Monday, July 18, 2016

Getting to know the (nearest) neighbors

kNN_intro


kNN background

The Nearest Neighbor algorithm (NN) is a simple algorithm that works well in classifying previously unseen data (see this post). It simply compares an unknown data against all known data. It then concludes that the unknown data is the same as the closest known data, hence the nearest neighbor designation. More precisely, the algorithm presumes that the type or class ("type" and "class" are used interchangeably throughout this post) of the unknown data has to be the type/class of its nearest neighbor. Due to this very direct dependency on prior known training data, the more training data one has, the better the approximation.

Expressed in mathematical form, NN is described as follows. Given a set of $m$ classes $C = \{c_1,c_2,...,c_m\}$, a set of training points $\mathbf D=\{\mathbf x_0,\mathbf x_1,...,\mathbf x_n\}$ where $\mathbf x\in\Bbb R^d$, and a set of corresponding classes $\{y_0,y_1,...,y_n\}$ for each $\mathbf x_i$ in $\mathbf D$ where $y_i\in C$, the NN algorithm finds the the class $y_i$ whose distance to the query/test data $\mathbf x'$ is minimized. If we define $z(\mathbf x_i)=y_i$, the NN algorithm becomes:

$$y'=z(\underset{\mathbf x_i\in D}{\operatorname{argmin}}(d(\mathbf x',\mathbf x_n)))$$

where $d(\mathbf x',\mathbf x_n)$ is some distance calculation.

On a set of data points whose classes are clearly delineated from each other --that is, no overlaps in the classes-- this method will work well. Any errors will come from classifying test points around class boundaries. If there are few examples of one type (say Type A) in a particular region near a boundary between two types, but a lot more examples for another type (say Type B) across such boundary, the nearest neighbor of an unknown point (a point that is in fact a Type A point) might be from across the boundary. This will cause the unknown point to be incorrectly identified as Type B.

To hedge against a misleading nearest neighbor, we could query more points and average the types, concluding that the most dominant type around a neighborhood is the correct class. This method of looking at the majority class over several nearest neighbors is called $k$-NN (or kNN), where $k$ is the number of neighboring data points to query.

The original function above changes slightly. We define $z(k,\mathbf x')$ as the classes of the $k$ training points in $\mathbf D$ that are closest to query point $\mathbf x'$:

$$\hat y=\underset{y\in C}{\operatorname{argmax}}\left(\sum_i^k \mathbf I(y,z(k,\mathbf x'))\right)$$$$\mathbf I(a,b)= \begin{cases} +1 & \text{if }a=b\\ 0 & \text{otherwise} \end{cases}$$

kNN can handle cases where classes are known to overlap, e.g., each class is a Gaussian distribution (or some other kind of distribution) that extends outward from a class center or centers regardless of the presence of other classes in the vicinity. At different points between the ‘centers’ of the class distributions, there will be different likelihoods of being a Type A or some other type (Type B, C, and so on).

We hope to guess this likelihood of belonging to the correct type by examining the class density around the immediate neighborhood of our sample, with the class 'density' estimated as the majority class within $k$ neighbors. A kNN estimate is clearly not perfect, but gets better with more data. The intuition is that we should choose the type that is most dense, which should imply that we are nearest that type's center. However, there is no guarantee that the type of the nearest neighbor is the actual correct type for the unknown data point (recall the assumed Gaussian distribution overlap). All we get is a higher likelihood.

Fine-tuning kNN

If we think through how the kNN algorithm works, we should be able to figure out that the algorithm is affected by --and could be improved by fine tuning-- the following factors:

  • The number of neighbors to consider in the calculation, $k$
  • The distance measure, $d(\mathbf x',\mathbf x_n)$
  • The weights of each closest neighbor(s) within $k$, $\mathbf w^k\in\Bbb R^k$
  • The weight of each feature, $\mathbf w^f\in\Bbb R^d$
  • The weight of each instance/sample, $\mathbf w^s\in\Bbb R^n$

(Note that the variable notation for the last three parameters is my arbitrary choice. The superscripts are to identify each parameter, not exponents. Formal literature might have different standard variable designation.)

The first three parameters have been well explored by researchers, and intuitive choices produce good results without a lot of testing. In practice, most of these choices are set, with k as the only parameter that requires finetuning. The distance measure is typically the square of the error and the nearest neighbor weighing is uniform (each neighbor has the same effect). The value of k is determined by cross-validation (and k=3 works well without tuning), while additional tests might uncover if a specialized neighbor weighing system (such as an inverse distance weighing) could work better.

The last two parameters are typically left untouched, meaning equal weights are assumed, once training features and instances are finalized. However, as the concept of nearest neighbors become diluted in high dimensions, especially with data that has many irrelevant features, it is often desirable to consider techniques that filter out the more important features, i.e., feature selection, during pre-processing. These last two parameters will be the focus of our experiments.

I have tested changing the weights of each feature, but through an indirect approach. Instead of ranking the relative value or relevance of a feature over all others, I reduced the number of features (dimensionality reduction), effectively setting to zero the weights of ‘irrelevant’ features while holding the remaining features at equal weights. I could still explore a weighted feature kNN, but will leave this as a future experiment.

I have no clear plan of attack on the last one, except perhaps to deploy anomaly detection algorithms on the training data (yet another experiment). In other words, training samples that are atypical (outside some normality threshold) could be discarded or assigned less contribution to the summed kNN distance, as they might be causing incorrect classification. I am not (yet) as well-versed on anomaly detection limitations, specially on data for which a class might have multiple kinds of a normal representation, that I will do this experiment at a later time.

If we add these parameters to the equations above, we will get, for NN:

$$d(\mathbf x, \mathbf y)=\sum_{i=1}^n w^f_i(d(x_i,y_i))$$$$y'=z(w^s_{i'},i')$$

where $i'=\underset{\mathbf x_i\in D}{\operatorname{argmin}}(d(\mathbf x',\mathbf x_n))$.

And for kNN:

$$\hat y=\underset{y\in C}{\operatorname{argmax}}\left(\sum_i^k w^k_i \mathbf I(y,z(k,\mathbf x'))\right)$$


Number of neighbors

We already described the number of nearest neighbors in the kNN post referenced above. We simply take a survey of several points near our unknown point, and make a judgment based on these neighbor points. For instance, we could decide that the class of an unknown point should be the majority class of the k=5 nearest neighbors, even if the nearest neighbor is of a different class.

This approach typically avoids the noise caused by outliers, e.g., a solo Type A surrounded by many Type Bs. This loner Type A datapoint might represent an island of Type As, or it could simply be noise. Without more data, it would be hard to conclude. If this were a valid Type A class, with more training data sample, more data points would cluster around this island, thus making its neighborhood more Type A. If it were truly an outlier, the solo Type A would not have more Type A neighbors, justifying our original majority vote.

Generally, k=3 works well. Depending on the problem and the available training data, other values of k (odd k, to break ties) might work better, but often flatlines at higher values. The best k for a machine learning problem is mostly determined via cross-validation, a topic for another day.

For now, from an experimentation point of view, we will focus on k=1, using k=3 only to test if the performance improves significantly. The thinking is that k=3 should outperform k=1, but we want the performance of our experiments to improve on k=1 first.

Distance measure

The distance measure is typically the distance between two points in space. In 2d space, we are familiar with using the Pythagorean Theorem to calculate the distance between two points (the hypotenuse $c$ of a triangle), $c^2=a^2+b^2$. If we extend this to many dimensions between two points $\mathbf x$ and $\mathbf y$, we get the following, called Euclidean distance:

$$\mathbf x=(x_1,x_2,...,x_n) \in \mathbb R^n$$$$\mathbf y=(y_1,y_2,...,y_n) \in \mathbb R^n$$$$d_2(\mathbf x, \mathbf y)=\sqrt {{\sum_{i=1}^n ({x_i-y_i})^2}}$$

There is a simpler version. If we do not use squared difference, instead adding the absolute values of the gaps, we get the so-called taxicab distance, or the Manhattan distance:

$$d_1(\mathbf x, \mathbf y)={\sum_{i=1}^n \left\lvert{x_i-y_i}\right\rvert}$$

Taxicab is an apt description. It is the distance we get when we look at a city map and determine the number of blocks to travel from one point to another. This distance measure is also called a city block distance.

Notice that this is similar to the original Euclidean formula, except that we do not have the power of 2 and squareroot. If we replace the “2” with “1,” we change from the Euclidean distance to the taxicab distance. In machine learning terminology, you will hear $\mathsf L^2$ or $\mathscr l_2$ distance. This is the Euclidean distance. The $\mathsf L^1$ or $\mathscr l_1$ distance is the taxicab distance.

More generally, both are specific variations of a general case, the Minkowski distance metric:

$$d_p(\mathbf x, \mathbf y)=\left({\sum_{i=1}^n \left\lvert{x_i-y_i}\right\rvert^p}\right)^{1/p}$$

There are a few more cautionary points when calculating distance in kNN. The data has to be of a consistent scale on all dimensions. For example, a feature that scales from 0 to 100 will overpower a feature whose scale is from 0 to 1. The squared errors will be huge for the first feature, artificially suggesting that the first feature is more important, which might not be the case. Thus, each feature has to be normalized to a consistent scale first.

This focus on consistent distance measurements also implies the formula satisfy what is known as a metric, i.e., a distance metric (not a distance measure). The term metric has a precise mathematical definition. For a distance calculation to be considered a metric, it has to satify the following:

  • the distance between two points is zero or positive (non-negative; $d(\mathbf x,\mathbf y)\ge 0$)
  • if two points are equal, the distance between them is zero (identity; $d(\mathbf x,\mathbf y)=0 \Leftrightarrow \mathbf x=\mathbf y$)
  • the distance from point A to point B should be equal to the distance from point B to point A (symmetry; $d(\mathbf x,\mathbf y)=d(\mathbf y,\mathbf x)$)
  • the distance from point A to point C has to be less than or equal to the combined distance from point A to an interim point B and the distance from the interim point B to point C (triangle inequality; $d(\mathbf x,\mathbf z)\le d(\mathbf x,\mathbf y)+d(\mathbf y,\mathbf z)$)

Extending the scale constraint further, kNN is also a bit clunky to use on purely categorical data (e.g., hot, warm, neutral, cool, cold) where real number values do not make sense, or datasets that contain a mix of continuous numerical data and categorical or binary data. There are ways to use kNN in those cases still, but we will not cover that here in detail.

Binary features are easy. The distance calculation above still works. With binary features, the feature vector becomes a binary vector. We can therefore use well-known binary/set similarity measures such as Hamming, Jaccard, Tanimoto (not a metric), Cosine similarity (not strictly for binary vectors) and so on. To handle categorical data, we can extend the binary feature concept. This approach is known as one-hot encoding, where each possible value of a categorical feature is encoded as on/off (e.g., one binary feature for hot, another for warm, yet another for neutral, and so on). This is almost similar to the concept of dummy variables in statistics.

Different scales and categorical data is not an issue in our MNIST dataset. All datapoints are of the same type (pixel intensity) and range (0-255). For our test purposes, we will use the Euclidean distance.

Weights of neighbors

If we are to use k>1, we have to decide how we value the classes suggested by several neighbors. The standard kNN calculation treats each neighbor equally, relying on the training data local class densities within the first k neighbors to drive the majority decision. Intuitively however, the data point closest to our test point has to matter more (i.e., more likely to be of the class that governs the unknown point), than those farther away. Hence, we could define the weight based on the distance to the unknown point.

There is a version of kNN that incorporates distance weighing to the neighboring points. Unsurprisingly, it is called distance-weighed kNN. It assign a value of 1 to the nearest neighbor, 0 to the farthest, and some linear interpolation for the rest of the neighbors. The literature also has other methods for calculate the weights, including other modifications to the distance-weighed kNN. These modification tend to improve kNN performance on UCI datasets.

For our MNIST tests, we will assume k=1. Whenever we use k>1 in our experiments, we will make the weight of each neighbor equal to simplify the calculations.

Weights of features

The basic kNN algorithm assumes that the features contribute equally, hence the summation is over all features. In reality, certain types of objects are identifiable by a subset of its many features. For instance, relative to other sports balls, an (American) football’s overriding and consistent feature is its oblong shape. It does not have to be of a specific color, although a leather shade helps identification. For a basketball or a football (soccer), the stitches and panels are very distinct. The same could be said of the digits in MNIST where each digit has features to itself or a subset of all 10 digits. Irrelevant features tend to cause misclassifications.

The literature on kNN provides several improvements on the basic kNN where a parameter weight is added for each feature based on some calculation. We will not explore those techniques here. Instead, we will explore our own ideas of assigning weights to features, without regard to existing literature (for the joy of exploration, of course, plus we all know that literature search is not as much fun as figuring things out!).

Note also that some distance metric might already incorporate some implicit feature weighing, such as the Mahalanobis distance:

$$d_m(\mathbf x, \mathbf y)=\sqrt {(\mathbf x-\mathbf y)^T C^{-1}(\mathbf x-\mathbf y)}$$

Notice that it is similar to the Euclidian distance (in vector form):

$$d_2(\mathbf x, \mathbf y)=\sqrt {(\mathbf x-\mathbf y)^T (\mathbf x-\mathbf y)}$$

The only difference is the presence of $C$, which is the covariance matrix of $\mathbf x$.

Other ML algorithms prioritize certain features by automatically adjusting the weights of each feature, lending more weight to important elements and reducing the noisy features. For example, the back-propagation and gradient descent of a neural network will automatically adjust the value of a feature weight such that the overall accuracy increases (or minimizes a loss function). Unfortunately, in standard kNN, we are stuck with a uniform weight value for each feature.

In our experiment, unless the subject of a trial algorithm, we will assume features have equal weights.

Weights of instances

The weight of instances and the neighbors sound similar. However, whereas the weight of the neighbor affects the weight of the vote for each class, the weight of the instance affects the distance calculation for that instance. In other words, if a query point lands closer to Point B, we could bias the final distance calculation to favor Point A.

But why would we assume one point is 'better' than another? We could think of outliers, where some training data points might not be very informative, even misleading. This is not strictly an issue for the algorithm, rather an issue on preparing the training data perhaps due to labeling errors or including data points that could be interpreted as one type, even if factually of another type (for example, due to a very weird writing style). This parameter is not typically explored however in improving kNN. Each data point is equally considered for the final distance calculation.

One technique where instance weights are implicitly calculated is through the use of class prototypes (mean features of the class) constructed from training set data, then running a kNN against the prototypes instead of the original training data. The original nearest neighbor instances, while no longer used directly, is implicitly weighed into the distance calculation against the class prototypes.

In our experiment, we will assume each instance is correct and equally likely to exist.

Conclusions

The literature on kNN is vast. It is an old technique. It remains a reliable baseline when presented with enough training data. Being a memory-based learner, it is memory-intensive and slow relative to other algorithms. Being an instance-based classifier, it tends to overfit and its structure is highly dependent on the subset of data used for training (i.e., high variance). It could be argued that some ML algorithms reduce to a variant of the kNN algorithm, or the kNN turns out to be a special case of other algorithms. Despite its weaknesses, the feasibility of kNN as a good ML algorithm is well-documented. For its simplicity and logic, it is a good tool to run experiments, which we will do in future posts.

Monday, June 27, 2016

ML without learning ML

ML_without_ML


Machine learning without knowing (a lot of) machine learning

Folks who are new to ML might think ML math is not for the faint-hearted. They are not wrong. I have seen a couple of self-assessment tests for graduate students considering intro ML courses (here and here). The thinking is that one needs to have a thorough grasp of certain mathematical fields to make ML learning easier. Schools mean that. I should know. I tried reading introductory ML material a long time ago without brushing up on statistics and linear algebra. It was not pretty. :)

But today, I'd like to make a little detour (and a less formal tone) to convince readers that ML can also be for anyone, even those without math backgrounds. We just need some very basic coding skills, really basic, as you are about to see. All I am after is to answer this question:

Can one use ML tools without knowing a lot of ML concepts?

This is a fair question because even the simplest ML techniques have a lot of math behind them, apparently only meant for the truly determined folks. The short answer is yes, and one can get far indeed even without knowing algorith details. In fact, we will tackle what looks like a fairly difficult pattern recognition/learning challenge (although it will look pretty simple after the fact), all without knowing how these ML tools work. The idea is that we should be able to apply ML as business users, not as the system developers. As long as we know how to use them, we can abstract away the implementation details.

All that said, we will always get better mileage in machine learning if we know the underlying details. But for now, let's see how far we can go without knowing ML and its math.


The basic structure of ML code

There is a pattern to ML code. At its simplest, we have:

  • get the training data
  • get the test data
  • something, something
  • ... (wait)...
  • predict using the test data
  • profit!

That's it. The third step is where we train a classifier. Let's start with a simple one, a nearest neighbor classifier.

Introducing the Nearest Neighbor classifier

The concept behind the nearest neighbor classifier is very simple, yet its performance in many classification problems is quite astounding. The nearest neighbor classifier is typically called kNN, for k-nearest neighbors. The k represents a number, specifically the number of neighbors used for comparison.

A kNN classifier first stores all training data. It then takes a test data, compares that test data with each one of the training data, finds the closest match(es), and 'predicts' that the test data is similar or belongs to the same class as that closest match(es).

This process is logically correct, and very simple. There is no 'learning' done other than 'memorizing' everything. The kNN classifier simply makes a comparison each time an unknown test data is presented. There is no computation or estimating prior to the comparison.

In contrast, as described in the perceptron post (here), a perceptron actively 'learns' a separating line based on the training data, after which the comparison is only against the separating line. The PLA would not need the training data after this point, whereas in a kNN, the entire training dataset has to be kept. Due to this behavior, the kNN is categorized as a lazy learner and an instance-based (or memory-based) classifier. But even with its simple approach, a kNN classifier is very effective!

Machine learning in Python

Let's now create a kNN using the above code structure. Don't worry, this is an exercise in avoiding complex code. We will actually not write our own kNN algorithm. The point is to show that we can write ML code without knowing how ML algorithms work. Remember NumPy and pandas from the perceptron post? We will do the same here. We deploy well-tested ML libraries. The Python library we need is scikit-learn, which we introduced in a previous post (here).

Our general pseudo-code structure will be similar to the code below. Normally, the training and test files are separate files. In our example later, we will use a single file, splitting them into a training set and a test set.

In [ ]:
# this is a pseudo-code; meant to show outline

# get the training data
train_data = read_file(train_filename)

# get the test data
test_data = read_file(test_filename)

# create a classifier
my_classifer=library.classifier_type()

# train the classifier
my_classifier.fit(train_data[1:], train_data[0])

# use the classifier on the test data
output=my_classifier.predict(test_data)

# save to file
output.save(output_filename)


'Reading' handwritten digits

To make this a little bit challenging, let us tackle the problem of identifying handwritten digits, without telling our classifier what kind of pixel patterns indicate which digits (as we explored when we introduced the MNIST dataset, here). We simply provide the raw digit values (0-255) in a sequential array.

If you have not read or do not have time to read the MNIST post, think of the MNIST dataset as a huge dataset of handwritten digits (0 to 9). It was meant to be similar to numbers extracted from actual ZIP codes in actual mail/letters. The sample digits were all written by people in natural human stroke patterns, without trying to make the digits artificially legible for computers. We will train a classifier using a portion of this dataset, then we will test the model against another set of handwritten digits (from the same dataset).

Let us start by loading the data using pandas.

The Python library scikit-learn has a dedicated digits and iris dataset loader functions (iris is the classic iris flower dataset discussed in a previous post, here). The digits dataset in scikit-learn is not MNIST. We will continue to use pandas-based routines instead of pre-packaged loaders, if any, unless there are clear advantages.

In [1]:
# import libraries
import numpy as np
import pandas as pd

# let's limit our dataset
train_start=1
train_end=20000
test_start=20001
test_end=22000

print('Running Random Forest on MNIST, stats:')
print('... train_start:',train_start)
print('... train_end:',train_end)
print('... test_start:',test_start)
print('... test_end:',test_end)

print('Loading training and test data...')
train_file='kaggle_mnist_train.csv'
train=np.array(pd.read_csv(train_file).ix[train_start-1:train_end-1].as_matrix(),dtype='uint8')
test=np.array(pd.read_csv(train_file).ix[test_start-1:test_end-1].as_matrix(),dtype='uint8')

print('... done.')
Running Random Forest on MNIST, stats:
... train_start: 1
... train_end: 20000
... test_start: 20001
... test_end: 22000
Loading training and test data...
... done.


We now get into the most difficult part of using a machine learning tool, the part where we have to live with some math.

In [2]:
from sklearn.ensemble import RandomForestClassifier

# create a random forest classifier, default parameters, hence 10 trees
print('Running a Random Forest classifier')
print('... creating classifier')
rf = RandomForestClassifier()

# train the classifier
print('... training classifier')
rf.fit(train[:,1:], train[:,0])

# test the classifier
#print('... classifying test data')
#output=rf.predict(test[:,1:])

# test and record accuracy score
print('... classifying test data and scoring accuracy')
accuracy=rf.score(test[:,1:],test[:,0])
print('... done.')
print('Accuracy:',accuracy)
Running a Random Forest classifier
... creating classifier
... training classifier
... classifying test data and scoring accuracy
... done.
Accuracy: 0.9235


Of course I was kidding on the math part. :) We want to use ML without the hard math topics.

We just created a Random Forest classifier that can read handwritten numbers. We only needed two lines: "create" a classifier, then "fit" said classifier! In fact, these two could be merged into one line. Creating a machine that can read handwritten digits should not be trivial, as we explored in the MNIST post. Yet, here it is! This no-frills classifier gave us a 92+% accuracy on our 2k test set. It can be improved with some basic fine-tuning and more training data.

Notice that we only trained using the first 20,000 digits, not the entire 42k on the Kaggle dataset (much less than 60k on the original MNIST dataset). Further, we only used a default classifier, in this case a basic Random Forest of 10 trees.

But where is the kNN classifier, and what is a Random Forest?

Good point. A Random Forest is another kind of ML tool. The fact that you had no idea what it is and how it does things, only that it works by being trained then by predicting, is the whole point. You didn't even have to know the difference between a kNN, which I explained above (and that explanation is correct, I'm not trying to trick you), and a Random Forest.

The code mentioned "trees." What are trees in a random forest?

Trees in a Random Forest are Decision Trees, which is another type of classifier. Decision Trees are common in business decision making. It is essentially a series of branching decision points that end in different final decisions (in this case, different classifications of digits). The decision points and the sequence of the decision branches are automatically calculated by some algorithm.

So what is a Random Forest?

Random Forests are composed of many (partially-grown) Decision Trees that are randomly created by an algorithm based on certain parameters, for example, the number of trees, the size of the trees, and so on. The rationale --and what makes Random Forests powerful-- is that the collective decision of these randomized trees, which are specialized to detect digits based on one set of pixel patterns, are diversified. Two trees might tend to be good at detecting "8" but each one will use different pixel patterns to arrive at the decision. The extreme votes tend to cancel out, leaving a majority vote that tends to be the right outcome. It is a simple idea, and it works beautifully.

Is it really as simple as that?

Yes, but it depends on the problem and the datasets. Here, the dataset is well-prepared. You plug it in, out comes a model.

But is this really how it is done?

In many cases yes. But actual ML applications are very specific, and you might have to design certain things that require specific knowledge on ML concepts and math. There is also some method to fine-tuning a model, a skill that demands deeper understanding of the techniques used, accumulated experience to understand the behavior of these techniques, and luck.

But what if we really want to use a kNN algorithm?

Easy peasy. We just need to replace one line, the line where we create the classifier (two including the library import line). The rest remains the same. See the revised code below.

In [3]:
from sklearn.neighbors import KNeighborsClassifier

# create a nearest neighbor classifier, default parameters, hence k=1
print('Running a Nearest Neighbor classifier')
print('... creating classifier')
kNN = KNeighborsClassifier()

# train the classifier
print('... training classifier')
kNN.fit(train[:,1:], train[:,0])

# test the classifier
#print('... classifying test data')
#output=kNN.predict(test[:,1:])

# test and record accuracy score
print('... classifying test data and scoring accuracy')
accuracy=kNN.score(test[:,1:],test[:,0])
print('... done.')
print('Accuracy:',accuracy)
Running a Nearest Neighbor classifier
... creating classifier
... training classifier
... classifying test data and scoring accuracy
... done.
Accuracy: 0.956


This is interesting! A simple nearest neighbor classifier gave us almost a 96% accuracy, better than the more sophisticated Random Forest. Random Forests are traditionally considered one of the consistently better performing classifiers outside of deep learning networks.

Well, what if we use more trees? Generally, Random Forests do well with more trees (up to a point), and anywhere between 100-200 trees is typical. So let's do that.

In [4]:
# create a random forest classifier, 100 trees
print('Running a Random Forest classifier, trees=100')
print('... creating classifier')
rf = RandomForestClassifier(n_estimators=100)

# train the classifier
print('... training classifier')
rf.fit(train[:,1:], train[:,0])

# test the classifier
#print('... classifying test data')
#output=rf.predict(test[:,1:])

# test and record accuracy score
print('... classifying test data and scoring accuracy')
accuracy=rf.score(test[:,1:],test[:,0])
print('... done.')
print('Accuracy:',accuracy)
Running a Random Forest classifier, trees=100
... creating classifier
... training classifier
... classifying test data and scoring accuracy
... done.
Accuracy: 0.953


There we go, a 95% accuracy!

Well, we fine-tuned Random Forest. Can we also fine-tune kNN? Yes, we can.

Let us try k=3. This choice makes logical sense: instead of relying on just one nearest neighbor, let us query the closest three training points to give us more assurance that our test point indeed belongs to a particular class (which would be 2 out of 3, or 3 out of 3 neighbors). If the three points are all different classes, then we can let the algorithm break the tie (arbitrarily, based on how the algorithm was written).

In [5]:
# create a nearest neighbor classifier, k=3
print('Running a Nearest Neighbor classifier, k=3')
print('... creating classifier')
kNN = KNeighborsClassifier(n_neighbors=3)

# train the classifier
print('... training classifier')
kNN.fit(train[:,1:], train[:,0])

# test the classifier
#print('... classifying test data')
#output=kNN.predict(test[:,1:])

# test and record accuracy score
print('... classifying test data and scoring accuracy')
accuracy=kNN.score(test[:,1:],test[:,0])
print('... done.')
print('Accuracy:',accuracy)
Running a Nearest Neighbor classifier, k=3
... creating classifier
... training classifier
... classifying test data and scoring accuracy
... done.
Accuracy: 0.9545


There seems to be not much difference in the result between k=1 and k=3, with k=1 doing better. Generally, k=3 performs better so we would expect that with more training data, k=3 would surpass k=1. On the Random Forest, there is no definitive advantage in having more trees past a certain point, so 200 trees is sometimes worse than 100 trees. There are also other parameters to change in Random Forests, but these changes demand a level of familiarity with how Random Forests work, which we want to avoid for this post.

Selecting the parameters

Fine-tuning the parameters of a classifier is more an art than disciplined science, especially on the more complex classifiers. There are guidelines on parameters values that tend to do well for each classifier (and characteristics of the problem being solved). Practical experience is therefore important.

In many cases, the parameter selection is determined via a technique called cross-validation, which is a topic for another day. In brief, all it means is we have to set aside a portion of our original training data as a validation set (which is never used for training at any point; this therefore shrinks the data available for training), run the same class of model but using different parameters, then selecting the model/parameter with the best accuracy on the validation set. We cannot use the test set as the benchmark to fine-tune parameters because in practice, we do not know how well a model performs on the test set. We cannot use the training set because it has been used for training, and the parameters will have 'memorized' that training set.

kNN and Random Forest against the full Kaggle test set

Now that we have built the foundations, we might as well run the entire thing. Let us compare the performance of kNN, k=3 with a 100-tree Random Forest, trained over the entire Kaggle training set of 42k digits (roughly equal distribution over 10 digits), and test it against the first 10k test digits of Kaggle's 28k test set.

The Random Forest algorithm is significantly faster than kNN. kNN's execution time is exponential over the number of dimensions (784 pixels), the number of training data points (42k), and the number of test data points (28k). While the scikit kNN is likely optimized to avoid scanning the entire 42k training set for each test data point, it still takes a long time to run.

In [6]:
# import libraries
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# full Kaggle training set, first 10k test set
train_start=1
train_end=42000
test_start=1
test_end=28000

print('Running Random Forest on MNIST, stats:')
print('... train_start:',train_start)
print('... train_end:',train_end)
print('... test_start:',test_start)
print('... test_end:',test_end)

print('Loading training and test data...')
train_file='kaggle_mnist_train.csv'
test_file='kaggle_mnist_test.csv'
test_file='kaggle_mnist_test_labels.csv'
train=np.array(pd.read_csv(train_file).ix[train_start-1:train_end-1].as_matrix(),dtype='uint8')
test=np.array(pd.read_csv(test_file).ix[test_start-1:test_end-1].as_matrix(),dtype='uint8')
test_label=np.array(pd.read_csv(test_label_file).ix[test_start-1:test_end-1].as_matrix(),dtype='uint8')
print('... train dimensions:',train.shape)
print('... test dimensions:',test.shape)
print('... test label dimensions:',test_label.shape)
print('... done.')
Running Random Forest on MNIST, stats:
... train_start: 1
... train_end: 42000
... test_start: 1
... test_end: 28000
Loading training and test data...
... train dimensions: (42000, 785)
... test dimensions: (28000, 784)
... test label dimensions: (28000, 1)
... done.


The data upload seemed to have worked well. The dimensions look as expected. We can now safely proceed to running the two classifiers.

In [7]:
# create a nearest neighbor classifier, k=3
print('Running a Nearest Neighbor classifier, k=3')
print('... creating classifier')
kNN = KNeighborsClassifier(n_neighbors=3)
print('... done.')

# train the classifier
print('... training classifier')
kNN.fit(train[:,1:], train[:,0])
print('... done.')

# test and record accuracy score
print('... classifying test data and scoring accuracy')
accuracy=kNN.score(test[:,:],test_label[:])
print('... done.')
print('Accuracy:',accuracy)

# create a random forest classifier, 100 trees
print('Running a Random Forest classifier, trees=100')
print('... creating classifier')
rf = RandomForestClassifier(n_estimators=100)
print('... done.')

# train the classifier
print('... training classifier')
rf.fit(train[:,1:], train[:,0])
print('... done.')

# test and record accuracy score
print('... classifying test data and scoring accuracy')
accuracy=rf.score(test[:,:],test_label[:])
print('... done.')
print('Accuracy:',accuracy)
Running a Nearest Neighbor classifier, k=3
... creating classifier
... done.
... training classifier
... done.
... classifying test data and scoring accuracy
... done.
Accuracy: 0.968035714286
Running a Random Forest classifier, trees=100
... creating classifier
... done.
... training classifier
... done.
... classifying test data and scoring accuracy
... done.
Accuracy: 0.965178571429


Both algorithms reach almost 96-97% accuracy! Not bad. It failed to classify 895 digits out of the 28k, about 30 errors per 1,000 digits. Not too shabby at all. (This is however a simple solution and will only get us to around the mid-600s in the Kaggle MNIST competition rankings (as of 6/28/2016).)

Misclassified digits

Let us take a look at some of the digits it failed to classify correctly within the first 1,000 digits.

In [8]:
# import libraries
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# full Kaggle training set, first 10k test set
train_start=1
train_end=42000
test_start=1
test_end=1000

print('Running Random Forest on MNIST, stats:')
print('... train_start:',train_start)
print('... train_end:',train_end)
print('... test_start:',test_start)
print('... test_end:',test_end)

print('Loading training and test data...')
train_file='kaggle_mnist_train.csv'
test_file='kaggle_mnist_test.csv'
test_file='kaggle_mnist_test_labels.csv'
train=np.array(pd.read_csv(train_file).ix[train_start-1:train_end-1].as_matrix(),dtype='uint8')
test=np.array(pd.read_csv(test_file).ix[test_start-1:test_end-1].as_matrix(),dtype='uint8')
test_label=np.array(pd.read_csv(test_label_file).ix[test_start-1:test_end-1].as_matrix(),dtype='uint8')
print('... train dimensions:',train.shape)
print('... test dimensions:',test.shape)
print('... test label dimensions:',test_label.shape)
print('... done.')
Running Random Forest on MNIST, stats:
... train_start: 1
... train_end: 42000
... test_start: 1
... test_end: 1000
Loading training and test data...
... train dimensions: (42000, 785)
... test dimensions: (1000, 784)
... test label dimensions: (1000, 1)
... done.


Let's now run the same classifiers as before, but instead of automatically scoring the results, we will use the "predict" function, then manually comparing the results against the actual labels, and recording the incorrect digits as we go.

In [9]:
# create a nearest neighbor classifier, k=3
print('Running a Nearest Neighbor classifier, k=3')
print('... creating classifier')
kNN = KNeighborsClassifier(n_neighbors=3)
print('... done.')

# train the classifier
print('... training classifier')
kNN.fit(train[:,1:], train[:,0])
print('... done.')

# run the classifier on unseen data
print('... classifying test data')
output_kNN=kNN.predict(test[:,:])
print('... done.')
print('output_kNN:',output_kNN[:10])
print('test labels:',test_label[:10,0])

# compare results
errors_kNN=output_kNN!=test_label[:,0]
print('errors_kNN:',errors_kNN[1:10])
Running a Nearest Neighbor classifier, k=3
... creating classifier
... done.
... training classifier
... done.
... classifying test data
... done.
output_kNN: [2 0 9 9 3 7 0 3 0 3]
test labels: [2 0 9 0 3 7 0 3 0 3]
errors_kNN: [False False  True False False False False False False]


The output seems to work well on kNN, so let's do the same with the Random Forest.

In [10]:
# create a random forest classifier, 100 trees
print('Running a Random Forest classifier, trees=100')
print('... creating classifier')
rf = RandomForestClassifier(n_estimators=100)
print('... done.')

# train the classifier
print('... training classifier')
rf.fit(train[:,1:], train[:,0])
print('... done.')

# run the classifier on unseen data
print('... classifying test data')
output_rf=rf.predict(test)
print('output_rf:',output_rf[:10])
print('test labels:',test_label[:10,0])

# compare results
errors_rf=output_rf!=test_label[:,0]
print('errors_rf:',errors_rf[1:10])
Running a Random Forest classifier, trees=100
... creating classifier
... done.
... training classifier
... done.
... classifying test data
output_rf: [2 0 9 9 3 7 0 3 0 3]
test labels: [2 0 9 0 3 7 0 3 0 3]
errors_rf: [False False  True False False False False False False]


So far so good. We have a list of the mismatched digits for each classifier. We can re-use our previous code for displaying MNIST digits, with small tweaks to show the actual vs predicted digits. We want to see if the incorrect predictions of our classifiers were at least visually close enough, even if eventually wrong.

We could try to extract the second-best guess of each classifier, but I do not have the time to read through the scikit documentation if this is possible, and how it could be done. In one kNN implementations I have seen (not scikit), there was a possibility to extract additional information from the output, one of which was the second closest neighbor.

We will create a plot of each digit where the y label is the actual digit value and the x label is our prediction.

In [11]:
# Jupyter magic so we do not have to manually attach PNG output files
%matplotlib inline

# import libraries
import matplotlib.pyplot as plt

# display the misclassified digits
test_label_errors_kNN=test_label[:,0][errors_kNN]
output_kNN_errors=output_kNN[errors_kNN]
print('output_kNN_errors:',output_kNN_errors)
fig=plt.figure(figsize=(10,10))
for i, row in enumerate(test[:1000][errors_kNN]):
    ax=fig.add_subplot(6,5,i+1)
    ax.set_xlabel(output_kNN_errors[i])
    ax.set_ylabel(test_label_errors_kNN[i])
    ax.xaxis.set_ticklabels([])
    ax.yaxis.set_ticklabels([])
    a=np.copy(row)
    a=np.reshape(a,(28,28))
    ax.imshow(a, cmap='gray', interpolation='nearest')
plt.tight_layout()
plt.show()
output_kNN_errors: [9 8 1 2 0 9 2 9 3 2 5 1 2 5 0 1 7 0 1 7 4 5 4 0]
In [12]:
# display the misclassified digits
test_label_errors_rf=test_label[:,0][errors_rf]
output_rf_errors=output_rf[errors_rf]
print('output_rf_errors:',output_rf_errors)
fig=plt.figure(figsize=(10,10))
for i, row in enumerate(test[:1000][errors_rf]):
    ax=fig.add_subplot(6,5,i+1)
    ax.set_xlabel(output_rf_errors[i])
    ax.set_ylabel(test_label_errors_rf[i])
    ax.xaxis.set_ticklabels([])
    ax.yaxis.set_ticklabels([])
    a=np.copy(row)
    a=np.reshape(a,(28,28))
    ax.imshow(a, cmap='gray', interpolation='nearest')
plt.tight_layout()
plt.show()
output_rf_errors: [9 9 5 8 3 3 8 0 2 9 9 4 2 2 8 5 6 4 0 1 5 7 4 9 2 8 8 4]


We see instances where we can understand why the classifiers would get confused. However, there are also clear errors, such as a "9" being predicted as a "3" or an "8". There is also the obvious oversight of small stroke patterns that distinquish two similar digits, such as the bottom stroke for a "2" vs a "7".

Of interest however is that the two classifiers make different mistakes. This means we could improve performance by creating several classifiers, pool their decisions, and select the majority vote. We will explore this in a future post.