Machine learning without knowing (a lot of) machine learning

Folks who are new to ML might think ML math is not for the faint-hearted. They are not wrong. I have seen a couple of self-assessment tests for graduate students considering intro ML courses (here and here). The thinking is that one needs to have a thorough grasp of certain mathematical fields to make ML learning easier. Schools mean that. I should know. I tried reading introductory ML material a long time ago without brushing up on statistics and linear algebra. It was not pretty. :)

But today, I'd like to make a little detour (and a less formal tone) to convince readers that ML can also be for anyone, even those without math backgrounds. We just need some very basic coding skills, really basic, as you are about to see. All I am after is to answer this question:

Can one use ML tools without knowing a lot of ML concepts?

This is a fair question because even the simplest ML techniques have a lot of math behind them, apparently only meant for the truly determined folks. The short answer is yes, and one can get far indeed even without knowing algorith details. In fact, we will tackle what looks like a fairly difficult pattern recognition/learning challenge (although it will look pretty simple after the fact), all without knowing how these ML tools work. The idea is that we should be able to apply ML as business users, not as the system developers. As long as we know how to use them, we can abstract away the implementation details.

All that said, we will always get better mileage in machine learning if we know the underlying details. But for now, let's see how far we can go without knowing ML and its math.

The basic structure of ML code

There is a pattern to ML code. At its simplest, we have:

get the training data
get the test data
something, something
... (wait)...
predict using the test data
profit!

That's it. The third step is where we train a classifier. Let's start with a simple one, a nearest neighbor classifier.

Introducing the Nearest Neighbor classifier

The concept behind the nearest neighbor classifier is very simple, yet its performance in many classification problems is quite astounding. The nearest neighbor classifier is typically called kNN, for k-nearest neighbors. The k represents a number, specifically the number of neighbors used for comparison.

A kNN classifier first stores all training data. It then takes a test data, compares that test data with each one of the training data, finds the closest match(es), and 'predicts' that the test data is similar or belongs to the same class as that closest match(es).

This process is logically correct, and very simple. There is no 'learning' done other than 'memorizing' everything. The kNN classifier simply makes a comparison each time an unknown test data is presented. There is no computation or estimating prior to the comparison.

In contrast, as described in the perceptron post (here), a perceptron actively 'learns' a separating line based on the training data, after which the comparison is only against the separating line. The PLA would not need the training data after this point, whereas in a kNN, the entire training dataset has to be kept. Due to this behavior, the kNN is categorized as a lazy learner and an instance-based (or memory-based) classifier. But even with its simple approach, a kNN classifier is very effective!

Machine learning in Python

Let's now create a kNN using the above code structure. Don't worry, this is an exercise in avoiding complex code. We will actually not write our own kNN algorithm. The point is to show that we can write ML code without knowing how ML algorithms work. Remember NumPy and pandas from the perceptron post? We will do the same here. We deploy well-tested ML libraries. The Python library we need is scikit-learn, which we introduced in a previous post (here).

Our general pseudo-code structure will be similar to the code below. Normally, the training and test files are separate files. In our example later, we will use a single file, splitting them into a training set and a test set.

# this is a pseudo-code; meant to show outline

# get the training data
train_data = read_file(train_filename)

# get the test data
test_data = read_file(test_filename)

# create a classifier
my_classifer=library.classifier_type()

# train the classifier
my_classifier.fit(train_data[1:], train_data[0])

# use the classifier on the test data
output=my_classifier.predict(test_data)

# save to file
output.save(output_filename)

'Reading' handwritten digits

To make this a little bit challenging, let us tackle the problem of identifying handwritten digits, without telling our classifier what kind of pixel patterns indicate which digits (as we explored when we introduced the MNIST dataset, here). We simply provide the raw digit values (0-255) in a sequential array.

If you have not read or do not have time to read the MNIST post, think of the MNIST dataset as a huge dataset of handwritten digits (0 to 9). It was meant to be similar to numbers extracted from actual ZIP codes in actual mail/letters. The sample digits were all written by people in natural human stroke patterns, without trying to make the digits artificially legible for computers. We will train a classifier using a portion of this dataset, then we will test the model against another set of handwritten digits (from the same dataset).

Let us start by loading the data using pandas.

The Python library scikit-learn has a dedicated digits and iris dataset loader functions (iris is the classic iris flower dataset discussed in a previous post, here). The digits dataset in scikit-learn is not MNIST. We will continue to use pandas-based routines instead of pre-packaged loaders, if any, unless there are clear advantages.

# import libraries
import numpy as np
import pandas as pd

# let's limit our dataset
train_start=1
train_end=20000
test_start=20001
test_end=22000

print('Running Random Forest on MNIST, stats:')
print('... train_start:',train_start)
print('... train_end:',train_end)
print('... test_start:',test_start)
print('... test_end:',test_end)

print('Loading training and test data...')
train_file='kaggle_mnist_train.csv'
train=np.array(pd.read_csv(train_file).ix[train_start-1:train_end-1].as_matrix(),dtype='uint8')
test=np.array(pd.read_csv(train_file).ix[test_start-1:test_end-1].as_matrix(),dtype='uint8')

print('... done.')

Running Random Forest on MNIST, stats:
... train_start: 1
... train_end: 20000
... test_start: 20001
... test_end: 22000
Loading training and test data...
... done.

We now get into the most difficult part of using a machine learning tool, the part where we have to live with some math.

from sklearn.ensemble import RandomForestClassifier

# create a random forest classifier, default parameters, hence 10 trees
print('Running a Random Forest classifier')
print('... creating classifier')
rf = RandomForestClassifier()

# train the classifier
print('... training classifier')
rf.fit(train[:,1:], train[:,0])

# test the classifier
#print('... classifying test data')
#output=rf.predict(test[:,1:])

# test and record accuracy score
print('... classifying test data and scoring accuracy')
accuracy=rf.score(test[:,1:],test[:,0])
print('... done.')
print('Accuracy:',accuracy)

Running a Random Forest classifier
... creating classifier
... training classifier
... classifying test data and scoring accuracy
... done.
Accuracy: 0.9235

Of course I was kidding on the math part. :) We want to use ML without the hard math topics.

We just created a Random Forest classifier that can read handwritten numbers. We only needed two lines: "create" a classifier, then "fit" said classifier! In fact, these two could be merged into one line. Creating a machine that can read handwritten digits should not be trivial, as we explored in the MNIST post. Yet, here it is! This no-frills classifier gave us a 92+% accuracy on our 2k test set. It can be improved with some basic fine-tuning and more training data.

Notice that we only trained using the first 20,000 digits, not the entire 42k on the Kaggle dataset (much less than 60k on the original MNIST dataset). Further, we only used a default classifier, in this case a basic Random Forest of 10 trees.

But where is the kNN classifier, and what is a Random Forest?

Good point. A Random Forest is another kind of ML tool. The fact that you had no idea what it is and how it does things, only that it works by being trained then by predicting, is the whole point. You didn't even have to know the difference between a kNN, which I explained above (and that explanation is correct, I'm not trying to trick you), and a Random Forest.

The code mentioned "trees." What are trees in a random forest?

Trees in a Random Forest are Decision Trees, which is another type of classifier. Decision Trees are common in business decision making. It is essentially a series of branching decision points that end in different final decisions (in this case, different classifications of digits). The decision points and the sequence of the decision branches are automatically calculated by some algorithm.

So what is a Random Forest?

Random Forests are composed of many (partially-grown) Decision Trees that are randomly created by an algorithm based on certain parameters, for example, the number of trees, the size of the trees, and so on. The rationale --and what makes Random Forests powerful-- is that the collective decision of these randomized trees, which are specialized to detect digits based on one set of pixel patterns, are diversified. Two trees might tend to be good at detecting "8" but each one will use different pixel patterns to arrive at the decision. The extreme votes tend to cancel out, leaving a majority vote that tends to be the right outcome. It is a simple idea, and it works beautifully.

Is it really as simple as that?

Yes, but it depends on the problem and the datasets. Here, the dataset is well-prepared. You plug it in, out comes a model.

But is this really how it is done?

In many cases yes. But actual ML applications are very specific, and you might have to design certain things that require specific knowledge on ML concepts and math. There is also some method to fine-tuning a model, a skill that demands deeper understanding of the techniques used, accumulated experience to understand the behavior of these techniques, and luck.

But what if we really want to use a kNN algorithm?

Easy peasy. We just need to replace one line, the line where we create the classifier (two including the library import line). The rest remains the same. See the revised code below.

from sklearn.neighbors import KNeighborsClassifier

# create a nearest neighbor classifier, default parameters, hence k=1
print('Running a Nearest Neighbor classifier')
print('... creating classifier')
kNN = KNeighborsClassifier()

# train the classifier
print('... training classifier')
kNN.fit(train[:,1:], train[:,0])

# test the classifier
#print('... classifying test data')
#output=kNN.predict(test[:,1:])

# test and record accuracy score
print('... classifying test data and scoring accuracy')
accuracy=kNN.score(test[:,1:],test[:,0])
print('... done.')
print('Accuracy:',accuracy)

Running a Nearest Neighbor classifier
... creating classifier
... training classifier
... classifying test data and scoring accuracy
... done.
Accuracy: 0.956

This is interesting! A simple nearest neighbor classifier gave us almost a 96% accuracy, better than the more sophisticated Random Forest. Random Forests are traditionally considered one of the consistently better performing classifiers outside of deep learning networks.

Well, what if we use more trees? Generally, Random Forests do well with more trees (up to a point), and anywhere between 100-200 trees is typical. So let's do that.

# create a random forest classifier, 100 trees
print('Running a Random Forest classifier, trees=100')
print('... creating classifier')
rf = RandomForestClassifier(n_estimators=100)

# train the classifier
print('... training classifier')
rf.fit(train[:,1:], train[:,0])

# test the classifier
#print('... classifying test data')
#output=rf.predict(test[:,1:])

# test and record accuracy score
print('... classifying test data and scoring accuracy')
accuracy=rf.score(test[:,1:],test[:,0])
print('... done.')
print('Accuracy:',accuracy)

Running a Random Forest classifier, trees=100
... creating classifier
... training classifier
... classifying test data and scoring accuracy
... done.
Accuracy: 0.953

There we go, a 95% accuracy!

Well, we fine-tuned Random Forest. Can we also fine-tune kNN? Yes, we can.

Let us try k=3. This choice makes logical sense: instead of relying on just one nearest neighbor, let us query the closest three training points to give us more assurance that our test point indeed belongs to a particular class (which would be 2 out of 3, or 3 out of 3 neighbors). If the three points are all different classes, then we can let the algorithm break the tie (arbitrarily, based on how the algorithm was written).

# create a nearest neighbor classifier, k=3
print('Running a Nearest Neighbor classifier, k=3')
print('... creating classifier')
kNN = KNeighborsClassifier(n_neighbors=3)

# train the classifier
print('... training classifier')
kNN.fit(train[:,1:], train[:,0])

# test the classifier
#print('... classifying test data')
#output=kNN.predict(test[:,1:])

# test and record accuracy score
print('... classifying test data and scoring accuracy')
accuracy=kNN.score(test[:,1:],test[:,0])
print('... done.')
print('Accuracy:',accuracy)

Running a Nearest Neighbor classifier, k=3
... creating classifier
... training classifier
... classifying test data and scoring accuracy
... done.
Accuracy: 0.9545

There seems to be not much difference in the result between k=1 and k=3, with k=1 doing better. Generally, k=3 performs better so we would expect that with more training data, k=3 would surpass k=1. On the Random Forest, there is no definitive advantage in having more trees past a certain point, so 200 trees is sometimes worse than 100 trees. There are also other parameters to change in Random Forests, but these changes demand a level of familiarity with how Random Forests work, which we want to avoid for this post.

Selecting the parameters

Fine-tuning the parameters of a classifier is more an art than disciplined science, especially on the more complex classifiers. There are guidelines on parameters values that tend to do well for each classifier (and characteristics of the problem being solved). Practical experience is therefore important.

In many cases, the parameter selection is determined via a technique called cross-validation, which is a topic for another day. In brief, all it means is we have to set aside a portion of our original training data as a validation set (which is never used for training at any point; this therefore shrinks the data available for training), run the same class of model but using different parameters, then selecting the model/parameter with the best accuracy on the validation set. We cannot use the test set as the benchmark to fine-tune parameters because in practice, we do not know how well a model performs on the test set. We cannot use the training set because it has been used for training, and the parameters will have 'memorized' that training set.

kNN and Random Forest against the full Kaggle test set

Now that we have built the foundations, we might as well run the entire thing. Let us compare the performance of kNN, k=3 with a 100-tree Random Forest, trained over the entire Kaggle training set of 42k digits (roughly equal distribution over 10 digits), and test it against the first 10k test digits of Kaggle's 28k test set.

The Random Forest algorithm is significantly faster than kNN. kNN's execution time is exponential over the number of dimensions (784 pixels), the number of training data points (42k), and the number of test data points (28k). While the scikit kNN is likely optimized to avoid scanning the entire 42k training set for each test data point, it still takes a long time to run.

# import libraries
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# full Kaggle training set, first 10k test set
train_start=1
train_end=42000
test_start=1
test_end=28000

print('Running Random Forest on MNIST, stats:')
print('... train_start:',train_start)
print('... train_end:',train_end)
print('... test_start:',test_start)
print('... test_end:',test_end)

print('Loading training and test data...')
train_file='kaggle_mnist_train.csv'
test_file='kaggle_mnist_test.csv'
test_file='kaggle_mnist_test_labels.csv'
train=np.array(pd.read_csv(train_file).ix[train_start-1:train_end-1].as_matrix(),dtype='uint8')
test=np.array(pd.read_csv(test_file).ix[test_start-1:test_end-1].as_matrix(),dtype='uint8')
test_label=np.array(pd.read_csv(test_label_file).ix[test_start-1:test_end-1].as_matrix(),dtype='uint8')
print('... train dimensions:',train.shape)
print('... test dimensions:',test.shape)
print('... test label dimensions:',test_label.shape)
print('... done.')

Running Random Forest on MNIST, stats:
... train_start: 1
... train_end: 42000
... test_start: 1
... test_end: 28000
Loading training and test data...
... train dimensions: (42000, 785)
... test dimensions: (28000, 784)
... test label dimensions: (28000, 1)
... done.

The data upload seemed to have worked well. The dimensions look as expected. We can now safely proceed to running the two classifiers.

# create a nearest neighbor classifier, k=3
print('Running a Nearest Neighbor classifier, k=3')
print('... creating classifier')
kNN = KNeighborsClassifier(n_neighbors=3)
print('... done.')

# train the classifier
print('... training classifier')
kNN.fit(train[:,1:], train[:,0])
print('... done.')

# test and record accuracy score
print('... classifying test data and scoring accuracy')
accuracy=kNN.score(test[:,:],test_label[:])
print('... done.')
print('Accuracy:',accuracy)

# create a random forest classifier, 100 trees
print('Running a Random Forest classifier, trees=100')
print('... creating classifier')
rf = RandomForestClassifier(n_estimators=100)
print('... done.')

# train the classifier
print('... training classifier')
rf.fit(train[:,1:], train[:,0])
print('... done.')

# test and record accuracy score
print('... classifying test data and scoring accuracy')
accuracy=rf.score(test[:,:],test_label[:])
print('... done.')
print('Accuracy:',accuracy)

Running a Nearest Neighbor classifier, k=3
... creating classifier
... done.
... training classifier
... done.
... classifying test data and scoring accuracy
... done.
Accuracy: 0.968035714286
Running a Random Forest classifier, trees=100
... creating classifier
... done.
... training classifier
... done.
... classifying test data and scoring accuracy
... done.
Accuracy: 0.965178571429

Both algorithms reach almost 96-97% accuracy! Not bad. It failed to classify 895 digits out of the 28k, about 30 errors per 1,000 digits. Not too shabby at all. (This is however a simple solution and will only get us to around the mid-600s in the Kaggle MNIST competition rankings (as of 6/28/2016).)

Misclassified digits

Let us take a look at some of the digits it failed to classify correctly within the first 1,000 digits.

# import libraries
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# full Kaggle training set, first 10k test set
train_start=1
train_end=42000
test_start=1
test_end=1000

print('Running Random Forest on MNIST, stats:')
print('... train_start:',train_start)
print('... train_end:',train_end)
print('... test_start:',test_start)
print('... test_end:',test_end)

print('Loading training and test data...')
train_file='kaggle_mnist_train.csv'
test_file='kaggle_mnist_test.csv'
test_file='kaggle_mnist_test_labels.csv'
train=np.array(pd.read_csv(train_file).ix[train_start-1:train_end-1].as_matrix(),dtype='uint8')
test=np.array(pd.read_csv(test_file).ix[test_start-1:test_end-1].as_matrix(),dtype='uint8')
test_label=np.array(pd.read_csv(test_label_file).ix[test_start-1:test_end-1].as_matrix(),dtype='uint8')
print('... train dimensions:',train.shape)
print('... test dimensions:',test.shape)
print('... test label dimensions:',test_label.shape)
print('... done.')

Running Random Forest on MNIST, stats:
... train_start: 1
... train_end: 42000
... test_start: 1
... test_end: 1000
Loading training and test data...
... train dimensions: (42000, 785)
... test dimensions: (1000, 784)
... test label dimensions: (1000, 1)
... done.

Let's now run the same classifiers as before, but instead of automatically scoring the results, we will use the "predict" function, then manually comparing the results against the actual labels, and recording the incorrect digits as we go.

# create a nearest neighbor classifier, k=3
print('Running a Nearest Neighbor classifier, k=3')
print('... creating classifier')
kNN = KNeighborsClassifier(n_neighbors=3)
print('... done.')

# train the classifier
print('... training classifier')
kNN.fit(train[:,1:], train[:,0])
print('... done.')

# run the classifier on unseen data
print('... classifying test data')
output_kNN=kNN.predict(test[:,:])
print('... done.')
print('output_kNN:',output_kNN[:10])
print('test labels:',test_label[:10,0])

# compare results
errors_kNN=output_kNN!=test_label[:,0]
print('errors_kNN:',errors_kNN[1:10])

Running a Nearest Neighbor classifier, k=3
... creating classifier
... done.
... training classifier
... done.
... classifying test data
... done.
output_kNN: [2 0 9 9 3 7 0 3 0 3]
test labels: [2 0 9 0 3 7 0 3 0 3]
errors_kNN: [False False  True False False False False False False]

The output seems to work well on kNN, so let's do the same with the Random Forest.

# create a random forest classifier, 100 trees
print('Running a Random Forest classifier, trees=100')
print('... creating classifier')
rf = RandomForestClassifier(n_estimators=100)
print('... done.')

# train the classifier
print('... training classifier')
rf.fit(train[:,1:], train[:,0])
print('... done.')

# run the classifier on unseen data
print('... classifying test data')
output_rf=rf.predict(test)
print('output_rf:',output_rf[:10])
print('test labels:',test_label[:10,0])

# compare results
errors_rf=output_rf!=test_label[:,0]
print('errors_rf:',errors_rf[1:10])

Running a Random Forest classifier, trees=100
... creating classifier
... done.
... training classifier
... done.
... classifying test data
output_rf: [2 0 9 9 3 7 0 3 0 3]
test labels: [2 0 9 0 3 7 0 3 0 3]
errors_rf: [False False  True False False False False False False]

So far so good. We have a list of the mismatched digits for each classifier. We can re-use our previous code for displaying MNIST digits, with small tweaks to show the actual vs predicted digits. We want to see if the incorrect predictions of our classifiers were at least visually close enough, even if eventually wrong.

We could try to extract the second-best guess of each classifier, but I do not have the time to read through the scikit documentation if this is possible, and how it could be done. In one kNN implementations I have seen (not scikit), there was a possibility to extract additional information from the output, one of which was the second closest neighbor.

We will create a plot of each digit where the y label is the actual digit value and the x label is our prediction.

# Jupyter magic so we do not have to manually attach PNG output files
%matplotlib inline

# import libraries
import matplotlib.pyplot as plt

# display the misclassified digits
test_label_errors_kNN=test_label[:,0][errors_kNN]
output_kNN_errors=output_kNN[errors_kNN]
print('output_kNN_errors:',output_kNN_errors)
fig=plt.figure(figsize=(10,10))
for i, row in enumerate(test[:1000][errors_kNN]):
    ax=fig.add_subplot(6,5,i+1)
    ax.set_xlabel(output_kNN_errors[i])
    ax.set_ylabel(test_label_errors_kNN[i])
    ax.xaxis.set_ticklabels([])
    ax.yaxis.set_ticklabels([])
    a=np.copy(row)
    a=np.reshape(a,(28,28))
    ax.imshow(a, cmap='gray', interpolation='nearest')
plt.tight_layout()
plt.show()

output_kNN_errors: [9 8 1 2 0 9 2 9 3 2 5 1 2 5 0 1 7 0 1 7 4 5 4 0]

# display the misclassified digits
test_label_errors_rf=test_label[:,0][errors_rf]
output_rf_errors=output_rf[errors_rf]
print('output_rf_errors:',output_rf_errors)
fig=plt.figure(figsize=(10,10))
for i, row in enumerate(test[:1000][errors_rf]):
    ax=fig.add_subplot(6,5,i+1)
    ax.set_xlabel(output_rf_errors[i])
    ax.set_ylabel(test_label_errors_rf[i])
    ax.xaxis.set_ticklabels([])
    ax.yaxis.set_ticklabels([])
    a=np.copy(row)
    a=np.reshape(a,(28,28))
    ax.imshow(a, cmap='gray', interpolation='nearest')
plt.tight_layout()
plt.show()

output_rf_errors: [9 9 5 8 3 3 8 0 2 9 9 4 2 2 8 5 6 4 0 1 5 7 4 9 2 8 8 4]

We see instances where we can understand why the classifiers would get confused. However, there are also clear errors, such as a "9" being predicted as a "3" or an "8". There is also the obvious oversight of small stroke patterns that distinquish two similar digits, such as the bottom stroke for a "2" vs a "7".

Of interest however is that the two classifiers make different mistakes. This means we could improve performance by creating several classifiers, pool their decisions, and select the majority vote. We will explore this in a future post.

A Random Walk to AI

Pages

Monday, June 27, 2016

ML without learning ML