Wednesday, December 14, 2016

Improving a sparse kNN

sparse-knn

Improving the accuracy of a sparse kNN

Model accuracy improves with more training data. Many impressive results in machine learning, typically on neural networks based approaches, tend to use a lot of data and prolonged iterations (e.g., several days of running gradient descent algorithms across hundreds of thousands of data). This is not a bad thing. In practice, there is no reason why models should not use as much data and training time as practically available. Even in low-stakes Kaggle competitions, it is common to hear of algorithms that run for hours on high-end GPUs.

However, there might be instances when we only have access to a limited set of training data. We therefore expect performance to decrease, in any model. This is a theoretical result as well as a practical reality: given a baseline model, more data leads to better accuracy generally.

On this post, we will observe this degradation due to sparse training data in the context of a PCA+kNN model run against the MNIST dataset. To recall, MNIST accuracy on PCA+kNN using 42,000 training digits leads to a test accuracy of about 97+%. This accuracy drops below 90% when only the first 1,000 training digits are used.

We will also see if we can somehow improve on a PCA+kNN under limited training data. We know that a reduction in dimensionality tends to help accuracy by diminishing overfitting in a kNN, and we have established that a PCA set to 50 principal components generates consistently high scores on the Kaggle MNIST. We recall also that the 50 principal components was in fact selected via a grid search over the test set and not against a training validation set. Put another way, the PCA=50 setting will be very hard to beat.

How do we do this?

We will not explore the actual code in detail as it is long and not well-organized. It was meant to be experimental, and is an exploratory patchwork of different sub-algorithms and full of debugging stubs. In concept, my approach is to run a PCA+kNN, select the top two or three classes, then proceed to a runoff vote. The runoff portion is kNN-like, relying on a Euclidian distance measure, plus some other refinements. As there is a secondary runoff --and the refinements are not yet vectorized-- execution time is much longer than standard PCA+kNN/kNN.

Algorithm context

kNN is sensitive to the features used in the distance calcultion. It is obvious that kNN would improve if only the important features (pixels in this case) are included. However, at the pixel level, selecting the defining pixels is fundamentally subject to slight image perturbations such as translation, rotation, and so on. In hyperspace, a shift of even a single pixel leads to big jumps. This could be minimized to an extent by certain techniques (e.g., convolution/pooling, dictionary/bag-of-words, pre-processing, etc.), but we wish to see if we can improve performance even with naturally noisy data.

A related concept is the concept of transformation manifolds, and a good reference is tangent distance as applied to MNIST [Simard, 2000]. In brief, this solution calculates samples of the same classes to be neighbors by calculating the distance between points after certain transformations (e.g., a thin digit and a fat digit of the same class are neighbors, where they would otherwise be far apart in normal kNN hyperspace). The tangent distance is the closest distance to another point as a digit moves along the transformation manifold. Another technique [Wong, 2015] generates a skeleton of each digit, decomposes digits into important patches, generates a dictionary of possible transformations of these patches, then reassembling for the best fit. These techniques assume knowledge or calculation of likely transformations and preprocessing.

[These are the ones that I recently re-read. There must be many other techniques and variants thereof. It could also be said that the convolutional nature of CNNs (e.g., LeNet) is tantamount to prior knowledge of transformations and visual structures, even if the weights are actively learned. In other words, if the wiring of the CNN receptive field is random across the entire image, it would need far more random wiring to eventually settle into a network that favors local and overlapping patches and accurately solves translational and rotational invariance. We do know that image pixels are locally and spatially related, so providing that nudge simplifies --and leads to better accuracy. Another good algorithm, shape context matching [Belongie, 2002] could be said to assume local spatial relationships by virtue of its diminishing bin sizes.]

Without the benefit of transformations, it is obvious that a digit would have different key pixels depending on how it is written (orientation, line thickness, and so on). In kNN space, we could think of this as different key features at different parts of the kNN hyperspace. So if we localize the kNN calculation, we might get improvement gains. We can extract these locally important features by considering a local region ball. Similar to Gaussian/radial-basis models, the size/radius of the parameter ball is a key but very sensitive parameter.

This algorithm relies on this local region kNN concept. In this sense, the algorithm does not assume knowledge of tranformations (it does not provide guideposts that a slightly curved line is equal to a more slightly curved line) nor pre-processing (a thin line is different from a thick line). The absence of these assumptions therefore prevents the algorithm from making generalizations outside the known samples. What this does however is that we could re-order the pixels in each sample (so it won't make sense visually to anyone, and would lead to a different set of local convolutions in a CNN), and the algorithm can still achieve the same result.

Finding the key local features is sort of trivial. Finding the right ones is not. A previous experiment that applied a PCA over each localized ball did not produce a significant improvement. The problem is that any local ball will only have a few samples, and a PCA or similar are prone to get misleading results. I also explored a variation of tf-idf (over a bag-of-pixels!), a technique I have explored before on a binary MNIST template matcher.

Accuracy bounds

The runoff accuracy relies on the top choices by a PCA+kNN (or any other algorithm). If the top choices do not include the correct class, the runoff will fail. The runoff simply allows the algorithm a second chance at improving (or keeping) the result of the PCA+kNN. Thus, the lower bound is the number of test cases where the top choices all belong to the same class. The upper bound is the accuracy at which all cases where the top choices are non-homogenous, and are predicted correctly.

While it relies on the PCA+kNN initially, the algorithm does not know which of the runoff choices is the correct class --that is, the 'correct' class as predicted by the prior PCA+kNN. The algorithm is completely independent and receives no guidance from the PCA+kNN (besides receiving the top choices). Its final choice could be different from the top PCA+kNN choice. If the algorithm is ineffective, it could override and undermine the initial accuracy of the underlying PCA+kNN, leading to a worse performance overall.

Since it relies on a prior algorithm, its final accuracy is limited by the top choices generated by the prior algorithm. In fact, my experiments showed that there is more accuracy to be gained by making a 'better' first layer (so it includes the correct class in the top two/three choices) than improving the runoff accuracy over the top two/three choices.

Runoff background

We previously experimented on a runoff approach (here), but while it provided slight improvements, the outcome was not very convincing. Unfortunately in many cases the runoff would choose the wrong class, even if the first-layer PCA+kNN was able to identify the correct class. Clearly the runoff approach had to be tweaked…

… and tweak I did. I have been working on this algorithm for several months now (hence the lack of new posts). It is time-consuming. Parameter tuning and testing use up more time than the coding. It is finally in a state that is consistently accurate. I have settled on some hard parameters that seem to work across different training set size variations, but are hardly optimal. More research is needed to tie these parameters to certain qualities of the training dataset. There is also no reason to match the number of principal componets in the PCA to the number of localized features in the runoff, but we will stick to that in these runs.

The plotting code is standard, shown below:

In [19]:
%matplotlib inline
import matplotlib.pyplot as plt

def plotter(a,b,c,reps,train_size,test_size):
    print('mean accuracy, pca+knn:',np.mean(a))
    print('mean accuracy, knn:',np.mean(b))
    print('mean accuracy, pca+knn-runoff:',np.mean(c))
    plt.figure(figsize=(12,4))
    plt.plot(reps,a,marker='+',ls=':',c='r',label='PCA+kNN')
    plt.plot(reps,b,marker='x',ls='-.',c='g',label='kNN')
    plt.plot(reps,c,marker='^',ls='--',c='b',label='PCA+kNN-Runoff')
    plt.plot(reps,[np.mean(a)]*len(reps),ls='-',c='r',label='PCA+kNN, mean')
    plt.plot(reps,[np.mean(b)]*len(reps),ls='-',c='g',label='kNN, mean')
    plt.plot(reps,[np.mean(c)]*len(reps),ls='-',c='b',label='Runoff, mean')
    plot_title='Limited Training Data Comparison\n(train size='+str(train_size)+', test size='+str(test_size)+')'
    plt.title(plot_title)
    plt.legend(loc=4,fontsize='small')
    plt.xlabel('Trials')
    plt.ylabel('Accuracy')
    plt.xlim([1,20])
    plt.show


Experiment setup

The experiment goes through 20 repetitions to identify 100 test digits. It randomizes the training data used for each repetition. We then run the algorithm to extract the accuracy of different models (a PCA+kNN, a kNN, and the runoff).

Notice that the training data is not guaranteed to be balanced, nor reflect the original ratio of the different classes. This affects kNN accuracy. In later experiments, we also explore different training sizes to observe how accuracy changes with different training data sizes.

In [5]:
def runoff_experiment(reps,train_size,test_size,seed=0,randomTest=True):
    np.random.seed(seed)
    pca_knn=[]
    knn=[]
    runoff=[]
    for i in reps:
        train_sel=train_mnist.copy()
        test_sel=test_mnist.copy()
        test_label=test_mnist_label.copy()
        
        np.random.shuffle(train_sel)
        train_sel=train_sel[:train_size]
        if randomTest:
            idx=np.arange(0,len(test_sel))
            np.random.shuffle(idx)
            test_sel=test_sel[idx[:test_size]]
            test_label=test_label[idx[:test_size]]
        else:
            test_sel=test_sel[:test_size]
            test_label=test_label[:test_size]
        
        a,b,c=runoffexp(train_sel,test_sel,test_label)
        pca_knn.append(a)
        knn.append(b)
        runoff.append(c)
    return pca_knn,knn,runoff


Test 1: Initial test, training size=1,000 samples

We first test with a generic and reasonable size of 1,000 samples. If we are lucky with the random selection, we should get about 100 training samples per digit. For handwriting, this is very low.

In [21]:
reps=np.arange(1,21)
train_size=1000
test_size=100
pca_knnA0,knnA0,runoffA0=runoff_experiment(reps,train_size,test_size,0,False)
plotter(pca_knnA0,knnA0,runoffA0,reps,train_size,test_size)
mean accuracy, pca+knn: 0.9125
mean accuracy, knn: 0.9015
mean accuracy, pca+knn-runoff: 0.918


We see a bit of an improvement, but marginally so that it might not be significant. We also observe that PCA+kNN is generally better than a plain kNN, which is what we expect.

Test 2: Re-run Test 1 (training size=1,000 samples), but with a different random seed

Let's see what would happen if we use a different random number seed.

In [22]:
pca_knnA1,knnA1,runoffA1=runoff_experiment(reps,train_size,test_size,1,False)
plotter(pca_knnA1,knnA1,runoffA1,reps,train_size,test_size)
mean accuracy, pca+knn: 0.8945
mean accuracy, knn: 0.8865
mean accuracy, pca+knn-runoff: 0.9115


This is good! We see some separation between the PCA+kNN and the runoff. We also see that the runoff result is almost always better than the other two.

Test 3: Training size=100

Now that we think it works, it is time to try an even smaller training data set. Let's cut this down by a factor of 10, leaving us with only about 10 training sample digits (remember that the selection is random --some classes might not even be represented). Surely this would be a problem.

In [23]:
train_size=100
pca_knnB0,knnB0,runoffB0=runoff_experiment(reps,train_size,test_size,0,False)
plotter(pca_knnB0,knnB0,runoffB0,reps,train_size,test_size)
mean accuracy, pca+knn: 0.7135
mean accuracy, knn: 0.708
mean accuracy, pca+knn-runoff: 0.721


Indeed it is a problem. The accuracy dropped from 90% to 70%!

But our runoff algorithm continues to outperform both algorithms....

Test 4: Re-run Test 3 (training size=100), with a different random seed

To be sure that it still works at low training size, let's run with a different random generator.

In [24]:
train_size=100
pca_knnB1,knnB1,runoffB1=runoff_experiment(reps,train_size,test_size,1,False)
plotter(pca_knnB1,knnB1,runoffB1,reps,train_size,test_size)
mean accuracy, pca+knn: 0.7085
mean accuracy, knn: 0.709
mean accuracy, pca+knn-runoff: 0.708


Whoaa! What happend here? Why are the average lines together? And where did the runoff advantage go?

Well, the similar average is a coincidence. The individual runs still show obvious differences. The 'underperformance' of the runoff algorithm is either: (i) the previous three tests are misleading and this one is more representative, or (ii) the random generator for the training set is very unbalanced that it is throwing off the runoff portion of the algorithm.

How will we know? Why don't we throw caution to the wind and just unleash this in the wild.

Test 5: Training size=100, randomized test set

The previous tests were run against the same test set each time, so the comparison is consistent other than the randomized training data selection, and the size of the selection. Here, to truly test the runoff algorithm 'in the wild', we should also randomize the test set creation (which we do via a default flag in the experiment setup code).

In [25]:
train_size=100
pca_knn1,knn1,runoff1=runoff_experiment(reps,train_size,test_size)
plotter(pca_knn1,knn1,runoff1,reps,train_size,test_size)
mean accuracy, pca+knn: 0.6975
mean accuracy, knn: 0.693
mean accuracy, pca+knn-runoff: 0.7145


Clear improvement on this one.

Test 6: Training size=1,000, randomized test set

Let's try bringing up the training size to 1,000 again.

In [26]:
train_size=1000
pca_knn2,knn2,runoff2=runoff_experiment(reps,train_size,test_size)
plotter(pca_knn2,knn2,runoff2,reps,train_size,test_size)
mean accuracy, pca+knn: 0.8965
mean accuracy, knn: 0.8885
mean accuracy, pca+knn-runoff: 0.9155


Yep, still better.

Test 7: Training size=500, randomized test set

We skipped too far to 1,000 samples, so let's cut it in half to 500.

In [27]:
train_size=500
pca_knn3,knn3,runoff3=runoff_experiment(reps,train_size,test_size)
plotter(pca_knn3,knn3,runoff3,reps,train_size,test_size)
mean accuracy, pca+knn: 0.8535
mean accuracy, knn: 0.845
mean accuracy, pca+knn-runoff: 0.8795


Yep, still better.

Test 8: Training size=2,000, randomized test set

How about we double the 1,000 training samples from earlier.

In [28]:
train_size=2000
pca_knn4,knn4,runoff4=runoff_experiment(reps,train_size,test_size)
plotter(pca_knn4,knn4,runoff4,reps,train_size,test_size)
mean accuracy, pca+knn: 0.9265
mean accuracy, knn: 0.917
mean accuracy, pca+knn-runoff: 0.9345


Yep, still better.

Test 9: Training size=5,000, randomized test set

How about 5,000 training samples? Maybe this improvement will disappear when plain old kNN and PCA+kNN have more instances in memory.

In [29]:
train_size=5000
pca_knn5,knn5,runoff5=runoff_experiment(reps,train_size,test_size)
plotter(pca_knn5,knn5,runoff5,reps,train_size,test_size)
mean accuracy, pca+knn: 0.9505
mean accuracy, knn: 0.9475
mean accuracy, pca+knn-runoff: 0.9595


Yep, still better.

Test 9: Training size=10,000, randomized test set

Ok, how about 10,000 training samples?

In [30]:
train_size=10000
pca_knn6,knn6,runoff6=runoff_experiment(reps,train_size,test_size)
plotter(pca_knn6,knn6,runoff6,reps,train_size,test_size)
mean accuracy, pca+knn: 0.9645
mean accuracy, knn: 0.961
mean accuracy, pca+knn-runoff: 0.967


Yep, still better, although we do notice that the improvement gap narrows with more training data. This is intuitive.

The algorithm 'guesses' the better result from the top two or three choices, but with more training data for a kNN, its baseline accuracy increases, nullifying the advantage created by the runoff algorithm.

No comments:

Post a Comment