Improving the accuracy of a sparse kNN
Model accuracy improves with more training data. Many impressive results in machine learning, typically on neural networks based approaches, tend to use a lot of data and prolonged iterations (e.g., several days of running gradient descent algorithms across hundreds of thousands of data). This is not a bad thing. In practice, there is no reason why models should not use as much data and training time as practically available. Even in low-stakes Kaggle competitions, it is common to hear of algorithms that run for hours on high-end GPUs.
However, there might be instances when we only have access to a limited set of training data. We therefore expect performance to decrease, in any model. This is a theoretical result as well as a practical reality: given a baseline model, more data leads to better accuracy generally.
On this post, we will observe this degradation due to sparse training data in the context of a PCA+kNN model run against the MNIST dataset. To recall, MNIST accuracy on PCA+kNN using 42,000 training digits leads to a test accuracy of about 97+%. This accuracy drops below 90% when only the first 1,000 training digits are used.
We will also see if we can somehow improve on a PCA+kNN under limited training data. We know that a reduction in dimensionality tends to help accuracy by diminishing overfitting in a kNN, and we have established that a PCA set to 50 principal components generates consistently high scores on the Kaggle MNIST. We recall also that the 50 principal components was in fact selected via a grid search over the test set and not against a training validation set. Put another way, the PCA=50 setting will be very hard to beat.
How do we do this?
We will not explore the actual code in detail as it is long and not well-organized. It was meant to be experimental, and is an exploratory patchwork of different sub-algorithms and full of debugging stubs. In concept, my approach is to run a PCA+kNN, select the top two or three classes, then proceed to a runoff vote. The runoff portion is kNN-like, relying on a Euclidian distance measure, plus some other refinements. As there is a secondary runoff --and the refinements are not yet vectorized-- execution time is much longer than standard PCA+kNN/kNN.
Algorithm context
kNN is sensitive to the features used in the distance calcultion. It is obvious that kNN would improve if only the important features (pixels in this case) are included. However, at the pixel level, selecting the defining pixels is fundamentally subject to slight image perturbations such as translation, rotation, and so on. In hyperspace, a shift of even a single pixel leads to big jumps. This could be minimized to an extent by certain techniques (e.g., convolution/pooling, dictionary/bag-of-words, pre-processing, etc.), but we wish to see if we can improve performance even with naturally noisy data.
A related concept is the concept of transformation manifolds, and a good reference is tangent distance as applied to MNIST [Simard, 2000]. In brief, this solution calculates samples of the same classes to be neighbors by calculating the distance between points after certain transformations (e.g., a thin digit and a fat digit of the same class are neighbors, where they would otherwise be far apart in normal kNN hyperspace). The tangent distance is the closest distance to another point as a digit moves along the transformation manifold. Another technique [Wong, 2015] generates a skeleton of each digit, decomposes digits into important patches, generates a dictionary of possible transformations of these patches, then reassembling for the best fit. These techniques assume knowledge or calculation of likely transformations and preprocessing.
[These are the ones that I recently re-read. There must be many other techniques and variants thereof. It could also be said that the convolutional nature of CNNs (e.g., LeNet) is tantamount to prior knowledge of transformations and visual structures, even if the weights are actively learned. In other words, if the wiring of the CNN receptive field is random across the entire image, it would need far more random wiring to eventually settle into a network that favors local and overlapping patches and accurately solves translational and rotational invariance. We do know that image pixels are locally and spatially related, so providing that nudge simplifies --and leads to better accuracy. Another good algorithm, shape context matching [Belongie, 2002] could be said to assume local spatial relationships by virtue of its diminishing bin sizes.]
Without the benefit of transformations, it is obvious that a digit would have different key pixels depending on how it is written (orientation, line thickness, and so on). In kNN space, we could think of this as different key features at different parts of the kNN hyperspace. So if we localize the kNN calculation, we might get improvement gains. We can extract these locally important features by considering a local region ball. Similar to Gaussian/radial-basis models, the size/radius of the parameter ball is a key but very sensitive parameter.
This algorithm relies on this local region kNN concept. In this sense, the algorithm does not assume knowledge of tranformations (it does not provide guideposts that a slightly curved line is equal to a more slightly curved line) nor pre-processing (a thin line is different from a thick line). The absence of these assumptions therefore prevents the algorithm from making generalizations outside the known samples. What this does however is that we could re-order the pixels in each sample (so it won't make sense visually to anyone, and would lead to a different set of local convolutions in a CNN), and the algorithm can still achieve the same result.
Finding the key local features is sort of trivial. Finding the right ones is not. A previous experiment that applied a PCA over each localized ball did not produce a significant improvement. The problem is that any local ball will only have a few samples, and a PCA or similar are prone to get misleading results. I also explored a variation of tf-idf (over a bag-of-pixels!), a technique I have explored before on a binary MNIST template matcher.
Accuracy bounds
The runoff accuracy relies on the top choices by a PCA+kNN (or any other algorithm). If the top choices do not include the correct class, the runoff will fail. The runoff simply allows the algorithm a second chance at improving (or keeping) the result of the PCA+kNN. Thus, the lower bound is the number of test cases where the top choices all belong to the same class. The upper bound is the accuracy at which all cases where the top choices are non-homogenous, and are predicted correctly.
While it relies on the PCA+kNN initially, the algorithm does not know which of the runoff choices is the correct class --that is, the 'correct' class as predicted by the prior PCA+kNN. The algorithm is completely independent and receives no guidance from the PCA+kNN (besides receiving the top choices). Its final choice could be different from the top PCA+kNN choice. If the algorithm is ineffective, it could override and undermine the initial accuracy of the underlying PCA+kNN, leading to a worse performance overall.
Since it relies on a prior algorithm, its final accuracy is limited by the top choices generated by the prior algorithm. In fact, my experiments showed that there is more accuracy to be gained by making a 'better' first layer (so it includes the correct class in the top two/three choices) than improving the runoff accuracy over the top two/three choices.
Runoff background
We previously experimented on a runoff approach (here), but while it provided slight improvements, the outcome was not very convincing. Unfortunately in many cases the runoff would choose the wrong class, even if the first-layer PCA+kNN was able to identify the correct class. Clearly the runoff approach had to be tweaked…
… and tweak I did. I have been working on this algorithm for several months now (hence the lack of new posts). It is time-consuming. Parameter tuning and testing use up more time than the coding. It is finally in a state that is consistently accurate. I have settled on some hard parameters that seem to work across different training set size variations, but are hardly optimal. More research is needed to tie these parameters to certain qualities of the training dataset. There is also no reason to match the number of principal componets in the PCA to the number of localized features in the runoff, but we will stick to that in these runs.
The plotting code is standard, shown below:
%matplotlib inline
import matplotlib.pyplot as plt
def plotter(a,b,c,reps,train_size,test_size):
print('mean accuracy, pca+knn:',np.mean(a))
print('mean accuracy, knn:',np.mean(b))
print('mean accuracy, pca+knn-runoff:',np.mean(c))
plt.figure(figsize=(12,4))
plt.plot(reps,a,marker='+',ls=':',c='r',label='PCA+kNN')
plt.plot(reps,b,marker='x',ls='-.',c='g',label='kNN')
plt.plot(reps,c,marker='^',ls='--',c='b',label='PCA+kNN-Runoff')
plt.plot(reps,[np.mean(a)]*len(reps),ls='-',c='r',label='PCA+kNN, mean')
plt.plot(reps,[np.mean(b)]*len(reps),ls='-',c='g',label='kNN, mean')
plt.plot(reps,[np.mean(c)]*len(reps),ls='-',c='b',label='Runoff, mean')
plot_title='Limited Training Data Comparison\n(train size='+str(train_size)+', test size='+str(test_size)+')'
plt.title(plot_title)
plt.legend(loc=4,fontsize='small')
plt.xlabel('Trials')
plt.ylabel('Accuracy')
plt.xlim([1,20])
plt.show
Experiment setup
The experiment goes through 20 repetitions to identify 100 test digits. It randomizes the training data used for each repetition. We then run the algorithm to extract the accuracy of different models (a PCA+kNN, a kNN, and the runoff).
Notice that the training data is not guaranteed to be balanced, nor reflect the original ratio of the different classes. This affects kNN accuracy. In later experiments, we also explore different training sizes to observe how accuracy changes with different training data sizes.
def runoff_experiment(reps,train_size,test_size,seed=0,randomTest=True):
np.random.seed(seed)
pca_knn=[]
knn=[]
runoff=[]
for i in reps:
train_sel=train_mnist.copy()
test_sel=test_mnist.copy()
test_label=test_mnist_label.copy()
np.random.shuffle(train_sel)
train_sel=train_sel[:train_size]
if randomTest:
idx=np.arange(0,len(test_sel))
np.random.shuffle(idx)
test_sel=test_sel[idx[:test_size]]
test_label=test_label[idx[:test_size]]
else:
test_sel=test_sel[:test_size]
test_label=test_label[:test_size]
a,b,c=runoffexp(train_sel,test_sel,test_label)
pca_knn.append(a)
knn.append(b)
runoff.append(c)
return pca_knn,knn,runoff
reps=np.arange(1,21)
train_size=1000
test_size=100
pca_knnA0,knnA0,runoffA0=runoff_experiment(reps,train_size,test_size,0,False)
plotter(pca_knnA0,knnA0,runoffA0,reps,train_size,test_size)
We see a bit of an improvement, but marginally so that it might not be significant. We also observe that PCA+kNN is generally better than a plain kNN, which is what we expect.
Test 2: Re-run Test 1 (training size=1,000 samples), but with a different random seed
Let's see what would happen if we use a different random number seed.
pca_knnA1,knnA1,runoffA1=runoff_experiment(reps,train_size,test_size,1,False)
plotter(pca_knnA1,knnA1,runoffA1,reps,train_size,test_size)
This is good! We see some separation between the PCA+kNN and the runoff. We also see that the runoff result is almost always better than the other two.
Test 3: Training size=100
Now that we think it works, it is time to try an even smaller training data set. Let's cut this down by a factor of 10, leaving us with only about 10 training sample digits (remember that the selection is random --some classes might not even be represented). Surely this would be a problem.
train_size=100
pca_knnB0,knnB0,runoffB0=runoff_experiment(reps,train_size,test_size,0,False)
plotter(pca_knnB0,knnB0,runoffB0,reps,train_size,test_size)
train_size=100
pca_knnB1,knnB1,runoffB1=runoff_experiment(reps,train_size,test_size,1,False)
plotter(pca_knnB1,knnB1,runoffB1,reps,train_size,test_size)
Whoaa! What happend here? Why are the average lines together? And where did the runoff advantage go?
Well, the similar average is a coincidence. The individual runs still show obvious differences. The 'underperformance' of the runoff algorithm is either: (i) the previous three tests are misleading and this one is more representative, or (ii) the random generator for the training set is very unbalanced that it is throwing off the runoff portion of the algorithm.
How will we know? Why don't we throw caution to the wind and just unleash this in the wild.
Test 5: Training size=100, randomized test set
The previous tests were run against the same test set each time, so the comparison is consistent other than the randomized training data selection, and the size of the selection. Here, to truly test the runoff algorithm 'in the wild', we should also randomize the test set creation (which we do via a default flag in the experiment setup code).
train_size=100
pca_knn1,knn1,runoff1=runoff_experiment(reps,train_size,test_size)
plotter(pca_knn1,knn1,runoff1,reps,train_size,test_size)
train_size=1000
pca_knn2,knn2,runoff2=runoff_experiment(reps,train_size,test_size)
plotter(pca_knn2,knn2,runoff2,reps,train_size,test_size)
train_size=500
pca_knn3,knn3,runoff3=runoff_experiment(reps,train_size,test_size)
plotter(pca_knn3,knn3,runoff3,reps,train_size,test_size)
train_size=2000
pca_knn4,knn4,runoff4=runoff_experiment(reps,train_size,test_size)
plotter(pca_knn4,knn4,runoff4,reps,train_size,test_size)
train_size=5000
pca_knn5,knn5,runoff5=runoff_experiment(reps,train_size,test_size)
plotter(pca_knn5,knn5,runoff5,reps,train_size,test_size)
train_size=10000
pca_knn6,knn6,runoff6=runoff_experiment(reps,train_size,test_size)
plotter(pca_knn6,knn6,runoff6,reps,train_size,test_size)
Yep, still better, although we do notice that the improvement gap narrows with more training data. This is intuitive.
The algorithm 'guesses' the better result from the top two or three choices, but with more training data for a kNN, its baseline accuracy increases, nullifying the advantage created by the runoff algorithm.
No comments:
Post a Comment