Monday, July 18, 2016

Getting to know the (nearest) neighbors

kNN_intro


kNN background

The Nearest Neighbor algorithm (NN) is a simple algorithm that works well in classifying previously unseen data (see this post). It simply compares an unknown data against all known data. It then concludes that the unknown data is the same as the closest known data, hence the nearest neighbor designation. More precisely, the algorithm presumes that the type or class ("type" and "class" are used interchangeably throughout this post) of the unknown data has to be the type/class of its nearest neighbor. Due to this very direct dependency on prior known training data, the more training data one has, the better the approximation.

Expressed in mathematical form, NN is described as follows. Given a set of $m$ classes $C = \{c_1,c_2,...,c_m\}$, a set of training points $\mathbf D=\{\mathbf x_0,\mathbf x_1,...,\mathbf x_n\}$ where $\mathbf x\in\Bbb R^d$, and a set of corresponding classes $\{y_0,y_1,...,y_n\}$ for each $\mathbf x_i$ in $\mathbf D$ where $y_i\in C$, the NN algorithm finds the the class $y_i$ whose distance to the query/test data $\mathbf x'$ is minimized. If we define $z(\mathbf x_i)=y_i$, the NN algorithm becomes:

$$y'=z(\underset{\mathbf x_i\in D}{\operatorname{argmin}}(d(\mathbf x',\mathbf x_n)))$$

where $d(\mathbf x',\mathbf x_n)$ is some distance calculation.

On a set of data points whose classes are clearly delineated from each other --that is, no overlaps in the classes-- this method will work well. Any errors will come from classifying test points around class boundaries. If there are few examples of one type (say Type A) in a particular region near a boundary between two types, but a lot more examples for another type (say Type B) across such boundary, the nearest neighbor of an unknown point (a point that is in fact a Type A point) might be from across the boundary. This will cause the unknown point to be incorrectly identified as Type B.

To hedge against a misleading nearest neighbor, we could query more points and average the types, concluding that the most dominant type around a neighborhood is the correct class. This method of looking at the majority class over several nearest neighbors is called $k$-NN (or kNN), where $k$ is the number of neighboring data points to query.

The original function above changes slightly. We define $z(k,\mathbf x')$ as the classes of the $k$ training points in $\mathbf D$ that are closest to query point $\mathbf x'$:

$$\hat y=\underset{y\in C}{\operatorname{argmax}}\left(\sum_i^k \mathbf I(y,z(k,\mathbf x'))\right)$$$$\mathbf I(a,b)= \begin{cases} +1 & \text{if }a=b\\ 0 & \text{otherwise} \end{cases}$$

kNN can handle cases where classes are known to overlap, e.g., each class is a Gaussian distribution (or some other kind of distribution) that extends outward from a class center or centers regardless of the presence of other classes in the vicinity. At different points between the ‘centers’ of the class distributions, there will be different likelihoods of being a Type A or some other type (Type B, C, and so on).

We hope to guess this likelihood of belonging to the correct type by examining the class density around the immediate neighborhood of our sample, with the class 'density' estimated as the majority class within $k$ neighbors. A kNN estimate is clearly not perfect, but gets better with more data. The intuition is that we should choose the type that is most dense, which should imply that we are nearest that type's center. However, there is no guarantee that the type of the nearest neighbor is the actual correct type for the unknown data point (recall the assumed Gaussian distribution overlap). All we get is a higher likelihood.

Fine-tuning kNN

If we think through how the kNN algorithm works, we should be able to figure out that the algorithm is affected by --and could be improved by fine tuning-- the following factors:

  • The number of neighbors to consider in the calculation, $k$
  • The distance measure, $d(\mathbf x',\mathbf x_n)$
  • The weights of each closest neighbor(s) within $k$, $\mathbf w^k\in\Bbb R^k$
  • The weight of each feature, $\mathbf w^f\in\Bbb R^d$
  • The weight of each instance/sample, $\mathbf w^s\in\Bbb R^n$

(Note that the variable notation for the last three parameters is my arbitrary choice. The superscripts are to identify each parameter, not exponents. Formal literature might have different standard variable designation.)

The first three parameters have been well explored by researchers, and intuitive choices produce good results without a lot of testing. In practice, most of these choices are set, with k as the only parameter that requires finetuning. The distance measure is typically the square of the error and the nearest neighbor weighing is uniform (each neighbor has the same effect). The value of k is determined by cross-validation (and k=3 works well without tuning), while additional tests might uncover if a specialized neighbor weighing system (such as an inverse distance weighing) could work better.

The last two parameters are typically left untouched, meaning equal weights are assumed, once training features and instances are finalized. However, as the concept of nearest neighbors become diluted in high dimensions, especially with data that has many irrelevant features, it is often desirable to consider techniques that filter out the more important features, i.e., feature selection, during pre-processing. These last two parameters will be the focus of our experiments.

I have tested changing the weights of each feature, but through an indirect approach. Instead of ranking the relative value or relevance of a feature over all others, I reduced the number of features (dimensionality reduction), effectively setting to zero the weights of ‘irrelevant’ features while holding the remaining features at equal weights. I could still explore a weighted feature kNN, but will leave this as a future experiment.

I have no clear plan of attack on the last one, except perhaps to deploy anomaly detection algorithms on the training data (yet another experiment). In other words, training samples that are atypical (outside some normality threshold) could be discarded or assigned less contribution to the summed kNN distance, as they might be causing incorrect classification. I am not (yet) as well-versed on anomaly detection limitations, specially on data for which a class might have multiple kinds of a normal representation, that I will do this experiment at a later time.

If we add these parameters to the equations above, we will get, for NN:

$$d(\mathbf x, \mathbf y)=\sum_{i=1}^n w^f_i(d(x_i,y_i))$$$$y'=z(w^s_{i'},i')$$

where $i'=\underset{\mathbf x_i\in D}{\operatorname{argmin}}(d(\mathbf x',\mathbf x_n))$.

And for kNN:

$$\hat y=\underset{y\in C}{\operatorname{argmax}}\left(\sum_i^k w^k_i \mathbf I(y,z(k,\mathbf x'))\right)$$


Number of neighbors

We already described the number of nearest neighbors in the kNN post referenced above. We simply take a survey of several points near our unknown point, and make a judgment based on these neighbor points. For instance, we could decide that the class of an unknown point should be the majority class of the k=5 nearest neighbors, even if the nearest neighbor is of a different class.

This approach typically avoids the noise caused by outliers, e.g., a solo Type A surrounded by many Type Bs. This loner Type A datapoint might represent an island of Type As, or it could simply be noise. Without more data, it would be hard to conclude. If this were a valid Type A class, with more training data sample, more data points would cluster around this island, thus making its neighborhood more Type A. If it were truly an outlier, the solo Type A would not have more Type A neighbors, justifying our original majority vote.

Generally, k=3 works well. Depending on the problem and the available training data, other values of k (odd k, to break ties) might work better, but often flatlines at higher values. The best k for a machine learning problem is mostly determined via cross-validation, a topic for another day.

For now, from an experimentation point of view, we will focus on k=1, using k=3 only to test if the performance improves significantly. The thinking is that k=3 should outperform k=1, but we want the performance of our experiments to improve on k=1 first.

Distance measure

The distance measure is typically the distance between two points in space. In 2d space, we are familiar with using the Pythagorean Theorem to calculate the distance between two points (the hypotenuse $c$ of a triangle), $c^2=a^2+b^2$. If we extend this to many dimensions between two points $\mathbf x$ and $\mathbf y$, we get the following, called Euclidean distance:

$$\mathbf x=(x_1,x_2,...,x_n) \in \mathbb R^n$$$$\mathbf y=(y_1,y_2,...,y_n) \in \mathbb R^n$$$$d_2(\mathbf x, \mathbf y)=\sqrt {{\sum_{i=1}^n ({x_i-y_i})^2}}$$

There is a simpler version. If we do not use squared difference, instead adding the absolute values of the gaps, we get the so-called taxicab distance, or the Manhattan distance:

$$d_1(\mathbf x, \mathbf y)={\sum_{i=1}^n \left\lvert{x_i-y_i}\right\rvert}$$

Taxicab is an apt description. It is the distance we get when we look at a city map and determine the number of blocks to travel from one point to another. This distance measure is also called a city block distance.

Notice that this is similar to the original Euclidean formula, except that we do not have the power of 2 and squareroot. If we replace the “2” with “1,” we change from the Euclidean distance to the taxicab distance. In machine learning terminology, you will hear $\mathsf L^2$ or $\mathscr l_2$ distance. This is the Euclidean distance. The $\mathsf L^1$ or $\mathscr l_1$ distance is the taxicab distance.

More generally, both are specific variations of a general case, the Minkowski distance metric:

$$d_p(\mathbf x, \mathbf y)=\left({\sum_{i=1}^n \left\lvert{x_i-y_i}\right\rvert^p}\right)^{1/p}$$

There are a few more cautionary points when calculating distance in kNN. The data has to be of a consistent scale on all dimensions. For example, a feature that scales from 0 to 100 will overpower a feature whose scale is from 0 to 1. The squared errors will be huge for the first feature, artificially suggesting that the first feature is more important, which might not be the case. Thus, each feature has to be normalized to a consistent scale first.

This focus on consistent distance measurements also implies the formula satisfy what is known as a metric, i.e., a distance metric (not a distance measure). The term metric has a precise mathematical definition. For a distance calculation to be considered a metric, it has to satify the following:

  • the distance between two points is zero or positive (non-negative; $d(\mathbf x,\mathbf y)\ge 0$)
  • if two points are equal, the distance between them is zero (identity; $d(\mathbf x,\mathbf y)=0 \Leftrightarrow \mathbf x=\mathbf y$)
  • the distance from point A to point B should be equal to the distance from point B to point A (symmetry; $d(\mathbf x,\mathbf y)=d(\mathbf y,\mathbf x)$)
  • the distance from point A to point C has to be less than or equal to the combined distance from point A to an interim point B and the distance from the interim point B to point C (triangle inequality; $d(\mathbf x,\mathbf z)\le d(\mathbf x,\mathbf y)+d(\mathbf y,\mathbf z)$)

Extending the scale constraint further, kNN is also a bit clunky to use on purely categorical data (e.g., hot, warm, neutral, cool, cold) where real number values do not make sense, or datasets that contain a mix of continuous numerical data and categorical or binary data. There are ways to use kNN in those cases still, but we will not cover that here in detail.

Binary features are easy. The distance calculation above still works. With binary features, the feature vector becomes a binary vector. We can therefore use well-known binary/set similarity measures such as Hamming, Jaccard, Tanimoto (not a metric), Cosine similarity (not strictly for binary vectors) and so on. To handle categorical data, we can extend the binary feature concept. This approach is known as one-hot encoding, where each possible value of a categorical feature is encoded as on/off (e.g., one binary feature for hot, another for warm, yet another for neutral, and so on). This is almost similar to the concept of dummy variables in statistics.

Different scales and categorical data is not an issue in our MNIST dataset. All datapoints are of the same type (pixel intensity) and range (0-255). For our test purposes, we will use the Euclidean distance.

Weights of neighbors

If we are to use k>1, we have to decide how we value the classes suggested by several neighbors. The standard kNN calculation treats each neighbor equally, relying on the training data local class densities within the first k neighbors to drive the majority decision. Intuitively however, the data point closest to our test point has to matter more (i.e., more likely to be of the class that governs the unknown point), than those farther away. Hence, we could define the weight based on the distance to the unknown point.

There is a version of kNN that incorporates distance weighing to the neighboring points. Unsurprisingly, it is called distance-weighed kNN. It assign a value of 1 to the nearest neighbor, 0 to the farthest, and some linear interpolation for the rest of the neighbors. The literature also has other methods for calculate the weights, including other modifications to the distance-weighed kNN. These modification tend to improve kNN performance on UCI datasets.

For our MNIST tests, we will assume k=1. Whenever we use k>1 in our experiments, we will make the weight of each neighbor equal to simplify the calculations.

Weights of features

The basic kNN algorithm assumes that the features contribute equally, hence the summation is over all features. In reality, certain types of objects are identifiable by a subset of its many features. For instance, relative to other sports balls, an (American) football’s overriding and consistent feature is its oblong shape. It does not have to be of a specific color, although a leather shade helps identification. For a basketball or a football (soccer), the stitches and panels are very distinct. The same could be said of the digits in MNIST where each digit has features to itself or a subset of all 10 digits. Irrelevant features tend to cause misclassifications.

The literature on kNN provides several improvements on the basic kNN where a parameter weight is added for each feature based on some calculation. We will not explore those techniques here. Instead, we will explore our own ideas of assigning weights to features, without regard to existing literature (for the joy of exploration, of course, plus we all know that literature search is not as much fun as figuring things out!).

Note also that some distance metric might already incorporate some implicit feature weighing, such as the Mahalanobis distance:

$$d_m(\mathbf x, \mathbf y)=\sqrt {(\mathbf x-\mathbf y)^T C^{-1}(\mathbf x-\mathbf y)}$$

Notice that it is similar to the Euclidian distance (in vector form):

$$d_2(\mathbf x, \mathbf y)=\sqrt {(\mathbf x-\mathbf y)^T (\mathbf x-\mathbf y)}$$

The only difference is the presence of $C$, which is the covariance matrix of $\mathbf x$.

Other ML algorithms prioritize certain features by automatically adjusting the weights of each feature, lending more weight to important elements and reducing the noisy features. For example, the back-propagation and gradient descent of a neural network will automatically adjust the value of a feature weight such that the overall accuracy increases (or minimizes a loss function). Unfortunately, in standard kNN, we are stuck with a uniform weight value for each feature.

In our experiment, unless the subject of a trial algorithm, we will assume features have equal weights.

Weights of instances

The weight of instances and the neighbors sound similar. However, whereas the weight of the neighbor affects the weight of the vote for each class, the weight of the instance affects the distance calculation for that instance. In other words, if a query point lands closer to Point B, we could bias the final distance calculation to favor Point A.

But why would we assume one point is 'better' than another? We could think of outliers, where some training data points might not be very informative, even misleading. This is not strictly an issue for the algorithm, rather an issue on preparing the training data perhaps due to labeling errors or including data points that could be interpreted as one type, even if factually of another type (for example, due to a very weird writing style). This parameter is not typically explored however in improving kNN. Each data point is equally considered for the final distance calculation.

One technique where instance weights are implicitly calculated is through the use of class prototypes (mean features of the class) constructed from training set data, then running a kNN against the prototypes instead of the original training data. The original nearest neighbor instances, while no longer used directly, is implicitly weighed into the distance calculation against the class prototypes.

In our experiment, we will assume each instance is correct and equally likely to exist.

Conclusions

The literature on kNN is vast. It is an old technique. It remains a reliable baseline when presented with enough training data. Being a memory-based learner, it is memory-intensive and slow relative to other algorithms. Being an instance-based classifier, it tends to overfit and its structure is highly dependent on the subset of data used for training (i.e., high variance). It could be argued that some ML algorithms reduce to a variant of the kNN algorithm, or the kNN turns out to be a special case of other algorithms. Despite its weaknesses, the feasibility of kNN as a good ML algorithm is well-documented. For its simplicity and logic, it is a good tool to run experiments, which we will do in future posts.