Projects

Projects

Running experiments on real world data using standard machine learning techniques is fun. Running experiments using non-standard techniques on real world data is even better. This is what drew me into machine learning. I get a chance to make up algorithms. Except for the rare slice-of-life musings, each one of my posts here is an experiment.

Most of these ‘experiments’ will just be to implement the algorithms myself, and modify them in some controlled and limited ways to observe changes in performance. Having implemented them, my hope is that I would become more fluent in these algorithms. Even if the experiments lead nowhere, I will have learned something about the performance envelope of these algorithms. Who knows, perhaps I'd be lucky and one of these might be new and useful.

But in machine learning these days, to have any hopes of inventing something truly novel and effective, it seems one has to be a PhD student or a PhD-trained researcher. I am neither, so I'll just have to be creative and stumble along the way. But to limit the stumbling, I would have to know the theory behind these algorithms. This is the not-fun part of my experiments. :)

Simple algorithms... pushed to their limits

I am interested in simple algorithms, so most of my tests will start with simple algorithms. These simple algorithms are often not the strongest classifiers, so the obvious experiment is to push them to their limits. For example, a generic and simple classifier run against the MNIST handwritten digit recognition dataset might correctly classify digits about ~95% of the time. In contrast, the state-of-the-art techniques (and ensembles of different techniques), with various levels of explicit or implicit feature engineering, hovers north of 99%. It would thus be quite interesting if these simple classifiers could be nudged closer to 99%.

The kNN algorithm and its suitability for simple blue-sky experiments

In machine learning, there is nothing simpler than nearest neighbors (NN, or kNN). kNN is very simple: the algorithm finds the nearest known neighbor (or k neighbors) of an unknown object, and uses that (those) known neighbor(s) as the basis for identifying the unknown object. For something so simple, kNN is surprisingly effective in many situations where there is enough training data. Just think about this: a simple comparison-based algorithm can correctly 'read' handwritten digits ~95% of the time!

Its advantage is that it is non-linear. While it is fairly accurate (low bias), its main disadvantage is its dependence on the available input (high variance). It has other constraints, such as memory requirements and slow speed (being a memory/instance-based learner), and relatively pronounced susceptibility to noise and training data size in high dimensional problems (i.e., the curse of dimensionality).

But given the choice, I would prefer a low bias-high variance model over the opposite. Unlike linear models, there is practically nothing that a kNN cannot model, assuming there is a consistent pattern that can be extracted from available features. Unlike other non-linear (kernel-based linear) models, there is no assumption of any particular kernel. A kNN decision boundary can contort through hyspace as much as its training data imposes, and that flexibility confers it the ability to achieve a good baseline accuracy in many situations (i.e., low bias). With its decision boundary explicitly and directly dependent on the training data, a kNN does suffer from the set of training data used in each model (i.e., the high variance issue, where the decision boundary of each kNN model is unique and unlike those of other kNN models). So the challenge is to ease the variance problem in kNN without sacrificing the inherent low bias.

But I digress. All these is to say that a kNN is a good start for experimentation. The challenge then is how to deal with limited training data.

Experiment candidates

I initially intend to run experiments on the following classifiers, using standard libraries or my own code where I need to make changes to an algorithm: nearest neighbors (kNN), k-means clustering, tfidf (in domains atypical for tf-idf), decision trees, ensembles of decision trees, including Random Forests, linear classifiers, kernel-based linear classifiers, and ensembles of different types of classifiers. I also want to explore extensions of similarity-based algorithms, for example k-means+kNN, RBFs and mixture of Gaussians.

Of course, this list might change. But for now, I will avoid multi-layer neural network classifiers and deep networks, and gradient boosted machines (e.g., XGBoost). They are already too successful. They also require a huge amount of training data and repetitive training to guarantee their high performance, which is what I philosophically would like to avoid.