Gaël Varoquaux

Thu 16 September 2010


Machine learning humour

Yes, but they overfit

If you are reading this post through a planet, the movie isn’t showing up, just click through to understand what the hell this is about.

Some explanations…

Machine learning, geeks, and beers

Sorry for the bad humour. In the previous weeks my social geek life had two strong moments:

  • Pycon fr, the French Python conference, and ensuing drinking

At the first event (or maybe the related drinking) there was a lot of discussion about NoSQL databases, and I was introduced to this fantastic video making fun of MongoDB fanboys. A few days later I was hacking on the scikit, comparing estimators and discussing hype versus fact in machine learning algorithms (hint: there is no free lunch, but you may get a free brunch). As in brain imaging people seem to be doing nothing but SVMs over and over while methods with more appropriate sparsity clearly perform better, I composed this stupid video.

Anything to learn about machine learning in there?

The short answer is: probably no. This video is humour, and there is little truth (well, RFE is indeed slow as a dog). However, not every reader of this blog are machine learning experts, so let me explain the stakes of the pseudo discussion.

Overfitting: when you learn a predictive model on a noisy data set, for instance trying to learn how to predict whether a movie is popular or not from ratings, if you have a finite amount of data, you should be careful not to learn by heart every detail of the data. You will learn noise that, by chance, correlated to what you are trying to predict. When you try to generalize to new data, these features that you learned from noise will be detrimental to your prediction performance. For instance, the presence of Matt Damon is not the sole predictor of the quality of movie. This is called overfitting. The goal of regularization is to avoid this overfitting.

Both SVM and elasticnet implement regularization, but in different ways. In the case of brain imaging, as the predictive features (voxels) are very sparse, but the noise is highly structured, SVM (that do not operate on voxels directly) are not able to select directly the relevant voxels and tend to overfit (which can be counter-balanced by univariate feature selection as in the scikit example).

RFE (recursive feature elimination) is slow as dog

In [1]: from scikits.learn import datasets
In [2]: digits = datasets.load_digits()
In [3]: X =
In [4]: y =
In [5]: from scikits.learn.svm import LinearSVC
In [6]: svc = LinearSVC()
In [7]: from scikits.learn.rfe import RFE
In [8]: %timeit RFE(estimator=svc, n_features=1, percentage=0.1).fit(X, y)
1 loops, best of 3: 21.5 s per loop
In [9]: from scikits.learn.glm import ElasticNet
In [10]: %timeit ElasticNet(alpha=.1, rho=0.7).fit(X, y)
10 loops, best of 3: 26.7 ms per loop

Yeah, but it does much more than simply building a predictor, it builds a ‘heat map’ of which features help predicting (run this scikit-learn example to get an idea).

I am afraid that all the examples I pointed to require the development version of the scikit. Sorry, we just finished a sprint, and there will be a release soon.

Go Top