Yes, but they overfit
If you are reading this post through a planet, the movie isn’t showing up, just click through to understand what the hell this is about.
Machine learning, geeks, and beers
Sorry for the bad humour. In the previous weeks my social geek life had two strong moments:
- Pycon fr, the French Python conference, and ensuing drinking
At the first event (or maybe the related drinking) there was a lot of discussion about NoSQL databases, and I was introduced to this fantastic video making fun of MongoDB fanboys. A few days later I was hacking on the scikit, comparing estimators and discussing hype versus fact in machine learning algorithms (hint: there is no free lunch, but you may get a free brunch). As in brain imaging people seem to be doing nothing but SVMs over and over while methods with more appropriate sparsity clearly perform better, I composed this stupid video.
Anything to learn about machine learning in there?
The short answer is: probably no. This video is humour, and there is little truth (well, RFE is indeed slow as a dog). However, not every reader of this blog are machine learning experts, so let me explain the stakes of the pseudo discussion.
Overfitting: when you learn a predictive model on a noisy data set, for instance trying to learn how to predict whether a movie is popular or not from ratings, if you have a finite amount of data, you should be careful not to learn by heart every detail of the data. You will learn noise that, by chance, correlated to what you are trying to predict. When you try to generalize to new data, these features that you learned from noise will be detrimental to your prediction performance. For instance, the presence of Matt Damon is not the sole predictor of the quality of movie. This is called overfitting. The goal of regularization is to avoid this overfitting.
Both SVM and elasticnet implement regularization, but in different ways. In the case of brain imaging, as the predictive features (voxels) are very sparse, but the noise is highly structured, SVM (that do not operate on voxels directly) are not able to select directly the relevant voxels and tend to overfit (which can be counter-balanced by univariate feature selection as in the scikit example).
RFE (recursive feature elimination) is slow as dog
In : from scikits.learn import datasets In : digits = datasets.load_digits() In : X = digits.data In : y = digits.target In : from scikits.learn.svm import LinearSVC In : svc = LinearSVC() In : from scikits.learn.rfe import RFE In : %timeit RFE(estimator=svc, n_features=1, percentage=0.1).fit(X, y) 1 loops, best of 3: 21.5 s per loop In : from scikits.learn.glm import ElasticNet In : %timeit ElasticNet(alpha=.1, rho=0.7).fit(X, y) 10 loops, best of 3: 26.7 ms per loop
Yeah, but it does much more than simply building a predictor, it builds a ‘heat map’ of which features help predicting (run this scikit-learn example to get an idea).
I am afraid that all the examples I pointed to require the development version of the scikit. Sorry, we just finished a sprint, and there will be a release soon.