Scikit-learn 2014 sprint: a report

under sprint scikit-learn python machine learning Read time: 2 min.

A week ago, the 2014 edition of the scikit-learn sprint was held in Paris. This was the third time that we held an internation sprint and it was hugely productive, and great fun, as always.

Great people and great venues

We had a mix of core contributors and newcomers, which is a great combination, as it enables us to be productive, but also to foster the new generation of core developers. Were present:

Laurent Direr
Michael Eickenberg
Loic Esteve
Alexandre Gramfort
Olivier Grisel
Arnaud Joly
Kyle Kastner
Manoj Kumar
Balazs Kegl
Nicolas Le Roux
Andreas Mueller
Vlad Niculae
Fabian Pedregosa
Amir Sani
Danny Sullivan
Gabriel Synnaeve
Roland Thiolliere
Gael Varoquaux

As the sprint extended through a French bank holiday and the week end, we were hosted in a variety of venues:

La paillasse, a Paris bio-hacker space
INRIA, the French computer-science national research, and the place where I work :)
Criteo, a French company doing word-wide add-banner placement. The venue there was absolutely gorgeous, with a beautiful terrace on the roofs of Paris. And they even had a social event with free drinks one evening.
Tinyclues, a French startup mining e-commerce data.

I must say that we were treated like kings during the whole stay; each host welcoming us as well they could. Thank you to all of our hosts!

Achievements during the sprint

The first day of the sprint was dedicated to polishing the 0.15 release, which was finally released on the morning of the second day, after 10 months of development.

A large part of the efforts of the sprint were dedicated to improving the coding base, rather than directly adding new features. Some files were reorganized. The input validation code was cleaned up (opening the way for better support of pandas structures in scikit-learn). We hunted dead code, deprecation warnings, numerical instabilities and tests randomly failing. We made the test suite faster, and refactored our common tests that scan all the model.

Some work of our GSOC student, Manoj Kumar, was merged, making some linear models faster.

Our online documentation was improve with the API documentation pointing to examples and source code.

Still work in progress:

Faster stochastic gradient descent (with AdaGrad, ASGD, and one day SAG)
Calibration of probabilities for models that do not have a ‘predict_proba’ method
Warm restart in random forests to add more estimators to an existing ensemble.
Infomax ICA algorithm.

Go Top

Gaël Varoquaux

Scikit-learn 2014 sprint: a report

Great people and great venues

Sponsored by the Digicosm Labex

Achievements during the sprint