Sprint on scikit-learn, in Paris and Austin

under scikit-learn open-source reproducible research scientific software Read time: 5 min.

Two weeks ago, we held a scikit-learn sprint in Austin and Paris. Here is a brief report, on progresses and challenges.

Several sprints

We actually held two sprint in Austin: one open sprint, at the scipy conference sprints, which was open to new contributors, and one core sprint, for more advanced contributors. Thank you to all who joined the scipy conference sprint. As I wasn’t there, I cannot report on it.

Many achievements

Too many things were done to be listed here. Here is brief overview:

Optics got merged: The optics clustering algorithm is a density-base clustering, as DBScan, but with hyperparameters more flexible and easier to set. Our implementation is also more scaleable for very large number of samples. The Pull request was opened in 2013, and got many many improvements over the years.
Yeo-Johnson: The Yeo-Johnson transform is a simple parametric transformation of the data that can be used to make it more Gaussian. It is similar to the Box-Cox transform but can deal with negative data (PR).
Novelty versus outlier detection: Novelty detection attempts to find on new data observations that differ from train data. Outlier detection considers that even in the train data there are aberrant observation. New modes in scikit-learn enable both usage scenario with the same algorithms (see this issue and this PR).
Missing-value indicator: a new transform that adds indicator columns marking missing data (PR).
Pypy support: pypy support was merged. (PR).
Random Forest with 100 estimators The default of n_estimator in RandomForest was changed from 10, which was fast but statistically poor, to 100 (PR).
Changing to 5-fold: we changed to default of cross-validation from 3-fold to 5-fold (PR).
Toward release 0.20: most of the effort of the sprint was actually spent on addressing issues for the 0.20 release: a long list of quality improvements (milestone).

Scikit-learn is hard work

Even for the almighty @amueller

Two days of intense group work on scikit-learn reminded me how much it is hard work. I thought that it was maybe a good idea to try to illustrate why.

Mathematical errors: maintaining the library requires mathematical understanding of the models. For instance Ivan Panico fixed the sparse PCA, for which the transform was mathematically incorrect.
Numerical instabilities: sometimes, however, when models give a result different from the expected one, this is due to numerical instability. For instance, Sergul Aydöre changed the tolerance for certain variants of ridge
Keeping examples and documentation up to date: Each change requires changing all documentation and examples. We have a lot these. For instance, Alex Boucault had to update many examples and documentation pages when changing the default cross-validation.
Clean deprecation path: We make sure that our changes do not break users code, and therefore we provide a smooth update path, with progressive deprecations. For instance, the change of default cross-validation introduce an intermediate step where the default is kept the same and warns that it will change in two releases.
Consistent behavior across the library: One of the acclaimed values of scikit-learn is that it has a very consistent behavior across different models. We enforce this by “common tests”, that test some properties of the estimators altogether. For instance, Sergul implemented common tests for sample weights.
Extensive testing: We test many many things in scikit-learn: that the code snippets in the documentation are correct, that the docstring conventions are respected, that there are no deprecation errors raised, including from our dependencies. As a results, continuous integration is a core part of our development. During the sprint, we flooded our cloud-based continuous integration, and as a result iteration really slowed down. TravisCI were kind enough to fix this by allocating us freely more computing power.
Supporting many versions: Least by not least, one constraint that makes development hard with scikit-learn is that we support many different versions of Python, of our dependencies, of linear-algebra libraries, and of operating system. This makes development harder, and continuous integration slower. But we feel that this is very valuable for a core library: narrowing the supported versions means that users are more likely to end up in unsatisfiable dependencies situations, where different parts of a project want different version numbers of a dependency.

Warning

Dropping support for Python 2

Supporting many version slows development. It also prevents implementing new features: supporting Python 2 makes it harder to provide better parallelism or traceback management.

Python 3 has been out for 10 years. It is solid and comes with many improvements over Python 2. Alongside with many other projects, we will be requiring Python 3 for the future releases of scikit-learn (0.21 and later). scikit-learn 0.20 will be the last release to support Python 2. It will enable us to develop faster a better toolkit.

Credits and acknowledgments

Contributors to the sprint

In Paris

Albert Thomas, Huawey
Alexandre Boucaud, Inria
Alexandre Gramfort, Inria
Eric Lebigot, CFM
Gaël Varoquaux, Inria
Ivan Panico, Deloitte
Jean-Baptiste Schiratti, Telecom ParisTech
Jérémie du Boisberranger, Inria
Léo Dreyfus-Schmidt, Dataiku
Nicolas Goix
Samuel Ronsin, Dataiku
Sebastien Treguer, Independent
Sergül Aydöre, Stevens Institute of Technology

In Austin

Andreas Mueller, Columbia
Guillaume Lemaître, Inria
Jan van Rijn, Columbia
Joan Massich, Inria
Joris Van den Bossche, Inria
Loïc Estève, Inria
Nicolas Hug, Columbia
Olivier Grisel, Inria
Roman Yurchak, independent
William de Vazelhes, Inria

Remote

Hanmin Qin, Peking University
Joel Nothman, University of Sydney

Many achievements

Scikit-learn is hard work

Credits and acknowledgments

Contributors to the sprint

Sponsors