programming category // Gaël Varoquaux: computer / data / health science

Skrub 0.2.0: tabular learning made easy

We just released skrub 0.2.0. This release markedly simplifies learning on complex dataframes.

model = tabular_learner(‘classifier’)

The highlight of the release is the tabular_learner function, which facilitates creating pipelines that readily perform machine learning on dataframes, adding preprocessing to a scikit-learn compatible learner …

03 July 2024

Promoting open-source, from inria to :probabl.

Note

Open-source efforts around scikit-learn at Inria are spinning off to a new enterprise, Probabl, in charge of sustainable development of a data-science commons.

Contents

Prelude: funding scikit-learn is hard
The birth of a new ambition
Probabl, a mission-driven enterprise
Probabl is already having an impact
My position within Probabl …

09 June 2024

People underestimate how impactful Scikit-learn continues to be

Note

François Chollet rightfully said that people often underestimate the impact of scikit-learn. I give here a few illustrations to back his claim.

A few days ago, François Chollet (the creator of Keras, the library that that democratized deep learning) posted:

Indeed, scikit-learn continues to be the most popular machine …

27 November 2023

My Mayavi story: discovering open source communities

The Mayavi Python software, and my personal history: A thread on Python and scipy ecosystems, building open source codebase, and meeting really cool and friendly people

I am writing today as a goodbye to the project: I used to be one of the core contributors and maintainers but have been …

10 July 2022

Hiring an engineer and post-doc to simplify data science on dirty data

Note

Join us to work on reinventing data-science practices and tools to produce robust analysis with less data curation.

It is well known that data cleaning and preparation are a heavy burden to the data scientist.

Dirty data research

In the dirty data project, we have been conducting machine-learning research …

29 October 2021

Hiring someone to develop scikit-learn community and industry partners

Note

With the growth of scikit-learn and the wider PyData ecosystem, we want to recruit in the Inria scikit-learn team for a new role. Departing from our usual focus on excellence in algorithms, statistics, or code, we want to add to the team someone with some technical understanding, but an …

14 September 2021

Technical discussions are hard; a few tips

Note

This post discuss the difficulties of communicating while developing open-source projects and tries to gives some simple advice.

A large software project is above all a social exercise in which technical experts try to reach good decisions together, for instance on github pull requests. But communication is difficult, in …

28 May 2020

Getting a big scientific prize for open-source software

Note

An important acknowledgement for a different view of doing science: open, collaborative, and more than a proof of concept.

A few days ago, Loïc Estève, Alexandre Gramfort, Olivier Grisel, Bertrand Thirion, and myself received the “Académie des Sciences Inria prize for transfer”, for our contributions to the scikit-learn project …

01 December 2019

A foundation for scikit-learn at Inria

We have just announced that a foundation will be supporting scikit-learn at Inria [1]: scikit-learn.fondation-inria.fr

Growth and sustainability

This is an exciting turn for us, because it enables us to receive private funding. As a result, we will be able to have secure employment for some existing core …

17 September 2018

Sprint on scikit-learn, in Paris and Austin

Two weeks ago, we held a scikit-learn sprint in Austin and Paris. Here is a brief report, on progresses and challenges.

Several sprints

We actually held two sprint in Austin: one open sprint, at the scipy conference sprints, which was open to new contributors, and one core sprint, for more …

01 August 2018

Beyond computational reproducibility, let us aim for reusability

Note

Scientific progress calls for reproducing results. Due to limited resources, this is difficult even in computational sciences. Yet, reproducibility is only a means to an end. It is not enough by itself to enable new scientific results. Rather, new discoveries must build on reuse and modification of the state …

19 September 2017

Scikit-learn Paris sprint 2017

Two week ago, we held in Paris a large international sprint on scikit-learn. It was incredibly productive and fun, as always. We are still busy merging in the work, but I think that know is a good time to try to summarize the sprint.

A massive workforce

We had a …

23 June 2017

Data science instrumenting social media for advertising is responsible for todays politics

To my friends developing data science for the social media, marketing, and advertising industries,

It is time to accept that we have our share of responsibility in the outcome of the US elections and the vote on Brexit. We are not creating the society that we would like. Facebook, Twitter …

11 November 2016

Better Python compressed persistence in joblib

New persistence in joblib enables low-overhead storage of big data contained in arbitrary objects

20 May 2016

Of software and Science. Reproducible science: what, why, and how

At MLOSS 15 we brainstormed on reproducible science, discussing why we care about software in computer science. Here is a summary blending notes from the discussions with my opinion.

“Without engineering, science is not more than philosophy” — the community

How do we enable better Science? Why do we do software …

16 December 2015

Nilearn 0.2: more powerful machine learning for neuroimaging

After 6 months of efforts, We just released version 0.2 of nilearn, dedicated to making machine learning in neuroimaging easier and more powerful.

This release integrates the features of the july sprint, and more.

Highlights

Better documentation …

13 December 2015

MLOSS 2015: wising up to building open-source machine learning

Note

The 2015 edition of the machine learning open source software (MLOSS) workshop was full of very mature discussions that I strive to report here.

I give links to the videos. Some machine-learning researchers have great thoughts about growing communities of coders, about code as a process and a deliverable …

28 November 2015

Nilearn sprint: hacking neuroimaging machine learning

A couple of weeks ago, we had in Paris the second international nilearn sprint, dedicated to making machine learning in neuroimaging easier and more powerful.

It was such a fantastic experience, as nilearn is really shaping up as a simple yet powerful tool, and there is a lot of enthusiasm …

04 August 2015

Software for reproducible science: let’s not have a misunderstanding

Note

tl;dr: Reproducibilty is a noble cause and scientific software a promising vessel. But excess of reproducibility can be at odds with the housekeeping required for good software engineering. Code that “just works” should not be taken for granted.

This post advocates for a progressive consolidation effort of scientific …

18 May 2015

MLOSS: machine learning open source software workshop @ ICML 2015

Note

This year again we will have an exciting workshop on the leading-edge machine-learning open-source software. This subject is central to many, because software is how we propagate, reuse, and apply progress in machine learning.

Want to present a project? The deadline for the call for papers is Apr 28th …

23 April 2015

Posts in 'programming'