Gaël Varoquaux

Do AIs reason or recite?

2024-10-19T00:00:00+02:00

Despite their apparent intelligence, conversational artificial intelligences often lack logic. The debate rages on: do they reason or do they recite snatches of text memorized on the Internet?

Note

This post was originally published in French as part of my scientific chronicle in Les Echos. I updated it with new references.

Conversational AI, or large language models, are sometimes seen as the gateway to general artificial intelligence. ChatGPT, for example, can answer questions asked at the International Mathematical Olympiad. And yet, on other, seemingly much simpler questions, ChatGPT makes surprising mistakes. What aspects of conversational AI intelligence explain its ability to solve some problems and not others?

Thomas McCoy and co-authors conjecture that it has to do with their underlying model of autoregression: technically, these AIs are trained to complete texts found on the Internet. If an AI is very good at calculating (9/5) x + 32, but not (7/5) x + 31, it is because the first formula corresponds to the conversion of degrees Celsius to Fahrenheit, a very frequent conversion on the Internet, while the second does not correspond to any particular formula. Conversational AIs would therefore be good at reproducing what they’ve already seen. Indeed, numerous studies have shown that they have a certain tendency to reproduce snippets of known text. So, if an AI can solve problems from the International Mathematical Olympiad, is it simply because it has memorized the answer?

Something new?

In terms of intelligence, inventing a new mathematical demonstration requires mastering abstractions and the ability to string together complicated logical reasoning with an imposed start and finish. This seems much more difficult than memorizing a demonstration. This is one of the traditional oppositions in machine learning, the line of research that gave rise to today’s AIs: memorizing is one thing, knowing how to generalize is another. For example, if I memorize all the additions between two numbers smaller than ten, I cannot extrapolate beyond that. To go further, I need to master the logic of addition… or memorize more.

And precisely, conversational AIs have an enormous capacity for memorization, and have been trained on almost the entire Internet. Given a question, they can often dip into their memory to find answers. So, are they intelligent or just have a great memory? Scientists are still debating the importance of memory to their abilities. Some argue that their storage capacity is ultimately limited by the size of the Internet. Others wonder to what extent the impressive successes highlighted are not on tasks already solved on the Internet, questioning their ability to do anything new.

But could memorization be an aspect of intelligence? In 1987, Lenat and Feigenbaum conjectured that, for a cognitive agent, accumulating knowledge enables it to solve new tasks with less learning. Perhaps the intelligence of conversational AI lies in knowing how to pick up the right bits of information, and combine them.

Related academic work:

Embers of autoregression show how large language models are shaped by the problem they are trained to solve, R. Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D. Hardy, and Thomas L. Griffiths, PNAS 2024 (ArXiv)

Princeton researchers show that properties of large language models (LLMs) are governed by the data that they are trained on, including for they arithmetics abilities.
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models, Iman Mirzadeh, Keivan Alizadeh Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar

Apple researchers show that LLMs solve mathematical challenge via probabilistic pattern matching on previously seen examples, rather than logical reasonning.

CARTE: toward table foundation models

2024-07-19T00:00:00+02:00

Note

Foundation models, pretrained and readily usable for many downstream tasks, have changed the way we process text, images, and sound. Can we achieve similar breakthroughs for tables? Here I explain why with “CARTE”, we’ve made significant headway.

Contents

Pre-training for data tables: hopes and challenges
- Pre-training is a necessity
- Pretraining for data tables?
CARTE: a table foundation model breakthrough

Pre-training for data tables: hopes and challenges

Pre-training is a necessity

Foundation models have brought breakthroughs to text and image processing because they embark a great deal of knowledge on these data, knowledge that can then be reused to simplify processing. But their promises have not come true for tables, which hold much of an organization’s specific data, eg relational databases capturing day-to-day operations, or measurements tables related to a specific source of data.

Rather, for tabular learning, a couple of years ago our extensive benchmarks showed that tree-based models outperformed even deep-learning architectures specially crafted for data tables.

One challenge is that typically tables are not that big and thus the high flexibility of deep learning is a weakness rather than a benefit. This shortcoming was solved by pretrained models, for data modalities where deep learning has been vastly successful: most people do not train a deep-learning model from scratch, but download a pre-trained one from model hubs. Such universal pre-training is also at the root of foundation models.

Pretraining for data tables?

But what does pretraining mean for data tables? If I give you a table of numbers, what can prior information can you use to process it better? Images and text have a lot of regularity that repeat across datasets: I can recognize a car on pictures coming from all kinds of camera (including old black and white photographs). I use my knowledge of the meaning of words to understand a text. But given a table of number as below, what sense can I make of it?

The tabular learning challenge: every table is a special snowflake

72	68	174	1
64	79	181	1
56	59	166	0
81	62	161	1

The reason a data analyst can understand this data and use this understanding to build a better data-processing pipeline is because the data comes with context: meaningful strings sprinkled around these numbers. For instance, a table with the same numbers as above but a bit of column names and string entries makes completely sense:

Cardiovascular cohort
Age	Weight	Height	Commorbidity	Cardiovascular event
72	68	174	Diabetes	1
64	79	181	Cardiac arrhythmia	1
56	59	166	NA	0
81	62	161	Asthma	1

In such a setting, it becomes clear what background knowledge, what pre-training can bring to analyzing data tables: string entries and column names bring meaning to the numbers in data tables.

Another way to seeing the challenge is that of data integration: as studied by the knowledge representation and database communities, putting multiple sources of data in a consistent representation requires:

schema matching, which to a first order is about finding column correspondences across tables
entity matching, finding correspondences across table entries denoting the same thing, for instance “Diabetes” and “Diabetes melitus”

These challenges of data integration are central to building pretrained or foundation models for tables. Indeed, such models must apply to all tables, and thus must bridge these gaps across tables.

CARTE: a table foundation model breakthrough

Our recent CARTE paper builds upon the above insights, and demonstrates that pretraining can give models that markedly improve performance.

An architecture to learn across tables

Graphlets The key ingredient of CARTE is how we represent the inputs. CARTE’s goal is to build predictors on rows of tables, for instance associating features of an individuals to a risk of developing adverse cardiovascular events. To pretrain across tables, we use a universal representation of the data (rows of tables), as small graphs.

Turning table rows into graphlets. Each column leads to an edge and the column name is turned into the corresponding edge feature. It’s a “multirelational graph”. The entry associated with the given column is turned into the corresponding node feature, and the row is represented as a special row token in the center of the graphlet.

Thus, tables with different number of columns can be turned into a consistent representation. But an additional benefit of this representation is that it can represent data across multiple tables with shared keys (for instance all the visits of a patient to a hospital).

A representation that can bridge tables without schema or entity matching

String embeddings The second ingredient is to represent all strings and embeddings, using a pretrained language model, whether it is for column names or string entries. With good language model will embed close by different string with similar meaning, for instance a column named “commorbidity” and another one named “medical conditions”. Such representation helps learning without entity or schema matching.

Graph transformer CARTE then uses a form of graph transformer on top of this representation. Key to this graph transformer is an attention mechanism that accounts for the relation information –the edge type, ie the column name. Thus (born in, Paris) is represented in a different way as (living in, Paris).

Numbers treated as such Columns with numerical entries are often important information in a data table. Unlike typical large language models, we do not represent numbers via string tokenization, but use a vector representation where the numerical value is multiplied with the embedding of the column name (a vector output by the language model). That way a value of 126 in a column named “Systolic mm Hg” is represented close to 1.5 times a value of 84 in a column named “Blood pressure”.

Pretraining on knowledge graphs

We pretrain the above architecture on a large general-knowledge knowledge graph. The goal is to distill the corresponding information in the pretrained model, which can then implicitly use it when analyzing new tables. Indeed, a large knowledge graph (we use YAGO) represents a huge amount of facts on the world, and the representation, as a multirelational graph, is close to the one that we use to model data tables.

Given an analytic task, on a data table of interest, the pretrained model can be fine tuned. We found that this was a tricky part as those tables are often small.

Empirical results

Excellent performance on extensive benchmarks We compared CARTE to a variety of baselines across 51 datasets (mostly downloaded from kaggle), as a function of the number of samlpes (number of rows):

Prediction performance as a function of sample size for classification and regression tasks

CARTE outperforms all baselines, including very strong ones

CARTE appears as a very strong performer, outperforming all baselines when there are less than 2000 samples. For larger tables, the prior information is less crucial, and more flexible learners are beneficial.

Strong contenders We see that powerful tree-based learner, such as CatBoost of XGBoost also work very well. We investigated in details many baselines. Here, we consider not only learners, but also a variety of methods to encode strings, and these really help predicting:

Detailed comparison (critical difference plots, giving the average ranking of methods) across all 42 baselines that we investigated

Catboost is an excellent predictor because it encodes with categories with great care. S-LLM-CN-XGB is a baseline that we contributed that encodes strings with an LLM, concats numerical numbers and used XGBoost on the resulting representation. TabVec is the TableVectorizer from skrub. Combined with standard learners it gives really strong baselines.

Learning across tables As CARTE can model jointly different tables with different conventions, we show that I can use large source tables, to boost prediction on the smaller target table.

Ranking of various methods used across tables with imperfect correspondances, where “matched” means manual column matching, and “not matched” means no manual column matching

Transfer learning across sources with different columns / schemas

Lessons learned

The extensive empirical results have many teachings.

Tabular foundation models are possible The first teaching is that using strings to bring meaning to the numbers enables foundation models for tables: pretrained models that facilitate a variety of downstream tasks.

LLMs are not enough Many approaches to table foundation models adapt large language models pretrained on huge text corpora. The argument is that with the amount of high-quality texts on Internet, the corresponding LLM can acquire more background knowledge. The seminal example is that of TabLLM, which makes sentences out of table rows and feeds them to LLMs. Yet, by itself it does not perform well on tables with numbers.

Ranking of models on data from the TabLLM paper, data that differs from our benchmark above as it does not have string entries.

A table foundation model must model strings and numbers

Modeling numbers is crucial TabPFN, CARTE, and XGBoost all outperform TabLLM on tables without strings, likely because they readily model numbers, while an LLM sees them as strings. Likewise, our variant S-LLM-XGB-CN that combines LLMs with a model suitable for numbers performs very well.

As the strings are crucial to give context to numbers, we believe that the future of table foundation models is to model well both strings and numbers.

Note

CARTE is only a first step in the world of table foundation models. I am convinved that the ideas will be pushed much much further.

But we have learned a lot in this study. I have only skimmed the surface of our work. If you want more details, read the CARTE paper.

Skrub 0.2.0: tabular learning made easy

2024-07-03T00:00:00+02:00

We just released skrub 0.2.0. This release markedly simplifies learning on complex dataframes.

model = tabular_learner(‘classifier’)

Simple, yet solid default baseline

The highlight of the release is the tabular_learner function, which facilitates creating pipelines that readily perform machine learning on dataframes, adding preprocessing to a scikit-learn compatible learner. The function packs defaults and heuristics to transform all forms of dataframes to a representation that is well suited to a learner, and it can adapt these transformation: tabular_learner(HistGradientBoostingClassifier()) encodes categories differently than tabular_learner(LogisticRegression()).

The heuristics are tuned based on much benchmarking and experience shows that they give good tradeoffs. The default tabular_learner(‘classifier’) is often a strong baseline.

The benefit are visible in a really simple example:

>>> # First retrieve data
>>> from skrub.datasets import fetch_employee_salaries
>>> dataset = fetch_employee_salaries()
>>> df = dataset.X
>>> y = dataset.y
>>> # The dataframe is a quite rich and complex dataframe, with various columns
>>> df

We can then easily build a learner that applies readily to this dataframe, without any transformation:

>>> from skrub import tabular_learner
>>> learner = tabular_learner('regressor')
>>> # The resulting learner can apply all the machine-learning conveniences (eg cross-validation) directly on the dataframe
>>> from sklearn.model_selection import cross_val_score
>>> cross_val_score(learner, df, y)
array([0.89370447, 0.89279068, 0.92282557, 0.92319094, 0.92162666])

transformer = TableVectorizer()

Making encoding complex dataframes easy

Behind the hood, the work is done by the skrub.TableVectorizer(), a scikit-learn compatible transformer that facilitates combining multiple transformations on the different columns of a dataframe. The TableVectorizer is not new in the 0.2.0 release, but we have completely revamped its internals to cover really well edge cases. Indeed, one challenge is to make sure that nothing different or strange happens at test time. Actually, enforcing consistency between train-time and test-time transformation is the real value of skrub compared to using pandas or polars to do transformation.

Increasing support of polars

Short-term goal of optimized support for pandas and polars

We have implemented a new mechanism for supporting both pandas and polars. It has not been applied on all the codebase, hence the support is still imperfect. However, we are seeing increasing support for polars in skrub, and our goal in the short term is to provide rock-solid polar support.

Try skrub out! It’s still young, but in my opinion, it provides a lot of value to tabular learning.

Promoting open-source, from inria to :probabl.

2024-06-09T00:00:00+02:00

Note

Open-source efforts around scikit-learn at Inria are spinning off to a new enterprise, Probabl, in charge of sustainable development of a data-science commons.

Contents

Prelude: funding scikit-learn is hard
The birth of a new ambition
Probabl, a mission-driven enterprise
Probabl is already having an impact
My position within Probabl, my vested interests
More to come

Prelude: funding scikit-learn is hard

Scikit-learn is a central software component in today’s machine learning landscape, and it is open source, governed by a community, easy to install, and well documented. It started many years ago as a project that we did on the side, and we were joined by many volunteers, which was key to the success of the project. We soon decided to ensure that scikit-learn was not only a volunteer-based effort. Over more than a decade, I’ve dedicated a lot of energy to this, using a variety of funding mechanisms: first grants (as an academic), then sponsoring and related contracts with various actors.

Digital commons eliminate scarcity and exclusivity

Funding digital commons is really hard. People build fortunes by leveraging competitive advantages, by creating lock-ins, or selling access to data. What makes a great open-source library, as scikit-learn, is exactly what prevents these tricks: we are committed to being independent, easy to use and install, lightweight…

The birth of a new ambition

Scikit-learn is very successful, but it could be more. For instance, it does not facilitate pushing to production as much as tensorflow, which can be served, deployed to android… And scikit-learn is not very visible to top decision makers: it’s not a line on their budget, a brand that they know. As a consequence, it is not reaping the benefit of its success [1].

[1]	Many commercial tools are sitting on top of open source software like scikit-learn (splunk, sagemaker, to name only a few), making profits, and not helping in any way the open source world that they build upon.

The French government is backing us to push the envelope

3 years ago, the French government challenged us to go further, to consolidate the ecosystem into a consistent data-science commons. The strategic interest of France is to preserve some technological autonomy on data, eg sensitive data. Thus, the government offered us, at Inria, a funding opportunity to go further.

They promised us a lot of money (dozens of millions of Euros), but with a specific mission to develop a sustainable “data-science commons” [2] ecosystem around scikit-learn. I’ll spare you the details of the amount of meetings we had, documents that we wrote, to sketch the outline of the project. I pushed forward a vision of technical components that fit in the broader open-source ecosystem, complementing it.

[2]	The letter that we received from the French government specifically defines the objective in these words: “data-science common” (“Communs numériques pour la Science des Données”)

As I moved forward, I faced a difficulty: the French government wanted a sustainability plan, and private investment to back it. To be honest, this is not what I’m good at. François Goupil, the COO of the scikit-learn consortium, was helping me, but we needed more for our ambitions. And this is when we started talking to Yann Lechelle, a tech entrepreneur with an impressive track record interested in the impact of France on the global tech world.

Probabl, a mission-driven enterprise

With Yann, we built a new vision. Our challenge is to be long-term sustainable and virtuous for scikit-learn, its broader ecosystem, and its community. Yann brought in a business point of view, and I tried to bring that of open-source communities beyond probabl [3], for instance avoiding to getting in the way of others building businesses that contribute to scikit-learn. Indeed, we are convinced that having a broad and diverse community around scikit-learn is central to its future.

[3]

One of the first things that Probabl did (Guillaume Lemaître, to be specific), was submit a grant application (to the Chang-Zuckenberg Institute), to fund, via NumFocus, a developer employed by Quantsight, with no money transiting via Probabl (one reason being that we have no operations outside of Europe so far).

Our sustainability model is still being finetuned. What I can tell is that it will involve a mix of professional service, support & sponsorship agreement, as well as a product-based offer, where we supplement scikit-learn with enterprise features. Our focus will be on features that are typically not the focus of open-source developers: integration in large structures, such as access control, LDAP connection, regulatory compliance. We will not shoehorn scikit-learn in open core or dual licensing approaches: we want our incentives to be aligned with scikit-learn, and its ecosystem, being as complete as possible.

Foster growth and adoption of our open-source stack

In a sense, our inspiration is that of RedHat, where the growth of the company fosters the growth and adoption of the software (Linux in the case of RedHat), beyond the company, in an ecosystem, and for a wide variety of applications.

Strong growth will mean external capital. To ensure that we do not lose the focus on our mission, building data-science commons, Yann penciled down a specific governance of the company (and then validated it with many people, as we are a spin-off from a governmental organization). The ultimate share structure, and the board, are divided in three electoral colleges: one for outside investors, one for founders and employees, and one for public institutions. This ensures a balance of power that hopefully will keep us aligned to our mission. I think that this structure sends a strong signal that we are not just another for-profit that will go from creating useful tech to dark money-generating patterns.

Probabl is already having an impact

A strong open-source team In February, the whole team developing scikit-learn at Inria moved to Probabl, joined by Adrin Jalali, a Berlin-based core developer of scikit-learn and fairlearn. We’ve been hiring excellent people, and we now have 9 people on open-source (see the Probabl team), spending their time contributing to open source (Jérémie, for instance, has been doing the last releases for scikit-learn).

Fostering an ecosystem Probabl is not only about scikit-learn. We are prioritizing 8 libraries, central to the machine-learning and data science ecosystem: joblib, fairlearn, imbalanced-learn… In general, as we have always done, we will not hesitate contributing to upstream or related projects. Our goal is to have a healthy open-source ecosystem around data-science.

Not only software Not everybody sees the important lines of code. I’ve become increasingly aware of the need to do outreach and communication, to coders, but also to decision makers. At Probabl we dedicate energy to be in business meetings, to participate in the tech narrative, to teach how to best do data science, eg with didactic videos. We’re starting a mentioning program, we’ll be organizing sprints… I am convinced that all this is a useful long-term investment.

My position within Probabl, my vested interests

I am a French civil servant (a researcher at Inria, one of our national research institute). Such a position comes with strong responsibilities to control conflicts of interest. The creation of Probabl underwent strict scrutiny (that took a long long time). I have been recently cleared to take an active role: 10% of my time is allocated to be a scientific and open-source advisor for Probabl.

I am not paid by Probabl. 100% of my salary comes from Inria (and I was not given a raise because of my involvement in Probabl). I do have financial interests as a founder, but given that I have a small active part, I have one of the smallest amount of shares among founders.

My main interest in Probabl is really the success of its mission: the long-term growth of an open-source data-science ecosystem. Spinning-off from Inria actually continues my efforts in this direction, but with more agility and breadth. And having on top of open source a variety of complementary commercial activities makes it stronger, by answering better the needs of some actors.

More to come

There are many things that we are still ironing. Clearing out specific details takes time (for instance, clearing my role took a while). We are still to announce the future of the sponsorship program that we had set up at the Inria foundation. Its mission has been transferred to Probabl. Currently, Probabl’s open source team is ensuring continuity of our work with the existing sponsors. But we will set up broader partnership opportunities, with a similar governance, that enable third-parties to invest in open source on a roadmap decided jointly with the open-source community.

I believe that we need a lot of transparency in how we decide upon priorities in our open source team. Our 2024 priorities for scikit-learn are visible here.

I look forward to when Probabl will start adding value to scikit-learn for enterprises with an offer enriching scikit-learn and the broader open-source ecosystem.

I am acutely aware that good open source is made of communities, and that communities need trust and understanding of big players such as Probabl (well, so far we are not that big). I hope that with time our actions will become easy to read and speak of themselves.

People underestimate how impactful Scikit-learn continues to be

2023-11-27T00:00:00+01:00

Note

François Chollet rightfully said that people often underestimate the impact of scikit-learn. I give here a few illustrations to back his claim.

A few days ago, François Chollet (the creator of Keras, the library that that democratized deep learning) posted:

Indeed, scikit-learn continues to be the most popular machine learning in surveys:

Most popular machine-learning framework, according to a Kaggle survey

Note

Scikit-learn is probably the most used machine-learning library

This popularity is sometimes underestimated as scikit-learn is a small player in terms of funding and size of the team, in particular compared to giants such as tensorflow and pytorch. Size is limited by nature of the project, based on a community without a strong commercial entity backing the project.

We target different technology than tensorflow and pytorch: we have by design let the big players focus on deep learning, which demands much more resources. Rather, we have focused on classic machine learning, believing that it serves other important needs. While such technologies make less the news, they are used a lot, and scikit-learn is massively used:

**Usage statistics** (from github)

By not focusing on deep learning, does scikit-learn risk to become outdated? Surveys show that simple models such as linear models or models based on trees (including boosting) are actually the most used models:

Most popular machine learning algorithm, according to a kaggle survey (apologies for the small fonts on the figure, I did not generate it)

Note

Gradient Boosted Trees is a good go-to model

There is a lot of hype surrounding deep learning, but it is most often not the right tool do tackle tabular data. Tabular data has different properties than images or text: it comes with heterogeneous columns which make sense by themselves, and tree-based models have the right inductive bias [Grinsztajn et al 2023].

Benchmark comparing models on tabular data while tuning hyper-parameters (from Grinsztajn et al 2023) Each value corresponds to the test score of the best model (on the validation set) after a specific time spent doing random search. The ribbon corresponds to the minimum and maximum scores on these 15 shuffles. Models HistGradientBoostingTree, GradientBoostingTree, and RandomForest come from scikit-learn. FTtransformer, Saint, ResNet and MLP are all deep learning architecture, with FT transformer and Saint models specifically developed for tabular data.

As we can see, scikit-learn’s HistGradientBoosting really shines in terms of good prediction performance for small computational costs. We strive to facilitate datascience: make it lightweight, give good documentation and APIs.

Linear models and tree-based models are there to stay. They answer strong needs for many application settings and they come with small operational cost.

In my opinion, where scikit-learn could really grow to be even more relevant is to integrate better in a broader ecosystem going from databases to putting to production, being more “enterprise ready” :).

Comité de l’intelligence artificielle: vision et stratégie nationale

2023-09-20T00:00:00+02:00

English summary

I have been appointed to the government-level panel of experts on AI, to set the national vision and strategy in France.

J’ai l’honneur d’être nommé au comité de l’intelligence artificielle du gouvernement Français.

La mission qui nous est confiée d’éclairer l’action publique autour de l’intelligence artificielle, une technologie qui peut impacter beaucoup d’aspects de la société.

Le comité comprend des experts de profils très variés, allant du jeune entrepreneur à l’économiste connu mondialement. La difficulté va être de considérer l’ensemble des liens entre progrès technologique et société. Nous allons chercher à dégager de la vision, rassembler beaucoup d’expertises d’acteurs différents sur différents sujets, appuyer nos projections sur l’état actuel des connaissances scientifiques.

Je ne partagerai pas les travaux du comité en avance de phase: il y aura un travail nécessaire pour établir du consensus, travail qui prend du temps.

Cette mission dépasse mon cadre habituel, celui de la recherche académique ou de la création de logiciels. Je fais cela parce que je crois que pour que la technologie ait le meilleur impact sur la société, il doit y avoir un va-et-vient entre la création technologique et les changements sociétaux. Si nous, scientifiques, décidons de nous concentrer uniquement sur notre travail académique et technique, nous perdons le contrôle de la façon dont la société adopte notre technologie; nous laissons ce contrôle aux personnes qui décident d’utiliser leur énergie pour agir, influencer, profiter directement de ces technologies. En tant que chercheur en sciences informatiques, travaillant à la fois sur l’IA fondamentale et sur les applications dans le domaine de la santé, je dispose d’une expertise qu’il est important d’apporter à la table. En tant que fonctionnaire, je pense que je peux et que je dois éclairer le débat : je suis moins exposé au risque de conflits d’intérêts, je suis payé par l’argent public pour être utile au public.

Ce travail n’est néanmoins pas une prise de position politique: je suis scientifique et non élu. Le pouvoir du comité n’est pas de faire les décisions politiques, mais d’informer du possible. C’est un travail de synthèse et de médiation.

2022, a new scientific adventure: machine learning for health and social sciences

2023-01-31T00:00:00+01:00

A retrospective on last year (2022): I embarked on a new scientific adventure, assembling a team focused on developing machine learning for health and social science. The team has existed for almost a year, and the vision is nice shaping up. Let me share with you illustrations of where we are at. This is extracted from our yearly report which will be public later, but I have sometimes edited it a bit to add personal context.

Highlights

A new team: Soda
The scientific vision
Some notable results of 2022

A new team: Soda

The team in early 2022 (it has grown a lot since)

At Inria, we have teams assembling multiple tenured researchers around a scientific project. Last year, we assembled a new team called Soda, which stands for “social data”, but above all is a fun name.

In a year, the team grew like crazy (to be honest, this had been baking for a little while). We are now around 25 people. There are 4 PIs (Marine le Morvan, Judith Abécassis, Jill-Jênn Vie, and myself); and the engineers working on scikit-learn at Inria are also part of the team.

The scientific vision

Machine learning to leverage richer, more complex, data for social-sciences and health

Applications raise specific data-science challenges

Data management: preparing dirty data for analytics Assembling, curating, and transforming data for data analysis is very labor intensive. These data-preparation steps are often considered the number one bottleneck to data-science. They mostly rely on data-management techniques. A typical problem is to establishing correspondences between entries that denote the same entities but appear in different forms (entity linking, including deduplication and record linkage). Another time-consuming process is to join and aggregate data across multiple tables with repetitions at different levels (as with panel data in econometrics and epidemiology) to form a unique set of “features” to describe each individual.

Progress in machine learning increasingly helps automating data preparation and processing data with less curation.

Data science with statistical machine learning Machine learning can be a tool to answer complex domain questions by providing non-parametric estimators. Yet, it still requires much work, eg to go beyond point estimators, to derive non-parametric procedures that account for a variety of bias (censoring, sampling biases, non-causal associations), or to provide theoretical and practical tools to assess validity of estimates and conclusion in weakly-parametric settings.

Our research axes

Representation learning for relational data

I dream of deep-learning methodology for relational databases, from tabular datasets to full relational databases. The stakes are i) to build machine-learning models that apply readily to the raw data so as to minimize manual cleaning, data formatting and integration, and ii) to extract reusable representations that reduce sample complexity on new databases by transforming the data in well-distributed vectors.

Mathematical aspects of statistical learning for data science

I want to use machine learning models as non-parametric estimators, as I worry about the impact of mismodeling on conclusion. However, for a given statistical task, the statistical procedures and validity criterion need to be reinvented. Soda contributes statistical tools and results for a variety of problems important to data science in health and social science (epidemiology, econometrics, education). These fields lead to various statistical topics:

Missing values
Causal inference
Model validation
Uncertainty quantification

High-quality data-science software

Societal and economical impact of machine learning requires easy-to-use practical tools that can be leveraged in non-specialized organizations such as hospitals or policy-making institutions.

Soda incorporates the core team working at Inria on scikit-learn, one of the most popular machine-learning tool world-wide. One of the missions of soda is to improve scikit-learn and its documentation, transfering the understanding of machine learning and data science accumulated by the various research efforts.

Soda works on other important software tools to foster growth and health of the Python data ecosystem in which scikit-learn is embedded.

Some notable results of 2022

I am listing here a small number of the achievements of the team, because I find them inspiring.

Learning on relational data: aggregating across many tables

For many machine-learning tasks, augmenting the data table at hand with features built from external sources is key to improving performance. For instance, estimating housing prices benefits from background information on the location, such as the population density or the average income.

Often, data must be assembled across multiple tables into a single table for analysis. Challenges arise due to one-to-many relations, irregularity of the information, and the number of tables that may be involved.

Most often, a major bottleneck is to assemble this information across many tables, requiring time and expertise from the data scientist. We propose vectorial representations of entities (e.g. cities) that capture the corresponding information and thus can replace human-crafted features. In Cvetkov-Iliev 2023, we represent the relational data on the entities as a graph and adapt graph-embedding methods to create feature vectors for each entity. We show that two technical ingredients are crucial: modeling well the different relationships between entities, and capturing numerical attributes. We adapt knowledge graph embedding methods that were primarily designed for graph completion. Yet, they model only discrete entities, while creating good feature vectors from relational data also requires capturing numerical attributes. For this, we introduce KEN: Knowledge Embedding with Numbers. We thoroughly evaluate approaches to enrich features with background information on 7 prediction tasks. We show that a good embedding model coupled with KEN can perform better than manually handcrafted features, while requiring much less human effort. It is also competitive with combinatorial feature engineering methods, but much more scalable. Our approach can be applied to huge databases, for instance on general knowledge graphs as in YAGO, creating general-purpose feature vectors reusable in various downstream tasks.

Entity embeddings of YAGO (wikipedia) (2D-representation using UMAP). The vectors are downloadable from https://soda-inria.github.io/ken_embeddings} to readily augment data-science projects.

Validating probabilistic classifiers: beyond calibration

Validating probabilistic predictions of classifiers must go account not only for the average error given an predicted score, but also for the dispersion of errors.

Ensuring that a classifier gives reliable confidence scores is essential for informed decision-making, in particular in high-stakes areas such as health. For instance, before using a clinical prognostic model, we want to establish that for a given individual is attributes probabilities of different clinical outcomes that can be indeed trusted. To this end, recent work has focused on miscalibration, i.e., the over or under confidence of model scores.

Yet calibration is not enough: even a perfectly calibrated classifier with the best possible accuracy can have confidence scores that are far from the true posterior probabilities, if it is over-confident for some samples and under-confident for others. This is captured by the grouping loss, created by samples with the same confidence scores but different true posterior probabilities. Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss. While there are many estimators of the calibration loss, none exists for the grouping loss in standard settings. In Perez-Level 2023, we propose an estimator to approximate the grouping loss. We show that modern neural network architectures in vision and NLP exhibit grouping loss, notably in distribution shifts settings, which highlights the importance of pre-production validation.

Reweighting randomized trials for generalization: finite sample error and variable selection

There may be a sampling bias between a randomized trial and the target population.

Randomized Controlled Trials (RCTs) are ideal experiments to establish causal statement. However, they may suffer from limited scope, in particular, because they may have been run on non-representative samples: some RCTs over- or under- sample individuals with certain characteristics compared to the target population, for which one wants conclusions on treatment effectiveness. Re-weighting trial individuals to match the target population can improve the treatment effect estimation.

In Colnet 2022, we establish the exact expressions of the bias and variance of such reweighting procedures - also called Inverse Propensity of Sampling Weighting (IPSW) - in presence of categorical covariates for any sample size. Such results allow us to compare the theoretical performance of different versions of IPSW estimates. Besides, our results show how the performance (bias, variance, and quadratic risk) of IPSW estimates depends on the two sample sizes (RCT and target population). A by-product of our work is the proof of consistency of IPSW estimates. Results also reveal that IPSW performances are improved when the trial probability to be treated is estimated (rather than using its oracle counterpart). In addition, we study choice of variables: how including covariates that are not necessary for identifiability of the causal effect may impact the asymptotic variance. Including covariates that are shifted between the two samples but not treatment effect modifiers increases the variance while non-shifted but treatment effect modifiers do not.

Challenges to clinical impact of AI in medical imaging

I have worked for many years on research in computer analysis of medical images. In particular, I am convinced that machine learning bears many promises to improve patients’ health. However, I cannot be blind to the fact that a number of systematic challenges are slowing down the progress of the field.

In Varoquaux & Cheplygina, we tried to take a step back on these challenges, from limitations of the data, such as biases, to research incentives, such as optimizing for publication. We reviewed roadblocks to developing and assessing methods. Building our analysis on evidence from the literature and data challenges, we showed that at every step, potential biases can creep.

First, larger datasets do not bring increased prediction accuracy and may suffer from biases.
Second, evaluations often miss the target, with evaluation error larger than algorithmic improvements, improper evaluation procedures and leakage, metrics that do not reflect the application, incorrectly chosen baselines, and improper statistics.
Finally, we show how publishing too often leads to distorted incentives.

On a positive note, we also discuss on-going efforts to counteract these problems and provide recommendations on how to further address these problems in the future.

This was a fun exercise. I realize that I still need to sit on it and introspect how it has shaped my research agenda, because I think it has pushed me to choose specific emphases (such as model evaluation, or focusing on rich data sources).

Privacy-preserving synthetic educational data generation

Soda also works on other applications than health, for instance education. In this direction, I would like to highlight work in which I did not participate, by Jill-Jenn Vie, another PI of the team.

Institutions collect massive learning traces but they may not disclose it for privacy issues. Synthetic data generation opens new opportunities for research in education. Vie 2022 presented a generative model for educational data that can preserve the privacy of participants, and an evaluation framework for comparing synthetic data generators. We show how naive pseudonymization can lead to re-identification threats and suggest techniques to guarantee privacy. We evaluate our method on existing massive educational open datasets.

The tension between privacy of individuals and the need for datasets for open science is a real and important one.

This was just a quick glance of what we do at soda, and we are just warming up. I am super excited about this research. I hope that it will matter.

I truely believe that more and better machine learning can help health and social science to draw new insight from new datasets.

My Mayavi story: discovering open source communities

2022-07-10T00:00:00+02:00

The Mayavi Python software, and my personal history: A thread on Python and scipy ecosystems, building open source codebase, and meeting really cool and friendly people

I am writing today as a goodbye to the project: I used to be one of the core contributors and maintainers but have been inactive for a while for lack of time. Out of common agreement, we recently removed my commit rights to limit security risks.

Mayavi brought my so much!

The start of my adventure with Mayavi

I got involved around 2007: I needed 3D visualization of magnetic fields as I was designing coils for my PhD [1].

[1]	This led to an example in the Mayavi docs http://docs.enthought.com/mayavi/mayavi/auto/example_magnetic_field_lines.html

I started as an early user of Mayavi2, a rewrite of Mayavi, and eventually joined Prabhu Ramachandran and Enthought as a contributor.

What is Mayavi?

Mayavi is a scientific 3D visualization library in Python.

It enables interactive visualization to understand complex information in 3D, such as multi-physics fields, combined with simple scripting to integrate in a broader scientific computing flow.

Mayavi was designed and founded around 2000 by Prabhu Ramachandran, a researcher in computational fluid dynamics at IIT Bombay and long-time open-source and Python figure.

The key idea was to make VTK, a powerful C++ visualization library, easily useful with a Python interface.

Mayavi bridged the gap between the C++ data structures, and efficient Python data structures, exposing without copies to numpy arrays.

It uses tools from Enthought (namely the entought tool suite) for an interactive GUI built on a Python object model: fully scriptable (the vision in explained in an article Prabhu and I wrote )

Mayavi is a full-blown interactive application

Mayavi is also a Python library, for full scripting

Working on Mayavi taught me code and communities

Mayavi used within an interactive IPython – an image from the Mayavi paper

I joined to help with the “mlab” interface, for even simpler Python scripting built upon functions. My goal was to make Mayavi natural to matlab and matplotlib users, a product vision which was probably important to push popularity even further.

I was an isolated PhD student in a physics lab, emboldened by a discussion with Fernando Perez, I started contributing and discussing with Prabhu Ramanchandran. I remember my first skype discussion with Prabhu, I was very intimidated.

Understanding this large codebase was hard! And yet, slowly but surely, I started making more and more meaningful contribution: on mlab, than on the broader codbase, fixing bugs, a lot of work on documentation and examples…

Prabhu and myself are in this scipy conference group picture! From https://slideshare.net/enthought/scientific-computing-with-python-webinar-august-28-2009

Then Enthought funded my overseas travel to the scipy conference: a big deal for me, as I was a peniless PhD student.

My Mayavi story is that of meeting amazing people in the Python, scipy, and pydata world; people who believe in building a tool stack to democratize scientific computing; people from all over the world, friendly, welcoming, passionate.

It founded my belief in communities.

This adventure led me to learn software engineering (Software carpentry really helped getting started) to work at Enthought (a software startup central to scientific computing in Python), to change career from physics to computing, join Inria (French national research in maths and computing), and I do other open source projects…

Mayavi was crucial to my personal adventure. Thank you Prabhu! Thank you Enthought! Thank you the Scipy community!!

2021 highlight: Decoding brain activity to new cognitive paradigms

2022-02-24T00:00:00+01:00

Broad decoding models that can specialize to discriminate closely-related mental process with limited data

TL;DR

Decoding models can help isolating which mental processes are implied by the activation of given brain structures. But to support a broad conclusion, they must be trained on many studies, a difficult problem given the unclear relations between tasks of different studies. We contributed a method that infers these links from the data. Their validity is established by generalization to new tasks. Some cognitive neuroscientists prefer qualitative consolidation of knowledge, but such approach is hard to put to the test.

Context: Infering cognition from brain-imaging

Often, when interpreting functional brain images, one would like to conclude on the indvidual’s on-going mental processes. But this conclusion is not directly warranted by brain-imaging studies, as they do not control the brain activity, but rather engage the participant via a cognitive paradigm made of psychological manipulations [1]. Brain decoding can help grounding such reverse inferences [2], by using machine learning to predict aspects of the task.

But a brain decoding model can seldom support broad reverse-inference claims, as typical decoding models are trained in a given study that samples only a few aspects of cognition. Thus the decoding model only concludes on the interpretation of the brain activity in the studies’ narrow scope.

Another challenge is that of statistical power. Most functional brain imaging studies comprise only a few dozen subjects, compromising statistical power [3], even more so when using machine learning [4]. While there exists large acquisition efforts, these must focus on broad psychological manipulations that do not probe fine aspects of mental processes.

[1]	Poldrack 2006, Can cognitive processes be inferred from neuroimaging data?

[2]	Poldrack 2011, Inferring Mental States from Neuroimaging Data: From Reverse Inference to Large-Scale Decoding

[3]	Poldrack 2017, Scanning the horizon: towards transparent and reproducible neuroimaging research

[4]	Varoquaux 2018, Cross-validation failure: Small sample sizes lead to large error bars

Contribution: Informing specialized decoding questions from broad data accumulation

In Mensch 2021, we designed a machine-learning method that can jointly analyze many unrelated functional imaging studies to build representations associating brain activity to mental processes. These representations can then be used to improve brain decoding in new unrelated studies, thus bringing statistical-power improvements even to experiments probing fine aspects of mental processes not studied in large cohorts.

One roadblock to accumulating information across cognitive neuroimaging studies is that all probe different, yet related, mental processes. Framing them all in the same analysis faces the lack of universally-adopted language to describe cognitive paradigms. Our prior work [5] on this endeavior –the quest for universal decoding across studies–, relied on describing each experimental paradigm in an ontology of cognitive processes and psychological manipulations. However, such approach is not scalable. Here, rather, we infered the latent structure of the tasks from the data, without explicitely modeling the links between studies. In my eye, this was a very important ingredient of our work, and it is non trivial that it enables improving the decoding of unrelated studies.

[5]	Varoquaux 2018, Atlases of cognition with large-scale human brain mapping

Capturing representations was key to transfering across study: representations of brain activity captured distributed brain structures predictive of behavior; representations of tasks across studies captured decompositions of behavior well explained by brain activity. Of course, the representations that we extracted were not as sharp as the stylized functional modules that have been manually compiled from decades of cognitive-neuroscience research.

From a computer-science standpoint, we used a deep-learning architecture. This is the first time that we witnessed a deep-learning architecture outperforming well-tuned shallow baselines on functional neuroimaging data [6]. This success is likely due to the massive amount of data that we assembled: as our method can readily work across studies, we were able to apply it to 40000 subject-level contrast maps.

[6]	There have been many reports of deep architectures on functional brain imaging. However, in our experience, good shallow benchmarks are hard to beat.

Our deep-learning architecture

A research agenda that does not win all hearts

Our underlying research agenda is to piece together cognitive-neuroimaging evidence on a wide variety of tasks and mental processes. In cognitive neuroscience, such consolidation of knowledge is done via review articles, that assemble findings from many publications into a consistent picture on how tasks decompose on elementary mental processes implemented by brain functional modules. The literature review and the ensuing neuro-cognitive model are however verbal by nature: assembling qualitative findings. I, for one, would like to have quantitative tools to foster big-picture view. Of course, the challenge with quantitative approaches as ours is to capture all qualitative aspects of the question.

Over the years that I have been pushing these ideas, I find that they are met with resistance from some elite cognitive neuroscientists who see them as unexciting at best. The same people are enthusiastic about new data-analysis methods to dissect in fine details brain responses with a detailed model of a given task, despite limited statistical power and external validity. My feeling is that the question of how various tasks are related is perceived as belonging to the walled garden of cognitive neuroscientists, not to be put to the test by statistical methods [7].

[7]	The second round of review of our manuscript certainly felt as if the method was judged by cognitive-neuroscience lenses, and not the validity of the data analysis that it entailed.

Yet, as clearly exposed by Tal Yarkoni in his Generalization crisis, drawing conclusions on mental organization from a few repetitions of a tasks is at risk of picking up idiosyncrasies of the task or the stimuli. A starting point of our work (Mensch 2021) was the fall of statistical power in cognitive neuroscience, documented by Poldrack 2017, but one reviewer censored this argument [8]. This exchange felt to me as a field refusing to discuss publicly its challenges, which leaves no room for methods’ researchers such as myself to address them.

[8]	Comments in the first review

Hiring an engineer and post-doc to simplify data science on dirty data

2021-10-29T00:00:00+02:00

Note

Join us to work on reinventing data-science practices and tools to produce robust analysis with less data curation.

It is well known that data cleaning and preparation are a heavy burden to the data scientist.

Dirty data research

In the dirty data project, we have been conducting machine-learning research to see how better statistical models could readily ingest non-curated data, and reduce the need of data preparation for data science. We now have a growing understanding of the problems, theoretical and practical, which lie across statistical and database topics.

Machine learning leads to different tradeoffs than traditional inferential statistics (because it can rely on more powerful model). For instance, we now have a good understanding of the case of missing values: in Le Morvan et al, we showed that with traditional methods, ignorable missingness [1] and “good” imputation are important, but it turns out for prediction, flexible predictors are what matters and they can work on any missingness mechanism.

[1]	“Missing at Random”, where missingness is independent of the hidden values

Similarly, we have made good progress on tolerating normalization errors and typos. We find that rather to attempt to deduplicate the entries or fix the typos, it is best to represent similarities and ambiguities to a flexible learning algorithm. The simplest and most reliable methods are implemented in the dirty-cat library, to facilitate the life of data-scientists

Reinventing data science

With this understanding (and even more exciting on-going research), we want to revisit data science. Machine-learning can provide flexible models for many usages of data science. Our goal is to use it to help assembling and analyzing datasets while minimizing human efforts. For this, we need tools that can answer typical data-science questions using machine learning and starting from the raw data, often spread in multiple files or multiple tables of a databases. Building these tools requires data-science research, a new vision of data-science APIs, and careful software crafting.

Join us in this adventure

We have an awesome team, with a great mix of people of different seniority, different expertise (statistics, machine learning, databases, software engineering), sharing offices with the scikit-learn at Inria. But we have too many exciting ideas, so we are growing this team.

A data-science engineer: new software with new ideas

We are looking for someone with a background in data science or numerical Python programming to join us, to help with designing a new data-science library, evolving from dirty-cat, and to help with data-science experimentation for the research.

We like people who care about data, designing good tools, and have vision about data science. We are happy to consider different level of experience. Apply on the job offer.

A post-doc researcher: science joining data engineering to deep learning

We will soon be announcing a post-doc position to join the team for research in this scope. We are interested in questions around learning on relational or tabular data, or learning data integration. We have plenty of ideas to explore around embeddings in databases, learning to aggregate, learning on sets, graph neural networks for databases, or distributional matching for entity and schema alignment. We expect to be borrowing tools (conceptual and practical) from deep learning, but to blending them with techniques from data integration, knowledge graphs, and databases.

The job posting will be out soon, but I am running out of the office right now for vacations (work-life balance also matters to us).

Diversity is important

Our team is not as diverse as I would like it to be (though probably doing better than typical computer-science team). We love diverse candidates. Do not hesitate.

Hiring someone to develop scikit-learn community and industry partners

2021-09-14T00:00:00+02:00

Note

With the growth of scikit-learn and the wider PyData ecosystem, we want to recruit in the Inria scikit-learn team for a new role. Departing from our usual focus on excellence in algorithms, statistics, or code, we want to add to the team someone with some technical understanding, but an eye for people dynamics. Are you passionate about developing open-source communities for data science? This job is a unique opportunity.

The mandate will be on the one hand to develop the wider community behind scikit-learn, on the other hand to foster the foundation’s partnerships, as this is our funding.

Context: Scikit-learn @ Inria foundation

The growth of Scikit-learn

Scikit-learn is used massively, from schools to major companies. It underpins business-intelligence analysis or automates processes. Its reliability is crucial for the enterprise. Its well-documented methods help data-scientists run to valid analyses.

Scikit-learn has hugely grown and is still growing in terms of userbase and expectation of quality. These days, the development team is large, with many grass-root volunteering and some contributors spending a sizeable fraction of their work time.

Number of monthly website access

Scikit-learn @ Inria foundation

Birth of a foundation To ensure reliable funding to a small core of scikit-learn developers, we set up a foundation [1] a few years ago. The goal was to make sure that we did not lose our experienced developers.

[1]	See the motivating announcement and the website.

Achieving sustainability The resulting structure is set up to provide a career path to a few of our core people. As a consequence, it is a French legal entity, acting as an employer, funded via sponsorship agreement with a few of major economic users of scikit-learn (check out the list of our sponsors). The priorities of the team are set jointly between the sponsors and the open-source community. The setup is not without flaws, in particular it forces us to employ people on Campus, but it enables giving proper benefits to these contributors.

The team The scikit-learn team at Inria foundation currently comprises 4 very experienced developers. In addition, we have other sources of funding –research projects, the scikit-learn MOOC – that we use to create a larger team (currently 3 full-time positions). Finally, various researchers on campus are heavily invested in scikit-learn or related projects such as joblib. As a result, the amount of technical skills is staggering.

Long story short, we want to add new DNA to this awesome team: someone into peopleware as much as software.

Mandate

The goal of the new position is to talk both to our wider open-source world and our corporate partners. Both are crucial to fostering growth for scikit-learn.

The official job posting doesn’t convey as well as I would like what is behind this position. I’m probably to blame :).

Growing our open-source community

As both the scikit-learn and the PyData community have grown, communication becomes a bottleneck. There are so many little things to make an open-source community productive: facilitating on-boarding, dividing efficiently the workload, documenting well the decision making, organizing fun sprints, making sure that issue triaging is efficient…

We are looking for someone passionate about open-source communities and who wants to be herding such cats.

Increasing our corporate visibility

Scikit-learn is one of the most used data-science tools. However, talking to senior decision makers, their perception sometimes differs. Indeed, we are competing for visibility with many powerful actors.

We must communicate beyond the open-source world to develop a strong brand for scikit-learn. Good communication will help us find new sponsors, a key ingredient of growth and sustainability for scikit-learn.

We need to communicate on our progresses and our actions, as people are often surprised by the breadth of our contributions [2].

[2]	for instance, the foundation team has contributed improvements in CPython itself , maintains cloudpickle a central component of the data ecosystem).

As a foundation, we need to be transparent and accountable, which is harder than it seems.

A good fit

We are looking for someone into open source, but also who likes writing blog posts, social networks, organizing events, presenting scikit-learn, and improving processes.

We believe that such a job is best done by someone who has some technical interest in scikit-learn: good advocacy needs with good understanding.

Maybe this sounds daunting? Few people have all the skills, let alone the experience. We are actually more looking for a passionate and promising candidate, whatever the length of the resume. We believe that talented people can learn, when they like what they do.

This is a job about open-source, for open source! It’s not a perfect job: we have many administrative constraints in running the foundation, we are paying ourselves less than a non-open-source job.

Apply now

We are looking forward to your application. You can submit them on the official job offer

2020: my scientific year in review

2021-01-05T00:00:00+01:00

The year 2020 has undoubtedly been interesting: the covid19 pandemic stroke while I was on a work sabbatical in Montréal, at the MNI and the MILA, and it pushed further my interest in machine learning for health-care. My highlights this year revolve around basic and applied data-science for health.

Highlights

Mining electronic health records for covid-19
Machine learning for dirty data
- Supervised learning with Missing values: beyond imputation
- Machine-learning without normalizing entries
Making sense of brain functional signals
- NeuroQuery: brain mapping any neuroscience query
- A high-resolution brain functional atlas

Mining electronic health records for covid-19

Hospital databases are rich and messy

Hospital databases In March, we teamed up with the hospital around Paris that were suffering from a severe overload due to a new pathology, covid-19. The challenge was to extract information from the huge databases of the hospital management system: What were the characteristic of the patients? How were the resources of the hospital evolving? In the treatments that were empirically attempted, which were most efficient?

The hospital databases are hugely promising, because they offer at almost no cost information on all the patients that go through the hospital. As we were dealing with a conglomerate of 39 hospitals, this information covers millions of patients each year: an excellent epidemiological coverage.

Challenging data science Our work was classic data science: we did a lot of data management, crafting SQL queries and munging pandas dataframes to create data tables for statistics and visualizations. We interacted strongly with the hospital management and the doctors to understand the information of interest. As we moved forward it became clear that behind each “simple” question, there were challenges of statistical validity. We did not want to produce a figure that was misleading. Typical challenges were:

Information needed complicated transformations (such as following a patient hoping across hospitals to capture the patient status)
Information was represented differently in the differently hospitals
Incorrect inputs prevented aggregation (such as erroneous entry data after the exit date, or missing values)
The database had biases compared to the ground truth (simple oxygen therapy acts more often unreported than complicated invasive ventilation)
Censoring effects prevented the use of naive statistics (after 20 days of epidemic outburst most hospital stays are short simply because patients have entered the hospitals recently)
A lot of information was present as unnormalized text, sometimes in long hand-written notes, full of acronyms and errors due to character recognition.
The data were of course often a consequence of treatment policy (the choices of the medical staff in terms of patient handling and measures), and hence not directly interpretable in causal or interventional terms.

These challenges were very interesting to me, as they related directly to my research agenda of facilitating the processing of “dirty data” (more on that below).

Most of the work that we did was not oriented toward publication, but rather to address urgent needs of the hospitals. Some scholarly contributions did come out:

Part of the extracted data are consolidated worldwide for medical studies (Brat et al, Nature Digital Medicine 2020).
We used causal-inference methods to estimate the treatment effects of HCQ with and without Azithromycin (Sbidian et al, MedRxiv 2020,.
The data are used in follow up medical studies (eg associating mortality and obesity Czernichow et al, Obesity 2020, )

Biomedical entity recognition A major AI difficulty in this work is recognizing biomedical entities, such as conditions or treatments, in the various texts. Coincidentally, we had been working on simplifying the state of the art pipelines for biomedical entity linking. While this research work was not used on the hospital data, because it was too bleeding edge, it led to an AAAI paper (Chen et al, AAAI 2021) on a state-of-the model for biomedical entity linking that is much more lightweight than current approaches.

Machine learning for dirty data

Machine learning methods that can robustly ingest non-curated data.

The Dirty Data project, that we undertook a few years ago, is really bearing its fruits.

Supervised learning with Missing values: beyond imputation

The classic view on processing data with missing values is to try and impute the missing values: replace them by probable values (or better, compute the distribution of the unobserved values given the observed ones). However, such approach needs a model of the missing-values mechanism; this is simple only when the values are missing at random. When have been studying the alternative view based on directly computing a predictive function to be applied data with missing values.

Missing-values mechanisms: black dots are fully-observed data points, while grey ones are partially observed. The left panel displays a missing-at-random situation, where missingness is independent of the underlying values. On the contrary, in a missing-not-at-random situation (right panel), whether values are observed or not depends on the underlying values (potentially unobserved).

Le Morvan et al, AIStats 2020 studied the seemingly-simple case of a linear generative mechanism and showed that, with missing values, the optimal predictor was a complex, piecewise linear, function of the observed data concatenated with the missing-values mask. This function can be implemented with a neural network with ReLu activation functions, fed with data where missing values are replaced by zeros and corresponding indicator features are added.

To go one step further, we noticed that the optimal predictor uses the correlation between features (eg on fully-observed data) to compensate for missing values.

Compensation effects: The optimal predictor uses the correlation between features to compensate when a value is missing.

Le Morvan et al, NeurIPS 2020 devise a neural-network architecture that efficiently captures these links across the features. Mathematically, it stems from seeking good functional forms to approximate the expression of the optimal predictor, that can be derived for various missing-values mechanisms. A non-trivial result is that a simple functional form can approximate the optimal predictor under very different mechanisms.

Better parameter efficiency

The resulting architecture needs much less parameters (depth or width) than a fully-connected multi-layer perceptron to predict well in the presence of missing values. This, in turns, leads to better performance on limited data size.

Machine-learning without normalizing entries

A challenge of data management is that the same information may be represented in different ways, typically with different strings denoting the same, or related entities. For instance, in the following table, the employee position title column contains such non-normalized information:

Sex Employee Position Title Years of experience

Male Master Police Officer 23

Female Social Worker IV 17

Male Police Officer III 12

Female Police Aide 9

Male Electrician I 4

Male Bus Operator 15

Male Bus Operator 22

Female Social Worker III 13

Female Library Assistant I 3

Male Library Assistant I 5

Sex	Employee Position Title	Years of experience
Male	Master Police Officer	23
Female	Social Worker IV	17
Male	Police Officer III	12
Female	Police Aide	9
Male	Electrician I	4
Male	Bus Operator	15
Male	Bus Operator	22
Female	Social Worker III	13
Female	Library Assistant I	3
Male	Library Assistant I	5

Typos, or other morphological variants (such as varying abbreviations) often make things worse. We found many instances of such challenges in electronic health records.

In a data-science analysis, such data has categorical meanings, but a typical categorical data representation (as a one-hot encoder) breaks: there are too many categories, and in machine learning, the test set might come with new categories.

The standard practice is to curate the data: represent the information in a normalized way, without morphological variants, and separating the various bits of information (for instance the type of job from the rank). It typically requires a lot of human labor.

The original categories and their continuous representation on latent categorical features inferred from the data.

Cerda & Varoquaux, TKDE 2020 give two efficient approaches to encode such data for statistical analysis capturing string similarities. The most interpretable of these approaches represents the data by continuous encoding on latent categories inferred automatically from recurrent substrings.

This research is implemented in the dirty-cat Python library, which is making rapid progress.

Making sense of brain functional signals

Turning brain-imaging signal into insights

Brain imaging, and in particular functional brain imaging, is amazing, because it gives a window on brain function, whether it is to understand cognition, behavior, or pathologies. One challenge that I have been interested in, across the years, is how to give systematic sense to these signals, in a broader perspective than a given study.

NeuroQuery: brain mapping any neuroscience query

Systematically linking mental processes and disorders to brain structures is a very difficult task because of the huge diversity of behavior.

In Dockes et al, elife 2020 we used text mining on a large number of brain-imaging publications to predict where in the brain a given subject of study (in neuroscience, behavior, and related pathologies) would report findings.

With this model, we built a web application, NeuroQuery in which the user can type a neuroscience query, and get a brain map of where a study on the topic is like to report findings.

A high-resolution brain functional atlas

Regions to summarize the fMRI signal

Atlases of brain regions are convenient to summarize the information of brain images, turning them into information easy to analyse. We have long studied the specific case of functional brain atlases, extracting and validating them from brain imaging data. Dadi NeuroImage 2020 contributes a high-resolution brain functional atlas, DiFuMo. This atlas can be browsed or downloaded online.

The functional regions, at dimension 512.

The atlas comes with various resolutions, and all the structures that it segments have been given meaningful names. In the paper, we showed that using this atlas to extract functional signals led to better analysis for a large number of problems compare to the atlases commonly used. We thus recommend this atlas for instance to extract Image-Derived Phenotypes in population analysis, where the huge size of the data requires to work on summarize information.

The region capturing the right hemisphere putamen.

Technical discussions are hard; a few tips

2020-05-28T00:00:00+02:00

Note

This post discuss the difficulties of communicating while developing open-source projects and tries to gives some simple advice.

A large software project is above all a social exercise in which technical experts try to reach good decisions together, for instance on github pull requests. But communication is difficult, in particular between diverging points of view. It is easy to underestimate how much well-intended persons can misunderstand each-other and get hurt, in open source as elsewhere. Knowing why there are communication challenges can help, as well as applying a few simple rules.

Contents

Maintainer’s anxiety
Contributor’s fatigue
Communication is hard
Little things that help

The first challenge is to understand the other’s point of view: the different parties see the problem differently.

Maintainer’s anxiety

Open source can be anxiety-generating for the maintainers

Maintainers ensure the quality and the long-term life of an open-source project. As such, they feel responsible for any shortcoming in the product. In addition, they often do this work because they care, even though it may not bring any financial support. But they can quickly become a converging point of anxiety-generating feedback:

Code has bugs; the more code, the more bugs. Watching a issue tracker fill up with a long list of bugs is frightening to people who feel in charge.
Given that maintainers are visible and qualified, they become the target of constant requests for attention: from pleas to prioritize a specific issue to solicitations for advice.
A small fraction of these interactions come as plain aggressions. I have been insulted many times by unsatisfied users. Each time, it hurts me a lot. My policy is to disengage from the conversation, but I am left shaking and staring at my computer in the evening.

The more popular a project, the more weight it puts on its maintainers’ shoulders. A consequence is that maintainers are tired, and can sometimes approach discussions in a defensive way. Also, we may be plain scared of integrating a code that we do not fully comprehend.

Open-source developers may even, unconsciously, adopt a simple, but unfortunate, protection mechanism: being rude. The logic is flawless: if I am nasty to people, or I set unreasonnable expectations, people will let me alone. Alas, this strategy leads to toxic environments. It not only makes people unhappy but also harms the community dynamics that ground the excellence of open source.

The danger abusive gatekeeping

A maintainer quickly learns that every piece of code, no matter how cute it might be, will give him or her work in the long run, just like a puppy. This is unavoidable given that the complexity of code grows faster than its number of features [1], and, even for a company as rich as Google, project maintenance becomes intractable on huge projects [2].

[1]	An Experiment on Unit Increase in Problem Complexity, Woodfield 1979

[2]	To quote tensorflow developers “Every [code addition] takes around 16 CPU/GPU hours of [quality control]. As such, we cannot just run every [code addition] through the [quality control] infrastructure.”

A maintainer’s job is to say no often, to protect the project. But, as any gatekeeping, it can unfortunately become an excercise in unchecked power. Making objective choices for these difficult decisions is hard, and we all tend naturally to trust more people that we know.

Most often we are not aware of our shortcomings, let alone are we doing them on purpose.

Contributor’s fatigue

A new contributor starting a conversation with a group of seasoned project maintainers may easily feel an imposter. The new contributor knows less about the project. In addition, he or she is engaging with a group of people that know each-other well, and is not yet part of that inner group.

This person does not know the code base, or the conventions, and must make extra efforts, compared to the seasoned developers, to propose a contribution suitable for the project. Often, he or she does not understand fully the reasons for the project guidelines, or for the feedback given. Request for changes can easily be seen as trifles.

Integrating the contribution can often be a lengthy process –in particular in scikit-learn. Indeed, it will involve not only shaping up the contribution, but also learning the skills and discovering the process. These long cycles can undermine motivation: humans need successes to feel enthusiasm. Also, the contributor may legitimately worry: Will all these efforts be fruitful? Will the contribution make its way to the project?

Note that for these reasons, it is recommended to start contributing with very simple features, and to seek feedback on the scope of the contribution before writing the code.

Finally, contributors are seldom paid to work on the project, and there is no single line of command that makes decisions and controls incentives for all the people on the project. No one is responsible when things go astray, which means that the weight falls on the shoulder of the individuals.

The danger behind the lengthy cycle of reviews and improvements needed to contribute is death by a thousands cuts. The contributor looses motivation, and no longer finds the energy to finish the work.

How about users?

This article is focused on developers. Yet, users are also an important part of the discussion around open source.

Often communication failures with users are due to frustration. Frustration of being unable to use the software, of hitting a bug, of seeing an important issue still not addressed. This frustration stems from incorrect expectations, which can often be traced to misunderstanding of the processes and the dynamics. Managing expectations is important to improve the dialogue, via the documentation, via notes on the issue tracker.

Communication is hard

Communication is hard: messages are sometimes received differently than we would like. Overworked people discussing very technically challenging issues only makes the matter worse. I have seen people not come across well, while I know they are absolutely lovely and caring.

We are human beings; we are limited; we misunderstand things, and we have feelings.

Emotions – My most vivid memory of a communication failure was when I was a sailing instructor. Trainees that were under my responsibility had put themselves at risk, causing me a lot of worry. During the debrief, I was angry. My failure to convey the messages without emotional loading undermined my leadership on the group, putting everybody at risk for the rest of the week.

Inability to understand the others’ point of view, or to communicate ours, can bring in emotions. Emotions most often impedes technical communication.

Limited attention – We, in particular maintainers, are bombarded with email, notifications, text and code to read. As a consequence, it is easy to read things too fast, to stop in the middle, to forget.

Language barriers – Most discussions happen in English; but most of us are not native English speakers. We may hide well our difficulties, but nuances are often lost.

Clique effects – Most interactions in open source are done in writing, with low communication bandwidth. It can be much harder to convince a maintainer on the other side of the world than a colleague in the same room. Schools of thoughts naturally emerge when people work a lot together. These create bubbles, where we have the impression that everything we say is obvious and uncontroversial, and yet we fail to convince people outside of our bubble.

Little things that help

Communication can improved by continuously working on it [3]. It may be obvious to some, but it personally took me many years to learn.

[3]	Training materials for managers often discuss communication, and give tricks. I am sure that there are better references than my list below. But that’s the best I can do.

Hear the other: exchange

Foster multiway discussions – The goal of a technical discussion is to come up to the best solution. Better solutions emerge via confronting different points of view: a single brilliant individual probably cannot find or recognize the best solution alone.

Integrate input from as many perspectives as possible.
Make sure everyone feels heard.

Don’t seek victory – Most important to keep in mind is that giving up on an argument and accepting the other point of view is a perfectly valid option. I naturally biased to think that my view on topics dear to me is the right one. However, I’ve learned that adopting the view of the other could bring a lot to the social dynamics of a project: we are often debating over details and the bigger benefit comes from moving forward.

In addition, if several very bright people have different conclusions than me about something that they’ve thought a lot, who am I to disagree?

Convey ideas well: pedagogy

Explain – Give the premises of your thoughts. Unroll your thought processes. People are not sitting in your head, and need to hear not only your conclusion, but how you got there.

Repeat things – Account for the fact that people can forget, and never hesitate to gently restate important points. Reformulating differently can also help explaining.

Keep it short – A typical reading speed is around 200 words a minute. People have limited time and attention span. The greatest help you can provide to your reader is to condense your ideas: let us avoid long threads that require several dozens of minutes to read and digest. There is a tension between this point and the above. My suggestion: remove every word that is not useful, move details to footnotes or postscriptums.

Cater for emotions: tone

Stay technical – Always try to get to the technical aspect of the matter, and never the human. Give specific code and wording suggestions. When explaining a decision, give technical arguments, even if they feel obvious to you.

Be positive – Being positive in general helps people feeling happy and motivated. It is well known that positive feedback leads to quicker progress than negative, as revealed eg by studies of class rooms. I am particularly guilty of this: I always forget to say something nice, although I may be super impressed by a contribution. Likewise, avoid negative words when giving feedback (stay technical).

Avoid “you” – The mere use of the pronoun “you” puts the person we are talking to in the center of message. But the message should not be about the person, it should be about the work. It’s very easy to react emotionally when it’s about us. The passive voice can be useful to avoid putting people as the topic. If the topic is indeed people, sometimes “we” is an adequate substitute for “you”.

Assume good faith – There are so many misunderstandings that can happen. People forget things, people make mistakes, people fail to convey their messages. Most often, all these failures are in good faith, and misunderstandings are legitimate. In the rare cases there might possibly be some bad faith, accounting for it will only make communication worse, not better. Along the same line, we should ignore when we feel assaulted or insulted, and avoid replying in kind.

Choose words wisely – The choice of words matter, because they convey implicit messages. In particular, avoid terms that carry judgement values: “good” or “bad”. For example “This is done wrong” (note that this sentence already avoids “you”), could be replaced by “There might be more numerically stable / efficient way of doing it” (note also the use of precise technical wording rather than the generic term “better”).

Use moderating words – Try to leave room for the other in the discussion. Statements too assertive close the door to different points of view: “this must be changed” (note the lack of “you”) should be avoided while “this should be changed” is better. For this reason, this article is riddled with words such as “tend”, “often”, “feel”, “may”, “might”.

Don’t blame someone else – If you feel that there is some pattern that you would like to change, do not point fingers, do not blame others. Rather, point yourself at the center of the story, find an example of this pattern with you, and the message should be that “it is a pattern that we should avoid. “We” is such a powerful term. It unites; it builds a team.

Give your understanding – If you feel that there is a misunderstanding, explain how you are feeling. But do it using “I”, and not “you”, and acknowledge the subjectivity: “I feel ignored” rather than “you are ignoring me”. Even better: only talk about the feeling: “I am loosing motivation, because this is not moving forward”, or “I think that am failing to convey why this numerical problem is such an important issue” (note the use of “I think”, which avoids casting the situation as necessarily true).

I hope this can be useful. I personally try to apply these rules, because I want to work better with others.

Thanks

to many who gave me feedback: Adrin Jalali, Andreas Mueller, Elizabeth DuPre, Emmanuelle Gouillart, Guillaume Lemaitre, Joel Nothman, Joris Van den Bossche, Nicolas Hug.

PS: note how many times I’ve used “you” above. I can clearly get better at communication!

Jean Dechoux, June 13rd 1923 – Feb 9th 2020

2020-02-16T00:00:00+01:00

Jean Dechoux was born between the first and the second world wars, in a small French town, close to Germany. His family was that of poor farmers, who would work in coal mines to make up for the small size of their crops.

He grew to become a pulmonologist, heading a hospital department that tended to the illnesses of his people. He became an intellectual, traveling the world, an avid reader, and the author of multiple publications on diseases of coal miners.

The story of how Jean grew his education is worth telling. His native language was not even French, but “Lorrain” dialect. His sisters worked young. But he was able to go to school because the village priest had perceived Jean’s intelligence and wanted him to go to the seminary. However the second world war came. Jean eventually got drafted in the German (Nazy) army. Being from Lorraine, he was considered a German, yet not one to be fully trusted: his fate was to be sent to Stalingrad, as cannon fodder. Mistreated during training, he catches tuberculosis and escapes narrowly the front. During his recovery in the German army hospitals, a chief doctor shelters him, declares him unfit for service, and pushes him to study for the abitur, the German high-school degree. Now Jean wants to become a doctor, and serves as a nurse in the German hospitals.

When the allies’ army advances, Jean is taken prisoner of war, then incorporated in the French army, and eventually released with war compensations. He uses them for college studies, during which he meets his wife-to-be, Nicole Lissacq. Nicole is more wealthy than him, and receives a stipend, as a student of the famed “École Normale Supérieure”. The rest is history: Jean is brilliantly successful during his medical studies, and comes back to his native region, Lorraine, to work as a doctor for the coal miners.

Jean, as I knew him, was a profoundly open and kind person. He survived tragedy in his family by becoming even more so. Despite his age, he was modern: the first time that I saw wifi was at his place.

Jean was my grand father. I very much look up to him.

Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020

2020-01-22T00:00:00+01:00

Note

A simple survey asking authors of two leading machine-learning conferences a few quantitative questions on their experimental procedures.

How do machine-learning researchers run their empirical validation? In the context of a push for improved reproducibility and benchmarking, this question is important to develop new tools for model comparison. We ran a simple survey asking to authors of two leading conferences, NeurIPS 2019 and ICLR 2020, a few quantitative questions on their experimental procedures.

A technical report on HAL summarizes our finding. It gives a simple picture of how hyper-parameters are set, how many baselines and datasets are included, or how seeds are used. Below, we give a very short summary, but please read (and cite) the full report if you are interested.

Highlights The response rates were 35.6% for NeurIPS and 48.6% for ICLR. A vast majority of empirical works optimize model hyper-parameters, thought almost half of these use manual tuning and most of the automatic hyper-parameter optimization is done with grid search. The typical number of hyper-parameter set is in interval 3-5, and less than 50 model fits are used to explore the search space. In addition, most works also optimized their baselines (typically, around 4 baselines). Finally, studies typically reported 4 results per model per task to provide a measure of variance, and around 50% of them used a different random seed for each experiment.

Sample results

How many papers with experiments optimized hyperparameters.

What hyperparameter optimization method were used.

Number of different datasets used for benchmarking.

Number of results reported for each model (ex: for different seeds)

These are just samples. Read the full report for more results.

For reproducibility and AutoML, there is active research in benchmarking and hyperparameter procedures in machine learning. We hope that the survey results presented here can help inform this research. As this document is merely a research report, we purposely limited interpretation of the results and drawing recommendations. However, trends that stand out to our eyes are, 1) the simplicity of hyper-parameter tuning strategies (mostly manual search and grid search), 2) the small number of model fits explored during this tuning (often 50 or less), which biases the results and 3) the small number of performances reported, which limits statistical power. These practices are most likely due to the high computational cost of fitting modern machine-learning models.

Acknowledgments We are deeply grateful to the participants of the survey who took time to answer the questions.

2019: my scientific year in review

2020-01-05T00:00:00+01:00

My current research spans wide: from brain sciences to core data science. My overall interest is to build methodology drawing insights from data for questions that have often been addressed qualitatively. If I can highlight a few publications from 2019 [1], the common thread would be computational statistics, from dirty data to brain images. Let me try to give the gist of these progresses, in simple terms.

[1]	It’s already 2020, I’m always late.

Highlights

Comparing distributions
Predictive pipelines on brain functional connectomes
Population shrinkage of covariance
Deep learning on non-translation-invariant images
Open science

Comparing distributions

Fundamental computational-statistics work

What if you are given two set of observations and need to conclude on whether they are drawn from the same distribution? We are interested in this question for the DirtyData research project, to facilitate analysis of data without manual curation. Comparing distributions is indeed important to detect drifts in the data, to match information across datasets, or to compensate for dataset biases.

Formally, we are given two cloud of points (circle and crosses in the figure below) and we want to develop a statistical test of whether the distributions differ. There is an abundant literature on this topic, that I cover in a more detailed post on this subject. Specifically, when the observations have a natural similarity, for instance when they live in a vector space, kernel methods are interesting because they enable to estimate a representant of the underlying distribution that interpolates between observations, as with a kernel density estimator.

Two cloud of points, the corresponding distribution representants μ_P and μ_Q (blue and orange), the difference between these (black), and locations to measure this difference (red triangles).

With Meyer Scetbon, in Scetbon & Varoquaux, NeurIPS, we investigate how to measure best the difference between these representants. We show that the best choice is to take the absolute value of the difference (the l1 norm), while the default choice had so far been the Euclidean (l2) norm. In a nutshell, the reason is that the difference most like is dense when the distributions differ: zero almost nowhere.

We were able to show that the sophisticated framework for efficient and powerful tests in the Euclidean case carries over to the l1 case. In particular, our paper gives efficient testing procedures using a small number of locations to avoid costly computation (the red triangles in the figure above), that can either be sampled at random or optimized.

My hunch is that the result is quite general: the l1 geometry is better than the l2 one on representants of distributions. There might be more fundamental mathematical properties behind this. The drawback is that the l1 norm is non smooth which can be challenging in optimization settings.

Predictive pipelines on brain functional connectomes

Brain-imaging methods

Brain functional connectivity is increasingly used to extract biomarkers of behavior and mental health. The long-term stakes are to ground assessment of psychological traits on quantitative brain data, rather than qualitative behavioral observations. But, to build biomarkers, there are many details that go in estimating functional connectivity from fMRI, something that I have studied for more than 10 years. With Kamalakar Dadi, in Dadi et al, we ran thorough empirical benchmarks to find which methodological choices for the various steps of the pipeline give best prediction across multiple cohorts. Specifically, we studied 1) defining regions of interest for signal extraction, 2) building a functional-connectivity matrix across these regions, 3) prediction across subjects with supervised learning on these features.

Summarizing our benchmark results.

Results show the importance of defining regions from functional data, ideally with a linear-decomposition method that produces soft parcellations such as ICA or dictionary learning. To represent connectivity between regions, the best choice is tangent-space parametrization, a method to build a vector-space from covariance matrices (more below). Finally, for supervised learning, a simple l2-penalized logistic regression is the best option. With the huge popularity of deep learning, it may surprise that linear models are the best performer, but this is well explained by the amount of data at hand: a cohort is typically less than 1000 individuals, which is way below the data sizes needed to see the benefits of non-linear models.

A recent preprint, Pervaiz et al from Oxford, overall confirms our findings, even though they investigated slightly different methodological choices. In particular, they find tangent space clearly useful.

In my eyes, such benchmarking studies are important not only to improve prediction, but also to reduce analytic variability that opens the door to inflation of reported effects. Indeed, given 1000 individuals, the measure of prediction accuracy of a pipeline is quite imprecise (Varoquaux 2018). As a consequence, trying out a bunch a analytic choices and publishing the one that works best can lead to grossly optimistic prediction accuracies. If we want trust in biomarkers, we need to reduce the variability in the methods used to build them.

Population shrinkage of covariance

Statistics for brain signals

Estimating covariances is central for functional brain connectivity and in many other applications. With Mehdi Rahim, in Rahim et al we considered the case of a population of random processes with related covariances, as for instance when estimating functional connectivity from a group of individuals. For this, we combined two mathematical ideas: that of using natural operations on covariance matrices, and that of priors for mean-square estimation:

Tangent space Covariance matrices are positive-definite matrices, for which standard arithmetics are not well suited [2]: subtracting two covariance matrices can lead to a matrix that cannot be the covariance of a signal. However, a group of covariance matrices can be transformed into points in a vector space for which standard distances and arithmetics respect the structure of covariances (for instance Euclidean distance between these points approximate KL divergence between covariances). This is what we call the tangent space.

[2]	Technically, covariance matrices live on a Riemannian manifold: a curve surface inside R^{n x n} that has some metric properties.

James-Stein shrinkage To estimate the mean of n observations, it is actually best not to compute the average of these, but rather to push a bit this average toward a prior guess. The better the guess, the more this “push” helps. The more the number of observations, the more gentle this push should be. This strategy is known as James-Stein shrinkage and it is in my opinion one of the most beautiful results in statistics. It can be seen as a Bayesian posterior, but it comes with guarantees that do not require the model to be true and that control estimation error, rather than a posterior probability.

James-Stein shrinkage is easily written for quadratic errors on vectors, but cannot be easily applied to covariances, as they do not live in a vector space and we would like to control a KL divergence rather than a quadratic error. Our work combined both ideas to give an excellent estimator of a family of related covariances that is also very computationally efficient. We call it PoSCE: Population Shrinkage Covariance Estimation.

Schema of the estimation strategy: projecting the covariances matrices into a tangent space, shrinkage to a group mean, but taking in account the anisotropy of the dispersion of the group, and projecting back to covariances.

It is easy to see how accounting for group information in the estimation of individual covariances can help stabilizing them. However, will it be beneficial if we are interested in the differences between these covariances, for instance to ground biomarkers, as studied above? Our results show that it does indeed help building better biomarkers, for instance to predict brain age. The larger the group of covariances used, the larger the benefits.

Error in predicting brain aging decreases when more individuals are used to build the biomarker.

Deep learning on non-translation-invariant images

Computer vision

Brain images, in particular images of brain activity, are very different from the natural images on which most computer-vision research focuses. A central difference is that detecting activity in different parts of the brain completely changes the meaning of this detection, while detecting a cat in the left or the right of a picture on Facebook makes no difference. This is important because many progresses of computer vision, such as convolutional neural networks, are built on the fact that natural images are statistically translational invariant. On the opposite, brain images are realigned to a template, before being analyzed.

Convolutional architectures have been crucial to the successes of deep learning on natural images because they impose a lot of structure on the weights of neural networks and thus help fight estimation noise. For predicting from brain images, the regularizations strategies that have been successful foster spatially continuous structures. Unfortunately, they have lead to costly non-smooth optimizations that cannot easily be used with the optimization framework of deep learning, stochastic gradient descent.

With Sergul Aydore, in Aydore et al, ICML, we have introduced a spatial regularization that is compatible with the deep learning toolbox. During the stochastic optimization, we impose random spatial structure via feature groups estimated from the data. These stabilize the input layers of deep architecture. They also lead to iterating on smaller representations, which greatly speeds up the algorithm.

At each step of a stochastic gradient descent, we randomly pick a feature-grouping matrix (itself estimated from the data), and use it to reduce the data in the computations of the gradients, then invert this reduction to update the weights.

The paper comes with extensive empirical validation, including comparison to convolutional neural networks. We benchmark the strategy on brain images, but also on realigned faces, to show that the approach is beneficial for any non-translational-invariant images. In particular, the approach greatly speeds up convergence.

Prediction accuracy as a function of training time – left: on realigned faces – right: on brain images

This paper clearly shows that one should not use convolutional neural networks on fMRI data: these images are not translational invariant.

Open science

Open and reproducible science: Looking at all these publications, I realize that every single one of them comes with code on a github repository and is done on open data, which means that they can all be easily reproduced. I’m very proud of the teams behind these papers. Achieving this level of reproducibility requires hard work and discipline. It is also a testimonial to a community investment in software tools and infrastructure for open science that has been going on for decades and gives the foundations on which these works build.

A prize for scikit-learn: On this topic, a highlight of 2019 was also that the work behind scikit-learn was acknowledged in an important scientific prize.

Why open science: Why do I care so much for open science? Because in a world of uncertainty, the claims of science must be trusted and hence built on transparent practice (think about science and global warming). Because it helps putting our methods in the hands of a wider public, society at large. And because it levels the ground, making it easier for newcomers –young scientists, or developing countries– to contribute, which in itself makes science more efficient.

Comparing distributions: Kernels estimate good representations, l1 distances give good tests

2019-12-08T00:00:00+01:00

Note

Given two set of observations, are they drawn from the same distribution? Our paper Comparing distributions: l1 geometry improves kernel two-sample testing at the NeurIPS 2019 conference revisits this classic statistical problem known as “two-sample testing”.

This post explains the context and the paper with a bit of hand waiving.

Contents

The context: two-sample testing
From kernel mean embeddings to distances on distributions
Controlling the weak convergence of probability measures
Two-sample testing procedures
The L1 metric provides best testing power

The context: two-sample testing

Given two samples from two unknown populations, the goal of two-sample tests is to determine whether the underlying populations differ with a statistical significance. For instance, we may care to know whether the McDonald’s and KFC use different logic to chose locations of restaurants across the US. This is a difficult question: we have access to data points, but not the underlying generative mechanism, that is probably governed by marketing strategies.

From kernel mean embeddings to distances on distributions

In the example of spatial distributions restaurants, there is a lot of information in how close observed data points lie in the original measurement space (here geographic coordinates). Kernel methods arise naturally to capture this information. They can be applied to distributions, building representatives of distributions: Kernel embeddings of distributions. The mean embedding of a distribution P with a kernel k is written:

μ_P(t) : = ∫_ℝ^dk(x, t)dP(x)

Intuitively, it is related to Kernel Density Estimates (KDEs) which estimate a density in continuous space by smoothing the observed data points with a kernel.

Kernel mean embeddings for two distributions of points

For two-sample testing, kernel embeddings can be used to compute distances between distributions, building metrics over the space of probability measures. Metrics between probability measures can be defined via the notion of Integral Probability Metric (IPM): as a difference of expectations:

IPM[F, P, Q] : = sup_f ∈ F(𝔼_x ∼ P[f(x)] − 𝔼_y ∼ Q[f(y)])

where F is a class of functions. This definition is appealing because it characterizes the difference between P and Q by the function for which the expectancy differs most. The specific choice of class of function defines the metric. If we now consider a kernel, it implicitly defines a space of functions (intuitively related to all the possible KDEs generated by varying data points): a Reproducible Kernel Hilbert Space (RKHS). Defining a metric (an IPM) with a function class F as the unit ball in such an RKHS, is known as the Maximum Mean Discrepancy (MMD). It can be shown that, rather than computing the maximum, the MMD has a more convenient expression, the RKHS distance between the mean embeddings:

MMD[P, Q] = ‖μ_P − μ_Q‖_{H_k}

For good choices of kernels, the MMD has appealing mathematical properties to compare distributions. With kernels said to be characteristic, eg Gaussian kernels, the MMD is a metric: MMD[P, Q] = 0 if and only if P = Q. Using the MMD for two-sample testing –given only observations from the distributions, and not P and Q– requires using an empirical estimation of the MMD. This can be done by computing the RKHS norm in the expression above, which leads to summing kernel evaluations on all data points in P and Q.

Our work builds upon this framework, but deviates a bit from the classical definition of MMD as it addresses the question of which norm is best to use on the difference of mean embeddings, µQ - µP (as well as other representatives, namely the smooth characteristic function, SCF). We consider a wider family of metrics based on the Lp distances between mean emdeddings (p=2 recovers the classic framework):

d_L^p, μ(P, Q) : = (∫_{t ∈ ℝ^d}|μ_P(t) − μ_Q(t)|^pdΓ(t))^1 ⁄ p

where Γ is a Borel probability measure absolutely continuous.

Controlling the weak convergence of probability measures

We show that these metrics have good properties. Specifically, for p ≥ 1, as soon as the kernel is bounded continuous and characteristic, these metrics metrize the weak convergence. What this means is that these metrics tend to zero if and only if P and Q weakly converge.

The weak convergence of probability measures is a notion of convergence that is based not just on having events with probabilities that are the same for the two distributions, but also that some events are “close”. Indeed, classic convergence in probability just tells us that the same observation should have the same probability in the two distributions. Weak convergence takes in account the topology of the observations. For instance, to go back to the problem of spatial distributions of restaurants, it does not only look at whether the probabilities of having a Mc Donald’s or a KFC restaurant converge on 11th Wall Street, but also at restaurants are likely on 9th Wall Street.

A simple example to see why these matters is to consider two Dirac distributions: spikes in a single point. If we bring these spikes closer and closer, merely looking at the probability of events in the same exact position will not detect any convergence until the spikes exactly overlap.

Using kernel embeddings of distributions enables to capture the aspects of convergence in the spatial domain because the kernels used give a spatial smoothness to the representatives:

Having a metric on probability distributions that captures the topology of the observations is important for many applications, for instance when fitting GANs to generate images: the goal is not to only capture that images are exactly the same, but also that they maybe be “close”.

Two-sample testing procedures

Now that we have built metrics, we can derive two-sample test statistics. A straightforward way of doing it would involve large sums on all the observations, which would be costly. Hence, we resort to a good approximation by sampling a set of {Tj} locations from the distribution Γ:

d̂^p_{ℓ_p, μ, J}[X, Y] : = n^p ⁄ 2∑_j = 1..J|μ_X(T_j) − μ_Y(T_j)|^p

We show that this approximation maintains (almost surely) the appealing metric properties, generalizing the results that were established by Chwialkowski et al 2015 for the special case of the L2 metric.

Sampling at different positions

We further develop the testing procedures by showing that other tricks known to improve testing with the L2 metric can be adapted to other metrics, such as the L1 metric. Fast and performant tests can be obtained by optimizing the test locations –using an upper-bound on the test power– or by testing in the Fourrier domain, using the Smooth Characteristic Function of the kernel. Even in the case of the L1 metric, the null distribution of the test statistic can be derived, leading to tests that can control errors without permutations.

The L1 metric provides best testing power

Going back to our question of which norm on the difference of distribution representative is best suited to detect, we show that when using analytics kernels, such as the Gaussian kernel, the L1 metric improves upon the L2 metric, which corresponds to the classic definition of the MMD.

Indeed, analytic kernels are non-zero almost everywhere. As a result, when P is different from Q, the difference between their mean embeddings will be dense, as well as the differences between the representatives that we use to build our tests (for instance the values at the locations that we use to build the tests above). l1 norms capture better dense differences than l2 norms –this is the reason why, used as penalties, they induce sparsity.

A simple intuition is that dense vectors tend to lie in the diagonals of the measurement basis, as none of their coordinates are zero. On these diagonals, the l1 norm is much larger than the l1 norm of vectors with some zero, or nearly-zero coordinates.

Summary

For a very simple summary, the story is that: to perform tests of whether two distributions differs, it is useful to compute a “mean Kernel embedding” –similar to a Kernel density estimate, but without normalization– of each distribution, and consider the l1 norm of the difference of these embeddings. They can be computed on a small number of locations, either drawn at random or optimized. This approach is reminiscent of looking at the total variation between the measures, however the fact that it uses Kernels makes it robust to small spatial noise in the observations, unlike the total variation for which events must perfectly coincide in both set of observations (the total variation does not metrize the weak convergence).

References

The framework exposed here is one that was developed over a long line of research, which our work builds upon. Our paper gives a complete list of references, however, some useful review papers are

C.-J. Simon-Gabriel and B. Schölkopf. Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metrics on distributions, arXiv:1604.05251, 2016.
A. Gretton, K.M. Borgwardt, M.J. Rasch, B. Schölkopf, A. Smola; A Kernel Two-Sample Test, JMLR, 2012.
The NeurIPS 2019 tutorial, by Gretton, Sutherland, and Jitkrittum, is extremely didactic and gives a lot of big picture

Getting a big scientific prize for open-source software

2019-12-01T06:00:00+01:00

Note

An important acknowledgement for a different view of doing science: open, collaborative, and more than a proof of concept.

A few days ago, Loïc Estève, Alexandre Gramfort, Olivier Grisel, Bertrand Thirion, and myself received the “Académie des Sciences Inria prize for transfer”, for our contributions to the scikit-learn project. To put things simply, it’s quite a big deal to me, because I feel that it illustrates a change of culture in academia.

Recognizing an open view of scientific contributions

It is a great honor, because the selection was made by the members of the Académie des Sciences, very accomplished scientists with impressive contributions to science. The “Académie” is the hallmark of fundamental academic science in France. To me, this prize is also symbolic because it recognizes an open view of academic research and transfer, a view that sometimes felt as not playing according to the incentives. We started scikit-learn as a crazy endeavor, a bit of a hippy science thing. People didn’t really take us seriously. We were working on software, and not publications. We were doing open source, while industrial transfer is made by creating startups or filing patents. We were doing Python, while academic machine learning was then done in Matlab, and industrial transfer in C++. We were not pursuing the latest publications, while these are thought to be research’s best assets. We were interested in reaching out to non experts, while partners considered as interesting have qualified staff.

Quality and openness, at the cost of quantity and control

No. We did it different. We reached out to an open community. We did BSD-licensed code. We worked to achieve quality at the cost of quantity. We cared about installation issues, on-boarding biologists or medical doctors, playing well with the wider scientific Python ecosystem. We gave decision power to people outside of Inria, sometimes whom we had never met in real life. We made sure that Inria was never the sole actor, the sole stake-holder. We never pushed our own scientific publications in the project. We limited complexity, trading off performance for ease of use, ease of installation, ease of understanding.

As a consequence, we slowly but surely assembled a large community. In such a community, the sum is greater than the parts. The breadth of interlocutors and cultures slows movement down, but creates better results, because these results are understandable to many and usable on a diversity of problems. The consequence of this quality is that we were progressively used in more and more places: industrial data-science labs, startups, research in applied or fundamental statistical learning, teaching. Ironically, the institutional world did not notice. It got hard, next to impossible, to get funding [1]. A few years ago, I was told by a central governmental agency that we, open-source zealots, were destroying an incredible amount of value by giving away for free the production of research [2]. The French report on AI, lead by a Fields medal, cited tensorflow and theano –a discontinued software–, but ignored scikit-learn; maybe because we were doing “boring science”?

But, scikit-learn’s amazing community continued plowing forward. We grew so much that we were heard from the top. The prize from the Académie shows that we managed to capture the attention of senior scientists with open-source software, because this software is really having a worldwide impact in many disciplines.

Presenting scikit-learn at the Academie Des Sciences

An accomplishment of the community

There were only five of us on stage, as the prize is for Inria permanent staff. But this is of course not a fair account of how the project has grown and what made it successful.

In 2011, at the first international sprint, I felt something was happening: Incredible people whom I had never met before were sitting next to me, working very hard on solving problems with me. This experience of being united to solve difficult problems is something amazing. And I deeply thank every single person who has worked on this project, the 1500 contributors, many of those that I have never met, in particular the core team who is committed to making sure that every detail of scikit-learn is solid and serves the users. The team that has assembled over the years is of incredible quality.

The promises of data science need open source

The world does not understand how much the promises of data science, for today and tomorrow, need open source projects, easy to install and to use by everybody. These projects are like roads and bridges: they are needed for growth thought no one wants to pay for maintaining them. I hope that I can use the podium that the prize will give us to stress the importance of the battle that we are fighting.

[1]	Getting funding from the government implied too much politics and risks. For these reasons, I turned to private donors, in a foundation.

[2]	Inria always supported us, and often paid developers in my team out of its own pockets.

PS: As an another illustration of the culture change toward openness in science, it was announced during the ceremony that the “Compte Rendu de l’Académie des Sciences” is becoming open access, without publication charges!

2018: my scientific year in review

2019-01-03T00:00:00+01:00

From a scientific perspective, 2018 [1] was once again extremely exciting thank to awesome collaborators (at Inria, with DirtyData, and our local scikit-learn team). Rather than going over everything that we did in 2018, I would like to give a few highlights: We published major work using machine learning to map cognition in the brain, We started a new research project on analysis of non-curated data (addressing all of data science, beyond brain imaging); And we worked a lot on growing scikit-learn.

[1]	It’s already 2019, I am indeed late in posting this summary.

Highlights

Cognitive brain mapping
Data science without data cleaning
Scikit-learn: growth and consolidation

Cognitive brain mapping

We have been exploring how predictive models can help mapping cognition in the human brain. In 2018, these long-running efforts led to important publications.

Atlases of cognition with large-scale human brain mapping

More than 6 years ago, with my student Yannick Schwartz, we started working on compiling an altases of cognition across many cognitive neuroimaging studies. This turned out to be quite challenging for several reasons:

Formalizing the links between mental processes studied across the literature is challenging. Strictly speaking, every paper studies a different mental process. However, to build an atlas of cognition, we are interested in finding commonalities across the literature.
While cognitive studies tend to target a specific mental function, the psychological manipulations that they use also recruit many other processes. For instance, a memory study might use a visual n-back task, and hence recruit the visual cortex. The problem is more than an experimental inconvience: varying details of an experiment may trigger different cognitive processes. For instance, there are common and separate pathways for visual word recognition and auditory word recognition.
Simply detecting regions that are recruited in a given mental operation leads to selecting the whole cortex with enough statistical power. Indeed tasks are never fully balanced; reading might for instance require more attention than listening.

These challenges are related on the one hand to the problem of reverse inference [2], and on the other hand to that of mental-process decomposition, or cognitive subtraction, both central to cognitive neuroimaging. They also call for formal knowledge representation, eg by building ontologies, which is a task harder than it might seem at first glance.

[2]

In essence, the reverse inference problem arises because in a cognitive brain imaging the observed brain activity is a consequence of the behavior, and not a cause. While a conclusion that activity in a brain structure causes a certain behavior is desirable, it is not directly supported by a cognition neuroimaging experiment.

In our work [Varoquaux et al, PLOS 2018], we tackled these challenges to build atlases of cognition as follows:

We assigned to each brain-activity image labels describing the multiple mental processes related to the experimental manipulation
We used decoding –ie prediction of the cognitive labels from the brain activity– to ground a principled reverse inference interpretation: regions selected indeed do imply the corresponding behavior.
Regions in the atlas were built of brain structures that both implied the corresponding cognition, and were triggered by it (conditional and marginal link), to ground a strong selectivity:

We applied these techniques to the data from 30 different studies, resulting in a detailed break down of the cortex in functionally-specialized modules:

Importantly, the validity of this decomposition in regions is established by the ability of these regions to predict the cognitive aspects of new experimental paradigms.

Predictive models avoid excessive reductionism in cognitive neuroimaging

While machine learning is generally seen as an engineering tool to build predictive models or automate tasks, I see in it a central method of modern science. Indeed, it can distill evidence that generalizes from vast –high dimensional– and ill-structured experimental data. Beyond prediction, it can guide understanding.

With Russ Poldrack, we wrote an opinion paper [Varoquaux & Poldrack, Curr Opinion Neurobio 2019] that details why predictive models are important tools to building wider theories of brain function. It reviews many exciting progresses in uncovering with predictive models how brain mechanisms support the mind. It makes the point that ability generalize is a fundamentally desirable priority of scientific inference. Models that are grounded on explicit generalization give a solid path to build broad theories of the mind. Particularly interesting is generalization to significantly different settings, ie going further than typical cross-validation experiments of machine learning, where identical data are artificially split.

Something that is dear to my heart is that we are aiming for quantitative generalization, while psychology often contents itself with qualitative generalization.

Individual Brain Charting, a high-resolution fMRI dataset for cognitive mapping

We are convinced about the importance of analyzing brain response across multiple paradigms, to build models of brain function that generalize across these paradigms. However, addressing such a research program by aggregating multiple studies is hindered by data heterogeneity, due to inter-individual differences or to differing scanners.

Hence, my team, Parietal, has undertook a major data acquisition, the Individual Brain Charting project: scanning a few individuals under a huge amount of cognitive tasks. The data acquisition will last for many years, as the individuals come back to the lab for new acquisitions. The images are of excellent quality, thanks to the unique expertise of our scanning site, Neurospin, a brain-imaging research facility.

The data are completely openly accessible: the raw data, preprocessed data, statistical outputs, alongside with the processing script. We are releasing new data as the project moves forward. This year, we published the data paper [Pinho et al, Scientific Data 2018].

Data accumulation in brain imaging

We are living exciting times, as there are more and more large volumes of shared brain imaging data. OpenfMRI aggregates data in a consistent way across brain-imaging studies. Large projects such as the Human Connectome Project, our Individual Brain Charting project, or the UK BioBank, are designed from the beginning to be shared. We are entering an era of brain-image analysis on many terabytes of data, with dozens of thousands of subjects, compounding hundreds of different clinical or cognitive conditions.

Massive data accumulation opens exciting new scientific prospects, and raises new engineering challenges. Some of these challenges are to scale up neuroimaging data-processing practices, eg inter-subject alignments at the scale of many thousands subjects. Some of these challenges are new to neuroimaging: when compounding hundreds of sources of data into an analysis, the human cost of data integration becomes a major roadblock. As I have become convinced that analysing more, and more diverse, data is an important way forward, I have started working on data intergration per se.

Data science without data cleaning

A new personal research agenda: DirtyData

Challenges to integrating data in a statistical analysis are ubiquitous, including in brain imaging. Data cleaning is recognized as the number one time sink for data scientists. When advising scikit-learn users, including very large companies, I often find that the major roadblock is going from the raw data sources to the data matrix that is input to scikit-learn.

A year ago, I started a new research focus, around the DirtyData project. We now have a team with multiple exciting collaborations, and funding. Our goal is to facilitate statistical analysis of non-curated data. We hope to foster better understanding of how powerful machine-learning models can cope with imperfect, non homogeneous data. As we go, we will publish this understanding, but also distribute code with new methods, and hopefully influence common data-science practices and software. This is an exciting adventure (and yes, we are hiring; see our job offers or contact me).

The topics are vast, at the intersection between database research and statistics. In particular, it calls for integrating machine learning with:

Knowledge representation
Information retrieval
Information extraction
Statistics with missing data

Similarity encoding: analysis with non-normalized string categories

While the DirtyData project is young, we already made progress for analysis of dirty categories, ie categorical data represented with strings that lack curation. These can have typos or other simple morphological variants (eg “patient” vs “patients”), or they can have more structured and fundamental differences, eg arising from the merge of multiple data sources. This latter problem is well-known of database research, where it is seen as a record linkage or alignment problem.

For statistical analysis, in particular machine learning, the problem with these non-curated string categories is that they must be encoded to numerical representations, and classic categorical encodings are not well suited for them. For instance, one-hot encoding leads to very high cardinality.

In Cerda et al (2018), we contribute a simple encoding approach, similarity encoding, based on interpolating one-hot encoding with string similarities between the categories.

We ran an extensive empirical study, and show that similarity encoding leads to better prediction accuracy without curation of the data, outperforming all the other approaches that we tried. The paper is purely empirical, but stay tuned: a theoretical analysis of why this is the case is coming soon.

For the benefit of data scientists and researchers, we are released a small Python package, dirty-cat, for learning with dirty categories.

This is just the beginning of the DirtyData project, more exciting work is under way.

Scikit-learn: growth and consolidation

In 2018, a lot of my energy went to consolidating scikit-learn as a project. Describing the work in detail is for another post. However, my main efforts where around growing the team and working on sustainability.

We established a scikit-learn foundation at Inria, in which companies partner with us to fund scikit-learn development. This took a lot of effort to establish good partnerships and create the legal vessels. Indeed, we want to make sure that the common effort is invested to make scikit-learn better. For instance, working with Intel, who are somewhat running an arms race for computing speed, we improved our test suite, and are slowly but surely learning how to improve our speed.
A consequence of the foundation is that we are hiring to grow the team (check out our open positions). In 2018, my own team grew, with more excellent people working on scikit-learn, but also joblib, and even contributing to core Python and numpy to improve parallel computing and pickling.
As the scikit-learn community is growing, it seemed important to formalize a bit more how decisions are made. To me, an important aspect was laying out clearly that the project is still governed by the community, and not partners or people paid by the foundation. We have a draft of a governance document, that is pretty much ready for merge. We also worked on a roadmap. It is a non binding document, but it still was an interesting exercise.
Scikit-learn 0.20 was released, with many enhancements. And the 0.20 release was followed by two minor releases, to make sure that our users got robust code with backward compatibility.

We are busy finishing a few very interesting studies; next year will be exciting! I hope that we will have much to say about population analysis with brain imaging, which is a amazingly interesting subject.

A foundation for scikit-learn at Inria

2018-09-17T00:00:00+02:00

We have just announced that a foundation will be supporting scikit-learn at Inria [1]: scikit-learn.fondation-inria.fr

Growth and sustainability

This is an exciting turn for us, because it enables us to receive private funding. As a result, we will be able to have secure employment for some existing core contributors, and to hire more people on the team. The goal is to help sustaining quality (more frequent releases?) and to tackle some ambitious features.

A foundation? What and why?

Open source lives and thrives by its base, the community of developers. And scikit-learn is a fantastic example of these dynamics. Because of its grass-root origins, it has focused on features that matter for the small and the many, such as ease of use and statistical models that work well in data-poor situations. Over the years, decisions have been based on their technical merit, rather than the importance of displaying a list of features that are trendy. A consequence of the breadth of contributors with different backgrounds is the library tends to be well-suited for many applications, including some models that are less mainstream.

People with dedicated time to support the community

That said, over time this is an increasing need for a core team of maintainers. As the library gets bigger, is it more and more difficult to have a full view of what is happening. Integration of new features, quality assurances, and releases are best done by developers who can dedicate a large amount of time to the library. Also, ambitious changes to the library, such as improving the parallel computing engine, need long efforts. For many years, we have always had people with dedicated time to support the community. In France, we were going through hoops to find public money to found them. As someone who has done this effort, I can tell you that is a complicated one [2].

The ability to receive money from sponsors will enable us to scale up our operations. I was initially worried that we would have difficulties finding partners that accepted to give us money without asking for control on the project. However, I was proven wrong, and we have found a small set of great partners.

What will people work on? How will decisions be made?

It can be a difficult exercise to balance how money is used in a community-driven project. The project should not loose its drive where the community of developers is important. Interests of the sponsors should not prime over interests of the user base.

We will make sure that the money that the foundation receives is invested for the interest of the community. We have a technical committee that supervises the activity of the foundation. Its decisions will be informed by the community [3]. For this, we have an advisory board composed of core contributors of scikit-learn. Beside the advisory board, the technical committee also comprises a delegate from each sponsor. I am excited about the input that our partners will provide us on the priorities for them, as they represent various industries. Voting power will be spread so that sponsors and community have the same voting power.

Why not an existing foundation such as NumFOCUS, or the PSF?

There are several reasons why we choose this particular legal vessel. Our endeavor is slightly from the prominent foundations in our ecosystem, NumFocus and the PSF (Python Software Foundation).

The first important aspect is that we want to employ full-time developers. Different countries have very different legal frameworks, and it is really hard to transfer money overseas in a non profit. Physical assets like employing people or owning real estate is even harder. We needed something in France. And there might be a need for something else in another country at some point.

Another reason to be embedded in the Inria foundation is that it is giving us a really good deal. We basically get legal advice, accounting, office space, and IT support, for an 8% overhead. This is an excellent deal and is part of the sponsoring efforts that Inria will keep doing.

Last, we feel that a foundation targeting specifically scikit-learn can raise money from different people than other foundations. I think that there is value having multiple foundations seeking money for open-source software. Indeed, a foundation builds a case and an image, to convince donors. Different donors require a different case and a different image. For instance the president of NumFOCUS argues for a name less focused on numerics. Yet, too wide of a scope can dilute the image.

We have in mind to make it easy for other foundations to support scikit-learn. We have majors contributors at leading institutions, such as Andreas Mueller at Columbia or Joel Nothman at Sydney university. It is important that these institutions can easily gather donations too, in the legal framework suited to their country. Hence the name reflects that the foundation is embedded at Inria, leaving room for other initiatives.

What’s the scope?

The scope of our work is everything scikit-learn related. It is not the whole pydata or scipy ecosystem: it is focused on scikit-learn. But we will not hesitate contribute fixes and enhancements to neighboring projects, like in the past, even all the way up to core Python [4].

I’m am very excited. A strong team of full-time contributors will allow us to do ambitious things with scikit-learn.

Join us

We will be recruiting! See our positions. Come work with us in Paris.

I want to end by thanking the amazing men and women who have been contributing to scikit-learn, and are with us in this fantastic adventure! The energy that is in this project is incredible. We are are launching this effort thank to you, and to empower you even more.

[1]

I am quite proud that over the years, my group has employed Olivier Grisel, Joris van den Bossche (working on pandas in addition to scikit-learn), Guillaume Lemaître (working on imbalanced-learn in addition to scikit-learn), Jérémie du Boisberranger, Tom Moreau, Loic Estève, Fabian Pedregosa, to name only a few. All these people, and the many others students that we have payed part time to work on software, have had an structuring impact on our ecosystem, going beyond the bounds of scikit-learn and touching many aspects of computing in Python. However, because of the constraints of research funding in France, public money forced my to hire them with short-term contracts.

[2]	Technically, it is a tax-deductible scikit-learn consortium inside the Inria foundation, which is an non-profit entity related to Inria.

[3]	Details on the goverance of the foundation can be found at https://scikit-learn.fondation-inria.fr/en/mission-and-governance

[4]

For instance Olivier and Tom have been making parallelism more robust in Python 3.7 (amongst various issues https://bugs.python.org/issue33056 and https://bugs.python.org/issue31699). Olivier helped defining the new pickling protocol, crucial to efficient persistence. This is hard work. Yet it is important, because it benefits all libraries.

Sprint on scikit-learn, in Paris and Austin

2018-08-01T00:00:00+02:00

Two weeks ago, we held a scikit-learn sprint in Austin and Paris. Here is a brief report, on progresses and challenges.

Several sprints

We actually held two sprint in Austin: one open sprint, at the scipy conference sprints, which was open to new contributors, and one core sprint, for more advanced contributors. Thank you to all who joined the scipy conference sprint. As I wasn’t there, I cannot report on it.

Many achievements

Too many things were done to be listed here. Here is brief overview:

Optics got merged: The optics clustering algorithm is a density-base clustering, as DBScan, but with hyperparameters more flexible and easier to set. Our implementation is also more scaleable for very large number of samples. The Pull request was opened in 2013, and got many many improvements over the years.
Yeo-Johnson: The Yeo-Johnson transform is a simple parametric transformation of the data that can be used to make it more Gaussian. It is similar to the Box-Cox transform but can deal with negative data (PR).
Novelty versus outlier detection: Novelty detection attempts to find on new data observations that differ from train data. Outlier detection considers that even in the train data there are aberrant observation. New modes in scikit-learn enable both usage scenario with the same algorithms (see this issue and this PR).
Missing-value indicator: a new transform that adds indicator columns marking missing data (PR).
Pypy support: pypy support was merged. (PR).
Random Forest with 100 estimators The default of n_estimator in RandomForest was changed from 10, which was fast but statistically poor, to 100 (PR).
Changing to 5-fold: we changed to default of cross-validation from 3-fold to 5-fold (PR).
Toward release 0.20: most of the effort of the sprint was actually spent on addressing issues for the 0.20 release: a long list of quality improvements (milestone).

Scikit-learn is hard work

Even for the almighty @amueller

Two days of intense group work on scikit-learn reminded me how much it is hard work. I thought that it was maybe a good idea to try to illustrate why.

Mathematical errors: maintaining the library requires mathematical understanding of the models. For instance Ivan Panico fixed the sparse PCA, for which the transform was mathematically incorrect.
Numerical instabilities: sometimes, however, when models give a result different from the expected one, this is due to numerical instability. For instance, Sergul Aydöre changed the tolerance for certain variants of ridge
Keeping examples and documentation up to date: Each change requires changing all documentation and examples. We have a lot these. For instance, Alex Boucault had to update many examples and documentation pages when changing the default cross-validation.
Clean deprecation path: We make sure that our changes do not break users code, and therefore we provide a smooth update path, with progressive deprecations. For instance, the change of default cross-validation introduce an intermediate step where the default is kept the same and warns that it will change in two releases.
Consistent behavior across the library: One of the acclaimed values of scikit-learn is that it has a very consistent behavior across different models. We enforce this by “common tests”, that test some properties of the estimators altogether. For instance, Sergul implemented common tests for sample weights.
Extensive testing: We test many many things in scikit-learn: that the code snippets in the documentation are correct, that the docstring conventions are respected, that there are no deprecation errors raised, including from our dependencies. As a results, continuous integration is a core part of our development. During the sprint, we flooded our cloud-based continuous integration, and as a result iteration really slowed down. TravisCI were kind enough to fix this by allocating us freely more computing power.
Supporting many versions: Least by not least, one constraint that makes development hard with scikit-learn is that we support many different versions of Python, of our dependencies, of linear-algebra libraries, and of operating system. This makes development harder, and continuous integration slower. But we feel that this is very valuable for a core library: narrowing the supported versions means that users are more likely to end up in unsatisfiable dependencies situations, where different parts of a project want different version numbers of a dependency.

Warning

Dropping support for Python 2

Supporting many version slows development. It also prevents implementing new features: supporting Python 2 makes it harder to provide better parallelism or traceback management.

Python 3 has been out for 10 years. It is solid and comes with many improvements over Python 2. Alongside with many other projects, we will be requiring Python 3 for the future releases of scikit-learn (0.21 and later). scikit-learn 0.20 will be the last release to support Python 2. It will enable us to develop faster a better toolkit.

Credits and acknowledgments

Contributors to the sprint

In Paris

Albert Thomas, Huawey
Alexandre Boucaud, Inria
Alexandre Gramfort, Inria
Eric Lebigot, CFM
Gaël Varoquaux, Inria
Ivan Panico, Deloitte
Jean-Baptiste Schiratti, Telecom ParisTech
Jérémie du Boisberranger, Inria
Léo Dreyfus-Schmidt, Dataiku
Nicolas Goix
Samuel Ronsin, Dataiku
Sebastien Treguer, Independent
Sergül Aydöre, Stevens Institute of Technology

In Austin

Andreas Mueller, Columbia
Guillaume Lemaître, Inria
Jan van Rijn, Columbia
Joan Massich, Inria
Joris Van den Bossche, Inria
Loïc Estève, Inria
Nicolas Hug, Columbia
Olivier Grisel, Inria
Roman Yurchak, independent
William de Vazelhes, Inria

Remote

Hanmin Qin, Peking University
Joel Nothman, University of Sydney

Our research in 2017: personal scientific highlights

2017-12-31T00:00:00+01:00

In my opinion the scientific highlights of 2017 for my team were on multivariate predictive analysis for brain imaging: a brain decoder more efficient and faster than alternatives, improvement clinical predictions by predicting jointly multiple traits of subjects, decoding based on the raw time-series of brain activity, and a personnal concern with the small sample sizes we use in predictive brain imaging…

A fast and stable brain decoder using ensembling: FReM

We have been working for 10 years on methods for brain decoding: predicting behavior from imaging. In particular, we developed state of the art decoders based on total variation. In Hoyos-Idrobo et al (preprint) we used a different technique based on ensembling: combining many fast decoders. The resulting decoder, dubbed FReM, predicts better, faster, and with more stable maps than existing methods. Indeed, we have learned that good prediction accuracy was not the only important feature of a decoder.

Brain imaging to characterize individuals: joint prediction of multiple traits

In population imaging, individual traits are linked to their brain images. Predictive models ground the development of imaging biomarkers. In Rahim et al (preprint), we showed that accounting for multiple traits of the subjects when learning the biomarker, gave a better prediction of the individual traits. For instance, knowing the MMSE (mini mental state examination) of subjects in a reference population helps derive better markers of Alzheimer’s disease, even for subjects of unknown MMSE. This is an important step to including a more complete picture of individuals in imaging studies.

Time-domain decoding for fMRI

In studies of cognition with functional MRI, the standard practice to decoding brain activity is to estimate a first-level model that teases appart the different experimental trials. It results in maps of regions of the brains that correlate with each trial. Decoding is then run on these maps, with supervised learning. The limitation of this approach is that the experiment has to be designed with a good time separation between each trial.

In Loula et al (preprint) we designed a time-domain decoding scheme, that starts from the raw brain activity time-series and predicts model time-courses of cognition. From these, it can classify the type of each trial. Importantly, it works better than traditional approaches when the trials are not well separated. It thus opens the door to decoding in experiments that were so far too fast.

Cross-validation failure: the dangers of small samples

I wrote an opinion paper (preprint) on a problem of our field that has been worrying me lot: often, we do not have enough samples to assess properly the predictive power in neuroimaging. Indeed, the typical predictive analysis in neuroimaging uses 100 samples.

The error distribution on the measure of prediction accuracy of a decoding is at best given by a binomial. With around 100 samples, it yields confidence bounds around ±7%. Analysis of neuroimaging studies reveals larger error bars.

Such error bars, large compared to the effect of interest, undermine publications using or developing predictive models in neuroimaging. Indeed, they couple with the publication incentives in two ways. First, studies that by chance observe an effect are published, while the others end up unaccounted for in a ``file drawer``. Second, minor modifications to the data processing strategy give large but meaningless differences on the observed prediction accuracy. These researchers degress of freedom can hardly be checked in a review process or a statistical test. The methods research, trying to improve decoders, is hindered by such error bars and should consider multiple datasets to gauge progress. Clinical neuroimaging, for biomarkers, must increase sample sizes and face heterogeneity.

I believe that this is a major challenge for our field, and invite you to read the paper if you are not convinced.

Convergence proofs for last year’s blazing fast dictionary learning

Mensch et al (preprint) is a long paper that studies in detail our very fast dictionary learning algorithm, with extensive experiments and convergence proofs. On huge matrices, such as brain imaging data in population studies, hyperspectral imaging, or recommender systems, is gives 10 fold speedups for matrix factorization.

We are busy finishing a few very interesting studies. Stay posted, next year will be exciting!

Beyond computational reproducibility, let us aim for reusability

2017-09-19T12:10:00+02:00

Note

Scientific progress calls for reproducing results. Due to limited resources, this is difficult even in computational sciences. Yet, reproducibility is only a means to an end. It is not enough by itself to enable new scientific results. Rather, new discoveries must build on reuse and modification of the state of the art. As time goes, this state of the art must be consolidated in software libraries, just as scientific knowledge as been consolidated on bookshelves of brick-and-mortar libraries.

I am reposting an essay that I wrote on reproducible science and software libraries. The full discussion is in IEEE CIS TC Cognitive and Developmental Systems, but I’ve been told that it is hard to find.

Science is based on the ability to falsify claims. Thus, reproduction or replication of published results is central to the progress of science. Researchers failing to reproduce a result will raise questions: Are these investigators not skilled enough? Did they misunderstand the original scientific endeavor? Or is the scientific claim unfounded? For this reason, the quality of the methods description in a research paper is crucial. Beyond papers, computers —central to science in our digital era— bring the hope of automating reproduction. Indeed, computers excel at doing the same thing several times.

However, there are many challenges to computational reproducibility. To begin with, computers enable reproducibility only if all steps of a scientific study are automated. In this sense, interactive environments —productivity-boosters for many— are detrimental unless they enable easy recording and replay of the actions performed. Similarly, as a computational-science study progresses, it is crucial to keep track of changes to the corresponding data and scripts. With a software-engineering perspective, version control is the solution. It should be in the curriculum of today’s scientists. But it does not suffice. Automating a computational study is difficult. This is because it comes with a large maintenance burden: operations change rapidly, straining limited resources —processing power and storage. Saving intermediate results helps. As does devising light experiments that are easier to automate. These are crucial to the progress of science, as laboratory classes or thought experiments in physics. A software engineer would relate them to unit tests, elementary operations checked repeatedly to ensure the quality of a program.

Archiving computers in thermally-regulated nuclear-proof vaults?

Once a study is automated and published, ensuring reproducibility should be easy; just a matter of archiving the computer used, preferably in a thermally-regulated nuclear-proof vault. Maybe, dear reader, the scientist in you frowns at this solution. Indeed, studies should also be reproduced by new investigators. Hardware and software variations then get in the way. Portability, ie achieving identical results across platforms, is well-known by the software industry as being a difficult problem. It faces great hurdles due to incompatibilities in compilers, libraries, or operating systems. Beyond these issues, portability also faces numerical and statistical stability issues in scientific computing. Hiding instability problems with heavy restrictions on the environment is like rearranging deck chairs on the Titanic. While enough freezing will recover reproducibility, unstable operations cast doubt upon scientific conclusions they might lead to. Computational reproducibility is more than a software engineering challenge; it must build upon solid numerical and statistical methods.

Reproducibility is not enough. It is only a means to an end, scientific progress. Setting in stone a numerical pipeline that produces a figure is of little use to scientific thinking if it is a black box. Researchers need to understand the corresponding set of operations to relate them to modeling assumptions. New scientific discoveries will arise from varying those assumptions, or applying the methodology to new questions or new data. Future studies build upon past studies, standing on the shoulders of giants, as Isaac Newton famously wrote. In this process, published results need to be modified and adapted, not only reproduced. Enabling reuse is an important goal.

Libraries as reusable computational experiments

To a software architect, a reusable computational experiment may sound like a library. Software libraries are not only a good analogy, but also an essential tool. The demanding process of designing a good library involves isolating elementary steps, ensuring their quality, and documenting them. It is akin to the editorial work needed to assemble a textbook from the research literature.

Science should value libraries made of code, and not only bookshelves. But they are expensive to develop, and even more so to maintain. Where to set the cursor? It is clear that in physics not every experimental setup can be stored for later reuse. Costs are less tangible with computational science; but they should not be underestimated. In addition, the race to publish creates legions of studies. As an example, Google scholar lists 28000 publications concerning compressive sensing in 2015. Arguably many are incremental and research could do with less publications. Yet the very nature of research is to explore new ideas, not all of which are to stay.

Identifying and consolidating major results for reuse

Computational research will best create scientific progress by identifying and consolidating the major results. It is a difficult but important task. These studies should be made reusable. Limited resources imply that the remainder will suffer from “code rot”, with results becoming harder and harder to reproduce as their software environment becomes obsolete. Libraries, curated and maintained, are the building blocks that can enable progress.

If you want to cite this essay in an academic publication, please cite the version in IEEE CIS TC Cognitive and Developmental Systems (volume 32, number 2, 2016).

Related posts:

Scikit-learn Paris sprint 2017

2017-06-23T00:00:00+02:00

Two week ago, we held in Paris a large international sprint on scikit-learn. It was incredibly productive and fun, as always. We are still busy merging in the work, but I think that know is a good time to try to summarize the sprint.

A massive workforce

We had a mix of core contributors and newcomers, which is a great combination, as it enables us to be productive, but also to foster the new generation of core developers. Were present:

Albert Thomas
Alexandre Abadie
Alexandre Gramfort
Andreas Mueller
Arthur Imbert
Aurélien Bellet
Bertrand Thirion
Denis Engemann
Elvis Dohmatob
Gael Varoquaux
Jan Margeta
Joan Massich
Joris Van den Bossche
Laurent Direr
Lemaitre Guillaume
Loic Esteve
Mohamed Maskani Filali
Nathalie Vauquier
Nicolas Cordier
Nicolas Goix
Olivier Grisel
Patricio Cerda
Paul Lagrée
Raghav RV
Roman Yurchak
Sebastien Treger
Sergei Lebedev
Thierry Guillemot
Thomas Moreau
Tom Dupré la Tour
Vlad Niculae

Manoj Kumar (could not come to Paris because of visa issues)

And many more people participating remote, and I am pretty certain that I forgot people.

Support and hosting

Hosting: As the sprint extended through a French bank holiday and the week end, we were hosted in a variety of venues:

La paillasse, a Paris bio-hacker space
Criteo, a French company doing word-wide add-banner placement. The venue there was absolutely gorgeous, with a beautiful terrace on the roofs of Paris. And they even had a social event with free drinks one evening.

Guillaume Lemaître did most of the organization, and at Criteo Ibrahim Abubakari was our host. We were treated like kings during the whole stay; each host welcoming us as well they could.

Financial support by France is IA: Beyond our hosts, we need to thank France is IA who fund the sprint, covering some of the lunches, accomodations, and travel expenses to bring in our contributors from abroad (3000 euros travel & accomodation, and 1000 euros for food and a venue during the week end).

Some achievements during the sprint

I would be hard to list everything that we did during the sprint (have a look at the development changelog if you’re curious). Here are some

Quantile transformer, to transform the data distribution into uniform, or Gaussian distributions (PR, example):

Before

After
Memory saving by avoiding to cast to float64 if X is given as float32: we are slowly making sure that, as much as possible, all models avoid using internal representations of a dtype float64 when the data is given as float32. This reduces significantly memory usage and can give speed ups up to a factor of two.
API test on instances rather than class. This is to facilitate testing packages in scikit-learn-contrib.
Many small API fixes to ensure better consistency of models, as well as cleaning the codebase, making sure that examples display well under matplotlib 2.x.
Many bug fixes, include fixing corner cases in our average precision, which was dear to me (PR).

Work soon to be merged

ColumnTransformer (PR): from pandas dataframe to feature matrix, by applying different transformers to different columns.
Fixing t-SNE (PR): our t-SNE implementation was extremely memory-inefficient, and on top of this had minor bugs. We are fixing it.

There is a lot more of pending work that the sprint help moved forward. You can also glance at the monthly activity report on github.

Joblib progress

Joblib, the parallel-computing engine used by scikit-learn, is getting extended to work in distributed settings, for instance using dask distributed as a backend. At the sprint, we made progress running a grid-search on Criteo’s Hadoop cluster.

Our research in 2016: personal scientific highlights

2016-12-31T00:00:00+01:00

Year 2016 has been productive for science in my team. Here are some personal highlights: bridging artificial intelligence tools to human cognition, markers of neuropsychiatric conditions from brain activity at rest, algorithmic speedups for matrix factorization on huge datasets…

Artificial-intelligence convolutional networks map well the human visual system

Eickenberg et al (preprint), showed that convolutional networks –machine-learning tools developed in artificial intelligence for image analysis– map well the human visual system. This is interesting because it shows that cognitive vision and artificial computer vision have evolved to similar architectures. It is not that surprising, as they are both driven by the statistics of natural images. From the point of view of inference in neuroscience, what I found really interesting is that we demonstrated that our computational model of brain activity generalizes across experimental paradigms. This is something new to my knowledge.

Using brain activity at rest to predicting Autism status across clinical sites

Abraham et al (preprint) used resting-state brain activity to predict whether individuals were typical controls or diagnosed with Autistic symptoms. The important aspect of this study is that it was performed on a large data collection across many sites that had not concerted each other during the acquisition. Given that prediction was successful across sites, the study shows the viability of extracting predictive biomarkers across inhomogeneous multi-site data. I think that it is an important result for the future of psychiatric neuroimaging research. The paper also highlights the aspects of the predictive pipeline that were important for this success.

Dictionary Learning for Massive Matrix Factorization

On a pure machine-learning side, Mensch et al introduced a new algorithm for matrix factorization that gives 10 times speedups compared to the state of the art on absolutely huge datasets (Terabyte scales). The key aspect is to combine online learning with random subampling that exploits redundancies in the data. For neuroimaging, this algorithmic advances is needed to tackle larger and larger resting-state data. We will use it to scale predictive models to epidemiologic cohorts. The original paper was purely heuristic but later work comes with proofs and we will soon be submitting a very rich journal paper about this class of algorithms.

A guide to cross-validation in neuroimaging

We published a review on cross-validation for neuroimaging (preprint). While this may sound less leading edge than other of our work, cross-validation is central to everything we do. Doing it right is important. We learned some interesting tradeoffs while doing the experiments for the review. One of them is that for predictive models that are quite stable, such as SVMs, it may be profitable to use default hyper-parameters than to tune them by cross-validation. This is because with the small sample sizes typical of neuroimaging cross-validation is fairly noisy.

Though not in my team, Liem et al (preprint) collaborated with us for a beautiful study showing multimodal prediction of brain age from rest brain activity and brain anatomy. Interestingly, they showed that discrepancy between predicted age and chronological age captures cognitive impairment.

We have many interesting things in the pipeline, but it will be for next year. On an unrelated note, I’ve been doing more art photography on my free time in 2016.

Data science instrumenting social media for advertising is responsible for todays politics

2016-11-11T00:00:00+01:00

To my friends developing data science for the social media, marketing, and advertising industries,

It is time to accept that we have our share of responsibility in the outcome of the US elections and the vote on Brexit. We are not creating the society that we would like. Facebook, Twitter, targeted advertising, customer profiling, are harmful to truth and have helped Brexiting and electing Trump. Journalism has been replaced by social media and commercial content tailored to influence the reader: your own personal distorted reality.

There are many deep reasons why Trump won the election. Here, as a data scientist, I want to talk about the factors created by data science.

Rumor replaces truth: the way we, data-miners, aggregate and recommend content is based on its popularity, on readership statistics. In no way is it based in the truthfulness of the content. As a result, Facebook, Twitter, Medium, and the like amplify rumors and sensational news, with no reality check [1].

This is nothing new: clickbait and tabloids build upon it. However, social networking and active recommendation makes things significantly worst. Indeed, birds of a feather flock together, reinforcing their own biases. We receive filtered information: have you noticed that every single argument you heard was overwhelmingly against (or in favor of) Brexit? To make matters even worse, our brain loves it: to resolve cognitive dissonance we avoid information that contradicts our biases [2].

Note

We all believe more information when it confirms our biases

Gossiping, rumors, and propaganda have always made sane decisions difficult. The filter bubble, algorithmically-tuned rose-colored glasses of Facebook, escalate this problem into a major dysfunction of our society. They amplify messy and false information better than anything before. Soviet-style propaganda builds on a carefully-crafted lies; post-truth politics build on a flood of information that does not even pretend to be credible in the long run.

Active distortion of reality: amplifying biases to the point that they drown truth is bad. Social networks actually do worse: they give tools for active manipulation of our perception of the world. Indeed, the revenue of today’s Internet information engines comes from advertising. For this purpose they are designed to learn as much as possible about the reader. Then they sell this information bundled with a slot where the buyer can insert the optimal message to influence the reader.

The Trump campaign used targeted Facebook ads presenting to unenthusiastic democrats information about Clinton tuned to discourage them from voting. For instance, portraying her as racist to black voters.

Information manipulation works. The Trump campaign has been a smearing campaign aimed at suppressing votes of his opponent. Release of negative information on Clinton did affect her supporter allegiance.

Tech created the perfect mind-control tool, with an eyes on sales revenue. Someone used it for politics.

The tech industry is mostly socially-liberal and highly educated, wishing the best for society. But it must accept its share of the blame. My friends improving machine-learning for costumer profiling and ad placement, you help shaping a world of lies and deception. I will not blame you for accepting this money: if it were not for you, others would do it. But we should all be thinking about how do we improve this system. How do we use data science to build a world based on objectivity, transparency, and truth, rather than Internet-based marketing?

References analysing the erosion of truth

Must-read article in the economist on lies in politics
Wikipedia page on Post-truth politics
Donald Trump won because of Facebook
The real story behind todays referendum : Neil Lawrence’s analysis of the filter-bublle effect in Brexit
A 2013 academic study showing that twitter increases partisan polarization

Disgression: other social issues of data science

The tech industry is increasing inequalities, making the rich richer and leaving the poor behind. Data-science, with its ability to automate actions and wield large sources of information, is a major contributor to these sources of inequalities.
Internet-based marketing is building a huge spying machine that infers as much as possible about the user. The Trump campaign was able to target a specific population, black voters leaning towards democrats. What if this data was used for direct executive action? This could come quicker than we think, given how intelligence agencies tap into social media.

I preferred to focus this post on how data-science can help distort truth. Indeed, it is a problem too often ignored by data scientists who like to think that they are empowering users.

In memory of Aaron Schwartz who fought centralized power on Internet.

[1]	Facebook was until recently using human curators, but fired them, leading to a loss of control on veracity

[2]	It is a well-known and well-studied cognitive bias that individuals strive to reduce cognitive dissonace and actively avoid situations and information likely to increase it

Unison 2.48 binaries for ARM

2016-07-23T00:00:00+02:00

I have built static binaries of Unison 2.48 for ARM Run on my NAS, the arm architecture is necessary to synchronize with the recent Ubuntu.

unison-2.48.4-armel.zip

Warning

I will not support these binaries

I will not answer any questions or request on these binaries. I have built them for my personal use and put them online in case it might be useful for others.

Remark on backward compatibility

Why don’t the Unison devs ensure compatibility between minor version of Unison?

Breaking compatibility is bad practice, in particular between minor versions. It breaks the trust that users have in updating the software. Programmers complain that users always run old versions of OSs/libraries/programs, but this is explained by the fear of stuff breaking during upgrades.

Notes to build these binaries

I built this following instructions historically on http://www.crutzi.info/unison/binary/armel

I retrieved these instructions from the wayback machine, and adapted them to work on a more modern Debian system

To compile it, I used qemu and a Debian ARM image

1. Build a debian system under Qemu

Install the system (lasts a couple of hours, with some user input):

sudo apt install qemu-system-arm qemu-efi libguestfs-tools

wget -O installer-vmlinuz http://http.us.debian.org/debian/dists/jessie/main/installer-armhf/current/images/netboot/vmlinuz
wget -O installer-initrd.gz http://http.us.debian.org/debian/dists/jessie/main/installer-armhf/current/images/netboot/initrd.gz

# Create a drive
qemu-img create -f qcow2 hda.qcow2 5G

qemu-system-arm -M virt -m 1024 \
-kernel installer-vmlinuz \
-initrd installer-initrd.gz \
-drive if=none,file=hda.qcow2,format=qcow2,id=hd \
-device virtio-blk-device,drive=hd \
-netdev user,id=mynet \
-device virtio-net-device,netdev=mynet \
-nographic -no-reboot

Under Ubuntu:

sudo chmod 644 /boot/vmlinuz*

List the content on the /boot dir of the VM’s disk:

virt-ls -a hda.qcow2 /boot/

Copy the initrd and vmlinux:

virt-copy-out -a hda.qcow2 /boot/vmlinuz-3.16.0-6-armmp-lpae /boot/initrd.img-3.16.0-6-armmp-lpae .

Do symlinks:

ln -s initrd.img-3.16.0-6-armmp-lpae initrd.img
ln -s vmlinuz-3.16.0-6-armmp-lpae vmlinuz

The installed system is then booted with:

qemu-system-arm -M virt -m 1024 \
-kernel vmlinuz \
-initrd initrd.img \
-drive if=none,file=hda.qcow2,format=qcow2,id=hd \
-device virtio-blk-device,drive=hd \
-netdev user,id=mynet \
-device virtio-net-device,netdev=mynet \
-nographic -no-reboot -append "root=/dev/vda2"

2. Build unison under the debian system

Download the unison source package from

Then compile the files within the qemu ARM environment:

apt-get update
apt-get upgrade
apt-get build-dep unison
wget https://github.com/bcpierce00/unison/archive/v2.48.15v4.tar.gz
tar -xvzf v2.48.15v4.tar.gz
cd unison-2.48.15v4
make UISTYLE=text NATIVE=true STATIC=true

You might need to remove the ‘-unsafe-string’ option as detailed in https://github.com/bcpierce00/unison/issues/211

The binary will be in src/unison

Better Python compressed persistence in joblib

2016-05-20T00:00:00+02:00

Problem setting: persistence for big data

Joblib is a powerful Python package for management of computation: parallel computing, caching, and primitives for out-of-core computing. It is handy when working on so called big data, that can consume more than the available RAM (several GB nowadays). In such situations, objects in the working space must be persisted to disk, for out-of-core computing, distribution of jobs, or caching.

An efficient strategy to write code dealing with big data is to rely on numpy arrays to hold large chunks of structured data. The code then handles objects or arbitrary containers (list, dict) with numpy arrays. For data management, joblib provides transparent disk persistence that is very efficient with such objects. The internal mechanism relies on specializing pickle to handle better numpy arrays.

Recent improvements reduce vastly the memory overhead of data persistence.

Limitations of the old implementation

❶ Dumping/loading persisted data with compression was a memory hog, because of internal copies of data, limiting the maximum size of usable data with compressed persistence:

We see the increased memory usage during the calls to dump and load functions, profiled using the memory_profiler package with this gist

❷ Another drawback was that large numpy arrays (>10MB) contained in an arbitrary Python object were dumped in separate .npy file, increasing the load on the file system [1]:

>>> import numpy as np
>>> import joblib # joblib version: 0.9.4
>>> obj = [np.ones((5000, 5000)), np.random.random((5000, 5000))]

# 3 files are generated:
>>> joblib.dump(obj, '/tmp/test.pkl', compress=True)
['/tmp/test.pkl', '/tmp/test.pkl_01.npy.z', '/tmp/test.pkl_02.npy.z']
>>> joblib.load('/tmp/test.pkl')
[array([[ 1.,  1., ...,  1.,  1.]],
 array([[ 0.47006195,  0.5436392 , ...,  0.1218267 ,  0.48592789]])]

What’s new: compression, low memory…

❶ Memory usage is now stable:

❷ All numpy arrays are persisted in a single file:

>>> import numpy as np
>>> import joblib # joblib version: 0.10.0 (dev)
>>> obj = [np.ones((5000, 5000)), np.random.random((5000, 5000))]

# only 1 file is generated:
>>> joblib.dump(obj, '/tmp/test.pkl', compress=True)
['/tmp/test.pkl']
>>> joblib.load('/tmp/test.pkl')
[array([[ 1.,  1., ...,  1.,  1.]],
 array([[ 0.47006195,  0.5436392 , ...,  0.1218267 ,  0.48592789]])]

❸ Persistence in a file handle (ongoing work in a pull request)

❹ More compression formats are available

Backward compatibility

Existing joblib users can be reassured: the new version is still compatible with pickles generated by older versions (>= 0.8.4). You are encouraged to update (rebuild?) your cache if you want to take advantage of this new version.

Benchmarks: speed and memory consumption

Joblib strives to have minimum dependencies (only numpy) and to be agnostic to the input data. Hence the goals are to deal with any kind of data while trying to be as efficient as possible with numpy arrays.

To illustrate the benefits and cost of the new persistence implementation, let’s now compare a real life use case (LFW dataset from scikit-learn) with different libraries:

Joblib, with 2 different versions, 0.9.4 and master (dev),
Pickle
Numpy

The four first lines use non compressed persistence strategies, the last four use persistence with zlib/gzip [2] strategies. Code to reproduce the benchmarks is available on this gist.

⚫ Speed: the results between joblib 0.9.4 and 0.10.0 (dev) are similar whereas numpy and pickle are clearly slower than joblib in both compressed and non compressed cases.

⚫ Memory consumption: Without compression, old and new joblib versions are the same; with compression, the new joblib version is much better than the old one. Joblib clearly outperforms pickle and numpy in terms of memory consumption. This can be explained by the fact that numpy relies on pickle if the object is not a pure numpy array (a list or a dict with arrays for example), so in this case it inherits the memory drawbacks from pickle. When persisting pure numpy arrays (not tested here), numpy uses its internal save/load functions which are efficient in terms of speed and memory consumption.

⚫ Disk used: results are as expected: non compressed files have the same size as the in-memory data; compressed files are smaller.

Caveat Emptor: performance is data-dependent

Different data compress more or less easily. Speed and disk used will vary depending on the data. Key considerations are:

Fraction of data in arrays: joblib is efficient if much of the data is contained in numpy arrays. The worst case scenario is something like a large dictionary of random numbers as keys and values.
Entropy of the data: an array fully of zeros will compress well and fast. A fully random array will compress slowly, and use a lot of disk. Real data is often somewhere in the middle.

Extra improvements in compressed persistence

New compression formats

Joblib can use new compression formats based on Python standard library modules: zlib, gzip, bz2, lzma and xz (the last 2 are available for Python greater than 3.3). The compressor is selected automatically when the file name has an explicit extension:

>>> joblib.dump(obj, '/tmp/test.pkl.z')   # zlib
['/tmp/test.pkl.z']
>>> joblib.dump(obj, '/tmp/test.pkl.gz')  # gzip
['/tmp/test.pkl.gz']
>>> joblib.dump(obj, '/tmp/test.pkl.bz2')  # bz2
['/tmp/test.pkl.bz2']
>>> joblib.dump(obj, '/tmp/test.pkl.lzma')  # lzma
['/tmp/test.pkl.lzma']
>>> joblib.dump(obj, '/tmp/test.pkl.xz')  # xz
['/tmp/test.pkl.xz']

One can tune the compression level, setting the compressor explicitly:

>>> joblib.dump(obj, '/tmp/test.pkl.compressed', compress=('zlib', 6))
['/tmp/test.pkl.compressed']
>>> joblib.dump(obj, '/tmp/test.compressed', compress=('lzma', 6))
['/tmp/test.pkl.compressed']

On loading, joblib uses the magic number of the file to determine the right decompression method. This makes loading compressed pickle transparent:

>>> joblib.load('/tmp/test.compressed')
[array([[ 1.,  1., ...,  1.,  1.]],
 array([[ 0.47006195,  0.5436392 , ...,  0.1218267 ,  0.48592789]])]

Importantly, the generated compressed files use a standard compression file format: for instance, regular command line tools (zip/unzip, gzip/gunzip, bzip2, lzma, xz) can be used to compress/uncompress a pickled file generated with joblib. Joblib will be able to load cache compressed with those tools.

Toward more and faster compression

Specific compression strategies have been developped for fast compression, sometimes even faster than disk reads such as snappy , blosc, LZO or LZ4. With a file-like interface, they should be readily usable with joblib.

In the benchmarks above, loading and dumping with compression is slower than without (though only by a factor of 3 for loading). These were done on a computer with an SSD, hence with very fast I/O. In a situation with slower I/O, as on a network drive, compression could save time. With faster compressors, compression will save time on most hardware.

Compressed persistence into a file handle

Now that everything is stored in a single file using standard compression formats, joblib can persist in an open file handle:

>>> with open('/tmp/test.pkl', 'wb') as f:
>>>    joblib.dump(obj, f)
['/tmp/test.pkl']
>>> with open('/tmp/test.pkl', 'rb') as f:
>>>    print(joblib.load(f))
[array([[ 1.,  1., ...,  1.,  1.]],
 array([[ 0.47006195,  0.5436392 , ...,  0.1218267 ,  0.48592789]])]

This also works with compression file object available in the standard library, like gzip.GzipFile, bz2.Bz2File or lzma.LzmaFile:

>>> import gzip
>>> with gzip.GzipFile('/tmp/test.pkl.gz', 'wb') as f:
>>>    joblib.dump(data, f)
['/tmp/test.pkl.gz']
>>> with gzip.GzipFile('/tmp/test.pkl.gz', 'rb') as f:
>>>    print(joblib.load(f))
[array([[ 1.,  1., ...,  1.,  1.]],
 array([[ 0.47006195,  0.5436392 , ...,  0.1218267 ,  0.48592789]])]

Be sure that you use a decompressor matching the internal compression when loading with the above method. If unsure, simply use open, joblib will select the right decompressor:

>>> with open('/tmp/test.pkl.gz', 'rb') as f:
>>>     print(joblib.load(f))
[array([[ 1.,  1., ...,  1.,  1.]],
 array([[ 0.47006195,  0.5436392 , ...,  0.1218267 ,  0.48592789]])]

Towards dumping to elaborate stores

Working with file handles opens the door to storing cache data in database blob or cloud storage such as Amazon S3, Amazon Glacier and Google Cloud Storage (for instance via the Python package boto).

Implementation

A Pickle Subclass: joblib relies on subclassing the Python Pickler/Unpickler [3]. These are state machines that walk the graph of nested objects (a dict may contain a list, that may contain…), creating a string representation of each object encountered. The new implementation proceeds as follows:

Pickling an arbitrary object: when an np.ndarray object is reached, instead of using the default pickling functions (__reduce__()), the joblib Pickler replaces in pickle stream the ndarray with a wrapper object containing all important array metadata (shape, dtype, flags). Then it writes the array content in the pickle file. Note that this step breaks the pickle compatibility. One benefit is that it enables using fast code for copyless handling of the numpy array. For compression, we pass chunks of the data to a compressor object (using the buffer protocol to avoid copies).
Unpickling from a file: when pickle reaches the array wrapper, as the object is in the pickle stream, the file handle is at the beginning of the array content. So at this point the Unpickler simply constructs an array based on the metadata contained in the wrapper and then fills the array buffer directly from the file. The object returned is the reconstructed array, the array wrapper being dropped. A benefit is that if the data is stored not compressed, the array can be directly memory mapped from the storage (the mmap_mode option of joblib.load).

This technique allows joblib to pickle all objects in a single file but also to have memory-efficient dump and load.

A fast compression stream: as the pickling refactoring opens the door to file objects usage, joblib is now able to persist data in any kind of file object: open, gzip.GzipFile, bz2.Bz2file and lzma.LzmaFile. For performance reason and usability, the new joblib version uses its own file object BinaryZlibFile for zlib compression. Compared to GzipFile, it disables crc computation, which bring a performance gain of 15%.

Speed penalties of on-the-fly writes

There’s also a small speed difference with dict/list objects between new/old joblib when using compression. The old version pickles the data inside a io.BytesIO buffer and then compress it in a row whereas the new version write “on the fly” compressed chunk of pickled data to the file. Because of this internal buffer the old implementation is not memory safe as it indeed copy the data in memory before compressing. The small speed difference was judged acceptable compared to this memory duplication.

Conclusion and future work

Memory copies were a limitation when caching on disk very large numpy arrays, e.g arrays with a size close to the available RAM on the computer. The problem was solved via intensive buffering and a lot of hacking on top of pickle and numpy. Unfortunately, our strategy has poor performance with big dictionaries or list compared to a cPickle, hence try to use numpy arrays in your internal data structures (note that something like scipy sparse matrices works well, as it builds on arrays).

For the future, maybe numpy’s pickle methods could be improved and make a better use of 64-bit opcodes for large objects that were introduced in Python recently.

Pickling using file handles is a first step toward pickling in sockets, enabling broadcasting of data between computing units on a network. This will be priceless with joblib’s new distributed backends.

Other improvements will come from better compressor, making everything faster.

Note

The pull request was implemented by @aabadie. He thanks @lesteve, @ogrisel and @GaelVaroquaux for the valuable help, reviews and support.

[1]	The load created by multiple files on the filesystem is particularly detrimental for network filesystems, as it triggers multiple requests and isn’t cache friendly.

[2]	gzip is based on zlib with additional crc checks and a default compression level of 3.

[3]

A drawback of subclassing the Python Pickler/Unpickler is that it is done for the pure-Python version, and not the “cPickle” version. The latter is much faster when dealing with a large number of Python objects. Once again, joblib is efficient when most of the data is represented as numpy arrays or subclasses.

Of software and Science. Reproducible science: what, why, and how

2015-12-16T00:00:00+01:00

At MLOSS 15 we brainstormed on reproducible science, discussing why we care about software in computer science. Here is a summary blending notes from the discussions with my opinion.

“Without engineering, science is not more than philosophy” — the community

How do we enable better Science? Why do we do software in science? These are the questions that we were interested in.

Improving reproducility of our scientific studies makes us more efficient in the long run to do good science: even inside a lab, new research efforts build upon the previous work.

Forms of reproducible science: reproduction, replication, & reuse

The classic concepts of reproducible science are:

Reproducibility: being able to rerun an experiment as it was run, for instance by reanalysing data.
Replicability: being able to redo an experiment from scratch.

The reproducible science movement argues sharing source code of experiments is a need for reproduction.

For reproduction, fields like computer science (development of methods) and biology (challenging data acquisition) have very different constraints, with the complexity allocated differently between data and code.

“Machine learning people use hugely complex algorithms on trivially simple datasets. Biology does trivially simple algorithms on hugely complex datasets.” — an MLOSS15 attendee

We felt that computer science needed an additional notion, complementing replication and reproduction:

Reusability: applying the process to a new yet similar question. For instance for a paper contributing data analysis method, applying it to new data.

Reusability is more valuable than reproducibility.

Reproducibility without reusability in method development may hinder the advancement of science as it pushes people to do all the same things, eg always running experiments on the same data.

Reusability enables results that the original investigator did not have in mind. It implies that the experimental protocol extends further than the exact scope of the question initially asked. For software development, it is also harder, as it implies more robustness and flexibility.

Finally sharing source code is not enough: readability of the code is necessary.

Roadblocks to reproducible science

Man power

Reusability, readability, support of released code, all actually take a lot of time, even though it is seldom acknowledged in talks about reproducible science. Given a fixed man power, it is impossible to achieve reusability and high quality for everything.

Computing power

Some numerical experiments or complex data analysis require weeks of cluster to run. These will be much harder to reproduce. Also, rerunning an analysis from scratch on a regular basis is a good recipe to achieve a robust path from data to results. The more computing power is a limiting resource, the more likely it is that a glitch is not detected.

Data availability

No access, or restricted access, to data is a show stopper for reproducibility. Data sharing requirements are becoming common –from funding agencies, or journals. However, privacy concerns, or confidential information get in the way of making data public, for instance in medical research or micro-economy. Often, these concerns serve as a pretext to people who actually do not want to relinquish control [1].

[1]	A related post by Deevy Bishop: Who’s afraid of open data

Incentives problem

Fancy new results are what matters for success in academia. “High impact” journals such as Nature or Science accept papers that amaze and impress, often with subpar inspection of the materials and methods [2]. The rate of publication in many leading groups is incompatible with consolidation efforts required for strong reproducibility.

On the other hand, it is hard to tell beforehand if a new idea is a good one. Hence letting imagination forward to foster impossible and improbable ideas is a good path to innovation. The underlying questions are: What are the best community rules for the advancement of knowledge? What do we want from the way science moves forward? Rapid publication of many incremental ideas, eg at a conference, gives food for thoughts, possibly at the sake of reproducibility.

[2]	“Science, Nature and Cell, had a higher rate of retractions” – Wikipedia: Invalid science

How to improve the situation

Docker, containers, and virtual machines

Docker, or other virtual machine technologies, enable shipping a software environment. It diminishes the challenges of building software and setting up an analysis. Virtual machines are used as a way to avoid software packaging issues. This seems to me as a plaster on a wooden leg.

Containers give easy reproduction, to the cost of hard replication and reuse.

Indeed, an analysis that lives in a box can be reproduced, but can it be understood, modified, or applied to new data? New science is likely going to come from modifying this analysis, or combining it with other tools, or new data. If these other tools live in a different virtual machine, the combination will be challenging.

In addition, people are using containers as an excuse to avoid tackling the need for proper documentation of requirements, and the process to set them up. They sometimes even try justify binary blobs [3]. This is wrong. An analysis should be runnable without requiring the stars to align, and it should be understandable.

[3]	See also Titus Brown’s post: The post-apocalyptic world of binary containers

Version control: wear your seatbelt

Version control is like a time machine: if used with regular commits, it enables rolling back to any point in time. For my work, it’s always been a crucial aspect to reproducing what me or my students did a while ago. I often meet researchers that feel they lack time to learn it. I really cannot support this position. http://try.github.io is an easy way to learn version control.

Hint: use a “tag” to pin-point a position in the history that you might want to repeat, such as making a figure or the publication of an article.

Sotware libraries, curated and maintained

Consolidating an analysis pipeline, a standard visualization, or any computational aspect of a paper into a software library is a sure way to make the paper more reproducible. It will also make the steps reusable, and a replication easier. If continued effort is put in the library, chances are that computational efficiency will improve over time, thus helping in the long run with the challenge of computing power.

Tough choices: not every variant of an analysis can be forever reproducible.

Maintaining the library will ensure that results are still reproducible on new hardware, or with evolution of the general software stack (a new Python or Matlab release, for instance). Documentation and curated examples will lower the bar to reuse and facilitate replication of the original scientific results.

To avoid feature creep and technical debt, a library calls for focused efforts on selecting the most important operations.

Datasets, serving as model experiments, tractable and open

Sometimes researchers create a toy data, with a well-posed question, that is curated and open, small enough to be tractable yet large enough to be relevant to the application field. This is an invaluable service to the field. One example is the netflix prize in machine learning, which led to a standard dataset. Unfortunately, the dataset was taken down some years later due to copyright concerns. But it has been replaced, eg by the movielens dataset. For computer vision, a series of datasets –Caltech101, CIFAR, ImageNet…– have led to continuous progress of the field. In bioinformatics, standard data are regularly created, for instance by the DREAM challenges.

These reference open datasets serve as benchmarks and therefore foster competition. They also define a canonical experiment, helping a wider scientific community understand the questions that they ask. Ultimately, they result in better software tools to solve the problem at hand, as this problem becomes a standard example and application of tools.

Sage bionetworks, for instance, is a non-profit that collects and make biomedical data available. These people believe, as I do, that such data will lead to better medical care.

Changing incentives: setting the right goals

Making sustainable, quality scientific work that facilitates reproduction needs to be a clearly-visible benefit to researchers, young and senior. Such contributions should help them get jobs and grants.

An unsophisticated publication count is the basis of scientific evaluation. We need to accept publications about data, software, and replication of prior work in high-quality journals. They need to be strictly reviewed, to establish high standards on these contributions. This change is happening. Gigascience, amongst other venues, publishes data. The MLOSS (machine learning open source software) track of the JMLR (journal of machine learning research) publishes software, with a tough review on the software quality of the project.

Researchers should cite the software they use.

Yet software is still often under cited: many will use a software implementing a method, and only cite the original paper that proposed the method. Another remaining challenge is: how to give credit for continuing development and maintenance.

Fast-paced science is probably useful even if fragile. But the difference between a quick proof of concept and solid, reproducible and reusable work needs to be acknowledged. It is important to select for publication not only impressive results, but also sound reusable material and methods. The latter are the foundation of future scientific developments, but high-impact journals tend to focus on the former.

Related posts:

Nilearn 0.2: more powerful machine learning for neuroimaging

2015-12-13T00:00:00+01:00

After 6 months of efforts, We just released version 0.2 of nilearn, dedicated to making machine learning in neuroimaging easier and more powerful.

This release integrates the features of the july sprint, and more.

Highlights

Better documentation with narrative examples

The example can now be broken down into blocks (as here) for a better narration (thanks to sphinx-gallery).

Space net: spatial regularizations in decoding

The “SpaceNet” decoder does spatial regularizations such as TV-l1 or Graph-Net to identify predictive regions in decoding.

Dictionnary learning for resting-state parcellations

Dictionnary learning is a promising alternative to ICA to learn networks.

Plotting sets of probabilistic maps

With a simple function, you can plot outlines for multiple maps.

Separating regions out of maps

We have a set of functions to separate regions on maps or turn networks into a probabilistic parcellation.

Classification on connectomes

We now have advanced connectivity measures to do comparisons across connectomes for classification.

Thanks

Thanks to Alexandre Abraham who lead the effort, and all the contributors.

Job offer: data crunching brain functional connectivity for biomarkers

2015-12-08T00:00:00+01:00

My research group is looking to fill a post-doc position on learning biomarkers from functional connectivity.

Scientific context

The challenge is to use resting-state fMRI at the level of a population to understand how intrinsic functional connectivity captures pathologies and other cognitive phenotypes. Rest fMRI is a promising tool for large-scale population analysis of brain function as it is easy to acquire and accumulate. Scans for thousands of subjects have already been shared, and more is to come. However, the signature of cognitions in this modality are weak. Extracting biomarkers is a challenging data processing and machine learning problem. This challenge is the expertise of my research group. Medical applications cover a wider range of brain pathologies, for which diagnosis is challenging, such as autism or Alzheimer’s disease.

This project is a collaboration with the Child Mind Institute, experts on psychiatric disorders and resting-state fMRI, as well as coordinators of the major data sharing initiatives for rest fRMI data (eg ABIDE).

Objectives of the project

The project hinges on processing of very large rest fMRI databases. Important novelties of the project are:

Building predictive models that can discriminate multiple pathologies in large inhomogeneous datasets.
Using and improving advanced connectomics and brain-parcellation techniques in fMRI.

Expected results include the discovery of neurophenotypes for several brain pathologies, as well as intrinsic brain structures, such as functional parcellations or connectomes, that carry signatures of cognition.

The analysis framework is based on algorithmic tools developed in Python (crucially, leveraging scikit-learn for predictive modeling).

Desired profile

We are looking for a post-doctoral fellow to hire in spring. The ideal candidate would have some, but not all, of the following expertise and interests:

Experience in advanced processing of fMRI
General knowledge of brain structure and function
Good communication skills to write high-impact neuroscience publications
Good computing skills, in particular with Python. Cluster computing experience is desired.

A great research environment

The work environment is dynamic and exiting, using state-of-the-art machine learning to answer challenging functional neuroimaging question.

The post-doc will be employed by INRIA, the lead computing research institute in France. We are a team of computer scientists specialized in image processing and statistical data analysis, integrated in one of the top French brain research centers, NeuroSpin, south of Paris. We work mostly in Python. The team includes core contributors to the scikit-learn project, for machine learning in Python, and the nilearn project, for statistical learning in NeuroImaging.

In addition, the post-doc will interact closely with researchers from the Child Mind Institute, with deep expertise in brain pathologies and in the details of the fMRI acquisitions. Finally, he or she will have access to advanced storage and grid computing facilities at INRIA.

Contact information: gael dotnospam varoquaux atnotspam inria dotnospam fr

MLOSS 2015: wising up to building open-source machine learning

2015-11-28T00:00:00+01:00

Note

The 2015 edition of the machine learning open source software (MLOSS) workshop was full of very mature discussions that I strive to report here.

I give links to the videos. Some machine-learning researchers have great thoughts about growing communities of coders, about code as a process and a deliverable.

I was a co-organizer of the MLOSS 2015 workshop, held during ICML 2015. As I have finally figured out where the videos are, now is a good time to summarize my impressions on the workshop.

Online videos of the talks

The videos of all the talks are online:

Python and Parallelism or Dask by Matthew Rocklin
Collaborative filtering via matrix decomposition in mlpack by Ryan Curtin
BLOG: a probabilistic programming language for open-universe contingent Bayesian networks by Yi Wu
Spotlights:
- Nilearn, machine learning for neuroimaging in Python (Alexandre Abraham)
- KeLP: a Kernel-based Learning Platform in Java (Simone Filice)
- DiffSharp: Automatic Differentiation Library (Atılım Güneş Baydin)
- The FAST toolkit for Unsupervised Learning of HMMs (José P. González-Brenes)
- OpenML: a Networked Science Platform for Machine Learning (Joaquin Vanschoren)
Julia’s Approach to Open Source Machine Learning by John Myles White
Do it yourself deep learning with the Caffe community by Evan Shelhamer
From flop to success in academic software development by Gaël Varoquaux

MLOSS: a maturing community

When Antti Honkela and Cheng Soon Ong approached me to co-organize an MLOSS workshop, I felt that it was important to do it for the sake of open source scientific software. But it didn’t feel very enthousiastic about the event or the talks themselves. Boy I was wrong.

Huge attendance: open-source ML software is now mainstream.

My first MLOSS workshop was at the ICML 2011 conference, in Haifa. The workshop was in a tiny cramped room, with a couple of dozens of geeks, and it felt like a clique of people on the side of the conference. This year, we had a huge room and more than 200 people showed up.

I am used to talks being about a grad student or young researcher that has whiped the code of a paper on the web, with an open license but no vision. This year, people were presenting actual projects, with long-term goals and the desire to solve a problem large than their latest research. It might explain why the attendance was huge: people came because talks might genuinely help them.

With Cheng and Antti, we had choosen as a theme “open ecosystems”, because ecosystems are the key to scaling computing and science. Between us, imposing a theme on a workshop is something challenging, as people submit abstracts, good or bad, and one has to compose with what one has. However, at lot of talks mentioned how the projects slot in a wider picture, and interact with a community. For instance, Evan attributes part of the success of Cafe to the “Model Zoo” in which the community contributes fitted models. At the other end of the spectrum, OpenML is a full online project with the goal to foster collaboration and comparison. Project developers have shown in their talk that they are very conscious of other projects that might be used together with their’s.

Accepting the sustainability challenges

Over the time, I have gradually realized the importance of community building, ie project management and goal setting, more than technical virtuosity. Historically, the scientific culture of code has put the emphasis on the genius ideas behind the code, and the craftsmanship of the implementation, to the cost of sustainability.

Alone, I go fast. Together, we go far.

I was surprised to see that the MLOSS community was growing very aware of mechanisms of long-term project life, in particular the human factors.

I was asked by my coorganizers to give a talk on factors of success of open source scientific software. I touched upon software engineering, project vision, licensing, governance, community building. All these topics deemed “non scientific” and thus so often despised and left out. I was astonished to find out that the talks before me were giving very good advice on these. I found that I only had to summarize and comment what had been said before. This evolution of the scientific community makes me very hopeful for the future.

Every line of code you write is dept. You should be ashamed of every line of code you have written. […]

You have a supply of labor. These are the people who are contributors […]. The people who are users and not contributors are actually a source of demand […] they mostly consume sources of labor rather than produce it. — John Myles White

Thanks to our sponsors

Facebook and continuum sponsored the trip for our keynote speakers. Thank you very much, the keynotes were great!

The Paris-Saclay Center for Data Science (CDS) gave us our main operating fund, which is critical for organizing an event. In general, I must say that the CDS has been hugely supportive of open source data science in the Paris area, having a significant impact on training as well as development.

And also, I must acknowledge support from Inria for the accounting and administration of the event.

Finally, our reviewers were amazing. Most of them reviewed the project, ie its code, its documentation, its support. They arose above the typical petty fights that we see in academia and focused on what the project was bringing to the scientific community. Often there reviews were longer and with more information than the abstract submitted.

Related posts:

Nilearn sprint: hacking neuroimaging machine learning

2015-08-04T00:00:00+02:00

A couple of weeks ago, we had in Paris the second international nilearn sprint, dedicated to making machine learning in neuroimaging easier and more powerful.

It was such a fantastic experience, as nilearn is really shaping up as a simple yet powerful tool, and there is a lot of enthusiasm. For me, this sprint is a turning point, as I could see people other than the original core team (that spanned out of our research team) excited about the project’s future. Thank you to all who came:

Ahmed Kanaan
Andres Hoyos Idrobo
Alexandre Abraham
Arthur Mensch
Ben Cipolli (remote)
Bertrand Thirion
Chris Filo Gorgolewski
Danilo Bzdok
Elvis Dohmatob
Julia Hutenburg
Kamalaker Dadi
Loic Esteve
Martin Perez
Michael Hanke
Oscar Nájera, working on sphinx-gallery

The sprint was a joint sprint with the MNE-Python team, that makes MEG processing awesome. We also need to thank Alex Gramfort, who did most of the work to set up the sprint, as well as NeuroSaclay for funding, and La paillasse, Telecom, and INRIA for hosting.

Highlights of the sprints results

Plotting of multiple maps

A function to visualize overlays of various maps, eg for a probabilistic atlas, with defaults that try to adapt to the number of maps (see the example). It’s very useful for example for easy visualizing of ICA components.

Sign of activation in glass brain

Our glass brain plotting was greatly improved adding amongst other things the option to capture the sign of the activation in the color (see this example).

Spatially-regularized decoder

Decoders based on GraphNet and total variation have finally landed in nilearn. This has required a lot of work to get fast convergence and robust parameter selection. At the end of the day, it is much slower than an SVM, but the maps look splendid (see this example).

Sparse dictionary learning

We have almost merged sparse dictionnary learning as a alternative to ICA. Experience shows that on resting-state data, it gives more contrasted segmentation of networks (see this example).

New installation docs

New webpage layout using tabs to display only the installation instruction relevant to the OS of the user (see here). The results are more compact and more clear instructions, that I hope will make our users’ life easier.

CircleCI integration

We now use CircleCI to run the examples and build the docs. This is challenging because our examples are real cases of neuroimaging data analysis, and thus require heavy datasets and computing horse power.

Neurodebian packaging

There are now neurodebian packages for nilearn.

And much more!

Warning

Features listed above are not in the released version of nilearn. You need to wait a month or so.

Software for reproducible science: let’s not have a misunderstanding

2015-05-18T00:00:00+02:00

Note

tl;dr: Reproducibilty is a noble cause and scientific software a promising vessel. But excess of reproducibility can be at odds with the housekeeping required for good software engineering. Code that “just works” should not be taken for granted.

This post advocates for a progressive consolidation effort of scientific code, rather than putting too high a bar on code release.

Titus Brown recently shared an interesting war story in which a reviewer refuses to review a paper until he can run the code on his own files. Titus’s comment boils down to:

“Please destroy this software after publication”.

Note

Reproducible science: Does the emperor have clothes?

In other words, code for a publication is often not reusable. This point of view is very interesting from someone like Titus, who is a vocal proponent of reproducible science. His words triggered some surprises, which led Titus to wonder if some of the reproducible science crowd folks live in a bubble. I was happy to see the discussion unroll, as I think that there is a strong risk of creating a bubble around reproducible science. Such a bubble will backfire.

Replication is a must for science and society

Science advances by accumulating knowledge built upon observations. It’s easy to forget that these observations, and the corresponding paradigmatic conclusions, are not always as simple to establish as the fact that hot air rises: replicating many times the scientific process transforms an evidence into a truth.

One striking example of scientific replication is the on-going effort in psychology to replay the evidence behind well-accepted findings central to current line of thoughts in psychological sciences. It implies setting up the experiments accordingly to the seminal publications, acquiring the data, and processing it to come up to the same conclusions. Surprisingly, not everything that was taken for granted holds.

Note

Findings later discredited backed economic policy

Another example, with massive consequences on Joe Average’s everyday, is the failed replication of Reinhart and Rogoff’s “Growth in a Time of Debt” publication. The original paper, published in 2010 in the American Economic Review, claimed empirical findings linking important public debt to failure of GDP growth. In a context of economical crisis, it was used by policy makers as a justification for restricted public spending. However, while pursuing a mere homework assignment to replicate these findings, a student uncovered methodological flaws with the paper. Understanding the limitations of the original study took a while, and discredited the academic backing to the economical doctrine of austerity. Critically, the analysis of the publication was possible only because Reinhart and Rogoff released their spreadsheet, with data and analysis details.

Reproducibility is not sustainable for everything

Thinking is easy, acting is difficult — Goethe

Note

Keeping a physics apparatus running for replication years later?

I started my scientific career doing physics, and fairly “heavy” physics: vacuum systems, lasers, free-falling airplanes. In such settings, the cost of maintaining an experiment is apparent to the layman. No-one is expected to keep an apparatus running for replication years later. The pinnacle of reproducible research is when the work becomes doable in a students lab. Such progress is often supported by improved technology, driven by wider applications of the findings.

However, not every experiment will give rise to a students lab. Replicating the others will not be easy. Even if the instruments are still around the lab, they will require setting up, adjusting and wiring. And chances are that connectors or cables will be missing.

Software is no different. Storing and sharing it is cheaper. But technology evolves very fast. Every setup is different. Code for a scientific paper has seldom been built for easy maintenance: lack of tests, profusion of exotic dependencies, inexistent documentation. Robustness, portability, isolation, would be desirable, but it is difficult and costly.

Software developers know that understanding the constraints to design a good program requires writing a prototype. Code for a scientific paper is very much a prototype: it’s a first version of an idea, that proves its feasibility. Common sense in software engineering says that prototypes are designed to be thrown away. Prototype code is fragile. It’s untested, probably buggy for certain usage. Releasing prototypes amounts to distributing semi-functioning code. This is the case for most code accompanying a publication, and it is to be expected given the very nature of research: exploration and prototyping [1].

No success without quality, …

Note

Highly-reliable is more useful than state-of-the-art.

My experience with scientific code has taught me that success require quality. Having a good implementation of simple, well-known, methods seems to matter more than doing something fancy. This is what the success of scikit-learn has taught us: we are really providing classic “old” machine learning methods, but with a good API, good docs, computational performance, and stable numerics controlled by stringent tests. There exists plenty of more sophisticated machine-learning methods, including some that I have developed specifically for my data. Yet, I find myself advising my co-workers to use the methods in scikit-learn, because I know that the implementation is reliable and that they will be able to use them [2].

This quality is indeed central to doing science with code. What good is a data analysis pipeline if it crashes when I fiddle with the data? How can I draw conclusions from simulations if I cannot change their parameters? As soon as I need trust in code supporting a scientific finding, I find myself tinkering with its input, and often breaking it. Good scientific code is code that can be reused, that can lead to large-scale experiments validating its underlying assumptions.

Sqlite is so much used that its developers have been woken up at night by users.

You might say that I am putting the bar too high; that slightly buggy code is more useful than no code. But I frown at the idea of releasing code for which I am unable to do proper quality assurance. I may have done too much of that in the past. And because I am a prolific coder, many people are using code that has been through my hands. My mailbox looks like a battlefield, and when I go the coffee machine I find myself answering questions.

… and making difficult choices

Note

Craftsmanship is about trade-offs

Achieving quality requires making choices. Not only because time is limited, but also because the difficulty to maintain and improve a codebase increases much quicker than the numbers of features [3]. This phenomena is actually frightening to watch: adding a feature in scikit-learn these days is much much harder than what it used to be in the early days. Interactions between features is a killer: when you modify something, something else unrelated breaks. For a given functionality, nothing makes the code more incomprehensible than cyclomatic complexity: the multiplicity of branching, if/then clauses, for loops. This complexity naturally appears when supporting different input types, or minor variants of a same method.

The consequence is that ensuring quality for many variants of a method is prohibitory. This limit is a real problem for reproducible science, as science builds upon comparing and opposing models. However, ignoring it simply leads to code that fails doing what it claims to do. What this is telling us, is that if we are really trying to do long-term reproducibility, we need to identify successful and important research and focus our efforts on it.

If you agree with my earlier point that the code of a publication is a prototype, this iterative process seems natural. Various ideas can be thought of as competing prototypes. Some will not lead to publication at all, while others will end up having a high impact. Knowing before-hand is impossible. Focusing too early on achieving high quality is counter productive. What matters is progressively consolidating the code.

Reproducible science, a rich trade-off space

Note

Verbatim replication or reuse?

Does Reinhart and Rogoff’s “Growth in a Time of Debt” paper face the same challenges as the manuscript under review by Titus? One is describing mechanisms while the other is introducing a method. The code of the former is probably much simpler than that of the latter. Different publications come with different goals and code that is more or less easy to share. For verbatim replication of the analysis of a paper, a simple IPython notebook without tests or API is enough. To go beyond requires applying the analysis to different problems or data: reuse. Reuse is very difficult and cannot be a requirement for all publications.

Conventional wisdom in academia is that science builds upon ideas and concepts rather than methods and code. Galileo is known for his contribution to our understanding of the cosmos. Yet, methods development underpins science. Galileo is also the inventor of the telescope, which was a huge technical achievement. He needed to develop it to back his cosmological theories. Today, Galileo’s measurements are easy to reproduce because telescopes are readily-available as consumer products.

Standing on the shoulders of giants — Isaac Newton, on software libraries

Related posts:

[1]	To make my point very clear, releasing buggy untested code is not a good thing. However, it is not possible to ask for all research papers to come with industial-quality code. I am trying here to push for a collective, reasoned, undertaking of consolidation.

[2]

Theory tells us that there is there is no universal machine learning algorithm. Given a specific machine-learning application, it is always possible to devise a custom strategy that out-performs a generic one. However, do we need hundreds of classifiers to solve real world classification problems? Empirical results [Delgado 2014] show that most of the benefits can be achieved with a small number of strategies. Is it desirable and sustainable to distribute and keep alive the code of every machine learning paper?

[3]	Empirical studies on the workload for programmers to achieve a given task showed that 25 percent increase in problem complexity results in a 100 percent increase in programming complexity: An Experiment on Unit increase in Problem Complexity, Woodfield 1979.

I need to thank my colleague Chris Filo Gorgolewski and my sister Nelle Varoquaux for their feedback on this note.

MLOSS: machine learning open source software workshop @ ICML 2015

2015-04-23T00:00:00+02:00

Note

This year again we will have an exciting workshop on the leading-edge machine-learning open-source software. This subject is central to many, because software is how we propagate, reuse, and apply progress in machine learning.

Want to present a project? The deadline for the call for papers is Apr 28th, in a few days : http://mloss.org/workshop/icml15/

The workshop will be help at the ICML conference, in Lille France, on July 10th. ICML –International Conference in Machine Learning– is the leading venue for academic research in machine learning. It’s a fantastic place to hold such a workshop, as the actors of theoretical progress are all around. Software is the bridge that brings this progress beyond papers.

There is a long tradition of MLOSS workshop, with one every year and a half. Last time, at NIPS 2013, I could feel a bit of a turning point, as people started feeling that different software slotted together, to create an efficient and state-of-the art working environment. For this reason, we have entitled this year’s workshop ‘open ecosystems’, stressing that contributions in the scope of the workshop, that build a thriving work environment, are not only machine learning software, but also better statistics or numerical tools.

We have two keynotes with important contributions to such ecosystems:

John Myles White (Facebook), lead developer of Julia statistics and machine learning: “Julia for machine learning: high-level syntax with compiled-code speed”
Matthew Rocklin (Continuum Analytics), developer of Python computational tools, in particular Blaze (confirmed): “Blaze, a modern numerical engine with out-of-core and out-of-order computations”.

There will be also a practical presentation on how to set up an open-source project, discussing hosting, community development, quality assurance, license choice, by yours truly.

Job offer: working on open source data processing in Python

2015-04-02T00:00:00+02:00

We, Parietal team at INRIA, are recruiting software developers to work on open source machine learning and neuroimaging software in Python.

In general, we are looking for people who:

have a mathematical mindset,

are curious about data (ie like looking at data and understanding it)

have an affinity for problem-solving tradeoffs

love high-quality code

worry about users

are good scientific Python coders,

enjoy interacting with a community of developers

We welcome candidates people without all the skills, but are strongly motivated to acquire them. Prior open-source experience is a big plus.

One example of such position with application in Neuroimaging is: http://gael-varoquaux.info/programming/hiring-a-programmer-for-a-brain-imaging-machine-learning-library.html Which was opened a year ago and has now resulted in nilearn: http://nilearn.github.io/

Other positions may be more focused on general machine learning or computing tools such as scikit-learn and joblib, which are reference open-source libraries for data processing in Python.

We are a tightly knit team, with a high degree of programming, data analysis and neuroimaging skills.

Please contact me and Olivier Grisel if you are interested,

Euroscipy 2015: Call for paper

2015-03-28T00:00:00+01:00

EuroScipy 2015, the annual conference on Python in science will take place in Cambridge, UK on 26-30 August 2015. The conference features two days of tutorials followed by two days of scientific talks & posters and an extra day dedicated to developer sprints. It is the major event in Europe in the field of technical/scientific computing within the Python ecosystem. Scientists, PhD’s, students, data scientists, analysts, and quants from more than 20 countries attended the conference last year.

The topics presented at EuroSciPy are very diverse, with a focus on advanced software engineering and original uses of Python and its scientific libraries, either in theoretical or experimental research, from both academia and the industry.

Submissions for posters, talks & tutorials (beginner and advanced) are welcome on our website at http://www.euroscipy.org/2015/ Sprint proposals should be addressed directly to the organisation at euroscipy-org@python.org

Important dates:

Apr 30, 2015 Talk and tutorials submission deadline
May 1, 2015 Registration opens
May 30, 2015 Final program announced
Jun 15, 2015 Early-bird registration ends
Aug 26-27, 2015 Tutorials
Aug 28-29, 2015 Main conference
Aug 30, 2015 Sprints

We look forward to an exciting conference and hope to see you in Cambridge

The EuroSciPy 2015 Team - http://ww.euroscipy.org/2015/

PRNI 2016: call for organization

2015-01-01T00:00:00+01:00

PRNI (Pattern Recognition for NeuroImaging) is an IEEE conference about applying pattern recognition and machine learning to brain imaging. It is a mid-sized conference (about 150 attendee), and is a satellite of OHBM (the annual “Human Brain Mapping” meeting).

The steering committee is calling for bids to organize the conference in June 2016, in Europe, as a satellite the OHBM meeting in Geneva.

New website

2014-10-09T00:00:00+02:00

I am moving my website to a new design, relying on Pelican and more modern CSS.

So far, I had been using rest2web to generate the static part of the website, and a local install of wordpress for the blog. I wasn’t doing good on keeping the wordpress install up to date, and I eventually got hacked. It was hurting my desire of simplicity to need a dynamic website. The combination of Pelican for my content, and Disqus suits very well my need, as it enables me to have a simpler website, and still have blog posts and discussions.

I also took the opportunity to clean up the website, remove some old content, and move my travel pictures to flickr.

Technical choices

Pelican for the core engine. I like the fact that it generates a static blog, that it uses restructured text to store the content, and jinja as a templating engine.

One interesting aspect of redoing my website with a more modern content managment system was that I could lay out the information based on tags and categories, rather than the old way of having a tree of nested topics. This is much more flexible because one article is likely to fall in many topics. Modern information organization is moving away from the notion of path used to access to an entry, to the notion of set of properties (tags here).
Pure CSS as a CSS base layer. I chose to use Pure CSS rather than Bootstrap as it is a pure CSS framework (no javascript) and it is much lighter. I find that Bootstrap websites can easily slow down browsing (due to download size and javascript). Bootstrap also does play very well with html documents in which ones doesn’t control the class tags, as those generated from restructured text. But that’s true of most web front-end frameworks. Another option was Foundation. I didn’t explore it in details, but it looked like an interesting tradeoff between Pure, which is very bare bones, and Bootstrap, the heavy lifter. I chose to go for the most lightweight option, because I had simple needs.

A result of using more modern CSS is that the website should look good on any screen size, from very large screens to mobile phones.

Improving your programming style in Python

2014-09-29T00:00:00+02:00

Here are some references on software development techniques and patterns to help write better code. They are intended for the casual programmer, and certainly not an advanced developer.

They are listed in order of difficulty.

Software carpentry

http://swc.scipy.org.

These are the original notes from Greg Wilson’s course on software engineering at the university of Toronto. This course is specifically intended for scientists, but not computer science students. It is very basic and does not cover design issues.

A tutorial introduction to Python

http://www.informit.com/articles/article.asp?p=23100&seqNum=3&rl=1.

This tutorial is easier to follow than Guido’s tutorial, thought it does not go as much in depth.

Python Essential Reference

http://www.informit.com/articles/article.asp?p=453682&rl=1

http://www.informit.com/articles/article.asp?p=459269&rl=1

These are two chapters out of David Beazley’s excellent book Python Essential Reference. They allow to understand more deeply how python works. I strongly recommend this book to anybody serious about python.

An Introduction to Regular Expressions

http://www.informit.com/articles/article.asp?p=20454&rl=1

If you are going to do any sort of text manipulation, you absolutely need to know how to use regular expressions: powerful search and replace patterns.

Software design for maintainability

My own post

A case of shameless plug: this is a post that I wrote a few years ago. I think that it is still relevant.

Writing a graphical application for scientific programming using TraitsUI

http://gael-varoquaux.info/computers/traits_tutorial/index.html

Building interactive graphical application is a difficult problem. I have found that the traitsUI module provides a great answer to this problem. This is a tutorial intended for the non programmer.

An introduction to Python iterators

http://www.informit.com/articles/article.asp?p=26895&rl=1

This article may not be terribly easy to follow, but iterator are a great feature of Python, so this is definitely worth reading.

Functional programming

http://www.ibm.com/developerworks/linux/library/l-prog.html?open&l=766,t=gr,p=PrmgPyth

Functional programming is a programming style where mathematical functions are successively applied to immutable objects to go from the inputs of the program to its outputs in a succession of transformation. It is appreciated by some because it is easy to analyze and prove. In certain cases it can be very readable.

Patterns in Python

http://www.suttoncourtenay.org.uk/duncan/accu/pythonpatterns.html.

This document exposes a few design patterns in Python. Design patterns are solutions to recurring development problems using object oriented programming. I suggest this reading only if you are familiar with OOP.

Idiomatic Python

Jeff Knupp’s post, a summary of his book:

http://www.jeffknupp.com/blog/2012/10/04/writing-idiomatic-python/
The scipy-lectures chapter on advanced Python:

https://scipy-lectures.github.io/advanced/advanced_python/index.html

General Object-Oriented programming advice

Designing Object-oriented code actually requires some care: when you are building your set of abstractions, you are designing the world in which you are going to be condemned to living (or actually coding). I would advice people to keep things as simple as possible, and follow the SOLID principles:

http://mmiika.wordpress.com/oo-design-principles/

Using decorators to do meta-programming in Python

http://www-128.ibm.com/developerworks/linux/library/l-cpdecor.html.

A very beautiful article for the advanced python user. Meta-programming is a programming technique that involves changing the program at the run-time. This allows to add new abstractions to the code the programmer writes, thus creating a “meta-language”. This article shows this very well.

A Primer on Python Metaclass Programming

http://www.onlamp.com/lpt/a/3388

Metaclasses allow to define new style of objects, that can have different calling, creation or inheritance rules. This is way over my head, but I am referencing it here for the record.

Iterators in Python

https://docs.python.org/2/library/itertools.html#recipes

Learn to use the itertools (but don’t abuse them)!

Related to the producer/consumer problem with iterators, see:

http://www.oluyede.org/blog/2007/04/09/producerconsumer-in-python/

Hiring an engineer to mine large functional-connectivity databases

2014-09-20T00:00:00+02:00

Work with us to leverage leading-edge machine learning for neuroimaging

At Parietal, my research team, we work on improving the way brain images are analyzed, for medical diagnostic purposes, or to understand the brain better. We develop new machine-learning tools and investigate new methodologies for for quantifying brain function from MRI scans.

One of our important alley of contributions is in deciphering “functional connectivity”: analysis the correlation of brain activity to measure interactions across the brain. This direction of research is exciting because it can be used to probe the neural-support of functional deficits in incapacitated patients, and thus lead to new biomarkers on functional pathologies, such as autism. Indeed, functional connectivity can be computed without resorting to complicated cognitive tasks, unlike most functional imaging approaches. The flip side is that exploiting such “resting-state” signal requires advanced multivariate statistics tools, something at which the Parietal team excels.

For such multivariate processing of brain imaging data, Parietal has an ecosystem of leading-edge high-quality tools. In particular we have built the foundations of the most successful Python machine learning library, scikit-learn, and we are growing a dedicate software, nilearn, that leverages machine-learning for neuroimaging. To support this ecosystem, we have dedicated top-notch programmers, lead by the well-known Olivier Grisel.

We are looking for a data-processing engineer to join our team and work on applying our tools on very large neuroimaging databases to learn specific biomarkers of pathologies. For this, the work will be shared with the CATI, the Fench platform for multicentric neuroimaging studies, located in the same building as us. The general context of the job is the NiConnect project, a multi-organisational research project that I lead and that focuses on improving diagnostic tools on resting-state functional connectivity. We have access to unique algorithms and datasets, before they are published. What we are now missing between those two, and that link could be you.

If you want more details, they can be found on the job offer. This post is to motivate the job in a personal way, that I cannot give in an official posting.

Why take this job?

I don’t expect some to take this job only because it pays the bill. To be clear, the kind of person I am looking for has no difficulties finding a job elsewhere. So, if you are that person, why would you take the job?

To join a great team with many experts, focused on finding elegant solutions to hard problems at the intersection of machine learning, cognitive science, and software. Choose to work with great people, knowledgeable, passionate, and fun.
To work on interesting problems, that matter. They are interesting because they are challenging but we have the skills to solve them. They matter because they can make brain research better.
To learn. NeuroImaging + Machine learning is a quickly growing topic. If you come from a NeuroImaging background and want to add to your CV an actual expertise in machine learning for NeuroImaging. This is the place to be.

What would make me excited in a resume?

A genuine experience in neuroimaging data processing, especially large databases.
Talent with computers and ideally some Python experience.
The unlikely combination of research training (graduate or undergraduate) and experience in a non academic setting.
A problem-solving mindset.
A good ability to write about neuroimaging and data processing in English: who knows, if everything goes to plan, you could very well be publishing about new biomarkers.

Now if you are interested and feel up for the challenge, read the real job offer, and send me your resume.

Scikit-learn 2014 sprint: a report

2014-07-25T00:00:00+02:00

A week ago, the 2014 edition of the scikit-learn sprint was held in Paris. This was the third time that we held an internation sprint and it was hugely productive, and great fun, as always.

Great people and great venues

We had a mix of core contributors and newcomers, which is a great combination, as it enables us to be productive, but also to foster the new generation of core developers. Were present:

Laurent Direr
Michael Eickenberg
Loic Esteve
Alexandre Gramfort
Olivier Grisel
Arnaud Joly
Kyle Kastner
Manoj Kumar
Balazs Kegl
Nicolas Le Roux
Andreas Mueller
Vlad Niculae
Fabian Pedregosa
Amir Sani
Danny Sullivan
Gabriel Synnaeve
Roland Thiolliere
Gael Varoquaux

As the sprint extended through a French bank holiday and the week end, we were hosted in a variety of venues:

La paillasse, a Paris bio-hacker space
INRIA, the French computer-science national research, and the place where I work :)
Criteo, a French company doing word-wide add-banner placement. The venue there was absolutely gorgeous, with a beautiful terrace on the roofs of Paris. And they even had a social event with free drinks one evening.
Tinyclues, a French startup mining e-commerce data.

I must say that we were treated like kings during the whole stay; each host welcoming us as well they could. Thank you to all of our hosts!

Achievements during the sprint

The first day of the sprint was dedicated to polishing the 0.15 release, which was finally released on the morning of the second day, after 10 months of development.

A large part of the efforts of the sprint were dedicated to improving the coding base, rather than directly adding new features. Some files were reorganized. The input validation code was cleaned up (opening the way for better support of pandas structures in scikit-learn). We hunted dead code, deprecation warnings, numerical instabilities and tests randomly failing. We made the test suite faster, and refactored our common tests that scan all the model.

Some work of our GSOC student, Manoj Kumar, was merged, making some linear models faster.

Our online documentation was improve with the API documentation pointing to examples and source code.

Still work in progress:

Faster stochastic gradient descent (with AdaGrad, ASGD, and one day SAG)
Calibration of probabilities for models that do not have a ‘predict_proba’ method
Warm restart in random forests to add more estimators to an existing ensemble.
Infomax ICA algorithm.

Scikit-learn 0.15 release: highlights

2014-07-15T00:00:00+02:00

We have just released the 0.15 version of scikit-learn. Hurray!! Thanks to all involved.

A long development stretch

It’s been a while since the last release of scikit-learn. So a lot has happened. Exactly 2611 commits according my count. Quite clearly, we have more and more existing code, more and more features to support. This means that when we modify an algorithm, for instance to make it faster, something else might break due to numerical instability, or exploring some obscure option. The good news is that we have tight continuous integration, mostly thanks to travis (but Windows continuous integration is on its way), and we keep growing our test suite. Thus while it is getting harder and harder to change something in scikit-learn, scikit-learn is also becoming more and more robust.

Highlights

Quality — Looking at the commit log, there has been a huge amount of work to fix minor annoying issues.

Speed — There has been a huge effort put in making many parts of scikit-learn faster. Little details all over the codebase. We do hope that you’ll find that your applications run faster. For instance, we find that the worst case speed of Ward clustering is 1.5 times faster in 0.15 than 0.14. K-means clustering is often 1.1 times faster. KNN, when used in brute-force mode, got faster by a factor of 2 or 3.

Random Forest and various tree methods — The random forest and various tree methods are much much faster, use parallel computing much better, and use less memory. For instance, the picture on the right shows the scikit-learn random forest running in parallel on a fat Amazon node, and nicely using all the CPUs with little RAM usage.

Hierarchical aglomerative clustering — Complete linkage and average linkage clustering have been added. The benefit of these approach compared to the existing Ward clustering is that they can take an arbitrary distance matrix.

Robust linear models — Scikit-learn now includes RANSAC for robust linear regression.

HMM are deprecated — We have been discussing for a long time removing HMMs, that do not fit in the focus of scikit-learn on predictive modeling. We have created a separate hmmlearn repository for the HMM code. It is looking for maintainers.

And much more — plenty of “minor things”, such as better support for sparse data, better support for multi-label data…

Google summer of code projects for scikit-learn

2014-04-23T00:00:00+02:00

I’d like to welcome the four students that were accepted for the GSoC this year:

Issam: Extending Neural networks
Hamzeh: Sparse Support for Ensemble Methods
Manoj: Making Linear models faster
Maheshakya: Locality Sensitive Hashing

Welcome to all of you. Your submissions were excellent, and you demonstrated a good will to integrate in the project, with its social and coding dynamics. It is a privilege to work with you.

I’d also like to thank all the mentors, Alex, Arnaud, Daniel, James, Jaidev, Olivier, Robert and Vlad. It is a lot of work to mentor and mentors are not only making it possible for great code to enter scikit-learn, but also shaping a future generation of scikit-learn contributors.

Hiring a programmer for a brain imaging machine-learning library

2014-02-12T00:00:00+01:00

Work with us on putting machine learning in the hand of cognitive scientists

Parietal is a research team that creates advanced data analysis to mine functional brain images and solve medical and cognitive science problems. Our day to day work is to write machine-learning and statistics code to understand and use better images of brain function (most often fMRI). Our purpose is to be useful to the NeuroImaging community, mostly medical and cognitive science researched, to understand brain function better. What is limiting us in this respect is that to reach end users we need to turn our algorithms in usable software.

This is why Parietal has a long tradition of investing in building an ecosystem of high-quality libraries and tools: we build, layer by layer, an environment in which we can do our research, and with which we hope to one day reach the user. We choose Python, as a high-level general purpose language with which we can do scientific computing, and, one day, GUIs, or web servers. We contribute to the scipy ecosystem; we have built the foundations of the most successful Python machine learning library, scikit-learn. We are invested in the neuroimaging in Python ecosystem. Our students, our team members, send patches to scientific Python projects, teach courses on how to use them, speak at conferences.

But to go all the way, we need support from people who do software as there sole goal. To put the finishing touch on the quality of our end-user libraries, we need full-time programmers. In an academic setting, they can be hard to justify, but we have always had dedicate top-notch engineers at Parietal, our latest hire being the well-known Olivier Grisel. This is where you can come in.

The NiConnect is a specific research project in which we are developing leading algorithmic tools. For this project, we have funding for a full-time programmer. Someone that will help us make from our understand of how to process brain images, a software tool that an cognitive science researcher can use. We have started work on such a software, in the nilearn project. What we need is someone who drives the project, and makes sure that the piece fit in together well. That the code to solve the user’s problem is not our research code, but a clean and lean library, just like scikit-learn is an elegant answer to day-to-day machine learning tasks.

If you want more details, they can be found on the job offer. This post is to motivate the job in a personal, that I cannot give in an official posting.

Why take this job?

I don’t expect some to take this job only because it pays the bill. To be clear, the kind of person I am looking for has no difficulties finding a well-payed job elsewhere. So, if you are that person, why would you take the job.

To join a great team that is focused on finding elegant solutions to hard problems at the intersection of machine learning, cognitive science, and software. Choose to work with great people, knowledgeable, passionate, and fun.
To work on interesting problems, that matter. They are interesting because they are challenging but we have the skills to solve them. They matter because these skills need to be used to make brain research better.
To have a boss (me) that actually codes and gives you feedback on your code.
To learn. Data science + Python is the combination of skills to have. We have a at Parietal a unique expertise in these. And add to it fine understanding of algorithms, high performance computing, statistics, and software quality. You have the perfect lines on a CV.

What would make me excited in a resume?

Open source contributions (there is no better coding CV than a github account).
Experience in agile-like situations
A passion for code quality
Good Python experience
The unlikely combination of research-like training (eg undergraduate) and experience in a non academic and non scientific setting (say web development).
To know that you care about user experience, about understanding and solving the user’s problems.

Now if you are interested and feel up for the challenge, read the real job offer, and send me your resume.

Publishing scientific software matters

2013-09-19T00:00:00+02:00

Christophe Pradal, Hans Peter Langtangen, and myself recently edited a version of the Journal of Computational Science on scientific software, in particular those written in Python. We wrote an editorial defending writing and publishing open source scientific software that I wish to summarize here. The full text preprint is openly available in my publications list as always. It includes, amongst other things, references.

Software is a central part of modern scientific discovery. Software turns a theoretical model into quantitative predictions; software controls an experiment; and software extracts from raw data evidence supporting or rejecting a theory. As of today, scientific publications seldom discuss software in depth, maybe because it is both highly technical and a recent addition to scientific tools. But times are changing. More and more scientific investigators are developing software and it is important to establish norms for publication of this work. Producing scientific software is an important part of the landscape of research activities. Very visible scientific software is found in products developed by private companies, such as Mathwork’s Matlab or Wolfram’s Mathematica, but let us not forget that these build upon code written by and for academics. Scientists writing software contribute to the advancement of Science via several factors.

First, software developed in one field, if written in a sufficiently general way, can often be applied to advance a different field if the underlying mathematics is common. Modern scientific software development has a strong emphasis on generality and reusability by taking advantage of the general properties of the mathematical structures in the problem. This feature of modern software help close the gap between fields and accelerate scientific discovery through packaging mathematical theories in a directly applicable way.

Second, the public availability of code is a corner stone of the scientific method, as it is a requirement to reproducing scientific results: “if it’s not open and verifiable by others, it’s not science, or engineering, or whatever it is you call what we do.” (V. Stodden, The scientific method in practice). Emphasizing code to an extreme, Buckheit and Donoho have challenged the traditional view that a publication was the valuable outcome of scientific research: “an article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment […]”.

It is important to keep in mind that going beyond replication of results requires reusable software tools: code that is portable, comes with documentation, and, most of all, is maintained throughout the years. Indeed, software development is a major undertaking that must build upon best practices and a quality process. Reversing Buckheit and Donoho’s argument, publications about scientific software play an increasingly important part in the scientific methodology. First, in the publish-or-perish academic culture, such publications give an incentive to software production and maintenance, because good software can lead to highly-cited papers. Second, the publication and review process are the de facto standards of ensuring quality in the scientific world. As software is becoming increasingly more central to the scientific discovery process, it must be subject to these standards. We have found that writing an article on software leads the authors to better clarify the project vision, technically and scientifically, the prior art, and the contributions. Last but not least, scientists publishing new results based on a particular software need an informed analysis of the validity of that software. Unfortunately, much of the current practice for adopting research software relies on ease of use of the package and reputation of the authors.

[…]

Today, software is to scientific research what Galileo’s telescope was to astronomy: a tool, combining science and engineering. It lies outside the central field of principal competence among the researchers that rely on it. Like the telescope, it also builds upon scientific progress and shapes our scientific vision. Galileo’s telescope was a leap forward in optics, a field of investigation that is now well established, with its own high-impact journals and scholarly associations. Similarly, we hope that visibility and recognition of scientific software development will grow.

Scikit-learn 0.14 release: features and benchmarks

2013-08-08T00:00:00+02:00

I have tagged and released the scikit-learn 0.14 release yesterday evening, after more than 6 months of heavy development from the team. I would like to give a quick overview of the highlights of this release in terms of features but also in term of performance. Indeed, the scikit-learn developers believe that performance matters and strive to be fast and efficient on fairly datasets.

I will show in this article on a couple of benchmarks that we have significant performance improvement and are competitive with the faster libraries such as the proprietary WiseRF.

Prohiminent new features

Most of the new features of the upcoming release have been mentionned more in details on Andy Mueller’s blog. I am just giving a quick list here for completness (see also the full list of changes):

Major new estimators:

AdaBoost (by Noel Dawe and Gilles Louppe): the classic boosting algorithm. This implementation can be applied to any estimator, but uses trees by default. AdaBoost is a learning strategy that builds from simple learning strategies by focussing successively on samples that are not well predicted. Typically, the simple learners (called weak learners) can be rules as simple as taking simple thresholds of observed quantities (this will form decision stumps). Documentation — Example
Biclustering (by Kemal Eren): clustering rows and columns of the data matrices. Suppose you have access to the shopping list of many consumers, biclustering would consists is grouping both consumers and product they bought to come up with stories such as “geeks buy computers and phones”, where “geeks” would be a group of consumers and “computers” and “phones” would be groups of products. Documentation — Example
Missing value imputation (by Nicolas Tresegnie): simple transformer filling missing data with means or medians. If your data-acquisition has failures, human or material, you can easily end up with some descriptors missing for some observations. It would be a pitty to throw away either those observations, or some descriptors. “Imputation” fills in the blanks with simple strategies. Documentation — Example
RBMs (Restricted Boltzmann Machines) (by Yann Dauphin): a neural network model useful for unsupervised learning of features. Restricted Boltzmann machines learn a set of hidden (latent) factors that have, for each observation, a probability to be activated or not. These activations are found so that they explain the data well, when combined across all the hidden factors with connection weights. Typically, they form a new feature set that can be useful in a prediction task. Documentation — Example
RandomizedSearchCV (by Andreas Mueller): setting meta-parameters on estimators using a randomized parameter exploration rather than a grid, as in a grid-search. A CV (cross-validated) meta-estimator sets parameters of an estimator by maximizing their cross-validated prediction scores. This entails fitting the estimator for each parameter value tried. The randomized-search explores the parameter space randomly, avoiding the exponential growth in number of points to fit of the grid search. Documentation — Example

Infrastucture work:

New wesbite (mostly by Gilles Louppe, Nelle Varoquaux, Vincent Michel and Andreas Mueller). The redesign of the website had two objectives: i) unclutter the pages to help prioritize information, ii) make it easier for users to find the stable documentation, if they follow an external link to a documentation of previous releases. I think that it also looks prettier :).
Python 3 support (Justin Vincent, Lars Buitinck, Subhodeep Moitra and Olivier Grisel). As a side note, under Python 3.3, on Windows, we have found that np.load can trigger segfaults, which means our test suite crashes. The tests not relying on np.load pass.

Major API changes

The scoring parameter One of the benefits of scikit-learn over other learning packages is that it can set parameters to maximizing a prediction score. However, the prediction that one would want to optimize might depend on the application. Also, some scores can only be computed with specific estimators, for instance because they require probabilistic prediction. Andreas Mueller and Lars Buitinck came up with a new API to specifies the scoring strategy that is versatile and hides complexity from the user. This replaces the score_func argument.
*sklearn.test()* is deprecated and will not run the test suite. Please use nosetests sklearn from the command line.

The full list of API changes can be found on the change log.

Performance improvements

Many part of the codebase got speed-ups, with a focus on making scikit-learn more scalable for bigger data.

The trees (random forests and extra-trees) were massively sped up by Gilles Louppe, bringing them to par with the fastest libraries (see benchmarks below)
Jake Vanderplas improved the BallTree and implemented fast KDTrees for nearest-neighbor search (benchmarks below).
“cleverless” made the DBSCAN implementation scale to a large number of samples by relying on KDTree and BallTree for neighbor search.
KMeans much faster on sparse data (Lars Buitinck)
For text vectorization: much faster CountVectorizer and TfidVectorizer with less memory consumption (Jochen Wersdorfer and Roman Sinayev)
Out-of-core learning for discrete naive Bayes classifiers by Olivier Grisel. Estimators that implement a partial_fit method can be used to fit the model with an out-of-core strategy, as illustrated by the out-of-core classification example. These settings are well suited to very big data.
FastICA: less memory consumptions and slightly faster code (Denis Engemann and Alexandre Gramfort)
Faster IsotonicRegression (Nelle Varoquaux)
OrthogonalMatchingPursuitCV by Alexandre Gramfort and Vlad Niculae: while strictly-speaking not a speedup of a existing estimator, this new estimator means that OMP parameters can be set much faster.

We are faster: lies, damn lies and benchmarks

“There are three kinds of lies: lies, damned lies and statistics.” —

Mark Twain’s Own Autobiography: The Chapters from the North American Review

I claim we have gotten faster at certain things. Other libraries, such as WiseRf, have performance claims compared to us. It turns out that benching statistical learning code is very hard, because speed depends a lot on the properties of the data.

Fast neighbor searches: good KDTrees beat BallTrees

A good example of interplay between properties of the data and computational speed is the nearest neighbor search. In general, finding the nearest neighbor to a point out of n other points will cost you n operations, as you have to compute the distance to each of these points. However, building a tree-like data structure ahead of time can make this query cost only log n. If these points are in 1D, ie simple scalars, this would be achieve by sorting them. In higher dimensions that can be achieved by building a KDTree, made of planes dividing the space in half-spaces, or a BallTree, made of nested balls.

KD Tree Image from AstroML’s documentation

Ball tree Image from AstroML’s documentation

Popular wisdom in machine learning is that in high dimensions, BallTrees scale better than KDTrees. This is explained by the fact that as the dimensionality grows, the number of planes required to break up the space grows too. On the contrary, if the data has structure, BallTrees can more efficiently cover this structure. I have benched scikit-learn’s KDTree and BallTree, as well as scipy’s KDTree, which employs a simpler tree-building strategy, on a variety of datasets, both real-life and artificial. Below if a summary plot giving relative performance of neighbor search

n is the number of data points, and p the dimensionality.

We can see that no approach win on all counts. That said, it came to a surprise to me to see that even in high dimension, scikit-learn’s KDTree outperformed the BallTrees. This is explained because these datasets do not display a heavily-structured low ambient dimension. On highly-structured synthetic data, the benefit of BallTree can clearly stand out, as shown by Jake here. However, on most dataset people encounter, it seems that this is not the case. Note also that scikit-learn’s KDTree tend to scale better in high dimension than scipy’s. This is due to the more elaborate choice of cutting planes. Note that it also has a cost, and may backfire, as on some datasets scikit-learn is slower than scipy.

Overall, the new KDTree in scikit-learn seem to be giving an excellent compromise. Congratulations Jake!

DBSCAN is faster with trees

DBSCAN is a clustering algorithm that relies heavily on the local neighborhood structure. The implementation in scikit-learn 0.13 computed the complete n by n matrix of distance between observations, which means that if you had a lot of data, you would blow your memory. In the 0.14 release, DBSCAN uses the BallTree, and as a result scales to much larger datasets and brings speed benefits. Here is a comparison between 0.13 and 0.14 implementations (I couldn’t put data as large as I wanted because the 0.13 code would blow):

Dataset	time with 0.13	time with 0.14
“lfw”: 13233 samples, 5 features	6.57 seconds	3.59 seconds
“make_blobs”: 30000, with 10 features	33.50 seconds	12.87 seconds

Importantly, the scaling is different: while the 0.13 code scales as n ^ 2, the 0.14 code scales as n log n. This means that the benefit is bigger for large dataset.

Scikit-learn 0.14’s random forests are fast

Gilles Louppe has made the random forests significantly faster in the 0.14 release. Let us bench them in comparison with WiseIO’s WiseRf, a proprietary package that only does random forest and for which the main selling point is that it is significantly than scikit-learn. However, let us also bench ExtraTrees, a tree-based model that is very similar to random forests, but that in our experience can be implemented a bit faster, and tends to work better.

On the digits dataset (1797 samples, 641 features):

Forest implementation	train time	test time	prediction accuracy
Sklearn ExtraTrees	2.641s	0.082s	0.986
Sklearn RandomForest	5.074s	0.088s	0.981
WiseRF	5.665s	0.108s	0.979

So we see that on a mid-sized dataset, scikit-learn is faster than WiseRF, and ExtraTrees is twice as fast as RandomForest, for better results.

On the MNIST dataset (70000 samples, 784 features):

Forest implementation	train time	test time	prediction accuracy
Sklearn ExtraTrees	1378.141s	4.768s	0.976
Sklearn RandomForest	1639.866s	4.132s	0.972
WiseRF	1102.465s	14.542s	0.972

On a big dataset, WiseRF takes the lead, but not by a large factor.

Using 2 CPUs (n_jobs=2) on the digits dataset:

Forest implementation	train time	test time	prediction accuracy
Sklearn ExtraTrees	4.874s	1.478s	0.986
Sklearn RandomForest	5.716s	1.349s	0.978
WiseRF	3.264s	0.104s	0.979

Both scikit-learn and WiseRF can use several CPUs. However, the Python parallel execution model via multiple processes has an overhead in term of computing time and of memory usage. The internals of WiseRF are coded in C++, and thus it is not limited by this overhead. Also, because of the memory duplication with multiples processes in scikit-learn, I could not run it on MNIST with 2 jobs. Next release will address these issues, partly by using memmapped arrays to share memory between processes.

We make good use of funding: the Paris sprint

A couple of weeks ago, we had a coding sprint in Paris. We were able to bring in a lot of core developers from all over Europe thanks to our sponsors: FNRS, AFPy, Telecom Paristech, and Saint-Gobain Recherche. The total budget, including accommodation and travel, was a couple thousand euros, thanks to Telecom Paristech and tinyclues helping us with accommodation and hosting the sprint.

The productivity of such a sprint is huge, both because we get together and work efficiently, but also because we get back home and keep working (I have been sleep deprived because of late-night hacking ever since the sprint). As an illustration, here is the diagram of commits as can be seen on Github. The huge spike correspond to the second international sprint: Paris 2013.

We now have a “donate” button on the website. I can assure you that your donations are well spent and turned into code.

RIP John Hunter: the loss of a great man

2012-08-30T10:21:00+02:00

John Hunter, the author of matplotlib passed away yesterday after a short battle against cancer. John gave the keynote at the scipy 2012 conference a few weeks ago, and was diagnosed with cancer just on his return from the conference. It is a shock to me that that a friend can disappear so quickly. Please read the announcement of Fernando Perez, who supported John in the last weeks to learn more about John.

A man who gave a lot, not asking for anything in return

Many have benefited from the silent efforts of John, and are not fully aware of how he generously invested his time and talent for the benefit of others. Matplotlib, the Python plotting library that he created in 2002, has propelled Python as a major tool for scientific research and engineering. The impact of John’s efforts go well beyond Matplotlib. Early on, John had the vision of Python as a interactive scientific environment. He promoted this vision pairing with Fernando Perez to develop the fantastic ipython/matplotlib tandem, solving many technical challenges. But he also invested a lot of energy in teaching workshops that helped change the way people compute, as well as writing didactic documentation and articles. He was a friendly, active, leader of an online community, open and helpful to newcomers.

As Travis Oliphant said on John’s numfocus memorial webpage:

Those who contribute much to open source, as John did, do so at the expense of something - often it is time with family.

I cannot stress how true this is. The entire open source software, that nowadays supports our economy, our education, and our research, is built on the shoulders of a fairly small number of generous people that spend their energy in making better software, rather than personal wealth.

John was a humble man. He did not have a blog, or a twitter account, did not seek fame or money. For this reason I feel that his contributions are unknown and undervalued by many. In my eyes, he is an unknown soldier of our modern times. I hope that I am not being too emphatic, but this is how I feel.

Note

John passed away at 44, leaving behind a wife and 3 daughters. Please do consider supporting them:

http://numfocus.org/johnhunter

A journal promoting high-quality research code: dream and reality

2012-06-04T21:39:00+02:00

Open research computation (ORC) was an attempt to create a scientific publication promoting high-quality and open source scientific code. The project went public in falls 2010, but last month, facing the low volume of submission, the editorial board chose to reorient it as a special track of an existing journal.

The challenges that we face are discussed in our editorial:

Changing computational research. The challenges ahead. C Neylon, J Aerts, CT Brown, D Lemire, J Millman, P Murray-Rust, F Perez, N Saunders, A Smith, G Varoquaux and E Willighagen, Source Code for Biology and Medicine 2012, 7:20

Here is my own personal take on the rise and fall of this ideal.

My story with ORC

From pipe dream to journal - My involvement with ORC started long before there was such a thing as ORC. In falls 2008, I had a discussion with a friend working in the publication industry, telling her how I believed that the publication system is broken, because it promotes new results without any interest on whether these can be exported outside the lab that produced them: it is currently easier to publish a minor but novel result than a tool enabling the routine reproduction of previous results. This seemed particularly marked in the scientific software world, as software tools are becoming central to the scientific workflow, and cost nothing to duplicate when produced under open-source license. To my surprise, she took me seriously, and asked me to write my ideas down in an email that she would forward to her colleagues in the publication industry.

Looking back at the email that I send, my concerns were, back then, to promote:

quality and openness of scientific software
basic tools shared across communities
recognition of software development as a challenging and worthwhile task in academic research

Shaping the idea -In the year that followed, I had a few discussions with staff from BioMedCentral, an open-access publisher in biology and medicine that was looking into expending in the physics and math related fields. Eventually, my contact there told me that they had other similar requests and were launching a journal that would be lead by Cameron Neylon, a British biophysicist and strong advocate of openness and reproducibility in science. This was the start of ORC, and for me the chance to meet other people sharing my concerns, some new and some already old friends.

ORC editor

Conventional editor

Setting up the journal -BioMedCentral was instrumental in setting up the journal project. I quickly learned that, no surprises, a journal is a product, like anything else, and it must find customers. Here, as we were launching an open access journal, the customers were authors. This is where a journal faces a chicken and egg problem: to be recognised it needs high-visibility publications, but authors will submit only to journals that they know. The main tool to overcome this challenge are communication and advocacy. I then realized that these really weren’t my strong points. Cameron Neylon absolutely shined on this side, with very enthusiastic communications and an incredibly active twitter account. On my side, I am a slow writer, and I tend to speak Python code better than English language, which is not a strong asset to be a journal editor.

Wild editorial discussions - The discussions in the editorial board really thrilled me because they were centered on how to set standards to improve the quality of code published. Looking in my mailbox, I see discussions about code repositories, software testing, documentation or licensing issues. This is not that surprising, given that a lot of the editors where actually contributors to major software projects. It made me very happy, as I have the feeling that, so far, most committees or decision makers are clueless about software.

Sand in the gears: the lack of uptake

A false start -So ORC was launched late 2010 and we had fantastic feedback. I had the feeling that people were genuinely excited about our program: changing the way computational science worked from the inside, through the review process. The idea was that we had opened a pre-submission call, and were waiting for a few good papers to be submitted to launch the journal. However, it turned out that the papers were slow to come. It took me a while to realize that there was something wrong. But slowly we had to face the truth: many people were excited about the journal, but most were sending their papers elsewhere.

What went wrong? -If I really knew what went wrong, I would probably not be discussing it here, but rather changing the world. However, I can come up with a few guesses:

Working across communities is harder. From the beginning we had wanted to position the journal across communities, in order to foster the sharing of tools for a greater good. The challenge is that a central role of publication is nowadays to provide recognition. It is much easier to achieve recognition in a given community than across communities, and authors always preferred submitting their work to a non-software oriented journal in their field. We couldn’t fight together the battle for software quality and the battle for inter-community work.
Setting the bar too high. Many felt that the submission requirements that where too demanding, as expressed on a NeuroImaging forumn to quote a researcher: “I think it’s setting the bar unrealistically high for most neuroimaging software”. While we had originally shot for a very high test coverage (probably too high), we had scaled it back quickly, simply stressing that editors and reviewers would be looking closely at test coverage, documentation and ease of installation. That said, the average researcher did not share our ideals of raising the quality of scientific software. Trying to ask only for excellent publications in a new and unproven journal was probably unrealistic.
Editors not willing to game the system. I have watched a few journal launches, and it seems to me that a common trick is to line up articles that are created by the editors and their friends specifically for the new journal. People come up with opinion papers, reviews, commentaries that only serve to generate an identity to the journal. This did not happen for ORC, and I believe that it is because the editors themselves were not huge fans of the low signal-to-noise ratio in modern scientific publishing practice.

The times they are a changing

ORC is dead, long live ORC - We did get a few submissions. ORC is not coming to an end, it is morphing into a special thematic series in source code for biology and medicine. This solution is not completely satisfactory, as it pushes what should have been a forum for exposing good practices and good software into a smaller community. But at least there is now a venue in which people can publish a paper about software that they have been improving and maintaining, and not only about a new algorithm.

Changing practices across the board - Among the reasons for which we had a hard time making a breakthrough, is that authors where sending their software papers to other journals, in particular journals not specialized on software. While these papers are not getting the attention of a review and editorial team expert on software development, as we are setting up with ORC, this is still a good thing. Indeed it shows that the times are changing and that recognition of software as a scientific work is improving. I have been impressed to see that many high profile journals have changed their editorial policies to specifically accept software papers, or have create tracks dedicated to software.

Software is being slowly recognized as a pillar of modern scientific research. We need to keep pushing to make sure that quality standards are set and that the open-source scientific software grows into a mature ecosystem focused on problem solving.

Update on scikit-learn: recent developments for machine learning in Python

2012-05-09T00:12:00+02:00

Yesterday, we released version 0.11 of the scikit-learn toolkit for machine learning in Python, and there was much rejoincing.

Major features gained in the last releases

In the last 6 months, there have been many things happening with the scikit-learn. While I do not whish to give an exhaustive summary of features added (it can be found here), let me list a few of the additions that I personnally find exciting.

Non-linear prediction models

For complex prediction problems where there is no simple model available, as in computer vision, non-linear models are handy. A good example of such models are those based on decisions trees and model averaging. For instance random forests are used in the Kinect to locate body parts. As they are intrinsically complex, they may need a large amount of training data. For this reason, they have been implemented in the scikit-learn with special attention to computational efficiency.

Dealing with unlabeled instances

It is often easy to gather unlabeled observations than labeled observation. While prediction of a quantity of interest is then harder or simply impossible, mining this data can be useful.

Semi-supervised learning: using unlabeled observations together with labeled ones for better prediction.

Outlier/novelty detection: detect deviant observations.

Manifold learning: discover a non-linear low-dimensional structure in the data.

Clustering with an algorithm that can scale to really large datasets using an online approach: fitting small portions of the data on after the other (Mini-batch k-means).

Dictionary learning: learning patterns in the data that represent it sparsely: each observation is a combination of a small number patterns.

Sparse models: when very few descriptors are relevant

In general, finding which descriptors are useful when there are many of them is like find a needle in a haystack: it is a very hard problem. However, you know that only a few of these descriptors actually carry information, you are in a so-called sparse problem, for specific approaches can work well.

Orthogonal matching pursuit: a greedy and fast algorithm for very sparse linear models

Randomized sparsity (randomized Lasso): selecting the relevant descriptors in noisy high-dimensional observations

Sparse inverse covariance: learning graphs of connectivity from correlations in the data

Getting developpers together: the Granada sprint

Of course, such developments happen only because we have a great team of dedicated coders.

Getting along and working together is a critical part of the project. In December 2011, we held the first international scikit-learn sprint in Granada, on the side of the NIPS conference. That was a while ago, and I haven’t found time to blog about it, maybe because I was too busy merging in the code produced :). Here is a small report from my point of view. Better late than never.

Participants from all over the globe

This sprint was a big deal for us, because for the first time, thanks to sponsor money, we were able to fly contributors from overseas and meet the team in person. For the first time I was able to see the faces behind many of the fantastic people that I knew only from the mailing list.

I really think that we must thank our sponsors, Google and tinyclues, but also The PSF, that is in particular Jesse Noller but especially Steve Holden, whose help was absolutely instrumental in getting sponsor money. This money is what made it possible to unite a good fraction of the team, and it opened the door to great moments of coding, and more.

Producing code lines and friendship

An important aspect of the sprint for me was that I really felt the team being united. Granada is a great city and we spent fantastic moments together. Now when I review code, I can often put a face on the author of that code and remember a walk below the Alhambra or an evening in a bar. I am sure it helps reviewing code!

Was it worth the money?

I really appreciate that the sponsors did not ask for specific returns on investment beyond acknowledgments, but I think that it is useful for us to ask the question: was it worth the money? After all, we got around $5000, and that’s a lot of money. First of all, as a side effect of the sprint, people who had invested a huge amount of time in a machine learning toolkit without asking anything in return got help to go to a major machine learning conference.

But was there a return over investment in terms of code? If you look at the number of lines of code modified weekly (figure on the right), there is a big spike in December 2011. That’s our sprint! Importantly, if you look at the months following the sprint, there still is a lot of activity in the months following the sprint. This is actually unusual, as the active developments happen more in the summer break than during the winter, as our developpers are busy working on papers or teaching.

The explaination is simple: we where thrilled by the sprint. Overall, it was incredibly beneficial to the project. I am looking forward to the next ones.

3 Google summer of code for scikit-learn and more…

2012-04-23T22:25:00+02:00

The scikit-learn got 3 students accepted for the Google summer of code.

Imanuel Bayer will work on making our sparse linear models, for regression and classification, faster. His proposal Optimizing sparse linear models using coordinate descent and strong rules.
David Marek will implement multi-layer perceptrons for the scikit. His proposal: Multilayer Perceptron
Vlad Niculae will work on speeding up the library in general, catching all the low hanging fruits, and the ones a bit higher. His proposal: Need for scikit-learn speed

In addition, other related projects have exciting projects, for instance **statsmodels**:

Divyanshu Bandil: Extension of Linear to Non Linear Models in Statsmodels Python module
Alexandre Crayssac: estimating system of equations
Justin Grana: empirical Likelihood in Statsmodels
Georgi Panterov: nonparametric estimation

and Cython:

Philip Herron: pxd generation using gcc-python-plugin
Mark Florisson: Fast Numerical Computing with Cython

finally, in Pandas:

Vytautas Jancauskas: Plots in pandas

Congratulations to all of the students. This is going to be an exciting summer.

The problems of low statistical power and publication bias

2012-04-14T16:16:00+02:00

Lately, I have been a mood of scientific scepticism: I have the feeling that the worldwide academic system is more and more failing to produce useful research. Christophe Lalanne’s twitter feed lead me to an interesting article in a non-mainstream journal: A farewell to Bonferroni: the problems of low statistical power and publication bias, by Shinichi Nakagawa.

Each study performed has a probability of being wrong. Thus performing many studies will lead to some wrong conclusions by chance. This is known in statistics as the multiple comparisons problem. When a working hypothesis is not verified empirically in a study, this null finding is seldom reported, leading to what is called publication bias: discoveries are further studied; negative results are usually ignored (Y. Benjamini). Because only discoveries, called detections in statistical terms, are reported, published results contain more false detections than the individual experiments and very little false negatives. Arguably, the original investigators have corrected using the understanding that they gained the experiments performed and account in a post-hoc analysis for the fact that some of their working hypothesis could not have been correct. Such a correction can work only in a field where there is a good mechanistic understanding, or models, such as physics, but in my opinion not in life and social sciences.

Let me quote some relevant extracts of the article, as you may never have access to it thanks to the way scientific publishing works:

Recently, Jennions and Moller (2003) carried out a meta-analysis on statistical power in the field of behavioral ecology and animal behavior, reviewing 10 leading journals including Behavioral Ecology. Their results showed dismayingly low average statistical power (note that a meta-analytic review of statistical power is different from post hoc power analysis as criticized in Hoenig and Heisey, 2001). The statistical power of a null hypothesis (Ho) significance test is the probability that the test will reject Ho when a research hypothesis (Ha) is true.

…

The meta-analysis on statistical power by Jennions and Moller (2003) revealed that, in the field of behavioral ecology and animal behavior, statistical power of less than 20% to detect a small effect and power of less than 50% to detect a medium effect existed. This means, for example, that the average behavioral scientist performing a statistical test has a greater probability of making a Type II error (or beta) (i.e., not rejecting Ho when Ho is false; note that statistical power is equals to 1 - beta) than if they had flipped a coin, when an experiment effect is of medium size.

…

Imagine that we conduct a study where we measure as many relevant variables as possible, 10 variables, for example. We find only two variables statistically significant. Then, what should we do? We could decide to write a paper highlighting these two variables (and not reporting the other eight at all) as if we had hypotheses about the two significant variables in the first place. Subsequently, our paper would be published. Alternatively, we could write a paper including all 10 variables. When the paper is reviewed, referees might tell us that there were no significant results if we had “appropriately” employed Bonferroni corrections, so that our study would not be advisable for publication. However, the latter paper is scientifically more important than the former paper. For example, if one wants to conduct a meta-analysis to investigate an overall effect in a specific area of study, the latter paper is five times more informative than the former paper. In the long term, statistical significance of particular tests may be of trivial importance (if not always), although, in the short term, it makes papers publishable. Bonferroni procedures may, in part, be preventing the accumulation of knowledge in the field of behavioral ecology and animal behavior, thus hindering the progress of the field as science.

Some of the concerns raised here are partly a criticism of Bonferoni corrections, i.e. in technical terms correcting for family-wise error rate (FWER). It is actually the message that the author wants to convey in his paper. Proponents of controling for false discovery rate (FDR) argue that an investigator shouldn’t be penalized for asking more questions, and the fraction of errors in the answers should be controlled, rather than the absolute value. That said, FDR, while useful, does not answer the problems of publication bias.

Want features? Just code

2012-03-08T22:46:00+01:00

Somebody just sent an email on a user’s mailing list for an open-source scientific package entitled “Feature foo: why is package bar not up to the task?”. To quote him:

Is there ANY plan for having such a module in package bar?? I think (personally) that this is a MUST DO. This is typically the type of routines that I hear people use in e.g., idl etc. If this could be an optimised, fast (and easy to use) routine, all the better.

As some one who spends a fair amount of time working on open source software I hear such remarks quite often. I am finding it harder and harder not to react negatively to these emails. Now I cannot consider myself as a contributor to package bar, and thus I can claim that I am not taking your comment personally.

Why aren’t package not up to the task? Will, the answer is quite simple: because they are developed by volunteers that do it on their spare time, late at night too often, or companies that put some of their benefits in open source rather in locking down a market. 90% of the time the reason the feature isn’t as good as you would want it is because of lack of time.

I personally find that suggesting that somebody else should put more of the time and money they are already giving away in improving a feature that you need is almost insulting.

I am aware that people do not realize how small the group of people that develop and maintain their toys is. Borrowing the figure below from Fernando Perez’s talk at Euroscipy, the number of people that do 90% of the grunt work to get the core scientific Python ecosystem going is around two handfuls:

I’d like to think that this recruitment problem is a lack of skill set: users that have the ability to contribute are just too rare. This is not entirely true, there are scores of skilled people on the mailing lists. The poster himself mentioned his email that he was developing a package. I personally started contribution not knowing anything about software development. I struggled, I did the grunt work like maintaining wikis, answer questions on mailing list, and writing documentation. These easier tasks were useful to the community, I think, but must importantly, they taught me a lot because I was investing energy in them.

Note

If people want things to improve, they will have more successes sending in pull requests than messages on mailing list that sound condescending to my ears.

I hope that I haven’t overreacted too badly :), that email turned me on. That said, I am not sure that people realize how much they owe to the open source developers breaking their backs on the packages they use.

All credit for images goes to Fernando Perez

Book review: NumPy 1.5 Beginner’s guide

2012-01-10T08:57:00+01:00

Packt publishing sent me a copy of NumPy 1.5 Beginner’s guide by Ivan Idris.

The book actually covers more than only numpy: it is a full introduction to numerical computing with Python. The table of contents is the following:

NumPy Quick Start
Beginning with NumPy Fundamentals
Get into Terms with Commonly Used Functions
Convenience Functions for Your Convenience
Working with Matrices and ufuncs
Move Further with NumPy Modules
Peeking Into Special Routines
Assure Quality with Testing
Plotting with Matplotlib
When NumPy is Not Enough: SciPy and Beyond

The book is easy to read, as it requires no specific expertise other than knowing basic Python programming. It is full of examples and exercises, which is really great for learning. I find the style of the author, Ivan Idris, particularly amusing and relaxing, engaging the reader with questions, challenges, or even jokes (“Have a go hero”).

With regards to the formatting and the print, the book is written in large fonts, with sectioning information, tips and exercises clearly standing out.

It is full of practical information, such as how to install the software, or where to get help. Finally, One thing that I appreciated, is that the examples are typed in IPython. Each time I teach, I like to use IPython, because it is full of features to help plotting, debugging and profiling numerical code. The book even has a little introduction to some useful IPython features.

After an introduction to the work flow, the book explores array manipulation such as creation or reshaping, followed by some simple numerics and the battery of array-based operations on functions and polynomials. Then it presents linear algebra and signal processing basics (FFT). It also covers the financial functions that are present in numpy and mentions testing, which is very important to achieve quality code. The book finishes with matplotlib and scipy, two modules that are important to know to go further.

The examples are mostly drawn from statistics or financial applications, such as computing running averages on stock quotes. Basic math explanations, such as the definition of the Moore-Penrose pseudo-inverse, are given when needed.

To conclude, I enjoyed this book and I think that it is a nice addition to my library. It answers exactly it’s title: it is well-suited for beginners wanting to learn numpy. On the other hand, I would not recommend it as a reference material, or as a book to learn more general scientific or numerical computing with Python.

Joblib beta release: fast compressed persistence + Python 3

2012-01-07T19:27:00+01:00

Joblib 0.6: better I/O and Python 3 support

Happy new year, every one. I have just released Joblib 0.6.0 beta. The highlights of the 0.6 release are a reworked enhanced pickler, and Python 3 support.

Many thanks go to the contributors to the 0.5.X series (Fabian Pedregosa, Yaroslav Halchenko, Kenneth C. Arnold, Alexandre Gramfort, Lars Buitinck, Bala Subrahmanyam Varanasi, Olivier Grisel, Ralf Gommers, Juan Manuel Caicedo Carvajal, and myself). In particular Fabian made sure that Joblib worked under Python 3.

In this blog post, I’d like to discuss a bit more the compressed persistence engine, as it illustrates well key factors in implementing and using compressed serialization.

Fast compressed persistence

One of the key components of joblib is it’s ability to persist arbitrary Python objects, and read them back very quickly. It is particularly efficient for containers that do their heavy lifting with numpy arrays. The trick to achieving great speed has been to save in separate files the numpy arrays, and load them via memmapping.

However, one drawback of joblib, is that the caching mechanism may end up using a lot of disk space. As a result, there is strong interest in having compressed storage, provided it doesn’t slow down the library too much. Another use case that I have in mind for fast compressed persistence, is implementing out of core computation.

There are some great compressed I/O libraries for Python, for instance Pytables. You may wonder why the need to code yet another one. The answer is that joblib is pure Python, depending only on the standard library (numpy is optional), but also that the goal here is black-box persistence of arbitrary objects.

Comparing I/O speed and compression to other libraries

Implementing efficient compressed storage was a bit of a struggle and I learned a lot. Rather than going into the details straight away, let me first discuss a few benchmarks of the resulting code. Benching such feature is very hard, first because you are fighting with the disk cache, second because they performances depends very much on the data at hand (some data compress better than others), last because they are three interesting metrics: disk space used, write speed, and read speed.

Dataset used - I chose to compare the different strategies on some datasets that I work with, namely the probabilistic brain atlases MNI 1mm (62Mb uncompressed) and Juelich 2mm (105Mb uncompressed). Whether the data is represented as a Fortran-ordered array, or a C-ordered array is important for the I/O performance. This data is normally stored to disk compressed using the domain-specific Nifti format (.nii files), accessed in Python with the Nibabel library.

Libraries used - I benched different compression strategies in joblib against Nibabel’s Nifti I/O, compressed or not, and against using Pytables to store the data buffer (without the meta-informations). Pytables exposed a variety of compression strategies, with different speed compromises. In addition, I benched numpy’s builtin save_compressed.

I would like to stress that I am comparing a general purpose persistence engine (joblib) to specific I/O libraries either optimized for the data (Nifti), or requiring some massaging to enable persistence (pytables).

Comparing to other libraries

Actual numbers can be found here.

Take home messages - The graphs are not crystal-clear, but a few tendencies appear:

Pytables with LZO or blosc compression is the king of the hill for read and write speed.
I/O of compressed data is often faster than with uncompressed data for a good compression algorithm.
Joblib with Zlib compression level 1 performs honorably in terms of speed with only the Python standard library and no compiled code.
Read time of memmapping (with nibabel or joblib) is negligeable (it is tiny on the graphs), however the loading time appears when you start accessing the data.
Passing in arrays with a memory layout (Fortran versus C order) that the I/O library doesn’t expect can really slow down writing.
Compressing with Zlib compression-level 1 gets you most of the disk space gains for a reasonable cost in write/read speed.
Compressing with Zlib compression-level 9 (not shown on the figures) doesn’t buy you much in disk space, but costs a lot in writing time.

Benching datasets richer than pure arrays

The datasets used so far are pretty much composed of one big array, a 4D smooth spatial map. I wanted to test on more datasets, to see how the performances varied with data type and richness. For this, I used the datasets of the scikit-learn, real life data of various nature, described here:

20 news - 20 usenet news group: this data mainly consists of text, and not numpy arrays.
LFW people - Labeled faces in the wild, many pictures of different people’s face.
LFW pairs - Labeled faces in the wild, pairs of pictures for each individual. This is a high entropy dataset, it does not have much redundant information.
Olivetti - Olivetti dataset: centered pictures of faces.
Juelich(F) - Our previous Juelich atlas
Big people - The LFW people dataset, but repeated 4 times, to put a strain on memory resources.
MNI(F) - Our previous MNI atlas
Species - Occurence of species measured in latin America, with a lot of missing data.

Actual numbers can be found here.

What this tells us - The main message from these benchmarks is that datasets with redundant information, i.e. that compress well, give fast I/O. This is not surprising. In particular, good compression can give good I/O on text (20 news). Another result, more of a sanity check, is that compressed I/O on big data (Big people, ) works as well as on smaller data. Earlier code would start to swap. Finally, I conclude from these graphs, that compression levels from 1 to 3 buy you most of the gains for reasonable costs, and that going up to 9 is not recommended, unless you know that your data can be compressed a lot (species).

Lessons learned

I’ll keep this paragraph short, because the information is really in joblib’s code and comments. Don’t hesitate to have a look, it’s BSD-licenced, so you are free to borrow what you please.

Memory copies, of arrays, but also of strings and byte streams can really slow you down with big data.
To avoid copies with numpy arrays, fully embrace numpy’s strided memory model. For instance, you do not need to save arrays in C order, if they are given to you in a different order. Accessing the memory in the wrong striding direction explains the poor write performance of pytables on Fortran-ordered Juelich.
When dealing with the file system, the OS makes so much magic (e.g. prefetching) that clever hacks tend not to work: always benchmark.
Depending on the size of the data, it may be more efficient to store subsets in different files: it introduces ‘chunk’ that avoid filling in the memory too much (parameter cache_size in joblib’s code). In addition, data of a same nature tends to compress better.
The I/O stream or file object interfaces are abstractions that can hide the data movement and the creation of large temporaries. After experiments with GZipFile and StringIO/BytesIO I found it more efficient to fall back to passing around big buffer object, numpy arrays, or strings.
For reasons 4 and 5, I ended up avoiding the gzip module: raw access to the zlib with buffers gives more control. This explains a good part of the differences in read speed for pure arrays with numpy’s save_compressed.

One of my conclusions for joblib, is that I’ll probably use Pytables as an optional backend for persistence in a future release.

Details on the benchmarks

These benchmarks where run on a Dell Lattitude D630 laptop. That’s a dual-core Intel Core2 Duo box, with 2M of CPU cache.

The code for the benchmarks below can be found on a gist.

Thanks

I’d like to that Francesc Alted for very useful feedback he gave on this topics. In particular, the following thread on the pytables mailing-list may be of interest to the reader.

Scikit-learn NIPS 2011 sprint: international thanks to our sponsors

2011-11-18T14:47:00+01:00

The NIPS conference: time for a sprint. The NIPS conference, one of the major conferences in machine learning, is hosted in Granada this year. I believe that it is the first time that it is hosted in Europe. As many of the scikit-learn developers are part of the wider NIPS community, but also many live in Europe, we jumped on the occasion to organize a truly international sprint: the NIPS 2011 scikit-learn sprint.

Finding money. As often with open source development, a lot of our contributors are young people, investing their free time outside of any request from their hierarchy. In such a situation, it can be hard to find travel money. So we started looking for sponsors. We needed to find a decent sum of money, as we were flying people in from places such as the West coast of the US, or even Japan. The good news is that we found money, and between supervisors pitching in, universities giving travel grants, and our generous sponsors, there will be an impressive list of contributors from all over the world at the sprint.

Thanks to our sponsors. The first people that we need to thank are Google, who gave us a sizable sponsorship, and the PSF, who made Google’s sponsorship possible through their accounting and sprints programs. We also need to thanks our other sponsors, namely Tinyclues. Thanks to these sponsors, and additional investment from many universities and research group, we have been able to gather a total of 12 contributors in Granada, a handful coming from overseas. Also, we are indebted to the University of Granada, and the Gnu/Linux Granada Group (GGG), who are providing hosting for the sprint, as well as Régine Bricquet, from INRIA, who did a lot of the trip planing for the sponsored people.

I am very much looking forward to the sprint. It will be the first time that meet in real life many of the contributors, and judging by the warmness of the on-line exchanges, it will be a great moment. Besides, Granada is known to be a lively and historical city.

If you are around and want to join us, to work on Python in machine learning, send us a mail on the mailing list.

Cython example of exposing C-computed arrays in Python without data copies

2011-09-28T23:42:00+02:00

Some advice on passing arrays from C to Python avoiding copies. I use Cython as I have found the code to be more maintainable than hand-written Python C-API code.

I found out that there was no self-contained example of creating numpy arrays from existing data in Cython. Thus I created my own. The full code with readme build and demo scripts is available on a gist. Here I only give an executive summary.

The core functionality is implemented by the PyArray_SimpleNewFromData function of the C API of numpy that can create an ndarray from a pointer to the data, a simple data type, and the shape of the data. The Cython file just builds around that function:

Python at scientific conferences

2011-09-11T15:52:00+02:00

Top notch scientific conferences are starting to add Python tracks to their program. This is good news. Indeed, it scientific Python conferences (namely Scipy, EuroSciPy and Scipy India) are doing great to get together people who have already heard about Python for science, but we need to reach out to specific Python communities to maximize impact.

ESCO 2012 - European Seminar on Coupled Problems

ESCO 2012 is the 3rd event in a series of interdisciplineary meetings dedicated to computational science challenges in multi-physics and PDEs.

I was invited as ESCO last year. It was an aboslute pleasure, because it is a small conference that is very focused on discussions. I learned a lot and could sit down with people who code top notch PDE libraries such as FEniCS and have technical discussions. Besides, it is hosted in the historical brewery where the Pilsner was invented. Plenty of great beer.

Application areas Theoretical results as well as applications are welcome. Application areas include, but are not limited to: Computational electromagnetics, Civil engineering, Nuclear engineering, Mechanical engineering, Computational fluid dynamics, Computational geophysics, Geomechanics and rock mechanics, Computational hydrology, Subsurface modeling, Biomechanics, Computational chemistry, Climate and weather modeling, Wave propagation, Acoustics, Stochastic differential equations, and Uncertainty quantification.

Minisymposia

Multiphysics and Multiscale Problems in Civil Engineering
Modern Numerical Methods for ODE
Porous Media Hydrodynamics
Nuclear Fuel Recycling Simulations
Adaptive Methods for Eigenproblems
Discontinuous Galerkin Methods for Electromagnetics
Undergraduate Projects in Technical Computing

Software afternoon Important part of each ESCO conference is a software afternoon featuring software projects by participants. Presented can be any computational software that has reached certain level of maturity, i.e., it is used outside of the author’s institution, and it has a web page and a user documentation. If you would like to present your software project, let us know soon.

Proceedings For each ESCO we strive to reserve a special issue of an international journal with impact factor. Proceedings of ESCO 2008 appeared in Math. Comput. Simul., proceedings of ESCO 2010 in CiCP and Appl. Math. Comput. Proceedings of ESCO 2012 will appear in Computing.

Important Dates

December 15, 2011: Abstract submission deadline.
December 15, 2011: Minisymposia proposals.
January 15, 2012: Notification of acceptance.

PyHPC: Python for High performance computing

If you are doing super computing, SC11, the Super Computing conference is the reference conference. This year there will a workshop on high performance computing with Python: PyHPC.

At the scipy conference, I was having a discussion with some of the attendees on how people often still do process management and I/O with Fortran in the big computing environment. This is counter productive. However, has success stories of supercomputing folks using high-level languages are not advertized, this is bound to stay. Come and tell us how you use Python for high performance computing!

Topics

Python-based scientific applications and libraries
High performance computing
Parallel Python-based programming languages
Scientific visualization
Scientific computing education
Python performance and language issues
Problem solving environments with Python
Performance analysis tools for Python application

Papers We invite you to submit a paper of up to 10 pages via the submission site. Authors are encouraged to use IEEE two column format.

Important Dates

Full paper submission: September 19, 2011
Notification of acceptance: October 7, 2011
Camera-ready papers: October 31, 2011

Conference posters

2011-09-05T04:15:00+02:00

At the request of a friend, I am putting up some of the posters that I recently presented at conferences.

Large-scale functional-connectivity graphical models for individual subjects using population prior.

This is a poster for our NIPS work

PDF

Multi-subject dictionary learning to segment an atlas of brain spontaneous activity.

This is a poster for our IPMI work

PDF

Mayavi for 3D visualization of neuroimaging data: powerful scripting and reusable components in Python.

PDF

Machine learning for fMRI in Python: inverse inference with scikit-learn.

PDF

Hiring a junior developer on the scikit-learn

2011-09-03T07:26:00+02:00

Once again, we are looking for a junior developer to work on the scikit-learn. Below is the official job posting. As a personal remark, I would like to stress that this is a unique opportunity to be payed for two years to work on learning and improving the scientific Python toolstack.

Job Description

INRIA is looking to hire a young graduate on a 2-year position to help with the community-driven development of the open source machine learning in Python library, scikit-learn. The scikit-learn is one of the majormajor machine-learning libraries in Python. It aims to be state-of-the-art on mid-size to large datasets by harnessing the power of the scientific Python toolstack.

Speaking French is not a requirement, as it is an international team.

Requirements

Programming skills in Python and C/C++
Understanding of quality assurance in software development: test-driven programming, version control, technical documentation.
Some knowledge of Linux/Unix
Software design skills
Knowledge of open-source development and community-driven environments
Good technical English level
An experience in statistical learning or a mathematical-oriented mindset is a plus
We can only hire a young-graduate that has received a masters or equivalent degree at most a year ago.

About INRIA

INRIA is the French computer science research institute. It recognized word-wide as one of the leading research institutions and has a strong expertise in machine learning. You will be working in the Parietal team that makes a heavy use of Python for brain imaging analysis.

Parietal is a small research team (around 10 people) with an excellent technical knowledge of scientific and numerical computing in Python as well as a fine understanding of algorithmic issues in machine learning and statistics. Parietal is committed to investing in scikit-learn.

Working at Parietal is a unique opportunity to improve your skills in machine learning and numerical computing in Python. In addition, working full time on the scikit-learn, a very active open-source project, will give you premium experience of open source community management and collaborative project development.

Contact Info:

Technical Contact: Bertand Thirion
E-mail contact: bertrand dotnospam thirion atnospam inria dotnospam fr
HR Contact: Marie Domingues
E-mail Contact: marie dotnospam domingues atnospam inria dotnospam fr
No telecommuting

My conference travels: Scipy 2011 and HBM 2011

2011-07-23T23:45:00+02:00

The Scipy 2011 conference in Austin

Last week, I was at the Scipy conference in Austin. It was really great to see old friends, and Austin is such a nice place.

The Scipy conference was held in UT Austin’s conference center, which is a fantastic venue. This is the first geek’s conference I have been at where the wireless network worked flawlessly with a good bandwidth, even thought 200 geeks were pounding on it. As a tutorial presenter, this was incredibly useful.

Conference highlight

Here is a short list of what I felt were the big trends and highlights of the conference. This is obviously biased by my own interests. I am not listing parallel computing, as it is clearly an important area of progress and debates, but it has been the case for the last few years.

Eric Jone’s keynote

Of course Eric’s keynote was excellent. Eric is a great speaker and always has good insights on how to run a team and a project. This year he shared (some) of his tricks in making Enthought deliver on software projects: “What Matters in Scientific Software Projects? 10 Years of Success and Failure Distilled”. The video is not yet online, unfortunately. Grab it when you can.

Hilary Mason’s keynote

Hilary is an applied data geek, just what I like! She gave an interesting keynote on how bitly (an URL-shortening startup, for those living under a rock) mines the requests on the URLs that the serve to do things like ranking or phishing attempts detection. Of course, I couldn’t resist asking what tools they used, thinking that she would reply R. She mentioned that they did do some roll-their-own, but she mentioned mlpy and scikit-learn, with a mention that it was very nice, at which point I believe that I blushed. She stressed that R was hard to use and production and raised the point that most often academic software doesn’t pan out in these settings (I hope that I am not distorting her thoughts too much).

Statistics and learning

I had the feeling that statistics and data mining played a big role at scipy this year. Maybe it is because I am more tuned to these questions nowadays, but some signs do not lie. There was a special session on Python in data sciences, a panel discussion on Python in finance and many many statistics and data related talks, as well as two tutorials and a keynote.

In addition, on a personal basis it was really great to meet part of the team behind scikits.statmodels. We had plenty of very interesting discussions and they really help me understand the way that some statisticians abord data: very differently than me, because they have fairly little data, and can afford to inspect reports and graphs, whereas I rely more on automated decision rules.

IPython

Min gave an excellent tutorial on how to do parallel computing using IPython. These guys have certainly done an excellent job to make cluster-level programming in Python easier. While they don’t play yet terribly well with the restrictive job-queue policy of the clusters to which I have access, they have all the right low-level tools to address these issues and Min told me that they will be working on this next year.

Fernando gave an impressive talk on the new developments of IPython. In particular, the new Qt-based terminal is `really cool`_ and there is a web frontend in the works.

Cluster computing as facility

While I mention cluster computing, I must confess that I have always stayed away from this beast: I find it a time sink, and I find that I get more science done without it. This is why I really like the presentation of the PiCould guys on, … cluster computing! The reason I liked it, is that they start from the principle that your time is more important than CPU time. I hear so much about bigger better faster more high-performance computing when researchers forget to address the biggest issue:

… a whole generation of researchers turned into system administrators by the demands of computing - Dan Reed, VP Microsoft

Abstract code manipulation for numerical computation

Finally, a trend that is picking up in the Python-based scientific computing is the abstract manipulation of expressions to generate fast code. This ranges from JIT (just in time) compilation generating machine code, to rewriting mathematical expressions. Peter Wang had a talk in this alley, but the topic was also brough up be Aron Ahmadia. Of course this is not new: numexpr has been using these tricks for years, and more recently Theano has been making good use of GPUs thanks to them.

Seeing this topic emerges in more and more places fr good reasons: with faster and more numerous CPU, the number of operations a second is less the bottleneck, and the order in which they are applied, or the physical location, is becoming critical.

My own agenda

Sprinting on scikit-learn

We had two days of sprints after the conference. A huge number of people voted for sprint on the scikit-learn but only two people showed up: Minwoo Lee and David Warde-Farley. Thanks heaps to these guys! My priority for the sprint was to review and merge branches. That worked beautifully: we merged in the following features:

Dirichlet-Process Gaussian mixture models, by Alex Passos
Sparse PCA by Vlad Niculae.
Speedups in Gaussian processes by Vincent Schut.
Sparse implementation of the mini-batch k-means by Peter Prettenhofer.

In addition, David added dataset downloader for the Olivetti face datasets which is lightweight, but rich-enough to give very interesting examples.

My presentation

I gave a talk on my research work, and the software stack that undermines it: Python for brain mining: (neuro)science with state of the art machine learning and data visualization. I think that it was well received by the audience. What is really crazy is that I uploaded the slides on slideshare, and they got a ridiculous amount of viewing. I suspect that it is because of the title: brain mining does sound fancy.

Mayavi

Because of technical and political reasons, I cannot get Mayavi installed on the computers at work. This, and the fact that many people ask for help, but little contribute, even in the form of answers on the mailing list, had been mining me a bit. I got so much great feedback on Mayavi at the conference that I feel much more motivated to invest energy on it.

The Humain Brain Mapping conference in Quebec City

This blog post is getting too long. It is well beyond my own attention span. However scipy is not the only conference to which I have been recently. Two weeks before I was in Quebec, for the Human Brain Mapping conference. As each year, HBM is a fun ride. It has fantastic parties in the evenings. But I didn’t stay up too late as, this year was a busy for me: I was teaching in a educational course, and chairing a symposium, both on comparing brain functional connectivity across subjects.

But the really big deal at HBM this year came at the end. As I was dosing off, vaguely listening to Russ Poldrak’s closing comments, he brought up on screen a slide entitled the year of Python. This is a big deal: we’ve been working for years to get Python in the neuroimaging word, and it is clearly making progress, despite all the roadblocks.

Euroscipy 2011: early bird deadline soon

2011-07-22T00:44:00+02:00

Euroscipy 2011: register now for early bird prices

The deadline for early-bird registration at the Euroscipy conference is this Sunday. Beyond this deadline prices will double. Register now to get a great deal.

To register, simply go to www.euroscipy.org, log in using the link on the top right, and follow the ‘Register now for the conference’ link on the top left.

The conference is a great opportunity to learn the intricacies of numerical and scientific computing in Python. You can register for the tutorials in a intro track, that will take you from beginner to fully autonomous user, or for an advanced track, to learn from the experts topics such as image processing, GPU computing, machine learning or optimization. The tutorials are a fairly unique occasion to improve your skills, as you will seldom get such a concentration of experts.

Some program highlights

After the 2 days of tutorial, the conference itself we host 2 keynotes: one by Marian Petre, of the open university, well-known for her empirical studies of software development, and another one by Fernando Perez, a pioneer in scientific computing in Python and the original author of IPython.

Glancing at the program, we can see how a wide range of topics are touched:

pure computer-science topics, such as concurrent programming
traditional hard sciences, such as multi-physics
simulation of complex systems, for instance network modeling in epidemiology
or novel application of quantitative large-data processing, as in legal research

The variety of the topics illustrates what is for me one of the greatest benefits of the scipy conferences: they form a forum to exchange ideas and techniques to find new solutions to scientific, numerical and data analysis problems. Unlike the pure computer science conference, they sit at the frontier of applications and bleeding edge computer developments, because these people really use the tools presented to solve their problems.

In addition to this rich program, we will have 2 days of sprints before the conference as well as 2-day-long satellite conferences on Python in Physics and NeuroScience after the conference. This is how what used to be a small conference can now be a full 8-days event if you order all the extras.

Hiring a junior engineer on the scikit-learn

2011-05-14T19:10:00+02:00

The scikit-learn is a Python module for machine learning. The project builds on the scientific and numerical tools of the scipy community to provide state-of-the-art data analysis tools. It is developed by a community of open source developers to which my research team (Parietal, INRIA) contributes a lot and is a striving project. Its mailing list fosters many discussions on code and machine learning topics, it has a a very detailed documentation, and a tight release cycle.

Although scikits.learn is mostly developed by volunteers, INRIA has funded a two year position for a junior engineer —currently Fabian Pedregosa— to help with the core management and integration of the project. This funding is coming to an end in falls 2011 [*]. The good news is that we have been allocate new funding to hire an engineer on the scikit.

We are thus looking to hire a junior engineer for a 2-year position to work on the scikits.learn at INRIA in Saclay, near Paris. The position is only available to candidates that have received a masters or equivalent degree at most a year ago — this is non negotiable: we cannot hire more senior candidates.

We are looking for a developer with good open-source project management skills: the successful candidate will review and merge patches, ensure the quality of the scikit, make releases, coordinate development on the mailing list and on github. Good knowledge of Python and its scientific ecosystem is expected. A mathematical or computer-science oriented mindset is a plus, as the project involves working with machine learning algorithms.

The candidate should be willing to relocate to work daily in the Neurospin brain research institute in which the Parietal is located. Knowledge of French is not required, as the team and the institute are very international. Non-EU candidates are welcome, but the hiring process will take longer.

You will be working in a very stimulating environment. You will be employed by INRIA, the French computer science research institute. As such, you will benefit from the expertise of the institute’s researchers and engineers. Team members contribute to various scientific Python libraries (in addition to scikits.learn, Mayavi, nipy, joblib). In addition, you will be working in a brain research institute, in collaboration with leading methods researchers and neuroscientists that use machine learning to gain new insights on brain processes.

To apply: To apply, you need to prepare a CV and a motivation letter. The deadline for applications is mid June, but we will be selecting candidates and conducting interviews before. Don’t send me CVs. The formal job description, as well as instructions to apply can be found on this page. The page is mostly in French, sorry; use Google translate if you don’t understand. At the bottom of the page you will find a link to apply.

[*] Fabian will most probably stay with us to do a PhD on analysis of large brain functional imaging datasets.

EuroScipy: the program is filling up, and the submission deadline nearing

2011-04-30T17:21:00+02:00

Submission deadline May 8th

The deadline for the call for presentation for the EuroScipy conference is on May 8th. There is only a week and a half left.

EuroScipy will be held in Paris, August 25-28. It is the European meeting for users of Python in scientific and numerical-intensive applications. It strives to bring together both users and developers of scientific and numerical tools, as well as academic research and state of the art industry. The conference will host 2 days of tutorials and 2 days of technical presentations.

Lately, numerical computing in Python has started reaching a much wider audience than the traditional academic-oriented audience. This is partly because Python is making its way in major engineering companies, but also because more and more industries are processing large amounts of data, and find precious data analytics tools in the Scipy community. In this spirit, this year there will be a tutorial on machine learning with Python.

Poster session

Last year, the organizing committee had to refuse a large fraction of the proposals, because there were not enough slots available. We had considered organizing a poster sessions, but the logistics were to challenging for our little resources. Indeed, EuroSciPy still tries to be organized as a hackers and coders conference, rather than an industry-level one. For instance, we keep the prices to a minimum, in order to make it easy for young people traveling on their own budget to join us. Getting 200 attendees as we did last year, did strain our small organization committee.

This year, we had a unexpected backing of the physics department of the ENS. They were extremely enthusiastic about Python, that they now use for teaching and research. This made me really happy, as this is where I studied. They proposed help, and in particular help with the local organization.

Thus I am able to announce that thanks to the physics department of the ENS, we will be able to host a poster session!

An exciting program shaping up

The program is starting to shape up, and it is looking really good, in my eyes.

Keynotes

We will be having two keynote speakers, one directly from the SciPy community, Fernando Perez, and one probably less known to this community, Marian Petre.

Marian Petre: Marian is the director of the Center for Research in Computing, at the Open University. She is interested in empirical studies of software development. I am very excited to hear a bit more about the often-forgotten human factor that goes behind every coding job, big or small. In my experience scientific computing and computational sciences pay a hefty price because they don’t acknowledge well-enough the gap between good ideas and tractable code.
Fernando Perez: Fernando is a research scientist in neuroscience at UC Berkeley. Before that, he was successively a physicist and a mathematician. He has been an early advocate of the scientific Python ecosystem, in addition to being the creator of IPython. His vision has always been oriented toward finding an computing environment that makes scientific creativity easier.

Tutorials

The tutorial program is now final, and can be seen on the schedule. Like last year, we will have two tracks:

An introductory track, designed as a two-day course addressing the different aspects of the Python language and the scientific computing module to bring up beginners to full speed. At the end of the two days, attendee should be able to solve simple computational problems using Python alone.
An advanced track, in which experts of various aspects of scientific and numerical computing in Python share their knowledge in 2-hours long tutorials.

Python in NeuroScience satellite

The two days following the conference, their will be a satellite meeting on the use Python in neuroscience. It will be a small and more focused event, in which neuroscientist will be able to exchange technical aspects of computation and data management in Python. Hopefully it will foster interest discussions and collaborations. if you are interested, you can submit a talk proposal for this satellite meeting here.

Come and join us at EuroScipy in Paris, Augst 25-28. Paris is a great city. The SciPy community is a friendly one.

Scikit-learn sprint on April 1st

2011-03-26T13:27:00+01:00

The scikit-learn team is organizing a sprint on April 1st (that next Friday). Join us in Paris, Boston, or on IRC!

With the rise of the data sciences, the scikit-learn, a BSD-licensed Python package for machine learning, is becoming an asset for more and more endeavors. Machine learning has traditionally been considered as very technical and inaccessible to the non mathematician. We are aiming to break this barrier.

The sprint will be focused on pragmatic down-to-earth improvements in the scikit. Our goal is to make it easy for people to contribute. A list of tasks and organization details can be found on the sprint planning wiki page. Amongst other things, we’ll be working on:

integrating new learning algorithms, in particular merging in the many excellent pull requests that we have: hierarchical clustering, data transforming using linear discriminant analysis, multinomial naive bayes classifier …
testing and logging framework,
**better parallel computing support**,
and many other itches to scratch, as it is a community-driven event.

Come and join us. It will be fun, and it’s an occasion to learn new tricks.

Windows binaries for the scientific Python ecosystem

2011-02-15T09:02:00+01:00

I just realized yesterday that Christoph Gohlke has a repository of binary installers (.exe) for Windows 32 and 64bit with almost all the scientific Python packages that you can dream of:

numpy, scipy and matplotlib, of course (compiled with the MKL)
cython
the ETS, including Mayavi
VTK, with the Python bindings
a variety of scikits (including the scikit-learn, hurray!)

These binaries are incredibly useful, as building all these packages under Windows does requires some skills, and a compiler. They complement very well fully-fledge scientific Python distributions such as EPD or Python(x,y), as they can be installed on top of an existing Python installation.

I should say that I discovered this thanks to a long email discussion in which Christoph Gohlke and Yakub Nowacki helped me debug a nasty Mayavi bug on Windows 64bit that I couldn’t reproduce as I don’t have a Windows 64bit available. That was particularly helpful.

Interested in parallel computing and statistics? We are looking for a post-doc

2011-01-30T22:30:00+01:00

My research group is kick starting a new project, called AzureBrain to do computational analysis of large brain imaging and genetics population-wise data. One of the goals of the project is to harness the power of grid computing to do statistical learning on fMRI data, finding features in an individuals brain images that can be predicted by his genome. The medical applications cover the wide scope of genetically-related brain pathologies, such as autism.

Want to work in a dynamic and exiting environment, using Python to solve challenging data analysis? We are looking for a post-doctoral fellow to hire in spring/beginning of summer. The ideal candidate would have a strong background in computational statistics or machine learning, as well as parallel computing, however we will consider any candidate with good experience in one or the other and a strong desire to learn.

You would be employed by INRIA, the lead computing research institute in France. We are a team of computer scientists specialized in image processing and statistical data analysis, integrated in one of the top French brain research centers, NeuroSpin, south of Paris. We work mostly in Python. The team includes core contributors to the scikit-learn project, for machine learning in Python, and the nipy project, for NeuroImaging in Python.

Below follows a summary of the official job announcement. Please contact Bertrand Thirion, (first name _dot_ last name _at_ inria _dot_ fr) if you are interested, referencing the AzureBrain project.

Introduction

Imaging genetic studies linking functional MRI data and Single Nucleotide Polyphormisms (SNPs) data face a dire multiple comparisons issue. In the genome dimension, genotyping DNA chips allow to record of several hundred thousands values per subject, while in the imaging dimension a brain image may contain 100k-1M voxels. Finding the brain and genome regions that may be involved in this link entails a huge number of hypotheses, hence a drastic correction of the statistical significance of pairwise relationships, which in turn reduces crucially the sensitivity of statistical procedures that aims at detecting the association. It is therefore desirable to set up as sensitive techniques as possible to explore where in the brain and where in the genome a significant link can be detected, while correcting for family-wise multiple comparisons (controlling for false positive rate). Another issue is the computational cost of these procedures, that need to be addressed with adequate algorithmic and computational tools.

Objectives

In this project, we will consider a unique dataset acquired in the Imagen project, an FP6 project that aims at investigating factors of addition in a population of adolescents; Imagen’s database contains multi-modal neuroimaging as well as genetics and psychological data on about 2000 subjects. This database is hosted and processed at Neurospin and is available for research purpose. The candidate will be in charge of:

Setting an analysis pipeline (based on code already available to analyze neuroimaging/genetics datasets) to extract and pre-process the relevant data for statistical analysis.
Performing statistical analysis on simulated datasets and sub-parts of the whole database in order to set all the computational framework. These procedures will include mass-univariate linear modeling (with peak- and cluster-level tests), regularized multiple regression and a permutation-based assessment framework.
Launch data analysis on a large scale grid and cloud environment, with the help of the Kerdata researchers (see below).
Build the post-analytic framework to ease the interpretation of the results in both neuroimaging and genetics domains.

The analysis framework is based on algorithmic tools developed in C/Python (numpy, scipy and scikit-learn). The candidate will interact i) with researchers of the Parietal team for algorithmic aspects, but also ii) with CEA researchers of Neurospin, who will provide expertise in genetics domain and iii) with the KerData team (INRIA Rennes) and the Joint MSR-INRIA Research Center (Microsoft Research), that will provide help and massive computation facilities. The project has an access to grid/cloud computing facilities to be used in collaboration with INRIA/Kerdata and MSR-INRIA partners.

The expected results is the discovery of correlation between brain activation and genetic information.

Required knowledge and background

The candidate should have at least a basic knowledge of standard statistical concepts. He or she should have a first significant experience in parallel computation and with python language. It is important that he or she has some real interest in genetics and/or brain imaging in order to have strong interactions with specialists of these domains. He or she will benefit from the algorithmic tools developed at Parietal and of the database settings and data pre-processing tools developed by Neurospin researchers.

EuroSciPy 2011: the dates are out - Aug 25-28, Paris

2011-01-16T15:57:00+01:00

We have finally been able to settle on final dates and venue for EuroSciPy 2011, the 4th European meeting on Python in Science.

The conference will be held from Thursday August 25th, to Sunday August 28th. The ENS will be hosting the conference once again, right in the center of Paris.

Research jobs in France: the black humor of 2010 is the reality of 2011

2011-01-15T11:41:00+01:00

The French basic research landscape is dominated by a few nationwide institute, similar to the NIST or the NIH in the US. The largest of these is the CNRS (Centre National de la Recherche Scientific). Getting a tenured job in one of those institutes enables someone to focus on basic research rather than teaching or going in the industry. It has always been quite challenging to get such position as many people apply for very few positions, and the choice of the candidates is quite political. Each year there is a call for applications, through a impressive formal process that young researchers trying to get jobs in France end up knowing quite well.

Last year, I was visiting a research lab (INCM) and I saw in their coffee-break room the following poster (below), that I could clearly recognize as the official call for application for positions at CNRS.

Now this poster says ‘The CNRS recruits 3 researchers (m/w) in all fields of research‘. Of course it’s a fake poster and black humor: 3 positions nationwide in all fields of research is ridiculously low. It is however an expression of the nightmare of thousands of young researchers who are applying each year and keep hearing that the government will slash the number of state employees.

The call for the 2011 applications for research positions at INRIA, the French national computer science institute, that is another one of the big research institutions in France, is out. The page is entitled Cinq postes de chargé de recherche 2e classe sont à pourvoir (5 positions for junior researchers are available). This is not a joke, and it is striking to see the similarity between the dark humor of 2010 and the reality of 2011. To be fair INRIA is smaller than CNRS, as it covers only computer science and applications (listed as applied maths, numerical computing and simulation, algorithm and software research, networks and distributed systems, and computational modeling for life sciences). The number of applications is in hundred and not thousands, but having only 5 jobs available nationwide still feels really awkward.

PDF poster

A minor detail: I am trying to get a job in computational science research in France.

Scientific publication for software development

2011-01-08T22:40:00+01:00

The academic community seems to judge the validity and significance of any contribution by the number of papers published and the number of citations they get. To find funding, to get credit, you have to publish or perish. However, the natural output of software development tends not to be an article (people who confuse articles and documentation do a poor job of both, IMHO).

While I believe that this policy is harmful for the quality of research, I also know that I cannot fight it, and chances are that many other are in my situation. As such, we need to publish scientific papers about the scientific softwares that we develop (such as Mayavi, or scikit-learn, as far as I am concerned). On the other hand, as an editor of the Scipy conference proceedings, I have found that the process of writing a paper on software work and going through peer review can be greatly beneficial to the software. Indeed, it forces authors to do a thorough review of the prior work, and to clearly identify the purpose of the project. Also, such an article can only be much shorter than a user manual, thus it forces the authors to identify the key concepts of their software, and explain them clearly. As a result, it helps finding design and usability flaws and gaining insight on how the user manual can be structured.

A major challenge to publishing is that most of the highly-ranked journals tend to disregard software works, unless they are very specific to a scientific problem, which actually makes them less useful to the complete ecosystem. Deeply rooted in the minds of the editors and the reviewers, there tends to be the idea that developing software is easy compared to doing experiments or proofs. In addition, these top-notch scientists are not always the most qualified to judge the quality of software, as they have most often never worked in a major software project. The good news is that this is slowing changing with the creation of software tracks in specialized journals, and the development of new journals focused on scientific software.

Journals for publishing about interdisciplinary scientific software

In my opinion, interdisciplinary scientific software such as numpy, the GSL, octave, scilab, matplotlib, or Fenics, are the most valuable projects, as they provide foundations to build science in the open. The challenge that these projects have to face are not only algorithmic or computational, but also deal with providing good user interfaces, or developing and catering for very large communities of users. These problems are considered as solved in a scientific context, as they have all been solved at least once, often quite successfully by commercial products such as Matlab. As a result, it is hard to get some funding for these projects unless there is a political reason behind the funding, and IMHO politics tend to produce bad software. Publishing high-profile articles on interdisciplinary scientific software is thus hard, but critical. For this we need journals that accept software papers, but are not only read by researchers in CS or IT departments.

A couple of years ago, some of us made a review of where it was possible to publish truly wide-scope scientific software, and we found that there was pretty much no option. It’s crazy to see that things have still not changed much, and that all lot of major general-purpose widely-used projects, like the one I cited above, have never been acknowledged by a publication.

Computing in Science and Engineering: a joint publication between the AIP (American Institute of Physics) and the IEEE, it is a magazine-style journal and it can be seen in many coffee rooms of computational-science departments. Thanks to that it gets a lot of reading, but the articles cannot be too technical (which might be a good thing) and there is room for only few articles.
Open Research Computation (ORC): A newly-created journal, with a focus on making computational research reproducible. As such, it favors papers about open source scientific software with good software-engineering. Open access.

In addition to these software-friendly journals, some large-scope journals on computational science sometime accept software papers, though software production fall out of their scope:

Journal of Computational Science: a very multidisciplinary journal.
SIAM Journal on Scientific Computing (SISC): a journal of the SIAM (society for industrial and applied mathematics), thus with a focus on engineering-type applications.

Journals for publishing domain-specific scientific software

It is usually easier to publish a domain-specific software contribution, as you can claim that you have solved a well-identified scientific roadblock. Until recently, it was hard to get such papers in the best journals of a community, but things have been changing.

Computer Physics Communications: for algorithms and packages solving numerical and computational problems related to physics.
Bioinformatics: accepts software papers on biology-related problems.
ACM Transactions On Mathematical Software (TOMS): a journal of the ACM (Association for Computing Machinery), thus with a focus on algorithms.
Journal of statistical Software: this journal comes from the community of people who wrote the R language. They know that open source scientific software is hard and important. Open access.
Journal of Machine Learning Research (JMLR), Machine Learning Open Source (MLOSS) track: reference journal in the machine learning community, the MLOSS track cares strongly about documentation, packaging and usability of the software. Open access.
Computers & Geoscience: computational geoscience journal that accepts software papers (thanks Michael Aye for the pointer).
Computer Applications in Engineering Education: a journal about education with computers. AFAIK, no special focus on open source or software-engineering quality (thanks Doug Holton for the pointer).
NeuroInformatics and Frontiers NeuroInformatics (open access): two journals on computer-related issues in neuroscience that accept software papers. I have the feeling that the latter is a bit warmer to open source that the former (thanks Andrew Davison for the pointer).
Computers and Electronics in Agriculture: for publishing agriculture-related software (thanks John B. Cole for the pointer).

I should stress that, in my opinion, journals such as PLOS computational biology, or the Journal of Computational Physics, or are not great venues for software papers, as they tend to emphasize what I would call proof of principle, and not packaged and maintained software.

I have the feeling that there is need for more communication on scientific software. The list above is, of course, incomplete. If you have extra ideas, please do not hesitate to contact me.

As a conclusion, I would like to point out that conferences are also a good way to advertise scientific software. You may even get approached by the editor of a journal to open the door for a journal article. Last year I was at ESCO, a coupled problems conference, and there was a track on Python in science. All in all the conference was a huge amount of fun, and I learned a lot on practical aspects of numerical methods, given the amount of numerical computing geeks that were around. The same community is organizing FEMTEC in Lake Tahoe (California) this year. If you are in any field related to FEM or multiphysics, you should consider it.

Update: added links suggested by Doug Holton, Michael Aye, Andrew Davison, and John B. Cole

ICA versus PCA in the scikit-learn: the value of code over pictures

2010-11-20T16:12:00+01:00

When I was trying to get an intuitive feeling of the difference between Independent Component Analysis (ICA) and Principal Component Analysis (PCA), I wrote a few Python scripts producing some visualizations explaining the difference that have had a bit of success.

During the last sprint on scikit-learn, a machine learning toolkit in Python, we cleaned up the ICA code that I had been using, and we added it to the scikit, along with an example inspired from this earlier toy problem.

While the pictures are not as pretty as the initial ones I had done (because we wanted to keep the example as simple as possible), I am very happy that this discussion is know more than a set of static pictures, but comes with runnable code.

This illustrates very well my feelings on the future of scientific code and scientific research: paper, books, teaching materials, on numerical methods or computational science are greatly enhanced when they come with highly-readable code that illustrates their purpose, because the reader can start asking questions to the algorithm. Hopefully, the documentation of scientific programming toolkits will become the textbooks of tomorrow. We still have a lot of work to.

It’s funny, I just realized that my vision on software might have been strongly influenced by the fact that my mother, a high-school math teacher, spent endless nights when I was a teenager working on Geoplan, a software for teaching geometry by interaction with figures.

Multitouch with VTK (and MedINRIA and Mayavi)

2010-09-18T09:40:00+02:00

If the videos on this post are not showing, click through to see them.

A colleague of mine, Pierre Fillard, has just integrated multitouch in the next generation of the VTK-based medical imaging software MedINRIA. The nice thing is that it works on an Apple laptop out of the box.

On his blog, he explain how he did this (warning, it involves C++ and VTK programming). He also gives the code for this! Enjoy.

This reminded me of when the Enthought guys had rigged up a large multitouch screen and wired it in Mayavi for 3D plotting, and in chaco for 2D plotting, using only a web-cam, a video projector, and pure Python image-analysis code:

Machine learning humour

2010-09-16T23:11:00+02:00

Yes, but they overfit

If you are reading this post through a planet, the movie isn’t showing up, just click through to understand what the hell this is about.

Some explanations…

Machine learning, geeks, and beers

Sorry for the bad humour. In the previous weeks my social geek life had two strong moments:

Pycon fr, the French Python conference, and ensuing drinking

The second sprint on the scikit learn, a library for machine learning in Python.

At the first event (or maybe the related drinking) there was a lot of discussion about NoSQL databases, and I was introduced to this fantastic video making fun of MongoDB fanboys. A few days later I was hacking on the scikit, comparing estimators and discussing hype versus fact in machine learning algorithms (hint: there is no free lunch, but you may get a free brunch). As in brain imaging people seem to be doing nothing but SVMs over and over while methods with more appropriate sparsity clearly perform better, I composed this stupid video.

Anything to learn about machine learning in there?

The short answer is: probably no. This video is humour, and there is little truth (well, RFE is indeed slow as a dog). However, not every reader of this blog are machine learning experts, so let me explain the stakes of the pseudo discussion.

Overfitting: when you learn a predictive model on a noisy data set, for instance trying to learn how to predict whether a movie is popular or not from ratings, if you have a finite amount of data, you should be careful not to learn by heart every detail of the data. You will learn noise that, by chance, correlated to what you are trying to predict. When you try to generalize to new data, these features that you learned from noise will be detrimental to your prediction performance. For instance, the presence of Matt Damon is not the sole predictor of the quality of movie. This is called overfitting. The goal of regularization is to avoid this overfitting.

Both SVM and elasticnet implement regularization, but in different ways. In the case of brain imaging, as the predictive features (voxels) are very sparse, but the noise is highly structured, SVM (that do not operate on voxels directly) are not able to select directly the relevant voxels and tend to overfit (which can be counter-balanced by univariate feature selection as in the scikit example).

RFE (recursive feature elimination) is slow as dog

In [1]: from scikits.learn import datasets
In [2]: digits = datasets.load_digits()
In [3]: X = digits.data
In [4]: y = digits.target
In [5]: from scikits.learn.svm import LinearSVC
In [6]: svc = LinearSVC()
In [7]: from scikits.learn.rfe import RFE
In [8]: %timeit RFE(estimator=svc, n_features=1, percentage=0.1).fit(X, y)
1 loops, best of 3: 21.5 s per loop
In [9]: from scikits.learn.glm import ElasticNet
In [10]: %timeit ElasticNet(alpha=.1, rho=0.7).fit(X, y)
10 loops, best of 3: 26.7 ms per loop

Yeah, but it does much more than simply building a predictor, it builds a ‘heat map’ of which features help predicting (run this scikit-learn example to get an idea).

I am afraid that all the examples I pointed to require the development version of the scikit. Sorry, we just finished a sprint, and there will be a release soon.

Scikit Learn coding sprint

2010-09-04T17:43:00+02:00

We have been really crap at communicating the next scikit-learn coding sprint. It’s next week!

The coding sprint will take place the 8 and 9 September at INRIA Saclay, near Paris, in the room K110 (building K).

For those who cannot make it, it will be possible to participate using the IRC chan (#scikit-learn on irc.freenode.net).

We will start at 9am (Paris time), and a sketch of the planning can be found here. In particular:

More docs! we still need tutorials: features selection, model selection, cross-validation, etc..
Make the pipeline object really work + illustration in different contexts.
Clean up and doc for bayesian approaches.
Implementation of PCA (fit + transform).
FastICA (adapt the CanICA code)
LDA : Covariance estimators (Ledoit-Wolf) and add transform.
Preprocessing routines (center, standardize) with fit transform.
Anything that you have a particular interest in.

Do not hesitate to send on the mailing list some advices on this (incomplete…) list, and see you next week!

scikit-learn is a Python module for efficient and easy machine learning using scipy and numpy.

SVG Word map of countries

2010-08-24T10:55:00+02:00

To be able to visualize some quantities attached to countries all over the world, I needed a image with various countries color-coded. The fantastic matplotlib basemap package was not an option as I really needed a static image.

So I generated an SVG image with all the countries. It was generating by tracing a bitmap, so it has a lot of imperfections, but being an SVG with each (major) country as a different object, it can be used to create a colored-code world map. I am uploading it here under a public-domain license. Enjoy!

PNG

SVG: countries.svg

Software design for maintainability

2010-08-01T23:47:00+02:00

I have just spent the best part of my Sunday fixing a bug that turned out being a seemingly-trivial two-liner. Such unpleasant experiences are all too frequent, and weight a lot on my view of code design.

My stance on code design

I call code design the process of designing the architecture of a piece of software: what are the objects it uses? how do they interact? how is the information passed around?…

My view of code design and software engineering has progressively evolved to favor extreme simplicity over sophistication. I believe that a good programmer should know design patterns, powerful language features, libraries dark corners, and not use them unless absolutely necessary.

Some rules of thumb

Here are some rules that I apply nowadays when writing code that I would like to last (I am aware that some of them go against well-advertised best practices).

Keep it as simple a possible, really! Experimental results have shown that the tractability of a code base goes down as the square of the number of interactions, and thus much quicker than the number of lines in a project. Each time you add a line, think about it: can you make simpler? If not you’ll have to find resources to maintain your project as fixing bugs or adding features will grow harder.
Design for the 80% usecases. In the same vein, a small decrease in the requirements can make your project much simpler [Woodfield1979]. Corner cases and minor usecases should not make the whole project complex and hard to maintain. If you can, give up on what is bringing in complexity. If you cannot, isolate it, and don’t let it sit at the core of your design.
Don’t design for the future. Again the same core idea: don’t start planing for all the usecases, and all the difficulties that you haven’t encountered, you will most certainly design wrong, and chances are that you’ll add complexity that you do not use. Design simple, design cleanly and refactor as you go, based on concrete problems. This is known as the “YAGNI principle”.

Don’t be clever. Each time you do a clever trick, whoever has to read and maintain this code will have to understand it (that person may be you, in a few years). Chances are that they’ll get it wrong and start by loosing a lot of time.
Repeating yourself may actually be OK. This is a case of practicality beats purity. Repeating code is really a bad thing in software design, because it leads to an increased number of lines to debug, and tends to hinder reusability. However, adding complexity in order to save a few lines of duplicated code will cost you more in the long run.
Use objects sparingly. Object are great, but are they always need? An object with a single method eval can probably simply be implemented by a function. The limitation of objects is that they all have a different behavior. As a result, the users and maintainers of your codebase will first have to understand how all your classes interact before understanding your code. This also means that there is a lot of benefit in making many different classes that have the same interface.
Avoid abstractions and levels of indirection. The more levels of code piled on top one of the other, the more layers your maintainer is going to have to inspect to find were the bug might be. An abstraction hides another object or algorithm. To debug code, chances are that all the black boxes will first have to be opened.

Coding for others to debug

“Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” - Brian W. Kernighan

You may think that I am overemphasizing simplicity at the cost of functionality. Well, think about the future of your code. The net is full of unmaintained and abandoned code. If you want your project to grow and have a future, you will probably need people to help you. For a given purpose, the easiest the code is to read and debug, the more chances you will have to pick momentum.

Some external references I like (about software engineering, rather than debugging):

Edmon Lau: Hidden costs that engineers ignore (Read this)
Titus Brown: Writing (Python) Code that Doesn’t Suck
Peter Norvig: Teach yourself programming in 10 years
Paul Stachour and David Collier-Brown: You Don’t Know Jack About Software Maintenance
Greg Wilson: Software carpentry: a course in software engineering

Sprint Scikit learn in Paris

2010-07-23T14:31:00+02:00

We are organizing a coding sprint in Paris on scikit learn, machine learning in Python. The goal of this sprint is to set the API and the general coding guidelines of the scikit to be able to tackle many different statistical learning problems in a consistent framework.

This is why we would like to have people with different problems, applications, and backgrounds to pitch in.

It will be a two-days sprint. Everyone is welcome, so just fill in the doodle, so that we can choose the date?

And do not hesitate to suggest some topics that you would like to be addressed during the sprint, and to discuss them on the mailing-list!

Vincent Michel is organizing the sprint. If you have questions about the sprint, you are welcomed to contact me, but please do put him in Cc to.

Simple object signatures

2010-07-16T23:31:00+02:00

A signature pattern

There are many libraries around to specify what I call a ‘signature’ for an object, in other words a list of attributes that define its parameter set. I have heavily used Enthought’s Traits library for this purpose, but the concept is fairly general and can be found eg in ORMs (Object Relational Mappers) or web frameworks.

Specification of this interface of parameters may be used to answer a variety of needs:

Typing: in the case of an ORM, to generate UIs, or for better error management, it may be desirable to have some control on the types of certain attributes of an object. In this case, specifying the signature corresponds to laying out a data model for the object.
Reactive programming: using properties to react to changes to attributes, one can fully specify the API of an object in terms of these attributes. This gives a message-passing like programming style that can be very well suited to parallel-computing in particular because it can easily be made thread-safe.

Signatures for statistical learning objects

Recently, I considered the signature pattern in a new context. In the scikit-learn, we are interested in statistical learning. This entails fitting models to data and often tuning parameters to select a model that fits best (a problem called model selection). Each of our models is an object that implements a couple of key methods to fit to the data and to apply to new data (fit and predict).

The approach that we are currently taking for model selection is (more or less) to generate a list of models with different parameters and fit and test them on the data.

A very nice feature would be to find out the parameters to vary simply by inspecting the objects, and such a desire recently got us discussing of defining signatures for our objects. I must confess that I am a bit weary as this means either depending on a signature library, or building one. We don’t want to grow our dependencies, and most signature-definition code that I know involve meta-programming tricks to avoid code duplication.

Solving the simple problem: avoiding type checking

Today, I had to bite the bullet, because we were in a situation in which we had to instantiate new models from the existing one during model selection. For technical reasons, using a copy.copy to create these new models was not a great idea, and it was better to have the minimal list of parameters required. Here come signatures again.

After a bit of messing around with the code, I realized that typing information was useless, and most probably harmful, to our immediate goals and that I just needed the names of the relevant attributes. I finally settled down to the following solution (which might still change):

All parameters need to be specified as keyword arguments of the __init__. The __init__ may not have positional arguments or ‘*’ arguments. Attributes on the objects have the same names as the __init__ parameters.
A simple base class, with couple of methods relying on a simple use of the inspect module to find the signature of the __init__.

class BaseEstimator(object):
    @classmethod
    def _get_param_names(cls):
        args, varargs, kw, default = inspect.getargspec(cls.__init__)
        assert varargs is None, (
            'scikit learn estimators should always specify their '
            'parameters in the signature of their init (no varargs).'
            )
        # Remove 'self'
        args.pop(0)
        return args

    def _get_params(self):
        out = dict()
        for key in self._get_param_names():
            out[key] = getattr(self, key)
        return out

    def _set_params(self, **params):
        valid_params = self._get_param_names()
        for key, value in params.iteritems():
            assert key in valid_params, ('Invalid parameter %s '
                'for estimator %s' %
                (key, self.__class__.__name__))
            setattr(self, key, value)

The full code can be seen here and adds a bit more features, such as a clever __repr__.

What I like about this solution is that it (almost) does not use metaprograming, and avoids code duplication without forcing any specific pattern on the developer subclassing BaseEstimator.

The next step

This approach solves my immediate problem, but not the bigger one of finding what values can the different parameters take when varied for model selection. Of course this second problem is much more complicated, and maybe it is not worth solving it: the framework could very easily be bringing in more problems than it solves.

However, it seems that a fairly easy way of specifying possible values for parameters would be to decorate the __init__, giving the possible parameters to be tested during the model selection:

@cv_params(l1=np.logspace(1e-4, 1, 10))
def __init__(self, l1=.5, fit_intercept=True)
# ...

All the decorator has to do is to store the information in an attribute attached to the __init__ (and probably to check that the parameters it was given are valid arguments, in order to raise errors early). Methods on the class can later inspect this information for model selection, or GUI building (data-model specification will probably require some typing language, rather than a simple list of possible parameters).

Once again, here we would be avoiding the difficulty of specifying type information in a non restrictive way, but avoiding a problem that we don’t have to solve is probably a good idea.

Euroscipy 2010: code, science, and a lot of fun

2010-07-13T17:31:00+02:00

Euroscipy 2010, the third European conference for the use of Python in science, is just over, and I think it was a great success.

Euroscipy in numbers

The attendance this year was huge: there was a grand total of 160 who came to EuroScipy, with 140 that came only to the tutorials, and 130 only the conference. This up by almost a factor of 3 compared to last year’s EuroScipy, more than last year’s SciPy conference in Passadena, and almost as much as this year’s SciPy conference in Austin that hosted 180 person. We had people coming from 16 country, and as far as New Zealand, the US, or Turkey. Research lab, education, and industry (small to large companies) were all well represented, with approximately a third of the delegates coming from the industry. Similarly, many different scientific field were discussed, ranging from landscape ecology to pure math.

There were 2 tutorial tracks with 10 tutorial slots in each track. We had 2 keynotes from Hans Petter Langtangen and Konrad Hinsen. With regards to the contributed talks, the conference this year was highly selective. We received 52 propositions. We unfortunately could accept only 30 of them, which corresponds to an acceptance rate of 58%. Finally, we had 18 lightning talks.

A warm and friendly atmosphere

As an organizer, I was really pleased to find out how much people were relaxed and friendly. This certainly facilitates discussions during the breaks. And the ambiance was undoubtedly warm: 140 people with laptops in a room without air conditioning in the Paris summer :).

Of course during the evenings, many people met to continue the passionate discussions in restaurants and bars.

Trends I noticed

What one remembers from a conference is obviously biased by personal interests. With that disclaimer, here are the recurrent and important topics that I noticed, both in the talks, but also in the coffee break discussions:

Parallel computing, in particular making it easy to do parallel computing. Konrad’s keynote had many interesting directions to explore. (talks: Playdoh, DANA).
Code generation. In the various conferences I have been to recently, I heard much talking about symbolic manipulation of numerical problems to generate optimal computing kernels (talks: Efficient computation tutorial, Theano, Algorithmic Differentiation.
Data management, with problems such as provenance tracking for reproducibility (talks: Sumatra, Knowledge management tutorial).

Finally installation problems of scientific tools were the subject of many discussions, as each year. One thing that I did notice, is that people stopped simply blaming each others and acknowledged that nobody knew how to fix the problem. Somebody even pointed out that installing any major scientific code was not a piece of cake. Hans Petter and others said that they had solved the problem by relying on a virtual machine and Ubuntu.

Konrad has also blogged, giving his own view of the conference.

Thanks

The conference could happen only because of the help of many people. First we need to thank our sponsors: Enthought, Python Academy, Pytables, and especially our host Ecole Normale Supérieure, which not only provided us with the rooms, but also made sure that everything was going well with the sound system, the projection, or the access to the building. With regards to organization and planing, Nicolas and I received a lot of help from Emmanuelle Gouillart.

Making posters for scientific conferences

2010-07-12T00:00:00+02:00

This page gives some advices and examples on making posters for scientific conference.

Here are some posters I made (one in 2007, the other in 2011). They don’t follow all the advice on this page, but should.

LaTeX sources

This poster is written in LaTeX. You can download the whole source of the posters for the first poster (left), and the second one (right). These are some of my personnal projects, not meant for sharing. As a result they have a fair amount of hacking. I have been asked for source code more than once, so I put it on the web. I do not however have time to provide any support for it (I am already to busy supporting other things. Any mail asking for help on these files will unanswered. Sorry.

Here is another example, a bit more visually appealing, as it is intended for a less technical audience.

One more about my work: this one was made to convey a strong message and simplified the content a lot to get the message accross. I am not too sure it worked, but I still find the poster pretty.

And finally two made by Emmanuelle with really nice colours.

Advice on poster presentation

Fonts

Sans-serif fonts look really nice, but are less readable in paragraphs. Use them for titles and headers. Use serif fonts for paragraphs. Stick to a simple font family like times. Use bold fonts when writing with a light colour on a dark background.

Colours

Stick to a rather little numbers of colours, but well chosen. Put a very light colour behind your text blocks. If ink is not too expensive, I would use a dark background, and have light text blocks on it. Have well separated areas of your posters (like the background, and the text blocks), and have the background, or other decorative elements, have little contrast: they should not stand out too much (mine stood out too much in my poster, its because the print-out didn’t look like want was on the screen).

Page layout

Break symmetry and order. A well aligned poster is boring to the eye, and does not catch attention from afar. People read your poster by first scanning through it and stopping at a few key points (usually first at the upper left, then the upper left, then down right, and down left), then they might read it more thoroughly after their first scan. You want to define visually these key points, make them appealing, and put key ideas there.

Long lines are difficult to read. Pick up a book, a flyer, anything made by a professional publisher, it will never have long lines. A good rule of thumb is that if a text block has lines longer than 80 characters, it needs breaking down in several columns.

Which software to use

Many people use PowerPoint to make their posters. It is easy to use, but it is not dedicated to making posters, and it does horrible pdfs.

If you want to pay a lot there is Quark Xpress that is very good for that kind of things. Adobe PageMaker is also a very good software. Xara is a cheap and good design program, and a free version will soon be available for linux.

I use LaTeX. Just because I love the way it positions characters. But I admit it is a bit brutal. What I would advice you to use is scribus it is dedicate to making posters and is free and open source. I sometimes use LaTeX to create the text boxes, and scribus to lay them around. I wrote a page describing how I do it.

One last remark: use vector graphics (eps, ps, pdf, svg), not bitmaps, they scale up really badly. Try to get a vector logo of your institution. Usually asking the PR people is the only thing it take to get one. Of course if you are using powerpoint chances are that you wont be able to insert it in your poster.

A simple LaTeX example

2010-06-01T00:00:00+02:00

Here is a very simple example of a laTeX document that uses good package to have a simple but nice layout:

Some advice

Use texniccenter if you don’t have a favorite editor.
Read the not so short introduction to latex

Personal views on scientific computing

2010-05-20T00:00:00+02:00

My contributions to the scientific computing software ecosystem are motivated by my vision on computational science.

Scientific research relies more and more on computing. However, most of the researchers are not software engineers, and as computing is becoming ubiquitous, the limiting factor becomes more and more the human factor [G. Wilson, 2006] [P. Norvig, 2009].

Note

To address the needs of computing accross scientific fields, I believe that we need a general-purpose, high-level, interactive, and highly-readable language and set of tools for scientific computing.

C does not answer my needs: does a molecular biologist know about pointers? Should she?
Matlab does not answer my needs either: scientific work with computers is not only about numerical computation. Have you tried writing an experiment-control software with Matlab? How about file management? Inserting the algorithms in a web server.

We need better teaching material, that sit at interfaces between software engineer, and general science. Most top notch tools and libraries are full of domain-specific jargon and conventions.

For reproducible science, we need the code to be readable and to reflect the corresponding scientific operation. We need it to be unit-tested to ensure its correctness.

Note

We need to consider scientific libraries as end-result of our research with the same importance than articles [J. Buckheit and D. Donoho. 1995]. They need to convey a scientific message, to be understandable and refutable. New results should be reproducible via published code [CISE Jan. 2009]. As for established algorithms, scientific libraries with their documentation and examples should be the textbooks of tomorrow.

Scientific software should be as reusable as possible, to enable the advancement of Science via software, year after year. This means that we need to build general-purpose libraries.
Code quality and documentation are crucial, as human factors are often the limitation. As a corollary, scientific code should be unit-tested to ensure correctness.
Core scientific software should be open source, as scientific work cannot build on black boxes
Algorithms should be written as simply as possible. A high level of sophistication in software engineering should not be a requirement to all scientists
Prefer high-level languages. The code should be written at the right level of abstraction.
We need to build common and shared tools. Scientific software shouldn’t be ‘owned’ by a lab.
The source code should a deliverable of the research. As a result, it should read clearly and be understandable to all.
Documentation and examples are the textbooks of tomorrow.
Publications should be reproducible. Ideally they should become an example of the library. This should be mitigated by the fact that code maintainance is costly, and achieving good code takes more work that publishing. Focus should be on publications that will give rise to reference results.
Academia need to value sotware maintainance. It is hard and costly, but it determines our future.
Tools that develop the environment, rather than a specific algorithm or scientific field are crucial (one example is IPython).

EuroScipy abstract submission deadline extended

2010-05-15T23:36:00+02:00

Given that we have been able to turn on registration only very late, the EuroScipy conference committee is extending the deadline for abstract submission for the 2010 EuroScipy conference.

On Thursday May 20th, at midnight Samoa time, we will turn off the abstract submission on the conference site. Up to then, you can modify the already-submitted abstract, or submit new abstracts.

We are very much looking forward to your submissions to the conference.

Gaël Varoquaux

Nicolas Chauvat

EuroScipy 2010 is the annual European conference for scientists using Python. It will be held July 8-11 2010, in ENS, Paris, France.

Links: `Conference website`_, `Call for papers`_, `Practical information`_

EuroScipy is finally open for registration

2010-05-13T13:23:00+02:00

The registration for EuroScipy is finally open.

To register, go to the website, create an account, and you will see a ‘register to the conference’ button on the left. Follow it to a page which presents a ‘shoping cart’. Simply submitting this information registers you to the conference, and on the left of the website, the button will now display ‘You are registered for the conference’.

The registration fee is 50 euros for the conference, and 50 euros for the tutorial. Right now there is no payment system: you will be contacted later (in a week) with instructions for paying.

We apologize for such a late set up. We do realize this has come as an inconvenience to people.

Do not wait to register: the number of people we can host is limited.

An exciting program

Tutorials: from beginners to experts

We have two tutorial tracks:

**Introductory tutorial**: to get you to speed on scientific programming with Python.
**Advanced tutorial**: experts sharing their knowledge on specific techniques and libraries.

Scientific track: doing new science in Python

Although the abstract submission is not yet over, I can say that we are going to have a rich set of talks, looking at the current submissions. In addition to the contributed talks, we have:

**Keynote speakers**: Hans Petter Langtangen and Konrard Hinsen, two major player of scientific computing in Python.
**Lightning talks**: one hour will be open for people to come up and present in a flash an interesting project.

Publishing papers

We are talking with the editors of a major scientific computing journal, and the odds are quite high that we will be able to publish a special issue on scientific computing in Python based on the proceedings of the conference. The papers will undergo peer-review independently from the conference, to ensure high quality of the final publication.

Call for papers

Abstract submission is still open, though not for long. We are soliciting contributions on scientific libraries and tools developed with Python and on scientific or engineering achievements using Python. These include applications, teaching, future development directions, and current research. See the call for papers.

I am very much looking forward to passionate discussions about Python in science in Paris

Status of the EuroScipy registration

2010-05-02T22:57:00+02:00

It is still not possible to register for the Euroscipy conference: we are having difficulties with payment for the registration, and we are still not sure that we will be able to actually charge money!

This might not be a bad news, because it might mean that the conference will be completely free. This would mean that we would be able to provide lunch which is a pity as there is nothing like eating with a bunch of passionate experts to learn new tricks, but it would not hamper the conference in any other way, as the rooms are already booked and various little expenses covered.

If we manage to sort out payments in the next weeks, the fee should be 50 euros for the 2 days of tutorial, and between 50 and 100 euros for the full conference, depending on exactly what catering we offer.

Anyhow, we should open the registration real-soon, with or without payment. We will need to have some formal registration, as the number of people that can fit in the rooms will be limited.

All in all, with or without registration fees, it should be possible to make it to Euroscipy keeping expenses low: we have indicated a few cheap accommodation on the practical details page, and it is easy to get good food for a good price in the area.

I am very excited about this conference. We have two keynotes that I am really looking forward to hearing, and I can say that we have been getting pretty good submissions for presentations. Also, changes are that we should be able to publish proceedings in a peer-reviewed journal, although I can’t say more about that right now.

Also, even if you are not interested in scientific research done using Python, the tutorials are a unique opportunity: we are having top-notch experts presenting with two tracks, one to get beginners up to speed and efficient in a couple of days, and the other for exploring advanced subjects. I know the speakers, and I can tell you that I won’t be talking in the corridor, but sitting with my laptop and listening to them. People pay large chunks of money for such training, usually.

Mayavi: Representing an additional scalar on surfaces

2010-04-05T00:30:00+02:00

We have been getting a few questions on the enthought-dev mailing-list on how to represent an additional information on a surface with Mayavi, using color not given eg by the elevation. A recent post on his blog by Didrik Pinte shows the problem quite well:

This problem can be seen as taking a standard surf plot:

but coloring it with a different scalar than the elevation.

I would like to present two ways of solving this problem. First a very simple way specific to the exact problem, second a more complicated but quite generic approach.

Representing surfaces more complex than an elevation map

The first option is simply to use the tools that Mayavi’s mlab interface provide to represent surfaces that are not the particular case of an elevation plot. In our case, it is very easy to use the mesh function which can take the x, y, z positions of a grid giving the surface, but also an additional scalar value at these position:

# Create some data
import numpy as np
x, y = np.mgrid[0:10:100j, 0:10:100j]
z = x**2 + y**2
w = np.arctan(x/y)

# Visualize it
from enthought.mayavi import mlab
mlab.mesh(x, y, .05*z, scalars=w)

# Finally, add a few decorations.
mlab.axes()
mlab.outline()
mlab.view(-177, 82, 32)
mlab.show()

As you can see, this solution is really simple, and solves the problem.

A generic way of representing several scalar attributes with one visualization

If we think of the visualization problem as a way of representing two scalar values, ‘z’ and ‘w’, and a function of two others, ‘x’ and ‘y’, the above solution is not really satisfactory: the surf function really turns the scalar value ‘z’ in elevation (using a WarpScalar filter). We would like to be able to add an addition scalar value ‘w’ and turn it into color, just like ‘z’ is turned into elevation. The pipeline that is created by the surf function is the following:

The first element of the pipeline after the scene is the data source created for us by the surf function: it is a 2D array that contains the ‘z’ value as a scalar value. The ‘WarpScalar’ filter is applied, and transform that value into elevation. After that, a ‘PolyDataNormals’ filter is used to calculate normals, so as to have a smooth rendering, and finally, a ‘Surface’ module is applied to display the resulting elevation map as a surface, with a color reflecting the scalar value.

The way we can operate on two scalar values and turn them into elevation and color successively is to embed these two scalar values on the dataset, ‘z’ and ‘w’, and use a ‘SetActiveAttribute’ to control on which one the ‘Surface’ module is applied. This approach is much more powerful, because we can tweak the pipeline ourselves, and use any filter to replace the WarpScalar, and display the ‘z’ information (more on that below).

Here is how to do achieve a visualization with a similar look as above, but with two scalar values transformed successively in elevation and color:

###############################################################
# Create some data
import numpy as np
x, y = np.mgrid[0:10:100j, 0:10:100j]
z = x**2 + y**2
w = np.arctan(x/y)

###############################################################
# Visualize the data
from enthought.mayavi import mlab

# Create the data source
src = mlab.pipeline.array2d_source(z)

# Add the additional scalar information 'w', this is where we need to be a bit careful,
# see
# http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/example_atomic_orbital.html
# and
# http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/data.html
dataset = src.mlab_source.dataset
array_id = dataset.point_data.add_array(w.T.ravel())
dataset.point_data.get_array(array_id).name = 'color'
dataset.point_data.update()

# Here, we build the very exact pipeline of surf, but add a
# set_active_attribute filter to switch the color, this is code very
# similar to the code introduced in:
# http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/mlab.html#assembling-pipelines-with-mlab
warp = mlab.pipeline.warp_scalar(src, warp_scale=.5)
normals = mlab.pipeline.poly_data_normals(warp)
active_attr = mlab.pipeline.set_active_attribute(normals,
                                            point_scalars='color')
surf = mlab.pipeline.surface(active_attr)

# Finally, add a few decorations.
mlab.axes()
mlab.outline()
mlab.view(-177, 82)
mlab.show()

The pipeline that is created is the following:

In the first part of the pipeline, the ‘WarpScalar’ filter is applied to the ‘z’ scalar value, whereas, due to the ‘SetActiveAttribute’ filter, the ‘Surface’ module uses the ‘w’ scalar value to display the color.

This pattern is very powerful, and can be used with other sets of filters or modules. The example of this pattern that we use in the Mayavi documentation is the following:

We use a ‘Contour’ filter to contour on the amplitude of a complex a field defined in the volume, and then switch to the phase to display the color. See the atomic orbital example in the Mayavi documentation for more details.

Book review: Matplotlib for Python Developpers

2010-03-26T10:49:00+01:00

Packt publishing sent me a copy of Sandro Tosi’s book Matplotlib for Python Developpers a while ago. Unfortunately, it arrived after I had left for the Christmas break, and I couldn’t find time to review it for a while (I am terribly bad at time-management, and I do too many things, as I result I am always overworked). 3 months later, I have finally found time to read it and post a review.

Content

The book introduces matplotlib which is, for those who don’t know, a truly fantastic library for scientific plotting in Python. Matplotlib is great because it is really easy to pick up, and can be used to produce very high-quality plots.

The book starts by progressively introducing the simple, imperative API for matplotlib, with a focus on getting the user immediately plotting data. It then moves on to a review of the functionality for plotting in matplotlib and the object-oriented usage of matplotlib. Finally, Sandro shows us how to embedded matplotb in various environment such as GUI toolkits or web development tools.

The last part of the book is, in my opinion the most original and precious, as these subjects are less well-known and documented in classical references accessible to people with a scientific computing background.

Target audience

The book can pretty much be picked by a scientific Python beginner. It does require some knowledge of the Python language, but if the reader has programmed in another language, I don’t see this as a big problem. In this regard, the book is especially interesting, as it can lead a scientist from newbie to writing simple end-user programs. There is a clear need for more of these documents currently.

The book will also be useful for the experienced Python developers looking to pick up quickly matplotlib.

Personal comments on the book

In my experience, exposing a tool such as matplotlib is a challenge: everybody has different plotting needs and there is an infinity of variation in ways that you can use a powerful library like matplotlib. Thus, Sandro’s exposition of matlplotlib will not suffice: people should absolutely read more, and I can’t stress too much that the matplotlib documentation is excellent, and people should read more of it.

In general, I found that the books reads fairly well. Off course, I am not the best critic in term of ease of read, as I know matplotlib very well. I do find that the book lacks a personal touch such as interesting examples, or profound insights on specific problems. There is nothing that got me excited in the book (again, maybe it’s because I know what’s in the book quite well).

Once again, in my eyes, the biggest contribution of this book is to put together an introduction to matplotlib, and examples of application building using matplotlib. I would especially recommend the book for people wanting to build simple data visualization GUIs.

Finally, with regards to interactive data visualization, in my experience, scientific programmers achieve better productivity when avoiding to work at the widget level and using an abstraction library. I strongly recommend looking at TraitsUI for this purpose. you can find a tutorial here (disclaimer: I wrote that tutorial).

Also, if you are going to write a data visualization program that is interactive in the sens that it enables the user to interact with the data, using Chaco instead of matplotlib may make your life easier. Chaco is not as well polished and documented as matplotlib, and I would never use it for a quick scripting work, but it has a strong focus on data interaction, and as such makes it really easy to build very responsive user interfaces, because it is very fast and has a clear object-oriented API.

New Mayavi release

2010-03-14T12:58:00+01:00

A week ago, the Peter Wang released a new version of the Enthought Tool Suite (ETS). With it came a new version of Mayavi2.

Prabhu and I have been horribly busy we real life, and I had the bad feeling that we were not giving enough love to Mayavi. I was surprised when I put together the list of features and bugs fixes that went in Mayavi for the last two releases. The full list can be found in the documentation.

Contributors

We are not being terribly good at tracking external ideas and patches, so I hope that I haven’t forgotten anybody, but I am very happy to say that Prabhu and I have received a fair amount of help from non core contributors:

Chris Colbert
Darren Dale
Dave Martin
Dave Peterson
Emmanuelle Gouillart
Erik Tollerud
Evan Patterson
Gary Ruben
Kyle Mandli
Michele Mattioni
Ondrej Certik
Ram Rachum
Robert Kern
Scott Warts
Suyog Jain

On top of these people, I wish to thank the people making sure that the Mayavi packages are available in the different Linux distributions: Varun Hiremath, Lev Givon, Andrea Colangelo, Rakesh Pandit, as well as Pierre Raybault for integrating in Pythonxy.

Important features added in 3.3.0

3.3.0 was released last fall. We had not compiled the list of changes at the time, I am giving it here:

An example gallery in the documentation.
A sync_camera helper function to synchronize camera between two scenes.
A text3d module, for position text in 3D that is scaled and hidden like a data object.
A close function to close scenes, similar to that in pylab or matlab.
A new filter to crop datasets: DataSet Clipper. This filter is terribly useful.
All the mlab.pipeline functions now take a figure= keyword argument. This is very useful when coding with several figures embedded in GUIs, as in a GUI you can’t rely on a context. This is illustrated in this example.

Important features added in 3.3.1

In latest release the following important features were added:

mlab.savefig can now reliably save images of a size larger than the window.
The interactive VTK documentation browser is now available in the GUI.
New functions added to mlab to control position of the camera: move, yaw, and pitch. These complement the existing view and roll.
Make the lines smoother when using mlab.plot3d (use a VTK Stripper filter)
Add a screenshot function to mlab for easy screen capture as a numpy array. This is very useful when creating figures that combine 3D using Mayavi and 2D using pylab. I use it all the time.
Add a probe_data function to return the data values of Mayavi objects at given locations as numpy arrays. This is very useful to combine numerics with Mayavi.
Add a auto mode to mlab.view to compute position and distance based on the objects on the image.
Add a helper function to easily interact with the data: a callback can easily be registered to picking data with the mouse. Two examples illustrate this new functionality. This is a major step forward in making life easier for people using Mayavi to build custom interfaces.

Using Python, Scipy, ETS, … to implement art

2010-02-14T14:14:00+01:00

The Aikon project has just been slashdotted.

The project is about implementing a robotic artist, with a special artistic touch:

The Co-principal investigator, Patrick Tresset, gave a talk at the French Pycon this year and I was simply flabbergasted by the project. It is amazing to mix together art and technology in such a way, you should really have a look at the videos of the robotic arm making sketches of people.

But I was even more startled when I discovered that the project was using scipy and all my beloved stack for scientific computing in Python, including the Enthought Tool Suite: check it out. I really want scientific computing software to be tools opening new ideas and new research. This research goes beyond my dreams.

EuroScipy 2010, Paris July 8-11. Save the date!

2010-02-14T00:02:00+01:00

EuroScipy 2010, the 3rd European meeting on Python in Science, will be held July 8-11 in the center of Paris, at the Ecole Normale Supérieure.

We have made good progress in the organization, and we already have an exciting program although paper submission is not yet even open.

Tutorial tracks

There will be two tutorials tracks:

An introductory track, to bring attendees up to speed with Python in science. Even if you are a complete beginner, after these two days, you should be able to be efficient using Python for scientific purposes.
An advanced tutorial track, covering in-depth specific tools and projects, aimed at experienced users and presented by leading experts of the topic.

We will soon be requesting feedback from you to help us choose between the different thrilling tutorial propositions that we have for these tracks. More on that later…

Keynote speakers

Hans Petter Langtangen

Simula laboratory, Oslo, director of scientific computing and bio-medical research
Author of the famous book Python scripting for computational science

Konrad Hinsen

Synchrotron SOLEIL and Centre de Biophysique Moléculaire (Orléans)
One of the fathers of numeric, and developer of Scientific Python.

Help us spread the word

The poster of the conference can be downloaded:

Help us spread the word: print it and post it at your workplace!

The exciting city of Paris

The conference will take place in the center of Paris, in the very lively “quartier latin”, in the prestigious and historical ‘Ecole Normale Supérieure’. In the morning, on your way to ENS, drop by a café for a French croissant, served by a French waiter with a typical French accent in English. In the evenings, walk one block to enjoy the night life “rue Mouffetard”, or venture further to stroll on the river banks of the Seine, along which people dance to street music.

PCA and ICA: Identifying combinations of variables

2010-02-05T00:00:00+01:00

Dimension reduction and interpretability

Suppose you have statistical data that too many dimensions, in other words too many variables of the same random process, that has been observed many times. You want to find out, from all these variables (or all these dimensions when speaking in terms of multivariate data), what are the relevant combinations, or directions.

Dimension reduction with PCA

If we have three-dimensional data, for instance simultaneous measurements made by three thermometers positioned at different locations in a room. The data forms a cluster of points in a 3D space:

If the temperature in that room is conditioned by only two parameters, the setting of a heater and the outside temperature, we probably have too much data: the three sets of measurements can be expressed as a linear combination of two fluctuating variable, and an additional much smaller noise parameter. In other words, the data mostly lies in a 2D plane embedded in the 3D measurement space.

We can use PCA (Principal Component Analysis) to find this plane: PCA will give us the orthogonal basis in which the covariance matrix of our data is diagonal. The vectors of this basis point in successive orthogonal directions in which the data variance is maximum. In the case of data mainly residing on a 2D plane, the variance is much greater along the two first vectors, which define our plane of interest, than along the third one:

The covariance eigenvectors identified by PCA are shown in red. The plane defined by the 2 largest eigenvectors is shown in light red.

If we look at the data in the plane identified by PCA, it is clear that it was mostly 2D:

Understanding PCA with a Gaussian model

Let x and y be two normal-distributed variables, describing the processes we are observing:

x = N(0, 1)

and

y = N(0, 1)

Let a and b be two observation variables, linear combinations of x and y:

a = x + y

and

b = 2 y

PCA is performed by applying an SVD (singular value decomposition) on the observed data matrix:

Y = [a₁a₂a₃...; b₁b₂b₃...]

This is equivalent to find the eigenvalues and eigenvectors of Y^TY, the correlation matrix of the observed data. The multidimensional (or multivariate, in statistical jargon) probability density function of Y is written:

p(Y) ∼ exp( − r^TM r)

where r is the position is the (a,b) observation space, and M the correlation matrix. Diagonalizing the matrix M corresponds to finding a rotation matrix U such that:

p(Y) ∼ exp( − r^TU^TS U r)

With S a diagonal matrix. In other words, U is a rotation of the observation space to change to a basis where the probability density function is written:

p(Y) ∼ exp( − ∑_i σ_i r²_i) = ∏_i exp( − σ_i r²_i)

In this new basis, Y can thus be interpreted as a sum of independent normal processes of different variance.

We can thus picture the PCA as a way of finding independent normal processes. The different steps of the argument exposed above can be pictured in the following figure:

First we represent samples drawn from x and y in their original space, the basis of the independent variables. Then we represent the (a, b) samples, and we apply PCA on these samples, to estimate the eigenvectors of the covariance matrix. Then we represent the data projected in the basis estimated by PCA. One important detail to note, is that after PCA, the data is most often rescaled: each direction is divided by the corresponding sample standard deviation identified by PCA. After this operation, all directions of space play the same role, the data is spheric, or “white”.

PCA was able to identify the original independent variables x and y in the a and b samples only because they were mixed with different variance. For a isotropic Gaussian model, any basis can describe the data in terms of independent normal process.

PCA on non normal data

More generally, the PCA algorithm can be understood as an algorithm finding the direction of space with the highest sample variance, and moving on to the orthogonal subspace of this direction to find the next highest variance, and iteratively discovering an ordered orthogonal basis of highest variance. This is well adapted to normal processes, as their covariance is indeed diagonal in an orthogonal basis. In addition, the resulting vectors come with a “PCA score”, ie the variance of the data projected along the direction they define. Thus when using PCA for dimension reduction, we can choose the subspace defined by the first n PCA vectors, on the basis that they explain a given percentage of the variance, and that the subspace they define is the subspace of dimension n that explains the largest possible fraction of the total variance.

However, on strongly non-Gaussian processes, the variance may not be the quantity of interest.

Let us consider the same model as above, with two independent variables x and y thought with strongly non-Gaussian distributions. Here we use a mixture of a narrow Gaussian, and wide one, to populate the tails:

We can apply the same operations on these random variables: change of basis to an observation basis made of a and b, and PCA on the resulting sample:

We can see that the PCA did not properly identify the original independent variables. The variance criteria is not good-enough when the principle axis of the observed distribution are not orthogonal, as the highest variance can be found in a direction mixing the two process. Indeed the largest PCA direction is found slightly off axis. In addition the second direction can only be found orthogonal to the first one, as this is a restriction of PCA.

On the other side, the data after PCA is much more spheric than the original data. No strong anisotropy is found in the central part of the sample cloud, which contributes most to the variance.

ICA: independent, non-Gaussian variables

For strongly non-Gaussian processes, the above example shows that separating independent process should be done by looking at fine details of the distribution, such as the tails. Indeed, after PCA, the Gaussian part of the processes have been separated by their variance, and the resulting, rescaled, samples cannot be decomposed in independent process in a Gaussian model, as they all have the same variance, and would already be considered independent under a Gaussian hypothesis.

A popular class of algorithms to separate independent sources, called ICA (independent component analysis) makes the simplification that finding independent sources out of such data can be reduced to finding maximally non-Gaussian. Indeed, the central-limit theorem tells us that the sum of non-Gaussian processes lead to Gaussian process. Conversely, with equal variance multivariate samples, the more non-Gaussian a signal extracted from the data, the less independent -and non-Gaussian- variables it contains.

A good discussion of these arguments can be found in following paper: http://www.cis.hut.fi/aapo/papers/IJCNN99_tutorialweb/IJCNN99_tutorial3.html

ICA is thus an optimization algorithm that from the data extracts the direction with the least-Gaussian PDF, removes the data explained by this variable from the signal, and iterates.

Applying ICA to the previous model yields the following:

We can see that ICA has well identified the original independent data variables. Its use of the tails of the distribution was paramount for this task. In addition, ICA relaxes the constraint that all identified directions must be perpendicular. This flexibility was also important to match our data.

Note

This discussion can now be seen as an example of the scikit-learn. Thus you can replicate the figure using the code in the scikit.

The SciPy 2009 proceedings are online

2009-12-20T18:49:00+01:00

We are finally announcing the online edition of SciPy proceedings:

http://conference.scipy.org/proceedings/SciPy2009/

This year, we tried to raise the bar in terms of article quality. This involved having a more strict review process, and we must thank a lot all the reviewers. I have the feeling it did improve the quality of the final papers. Actually, I must say that there are some really nice papers in the proceedings. I am not going to list them here, you can have a glance at the contents, but they range from fairly technical papers on tools development that are more in the software engineering and computer science fields, to application papers demonstrating how the tools can be used.

I must apologize for the time it took to publish the proceedings. All this was actually a lot of work, and it has taken me a lot of energy. I hope that you will it was worth it.

Announcing EuroScipy 2010

2009-12-14T01:01:00+01:00

The 3rd European meeting on Python in Science

Paris, Ecole Normale Supérieure, July 8-11 2010

We are happy to announce the 3rd EuroScipy meeting, in Paris, July 2010.

The EuroSciPy meeting is a cross-disciplinary gathering focused on the
use and development of the Python language in scientific research. This
event strives to bring together both users and developers of
scientific tools, as well as academic research and state of the art
industry.

Important dates

Registration opens: Sunday March 29

Paper submission deadline: Sunday May 9
Program announced: Sunday May 22
Tutorials tracks: Thursday July 8 - Friday July 9
Conference track: Saturday July 10 - Sunday July 11

Tutorial

There will be two tutorial tracks at the conference, an introductory one, to bring up to speed with the Python language as a scientific tool, and an advanced track, during which experts of the field will lecture on specific advanced topics such as advanced use of numpy, scientific visualization, software engineering…

Main conference topics

We will be soliciting talks on the follow topics:

Presentations of scientific tools and libraries using the Python language, including but not limited to:
- Vector and array manipulation
- Parallel computing
- Scientific visualization
- Scientific data flow and persistence
- Algorithms implemented or exposed in Python
- Web applications and portals for science and engineering
Reports on the use of Python in scientific achievements or ongoing projects.
General-purpose Python tools that can be of special interest to the scientific community.

Keynote Speaker: Hans Petter Langtangen

We are excited to welcome Hans Petter Langtangen as our keynote speaker.

Director of scientific computing and bio-medical research at Simula labs, Oslo
Author of the famous book Python scripting for computational science http://www.springer.com/math/cse/book/978-3-540-73915-9

The organizers:

Gaël Varoquaux (INRIA Saclay, Parietal), conference co-chair

Nicolas Chauvat (Logilab), conference co-chair

Program committee

Romain Brette (ENS Paris, DEC)
Mike Müller (Python Academy)
Christophe Pradal (CIRAD/INRIA, DigiPlantes team)
Pierre Raybault (CEA, DAM)
Jarrod Millman (UC Berkeley, Helen Wills NeuroScience institute)

General relativity, quantum physics, freely-falling planes and Bayesian statistics

2009-12-08T22:20:00+01:00

We’re famous: the work that concluded my PhD is now picked up by the press http://www.physorg.com/news179481148.html

I hadn’t realized before reading this journalist’s version of the story, but we have all the proper buzz words:

general relativity
quantum physics
freely-falling planes
Bayesian statistics.

This kind of stuff makes great headlines, but the way we are judged on this “success” is actually harmful (I believe), as there is so much interesting research that lies away of the trendy words and that needs to be done.

Decoration in Python done right: Decorating and pickling

2009-11-13T00:14:00+01:00

Decoration is a fantastic pattern in Python that allows for very light-weight metaprograming with functions rather than objects (see this article for an in-depth discussion). However, when decorating, it is very easy to break another great feature of the language: its reflectivity and its ability to do static representations of its internal objects: pickling.

In this blog post, I’d like to rewrite a post I made on the IPython mailing list a month ago, summing up the few things to have in mind when decorating a function.

A pattern to avoid?

I have recently been revisiting my decoration code, to fight a common mistake I had been doing, and it was partly due to the heavy use of a simplified pattern for decorating:

def with_print(func):
    """ Decorate a function to print its arguments.
    """

    def my_func(*args, **kwargs):
        print args, kwargs
        return func(*args, **kwargs)

    return my_func

@with_print
def f(x):
    print 'f called'

The nice thing about this pattern is that is it quite easy to type, and to read.

Why it is harmful

The decorated function is actually the function ‘my_func’, with a reference to the original function ‘func’, a part of the scope of the decorator ‘with_print’, and thus in the closure of the with_print function.

The problem is that we have a closure here. Thus we have variables that are hard to get to (the undecorated function), and the decorated function is not picklable (which is more and more important to me, e.g. for parallel computing).

Some solutions

Avoiding the closure

Use objects as a scope, rather than a closure:

class WithPrint(object):
    def __init__(self, func):
        self.func = func

    def __call__(self, *args, **kwargs):
        print args, kwargs
        return self.func(*args, **kwargs)

This solution is not enough: the following code won’t pickle:

@WithPrint
def g(x):
    print 'g called'

The reason this won’t pickle is that we have a name collision: the code above expands to:

def g(x):
    print 'g called'

g = WithPrint(g)

and trying to pickle raises the following PicklingError:

Can't pickle <function g at 0x6ed2a8>: it's not the same object as __main__.g

If we do:

def g(x):
    print 'g called'

h = WithPrint(g)

we can pickle h, hurray!

Using functools.wraps

However, Python comes with the answer in the standard libary: functools.wraps does the name unmangling.

Thus the following code produces a pickleable f:

from functools import wraps
def with_print(func):
    """ Decorate a function to print its arguments.
    """
    @wraps(func)
    def my_func(*args, **kwargs):
        print args, kwargs
        return func(*args, **kwargs)
    return my_func

@with_print
def f(x):
    print 'f called'

The pickling works simply because using functools.wraps resets the
.func_name attribute of f to have a well-defined import path. Thus
pickling works, simply by storing the import path, as all pickling of
functions.

Notice that there is only a one-line difference with the original code!

I actually tend to use a combination of both solution (an object, using functools.wraps), to keep a reference on the undecorated functions.

Note: Demo code of this blog post can be found here.

Take home messages for pickling

Decorators can be more clever than you think, and might not return objects as simple as you think
Think about pickling, or you’ll get bitten at some point (for instance when doing parallel computing).

and most important:

Use functools.wraps

A remark about object-oriented programming

To jump on the band-wagon behind Travis, I believe that this discussion teaches us a bit about object-oriented programming. When decorating, we are really taking a callable object, and redefining how the call is handled. If we do this the naive way, we loose introspection (there is no way to access the original callable from Python), and as a result pickling, and many of the nice feature going with reflexivity in Python. This is because we trapped information in a scope that is not accessible by normal Python code (without playing at the frame level). If on the other hand, we accept that what we have behind all this are nested scope with a control of lookups, and we create a full-blown object, we have the benefits of the black box, and the benefits of reflexivity.

But this is not the point I want to make. The point I want to make is that, by decorating, we are piggy-backing on an absolutely universal object/interface: the callable. Everybody knows what a callable is, and knows how to employ it. From a pure object-oriented point of view, decorating is simply some kind of proxy design pattern. But, to stress Travis’s point, introducing new objects that have their own behavior puts cognitive load on the programmer. The real value of decoration is that it is object-oriented programming without adding any new or surprising interface. You don’t really have to care what is going on, you can still use the resulting ‘proxied’ function as the original function: a simple function.

Writing parallel code in a readable way

2009-11-09T00:10:00+01:00

Although I often have embarrasingly parallel problems (data parallel), and I have an 8-CPU box at work, I used to frown on writing parallel computing code when doing exploratory coding. We now have fantastic parallel computing facilities in Python (amongst other, multiprocessing, IPython, and parallel Python). However, in my opinion, there are two reasons to hesitate to use them, especially when the code is very imature (which is almost always my case, in research settings):

It makes the code look less like the ideas it is trying to express. Peter Norvig made a pretty convincing case for scientific code reading like math at SciPy2009.
Because parallel computing is out of process, in Python, it is simply harder to debug (though I hear that the IPython guys are on that).

I have progressively developed a tiny tool to adress both problems, at least for my embarrasingly-parallel problems. I address the second problem by having a trivial switch to run my code without importing any fancy parallel computing tools. And I address the first problem using syntactic sugar to be able to type in map/reduce code that actually looks like standard procedural code:

results = Parallel(n_jobs=2)(
            delayed(my_calculation)(data1, data2,
                                    parameter1=1, parameter2=2)
            for data1 in store1 for data2 in store2)

There are several tricks here:

I use a ‘delayed‘ decorator that creates the argument list and keyword argument dictionary for me so that I can type something that really looks like a function call. Also, the decorator checks to see if the function and the arguments can be pickled, because if not the parallel computing libraries will raise errors, sometimes with hard-to-understand messages.
I use list comprehension to create the list to apply the map/reduce onto. List comprehension is really readable, and very powerful.
The ‘Parallel‘ object hides all the cleverness. If the ‘n_jobs‘ parameter is set to 1, it does not call any parallel computing library. If it is set to -1, all the CPUs are used. The object instantiates the parallel computing context and also destroys it. While this is inefficient, it is great for catching errors early. And finally, while I have implemented this only for the multiprocessing module, any fork/join-based parallel computing library could be encapsulated the same way, thus providing a uniform API to do multi-node parallel computing or single-computer shared memory (as multi-processing uses the Unix fork call, and all modern Unices implement copy on write of memory pages, you get some shared memory for free without worrying about race conditions).

Update

This pattern has actually evolved in the joblib project , which provides a lot of cleverness under the hood.

Acceleration estimation in atom-interferometric tests of the Einstein equivalence principle

2009-11-07T15:24:00+01:00

Hurray! The pivot article that marks my transition from physics to statistic modeling is finally out:

How to estimate the differential acceleration in a two-species atom interferometer to test the equivalence principle G Varoquaux, R A Nyman, R Geiger, P Cheinet, A Landragin and P Bouyer

To put things in context, at the end of my PhD, we had been building an atom interferometer to test the Einstein equivalence principle and my reflections on the limits of atom interferometry shifted from worrying about the underlying physics, to worrying about the estimation: the inverse problem of going from the experimental signal, to the underlying quantities that we are measuring, confounded by all the horrible experimental noise.

Atoms, light, gravity fields and free-fall planes

The problem is: we want to do high precision metrologic tests in a free-falling plane. We use interferometry to measure gravity fields. But rather than doing interferometry with light, we use atoms, that are much more coupled to gravity. When probing gravity fields with light, the trick is to use huge highly-sensitive interferometers. For instance the ligo and virgo projects are kilometer-long light interferometers listening for gravitational waves, and the giant ring lasers can test for tiny modifications in the Earth rotation and gravity field. Gravimetric coupling with matter waves and light waves describes the very exact same underlying physics. However, matter waves, atoms in the case of PhD, fall in gravity fields. While this is the expression of the very exact phenomena we are trying to measure, it also means that to build a very large atom interferometer, you have to let the atoms fall for a large distance. And I can attest that even laboratory-sized versions of atom-interferometric experiments are fairly nasty to run:

This is why we simply decided to build an experiment in a freely-falling plane: let’s fall with the atoms for 6 kilometers (30 seconds).

Measuring free fall, while in free fall?

Of course, the plane is not really in free fall. The pilots try as hard as possible to compensate for drag and atmospheric turbulence but there is a limit to what they can achieve with an Airbus. The atoms are a vacuum apparatus, so they are indeed in free fall (before they crash in the side of the apparatus). However, making sens of measure of fall-free made relative to an unstable, and unpredictable platform is not trivial. This is where the statistical modeling kicked in. After reading a bit about noise in interferometers, I realized that we had a well-known problem in statistics: estimation of hidden variables from noisy observations. I learned about recursive Bayesian estimation, coded a proof-of-principle algorithm for our problem (in Python, of course), and was sold. The rest of the story is about noise simulations, and trying to convince a metrology community that you could perform good measurements in a noisy environment.

It took us a lot of time (2 years) to write an article that was acceptable to the target scientific community, while keeping the core estimation and statistics message. Publishing new ideas is hard, because you are not answering questions that people already have in mind. This is why the fact that this article is out is a huge deal for me. It marks a turning point in my reflection: I switched from worrying only about forward models, with which try to describe as well as possible the system at hand, to inverse problems, in which you worry about estimating the parameters from the data.

I was startled to see that people are ready to spend a huge amount of money and efforts in improving complicated experiments involving quantum physics and very sophisticated technology, but can be weary of processing the output signal to increase statistical power. Scientific communities have their own goals that they pitch (e.g. reducing the phase noise in lasers) and there can be huge divides between different scientific interests. Realizing this played an important role in my career shift. I wanted to know more about the power of statistical modeling and machine learning applied to real-life system. I decided that to learn more, I had to work with people that had a different culture from mine. It’s been a huge amount of fun so far… More about that later.

EuroScipy 2010 in Paris

2009-10-27T23:22:00+01:00

Next year’s EuroScipy will be in Paris, as Nicolas Chauvat and myself announced in Leipzig this summer. We are still busy organizing, but we have pretty much settled down for a dates: July 8th- July 11th. So mark those dates, and get ready to come to Paris for a fantastic event where Science meets computing thanks to Python.

On the Thursday and Friday, we will have 2 days of optional tutorials; introductory ones to get up to speed with Python, and advanced ones, where experts explain the tools they know best. On the Saturday and Sunday, the main conference will be held, and if it is anywhere like last year’s, we will be hearing thrilling discussions with topics ranging from the latest libraries for better scientific computing to how Python was used in top-notch scientific achievements.

Useful trick for functions and tests using np.random

2009-08-29T16:00:00+02:00

Recently, listening to Robert Kern taught a new useful trick to use when you write functions that use the numpy random number generator:

As always, when using random number generation in code, the problem is to get ‘repeatable results’. Of course, you want only repeatable statistics, and with statistics, the problem is to define what repeatable is. Anyhow, for various reasons, it is useful to be able to reproduce exactly runs, for instance when testing, fine tuning, or debugging. That is why you would like to be able to control the seed of your random number generation. Robert Kern showed us (at the SciPy conference tutorial) a way to control the pseudo random number generator (PRNG) in a function, without affecting the rest of the code. This does not involve setting the seed of the global PRNG, as this is evil, because it has global effects. The idea is to pass in to your functions a PRNG instance (by default the global one):

def test(prng=np.random):
    print pnrg.rand(10)

if you want to use your function with a controlled PRNG, you can instantiate one with a specific seed:

prng = np.random.RandomState(seed=0)

and pass it to your function.

SciPy 2009 is over!

2009-08-29T12:21:00+02:00

The week is over, and I am finally catching up with things, back here in France.

The SciPy conference was exciting and fun as usual. It was great to meet old friends and put faces on names on the mailing list.

The turn out was very good: we had 150 people total. This is more than last year (125), which shows that there is high interest, given that most institutions have travel restrictions due to this year’s low budget.

The year, the conference was very international. I was really happy that, partly thanks to the PSF contribution, we had the visit of young contributors coming from far away, such as David Cournapeau (Japan), Dag Sverre Seljebotn (Norway), Pauli Virtanen (Finland), and Stéfan van der Walt (South Africa). For me, living in France, it was also great to have people from major European institutions, such as the ESRF (European Synchrotron Radiation Facility), the Fraunhofer institute, the Max Planck Institute. These people not only committed important projects to the scientific Python tools, but made the effort of coming all the way to California to talk about it, which is non negligible given the cost of the trip all the way from Europe. To me this is important because it means that we are getting more interaction wordwide, and thus the tools are more likely to converge to something of generaluse. Also, for the first time ever, one of my bosses came to the conference. It is fantastic to be working with great scientists who actually understand that technology is important to do good science, and that programming is actually hard, and a matter of interest per se.

On the other hand, I was disappointed that we had no presentations from the industry. There were a lot of industry people in the audience, and it is always fun to here what they use Python for.

I really enjoyed the keynote by Peter Norvig because Peter talked about the importance of having a clear language to expose and formulate scientific ideas. This is something that is very dear to me, and I must say that the code snippets he presented were crystal clear, and involving non-trivial maths explained in a way that made them look simple, similar to his famous blog post on spell checking. It was really inspiring for me, and driving me into trying to write even cleared and simpler code.

The technical keynote by Jon Guyer also hit a soft spot for me, not only because the physics presented was very beautiful, but also because my partner is doing research in similar fields (with Python, of course), and Jon made an excellent argument for using Python, which is not always easy when you are discussing heavily computational problems.

For my personal work, SciPy was very exciting, because I had so many discussions with different people on how we could share efforts, by tweaking a data structure in an existing package, or simply having a look at a package I wasn’t aware of. The machine learning BoF was extremely enthusiastic, and I am really looking forward to October, when we will be able to start working on that. If only half of the things was talked about ever get done, I will be thrilled.

I should point out that, thanks to hard work by Jeff Teeters and Kilian Koepsell from Berkeley, the videos of every talk are on the web for the first time.

Also, we have a nice photo gallery with a group picture.

We have so many people to thank. I think special thanks go to Leah Jones, at Enthought, and Julie Ponce, at CACR, Caltech. They made sure that the organization committee didn’t forget anything important and did a lot of the grunt work. Thanks also to Enthought and CACR, and many of their staff, for the support in the organization. PSF founded students, and that is a big deal. We should thank all the tutorial presenters, it takes a lot of work to put together the material. We were very grateful to the program committee for reviewing the papers. Also thanks to all the speakers, and to all the attendees. The SciPy conference is a bit special to me, because it is very laid back, and I can trust that it will be great almost by self-organization, as you put together nice and clever people, and they find ways of discussing of interested things with enthusiasm.

Update: That blog post feels way too ‘political’. I dislike sale pitches, and it does feel like one. But, how to sum up some important contributions and thanks the people who helped out? I made a point of always wearing a T-shirt of the conference, rather than a shirt, but I guess that there is a point at which trying to dodge etiquette with a T-shirt and a pony tail is just another cliché[*] and formalism.

[*] cliché is a French word for cliché.

Gaël Varoquaux

Do AIs reason or recite?

Something new?

CARTE: toward table foundation models

Skrub 0.2.0: tabular learning made easy

model = tabular_learner(‘classifier’)

transformer = TableVectorizer()

Increasing support of polars

Promoting open-source, from inria to :probabl.

People underestimate how impactful Scikit-learn continues to be

Comité de l’intelligence artificielle: vision et stratégie nationale

2022, a new scientific adventure: machine learning for health and social sciences

Representation learning for relational data

Mathematical aspects of statistical learning for data science

Machine learning for health and social sciences

High-quality data-science software

My Mayavi story: discovering open source communities

The start of my adventure with Mayavi

What is Mayavi?

Working on Mayavi taught me code and communities

2021 highlight: Decoding brain activity to new cognitive paradigms

Context: Infering cognition from brain-imaging

Contribution: Informing specialized decoding questions from broad data accumulation

A research agenda that does not win all hearts

Hiring an engineer and post-doc to simplify data science on dirty data

Dirty data research

Reinventing data science

Join us in this adventure

A data-science engineer: new software with new ideas

A post-doc researcher: science joining data engineering to deep learning

Hiring someone to develop scikit-learn community and industry partners

Context: Scikit-learn @ Inria foundation

The growth of Scikit-learn

Scikit-learn @ Inria foundation

Mandate

Growing our open-source community

Increasing our corporate visibility

A good fit

2020: my scientific year in review

Technical discussions are hard; a few tips

Open source can be anxiety-generating for the maintainers

The danger abusive gatekeeping

Hear the other: exchange

Convey ideas well: pedagogy

Cater for emotions: tone

Jean Dechoux, June 13rd 1923 – Feb 9th 2020

Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020

2019: my scientific year in review

Comparing distributions: Kernels estimate good representations, l1 distances give good tests

Getting a big scientific prize for open-source software

2018: my scientific year in review

Atlases of cognition with large-scale human brain mapping

Predictive models avoid excessive reductionism in cognitive neuroimaging

Individual Brain Charting, a high-resolution fMRI dataset for cognitive mapping

A new personal research agenda: DirtyData

Similarity encoding: analysis with non-normalized string categories

A foundation for scikit-learn at Inria

A foundation? What and why?

What will people work on? How will decisions be made?

Why not an existing foundation such as NumFOCUS, or the PSF?

What’s the scope?

Sprint on scikit-learn, in Paris and Austin

Many achievements

Scikit-learn is hard work

Credits and acknowledgments

Contributors to the sprint

Sponsors

Our research in 2017: personal scientific highlights

A fast and stable brain decoder using ensembling: FReM

Brain imaging to characterize individuals: joint prediction of multiple traits

Time-domain decoding for fMRI

Cross-validation failure: the dangers of small samples

Convergence proofs for last year’s blazing fast dictionary learning

Beyond computational reproducibility, let us aim for reusability

Scikit-learn Paris sprint 2017

A massive workforce

Support and hosting

Some achievements during the sprint

Our research in 2016: personal scientific highlights

Artificial-intelligence convolutional networks map well the human visual system