Gaël Varoquaux - science

Do AIs reason or recite?

2024-10-19T00:00:00+02:00

Despite their apparent intelligence, conversational artificial intelligences often lack logic. The debate rages on: do they reason or do they recite snatches of text memorized on the Internet?

Note

This post was originally published in French as part of my scientific chronicle in Les Echos. I updated it with new references.

Conversational AI, or large language models, are sometimes seen as the gateway to general artificial intelligence. ChatGPT, for example, can answer questions asked at the International Mathematical Olympiad. And yet, on other, seemingly much simpler questions, ChatGPT makes surprising mistakes. What aspects of conversational AI intelligence explain its ability to solve some problems and not others?

Thomas McCoy and co-authors conjecture that it has to do with their underlying model of autoregression: technically, these AIs are trained to complete texts found on the Internet. If an AI is very good at calculating (9/5) x + 32, but not (7/5) x + 31, it is because the first formula corresponds to the conversion of degrees Celsius to Fahrenheit, a very frequent conversion on the Internet, while the second does not correspond to any particular formula. Conversational AIs would therefore be good at reproducing what they’ve already seen. Indeed, numerous studies have shown that they have a certain tendency to reproduce snippets of known text. So, if an AI can solve problems from the International Mathematical Olympiad, is it simply because it has memorized the answer?

Something new?

In terms of intelligence, inventing a new mathematical demonstration requires mastering abstractions and the ability to string together complicated logical reasoning with an imposed start and finish. This seems much more difficult than memorizing a demonstration. This is one of the traditional oppositions in machine learning, the line of research that gave rise to today’s AIs: memorizing is one thing, knowing how to generalize is another. For example, if I memorize all the additions between two numbers smaller than ten, I cannot extrapolate beyond that. To go further, I need to master the logic of addition… or memorize more.

And precisely, conversational AIs have an enormous capacity for memorization, and have been trained on almost the entire Internet. Given a question, they can often dip into their memory to find answers. So, are they intelligent or just have a great memory? Scientists are still debating the importance of memory to their abilities. Some argue that their storage capacity is ultimately limited by the size of the Internet. Others wonder to what extent the impressive successes highlighted are not on tasks already solved on the Internet, questioning their ability to do anything new.

But could memorization be an aspect of intelligence? In 1987, Lenat and Feigenbaum conjectured that, for a cognitive agent, accumulating knowledge enables it to solve new tasks with less learning. Perhaps the intelligence of conversational AI lies in knowing how to pick up the right bits of information, and combine them.

Related academic work:

Embers of autoregression show how large language models are shaped by the problem they are trained to solve, R. Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D. Hardy, and Thomas L. Griffiths, PNAS 2024 (ArXiv)

Princeton researchers show that properties of large language models (LLMs) are governed by the data that they are trained on, including for they arithmetics abilities.
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models, Iman Mirzadeh, Keivan Alizadeh Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar

Apple researchers show that LLMs solve mathematical challenge via probabilistic pattern matching on previously seen examples, rather than logical reasonning.

CARTE: toward table foundation models

2024-07-19T00:00:00+02:00

Note

Foundation models, pretrained and readily usable for many downstream tasks, have changed the way we process text, images, and sound. Can we achieve similar breakthroughs for tables? Here I explain why with “CARTE”, we’ve made significant headway.

Contents

Pre-training for data tables: hopes and challenges
- Pre-training is a necessity
- Pretraining for data tables?
CARTE: a table foundation model breakthrough

Pre-training for data tables: hopes and challenges

Pre-training is a necessity

Foundation models have brought breakthroughs to text and image processing because they embark a great deal of knowledge on these data, knowledge that can then be reused to simplify processing. But their promises have not come true for tables, which hold much of an organization’s specific data, eg relational databases capturing day-to-day operations, or measurements tables related to a specific source of data.

Rather, for tabular learning, a couple of years ago our extensive benchmarks showed that tree-based models outperformed even deep-learning architectures specially crafted for data tables.

One challenge is that typically tables are not that big and thus the high flexibility of deep learning is a weakness rather than a benefit. This shortcoming was solved by pretrained models, for data modalities where deep learning has been vastly successful: most people do not train a deep-learning model from scratch, but download a pre-trained one from model hubs. Such universal pre-training is also at the root of foundation models.

Pretraining for data tables?

But what does pretraining mean for data tables? If I give you a table of numbers, what can prior information can you use to process it better? Images and text have a lot of regularity that repeat across datasets: I can recognize a car on pictures coming from all kinds of camera (including old black and white photographs). I use my knowledge of the meaning of words to understand a text. But given a table of number as below, what sense can I make of it?

The tabular learning challenge: every table is a special snowflake

72	68	174	1
64	79	181	1
56	59	166	0
81	62	161	1

The reason a data analyst can understand this data and use this understanding to build a better data-processing pipeline is because the data comes with context: meaningful strings sprinkled around these numbers. For instance, a table with the same numbers as above but a bit of column names and string entries makes completely sense:

Cardiovascular cohort
Age	Weight	Height	Commorbidity	Cardiovascular event
72	68	174	Diabetes	1
64	79	181	Cardiac arrhythmia	1
56	59	166	NA	0
81	62	161	Asthma	1

In such a setting, it becomes clear what background knowledge, what pre-training can bring to analyzing data tables: string entries and column names bring meaning to the numbers in data tables.

Another way to seeing the challenge is that of data integration: as studied by the knowledge representation and database communities, putting multiple sources of data in a consistent representation requires:

schema matching, which to a first order is about finding column correspondences across tables
entity matching, finding correspondences across table entries denoting the same thing, for instance “Diabetes” and “Diabetes melitus”

These challenges of data integration are central to building pretrained or foundation models for tables. Indeed, such models must apply to all tables, and thus must bridge these gaps across tables.

CARTE: a table foundation model breakthrough

Our recent CARTE paper builds upon the above insights, and demonstrates that pretraining can give models that markedly improve performance.

An architecture to learn across tables

Graphlets The key ingredient of CARTE is how we represent the inputs. CARTE’s goal is to build predictors on rows of tables, for instance associating features of an individuals to a risk of developing adverse cardiovascular events. To pretrain across tables, we use a universal representation of the data (rows of tables), as small graphs.

Turning table rows into graphlets. Each column leads to an edge and the column name is turned into the corresponding edge feature. It’s a “multirelational graph”. The entry associated with the given column is turned into the corresponding node feature, and the row is represented as a special row token in the center of the graphlet.

Thus, tables with different number of columns can be turned into a consistent representation. But an additional benefit of this representation is that it can represent data across multiple tables with shared keys (for instance all the visits of a patient to a hospital).

A representation that can bridge tables without schema or entity matching

String embeddings The second ingredient is to represent all strings and embeddings, using a pretrained language model, whether it is for column names or string entries. With good language model will embed close by different string with similar meaning, for instance a column named “commorbidity” and another one named “medical conditions”. Such representation helps learning without entity or schema matching.

Graph transformer CARTE then uses a form of graph transformer on top of this representation. Key to this graph transformer is an attention mechanism that accounts for the relation information –the edge type, ie the column name. Thus (born in, Paris) is represented in a different way as (living in, Paris).

Numbers treated as such Columns with numerical entries are often important information in a data table. Unlike typical large language models, we do not represent numbers via string tokenization, but use a vector representation where the numerical value is multiplied with the embedding of the column name (a vector output by the language model). That way a value of 126 in a column named “Systolic mm Hg” is represented close to 1.5 times a value of 84 in a column named “Blood pressure”.

Pretraining on knowledge graphs

We pretrain the above architecture on a large general-knowledge knowledge graph. The goal is to distill the corresponding information in the pretrained model, which can then implicitly use it when analyzing new tables. Indeed, a large knowledge graph (we use YAGO) represents a huge amount of facts on the world, and the representation, as a multirelational graph, is close to the one that we use to model data tables.

Given an analytic task, on a data table of interest, the pretrained model can be fine tuned. We found that this was a tricky part as those tables are often small.

Empirical results

Excellent performance on extensive benchmarks We compared CARTE to a variety of baselines across 51 datasets (mostly downloaded from kaggle), as a function of the number of samlpes (number of rows):

Prediction performance as a function of sample size for classification and regression tasks

CARTE outperforms all baselines, including very strong ones

CARTE appears as a very strong performer, outperforming all baselines when there are less than 2000 samples. For larger tables, the prior information is less crucial, and more flexible learners are beneficial.

Strong contenders We see that powerful tree-based learner, such as CatBoost of XGBoost also work very well. We investigated in details many baselines. Here, we consider not only learners, but also a variety of methods to encode strings, and these really help predicting:

Detailed comparison (critical difference plots, giving the average ranking of methods) across all 42 baselines that we investigated

Catboost is an excellent predictor because it encodes with categories with great care. S-LLM-CN-XGB is a baseline that we contributed that encodes strings with an LLM, concats numerical numbers and used XGBoost on the resulting representation. TabVec is the TableVectorizer from skrub. Combined with standard learners it gives really strong baselines.

Learning across tables As CARTE can model jointly different tables with different conventions, we show that I can use large source tables, to boost prediction on the smaller target table.

Ranking of various methods used across tables with imperfect correspondances, where “matched” means manual column matching, and “not matched” means no manual column matching

Transfer learning across sources with different columns / schemas

Lessons learned

The extensive empirical results have many teachings.

Tabular foundation models are possible The first teaching is that using strings to bring meaning to the numbers enables foundation models for tables: pretrained models that facilitate a variety of downstream tasks.

LLMs are not enough Many approaches to table foundation models adapt large language models pretrained on huge text corpora. The argument is that with the amount of high-quality texts on Internet, the corresponding LLM can acquire more background knowledge. The seminal example is that of TabLLM, which makes sentences out of table rows and feeds them to LLMs. Yet, by itself it does not perform well on tables with numbers.

Ranking of models on data from the TabLLM paper, data that differs from our benchmark above as it does not have string entries.

A table foundation model must model strings and numbers

Modeling numbers is crucial TabPFN, CARTE, and XGBoost all outperform TabLLM on tables without strings, likely because they readily model numbers, while an LLM sees them as strings. Likewise, our variant S-LLM-XGB-CN that combines LLMs with a model suitable for numbers performs very well.

As the strings are crucial to give context to numbers, we believe that the future of table foundation models is to model well both strings and numbers.

Note

CARTE is only a first step in the world of table foundation models. I am convinved that the ideas will be pushed much much further.

But we have learned a lot in this study. I have only skimmed the surface of our work. If you want more details, read the CARTE paper.

Comité de l’intelligence artificielle: vision et stratégie nationale

2023-09-20T00:00:00+02:00

English summary

I have been appointed to the government-level panel of experts on AI, to set the national vision and strategy in France.

J’ai l’honneur d’être nommé au comité de l’intelligence artificielle du gouvernement Français.

La mission qui nous est confiée d’éclairer l’action publique autour de l’intelligence artificielle, une technologie qui peut impacter beaucoup d’aspects de la société.

Le comité comprend des experts de profils très variés, allant du jeune entrepreneur à l’économiste connu mondialement. La difficulté va être de considérer l’ensemble des liens entre progrès technologique et société. Nous allons chercher à dégager de la vision, rassembler beaucoup d’expertises d’acteurs différents sur différents sujets, appuyer nos projections sur l’état actuel des connaissances scientifiques.

Je ne partagerai pas les travaux du comité en avance de phase: il y aura un travail nécessaire pour établir du consensus, travail qui prend du temps.

Cette mission dépasse mon cadre habituel, celui de la recherche académique ou de la création de logiciels. Je fais cela parce que je crois que pour que la technologie ait le meilleur impact sur la société, il doit y avoir un va-et-vient entre la création technologique et les changements sociétaux. Si nous, scientifiques, décidons de nous concentrer uniquement sur notre travail académique et technique, nous perdons le contrôle de la façon dont la société adopte notre technologie; nous laissons ce contrôle aux personnes qui décident d’utiliser leur énergie pour agir, influencer, profiter directement de ces technologies. En tant que chercheur en sciences informatiques, travaillant à la fois sur l’IA fondamentale et sur les applications dans le domaine de la santé, je dispose d’une expertise qu’il est important d’apporter à la table. En tant que fonctionnaire, je pense que je peux et que je dois éclairer le débat : je suis moins exposé au risque de conflits d’intérêts, je suis payé par l’argent public pour être utile au public.

Ce travail n’est néanmoins pas une prise de position politique: je suis scientifique et non élu. Le pouvoir du comité n’est pas de faire les décisions politiques, mais d’informer du possible. C’est un travail de synthèse et de médiation.

2022, a new scientific adventure: machine learning for health and social sciences

2023-01-31T00:00:00+01:00

A retrospective on last year (2022): I embarked on a new scientific adventure, assembling a team focused on developing machine learning for health and social science. The team has existed for almost a year, and the vision is nice shaping up. Let me share with you illustrations of where we are at. This is extracted from our yearly report which will be public later, but I have sometimes edited it a bit to add personal context.

Highlights

A new team: Soda
The scientific vision
Some notable results of 2022

A new team: Soda

The team in early 2022 (it has grown a lot since)

At Inria, we have teams assembling multiple tenured researchers around a scientific project. Last year, we assembled a new team called Soda, which stands for “social data”, but above all is a fun name.

In a year, the team grew like crazy (to be honest, this had been baking for a little while). We are now around 25 people. There are 4 PIs (Marine le Morvan, Judith Abécassis, Jill-Jênn Vie, and myself); and the engineers working on scikit-learn at Inria are also part of the team.

The scientific vision

Machine learning to leverage richer, more complex, data for social-sciences and health

Applications raise specific data-science challenges

Data management: preparing dirty data for analytics Assembling, curating, and transforming data for data analysis is very labor intensive. These data-preparation steps are often considered the number one bottleneck to data-science. They mostly rely on data-management techniques. A typical problem is to establishing correspondences between entries that denote the same entities but appear in different forms (entity linking, including deduplication and record linkage). Another time-consuming process is to join and aggregate data across multiple tables with repetitions at different levels (as with panel data in econometrics and epidemiology) to form a unique set of “features” to describe each individual.

Progress in machine learning increasingly helps automating data preparation and processing data with less curation.

Data science with statistical machine learning Machine learning can be a tool to answer complex domain questions by providing non-parametric estimators. Yet, it still requires much work, eg to go beyond point estimators, to derive non-parametric procedures that account for a variety of bias (censoring, sampling biases, non-causal associations), or to provide theoretical and practical tools to assess validity of estimates and conclusion in weakly-parametric settings.

Our research axes

Representation learning for relational data

I dream of deep-learning methodology for relational databases, from tabular datasets to full relational databases. The stakes are i) to build machine-learning models that apply readily to the raw data so as to minimize manual cleaning, data formatting and integration, and ii) to extract reusable representations that reduce sample complexity on new databases by transforming the data in well-distributed vectors.

Mathematical aspects of statistical learning for data science

I want to use machine learning models as non-parametric estimators, as I worry about the impact of mismodeling on conclusion. However, for a given statistical task, the statistical procedures and validity criterion need to be reinvented. Soda contributes statistical tools and results for a variety of problems important to data science in health and social science (epidemiology, econometrics, education). These fields lead to various statistical topics:

Missing values
Causal inference
Model validation
Uncertainty quantification

High-quality data-science software

Societal and economical impact of machine learning requires easy-to-use practical tools that can be leveraged in non-specialized organizations such as hospitals or policy-making institutions.

Soda incorporates the core team working at Inria on scikit-learn, one of the most popular machine-learning tool world-wide. One of the missions of soda is to improve scikit-learn and its documentation, transfering the understanding of machine learning and data science accumulated by the various research efforts.

Soda works on other important software tools to foster growth and health of the Python data ecosystem in which scikit-learn is embedded.

Some notable results of 2022

I am listing here a small number of the achievements of the team, because I find them inspiring.

Learning on relational data: aggregating across many tables

For many machine-learning tasks, augmenting the data table at hand with features built from external sources is key to improving performance. For instance, estimating housing prices benefits from background information on the location, such as the population density or the average income.

Often, data must be assembled across multiple tables into a single table for analysis. Challenges arise due to one-to-many relations, irregularity of the information, and the number of tables that may be involved.

Most often, a major bottleneck is to assemble this information across many tables, requiring time and expertise from the data scientist. We propose vectorial representations of entities (e.g. cities) that capture the corresponding information and thus can replace human-crafted features. In Cvetkov-Iliev 2023, we represent the relational data on the entities as a graph and adapt graph-embedding methods to create feature vectors for each entity. We show that two technical ingredients are crucial: modeling well the different relationships between entities, and capturing numerical attributes. We adapt knowledge graph embedding methods that were primarily designed for graph completion. Yet, they model only discrete entities, while creating good feature vectors from relational data also requires capturing numerical attributes. For this, we introduce KEN: Knowledge Embedding with Numbers. We thoroughly evaluate approaches to enrich features with background information on 7 prediction tasks. We show that a good embedding model coupled with KEN can perform better than manually handcrafted features, while requiring much less human effort. It is also competitive with combinatorial feature engineering methods, but much more scalable. Our approach can be applied to huge databases, for instance on general knowledge graphs as in YAGO, creating general-purpose feature vectors reusable in various downstream tasks.

Entity embeddings of YAGO (wikipedia) (2D-representation using UMAP). The vectors are downloadable from https://soda-inria.github.io/ken_embeddings} to readily augment data-science projects.

Validating probabilistic classifiers: beyond calibration

Validating probabilistic predictions of classifiers must go account not only for the average error given an predicted score, but also for the dispersion of errors.

Ensuring that a classifier gives reliable confidence scores is essential for informed decision-making, in particular in high-stakes areas such as health. For instance, before using a clinical prognostic model, we want to establish that for a given individual is attributes probabilities of different clinical outcomes that can be indeed trusted. To this end, recent work has focused on miscalibration, i.e., the over or under confidence of model scores.

Yet calibration is not enough: even a perfectly calibrated classifier with the best possible accuracy can have confidence scores that are far from the true posterior probabilities, if it is over-confident for some samples and under-confident for others. This is captured by the grouping loss, created by samples with the same confidence scores but different true posterior probabilities. Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss. While there are many estimators of the calibration loss, none exists for the grouping loss in standard settings. In Perez-Level 2023, we propose an estimator to approximate the grouping loss. We show that modern neural network architectures in vision and NLP exhibit grouping loss, notably in distribution shifts settings, which highlights the importance of pre-production validation.

Reweighting randomized trials for generalization: finite sample error and variable selection

There may be a sampling bias between a randomized trial and the target population.

Randomized Controlled Trials (RCTs) are ideal experiments to establish causal statement. However, they may suffer from limited scope, in particular, because they may have been run on non-representative samples: some RCTs over- or under- sample individuals with certain characteristics compared to the target population, for which one wants conclusions on treatment effectiveness. Re-weighting trial individuals to match the target population can improve the treatment effect estimation.

In Colnet 2022, we establish the exact expressions of the bias and variance of such reweighting procedures - also called Inverse Propensity of Sampling Weighting (IPSW) - in presence of categorical covariates for any sample size. Such results allow us to compare the theoretical performance of different versions of IPSW estimates. Besides, our results show how the performance (bias, variance, and quadratic risk) of IPSW estimates depends on the two sample sizes (RCT and target population). A by-product of our work is the proof of consistency of IPSW estimates. Results also reveal that IPSW performances are improved when the trial probability to be treated is estimated (rather than using its oracle counterpart). In addition, we study choice of variables: how including covariates that are not necessary for identifiability of the causal effect may impact the asymptotic variance. Including covariates that are shifted between the two samples but not treatment effect modifiers increases the variance while non-shifted but treatment effect modifiers do not.

Challenges to clinical impact of AI in medical imaging

I have worked for many years on research in computer analysis of medical images. In particular, I am convinced that machine learning bears many promises to improve patients’ health. However, I cannot be blind to the fact that a number of systematic challenges are slowing down the progress of the field.

In Varoquaux & Cheplygina, we tried to take a step back on these challenges, from limitations of the data, such as biases, to research incentives, such as optimizing for publication. We reviewed roadblocks to developing and assessing methods. Building our analysis on evidence from the literature and data challenges, we showed that at every step, potential biases can creep.

First, larger datasets do not bring increased prediction accuracy and may suffer from biases.
Second, evaluations often miss the target, with evaluation error larger than algorithmic improvements, improper evaluation procedures and leakage, metrics that do not reflect the application, incorrectly chosen baselines, and improper statistics.
Finally, we show how publishing too often leads to distorted incentives.

On a positive note, we also discuss on-going efforts to counteract these problems and provide recommendations on how to further address these problems in the future.

This was a fun exercise. I realize that I still need to sit on it and introspect how it has shaped my research agenda, because I think it has pushed me to choose specific emphases (such as model evaluation, or focusing on rich data sources).

Privacy-preserving synthetic educational data generation

Soda also works on other applications than health, for instance education. In this direction, I would like to highlight work in which I did not participate, by Jill-Jenn Vie, another PI of the team.

Institutions collect massive learning traces but they may not disclose it for privacy issues. Synthetic data generation opens new opportunities for research in education. Vie 2022 presented a generative model for educational data that can preserve the privacy of participants, and an evaluation framework for comparing synthetic data generators. We show how naive pseudonymization can lead to re-identification threats and suggest techniques to guarantee privacy. We evaluate our method on existing massive educational open datasets.

The tension between privacy of individuals and the need for datasets for open science is a real and important one.

This was just a quick glance of what we do at soda, and we are just warming up. I am super excited about this research. I hope that it will matter.

I truely believe that more and better machine learning can help health and social science to draw new insight from new datasets.

2021 highlight: Decoding brain activity to new cognitive paradigms

2022-02-24T00:00:00+01:00

Broad decoding models that can specialize to discriminate closely-related mental process with limited data

TL;DR

Decoding models can help isolating which mental processes are implied by the activation of given brain structures. But to support a broad conclusion, they must be trained on many studies, a difficult problem given the unclear relations between tasks of different studies. We contributed a method that infers these links from the data. Their validity is established by generalization to new tasks. Some cognitive neuroscientists prefer qualitative consolidation of knowledge, but such approach is hard to put to the test.

Context: Infering cognition from brain-imaging

Often, when interpreting functional brain images, one would like to conclude on the indvidual’s on-going mental processes. But this conclusion is not directly warranted by brain-imaging studies, as they do not control the brain activity, but rather engage the participant via a cognitive paradigm made of psychological manipulations [1]. Brain decoding can help grounding such reverse inferences [2], by using machine learning to predict aspects of the task.

But a brain decoding model can seldom support broad reverse-inference claims, as typical decoding models are trained in a given study that samples only a few aspects of cognition. Thus the decoding model only concludes on the interpretation of the brain activity in the studies’ narrow scope.

Another challenge is that of statistical power. Most functional brain imaging studies comprise only a few dozen subjects, compromising statistical power [3], even more so when using machine learning [4]. While there exists large acquisition efforts, these must focus on broad psychological manipulations that do not probe fine aspects of mental processes.

[1]	Poldrack 2006, Can cognitive processes be inferred from neuroimaging data?

[2]	Poldrack 2011, Inferring Mental States from Neuroimaging Data: From Reverse Inference to Large-Scale Decoding

[3]	Poldrack 2017, Scanning the horizon: towards transparent and reproducible neuroimaging research

[4]	Varoquaux 2018, Cross-validation failure: Small sample sizes lead to large error bars

Contribution: Informing specialized decoding questions from broad data accumulation

In Mensch 2021, we designed a machine-learning method that can jointly analyze many unrelated functional imaging studies to build representations associating brain activity to mental processes. These representations can then be used to improve brain decoding in new unrelated studies, thus bringing statistical-power improvements even to experiments probing fine aspects of mental processes not studied in large cohorts.

One roadblock to accumulating information across cognitive neuroimaging studies is that all probe different, yet related, mental processes. Framing them all in the same analysis faces the lack of universally-adopted language to describe cognitive paradigms. Our prior work [5] on this endeavior –the quest for universal decoding across studies–, relied on describing each experimental paradigm in an ontology of cognitive processes and psychological manipulations. However, such approach is not scalable. Here, rather, we infered the latent structure of the tasks from the data, without explicitely modeling the links between studies. In my eye, this was a very important ingredient of our work, and it is non trivial that it enables improving the decoding of unrelated studies.

[5]	Varoquaux 2018, Atlases of cognition with large-scale human brain mapping

Capturing representations was key to transfering across study: representations of brain activity captured distributed brain structures predictive of behavior; representations of tasks across studies captured decompositions of behavior well explained by brain activity. Of course, the representations that we extracted were not as sharp as the stylized functional modules that have been manually compiled from decades of cognitive-neuroscience research.

From a computer-science standpoint, we used a deep-learning architecture. This is the first time that we witnessed a deep-learning architecture outperforming well-tuned shallow baselines on functional neuroimaging data [6]. This success is likely due to the massive amount of data that we assembled: as our method can readily work across studies, we were able to apply it to 40000 subject-level contrast maps.

[6]	There have been many reports of deep architectures on functional brain imaging. However, in our experience, good shallow benchmarks are hard to beat.

Our deep-learning architecture

A research agenda that does not win all hearts

Our underlying research agenda is to piece together cognitive-neuroimaging evidence on a wide variety of tasks and mental processes. In cognitive neuroscience, such consolidation of knowledge is done via review articles, that assemble findings from many publications into a consistent picture on how tasks decompose on elementary mental processes implemented by brain functional modules. The literature review and the ensuing neuro-cognitive model are however verbal by nature: assembling qualitative findings. I, for one, would like to have quantitative tools to foster big-picture view. Of course, the challenge with quantitative approaches as ours is to capture all qualitative aspects of the question.

Over the years that I have been pushing these ideas, I find that they are met with resistance from some elite cognitive neuroscientists who see them as unexciting at best. The same people are enthusiastic about new data-analysis methods to dissect in fine details brain responses with a detailed model of a given task, despite limited statistical power and external validity. My feeling is that the question of how various tasks are related is perceived as belonging to the walled garden of cognitive neuroscientists, not to be put to the test by statistical methods [7].

[7]	The second round of review of our manuscript certainly felt as if the method was judged by cognitive-neuroscience lenses, and not the validity of the data analysis that it entailed.

Yet, as clearly exposed by Tal Yarkoni in his Generalization crisis, drawing conclusions on mental organization from a few repetitions of a tasks is at risk of picking up idiosyncrasies of the task or the stimuli. A starting point of our work (Mensch 2021) was the fall of statistical power in cognitive neuroscience, documented by Poldrack 2017, but one reviewer censored this argument [8]. This exchange felt to me as a field refusing to discuss publicly its challenges, which leaves no room for methods’ researchers such as myself to address them.

[8]	Comments in the first review

2020: my scientific year in review

2021-01-05T00:00:00+01:00

The year 2020 has undoubtedly been interesting: the covid19 pandemic stroke while I was on a work sabbatical in Montréal, at the MNI and the MILA, and it pushed further my interest in machine learning for health-care. My highlights this year revolve around basic and applied data-science for health.

Highlights

Mining electronic health records for covid-19
Machine learning for dirty data
- Supervised learning with Missing values: beyond imputation
- Machine-learning without normalizing entries
Making sense of brain functional signals
- NeuroQuery: brain mapping any neuroscience query
- A high-resolution brain functional atlas

Mining electronic health records for covid-19

Hospital databases are rich and messy

Hospital databases In March, we teamed up with the hospital around Paris that were suffering from a severe overload due to a new pathology, covid-19. The challenge was to extract information from the huge databases of the hospital management system: What were the characteristic of the patients? How were the resources of the hospital evolving? In the treatments that were empirically attempted, which were most efficient?

The hospital databases are hugely promising, because they offer at almost no cost information on all the patients that go through the hospital. As we were dealing with a conglomerate of 39 hospitals, this information covers millions of patients each year: an excellent epidemiological coverage.

Challenging data science Our work was classic data science: we did a lot of data management, crafting SQL queries and munging pandas dataframes to create data tables for statistics and visualizations. We interacted strongly with the hospital management and the doctors to understand the information of interest. As we moved forward it became clear that behind each “simple” question, there were challenges of statistical validity. We did not want to produce a figure that was misleading. Typical challenges were:

Information needed complicated transformations (such as following a patient hoping across hospitals to capture the patient status)
Information was represented differently in the differently hospitals
Incorrect inputs prevented aggregation (such as erroneous entry data after the exit date, or missing values)
The database had biases compared to the ground truth (simple oxygen therapy acts more often unreported than complicated invasive ventilation)
Censoring effects prevented the use of naive statistics (after 20 days of epidemic outburst most hospital stays are short simply because patients have entered the hospitals recently)
A lot of information was present as unnormalized text, sometimes in long hand-written notes, full of acronyms and errors due to character recognition.
The data were of course often a consequence of treatment policy (the choices of the medical staff in terms of patient handling and measures), and hence not directly interpretable in causal or interventional terms.

These challenges were very interesting to me, as they related directly to my research agenda of facilitating the processing of “dirty data” (more on that below).

Most of the work that we did was not oriented toward publication, but rather to address urgent needs of the hospitals. Some scholarly contributions did come out:

Part of the extracted data are consolidated worldwide for medical studies (Brat et al, Nature Digital Medicine 2020).
We used causal-inference methods to estimate the treatment effects of HCQ with and without Azithromycin (Sbidian et al, MedRxiv 2020,.
The data are used in follow up medical studies (eg associating mortality and obesity Czernichow et al, Obesity 2020, )

Biomedical entity recognition A major AI difficulty in this work is recognizing biomedical entities, such as conditions or treatments, in the various texts. Coincidentally, we had been working on simplifying the state of the art pipelines for biomedical entity linking. While this research work was not used on the hospital data, because it was too bleeding edge, it led to an AAAI paper (Chen et al, AAAI 2021) on a state-of-the model for biomedical entity linking that is much more lightweight than current approaches.

Machine learning for dirty data

Machine learning methods that can robustly ingest non-curated data.

The Dirty Data project, that we undertook a few years ago, is really bearing its fruits.

Supervised learning with Missing values: beyond imputation

The classic view on processing data with missing values is to try and impute the missing values: replace them by probable values (or better, compute the distribution of the unobserved values given the observed ones). However, such approach needs a model of the missing-values mechanism; this is simple only when the values are missing at random. When have been studying the alternative view based on directly computing a predictive function to be applied data with missing values.

Missing-values mechanisms: black dots are fully-observed data points, while grey ones are partially observed. The left panel displays a missing-at-random situation, where missingness is independent of the underlying values. On the contrary, in a missing-not-at-random situation (right panel), whether values are observed or not depends on the underlying values (potentially unobserved).

Le Morvan et al, AIStats 2020 studied the seemingly-simple case of a linear generative mechanism and showed that, with missing values, the optimal predictor was a complex, piecewise linear, function of the observed data concatenated with the missing-values mask. This function can be implemented with a neural network with ReLu activation functions, fed with data where missing values are replaced by zeros and corresponding indicator features are added.

To go one step further, we noticed that the optimal predictor uses the correlation between features (eg on fully-observed data) to compensate for missing values.

Compensation effects: The optimal predictor uses the correlation between features to compensate when a value is missing.

Le Morvan et al, NeurIPS 2020 devise a neural-network architecture that efficiently captures these links across the features. Mathematically, it stems from seeking good functional forms to approximate the expression of the optimal predictor, that can be derived for various missing-values mechanisms. A non-trivial result is that a simple functional form can approximate the optimal predictor under very different mechanisms.

Better parameter efficiency

The resulting architecture needs much less parameters (depth or width) than a fully-connected multi-layer perceptron to predict well in the presence of missing values. This, in turns, leads to better performance on limited data size.

Machine-learning without normalizing entries

A challenge of data management is that the same information may be represented in different ways, typically with different strings denoting the same, or related entities. For instance, in the following table, the employee position title column contains such non-normalized information:

Sex Employee Position Title Years of experience

Male Master Police Officer 23

Female Social Worker IV 17

Male Police Officer III 12

Female Police Aide 9

Male Electrician I 4

Male Bus Operator 15

Male Bus Operator 22

Female Social Worker III 13

Female Library Assistant I 3

Male Library Assistant I 5

Sex	Employee Position Title	Years of experience
Male	Master Police Officer	23
Female	Social Worker IV	17
Male	Police Officer III	12
Female	Police Aide	9
Male	Electrician I	4
Male	Bus Operator	15
Male	Bus Operator	22
Female	Social Worker III	13
Female	Library Assistant I	3
Male	Library Assistant I	5

Typos, or other morphological variants (such as varying abbreviations) often make things worse. We found many instances of such challenges in electronic health records.

In a data-science analysis, such data has categorical meanings, but a typical categorical data representation (as a one-hot encoder) breaks: there are too many categories, and in machine learning, the test set might come with new categories.

The standard practice is to curate the data: represent the information in a normalized way, without morphological variants, and separating the various bits of information (for instance the type of job from the rank). It typically requires a lot of human labor.

The original categories and their continuous representation on latent categorical features inferred from the data.

Cerda & Varoquaux, TKDE 2020 give two efficient approaches to encode such data for statistical analysis capturing string similarities. The most interpretable of these approaches represents the data by continuous encoding on latent categories inferred automatically from recurrent substrings.

This research is implemented in the dirty-cat Python library, which is making rapid progress.

Making sense of brain functional signals

Turning brain-imaging signal into insights

Brain imaging, and in particular functional brain imaging, is amazing, because it gives a window on brain function, whether it is to understand cognition, behavior, or pathologies. One challenge that I have been interested in, across the years, is how to give systematic sense to these signals, in a broader perspective than a given study.

NeuroQuery: brain mapping any neuroscience query

Systematically linking mental processes and disorders to brain structures is a very difficult task because of the huge diversity of behavior.

In Dockes et al, elife 2020 we used text mining on a large number of brain-imaging publications to predict where in the brain a given subject of study (in neuroscience, behavior, and related pathologies) would report findings.

With this model, we built a web application, NeuroQuery in which the user can type a neuroscience query, and get a brain map of where a study on the topic is like to report findings.

A high-resolution brain functional atlas

Regions to summarize the fMRI signal

Atlases of brain regions are convenient to summarize the information of brain images, turning them into information easy to analyse. We have long studied the specific case of functional brain atlases, extracting and validating them from brain imaging data. Dadi NeuroImage 2020 contributes a high-resolution brain functional atlas, DiFuMo. This atlas can be browsed or downloaded online.

The functional regions, at dimension 512.

The atlas comes with various resolutions, and all the structures that it segments have been given meaningful names. In the paper, we showed that using this atlas to extract functional signals led to better analysis for a large number of problems compare to the atlases commonly used. We thus recommend this atlas for instance to extract Image-Derived Phenotypes in population analysis, where the huge size of the data requires to work on summarize information.

The region capturing the right hemisphere putamen.

Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020

2020-01-22T00:00:00+01:00

Note

A simple survey asking authors of two leading machine-learning conferences a few quantitative questions on their experimental procedures.

How do machine-learning researchers run their empirical validation? In the context of a push for improved reproducibility and benchmarking, this question is important to develop new tools for model comparison. We ran a simple survey asking to authors of two leading conferences, NeurIPS 2019 and ICLR 2020, a few quantitative questions on their experimental procedures.

A technical report on HAL summarizes our finding. It gives a simple picture of how hyper-parameters are set, how many baselines and datasets are included, or how seeds are used. Below, we give a very short summary, but please read (and cite) the full report if you are interested.

Highlights The response rates were 35.6% for NeurIPS and 48.6% for ICLR. A vast majority of empirical works optimize model hyper-parameters, thought almost half of these use manual tuning and most of the automatic hyper-parameter optimization is done with grid search. The typical number of hyper-parameter set is in interval 3-5, and less than 50 model fits are used to explore the search space. In addition, most works also optimized their baselines (typically, around 4 baselines). Finally, studies typically reported 4 results per model per task to provide a measure of variance, and around 50% of them used a different random seed for each experiment.

Sample results

How many papers with experiments optimized hyperparameters.

What hyperparameter optimization method were used.

Number of different datasets used for benchmarking.

Number of results reported for each model (ex: for different seeds)

These are just samples. Read the full report for more results.

For reproducibility and AutoML, there is active research in benchmarking and hyperparameter procedures in machine learning. We hope that the survey results presented here can help inform this research. As this document is merely a research report, we purposely limited interpretation of the results and drawing recommendations. However, trends that stand out to our eyes are, 1) the simplicity of hyper-parameter tuning strategies (mostly manual search and grid search), 2) the small number of model fits explored during this tuning (often 50 or less), which biases the results and 3) the small number of performances reported, which limits statistical power. These practices are most likely due to the high computational cost of fitting modern machine-learning models.

Acknowledgments We are deeply grateful to the participants of the survey who took time to answer the questions.

2019: my scientific year in review

2020-01-05T00:00:00+01:00

My current research spans wide: from brain sciences to core data science. My overall interest is to build methodology drawing insights from data for questions that have often been addressed qualitatively. If I can highlight a few publications from 2019 [1], the common thread would be computational statistics, from dirty data to brain images. Let me try to give the gist of these progresses, in simple terms.

[1]	It’s already 2020, I’m always late.

Highlights

Comparing distributions
Predictive pipelines on brain functional connectomes
Population shrinkage of covariance
Deep learning on non-translation-invariant images
Open science

Comparing distributions

Fundamental computational-statistics work

What if you are given two set of observations and need to conclude on whether they are drawn from the same distribution? We are interested in this question for the DirtyData research project, to facilitate analysis of data without manual curation. Comparing distributions is indeed important to detect drifts in the data, to match information across datasets, or to compensate for dataset biases.

Formally, we are given two cloud of points (circle and crosses in the figure below) and we want to develop a statistical test of whether the distributions differ. There is an abundant literature on this topic, that I cover in a more detailed post on this subject. Specifically, when the observations have a natural similarity, for instance when they live in a vector space, kernel methods are interesting because they enable to estimate a representant of the underlying distribution that interpolates between observations, as with a kernel density estimator.

Two cloud of points, the corresponding distribution representants μ_P and μ_Q (blue and orange), the difference between these (black), and locations to measure this difference (red triangles).

With Meyer Scetbon, in Scetbon & Varoquaux, NeurIPS, we investigate how to measure best the difference between these representants. We show that the best choice is to take the absolute value of the difference (the l1 norm), while the default choice had so far been the Euclidean (l2) norm. In a nutshell, the reason is that the difference most like is dense when the distributions differ: zero almost nowhere.

We were able to show that the sophisticated framework for efficient and powerful tests in the Euclidean case carries over to the l1 case. In particular, our paper gives efficient testing procedures using a small number of locations to avoid costly computation (the red triangles in the figure above), that can either be sampled at random or optimized.

My hunch is that the result is quite general: the l1 geometry is better than the l2 one on representants of distributions. There might be more fundamental mathematical properties behind this. The drawback is that the l1 norm is non smooth which can be challenging in optimization settings.

Predictive pipelines on brain functional connectomes

Brain-imaging methods

Brain functional connectivity is increasingly used to extract biomarkers of behavior and mental health. The long-term stakes are to ground assessment of psychological traits on quantitative brain data, rather than qualitative behavioral observations. But, to build biomarkers, there are many details that go in estimating functional connectivity from fMRI, something that I have studied for more than 10 years. With Kamalakar Dadi, in Dadi et al, we ran thorough empirical benchmarks to find which methodological choices for the various steps of the pipeline give best prediction across multiple cohorts. Specifically, we studied 1) defining regions of interest for signal extraction, 2) building a functional-connectivity matrix across these regions, 3) prediction across subjects with supervised learning on these features.

Summarizing our benchmark results.

Results show the importance of defining regions from functional data, ideally with a linear-decomposition method that produces soft parcellations such as ICA or dictionary learning. To represent connectivity between regions, the best choice is tangent-space parametrization, a method to build a vector-space from covariance matrices (more below). Finally, for supervised learning, a simple l2-penalized logistic regression is the best option. With the huge popularity of deep learning, it may surprise that linear models are the best performer, but this is well explained by the amount of data at hand: a cohort is typically less than 1000 individuals, which is way below the data sizes needed to see the benefits of non-linear models.

A recent preprint, Pervaiz et al from Oxford, overall confirms our findings, even though they investigated slightly different methodological choices. In particular, they find tangent space clearly useful.

In my eyes, such benchmarking studies are important not only to improve prediction, but also to reduce analytic variability that opens the door to inflation of reported effects. Indeed, given 1000 individuals, the measure of prediction accuracy of a pipeline is quite imprecise (Varoquaux 2018). As a consequence, trying out a bunch a analytic choices and publishing the one that works best can lead to grossly optimistic prediction accuracies. If we want trust in biomarkers, we need to reduce the variability in the methods used to build them.

Population shrinkage of covariance

Statistics for brain signals

Estimating covariances is central for functional brain connectivity and in many other applications. With Mehdi Rahim, in Rahim et al we considered the case of a population of random processes with related covariances, as for instance when estimating functional connectivity from a group of individuals. For this, we combined two mathematical ideas: that of using natural operations on covariance matrices, and that of priors for mean-square estimation:

Tangent space Covariance matrices are positive-definite matrices, for which standard arithmetics are not well suited [2]: subtracting two covariance matrices can lead to a matrix that cannot be the covariance of a signal. However, a group of covariance matrices can be transformed into points in a vector space for which standard distances and arithmetics respect the structure of covariances (for instance Euclidean distance between these points approximate KL divergence between covariances). This is what we call the tangent space.

[2]	Technically, covariance matrices live on a Riemannian manifold: a curve surface inside R^{n x n} that has some metric properties.

James-Stein shrinkage To estimate the mean of n observations, it is actually best not to compute the average of these, but rather to push a bit this average toward a prior guess. The better the guess, the more this “push” helps. The more the number of observations, the more gentle this push should be. This strategy is known as James-Stein shrinkage and it is in my opinion one of the most beautiful results in statistics. It can be seen as a Bayesian posterior, but it comes with guarantees that do not require the model to be true and that control estimation error, rather than a posterior probability.

James-Stein shrinkage is easily written for quadratic errors on vectors, but cannot be easily applied to covariances, as they do not live in a vector space and we would like to control a KL divergence rather than a quadratic error. Our work combined both ideas to give an excellent estimator of a family of related covariances that is also very computationally efficient. We call it PoSCE: Population Shrinkage Covariance Estimation.

Schema of the estimation strategy: projecting the covariances matrices into a tangent space, shrinkage to a group mean, but taking in account the anisotropy of the dispersion of the group, and projecting back to covariances.

It is easy to see how accounting for group information in the estimation of individual covariances can help stabilizing them. However, will it be beneficial if we are interested in the differences between these covariances, for instance to ground biomarkers, as studied above? Our results show that it does indeed help building better biomarkers, for instance to predict brain age. The larger the group of covariances used, the larger the benefits.

Error in predicting brain aging decreases when more individuals are used to build the biomarker.

Deep learning on non-translation-invariant images

Computer vision

Brain images, in particular images of brain activity, are very different from the natural images on which most computer-vision research focuses. A central difference is that detecting activity in different parts of the brain completely changes the meaning of this detection, while detecting a cat in the left or the right of a picture on Facebook makes no difference. This is important because many progresses of computer vision, such as convolutional neural networks, are built on the fact that natural images are statistically translational invariant. On the opposite, brain images are realigned to a template, before being analyzed.

Convolutional architectures have been crucial to the successes of deep learning on natural images because they impose a lot of structure on the weights of neural networks and thus help fight estimation noise. For predicting from brain images, the regularizations strategies that have been successful foster spatially continuous structures. Unfortunately, they have lead to costly non-smooth optimizations that cannot easily be used with the optimization framework of deep learning, stochastic gradient descent.

With Sergul Aydore, in Aydore et al, ICML, we have introduced a spatial regularization that is compatible with the deep learning toolbox. During the stochastic optimization, we impose random spatial structure via feature groups estimated from the data. These stabilize the input layers of deep architecture. They also lead to iterating on smaller representations, which greatly speeds up the algorithm.

At each step of a stochastic gradient descent, we randomly pick a feature-grouping matrix (itself estimated from the data), and use it to reduce the data in the computations of the gradients, then invert this reduction to update the weights.

The paper comes with extensive empirical validation, including comparison to convolutional neural networks. We benchmark the strategy on brain images, but also on realigned faces, to show that the approach is beneficial for any non-translational-invariant images. In particular, the approach greatly speeds up convergence.

Prediction accuracy as a function of training time – left: on realigned faces – right: on brain images

This paper clearly shows that one should not use convolutional neural networks on fMRI data: these images are not translational invariant.

Open science

Open and reproducible science: Looking at all these publications, I realize that every single one of them comes with code on a github repository and is done on open data, which means that they can all be easily reproduced. I’m very proud of the teams behind these papers. Achieving this level of reproducibility requires hard work and discipline. It is also a testimonial to a community investment in software tools and infrastructure for open science that has been going on for decades and gives the foundations on which these works build.

A prize for scikit-learn: On this topic, a highlight of 2019 was also that the work behind scikit-learn was acknowledged in an important scientific prize.

Why open science: Why do I care so much for open science? Because in a world of uncertainty, the claims of science must be trusted and hence built on transparent practice (think about science and global warming). Because it helps putting our methods in the hands of a wider public, society at large. And because it levels the ground, making it easier for newcomers –young scientists, or developing countries– to contribute, which in itself makes science more efficient.

Comparing distributions: Kernels estimate good representations, l1 distances give good tests

2019-12-08T00:00:00+01:00

Note

Given two set of observations, are they drawn from the same distribution? Our paper Comparing distributions: l1 geometry improves kernel two-sample testing at the NeurIPS 2019 conference revisits this classic statistical problem known as “two-sample testing”.

This post explains the context and the paper with a bit of hand waiving.

Contents

The context: two-sample testing
From kernel mean embeddings to distances on distributions
Controlling the weak convergence of probability measures
Two-sample testing procedures
The L1 metric provides best testing power

The context: two-sample testing

Given two samples from two unknown populations, the goal of two-sample tests is to determine whether the underlying populations differ with a statistical significance. For instance, we may care to know whether the McDonald’s and KFC use different logic to chose locations of restaurants across the US. This is a difficult question: we have access to data points, but not the underlying generative mechanism, that is probably governed by marketing strategies.

From kernel mean embeddings to distances on distributions

In the example of spatial distributions restaurants, there is a lot of information in how close observed data points lie in the original measurement space (here geographic coordinates). Kernel methods arise naturally to capture this information. They can be applied to distributions, building representatives of distributions: Kernel embeddings of distributions. The mean embedding of a distribution P with a kernel k is written:

μ_P(t) : = ∫_ℝ^dk(x, t)dP(x)

Intuitively, it is related to Kernel Density Estimates (KDEs) which estimate a density in continuous space by smoothing the observed data points with a kernel.

Kernel mean embeddings for two distributions of points

For two-sample testing, kernel embeddings can be used to compute distances between distributions, building metrics over the space of probability measures. Metrics between probability measures can be defined via the notion of Integral Probability Metric (IPM): as a difference of expectations:

IPM[F, P, Q] : = sup_f ∈ F(𝔼_x ∼ P[f(x)] − 𝔼_y ∼ Q[f(y)])

where F is a class of functions. This definition is appealing because it characterizes the difference between P and Q by the function for which the expectancy differs most. The specific choice of class of function defines the metric. If we now consider a kernel, it implicitly defines a space of functions (intuitively related to all the possible KDEs generated by varying data points): a Reproducible Kernel Hilbert Space (RKHS). Defining a metric (an IPM) with a function class F as the unit ball in such an RKHS, is known as the Maximum Mean Discrepancy (MMD). It can be shown that, rather than computing the maximum, the MMD has a more convenient expression, the RKHS distance between the mean embeddings:

MMD[P, Q] = ‖μ_P − μ_Q‖_{H_k}

For good choices of kernels, the MMD has appealing mathematical properties to compare distributions. With kernels said to be characteristic, eg Gaussian kernels, the MMD is a metric: MMD[P, Q] = 0 if and only if P = Q. Using the MMD for two-sample testing –given only observations from the distributions, and not P and Q– requires using an empirical estimation of the MMD. This can be done by computing the RKHS norm in the expression above, which leads to summing kernel evaluations on all data points in P and Q.

Our work builds upon this framework, but deviates a bit from the classical definition of MMD as it addresses the question of which norm is best to use on the difference of mean embeddings, µQ - µP (as well as other representatives, namely the smooth characteristic function, SCF). We consider a wider family of metrics based on the Lp distances between mean emdeddings (p=2 recovers the classic framework):

d_L^p, μ(P, Q) : = (∫_{t ∈ ℝ^d}|μ_P(t) − μ_Q(t)|^pdΓ(t))^1 ⁄ p

where Γ is a Borel probability measure absolutely continuous.

Controlling the weak convergence of probability measures

We show that these metrics have good properties. Specifically, for p ≥ 1, as soon as the kernel is bounded continuous and characteristic, these metrics metrize the weak convergence. What this means is that these metrics tend to zero if and only if P and Q weakly converge.

The weak convergence of probability measures is a notion of convergence that is based not just on having events with probabilities that are the same for the two distributions, but also that some events are “close”. Indeed, classic convergence in probability just tells us that the same observation should have the same probability in the two distributions. Weak convergence takes in account the topology of the observations. For instance, to go back to the problem of spatial distributions of restaurants, it does not only look at whether the probabilities of having a Mc Donald’s or a KFC restaurant converge on 11th Wall Street, but also at restaurants are likely on 9th Wall Street.

A simple example to see why these matters is to consider two Dirac distributions: spikes in a single point. If we bring these spikes closer and closer, merely looking at the probability of events in the same exact position will not detect any convergence until the spikes exactly overlap.

Using kernel embeddings of distributions enables to capture the aspects of convergence in the spatial domain because the kernels used give a spatial smoothness to the representatives:

Having a metric on probability distributions that captures the topology of the observations is important for many applications, for instance when fitting GANs to generate images: the goal is not to only capture that images are exactly the same, but also that they maybe be “close”.

Two-sample testing procedures

Now that we have built metrics, we can derive two-sample test statistics. A straightforward way of doing it would involve large sums on all the observations, which would be costly. Hence, we resort to a good approximation by sampling a set of {Tj} locations from the distribution Γ:

d̂^p_{ℓ_p, μ, J}[X, Y] : = n^p ⁄ 2∑_j = 1..J|μ_X(T_j) − μ_Y(T_j)|^p

We show that this approximation maintains (almost surely) the appealing metric properties, generalizing the results that were established by Chwialkowski et al 2015 for the special case of the L2 metric.

Sampling at different positions

We further develop the testing procedures by showing that other tricks known to improve testing with the L2 metric can be adapted to other metrics, such as the L1 metric. Fast and performant tests can be obtained by optimizing the test locations –using an upper-bound on the test power– or by testing in the Fourrier domain, using the Smooth Characteristic Function of the kernel. Even in the case of the L1 metric, the null distribution of the test statistic can be derived, leading to tests that can control errors without permutations.

The L1 metric provides best testing power

Going back to our question of which norm on the difference of distribution representative is best suited to detect, we show that when using analytics kernels, such as the Gaussian kernel, the L1 metric improves upon the L2 metric, which corresponds to the classic definition of the MMD.

Indeed, analytic kernels are non-zero almost everywhere. As a result, when P is different from Q, the difference between their mean embeddings will be dense, as well as the differences between the representatives that we use to build our tests (for instance the values at the locations that we use to build the tests above). l1 norms capture better dense differences than l2 norms –this is the reason why, used as penalties, they induce sparsity.

A simple intuition is that dense vectors tend to lie in the diagonals of the measurement basis, as none of their coordinates are zero. On these diagonals, the l1 norm is much larger than the l1 norm of vectors with some zero, or nearly-zero coordinates.

Summary

For a very simple summary, the story is that: to perform tests of whether two distributions differs, it is useful to compute a “mean Kernel embedding” –similar to a Kernel density estimate, but without normalization– of each distribution, and consider the l1 norm of the difference of these embeddings. They can be computed on a small number of locations, either drawn at random or optimized. This approach is reminiscent of looking at the total variation between the measures, however the fact that it uses Kernels makes it robust to small spatial noise in the observations, unlike the total variation for which events must perfectly coincide in both set of observations (the total variation does not metrize the weak convergence).

References

The framework exposed here is one that was developed over a long line of research, which our work builds upon. Our paper gives a complete list of references, however, some useful review papers are

C.-J. Simon-Gabriel and B. Schölkopf. Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metrics on distributions, arXiv:1604.05251, 2016.
A. Gretton, K.M. Borgwardt, M.J. Rasch, B. Schölkopf, A. Smola; A Kernel Two-Sample Test, JMLR, 2012.
The NeurIPS 2019 tutorial, by Gretton, Sutherland, and Jitkrittum, is extremely didactic and gives a lot of big picture

2018: my scientific year in review

2019-01-03T00:00:00+01:00

From a scientific perspective, 2018 [1] was once again extremely exciting thank to awesome collaborators (at Inria, with DirtyData, and our local scikit-learn team). Rather than going over everything that we did in 2018, I would like to give a few highlights: We published major work using machine learning to map cognition in the brain, We started a new research project on analysis of non-curated data (addressing all of data science, beyond brain imaging); And we worked a lot on growing scikit-learn.

[1]	It’s already 2019, I am indeed late in posting this summary.

Highlights

Cognitive brain mapping
Data science without data cleaning
Scikit-learn: growth and consolidation

Cognitive brain mapping

We have been exploring how predictive models can help mapping cognition in the human brain. In 2018, these long-running efforts led to important publications.

Atlases of cognition with large-scale human brain mapping

More than 6 years ago, with my student Yannick Schwartz, we started working on compiling an altases of cognition across many cognitive neuroimaging studies. This turned out to be quite challenging for several reasons:

Formalizing the links between mental processes studied across the literature is challenging. Strictly speaking, every paper studies a different mental process. However, to build an atlas of cognition, we are interested in finding commonalities across the literature.
While cognitive studies tend to target a specific mental function, the psychological manipulations that they use also recruit many other processes. For instance, a memory study might use a visual n-back task, and hence recruit the visual cortex. The problem is more than an experimental inconvience: varying details of an experiment may trigger different cognitive processes. For instance, there are common and separate pathways for visual word recognition and auditory word recognition.
Simply detecting regions that are recruited in a given mental operation leads to selecting the whole cortex with enough statistical power. Indeed tasks are never fully balanced; reading might for instance require more attention than listening.

These challenges are related on the one hand to the problem of reverse inference [2], and on the other hand to that of mental-process decomposition, or cognitive subtraction, both central to cognitive neuroimaging. They also call for formal knowledge representation, eg by building ontologies, which is a task harder than it might seem at first glance.

[2]

In essence, the reverse inference problem arises because in a cognitive brain imaging the observed brain activity is a consequence of the behavior, and not a cause. While a conclusion that activity in a brain structure causes a certain behavior is desirable, it is not directly supported by a cognition neuroimaging experiment.

In our work [Varoquaux et al, PLOS 2018], we tackled these challenges to build atlases of cognition as follows:

We assigned to each brain-activity image labels describing the multiple mental processes related to the experimental manipulation
We used decoding –ie prediction of the cognitive labels from the brain activity– to ground a principled reverse inference interpretation: regions selected indeed do imply the corresponding behavior.
Regions in the atlas were built of brain structures that both implied the corresponding cognition, and were triggered by it (conditional and marginal link), to ground a strong selectivity:

We applied these techniques to the data from 30 different studies, resulting in a detailed break down of the cortex in functionally-specialized modules:

Importantly, the validity of this decomposition in regions is established by the ability of these regions to predict the cognitive aspects of new experimental paradigms.

Predictive models avoid excessive reductionism in cognitive neuroimaging

While machine learning is generally seen as an engineering tool to build predictive models or automate tasks, I see in it a central method of modern science. Indeed, it can distill evidence that generalizes from vast –high dimensional– and ill-structured experimental data. Beyond prediction, it can guide understanding.

With Russ Poldrack, we wrote an opinion paper [Varoquaux & Poldrack, Curr Opinion Neurobio 2019] that details why predictive models are important tools to building wider theories of brain function. It reviews many exciting progresses in uncovering with predictive models how brain mechanisms support the mind. It makes the point that ability generalize is a fundamentally desirable priority of scientific inference. Models that are grounded on explicit generalization give a solid path to build broad theories of the mind. Particularly interesting is generalization to significantly different settings, ie going further than typical cross-validation experiments of machine learning, where identical data are artificially split.

Something that is dear to my heart is that we are aiming for quantitative generalization, while psychology often contents itself with qualitative generalization.

Individual Brain Charting, a high-resolution fMRI dataset for cognitive mapping

We are convinced about the importance of analyzing brain response across multiple paradigms, to build models of brain function that generalize across these paradigms. However, addressing such a research program by aggregating multiple studies is hindered by data heterogeneity, due to inter-individual differences or to differing scanners.

Hence, my team, Parietal, has undertook a major data acquisition, the Individual Brain Charting project: scanning a few individuals under a huge amount of cognitive tasks. The data acquisition will last for many years, as the individuals come back to the lab for new acquisitions. The images are of excellent quality, thanks to the unique expertise of our scanning site, Neurospin, a brain-imaging research facility.

The data are completely openly accessible: the raw data, preprocessed data, statistical outputs, alongside with the processing script. We are releasing new data as the project moves forward. This year, we published the data paper [Pinho et al, Scientific Data 2018].

Data accumulation in brain imaging

We are living exciting times, as there are more and more large volumes of shared brain imaging data. OpenfMRI aggregates data in a consistent way across brain-imaging studies. Large projects such as the Human Connectome Project, our Individual Brain Charting project, or the UK BioBank, are designed from the beginning to be shared. We are entering an era of brain-image analysis on many terabytes of data, with dozens of thousands of subjects, compounding hundreds of different clinical or cognitive conditions.

Massive data accumulation opens exciting new scientific prospects, and raises new engineering challenges. Some of these challenges are to scale up neuroimaging data-processing practices, eg inter-subject alignments at the scale of many thousands subjects. Some of these challenges are new to neuroimaging: when compounding hundreds of sources of data into an analysis, the human cost of data integration becomes a major roadblock. As I have become convinced that analysing more, and more diverse, data is an important way forward, I have started working on data intergration per se.

Data science without data cleaning

A new personal research agenda: DirtyData

Challenges to integrating data in a statistical analysis are ubiquitous, including in brain imaging. Data cleaning is recognized as the number one time sink for data scientists. When advising scikit-learn users, including very large companies, I often find that the major roadblock is going from the raw data sources to the data matrix that is input to scikit-learn.

A year ago, I started a new research focus, around the DirtyData project. We now have a team with multiple exciting collaborations, and funding. Our goal is to facilitate statistical analysis of non-curated data. We hope to foster better understanding of how powerful machine-learning models can cope with imperfect, non homogeneous data. As we go, we will publish this understanding, but also distribute code with new methods, and hopefully influence common data-science practices and software. This is an exciting adventure (and yes, we are hiring; see our job offers or contact me).

The topics are vast, at the intersection between database research and statistics. In particular, it calls for integrating machine learning with:

Knowledge representation
Information retrieval
Information extraction
Statistics with missing data

Similarity encoding: analysis with non-normalized string categories

While the DirtyData project is young, we already made progress for analysis of dirty categories, ie categorical data represented with strings that lack curation. These can have typos or other simple morphological variants (eg “patient” vs “patients”), or they can have more structured and fundamental differences, eg arising from the merge of multiple data sources. This latter problem is well-known of database research, where it is seen as a record linkage or alignment problem.

For statistical analysis, in particular machine learning, the problem with these non-curated string categories is that they must be encoded to numerical representations, and classic categorical encodings are not well suited for them. For instance, one-hot encoding leads to very high cardinality.

In Cerda et al (2018), we contribute a simple encoding approach, similarity encoding, based on interpolating one-hot encoding with string similarities between the categories.

We ran an extensive empirical study, and show that similarity encoding leads to better prediction accuracy without curation of the data, outperforming all the other approaches that we tried. The paper is purely empirical, but stay tuned: a theoretical analysis of why this is the case is coming soon.

For the benefit of data scientists and researchers, we are released a small Python package, dirty-cat, for learning with dirty categories.

This is just the beginning of the DirtyData project, more exciting work is under way.

Scikit-learn: growth and consolidation

In 2018, a lot of my energy went to consolidating scikit-learn as a project. Describing the work in detail is for another post. However, my main efforts where around growing the team and working on sustainability.

We established a scikit-learn foundation at Inria, in which companies partner with us to fund scikit-learn development. This took a lot of effort to establish good partnerships and create the legal vessels. Indeed, we want to make sure that the common effort is invested to make scikit-learn better. For instance, working with Intel, who are somewhat running an arms race for computing speed, we improved our test suite, and are slowly but surely learning how to improve our speed.
A consequence of the foundation is that we are hiring to grow the team (check out our open positions). In 2018, my own team grew, with more excellent people working on scikit-learn, but also joblib, and even contributing to core Python and numpy to improve parallel computing and pickling.
As the scikit-learn community is growing, it seemed important to formalize a bit more how decisions are made. To me, an important aspect was laying out clearly that the project is still governed by the community, and not partners or people paid by the foundation. We have a draft of a governance document, that is pretty much ready for merge. We also worked on a roadmap. It is a non binding document, but it still was an interesting exercise.
Scikit-learn 0.20 was released, with many enhancements. And the 0.20 release was followed by two minor releases, to make sure that our users got robust code with backward compatibility.

We are busy finishing a few very interesting studies; next year will be exciting! I hope that we will have much to say about population analysis with brain imaging, which is a amazingly interesting subject.

Our research in 2017: personal scientific highlights

2017-12-31T00:00:00+01:00

In my opinion the scientific highlights of 2017 for my team were on multivariate predictive analysis for brain imaging: a brain decoder more efficient and faster than alternatives, improvement clinical predictions by predicting jointly multiple traits of subjects, decoding based on the raw time-series of brain activity, and a personnal concern with the small sample sizes we use in predictive brain imaging…

A fast and stable brain decoder using ensembling: FReM

We have been working for 10 years on methods for brain decoding: predicting behavior from imaging. In particular, we developed state of the art decoders based on total variation. In Hoyos-Idrobo et al (preprint) we used a different technique based on ensembling: combining many fast decoders. The resulting decoder, dubbed FReM, predicts better, faster, and with more stable maps than existing methods. Indeed, we have learned that good prediction accuracy was not the only important feature of a decoder.

Brain imaging to characterize individuals: joint prediction of multiple traits

In population imaging, individual traits are linked to their brain images. Predictive models ground the development of imaging biomarkers. In Rahim et al (preprint), we showed that accounting for multiple traits of the subjects when learning the biomarker, gave a better prediction of the individual traits. For instance, knowing the MMSE (mini mental state examination) of subjects in a reference population helps derive better markers of Alzheimer’s disease, even for subjects of unknown MMSE. This is an important step to including a more complete picture of individuals in imaging studies.

Time-domain decoding for fMRI

In studies of cognition with functional MRI, the standard practice to decoding brain activity is to estimate a first-level model that teases appart the different experimental trials. It results in maps of regions of the brains that correlate with each trial. Decoding is then run on these maps, with supervised learning. The limitation of this approach is that the experiment has to be designed with a good time separation between each trial.

In Loula et al (preprint) we designed a time-domain decoding scheme, that starts from the raw brain activity time-series and predicts model time-courses of cognition. From these, it can classify the type of each trial. Importantly, it works better than traditional approaches when the trials are not well separated. It thus opens the door to decoding in experiments that were so far too fast.

Cross-validation failure: the dangers of small samples

I wrote an opinion paper (preprint) on a problem of our field that has been worrying me lot: often, we do not have enough samples to assess properly the predictive power in neuroimaging. Indeed, the typical predictive analysis in neuroimaging uses 100 samples.

The error distribution on the measure of prediction accuracy of a decoding is at best given by a binomial. With around 100 samples, it yields confidence bounds around ±7%. Analysis of neuroimaging studies reveals larger error bars.

Such error bars, large compared to the effect of interest, undermine publications using or developing predictive models in neuroimaging. Indeed, they couple with the publication incentives in two ways. First, studies that by chance observe an effect are published, while the others end up unaccounted for in a ``file drawer``. Second, minor modifications to the data processing strategy give large but meaningless differences on the observed prediction accuracy. These researchers degress of freedom can hardly be checked in a review process or a statistical test. The methods research, trying to improve decoders, is hindered by such error bars and should consider multiple datasets to gauge progress. Clinical neuroimaging, for biomarkers, must increase sample sizes and face heterogeneity.

I believe that this is a major challenge for our field, and invite you to read the paper if you are not convinced.

Convergence proofs for last year’s blazing fast dictionary learning

Mensch et al (preprint) is a long paper that studies in detail our very fast dictionary learning algorithm, with extensive experiments and convergence proofs. On huge matrices, such as brain imaging data in population studies, hyperspectral imaging, or recommender systems, is gives 10 fold speedups for matrix factorization.

We are busy finishing a few very interesting studies. Stay posted, next year will be exciting!

Our research in 2016: personal scientific highlights

2016-12-31T00:00:00+01:00

Year 2016 has been productive for science in my team. Here are some personal highlights: bridging artificial intelligence tools to human cognition, markers of neuropsychiatric conditions from brain activity at rest, algorithmic speedups for matrix factorization on huge datasets…

Artificial-intelligence convolutional networks map well the human visual system

Eickenberg et al (preprint), showed that convolutional networks –machine-learning tools developed in artificial intelligence for image analysis– map well the human visual system. This is interesting because it shows that cognitive vision and artificial computer vision have evolved to similar architectures. It is not that surprising, as they are both driven by the statistics of natural images. From the point of view of inference in neuroscience, what I found really interesting is that we demonstrated that our computational model of brain activity generalizes across experimental paradigms. This is something new to my knowledge.

Using brain activity at rest to predicting Autism status across clinical sites

Abraham et al (preprint) used resting-state brain activity to predict whether individuals were typical controls or diagnosed with Autistic symptoms. The important aspect of this study is that it was performed on a large data collection across many sites that had not concerted each other during the acquisition. Given that prediction was successful across sites, the study shows the viability of extracting predictive biomarkers across inhomogeneous multi-site data. I think that it is an important result for the future of psychiatric neuroimaging research. The paper also highlights the aspects of the predictive pipeline that were important for this success.

Dictionary Learning for Massive Matrix Factorization

On a pure machine-learning side, Mensch et al introduced a new algorithm for matrix factorization that gives 10 times speedups compared to the state of the art on absolutely huge datasets (Terabyte scales). The key aspect is to combine online learning with random subampling that exploits redundancies in the data. For neuroimaging, this algorithmic advances is needed to tackle larger and larger resting-state data. We will use it to scale predictive models to epidemiologic cohorts. The original paper was purely heuristic but later work comes with proofs and we will soon be submitting a very rich journal paper about this class of algorithms.

A guide to cross-validation in neuroimaging

We published a review on cross-validation for neuroimaging (preprint). While this may sound less leading edge than other of our work, cross-validation is central to everything we do. Doing it right is important. We learned some interesting tradeoffs while doing the experiments for the review. One of them is that for predictive models that are quite stable, such as SVMs, it may be profitable to use default hyper-parameters than to tune them by cross-validation. This is because with the small sample sizes typical of neuroimaging cross-validation is fairly noisy.

Though not in my team, Liem et al (preprint) collaborated with us for a beautiful study showing multimodal prediction of brain age from rest brain activity and brain anatomy. Interestingly, they showed that discrepancy between predicted age and chronological age captures cognitive impairment.

We have many interesting things in the pipeline, but it will be for next year. On an unrelated note, I’ve been doing more art photography on my free time in 2016.

Job offer: data crunching brain functional connectivity for biomarkers

2015-12-08T00:00:00+01:00

My research group is looking to fill a post-doc position on learning biomarkers from functional connectivity.

Scientific context

The challenge is to use resting-state fMRI at the level of a population to understand how intrinsic functional connectivity captures pathologies and other cognitive phenotypes. Rest fMRI is a promising tool for large-scale population analysis of brain function as it is easy to acquire and accumulate. Scans for thousands of subjects have already been shared, and more is to come. However, the signature of cognitions in this modality are weak. Extracting biomarkers is a challenging data processing and machine learning problem. This challenge is the expertise of my research group. Medical applications cover a wider range of brain pathologies, for which diagnosis is challenging, such as autism or Alzheimer’s disease.

This project is a collaboration with the Child Mind Institute, experts on psychiatric disorders and resting-state fMRI, as well as coordinators of the major data sharing initiatives for rest fRMI data (eg ABIDE).

Objectives of the project

The project hinges on processing of very large rest fMRI databases. Important novelties of the project are:

Building predictive models that can discriminate multiple pathologies in large inhomogeneous datasets.
Using and improving advanced connectomics and brain-parcellation techniques in fMRI.

Expected results include the discovery of neurophenotypes for several brain pathologies, as well as intrinsic brain structures, such as functional parcellations or connectomes, that carry signatures of cognition.

The analysis framework is based on algorithmic tools developed in Python (crucially, leveraging scikit-learn for predictive modeling).

Desired profile

We are looking for a post-doctoral fellow to hire in spring. The ideal candidate would have some, but not all, of the following expertise and interests:

Experience in advanced processing of fMRI
General knowledge of brain structure and function
Good communication skills to write high-impact neuroscience publications
Good computing skills, in particular with Python. Cluster computing experience is desired.

A great research environment

The work environment is dynamic and exiting, using state-of-the-art machine learning to answer challenging functional neuroimaging question.

The post-doc will be employed by INRIA, the lead computing research institute in France. We are a team of computer scientists specialized in image processing and statistical data analysis, integrated in one of the top French brain research centers, NeuroSpin, south of Paris. We work mostly in Python. The team includes core contributors to the scikit-learn project, for machine learning in Python, and the nilearn project, for statistical learning in NeuroImaging.

In addition, the post-doc will interact closely with researchers from the Child Mind Institute, with deep expertise in brain pathologies and in the details of the fMRI acquisitions. Finally, he or she will have access to advanced storage and grid computing facilities at INRIA.

Contact information: gael dotnospam varoquaux atnotspam inria dotnospam fr

Publishing scientific software matters

2013-09-19T00:00:00+02:00

Christophe Pradal, Hans Peter Langtangen, and myself recently edited a version of the Journal of Computational Science on scientific software, in particular those written in Python. We wrote an editorial defending writing and publishing open source scientific software that I wish to summarize here. The full text preprint is openly available in my publications list as always. It includes, amongst other things, references.

Software is a central part of modern scientific discovery. Software turns a theoretical model into quantitative predictions; software controls an experiment; and software extracts from raw data evidence supporting or rejecting a theory. As of today, scientific publications seldom discuss software in depth, maybe because it is both highly technical and a recent addition to scientific tools. But times are changing. More and more scientific investigators are developing software and it is important to establish norms for publication of this work. Producing scientific software is an important part of the landscape of research activities. Very visible scientific software is found in products developed by private companies, such as Mathwork’s Matlab or Wolfram’s Mathematica, but let us not forget that these build upon code written by and for academics. Scientists writing software contribute to the advancement of Science via several factors.

First, software developed in one field, if written in a sufficiently general way, can often be applied to advance a different field if the underlying mathematics is common. Modern scientific software development has a strong emphasis on generality and reusability by taking advantage of the general properties of the mathematical structures in the problem. This feature of modern software help close the gap between fields and accelerate scientific discovery through packaging mathematical theories in a directly applicable way.

Second, the public availability of code is a corner stone of the scientific method, as it is a requirement to reproducing scientific results: “if it’s not open and verifiable by others, it’s not science, or engineering, or whatever it is you call what we do.” (V. Stodden, The scientific method in practice). Emphasizing code to an extreme, Buckheit and Donoho have challenged the traditional view that a publication was the valuable outcome of scientific research: “an article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment […]”.

It is important to keep in mind that going beyond replication of results requires reusable software tools: code that is portable, comes with documentation, and, most of all, is maintained throughout the years. Indeed, software development is a major undertaking that must build upon best practices and a quality process. Reversing Buckheit and Donoho’s argument, publications about scientific software play an increasingly important part in the scientific methodology. First, in the publish-or-perish academic culture, such publications give an incentive to software production and maintenance, because good software can lead to highly-cited papers. Second, the publication and review process are the de facto standards of ensuring quality in the scientific world. As software is becoming increasingly more central to the scientific discovery process, it must be subject to these standards. We have found that writing an article on software leads the authors to better clarify the project vision, technically and scientifically, the prior art, and the contributions. Last but not least, scientists publishing new results based on a particular software need an informed analysis of the validity of that software. Unfortunately, much of the current practice for adopting research software relies on ease of use of the package and reputation of the authors.

[…]

Today, software is to scientific research what Galileo’s telescope was to astronomy: a tool, combining science and engineering. It lies outside the central field of principal competence among the researchers that rely on it. Like the telescope, it also builds upon scientific progress and shapes our scientific vision. Galileo’s telescope was a leap forward in optics, a field of investigation that is now well established, with its own high-impact journals and scholarly associations. Similarly, we hope that visibility and recognition of scientific software development will grow.

The problems of low statistical power and publication bias

2012-04-14T16:16:00+02:00

Lately, I have been a mood of scientific scepticism: I have the feeling that the worldwide academic system is more and more failing to produce useful research. Christophe Lalanne’s twitter feed lead me to an interesting article in a non-mainstream journal: A farewell to Bonferroni: the problems of low statistical power and publication bias, by Shinichi Nakagawa.

Each study performed has a probability of being wrong. Thus performing many studies will lead to some wrong conclusions by chance. This is known in statistics as the multiple comparisons problem. When a working hypothesis is not verified empirically in a study, this null finding is seldom reported, leading to what is called publication bias: discoveries are further studied; negative results are usually ignored (Y. Benjamini). Because only discoveries, called detections in statistical terms, are reported, published results contain more false detections than the individual experiments and very little false negatives. Arguably, the original investigators have corrected using the understanding that they gained the experiments performed and account in a post-hoc analysis for the fact that some of their working hypothesis could not have been correct. Such a correction can work only in a field where there is a good mechanistic understanding, or models, such as physics, but in my opinion not in life and social sciences.

Let me quote some relevant extracts of the article, as you may never have access to it thanks to the way scientific publishing works:

Recently, Jennions and Moller (2003) carried out a meta-analysis on statistical power in the field of behavioral ecology and animal behavior, reviewing 10 leading journals including Behavioral Ecology. Their results showed dismayingly low average statistical power (note that a meta-analytic review of statistical power is different from post hoc power analysis as criticized in Hoenig and Heisey, 2001). The statistical power of a null hypothesis (Ho) significance test is the probability that the test will reject Ho when a research hypothesis (Ha) is true.

…

The meta-analysis on statistical power by Jennions and Moller (2003) revealed that, in the field of behavioral ecology and animal behavior, statistical power of less than 20% to detect a small effect and power of less than 50% to detect a medium effect existed. This means, for example, that the average behavioral scientist performing a statistical test has a greater probability of making a Type II error (or beta) (i.e., not rejecting Ho when Ho is false; note that statistical power is equals to 1 - beta) than if they had flipped a coin, when an experiment effect is of medium size.

…

Imagine that we conduct a study where we measure as many relevant variables as possible, 10 variables, for example. We find only two variables statistically significant. Then, what should we do? We could decide to write a paper highlighting these two variables (and not reporting the other eight at all) as if we had hypotheses about the two significant variables in the first place. Subsequently, our paper would be published. Alternatively, we could write a paper including all 10 variables. When the paper is reviewed, referees might tell us that there were no significant results if we had “appropriately” employed Bonferroni corrections, so that our study would not be advisable for publication. However, the latter paper is scientifically more important than the former paper. For example, if one wants to conduct a meta-analysis to investigate an overall effect in a specific area of study, the latter paper is five times more informative than the former paper. In the long term, statistical significance of particular tests may be of trivial importance (if not always), although, in the short term, it makes papers publishable. Bonferroni procedures may, in part, be preventing the accumulation of knowledge in the field of behavioral ecology and animal behavior, thus hindering the progress of the field as science.

Some of the concerns raised here are partly a criticism of Bonferoni corrections, i.e. in technical terms correcting for family-wise error rate (FWER). It is actually the message that the author wants to convey in his paper. Proponents of controling for false discovery rate (FDR) argue that an investigator shouldn’t be penalized for asking more questions, and the fraction of errors in the answers should be controlled, rather than the absolute value. That said, FDR, while useful, does not answer the problems of publication bias.

Conference posters

2011-09-05T04:15:00+02:00

At the request of a friend, I am putting up some of the posters that I recently presented at conferences.

Large-scale functional-connectivity graphical models for individual subjects using population prior.

This is a poster for our NIPS work

PDF

Multi-subject dictionary learning to segment an atlas of brain spontaneous activity.

This is a poster for our IPMI work

PDF

Mayavi for 3D visualization of neuroimaging data: powerful scripting and reusable components in Python.

PDF

Machine learning for fMRI in Python: inverse inference with scikit-learn.

PDF

My conference travels: Scipy 2011 and HBM 2011

2011-07-23T23:45:00+02:00

The Scipy 2011 conference in Austin

Last week, I was at the Scipy conference in Austin. It was really great to see old friends, and Austin is such a nice place.

The Scipy conference was held in UT Austin’s conference center, which is a fantastic venue. This is the first geek’s conference I have been at where the wireless network worked flawlessly with a good bandwidth, even thought 200 geeks were pounding on it. As a tutorial presenter, this was incredibly useful.

Conference highlight

Here is a short list of what I felt were the big trends and highlights of the conference. This is obviously biased by my own interests. I am not listing parallel computing, as it is clearly an important area of progress and debates, but it has been the case for the last few years.

Eric Jone’s keynote

Of course Eric’s keynote was excellent. Eric is a great speaker and always has good insights on how to run a team and a project. This year he shared (some) of his tricks in making Enthought deliver on software projects: “What Matters in Scientific Software Projects? 10 Years of Success and Failure Distilled”. The video is not yet online, unfortunately. Grab it when you can.

Hilary Mason’s keynote

Hilary is an applied data geek, just what I like! She gave an interesting keynote on how bitly (an URL-shortening startup, for those living under a rock) mines the requests on the URLs that the serve to do things like ranking or phishing attempts detection. Of course, I couldn’t resist asking what tools they used, thinking that she would reply R. She mentioned that they did do some roll-their-own, but she mentioned mlpy and scikit-learn, with a mention that it was very nice, at which point I believe that I blushed. She stressed that R was hard to use and production and raised the point that most often academic software doesn’t pan out in these settings (I hope that I am not distorting her thoughts too much).

Statistics and learning

I had the feeling that statistics and data mining played a big role at scipy this year. Maybe it is because I am more tuned to these questions nowadays, but some signs do not lie. There was a special session on Python in data sciences, a panel discussion on Python in finance and many many statistics and data related talks, as well as two tutorials and a keynote.

In addition, on a personal basis it was really great to meet part of the team behind scikits.statmodels. We had plenty of very interesting discussions and they really help me understand the way that some statisticians abord data: very differently than me, because they have fairly little data, and can afford to inspect reports and graphs, whereas I rely more on automated decision rules.

IPython

Min gave an excellent tutorial on how to do parallel computing using IPython. These guys have certainly done an excellent job to make cluster-level programming in Python easier. While they don’t play yet terribly well with the restrictive job-queue policy of the clusters to which I have access, they have all the right low-level tools to address these issues and Min told me that they will be working on this next year.

Fernando gave an impressive talk on the new developments of IPython. In particular, the new Qt-based terminal is `really cool`_ and there is a web frontend in the works.

Cluster computing as facility

While I mention cluster computing, I must confess that I have always stayed away from this beast: I find it a time sink, and I find that I get more science done without it. This is why I really like the presentation of the PiCould guys on, … cluster computing! The reason I liked it, is that they start from the principle that your time is more important than CPU time. I hear so much about bigger better faster more high-performance computing when researchers forget to address the biggest issue:

… a whole generation of researchers turned into system administrators by the demands of computing - Dan Reed, VP Microsoft

Abstract code manipulation for numerical computation

Finally, a trend that is picking up in the Python-based scientific computing is the abstract manipulation of expressions to generate fast code. This ranges from JIT (just in time) compilation generating machine code, to rewriting mathematical expressions. Peter Wang had a talk in this alley, but the topic was also brough up be Aron Ahmadia. Of course this is not new: numexpr has been using these tricks for years, and more recently Theano has been making good use of GPUs thanks to them.

Seeing this topic emerges in more and more places fr good reasons: with faster and more numerous CPU, the number of operations a second is less the bottleneck, and the order in which they are applied, or the physical location, is becoming critical.

My own agenda

Sprinting on scikit-learn

We had two days of sprints after the conference. A huge number of people voted for sprint on the scikit-learn but only two people showed up: Minwoo Lee and David Warde-Farley. Thanks heaps to these guys! My priority for the sprint was to review and merge branches. That worked beautifully: we merged in the following features:

Dirichlet-Process Gaussian mixture models, by Alex Passos
Sparse PCA by Vlad Niculae.
Speedups in Gaussian processes by Vincent Schut.
Sparse implementation of the mini-batch k-means by Peter Prettenhofer.

In addition, David added dataset downloader for the Olivetti face datasets which is lightweight, but rich-enough to give very interesting examples.

My presentation

I gave a talk on my research work, and the software stack that undermines it: Python for brain mining: (neuro)science with state of the art machine learning and data visualization. I think that it was well received by the audience. What is really crazy is that I uploaded the slides on slideshare, and they got a ridiculous amount of viewing. I suspect that it is because of the title: brain mining does sound fancy.

Mayavi

Because of technical and political reasons, I cannot get Mayavi installed on the computers at work. This, and the fact that many people ask for help, but little contribute, even in the form of answers on the mailing list, had been mining me a bit. I got so much great feedback on Mayavi at the conference that I feel much more motivated to invest energy on it.

The Humain Brain Mapping conference in Quebec City

This blog post is getting too long. It is well beyond my own attention span. However scipy is not the only conference to which I have been recently. Two weeks before I was in Quebec, for the Human Brain Mapping conference. As each year, HBM is a fun ride. It has fantastic parties in the evenings. But I didn’t stay up too late as, this year was a busy for me: I was teaching in a educational course, and chairing a symposium, both on comparing brain functional connectivity across subjects.

But the really big deal at HBM this year came at the end. As I was dosing off, vaguely listening to Russ Poldrak’s closing comments, he brought up on screen a slide entitled the year of Python. This is a big deal: we’ve been working for years to get Python in the neuroimaging word, and it is clearly making progress, despite all the roadblocks.

Research jobs in France: the black humor of 2010 is the reality of 2011

2011-01-15T11:41:00+01:00

The French basic research landscape is dominated by a few nationwide institute, similar to the NIST or the NIH in the US. The largest of these is the CNRS (Centre National de la Recherche Scientific). Getting a tenured job in one of those institutes enables someone to focus on basic research rather than teaching or going in the industry. It has always been quite challenging to get such position as many people apply for very few positions, and the choice of the candidates is quite political. Each year there is a call for applications, through a impressive formal process that young researchers trying to get jobs in France end up knowing quite well.

Last year, I was visiting a research lab (INCM) and I saw in their coffee-break room the following poster (below), that I could clearly recognize as the official call for application for positions at CNRS.

Now this poster says ‘The CNRS recruits 3 researchers (m/w) in all fields of research‘. Of course it’s a fake poster and black humor: 3 positions nationwide in all fields of research is ridiculously low. It is however an expression of the nightmare of thousands of young researchers who are applying each year and keep hearing that the government will slash the number of state employees.

The call for the 2011 applications for research positions at INRIA, the French national computer science institute, that is another one of the big research institutions in France, is out. The page is entitled Cinq postes de chargé de recherche 2e classe sont à pourvoir (5 positions for junior researchers are available). This is not a joke, and it is striking to see the similarity between the dark humor of 2010 and the reality of 2011. To be fair INRIA is smaller than CNRS, as it covers only computer science and applications (listed as applied maths, numerical computing and simulation, algorithm and software research, networks and distributed systems, and computational modeling for life sciences). The number of applications is in hundred and not thousands, but having only 5 jobs available nationwide still feels really awkward.

PDF poster

A minor detail: I am trying to get a job in computational science research in France.

Machine learning humour

2010-09-16T23:11:00+02:00

Yes, but they overfit

If you are reading this post through a planet, the movie isn’t showing up, just click through to understand what the hell this is about.

Some explanations…

Machine learning, geeks, and beers

Sorry for the bad humour. In the previous weeks my social geek life had two strong moments:

Pycon fr, the French Python conference, and ensuing drinking

The second sprint on the scikit learn, a library for machine learning in Python.

At the first event (or maybe the related drinking) there was a lot of discussion about NoSQL databases, and I was introduced to this fantastic video making fun of MongoDB fanboys. A few days later I was hacking on the scikit, comparing estimators and discussing hype versus fact in machine learning algorithms (hint: there is no free lunch, but you may get a free brunch). As in brain imaging people seem to be doing nothing but SVMs over and over while methods with more appropriate sparsity clearly perform better, I composed this stupid video.

Anything to learn about machine learning in there?

The short answer is: probably no. This video is humour, and there is little truth (well, RFE is indeed slow as a dog). However, not every reader of this blog are machine learning experts, so let me explain the stakes of the pseudo discussion.

Overfitting: when you learn a predictive model on a noisy data set, for instance trying to learn how to predict whether a movie is popular or not from ratings, if you have a finite amount of data, you should be careful not to learn by heart every detail of the data. You will learn noise that, by chance, correlated to what you are trying to predict. When you try to generalize to new data, these features that you learned from noise will be detrimental to your prediction performance. For instance, the presence of Matt Damon is not the sole predictor of the quality of movie. This is called overfitting. The goal of regularization is to avoid this overfitting.

Both SVM and elasticnet implement regularization, but in different ways. In the case of brain imaging, as the predictive features (voxels) are very sparse, but the noise is highly structured, SVM (that do not operate on voxels directly) are not able to select directly the relevant voxels and tend to overfit (which can be counter-balanced by univariate feature selection as in the scikit example).

RFE (recursive feature elimination) is slow as dog

In [1]: from scikits.learn import datasets
In [2]: digits = datasets.load_digits()
In [3]: X = digits.data
In [4]: y = digits.target
In [5]: from scikits.learn.svm import LinearSVC
In [6]: svc = LinearSVC()
In [7]: from scikits.learn.rfe import RFE
In [8]: %timeit RFE(estimator=svc, n_features=1, percentage=0.1).fit(X, y)
1 loops, best of 3: 21.5 s per loop
In [9]: from scikits.learn.glm import ElasticNet
In [10]: %timeit ElasticNet(alpha=.1, rho=0.7).fit(X, y)
10 loops, best of 3: 26.7 ms per loop

Yeah, but it does much more than simply building a predictor, it builds a ‘heat map’ of which features help predicting (run this scikit-learn example to get an idea).

I am afraid that all the examples I pointed to require the development version of the scikit. Sorry, we just finished a sprint, and there will be a release soon.

Making posters for scientific conferences

2010-07-12T00:00:00+02:00

This page gives some advices and examples on making posters for scientific conference.

Here are some posters I made (one in 2007, the other in 2011). They don’t follow all the advice on this page, but should.

LaTeX sources

This poster is written in LaTeX. You can download the whole source of the posters for the first poster (left), and the second one (right). These are some of my personnal projects, not meant for sharing. As a result they have a fair amount of hacking. I have been asked for source code more than once, so I put it on the web. I do not however have time to provide any support for it (I am already to busy supporting other things. Any mail asking for help on these files will unanswered. Sorry.

Here is another example, a bit more visually appealing, as it is intended for a less technical audience.

One more about my work: this one was made to convey a strong message and simplified the content a lot to get the message accross. I am not too sure it worked, but I still find the poster pretty.

And finally two made by Emmanuelle with really nice colours.

Advice on poster presentation

Fonts

Sans-serif fonts look really nice, but are less readable in paragraphs. Use them for titles and headers. Use serif fonts for paragraphs. Stick to a simple font family like times. Use bold fonts when writing with a light colour on a dark background.

Colours

Stick to a rather little numbers of colours, but well chosen. Put a very light colour behind your text blocks. If ink is not too expensive, I would use a dark background, and have light text blocks on it. Have well separated areas of your posters (like the background, and the text blocks), and have the background, or other decorative elements, have little contrast: they should not stand out too much (mine stood out too much in my poster, its because the print-out didn’t look like want was on the screen).

Page layout

Break symmetry and order. A well aligned poster is boring to the eye, and does not catch attention from afar. People read your poster by first scanning through it and stopping at a few key points (usually first at the upper left, then the upper left, then down right, and down left), then they might read it more thoroughly after their first scan. You want to define visually these key points, make them appealing, and put key ideas there.

Long lines are difficult to read. Pick up a book, a flyer, anything made by a professional publisher, it will never have long lines. A good rule of thumb is that if a text block has lines longer than 80 characters, it needs breaking down in several columns.

Which software to use

Many people use PowerPoint to make their posters. It is easy to use, but it is not dedicated to making posters, and it does horrible pdfs.

If you want to pay a lot there is Quark Xpress that is very good for that kind of things. Adobe PageMaker is also a very good software. Xara is a cheap and good design program, and a free version will soon be available for linux.

I use LaTeX. Just because I love the way it positions characters. But I admit it is a bit brutal. What I would advice you to use is scribus it is dedicate to making posters and is free and open source. I sometimes use LaTeX to create the text boxes, and scribus to lay them around. I wrote a page describing how I do it.

One last remark: use vector graphics (eps, ps, pdf, svg), not bitmaps, they scale up really badly. Try to get a vector logo of your institution. Usually asking the PR people is the only thing it take to get one. Of course if you are using powerpoint chances are that you wont be able to insert it in your poster.

A simple LaTeX example

2010-06-01T00:00:00+02:00

Here is a very simple example of a laTeX document that uses good package to have a simple but nice layout:

Some advice

Use texniccenter if you don’t have a favorite editor.
Read the not so short introduction to latex

PCA and ICA: Identifying combinations of variables

2010-02-05T00:00:00+01:00

Dimension reduction and interpretability

Suppose you have statistical data that too many dimensions, in other words too many variables of the same random process, that has been observed many times. You want to find out, from all these variables (or all these dimensions when speaking in terms of multivariate data), what are the relevant combinations, or directions.

Dimension reduction with PCA

If we have three-dimensional data, for instance simultaneous measurements made by three thermometers positioned at different locations in a room. The data forms a cluster of points in a 3D space:

If the temperature in that room is conditioned by only two parameters, the setting of a heater and the outside temperature, we probably have too much data: the three sets of measurements can be expressed as a linear combination of two fluctuating variable, and an additional much smaller noise parameter. In other words, the data mostly lies in a 2D plane embedded in the 3D measurement space.

We can use PCA (Principal Component Analysis) to find this plane: PCA will give us the orthogonal basis in which the covariance matrix of our data is diagonal. The vectors of this basis point in successive orthogonal directions in which the data variance is maximum. In the case of data mainly residing on a 2D plane, the variance is much greater along the two first vectors, which define our plane of interest, than along the third one:

The covariance eigenvectors identified by PCA are shown in red. The plane defined by the 2 largest eigenvectors is shown in light red.

If we look at the data in the plane identified by PCA, it is clear that it was mostly 2D:

Understanding PCA with a Gaussian model

Let x and y be two normal-distributed variables, describing the processes we are observing:

x = N(0, 1)

and

y = N(0, 1)

Let a and b be two observation variables, linear combinations of x and y:

a = x + y

and

b = 2 y

PCA is performed by applying an SVD (singular value decomposition) on the observed data matrix:

Y = [a₁a₂a₃...; b₁b₂b₃...]

This is equivalent to find the eigenvalues and eigenvectors of Y^TY, the correlation matrix of the observed data. The multidimensional (or multivariate, in statistical jargon) probability density function of Y is written:

p(Y) ∼ exp( − r^TM r)

where r is the position is the (a,b) observation space, and M the correlation matrix. Diagonalizing the matrix M corresponds to finding a rotation matrix U such that:

p(Y) ∼ exp( − r^TU^TS U r)

With S a diagonal matrix. In other words, U is a rotation of the observation space to change to a basis where the probability density function is written:

p(Y) ∼ exp( − ∑_i σ_i r²_i) = ∏_i exp( − σ_i r²_i)

In this new basis, Y can thus be interpreted as a sum of independent normal processes of different variance.

We can thus picture the PCA as a way of finding independent normal processes. The different steps of the argument exposed above can be pictured in the following figure:

First we represent samples drawn from x and y in their original space, the basis of the independent variables. Then we represent the (a, b) samples, and we apply PCA on these samples, to estimate the eigenvectors of the covariance matrix. Then we represent the data projected in the basis estimated by PCA. One important detail to note, is that after PCA, the data is most often rescaled: each direction is divided by the corresponding sample standard deviation identified by PCA. After this operation, all directions of space play the same role, the data is spheric, or “white”.

PCA was able to identify the original independent variables x and y in the a and b samples only because they were mixed with different variance. For a isotropic Gaussian model, any basis can describe the data in terms of independent normal process.

PCA on non normal data

More generally, the PCA algorithm can be understood as an algorithm finding the direction of space with the highest sample variance, and moving on to the orthogonal subspace of this direction to find the next highest variance, and iteratively discovering an ordered orthogonal basis of highest variance. This is well adapted to normal processes, as their covariance is indeed diagonal in an orthogonal basis. In addition, the resulting vectors come with a “PCA score”, ie the variance of the data projected along the direction they define. Thus when using PCA for dimension reduction, we can choose the subspace defined by the first n PCA vectors, on the basis that they explain a given percentage of the variance, and that the subspace they define is the subspace of dimension n that explains the largest possible fraction of the total variance.

However, on strongly non-Gaussian processes, the variance may not be the quantity of interest.

Let us consider the same model as above, with two independent variables x and y thought with strongly non-Gaussian distributions. Here we use a mixture of a narrow Gaussian, and wide one, to populate the tails:

We can apply the same operations on these random variables: change of basis to an observation basis made of a and b, and PCA on the resulting sample:

We can see that the PCA did not properly identify the original independent variables. The variance criteria is not good-enough when the principle axis of the observed distribution are not orthogonal, as the highest variance can be found in a direction mixing the two process. Indeed the largest PCA direction is found slightly off axis. In addition the second direction can only be found orthogonal to the first one, as this is a restriction of PCA.

On the other side, the data after PCA is much more spheric than the original data. No strong anisotropy is found in the central part of the sample cloud, which contributes most to the variance.

ICA: independent, non-Gaussian variables

For strongly non-Gaussian processes, the above example shows that separating independent process should be done by looking at fine details of the distribution, such as the tails. Indeed, after PCA, the Gaussian part of the processes have been separated by their variance, and the resulting, rescaled, samples cannot be decomposed in independent process in a Gaussian model, as they all have the same variance, and would already be considered independent under a Gaussian hypothesis.

A popular class of algorithms to separate independent sources, called ICA (independent component analysis) makes the simplification that finding independent sources out of such data can be reduced to finding maximally non-Gaussian. Indeed, the central-limit theorem tells us that the sum of non-Gaussian processes lead to Gaussian process. Conversely, with equal variance multivariate samples, the more non-Gaussian a signal extracted from the data, the less independent -and non-Gaussian- variables it contains.

A good discussion of these arguments can be found in following paper: http://www.cis.hut.fi/aapo/papers/IJCNN99_tutorialweb/IJCNN99_tutorial3.html

ICA is thus an optimization algorithm that from the data extracts the direction with the least-Gaussian PDF, removes the data explained by this variable from the signal, and iterates.

Applying ICA to the previous model yields the following:

We can see that ICA has well identified the original independent data variables. Its use of the tails of the distribution was paramount for this task. In addition, ICA relaxes the constraint that all identified directions must be perpendicular. This flexibility was also important to match our data.

Note

This discussion can now be seen as an example of the scikit-learn. Thus you can replicate the figure using the code in the scikit.

General relativity, quantum physics, freely-falling planes and Bayesian statistics

2009-12-08T22:20:00+01:00

We’re famous: the work that concluded my PhD is now picked up by the press http://www.physorg.com/news179481148.html

I hadn’t realized before reading this journalist’s version of the story, but we have all the proper buzz words:

general relativity
quantum physics
freely-falling planes
Bayesian statistics.

This kind of stuff makes great headlines, but the way we are judged on this “success” is actually harmful (I believe), as there is so much interesting research that lies away of the trendy words and that needs to be done.

Acceleration estimation in atom-interferometric tests of the Einstein equivalence principle

2009-11-07T15:24:00+01:00

Hurray! The pivot article that marks my transition from physics to statistic modeling is finally out:

How to estimate the differential acceleration in a two-species atom interferometer to test the equivalence principle G Varoquaux, R A Nyman, R Geiger, P Cheinet, A Landragin and P Bouyer

To put things in context, at the end of my PhD, we had been building an atom interferometer to test the Einstein equivalence principle and my reflections on the limits of atom interferometry shifted from worrying about the underlying physics, to worrying about the estimation: the inverse problem of going from the experimental signal, to the underlying quantities that we are measuring, confounded by all the horrible experimental noise.

Atoms, light, gravity fields and free-fall planes

The problem is: we want to do high precision metrologic tests in a free-falling plane. We use interferometry to measure gravity fields. But rather than doing interferometry with light, we use atoms, that are much more coupled to gravity. When probing gravity fields with light, the trick is to use huge highly-sensitive interferometers. For instance the ligo and virgo projects are kilometer-long light interferometers listening for gravitational waves, and the giant ring lasers can test for tiny modifications in the Earth rotation and gravity field. Gravimetric coupling with matter waves and light waves describes the very exact same underlying physics. However, matter waves, atoms in the case of PhD, fall in gravity fields. While this is the expression of the very exact phenomena we are trying to measure, it also means that to build a very large atom interferometer, you have to let the atoms fall for a large distance. And I can attest that even laboratory-sized versions of atom-interferometric experiments are fairly nasty to run:

This is why we simply decided to build an experiment in a freely-falling plane: let’s fall with the atoms for 6 kilometers (30 seconds).

Measuring free fall, while in free fall?

Of course, the plane is not really in free fall. The pilots try as hard as possible to compensate for drag and atmospheric turbulence but there is a limit to what they can achieve with an Airbus. The atoms are a vacuum apparatus, so they are indeed in free fall (before they crash in the side of the apparatus). However, making sens of measure of fall-free made relative to an unstable, and unpredictable platform is not trivial. This is where the statistical modeling kicked in. After reading a bit about noise in interferometers, I realized that we had a well-known problem in statistics: estimation of hidden variables from noisy observations. I learned about recursive Bayesian estimation, coded a proof-of-principle algorithm for our problem (in Python, of course), and was sold. The rest of the story is about noise simulations, and trying to convince a metrology community that you could perform good measurements in a noisy environment.

It took us a lot of time (2 years) to write an article that was acceptable to the target scientific community, while keeping the core estimation and statistics message. Publishing new ideas is hard, because you are not answering questions that people already have in mind. This is why the fact that this article is out is a huge deal for me. It marks a turning point in my reflection: I switched from worrying only about forward models, with which try to describe as well as possible the system at hand, to inverse problems, in which you worry about estimating the parameters from the data.

I was startled to see that people are ready to spend a huge amount of money and efforts in improving complicated experiments involving quantum physics and very sophisticated technology, but can be weary of processing the output signal to increase statistical power. Scientific communities have their own goals that they pitch (e.g. reducing the phase noise in lasers) and there can be huge divides between different scientific interests. Realizing this played an important role in my career shift. I wanted to know more about the power of statistical modeling and machine learning applied to real-life system. I decided that to learn more, I had to work with people that had a different culture from mine. It’s been a huge amount of fun so far… More about that later.

What’s wrong with young academic careers in France

2008-10-13T22:36:00+02:00

David just blogged a link to an article about careers in higher education. I thought the paragraph on the French system was so much to the point that I would like to quote it entirely here:

In France, the access to a first permanent position as maître de conférences occurs rather early compared with other countries (on average prior to the age of 33 years) and opens the path to 35 to 40 years of an academic career. These recruitments happen after a period of high uncertainty as in almost all disciplines the ratio of “open positions per doctors” has worsened, while the doctoral degree is still not recognized as a qualification by businesses or the public sector. Recruiting a new maître de conférences thus constitutes a high-stakes decision. But currently university departments have about two months to examine the candidates, select some of them, hold a 20- to 30-minute interview with those on the short list, and rank the best ones. Despite the highly selective process that the first candidate on the list successfully passes, this new colleague is rarely considered as a chance on which to build by the recruiting university. Not only is the salary based on a national bureaucratic scale below the average GDP per capita for France, but new academics are frequently not offered a personal office and may be asked to teach the classes colleagues do not want to offer or to accept administrative duties. The difficult road toward the doctorate leads to a rather disappointing and frequently non-well-remunerated situation, thus undermining the attractiveness of the career.

I don’t regret doing a PhD, but I think the current situation needs to be stressed, especially to future PhD students: high risk, little gain career. You better really love what you’ll be doing. And keep in mind an exit door.

LaTeX files of my PhD thesis

2008-04-01T00:00:00+02:00

Here are the main files I use for writing my PhD thesis with LaTeX. I am not publishing them on the net as a model of what to do, as at the end I was too much in a hurry to do a good job, and I hacked kludges all over the code (it does not compile without overflows anymore).

What turned out to be very handy was the use of the memoir package. It allowed me just enough customization while staying compact. In order to make it work with some other packages I use, I had to hack it a bit (horrible kludges again).

You need an install of the garamond fonts to build this (for epigraphs). I use my own version.

Don’t e-mail me to debug the problems you get by copying the kludges in here. This is ugly code, that I put out because people were asking for it.

Mission accomplished

2008-01-19T11:59:00+01:00

I defended my PhD yesterday. I am pretty happy to be done with this.

After the defense, the other PhD students offered me a plastic python (well it was a cobra, actually, but they told me to pretend it was a Python.

Garamond fonts for LaTeX

2006-10-01T00:00:00+02:00

Garamond fonts are a large family of fonts. At a friend’s request I modified the URW-garamond fonts to improve kerning, add old style numbers, and make some letters prettier. These fonts are available under the Aladdin Free Public License , which states, if I understand it correctly, that you can use and modify the fonts freely for non commercial purposes.

Here is a pdf file that gives an example of the fonts.

Questions and suggestions

I made this font in 2006. Time has passed, and I have completely forgotten the skills required to modify it. I cannot go anywhere beyond providing the file for download. Sorry, if you send me a kind email mentionning that the accents or the numbers are not right, I am unable to address it.

Instructions for use with pdfLaTeX

The standard procedure for installing new fonts in a LaTeX installation is quite complicated and varies from one LaTeX distribution to another.

I strongly suggest that you install the fonts only in your documents folder. This make your document portable: as long as you give the complete folder to your colleagues, they will be able to compile it.

If you want to install the fonts in the TeXMF (so that all documents compiled on your installation have access to the fonts) I assume you know TeX well enough to perform the installation without further help.

Installing in the current folder

Here is an easy way to install the fonts in your document’s folder (this will only work if you are using pdfLaTeX):

Here is a package to use these fonts with LaTeX.

Unzip garamond.zip in the same folder than the LaTeX document you are working on.

Using in a LaTeX document

In your LaTeX file, include the package “garamond”:

\usepackage{garamond}

You also need to use the T1 font encoding:

\usepackage[T1]{fontenc}

The garamond package defines a new command \garamond that switches the font in the current group to garamond. Here is a minimal example:

\documentclass{article}

\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage{garamond}

\begin{document}

{\garamond
The Quick Brown Fox Jumps Over The Lazy Dog. 0123456789 \\
    {\slshape This is garamond slanted} \\
    {\bfseries This is garamond bold face} \\
    {\scshape This is in small caps} \\
    {\slshape \bfseries This is slanted and bold face} \\
}
And this is written with the latin modern fonts.

\garamond

Here we switch to garamond.
\ungaramond

Here we switch back to the default.

\end{document}

One remark on this example: you should never, ever, use the standards out-of-the-box T1 fonts with pdfLaTeX, they look ugly. Always include the “lmodern” or “pslatex” package, that uses much better postscript fonts.

Timing problems with a computer

2006-03-20T00:00:00+01:00

Computers are very versatile beasts. Physicists are tempted to use them to do real-time signal processing and for instance implement a feedback-loop on an instrument. If the frequencies are above 10Hz this is not as easy as one might think (after they run at several gHz). I will try to explore some difficulties here.

Remember, these are just the ramblings of a physics phD student. I have little formal training in IT, so don’t hesitate to correct me if I didn’t get things right.

Operating systems, timing and latencies

If you want to build an I/O system that interacts in real-time with external devices you will want to control the timing of the signals you send to the instruments.

Computers are not good at generating events at a precise timing. This is due to the fact that modern operating systems share the processor time between a large number of tasks. Your process does not control completely the computer, and it has to ask for time to the operating system. The operating system shares time between different processes, but it also has some internal tasks to do (like allocating memory). All these operations may not perform in a predictable time-lapse [2], and make it harder for a process to produce an event (eg a hardware output signal) at a precise instant.

One solution to avoid problems is to run the program with a single task operating-system, like DOS. Even when doing this you have to be careful, as all system operations asked by your program may not return in a controlled amount of time. The good solution is to use a hard real-time operating system, but this forces us to use dedicated system and makes the job much harder as we cannot use standard programming techniques and libraries.

I will attempt to study the limitations of a simple approach, using standard operating systems and programming techniques, to put numbers of the performance one can expect.

Real-time clock interrupt latency

The right tool to control timing under linux is the “real time clock” [3]. It can be used to generate interrupts at a given frequency or instant.

To quote Wikipedia: “in computing, an interrupt is an asynchronous signal from hardware indicating the need for attention or a synchronous event in software indicating the need for a change in execution”. In our case the interrupt is a signal generated by the real time clock that is trapped by a process.

I have ran a few experiments on the computers I have available to test the reliability of timing of these interrupts, that is the time it take to the process to get the interrupt. This is known as “interrupt latency” (for more details see this article), and it limits both the response time and the timing accuracy of a program that does not monopolize the CPU, as it corresponds to the time needed for the OS to hand over control to the program.

The experiment and the results

I used a test program to measure interrupt latency [4] on linux. The test code first sets the highest scheduling priority it can, then asks to be waken up at a given frequency f by the real-time clock. It checks the real-time clock to see if it was really waken-up when it asked for. It computes the differences between the measured delay between 2 interrupts and the theoretical one 1/f. Here is a plot of histogram of the delays on different systems. The delay is plotted in units of the period 1/f.

While the code was running I put some stress on the system, pinging google.com, copying data to the disk, and calculating an md5 hash. This is not supposed to be representative of any particular use, I just wanted not the system to be idle aside from my test code. The tests where run under a gnome session but without any user action.

Interpretation of the results

I am no kernel guru, so my interpretations may be imprecise, but I can see that the results are pretty bad.

There is a jitter that can go up to half a period at 1kHz. Depending on how important it is to have a narrow linewidth of your “digital oscillator” the jitter sets a limit to the frequency where the computer can be used as a “digital oscillator”.

This also tells us that an interrupt request takes in average 0.5ms to get through to the program it targets. This allows us to estimate the time it take for an event (for instance generated by an I/O card) to reach a program, if this one is not running.

Keep in mind that this experiment only measures jitter and frequency offset due to software imperfection (kernel: operating system related), on top of this you must add all the I/O bus and buffer problems, if you want to control an external device.

An interesting remark is to see how the results vary from one computer to another. Quite clearly omega’s RTC is not working properly, this is probably due to driver problems. Beta has good results, and this is probably due to its pre-emptible kernel. The results of our computer (digamma) are surprisingly bad. This is powerful 4 CPU computer. It seems to me that the process my be getting relocated from one CPU to another, which generates big jitter. Aramis is a 2 CPU (+ multithreading, that’s why it appears as 4) box, and it performs much better. The CPU are different, and the kernel versions are different, but I would expect more recent kernels to fare better.

The take home message: do not trust computers under the milisecond.

Other sources have indeed confirmed that with a standard linux kernel, at the time of the writing (linux 2.6.18) interrupt latency is of the order of the millisecond. The “RT_PREEMPT” compile switch has been measured to drop the interrupt latency to 50 microseconds, which is of the order of the hardware limit.

Implications of this jitter

These histograms can be seen as frequency spectra of the signal generated by the computer.

We can see that the signal created can be slightly off in frequency (the peak is not always centered on zero). The RTC is not well calibrated. This should not be a major problem if the offset is repeatable, as it can be measured and taken in account for.

We can see that the spectrum has a non negligible width at high frequency. This means that in a servo-loop like system the computer will add high frequency noise at around 1kHz. It also means that the timing of a computer created event cannot be trusted at the millisecond level.

However it is interesting to note that very few events reach out of the +/- 1 period. This means that the computer does not skip a beat very often. It does perform the work in a reliable way, but it does not deliver it on time. This means that if we correct for this jitter the computer can act as a servo loop up to 1kHz. The preempt kernel performs very well in terms of reliability, even though it is on an old box with little computing power.

Dealing with the jitter

First we could try to correct for the jitter with a software trick. For instance we could ask for the interrupt in advance, and block the CPU by doing busy-waiting (to ensure that the scheduler does not schedule us out) until the exact moment comes.

Another option is to use an I/O device with an embedded clock, that corrects for the jitter. For instance a hardware trigged acquisition card. I prefer this solution as it is more versatile and scalable.

This brings us to something that seems to be quite general with real-time computer control: buffers and external clocks. The computer has the processing power to do the work in the required amount of time. The buffer and the external clock correct for the jitter introduced by the software.

Finally recompiling a kernel with the RT-preempt patch would probably help a lot, given that it reduces the interrupt latency by two orders of magnitudes.

Technical details about the experiment

The measuring code

The way this work is that a small C code (borrowed and adapted from Andrew Morton’s “realfeel.c”) asks for the highest scheduling priority it can get, then set the real-time clock to generate an interrupt at a give frequency. It then loops, waiting for the real-time clock (RTC). The OS schedules other tasks during the waiting period, but when the interrupt is generated by the RTC the OS gives the CPU back to the program. It then compares the time delay between the last time it got the interrupt, and this time, and stores the difference. The results are stored in a histogram file.

The stress code

I have very ugly way of putting stress and the computer, so that the kernel actually schedules other tasks. I did not put tremendous stress on the CPU, as I want to simulate standard use cases. This is the way I did it:

for ((  i = 0 ;  i <= 10;  i++  ))
do
    ping -c 10 www.google.com &
    dd if=/dev/urandom bs=1M count=40 | md5sum - &
    dd if=/dev/zero of=/tmp/foo bs=1M count=500
    sync
rm /tmp/foo
done

Three tasks running in parallel: pinging google, calculation the md5 hash of a random chunk of bits (which also means generating it), and writing 500Mb to the disk. If the system and the network are fast enough the 2 first task finish before the last one. This is done on purpose.

Making your own measurements

You can reproduce the histograms under linux by running the “stresstest.sh” script given be the following archive . The plots can be obtained by running the “process.py” python scripts (requires scipy and matplotlib). You may have to increase the real-time clock frequency user limit. You can do this by running (as root) “ echo 1024 > /proc/sys/dev/rtc/max-user-freq”

Send me the results dir created by the “stresstest.sh” script on your box, I am very interested to gather more statistics.

Conclusion

The jitter measurement is interesting not because it shows the absolute limit of the technology (hard real-time OSs, like RTlinux could go much further), but because it shows the performance achievable with simple techniques. Looking at this data I would say that anything with frequencies below 10 to 100Hz is fairly easy to achieve with the RTC interrupts, anything around several kiloHertz can be done with a bit more work, and anything above require a lot of work.

My current policy is to try to move to embedded devices anything with speeds above 10Hz.

Acknowledgments

I would like to thank Nicolas George for enlightening discussions on these matters, as useful questions on the purpose of this experiment. I would also like to thank David Cournapeau for pointing me to interesting references and to the Linux Audio Developer mailing list for more information.

References

[1]	Wikipedia article on real-time computing: http://en.wikipedia.org/wiki/Real-time_computing

[2]	A very clear article about fighting latency in the linux kernel: http://lac.zkm.de/2006/papers/lac2006_lee_revell.pdf

[3]	About the RTC: http://www.die.net/doc/linux/man/man4/rtc.4.html

[4]	What this code is actually measuring is, in technical terms, the interrupt latency, that is the time it takes for the kernel to catch the interrupt, and the rescheduling latency, that is the time it take for the kernel to reschedule from one process to another.

[5]	A different benchmark, that probably studies more directly the intrinsic kernel limits than my code: http://lwn.net/Articles/139403/

[6]	Another benchmark, that also benchmarks the RT-preempt patch and shows the impressive improvements achieved with this patch: http://kerneltrap.org/node/5466

[7]	A course on real-time computing, with the lecture notes. http://lamspeople.epfl.ch/decotignie/#InfoTR

Gaël Varoquaux - science

Do AIs reason or recite?

Something new?

CARTE: toward table foundation models

Comité de l’intelligence artificielle: vision et stratégie nationale

2022, a new scientific adventure: machine learning for health and social sciences

Representation learning for relational data

Mathematical aspects of statistical learning for data science

Machine learning for health and social sciences

High-quality data-science software

2021 highlight: Decoding brain activity to new cognitive paradigms

Context: Infering cognition from brain-imaging

Contribution: Informing specialized decoding questions from broad data accumulation

A research agenda that does not win all hearts

2020: my scientific year in review

Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020

2019: my scientific year in review

Comparing distributions: Kernels estimate good representations, l1 distances give good tests

2018: my scientific year in review

Atlases of cognition with large-scale human brain mapping

Predictive models avoid excessive reductionism in cognitive neuroimaging

Individual Brain Charting, a high-resolution fMRI dataset for cognitive mapping

A new personal research agenda: DirtyData

Similarity encoding: analysis with non-normalized string categories

Our research in 2017: personal scientific highlights

A fast and stable brain decoder using ensembling: FReM

Brain imaging to characterize individuals: joint prediction of multiple traits

Time-domain decoding for fMRI

Cross-validation failure: the dangers of small samples

Convergence proofs for last year’s blazing fast dictionary learning

Our research in 2016: personal scientific highlights

Artificial-intelligence convolutional networks map well the human visual system

Using brain activity at rest to predicting Autism status across clinical sites

Dictionary Learning for Massive Matrix Factorization

A guide to cross-validation in neuroimaging

Job offer: data crunching brain functional connectivity for biomarkers

Scientific context

Objectives of the project

Desired profile

A great research environment

Publishing scientific software matters

The problems of low statistical power and publication bias

Conference posters

My conference travels: Scipy 2011 and HBM 2011

The Scipy 2011 conference in Austin

Conference highlight

Eric Jone’s keynote

Hilary Mason’s keynote

Statistics and learning

IPython

Cluster computing as facility

Abstract code manipulation for numerical computation

My own agenda

Sprinting on scikit-learn

My presentation

Mayavi

The Humain Brain Mapping conference in Quebec City

Research jobs in France: the black humor of 2010 is the reality of 2011

Machine learning humour

Yes, but they overfit

Some explanations…

Machine learning, geeks, and beers

Anything to learn about machine learning in there?

Making posters for scientific conferences

Advice on poster presentation

Fonts

Colours

Page layout

Which software to use

A simple LaTeX example

PCA and ICA: Identifying combinations of variables

Dimension reduction with PCA

Understanding PCA with a Gaussian model

PCA on non normal data

ICA: independent, non-Gaussian variables

General relativity, quantum physics, freely-falling planes and Bayesian statistics

Acceleration estimation in atom-interferometric tests of the Einstein equivalence principle

Atoms, light, gravity fields and free-fall planes

Measuring free fall, while in free fall?

What’s wrong with young academic careers in France