<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Gaël Varoquaux - science</title><link href="https://gael-varoquaux.info/" rel="alternate"></link><link href="https://gael-varoquaux.info/feeds/science.atom.xml" rel="self"></link><id>https://gael-varoquaux.info/</id><updated>2026-01-02T00:00:00+01:00</updated><entry><title>2025 highlights: AI research and code</title><link href="https://gael-varoquaux.info/science/2025-highlights-ai-research-and-code.html" rel="alternate"></link><published>2026-01-02T00:00:00+01:00</published><updated>2026-01-02T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2026-01-02:/science/2025-highlights-ai-research-and-code.html</id><summary type="html">&lt;div class="figure align-right"&gt;
&lt;img alt="" class="small" src="attachments/2025_highlights/eiffel_tower_ai.jpg" /&gt;
&lt;p class="caption"&gt;AI is everywhere. Can you see it here?&lt;/p&gt;
&lt;/div&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Some highlights about my work in 2025: progress on
tabular-learning stands out, a publication on unpacking trade-off and
consequences of scale in AI, and of course progress on the open-source
data-science and machine learning stack.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;As 2026 starts, I’m looking …&lt;/p&gt;</summary><content type="html">&lt;div class="figure align-right"&gt;
&lt;img alt="" class="small" src="attachments/2025_highlights/eiffel_tower_ai.jpg" /&gt;
&lt;p class="caption"&gt;AI is everywhere. Can you see it here?&lt;/p&gt;
&lt;/div&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Some highlights about my work in 2025: progress on
tabular-learning stands out, a publication on unpacking trade-off and
consequences of scale in AI, and of course progress on the open-source
data-science and machine learning stack.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;As 2026 starts, I’m looking back on 2025. It was all about AI, with
research in the &lt;a class="reference external" href="https://team.inria.fr/soda/"&gt;soda team&lt;/a&gt; on tabular
machine learning stimulating better software.&lt;/p&gt;
&lt;div class="contents topic" id="highlights"&gt;
&lt;p class="topic-title"&gt;Highlights&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#beyond-maths-unpacking-the-scale-narrative-in-ai" id="toc-entry-1"&gt;Beyond maths: Unpacking the scale narrative in AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#tabular-learning-research" id="toc-entry-2"&gt;Tabular-learning research&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#tabicl-open-source-table-foundation-model" id="toc-entry-3"&gt;TabICL:  open-source table foundation model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#retrieve-merge-predict-tradeoffs-of-predictions-from-data-lakes" id="toc-entry-4"&gt;Retrieve merge predict: tradeoffs of predictions from data lakes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#growing-the-machine-learning-and-data-science-stack" id="toc-entry-5"&gt;Growing the machine learning and data science stack&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#skrub-machine-learning-with-tables" id="toc-entry-6"&gt;Skrub: machine learning with tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#fundamental-progress-in-scikit-learn" id="toc-entry-7"&gt;Fundamental progress in scikit-learn&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="beyond-maths-unpacking-the-scale-narrative-in-ai"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;Beyond maths: Unpacking the scale narrative in AI&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Plotting the increase of the scale of notable AI systems in the last
years reveals a staggering explosion. AI’s size has been growing super
exponentially on a variety of dimensions: training compute, training cost
(figure below), inference cost, amount of data used. Studying the wording
used in pivotal publications as well as company communications shows that
it anchors AI success in this growth, thus &lt;strong&gt;settings implicit social
norms around scale&lt;/strong&gt;. But systematic analysis of benchmark results show
that &lt;strong&gt;scale does not always bring benefit&lt;/strong&gt;. The narrative of scale is
simplified and leaves aside many important ingredients of success of AI
systems. In addition, the race for scale comes with planetary and
societal consequences, which we study and &lt;a class="reference external" href="https://dl.acm.org/doi/10.1145/3715275.3732006"&gt;document&lt;/a&gt;. Ever-increasing
inference costs threaten economic and electricity sustainability. An
unstoppable appetite for training data leads to fitting models on
enormous datasets that elude quality control, engulfing undesirable
facets of internet (including child pornography) or eroding privacy. The
race for scale has financial consequences, benefiting above all actors of
compute, but also structuring an ecosystem where cash-rich and GPU-rich
actors have leverage on priorities, industrial or academic. These actors
sometimes have circular investments strategies: funding third parties
that will spend all this funding in compute, which can fuel &lt;strong&gt;an
investment bubble in AI&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/2025_highlights/cost_ai.png" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;Evolution of the training cost (in dollars) of notable AI systems
across the years&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;We conclude our study, &lt;a class="reference external" href="https://dl.acm.org/doi/10.1145/3715275.3732006"&gt;published at FAccT&lt;/a&gt;, by underlining that &lt;strong&gt;academic
research has a central role to play in these dynamics and must shape a
healthy and grounded narrative&lt;/strong&gt;. We recommend to:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;pursue basic AI research of interest independent of scale, &lt;em&gt;eg&lt;/em&gt;
uncertainty quantification, causality…&lt;/li&gt;
&lt;li&gt;hold responsible norms, in particular avoiding asking for compute
increase when editing or reviewing,&lt;/li&gt;
&lt;li&gt;always publish measures of compute to document the tradeoffs.&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2025_highlights/pareto_schema.png" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;We need to document and explore the tradeoffs&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;In addition, I personally want to push those tradeoffs in the direction
of resource efficient progress, and not only resource intensive progress
(as illustrated on the figure alongside),
which is the easy route to task performance, but not the one that brings
most value.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="tabular-learning-research"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;Tabular-learning research&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="tabicl-open-source-table-foundation-model"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;TabICL:  open-source table foundation model&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Recent tabular-learning models have been bringing better performance. A
poster example is that of the TabPFN series of models, that rely on
pretrained transformers to bring excellent performance. However, the
quadratic complexity of the transformers is a bottleneck. I do fear that
the agenda of fancy tabular learning is leading us into a race for scale
again.&lt;/p&gt;
&lt;p&gt;With the &lt;a class="reference external" href="https://icml.cc/virtual/2025/poster/46681"&gt;TabICL model&lt;/a&gt; we
strives to decrease this computational cost. We showed that a multi-stage
architecture can build a pre-trained in-context predictor where the
separation of states decreases the quadratic cost. The model can be
pretrained on larger datasets, and thus results in the best performer in
settings of larger tables. The model is faster than alternatives, in
particular when using a CPU rather than a GPU. In addition, we released
in &lt;strong&gt;open source all the code&lt;/strong&gt;, including the pretraining.&lt;/p&gt;
&lt;p&gt;TabICL gives a table foundation model that is easy to use on modest or
big hardware and that can be easily customized.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="retrieve-merge-predict-tradeoffs-of-predictions-from-data-lakes"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-4"&gt;Retrieve merge predict: tradeoffs of predictions from data lakes&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;A full data-science pipeline must often assemble data across multiple
source tables:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Alice is working on a base table that contains information about
movies. She has also access to a data lake, or a collection of other
tables on all sorts of subjects. She wants to predict the ranking of
a movie based on as much information as possible. She would like to
extract information from the data lake to the performance of her
model.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;The challenge is that the information of interest is mixed with a
huge amount of unrelated data. Thus, Alice’s problem is: “how to find
tables that are relevant to my problem? how to combine them with the
base table?”&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;When the user is faced with a complex data lake, many
tables and little explicit link between them, it is difficult to find the
best assembly for a given machine-learning task. This problem requires
not only finding which table must be joined in the main table of interest
–a table retrieval problem–, but also how to aggregate multiple records
when tables are linked through a many-to-one relation. While table
retrieval is a classic problem of the data management literature, it had
been understudied in the case of supervised machine learning. We
assembled a systematic –and open– benchmark with data lakes emph{and}
supervised-learning tasks (&lt;a class="reference external" href="https://openreview.net/pdf?id=4uPJN6yfY1"&gt;publication&lt;/a&gt;, &lt;a class="reference external" href="https://soda-inria.github.io/retrieve-merge-predict/"&gt;benchmark material&lt;/a&gt; ).&lt;/p&gt;
&lt;p&gt;We found that supervised learning does change the picture compared to
classic table-retrieval settings in that for a fixed compute budget, it
is worth avoiding fancy retrieval methods, which can be very
computationally costly, and rather using better supervised learning
methods, which can be comparatively less expensive while being
able to extract the relevant information from a noisy retrieval.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/2025_highlights/yadl_benchmark.png" style="width: 700px;" /&gt;
&lt;p class="caption"&gt;A schema of the pipeline&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The pipeline that we studied here is one that is broader than the
typical machine-learning modeling step. In my experience, data-science
applications are often much more complex than mere tabular learning, and
for these reason, we develop the skrub software, described below.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="growing-the-machine-learning-and-data-science-stack"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-5"&gt;Growing the machine learning and data science stack&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="skrub-machine-learning-with-tables"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-6"&gt;Skrub: machine learning with tables&lt;/a&gt;&lt;/h3&gt;
&lt;a class="reference external image-reference" href="https://skrub-data.org"&gt;&lt;img alt="" class="align-right" src="attachments/skrub_logo.png" style="width: 150px;" /&gt;&lt;/a&gt;
&lt;p&gt;&lt;a class="reference external" href="https://skrub-data.org"&gt;Skrub&lt;/a&gt; is a recent library to blend machine
learning with data-frame computing. In 2025, we have ironed existing
features to make them more performant and really easy to use. For
instance the &lt;a class="reference external" href="https://skrub-data.org/stable/reference/generated/skrub.TableVectorizer.html"&gt;TableVectorizer&lt;/a&gt;
is incredibly useful to build tabular machine-learning pipelines. But we
have also added exciting new features:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;The &lt;a class="reference external" href="https://skrub-data.org/stable/reference/generated/skrub.ApplyToCols.html"&gt;ApplyToCols&lt;/a&gt; is an object that can use skrub’s powerful &lt;a class="reference external" href="https://skrub-data.org/stable/modules/multi_column_operations/selectors.html"&gt;selectors&lt;/a&gt; to apply transforms to some columns but not others. I find myself using it all the time.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://skrub-data.org/stable/data_ops.html"&gt;DataOps&lt;/a&gt; are an
incredibly powerful way of blending dataframe transformation and
scikit-learn fit/transform/predict API, to build complete machine
learning pipeline across multiple tables. The benefit is that, unlike
standard data wrangling code, they can be applied to new data,
cross-validated, or any component of the pipeline can be tuned to
maximize a prediction score. We even have added optuna support for this
tuning.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="fundamental-progress-in-scikit-learn"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-7"&gt;Fundamental progress in scikit-learn&lt;/a&gt;&lt;/h3&gt;
&lt;a class="reference external image-reference" href="https://scikit-learn.org"&gt;&lt;img alt="" class="align-right" src="attachments/scikit-learn-logo.png" style="width: 150px;" /&gt;&lt;/a&gt;
&lt;p&gt;What strikes me in the 2025 releases of &lt;a class="reference external" href="https://scikit-learn.org"&gt;scikit-learn&lt;/a&gt; is that we have been
making progress on fundamental improvements to the core features:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Faster linear models and tree based model due to better algorithms
(which, in certain cases give massive speedups).&lt;/li&gt;
&lt;li&gt;Ramping up GPU support: we are progressively adding to scikit-learn a
compute backend that enable GPU computing (an intro &lt;a class="reference external" href="https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_8_0.html#array-api-support-enables-gpu-computations"&gt;here&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Free-threaded: we now support the “free-threaded” version of Python,
which removes a central lock and opens the door to
heavily-multithreaded parallel computing. More of the ecosystem needs
to support Python free-threaded for it to be widely used, but I am
hoping that in the mid-term we’ll see great improvements to parallel
computing.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Exciting times :)&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="machine learning"></category><category term="python"></category><category term="yearly report"></category></entry><entry><title>TabICL: Pretraining the best tabular learner</title><link href="https://gael-varoquaux.info/science/tabicl-pretraining-the-best-tabular-learner.html" rel="alternate"></link><published>2025-07-09T00:00:00+02:00</published><updated>2025-07-09T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2025-07-09:/science/tabicl-pretraining-the-best-tabular-learner.html</id><summary type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;TabICL is a state-of-the-art tabular learner &lt;a class="reference external" href="https://arxiv.org/abs/2502.05564"&gt;[Qu et al 2025]&lt;/a&gt;. The key is its very rich
prior, that is baked in a pre-trained architecture -a table foundation
model-, and leveraged by in-context-learning. Thanks to clever
choices, it is fast and scalable, efficient even without a GPU.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#recent-progress-in-tabular-learning-in-context-learning" id="toc-entry-1"&gt;Recent progress …&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;TabICL is a state-of-the-art tabular learner &lt;a class="reference external" href="https://arxiv.org/abs/2502.05564"&gt;[Qu et al 2025]&lt;/a&gt;. The key is its very rich
prior, that is baked in a pre-trained architecture -a table foundation
model-, and leveraged by in-context-learning. Thanks to clever
choices, it is fast and scalable, efficient even without a GPU.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#recent-progress-in-tabular-learning-in-context-learning" id="toc-entry-1"&gt;Recent progress in tabular learning: In-Context Learning&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#tabular-learning-as-a-completion-problem" id="toc-entry-2"&gt;Tabular learning as a completion problem&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#sophisticated-prior-via-data-generation" id="toc-entry-3"&gt;Sophisticated prior via data generation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#tabicl-improved-architecture" id="toc-entry-4"&gt;TabICL: improved architecture&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#the-challenge-accounting-for-the-structure-of-tables" id="toc-entry-5"&gt;The challenge: accounting for the structure of tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#tabicl-s-solution" id="toc-entry-6"&gt;TabICL’s solution&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#the-result-a-powerful-and-easy-to-use-tabular-learner" id="toc-entry-7"&gt;The result: a powerful and easy to use tabular learner&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;p&gt;This note is about the research behind TabICL &lt;a class="reference external" href="https://arxiv.org/abs/2502.05564"&gt;[Qu et al 2025]&lt;/a&gt;, work by Jingang Qu, David
Holzmüller, myself, and Marine Le Morvan, published at ICML 2025, and
available as &lt;a class="reference external" href="https://tabicl.readthedocs.io/en/latest/"&gt;open-source software&lt;/a&gt;.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="recent-progress-in-tabular-learning-in-context-learning"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;Recent progress in tabular learning: In-Context Learning&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Describing the statistical structure of tables in general is very subtle.
They do have some unique statistical features. For instance, each column
is typically meaningful by itself, more meaningful than linear
combinations of columns (data &lt;em&gt;non rotationally invariant&lt;/em&gt;, cf
&lt;a class="reference external" href="https://proceedings.neurips.cc/paper_files/paper/2022/hash/0378c7692da36807bdec87ab043cdadc-Abstract-Datasets_and_Benchmarks.html"&gt;[Grinsztajn et al, 2022]&lt;/a&gt;).
For long, tree-based models, in particular gradient-boosted trees, were
the models that best captured this statistical structure.&lt;/p&gt;
&lt;p&gt;The question is indeed: &lt;strong&gt;how to build complex and rich inductive biases
into statistical models&lt;/strong&gt;?&lt;/p&gt;
&lt;p&gt;A pioneering contribution to this question was made with the TabPFN
approach &lt;a class="reference external" href="https://www.nature.com/articles/s41586-024-08328-6"&gt;[Hollmann et al, 2025]&lt;/a&gt;.&lt;/p&gt;
&lt;div class="section" id="tabular-learning-as-a-completion-problem"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;Tabular learning as a completion problem&lt;/a&gt;&lt;/h3&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="../science/attachments/tabicl/table_in_context_learning.png" style="width: 100%;" /&gt;
&lt;p class="caption"&gt;Prediction by table completion using across-row transformers&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The key idea behind this line of work is that tabular learning can be
seen as completing a table where one column has a missing entry.
Transformer-based large-language models are very good at completing
sequences, in particular in the few-shot regime. Hence the idea to use a
transformer architecture for this table-completion task.&lt;/p&gt;
&lt;p&gt;More specifically, this is a &lt;em&gt;meta-learning&lt;/em&gt; setting (learning to learn),
using transformers.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="sophisticated-prior-via-data-generation"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;Sophisticated prior via data generation&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Teaching transformers to predict well requires showing them many many
prediction problems.&lt;/p&gt;
&lt;p&gt;The benefit of this approach is that these prediction problems can be
chosen to reflect well the downstream task. In particular, it becomes now
easy to bake in any form of inductive bias by simulating data.&lt;/p&gt;
&lt;p&gt;TabPFN simulates data by cascading a series of simple transformations
combining very few columns. The data-generative processes are actually
more subtle, but the idea being that they are plausible for data tables.&lt;/p&gt;
&lt;p&gt;Experience (from us and others) shows that pretraining on a quality
data-generation process is crucial to produce a good tabular learner,
alike foundation models in other settings.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="tabicl-improved-architecture"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-4"&gt;TabICL: improved architecture&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="the-challenge-accounting-for-the-structure-of-tables"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-5"&gt;The challenge: accounting for the structure of tables&lt;/a&gt;&lt;/h3&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="../science/attachments/tabicl/tabpfn_architecture.png" style="width: 60%;" /&gt;
&lt;p class="caption"&gt;Tables are 2D objects, and the TabPFNv2 architecture alternates
attentions across row and across columns&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;In practice, a table is not a 1D structure, like sentences. It is closer
to a 2D structure, with rows and columns. A good architecture will
account for this structure, and the TabPFNv2 architecture uses
transformers with alternating across-row and across-column attention.&lt;/p&gt;
&lt;p&gt;One problem is the computational complexity: attention is quadratic in
the number of entries, and the bi-directional transform of TabPFNv2 leads
to a cost in &lt;em&gt;O(n p² + p n²)&lt;/em&gt; for a table with &lt;em&gt;n&lt;/em&gt; rows and &lt;em&gt;p&lt;/em&gt; columns.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="tabicl-s-solution"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-6"&gt;TabICL’s solution&lt;/a&gt;&lt;/h3&gt;
&lt;div class="section" id="row-wise-encoding"&gt;
&lt;h4&gt;Row-wise encoding&lt;/h4&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="../science/attachments/tabicl/tabicl_architecture.png" style="width: 60%;" /&gt;
&lt;p class="caption"&gt;To break the quadratic cost, TabICL first encodes the rows to a
smaller, fixed-sized, represention, before performing across-row
in-context learning.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;For more scalability and better inductive bias, our model, TabICL, first
embeds the rows (using a first transformer) and then does in-context
learning across rows (with a second transformer). The resulting
computational complexity is &lt;em&gt;O(n p² + n²)&lt;/em&gt;, which is more scalable,
though still quadratic in &lt;em&gt;n&lt;/em&gt; and &lt;em&gt;p&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Scalability is important because it enables us to pretrain TabICL on both
small &lt;em&gt;and&lt;/em&gt; large datasets, and as a consquence TabICL is a good
predictor for large datasets.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="column-specific-embeddings"&gt;
&lt;h4&gt;Column-specific embeddings&lt;/h4&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="../science/attachments/tabicl/tabicl_embeddings.png" style="width: 100%;" /&gt;
&lt;p class="caption"&gt;To apply different transformations on columns depending on their
statistical properties, TabICL builds positional embeddings for
columns that capture aspects of their distribution.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Another important innovation of TabICL is that it inputs the entries in
the transformer with column-specific embeddings. These column embeddings
are computed to be a function of the distribution of the column. For
this, we use a set transformer, which is a scalable transformer-like way
of building a function on sets (but without the quadratic complexity).&lt;/p&gt;
&lt;p&gt;After pretraining, we find that the column embeddings have learned a
mapping that implicitly captures statistical aspects of the data
distribution in the column, as the kurtosis or the skewness.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="the-result-a-powerful-and-easy-to-use-tabular-learner"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-7"&gt;The result: a powerful and easy to use tabular learner&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;After a lot of pretraining on synthetic data, TabICL is a
state-of-the-art tabular learner. Pretraining gave it the right inductive
bias, as visible from the classifier-comparison plot below:&lt;/p&gt;
&lt;div class="figure"&gt;
&lt;img alt="" src="../science/attachments/tabicl/tabicl_comparison.png" style="width: 100%;" /&gt;
&lt;p class="caption"&gt;A classic classification comparison plot that shows the decision
boundaries on very simple toy data. It is useful to get a feeling of
how classifiers behave.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;It is interesting to see that while TabICL forms very flexible decision
boundaries, they do extend along the horizontal and vertical axes, as the
decision tree and random forest. These axis-aligned features are a
very important aspect of the inductive bias.&lt;/p&gt;
&lt;p&gt;At the end of the day, TabICL is an excellent tabular learner, as visible
on benchmarks:&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="../science/attachments/tabicl/result_comparison.png" /&gt;
&lt;p class="caption"&gt;TabICL is a great predictor: Comparison of many predictors.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="../science/attachments/tabicl/tabarena.png" /&gt;
&lt;p class="caption"&gt;Experimental results, from a benchmark paper independent of the TabICL
paper: TabArena &lt;a class="reference external" href="https://arxiv.org/abs/2506.16791"&gt;[Erickson et al, 2025]&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The benefit of TabICL over TabPFNv2 becomes more marked for larger datasets:&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="../science/attachments/tabicl/tabicl_scale_bench.png" style="width: 60%;" /&gt;
&lt;p class="caption"&gt;Rank (lower is best) as a function of dataset size.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;However, one limitation to keep in mind is that with in-context learners,
as TabICL or TabPFN, inference (prediction on new datapoint) ican be
costly.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;All in all, TabICL is an excellent tabular predictor, and a push forward
for tabular foundation models. From a fundamental standpoint, it shows
that in-context learning is not only for few-shot learning, but that it can be
very beneficial for sample sizes as large as &lt;em&gt;n=100 000&lt;/em&gt;.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;More about TabICL&lt;/p&gt;
&lt;p&gt;There is a lot more in TabICL: the details of pretraining are crucial,
implementation uses memory offloading (which is facilitated by the
architecture, which dissociates the train X from the test y for most
of the operations). To learn more about TabICL:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;The paper: &lt;a class="reference external" href="https://arxiv.org/abs/2502.05564"&gt;https://arxiv.org/abs/2502.05564&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;The github code: &lt;strong&gt;TabICL is 100% open source&lt;/strong&gt;
&lt;a class="reference external" href="https://github.com/soda-inria/tabicl"&gt;https://github.com/soda-inria/tabicl&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Install the Python package, TabICL is just one pip install away
&lt;a class="reference external" href="https://pypi.org/project/tabicl/"&gt;https://pypi.org/project/tabicl/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;Other topics in table foundation models: leveraging strings&lt;/p&gt;
&lt;p&gt;TabICL is only one aspect of table foundation models. We are pursuing
also another line of research that focuses on using strings (in
entries and column names) to bring knowledge about the real world in
table foundation models, see &lt;a class="reference external" href="carte-toward-table-foundation-models.html"&gt;CARTE&lt;/a&gt; and more recently &lt;a class="reference external" href="https://arxiv.org/abs/2505.14415"&gt;[Kim
et al, 2025]&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="machine learning"></category><category term="tabular learning"></category><category term="foundation models"></category></entry><entry><title>AI agents that use tools</title><link href="https://gael-varoquaux.info/science/ai-agents-that-use-tools.html" rel="alternate"></link><published>2025-07-04T00:00:00+02:00</published><updated>2025-07-04T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2025-07-04:/science/ai-agents-that-use-tools.html</id><summary type="html">&lt;img alt="Image generated with ChatGPT, with the prompt &amp;quot;Please generate an image of an AI using a mechanical tool, such as a wrench. Please make the robot look rather friendly. Also, please make the image square&amp;quot;" class="small align-right" src="../science/attachments/robot_tool_friendly.png" /&gt;
&lt;p&gt;Modern AIs acquire new capabilities by combining tools to perform a
complex task, controlling them like an agent. Unlike traditional
programming, they define the sequences of actions themselves.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post was originally published in French as part of my scientific
chronicle in &lt;a class="reference external" href="https://www.lesechos.fr/idees-debats/sciences-prospective/des-agents-ia-qui-utilisent-des-outils-2163252"&gt;Les Echos&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Modern AIs are increasingly using …&lt;/p&gt;</summary><content type="html">&lt;img alt="Image generated with ChatGPT, with the prompt &amp;quot;Please generate an image of an AI using a mechanical tool, such as a wrench. Please make the robot look rather friendly. Also, please make the image square&amp;quot;" class="small align-right" src="../science/attachments/robot_tool_friendly.png" /&gt;
&lt;p&gt;Modern AIs acquire new capabilities by combining tools to perform a
complex task, controlling them like an agent. Unlike traditional
programming, they define the sequences of actions themselves.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post was originally published in French as part of my scientific
chronicle in &lt;a class="reference external" href="https://www.lesechos.fr/idees-debats/sciences-prospective/des-agents-ia-qui-utilisent-des-outils-2163252"&gt;Les Echos&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Modern AIs are increasingly using tools. For example, if you ask a
conversational AI to solve a complicated equation, the AI alone cannot do
it. This is not surprising: there is no general mathematical formula. But
if this AI knows how to use numerical equation-solving routines, it
quickly gives us the answer. For example, “Le Chat” from Mistral
generates a small program that uses the “Python” language and its
numerical routines to solve our problem. The difficulty here is to
generate the program that calls the right routines. This ability is an
extension of conversational AI models that know how to answer questions
by generating text. Here, the text is computer code and not English.&lt;/p&gt;
&lt;p&gt;By controlling the computer, the AI “acts”. That’s why it is said to be
an “agent”. By coupling with other systems, agentic AIs develop new
capabilities. The most powerful ones can then combine different tools by
leveraging their complementarities. These agent systems are currently
progressing very quickly, but they remind us of what we have always done
in computer science: any complicated system is assembled from multiple
routines, each with a specific functionality. Writing a computer program
is precisely describing how we are going to call these routines to solve
a problem. And yet, without the recent advances in AI, we have to specify
all the steps, whereas agent AIs take a given goal and will themselves
produce these steps. The difficulty then becomes to break down a task
into sub-tasks, which is called planning, a difficult problem.&lt;/p&gt;
&lt;p&gt;In modern AIs, these planning skills are learned. The systems improve
through trial and error: we give the AI lots of tasks to solve and the AI
tries sequences of sub-tasks, deciding to use one tool or another. If it
succeeds in the final task, it learns that the sequence of tool use was a
good sequence for the task. This is called reinforcement learning, whose
main inventors received the Turing Prize this year, the Nobel Prize of
computer science.&lt;/p&gt;
&lt;p&gt;Another major driver of progress for agent AIs is the powerful ability of
analogy and associative memory of language models. These language skills
enable them to start from problems specified by the user in plain
English, with an open vocabulary. They draw their strategies to use tools
from a great knowledge of similar problems, but also know how to adapt
these strategies to the intermediate responses of the tools. They can
also interact with systems that are much more complex and indeterminate
than computer routines. For example, an AI can go and fetch information
on the internet, or even ask a human.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Agent AIs open new perspectives. But they also greatly increase computing
costs, as they iterate over sub-tasks. Computing costs must be kept in
mind, as they are an important hurdle to democratization of AI.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;AI chronicles&lt;/p&gt;
&lt;p&gt;Find all my AI chronicles &lt;a class="reference external" href="https://gael-varoquaux.info/tag/ai-chronicle.html"&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The goals of these “AI chronicles” is to introduce concepts of AI to a broader public, staying at a very very high level.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="AI"></category><category term="chronicle"></category><category term="AI chronicle"></category></entry><entry><title>AIs that break down questions reason better</title><link href="https://gael-varoquaux.info/science/ais-that-break-down-questions-reason-better.html" rel="alternate"></link><published>2025-06-20T00:00:00+02:00</published><updated>2025-06-20T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2025-06-20:/science/ais-that-break-down-questions-reason-better.html</id><summary type="html">&lt;img alt="Image generated with &amp;quot;LeChat&amp;quot;, with the prompt &amp;quot;Please generate an image of an AI that is thinking deeply. Philosophical references may be welcomed, for instance like the classic hamlet holding skull cliché.&amp;quot;" class="small align-right" src="../science/attachments/ai_thinking.jpg" /&gt;
&lt;p&gt;The key to the most powerful conversational AIs is to reason by breaking
down a complex task into simpler subproblems. Why is this crucial, and
how does it work?&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post was originally published in French as part of my scientific
chronicle in &lt;a class="reference external" href="https://www.lesechos.fr/idees-debats/sciences-prospective/les-ia-qui-decomposent-les-questions-raisonnent-mieux-2151428"&gt;Les Echos&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The recent release of …&lt;/p&gt;</summary><content type="html">&lt;img alt="Image generated with &amp;quot;LeChat&amp;quot;, with the prompt &amp;quot;Please generate an image of an AI that is thinking deeply. Philosophical references may be welcomed, for instance like the classic hamlet holding skull cliché.&amp;quot;" class="small align-right" src="../science/attachments/ai_thinking.jpg" /&gt;
&lt;p&gt;The key to the most powerful conversational AIs is to reason by breaking
down a complex task into simpler subproblems. Why is this crucial, and
how does it work?&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post was originally published in French as part of my scientific
chronicle in &lt;a class="reference external" href="https://www.lesechos.fr/idees-debats/sciences-prospective/les-ia-qui-decomposent-les-questions-raisonnent-mieux-2151428"&gt;Les Echos&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The recent release of the conversational AI “DeepSeek R1” shook the
financial markets because it showed a significant reduction in the costs
of reasoning models. But what are these reasoning models?&lt;/p&gt;
&lt;p&gt;To understand the challenges of reasoning in conversational AIs, we can
ask them to solve riddles. I tried various logical riddles on different
AIs, such as the puzzle where a man has to get a fox, a chicken, and a
sack of corn across a river without one eating the other. The AI responds
brilliantly. But how can we ensure that the AI is truly reasoning and not
just reciting answers it has seen before? By replacing them with other
equivalent animals (wolf, lamb, and hay), the AI does just as well. But
it could have solved the problem by analogy with the previous classic
one, rather than with reasonning. Indeed language models are very good at
analogies. A conversational AI typically works by proposing a answer
inspired by the flow of words (and corresponding concepts) in the texts
on which it was trained.&lt;/p&gt;
&lt;p&gt;If, instead of a riddle resembling a story, we try to play tic-tac-toe,
the weaknesses appear. Most conversational AIs are very bad at
tic-tac-toe, even going so far as to declare victory when faced with a
defeat. Perhaps this is because analogy is not as useful. But activating
the “reasoning” option makes them unbeatable. What is behind this option?&lt;/p&gt;
&lt;p&gt;A third task helps to understand the reasoning mechanisms of a
conversational AI: let’s ask it how many “L”s there are in
“LOLLAPALOUZA”. There is a catch: ChatGPT was able to give me the correct
for the number of Ls in “LOLLAPALOOZA”, a question often used in the past
to show its limits. For “LOLLAPALOUZA”, it fails. Or rather, it needs
help: if we tell it to spell out the word, then count the “L”s, it gives
the correct answer. With the right intermediate steps, a problem is often
much simpler. These decompositions into subproblems are called chains of
thought in conversational AIs. The “reasoning” option of some AIs
generates such chains.&lt;/p&gt;
&lt;p&gt;DeepSeek R1 received much attention due to its excellence in breaking
down problems to reason in such a way. To do this, it has been trained to
generate reasoning patterns from questions, using reinforcement learning:
through trial and error, on many problems generated with the associated
answer, like math problems. Faced with a task, the AI still proceeds by
analogy with the tasks it has seen during this learning phase, but it
uses this analogy to sketch a battle plan, rather than a response. Each
subproblem is then easier, and the AI can tackle it by analogy to
problems already seen. By observing the chains of thought, we can even
see the AI verifying its intermediate results. These chains of thought
are not always visible, but we can guess them from the AI’s response
time.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;With these reasoning mechanisms, a conversational AI is as good as I am
at tic-tac-toe. But using such a model to play tic-tac-toe is like using
a sledgehammer to crush a fly: it is very inefficient in computational
cost compared to a specialized program for tic-tac-toe, which we have
known how to do for decades.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;AI chronicles&lt;/p&gt;
&lt;p&gt;Find all my AI chronicles &lt;a class="reference external" href="https://gael-varoquaux.info/tag/ai-chronicle.html"&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The goals of these “AI chronicles” is to introduce concepts of AI to a broader public, staying at a very very high level.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="AI"></category><category term="chronicle"></category><category term="AI chronicle"></category></entry><entry><title>Science must drive the narratives that shape society</title><link href="https://gael-varoquaux.info/science/science-must-drive-the-narratives-that-shape-society.html" rel="alternate"></link><published>2025-03-01T00:00:00+01:00</published><updated>2025-03-01T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2025-03-01:/science/science-must-drive-the-narratives-that-shape-society.html</id><summary type="html">&lt;img alt="A picture of me giving this speech" class="small align-right" src="../science/attachments/louvain_gael_dhc.jpg" /&gt;
&lt;p&gt;I would like to take a brief moment to reflect on what drives me as an
academic.&lt;/p&gt;
&lt;p&gt;Academia’s root are in creating knowledge and sharing it. We, academics,
have a role to play in shaping society. In computer science, we sometimes
focus on the creation of technology. Here, creation …&lt;/p&gt;</summary><content type="html">&lt;img alt="A picture of me giving this speech" class="small align-right" src="../science/attachments/louvain_gael_dhc.jpg" /&gt;
&lt;p&gt;I would like to take a brief moment to reflect on what drives me as an
academic.&lt;/p&gt;
&lt;p&gt;Academia’s root are in creating knowledge and sharing it. We, academics,
have a role to play in shaping society. In computer science, we sometimes
focus on the creation of technology. Here, creation of open technology is
central to knowledge consolidation in computer science, because open
technology can be studied, because open technology can be shared.
But academia’s role in society is more than technology, even open.&lt;/p&gt;
&lt;p&gt;Academia’s position in consolidating knowledge implies that it is trusted
with responsibilities in shaping the narrative, for instance that of
technology. An important narrative today is that of artificial
intelligence, a new industrial revolution, they say. Our role here is to
do a sober assessment, inventing the future of technology, but without
false promises and blind spots. This work, as all broad scientific work,
requires working across disciplines.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;The above text is extracted from my acceptance speech when receiving
UC Louvain’s  Doctor Honoris Causa.&lt;/p&gt;
&lt;p class="last"&gt;As stated in my full speech, I am incredibly greatful for this honor. I
deeply thank all those that have been part of my scientific and
technical adventures. They were all built through team works, with
many amazing people, from all horizons, young and older, famous or
invisible. Working together is what moves mountains.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="society"></category><category term="AI"></category><category term="award"></category></entry><entry><title>AI super-intelligent to play Go, and math?</title><link href="https://gael-varoquaux.info/science/ai-super-intelligent-to-play-go-and-math.html" rel="alternate"></link><published>2025-02-19T00:00:00+01:00</published><updated>2025-02-19T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2025-02-19:/science/ai-super-intelligent-to-play-go-and-math.html</id><summary type="html">&lt;img alt="Image generated with &amp;quot;LeChat&amp;quot;, with the prompt &amp;quot;Please generate an image of an artificial intelligences playing go, with mathematical formula flying in the background. The mathematical formula are flying in all directions, and the image is futuristic.&amp;quot;" class="small align-right" src="../science/attachments/robots_playing_go.jpg" /&gt;
&lt;p&gt;Since 2017, an AI has been defeating the best Go experts, despite the game being particularly challenging. Such “super intelligence” is rare, but it could also emerge in fundamental mathematics.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post was originally published in French as part of my scientific
chronicle in &lt;a class="reference external" href="https://www.lesechos.fr/idees-debats/sciences-prospective/lia-le-go-et-les-maths-2140332"&gt;Les Echos&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="imitation-is-not-creation"&gt;
&lt;h2&gt;Imitation is not …&lt;/h2&gt;&lt;/div&gt;</summary><content type="html">&lt;img alt="Image generated with &amp;quot;LeChat&amp;quot;, with the prompt &amp;quot;Please generate an image of an artificial intelligences playing go, with mathematical formula flying in the background. The mathematical formula are flying in all directions, and the image is futuristic.&amp;quot;" class="small align-right" src="../science/attachments/robots_playing_go.jpg" /&gt;
&lt;p&gt;Since 2017, an AI has been defeating the best Go experts, despite the game being particularly challenging. Such “super intelligence” is rare, but it could also emerge in fundamental mathematics.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post was originally published in French as part of my scientific
chronicle in &lt;a class="reference external" href="https://www.lesechos.fr/idees-debats/sciences-prospective/lia-le-go-et-les-maths-2140332"&gt;Les Echos&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="imitation-is-not-creation"&gt;
&lt;h2&gt;Imitation is not creation&lt;/h2&gt;
&lt;p&gt;For several decades, calculators have been better than humans in an
intellectual task: mental arithmetic. Yet, we do not call this
“super-intelligence.” Probably because it is humans who specified all the
rules for these calculations to the machine. Similarly, a computer has a
superhuman ability to memorize information exactly, such as numbers, but
we do not consider it super-intelligent for that reason. Perhaps this is
because it does not teach us anything new. However, in 2017, an AI
started teaching the best Go players moves and strategies that no one had
ever known. How is this possible? Will AI surpass its creator and become
super-intelligent in all fields?&lt;/p&gt;
&lt;p&gt;Most recent breakthroughs in AI rely on learning methods where the
computer imitates humans. For example, to create computer-vision systems,
we provide the computer with many annotated images describing what they
represent. Likewise, conversational AIs learn by training to complete
examples of text. Under these conditions, it is difficult for AI to
surpass its creator.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="when-ais-invent"&gt;
&lt;h2&gt;When AIs invent&lt;/h2&gt;
&lt;p&gt;But AlphaZero, the AI champion in Go, operates on a different principle:
reinforcement learning. Here, the AI takes actions –moves in the game of
Go– and receives a “reward” if it wins the game. Through countless games,
it optimizes its strategies to maximize rewards, including exploring new
strategies. AlphaZero trained by playing tens of millions of games
against itself. This is how the AI was able to create new strategies,
unrestricted by human knowledge.&lt;/p&gt;
&lt;p&gt;Such learning, based on millions of trial-and-error attempts, does not
apply to all problems –it requires the ability to perform rapid
experiments, like in a computer game, which remains the only domain where
a true super-intelligence has been achieved. However, there is hope in
mathematics, another intellectual game.&lt;/p&gt;
&lt;p&gt;Indeed, progress in generative AI for language –that power tools such as
ChatGPT– can be applied to mathematical proofs, which consist of
sequences of symbols. Trained on numerous proofs, an AI can learn to
complete partial proofs. However, such a generative AI will produce
sequences without guarantees of mathematical validity. Another tool,
using proof-verification techniques based on symbolic AI, can then filter
out only the correct sequences, giving a “reward” signal. Reinforcement
learning finally comes in, using its exploration schemes to maximize this
reward and discover new valid proof steps.&lt;/p&gt;
&lt;p&gt;This is how, in July 2024, the AlphaProof AI won a silver medal at the
International Mathematical Olympiad. Further progress may eventually lead
to “super-intelligence” in mathematics. However, we are still far from
general super-intelligence, as, both in Go and mathematics, progress is
made possible by the ease of verifying whether one has “won” or not.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;AI chronicles&lt;/p&gt;
&lt;p&gt;Find all my AI chronicles &lt;a class="reference external" href="https://gael-varoquaux.info/tag/ai-chronicle.html"&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The goals of these “AI chronicles” is to introduce concepts of AI to a broader public, staying at a very very high level.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="AI"></category><category term="chronicle"></category><category term="AI chronicle"></category></entry><entry><title>AI for health: the impossible necessity of unbiased data</title><link href="https://gael-varoquaux.info/science/ai-for-health-the-impossible-necessity-of-unbiased-data.html" rel="alternate"></link><published>2025-02-13T00:00:00+01:00</published><updated>2025-02-13T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2025-02-13:/science/ai-for-health-the-impossible-necessity-of-unbiased-data.html</id><summary type="html">&lt;img alt="Image generated with &amp;quot;LeChat&amp;quot;, with the prompt &amp;quot;Please generate a fairly abstract image of biased data. The image is about data. It should have numbers, streams of numbers. It should express the notion of bias, showing a black woman in the middle of the stream of numbers.&amp;quot;" class="small align-right" src="../science/attachments/biased_data.jpg" /&gt;
&lt;p&gt;Is unbiased data important to build health AI? Yes!&lt;/p&gt;
&lt;p&gt;Can there be unbiased data? No!&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
&lt;em&gt;Building health on biased data discriminates&lt;/em&gt;&lt;/div&gt;
&lt;p&gt;The notion of bias depends on the intended use.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;In medicine, we have seen the importance of tuning devices and decisions
for the target population. The problem is not …&lt;/p&gt;</summary><content type="html">&lt;img alt="Image generated with &amp;quot;LeChat&amp;quot;, with the prompt &amp;quot;Please generate a fairly abstract image of biased data. The image is about data. It should have numbers, streams of numbers. It should express the notion of bias, showing a black woman in the middle of the stream of numbers.&amp;quot;" class="small align-right" src="../science/attachments/biased_data.jpg" /&gt;
&lt;p&gt;Is unbiased data important to build health AI? Yes!&lt;/p&gt;
&lt;p&gt;Can there be unbiased data? No!&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
&lt;em&gt;Building health on biased data discriminates&lt;/em&gt;&lt;/div&gt;
&lt;p&gt;The notion of bias depends on the intended use.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;In medicine, we have seen the importance of tuning devices and decisions
for the target population. The problem is not limited to AI: pulse
oximeters, that measure oxygen saturation, do not work well on dark
skins; cardiac procedures were adjusted to symptoms and anatomy of men,
while those of women differ. These issues arose because the corresponding
groups were underrepresented in the clinical studies.&lt;/p&gt;
&lt;p&gt;So when we build AI, we need to make sure that they are not trained on
biased data.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="align-right docutils container"&gt;
&lt;em&gt;Beyond population sampling, historical choices also bias&lt;/em&gt;&lt;/div&gt;
&lt;p&gt;But unbiased data is hard, as it goes beyond sampling the right
population of individuals. Indeed, the data we have is the result of a
historical set of choices: Who do we measure? Which measurements? And
what led to their condition? Beyond health, consider for instance
salaries: we can train a model from historical data to tell us what
should be the right compensation for a given individual. But it is just
going to capture and repeat historical biases, such as paying less women
that are as qualified as their male counterparts.&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
&lt;em&gt;The notion of being unbiased embeds societal and ethical values&lt;/em&gt;&lt;/div&gt;
&lt;p&gt;Here we see that the notion of being unbiased embeds societal and ethical
values: Should Olympic-level gymnasts and football players be paid the
same thing? How about men and women with the same job description?&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;And now to go back to medicine, there is another critical aspect, which
is that of cause and effects, which is central to taking decisions. To
take a simple example, if we were to compare the health outcome of
individuals after two days at the hospital to those who did not go to the
hospital, we would conclude, incorrectly, that a hospital is a very
dangerous place, as individuals there are in a worse shape. The problem
is, of course, that we are comparing individuals that are not comparable
as they have a different baseline health. A health intervention is given
for a reason, so it is given to a specific population: insulin is given
to diabetics. Building a model, an AI, that can decide on health
intervention requires compensating for the difference between the treated
and non-treated individuals.&lt;/p&gt;
&lt;div class="side-hanging small sidebar"&gt;
&lt;p class="first sidebar-title"&gt;&lt;strong&gt;Reference: causality&lt;/strong&gt;&lt;/p&gt;
&lt;p class="last"&gt;&lt;a class="reference external" href="https://hal.science/hal-04774700/"&gt;A 15-page introduction to causal inference with machine
learning&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div class="align-right docutils container"&gt;
&lt;em&gt;AIs can make good decisions only from adequate data&lt;/em&gt;&lt;/div&gt;
&lt;p&gt;Here also we have a case of bias. The bias is with regards to the data
required to answer the question on the effect of the intervention, where
both populations are comparable. More generally, we are seeing once again
that the data are always the result of a historical set of choices, and
these choices condition the statistical relationships in the data. And
AIs build on the statistical relationships.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="align-right docutils container"&gt;
&lt;em&gt;The notion of bias depends on the intended use&lt;/em&gt;&lt;/div&gt;
&lt;p&gt;What we see here is that the notion of bias depends on the intended use: it depends on the target population, but also on the target intervention. So there really is no absolute notion of unbiased data. There is just the notion of data that are well suited to a particular goal.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;img alt="" class="align-right" src="../science/attachments/lady_justice_robot.png" /&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;AI chronicles&lt;/p&gt;
&lt;p&gt;This post was consolidated from notes of a panel on health AI at the
AI Action Summit, but it is link to my &lt;a class="reference external" href="https://gael-varoquaux.info/tag/ai-chronicle.html"&gt;AI chronicles&lt;/a&gt;, big-picture
didactic pieces on AI and related topics.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="society"></category><category term="health"></category><category term="AI"></category><category term="chronicle"></category><category term="AI chronicle"></category></entry><entry><title>2024 highlights: of computer science and society</title><link href="https://gael-varoquaux.info/science/2024-highlights-of-computer-science-and-society.html" rel="alternate"></link><published>2025-01-01T00:00:00+01:00</published><updated>2025-01-01T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2025-01-01:/science/2024-highlights-of-computer-science-and-society.html</id><summary type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;For me, 2024 was full of back and forth between research,
software, and connecting these to society. Here, I lay out some
highlights on AI and society, as well as research and software, around
tabular AI and language models.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;As 2025 starts, I’m looking back on 2024. It …&lt;/p&gt;</summary><content type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;For me, 2024 was full of back and forth between research,
software, and connecting these to society. Here, I lay out some
highlights on AI and society, as well as research and software, around
tabular AI and language models.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;As 2025 starts, I’m looking back on 2024. It was an interesting
professional year, as the research in the &lt;a class="reference external" href="https://team.inria.fr/soda/"&gt;soda team&lt;/a&gt; on machine learning for health and
social science nourished reflection on society.&lt;/p&gt;
&lt;div class="contents topic" id="highlights"&gt;
&lt;p class="topic-title"&gt;Highlights&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#thoughts-from-the-national-ai-committee" id="toc-entry-1"&gt;Thoughts from the national AI committee&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#adventures-in-software-land" id="toc-entry-2"&gt;Adventures in software land&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#probabl-to-supercharge-scikit-learn" id="toc-entry-3"&gt;probabl to supercharge scikit-learn&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#skrub-machine-learning-on-tables-made-easy" id="toc-entry-4"&gt;Skrub: machine learning on tables made easy&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#research-better-ai-tools-more-understanding" id="toc-entry-5"&gt;Research: better AI tools, more understanding&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#table-foundation-models" id="toc-entry-6"&gt;Table foundation models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#disparities-of-confidence-of-large-language-models" id="toc-entry-7"&gt;Disparities of confidence of large language models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#a-straggler-consistency-of-supervised-learning-with-missing-values" id="toc-entry-8"&gt;A straggler: Consistency of supervised learning with missing values&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="thoughts-from-the-national-ai-committee"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;Thoughts from the national AI committee&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Early 2024, I was serving in the French national AI committee. Our final write up can be found
&lt;a class="reference external" href="https://www.info.gouv.fr/actualite/25-recommandations-pour-lia-en-france"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It was a ton of work, a very interesting experience, and I learned a lot
on many aspects of the interfaces between technology, policy, and
society. A few things that stood out for me, some partly
obvious but worth saying:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;Digital services are a growing economy.&lt;/strong&gt; The share of the economy
that is digital keeps growing, whether we like it or not (IMHO, most of
us spent too much time on our phones…). For France, or Europe, there
is no question: we must produce our share of digital services and
innovation, else our economic balance suffers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;Privacy is erroding.&lt;/strong&gt; Whether it is social network, information
leaking into search engines or training of large language models,
or people uploading private information to chatGPT, private information
is more and more available. History has shown us the dangers behind
loss of privacy, which the powerful (governing or economical elites)
typically leverage to assert more power. Europe has had a long stance
of trying to mitigate this loss of privacy via regulation (GDPR). But
regulating services that we don’t control is hard, and it ends up being
a geo-political and economical battle.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;Big AI is huge.&lt;/strong&gt; The size of investments in AI is huge (dozens of
billions yearly, comparable to a sizeable fraction of the state
expenditures of a rich country like Switzerland). Data centers are
having significant impacts on the electric grid of modern countries,
running in competition with other usage. The cost of large models have
ballooned (training a large language model is in the hundreds of
millions of cost, which is comparable to a sizeable fraction of the
budget of the national research institute that I work in (&lt;a class="reference external" href="https://inria.fr/fr"&gt;inria&lt;/a&gt;). Training costs are just the visible part
of the iceberg, operational costs are huge and are everywhere.&lt;/p&gt;
&lt;p&gt;Not all in tech are worried about rising costs. Indeed, they go hand in
hand with more money in tech, making us, tech bros, richer, as long as
investments keep pouring in. But &lt;a class="reference external" href="https://www.goldmansachs.com/images/migrated/insights/pages/gs-research/gen-ai--too-much-spend%2C-too-little-benefit-/TOM_AI%202.0_ForRedaction.pdf"&gt;bubble dynamics are at play&lt;/a&gt;,
and explain part of the conversation around AI.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;Concentration of power.&lt;/strong&gt; Many factors in today’s AI lead to
concentration into the hands of large actors. Training and operation
costs, of course. But also limited access to the correspond skills,
platform effect on the data and the users. The most striking bottleneck
is the compute hardware. Only one company makes the chips that we all
need. Few actors can afford buying them; and as a result most of the
world lives from renting out to big landlords.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;AI neither good nor bad, but what we do of it.&lt;/strong&gt; The above may
paint a gloomy picture. But this is not how I see it. AI does have a
lot of potential for good, as all general purpose technology. It all
depends how society uses it. And here the future is open: we, as actors
of democratic societies, as innovators, in tech but in every aspects of
society, we can determine what the future of AI is. I look forward to
technology that empowers each and everybody, to act for their own
benefit. Key to this future is enabling and bringing in every stakeholder.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="adventures-in-software-land"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;Adventures in software land&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;With the growing importance of data and artificial intelligence in
shaping society, I believe more than ever in the importance of open
source and commons for data science, making tools accessible to as many
as possible.&lt;/p&gt;
&lt;div class="section" id="probabl-to-supercharge-scikit-learn"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;probabl to supercharge scikit-learn&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Early 2024, Inria span off the scikit-learn development to a new structure, &lt;a class="reference external" href="https://probabl.ai"&gt;probabl&lt;/a&gt;, to supercharge the development of the broader
ecosystem. I detailed the motivation and the goals in &lt;a class="reference external" href="../programming/promoting-open-source-from-inria-to-probabl.html"&gt;a previous article&lt;/a&gt;. In a
nutshell:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Scikit-learn is &lt;a class="reference external" href="programming/people-underestimate-how-impactful-scikit-learn-continues-to-be.html"&gt;a key component of the machine-learning
ecosystem&lt;/a&gt;,
but its development require funding.&lt;/li&gt;
&lt;li&gt;Probabl is there to foster a broader open data-science ecosystem, as
scikit-learn can be sustainable only when used in such ecosystem.
Probabl focus on delivering value to enterprises, and thus makes sure
that there is a seamless solution to their needs.&lt;/li&gt;
&lt;li&gt;I have 10% of my time allocated from Inria to Probabl.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Some of our successes are already publicly visible:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;The open-source team at probabl is maintaining and improving &lt;a class="reference external" href="https://probabl.ai/open-source"&gt;a range
of software libraries&lt;/a&gt;: scikit-learn,
joblib, imbalanced-learn, fairlearn, skops, skrub… Our priorities are
openly discussed &lt;a class="reference external" href="https://papers.probabl.ai/open-source-priorities-chapter-2"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;We have launched &lt;a class="reference external" href="https://papers.probabl.ai/official-scikit-learn-certification-launch"&gt;an official certification program for scikit-learn&lt;/a&gt;. I’m very excited about these certifications (there are three levels), to grow recognition in the scikit-learn skills, and thus make sure that it is a dependable stack for the industry.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="skrub-machine-learning-on-tables-made-easy"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-4"&gt;Skrub: machine learning on tables made easy&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;a class="reference external" href="https://skrub-data.org/"&gt;skrub&lt;/a&gt; is a software project that I am very
excited about. Many crucial applications of machine learning are on
tables. Skrub facilitates the corresponding patterns. We are designing it
with the insights of years of research and practice on the topic. It does
not always look impressive, but it’s little things that add up for
productivity.&lt;/p&gt;
&lt;p&gt;A typical dataset is the employees one:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
&amp;gt;&amp;gt;&amp;gt; from skrub.datasets import fetch_employee_salaries
&amp;gt;&amp;gt;&amp;gt; dataset = fetch_employee_salaries()
&amp;gt;&amp;gt;&amp;gt; employees_df, y = dataset.X, dataset.y
&lt;/pre&gt;
&lt;p&gt;Skrub’s &lt;a class="reference external" href="https://skrub-data.org/stable/reference/generated/skrub.TableReport.html"&gt;TableReport&lt;/a&gt; makes it really easy to interactively visualize and
explore such table:&lt;/p&gt;
&lt;img alt="" src="attachments/2024_highlights/table_report_vscode.png" style="width: 700px;" /&gt;
&lt;p&gt;The dataframe &lt;cite&gt;employees_df&lt;/cite&gt; has plenty of non numerical columns, as visible above.
Skrub’s &lt;a class="reference external" href="https://skrub-data.org/stable/reference/generated/skrub.TableVectorizer.html"&gt;TableVectorizer&lt;/a&gt; turns it into a numerical array suitable for
machine learning, taking care of dates, categories, strings…&lt;/p&gt;
&lt;pre class="literal-block"&gt;
&amp;gt;&amp;gt;&amp;gt; from skrub import TableVectorizer
&amp;gt;&amp;gt;&amp;gt; X = TableVectorizer().fit_transform(employees_df)
&lt;/pre&gt;
&lt;p&gt;If you want to use deep-learning language models for the string
categories, skrub’s &lt;a class="reference external" href="https://skrub-data.org/stable/reference/generated/skrub.TextEncoder.html"&gt;TextEncoder&lt;/a&gt;
can download pre-trained models from hugginface:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
&amp;gt;&amp;gt;&amp;gt; from skrub import TextEncoder
&amp;gt;&amp;gt;&amp;gt; text_encoder = TextEncoder(
        &amp;quot;sentence-transformers/paraphrase-albert-small-v2&amp;quot;,
        device=&amp;quot;cpu&amp;quot;,
    )
&amp;gt;&amp;gt;&amp;gt; tab_vec = TableVectorizer(high_cardinality=text_encoder)
&amp;gt;&amp;gt;&amp;gt; X = tab_vec.fit_transform(employees_df)
&lt;/pre&gt;
&lt;p&gt;With this, the latest artificial intelligent developments are easily
brought to drive decisions on the data that matters.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="research-better-ai-tools-more-understanding"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-5"&gt;Research: better AI tools, more understanding&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Software or thoughts on AI and society, are best built on solid
understanding of AI, which calls for research.&lt;/p&gt;
&lt;div class="section" id="table-foundation-models"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-6"&gt;Table foundation models&lt;/a&gt;&lt;/h3&gt;
&lt;p class="align-right"&gt;&lt;em&gt;Modeling data semantics enable pretaining for tables&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I have been working on machine-learning for tables for more than a
decade. These data are crucial for many applications, but they have so
far not witnessed the breakthroughs of deep learning seen &lt;em&gt;eg&lt;/em&gt; in vision
or text. Much of these success of &lt;strong&gt;deep learning as been driven by the
ability to reused pretrained models&lt;/strong&gt;, fitted on very large datasets.
Foundation models pushed this idea very far with models that provide
background information useful for a wide variety of downstream tasks. But
pretraining is challenging for tables.&lt;/p&gt;
&lt;p&gt;A crucial part of foundation models for text and images is the attention
mechanism, stacked in a transformer architecture, that bring associative
memory to the inputs by contextualizing them. We had a breakthough with
the &lt;a class="reference external" href="https://openreview.net/forum?id=9kArQnKLDp"&gt;CARTE model&lt;/a&gt;: we
managed to adapt these ideas to tables. The strings –tables
entries and column names– give the information that enables transfer from
one table to another: data semantics. Here, key is to have an
architecture that 1) models both strings and numerical values 2) applies
to any set of tables while using the column names to route the
information. For this purpose, CARTE uses a new dedicated attention
mechanism that accounts for column names. It is pre-trained on a very
large knowledge base. As a result, it outperform the best models
(including tree-based models) in small sample settings (up to n=2000).&lt;/p&gt;
&lt;p&gt;The pretrained CARTE model is available for download as &lt;a class="reference external" href="https://pypi.org/project/carte-ai"&gt;a Python package&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This result is very significant as it opens the door to &lt;strong&gt;foundation models
for tables&lt;/strong&gt;: models that embark much background knowledge and can be
specialized to many tabular-learning tasks.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="https://openreview.net/forum?id=9kArQnKLDp"&gt;&lt;img alt="" src="attachments/2024_highlights/carte_comparisons.png" style="width: 100%;" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;Extensive empirical results show that CARTE brings benefits to very
broad set of baselines. The relative performance of baselines also
contains interesting results.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;See also&lt;/p&gt;
&lt;p&gt;I wrote a longer &lt;a class="reference external" href="./carte-toward-table-foundation-models.html"&gt;high-level post on CARTE&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="disparities-of-confidence-of-large-language-models"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-7"&gt;Disparities of confidence of large language models&lt;/a&gt;&lt;/h3&gt;
&lt;div class="figure align-right"&gt;
&lt;a class="reference external image-reference" href="https://hal.science/hal-04750567"&gt;&lt;img alt="" src="attachments/2024_highlights/hallucination_probability.png" style="width: 400px;" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;A good confidence assessment on replies of an LLM would separate out
correct from incorrect statements: Einstein was not born on Jan 14th
1879 (close call, it was March 14th); his PhD was in Zurich.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Large language models (LLMs), such as chatGPT, may produce answers that
are plausible but not factually correct, the so-called “hallucinations”.
A variety of approach try to assess how likely a statement is to be true,
for instance by sampling multiple responses from the language model.
Ideally, we would like to use these confidence assessments to flag the
wrong statements in an LLM’s answer. For this, a challenge is to
threshold them, or assign a probability of correctness.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="figure align-right"&gt;
&lt;a class="reference external image-reference" href="https://hal.science/hal-04750567"&gt;&lt;img alt="" src="attachments/2024_highlights/llm_confidence_nationality.png" style="width: 400px;" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;Observed error rate and a function predicted probability of
correctness For the birth date, when a large language model (here Mistral
7B) gives information on a given notable individual. The different
curves give the corresponding calibration for different nationalities of
the individuals, revealing that &lt;strong&gt;the probability is much more trustworthy
for a citizen of the United States than for other countries&lt;/strong&gt;, and
particularly poor for people that originate from South-East Asia.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;In &lt;a class="reference external" href="https://hal.science/hal-04750567/"&gt;Chen et al&lt;/a&gt;, we investigate the
confidence of LLMs in their answers. We show that the
probabilities computed are not only overconfident, but also that there is
heterogeneity (grouping loss): on some groups of queries the
overconfidence is more pronounced than on others. For instance, for an
answer on a notable individual, the LLMs’ confidence is reasonably
calibrated if the individual is from the United States, but severely
overconfident for individuals from South East Asia
(fig:llmconfidencenationality). Characterizing the corresponding groups
opens the door to correcting the corresponding bias, a “reconfidencing”
procedure.&lt;/p&gt;
&lt;p&gt;This study is an application of our earlier, more theoretical, &lt;a class="reference external" href="https://openreview.net/forum?id=6w1k-IixnL8"&gt;work&lt;/a&gt; that contributed the
first estimator grouping loss, a mathematically-solid concept behind
hidden heterogeneity in classifier calibration. I am very happy to see
that these fairly abstract ideas are useful to probe very concrete
problems such as the disparity in LLM confidence across nationalities.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="a-straggler-consistency-of-supervised-learning-with-missing-values"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-8"&gt;A straggler: Consistency of supervised learning with missing values&lt;/a&gt;&lt;/h3&gt;
&lt;p class="align-right"&gt;&lt;em&gt;A&lt;/em&gt; &lt;a class="reference external" href="https://link.springer.com/article/10.1007/s00362-024-01550-4"&gt;paper&lt;/a&gt;
&lt;em&gt;on the fundamentals of machine-learning with missing values&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;In 2018, &lt;a class="reference external" href="https://juliejosse.com"&gt;Julie Josse&lt;/a&gt;, &lt;a class="reference external" href="https://erwanscornet.github.io"&gt;Erwan Scornet&lt;/a&gt;, and myself started working on the
theory of how supervised learning works with missing values (learning
theory). Working an intern, Nicolas Prost, we quickly realized that there
was a gap between the statistical thinking around missing values, which
was focused on enabling inference in parametric models as if their were
no missing values, and the needs for prediction with missing values.&lt;/p&gt;
&lt;p&gt;We wrote  &lt;a class="reference external" href="https://link.springer.com/article/10.1007/s00362-024-01550-4"&gt;a paper&lt;/a&gt; to
lay out the theory cleanly, summarizing both elements of learning theory
and the fundamentals of statistics with missing values. Beyond this
didactic aspects, the paper gives a series of formal results, such as the
need for multiple imputations to be able to use the &lt;em&gt;complete case&lt;/em&gt;
predictor (the optimal predictor without missing values), the optimal way
to model missing values in trees (which was already used in XGBoost :) ),
and the fact that asymptotically, constant imputation of missing values
could work well for predictor.&lt;/p&gt;
&lt;p class="align-right"&gt;&lt;em&gt;Frustrations of the academic game&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://hal.science/hal-02024202"&gt;The preprint&lt;/a&gt; got a lot of success
(more than a hundred citations), probably because it laid out
fundamentals. But it took 5 years to publish it. The machine learning
community did not like the absence of new methods (we only gave
theoretical results on existing practice, such as imputation). The
statistics literature really did not like our messages that imputation
was not always important. In one journal, a reviewer rejected the paper on
the basis that it was giving bad messages to the community, but not
arguing that anything was wrong in our proofs or our experiments. Of
course, there is a lot to say about the difficulties of doing data
analysis with missing values, but the conversation did not go in these
details. This is a good illustration that &lt;strong&gt;progress in science is
social&lt;/strong&gt;, and is as much about shifting norms than accumulating knowledge
(actually, knowledge is social too, as put forward by &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Social_epistemology"&gt;social
epistemology&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;As time went by, my colleague &lt;a class="reference external" href="https://marinelm.github.io"&gt;Marine Le Morvan&lt;/a&gt; has published &lt;a class="reference external" href="https://proceedings.mlr.press/v108/morvan20a.html"&gt;more&lt;/a&gt; &lt;a class="reference external" href="https://proceedings.neurips.cc/paper/2021/hash/5fe8fdc79ce292c39c5f209d734b7206-Abstract.html"&gt;and&lt;/a&gt;
&lt;a class="reference external" href="https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac013/6568998"&gt;more&lt;/a&gt;
&lt;a class="reference external" href="https://arxiv.org/abs/2407.19804"&gt;results&lt;/a&gt; that push deeper
understanding of prediction with missing values. But I still see value in
our original paper, as it lays the foundations.&lt;/p&gt;
&lt;p&gt;The paper is now out, thanks to my coauthors who kept replying to
reviewers, improving the manuscripts, and resubmitting. Read &lt;a class="reference external" href="https://link.springer.com/article/10.1007/s00362-024-01550-4"&gt;it&lt;/a&gt;, I think
that it is a good read.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;Well, this article ended up longer than I had expected. Thanks for
reading. Taking a step back to figure out what is important is always a
good exercise for me.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="machine learning"></category><category term="statistics"></category><category term="yearly report"></category></entry><entry><title>When AIs must overcome the data</title><link href="https://gael-varoquaux.info/science/when-ais-must-overcome-the-data.html" rel="alternate"></link><published>2024-12-22T00:00:00+01:00</published><updated>2024-12-22T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2024-12-22:/science/when-ais-must-overcome-the-data.html</id><summary type="html">&lt;p&gt;Improving conversational artificial intelligences or simpler prediction engines involves overcoming biases, that is, going beyond the limits of data. But the notion of bias is subtle, as it depends on the goals.&lt;/p&gt;
&lt;img alt="Image generated with &amp;quot;ChatGPT&amp;quot;, with the prompt &amp;quot;Please generate an image of a robot arm wrestling a figure made of numbers. This figure does not look like a robot, but more like a human, however it is made of numbers.&amp;quot;" class="small align-right" src="../science/attachments/robot_wresting_numbers.jpg" /&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post was originally published in French as part of my scientific
chronicle in &lt;a class="reference external" href="https://www.lesechos.fr/idees-debats/sciences-prospective/quand-lia-doit-depasser-les-donnees-2126369"&gt;Les Echos&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;In …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Improving conversational artificial intelligences or simpler prediction engines involves overcoming biases, that is, going beyond the limits of data. But the notion of bias is subtle, as it depends on the goals.&lt;/p&gt;
&lt;img alt="Image generated with &amp;quot;ChatGPT&amp;quot;, with the prompt &amp;quot;Please generate an image of a robot arm wrestling a figure made of numbers. This figure does not look like a robot, but more like a human, however it is made of numbers.&amp;quot;" class="small align-right" src="../science/attachments/robot_wresting_numbers.jpg" /&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post was originally published in French as part of my scientific
chronicle in &lt;a class="reference external" href="https://www.lesechos.fr/idees-debats/sciences-prospective/quand-lia-doit-depasser-les-donnees-2126369"&gt;Les Echos&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;In 2023, Microsoft’s conversational AI insulted users.
Salary-recommendation engines ignore women’s degrees to underpay them. At
the start of the Covid-19 pandemic, predictions of hospital stays
consistently underestimated the duration. These three issues all stem
from the same failure: predictive engines, artificial intelligences, that
have learned from biases. The rude conversational AI replicated its
training texts, some of which came from internet forums where politeness
is sometimes overlooked. The medical AI only considered finished
hospitalizations, and, as the epidemic had just begun, only patients
with mild forms had already been discharged, while the more seriously ill
remained hospitalized.&lt;/p&gt;
&lt;p&gt;To obtain an AI that doesn’t say nonsense, the biases must be
“corrected.” The problem of too-short observation windows is a classic
issue in medical statistics: more importance must be placed on the few
individuals who have been sick for a long time. A similar solution is
used to improve conversational AIs: weighting the training text sources
based on the deviation from the desired behavior.&lt;/p&gt;
&lt;div class="section" id="aligning-on-which-values"&gt;
&lt;h2&gt;Aligning on which values?&lt;/h2&gt;
&lt;p&gt;The problem of bias is universal in statistics. And modern AIs are
statistical because they learn from data. The notion of bias is very
relative. It should be understood as a gap between the available data and
the desired behavior. Therefore, &lt;strong&gt;there is no such thing as unbiased data,
or a universal bias correction&lt;/strong&gt;. Much of the effort to improve AIs focuses
on reducing this gap between training and the desired behavior.&lt;/p&gt;
&lt;p&gt;For example, when training AIs for autonomous vehicles, one difficulty is
that the data contains very few traffic accidents. Simulators are
sometimes used to fill this gap. They are inherently less rich than
reality and are mixed with real-world driving. There is a well-controled
gap between the resulting mixture and typical driving, this gap is there
to put emphasis on safety requirements in unfavorable scenarios. This is
another form of data correction.&lt;/p&gt;
&lt;p&gt;Just as the notion of data bias depends on how well the data match a
targeted use, an AI does not produce absolute or objective truth. Without
corrections, it simply replicates its behavior based on what it has
observed. And when corrections are made, the whole question is how to
correct it. For powerful AIs, we then talk about “alignment” towards
goals and values. As AI incorporates the values of its designers, one
might wonder whether the same AI can be socially acceptable in all
cultures.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;AI chronicles&lt;/p&gt;
&lt;p&gt;Find all my AI chronicles &lt;a class="reference external" href="https://gael-varoquaux.info/tag/ai-chronicle.html"&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The goals of these “AI chronicles” is to introduce concepts of AI to a broader public, staying at a very very high level.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="AI"></category><category term="chronicle"></category><category term="AI chronicle"></category></entry><entry><title>Do AIs reason or recite?</title><link href="https://gael-varoquaux.info/science/do-ais-reason-or-recite.html" rel="alternate"></link><published>2024-10-19T00:00:00+02:00</published><updated>2024-10-19T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2024-10-19:/science/do-ais-reason-or-recite.html</id><summary type="html">&lt;p&gt;Despite their apparent intelligence, conversational artificial intelligences often lack logic. The debate rages on: do they reason or do they recite snatches of text memorized on the Internet?&lt;/p&gt;
&lt;img alt="Image generated with &amp;quot;ChatGPT&amp;quot;, with the prompt &amp;quot;Please generate an image of a robot with a stream of numbers coming out of his mouth. The robot is on the left, facing right, and the numbers flow, as if they were sound.&amp;quot;" class="small align-right" src="../science/attachments/robot_numbers_flow_mouth.jpg" /&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post was originally published in French as part of my scientific
chronicle in &lt;a class="reference external" href="https://www.lesechos.fr/idees-debats/sciences-prospective/les-ia-raisonnent-elles-ou-recitent-elles-2103079"&gt;Les Echos&lt;/a&gt;. I updated it with new …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;p&gt;Despite their apparent intelligence, conversational artificial intelligences often lack logic. The debate rages on: do they reason or do they recite snatches of text memorized on the Internet?&lt;/p&gt;
&lt;img alt="Image generated with &amp;quot;ChatGPT&amp;quot;, with the prompt &amp;quot;Please generate an image of a robot with a stream of numbers coming out of his mouth. The robot is on the left, facing right, and the numbers flow, as if they were sound.&amp;quot;" class="small align-right" src="../science/attachments/robot_numbers_flow_mouth.jpg" /&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post was originally published in French as part of my scientific
chronicle in &lt;a class="reference external" href="https://www.lesechos.fr/idees-debats/sciences-prospective/les-ia-raisonnent-elles-ou-recitent-elles-2103079"&gt;Les Echos&lt;/a&gt;. I updated it with new references.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Conversational AI, or large language models, are sometimes seen as the
gateway to general artificial intelligence. ChatGPT, for example, can
answer questions asked at the International Mathematical Olympiad. And
yet, on other, seemingly much simpler questions, ChatGPT makes surprising
mistakes. What aspects of conversational AI intelligence explain its
ability to solve some problems and not others?&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://arxiv.org/abs/2309.13638"&gt;Thomas McCoy and co-authors&lt;/a&gt;
conjecture that it has to do with their underlying model of
autoregression: technically, these AIs are trained to complete texts
found on the Internet. If an AI is very good at calculating (9/5) x + 32,
but not (7/5) x + 31, it is because the first formula corresponds to the
conversion of degrees Celsius to Fahrenheit, a very frequent conversion
on the Internet, while the second does not correspond to any particular
formula. Conversational AIs would therefore be good at reproducing what
they’ve already seen. Indeed, numerous studies have shown that they have
a certain tendency to reproduce snippets of known text. So, if an AI can
solve problems from the International Mathematical Olympiad, is it simply
because it has memorized the answer?&lt;/p&gt;
&lt;div class="section" id="something-new"&gt;
&lt;h2&gt;Something new?&lt;/h2&gt;
&lt;p&gt;In terms of intelligence, inventing a new mathematical demonstration
requires mastering abstractions and the ability to string together
complicated logical reasoning with an imposed start and finish. This
seems much more difficult than memorizing a demonstration. This is one of
the traditional oppositions in machine learning, the line of research
that gave rise to today’s AIs: memorizing is one thing, knowing how to
generalize is another. For example, if I memorize all the additions
between two numbers smaller than ten, I cannot extrapolate beyond that. To
go further, I need to master the logic of addition… or memorize more.&lt;/p&gt;
&lt;p&gt;And precisely, conversational AIs have an enormous capacity for
memorization, and have been trained on almost the entire Internet. Given
a question, they can often dip into their memory to find answers. So, are
they intelligent or just have a great memory? Scientists are still
debating the importance of memory to their abilities. Some argue that
their storage capacity is ultimately limited by the size of the Internet.
Others wonder to what extent the impressive successes highlighted are not
on tasks already solved on the Internet, questioning their ability to do
anything new.&lt;/p&gt;
&lt;p&gt;But could memorization be an aspect of intelligence? In 1987, Lenat and
Feigenbaum conjectured that, for a cognitive agent, accumulating
knowledge enables it to solve new tasks with less learning. Perhaps the
intelligence of conversational AI lies in knowing how to pick up the
right bits of information, and combine them.&lt;/p&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;Related academic work:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://www.pnas.org/doi/10.1073/pnas.2322420121"&gt;Embers of autoregression show how large language models are shaped
by the problem they are trained to solve&lt;/a&gt;, R. Thomas McCoy,
Shunyu Yao, Dan Friedman, Mathew D. Hardy, and Thomas L. Griffiths,
PNAS 2024 (&lt;a class="reference external" href="https://arxiv.org/abs/2309.13638"&gt;ArXiv&lt;/a&gt;)&lt;/p&gt;
&lt;p&gt;Princeton researchers show that properties of large language models
(LLMs) are governed by the data that they are trained on, including
for they arithmetics abilities.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://arxiv.org/abs/2410.05229"&gt;GSM-Symbolic: Understanding the Limitations of Mathematical
Reasoning in Large Language Models&lt;/a&gt;, Iman Mirzadeh, Keivan Alizadeh
Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar&lt;/p&gt;
&lt;p&gt;Apple researchers show that LLMs solve mathematical challenge via
probabilistic &lt;strong&gt;pattern matching&lt;/strong&gt; on previously seen examples, rather
than logical reasonning.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;AI chronicles&lt;/p&gt;
&lt;p&gt;Find all my AI chronicles &lt;a class="reference external" href="https://gael-varoquaux.info/tag/ai-chronicle.html"&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The goals of these “AI chronicles” is to introduce concepts of AI to a broader public, staying at a very very high level.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="AI"></category><category term="chronicle"></category><category term="AI chronicle"></category></entry><entry><title>CARTE: toward table foundation models</title><link href="https://gael-varoquaux.info/science/carte-toward-table-foundation-models.html" rel="alternate"></link><published>2024-07-19T00:00:00+02:00</published><updated>2024-07-19T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2024-07-19:/science/carte-toward-table-foundation-models.html</id><summary type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Foundation models, pretrained and readily usable for many downstream
tasks, have changed the way we process text, images, and sound. Can we
achieve similar breakthroughs for tables? Here I explain why with
&lt;a class="reference external" href="https://arxiv.org/abs/2402.16785"&gt;“CARTE”&lt;/a&gt;, we’ve made significant headway.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#pre-training-for-data-tables-hopes-and-challenges" id="toc-entry-1"&gt;Pre-training for data tables: hopes and challenges&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#pre-training-is-a-necessity" id="toc-entry-2"&gt;Pre-training is a …&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Foundation models, pretrained and readily usable for many downstream
tasks, have changed the way we process text, images, and sound. Can we
achieve similar breakthroughs for tables? Here I explain why with
&lt;a class="reference external" href="https://arxiv.org/abs/2402.16785"&gt;“CARTE”&lt;/a&gt;, we’ve made significant headway.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#pre-training-for-data-tables-hopes-and-challenges" id="toc-entry-1"&gt;Pre-training for data tables: hopes and challenges&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#pre-training-is-a-necessity" id="toc-entry-2"&gt;Pre-training is a necessity&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#pretraining-for-data-tables" id="toc-entry-3"&gt;Pretraining for data tables?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#carte-a-table-foundation-model-breakthrough" id="toc-entry-4"&gt;CARTE: a table foundation model breakthrough&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#an-architecture-to-learn-across-tables" id="toc-entry-5"&gt;An architecture to learn across tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#pretraining-on-knowledge-graphs" id="toc-entry-6"&gt;Pretraining on knowledge graphs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#empirical-results" id="toc-entry-7"&gt;Empirical results&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#lessons-learned" id="toc-entry-8"&gt;Lessons learned&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="pre-training-for-data-tables-hopes-and-challenges"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;Pre-training for data tables: hopes and challenges&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="pre-training-is-a-necessity"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;Pre-training is a necessity&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Foundation models have brought breakthroughs to text and image processing
because they embark a great deal of knowledge on these data, knowledge
that can then be reused to simplify processing. But their promises have
not come true for tables, which hold much of an organization’s specific
data, &lt;em&gt;eg&lt;/em&gt; relational databases capturing day-to-day operations, or
measurements tables related to a specific source of data.&lt;/p&gt;
&lt;p&gt;Rather, for tabular learning, a couple of years ago &lt;a class="reference external" href="https://proceedings.neurips.cc/paper_files/paper/2022/hash/0378c7692da36807bdec87ab043cdadc-Abstract-Datasets_and_Benchmarks.html"&gt;our extensive
benchmarks&lt;/a&gt;
showed that tree-based models outperformed even deep-learning
architectures specially crafted for data tables.&lt;/p&gt;
&lt;p&gt;One challenge is that typically tables are not that big and thus the
high flexibility of deep learning is a weakness rather than a benefit.
This shortcoming was solved by pretrained models, for data modalities
where deep learning has been vastly successful: &lt;strong&gt;most people do not
train a deep-learning model from scratch, but download a pre-trained one
from model hubs&lt;/strong&gt;. Such universal pre-training is also at the root of
foundation models.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="pretraining-for-data-tables"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;Pretraining for data tables?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;But what does pretraining mean for data tables? If I give you a table of
numbers, what can prior information can you use to process it better?
Images and text have a lot of regularity that repeat across datasets:
I can recognize a car on pictures coming from all kinds of camera
(including old black and white photographs). I use my knowledge of the
meaning of words to understand a text. But given a table of number as
below, what sense can I make of it?&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
&lt;em&gt;The tabular learning challenge: every table is a special snowflake&lt;/em&gt;&lt;/div&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="29%" /&gt;
&lt;col width="29%" /&gt;
&lt;col width="29%" /&gt;
&lt;col width="14%" /&gt;
&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;72&lt;/td&gt;
&lt;td&gt;68&lt;/td&gt;
&lt;td&gt;174&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;79&lt;/td&gt;
&lt;td&gt;181&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;56&lt;/td&gt;
&lt;td&gt;59&lt;/td&gt;
&lt;td&gt;166&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;81&lt;/td&gt;
&lt;td&gt;62&lt;/td&gt;
&lt;td&gt;161&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The reason a data analyst can understand this data and use this
understanding to build a better data-processing pipeline is because the
data comes with context: meaningful strings sprinkled around these
numbers. For instance, a table with the same numbers as above but a bit
of column names and string entries makes completely sense:&lt;/p&gt;
&lt;table border="1" class="docutils"&gt;
&lt;caption&gt;Cardiovascular cohort&lt;/caption&gt;
&lt;colgroup&gt;
&lt;col width="18%" /&gt;
&lt;col width="18%" /&gt;
&lt;col width="18%" /&gt;
&lt;col width="36%" /&gt;
&lt;col width="9%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;Age&lt;/th&gt;
&lt;th class="head"&gt;Weight&lt;/th&gt;
&lt;th class="head"&gt;Height&lt;/th&gt;
&lt;th class="head"&gt;Commorbidity&lt;/th&gt;
&lt;th class="head"&gt;Cardiovascular event&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;72&lt;/td&gt;
&lt;td&gt;68&lt;/td&gt;
&lt;td&gt;174&lt;/td&gt;
&lt;td&gt;Diabetes&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;79&lt;/td&gt;
&lt;td&gt;181&lt;/td&gt;
&lt;td&gt;Cardiac arrhythmia&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;56&lt;/td&gt;
&lt;td&gt;59&lt;/td&gt;
&lt;td&gt;166&lt;/td&gt;
&lt;td&gt;NA&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;81&lt;/td&gt;
&lt;td&gt;62&lt;/td&gt;
&lt;td&gt;161&lt;/td&gt;
&lt;td&gt;Asthma&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In such a setting, it becomes clear what background knowledge, what
pre-training can bring to analyzing data tables: &lt;strong&gt;string entries and
column names bring meaning to the numbers in data tables&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Another way to seeing the challenge is that of &lt;strong&gt;data integration&lt;/strong&gt;: as
studied by the knowledge representation and database communities, putting
multiple sources of data in a consistent representation requires:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;schema matching&lt;/strong&gt;, which to a first order is about finding column
correspondences across tables&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;entity matching&lt;/strong&gt;, finding correspondences across table entries
denoting the same thing, for instance “Diabetes” and “Diabetes melitus”&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These challenges of data integration are central to building pretrained
or foundation models for tables. Indeed, such models must apply to all
tables, and thus must bridge these gaps across tables.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="carte-a-table-foundation-model-breakthrough"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-4"&gt;CARTE: a table foundation model breakthrough&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Our recent &lt;a class="reference external" href="https://arxiv.org/abs/2402.16785"&gt;CARTE paper&lt;/a&gt; builds upon
the above insights, and demonstrates that pretraining can give
models that markedly improve performance.&lt;/p&gt;
&lt;div class="section" id="an-architecture-to-learn-across-tables"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-5"&gt;An architecture to learn across tables&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Graphlets&lt;/strong&gt;
The key ingredient of CARTE is how we represent the inputs. CARTE’s goal
is to build predictors on rows of tables, for instance associating
features of an individuals to a risk of developing adverse cardiovascular
events. To pretrain across tables, we use a universal representation of
the data (rows of tables), as small graphs.&lt;/p&gt;
&lt;div class="figure"&gt;
&lt;img alt="" src="attachments/carte/carte_graphlet.png" /&gt;
&lt;p class="caption"&gt;Turning table rows into graphlets. Each column leads to an edge and
the column name is turned into the corresponding edge feature. It’s a
“multirelational graph”. The entry associated with the given column
is turned into the corresponding node feature, and the row is
represented as a special row token in the center of the graphlet.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Thus, tables with different number of columns can be turned into a
consistent representation. But an additional benefit of this
representation is that it can represent data across multiple tables with
shared keys (for instance all the visits of a patient to a hospital).&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
&lt;em&gt;A representation that can bridge tables without schema or entity
matching&lt;/em&gt;&lt;/div&gt;
&lt;br/&gt;
&lt;br/&gt;&lt;p&gt;&lt;strong&gt;String embeddings&lt;/strong&gt;
The second ingredient is to represent all strings and embeddings, using a
pretrained language model, whether it is for column names or string
entries. With good language model will embed close by different string
with similar meaning, for instance a column named “commorbidity” and
another one named “medical conditions”. Such representation helps
learning without entity or schema matching.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Graph transformer&lt;/strong&gt; CARTE then uses a form of graph transformer on top
of this representation. Key to this graph transformer is an attention
mechanism that accounts for the relation information –the edge type,
&lt;em&gt;ie&lt;/em&gt; the column name. Thus &lt;em&gt;(born in, Paris)&lt;/em&gt; is represented in a
different way as &lt;em&gt;(living in, Paris)&lt;/em&gt;.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Numbers treated as such&lt;/strong&gt; Columns with numerical entries are often
important information in a data table. Unlike typical large language
models, we do not represent numbers via string tokenization, but use a
vector representation where the numerical value is multiplied with the
embedding of the column name (a vector output by the language model).
That way a value of 126 in a column named “Systolic mm Hg” is represented
close to 1.5 times a value of 84 in a column named “Blood pressure”.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="pretraining-on-knowledge-graphs"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-6"&gt;Pretraining on knowledge graphs&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;We pretrain the above architecture on a large general-knowledge knowledge
graph. The goal is to distill the corresponding information in the
pretrained model, which can then implicitly use it when analyzing new
tables. Indeed, a large knowledge graph (we use &lt;a class="reference external" href="https://yago-knowledge.org"&gt;YAGO&lt;/a&gt;) represents a huge amount of facts on the
world, and the representation, as a multirelational graph, is close to
the one that we use to model data tables.&lt;/p&gt;
&lt;p&gt;Given an analytic task, on a data table of interest, the pretrained model
can be fine tuned. We found that this was a tricky part as those tables
are often small.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="empirical-results"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-7"&gt;Empirical results&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Excellent performance on extensive benchmarks&lt;/strong&gt;
We compared CARTE to a variety of baselines across 51 datasets (mostly
downloaded from kaggle), as a function of the number of samlpes (number
of rows):&lt;/p&gt;
&lt;div class="figure"&gt;
&lt;img alt="" src="attachments/carte/carte_learning_curve.png" /&gt;
&lt;p class="caption"&gt;Prediction performance as a function of sample size for classification
and regression tasks&lt;/p&gt;
&lt;/div&gt;
&lt;div class="align-right docutils container"&gt;
CARTE outperforms all baselines, including very strong ones&lt;/div&gt;
&lt;p&gt;CARTE appears as a very strong performer, outperforming all baselines
when there are less than 2000 samples. For larger tables, the prior
information is less crucial, and more flexible learners are beneficial.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Strong contenders&lt;/strong&gt; We see that powerful tree-based learner, such as
CatBoost of XGBoost also work very well. We investigated in details many
baselines. Here, we consider not only learners, but also a variety of
methods to encode strings, and these really help predicting:&lt;/p&gt;
&lt;div class="figure"&gt;
&lt;img alt="" src="attachments/carte/carte_cd_plots.png" /&gt;
&lt;p class="caption"&gt;Detailed comparison (critical difference plots, giving the average
ranking of methods) across all 42 baselines that we investigated&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Catboost is an excellent predictor because it encodes with categories
with great care. &lt;em&gt;S-LLM-CN-XGB&lt;/em&gt; is a baseline that we contributed that
encodes strings with an LLM, concats numerical numbers and used XGBoost
on the resulting representation. &lt;em&gt;TabVec&lt;/em&gt; is the &lt;a class="reference external" href="https://skrub-data.org/stable/generated/skrub.TableVectorizer.html#skrub.TableVectorizer"&gt;TableVectorizer&lt;/a&gt;
from &lt;a class="reference external" href="https://skrub-data.org"&gt;skrub&lt;/a&gt;. Combined with standard learners
it gives really strong baselines.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Learning across tables&lt;/strong&gt; As CARTE can model jointly different tables with
different conventions, we show that I can use large source tables, to
boost prediction on the smaller target table.&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/carte/carte_joint_learning.png" style="width: 600px;" /&gt;
&lt;/div&gt;
&lt;p&gt;&lt;em&gt;Ranking of various methods used across tables with imperfect
correspondances, where “matched” means manual column matching, and “not
matched” means no manual column matching&lt;/em&gt;&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
Transfer learning across sources with different columns / schemas&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="lessons-learned"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-8"&gt;Lessons learned&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The extensive empirical results have many teachings.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tabular foundation models are possible&lt;/strong&gt; The first teaching is that
using strings to bring meaning to the numbers enables foundation models
for tables: pretrained models that facilitate a variety of downstream
tasks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;LLMs are not enough&lt;/strong&gt; Many approaches to table foundation models adapt
large language models pretrained on huge text corpora. The argument is
that with the amount of high-quality texts on Internet, the corresponding
LLM can acquire more background knowledge. The seminal example is that of
&lt;a class="reference external" href="https://proceedings.mlr.press/v206/hegselmann23a.html"&gt;TabLLM&lt;/a&gt;, which
makes sentences out of table rows and feeds them to LLMs. Yet, by itself
it does not perform well on tables with numbers.&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/carte/tabllm_comparison.png" style="width: 350px;" /&gt;
&lt;/div&gt;
&lt;p&gt;&lt;em&gt;Ranking of models on data from the TabLLM paper, data that differs from
our benchmark above as it does not have string entries.&lt;/em&gt;&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="align-right docutils container"&gt;
A table foundation model must model strings and numbers&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Modeling numbers is crucial&lt;/strong&gt; TabPFN, CARTE, and XGBoost all outperform
TabLLM on tables without strings, likely because they readily model
numbers, while an LLM sees them as strings. Likewise, our variant
&lt;em&gt;S-LLM-XGB-CN&lt;/em&gt; that combines LLMs with a model suitable for numbers
performs very well.&lt;/p&gt;
&lt;p&gt;As the strings are crucial to give context to numbers, we believe that
the future of table foundation models is to model well both strings and
numbers.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;CARTE is only a first step in the world of table foundation models. I
am convinved that the ideas will be pushed much much further.&lt;/p&gt;
&lt;p class="last"&gt;But we have learned a lot in this study. I have only skimmed the
surface of our work. If you want more details, read the &lt;a class="reference external" href="https://arxiv.org/abs/2402.16785"&gt;CARTE paper&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="machine learning"></category><category term="tabular learning"></category><category term="foundation models"></category></entry><entry><title>Comité de l’intelligence artificielle: vision et stratégie nationale</title><link href="https://gael-varoquaux.info/science/comite-de-lintelligence-artificielle-vision-et-strategie-nationale.html" rel="alternate"></link><published>2023-09-20T00:00:00+02:00</published><updated>2023-09-20T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2023-09-20:/science/comite-de-lintelligence-artificielle-vision-et-strategie-nationale.html</id><summary type="html">&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;English summary&lt;/p&gt;
&lt;p&gt;I have been appointed to the government-level panel of experts on AI,
to set the national vision and strategy in France.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;J’ai l’honneur d’être &lt;a class="reference external" href="https://www.gouvernement.fr/communique/comite-de-lintelligence-artificielle"&gt;nommé au comité de l’intelligence artificielle du gouvernement Français&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;La mission qui nous est confiée d’éclairer l’action publique …&lt;/p&gt;</summary><content type="html">&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;English summary&lt;/p&gt;
&lt;p&gt;I have been appointed to the government-level panel of experts on AI,
to set the national vision and strategy in France.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;J’ai l’honneur d’être &lt;a class="reference external" href="https://www.gouvernement.fr/communique/comite-de-lintelligence-artificielle"&gt;nommé au comité de l’intelligence artificielle du gouvernement Français&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;La mission qui nous est confiée d’éclairer l’action publique autour de
l’intelligence artificielle, une technologie qui peut impacter beaucoup
d’aspects de la société.&lt;/p&gt;
&lt;p&gt;Le comité comprend des experts de profils très variés, allant du
jeune entrepreneur à l’économiste connu mondialement. La difficulté va être de considérer
l’ensemble des liens entre progrès technologique et société. Nous allons
chercher à dégager de la vision, rassembler beaucoup d’expertises
d’acteurs différents sur
différents
sujets, appuyer nos projections sur l’état actuel des connaissances
scientifiques.&lt;/p&gt;
&lt;p&gt;Je ne partagerai pas les travaux du comité en avance de phase: il y aura
un travail nécessaire pour établir du consensus, travail qui prend du temps.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Cette mission dépasse mon cadre habituel, celui de la recherche académique
ou de la création de logiciels. Je fais cela parce que je crois que pour
que la technologie ait le meilleur impact sur la société, il doit y avoir
un va-et-vient entre la création technologique et les changements
sociétaux. Si nous, scientifiques, décidons de nous concentrer uniquement
sur notre travail académique et technique, nous perdons le contrôle de la
façon dont la société adopte notre technologie; nous laissons ce contrôle
aux personnes qui décident d’utiliser leur énergie pour agir, influencer,
profiter directement de ces technologies. En tant que chercheur en sciences informatiques, travaillant à
la fois sur l’IA fondamentale et sur les applications dans le domaine de
la santé, je dispose d’une expertise qu’il est important d’apporter à la
table. En tant que fonctionnaire, je pense que je peux et que je dois
éclairer le débat : je suis moins exposé au risque de conflits
d’intérêts, je suis payé par l’argent public pour être utile au public.&lt;/p&gt;
&lt;p&gt;Ce travail n’est néanmoins pas une prise de position politique: je suis
scientifique et non élu. Le pouvoir du comité n’est pas de faire les
décisions politiques, mais d’informer du possible. C’est un travail de
synthèse et de médiation.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;Mise à jour: rapport disponible&lt;/p&gt;
&lt;p&gt;Nous avons publié en mars 2024 notre rapport, disponible &lt;a class="reference external" href="https://www.info.gouv.fr/actualite/25-recommandations-pour-lia-en-france"&gt;en ligne&lt;/a&gt;.
Il est très lisible et traite de tous les sujets autours de l’IA.
Lecture recommandée à tous.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="artificial intelligence"></category><category term="society"></category><category term="science"></category><category term="government"></category></entry><entry><title>2022, a new scientific adventure: machine learning for health and social sciences</title><link href="https://gael-varoquaux.info/science/2022-a-new-scientific-adventure-machine-learning-for-health-and-social-sciences.html" rel="alternate"></link><published>2023-01-31T00:00:00+01:00</published><updated>2023-01-31T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2023-01-31:/science/2022-a-new-scientific-adventure-machine-learning-for-health-and-social-sciences.html</id><summary type="html">&lt;p&gt;A retrospective on last year (2022): I embarked on a new scientific
adventure, assembling &lt;a class="reference external" href="https://team.inria.fr/soda/"&gt;a team&lt;/a&gt; focused on
developing machine learning for health and social science. The team has
existed for almost a year, and the vision is nice shaping up. Let me
share with you illustrations of where we …&lt;/p&gt;</summary><content type="html">&lt;p&gt;A retrospective on last year (2022): I embarked on a new scientific
adventure, assembling &lt;a class="reference external" href="https://team.inria.fr/soda/"&gt;a team&lt;/a&gt; focused on
developing machine learning for health and social science. The team has
existed for almost a year, and the vision is nice shaping up. Let me
share with you illustrations of where we are at. This is extracted from
our yearly report which will be public later, but I have sometimes edited
it a bit to add personal context.&lt;/p&gt;
&lt;div class="contents topic" id="highlights"&gt;
&lt;p class="topic-title"&gt;Highlights&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#a-new-team-soda" id="toc-entry-1"&gt;A new team: Soda&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#the-scientific-vision" id="toc-entry-2"&gt;The scientific vision&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#application-context-richer-data-in-health-and-social-sciences" id="toc-entry-3"&gt;Application context: richer data in health and social sciences&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#applications-raise-specific-data-science-challenges" id="toc-entry-4"&gt;Applications raise specific data-science challenges&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#our-research-axes" id="toc-entry-5"&gt;Our research axes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#some-notable-results-of-2022" id="toc-entry-6"&gt;Some notable results of 2022&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#learning-on-relational-data-aggregating-across-many-tables" id="toc-entry-7"&gt;Learning on relational data: aggregating across many tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#validating-probabilistic-classifiers-beyond-calibration" id="toc-entry-8"&gt;Validating probabilistic classifiers: beyond calibration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#reweighting-randomized-trials-for-generalization-finite-sample-error-and-variable-selection" id="toc-entry-9"&gt;Reweighting randomized trials for generalization: finite sample error and variable selection&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#challenges-to-clinical-impact-of-ai-in-medical-imaging" id="toc-entry-10"&gt;Challenges to clinical impact of AI in medical imaging&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#privacy-preserving-synthetic-educational-data-generation" id="toc-entry-11"&gt;Privacy-preserving synthetic educational data generation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="a-new-team-soda"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;A new team: Soda&lt;/a&gt;&lt;/h2&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2022/team_2022.jpg" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;The team in early 2022 (it has grown a lot since)&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;At &lt;a class="reference external" href="https://www.inria.fr/en"&gt;Inria&lt;/a&gt;, we have teams assembling multiple
tenured researchers around a scientific project. Last year, we assembled
a new team called &lt;a class="reference external" href="https://team.inria.fr/soda/"&gt;Soda&lt;/a&gt;, which stands for
“social data”, but above all is a fun name.&lt;/p&gt;
&lt;p&gt;In a year, the team grew like crazy (to be honest, this had been baking
for a little while). We are now around 25 people.
There are 4 PIs (Marine le Morvan, Judith Abécassis, Jill-Jênn Vie, and
myself); and the engineers working on scikit-learn at Inria are also part
of the team.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="the-scientific-vision"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;The scientific vision&lt;/a&gt;&lt;/h2&gt;
&lt;p class="align-right"&gt;&lt;em&gt;Machine learning to leverage richer, more complex, data for
social-sciences and health&lt;/em&gt;&lt;/p&gt;
&lt;div class="section" id="application-context-richer-data-in-health-and-social-sciences"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;Application context: richer data in health and social sciences&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Opportunistic data accumulations, often observational, bare great
promises for social and health sciences. But the data are too big and
complex for standard statistical methodologies in these sciences.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Health databases&lt;/strong&gt; Increasingly rich health data is accumulated
during routine clinical practice as well as for research. Its large
coverage brings new promises for public health and personalized medicine,
but it does not fit easily in standard biostatistical practice because it
is not acquired and formatted for a specific medical question.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Social, educational, and behavioral sciences&lt;/strong&gt; Better data sheds new
light on human behavior and psychology, for instance with on-line
learning platforms. Machine learning can be used both as a model for
human intelligence and as a tool to leverage these data, for instance
improving education.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="applications-raise-specific-data-science-challenges"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-4"&gt;Applications raise specific data-science challenges&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Data management: preparing dirty data for analytics&lt;/strong&gt; Assembling,
curating, and transforming data for data analysis is very labor
intensive. These data-preparation steps are often considered the number
one bottleneck to data-science. They mostly rely on data-management
techniques. A typical problem is to establishing correspondences between
entries that denote the same entities but appear in different forms
(entity linking, including deduplication and record linkage). Another
time-consuming process is to join and aggregate data across multiple
tables with repetitions at different levels (as with panel data in
econometrics and epidemiology) to form a unique set of “features” to
describe each individual.&lt;/p&gt;
&lt;div class="sidebar"&gt;
The &lt;a class="reference external" href="https://project.inria.fr/dirtydata/"&gt;Dirty Data project&lt;/a&gt; paved the way.&lt;/div&gt;
&lt;p&gt;Progress in machine learning increasingly helps automating data
preparation and processing data with less curation.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Data science with statistical machine learning&lt;/strong&gt; Machine learning can
be a tool to answer complex domain questions by providing non-parametric
estimators. Yet, it still requires much work, eg to go beyond point
estimators, to derive non-parametric procedures that account for a
variety of bias (censoring, sampling biases, non-causal associations), or
to provide theoretical and practical tools to assess validity of
estimates and conclusion in weakly-parametric settings.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="our-research-axes"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-5"&gt;Our research axes&lt;/a&gt;&lt;/h3&gt;
&lt;div class="section" id="representation-learning-for-relational-data"&gt;
&lt;h4&gt;Representation learning for relational data&lt;/h4&gt;
&lt;p&gt;I dream of deep-learning methodology for relational databases, from
tabular datasets to full relational databases. The stakes are &lt;em&gt;i)&lt;/em&gt; to
build machine-learning models that apply readily to the raw data so as to
minimize manual cleaning, data formatting and integration, and &lt;em&gt;ii)&lt;/em&gt; to
extract reusable representations that reduce sample complexity on new
databases by transforming the data in well-distributed vectors.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="mathematical-aspects-of-statistical-learning-for-data-science"&gt;
&lt;h4&gt;Mathematical aspects of statistical learning for data science&lt;/h4&gt;
&lt;p&gt;I want to use machine learning models as non-parametric estimators, as I
worry about the impact of mismodeling on conclusion. However, for a given
statistical task, the statistical procedures and validity criterion need
to be reinvented. Soda contributes statistical tools and results for a
variety of problems important to data science in health and social
science (epidemiology, econometrics, education). These fields lead to
various statistical topics:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Missing values&lt;/li&gt;
&lt;li&gt;Causal inference&lt;/li&gt;
&lt;li&gt;Model validation&lt;/li&gt;
&lt;li&gt;Uncertainty quantification&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="machine-learning-for-health-and-social-sciences"&gt;
&lt;h4&gt;Machine learning for health and social sciences&lt;/h4&gt;
&lt;p&gt;Soda targets applications in health and social sciences, as these can
markedly benefit from advanced processing of richer datasets, can have a
large societal impact, but fall out of mainstream machine-learning
research, which focus on processing natural images, language, and voice.
Rather, data surveying humans needs another focus: it is most of the time
tabular, sparse, with a time dimension, and missing values. In term of
application fields, we focus on the social sciences that rely on
quantitative predictions or analysis across individuals, such as policy
evaluation. Indeed, the same formal problems, addressed in the two
research axes above, arise across various social sciences:
&lt;strong&gt;epidemiology, education research, and economics&lt;/strong&gt;.
The challenge is to develop efficient and trustworthy machine learning
methodology for these high-stakes applications.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="high-quality-data-science-software"&gt;
&lt;h4&gt;High-quality data-science software&lt;/h4&gt;
&lt;p&gt;Societal and economical impact of machine learning requires easy-to-use
practical tools that can be leveraged in non-specialized organizations
such as hospitals or policy-making institutions.&lt;/p&gt;
&lt;p&gt;Soda incorporates the core team working at Inria on &lt;strong&gt;scikit-learn&lt;/strong&gt;, one
of the most popular machine-learning tool world-wide. One of the missions
of soda is to improve scikit-learn and its documentation, transfering the
understanding of machine learning and data science accumulated by the
various research efforts.&lt;/p&gt;
&lt;p&gt;Soda works on other important software tools to foster growth and health
of the Python data ecosystem in which scikit-learn is embedded.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="some-notable-results-of-2022"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-6"&gt;Some notable results of 2022&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;I am listing here a small number of the achievements of the team, because
I find them inspiring.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="learning-on-relational-data-aggregating-across-many-tables"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-7"&gt;Learning on relational data: aggregating across many tables&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;For many machine-learning tasks, augmenting the data table at hand with
features built from external sources is key to improving performance. For
instance, estimating housing prices benefits from background information
on the location, such as the population density or the average income.&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2022/aggregating.png" style="width: 300px;" /&gt;
&lt;p class="caption"&gt;Often, data must be assembled across multiple tables into a single
table for analysis. Challenges arise due to one-to-many relations,
irregularity of the information, and the number of tables that may be
involved.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Most often, a major bottleneck is to &lt;strong&gt;assemble this information across
many tables&lt;/strong&gt;, requiring time and expertise from the data scientist. We
propose &lt;strong&gt;vectorial representations of entities (e.g. cities) that capture
the corresponding information&lt;/strong&gt; and thus can replace human-crafted
features. In &lt;a class="reference external" href="https://link.springer.com/article/10.1007/s10994-022-06277-7"&gt;Cvetkov-Iliev 2023&lt;/a&gt;, we
represent the relational data on the entities as a graph and adapt
graph-embedding methods to create feature vectors for each entity. We
show that two technical ingredients are crucial: modeling well the
different relationships between entities, and capturing numerical
attributes. We adapt knowledge graph embedding methods that were
primarily designed for graph completion. Yet, they model only discrete
entities, while creating good feature vectors from relational data also
requires capturing numerical attributes. For this, we introduce KEN:
Knowledge Embedding with Numbers. We thoroughly evaluate approaches to
enrich features with background information on 7 prediction tasks. We
show that a good embedding model coupled with KEN can perform better than
manually handcrafted features, while requiring much less human effort. It
is also competitive with combinatorial feature engineering methods, but
much more scalable. Our approach can be applied to huge databases, for
instance on general knowledge graphs as in YAGO, creating &lt;strong&gt;general-purpose
feature vectors reusable in various downstream tasks&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/2022/entity_types_with_names.png" style="width: 100%;" /&gt;
&lt;p class="caption"&gt;&lt;strong&gt;Entity embeddings of YAGO (wikipedia)&lt;/strong&gt; (2D-representation using
UMAP). The vectors are downloadable from
&lt;a class="reference external" href="https://soda-inria.github.io/ken_embeddings"&gt;https://soda-inria.github.io/ken_embeddings&lt;/a&gt;} to readily augment
data-science projects.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="validating-probabilistic-classifiers-beyond-calibration"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-8"&gt;Validating probabilistic classifiers: beyond calibration&lt;/a&gt;&lt;/h3&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2022/grouping_loss.png" style="width: 360px;" /&gt;
&lt;p class="caption"&gt;Validating probabilistic predictions of classifiers must go account
not only for the average error given an predicted score, but also for
the dispersion of errors.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Ensuring that a classifier gives reliable confidence scores is essential
for informed decision-making, in particular in high-stakes areas such as
health. For instance, before using a clinical prognostic model, we want
to establish that for a given individual is attributes probabilities of
different clinical outcomes that can be indeed trusted. To this end,
recent work has focused on miscalibration, &lt;em&gt;i.e.&lt;/em&gt;, the over or under
confidence of model scores.&lt;/p&gt;
&lt;p&gt;Yet calibration is not enough: even a perfectly calibrated classifier
with the best possible accuracy can have confidence scores that are far
from the true posterior probabilities, if it is over-confident for some
samples and under-confident for others. This is captured by the grouping
loss, created by samples with &lt;strong&gt;the same confidence scores but different
true posterior probabilities&lt;/strong&gt;. Proper scoring rule theory shows that given
the calibration loss, the missing piece to characterize individual errors
is the grouping loss. While there are many estimators of the calibration
loss, none exists for the grouping loss in standard settings. In
&lt;a class="reference external" href="https://arxiv.org/abs/2210.16315"&gt;Perez-Level 2023&lt;/a&gt;, we propose an
estimator to approximate the grouping loss. We show that modern neural
network architectures in vision and NLP exhibit grouping loss, notably in
distribution shifts settings, which highlights the importance of
pre-production validation.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="reweighting-randomized-trials-for-generalization-finite-sample-error-and-variable-selection"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-9"&gt;Reweighting randomized trials for generalization: finite sample error and variable selection&lt;/a&gt;&lt;/h3&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2022/reweighting_trial.png" style="width: 360px;" /&gt;
&lt;p class="caption"&gt;There may be a sampling bias between a randomized trial and the
target population.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Randomized Controlled Trials (RCTs) are ideal experiments to establish
causal statement. However, they may suffer from limited scope, in
particular, because they may have been run on non-representative samples:
some RCTs over- or under- sample individuals with certain characteristics
compared to the target population, for which one wants conclusions on
treatment effectiveness. Re-weighting trial individuals to match the
target population can improve the treatment effect estimation.&lt;/p&gt;
&lt;p&gt;In &lt;a class="reference external" href="https://hal.science/hal-03822662"&gt;Colnet 2022&lt;/a&gt;, we establish the
exact expressions of the bias and variance of such reweighting procedures
- also called Inverse Propensity of Sampling Weighting (IPSW) - in
presence of categorical covariates for any sample size. Such results
allow us to compare the theoretical performance of different versions of
IPSW estimates. Besides, our results show how the performance (bias,
variance, and quadratic risk) of IPSW estimates depends on the two sample
sizes (RCT and target population). A by-product of our work is the proof
of consistency of IPSW estimates. Results also reveal that IPSW
performances are improved when the trial probability to be treated is
estimated (rather than using its oracle counterpart). In addition, we
study &lt;strong&gt;choice of variables&lt;/strong&gt;: how including covariates that are not
necessary for identifiability of the causal effect may impact the
asymptotic variance. Including covariates that are shifted between the
two samples but not treatment effect modifiers increases the variance
while non-shifted but treatment effect modifiers do not.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="challenges-to-clinical-impact-of-ai-in-medical-imaging"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-10"&gt;Challenges to clinical impact of AI in medical imaging&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;I have worked for many years on research in computer analysis of medical
images. In particular, I am convinced that machine learning bears many
promises to improve patients’ health. However, I cannot be blind to the
fact that a number of systematic challenges are slowing down the progress
of the field.&lt;/p&gt;
&lt;p&gt;In &lt;a class="reference external" href="https://www.nature.com/articles/s41746-022-00592-y"&gt;Varoquaux &amp;amp; Cheplygina&lt;/a&gt;, we tried to take
a step back on these challenges, from limitations of the data, such as
biases, to research incentives, such as optimizing for publication. We
reviewed roadblocks to developing and assessing methods. Building our
analysis on evidence from the literature and data challenges, we showed
that at every step, potential biases can creep.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;First, larger datasets do not bring increased prediction accuracy and
may suffer from biases.&lt;/li&gt;
&lt;li&gt;Second, evaluations often miss the target, with evaluation error larger
than algorithmic improvements, improper evaluation procedures and
leakage, metrics that do not reflect the application, incorrectly chosen
baselines, and improper statistics.&lt;/li&gt;
&lt;li&gt;Finally, we show how publishing too often leads to distorted incentives.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;On a positive note, we also discuss on-going efforts to counteract these
problems and provide recommendations on how to further address these
problems in the future.&lt;/p&gt;
&lt;p&gt;This was a fun exercise. I realize that I still need to sit on it and
introspect how it has shaped my research agenda, because I think it has
pushed me to choose specific emphases (such as model evaluation, or
focusing on rich data sources).&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="privacy-preserving-synthetic-educational-data-generation"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-11"&gt;Privacy-preserving synthetic educational data generation&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Soda also works on other applications than health, for instance
education. In this direction, I would like to highlight work in which I
did not participate, by Jill-Jenn Vie, another PI of the team.&lt;/p&gt;
&lt;p&gt;Institutions collect massive learning traces but they may not disclose it
for privacy issues. Synthetic data generation opens new opportunities for
research in education. &lt;a class="reference external" href="https://hal.inria.fr/hal-03715416"&gt;Vie 2022&lt;/a&gt;
presented a generative model for educational data that can preserve the
privacy of participants, and an evaluation framework for comparing
synthetic data generators. We show how naive pseudonymization can lead to
re-identification threats and suggest techniques to guarantee privacy. We
evaluate our method on existing massive educational open datasets.&lt;/p&gt;
&lt;p&gt;The tension between privacy of individuals and the need for datasets for
open science is a real and important one.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;hr class="docutils" /&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This was just a quick glance of what we do at soda, and we are just
warming up. I am super excited about this research. I hope that it will
matter.&lt;/p&gt;
&lt;p&gt;I truely believe that more and better machine learning can help health
and social science to draw new insight from new datasets.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="machine learning"></category><category term="health"></category><category term="statistics"></category><category term="yearly report"></category></entry><entry><title>2021 highlight: Decoding brain activity to new cognitive paradigms</title><link href="https://gael-varoquaux.info/science/2021-highlight-decoding-brain-activity-to-new-cognitive-paradigms.html" rel="alternate"></link><published>2022-02-24T00:00:00+01:00</published><updated>2022-02-24T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2022-02-24:/science/2021-highlight-decoding-brain-activity-to-new-cognitive-paradigms.html</id><summary type="html">&lt;p class="align-right"&gt;&lt;em&gt;Broad decoding models that can specialize to discriminate
closely-related mental process with limited data&lt;/em&gt;&lt;/p&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;TL;DR&lt;/p&gt;
&lt;p&gt;Decoding models can help isolating which mental processes are implied
by the activation of given brain structures. But to support a broad
conclusion, they must be trained on many studies, a difficult problem
given …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;p class="align-right"&gt;&lt;em&gt;Broad decoding models that can specialize to discriminate
closely-related mental process with limited data&lt;/em&gt;&lt;/p&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;TL;DR&lt;/p&gt;
&lt;p&gt;Decoding models can help isolating which mental processes are implied
by the activation of given brain structures. But to support a broad
conclusion, they must be trained on many studies, a difficult problem
given the unclear relations between tasks of different studies. We
contributed a method that infers these links from the data. Their
validity is established by generalization to new tasks. Some
cognitive neuroscientists prefer qualitative consolidation of
knowledge, but such approach is hard to put to the test.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="context-infering-cognition-from-brain-imaging"&gt;
&lt;h2&gt;Context: Infering cognition from brain-imaging&lt;/h2&gt;
&lt;p&gt;Often, when interpreting functional brain images, one would like to
conclude on the indvidual’s on-going mental processes. But this
conclusion is not directly warranted by brain-imaging studies, as they do
not control the brain activity, but rather engage the participant via a
cognitive paradigm made of psychological manipulations &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;. &lt;em&gt;Brain
decoding&lt;/em&gt; can help grounding such &lt;em&gt;reverse inferences&lt;/em&gt; &lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt;, by using
machine learning to predict aspects of the task.&lt;/p&gt;
&lt;p&gt;But a brain decoding model can seldom support broad reverse-inference
claims, as typical decoding models are trained in a given study that
samples only a few aspects of cognition. Thus the decoding model only
concludes on the interpretation of the brain activity in the studies’
narrow scope.&lt;/p&gt;
&lt;p&gt;Another challenge is that of statistical power. Most functional brain
imaging studies comprise only a few dozen subjects, compromising
statistical power &lt;a class="footnote-reference" href="#footnote-3" id="footnote-reference-3"&gt;[3]&lt;/a&gt;, even more so when using machine learning &lt;a class="footnote-reference" href="#footnote-4" id="footnote-reference-4"&gt;[4]&lt;/a&gt;.
While there exists large acquisition efforts, these must focus on broad
psychological manipulations that do not probe fine aspects of mental
processes.&lt;/p&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Poldrack 2006, &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S1364661305003360"&gt;Can cognitive processes be inferred from
neuroimaging data?&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Poldrack 2011, &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S0896627311009895"&gt;Inferring Mental States from Neuroimaging Data:
From Reverse Inference to Large-Scale Decoding&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-3" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-3"&gt;[3]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Poldrack 2017, &lt;a class="reference external" href="https://www.nature.com/articles/nrn.2016.167"&gt;Scanning the horizon: towards transparent and
reproducible neuroimaging research&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-4" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-4"&gt;[4]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Varoquaux 2018, &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S1053811917305311"&gt;Cross-validation failure: Small sample sizes lead
to large error bars&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="contribution-informing-specialized-decoding-questions-from-broad-data-accumulation"&gt;
&lt;h2&gt;Contribution: Informing specialized decoding questions from broad data accumulation&lt;/h2&gt;
&lt;p&gt;In &lt;a class="reference external" href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008795"&gt;Mensch 2021&lt;/a&gt;,
we designed a machine-learning method that can &lt;strong&gt;jointly analyze many
unrelated functional imaging studies to build representations associating
brain activity to mental processes&lt;/strong&gt;. These representations can then be
used to &lt;strong&gt;improve brain decoding in new unrelated studies&lt;/strong&gt;, thus bringing
statistical-power improvements even to experiments probing fine aspects
of mental processes not studied in large cohorts.&lt;/p&gt;
&lt;p&gt;One roadblock to accumulating information across
cognitive neuroimaging studies is that all probe different, yet related,
mental processes. Framing them all in the same analysis faces the lack of
universally-adopted language to describe cognitive paradigms. Our prior
work &lt;a class="footnote-reference" href="#footnote-5" id="footnote-reference-5"&gt;[5]&lt;/a&gt; on this endeavior –the quest for universal decoding across
studies–, relied on describing each experimental paradigm in an ontology
of cognitive processes and psychological manipulations. However, such
approach is not scalable. Here, rather, we infered the latent structure
of the tasks from the data, without explicitely modeling the links
between studies. In my eye, this was a very important ingredient of our
work, and it is non trivial that it enables improving the decoding of
unrelated studies.&lt;/p&gt;
&lt;table class="side-hanging docutils footnote" frame="void" id="footnote-5" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-5"&gt;[5]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Varoquaux 2018, &lt;a class="reference external" href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006565"&gt;Atlases of cognition with large-scale human brain
mapping&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Capturing &lt;em&gt;representations&lt;/em&gt; was key to transfering across study:
representations of brain activity captured distributed brain structures
predictive of behavior; representations of tasks across studies captured
decompositions of behavior well explained by brain activity. Of course,
the representations that we extracted were not as sharp as the stylized
functional modules that have been manually compiled from decades of
cognitive-neuroscience research.&lt;/p&gt;
&lt;p&gt;From a computer-science standpoint, we used a deep-learning architecture.
This is the first time that we witnessed a
deep-learning architecture outperforming well-tuned shallow baselines on
functional neuroimaging data &lt;a class="footnote-reference" href="#footnote-6" id="footnote-reference-6"&gt;[6]&lt;/a&gt;. This success is likely due to the
massive amount of data that we assembled: as our method can
readily work across studies, we were able to apply it to 40000
subject-level contrast maps.&lt;/p&gt;
&lt;table class="side-hanging docutils footnote" frame="void" id="footnote-6" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-6"&gt;[6]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;There have been many reports of deep architectures on functional
brain imaging. However, in our experience, good shallow benchmarks
are hard to beat.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/2021_highlights/mston.png" /&gt;
&lt;p class="caption"&gt;Our deep-learning architecture&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="a-research-agenda-that-does-not-win-all-hearts"&gt;
&lt;h2&gt;A research agenda that does not win all hearts&lt;/h2&gt;
&lt;p&gt;Our underlying research agenda is to &lt;strong&gt;piece together
cognitive-neuroimaging evidence on a wide variety of tasks and mental
processes&lt;/strong&gt;. In cognitive neuroscience, such consolidation of knowledge
is done via review articles, that assemble findings from many
publications into a consistent picture on how tasks decompose on
elementary mental processes implemented by brain functional modules. The
literature review and the ensuing neuro-cognitive model are however verbal
by nature: assembling qualitative findings. I, for one, would like to
have quantitative tools to foster big-picture view. Of course, the
challenge with quantitative approaches as ours is to capture all
qualitative aspects of the question.&lt;/p&gt;
&lt;p&gt;Over the years that I have been pushing these ideas, I find that they are
met with resistance from some elite cognitive neuroscientists who see
them as unexciting at best. The same people are enthusiastic about new
data-analysis methods to dissect in fine details brain responses with a
detailed model of a given task, despite limited statistical power and
external validity. My feeling is that &lt;strong&gt;the question of how
various tasks are related is perceived as belonging to the walled garden
of cognitive neuroscientists, not to be put to the test by statistical
methods&lt;/strong&gt; &lt;a class="footnote-reference" href="#footnote-7" id="footnote-reference-7"&gt;[7]&lt;/a&gt;.&lt;/p&gt;
&lt;table class="side-hanging docutils footnote" frame="void" id="footnote-7" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-7"&gt;[7]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;a class="reference external" href="https://journals.plos.org/ploscompbiol/article/peerReview?id=10.1371/journal.pcbi.1008795"&gt;The second round of review of our manuscript&lt;/a&gt;
certainly felt as if the method was judged by cognitive-neuroscience
lenses, and not the validity of the data analysis that it entailed.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Yet, as clearly exposed by Tal Yarkoni in his &lt;a class="reference external" href="https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/abs/generalizability-crisis/AD386115BA539A759ACB3093760F4824"&gt;Generalization crisis&lt;/a&gt;,
drawing conclusions on mental organization from a few repetitions of a
tasks is at risk of picking up idiosyncrasies of the task or the stimuli.
A starting point of our work (&lt;a class="reference external" href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008795"&gt;Mensch 2021&lt;/a&gt;)
was the fall of statistical power in cognitive neuroscience, documented
by &lt;a class="reference external" href="https://www.nature.com/articles/nrn.2016.167"&gt;Poldrack 2017&lt;/a&gt;, but
one reviewer censored this argument &lt;a class="footnote-reference" href="#footnote-8" id="footnote-reference-8"&gt;[8]&lt;/a&gt;. This exchange felt to me as &lt;strong&gt;a
field refusing to discuss publicly its challenges&lt;/strong&gt;, which leaves no room for
methods’ researchers such as myself to address them.&lt;/p&gt;
&lt;table class="side-hanging docutils footnote" frame="void" id="footnote-8" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-8"&gt;[8]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;a class="reference external" href="https://journals.plos.org/ploscompbiol/article/peerReview?id=10.1371/journal.pcbi.1008795"&gt;Comments in the first review&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="machine learning"></category><category term="neuroimaging"></category><category term="statistics"></category><category term="yearly report"></category></entry><entry><title>2020: my scientific year in review</title><link href="https://gael-varoquaux.info/science/2020-my-scientific-year-in-review.html" rel="alternate"></link><published>2021-01-05T00:00:00+01:00</published><updated>2021-01-05T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2021-01-05:/science/2020-my-scientific-year-in-review.html</id><summary type="html">&lt;p&gt;The year 2020 has undoubtedly been interesting: the covid19 pandemic
stroke while I was on a work sabbatical in Montréal, at the &lt;a class="reference external" href="https://www.mcgill.ca/neuro/"&gt;MNI&lt;/a&gt; and the &lt;a class="reference external" href="https://mila.quebec/"&gt;MILA&lt;/a&gt;,
and it pushed further my interest in machine learning for health-care.
&lt;strong&gt;My highlights this year revolve around basic and applied data-science
for health&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="contents topic" id="highlights"&gt;
&lt;p class="topic-title"&gt;Highlights …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;p&gt;The year 2020 has undoubtedly been interesting: the covid19 pandemic
stroke while I was on a work sabbatical in Montréal, at the &lt;a class="reference external" href="https://www.mcgill.ca/neuro/"&gt;MNI&lt;/a&gt; and the &lt;a class="reference external" href="https://mila.quebec/"&gt;MILA&lt;/a&gt;,
and it pushed further my interest in machine learning for health-care.
&lt;strong&gt;My highlights this year revolve around basic and applied data-science
for health&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="contents topic" id="highlights"&gt;
&lt;p class="topic-title"&gt;Highlights&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#mining-electronic-health-records-for-covid-19" id="toc-entry-1"&gt;Mining electronic health records for covid-19&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#machine-learning-for-dirty-data" id="toc-entry-2"&gt;Machine learning for dirty data&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#supervised-learning-with-missing-values-beyond-imputation" id="toc-entry-3"&gt;Supervised learning with Missing values: beyond imputation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#machine-learning-without-normalizing-entries" id="toc-entry-4"&gt;Machine-learning without normalizing entries&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#making-sense-of-brain-functional-signals" id="toc-entry-5"&gt;Making sense of brain functional signals&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#neuroquery-brain-mapping-any-neuroscience-query" id="toc-entry-6"&gt;NeuroQuery: brain mapping any neuroscience query&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#a-high-resolution-brain-functional-atlas" id="toc-entry-7"&gt;A high-resolution brain functional atlas&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="mining-electronic-health-records-for-covid-19"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;Mining electronic health records for covid-19&lt;/a&gt;&lt;/h2&gt;
&lt;p class="align-right"&gt;&lt;em&gt;Hospital databases are rich and messy&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hospital databases&lt;/strong&gt;
In March, we &lt;a class="reference external" href="https://www.inria.fr/en/scikiteds-visualization-tool-monitoring-flow-sick-patients"&gt;teamed up with the hospital around Paris&lt;/a&gt; that were suffering from a severe overload due to a new pathology,
covid-19. The challenge was to extract information from the huge
databases of the hospital management system: What were the characteristic
of the patients? How were the resources of the hospital evolving? In the
treatments that were empirically attempted, which were most efficient?&lt;/p&gt;
&lt;p&gt;The hospital databases are hugely promising, because &lt;strong&gt;they offer at
almost no cost information on all the patients that go through the
hospital&lt;/strong&gt;. As we were dealing with a conglomerate of 39 hospitals, this
information covers millions of patients each year: an excellent
epidemiological coverage.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Challenging data science&lt;/strong&gt;
Our work was classic data science: we did a lot of data management,
crafting SQL queries and munging pandas dataframes to create data tables
for statistics and visualizations. We interacted strongly with the
hospital management and the doctors to understand the information of
interest. As we moved forward it became clear that behind each “simple”
question, there were challenges of statistical validity. We did not want
to produce a figure that was misleading. Typical challenges were:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Information needed complicated transformations (such as following a
patient hoping across hospitals to capture the patient status)&lt;/li&gt;
&lt;li&gt;Information was represented differently in the differently hospitals&lt;/li&gt;
&lt;li&gt;Incorrect inputs prevented aggregation (such as erroneous entry data
after the exit date, or missing values)&lt;/li&gt;
&lt;li&gt;The database had biases compared to the ground truth (simple oxygen
therapy acts more often unreported than complicated invasive
ventilation)&lt;/li&gt;
&lt;li&gt;Censoring effects prevented the use of naive statistics (after 20 days
of epidemic outburst most hospital stays are short simply because
patients have entered the hospitals recently)&lt;/li&gt;
&lt;li&gt;A lot of information was present as unnormalized text, sometimes in
long hand-written notes, full of acronyms and errors due to character
recognition.&lt;/li&gt;
&lt;li&gt;The data were of course often a consequence of treatment policy (the
choices of the medical staff in terms of patient handling and
measures), and hence not directly interpretable in causal or
interventional terms.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These challenges were very interesting to me, as they related directly to
my research agenda of &lt;a class="reference external" href="https://project.inria.fr/dirtydata/"&gt;facilitating the processing of “dirty data”&lt;/a&gt; (more on that below).&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Most of the work that we did was not oriented toward publication, but
rather to address urgent needs of the hospitals. Some scholarly
contributions did come out:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Part of the extracted data are consolidated worldwide for medical
studies (&lt;a class="reference external" href="https://www.nature.com/articles/s41746-020-00308-0"&gt;Brat et al, Nature Digital Medicine 2020&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;We used causal-inference methods to estimate the treatment effects of
HCQ with and without Azithromycin (&lt;a class="reference external" href="https://www.medrxiv.org/content/10.1101/2020.06.16.20132597v1"&gt;Sbidian et al, MedRxiv 2020,&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;The data are used in follow up medical studies (eg associating
mortality and obesity &lt;a class="reference external" href="https://onlinelibrary.wiley.com/doi/full/10.1002/oby.23014"&gt;Czernichow et al, Obesity 2020,&lt;/a&gt; )&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Biomedical entity recognition&lt;/strong&gt; A major AI difficulty in this work is
recognizing biomedical entities, such as conditions or treatments, in the
various texts. Coincidentally, we had been working on simplifying the
state of the art pipelines for biomedical entity linking. While this
research work was not used on the hospital data, because it was too
bleeding edge, it led to an AAAI paper (&lt;a class="reference external" href="https://arxiv.org/abs/2012.08844"&gt;Chen et al, AAAI 2021&lt;/a&gt;) on a state-of-the model for
biomedical entity linking that is much more lightweight than current
approaches.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="machine-learning-for-dirty-data"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;Machine learning for dirty data&lt;/a&gt;&lt;/h2&gt;
&lt;p class="align-right"&gt;&lt;em&gt;Machine learning methods that can robustly ingest non-curated data.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The &lt;a class="reference external" href="https://project.inria.fr/dirtydata/"&gt;Dirty Data project&lt;/a&gt;, that we
undertook a few years ago, is really bearing its fruits.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="supervised-learning-with-missing-values-beyond-imputation"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;Supervised learning with Missing values: beyond imputation&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The classic view on processing data with missing values is to try and
&lt;em&gt;impute&lt;/em&gt; the missing values: replace them by probable values (or better,
compute the distribution of the unobserved values given the observed
ones). However, such approach needs a model of the missing-values
mechanism; this is simple only when the values are missing at random.
When have been studying the alternative view based on directly computing
a predictive function to be applied data with missing values.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/2020_highlights/mnar_versus_mcar.png" style="width: 500px;" /&gt;
&lt;p class="caption"&gt;&lt;strong&gt;Missing-values mechanisms&lt;/strong&gt;: black dots are fully-observed data
points, while grey ones are partially observed. The left panel
displays a missing-at-random situation, where missingness is
independent of the underlying values. On the contrary, in a
missing-not-at-random situation (right panel), whether values are
observed or not depends on the underlying values (potentially
unobserved).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a class="reference external" href="http://proceedings.mlr.press/v108/morvan20a.html"&gt;Le Morvan et al, AIStats 2020&lt;/a&gt; studied the
seemingly-simple case of a linear generative mechanism and showed that,
with missing values, the optimal predictor was a complex, piecewise
linear, function of the observed data concatenated with the
missing-values mask. This function can be implemented with a neural
network with ReLu activation functions, fed with data where missing
values are replaced by zeros and corresponding indicator features are
added.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;To go one step further, we noticed that the optimal predictor uses the
correlation between features (&lt;em&gt;eg&lt;/em&gt; on fully-observed data) to compensate
for missing values.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/2020_highlights/compensation_effects.jpeg" style="width: 700px;" /&gt;
&lt;p class="caption"&gt;&lt;strong&gt;Compensation effects&lt;/strong&gt;: The optimal predictor uses the correlation
between features to compensate when a value is missing.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a class="reference external" href="https://neurips.cc/virtual/2020/public/poster_42ae1544956fbe6e09242e6cd752444c.html"&gt;Le Morvan et al, NeurIPS 2020&lt;/a&gt;
devise a neural-network architecture that efficiently captures these
links across the features. Mathematically, it stems from seeking good
functional forms to approximate the expression of the optimal predictor,
that can be derived for various missing-values mechanisms. A non-trivial
result is that a simple functional form can approximate the optimal
predictor under very different mechanisms.&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2020_highlights/neumiss_nb_parameters.jpeg" /&gt;
&lt;p class="caption"&gt;&lt;strong&gt;Better parameter efficiency&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The resulting architecture needs much less parameters (depth or width)
than a fully-connected multi-layer perceptron to predict well in the
presence of missing values. This, in turns, leads to better performance
on limited data size.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="machine-learning-without-normalizing-entries"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-4"&gt;Machine-learning without normalizing entries&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;A challenge of data management is that the same information may be
represented in different ways, typically with different strings denoting
the same, or related entities. For instance, in the following table, the
&lt;em&gt;employee position title&lt;/em&gt; column contains such non-normalized
information:&lt;/p&gt;
&lt;blockquote&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="13%" /&gt;
&lt;col width="47%" /&gt;
&lt;col width="40%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;Sex&lt;/th&gt;
&lt;th class="head"&gt;Employee Position Title&lt;/th&gt;
&lt;th class="head"&gt;Years of experience&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;Male&lt;/td&gt;
&lt;td&gt;Master Police Officer&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Female&lt;/td&gt;
&lt;td&gt;Social Worker IV&lt;/td&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Male&lt;/td&gt;
&lt;td&gt;Police Officer III&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Female&lt;/td&gt;
&lt;td&gt;Police Aide&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Male&lt;/td&gt;
&lt;td&gt;Electrician I&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Male&lt;/td&gt;
&lt;td&gt;Bus Operator&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Male&lt;/td&gt;
&lt;td&gt;Bus Operator&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Female&lt;/td&gt;
&lt;td&gt;Social Worker III&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Female&lt;/td&gt;
&lt;td&gt;Library Assistant I&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Male&lt;/td&gt;
&lt;td&gt;Library Assistant I&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/blockquote&gt;
&lt;p&gt;Typos, or other morphological variants (such as varying abbreviations)
often make things worse. We found many instances of such challenges in
electronic health records.&lt;/p&gt;
&lt;p&gt;In a data-science analysis, such data has categorical meanings, but a
typical categorical data representation (as a one-hot encoder) breaks:
there are too many categories, and in machine learning, the test set
might come with new categories.&lt;/p&gt;
&lt;p&gt;The standard practice is to curate the data: represent the information in
a normalized way, without morphological variants, and separating the
various bits of information (for instance the type of job from the rank).
It typically requires a lot of human labor.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2020_highlights/gamma_poisson_encoding.png" style="width: 600px;" /&gt;
&lt;p class="caption"&gt;The original categories and their continuous representation on latent
categorical features inferred from the data.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a class="reference external" href="https://ieeexplore.ieee.org/abstract/document/9086128"&gt;Cerda &amp;amp; Varoquaux, TKDE 2020&lt;/a&gt; give two
efficient approaches to encode such data for statistical analysis
capturing string similarities. The most interpretable of these approaches
represents the data by continuous encoding on latent categories inferred
automatically from recurrent substrings.&lt;/p&gt;
&lt;p&gt;This research is implemented in the &lt;a class="reference external" href="https://skrub-data.org"&gt;skrub&lt;/a&gt;
Python library, which is making rapid progress (and was originally called
dirty-cat).&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="making-sense-of-brain-functional-signals"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-5"&gt;Making sense of brain functional signals&lt;/a&gt;&lt;/h2&gt;
&lt;p class="align-right"&gt;&lt;em&gt;Turning brain-imaging signal into insights&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Brain imaging, and in particular functional brain imaging, is amazing,
because it gives a window on brain function, whether it is to understand
cognition, behavior, or pathologies. One challenge that I have been
interested in, across the years, is how to give systematic sense to these
signals, in a broader perspective than a given study.&lt;/p&gt;
&lt;div class="section" id="neuroquery-brain-mapping-any-neuroscience-query"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-6"&gt;NeuroQuery: brain mapping any neuroscience query&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Systematically linking mental processes and disorders to brain structures
is a very difficult task because of the huge diversity of behavior.&lt;/p&gt;
&lt;p&gt;In &lt;a class="reference external" href="https://elifesciences.org/articles/53385"&gt;Dockes et al, elife 2020&lt;/a&gt; we used text mining on a
large number of brain-imaging publications to predict where in the brain
a given subject of study (in neuroscience, behavior, and related
pathologies) would report findings.&lt;/p&gt;
&lt;p&gt;With this model, we built a web application, &lt;a class="reference external" href="https://neuroquery.org"&gt;NeuroQuery&lt;/a&gt; in which the user can type a neuroscience
query, and get a brain map of where a study on the topic is like to
report findings.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="a-high-resolution-brain-functional-atlas"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-7"&gt;A high-resolution brain functional atlas&lt;/a&gt;&lt;/h3&gt;
&lt;p class="align-right"&gt;&lt;em&gt;Regions to summarize the fMRI signal&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Atlases of brain regions are convenient to summarize the information of
brain images, turning them into information easy to analyse. We have long
studied the specific case of functional brain atlases, extracting and
validating them from brain imaging data. &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S1053811920306121"&gt;Dadi NeuroImage 2020&lt;/a&gt;
contributes a high-resolution brain functional atlas, DiFuMo. This atlas
can be browsed or downloaded &lt;a class="reference external" href="https://parietal-inria.github.io/DiFuMo/"&gt;online&lt;/a&gt;.&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2020_highlights/difumo.jpg" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;The functional regions, at dimension 512.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The atlas comes with various resolutions, and all the structures that it
segments have been given meaningful names. In the paper, we showed that
using this atlas to extract functional signals led to better analysis for
a large number of problems compare to the atlases commonly used. We thus
recommend this atlas for instance to extract Image-Derived Phenotypes in
population analysis, where the huge size of the data requires to work on
summarize information.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/2020_highlights/putamen_difumo.png" /&gt;
&lt;p class="caption"&gt;The region capturing the right hemisphere putamen.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="machine learning"></category><category term="health"></category><category term="covid19"></category><category term="statistics"></category><category term="yearly report"></category></entry><entry><title>Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020</title><link href="https://gael-varoquaux.info/science/survey-of-machine-learning-experimental-methods-at-neurips2019-and-iclr2020.html" rel="alternate"></link><published>2020-01-22T00:00:00+01:00</published><updated>2020-01-22T00:00:00+01:00</updated><author><name>Xavier Bouthillier &amp; Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2020-01-22:/science/survey-of-machine-learning-experimental-methods-at-neurips2019-and-iclr2020.html</id><summary type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;A simple survey asking authors of two leading machine-learning
conferences a few quantitative questions on their experimental
procedures.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;How do machine-learning researchers run their empirical validation? In
the context of a push for improved reproducibility and benchmarking, this
question is important to develop new tools for model comparison. We …&lt;/p&gt;</summary><content type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;A simple survey asking authors of two leading machine-learning
conferences a few quantitative questions on their experimental
procedures.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;How do machine-learning researchers run their empirical validation? In
the context of a push for improved reproducibility and benchmarking, this
question is important to develop new tools for model comparison. We ran a
simple survey asking to authors of two leading conferences, NeurIPS 2019
and ICLR 2020, a few quantitative questions on their experimental
procedures.&lt;/p&gt;
&lt;p&gt;A &lt;a class="reference external" href="https://hal.archives-ouvertes.fr/hal-02447823"&gt;technical report on HAL&lt;/a&gt; summarizes our
finding. It gives a simple picture of how hyper-parameters are set, how
many baselines and datasets are included, or how seeds are used.
Below, we give a very short summary, but please read (and &lt;a class="reference external" href="https://hal.archives-ouvertes.fr/hal-02447823v1/bibtex"&gt;cite&lt;/a&gt;)
&lt;a class="reference external" href="https://hal.archives-ouvertes.fr/hal-02447823"&gt;the full report&lt;/a&gt; if you are interested.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Highlights&lt;/strong&gt;
The response rates were 35.6% for NeurIPS and 48.6%
for ICLR.
A vast majority of empirical works optimize model hyper-parameters,
thought almost half of these use manual tuning and most of the automatic
hyper-parameter optimization is done with grid search. The typical number
of hyper-parameter set is in interval 3-5, and less than 50 model fits
are used to explore the search space. In addition, most works also
optimized their baselines (typically, around 4 baselines).
Finally, studies typically reported 4 results per model per task to provide a measure of variance, and around 50% of them
used a different random seed for each experiment.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sample results&lt;/strong&gt;&lt;/p&gt;
&lt;div class="side-caption figure align-center"&gt;
&lt;img alt="" src="../science/attachments/survey_of_ml_experimental_methods/hyper_parameter_optimization.png" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;How many papers with experiments optimized hyperparameters.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="side-caption figure align-center"&gt;
&lt;img alt="" src="../science/attachments/survey_of_ml_experimental_methods/tuning_methods.png" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;What hyperparameter optimization method were used.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="side-caption figure align-center"&gt;
&lt;img alt="" src="../science/attachments/survey_of_ml_experimental_methods/number_datasets.png" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;Number of different datasets used for benchmarking.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="side-caption figure align-center"&gt;
&lt;img alt="" src="../science/attachments/survey_of_ml_experimental_methods/number_seeds_or_trials.png" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;Number of results reported for each model (ex: for different seeds)&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;These are just samples. Read &lt;a class="reference external" href="https://hal.archives-ouvertes.fr/hal-02447823"&gt;the full report&lt;/a&gt; for
more results.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;For reproducibility and AutoML, there is active research in benchmarking
and hyperparameter procedures in machine learning. We hope that the
survey results presented here can help inform this research. As this
document is merely a research report, we purposely limited
interpretation of the results and drawing recommendations. However, trends that stand out to our
eyes are, &lt;cite&gt;1)&lt;/cite&gt; the simplicity of hyper-parameter tuning strategies
(mostly manual search and grid search),  &lt;cite&gt;2)&lt;/cite&gt; the small number of
model fits explored during this tuning (often 50 or less), which biases the
results and &lt;cite&gt;3)&lt;/cite&gt; the small number of performances reported, which limits
statistical power. These
practices are most likely due to the high computational cost of fitting
modern machine-learning models.&lt;/p&gt;
&lt;div class="sidebar"&gt;
&lt;p class="first sidebar-title"&gt;Code&lt;/p&gt;
&lt;p class="last"&gt;The code used for plotting and analysis is &lt;a class="reference external" href="https://github.com/bouthilx/ml-survey-2020"&gt;on github&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Acknowledgments&lt;/strong&gt; We are deeply grateful to the participants of
the survey who took time to answer the questions.&lt;/p&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="machine learning"></category><category term="benchmarking"></category><category term="conferences"></category><category term="experimental methods"></category></entry><entry><title>2019: my scientific year in review</title><link href="https://gael-varoquaux.info/science/2019-my-scientific-year-in-review.html" rel="alternate"></link><published>2020-01-05T00:00:00+01:00</published><updated>2020-01-05T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2020-01-05:/science/2019-my-scientific-year-in-review.html</id><summary type="html">&lt;p&gt;My current research spans wide: from brain sciences to core data
science. My overall interest is to build &lt;strong&gt;methodology drawing insights from
data&lt;/strong&gt; for questions that have often been addressed qualitatively. If I can
highlight a few publications from 2019 &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;, the common thread would be
computational statistics, from dirty …&lt;/p&gt;</summary><content type="html">&lt;p&gt;My current research spans wide: from brain sciences to core data
science. My overall interest is to build &lt;strong&gt;methodology drawing insights from
data&lt;/strong&gt; for questions that have often been addressed qualitatively. If I can
highlight a few publications from 2019 &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;, the common thread would be
computational statistics, from dirty data to brain images. Let me try to
give the gist of these progresses, in simple terms.&lt;/p&gt;
&lt;table class="side-hanging docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;It’s already 2020, I’m always late.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;div class="contents topic" id="highlights"&gt;
&lt;p class="topic-title"&gt;Highlights&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#comparing-distributions" id="toc-entry-1"&gt;Comparing distributions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#predictive-pipelines-on-brain-functional-connectomes" id="toc-entry-2"&gt;Predictive pipelines on brain functional connectomes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#population-shrinkage-of-covariance" id="toc-entry-3"&gt;Population shrinkage of covariance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#deep-learning-on-non-translation-invariant-images" id="toc-entry-4"&gt;Deep learning on non-translation-invariant images&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#open-science" id="toc-entry-5"&gt;Open science&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="comparing-distributions"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;Comparing distributions&lt;/a&gt;&lt;/h2&gt;
&lt;p class="align-right"&gt;&lt;em&gt;Fundamental computational-statistics work&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;What if you are given two set of observations and need to conclude on
whether they are drawn from the same distribution? We are interested in
this question for the &lt;a class="reference external" href="https://project.inria.fr/dirtydata/"&gt;DirtyData&lt;/a&gt;
research project, to facilitate analysis of data without manual curation.
Comparing distributions is indeed important to detect drifts in the data,
to match information across datasets, or to compensate for dataset
biases.&lt;/p&gt;
&lt;p&gt;Formally, we are given two cloud of points (circle and crosses in the
figure below) and we want to develop a statistical test of whether the
distributions differ. There is an abundant literature on this topic, that I
cover in &lt;a class="reference external" href="http://gael-varoquaux.info/science/comparing-distributions-kernels-estimate-good-representations-l1-distances-give-good-tests.html"&gt;a more detailed post on this subject&lt;/a&gt;.
Specifically, when the observations have a natural similarity, for
instance when they live in a vector space, kernel methods are interesting
because they enable to estimate a representant of the underlying
distribution that interpolates between observations, as with &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Kernel_density_estimation"&gt;a kernel
density estimator&lt;/a&gt;.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="http://papers.nips.cc/paper/9398-comparing-distributions-ell_1-geometry-improves-kernel-two-sample-testing"&gt;&lt;img alt="" src="../science/attachments/comparing_distributions_l1/optimizing_position.png" style="width: 500px;" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;Two cloud of points, the corresponding distribution representants μ_P
and μ_Q (blue and orange), the difference between these
(black), and locations to measure this difference (red triangles).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;With Meyer Scetbon, in
&lt;a class="reference external" href="http://papers.nips.cc/paper/9398-comparing-distributions-ell_1-geometry-improves-kernel-two-sample-testing"&gt;Scetbon &amp;amp; Varoquaux, NeurIPS&lt;/a&gt;,
we investigate how to measure best the difference between these
representants. We show that the best choice is to take the absolute value
of the difference (the l1 norm), while the default choice had so far been
the Euclidean (l2) norm. In a nutshell, the reason is that the difference
most like is dense when the distributions differ: zero almost nowhere.&lt;/p&gt;
&lt;p&gt;We were able to show that the &lt;a class="reference external" href="https://slideslive.com/38921490/interpretable-comparison-of-distributions-and-models"&gt;sophisticated framework&lt;/a&gt;
for efficient and powerful tests in the
Euclidean case carries over to the l1 case. In particular, our paper
gives efficient testing procedures using a small number of locations to
avoid costly computation (the red triangles in the figure above), that
can either be sampled at random or optimized.&lt;/p&gt;
&lt;p&gt;My hunch is that the result is quite general: the l1 geometry is better
than the l2 one on representants of distributions. There might be more
fundamental mathematical properties behind this. The drawback is that the
l1 norm is non smooth which can be challenging in optimization settings.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="predictive-pipelines-on-brain-functional-connectomes"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;Predictive pipelines on brain functional connectomes&lt;/a&gt;&lt;/h2&gt;
&lt;p class="align-right"&gt;&lt;em&gt;Brain-imaging methods&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Brain functional connectivity is increasingly used to extract biomarkers
of behavior and mental health. The long-term stakes are to ground
assessment of psychological traits on quantitative brain
data, rather than qualitative behavioral observations. But, to build
biomarkers, there are many details that go in estimating functional
connectivity from fMRI, something that I have studied for more than 10
years. With Kamalakar Dadi, in &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/abs/pii/S1053811919301594"&gt;Dadi et al&lt;/a&gt;,
we ran thorough empirical benchmarks to find which methodological choices
for the various steps of the pipeline give best prediction across
multiple cohorts. Specifically, we studied 1) defining regions of
interest for signal extraction, 2) building a functional-connectivity
matrix across these regions, 3) prediction across subjects with
supervised learning on these features.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="https://www.sciencedirect.com/science/article/abs/pii/S1053811919301594"&gt;&lt;img alt="" src="../science/attachments/2019_highlights/dadi_2019_highlights.png" style="width: 600px;" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;Summarizing our benchmark results.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="sidebar"&gt;
&lt;p class="first sidebar-title"&gt;Recommendations&lt;/p&gt;
&lt;ul class="last simple"&gt;
&lt;li&gt;functional regions (eg from dictionary learning)&lt;/li&gt;
&lt;li&gt;tangent-space for covariances&lt;/li&gt;
&lt;li&gt;l2-logistic regression&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;p&gt;Results show the importance of defining regions from functional data,
ideally with a linear-decomposition method that produces soft
parcellations such as ICA or dictionary learning. To represent
connectivity between regions, the best choice is tangent-space
parametrization, a method to build a vector-space from covariance
matrices (more below). Finally, for supervised learning, a simple
l2-penalized logistic regression is the best option. With the huge popularity
of deep learning, it may surprise that linear models are the best
performer, but this is well explained by the amount of data at hand: a
cohort is typically less than 1000 individuals, which is way below the
data sizes needed to see the benefits of non-linear models.&lt;/p&gt;
&lt;p&gt;A recent preprint, &lt;a class="reference external" href="https://www.biorxiv.org/content/10.1101/741595v2.abstract"&gt;Pervaiz et al&lt;/a&gt; from
Oxford, overall
confirms our findings, even though they investigated slightly
different methodological choices. In particular, they find tangent space
clearly useful.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;In my eyes, such benchmarking studies are important not only to improve
prediction, but also to reduce analytic variability that opens the door
to inflation of reported effects. Indeed, given 1000 individuals, the
measure of prediction accuracy of a pipeline is quite imprecise
(&lt;a class="reference external" href="https://www.sciencedirect.com/science/article/abs/pii/S1053811917305311"&gt;Varoquaux 2018&lt;/a&gt;).
As a consequence, trying out a bunch a analytic choices and
publishing the one that works best can lead to grossly optimistic
prediction accuracies. &lt;strong&gt;If we want trust in biomarkers, we need to
reduce the variability in the methods used to build them&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="population-shrinkage-of-covariance"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;Population shrinkage of covariance&lt;/a&gt;&lt;/h2&gt;
&lt;p class="align-right"&gt;Statistics for brain signals&lt;/p&gt;
&lt;p&gt;Estimating covariances is central for functional brain connectivity and
in many other applications. With Mehdi Rahim, in &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/abs/pii/S1361841518301014"&gt;Rahim et al&lt;/a&gt;
we considered the case of a population of random processes with
related covariances, as for instance when estimating functional
connectivity from a group of individuals. For this, we combined two
mathematical ideas: that of using natural operations on covariance
matrices, and that of priors for mean-square estimation:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;Tangent space&lt;/strong&gt; Covariance matrices are positive-definite matrices,
for which standard arithmetics are not well suited &lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt;: subtracting
two covariance matrices can lead to a matrix that cannot be
the covariance of a signal. However, a group of covariance matrices can
be transformed into points in a vector space for which standard
distances and arithmetics respect the structure of
covariances (for instance Euclidean distance between these points
approximate KL divergence between covariances). This is what we call
the &lt;em&gt;tangent space&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;table class="side-hanging docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Technically, covariance matrices live on a Riemannian manifold:
a curve surface inside &lt;em&gt;R^{n x n}&lt;/em&gt; that has some metric
properties.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;James-Stein shrinkage&lt;/strong&gt; To estimate the mean of &lt;em&gt;n&lt;/em&gt; observations, it
is actually best not to compute the average of these, but rather to
push a bit this average toward a prior guess. The better the
guess, the more this “push” helps. The more the number of observations,
the more gentle this push should be. This strategy is known as
&lt;a class="reference external" href="https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator"&gt;James-Stein shrinkage&lt;/a&gt; and it
is in my opinion one of the most beautiful results in statistics.
It can be seen as a Bayesian posterior, but it comes with guarantees
that do not require the model to be true and that control estimation
error, rather than a posterior probability.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;James-Stein shrinkage is easily written for quadratic errors on vectors,
but cannot be easily applied to covariances, as they do not live in a vector
space and we would like to control a KL divergence rather than
a quadratic error. Our work combined both ideas to give an excellent
estimator of a family of related covariances that is also very
computationally efficient. We call it PoSCE: Population Shrinkage
Covariance Estimation.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="https://www.sciencedirect.com/science/article/abs/pii/S1361841518301014"&gt;&lt;img alt="" src="../science/attachments/2019_highlights/posce.png" style="width: 600px;" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;Schema of the estimation strategy: projecting the covariances matrices
into a tangent space, shrinkage to a group mean, but taking in account
the anisotropy of the dispersion of the group, and projecting back to
covariances.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;It is easy to see how accounting for group information in the estimation
of individual covariances can help stabilizing them. However, will it be
beneficial if we are interested in the differences between these
covariances, for instance to ground biomarkers, as studied above? Our
results show that it does indeed help building better biomarkers, for
instance to predict brain age. The larger the group of covariances used,
the larger the benefits.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="https://www.sciencedirect.com/science/article/abs/pii/S1361841518301014"&gt;&lt;img alt="" src="../science/attachments/2019_highlights/posce_age_learning_curve.png" style="width: 500px;" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;Error in predicting brain aging decreases when more individuals are used
to build the biomarker.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="deep-learning-on-non-translation-invariant-images"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-4"&gt;Deep learning on non-translation-invariant images&lt;/a&gt;&lt;/h2&gt;
&lt;p class="align-right"&gt;Computer vision&lt;/p&gt;
&lt;p&gt;Brain images, in particular images of brain activity, are very different
from the natural images on which most computer-vision research focuses.
A central difference is that detecting activity in different parts of the
brain completely changes the meaning of this detection, while detecting a
cat in the left or the right of a picture on Facebook makes no
difference. This is important because many progresses of computer vision,
such as convolutional neural networks, are built on the fact that natural
images are statistically translational invariant. On the opposite, brain
images are realigned to a template, before being analyzed.&lt;/p&gt;
&lt;p&gt;Convolutional architectures have been crucial to the successes of deep
learning on natural images because they impose a lot of structure on the
weights of neural networks and thus help fight estimation noise. For
predicting from brain images, the regularizations strategies that have
been successful foster spatially continuous structures. Unfortunately,
they have lead to costly non-smooth optimizations that cannot easily be
used with the optimization framework of deep learning, stochastic
gradient descent.&lt;/p&gt;
&lt;p&gt;With Sergul Aydore, in &lt;a class="reference external" href="http://proceedings.mlr.press/v97/aydore19a.html"&gt;Aydore et al, ICML&lt;/a&gt;, we have introduced a
spatial regularization that is compatible with the deep learning toolbox.
During the stochastic optimization, we impose random spatial structure
via feature groups estimated from the data. These stabilize the input
layers of deep architecture. They also lead to iterating on smaller
representations, which greatly speeds up the algorithm.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="http://proceedings.mlr.press/v97/aydore19a.html"&gt;&lt;img alt="" src="../science/attachments/2019_highlights/stochastic_grouping_mlp.png" style="width: 600px;" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;At each step of a stochastic gradient descent, we randomly pick a
feature-grouping matrix (itself estimated from the data), and use it
to reduce the data in the computations of the gradients, then invert
this reduction to update the weights.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a class="reference external" href="http://proceedings.mlr.press/v97/aydore19a.html"&gt;The paper&lt;/a&gt; comes with
extensive empirical validation, including comparison to convolutional
neural networks. We benchmark the strategy on brain images, but also
on realigned faces, to show that the approach is beneficial for any
non-translational-invariant images. In particular, the approach greatly
speeds up convergence.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="http://proceedings.mlr.press/v97/aydore19a.html"&gt;&lt;img alt="" src="../science/attachments/2019_highlights/stochastic_grouping_results.png" style="width: 600px;" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;Prediction accuracy as a function of training time – left: on
realigned faces – right: on brain images&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a class="reference external" href="http://proceedings.mlr.press/v97/aydore19a.html"&gt;This paper&lt;/a&gt; clearly
shows that &lt;strong&gt;one should not use convolutional neural networks on fMRI
data&lt;/strong&gt;: these images are not translational invariant.&lt;/p&gt;
&lt;div class="sidebar"&gt;
&lt;p class="first sidebar-title"&gt;&lt;strong&gt;Preprints&lt;/strong&gt;&lt;/p&gt;
&lt;p class="last"&gt;All papers are available as preprints, eg on &lt;a class="reference external" href="http://gael-varoquaux.info/publications.html"&gt;my site&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="open-science"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-5"&gt;Open science&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Open and reproducible science:&lt;/strong&gt; Looking at all these publications, I
realize that every single one of them comes with code on a github
repository and is done on open data, which means that they can all be
easily reproduced. I’m very proud of the teams behind these papers.
Achieving this level of reproducibility requires hard work and
discipline. It is also a testimonial to a community investment in
software tools and infrastructure for open science that has been going on
for decades and gives the foundations on which these works build.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;A prize for scikit-learn:&lt;/strong&gt; On this topic, a highlight of 2019 was also
that the work behind scikit-learn was acknowledged in &lt;a class="reference external" href="../programming/getting-a-big-scientific-prize-for-open-source-software.html"&gt;an important
scientific prize&lt;/a&gt;.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Why open science:&lt;/strong&gt; Why do I care so much for open science? Because in
a world of uncertainty, the claims of science must be trusted and hence
built on transparent practice (think about science and global warming).
Because it helps putting our methods in the hands of a wider public,
society at large. And because it levels the ground, making it easier for
newcomers –young scientists, or developing countries– to contribute,
which in itself makes science more efficient.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="machine learning"></category><category term="neuroimaging"></category><category term="statistics"></category><category term="yearly report"></category></entry><entry><title>Comparing distributions: Kernels estimate good representations, l1 distances give good tests</title><link href="https://gael-varoquaux.info/science/comparing-distributions-kernels-estimate-good-representations-l1-distances-give-good-tests.html" rel="alternate"></link><published>2019-12-08T00:00:00+01:00</published><updated>2019-12-08T00:00:00+01:00</updated><author><name>Meyer Scetbon &amp; Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2019-12-08:/science/comparing-distributions-kernels-estimate-good-representations-l1-distances-give-good-tests.html</id><summary type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;Given two set of observations, are they drawn from the same
distribution? Our paper &lt;a class="reference external" href="https://papers.nips.cc/paper/9398-comparing-distributions-ell_1-geometry-improves-kernel-two-sample-testing.html"&gt;Comparing distributions: l1 geometry
improves kernel two-sample testing&lt;/a&gt;
at the &lt;strong&gt;NeurIPS 2019 conference&lt;/strong&gt; revisits this classic statistical
problem known as “two-sample testing”.&lt;/p&gt;
&lt;p class="last"&gt;This post explains the context and the paper with a bit of hand …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;Given two set of observations, are they drawn from the same
distribution? Our paper &lt;a class="reference external" href="https://papers.nips.cc/paper/9398-comparing-distributions-ell_1-geometry-improves-kernel-two-sample-testing.html"&gt;Comparing distributions: l1 geometry
improves kernel two-sample testing&lt;/a&gt;
at the &lt;strong&gt;NeurIPS 2019 conference&lt;/strong&gt; revisits this classic statistical
problem known as “two-sample testing”.&lt;/p&gt;
&lt;p class="last"&gt;This post explains the context and the paper with a bit of hand
waiving.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#the-context-two-sample-testing" id="toc-entry-1"&gt;The context: two-sample testing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#from-kernel-mean-embeddings-to-distances-on-distributions" id="toc-entry-2"&gt;From kernel mean embeddings to distances on distributions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#controlling-the-weak-convergence-of-probability-measures" id="toc-entry-3"&gt;Controlling the weak convergence of probability measures&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#two-sample-testing-procedures" id="toc-entry-4"&gt;Two-sample testing procedures&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#the-l1-metric-provides-best-testing-power" id="toc-entry-5"&gt;The L1 metric provides best testing power&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="the-context-two-sample-testing"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;The context: two-sample testing&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Given two samples from two unknown populations, the goal of two-sample tests is
to determine whether the underlying populations differ with a statistical
significance. For instance, we may care to know whether the
McDonald’s and KFC use different logic to chose locations of restaurants
across the US. This is a difficult question: we have access to data points,
but not the underlying generative mechanism, that is probably governed by
marketing strategies.&lt;/p&gt;
&lt;img alt="" class="align-center" src="attachments/comparing_distributions_l1/map_KFC_McDo_simple.png" style="width: 70%;" /&gt;
&lt;/div&gt;
&lt;div class="section" id="from-kernel-mean-embeddings-to-distances-on-distributions"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;From kernel mean embeddings to distances on distributions&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In the example of spatial distributions restaurants,
there is &lt;strong&gt;a lot of information in how close observed data
points lie in the original measurement space (here geographic coordinates)&lt;/strong&gt;.
Kernel methods arise naturally to capture this information. They can be
applied to distributions, building representatives of distributions:
&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Kernel_embedding_of_distributions"&gt;Kernel embeddings of distributions&lt;/a&gt;. The
mean embedding of a distribution P with a kernel k is written:&lt;/p&gt;
&lt;div class="formula"&gt;
&lt;i&gt;μ&lt;/i&gt;&lt;sub&gt;&lt;i&gt;P&lt;/i&gt;&lt;/sub&gt;(&lt;i&gt;t&lt;/i&gt;) :  = &lt;span class="limits"&gt;&lt;span class="limit"&gt;&lt;span class="bigoperator integral"&gt;∫&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;sub&gt;ℝ&lt;sup&gt;&lt;i&gt;d&lt;/i&gt;&lt;/sup&gt;&lt;/sub&gt;&lt;i&gt;k&lt;/i&gt;(&lt;i&gt;x&lt;/i&gt;, &lt;i&gt;t&lt;/i&gt;)&lt;i&gt;dP&lt;/i&gt;(&lt;i&gt;x&lt;/i&gt;)
&lt;/div&gt;
&lt;p&gt;Intuitively, it is related to &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Kernel_density_estimation"&gt;Kernel Density Estimates (KDEs)&lt;/a&gt; which
estimate a density in continuous space by smoothing the observed data
points with a kernel.&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/comparing_distributions_l1/kde.jpg" /&gt;
&lt;p class="caption"&gt;Kernel mean embeddings for two distributions of points&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;For two-sample testing, kernel embeddings can be used to compute distances
between distributions, building metrics over the space of probability
measures. Metrics between probability measures can be defined via the
notion of Integral Probability Metric (IPM): as a difference of
expectations:&lt;/p&gt;
&lt;div class="formula"&gt;
&lt;span class="text"&gt;IPM&lt;/span&gt;[&lt;i&gt;F&lt;/i&gt;, &lt;i&gt;P&lt;/i&gt;, &lt;i&gt;Q&lt;/i&gt;] :  = &lt;span class="limits"&gt;&lt;sup class="limit"&gt; &lt;/sup&gt;&lt;span class="limit"&gt;sup&lt;/span&gt;&lt;sub class="limit"&gt;&lt;i&gt;f&lt;/i&gt; ∈ &lt;i&gt;F&lt;/i&gt;&lt;/sub&gt;&lt;/span&gt;(𝔼&lt;sub&gt;&lt;i&gt;x&lt;/i&gt; ∼ &lt;i&gt;P&lt;/i&gt;&lt;/sub&gt;&lt;span class="stretchy"&gt;[&lt;/span&gt;&lt;i&gt;f&lt;/i&gt;(&lt;i&gt;x&lt;/i&gt;)&lt;span class="stretchy"&gt;]&lt;/span&gt; − 𝔼&lt;sub&gt;&lt;i&gt;y&lt;/i&gt; ∼ &lt;i&gt;Q&lt;/i&gt;&lt;/sub&gt;&lt;span class="stretchy"&gt;[&lt;/span&gt;&lt;i&gt;f&lt;/i&gt;(&lt;i&gt;y&lt;/i&gt;)&lt;span class="stretchy"&gt;]&lt;/span&gt;)
&lt;/div&gt;
&lt;p&gt;where F is a class of functions. This definition is appealing because it
&lt;strong&gt;characterizes the difference between P and Q by the function for which
the expectancy differs most&lt;/strong&gt;. The specific choice of class of function
defines the metric. If we now consider a kernel, it implicitly defines a
space of functions (intuitively related to all the possible KDEs
generated by varying data points): a Reproducible Kernel Hilbert Space
(RKHS). Defining a metric (an IPM) with a function class F as the unit
ball in such an RKHS, is known as the Maximum Mean Discrepancy (MMD). It
can be shown that, rather than computing the maximum, the MMD has a more
convenient expression, the RKHS distance between the mean embeddings:&lt;/p&gt;
&lt;div class="formula"&gt;
&lt;span class="text"&gt;MMD&lt;/span&gt;[&lt;i&gt;P&lt;/i&gt;, &lt;i&gt;Q&lt;/i&gt;] = ‖&lt;i&gt;μ&lt;/i&gt;&lt;sub&gt;&lt;i&gt;P&lt;/i&gt;&lt;/sub&gt; − &lt;i&gt;μ&lt;/i&gt;&lt;sub&gt;&lt;i&gt;Q&lt;/i&gt;&lt;/sub&gt;‖&lt;sub&gt;&lt;i&gt;H&lt;/i&gt;&lt;sub&gt;&lt;i&gt;k&lt;/i&gt;&lt;/sub&gt;&lt;/sub&gt;
&lt;/div&gt;
&lt;p&gt;For good choices of kernels, the MMD has appealing mathematical
properties to compare distributions. With kernels said to be
characteristic, eg Gaussian kernels, the MMD is a metric: MMD[P, Q] = 0
if and only if P = Q. Using the MMD for two-sample testing –given only
observations from the distributions, and not P and Q–  requires using an
empirical estimation of the MMD. This can be done by computing the RKHS
norm in the expression above, which leads to summing kernel evaluations
on all data points in P and Q.&lt;/p&gt;
&lt;p&gt;Our work builds upon this framework, but deviates a bit from the
classical definition of MMD as it addresses the question of which norm is
best to use on the difference of mean embeddings, µQ - µP (as well as
other representatives, namely the smooth characteristic function, SCF).
We consider a wider family of metrics based on the Lp distances between
mean emdeddings (p=2 recovers the classic framework):&lt;/p&gt;
&lt;div class="formula"&gt;
&lt;i&gt;d&lt;/i&gt;&lt;sub&gt;&lt;i&gt;L&lt;/i&gt;&lt;sup&gt;&lt;i&gt;p&lt;/i&gt;&lt;/sup&gt;, &lt;i&gt;μ&lt;/i&gt;&lt;/sub&gt;(&lt;i&gt;P&lt;/i&gt;, &lt;i&gt;Q&lt;/i&gt;) :  = &lt;span class="stretchy"&gt;(&lt;/span&gt;&lt;span class="limits"&gt;&lt;sup class="limit"&gt; &lt;/sup&gt;&lt;span class="limit"&gt;&lt;span class="bigoperator integral"&gt;∫&lt;/span&gt;&lt;/span&gt;&lt;sub class="limit"&gt;&lt;i&gt;t&lt;/i&gt; ∈ ℝ&lt;sup&gt;&lt;i&gt;d&lt;/i&gt;&lt;/sup&gt;&lt;/sub&gt;&lt;/span&gt;|&lt;i&gt;μ&lt;/i&gt;&lt;sub&gt;&lt;i&gt;P&lt;/i&gt;&lt;/sub&gt;(&lt;i&gt;t&lt;/i&gt;) − &lt;i&gt;μ&lt;/i&gt;&lt;sub&gt;&lt;i&gt;Q&lt;/i&gt;&lt;/sub&gt;(&lt;i&gt;t&lt;/i&gt;)|&lt;sup&gt;&lt;i&gt;p&lt;/i&gt;&lt;/sup&gt;&lt;i&gt;d&lt;/i&gt;Γ(&lt;i&gt;t&lt;/i&gt;)&lt;span class="stretchy"&gt;)&lt;/span&gt;&lt;sup&gt;1 ⁄ &lt;i&gt;p&lt;/i&gt;&lt;/sup&gt;
&lt;/div&gt;
&lt;p&gt;where Γ is a Borel probability measure absolutely continuous.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="controlling-the-weak-convergence-of-probability-measures"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;Controlling the weak convergence of probability measures&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;We show that these metrics have good properties. Specifically, for p ≥ 1,
as soon as the kernel is bounded continuous and characteristic, these
metrics metrize the weak convergence. What this means is that these
metrics tend to zero if and only if P and Q weakly converge.&lt;/p&gt;
&lt;p&gt;The &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Convergence_of_measures#Weak_convergence_of_measures"&gt;weak convergence of probability measures&lt;/a&gt;
is a notion of convergence that is based &lt;strong&gt;not just on having events with
probabilities that are the same for the two distributions, but also that some events are
“close”&lt;/strong&gt;. Indeed, classic convergence in probability just tells us that
the same observation should have the same probability in the two distributions. Weak convergence takes in account the topology of the
observations. For instance, to go back to the problem of spatial
distributions of restaurants, it does not only look at whether the
probabilities of having a Mc Donald’s or a KFC restaurant converge on
11th Wall Street, but also at restaurants are likely on 9th Wall Street.&lt;/p&gt;
&lt;p&gt;A simple example to see why these matters is to consider two Dirac
distributions: spikes in a single point. If we bring these spikes closer
and closer, merely looking at the probability of events in the same exact
position will not detect any convergence until the spikes exactly
overlap.&lt;/p&gt;
&lt;p&gt;Using kernel embeddings of distributions enables to capture the aspects
of convergence in the spatial domain because the kernels used give a
spatial smoothness to the representatives:&lt;/p&gt;
&lt;img alt="" class="align-center" src="attachments/comparing_distributions_l1/converging_diracs.png" style="width: 70%;" /&gt;
&lt;p&gt;Having a metric on probability distributions that captures the topology
of the observations is important for many applications, for instance when
fitting GANs to generate images: the goal is not to only capture that
images are exactly the same, but also that they maybe be “close”.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="two-sample-testing-procedures"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-4"&gt;Two-sample testing procedures&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Now that we have built metrics, we can derive two-sample test statistics.
A straightforward way of doing it would involve large sums on all the
observations, which would be costly. Hence, we resort to a good
approximation by sampling a set of {Tj} locations from the distribution
Γ:&lt;/p&gt;
&lt;div class="formula"&gt;
&lt;i&gt;d̂&lt;/i&gt;&lt;span class="scripts"&gt;&lt;sup class="script"&gt;&lt;i&gt;p&lt;/i&gt;&lt;/sup&gt;&lt;sub class="script"&gt;&lt;i&gt;ℓ&lt;/i&gt;&lt;sub&gt;&lt;i&gt;p&lt;/i&gt;&lt;/sub&gt;, &lt;i&gt;μ&lt;/i&gt;, &lt;i&gt;J&lt;/i&gt;&lt;/sub&gt;&lt;/span&gt;[&lt;i&gt;X&lt;/i&gt;, &lt;i&gt;Y&lt;/i&gt;] :  = &lt;i&gt;n&lt;/i&gt;&lt;sup&gt;&lt;i&gt;p&lt;/i&gt; ⁄ 2&lt;/sup&gt;&lt;span class="limits"&gt;&lt;sup class="limit"&gt; &lt;/sup&gt;&lt;span class="limit"&gt;&lt;span class="bigoperator"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;sub class="limit"&gt;&lt;i&gt;j&lt;/i&gt; = 1..&lt;i&gt;J&lt;/i&gt;&lt;/sub&gt;&lt;/span&gt;|&lt;i&gt;μ&lt;/i&gt;&lt;sub&gt;&lt;i&gt;X&lt;/i&gt;&lt;/sub&gt;(&lt;i&gt;T&lt;/i&gt;&lt;sub&gt;&lt;i&gt;j&lt;/i&gt;&lt;/sub&gt;) − &lt;i&gt;μ&lt;/i&gt;&lt;sub&gt;&lt;i&gt;Y&lt;/i&gt;&lt;/sub&gt;(&lt;i&gt;T&lt;/i&gt;&lt;sub&gt;&lt;i&gt;j&lt;/i&gt;&lt;/sub&gt;)|&lt;sup&gt;&lt;i&gt;p&lt;/i&gt;&lt;/sup&gt;
&lt;/div&gt;
&lt;p&gt;We show that this approximation maintains (almost surely) the appealing
metric properties, generalizing the results that were established by
&lt;a class="reference external" href="http://papers.nips.cc/paper/5685-fast-two-sample-testing-with-analytic-representations-of-probability-measures"&gt;Chwialkowski et al 2015&lt;/a&gt;
for the special case of the L2 metric.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/comparing_distributions_l1/optimizing_position.png" style="width: 70%;" /&gt;
&lt;p class="caption"&gt;Sampling at different positions&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;We further develop the testing procedures by showing that other tricks
known to improve testing with the L2 metric can be adapted to other
metrics, such as the L1 metric. Fast and performant tests can be obtained
by optimizing the test locations –using an upper-bound on the test power–
or by testing in the Fourrier domain, using the Smooth Characteristic
Function of the kernel. Even in the case of the L1 metric, the null
distribution of the test statistic can be derived, leading to tests that
can control errors without permutations.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="the-l1-metric-provides-best-testing-power"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-5"&gt;The L1 metric provides best testing power&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Going back to our question of which norm on the difference of
distribution representative is best suited to detect, we show that when
using analytics kernels, such as the Gaussian kernel, the L1 metric
improves upon the L2 metric, which corresponds to the classic definition
of the MMD.&lt;/p&gt;
&lt;p&gt;Indeed, analytic kernels are non-zero almost everywhere. As a result,
when P is different from Q, the difference between their mean embeddings
will be dense, as well as the differences between the representatives
that we use to build our tests (for instance the values at the locations
that we use to build the tests above). l1 norms capture better dense
differences than l2 norms –this is the reason why, used as penalties,
they induce sparsity.&lt;/p&gt;
&lt;img alt="" class="align-right" src="attachments/comparing_distributions_l1/l1_vs_l2.png" style="width: 150px;" /&gt;
&lt;p&gt;A simple intuition is that dense vectors tend to lie in the diagonals of
the measurement basis, as none of their coordinates are zero. On these
diagonals, the l1 norm is much larger than the l1 norm of vectors with
some zero, or nearly-zero coordinates.&lt;/p&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;&lt;strong&gt;Summary&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For a very simple summary, the story is that: to perform tests of
whether two distributions differs, it is useful to compute a “mean
Kernel embedding” –similar to a Kernel density estimate, but without
normalization– of each distribution, and consider the l1 norm of the
difference of these embeddings. They can be computed on a small number
of locations, either drawn at random or optimized. This approach is
reminiscent of looking at the total variation between the measures,
however the fact that it uses Kernels makes it robust to small spatial
noise in the observations, unlike the total variation for which events
must perfectly coincide in both set of observations (the total
variation does not metrize the weak convergence).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;&lt;strong&gt;References&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The framework exposed here is one that was developed over a long line
of research, which our work builds upon. &lt;a class="reference external" href="https://papers.nips.cc/paper/9398-comparing-distributions-ell_1-geometry-improves-kernel-two-sample-testing.html"&gt;Our paper&lt;/a&gt;
gives a complete list of references, however, some useful review
papers are&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;C.-J. Simon-Gabriel and B. Schölkopf. Kernel distribution
embeddings: Universal kernels, &lt;em&gt;characteristic kernels and kernel
metrics on distributions&lt;/em&gt;, &lt;a class="reference external" href="https://arxiv.org/abs/1604.0525"&gt;arXiv:1604.05251&lt;/a&gt;, 2016.&lt;/li&gt;
&lt;li&gt;A. Gretton, K.M. Borgwardt, M.J. Rasch, B. Schölkopf, A. Smola; &lt;em&gt;A
Kernel Two-Sample Test&lt;/em&gt;, &lt;a class="reference external" href="http://www.jmlr.org/papers/v13/gretton12a.html"&gt;JMLR, 2012&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://slideslive.com/38921490/interpretable-comparison-of-distributions-and-models"&gt;The NeurIPS 2019 tutorial&lt;/a&gt;,
by Gretton, Sutherland, and Jitkrittum, is extremely didactic and gives
a lot of big picture&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;p&gt;·&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="machine learning"></category><category term="two-sample testing"></category><category term="conferences"></category><category term="statistics"></category></entry><entry><title>2018: my scientific year in review</title><link href="https://gael-varoquaux.info/science/2018-my-scientific-year-in-review.html" rel="alternate"></link><published>2019-01-03T00:00:00+01:00</published><updated>2019-01-03T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2019-01-03:/science/2018-my-scientific-year-in-review.html</id><summary type="html">&lt;p&gt;From a scientific perspective, 2018 &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt; was once again extremely exciting
thank to awesome collaborators (at &lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;Inria&lt;/a&gt;, with &lt;a class="reference external" href="https://project.inria.fr/dirtydata/"&gt;DirtyData&lt;/a&gt;, and our &lt;a class="reference external" href="https://scikit-learn.fondation-inria.fr/"&gt;local scikit-learn team&lt;/a&gt;).
Rather than going over everything that we did in 2018, I would like to
give a few highlights: We published major work using &lt;strong&gt;machine learning to …&lt;/strong&gt;&lt;/p&gt;</summary><content type="html">&lt;p&gt;From a scientific perspective, 2018 &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt; was once again extremely exciting
thank to awesome collaborators (at &lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;Inria&lt;/a&gt;, with &lt;a class="reference external" href="https://project.inria.fr/dirtydata/"&gt;DirtyData&lt;/a&gt;, and our &lt;a class="reference external" href="https://scikit-learn.fondation-inria.fr/"&gt;local scikit-learn team&lt;/a&gt;).
Rather than going over everything that we did in 2018, I would like to
give a few highlights: We published major work using &lt;strong&gt;machine learning to
map cognition in the brain&lt;/strong&gt;, We started a new research project on &lt;strong&gt;analysis
of non-curated data&lt;/strong&gt; (addressing all of data science, beyond brain
imaging); And we worked a lot on &lt;strong&gt;growing scikit-learn&lt;/strong&gt;.&lt;/p&gt;
&lt;table class="side-hanging docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;It’s already 2019, I am indeed late in posting this summary.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;div class="contents topic" id="highlights"&gt;
&lt;p class="topic-title"&gt;Highlights&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#cognitive-brain-mapping" id="toc-entry-1"&gt;Cognitive brain mapping&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#data-science-without-data-cleaning" id="toc-entry-2"&gt;Data science without data cleaning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#scikit-learn-growth-and-consolidation" id="toc-entry-3"&gt;Scikit-learn: growth and consolidation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="cognitive-brain-mapping"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;Cognitive brain mapping&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;We have been exploring &lt;strong&gt;how predictive models can help mapping cognition
in the human brain&lt;/strong&gt;. In 2018, these long-running efforts led to important
publications.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="atlases-of-cognition-with-large-scale-human-brain-mapping"&gt;
&lt;h3&gt;Atlases of cognition with large-scale human brain mapping&lt;/h3&gt;
&lt;p&gt;More than 6 years ago, with my student Yannick Schwartz, we started
working on &lt;strong&gt;compiling an altases of cognition across many cognitive
neuroimaging studies&lt;/strong&gt;. This turned out to be quite challenging for several
reasons:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;Formalizing the links between mental processes&lt;/strong&gt; studied across the
literature is challenging. Strictly speaking, every paper studies a
different mental process. However, to build an atlas of cognition, we
are interested in finding commonalities across the literature.&lt;/li&gt;
&lt;li&gt;While cognitive studies tend to target a specific mental function,
the psychological manipulations that they use also recruit many other
processes. For instance, a memory study might use a &lt;em&gt;visual n-back&lt;/em&gt;
task, and hence recruit the visual cortex. The problem is more than an
experimental inconvience: &lt;strong&gt;varying details of an experiment may
trigger different cognitive processes&lt;/strong&gt;. For instance, there are common
and separate pathways for visual word recognition and auditory word
recognition.&lt;/li&gt;
&lt;li&gt;Simply &lt;strong&gt;detecting regions that are recruited in a given mental operation
leads to selecting the whole cortex&lt;/strong&gt; with enough statistical power. Indeed
tasks are never fully balanced; reading might for instance require more
attention than listening.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These challenges are related on the one hand to the problem of &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S1364661305003360"&gt;reverse
inference&lt;/a&gt;
&lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt;, and on the other hand to that of mental-process decomposition, or
cognitive subtraction, both central to cognitive neuroimaging. They also
call for formal knowledge representation, &lt;em&gt;eg&lt;/em&gt; by building ontologies,
which is a task harder than it might seem at first glance.&lt;/p&gt;
&lt;table class="side-hanging docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;In essence, the reverse inference problem arises because in a
cognitive brain imaging the observed brain activity is a consequence
of the behavior, and not a cause. While a conclusion that activity in
a brain structure causes a certain behavior is desirable, it is not
directly supported by a cognition neuroimaging experiment.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In our work &lt;a class="reference external" href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006565"&gt;[Varoquaux et al, PLOS 2018]&lt;/a&gt;,
we tackled these challenges to build atlases of cognition as follows:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;We assigned to each brain-activity image labels describing the
&lt;em&gt;multiple&lt;/em&gt; mental processes related to the experimental manipulation&lt;/li&gt;
&lt;li&gt;We used decoding –&lt;em&gt;ie&lt;/em&gt; prediction of the cognitive labels from the brain
activity– to ground a principled &lt;em&gt;reverse inference&lt;/em&gt; interpretation:
regions selected indeed do imply the corresponding behavior.&lt;/li&gt;
&lt;li&gt;Regions in the atlas were built of brain structures that both implied
the corresponding cognition, and were triggered by it (conditional and
marginal link), to ground a strong selectivity:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006565"&gt;&lt;img alt="" src="attachments/2018_highlights/mapping_types.png" style="width: 700px;" /&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;p&gt;We applied these techniques to the data from 30 different studies,
resulting in a detailed break down of the cortex in functionally-specialized
modules:&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006565"&gt;&lt;img alt="" src="attachments/2018_highlights/cognitive_regions.png" style="width: 700px;" /&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;p&gt;Importantly, the validity of this decomposition in regions is established
by the ability of these regions to predict the cognitive aspects of new
experimental paradigms.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="predictive-models-avoid-excessive-reductionism-in-cognitive-neuroimaging"&gt;
&lt;h3&gt;Predictive models avoid excessive reductionism in cognitive neuroimaging&lt;/h3&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2018_highlights/decoding.png" style="width: 400px;" /&gt;
&lt;/div&gt;
&lt;p&gt;While machine learning is generally seen as an engineering tool to build
predictive models or automate tasks, I see in it a central method of
modern science. Indeed, it can distill &lt;strong&gt;evidence that generalizes&lt;/strong&gt; from
vast –high dimensional– and ill-structured experimental data. Beyond
prediction, it can guide understanding.&lt;/p&gt;
&lt;p&gt;With Russ Poldrack, we wrote an opinion paper &lt;a class="reference external" href="https://hal.archives-ouvertes.fr/hal-01856412/"&gt;[Varoquaux &amp;amp; Poldrack,
Curr Opinion Neurobio 2019]&lt;/a&gt; that details why
predictive models are important tools to building wider theories of brain
function. It reviews many exciting progresses in uncovering with
predictive models how brain mechanisms support the mind. It makes the
point that &lt;strong&gt;ability generalize is a fundamentally desirable priority of
scientific inference&lt;/strong&gt;. Models that are grounded on explicit
generalization give a solid path to build broad theories of the mind.
Particularly interesting is generalization to significantly different
settings, &lt;em&gt;ie&lt;/em&gt; going further than typical cross-validation experiments of
machine learning, where identical data are artificially split.&lt;/p&gt;
&lt;p&gt;Something that is dear to my heart is that we are aiming for
&lt;strong&gt;quantitative generalization&lt;/strong&gt;, while psychology often contents itself
with qualitative generalization.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="individual-brain-charting-a-high-resolution-fmri-dataset-for-cognitive-mapping"&gt;
&lt;h3&gt;Individual Brain Charting, a high-resolution fMRI dataset for cognitive mapping&lt;/h3&gt;
&lt;p&gt;We are convinced about the importance of analyzing brain response across
multiple paradigms, to build models of brain function that generalize
across these paradigms. However, addressing such a research program by
aggregating multiple studies is hindered by data heterogeneity, due to
inter-individual differences or to differing scanners.&lt;/p&gt;
&lt;p&gt;Hence, my team, &lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;Parietal&lt;/a&gt;, has
undertook a major data acquisition, the &lt;a class="reference external" href="https://project.inria.fr/IBC"&gt;Individual Brain Charting
project&lt;/a&gt;: &lt;strong&gt;scanning a few individuals
under a huge amount of cognitive tasks&lt;/strong&gt;. The data acquisition will last
for many years, as the individuals come back to the lab for new
acquisitions. The images are of excellent quality, thanks to the unique
expertise of our scanning site, Neurospin, a brain-imaging research
facility.&lt;/p&gt;
&lt;p&gt;The data are completely &lt;strong&gt;openly accessible&lt;/strong&gt;: the raw data, preprocessed
data, statistical outputs, alongside with the processing script. We are
releasing new data as the project moves forward. This year, we published
the data paper &lt;a class="reference external" href="https://www.nature.com/articles/sdata2018105"&gt;[Pinho et al, Scientific Data 2018]&lt;/a&gt;.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;Data accumulation in brain imaging&lt;/p&gt;
&lt;p&gt;We are living exciting times, as &lt;strong&gt;there are more and more large volumes
of shared brain imaging data&lt;/strong&gt;. &lt;a class="reference external" href="https://openfmri.org/"&gt;OpenfMRI&lt;/a&gt;
aggregates data in a consistent way across brain-imaging
studies. Large projects such as the Human Connectome Project, our
Individual Brain Charting project, or the UK BioBank, are designed
from the beginning to be shared. We are entering an era of
brain-image analysis on many terabytes of data, with dozens of
thousands of subjects, compounding hundreds of different clinical or
cognitive conditions.&lt;/p&gt;
&lt;p&gt;Massive data accumulation opens exciting new scientific prospects,
and raises new engineering challenges. Some of these challenges are
to scale up neuroimaging data-processing practices, eg inter-subject
alignments at the scale of many thousands subjects. Some of these
challenges are new to neuroimaging: &lt;strong&gt;when compounding hundreds of
sources of data into an analysis, the human cost of data
integration becomes a major roadblock&lt;/strong&gt;. As I have become convinced
that analysing more, and more diverse, data is an important way
forward, I have started working on data intergration per se.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="data-science-without-data-cleaning"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;Data science without data cleaning&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="a-new-personal-research-agenda-dirtydata"&gt;
&lt;h3&gt;A new personal research agenda: DirtyData&lt;/h3&gt;
&lt;p&gt;Challenges to integrating data in a statistical analysis are ubiquitous,
including in brain imaging. Data cleaning &lt;a class="reference external" href="https://www.kaggle.com/surveys/2017"&gt;is recognized&lt;/a&gt; as the number one time sink for
data scientists. When advising scikit-learn users, including very large
companies, I often find that the major roadblock is going from the raw
data sources to the data matrix that is input to scikit-learn.&lt;/p&gt;
&lt;p&gt;A year ago, I started a new research focus, around the &lt;a class="reference external" href="https://project.inria.fr/dirtydata"&gt;DirtyData project&lt;/a&gt;. We now have a team with multiple
exciting collaborations, and funding. Our goal is to &lt;strong&gt;facilitate
statistical analysis of non-curated data&lt;/strong&gt;. We hope to foster better
understanding of how powerful machine-learning models can cope with
imperfect, non homogeneous data. As we go, we will publish this
understanding, but also distribute code with new methods, and hopefully
influence common data-science practices and software. This is an exciting
adventure (and yes, &lt;strong&gt;we are hiring&lt;/strong&gt;; see our &lt;a class="reference external" href="https://project.inria.fr/dirtydata/job-offers"&gt;job offers&lt;/a&gt; or contact me).&lt;/p&gt;
&lt;p&gt;The topics are vast, at the intersection between database research and
statistics. In particular, it calls for integrating machine learning
with:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Knowledge representation&lt;/li&gt;
&lt;li&gt;Information retrieval&lt;/li&gt;
&lt;li&gt;Information extraction&lt;/li&gt;
&lt;li&gt;Statistics with missing data&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="similarity-encoding-analysis-with-non-normalized-string-categories"&gt;
&lt;h3&gt;Similarity encoding: analysis with non-normalized string categories&lt;/h3&gt;
&lt;p&gt;While the DirtyData project is young, we already made progress for
analysis of &lt;strong&gt;dirty categories, ie categorical data represented with
strings that lack curation&lt;/strong&gt;. These can have typos or other simple
morphological variants (&lt;em&gt;eg&lt;/em&gt; “patient” vs “patients”), or they can have
more structured and fundamental differences, &lt;em&gt;eg&lt;/em&gt; arising from the merge
of multiple data sources. This latter problem is well-known of database
research, where it is seen as a &lt;em&gt;record linkage&lt;/em&gt; or &lt;em&gt;alignment&lt;/em&gt; problem.&lt;/p&gt;
&lt;p&gt;For statistical analysis, in particular machine learning, the problem
with these non-curated string categories is that they must be encoded to
numerical representations, and classic categorical encodings are not well
suited for them. For instance, one-hot encoding leads to very high
cardinality.&lt;/p&gt;
&lt;p&gt;In &lt;a class="reference external" href="https://hal.inria.fr/hal-01806175"&gt;Cerda et al (2018)&lt;/a&gt;, we
contribute a simple encoding approach, &lt;em&gt;similarity encoding&lt;/em&gt;, based on
interpolating one-hot encoding with string similarities between the
categories.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="https://dirty-cat.github.io/stable/auto_examples/01_investigating_dirty_categories.html"&gt;&lt;img alt="" src="attachments/2018_highlights/investigating_dirty_categories.png" style="width: 600px;" /&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class="figure align-right"&gt;
&lt;a class="reference external image-reference" href="https://dirty-cat.github.io/stable/auto_examples/02_fit_predict_plot_employee_salaries.html"&gt;&lt;img alt="" src="attachments/2018_highlights/predict_employee_salaries.png" style="width: 230px;" /&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;p&gt;We ran an extensive empirical study, and show that &lt;strong&gt;similarity encoding
leads to better prediction accuracy without curation of the data&lt;/strong&gt;,
outperforming all the other approaches that we tried. The paper is purely
empirical, but stay tuned: a theoretical analysis of why this is the case
is coming soon.&lt;/p&gt;
&lt;p&gt;For the benefit of data scientists and researchers, we are released a
small Python package, &lt;a class="reference external" href="https://dirty-cat.github.io/stable/"&gt;dirty-cat&lt;/a&gt;,
for learning with dirty categories.&lt;/p&gt;
&lt;p&gt;This is just the beginning of the DirtyData project, more exciting work
is under way.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="scikit-learn-growth-and-consolidation"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;Scikit-learn: growth and consolidation&lt;/a&gt;&lt;/h2&gt;
&lt;img alt="" class="align-right" src="attachments/2018_highlights/scikit-learn-logo-notext.png" style="width: 150px;" /&gt;
&lt;p&gt;In 2018, a lot of my energy went to consolidating scikit-learn as a
project. Describing the work in detail is for another post. However, my
main efforts where around growing the team and working on sustainability.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;We established a &lt;a class="reference external" href="https://scikit-learn.fondation-inria.fr/"&gt;scikit-learn foundation at Inria&lt;/a&gt;, in which companies
partner with us to fund scikit-learn development. This took a lot of
effort to establish good partnerships and create the legal vessels.
Indeed, we want to make sure that the common effort is invested to make
scikit-learn better. For instance, working with Intel, who are somewhat
running an arms race for computing speed, we improved our test suite,
and are slowly but surely learning how to improve our speed.&lt;/li&gt;
&lt;li&gt;A consequence of the foundation is that we are hiring to grow the team
(check out &lt;a class="reference external" href="https://scikit-learn.fondation-inria.fr/people/"&gt;our open positions&lt;/a&gt;). In 2018, my own
team grew, with more excellent people working on scikit-learn, but also
&lt;a class="reference external" href="http://joblib.readthedocs.io/"&gt;joblib&lt;/a&gt;, and even contributing to
core Python and numpy to improve &lt;a class="reference external" href="https://github.com/python/cpython/pull/3895"&gt;parallel computing&lt;/a&gt; and &lt;a class="reference external" href="https://github.com/numpy/numpy/pull/12133"&gt;pickling&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;As the scikit-learn community is growing, it seemed important to
formalize a bit more how decisions are made. To me, an important aspect
was laying out clearly that the project is still governed by the
community, and not partners or people paid by the foundation. We have a
draft of a &lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/12878"&gt;governance document&lt;/a&gt;, that is
pretty much ready for merge. We also worked on a &lt;a class="reference external" href="https://scikit-learn.org/dev/roadmap.html"&gt;roadmap&lt;/a&gt;. It is a non binding
document, but it still was an interesting exercise.&lt;/li&gt;
&lt;li&gt;Scikit-learn 0.20 was released, &lt;a class="reference external" href="https://scikit-learn.org/dev/whats_new.html"&gt;with many enhancements&lt;/a&gt;. And the 0.20 release
was followed by two minor releases, to make sure that our users got
robust code with backward compatibility.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We are busy finishing a few very interesting studies; next year will be
exciting! I hope that we will have much to say about population analysis
with brain imaging, which is a amazingly interesting subject.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="machine learning"></category><category term="neuroimaging"></category><category term="brain science"></category><category term="yearly report"></category></entry><entry><title>Our research in 2017: personal scientific highlights</title><link href="https://gael-varoquaux.info/science/our-research-in-2017-personal-scientific-highlights.html" rel="alternate"></link><published>2017-12-31T00:00:00+01:00</published><updated>2017-12-31T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2017-12-31:/science/our-research-in-2017-personal-scientific-highlights.html</id><summary type="html">&lt;p&gt;In my opinion the scientific highlights of 2017 for &lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;my team&lt;/a&gt; were on multivariate predictive
analysis for brain imaging: a brain decoder more efficient and faster
than alternatives, improvement clinical predictions by predicting jointly
multiple traits of subjects, decoding based on the raw time-series of
brain activity, and a personnal …&lt;/p&gt;</summary><content type="html">&lt;p&gt;In my opinion the scientific highlights of 2017 for &lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;my team&lt;/a&gt; were on multivariate predictive
analysis for brain imaging: a brain decoder more efficient and faster
than alternatives, improvement clinical predictions by predicting jointly
multiple traits of subjects, decoding based on the raw time-series of
brain activity, and a personnal concern with the small sample sizes we
use in predictive brain imaging…&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="a-fast-and-stable-brain-decoder-using-ensembling-frem"&gt;
&lt;h2&gt;A fast and stable brain decoder using ensembling: FReM&lt;/h2&gt;
&lt;p&gt;We have been working for 10 years on methods for brain decoding:
predicting behavior from imaging. In particular, we developed state of
the art decoders based on &lt;a class="reference external" href="http://ieeexplore.ieee.org/abstract/document/5711672/"&gt;total variation&lt;/a&gt;.
In &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S1053811917308182"&gt;Hoyos-Idrobo et al&lt;/a&gt;
(&lt;a class="reference external" href="https://hal.inria.fr/INRIA/hal-01615015v1"&gt;preprint&lt;/a&gt;)
we used a different technique based on ensembling: combining many fast
decoders. The resulting decoder, dubbed &lt;em&gt;FReM&lt;/em&gt;, predicts better, faster,
and with more stable maps than existing methods. Indeed, we have learned
that good prediction accuracy was not the only important feature of a
decoder.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/2017_highlights/frem_benchmarks.png" style="width: 600px;" /&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="brain-imaging-to-characterize-individuals-joint-prediction-of-multiple-traits"&gt;
&lt;h2&gt;Brain imaging to characterize individuals: joint prediction of multiple traits&lt;/h2&gt;
&lt;p&gt;In &lt;em&gt;population imaging&lt;/em&gt;, individual traits are linked to their brain
images. Predictive models ground the development of imaging biomarkers.
In &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S1053811917305438"&gt;Rahim et al&lt;/a&gt;
(&lt;a class="reference external" href="https://hal.inria.fr/hal-01547524/"&gt;preprint&lt;/a&gt;), we showed that
accounting for multiple traits of the subjects when &lt;em&gt;learning&lt;/em&gt; the
biomarker, gave a better prediction of the individual traits. For
instance, knowing the MMSE (mini mental state examination) of subjects
in a reference population helps derive better markers of Alzheimer’s
disease, even for subjects of unknown MMSE. This is an important step to
including a more complete picture of individuals in imaging studies.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/2017_highlights/multi_output_decoder.jpg" style="width: 600px;" /&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="time-domain-decoding-for-fmri"&gt;
&lt;h2&gt;Time-domain decoding for fMRI&lt;/h2&gt;
&lt;p&gt;In studies of cognition with functional MRI, the standard practice to
decoding brain activity is to estimate a first-level model that teases
appart the different experimental trials. It results in maps of regions
of the brains that correlate with each trial. Decoding is then run on
these maps, with supervised learning. The limitation of this approach is
that the experiment has to be designed with a good time separation
between each trial.&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2017_highlights/time_domain_decoding.png" style="width: 300px;" /&gt;
&lt;/div&gt;
&lt;p&gt;In &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S1053811917306651"&gt;Loula et al&lt;/a&gt;
(&lt;a class="reference external" href="https://hal.inria.fr/hal-01576641/"&gt;preprint&lt;/a&gt;) we designed a
&lt;em&gt;time-domain decoding&lt;/em&gt; scheme, that starts from the raw brain activity
time-series and predicts model time-courses of cognition. From these, it
can classify the type of each trial. Importantly, it works better than
traditional approaches when the trials are not well separated. It thus
opens the door to decoding in experiments that were so far too fast.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="cross-validation-failure-the-dangers-of-small-samples"&gt;
&lt;h2&gt;Cross-validation failure: the dangers of small samples&lt;/h2&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2017_highlights/sample_size_distribution.png" style="width: 300px;" /&gt;
&lt;/div&gt;
&lt;p&gt;I wrote &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S1053811917305311"&gt;an opinion paper&lt;/a&gt;
(&lt;a class="reference external" href="https://hal.inria.fr/hal-01545002/"&gt;preprint&lt;/a&gt;) on a problem of our
field that has been worrying me lot: &lt;strong&gt;often, we do not have enough
samples to assess properly the predictive power in neuroimaging&lt;/strong&gt;.
Indeed, the typical predictive analysis in neuroimaging uses 100 samples.&lt;/p&gt;
&lt;div style="clear: both"&gt;&lt;/div&gt;&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2017_highlights/binomial_cdf.png" style="width: 300px;" /&gt;
&lt;/div&gt;
&lt;p&gt;The error distribution on the measure of prediction accuracy of a decoding
is at best given by a binomial. With around 100 samples, it yields
confidence bounds around ±7%. Analysis of neuroimaging studies reveals
larger error bars.&lt;/p&gt;
&lt;p&gt;Such error bars, large compared to the effect of interest, undermine
publications using or developing predictive models in neuroimaging.
Indeed, they couple with the publication incentives in two ways. First,
studies that by chance observe an effect are published, while the others
end up unaccounted for in a &lt;em&gt;``file drawer``&lt;/em&gt;. Second, minor
modifications to the data processing strategy give large but meaningless
differences on the observed prediction accuracy. These &lt;em&gt;researchers
degress of freedom&lt;/em&gt; can hardly be checked in a review process or a
statistical test. The methods research, trying to improve decoders, is
hindered by such error bars and should consider multiple datasets to
gauge progress. Clinical neuroimaging, for biomarkers, must increase
sample sizes and face heterogeneity.&lt;/p&gt;
&lt;p&gt;I believe that this is a major challenge for our field, and invite you to
read the paper if you are not convinced.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="convergence-proofs-for-last-year-s-blazing-fast-dictionary-learning"&gt;
&lt;h2&gt;Convergence proofs for last year’s blazing fast dictionary learning&lt;/h2&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/2017_highlights/online_dict_learning.png" style="width: 600px;" /&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a class="reference external" href="http://ieeexplore.ieee.org/abstract/document/8038072/"&gt;Mensch et al&lt;/a&gt;
(&lt;a class="reference external" href="https://hal.inria.fr/hal-01431618/"&gt;preprint&lt;/a&gt;) is a long paper that
studies in detail our very fast dictionary learning algorithm, with
extensive experiments and convergence proofs. On huge matrices, such as
brain imaging data in population studies, hyperspectral imaging, or
recommender systems, is gives &lt;strong&gt;10 fold speedups for matrix factorization&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We are busy finishing a few very interesting studies. Stay posted, next
year will be exciting!&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="neuroimaging"></category><category term="brain science"></category><category term="machine learning"></category><category term="yearly report"></category></entry><entry><title>Our research in 2016: personal scientific highlights</title><link href="https://gael-varoquaux.info/science/our-research-in-2016-personal-scientific-highlights.html" rel="alternate"></link><published>2016-12-31T00:00:00+01:00</published><updated>2016-12-31T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2016-12-31:/science/our-research-in-2016-personal-scientific-highlights.html</id><summary type="html">&lt;p&gt;Year 2016 has been productive for science in &lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;my team&lt;/a&gt;. Here are some personal highlights:
bridging artificial intelligence tools to human cognition,
markers of neuropsychiatric conditions from brain activity at rest,
algorithmic speedups for matrix factorization on huge datasets…&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="artificial-intelligence-convolutional-networks-map-well-the-human-visual-system"&gt;
&lt;h2&gt;Artificial-intelligence convolutional networks map well the human visual system&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.sciencedirect.com/science/article/pii/S1053811916305481"&gt;Eickenberg et …&lt;/a&gt;&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;p&gt;Year 2016 has been productive for science in &lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;my team&lt;/a&gt;. Here are some personal highlights:
bridging artificial intelligence tools to human cognition,
markers of neuropsychiatric conditions from brain activity at rest,
algorithmic speedups for matrix factorization on huge datasets…&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="artificial-intelligence-convolutional-networks-map-well-the-human-visual-system"&gt;
&lt;h2&gt;Artificial-intelligence convolutional networks map well the human visual system&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.sciencedirect.com/science/article/pii/S1053811916305481"&gt;Eickenberg et al&lt;/a&gt;
(&lt;a class="reference external" href="https://hal.inria.fr/hal-01389809/document"&gt;preprint&lt;/a&gt;), showed that
convolutional networks –machine-learning tools developed in artificial
intelligence for image analysis– map well the human visual system. This
is interesting because it shows that cognitive vision and artificial
computer vision have evolved to similar architectures. It is not that
surprising, as they are both driven by the statistics of natural images.
From the point of view of inference in neuroscience, what I found really
interesting is that we demonstrated that our computational model of brain
activity generalizes across experimental paradigms. This is something new
to my knowledge.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="using-brain-activity-at-rest-to-predicting-autism-status-across-clinical-sites"&gt;
&lt;h2&gt;Using brain activity at rest to predicting Autism status across clinical sites&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.sciencedirect.com/science/article/pii/S1053811916305924"&gt;Abraham et al&lt;/a&gt;
(&lt;a class="reference external" href="https://arxiv.org/pdf/1611.06066"&gt;preprint&lt;/a&gt;) used resting-state brain
activity to predict whether individuals were typical controls or
diagnosed with Autistic symptoms. The important aspect of this study
is that it was performed on a large data collection across many sites
that had not concerted each other during the acquisition. Given that
prediction was successful across sites, the study shows the viability of
extracting predictive biomarkers across inhomogeneous multi-site data. I
think that it is an important result for the future of psychiatric
neuroimaging research. The paper also highlights the aspects of the
predictive pipeline that were important for this success.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="dictionary-learning-for-massive-matrix-factorization"&gt;
&lt;h2&gt;Dictionary Learning for Massive Matrix Factorization&lt;/h2&gt;
&lt;p&gt;On a pure machine-learning side, &lt;a class="reference external" href="http://jmlr.org/proceedings/papers/v48/mensch16.html"&gt;Mensch et al&lt;/a&gt; introduced a new
algorithm for matrix factorization that gives 10 times speedups compared
to the state of the art on absolutely huge datasets (Terabyte scales).
The key aspect is to combine online learning with random subampling that
exploits redundancies in the data. For neuroimaging, this algorithmic
advances is needed to tackle larger and larger resting-state data. We
will use it to scale predictive models to epidemiologic cohorts. The
original paper was purely heuristic but &lt;a class="reference external" href="https://arxiv.org/pdf/1611.10041"&gt;later work&lt;/a&gt; comes with proofs and we will soon
be submitting a very rich journal paper about this class of algorithms.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="a-guide-to-cross-validation-in-neuroimaging"&gt;
&lt;h2&gt;A guide to cross-validation in neuroimaging&lt;/h2&gt;
&lt;p&gt;We published &lt;a class="reference external" href="http://www.sciencedirect.com/science/article/pii/S105381191630595X"&gt;a review on cross-validation for neuroimaging&lt;/a&gt;
(&lt;a class="reference external" href="https://arxiv.org/pdf/1606.05201"&gt;preprint&lt;/a&gt;). While this may sound
less leading edge than other of our work, cross-validation is central to
everything we do. Doing it right is important. We learned some
interesting tradeoffs while doing the experiments for the review. One of
them is that for predictive models that are quite stable, such as SVMs,
it may be profitable to use default hyper-parameters than to tune them by
cross-validation. This is because with the small sample sizes typical of
neuroimaging cross-validation is fairly noisy.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Though not in my team, &lt;a class="reference external" href="http://www.sciencedirect.com/science/article/pii/S1053811916306103"&gt;Liem et al&lt;/a&gt;
(&lt;a class="reference external" href="http://www.biorxiv.org/content/biorxiv/early/2016/11/07/085506.full.pdf"&gt;preprint&lt;/a&gt;)
collaborated with us for a beautiful study showing multimodal prediction
of brain age from rest brain activity and brain anatomy. Interestingly,
they showed that discrepancy between predicted age and chronological age
captures cognitive impairment.&lt;/p&gt;
&lt;p&gt;We have many interesting things in the pipeline, but it will be for next
year. On an unrelated note, I’ve been doing more &lt;a class="reference external" href="http://www.flickriver.com/photos/gaelvaroquaux/popular-interesting/"&gt;art photography&lt;/a&gt;
on my free time in 2016.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="neuroimaging"></category><category term="brain science"></category><category term="machine learning"></category><category term="yearly report"></category></entry><entry><title>Job offer: data crunching brain functional connectivity for biomarkers</title><link href="https://gael-varoquaux.info/science/job-offer-data-crunching-brain-functional-connectivity-for-biomarkers.html" rel="alternate"></link><published>2015-12-08T00:00:00+01:00</published><updated>2015-12-08T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2015-12-08:/science/job-offer-data-crunching-brain-functional-connectivity-for-biomarkers.html</id><summary type="html">&lt;p&gt;&lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;My research group&lt;/a&gt; is looking to fill
a &lt;strong&gt;post-doc position on learning biomarkers from functional
connectivity&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="section" id="scientific-context"&gt;
&lt;h2&gt;Scientific context&lt;/h2&gt;
&lt;p&gt;The challenge is to use resting-state fMRI at the level of a population
to understand how intrinsic functional connectivity captures pathologies
and other cognitive phenotypes. Rest fMRI is a promising tool for …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;p&gt;&lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;My research group&lt;/a&gt; is looking to fill
a &lt;strong&gt;post-doc position on learning biomarkers from functional
connectivity&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="section" id="scientific-context"&gt;
&lt;h2&gt;Scientific context&lt;/h2&gt;
&lt;p&gt;The challenge is to use resting-state fMRI at the level of a population
to understand how intrinsic functional connectivity captures pathologies
and other cognitive phenotypes. Rest fMRI is a promising tool for
large-scale population analysis of brain function as it is easy to
acquire and accumulate. Scans for thousands of subjects have already been
shared, and more is to come. However, the signature of cognitions in this
modality are weak. Extracting biomarkers is a challenging data processing
and machine learning problem. This challenge is the expertise of my
research group. Medical applications cover a wider range of brain
pathologies, for which diagnosis is challenging, such as autism or
Alzheimer’s disease.&lt;/p&gt;
&lt;p&gt;This project is a collaboration with the &lt;a class="reference external" href="http://www.childmind.org/"&gt;Child Mind Institute&lt;/a&gt;, experts on psychiatric disorders and
resting-state fMRI, as well as coordinators of the major data sharing
initiatives for rest fRMI data (eg ABIDE).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="objectives-of-the-project"&gt;
&lt;h2&gt;Objectives of the project&lt;/h2&gt;
&lt;p&gt;The project hinges on processing of very large rest fMRI databases.
Important novelties of the project are:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Building predictive models that can discriminate &lt;strong&gt;multiple
pathologies&lt;/strong&gt; in &lt;strong&gt;large inhomogeneous datasets&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Using and improving &lt;strong&gt;advanced connectomics&lt;/strong&gt; and
&lt;strong&gt;brain-parcellation&lt;/strong&gt; techniques in fMRI.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Expected results include the discovery of neurophenotypes for several
brain pathologies, as well as intrinsic brain structures, such as
functional parcellations or connectomes, that carry signatures of
cognition.&lt;/p&gt;
&lt;p&gt;The analysis framework is based on algorithmic tools developed in Python
(crucially, leveraging scikit-learn for predictive modeling).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="desired-profile"&gt;
&lt;h2&gt;Desired profile&lt;/h2&gt;
&lt;p&gt;We are looking for a post-doctoral fellow to hire in spring. The ideal
candidate would have some, but not all, of the following expertise and
interests:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Experience in advanced processing of fMRI&lt;/li&gt;
&lt;li&gt;General knowledge of brain structure and function&lt;/li&gt;
&lt;li&gt;Good communication skills to write high-impact neuroscience publications&lt;/li&gt;
&lt;li&gt;Good computing skills, in particular with Python. Cluster computing
experience is desired.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="a-great-research-environment"&gt;
&lt;h2&gt;A great research environment&lt;/h2&gt;
&lt;p&gt;The work environment is dynamic and exiting, using state-of-the-art
machine learning to answer challenging functional neuroimaging question.&lt;/p&gt;
&lt;p&gt;The post-doc will be employed by &lt;a class="reference external" href="http://www.inria.fr"&gt;INRIA&lt;/a&gt;, the lead
computing research institute in France. We are a team of computer
scientists specialized in image processing and statistical data analysis,
integrated in one of the top French brain research centers, &lt;a class="reference external" href="http://i2bm.cea.fr/dsv/i2bm/Pages/NeuroSpin.aspx"&gt;NeuroSpin&lt;/a&gt;, south of Paris. We
work mostly in Python. The team includes core contributors to the
&lt;a class="reference external" href="http://scikit-learn.org"&gt;scikit-learn project&lt;/a&gt;, for machine learning in
Python, and the &lt;a class="reference external" href="http://nilearn.github.io/"&gt;nilearn project&lt;/a&gt;, for
statistical learning in NeuroImaging.&lt;/p&gt;
&lt;p&gt;In addition, the post-doc will interact closely with researchers from the
&lt;a class="reference external" href="http://www.childmind.org/"&gt;Child Mind Institute&lt;/a&gt;, with deep expertise
in brain pathologies and in the details of the fMRI acquisitions.
Finally, he or she will have access to advanced storage and grid
computing facilities at INRIA.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Contact information&lt;/strong&gt;: gael dotnospam varoquaux atnotspam inria dotnospam fr&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="jobs"></category><category term="neuromaging"></category><category term="science"></category><category term="python"></category><category term="scientific computing"></category></entry><entry><title>Publishing scientific software matters</title><link href="https://gael-varoquaux.info/science/publishing-scientific-software-matters.html" rel="alternate"></link><published>2013-09-19T00:00:00+02:00</published><updated>2013-09-19T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2013-09-19:/science/publishing-scientific-software-matters.html</id><summary type="html">&lt;p class="light"&gt;Christophe Pradal, Hans Peter Langtangen, and myself recently edited
&lt;a class="reference external" href="http://www.sciencedirect.com/science/journal/18777503/4/5"&gt;a version&lt;/a&gt; of the
Journal of Computational Science on scientific software, in
particular those written in Python. We wrote &lt;a class="reference external" href="http://www.sciencedirect.com/science/article/pii/S1877750313000938"&gt;an editorial&lt;/a&gt;
defending writing and publishing open source scientific software that
I wish to summarize here. The &lt;a class="reference external" href="http://hal.inria.fr/hal-00858663/en"&gt;full text preprint&lt;/a&gt; is openly …&lt;/p&gt;</summary><content type="html">&lt;p class="light"&gt;Christophe Pradal, Hans Peter Langtangen, and myself recently edited
&lt;a class="reference external" href="http://www.sciencedirect.com/science/journal/18777503/4/5"&gt;a version&lt;/a&gt; of the
Journal of Computational Science on scientific software, in
particular those written in Python. We wrote &lt;a class="reference external" href="http://www.sciencedirect.com/science/article/pii/S1877750313000938"&gt;an editorial&lt;/a&gt;
defending writing and publishing open source scientific software that
I wish to summarize here. The &lt;a class="reference external" href="http://hal.inria.fr/hal-00858663/en"&gt;full text preprint&lt;/a&gt; is openly available in &lt;a class="reference external" href="http://gael-varoquaux.info/publications.html"&gt;my
publications list&lt;/a&gt; as always. It
includes, amongst other things, references.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Software is a central part of modern scientific discovery.&lt;/strong&gt; Software turns a
theoretical model into quantitative predictions; software controls an
experiment; and software extracts from raw data evidence supporting or
rejecting a theory. As of today, scientific publications seldom discuss
software in depth, maybe because it is both highly technical and a recent
addition to scientific tools. But times are changing. More and more scientific
investigators are developing software and it is important to establish norms
for publication of this work. Producing scientific software is an important
part of the landscape of research activities. Very visible scientific software
is found in products developed by private companies, such as Mathwork’s Matlab
or Wolfram’s Mathematica, but let us not forget that these build upon code
written by and for academics. Scientists writing software contribute to the
advancement of Science via several factors.&lt;/p&gt;
&lt;p&gt;First, software developed in one field, if written in a sufficiently general
way, can often be applied to advance a different field if the underlying
mathematics is common. &lt;strong&gt;Modern scientific software development has a strong
emphasis on generality and reusability by taking advantage of the general
properties of the mathematical structures in the problem.&lt;/strong&gt; This feature of
modern software help close the gap between fields and accelerate scientific
discovery through packaging mathematical theories in a directly applicable way.&lt;/p&gt;
&lt;p&gt;Second, &lt;strong&gt;the public availability of code is a corner stone of the
scientific method&lt;/strong&gt;, as it is a requirement to reproducing scientific
results: “&lt;em&gt;if it’s not open and verifiable by others, it’s not science,
or engineering, or whatever it is you call what we do.&lt;/em&gt;” (V. Stodden,
&lt;em&gt;The scientific method in practice&lt;/em&gt;). Emphasizing code to an extreme,
Buckheit and Donoho have challenged the traditional view that a
publication was the valuable outcome of scientific research: “&lt;em&gt;an article
about computational science in a scientific publication is not the
scholarship itself, it is merely advertising of the scholarship. The
actual scholarship is the complete software development environment
[…]&lt;/em&gt;”.&lt;/p&gt;
&lt;p&gt;It is important to keep in mind that &lt;strong&gt;going beyond replication of
results requires reusable software tools&lt;/strong&gt;: code that is portable, comes
with documentation, and, most of all, is maintained throughout the years.
Indeed, &lt;strong&gt;software development is a major undertaking that must build
upon best practices and a quality process&lt;/strong&gt;. Reversing Buckheit and
Donoho’s argument, publications about scientific software play an increasingly
important part in the scientific methodology. First, in the publish-or-perish
academic culture, such publications give an incentive to software production
and maintenance, because good software can lead to highly-cited papers. Second,
&lt;strong&gt;the publication and review process are the de facto standards of
ensuring quality in the scientific world. As software is becoming increasingly
more central to the scientific discovery process, it must be subject to these
standards&lt;/strong&gt;. We have found that writing an article on software leads the
authors to better clarify the project vision, technically and scientifically,
the prior art, and the contributions. Last but not least, scientists publishing
new results based on a particular software need an informed analysis of the
validity of that software. Unfortunately, much of the current practice for
adopting research software relies on ease of use of the package and reputation
of the authors.&lt;/p&gt;
&lt;p&gt;[…]&lt;/p&gt;
&lt;p&gt;Today, software is to scientific research what Galileo’s telescope was to
astronomy: a tool, combining science and engineering. It lies outside the
central field of principal competence among the researchers that rely on it.
Like the telescope, it also builds upon scientific progress and shapes our
scientific vision. Galileo’s telescope was a leap forward in optics, a field of
investigation that is now well established, with its own high-impact journals
and scholarly associations. Similarly, we hope that visibility and recognition
of scientific software development will grow.&lt;/p&gt;
</content><category term="science"></category><category term="publishing"></category><category term="open source"></category><category term="scientific computing"></category><category term="reproducible research"></category><category term="scientific software"></category></entry><entry><title>The problems of low statistical power and publication bias</title><link href="https://gael-varoquaux.info/science/the-problems-of-low-statistical-power-and-publication-bias.html" rel="alternate"></link><published>2012-04-14T16:16:00+02:00</published><updated>2012-04-14T16:16:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2012-04-14:/science/the-problems-of-low-statistical-power-and-publication-bias.html</id><summary type="html">&lt;img alt="" class="align-right" src="http://idoubtit.files.wordpress.com/2010/12/coldfusion.jpg" style="width: 30%;" /&gt;
&lt;p&gt;Lately, I have been a mood of scientific scepticism: I have the feeling
that the worldwide academic system is more and more failing to produce
useful research. Christophe Lalanne’s &lt;a class="reference external" href="https://twitter.com/#!/chlalanne"&gt;twitter feed&lt;/a&gt; lead me to an
interesting article in a non-mainstream journal: &lt;a class="reference external" href="http://beheco.oxfordjournals.org/content/15/6/1044.short"&gt;A farewell to
Bonferroni: the problems of low …&lt;/a&gt;&lt;/p&gt;</summary><content type="html">&lt;img alt="" class="align-right" src="http://idoubtit.files.wordpress.com/2010/12/coldfusion.jpg" style="width: 30%;" /&gt;
&lt;p&gt;Lately, I have been a mood of scientific scepticism: I have the feeling
that the worldwide academic system is more and more failing to produce
useful research. Christophe Lalanne’s &lt;a class="reference external" href="https://twitter.com/#!/chlalanne"&gt;twitter feed&lt;/a&gt; lead me to an
interesting article in a non-mainstream journal: &lt;a class="reference external" href="http://beheco.oxfordjournals.org/content/15/6/1044.short"&gt;A farewell to
Bonferroni: the problems of low statistical power and publication
bias&lt;/a&gt;, by Shinichi Nakagawa.&lt;/p&gt;
&lt;p&gt;Each study performed has a probability of being wrong. Thus performing
many studies will lead to some wrong conclusions by chance. This is
known in statistics as the &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Multiple_comparisons"&gt;multiple comparisons&lt;/a&gt; problem. When a
working hypothesis is not verified empirically in a study, this null
finding is seldom reported, leading to what is called &lt;em&gt;publication
bias&lt;/em&gt;: &lt;strong&gt;discoveries are further studied; negative results are usually
ignored&lt;/strong&gt; (Y. Benjamini). Because only &lt;em&gt;discoveries&lt;/em&gt;, called
&lt;em&gt;detections&lt;/em&gt; in statistical terms, are reported, &lt;strong&gt;published results
contain more false detections than the individual experiments and very
little false negatives&lt;/strong&gt;. Arguably, the original investigators have
corrected using the understanding that they gained the experiments
performed and account in a &lt;em&gt;post-hoc analysis&lt;/em&gt; for the fact that some of
their working hypothesis could not have been correct. Such a correction
can work only in a field where there is a good mechanistic
understanding, or models, such as physics, but in my opinion not in life
and social sciences.&lt;/p&gt;
&lt;p&gt;Let me quote some relevant extracts of &lt;a class="reference external" href="http://beheco.oxfordjournals.org/content/15/6/1044.short"&gt;the article&lt;/a&gt;, as you may never
have access to it thanks to the way scientific publishing works:&lt;/p&gt;
&lt;blockquote class="epigraph"&gt;
&lt;p&gt;Recently, Jennions and Moller (2003) carried out a meta-analysis
on statistical power in the field of behavioral ecology and animal
behavior, reviewing 10 leading journals including Behavioral
Ecology. Their results showed dismayingly low average statistical
power (note that a meta-analytic review of statistical power is
different from post hoc power analysis as criticized in Hoenig and
Heisey, 2001). The statistical power of a null hypothesis (Ho)
significance test is the probability that the test will reject Ho
when a research hypothesis (Ha) is true.&lt;/p&gt;
&lt;p&gt;…&lt;/p&gt;
&lt;p&gt;The meta-analysis on statistical power by Jennions and Moller
(2003) revealed that, in the field of behavioral ecology and animal
behavior, statistical power of less than 20% to detect a small
effect and power of less than 50% to detect a medium effect existed.
This means, for example, that the average behavioral scientist
performing a statistical test has a greater probability of making a
Type II error (or beta) (&lt;em&gt;i.e.&lt;/em&gt;, not rejecting Ho when Ho is false;
note that statistical power is equals to 1 - beta) than if they had
flipped a coin, when an experiment effect is of medium size.&lt;/p&gt;
&lt;p&gt;…&lt;/p&gt;
&lt;p&gt;Imagine that we conduct a study where we measure as many relevant
variables as possible, 10 variables, for example. We find only two
variables statistically significant. Then, what should we do? We
could decide to write a paper highlighting these two variables (and
not reporting the other eight at all) as if we had hypotheses about
the two significant variables in the first place. Subsequently, our
paper would be published. Alternatively, we could write a paper
including all 10 variables. When the paper is reviewed, referees
might tell us that there were no significant results if we had
“appropriately” employed Bonferroni corrections, so that our study
would not be advisable for publication. However, the latter paper is
scientifically more important than the former paper. For example, if
one wants to conduct a meta-analysis to investigate an overall
effect in a specific area of study, the latter paper is five times
more informative than the former paper. In the long term,
statistical significance of particular tests may be of trivial
importance (if not always), although, in the short term, it makes
papers publishable. Bonferroni procedures may, in part, be
preventing the accumulation of knowledge in the field of behavioral
ecology and animal behavior, thus hindering the progress of the
field as science.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;img alt="" class="align-right" src="http://farm6.staticflickr.com/5206/5330056727_a98c97c3c5.jpg" style="width: 50%;" /&gt;
&lt;p&gt;Some of the concerns raised here are partly a criticism of Bonferoni
corrections, &lt;em&gt;i.e.&lt;/em&gt; in technical terms correcting for &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Familywise_error_rate"&gt;family-wise error
rate (FWER)&lt;/a&gt;. It is actually the message that the author wants to
convey in his paper. Proponents of controling for &lt;a class="reference external" href="http://en.wikipedia.org/wiki/False_discovery_rate"&gt;false discovery rate
(FDR)&lt;/a&gt; argue that an investigator shouldn’t be penalized for asking
more questions, and the fraction of errors in the answers should be
controlled, rather than the absolute value. That said, FDR, while
useful, does not answer the problems of publication bias.&lt;/p&gt;
</content><category term="science"></category><category term="statistics"></category><category term="computational science"></category><category term="science"></category></entry><entry><title>Conference posters</title><link href="https://gael-varoquaux.info/science/conference-posters.html" rel="alternate"></link><published>2011-09-05T04:15:00+02:00</published><updated>2011-09-05T04:15:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2011-09-05:/science/conference-posters.html</id><summary type="html">&lt;p&gt;At the request of a friend, I am putting up some of the posters that I
recently presented at conferences.&lt;/p&gt;
&lt;img alt="" class="align-left" src="attachments/scientific_posters/poster_nips.png" style="width: 30%;" /&gt;
&lt;p&gt;&lt;strong&gt;Large-scale functional-connectivity graphical models for individual
subjects using population prior.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This is a poster for &lt;a class="reference external" href="http://hal.inria.fr/inria-00512451/en"&gt;our NIPS work&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="attachments/scientific_posters/poster_nips.pdf"&gt;PDF&lt;/a&gt;&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;img alt="" class="align-left" src="attachments/scientific_posters/poster_ipmi.png" style="width: 30%;" /&gt;
&lt;p&gt;&lt;strong&gt;Multi-subject dictionary learning to segment an atlas of brain
spontaneous activity …&lt;/strong&gt;&lt;/p&gt;</summary><content type="html">&lt;p&gt;At the request of a friend, I am putting up some of the posters that I
recently presented at conferences.&lt;/p&gt;
&lt;img alt="" class="align-left" src="attachments/scientific_posters/poster_nips.png" style="width: 30%;" /&gt;
&lt;p&gt;&lt;strong&gt;Large-scale functional-connectivity graphical models for individual
subjects using population prior.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This is a poster for &lt;a class="reference external" href="http://hal.inria.fr/inria-00512451/en"&gt;our NIPS work&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="attachments/scientific_posters/poster_nips.pdf"&gt;PDF&lt;/a&gt;&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;img alt="" class="align-left" src="attachments/scientific_posters/poster_ipmi.png" style="width: 30%;" /&gt;
&lt;p&gt;&lt;strong&gt;Multi-subject dictionary learning to segment an atlas of brain
spontaneous activity.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This is a poster for &lt;a class="reference external" href="http://hal.inria.fr/inria-00588898/en"&gt;our IPMI work&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="attachments/scientific_posters/poster_ipmi.png"&gt;PDF&lt;/a&gt;&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;img alt="" class="align-left" src="attachments/scientific_posters/poster_mayavi.png" style="width: 30%;" /&gt;
&lt;p&gt;&lt;strong&gt;Mayavi for 3D visualization of neuroimaging data: powerful scripting
and reusable components in Python.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="attachments/scientific_posters/poster_mayavi.pdf"&gt;PDF&lt;/a&gt;&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;img alt="" class="align-left" src="attachments/scientific_posters/poster_scikit.png" style="width: 30%;" /&gt;
&lt;p&gt;&lt;strong&gt;Machine learning for fMRI in Python: inverse inference with
scikit-learn.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="attachments/scientific_posters/poster_scikit.pdf"&gt;PDF&lt;/a&gt;&lt;/p&gt;
</content><category term="science"></category><category term="neuroimaging"></category><category term="machine learning"></category><category term="science"></category><category term="publishing"></category></entry><entry><title>My conference travels: Scipy 2011 and HBM 2011</title><link href="https://gael-varoquaux.info/science/my-conference-travels-scipy-2011-and-hbm-2011.html" rel="alternate"></link><published>2011-07-23T23:45:00+02:00</published><updated>2011-07-23T23:45:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2011-07-23:/science/my-conference-travels-scipy-2011-and-hbm-2011.html</id><summary type="html">&lt;div class="section" id="the-scipy-2011-conference-in-austin"&gt;
&lt;h2&gt;The Scipy 2011 conference in Austin&lt;/h2&gt;
&lt;p&gt;Last week, I was at the Scipy conference in Austin. It was really great
to see old friends, and Austin is such a nice  place.&lt;/p&gt;
&lt;img alt="" class="align-center" src="http://farm7.static.flickr.com/6143/5931239349_13c78bbef5_m.jpg" style="width: 50%;" /&gt;
&lt;p&gt;The Scipy conference was held in &lt;a class="reference external" href="http://www.meetattexas.com/"&gt;UT Austin’s conference center&lt;/a&gt;, which
is a fantastic venue. This is the …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="section" id="the-scipy-2011-conference-in-austin"&gt;
&lt;h2&gt;The Scipy 2011 conference in Austin&lt;/h2&gt;
&lt;p&gt;Last week, I was at the Scipy conference in Austin. It was really great
to see old friends, and Austin is such a nice  place.&lt;/p&gt;
&lt;img alt="" class="align-center" src="http://farm7.static.flickr.com/6143/5931239349_13c78bbef5_m.jpg" style="width: 50%;" /&gt;
&lt;p&gt;The Scipy conference was held in &lt;a class="reference external" href="http://www.meetattexas.com/"&gt;UT Austin’s conference center&lt;/a&gt;, which
is a fantastic venue. This is the first geek’s conference I have been at
where the wireless network worked flawlessly with a good bandwidth, even
thought 200 geeks were pounding on it. As a tutorial presenter, this was
incredibly useful.&lt;/p&gt;
&lt;div class="section" id="conference-highlight"&gt;
&lt;h3&gt;Conference highlight&lt;/h3&gt;
&lt;p&gt;Here is a short list of what I &lt;em&gt;felt&lt;/em&gt; were the big trends and highlights
of the conference. This is obviously biased by my own interests. I am
not listing parallel computing, as it is clearly an important area of
progress and debates, but it has been the case for the last few years.&lt;/p&gt;
&lt;div class="section" id="eric-jone-s-keynote"&gt;
&lt;h4&gt;Eric Jone’s keynote&lt;/h4&gt;
&lt;p&gt;Of course Eric’s keynote was excellent. Eric is a great speaker and
always has good insights on how to run a team and a project. This year
he shared (some) of his tricks in making Enthought deliver on software
projects: &lt;em&gt;“What Matters in Scientific Software Projects? 10 Years of
Success and Failure Distilled”&lt;/em&gt;. The video is not yet online,
unfortunately. Grab it when you can.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="hilary-mason-s-keynote"&gt;
&lt;h4&gt;Hilary Mason’s keynote&lt;/h4&gt;
&lt;p&gt;Hilary is an applied data geek, just what I like! She gave an
interesting &lt;a class="reference external" href="http://conference.scipy.org/scipy2011/slides/mason_awesome.pdf"&gt;keynote&lt;/a&gt; on how &lt;a class="reference external" href="https://bitly.com/"&gt;bitly&lt;/a&gt; (an URL-shortening startup, for
those living under a rock) mines the requests on the URLs that the serve
to do things like ranking or phishing attempts detection. Of course, I
couldn’t resist asking what tools they used, thinking that she would
reply R. She mentioned that they did do some roll-their-own, but she
mentioned &lt;a class="reference external" href="https://mlpy.fbk.eu/"&gt;mlpy&lt;/a&gt; and &lt;a class="reference external" href="http://scikit-learn.sourceforge.net/"&gt;scikit-learn&lt;/a&gt;, with a mention that it was very
nice, at which point I believe that I blushed. She stressed that R was
hard to use and production and raised the point that most often academic
software doesn’t pan out in these settings (I hope that I am not
distorting her thoughts too much).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="statistics-and-learning"&gt;
&lt;h4&gt;Statistics and learning&lt;/h4&gt;
&lt;p&gt;I had the feeling that statistics and data mining played a big role at
scipy this year. Maybe it is because I am more tuned to these questions
nowadays, but some signs do not lie. There was a special session on
Python in data sciences, a panel discussion on Python in finance and
&lt;a class="reference external" href="http://conference.scipy.org/scipy2011/slides/cron_gpustats.pdf"&gt;many&lt;/a&gt;
&lt;a class="reference external" href="http://conference.scipy.org/scipy2011/slides/refsdal_sherpa.zip"&gt;many&lt;/a&gt;
&lt;a class="reference external" href="http://conference.scipy.org/scipy2011/slides/mckinney_time_series.pdf"&gt;statistics&lt;/a&gt; and &lt;a class="reference external" href="http://conference.scipy.org/scipy2011/slides/determan_vision_spreadsheet.pdf"&gt;data&lt;/a&gt; &lt;a class="reference external" href="http://conference.scipy.org/scipy2011/slides/caraciolo_crab_recommendation.pdf"&gt;related&lt;/a&gt; talks, as well as two tutorials and
a keynote.&lt;/p&gt;
&lt;p&gt;In addition, on a personal basis it was really great to meet part of the
team behind &lt;a class="reference external" href="http://statsmodels.sourceforge.net/"&gt;scikits.statmodels&lt;/a&gt;. We had plenty of very interesting
discussions and they really help me understand the way that some
statisticians abord data: very differently than me, because they have
fairly little data, and can afford to inspect reports and graphs,
whereas I rely more on automated decision rules.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="ipython"&gt;
&lt;h4&gt;IPython&lt;/h4&gt;
&lt;p&gt;&lt;a class="reference external" href="http://twitter.com/#!/minrk"&gt;Min&lt;/a&gt; gave &lt;a class="reference external" href="http://minrk.github.com/scipy-tutorial-2011/"&gt;an excellent tutorial&lt;/a&gt; on how to do parallel computing
using IPython. These guys have certainly done an excellent job to make
cluster-level programming in Python easier. While they don’t play yet
terribly well with the restrictive job-queue policy of the clusters to
which I have access, they have all the right low-level tools to address
these issues and Min told me that they will be working on this next
year.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://fperez.org/"&gt;Fernando&lt;/a&gt; gave &lt;a class="reference external" href="http://conference.scipy.org/scipy2011/slides/perez_ipython.pdf"&gt;an impressive talk&lt;/a&gt; on the new developments of
IPython. In particular, the new Qt-based terminal is &lt;em&gt;`really cool`_&lt;/em&gt;
and there is a web frontend in the works.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="cluster-computing-as-facility"&gt;
&lt;h4&gt;Cluster computing as facility&lt;/h4&gt;
&lt;p&gt;While I mention cluster computing, I must confess that I have always
stayed away from this beast: I find it a time sink, and I find that I
get more science done without it. This is why I really like the
presentation of the &lt;a class="reference external" href="http://www.picloud.com/"&gt;PiCould&lt;/a&gt; guys on, … cluster computing! The
reason I liked it, is that they start from the principle that your time
is more important than CPU time. I hear so much about &lt;em&gt;bigger better
faster more&lt;/em&gt; high-performance computing when researchers forget to
address the biggest issue:&lt;/p&gt;
&lt;blockquote class="epigraph"&gt;
… a whole generation of researchers turned into system
administrators by the demands of computing - Dan Reed, VP Microsoft&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div class="section" id="abstract-code-manipulation-for-numerical-computation"&gt;
&lt;h4&gt;Abstract code manipulation for numerical computation&lt;/h4&gt;
&lt;p&gt;Finally, a trend that is picking up in the Python-based scientific
computing is the abstract manipulation of expressions to generate fast
code. This ranges from &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Just-in-time_compilation"&gt;JIT (just in time) compilation&lt;/a&gt; generating
machine code, to rewriting mathematical expressions. Peter Wang had a
&lt;a class="reference external" href="http://conference.scipy.org/scipy2011/slides/wang_metagraph.pdf"&gt;talk&lt;/a&gt; in this alley, but the topic was also brough up be Aron Ahmadia.
Of course this is not new: &lt;a class="reference external" href="http://code.google.com/p/numexpr/"&gt;numexpr&lt;/a&gt; has been using these tricks for
years, and more recently &lt;a class="reference external" href="http://deeplearning.net/software/theano/"&gt;Theano&lt;/a&gt; has been making good use of GPUs
thanks to them.&lt;/p&gt;
&lt;p&gt;Seeing this topic emerges in more and more places fr good reasons: with
faster and more numerous CPU, the number of operations a second is less
the bottleneck, and the order in which they are applied, or the physical
location, is becoming critical.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="my-own-agenda"&gt;
&lt;h3&gt;My own agenda&lt;/h3&gt;
&lt;div class="section" id="sprinting-on-scikit-learn"&gt;
&lt;h4&gt;Sprinting on scikit-learn&lt;/h4&gt;
&lt;a class="reference external image-reference" href="http://scikit-learn.org/dev/auto_examples/mixture/plot_gmm.html"&gt;&lt;img alt="" src="http://scikit-learn.org/dev/_images/plot_gmm_1.png" /&gt;&lt;/a&gt;
&lt;p&gt;We had two days of sprints after the conference. A huge number of people
voted for sprint on the &lt;a class="reference external" href="http://scikit-learn.sourceforge.net/"&gt;scikit-learn&lt;/a&gt; but only two people showed up:
Minwoo Lee and &lt;a class="reference external" href="http://www-etud.iro.umontreal.ca/~wardefar"&gt;David Warde-Farley&lt;/a&gt;. Thanks heaps to these guys! My
priority for the sprint was to review and merge branches. That worked
beautifully: we merged in the following features:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://scikit-learn.sourceforge.net/dev/modules/mixture.html#the-dirichlet-process"&gt;Dirichlet-Process Gaussian mixture models&lt;/a&gt;, by Alex Passos&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://scikit-learn.sourceforge.net/dev/modules/decomposition.html#sparse-principal-components-analysis-sparsepca"&gt;Sparse PCA&lt;/a&gt; by Vlad Niculae.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://scikit-learn.sourceforge.net/dev/modules/gaussian_process.html"&gt;Speedups in Gaussian processes&lt;/a&gt; by Vincent Schut.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://scikit-learn.sourceforge.net/dev/modules/clustering.html#mini-batch-k-means"&gt;Sparse implementation of the mini-batch k-means&lt;/a&gt; by Peter
Prettenhofer.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In addition, David added dataset downloader for the &lt;a class="reference external" href="http://cs.nyu.edu/~roweis/data/olivettifaces.gif"&gt;Olivetti face
datasets&lt;/a&gt; which is lightweight, but rich-enough to give very
interesting examples.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="my-presentation"&gt;
&lt;h4&gt;My presentation&lt;/h4&gt;
&lt;p&gt;I gave a talk on my research work, and the software stack that
undermines it: &lt;a class="reference external" href="http://www.slideshare.net/GaelVaroquaux/python-for-brain-mining-neuroscience-with-state-of-the-art-machine-learning-and-data-visualization"&gt;Python for brain mining: (neuro)science with state of
the art machine learning and data visualization&lt;/a&gt;. I think that it was
well received by the audience. What is really crazy is that I uploaded
the slides on slideshare, and they got a ridiculous amount of viewing. I
suspect that it is because of the title: &lt;em&gt;brain mining&lt;/em&gt; does sound
fancy.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="mayavi"&gt;
&lt;h4&gt;Mayavi&lt;/h4&gt;
&lt;p&gt;Because of technical and political reasons, I cannot get &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/"&gt;Mayavi&lt;/a&gt;
installed on the computers at work. This, and the fact that many people
ask for help, but little contribute, even in the form of answers on the
mailing list, had been mining me a bit. I got so much great feedback on
Mayavi at the conference that I feel much more motivated to invest
energy on it.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="the-humain-brain-mapping-conference-in-quebec-city"&gt;
&lt;h2&gt;The Humain Brain Mapping conference in Quebec City&lt;/h2&gt;
&lt;img alt="" class="align-center" src="http://farm7.static.flickr.com/6018/5968391718_002105ccd1.jpg" style="width: 50%;" /&gt;
&lt;p&gt;This blog post is getting too long. It is well beyond my own attention
span. However scipy is not the only conference to which I have been
recently. Two weeks before I was in Quebec, for the &lt;a class="reference external" href="http://www.humanbrainmapping.org/i4a/pages/index.cfm?pageID=3419"&gt;Human Brain Mapping
conference&lt;/a&gt;. As each year, HBM is a fun ride. It has fantastic parties
in the evenings. But I didn’t stay up too late as, this year was a busy
for me: I was teaching in a educational course, and chairing a
symposium, both on comparing brain functional connectivity across
subjects.&lt;/p&gt;
&lt;p&gt;But the really big deal at HBM this year came at the end. As I was
dosing off, vaguely listening to Russ Poldrak’s closing comments, he
brought up on screen a slide entitled &lt;em&gt;the year of Python&lt;/em&gt;. This is a
big deal: we’ve been working for years to get Python in the neuroimaging
word, and it is clearly making progress, despite all the roadblocks.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="conferences"></category><category term="travels"></category><category term="machine learning"></category><category term="mayavi"></category><category term="python"></category><category term="science"></category><category term="scikit-learn"></category></entry><entry><title>Research jobs in France: the black humor of 2010 is the reality of 2011</title><link href="https://gael-varoquaux.info/science/research-jobs-in-france-the-black-humor-of-2010-is-the-reality-of-2011.html" rel="alternate"></link><published>2011-01-15T11:41:00+01:00</published><updated>2011-01-15T11:41:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2011-01-15:/science/research-jobs-in-france-the-black-humor-of-2010-is-the-reality-of-2011.html</id><summary type="html">&lt;p&gt;The French basic research landscape is dominated by a few nationwide
institute, similar to the NIST or the NIH in the US. The largest of these
is the &lt;a class="reference external" href="http://www.cnrs.fr/index.php"&gt;CNRS&lt;/a&gt; (Centre National de la Recherche Scientific). Getting a
tenured job in one of those institutes enables someone to focus on basic …&lt;/p&gt;</summary><content type="html">&lt;p&gt;The French basic research landscape is dominated by a few nationwide
institute, similar to the NIST or the NIH in the US. The largest of these
is the &lt;a class="reference external" href="http://www.cnrs.fr/index.php"&gt;CNRS&lt;/a&gt; (Centre National de la Recherche Scientific). Getting a
tenured job in one of those institutes enables someone to focus on basic
research rather than teaching or going in the industry. It has always
been quite challenging to get such position as many people apply for very
few positions, and the choice of the candidates is quite political. Each
year there is a call for applications, through a impressive formal
process that young researchers trying to get jobs in France end up
knowing quite well.&lt;/p&gt;
&lt;p&gt;Last year, I was visiting a research lab (&lt;a class="reference external" href="http://www.incm.cnrs-mrs.fr/en_index.php"&gt;INCM&lt;/a&gt;) and I saw in their
coffee-break room the following poster (below), that I could
clearly recognize as the official call for application for positions at
CNRS.&lt;/p&gt;
&lt;p&gt;Now this poster says ‘&lt;strong&gt;The CNRS recruits 3 researchers (m/w) in all
fields of research&lt;/strong&gt;‘. Of course it’s a fake poster and black humor: 3
positions nationwide in all fields of research is ridiculously low. It
is however an expression of the nightmare of thousands of young
researchers who are applying each year and keep hearing that the
government will &lt;a class="reference external" href="http://www.latribune.fr/actualites/economie/france/20100415trib000499181/la-fonction-publique-d-etat-perdra-34.000-postes-en-2011-selon-georges-tron.html"&gt;slash the number of state employees&lt;/a&gt;.&lt;/p&gt;
&lt;img alt="" class="align-center" src="attachments/cnrs_recruits.jpg" style="width: 70%;" /&gt;
&lt;p&gt;The call for the 2011 applications for research positions at &lt;a class="reference external" href="http://en.inria.fr/"&gt;INRIA&lt;/a&gt;,
the French national computer science institute, that is another one of
the big research institutions in France, is &lt;a class="reference external" href="http://www.inria.fr/institut/recrutement-metiers/offres/concours-2011-5-postes-de-charge-de-recherche-2e-classe-sont-a-pourvoir/concours-2011"&gt;out&lt;/a&gt;. The page is entitled
&lt;em&gt;Cinq postes de chargé de recherche 2e classe sont à pourvoir&lt;/em&gt; (&lt;strong&gt;5
positions for junior researchers are available&lt;/strong&gt;). This is not a joke,
and it is striking to see the similarity between &lt;strong&gt;the dark humor of
2010 and the reality of 2011&lt;/strong&gt;. To be fair INRIA is smaller than CNRS,
as it covers only computer science and applications (listed as applied
maths, numerical computing and simulation, algorithm and software
research, networks and distributed systems, and computational modeling
for life sciences). The number of applications is in hundred and not
thousands, but having only 5 jobs available nationwide still feels
really awkward.&lt;/p&gt;
&lt;blockquote&gt;
&lt;a class="reference external" href="attachments/cnrs_recruits.pdf"&gt;PDF poster&lt;/a&gt;&lt;/blockquote&gt;
&lt;p&gt;A minor detail: I am trying to get a job in computational science
research in France.&lt;/p&gt;
</content><category term="science"></category><category term="personnal"></category><category term="science"></category></entry><entry><title>Machine learning humour</title><link href="https://gael-varoquaux.info/science/machine-learning-humour.html" rel="alternate"></link><published>2010-09-16T23:11:00+02:00</published><updated>2010-09-16T23:11:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-09-16:/science/machine-learning-humour.html</id><summary type="html">&lt;div class="section" id="yes-but-they-overfit"&gt;
&lt;h2&gt;Yes, but they overfit&lt;/h2&gt;
&lt;p&gt;If you are reading this post through a planet, the movie isn’t showing
up, just &lt;a class="reference external" href="http://gael-varoquaux.info/science/machine-learning-humour.html"&gt;click through&lt;/a&gt; to understand what the hell this is about.&lt;/p&gt;
&lt;p&gt;
&lt;object width="480" height="385"&gt;
&lt;embed src="http://www.youtube.com/v/m60lVGz34hU?fs=1&amp;amp;hl=en_US" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="385"&gt;
&lt;/embed&gt;
&lt;/object&gt;
&lt;/p&gt;&lt;/div&gt;
&lt;div class="section" id="some-explanations"&gt;
&lt;h2&gt;Some explanations…&lt;/h2&gt;
&lt;div class="section" id="machine-learning-geeks-and-beers"&gt;
&lt;h3&gt;Machine learning, geeks, and beers&lt;/h3&gt;
&lt;p&gt;Sorry for the bad humour. In the previous weeks my social geek life …&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="section" id="yes-but-they-overfit"&gt;
&lt;h2&gt;Yes, but they overfit&lt;/h2&gt;
&lt;p&gt;If you are reading this post through a planet, the movie isn’t showing
up, just &lt;a class="reference external" href="http://gael-varoquaux.info/science/machine-learning-humour.html"&gt;click through&lt;/a&gt; to understand what the hell this is about.&lt;/p&gt;
&lt;p&gt;
&lt;object width="480" height="385"&gt;
&lt;embed src="http://www.youtube.com/v/m60lVGz34hU?fs=1&amp;amp;hl=en_US" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="385"&gt;
&lt;/embed&gt;
&lt;/object&gt;
&lt;/p&gt;&lt;/div&gt;
&lt;div class="section" id="some-explanations"&gt;
&lt;h2&gt;Some explanations…&lt;/h2&gt;
&lt;div class="section" id="machine-learning-geeks-and-beers"&gt;
&lt;h3&gt;Machine learning, geeks, and beers&lt;/h3&gt;
&lt;p&gt;Sorry for the bad humour. In the previous weeks my social geek life had
two strong moments:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.pycon.fr/conference/edition2010"&gt;Pycon fr&lt;/a&gt;, the French Python conference, and ensuing drinking&lt;/li&gt;
&lt;/ul&gt;
&lt;img alt="" src="http://farm5.static.flickr.com/4077/4938486734_378f52fd3d.jpg" style="width: 45%;" /&gt;
&lt;img alt="" src="http://farm5.static.flickr.com/4114/4938124265_027853c81a.jpg" style="width: 45%;" /&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://fseoane.net/blog/2010/second-scikitslearn-coding-sprint/"&gt;The second sprint&lt;/a&gt; on the &lt;a class="reference external" href="http://scikit-learn.sourceforge.net/"&gt;scikit learn&lt;/a&gt;, a library for machine
learning in Python.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At the first event (or maybe the related drinking) there was a lot of
discussion about NoSQL databases, and I was introduced to &lt;a class="reference external" href="http://www.xtranormal.com/watch/6995033/&amp;quot;&amp;quot;"&gt;this
fantastic video&lt;/a&gt; making fun of MongoDB fanboys. A few days later I was
hacking on the scikit, comparing estimators and discussing hype versus
fact in machine learning algorithms (hint: &lt;a class="reference external" href="http://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization"&gt;there is no free lunch&lt;/a&gt;,
but you may get &lt;a class="reference external" href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.79.2501&amp;amp;rep=rep1&amp;amp;type=pdf"&gt;a free brunch&lt;/a&gt;). As in brain imaging people seem to
be doing nothing but SVMs over and over while &lt;a class="reference external" href="http://hal.inria.fr/hal-00504095/PDF/icpr_2010_tv.pdf"&gt;methods with more
appropriate sparsity clearly perform better&lt;/a&gt;, I composed this stupid
video.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="anything-to-learn-about-machine-learning-in-there"&gt;
&lt;h3&gt;Anything to learn about machine learning in there?&lt;/h3&gt;
&lt;p&gt;The short answer is: probably no. This video is humour, and there is
little truth (well, RFE is indeed slow as a dog). However, not every
reader of this blog are machine learning experts, so let me explain the
stakes of the pseudo discussion.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Overfitting&lt;/strong&gt;: when you learn a predictive model on a noisy data set,
for instance trying to learn how to predict whether a movie is popular
or not from ratings, if you have a finite amount of data, you should be
careful not to learn by heart every detail of the data. You will learn
noise that, by chance, correlated to what you are trying to predict.
When you try to generalize to new data, these features that you learned
from noise will be detrimental to your prediction performance. For
instance, &lt;a class="reference external" href="http://www.reddit.com/r/Python/comments/cwq37/announcing_python_nltk_demos_natural_language/"&gt;the presence of Matt Damon&lt;/a&gt; is not the sole predictor of the
quality of movie. This is called overfitting. The goal of
&lt;a class="reference external" href="http://en.wikipedia.org/wiki/Regularization_%28mathematics%29"&gt;regularization&lt;/a&gt; is to avoid this overfitting.&lt;/p&gt;
&lt;p&gt;Both SVM and elasticnet implement regularization, but in different ways.
In the case of brain imaging, as the predictive features (voxels) are
very sparse, but the noise is highly structured, SVM (that do not
operate on voxels directly) are not able to select directly the relevant
voxels and tend to overfit (which can be counter-balanced by univariate
feature selection as in the &lt;a class="reference external" href="http://scikit-learn.org/stable/auto_examples/svm/plot_svm_anova.html"&gt;scikit example&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;RFE (recursive feature elimination) is slow as dog&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;scikits.learn&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;digits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load_digits&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;digits&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;digits&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;scikits.learn.svm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LinearSVC&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;svc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LinearSVC&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;scikits.learn.rfe&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RFE&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="n"&gt;RFE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;estimator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;svc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;percentage&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;21.5&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="n"&gt;per&lt;/span&gt; &lt;span class="n"&gt;loop&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;scikits.learn.glm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ElasticNet&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="n"&gt;ElasticNet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rho&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;26.7&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="n"&gt;per&lt;/span&gt; &lt;span class="n"&gt;loop&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Yeah, but it does much more than simply building a predictor, it builds
a ‘heat map’ of which features help predicting (run &lt;a class="reference external" href="http://scikit-learn.sourceforge.net/auto_examples/rfe_digits.html"&gt;this scikit-learn
example&lt;/a&gt; to get an idea).&lt;/p&gt;
&lt;p&gt;I am afraid that all the examples I pointed to require the development
version of the scikit. Sorry, we just finished a sprint, and there will
be a release soon.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="personnal"></category><category term="python"></category><category term="humor"></category></entry><entry><title>Making posters for scientific conferences</title><link href="https://gael-varoquaux.info/science/making-posters-for-scientific-conferences.html" rel="alternate"></link><published>2010-07-12T00:00:00+02:00</published><updated>2010-07-12T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-07-12:/science/making-posters-for-scientific-conferences.html</id><summary type="html">&lt;p class="first last"&gt;Some advices and examples on making posters for scientific conference.&lt;/p&gt;
</summary><content type="html">&lt;p&gt;This page gives some advices and examples on making posters for
scientific conference.&lt;/p&gt;
&lt;p&gt;Here are some posters I made (one in 2007, the other in 2011). They don’t
follow all the advice on this page, but should.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external image-reference" href="attachments/poster_YAO.pdf"&gt;&lt;img alt="poster1" src="attachments/poster_YAO.jpg" style="width: 33%;" /&gt;&lt;/a&gt; &lt;a class="reference external image-reference" href="attachments/poster_hbm2011.pdf"&gt;&lt;img alt="poster2" src="attachments/poster_hbm2011.png" style="width: 33%;" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;LaTeX sources&lt;/p&gt;
&lt;p&gt;This poster is written in LaTeX. You can download the whole source of
the posters for &lt;a class="reference external" href="attachments/poster.zip"&gt;the first poster (left)&lt;/a&gt;,
and &lt;a class="reference external" href="attachments/poster_hbm2011.zip"&gt;the second one (right)&lt;/a&gt;. These
are some of my personnal projects, not meant for sharing. As a result
they have a fair amount of hacking. I have been asked for source code
more than once, so I put it on the web. I do not however have time to
provide &lt;strong&gt;any&lt;/strong&gt; support for it (I am already to busy supporting other
things. Any mail asking for help on these files will unanswered. Sorry.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Here is another example, a bit more visually appealing, as it is intended
for a less technical audience.&lt;/p&gt;
&lt;a class="reference external image-reference" href="attachments/poster_ICE.pdf"&gt;&lt;img alt="" class="align-center" src="attachments/poster_ICE.jpg" /&gt;&lt;/a&gt;
&lt;p&gt;One more about my work: this one was made to convey a strong message and
simplified the content a lot to get the message accross. I am not too sure
it worked, but I still find the poster pretty.&lt;/p&gt;
&lt;a class="reference external image-reference" href="attachments/poster_ICOLS07.pdf"&gt;&lt;img alt="" class="align-center" src="attachments/poster_ICOLS07.jpg" /&gt;&lt;/a&gt;
&lt;p&gt;And finally two made by Emmanuelle with really nice colours.&lt;/p&gt;
&lt;a class="reference external image-reference" href="attachments/poster_Emmanuelle.pdf"&gt;&lt;img alt="" src="attachments/poster_Emmanuelle.jpg" /&gt;&lt;/a&gt;
&lt;a class="reference external image-reference" href="attachments/poster_blue.pdf"&gt;&lt;img alt="" src="attachments/poster_blue.jpg" /&gt;&lt;/a&gt;
&lt;div class="section" id="advice-on-poster-presentation"&gt;
&lt;h2&gt;Advice on poster presentation&lt;/h2&gt;
&lt;p&gt;See also &lt;a class="reference external" href="http://www.ncsu.edu/project/posters"&gt;http://www.ncsu.edu/project/posters&lt;/a&gt;&lt;/p&gt;
&lt;div class="section" id="fonts"&gt;
&lt;h3&gt;Fonts&lt;/h3&gt;
&lt;p&gt;Sans-serif fonts look really nice, but are less readable in
paragraphs. Use them for titles and headers. Use serif fonts for
paragraphs. Stick to a simple font family like times. Use bold fonts
when writing with a light colour on a dark background.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="colours"&gt;
&lt;h3&gt;Colours&lt;/h3&gt;
&lt;p&gt;Stick to a rather little numbers of colours, but well chosen.
Put a very light colour behind your text blocks. If ink is not too
expensive, I would use a dark background, and have light text blocks on
it. Have well separated areas of your posters (like the background, and
the text blocks), and have the background, or other decorative elements,
have little contrast: they should not stand out too much (mine stood out
too much in my poster, its because the print-out didn’t look like want
was on the screen).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="page-layout"&gt;
&lt;h3&gt;Page layout&lt;/h3&gt;
&lt;p&gt;Break symmetry and order. A well aligned poster is boring to the
eye, and does not catch attention from afar. People read your poster by
first scanning through it and stopping at a few key points (usually
first at the upper left, then the upper left, then down right, and down
left), then they might read it more thoroughly after their first scan.
You want to define visually these key points, make them appealing, and
put key ideas there.&lt;/p&gt;
&lt;p&gt;Long lines are difficult to read. Pick up a book, a flyer, anything made
by a professional publisher, it will never have long lines. A good rule
of thumb is that if a text block has lines longer than 80 characters, it
needs breaking down in several columns.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="which-software-to-use"&gt;
&lt;h2&gt;Which software to use&lt;/h2&gt;
&lt;p&gt;Many people use PowerPoint to make their posters. It is easy to use, but
it is not dedicated to making posters, and it does horrible pdfs.&lt;/p&gt;
&lt;p&gt;If you want to pay a lot there is Quark Xpress that is very good for that
kind of things. Adobe PageMaker is also a very good software. &lt;a class="reference external" href="http://www.xara.com/"&gt;Xara&lt;/a&gt; is a cheap and good design program, and a free
version will soon be available for linux.&lt;/p&gt;
&lt;p&gt;I use LaTeX. Just because I love the way it positions characters. But I
admit it is a bit brutal. What I would advice you to use is &lt;a class="reference external" href="http://www.scribus.net"&gt;scribus&lt;/a&gt; it is dedicate to making posters and is free
and open source. I sometimes use LaTeX to create the text boxes, and
scribus to lay them around. I wrote a &lt;a class="reference external" href="LaTeX-scribus.html"&gt;page&lt;/a&gt;
describing how I do it.&lt;/p&gt;
&lt;!-- See also :
http://theoval.cmp.uea.ac.uk/~nlct/jpgfdraw/manual/postertutorial.html --&gt;
&lt;p&gt;One last remark: use vector graphics (eps, ps, pdf, svg), not bitmaps,
they scale up really badly.
Try to get a vector logo of your institution. Usually asking the PR
people is the only thing it take to get one. Of course if you are using
powerpoint chances are that you wont be able to insert it in your poster.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="latex"></category><category term="publishing"></category><category term="conferences"></category><category term="selected"></category></entry><entry><title>A simple LaTeX example</title><link href="https://gael-varoquaux.info/science/a-simple-latex-example.html" rel="alternate"></link><published>2010-06-01T00:00:00+02:00</published><updated>2010-06-01T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-06-01:/science/a-simple-latex-example.html</id><summary type="html">&lt;p class="first last"&gt;A simple LaTeX document, to use as a skeletton&lt;/p&gt;
</summary><content type="html">&lt;p&gt;Here is a very simple example of a laTeX document that uses good package
to have a simple but nice layout:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="attachments/simple.tex"&gt;The LaTeX source&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="attachments/simple.pdf"&gt;The pdf document&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;&lt;strong&gt;Some advice&lt;/strong&gt;&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Use &lt;a class="reference external" href="http://www.texniccenter.org/"&gt;texniccenter&lt;/a&gt; if you don’t have a
favorite editor.&lt;/li&gt;
&lt;li&gt;Read the &lt;a class="reference external" href="http://www.ctan.org/tex-archive/info/lshort/english/lshort.pdf"&gt;not so short introduction to latex&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="latex"></category><category term="publishing"></category><category term="science"></category></entry><entry><title>PCA and ICA: Identifying combinations of variables</title><link href="https://gael-varoquaux.info/science/ica_vs_pca.html" rel="alternate"></link><published>2010-02-05T00:00:00+01:00</published><updated>2010-02-05T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-02-05:/science/ica_vs_pca.html</id><summary type="html">&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;&lt;strong&gt;Dimension reduction and interpretability&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Suppose you have statistical data that too many dimensions, in other
words too many variables of the same random process, that has been
observed many times. You want to find out, from all these variables (or all
these dimensions when speaking in terms of multivariate data …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;&lt;strong&gt;Dimension reduction and interpretability&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Suppose you have statistical data that too many dimensions, in other
words too many variables of the same random process, that has been
observed many times. You want to find out, from all these variables (or all
these dimensions when speaking in terms of multivariate data),
what are the relevant combinations, or directions.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="dimension-reduction-with-pca"&gt;
&lt;h2&gt;Dimension reduction with PCA&lt;/h2&gt;
&lt;p&gt;If we have three-dimensional data, for instance simultaneous measurements
made by three thermometers positioned at different locations in a room.
The data forms a cluster of points in a 3D space:&lt;/p&gt;
&lt;img alt="" class="align-center" src="https://gael-varoquaux.info/science/attachments/ica_pca/3d_data.jpg" style="width: 50%;" /&gt;
&lt;p&gt;If the temperature in that room is conditioned by only two parameters,
the setting of a heater and the outside temperature, we probably have
too much data: the three sets of measurements can be expressed as a
linear combination of two fluctuating variable, and an additional much
smaller noise parameter. In other words, the data mostly lies in a 2D
plane embedded in the 3D measurement space.&lt;/p&gt;
&lt;p&gt;We can use PCA (Principal Component Analysis) to find this plane: PCA
will give us the orthogonal basis in which the covariance matrix of our
data is diagonal. The vectors of this basis point in successive
orthogonal directions in which the data variance is maximum. In the case
of data mainly residing on a 2D plane, the variance is much greater along
the two first vectors, which define our plane of interest, than along the
third one:&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="https://gael-varoquaux.info/science/attachments/ica_pca/3d_data_pca_axis.jpg" style="width: 50%;" /&gt;
&lt;p class="caption"&gt;The covariance eigenvectors identified by PCA are shown in red. The
plane defined by the 2 largest eigenvectors is shown in light red.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;If we look at the data in the plane identified by PCA, it is clear that
it was mostly 2D:&lt;/p&gt;
&lt;img alt="" class="align-center" src="https://gael-varoquaux.info/science/attachments/ica_pca/3d_data_pca.jpg" style="width: 50%;" /&gt;
&lt;/div&gt;
&lt;div class="section" id="understanding-pca-with-a-gaussian-model"&gt;
&lt;h2&gt;Understanding PCA with a Gaussian model&lt;/h2&gt;
&lt;p&gt;Let &lt;cite&gt;x&lt;/cite&gt; and &lt;cite&gt;y&lt;/cite&gt; be two normal-distributed variables, describing the
processes we are observing:&lt;/p&gt;
&lt;div class="formula"&gt;
&lt;i&gt;x&lt;/i&gt; = &lt;span class="scriptfont"&gt;N&lt;/span&gt;(0, 1)
&lt;/div&gt;
&lt;p&gt;and&lt;/p&gt;
&lt;div class="formula"&gt;
&lt;i&gt;y&lt;/i&gt; = &lt;span class="scriptfont"&gt;N&lt;/span&gt;(0, 1)
&lt;/div&gt;
&lt;p&gt;Let &lt;cite&gt;a&lt;/cite&gt; and &lt;cite&gt;b&lt;/cite&gt; be two observation variables, linear combinations of &lt;cite&gt;x&lt;/cite&gt;
and &lt;cite&gt;y&lt;/cite&gt;:&lt;/p&gt;
&lt;div class="formula"&gt;
&lt;i&gt;a&lt;/i&gt; = &lt;i&gt;x&lt;/i&gt; + &lt;i&gt;y&lt;/i&gt;
&lt;/div&gt;
&lt;p&gt;and&lt;/p&gt;
&lt;div class="formula"&gt;
&lt;i&gt;b&lt;/i&gt; = 2 &lt;i&gt;y&lt;/i&gt;
&lt;/div&gt;
&lt;p&gt;PCA is performed by applying an SVD (singular value decomposition) on the
observed data matrix:&lt;/p&gt;
&lt;div class="formula"&gt;
&lt;i&gt;Y&lt;/i&gt; = [&lt;i&gt;a&lt;/i&gt;&lt;sub&gt;1&lt;/sub&gt;&lt;i&gt;a&lt;/i&gt;&lt;sub&gt;2&lt;/sub&gt;&lt;i&gt;a&lt;/i&gt;&lt;sub&gt;3&lt;/sub&gt;...; &lt;i&gt;b&lt;/i&gt;&lt;sub&gt;1&lt;/sub&gt;&lt;i&gt;b&lt;/i&gt;&lt;sub&gt;2&lt;/sub&gt;&lt;i&gt;b&lt;/i&gt;&lt;sub&gt;3&lt;/sub&gt;...]
&lt;/div&gt;
&lt;p&gt;This is equivalent to find the eigenvalues and eigenvectors of
&lt;span class="formula"&gt;&lt;i&gt;Y&lt;/i&gt;&lt;sup&gt; &lt;i&gt;T&lt;/i&gt;&lt;/sup&gt;&lt;i&gt;Y&lt;/i&gt;&lt;/span&gt;, the correlation matrix of the observed data. The
multidimensional (or multivariate, in statistical jargon) probability
density function of Y is written:&lt;/p&gt;
&lt;div class="formula"&gt;
&lt;i&gt;p&lt;/i&gt;(&lt;i&gt;Y&lt;/i&gt;) ∼ &lt;i&gt;exp&lt;/i&gt;( − &lt;i&gt;r&lt;/i&gt;&lt;sup&gt; &lt;i&gt;T&lt;/i&gt;&lt;/sup&gt;&lt;i&gt;M&lt;/i&gt; &lt;i&gt;r&lt;/i&gt;)
&lt;/div&gt;
&lt;p&gt;where &lt;cite&gt;r&lt;/cite&gt; is the position is the &lt;cite&gt;(a,b)&lt;/cite&gt; observation space, and &lt;cite&gt;M&lt;/cite&gt; the
correlation matrix. Diagonalizing the matrix &lt;cite&gt;M&lt;/cite&gt; corresponds to finding
a rotation matrix &lt;cite&gt;U&lt;/cite&gt; such that:&lt;/p&gt;
&lt;div class="formula"&gt;
&lt;i&gt;p&lt;/i&gt;(&lt;i&gt;Y&lt;/i&gt;) ∼ &lt;i&gt;exp&lt;/i&gt;( − &lt;i&gt;r&lt;/i&gt;&lt;sup&gt; &lt;i&gt;T&lt;/i&gt;&lt;/sup&gt;&lt;i&gt;U&lt;/i&gt;&lt;sup&gt; &lt;i&gt;T&lt;/i&gt;&lt;/sup&gt;&lt;i&gt;S&lt;/i&gt; &lt;i&gt;U&lt;/i&gt; &lt;i&gt;r&lt;/i&gt;)
&lt;/div&gt;
&lt;p&gt;With &lt;cite&gt;S&lt;/cite&gt; a diagonal matrix. In other words, &lt;cite&gt;U&lt;/cite&gt; is a rotation of the
observation space to change to a basis where the probability density
function is written:&lt;/p&gt;
&lt;div class="formula"&gt;
&lt;i&gt;p&lt;/i&gt;(&lt;i&gt;Y&lt;/i&gt;) ∼ &lt;i&gt;exp&lt;/i&gt;( − &lt;span class="limits"&gt;&lt;sup class="limit"&gt; &lt;/sup&gt;&lt;span class="limit"&gt;&lt;span class="bigoperator"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;sub class="limit"&gt;&lt;i&gt;i&lt;/i&gt;&lt;/sub&gt;&lt;/span&gt; &lt;i&gt;σ&lt;/i&gt;&lt;sub&gt;&lt;i&gt;i&lt;/i&gt;&lt;/sub&gt; &lt;i&gt;r&lt;/i&gt;&lt;span class="scripts"&gt;&lt;sup class="script"&gt;2&lt;/sup&gt;&lt;sub class="script"&gt;&lt;i&gt;i&lt;/i&gt;&lt;/sub&gt;&lt;/span&gt;) = &lt;span class="limits"&gt;&lt;sup class="limit"&gt; &lt;/sup&gt;&lt;span class="limit"&gt;&lt;span class="bigoperator"&gt;∏&lt;/span&gt;&lt;/span&gt;&lt;sub class="limit"&gt;&lt;i&gt;i&lt;/i&gt;&lt;/sub&gt;&lt;/span&gt; &lt;i&gt;exp&lt;/i&gt;( − &lt;i&gt;σ&lt;/i&gt;&lt;sub&gt;&lt;i&gt;i&lt;/i&gt;&lt;/sub&gt; &lt;i&gt;r&lt;/i&gt;&lt;span class="scripts"&gt;&lt;sup class="script"&gt;2&lt;/sup&gt;&lt;sub class="script"&gt;&lt;i&gt;i&lt;/i&gt;&lt;/sub&gt;&lt;/span&gt;)
&lt;/div&gt;
&lt;p&gt;In this new basis, &lt;cite&gt;Y&lt;/cite&gt; can thus be interpreted as a sum of independent
normal processes of different variance.&lt;/p&gt;
&lt;p&gt;We can thus picture the PCA as a way of finding independent normal
processes. The different steps of the argument exposed above can be
pictured in the following figure:&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="https://gael-varoquaux.info/science/attachments/ica_pca/pca_on_gaussian_data.png" style="width: 80%;" /&gt;
&lt;p class="caption"&gt;First we represent samples drawn from
&lt;cite&gt;x&lt;/cite&gt; and &lt;cite&gt;y&lt;/cite&gt; in their original space, the basis of the independent
variables. Then we represent the (&lt;cite&gt;a&lt;/cite&gt;, &lt;cite&gt;b&lt;/cite&gt;) samples, and we apply PCA on
these samples, to estimate the eigenvectors of the covariance matrix.
Then we represent the data projected in the basis estimated by PCA. One
important detail to note, is that after PCA, the data is most often
rescaled: each direction is divided by the corresponding sample standard
deviation identified by PCA. After this operation, all directions of
space play the same role, the data is spheric, or “white”.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;PCA was able to identify the original independent variables &lt;cite&gt;x&lt;/cite&gt; and &lt;cite&gt;y&lt;/cite&gt;
in the &lt;cite&gt;a&lt;/cite&gt; and &lt;cite&gt;b&lt;/cite&gt; samples only because they were mixed with different
variance. For a isotropic Gaussian model, any basis can describe the data
in terms of independent normal process.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="pca-on-non-normal-data"&gt;
&lt;h2&gt;PCA on non normal data&lt;/h2&gt;
&lt;p&gt;More generally, the PCA algorithm can be understood as an algorithm
finding the direction of space with the highest sample variance, and
moving on to the orthogonal subspace of this direction to find the next
highest variance, and iteratively discovering an ordered orthogonal basis
of highest variance. This is well adapted to normal processes, as their
covariance is indeed diagonal in an orthogonal basis. In addition, the
resulting vectors come with a “PCA score”, ie the variance of the data
projected along the direction they define. Thus when using PCA for
dimension reduction, we can choose the subspace defined by the first &lt;cite&gt;n&lt;/cite&gt;
PCA vectors, on the basis that they explain a given percentage of the
variance, and that the subspace they define is the subspace of dimension
&lt;cite&gt;n&lt;/cite&gt; that explains the largest possible fraction of the total variance.&lt;/p&gt;
&lt;p&gt;However, on strongly non-Gaussian processes, the variance may not be the
quantity of interest.&lt;/p&gt;
&lt;p&gt;Let us consider the same model as above, with two independent variables
&lt;cite&gt;x&lt;/cite&gt; and &lt;cite&gt;y&lt;/cite&gt; thought with strongly non-Gaussian distributions. Here we
use a mixture of a narrow Gaussian, and wide one, to populate the tails:&lt;/p&gt;
&lt;img alt="" class="align-center" src="https://gael-varoquaux.info/science/attachments/ica_pca/non_gaussian_pdf.png" style="width: 40%;" /&gt;
&lt;p&gt;We can apply the same operations on these random variables: change of
basis to an observation basis made of &lt;cite&gt;a&lt;/cite&gt; and &lt;cite&gt;b&lt;/cite&gt;, and PCA on the
resulting sample:&lt;/p&gt;
&lt;img alt="" class="align-center" src="https://gael-varoquaux.info/science/attachments/ica_pca/pca_on_non_gaussian_data.png" style="width: 80%;" /&gt;
&lt;p&gt;We can see that the PCA did not properly identify the original
independent variables. The variance criteria is not good-enough when the
principle axis of the observed distribution are not orthogonal, as the
highest variance can be found in a direction mixing the two process.
Indeed the largest PCA direction is found slightly off axis. In addition
the second direction can only be found orthogonal to the first one, as
this is a restriction of PCA.&lt;/p&gt;
&lt;p&gt;On the other side, the data after PCA is much more spheric than the
original data. No strong anisotropy is found in the central part of the
sample cloud, which contributes most to the variance.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="ica-independent-non-gaussian-variables"&gt;
&lt;h2&gt;ICA: independent, non-Gaussian variables&lt;/h2&gt;
&lt;p&gt;For strongly non-Gaussian processes, the above example shows that
separating independent process should be done by looking at fine details
of the distribution, such as the tails. Indeed, after PCA, the Gaussian
part of the processes have been separated by their variance, and the
resulting, rescaled, samples cannot be decomposed in independent process
in a Gaussian model, as they all have the same variance, and would
already be considered independent under a Gaussian hypothesis.&lt;/p&gt;
&lt;p&gt;A popular class of algorithms to separate independent sources, called ICA
(independent component analysis) makes the simplification that finding
independent sources out of such data can be reduced to finding maximally
non-Gaussian. Indeed, the central-limit theorem tells us that the sum of
non-Gaussian processes lead to Gaussian process. Conversely, with equal
variance multivariate samples, the more non-Gaussian a signal extracted
from the data, the less independent -and non-Gaussian- variables it
contains.&lt;/p&gt;
&lt;p&gt;A good discussion of these arguments can be found in following paper:
&lt;a class="reference external" href="http://www.cis.hut.fi/aapo/papers/IJCNN99_tutorialweb/IJCNN99_tutorial3.html"&gt;http://www.cis.hut.fi/aapo/papers/IJCNN99_tutorialweb/IJCNN99_tutorial3.html&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;ICA is thus an optimization algorithm that from the data extracts the
direction with the least-Gaussian PDF, removes the data explained by this
variable from the signal, and iterates.&lt;/p&gt;
&lt;p&gt;Applying ICA to the previous model yields the following:&lt;/p&gt;
&lt;img alt="" class="align-center" src="https://gael-varoquaux.info/science/attachments/ica_pca/ica_on_non_gaussian_data.png" style="width: 80%;" /&gt;
&lt;p&gt;We can see that ICA has well identified the original independent data
variables. Its use of the tails of the distribution was paramount for
this task. In addition, ICA relaxes the constraint that all identified
directions must be perpendicular. This flexibility was also important to
match our data.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This discussion can now be seen as an &lt;a class="reference external" href="http://scikit-learn.org/stable/auto_examples/decomposition/plot_ica_vs_pca.html"&gt;example of the scikit-learn&lt;/a&gt;.
Thus you can replicate the figure using the code in the scikit.&lt;/p&gt;
&lt;/div&gt;
&lt;!-- vim:set spell:
vim:set autoindent: --&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="machine learning"></category><category term="scientific computing"></category><category term="selected"></category></entry><entry><title>General relativity, quantum physics, freely-falling planes and Bayesian statistics</title><link href="https://gael-varoquaux.info/science/general-relativity-quantum-physics-freely-falling-planes-and-bayesian-statistics.html" rel="alternate"></link><published>2009-12-08T22:20:00+01:00</published><updated>2009-12-08T22:20:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2009-12-08:/science/general-relativity-quantum-physics-freely-falling-planes-and-bayesian-statistics.html</id><summary type="html">&lt;p&gt;We’re famous: the &lt;a class="reference external" href="http://gael-varoquaux.info/science/acceleration-estimation-in-atom-interferometric-tests-of-the-einstein-equivalence-principle.html"&gt;work&lt;/a&gt; that concluded my PhD is now picked up by the
press &lt;a class="reference external" href="http://www.physorg.com/news179481148.html"&gt;http://www.physorg.com/news179481148.html&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I hadn’t realized before reading this journalist’s version of the story,
but we have all the proper buzz words:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;general relativity&lt;/li&gt;
&lt;li&gt;quantum physics&lt;/li&gt;
&lt;li&gt;freely-falling planes&lt;/li&gt;
&lt;li&gt;Bayesian …&lt;/li&gt;&lt;/ul&gt;</summary><content type="html">&lt;p&gt;We’re famous: the &lt;a class="reference external" href="http://gael-varoquaux.info/science/acceleration-estimation-in-atom-interferometric-tests-of-the-einstein-equivalence-principle.html"&gt;work&lt;/a&gt; that concluded my PhD is now picked up by the
press &lt;a class="reference external" href="http://www.physorg.com/news179481148.html"&gt;http://www.physorg.com/news179481148.html&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I hadn’t realized before reading this journalist’s version of the story,
but we have all the proper buzz words:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;general relativity&lt;/li&gt;
&lt;li&gt;quantum physics&lt;/li&gt;
&lt;li&gt;freely-falling planes&lt;/li&gt;
&lt;li&gt;Bayesian statistics.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This kind of stuff makes great headlines, but the way we are judged on
this “success” is actually harmful (I believe), as there is so much
interesting research that lies away of the trendy words and that needs
to be done.&lt;/p&gt;
</content><category term="science"></category><category term="personnal"></category><category term="physics"></category><category term="science"></category></entry><entry><title>Acceleration estimation in atom-interferometric tests of the Einstein equivalence principle</title><link href="https://gael-varoquaux.info/science/acceleration-estimation-in-atom-interferometric-tests-of-the-einstein-equivalence-principle.html" rel="alternate"></link><published>2009-11-07T15:24:00+01:00</published><updated>2009-11-07T15:24:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2009-11-07:/science/acceleration-estimation-in-atom-interferometric-tests-of-the-einstein-equivalence-principle.html</id><summary type="html">&lt;p&gt;Hurray! The pivot article that marks my transition from physics to
statistic modeling is finally out:&lt;/p&gt;
&lt;blockquote&gt;
&lt;a class="reference external" href="http://www.iop.org/EJ/article/1367-2630/11/11/113010/njp9_11_113010.pdf"&gt;How to estimate the differential acceleration in a two-species atom interferometer to test the equivalence principle&lt;/a&gt;
&lt;em&gt;G Varoquaux, R A Nyman, R Geiger, P Cheinet, A Landragin and P Bouyer&lt;/em&gt;&lt;/blockquote&gt;
&lt;p&gt;To put things in …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Hurray! The pivot article that marks my transition from physics to
statistic modeling is finally out:&lt;/p&gt;
&lt;blockquote&gt;
&lt;a class="reference external" href="http://www.iop.org/EJ/article/1367-2630/11/11/113010/njp9_11_113010.pdf"&gt;How to estimate the differential acceleration in a two-species atom interferometer to test the equivalence principle&lt;/a&gt;
&lt;em&gt;G Varoquaux, R A Nyman, R Geiger, P Cheinet, A Landragin and P Bouyer&lt;/em&gt;&lt;/blockquote&gt;
&lt;p&gt;To put things in context, at the end of my PhD, we had been building an
atom interferometer to test the Einstein equivalence principle and my
reflections on the limits of atom interferometry shifted from worrying
about the underlying physics, to worrying about the estimation: the
inverse problem of going from the experimental signal, to the underlying
quantities that we are measuring, confounded by all the horrible
experimental noise.&lt;/p&gt;
&lt;div class="section" id="atoms-light-gravity-fields-and-free-fall-planes"&gt;
&lt;h2&gt;Atoms, light, gravity fields and free-fall planes&lt;/h2&gt;
&lt;p&gt;The problem is: we want to do high precision metrologic tests in a
free-falling plane. We use interferometry to measure gravity fields. But
rather than doing interferometry with light, we use atoms, that are much
more coupled to gravity. When probing gravity fields with light, the
trick is to use huge highly-sensitive interferometers. For instance the
&lt;a class="reference external" href="http://www.ligo.caltech.edu/"&gt;ligo&lt;/a&gt; and &lt;a class="reference external" href="http://www.virgo.infn.it/"&gt;virgo&lt;/a&gt; projects are kilometer-long light interferometers
listening for gravitational waves, and the &lt;a class="reference external" href="http://www.ringlaser.org.nz/content/facilities.php"&gt;giant ring lasers&lt;/a&gt; can test
for tiny modifications in the Earth rotation and gravity field.
Gravimetric coupling with matter waves and light waves describes the
&lt;a class="reference external" href="http://www.turpion.org/php/paper.phtml?journal_id=pu&amp;amp;paper_id=6425"&gt;very exact same underlying physics&lt;/a&gt;. However, matter waves, atoms in
the case of PhD, fall in gravity fields. While this is the expression of
the very exact phenomena we are trying to measure, it also means that to
build a very large atom interferometer, you have to let the atoms fall
for a large distance. And I can attest that even laboratory-sized
versions of atom-interferometric experiments are fairly nasty to
run:&lt;/p&gt;
&lt;img alt="" class="align-center" src="attachments/P1010619.jpg" style="width: 55%;" /&gt;
&lt;p&gt;This is why we simply decided to build an experiment in a
&lt;a class="reference external" href="http://arxiv.org/pdf/0705.2922"&gt;freely-falling plane&lt;/a&gt;: let’s fall with the atoms for 6
kilometers (30 seconds).&lt;/p&gt;
&lt;img alt="" src="http://gael-varoquaux.info/physics/ICELog/07/0328/DSCF0662.jpg" style="width: 40%;" /&gt;
&lt;img alt="" src="http://gael-varoquaux.info/physics/ICELog/07/0327/100_6838.jpg" style="width: 40%;" /&gt;
&lt;/div&gt;
&lt;div class="section" id="measuring-free-fall-while-in-free-fall"&gt;
&lt;h2&gt;Measuring free fall, while in free fall?&lt;/h2&gt;
&lt;img alt="" class="align-right" src="attachments/coyote.png" /&gt;
&lt;p&gt;Of course, the plane is not really in free fall. The pilots try as hard
as possible to compensate for drag and atmospheric turbulence but there
is a limit to what they can achieve with an Airbus. The atoms are a
vacuum apparatus, so they are indeed in free fall (before they crash in
the side of the apparatus). However, making sens of measure of fall-free
made relative to an unstable, and unpredictable platform is not trivial.
This is where the statistical modeling kicked in. After reading a bit
about noise in interferometers, I realized that we had a well-known
problem in statistics: estimation of hidden variables from noisy
observations. I learned about &lt;a class="reference external" href="http://www.google.fr/url?sa=t&amp;amp;source=web&amp;amp;ct=res&amp;amp;cd=1&amp;amp;ved=0CAcQFjAA&amp;amp;url=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FRecursive_Bayesian_estimation&amp;amp;ei=S331StLdCof34Ab117i4BA&amp;amp;usg=AFQjCNFeQT7-ruBii_IfqL5C7smW9jBL3Q&amp;amp;sig2=fYSw1ieKbBFPLqnoBEsdEQ"&gt;recursive Bayesian estimation&lt;/a&gt;, coded a
proof-of-principle algorithm for our problem (in Python, of course), and
was sold. The rest of the story is about noise simulations, and trying
to convince a metrology community that you could perform good
measurements in a noisy environment.&lt;/p&gt;
&lt;p&gt;It took us a lot of time (2 years) to write an article that was
acceptable to the target scientific community, while keeping the core
estimation and statistics message. Publishing new ideas is hard, because
you are not answering questions that people already have in mind. This
is why the fact that &lt;a class="reference external" href="http://www.iop.org/EJ/abstract/1367-2630/11/11/113010"&gt;this article&lt;/a&gt; is out is a huge deal for me. It
marks a turning point in my reflection: I switched from worrying only
about forward models, with which try to describe as well as possible the
system at hand, to inverse problems, in which you worry about estimating
the parameters from the data.&lt;/p&gt;
&lt;p&gt;I was startled to see that people are ready to spend a huge amount of
money and efforts in improving complicated experiments involving quantum
physics and very sophisticated technology, but can be weary of
processing the output signal to increase statistical power. Scientific
communities have their own goals that they pitch (e.g. reducing the
phase noise in lasers) and there can be huge divides between different
scientific interests. Realizing this played an important role in &lt;a class="reference external" href="http://gael-varoquaux.info/personnal/update-on-my-life.html"&gt;my
career shift&lt;/a&gt;. I wanted to know more about the power of statistical
modeling and machine learning applied to real-life system. I decided
that to learn more, I had to work with people that had a different
culture from mine. It’s been a huge amount of fun so far… More about
that later.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="personnal"></category><category term="science"></category><category term="physics"></category><category term="scientific computing"></category></entry><entry><title>What’s wrong with young academic careers in France</title><link href="https://gael-varoquaux.info/science/whats-wrong-with-young-academic-careers-in-france.html" rel="alternate"></link><published>2008-10-13T22:36:00+02:00</published><updated>2008-10-13T22:36:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2008-10-13:/science/whats-wrong-with-young-academic-careers-in-france.html</id><summary type="html">&lt;p&gt;&lt;a class="reference external" href="http://cournape.wordpress.com/"&gt;David&lt;/a&gt; just blogged a link to an &lt;a class="reference external" href="http://insidehighered.com/views/2008/09/15/altbach"&gt;article&lt;/a&gt; about careers in higher
education. I thought the paragraph on the French system was so much to
the point that I would like to quote it entirely here:&lt;/p&gt;
&lt;blockquote&gt;
In France, the access to a first permanent position as &lt;em&gt;maître de
conférences&lt;/em&gt; occurs …&lt;/blockquote&gt;</summary><content type="html">&lt;p&gt;&lt;a class="reference external" href="http://cournape.wordpress.com/"&gt;David&lt;/a&gt; just blogged a link to an &lt;a class="reference external" href="http://insidehighered.com/views/2008/09/15/altbach"&gt;article&lt;/a&gt; about careers in higher
education. I thought the paragraph on the French system was so much to
the point that I would like to quote it entirely here:&lt;/p&gt;
&lt;blockquote&gt;
In France, the access to a first permanent position as &lt;em&gt;maître de
conférences&lt;/em&gt; occurs rather early compared with other countries (on
average prior to the age of 33 years) and opens the path to 35 to 40
years of an academic career. These recruitments happen after a
period of high uncertainty as in almost all disciplines the ratio of
“open positions per doctors” has worsened, while the doctoral degree
is still not recognized as a qualification by businesses or the
public sector. Recruiting a new &lt;em&gt;maître de conférences&lt;/em&gt; thus
constitutes a high-stakes decision. But currently university
departments have about two months to examine the candidates, select
some of them, hold a 20- to 30-minute interview with those on the
short list, and rank the best ones. Despite the highly selective
process that the first candidate on the list successfully passes,
this new colleague is rarely considered as a chance on which to
build by the recruiting university. Not only is the salary based on
a national bureaucratic scale below the average GDP per capita for
France, but new academics are frequently not offered a personal
office and may be asked to teach the classes colleagues do not want
to offer or to accept administrative duties. The difficult road
toward the doctorate leads to a rather disappointing and frequently
non-well-remunerated situation, thus undermining the attractiveness
of the career.&lt;/blockquote&gt;
&lt;p&gt;I don’t regret doing a PhD, but I think the current situation needs to
be stressed, especially to future PhD students: high risk, little gain
career. You better really love what you’ll be doing. And keep in mind an
exit door.&lt;/p&gt;
</content><category term="science"></category><category term="science"></category><category term="scientific computing"></category></entry><entry><title>LaTeX files of my PhD thesis</title><link href="https://gael-varoquaux.info/science/latex-files-of-my-phd-thesis.html" rel="alternate"></link><published>2008-04-01T00:00:00+02:00</published><updated>2008-04-01T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2008-04-01:/science/latex-files-of-my-phd-thesis.html</id><summary type="html">&lt;p class="first last"&gt;The main files of my phd thesis, to give an example of the LaTeX code used&lt;/p&gt;
</summary><content type="html">&lt;p&gt;&lt;a class="reference external" href="attachments/gaeltex.zip"&gt;Here&lt;/a&gt; are the main files I use for writing
&lt;a class="reference external" href="http://tel.archives-ouvertes.fr/tel-00265714"&gt;my PhD thesis&lt;/a&gt; with
LaTeX. I am not publishing them on the net as a model of what to do, as
at the end I was too much in a hurry to do a good job, and I hacked
kludges all over the code (it does not compile without overflows
anymore).&lt;/p&gt;
&lt;p&gt;What turned out to be very handy was the use of the &lt;a class="reference external" href="http://www.ctan.org/tex-archive/macros/latex/contrib/memoir/"&gt;memoir package&lt;/a&gt;. It
allowed me just enough customization while staying compact. In order to
make it work with some other packages I use, I had to hack it a bit
(horrible kludges again).&lt;/p&gt;
&lt;p&gt;You need an install of the garamond fonts to build this (for epigraphs).
I use my own version.&lt;/p&gt;
&lt;p&gt;Don’t e-mail me to debug the problems you get by copying the kludges in
here. This is ugly code, that I put out because people were asking for
it.&lt;/p&gt;
</content><category term="science"></category><category term="latex"></category><category term="publishing"></category></entry><entry><title>Mission accomplished</title><link href="https://gael-varoquaux.info/science/mission-accomplished.html" rel="alternate"></link><published>2008-01-19T11:59:00+01:00</published><updated>2008-01-19T11:59:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2008-01-19:/science/mission-accomplished.html</id><content type="html">&lt;p&gt;I defended my PhD yesterday. I am pretty happy to be done with this.&lt;/p&gt;
&lt;a class="reference external image-reference" href="../science/attachments/talking1.jpg"&gt;&lt;img alt="" src="../science/attachments/talking1.jpg" /&gt;&lt;/a&gt;
&lt;p&gt;After the defense, the other PhD students offered me a plastic python
(well it was a cobra, actually, but they told me to pretend it was a
Python.&lt;/p&gt;
&lt;a class="reference external image-reference" href="../science/attachments/gael_with_python.jpg"&gt;&lt;img alt="" src="../science/attachments/gael_with_python.jpg" /&gt;&lt;/a&gt;
</content><category term="science"></category><category term="personnal"></category><category term="science"></category><category term="physics"></category></entry><entry><title>Garamond fonts for LaTeX</title><link href="https://gael-varoquaux.info/science/garamond-fonts-for-latex.html" rel="alternate"></link><published>2006-10-01T00:00:00+02:00</published><updated>2006-10-01T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2006-10-01:/science/garamond-fonts-for-latex.html</id><summary type="html">&lt;p class="first last"&gt;An easy to install version of Garamond fonts for LaTeX&lt;/p&gt;
</summary><content type="html">&lt;p&gt;&lt;a class="reference external" href="http://en.wikipedia.org/wiki/Garamond"&gt;Garamond fonts&lt;/a&gt; are a large
family of fonts. At a friend’s request I modified the &lt;a class="reference external" href="ftp://dante.ctan.org/tex-archive/fonts/urw/garamond/"&gt;URW-garamond&lt;/a&gt; fonts to improve
kerning, add old style numbers, and make some letters prettier. These
fonts are available under the &lt;a class="reference external" href="http://www.cs.wisc.edu/~ghost/doc/cvs/Public.htm"&gt;Aladdin Free Public License&lt;/a&gt; , which states, if I
understand it correctly, that you can use and modify the fonts freely for
non commercial purposes.&lt;/p&gt;
&lt;p&gt;Here is &lt;a class="reference external" href="attachments/baudelaire.pdf"&gt;a pdf file&lt;/a&gt; that gives an example
of the fonts.&lt;/p&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;Questions and suggestions&lt;/p&gt;
&lt;p&gt;I made this font in 2006. Time has passed, and I have completely
forgotten the skills required to modify it. I cannot go anywhere
beyond providing the file for download. Sorry, if you send me a kind
email mentionning that the accents or the numbers are not right, I am
unable to address it.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="instructions-for-use-with-pdflatex"&gt;
&lt;h2&gt;Instructions for use with pdfLaTeX&lt;/h2&gt;
&lt;p&gt;The standard procedure for installing new fonts in a LaTeX installation
is quite complicated and varies from one LaTeX distribution to another.&lt;/p&gt;
&lt;p&gt;I strongly suggest that you install the fonts only in your documents
folder. This make your document portable: as long as you give the
complete folder to your colleagues, they will be able to compile it.&lt;/p&gt;
&lt;p&gt;If you want to install the fonts in the TeXMF (so that all documents
compiled on your installation have access to the fonts) I assume you know
TeX well enough to perform the installation without further help.&lt;/p&gt;
&lt;div class="section" id="installing-in-the-current-folder"&gt;
&lt;h3&gt;Installing in the current folder&lt;/h3&gt;
&lt;p&gt;Here is an easy way to install the fonts in your document’s folder (this
will only work if you are using pdfLaTeX):&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="attachments/garamond.zip"&gt;Here&lt;/a&gt; is a package to use these fonts with LaTeX.&lt;/p&gt;
&lt;p&gt;Unzip &lt;em&gt;garamond.zip&lt;/em&gt; in the same folder than the LaTeX document you
are working on.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="using-in-a-latex-document"&gt;
&lt;h3&gt;Using in a LaTeX document&lt;/h3&gt;
&lt;p&gt;In your LaTeX file, include the package “garamond”:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;\usepackage&lt;/span&gt;&lt;span class="nb"&gt;{&lt;/span&gt;garamond&lt;span class="nb"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You also need to use the T1 font encoding:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;\usepackage&lt;/span&gt;&lt;span class="na"&gt;[T1]&lt;/span&gt;&lt;span class="nb"&gt;{&lt;/span&gt;fontenc&lt;span class="nb"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The garamond package defines a new command &lt;tt class="docutils literal"&gt;\garamond&lt;/tt&gt; that switches
the font in the current group to garamond. Here is a minimal example:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;\documentclass&lt;/span&gt;&lt;span class="nb"&gt;{&lt;/span&gt;article&lt;span class="nb"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;\usepackage&lt;/span&gt;&lt;span class="na"&gt;[T1]&lt;/span&gt;&lt;span class="nb"&gt;{&lt;/span&gt;fontenc&lt;span class="nb"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;\usepackage&lt;/span&gt;&lt;span class="nb"&gt;{&lt;/span&gt;lmodern&lt;span class="nb"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;\usepackage&lt;/span&gt;&lt;span class="nb"&gt;{&lt;/span&gt;garamond&lt;span class="nb"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;\begin&lt;/span&gt;&lt;span class="nb"&gt;{&lt;/span&gt;document&lt;span class="nb"&gt;}&lt;/span&gt;

&lt;span class="nb"&gt;{&lt;/span&gt;&lt;span class="k"&gt;\garamond&lt;/span&gt;
The Quick Brown Fox Jumps Over The Lazy Dog. 0123456789 &lt;span class="k"&gt;\\&lt;/span&gt;
    &lt;span class="nb"&gt;{&lt;/span&gt;&lt;span class="k"&gt;\slshape&lt;/span&gt; This is garamond slanted&lt;span class="nb"&gt;}&lt;/span&gt; &lt;span class="k"&gt;\\&lt;/span&gt;
    &lt;span class="nb"&gt;{&lt;/span&gt;&lt;span class="k"&gt;\bfseries&lt;/span&gt; This is garamond bold face&lt;span class="nb"&gt;}&lt;/span&gt; &lt;span class="k"&gt;\\&lt;/span&gt;
    &lt;span class="nb"&gt;{&lt;/span&gt;&lt;span class="k"&gt;\scshape&lt;/span&gt; This is in small caps&lt;span class="nb"&gt;}&lt;/span&gt; &lt;span class="k"&gt;\\&lt;/span&gt;
    &lt;span class="nb"&gt;{&lt;/span&gt;&lt;span class="k"&gt;\slshape&lt;/span&gt; &lt;span class="k"&gt;\bfseries&lt;/span&gt; This is slanted and bold face&lt;span class="nb"&gt;}&lt;/span&gt; &lt;span class="k"&gt;\\&lt;/span&gt;
&lt;span class="nb"&gt;}&lt;/span&gt;
And this is written with the latin modern fonts.

&lt;span class="k"&gt;\garamond&lt;/span&gt;

Here we switch to garamond.
&lt;span class="k"&gt;\ungaramond&lt;/span&gt;

Here we switch back to the default.

&lt;span class="k"&gt;\end&lt;/span&gt;&lt;span class="nb"&gt;{&lt;/span&gt;document&lt;span class="nb"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;img alt="minimal example of a LaTeX file using garamond fonts" class="align-center" src="attachments/minimal.png" /&gt;
&lt;p&gt;One remark on this example: you should never, ever, use the standards
out-of-the-box T1 fonts with pdfLaTeX, they look ugly. Always include the
“lmodern” or “pslatex” package, that uses much better postscript fonts.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="latex"></category><category term="publishing"></category><category term="selected"></category></entry><entry><title>Timing problems with a computer</title><link href="https://gael-varoquaux.info/science/timing-problems-with-a-computer.html" rel="alternate"></link><published>2006-03-20T00:00:00+01:00</published><updated>2006-03-20T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2006-03-20:/science/timing-problems-with-a-computer.html</id><summary type="html">&lt;p class="first last"&gt;Simple experiments on real-time computing, to put in the perspective of the computer-control of an experiment&lt;/p&gt;
</summary><content type="html">&lt;p&gt;Computers are very versatile beasts. Physicists are tempted to use them
to do real-time signal processing and for instance implement a
feedback-loop on an instrument. If the frequencies are above 10Hz this is
not as easy as one might think (after they run at several gHz). I will
try to explore some difficulties here.&lt;/p&gt;
&lt;p&gt;Remember, these are just the ramblings of a physics phD student. I have
little formal training in IT, so don’t hesitate to correct me if I didn’t
get things right.&lt;/p&gt;
&lt;div class="section" id="operating-systems-timing-and-latencies"&gt;
&lt;h2&gt;Operating systems, timing and latencies&lt;/h2&gt;
&lt;p&gt;If you want to build an I/O system that interacts in real-time with
external devices you will want to control the timing of the signals you
send to the instruments.&lt;/p&gt;
&lt;p&gt;Computers are not good at generating events at a precise timing. This is
due to the fact that modern operating systems share the processor time
between a large number of tasks. Your process does not control completely
the computer, and it has to ask for time to the operating system. The
operating system shares time between different processes, but it also has
some internal tasks to do (like allocating memory). All these
operations may not perform in a predictable time-lapse &lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-1"&gt;[2]&lt;/a&gt;, and make it
harder for a process to produce an event (eg a hardware output signal) at
a precise instant.&lt;/p&gt;
&lt;p&gt;One solution to avoid problems is to run the program with a single task
operating-system, like DOS. Even when doing this you have to be careful,
as all system operations asked by your program may not return in a
controlled amount of time. The good solution is to use a &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Real-time_operating_system"&gt;hard real-time
operating system&lt;/a&gt;, but this
forces us to use dedicated system and makes the job much harder as we
cannot use standard programming techniques and libraries.&lt;/p&gt;
&lt;p&gt;I will attempt to study the limitations of a simple approach, using
standard operating systems and programming techniques, to put numbers of
the performance one can expect.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="real-time-clock-interrupt-latency"&gt;
&lt;h2&gt;Real-time clock interrupt latency&lt;/h2&gt;
&lt;p&gt;The right tool to control timing under linux is the “real time clock”
&lt;a class="footnote-reference" href="#footnote-3" id="footnote-reference-2"&gt;[3]&lt;/a&gt;. It can be used to generate interrupts at a given frequency or
instant.&lt;/p&gt;
&lt;p&gt;To quote Wikipedia: “in computing, an interrupt is an asynchronous signal
from hardware indicating the need for attention or a synchronous event in
software indicating the need for a change in execution”. In our case the
interrupt is a signal generated by the real time clock that is trapped by
a process.&lt;/p&gt;
&lt;p&gt;I have ran a few experiments on the computers I have available to test
the reliability of timing of these interrupts, that is the time it take
to the process to get the interrupt. This is known as “interrupt
latency” (for more details see &lt;a class="reference external" href="http://lwn.net/Articles/139784/"&gt;this article&lt;/a&gt;), and it limits both the response
time and the timing accuracy of a program that does not monopolize the
CPU, as it corresponds to the time needed for the OS to hand over control
to the program.&lt;/p&gt;
&lt;div class="section" id="the-experiment-and-the-results"&gt;
&lt;h3&gt;The experiment and the results&lt;/h3&gt;
&lt;p&gt;I used a test program to measure interrupt latency &lt;a class="footnote-reference" href="#footnote-4" id="footnote-reference-3"&gt;[4]&lt;/a&gt; on linux. The test
code first sets the highest scheduling priority it can, then asks to be
waken up at a given frequency &lt;em&gt;f&lt;/em&gt; by the real-time clock. It checks the
real-time clock to see if it was really waken-up when it asked for. It
computes the differences between the measured delay between 2 interrupts
and the theoretical one &lt;em&gt;1/f&lt;/em&gt;. Here is a plot of histogram of the delays
on different systems. The delay is plotted in units of the period &lt;em&gt;1/f&lt;/em&gt;.&lt;/p&gt;
&lt;img alt="" class="align-center" src="attachments/real_time_results.png" /&gt;
&lt;p&gt;While the code was running I put some stress on the system, pinging
google.com, copying data to the disk, and calculating an md5 hash. This
is not supposed to be representative of any particular use, I just wanted
not the system to be idle aside from my test code. The tests where run
under a gnome session but without any user action.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="interpretation-of-the-results"&gt;
&lt;h3&gt;Interpretation of the results&lt;/h3&gt;
&lt;p&gt;I am no kernel guru, so my interpretations may be imprecise, but I can
see that the results are pretty bad.&lt;/p&gt;
&lt;p&gt;There is a jitter that can go up to half a period at 1kHz. Depending on
how important it is to have a narrow linewidth of your “digital
oscillator” the jitter sets a limit to the frequency where the computer
can be used as a “digital oscillator”.&lt;/p&gt;
&lt;p&gt;This also tells us that an interrupt request takes in average 0.5ms to
get through to the program it targets. This allows us to estimate the
time it take for an event (for instance generated by an I/O card) to
reach a program, if this one is not running.&lt;/p&gt;
&lt;p&gt;Keep in mind that this experiment only measures jitter and frequency
offset due to software imperfection (kernel: operating system related),
on top of this you must add all the I/O bus and buffer problems, if you
want to control an external device.&lt;/p&gt;
&lt;p&gt;An interesting remark is to see how the results vary from one computer to
another. Quite clearly omega’s RTC is not working properly, this is
probably due to driver problems. Beta has good results, and this is
probably due to its pre-emptible kernel. The results of our computer
(digamma) are surprisingly bad. This is powerful 4 CPU computer. It seems
to me that the process my be getting relocated from one CPU to another,
which generates big jitter. Aramis is a 2 CPU (+ multithreading, that’s
why it appears as 4) box, and it performs much better. The CPU are
different, and the kernel versions are different, but I would expect more
recent kernels to fare better.&lt;/p&gt;
&lt;blockquote&gt;
&lt;strong&gt;The take home message: do not trust computers under the milisecond.&lt;/strong&gt;&lt;/blockquote&gt;
&lt;p&gt;Other sources have indeed confirmed that with a standard linux kernel, at
the time of the writing (linux 2.6.18) interrupt latency is of the order
of the millisecond. The “RT_PREEMPT” compile switch has been measured to
drop the interrupt latency to 50 microseconds, which is of the order of
the hardware limit.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="implications-of-this-jitter"&gt;
&lt;h3&gt;Implications of this jitter&lt;/h3&gt;
&lt;p&gt;These histograms can be seen as frequency spectra of the signal generated
by the computer.&lt;/p&gt;
&lt;p&gt;We can see that the signal created can be slightly off in frequency (the
peak is not always centered on zero). The RTC is not well calibrated.
This should not be a major problem if the offset is repeatable, as it can
be measured and taken in account for.&lt;/p&gt;
&lt;p&gt;We can see that the spectrum has a non negligible width at high
frequency. This means that in a servo-loop like system the computer will
add high frequency noise at around 1kHz. It also means that the
timing of a computer created event cannot be trusted at the millisecond
level.&lt;/p&gt;
&lt;p&gt;However it is interesting to note that very few events reach out of the
+/- 1 period. This means that the computer does not skip a beat very
often. It does perform the work in a reliable way, but it does not
deliver it on time. This means that if we correct for this jitter the
computer can act as a servo loop up to 1kHz. The preempt kernel performs
very well in terms of reliability, even though it is on an old box with
little computing power.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="dealing-with-the-jitter"&gt;
&lt;h3&gt;Dealing with the jitter&lt;/h3&gt;
&lt;p&gt;First we could try to correct for the jitter with a software trick. For
instance we could ask for the interrupt in advance, and block the CPU by
doing busy-waiting (to ensure that the scheduler does not schedule us
out) until the exact moment comes.&lt;/p&gt;
&lt;p&gt;Another option is to use an I/O device with an embedded clock, that
corrects for the jitter. For instance a hardware trigged acquisition
card. I prefer this solution as it is more versatile and scalable.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;This brings us to something that seems to be quite general with real-time
computer control: buffers and external clocks. The computer has the
processing power to do the work in the required amount of time. The
buffer and the external clock correct for the jitter introduced by the
software.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Finally recompiling a kernel with the RT-preempt patch would probably
help a lot, given that it reduces the interrupt latency by two orders of
magnitudes.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="technical-details-about-the-experiment"&gt;
&lt;h3&gt;Technical details about the experiment&lt;/h3&gt;
&lt;div class="section" id="the-measuring-code"&gt;
&lt;h4&gt;The measuring code&lt;/h4&gt;
&lt;p&gt;The way this work is that a small C code (borrowed and adapted from
Andrew Morton’s “realfeel.c”) asks for the highest scheduling priority it
can get, then set the real-time clock to generate an interrupt at a give
frequency. It then loops, waiting for the real-time clock (RTC). The OS
schedules other tasks during the waiting period, but when the interrupt
is generated by the RTC the OS gives the CPU back to the program. It then
compares the time delay between the last time it got the interrupt, and
this time, and stores the difference. The results are stored in a
histogram file.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="the-stress-code"&gt;
&lt;h4&gt;The stress code&lt;/h4&gt;
&lt;p&gt;I have very ugly way of putting stress and the computer, so that the
kernel actually schedules other tasks. I did not put tremendous stress
on the CPU, as I want to simulate standard use cases. This is the way I
did it:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;i&lt;span class="w"&gt; &lt;/span&gt;&amp;lt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;i++&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;do&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;ping&lt;span class="w"&gt; &lt;/span&gt;-c&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;www.google.com&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;&amp;amp;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;dd&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/dev/urandom&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;bs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1M&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;40&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;md5sum&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;&amp;amp;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;dd&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/dev/zero&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;of&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/tmp/foo&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;bs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1M&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;500&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;sync
rm&lt;span class="w"&gt; &lt;/span&gt;/tmp/foo
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Three tasks running in parallel: pinging google, calculation the md5 hash
of a random chunk of bits (which also means generating it), and writing
500Mb to the disk. If the system and the network are fast enough the 2
first task finish before the last one. This is done on purpose.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="making-your-own-measurements"&gt;
&lt;h3&gt;Making your own measurements&lt;/h3&gt;
&lt;p&gt;You can reproduce the histograms under linux by running the
“stresstest.sh” script given be the &lt;a class="reference external" href="attachments/real_time_stress_test.zip"&gt;following archive&lt;/a&gt; . The plots can be obtained by
running the “process.py” python scripts (requires scipy and matplotlib).
You may have to increase the real-time clock frequency user limit. You
can do this by running (as root) “ echo 1024 &amp;gt;
/proc/sys/dev/rtc/max-user-freq”&lt;/p&gt;
&lt;p&gt;Send me the results dir created by the “stresstest.sh” script on your
box, I am very interested to gather more statistics.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The jitter measurement is interesting not because it shows the absolute
limit of the technology (hard real-time OSs, like RTlinux could go much
further), but because it shows the performance achievable with simple
techniques. Looking at this data I would say that anything with
frequencies below 10 to 100Hz is fairly easy to achieve with the RTC
interrupts, anything around several kiloHertz can be done with a bit more
work, and anything above require a lot of work.&lt;/p&gt;
&lt;p&gt;My current policy is to try to move to embedded devices anything with
speeds above 10Hz.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Acknowledgments&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I would like to thank Nicolas George for enlightening discussions on
these matters, as useful questions on the purpose of this experiment. I
would also like to thank David Cournapeau for pointing me to interesting
references and to the Linux Audio Developer mailing list for more
information.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;References&lt;/strong&gt;&lt;/p&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;[1]&lt;/td&gt;&lt;td&gt;Wikipedia article on real-time computing:
&lt;a class="reference external" href="http://en.wikipedia.org/wiki/Real-time_computing"&gt;http://en.wikipedia.org/wiki/Real-time_computing&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;A very clear article about fighting latency in the linux kernel:
&lt;a class="reference external" href="http://lac.zkm.de/2006/papers/lac2006_lee_revell.pdf"&gt;http://lac.zkm.de/2006/papers/lac2006_lee_revell.pdf&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-3" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[3]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;About the RTC: &lt;a class="reference external" href="http://www.die.net/doc/linux/man/man4/rtc.4.html"&gt;http://www.die.net/doc/linux/man/man4/rtc.4.html&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-4" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-3"&gt;[4]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;What this code is actually measuring is, in technical terms, the
interrupt latency, that is the time it takes for the kernel to catch
the interrupt, and the rescheduling latency, that is the time it take
for the kernel to reschedule from one process to another.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-5" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;[5]&lt;/td&gt;&lt;td&gt;A different benchmark, that probably studies more directly the
intrinsic kernel limits than my code: &lt;a class="reference external" href="http://lwn.net/Articles/139403/"&gt;http://lwn.net/Articles/139403/&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-6" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;[6]&lt;/td&gt;&lt;td&gt;Another benchmark, that also benchmarks the RT-preempt patch and
shows the impressive improvements achieved with this patch:
&lt;a class="reference external" href="http://kerneltrap.org/node/5466"&gt;http://kerneltrap.org/node/5466&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-7" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;[7]&lt;/td&gt;&lt;td&gt;A course on real-time computing, with the lecture notes.
&lt;a class="reference external" href="http://lamspeople.epfl.ch/decotignie/#InfoTR"&gt;http://lamspeople.epfl.ch/decotignie/#InfoTR&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;!-- http://www.captain.at/adeos-ipipe-jitter-latency-test.php --&gt;
&lt;!-- vim:spell:spelllang=en_us ft=rst --&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="linux"></category><category term="science"></category></entry></feed>