<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Gaël Varoquaux</title><link href="https://gael-varoquaux.info/" rel="alternate"></link><link href="https://gael-varoquaux.info/feeds/all.atom.xml" rel="self"></link><id>https://gael-varoquaux.info/</id><updated>2026-01-14T00:00:00+01:00</updated><entry><title>Stepping up as probabl’s CSO to supercharge scikit-learn and its ecosystem</title><link href="https://gael-varoquaux.info/programming/stepping-up-as-probabls-cso-to-supercharge-scikit-learn-and-its-ecosystem.html" rel="alternate"></link><published>2026-01-14T00:00:00+01:00</published><updated>2026-01-14T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2026-01-14:/programming/stepping-up-as-probabls-cso-to-supercharge-scikit-learn-and-its-ecosystem.html</id><summary type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="../programming/attachments/probabl_team_2025.png" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;Probabl’s get together, in falls 2025&lt;/p&gt;
&lt;/div&gt;
&lt;p class="last"&gt;I’m thrilled to announce that I’m stepping up as &lt;a class="reference external" href="https://probabl.ai/?utm_source=employee_blog&amp;amp;utm_medium=social_employee&amp;amp;utm_campaign=202601_probabl_awareness_post"&gt;Probabl&lt;/a&gt;’s CSO (Chief Science Officer) to supercharge
scikit-learn and its ecosystem, pursuing my dreams of tools that help go
from data to impact.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="scikit-learn-a-central-tool"&gt;
&lt;h2&gt;Scikit-learn, a central tool&lt;/h2&gt;
&lt;p&gt;Scikit-learn is central …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="../programming/attachments/probabl_team_2025.png" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;Probabl’s get together, in falls 2025&lt;/p&gt;
&lt;/div&gt;
&lt;p class="last"&gt;I’m thrilled to announce that I’m stepping up as &lt;a class="reference external" href="https://probabl.ai/?utm_source=employee_blog&amp;amp;utm_medium=social_employee&amp;amp;utm_campaign=202601_probabl_awareness_post"&gt;Probabl&lt;/a&gt;’s CSO (Chief Science Officer) to supercharge
scikit-learn and its ecosystem, pursuing my dreams of tools that help go
from data to impact.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="scikit-learn-a-central-tool"&gt;
&lt;h2&gt;Scikit-learn, a central tool&lt;/h2&gt;
&lt;p&gt;Scikit-learn is central to data-scientists’ work: it is &lt;strong&gt;the most used
machine-learning package&lt;/strong&gt;. It has grown over more than a decade,
supported by volunteers’ time, donations, and grant funding, with a
central role of Inria.&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="../programming/attachments/scikit-learn_clickpy_2025.png" style="width: 350px;" /&gt;
&lt;p class="caption"&gt;Scikit-learn download numbers; &lt;a class="reference external" href="https://clickpy.clickhouse.com/dashboard/scikit-learn"&gt;reproduce and explore on clickpy&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;And the usage numbers keep going up…&lt;/p&gt;
&lt;p&gt;Scikit-learn keeps growing because it enables crucial applications:
machine-learning that can be easily adapted to a given application. This
type of AI does not make the headlines, but it is central to the value
brought by data science. It is used across the board to extract insights
from data and automate business-specific processes, thus ensuring
function and efficiency of a wide variety of activities.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;And scikit-learn is quietly but steadily advancing. The recent releases
bring progress in all directions: computational foundations (&lt;a class="reference external" href="https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_8_0.html#array-api-support-enables-gpu-computations"&gt;the array
API enabling GPU support&lt;/a&gt;),
user interface (&lt;a class="reference external" href="https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_8_0.html#html-representation-of-estimators"&gt;rich HTML displays&lt;/a&gt;),
new models (eg &lt;a class="reference external" href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html"&gt;HDBSCAN&lt;/a&gt;,
&lt;a class="reference external" href="https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_8_0.html#temperature-scaling-in-calibratedclassifiercv"&gt;temperature-scaling recalibration&lt;/a&gt; …), and always algorithmic
improvements (release 1.8 brought &lt;a class="reference external" href="https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_8_0.html#efficiency-improvements-in-linear-models"&gt;marked speed ups to linear models&lt;/a&gt; or
&lt;a class="reference external" href="https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_8_0.html#decisiontreeregressor-with-criterion-absolute-error"&gt;trees with MAE&lt;/a&gt;).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="a-new-opportunity-to-boost-scikit-learn-and-its-ecosystem"&gt;
&lt;h2&gt;A new opportunity to boost scikit-learn and its ecosystem&lt;/h2&gt;
&lt;p&gt;Probabl recently raised a &lt;a class="reference external" href="https://blog.probabl.ai/probabl-raises-a-13m-in-seed-to-accelerate-enterprise-grade-ai?utm_source=employee_blog&amp;amp;utm_medium=social_employee&amp;amp;utm_campaign=202601_blog_awareness_post"&gt;beautiful seed funding&lt;/a&gt;
from investors who really understand the value and perspective of
scikit-learn. We have a unique opportunity to accelerate scikit-learn’s
development. Our analysis is that &lt;strong&gt;enterprises need dedicated tooling and
partners to build best on scikit-learn&lt;/strong&gt;, and we’re hard at work to provide
this.&lt;/p&gt;
&lt;p&gt;2/3rd of probabl’s founders are scikit-learn contributors and we have
been investing in all aspects of scikit-learn: features, releases,
communication, documentation, and training. In addition, part of
scikit-learn’s success has always been to nurture an ecosystem, for
instance via its simple API that has become a standard. Thus Probabl is
not only consolidating scikit-learn, but also this ecosystem: the &lt;a class="reference external" href="https://skops.readthedocs.io/en/stable/"&gt;skops
project, to put scikit-learn based models in production&lt;/a&gt;, the &lt;a class="reference external" href="https://skrub-data.org"&gt;skrub project, that
facilitates data preparation&lt;/a&gt;, the &lt;a class="reference external" href="https://skore.probabl.ai/?utm_source=employee_blog&amp;amp;utm_medium=social_employee&amp;amp;utm_campaign=202601_skore_awareness_post"&gt;young skore
project to track data science&lt;/a&gt;, &lt;a class="reference external" href="https://fairlearn.org/"&gt;fairlearn
to help avoiding machine learning that discriminates&lt;/a&gt;, and more upstream projects, such as &lt;a class="reference external" href="https://joblib.readthedocs.io/en/stable/"&gt;joblib
for parallel computing&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="my-obsession-as-probabl-cso-serving-the-data-scientists"&gt;
&lt;h2&gt;My obsession as Probabl CSO: serving the data scientists&lt;/h2&gt;
&lt;p&gt;As CSO (Chief Science Officer) at Probabl, my role is to nourish our
development strategy with understanding of machine learning, data
science, and open source. Making sure that &lt;strong&gt;scikit-learn and its
ecosystem are enterprise ready&lt;/strong&gt; will bring resources for scikit-learn’s
sustainability, enabling its ecosystem to grow into a standard-setting
platform for the industry, that continues &lt;strong&gt;to serve data scientists&lt;/strong&gt;.
This mission will require consolidating the existing tools and patterns,
and inventing new ones.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Probabl is in a unique position for this endeavor: Our core is an amazing
team of engineers with deep knowledge of data science. Working directly
with businesses gives us an acute understanding of where the ecosystem
can be improved. On this topic, I also profoundly enjoy working with
people who have a different DNA than the historical DNA of scikit-learn,
with product research, marketing, and business mindsets. I believe that
the union of our different cultures will make the scikit-learn ecosystem
better.&lt;/p&gt;
&lt;p&gt;Beyond the Probabl team, we have an amazing community, with a broader
group of scikit-learn contributors who do an amazing job bringing
together what makes scikit-learn so versatile, with a deep ecosystem of
Python data tools enriched by so many different actors. I’m deeply
greatful to the many scikit-learn and pydata contributors. At Probabl, we
are very attuned to enabling the open-source contributor community. Such
a community is what enables a single tool, scikit-learn, to serve a long
tail of diverse usages.&lt;/p&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="open source"></category><category term="growth"></category><category term="communities"></category><category term="scikit-learn"></category><category term="inria"></category><category term="probabl"></category></entry><entry><title>2025 highlights: AI research and code</title><link href="https://gael-varoquaux.info/science/2025-highlights-ai-research-and-code.html" rel="alternate"></link><published>2026-01-02T00:00:00+01:00</published><updated>2026-01-02T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2026-01-02:/science/2025-highlights-ai-research-and-code.html</id><summary type="html">&lt;div class="figure align-right"&gt;
&lt;img alt="" class="small" src="attachments/2025_highlights/eiffel_tower_ai.jpg" /&gt;
&lt;p class="caption"&gt;AI is everywhere. Can you see it here?&lt;/p&gt;
&lt;/div&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Some highlights about my work in 2025: progress on
tabular-learning stands out, a publication on unpacking trade-off and
consequences of scale in AI, and of course progress on the open-source
data-science and machine learning stack.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;As 2026 starts, I’m looking …&lt;/p&gt;</summary><content type="html">&lt;div class="figure align-right"&gt;
&lt;img alt="" class="small" src="attachments/2025_highlights/eiffel_tower_ai.jpg" /&gt;
&lt;p class="caption"&gt;AI is everywhere. Can you see it here?&lt;/p&gt;
&lt;/div&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Some highlights about my work in 2025: progress on
tabular-learning stands out, a publication on unpacking trade-off and
consequences of scale in AI, and of course progress on the open-source
data-science and machine learning stack.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;As 2026 starts, I’m looking back on 2025. It was all about AI, with
research in the &lt;a class="reference external" href="https://team.inria.fr/soda/"&gt;soda team&lt;/a&gt; on tabular
machine learning stimulating better software.&lt;/p&gt;
&lt;div class="contents topic" id="highlights"&gt;
&lt;p class="topic-title"&gt;Highlights&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#beyond-maths-unpacking-the-scale-narrative-in-ai" id="toc-entry-1"&gt;Beyond maths: Unpacking the scale narrative in AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#tabular-learning-research" id="toc-entry-2"&gt;Tabular-learning research&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#tabicl-open-source-table-foundation-model" id="toc-entry-3"&gt;TabICL:  open-source table foundation model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#retrieve-merge-predict-tradeoffs-of-predictions-from-data-lakes" id="toc-entry-4"&gt;Retrieve merge predict: tradeoffs of predictions from data lakes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#growing-the-machine-learning-and-data-science-stack" id="toc-entry-5"&gt;Growing the machine learning and data science stack&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#skrub-machine-learning-with-tables" id="toc-entry-6"&gt;Skrub: machine learning with tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#fundamental-progress-in-scikit-learn" id="toc-entry-7"&gt;Fundamental progress in scikit-learn&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="beyond-maths-unpacking-the-scale-narrative-in-ai"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;Beyond maths: Unpacking the scale narrative in AI&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Plotting the increase of the scale of notable AI systems in the last
years reveals a staggering explosion. AI’s size has been growing super
exponentially on a variety of dimensions: training compute, training cost
(figure below), inference cost, amount of data used. Studying the wording
used in pivotal publications as well as company communications shows that
it anchors AI success in this growth, thus &lt;strong&gt;settings implicit social
norms around scale&lt;/strong&gt;. But systematic analysis of benchmark results show
that &lt;strong&gt;scale does not always bring benefit&lt;/strong&gt;. The narrative of scale is
simplified and leaves aside many important ingredients of success of AI
systems. In addition, the race for scale comes with planetary and
societal consequences, which we study and &lt;a class="reference external" href="https://dl.acm.org/doi/10.1145/3715275.3732006"&gt;document&lt;/a&gt;. Ever-increasing
inference costs threaten economic and electricity sustainability. An
unstoppable appetite for training data leads to fitting models on
enormous datasets that elude quality control, engulfing undesirable
facets of internet (including child pornography) or eroding privacy. The
race for scale has financial consequences, benefiting above all actors of
compute, but also structuring an ecosystem where cash-rich and GPU-rich
actors have leverage on priorities, industrial or academic. These actors
sometimes have circular investments strategies: funding third parties
that will spend all this funding in compute, which can fuel &lt;strong&gt;an
investment bubble in AI&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/2025_highlights/cost_ai.png" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;Evolution of the training cost (in dollars) of notable AI systems
across the years&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;We conclude our study, &lt;a class="reference external" href="https://dl.acm.org/doi/10.1145/3715275.3732006"&gt;published at FAccT&lt;/a&gt;, by underlining that &lt;strong&gt;academic
research has a central role to play in these dynamics and must shape a
healthy and grounded narrative&lt;/strong&gt;. We recommend to:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;pursue basic AI research of interest independent of scale, &lt;em&gt;eg&lt;/em&gt;
uncertainty quantification, causality…&lt;/li&gt;
&lt;li&gt;hold responsible norms, in particular avoiding asking for compute
increase when editing or reviewing,&lt;/li&gt;
&lt;li&gt;always publish measures of compute to document the tradeoffs.&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2025_highlights/pareto_schema.png" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;We need to document and explore the tradeoffs&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;In addition, I personally want to push those tradeoffs in the direction
of resource efficient progress, and not only resource intensive progress
(as illustrated on the figure alongside),
which is the easy route to task performance, but not the one that brings
most value.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="tabular-learning-research"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;Tabular-learning research&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="tabicl-open-source-table-foundation-model"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;TabICL:  open-source table foundation model&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Recent tabular-learning models have been bringing better performance. A
poster example is that of the TabPFN series of models, that rely on
pretrained transformers to bring excellent performance. However, the
quadratic complexity of the transformers is a bottleneck. I do fear that
the agenda of fancy tabular learning is leading us into a race for scale
again.&lt;/p&gt;
&lt;p&gt;With the &lt;a class="reference external" href="https://icml.cc/virtual/2025/poster/46681"&gt;TabICL model&lt;/a&gt; we
strives to decrease this computational cost. We showed that a multi-stage
architecture can build a pre-trained in-context predictor where the
separation of states decreases the quadratic cost. The model can be
pretrained on larger datasets, and thus results in the best performer in
settings of larger tables. The model is faster than alternatives, in
particular when using a CPU rather than a GPU. In addition, we released
in &lt;strong&gt;open source all the code&lt;/strong&gt;, including the pretraining.&lt;/p&gt;
&lt;p&gt;TabICL gives a table foundation model that is easy to use on modest or
big hardware and that can be easily customized.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="retrieve-merge-predict-tradeoffs-of-predictions-from-data-lakes"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-4"&gt;Retrieve merge predict: tradeoffs of predictions from data lakes&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;A full data-science pipeline must often assemble data across multiple
source tables:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Alice is working on a base table that contains information about
movies. She has also access to a data lake, or a collection of other
tables on all sorts of subjects. She wants to predict the ranking of
a movie based on as much information as possible. She would like to
extract information from the data lake to the performance of her
model.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;The challenge is that the information of interest is mixed with a
huge amount of unrelated data. Thus, Alice’s problem is: “how to find
tables that are relevant to my problem? how to combine them with the
base table?”&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;When the user is faced with a complex data lake, many
tables and little explicit link between them, it is difficult to find the
best assembly for a given machine-learning task. This problem requires
not only finding which table must be joined in the main table of interest
–a table retrieval problem–, but also how to aggregate multiple records
when tables are linked through a many-to-one relation. While table
retrieval is a classic problem of the data management literature, it had
been understudied in the case of supervised machine learning. We
assembled a systematic –and open– benchmark with data lakes emph{and}
supervised-learning tasks (&lt;a class="reference external" href="https://openreview.net/pdf?id=4uPJN6yfY1"&gt;publication&lt;/a&gt;, &lt;a class="reference external" href="https://soda-inria.github.io/retrieve-merge-predict/"&gt;benchmark material&lt;/a&gt; ).&lt;/p&gt;
&lt;p&gt;We found that supervised learning does change the picture compared to
classic table-retrieval settings in that for a fixed compute budget, it
is worth avoiding fancy retrieval methods, which can be very
computationally costly, and rather using better supervised learning
methods, which can be comparatively less expensive while being
able to extract the relevant information from a noisy retrieval.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/2025_highlights/yadl_benchmark.png" style="width: 700px;" /&gt;
&lt;p class="caption"&gt;A schema of the pipeline&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The pipeline that we studied here is one that is broader than the
typical machine-learning modeling step. In my experience, data-science
applications are often much more complex than mere tabular learning, and
for these reason, we develop the skrub software, described below.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="growing-the-machine-learning-and-data-science-stack"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-5"&gt;Growing the machine learning and data science stack&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="skrub-machine-learning-with-tables"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-6"&gt;Skrub: machine learning with tables&lt;/a&gt;&lt;/h3&gt;
&lt;a class="reference external image-reference" href="https://skrub-data.org"&gt;&lt;img alt="" class="align-right" src="attachments/skrub_logo.png" style="width: 150px;" /&gt;&lt;/a&gt;
&lt;p&gt;&lt;a class="reference external" href="https://skrub-data.org"&gt;Skrub&lt;/a&gt; is a recent library to blend machine
learning with data-frame computing. In 2025, we have ironed existing
features to make them more performant and really easy to use. For
instance the &lt;a class="reference external" href="https://skrub-data.org/stable/reference/generated/skrub.TableVectorizer.html"&gt;TableVectorizer&lt;/a&gt;
is incredibly useful to build tabular machine-learning pipelines. But we
have also added exciting new features:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;The &lt;a class="reference external" href="https://skrub-data.org/stable/reference/generated/skrub.ApplyToCols.html"&gt;ApplyToCols&lt;/a&gt; is an object that can use skrub’s powerful &lt;a class="reference external" href="https://skrub-data.org/stable/modules/multi_column_operations/selectors.html"&gt;selectors&lt;/a&gt; to apply transforms to some columns but not others. I find myself using it all the time.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://skrub-data.org/stable/data_ops.html"&gt;DataOps&lt;/a&gt; are an
incredibly powerful way of blending dataframe transformation and
scikit-learn fit/transform/predict API, to build complete machine
learning pipeline across multiple tables. The benefit is that, unlike
standard data wrangling code, they can be applied to new data,
cross-validated, or any component of the pipeline can be tuned to
maximize a prediction score. We even have added optuna support for this
tuning.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="fundamental-progress-in-scikit-learn"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-7"&gt;Fundamental progress in scikit-learn&lt;/a&gt;&lt;/h3&gt;
&lt;a class="reference external image-reference" href="https://scikit-learn.org"&gt;&lt;img alt="" class="align-right" src="attachments/scikit-learn-logo.png" style="width: 150px;" /&gt;&lt;/a&gt;
&lt;p&gt;What strikes me in the 2025 releases of &lt;a class="reference external" href="https://scikit-learn.org"&gt;scikit-learn&lt;/a&gt; is that we have been
making progress on fundamental improvements to the core features:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Faster linear models and tree based model due to better algorithms
(which, in certain cases give massive speedups).&lt;/li&gt;
&lt;li&gt;Ramping up GPU support: we are progressively adding to scikit-learn a
compute backend that enable GPU computing (an intro &lt;a class="reference external" href="https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_8_0.html#array-api-support-enables-gpu-computations"&gt;here&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Free-threaded: we now support the “free-threaded” version of Python,
which removes a central lock and opens the door to
heavily-multithreaded parallel computing. More of the ecosystem needs
to support Python free-threaded for it to be widely used, but I am
hoping that in the mid-term we’ll see great improvements to parallel
computing.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Exciting times :)&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="machine learning"></category><category term="python"></category><category term="yearly report"></category></entry><entry><title>Maïc, you lived 100 years, what changed?</title><link href="https://gael-varoquaux.info/personnal/maic-you-lived-100-years-what-changed.html" rel="alternate"></link><published>2025-10-29T00:00:00+01:00</published><updated>2025-10-29T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2025-10-29:/personnal/maic-you-lived-100-years-what-changed.html</id><summary type="html">&lt;p&gt;At Maïc’s 100th birthday, I asked her “you lived 100 years, what was the most important change for you?”. She mentioned “Internet”. I asked, why was the Internet important to her eyes? Because this is how she kept close contact with her loved ones, sharing travels or discussing everyday …&lt;/p&gt;</summary><content type="html">&lt;p&gt;At Maïc’s 100th birthday, I asked her “you lived 100 years, what was the most important change for you?”. She mentioned “Internet”. I asked, why was the Internet important to her eyes? Because this is how she kept close contact with her loved ones, sharing travels or discussing everyday life on her phone, her tablet…&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Born in 1925, she was of a generation sometimes called the silent one. And indeed, she was often low-key. Her father was an administrator in the countryside, and she arrived in Paris in her youth. She studied maths, joining the prestigious “Ecole Normale Supérieure”, which provided her with an income and led her to become a maths teacher. After meeting and marrying &lt;a class="reference external" href="jean-dechoux-june-13rd-1923-feb-9th-2020.html"&gt;Jean Dechoux&lt;/a&gt;, she used her income to fund his medical studies. The story goes that, living in a tiny room, she had to cook on the balcony.&lt;/p&gt;
&lt;p&gt;Maïc was a teacher, one of those unsung heroes that have educated the masses. Nowadays, this is not a job title that is much acclaimed, unlike say “start-up founder”. But the only reason we have good computer scientists that create start-ups, the only reason we have researchers to build computer science, is because they had great teachers. Maïc was also a mother, a foster mother, a grandmother, a great grandmother. She was kind, humble, tireless, always positive. Her life philosophy was focused on doing the best with what she got.&lt;/p&gt;
&lt;p&gt;Maïc never seemed left behind by the transformations of our world. Turning 100-years old, she was as sharp as ever, reading book after book and using her phone, her tablet, her computer. Whenever I hear how technology changes the world, I cannot help thinking of her, a 100-year-old geek. The world went through many transformations during her lifetime. But what she saw in these transformations, in Internet technology, is a way to stay in contact with others, a way to bring more humanity into our lives.&lt;/p&gt;
&lt;img alt="" class="align-right" src="../personnal/attachments/nicole_dechoux.jpg" style="width: 350px;" /&gt;
&lt;br/&gt;&lt;p&gt;&lt;em&gt;Remembering Nicole Dechoux, May 03rd 1925 - October 22nd 2025&lt;/em&gt;&lt;/p&gt;
&lt;br/&gt;
&lt;br/&gt;

&lt;style&gt;
 div.poem p {margin: 0;}
 div.poem div.line-block {clear: unset}
&lt;/style&gt;&lt;div class="poem docutils container"&gt;
&lt;p&gt;Il restera de toi ce que tu as donné&lt;/p&gt;
&lt;p&gt;Au lieu de le garder dans des coffres rouillés…&lt;/p&gt;
&lt;p&gt;Ce que tu as donné en d’autres fleurira…&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Il restera de toi ce que tu as offert&lt;/p&gt;
&lt;p&gt;Entre tes bras ouverts un matin au soleil…&lt;/p&gt;
&lt;p&gt;Ce que tu as offert en d’autres revivra…&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Il restera de toi un sourire épanoui&lt;/p&gt;
&lt;p&gt;Aux bords de tes lèvres comme au bord de ton cœur…&lt;/p&gt;
&lt;p&gt;Ce que tu as ouvert en d’autres grandira…&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Il restera de toi ce que tu as semé&lt;/p&gt;
&lt;p&gt;Que tu as partagé aux mendiants du bonheur…&lt;/p&gt;
&lt;p&gt;Ce que tu as semé en d’autres germera…&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;em&gt;Adapted from Simone Weil and Michel Scouarnec&lt;/em&gt;&lt;/p&gt;
</content><category term="personnal"></category><category term="family"></category><category term="people"></category></entry><entry><title>A national recognition; but science and open source are bitter victories</title><link href="https://gael-varoquaux.info/personnal/a-national-recognition-but-science-and-open-source-are-bitter-victories.html" rel="alternate"></link><published>2025-10-10T00:00:00+02:00</published><updated>2025-10-10T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2025-10-10:/personnal/a-national-recognition-but-science-and-open-source-are-bitter-victories.html</id><summary type="html">&lt;img alt="" class="align-right" src="../personnal/attachments/gael_speech.jpg" style="width: 400px;" /&gt;
&lt;p&gt;I have recently been awarded &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Ordre_national_du_M%C3%A9rite"&gt;France’s national order of merit&lt;/a&gt;, for my career, in science, in open source, and around AI.&lt;/p&gt;
&lt;p&gt;The speech that I gave carries messages important to me (French below; it
flows better).&lt;/p&gt;
&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#speech-translated-to-english" id="toc-entry-1"&gt;Speech translated to English&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#le-texte-d-origine-en-francais" id="toc-entry-2"&gt;Le texte d’origine, en Français&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;style&gt;
.content p …&lt;/style&gt;</summary><content type="html">&lt;img alt="" class="align-right" src="../personnal/attachments/gael_speech.jpg" style="width: 400px;" /&gt;
&lt;p&gt;I have recently been awarded &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Ordre_national_du_M%C3%A9rite"&gt;France’s national order of merit&lt;/a&gt;, for my career, in science, in open source, and around AI.&lt;/p&gt;
&lt;p&gt;The speech that I gave carries messages important to me (French below; it
flows better).&lt;/p&gt;
&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#speech-translated-to-english" id="toc-entry-1"&gt;Speech translated to English&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#le-texte-d-origine-en-francais" id="toc-entry-2"&gt;Le texte d’origine, en Français&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;style&gt;
.content p {
    margin: .5ex 0;
}

p.centered-symbol {
    margin: 1ex auto;
    text-align: center;
    font-size: xx-large;
    color: rgb(210, 210, 210);
}
&lt;/style&gt;&lt;div class="section" id="speech-translated-to-english"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;Speech translated to English&lt;/a&gt;&lt;/h2&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Receiving such a medal is a powerful symbol. But what battles does it honor?&lt;/p&gt;
&lt;p&gt;My first battle, my first dream, was that of science, with the hope of understanding and improving the world. I probably turned to computers because they were simpler, less frightening, than society.&lt;/p&gt;
&lt;p&gt;This led me to my second battle: the dream of democratizing this science and these digital tools, thanks to open source, also in the hope of making a better world.&lt;/p&gt;
&lt;p&gt;The freedom I enjoyed, a privilege of researchers, allowed me to devote my time to these dreams. And many people helped on this journey, such as my colleagues at Inria and elsewhere –science is a team sport–, or free software developers from all over the world.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;And two decades later, we have won. Open source is everywhere. Statistical algorithms raise billions of dollars. But what good will this free software, these algorithms, have been if an Elon Musk can buy their vector of action and transform it into a fascist machine. This victory is bitter.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Science, open source, come to play within a societal context, mediated by norms and means of action. These means of action are rooted in economic rationality, and I find myself, to my great surprise, interested in commercial and financial logics.&lt;/p&gt;
&lt;p&gt;Money is power. It is the ability to build, to buy Twitter or to finance Wikipedia. For science or open source to be successful, we need economic ambitions.&lt;/p&gt;
&lt;p&gt;But I do not want to reduce the world to economic motivations. Science and free software result from the work of individuals who believe in what they are doing. With scikit-learn, as with many other open source projects, humble developers with few resources have created incredible wealth.&lt;/p&gt;
&lt;p&gt;And it is these battles that today’s medal rewards. I have always been wary of individual distinctions. Success is rarely the work of a single person. We need more collective effort and fewer heroes, less ego.&lt;/p&gt;
&lt;p&gt;And yet, I hope that this medal, this symbol, can be useful. Indeed, symbols create the collective narrative, and control the choices we make, individually or as a society. For both science and free software, the risk is to be invisible, unheard, and powerless.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Neither lines of code nor equations will be enough to make a better world. The privilege of a researcher is the independence of thoughts necessary for the consolidation of knowledge. The unique strength of open source software is to offer independence to the user. Beyond independence, this knowledge and these software are only useful if society embraces them. And for that, we must win the battle of the narrative.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Today, I have only one dream: that our children live in the best possible world. Between the global rise of fascism and climate warming, this dream faces many challenges. But we can fight for it. For this, as always, we need to gather people and unite around the right causes. And thus, I thank you all for the support and help you have given me across the years, for today’s recognition.&lt;/p&gt;
&lt;p class="centered-symbol"&gt;✶ ✶ ✶&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="le-texte-d-origine-en-francais"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;Le texte d’origine, en Français&lt;/a&gt;&lt;/h2&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Recevoir un tel insigne est un symbole puissant. Mais quels combats décore-t-il?&lt;/p&gt;
&lt;p&gt;Mon premier combat, mon premier rêve a été celui de la science, avec l’espoir de comprendre et d’améliorer le monde. Je me suis probablement tourné vers les ordinateurs car ils étaient plus simples, moins effrayants, que la société.&lt;/p&gt;
&lt;p&gt;Un deuxième combat est né en moi: le rêve de démocratiser cette science et ces outils numériques, grâce au logiciel libre, toujours dans l’espoir de faire un monde meilleur.&lt;/p&gt;
&lt;p&gt;La liberté dont j’ai joui, privilège inouï des chercheurs, m’a permis de me consacrer à ces rêves. Et beaucoup m’ont aidé: mes collègues à Inria et ailleurs, car la science est un sport d’équipe; les développeurs logiciels libres partout dans le monde; mes parents, qui m’ont donné l’amour de la science même lorsque j’étais en échec scolaire.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Et deux décennies plus tard, nous avons gagné. Les logiciels libres sont partout. Les algorithmes statistiques font des levées de fonds de plusieurs milliards. Mais à quoi auront servi ces logiciels libres, ces algorithmes, si un Elon Musk peut racheter leur vecteur d’action et le transformer en machine à fascisme. Cette victoire est amère.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;La science, le logiciel libre, se réalisent dans un contexte sociétal, médié par des normes et des moyens d’actions. Ces moyens d’actions sont ancrés dans le rationnel économique, et je me trouve, à ma grande surprise, à m’intéresser à des logiques commerciales et financières.&lt;/p&gt;
&lt;p&gt;L’argent, c’est le pouvoir. C’est la capacité de réaliser, de racheter twitter ou de financer wikipedia. Pour le succès de la science ou du logiciel libre, nous avons besoin d’une ambition économique.&lt;/p&gt;
&lt;p&gt;Mais je ne voudrais réduire le monde aux motivations économiques. La science et le logiciel libre résultent du travail d’individus qui croient à ce qu’ils font. Avec scikit-learn, comme avec beaucoup d’autres logiciels libres, des développeurs humbles et avec peu de moyens ont créé une richesse incroyable.&lt;/p&gt;
&lt;p&gt;Et c’est ces combats que récompense aujourd’hui l’insigne que je reçois. Je me suis toujours méfié des distinctions individuelles. Un succès est rarement l’œuvre d’un seul. Nous avons besoin de plus de collectif et de moins de héros, de moins d’égo.&lt;/p&gt;
&lt;p&gt;Et pourtant, j’espère que cette médaille, ce symbole, peut être utile. En effet, les symboles créent le récit collectif, et contrôlent les choix que nous faisons, individuellement ou en tant que société. Science comme logiciel libre, le risque est d’être invisibles, inaudibles, et impuissants.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;La ligne de code, ou l’équation, ne suffiront à faire un meilleur monde. Le privilège du chercheur, c’est l’indépendance de pensée nécessaire à la consolidation de la connaissance. L’atout du logiciel libre, c’est d’offrir une indépendance à l’utilisateur. Au-delà de l’indépendance, cette connaissance et ces logiciels ne sont utiles que si la société s’en empare. Et pour cela, il nous faut gagner la bataille du récit.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Aujourd’hui, je n’ai plus qu’un rêve: que nos enfants vivent dans le meilleur monde possible. Entre montée mondiale du fascisme et réchauffement climatique, j’ai la détermination que ce rêve ne soit pas une chimère. Pour ce rêve, il nous faut encore réunir, rassembler, et je vous remercie tous des soutiens et des aides que vous m’avez apportés, de cet honneur que vous me faites aujourd’hui.&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="../personnal/attachments/gael_knight_monty_python.jpg" style="width: 300px;" /&gt;
&lt;p class="caption"&gt;Technically, I might be a knight now&lt;/p&gt;
&lt;/div&gt;
&lt;p class="centered-symbol"&gt;✶ ✶ ✶&lt;/p&gt;
&lt;/div&gt;
</content><category term="personnal"></category><category term="award"></category><category term="open source"></category><category term="science"></category></entry><entry><title>TabICL: Pretraining the best tabular learner</title><link href="https://gael-varoquaux.info/science/tabicl-pretraining-the-best-tabular-learner.html" rel="alternate"></link><published>2025-07-09T00:00:00+02:00</published><updated>2025-07-09T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2025-07-09:/science/tabicl-pretraining-the-best-tabular-learner.html</id><summary type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;TabICL is a state-of-the-art tabular learner &lt;a class="reference external" href="https://arxiv.org/abs/2502.05564"&gt;[Qu et al 2025]&lt;/a&gt;. The key is its very rich
prior, that is baked in a pre-trained architecture -a table foundation
model-, and leveraged by in-context-learning. Thanks to clever
choices, it is fast and scalable, efficient even without a GPU.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#recent-progress-in-tabular-learning-in-context-learning" id="toc-entry-1"&gt;Recent progress …&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;TabICL is a state-of-the-art tabular learner &lt;a class="reference external" href="https://arxiv.org/abs/2502.05564"&gt;[Qu et al 2025]&lt;/a&gt;. The key is its very rich
prior, that is baked in a pre-trained architecture -a table foundation
model-, and leveraged by in-context-learning. Thanks to clever
choices, it is fast and scalable, efficient even without a GPU.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#recent-progress-in-tabular-learning-in-context-learning" id="toc-entry-1"&gt;Recent progress in tabular learning: In-Context Learning&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#tabular-learning-as-a-completion-problem" id="toc-entry-2"&gt;Tabular learning as a completion problem&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#sophisticated-prior-via-data-generation" id="toc-entry-3"&gt;Sophisticated prior via data generation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#tabicl-improved-architecture" id="toc-entry-4"&gt;TabICL: improved architecture&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#the-challenge-accounting-for-the-structure-of-tables" id="toc-entry-5"&gt;The challenge: accounting for the structure of tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#tabicl-s-solution" id="toc-entry-6"&gt;TabICL’s solution&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#the-result-a-powerful-and-easy-to-use-tabular-learner" id="toc-entry-7"&gt;The result: a powerful and easy to use tabular learner&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;p&gt;This note is about the research behind TabICL &lt;a class="reference external" href="https://arxiv.org/abs/2502.05564"&gt;[Qu et al 2025]&lt;/a&gt;, work by Jingang Qu, David
Holzmüller, myself, and Marine Le Morvan, published at ICML 2025, and
available as &lt;a class="reference external" href="https://tabicl.readthedocs.io/en/latest/"&gt;open-source software&lt;/a&gt;.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="recent-progress-in-tabular-learning-in-context-learning"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;Recent progress in tabular learning: In-Context Learning&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Describing the statistical structure of tables in general is very subtle.
They do have some unique statistical features. For instance, each column
is typically meaningful by itself, more meaningful than linear
combinations of columns (data &lt;em&gt;non rotationally invariant&lt;/em&gt;, cf
&lt;a class="reference external" href="https://proceedings.neurips.cc/paper_files/paper/2022/hash/0378c7692da36807bdec87ab043cdadc-Abstract-Datasets_and_Benchmarks.html"&gt;[Grinsztajn et al, 2022]&lt;/a&gt;).
For long, tree-based models, in particular gradient-boosted trees, were
the models that best captured this statistical structure.&lt;/p&gt;
&lt;p&gt;The question is indeed: &lt;strong&gt;how to build complex and rich inductive biases
into statistical models&lt;/strong&gt;?&lt;/p&gt;
&lt;p&gt;A pioneering contribution to this question was made with the TabPFN
approach &lt;a class="reference external" href="https://www.nature.com/articles/s41586-024-08328-6"&gt;[Hollmann et al, 2025]&lt;/a&gt;.&lt;/p&gt;
&lt;div class="section" id="tabular-learning-as-a-completion-problem"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;Tabular learning as a completion problem&lt;/a&gt;&lt;/h3&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="../science/attachments/tabicl/table_in_context_learning.png" style="width: 100%;" /&gt;
&lt;p class="caption"&gt;Prediction by table completion using across-row transformers&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The key idea behind this line of work is that tabular learning can be
seen as completing a table where one column has a missing entry.
Transformer-based large-language models are very good at completing
sequences, in particular in the few-shot regime. Hence the idea to use a
transformer architecture for this table-completion task.&lt;/p&gt;
&lt;p&gt;More specifically, this is a &lt;em&gt;meta-learning&lt;/em&gt; setting (learning to learn),
using transformers.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="sophisticated-prior-via-data-generation"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;Sophisticated prior via data generation&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Teaching transformers to predict well requires showing them many many
prediction problems.&lt;/p&gt;
&lt;p&gt;The benefit of this approach is that these prediction problems can be
chosen to reflect well the downstream task. In particular, it becomes now
easy to bake in any form of inductive bias by simulating data.&lt;/p&gt;
&lt;p&gt;TabPFN simulates data by cascading a series of simple transformations
combining very few columns. The data-generative processes are actually
more subtle, but the idea being that they are plausible for data tables.&lt;/p&gt;
&lt;p&gt;Experience (from us and others) shows that pretraining on a quality
data-generation process is crucial to produce a good tabular learner,
alike foundation models in other settings.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="tabicl-improved-architecture"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-4"&gt;TabICL: improved architecture&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="the-challenge-accounting-for-the-structure-of-tables"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-5"&gt;The challenge: accounting for the structure of tables&lt;/a&gt;&lt;/h3&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="../science/attachments/tabicl/tabpfn_architecture.png" style="width: 60%;" /&gt;
&lt;p class="caption"&gt;Tables are 2D objects, and the TabPFNv2 architecture alternates
attentions across row and across columns&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;In practice, a table is not a 1D structure, like sentences. It is closer
to a 2D structure, with rows and columns. A good architecture will
account for this structure, and the TabPFNv2 architecture uses
transformers with alternating across-row and across-column attention.&lt;/p&gt;
&lt;p&gt;One problem is the computational complexity: attention is quadratic in
the number of entries, and the bi-directional transform of TabPFNv2 leads
to a cost in &lt;em&gt;O(n p² + p n²)&lt;/em&gt; for a table with &lt;em&gt;n&lt;/em&gt; rows and &lt;em&gt;p&lt;/em&gt; columns.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="tabicl-s-solution"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-6"&gt;TabICL’s solution&lt;/a&gt;&lt;/h3&gt;
&lt;div class="section" id="row-wise-encoding"&gt;
&lt;h4&gt;Row-wise encoding&lt;/h4&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="../science/attachments/tabicl/tabicl_architecture.png" style="width: 60%;" /&gt;
&lt;p class="caption"&gt;To break the quadratic cost, TabICL first encodes the rows to a
smaller, fixed-sized, represention, before performing across-row
in-context learning.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;For more scalability and better inductive bias, our model, TabICL, first
embeds the rows (using a first transformer) and then does in-context
learning across rows (with a second transformer). The resulting
computational complexity is &lt;em&gt;O(n p² + n²)&lt;/em&gt;, which is more scalable,
though still quadratic in &lt;em&gt;n&lt;/em&gt; and &lt;em&gt;p&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Scalability is important because it enables us to pretrain TabICL on both
small &lt;em&gt;and&lt;/em&gt; large datasets, and as a consquence TabICL is a good
predictor for large datasets.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="column-specific-embeddings"&gt;
&lt;h4&gt;Column-specific embeddings&lt;/h4&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="../science/attachments/tabicl/tabicl_embeddings.png" style="width: 100%;" /&gt;
&lt;p class="caption"&gt;To apply different transformations on columns depending on their
statistical properties, TabICL builds positional embeddings for
columns that capture aspects of their distribution.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Another important innovation of TabICL is that it inputs the entries in
the transformer with column-specific embeddings. These column embeddings
are computed to be a function of the distribution of the column. For
this, we use a set transformer, which is a scalable transformer-like way
of building a function on sets (but without the quadratic complexity).&lt;/p&gt;
&lt;p&gt;After pretraining, we find that the column embeddings have learned a
mapping that implicitly captures statistical aspects of the data
distribution in the column, as the kurtosis or the skewness.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="the-result-a-powerful-and-easy-to-use-tabular-learner"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-7"&gt;The result: a powerful and easy to use tabular learner&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;After a lot of pretraining on synthetic data, TabICL is a
state-of-the-art tabular learner. Pretraining gave it the right inductive
bias, as visible from the classifier-comparison plot below:&lt;/p&gt;
&lt;div class="figure"&gt;
&lt;img alt="" src="../science/attachments/tabicl/tabicl_comparison.png" style="width: 100%;" /&gt;
&lt;p class="caption"&gt;A classic classification comparison plot that shows the decision
boundaries on very simple toy data. It is useful to get a feeling of
how classifiers behave.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;It is interesting to see that while TabICL forms very flexible decision
boundaries, they do extend along the horizontal and vertical axes, as the
decision tree and random forest. These axis-aligned features are a
very important aspect of the inductive bias.&lt;/p&gt;
&lt;p&gt;At the end of the day, TabICL is an excellent tabular learner, as visible
on benchmarks:&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="../science/attachments/tabicl/result_comparison.png" /&gt;
&lt;p class="caption"&gt;TabICL is a great predictor: Comparison of many predictors.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="../science/attachments/tabicl/tabarena.png" /&gt;
&lt;p class="caption"&gt;Experimental results, from a benchmark paper independent of the TabICL
paper: TabArena &lt;a class="reference external" href="https://arxiv.org/abs/2506.16791"&gt;[Erickson et al, 2025]&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The benefit of TabICL over TabPFNv2 becomes more marked for larger datasets:&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="../science/attachments/tabicl/tabicl_scale_bench.png" style="width: 60%;" /&gt;
&lt;p class="caption"&gt;Rank (lower is best) as a function of dataset size.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;However, one limitation to keep in mind is that with in-context learners,
as TabICL or TabPFN, inference (prediction on new datapoint) ican be
costly.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;All in all, TabICL is an excellent tabular predictor, and a push forward
for tabular foundation models. From a fundamental standpoint, it shows
that in-context learning is not only for few-shot learning, but that it can be
very beneficial for sample sizes as large as &lt;em&gt;n=100 000&lt;/em&gt;.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;More about TabICL&lt;/p&gt;
&lt;p&gt;There is a lot more in TabICL: the details of pretraining are crucial,
implementation uses memory offloading (which is facilitated by the
architecture, which dissociates the train X from the test y for most
of the operations). To learn more about TabICL:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;The paper: &lt;a class="reference external" href="https://arxiv.org/abs/2502.05564"&gt;https://arxiv.org/abs/2502.05564&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;The github code: &lt;strong&gt;TabICL is 100% open source&lt;/strong&gt;
&lt;a class="reference external" href="https://github.com/soda-inria/tabicl"&gt;https://github.com/soda-inria/tabicl&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Install the Python package, TabICL is just one pip install away
&lt;a class="reference external" href="https://pypi.org/project/tabicl/"&gt;https://pypi.org/project/tabicl/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;Other topics in table foundation models: leveraging strings&lt;/p&gt;
&lt;p&gt;TabICL is only one aspect of table foundation models. We are pursuing
also another line of research that focuses on using strings (in
entries and column names) to bring knowledge about the real world in
table foundation models, see &lt;a class="reference external" href="carte-toward-table-foundation-models.html"&gt;CARTE&lt;/a&gt; and more recently &lt;a class="reference external" href="https://arxiv.org/abs/2505.14415"&gt;[Kim
et al, 2025]&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="machine learning"></category><category term="tabular learning"></category><category term="foundation models"></category></entry><entry><title>AI agents that use tools</title><link href="https://gael-varoquaux.info/science/ai-agents-that-use-tools.html" rel="alternate"></link><published>2025-07-04T00:00:00+02:00</published><updated>2025-07-04T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2025-07-04:/science/ai-agents-that-use-tools.html</id><summary type="html">&lt;img alt="Image generated with ChatGPT, with the prompt &amp;quot;Please generate an image of an AI using a mechanical tool, such as a wrench. Please make the robot look rather friendly. Also, please make the image square&amp;quot;" class="small align-right" src="../science/attachments/robot_tool_friendly.png" /&gt;
&lt;p&gt;Modern AIs acquire new capabilities by combining tools to perform a
complex task, controlling them like an agent. Unlike traditional
programming, they define the sequences of actions themselves.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post was originally published in French as part of my scientific
chronicle in &lt;a class="reference external" href="https://www.lesechos.fr/idees-debats/sciences-prospective/des-agents-ia-qui-utilisent-des-outils-2163252"&gt;Les Echos&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Modern AIs are increasingly using …&lt;/p&gt;</summary><content type="html">&lt;img alt="Image generated with ChatGPT, with the prompt &amp;quot;Please generate an image of an AI using a mechanical tool, such as a wrench. Please make the robot look rather friendly. Also, please make the image square&amp;quot;" class="small align-right" src="../science/attachments/robot_tool_friendly.png" /&gt;
&lt;p&gt;Modern AIs acquire new capabilities by combining tools to perform a
complex task, controlling them like an agent. Unlike traditional
programming, they define the sequences of actions themselves.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post was originally published in French as part of my scientific
chronicle in &lt;a class="reference external" href="https://www.lesechos.fr/idees-debats/sciences-prospective/des-agents-ia-qui-utilisent-des-outils-2163252"&gt;Les Echos&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Modern AIs are increasingly using tools. For example, if you ask a
conversational AI to solve a complicated equation, the AI alone cannot do
it. This is not surprising: there is no general mathematical formula. But
if this AI knows how to use numerical equation-solving routines, it
quickly gives us the answer. For example, “Le Chat” from Mistral
generates a small program that uses the “Python” language and its
numerical routines to solve our problem. The difficulty here is to
generate the program that calls the right routines. This ability is an
extension of conversational AI models that know how to answer questions
by generating text. Here, the text is computer code and not English.&lt;/p&gt;
&lt;p&gt;By controlling the computer, the AI “acts”. That’s why it is said to be
an “agent”. By coupling with other systems, agentic AIs develop new
capabilities. The most powerful ones can then combine different tools by
leveraging their complementarities. These agent systems are currently
progressing very quickly, but they remind us of what we have always done
in computer science: any complicated system is assembled from multiple
routines, each with a specific functionality. Writing a computer program
is precisely describing how we are going to call these routines to solve
a problem. And yet, without the recent advances in AI, we have to specify
all the steps, whereas agent AIs take a given goal and will themselves
produce these steps. The difficulty then becomes to break down a task
into sub-tasks, which is called planning, a difficult problem.&lt;/p&gt;
&lt;p&gt;In modern AIs, these planning skills are learned. The systems improve
through trial and error: we give the AI lots of tasks to solve and the AI
tries sequences of sub-tasks, deciding to use one tool or another. If it
succeeds in the final task, it learns that the sequence of tool use was a
good sequence for the task. This is called reinforcement learning, whose
main inventors received the Turing Prize this year, the Nobel Prize of
computer science.&lt;/p&gt;
&lt;p&gt;Another major driver of progress for agent AIs is the powerful ability of
analogy and associative memory of language models. These language skills
enable them to start from problems specified by the user in plain
English, with an open vocabulary. They draw their strategies to use tools
from a great knowledge of similar problems, but also know how to adapt
these strategies to the intermediate responses of the tools. They can
also interact with systems that are much more complex and indeterminate
than computer routines. For example, an AI can go and fetch information
on the internet, or even ask a human.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Agent AIs open new perspectives. But they also greatly increase computing
costs, as they iterate over sub-tasks. Computing costs must be kept in
mind, as they are an important hurdle to democratization of AI.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;AI chronicles&lt;/p&gt;
&lt;p&gt;Find all my AI chronicles &lt;a class="reference external" href="https://gael-varoquaux.info/tag/ai-chronicle.html"&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The goals of these “AI chronicles” is to introduce concepts of AI to a broader public, staying at a very very high level.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="AI"></category><category term="chronicle"></category><category term="AI chronicle"></category></entry><entry><title>AIs that break down questions reason better</title><link href="https://gael-varoquaux.info/science/ais-that-break-down-questions-reason-better.html" rel="alternate"></link><published>2025-06-20T00:00:00+02:00</published><updated>2025-06-20T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2025-06-20:/science/ais-that-break-down-questions-reason-better.html</id><summary type="html">&lt;img alt="Image generated with &amp;quot;LeChat&amp;quot;, with the prompt &amp;quot;Please generate an image of an AI that is thinking deeply. Philosophical references may be welcomed, for instance like the classic hamlet holding skull cliché.&amp;quot;" class="small align-right" src="../science/attachments/ai_thinking.jpg" /&gt;
&lt;p&gt;The key to the most powerful conversational AIs is to reason by breaking
down a complex task into simpler subproblems. Why is this crucial, and
how does it work?&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post was originally published in French as part of my scientific
chronicle in &lt;a class="reference external" href="https://www.lesechos.fr/idees-debats/sciences-prospective/les-ia-qui-decomposent-les-questions-raisonnent-mieux-2151428"&gt;Les Echos&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The recent release of …&lt;/p&gt;</summary><content type="html">&lt;img alt="Image generated with &amp;quot;LeChat&amp;quot;, with the prompt &amp;quot;Please generate an image of an AI that is thinking deeply. Philosophical references may be welcomed, for instance like the classic hamlet holding skull cliché.&amp;quot;" class="small align-right" src="../science/attachments/ai_thinking.jpg" /&gt;
&lt;p&gt;The key to the most powerful conversational AIs is to reason by breaking
down a complex task into simpler subproblems. Why is this crucial, and
how does it work?&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post was originally published in French as part of my scientific
chronicle in &lt;a class="reference external" href="https://www.lesechos.fr/idees-debats/sciences-prospective/les-ia-qui-decomposent-les-questions-raisonnent-mieux-2151428"&gt;Les Echos&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The recent release of the conversational AI “DeepSeek R1” shook the
financial markets because it showed a significant reduction in the costs
of reasoning models. But what are these reasoning models?&lt;/p&gt;
&lt;p&gt;To understand the challenges of reasoning in conversational AIs, we can
ask them to solve riddles. I tried various logical riddles on different
AIs, such as the puzzle where a man has to get a fox, a chicken, and a
sack of corn across a river without one eating the other. The AI responds
brilliantly. But how can we ensure that the AI is truly reasoning and not
just reciting answers it has seen before? By replacing them with other
equivalent animals (wolf, lamb, and hay), the AI does just as well. But
it could have solved the problem by analogy with the previous classic
one, rather than with reasonning. Indeed language models are very good at
analogies. A conversational AI typically works by proposing a answer
inspired by the flow of words (and corresponding concepts) in the texts
on which it was trained.&lt;/p&gt;
&lt;p&gt;If, instead of a riddle resembling a story, we try to play tic-tac-toe,
the weaknesses appear. Most conversational AIs are very bad at
tic-tac-toe, even going so far as to declare victory when faced with a
defeat. Perhaps this is because analogy is not as useful. But activating
the “reasoning” option makes them unbeatable. What is behind this option?&lt;/p&gt;
&lt;p&gt;A third task helps to understand the reasoning mechanisms of a
conversational AI: let’s ask it how many “L”s there are in
“LOLLAPALOUZA”. There is a catch: ChatGPT was able to give me the correct
for the number of Ls in “LOLLAPALOOZA”, a question often used in the past
to show its limits. For “LOLLAPALOUZA”, it fails. Or rather, it needs
help: if we tell it to spell out the word, then count the “L”s, it gives
the correct answer. With the right intermediate steps, a problem is often
much simpler. These decompositions into subproblems are called chains of
thought in conversational AIs. The “reasoning” option of some AIs
generates such chains.&lt;/p&gt;
&lt;p&gt;DeepSeek R1 received much attention due to its excellence in breaking
down problems to reason in such a way. To do this, it has been trained to
generate reasoning patterns from questions, using reinforcement learning:
through trial and error, on many problems generated with the associated
answer, like math problems. Faced with a task, the AI still proceeds by
analogy with the tasks it has seen during this learning phase, but it
uses this analogy to sketch a battle plan, rather than a response. Each
subproblem is then easier, and the AI can tackle it by analogy to
problems already seen. By observing the chains of thought, we can even
see the AI verifying its intermediate results. These chains of thought
are not always visible, but we can guess them from the AI’s response
time.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;With these reasoning mechanisms, a conversational AI is as good as I am
at tic-tac-toe. But using such a model to play tic-tac-toe is like using
a sledgehammer to crush a fly: it is very inefficient in computational
cost compared to a specialized program for tic-tac-toe, which we have
known how to do for decades.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;AI chronicles&lt;/p&gt;
&lt;p&gt;Find all my AI chronicles &lt;a class="reference external" href="https://gael-varoquaux.info/tag/ai-chronicle.html"&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The goals of these “AI chronicles” is to introduce concepts of AI to a broader public, staying at a very very high level.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="AI"></category><category term="chronicle"></category><category term="AI chronicle"></category></entry><entry><title>Science must drive the narratives that shape society</title><link href="https://gael-varoquaux.info/science/science-must-drive-the-narratives-that-shape-society.html" rel="alternate"></link><published>2025-03-01T00:00:00+01:00</published><updated>2025-03-01T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2025-03-01:/science/science-must-drive-the-narratives-that-shape-society.html</id><summary type="html">&lt;img alt="A picture of me giving this speech" class="small align-right" src="../science/attachments/louvain_gael_dhc.jpg" /&gt;
&lt;p&gt;I would like to take a brief moment to reflect on what drives me as an
academic.&lt;/p&gt;
&lt;p&gt;Academia’s root are in creating knowledge and sharing it. We, academics,
have a role to play in shaping society. In computer science, we sometimes
focus on the creation of technology. Here, creation …&lt;/p&gt;</summary><content type="html">&lt;img alt="A picture of me giving this speech" class="small align-right" src="../science/attachments/louvain_gael_dhc.jpg" /&gt;
&lt;p&gt;I would like to take a brief moment to reflect on what drives me as an
academic.&lt;/p&gt;
&lt;p&gt;Academia’s root are in creating knowledge and sharing it. We, academics,
have a role to play in shaping society. In computer science, we sometimes
focus on the creation of technology. Here, creation of open technology is
central to knowledge consolidation in computer science, because open
technology can be studied, because open technology can be shared.
But academia’s role in society is more than technology, even open.&lt;/p&gt;
&lt;p&gt;Academia’s position in consolidating knowledge implies that it is trusted
with responsibilities in shaping the narrative, for instance that of
technology. An important narrative today is that of artificial
intelligence, a new industrial revolution, they say. Our role here is to
do a sober assessment, inventing the future of technology, but without
false promises and blind spots. This work, as all broad scientific work,
requires working across disciplines.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;The above text is extracted from my acceptance speech when receiving
UC Louvain’s  Doctor Honoris Causa.&lt;/p&gt;
&lt;p class="last"&gt;As stated in my full speech, I am incredibly greatful for this honor. I
deeply thank all those that have been part of my scientific and
technical adventures. They were all built through team works, with
many amazing people, from all horizons, young and older, famous or
invisible. Working together is what moves mountains.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="society"></category><category term="AI"></category><category term="award"></category></entry><entry><title>AI super-intelligent to play Go, and math?</title><link href="https://gael-varoquaux.info/science/ai-super-intelligent-to-play-go-and-math.html" rel="alternate"></link><published>2025-02-19T00:00:00+01:00</published><updated>2025-02-19T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2025-02-19:/science/ai-super-intelligent-to-play-go-and-math.html</id><summary type="html">&lt;img alt="Image generated with &amp;quot;LeChat&amp;quot;, with the prompt &amp;quot;Please generate an image of an artificial intelligences playing go, with mathematical formula flying in the background. The mathematical formula are flying in all directions, and the image is futuristic.&amp;quot;" class="small align-right" src="../science/attachments/robots_playing_go.jpg" /&gt;
&lt;p&gt;Since 2017, an AI has been defeating the best Go experts, despite the game being particularly challenging. Such “super intelligence” is rare, but it could also emerge in fundamental mathematics.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post was originally published in French as part of my scientific
chronicle in &lt;a class="reference external" href="https://www.lesechos.fr/idees-debats/sciences-prospective/lia-le-go-et-les-maths-2140332"&gt;Les Echos&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="imitation-is-not-creation"&gt;
&lt;h2&gt;Imitation is not …&lt;/h2&gt;&lt;/div&gt;</summary><content type="html">&lt;img alt="Image generated with &amp;quot;LeChat&amp;quot;, with the prompt &amp;quot;Please generate an image of an artificial intelligences playing go, with mathematical formula flying in the background. The mathematical formula are flying in all directions, and the image is futuristic.&amp;quot;" class="small align-right" src="../science/attachments/robots_playing_go.jpg" /&gt;
&lt;p&gt;Since 2017, an AI has been defeating the best Go experts, despite the game being particularly challenging. Such “super intelligence” is rare, but it could also emerge in fundamental mathematics.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post was originally published in French as part of my scientific
chronicle in &lt;a class="reference external" href="https://www.lesechos.fr/idees-debats/sciences-prospective/lia-le-go-et-les-maths-2140332"&gt;Les Echos&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="imitation-is-not-creation"&gt;
&lt;h2&gt;Imitation is not creation&lt;/h2&gt;
&lt;p&gt;For several decades, calculators have been better than humans in an
intellectual task: mental arithmetic. Yet, we do not call this
“super-intelligence.” Probably because it is humans who specified all the
rules for these calculations to the machine. Similarly, a computer has a
superhuman ability to memorize information exactly, such as numbers, but
we do not consider it super-intelligent for that reason. Perhaps this is
because it does not teach us anything new. However, in 2017, an AI
started teaching the best Go players moves and strategies that no one had
ever known. How is this possible? Will AI surpass its creator and become
super-intelligent in all fields?&lt;/p&gt;
&lt;p&gt;Most recent breakthroughs in AI rely on learning methods where the
computer imitates humans. For example, to create computer-vision systems,
we provide the computer with many annotated images describing what they
represent. Likewise, conversational AIs learn by training to complete
examples of text. Under these conditions, it is difficult for AI to
surpass its creator.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="when-ais-invent"&gt;
&lt;h2&gt;When AIs invent&lt;/h2&gt;
&lt;p&gt;But AlphaZero, the AI champion in Go, operates on a different principle:
reinforcement learning. Here, the AI takes actions –moves in the game of
Go– and receives a “reward” if it wins the game. Through countless games,
it optimizes its strategies to maximize rewards, including exploring new
strategies. AlphaZero trained by playing tens of millions of games
against itself. This is how the AI was able to create new strategies,
unrestricted by human knowledge.&lt;/p&gt;
&lt;p&gt;Such learning, based on millions of trial-and-error attempts, does not
apply to all problems –it requires the ability to perform rapid
experiments, like in a computer game, which remains the only domain where
a true super-intelligence has been achieved. However, there is hope in
mathematics, another intellectual game.&lt;/p&gt;
&lt;p&gt;Indeed, progress in generative AI for language –that power tools such as
ChatGPT– can be applied to mathematical proofs, which consist of
sequences of symbols. Trained on numerous proofs, an AI can learn to
complete partial proofs. However, such a generative AI will produce
sequences without guarantees of mathematical validity. Another tool,
using proof-verification techniques based on symbolic AI, can then filter
out only the correct sequences, giving a “reward” signal. Reinforcement
learning finally comes in, using its exploration schemes to maximize this
reward and discover new valid proof steps.&lt;/p&gt;
&lt;p&gt;This is how, in July 2024, the AlphaProof AI won a silver medal at the
International Mathematical Olympiad. Further progress may eventually lead
to “super-intelligence” in mathematics. However, we are still far from
general super-intelligence, as, both in Go and mathematics, progress is
made possible by the ease of verifying whether one has “won” or not.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;AI chronicles&lt;/p&gt;
&lt;p&gt;Find all my AI chronicles &lt;a class="reference external" href="https://gael-varoquaux.info/tag/ai-chronicle.html"&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The goals of these “AI chronicles” is to introduce concepts of AI to a broader public, staying at a very very high level.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="AI"></category><category term="chronicle"></category><category term="AI chronicle"></category></entry><entry><title>AI for health: the impossible necessity of unbiased data</title><link href="https://gael-varoquaux.info/science/ai-for-health-the-impossible-necessity-of-unbiased-data.html" rel="alternate"></link><published>2025-02-13T00:00:00+01:00</published><updated>2025-02-13T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2025-02-13:/science/ai-for-health-the-impossible-necessity-of-unbiased-data.html</id><summary type="html">&lt;img alt="Image generated with &amp;quot;LeChat&amp;quot;, with the prompt &amp;quot;Please generate a fairly abstract image of biased data. The image is about data. It should have numbers, streams of numbers. It should express the notion of bias, showing a black woman in the middle of the stream of numbers.&amp;quot;" class="small align-right" src="../science/attachments/biased_data.jpg" /&gt;
&lt;p&gt;Is unbiased data important to build health AI? Yes!&lt;/p&gt;
&lt;p&gt;Can there be unbiased data? No!&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
&lt;em&gt;Building health on biased data discriminates&lt;/em&gt;&lt;/div&gt;
&lt;p&gt;The notion of bias depends on the intended use.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;In medicine, we have seen the importance of tuning devices and decisions
for the target population. The problem is not …&lt;/p&gt;</summary><content type="html">&lt;img alt="Image generated with &amp;quot;LeChat&amp;quot;, with the prompt &amp;quot;Please generate a fairly abstract image of biased data. The image is about data. It should have numbers, streams of numbers. It should express the notion of bias, showing a black woman in the middle of the stream of numbers.&amp;quot;" class="small align-right" src="../science/attachments/biased_data.jpg" /&gt;
&lt;p&gt;Is unbiased data important to build health AI? Yes!&lt;/p&gt;
&lt;p&gt;Can there be unbiased data? No!&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
&lt;em&gt;Building health on biased data discriminates&lt;/em&gt;&lt;/div&gt;
&lt;p&gt;The notion of bias depends on the intended use.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;In medicine, we have seen the importance of tuning devices and decisions
for the target population. The problem is not limited to AI: pulse
oximeters, that measure oxygen saturation, do not work well on dark
skins; cardiac procedures were adjusted to symptoms and anatomy of men,
while those of women differ. These issues arose because the corresponding
groups were underrepresented in the clinical studies.&lt;/p&gt;
&lt;p&gt;So when we build AI, we need to make sure that they are not trained on
biased data.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="align-right docutils container"&gt;
&lt;em&gt;Beyond population sampling, historical choices also bias&lt;/em&gt;&lt;/div&gt;
&lt;p&gt;But unbiased data is hard, as it goes beyond sampling the right
population of individuals. Indeed, the data we have is the result of a
historical set of choices: Who do we measure? Which measurements? And
what led to their condition? Beyond health, consider for instance
salaries: we can train a model from historical data to tell us what
should be the right compensation for a given individual. But it is just
going to capture and repeat historical biases, such as paying less women
that are as qualified as their male counterparts.&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
&lt;em&gt;The notion of being unbiased embeds societal and ethical values&lt;/em&gt;&lt;/div&gt;
&lt;p&gt;Here we see that the notion of being unbiased embeds societal and ethical
values: Should Olympic-level gymnasts and football players be paid the
same thing? How about men and women with the same job description?&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;And now to go back to medicine, there is another critical aspect, which
is that of cause and effects, which is central to taking decisions. To
take a simple example, if we were to compare the health outcome of
individuals after two days at the hospital to those who did not go to the
hospital, we would conclude, incorrectly, that a hospital is a very
dangerous place, as individuals there are in a worse shape. The problem
is, of course, that we are comparing individuals that are not comparable
as they have a different baseline health. A health intervention is given
for a reason, so it is given to a specific population: insulin is given
to diabetics. Building a model, an AI, that can decide on health
intervention requires compensating for the difference between the treated
and non-treated individuals.&lt;/p&gt;
&lt;div class="side-hanging small sidebar"&gt;
&lt;p class="first sidebar-title"&gt;&lt;strong&gt;Reference: causality&lt;/strong&gt;&lt;/p&gt;
&lt;p class="last"&gt;&lt;a class="reference external" href="https://hal.science/hal-04774700/"&gt;A 15-page introduction to causal inference with machine
learning&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div class="align-right docutils container"&gt;
&lt;em&gt;AIs can make good decisions only from adequate data&lt;/em&gt;&lt;/div&gt;
&lt;p&gt;Here also we have a case of bias. The bias is with regards to the data
required to answer the question on the effect of the intervention, where
both populations are comparable. More generally, we are seeing once again
that the data are always the result of a historical set of choices, and
these choices condition the statistical relationships in the data. And
AIs build on the statistical relationships.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="align-right docutils container"&gt;
&lt;em&gt;The notion of bias depends on the intended use&lt;/em&gt;&lt;/div&gt;
&lt;p&gt;What we see here is that the notion of bias depends on the intended use: it depends on the target population, but also on the target intervention. So there really is no absolute notion of unbiased data. There is just the notion of data that are well suited to a particular goal.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;img alt="" class="align-right" src="../science/attachments/lady_justice_robot.png" /&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;AI chronicles&lt;/p&gt;
&lt;p&gt;This post was consolidated from notes of a panel on health AI at the
AI Action Summit, but it is link to my &lt;a class="reference external" href="https://gael-varoquaux.info/tag/ai-chronicle.html"&gt;AI chronicles&lt;/a&gt;, big-picture
didactic pieces on AI and related topics.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="society"></category><category term="health"></category><category term="AI"></category><category term="chronicle"></category><category term="AI chronicle"></category></entry><entry><title>2024 highlights: of computer science and society</title><link href="https://gael-varoquaux.info/science/2024-highlights-of-computer-science-and-society.html" rel="alternate"></link><published>2025-01-01T00:00:00+01:00</published><updated>2025-01-01T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2025-01-01:/science/2024-highlights-of-computer-science-and-society.html</id><summary type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;For me, 2024 was full of back and forth between research,
software, and connecting these to society. Here, I lay out some
highlights on AI and society, as well as research and software, around
tabular AI and language models.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;As 2025 starts, I’m looking back on 2024. It …&lt;/p&gt;</summary><content type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;For me, 2024 was full of back and forth between research,
software, and connecting these to society. Here, I lay out some
highlights on AI and society, as well as research and software, around
tabular AI and language models.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;As 2025 starts, I’m looking back on 2024. It was an interesting
professional year, as the research in the &lt;a class="reference external" href="https://team.inria.fr/soda/"&gt;soda team&lt;/a&gt; on machine learning for health and
social science nourished reflection on society.&lt;/p&gt;
&lt;div class="contents topic" id="highlights"&gt;
&lt;p class="topic-title"&gt;Highlights&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#thoughts-from-the-national-ai-committee" id="toc-entry-1"&gt;Thoughts from the national AI committee&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#adventures-in-software-land" id="toc-entry-2"&gt;Adventures in software land&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#probabl-to-supercharge-scikit-learn" id="toc-entry-3"&gt;probabl to supercharge scikit-learn&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#skrub-machine-learning-on-tables-made-easy" id="toc-entry-4"&gt;Skrub: machine learning on tables made easy&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#research-better-ai-tools-more-understanding" id="toc-entry-5"&gt;Research: better AI tools, more understanding&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#table-foundation-models" id="toc-entry-6"&gt;Table foundation models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#disparities-of-confidence-of-large-language-models" id="toc-entry-7"&gt;Disparities of confidence of large language models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#a-straggler-consistency-of-supervised-learning-with-missing-values" id="toc-entry-8"&gt;A straggler: Consistency of supervised learning with missing values&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="thoughts-from-the-national-ai-committee"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;Thoughts from the national AI committee&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Early 2024, I was serving in the French national AI committee. Our final write up can be found
&lt;a class="reference external" href="https://www.info.gouv.fr/actualite/25-recommandations-pour-lia-en-france"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It was a ton of work, a very interesting experience, and I learned a lot
on many aspects of the interfaces between technology, policy, and
society. A few things that stood out for me, some partly
obvious but worth saying:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;Digital services are a growing economy.&lt;/strong&gt; The share of the economy
that is digital keeps growing, whether we like it or not (IMHO, most of
us spent too much time on our phones…). For France, or Europe, there
is no question: we must produce our share of digital services and
innovation, else our economic balance suffers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;Privacy is erroding.&lt;/strong&gt; Whether it is social network, information
leaking into search engines or training of large language models,
or people uploading private information to chatGPT, private information
is more and more available. History has shown us the dangers behind
loss of privacy, which the powerful (governing or economical elites)
typically leverage to assert more power. Europe has had a long stance
of trying to mitigate this loss of privacy via regulation (GDPR). But
regulating services that we don’t control is hard, and it ends up being
a geo-political and economical battle.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;Big AI is huge.&lt;/strong&gt; The size of investments in AI is huge (dozens of
billions yearly, comparable to a sizeable fraction of the state
expenditures of a rich country like Switzerland). Data centers are
having significant impacts on the electric grid of modern countries,
running in competition with other usage. The cost of large models have
ballooned (training a large language model is in the hundreds of
millions of cost, which is comparable to a sizeable fraction of the
budget of the national research institute that I work in (&lt;a class="reference external" href="https://inria.fr/fr"&gt;inria&lt;/a&gt;). Training costs are just the visible part
of the iceberg, operational costs are huge and are everywhere.&lt;/p&gt;
&lt;p&gt;Not all in tech are worried about rising costs. Indeed, they go hand in
hand with more money in tech, making us, tech bros, richer, as long as
investments keep pouring in. But &lt;a class="reference external" href="https://www.goldmansachs.com/images/migrated/insights/pages/gs-research/gen-ai--too-much-spend%2C-too-little-benefit-/TOM_AI%202.0_ForRedaction.pdf"&gt;bubble dynamics are at play&lt;/a&gt;,
and explain part of the conversation around AI.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;Concentration of power.&lt;/strong&gt; Many factors in today’s AI lead to
concentration into the hands of large actors. Training and operation
costs, of course. But also limited access to the correspond skills,
platform effect on the data and the users. The most striking bottleneck
is the compute hardware. Only one company makes the chips that we all
need. Few actors can afford buying them; and as a result most of the
world lives from renting out to big landlords.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;AI neither good nor bad, but what we do of it.&lt;/strong&gt; The above may
paint a gloomy picture. But this is not how I see it. AI does have a
lot of potential for good, as all general purpose technology. It all
depends how society uses it. And here the future is open: we, as actors
of democratic societies, as innovators, in tech but in every aspects of
society, we can determine what the future of AI is. I look forward to
technology that empowers each and everybody, to act for their own
benefit. Key to this future is enabling and bringing in every stakeholder.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="adventures-in-software-land"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;Adventures in software land&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;With the growing importance of data and artificial intelligence in
shaping society, I believe more than ever in the importance of open
source and commons for data science, making tools accessible to as many
as possible.&lt;/p&gt;
&lt;div class="section" id="probabl-to-supercharge-scikit-learn"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;probabl to supercharge scikit-learn&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Early 2024, Inria span off the scikit-learn development to a new structure, &lt;a class="reference external" href="https://probabl.ai"&gt;probabl&lt;/a&gt;, to supercharge the development of the broader
ecosystem. I detailed the motivation and the goals in &lt;a class="reference external" href="../programming/promoting-open-source-from-inria-to-probabl.html"&gt;a previous article&lt;/a&gt;. In a
nutshell:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Scikit-learn is &lt;a class="reference external" href="programming/people-underestimate-how-impactful-scikit-learn-continues-to-be.html"&gt;a key component of the machine-learning
ecosystem&lt;/a&gt;,
but its development require funding.&lt;/li&gt;
&lt;li&gt;Probabl is there to foster a broader open data-science ecosystem, as
scikit-learn can be sustainable only when used in such ecosystem.
Probabl focus on delivering value to enterprises, and thus makes sure
that there is a seamless solution to their needs.&lt;/li&gt;
&lt;li&gt;I have 10% of my time allocated from Inria to Probabl.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Some of our successes are already publicly visible:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;The open-source team at probabl is maintaining and improving &lt;a class="reference external" href="https://probabl.ai/open-source"&gt;a range
of software libraries&lt;/a&gt;: scikit-learn,
joblib, imbalanced-learn, fairlearn, skops, skrub… Our priorities are
openly discussed &lt;a class="reference external" href="https://papers.probabl.ai/open-source-priorities-chapter-2"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;We have launched &lt;a class="reference external" href="https://papers.probabl.ai/official-scikit-learn-certification-launch"&gt;an official certification program for scikit-learn&lt;/a&gt;. I’m very excited about these certifications (there are three levels), to grow recognition in the scikit-learn skills, and thus make sure that it is a dependable stack for the industry.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="skrub-machine-learning-on-tables-made-easy"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-4"&gt;Skrub: machine learning on tables made easy&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;a class="reference external" href="https://skrub-data.org/"&gt;skrub&lt;/a&gt; is a software project that I am very
excited about. Many crucial applications of machine learning are on
tables. Skrub facilitates the corresponding patterns. We are designing it
with the insights of years of research and practice on the topic. It does
not always look impressive, but it’s little things that add up for
productivity.&lt;/p&gt;
&lt;p&gt;A typical dataset is the employees one:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
&amp;gt;&amp;gt;&amp;gt; from skrub.datasets import fetch_employee_salaries
&amp;gt;&amp;gt;&amp;gt; dataset = fetch_employee_salaries()
&amp;gt;&amp;gt;&amp;gt; employees_df, y = dataset.X, dataset.y
&lt;/pre&gt;
&lt;p&gt;Skrub’s &lt;a class="reference external" href="https://skrub-data.org/stable/reference/generated/skrub.TableReport.html"&gt;TableReport&lt;/a&gt; makes it really easy to interactively visualize and
explore such table:&lt;/p&gt;
&lt;img alt="" src="attachments/2024_highlights/table_report_vscode.png" style="width: 700px;" /&gt;
&lt;p&gt;The dataframe &lt;cite&gt;employees_df&lt;/cite&gt; has plenty of non numerical columns, as visible above.
Skrub’s &lt;a class="reference external" href="https://skrub-data.org/stable/reference/generated/skrub.TableVectorizer.html"&gt;TableVectorizer&lt;/a&gt; turns it into a numerical array suitable for
machine learning, taking care of dates, categories, strings…&lt;/p&gt;
&lt;pre class="literal-block"&gt;
&amp;gt;&amp;gt;&amp;gt; from skrub import TableVectorizer
&amp;gt;&amp;gt;&amp;gt; X = TableVectorizer().fit_transform(employees_df)
&lt;/pre&gt;
&lt;p&gt;If you want to use deep-learning language models for the string
categories, skrub’s &lt;a class="reference external" href="https://skrub-data.org/stable/reference/generated/skrub.TextEncoder.html"&gt;TextEncoder&lt;/a&gt;
can download pre-trained models from hugginface:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
&amp;gt;&amp;gt;&amp;gt; from skrub import TextEncoder
&amp;gt;&amp;gt;&amp;gt; text_encoder = TextEncoder(
        &amp;quot;sentence-transformers/paraphrase-albert-small-v2&amp;quot;,
        device=&amp;quot;cpu&amp;quot;,
    )
&amp;gt;&amp;gt;&amp;gt; tab_vec = TableVectorizer(high_cardinality=text_encoder)
&amp;gt;&amp;gt;&amp;gt; X = tab_vec.fit_transform(employees_df)
&lt;/pre&gt;
&lt;p&gt;With this, the latest artificial intelligent developments are easily
brought to drive decisions on the data that matters.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="research-better-ai-tools-more-understanding"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-5"&gt;Research: better AI tools, more understanding&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Software or thoughts on AI and society, are best built on solid
understanding of AI, which calls for research.&lt;/p&gt;
&lt;div class="section" id="table-foundation-models"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-6"&gt;Table foundation models&lt;/a&gt;&lt;/h3&gt;
&lt;p class="align-right"&gt;&lt;em&gt;Modeling data semantics enable pretaining for tables&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I have been working on machine-learning for tables for more than a
decade. These data are crucial for many applications, but they have so
far not witnessed the breakthroughs of deep learning seen &lt;em&gt;eg&lt;/em&gt; in vision
or text. Much of these success of &lt;strong&gt;deep learning as been driven by the
ability to reused pretrained models&lt;/strong&gt;, fitted on very large datasets.
Foundation models pushed this idea very far with models that provide
background information useful for a wide variety of downstream tasks. But
pretraining is challenging for tables.&lt;/p&gt;
&lt;p&gt;A crucial part of foundation models for text and images is the attention
mechanism, stacked in a transformer architecture, that bring associative
memory to the inputs by contextualizing them. We had a breakthough with
the &lt;a class="reference external" href="https://openreview.net/forum?id=9kArQnKLDp"&gt;CARTE model&lt;/a&gt;: we
managed to adapt these ideas to tables. The strings –tables
entries and column names– give the information that enables transfer from
one table to another: data semantics. Here, key is to have an
architecture that 1) models both strings and numerical values 2) applies
to any set of tables while using the column names to route the
information. For this purpose, CARTE uses a new dedicated attention
mechanism that accounts for column names. It is pre-trained on a very
large knowledge base. As a result, it outperform the best models
(including tree-based models) in small sample settings (up to n=2000).&lt;/p&gt;
&lt;p&gt;The pretrained CARTE model is available for download as &lt;a class="reference external" href="https://pypi.org/project/carte-ai"&gt;a Python package&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This result is very significant as it opens the door to &lt;strong&gt;foundation models
for tables&lt;/strong&gt;: models that embark much background knowledge and can be
specialized to many tabular-learning tasks.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="https://openreview.net/forum?id=9kArQnKLDp"&gt;&lt;img alt="" src="attachments/2024_highlights/carte_comparisons.png" style="width: 100%;" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;Extensive empirical results show that CARTE brings benefits to very
broad set of baselines. The relative performance of baselines also
contains interesting results.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;See also&lt;/p&gt;
&lt;p&gt;I wrote a longer &lt;a class="reference external" href="./carte-toward-table-foundation-models.html"&gt;high-level post on CARTE&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="disparities-of-confidence-of-large-language-models"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-7"&gt;Disparities of confidence of large language models&lt;/a&gt;&lt;/h3&gt;
&lt;div class="figure align-right"&gt;
&lt;a class="reference external image-reference" href="https://hal.science/hal-04750567"&gt;&lt;img alt="" src="attachments/2024_highlights/hallucination_probability.png" style="width: 400px;" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;A good confidence assessment on replies of an LLM would separate out
correct from incorrect statements: Einstein was not born on Jan 14th
1879 (close call, it was March 14th); his PhD was in Zurich.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Large language models (LLMs), such as chatGPT, may produce answers that
are plausible but not factually correct, the so-called “hallucinations”.
A variety of approach try to assess how likely a statement is to be true,
for instance by sampling multiple responses from the language model.
Ideally, we would like to use these confidence assessments to flag the
wrong statements in an LLM’s answer. For this, a challenge is to
threshold them, or assign a probability of correctness.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="figure align-right"&gt;
&lt;a class="reference external image-reference" href="https://hal.science/hal-04750567"&gt;&lt;img alt="" src="attachments/2024_highlights/llm_confidence_nationality.png" style="width: 400px;" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;Observed error rate and a function predicted probability of
correctness For the birth date, when a large language model (here Mistral
7B) gives information on a given notable individual. The different
curves give the corresponding calibration for different nationalities of
the individuals, revealing that &lt;strong&gt;the probability is much more trustworthy
for a citizen of the United States than for other countries&lt;/strong&gt;, and
particularly poor for people that originate from South-East Asia.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;In &lt;a class="reference external" href="https://hal.science/hal-04750567/"&gt;Chen et al&lt;/a&gt;, we investigate the
confidence of LLMs in their answers. We show that the
probabilities computed are not only overconfident, but also that there is
heterogeneity (grouping loss): on some groups of queries the
overconfidence is more pronounced than on others. For instance, for an
answer on a notable individual, the LLMs’ confidence is reasonably
calibrated if the individual is from the United States, but severely
overconfident for individuals from South East Asia
(fig:llmconfidencenationality). Characterizing the corresponding groups
opens the door to correcting the corresponding bias, a “reconfidencing”
procedure.&lt;/p&gt;
&lt;p&gt;This study is an application of our earlier, more theoretical, &lt;a class="reference external" href="https://openreview.net/forum?id=6w1k-IixnL8"&gt;work&lt;/a&gt; that contributed the
first estimator grouping loss, a mathematically-solid concept behind
hidden heterogeneity in classifier calibration. I am very happy to see
that these fairly abstract ideas are useful to probe very concrete
problems such as the disparity in LLM confidence across nationalities.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="a-straggler-consistency-of-supervised-learning-with-missing-values"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-8"&gt;A straggler: Consistency of supervised learning with missing values&lt;/a&gt;&lt;/h3&gt;
&lt;p class="align-right"&gt;&lt;em&gt;A&lt;/em&gt; &lt;a class="reference external" href="https://link.springer.com/article/10.1007/s00362-024-01550-4"&gt;paper&lt;/a&gt;
&lt;em&gt;on the fundamentals of machine-learning with missing values&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;In 2018, &lt;a class="reference external" href="https://juliejosse.com"&gt;Julie Josse&lt;/a&gt;, &lt;a class="reference external" href="https://erwanscornet.github.io"&gt;Erwan Scornet&lt;/a&gt;, and myself started working on the
theory of how supervised learning works with missing values (learning
theory). Working an intern, Nicolas Prost, we quickly realized that there
was a gap between the statistical thinking around missing values, which
was focused on enabling inference in parametric models as if their were
no missing values, and the needs for prediction with missing values.&lt;/p&gt;
&lt;p&gt;We wrote  &lt;a class="reference external" href="https://link.springer.com/article/10.1007/s00362-024-01550-4"&gt;a paper&lt;/a&gt; to
lay out the theory cleanly, summarizing both elements of learning theory
and the fundamentals of statistics with missing values. Beyond this
didactic aspects, the paper gives a series of formal results, such as the
need for multiple imputations to be able to use the &lt;em&gt;complete case&lt;/em&gt;
predictor (the optimal predictor without missing values), the optimal way
to model missing values in trees (which was already used in XGBoost :) ),
and the fact that asymptotically, constant imputation of missing values
could work well for predictor.&lt;/p&gt;
&lt;p class="align-right"&gt;&lt;em&gt;Frustrations of the academic game&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://hal.science/hal-02024202"&gt;The preprint&lt;/a&gt; got a lot of success
(more than a hundred citations), probably because it laid out
fundamentals. But it took 5 years to publish it. The machine learning
community did not like the absence of new methods (we only gave
theoretical results on existing practice, such as imputation). The
statistics literature really did not like our messages that imputation
was not always important. In one journal, a reviewer rejected the paper on
the basis that it was giving bad messages to the community, but not
arguing that anything was wrong in our proofs or our experiments. Of
course, there is a lot to say about the difficulties of doing data
analysis with missing values, but the conversation did not go in these
details. This is a good illustration that &lt;strong&gt;progress in science is
social&lt;/strong&gt;, and is as much about shifting norms than accumulating knowledge
(actually, knowledge is social too, as put forward by &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Social_epistemology"&gt;social
epistemology&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;As time went by, my colleague &lt;a class="reference external" href="https://marinelm.github.io"&gt;Marine Le Morvan&lt;/a&gt; has published &lt;a class="reference external" href="https://proceedings.mlr.press/v108/morvan20a.html"&gt;more&lt;/a&gt; &lt;a class="reference external" href="https://proceedings.neurips.cc/paper/2021/hash/5fe8fdc79ce292c39c5f209d734b7206-Abstract.html"&gt;and&lt;/a&gt;
&lt;a class="reference external" href="https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac013/6568998"&gt;more&lt;/a&gt;
&lt;a class="reference external" href="https://arxiv.org/abs/2407.19804"&gt;results&lt;/a&gt; that push deeper
understanding of prediction with missing values. But I still see value in
our original paper, as it lays the foundations.&lt;/p&gt;
&lt;p&gt;The paper is now out, thanks to my coauthors who kept replying to
reviewers, improving the manuscripts, and resubmitting. Read &lt;a class="reference external" href="https://link.springer.com/article/10.1007/s00362-024-01550-4"&gt;it&lt;/a&gt;, I think
that it is a good read.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;Well, this article ended up longer than I had expected. Thanks for
reading. Taking a step back to figure out what is important is always a
good exercise for me.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="machine learning"></category><category term="statistics"></category><category term="yearly report"></category></entry><entry><title>When AIs must overcome the data</title><link href="https://gael-varoquaux.info/science/when-ais-must-overcome-the-data.html" rel="alternate"></link><published>2024-12-22T00:00:00+01:00</published><updated>2024-12-22T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2024-12-22:/science/when-ais-must-overcome-the-data.html</id><summary type="html">&lt;p&gt;Improving conversational artificial intelligences or simpler prediction engines involves overcoming biases, that is, going beyond the limits of data. But the notion of bias is subtle, as it depends on the goals.&lt;/p&gt;
&lt;img alt="Image generated with &amp;quot;ChatGPT&amp;quot;, with the prompt &amp;quot;Please generate an image of a robot arm wrestling a figure made of numbers. This figure does not look like a robot, but more like a human, however it is made of numbers.&amp;quot;" class="small align-right" src="../science/attachments/robot_wresting_numbers.jpg" /&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post was originally published in French as part of my scientific
chronicle in &lt;a class="reference external" href="https://www.lesechos.fr/idees-debats/sciences-prospective/quand-lia-doit-depasser-les-donnees-2126369"&gt;Les Echos&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;In …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Improving conversational artificial intelligences or simpler prediction engines involves overcoming biases, that is, going beyond the limits of data. But the notion of bias is subtle, as it depends on the goals.&lt;/p&gt;
&lt;img alt="Image generated with &amp;quot;ChatGPT&amp;quot;, with the prompt &amp;quot;Please generate an image of a robot arm wrestling a figure made of numbers. This figure does not look like a robot, but more like a human, however it is made of numbers.&amp;quot;" class="small align-right" src="../science/attachments/robot_wresting_numbers.jpg" /&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post was originally published in French as part of my scientific
chronicle in &lt;a class="reference external" href="https://www.lesechos.fr/idees-debats/sciences-prospective/quand-lia-doit-depasser-les-donnees-2126369"&gt;Les Echos&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;In 2023, Microsoft’s conversational AI insulted users.
Salary-recommendation engines ignore women’s degrees to underpay them. At
the start of the Covid-19 pandemic, predictions of hospital stays
consistently underestimated the duration. These three issues all stem
from the same failure: predictive engines, artificial intelligences, that
have learned from biases. The rude conversational AI replicated its
training texts, some of which came from internet forums where politeness
is sometimes overlooked. The medical AI only considered finished
hospitalizations, and, as the epidemic had just begun, only patients
with mild forms had already been discharged, while the more seriously ill
remained hospitalized.&lt;/p&gt;
&lt;p&gt;To obtain an AI that doesn’t say nonsense, the biases must be
“corrected.” The problem of too-short observation windows is a classic
issue in medical statistics: more importance must be placed on the few
individuals who have been sick for a long time. A similar solution is
used to improve conversational AIs: weighting the training text sources
based on the deviation from the desired behavior.&lt;/p&gt;
&lt;div class="section" id="aligning-on-which-values"&gt;
&lt;h2&gt;Aligning on which values?&lt;/h2&gt;
&lt;p&gt;The problem of bias is universal in statistics. And modern AIs are
statistical because they learn from data. The notion of bias is very
relative. It should be understood as a gap between the available data and
the desired behavior. Therefore, &lt;strong&gt;there is no such thing as unbiased data,
or a universal bias correction&lt;/strong&gt;. Much of the effort to improve AIs focuses
on reducing this gap between training and the desired behavior.&lt;/p&gt;
&lt;p&gt;For example, when training AIs for autonomous vehicles, one difficulty is
that the data contains very few traffic accidents. Simulators are
sometimes used to fill this gap. They are inherently less rich than
reality and are mixed with real-world driving. There is a well-controled
gap between the resulting mixture and typical driving, this gap is there
to put emphasis on safety requirements in unfavorable scenarios. This is
another form of data correction.&lt;/p&gt;
&lt;p&gt;Just as the notion of data bias depends on how well the data match a
targeted use, an AI does not produce absolute or objective truth. Without
corrections, it simply replicates its behavior based on what it has
observed. And when corrections are made, the whole question is how to
correct it. For powerful AIs, we then talk about “alignment” towards
goals and values. As AI incorporates the values of its designers, one
might wonder whether the same AI can be socially acceptable in all
cultures.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;AI chronicles&lt;/p&gt;
&lt;p&gt;Find all my AI chronicles &lt;a class="reference external" href="https://gael-varoquaux.info/tag/ai-chronicle.html"&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The goals of these “AI chronicles” is to introduce concepts of AI to a broader public, staying at a very very high level.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="AI"></category><category term="chronicle"></category><category term="AI chronicle"></category></entry><entry><title>Do AIs reason or recite?</title><link href="https://gael-varoquaux.info/science/do-ais-reason-or-recite.html" rel="alternate"></link><published>2024-10-19T00:00:00+02:00</published><updated>2024-10-19T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2024-10-19:/science/do-ais-reason-or-recite.html</id><summary type="html">&lt;p&gt;Despite their apparent intelligence, conversational artificial intelligences often lack logic. The debate rages on: do they reason or do they recite snatches of text memorized on the Internet?&lt;/p&gt;
&lt;img alt="Image generated with &amp;quot;ChatGPT&amp;quot;, with the prompt &amp;quot;Please generate an image of a robot with a stream of numbers coming out of his mouth. The robot is on the left, facing right, and the numbers flow, as if they were sound.&amp;quot;" class="small align-right" src="../science/attachments/robot_numbers_flow_mouth.jpg" /&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post was originally published in French as part of my scientific
chronicle in &lt;a class="reference external" href="https://www.lesechos.fr/idees-debats/sciences-prospective/les-ia-raisonnent-elles-ou-recitent-elles-2103079"&gt;Les Echos&lt;/a&gt;. I updated it with new …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;p&gt;Despite their apparent intelligence, conversational artificial intelligences often lack logic. The debate rages on: do they reason or do they recite snatches of text memorized on the Internet?&lt;/p&gt;
&lt;img alt="Image generated with &amp;quot;ChatGPT&amp;quot;, with the prompt &amp;quot;Please generate an image of a robot with a stream of numbers coming out of his mouth. The robot is on the left, facing right, and the numbers flow, as if they were sound.&amp;quot;" class="small align-right" src="../science/attachments/robot_numbers_flow_mouth.jpg" /&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post was originally published in French as part of my scientific
chronicle in &lt;a class="reference external" href="https://www.lesechos.fr/idees-debats/sciences-prospective/les-ia-raisonnent-elles-ou-recitent-elles-2103079"&gt;Les Echos&lt;/a&gt;. I updated it with new references.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Conversational AI, or large language models, are sometimes seen as the
gateway to general artificial intelligence. ChatGPT, for example, can
answer questions asked at the International Mathematical Olympiad. And
yet, on other, seemingly much simpler questions, ChatGPT makes surprising
mistakes. What aspects of conversational AI intelligence explain its
ability to solve some problems and not others?&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://arxiv.org/abs/2309.13638"&gt;Thomas McCoy and co-authors&lt;/a&gt;
conjecture that it has to do with their underlying model of
autoregression: technically, these AIs are trained to complete texts
found on the Internet. If an AI is very good at calculating (9/5) x + 32,
but not (7/5) x + 31, it is because the first formula corresponds to the
conversion of degrees Celsius to Fahrenheit, a very frequent conversion
on the Internet, while the second does not correspond to any particular
formula. Conversational AIs would therefore be good at reproducing what
they’ve already seen. Indeed, numerous studies have shown that they have
a certain tendency to reproduce snippets of known text. So, if an AI can
solve problems from the International Mathematical Olympiad, is it simply
because it has memorized the answer?&lt;/p&gt;
&lt;div class="section" id="something-new"&gt;
&lt;h2&gt;Something new?&lt;/h2&gt;
&lt;p&gt;In terms of intelligence, inventing a new mathematical demonstration
requires mastering abstractions and the ability to string together
complicated logical reasoning with an imposed start and finish. This
seems much more difficult than memorizing a demonstration. This is one of
the traditional oppositions in machine learning, the line of research
that gave rise to today’s AIs: memorizing is one thing, knowing how to
generalize is another. For example, if I memorize all the additions
between two numbers smaller than ten, I cannot extrapolate beyond that. To
go further, I need to master the logic of addition… or memorize more.&lt;/p&gt;
&lt;p&gt;And precisely, conversational AIs have an enormous capacity for
memorization, and have been trained on almost the entire Internet. Given
a question, they can often dip into their memory to find answers. So, are
they intelligent or just have a great memory? Scientists are still
debating the importance of memory to their abilities. Some argue that
their storage capacity is ultimately limited by the size of the Internet.
Others wonder to what extent the impressive successes highlighted are not
on tasks already solved on the Internet, questioning their ability to do
anything new.&lt;/p&gt;
&lt;p&gt;But could memorization be an aspect of intelligence? In 1987, Lenat and
Feigenbaum conjectured that, for a cognitive agent, accumulating
knowledge enables it to solve new tasks with less learning. Perhaps the
intelligence of conversational AI lies in knowing how to pick up the
right bits of information, and combine them.&lt;/p&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;Related academic work:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://www.pnas.org/doi/10.1073/pnas.2322420121"&gt;Embers of autoregression show how large language models are shaped
by the problem they are trained to solve&lt;/a&gt;, R. Thomas McCoy,
Shunyu Yao, Dan Friedman, Mathew D. Hardy, and Thomas L. Griffiths,
PNAS 2024 (&lt;a class="reference external" href="https://arxiv.org/abs/2309.13638"&gt;ArXiv&lt;/a&gt;)&lt;/p&gt;
&lt;p&gt;Princeton researchers show that properties of large language models
(LLMs) are governed by the data that they are trained on, including
for they arithmetics abilities.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://arxiv.org/abs/2410.05229"&gt;GSM-Symbolic: Understanding the Limitations of Mathematical
Reasoning in Large Language Models&lt;/a&gt;, Iman Mirzadeh, Keivan Alizadeh
Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar&lt;/p&gt;
&lt;p&gt;Apple researchers show that LLMs solve mathematical challenge via
probabilistic &lt;strong&gt;pattern matching&lt;/strong&gt; on previously seen examples, rather
than logical reasonning.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;AI chronicles&lt;/p&gt;
&lt;p&gt;Find all my AI chronicles &lt;a class="reference external" href="https://gael-varoquaux.info/tag/ai-chronicle.html"&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The goals of these “AI chronicles” is to introduce concepts of AI to a broader public, staying at a very very high level.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="AI"></category><category term="chronicle"></category><category term="AI chronicle"></category></entry><entry><title>CARTE: toward table foundation models</title><link href="https://gael-varoquaux.info/science/carte-toward-table-foundation-models.html" rel="alternate"></link><published>2024-07-19T00:00:00+02:00</published><updated>2024-07-19T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2024-07-19:/science/carte-toward-table-foundation-models.html</id><summary type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Foundation models, pretrained and readily usable for many downstream
tasks, have changed the way we process text, images, and sound. Can we
achieve similar breakthroughs for tables? Here I explain why with
&lt;a class="reference external" href="https://arxiv.org/abs/2402.16785"&gt;“CARTE”&lt;/a&gt;, we’ve made significant headway.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#pre-training-for-data-tables-hopes-and-challenges" id="toc-entry-1"&gt;Pre-training for data tables: hopes and challenges&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#pre-training-is-a-necessity" id="toc-entry-2"&gt;Pre-training is a …&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Foundation models, pretrained and readily usable for many downstream
tasks, have changed the way we process text, images, and sound. Can we
achieve similar breakthroughs for tables? Here I explain why with
&lt;a class="reference external" href="https://arxiv.org/abs/2402.16785"&gt;“CARTE”&lt;/a&gt;, we’ve made significant headway.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#pre-training-for-data-tables-hopes-and-challenges" id="toc-entry-1"&gt;Pre-training for data tables: hopes and challenges&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#pre-training-is-a-necessity" id="toc-entry-2"&gt;Pre-training is a necessity&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#pretraining-for-data-tables" id="toc-entry-3"&gt;Pretraining for data tables?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#carte-a-table-foundation-model-breakthrough" id="toc-entry-4"&gt;CARTE: a table foundation model breakthrough&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#an-architecture-to-learn-across-tables" id="toc-entry-5"&gt;An architecture to learn across tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#pretraining-on-knowledge-graphs" id="toc-entry-6"&gt;Pretraining on knowledge graphs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#empirical-results" id="toc-entry-7"&gt;Empirical results&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#lessons-learned" id="toc-entry-8"&gt;Lessons learned&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="pre-training-for-data-tables-hopes-and-challenges"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;Pre-training for data tables: hopes and challenges&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="pre-training-is-a-necessity"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;Pre-training is a necessity&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Foundation models have brought breakthroughs to text and image processing
because they embark a great deal of knowledge on these data, knowledge
that can then be reused to simplify processing. But their promises have
not come true for tables, which hold much of an organization’s specific
data, &lt;em&gt;eg&lt;/em&gt; relational databases capturing day-to-day operations, or
measurements tables related to a specific source of data.&lt;/p&gt;
&lt;p&gt;Rather, for tabular learning, a couple of years ago &lt;a class="reference external" href="https://proceedings.neurips.cc/paper_files/paper/2022/hash/0378c7692da36807bdec87ab043cdadc-Abstract-Datasets_and_Benchmarks.html"&gt;our extensive
benchmarks&lt;/a&gt;
showed that tree-based models outperformed even deep-learning
architectures specially crafted for data tables.&lt;/p&gt;
&lt;p&gt;One challenge is that typically tables are not that big and thus the
high flexibility of deep learning is a weakness rather than a benefit.
This shortcoming was solved by pretrained models, for data modalities
where deep learning has been vastly successful: &lt;strong&gt;most people do not
train a deep-learning model from scratch, but download a pre-trained one
from model hubs&lt;/strong&gt;. Such universal pre-training is also at the root of
foundation models.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="pretraining-for-data-tables"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;Pretraining for data tables?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;But what does pretraining mean for data tables? If I give you a table of
numbers, what can prior information can you use to process it better?
Images and text have a lot of regularity that repeat across datasets:
I can recognize a car on pictures coming from all kinds of camera
(including old black and white photographs). I use my knowledge of the
meaning of words to understand a text. But given a table of number as
below, what sense can I make of it?&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
&lt;em&gt;The tabular learning challenge: every table is a special snowflake&lt;/em&gt;&lt;/div&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="29%" /&gt;
&lt;col width="29%" /&gt;
&lt;col width="29%" /&gt;
&lt;col width="14%" /&gt;
&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;72&lt;/td&gt;
&lt;td&gt;68&lt;/td&gt;
&lt;td&gt;174&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;79&lt;/td&gt;
&lt;td&gt;181&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;56&lt;/td&gt;
&lt;td&gt;59&lt;/td&gt;
&lt;td&gt;166&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;81&lt;/td&gt;
&lt;td&gt;62&lt;/td&gt;
&lt;td&gt;161&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The reason a data analyst can understand this data and use this
understanding to build a better data-processing pipeline is because the
data comes with context: meaningful strings sprinkled around these
numbers. For instance, a table with the same numbers as above but a bit
of column names and string entries makes completely sense:&lt;/p&gt;
&lt;table border="1" class="docutils"&gt;
&lt;caption&gt;Cardiovascular cohort&lt;/caption&gt;
&lt;colgroup&gt;
&lt;col width="18%" /&gt;
&lt;col width="18%" /&gt;
&lt;col width="18%" /&gt;
&lt;col width="36%" /&gt;
&lt;col width="9%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;Age&lt;/th&gt;
&lt;th class="head"&gt;Weight&lt;/th&gt;
&lt;th class="head"&gt;Height&lt;/th&gt;
&lt;th class="head"&gt;Commorbidity&lt;/th&gt;
&lt;th class="head"&gt;Cardiovascular event&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;72&lt;/td&gt;
&lt;td&gt;68&lt;/td&gt;
&lt;td&gt;174&lt;/td&gt;
&lt;td&gt;Diabetes&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;79&lt;/td&gt;
&lt;td&gt;181&lt;/td&gt;
&lt;td&gt;Cardiac arrhythmia&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;56&lt;/td&gt;
&lt;td&gt;59&lt;/td&gt;
&lt;td&gt;166&lt;/td&gt;
&lt;td&gt;NA&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;81&lt;/td&gt;
&lt;td&gt;62&lt;/td&gt;
&lt;td&gt;161&lt;/td&gt;
&lt;td&gt;Asthma&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In such a setting, it becomes clear what background knowledge, what
pre-training can bring to analyzing data tables: &lt;strong&gt;string entries and
column names bring meaning to the numbers in data tables&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Another way to seeing the challenge is that of &lt;strong&gt;data integration&lt;/strong&gt;: as
studied by the knowledge representation and database communities, putting
multiple sources of data in a consistent representation requires:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;schema matching&lt;/strong&gt;, which to a first order is about finding column
correspondences across tables&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;entity matching&lt;/strong&gt;, finding correspondences across table entries
denoting the same thing, for instance “Diabetes” and “Diabetes melitus”&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These challenges of data integration are central to building pretrained
or foundation models for tables. Indeed, such models must apply to all
tables, and thus must bridge these gaps across tables.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="carte-a-table-foundation-model-breakthrough"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-4"&gt;CARTE: a table foundation model breakthrough&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Our recent &lt;a class="reference external" href="https://arxiv.org/abs/2402.16785"&gt;CARTE paper&lt;/a&gt; builds upon
the above insights, and demonstrates that pretraining can give
models that markedly improve performance.&lt;/p&gt;
&lt;div class="section" id="an-architecture-to-learn-across-tables"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-5"&gt;An architecture to learn across tables&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Graphlets&lt;/strong&gt;
The key ingredient of CARTE is how we represent the inputs. CARTE’s goal
is to build predictors on rows of tables, for instance associating
features of an individuals to a risk of developing adverse cardiovascular
events. To pretrain across tables, we use a universal representation of
the data (rows of tables), as small graphs.&lt;/p&gt;
&lt;div class="figure"&gt;
&lt;img alt="" src="attachments/carte/carte_graphlet.png" /&gt;
&lt;p class="caption"&gt;Turning table rows into graphlets. Each column leads to an edge and
the column name is turned into the corresponding edge feature. It’s a
“multirelational graph”. The entry associated with the given column
is turned into the corresponding node feature, and the row is
represented as a special row token in the center of the graphlet.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Thus, tables with different number of columns can be turned into a
consistent representation. But an additional benefit of this
representation is that it can represent data across multiple tables with
shared keys (for instance all the visits of a patient to a hospital).&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
&lt;em&gt;A representation that can bridge tables without schema or entity
matching&lt;/em&gt;&lt;/div&gt;
&lt;br/&gt;
&lt;br/&gt;&lt;p&gt;&lt;strong&gt;String embeddings&lt;/strong&gt;
The second ingredient is to represent all strings and embeddings, using a
pretrained language model, whether it is for column names or string
entries. With good language model will embed close by different string
with similar meaning, for instance a column named “commorbidity” and
another one named “medical conditions”. Such representation helps
learning without entity or schema matching.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Graph transformer&lt;/strong&gt; CARTE then uses a form of graph transformer on top
of this representation. Key to this graph transformer is an attention
mechanism that accounts for the relation information –the edge type,
&lt;em&gt;ie&lt;/em&gt; the column name. Thus &lt;em&gt;(born in, Paris)&lt;/em&gt; is represented in a
different way as &lt;em&gt;(living in, Paris)&lt;/em&gt;.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Numbers treated as such&lt;/strong&gt; Columns with numerical entries are often
important information in a data table. Unlike typical large language
models, we do not represent numbers via string tokenization, but use a
vector representation where the numerical value is multiplied with the
embedding of the column name (a vector output by the language model).
That way a value of 126 in a column named “Systolic mm Hg” is represented
close to 1.5 times a value of 84 in a column named “Blood pressure”.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="pretraining-on-knowledge-graphs"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-6"&gt;Pretraining on knowledge graphs&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;We pretrain the above architecture on a large general-knowledge knowledge
graph. The goal is to distill the corresponding information in the
pretrained model, which can then implicitly use it when analyzing new
tables. Indeed, a large knowledge graph (we use &lt;a class="reference external" href="https://yago-knowledge.org"&gt;YAGO&lt;/a&gt;) represents a huge amount of facts on the
world, and the representation, as a multirelational graph, is close to
the one that we use to model data tables.&lt;/p&gt;
&lt;p&gt;Given an analytic task, on a data table of interest, the pretrained model
can be fine tuned. We found that this was a tricky part as those tables
are often small.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="empirical-results"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-7"&gt;Empirical results&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Excellent performance on extensive benchmarks&lt;/strong&gt;
We compared CARTE to a variety of baselines across 51 datasets (mostly
downloaded from kaggle), as a function of the number of samlpes (number
of rows):&lt;/p&gt;
&lt;div class="figure"&gt;
&lt;img alt="" src="attachments/carte/carte_learning_curve.png" /&gt;
&lt;p class="caption"&gt;Prediction performance as a function of sample size for classification
and regression tasks&lt;/p&gt;
&lt;/div&gt;
&lt;div class="align-right docutils container"&gt;
CARTE outperforms all baselines, including very strong ones&lt;/div&gt;
&lt;p&gt;CARTE appears as a very strong performer, outperforming all baselines
when there are less than 2000 samples. For larger tables, the prior
information is less crucial, and more flexible learners are beneficial.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Strong contenders&lt;/strong&gt; We see that powerful tree-based learner, such as
CatBoost of XGBoost also work very well. We investigated in details many
baselines. Here, we consider not only learners, but also a variety of
methods to encode strings, and these really help predicting:&lt;/p&gt;
&lt;div class="figure"&gt;
&lt;img alt="" src="attachments/carte/carte_cd_plots.png" /&gt;
&lt;p class="caption"&gt;Detailed comparison (critical difference plots, giving the average
ranking of methods) across all 42 baselines that we investigated&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Catboost is an excellent predictor because it encodes with categories
with great care. &lt;em&gt;S-LLM-CN-XGB&lt;/em&gt; is a baseline that we contributed that
encodes strings with an LLM, concats numerical numbers and used XGBoost
on the resulting representation. &lt;em&gt;TabVec&lt;/em&gt; is the &lt;a class="reference external" href="https://skrub-data.org/stable/generated/skrub.TableVectorizer.html#skrub.TableVectorizer"&gt;TableVectorizer&lt;/a&gt;
from &lt;a class="reference external" href="https://skrub-data.org"&gt;skrub&lt;/a&gt;. Combined with standard learners
it gives really strong baselines.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Learning across tables&lt;/strong&gt; As CARTE can model jointly different tables with
different conventions, we show that I can use large source tables, to
boost prediction on the smaller target table.&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/carte/carte_joint_learning.png" style="width: 600px;" /&gt;
&lt;/div&gt;
&lt;p&gt;&lt;em&gt;Ranking of various methods used across tables with imperfect
correspondances, where “matched” means manual column matching, and “not
matched” means no manual column matching&lt;/em&gt;&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
Transfer learning across sources with different columns / schemas&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="lessons-learned"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-8"&gt;Lessons learned&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The extensive empirical results have many teachings.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tabular foundation models are possible&lt;/strong&gt; The first teaching is that
using strings to bring meaning to the numbers enables foundation models
for tables: pretrained models that facilitate a variety of downstream
tasks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;LLMs are not enough&lt;/strong&gt; Many approaches to table foundation models adapt
large language models pretrained on huge text corpora. The argument is
that with the amount of high-quality texts on Internet, the corresponding
LLM can acquire more background knowledge. The seminal example is that of
&lt;a class="reference external" href="https://proceedings.mlr.press/v206/hegselmann23a.html"&gt;TabLLM&lt;/a&gt;, which
makes sentences out of table rows and feeds them to LLMs. Yet, by itself
it does not perform well on tables with numbers.&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/carte/tabllm_comparison.png" style="width: 350px;" /&gt;
&lt;/div&gt;
&lt;p&gt;&lt;em&gt;Ranking of models on data from the TabLLM paper, data that differs from
our benchmark above as it does not have string entries.&lt;/em&gt;&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="align-right docutils container"&gt;
A table foundation model must model strings and numbers&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Modeling numbers is crucial&lt;/strong&gt; TabPFN, CARTE, and XGBoost all outperform
TabLLM on tables without strings, likely because they readily model
numbers, while an LLM sees them as strings. Likewise, our variant
&lt;em&gt;S-LLM-XGB-CN&lt;/em&gt; that combines LLMs with a model suitable for numbers
performs very well.&lt;/p&gt;
&lt;p&gt;As the strings are crucial to give context to numbers, we believe that
the future of table foundation models is to model well both strings and
numbers.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;CARTE is only a first step in the world of table foundation models. I
am convinved that the ideas will be pushed much much further.&lt;/p&gt;
&lt;p class="last"&gt;But we have learned a lot in this study. I have only skimmed the
surface of our work. If you want more details, read the &lt;a class="reference external" href="https://arxiv.org/abs/2402.16785"&gt;CARTE paper&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="machine learning"></category><category term="tabular learning"></category><category term="foundation models"></category></entry><entry><title>Skrub 0.2.0: tabular learning made easy</title><link href="https://gael-varoquaux.info/programming/skrub-020-tabular-learning-made-easy.html" rel="alternate"></link><published>2024-07-03T00:00:00+02:00</published><updated>2024-07-03T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2024-07-03:/programming/skrub-020-tabular-learning-made-easy.html</id><summary type="html">&lt;img alt="" class="align-center" src="attachments/skrub_schematic.png" style="width: 500px;" /&gt;
&lt;p&gt;We just released &lt;a class="reference external" href="https://skrub-data.org"&gt;skrub 0.2.0&lt;/a&gt;. This release
markedly simplifies learning on complex dataframes.&lt;/p&gt;
&lt;div class="section" id="model-tabular-learner-classifier"&gt;
&lt;h2&gt;&lt;cite&gt;model = tabular_learner(‘classifier’)&lt;/cite&gt;&lt;/h2&gt;
&lt;div class="align-right docutils container"&gt;
Simple, yet solid default baseline&lt;/div&gt;
&lt;p&gt;The highlight of the release is the &lt;a class="reference external" href="https://skrub-data.org/stable/generated/skrub.tabular_learner.html"&gt;tabular_learner&lt;/a&gt;
function, which facilitates creating pipelines that readily perform
machine learning on dataframes, adding preprocessing to a scikit-learn
compatible learner …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;img alt="" class="align-center" src="attachments/skrub_schematic.png" style="width: 500px;" /&gt;
&lt;p&gt;We just released &lt;a class="reference external" href="https://skrub-data.org"&gt;skrub 0.2.0&lt;/a&gt;. This release
markedly simplifies learning on complex dataframes.&lt;/p&gt;
&lt;div class="section" id="model-tabular-learner-classifier"&gt;
&lt;h2&gt;&lt;cite&gt;model = tabular_learner(‘classifier’)&lt;/cite&gt;&lt;/h2&gt;
&lt;div class="align-right docutils container"&gt;
Simple, yet solid default baseline&lt;/div&gt;
&lt;p&gt;The highlight of the release is the &lt;a class="reference external" href="https://skrub-data.org/stable/generated/skrub.tabular_learner.html"&gt;tabular_learner&lt;/a&gt;
function, which facilitates creating pipelines that readily perform
machine learning on dataframes, adding preprocessing to a scikit-learn
compatible learner. The function packs defaults and heuristics
to transform all forms of dataframes to a representation that is well
suited to a learner, and it can adapt these transformation:
&lt;cite&gt;tabular_learner(HistGradientBoostingClassifier())&lt;/cite&gt; encodes categories
differently than &lt;cite&gt;tabular_learner(LogisticRegression())&lt;/cite&gt;.&lt;/p&gt;
&lt;p&gt;The heuristics are tuned based on much benchmarking and experience shows
that they give good tradeoffs. The default
&lt;cite&gt;tabular_learner(‘classifier’)&lt;/cite&gt; is often a strong baseline.&lt;/p&gt;
&lt;p&gt;The benefit are visible in a really simple example:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
&amp;gt;&amp;gt;&amp;gt; # First retrieve data
&amp;gt;&amp;gt;&amp;gt; from skrub.datasets import fetch_employee_salaries
&amp;gt;&amp;gt;&amp;gt; dataset = fetch_employee_salaries()
&amp;gt;&amp;gt;&amp;gt; df = dataset.X
&amp;gt;&amp;gt;&amp;gt; y = dataset.y
&amp;gt;&amp;gt;&amp;gt; # The dataframe is a quite rich and complex dataframe, with various columns
&amp;gt;&amp;gt;&amp;gt; df
&lt;/pre&gt;
&lt;img alt="" src="attachments/employee_salaries_df.png" /&gt;
&lt;p&gt;We can then easily build a learner that applies readily to this
dataframe, without any transformation:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
&amp;gt;&amp;gt;&amp;gt; from skrub import tabular_learner
&amp;gt;&amp;gt;&amp;gt; learner = tabular_learner('regressor')
&amp;gt;&amp;gt;&amp;gt; # The resulting learner can apply all the machine-learning conveniences (eg cross-validation) directly on the dataframe
&amp;gt;&amp;gt;&amp;gt; from sklearn.model_selection import cross_val_score
&amp;gt;&amp;gt;&amp;gt; cross_val_score(learner, df, y)
array([0.89370447, 0.89279068, 0.92282557, 0.92319094, 0.92162666])
&lt;/pre&gt;
&lt;/div&gt;
&lt;div class="section" id="transformer-tablevectorizer"&gt;
&lt;h2&gt;&lt;cite&gt;transformer = TableVectorizer()&lt;/cite&gt;&lt;/h2&gt;
&lt;div class="align-right docutils container"&gt;
Making encoding complex dataframes easy&lt;/div&gt;
&lt;p&gt;Behind the hood, the work is done by the &lt;a class="reference external" href="https://skrub-data.org/stable/generated/skrub.TableVectorizer.html"&gt;skrub.TableVectorizer()&lt;/a&gt;, a
scikit-learn compatible transformer that facilitates combining multiple
transformations on the different columns of a dataframe. The
TableVectorizer is not new in the 0.2.0 release, but we have completely
revamped its internals to cover really well edge cases. Indeed, one
challenge is to make sure that nothing different or strange happens at
test time. Actually, enforcing consistency between train-time and
test-time transformation is the real value of skrub compared to using
pandas or polars to do transformation.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="increasing-support-of-polars"&gt;
&lt;h2&gt;Increasing support of polars&lt;/h2&gt;
&lt;div class="align-right docutils container"&gt;
Short-term goal of optimized support for pandas and polars&lt;/div&gt;
&lt;p&gt;We have implemented a new mechanism for supporting both pandas and
polars. It has not been applied on all the codebase, hence the support is
still imperfect. However, we are seeing increasing support for polars in
skrub, and our goal in the short term is to provide rock-solid polar
support.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;img alt="" class="align-right" src="attachments/skrub_logo.png" style="width: 200px;" /&gt;
&lt;p&gt;Try skrub out! It’s still young, but in my opinion, it provides a lot
of value to tabular learning.&lt;/p&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="skrub"></category><category term="scikit-learn"></category><category term="tabular"></category><category term="machine learning"></category><category term="open source"></category><category term="software"></category></entry><entry><title>Promoting open-source, from inria to :probabl.</title><link href="https://gael-varoquaux.info/programming/promoting-open-source-from-inria-to-probabl.html" rel="alternate"></link><published>2024-06-09T00:00:00+02:00</published><updated>2024-06-09T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2024-06-09:/programming/promoting-open-source-from-inria-to-probabl.html</id><summary type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;img alt="" class="align-right" src="../programming/attachments/scikit-learn_at_probabl.png" style="width: 300px;" /&gt;
&lt;p class="last"&gt;Open-source efforts around scikit-learn at Inria are spinning off to a
new enterprise, &lt;a class="reference external" href="https://probabl.ai"&gt;Probabl&lt;/a&gt;, in charge of
sustainable development of a data-science commons.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#prelude-funding-scikit-learn-is-hard" id="toc-entry-1"&gt;Prelude: funding scikit-learn is hard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#the-birth-of-a-new-ambition" id="toc-entry-2"&gt;The birth of a new ambition&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#probabl-a-mission-driven-enterprise" id="toc-entry-3"&gt;Probabl, a mission-driven enterprise&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#probabl-is-already-having-an-impact" id="toc-entry-4"&gt;Probabl is already having an impact&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#my-position-within-probabl-my-vested-interests" id="toc-entry-5"&gt;My position within Probabl …&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;img alt="" class="align-right" src="../programming/attachments/scikit-learn_at_probabl.png" style="width: 300px;" /&gt;
&lt;p class="last"&gt;Open-source efforts around scikit-learn at Inria are spinning off to a
new enterprise, &lt;a class="reference external" href="https://probabl.ai"&gt;Probabl&lt;/a&gt;, in charge of
sustainable development of a data-science commons.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#prelude-funding-scikit-learn-is-hard" id="toc-entry-1"&gt;Prelude: funding scikit-learn is hard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#the-birth-of-a-new-ambition" id="toc-entry-2"&gt;The birth of a new ambition&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#probabl-a-mission-driven-enterprise" id="toc-entry-3"&gt;Probabl, a mission-driven enterprise&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#probabl-is-already-having-an-impact" id="toc-entry-4"&gt;Probabl is already having an impact&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#my-position-within-probabl-my-vested-interests" id="toc-entry-5"&gt;My position within Probabl, my vested interests&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#more-to-come" id="toc-entry-6"&gt;More to come&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="prelude-funding-scikit-learn-is-hard"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;Prelude: funding scikit-learn is hard&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Scikit-learn is a &lt;a class="reference external" href="../programming/people-underestimate-how-impactful-scikit-learn-continues-to-be.html"&gt;central software component in today’s machine learning
landscape&lt;/a&gt;,
and it is open source, governed by a community, easy to install, and well
documented. It started many years ago as a project that we did on the
side, and we were joined by many volunteers, which was key to the success
of the project. We soon decided to ensure that scikit-learn was not
&lt;em&gt;only&lt;/em&gt; a volunteer-based effort. Over more than a decade, I’ve dedicated
a lot of energy to this, using a variety of funding mechanisms: first
grants (as an academic), then sponsoring and related contracts with
various actors.&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
&lt;em&gt;Digital commons eliminate scarcity and exclusivity&lt;/em&gt;&lt;/div&gt;
&lt;p&gt;Funding digital commons is really hard. People build fortunes by
leveraging competitive advantages, by creating lock-ins, or selling
access to data. What makes a great open-source library, as scikit-learn,
is exactly what prevents these tricks: we are committed to being
independent, easy to use and install, lightweight…&lt;/p&gt;
&lt;img src="../programming/attachments/probabl_rocket.svg" class="align-right" width="150px"&gt;&lt;/div&gt;
&lt;div class="section" id="the-birth-of-a-new-ambition"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;The birth of a new ambition&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Scikit-learn is very successful, but it could be more. For instance, it
does not facilitate pushing to production as much as tensorflow, which
can be served, deployed to android… And scikit-learn is not very
visible to top decision makers: it’s not a line on their budget, a brand
that they know. As a consequence, it is not reaping the benefit of its
success &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;.&lt;/p&gt;
&lt;table class="side-hanging docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Many commercial tools are sitting on top of open source software
like scikit-learn (splunk, sagemaker, to name only a few), making
profits, and not helping in any way the open source world that they
build upon.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;div class="align-right docutils container"&gt;
&lt;em&gt;The French government is backing us to push the envelope&lt;/em&gt;&lt;/div&gt;
&lt;p&gt;3 years ago, the French government challenged us to go further, to consolidate
the ecosystem into a consistent data-science commons. The strategic
interest of France is to preserve some technological autonomy on data, eg
sensitive data. Thus, the government offered us, at Inria, a funding
opportunity to go further.&lt;/p&gt;
&lt;p&gt;They promised us a lot of money (dozens of millions of Euros), but with a
specific mission to develop a sustainable “data-science commons” &lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt;
ecosystem around scikit-learn. I’ll spare you the details of the amount
of meetings we had, documents that we wrote, to sketch the outline of the
project. I pushed forward a vision of technical components that fit in
the broader open-source ecosystem, complementing it.&lt;/p&gt;
&lt;table class="side-hanging docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;The letter that we received from the French government
specifically defines the objective in these words: “data-science
common” (“Communs numériques pour la Science des Données”)&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;As I moved forward, I faced a difficulty: the French government wanted a
&lt;strong&gt;sustainability plan&lt;/strong&gt;, and private investment to back it. To be honest,
this is not what I’m good at. François Goupil, the COO of the
scikit-learn consortium, was helping me, but we needed more for our
ambitions. And this is when we started talking to &lt;a class="reference external" href="https://www.linkedin.com/in/ylechelle/"&gt;Yann Lechelle&lt;/a&gt;, a tech entrepreneur with an
impressive track record interested in the impact of France on the global
tech world.&lt;/p&gt;
&lt;img alt="" class="align-right" src="../programming/attachments/probabl_logo.jpeg" style="width: 100px;" /&gt;
&lt;/div&gt;
&lt;div class="section" id="probabl-a-mission-driven-enterprise"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;Probabl, a mission-driven enterprise&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;With Yann, we built a new vision. Our challenge is to be long-term
sustainable and virtuous for scikit-learn, its broader ecosystem, and its
community. Yann brought in a business point of view, and I tried to bring
that of open-source communities beyond probabl &lt;a class="footnote-reference" href="#footnote-3" id="footnote-reference-3"&gt;[3]&lt;/a&gt;, for instance
avoiding to getting in the way of others building businesses that
contribute to scikit-learn. Indeed, we are convinced that having a broad
and diverse community around scikit-learn is central to its future.&lt;/p&gt;
&lt;table class="side-hanging docutils footnote" frame="void" id="footnote-3" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-3"&gt;[3]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;One of the first things that Probabl did (Guillaume Lemaître, to
be specific), was submit a grant application (to the Chang-Zuckenberg
Institute), to fund, via NumFocus, a developer employed by
Quantsight, with no money transiting via Probabl (one reason being
that we have no operations outside of Europe so far).&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Our sustainability model is still being finetuned. What I can tell is
that it will involve a mix of professional service, support &amp;amp; sponsorship
agreement, as well as a product-based offer, where we supplement
scikit-learn with enterprise features. Our focus will be on features that
are typically not the focus of open-source developers: integration in
large structures, such as access control, LDAP connection, regulatory
compliance. We will not shoehorn scikit-learn in open core or dual
licensing approaches: we want our incentives to be aligned with
scikit-learn, and its ecosystem, being as complete as possible.&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
Foster growth and adoption of our open-source stack&lt;/div&gt;
&lt;p&gt;In a sense, our inspiration is that of RedHat, where the growth of the
company fosters the growth and adoption of the software (Linux in the case
of RedHat), beyond the company, in an ecosystem, and for a wide variety
of applications.&lt;/p&gt;
&lt;p&gt;Strong growth will mean external capital. To ensure that we do not lose
the focus on our mission, building data-science commons, Yann penciled
down a specific governance of the company (and then validated it with
many people, as we are a spin-off from a governmental organization). The
ultimate share structure, and the board, are divided in three electoral
colleges: one for outside investors, one for founders and employees, and
one for public institutions. This ensures a balance of power that
hopefully will keep us aligned to our mission. I think that this
structure sends a strong signal that we are not just another for-profit
that will go from creating useful tech to dark money-generating patterns.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="probabl-is-already-having-an-impact"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-4"&gt;Probabl is already having an impact&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;A strong open-source team&lt;/strong&gt; In February, the whole team developing
scikit-learn at Inria moved to Probabl, joined by Adrin Jalali, a
Berlin-based core developer of scikit-learn and fairlearn. We’ve been
hiring excellent people, and we now have &lt;strong&gt;9 people on open-source&lt;/strong&gt; (see
the &lt;a class="reference external" href="https://probabl.ai/about"&gt;Probabl team&lt;/a&gt;), spending their time
contributing to open source (Jérémie, for instance, has been doing the
last releases for scikit-learn).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fostering an ecosystem&lt;/strong&gt; Probabl is not only about scikit-learn. We are
prioritizing &lt;a class="reference external" href="https://probabl.ai/open-source"&gt;8 libraries&lt;/a&gt;, central to
the machine-learning and data science ecosystem: joblib, fairlearn,
imbalanced-learn… In general, as we have always done, we will not
hesitate contributing to upstream or related projects. Our goal is to
have a healthy open-source ecosystem around data-science.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Not only software&lt;/strong&gt; Not everybody sees the important lines of code.
I’ve become increasingly aware of the need to do outreach and
communication, to coders, but also to decision makers. At Probabl we
dedicate energy to be in business meetings, to participate in the tech
narrative, to teach how to best do data science, &lt;em&gt;eg&lt;/em&gt; with didactic
videos. We’re starting a mentioning program, we’ll be organizing
sprints… I am convinced that all this is a useful long-term investment.&lt;/p&gt;
&lt;img alt="" class="align-center" src="../programming/attachments/probabl_robot_dog.jpeg" style="width: 360px;" /&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="my-position-within-probabl-my-vested-interests"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-5"&gt;My position within Probabl, my vested interests&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;I am a French civil servant (a researcher at Inria, one of our national
research institute). Such a position comes with strong responsibilities
to control conflicts of interest. The creation of Probabl underwent
strict scrutiny (that took a long long time). I have been recently
cleared to take an active role: 10% of my time is allocated to be a
&lt;strong&gt;scientific and open-source advisor for Probabl&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;I am not paid by Probabl&lt;/strong&gt;. 100% of my salary comes from Inria (and I was
not given a raise because of my involvement in Probabl). I do have financial
interests as a founder, but given that I have a small active part, I have
one of the smallest amount of shares among founders.&lt;/p&gt;
&lt;p&gt;My main interest in Probabl is really the success of its mission: the
long-term growth of an open-source data-science ecosystem. Spinning-off
from Inria actually continues my efforts in this direction, but with more
agility and breadth. And having on top of open source a variety of
complementary commercial activities makes it stronger, by answering
better the needs of some actors.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="more-to-come"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-6"&gt;More to come&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;There are many things that we are still ironing. Clearing out
specific details takes time (for instance, clearing my role took a
while). We are still to announce the future of the sponsorship program
that we had set up at the Inria foundation. Its mission has been
transferred to Probabl. Currently, Probabl’s open
source team is ensuring continuity of our work with the existing sponsors.
But we will set up broader
partnership opportunities, with a similar governance, that enable
third-parties to invest in open source on a roadmap decided jointly with
the open-source community.&lt;/p&gt;
&lt;p&gt;I believe that we need a lot of &lt;strong&gt;transparency&lt;/strong&gt; in how we decide upon priorities
in our open source team. Our 2024 priorities for scikit-learn are visible
&lt;a class="reference external" href="https://papers.probabl.ai/scikit-learns-priorities-at-probabl"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I look forward to when Probabl will start adding value to scikit-learn
for enterprises with an offer enriching scikit-learn and the broader
open-source ecosystem.&lt;/p&gt;
&lt;p&gt;I am acutely aware that good &lt;strong&gt;open source is made of communities&lt;/strong&gt;, and that
communities need trust and understanding of big players such as Probabl
(well, so far we are not that big). I hope that with time our actions
will become easy to read and speak of themselves.&lt;/p&gt;
&lt;img src="../programming/attachments/probabl_machine_heart.svg" class="align-center" width="400px"&gt;&lt;/div&gt;
</content><category term="programming"></category><category term="open source"></category><category term="growth"></category><category term="communities"></category><category term="scikit-learn"></category><category term="inria"></category><category term="probabl"></category></entry><entry><title>People underestimate how impactful Scikit-learn continues to be</title><link href="https://gael-varoquaux.info/programming/people-underestimate-how-impactful-scikit-learn-continues-to-be.html" rel="alternate"></link><published>2023-11-27T00:00:00+01:00</published><updated>2023-11-27T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2023-11-27:/programming/people-underestimate-how-impactful-scikit-learn-continues-to-be.html</id><summary type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;François Chollet rightfully said that people often underestimate the
impact of scikit-learn. I give here a few illustrations to back his
claim.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;A few days ago, François Chollet (the creator of Keras, the library that
that democratized deep learning) &lt;a class="reference external" href="https://twitter.com/fchollet/status/1727186047115882624"&gt;posted&lt;/a&gt;:&lt;/p&gt;
&lt;a class="reference external image-reference" href="https://twitter.com/fchollet/status/1727186047115882624"&gt;&lt;img alt="Tweet from François Chollet: &amp;quot;People underestimate how impactful scikit-learn continues to be&amp;quot;" class="align-center" src="../programming/attachments/chollet_scikit_learn_impact.png" /&gt;&lt;/a&gt;
&lt;p&gt;Indeed, scikit-learn continues to be the most popular machine …&lt;/p&gt;</summary><content type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;François Chollet rightfully said that people often underestimate the
impact of scikit-learn. I give here a few illustrations to back his
claim.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;A few days ago, François Chollet (the creator of Keras, the library that
that democratized deep learning) &lt;a class="reference external" href="https://twitter.com/fchollet/status/1727186047115882624"&gt;posted&lt;/a&gt;:&lt;/p&gt;
&lt;a class="reference external image-reference" href="https://twitter.com/fchollet/status/1727186047115882624"&gt;&lt;img alt="Tweet from François Chollet: &amp;quot;People underestimate how impactful scikit-learn continues to be&amp;quot;" class="align-center" src="../programming/attachments/chollet_scikit_learn_impact.png" /&gt;&lt;/a&gt;
&lt;p&gt;Indeed, scikit-learn continues to be the most popular machine learning in
surveys:&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="../programming/attachments/kaggle_survey_library_2022.png"&gt;&lt;img alt="" src="../programming/attachments/kaggle_survey_library_2022.png" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;Most popular machine-learning framework, according to &lt;a class="reference external" href="https://www.kaggle.com/kaggle-survey-2022"&gt;a Kaggle survey&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div class="admonition align-right note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Scikit-learn is probably the most used machine-learning library&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;This popularity is sometimes underestimated as scikit-learn is a small player
in terms of funding and size of the team, in particular
compared to giants such as tensorflow and pytorch. Size is limited
by nature of the project, based on a community without a strong commercial
entity backing the project.&lt;/p&gt;
&lt;p&gt;We target different technology than tensorflow and pytorch: we have
by design let the big players focus on deep learning, which demands much
more resources. Rather, we have focused on classic machine learning,
believing that it serves other important needs. While such technologies
make less the news, they are used a lot, and scikit-learn is massively
used:&lt;/p&gt;
&lt;table border="1" class="noborder docutils align-center"&gt;
&lt;caption&gt;&lt;strong&gt;Usage statistics&lt;/strong&gt; (from github)&lt;/caption&gt;
&lt;colgroup&gt;
&lt;col width="33%" /&gt;
&lt;col width="33%" /&gt;
&lt;col width="33%" /&gt;
&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;&lt;a class="reference external image-reference" href="https://github.com/scikit-learn/scikit-learn"&gt;&lt;img alt="sklearn_header" src="../programming/attachments/scikit-learn_header.png" /&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a class="reference external image-reference" href="https://github.com/pytorch/pytorch/"&gt;&lt;img alt="pytorch_header" src="../programming/attachments/pytorch_header.png" /&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a class="reference external image-reference" href="https://github.com/tensorflow/tensorflow"&gt;&lt;img alt="tensorflow_header" src="../programming/attachments/tensorflow_header.png" /&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a class="reference external image-reference" href="https://github.com/scikit-learn/scikit-learn"&gt;&lt;img alt="sklearn_used_by" src="../programming/attachments/scikit-learn_used_by.png" /&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a class="reference external image-reference" href="https://github.com/pytorch/pytorch/"&gt;&lt;img alt="pytorch_used_by" src="../programming/attachments/pytorch_used_by.png" /&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a class="reference external image-reference" href="https://github.com/tensorflow/tensorflow"&gt;&lt;img alt="tensorflow_used_by" src="../programming/attachments/tensorflow_used_by.png" /&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;By not focusing on deep learning, does scikit-learn risk to become
outdated? Surveys show that simple models such as linear models or models
based on trees (including boosting) are actually the most used models:&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="../programming/attachments/popular_ml_algorithm_2022.png"&gt;&lt;img alt="" src="../programming/attachments/popular_ml_algorithm_2022.png" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;Most popular machine learning algorithm, according to &lt;a class="reference external" href="https://www.kaggle.com/code/dhirajkumar612/kaggle-survey-2022-data-analysis"&gt;a kaggle
survey&lt;/a&gt;
(apologies for the small fonts on the figure, I did not generate it)&lt;/p&gt;
&lt;/div&gt;
&lt;div class="admonition align-right note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Gradient Boosted Trees is a good go-to model&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;There is a lot of hype surrounding deep learning, but it is most
often not the right tool do tackle tabular data. Tabular data has
different properties than images or text: it comes with heterogeneous
columns which make sense by themselves, and tree-based models have the
right inductive bias &lt;a class="reference external" href="https://proceedings.neurips.cc/paper_files/paper/2022/hash/0378c7692da36807bdec87ab043cdadc-Abstract-Datasets_and_Benchmarks.html"&gt;[Grinsztajn et al 2023]&lt;/a&gt;.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="https://proceedings.neurips.cc/paper_files/paper/2022/hash/0378c7692da36807bdec87ab043cdadc-Abstract-Datasets_and_Benchmarks.html"&gt;&lt;img alt="" src="../programming/attachments/benchmark_tree_models.png" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;&lt;strong&gt;Benchmark comparing models on tabular data while tuning
hyper-parameters&lt;/strong&gt; (from &lt;a class="reference external" href="https://proceedings.neurips.cc/paper_files/paper/2022/hash/0378c7692da36807bdec87ab043cdadc-Abstract-Datasets_and_Benchmarks.html"&gt;Grinsztajn et al 2023&lt;/a&gt;) Each value corresponds to the test score of the
best model (on the validation set) after a specific time spent doing
random search. The
ribbon corresponds to the minimum and maximum scores on these 15
shuffles.
Models HistGradientBoostingTree, GradientBoostingTree, and
RandomForest come from scikit-learn. FTtransformer, Saint, ResNet and
MLP are all deep learning architecture, with FT transformer and Saint
models specifically developed for tabular data.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;As we can see, scikit-learn’s &lt;a class="reference external" href="https://scikit-learn.org/stable/modules/ensemble.html#histogram-based-gradient-boosting"&gt;HistGradientBoosting&lt;/a&gt; really shines in terms of good prediction performance for small computational costs. We strive to facilitate datascience: make it lightweight, give good documentation and APIs.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Linear models and tree-based models are there to stay. They answer strong
needs for many application settings and they come with small
operational cost.&lt;/p&gt;
&lt;p&gt;In my opinion, where scikit-learn could really grow to be even more
relevant is to integrate better in a broader ecosystem going from
databases to putting to production, being more “enterprise ready” :).&lt;/p&gt;
</content><category term="programming"></category><category term="scikit-learn"></category><category term="open-source"></category><category term="machine learning"></category></entry><entry><title>Comité de l’intelligence artificielle: vision et stratégie nationale</title><link href="https://gael-varoquaux.info/science/comite-de-lintelligence-artificielle-vision-et-strategie-nationale.html" rel="alternate"></link><published>2023-09-20T00:00:00+02:00</published><updated>2023-09-20T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2023-09-20:/science/comite-de-lintelligence-artificielle-vision-et-strategie-nationale.html</id><summary type="html">&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;English summary&lt;/p&gt;
&lt;p&gt;I have been appointed to the government-level panel of experts on AI,
to set the national vision and strategy in France.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;J’ai l’honneur d’être &lt;a class="reference external" href="https://www.gouvernement.fr/communique/comite-de-lintelligence-artificielle"&gt;nommé au comité de l’intelligence artificielle du gouvernement Français&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;La mission qui nous est confiée d’éclairer l’action publique …&lt;/p&gt;</summary><content type="html">&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;English summary&lt;/p&gt;
&lt;p&gt;I have been appointed to the government-level panel of experts on AI,
to set the national vision and strategy in France.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;J’ai l’honneur d’être &lt;a class="reference external" href="https://www.gouvernement.fr/communique/comite-de-lintelligence-artificielle"&gt;nommé au comité de l’intelligence artificielle du gouvernement Français&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;La mission qui nous est confiée d’éclairer l’action publique autour de
l’intelligence artificielle, une technologie qui peut impacter beaucoup
d’aspects de la société.&lt;/p&gt;
&lt;p&gt;Le comité comprend des experts de profils très variés, allant du
jeune entrepreneur à l’économiste connu mondialement. La difficulté va être de considérer
l’ensemble des liens entre progrès technologique et société. Nous allons
chercher à dégager de la vision, rassembler beaucoup d’expertises
d’acteurs différents sur
différents
sujets, appuyer nos projections sur l’état actuel des connaissances
scientifiques.&lt;/p&gt;
&lt;p&gt;Je ne partagerai pas les travaux du comité en avance de phase: il y aura
un travail nécessaire pour établir du consensus, travail qui prend du temps.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Cette mission dépasse mon cadre habituel, celui de la recherche académique
ou de la création de logiciels. Je fais cela parce que je crois que pour
que la technologie ait le meilleur impact sur la société, il doit y avoir
un va-et-vient entre la création technologique et les changements
sociétaux. Si nous, scientifiques, décidons de nous concentrer uniquement
sur notre travail académique et technique, nous perdons le contrôle de la
façon dont la société adopte notre technologie; nous laissons ce contrôle
aux personnes qui décident d’utiliser leur énergie pour agir, influencer,
profiter directement de ces technologies. En tant que chercheur en sciences informatiques, travaillant à
la fois sur l’IA fondamentale et sur les applications dans le domaine de
la santé, je dispose d’une expertise qu’il est important d’apporter à la
table. En tant que fonctionnaire, je pense que je peux et que je dois
éclairer le débat : je suis moins exposé au risque de conflits
d’intérêts, je suis payé par l’argent public pour être utile au public.&lt;/p&gt;
&lt;p&gt;Ce travail n’est néanmoins pas une prise de position politique: je suis
scientifique et non élu. Le pouvoir du comité n’est pas de faire les
décisions politiques, mais d’informer du possible. C’est un travail de
synthèse et de médiation.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;Mise à jour: rapport disponible&lt;/p&gt;
&lt;p&gt;Nous avons publié en mars 2024 notre rapport, disponible &lt;a class="reference external" href="https://www.info.gouv.fr/actualite/25-recommandations-pour-lia-en-france"&gt;en ligne&lt;/a&gt;.
Il est très lisible et traite de tous les sujets autours de l’IA.
Lecture recommandée à tous.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="artificial intelligence"></category><category term="society"></category><category term="science"></category><category term="government"></category></entry><entry><title>2022, a new scientific adventure: machine learning for health and social sciences</title><link href="https://gael-varoquaux.info/science/2022-a-new-scientific-adventure-machine-learning-for-health-and-social-sciences.html" rel="alternate"></link><published>2023-01-31T00:00:00+01:00</published><updated>2023-01-31T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2023-01-31:/science/2022-a-new-scientific-adventure-machine-learning-for-health-and-social-sciences.html</id><summary type="html">&lt;p&gt;A retrospective on last year (2022): I embarked on a new scientific
adventure, assembling &lt;a class="reference external" href="https://team.inria.fr/soda/"&gt;a team&lt;/a&gt; focused on
developing machine learning for health and social science. The team has
existed for almost a year, and the vision is nice shaping up. Let me
share with you illustrations of where we …&lt;/p&gt;</summary><content type="html">&lt;p&gt;A retrospective on last year (2022): I embarked on a new scientific
adventure, assembling &lt;a class="reference external" href="https://team.inria.fr/soda/"&gt;a team&lt;/a&gt; focused on
developing machine learning for health and social science. The team has
existed for almost a year, and the vision is nice shaping up. Let me
share with you illustrations of where we are at. This is extracted from
our yearly report which will be public later, but I have sometimes edited
it a bit to add personal context.&lt;/p&gt;
&lt;div class="contents topic" id="highlights"&gt;
&lt;p class="topic-title"&gt;Highlights&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#a-new-team-soda" id="toc-entry-1"&gt;A new team: Soda&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#the-scientific-vision" id="toc-entry-2"&gt;The scientific vision&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#application-context-richer-data-in-health-and-social-sciences" id="toc-entry-3"&gt;Application context: richer data in health and social sciences&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#applications-raise-specific-data-science-challenges" id="toc-entry-4"&gt;Applications raise specific data-science challenges&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#our-research-axes" id="toc-entry-5"&gt;Our research axes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#some-notable-results-of-2022" id="toc-entry-6"&gt;Some notable results of 2022&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#learning-on-relational-data-aggregating-across-many-tables" id="toc-entry-7"&gt;Learning on relational data: aggregating across many tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#validating-probabilistic-classifiers-beyond-calibration" id="toc-entry-8"&gt;Validating probabilistic classifiers: beyond calibration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#reweighting-randomized-trials-for-generalization-finite-sample-error-and-variable-selection" id="toc-entry-9"&gt;Reweighting randomized trials for generalization: finite sample error and variable selection&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#challenges-to-clinical-impact-of-ai-in-medical-imaging" id="toc-entry-10"&gt;Challenges to clinical impact of AI in medical imaging&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#privacy-preserving-synthetic-educational-data-generation" id="toc-entry-11"&gt;Privacy-preserving synthetic educational data generation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="a-new-team-soda"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;A new team: Soda&lt;/a&gt;&lt;/h2&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2022/team_2022.jpg" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;The team in early 2022 (it has grown a lot since)&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;At &lt;a class="reference external" href="https://www.inria.fr/en"&gt;Inria&lt;/a&gt;, we have teams assembling multiple
tenured researchers around a scientific project. Last year, we assembled
a new team called &lt;a class="reference external" href="https://team.inria.fr/soda/"&gt;Soda&lt;/a&gt;, which stands for
“social data”, but above all is a fun name.&lt;/p&gt;
&lt;p&gt;In a year, the team grew like crazy (to be honest, this had been baking
for a little while). We are now around 25 people.
There are 4 PIs (Marine le Morvan, Judith Abécassis, Jill-Jênn Vie, and
myself); and the engineers working on scikit-learn at Inria are also part
of the team.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="the-scientific-vision"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;The scientific vision&lt;/a&gt;&lt;/h2&gt;
&lt;p class="align-right"&gt;&lt;em&gt;Machine learning to leverage richer, more complex, data for
social-sciences and health&lt;/em&gt;&lt;/p&gt;
&lt;div class="section" id="application-context-richer-data-in-health-and-social-sciences"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;Application context: richer data in health and social sciences&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Opportunistic data accumulations, often observational, bare great
promises for social and health sciences. But the data are too big and
complex for standard statistical methodologies in these sciences.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Health databases&lt;/strong&gt; Increasingly rich health data is accumulated
during routine clinical practice as well as for research. Its large
coverage brings new promises for public health and personalized medicine,
but it does not fit easily in standard biostatistical practice because it
is not acquired and formatted for a specific medical question.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Social, educational, and behavioral sciences&lt;/strong&gt; Better data sheds new
light on human behavior and psychology, for instance with on-line
learning platforms. Machine learning can be used both as a model for
human intelligence and as a tool to leverage these data, for instance
improving education.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="applications-raise-specific-data-science-challenges"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-4"&gt;Applications raise specific data-science challenges&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Data management: preparing dirty data for analytics&lt;/strong&gt; Assembling,
curating, and transforming data for data analysis is very labor
intensive. These data-preparation steps are often considered the number
one bottleneck to data-science. They mostly rely on data-management
techniques. A typical problem is to establishing correspondences between
entries that denote the same entities but appear in different forms
(entity linking, including deduplication and record linkage). Another
time-consuming process is to join and aggregate data across multiple
tables with repetitions at different levels (as with panel data in
econometrics and epidemiology) to form a unique set of “features” to
describe each individual.&lt;/p&gt;
&lt;div class="sidebar"&gt;
The &lt;a class="reference external" href="https://project.inria.fr/dirtydata/"&gt;Dirty Data project&lt;/a&gt; paved the way.&lt;/div&gt;
&lt;p&gt;Progress in machine learning increasingly helps automating data
preparation and processing data with less curation.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Data science with statistical machine learning&lt;/strong&gt; Machine learning can
be a tool to answer complex domain questions by providing non-parametric
estimators. Yet, it still requires much work, eg to go beyond point
estimators, to derive non-parametric procedures that account for a
variety of bias (censoring, sampling biases, non-causal associations), or
to provide theoretical and practical tools to assess validity of
estimates and conclusion in weakly-parametric settings.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="our-research-axes"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-5"&gt;Our research axes&lt;/a&gt;&lt;/h3&gt;
&lt;div class="section" id="representation-learning-for-relational-data"&gt;
&lt;h4&gt;Representation learning for relational data&lt;/h4&gt;
&lt;p&gt;I dream of deep-learning methodology for relational databases, from
tabular datasets to full relational databases. The stakes are &lt;em&gt;i)&lt;/em&gt; to
build machine-learning models that apply readily to the raw data so as to
minimize manual cleaning, data formatting and integration, and &lt;em&gt;ii)&lt;/em&gt; to
extract reusable representations that reduce sample complexity on new
databases by transforming the data in well-distributed vectors.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="mathematical-aspects-of-statistical-learning-for-data-science"&gt;
&lt;h4&gt;Mathematical aspects of statistical learning for data science&lt;/h4&gt;
&lt;p&gt;I want to use machine learning models as non-parametric estimators, as I
worry about the impact of mismodeling on conclusion. However, for a given
statistical task, the statistical procedures and validity criterion need
to be reinvented. Soda contributes statistical tools and results for a
variety of problems important to data science in health and social
science (epidemiology, econometrics, education). These fields lead to
various statistical topics:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Missing values&lt;/li&gt;
&lt;li&gt;Causal inference&lt;/li&gt;
&lt;li&gt;Model validation&lt;/li&gt;
&lt;li&gt;Uncertainty quantification&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="machine-learning-for-health-and-social-sciences"&gt;
&lt;h4&gt;Machine learning for health and social sciences&lt;/h4&gt;
&lt;p&gt;Soda targets applications in health and social sciences, as these can
markedly benefit from advanced processing of richer datasets, can have a
large societal impact, but fall out of mainstream machine-learning
research, which focus on processing natural images, language, and voice.
Rather, data surveying humans needs another focus: it is most of the time
tabular, sparse, with a time dimension, and missing values. In term of
application fields, we focus on the social sciences that rely on
quantitative predictions or analysis across individuals, such as policy
evaluation. Indeed, the same formal problems, addressed in the two
research axes above, arise across various social sciences:
&lt;strong&gt;epidemiology, education research, and economics&lt;/strong&gt;.
The challenge is to develop efficient and trustworthy machine learning
methodology for these high-stakes applications.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="high-quality-data-science-software"&gt;
&lt;h4&gt;High-quality data-science software&lt;/h4&gt;
&lt;p&gt;Societal and economical impact of machine learning requires easy-to-use
practical tools that can be leveraged in non-specialized organizations
such as hospitals or policy-making institutions.&lt;/p&gt;
&lt;p&gt;Soda incorporates the core team working at Inria on &lt;strong&gt;scikit-learn&lt;/strong&gt;, one
of the most popular machine-learning tool world-wide. One of the missions
of soda is to improve scikit-learn and its documentation, transfering the
understanding of machine learning and data science accumulated by the
various research efforts.&lt;/p&gt;
&lt;p&gt;Soda works on other important software tools to foster growth and health
of the Python data ecosystem in which scikit-learn is embedded.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="some-notable-results-of-2022"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-6"&gt;Some notable results of 2022&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;I am listing here a small number of the achievements of the team, because
I find them inspiring.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="learning-on-relational-data-aggregating-across-many-tables"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-7"&gt;Learning on relational data: aggregating across many tables&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;For many machine-learning tasks, augmenting the data table at hand with
features built from external sources is key to improving performance. For
instance, estimating housing prices benefits from background information
on the location, such as the population density or the average income.&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2022/aggregating.png" style="width: 300px;" /&gt;
&lt;p class="caption"&gt;Often, data must be assembled across multiple tables into a single
table for analysis. Challenges arise due to one-to-many relations,
irregularity of the information, and the number of tables that may be
involved.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Most often, a major bottleneck is to &lt;strong&gt;assemble this information across
many tables&lt;/strong&gt;, requiring time and expertise from the data scientist. We
propose &lt;strong&gt;vectorial representations of entities (e.g. cities) that capture
the corresponding information&lt;/strong&gt; and thus can replace human-crafted
features. In &lt;a class="reference external" href="https://link.springer.com/article/10.1007/s10994-022-06277-7"&gt;Cvetkov-Iliev 2023&lt;/a&gt;, we
represent the relational data on the entities as a graph and adapt
graph-embedding methods to create feature vectors for each entity. We
show that two technical ingredients are crucial: modeling well the
different relationships between entities, and capturing numerical
attributes. We adapt knowledge graph embedding methods that were
primarily designed for graph completion. Yet, they model only discrete
entities, while creating good feature vectors from relational data also
requires capturing numerical attributes. For this, we introduce KEN:
Knowledge Embedding with Numbers. We thoroughly evaluate approaches to
enrich features with background information on 7 prediction tasks. We
show that a good embedding model coupled with KEN can perform better than
manually handcrafted features, while requiring much less human effort. It
is also competitive with combinatorial feature engineering methods, but
much more scalable. Our approach can be applied to huge databases, for
instance on general knowledge graphs as in YAGO, creating &lt;strong&gt;general-purpose
feature vectors reusable in various downstream tasks&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/2022/entity_types_with_names.png" style="width: 100%;" /&gt;
&lt;p class="caption"&gt;&lt;strong&gt;Entity embeddings of YAGO (wikipedia)&lt;/strong&gt; (2D-representation using
UMAP). The vectors are downloadable from
&lt;a class="reference external" href="https://soda-inria.github.io/ken_embeddings"&gt;https://soda-inria.github.io/ken_embeddings&lt;/a&gt;} to readily augment
data-science projects.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="validating-probabilistic-classifiers-beyond-calibration"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-8"&gt;Validating probabilistic classifiers: beyond calibration&lt;/a&gt;&lt;/h3&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2022/grouping_loss.png" style="width: 360px;" /&gt;
&lt;p class="caption"&gt;Validating probabilistic predictions of classifiers must go account
not only for the average error given an predicted score, but also for
the dispersion of errors.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Ensuring that a classifier gives reliable confidence scores is essential
for informed decision-making, in particular in high-stakes areas such as
health. For instance, before using a clinical prognostic model, we want
to establish that for a given individual is attributes probabilities of
different clinical outcomes that can be indeed trusted. To this end,
recent work has focused on miscalibration, &lt;em&gt;i.e.&lt;/em&gt;, the over or under
confidence of model scores.&lt;/p&gt;
&lt;p&gt;Yet calibration is not enough: even a perfectly calibrated classifier
with the best possible accuracy can have confidence scores that are far
from the true posterior probabilities, if it is over-confident for some
samples and under-confident for others. This is captured by the grouping
loss, created by samples with &lt;strong&gt;the same confidence scores but different
true posterior probabilities&lt;/strong&gt;. Proper scoring rule theory shows that given
the calibration loss, the missing piece to characterize individual errors
is the grouping loss. While there are many estimators of the calibration
loss, none exists for the grouping loss in standard settings. In
&lt;a class="reference external" href="https://arxiv.org/abs/2210.16315"&gt;Perez-Level 2023&lt;/a&gt;, we propose an
estimator to approximate the grouping loss. We show that modern neural
network architectures in vision and NLP exhibit grouping loss, notably in
distribution shifts settings, which highlights the importance of
pre-production validation.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="reweighting-randomized-trials-for-generalization-finite-sample-error-and-variable-selection"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-9"&gt;Reweighting randomized trials for generalization: finite sample error and variable selection&lt;/a&gt;&lt;/h3&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2022/reweighting_trial.png" style="width: 360px;" /&gt;
&lt;p class="caption"&gt;There may be a sampling bias between a randomized trial and the
target population.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Randomized Controlled Trials (RCTs) are ideal experiments to establish
causal statement. However, they may suffer from limited scope, in
particular, because they may have been run on non-representative samples:
some RCTs over- or under- sample individuals with certain characteristics
compared to the target population, for which one wants conclusions on
treatment effectiveness. Re-weighting trial individuals to match the
target population can improve the treatment effect estimation.&lt;/p&gt;
&lt;p&gt;In &lt;a class="reference external" href="https://hal.science/hal-03822662"&gt;Colnet 2022&lt;/a&gt;, we establish the
exact expressions of the bias and variance of such reweighting procedures
- also called Inverse Propensity of Sampling Weighting (IPSW) - in
presence of categorical covariates for any sample size. Such results
allow us to compare the theoretical performance of different versions of
IPSW estimates. Besides, our results show how the performance (bias,
variance, and quadratic risk) of IPSW estimates depends on the two sample
sizes (RCT and target population). A by-product of our work is the proof
of consistency of IPSW estimates. Results also reveal that IPSW
performances are improved when the trial probability to be treated is
estimated (rather than using its oracle counterpart). In addition, we
study &lt;strong&gt;choice of variables&lt;/strong&gt;: how including covariates that are not
necessary for identifiability of the causal effect may impact the
asymptotic variance. Including covariates that are shifted between the
two samples but not treatment effect modifiers increases the variance
while non-shifted but treatment effect modifiers do not.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="challenges-to-clinical-impact-of-ai-in-medical-imaging"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-10"&gt;Challenges to clinical impact of AI in medical imaging&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;I have worked for many years on research in computer analysis of medical
images. In particular, I am convinced that machine learning bears many
promises to improve patients’ health. However, I cannot be blind to the
fact that a number of systematic challenges are slowing down the progress
of the field.&lt;/p&gt;
&lt;p&gt;In &lt;a class="reference external" href="https://www.nature.com/articles/s41746-022-00592-y"&gt;Varoquaux &amp;amp; Cheplygina&lt;/a&gt;, we tried to take
a step back on these challenges, from limitations of the data, such as
biases, to research incentives, such as optimizing for publication. We
reviewed roadblocks to developing and assessing methods. Building our
analysis on evidence from the literature and data challenges, we showed
that at every step, potential biases can creep.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;First, larger datasets do not bring increased prediction accuracy and
may suffer from biases.&lt;/li&gt;
&lt;li&gt;Second, evaluations often miss the target, with evaluation error larger
than algorithmic improvements, improper evaluation procedures and
leakage, metrics that do not reflect the application, incorrectly chosen
baselines, and improper statistics.&lt;/li&gt;
&lt;li&gt;Finally, we show how publishing too often leads to distorted incentives.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;On a positive note, we also discuss on-going efforts to counteract these
problems and provide recommendations on how to further address these
problems in the future.&lt;/p&gt;
&lt;p&gt;This was a fun exercise. I realize that I still need to sit on it and
introspect how it has shaped my research agenda, because I think it has
pushed me to choose specific emphases (such as model evaluation, or
focusing on rich data sources).&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="privacy-preserving-synthetic-educational-data-generation"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-11"&gt;Privacy-preserving synthetic educational data generation&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Soda also works on other applications than health, for instance
education. In this direction, I would like to highlight work in which I
did not participate, by Jill-Jenn Vie, another PI of the team.&lt;/p&gt;
&lt;p&gt;Institutions collect massive learning traces but they may not disclose it
for privacy issues. Synthetic data generation opens new opportunities for
research in education. &lt;a class="reference external" href="https://hal.inria.fr/hal-03715416"&gt;Vie 2022&lt;/a&gt;
presented a generative model for educational data that can preserve the
privacy of participants, and an evaluation framework for comparing
synthetic data generators. We show how naive pseudonymization can lead to
re-identification threats and suggest techniques to guarantee privacy. We
evaluate our method on existing massive educational open datasets.&lt;/p&gt;
&lt;p&gt;The tension between privacy of individuals and the need for datasets for
open science is a real and important one.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;hr class="docutils" /&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This was just a quick glance of what we do at soda, and we are just
warming up. I am super excited about this research. I hope that it will
matter.&lt;/p&gt;
&lt;p&gt;I truely believe that more and better machine learning can help health
and social science to draw new insight from new datasets.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="machine learning"></category><category term="health"></category><category term="statistics"></category><category term="yearly report"></category></entry><entry><title>My Mayavi story: discovering open source communities</title><link href="https://gael-varoquaux.info/programming/my-mayavi-story-discovering-open-source-communities.html" rel="alternate"></link><published>2022-07-10T00:00:00+02:00</published><updated>2022-07-10T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2022-07-10:/programming/my-mayavi-story-discovering-open-source-communities.html</id><summary type="html">&lt;p class="align-right"&gt;&lt;em&gt;The Mayavi Python software, and my personal history: A thread on
Python and scipy ecosystems, building open source codebase, and
meeting really cool and friendly people&lt;/em&gt;&lt;/p&gt;
&lt;img alt="" class="align-right" src="attachments/mayavi/mayavi_ets.png" /&gt;
&lt;p&gt;I am writing today as a goodbye to the project: I used to be one of the
core contributors and maintainers but have been …&lt;/p&gt;</summary><content type="html">&lt;p class="align-right"&gt;&lt;em&gt;The Mayavi Python software, and my personal history: A thread on
Python and scipy ecosystems, building open source codebase, and
meeting really cool and friendly people&lt;/em&gt;&lt;/p&gt;
&lt;img alt="" class="align-right" src="attachments/mayavi/mayavi_ets.png" /&gt;
&lt;p&gt;I am writing today as a goodbye to the project: I used to be one of the
core contributors and maintainers but have been inactive for a while for
lack of time. Out of common agreement, we recently removed my commit
rights to limit security risks.&lt;/p&gt;
&lt;p&gt;Mayavi brought my so much!&lt;/p&gt;
&lt;div class="section" id="the-start-of-my-adventure-with-mayavi"&gt;
&lt;h2&gt;The start of my adventure with Mayavi&lt;/h2&gt;
&lt;img alt="" class="align-right" src="attachments/mayavi/example_magnetic_field_lines.jpg" /&gt;
&lt;p&gt;I got involved around 2007: I needed 3D visualization of magnetic fields as I was designing coils for my PhD &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;.&lt;/p&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;This led to an example in the Mayavi docs &lt;a class="reference external" href="http://docs.enthought.com/mayavi/mayavi/auto/example_magnetic_field_lines.html"&gt;http://docs.enthought.com/mayavi/mayavi/auto/example_magnetic_field_lines.html&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;I started as an early user of Mayavi2, a rewrite of Mayavi, and
eventually joined Prabhu Ramachandran and Enthought as a contributor.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="what-is-mayavi"&gt;
&lt;h2&gt;What is Mayavi?&lt;/h2&gt;
&lt;p&gt;Mayavi is a scientific 3D visualization library in Python.&lt;/p&gt;
&lt;p&gt;It enables interactive visualization to understand complex information in
3D, such as multi-physics fields, combined with &lt;a class="reference external" href="https://docs.enthought.com/mayavi/mayavi/mlab.html"&gt;simple scripting&lt;/a&gt; to integrate in a
broader scientific computing flow.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Mayavi was designed and founded around 2000 by Prabhu Ramachandran, a
researcher in computational fluid dynamics at IIT Bombay and long-time
open-source and Python figure.&lt;/p&gt;
&lt;p&gt;The key idea was to make VTK, a powerful C++ visualization library,
easily useful with a Python interface.&lt;/p&gt;
&lt;p&gt;Mayavi bridged the gap between the C++ data structures, and efficient Python data structures, exposing without copies to numpy arrays.&lt;/p&gt;
&lt;p&gt;It uses tools from Enthought (namely the entought tool suite) for an
interactive GUI built on a Python object model: fully scriptable (the
vision in explained in &lt;a class="reference external" href="https://hal.archives-ouvertes.fr/hal-00502548"&gt;an article Prabhu and I wrote&lt;/a&gt; )&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/mayavi/mayavi_application.png" /&gt;
&lt;p class="caption"&gt;Mayavi is a full-blown interactive application&lt;/p&gt;
&lt;/div&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/mayavi/mayavi_mlab.jpg" /&gt;
&lt;p class="caption"&gt;Mayavi is also a Python library, for full scripting&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="working-on-mayavi-taught-me-code-and-communities"&gt;
&lt;h2&gt;Working on Mayavi taught me code and communities&lt;/h2&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/mayavi/mayavi_ipython.png" /&gt;
&lt;p class="caption"&gt;Mayavi used within an interactive IPython – an image from the
&lt;a class="reference external" href="https://ieeexplore.ieee.org/abstract/document/5725237"&gt;Mayavi paper&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;I joined to help with the “mlab” interface, for even simpler Python
scripting built upon functions. My goal was to make Mayavi natural to
matlab and matplotlib users, a product vision which was probably
important to push popularity even further.&lt;/p&gt;
&lt;p&gt;I was an isolated PhD student in a physics lab, emboldened by a
discussion with Fernando Perez, I started contributing and discussing
with Prabhu Ramanchandran. I remember my first skype discussion with
Prabhu, I was very intimidated.&lt;/p&gt;
&lt;p&gt;Understanding this large codebase was hard! And yet, slowly but surely, I
started making more and more meaningful contribution: on mlab, than on
the broader codbase, fixing bugs, a lot of work on documentation and
examples…&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/mayavi/scipy_conf.jpg" /&gt;
&lt;p class="caption"&gt;Prabhu and myself are in this scipy conference group picture! From &lt;a class="reference external" href="https://slideshare.net/enthought/scientific-computing-with-python-webinar-august-28-2009"&gt;https://slideshare.net/enthought/scientific-computing-with-python-webinar-august-28-2009&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Then Enthought funded my overseas travel to the scipy conference: a big
deal for me, as I was a peniless PhD student.&lt;/p&gt;
&lt;p&gt;My Mayavi story is that of meeting amazing people in the Python, scipy,
and pydata world; people who believe in building a tool stack to
democratize scientific computing; people from all over the world,
friendly, welcoming, passionate.&lt;/p&gt;
&lt;p&gt;It founded my belief in communities.&lt;/p&gt;
&lt;p&gt;This adventure led me to learn software engineering (&lt;a class="reference external" href="https://software-carpentry.org/"&gt;Software carpentry&lt;/a&gt; really helped getting started) to
work at Enthought (a software startup central to scientific computing in
Python), to change career from physics to computing, join Inria (French
national research in maths and computing), and I do other open source
projects…&lt;/p&gt;
&lt;p&gt;Mayavi was crucial to my personal adventure. Thank you Prabhu! Thank you
Enthought! Thank you the Scipy community!!&lt;/p&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="mayavi"></category><category term="python"></category><category term="science"></category><category term="conferences"></category></entry><entry><title>2021 highlight: Decoding brain activity to new cognitive paradigms</title><link href="https://gael-varoquaux.info/science/2021-highlight-decoding-brain-activity-to-new-cognitive-paradigms.html" rel="alternate"></link><published>2022-02-24T00:00:00+01:00</published><updated>2022-02-24T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2022-02-24:/science/2021-highlight-decoding-brain-activity-to-new-cognitive-paradigms.html</id><summary type="html">&lt;p class="align-right"&gt;&lt;em&gt;Broad decoding models that can specialize to discriminate
closely-related mental process with limited data&lt;/em&gt;&lt;/p&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;TL;DR&lt;/p&gt;
&lt;p&gt;Decoding models can help isolating which mental processes are implied
by the activation of given brain structures. But to support a broad
conclusion, they must be trained on many studies, a difficult problem
given …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;p class="align-right"&gt;&lt;em&gt;Broad decoding models that can specialize to discriminate
closely-related mental process with limited data&lt;/em&gt;&lt;/p&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;TL;DR&lt;/p&gt;
&lt;p&gt;Decoding models can help isolating which mental processes are implied
by the activation of given brain structures. But to support a broad
conclusion, they must be trained on many studies, a difficult problem
given the unclear relations between tasks of different studies. We
contributed a method that infers these links from the data. Their
validity is established by generalization to new tasks. Some
cognitive neuroscientists prefer qualitative consolidation of
knowledge, but such approach is hard to put to the test.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="context-infering-cognition-from-brain-imaging"&gt;
&lt;h2&gt;Context: Infering cognition from brain-imaging&lt;/h2&gt;
&lt;p&gt;Often, when interpreting functional brain images, one would like to
conclude on the indvidual’s on-going mental processes. But this
conclusion is not directly warranted by brain-imaging studies, as they do
not control the brain activity, but rather engage the participant via a
cognitive paradigm made of psychological manipulations &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;. &lt;em&gt;Brain
decoding&lt;/em&gt; can help grounding such &lt;em&gt;reverse inferences&lt;/em&gt; &lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt;, by using
machine learning to predict aspects of the task.&lt;/p&gt;
&lt;p&gt;But a brain decoding model can seldom support broad reverse-inference
claims, as typical decoding models are trained in a given study that
samples only a few aspects of cognition. Thus the decoding model only
concludes on the interpretation of the brain activity in the studies’
narrow scope.&lt;/p&gt;
&lt;p&gt;Another challenge is that of statistical power. Most functional brain
imaging studies comprise only a few dozen subjects, compromising
statistical power &lt;a class="footnote-reference" href="#footnote-3" id="footnote-reference-3"&gt;[3]&lt;/a&gt;, even more so when using machine learning &lt;a class="footnote-reference" href="#footnote-4" id="footnote-reference-4"&gt;[4]&lt;/a&gt;.
While there exists large acquisition efforts, these must focus on broad
psychological manipulations that do not probe fine aspects of mental
processes.&lt;/p&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Poldrack 2006, &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S1364661305003360"&gt;Can cognitive processes be inferred from
neuroimaging data?&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Poldrack 2011, &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S0896627311009895"&gt;Inferring Mental States from Neuroimaging Data:
From Reverse Inference to Large-Scale Decoding&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-3" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-3"&gt;[3]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Poldrack 2017, &lt;a class="reference external" href="https://www.nature.com/articles/nrn.2016.167"&gt;Scanning the horizon: towards transparent and
reproducible neuroimaging research&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-4" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-4"&gt;[4]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Varoquaux 2018, &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S1053811917305311"&gt;Cross-validation failure: Small sample sizes lead
to large error bars&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="contribution-informing-specialized-decoding-questions-from-broad-data-accumulation"&gt;
&lt;h2&gt;Contribution: Informing specialized decoding questions from broad data accumulation&lt;/h2&gt;
&lt;p&gt;In &lt;a class="reference external" href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008795"&gt;Mensch 2021&lt;/a&gt;,
we designed a machine-learning method that can &lt;strong&gt;jointly analyze many
unrelated functional imaging studies to build representations associating
brain activity to mental processes&lt;/strong&gt;. These representations can then be
used to &lt;strong&gt;improve brain decoding in new unrelated studies&lt;/strong&gt;, thus bringing
statistical-power improvements even to experiments probing fine aspects
of mental processes not studied in large cohorts.&lt;/p&gt;
&lt;p&gt;One roadblock to accumulating information across
cognitive neuroimaging studies is that all probe different, yet related,
mental processes. Framing them all in the same analysis faces the lack of
universally-adopted language to describe cognitive paradigms. Our prior
work &lt;a class="footnote-reference" href="#footnote-5" id="footnote-reference-5"&gt;[5]&lt;/a&gt; on this endeavior –the quest for universal decoding across
studies–, relied on describing each experimental paradigm in an ontology
of cognitive processes and psychological manipulations. However, such
approach is not scalable. Here, rather, we infered the latent structure
of the tasks from the data, without explicitely modeling the links
between studies. In my eye, this was a very important ingredient of our
work, and it is non trivial that it enables improving the decoding of
unrelated studies.&lt;/p&gt;
&lt;table class="side-hanging docutils footnote" frame="void" id="footnote-5" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-5"&gt;[5]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Varoquaux 2018, &lt;a class="reference external" href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006565"&gt;Atlases of cognition with large-scale human brain
mapping&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Capturing &lt;em&gt;representations&lt;/em&gt; was key to transfering across study:
representations of brain activity captured distributed brain structures
predictive of behavior; representations of tasks across studies captured
decompositions of behavior well explained by brain activity. Of course,
the representations that we extracted were not as sharp as the stylized
functional modules that have been manually compiled from decades of
cognitive-neuroscience research.&lt;/p&gt;
&lt;p&gt;From a computer-science standpoint, we used a deep-learning architecture.
This is the first time that we witnessed a
deep-learning architecture outperforming well-tuned shallow baselines on
functional neuroimaging data &lt;a class="footnote-reference" href="#footnote-6" id="footnote-reference-6"&gt;[6]&lt;/a&gt;. This success is likely due to the
massive amount of data that we assembled: as our method can
readily work across studies, we were able to apply it to 40000
subject-level contrast maps.&lt;/p&gt;
&lt;table class="side-hanging docutils footnote" frame="void" id="footnote-6" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-6"&gt;[6]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;There have been many reports of deep architectures on functional
brain imaging. However, in our experience, good shallow benchmarks
are hard to beat.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/2021_highlights/mston.png" /&gt;
&lt;p class="caption"&gt;Our deep-learning architecture&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="a-research-agenda-that-does-not-win-all-hearts"&gt;
&lt;h2&gt;A research agenda that does not win all hearts&lt;/h2&gt;
&lt;p&gt;Our underlying research agenda is to &lt;strong&gt;piece together
cognitive-neuroimaging evidence on a wide variety of tasks and mental
processes&lt;/strong&gt;. In cognitive neuroscience, such consolidation of knowledge
is done via review articles, that assemble findings from many
publications into a consistent picture on how tasks decompose on
elementary mental processes implemented by brain functional modules. The
literature review and the ensuing neuro-cognitive model are however verbal
by nature: assembling qualitative findings. I, for one, would like to
have quantitative tools to foster big-picture view. Of course, the
challenge with quantitative approaches as ours is to capture all
qualitative aspects of the question.&lt;/p&gt;
&lt;p&gt;Over the years that I have been pushing these ideas, I find that they are
met with resistance from some elite cognitive neuroscientists who see
them as unexciting at best. The same people are enthusiastic about new
data-analysis methods to dissect in fine details brain responses with a
detailed model of a given task, despite limited statistical power and
external validity. My feeling is that &lt;strong&gt;the question of how
various tasks are related is perceived as belonging to the walled garden
of cognitive neuroscientists, not to be put to the test by statistical
methods&lt;/strong&gt; &lt;a class="footnote-reference" href="#footnote-7" id="footnote-reference-7"&gt;[7]&lt;/a&gt;.&lt;/p&gt;
&lt;table class="side-hanging docutils footnote" frame="void" id="footnote-7" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-7"&gt;[7]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;a class="reference external" href="https://journals.plos.org/ploscompbiol/article/peerReview?id=10.1371/journal.pcbi.1008795"&gt;The second round of review of our manuscript&lt;/a&gt;
certainly felt as if the method was judged by cognitive-neuroscience
lenses, and not the validity of the data analysis that it entailed.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Yet, as clearly exposed by Tal Yarkoni in his &lt;a class="reference external" href="https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/abs/generalizability-crisis/AD386115BA539A759ACB3093760F4824"&gt;Generalization crisis&lt;/a&gt;,
drawing conclusions on mental organization from a few repetitions of a
tasks is at risk of picking up idiosyncrasies of the task or the stimuli.
A starting point of our work (&lt;a class="reference external" href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008795"&gt;Mensch 2021&lt;/a&gt;)
was the fall of statistical power in cognitive neuroscience, documented
by &lt;a class="reference external" href="https://www.nature.com/articles/nrn.2016.167"&gt;Poldrack 2017&lt;/a&gt;, but
one reviewer censored this argument &lt;a class="footnote-reference" href="#footnote-8" id="footnote-reference-8"&gt;[8]&lt;/a&gt;. This exchange felt to me as &lt;strong&gt;a
field refusing to discuss publicly its challenges&lt;/strong&gt;, which leaves no room for
methods’ researchers such as myself to address them.&lt;/p&gt;
&lt;table class="side-hanging docutils footnote" frame="void" id="footnote-8" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-8"&gt;[8]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;a class="reference external" href="https://journals.plos.org/ploscompbiol/article/peerReview?id=10.1371/journal.pcbi.1008795"&gt;Comments in the first review&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="machine learning"></category><category term="neuroimaging"></category><category term="statistics"></category><category term="yearly report"></category></entry><entry><title>Hiring an engineer and post-doc to simplify data science on dirty data</title><link href="https://gael-varoquaux.info/programming/hiring-an-engineer-and-post-doc-to-simplify-data-science-on-dirty-data.html" rel="alternate"></link><published>2021-10-29T00:00:00+02:00</published><updated>2021-10-29T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2021-10-29:/programming/hiring-an-engineer-and-post-doc-to-simplify-data-science-on-dirty-data.html</id><summary type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Join us to work on reinventing data-science practices and tools to
produce robust analysis with less data curation.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;It is well known that data cleaning and preparation are a heavy burden to
the data scientist.&lt;/p&gt;
&lt;img alt="" class="align-center" src="attachments/big_data_borat_cleaning_data.png" style="width: 400px;" /&gt;
&lt;div class="section" id="dirty-data-research"&gt;
&lt;h2&gt;Dirty data research&lt;/h2&gt;
&lt;p&gt;In the &lt;a class="reference external" href="https://project.inria.fr/dirtydata/"&gt;dirty data project&lt;/a&gt;, we
have been conducting machine-learning research …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Join us to work on reinventing data-science practices and tools to
produce robust analysis with less data curation.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;It is well known that data cleaning and preparation are a heavy burden to
the data scientist.&lt;/p&gt;
&lt;img alt="" class="align-center" src="attachments/big_data_borat_cleaning_data.png" style="width: 400px;" /&gt;
&lt;div class="section" id="dirty-data-research"&gt;
&lt;h2&gt;Dirty data research&lt;/h2&gt;
&lt;p&gt;In the &lt;a class="reference external" href="https://project.inria.fr/dirtydata/"&gt;dirty data project&lt;/a&gt;, we
have been conducting machine-learning research to see how better
statistical models could readily ingest non-curated data, and reduce the
need of data preparation for data science. We now have a growing
understanding of the problems, theoretical and practical, which lie
across statistical and database topics.&lt;/p&gt;
&lt;p&gt;Machine learning leads to different tradeoffs than traditional
inferential statistics (because it can rely on more powerful model). For
instance, we now have a good understanding of the case of missing values:
in &lt;a class="reference external" href="https://arxiv.org/abs/2106.00311"&gt;Le Morvan et al&lt;/a&gt;, we showed that
with traditional methods, ignorable missingness &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt; and “good”
imputation are important, but it turns out for prediction, flexible
predictors are what matters and they can work on any missingness
mechanism.&lt;/p&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;“Missing at Random”, where missingness is independent of the
hidden values&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Similarly, we have made good progress on tolerating normalization errors
and typos. We find that rather to attempt to deduplicate the entries or
fix the typos, it is best to represent similarities and ambiguities to
a flexible learning algorithm. The simplest and most reliable methods are
implemented in the &lt;a class="reference external" href="http://dirty-cat.github.io/"&gt;dirty-cat&lt;/a&gt; library, to
facilitate the life of data-scientists&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="reinventing-data-science"&gt;
&lt;h2&gt;Reinventing data science&lt;/h2&gt;
&lt;p&gt;With this understanding (and even more exciting on-going research), we
want to revisit data science. Machine-learning can provide flexible
models for many usages of data science. Our goal is to use it to help
assembling and analyzing datasets while minimizing human efforts. For
this, we need tools that can answer typical data-science questions using
machine learning and starting from the raw data, often spread in multiple
files or multiple tables of a databases. Building these tools requires
data-science research, a new vision of data-science APIs, and careful
software crafting.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="join-us-in-this-adventure"&gt;
&lt;h2&gt;Join us in this adventure&lt;/h2&gt;
&lt;p&gt;We have an &lt;a class="reference external" href="https://project.inria.fr/dirtydata/team/"&gt;awesome team&lt;/a&gt;,
with a great mix of people of different seniority, different expertise
(statistics, machine learning, databases, software engineering), sharing
offices with the &lt;a class="reference external" href="https://scikit-learn.fondation-inria.fr/home/"&gt;scikit-learn at Inria&lt;/a&gt;. But we have too many
exciting ideas, so we are growing this team.&lt;/p&gt;
&lt;div class="section" id="a-data-science-engineer-new-software-with-new-ideas"&gt;
&lt;h3&gt;A data-science engineer: new software with new ideas&lt;/h3&gt;
&lt;p&gt;We are looking for someone with a background in data science or numerical
Python programming to join us, to help with designing a new data-science
library, evolving from &lt;a class="reference external" href="http://dirty-cat.github.io/"&gt;dirty-cat&lt;/a&gt;, and
to help with data-science experimentation for the research.&lt;/p&gt;
&lt;p&gt;We like people who care about data, designing good tools, and have vision
about data science. We are happy to consider different level of
experience. Apply on &lt;a class="reference external" href="https://jobs.inria.fr/public/classic/fr/offres/2021-04182"&gt;the job offer&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="a-post-doc-researcher-science-joining-data-engineering-to-deep-learning"&gt;
&lt;h3&gt;A post-doc researcher: science joining data engineering to deep learning&lt;/h3&gt;
&lt;p&gt;We will soon be announcing a post-doc position to join the team for
research in this scope. We are interested in questions around learning on
relational or tabular data, or learning data integration. We have plenty
of ideas to explore around embeddings in databases, learning to
aggregate, learning on sets, graph neural networks for databases, or
distributional matching for entity and schema alignment.
We expect to be borrowing tools (conceptual and practical) from deep
learning, but to blending them with techniques from data integration,
knowledge graphs, and databases.&lt;/p&gt;
&lt;p&gt;The job posting will be out soon, but I am running out of the office
right now for vacations (work-life balance also matters to us).&lt;/p&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;Diversity is important&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://project.inria.fr/dirtydata/team/"&gt;Our team&lt;/a&gt; is not as
diverse as I would like it to be (though probably doing better than
typical computer-science team). We love diverse candidates. Do not
hesitate.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="machine learning"></category><category term="data science"></category><category term="dirty data"></category><category term="hiring"></category></entry><entry><title>Hiring someone to develop scikit-learn community and industry partners</title><link href="https://gael-varoquaux.info/programming/hiring-someone-to-develop-scikit-learn-community-and-industry-partners.html" rel="alternate"></link><published>2021-09-14T00:00:00+02:00</published><updated>2021-09-14T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2021-09-14:/programming/hiring-someone-to-develop-scikit-learn-community-and-industry-partners.html</id><summary type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;With the growth of scikit-learn and the wider PyData ecosystem, we
want to recruit in the Inria scikit-learn team for &lt;a class="reference external" href="https://recrutement.inria.fr/public/classic/en/offres/2021-04058"&gt;a new role&lt;/a&gt;.
Departing from our usual focus on excellence in algorithms,
statistics, or code, we want to add to the team someone with some
technical understanding, but an …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;With the growth of scikit-learn and the wider PyData ecosystem, we
want to recruit in the Inria scikit-learn team for &lt;a class="reference external" href="https://recrutement.inria.fr/public/classic/en/offres/2021-04058"&gt;a new role&lt;/a&gt;.
Departing from our usual focus on excellence in algorithms,
statistics, or code, we want to add to the team someone with some
technical understanding, but an eye for people dynamics. Are you
passionate about developing open-source communities for data science?
This job is a unique opportunity.&lt;/p&gt;
&lt;p class="last"&gt;The mandate will be on the one hand to develop the wider community
behind scikit-learn, on the other hand to foster the foundation’s
partnerships, as this is our funding.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="context-scikit-learn-inria-foundation"&gt;
&lt;h2&gt;Context: Scikit-learn &amp;#64; Inria foundation&lt;/h2&gt;
&lt;div class="section" id="the-growth-of-scikit-learn"&gt;
&lt;h3&gt;The growth of Scikit-learn&lt;/h3&gt;
&lt;img alt="" class="align-right" src="../programming/attachments/scikit-learn-logo.png" style="width: 200px;" /&gt;
&lt;p&gt;Scikit-learn is used massively, from schools to major companies. It
underpins business-intelligence analysis or automates processes. Its
reliability is crucial for the enterprise. Its well-documented methods
help data-scientists run to valid analyses.&lt;/p&gt;
&lt;p&gt;Scikit-learn has hugely grown and is still growing in terms of userbase
and expectation of quality. These days, the development team is large,
with many grass-root volunteering and some contributors spending a
sizeable fraction of their work time.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="../programming/attachments/sklearn_website_access.png" style="width: 450px;" /&gt;
&lt;p class="caption"&gt;Number of monthly website access&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="scikit-learn-inria-foundation"&gt;
&lt;h3&gt;Scikit-learn &amp;#64; Inria foundation&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Birth of a foundation&lt;/strong&gt;
To ensure reliable funding to a small core of scikit-learn developers, we
set up a foundation &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt; a few years ago. The goal was to make sure that
we did not lose our experienced developers.&lt;/p&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;See &lt;a class="reference external" href="http://gael-varoquaux.info/programming/a-foundation-for-scikit-learn-at-inria.html"&gt;the motivating announcement&lt;/a&gt; and the &lt;a class="reference external" href="https://scikit-learn.fondation-inria.fr"&gt;website&lt;/a&gt;.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Achieving sustainability&lt;/strong&gt;
The resulting structure is set up to provide a career path to a few of
our core people. As a consequence, it is a French legal entity, acting as
an employer, funded via sponsorship agreement with a few
of major economic users of scikit-learn (check out &lt;a class="reference external" href="https://scikit-learn.fondation-inria.fr"&gt;the list of our
sponsors&lt;/a&gt;). The priorities of
the team are set &lt;a class="reference external" href="https://scikit-learn.fondation-inria.fr/how-are-the-priorities-of-the-consortium-defined/"&gt;jointly between the sponsors and the open-source
community&lt;/a&gt;. The setup is not without flaws, in particular it forces us to employ people &lt;a class="reference external" href="https://www.inria.fr/en/centre-inria-saclay-ile-de-france"&gt;on Campus&lt;/a&gt;, but it enables giving proper benefits to these contributors.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The team&lt;/strong&gt; The &lt;a class="reference external" href="https://scikit-learn.fondation-inria.fr/people/"&gt;scikit-learn team at Inria foundation&lt;/a&gt; currently comprises 4
very experienced developers. In addition, we have other sources of
funding –research projects, &lt;a class="reference external" href="https://www.fun-mooc.fr/en/courses/machine-learning-python-scikit-learn/"&gt;the scikit-learn MOOC&lt;/a&gt; –
that we use to create a larger team (currently 3 full-time positions).
Finally, various researchers on campus are heavily invested in
scikit-learn or related projects such as joblib. As a result, the amount
of technical skills is staggering.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Long story short, we want to add new DNA to this awesome team: someone
into peopleware as much as software.&lt;/strong&gt;&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="mandate"&gt;
&lt;h2&gt;Mandate&lt;/h2&gt;
&lt;p&gt;The goal of &lt;a class="reference external" href="https://recrutement.inria.fr/public/classic/en/offres/2021-04058"&gt;the new position&lt;/a&gt; is
to talk both to our wider open-source world and our corporate partners.
Both are crucial to fostering growth for scikit-learn.&lt;/p&gt;
&lt;p&gt;The &lt;a class="reference external" href="https://recrutement.inria.fr/public/classic/en/offres/2021-04058"&gt;official job posting&lt;/a&gt;
doesn’t convey as well as I would like what is behind this position. I’m
probably to blame :).&lt;/p&gt;
&lt;div class="section" id="growing-our-open-source-community"&gt;
&lt;h3&gt;Growing our open-source community&lt;/h3&gt;
&lt;img alt="" class="align-right" src="../programming/attachments/herdingcats.jpg" style="width: 300px;" /&gt;
&lt;p&gt;As both the scikit-learn and the PyData community have grown,
communication becomes a bottleneck. There are so many little things to
make an open-source community productive: facilitating on-boarding,
dividing efficiently the workload, documenting well the decision making,
organizing fun sprints, making sure that issue triaging is efficient…&lt;/p&gt;
&lt;p&gt;We are looking for someone passionate about open-source
communities and who wants to be herding such cats.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="increasing-our-corporate-visibility"&gt;
&lt;h3&gt;Increasing our corporate visibility&lt;/h3&gt;
&lt;p&gt;Scikit-learn is one of the most used data-science tools. However, talking
to senior decision makers, their perception sometimes differs. Indeed, we
are competing for visibility with many powerful actors.&lt;/p&gt;
&lt;p&gt;We must communicate beyond the open-source world to develop
a strong brand for scikit-learn. Good communication will help us find new
sponsors, a key ingredient of growth and sustainability for scikit-learn.&lt;/p&gt;
&lt;p&gt;We need to communicate on our progresses and our actions, as people are
often surprised by the breadth of our contributions &lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt;.&lt;/p&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;for instance, the foundation team has contributed &lt;a class="reference external" href="https://youtu.be/UVL4LFy8ch0?t=1437"&gt;improvements in
CPython itself&lt;/a&gt; , maintains
&lt;a class="reference external" href="https://github.com/cloudpipe/cloudpickle"&gt;cloudpickle&lt;/a&gt; a central
component of the data ecosystem).&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;As a foundation, we need to be transparent and accountable, which is
harder than it seems.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="a-good-fit"&gt;
&lt;h2&gt;A good fit&lt;/h2&gt;
&lt;a class="reference external image-reference" href="https://www.flickr.com/photos/randychiu/4602851011/"&gt;&lt;img alt="One Man Band, CCby2.0 from randychiu" class="align-right" src="../programming/attachments/one_man_band.jpg" style="width: 250px;" /&gt;&lt;/a&gt;
&lt;p&gt;We are looking for someone into open source, but also who likes writing
blog posts, social networks, organizing events, presenting scikit-learn,
and improving processes.&lt;/p&gt;
&lt;p&gt;We believe that such a job is best done by someone who has some technical
interest in scikit-learn: good advocacy needs with good understanding.&lt;/p&gt;
&lt;p&gt;Maybe this sounds daunting? Few people have all the skills, let alone the
experience. We are actually more &lt;strong&gt;looking for a passionate and promising
candidate&lt;/strong&gt;, whatever the length of the resume. We believe that
&lt;strong&gt;talented people can learn&lt;/strong&gt;, when they like what they do.&lt;/p&gt;
&lt;p&gt;This is a job about open-source, for open source! It’s not a perfect job:
we have many administrative constraints in running the foundation, we are
paying ourselves less than a non-open-source job.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;&lt;strong&gt;Apply now&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We are looking forward to your application. You can submit them on
&lt;a class="reference external" href="https://recrutement.inria.fr/public/classic/en/offres/2021-04058"&gt;the official job offer&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="open source"></category><category term="growth"></category><category term="communities"></category><category term="scikit-learn"></category><category term="inria"></category><category term="foundation"></category></entry><entry><title>2020: my scientific year in review</title><link href="https://gael-varoquaux.info/science/2020-my-scientific-year-in-review.html" rel="alternate"></link><published>2021-01-05T00:00:00+01:00</published><updated>2021-01-05T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2021-01-05:/science/2020-my-scientific-year-in-review.html</id><summary type="html">&lt;p&gt;The year 2020 has undoubtedly been interesting: the covid19 pandemic
stroke while I was on a work sabbatical in Montréal, at the &lt;a class="reference external" href="https://www.mcgill.ca/neuro/"&gt;MNI&lt;/a&gt; and the &lt;a class="reference external" href="https://mila.quebec/"&gt;MILA&lt;/a&gt;,
and it pushed further my interest in machine learning for health-care.
&lt;strong&gt;My highlights this year revolve around basic and applied data-science
for health&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="contents topic" id="highlights"&gt;
&lt;p class="topic-title"&gt;Highlights …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;p&gt;The year 2020 has undoubtedly been interesting: the covid19 pandemic
stroke while I was on a work sabbatical in Montréal, at the &lt;a class="reference external" href="https://www.mcgill.ca/neuro/"&gt;MNI&lt;/a&gt; and the &lt;a class="reference external" href="https://mila.quebec/"&gt;MILA&lt;/a&gt;,
and it pushed further my interest in machine learning for health-care.
&lt;strong&gt;My highlights this year revolve around basic and applied data-science
for health&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="contents topic" id="highlights"&gt;
&lt;p class="topic-title"&gt;Highlights&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#mining-electronic-health-records-for-covid-19" id="toc-entry-1"&gt;Mining electronic health records for covid-19&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#machine-learning-for-dirty-data" id="toc-entry-2"&gt;Machine learning for dirty data&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#supervised-learning-with-missing-values-beyond-imputation" id="toc-entry-3"&gt;Supervised learning with Missing values: beyond imputation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#machine-learning-without-normalizing-entries" id="toc-entry-4"&gt;Machine-learning without normalizing entries&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#making-sense-of-brain-functional-signals" id="toc-entry-5"&gt;Making sense of brain functional signals&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#neuroquery-brain-mapping-any-neuroscience-query" id="toc-entry-6"&gt;NeuroQuery: brain mapping any neuroscience query&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#a-high-resolution-brain-functional-atlas" id="toc-entry-7"&gt;A high-resolution brain functional atlas&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="mining-electronic-health-records-for-covid-19"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;Mining electronic health records for covid-19&lt;/a&gt;&lt;/h2&gt;
&lt;p class="align-right"&gt;&lt;em&gt;Hospital databases are rich and messy&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hospital databases&lt;/strong&gt;
In March, we &lt;a class="reference external" href="https://www.inria.fr/en/scikiteds-visualization-tool-monitoring-flow-sick-patients"&gt;teamed up with the hospital around Paris&lt;/a&gt; that were suffering from a severe overload due to a new pathology,
covid-19. The challenge was to extract information from the huge
databases of the hospital management system: What were the characteristic
of the patients? How were the resources of the hospital evolving? In the
treatments that were empirically attempted, which were most efficient?&lt;/p&gt;
&lt;p&gt;The hospital databases are hugely promising, because &lt;strong&gt;they offer at
almost no cost information on all the patients that go through the
hospital&lt;/strong&gt;. As we were dealing with a conglomerate of 39 hospitals, this
information covers millions of patients each year: an excellent
epidemiological coverage.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Challenging data science&lt;/strong&gt;
Our work was classic data science: we did a lot of data management,
crafting SQL queries and munging pandas dataframes to create data tables
for statistics and visualizations. We interacted strongly with the
hospital management and the doctors to understand the information of
interest. As we moved forward it became clear that behind each “simple”
question, there were challenges of statistical validity. We did not want
to produce a figure that was misleading. Typical challenges were:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Information needed complicated transformations (such as following a
patient hoping across hospitals to capture the patient status)&lt;/li&gt;
&lt;li&gt;Information was represented differently in the differently hospitals&lt;/li&gt;
&lt;li&gt;Incorrect inputs prevented aggregation (such as erroneous entry data
after the exit date, or missing values)&lt;/li&gt;
&lt;li&gt;The database had biases compared to the ground truth (simple oxygen
therapy acts more often unreported than complicated invasive
ventilation)&lt;/li&gt;
&lt;li&gt;Censoring effects prevented the use of naive statistics (after 20 days
of epidemic outburst most hospital stays are short simply because
patients have entered the hospitals recently)&lt;/li&gt;
&lt;li&gt;A lot of information was present as unnormalized text, sometimes in
long hand-written notes, full of acronyms and errors due to character
recognition.&lt;/li&gt;
&lt;li&gt;The data were of course often a consequence of treatment policy (the
choices of the medical staff in terms of patient handling and
measures), and hence not directly interpretable in causal or
interventional terms.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These challenges were very interesting to me, as they related directly to
my research agenda of &lt;a class="reference external" href="https://project.inria.fr/dirtydata/"&gt;facilitating the processing of “dirty data”&lt;/a&gt; (more on that below).&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Most of the work that we did was not oriented toward publication, but
rather to address urgent needs of the hospitals. Some scholarly
contributions did come out:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Part of the extracted data are consolidated worldwide for medical
studies (&lt;a class="reference external" href="https://www.nature.com/articles/s41746-020-00308-0"&gt;Brat et al, Nature Digital Medicine 2020&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;We used causal-inference methods to estimate the treatment effects of
HCQ with and without Azithromycin (&lt;a class="reference external" href="https://www.medrxiv.org/content/10.1101/2020.06.16.20132597v1"&gt;Sbidian et al, MedRxiv 2020,&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;The data are used in follow up medical studies (eg associating
mortality and obesity &lt;a class="reference external" href="https://onlinelibrary.wiley.com/doi/full/10.1002/oby.23014"&gt;Czernichow et al, Obesity 2020,&lt;/a&gt; )&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Biomedical entity recognition&lt;/strong&gt; A major AI difficulty in this work is
recognizing biomedical entities, such as conditions or treatments, in the
various texts. Coincidentally, we had been working on simplifying the
state of the art pipelines for biomedical entity linking. While this
research work was not used on the hospital data, because it was too
bleeding edge, it led to an AAAI paper (&lt;a class="reference external" href="https://arxiv.org/abs/2012.08844"&gt;Chen et al, AAAI 2021&lt;/a&gt;) on a state-of-the model for
biomedical entity linking that is much more lightweight than current
approaches.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="machine-learning-for-dirty-data"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;Machine learning for dirty data&lt;/a&gt;&lt;/h2&gt;
&lt;p class="align-right"&gt;&lt;em&gt;Machine learning methods that can robustly ingest non-curated data.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The &lt;a class="reference external" href="https://project.inria.fr/dirtydata/"&gt;Dirty Data project&lt;/a&gt;, that we
undertook a few years ago, is really bearing its fruits.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="supervised-learning-with-missing-values-beyond-imputation"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;Supervised learning with Missing values: beyond imputation&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The classic view on processing data with missing values is to try and
&lt;em&gt;impute&lt;/em&gt; the missing values: replace them by probable values (or better,
compute the distribution of the unobserved values given the observed
ones). However, such approach needs a model of the missing-values
mechanism; this is simple only when the values are missing at random.
When have been studying the alternative view based on directly computing
a predictive function to be applied data with missing values.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/2020_highlights/mnar_versus_mcar.png" style="width: 500px;" /&gt;
&lt;p class="caption"&gt;&lt;strong&gt;Missing-values mechanisms&lt;/strong&gt;: black dots are fully-observed data
points, while grey ones are partially observed. The left panel
displays a missing-at-random situation, where missingness is
independent of the underlying values. On the contrary, in a
missing-not-at-random situation (right panel), whether values are
observed or not depends on the underlying values (potentially
unobserved).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a class="reference external" href="http://proceedings.mlr.press/v108/morvan20a.html"&gt;Le Morvan et al, AIStats 2020&lt;/a&gt; studied the
seemingly-simple case of a linear generative mechanism and showed that,
with missing values, the optimal predictor was a complex, piecewise
linear, function of the observed data concatenated with the
missing-values mask. This function can be implemented with a neural
network with ReLu activation functions, fed with data where missing
values are replaced by zeros and corresponding indicator features are
added.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;To go one step further, we noticed that the optimal predictor uses the
correlation between features (&lt;em&gt;eg&lt;/em&gt; on fully-observed data) to compensate
for missing values.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/2020_highlights/compensation_effects.jpeg" style="width: 700px;" /&gt;
&lt;p class="caption"&gt;&lt;strong&gt;Compensation effects&lt;/strong&gt;: The optimal predictor uses the correlation
between features to compensate when a value is missing.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a class="reference external" href="https://neurips.cc/virtual/2020/public/poster_42ae1544956fbe6e09242e6cd752444c.html"&gt;Le Morvan et al, NeurIPS 2020&lt;/a&gt;
devise a neural-network architecture that efficiently captures these
links across the features. Mathematically, it stems from seeking good
functional forms to approximate the expression of the optimal predictor,
that can be derived for various missing-values mechanisms. A non-trivial
result is that a simple functional form can approximate the optimal
predictor under very different mechanisms.&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2020_highlights/neumiss_nb_parameters.jpeg" /&gt;
&lt;p class="caption"&gt;&lt;strong&gt;Better parameter efficiency&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The resulting architecture needs much less parameters (depth or width)
than a fully-connected multi-layer perceptron to predict well in the
presence of missing values. This, in turns, leads to better performance
on limited data size.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="machine-learning-without-normalizing-entries"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-4"&gt;Machine-learning without normalizing entries&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;A challenge of data management is that the same information may be
represented in different ways, typically with different strings denoting
the same, or related entities. For instance, in the following table, the
&lt;em&gt;employee position title&lt;/em&gt; column contains such non-normalized
information:&lt;/p&gt;
&lt;blockquote&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="13%" /&gt;
&lt;col width="47%" /&gt;
&lt;col width="40%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;Sex&lt;/th&gt;
&lt;th class="head"&gt;Employee Position Title&lt;/th&gt;
&lt;th class="head"&gt;Years of experience&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;Male&lt;/td&gt;
&lt;td&gt;Master Police Officer&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Female&lt;/td&gt;
&lt;td&gt;Social Worker IV&lt;/td&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Male&lt;/td&gt;
&lt;td&gt;Police Officer III&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Female&lt;/td&gt;
&lt;td&gt;Police Aide&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Male&lt;/td&gt;
&lt;td&gt;Electrician I&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Male&lt;/td&gt;
&lt;td&gt;Bus Operator&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Male&lt;/td&gt;
&lt;td&gt;Bus Operator&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Female&lt;/td&gt;
&lt;td&gt;Social Worker III&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Female&lt;/td&gt;
&lt;td&gt;Library Assistant I&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Male&lt;/td&gt;
&lt;td&gt;Library Assistant I&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/blockquote&gt;
&lt;p&gt;Typos, or other morphological variants (such as varying abbreviations)
often make things worse. We found many instances of such challenges in
electronic health records.&lt;/p&gt;
&lt;p&gt;In a data-science analysis, such data has categorical meanings, but a
typical categorical data representation (as a one-hot encoder) breaks:
there are too many categories, and in machine learning, the test set
might come with new categories.&lt;/p&gt;
&lt;p&gt;The standard practice is to curate the data: represent the information in
a normalized way, without morphological variants, and separating the
various bits of information (for instance the type of job from the rank).
It typically requires a lot of human labor.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2020_highlights/gamma_poisson_encoding.png" style="width: 600px;" /&gt;
&lt;p class="caption"&gt;The original categories and their continuous representation on latent
categorical features inferred from the data.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a class="reference external" href="https://ieeexplore.ieee.org/abstract/document/9086128"&gt;Cerda &amp;amp; Varoquaux, TKDE 2020&lt;/a&gt; give two
efficient approaches to encode such data for statistical analysis
capturing string similarities. The most interpretable of these approaches
represents the data by continuous encoding on latent categories inferred
automatically from recurrent substrings.&lt;/p&gt;
&lt;p&gt;This research is implemented in the &lt;a class="reference external" href="https://skrub-data.org"&gt;skrub&lt;/a&gt;
Python library, which is making rapid progress (and was originally called
dirty-cat).&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="making-sense-of-brain-functional-signals"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-5"&gt;Making sense of brain functional signals&lt;/a&gt;&lt;/h2&gt;
&lt;p class="align-right"&gt;&lt;em&gt;Turning brain-imaging signal into insights&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Brain imaging, and in particular functional brain imaging, is amazing,
because it gives a window on brain function, whether it is to understand
cognition, behavior, or pathologies. One challenge that I have been
interested in, across the years, is how to give systematic sense to these
signals, in a broader perspective than a given study.&lt;/p&gt;
&lt;div class="section" id="neuroquery-brain-mapping-any-neuroscience-query"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-6"&gt;NeuroQuery: brain mapping any neuroscience query&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Systematically linking mental processes and disorders to brain structures
is a very difficult task because of the huge diversity of behavior.&lt;/p&gt;
&lt;p&gt;In &lt;a class="reference external" href="https://elifesciences.org/articles/53385"&gt;Dockes et al, elife 2020&lt;/a&gt; we used text mining on a
large number of brain-imaging publications to predict where in the brain
a given subject of study (in neuroscience, behavior, and related
pathologies) would report findings.&lt;/p&gt;
&lt;p&gt;With this model, we built a web application, &lt;a class="reference external" href="https://neuroquery.org"&gt;NeuroQuery&lt;/a&gt; in which the user can type a neuroscience
query, and get a brain map of where a study on the topic is like to
report findings.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="a-high-resolution-brain-functional-atlas"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="#toc-entry-7"&gt;A high-resolution brain functional atlas&lt;/a&gt;&lt;/h3&gt;
&lt;p class="align-right"&gt;&lt;em&gt;Regions to summarize the fMRI signal&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Atlases of brain regions are convenient to summarize the information of
brain images, turning them into information easy to analyse. We have long
studied the specific case of functional brain atlases, extracting and
validating them from brain imaging data. &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S1053811920306121"&gt;Dadi NeuroImage 2020&lt;/a&gt;
contributes a high-resolution brain functional atlas, DiFuMo. This atlas
can be browsed or downloaded &lt;a class="reference external" href="https://parietal-inria.github.io/DiFuMo/"&gt;online&lt;/a&gt;.&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2020_highlights/difumo.jpg" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;The functional regions, at dimension 512.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The atlas comes with various resolutions, and all the structures that it
segments have been given meaningful names. In the paper, we showed that
using this atlas to extract functional signals led to better analysis for
a large number of problems compare to the atlases commonly used. We thus
recommend this atlas for instance to extract Image-Derived Phenotypes in
population analysis, where the huge size of the data requires to work on
summarize information.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/2020_highlights/putamen_difumo.png" /&gt;
&lt;p class="caption"&gt;The region capturing the right hemisphere putamen.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="machine learning"></category><category term="health"></category><category term="covid19"></category><category term="statistics"></category><category term="yearly report"></category></entry><entry><title>Technical discussions are hard; a few tips</title><link href="https://gael-varoquaux.info/programming/technical-discussions-are-hard-a-few-tips.html" rel="alternate"></link><published>2020-05-28T00:00:00+02:00</published><updated>2020-05-28T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2020-05-28:/programming/technical-discussions-are-hard-a-few-tips.html</id><summary type="html">&lt;!-- Emma, Eliz, Rashema, Ralf Gommers to read this --&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post discuss the difficulties of communicating while developing
open-source projects and tries to gives some simple advice.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;A large software project is above all a social exercise in which technical
experts try to reach good decisions together, for instance on github
pull requests. But communication is difficult, in …&lt;/p&gt;</summary><content type="html">&lt;!-- Emma, Eliz, Rashema, Ralf Gommers to read this --&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;This post discuss the difficulties of communicating while developing
open-source projects and tries to gives some simple advice.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;A large software project is above all a social exercise in which technical
experts try to reach good decisions together, for instance on github
pull requests. But communication is difficult, in particular between
diverging points of view. It is easy to
underestimate how much well-intended persons can misunderstand
each-other and get hurt, in open source as elsewhere. Knowing why
there are communication challenges can help, as well as applying a few
simple rules.&lt;/p&gt;
&lt;img alt="" class="align-right" src="../programming/attachments/communication.png" style="width: 300px;" /&gt;
&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#maintainer-s-anxiety" id="toc-entry-1"&gt;Maintainer’s anxiety&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#contributor-s-fatigue" id="toc-entry-2"&gt;Contributor’s fatigue&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#communication-is-hard" id="toc-entry-3"&gt;Communication is hard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#little-things-that-help" id="toc-entry-4"&gt;Little things that help&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The first challenge is to understand the other’s point of view: the
different parties see the problem differently.&lt;/p&gt;
&lt;!-- TODO: put a few things in bold --&gt;
&lt;div class="section" id="maintainer-s-anxiety"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;Maintainer’s anxiety&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="open-source-can-be-anxiety-generating-for-the-maintainers"&gt;
&lt;h3&gt;Open source can be anxiety-generating for the maintainers&lt;/h3&gt;
&lt;p&gt;Maintainers ensure the quality and the long-term life of an open-source
project. As such, &lt;strong&gt;they feel responsible for any shortcoming in
the product&lt;/strong&gt;. In addition, they often do this work because they care,
even though it may not bring any financial support.
But they can quickly become a converging point of anxiety-generating
feedback:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Code has bugs; the more code, the more bugs. Watching a issue tracker
fill up with a long list of bugs is frightening to people who
feel in charge.&lt;/li&gt;
&lt;li&gt;Given that maintainers are visible and qualified, they become the
target of constant requests for attention: from pleas to prioritize a
specific issue to solicitations for advice.&lt;/li&gt;
&lt;li&gt;A small fraction of these interactions come as plain
aggressions. I have been insulted many times by unsatisfied
users. Each time, it hurts me a lot. My policy is to
disengage from the conversation, but I am left shaking and staring at
my computer in the evening.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="side-hanging small sidebar"&gt;
&lt;p class="first sidebar-title"&gt;&lt;strong&gt;Related writings&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Ralf Gommers discusses &lt;a class="reference external" href="https://rgommers.github.io/2019/06/the-cost-of-an-open-source-contribution/"&gt;the cost of an open source
contribution&lt;/a&gt;, from the point of view of the maintainer.&lt;/p&gt;
&lt;p&gt;Ilya Grigorik suggests: &lt;a class="reference external" href="https://www.igvita.com/2011/12/19/dont-push-your-pull-requests/"&gt;Don’t push your pull request&lt;/a&gt;.&lt;/p&gt;
&lt;p class="last"&gt;Brett Cannon: &lt;a class="reference external" href="https://snarky.ca/setting-expectations-for-open-source-participation/"&gt;Setting expectations for open source participation&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The more popular a project, the more weight it puts on its maintainers’
shoulders. A consequence is that &lt;strong&gt;maintainers are tired&lt;/strong&gt;, and can
sometimes approach discussions in a defensive way. Also, we may be plain
scared of integrating a code that we do not fully comprehend.&lt;/p&gt;
&lt;p&gt;Open-source developers may even, unconsciously, adopt a simple, but
unfortunate, protection mechanism: being rude. The logic is flawless: if
I am nasty to people, or I set unreasonnable expectations, people will let me alone.
Alas, this strategy leads to toxic environments. It not only makes people
unhappy but also harms the community dynamics that ground the excellence
of open source.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="the-danger-abusive-gatekeeping"&gt;
&lt;h3&gt;The danger abusive gatekeeping&lt;/h3&gt;
&lt;!-- add a image of puppy? And a gate? --&gt;
&lt;p&gt;A maintainer quickly learns that every piece of code, no matter how cute
it might be, will give him or her work in the long run, &lt;a class="reference external" href="https://snarky.ca/setting-expectations-for-open-source-participation/#submittingacontribution"&gt;just like a puppy&lt;/a&gt;. This
is unavoidable given that the complexity of code grows faster than its number of
features &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;, and, even for a company as rich as Google,
project maintenance becomes intractable on huge projects &lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt;.&lt;/p&gt;
&lt;div class="side-hanging docutils container"&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;a class="reference external" href="https://ieeexplore.ieee.org/document/1702600"&gt;An Experiment on Unit Increase in Problem Complexity, Woodfield 1979&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;To quote tensorflow developers
&lt;a class="reference external" href="https://github.com/tensorflow/tensorflow/pull/33460"&gt;“Every [code addition] takes around 16 CPU/GPU
hours of [quality control]. As such, we cannot just run every
[code addition] through the [quality control] infrastructure.”&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;A maintainer’s job is to say no often&lt;/strong&gt;, to protect the project. But,
as any gatekeeping, it can unfortunately become an excercise in unchecked
power. Making objective choices for these difficult decisions is hard,
and we all tend naturally to trust more people that we know.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Most often we are not aware of our shortcomings, let alone are we doing
them on purpose.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="contributor-s-fatigue"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;Contributor’s fatigue&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A new contributor starting a conversation with a group of seasoned
project maintainers may easily &lt;strong&gt;feel an imposter&lt;/strong&gt;. The
new contributor knows less about the project. In addition, he or she is engaging
with a group of people that know each-other well, and is not yet part of
that &lt;em&gt;inner&lt;/em&gt; group.&lt;/p&gt;
&lt;p&gt;This person does not know the code base, or the conventions, and must &lt;strong&gt;make
extra efforts&lt;/strong&gt;, compared to the seasoned developers, to propose a
contribution suitable for the project. Often, he or she does
not understand fully the reasons for the project guidelines, or for the
feedback given. Request for changes can easily be seen as trifles.&lt;/p&gt;
&lt;p&gt;Integrating the contribution can often be a lengthy process –in
particular in scikit-learn. Indeed, it will involve not only shaping up
the contribution, but also learning the skills and discovering the
process. These &lt;strong&gt;long cycles can undermine motivation&lt;/strong&gt;: humans need
successes to feel enthusiasm. Also, the contributor may legitimately
worry: Will all these efforts be fruitful? Will the contribution make its
way to the project?&lt;/p&gt;
&lt;p&gt;Note that for these reasons, it is recommended to start contributing with
very simple features, and to seek feedback on the scope of the
contribution before writing the code.&lt;/p&gt;
&lt;p&gt;Finally, contributors are seldom paid to work on the project, and there
is no single line of command that makes decisions and controls incentives
for all the people on the project. No one is responsible when things go
astray, which means that the weight falls on the shoulder of the
individuals.&lt;/p&gt;
&lt;!-- fun pictures, to relax atmosphere, but only later, first write and
review --&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The danger behind the lengthy cycle of reviews and improvements needed to
contribute is &lt;strong&gt;death by a thousands cuts&lt;/strong&gt;. The contributor looses
motivation, and no longer finds the energy to finish the work.&lt;/p&gt;
&lt;div class="grey docutils container"&gt;
&lt;p&gt;&lt;strong&gt;How about users?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This article is focused on developers. Yet, users are also an
important part of the discussion around open source.&lt;/p&gt;
&lt;p&gt;Often communication failures with users are due to frustration.
Frustration of being unable to use the software, of hitting a bug, of
seeing an important issue still not addressed. This frustration stems
from incorrect expectations, which can often be traced to
misunderstanding of the processes and the dynamics. Managing
expectations is important to improve the dialogue, via the
documentation, via notes on the issue tracker.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="communication-is-hard"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;Communication is hard&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Communication is hard: messages are sometimes received differently than
we would like. &lt;strong&gt;Overworked people discussing very technically
challenging issues&lt;/strong&gt; only makes the matter worse. I have seen people not
come across well, while I know they are absolutely lovely and caring.&lt;/p&gt;
&lt;p&gt;We are human beings; we are limited; we misunderstand things, and we have
feelings.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Emotions&lt;/strong&gt; –
My most vivid memory of a communication failure was when I was a sailing
instructor. Trainees that were under my responsibility had put themselves
at risk, causing me a lot of worry. During the debrief, I was angry. My
failure to convey the messages without emotional loading undermined my
leadership on the group, putting everybody at risk for the rest of the
week.&lt;/p&gt;
&lt;p&gt;Inability to understand the others’ point of view, or to communicate
ours, can bring in emotions. Emotions most often impedes technical
communication.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Limited attention&lt;/strong&gt; –
We, in particular maintainers, are bombarded with email, notifications,
text and code to read.
As a consequence, it is easy to read things too fast, to stop in the
middle, to forget.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Language barriers&lt;/strong&gt; –
Most discussions happen in English; but most of us are not native English
speakers. We may hide well our difficulties, but nuances are often lost.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Clique effects&lt;/strong&gt; –
Most interactions in open source are done in writing, with low
communication bandwidth. It can be much harder to convince a maintainer
on the other side of the world than a colleague in the same room. Schools
of thoughts naturally emerge when people work a lot together. These
create bubbles, where we have the impression that everything we say is
obvious and uncontroversial, and yet we fail to convince people outside
of our bubble.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="little-things-that-help"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-4"&gt;Little things that help&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Communication can improved by continuously working on it &lt;a class="footnote-reference" href="#footnote-3" id="footnote-reference-3"&gt;[3]&lt;/a&gt;.
It may be obvious to some, but it personally took me many years to learn.&lt;/p&gt;
&lt;table class="side-hanging docutils footnote" frame="void" id="footnote-3" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-3"&gt;[3]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Training materials for managers often discuss communication, and
give tricks. I am sure that there are better references than my
list below. But that’s the best I can do.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="hear-the-other-exchange"&gt;
&lt;h3&gt;Hear the other: exchange&lt;/h3&gt;
&lt;div class="side-hanging small sidebar"&gt;
&lt;p class="first sidebar-title"&gt;&lt;strong&gt;Related presentation&lt;/strong&gt;&lt;/p&gt;
&lt;p class="last"&gt;&lt;a class="reference external" href="https://docs.google.com/presentation/d/1mEMjGQXErZC-mBeCt0quLz7b5ODQnehmfwwnCeggzcU/edit#slide=id.g5135b4b0eb_1_14"&gt;How can we have healthier technical discussions?&lt;/a&gt; by Nathaniel J. Smith&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Foster multiway discussions&lt;/strong&gt; – The goal of a technical discussion is to
come up to the best solution. Better solutions emerge via confronting
different points of view: a single brilliant individual
probably cannot find or recognize the best solution alone.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Integrate input from as many perspectives as possible.&lt;/li&gt;
&lt;li&gt;Make sure everyone feels heard.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Don’t seek victory&lt;/strong&gt; – Most important to keep in mind is that giving
up on an argument and accepting the other point of view is a perfectly
valid option. I naturally biased to think that my view on topics dear to
me is the right one. However, I’ve learned that adopting the view of the
other could bring a lot to the social dynamics of a project: we are often
debating over details and the bigger benefit comes from moving forward.&lt;/p&gt;
&lt;p&gt;In addition, if several very bright people have different conclusions
than me about something that they’ve thought a lot, who am I to disagree?&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="convey-ideas-well-pedagogy"&gt;
&lt;h3&gt;Convey ideas well: pedagogy&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Explain&lt;/strong&gt; – Give the premises of your thoughts. Unroll your thought
processes. People are not sitting in your head, and need to hear not only
your conclusion, but how you got there.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Repeat things&lt;/strong&gt; – Account for the fact that people can forget, and
never hesitate to gently restate important points. Reformulating
differently can also help explaining.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Keep it short&lt;/strong&gt; – A typical reading speed is around 200 words a
minute. People have limited time and attention span. The greatest help
you can provide to your reader is to condense your ideas: let us avoid
long threads that require several dozens of minutes to read and digest.
There is a tension between this point and the above. My suggestion:
remove every word that is not useful, move details to footnotes or
postscriptums.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="cater-for-emotions-tone"&gt;
&lt;h3&gt;Cater for emotions: tone&lt;/h3&gt;
&lt;div class="side-hanging small sidebar"&gt;
&lt;p class="first sidebar-title"&gt;&lt;strong&gt;Related good advice&lt;/strong&gt;&lt;/p&gt;
&lt;p class="last"&gt;&lt;a class="reference external" href="https://www.mozilla.org/en-US/about/governance/policies/participation/#expected-behavior"&gt;Mozilla participation guide, expected behavior section&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Stay technical&lt;/strong&gt; – Always try to get to the technical aspect of the
matter, and never the human. Give specific code and wording suggestions.
When explaining a decision, give technical arguments, even if they feel
obvious to you.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Be positive&lt;/strong&gt; – Being positive in general helps people feeling happy and
motivated. It is well known that positive feedback leads to quicker
progress than negative, as revealed &lt;em&gt;eg&lt;/em&gt; by studies of class rooms. I am
particularly guilty of this: I always forget to say something nice,
although I may be super impressed by a contribution. Likewise, avoid
negative words when giving feedback (stay technical).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Avoid “you”&lt;/strong&gt; – The mere use of the pronoun “you” puts the person we are
talking to in the center of message. But the message should not be about
the person, it should be about the work. It’s very easy to react
emotionally when it’s about us. The passive voice can be useful to avoid
putting people as the topic. If the topic is indeed people, sometimes “we”
is an adequate substitute for “you”.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Assume good faith&lt;/strong&gt; – There are so many misunderstandings that can
happen. People forget things, people make mistakes, people fail to convey
their messages. Most often, all these failures are in good faith, and
misunderstandings are legitimate. In the rare cases there might possibly
be some bad faith, accounting for it will only make communication worse,
not better. Along the same line, we should ignore when we feel assaulted
or insulted, and avoid replying in kind.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Choose words wisely&lt;/strong&gt; – The choice of words matter, because they convey
implicit messages. In particular, avoid terms that carry judgement
values: “good” or “bad”. For example “This is done wrong” (note that this
sentence already avoids “you”), could be replaced by “There might be more
numerically stable / efficient way of doing it” (note also the use of
precise technical wording rather than the generic term “better”).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use moderating words&lt;/strong&gt; – Try to leave room for the other in the
discussion. Statements too assertive close the door to different points
of view: “this must be changed” (note the lack of “you”) should be
avoided while “this should be changed” is better. For this reason, this
article is riddled with words such as “tend”, “often”, “feel”, “may”,
“might”.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Don’t blame someone else&lt;/strong&gt; – If you feel that there is some pattern that
you would like to change, do not point fingers, do not blame others.
Rather, point yourself at the center of the story, find an example of
this pattern with you, and the message should be that “it is a pattern
that &lt;em&gt;we&lt;/em&gt; should avoid. &lt;em&gt;“We”&lt;/em&gt; is such a powerful term. It unites; it
builds a team.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Give your understanding&lt;/strong&gt; – If you feel that there is a misunderstanding,
explain how you are feeling. But do it using “I”, and not “you”, and
acknowledge the subjectivity: “I feel ignored” rather than “you are
ignoring me”. Even better: only talk about the feeling: “I am loosing
motivation, because this is not moving forward”, or “I think that am
failing to convey why this numerical problem is such an important issue”
(note the use of “I think”, which avoids casting the situation as
necessarily true).&lt;/p&gt;
&lt;div class="side-hanging small sidebar"&gt;
&lt;p class="first sidebar-title"&gt;&lt;strong&gt;Implicit messages&lt;/strong&gt;&lt;/p&gt;
&lt;p class="last"&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Four-sides_model"&gt;The four sides&lt;/a&gt;
view of communication highlights the multiple messages present even in
simple statements.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;I hope this can be useful. I personally try to apply these rules, because
I want to work better with others.&lt;/p&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;Thanks&lt;/p&gt;
&lt;p&gt;to many who gave me feedback: Adrin Jalali, Andreas Mueller,
Elizabeth DuPre, Emmanuelle Gouillart, Guillaume
Lemaitre, Joel Nothman, Joris Van den Bossche, Nicolas Hug.&lt;/p&gt;
&lt;/div&gt;
&lt;hr class="docutils" /&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;PS: note how many times I’ve used “you” above. I can clearly get better
at communication!&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="programming"></category><category term="open source"></category><category term="people"></category></entry><entry><title>Jean Dechoux, June 13rd 1923 – Feb 9th 2020</title><link href="https://gael-varoquaux.info/personnal/jean-dechoux-june-13rd-1923-feb-9th-2020.html" rel="alternate"></link><published>2020-02-16T00:00:00+01:00</published><updated>2020-02-16T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2020-02-16:/personnal/jean-dechoux-june-13rd-1923-feb-9th-2020.html</id><summary type="html">&lt;p&gt;Jean Dechoux was born between the first and the second world wars, in a
small French town, close to Germany. His family was that of poor
farmers, who would work in coal mines to make up for the small size of
their crops.&lt;/p&gt;
&lt;p&gt;He grew to become a pulmonologist, heading …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Jean Dechoux was born between the first and the second world wars, in a
small French town, close to Germany. His family was that of poor
farmers, who would work in coal mines to make up for the small size of
their crops.&lt;/p&gt;
&lt;p&gt;He grew to become a pulmonologist, heading a hospital department that
tended to the illnesses of his people. He became an intellectual,
traveling the world, an avid reader, and the author of multiple
publications on diseases of coal miners.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The story of how Jean grew his education is worth telling. His native
language was not even French, but “Lorrain” dialect. His sisters worked
young. But he was able to go to school because the village priest had
perceived Jean’s intelligence and wanted him to go to the seminary.
However the second world war came. Jean eventually got drafted in the
German (Nazy) army. Being from Lorraine, he was considered a German, yet
not one to be fully trusted: his fate was to be sent to Stalingrad, as
cannon fodder. Mistreated during training, he catches tuberculosis and
escapes narrowly the front. During his recovery in the German army
hospitals, a chief doctor shelters him, declares him unfit for service,
and pushes him to study for the &lt;em&gt;abitur&lt;/em&gt;, the German high-school degree.
Now Jean wants to become a doctor, and serves as a nurse in the German
hospitals.&lt;/p&gt;
&lt;p&gt;When the allies’ army advances, Jean is taken prisoner of war, then
incorporated in the French army, and eventually released with war
compensations. He uses them for college studies, during which he meets
his wife-to-be, Nicole Lissacq. Nicole is more wealthy than him, and
receives a stipend, as a student of the famed “École Normale Supérieure”.
The rest is history: Jean is brilliantly successful during his medical
studies, and comes back to his native region, Lorraine, to work as a
doctor for the coal miners.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;img alt="" class="align-right" src="../personnal/attachments/jean_dechoux.jpg" style="width: 200px;" /&gt;
&lt;p&gt;Jean, as I knew him, was a profoundly open and kind person. He survived
tragedy in his family by becoming even more so. Despite his age, he was
modern: the first time that I saw wifi was at his place.&lt;/p&gt;
&lt;p&gt;Jean was my grand father. I very much look up to him.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;.&lt;/p&gt;
</content><category term="personnal"></category><category term="family"></category><category term="people"></category></entry><entry><title>Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020</title><link href="https://gael-varoquaux.info/science/survey-of-machine-learning-experimental-methods-at-neurips2019-and-iclr2020.html" rel="alternate"></link><published>2020-01-22T00:00:00+01:00</published><updated>2020-01-22T00:00:00+01:00</updated><author><name>Xavier Bouthillier &amp; Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2020-01-22:/science/survey-of-machine-learning-experimental-methods-at-neurips2019-and-iclr2020.html</id><summary type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;A simple survey asking authors of two leading machine-learning
conferences a few quantitative questions on their experimental
procedures.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;How do machine-learning researchers run their empirical validation? In
the context of a push for improved reproducibility and benchmarking, this
question is important to develop new tools for model comparison. We …&lt;/p&gt;</summary><content type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;A simple survey asking authors of two leading machine-learning
conferences a few quantitative questions on their experimental
procedures.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;How do machine-learning researchers run their empirical validation? In
the context of a push for improved reproducibility and benchmarking, this
question is important to develop new tools for model comparison. We ran a
simple survey asking to authors of two leading conferences, NeurIPS 2019
and ICLR 2020, a few quantitative questions on their experimental
procedures.&lt;/p&gt;
&lt;p&gt;A &lt;a class="reference external" href="https://hal.archives-ouvertes.fr/hal-02447823"&gt;technical report on HAL&lt;/a&gt; summarizes our
finding. It gives a simple picture of how hyper-parameters are set, how
many baselines and datasets are included, or how seeds are used.
Below, we give a very short summary, but please read (and &lt;a class="reference external" href="https://hal.archives-ouvertes.fr/hal-02447823v1/bibtex"&gt;cite&lt;/a&gt;)
&lt;a class="reference external" href="https://hal.archives-ouvertes.fr/hal-02447823"&gt;the full report&lt;/a&gt; if you are interested.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Highlights&lt;/strong&gt;
The response rates were 35.6% for NeurIPS and 48.6%
for ICLR.
A vast majority of empirical works optimize model hyper-parameters,
thought almost half of these use manual tuning and most of the automatic
hyper-parameter optimization is done with grid search. The typical number
of hyper-parameter set is in interval 3-5, and less than 50 model fits
are used to explore the search space. In addition, most works also
optimized their baselines (typically, around 4 baselines).
Finally, studies typically reported 4 results per model per task to provide a measure of variance, and around 50% of them
used a different random seed for each experiment.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sample results&lt;/strong&gt;&lt;/p&gt;
&lt;div class="side-caption figure align-center"&gt;
&lt;img alt="" src="../science/attachments/survey_of_ml_experimental_methods/hyper_parameter_optimization.png" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;How many papers with experiments optimized hyperparameters.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="side-caption figure align-center"&gt;
&lt;img alt="" src="../science/attachments/survey_of_ml_experimental_methods/tuning_methods.png" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;What hyperparameter optimization method were used.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="side-caption figure align-center"&gt;
&lt;img alt="" src="../science/attachments/survey_of_ml_experimental_methods/number_datasets.png" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;Number of different datasets used for benchmarking.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="side-caption figure align-center"&gt;
&lt;img alt="" src="../science/attachments/survey_of_ml_experimental_methods/number_seeds_or_trials.png" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;Number of results reported for each model (ex: for different seeds)&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;These are just samples. Read &lt;a class="reference external" href="https://hal.archives-ouvertes.fr/hal-02447823"&gt;the full report&lt;/a&gt; for
more results.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;For reproducibility and AutoML, there is active research in benchmarking
and hyperparameter procedures in machine learning. We hope that the
survey results presented here can help inform this research. As this
document is merely a research report, we purposely limited
interpretation of the results and drawing recommendations. However, trends that stand out to our
eyes are, &lt;cite&gt;1)&lt;/cite&gt; the simplicity of hyper-parameter tuning strategies
(mostly manual search and grid search),  &lt;cite&gt;2)&lt;/cite&gt; the small number of
model fits explored during this tuning (often 50 or less), which biases the
results and &lt;cite&gt;3)&lt;/cite&gt; the small number of performances reported, which limits
statistical power. These
practices are most likely due to the high computational cost of fitting
modern machine-learning models.&lt;/p&gt;
&lt;div class="sidebar"&gt;
&lt;p class="first sidebar-title"&gt;Code&lt;/p&gt;
&lt;p class="last"&gt;The code used for plotting and analysis is &lt;a class="reference external" href="https://github.com/bouthilx/ml-survey-2020"&gt;on github&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Acknowledgments&lt;/strong&gt; We are deeply grateful to the participants of
the survey who took time to answer the questions.&lt;/p&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="machine learning"></category><category term="benchmarking"></category><category term="conferences"></category><category term="experimental methods"></category></entry><entry><title>2019: my scientific year in review</title><link href="https://gael-varoquaux.info/science/2019-my-scientific-year-in-review.html" rel="alternate"></link><published>2020-01-05T00:00:00+01:00</published><updated>2020-01-05T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2020-01-05:/science/2019-my-scientific-year-in-review.html</id><summary type="html">&lt;p&gt;My current research spans wide: from brain sciences to core data
science. My overall interest is to build &lt;strong&gt;methodology drawing insights from
data&lt;/strong&gt; for questions that have often been addressed qualitatively. If I can
highlight a few publications from 2019 &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;, the common thread would be
computational statistics, from dirty …&lt;/p&gt;</summary><content type="html">&lt;p&gt;My current research spans wide: from brain sciences to core data
science. My overall interest is to build &lt;strong&gt;methodology drawing insights from
data&lt;/strong&gt; for questions that have often been addressed qualitatively. If I can
highlight a few publications from 2019 &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;, the common thread would be
computational statistics, from dirty data to brain images. Let me try to
give the gist of these progresses, in simple terms.&lt;/p&gt;
&lt;table class="side-hanging docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;It’s already 2020, I’m always late.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;div class="contents topic" id="highlights"&gt;
&lt;p class="topic-title"&gt;Highlights&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#comparing-distributions" id="toc-entry-1"&gt;Comparing distributions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#predictive-pipelines-on-brain-functional-connectomes" id="toc-entry-2"&gt;Predictive pipelines on brain functional connectomes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#population-shrinkage-of-covariance" id="toc-entry-3"&gt;Population shrinkage of covariance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#deep-learning-on-non-translation-invariant-images" id="toc-entry-4"&gt;Deep learning on non-translation-invariant images&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#open-science" id="toc-entry-5"&gt;Open science&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="comparing-distributions"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;Comparing distributions&lt;/a&gt;&lt;/h2&gt;
&lt;p class="align-right"&gt;&lt;em&gt;Fundamental computational-statistics work&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;What if you are given two set of observations and need to conclude on
whether they are drawn from the same distribution? We are interested in
this question for the &lt;a class="reference external" href="https://project.inria.fr/dirtydata/"&gt;DirtyData&lt;/a&gt;
research project, to facilitate analysis of data without manual curation.
Comparing distributions is indeed important to detect drifts in the data,
to match information across datasets, or to compensate for dataset
biases.&lt;/p&gt;
&lt;p&gt;Formally, we are given two cloud of points (circle and crosses in the
figure below) and we want to develop a statistical test of whether the
distributions differ. There is an abundant literature on this topic, that I
cover in &lt;a class="reference external" href="http://gael-varoquaux.info/science/comparing-distributions-kernels-estimate-good-representations-l1-distances-give-good-tests.html"&gt;a more detailed post on this subject&lt;/a&gt;.
Specifically, when the observations have a natural similarity, for
instance when they live in a vector space, kernel methods are interesting
because they enable to estimate a representant of the underlying
distribution that interpolates between observations, as with &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Kernel_density_estimation"&gt;a kernel
density estimator&lt;/a&gt;.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="http://papers.nips.cc/paper/9398-comparing-distributions-ell_1-geometry-improves-kernel-two-sample-testing"&gt;&lt;img alt="" src="../science/attachments/comparing_distributions_l1/optimizing_position.png" style="width: 500px;" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;Two cloud of points, the corresponding distribution representants μ_P
and μ_Q (blue and orange), the difference between these
(black), and locations to measure this difference (red triangles).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;With Meyer Scetbon, in
&lt;a class="reference external" href="http://papers.nips.cc/paper/9398-comparing-distributions-ell_1-geometry-improves-kernel-two-sample-testing"&gt;Scetbon &amp;amp; Varoquaux, NeurIPS&lt;/a&gt;,
we investigate how to measure best the difference between these
representants. We show that the best choice is to take the absolute value
of the difference (the l1 norm), while the default choice had so far been
the Euclidean (l2) norm. In a nutshell, the reason is that the difference
most like is dense when the distributions differ: zero almost nowhere.&lt;/p&gt;
&lt;p&gt;We were able to show that the &lt;a class="reference external" href="https://slideslive.com/38921490/interpretable-comparison-of-distributions-and-models"&gt;sophisticated framework&lt;/a&gt;
for efficient and powerful tests in the
Euclidean case carries over to the l1 case. In particular, our paper
gives efficient testing procedures using a small number of locations to
avoid costly computation (the red triangles in the figure above), that
can either be sampled at random or optimized.&lt;/p&gt;
&lt;p&gt;My hunch is that the result is quite general: the l1 geometry is better
than the l2 one on representants of distributions. There might be more
fundamental mathematical properties behind this. The drawback is that the
l1 norm is non smooth which can be challenging in optimization settings.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="predictive-pipelines-on-brain-functional-connectomes"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;Predictive pipelines on brain functional connectomes&lt;/a&gt;&lt;/h2&gt;
&lt;p class="align-right"&gt;&lt;em&gt;Brain-imaging methods&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Brain functional connectivity is increasingly used to extract biomarkers
of behavior and mental health. The long-term stakes are to ground
assessment of psychological traits on quantitative brain
data, rather than qualitative behavioral observations. But, to build
biomarkers, there are many details that go in estimating functional
connectivity from fMRI, something that I have studied for more than 10
years. With Kamalakar Dadi, in &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/abs/pii/S1053811919301594"&gt;Dadi et al&lt;/a&gt;,
we ran thorough empirical benchmarks to find which methodological choices
for the various steps of the pipeline give best prediction across
multiple cohorts. Specifically, we studied 1) defining regions of
interest for signal extraction, 2) building a functional-connectivity
matrix across these regions, 3) prediction across subjects with
supervised learning on these features.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="https://www.sciencedirect.com/science/article/abs/pii/S1053811919301594"&gt;&lt;img alt="" src="../science/attachments/2019_highlights/dadi_2019_highlights.png" style="width: 600px;" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;Summarizing our benchmark results.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="sidebar"&gt;
&lt;p class="first sidebar-title"&gt;Recommendations&lt;/p&gt;
&lt;ul class="last simple"&gt;
&lt;li&gt;functional regions (eg from dictionary learning)&lt;/li&gt;
&lt;li&gt;tangent-space for covariances&lt;/li&gt;
&lt;li&gt;l2-logistic regression&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;p&gt;Results show the importance of defining regions from functional data,
ideally with a linear-decomposition method that produces soft
parcellations such as ICA or dictionary learning. To represent
connectivity between regions, the best choice is tangent-space
parametrization, a method to build a vector-space from covariance
matrices (more below). Finally, for supervised learning, a simple
l2-penalized logistic regression is the best option. With the huge popularity
of deep learning, it may surprise that linear models are the best
performer, but this is well explained by the amount of data at hand: a
cohort is typically less than 1000 individuals, which is way below the
data sizes needed to see the benefits of non-linear models.&lt;/p&gt;
&lt;p&gt;A recent preprint, &lt;a class="reference external" href="https://www.biorxiv.org/content/10.1101/741595v2.abstract"&gt;Pervaiz et al&lt;/a&gt; from
Oxford, overall
confirms our findings, even though they investigated slightly
different methodological choices. In particular, they find tangent space
clearly useful.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;In my eyes, such benchmarking studies are important not only to improve
prediction, but also to reduce analytic variability that opens the door
to inflation of reported effects. Indeed, given 1000 individuals, the
measure of prediction accuracy of a pipeline is quite imprecise
(&lt;a class="reference external" href="https://www.sciencedirect.com/science/article/abs/pii/S1053811917305311"&gt;Varoquaux 2018&lt;/a&gt;).
As a consequence, trying out a bunch a analytic choices and
publishing the one that works best can lead to grossly optimistic
prediction accuracies. &lt;strong&gt;If we want trust in biomarkers, we need to
reduce the variability in the methods used to build them&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="population-shrinkage-of-covariance"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;Population shrinkage of covariance&lt;/a&gt;&lt;/h2&gt;
&lt;p class="align-right"&gt;Statistics for brain signals&lt;/p&gt;
&lt;p&gt;Estimating covariances is central for functional brain connectivity and
in many other applications. With Mehdi Rahim, in &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/abs/pii/S1361841518301014"&gt;Rahim et al&lt;/a&gt;
we considered the case of a population of random processes with
related covariances, as for instance when estimating functional
connectivity from a group of individuals. For this, we combined two
mathematical ideas: that of using natural operations on covariance
matrices, and that of priors for mean-square estimation:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;Tangent space&lt;/strong&gt; Covariance matrices are positive-definite matrices,
for which standard arithmetics are not well suited &lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt;: subtracting
two covariance matrices can lead to a matrix that cannot be
the covariance of a signal. However, a group of covariance matrices can
be transformed into points in a vector space for which standard
distances and arithmetics respect the structure of
covariances (for instance Euclidean distance between these points
approximate KL divergence between covariances). This is what we call
the &lt;em&gt;tangent space&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;table class="side-hanging docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Technically, covariance matrices live on a Riemannian manifold:
a curve surface inside &lt;em&gt;R^{n x n}&lt;/em&gt; that has some metric
properties.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;James-Stein shrinkage&lt;/strong&gt; To estimate the mean of &lt;em&gt;n&lt;/em&gt; observations, it
is actually best not to compute the average of these, but rather to
push a bit this average toward a prior guess. The better the
guess, the more this “push” helps. The more the number of observations,
the more gentle this push should be. This strategy is known as
&lt;a class="reference external" href="https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator"&gt;James-Stein shrinkage&lt;/a&gt; and it
is in my opinion one of the most beautiful results in statistics.
It can be seen as a Bayesian posterior, but it comes with guarantees
that do not require the model to be true and that control estimation
error, rather than a posterior probability.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;James-Stein shrinkage is easily written for quadratic errors on vectors,
but cannot be easily applied to covariances, as they do not live in a vector
space and we would like to control a KL divergence rather than
a quadratic error. Our work combined both ideas to give an excellent
estimator of a family of related covariances that is also very
computationally efficient. We call it PoSCE: Population Shrinkage
Covariance Estimation.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="https://www.sciencedirect.com/science/article/abs/pii/S1361841518301014"&gt;&lt;img alt="" src="../science/attachments/2019_highlights/posce.png" style="width: 600px;" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;Schema of the estimation strategy: projecting the covariances matrices
into a tangent space, shrinkage to a group mean, but taking in account
the anisotropy of the dispersion of the group, and projecting back to
covariances.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;It is easy to see how accounting for group information in the estimation
of individual covariances can help stabilizing them. However, will it be
beneficial if we are interested in the differences between these
covariances, for instance to ground biomarkers, as studied above? Our
results show that it does indeed help building better biomarkers, for
instance to predict brain age. The larger the group of covariances used,
the larger the benefits.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="https://www.sciencedirect.com/science/article/abs/pii/S1361841518301014"&gt;&lt;img alt="" src="../science/attachments/2019_highlights/posce_age_learning_curve.png" style="width: 500px;" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;Error in predicting brain aging decreases when more individuals are used
to build the biomarker.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="deep-learning-on-non-translation-invariant-images"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-4"&gt;Deep learning on non-translation-invariant images&lt;/a&gt;&lt;/h2&gt;
&lt;p class="align-right"&gt;Computer vision&lt;/p&gt;
&lt;p&gt;Brain images, in particular images of brain activity, are very different
from the natural images on which most computer-vision research focuses.
A central difference is that detecting activity in different parts of the
brain completely changes the meaning of this detection, while detecting a
cat in the left or the right of a picture on Facebook makes no
difference. This is important because many progresses of computer vision,
such as convolutional neural networks, are built on the fact that natural
images are statistically translational invariant. On the opposite, brain
images are realigned to a template, before being analyzed.&lt;/p&gt;
&lt;p&gt;Convolutional architectures have been crucial to the successes of deep
learning on natural images because they impose a lot of structure on the
weights of neural networks and thus help fight estimation noise. For
predicting from brain images, the regularizations strategies that have
been successful foster spatially continuous structures. Unfortunately,
they have lead to costly non-smooth optimizations that cannot easily be
used with the optimization framework of deep learning, stochastic
gradient descent.&lt;/p&gt;
&lt;p&gt;With Sergul Aydore, in &lt;a class="reference external" href="http://proceedings.mlr.press/v97/aydore19a.html"&gt;Aydore et al, ICML&lt;/a&gt;, we have introduced a
spatial regularization that is compatible with the deep learning toolbox.
During the stochastic optimization, we impose random spatial structure
via feature groups estimated from the data. These stabilize the input
layers of deep architecture. They also lead to iterating on smaller
representations, which greatly speeds up the algorithm.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="http://proceedings.mlr.press/v97/aydore19a.html"&gt;&lt;img alt="" src="../science/attachments/2019_highlights/stochastic_grouping_mlp.png" style="width: 600px;" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;At each step of a stochastic gradient descent, we randomly pick a
feature-grouping matrix (itself estimated from the data), and use it
to reduce the data in the computations of the gradients, then invert
this reduction to update the weights.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a class="reference external" href="http://proceedings.mlr.press/v97/aydore19a.html"&gt;The paper&lt;/a&gt; comes with
extensive empirical validation, including comparison to convolutional
neural networks. We benchmark the strategy on brain images, but also
on realigned faces, to show that the approach is beneficial for any
non-translational-invariant images. In particular, the approach greatly
speeds up convergence.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="http://proceedings.mlr.press/v97/aydore19a.html"&gt;&lt;img alt="" src="../science/attachments/2019_highlights/stochastic_grouping_results.png" style="width: 600px;" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;Prediction accuracy as a function of training time – left: on
realigned faces – right: on brain images&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a class="reference external" href="http://proceedings.mlr.press/v97/aydore19a.html"&gt;This paper&lt;/a&gt; clearly
shows that &lt;strong&gt;one should not use convolutional neural networks on fMRI
data&lt;/strong&gt;: these images are not translational invariant.&lt;/p&gt;
&lt;div class="sidebar"&gt;
&lt;p class="first sidebar-title"&gt;&lt;strong&gt;Preprints&lt;/strong&gt;&lt;/p&gt;
&lt;p class="last"&gt;All papers are available as preprints, eg on &lt;a class="reference external" href="http://gael-varoquaux.info/publications.html"&gt;my site&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="open-science"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-5"&gt;Open science&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Open and reproducible science:&lt;/strong&gt; Looking at all these publications, I
realize that every single one of them comes with code on a github
repository and is done on open data, which means that they can all be
easily reproduced. I’m very proud of the teams behind these papers.
Achieving this level of reproducibility requires hard work and
discipline. It is also a testimonial to a community investment in
software tools and infrastructure for open science that has been going on
for decades and gives the foundations on which these works build.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;A prize for scikit-learn:&lt;/strong&gt; On this topic, a highlight of 2019 was also
that the work behind scikit-learn was acknowledged in &lt;a class="reference external" href="../programming/getting-a-big-scientific-prize-for-open-source-software.html"&gt;an important
scientific prize&lt;/a&gt;.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Why open science:&lt;/strong&gt; Why do I care so much for open science? Because in
a world of uncertainty, the claims of science must be trusted and hence
built on transparent practice (think about science and global warming).
Because it helps putting our methods in the hands of a wider public,
society at large. And because it levels the ground, making it easier for
newcomers –young scientists, or developing countries– to contribute,
which in itself makes science more efficient.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="machine learning"></category><category term="neuroimaging"></category><category term="statistics"></category><category term="yearly report"></category></entry><entry><title>Comparing distributions: Kernels estimate good representations, l1 distances give good tests</title><link href="https://gael-varoquaux.info/science/comparing-distributions-kernels-estimate-good-representations-l1-distances-give-good-tests.html" rel="alternate"></link><published>2019-12-08T00:00:00+01:00</published><updated>2019-12-08T00:00:00+01:00</updated><author><name>Meyer Scetbon &amp; Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2019-12-08:/science/comparing-distributions-kernels-estimate-good-representations-l1-distances-give-good-tests.html</id><summary type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;Given two set of observations, are they drawn from the same
distribution? Our paper &lt;a class="reference external" href="https://papers.nips.cc/paper/9398-comparing-distributions-ell_1-geometry-improves-kernel-two-sample-testing.html"&gt;Comparing distributions: l1 geometry
improves kernel two-sample testing&lt;/a&gt;
at the &lt;strong&gt;NeurIPS 2019 conference&lt;/strong&gt; revisits this classic statistical
problem known as “two-sample testing”.&lt;/p&gt;
&lt;p class="last"&gt;This post explains the context and the paper with a bit of hand …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;Given two set of observations, are they drawn from the same
distribution? Our paper &lt;a class="reference external" href="https://papers.nips.cc/paper/9398-comparing-distributions-ell_1-geometry-improves-kernel-two-sample-testing.html"&gt;Comparing distributions: l1 geometry
improves kernel two-sample testing&lt;/a&gt;
at the &lt;strong&gt;NeurIPS 2019 conference&lt;/strong&gt; revisits this classic statistical
problem known as “two-sample testing”.&lt;/p&gt;
&lt;p class="last"&gt;This post explains the context and the paper with a bit of hand
waiving.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#the-context-two-sample-testing" id="toc-entry-1"&gt;The context: two-sample testing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#from-kernel-mean-embeddings-to-distances-on-distributions" id="toc-entry-2"&gt;From kernel mean embeddings to distances on distributions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#controlling-the-weak-convergence-of-probability-measures" id="toc-entry-3"&gt;Controlling the weak convergence of probability measures&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#two-sample-testing-procedures" id="toc-entry-4"&gt;Two-sample testing procedures&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#the-l1-metric-provides-best-testing-power" id="toc-entry-5"&gt;The L1 metric provides best testing power&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="the-context-two-sample-testing"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;The context: two-sample testing&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Given two samples from two unknown populations, the goal of two-sample tests is
to determine whether the underlying populations differ with a statistical
significance. For instance, we may care to know whether the
McDonald’s and KFC use different logic to chose locations of restaurants
across the US. This is a difficult question: we have access to data points,
but not the underlying generative mechanism, that is probably governed by
marketing strategies.&lt;/p&gt;
&lt;img alt="" class="align-center" src="attachments/comparing_distributions_l1/map_KFC_McDo_simple.png" style="width: 70%;" /&gt;
&lt;/div&gt;
&lt;div class="section" id="from-kernel-mean-embeddings-to-distances-on-distributions"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;From kernel mean embeddings to distances on distributions&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In the example of spatial distributions restaurants,
there is &lt;strong&gt;a lot of information in how close observed data
points lie in the original measurement space (here geographic coordinates)&lt;/strong&gt;.
Kernel methods arise naturally to capture this information. They can be
applied to distributions, building representatives of distributions:
&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Kernel_embedding_of_distributions"&gt;Kernel embeddings of distributions&lt;/a&gt;. The
mean embedding of a distribution P with a kernel k is written:&lt;/p&gt;
&lt;div class="formula"&gt;
&lt;i&gt;μ&lt;/i&gt;&lt;sub&gt;&lt;i&gt;P&lt;/i&gt;&lt;/sub&gt;(&lt;i&gt;t&lt;/i&gt;) :  = &lt;span class="limits"&gt;&lt;span class="limit"&gt;&lt;span class="bigoperator integral"&gt;∫&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;sub&gt;ℝ&lt;sup&gt;&lt;i&gt;d&lt;/i&gt;&lt;/sup&gt;&lt;/sub&gt;&lt;i&gt;k&lt;/i&gt;(&lt;i&gt;x&lt;/i&gt;, &lt;i&gt;t&lt;/i&gt;)&lt;i&gt;dP&lt;/i&gt;(&lt;i&gt;x&lt;/i&gt;)
&lt;/div&gt;
&lt;p&gt;Intuitively, it is related to &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Kernel_density_estimation"&gt;Kernel Density Estimates (KDEs)&lt;/a&gt; which
estimate a density in continuous space by smoothing the observed data
points with a kernel.&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/comparing_distributions_l1/kde.jpg" /&gt;
&lt;p class="caption"&gt;Kernel mean embeddings for two distributions of points&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;For two-sample testing, kernel embeddings can be used to compute distances
between distributions, building metrics over the space of probability
measures. Metrics between probability measures can be defined via the
notion of Integral Probability Metric (IPM): as a difference of
expectations:&lt;/p&gt;
&lt;div class="formula"&gt;
&lt;span class="text"&gt;IPM&lt;/span&gt;[&lt;i&gt;F&lt;/i&gt;, &lt;i&gt;P&lt;/i&gt;, &lt;i&gt;Q&lt;/i&gt;] :  = &lt;span class="limits"&gt;&lt;sup class="limit"&gt; &lt;/sup&gt;&lt;span class="limit"&gt;sup&lt;/span&gt;&lt;sub class="limit"&gt;&lt;i&gt;f&lt;/i&gt; ∈ &lt;i&gt;F&lt;/i&gt;&lt;/sub&gt;&lt;/span&gt;(𝔼&lt;sub&gt;&lt;i&gt;x&lt;/i&gt; ∼ &lt;i&gt;P&lt;/i&gt;&lt;/sub&gt;&lt;span class="stretchy"&gt;[&lt;/span&gt;&lt;i&gt;f&lt;/i&gt;(&lt;i&gt;x&lt;/i&gt;)&lt;span class="stretchy"&gt;]&lt;/span&gt; − 𝔼&lt;sub&gt;&lt;i&gt;y&lt;/i&gt; ∼ &lt;i&gt;Q&lt;/i&gt;&lt;/sub&gt;&lt;span class="stretchy"&gt;[&lt;/span&gt;&lt;i&gt;f&lt;/i&gt;(&lt;i&gt;y&lt;/i&gt;)&lt;span class="stretchy"&gt;]&lt;/span&gt;)
&lt;/div&gt;
&lt;p&gt;where F is a class of functions. This definition is appealing because it
&lt;strong&gt;characterizes the difference between P and Q by the function for which
the expectancy differs most&lt;/strong&gt;. The specific choice of class of function
defines the metric. If we now consider a kernel, it implicitly defines a
space of functions (intuitively related to all the possible KDEs
generated by varying data points): a Reproducible Kernel Hilbert Space
(RKHS). Defining a metric (an IPM) with a function class F as the unit
ball in such an RKHS, is known as the Maximum Mean Discrepancy (MMD). It
can be shown that, rather than computing the maximum, the MMD has a more
convenient expression, the RKHS distance between the mean embeddings:&lt;/p&gt;
&lt;div class="formula"&gt;
&lt;span class="text"&gt;MMD&lt;/span&gt;[&lt;i&gt;P&lt;/i&gt;, &lt;i&gt;Q&lt;/i&gt;] = ‖&lt;i&gt;μ&lt;/i&gt;&lt;sub&gt;&lt;i&gt;P&lt;/i&gt;&lt;/sub&gt; − &lt;i&gt;μ&lt;/i&gt;&lt;sub&gt;&lt;i&gt;Q&lt;/i&gt;&lt;/sub&gt;‖&lt;sub&gt;&lt;i&gt;H&lt;/i&gt;&lt;sub&gt;&lt;i&gt;k&lt;/i&gt;&lt;/sub&gt;&lt;/sub&gt;
&lt;/div&gt;
&lt;p&gt;For good choices of kernels, the MMD has appealing mathematical
properties to compare distributions. With kernels said to be
characteristic, eg Gaussian kernels, the MMD is a metric: MMD[P, Q] = 0
if and only if P = Q. Using the MMD for two-sample testing –given only
observations from the distributions, and not P and Q–  requires using an
empirical estimation of the MMD. This can be done by computing the RKHS
norm in the expression above, which leads to summing kernel evaluations
on all data points in P and Q.&lt;/p&gt;
&lt;p&gt;Our work builds upon this framework, but deviates a bit from the
classical definition of MMD as it addresses the question of which norm is
best to use on the difference of mean embeddings, µQ - µP (as well as
other representatives, namely the smooth characteristic function, SCF).
We consider a wider family of metrics based on the Lp distances between
mean emdeddings (p=2 recovers the classic framework):&lt;/p&gt;
&lt;div class="formula"&gt;
&lt;i&gt;d&lt;/i&gt;&lt;sub&gt;&lt;i&gt;L&lt;/i&gt;&lt;sup&gt;&lt;i&gt;p&lt;/i&gt;&lt;/sup&gt;, &lt;i&gt;μ&lt;/i&gt;&lt;/sub&gt;(&lt;i&gt;P&lt;/i&gt;, &lt;i&gt;Q&lt;/i&gt;) :  = &lt;span class="stretchy"&gt;(&lt;/span&gt;&lt;span class="limits"&gt;&lt;sup class="limit"&gt; &lt;/sup&gt;&lt;span class="limit"&gt;&lt;span class="bigoperator integral"&gt;∫&lt;/span&gt;&lt;/span&gt;&lt;sub class="limit"&gt;&lt;i&gt;t&lt;/i&gt; ∈ ℝ&lt;sup&gt;&lt;i&gt;d&lt;/i&gt;&lt;/sup&gt;&lt;/sub&gt;&lt;/span&gt;|&lt;i&gt;μ&lt;/i&gt;&lt;sub&gt;&lt;i&gt;P&lt;/i&gt;&lt;/sub&gt;(&lt;i&gt;t&lt;/i&gt;) − &lt;i&gt;μ&lt;/i&gt;&lt;sub&gt;&lt;i&gt;Q&lt;/i&gt;&lt;/sub&gt;(&lt;i&gt;t&lt;/i&gt;)|&lt;sup&gt;&lt;i&gt;p&lt;/i&gt;&lt;/sup&gt;&lt;i&gt;d&lt;/i&gt;Γ(&lt;i&gt;t&lt;/i&gt;)&lt;span class="stretchy"&gt;)&lt;/span&gt;&lt;sup&gt;1 ⁄ &lt;i&gt;p&lt;/i&gt;&lt;/sup&gt;
&lt;/div&gt;
&lt;p&gt;where Γ is a Borel probability measure absolutely continuous.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="controlling-the-weak-convergence-of-probability-measures"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;Controlling the weak convergence of probability measures&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;We show that these metrics have good properties. Specifically, for p ≥ 1,
as soon as the kernel is bounded continuous and characteristic, these
metrics metrize the weak convergence. What this means is that these
metrics tend to zero if and only if P and Q weakly converge.&lt;/p&gt;
&lt;p&gt;The &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Convergence_of_measures#Weak_convergence_of_measures"&gt;weak convergence of probability measures&lt;/a&gt;
is a notion of convergence that is based &lt;strong&gt;not just on having events with
probabilities that are the same for the two distributions, but also that some events are
“close”&lt;/strong&gt;. Indeed, classic convergence in probability just tells us that
the same observation should have the same probability in the two distributions. Weak convergence takes in account the topology of the
observations. For instance, to go back to the problem of spatial
distributions of restaurants, it does not only look at whether the
probabilities of having a Mc Donald’s or a KFC restaurant converge on
11th Wall Street, but also at restaurants are likely on 9th Wall Street.&lt;/p&gt;
&lt;p&gt;A simple example to see why these matters is to consider two Dirac
distributions: spikes in a single point. If we bring these spikes closer
and closer, merely looking at the probability of events in the same exact
position will not detect any convergence until the spikes exactly
overlap.&lt;/p&gt;
&lt;p&gt;Using kernel embeddings of distributions enables to capture the aspects
of convergence in the spatial domain because the kernels used give a
spatial smoothness to the representatives:&lt;/p&gt;
&lt;img alt="" class="align-center" src="attachments/comparing_distributions_l1/converging_diracs.png" style="width: 70%;" /&gt;
&lt;p&gt;Having a metric on probability distributions that captures the topology
of the observations is important for many applications, for instance when
fitting GANs to generate images: the goal is not to only capture that
images are exactly the same, but also that they maybe be “close”.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="two-sample-testing-procedures"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-4"&gt;Two-sample testing procedures&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Now that we have built metrics, we can derive two-sample test statistics.
A straightforward way of doing it would involve large sums on all the
observations, which would be costly. Hence, we resort to a good
approximation by sampling a set of {Tj} locations from the distribution
Γ:&lt;/p&gt;
&lt;div class="formula"&gt;
&lt;i&gt;d̂&lt;/i&gt;&lt;span class="scripts"&gt;&lt;sup class="script"&gt;&lt;i&gt;p&lt;/i&gt;&lt;/sup&gt;&lt;sub class="script"&gt;&lt;i&gt;ℓ&lt;/i&gt;&lt;sub&gt;&lt;i&gt;p&lt;/i&gt;&lt;/sub&gt;, &lt;i&gt;μ&lt;/i&gt;, &lt;i&gt;J&lt;/i&gt;&lt;/sub&gt;&lt;/span&gt;[&lt;i&gt;X&lt;/i&gt;, &lt;i&gt;Y&lt;/i&gt;] :  = &lt;i&gt;n&lt;/i&gt;&lt;sup&gt;&lt;i&gt;p&lt;/i&gt; ⁄ 2&lt;/sup&gt;&lt;span class="limits"&gt;&lt;sup class="limit"&gt; &lt;/sup&gt;&lt;span class="limit"&gt;&lt;span class="bigoperator"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;sub class="limit"&gt;&lt;i&gt;j&lt;/i&gt; = 1..&lt;i&gt;J&lt;/i&gt;&lt;/sub&gt;&lt;/span&gt;|&lt;i&gt;μ&lt;/i&gt;&lt;sub&gt;&lt;i&gt;X&lt;/i&gt;&lt;/sub&gt;(&lt;i&gt;T&lt;/i&gt;&lt;sub&gt;&lt;i&gt;j&lt;/i&gt;&lt;/sub&gt;) − &lt;i&gt;μ&lt;/i&gt;&lt;sub&gt;&lt;i&gt;Y&lt;/i&gt;&lt;/sub&gt;(&lt;i&gt;T&lt;/i&gt;&lt;sub&gt;&lt;i&gt;j&lt;/i&gt;&lt;/sub&gt;)|&lt;sup&gt;&lt;i&gt;p&lt;/i&gt;&lt;/sup&gt;
&lt;/div&gt;
&lt;p&gt;We show that this approximation maintains (almost surely) the appealing
metric properties, generalizing the results that were established by
&lt;a class="reference external" href="http://papers.nips.cc/paper/5685-fast-two-sample-testing-with-analytic-representations-of-probability-measures"&gt;Chwialkowski et al 2015&lt;/a&gt;
for the special case of the L2 metric.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/comparing_distributions_l1/optimizing_position.png" style="width: 70%;" /&gt;
&lt;p class="caption"&gt;Sampling at different positions&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;We further develop the testing procedures by showing that other tricks
known to improve testing with the L2 metric can be adapted to other
metrics, such as the L1 metric. Fast and performant tests can be obtained
by optimizing the test locations –using an upper-bound on the test power–
or by testing in the Fourrier domain, using the Smooth Characteristic
Function of the kernel. Even in the case of the L1 metric, the null
distribution of the test statistic can be derived, leading to tests that
can control errors without permutations.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="the-l1-metric-provides-best-testing-power"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-5"&gt;The L1 metric provides best testing power&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Going back to our question of which norm on the difference of
distribution representative is best suited to detect, we show that when
using analytics kernels, such as the Gaussian kernel, the L1 metric
improves upon the L2 metric, which corresponds to the classic definition
of the MMD.&lt;/p&gt;
&lt;p&gt;Indeed, analytic kernels are non-zero almost everywhere. As a result,
when P is different from Q, the difference between their mean embeddings
will be dense, as well as the differences between the representatives
that we use to build our tests (for instance the values at the locations
that we use to build the tests above). l1 norms capture better dense
differences than l2 norms –this is the reason why, used as penalties,
they induce sparsity.&lt;/p&gt;
&lt;img alt="" class="align-right" src="attachments/comparing_distributions_l1/l1_vs_l2.png" style="width: 150px;" /&gt;
&lt;p&gt;A simple intuition is that dense vectors tend to lie in the diagonals of
the measurement basis, as none of their coordinates are zero. On these
diagonals, the l1 norm is much larger than the l1 norm of vectors with
some zero, or nearly-zero coordinates.&lt;/p&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;&lt;strong&gt;Summary&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For a very simple summary, the story is that: to perform tests of
whether two distributions differs, it is useful to compute a “mean
Kernel embedding” –similar to a Kernel density estimate, but without
normalization– of each distribution, and consider the l1 norm of the
difference of these embeddings. They can be computed on a small number
of locations, either drawn at random or optimized. This approach is
reminiscent of looking at the total variation between the measures,
however the fact that it uses Kernels makes it robust to small spatial
noise in the observations, unlike the total variation for which events
must perfectly coincide in both set of observations (the total
variation does not metrize the weak convergence).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;&lt;strong&gt;References&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The framework exposed here is one that was developed over a long line
of research, which our work builds upon. &lt;a class="reference external" href="https://papers.nips.cc/paper/9398-comparing-distributions-ell_1-geometry-improves-kernel-two-sample-testing.html"&gt;Our paper&lt;/a&gt;
gives a complete list of references, however, some useful review
papers are&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;C.-J. Simon-Gabriel and B. Schölkopf. Kernel distribution
embeddings: Universal kernels, &lt;em&gt;characteristic kernels and kernel
metrics on distributions&lt;/em&gt;, &lt;a class="reference external" href="https://arxiv.org/abs/1604.0525"&gt;arXiv:1604.05251&lt;/a&gt;, 2016.&lt;/li&gt;
&lt;li&gt;A. Gretton, K.M. Borgwardt, M.J. Rasch, B. Schölkopf, A. Smola; &lt;em&gt;A
Kernel Two-Sample Test&lt;/em&gt;, &lt;a class="reference external" href="http://www.jmlr.org/papers/v13/gretton12a.html"&gt;JMLR, 2012&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://slideslive.com/38921490/interpretable-comparison-of-distributions-and-models"&gt;The NeurIPS 2019 tutorial&lt;/a&gt;,
by Gretton, Sutherland, and Jitkrittum, is extremely didactic and gives
a lot of big picture&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;p&gt;·&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="machine learning"></category><category term="two-sample testing"></category><category term="conferences"></category><category term="statistics"></category></entry><entry><title>Getting a big scientific prize for open-source software</title><link href="https://gael-varoquaux.info/programming/getting-a-big-scientific-prize-for-open-source-software.html" rel="alternate"></link><published>2019-12-01T06:00:00+01:00</published><updated>2019-12-01T06:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2019-12-01:/programming/getting-a-big-scientific-prize-for-open-source-software.html</id><summary type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;An important acknowledgement for a different view of doing science:
open, collaborative, and more than a proof of concept.&lt;/p&gt;
&lt;/div&gt;
&lt;img alt="" class="align-right" src="attachments/sklearn_prize_academie/prize.jpg" style="width: 350px;" /&gt;
&lt;p&gt;A few days ago, Loïc Estève, Alexandre Gramfort, Olivier Grisel, Bertrand
Thirion, and myself received the &lt;a class="reference external" href="https://www.academie-sciences.fr/fr/Laureats/prix-inria-academie-des-sciences-2019-vincent-hayward-equipe-scikit-learn-et-maria-naya-plasencia.html"&gt;“Académie des Sciences Inria prize for transfer”&lt;/a&gt;,
for our contributions to the scikit-learn project …&lt;/p&gt;</summary><content type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;An important acknowledgement for a different view of doing science:
open, collaborative, and more than a proof of concept.&lt;/p&gt;
&lt;/div&gt;
&lt;img alt="" class="align-right" src="attachments/sklearn_prize_academie/prize.jpg" style="width: 350px;" /&gt;
&lt;p&gt;A few days ago, Loïc Estève, Alexandre Gramfort, Olivier Grisel, Bertrand
Thirion, and myself received the &lt;a class="reference external" href="https://www.academie-sciences.fr/fr/Laureats/prix-inria-academie-des-sciences-2019-vincent-hayward-equipe-scikit-learn-et-maria-naya-plasencia.html"&gt;“Académie des Sciences Inria prize for transfer”&lt;/a&gt;,
for our contributions to the scikit-learn project. To put things simply,
it’s quite a big deal to me, because I feel that it illustrates a change
of culture in academia.&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
Recognizing an open view of scientific contributions&lt;/div&gt;
&lt;p&gt;It is a great honor, because the selection was made by the members of the
Académie des Sciences, very accomplished scientists with impressive
contributions to science. The “Académie” is the hallmark of fundamental
academic science in France. To me, this prize is also symbolic because it
recognizes an open view of academic research and transfer, a view that
sometimes felt as not playing according to the incentives. We started
scikit-learn as a crazy endeavor, a bit of a &lt;em&gt;hippy&lt;/em&gt; science thing.
People didn’t really take us seriously. We were working on software, and
not publications. We were doing open source, while industrial transfer is
made by creating startups or filing patents. We were doing Python, while
academic machine learning was then done in Matlab, and industrial
transfer in C++. We were not pursuing the latest publications, while
these are thought to be research’s best assets. We were interested in
reaching out to non experts, while partners considered as
interesting have qualified staff.&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
Quality and openness, at the cost of quantity and control&lt;/div&gt;
&lt;p&gt;No. We did it different. We reached out to an open community. We did
BSD-licensed code. We worked to achieve quality at the cost of quantity. We
cared about installation issues, on-boarding biologists or medical
doctors, playing well with the wider scientific Python ecosystem.
We gave decision power to people outside of Inria, sometimes whom we had
never met in real life. We made sure that Inria was never the sole actor,
the sole stake-holder. We never pushed our own scientific publications in
the project. We limited complexity, trading off performance for ease of
use, ease of installation, ease of understanding.&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;object data="attachments/sklearn_prize_academie/sklearn_website_stats_white.svg" style="width: 25%;" type="image/svg+xml"&gt;&lt;/object&gt;
&lt;/div&gt;
&lt;p&gt;As a consequence, we slowly but surely assembled a large community. In
such a community, the
sum is greater than the parts. The breadth of interlocutors and cultures
slows movement down, but creates better results, because these results are
understandable to many and usable on a diversity of problems. The
consequence of this quality is that
we were progressively used in more and more places: industrial
data-science labs, startups, research in applied or fundamental
statistical learning, teaching. Ironically, the institutional world did
not notice. It got hard, next to impossible, to get funding &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;. A few years
ago, I was told by a central governmental agency that we, open-source
zealots, were destroying an incredible amount of value by giving away
for free the production of research &lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt;. The French report on AI, lead by a
Fields medal, cited tensorflow and theano –a discontinued software–, but
ignored scikit-learn; maybe because we were doing “boring science”?&lt;/p&gt;
&lt;p&gt;But, scikit-learn’s amazing community continued plowing forward. We grew
so much that we were heard from the top. The prize from the Académie shows
that we managed to capture the attention of senior scientists with
open-source software, because this software is really having a worldwide
impact in many disciplines.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/sklearn_prize_academie/academie_presentation.jpeg" style="width: 70%;" /&gt;
&lt;p class="caption"&gt;Presenting scikit-learn at the Academie Des Sciences&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="align-right docutils container"&gt;
An accomplishment of the community&lt;/div&gt;
&lt;p&gt;There were only five of us on stage, as the prize is for Inria permanent
staff. But this is of course not a fair account of how the project has
grown and what made it successful.&lt;/p&gt;
&lt;p&gt;In 2011, at &lt;a class="reference external" href="scikit-learn-nips-2011-sprint-international-thanks-to-our-sponsors.html"&gt;the first international sprint&lt;/a&gt;,
I felt something was happening: Incredible people whom I had never met
before were sitting next to me, working very hard on solving problems
with me. This experience of being united to solve difficult problems is
something amazing. And I deeply thank every single person who has worked
on this project, the 1500 contributors, many of those that I have never
met, in particular &lt;a class="reference external" href="https://scikit-learn.org/stable/about.html#authors"&gt;the core team&lt;/a&gt; who is committed
to making sure that every detail of scikit-learn is solid and serves the
users. The team that has assembled over the years is of incredible
quality.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="align-right docutils container"&gt;
The promises of data science need open source&lt;/div&gt;
&lt;p&gt;The world does not understand how much the promises of data science,
for today and tomorrow, need open source projects, easy to install and to use
by everybody. These projects are like &lt;a class="reference external" href="https://www.fordfoundation.org/work/learning/research-reports/roads-and-bridges-the-unseen-labor-behind-our-digital-infrastructure/"&gt;roads and bridges&lt;/a&gt;:
they are needed for growth thought no one wants to pay for maintaining
them. I hope that I can use the podium that the prize will give us to
stress the importance of the battle that we are fighting.&lt;/p&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Getting funding from the government implied too much politics and
risks. For these reasons, I turned to private donors, in a
&lt;a class="reference external" href="https://scikit-learn.fondation-inria.fr/"&gt;foundation&lt;/a&gt;.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Inria &lt;em&gt;always&lt;/em&gt; supported us, and often paid developers in my team
out of its own pockets.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;PS: As an another illustration of the culture change toward openness in
science, it was announced during the ceremony that the &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Comptes_rendus_de_l%27Acad%C3%A9mie_des_Sciences"&gt;“Compte Rendu de
l’Académie des Sciences”&lt;/a&gt; is becoming open access, without publication
charges!&lt;/p&gt;
</content><category term="programming"></category><category term="scikit-learn"></category><category term="science"></category><category term="scientific computing"></category><category term="open source"></category><category term="software"></category></entry><entry><title>2018: my scientific year in review</title><link href="https://gael-varoquaux.info/science/2018-my-scientific-year-in-review.html" rel="alternate"></link><published>2019-01-03T00:00:00+01:00</published><updated>2019-01-03T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2019-01-03:/science/2018-my-scientific-year-in-review.html</id><summary type="html">&lt;p&gt;From a scientific perspective, 2018 &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt; was once again extremely exciting
thank to awesome collaborators (at &lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;Inria&lt;/a&gt;, with &lt;a class="reference external" href="https://project.inria.fr/dirtydata/"&gt;DirtyData&lt;/a&gt;, and our &lt;a class="reference external" href="https://scikit-learn.fondation-inria.fr/"&gt;local scikit-learn team&lt;/a&gt;).
Rather than going over everything that we did in 2018, I would like to
give a few highlights: We published major work using &lt;strong&gt;machine learning to …&lt;/strong&gt;&lt;/p&gt;</summary><content type="html">&lt;p&gt;From a scientific perspective, 2018 &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt; was once again extremely exciting
thank to awesome collaborators (at &lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;Inria&lt;/a&gt;, with &lt;a class="reference external" href="https://project.inria.fr/dirtydata/"&gt;DirtyData&lt;/a&gt;, and our &lt;a class="reference external" href="https://scikit-learn.fondation-inria.fr/"&gt;local scikit-learn team&lt;/a&gt;).
Rather than going over everything that we did in 2018, I would like to
give a few highlights: We published major work using &lt;strong&gt;machine learning to
map cognition in the brain&lt;/strong&gt;, We started a new research project on &lt;strong&gt;analysis
of non-curated data&lt;/strong&gt; (addressing all of data science, beyond brain
imaging); And we worked a lot on &lt;strong&gt;growing scikit-learn&lt;/strong&gt;.&lt;/p&gt;
&lt;table class="side-hanging docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;It’s already 2019, I am indeed late in posting this summary.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;div class="contents topic" id="highlights"&gt;
&lt;p class="topic-title"&gt;Highlights&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#cognitive-brain-mapping" id="toc-entry-1"&gt;Cognitive brain mapping&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#data-science-without-data-cleaning" id="toc-entry-2"&gt;Data science without data cleaning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#scikit-learn-growth-and-consolidation" id="toc-entry-3"&gt;Scikit-learn: growth and consolidation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="cognitive-brain-mapping"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;Cognitive brain mapping&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;We have been exploring &lt;strong&gt;how predictive models can help mapping cognition
in the human brain&lt;/strong&gt;. In 2018, these long-running efforts led to important
publications.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="atlases-of-cognition-with-large-scale-human-brain-mapping"&gt;
&lt;h3&gt;Atlases of cognition with large-scale human brain mapping&lt;/h3&gt;
&lt;p&gt;More than 6 years ago, with my student Yannick Schwartz, we started
working on &lt;strong&gt;compiling an altases of cognition across many cognitive
neuroimaging studies&lt;/strong&gt;. This turned out to be quite challenging for several
reasons:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;Formalizing the links between mental processes&lt;/strong&gt; studied across the
literature is challenging. Strictly speaking, every paper studies a
different mental process. However, to build an atlas of cognition, we
are interested in finding commonalities across the literature.&lt;/li&gt;
&lt;li&gt;While cognitive studies tend to target a specific mental function,
the psychological manipulations that they use also recruit many other
processes. For instance, a memory study might use a &lt;em&gt;visual n-back&lt;/em&gt;
task, and hence recruit the visual cortex. The problem is more than an
experimental inconvience: &lt;strong&gt;varying details of an experiment may
trigger different cognitive processes&lt;/strong&gt;. For instance, there are common
and separate pathways for visual word recognition and auditory word
recognition.&lt;/li&gt;
&lt;li&gt;Simply &lt;strong&gt;detecting regions that are recruited in a given mental operation
leads to selecting the whole cortex&lt;/strong&gt; with enough statistical power. Indeed
tasks are never fully balanced; reading might for instance require more
attention than listening.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These challenges are related on the one hand to the problem of &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S1364661305003360"&gt;reverse
inference&lt;/a&gt;
&lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt;, and on the other hand to that of mental-process decomposition, or
cognitive subtraction, both central to cognitive neuroimaging. They also
call for formal knowledge representation, &lt;em&gt;eg&lt;/em&gt; by building ontologies,
which is a task harder than it might seem at first glance.&lt;/p&gt;
&lt;table class="side-hanging docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;In essence, the reverse inference problem arises because in a
cognitive brain imaging the observed brain activity is a consequence
of the behavior, and not a cause. While a conclusion that activity in
a brain structure causes a certain behavior is desirable, it is not
directly supported by a cognition neuroimaging experiment.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In our work &lt;a class="reference external" href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006565"&gt;[Varoquaux et al, PLOS 2018]&lt;/a&gt;,
we tackled these challenges to build atlases of cognition as follows:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;We assigned to each brain-activity image labels describing the
&lt;em&gt;multiple&lt;/em&gt; mental processes related to the experimental manipulation&lt;/li&gt;
&lt;li&gt;We used decoding –&lt;em&gt;ie&lt;/em&gt; prediction of the cognitive labels from the brain
activity– to ground a principled &lt;em&gt;reverse inference&lt;/em&gt; interpretation:
regions selected indeed do imply the corresponding behavior.&lt;/li&gt;
&lt;li&gt;Regions in the atlas were built of brain structures that both implied
the corresponding cognition, and were triggered by it (conditional and
marginal link), to ground a strong selectivity:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006565"&gt;&lt;img alt="" src="attachments/2018_highlights/mapping_types.png" style="width: 700px;" /&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;p&gt;We applied these techniques to the data from 30 different studies,
resulting in a detailed break down of the cortex in functionally-specialized
modules:&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006565"&gt;&lt;img alt="" src="attachments/2018_highlights/cognitive_regions.png" style="width: 700px;" /&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;p&gt;Importantly, the validity of this decomposition in regions is established
by the ability of these regions to predict the cognitive aspects of new
experimental paradigms.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="predictive-models-avoid-excessive-reductionism-in-cognitive-neuroimaging"&gt;
&lt;h3&gt;Predictive models avoid excessive reductionism in cognitive neuroimaging&lt;/h3&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2018_highlights/decoding.png" style="width: 400px;" /&gt;
&lt;/div&gt;
&lt;p&gt;While machine learning is generally seen as an engineering tool to build
predictive models or automate tasks, I see in it a central method of
modern science. Indeed, it can distill &lt;strong&gt;evidence that generalizes&lt;/strong&gt; from
vast –high dimensional– and ill-structured experimental data. Beyond
prediction, it can guide understanding.&lt;/p&gt;
&lt;p&gt;With Russ Poldrack, we wrote an opinion paper &lt;a class="reference external" href="https://hal.archives-ouvertes.fr/hal-01856412/"&gt;[Varoquaux &amp;amp; Poldrack,
Curr Opinion Neurobio 2019]&lt;/a&gt; that details why
predictive models are important tools to building wider theories of brain
function. It reviews many exciting progresses in uncovering with
predictive models how brain mechanisms support the mind. It makes the
point that &lt;strong&gt;ability generalize is a fundamentally desirable priority of
scientific inference&lt;/strong&gt;. Models that are grounded on explicit
generalization give a solid path to build broad theories of the mind.
Particularly interesting is generalization to significantly different
settings, &lt;em&gt;ie&lt;/em&gt; going further than typical cross-validation experiments of
machine learning, where identical data are artificially split.&lt;/p&gt;
&lt;p&gt;Something that is dear to my heart is that we are aiming for
&lt;strong&gt;quantitative generalization&lt;/strong&gt;, while psychology often contents itself
with qualitative generalization.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="individual-brain-charting-a-high-resolution-fmri-dataset-for-cognitive-mapping"&gt;
&lt;h3&gt;Individual Brain Charting, a high-resolution fMRI dataset for cognitive mapping&lt;/h3&gt;
&lt;p&gt;We are convinced about the importance of analyzing brain response across
multiple paradigms, to build models of brain function that generalize
across these paradigms. However, addressing such a research program by
aggregating multiple studies is hindered by data heterogeneity, due to
inter-individual differences or to differing scanners.&lt;/p&gt;
&lt;p&gt;Hence, my team, &lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;Parietal&lt;/a&gt;, has
undertook a major data acquisition, the &lt;a class="reference external" href="https://project.inria.fr/IBC"&gt;Individual Brain Charting
project&lt;/a&gt;: &lt;strong&gt;scanning a few individuals
under a huge amount of cognitive tasks&lt;/strong&gt;. The data acquisition will last
for many years, as the individuals come back to the lab for new
acquisitions. The images are of excellent quality, thanks to the unique
expertise of our scanning site, Neurospin, a brain-imaging research
facility.&lt;/p&gt;
&lt;p&gt;The data are completely &lt;strong&gt;openly accessible&lt;/strong&gt;: the raw data, preprocessed
data, statistical outputs, alongside with the processing script. We are
releasing new data as the project moves forward. This year, we published
the data paper &lt;a class="reference external" href="https://www.nature.com/articles/sdata2018105"&gt;[Pinho et al, Scientific Data 2018]&lt;/a&gt;.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;Data accumulation in brain imaging&lt;/p&gt;
&lt;p&gt;We are living exciting times, as &lt;strong&gt;there are more and more large volumes
of shared brain imaging data&lt;/strong&gt;. &lt;a class="reference external" href="https://openfmri.org/"&gt;OpenfMRI&lt;/a&gt;
aggregates data in a consistent way across brain-imaging
studies. Large projects such as the Human Connectome Project, our
Individual Brain Charting project, or the UK BioBank, are designed
from the beginning to be shared. We are entering an era of
brain-image analysis on many terabytes of data, with dozens of
thousands of subjects, compounding hundreds of different clinical or
cognitive conditions.&lt;/p&gt;
&lt;p&gt;Massive data accumulation opens exciting new scientific prospects,
and raises new engineering challenges. Some of these challenges are
to scale up neuroimaging data-processing practices, eg inter-subject
alignments at the scale of many thousands subjects. Some of these
challenges are new to neuroimaging: &lt;strong&gt;when compounding hundreds of
sources of data into an analysis, the human cost of data
integration becomes a major roadblock&lt;/strong&gt;. As I have become convinced
that analysing more, and more diverse, data is an important way
forward, I have started working on data intergration per se.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="data-science-without-data-cleaning"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-2"&gt;Data science without data cleaning&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="a-new-personal-research-agenda-dirtydata"&gt;
&lt;h3&gt;A new personal research agenda: DirtyData&lt;/h3&gt;
&lt;p&gt;Challenges to integrating data in a statistical analysis are ubiquitous,
including in brain imaging. Data cleaning &lt;a class="reference external" href="https://www.kaggle.com/surveys/2017"&gt;is recognized&lt;/a&gt; as the number one time sink for
data scientists. When advising scikit-learn users, including very large
companies, I often find that the major roadblock is going from the raw
data sources to the data matrix that is input to scikit-learn.&lt;/p&gt;
&lt;p&gt;A year ago, I started a new research focus, around the &lt;a class="reference external" href="https://project.inria.fr/dirtydata"&gt;DirtyData project&lt;/a&gt;. We now have a team with multiple
exciting collaborations, and funding. Our goal is to &lt;strong&gt;facilitate
statistical analysis of non-curated data&lt;/strong&gt;. We hope to foster better
understanding of how powerful machine-learning models can cope with
imperfect, non homogeneous data. As we go, we will publish this
understanding, but also distribute code with new methods, and hopefully
influence common data-science practices and software. This is an exciting
adventure (and yes, &lt;strong&gt;we are hiring&lt;/strong&gt;; see our &lt;a class="reference external" href="https://project.inria.fr/dirtydata/job-offers"&gt;job offers&lt;/a&gt; or contact me).&lt;/p&gt;
&lt;p&gt;The topics are vast, at the intersection between database research and
statistics. In particular, it calls for integrating machine learning
with:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Knowledge representation&lt;/li&gt;
&lt;li&gt;Information retrieval&lt;/li&gt;
&lt;li&gt;Information extraction&lt;/li&gt;
&lt;li&gt;Statistics with missing data&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="similarity-encoding-analysis-with-non-normalized-string-categories"&gt;
&lt;h3&gt;Similarity encoding: analysis with non-normalized string categories&lt;/h3&gt;
&lt;p&gt;While the DirtyData project is young, we already made progress for
analysis of &lt;strong&gt;dirty categories, ie categorical data represented with
strings that lack curation&lt;/strong&gt;. These can have typos or other simple
morphological variants (&lt;em&gt;eg&lt;/em&gt; “patient” vs “patients”), or they can have
more structured and fundamental differences, &lt;em&gt;eg&lt;/em&gt; arising from the merge
of multiple data sources. This latter problem is well-known of database
research, where it is seen as a &lt;em&gt;record linkage&lt;/em&gt; or &lt;em&gt;alignment&lt;/em&gt; problem.&lt;/p&gt;
&lt;p&gt;For statistical analysis, in particular machine learning, the problem
with these non-curated string categories is that they must be encoded to
numerical representations, and classic categorical encodings are not well
suited for them. For instance, one-hot encoding leads to very high
cardinality.&lt;/p&gt;
&lt;p&gt;In &lt;a class="reference external" href="https://hal.inria.fr/hal-01806175"&gt;Cerda et al (2018)&lt;/a&gt;, we
contribute a simple encoding approach, &lt;em&gt;similarity encoding&lt;/em&gt;, based on
interpolating one-hot encoding with string similarities between the
categories.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;a class="reference external image-reference" href="https://dirty-cat.github.io/stable/auto_examples/01_investigating_dirty_categories.html"&gt;&lt;img alt="" src="attachments/2018_highlights/investigating_dirty_categories.png" style="width: 600px;" /&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class="figure align-right"&gt;
&lt;a class="reference external image-reference" href="https://dirty-cat.github.io/stable/auto_examples/02_fit_predict_plot_employee_salaries.html"&gt;&lt;img alt="" src="attachments/2018_highlights/predict_employee_salaries.png" style="width: 230px;" /&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;p&gt;We ran an extensive empirical study, and show that &lt;strong&gt;similarity encoding
leads to better prediction accuracy without curation of the data&lt;/strong&gt;,
outperforming all the other approaches that we tried. The paper is purely
empirical, but stay tuned: a theoretical analysis of why this is the case
is coming soon.&lt;/p&gt;
&lt;p&gt;For the benefit of data scientists and researchers, we are released a
small Python package, &lt;a class="reference external" href="https://dirty-cat.github.io/stable/"&gt;dirty-cat&lt;/a&gt;,
for learning with dirty categories.&lt;/p&gt;
&lt;p&gt;This is just the beginning of the DirtyData project, more exciting work
is under way.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="scikit-learn-growth-and-consolidation"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="#toc-entry-3"&gt;Scikit-learn: growth and consolidation&lt;/a&gt;&lt;/h2&gt;
&lt;img alt="" class="align-right" src="attachments/2018_highlights/scikit-learn-logo-notext.png" style="width: 150px;" /&gt;
&lt;p&gt;In 2018, a lot of my energy went to consolidating scikit-learn as a
project. Describing the work in detail is for another post. However, my
main efforts where around growing the team and working on sustainability.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;We established a &lt;a class="reference external" href="https://scikit-learn.fondation-inria.fr/"&gt;scikit-learn foundation at Inria&lt;/a&gt;, in which companies
partner with us to fund scikit-learn development. This took a lot of
effort to establish good partnerships and create the legal vessels.
Indeed, we want to make sure that the common effort is invested to make
scikit-learn better. For instance, working with Intel, who are somewhat
running an arms race for computing speed, we improved our test suite,
and are slowly but surely learning how to improve our speed.&lt;/li&gt;
&lt;li&gt;A consequence of the foundation is that we are hiring to grow the team
(check out &lt;a class="reference external" href="https://scikit-learn.fondation-inria.fr/people/"&gt;our open positions&lt;/a&gt;). In 2018, my own
team grew, with more excellent people working on scikit-learn, but also
&lt;a class="reference external" href="http://joblib.readthedocs.io/"&gt;joblib&lt;/a&gt;, and even contributing to
core Python and numpy to improve &lt;a class="reference external" href="https://github.com/python/cpython/pull/3895"&gt;parallel computing&lt;/a&gt; and &lt;a class="reference external" href="https://github.com/numpy/numpy/pull/12133"&gt;pickling&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;As the scikit-learn community is growing, it seemed important to
formalize a bit more how decisions are made. To me, an important aspect
was laying out clearly that the project is still governed by the
community, and not partners or people paid by the foundation. We have a
draft of a &lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/12878"&gt;governance document&lt;/a&gt;, that is
pretty much ready for merge. We also worked on a &lt;a class="reference external" href="https://scikit-learn.org/dev/roadmap.html"&gt;roadmap&lt;/a&gt;. It is a non binding
document, but it still was an interesting exercise.&lt;/li&gt;
&lt;li&gt;Scikit-learn 0.20 was released, &lt;a class="reference external" href="https://scikit-learn.org/dev/whats_new.html"&gt;with many enhancements&lt;/a&gt;. And the 0.20 release
was followed by two minor releases, to make sure that our users got
robust code with backward compatibility.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We are busy finishing a few very interesting studies; next year will be
exciting! I hope that we will have much to say about population analysis
with brain imaging, which is a amazingly interesting subject.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="machine learning"></category><category term="neuroimaging"></category><category term="brain science"></category><category term="yearly report"></category></entry><entry><title>A foundation for scikit-learn at Inria</title><link href="https://gael-varoquaux.info/programming/a-foundation-for-scikit-learn-at-inria.html" rel="alternate"></link><published>2018-09-17T00:00:00+02:00</published><updated>2018-09-17T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2018-09-17:/programming/a-foundation-for-scikit-learn-at-inria.html</id><summary type="html">&lt;p&gt;We have just announced that a foundation will be supporting scikit-learn
at Inria &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;: &lt;a class="reference external" href="https://scikit-learn.fondation-inria.fr"&gt;scikit-learn.fondation-inria.fr&lt;/a&gt;&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
Growth and sustainability&lt;/div&gt;
&lt;p&gt;This is an exciting turn for us, because it enables us to receive private
funding. As a result, we will be able to have secure employment for some
existing core …&lt;/p&gt;</summary><content type="html">&lt;p&gt;We have just announced that a foundation will be supporting scikit-learn
at Inria &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;: &lt;a class="reference external" href="https://scikit-learn.fondation-inria.fr"&gt;scikit-learn.fondation-inria.fr&lt;/a&gt;&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
Growth and sustainability&lt;/div&gt;
&lt;p&gt;This is an exciting turn for us, because it enables us to receive private
funding. As a result, we will be able to have secure employment for some
existing core contributors, and to hire more people on the team. The goal
is to help sustaining quality (more frequent releases?) and to tackle
some ambitious features.&lt;/p&gt;
&lt;div class="section" id="a-foundation-what-and-why"&gt;
&lt;h2&gt;A foundation? What and why?&lt;/h2&gt;
&lt;p&gt;Open source lives and thrives by its base, the community of developers.
And scikit-learn is a fantastic example of these dynamics. Because of its
grass-root origins, it has focused on features that matter for the small
and the many, such as ease of use and statistical models that work well
in data-poor situations. Over the years, decisions have been based on
their technical merit, rather than the importance of displaying a list of
features that are trendy. A consequence of the breadth of contributors
with different backgrounds is the library tends to be well-suited for
many applications, including some models that are less mainstream.&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
People with dedicated time to support the community&lt;/div&gt;
&lt;p&gt;That said, over time this is an increasing need for a core team of
maintainers. As the library gets bigger, is it more and more difficult to
have a full view of what is happening. Integration of new features,
quality assurances, and releases are best done by developers who can
dedicate a large amount of time to the library. Also, ambitious changes
to the library, such as improving the parallel computing engine, need
long efforts. For many years, we have always had people with dedicated
time to support the community. In France, we were going through hoops to
find public money to found them. As someone who has done this effort, I
can tell you that is a complicated one &lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The ability to receive money from sponsors will enable us to scale up our
operations. I was initially worried that we would have difficulties
finding partners that accepted to give us money without asking for
control on the project. However, I was proven wrong, and we have found a
small set of &lt;a class="reference external" href="https://scikit-learn.fondation-inria.fr/en/home/#sponsors"&gt;great partners&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="what-will-people-work-on-how-will-decisions-be-made"&gt;
&lt;h2&gt;What will people work on? How will decisions be made?&lt;/h2&gt;
&lt;p&gt;It can be a difficult exercise to balance how money is used in a
community-driven project. The project should not loose its drive where
the community of developers is important. Interests of the sponsors
should not prime over interests of the user base.&lt;/p&gt;
&lt;p&gt;We will make sure that the money that the foundation receives is invested
for the interest of the community. We have a technical committee that
supervises the activity of the foundation. Its decisions will be informed
by the community &lt;a class="footnote-reference" href="#footnote-3" id="footnote-reference-3"&gt;[3]&lt;/a&gt;. For this, we have an advisory board composed of
core contributors of scikit-learn. Beside the advisory board, the
technical committee also comprises a delegate from each sponsor. I am
excited about the input that our partners will provide us on
the priorities for them, as they represent various industries.
Voting power will be spread so that sponsors and community have the same
voting power.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="why-not-an-existing-foundation-such-as-numfocus-or-the-psf"&gt;
&lt;h2&gt;Why not an existing foundation such as NumFOCUS, or the PSF?&lt;/h2&gt;
&lt;p&gt;There are several reasons why we choose this particular legal vessel. Our
endeavor is slightly from the prominent foundations in our ecosystem,
&lt;a class="reference external" href="https://numfocus.org"&gt;NumFocus&lt;/a&gt; and the &lt;a class="reference external" href="https://www.python.org/psf"&gt;PSF (Python Software
Foundation)&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The first important aspect is that we want to employ full-time
developers. Different countries have very different legal frameworks, and
it is really hard to transfer money overseas in a non profit. Physical
assets like employing people or owning real estate is even harder. We
needed something in France. And there might be a need for something else
in another country at some point.&lt;/p&gt;
&lt;p&gt;Another reason to be embedded in the Inria foundation is that it is
giving us a really good deal. We basically get legal advice, accounting,
office space, and IT support, for an 8% overhead. This is an excellent
deal and is part of the sponsoring efforts that Inria will keep doing.&lt;/p&gt;
&lt;p&gt;Last, we feel that a foundation targeting specifically scikit-learn can
raise money from different people than other foundations. I think that
there is value  having multiple foundations seeking money for open-source
software. Indeed, a foundation builds a case and an image, to convince
donors. Different donors require a different case and a different image.
For instance the president of NumFOCUS &lt;a class="reference external" href="https://twitter.com/aterrel/status/1039488246454083585"&gt;argues for a name less focused on
numerics&lt;/a&gt;. Yet,
too wide of a scope can dilute the image.&lt;/p&gt;
&lt;p&gt;We have in mind to make it easy for other foundations to support
scikit-learn. We have majors contributors at leading institutions, such
as Andreas Mueller at Columbia or Joel Nothman at Sydney university. It
is important that these institutions can easily gather donations too, in
the legal framework suited to their country. Hence the name reflects that
the foundation is embedded at Inria, leaving room for other initiatives.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="what-s-the-scope"&gt;
&lt;h2&gt;What’s the scope?&lt;/h2&gt;
&lt;p&gt;The scope of our work is everything scikit-learn related. It is not the
whole pydata or scipy ecosystem: it is focused on scikit-learn. But we
will not hesitate contribute fixes and enhancements to neighboring
projects, like in the past, even all the way up to core Python &lt;a class="footnote-reference" href="#footnote-4" id="footnote-reference-4"&gt;[4]&lt;/a&gt;.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;I’m am very excited. A strong team of full-time contributors will allow
us to do ambitious things with scikit-learn.&lt;/p&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;&lt;strong&gt;Join us&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We will be recruiting! See &lt;a class="reference external" href="https://scikit-learn.fondation-inria.fr/people"&gt;our positions&lt;/a&gt;. Come work with us
in Paris.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;I want to end by thanking the amazing men and women who have been
contributing to scikit-learn, and are with us in this fantastic
adventure! The energy that is in this project is incredible. We are
are launching this effort thank to you, and to empower you even more.&lt;/p&gt;
&lt;img alt="" class="align-center" src="attachments/code_sklearn_crop.jpg" style="width: 90%;" /&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;I am quite proud that over the years, my group has employed
&lt;a class="reference external" href="https://github.com/ogrisel"&gt;Olivier Grisel&lt;/a&gt;, &lt;a class="reference external" href="https://github.com/jorisvandenbossche"&gt;Joris van den Bossche&lt;/a&gt; (working on pandas in
addition to scikit-learn), &lt;a class="reference external" href="https://github.com/glemaitre"&gt;Guillaume Lemaître&lt;/a&gt; (working on imbalanced-learn in
addition to scikit-learn), &lt;a class="reference external" href="https://github.com/jeremiedbb"&gt;Jérémie du Boisberranger&lt;/a&gt;,
&lt;a class="reference external" href="https://github.com/tomMoral"&gt;Tom Moreau&lt;/a&gt;,
&lt;a class="reference external" href="https://github.com/lesteve"&gt;Loic Estève&lt;/a&gt;,
&lt;a class="reference external" href="https://github.com/fabianp"&gt;Fabian Pedregosa&lt;/a&gt;, to name only a
few. All these people, and the many others students that we have
payed part time to work on software, have had an structuring
impact on our ecosystem, going beyond the bounds of scikit-learn
and touching many aspects of computing in Python. However, because
of the constraints of research funding in France, public money
forced my to hire them with short-term contracts.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Technically, it is a tax-deductible scikit-learn consortium inside
the Inria foundation, which is an non-profit entity related to Inria.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-3" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-3"&gt;[3]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Details on the goverance of the foundation can be found at
&lt;a class="reference external" href="https://scikit-learn.fondation-inria.fr/en/mission-and-governance"&gt;https://scikit-learn.fondation-inria.fr/en/mission-and-governance&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-4" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-4"&gt;[4]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;For instance Olivier and Tom have been making parallelism more
robust in Python 3.7 (amongst various issues
&lt;a class="reference external" href="https://bugs.python.org/issue33056"&gt;https://bugs.python.org/issue33056&lt;/a&gt; and
&lt;a class="reference external" href="https://bugs.python.org/issue31699"&gt;https://bugs.python.org/issue31699&lt;/a&gt;). Olivier helped defining the
&lt;a class="reference external" href="https://www.python.org/dev/peps/pep-0574/"&gt;new pickling protocol&lt;/a&gt;, crucial to
efficient persistence.
This is hard work. Yet it is
important, because it benefits all libraries.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="scikit-learn"></category><category term="open-source"></category><category term="sustainabilty"></category><category term="scientific software"></category></entry><entry><title>Sprint on scikit-learn, in Paris and Austin</title><link href="https://gael-varoquaux.info/programming/sprint-on-scikit-learn-in-paris-and-austin.html" rel="alternate"></link><published>2018-08-01T00:00:00+02:00</published><updated>2018-08-01T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2018-08-01:/programming/sprint-on-scikit-learn-in-paris-and-austin.html</id><summary type="html">&lt;p&gt;Two weeks ago, we held a scikit-learn sprint in Austin and Paris. Here is
a brief report, on progresses and challenges.&lt;/p&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;Several sprints&lt;/p&gt;
&lt;p&gt;We actually held two sprint in Austin: one open sprint, at the scipy
conference sprints, which was open to new contributors, and one core
sprint, for more …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;p&gt;Two weeks ago, we held a scikit-learn sprint in Austin and Paris. Here is
a brief report, on progresses and challenges.&lt;/p&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;Several sprints&lt;/p&gt;
&lt;p&gt;We actually held two sprint in Austin: one open sprint, at the scipy
conference sprints, which was open to new contributors, and one core
sprint, for more advanced contributors. Thank you to all who joined
the scipy conference sprint. As I wasn’t there, I cannot report on
it.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="many-achievements"&gt;
&lt;h2&gt;Many achievements&lt;/h2&gt;
&lt;p&gt;Too many things were done to be listed here. Here is brief overview:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;Optics got merged&lt;/strong&gt;: &lt;a class="reference external" href="http://scikit-learn.org/dev/modules/clustering.html#optics"&gt;The optics clustering algorithm&lt;/a&gt; is a
density-base clustering, as DBScan, but with hyperparameters more
flexible and easier to set. Our implementation is also more scaleable
for very large number of samples. The &lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/1984"&gt;Pull request&lt;/a&gt; was opened
in 2013, and got many many improvements over the years.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Yeo-Johnson&lt;/strong&gt;: &lt;a class="reference external" href="http://scikit-learn.org/dev/modules/preprocessing.html#mapping-to-a-gaussian-distribution"&gt;The Yeo-Johnson transform&lt;/a&gt;
is a simple parametric transformation of the data that can be used to
make it more Gaussian. It is similar to the Box-Cox transform but can
deal with negative data
(&lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/11520"&gt;PR&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Novelty versus outlier detection&lt;/strong&gt;: Novelty detection attempts to
find on new data observations that differ from train data. Outlier
detection considers that even in the train data there are aberrant
observation. New modes in scikit-learn enable both usage scenario with
the same algorithms (see &lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/issues/8693"&gt;this issue&lt;/a&gt; and &lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/10700"&gt;this
PR&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Missing-value indicator&lt;/strong&gt;: a new transform that adds indicator columns
marking missing data
(&lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/8075"&gt;PR&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pypy support&lt;/strong&gt;: pypy support was merged.
(&lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/11010"&gt;PR&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Random Forest with 100 estimators&lt;/strong&gt; The default of &lt;cite&gt;n_estimator&lt;/cite&gt; in
RandomForest was changed from 10, which was fast but statistically
poor, to 100 (&lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/11542"&gt;PR&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Changing to 5-fold&lt;/strong&gt;: we changed to default of cross-validation from
3-fold to 5-fold
(&lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/11557"&gt;PR&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Toward release 0.20&lt;/strong&gt;: most of the effort of the sprint was actually
spent on addressing issues for the 0.20 release: a long list of quality
improvements
(&lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/milestone/24"&gt;milestone&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="scikit-learn-is-hard-work"&gt;
&lt;h2&gt;Scikit-learn is hard work&lt;/h2&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/dev_scikit-learn.png" style="width: 300px;" /&gt;
&lt;p class="caption"&gt;Even for the almighty &amp;#64;amueller&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Two days of intense group work on scikit-learn reminded me how much it is
hard work. I thought that it was maybe a good idea to try to illustrate
why.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;Mathematical errors&lt;/strong&gt;: maintaining the library requires mathematical
understanding of the models. For instance Ivan Panico &lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/11585"&gt;fixed the sparse
PCA&lt;/a&gt;, for
which the transform was mathematically incorrect.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Numerical instabilities&lt;/strong&gt;: sometimes, however, when models give a
result different from the expected one, this is due to numerical
instability. For instance, Sergul Aydöre &lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/11587"&gt;changed the tolerance for
certain variants of ridge&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Keeping examples and documentation up to date&lt;/strong&gt;:
Each change requires changing all documentation and examples. We have a
lot these. For instance, Alex Boucault &lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/11557"&gt;had to update many examples and
documentation pages when changing the default cross-validation&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Clean deprecation path&lt;/strong&gt;: We make sure that our changes do not break
users code, and therefore we provide a smooth update path, with
progressive deprecations. For instance, &lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/11557"&gt;the change of default
cross-validation&lt;/a&gt; introduce
an intermediate step where the default is kept the same and warns that
it will change in two releases.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistent behavior across the library&lt;/strong&gt;:
One of the acclaimed values of scikit-learn is that it has a very
consistent behavior across different models. We enforce this by “common
tests”, that test some properties of the estimators altogether. For
instance, Sergul implemented &lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/11558"&gt;common tests for sample weights&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Extensive testing&lt;/strong&gt;: We test many many things in scikit-learn:
that the code snippets in the documentation are correct, that &lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/11421"&gt;the
docstring conventions&lt;/a&gt; are
respected, that there are no deprecation errors raised, including from
our dependencies. As a results, continuous integration is a core part
of our development. During the sprint, we flooded our cloud-based
continuous integration, and as a result iteration really slowed down.
&lt;a class="reference external" href="https://travis-ci.org/"&gt;TravisCI&lt;/a&gt; were kind enough to fix this by
allocating us freely more computing power.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Supporting many versions&lt;/strong&gt;: Least by not least, one constraint that
makes development hard with scikit-learn is that we support many
different versions of Python, of our dependencies, of linear-algebra
libraries, and of operating system. This makes development harder, and
continuous integration slower. But we feel that this is very valuable
for a core library: narrowing the supported versions means that users
are more likely to end up in unsatisfiable dependencies situations,
where different parts of a project want different version numbers of a
dependency.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="admonition warning"&gt;
&lt;p class="first admonition-title"&gt;Warning&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dropping support for Python 2&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Supporting many version slows development. It also prevents
implementing new features: supporting Python 2 makes it harder to
provide  better parallelism or traceback management.&lt;/p&gt;
&lt;p class="last"&gt;Python 3 has been out for 10 years. It is solid and comes with many
improvements over Python 2. Alongside with &lt;a class="reference external" href="http://python3statement.org"&gt;many other projects&lt;/a&gt;, we will be requiring Python 3 for
the future releases of scikit-learn (0.21 and later). scikit-learn
0.20 will be the last release to support Python 2. It will enable
us to develop faster a better toolkit.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="credits-and-acknowledgments"&gt;
&lt;h2&gt;Credits and acknowledgments&lt;/h2&gt;
&lt;div class="section" id="contributors-to-the-sprint"&gt;
&lt;h3&gt;Contributors to the sprint&lt;/h3&gt;
&lt;div class="sidebar"&gt;
&lt;p class="first sidebar-title"&gt;Women contributors&lt;/p&gt;
&lt;p class="last"&gt;We deeply regret having only one woman in this long list of
contributors. We care about diversity and welcome contributors from
under-represented groups &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[*]&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;In Paris&lt;/strong&gt;&lt;/p&gt;
&lt;ul class="columns simple"&gt;
&lt;li&gt;Albert Thomas, Huawey&lt;/li&gt;
&lt;li&gt;Alexandre Boucaud, Inria&lt;/li&gt;
&lt;li&gt;Alexandre Gramfort, Inria&lt;/li&gt;
&lt;li&gt;Eric Lebigot, CFM&lt;/li&gt;
&lt;li&gt;Gaël Varoquaux, Inria&lt;/li&gt;
&lt;li&gt;Ivan Panico, Deloitte&lt;/li&gt;
&lt;li&gt;Jean-Baptiste Schiratti, Telecom ParisTech&lt;/li&gt;
&lt;li&gt;Jérémie du Boisberranger, Inria&lt;/li&gt;
&lt;li&gt;Léo Dreyfus-Schmidt, Dataiku&lt;/li&gt;
&lt;li&gt;Nicolas Goix&lt;/li&gt;
&lt;li&gt;Samuel Ronsin, Dataiku&lt;/li&gt;
&lt;li&gt;Sebastien Treguer, Independent&lt;/li&gt;
&lt;li&gt;Sergül Aydöre, Stevens Institute of Technology&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;In Austin&lt;/strong&gt;&lt;/p&gt;
&lt;ul class="columns simple"&gt;
&lt;li&gt;Andreas Mueller, Columbia&lt;/li&gt;
&lt;li&gt;Guillaume Lemaître, Inria&lt;/li&gt;
&lt;li&gt;Jan van Rijn, Columbia&lt;/li&gt;
&lt;li&gt;Joan Massich, Inria&lt;/li&gt;
&lt;li&gt;Joris Van den Bossche, Inria&lt;/li&gt;
&lt;li&gt;Loïc Estève, Inria&lt;/li&gt;
&lt;li&gt;Nicolas Hug, Columbia&lt;/li&gt;
&lt;li&gt;Olivier Grisel, Inria&lt;/li&gt;
&lt;li&gt;Roman Yurchak, independent&lt;/li&gt;
&lt;li&gt;William de Vazelhes, Inria&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Remote&lt;/strong&gt;&lt;/p&gt;
&lt;ul class="columns simple"&gt;
&lt;li&gt;Hanmin Qin, Peking University&lt;/li&gt;
&lt;li&gt;Joel Nothman, University of Sydney&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="sponsors"&gt;
&lt;h3&gt;Sponsors&lt;/h3&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="https://franceisai.com/"&gt;France Is AI&lt;/a&gt; payed the travel of the French
contributors to Austin&lt;/li&gt;
&lt;li&gt;The NSF and the Sloan foundation payed the travel of the people from
Columbia.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://scipy2018.scipy.org"&gt;SciPy 2018&lt;/a&gt; organizers (and sponsors) hosted the first part of the sprint in Austin,&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://www.enthought.com/"&gt;Enthought&lt;/a&gt; hosted the second part of the sprint in Austin,&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://www.dataiku.com/"&gt;Dataiku&lt;/a&gt; hosted us in Paris&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://travis-ci.org/"&gt;TravisCI&lt;/a&gt; raised our number of workers for
online testing&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://www.meetup.com/Paris-Machine-learning-applications-group/"&gt;ParisML meetup&lt;/a&gt; helped us with the organization&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Thank you all for the support&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Also thanks to Andy Mueller and Olivier Grisel for feedback on this blog post.&lt;/p&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[*]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;We aspire to treat everybody excatly the same way. However,
acknowledging the fact that there is currently a lack of diversity, we
are happy to do some outreach and give extra help onboarding
newcomers.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="scikit-learn"></category><category term="open-source"></category><category term="reproducible research"></category><category term="scientific software"></category></entry><entry><title>Our research in 2017: personal scientific highlights</title><link href="https://gael-varoquaux.info/science/our-research-in-2017-personal-scientific-highlights.html" rel="alternate"></link><published>2017-12-31T00:00:00+01:00</published><updated>2017-12-31T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2017-12-31:/science/our-research-in-2017-personal-scientific-highlights.html</id><summary type="html">&lt;p&gt;In my opinion the scientific highlights of 2017 for &lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;my team&lt;/a&gt; were on multivariate predictive
analysis for brain imaging: a brain decoder more efficient and faster
than alternatives, improvement clinical predictions by predicting jointly
multiple traits of subjects, decoding based on the raw time-series of
brain activity, and a personnal …&lt;/p&gt;</summary><content type="html">&lt;p&gt;In my opinion the scientific highlights of 2017 for &lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;my team&lt;/a&gt; were on multivariate predictive
analysis for brain imaging: a brain decoder more efficient and faster
than alternatives, improvement clinical predictions by predicting jointly
multiple traits of subjects, decoding based on the raw time-series of
brain activity, and a personnal concern with the small sample sizes we
use in predictive brain imaging…&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="a-fast-and-stable-brain-decoder-using-ensembling-frem"&gt;
&lt;h2&gt;A fast and stable brain decoder using ensembling: FReM&lt;/h2&gt;
&lt;p&gt;We have been working for 10 years on methods for brain decoding:
predicting behavior from imaging. In particular, we developed state of
the art decoders based on &lt;a class="reference external" href="http://ieeexplore.ieee.org/abstract/document/5711672/"&gt;total variation&lt;/a&gt;.
In &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S1053811917308182"&gt;Hoyos-Idrobo et al&lt;/a&gt;
(&lt;a class="reference external" href="https://hal.inria.fr/INRIA/hal-01615015v1"&gt;preprint&lt;/a&gt;)
we used a different technique based on ensembling: combining many fast
decoders. The resulting decoder, dubbed &lt;em&gt;FReM&lt;/em&gt;, predicts better, faster,
and with more stable maps than existing methods. Indeed, we have learned
that good prediction accuracy was not the only important feature of a
decoder.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/2017_highlights/frem_benchmarks.png" style="width: 600px;" /&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="brain-imaging-to-characterize-individuals-joint-prediction-of-multiple-traits"&gt;
&lt;h2&gt;Brain imaging to characterize individuals: joint prediction of multiple traits&lt;/h2&gt;
&lt;p&gt;In &lt;em&gt;population imaging&lt;/em&gt;, individual traits are linked to their brain
images. Predictive models ground the development of imaging biomarkers.
In &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S1053811917305438"&gt;Rahim et al&lt;/a&gt;
(&lt;a class="reference external" href="https://hal.inria.fr/hal-01547524/"&gt;preprint&lt;/a&gt;), we showed that
accounting for multiple traits of the subjects when &lt;em&gt;learning&lt;/em&gt; the
biomarker, gave a better prediction of the individual traits. For
instance, knowing the MMSE (mini mental state examination) of subjects
in a reference population helps derive better markers of Alzheimer’s
disease, even for subjects of unknown MMSE. This is an important step to
including a more complete picture of individuals in imaging studies.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/2017_highlights/multi_output_decoder.jpg" style="width: 600px;" /&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="time-domain-decoding-for-fmri"&gt;
&lt;h2&gt;Time-domain decoding for fMRI&lt;/h2&gt;
&lt;p&gt;In studies of cognition with functional MRI, the standard practice to
decoding brain activity is to estimate a first-level model that teases
appart the different experimental trials. It results in maps of regions
of the brains that correlate with each trial. Decoding is then run on
these maps, with supervised learning. The limitation of this approach is
that the experiment has to be designed with a good time separation
between each trial.&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2017_highlights/time_domain_decoding.png" style="width: 300px;" /&gt;
&lt;/div&gt;
&lt;p&gt;In &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S1053811917306651"&gt;Loula et al&lt;/a&gt;
(&lt;a class="reference external" href="https://hal.inria.fr/hal-01576641/"&gt;preprint&lt;/a&gt;) we designed a
&lt;em&gt;time-domain decoding&lt;/em&gt; scheme, that starts from the raw brain activity
time-series and predicts model time-courses of cognition. From these, it
can classify the type of each trial. Importantly, it works better than
traditional approaches when the trials are not well separated. It thus
opens the door to decoding in experiments that were so far too fast.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="cross-validation-failure-the-dangers-of-small-samples"&gt;
&lt;h2&gt;Cross-validation failure: the dangers of small samples&lt;/h2&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2017_highlights/sample_size_distribution.png" style="width: 300px;" /&gt;
&lt;/div&gt;
&lt;p&gt;I wrote &lt;a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S1053811917305311"&gt;an opinion paper&lt;/a&gt;
(&lt;a class="reference external" href="https://hal.inria.fr/hal-01545002/"&gt;preprint&lt;/a&gt;) on a problem of our
field that has been worrying me lot: &lt;strong&gt;often, we do not have enough
samples to assess properly the predictive power in neuroimaging&lt;/strong&gt;.
Indeed, the typical predictive analysis in neuroimaging uses 100 samples.&lt;/p&gt;
&lt;div style="clear: both"&gt;&lt;/div&gt;&lt;div class="figure align-right"&gt;
&lt;img alt="" src="attachments/2017_highlights/binomial_cdf.png" style="width: 300px;" /&gt;
&lt;/div&gt;
&lt;p&gt;The error distribution on the measure of prediction accuracy of a decoding
is at best given by a binomial. With around 100 samples, it yields
confidence bounds around ±7%. Analysis of neuroimaging studies reveals
larger error bars.&lt;/p&gt;
&lt;p&gt;Such error bars, large compared to the effect of interest, undermine
publications using or developing predictive models in neuroimaging.
Indeed, they couple with the publication incentives in two ways. First,
studies that by chance observe an effect are published, while the others
end up unaccounted for in a &lt;em&gt;``file drawer``&lt;/em&gt;. Second, minor
modifications to the data processing strategy give large but meaningless
differences on the observed prediction accuracy. These &lt;em&gt;researchers
degress of freedom&lt;/em&gt; can hardly be checked in a review process or a
statistical test. The methods research, trying to improve decoders, is
hindered by such error bars and should consider multiple datasets to
gauge progress. Clinical neuroimaging, for biomarkers, must increase
sample sizes and face heterogeneity.&lt;/p&gt;
&lt;p&gt;I believe that this is a major challenge for our field, and invite you to
read the paper if you are not convinced.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="convergence-proofs-for-last-year-s-blazing-fast-dictionary-learning"&gt;
&lt;h2&gt;Convergence proofs for last year’s blazing fast dictionary learning&lt;/h2&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="attachments/2017_highlights/online_dict_learning.png" style="width: 600px;" /&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a class="reference external" href="http://ieeexplore.ieee.org/abstract/document/8038072/"&gt;Mensch et al&lt;/a&gt;
(&lt;a class="reference external" href="https://hal.inria.fr/hal-01431618/"&gt;preprint&lt;/a&gt;) is a long paper that
studies in detail our very fast dictionary learning algorithm, with
extensive experiments and convergence proofs. On huge matrices, such as
brain imaging data in population studies, hyperspectral imaging, or
recommender systems, is gives &lt;strong&gt;10 fold speedups for matrix factorization&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We are busy finishing a few very interesting studies. Stay posted, next
year will be exciting!&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="neuroimaging"></category><category term="brain science"></category><category term="machine learning"></category><category term="yearly report"></category></entry><entry><title>Beyond computational reproducibility, let us aim for reusability</title><link href="https://gael-varoquaux.info/programming/beyond-computational-reproducibility-let-us-aim-for-reusability.html" rel="alternate"></link><published>2017-09-19T12:10:00+02:00</published><updated>2017-09-19T12:10:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2017-09-19:/programming/beyond-computational-reproducibility-let-us-aim-for-reusability.html</id><summary type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Scientific progress calls for reproducing results. Due to limited
resources, this is difficult even in computational sciences. Yet,
reproducibility is only a means to an end. It is not enough by itself
to enable new scientific results. Rather, new discoveries must build
on reuse and modification of the state …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Scientific progress calls for reproducing results. Due to limited
resources, this is difficult even in computational sciences. Yet,
reproducibility is only a means to an end. It is not enough by itself
to enable new scientific results. Rather, new discoveries must build
on reuse and modification of the state of the art. As time goes, this
state of the art must be consolidated in software libraries, just as
scientific knowledge as been consolidated on bookshelves of
brick-and-mortar libraries.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="small docutils container"&gt;
I am reposting &lt;a class="reference external" href="https://openlab-flowers.inria.fr/uploads/default/original/1X/65addc14bb2a6a7feaf7690865fa3708d5b0990f.pdf"&gt;an essay&lt;/a&gt;
that I wrote on reproducible science and software libraries. The full
discussion is in &lt;a class="reference external" href="https://openlab-flowers.inria.fr/t/ieee-cis-newsletter-on-cognitive-and-developmental-systems/129/1"&gt;IEEE CIS TC Cognitive and Developmental Systems&lt;/a&gt;,
but I’ve been told that it is hard to find.&lt;/div&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;Science is based on the ability to falsify claims. Thus, reproduction or
replication of published results is central to the progress of science.
Researchers failing to reproduce a result will raise questions:
Are these investigators not skilled enough? Did they misunderstand the
original scientific endeavor? Or is the scientific claim unfounded? For
this reason, the quality of the methods description in a research paper
is crucial. Beyond papers, computers —central to science in our digital
era— bring the hope of automating reproduction. Indeed, computers excel
at doing the same thing several times.&lt;/p&gt;
&lt;p&gt;However, there are many challenges to computational reproducibility. To
begin with, computers enable reproducibility only if all steps of a
scientific study are automated. In this sense, interactive environments
—productivity-boosters for many— are detrimental unless they enable easy
recording and replay of the actions performed. Similarly, as a
computational-science study progresses, it is crucial to keep track of
changes to the corresponding data and scripts. With a
software-engineering perspective, version control is the solution. It
should be in the curriculum of today’s scientists. But it does not
suffice. Automating a computational study is difficult. This is because
it comes with a large maintenance burden: operations change rapidly,
straining limited resources —processing power and storage. Saving
intermediate results helps. As does devising light experiments that are
easier to automate. These are crucial to the progress of science, as
laboratory classes or thought experiments in physics. A software
engineer would relate them to unit tests, elementary operations checked
repeatedly to ensure the quality of a program.&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
Archiving computers in thermally-regulated nuclear-proof vaults?&lt;/div&gt;
&lt;p&gt;Once a study is automated and published, ensuring reproducibility should
be easy; just a matter of archiving the computer used, preferably in a
thermally-regulated nuclear-proof vault. Maybe, dear reader, the
scientist in you frowns at this solution. Indeed, studies should also be
reproduced by new investigators. Hardware and software variations then
get in the way. Portability, &lt;em&gt;ie&lt;/em&gt; achieving identical results across
platforms, is well-known by the software industry as being a difficult
problem. It faces great hurdles due to incompatibilities in compilers,
libraries, or operating systems. Beyond these issues, portability also
faces numerical and statistical stability issues in scientific computing.
Hiding instability problems with heavy restrictions on the environment is
like rearranging deck chairs on the Titanic. While enough freezing will
recover reproducibility, unstable operations cast doubt upon scientific
conclusions they might lead to. Computational reproducibility is more
than a software engineering challenge; it must build upon solid numerical
and statistical methods.&lt;/p&gt;
&lt;p&gt;Reproducibility is not enough. It is only a means to an end, scientific
progress. Setting in stone a numerical pipeline that produces a figure is
of little use to scientific thinking if it is a black box. Researchers
need to understand the corresponding set of operations to relate them to
modeling assumptions. New scientific discoveries will arise from varying
those assumptions, or applying the methodology to new questions or new
data. Future studies build upon past studies, standing on the shoulders
of giants, as Isaac Newton famously wrote. In this process, published
results need to be modified and adapted, not only reproduced. Enabling
reuse is an important goal.&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
Libraries as reusable computational experiments&lt;/div&gt;
&lt;p&gt;To a software architect, a reusable computational experiment may sound
like a library. Software libraries are not only a good analogy, but also
an essential tool. The demanding process of designing a good library
involves isolating elementary steps, ensuring their quality, and
documenting them. It is akin to the editorial work needed to assemble a
textbook from the research literature.&lt;/p&gt;
&lt;p&gt;Science should value libraries made of code, and not only bookshelves.
But they are expensive to develop, and even more so to maintain. Where to
set the cursor? It is clear that in physics not every experimental setup
can be stored for later reuse. Costs are less tangible with computational
science; but they should not be underestimated. In addition, the race to
publish creates legions of studies. As an example, Google scholar lists
28000 publications concerning compressive sensing in 2015. Arguably many
are incremental and research could do with less publications. Yet the
very nature of research is to explore new ideas, not all of which are to
stay.&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
Identifying and consolidating major results for reuse&lt;/div&gt;
&lt;p&gt;Computational research will best create scientific progress by
identifying and consolidating the major results. It is a difficult but
important task. These studies should be made reusable. Limited resources
imply that the remainder will suffer from “code rot”, with results
becoming harder and harder to reproduce as their software environment
becomes obsolete. Libraries, curated and maintained, are the building
blocks that can enable progress.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="grey docutils container"&gt;
If you want to cite this essay in an academic publication, please
cite the version in
&lt;a class="reference external" href="https://openlab-flowers.inria.fr/t/ieee-cis-newsletter-on-cognitive-and-developmental-systems/129/1"&gt;IEEE CIS TC Cognitive and Developmental Systems&lt;/a&gt;
(volume 32, number 2, 2016).&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;&lt;strong&gt;Related posts&lt;/strong&gt;:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="software-for-reproducible-science-lets-not-have-a-misunderstanding.html"&gt;Software for reproducible science: let’s not have a misunderstanding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="mloss-2015-wising-up-to-building-open-source-machine-learning.html"&gt;MLOSS 2015: wising up to building open-source machine learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="../science/publishing-scientific-software-matters.html"&gt;Publishing scientific software matters&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="science"></category><category term="scientific computing"></category><category term="publishing"></category><category term="software"></category><category term="reproducible research"></category></entry><entry><title>Scikit-learn Paris sprint 2017</title><link href="https://gael-varoquaux.info/programming/scikit-learn-paris-sprint-2017.html" rel="alternate"></link><published>2017-06-23T00:00:00+02:00</published><updated>2017-06-23T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2017-06-23:/programming/scikit-learn-paris-sprint-2017.html</id><summary type="html">&lt;object class="align-right" data="attachments/scikit-learn-logo.svg" style="width: 400px;" type="image/svg+xml"&gt;&lt;/object&gt;
&lt;p&gt;Two week ago, we held in Paris a large international sprint on
&lt;a class="reference external" href="http://scikit-learn.org"&gt;scikit-learn&lt;/a&gt;. It was incredibly productive
and fun, as always. We are still busy merging in the work, but I think
that know is a good time to try to summarize the sprint.&lt;/p&gt;
&lt;div class="section" id="a-massive-workforce"&gt;
&lt;h2&gt;A massive workforce&lt;/h2&gt;
&lt;img alt="" class="align-center" src="attachments/sklearn_sprint_2017/P1060011.jpg" style="width: 100%;" /&gt;
&lt;p&gt;We had a …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;object class="align-right" data="attachments/scikit-learn-logo.svg" style="width: 400px;" type="image/svg+xml"&gt;&lt;/object&gt;
&lt;p&gt;Two week ago, we held in Paris a large international sprint on
&lt;a class="reference external" href="http://scikit-learn.org"&gt;scikit-learn&lt;/a&gt;. It was incredibly productive
and fun, as always. We are still busy merging in the work, but I think
that know is a good time to try to summarize the sprint.&lt;/p&gt;
&lt;div class="section" id="a-massive-workforce"&gt;
&lt;h2&gt;A massive workforce&lt;/h2&gt;
&lt;img alt="" class="align-center" src="attachments/sklearn_sprint_2017/P1060011.jpg" style="width: 100%;" /&gt;
&lt;p&gt;We had a mix of core contributors and newcomers, which is a great
combination, as it enables us to be productive, but also to foster the
new generation of core developers. Were present:&lt;/p&gt;
&lt;ul class="columns simple"&gt;
&lt;li&gt;Albert Thomas&lt;/li&gt;
&lt;li&gt;Alexandre Abadie&lt;/li&gt;
&lt;li&gt;Alexandre Gramfort&lt;/li&gt;
&lt;li&gt;Andreas Mueller&lt;/li&gt;
&lt;li&gt;Arthur Imbert&lt;/li&gt;
&lt;li&gt;Aurélien Bellet&lt;/li&gt;
&lt;li&gt;Bertrand Thirion&lt;/li&gt;
&lt;li&gt;Denis Engemann&lt;/li&gt;
&lt;li&gt;Elvis Dohmatob&lt;/li&gt;
&lt;li&gt;Gael Varoquaux&lt;/li&gt;
&lt;li&gt;Jan Margeta&lt;/li&gt;
&lt;li&gt;Joan Massich&lt;/li&gt;
&lt;li&gt;Joris Van den Bossche&lt;/li&gt;
&lt;li&gt;Laurent Direr&lt;/li&gt;
&lt;li&gt;Lemaitre Guillaume&lt;/li&gt;
&lt;li&gt;Loic Esteve&lt;/li&gt;
&lt;li&gt;Mohamed Maskani Filali&lt;/li&gt;
&lt;li&gt;Nathalie Vauquier&lt;/li&gt;
&lt;li&gt;Nicolas Cordier&lt;/li&gt;
&lt;li&gt;Nicolas Goix&lt;/li&gt;
&lt;li&gt;Olivier Grisel&lt;/li&gt;
&lt;li&gt;Patricio Cerda&lt;/li&gt;
&lt;li&gt;Paul Lagrée&lt;/li&gt;
&lt;li&gt;Raghav RV&lt;/li&gt;
&lt;li&gt;Roman Yurchak&lt;/li&gt;
&lt;li&gt;Sebastien Treger&lt;/li&gt;
&lt;li&gt;Sergei Lebedev&lt;/li&gt;
&lt;li&gt;Thierry Guillemot&lt;/li&gt;
&lt;li&gt;Thomas Moreau&lt;/li&gt;
&lt;li&gt;Tom Dupré la Tour&lt;/li&gt;
&lt;li&gt;Vlad Niculae&lt;/li&gt;
&lt;/ul&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Manoj Kumar (could not come to Paris because of visa issues)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And many more people participating remote, and I am pretty certain that I
forgot people.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="support-and-hosting"&gt;
&lt;h2&gt;Support and hosting&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Hosting&lt;/strong&gt;:
As the sprint extended through a French bank holiday and the week end,
we were hosted in a variety of venues:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://lapaillasse.org"&gt;La paillasse&lt;/a&gt;, a Paris bio-hacker space&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.criteo.com"&gt;Criteo&lt;/a&gt;, a French company doing word-wide
add-banner placement. The venue there was absolutely gorgeous, with a
beautiful terrace on the roofs of Paris. And they even had a social
event with free drinks one evening.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Guillaume Lemaître did most of the organization, and at Criteo Ibrahim
Abubakari was our host. We were treated like kings during the whole stay;
each host welcoming us as well they could.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Financial support by France is IA&lt;/strong&gt;: Beyond our hosts, we need to thank
&lt;a class="reference external" href="https://franceisai.com/"&gt;France is IA&lt;/a&gt; who fund the sprint, covering
some of the lunches, accomodations, and travel expenses to bring in our
contributors from abroad (3000 euros travel &amp;amp; accomodation, and 1000
euros for food and a venue during the week end).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="some-achievements-during-the-sprint"&gt;
&lt;h2&gt;Some achievements during the sprint&lt;/h2&gt;
&lt;p&gt;I would be hard to list everything that we did during the sprint (have a
look at the &lt;a class="reference external" href="http://scikit-learn.org/dev/whats_new.html#version-0-14"&gt;development changelog&lt;/a&gt; if you’re curious). Here are some&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p class="first"&gt;Quantile transformer, to transform the data distribution into uniform,
or Gaussian distributions
(&lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/8363"&gt;PR&lt;/a&gt;,
&lt;a class="reference external" href="http://scikit-learn.org/dev/auto_examples/preprocessing/plot_all_scaling.html"&gt;example&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Before&lt;/strong&gt;&lt;/p&gt;
&lt;img alt="" src="attachments/sklearn_sprint_2017/original_distributions.png" style="width: 500px;" /&gt;
&lt;p&gt;&lt;strong&gt;After&lt;/strong&gt;&lt;/p&gt;
&lt;img alt="" src="attachments/sklearn_sprint_2017/quantile_transform.png" style="width: 500px;" /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Memory saving by avoiding to cast to float64 if X is given as float32:
we are slowly making sure that, as much as possible, all models avoid
using internal representations of a dtype float64 when the data is
given as float32. This reduces significantly memory usage and can give
speed ups up to a factor of two.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;API test on instances rather than class. This is to facilitate testing
packages in &lt;a class="reference external" href="https://github.com/scikit-learn-contrib/scikit-learn-contrib"&gt;scikit-learn-contrib&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Many small API fixes to ensure better consistency of models, as well as
cleaning the codebase, making sure that examples display well under
matplotlib 2.x.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Many bug fixes, include fixing corner cases in our average precision,
which was dear to me (&lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/9017"&gt;PR&lt;/a&gt;).&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Work soon to be merged&lt;/strong&gt;&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;ColumnTransformer (&lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/9012"&gt;PR&lt;/a&gt;): from
pandas dataframe to feature matrix, by applying different transformers
to different columns.&lt;/li&gt;
&lt;li&gt;Fixing t-SNE (&lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/9032"&gt;PR&lt;/a&gt;): our
t-SNE implementation was extremely memory-inefficient, and on top of
this had minor bugs. We are fixing it.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There is a lot more of pending work that the sprint help moved forward.
You can also glance at the &lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pulse/monthly"&gt;monthly activity report on github&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Joblib progress&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://pythonhosted.org/joblib/"&gt;Joblib&lt;/a&gt;, the parallel-computing
engine used by scikit-learn, is getting extended to work in distributed
settings, for instance using dask distributed as a &lt;a class="reference external" href="http://distributed.readthedocs.io/en/latest/joblib.html"&gt;backend&lt;/a&gt;.
At the sprint, we made progress running a grid-search on Criteo’s Hadoop
cluster.&lt;/p&gt;
&lt;img alt="" class="align-center" src="attachments/sklearn_sprint_2017/P1060014.jpg" style="width: 100%;" /&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="sprint"></category><category term="scikit-learn"></category><category term="python"></category><category term="machine learning"></category></entry><entry><title>Our research in 2016: personal scientific highlights</title><link href="https://gael-varoquaux.info/science/our-research-in-2016-personal-scientific-highlights.html" rel="alternate"></link><published>2016-12-31T00:00:00+01:00</published><updated>2016-12-31T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2016-12-31:/science/our-research-in-2016-personal-scientific-highlights.html</id><summary type="html">&lt;p&gt;Year 2016 has been productive for science in &lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;my team&lt;/a&gt;. Here are some personal highlights:
bridging artificial intelligence tools to human cognition,
markers of neuropsychiatric conditions from brain activity at rest,
algorithmic speedups for matrix factorization on huge datasets…&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="artificial-intelligence-convolutional-networks-map-well-the-human-visual-system"&gt;
&lt;h2&gt;Artificial-intelligence convolutional networks map well the human visual system&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.sciencedirect.com/science/article/pii/S1053811916305481"&gt;Eickenberg et …&lt;/a&gt;&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;p&gt;Year 2016 has been productive for science in &lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;my team&lt;/a&gt;. Here are some personal highlights:
bridging artificial intelligence tools to human cognition,
markers of neuropsychiatric conditions from brain activity at rest,
algorithmic speedups for matrix factorization on huge datasets…&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="artificial-intelligence-convolutional-networks-map-well-the-human-visual-system"&gt;
&lt;h2&gt;Artificial-intelligence convolutional networks map well the human visual system&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.sciencedirect.com/science/article/pii/S1053811916305481"&gt;Eickenberg et al&lt;/a&gt;
(&lt;a class="reference external" href="https://hal.inria.fr/hal-01389809/document"&gt;preprint&lt;/a&gt;), showed that
convolutional networks –machine-learning tools developed in artificial
intelligence for image analysis– map well the human visual system. This
is interesting because it shows that cognitive vision and artificial
computer vision have evolved to similar architectures. It is not that
surprising, as they are both driven by the statistics of natural images.
From the point of view of inference in neuroscience, what I found really
interesting is that we demonstrated that our computational model of brain
activity generalizes across experimental paradigms. This is something new
to my knowledge.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="using-brain-activity-at-rest-to-predicting-autism-status-across-clinical-sites"&gt;
&lt;h2&gt;Using brain activity at rest to predicting Autism status across clinical sites&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.sciencedirect.com/science/article/pii/S1053811916305924"&gt;Abraham et al&lt;/a&gt;
(&lt;a class="reference external" href="https://arxiv.org/pdf/1611.06066"&gt;preprint&lt;/a&gt;) used resting-state brain
activity to predict whether individuals were typical controls or
diagnosed with Autistic symptoms. The important aspect of this study
is that it was performed on a large data collection across many sites
that had not concerted each other during the acquisition. Given that
prediction was successful across sites, the study shows the viability of
extracting predictive biomarkers across inhomogeneous multi-site data. I
think that it is an important result for the future of psychiatric
neuroimaging research. The paper also highlights the aspects of the
predictive pipeline that were important for this success.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="dictionary-learning-for-massive-matrix-factorization"&gt;
&lt;h2&gt;Dictionary Learning for Massive Matrix Factorization&lt;/h2&gt;
&lt;p&gt;On a pure machine-learning side, &lt;a class="reference external" href="http://jmlr.org/proceedings/papers/v48/mensch16.html"&gt;Mensch et al&lt;/a&gt; introduced a new
algorithm for matrix factorization that gives 10 times speedups compared
to the state of the art on absolutely huge datasets (Terabyte scales).
The key aspect is to combine online learning with random subampling that
exploits redundancies in the data. For neuroimaging, this algorithmic
advances is needed to tackle larger and larger resting-state data. We
will use it to scale predictive models to epidemiologic cohorts. The
original paper was purely heuristic but &lt;a class="reference external" href="https://arxiv.org/pdf/1611.10041"&gt;later work&lt;/a&gt; comes with proofs and we will soon
be submitting a very rich journal paper about this class of algorithms.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="a-guide-to-cross-validation-in-neuroimaging"&gt;
&lt;h2&gt;A guide to cross-validation in neuroimaging&lt;/h2&gt;
&lt;p&gt;We published &lt;a class="reference external" href="http://www.sciencedirect.com/science/article/pii/S105381191630595X"&gt;a review on cross-validation for neuroimaging&lt;/a&gt;
(&lt;a class="reference external" href="https://arxiv.org/pdf/1606.05201"&gt;preprint&lt;/a&gt;). While this may sound
less leading edge than other of our work, cross-validation is central to
everything we do. Doing it right is important. We learned some
interesting tradeoffs while doing the experiments for the review. One of
them is that for predictive models that are quite stable, such as SVMs,
it may be profitable to use default hyper-parameters than to tune them by
cross-validation. This is because with the small sample sizes typical of
neuroimaging cross-validation is fairly noisy.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Though not in my team, &lt;a class="reference external" href="http://www.sciencedirect.com/science/article/pii/S1053811916306103"&gt;Liem et al&lt;/a&gt;
(&lt;a class="reference external" href="http://www.biorxiv.org/content/biorxiv/early/2016/11/07/085506.full.pdf"&gt;preprint&lt;/a&gt;)
collaborated with us for a beautiful study showing multimodal prediction
of brain age from rest brain activity and brain anatomy. Interestingly,
they showed that discrepancy between predicted age and chronological age
captures cognitive impairment.&lt;/p&gt;
&lt;p&gt;We have many interesting things in the pipeline, but it will be for next
year. On an unrelated note, I’ve been doing more &lt;a class="reference external" href="http://www.flickriver.com/photos/gaelvaroquaux/popular-interesting/"&gt;art photography&lt;/a&gt;
on my free time in 2016.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="science"></category><category term="research"></category><category term="neuroimaging"></category><category term="brain science"></category><category term="machine learning"></category><category term="yearly report"></category></entry><entry><title>Data science instrumenting social media for advertising is responsible for todays politics</title><link href="https://gael-varoquaux.info/programming/data-science-instrumenting-social-media-for-advertising-is-responsible-for-todays-politics.html" rel="alternate"></link><published>2016-11-11T00:00:00+01:00</published><updated>2016-11-11T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2016-11-11:/programming/data-science-instrumenting-social-media-for-advertising-is-responsible-for-todays-politics.html</id><summary type="html">&lt;p&gt;&lt;em&gt;To my friends developing data science for the social media, marketing, and
advertising industries,&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;It is time to accept that we have our share of responsibility in the outcome of
the US elections and the vote on Brexit. We are not creating the
society that we would like. Facebook,
Twitter …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;em&gt;To my friends developing data science for the social media, marketing, and
advertising industries,&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;It is time to accept that we have our share of responsibility in the outcome of
the US elections and the vote on Brexit. We are not creating the
society that we would like. Facebook,
Twitter, targeted advertising, customer profiling, are harmful to truth
and have helped Brexiting and electing Trump. Journalism
has been replaced by social media and commercial content tailored to
influence the reader: your own personal distorted reality.&lt;/p&gt;
&lt;p&gt;There are many deep reasons why Trump won the election. Here, as a
data scientist, I want to talk about the factors created by data science.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Rumor replaces truth&lt;/strong&gt;: the way we, data-miners, aggregate and
recommend content is based on its popularity, on readership statistics.
In no way is it based in the truthfulness of the content. As a
result, Facebook, Twitter, Medium, and the like amplify rumors and
sensational news, with no reality check &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This is nothing new: clickbait and tabloids build upon it. However, social networking and
active recommendation makes things significantly worst. Indeed, birds of
a feather flock together, reinforcing their own biases. &lt;strong&gt;We receive
filtered information&lt;/strong&gt;: have you noticed that every single argument you
heard was overwhelmingly against (or in favor of) Brexit? To make matters
even worse, our brain loves it: to resolve cognitive dissonance we avoid
information that contradicts our biases &lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt;.&lt;/p&gt;
&lt;div class="admonition align-right note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;We all believe more information when it confirms our biases&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Gossiping, rumors, and propaganda have always made sane decisions
difficult. The &lt;strong&gt;filter bubble&lt;/strong&gt;, algorithmically-tuned rose-colored
glasses of Facebook, escalate this problem into a major dysfunction of
our society. They amplify messy and false information better than
anything before. Soviet-style propaganda builds on a carefully-crafted
lies; post-truth politics build on a flood of information that does not
even pretend to be credible in the long run.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Active distortion of reality&lt;/strong&gt;: amplifying biases to the point that
they drown truth is bad. Social networks actually do worse: they give
tools for active manipulation of our perception of the world. Indeed, the
revenue of today’s Internet information engines comes from advertising.
For this purpose they are designed to learn as much as possible about the
reader. Then they sell this information bundled with a slot where the
buyer can insert the optimal message to influence the reader.&lt;/p&gt;
&lt;a class="reference external image-reference" href="https://www.flickr.com/photos/benterrett/6929895752/"&gt;&lt;img alt="" class="align-right" src="https://farm8.staticflickr.com/7212/6929895752_2e359557b8_z_d.jpg" style="width: 25%;" /&gt;&lt;/a&gt;
&lt;p&gt;The Trump campaign used targeted Facebook ads presenting to
unenthusiastic democrats information about Clinton tuned to discourage
them from voting. For instance, &lt;a class="reference external" href="http://www.theverge.com/2016/10/27/13434246/donald-trump-targeted-dark-facebook-ads-black-voters"&gt;portraying her as racist to black voters&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Information manipulation works. The Trump campaign has been a smearing
campaign aimed at suppressing votes of his opponent. Release of
negative information on Clinton &lt;a class="reference external" href="https://medium.com/&amp;#64;jonathonmorgan/we-are-more-than-our-partisanship-4ea179592c1f"&gt;did affect her supporter allegiance&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tech created the perfect mind-control tool, with an eyes on
sales revenue. Someone used it for politics.&lt;/strong&gt;&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The tech industry is mostly socially-liberal and highly educated,
wishing the best for society. But it must accept its share of the blame.
My friends improving machine-learning for costumer profiling and ad
placement, &lt;strong&gt;you help shaping a world of lies and deception&lt;/strong&gt;. I will
not blame you for accepting this money: if it were not for you, others
would do it. But we should all be thinking about how do we improve this
system. How do we use data science to build a world based on objectivity,
transparency, and truth, rather than Internet-based marketing?&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;&lt;strong&gt;References analysing the erosion of truth&lt;/strong&gt;&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.economist.com/news/briefing/21706498-dishonesty-politics-nothing-new-manner-which-some-politicians-now-lie-and"&gt;Must-read article in the economist on lies in politics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Post-truth_politics"&gt;Wikipedia page on Post-truth politics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://nymag.com/selectall/2016/11/donald-trump-won-because-of-facebook.html"&gt;Donald Trump won because of Facebook&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://inverseprobability.com/2016/06/23/the-real-story-behind-todays-referendum"&gt;The real story behind todays referendum&lt;/a&gt; : Neil Lawrence’s analysis of the filter-bublle effect in Brexit&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://users.polisci.wisc.edu/behavior/Papers/Toff&amp;amp;Kim2013.pdf"&gt;A 2013 academic study showing that twitter increases partisan
polarization&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;&lt;strong&gt;Disgression: other social issues of data science&lt;/strong&gt;&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;The tech industry is &lt;strong&gt;increasing inequalities&lt;/strong&gt;, making the rich richer and
leaving the poor behind. Data-science, with its ability to automate
actions and wield large sources of information, is a major contributor
to these sources of inequalities.&lt;/li&gt;
&lt;li&gt;Internet-based marketing is building &lt;strong&gt;a huge spying machine&lt;/strong&gt; that
infers as much as possible about the user. The Trump campaign was able
to target a specific population, black voters leaning towards
democrats. What if this data was used for direct executive action? This
could come quicker than we think, given how intelligence agencies tap
into social media.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I preferred to focus this post on how data-science can help distort truth.
Indeed, it is a problem too often ignored by data scientists who like to
think that they are empowering users.&lt;/p&gt;
&lt;/div&gt;
&lt;!-- The wikileaks dumps of Clinton's mail resemble the
`Kompromat &lt;https://en.wikipedia.org/wiki/Kompromat&gt;`_ techniques used
by post-soviet regimes, using private information on opponents to
control them. --&gt;
&lt;p class="align-right"&gt;In memory of &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Aaron_Swartz"&gt;Aaron Schwartz&lt;/a&gt;
who fought centralized power on Internet.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Facebook was until recently using human curators, &lt;a class="reference external" href="http://arstechnica.com/business/2016/08/facebook-fires-human-editors-algorithm-immediately-posts-fake-news/"&gt;but fired them,
leading to a loss of control on veracity&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;It is a well-known and well-studied cognitive bias that
&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Cognitive_dissonance"&gt;individuals strive to reduce cognitive dissonace and actively avoid
situations and information likely to increase it&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;a class="reference external image-reference" href="https://www.flickr.com/photos/cdevers/4602805654"&gt;&lt;img alt="" class="align-center" src="https://farm2.staticflickr.com/1376/4602805654_db8b6569fb_z_d.jpg" style="width: 80%;" /&gt;&lt;/a&gt;
</content><category term="programming"></category><category term="politics"></category><category term="data science"></category><category term="software"></category><category term="machine learning"></category><category term="society"></category></entry><entry><title>Unison 2.48 binaries for ARM</title><link href="https://gael-varoquaux.info/misc/unison-248-binaries-for-arm.html" rel="alternate"></link><published>2016-07-23T00:00:00+02:00</published><updated>2016-07-23T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2016-07-23:/misc/unison-248-binaries-for-arm.html</id><summary type="html">&lt;p class="first last"&gt;I have built static binaries of Unision 2.48 for ARM&lt;/p&gt;
</summary><content type="html">&lt;p&gt;I have built static binaries of Unison 2.48 for ARM
Run on my NAS, the arm architecture is necessary to synchronize with the
recent Ubuntu.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="../programming/attachments/unison-2.48.4-armel.zip"&gt;unison-2.48.4-armel.zip&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="admonition warning"&gt;
&lt;p class="first admonition-title"&gt;Warning&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;I will not support these binaries&lt;/strong&gt;&lt;/p&gt;
&lt;p class="last"&gt;I will not answer any questions or request on these binaries. I have
built them for my personal use and put them online in case it might be
useful for others.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="remark-on-backward-compatibility"&gt;
&lt;h2&gt;Remark on backward compatibility&lt;/h2&gt;
&lt;p&gt;Why don’t the Unison devs ensure compatibility between minor version of
Unison?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Breaking compatibility is bad practice, in particular between minor
versions&lt;/strong&gt;. It breaks the trust that users have in updating the software.
Programmers complain that users always run old versions of
OSs/libraries/programs, but this is explained by the fear of stuff
breaking during upgrades.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="notes-to-build-these-binaries"&gt;
&lt;h2&gt;Notes to build these binaries&lt;/h2&gt;
&lt;p&gt;I built this following instructions historically on
&lt;a class="reference external" href="http://www.crutzi.info/unison/binary/armel"&gt;http://www.crutzi.info/unison/binary/armel&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I retrieved these instructions from the wayback machine, and adapted them
to work on a more modern Debian system&lt;/p&gt;
&lt;p&gt;To compile it, I used qemu and a Debian ARM image&lt;/p&gt;
&lt;div class="section" id="build-a-debian-system-under-qemu"&gt;
&lt;h3&gt;1. Build a debian system under Qemu&lt;/h3&gt;
&lt;p&gt;Install the system (lasts a couple of hours, with some user input):&lt;/p&gt;
&lt;pre class="literal-block"&gt;
sudo apt install qemu-system-arm qemu-efi libguestfs-tools

wget -O installer-vmlinuz http://http.us.debian.org/debian/dists/jessie/main/installer-armhf/current/images/netboot/vmlinuz
wget -O installer-initrd.gz http://http.us.debian.org/debian/dists/jessie/main/installer-armhf/current/images/netboot/initrd.gz

# Create a drive
qemu-img create -f qcow2 hda.qcow2 5G

qemu-system-arm -M virt -m 1024 \
-kernel installer-vmlinuz \
-initrd installer-initrd.gz \
-drive if=none,file=hda.qcow2,format=qcow2,id=hd \
-device virtio-blk-device,drive=hd \
-netdev user,id=mynet \
-device virtio-net-device,netdev=mynet \
-nographic -no-reboot
&lt;/pre&gt;
&lt;p&gt;Under Ubuntu:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
sudo chmod 644 /boot/vmlinuz*
&lt;/pre&gt;
&lt;p&gt;List the content on the /boot dir of the VM’s disk:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
virt-ls -a hda.qcow2 /boot/
&lt;/pre&gt;
&lt;p&gt;Copy the initrd and vmlinux:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
virt-copy-out -a hda.qcow2 /boot/vmlinuz-3.16.0-6-armmp-lpae /boot/initrd.img-3.16.0-6-armmp-lpae .
&lt;/pre&gt;
&lt;p&gt;Do symlinks:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
ln -s initrd.img-3.16.0-6-armmp-lpae initrd.img
ln -s vmlinuz-3.16.0-6-armmp-lpae vmlinuz
&lt;/pre&gt;
&lt;p&gt;The installed system is then booted with:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
qemu-system-arm -M virt -m 1024 \
-kernel vmlinuz \
-initrd initrd.img \
-drive if=none,file=hda.qcow2,format=qcow2,id=hd \
-device virtio-blk-device,drive=hd \
-netdev user,id=mynet \
-device virtio-net-device,netdev=mynet \
-nographic -no-reboot -append &amp;quot;root=/dev/vda2&amp;quot;
&lt;/pre&gt;
&lt;/div&gt;
&lt;div class="section" id="build-unison-under-the-debian-system"&gt;
&lt;h3&gt;2. Build unison under the debian system&lt;/h3&gt;
&lt;p&gt;Download the unison source package from&lt;/p&gt;
&lt;p&gt;Then compile the files within the qemu ARM environment:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
apt-get update
apt-get upgrade
apt-get build-dep unison
wget https://github.com/bcpierce00/unison/archive/v2.48.15v4.tar.gz
tar -xvzf v2.48.15v4.tar.gz
cd unison-2.48.15v4
make UISTYLE=text NATIVE=true STATIC=true
&lt;/pre&gt;
&lt;p&gt;You might need to remove the ‘-unsafe-string’ option as detailed in &lt;a class="reference external" href="https://github.com/bcpierce00/unison/issues/211"&gt;https://github.com/bcpierce00/unison/issues/211&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The binary will be in &lt;cite&gt;src/unison&lt;/cite&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="misc"></category><category term="software"></category></entry><entry><title>Better Python compressed persistence in joblib</title><link href="https://gael-varoquaux.info/programming/new_low-overhead_persistence_in_joblib_for_big_data.html" rel="alternate"></link><published>2016-05-20T00:00:00+02:00</published><updated>2016-05-20T00:00:00+02:00</updated><author><name>Alexandre Abadie &amp; Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2016-05-20:/programming/new_low-overhead_persistence_in_joblib_for_big_data.html</id><summary type="html">&lt;p class="first last"&gt;New persistence in joblib enables low-overhead storage of big data contained in arbitrary objects&lt;/p&gt;
</summary><content type="html">&lt;div class="section" id="problem-setting-persistence-for-big-data"&gt;
&lt;h2&gt;Problem setting: persistence for big data&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="https://pythonhosted.org/joblib/"&gt;Joblib&lt;/a&gt; is a powerful Python package
for management of computation: parallel computing, caching, and
primitives for out-of-core computing. It is handy when working on so
called &lt;strong&gt;big data&lt;/strong&gt;, that can consume more than the available RAM (several GB
nowadays). In such situations, objects in the working space must be
persisted to disk, for out-of-core computing, distribution of jobs, or
caching.&lt;/p&gt;
&lt;p&gt;An efficient strategy to write code dealing with big data is to rely on
&lt;strong&gt;numpy arrays to hold large chunks of structured data&lt;/strong&gt;.
The code then handles objects or arbitrary containers (list, dict) with
numpy arrays. For data management, joblib provides transparent disk
persistence that is very efficient with such objects. The internal
mechanism relies on specializing &lt;a class="reference external" href="https://docs.python.org/3/library/pickle.html"&gt;pickle&lt;/a&gt; to handle better numpy
arrays.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://github.com/joblib/joblib/pull/260"&gt;Recent improvements&lt;/a&gt;
reduce vastly the memory overhead of data persistence.&lt;/p&gt;
&lt;div class="section" id="limitations-of-the-old-implementation"&gt;
&lt;h3&gt;Limitations of the old implementation&lt;/h3&gt;
&lt;p&gt;❶ Dumping/loading persisted data &lt;strong&gt;with compression&lt;/strong&gt; was a memory hog,
because of internal copies of data, limiting the maximum size
of usable data with compressed persistence:&lt;/p&gt;
&lt;img alt="" class="large" src="https://gael-varoquaux.info/programming/attachments/old_pickle_mem_profile.png" /&gt;
&lt;p&gt;We see the increased memory usage during the calls to &lt;tt class="docutils literal"&gt;dump&lt;/tt&gt; and
&lt;tt class="docutils literal"&gt;load&lt;/tt&gt; functions, profiled using the &lt;a class="reference external" href="https://pypi.python.org/pypi/memory_profiler"&gt;memory_profiler package&lt;/a&gt; with this &lt;a class="reference external" href="https://gist.github.com/aabadie/7cba3385406d1cec7d3dd4407ba3f164"&gt;gist&lt;/a&gt;&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;❷ Another drawback was that large numpy arrays (&amp;gt;10MB) contained in an
arbitrary Python object were dumped in separate &lt;tt class="docutils literal"&gt;.npy&lt;/tt&gt; file, increasing
the load on the file system &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;joblib&lt;/span&gt; &lt;span class="c1"&gt;# joblib version: 0.9.4&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ones&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;))]&lt;/span&gt;

&lt;span class="c1"&gt;# 3 files are generated:&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compress&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl_01.npy.z&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl_02.npy.z&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt; &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
 &lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt; &lt;span class="mf"&gt;0.47006195&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;0.5436392&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;0.1218267&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;0.48592789&lt;/span&gt;&lt;span class="p"&gt;]])]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;!-- XXX: announce content of post earlier

Let's now discover the new features and improvements that comes with
version 0.10.0. After that, we'll compare speed and memory consumption with
other libraries and discuss the results. Then we'll give some details about the
new internal implementation. --&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="what-s-new-compression-low-memory"&gt;
&lt;h2&gt;What’s new: compression, low memory…&lt;/h2&gt;
&lt;p&gt;❶ &lt;strong&gt;Memory usage is now stable&lt;/strong&gt;:&lt;/p&gt;
&lt;img alt="" src="https://gael-varoquaux.info/programming/attachments/new_pickle_mem_profile.png" /&gt;
&lt;p&gt;❷ &lt;strong&gt;All numpy arrays are persisted in a single file&lt;/strong&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;joblib&lt;/span&gt; &lt;span class="c1"&gt;# joblib version: 0.10.0 (dev)&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ones&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;))]&lt;/span&gt;

&lt;span class="c1"&gt;# only 1 file is generated:&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compress&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt; &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
 &lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt; &lt;span class="mf"&gt;0.47006195&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;0.5436392&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;0.1218267&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;0.48592789&lt;/span&gt;&lt;span class="p"&gt;]])]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;❸ &lt;strong&gt;Persistence in a file handle&lt;/strong&gt; (ongoing work in a &lt;a class="reference external" href="https://github.com/joblib/joblib/pull/351"&gt;pull request&lt;/a&gt;)&lt;/p&gt;
&lt;p&gt;❹ &lt;strong&gt;More compression formats are available&lt;/strong&gt;&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;Backward compatibility&lt;/p&gt;
&lt;p&gt;Existing joblib users can be reassured: the new version is &lt;strong&gt;still
compatible with pickles generated by older versions&lt;/strong&gt; (&amp;gt;= 0.8.4). You
are encouraged to update (rebuild?) your cache if you want to take
advantage of this new version.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="benchmarks-speed-and-memory-consumption"&gt;
&lt;h2&gt;Benchmarks: speed and memory consumption&lt;/h2&gt;
&lt;p&gt;Joblib strives to have &lt;strong&gt;minimum dependencies&lt;/strong&gt; (only numpy) and to
&lt;strong&gt;be agnostic to the input data&lt;/strong&gt;. Hence the goals are to deal with any
kind of data while trying to &lt;strong&gt;be as efficient as possible with numpy arrays&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;To illustrate the benefits and cost of the new persistence implementation, let’s
now compare a real life use case
(&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_lfw_people.html"&gt;LFW dataset from scikit-learn&lt;/a&gt;)
with different libraries:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Joblib, with 2 different versions,
0.9.4 and master (dev),&lt;/li&gt;
&lt;li&gt;Pickle&lt;/li&gt;
&lt;li&gt;Numpy&lt;/li&gt;
&lt;/ul&gt;
&lt;img alt="" class="large" src="https://gael-varoquaux.info/programming/attachments/persistence_lfw_bench.png" /&gt;
&lt;p&gt;The four first lines use non compressed persistence strategies, the last
four use persistence with zlib/gzip &lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt; strategies. Code to reproduce the
benchmarks is available on this &lt;a class="reference external" href="https://gist.github.com/aabadie/2ba94d28d68f19f87eb8916a2238a97c"&gt;gist&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;⚫ &lt;strong&gt;Speed&lt;/strong&gt;: the results between joblib 0.9.4 and 0.10.0 (dev) are
similar whereas &lt;strong&gt;numpy and pickle are clearly slower than joblib&lt;/strong&gt; in both
compressed and non compressed cases.&lt;/p&gt;
&lt;p&gt;⚫ &lt;strong&gt;Memory consumption&lt;/strong&gt;: Without compression, old and
new joblib versions are the same; with compression, the new joblib version is
much better than the old one.
&lt;strong&gt;Joblib clearly outperforms pickle and numpy in terms of
memory consumption&lt;/strong&gt;. This can be explained by the fact that numpy relies on
pickle if the object is not a pure numpy array (a list or a dict with arrays for
example), so in this case it inherits the memory drawbacks from pickle. When
persisting pure numpy arrays (not tested here), numpy uses its internal save/load
functions which are efficient in terms of speed and memory consumption.&lt;/p&gt;
&lt;p&gt;⚫ &lt;strong&gt;Disk used&lt;/strong&gt;: results are as expected: non compressed files have
the same size as the in-memory data; compressed files are smaller.&lt;/p&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;Caveat Emptor: performance is data-dependent&lt;/p&gt;
&lt;p&gt;Different data compress more or less easily. Speed and disk used will
vary depending on the data. Key considerations are:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;Fraction of data in arrays&lt;/strong&gt;: joblib is efficient if much of the
data is contained in numpy arrays. The worst case scenario is
something like a large dictionary of random numbers as keys and
values.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Entropy of the data&lt;/strong&gt;: an array fully of zeros will compress well
and fast. A fully random array will compress slowly, and use a lot
of disk. Real data is often somewhere in the middle.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="extra-improvements-in-compressed-persistence"&gt;
&lt;h2&gt;Extra improvements in compressed persistence&lt;/h2&gt;
&lt;div class="section" id="new-compression-formats"&gt;
&lt;h3&gt;New compression formats&lt;/h3&gt;
&lt;p&gt;Joblib can use new compression formats based on Python standard library modules:
&lt;strong&gt;zlib, gzip, bz2, lzma and xz&lt;/strong&gt; (the last 2 are available for Python
greater than 3.3). &lt;strong&gt;The compressor is
selected automatically when the file name has an explicit extension&lt;/strong&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl.z&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# zlib&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl.z&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl.gz&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# gzip&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl.gz&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl.bz2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# bz2&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl.bz2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl.lzma&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# lzma&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl.lzma&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl.xz&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# xz&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl.xz&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;One can tune the compression level, setting the compressor explicitly:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl.compressed&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compress&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;zlib&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl.compressed&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;/tmp/test.compressed&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compress&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lzma&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl.compressed&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;On loading, joblib uses the magic number of the file to determine the
right decompression method. This makes loading compressed pickle transparent:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/tmp/test.compressed&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt; &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
 &lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt; &lt;span class="mf"&gt;0.47006195&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;0.5436392&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;0.1218267&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;0.48592789&lt;/span&gt;&lt;span class="p"&gt;]])]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Importantly, the generated compressed files use a &lt;strong&gt;standard
compression file format&lt;/strong&gt;: for instance, regular command line tools (zip/unzip,
gzip/gunzip, bzip2, lzma, xz) can be used to compress/uncompress a pickled file
generated with joblib. Joblib will be able to load cache compressed with those
tools.&lt;/p&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;Toward more and faster compression&lt;/p&gt;
&lt;p&gt;Specific compression strategies have been developped for fast
compression, sometimes even faster than disk reads such as &lt;a class="reference external" href="http://google.github.io/snappy/"&gt;snappy&lt;/a&gt; , &lt;a class="reference external" href="http://www.blosc.org/"&gt;blosc&lt;/a&gt;, LZO or LZ4. With a file-like interface, they should be
readily usable with joblib.&lt;/p&gt;
&lt;p&gt;In the benchmarks above, loading and dumping with compression is
slower than without (though only by a factor of 3 for loading). These
were done on a computer with an SSD, hence with very fast I/O. In a
situation with slower I/O, as &lt;strong&gt;on a network drive, compression could
save time&lt;/strong&gt;. With faster compressors, compression will save time on most
hardware.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="compressed-persistence-into-a-file-handle"&gt;
&lt;h3&gt;Compressed persistence into a file handle&lt;/h3&gt;
&lt;p&gt;Now that everything is stored in a
single file using standard compression formats, joblib can
persist in an &lt;a class="reference external" href="https://github.com/joblib/joblib/pull/351"&gt;open file handle&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;wb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;    &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;rb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt; &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
 &lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt; &lt;span class="mf"&gt;0.47006195&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;0.5436392&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;0.1218267&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;0.48592789&lt;/span&gt;&lt;span class="p"&gt;]])]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This also works with compression file object available in the standard library,
like &lt;tt class="docutils literal"&gt;gzip.GzipFile&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;bz2.Bz2File&lt;/tt&gt; or &lt;tt class="docutils literal"&gt;lzma.LzmaFile&lt;/tt&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;gzip&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;gzip&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GzipFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl.gz&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;wb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;    &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl.gz&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;gzip&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GzipFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl.gz&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;rb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt; &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
 &lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt; &lt;span class="mf"&gt;0.47006195&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;0.5436392&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;0.1218267&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;0.48592789&lt;/span&gt;&lt;span class="p"&gt;]])]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Be sure that you use a decompressor matching the internal compression when
loading with the above method. If
unsure, simply use &lt;tt class="docutils literal"&gt;open&lt;/tt&gt;, joblib will &lt;strong&gt;select the right decompressor&lt;/strong&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/tmp/test.pkl.gz&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;rb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;     &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt; &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
 &lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt; &lt;span class="mf"&gt;0.47006195&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;0.5436392&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;0.1218267&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;0.48592789&lt;/span&gt;&lt;span class="p"&gt;]])]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;Towards dumping to elaborate stores&lt;/p&gt;
&lt;p&gt;Working with file handles opens the door to &lt;strong&gt;storing cache data in database blob or cloud
storage such as Amazon S3, Amazon Glacier and Google Cloud Storage&lt;/strong&gt;
(for instance via the Python package &lt;a class="reference external" href="https://github.com/boto/boto"&gt;boto&lt;/a&gt;).&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="implementation"&gt;
&lt;h2&gt;Implementation&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;A Pickle Subclass&lt;/strong&gt;: joblib relies on subclassing the Python Pickler/Unpickler
&lt;a class="footnote-reference" href="#footnote-3" id="footnote-reference-3"&gt;[3]&lt;/a&gt;. These are state machines that walk the graph of nested objects (a
dict may contain a list, that may contain…), creating a string
representation of each object encountered. The new implementation
proceeds as follows:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;Pickling an arbitrary object&lt;/strong&gt;: when an &lt;tt class="docutils literal"&gt;np.ndarray&lt;/tt&gt; object is reached,
instead of using the default pickling functions (__reduce__()), the joblib
Pickler replaces in pickle stream the ndarray with a wrapper object containing
all important array metadata (shape, dtype, flags). Then it writes the array
content in the pickle file. Note that this step breaks the pickle
compatibility. One benefit is that it enables using fast code for
copyless handling of the numpy array. For compression, we pass chunks
of the data to a compressor object (using the buffer protocol to avoid
copies).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unpickling from a file&lt;/strong&gt;: when pickle reaches the array wrapper, as the
object is in the pickle stream, the file handle is at the
beginning of the array content. So at this point the Unpickler simply
constructs an array based on the metadata contained in the wrapper and then
fills the array buffer directly from the file. The object returned is the
reconstructed array, the array wrapper being dropped. A benefit is that
if the data is stored not compressed, &lt;strong&gt;the array can be directly memory
mapped from the storage&lt;/strong&gt; (the mmap_mode option of &lt;a class="reference external" href="https://pythonhosted.org/joblib/generated/joblib.load.html"&gt;joblib.load&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This technique allows joblib to pickle all objects in a single file but also to
have memory-efficient dump and load.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;A fast compression stream&lt;/strong&gt;: as the pickling refactoring opens the door
to file objects usage, joblib is now able to persist data in any kind of file
object: &lt;tt class="docutils literal"&gt;open&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;gzip.GzipFile&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;bz2.Bz2file&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;lzma.LzmaFile&lt;/tt&gt;. For
performance reason and usability, the new joblib version uses its own file
object &lt;tt class="docutils literal"&gt;BinaryZlibFile&lt;/tt&gt; for zlib compression. Compared to
&lt;tt class="docutils literal"&gt;GzipFile&lt;/tt&gt;, it disables crc computation, which bring a performance gain of 15%.&lt;/p&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;Speed penalties of on-the-fly writes&lt;/p&gt;
&lt;p&gt;There’s also a small speed difference with dict/list objects between new/old
joblib when using compression.
The old version pickles the data inside a &lt;tt class="docutils literal"&gt;io.BytesIO&lt;/tt&gt; buffer and then
compress it in a row whereas the new version write “on the fly” compressed
chunk of pickled data to the file.
Because of this internal buffer the old implementation is not memory safe as it
indeed copy the data in memory before compressing. The small speed difference
was judged acceptable compared to this memory duplication.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="conclusion-and-future-work"&gt;
&lt;h2&gt;Conclusion and future work&lt;/h2&gt;
&lt;p&gt;Memory copies were a limitation when caching on disk very large
numpy arrays, e.g arrays with a size close to the available RAM on the computer.
The problem was solved via intensive buffering and a lot of hacking on top of
pickle and numpy. Unfortunately, our strategy has poor performance with
big dictionaries or list compared to a &lt;tt class="docutils literal"&gt;cPickle&lt;/tt&gt;, hence try to use
numpy arrays in your internal data structures (note that something like
scipy sparse matrices works well, as it builds on arrays).&lt;/p&gt;
&lt;p&gt;For the future, maybe numpy’s pickle methods could be improved and make a
better use of &lt;a class="reference external" href="https://www.python.org/dev/peps/pep-3154/#bit-opcodes-for-large-objects"&gt;64-bit opcodes for large objects&lt;/a&gt;
that were introduced in Python recently.&lt;/p&gt;
&lt;p&gt;Pickling using file handles is a first step toward pickling in
sockets, enabling broadcasting of data between computing units
on a network. This will be priceless with &lt;a class="reference external" href="https://github.com/joblib/joblib/pull/325"&gt;joblib’s new distributed backends&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Other improvements will come from better compressor, making everything
faster.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;The pull request was implemented by &lt;a class="reference external" href="https://github.com/aabadie"&gt;&amp;#64;aabadie&lt;/a&gt;. He thanks &lt;a class="reference external" href="https://github.com/lesteve"&gt;&amp;#64;lesteve&lt;/a&gt;, &lt;a class="reference external" href="https://github.com/ogrisel"&gt;&amp;#64;ogrisel&lt;/a&gt;
and &lt;a class="reference external" href="https://github.com/GaelVaroquaux"&gt;&amp;#64;GaelVaroquaux&lt;/a&gt; for the valuable
help, reviews and support.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;The load created by multiple files on the filesystem is
particularly detrimental for network filesystems, as it triggers
multiple requests and isn’t cache friendly.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;gzip is based on zlib with additional crc checks and a default
compression level of 3.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-3" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-3"&gt;[3]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;A drawback of subclassing the Python Pickler/Unpickler is that it
is done for the pure-Python version, and not the “cPickle” version.
The latter is much faster when dealing with a large number of Python
objects. Once again, joblib is efficient when most of the data is
represented as numpy arrays or subclasses.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="joblib"></category><category term="persistence"></category><category term="big data"></category></entry><entry><title>Of software and Science. Reproducible science: what, why, and how</title><link href="https://gael-varoquaux.info/programming/of-software-and-science-reproducible-science-what-why-and-how.html" rel="alternate"></link><published>2015-12-16T00:00:00+01:00</published><updated>2015-12-16T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2015-12-16:/programming/of-software-and-science-reproducible-science-what-why-and-how.html</id><summary type="html">&lt;p&gt;At &lt;a class="reference external" href="mloss-2015-wising-up-to-building-open-source-machine-learning.html"&gt;MLOSS 15&lt;/a&gt; we
brainstormed on reproducible science, discussing &lt;strong&gt;why we care about
software in computer science&lt;/strong&gt;. Here is a summary blending &lt;a class="reference external" href="https://gist.github.com/GaelVaroquaux/33e7a7b297425890fefa"&gt;notes from
the discussions&lt;/a&gt; with my
opinion.&lt;/p&gt;
&lt;blockquote class="epigraph"&gt;
“Without engineering, science is not more than philosophy”
&amp;nbsp; &amp;nbsp; —  &amp;nbsp; &amp;nbsp;
&lt;a class="reference external" href="https://twitter.com/GaelVaroquaux/status/619767624654786560"&gt;the community&lt;/a&gt;&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;How do we enable better Science? Why do we do software …&lt;/strong&gt;&lt;/p&gt;</summary><content type="html">&lt;p&gt;At &lt;a class="reference external" href="mloss-2015-wising-up-to-building-open-source-machine-learning.html"&gt;MLOSS 15&lt;/a&gt; we
brainstormed on reproducible science, discussing &lt;strong&gt;why we care about
software in computer science&lt;/strong&gt;. Here is a summary blending &lt;a class="reference external" href="https://gist.github.com/GaelVaroquaux/33e7a7b297425890fefa"&gt;notes from
the discussions&lt;/a&gt; with my
opinion.&lt;/p&gt;
&lt;blockquote class="epigraph"&gt;
“Without engineering, science is not more than philosophy”
&amp;nbsp; &amp;nbsp; —  &amp;nbsp; &amp;nbsp;
&lt;a class="reference external" href="https://twitter.com/GaelVaroquaux/status/619767624654786560"&gt;the community&lt;/a&gt;&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;How do we enable better Science? Why do we do software in science?&lt;/strong&gt;
These are the questions that we were interested in.&lt;/p&gt;
&lt;div class="grey docutils container"&gt;
&lt;strong&gt;Improving reproducility of our scientific studies makes us more
efficient in the long run&lt;/strong&gt; to do good science: even inside a lab, new
research efforts build upon the previous work.&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="forms-of-reproducible-science-reproduction-replication-reuse"&gt;
&lt;h2&gt;Forms of reproducible science: reproduction, replication, &amp;amp; reuse&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="https://politicalsciencereplication.wordpress.com/2013/02/24/is-there-a-difference-between-replication-reproduction-and-re-analysis/"&gt;The classic concepts of reproducible science&lt;/a&gt;
are:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;Reproducibility&lt;/strong&gt;: being able to rerun an experiment as it was run,
for instance by reanalysing data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Replicability&lt;/strong&gt;: being able to redo an experiment from scratch.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;em&gt;reproducible science&lt;/em&gt; movement argues sharing source code of
experiments is a need for &lt;em&gt;reproduction&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;For reproduction, fields like computer science (development of methods)
and biology (challenging data acquisition) have very different
constraints, with the complexity allocated differently between data and
code.&lt;/p&gt;
&lt;blockquote class="epigraph"&gt;
“Machine learning people use hugely complex algorithms on trivially
simple datasets. Biology does trivially simple algorithms on hugely
complex datasets.”
&amp;nbsp; &amp;nbsp; —  &amp;nbsp; &amp;nbsp;
&lt;em&gt;an MLOSS15 attendee&lt;/em&gt;&lt;/blockquote&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We felt that computer science needed an additional notion, complementing
replication and reproduction:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;Reusability&lt;/strong&gt;: applying the process to a new yet similar question.
For instance for a paper contributing data analysis method, applying it
to new data.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="align-right docutils container"&gt;
Reusability is more valuable than reproducibility.&lt;/div&gt;
&lt;p&gt;Reproducibility without reusability in method development may hinder the
advancement of science as it pushes people to do all the same
things, &lt;em&gt;eg&lt;/em&gt; always running experiments on the same data.&lt;/p&gt;
&lt;p&gt;Reusability enables results that the original investigator did not have in
mind. It implies that the experimental protocol extends further than the
exact scope of the question initially asked. For software development, it
is also harder, as it implies more robustness and flexibility.&lt;/p&gt;
&lt;p&gt;Finally sharing source code is not enough: &lt;strong&gt;readability&lt;/strong&gt; of the code is
necessary.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="roadblocks-to-reproducible-science"&gt;
&lt;h2&gt;Roadblocks to reproducible science&lt;/h2&gt;
&lt;div class="section" id="man-power"&gt;
&lt;h3&gt;Man power&lt;/h3&gt;
&lt;p&gt;Reusability, readability, support of released code, all actually take a
lot of time, even though it is seldom acknowledged in talks about
reproducible science. Given a fixed man power, it is impossible to
achieve reusability and high quality for everything.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="computing-power"&gt;
&lt;h3&gt;Computing power&lt;/h3&gt;
&lt;p&gt;Some numerical experiments or complex data analysis require weeks of
cluster to run. These will be much harder to reproduce. Also, rerunning
an analysis from scratch on a regular basis is a good recipe to achieve a
robust path from data to results. The more computing power is a limiting
resource, the more likely it is that a glitch is not detected.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="data-availability"&gt;
&lt;h3&gt;Data availability&lt;/h3&gt;
&lt;p&gt;No access, or restricted access, to data is a show stopper for
reproducibility. Data sharing requirements are becoming common –from
funding agencies, or journals. However, privacy concerns, or confidential
information get in the way of making data public, for instance in medical
research or micro-economy. Often, these concerns serve as a pretext
to people who actually do not want to relinquish &lt;em&gt;control&lt;/em&gt; &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;.&lt;/p&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;A related post by Deevy Bishop: &lt;a class="reference external" href="http://deevybee.blogspot.co.uk/2015/11/whos-afraid-of-open-data.html?m=1"&gt;Who’s afraid of open data&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div class="section" id="incentives-problem"&gt;
&lt;h3&gt;Incentives problem&lt;/h3&gt;
&lt;p&gt;Fancy new results are what matters for success in academia. “High impact”
journals such as Nature or Science accept papers that amaze and impress,
often with subpar inspection of the materials and methods &lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt;. The rate of
publication in many leading groups is incompatible with consolidation
efforts required for strong reproducibility.&lt;/p&gt;
&lt;p&gt;On the other hand, it is hard to tell beforehand if a new idea is a good
one. Hence letting imagination forward to foster impossible and
improbable ideas is a good path to innovation. The underlying questions
are: What are the best community rules for the advancement of knowledge?
What do we want from the way science moves forward? Rapid publication of
many incremental ideas, &lt;em&gt;eg&lt;/em&gt; at a conference, gives food for thoughts,
possibly at the sake of reproducibility.&lt;/p&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;“Science, Nature and Cell, had a higher rate of retractions” –
&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Invalid_science"&gt;Wikipedia: Invalid science&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="how-to-improve-the-situation"&gt;
&lt;h2&gt;How to improve the situation&lt;/h2&gt;
&lt;div class="section" id="docker-containers-and-virtual-machines"&gt;
&lt;h3&gt;Docker, containers, and virtual machines&lt;/h3&gt;
&lt;p&gt;Docker, or other virtual machine technologies, enable shipping a software
environment. It diminishes the challenges of building software and
setting up an analysis. Virtual machines are used as a way to avoid
software packaging issues. This seems to me as a plaster on a wooden leg.&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
Containers give easy reproduction, to the cost of hard
replication and reuse.&lt;/div&gt;
&lt;p&gt;Indeed, an analysis that lives in a box can be reproduced, but can it be
understood, modified, or applied to new data? New science is likely going
to come from modifying this analysis, or combining it with other tools,
or new data. If these other tools live in a different virtual machine,
the combination will be challenging.&lt;/p&gt;
&lt;p&gt;In addition, people are using containers as an excuse to avoid tackling
the need for proper documentation of requirements, and the process to set
them up. They sometimes even try justify binary blobs &lt;a class="footnote-reference" href="#footnote-3" id="footnote-reference-3"&gt;[3]&lt;/a&gt;. This is
wrong. An analysis should be runnable without requiring the stars to
align, and it should be understandable.&lt;/p&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-3" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-3"&gt;[3]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;See also Titus Brown’s post: &lt;a class="reference external" href="http://ivory.idyll.org/blog/2014-containers.html"&gt;The post-apocalyptic world of binary
containers&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div class="section" id="version-control-wear-your-seatbelt"&gt;
&lt;h3&gt;Version control: wear your seatbelt&lt;/h3&gt;
&lt;p&gt;&lt;a class="reference external" href="https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control"&gt;Version control&lt;/a&gt;
is like a time machine: if used with regular commits, it enables rolling
back to any point in time. For my work, it’s always been a crucial aspect
to reproducing what me or my students did a while ago. I often meet
researchers that feel they lack time to learn it. I really cannot support
this position. &lt;a class="reference external" href="http://try.github.io"&gt;http://try.github.io&lt;/a&gt; is an easy way to learn version
control.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Hint&lt;/em&gt;: use a “tag” to pin-point a position in the history that you might
want to repeat, such as making a figure or the publication of an article.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="sotware-libraries-curated-and-maintained"&gt;
&lt;h3&gt;Sotware libraries, curated and maintained&lt;/h3&gt;
&lt;p&gt;Consolidating an analysis pipeline, a standard visualization, or any
computational aspect of a paper into a software library is a sure way to
make the paper more reproducible. It will also make the steps reusable,
and a replication easier. If continued effort is put in the library,
chances are that computational efficiency will improve over time, thus
helping in the long run with the challenge of computing power.&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
Tough choices: not every variant of an analysis can be forever
reproducible.&lt;/div&gt;
&lt;p&gt;Maintaining the library will ensure that results are still reproducible
on new hardware, or with evolution of the general software stack (a new
Python or Matlab release, for instance). Documentation and curated
examples will lower the bar to reuse and facilitate replication of the
original scientific results.&lt;/p&gt;
&lt;p&gt;To avoid feature creep and technical debt, a library calls for focused
efforts on selecting the most important operations.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="datasets-serving-as-model-experiments-tractable-and-open"&gt;
&lt;h3&gt;Datasets, serving as model experiments, tractable and open&lt;/h3&gt;
&lt;p&gt;Sometimes researchers create a toy data, with a well-posed question, that
is curated and open, small enough to be tractable yet large enough to be
relevant to the application field. This is an invaluable service to the
field. One example is the &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Netflix_Prize"&gt;netflix prize&lt;/a&gt; in machine learning,
which led to a standard dataset. Unfortunately, the dataset was taken
down some years later due to copyright concerns. But it has been
replaced, &lt;em&gt;eg&lt;/em&gt; by the &lt;a class="reference external" href="http://grouplens.org/datasets/movielens/"&gt;movielens dataset&lt;/a&gt;. For computer vision, a
series of datasets –&lt;a class="reference external" href="http://www.vision.caltech.edu/Image_Datasets/Caltech101/"&gt;Caltech101&lt;/a&gt;, &lt;a class="reference external" href="https://www.cs.toronto.edu/~kriz/cifar.html"&gt;CIFAR&lt;/a&gt;, &lt;a class="reference external" href="http://www.image-net.org/"&gt;ImageNet&lt;/a&gt;…– have led to continuous progress of the
field. In bioinformatics, standard data are regularly created, for
instance by the &lt;a class="reference external" href="http://dreamchallenges.org/"&gt;DREAM challenges&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;These reference open datasets serve as benchmarks and therefore foster
competition. They also define a canonical experiment, helping a wider
scientific community understand the questions that they ask. Ultimately,
they result in better software tools to solve the problem at hand, as
this problem becomes a standard example and application of tools.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Sage_Bionetworks"&gt;Sage bionetworks&lt;/a&gt;, for
instance, is a non-profit that collects and make biomedical data
available. These people believe, as I do, that such data will lead to
better medical care.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="changing-incentives-setting-the-right-goals"&gt;
&lt;h3&gt;Changing incentives: setting the right goals&lt;/h3&gt;
&lt;p&gt;Making sustainable, quality scientific work that facilitates reproduction
needs to be a clearly-visible benefit to researchers, young and senior.
Such contributions should help them get jobs and grants.&lt;/p&gt;
&lt;p&gt;An unsophisticated publication count is the basis of scientific
evaluation. We need to accept publications about data, software, and
replication of prior work in high-quality journals. They need to be
strictly reviewed, to establish high standards on these contributions.
This change is happening. &lt;a class="reference external" href="http://www.gigasciencejournal.com/"&gt;Gigascience&lt;/a&gt;, amongst other venues, publishes
data. The &lt;a class="reference external" href="http://jmlr.org/mloss/"&gt;MLOSS (machine learning open source software) track&lt;/a&gt; of the JMLR (journal of machine learning
research) publishes software, with a tough review on the software quality
of the project.&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
Researchers should cite the software they use.&lt;/div&gt;
&lt;p&gt;Yet software is still often under cited: many will use a software
implementing a method, and only cite the original paper that proposed the
method. Another remaining challenge is: how to give credit for continuing
development and maintenance.&lt;/p&gt;
&lt;p&gt;Fast-paced science is probably useful even if fragile. But the difference
between a quick proof of concept and solid, reproducible and reusable
work needs to be acknowledged. It is important to select for publication
not only impressive results, but also sound reusable material and
methods. The latter are the foundation of future scientific developments,
but high-impact journals tend to focus on the former.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;&lt;strong&gt;Related posts&lt;/strong&gt;:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="software-for-reproducible-science-lets-not-have-a-misunderstanding.html"&gt;Software for reproducible science: let’s not have a misunderstanding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="mloss-2015-wising-up-to-building-open-source-machine-learning.html"&gt;MLOSS 2015: wising up to building open-source machine learning&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="reproducible research"></category><category term="science"></category><category term="software"></category><category term="machine learning"></category><category term="scientific software"></category></entry><entry><title>Nilearn 0.2: more powerful machine learning for neuroimaging</title><link href="https://gael-varoquaux.info/programming/nilearn-02-more-powerful-machine-learning-for-neuroimaging.html" rel="alternate"></link><published>2015-12-13T00:00:00+01:00</published><updated>2015-12-13T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2015-12-13:/programming/nilearn-02-more-powerful-machine-learning-for-neuroimaging.html</id><summary type="html">&lt;div class="small sidebar"&gt;
&lt;p class="first sidebar-title"&gt;Nilearn’s goals&lt;/p&gt;
&lt;p class="last"&gt;Make advanced machine learning techniques easy for neuroimaging
research.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;After 6 months of efforts, We just released version 0.2 of &lt;a class="reference external" href="http://nilearn.github.io"&gt;nilearn&lt;/a&gt;, dedicated to making &lt;strong&gt;machine learning in
neuroimaging easier and more powerful&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;This release integrates the features of the &lt;a class="reference external" href="nilearn_july_2015_sprint.html"&gt;july sprint&lt;/a&gt;, and &lt;a class="reference external" href="http://nilearn.github.io/whats_new.html"&gt;more&lt;/a&gt;.&lt;/p&gt;
&lt;div class="section" id="highlights"&gt;
&lt;h2&gt;Highlights&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Better documentation …&lt;/strong&gt;&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="small sidebar"&gt;
&lt;p class="first sidebar-title"&gt;Nilearn’s goals&lt;/p&gt;
&lt;p class="last"&gt;Make advanced machine learning techniques easy for neuroimaging
research.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;After 6 months of efforts, We just released version 0.2 of &lt;a class="reference external" href="http://nilearn.github.io"&gt;nilearn&lt;/a&gt;, dedicated to making &lt;strong&gt;machine learning in
neuroimaging easier and more powerful&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;This release integrates the features of the &lt;a class="reference external" href="nilearn_july_2015_sprint.html"&gt;july sprint&lt;/a&gt;, and &lt;a class="reference external" href="http://nilearn.github.io/whats_new.html"&gt;more&lt;/a&gt;.&lt;/p&gt;
&lt;div class="section" id="highlights"&gt;
&lt;h2&gt;Highlights&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Better documentation with narrative examples&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The example can now be broken down into blocks (as &lt;a class="reference external" href="http://nilearn.github.io/auto_examples/connectivity/plot_signal_extraction.html#sphx-glr-auto-examples-connectivity-plot-signal-extraction-py"&gt;here&lt;/a&gt;)
for a better narration (thanks to &lt;a class="reference external" href="http://sphinx-gallery.readthedocs.org/en/latest/"&gt;sphinx-gallery&lt;/a&gt;).&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;div class="figure align-right"&gt;
&lt;a class="reference external image-reference" href="http://nilearn.github.io/auto_examples/decoding/plot_mixed_gambles_space_net.html"&gt;&lt;img alt="" src="http://nilearn.github.io/_images/sphx_glr_plot_mixed_gambles_space_net_001.png" /&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Space net: spatial regularizations in decoding&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The &lt;a class="reference external" href="http://nilearn.github.io/decoding/space_net.html"&gt;“SpaceNet” decoder&lt;/a&gt; does spatial
regularizations such as TV-l1 or Graph-Net to identify predictive regions
in decoding.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;div class="figure align-right"&gt;
&lt;a class="reference external image-reference" href="http://nilearn.github.io/auto_examples/connectivity/plot_compare_resting_state_decomposition.html"&gt;&lt;img alt="" src="http://nilearn.github.io/_images/sphx_glr_plot_compare_resting_state_decomposition_002.png" /&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Dictionnary learning for resting-state parcellations&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Dictionnary learning is a &lt;a class="reference external" href="http://nilearn.github.io/connectivity/resting_state_networks.html#beyond-ica-dictionary-learning"&gt;promising alternative to ICA to learn networks&lt;/a&gt;.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;div class="figure align-right"&gt;
&lt;a class="reference external image-reference" href="http://nilearn.github.io/auto_examples/manipulating_visualizing/plot_prob_atlas.html#sphx-glr-auto-examples-manipulating-visualizing-plot-prob-atlas-py"&gt;&lt;img alt="" src="http://nilearn.github.io/_images/sphx_glr_plot_prob_atlas_003.png" /&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Plotting sets of probabilistic maps&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;With &lt;a class="reference external" href="http://nilearn.github.io/manipulating_visualizing/plotting.html#different-plotting-functions"&gt;a simple function&lt;/a&gt;,
you can plot outlines for multiple maps.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;div class="figure align-right"&gt;
&lt;a class="reference external image-reference" href="http://nilearn.github.io/auto_examples/manipulating_visualizing/plot_extract_rois_statistical_maps.html"&gt;&lt;img alt="" src="http://nilearn.github.io/_images/sphx_glr_plot_extract_rois_statistical_maps_003.png" /&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Separating regions out of maps&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We have a set of functions to &lt;a class="reference external" href="http://nilearn.github.io/auto_examples/manipulating_visualizing/plot_extract_rois_statistical_maps.html"&gt;separate regions on maps&lt;/a&gt; or &lt;a class="reference external" href="http://nilearn.github.io/auto_examples/connectivity/plot_extract_regions_canica_maps.html"&gt;turn networks into a probabilistic parcellation&lt;/a&gt;.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;&lt;strong&gt;Classification on connectomes&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We now have advanced connectivity measures to do &lt;a class="reference external" href="http://nilearn.github.io/auto_examples/connectivity/plot_connectivity_measures.html"&gt;comparisons across
connectomes for classification&lt;/a&gt;.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;Thanks&lt;/p&gt;
&lt;p&gt;Thanks to Alexandre Abraham who lead the effort, and &lt;a class="reference external" href="http://nilearn.github.io/whats_new.html#contributors"&gt;all the
contributors&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="neuroimaging"></category><category term="python"></category><category term="scientific computing"></category><category term="scipy"></category></entry><entry><title>Job offer: data crunching brain functional connectivity for biomarkers</title><link href="https://gael-varoquaux.info/science/job-offer-data-crunching-brain-functional-connectivity-for-biomarkers.html" rel="alternate"></link><published>2015-12-08T00:00:00+01:00</published><updated>2015-12-08T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2015-12-08:/science/job-offer-data-crunching-brain-functional-connectivity-for-biomarkers.html</id><summary type="html">&lt;p&gt;&lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;My research group&lt;/a&gt; is looking to fill
a &lt;strong&gt;post-doc position on learning biomarkers from functional
connectivity&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="section" id="scientific-context"&gt;
&lt;h2&gt;Scientific context&lt;/h2&gt;
&lt;p&gt;The challenge is to use resting-state fMRI at the level of a population
to understand how intrinsic functional connectivity captures pathologies
and other cognitive phenotypes. Rest fMRI is a promising tool for …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;p&gt;&lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;My research group&lt;/a&gt; is looking to fill
a &lt;strong&gt;post-doc position on learning biomarkers from functional
connectivity&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="section" id="scientific-context"&gt;
&lt;h2&gt;Scientific context&lt;/h2&gt;
&lt;p&gt;The challenge is to use resting-state fMRI at the level of a population
to understand how intrinsic functional connectivity captures pathologies
and other cognitive phenotypes. Rest fMRI is a promising tool for
large-scale population analysis of brain function as it is easy to
acquire and accumulate. Scans for thousands of subjects have already been
shared, and more is to come. However, the signature of cognitions in this
modality are weak. Extracting biomarkers is a challenging data processing
and machine learning problem. This challenge is the expertise of my
research group. Medical applications cover a wider range of brain
pathologies, for which diagnosis is challenging, such as autism or
Alzheimer’s disease.&lt;/p&gt;
&lt;p&gt;This project is a collaboration with the &lt;a class="reference external" href="http://www.childmind.org/"&gt;Child Mind Institute&lt;/a&gt;, experts on psychiatric disorders and
resting-state fMRI, as well as coordinators of the major data sharing
initiatives for rest fRMI data (eg ABIDE).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="objectives-of-the-project"&gt;
&lt;h2&gt;Objectives of the project&lt;/h2&gt;
&lt;p&gt;The project hinges on processing of very large rest fMRI databases.
Important novelties of the project are:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Building predictive models that can discriminate &lt;strong&gt;multiple
pathologies&lt;/strong&gt; in &lt;strong&gt;large inhomogeneous datasets&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Using and improving &lt;strong&gt;advanced connectomics&lt;/strong&gt; and
&lt;strong&gt;brain-parcellation&lt;/strong&gt; techniques in fMRI.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Expected results include the discovery of neurophenotypes for several
brain pathologies, as well as intrinsic brain structures, such as
functional parcellations or connectomes, that carry signatures of
cognition.&lt;/p&gt;
&lt;p&gt;The analysis framework is based on algorithmic tools developed in Python
(crucially, leveraging scikit-learn for predictive modeling).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="desired-profile"&gt;
&lt;h2&gt;Desired profile&lt;/h2&gt;
&lt;p&gt;We are looking for a post-doctoral fellow to hire in spring. The ideal
candidate would have some, but not all, of the following expertise and
interests:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Experience in advanced processing of fMRI&lt;/li&gt;
&lt;li&gt;General knowledge of brain structure and function&lt;/li&gt;
&lt;li&gt;Good communication skills to write high-impact neuroscience publications&lt;/li&gt;
&lt;li&gt;Good computing skills, in particular with Python. Cluster computing
experience is desired.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="a-great-research-environment"&gt;
&lt;h2&gt;A great research environment&lt;/h2&gt;
&lt;p&gt;The work environment is dynamic and exiting, using state-of-the-art
machine learning to answer challenging functional neuroimaging question.&lt;/p&gt;
&lt;p&gt;The post-doc will be employed by &lt;a class="reference external" href="http://www.inria.fr"&gt;INRIA&lt;/a&gt;, the lead
computing research institute in France. We are a team of computer
scientists specialized in image processing and statistical data analysis,
integrated in one of the top French brain research centers, &lt;a class="reference external" href="http://i2bm.cea.fr/dsv/i2bm/Pages/NeuroSpin.aspx"&gt;NeuroSpin&lt;/a&gt;, south of Paris. We
work mostly in Python. The team includes core contributors to the
&lt;a class="reference external" href="http://scikit-learn.org"&gt;scikit-learn project&lt;/a&gt;, for machine learning in
Python, and the &lt;a class="reference external" href="http://nilearn.github.io/"&gt;nilearn project&lt;/a&gt;, for
statistical learning in NeuroImaging.&lt;/p&gt;
&lt;p&gt;In addition, the post-doc will interact closely with researchers from the
&lt;a class="reference external" href="http://www.childmind.org/"&gt;Child Mind Institute&lt;/a&gt;, with deep expertise
in brain pathologies and in the details of the fMRI acquisitions.
Finally, he or she will have access to advanced storage and grid
computing facilities at INRIA.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Contact information&lt;/strong&gt;: gael dotnospam varoquaux atnotspam inria dotnospam fr&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="jobs"></category><category term="neuromaging"></category><category term="science"></category><category term="python"></category><category term="scientific computing"></category></entry><entry><title>MLOSS 2015: wising up to building open-source machine learning</title><link href="https://gael-varoquaux.info/programming/mloss-2015-wising-up-to-building-open-source-machine-learning.html" rel="alternate"></link><published>2015-11-28T00:00:00+01:00</published><updated>2015-11-28T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2015-11-28:/programming/mloss-2015-wising-up-to-building-open-source-machine-learning.html</id><summary type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;&lt;em&gt;The 2015 edition of the machine learning open
source software (MLOSS) workshop was full of very mature discussions
that I strive to report here.&lt;/em&gt;&lt;/p&gt;
&lt;p class="last"&gt;&lt;em&gt;I give links to the videos. Some machine-learning researchers have
great thoughts about growing communities of coders, about code as a
process and a deliverable …&lt;/em&gt;&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;&lt;em&gt;The 2015 edition of the machine learning open
source software (MLOSS) workshop was full of very mature discussions
that I strive to report here.&lt;/em&gt;&lt;/p&gt;
&lt;p class="last"&gt;&lt;em&gt;I give links to the videos. Some machine-learning researchers have
great thoughts about growing communities of coders, about code as a
process and a deliverable.&lt;/em&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;I was a co-organizer of the &lt;a class="reference external" href="https://mloss.org/workshop/icml15/"&gt;MLOSS 2015 workshop&lt;/a&gt;, held during &lt;a class="reference external" href="http://icml.cc/2015/"&gt;ICML 2015&lt;/a&gt;. As I have finally figured out where the
videos are, now is a good time to summarize my impressions on the
workshop.&lt;/p&gt;
&lt;img alt="" src="attachments/mloss/mloss_t_shirt_white.png" style="width: 100%;" /&gt;
&lt;div class="section" id="online-videos-of-the-talks"&gt;
&lt;h2&gt;Online videos of the talks&lt;/h2&gt;
&lt;div class="small sidebar"&gt;
&lt;p class="first sidebar-title"&gt;Graphics &amp;amp; T-shirts&lt;/p&gt;
&lt;p&gt;The graphics were printed on T-shirts. We ran out, but the material is
&lt;a class="reference external" href="attachments/mloss/mloss_t_shirt_graphics.zip"&gt;here&lt;/a&gt; for you to
print.&lt;/p&gt;
&lt;p class="last"&gt;&lt;em&gt;Anyone wants to help making an online T-shirt ordering?&lt;/em&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The videos of all the talks are online:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://k4webcast.mediasite.com/Mediasite/Play/4216268dc28148c89d8b6e4eba1ad6e51d"&gt;Python and Parallelism or Dask&lt;/a&gt;
by &lt;em&gt;Matthew Rocklin&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://k4webcast.mediasite.com/Mediasite/Play/afe6f76b3bb1452790fc8982e28112641d"&gt;Collaborative filtering via matrix decomposition in mlpack&lt;/a&gt;
by &lt;em&gt;Ryan Curtin&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://k4webcast.mediasite.com/Mediasite/Play/9cd947554ddf404b9a40ca2601e44b4c1d"&gt;BLOG: a probabilistic programming language for open-universe contingent
Bayesian networks&lt;/a&gt;
by &lt;em&gt;Yi Wu&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://k4webcast.mediasite.com/Mediasite/Play/45c3bb312a37491dbce1af25f1aeba001d"&gt;Spotlights&lt;/a&gt;:&lt;ul&gt;
&lt;li&gt;Nilearn, machine learning for neuroimaging in Python (Alexandre
Abraham)&lt;/li&gt;
&lt;li&gt;KeLP: a Kernel-based Learning Platform in Java (Simone Filice)&lt;/li&gt;
&lt;li&gt;DiffSharp: Automatic Differentiation Library (Atılım Güneş Baydin)&lt;/li&gt;
&lt;li&gt;The FAST toolkit for Unsupervised Learning of HMMs (José P.
González-Brenes)&lt;/li&gt;
&lt;li&gt;OpenML: a Networked Science Platform for Machine Learning (Joaquin
Vanschoren)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://k4webcast.mediasite.com/Mediasite/Play/2529ebcb20794942874d5c277c5dcc981d"&gt;Julia’s Approach to Open Source Machine Learning&lt;/a&gt;
by &lt;em&gt;John Myles White&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://k4webcast.mediasite.com/Mediasite/Play/da4f7869f07745f7bbc5a2e5f31761b61d"&gt;Do it yourself deep learning with the Caffe community&lt;/a&gt;
by &lt;em&gt;Evan Shelhamer&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://k4webcast.mediasite.com/Mediasite/Play/2bc15b283f324784a945d79d9a06c76c1d"&gt;From flop to success in academic software development&lt;/a&gt;
by &lt;em&gt;Gaël Varoquaux&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="mloss-a-maturing-community"&gt;
&lt;h2&gt;MLOSS: a maturing community&lt;/h2&gt;
&lt;!-- Say that I was not enthousiastic, originaly, and say why (typical
flaws of academic software) --&gt;
&lt;p&gt;When Antti Honkela and Cheng Soon Ong approached me to co-organize an
MLOSS workshop, I felt that it was important to do it for the sake of
open source scientific software. But it didn’t feel very enthousiastic
about the event or the talks themselves. Boy I was wrong.&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
Huge attendance: open-source ML software is now mainstream.&lt;/div&gt;
&lt;p&gt;My first MLOSS workshop was at the ICML 2011 conference, in Haifa. The
workshop was in a tiny cramped room, with a couple of dozens of geeks,
and it felt like a clique of people on the side of the conference. This
year, we had a huge room and more than 200 people showed up.&lt;/p&gt;
&lt;p&gt;I am used to talks being about a grad student or young researcher that
has whiped the code of a paper on the web, with an open license but no
vision. This year, people were presenting actual projects, with long-term
goals and the desire to solve a problem large than their latest research.
It might explain why the attendance was huge: people came because talks
might genuinely help them.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;With Cheng and Antti, we had choosen as a theme &lt;em&gt;“open ecosystems”&lt;/em&gt;,
because ecosystems are the key to scaling computing and science. Between
us, imposing a theme on a workshop is something challenging, as people
submit abstracts, good or bad, and one has to compose with what one has.
However, at lot of talks mentioned how the projects slot in a wider
picture, and interact with a community. For instance, Evan attributes
part of the success of Cafe to the &lt;a class="reference external" href="https://github.com/BVLC/caffe/wiki/Model-Zoo"&gt;“Model Zoo”&lt;/a&gt; in which the community
contributes fitted models. At the other end of the spectrum, OpenML is a
full online project with the goal to foster collaboration and comparison.
Project developers have shown in their talk that they are very conscious
of other projects that might be used together with their’s.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="accepting-the-sustainability-challenges"&gt;
&lt;h2&gt;Accepting the sustainability challenges&lt;/h2&gt;
&lt;p&gt;Over the time, I have gradually realized the importance of community
building, &lt;em&gt;ie&lt;/em&gt; project management and goal setting, more than technical
virtuosity. Historically, the scientific culture of code has put the
emphasis on the genius ideas behind the code, and the craftsmanship of
the implementation, to the cost of sustainability.&lt;/p&gt;
&lt;div class="align-right docutils container"&gt;
Alone, I go fast. Together, we go far.&lt;/div&gt;
&lt;p&gt;I was surprised to see that the MLOSS community was growing very aware of
mechanisms of long-term project life, in particular the human factors.&lt;/p&gt;
&lt;p&gt;I was asked by my coorganizers to give &lt;a class="reference external" href="http://k4webcast.mediasite.com/Mediasite/Play/2bc15b283f324784a945d79d9a06c76c1d"&gt;a talk on factors of success of
open source scientific software&lt;/a&gt;.
I touched upon &lt;strong&gt;software engineering&lt;/strong&gt;, &lt;strong&gt;project vision&lt;/strong&gt;,
&lt;strong&gt;licensing&lt;/strong&gt;, &lt;strong&gt;governance&lt;/strong&gt;, &lt;strong&gt;community building&lt;/strong&gt;. All these topics
deemed &lt;em&gt;“non scientific”&lt;/em&gt; and thus so often despised and left out. I was
astonished to find out that the talks before me were giving very good
advice on these. I found that I only had to summarize and comment what
had been said before. This evolution of the scientific community makes me
very hopeful for the future.&lt;/p&gt;
&lt;blockquote class="epigraph"&gt;
&lt;p&gt;Every line of code you write is dept. You should be ashamed of every line
of code you have written. […]&lt;/p&gt;
&lt;p&gt;You have a supply of labor. These are the people who are contributors
[…].
The people who are users and not contributors are actually a source of
demand […] they mostly consume sources of labor rather than produce it.
&amp;nbsp; &amp;nbsp; &amp;nbsp; —  &amp;nbsp; &amp;nbsp;
&lt;a class="reference external" href="http://k4webcast.mediasite.com/Mediasite/Play/2529ebcb20794942874d5c277c5dcc981d"&gt;John Myles White&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;&lt;strong&gt;Thanks to our sponsors&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.facebook.com"&gt;Facebook&lt;/a&gt; and &lt;a class="reference external" href="http://www.continuum.io"&gt;continuum&lt;/a&gt; sponsored the trip for our keynote
speakers. Thank you very much, the keynotes were great!&lt;/p&gt;
&lt;p&gt;The &lt;a class="reference external" href="http://www.datascience-paris-saclay.fr/"&gt;Paris-Saclay Center for Data Science (CDS)&lt;/a&gt; gave us our main operating
fund, which is critical for organizing an event. In general, I must
say that the CDS has been hugely supportive of open source data
science in the Paris area, having a significant impact on training as
well as development.&lt;/p&gt;
&lt;p&gt;And also, I must acknowledge support from &lt;a class="reference external" href="http://http://www.inria.fr/"&gt;Inria&lt;/a&gt; for the accounting and administration
of the event.&lt;/p&gt;
&lt;p&gt;Finally, &lt;strong&gt;our reviewers were amazing&lt;/strong&gt;. Most of them reviewed the
project, ie its code, its documentation, its support. They arose above
the typical petty fights that we see in academia and focused on what
the project was bringing to the scientific community. Often there
reviews were longer and with more information than the abstract
submitted.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;&lt;strong&gt;Related posts&lt;/strong&gt;:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="software-for-reproducible-science-lets-not-have-a-misunderstanding.html"&gt;Software for reproducible science: let’s not have a misunderstanding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="../science/publishing-scientific-software-matters.html"&gt;Publishing scientific software matters&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="conferences"></category><category term="science"></category><category term="software"></category><category term="machine learning"></category><category term="reproducible research"></category><category term="scientific software"></category></entry><entry><title>Nilearn sprint: hacking neuroimaging machine learning</title><link href="https://gael-varoquaux.info/programming/nilearn-sprint-hacking-neuroimaging-machine-learning.html" rel="alternate"></link><published>2015-08-04T00:00:00+02:00</published><updated>2015-08-04T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2015-08-04:/programming/nilearn-sprint-hacking-neuroimaging-machine-learning.html</id><summary type="html">&lt;p&gt;A couple of weeks ago, we had in Paris the second international &lt;a class="reference external" href="http://nilearn.github.io"&gt;nilearn&lt;/a&gt; sprint, dedicated to making &lt;strong&gt;machine learning
in neuroimaging easier and more powerful&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;It was such a fantastic experience, as nilearn is really shaping up as a
simple yet powerful tool, and there is a lot of enthusiasm …&lt;/p&gt;</summary><content type="html">&lt;p&gt;A couple of weeks ago, we had in Paris the second international &lt;a class="reference external" href="http://nilearn.github.io"&gt;nilearn&lt;/a&gt; sprint, dedicated to making &lt;strong&gt;machine learning
in neuroimaging easier and more powerful&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;It was such a fantastic experience, as nilearn is really shaping up as a
simple yet powerful tool, and there is a lot of enthusiasm. For me, this
sprint is a turning point, as I could see people other than the original
core team (that spanned out of &lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;our research team&lt;/a&gt;) excited about the project’s future.
Thank you to all who came:&lt;/p&gt;
&lt;ul class="columns simple"&gt;
&lt;li&gt;Ahmed Kanaan&lt;/li&gt;
&lt;li&gt;Andres Hoyos Idrobo&lt;/li&gt;
&lt;li&gt;Alexandre Abraham&lt;/li&gt;
&lt;li&gt;Arthur Mensch&lt;/li&gt;
&lt;li&gt;Ben Cipolli (remote)&lt;/li&gt;
&lt;li&gt;Bertrand Thirion&lt;/li&gt;
&lt;li&gt;Chris Filo Gorgolewski&lt;/li&gt;
&lt;li&gt;Danilo Bzdok&lt;/li&gt;
&lt;li&gt;Elvis Dohmatob&lt;/li&gt;
&lt;li&gt;Julia Hutenburg&lt;/li&gt;
&lt;li&gt;Kamalaker Dadi&lt;/li&gt;
&lt;li&gt;Loic Esteve&lt;/li&gt;
&lt;li&gt;Martin Perez&lt;/li&gt;
&lt;li&gt;Michael Hanke&lt;/li&gt;
&lt;li&gt;Oscar Nájera, working on
&lt;a class="reference external" href="http://sphinx-gallery.readthedocs.org/"&gt;sphinx-gallery&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;img alt="" src="attachments/nilearn_july_2015_sprint/nilearn_sprint.jpg" style="width: 100%;" /&gt;
&lt;p&gt;The sprint was a joint sprint with the &lt;a class="reference external" href="http://martinos.org/mne/stable/mne-python.html"&gt;MNE-Python&lt;/a&gt; team, that makes MEG
processing awesome. We also need to thank &lt;a class="reference external" href="http://alexandre.gramfort.net"&gt;Alex Gramfort&lt;/a&gt;, who did most of the work to set up the
sprint, as well as &lt;a class="reference external" href="https://www.universite-paris-saclay.fr/en/research/project/lidex-neurosaclay"&gt;NeuroSaclay&lt;/a&gt;
for funding, and &lt;a class="reference external" href="http://lapaillasse.org/"&gt;La paillasse&lt;/a&gt;, &lt;a class="reference external" href="http://www.telecom-paristech.fr"&gt;Telecom&lt;/a&gt;, and &lt;a class="reference external" href="http://www.inria.fr/en/centre/saclay"&gt;INRIA&lt;/a&gt; for hosting.&lt;/p&gt;
&lt;div class="section" id="highlights-of-the-sprints-results"&gt;
&lt;h2&gt;Highlights of the sprints results&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Plotting of multiple maps&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;a class="reference external image-reference" href="https://circle-artifacts.com/gh/nilearn/nilearn/128/artifacts/0/home/ubuntu/nilearn/doc/_build/html/auto_examples/connectivity/plot_canica_resting_state.html"&gt;&lt;img alt="" class="align-right" src="attachments/nilearn_july_2015_sprint/plot_canica_resting_state_001.png" style="width: 200px;" /&gt;&lt;/a&gt;
&lt;p&gt;A function to visualize overlays of various maps, eg for a
probabilistic atlas, with defaults that try to adapt to the number of
maps (see the &lt;a class="reference external" href="https://circle-artifacts.com/gh/nilearn/nilearn/128/artifacts/0/home/ubuntu/nilearn/doc/_build/html/auto_examples/manipulating_visualizing/plot_prob_atlas.html"&gt;example&lt;/a&gt;).
It’s very useful for example for &lt;a class="reference external" href="https://circle-artifacts.com/gh/nilearn/nilearn/128/artifacts/0/home/ubuntu/nilearn/doc/_build/html/auto_examples/connectivity/plot_canica_resting_state.html"&gt;easy visualizing of ICA components&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Sign of activation in glass brain&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;a class="reference external image-reference" href="https://circle-artifacts.com/gh/nilearn/nilearn/287/artifacts/0/home/ubuntu/nilearn/doc/_build/html/auto_examples/manipulating_visualizing/plot_demo_glass_brain_extensive.html"&gt;&lt;img alt="" class="align-right" src="attachments/nilearn_july_2015_sprint/plot_demo_glass_brain_extensive_005.png" style="width: 200px;" /&gt;&lt;/a&gt;
&lt;p&gt;Our glass brain plotting was greatly improved adding amongst other
things the option to capture the sign of the activation in the color
(see this &lt;a class="reference external" href="https://circle-artifacts.com/gh/nilearn/nilearn/287/artifacts/0/home/ubuntu/nilearn/doc/_build/html/auto_examples/manipulating_visualizing/plot_demo_glass_brain_extensive.html"&gt;example&lt;/a&gt;).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Spatially-regularized decoder&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;a class="reference external image-reference" href="https://circle-artifacts.com/gh/nilearn/nilearn/287/artifacts/0/home/ubuntu/nilearn/doc/_build/html/auto_examples/decoding/plot_haxby_space_net.html"&gt;&lt;img alt="" class="align-right" src="attachments/nilearn_july_2015_sprint/plot_haxby_space_net_002.png" style="width: 200px;" /&gt;&lt;/a&gt;
&lt;p&gt;Decoders based on GraphNet and total variation have finally landed in
nilearn. This has required a lot of work to get fast convergence and
robust parameter selection. At the end of the day, it is much slower
than an SVM, but the maps look splendid
(see this &lt;a class="reference external" href="https://circle-artifacts.com/gh/nilearn/nilearn/287/artifacts/0/home/ubuntu/nilearn/doc/_build/html/auto_examples/decoding/plot_haxby_space_net.html"&gt;example&lt;/a&gt;).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Sparse dictionary learning&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;a class="reference external image-reference" href="https://circle-artifacts.com/gh/nilearn/nilearn/282/artifacts/0/home/ubuntu/nilearn/doc/_build/html/auto_examples/connectivity/plot_dict_learning_resting_state.html"&gt;&lt;img alt="" class="align-right" src="attachments/nilearn_july_2015_sprint/plot_dict_learning_resting_state_001.png" style="width: 200px;" /&gt;&lt;/a&gt;
&lt;p&gt;We have almost merged sparse dictionnary learning as a alternative to ICA.
Experience shows that on resting-state data, it gives more contrasted
segmentation of networks
(see this &lt;a class="reference external" href="https://circle-artifacts.com/gh/nilearn/nilearn/282/artifacts/0/home/ubuntu/nilearn/doc/_build/html/auto_examples/connectivity/plot_dict_learning_resting_state.html"&gt;example&lt;/a&gt;).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;New installation docs&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
New webpage layout using tabs to display only the installation
instruction relevant to the OS of the user (see &lt;a class="reference external" href="https://circle-artifacts.com/gh/nilearn/nilearn/287/artifacts/0/home/ubuntu/nilearn/doc/_build/html/introduction.html#installation"&gt;here&lt;/a&gt;).
The results are more compact and more clear instructions, that I hope
will make our users’ life easier.&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;CircleCI integration&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
We now use &lt;a class="reference external" href="https://circleci.com/gh/nilearn/nilearn"&gt;CircleCI&lt;/a&gt; to
run the examples and build the docs. This is challenging because our
examples are real cases of neuroimaging data analysis, and thus require
heavy datasets and computing horse power.&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Neurodebian packaging&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
There are now &lt;a class="reference external" href="http://neuro.debian.net/pkgs/python-nilearn.html"&gt;neurodebian packages&lt;/a&gt; for nilearn.&lt;/blockquote&gt;
&lt;p&gt;And much more!&lt;/p&gt;
&lt;div class="admonition warning"&gt;
&lt;p class="first admonition-title"&gt;Warning&lt;/p&gt;
&lt;p class="last"&gt;Features listed above are &lt;strong&gt;not&lt;/strong&gt; in the released version of nilearn.
You need to wait a month or so.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="neuroimaging"></category><category term="python"></category><category term="scientific computing"></category><category term="scipy"></category></entry><entry><title>Software for reproducible science: let’s not have a misunderstanding</title><link href="https://gael-varoquaux.info/programming/software-for-reproducible-science-lets-not-have-a-misunderstanding.html" rel="alternate"></link><published>2015-05-18T00:00:00+02:00</published><updated>2015-05-18T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2015-05-18:/programming/software-for-reproducible-science-lets-not-have-a-misunderstanding.html</id><summary type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;tl;dr:&lt;/strong&gt; &amp;nbsp; &lt;em&gt;Reproducibilty is a noble cause and scientific
software a promising vessel. But excess of reproducibility can be at
odds with the housekeeping required for good software engineering.
Code that “just works” should not be taken for granted.&lt;/em&gt;&lt;/p&gt;
&lt;p class="last"&gt;&lt;em&gt;This post advocates for a progressive consolidation effort of
scientific …&lt;/em&gt;&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;tl;dr:&lt;/strong&gt; &amp;nbsp; &lt;em&gt;Reproducibilty is a noble cause and scientific
software a promising vessel. But excess of reproducibility can be at
odds with the housekeeping required for good software engineering.
Code that “just works” should not be taken for granted.&lt;/em&gt;&lt;/p&gt;
&lt;p class="last"&gt;&lt;em&gt;This post advocates for a progressive consolidation effort of
scientific code, rather than putting too high a bar on code release.&lt;/em&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a class="reference external" href="http://ivory.idyll.org/blog/"&gt;Titus Brown&lt;/a&gt; recently shared &lt;a class="reference external" href="http://ivory.idyll.org/blog/2015-how-should-we-think-about-research-software.html"&gt;an
interesting war story&lt;/a&gt;
in which a reviewer refuses to review a paper until he can run the code
on his own files. Titus’s comment boils down to:&lt;/p&gt;
&lt;blockquote&gt;
&lt;blockquote class="epigraph"&gt;
&lt;a class="reference external" href="http://ivory.idyll.org/blog/2015-how-should-we-think-about-research-software.html"&gt;“Please destroy this software after publication”&lt;/a&gt;.&lt;/blockquote&gt;
&lt;/blockquote&gt;
&lt;div class="admonition align-right note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Reproducible science: Does the emperor have clothes?&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;In other words, code for a publication is often not reusable. This
point of view is very interesting from someone like Titus, who is a
&lt;a class="reference external" href="http://ivory.idyll.org/blog/a-conversation-on-reproducibility.html"&gt;vocal proponent&lt;/a&gt; of
reproducible science. His words triggered some surprises, which led Titus
to wonder if &lt;a class="reference external" href="http://ivory.idyll.org/blog/2015-we-live-in-a-bubble.html"&gt;some of the reproducible science crowd folks live in a
bubble&lt;/a&gt;. I
was happy to see &lt;a class="reference external" href="https://twitter.com/ctitusbrown/status/589171853031186434"&gt;the discussion&lt;/a&gt; unroll, as
I think that there is a strong risk of creating a bubble around
reproducible science. Such a bubble will backfire.&lt;/p&gt;
&lt;!-- Let me share my point of view on software for reproducible science. --&gt;
&lt;div class="section" id="replication-is-a-must-for-science-and-society"&gt;
&lt;h2&gt;Replication is a must for science and society&lt;/h2&gt;
&lt;p&gt;Science advances by accumulating knowledge built upon
observations. It’s easy to forget that these observations, and the
corresponding paradigmatic conclusions, are not always as simple to
establish as the fact that hot air rises: &lt;strong&gt;replicating many times the
scientific process transforms an evidence into a truth&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;One striking example of scientific replication is &lt;a class="reference external" href="http://www.nature.com/news/first-results-from-psychology-s-largest-reproducibility-test-1.17433"&gt;the on-going effort in
psychology&lt;/a&gt;
to replay the evidence behind well-accepted findings central to
current line of thoughts in psychological sciences. It implies setting up
the experiments accordingly to the seminal publications, acquiring the
data, and processing it to come up to the same conclusions. Surprisingly,
not everything that was taken for granted holds.&lt;/p&gt;
&lt;div class="admonition align-right note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Findings later discredited backed economic policy&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Another example, with massive consequences on Joe Average’s everyday, is
the failed replication of Reinhart and Rogoff’s &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Growth_in_a_Time_of_Debt"&gt;“Growth in a Time of
Debt”&lt;/a&gt;
publication. The original paper, published in 2010 in the American
Economic Review, claimed empirical findings linking important public debt
to failure of GDP growth. In a context of economical crisis, it was used
by policy makers as a justification for restricted public spending.
However, while pursuing a mere homework assignment to replicate these
findings, &lt;a class="reference external" href="http://www.bbc.com/news/magazine-22223190"&gt;a student uncovered methodological flaws with the paper&lt;/a&gt;. Understanding the
&lt;a class="reference external" href="http://www.nextnewdeal.net/rortybomb/researchers-finally-replicated-reinhart-rogoff-and-there-are-serious-problems"&gt;limitations&lt;/a&gt;
of the original study took a while, and &lt;strong&gt;discredited the academic
backing to the economical doctrine of austerity&lt;/strong&gt;. Critically, the
analysis of the publication was possible only because Reinhart and Rogoff
&lt;strong&gt;released their spreadsheet, with data and analysis details&lt;/strong&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="sharing-code-can-make-science-reproducible"&gt;
&lt;h2&gt;Sharing code can make science reproducible&lt;/h2&gt;
&lt;p&gt;A great example of sharing code to make a publication reproducible is the
recent paper on &lt;a class="reference external" href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0126255"&gt;orthogonalization of regressors in fMRI models&lt;/a&gt;,
by Mumford, Poline and Poldrack. The paper is a didactic refutation
of non-justified data processing practices. The authors made their
point much stronger by giving &lt;a class="reference external" href="http://nbviewer.ipython.org/github/jmumford/orthogonalizaton_ipynb/blob/master/orthogonalization.ipynb"&gt;an IPython notebook&lt;/a&gt;
to reproduce their figures. The recipe works perfectly here, because the
ideas underlying the publication are simple and can be illustrated on
synthetic data with relatively inexpensive computation. A short IPython
notebook is all it takes to convince the reader.&lt;/p&gt;
&lt;div class="admonition align-right note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Sharing complex code… chances are it won’t run on new data.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;At the other end of the spectrum, a complex analysis pipeline will not be
as easy to share. For instance, a feat of strength such as Miyawaki &lt;em&gt;et
al&lt;/em&gt;’s &lt;a class="reference external" href="http://www.cell.com/neuron/abstract/S0896-6273%2808%2900958-6"&gt;visual image
reconstruction from brain activity&lt;/a&gt;
requires complex statistical signal processing to extract weak
signatures. Miyawaki &lt;em&gt;et al&lt;/em&gt; shared the data. They might share the code, but
it would be a large chunk of code, probably fragile to changes in the
environment (Matlab version, OS…). Chances are that it wouldn’t run on
new data. This is the scenario that prompted Titus’s words:&lt;/p&gt;
&lt;blockquote&gt;
&lt;blockquote class="epigraph"&gt;
&lt;a class="reference external" href="http://ivory.idyll.org/blog/2015-how-should-we-think-about-research-software.html"&gt;“Please destroy this software after publication”&lt;/a&gt;.&lt;/blockquote&gt;
&lt;/blockquote&gt;
&lt;p&gt;I have good news: you can reproduce Miyawaki’s work with &lt;a class="reference external" href="http://nilearn.github.io/auto_examples/decoding/plot_miyawaki_reconstruction.html"&gt;an example&lt;/a&gt;
in &lt;a class="reference external" href="http://nilearn.github.io"&gt;nilearn&lt;/a&gt;, a library for
machine learning on brain images. The example itself is concise,
readable and it reliably produces figures close to that of the paper.&lt;/p&gt;
&lt;div class="admonition align-right note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Maintained libraries make feats of strength routinely
reproducible.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;This easy replication is only possible because &lt;strong&gt;the corresponding code
leverages a set of libraries that encapsulate the main steps of the
analysis&lt;/strong&gt;, mainly &lt;a class="reference external" href="http://scikit-learn.org/stable/"&gt;scikit-learn&lt;/a&gt; and
&lt;a class="reference external" href="http://nilearn.github.io"&gt;nilearn&lt;/a&gt; here. These libraries are
&lt;a class="reference external" href="https://travis-ci.org/nilearn/nilearn"&gt;tested&lt;/a&gt;, &lt;a class="reference external" href="https://github.com/nilearn/nilearn/issues?q=is%3Aissue+is%3Aclosed"&gt;maintained&lt;/a&gt;
and &lt;a class="reference external" href="http://gael-varoquaux.info/programming/scikit-learn-015-release-highlights.html"&gt;released&lt;/a&gt;.
They enable us to go from a feat of strength to routine replication.&lt;/p&gt;
&lt;!-- * An example of non-reproducible research (my ICML paper) --&gt;
&lt;!-- Can research be up to the software engineering challenge? --&gt;
&lt;/div&gt;
&lt;div class="section" id="reproducibility-is-not-sustainable-for-everything"&gt;
&lt;h2&gt;Reproducibility is not sustainable for everything&lt;/h2&gt;
&lt;!-- Things are not always that easy

It's not you, it's me

Nobody said it was easy

Living up to the promise? --&gt;
&lt;blockquote class="epigraph"&gt;
Thinking is easy, acting is difficult &amp;nbsp; &amp;nbsp; &amp;nbsp;
—  &amp;nbsp; &amp;nbsp; &amp;nbsp;  &lt;em&gt;Goethe&lt;/em&gt;&lt;/blockquote&gt;
&lt;div class="admonition align-right note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Keeping a physics apparatus running for replication years later?&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;I started my scientific career doing physics, and fairly &lt;a class="reference external" href="http://gael-varoquaux.info/science/general-relativity-quantum-physics-freely-falling-planes-and-bayesian-statistics.html"&gt;“heavy” physics&lt;/a&gt;:
vacuum systems, lasers, free-falling airplanes. In such settings, the
cost of maintaining an experiment is apparent to the layman. No-one is
expected to keep an apparatus running for replication years later. The
pinnacle of reproducible research is when the work becomes doable in a
students lab. Such progress is often supported by improved
technology, driven by wider applications of the findings.&lt;/p&gt;
&lt;p&gt;However, not every experiment will give rise to a students lab.
Replicating the others will not be easy. Even if the instruments are
still around the lab, they will require setting up, adjusting and wiring.
And chances are that connectors or cables will be missing.&lt;/p&gt;
&lt;p&gt;Software is no different. Storing and sharing it is cheaper. But
technology evolves very fast. Every setup is different. Code for a
scientific paper has seldom been built for easy maintenance: lack of
tests, profusion of exotic dependencies, inexistent documentation.
Robustness, portability, isolation, would be desirable, but it is
difficult and costly.&lt;/p&gt;
&lt;p&gt;Software developers know that understanding the constraints to design a
good program requires writing a prototype. &lt;strong&gt;Code for a scientific paper
is very much a prototype&lt;/strong&gt;: it’s a first version of an idea, that proves
its feasibility. Common sense in software engineering says that
&lt;a class="reference external" href="http://blog.codinghorror.com/the-prototype-pitfall/"&gt;prototypes are designed to be thrown away&lt;/a&gt;. Prototype code
is fragile. It’s untested, probably buggy for certain usage. Releasing
prototypes amounts to distributing semi-functioning code. This is the
case for most code accompanying a publication, and it is to be expected
given the very nature of research: exploration and prototyping &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;.&lt;/p&gt;
&lt;!-- Quality scientific software require making choices --&gt;
&lt;!-- Doing less, better --&gt;
&lt;!-- Quality scientific software, only for a happy few --&gt;
&lt;/div&gt;
&lt;div class="section" id="no-success-without-quality"&gt;
&lt;h2&gt;No success without quality, …&lt;/h2&gt;
&lt;div class="admonition align-right note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Highly-reliable is more useful than state-of-the-art.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;My experience with scientific code has taught me that success require
quality. Having a good implementation of simple, well-known, methods
seems to matter more than doing something fancy. This is what the
success of scikit-learn has taught us: we are really providing classic
“old” machine learning methods, but with a good API, good docs,
computational performance, and stable numerics controlled by stringent
tests. There exists plenty of more sophisticated machine-learning
methods, including some that I have developed specifically for my data.
Yet, I find myself advising my co-workers to use the methods in
scikit-learn, because I know that the implementation is reliable and that
they will be able to use them &lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This quality is indeed central to doing science with code. What good is a
data analysis pipeline if it crashes when I fiddle with the data? How can
I draw conclusions from simulations if I cannot change their parameters?
As soon as I need trust in code supporting a scientific
finding, I find myself tinkering with its input, and often breaking it.
Good scientific code is code that can be reused, that can lead to
large-scale experiments validating its underlying assumptions.&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;a class="reference external image-reference" href="https://twitter.com/divineomega/status/576165762911608833"&gt;&lt;img alt="" src="../programming/attachments/sqlite_code.png" /&gt;&lt;/a&gt;
&lt;p class="caption"&gt;Sqlite is so much used that its developers have been woken up at
night by users.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;You might say that I am putting the bar too high; that slightly buggy
code is more useful than no code. But I frown at the idea of releasing
code for which I am unable to do proper quality assurance. I may have
done too much of that in the past. And because I am a prolific coder, many
people are using code that has been through my hands. My mailbox looks
like a battlefield, and when I go the coffee machine I find myself
answering questions.&lt;/p&gt;
&lt;!-- Pour vivre heureux, vivons cachés.
http://en.wikipedia.org/wiki/Jean-Pierre_Claris_de_Florian --&gt;
&lt;/div&gt;
&lt;div class="section" id="and-making-difficult-choices"&gt;
&lt;h2&gt;… and making difficult choices&lt;/h2&gt;
&lt;!-- diminishing returns --&gt;
&lt;div class="admonition align-right note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Craftsmanship is about trade-offs&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Achieving quality requires making choices. Not only because time
is limited, but also because the difficulty to maintain and improve a
codebase increases much quicker than the numbers of features &lt;a class="footnote-reference" href="#footnote-3" id="footnote-reference-3"&gt;[3]&lt;/a&gt;. This
phenomena is actually frightening to watch: adding a feature in
scikit-learn these days is much much harder than what it used to be in
the early days. Interactions between features is a killer: when you
modify something, something else unrelated breaks. For a given
functionality, &lt;strong&gt;nothing makes the code more incomprehensible than
cyclomatic complexity&lt;/strong&gt;: the multiplicity of branching, if/then clauses,
for loops. This complexity naturally appears when supporting different
input types, or minor variants of a same method.&lt;/p&gt;
&lt;p&gt;The consequence is that ensuring quality for many variants of a method is
prohibitory. This limit is a real problem for reproducible
science, as science builds upon comparing and opposing models. However,
ignoring it simply leads to code that fails doing what it claims to do.
What this is telling us, is that if we are really trying to do long-term
reproducibility, we &lt;strong&gt;need to identify successful and important research
and focus our efforts on it&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;If you agree with my earlier point that the code of a publication is
a prototype, this iterative process seems natural. Various ideas
can be thought of as competing prototypes. Some will not lead to
publication at all, while others will end up having a high impact.
Knowing before-hand is impossible. Focusing too early on achieving high
quality is counter productive. What matters is &lt;strong&gt;progressively
consolidating the code&lt;/strong&gt;.&lt;/p&gt;
&lt;!-- XXX rephrase the above to avoid 'what matters'? --&gt;
&lt;!-- I am sorry to say that my publications are not based on code with 90% test coverage. --&gt;
&lt;!-- say that my methods in machine learning will probably never make it to
scikit-learn --&gt;
&lt;/div&gt;
&lt;div class="section" id="reproducible-science-a-rich-trade-off-space"&gt;
&lt;h2&gt;Reproducible science, a rich trade-off space&lt;/h2&gt;
&lt;div class="admonition align-right note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Verbatim replication or reuse?&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Does Reinhart and Rogoff’s &lt;em&gt;“Growth in a Time of Debt”&lt;/em&gt; paper face the
same challenges as the manuscript under review by Titus? One is
describing mechanisms while the other is introducing a method. The code
of the former is probably much simpler than that of the latter. Different
publications come with different goals and code that is more or less easy
to share. For verbatim replication of the analysis of a paper, a simple
IPython notebook without tests or API is enough. To go beyond requires
applying the analysis to different problems or data: reuse. Reuse is
very difficult and cannot be a requirement for all publications.&lt;/p&gt;
&lt;!-- As someone who spends a lot of time on method development, I think a lot
in terms of code reuse. On the contrary, --&gt;
&lt;p&gt;Conventional wisdom in academia is that science builds upon ideas and
concepts rather than methods and code. Galileo is known for his
contribution to our understanding of the cosmos. Yet, methods
development underpins science. Galileo is also the inventor of the
telescope, which was a huge technical achievement. He needed to develop
it to back his cosmological theories. Today, Galileo’s measurements are
easy to reproduce because telescopes are readily-available as consumer
products.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;blockquote class="epigraph"&gt;
Standing on the shoulders of giants &amp;nbsp; &amp;nbsp; —  &amp;nbsp; &amp;nbsp;
&lt;em&gt;Isaac Newton, on software libraries&lt;/em&gt;&lt;/blockquote&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;&lt;strong&gt;Related posts&lt;/strong&gt;:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="../science/publishing_scientific_software_matters.html"&gt;Publishing scientific software matters&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="views_on_scientific_computing.html"&gt;Personal views on scientific computing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!-- With great powers come great responsibility --&gt;
&lt;!-- Some publications, including computational ones, strive to contribute an idea. --&gt;
&lt;!-- The way I understand Titus's
phrase *"Please destroy this software after publication"* is that some
methods publication --&gt;
&lt;!-- Is the output of a paper the idea, or the code? It depends? (example of
the ICML) --&gt;
&lt;!-- Different code complexity, different trade-off (loops back to the point
above with Poldrack) --&gt;
&lt;!-- XXX: need to point to the donoho paper and cite it --&gt;
&lt;!-- Recommendations (in a separate blog post?):

* What the difficulties are (evolving APIs, plus configuration problems)
  (skip this point?)

* don't publish method work on non open data (very restrictive, I have
  been criticized for working on 'old', 'uninteresting' data). --&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;To make my point very clear, releasing buggy untested code is not
a good thing. However, it is not possible to ask for all research
papers to come with industial-quality code. I am trying here to push
for a collective, reasoned, undertaking of consolidation.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Theory tells us that there is there is no universal machine
learning algorithm. Given a specific machine-learning application, it
is always possible to devise a custom strategy that out-performs a
generic one. However, &lt;a class="reference external" href="http://jmlr.org/papers/volume15/delgado14a/delgado14a.pdf"&gt;do we need hundreds of classifiers to solve
real world classification problems?&lt;/a&gt;
Empirical results &lt;a class="reference external" href="http://jmlr.org/papers/volume15/delgado14a/delgado14a.pdf"&gt;[Delgado 2014]&lt;/a&gt; show
that most of the benefits can be achieved with a small number of
strategies. Is it desirable and sustainable to distribute and keep
alive the code of every machine learning paper?&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-3" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-3"&gt;[3]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Empirical studies on the workload for programmers to achieve a
given task showed that 25 percent increase in problem complexity results in
a 100 percent increase in programming complexity: &lt;a class="reference external" href="http://ieeexplore.ieee.org/Xplore/login.jsp?url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F32%2F35909%2F01702600.pdf%3Farnumber%3D1702600&amp;amp;authDecision=-203"&gt;An Experiment on
Unit increase in Problem Complexity, Woodfield 1979&lt;/a&gt;.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p class="small"&gt;I need to thank my colleague &lt;a class="reference external" href="http://multiplecomparisons.blogspot.fr"&gt;Chris Filo Gorgolewski&lt;/a&gt; and my sister &lt;a class="reference external" href="http://cbio.ensmp.fr/~nvaroquaux/"&gt;Nelle
Varoquaux&lt;/a&gt; for their
feedback on this note.&lt;/p&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="science"></category><category term="software"></category><category term="reproducible research"></category><category term="scientific software"></category></entry><entry><title>MLOSS: machine learning open source software workshop @ ICML 2015</title><link href="https://gael-varoquaux.info/programming/mloss-machine-learning-open-source-software-workshop-icml-2015.html" rel="alternate"></link><published>2015-04-23T00:00:00+02:00</published><updated>2015-04-23T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2015-04-23:/programming/mloss-machine-learning-open-source-software-workshop-icml-2015.html</id><summary type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;&lt;em&gt;This year again we will have an exciting workshop on the
leading-edge machine-learning open-source software. This subject is
central to many, because software is how we propagate, reuse, and
apply progress in machine learning.&lt;/em&gt;&lt;/p&gt;
&lt;p class="last"&gt;&lt;strong&gt;Want to present a project? The deadline for the call for papers is
Apr 28th …&lt;/strong&gt;&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;&lt;em&gt;This year again we will have an exciting workshop on the
leading-edge machine-learning open-source software. This subject is
central to many, because software is how we propagate, reuse, and
apply progress in machine learning.&lt;/em&gt;&lt;/p&gt;
&lt;p class="last"&gt;&lt;strong&gt;Want to present a project? The deadline for the call for papers is
Apr 28th, in a few days&lt;/strong&gt;
: &lt;a class="reference external" href="http://mloss.org/workshop/icml15/"&gt;http://mloss.org/workshop/icml15/&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The workshop will be help at the &lt;a class="reference external" href="http://icml.cc/2015/"&gt;ICML conference&lt;/a&gt;, in Lille France, on July 10th. ICML
–International Conference in Machine Learning– is the leading venue for
academic research in machine learning. It’s a fantastic place to hold
such a workshop, as the actors of theoretical progress are all around.
Software is the bridge that brings this progress beyond papers.&lt;/p&gt;
&lt;p&gt;There is a &lt;a class="reference external" href="http://mloss.org/workshop/"&gt;long tradition&lt;/a&gt; of MLOSS
workshop, with one every year and a half. Last time, at NIPS 2013, I
could feel a bit of a turning point, as people started feeling that
different software slotted together, to create an efficient and
state-of-the art working environment. For this reason, we have entitled
this year’s workshop ‘open ecosystems’, stressing that contributions in
the scope of the workshop, that build a thriving work environment, are
not only machine learning software, but also better statistics or
numerical tools.&lt;/p&gt;
&lt;p&gt;We have two keynotes with important contributions to such ecosystems:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.johnmyleswhite.com/"&gt;John Myles White&lt;/a&gt; (Facebook), lead
developer of Julia statistics and machine learning: “Julia for machine
learning: high-level syntax with compiled-code speed”&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://matthewrocklin.com"&gt;Matthew Rocklin&lt;/a&gt; (Continuum Analytics),
developer of Python computational tools, in particular Blaze (confirmed):
“Blaze, a modern numerical engine with out-of-core and out-of-order
computations”.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There will be also a practical presentation on how to set up an
open-source project, discussing hosting, community development, quality
assurance, license choice, by yours truly.&lt;/p&gt;
</content><category term="programming"></category><category term="conferences"></category><category term="machine learning"></category><category term="scientific computing"></category><category term="scipy"></category></entry><entry><title>Job offer: working on open source data processing in Python</title><link href="https://gael-varoquaux.info/programming/job-offer-working-on-open-source-data-processing-in-python.html" rel="alternate"></link><published>2015-04-02T00:00:00+02:00</published><updated>2015-04-02T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2015-04-02:/programming/job-offer-working-on-open-source-data-processing-in-python.html</id><summary type="html">&lt;p&gt;We, &lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;Parietal team&lt;/a&gt; at &lt;a class="reference external" href="http://www.inria.fr/"&gt;INRIA&lt;/a&gt;, are recruiting software developers to work on
open source machine learning and neuroimaging software in Python.&lt;/p&gt;
&lt;p&gt;In general, we are looking for people who:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;have a mathematical mindset,&lt;/li&gt;
&lt;li&gt;are curious about data (ie like looking at data and understanding it)&lt;/li&gt;
&lt;li&gt;have an affinity for problem-solving …&lt;/li&gt;&lt;/ul&gt;&lt;/blockquote&gt;</summary><content type="html">&lt;p&gt;We, &lt;a class="reference external" href="https://team.inria.fr/parietal/"&gt;Parietal team&lt;/a&gt; at &lt;a class="reference external" href="http://www.inria.fr/"&gt;INRIA&lt;/a&gt;, are recruiting software developers to work on
open source machine learning and neuroimaging software in Python.&lt;/p&gt;
&lt;p&gt;In general, we are looking for people who:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;have a mathematical mindset,&lt;/li&gt;
&lt;li&gt;are curious about data (ie like looking at data and understanding it)&lt;/li&gt;
&lt;li&gt;have an affinity for problem-solving tradeoffs&lt;/li&gt;
&lt;li&gt;love high-quality code&lt;/li&gt;
&lt;li&gt;worry about users&lt;/li&gt;
&lt;li&gt;are good scientific Python coders,&lt;/li&gt;
&lt;li&gt;enjoy interacting with a community of developers&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;We welcome candidates people without all the skills, but are strongly
motivated to acquire them. Prior open-source experience is a big plus.&lt;/p&gt;
&lt;p&gt;One example of such position with application in Neuroimaging is:
&lt;a class="reference external" href="http://gael-varoquaux.info/programming/hiring-a-programmer-for-a-brain-imaging-machine-learning-library.html"&gt;http://gael-varoquaux.info/programming/hiring-a-programmer-for-a-brain-imaging-machine-learning-library.html&lt;/a&gt;
Which was opened a year ago and has now resulted in nilearn:
&lt;a class="reference external" href="http://nilearn.github.io/"&gt;http://nilearn.github.io/&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Other positions may be more focused on general machine learning or
computing tools such as scikit-learn and joblib, which are reference
open-source libraries for data processing in Python.&lt;/p&gt;
&lt;p&gt;We are a tightly knit team, with a high degree of programming, data
analysis and neuroimaging skills.&lt;/p&gt;
&lt;p&gt;Please contact me and Olivier Grisel if you are interested,&lt;/p&gt;
</content><category term="programming"></category><category term="jobs"></category><category term="machine learning"></category><category term="neuroimaging"></category><category term="python"></category></entry><entry><title>Euroscipy 2015: Call for paper</title><link href="https://gael-varoquaux.info/programming/euroscipy-2015-call-for-paper.html" rel="alternate"></link><published>2015-03-28T00:00:00+01:00</published><updated>2015-03-28T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2015-03-28:/programming/euroscipy-2015-call-for-paper.html</id><summary type="html">&lt;p&gt;EuroScipy 2015, the annual conference on Python in science will take
place in Cambridge, UK on 26-30 August 2015. The conference features two
days of tutorials followed by two days of scientific talks &amp;amp; posters and
an extra day dedicated to developer sprints. It is the major event in
Europe in …&lt;/p&gt;</summary><content type="html">&lt;p&gt;EuroScipy 2015, the annual conference on Python in science will take
place in Cambridge, UK on 26-30 August 2015. The conference features two
days of tutorials followed by two days of scientific talks &amp;amp; posters and
an extra day dedicated to developer sprints. It is the major event in
Europe in the field of technical/scientific computing within the Python
ecosystem. Scientists, PhD’s, students, data scientists, analysts, and
quants from more than 20 countries attended the conference last year.&lt;/p&gt;
&lt;p&gt;The topics presented at EuroSciPy are very diverse, with a focus on advanced
software engineering and original uses of Python and its scientific libraries,
either in theoretical or experimental research, from both academia and the
industry.&lt;/p&gt;
&lt;p&gt;Submissions for posters, talks &amp;amp; tutorials (beginner and advanced) are welcome
on our website at &lt;a class="reference external" href="http://www.euroscipy.org/2015/"&gt;http://www.euroscipy.org/2015/&lt;/a&gt;
Sprint proposals should be addressed directly to the organisation at
&lt;em&gt;euroscipy-org&amp;#64;python.org&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Important dates&lt;/strong&gt;:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;em&gt;Apr 30, 2015&lt;/em&gt; Talk and tutorials submission deadline&lt;/li&gt;
&lt;li&gt;&lt;em&gt;May 1, 2015&lt;/em&gt; Registration opens&lt;/li&gt;
&lt;li&gt;&lt;em&gt;May 30, 2015&lt;/em&gt; Final program announced&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Jun 15, 2015&lt;/em&gt; Early-bird registration ends&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Aug 26-27, 2015&lt;/em&gt; Tutorials&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Aug 28-29, 2015&lt;/em&gt; Main conference&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Aug 30, 2015&lt;/em&gt; Sprints&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We look forward to an exciting conference and hope to see you in Cambridge&lt;/p&gt;
&lt;p&gt;The EuroSciPy 2015 Team - &lt;a class="reference external" href="http://ww.euroscipy.org/2015/"&gt;http://ww.euroscipy.org/2015/&lt;/a&gt;&lt;/p&gt;
</content><category term="programming"></category><category term="python"></category><category term="science"></category><category term="conferences"></category></entry><entry><title>PRNI 2016: call for organization</title><link href="https://gael-varoquaux.info/programming/prni-2016-call-for-organization.html" rel="alternate"></link><published>2015-01-01T00:00:00+01:00</published><updated>2015-01-01T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2015-01-01:/programming/prni-2016-call-for-organization.html</id><summary type="html">&lt;p class="first last"&gt;The steering committee of PRNI (Pattern Recognition for NeuroImaging) is opening a call for bid to organize the conference in June 2016, in Europe&lt;/p&gt;
</summary><content type="html">&lt;p&gt;&lt;a class="reference external" href="http://www.prni.org"&gt;PRNI (Pattern Recognition for NeuroImaging)&lt;/a&gt; is
an IEEE conference about applying pattern recognition and machine
learning to brain imaging. It is a mid-sized conference (about 150
attendee), and is a satellite of OHBM (the annual “Human Brain Mapping”
meeting).&lt;/p&gt;
&lt;p&gt;The steering committee is calling for bids to organize the conference in
June 2016, in Europe, as a satellite the OHBM meeting in Geneva.&lt;/p&gt;
</content><category term="programming"></category><category term="neuroimaging"></category><category term="conferences"></category><category term="science"></category><category term="machine learning"></category></entry><entry><title>New website</title><link href="https://gael-varoquaux.info/misc/new-website.html" rel="alternate"></link><published>2014-10-09T00:00:00+02:00</published><updated>2014-10-09T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2014-10-09:/misc/new-website.html</id><summary type="html">&lt;p&gt;I am moving my website to a new design, relying on &lt;a class="reference external" href="http://blog.getpelican.com/"&gt;Pelican&lt;/a&gt; and more modern CSS.&lt;/p&gt;
&lt;p&gt;So far, I had been using &lt;a class="reference external" href="http://www.voidspace.org.uk/python/rest2web/"&gt;rest2web&lt;/a&gt; to generate the static
part of the website, and a local install of wordpress for the blog. I
wasn’t doing good on keeping the wordpress install …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I am moving my website to a new design, relying on &lt;a class="reference external" href="http://blog.getpelican.com/"&gt;Pelican&lt;/a&gt; and more modern CSS.&lt;/p&gt;
&lt;p&gt;So far, I had been using &lt;a class="reference external" href="http://www.voidspace.org.uk/python/rest2web/"&gt;rest2web&lt;/a&gt; to generate the static
part of the website, and a local install of wordpress for the blog. I
wasn’t doing good on keeping the wordpress install up to date, and I
eventually got hacked. It was hurting my desire of simplicity to need a
dynamic website. The combination of &lt;a class="reference external" href="http://blog.getpelican.com/"&gt;Pelican&lt;/a&gt; for my content, and &lt;a class="reference external" href="https://disqus.com/"&gt;Disqus&lt;/a&gt; suits very well my need, as it enables me to have
a simpler website, and still have blog posts and discussions.&lt;/p&gt;
&lt;p&gt;I also took the opportunity to clean up the website, remove some old
content, and move my travel pictures to
&lt;a class="reference external" href="https://www.flickr.com/photos/gaelvaroquaux/"&gt;flickr&lt;/a&gt;.&lt;/p&gt;
&lt;div class="section" id="technical-choices"&gt;
&lt;h2&gt;Technical choices&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="http://blog.getpelican.com/"&gt;Pelican&lt;/a&gt; for the core engine. I like
the fact that it generates a static blog, that it uses restructured
text to store the content, and &lt;a class="reference external" href="http://jinja.pocoo.org"&gt;jinja&lt;/a&gt; as a
templating engine.&lt;/p&gt;
&lt;p&gt;One interesting aspect of redoing my website with a more modern content
managment system was that I could lay out the information based on tags
and categories, rather than the old way of having a tree of nested
topics. This is much more flexible because one article is likely to
fall in many topics. Modern information organization is moving away
from the notion of path used to access to an entry, to the notion of
set of properties (tags here).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="http://purecss.io"&gt;Pure CSS&lt;/a&gt; as a CSS base layer. I chose to use
Pure CSS rather than &lt;a class="reference external" href="http://getbootstrap.com/"&gt;Bootstrap&lt;/a&gt; as it is a
pure CSS framework (no javascript) and it is much lighter. I find that
Bootstrap websites can easily slow down browsing (due to download size
and javascript). Bootstrap also does play very well with html documents
in which ones doesn’t control the class tags, as those generated from
restructured text. But that’s true of most web front-end frameworks.
Another option was &lt;a class="reference external" href="http://foundation.zurb.com/"&gt;Foundation&lt;/a&gt;. I
didn’t explore it in details, but it looked like an interesting
tradeoff between Pure, which is very bare bones, and Bootstrap, the
heavy lifter. I chose to go for the most lightweight option, because I
had simple needs.&lt;/p&gt;
&lt;p&gt;A result of using more modern CSS is that the website should look good
on any screen size, from very large screens to mobile phones.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</content><category term="misc"></category><category term="web"></category></entry><entry><title>Improving your programming style in Python</title><link href="https://gael-varoquaux.info/programming/improving-your-programming-style-in-python.html" rel="alternate"></link><published>2014-09-29T00:00:00+02:00</published><updated>2014-09-29T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2014-09-29:/programming/improving-your-programming-style-in-python.html</id><summary type="html">&lt;p class="first last"&gt;Some references on software development techniques and patterns to help write better code.&lt;/p&gt;
</summary><content type="html">&lt;p&gt;Here are some references on software development techniques and patterns
to help write better code. They are intended for the casual programmer,
and certainly not an advanced developer.&lt;/p&gt;
&lt;p&gt;They are listed in order of difficulty.&lt;/p&gt;
&lt;div class="section" id="software-carpentry"&gt;
&lt;h2&gt;Software carpentry&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://swc.scipy.org"&gt;http://swc.scipy.org&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;These are the original notes from Greg Wilson’s course on software
engineering at the university of Toronto. This course is specifically
intended for scientists, but not computer science students. It is very
basic and does not cover design issues.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="a-tutorial-introduction-to-python"&gt;
&lt;h2&gt;A tutorial introduction to Python&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.informit.com/articles/article.asp?p=23100&amp;amp;seqNum=3&amp;amp;rl=1"&gt;http://www.informit.com/articles/article.asp?p=23100&amp;amp;seqNum=3&amp;amp;rl=1&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This tutorial is easier to follow than &lt;a class="reference external" href="http://www.python.org/doc/"&gt;Guido’s tutorial&lt;/a&gt;, thought it does not go as much in depth.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="python-essential-reference"&gt;
&lt;h2&gt;Python Essential Reference&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.informit.com/articles/article.asp?p=453682&amp;amp;rl=1"&gt;http://www.informit.com/articles/article.asp?p=453682&amp;amp;rl=1&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.informit.com/articles/article.asp?p=459269&amp;amp;rl=1"&gt;http://www.informit.com/articles/article.asp?p=459269&amp;amp;rl=1&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;These are two chapters out of David Beazley’s excellent book &lt;a class="reference external" href="http://www.amazon.com/Python-Essential-Reference-David-Beazley/dp/0735710910"&gt;Python
Essential Reference&lt;/a&gt;.
They allow to understand more deeply how python works. I strongly recommend
this book to anybody serious about python.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="an-introduction-to-regular-expressions"&gt;
&lt;h2&gt;An Introduction to Regular Expressions&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.informit.com/articles/article.asp?p=20454&amp;amp;rl=1"&gt;http://www.informit.com/articles/article.asp?p=20454&amp;amp;rl=1&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;If you are going to do any sort of text manipulation, you absolutely need
to know how to use regular expressions: powerful search and replace patterns.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="software-design-for-maintainability"&gt;
&lt;h2&gt;Software design for maintainability&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="./software-design-for-maintainability.html"&gt;My own post&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;A case of shameless plug: this is a post that I wrote a few years ago. I
think that it is still relevant.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="writing-a-graphical-application-for-scientific-programming-using-traitsui"&gt;
&lt;h2&gt;Writing a graphical application for scientific programming using TraitsUI&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://gael-varoquaux.info/computers/traits_tutorial/index.html"&gt;http://gael-varoquaux.info/computers/traits_tutorial/index.html&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Building interactive graphical application is a difficult problem. I have
found that the traitsUI module provides a great answer to this problem.
This is a tutorial intended for the non programmer.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="an-introduction-to-python-iterators"&gt;
&lt;h2&gt;An introduction to Python iterators&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.informit.com/articles/article.asp?p=26895&amp;amp;rl=1"&gt;http://www.informit.com/articles/article.asp?p=26895&amp;amp;rl=1&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This article may not be terribly easy to follow, but iterator are a
great feature of Python, so this is definitely worth reading.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="functional-programming"&gt;
&lt;h2&gt;Functional programming&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.ibm.com/developerworks/linux/library/l-prog.html?open&amp;amp;l=766,t=gr,p=PrmgPyth"&gt;http://www.ibm.com/developerworks/linux/library/l-prog.html?open&amp;amp;l=766,t=gr,p=PrmgPyth&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Functional programming is a programming style where mathematical
functions are successively applied to immutable objects to go from the
inputs of the program to its outputs in a succession of transformation.
It is appreciated by some because it is easy to analyze and prove.
In certain cases it can be very readable.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="patterns-in-python"&gt;
&lt;h2&gt;Patterns in Python&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.suttoncourtenay.org.uk/duncan/accu/pythonpatterns.html"&gt;http://www.suttoncourtenay.org.uk/duncan/accu/pythonpatterns.html&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This document exposes a few design patterns in Python. Design patterns
are solutions to recurring development problems using object oriented
programming. I suggest this reading only if you are familiar with OOP.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="idiomatic-python"&gt;
&lt;h2&gt;Idiomatic Python&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p class="first"&gt;Jeff Knupp’s post, a summary of his book:&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.jeffknupp.com/blog/2012/10/04/writing-idiomatic-python/"&gt;http://www.jeffknupp.com/blog/2012/10/04/writing-idiomatic-python/&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The &lt;a class="reference external" href="https://scipy-lectures.github.io"&gt;scipy-lectures&lt;/a&gt; chapter on
advanced Python:&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://scipy-lectures.github.io/advanced/advanced_python/index.html"&gt;https://scipy-lectures.github.io/advanced/advanced_python/index.html&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="general-object-oriented-programming-advice"&gt;
&lt;h2&gt;General Object-Oriented programming advice&lt;/h2&gt;
&lt;p&gt;Designing Object-oriented code actually requires some care: when you are
building your set of abstractions, you are designing the world in which
you are going to be condemned to living (or actually coding). I would
advice people to keep things as simple as possible, and follow the SOLID
principles:&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://mmiika.wordpress.com/oo-design-principles/"&gt;http://mmiika.wordpress.com/oo-design-principles/&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="using-decorators-to-do-meta-programming-in-python"&gt;
&lt;h2&gt;Using decorators to do meta-programming in Python&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www-128.ibm.com/developerworks/linux/library/l-cpdecor.html"&gt;http://www-128.ibm.com/developerworks/linux/library/l-cpdecor.html&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;A very beautiful article for the advanced python user. Meta-programming
is a programming technique that involves changing the program at the
run-time. This allows to add new abstractions to the code the
programmer writes, thus creating a “meta-language”. This article shows
this very well.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="a-primer-on-python-metaclass-programming"&gt;
&lt;h2&gt;A Primer on Python Metaclass Programming&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.onlamp.com/lpt/a/3388"&gt;http://www.onlamp.com/lpt/a/3388&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Metaclasses allow to define new style of objects, that can have different
calling, creation or inheritance rules. This is way over my head, but I
am referencing it here for the record.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="iterators-in-python"&gt;
&lt;h2&gt;Iterators in Python&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="https://docs.python.org/2/library/itertools.html#recipes"&gt;https://docs.python.org/2/library/itertools.html#recipes&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Learn to use the itertools (but don’t abuse them)!&lt;/p&gt;
&lt;p&gt;Related to the producer/consumer problem with iterators, see:&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.oluyede.org/blog/2007/04/09/producerconsumer-in-python/"&gt;http://www.oluyede.org/blog/2007/04/09/producerconsumer-in-python/&lt;/a&gt;&lt;/p&gt;
&lt;!-- vim:spell:spelllang=en_us ft=rst --&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="python"></category><category term="software engineering"></category><category term="selected"></category></entry><entry><title>Hiring an engineer to mine large functional-connectivity databases</title><link href="https://gael-varoquaux.info/programming/hiring-an-engineer-to-mine-large-functional-connectivity-databases.html" rel="alternate"></link><published>2014-09-20T00:00:00+02:00</published><updated>2014-09-20T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2014-09-20:/programming/hiring-an-engineer-to-mine-large-functional-connectivity-databases.html</id><summary type="html">&lt;p&gt;&lt;strong&gt;Work with us to leverage leading-edge machine learning for
neuroimaging&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;At &lt;a class="reference external" href="https://team.inria.fr/parietal"&gt;Parietal&lt;/a&gt;, my research team,
we work on improving the way brain images are analyzed, for medical
diagnostic purposes, or to understand the brain better. We develop
new machine-learning tools and investigate new methodologies for
for quantifying brain function from …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;Work with us to leverage leading-edge machine learning for
neuroimaging&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;At &lt;a class="reference external" href="https://team.inria.fr/parietal"&gt;Parietal&lt;/a&gt;, my research team,
we work on improving the way brain images are analyzed, for medical
diagnostic purposes, or to understand the brain better. We develop
new machine-learning tools and investigate new methodologies for
for quantifying brain function from MRI scans.&lt;/p&gt;
&lt;p&gt;One of our important alley of contributions is in deciphering “functional
connectivity”: analysis the correlation of brain activity to measure
interactions across the brain. This direction of research is exciting
because it can be used to probe the neural-support of &lt;em&gt;functional&lt;/em&gt;
deficits in incapacitated patients, and thus lead to new biomarkers on
functional pathologies, such as autism. Indeed, functional connectivity
can be computed without resorting to complicated cognitive tasks, unlike
most functional imaging approaches. The flip side is that exploiting such
“resting-state” signal requires advanced multivariate statistics tools,
something at which the Parietal team excels.&lt;/p&gt;
&lt;p&gt;For such multivariate processing of brain imaging data, Parietal has an
ecosystem of &lt;a class="reference external" href="https://team.inria.fr/parietal/software"&gt;leading-edge high-quality tools&lt;/a&gt;. In particular we have built
the foundations of the most successful Python machine learning library,
&lt;a class="reference external" href="http://scikit-learn"&gt;scikit-learn&lt;/a&gt;, and we are growing a dedicate
software, &lt;a class="reference external" href="http://nilearn.github.io/"&gt;nilearn&lt;/a&gt;, that leverages
machine-learning for neuroimaging. To support this ecosystem, we have
dedicated top-notch programmers, lead by the well-known
&lt;a class="reference external" href="http://ogrisel.com/"&gt;Olivier Grisel&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We are looking for a data-processing engineer to join our team and work
on &lt;strong&gt;applying our tools on very large neuroimaging databases to
learn specific biomarkers of pathologies&lt;/strong&gt;. For this, the work will be
shared with the &lt;a class="reference external" href="http://www.cati-neuroimaging.com/"&gt;CATI&lt;/a&gt;, the Fench
platform for multicentric neuroimaging studies, located in the same
building as us. The general context of the job is the &lt;a class="reference external" href="https://team.inria.fr/parietal/research/spatial_patterns/niconnect/"&gt;NiConnect&lt;/a&gt;
project, a multi-organisational research project that I lead and
that focuses on improving diagnostic tools on resting-state functional
connectivity. We have access to unique algorithms and datasets, before
they are published. What we are now missing between those two, and that
link could be you.&lt;/p&gt;
&lt;p&gt;If you want more details, they can be found on the &lt;a class="reference external" href="https://team.inria.fr/parietal/job-offers"&gt;job offer&lt;/a&gt;. This post is to motivate
the job in a personal way, that I cannot give in an official posting.&lt;/p&gt;
&lt;div class="section" id="why-take-this-job"&gt;
&lt;h2&gt;Why take this job?&lt;/h2&gt;
&lt;p&gt;I don’t expect some to take this job only because it pays the bill. To be
clear, the kind of person I am looking for has no difficulties finding a
job elsewhere. So, if you are that person, why would you take the job?&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;To join &lt;a class="reference external" href="https://team.inria.fr/parietal/team-members/"&gt;a great team&lt;/a&gt;
with many experts, focused on finding elegant solutions to hard
problems at the intersection of machine learning, cognitive science,
and software. Choose to work with great people, knowledgeable,
passionate, and &lt;a class="reference external" href="https://team.inria.fr/parietal/inria-winter-party-2014/"&gt;fun&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;To work on interesting problems, that matter. They are interesting
because they are challenging but we have the skills to solve them. They
matter because they can make brain research better.&lt;/li&gt;
&lt;li&gt;To learn. NeuroImaging + Machine learning is a quickly growing topic.
If you come from a NeuroImaging background and want to add to your CV
an actual expertise in machine learning for NeuroImaging. This is the
place to be.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="what-would-make-me-excited-in-a-resume"&gt;
&lt;h2&gt;What would make me excited in a resume?&lt;/h2&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;A genuine experience in neuroimaging data processing, especially large
databases.&lt;/li&gt;
&lt;li&gt;Talent with computers and ideally some Python experience.&lt;/li&gt;
&lt;li&gt;The unlikely combination of research training (graduate or
undergraduate) and experience in a non academic setting.&lt;/li&gt;
&lt;li&gt;A problem-solving mindset.&lt;/li&gt;
&lt;li&gt;A good ability to write about neuroimaging and data processing in
English: who knows, if everything goes to plan, you could very well be
publishing about new biomarkers.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;Now if you are interested and feel up for the challenge, read the real
&lt;a class="reference external" href="https://team.inria.fr/parietal/job-offers"&gt;job offer&lt;/a&gt;, and send me
your resume.&lt;/p&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="jobs"></category><category term="neuroimaging"></category><category term="python"></category></entry><entry><title>Scikit-learn 2014 sprint: a report</title><link href="https://gael-varoquaux.info/programming/scikit-learn-2014-sprint-a-report.html" rel="alternate"></link><published>2014-07-25T00:00:00+02:00</published><updated>2014-07-25T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2014-07-25:/programming/scikit-learn-2014-sprint-a-report.html</id><summary type="html">&lt;p&gt;A week ago, the 2014 edition of the
&lt;a class="reference external" href="http://scikit-learn.org"&gt;scikit-learn&lt;/a&gt; sprint was held in Paris.
This was the third time that we held an internation sprint and it was
hugely productive, and great fun, as always.&lt;/p&gt;
&lt;div class="section" id="great-people-and-great-venues"&gt;
&lt;h2&gt;Great people and great venues&lt;/h2&gt;
&lt;img alt="" class="align-center" src="https://pbs.twimg.com/media/BsqD4BeCQAEnT6w.jpg" style="width: 65%;" /&gt;
&lt;p&gt;We had a mix of core contributors and newcomers, which …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;p&gt;A week ago, the 2014 edition of the
&lt;a class="reference external" href="http://scikit-learn.org"&gt;scikit-learn&lt;/a&gt; sprint was held in Paris.
This was the third time that we held an internation sprint and it was
hugely productive, and great fun, as always.&lt;/p&gt;
&lt;div class="section" id="great-people-and-great-venues"&gt;
&lt;h2&gt;Great people and great venues&lt;/h2&gt;
&lt;img alt="" class="align-center" src="https://pbs.twimg.com/media/BsqD4BeCQAEnT6w.jpg" style="width: 65%;" /&gt;
&lt;p&gt;We had a mix of core contributors and newcomers, which is a great
combination, as it enables us to be productive, but also to foster the
new generation of core developers. Were present:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Laurent Direr&lt;/li&gt;
&lt;li&gt;Michael Eickenberg&lt;/li&gt;
&lt;li&gt;Loic Esteve&lt;/li&gt;
&lt;li&gt;Alexandre Gramfort&lt;/li&gt;
&lt;li&gt;Olivier Grisel&lt;/li&gt;
&lt;li&gt;Arnaud Joly&lt;/li&gt;
&lt;li&gt;Kyle Kastner&lt;/li&gt;
&lt;li&gt;Manoj Kumar&lt;/li&gt;
&lt;li&gt;Balazs Kegl&lt;/li&gt;
&lt;li&gt;Nicolas Le Roux&lt;/li&gt;
&lt;li&gt;Andreas Mueller&lt;/li&gt;
&lt;li&gt;Vlad Niculae&lt;/li&gt;
&lt;li&gt;Fabian Pedregosa&lt;/li&gt;
&lt;li&gt;Amir Sani&lt;/li&gt;
&lt;li&gt;Danny Sullivan&lt;/li&gt;
&lt;li&gt;Gabriel Synnaeve&lt;/li&gt;
&lt;li&gt;Roland Thiolliere&lt;/li&gt;
&lt;li&gt;Gael Varoquaux&lt;/li&gt;
&lt;/ul&gt;
&lt;img alt="" class="align-center" src="https://pbs.twimg.com/media/BsqRedvCEAE5Opw.jpg" style="width: 65%;" /&gt;
&lt;p&gt;As the sprint extended through a French bank holiday and the week end,
we were hosted in a variety of venues:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://lapaillasse.org"&gt;La paillasse&lt;/a&gt;, a Paris bio-hacker space&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.inria.fr"&gt;INRIA&lt;/a&gt;, the French computer-science national
research, and the place where I work :)&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.criteo.com"&gt;Criteo&lt;/a&gt;, a French company doing word-wide
add-banner placement. The venue there was absolutely gorgeous, with a
beautiful terrace on the roofs of Paris. And they even had a social
event with free drinks one evening.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.tinyclues.com"&gt;Tinyclues&lt;/a&gt;, a French startup mining
e-commerce data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I must say that we were treated like kings during the whole stay; each
host welcoming us as well they could. Thank you to all of our hosts!&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="sponsored-by-the-digicosm-labex"&gt;
&lt;h2&gt;Sponsored by the Digicosm Labex&lt;/h2&gt;
&lt;p&gt;Beyond our hosts, we need to thank the &lt;a class="reference external" href="https://digicosme.lri.fr/tiki-index.php"&gt;Digicosme Labex&lt;/a&gt;.
Digicosm gave us funding that covered some of the lunches, accomodations,
and travel expenses to bring in our contributors from abroad.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="achievements-during-the-sprint"&gt;
&lt;h2&gt;Achievements during the sprint&lt;/h2&gt;
&lt;p&gt;The first day of the sprint was dedicated to polishing the &lt;a class="reference external" href="http://www.scikit-learn.org/stable/whats_new.html"&gt;0.15
release&lt;/a&gt;, which
was finally released on the morning of the second day, after 10 months
of development.&lt;/p&gt;
&lt;p&gt;A large part of the efforts of the sprint were dedicated to improving
the coding base, rather than directly adding new features. Some files
were reorganized. The input validation code was cleaned up (opening the
way for better support of pandas structures in scikit-learn). We hunted
dead code, deprecation warnings, numerical instabilities and tests
randomly failing. We made the test suite faster, and refactored our
common tests that scan all the model.&lt;/p&gt;
&lt;p&gt;Some work of our GSOC student, Manoj Kumar, was merged, making some
linear models faster.&lt;/p&gt;
&lt;p&gt;Our &lt;a class="reference external" href="http:/scikit-learn.org/dev"&gt;online documentation&lt;/a&gt; was improve
with the &lt;a class="reference external" href="http://scikit-learn.org/stable/modules/classes.html"&gt;API
documentation&lt;/a&gt;
pointing to examples and source code.&lt;/p&gt;
&lt;p&gt;Still work in progress:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Faster stochastic gradient descent (with AdaGrad, ASGD, and one day
SAG)&lt;/li&gt;
&lt;li&gt;Calibration of probabilities for models that do not have a
‘predict_proba’ method&lt;/li&gt;
&lt;li&gt;Warm restart in random forests to add more estimators to an existing
ensemble.&lt;/li&gt;
&lt;li&gt;Infomax ICA algorithm.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="sprint"></category><category term="scikit-learn"></category><category term="python"></category><category term="machine learning"></category></entry><entry><title>Scikit-learn 0.15 release: highlights</title><link href="https://gael-varoquaux.info/programming/scikit-learn-015-release-highlights.html" rel="alternate"></link><published>2014-07-15T00:00:00+02:00</published><updated>2014-07-15T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2014-07-15:/programming/scikit-learn-015-release-highlights.html</id><summary type="html">&lt;p&gt;We have just released the 0.15 version of scikit-learn. Hurray!! Thanks
to all
&lt;a class="reference external" href="http://scikit-learn.org/stable/whats_new.html#people"&gt;involved&lt;/a&gt;.&lt;/p&gt;
&lt;div class="section" id="a-long-development-stretch"&gt;
&lt;h2&gt;A long development stretch&lt;/h2&gt;
&lt;p&gt;It’s been a while since the &lt;a class="reference external" href="http://gael-varoquaux.info/programming/scikit-learn-014-release-features-and-benchmarks.html"&gt;last release of
scikit-learn&lt;/a&gt;. So a lot has
happened. Exactly 2611 commits according my count. Quite clearly, we
have more and more existing code …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;p&gt;We have just released the 0.15 version of scikit-learn. Hurray!! Thanks
to all
&lt;a class="reference external" href="http://scikit-learn.org/stable/whats_new.html#people"&gt;involved&lt;/a&gt;.&lt;/p&gt;
&lt;div class="section" id="a-long-development-stretch"&gt;
&lt;h2&gt;A long development stretch&lt;/h2&gt;
&lt;p&gt;It’s been a while since the &lt;a class="reference external" href="http://gael-varoquaux.info/programming/scikit-learn-014-release-features-and-benchmarks.html"&gt;last release of
scikit-learn&lt;/a&gt;. So a lot has
happened. Exactly 2611 commits according my count. Quite clearly, we
have more and more existing code, more and more features to support.
This means that when we modify an algorithm, for instance to make it
faster, something else might break due to numerical instability, or
exploring some obscure option. The good news is that we have tight
continuous integration, mostly thanks to
&lt;a class="reference external" href="https://travis-ci.org/scikit-learn/scikit-learn"&gt;travis&lt;/a&gt; (but
Windows continuous integration is on its way), and we keep growing our
test suite. Thus while it is getting harder and harder to change
something in scikit-learn, scikit-learn is also becoming more and more
robust.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="highlights"&gt;
&lt;h2&gt;Highlights&lt;/h2&gt;
&lt;a class="reference external image-reference" href="https://twitter.com/t3kcit/status/434378452901187584"&gt;&lt;img alt="" src="https://pbs.twimg.com/media/Bgc45seCUAAbze1.png" /&gt;&lt;/a&gt;
&lt;p&gt;&lt;strong&gt;Quality&lt;/strong&gt; — Looking at the commit log, there has been a huge amount of
work to &lt;a class="reference external" href="http://scikit-learn.org/stable/whats_new.html#id7"&gt;fix minor annoying
issues&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Speed&lt;/strong&gt; — There has been a huge effort put in making many parts of
scikit-learn faster. Little details all over the codebase. We do hope
that you’ll find that your applications run faster. For instance, we
find that the worst case speed of Ward clustering is 1.5 times faster in
0.15 than 0.14. K-means clustering is often 1.1 times faster. KNN, when
used in brute-force mode, got faster by a factor of 2 or 3.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Random Forest and various tree methods&lt;/strong&gt; — The random forest and
various tree methods are much much faster, use parallel computing much
better, and use less memory. For instance, the picture on the right
shows the scikit-learn random forest running in parallel on a fat Amazon
node, and nicely using all the CPUs with little RAM usage.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hierarchical aglomerative clustering&lt;/strong&gt; — &lt;a class="reference external" href="http://scikit-learn.org/dev/modules/clustering.html#different-linkage-type-ward-complete-and-average-linkage"&gt;Complete linkage and average
linkage clustering have been
added&lt;/a&gt;.
The benefit of these approach compared to the existing Ward clustering
is that they can take &lt;a class="reference external" href="http://scikit-learn.org/stable/modules/clustering.html#varying-the-metric"&gt;an arbitrary distance
matrix&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Robust linear models&lt;/strong&gt; — Scikit-learn now includes
&lt;a class="reference external" href="http://scikit-learn.org/0.15/modules/linear_model.html#robustness-to-outliers-ransac"&gt;RANSAC&lt;/a&gt;
for robust linear regression.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;HMM are deprecated&lt;/strong&gt; — We have been discussing for a long time removing
HMMs, that do not fit in the focus of scikit-learn on predictive
modeling. We have created a separate
&lt;a class="reference external" href="https://github.com/hmmlearn/hmmlearn"&gt;hmmlearn&lt;/a&gt; repository for the
HMM code. It is looking for maintainers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;And much more&lt;/strong&gt; — plenty of &lt;a class="reference external" href="http://scikit-learn.org/stable/whats_new.html"&gt;“minor
things”&lt;/a&gt;, such as
better support for sparse data, better support for multi-label data…&lt;/p&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="scikit-learn"></category><category term="machine learning"></category><category term="python"></category></entry><entry><title>Google summer of code projects for scikit-learn</title><link href="https://gael-varoquaux.info/programming/google-summer-of-code-projects-for-scikit-learn.html" rel="alternate"></link><published>2014-04-23T00:00:00+02:00</published><updated>2014-04-23T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2014-04-23:/programming/google-summer-of-code-projects-for-scikit-learn.html</id><summary type="html">&lt;p&gt;I’d like to welcome the four students that were accepted for the GSoC
this year:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Issam: &lt;a class="reference external" href="http://www.google-melange.com/gsoc/project/details/google/gsoc2014/issamou/5733935958982656"&gt;Extending Neural networks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Hamzeh: &lt;a class="reference external" href="http://www.google-melange.com/gsoc/project/details/google/gsoc2014/hamsal/5709068098338816"&gt;Sparse Support for Ensemble Methods&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Manoj: &lt;a class="reference external" href="http://www.google-melange.com/gsoc/project/details/google/gsoc2014/manojkumar/5673522948997120"&gt;Making Linear models faster&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Maheshakya: &lt;a class="reference external" href="http://www.google-melange.com/gsoc/project/details/google/gsoc2014/maheshakya/5754903989321728"&gt;Locality Sensitive Hashing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Welcome to all of you. Your submissions were excellent, and you
demonstrated a good will …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I’d like to welcome the four students that were accepted for the GSoC
this year:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Issam: &lt;a class="reference external" href="http://www.google-melange.com/gsoc/project/details/google/gsoc2014/issamou/5733935958982656"&gt;Extending Neural networks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Hamzeh: &lt;a class="reference external" href="http://www.google-melange.com/gsoc/project/details/google/gsoc2014/hamsal/5709068098338816"&gt;Sparse Support for Ensemble Methods&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Manoj: &lt;a class="reference external" href="http://www.google-melange.com/gsoc/project/details/google/gsoc2014/manojkumar/5673522948997120"&gt;Making Linear models faster&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Maheshakya: &lt;a class="reference external" href="http://www.google-melange.com/gsoc/project/details/google/gsoc2014/maheshakya/5754903989321728"&gt;Locality Sensitive Hashing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Welcome to all of you. Your submissions were excellent, and you
demonstrated a good will to integrate in the project, with its social and
coding dynamics. It is a privilege to work with you.&lt;/p&gt;
&lt;p&gt;I’d also like to thank all the mentors, Alex, Arnaud, Daniel, James,
Jaidev, Olivier, Robert and Vlad. It is a lot of work to mentor and
mentors are not only making it possible for great code to enter
scikit-learn, but also shaping a future generation of scikit-learn
contributors.&lt;/p&gt;
</content><category term="programming"></category><category term="scikit-learn"></category><category term="machine learning"></category></entry><entry><title>Hiring a programmer for a brain imaging machine-learning library</title><link href="https://gael-varoquaux.info/programming/hiring-a-programmer-for-a-brain-imaging-machine-learning-library.html" rel="alternate"></link><published>2014-02-12T00:00:00+01:00</published><updated>2014-02-12T00:00:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2014-02-12:/programming/hiring-a-programmer-for-a-brain-imaging-machine-learning-library.html</id><summary type="html">&lt;p&gt;&lt;strong&gt;Work with us on putting machine learning in the hand of cognitive
scientists&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Parietal is a research team that creates advanced data analysis to mine
functional brain images and solve medical and cognitive science problems.
Our day to day work is to write machine-learning and statistics code to
understand and …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;Work with us on putting machine learning in the hand of cognitive
scientists&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Parietal is a research team that creates advanced data analysis to mine
functional brain images and solve medical and cognitive science problems.
Our day to day work is to write machine-learning and statistics code to
understand and use better images of brain function (most often fMRI). Our
purpose is to be useful to the NeuroImaging community, mostly medical and
cognitive science researched, to understand brain function better. What
is limiting us in this respect is that to reach end users we need to turn
our algorithms in usable software.&lt;/p&gt;
&lt;p&gt;This is why Parietal has a long tradition of investing in building an
ecosystem of &lt;a class="reference external" href="https://team.inria.fr/parietal/software"&gt;high-quality libraries and tools&lt;/a&gt;: we build, layer by layer, an
environment in which we can do our research, and with which we hope to
one day reach the user. We choose Python, as a high-level general purpose
language with which we can do scientific computing, and, one day, GUIs,
or web servers. We contribute to the scipy ecosystem; we have built the
foundations of the most successful Python machine learning library,
&lt;a class="reference external" href="http://scikit-learn"&gt;scikit-learn&lt;/a&gt;. We are invested in the
&lt;a class="reference external" href="http://nipy.org"&gt;neuroimaging in Python ecosystem&lt;/a&gt;. Our students, our
team members, send patches to scientific Python projects, teach courses
on how to use them, speak at conferences.&lt;/p&gt;
&lt;p&gt;But to go all the way, we need support from people who do software as
there sole goal. To put the finishing touch on the quality of our
end-user libraries, we need full-time programmers. In an academic
setting, they can be hard to justify, but we have always had dedicate
top-notch engineers at Parietal, our latest hire being the well-known
&lt;a class="reference external" href="http://ogrisel.com/"&gt;Olivier Grisel&lt;/a&gt;. This is where &lt;strong&gt;you&lt;/strong&gt; can come
in.&lt;/p&gt;
&lt;p&gt;The &lt;a class="reference external" href="https://team.inria.fr/parietal/research/spatial_patterns/niconnect/"&gt;NiConnect&lt;/a&gt;
is a specific research project in which we are developing leading
algorithmic tools. For this project, we have funding for a full-time
programmer. Someone that will help us make from our understand of how to
process brain images, a software tool that an cognitive science
researcher can use. We have started work on such a software, in the
&lt;a class="reference external" href="http://nilearn.github.io/"&gt;nilearn&lt;/a&gt; project. What we need is someone
who drives the project, and makes sure that the piece fit in together
well. That the code to solve the user’s problem is not our research code,
but a clean and lean library, just like scikit-learn is an elegant
answer to day-to-day machine learning tasks.&lt;/p&gt;
&lt;p&gt;If you want more details, they can be found on the &lt;a class="reference external" href="https://team.inria.fr/parietal/job-offers"&gt;job offer&lt;/a&gt;. This post is to motivate
the job in a personal, that I cannot give in an official posting.&lt;/p&gt;
&lt;div class="section" id="why-take-this-job"&gt;
&lt;h2&gt;Why take this job?&lt;/h2&gt;
&lt;p&gt;I don’t expect some to take this job only because it pays the bill. To be
clear, the kind of person I am looking for has no difficulties finding a
well-payed job elsewhere. So, if you are that person, why would you take
the job.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;To join &lt;a class="reference external" href="https://team.inria.fr/parietal/team-members/"&gt;a great team&lt;/a&gt;
that is focused on finding elegant solutions to hard problems at the
intersection of machine learning, cognitive science, and software.
Choose to work with great people, knowledgeable, passionate, and &lt;a class="reference external" href="https://team.inria.fr/parietal/inria-winter-party-2014/"&gt;fun&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;To work on interesting problems, that matter. They are interesting
because they are challenging but we have the skills to solve them. They
matter because these skills need to be used to make brain research
better.&lt;/li&gt;
&lt;li&gt;To have a boss (&lt;a class="reference external" href="https://github.com/GaelVaroquaux"&gt;me&lt;/a&gt;) that
actually codes and gives you feedback on your code.&lt;/li&gt;
&lt;li&gt;To learn. Data science + Python is &lt;em&gt;the&lt;/em&gt; combination of skills to have.
We have a at Parietal a unique expertise in these. And add to it fine
understanding of algorithms, high performance computing, statistics,
and software quality. You have the perfect lines on a CV.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="what-would-make-me-excited-in-a-resume"&gt;
&lt;h2&gt;What would make me excited in a resume?&lt;/h2&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Open source contributions (there is no better coding CV than a github
account).&lt;/li&gt;
&lt;li&gt;Experience in agile-like situations&lt;/li&gt;
&lt;li&gt;A passion for code quality&lt;/li&gt;
&lt;li&gt;Good Python experience&lt;/li&gt;
&lt;li&gt;The unlikely combination of research-like training (eg undergraduate)
and experience in a non academic and non scientific setting (say web
development).&lt;/li&gt;
&lt;li&gt;To know that you care about user experience, about understanding and
solving the user’s problems.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;Now if you are interested and feel up for the challenge, read the real
&lt;a class="reference external" href="https://team.inria.fr/parietal/job-offers"&gt;job offer&lt;/a&gt;, and send me
your resume.&lt;/p&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="jobs"></category><category term="neuroimaging"></category><category term="python"></category></entry><entry><title>Publishing scientific software matters</title><link href="https://gael-varoquaux.info/science/publishing-scientific-software-matters.html" rel="alternate"></link><published>2013-09-19T00:00:00+02:00</published><updated>2013-09-19T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2013-09-19:/science/publishing-scientific-software-matters.html</id><summary type="html">&lt;p class="light"&gt;Christophe Pradal, Hans Peter Langtangen, and myself recently edited
&lt;a class="reference external" href="http://www.sciencedirect.com/science/journal/18777503/4/5"&gt;a version&lt;/a&gt; of the
Journal of Computational Science on scientific software, in
particular those written in Python. We wrote &lt;a class="reference external" href="http://www.sciencedirect.com/science/article/pii/S1877750313000938"&gt;an editorial&lt;/a&gt;
defending writing and publishing open source scientific software that
I wish to summarize here. The &lt;a class="reference external" href="http://hal.inria.fr/hal-00858663/en"&gt;full text preprint&lt;/a&gt; is openly …&lt;/p&gt;</summary><content type="html">&lt;p class="light"&gt;Christophe Pradal, Hans Peter Langtangen, and myself recently edited
&lt;a class="reference external" href="http://www.sciencedirect.com/science/journal/18777503/4/5"&gt;a version&lt;/a&gt; of the
Journal of Computational Science on scientific software, in
particular those written in Python. We wrote &lt;a class="reference external" href="http://www.sciencedirect.com/science/article/pii/S1877750313000938"&gt;an editorial&lt;/a&gt;
defending writing and publishing open source scientific software that
I wish to summarize here. The &lt;a class="reference external" href="http://hal.inria.fr/hal-00858663/en"&gt;full text preprint&lt;/a&gt; is openly available in &lt;a class="reference external" href="http://gael-varoquaux.info/publications.html"&gt;my
publications list&lt;/a&gt; as always. It
includes, amongst other things, references.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Software is a central part of modern scientific discovery.&lt;/strong&gt; Software turns a
theoretical model into quantitative predictions; software controls an
experiment; and software extracts from raw data evidence supporting or
rejecting a theory. As of today, scientific publications seldom discuss
software in depth, maybe because it is both highly technical and a recent
addition to scientific tools. But times are changing. More and more scientific
investigators are developing software and it is important to establish norms
for publication of this work. Producing scientific software is an important
part of the landscape of research activities. Very visible scientific software
is found in products developed by private companies, such as Mathwork’s Matlab
or Wolfram’s Mathematica, but let us not forget that these build upon code
written by and for academics. Scientists writing software contribute to the
advancement of Science via several factors.&lt;/p&gt;
&lt;p&gt;First, software developed in one field, if written in a sufficiently general
way, can often be applied to advance a different field if the underlying
mathematics is common. &lt;strong&gt;Modern scientific software development has a strong
emphasis on generality and reusability by taking advantage of the general
properties of the mathematical structures in the problem.&lt;/strong&gt; This feature of
modern software help close the gap between fields and accelerate scientific
discovery through packaging mathematical theories in a directly applicable way.&lt;/p&gt;
&lt;p&gt;Second, &lt;strong&gt;the public availability of code is a corner stone of the
scientific method&lt;/strong&gt;, as it is a requirement to reproducing scientific
results: “&lt;em&gt;if it’s not open and verifiable by others, it’s not science,
or engineering, or whatever it is you call what we do.&lt;/em&gt;” (V. Stodden,
&lt;em&gt;The scientific method in practice&lt;/em&gt;). Emphasizing code to an extreme,
Buckheit and Donoho have challenged the traditional view that a
publication was the valuable outcome of scientific research: “&lt;em&gt;an article
about computational science in a scientific publication is not the
scholarship itself, it is merely advertising of the scholarship. The
actual scholarship is the complete software development environment
[…]&lt;/em&gt;”.&lt;/p&gt;
&lt;p&gt;It is important to keep in mind that &lt;strong&gt;going beyond replication of
results requires reusable software tools&lt;/strong&gt;: code that is portable, comes
with documentation, and, most of all, is maintained throughout the years.
Indeed, &lt;strong&gt;software development is a major undertaking that must build
upon best practices and a quality process&lt;/strong&gt;. Reversing Buckheit and
Donoho’s argument, publications about scientific software play an increasingly
important part in the scientific methodology. First, in the publish-or-perish
academic culture, such publications give an incentive to software production
and maintenance, because good software can lead to highly-cited papers. Second,
&lt;strong&gt;the publication and review process are the de facto standards of
ensuring quality in the scientific world. As software is becoming increasingly
more central to the scientific discovery process, it must be subject to these
standards&lt;/strong&gt;. We have found that writing an article on software leads the
authors to better clarify the project vision, technically and scientifically,
the prior art, and the contributions. Last but not least, scientists publishing
new results based on a particular software need an informed analysis of the
validity of that software. Unfortunately, much of the current practice for
adopting research software relies on ease of use of the package and reputation
of the authors.&lt;/p&gt;
&lt;p&gt;[…]&lt;/p&gt;
&lt;p&gt;Today, software is to scientific research what Galileo’s telescope was to
astronomy: a tool, combining science and engineering. It lies outside the
central field of principal competence among the researchers that rely on it.
Like the telescope, it also builds upon scientific progress and shapes our
scientific vision. Galileo’s telescope was a leap forward in optics, a field of
investigation that is now well established, with its own high-impact journals
and scholarly associations. Similarly, we hope that visibility and recognition
of scientific software development will grow.&lt;/p&gt;
</content><category term="science"></category><category term="publishing"></category><category term="open source"></category><category term="scientific computing"></category><category term="reproducible research"></category><category term="scientific software"></category></entry><entry><title>Scikit-learn 0.14 release: features and benchmarks</title><link href="https://gael-varoquaux.info/programming/scikit-learn-014-release-features-and-benchmarks.html" rel="alternate"></link><published>2013-08-08T00:00:00+02:00</published><updated>2013-08-08T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2013-08-08:/programming/scikit-learn-014-release-features-and-benchmarks.html</id><summary type="html">&lt;p&gt;I have tagged and released the scikit-learn 0.14 release yesterday
evening, after more than 6 months of heavy development from the team. I
would like to give a quick overview of the highlights of this release in
terms of features but also in term of performance. Indeed, the
scikit-learn …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I have tagged and released the scikit-learn 0.14 release yesterday
evening, after more than 6 months of heavy development from the team. I
would like to give a quick overview of the highlights of this release in
terms of features but also in term of performance. Indeed, the
scikit-learn developers believe that &lt;strong&gt;performance matters&lt;/strong&gt; and strive
to be fast and efficient on fairly datasets.&lt;/p&gt;
&lt;p&gt;I will show in this article on a couple of benchmarks that we have
significant performance improvement and are competitive with the faster
libraries such as the proprietary WiseRF.&lt;/p&gt;
&lt;div class="section" id="prohiminent-new-features"&gt;
&lt;h2&gt;Prohiminent new features&lt;/h2&gt;
&lt;p&gt;Most of the new features of the upcoming release have been mentionned
more in details on &lt;a class="reference external" href="http://peekaboo-vision.blogspot.de/2013/07/scikit-learn-sprint-and-014-release.html"&gt;Andy Mueller’s
blog&lt;/a&gt;.
I am just giving a quick list here for completness (see also the &lt;a class="reference external" href="http://scikit-learn.org/stable/whats_new.html"&gt;full
list of changes&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Major new estimators&lt;/strong&gt;:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;AdaBoost&lt;/strong&gt; (by &lt;a class="reference external" href="http://noel.dawe.me"&gt;Noel Dawe&lt;/a&gt; and &lt;a class="reference external" href="http://www.montefiore.ulg.ac.be/~glouppe/"&gt;Gilles
Louppe&lt;/a&gt;): the classic
boosting algorithm. This implementation can be applied to any
estimator, but uses trees by default.
AdaBoost is a learning strategy that builds from simple learning
strategies by focussing successively on samples that are not well
predicted. Typically, the simple learners (called &lt;em&gt;weak learners&lt;/em&gt;)
can be rules as simple as taking simple thresholds of observed
quantities (this will form &lt;em&gt;decision stumps&lt;/em&gt;).
&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/ensemble.html#AdaBoost"&gt;Documentation&lt;/a&gt;
—
&lt;a class="reference external" href="http://scikit-learn.org/stable/auto_examples/ensemble/plot_adaboost_twoclass.html"&gt;Example&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Biclustering&lt;/strong&gt; (by &lt;a class="reference external" href="http://www.kemaleren.com"&gt;Kemal Eren&lt;/a&gt;):
clustering rows and columns of the data matrices.
Suppose you have access to the shopping list of many consumers,
biclustering would consists is grouping both consumers and product
they bought to come up with stories such as “geeks buy computers and
phones”, where “geeks” would be a group of consumers and “computers”
and “phones” would be groups of products.
&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/biclustering.html"&gt;Documentation&lt;/a&gt;
—
&lt;a class="reference external" href="http://scikit-learn.org/stable/auto_examples/bicluster/plot_spectral_biclustering.html"&gt;Example&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Missing value imputation&lt;/strong&gt; (by &lt;a class="reference external" href="http://nicolastr.com/"&gt;Nicolas
Tresegnie&lt;/a&gt;): simple transformer filling
missing data with means or medians.
If your data-acquisition has failures, human or material, you can
easily end up with some descriptors missing for some observations. It
would be a pitty to throw away either those observations, or some
descriptors. “Imputation” fills in the blanks with simple strategies.
&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/preprocessing.html#imputation-of-missing-values"&gt;Documentation&lt;/a&gt;
—
&lt;a class="reference external" href="http://scikit-learn.org/stable/auto_examples/imputation.html"&gt;Example&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RBMs (Restricted Boltzmann Machines)&lt;/strong&gt; (by &lt;a class="reference external" href="http://ynd.github.io/"&gt;Yann
Dauphin&lt;/a&gt;): a neural network model useful
for unsupervised learning of features.
Restricted Boltzmann machines learn a set of hidden (latent) factors
that have, for each observation, a probability to be activated or
not. These activations are found so that they explain the data well,
when combined across all the hidden factors with connection weights.
Typically, they form a new feature set that can be useful in a
prediction task.
&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/neural_networks.html#restricted-boltzmann-Machines"&gt;Documentation&lt;/a&gt;
—
&lt;a class="reference external" href="http://scikit-learn.org/stable/auto_examples/plot_rbm_logistic_classification.html"&gt;Example&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RandomizedSearchCV&lt;/strong&gt; (by &lt;a class="reference external" href="http://peekaboo-vision.blogspot.com"&gt;Andreas
Mueller&lt;/a&gt;): setting
meta-parameters on estimators using a randomized parameter
exploration rather than a grid, as in a grid-search.
A CV (cross-validated) meta-estimator sets parameters of an
estimator by maximizing their cross-validated prediction scores. This
entails fitting the estimator for each parameter value tried. The
randomized-search explores the parameter space randomly, avoiding the
exponential growth in number of points to fit of the grid search.
&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-optimization"&gt;Documentation&lt;/a&gt;
—
&lt;a class="reference external" href="http://scikit-learn.org/stable/auto_examples/randomized_search.html"&gt;Example&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Infrastucture work&lt;/strong&gt;:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;New wesbite&lt;/strong&gt; (mostly by &lt;a class="reference external" href="http://www.montefiore.ulg.ac.be/~glouppe/"&gt;Gilles
Louppe&lt;/a&gt;, &lt;a class="reference external" href="https://github.com/nellev"&gt;Nelle
Varoquaux&lt;/a&gt;, Vincent Michel and &lt;a class="reference external" href="http://peekaboo-vision.blogspot.com"&gt;Andreas
Mueller&lt;/a&gt;). The redesign of
the website had two objectives: &lt;em&gt;i)&lt;/em&gt; unclutter the pages to help
prioritize information, &lt;em&gt;ii)&lt;/em&gt; make it easier for users to find the
stable documentation, if they follow an external link to a
documentation of previous releases. I think that it also looks
prettier &lt;em&gt;:)&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python 3 support&lt;/strong&gt; (&lt;a class="reference external" href="https://github.com/justinvf"&gt;Justin
Vincent&lt;/a&gt;, &lt;a class="reference external" href="https://github.com/larsmans"&gt;Lars
Buitinck&lt;/a&gt;, &lt;a class="reference external" href="https://github.com/smoitra87"&gt;Subhodeep
Moitra&lt;/a&gt; and &lt;a class="reference external" href="http://twitter.com/ogrisel"&gt;Olivier
Grisel&lt;/a&gt;). As a side note, under Python
3.3, on Windows, we have found that &lt;em&gt;np.load&lt;/em&gt; can trigger segfaults,
which means our test suite crashes. The tests not relying on
&lt;em&gt;np.load&lt;/em&gt; pass.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="major-api-changes"&gt;
&lt;h2&gt;Major API changes&lt;/h2&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;The scoring parameter&lt;/strong&gt; One of the benefits of scikit-learn over
other learning packages is that it can set parameters to maximizing a
prediction score. However, the prediction that one would want to
optimize might depend on the application. Also, some scores can only
be computed with specific estimators, for instance because they
require probabilistic prediction. &lt;a class="reference external" href="http://peekaboo-vision.blogspot.com"&gt;Andreas
Mueller&lt;/a&gt; and &lt;a class="reference external" href="https://github.com/larsmans"&gt;Lars
Buitinck&lt;/a&gt; came up with &lt;a class="reference external" href="http://scikit-learn.org/dev/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules"&gt;a new
API&lt;/a&gt;
to specifies the scoring strategy that is versatile and hides
complexity from the user. This replaces the &lt;em&gt;score_func&lt;/em&gt; argument.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;*sklearn.test()*&lt;/strong&gt; is deprecated and will not run the test suite.
Please use &lt;em&gt;nosetests sklearn&lt;/em&gt; from the command line.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The full list of API changes can be found on the &lt;a class="reference external" href="http://scikit-learn.org/stable/whats_new.html"&gt;change
log&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="performance-improvements"&gt;
&lt;h2&gt;Performance improvements&lt;/h2&gt;
&lt;p&gt;Many part of the codebase got speed-ups, with a focus on making
&lt;strong&gt;scikit-learn more scalable for bigger data&lt;/strong&gt;.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;The trees (random forests and extra-trees) were massively sped up by
&lt;a class="reference external" href="http://www.montefiore.ulg.ac.be/~glouppe/"&gt;Gilles Louppe&lt;/a&gt;,
bringing them to par with the fastest libraries (see benchmarks
below)&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.astro.washington.edu/users/vanderplas/"&gt;Jake
Vanderplas&lt;/a&gt;
improved the BallTree and implemented fast KDTrees for
nearest-neighbor search (benchmarks below).&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://github.com/cleverless"&gt;“cleverless”&lt;/a&gt; made the DBSCAN
implementation scale to a large number of samples by relying on
KDTree and BallTree for neighbor search.&lt;/li&gt;
&lt;li&gt;KMeans much faster on sparse data (&lt;a class="reference external" href="https://github.com/larsmans"&gt;Lars
Buitinck&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;For text vectorization: much faster CountVectorizer and
TfidVectorizer with less memory consumption (Jochen Wersdorfer and
Roman Sinayev)&lt;/li&gt;
&lt;li&gt;Out-of-core learning for discrete naive Bayes classifiers by &lt;a class="reference external" href="http://twitter.com/ogrisel"&gt;Olivier
Grisel&lt;/a&gt;. Estimators that implement a
&lt;em&gt;partial_fit&lt;/em&gt; method can be used to fit the model with an
out-of-core strategy, as illustrated by the &lt;a class="reference external" href="http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html"&gt;out-of-core
classification
example&lt;/a&gt;.
These settings are well suited to very big data.&lt;/li&gt;
&lt;li&gt;FastICA: less memory consumptions and slightly faster code (&lt;a class="reference external" href="https://github.com/dengemann"&gt;Denis
Engemann&lt;/a&gt; and &lt;a class="reference external" href="http://alexandre.gramfort.net"&gt;Alexandre
Gramfort&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Faster IsotonicRegression (&lt;a class="reference external" href="https://github.com/nellev"&gt;Nelle
Varoquaux&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;OrthogonalMatchingPursuitCV by &lt;a class="reference external" href="http://alexandre.gramfort.net"&gt;Alexandre
Gramfort&lt;/a&gt; and &lt;a class="reference external" href="http://vene.ro"&gt;Vlad
Niculae&lt;/a&gt;: while strictly-speaking not a speedup of
a existing estimator, this new estimator means that OMP parameters
can be set much faster.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="we-are-faster-lies-damn-lies-and-benchmarks"&gt;
&lt;h2&gt;We are faster: lies, damn lies and benchmarks&lt;/h2&gt;
&lt;blockquote class="epigraph"&gt;
&lt;p&gt;“There are three kinds of lies: lies, damned lies and statistics.” —&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Mark Twain’s Own Autobiography: The Chapters from the North
American Review&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I claim we have gotten faster at certain things. Other libraries, such
as &lt;a class="reference external" href="http://docs.wise.io/"&gt;WiseRf&lt;/a&gt;, have performance claims compared
to us. It turns out that benching statistical learning code is very
hard, because speed depends a lot on the properties of the data.&lt;/p&gt;
&lt;div class="section" id="fast-neighbor-searches-good-kdtrees-beat-balltrees"&gt;
&lt;h3&gt;Fast neighbor searches: good KDTrees beat BallTrees&lt;/h3&gt;
&lt;p&gt;A good example of interplay between properties of the data and
computational speed is the nearest neighbor search. In general, finding
the nearest neighbor to a point out of &lt;em&gt;n&lt;/em&gt; other points will cost you
&lt;em&gt;n&lt;/em&gt; operations, as you have to compute the distance to each of these
points. However, building a tree-like data structure ahead of time can
make this query cost only &lt;em&gt;log n&lt;/em&gt;. If these points are in 1D, &lt;em&gt;ie&lt;/em&gt;
simple scalars, this would be achieve by sorting them. In higher
dimensions that can be achieved by building a &lt;em&gt;KDTree&lt;/em&gt;, made of planes
dividing the space in half-spaces, or a &lt;em&gt;BallTree&lt;/em&gt;, made of nested
balls.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="http://www.astroml.org/_images/fig_kdtree_example_1.png" style="width: 60%;" /&gt;
&lt;p class="caption"&gt;&lt;strong&gt;KD Tree&lt;/strong&gt; Image from &lt;a class="reference external" href="http://www.astroml.org/index.html"&gt;AstroML’s documentation&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="http://www.astroml.org/_images/fig_balltree_example_1.png" style="width: 60%;" /&gt;
&lt;p class="caption"&gt;&lt;strong&gt;Ball tree&lt;/strong&gt; Image from &lt;a class="reference external" href="http://www.astroml.org/index.html"&gt;AstroML’s documentation&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Popular wisdom in machine learning is that in high dimensions, BallTrees
scale better than KDTrees. This is explained by the fact that as the
dimensionality grows, the number of planes required to break up the
space grows too. On the contrary, if the data has structure, BallTrees
can more efficiently cover this structure. I have benched scikit-learn’s
KDTree and BallTree, as well as scipy’s KDTree, which employs a simpler
tree-building strategy, on a variety of datasets, both real-life and
artificial. Below if a summary plot giving relative performance of
neighbor search&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="https://gael-varoquaux.info/programming/attachments/sklearn_0.14.X_speed/nn_trees.png" style="width: 60%;" /&gt;
&lt;p class="caption"&gt;&lt;em&gt;n&lt;/em&gt; is the number of data points, and &lt;em&gt;p&lt;/em&gt; the dimensionality.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;We can see that no approach win on all counts. That said, it came to a
surprise to me to see that even in high dimension, &lt;strong&gt;scikit-learn’s
KDTree outperformed the BallTrees&lt;/strong&gt;. This is explained because these
datasets do not display a heavily-structured low ambient dimension. On
highly-structured synthetic data, the benefit of BallTree can clearly
stand out, as shown by Jake
&lt;a class="reference external" href="http://jakevdp.github.io/blog/2013/04/29/benchmarking-nearest-neighbor-searches-in-python"&gt;here&lt;/a&gt;.
However, on most dataset people encounter, it seems that this is not the
case. Note also that &lt;strong&gt;scikit-learn’s KDTree tend to scale better in
high dimension than scipy’s&lt;/strong&gt;. This is due to the more elaborate choice
of cutting planes. Note that it also has a cost, and may backfire, as on
some datasets scikit-learn is slower than scipy.&lt;/p&gt;
&lt;p&gt;Overall, the new KDTree in scikit-learn seem to be giving an excellent
compromise. Congratulations
&lt;a class="reference external" href="http://www.astro.washington.edu/users/vanderplas/"&gt;Jake&lt;/a&gt;!&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="dbscan-is-faster-with-trees"&gt;
&lt;h3&gt;DBSCAN is faster with trees&lt;/h3&gt;
&lt;p&gt;&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/clustering.html#dbscan"&gt;DBSCAN&lt;/a&gt;
is a clustering algorithm that relies heavily on the local neighborhood
structure. The implementation in scikit-learn 0.13 computed the complete
&lt;em&gt;n&lt;/em&gt; by &lt;em&gt;n&lt;/em&gt; matrix of distance between observations, which means that if
you had a lot of data, you would blow your memory. In the 0.14 release,
DBSCAN uses the BallTree, and as a result scales to much larger datasets
and brings speed benefits. Here is a comparison between 0.13 and 0.14
implementations (I couldn’t put data as large as I wanted because the
0.13 code would blow):&lt;/p&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="53%" /&gt;
&lt;col width="23%" /&gt;
&lt;col width="24%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;Dataset&lt;/th&gt;
&lt;th class="head"&gt;time with 0.13&lt;/th&gt;
&lt;th class="head"&gt;time with 0.14&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;“lfw”: 13233 samples, 5 features&lt;/td&gt;
&lt;td&gt;6.57 seconds&lt;/td&gt;
&lt;td&gt;3.59 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;“make_blobs”: 30000, with 10 features&lt;/td&gt;
&lt;td&gt;33.50 seconds&lt;/td&gt;
&lt;td&gt;12.87 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Importantly, the scaling is different: while the 0.13 code scales as &lt;em&gt;n
^ 2&lt;/em&gt;, the 0.14 code scales as &lt;em&gt;n log n&lt;/em&gt;. This means that the benefit is
bigger for large dataset.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="scikit-learn-0-14-s-random-forests-are-fast"&gt;
&lt;h3&gt;Scikit-learn 0.14’s random forests are fast&lt;/h3&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.montefiore.ulg.ac.be/~glouppe/"&gt;Gilles Louppe&lt;/a&gt; has made
the random forests significantly faster in the 0.14 release. Let us
bench them in comparison with WiseIO’s
&lt;a class="reference external" href="http://docs.wise.io/"&gt;WiseRf&lt;/a&gt;, a proprietary package that only does
random forest and for which the main selling point is that it is
significantly than scikit-learn. However, let us also bench
&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/ensemble.html#extremely-randomized-trees"&gt;ExtraTrees&lt;/a&gt;,
a tree-based model that is very similar to random forests, but that in
our experience can be implemented a bit faster, and tends to work
better.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;On the digits dataset (1797 samples, 641 features):&lt;/strong&gt;&lt;/p&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="33%" /&gt;
&lt;col width="19%" /&gt;
&lt;col width="17%" /&gt;
&lt;col width="31%" /&gt;
&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;Forest implementation&lt;/td&gt;
&lt;td&gt;train time&lt;/td&gt;
&lt;td&gt;test time&lt;/td&gt;
&lt;td&gt;prediction accuracy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Sklearn ExtraTrees&lt;/td&gt;
&lt;td&gt;2.641s&lt;/td&gt;
&lt;td&gt;0.082s&lt;/td&gt;
&lt;td&gt;0.986&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Sklearn RandomForest&lt;/td&gt;
&lt;td&gt;5.074s&lt;/td&gt;
&lt;td&gt;0.088s&lt;/td&gt;
&lt;td&gt;0.981&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;WiseRF&lt;/td&gt;
&lt;td&gt;5.665s&lt;/td&gt;
&lt;td&gt;0.108s&lt;/td&gt;
&lt;td&gt;0.979&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;So we see that on a mid-sized dataset, scikit-learn is faster than
WiseRF, and ExtraTrees is twice as fast as RandomForest, for better
results.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;On the MNIST dataset (70000 samples, 784 features):&lt;/strong&gt;&lt;/p&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="33%" /&gt;
&lt;col width="19%" /&gt;
&lt;col width="17%" /&gt;
&lt;col width="31%" /&gt;
&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;Forest implementation&lt;/td&gt;
&lt;td&gt;train time&lt;/td&gt;
&lt;td&gt;test time&lt;/td&gt;
&lt;td&gt;prediction accuracy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Sklearn ExtraTrees&lt;/td&gt;
&lt;td&gt;1378.141s&lt;/td&gt;
&lt;td&gt;4.768s&lt;/td&gt;
&lt;td&gt;0.976&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Sklearn RandomForest&lt;/td&gt;
&lt;td&gt;1639.866s&lt;/td&gt;
&lt;td&gt;4.132s&lt;/td&gt;
&lt;td&gt;0.972&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;WiseRF&lt;/td&gt;
&lt;td&gt;1102.465s&lt;/td&gt;
&lt;td&gt;14.542s&lt;/td&gt;
&lt;td&gt;0.972&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;On a big dataset, WiseRF takes the lead, but not by a large factor.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Using 2 CPUs (n_jobs=2) on the digits dataset:&lt;/strong&gt;&lt;/p&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="33%" /&gt;
&lt;col width="19%" /&gt;
&lt;col width="17%" /&gt;
&lt;col width="31%" /&gt;
&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;Forest implementation&lt;/td&gt;
&lt;td&gt;train time&lt;/td&gt;
&lt;td&gt;test time&lt;/td&gt;
&lt;td&gt;prediction accuracy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Sklearn ExtraTrees&lt;/td&gt;
&lt;td&gt;4.874s&lt;/td&gt;
&lt;td&gt;1.478s&lt;/td&gt;
&lt;td&gt;0.986&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Sklearn RandomForest&lt;/td&gt;
&lt;td&gt;5.716s&lt;/td&gt;
&lt;td&gt;1.349s&lt;/td&gt;
&lt;td&gt;0.978&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;WiseRF&lt;/td&gt;
&lt;td&gt;3.264s&lt;/td&gt;
&lt;td&gt;0.104s&lt;/td&gt;
&lt;td&gt;0.979&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Both scikit-learn and WiseRF can use several CPUs. However, the Python
parallel execution model via multiple processes has an overhead in term
of computing time and of memory usage. The internals of WiseRF are coded
in C++, and thus it is not limited by this overhead. Also, because of
the memory duplication with multiples processes in scikit-learn, I could
not run it on MNIST with 2 jobs. Next release will address these issues,
partly by using memmapped arrays to share memory between processes.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="we-make-good-use-of-funding-the-paris-sprint"&gt;
&lt;h2&gt;We make good use of funding: the Paris sprint&lt;/h2&gt;
&lt;p&gt;A couple of weeks ago, we had a coding sprint in Paris. We were able to
bring in a lot of core developers from all over Europe thanks to our
sponsors: &lt;a class="reference external" href="http://www.frs-fnrs.be/%20"&gt;FNRS&lt;/a&gt;,
&lt;a class="reference external" href="http://www.afpy.org"&gt;AFPy&lt;/a&gt;, &lt;a class="reference external" href="http://www.telecom-paristech.fr/"&gt;Telecom
Paristech&lt;/a&gt;, and &lt;a class="reference external" href="http://www.svi.cnrs-bellevue.fr"&gt;Saint-Gobain
Recherche&lt;/a&gt;. The total budget,
including accommodation and travel, was a couple thousand euros, thanks
to &lt;a class="reference external" href="http://www.telecom-paristech.fr/"&gt;Telecom Paristech&lt;/a&gt; and
&lt;a class="reference external" href="http://www.tinyclues.com"&gt;tinyclues&lt;/a&gt; helping us with accommodation
and hosting the sprint.&lt;/p&gt;
&lt;p&gt;The productivity of such a sprint is huge, both because we get together
and work efficiently, but also because we get back home and keep working
(I have been sleep deprived because of late-night hacking ever since the
sprint). As an illustration, here is the diagram of commits as can be
seen on Github. The huge spike correspond to the second international
sprint: Paris 2013.&lt;/p&gt;
&lt;div class="figure align-center"&gt;
&lt;img alt="" src="https://gael-varoquaux.info/programming/attachments/sklearn_0.14.X_speed/commit_graph.png" style="width: 100%;" /&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;We now have a “donate” button&lt;/strong&gt; on the
&lt;a class="reference external" href="http://scikit-learn.org/stable"&gt;website&lt;/a&gt;. I can assure you that
your donations are well spent and turned into code.&lt;/p&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="scikit-learn"></category><category term="machine learning"></category></entry><entry><title>RIP John Hunter: the loss of a great man</title><link href="https://gael-varoquaux.info/programming/rip-john-hunter-the-loss-of-a-great-man.html" rel="alternate"></link><published>2012-08-30T10:21:00+02:00</published><updated>2012-08-30T10:21:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2012-08-30:/programming/rip-john-hunter-the-loss-of-a-great-man.html</id><summary type="html">&lt;p&gt;John Hunter, the author of &lt;a class="reference external" href="http://matplotlib.sourceforge.net/"&gt;matplotlib&lt;/a&gt; passed away yesterday after a
short battle against cancer. John gave the keynote at the scipy 2012
conference a few weeks ago, and was diagnosed with cancer just on his
return from the conference. It is a shock to me that that a friend …&lt;/p&gt;</summary><content type="html">&lt;p&gt;John Hunter, the author of &lt;a class="reference external" href="http://matplotlib.sourceforge.net/"&gt;matplotlib&lt;/a&gt; passed away yesterday after a
short battle against cancer. John gave the keynote at the scipy 2012
conference a few weeks ago, and was diagnosed with cancer just on his
return from the conference. It is a shock to me that that a friend can
disappear so quickly. Please read the &lt;a class="reference external" href="https://groups.google.com/forum/#!msg/pydata/FpwXp3sX6N8/mxopkZ1PkBQJ"&gt;announcement&lt;/a&gt; of &lt;a class="reference external" href="http://fperez.org/"&gt;Fernando
Perez&lt;/a&gt;, who supported John in the last weeks to learn more about John.&lt;/p&gt;
&lt;div class="section" id="a-man-who-gave-a-lot-not-asking-for-anything-in-return"&gt;
&lt;h2&gt;A man who gave a lot, not asking for anything in return&lt;/h2&gt;
&lt;p&gt;Many have benefited from the silent efforts of John, and are not fully
aware of how he generously invested his time and talent for the benefit
of others. Matplotlib, the Python plotting library that he created in
2002, has propelled Python as a major tool for scientific research and
engineering. The impact of John’s efforts go well beyond Matplotlib.
Early on, John had the vision of Python as a interactive scientific
environment. He promoted this vision pairing with Fernando Perez to
develop the fantastic &lt;a class="reference external" href="http://ipython.org/"&gt;ipython&lt;/a&gt;/&lt;a class="reference external" href="http://matplotlib.sourceforge.net/"&gt;matplotlib&lt;/a&gt; tandem, solving many
technical challenges. But he also invested a lot of energy in teaching
workshops that helped change the way people compute, as well as writing
didactic documentation and articles. He was a friendly, active, leader
of an online community, open and helpful to newcomers.&lt;/p&gt;
&lt;p&gt;As Travis Oliphant said on John’s numfocus &lt;a class="reference external" href="http://numfocus.org/johnhunter/"&gt;memorial webpage&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote class="epigraph"&gt;
Those who contribute much to open source, as John did, do so at the
expense of something - often it is time with family.&lt;/blockquote&gt;
&lt;p&gt;I cannot stress how true this is. The entire open source software, that
nowadays supports our economy, our education, and our research, is built
on the shoulders of a fairly small number of generous people that spend
their energy in making better software, rather than personal wealth.&lt;/p&gt;
&lt;p&gt;John was a humble man. He did not have a blog, or a twitter account, did
not seek fame or money. For this reason I feel that his contributions
are unknown and undervalued by many. In my eyes, he is an unknown
soldier of our modern times. I hope that I am not being too emphatic,
but this is how I feel.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;John passed away at 44, leaving behind a wife and 3 daughters. Please
do consider supporting them:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote class="last"&gt;
&lt;a class="reference external" href="http://numfocus.org/johnhunter"&gt;http://numfocus.org/johnhunter&lt;/a&gt;&lt;/blockquote&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="scipy"></category><category term="personnal"></category><category term="community"></category></entry><entry><title>A journal promoting high-quality research code: dream and reality</title><link href="https://gael-varoquaux.info/programming/a-journal-promoting-high-quality-research-code-dream-and-reality.html" rel="alternate"></link><published>2012-06-04T21:39:00+02:00</published><updated>2012-06-04T21:39:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2012-06-04:/programming/a-journal-promoting-high-quality-research-code-dream-and-reality.html</id><summary type="html">&lt;p&gt;&lt;a class="reference external" href="http://www.openresearchcomputation.com/"&gt;Open research computation (ORC)&lt;/a&gt; was an attempt to create a scientific
publication promoting &lt;strong&gt;high-quality and open source scientific code&lt;/strong&gt;.
The project went public in falls 2010, but last month, facing the low
volume of submission, the editorial board &lt;a class="reference external" href="http://blogs.openaccesscentral.com/blogs/bmcblog/entry/open_research_computation_thematic_series"&gt;chose to reorient it&lt;/a&gt; as a
special track of an existing journal …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a class="reference external" href="http://www.openresearchcomputation.com/"&gt;Open research computation (ORC)&lt;/a&gt; was an attempt to create a scientific
publication promoting &lt;strong&gt;high-quality and open source scientific code&lt;/strong&gt;.
The project went public in falls 2010, but last month, facing the low
volume of submission, the editorial board &lt;a class="reference external" href="http://blogs.openaccesscentral.com/blogs/bmcblog/entry/open_research_computation_thematic_series"&gt;chose to reorient it&lt;/a&gt; as a
special track of an existing journal.&lt;/p&gt;
&lt;p&gt;The challenges that we face are discussed in our editorial:&lt;/p&gt;
&lt;blockquote&gt;
&lt;a class="reference external" href="http://www.scfbm.org/content/7/1/2/abstract"&gt;Changing computational research. The challenges ahead.&lt;/a&gt; C Neylon,
J Aerts, CT Brown, D Lemire, J Millman, P Murray-Rust, F Perez, N
Saunders, A Smith, G Varoquaux and E Willighagen, &lt;em&gt;Source Code for
Biology and Medicine&lt;/em&gt; 2012, 7:20&lt;/blockquote&gt;
&lt;p&gt;Here is my own personal take on the rise and fall of this ideal.&lt;/p&gt;
&lt;div class="section" id="my-story-with-orc"&gt;
&lt;h2&gt;My story with ORC&lt;/h2&gt;
&lt;img alt="" class="align-right" src="http://www.rcac.net.au/images/Publications1.jpg" style="width: 40%;" /&gt;
&lt;p&gt;&lt;strong&gt;From pipe dream to journal -&lt;/strong&gt; My involvement with ORC started long
before there was such a thing as ORC. In falls 2008, I had a discussion
with a friend working in the publication industry, telling her how I
believed that the publication system is broken, because it promotes new
results without any interest on whether these can be exported outside
the lab that produced them: &lt;strong&gt;it is currently easier to publish a minor
but novel result than a tool enabling the routine reproduction of
previous results&lt;/strong&gt;. This seemed particularly marked in the scientific
software world, as software tools are becoming central to the scientific
workflow, and cost nothing to duplicate when produced under open-source
license. To my surprise, she took me seriously, and asked me to write my
ideas down in an email that she would forward to her colleagues in the
publication industry.&lt;/p&gt;
&lt;p&gt;Looking back at the email that I send, my concerns were, back then, to
promote:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;quality and openness of scientific software&lt;/li&gt;
&lt;li&gt;basic tools shared across communities&lt;/li&gt;
&lt;li&gt;recognition of software development as a challenging and worthwhile
task in academic research&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Shaping the idea -&lt;/strong&gt;In the year that followed, I had a few
discussions with staff from &lt;a class="reference external" href="http://www.biomedcentral.com"&gt;BioMedCentral&lt;/a&gt;, an open-access publisher
in biology and medicine that was looking into expending in the physics
and math related fields. Eventually, my contact there told me that they
had other similar requests and were launching a journal that would be
lead by Cameron Neylon, a British biophysicist and strong advocate of
openness and reproducibility in science. This was the start of ORC, and
for me the chance to meet other people sharing my concerns, some new and
some &lt;a class="reference external" href="http://fperez.org/"&gt;already&lt;/a&gt; &lt;a class="reference external" href="http://jarrodmillman.com/"&gt;old&lt;/a&gt; &lt;a class="reference external" href="http://ivory.idyll.org"&gt;friends&lt;/a&gt;.&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="http://www.salinafbc.com/Websites/fbcsalina/images/nerd_computer.gif" style="width: 230px;" /&gt;
&lt;p class="caption"&gt;ORC editor&lt;/p&gt;
&lt;/div&gt;
&lt;div class="figure align-left"&gt;
&lt;img alt="" src="http://researchsupportgroup.files.wordpress.com/2011/11/kayla1.jpg" style="width: 150px;" /&gt;
&lt;p class="caption"&gt;Conventional editor&lt;/p&gt;
&lt;/div&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;&lt;strong&gt;Setting up the journal -&lt;/strong&gt;BioMedCentral was instrumental in setting
up the journal project. I quickly learned that, no surprises, a journal
is a product, like anything else, and it must find customers. Here, as
we were launching an open access journal, the customers were authors.
This is where a journal faces a chicken and egg problem: to be
recognised it needs high-visibility publications, but authors will
submit only to journals that they know. The main tool to overcome this
challenge are communication and advocacy. I then realized that these
really weren’t my strong points. Cameron Neylon absolutely shined on
this side, with very enthusiastic &lt;a class="reference external" href="http://cameronneylon.net/blog/open-research-computation-an-ordinary-journal-with-extraordinary-aims/"&gt;communications&lt;/a&gt; and an incredibly
active &lt;a class="reference external" href="https://twitter.com/#!/CameronNeylon"&gt;twitter account&lt;/a&gt;. On my side, I am a slow writer, and I tend to
speak Python code better than English language, which is not a strong
asset to be a journal editor.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Wild editorial discussions -&lt;/strong&gt; The discussions in the editorial board
really thrilled me because they were centered on how to set standards to
improve the quality of code published. Looking in my mailbox, I see
discussions about code repositories, software testing, documentation or
licensing issues. This is not that surprising, given that a lot of the
editors where actually contributors to major software projects. It made
me very happy, as I have the feeling that, so far, most committees or
decision makers are clueless about software.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="sand-in-the-gears-the-lack-of-uptake"&gt;
&lt;h2&gt;Sand in the gears: the lack of uptake&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;A false start -&lt;/strong&gt;So ORC was launched late 2010 and we had fantastic
feedback. I had the feeling that people were &lt;a class="reference external" href="http://neuralensemble.blogspot.fr/2010/12/open-research-computation-new-journal.html"&gt;genuinely&lt;/a&gt; &lt;a class="reference external" href="https://twitter.com/vaguery/status/15402390589018112"&gt;excited&lt;/a&gt;
about our program: changing the way computational science worked from
the inside, through the review process. The idea was that we had opened
a pre-submission call, and were waiting for a few good papers to be
submitted to launch the journal. However, it turned out that the papers
were slow to come. It took me a while to realize that there was
something wrong. But slowly we had to face the truth: many people were
excited about the journal, but most were sending their papers elsewhere.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What went wrong? -&lt;/strong&gt;If I really knew what went wrong, I would
probably not be discussing it here, but rather changing the world.
However, I can come up with a few guesses:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;Working across communities is harder.&lt;/strong&gt; From the beginning we had
wanted to position the journal across communities, in order to foster
the sharing of tools for a greater good. The challenge is that a
central role of publication is nowadays to provide recognition. It is
much easier to achieve recognition in a given community than across
communities, and authors always preferred submitting their work to a
non-software oriented journal in their field. We couldn’t fight
together the battle for software quality and the battle for
inter-community work.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Setting the bar too high.&lt;/strong&gt; Many felt that the submission
requirements that where too demanding, as expressed on a NeuroImaging
forumn to quote a researcher: &lt;a class="reference external" href="http://www.nitrc.org/forum/message.php?msg_id=3674"&gt;“I think it’s setting the bar
unrealistically high for most neuroimaging software”&lt;/a&gt;. While we had
originally shot for a very high test coverage (probably too high), we
had scaled it back quickly, simply stressing that editors and
reviewers would be looking closely at test coverage, documentation
and ease of installation. That said, the average researcher did not
share our ideals of raising the quality of scientific software.
Trying to ask only for excellent publications in a new and unproven
journal was probably unrealistic.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Editors not willing to game the system.&lt;/strong&gt; I have watched a few
journal launches, and it seems to me that a common trick is to line
up articles that are created by the editors and their friends
specifically for the new journal. People come up with &lt;em&gt;opinion
papers&lt;/em&gt;, &lt;em&gt;reviews&lt;/em&gt;, &lt;em&gt;commentaries&lt;/em&gt; that only serve to generate an
identity to the journal. This did not happen for ORC, and I believe
that it is because &lt;a class="reference external" href="http://cameronneylon.net/blog/open-research-computation-an-ordinary-journal-with-extraordinary-aims"&gt;the editors themselves&lt;/a&gt; were not huge fans of
the low signal-to-noise ratio in modern scientific publishing
practice.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="the-times-they-are-a-changing"&gt;
&lt;h2&gt;The times they are a changing&lt;/h2&gt;
&lt;img alt="" class="align-right" src="http://www.pictures88.com/p/success/success_005.jpg" style="width: 35%;" /&gt;
&lt;p&gt;&lt;strong&gt;ORC is dead, long live ORC -&lt;/strong&gt; We did get a few submissions. ORC is
not coming to an end, it is morphing into a special thematic series in
&lt;a class="reference external" href="http://www.scfbm.org/"&gt;source code for biology and medicine&lt;/a&gt;. This solution is not completely
satisfactory, as it pushes what should have been a forum for exposing
good practices and good software into a smaller community. But at least
there is now a venue in which people can publish a paper about software
that they have been improving and maintaining, and not only about a new
algorithm.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Changing practices across the board -&lt;/strong&gt; Among the reasons for which we
had a hard time making a breakthrough, is that authors where sending
their software papers to other journals, in particular journals not
specialized on software. While these papers are not getting the
attention of a review and editorial team expert on software development,
as we are setting up with ORC, this is still a good thing. Indeed it
shows that the times are changing and that recognition of software as a
scientific work is improving. I have been impressed to see that many
high profile journals have changed their editorial policies to
specifically accept software papers, or have create tracks dedicated to
software.&lt;/p&gt;
&lt;p&gt;Software is being slowly recognized as a pillar of modern scientific
research. We need to keep pushing to make sure that quality standards
are set and that the open-source scientific software grows into a mature
ecosystem focused on problem solving.&lt;/p&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="publishing"></category><category term="science"></category><category term="computational science"></category><category term="programming"></category><category term="python"></category><category term="scientific computing"></category></entry><entry><title>Update on scikit-learn: recent developments for machine learning in Python</title><link href="https://gael-varoquaux.info/programming/update-on-scikit-learn-recent-developments-for-machine-learning-in-python.html" rel="alternate"></link><published>2012-05-09T00:12:00+02:00</published><updated>2012-05-09T00:12:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2012-05-09:/programming/update-on-scikit-learn-recent-developments-for-machine-learning-in-python.html</id><summary type="html">&lt;p&gt;Yesterday, we released version 0.11 of the &lt;a class="reference external" href="http://scikit-learn"&gt;scikit-learn&lt;/a&gt; toolkit for
machine learning in Python, and there was much rejoincing.&lt;/p&gt;
&lt;div class="section" id="major-features-gained-in-the-last-releases"&gt;
&lt;h2&gt;Major features gained in the last releases&lt;/h2&gt;
&lt;p&gt;In the last 6 months, there have been many things happening with the
scikit-learn. While I do not whish to give an exhaustive …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;p&gt;Yesterday, we released version 0.11 of the &lt;a class="reference external" href="http://scikit-learn"&gt;scikit-learn&lt;/a&gt; toolkit for
machine learning in Python, and there was much rejoincing.&lt;/p&gt;
&lt;div class="section" id="major-features-gained-in-the-last-releases"&gt;
&lt;h2&gt;Major features gained in the last releases&lt;/h2&gt;
&lt;p&gt;In the last 6 months, there have been many things happening with the
scikit-learn. While I do not whish to give an exhaustive summary of
features added (it can be found &lt;a class="reference external" href="http://scikit-learn.org/stable/whats_new.html"&gt;here&lt;/a&gt;), let me list a few of the
additions that I personnally find exciting.&lt;/p&gt;
&lt;div class="section" id="non-linear-prediction-models"&gt;
&lt;h3&gt;Non-linear prediction models&lt;/h3&gt;
&lt;p&gt;For complex prediction problems where there is no simple model
available, as in computer vision, non-linear models are handy. A good
example of such models are those based on decisions trees and model
averaging. For instance random forests are used in the Kinect to locate
body parts. As they are intrinsically complex, they may need a large
amount of training data. For this reason, they have been implemented in
the scikit-learn with special attention to computational efficiency.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees"&gt;Randomized Forests and extra-trees&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting"&gt;Gradient boosted regression trees&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="dealing-with-unlabeled-instances"&gt;
&lt;h3&gt;Dealing with unlabeled instances&lt;/h3&gt;
&lt;p&gt;It is often easy to gather unlabeled observations than labeled
observation. While prediction of a quantity of interest is then harder
or simply impossible, mining this data can be useful.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/label_propagation.html"&gt;Semi-supervised learning&lt;/a&gt;: using unlabeled observations together with
labeled ones for better prediction.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/outlier_detection.html"&gt;Outlier/novelty detection&lt;/a&gt;: detect deviant observations.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/manifold.html"&gt;Manifold learning&lt;/a&gt;: discover a non-linear low-dimensional structure in
the data.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/clustering.html"&gt;Clustering&lt;/a&gt; with &lt;a class="reference external" href="http://scikit-learn.org/stable/modules/clustering.html#mini-batch-k-means"&gt;an algorithm&lt;/a&gt; that can scale to really large
datasets using an online approach: fitting small portions of the data on
after the other (Mini-batch k-means).&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/decomposition.html#dictionarylearning"&gt;Dictionary learning&lt;/a&gt;: learning patterns in the data that represent it
sparsely: each observation is a combination of a small number patterns.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="sparse-models-when-very-few-descriptors-are-relevant"&gt;
&lt;h3&gt;Sparse models: when very few descriptors are relevant&lt;/h3&gt;
&lt;p&gt;In general, finding which descriptors are useful when there are many of
them is like find a needle in a haystack: it is a very hard problem.
However, you know that only a few of these descriptors actually carry
information, you are in a so-called &lt;em&gt;sparse&lt;/em&gt; problem, for specific
approaches can work well.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/linear_model.html#orthogonal-matching-pursuit-omp"&gt;Orthogonal matching pursuit&lt;/a&gt;: a greedy and fast algorithm for very
sparse linear models&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/feature_selection.html#randomized-sparse-models"&gt;Randomized sparsity (randomized Lasso)&lt;/a&gt;: selecting the relevant
descriptors in noisy high-dimensional observations&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;&lt;a class="reference external" href="http://scikit-learn.org/stable/modules/generated/sklearn.covariance.GraphLasso.html#sklearn.covariance.GraphLasso"&gt;Sparse inverse covariance&lt;/a&gt;: learning graphs of connectivity from
correlations in the data&lt;/p&gt;
&lt;div class="section" id="getting-developpers-together-the-granada-sprint"&gt;
&lt;h4&gt;Getting developpers together: the Granada sprint&lt;/h4&gt;
&lt;p&gt;
&lt;object width="400" height="300" align="right"&gt;
&lt;embed type="application/x-shockwave-flash" src="http://www.flickr.com/apps/slideshow/show.swf?v=109615" allowfullscreen="true" flashvars="offsite=true⟨=en-us&amp;amp;page_show_url=%2Fsearch%2Fshow%2F%3Fq%3Dscikit-learn%26m%3Dtags%26w%3D66885349%2540N03&amp;amp;page_show_back_url=%2Fsearch%2F%3Fq%3Dscikit-learn%26m%3Dtags%26w%3D66885349%2540N03&amp;amp;method=flickr.photos.search&amp;amp;api_params_str=&amp;amp;api_tags=scikit-learn&amp;amp;api_tag_mode=bool&amp;amp;api_user_id=66885349%40N03&amp;amp;api_safe_search=3&amp;amp;api_content_type=7&amp;amp;api_media=all&amp;amp;api_sort=date-posted-desc&amp;amp;jump_to=&amp;amp;start_index=0" width="400" height="300"&gt;
&lt;/embed&gt;
&lt;/object&gt;
&lt;/p&gt;&lt;p&gt;Of course, such developments happen only because we have a great team of
&lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/graphs/contributors"&gt;dedicated coders&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Getting along and working together is a critical part of the project. In
December 2011, we held the first international &lt;a class="reference external" href="http://scikit-learn"&gt;scikit-learn&lt;/a&gt; sprint in
Granada, on the side of the &lt;a class="reference external" href="http://nips.cc"&gt;NIPS conference&lt;/a&gt;. That was a while ago,
and I haven’t found time to blog about it, maybe because I was too busy
merging in the code produced :). Here is a small report from my point of
view. Better late than never.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="participants-from-all-over-the-globe"&gt;
&lt;h2&gt;Participants from all over the globe&lt;/h2&gt;
&lt;p&gt;This sprint was a big deal for us, because for the first time, thanks to
sponsor money, we were able to fly contributors from overseas and meet
the team in person. For the first time I was able to see the faces
behind many of the fantastic people that I knew only from the mailing
list.&lt;/p&gt;
&lt;p&gt;I really think that we must thank our sponsors, &lt;strong&gt;Google&lt;/strong&gt; and
&lt;strong&gt;tinyclues&lt;/strong&gt;, but also The PSF, that is in particular Jesse Noller but
especially &lt;strong&gt;Steve Holden&lt;/strong&gt;, whose help was absolutely instrumental in
getting sponsor money. This money is what made it possible to unite a
good fraction of the team, and it opened the door to great moments of
coding, and more.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="producing-code-lines-and-friendship"&gt;
&lt;h2&gt;Producing code lines and friendship&lt;/h2&gt;
&lt;p&gt;An important aspect of the sprint for me was that I really felt the team
being united. Granada is a great city and we spent fantastic moments
together. Now when I review code, I can often put a face on the author
of that code and remember a walk below the Alhambra or an evening in a
bar. I am sure it helps reviewing code!&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="was-it-worth-the-money"&gt;
&lt;h2&gt;Was it worth the money?&lt;/h2&gt;
&lt;img alt="" src="attachments/skl_activity.png" style="width: 90%;" /&gt;
&lt;p&gt;I really appreciate that the sponsors did not ask for specific returns on
investment beyond acknowledgments, but I think that it is useful for us
to ask the question: was it worth the money? After all, we got around
$5000, and that’s a lot of money. First of all, as a side effect of the
sprint, people who had invested a huge amount of time in a machine
learning toolkit without asking anything in return got help to go to a
major machine learning conference.&lt;/p&gt;
&lt;p&gt;But was there a return over investment in terms of code? If you look at
the number of lines of code modified weekly (figure on the right), there
is a big spike in December 2011. That’s our sprint! Importantly, if you
look at the months following the sprint, there still is a lot of activity
in the months following the sprint. This is actually unusual, as the
active developments happen more in the summer break than during the
winter, as our developpers are busy working on papers or teaching.&lt;/p&gt;
&lt;p&gt;The explaination is simple: we where thrilled by the sprint. Overall, it
was incredibly beneficial to the project. I am looking forward to the
next ones.&lt;/p&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="machine learning"></category><category term="python"></category><category term="science"></category><category term="scikit-learn"></category><category term="sprint"></category></entry><entry><title>3 Google summer of code for scikit-learn and more…</title><link href="https://gael-varoquaux.info/programming/3-google-summer-of-code-for-scikit-learn-and-more.html" rel="alternate"></link><published>2012-04-23T22:25:00+02:00</published><updated>2012-04-23T22:25:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2012-04-23:/programming/3-google-summer-of-code-for-scikit-learn-and-more.html</id><summary type="html">&lt;p&gt;The &lt;a class="reference external" href="http://scikit-learn.org"&gt;scikit-learn&lt;/a&gt; got 3 students accepted for the Google summer of
code.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://ibayer.blogspot.fr/"&gt;Imanuel Bayer&lt;/a&gt; will work on making our sparse linear models, for
regression and classification, faster. His proposal &lt;a class="reference external" href="http://www.google-melange.com/gsoc/project/google/gsoc2012/ibayer/11001"&gt;Optimizing
sparse linear models using coordinate descent and strong rules&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.davidmarek.cz/"&gt;David Marek&lt;/a&gt; will implement multi-layer perceptrons for the scikit.
His proposal …&lt;/li&gt;&lt;/ul&gt;</summary><content type="html">&lt;p&gt;The &lt;a class="reference external" href="http://scikit-learn.org"&gt;scikit-learn&lt;/a&gt; got 3 students accepted for the Google summer of
code.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://ibayer.blogspot.fr/"&gt;Imanuel Bayer&lt;/a&gt; will work on making our sparse linear models, for
regression and classification, faster. His proposal &lt;a class="reference external" href="http://www.google-melange.com/gsoc/project/google/gsoc2012/ibayer/11001"&gt;Optimizing
sparse linear models using coordinate descent and strong rules&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.davidmarek.cz/"&gt;David Marek&lt;/a&gt; will implement multi-layer perceptrons for the scikit.
His proposal: &lt;a class="reference external" href="http://www.google-melange.com/gsoc/project/google/gsoc2012/h4wk_cz/24001"&gt;Multilayer Perceptron&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://blog.vene.ro/"&gt;Vlad Niculae&lt;/a&gt; will work on speeding up the library in general,
catching all the low hanging fruits, and the ones a bit higher. His
proposal: &lt;a class="reference external" href="http://www.google-melange.com/gsoc/project/google/gsoc2012/vladn/26002"&gt;Need for scikit-learn speed&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In addition, other related projects have exciting projects, for instance
&lt;a class="reference external" href="http://statsmodels.sourceforge.net/"&gt;**statsmodels**&lt;/a&gt;:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Divyanshu Bandil: &lt;a class="reference external" href="http://www.google-melange.com/gsoc/project/google/gsoc2012/divyanshu/34002"&gt;Extension of Linear to Non Linear Models in
Statsmodels Python module&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Alexandre Crayssac: &lt;a class="reference external" href="http://www.google-melange.com/gsoc/project/google/gsoc2012/alexandreyc/8001"&gt;estimating system of equations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Justin Grana: &lt;a class="reference external" href="http://www.google-melange.com/gsoc/project/google/gsoc2012/j_grana/8001"&gt;empirical Likelihood in Statsmodels&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Georgi Panterov: &lt;a class="reference external" href="http://www.google-melange.com/gsoc/project/google/gsoc2012/gpanterov/7001"&gt;nonparametric estimation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;and &lt;a class="reference external" href="http://www.cython.org"&gt;Cython&lt;/a&gt;:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Philip Herron: &lt;a class="reference external" href="http://www.google-melange.com/gsoc/project/google/gsoc2012/redbrain1123/28002"&gt;pxd generation using gcc-python-plugin&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Mark Florisson: &lt;a class="reference external" href="http://www.google-melange.com/gsoc/project/google/gsoc2012/markflorisson88/30002"&gt;Fast Numerical Computing with Cython&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;finally, in &lt;a class="reference external" href="http://pandas.pydata.org/"&gt;Pandas&lt;/a&gt;:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Vytautas Jancauskas: &lt;a class="reference external" href="http://www.google-melange.com/gsoc/project/google/gsoc2012/bucket_brigade/42002"&gt;Plots in pandas&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Congratulations to all of the students. This is going to be an exciting
summer.&lt;/p&gt;
</content><category term="programming"></category><category term="machine learning"></category><category term="programming"></category><category term="scipy"></category><category term="scikit-learn"></category></entry><entry><title>The problems of low statistical power and publication bias</title><link href="https://gael-varoquaux.info/science/the-problems-of-low-statistical-power-and-publication-bias.html" rel="alternate"></link><published>2012-04-14T16:16:00+02:00</published><updated>2012-04-14T16:16:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2012-04-14:/science/the-problems-of-low-statistical-power-and-publication-bias.html</id><summary type="html">&lt;img alt="" class="align-right" src="http://idoubtit.files.wordpress.com/2010/12/coldfusion.jpg" style="width: 30%;" /&gt;
&lt;p&gt;Lately, I have been a mood of scientific scepticism: I have the feeling
that the worldwide academic system is more and more failing to produce
useful research. Christophe Lalanne’s &lt;a class="reference external" href="https://twitter.com/#!/chlalanne"&gt;twitter feed&lt;/a&gt; lead me to an
interesting article in a non-mainstream journal: &lt;a class="reference external" href="http://beheco.oxfordjournals.org/content/15/6/1044.short"&gt;A farewell to
Bonferroni: the problems of low …&lt;/a&gt;&lt;/p&gt;</summary><content type="html">&lt;img alt="" class="align-right" src="http://idoubtit.files.wordpress.com/2010/12/coldfusion.jpg" style="width: 30%;" /&gt;
&lt;p&gt;Lately, I have been a mood of scientific scepticism: I have the feeling
that the worldwide academic system is more and more failing to produce
useful research. Christophe Lalanne’s &lt;a class="reference external" href="https://twitter.com/#!/chlalanne"&gt;twitter feed&lt;/a&gt; lead me to an
interesting article in a non-mainstream journal: &lt;a class="reference external" href="http://beheco.oxfordjournals.org/content/15/6/1044.short"&gt;A farewell to
Bonferroni: the problems of low statistical power and publication
bias&lt;/a&gt;, by Shinichi Nakagawa.&lt;/p&gt;
&lt;p&gt;Each study performed has a probability of being wrong. Thus performing
many studies will lead to some wrong conclusions by chance. This is
known in statistics as the &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Multiple_comparisons"&gt;multiple comparisons&lt;/a&gt; problem. When a
working hypothesis is not verified empirically in a study, this null
finding is seldom reported, leading to what is called &lt;em&gt;publication
bias&lt;/em&gt;: &lt;strong&gt;discoveries are further studied; negative results are usually
ignored&lt;/strong&gt; (Y. Benjamini). Because only &lt;em&gt;discoveries&lt;/em&gt;, called
&lt;em&gt;detections&lt;/em&gt; in statistical terms, are reported, &lt;strong&gt;published results
contain more false detections than the individual experiments and very
little false negatives&lt;/strong&gt;. Arguably, the original investigators have
corrected using the understanding that they gained the experiments
performed and account in a &lt;em&gt;post-hoc analysis&lt;/em&gt; for the fact that some of
their working hypothesis could not have been correct. Such a correction
can work only in a field where there is a good mechanistic
understanding, or models, such as physics, but in my opinion not in life
and social sciences.&lt;/p&gt;
&lt;p&gt;Let me quote some relevant extracts of &lt;a class="reference external" href="http://beheco.oxfordjournals.org/content/15/6/1044.short"&gt;the article&lt;/a&gt;, as you may never
have access to it thanks to the way scientific publishing works:&lt;/p&gt;
&lt;blockquote class="epigraph"&gt;
&lt;p&gt;Recently, Jennions and Moller (2003) carried out a meta-analysis
on statistical power in the field of behavioral ecology and animal
behavior, reviewing 10 leading journals including Behavioral
Ecology. Their results showed dismayingly low average statistical
power (note that a meta-analytic review of statistical power is
different from post hoc power analysis as criticized in Hoenig and
Heisey, 2001). The statistical power of a null hypothesis (Ho)
significance test is the probability that the test will reject Ho
when a research hypothesis (Ha) is true.&lt;/p&gt;
&lt;p&gt;…&lt;/p&gt;
&lt;p&gt;The meta-analysis on statistical power by Jennions and Moller
(2003) revealed that, in the field of behavioral ecology and animal
behavior, statistical power of less than 20% to detect a small
effect and power of less than 50% to detect a medium effect existed.
This means, for example, that the average behavioral scientist
performing a statistical test has a greater probability of making a
Type II error (or beta) (&lt;em&gt;i.e.&lt;/em&gt;, not rejecting Ho when Ho is false;
note that statistical power is equals to 1 - beta) than if they had
flipped a coin, when an experiment effect is of medium size.&lt;/p&gt;
&lt;p&gt;…&lt;/p&gt;
&lt;p&gt;Imagine that we conduct a study where we measure as many relevant
variables as possible, 10 variables, for example. We find only two
variables statistically significant. Then, what should we do? We
could decide to write a paper highlighting these two variables (and
not reporting the other eight at all) as if we had hypotheses about
the two significant variables in the first place. Subsequently, our
paper would be published. Alternatively, we could write a paper
including all 10 variables. When the paper is reviewed, referees
might tell us that there were no significant results if we had
“appropriately” employed Bonferroni corrections, so that our study
would not be advisable for publication. However, the latter paper is
scientifically more important than the former paper. For example, if
one wants to conduct a meta-analysis to investigate an overall
effect in a specific area of study, the latter paper is five times
more informative than the former paper. In the long term,
statistical significance of particular tests may be of trivial
importance (if not always), although, in the short term, it makes
papers publishable. Bonferroni procedures may, in part, be
preventing the accumulation of knowledge in the field of behavioral
ecology and animal behavior, thus hindering the progress of the
field as science.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;img alt="" class="align-right" src="http://farm6.staticflickr.com/5206/5330056727_a98c97c3c5.jpg" style="width: 50%;" /&gt;
&lt;p&gt;Some of the concerns raised here are partly a criticism of Bonferoni
corrections, &lt;em&gt;i.e.&lt;/em&gt; in technical terms correcting for &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Familywise_error_rate"&gt;family-wise error
rate (FWER)&lt;/a&gt;. It is actually the message that the author wants to
convey in his paper. Proponents of controling for &lt;a class="reference external" href="http://en.wikipedia.org/wiki/False_discovery_rate"&gt;false discovery rate
(FDR)&lt;/a&gt; argue that an investigator shouldn’t be penalized for asking
more questions, and the fraction of errors in the answers should be
controlled, rather than the absolute value. That said, FDR, while
useful, does not answer the problems of publication bias.&lt;/p&gt;
</content><category term="science"></category><category term="statistics"></category><category term="computational science"></category><category term="science"></category></entry><entry><title>Want features? Just code</title><link href="https://gael-varoquaux.info/programming/want-features-just-code.html" rel="alternate"></link><published>2012-03-08T22:46:00+01:00</published><updated>2012-03-08T22:46:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2012-03-08:/programming/want-features-just-code.html</id><summary type="html">&lt;p&gt;Somebody just sent an email on a user’s mailing list for an open-source
scientific package entitled &lt;strong&gt;“Feature foo: why is package bar
not&amp;nbsp;up to the task?”&lt;/strong&gt;. To quote him:&lt;/p&gt;
&lt;blockquote class="epigraph"&gt;
Is there ANY plan for having such a module in &lt;em&gt;package bar&lt;/em&gt;?? I
think&amp;nbsp;(personally) that this is a …&lt;/blockquote&gt;</summary><content type="html">&lt;p&gt;Somebody just sent an email on a user’s mailing list for an open-source
scientific package entitled &lt;strong&gt;“Feature foo: why is package bar
not&amp;nbsp;up to the task?”&lt;/strong&gt;. To quote him:&lt;/p&gt;
&lt;blockquote class="epigraph"&gt;
Is there ANY plan for having such a module in &lt;em&gt;package bar&lt;/em&gt;?? I
think&amp;nbsp;(personally) that this is a MUST DO. This is typically the
type of&amp;nbsp;routines that I hear people use in e.g., idl etc. If this
could be an&amp;nbsp;optimised, fast (and easy to use) routine, all the
better.&lt;/blockquote&gt;
&lt;p&gt;As some one who spends a fair amount of time working on open
source&amp;nbsp;software I hear such remarks quite often. I am finding it harder
and harder not to&amp;nbsp;react negatively to these emails. Now I cannot
consider myself as a&amp;nbsp;contributor to &lt;em&gt;package bar&lt;/em&gt;, and thus I can claim
that I am not taking your&amp;nbsp;comment personally.&lt;/p&gt;
&lt;p&gt;Why aren’t package not up to the task? Will, the answer is quite
simple:&amp;nbsp;because they are developed by volunteers that do it on their
spare time, late&amp;nbsp;at night too often, or companies that put some of their
benefits in open&amp;nbsp;source rather in locking down a market. 90% of the time
the reason the&amp;nbsp;feature isn’t as good as you would want it is because of
lack of time.&lt;/p&gt;
&lt;p&gt;I personally find that suggesting that somebody else should put more
of&amp;nbsp;the time and money they are already giving away in improving a
feature&amp;nbsp;that you need is almost insulting.&lt;/p&gt;
&lt;p&gt;I am aware that people do not realize how small the group of people
that&amp;nbsp;develop and maintain their toys is. Borrowing the figure below from
&lt;a class="reference external" href="http://www.euroscipy.org/file/6459?vid=download"&gt;Fernando Perez’s talk&amp;nbsp;at Euroscipy&lt;/a&gt;,&amp;nbsp;the number of people that do 90%
of the grunt work to get the core&amp;nbsp;scientific Python ecosystem going is
around two handfuls:&lt;/p&gt;
&lt;img alt="" src="attachments/fperez_euroscipy_2011_contributors.jpg" style="width: 70%;" /&gt;
&lt;p&gt;I’d like to think that this recruitment problem is a lack of skill set:
users that have the&amp;nbsp;ability to contribute are just too rare. This is not
entirely true, there&amp;nbsp;are scores of skilled people on the mailing lists.
The poster himself mentioned his email that he was developing a package.
I personally started contribution not knowing anything about software
development. I struggled, I did the grunt work like maintaining wikis,
answer questions on mailing list, and writing documentation. These
easier tasks were useful to the community, I think, but must
importantly, they taught me a lot because I was investing energy in
them.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;&lt;strong&gt;If people want things to improve, they will have more&amp;nbsp;successes
sending in pull requests than messages on mailing list that&amp;nbsp;sound
condescending to my ears.&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;I hope that I haven’t overreacted too badly :), that email turned me on.
That said, I am not sure that people realize how much they owe to the
open source developers breaking their backs on the packages they use.&lt;/p&gt;
&lt;img alt="" src="attachments/fperez_euroscipy_2011_i_want_you.jpg" style="width: 50%;" /&gt;
&lt;p&gt;All credit for images goes to &lt;a class="reference external" href="http://fperez.org/"&gt;Fernando Perez&lt;/a&gt;&lt;/p&gt;
</content><category term="programming"></category><category term="python"></category><category term="scientific computing"></category><category term="community"></category></entry><entry><title>Book review: NumPy 1.5 Beginner’s guide</title><link href="https://gael-varoquaux.info/programming/book-review-numpy-15-beginners-guide.html" rel="alternate"></link><published>2012-01-10T08:57:00+01:00</published><updated>2012-01-10T08:57:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2012-01-10:/programming/book-review-numpy-15-beginners-guide.html</id><summary type="html">&lt;p&gt;Packt publishing sent me a copy of &lt;a class="reference external" href="http://www.packtpub.com/numpy-1-5-using-real-world-examples-beginners-guide/Book"&gt;NumPy 1.5 Beginner’s guide&lt;/a&gt; by Ivan
Idris.&lt;/p&gt;
&lt;p&gt;The book actually covers more than only &lt;a class="reference external" href="http://numpy.scipy.org/"&gt;numpy&lt;/a&gt;: it is a full
introduction to numerical computing with Python. The &lt;a class="reference external" href="http://www.packtpub.com/toc/numpy-15-beginners-guide-table-contents"&gt;table of
contents&lt;/a&gt; is the following:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;NumPy Quick Start&lt;/li&gt;
&lt;li&gt;Beginning with NumPy Fundamentals&lt;/li&gt;
&lt;li&gt;Get into …&lt;/li&gt;&lt;/ul&gt;</summary><content type="html">&lt;p&gt;Packt publishing sent me a copy of &lt;a class="reference external" href="http://www.packtpub.com/numpy-1-5-using-real-world-examples-beginners-guide/Book"&gt;NumPy 1.5 Beginner’s guide&lt;/a&gt; by Ivan
Idris.&lt;/p&gt;
&lt;p&gt;The book actually covers more than only &lt;a class="reference external" href="http://numpy.scipy.org/"&gt;numpy&lt;/a&gt;: it is a full
introduction to numerical computing with Python. The &lt;a class="reference external" href="http://www.packtpub.com/toc/numpy-15-beginners-guide-table-contents"&gt;table of
contents&lt;/a&gt; is the following:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;NumPy Quick Start&lt;/li&gt;
&lt;li&gt;Beginning with NumPy Fundamentals&lt;/li&gt;
&lt;li&gt;Get into Terms with Commonly Used Functions&lt;/li&gt;
&lt;li&gt;Convenience Functions for Your Convenience&lt;/li&gt;
&lt;li&gt;Working with Matrices and ufuncs&lt;/li&gt;
&lt;li&gt;Move Further with NumPy Modules&lt;/li&gt;
&lt;li&gt;Peeking Into Special Routines&lt;/li&gt;
&lt;li&gt;Assure Quality with Testing&lt;/li&gt;
&lt;li&gt;Plotting with Matplotlib&lt;/li&gt;
&lt;li&gt;When NumPy is Not Enough: SciPy and Beyond&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The book is easy to read, as it requires no specific expertise other
than knowing basic Python programming. It is full of examples and
exercises, which is really great for learning. I find the style of the
author, Ivan Idris, particularly amusing and relaxing, engaging the
reader with questions, challenges, or even jokes (&lt;em&gt;“Have a go hero”&lt;/em&gt;).&lt;/p&gt;
&lt;p&gt;With regards to the formatting and the print, the book is written in
large fonts, with sectioning information, tips and exercises clearly
standing out.&lt;/p&gt;
&lt;p&gt;It is full of practical information, such as how to install the
software, or where to get help. Finally, One thing that I appreciated,
is that the examples are typed in &lt;a class="reference external" href="http://ipython.org/"&gt;IPython&lt;/a&gt;. Each time I teach, I like
to use IPython, because it is full of features to help plotting,
debugging and profiling numerical code. The book even has a little
introduction to some useful IPython features.&lt;/p&gt;
&lt;p&gt;After an introduction to the work flow, the book explores array
manipulation such as creation or reshaping, followed by some simple
numerics and the battery of array-based operations on functions and
polynomials. Then it presents linear algebra and signal processing
basics (FFT). It also covers the financial functions that are present in
numpy and mentions testing, which is very important to achieve quality
code. The book finishes with matplotlib and scipy, two modules that are
important to know to go further.&lt;/p&gt;
&lt;p&gt;The examples are mostly drawn from statistics or financial applications,
such as computing running averages on stock quotes. Basic math
explanations, such as the definition of the Moore-Penrose
pseudo-inverse, are given when needed.&lt;/p&gt;
&lt;p&gt;To conclude, I enjoyed this book and I think that it is a nice addition
to my library. It answers exactly it’s title: it is well-suited for
beginners wanting to learn numpy. On the other hand, I would not
recommend it as a reference material, or as a book to learn more general
scientific or numerical computing with Python.&lt;/p&gt;
</content><category term="programming"></category><category term="scipy"></category><category term="python"></category><category term="scientific computing"></category><category term="books"></category></entry><entry><title>Joblib beta release: fast compressed persistence + Python 3</title><link href="https://gael-varoquaux.info/programming/joblib-beta-release-fast-compressed-persistence-python-3.html" rel="alternate"></link><published>2012-01-07T19:27:00+01:00</published><updated>2012-01-07T19:27:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2012-01-07:/programming/joblib-beta-release-fast-compressed-persistence-python-3.html</id><summary type="html">&lt;div class="section" id="joblib-0-6-better-i-o-and-python-3-support"&gt;
&lt;h2&gt;Joblib 0.6: better I/O and Python 3 support&lt;/h2&gt;
&lt;p&gt;Happy new year, every one. I have just released &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Out-of-core_algorithm"&gt;Joblib&lt;/a&gt; 0.6.0 beta.
The highlights of the 0.6 release are a reworked enhanced pickler, and
Python 3 support.&lt;/p&gt;
&lt;p&gt;Many thanks go to the contributors to the 0.5 …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="section" id="joblib-0-6-better-i-o-and-python-3-support"&gt;
&lt;h2&gt;Joblib 0.6: better I/O and Python 3 support&lt;/h2&gt;
&lt;p&gt;Happy new year, every one. I have just released &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Out-of-core_algorithm"&gt;Joblib&lt;/a&gt; 0.6.0 beta.
The highlights of the 0.6 release are a reworked enhanced pickler, and
Python 3 support.&lt;/p&gt;
&lt;p&gt;Many thanks go to the contributors to the 0.5.X series (Fabian
Pedregosa, Yaroslav Halchenko, Kenneth C. Arnold, Alexandre Gramfort,
Lars Buitinck, Bala Subrahmanyam Varanasi, Olivier Grisel, Ralf Gommers,
Juan Manuel Caicedo Carvajal, and myself). In particular Fabian made
sure that Joblib worked under Python 3.&lt;/p&gt;
&lt;p&gt;In this blog post, I’d like to discuss a bit more the compressed
persistence engine, as it illustrates well key factors in implementing
and using compressed serialization.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="fast-compressed-persistence"&gt;
&lt;h2&gt;Fast compressed persistence&lt;/h2&gt;
&lt;p&gt;One of the key components of joblib is it’s ability to persist arbitrary
Python objects, and read them back very quickly. It is particularly
efficient for &lt;strong&gt;containers that do their heavy lifting with numpy
arrays&lt;/strong&gt;. The trick to achieving great speed has been to save in
separate files the numpy arrays, and load them via &lt;strong&gt;memmapping&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;However, one drawback of joblib, is that the caching mechanism may end
up using a lot of disk space. As a result, there is strong interest in
having &lt;strong&gt;compressed storage&lt;/strong&gt;, provided it doesn’t slow down the library
too much. Another use case that I have in mind for fast compressed
persistence, is implementing &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Out-of-core_algorithm"&gt;out of core computation&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;There are some great compressed I/O libraries for Python, for instance
&lt;a class="reference external" href="http://pytables.github.com/index.html"&gt;Pytables&lt;/a&gt;. You may wonder why the need to code yet another one. The
answer is that joblib is &lt;strong&gt;pure Python, depending only on the standard
library&lt;/strong&gt; (numpy is optional), but also that the goal here is
&lt;strong&gt;black-box persistence of arbitrary objects&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="section" id="comparing-i-o-speed-and-compression-to-other-libraries"&gt;
&lt;h3&gt;Comparing I/O speed and compression to other libraries&lt;/h3&gt;
&lt;p&gt;Implementing efficient compressed storage was a bit of a struggle and I
learned a lot. Rather than going into the details straight away, let me
first discuss a few benchmarks of the resulting code. Benching such
feature is very hard, first because you are fighting with the disk
cache, second because they performances depends very much on the data at
hand (some data compress better than others), last because they are
three interesting metrics: disk space used, write speed, and read speed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dataset used&lt;/strong&gt; - I chose to compare the different strategies on some
datasets that I work with, namely the probabilistic brain atlases MNI
1mm (62Mb uncompressed) and Juelich 2mm (105Mb uncompressed). Whether
the data is represented as a Fortran-ordered array, or a C-ordered array
is important for the I/O performance. This data is normally stored to
disk compressed using the domain-specific Nifti format (&lt;em&gt;.nii&lt;/em&gt; files),
accessed in Python with the &lt;a class="reference external" href="http://nipy.sourceforge.net/nibabel/"&gt;Nibabel&lt;/a&gt; library.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Libraries used&lt;/strong&gt; - I benched different compression strategies in
joblib against Nibabel’s Nifti I/O, compressed or not, and against using
Pytables to store the data buffer (without the meta-informations).
Pytables exposed a variety of compression strategies, with different
speed compromises. In addition, I benched numpy’s builtin
&lt;em&gt;save_compressed&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;I would like to stress that I am comparing a general purpose persistence
engine (joblib) to specific I/O libraries either optimized for the data
(Nifti), or requiring some massaging to enable persistence (pytables).&lt;/p&gt;
&lt;img alt="" class="align-center" src="attachments/joblib_rel_0.6_speed/disk.png" style="width: 66%;" /&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;img alt="" class="align-center" src="attachments/joblib_rel_0.6_speed/write.png" style="width: 66%;" /&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;img alt="" class="align-center" src="attachments/joblib_rel_0.6_speed/read.png" style="width: 66%;" /&gt;
&lt;p&gt;&lt;em&gt;Comparing to other libraries&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Actual numbers can be found &lt;a class="reference external" href="attachments/joblib_rel_0.6_speed/results_nii.csv"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Take home messages&lt;/strong&gt; - The graphs are not crystal-clear, but a few
tendencies appear:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Pytables with LZO or blosc compression is the king of the hill for
read and write speed.&lt;/li&gt;
&lt;li&gt;I/O of compressed data is often faster than with uncompressed data
for a good compression algorithm.&lt;/li&gt;
&lt;li&gt;Joblib with Zlib compression level 1 performs honorably in terms of
speed with only the Python standard library and no compiled code.&lt;/li&gt;
&lt;li&gt;Read time of memmapping (with nibabel or joblib) is negligeable (it
is tiny on the graphs), however the loading time appears when you
start accessing the data.&lt;/li&gt;
&lt;li&gt;Passing in arrays with a memory layout (Fortran versus C order) that
the I/O library doesn’t expect can really slow down writing.&lt;/li&gt;
&lt;li&gt;Compressing with Zlib compression-level 1 gets you most of the disk
space gains for a reasonable cost in write/read speed.&lt;/li&gt;
&lt;li&gt;Compressing with Zlib compression-level 9 (not shown on the figures)
doesn’t buy you much in disk space, but costs a lot in writing time.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="benching-datasets-richer-than-pure-arrays"&gt;
&lt;h3&gt;Benching datasets richer than pure arrays&lt;/h3&gt;
&lt;p&gt;The datasets used so far are pretty much composed of one big array, a 4D
smooth spatial map. I wanted to test on more datasets, to see how the
performances varied with data type and richness. For this, I used the
datasets of the &lt;a class="reference external" href="http://scikit-learn.org"&gt;scikit-learn&lt;/a&gt;, real life data of various nature,
described &lt;a class="reference external" href="http://scikit-learn.org/stable/datasets/index.html"&gt;here&lt;/a&gt;:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;20 news&lt;/strong&gt; - 20 usenet news group: this data mainly consists of
text, and not numpy arrays.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LFW people&lt;/strong&gt; - Labeled faces in the wild, many pictures of
different people’s face.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LFW pairs&lt;/strong&gt; - Labeled faces in the wild, pairs of pictures for each
individual. This is a high entropy dataset, it does not have much
redundant information.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Olivetti&lt;/strong&gt; - Olivetti dataset: centered pictures of faces.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Juelich(F)&lt;/strong&gt; - Our previous Juelich atlas&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Big people&lt;/strong&gt; - The LFW people dataset, but repeated 4 times, to put
a strain on memory resources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MNI(F)&lt;/strong&gt; - Our previous MNI atlas&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Species&lt;/strong&gt; - Occurence of species measured in latin America, with a
lot of missing data.&lt;/li&gt;
&lt;/ul&gt;
&lt;img alt="" class="align-center" src="attachments/joblib_rel_0.6_speed/joblib_disk.png" style="width: 50%;" /&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;img alt="" class="align-center" src="attachments/joblib_rel_0.6_speed/joblib_write.png" style="width: 50%;" /&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;img alt="" class="align-center" src="attachments/joblib_rel_0.6_speed/joblib_read.png" style="width: 50%;" /&gt;
&lt;p&gt;Actual numbers can be found
&lt;a class="reference external" href="attachments/joblib_rel_0.6_speed/joblib_results.csv"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What this tells us&lt;/strong&gt; - The main message from these benchmarks is that
datasets with redundant information, i.e. that compress well, give fast
I/O. This is not surprising. In particular, good compression can give
good I/O on text (20 news). Another result, more of a sanity check, is
that compressed I/O on big data (Big people, ) works as well as on
smaller data. Earlier code would start to swap. Finally, I conclude from
these graphs, that compression levels from 1 to 3 buy you most of the
gains for reasonable costs, and that going up to 9 is not recommended,
unless you know that your data can be compressed a lot (species).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="lessons-learned"&gt;
&lt;h3&gt;Lessons learned&lt;/h3&gt;
&lt;p&gt;I’ll keep this paragraph short, because the information is really in
&lt;a class="reference external" href="https://github.com/joblib/joblib/blob/0.5.X/joblib/numpy_pickle.py"&gt;joblib’s code and comments&lt;/a&gt;. Don’t hesitate to have a look, it’s
BSD-licenced, so you are free to borrow what you please.&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;Memory copies, of arrays, but also of strings and byte streams can
really slow you down with big data.&lt;/li&gt;
&lt;li&gt;To avoid copies with numpy arrays, fully embrace numpy’s strided
memory model. For instance, you do not need to save arrays in C
order, if they are given to you in a different order. Accessing the
memory in the wrong striding direction explains the poor write
performance of pytables on Fortran-ordered Juelich.&lt;/li&gt;
&lt;li&gt;When dealing with the file system, the OS makes so much magic (e.g.
prefetching) that clever hacks tend not to work: always benchmark.&lt;/li&gt;
&lt;li&gt;Depending on the size of the data, it may be more efficient to store
subsets in different files: it introduces ‘chunk’ that avoid filling
in the memory too much (parameter &lt;em&gt;cache_size&lt;/em&gt; in joblib’s code). In
addition, data of a same nature tends to compress better.&lt;/li&gt;
&lt;li&gt;The I/O stream or file object interfaces are abstractions that can
hide the data movement and the creation of large temporaries. After
experiments with GZipFile and StringIO/BytesIO I found it more
efficient to fall back to passing around big buffer object, numpy
arrays, or strings.&lt;/li&gt;
&lt;li&gt;For reasons 4 and 5, I ended up avoiding the gzip module: raw access
to the zlib with buffers gives more control. This explains a good
part of the differences in read speed for pure arrays with numpy’s
&lt;em&gt;save_compressed&lt;/em&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;One of my conclusions for joblib, is that I’ll probably use Pytables as
an optional backend for persistence in a future release.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="details-on-the-benchmarks"&gt;
&lt;h3&gt;Details on the benchmarks&lt;/h3&gt;
&lt;p&gt;These benchmarks where run on a Dell Lattitude D630 laptop. That’s a
dual-core Intel Core2 Duo box, with 2M of CPU cache.&lt;/p&gt;
&lt;p&gt;The code for the benchmarks below can be found on &lt;a class="reference external" href="https://gist.github.com/1551250"&gt;a gist&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="thanks"&gt;
&lt;h3&gt;Thanks&lt;/h3&gt;
&lt;p&gt;I’d like to that Francesc Alted for very useful feedback he gave on this
topics. In particular, the &lt;a class="reference external" href="http://sourceforge.net/mailarchive/message.php?msg_id=28609087"&gt;following thread&lt;/a&gt; on the pytables
mailing-list may be of interest to the reader.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="joblib"></category><category term="python"></category><category term="scientific computing"></category><category term="scipy"></category><category term="scikit-learn"></category></entry><entry><title>Scikit-learn NIPS 2011 sprint: international thanks to our sponsors</title><link href="https://gael-varoquaux.info/programming/scikit-learn-nips-2011-sprint-international-thanks-to-our-sponsors.html" rel="alternate"></link><published>2011-11-18T14:47:00+01:00</published><updated>2011-11-18T14:47:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2011-11-18:/programming/scikit-learn-nips-2011-sprint-international-thanks-to-our-sponsors.html</id><summary type="html">&lt;p&gt;&lt;strong&gt;The NIPS conference: time for a sprint.&lt;/strong&gt; The &lt;a class="reference external" href="http://nips.cc/"&gt;NIPS conference&lt;/a&gt;, one
of the major conferences in machine learning, is hosted in Granada this
year. I believe that it is the first time that it is hosted in Europe.
As many of the &lt;a class="reference external" href="http://scikit-learn.org"&gt;scikit-learn&lt;/a&gt; developers are part of the wider NIPS …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;The NIPS conference: time for a sprint.&lt;/strong&gt; The &lt;a class="reference external" href="http://nips.cc/"&gt;NIPS conference&lt;/a&gt;, one
of the major conferences in machine learning, is hosted in Granada this
year. I believe that it is the first time that it is hosted in Europe.
As many of the &lt;a class="reference external" href="http://scikit-learn.org"&gt;scikit-learn&lt;/a&gt; developers are part of the wider NIPS
community, but also many live in Europe, we jumped on the occasion to
organize a truly international sprint: the &lt;a class="reference external" href="http://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events"&gt;NIPS 2011 scikit-learn
sprint&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Finding money.&lt;/strong&gt; As often with open source development, a lot of our
contributors are young people, investing their free time outside of any
request from their hierarchy. In such a situation, it can be hard to
find travel money. So we started looking for sponsors. We needed to find
a decent sum of money, as we were flying people in from places such as
the West coast of the US, or even Japan. The good news is that we found
money, and between supervisors pitching in, universities giving travel
grants, and our generous sponsors, there will be an impressive list of
contributors from all over the world at the sprint.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Thanks to our sponsors.&lt;/strong&gt; The first people that we need to thank are
Google, who gave us a sizable sponsorship, and the &lt;a class="reference external" href="http://www.python.org/psf/"&gt;PSF&lt;/a&gt;, who made
Google’s sponsorship possible through their accounting and sprints
programs. We also need to thanks our other sponsors, namely
&lt;a class="reference external" href="http://www.tinyclues.com/"&gt;Tinyclues&lt;/a&gt;. Thanks to these sponsors, and additional investment from
many universities and research group, we have been able to gather a
total of 12 contributors in Granada, a handful coming from overseas.
Also, we are indebted to the &lt;a class="reference external" href="http://www.ugr.es/"&gt;University of Granada&lt;/a&gt;, and the Gnu/Linux
Granada Group (GGG), who are providing hosting for the sprint, as well
as Régine Bricquet, from INRIA, who did a lot of the trip planing for
the sponsored people.&lt;/p&gt;
&lt;p&gt;I am very much looking forward to the sprint. It will be the first time
that meet in real life many of the contributors, and judging by the
warmness of the on-line exchanges, it will be a great moment. Besides,
Granada is known to be a lively and historical city.&lt;/p&gt;
&lt;p&gt;If you are around and want to join us, to work on Python in machine
learning, send us a mail on the &lt;a class="reference external" href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general"&gt;mailing list&lt;/a&gt;.&lt;/p&gt;
</content><category term="programming"></category><category term="python"></category><category term="scikit-learn"></category><category term="scipy"></category><category term="conferences"></category><category term="sprint"></category></entry><entry><title>Cython example of exposing C-computed arrays in Python without data copies</title><link href="https://gael-varoquaux.info/programming/cython-example-of-exposing-c-computed-arrays-in-python-without-data-copies.html" rel="alternate"></link><published>2011-09-28T23:42:00+02:00</published><updated>2011-09-28T23:42:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2011-09-28:/programming/cython-example-of-exposing-c-computed-arrays-in-python-without-data-copies.html</id><summary type="html">&lt;p&gt;Some advice on passing arrays from C to Python avoiding copies. I use
Cython as I have found the code to be more maintainable than hand-written
Python C-API code.&lt;/p&gt;
&lt;p&gt;I found out that there was no self-contained example of creating numpy
arrays from existing data in Cython. Thus I created …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Some advice on passing arrays from C to Python avoiding copies. I use
Cython as I have found the code to be more maintainable than hand-written
Python C-API code.&lt;/p&gt;
&lt;p&gt;I found out that there was no self-contained example of creating numpy
arrays from existing data in Cython. Thus I created my own. The full code
with readme build and demo scripts is available on a &lt;a class="reference external" href="https://gist.github.com/1249305"&gt;gist&lt;/a&gt;. Here I only
give an executive summary.&lt;/p&gt;
&lt;p&gt;The core functionality is implemented by the
&lt;a class="reference external" href="http://docs.scipy.org/doc/numpy/user/c-info.how-to-extend.html#PyArray_SimpleNewFromData"&gt;PyArray_SimpleNewFromData&lt;/a&gt; function of the C API of numpy that can
create an ndarray from a pointer to the data, a simple data type, and
the shape of the data. The Cython file just builds around that function:&lt;/p&gt;
&lt;p&gt;
&lt;script src="https://gist.github.com/1249305.js?file=cython_wrapper.pyx"&gt;&lt;/script&gt;
&lt;/p&gt;</content><category term="programming"></category><category term="scipy"></category><category term="cython"></category><category term="python"></category><category term="scientific computing"></category><category term="selected"></category></entry><entry><title>Python at scientific conferences</title><link href="https://gael-varoquaux.info/programming/python-at-scientific-conferences.html" rel="alternate"></link><published>2011-09-11T15:52:00+02:00</published><updated>2011-09-11T15:52:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2011-09-11:/programming/python-at-scientific-conferences.html</id><summary type="html">&lt;p&gt;Top notch scientific conferences are starting to add Python tracks to
their program. This is good news. Indeed, it scientific Python
conferences (namely &lt;a class="reference external" href="http://conference.scipy.org/scipy2011/"&gt;Scipy&lt;/a&gt;, &lt;a class="reference external" href="http://www.euroscipy.org/"&gt;EuroSciPy&lt;/a&gt; and &lt;a class="reference external" href="http://scipy.in/scipyin/2011/"&gt;Scipy India&lt;/a&gt;) are doing
great to get together people who have already heard about Python for
science, but we need to reach out to …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Top notch scientific conferences are starting to add Python tracks to
their program. This is good news. Indeed, it scientific Python
conferences (namely &lt;a class="reference external" href="http://conference.scipy.org/scipy2011/"&gt;Scipy&lt;/a&gt;, &lt;a class="reference external" href="http://www.euroscipy.org/"&gt;EuroSciPy&lt;/a&gt; and &lt;a class="reference external" href="http://scipy.in/scipyin/2011/"&gt;Scipy India&lt;/a&gt;) are doing
great to get together people who have already heard about Python for
science, but we need to reach out to specific Python communities to
maximize impact.&lt;/p&gt;
&lt;div class="section" id="esco-2012-european-seminar-on-coupled-problems"&gt;
&lt;h2&gt;ESCO 2012 - European Seminar on Coupled Problems&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="http://esco2012.femhub.com/"&gt;ESCO 2012&lt;/a&gt; is the 3rd event in a series of interdisciplineary meetings
dedicated to computational science challenges in multi-physics and PDEs.&lt;/p&gt;
&lt;p&gt;I was invited as ESCO last year. It was an aboslute pleasure, because it
is a small conference that is very focused on discussions. I learned a
lot and could sit down with people who code top notch PDE libraries such
as FEniCS and have technical discussions. Besides, it is hosted in the
historical brewery where the Pilsner was invented. Plenty of great beer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Application areas&lt;/strong&gt; Theoretical results as well as applications are
welcome. Application areas include, but are not limited to:
Computational electromagnetics, Civil engineering, Nuclear engineering,
Mechanical engineering, Computational fluid dynamics, Computational
geophysics, Geomechanics and rock mechanics, Computational hydrology,
Subsurface modeling, Biomechanics, Computational chemistry, Climate and
weather modeling, Wave propagation, Acoustics, Stochastic differential
equations, and Uncertainty quantification.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Minisymposia&lt;/strong&gt;&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Multiphysics and Multiscale Problems in Civil Engineering&lt;/li&gt;
&lt;li&gt;Modern Numerical Methods for ODE&lt;/li&gt;
&lt;li&gt;Porous Media Hydrodynamics&lt;/li&gt;
&lt;li&gt;Nuclear Fuel Recycling Simulations&lt;/li&gt;
&lt;li&gt;Adaptive Methods for Eigenproblems&lt;/li&gt;
&lt;li&gt;Discontinuous Galerkin Methods for Electromagnetics&lt;/li&gt;
&lt;li&gt;Undergraduate Projects in Technical Computing&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Software afternoon&lt;/strong&gt; Important part of each ESCO conference is a
software afternoon featuring software projects by participants.
Presented can be any computational software that has reached certain
level of maturity, i.e., it is used outside of the author’s institution,
and it has a web page and a user documentation. If you would like to
present your software project, let us know soon.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Proceedings&lt;/strong&gt; For each ESCO we strive to reserve a special issue of an
international journal with impact factor. Proceedings of ESCO 2008
appeared in Math. Comput. Simul., proceedings of ESCO 2010 in CiCP and
Appl. Math. Comput. Proceedings of ESCO 2012 will appear in Computing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Important Dates&lt;/strong&gt;&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;December 15, 2011: Abstract submission deadline.&lt;/li&gt;
&lt;li&gt;December 15, 2011: Minisymposia proposals.&lt;/li&gt;
&lt;li&gt;January 15, 2012: Notification of acceptance.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="pyhpc-python-for-high-performance-computing"&gt;
&lt;h2&gt;PyHPC: Python for High performance computing&lt;/h2&gt;
&lt;p&gt;If you are doing super computing, &lt;a class="reference external" href="http://sc11.supercomputing.org/"&gt;SC11, the Super Computing
conference&lt;/a&gt; is &lt;em&gt;the&lt;/em&gt; reference conference. This year there will a
workshop on high performance computing with Python: &lt;a class="reference external" href="http://www.dlr.de/sc/desktopdefault.aspx/tabid-1183/1638_read-31733/"&gt;PyHPC&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;At the scipy conference, I was having a discussion with some of the
attendees on how people often still do process management and I/O with
Fortran in the big computing environment. This is counter productive.
However, has success stories of supercomputing folks using high-level
languages are not advertized, this is bound to stay. Come and tell us
how you use Python for high performance computing!&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Topics&lt;/strong&gt;&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Python-based scientific applications and libraries&lt;/li&gt;
&lt;li&gt;High performance computing&lt;/li&gt;
&lt;li&gt;Parallel Python-based programming languages&lt;/li&gt;
&lt;li&gt;Scientific visualization&lt;/li&gt;
&lt;li&gt;Scientific computing education&lt;/li&gt;
&lt;li&gt;Python performance and language issues&lt;/li&gt;
&lt;li&gt;Problem solving environments with Python&lt;/li&gt;
&lt;li&gt;Performance analysis tools for Python application&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Papers&lt;/strong&gt; We invite you to submit a paper of up to 10 pages via the
submission site. Authors are encouraged to use IEEE two column format.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Important Dates&lt;/strong&gt;&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Full paper submission: September 19, 2011&lt;/li&gt;
&lt;li&gt;Notification of acceptance: October 7, 2011&lt;/li&gt;
&lt;li&gt;Camera-ready papers: October 31, 2011&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="conferences"></category><category term="python"></category><category term="scipy"></category><category term="science"></category><category term="scientific computing"></category></entry><entry><title>Conference posters</title><link href="https://gael-varoquaux.info/science/conference-posters.html" rel="alternate"></link><published>2011-09-05T04:15:00+02:00</published><updated>2011-09-05T04:15:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2011-09-05:/science/conference-posters.html</id><summary type="html">&lt;p&gt;At the request of a friend, I am putting up some of the posters that I
recently presented at conferences.&lt;/p&gt;
&lt;img alt="" class="align-left" src="attachments/scientific_posters/poster_nips.png" style="width: 30%;" /&gt;
&lt;p&gt;&lt;strong&gt;Large-scale functional-connectivity graphical models for individual
subjects using population prior.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This is a poster for &lt;a class="reference external" href="http://hal.inria.fr/inria-00512451/en"&gt;our NIPS work&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="attachments/scientific_posters/poster_nips.pdf"&gt;PDF&lt;/a&gt;&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;img alt="" class="align-left" src="attachments/scientific_posters/poster_ipmi.png" style="width: 30%;" /&gt;
&lt;p&gt;&lt;strong&gt;Multi-subject dictionary learning to segment an atlas of brain
spontaneous activity …&lt;/strong&gt;&lt;/p&gt;</summary><content type="html">&lt;p&gt;At the request of a friend, I am putting up some of the posters that I
recently presented at conferences.&lt;/p&gt;
&lt;img alt="" class="align-left" src="attachments/scientific_posters/poster_nips.png" style="width: 30%;" /&gt;
&lt;p&gt;&lt;strong&gt;Large-scale functional-connectivity graphical models for individual
subjects using population prior.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This is a poster for &lt;a class="reference external" href="http://hal.inria.fr/inria-00512451/en"&gt;our NIPS work&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="attachments/scientific_posters/poster_nips.pdf"&gt;PDF&lt;/a&gt;&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;img alt="" class="align-left" src="attachments/scientific_posters/poster_ipmi.png" style="width: 30%;" /&gt;
&lt;p&gt;&lt;strong&gt;Multi-subject dictionary learning to segment an atlas of brain
spontaneous activity.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This is a poster for &lt;a class="reference external" href="http://hal.inria.fr/inria-00588898/en"&gt;our IPMI work&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="attachments/scientific_posters/poster_ipmi.png"&gt;PDF&lt;/a&gt;&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;img alt="" class="align-left" src="attachments/scientific_posters/poster_mayavi.png" style="width: 30%;" /&gt;
&lt;p&gt;&lt;strong&gt;Mayavi for 3D visualization of neuroimaging data: powerful scripting
and reusable components in Python.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="attachments/scientific_posters/poster_mayavi.pdf"&gt;PDF&lt;/a&gt;&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;img alt="" class="align-left" src="attachments/scientific_posters/poster_scikit.png" style="width: 30%;" /&gt;
&lt;p&gt;&lt;strong&gt;Machine learning for fMRI in Python: inverse inference with
scikit-learn.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="attachments/scientific_posters/poster_scikit.pdf"&gt;PDF&lt;/a&gt;&lt;/p&gt;
</content><category term="science"></category><category term="neuroimaging"></category><category term="machine learning"></category><category term="science"></category><category term="publishing"></category></entry><entry><title>Hiring a junior developer on the scikit-learn</title><link href="https://gael-varoquaux.info/programming/hiring-a-junior-developer-on-the-scikit-learn.html" rel="alternate"></link><published>2011-09-03T07:26:00+02:00</published><updated>2011-09-03T07:26:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2011-09-03:/programming/hiring-a-junior-developer-on-the-scikit-learn.html</id><summary type="html">&lt;p&gt;Once again, we are looking for a junior developer to work on the
scikit-learn. Below is the official job posting. As a personal remark, I
would like to stress that this is a unique opportunity to be payed for
two years to work on learning and improving the scientific Python …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Once again, we are looking for a junior developer to work on the
scikit-learn. Below is the official job posting. As a personal remark, I
would like to stress that this is a unique opportunity to be payed for
two years to work on learning and improving the scientific Python
toolstack.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;div class="section" id="job-description"&gt;
&lt;h2&gt;Job Description&lt;/h2&gt;
&lt;p&gt;INRIA is looking to hire a young graduate on a 2-year position to help
with the community-driven development of the open source machine
learning in Python library, scikit-learn. The scikit-learn is one of the
majormajor machine-learning libraries in Python. It aims to be
state-of-the-art on mid-size to large datasets by harnessing the power
of the scientific Python toolstack.&lt;/p&gt;
&lt;p&gt;Speaking French is not a requirement, as it is an international team.&lt;/p&gt;
&lt;div class="section" id="requirements"&gt;
&lt;h3&gt;Requirements&lt;/h3&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Programming skills in Python and C/C++&lt;/li&gt;
&lt;li&gt;Understanding of quality assurance in software development:
test-driven programming, version control, technical documentation.&lt;/li&gt;
&lt;li&gt;Some knowledge of Linux/Unix&lt;/li&gt;
&lt;li&gt;Software design skills&lt;/li&gt;
&lt;li&gt;Knowledge of open-source development and community-driven
environments&lt;/li&gt;
&lt;li&gt;Good technical English level&lt;/li&gt;
&lt;li&gt;An experience in statistical learning or a mathematical-oriented
mindset is a plus&lt;/li&gt;
&lt;li&gt;We can only hire a young-graduate that has received a masters or
equivalent degree at most a year ago.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="about-inria"&gt;
&lt;h2&gt;About INRIA&lt;/h2&gt;
&lt;p&gt;INRIA is the French computer science research institute. It recognized
word-wide as one of the leading research institutions and has a strong
expertise in machine learning. You will be working in the &lt;a class="reference external" href="https://parietal.saclay.inria.fr"&gt;Parietal
team&lt;/a&gt; that makes a heavy use of Python for brain imaging analysis.&lt;/p&gt;
&lt;p&gt;Parietal is a small research team (around 10 people) with an excellent
technical knowledge of scientific and numerical computing in Python as
well as a fine understanding of algorithmic issues in machine learning
and statistics. Parietal is committed to investing in scikit-learn.&lt;/p&gt;
&lt;p&gt;Working at Parietal is a unique opportunity to improve your skills in
machine learning and numerical computing in Python. In addition, working
full time on the scikit-learn, a very active open-source project, will
give you premium experience of open source community management and
collaborative project development.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="contact-info"&gt;
&lt;h2&gt;Contact Info:&lt;/h2&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;Technical Contact&lt;/strong&gt;: Bertand Thirion&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;E-mail contact&lt;/strong&gt;: bertrand dotnospam thirion atnospam inria
dotnospam fr&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;HR Contact&lt;/strong&gt;: Marie Domingues&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;E-mail Contact&lt;/strong&gt;: marie dotnospam domingues atnospam inria
dotnospam fr&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No telecommuting&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="machine learning"></category><category term="python"></category><category term="science"></category><category term="jobs"></category><category term="scikit-learn"></category></entry><entry><title>My conference travels: Scipy 2011 and HBM 2011</title><link href="https://gael-varoquaux.info/science/my-conference-travels-scipy-2011-and-hbm-2011.html" rel="alternate"></link><published>2011-07-23T23:45:00+02:00</published><updated>2011-07-23T23:45:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2011-07-23:/science/my-conference-travels-scipy-2011-and-hbm-2011.html</id><summary type="html">&lt;div class="section" id="the-scipy-2011-conference-in-austin"&gt;
&lt;h2&gt;The Scipy 2011 conference in Austin&lt;/h2&gt;
&lt;p&gt;Last week, I was at the Scipy conference in Austin. It was really great
to see old friends, and Austin is such a nice  place.&lt;/p&gt;
&lt;img alt="" class="align-center" src="http://farm7.static.flickr.com/6143/5931239349_13c78bbef5_m.jpg" style="width: 50%;" /&gt;
&lt;p&gt;The Scipy conference was held in &lt;a class="reference external" href="http://www.meetattexas.com/"&gt;UT Austin’s conference center&lt;/a&gt;, which
is a fantastic venue. This is the …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="section" id="the-scipy-2011-conference-in-austin"&gt;
&lt;h2&gt;The Scipy 2011 conference in Austin&lt;/h2&gt;
&lt;p&gt;Last week, I was at the Scipy conference in Austin. It was really great
to see old friends, and Austin is such a nice  place.&lt;/p&gt;
&lt;img alt="" class="align-center" src="http://farm7.static.flickr.com/6143/5931239349_13c78bbef5_m.jpg" style="width: 50%;" /&gt;
&lt;p&gt;The Scipy conference was held in &lt;a class="reference external" href="http://www.meetattexas.com/"&gt;UT Austin’s conference center&lt;/a&gt;, which
is a fantastic venue. This is the first geek’s conference I have been at
where the wireless network worked flawlessly with a good bandwidth, even
thought 200 geeks were pounding on it. As a tutorial presenter, this was
incredibly useful.&lt;/p&gt;
&lt;div class="section" id="conference-highlight"&gt;
&lt;h3&gt;Conference highlight&lt;/h3&gt;
&lt;p&gt;Here is a short list of what I &lt;em&gt;felt&lt;/em&gt; were the big trends and highlights
of the conference. This is obviously biased by my own interests. I am
not listing parallel computing, as it is clearly an important area of
progress and debates, but it has been the case for the last few years.&lt;/p&gt;
&lt;div class="section" id="eric-jone-s-keynote"&gt;
&lt;h4&gt;Eric Jone’s keynote&lt;/h4&gt;
&lt;p&gt;Of course Eric’s keynote was excellent. Eric is a great speaker and
always has good insights on how to run a team and a project. This year
he shared (some) of his tricks in making Enthought deliver on software
projects: &lt;em&gt;“What Matters in Scientific Software Projects? 10 Years of
Success and Failure Distilled”&lt;/em&gt;. The video is not yet online,
unfortunately. Grab it when you can.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="hilary-mason-s-keynote"&gt;
&lt;h4&gt;Hilary Mason’s keynote&lt;/h4&gt;
&lt;p&gt;Hilary is an applied data geek, just what I like! She gave an
interesting &lt;a class="reference external" href="http://conference.scipy.org/scipy2011/slides/mason_awesome.pdf"&gt;keynote&lt;/a&gt; on how &lt;a class="reference external" href="https://bitly.com/"&gt;bitly&lt;/a&gt; (an URL-shortening startup, for
those living under a rock) mines the requests on the URLs that the serve
to do things like ranking or phishing attempts detection. Of course, I
couldn’t resist asking what tools they used, thinking that she would
reply R. She mentioned that they did do some roll-their-own, but she
mentioned &lt;a class="reference external" href="https://mlpy.fbk.eu/"&gt;mlpy&lt;/a&gt; and &lt;a class="reference external" href="http://scikit-learn.sourceforge.net/"&gt;scikit-learn&lt;/a&gt;, with a mention that it was very
nice, at which point I believe that I blushed. She stressed that R was
hard to use and production and raised the point that most often academic
software doesn’t pan out in these settings (I hope that I am not
distorting her thoughts too much).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="statistics-and-learning"&gt;
&lt;h4&gt;Statistics and learning&lt;/h4&gt;
&lt;p&gt;I had the feeling that statistics and data mining played a big role at
scipy this year. Maybe it is because I am more tuned to these questions
nowadays, but some signs do not lie. There was a special session on
Python in data sciences, a panel discussion on Python in finance and
&lt;a class="reference external" href="http://conference.scipy.org/scipy2011/slides/cron_gpustats.pdf"&gt;many&lt;/a&gt;
&lt;a class="reference external" href="http://conference.scipy.org/scipy2011/slides/refsdal_sherpa.zip"&gt;many&lt;/a&gt;
&lt;a class="reference external" href="http://conference.scipy.org/scipy2011/slides/mckinney_time_series.pdf"&gt;statistics&lt;/a&gt; and &lt;a class="reference external" href="http://conference.scipy.org/scipy2011/slides/determan_vision_spreadsheet.pdf"&gt;data&lt;/a&gt; &lt;a class="reference external" href="http://conference.scipy.org/scipy2011/slides/caraciolo_crab_recommendation.pdf"&gt;related&lt;/a&gt; talks, as well as two tutorials and
a keynote.&lt;/p&gt;
&lt;p&gt;In addition, on a personal basis it was really great to meet part of the
team behind &lt;a class="reference external" href="http://statsmodels.sourceforge.net/"&gt;scikits.statmodels&lt;/a&gt;. We had plenty of very interesting
discussions and they really help me understand the way that some
statisticians abord data: very differently than me, because they have
fairly little data, and can afford to inspect reports and graphs,
whereas I rely more on automated decision rules.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="ipython"&gt;
&lt;h4&gt;IPython&lt;/h4&gt;
&lt;p&gt;&lt;a class="reference external" href="http://twitter.com/#!/minrk"&gt;Min&lt;/a&gt; gave &lt;a class="reference external" href="http://minrk.github.com/scipy-tutorial-2011/"&gt;an excellent tutorial&lt;/a&gt; on how to do parallel computing
using IPython. These guys have certainly done an excellent job to make
cluster-level programming in Python easier. While they don’t play yet
terribly well with the restrictive job-queue policy of the clusters to
which I have access, they have all the right low-level tools to address
these issues and Min told me that they will be working on this next
year.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://fperez.org/"&gt;Fernando&lt;/a&gt; gave &lt;a class="reference external" href="http://conference.scipy.org/scipy2011/slides/perez_ipython.pdf"&gt;an impressive talk&lt;/a&gt; on the new developments of
IPython. In particular, the new Qt-based terminal is &lt;em&gt;`really cool`_&lt;/em&gt;
and there is a web frontend in the works.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="cluster-computing-as-facility"&gt;
&lt;h4&gt;Cluster computing as facility&lt;/h4&gt;
&lt;p&gt;While I mention cluster computing, I must confess that I have always
stayed away from this beast: I find it a time sink, and I find that I
get more science done without it. This is why I really like the
presentation of the &lt;a class="reference external" href="http://www.picloud.com/"&gt;PiCould&lt;/a&gt; guys on, … cluster computing! The
reason I liked it, is that they start from the principle that your time
is more important than CPU time. I hear so much about &lt;em&gt;bigger better
faster more&lt;/em&gt; high-performance computing when researchers forget to
address the biggest issue:&lt;/p&gt;
&lt;blockquote class="epigraph"&gt;
… a whole generation of researchers turned into system
administrators by the demands of computing - Dan Reed, VP Microsoft&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div class="section" id="abstract-code-manipulation-for-numerical-computation"&gt;
&lt;h4&gt;Abstract code manipulation for numerical computation&lt;/h4&gt;
&lt;p&gt;Finally, a trend that is picking up in the Python-based scientific
computing is the abstract manipulation of expressions to generate fast
code. This ranges from &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Just-in-time_compilation"&gt;JIT (just in time) compilation&lt;/a&gt; generating
machine code, to rewriting mathematical expressions. Peter Wang had a
&lt;a class="reference external" href="http://conference.scipy.org/scipy2011/slides/wang_metagraph.pdf"&gt;talk&lt;/a&gt; in this alley, but the topic was also brough up be Aron Ahmadia.
Of course this is not new: &lt;a class="reference external" href="http://code.google.com/p/numexpr/"&gt;numexpr&lt;/a&gt; has been using these tricks for
years, and more recently &lt;a class="reference external" href="http://deeplearning.net/software/theano/"&gt;Theano&lt;/a&gt; has been making good use of GPUs
thanks to them.&lt;/p&gt;
&lt;p&gt;Seeing this topic emerges in more and more places fr good reasons: with
faster and more numerous CPU, the number of operations a second is less
the bottleneck, and the order in which they are applied, or the physical
location, is becoming critical.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="my-own-agenda"&gt;
&lt;h3&gt;My own agenda&lt;/h3&gt;
&lt;div class="section" id="sprinting-on-scikit-learn"&gt;
&lt;h4&gt;Sprinting on scikit-learn&lt;/h4&gt;
&lt;a class="reference external image-reference" href="http://scikit-learn.org/dev/auto_examples/mixture/plot_gmm.html"&gt;&lt;img alt="" src="http://scikit-learn.org/dev/_images/plot_gmm_1.png" /&gt;&lt;/a&gt;
&lt;p&gt;We had two days of sprints after the conference. A huge number of people
voted for sprint on the &lt;a class="reference external" href="http://scikit-learn.sourceforge.net/"&gt;scikit-learn&lt;/a&gt; but only two people showed up:
Minwoo Lee and &lt;a class="reference external" href="http://www-etud.iro.umontreal.ca/~wardefar"&gt;David Warde-Farley&lt;/a&gt;. Thanks heaps to these guys! My
priority for the sprint was to review and merge branches. That worked
beautifully: we merged in the following features:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://scikit-learn.sourceforge.net/dev/modules/mixture.html#the-dirichlet-process"&gt;Dirichlet-Process Gaussian mixture models&lt;/a&gt;, by Alex Passos&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://scikit-learn.sourceforge.net/dev/modules/decomposition.html#sparse-principal-components-analysis-sparsepca"&gt;Sparse PCA&lt;/a&gt; by Vlad Niculae.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://scikit-learn.sourceforge.net/dev/modules/gaussian_process.html"&gt;Speedups in Gaussian processes&lt;/a&gt; by Vincent Schut.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://scikit-learn.sourceforge.net/dev/modules/clustering.html#mini-batch-k-means"&gt;Sparse implementation of the mini-batch k-means&lt;/a&gt; by Peter
Prettenhofer.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In addition, David added dataset downloader for the &lt;a class="reference external" href="http://cs.nyu.edu/~roweis/data/olivettifaces.gif"&gt;Olivetti face
datasets&lt;/a&gt; which is lightweight, but rich-enough to give very
interesting examples.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="my-presentation"&gt;
&lt;h4&gt;My presentation&lt;/h4&gt;
&lt;p&gt;I gave a talk on my research work, and the software stack that
undermines it: &lt;a class="reference external" href="http://www.slideshare.net/GaelVaroquaux/python-for-brain-mining-neuroscience-with-state-of-the-art-machine-learning-and-data-visualization"&gt;Python for brain mining: (neuro)science with state of
the art machine learning and data visualization&lt;/a&gt;. I think that it was
well received by the audience. What is really crazy is that I uploaded
the slides on slideshare, and they got a ridiculous amount of viewing. I
suspect that it is because of the title: &lt;em&gt;brain mining&lt;/em&gt; does sound
fancy.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="mayavi"&gt;
&lt;h4&gt;Mayavi&lt;/h4&gt;
&lt;p&gt;Because of technical and political reasons, I cannot get &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/"&gt;Mayavi&lt;/a&gt;
installed on the computers at work. This, and the fact that many people
ask for help, but little contribute, even in the form of answers on the
mailing list, had been mining me a bit. I got so much great feedback on
Mayavi at the conference that I feel much more motivated to invest
energy on it.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="the-humain-brain-mapping-conference-in-quebec-city"&gt;
&lt;h2&gt;The Humain Brain Mapping conference in Quebec City&lt;/h2&gt;
&lt;img alt="" class="align-center" src="http://farm7.static.flickr.com/6018/5968391718_002105ccd1.jpg" style="width: 50%;" /&gt;
&lt;p&gt;This blog post is getting too long. It is well beyond my own attention
span. However scipy is not the only conference to which I have been
recently. Two weeks before I was in Quebec, for the &lt;a class="reference external" href="http://www.humanbrainmapping.org/i4a/pages/index.cfm?pageID=3419"&gt;Human Brain Mapping
conference&lt;/a&gt;. As each year, HBM is a fun ride. It has fantastic parties
in the evenings. But I didn’t stay up too late as, this year was a busy
for me: I was teaching in a educational course, and chairing a
symposium, both on comparing brain functional connectivity across
subjects.&lt;/p&gt;
&lt;p&gt;But the really big deal at HBM this year came at the end. As I was
dosing off, vaguely listening to Russ Poldrak’s closing comments, he
brought up on screen a slide entitled &lt;em&gt;the year of Python&lt;/em&gt;. This is a
big deal: we’ve been working for years to get Python in the neuroimaging
word, and it is clearly making progress, despite all the roadblocks.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="conferences"></category><category term="travels"></category><category term="machine learning"></category><category term="mayavi"></category><category term="python"></category><category term="science"></category><category term="scikit-learn"></category></entry><entry><title>Euroscipy 2011: early bird deadline soon</title><link href="https://gael-varoquaux.info/programming/euroscipy-2011-early-bird-deadline-soon.html" rel="alternate"></link><published>2011-07-22T00:44:00+02:00</published><updated>2011-07-22T00:44:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2011-07-22:/programming/euroscipy-2011-early-bird-deadline-soon.html</id><summary type="html">&lt;div class="section" id="euroscipy-2011-register-now-for-early-bird-prices"&gt;
&lt;h2&gt;Euroscipy 2011: register now for early bird prices&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;The deadline for early-bird registration at the Euroscipy conference
is this Sunday&lt;/strong&gt;. Beyond this deadline prices will double. &lt;strong&gt;Register
now to get a great deal.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;To register, simply go to &lt;a class="reference external" href="http://www.euroscipy.org"&gt;www.euroscipy.org&lt;/a&gt;, log in using the link on
the top right …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="section" id="euroscipy-2011-register-now-for-early-bird-prices"&gt;
&lt;h2&gt;Euroscipy 2011: register now for early bird prices&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;The deadline for early-bird registration at the Euroscipy conference
is this Sunday&lt;/strong&gt;. Beyond this deadline prices will double. &lt;strong&gt;Register
now to get a great deal.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;To register, simply go to &lt;a class="reference external" href="http://www.euroscipy.org"&gt;www.euroscipy.org&lt;/a&gt;, log in using the link on
the top right, and follow the &lt;em&gt;‘Register now for the conference’&lt;/em&gt; link
on the top left.&lt;/p&gt;
&lt;p&gt;The conference is a great opportunity to learn the intricacies of
numerical and scientific computing in Python. You can register for the
tutorials in a &lt;a class="reference external" href="http://www.euroscipy.org/track/4010?vid=tracktalkslist"&gt;intro track&lt;/a&gt;, that will take you from beginner to fully
autonomous user, or for an &lt;a class="reference external" href="http://www.euroscipy.org/track/4011?vid=tracktalkslist"&gt;advanced track&lt;/a&gt;, to learn from the experts
topics such as image processing, GPU computing, machine learning or
optimization. The tutorials are a fairly unique occasion to improve your
skills, as you will seldom get such a concentration of experts.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="some-program-highlights"&gt;
&lt;h2&gt;Some program highlights&lt;/h2&gt;
&lt;p&gt;After the 2 days of tutorial, the conference itself we host 2 keynotes:
one by &lt;a class="reference external" href="http://mcs.open.ac.uk/mp8/"&gt;Marian Petre&lt;/a&gt;, of the open university, well-known for her
empirical studies of software development, and another one by &lt;a class="reference external" href="http://fperez.org/"&gt;Fernando
Perez&lt;/a&gt;, a pioneer in scientific computing in Python and the original
author of IPython.&lt;/p&gt;
&lt;p&gt;Glancing at the &lt;a class="reference external" href="http://www.euroscipy.org/track/3992?vid=tracktalkslist"&gt;program&lt;/a&gt;, we can see how a wide range of topics are
touched:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;pure computer-science topics, such as &lt;a class="reference external" href="http://www.euroscipy.org/talk/4186"&gt;concurrent programming&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;traditional &lt;em&gt;hard&lt;/em&gt; sciences, such as &lt;a class="reference external" href="http://www.euroscipy.org/talk/4201"&gt;multi-physics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;simulation of complex systems, for instance &lt;a class="reference external" href="http://www.euroscipy.org/talk/4219"&gt;network modeling in
epidemiology&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;or novel application of quantitative large-data processing, as in
&lt;a class="reference external" href="http://www.euroscipy.org/talk/4182"&gt;legal research&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The variety of the topics illustrates what is for me one of the greatest
benefits of the scipy conferences: they form a forum to exchange ideas
and techniques to find new solutions to scientific, numerical and data
analysis problems. Unlike the pure computer science conference, they sit
at the frontier of applications and bleeding edge computer developments,
&lt;strong&gt;because these people really use the tools presented to solve their
problems&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In addition to this rich program, we will have 2 days of &lt;a class="reference external" href="http://www.euroscipy.org/track/5201"&gt;sprints&lt;/a&gt;
before the conference as well as 2-day-long satellite conferences on
Python in &lt;a class="reference external" href="http://www.euroscipy.org/card/pyphy2011"&gt;Physics&lt;/a&gt; and &lt;a class="reference external" href="http://pythonneuro.sciencesconf.org/"&gt;NeuroScience&lt;/a&gt; after the conference. This is
how what used to be a small conference can now be a full 8-days event if
you order all the extras.&lt;/p&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="python"></category><category term="science"></category><category term="conferences"></category></entry><entry><title>Hiring a junior engineer on the scikit-learn</title><link href="https://gael-varoquaux.info/programming/hiring-a-junior-engineer-on-the-scikit-learn.html" rel="alternate"></link><published>2011-05-14T19:10:00+02:00</published><updated>2011-05-14T19:10:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2011-05-14:/programming/hiring-a-junior-engineer-on-the-scikit-learn.html</id><summary type="html">&lt;p&gt;The &lt;a class="reference external" href="http://www.scikit-learn.org"&gt;scikit-learn&lt;/a&gt; is a Python module for machine learning. The
project builds on the scientific and numerical tools of the &lt;a class="reference external" href="http://www.scipy.org"&gt;scipy
community&lt;/a&gt; to provide state-of-the-art data analysis tools. It is
developed by a community of open source developers to which my research
team (&lt;a class="reference external" href="https://parietal.saclay.inria.fr/"&gt;Parietal&lt;/a&gt;, &lt;a class="reference external" href="http://www.inria.fr/"&gt;INRIA&lt;/a&gt;) contributes a lot and is …&lt;/p&gt;</summary><content type="html">&lt;p&gt;The &lt;a class="reference external" href="http://www.scikit-learn.org"&gt;scikit-learn&lt;/a&gt; is a Python module for machine learning. The
project builds on the scientific and numerical tools of the &lt;a class="reference external" href="http://www.scipy.org"&gt;scipy
community&lt;/a&gt; to provide state-of-the-art data analysis tools. It is
developed by a community of open source developers to which my research
team (&lt;a class="reference external" href="https://parietal.saclay.inria.fr/"&gt;Parietal&lt;/a&gt;, &lt;a class="reference external" href="http://www.inria.fr/"&gt;INRIA&lt;/a&gt;) contributes a lot and is a &lt;a class="reference external" href="http://github.com/scikit-learn/scikit-learn"&gt;striving
project&lt;/a&gt;. Its mailing list fosters many discussions on code and machine
learning topics, it has a &lt;a class="reference external" href="http://scikit-learn.sourceforge.net/user_guide.html"&gt;a very detailed documentation&lt;/a&gt;, and &lt;a class="reference external" href="http://scikit-learn.sourceforge.net/whats_new.html"&gt;a tight
release cycle&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Although scikits.learn is mostly developed by volunteers, INRIA has
funded a two year position for a junior engineer —currently &lt;a class="reference external" href="http://fseoane.net/blog/"&gt;Fabian
Pedregosa&lt;/a&gt;— to help with the core management and integration of the
project. This funding is coming to an end in falls 2011 &lt;a class="reference external" href="#footnote"&gt;[*]&lt;/a&gt;. The
good news is that we have been allocate new funding to hire an engineer
on the scikit.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;We are thus looking to hire a junior engineer for a 2-year position to
work on the scikits.learn at INRIA in Saclay, near Paris&lt;/strong&gt;. The position
is only available to candidates that have received a &lt;strong&gt;masters or
equivalent degree at most a year ago&lt;/strong&gt; — this is non negotiable: we
cannot hire more senior candidates.&lt;/p&gt;
&lt;p&gt;We are looking for a developer with good open-source project management
skills: the successful candidate will review and merge patches, ensure
the quality of the scikit, make releases, coordinate development on the
mailing list and on github. Good knowledge of Python and its scientific
ecosystem is expected. A mathematical or computer-science oriented
mindset is a plus, as the project involves working with machine learning
algorithms.&lt;/p&gt;
&lt;p&gt;The candidate should be willing to relocate to work daily in the
&lt;a class="reference external" href="http://www-dsv.cea.fr/en/instituts/institut-d-imagerie-biomedicale-i2bm/services/neurospin-neurospin"&gt;Neurospin brain research institute&lt;/a&gt; in which the Parietal is located.
Knowledge of French is not required, as the team and the institute are
very international. Non-EU candidates are welcome, but the hiring
process will take longer.&lt;/p&gt;
&lt;p&gt;You will be working in a very stimulating environment. You will be
employed by INRIA, the French computer science research institute. As
such, you will benefit from the expertise of the institute’s researchers
and engineers. Team members contribute to various scientific Python
libraries (in addition to scikits.learn, &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/"&gt;Mayavi&lt;/a&gt;, &lt;a class="reference external" href="http://nipy.org"&gt;nipy&lt;/a&gt;, &lt;a class="reference external" href="http://packages.python.org/joblib/"&gt;joblib&lt;/a&gt;).
In addition, you will be working in a brain research institute, in
collaboration with leading &lt;a class="reference external" href="http://lnao.fr"&gt;methods researchers&lt;/a&gt; and &lt;a class="reference external" href="http://www.unicog.org/pm/pmwiki.php"&gt;neuroscientists&lt;/a&gt;
that use machine learning to gain new insights on brain processes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;To apply:&lt;/strong&gt; To apply, you need to prepare a CV and a motivation
letter. The deadline for applications is mid June, but we will be
selecting candidates and conducting interviews before. &lt;strong&gt;Don’t send me
CVs&lt;/strong&gt;. The formal job description, as well as instructions to apply can
be found on this &lt;a class="reference external" href="http://en.inria.fr/institute/recruitment/offers/young-graduate-engineers/%28view%29/details.html?id=PNGFK026203F3VBQB6G68LOE1&amp;amp;LOV5=4510&amp;amp;ContractType=4545&amp;amp;LG=EN&amp;amp;Resultsperpage=20&amp;amp;nPostingID=5534&amp;amp;nPostingTargetID=10628&amp;amp;option=52&amp;amp;sort=DESC&amp;amp;nDepartmentID=10"&gt;page&lt;/a&gt;. The page is mostly in French, sorry; use
Google translate if you don’t understand. At the bottom of the page you
will find a link to apply.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;&lt;strong&gt;[*]&lt;/strong&gt; Fabian will most probably stay with us to do a PhD on
&lt;a class="reference external" href="https://parietal.saclay.inria.fr/research"&gt;analysis of large brain functional imaging datasets&lt;/a&gt;.&lt;/p&gt;
</content><category term="programming"></category><category term="scikit-learn"></category><category term="jobs"></category><category term="machine learning"></category><category term="scipy"></category><category term="science"></category></entry><entry><title>EuroScipy: the program is filling up, and the submission deadline nearing</title><link href="https://gael-varoquaux.info/programming/euroscipy-the-program-is-filling-up-and-the-submission-deadline-nearing.html" rel="alternate"></link><published>2011-04-30T17:21:00+02:00</published><updated>2011-04-30T17:21:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2011-04-30:/programming/euroscipy-the-program-is-filling-up-and-the-submission-deadline-nearing.html</id><summary type="html">&lt;div class="section" id="submission-deadline-may-8th"&gt;
&lt;h2&gt;Submission deadline May 8th&lt;/h2&gt;
&lt;p&gt;The deadline for the call for presentation for the EuroScipy conference
is on &lt;strong&gt;May 8th&lt;/strong&gt;. There is only a week and a half left.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.euroscipy.org/"&gt;EuroScipy&lt;/a&gt; will be held in &lt;strong&gt;Paris, August 25-28&lt;/strong&gt;. It is the European
meeting for users of Python in scientific and numerical-intensive
applications …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="section" id="submission-deadline-may-8th"&gt;
&lt;h2&gt;Submission deadline May 8th&lt;/h2&gt;
&lt;p&gt;The deadline for the call for presentation for the EuroScipy conference
is on &lt;strong&gt;May 8th&lt;/strong&gt;. There is only a week and a half left.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.euroscipy.org/"&gt;EuroScipy&lt;/a&gt; will be held in &lt;strong&gt;Paris, August 25-28&lt;/strong&gt;. It is the European
meeting for users of Python in scientific and numerical-intensive
applications. It strives to bring together both users and developers of
scientific and numerical tools, as well as academic research and state
of the art industry. The conference will host 2 days of tutorials and 2
days of technical presentations.&lt;/p&gt;
&lt;p&gt;Lately, numerical computing in Python has started reaching a much wider
audience than the traditional academic-oriented audience. This is partly
because Python is making its way in major engineering companies, but
also because more and more industries are processing large amounts of
data, and find precious &lt;strong&gt;data analytics tools&lt;/strong&gt; in the &lt;a class="reference external" href="http://www.scipy.org"&gt;Scipy&lt;/a&gt;
community. In this spirit, this year there will be a &lt;a class="reference external" href="http://www.euroscipy.org/talk/4061"&gt;tutorial on
machine learning with Python&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="poster-session"&gt;
&lt;h2&gt;Poster session&lt;/h2&gt;
&lt;p&gt;Last year, the organizing committee had to refuse a large fraction of
the proposals, because there were not enough slots available. We had
considered organizing a poster sessions, but the logistics were to
challenging for our little resources. Indeed, EuroSciPy still tries to
be organized as a hackers and coders conference, rather than an
industry-level one. For instance, we keep the prices to a minimum, in
order to make it easy for young people traveling on their own budget to
join us. Getting 200 attendees as we did last year, did strain our small
organization committee.&lt;/p&gt;
&lt;p&gt;This year, we had a unexpected backing of the &lt;a class="reference external" href="http://www.phys.ens.fr/"&gt;physics department&lt;/a&gt; of
the &lt;a class="reference external" href="http://www.ens.fr/?lang=en"&gt;ENS&lt;/a&gt;. They were extremely enthusiastic about Python, that they now
use for teaching and research. This made me really happy, as this is
where I studied. They proposed help, and in particular help with the
local organization.&lt;/p&gt;
&lt;p&gt;Thus I am able to announce that thanks to the physics department of the
ENS, we will be able to host a poster session!&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="an-exciting-program-shaping-up"&gt;
&lt;h2&gt;An exciting program shaping up&lt;/h2&gt;
&lt;p&gt;The program is starting to shape up, and it is looking really good, in
my eyes.&lt;/p&gt;
&lt;div class="section" id="keynotes"&gt;
&lt;h3&gt;Keynotes&lt;/h3&gt;
&lt;p&gt;We will be having two keynote speakers, one directly from the SciPy
community, Fernando Perez, and one probably less known to this
community, Marian Petre.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://mcs.open.ac.uk/mp8/"&gt;Marian Petre&lt;/a&gt;: Marian is the director of the &lt;a class="reference external" href="http://crc.open.ac.uk/"&gt;Center for Research
in Computing&lt;/a&gt;, at the &lt;a class="reference external" href="http://www.open.ac.uk/"&gt;Open University&lt;/a&gt;. She is interested in
empirical studies of software development. I am very excited to hear
a bit more about the often-forgotten human factor that goes behind
every coding job, big or small. In my experience scientific computing
and computational sciences pay a hefty price because they don’t
acknowledge well-enough the gap between good ideas and tractable
code.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://fperez.org/"&gt;Fernando Perez&lt;/a&gt;: Fernando is a research scientist in
neuroscience at &lt;a class="reference external" href="http://neuroscience.berkeley.edu/"&gt;UC Berkeley&lt;/a&gt;. Before that, he was successively a
physicist and a mathematician. He has been an early advocate of the
scientific Python ecosystem, in addition to being the creator of
IPython. His vision has always been oriented toward finding an
computing environment that makes scientific creativity easier.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="tutorials"&gt;
&lt;h3&gt;Tutorials&lt;/h3&gt;
&lt;p&gt;The tutorial program is now final, and can be seen on the &lt;a class="reference external" href="http://www.euroscipy.org/conference/euroscipy2011"&gt;schedule&lt;/a&gt;.
Like last year, we will have two tracks:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.euroscipy.org/track/4010"&gt;An introductory track&lt;/a&gt;, designed as a two-day course addressing
the different aspects of the Python language and the scientific
computing module to bring up beginners to full speed. At the end of
the two days, attendee should be able to solve simple computational
problems using Python alone.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.euroscipy.org/track/4011"&gt;An advanced track&lt;/a&gt;, in which experts of various aspects of
scientific and numerical computing in Python share their knowledge in
2-hours long tutorials.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="python-in-neuroscience-satellite"&gt;
&lt;h2&gt;Python in NeuroScience satellite&lt;/h2&gt;
&lt;p&gt;The two days following the conference, their will be &lt;a class="reference external" href="http://pythonneuro.sciencesconf.org/"&gt;a satellite
meeting on the use Python in neuroscience&lt;/a&gt;. It will be a small and more
focused event, in which neuroscientist will be able to exchange
technical aspects of computation and data management in Python.
Hopefully it will foster interest discussions and collaborations. if you
are interested, you can submit a talk proposal for this satellite
meeting &lt;a class="reference external" href="http://pythonneuro.sciencesconf.org/"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;img alt="" class="align-center" src="http://farm5.static.flickr.com/4143/4780097256_14c99f3b32.jpg" style="width: 60%;" /&gt;
&lt;p&gt;&lt;strong&gt;Come and join us at EuroScipy in Paris, Augst 25-28. Paris is a great
city. The SciPy community is a friendly one.&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="conferences"></category><category term="scipy"></category><category term="python"></category><category term="science"></category></entry><entry><title>Scikit-learn sprint on April 1st</title><link href="https://gael-varoquaux.info/programming/scikit-learn-sprint-on-april-1st.html" rel="alternate"></link><published>2011-03-26T13:27:00+01:00</published><updated>2011-03-26T13:27:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2011-03-26:/programming/scikit-learn-sprint-on-april-1st.html</id><summary type="html">&lt;a class="reference external image-reference" href="http://scikit-learn.sourceforge.net/"&gt;&lt;img alt="" src="http://scikit-learn.sourceforge.net/stable/_static/scikit-learn-logo-small.png" /&gt;&lt;/a&gt;
&lt;p&gt;The &lt;a class="reference external" href="http://scikit-learn.sourceforge.net/"&gt;scikit-learn&lt;/a&gt; team is organizing a sprint on April 1st (that next
Friday). Join us in &lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events"&gt;Paris, Boston, or on IRC&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;With the rise of the data sciences, the scikit-learn, a &lt;strong&gt;BSD-licensed
Python package for machine learning&lt;/strong&gt;, is becoming an asset for more and
more endeavors. Machine learning has traditionally …&lt;/p&gt;</summary><content type="html">&lt;a class="reference external image-reference" href="http://scikit-learn.sourceforge.net/"&gt;&lt;img alt="" src="http://scikit-learn.sourceforge.net/stable/_static/scikit-learn-logo-small.png" /&gt;&lt;/a&gt;
&lt;p&gt;The &lt;a class="reference external" href="http://scikit-learn.sourceforge.net/"&gt;scikit-learn&lt;/a&gt; team is organizing a sprint on April 1st (that next
Friday). Join us in &lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events"&gt;Paris, Boston, or on IRC&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;With the rise of the data sciences, the scikit-learn, a &lt;strong&gt;BSD-licensed
Python package for machine learning&lt;/strong&gt;, is becoming an asset for more and
more endeavors. Machine learning has traditionally been considered as
very technical and inaccessible to the non mathematician. We are aiming
to break this barrier.&lt;/p&gt;
&lt;p&gt;The sprint will be focused on pragmatic down-to-earth improvements in
the scikit. Our goal is to make it easy for people to contribute. A list
of tasks and organization details can be found on the &lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events"&gt;sprint planning&lt;/a&gt;
wiki page. Amongst other things, we’ll be working on:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;integrating new learning algorithms&lt;/strong&gt;, in particular merging in the
many excellent pull requests that we have: &lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/103"&gt;hierarchical
clustering&lt;/a&gt;, &lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/103"&gt;data transforming using linear discriminant
analysis&lt;/a&gt;, &lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/107"&gt;multinomial naive bayes classifier&lt;/a&gt; …&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;testing and logging framework&lt;/strong&gt;,&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/94"&gt;**better parallel computing support**&lt;/a&gt;,&lt;/li&gt;
&lt;li&gt;and many other itches to scratch, as it is a community-driven event.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Come and join us. It will be fun, and it’s an occasion to learn new
tricks.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;&lt;a class="reference external image-reference" href="http://farm5.static.flickr.com/4067/4405351641_5675ba000c.jpg"&gt;&lt;img alt="image1" src="http://farm5.static.flickr.com/4067/4405351641_5675ba000c.jpg" style="width: 20%;" /&gt;&lt;/a&gt; &lt;a class="reference external image-reference" href="http://farm6.static.flickr.com/5249/5265835075_ea0b41019c.jpg"&gt;&lt;img alt="image2" src="http://farm6.static.flickr.com/5249/5265835075_ea0b41019c.jpg" style="width: 20%;" /&gt;&lt;/a&gt; &lt;a class="reference external image-reference" href="http://farm5.static.flickr.com/4135/4974339970_566424185f.jpg"&gt;&lt;img alt="image3" src="http://farm5.static.flickr.com/4135/4974339970_566424185f.jpg" style="width: 20%;" /&gt;&lt;/a&gt; &lt;a class="reference external image-reference" href="http://farm6.static.flickr.com/5294/5425114531_6eec316967.jpg"&gt;&lt;img alt="image4" src="http://farm6.static.flickr.com/5294/5425114531_6eec316967.jpg" style="width: 20%;" /&gt;&lt;/a&gt;&lt;/p&gt;
</content><category term="programming"></category><category term="sprint"></category><category term="machine learning"></category><category term="python"></category><category term="science"></category><category term="scientific computing"></category><category term="scikit-learn"></category></entry><entry><title>Windows binaries for the scientific Python ecosystem</title><link href="https://gael-varoquaux.info/programming/windows-binaries-for-the-scientific-python-ecosystem.html" rel="alternate"></link><published>2011-02-15T09:02:00+01:00</published><updated>2011-02-15T09:02:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2011-02-15:/programming/windows-binaries-for-the-scientific-python-ecosystem.html</id><summary type="html">&lt;p&gt;I just realized yesterday that Christoph Gohlke has &lt;a class="reference external" href="http://www.lfd.uci.edu/~gohlke/pythonlibs/"&gt;a repository of
binary installers&lt;/a&gt; (&lt;em&gt;.exe&lt;/em&gt;) for Windows 32 and 64bit with almost all
the scientific Python packages that you can dream of:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://numpy.scipy.org"&gt;numpy&lt;/a&gt;, &lt;a class="reference external" href="http://www.scipy.org/"&gt;scipy&lt;/a&gt; and &lt;a class="reference external" href="http://matplotlib.sourceforge.net/"&gt;matplotlib&lt;/a&gt;, of course (compiled
with the MKL)&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://cython.org/"&gt;cython&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;the &lt;a class="reference external" href="http://enthought.github.com/"&gt;ETS&lt;/a&gt;, including &lt;a class="reference external" href="http://enthought.github.com/mayavi/mayavi/"&gt;Mayavi&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;VTK&lt;/strong&gt;, with the Python …&lt;/li&gt;&lt;/ul&gt;</summary><content type="html">&lt;p&gt;I just realized yesterday that Christoph Gohlke has &lt;a class="reference external" href="http://www.lfd.uci.edu/~gohlke/pythonlibs/"&gt;a repository of
binary installers&lt;/a&gt; (&lt;em&gt;.exe&lt;/em&gt;) for Windows 32 and 64bit with almost all
the scientific Python packages that you can dream of:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://numpy.scipy.org"&gt;numpy&lt;/a&gt;, &lt;a class="reference external" href="http://www.scipy.org/"&gt;scipy&lt;/a&gt; and &lt;a class="reference external" href="http://matplotlib.sourceforge.net/"&gt;matplotlib&lt;/a&gt;, of course (compiled
with the MKL)&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://cython.org/"&gt;cython&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;the &lt;a class="reference external" href="http://enthought.github.com/"&gt;ETS&lt;/a&gt;, including &lt;a class="reference external" href="http://enthought.github.com/mayavi/mayavi/"&gt;Mayavi&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;VTK&lt;/strong&gt;, with the Python bindings&lt;/li&gt;
&lt;li&gt;a variety of &lt;a class="reference external" href="http://scikits.appspot.com/"&gt;scikits&lt;/a&gt; (including the &lt;a class="reference external" href="http://scikit-learn.sourceforge.net/"&gt;scikit-learn&lt;/a&gt;,
hurray!)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These binaries are incredibly useful, as building all these packages
under Windows does requires some skills, and a compiler. They complement
very well fully-fledge scientific Python distributions such as EPD or
Python(x,y), as they can be installed on top of an existing Python
installation.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;I should say that I discovered this thanks to a long email discussion in
which Christoph Gohlke and Yakub Nowacki helped me debug a nasty Mayavi
bug on Windows 64bit that I couldn’t reproduce as I don’t have a Windows
64bit available. That was particularly helpful.&lt;/p&gt;
</content><category term="programming"></category><category term="python"></category><category term="scipy"></category><category term="mayavi"></category></entry><entry><title>Interested in parallel computing and statistics? We are looking for a post-doc</title><link href="https://gael-varoquaux.info/programming/interested-in-parallel-computing-and-statistics-we-are-looking-for-a-post-doc.html" rel="alternate"></link><published>2011-01-30T22:30:00+01:00</published><updated>2011-01-30T22:30:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2011-01-30:/programming/interested-in-parallel-computing-and-statistics-we-are-looking-for-a-post-doc.html</id><summary type="html">&lt;p&gt;&lt;a class="reference external" href="https://parietal.saclay.inria.fr/"&gt;My research group&lt;/a&gt; is kick starting a new project, called
&lt;strong&gt;AzureBrain&lt;/strong&gt; to do computational analysis of large brain imaging and
genetics population-wise data. One of the goals of the project is to
harness the power of grid computing to do statistical learning on fMRI
data, finding features in an individuals …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a class="reference external" href="https://parietal.saclay.inria.fr/"&gt;My research group&lt;/a&gt; is kick starting a new project, called
&lt;strong&gt;AzureBrain&lt;/strong&gt; to do computational analysis of large brain imaging and
genetics population-wise data. One of the goals of the project is to
harness the power of grid computing to do statistical learning on fMRI
data, finding features in an individuals brain images that can be
predicted by his genome. The medical applications cover the wide scope
of genetically-related brain pathologies, such as autism.&lt;/p&gt;
&lt;p&gt;Want to work in a dynamic and exiting environment, using Python to solve
challenging data analysis? We are looking for a post-doctoral fellow to
hire in spring/beginning of summer. The ideal candidate would have a
strong background in computational statistics or machine learning, as
well as parallel computing, however we will consider any candidate with
good experience in one or the other and a strong desire to learn.&lt;/p&gt;
&lt;p&gt;You would be employed by &lt;a class="reference external" href="http://www.inria.fr"&gt;INRIA&lt;/a&gt;, the lead computing research institute
in France. We are a team of computer scientists specialized in image
processing and statistical data analysis, integrated in one of the top
French brain research centers, &lt;a class="reference external" href="http://www-dsv.cea.fr/en/instituts/institut-d-imagerie-biomedicale-i2bm/services/neurospin-d.-le-bihan"&gt;NeuroSpin&lt;/a&gt;, south of Paris. We work
mostly in Python. The team includes core contributors to the
&lt;a class="reference external" href="http://scikit-learn.sourceforge.net/"&gt;scikit-learn project&lt;/a&gt;, for machine learning in Python, and the &lt;a class="reference external" href="http://nipy.sourceforge.net/"&gt;nipy
project&lt;/a&gt;, for NeuroImaging in Python.&lt;/p&gt;
&lt;p&gt;Below follows a summary of &lt;a class="reference external" href="http://parietal.saclay.inria.fr/open-positions/azure-brain-post-doc-proposal"&gt;the official job announcement&lt;/a&gt;. Please
contact Bertrand Thirion, (first name _dot_ last name _at_ inria
_dot_ fr) if you are interested, referencing the AzureBrain project.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;div class="section" id="introduction"&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Imaging genetic studies linking functional MRI data and Single
Nucleotide Polyphormisms (SNPs) data face a dire multiple comparisons
issue. In the genome dimension, genotyping DNA chips allow to record of
several hundred thousands values per subject, while in the imaging
dimension a brain image may contain 100k-1M voxels. Finding the brain
and genome regions that may be involved in this link entails a huge
number of hypotheses, hence a drastic correction of the statistical
significance of pairwise relationships, which in turn reduces crucially
the sensitivity of statistical procedures that aims at detecting the
association. It is therefore desirable to set up as sensitive techniques
as possible to explore where in the brain and where in the genome a
significant link can be detected, while correcting for family-wise
multiple comparisons (controlling for false positive rate). Another
issue is the computational cost of these procedures, that need to be
addressed with adequate algorithmic and computational tools.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="objectives"&gt;
&lt;h2&gt;Objectives&lt;/h2&gt;
&lt;p&gt;In this project, we will consider a unique dataset acquired in the
&lt;a class="reference external" href="http://www.imagen-europe.com"&gt;Imagen project&lt;/a&gt;, an FP6 project that aims at investigating factors of
addition in a population of adolescents; Imagen’s database contains
multi-modal neuroimaging as well as genetics and psychological data on
about 2000 subjects. This database is hosted and processed at Neurospin
and is available for research purpose. The candidate will be in charge
of:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Setting an analysis pipeline (based on code already available to
analyze neuroimaging/genetics datasets) to extract and pre-process
the relevant data for statistical analysis.&lt;/li&gt;
&lt;li&gt;Performing statistical analysis on simulated datasets and sub-parts
of the whole database in order to set all the computational
framework. These procedures will include mass-univariate linear
modeling (with peak- and cluster-level tests), regularized multiple
regression and a permutation-based assessment framework.&lt;/li&gt;
&lt;li&gt;Launch data analysis on a large scale grid and cloud environment,
with the help of the Kerdata researchers (see below).&lt;/li&gt;
&lt;li&gt;Build the post-analytic framework to ease the interpretation of the
results in both neuroimaging and genetics domains.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The analysis framework is based on algorithmic tools developed in
C/Python (numpy, scipy and scikit-learn). The candidate will interact i)
with researchers of the Parietal team for algorithmic aspects, but also
ii) with CEA researchers of Neurospin, who will provide expertise in
genetics domain and iii) with the KerData team (INRIA Rennes) and the
Joint MSR-INRIA Research Center (Microsoft Research), that will provide
help and massive computation facilities. The project has an access to
grid/cloud computing facilities to be used in collaboration with
INRIA/Kerdata and MSR-INRIA partners.&lt;/p&gt;
&lt;p&gt;The expected results is the discovery of correlation between brain
activation and genetic information.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="required-knowledge-and-background"&gt;
&lt;h2&gt;Required knowledge and background&lt;/h2&gt;
&lt;p&gt;The candidate should have at least a basic knowledge of standard
statistical concepts. He or she should have a first significant
experience in parallel computation and with python language. It is
important that he or she has some real interest in genetics and/or brain
imaging in order to have strong interactions with specialists of these
domains. He or she will benefit from the algorithmic tools developed at
Parietal and of the database settings and data pre-processing tools
developed by Neurospin researchers.&lt;/p&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="jobs"></category><category term="python"></category><category term="science"></category><category term="scientific computing"></category></entry><entry><title>EuroSciPy 2011: the dates are out - Aug 25-28, Paris</title><link href="https://gael-varoquaux.info/programming/euroscipy-2011-the-dates-are-out-aug-25-28-paris.html" rel="alternate"></link><published>2011-01-16T15:57:00+01:00</published><updated>2011-01-16T15:57:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2011-01-16:/programming/euroscipy-2011-the-dates-are-out-aug-25-28-paris.html</id><summary type="html">&lt;p&gt;We have finally been able to settle on final dates and venue for
&lt;a class="reference external" href="http://www.euroscipy.org/conference/euroscipy_2011"&gt;EuroSciPy 2011&lt;/a&gt;, the 4th European meeting on Python in Science.&lt;/p&gt;
&lt;p&gt;The conference will be held &lt;strong&gt;from Thursday August 25th, to Sunday
August 28th&lt;/strong&gt;. The &lt;a class="reference external" href="http://www.ens.fr"&gt;ENS&lt;/a&gt; will be hosting the conference once again,
right in the center of …&lt;/p&gt;</summary><content type="html">&lt;p&gt;We have finally been able to settle on final dates and venue for
&lt;a class="reference external" href="http://www.euroscipy.org/conference/euroscipy_2011"&gt;EuroSciPy 2011&lt;/a&gt;, the 4th European meeting on Python in Science.&lt;/p&gt;
&lt;p&gt;The conference will be held &lt;strong&gt;from Thursday August 25th, to Sunday
August 28th&lt;/strong&gt;. The &lt;a class="reference external" href="http://www.ens.fr"&gt;ENS&lt;/a&gt; will be hosting the conference once again,
right in the center of Paris.&lt;/p&gt;
</content><category term="programming"></category><category term="python"></category><category term="science"></category><category term="conferences"></category></entry><entry><title>Research jobs in France: the black humor of 2010 is the reality of 2011</title><link href="https://gael-varoquaux.info/science/research-jobs-in-france-the-black-humor-of-2010-is-the-reality-of-2011.html" rel="alternate"></link><published>2011-01-15T11:41:00+01:00</published><updated>2011-01-15T11:41:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2011-01-15:/science/research-jobs-in-france-the-black-humor-of-2010-is-the-reality-of-2011.html</id><summary type="html">&lt;p&gt;The French basic research landscape is dominated by a few nationwide
institute, similar to the NIST or the NIH in the US. The largest of these
is the &lt;a class="reference external" href="http://www.cnrs.fr/index.php"&gt;CNRS&lt;/a&gt; (Centre National de la Recherche Scientific). Getting a
tenured job in one of those institutes enables someone to focus on basic …&lt;/p&gt;</summary><content type="html">&lt;p&gt;The French basic research landscape is dominated by a few nationwide
institute, similar to the NIST or the NIH in the US. The largest of these
is the &lt;a class="reference external" href="http://www.cnrs.fr/index.php"&gt;CNRS&lt;/a&gt; (Centre National de la Recherche Scientific). Getting a
tenured job in one of those institutes enables someone to focus on basic
research rather than teaching or going in the industry. It has always
been quite challenging to get such position as many people apply for very
few positions, and the choice of the candidates is quite political. Each
year there is a call for applications, through a impressive formal
process that young researchers trying to get jobs in France end up
knowing quite well.&lt;/p&gt;
&lt;p&gt;Last year, I was visiting a research lab (&lt;a class="reference external" href="http://www.incm.cnrs-mrs.fr/en_index.php"&gt;INCM&lt;/a&gt;) and I saw in their
coffee-break room the following poster (below), that I could
clearly recognize as the official call for application for positions at
CNRS.&lt;/p&gt;
&lt;p&gt;Now this poster says ‘&lt;strong&gt;The CNRS recruits 3 researchers (m/w) in all
fields of research&lt;/strong&gt;‘. Of course it’s a fake poster and black humor: 3
positions nationwide in all fields of research is ridiculously low. It
is however an expression of the nightmare of thousands of young
researchers who are applying each year and keep hearing that the
government will &lt;a class="reference external" href="http://www.latribune.fr/actualites/economie/france/20100415trib000499181/la-fonction-publique-d-etat-perdra-34.000-postes-en-2011-selon-georges-tron.html"&gt;slash the number of state employees&lt;/a&gt;.&lt;/p&gt;
&lt;img alt="" class="align-center" src="attachments/cnrs_recruits.jpg" style="width: 70%;" /&gt;
&lt;p&gt;The call for the 2011 applications for research positions at &lt;a class="reference external" href="http://en.inria.fr/"&gt;INRIA&lt;/a&gt;,
the French national computer science institute, that is another one of
the big research institutions in France, is &lt;a class="reference external" href="http://www.inria.fr/institut/recrutement-metiers/offres/concours-2011-5-postes-de-charge-de-recherche-2e-classe-sont-a-pourvoir/concours-2011"&gt;out&lt;/a&gt;. The page is entitled
&lt;em&gt;Cinq postes de chargé de recherche 2e classe sont à pourvoir&lt;/em&gt; (&lt;strong&gt;5
positions for junior researchers are available&lt;/strong&gt;). This is not a joke,
and it is striking to see the similarity between &lt;strong&gt;the dark humor of
2010 and the reality of 2011&lt;/strong&gt;. To be fair INRIA is smaller than CNRS,
as it covers only computer science and applications (listed as applied
maths, numerical computing and simulation, algorithm and software
research, networks and distributed systems, and computational modeling
for life sciences). The number of applications is in hundred and not
thousands, but having only 5 jobs available nationwide still feels
really awkward.&lt;/p&gt;
&lt;blockquote&gt;
&lt;a class="reference external" href="attachments/cnrs_recruits.pdf"&gt;PDF poster&lt;/a&gt;&lt;/blockquote&gt;
&lt;p&gt;A minor detail: I am trying to get a job in computational science
research in France.&lt;/p&gt;
</content><category term="science"></category><category term="personnal"></category><category term="science"></category></entry><entry><title>Scientific publication for software development</title><link href="https://gael-varoquaux.info/programming/scientific-publication-for-software-development.html" rel="alternate"></link><published>2011-01-08T22:40:00+01:00</published><updated>2011-01-08T22:40:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2011-01-08:/programming/scientific-publication-for-software-development.html</id><summary type="html">&lt;p&gt;The academic community seems to judge the validity and significance of
any contribution by the number of papers published and the number of
citations they get. To find funding, to get credit, you have to
&lt;strong&gt;publish or perish&lt;/strong&gt;. However, the natural output of software
development tends not to be an …&lt;/p&gt;</summary><content type="html">&lt;p&gt;The academic community seems to judge the validity and significance of
any contribution by the number of papers published and the number of
citations they get. To find funding, to get credit, you have to
&lt;strong&gt;publish or perish&lt;/strong&gt;. However, the natural output of software
development tends not to be an article (people who confuse articles and
documentation do a poor job of both, IMHO).&lt;/p&gt;
&lt;p&gt;While I believe that this policy is harmful for the quality of research,
I also know that I cannot fight it, and chances are that many other are
in my situation. As such, we need to publish scientific papers about the
scientific softwares that we develop (such as &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/"&gt;Mayavi&lt;/a&gt;, or
&lt;a class="reference external" href="http://scikit-learn.sourceforge.net/"&gt;scikit-learn&lt;/a&gt;, as far as I am concerned). On the other hand, as an
editor of the &lt;a class="reference external" href="http://conference.scipy.org/proceedings.html"&gt;Scipy conference proceedings&lt;/a&gt;, I have found that the
process of writing a paper on software work and going through peer
review can be greatly beneficial to the software. Indeed, it forces
authors to do a thorough review of the prior work, and to clearly
identify the purpose of the project. Also, such an article can only be
much shorter than a user manual, thus it forces the authors to identify
the key concepts of their software, and explain them clearly. As a
result, it helps finding design and usability flaws and gaining insight
on how the user manual can be structured.&lt;/p&gt;
&lt;p&gt;A major challenge to publishing is that most of the highly-ranked
journals tend to disregard software works, unless they are very specific
to a scientific problem, which actually makes them less useful to the
complete ecosystem. Deeply rooted in the minds of the editors and the
reviewers, there tends to be the idea that developing software is easy
compared to doing experiments or proofs. In addition, these top-notch
scientists are not always the most qualified to judge the quality of
software, as they have most often never worked in a major software
project. The good news is that this is slowing changing with the
creation of software tracks in specialized journals, and the development
of new journals focused on scientific software.&lt;/p&gt;
&lt;div class="section" id="journals-for-publishing-about-interdisciplinary-scientific-software"&gt;
&lt;h2&gt;Journals for publishing about interdisciplinary scientific software&lt;/h2&gt;
&lt;p&gt;In my opinion, interdisciplinary scientific software such as &lt;a class="reference external" href="http://numpy.scipy.org/"&gt;numpy&lt;/a&gt;,
the &lt;a class="reference external" href="http://www.gnu.org/software/gsl/"&gt;GSL&lt;/a&gt;, &lt;a class="reference external" href="http://www.gnu.org/software/octave/"&gt;octave&lt;/a&gt;, &lt;a class="reference external" href="http://www.scilab.org/"&gt;scilab&lt;/a&gt;, &lt;a class="reference external" href="http://matplotlib.sourceforge.net/"&gt;matplotlib&lt;/a&gt;, or &lt;a class="reference external" href="http://www.fenicsproject.org"&gt;Fenics&lt;/a&gt;, are the
most valuable projects, as they provide foundations to build science in
the open. The challenge that these projects have to face are not only
algorithmic or computational, but also deal with providing good user
interfaces, or developing and catering for very large communities of
users. These problems are considered as &lt;em&gt;solved&lt;/em&gt; in a scientific
context, as they have all been solved at least once, often quite
successfully by commercial products such as Matlab. As a result, it is
hard to get some funding for these projects unless there is a political
reason behind the funding, and IMHO politics tend to produce bad
software. Publishing high-profile articles on interdisciplinary
scientific software is thus hard, but critical. For this we need
journals that accept software papers, but are not only read by
researchers in CS or IT departments.&lt;/p&gt;
&lt;p&gt;A couple of years ago, some of us made a review of where it was possible
to publish truly wide-scope scientific software, and we found that there
was pretty much no option. It’s crazy to see that things have still not
changed much, and that all lot of major general-purpose widely-used
projects, like the one I cited above, have never been acknowledged by a
publication.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://cise.aip.org/"&gt;Computing in Science and Engineering&lt;/a&gt;: a joint publication
between the AIP (American Institute of Physics) and the IEEE, it is a
magazine-style journal and it can be seen in many coffee rooms of
computational-science departments. Thanks to that it gets a lot of
reading, but the articles cannot be too technical (which might be a
good thing) and there is room for only few articles.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.openresearchcomputation.com/"&gt;Open Research Computation (ORC)&lt;/a&gt;: A newly-created journal, with
a focus on making computational research reproducible. As such, it
favors papers about open source scientific software with good
software-engineering. &lt;strong&gt;Open access&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In addition to these software-friendly journals, some large-scope
journals on computational science sometime accept software papers,
though software production fall out of their scope:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.elsevier.com/locate/jocs/"&gt;Journal of Computational Science&lt;/a&gt;: a very multidisciplinary
journal.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.siam.org/journals/sisc.php"&gt;SIAM Journal on Scientific Computing (SISC)&lt;/a&gt;: a journal of the
SIAM (society for industrial and applied mathematics), thus with a
focus on engineering-type applications.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="journals-for-publishing-domain-specific-scientific-software"&gt;
&lt;h2&gt;Journals for publishing domain-specific scientific software&lt;/h2&gt;
&lt;p&gt;It is usually easier to publish a domain-specific software contribution,
as you can claim that you have solved a well-identified scientific
roadblock. Until recently, it was hard to get such papers in the best
journals of a community, but things have been changing.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.elsevier.com/locate/cpc"&gt;Computer Physics Communications&lt;/a&gt;: for algorithms and packages
solving numerical and computational problems related to physics.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://bioinformatics.oxfordjournals.org/"&gt;Bioinformatics&lt;/a&gt;: accepts software papers on biology-related
problems.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://toms.acm.org/"&gt;ACM Transactions On Mathematical Software (TOMS)&lt;/a&gt;: a journal of
the ACM (Association for Computing Machinery), thus with a focus on
algorithms.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.jstatsoft.org/"&gt;Journal of statistical Software&lt;/a&gt;: this journal comes from the
community of people who wrote the R language. They know that open
source scientific software is hard and important. &lt;strong&gt;Open access&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://jmlr.csail.mit.edu/mloss/"&gt;Journal of Machine Learning Research (JMLR), Machine Learning Open
Source (MLOSS) track&lt;/a&gt;: reference journal in the machine learning
community, the MLOSS track cares strongly about documentation,
packaging and usability of the software. &lt;strong&gt;Open access&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.elsevier.com/wps/find/journaldescription.cws_home/398/description#description"&gt;Computers &amp;amp; Geoscience&lt;/a&gt;: computational geoscience journal that
accepts software papers (thanks Michael Aye for the pointer).&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://onlinelibrary.wiley.com/journal/10.1002/%28ISSN%291099-0542"&gt;Computer Applications in Engineering Education&lt;/a&gt;: a journal
about education with computers. AFAIK, no special focus on open
source or software-engineering quality (thanks Doug Holton for the
pointer).&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.springer.com/biomed/neuroscience/journal/12021"&gt;NeuroInformatics&lt;/a&gt; and &lt;a class="reference external" href="http://www.frontiersin.org/neuroinformatics"&gt;Frontiers NeuroInformatics&lt;/a&gt; (&lt;strong&gt;open
access&lt;/strong&gt;): two journals on computer-related issues in neuroscience
that accept software papers. I have the feeling that the latter is a
bit warmer to open source that the former (thanks Andrew Davison for
the pointer).&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.elsevier.com/wps/find/journaldescription.cws_home/503304/description#description"&gt;Computers and Electronics in Agriculture&lt;/a&gt;: for publishing
agriculture-related software (thanks John B. Cole for the pointer).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I should stress that, in my opinion, journals such as &lt;a class="reference external" href="http://www.ploscompbiol.org"&gt;PLOS
computational biology&lt;/a&gt;, or the &lt;a class="reference external" href="http://www.elsevier.com/wps/find/journaldescription.cws_home/622866/description#description"&gt;Journal of Computational Physics&lt;/a&gt;, or
are not great venues for software papers, as they tend to emphasize what
I would call &lt;em&gt;proof of principle&lt;/em&gt;, and not packaged and maintained
software.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;I have the feeling that there is need for more communication on
scientific software. The list above is, of course, incomplete. If you
have extra ideas, please do not hesitate to contact me.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;As a conclusion, I would like to point out that conferences are also a
good way to advertise scientific software. You may even get approached
by the editor of a journal to open the door for a journal article. Last
year I was at &lt;a class="reference external" href="http://hpfem.org/events/esco-2010/"&gt;ESCO&lt;/a&gt;, a coupled problems conference, and there was a
track on Python in science. All in all the conference was a huge amount
of fun, and I learned a lot on practical aspects of numerical methods,
given the amount of numerical computing geeks that were around. The same
community is organizing &lt;a class="reference external" href="http://hpfem.org/events/femtec-2011/"&gt;FEMTEC&lt;/a&gt; in Lake Tahoe (California) this year.
If you are in any field related to FEM or multiphysics, you should
consider it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update: added links suggested by Doug Holton, Michael Aye, Andrew
Davison, and John B. Cole&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="python"></category><category term="science"></category><category term="scientific computing"></category><category term="publishing"></category></entry><entry><title>ICA versus PCA in the scikit-learn: the value of code over pictures</title><link href="https://gael-varoquaux.info/programming/ica-versus-pca-in-the-scikit-learn-the-value-of-code-over-pictures.html" rel="alternate"></link><published>2010-11-20T16:12:00+01:00</published><updated>2010-11-20T16:12:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-11-20:/programming/ica-versus-pca-in-the-scikit-learn-the-value-of-code-over-pictures.html</id><summary type="html">&lt;p&gt;When I was trying to get an intuitive feeling of the difference between
&lt;strong&gt;Independent Component Analysis&lt;/strong&gt; (ICA) and &lt;strong&gt;Principal Component
Analysis&lt;/strong&gt; (PCA), I wrote a few Python scripts producing &lt;a class="reference external" href="http://gael-varoquaux.info/scientific_computing/ica_pca/index.html"&gt;some
visualizations explaining the difference&lt;/a&gt; that have had a bit of
success.&lt;/p&gt;
&lt;p&gt;During the last sprint on &lt;a class="reference external" href="http://scikit-learn.org"&gt;scikit-learn&lt;/a&gt;, a machine learning …&lt;/p&gt;</summary><content type="html">&lt;p&gt;When I was trying to get an intuitive feeling of the difference between
&lt;strong&gt;Independent Component Analysis&lt;/strong&gt; (ICA) and &lt;strong&gt;Principal Component
Analysis&lt;/strong&gt; (PCA), I wrote a few Python scripts producing &lt;a class="reference external" href="http://gael-varoquaux.info/scientific_computing/ica_pca/index.html"&gt;some
visualizations explaining the difference&lt;/a&gt; that have had a bit of
success.&lt;/p&gt;
&lt;p&gt;During the last sprint on &lt;a class="reference external" href="http://scikit-learn.org"&gt;scikit-learn&lt;/a&gt;, a machine learning
toolkit in Python, we cleaned up the ICA code that I had been using, and
we added it to the scikit, along with &lt;a class="reference external" href="http://scikit-learn.org/stable/auto_examples/decomposition/plot_ica_vs_pca.html"&gt;an example&lt;/a&gt; inspired from this
earlier toy problem.&lt;/p&gt;
&lt;a class="reference external image-reference" href="http://scikit-learn.org/stable/auto_examples/decomposition/plot_ica_vs_pca.html"&gt;&lt;img alt="" class="align-center" src="http://scikit-learn.org/stable/_images/sphx_glr_plot_ica_vs_pca_001.png" /&gt;&lt;/a&gt;
&lt;p&gt;While the pictures are not as pretty as the initial ones I had done
(because we wanted to keep the example as simple as possible), I am very
happy that this discussion is know more than a set of static pictures,
but comes with runnable code.&lt;/p&gt;
&lt;p&gt;This illustrates very well my feelings on the future of scientific code
and scientific research: paper, books, teaching materials, on numerical
methods or computational science are greatly enhanced when they come
with highly-readable code that illustrates their purpose, because the
reader can start asking questions to the algorithm. Hopefully, &lt;strong&gt;the
documentation of scientific programming toolkits will become the
textbooks of tomorrow&lt;/strong&gt;. We still have a lot of work to.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;It’s funny, I just realized that my vision on software might have been
strongly influenced by the fact that my mother, a high-school math
teacher, spent endless nights when I was a teenager working on
&lt;a class="reference external" href="http://fr.wikipedia.org/wiki/G%C3%A9oplan"&gt;Geoplan&lt;/a&gt;, a software for teaching geometry by interaction with
figures.&lt;/p&gt;
</content><category term="programming"></category><category term="python"></category><category term="science"></category><category term="scientific computing"></category></entry><entry><title>Multitouch with VTK (and MedINRIA and Mayavi)</title><link href="https://gael-varoquaux.info/programming/multitouch-with-vtk-and-medinria-and-mayavi.html" rel="alternate"></link><published>2010-09-18T09:40:00+02:00</published><updated>2010-09-18T09:40:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-09-18:/programming/multitouch-with-vtk-and-medinria-and-mayavi.html</id><summary type="html">&lt;p&gt;If the videos on this post are not showing, click through to see
them.&lt;/p&gt;
&lt;p&gt;A colleague of mine, &lt;a class="reference external" href="http://sites.google.com/site/pierrefillard/"&gt;Pierre Fillard&lt;/a&gt;, has just integrated multitouch
in the next generation of the VTK-based medical imaging software
&lt;a class="reference external" href="http://www-sop.inria.fr/asclepios/software/MedINRIA/"&gt;MedINRIA&lt;/a&gt;. The nice thing is that it works on an Apple laptop out of
the box …&lt;/p&gt;</summary><content type="html">&lt;p&gt;If the videos on this post are not showing, click through to see
them.&lt;/p&gt;
&lt;p&gt;A colleague of mine, &lt;a class="reference external" href="http://sites.google.com/site/pierrefillard/"&gt;Pierre Fillard&lt;/a&gt;, has just integrated multitouch
in the next generation of the VTK-based medical imaging software
&lt;a class="reference external" href="http://www-sop.inria.fr/asclepios/software/MedINRIA/"&gt;MedINRIA&lt;/a&gt;. The nice thing is that it works on an Apple laptop out of
the box.&lt;/p&gt;
&lt;p&gt;
&lt;object width="640" height="385"&gt;
&lt;embed src="http://www.youtube.com/v/UyO4KRnYreU?fs=1&amp;amp;hl=en_US" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="640" height="385"&gt;
&lt;/embed&gt;
&lt;/object&gt;
&lt;/p&gt;&lt;p&gt;On &lt;a class="reference external" href="https://sites.google.com/site/pierrefillard/coding-blog/multi-touchgesturesinvtk"&gt;his blog&lt;/a&gt;, he explain how he did this (warning, it involves C++ and
VTK programming). &lt;strong&gt;He also gives the code for this!&lt;/strong&gt; Enjoy.&lt;/p&gt;
&lt;p&gt;This reminded me of when the &lt;a class="reference external" href="http://www.enthought.com/"&gt;Enthought guys&lt;/a&gt; had rigged up a large
multitouch screen and wired it in &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/"&gt;Mayavi&lt;/a&gt; for 3D plotting, and in
&lt;a class="reference external" href="http://code.enthought.com/projects/chaco/"&gt;chaco&lt;/a&gt; for 2D plotting, using only a web-cam, a video projector, and
pure Python image-analysis code:&lt;/p&gt;
&lt;p&gt;
&lt;object width="480" height="385"&gt;
&lt;embed src="http://www.youtube.com/v/bEf3nGjOgpU?fs=1&amp;amp;hl=en_US" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="385"&gt;
&lt;/embed&gt;
&lt;/object&gt;
&lt;/p&gt;</content><category term="programming"></category><category term="mayavi"></category><category term="python"></category><category term="scientific computing"></category></entry><entry><title>Machine learning humour</title><link href="https://gael-varoquaux.info/science/machine-learning-humour.html" rel="alternate"></link><published>2010-09-16T23:11:00+02:00</published><updated>2010-09-16T23:11:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-09-16:/science/machine-learning-humour.html</id><summary type="html">&lt;div class="section" id="yes-but-they-overfit"&gt;
&lt;h2&gt;Yes, but they overfit&lt;/h2&gt;
&lt;p&gt;If you are reading this post through a planet, the movie isn’t showing
up, just &lt;a class="reference external" href="http://gael-varoquaux.info/science/machine-learning-humour.html"&gt;click through&lt;/a&gt; to understand what the hell this is about.&lt;/p&gt;
&lt;p&gt;
&lt;object width="480" height="385"&gt;
&lt;embed src="http://www.youtube.com/v/m60lVGz34hU?fs=1&amp;amp;hl=en_US" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="385"&gt;
&lt;/embed&gt;
&lt;/object&gt;
&lt;/p&gt;&lt;/div&gt;
&lt;div class="section" id="some-explanations"&gt;
&lt;h2&gt;Some explanations…&lt;/h2&gt;
&lt;div class="section" id="machine-learning-geeks-and-beers"&gt;
&lt;h3&gt;Machine learning, geeks, and beers&lt;/h3&gt;
&lt;p&gt;Sorry for the bad humour. In the previous weeks my social geek life …&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="section" id="yes-but-they-overfit"&gt;
&lt;h2&gt;Yes, but they overfit&lt;/h2&gt;
&lt;p&gt;If you are reading this post through a planet, the movie isn’t showing
up, just &lt;a class="reference external" href="http://gael-varoquaux.info/science/machine-learning-humour.html"&gt;click through&lt;/a&gt; to understand what the hell this is about.&lt;/p&gt;
&lt;p&gt;
&lt;object width="480" height="385"&gt;
&lt;embed src="http://www.youtube.com/v/m60lVGz34hU?fs=1&amp;amp;hl=en_US" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="385"&gt;
&lt;/embed&gt;
&lt;/object&gt;
&lt;/p&gt;&lt;/div&gt;
&lt;div class="section" id="some-explanations"&gt;
&lt;h2&gt;Some explanations…&lt;/h2&gt;
&lt;div class="section" id="machine-learning-geeks-and-beers"&gt;
&lt;h3&gt;Machine learning, geeks, and beers&lt;/h3&gt;
&lt;p&gt;Sorry for the bad humour. In the previous weeks my social geek life had
two strong moments:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.pycon.fr/conference/edition2010"&gt;Pycon fr&lt;/a&gt;, the French Python conference, and ensuing drinking&lt;/li&gt;
&lt;/ul&gt;
&lt;img alt="" src="http://farm5.static.flickr.com/4077/4938486734_378f52fd3d.jpg" style="width: 45%;" /&gt;
&lt;img alt="" src="http://farm5.static.flickr.com/4114/4938124265_027853c81a.jpg" style="width: 45%;" /&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://fseoane.net/blog/2010/second-scikitslearn-coding-sprint/"&gt;The second sprint&lt;/a&gt; on the &lt;a class="reference external" href="http://scikit-learn.sourceforge.net/"&gt;scikit learn&lt;/a&gt;, a library for machine
learning in Python.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At the first event (or maybe the related drinking) there was a lot of
discussion about NoSQL databases, and I was introduced to &lt;a class="reference external" href="http://www.xtranormal.com/watch/6995033/&amp;quot;&amp;quot;"&gt;this
fantastic video&lt;/a&gt; making fun of MongoDB fanboys. A few days later I was
hacking on the scikit, comparing estimators and discussing hype versus
fact in machine learning algorithms (hint: &lt;a class="reference external" href="http://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization"&gt;there is no free lunch&lt;/a&gt;,
but you may get &lt;a class="reference external" href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.79.2501&amp;amp;rep=rep1&amp;amp;type=pdf"&gt;a free brunch&lt;/a&gt;). As in brain imaging people seem to
be doing nothing but SVMs over and over while &lt;a class="reference external" href="http://hal.inria.fr/hal-00504095/PDF/icpr_2010_tv.pdf"&gt;methods with more
appropriate sparsity clearly perform better&lt;/a&gt;, I composed this stupid
video.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="anything-to-learn-about-machine-learning-in-there"&gt;
&lt;h3&gt;Anything to learn about machine learning in there?&lt;/h3&gt;
&lt;p&gt;The short answer is: probably no. This video is humour, and there is
little truth (well, RFE is indeed slow as a dog). However, not every
reader of this blog are machine learning experts, so let me explain the
stakes of the pseudo discussion.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Overfitting&lt;/strong&gt;: when you learn a predictive model on a noisy data set,
for instance trying to learn how to predict whether a movie is popular
or not from ratings, if you have a finite amount of data, you should be
careful not to learn by heart every detail of the data. You will learn
noise that, by chance, correlated to what you are trying to predict.
When you try to generalize to new data, these features that you learned
from noise will be detrimental to your prediction performance. For
instance, &lt;a class="reference external" href="http://www.reddit.com/r/Python/comments/cwq37/announcing_python_nltk_demos_natural_language/"&gt;the presence of Matt Damon&lt;/a&gt; is not the sole predictor of the
quality of movie. This is called overfitting. The goal of
&lt;a class="reference external" href="http://en.wikipedia.org/wiki/Regularization_%28mathematics%29"&gt;regularization&lt;/a&gt; is to avoid this overfitting.&lt;/p&gt;
&lt;p&gt;Both SVM and elasticnet implement regularization, but in different ways.
In the case of brain imaging, as the predictive features (voxels) are
very sparse, but the noise is highly structured, SVM (that do not
operate on voxels directly) are not able to select directly the relevant
voxels and tend to overfit (which can be counter-balanced by univariate
feature selection as in the &lt;a class="reference external" href="http://scikit-learn.org/stable/auto_examples/svm/plot_svm_anova.html"&gt;scikit example&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;RFE (recursive feature elimination) is slow as dog&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;scikits.learn&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;digits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load_digits&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;digits&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;digits&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;scikits.learn.svm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LinearSVC&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;svc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LinearSVC&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;scikits.learn.rfe&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RFE&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="n"&gt;RFE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;estimator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;svc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;percentage&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;21.5&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="n"&gt;per&lt;/span&gt; &lt;span class="n"&gt;loop&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;scikits.learn.glm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ElasticNet&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="n"&gt;ElasticNet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rho&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;26.7&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="n"&gt;per&lt;/span&gt; &lt;span class="n"&gt;loop&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Yeah, but it does much more than simply building a predictor, it builds
a ‘heat map’ of which features help predicting (run &lt;a class="reference external" href="http://scikit-learn.sourceforge.net/auto_examples/rfe_digits.html"&gt;this scikit-learn
example&lt;/a&gt; to get an idea).&lt;/p&gt;
&lt;p&gt;I am afraid that all the examples I pointed to require the development
version of the scikit. Sorry, we just finished a sprint, and there will
be a release soon.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="personnal"></category><category term="python"></category><category term="humor"></category></entry><entry><title>Scikit Learn coding sprint</title><link href="https://gael-varoquaux.info/programming/scikit-learn-coding-sprint.html" rel="alternate"></link><published>2010-09-04T17:43:00+02:00</published><updated>2010-09-04T17:43:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-09-04:/programming/scikit-learn-coding-sprint.html</id><summary type="html">&lt;p&gt;We have been really crap at communicating the next &lt;a class="reference external" href="http://scikit-learn.sourceforge.net/"&gt;scikit-learn&lt;/a&gt;
coding sprint. It’s next week!&lt;/p&gt;
&lt;p&gt;The coding sprint will take place the 8 and 9 September at &lt;a class="reference external" href="http://maps.google.fr/maps/place?oe=utf-8&amp;amp;rls=com.mandriva:en-US:official&amp;amp;client=firefox-a&amp;amp;um=1&amp;amp;ie=UTF-8&amp;amp;q=inria+saclay&amp;amp;fb=1≷=fr&amp;amp;hq=inria&amp;amp;hnear=Saclay&amp;amp;cid=14838681423181723946"&gt;INRIA
Saclay&lt;/a&gt;, near Paris, in the room K110 (building K).&lt;/p&gt;
&lt;p&gt;For those who cannot make it, it will be possible to participate …&lt;/p&gt;</summary><content type="html">&lt;p&gt;We have been really crap at communicating the next &lt;a class="reference external" href="http://scikit-learn.sourceforge.net/"&gt;scikit-learn&lt;/a&gt;
coding sprint. It’s next week!&lt;/p&gt;
&lt;p&gt;The coding sprint will take place the 8 and 9 September at &lt;a class="reference external" href="http://maps.google.fr/maps/place?oe=utf-8&amp;amp;rls=com.mandriva:en-US:official&amp;amp;client=firefox-a&amp;amp;um=1&amp;amp;ie=UTF-8&amp;amp;q=inria+saclay&amp;amp;fb=1≷=fr&amp;amp;hq=inria&amp;amp;hnear=Saclay&amp;amp;cid=14838681423181723946"&gt;INRIA
Saclay&lt;/a&gt;, near Paris, in the room K110 (building K).&lt;/p&gt;
&lt;p&gt;For those who cannot make it, it will be possible to participate using
the IRC chan (#scikit-learn on irc.freenode.net).&lt;/p&gt;
&lt;p&gt;We will start at 9am (Paris time), and a sketch of the planning can be
found &lt;a class="reference external" href="http://sourceforge.net/apps/trac/scikit-learn/wiki/SprintPlanning"&gt;here&lt;/a&gt;. In particular:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;More docs! we still need tutorials: features selection, model
selection, cross-validation, etc..&lt;/li&gt;
&lt;li&gt;Make the &lt;a class="reference external" href="http://github.com/scikit-learn/scikit-learn/blob/master/scikits/learn/pipeline.py"&gt;pipeline object&lt;/a&gt; really work + illustration in different
contexts.&lt;/li&gt;
&lt;li&gt;Clean up and doc for bayesian approaches.&lt;/li&gt;
&lt;li&gt;Implementation of PCA (fit + transform).&lt;/li&gt;
&lt;li&gt;FastICA (adapt the &lt;a class="reference external" href="http://github.com/GaelVaroquaux/canica/blob/master/canica/algorithms/fastica.py"&gt;CanICA&lt;/a&gt; code)&lt;/li&gt;
&lt;li&gt;LDA : Covariance estimators (Ledoit-Wolf) and add transform.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://github.com/scikit-learn/scikit-learn/blob/master/scikits/learn/preprocessing.py"&gt;Preprocessing routines&lt;/a&gt; (center, standardize) with fit transform.&lt;/li&gt;
&lt;li&gt;Anything that you have a particular interest in.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do not hesitate to send on the &lt;a class="reference external" href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general"&gt;mailing list&lt;/a&gt; some advices on this
(incomplete…) list, and see you next week!&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;&lt;a class="reference external" href="http://scikit-learn.sourceforge.net/"&gt;scikit-learn&lt;/a&gt; is a Python module for efficient and easy machine
learning using scipy and numpy.&lt;/p&gt;
</content><category term="programming"></category><category term="python"></category><category term="scientific computing"></category><category term="scikit-learn"></category></entry><entry><title>SVG Word map of countries</title><link href="https://gael-varoquaux.info/misc/svg-word-map-of-countries.html" rel="alternate"></link><published>2010-08-24T10:55:00+02:00</published><updated>2010-08-24T10:55:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-08-24:/misc/svg-word-map-of-countries.html</id><summary type="html">&lt;p&gt;To be able to visualize some quantities attached to countries all over
the world, I needed a image with various countries color-coded. The
fantastic &lt;a class="reference external" href="http://matplotlib.sourceforge.net/basemap/doc/html/"&gt;matplotlib basemap package&lt;/a&gt; was not an option as I really
needed a static image.&lt;/p&gt;
&lt;p&gt;So I generated an SVG image with all the countries. It was …&lt;/p&gt;</summary><content type="html">&lt;p&gt;To be able to visualize some quantities attached to countries all over
the world, I needed a image with various countries color-coded. The
fantastic &lt;a class="reference external" href="http://matplotlib.sourceforge.net/basemap/doc/html/"&gt;matplotlib basemap package&lt;/a&gt; was not an option as I really
needed a static image.&lt;/p&gt;
&lt;p&gt;So I generated an SVG image with all the countries. It was generating by
tracing a bitmap, so it has a lot of imperfections, but being an SVG
with each (major) country as a different object, it can be used to
create a colored-code world map. I am uploading it here under a
public-domain license. Enjoy!&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;PNG&lt;/strong&gt;&lt;/p&gt;
&lt;img alt="" src="../images/misc/countries.png" style="width: 50%;" /&gt;
&lt;p&gt;&lt;strong&gt;SVG&lt;/strong&gt;: &lt;a class="reference external" href="../images/misc/countries.svg"&gt;countries.svg&lt;/a&gt;&lt;/p&gt;
&lt;!-- _ --&gt;
</content><category term="misc"></category><category term="python"></category><category term="scientific computing"></category><category term="travels"></category><category term="art"></category></entry><entry><title>Software design for maintainability</title><link href="https://gael-varoquaux.info/programming/software-design-for-maintainability.html" rel="alternate"></link><published>2010-08-01T23:47:00+02:00</published><updated>2010-08-01T23:47:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-08-01:/programming/software-design-for-maintainability.html</id><summary type="html">&lt;p&gt;I have just spent the best part of my Sunday fixing a bug that turned
out being a &lt;a class="reference external" href="https://svn.enthought.com/enthought/changeset/25699/"&gt;seemingly-trivial two-liner&lt;/a&gt;. Such unpleasant experiences
are all too frequent, and weight a lot on my view of code design.&lt;/p&gt;
&lt;div class="section" id="my-stance-on-code-design"&gt;
&lt;h2&gt;My stance on code design&lt;/h2&gt;
&lt;img alt="" class="align-right" src="https://gael-varoquaux.info/programming/attachments/software_design_for_maintainability/cool-car-drawing-5.jpg" style="width: 30%;" /&gt;
&lt;p&gt;I call &lt;em&gt;code design&lt;/em&gt; the process of designing …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;p&gt;I have just spent the best part of my Sunday fixing a bug that turned
out being a &lt;a class="reference external" href="https://svn.enthought.com/enthought/changeset/25699/"&gt;seemingly-trivial two-liner&lt;/a&gt;. Such unpleasant experiences
are all too frequent, and weight a lot on my view of code design.&lt;/p&gt;
&lt;div class="section" id="my-stance-on-code-design"&gt;
&lt;h2&gt;My stance on code design&lt;/h2&gt;
&lt;img alt="" class="align-right" src="https://gael-varoquaux.info/programming/attachments/software_design_for_maintainability/cool-car-drawing-5.jpg" style="width: 30%;" /&gt;
&lt;p&gt;I call &lt;em&gt;code design&lt;/em&gt; the process of designing the architecture of a
piece of software: what are the objects it uses? how do they interact?
how is the information passed around?…&lt;/p&gt;
&lt;p&gt;My view of code design and software engineering has progressively
evolved to favor &lt;strong&gt;extreme simplicity&lt;/strong&gt; over sophistication. I believe
that a good programmer should know &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Design_pattern_%28computer_science%29"&gt;design patterns&lt;/a&gt;, &lt;a class="reference external" href="http://gael-varoquaux.info/computers/python_advanced/index.html"&gt;powerful
language features&lt;/a&gt;, &lt;a class="reference external" href="http://scipy2010.blogspot.com/2010/06/tutorials-day-1-advanced-numpy.html"&gt;libraries dark corners&lt;/a&gt;, and &lt;em&gt;not use them unless
absolutely necessary&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="some-rules-of-thumb"&gt;
&lt;h2&gt;Some rules of thumb&lt;/h2&gt;
&lt;p&gt;Here are some rules that I apply nowadays when writing code that I would
like to last (I am aware that some of them go against well-advertised
best practices).&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;Keep it as simple a possible, really!&lt;/strong&gt; Experimental results have
shown that the tractability of a code base goes down as the square of
the number of interactions, and thus much quicker than the number of
lines in a project. Each time you add a line, think about it: can you
make simpler? If not you’ll have to find resources to maintain your
project as fixing bugs or adding features will grow harder.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Design for the 80% usecases.&lt;/strong&gt; In the same vein, a small decrease
in the requirements can make your project much simpler
&lt;a class="reference external" href="http://ieeexplore.ieee.org/Xplore/login.jsp?url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F32%2F35909%2F01702600.pdf%3Farnumber%3D1702600&amp;amp;authDecision=-203"&gt;[Woodfield1979]&lt;/a&gt;. Corner cases and minor usecases should not make
the whole project complex and hard to maintain. If you can, give up
on what is bringing in complexity. If you cannot, isolate it, and
don’t let it sit at the core of your design.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Don’t design for the future.&lt;/strong&gt; Again the same core idea: don’t
start planing for all the usecases, and all the difficulties that you
haven’t encountered, you will most certainly design wrong, and
chances are that you’ll add complexity that you do not use. Design
simple, design cleanly and refactor as you go, based on concrete
problems. This is known as the &lt;a class="reference external" href="http://en.wikipedia.org/wiki/You_aren't_gonna_need_it"&gt;“YAGNI principle”&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;img alt="" class="align-center" src="https://gael-varoquaux.info/programming/attachments/software_design_for_maintainability/howtobuildmvp.gif" style="width: 60%;" /&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;Don’t be clever.&lt;/strong&gt; Each time you do a clever trick, whoever has to
read and maintain this code will have to understand it (that person
may be you, in a few years). Chances are that they’ll get it wrong
and start by loosing a lot of time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Repeating yourself may actually be OK.&lt;/strong&gt; This is a case of
&lt;em&gt;practicality beats purity&lt;/em&gt;. Repeating code is really a bad thing in
software design, because it leads to an increased number of lines to
debug, and tends to hinder reusability. However, adding complexity in
order to save a few lines of duplicated code will cost you more in
the long run.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use objects sparingly.&lt;/strong&gt; Object are great, but are they always
need? An object with a single method &lt;em&gt;eval&lt;/em&gt; can probably simply be
implemented by a function. The limitation of objects is that they all
have a different behavior. As a result, the users and maintainers of
your codebase will first have to understand how all your classes
interact before understanding your code. This also means that there
is a lot of benefit in making many different classes that have the
same interface.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Avoid abstractions and levels of indirection.&lt;/strong&gt; The more levels of
code piled on top one of the other, the more layers your maintainer
is going to have to inspect to find were the bug might be. An
abstraction hides another object or algorithm. To debug code, chances
are that all the black boxes will first have to be opened.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="coding-for-others-to-debug"&gt;
&lt;h2&gt;Coding for others to debug&lt;/h2&gt;
&lt;blockquote class="epigraph"&gt;
“Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it.” - Brian W. Kernighan&lt;/blockquote&gt;
&lt;img alt="" class="align-right" src="https://gael-varoquaux.info/programming/attachments/software_design_for_maintainability/auto-graveyard-1.jpg" style="width: 40%;" /&gt;
&lt;p&gt;You may think that I am overemphasizing simplicity at the cost of
functionality. Well, think about the future of your code. The net is
full of unmaintained and abandoned code. If you want your project to
grow and have a future, you will probably need people to help you. For a
given purpose, the easiest the code is to read and debug, the more
chances you will have to pick momentum.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;Some external references I like (about software engineering, rather than
debugging):&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Edmon Lau: &lt;a class="reference external" href="http://www.theeffectiveengineer.com/blog/hidden-costs-that-engineers-ignore"&gt;Hidden costs that engineers ignore&lt;/a&gt;
(&lt;strong&gt;Read this&lt;/strong&gt;)&lt;/li&gt;
&lt;li&gt;Titus Brown: &lt;a class="reference external" href="http://ivory.idyll.org/blog/sep-07/not-sucking-v2"&gt;Writing (Python) Code that Doesn’t Suck&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Peter Norvig: &lt;a class="reference external" href="http://norvig.com/21-days.html"&gt;Teach yourself programming in 10 years&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Paul Stachour and David Collier-Brown: &lt;a class="reference external" href="http://cacm.acm.org/magazines/2009/11/48444-you-dont-know-jack-about-software-maintenance/fulltext"&gt;You Don’t Know Jack About
Software Maintenance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Greg Wilson: &lt;a class="reference external" href="http://software-carpentry.org/"&gt;Software carpentry: a course in software engineering&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="software engineering"></category><category term="software architecture"></category><category term="python"></category><category term="selected"></category></entry><entry><title>Sprint Scikit learn in Paris</title><link href="https://gael-varoquaux.info/programming/sprint-scikit-learn-in-paris.html" rel="alternate"></link><published>2010-07-23T14:31:00+02:00</published><updated>2010-07-23T14:31:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-07-23:/programming/sprint-scikit-learn-in-paris.html</id><summary type="html">&lt;p&gt;We are organizing a coding sprint in Paris on &lt;a class="reference external" href="http://scikit-learn.sourceforge.net/"&gt;scikit learn&lt;/a&gt;,
&lt;strong&gt;machine learning in Python&lt;/strong&gt;. The goal of this sprint is to set the
API and the general coding guidelines of the scikit to be able to tackle
many different statistical learning problems in a consistent framework.&lt;/p&gt;
&lt;p&gt;This is why …&lt;/p&gt;</summary><content type="html">&lt;p&gt;We are organizing a coding sprint in Paris on &lt;a class="reference external" href="http://scikit-learn.sourceforge.net/"&gt;scikit learn&lt;/a&gt;,
&lt;strong&gt;machine learning in Python&lt;/strong&gt;. The goal of this sprint is to set the
API and the general coding guidelines of the scikit to be able to tackle
many different statistical learning problems in a consistent framework.&lt;/p&gt;
&lt;p&gt;This is why we would like to have people with different problems,
applications, and backgrounds to pitch in.&lt;/p&gt;
&lt;p&gt;It will be a two-days sprint. Everyone is welcome, so just fill in the
&lt;a class="reference external" href="http://www.doodle.com/4cqxnhuq5rr4qzn5"&gt;doodle&lt;/a&gt;, so that we can choose the date?&lt;/p&gt;
&lt;p&gt;And do not hesitate to suggest some topics that you would like to be
addressed during the sprint, and to discuss them on the &lt;a class="reference external" href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general"&gt;mailing-list&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://parietal.saclay.inria.fr/Members/vincent-michel"&gt;Vincent Michel&lt;/a&gt; is organizing the sprint. If you have questions about
the sprint, you are welcomed to contact me, but please do put him in Cc
to.&lt;/p&gt;
</content><category term="programming"></category><category term="scikit-learn"></category><category term="scipy"></category><category term="scientific computing"></category><category term="sprint"></category><category term="conferences"></category></entry><entry><title>Simple object signatures</title><link href="https://gael-varoquaux.info/programming/simple-object-signatures.html" rel="alternate"></link><published>2010-07-16T23:31:00+02:00</published><updated>2010-07-16T23:31:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-07-16:/programming/simple-object-signatures.html</id><summary type="html">&lt;div class="section" id="a-signature-pattern"&gt;
&lt;h2&gt;A &lt;em&gt;signature&lt;/em&gt; pattern&lt;/h2&gt;
&lt;p&gt;There are many libraries around to specify what I call a &lt;em&gt;‘signature’&lt;/em&gt;
for an object, in other words a list of attributes that define its
parameter set. I have heavily used &lt;a class="reference external" href="http://code.enthought.com/projects/traits"&gt;Enthought’s Traits library&lt;/a&gt; for
this purpose, but the concept is fairly general and can be …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;div class="section" id="a-signature-pattern"&gt;
&lt;h2&gt;A &lt;em&gt;signature&lt;/em&gt; pattern&lt;/h2&gt;
&lt;p&gt;There are many libraries around to specify what I call a &lt;em&gt;‘signature’&lt;/em&gt;
for an object, in other words a list of attributes that define its
parameter set. I have heavily used &lt;a class="reference external" href="http://code.enthought.com/projects/traits"&gt;Enthought’s Traits library&lt;/a&gt; for
this purpose, but the concept is fairly general and can be found &lt;em&gt;eg&lt;/em&gt; in
ORMs (Object Relational Mappers) or web frameworks.&lt;/p&gt;
&lt;p&gt;Specification of this interface of parameters may be used to answer a
variety of needs:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;Typing&lt;/strong&gt;: in the case of an ORM, to generate UIs, or for better
error management, it may be desirable to have some control on the
types of certain attributes of an object. In this case, specifying
the signature corresponds to laying out a &lt;strong&gt;data model&lt;/strong&gt; for the
object.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reactive programming&lt;/strong&gt;: using properties to react to changes to
attributes, one can fully specify the API of an object in terms of
these attributes. This gives a message-passing like programming style
that can be very well suited to parallel-computing in particular
because it can easily be made thread-safe.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="signatures-for-statistical-learning-objects"&gt;
&lt;h2&gt;Signatures for statistical learning objects&lt;/h2&gt;
&lt;p&gt;Recently, I considered the &lt;em&gt;signature&lt;/em&gt; pattern in a new context. In the
&lt;a class="reference external" href="http://scikit-learn.sourceforge.net/"&gt;scikit-learn&lt;/a&gt;, we are interested in statistical learning. This entails
fitting models to data and often tuning parameters to select a model
that fits best (a problem called &lt;em&gt;model selection&lt;/em&gt;). Each of our models
is an object that implements a couple of key methods to fit to the data
and to apply to new data (&lt;em&gt;fit&lt;/em&gt; and &lt;em&gt;predict&lt;/em&gt;).&lt;/p&gt;
&lt;p&gt;The approach that we are currently taking for model selection is (more
or less) to generate a list of models with different parameters and fit
and test them on the data.&lt;/p&gt;
&lt;p&gt;A very nice feature would be to find out the parameters to vary simply
by inspecting the objects, and such a desire recently got us
&lt;a class="reference external" href="http://sourceforge.net/mailarchive/forum.php?thread_name=201007050958.16199.matthieu.perrot%40cea.fr&amp;amp;forum_name=scikit-learn-general"&gt;discussing&lt;/a&gt; of defining &lt;em&gt;signatures&lt;/em&gt; for our objects. I must confess
that I am a bit weary as this means either depending on a signature
library, or building one. We don’t want to grow our dependencies, and
most signature-definition code that I know involve meta-programming
tricks to avoid code duplication.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="solving-the-simple-problem-avoiding-type-checking"&gt;
&lt;h2&gt;Solving the simple problem: avoiding type checking&lt;/h2&gt;
&lt;p&gt;Today, I had to bite the bullet, because we were in a situation in which
we had to instantiate new models from the existing one during model
selection. For technical reasons, using a &lt;em&gt;copy.copy&lt;/em&gt; to create these
new models was not a great idea, and it was better to have the minimal
list of parameters required. Here come signatures again.&lt;/p&gt;
&lt;p&gt;After a bit of messing around with the code, I realized that typing
information was useless, and most probably harmful, to our immediate
goals and that I just needed the names of the relevant attributes. I
finally settled down to the following solution (which might still
change):&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;All parameters need to be specified as keyword arguments of the
&lt;em&gt;__init__&lt;/em&gt;. The &lt;em&gt;__init__&lt;/em&gt; may not have positional arguments
or ‘*’ arguments. Attributes on the objects have the same names as
the &lt;em&gt;__init__&lt;/em&gt; parameters.&lt;/li&gt;
&lt;li&gt;A simple base class, with couple of methods relying on a simple use
of the &lt;em&gt;inspect&lt;/em&gt; module to find the signature of the &lt;em&gt;__init__&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr class="docutils" /&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BaseEstimator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;object&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nd"&gt;@classmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_get_param_names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;varargs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inspect&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getargspec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;cls&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;varargs&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;scikit learn estimators should always specify their &amp;#39;&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;parameters in the signature of their init (no varargs).&amp;#39;&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Remove &amp;#39;self&amp;#39;&lt;/span&gt;
        &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_get_params&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_get_param_names&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_set_params&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;valid_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_get_param_names&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iteritems&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;valid_params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Invalid parameter &lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s1"&gt; &amp;#39;&lt;/span&gt;
                &lt;span class="s1"&gt;&amp;#39;for estimator &lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;
                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="vm"&gt;__class__&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="vm"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="nb"&gt;setattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The full code can be seen &lt;a class="reference external" href="attachments/base_estimator.py"&gt;here&lt;/a&gt; and adds a bit more features, such as
a clever &lt;em&gt;__repr__&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;What I like about this solution is that it (almost) does not use
metaprograming, and avoids code duplication without forcing any specific
pattern on the developer subclassing &lt;em&gt;BaseEstimator&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="the-next-step"&gt;
&lt;h2&gt;The next step&lt;/h2&gt;
&lt;p&gt;This approach solves my immediate problem, but not the bigger one of
finding what values can the different parameters take when varied for
model selection. Of course this second problem is much more complicated,
and maybe it is not worth solving it: the framework could very easily be
bringing in more problems than it solves.&lt;/p&gt;
&lt;p&gt;However, it seems that a fairly easy way of specifying possible values
for parameters would be to decorate the &lt;em&gt;__init__&lt;/em&gt;, giving the
possible parameters to be tested during the model selection:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nd"&gt;@cv_params&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;l1&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logspace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1e-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;l1&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fit_intercept&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ...&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;All the decorator has to do is to store the information in an attribute
attached to the &lt;em&gt;__init__&lt;/em&gt; (and probably to check that the
parameters it was given are valid arguments, in order to raise errors
early). Methods on the class can later inspect this information for
model selection, or GUI building (data-model specification will probably
require some typing language, rather than a simple list of possible
parameters).&lt;/p&gt;
&lt;p&gt;Once again, here we would be avoiding the difficulty of specifying type
information in a non restrictive way, but avoiding a problem that we
don’t have to solve is probably a good idea.&lt;/p&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="software engineering"></category><category term="software architecture"></category><category term="design patterns"></category><category term="scientific computing"></category><category term="selected"></category></entry><entry><title>Euroscipy 2010: code, science, and a lot of fun</title><link href="https://gael-varoquaux.info/programming/euroscipy-2010-code-science-and-a-lot-of-fun.html" rel="alternate"></link><published>2010-07-13T17:31:00+02:00</published><updated>2010-07-13T17:31:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-07-13:/programming/euroscipy-2010-code-science-and-a-lot-of-fun.html</id><summary type="html">&lt;p&gt;&lt;a class="reference external" href="http://www.euroscipy.org/conference/euroscipy2010"&gt;Euroscipy 2010&lt;/a&gt;, the third European conference for the use of Python in
science, is just over, and I think it was a great success.&lt;/p&gt;
&lt;div class="section" id="euroscipy-in-numbers"&gt;
&lt;h2&gt;Euroscipy in numbers&lt;/h2&gt;
&lt;p&gt;&lt;img alt="image0" src="http://farm5.static.flickr.com/4118/4779625445_0e783484cd_m_d.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The attendance this year was huge: there was a grand total of 160 who
came to EuroScipy, with 140 that came only to …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;p&gt;&lt;a class="reference external" href="http://www.euroscipy.org/conference/euroscipy2010"&gt;Euroscipy 2010&lt;/a&gt;, the third European conference for the use of Python in
science, is just over, and I think it was a great success.&lt;/p&gt;
&lt;div class="section" id="euroscipy-in-numbers"&gt;
&lt;h2&gt;Euroscipy in numbers&lt;/h2&gt;
&lt;p&gt;&lt;img alt="image0" src="http://farm5.static.flickr.com/4118/4779625445_0e783484cd_m_d.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The attendance this year was huge: there was a grand total of 160 who
came to EuroScipy, with 140 that came only to the tutorials, and 130 only
the conference. This up by almost a factor of 3 compared to last year’s
EuroScipy, more than last year’s SciPy conference in Passadena, and
almost as much as this year’s SciPy conference in Austin that hosted 180
person. We had people coming from 16 country, and as far as New Zealand,
the US, or Turkey. Research lab, education, and industry (small to large
companies) were all well represented, with approximately a third of the
delegates coming from the industry. Similarly, many different scientific
field were discussed, ranging from landscape ecology to pure math.&lt;/p&gt;
&lt;p&gt;There were 2 tutorial tracks with 10 tutorial slots in each track. We
had 2 keynotes from Hans Petter Langtangen and Konrad Hinsen. With
regards to the contributed talks, the conference this year was highly
selective. We received 52 propositions. We unfortunately could accept
only 30 of them, which corresponds to an acceptance rate of 58%.
Finally, we had 18 &lt;a class="reference external" href="http://www.euroscipy.org/talk/937"&gt;lightning talks&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="a-warm-and-friendly-atmosphere"&gt;
&lt;h2&gt;A warm and friendly atmosphere&lt;/h2&gt;
&lt;p&gt;&lt;img alt="image1" src="http://farm5.static.flickr.com/4097/4774499149_5dda469dc2_m.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;As an organizer, I was really pleased to find out how much people were
relaxed and friendly. This certainly facilitates discussions during the
breaks. And the ambiance was undoubtedly warm: 140 people with laptops
in a room without air conditioning in the Paris summer :).&lt;/p&gt;
&lt;p&gt;Of course during the evenings, many people met to continue the
passionate discussions in restaurants and bars.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="trends-i-noticed"&gt;
&lt;h2&gt;Trends I noticed&lt;/h2&gt;
&lt;p&gt;What one remembers from a conference is obviously biased by personal
interests. With that disclaimer, here are the recurrent and important
topics that I noticed, both in the talks, but also in the coffee break
discussions:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;Parallel computing&lt;/strong&gt;, in particular making it easy to do parallel
computing. &lt;a class="reference external" href="http://www.euroscipy.org/talk/2011"&gt;Konrad’s keynote&lt;/a&gt; had many interesting directions to
explore. (talks: &lt;a class="reference external" href="http://www.euroscipy.org/talk/2009"&gt;Playdoh&lt;/a&gt;, &lt;a class="reference external" href="http://www.euroscipy.org/talk/1686"&gt;DANA&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Code generation&lt;/strong&gt;. In the various conferences I have been to
recently, I heard much talking about symbolic manipulation of
numerical problems to generate optimal computing kernels (talks:
&lt;a class="reference external" href="http://www.euroscipy.org/talk/1657"&gt;Efficient computation tutorial&lt;/a&gt;, &lt;a class="reference external" href="http://www.euroscipy.org/talk/1666"&gt;Theano&lt;/a&gt;, &lt;a class="reference external" href="http://www.euroscipy.org/talk/2045"&gt;Algorithmic
Differentiation&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data management&lt;/strong&gt;, with problems such as provenance tracking for
reproducibility (talks: &lt;a class="reference external" href="http://www.euroscipy.org/talk/1960"&gt;Sumatra&lt;/a&gt;, &lt;a class="reference external" href="http://www.euroscipy.org/talk/880"&gt;Knowledge management
tutorial&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Finally installation problems of scientific tools were the subject of
many discussions, as each year. One thing that I did notice, is that
people stopped simply blaming each others and acknowledged that nobody
knew how to fix the problem. Somebody even pointed out that installing
any major scientific code was not a piece of cake. Hans Petter and
others said that they had solved the problem by relying on a virtual
machine and Ubuntu.&lt;/p&gt;
&lt;p&gt;Konrad has also &lt;a class="reference external" href="http://khinsen.wordpress.com/2010/07/12/euroscipy-2010/"&gt;blogged&lt;/a&gt;, giving his own view of the conference.&lt;/p&gt;
&lt;p&gt;&lt;img alt="image2" src="http://farm5.static.flickr.com/4097/4778812305_9217c5d3c2_m.jpg" /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="thanks"&gt;
&lt;h2&gt;Thanks&lt;/h2&gt;
&lt;p&gt;The conference could happen only because of the help of many people.
First we need to thank our sponsors: &lt;a class="reference external" href="http://www.enthought.com"&gt;Enthought&lt;/a&gt;, &lt;a class="reference external" href="http://www.python-academy.com/"&gt;Python Academy&lt;/a&gt;,
&lt;a class="reference external" href="http://www.pytables.org"&gt;Pytables&lt;/a&gt;, and especially our host &lt;a class="reference external" href="http://www.ens.fr"&gt;Ecole Normale Supérieure&lt;/a&gt;, which
not only provided us with the rooms, but also made sure that everything
was going well with the sound system, the projection, or the access to
the building. With regards to organization and planing, Nicolas and I
received a lot of help from &lt;a class="reference external" href="http://www.saint-gobain-recherche.com/svi/en/emmanuelle_gouillart.html"&gt;Emmanuelle Gouillart&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="python"></category><category term="science"></category><category term="scientific computing"></category><category term="conferences"></category></entry><entry><title>Making posters for scientific conferences</title><link href="https://gael-varoquaux.info/science/making-posters-for-scientific-conferences.html" rel="alternate"></link><published>2010-07-12T00:00:00+02:00</published><updated>2010-07-12T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-07-12:/science/making-posters-for-scientific-conferences.html</id><summary type="html">&lt;p class="first last"&gt;Some advices and examples on making posters for scientific conference.&lt;/p&gt;
</summary><content type="html">&lt;p&gt;This page gives some advices and examples on making posters for
scientific conference.&lt;/p&gt;
&lt;p&gt;Here are some posters I made (one in 2007, the other in 2011). They don’t
follow all the advice on this page, but should.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external image-reference" href="attachments/poster_YAO.pdf"&gt;&lt;img alt="poster1" src="attachments/poster_YAO.jpg" style="width: 33%;" /&gt;&lt;/a&gt; &lt;a class="reference external image-reference" href="attachments/poster_hbm2011.pdf"&gt;&lt;img alt="poster2" src="attachments/poster_hbm2011.png" style="width: 33%;" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;LaTeX sources&lt;/p&gt;
&lt;p&gt;This poster is written in LaTeX. You can download the whole source of
the posters for &lt;a class="reference external" href="attachments/poster.zip"&gt;the first poster (left)&lt;/a&gt;,
and &lt;a class="reference external" href="attachments/poster_hbm2011.zip"&gt;the second one (right)&lt;/a&gt;. These
are some of my personnal projects, not meant for sharing. As a result
they have a fair amount of hacking. I have been asked for source code
more than once, so I put it on the web. I do not however have time to
provide &lt;strong&gt;any&lt;/strong&gt; support for it (I am already to busy supporting other
things. Any mail asking for help on these files will unanswered. Sorry.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Here is another example, a bit more visually appealing, as it is intended
for a less technical audience.&lt;/p&gt;
&lt;a class="reference external image-reference" href="attachments/poster_ICE.pdf"&gt;&lt;img alt="" class="align-center" src="attachments/poster_ICE.jpg" /&gt;&lt;/a&gt;
&lt;p&gt;One more about my work: this one was made to convey a strong message and
simplified the content a lot to get the message accross. I am not too sure
it worked, but I still find the poster pretty.&lt;/p&gt;
&lt;a class="reference external image-reference" href="attachments/poster_ICOLS07.pdf"&gt;&lt;img alt="" class="align-center" src="attachments/poster_ICOLS07.jpg" /&gt;&lt;/a&gt;
&lt;p&gt;And finally two made by Emmanuelle with really nice colours.&lt;/p&gt;
&lt;a class="reference external image-reference" href="attachments/poster_Emmanuelle.pdf"&gt;&lt;img alt="" src="attachments/poster_Emmanuelle.jpg" /&gt;&lt;/a&gt;
&lt;a class="reference external image-reference" href="attachments/poster_blue.pdf"&gt;&lt;img alt="" src="attachments/poster_blue.jpg" /&gt;&lt;/a&gt;
&lt;div class="section" id="advice-on-poster-presentation"&gt;
&lt;h2&gt;Advice on poster presentation&lt;/h2&gt;
&lt;p&gt;See also &lt;a class="reference external" href="http://www.ncsu.edu/project/posters"&gt;http://www.ncsu.edu/project/posters&lt;/a&gt;&lt;/p&gt;
&lt;div class="section" id="fonts"&gt;
&lt;h3&gt;Fonts&lt;/h3&gt;
&lt;p&gt;Sans-serif fonts look really nice, but are less readable in
paragraphs. Use them for titles and headers. Use serif fonts for
paragraphs. Stick to a simple font family like times. Use bold fonts
when writing with a light colour on a dark background.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="colours"&gt;
&lt;h3&gt;Colours&lt;/h3&gt;
&lt;p&gt;Stick to a rather little numbers of colours, but well chosen.
Put a very light colour behind your text blocks. If ink is not too
expensive, I would use a dark background, and have light text blocks on
it. Have well separated areas of your posters (like the background, and
the text blocks), and have the background, or other decorative elements,
have little contrast: they should not stand out too much (mine stood out
too much in my poster, its because the print-out didn’t look like want
was on the screen).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="page-layout"&gt;
&lt;h3&gt;Page layout&lt;/h3&gt;
&lt;p&gt;Break symmetry and order. A well aligned poster is boring to the
eye, and does not catch attention from afar. People read your poster by
first scanning through it and stopping at a few key points (usually
first at the upper left, then the upper left, then down right, and down
left), then they might read it more thoroughly after their first scan.
You want to define visually these key points, make them appealing, and
put key ideas there.&lt;/p&gt;
&lt;p&gt;Long lines are difficult to read. Pick up a book, a flyer, anything made
by a professional publisher, it will never have long lines. A good rule
of thumb is that if a text block has lines longer than 80 characters, it
needs breaking down in several columns.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="which-software-to-use"&gt;
&lt;h2&gt;Which software to use&lt;/h2&gt;
&lt;p&gt;Many people use PowerPoint to make their posters. It is easy to use, but
it is not dedicated to making posters, and it does horrible pdfs.&lt;/p&gt;
&lt;p&gt;If you want to pay a lot there is Quark Xpress that is very good for that
kind of things. Adobe PageMaker is also a very good software. &lt;a class="reference external" href="http://www.xara.com/"&gt;Xara&lt;/a&gt; is a cheap and good design program, and a free
version will soon be available for linux.&lt;/p&gt;
&lt;p&gt;I use LaTeX. Just because I love the way it positions characters. But I
admit it is a bit brutal. What I would advice you to use is &lt;a class="reference external" href="http://www.scribus.net"&gt;scribus&lt;/a&gt; it is dedicate to making posters and is free
and open source. I sometimes use LaTeX to create the text boxes, and
scribus to lay them around. I wrote a &lt;a class="reference external" href="LaTeX-scribus.html"&gt;page&lt;/a&gt;
describing how I do it.&lt;/p&gt;
&lt;!-- See also :
http://theoval.cmp.uea.ac.uk/~nlct/jpgfdraw/manual/postertutorial.html --&gt;
&lt;p&gt;One last remark: use vector graphics (eps, ps, pdf, svg), not bitmaps,
they scale up really badly.
Try to get a vector logo of your institution. Usually asking the PR
people is the only thing it take to get one. Of course if you are using
powerpoint chances are that you wont be able to insert it in your poster.&lt;/p&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="latex"></category><category term="publishing"></category><category term="conferences"></category><category term="selected"></category></entry><entry><title>A simple LaTeX example</title><link href="https://gael-varoquaux.info/science/a-simple-latex-example.html" rel="alternate"></link><published>2010-06-01T00:00:00+02:00</published><updated>2010-06-01T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-06-01:/science/a-simple-latex-example.html</id><summary type="html">&lt;p class="first last"&gt;A simple LaTeX document, to use as a skeletton&lt;/p&gt;
</summary><content type="html">&lt;p&gt;Here is a very simple example of a laTeX document that uses good package
to have a simple but nice layout:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="attachments/simple.tex"&gt;The LaTeX source&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="attachments/simple.pdf"&gt;The pdf document&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="topic"&gt;
&lt;p class="topic-title"&gt;&lt;strong&gt;Some advice&lt;/strong&gt;&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Use &lt;a class="reference external" href="http://www.texniccenter.org/"&gt;texniccenter&lt;/a&gt; if you don’t have a
favorite editor.&lt;/li&gt;
&lt;li&gt;Read the &lt;a class="reference external" href="http://www.ctan.org/tex-archive/info/lshort/english/lshort.pdf"&gt;not so short introduction to latex&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</content><category term="science"></category><category term="latex"></category><category term="publishing"></category><category term="science"></category></entry><entry><title>Personal views on scientific computing</title><link href="https://gael-varoquaux.info/programming/view_on_scientific_computing.html" rel="alternate"></link><published>2010-05-20T00:00:00+02:00</published><updated>2010-05-20T00:00:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-05-20:/programming/view_on_scientific_computing.html</id><summary type="html">&lt;p&gt;My contributions to the scientific computing software ecosystem are
motivated by my vision on computational science.&lt;/p&gt;
&lt;p&gt;Scientific research relies more and more on computing. However, most of
the researchers are not software engineers, and as computing is becoming
ubiquitous, the limiting factor becomes more and more the &lt;strong&gt;human
factor&lt;/strong&gt; &lt;a class="reference external" href="http://software-carpentry.org/articles/amsci-swc-2006.pdf"&gt;[G …&lt;/a&gt;&lt;/p&gt;</summary><content type="html">&lt;p&gt;My contributions to the scientific computing software ecosystem are
motivated by my vision on computational science.&lt;/p&gt;
&lt;p&gt;Scientific research relies more and more on computing. However, most of
the researchers are not software engineers, and as computing is becoming
ubiquitous, the limiting factor becomes more and more the &lt;strong&gt;human
factor&lt;/strong&gt; &lt;a class="reference external" href="http://software-carpentry.org/articles/amsci-swc-2006.pdf"&gt;[G. Wilson, 2006]&lt;/a&gt; &lt;a class="reference external" href="http://download.on9pc.com/ebook/programing/Teach%20Yourself%20Programming%20in%20Ten%20Years.pdf"&gt;[P.
Norvig, 2009]&lt;/a&gt;.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;To address the needs of computing accross scientific fields, I believe
that we need a &lt;strong&gt;general-purpose&lt;/strong&gt;, &lt;strong&gt;high-level&lt;/strong&gt;, &lt;strong&gt;interactive&lt;/strong&gt;, and
&lt;strong&gt;highly-readable&lt;/strong&gt; language and set of tools for scientific computing.&lt;/p&gt;
&lt;/div&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;C does not answer my needs: does a molecular biologist know about
pointers? Should she?&lt;/li&gt;
&lt;li&gt;Matlab does not answer my needs either: scientific work with computers
is not only about numerical computation. Have you tried writing an
experiment-control software with Matlab? How about file management?
Inserting the algorithms in a web server.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We need better teaching material, that sit at interfaces between software
engineer, and general science. Most top notch tools and libraries are
full of domain-specific jargon and conventions.&lt;/p&gt;
&lt;p&gt;For reproducible science, we need the code to be readable and to reflect
the corresponding scientific operation. We need it to be unit-tested to
ensure its correctness.&lt;/p&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;We need to consider scientific libraries as end-result of our
research with the same importance than articles &lt;a class="reference external" href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.6201"&gt;[J. Buckheit and D.
Donoho. 1995]&lt;/a&gt;.
They need to convey a scientific message, to be &lt;strong&gt;understandable&lt;/strong&gt; and
&lt;strong&gt;refutable&lt;/strong&gt;. New results should be &lt;strong&gt;reproducible&lt;/strong&gt; via published code
&lt;a class="reference external" href="http://www.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.14"&gt;[CISE Jan. 2009]&lt;/a&gt;. As
for established algorithms, scientific libraries with their
&lt;strong&gt;documentation&lt;/strong&gt; and &lt;strong&gt;examples&lt;/strong&gt; should be the textbooks of tomorrow.&lt;/p&gt;
&lt;/div&gt;
&lt;hr class="docutils" /&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;Scientific software should be as reusable as possible&lt;/strong&gt;, to enable the
advancement of Science via software, year after year. This means that
we need to build general-purpose libraries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Code quality and documentation are crucial&lt;/strong&gt;, as human factors are
often the limitation. As a corollary, scientific code should be
unit-tested to ensure correctness.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Core scientific software should be open source&lt;/strong&gt;, as scientific work
cannot build on black boxes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Algorithms should be written as simply as possible&lt;/strong&gt;. A high level of
sophistication in software engineering should not be a requirement to
all scientists&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prefer high-level languages&lt;/strong&gt;. The code should be written at the right
level of abstraction.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;We need to build common and shared tools&lt;/strong&gt;. Scientific software
shouldn’t be ‘owned’ by a lab.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The source code should a deliverable of the research&lt;/strong&gt;. As a result, it
should read clearly and be understandable to all.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Documentation and examples are the textbooks of tomorrow&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Publications should be reproducible&lt;/strong&gt;. Ideally they should become an
example of the library. This should be mitigated by the fact that code
maintainance is costly, and achieving good code takes more work that
publishing. Focus should be on publications that will give rise to reference
results.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Academia need to value sotware maintainance&lt;/strong&gt;. It is hard and costly,
but it determines our future.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tools that develop the environment, rather than a specific algorithm or
scientific field are crucial&lt;/strong&gt; (one example is IPython).&lt;/li&gt;
&lt;/ul&gt;
&lt;!-- Cite V Stodden --&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;Further reading:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Open source Machine Learning software &lt;a class="reference external" href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.5605&amp;amp;rep=rep1&amp;amp;type=pdf"&gt;[S. Sonnenburg et al. 2007]&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Open source mathematical sofware &lt;a class="reference external" href="http://www.ams.org/notices/200710/tx071001279p.pdf"&gt;[D. Joyner and W. Stein 2007]&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Licensing, intellectual property in scientific work
&lt;a class="reference external" href="http://jolt.unc.edu/sites/default/files/7_nc_jl_tech_321.pdf"&gt;[A. Gonzalez 2006]&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Scientific software development best practices
&lt;a class="reference external" href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.0020087"&gt;[S. Baxter et al. 2006]&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
</content><category term="programming"></category><category term="science"></category><category term="academia"></category><category term="scientific computing"></category><category term="selected"></category><category term="scientific software"></category></entry><entry><title>EuroScipy abstract submission deadline extended</title><link href="https://gael-varoquaux.info/programming/euroscipy-abstract-submission-deadline-extended.html" rel="alternate"></link><published>2010-05-15T23:36:00+02:00</published><updated>2010-05-15T23:36:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-05-15:/programming/euroscipy-abstract-submission-deadline-extended.html</id><summary type="html">&lt;p&gt;Given that we have been able to turn on registration only very late, the
&lt;a class="reference external" href="http://www.euroscipy.org"&gt;EuroScipy&lt;/a&gt; conference committee is extending the deadline for abstract
submission for the 2010 EuroScipy conference.&lt;/p&gt;
&lt;p&gt;On Thursday May 20th, at midnight Samoa time, we will turn off the
abstract submission on the conference site. Up to …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Given that we have been able to turn on registration only very late, the
&lt;a class="reference external" href="http://www.euroscipy.org"&gt;EuroScipy&lt;/a&gt; conference committee is extending the deadline for abstract
submission for the 2010 EuroScipy conference.&lt;/p&gt;
&lt;p&gt;On Thursday May 20th, at midnight Samoa time, we will turn off the
abstract submission on the conference site. Up to then, you can modify
the already-submitted abstract, or submit new abstracts.&lt;/p&gt;
&lt;p&gt;We are very much looking forward to your submissions to the conference.&lt;/p&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;Gaël Varoquaux&lt;/div&gt;
&lt;div class="line"&gt;Nicolas Chauvat&lt;/div&gt;
&lt;/div&gt;
&lt;hr class="docutils" /&gt;
&lt;div class="line-block"&gt;
&lt;div class="line"&gt;EuroScipy 2010 is the annual European conference for scientists using Python. It will be held July 8-11 2010, in ENS, Paris, France.&lt;/div&gt;
&lt;div class="line"&gt;&lt;strong&gt;Links: `Conference website`_,&amp;nbsp; `Call for papers`_,&amp;nbsp; `Practical information`_&lt;/strong&gt;&lt;/div&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="python"></category><category term="scientific computing"></category><category term="conferences"></category><category term="science"></category></entry><entry><title>EuroScipy is finally open for registration</title><link href="https://gael-varoquaux.info/programming/euroscipy-is-finally-open-for-registration.html" rel="alternate"></link><published>2010-05-13T13:23:00+02:00</published><updated>2010-05-13T13:23:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-05-13:/programming/euroscipy-is-finally-open-for-registration.html</id><summary type="html">&lt;a class="reference external image-reference" href="attachments/poster_euroscipy_2010.pdf"&gt;&lt;img alt="" src="attachments/poster_euroscipy_2010.jpg" /&gt;&lt;/a&gt;
&lt;div class="section" id="the-registration-for-euroscipy-is-finally-open"&gt;
&lt;h2&gt;The registration for &lt;a class="reference external" href="http://www.euroscipy.org//conference/euroscipy2010"&gt;EuroScipy&lt;/a&gt; is finally open.&lt;/h2&gt;
&lt;p&gt;To register, go to the &lt;a class="reference external" href="http://www.euroscipy.org//conference/euroscipy2010"&gt;website&lt;/a&gt;, create an account, and you will see a
&lt;em&gt;‘register to the conference’&lt;/em&gt; button on the left. Follow it to a page
which presents a &lt;em&gt;‘shoping cart’&lt;/em&gt;. Simply submitting this information
registers you to the conference, and on …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;a class="reference external image-reference" href="attachments/poster_euroscipy_2010.pdf"&gt;&lt;img alt="" src="attachments/poster_euroscipy_2010.jpg" /&gt;&lt;/a&gt;
&lt;div class="section" id="the-registration-for-euroscipy-is-finally-open"&gt;
&lt;h2&gt;The registration for &lt;a class="reference external" href="http://www.euroscipy.org//conference/euroscipy2010"&gt;EuroScipy&lt;/a&gt; is finally open.&lt;/h2&gt;
&lt;p&gt;To register, go to the &lt;a class="reference external" href="http://www.euroscipy.org//conference/euroscipy2010"&gt;website&lt;/a&gt;, create an account, and you will see a
&lt;em&gt;‘register to the conference’&lt;/em&gt; button on the left. Follow it to a page
which presents a &lt;em&gt;‘shoping cart’&lt;/em&gt;. Simply submitting this information
registers you to the conference, and on the left of the website, the
button will now display &lt;em&gt;‘You are registered for the conference’&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;The registration fee is 50 euros for the conference, and 50 euros for
the tutorial. Right now there is no payment system: you will be
contacted later (in a week) with instructions for paying.&lt;/p&gt;
&lt;p&gt;We apologize for such a late set up. We do realize this has come as an
inconvenience to people.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Do not wait to register: the number of people we can host is
limited.&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="an-exciting-program"&gt;
&lt;h2&gt;An exciting program&lt;/h2&gt;
&lt;div class="section" id="tutorials-from-beginners-to-experts"&gt;
&lt;h3&gt;Tutorials: from beginners to experts&lt;/h3&gt;
&lt;p&gt;We have two tutorial tracks:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.euroscipy.org/track/871"&gt;**Introductory tutorial**&lt;/a&gt;: to get you to speed on scientific
programming with Python.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.euroscipy.org/track/872"&gt;**Advanced tutorial**&lt;/a&gt;: experts sharing their knowledge on specific
techniques and libraries.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="scientific-track-doing-new-science-in-python"&gt;
&lt;h3&gt;Scientific track: doing new science in Python&lt;/h3&gt;
&lt;p&gt;Although the abstract submission is not yet over, I can say that we are
going to have a rich set of talks, looking at the current submissions.
In addition to the contributed talks, we have:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.euroscipy.org/conference/euroscipy2010"&gt;**Keynote speakers**&lt;/a&gt;: Hans Petter Langtangen and Konrard Hinsen,
two major player of scientific computing in Python.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.euroscipy.org/talk/937"&gt;**Lightning talks**&lt;/a&gt;: one hour will be open for people to come up
and present in a flash an interesting project.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="publishing-papers"&gt;
&lt;h3&gt;Publishing papers&lt;/h3&gt;
&lt;p&gt;We are talking with the editors of a major scientific computing journal,
and the odds are quite high that we will be able to publish a special
issue on scientific computing in Python based on the proceedings of the
conference. The papers will undergo peer-review independently from the
conference, to ensure high quality of the final publication.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="call-for-papers"&gt;
&lt;h2&gt;Call for papers&lt;/h2&gt;
&lt;p&gt;Abstract submission is still open, though not for long. We are
soliciting contributions on scientific libraries and tools developed
with Python and on scientific or engineering achievements using Python.
These include applications, teaching, future development directions, and
current research. See the &lt;a class="reference external" href="http://www.euroscipy.org/card/euroscipy2010_call_for_papers"&gt;call for papers&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;I am very much looking forward to passionate discussions about
Python in science in Paris&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="python"></category><category term="scientific computing"></category></entry><entry><title>Status of the EuroScipy registration</title><link href="https://gael-varoquaux.info/programming/status-of-the-euroscipy-registration.html" rel="alternate"></link><published>2010-05-02T22:57:00+02:00</published><updated>2010-05-02T22:57:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-05-02:/programming/status-of-the-euroscipy-registration.html</id><summary type="html">&lt;p&gt;It is still not possible to register for the &lt;a class="reference external" href="http://www.euroscipy.org/conference/euroscipy2010"&gt;Euroscipy conference&lt;/a&gt;: we
are having difficulties with payment for the registration, and we are
still not sure that we will be able to actually charge money!&lt;/p&gt;
&lt;p&gt;This might not be a bad news, because it might mean that the conference
will …&lt;/p&gt;</summary><content type="html">&lt;p&gt;It is still not possible to register for the &lt;a class="reference external" href="http://www.euroscipy.org/conference/euroscipy2010"&gt;Euroscipy conference&lt;/a&gt;: we
are having difficulties with payment for the registration, and we are
still not sure that we will be able to actually charge money!&lt;/p&gt;
&lt;p&gt;This might not be a bad news, because it might mean that the conference
will be completely free. This would mean that we would be able to
provide lunch which is a pity as there is nothing like eating with a
bunch of passionate experts to learn new tricks, but it would not hamper
the conference in any other way, as the rooms are already booked and
various little expenses covered.&lt;/p&gt;
&lt;p&gt;If we manage to sort out payments in the next weeks, the fee should be
50 euros for the 2 days of tutorial, and between 50 and 100 euros for
the full conference, depending on exactly what catering we offer.&lt;/p&gt;
&lt;p&gt;Anyhow, we should open the registration real-soon, with or without
payment. We will need to have some formal registration, as the number of
people that can fit in the rooms will be limited.&lt;/p&gt;
&lt;p&gt;All in all, with or without registration fees, it should be possible to
make it to Euroscipy keeping expenses low: we have indicated a few cheap
accommodation on the &lt;a class="reference external" href="http://www.euroscipy.org/card/euroscipy2010_practical_information"&gt;practical details page&lt;/a&gt;, and it is easy to get
good food for a good price in the area.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;I am very excited about this conference. We have two keynotes that I am
really looking forward to hearing, and I can say that we have been
getting pretty good submissions for presentations. Also, changes are
that we should be able to publish proceedings in a peer-reviewed
journal, although I can’t say more about that right now.&lt;/p&gt;
&lt;p&gt;Also, even if you are not interested in scientific research done using
Python, the tutorials are a unique opportunity: we are having top-notch
experts presenting with two tracks, &lt;a class="reference external" href="http://www.euroscipy.org/track/871"&gt;one&lt;/a&gt; to get beginners up to speed
and efficient in a couple of days, and the &lt;a class="reference external" href="http://www.euroscipy.org/track/872"&gt;other&lt;/a&gt; for exploring
advanced subjects. I know the speakers, and I can tell you that I won’t
be talking in the corridor, but sitting with my laptop and listening to
them. People pay large chunks of money for such training, usually.&lt;/p&gt;
</content><category term="programming"></category><category term="python"></category><category term="scientific computing"></category><category term="conferences"></category></entry><entry><title>Mayavi: Representing an additional scalar on surfaces</title><link href="https://gael-varoquaux.info/programming/mayavi-representing-an-additional-scalar-on-surfaces.html" rel="alternate"></link><published>2010-04-05T00:30:00+02:00</published><updated>2010-04-05T00:30:00+02:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-04-05:/programming/mayavi-representing-an-additional-scalar-on-surfaces.html</id><summary type="html">&lt;p&gt;We have been getting a few questions on the &lt;a class="reference external" href="https://mail.enthought.com/mailman/listinfo/enthought-dev"&gt;enthought-dev&lt;/a&gt;
mailing-list on how to represent an additional information on a surface
with &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi"&gt;Mayavi&lt;/a&gt;, using color not given eg by the elevation. A &lt;a class="reference external" href="http://dpinte.wordpress.com/2010/03/30/4d-surface-plots-in-mayavi/"&gt;recent
post&lt;/a&gt; on his blog by Didrik Pinte shows the problem quite well:&lt;/p&gt;
&lt;a class="reference external image-reference" href="http://dpinte.wordpress.com/2010/03/30/4d-surface-plots-in-mayavi/"&gt;&lt;img alt="" src="http://dpinte.files.wordpress.com/2010/03/option_valuation_3d.png" /&gt;&lt;/a&gt;
&lt;p&gt;This problem can be seen …&lt;/p&gt;</summary><content type="html">&lt;p&gt;We have been getting a few questions on the &lt;a class="reference external" href="https://mail.enthought.com/mailman/listinfo/enthought-dev"&gt;enthought-dev&lt;/a&gt;
mailing-list on how to represent an additional information on a surface
with &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi"&gt;Mayavi&lt;/a&gt;, using color not given eg by the elevation. A &lt;a class="reference external" href="http://dpinte.wordpress.com/2010/03/30/4d-surface-plots-in-mayavi/"&gt;recent
post&lt;/a&gt; on his blog by Didrik Pinte shows the problem quite well:&lt;/p&gt;
&lt;a class="reference external image-reference" href="http://dpinte.wordpress.com/2010/03/30/4d-surface-plots-in-mayavi/"&gt;&lt;img alt="" src="http://dpinte.files.wordpress.com/2010/03/option_valuation_3d.png" /&gt;&lt;/a&gt;
&lt;p&gt;This problem can be seen as taking a standard &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/mlab_helper_functions.html#enthought.mayavi.mlab.surf"&gt;surf&lt;/a&gt; plot:&lt;/p&gt;
&lt;a class="reference external image-reference" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/mlab_helper_functions.html#enthought.mayavi.mlab.surf"&gt;&lt;img alt="" src="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/_images/enthought_mayavi_mlab_surf.jpg" /&gt;&lt;/a&gt;
&lt;p&gt;but coloring it with a different scalar than the elevation.&lt;/p&gt;
&lt;p&gt;I would like to present two ways of solving this problem. First a very
simple way specific to the exact problem, second a more complicated but
quite generic approach.&lt;/p&gt;
&lt;div class="section" id="representing-surfaces-more-complex-than-an-elevation-map"&gt;
&lt;h2&gt;Representing surfaces more complex than an elevation map&lt;/h2&gt;
&lt;p&gt;The first option is simply to use the &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/mlab.html#d-data"&gt;tools&lt;/a&gt; that Mayavi’s &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/mlab.html"&gt;mlab&lt;/a&gt;
interface provide to represent surfaces that are not the particular case
of an elevation plot. In our case, it is very easy to use the &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/mlab_helper_functions.html#enthought.mayavi.mlab.mesh"&gt;mesh
function&lt;/a&gt; which can take the x, y, z positions of a grid giving the
surface, but also an additional scalar value at these position:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# Create some data&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mgrid&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arctan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Visualize it&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;enthought.mayavi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mlab&lt;/span&gt;
&lt;span class="n"&gt;mlab&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mesh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;.05&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scalars&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Finally, add a few decorations.&lt;/span&gt;
&lt;span class="n"&gt;mlab&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axes&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;mlab&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;outline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;mlab&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;177&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;82&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;mlab&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;img alt="" src="attachments/mesh_example.png" /&gt;
&lt;p&gt;As you can see, this solution is really simple, and solves the problem.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="a-generic-way-of-representing-several-scalar-attributes-with-one-visualization"&gt;
&lt;h2&gt;A generic way of representing several scalar attributes with one visualization&lt;/h2&gt;
&lt;p&gt;If we think of the visualization problem as a way of representing two
scalar values, ‘z’ and ‘w’, and a function of two others, ‘x’ and ‘y’,
the above solution is not really satisfactory: the surf function really
turns the scalar value ‘z’ in elevation (using a WarpScalar filter). We
would like to be able to add an addition scalar value ‘w’ and turn it
into color, just like ‘z’ is turned into elevation. The pipeline that is
created by the surf function is the following:&lt;/p&gt;
&lt;img alt="" src="attachments/surf_pipeline.png" /&gt;
&lt;p&gt;The first element of the pipeline after the scene is the data source
created for us by the surf function: it is a 2D array that contains the
‘z’ value as a scalar value. The ‘WarpScalar’ filter is applied, and
transform that value into elevation. After that, a ‘PolyDataNormals’
filter is used to calculate normals, so as to have a smooth rendering,
and finally, a ‘Surface’ module is applied to display the resulting
elevation map as a surface, with a color reflecting the scalar value.&lt;/p&gt;
&lt;p&gt;The way we can operate on two scalar values and turn them into elevation
and color successively is to embed these two scalar values on the
dataset, ‘z’ and ‘w’, and use a ‘SetActiveAttribute’ to control on which
one the ‘Surface’ module is applied. This approach is much more powerful,
because we can tweak the pipeline ourselves, and use any filter to
replace the WarpScalar, and display the ‘z’ information (more on that
below).&lt;/p&gt;
&lt;p&gt;Here is how to do achieve a visualization with a similar look as above,
but with two scalar values transformed successively in elevation and
color:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;###############################################################&lt;/span&gt;
&lt;span class="c1"&gt;# Create some data&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mgrid&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arctan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;###############################################################&lt;/span&gt;
&lt;span class="c1"&gt;# Visualize the data&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;enthought.mayavi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mlab&lt;/span&gt;

&lt;span class="c1"&gt;# Create the data source&lt;/span&gt;
&lt;span class="n"&gt;src&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mlab&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array2d_source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Add the additional scalar information &amp;#39;w&amp;#39;, this is where we need to be a bit careful,&lt;/span&gt;
&lt;span class="c1"&gt;# see&lt;/span&gt;
&lt;span class="c1"&gt;# http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/example_atomic_orbital.html&lt;/span&gt;
&lt;span class="c1"&gt;# and&lt;/span&gt;
&lt;span class="c1"&gt;# http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/data.html&lt;/span&gt;
&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mlab_source&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;
&lt;span class="n"&gt;array_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;point_data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ravel&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;point_data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;array_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;color&amp;#39;&lt;/span&gt;
&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;point_data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Here, we build the very exact pipeline of surf, but add a&lt;/span&gt;
&lt;span class="c1"&gt;# set_active_attribute filter to switch the color, this is code very&lt;/span&gt;
&lt;span class="c1"&gt;# similar to the code introduced in:&lt;/span&gt;
&lt;span class="c1"&gt;# http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/mlab.html#assembling-pipelines-with-mlab&lt;/span&gt;
&lt;span class="n"&gt;warp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mlab&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;warp_scalar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;warp_scale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;normals&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mlab&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;poly_data_normals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;warp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;active_attr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mlab&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_active_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normals&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                            &lt;span class="n"&gt;point_scalars&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;color&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;surf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mlab&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;surface&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;active_attr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Finally, add a few decorations.&lt;/span&gt;
&lt;span class="n"&gt;mlab&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axes&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;mlab&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;outline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;mlab&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;177&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;82&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;mlab&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The pipeline that is created is the following:&lt;/p&gt;
&lt;img alt="" src="attachments/complex_pipeline.png" /&gt;
&lt;p&gt;In the first part of the pipeline, the ‘WarpScalar’ filter is applied to
the ‘z’ scalar value, whereas, due to the ‘SetActiveAttribute’ filter,
the ‘Surface’ module uses the ‘w’ scalar value to display the color.&lt;/p&gt;
&lt;p&gt;This pattern is very powerful, and can be used with other sets of
filters or modules. The example of this pattern that we use in the
Mayavi documentation is the following:&lt;/p&gt;
&lt;a class="reference external image-reference" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/example_atomic_orbital.html"&gt;&lt;img alt="" src="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/_images/example_atomic_orbital.jpg" /&gt;&lt;/a&gt;
&lt;p&gt;We use a ‘Contour’ filter to contour on the amplitude of a complex a
field defined in the volume, and then switch to the phase to display the
color. See the &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/example_atomic_orbital.html"&gt;atomic orbital example&lt;/a&gt; in the Mayavi documentation for
more details.&lt;/p&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="mayavi"></category><category term="scipy"></category><category term="scientific computing"></category></entry><entry><title>Book review: Matplotlib for Python Developpers</title><link href="https://gael-varoquaux.info/programming/book-review-matplotlib-for-python-developpers.html" rel="alternate"></link><published>2010-03-26T10:49:00+01:00</published><updated>2010-03-26T10:49:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-03-26:/programming/book-review-matplotlib-for-python-developpers.html</id><summary type="html">&lt;p&gt;&lt;em&gt;Packt publishing&lt;/em&gt; sent me a copy of Sandro Tosi’s book &lt;a class="reference external" href="http://www.packtpub.com/matplotlib-python-development/book"&gt;Matplotlib for
Python Developpers&lt;/a&gt; a while ago. Unfortunately, it arrived after I had
left for the Christmas break, and I couldn’t find time to review it for
a while (I am terribly bad at time-management, and I do …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;em&gt;Packt publishing&lt;/em&gt; sent me a copy of Sandro Tosi’s book &lt;a class="reference external" href="http://www.packtpub.com/matplotlib-python-development/book"&gt;Matplotlib for
Python Developpers&lt;/a&gt; a while ago. Unfortunately, it arrived after I had
left for the Christmas break, and I couldn’t find time to review it for
a while (I am terribly bad at time-management, and I do too many things,
as I result I am always overworked). 3 months later, I have finally
found time to read it and post a review.&lt;/p&gt;
&lt;div class="section" id="content"&gt;
&lt;h2&gt;Content&lt;/h2&gt;
&lt;p&gt;The book introduces &lt;a class="reference external" href="http://matplotlib.sourceforge.net/"&gt;matplotlib&lt;/a&gt; which is, for those who don’t know, a
truly fantastic library for scientific plotting in Python. Matplotlib is
great because it is really easy to pick up, and can be used to produce
very high-quality plots.&lt;/p&gt;
&lt;p&gt;The book starts by progressively introducing the simple, imperative API
for matplotlib, with a focus on getting the user immediately plotting
data. It then moves on to a review of the functionality for plotting in
matplotlib and the object-oriented usage of matplotlib. Finally, Sandro
shows us how to embedded matplotb in various environment such as GUI
toolkits or web development tools.&lt;/p&gt;
&lt;p&gt;The last part of the book is, in my opinion the most original and
precious, as these subjects are less well-known and documented in
classical references accessible to people with a scientific computing
background.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="target-audience"&gt;
&lt;h2&gt;Target audience&lt;/h2&gt;
&lt;p&gt;The book can pretty much be picked by a scientific Python beginner. It
does require some knowledge of the Python language, but if the reader
has programmed in another language, I don’t see this as a big problem.
In this regard, the book is especially interesting, as it can lead a
scientist from newbie to writing simple end-user programs. There is a
clear need for more of these documents currently.&lt;/p&gt;
&lt;p&gt;The book will also be useful for the experienced Python developers
looking to pick up quickly matplotlib.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="personal-comments-on-the-book"&gt;
&lt;h2&gt;Personal comments on the book&lt;/h2&gt;
&lt;p&gt;In my experience, exposing a tool such as matplotlib is a challenge:
everybody has different plotting needs and there is an infinity of
variation in ways that you can use a powerful library like matplotlib.
Thus, Sandro’s exposition of matlplotlib will not suffice: people should
absolutely read more, and I can’t stress too much that the matplotlib
documentation is excellent, and people should read more of it.&lt;/p&gt;
&lt;p&gt;In general, I found that the books reads fairly well. Off course, I am
not the best critic in term of ease of read, as I know matplotlib very
well. I do find that the book lacks a &lt;em&gt;personal touch&lt;/em&gt; such as
interesting examples, or profound insights on specific problems. There
is nothing that got me excited in the book (again, maybe it’s because I
know what’s in the book quite well).&lt;/p&gt;
&lt;p&gt;Once again, in my eyes, the biggest contribution of this book is to put
together an introduction to matplotlib, and examples of application
building using matplotlib. I would especially recommend the book for
people wanting to build simple data visualization GUIs.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;Finally, with regards to interactive data visualization, in my
experience, scientific programmers achieve better productivity when
avoiding to work at the widget level and using an abstraction library. I
strongly recommend looking at &lt;a class="reference external" href="http://code.enthought.com/projects/traits/docs/html/"&gt;TraitsUI&lt;/a&gt; for this purpose. you can find
a tutorial &lt;a class="reference external" href="http://gael-varoquaux.info/computers/traits_tutorial/index.html"&gt;here&lt;/a&gt; (disclaimer: I wrote that tutorial).&lt;/p&gt;
&lt;p&gt;Also, if you are going to write a data visualization program that is
interactive in the sens that it enables the user to interact with the
data, using &lt;a class="reference external" href="http://code.enthought.com/chaco/"&gt;Chaco&lt;/a&gt; instead of matplotlib may make your life easier.
Chaco is not as well polished and documented as matplotlib, and I would
never use it for a quick scripting work, but it has a strong focus on
data interaction, and as such makes it really easy to build very
responsive user interfaces, because it is very fast and has a clear
object-oriented API.&lt;/p&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="python"></category><category term="scientific computing"></category><category term="books"></category></entry><entry><title>New Mayavi release</title><link href="https://gael-varoquaux.info/programming/new-mayavi-release.html" rel="alternate"></link><published>2010-03-14T12:58:00+01:00</published><updated>2010-03-14T12:58:00+01:00</updated><author><name>Gaël Varoquaux</name></author><id>tag:gael-varoquaux.info,2010-03-14:/programming/new-mayavi-release.html</id><summary type="html">&lt;p&gt;A week ago, the Peter Wang released a new version of the &lt;a class="reference external" href="http://code.enthought.com/"&gt;Enthought Tool
Suite (ETS)&lt;/a&gt;. With it came a new version of &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/"&gt;Mayavi2&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Prabhu and I have been horribly busy we real life, and I had the bad
feeling that we were not giving enough love to Mayavi. I …&lt;/p&gt;</summary><content type="html">&lt;p&gt;A week ago, the Peter Wang released a new version of the &lt;a class="reference external" href="http://code.enthought.com/"&gt;Enthought Tool
Suite (ETS)&lt;/a&gt;. With it came a new version of &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/"&gt;Mayavi2&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Prabhu and I have been horribly busy we real life, and I had the bad
feeling that we were not giving enough love to Mayavi. I was surprised
when I put together the list of features and bugs fixes that went in
Mayavi for the last two releases. The full list can be found &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/changes.html"&gt;in the
documentation&lt;/a&gt;.&lt;/p&gt;
&lt;div class="section" id="contributors"&gt;
&lt;h2&gt;Contributors&lt;/h2&gt;
&lt;p&gt;We are not being terribly good at tracking external ideas and patches,
so I hope that I haven’t forgotten anybody, but I am very happy to say
that Prabhu and I have received a fair amount of help from non core
contributors:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Chris Colbert&lt;/li&gt;
&lt;li&gt;Darren Dale&lt;/li&gt;
&lt;li&gt;Dave Martin&lt;/li&gt;
&lt;li&gt;Dave Peterson&lt;/li&gt;
&lt;li&gt;Emmanuelle Gouillart&lt;/li&gt;
&lt;li&gt;Erik Tollerud&lt;/li&gt;
&lt;li&gt;Evan Patterson&lt;/li&gt;
&lt;li&gt;Gary Ruben&lt;/li&gt;
&lt;li&gt;Kyle Mandli&lt;/li&gt;
&lt;li&gt;Michele Mattioni&lt;/li&gt;
&lt;li&gt;Ondrej Certik&lt;/li&gt;
&lt;li&gt;Ram Rachum&lt;/li&gt;
&lt;li&gt;Robert Kern&lt;/li&gt;
&lt;li&gt;Scott Warts&lt;/li&gt;
&lt;li&gt;Suyog Jain&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;On top of these people, I wish to thank the people making sure that the
Mayavi packages are available in the different Linux distributions:
Varun Hiremath, Lev Givon, Andrea Colangelo, Rakesh Pandit, as well as
Pierre Raybault for integrating in &lt;a class="reference external" href="http://pythonxy.com"&gt;Pythonxy&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="important-features-added-in-3-3-0"&gt;
&lt;h2&gt;Important features added in 3.3.0&lt;/h2&gt;
&lt;p&gt;3.3.0 was released last fall. We had not compiled the list of changes at
the time, I am giving it here:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;An &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/examples.html"&gt;example gallery&lt;/a&gt; in the documentation.&lt;/li&gt;
&lt;li&gt;A &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/mlab_figure.html#sync-camera"&gt;sync_camera&lt;/a&gt; helper function to synchronize camera between two
scenes.&lt;/li&gt;
&lt;li&gt;A &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/mlab_other_functions.html#text3d"&gt;text3d&lt;/a&gt; module, for position text in 3D that is scaled and hidden
like a data object.&lt;/li&gt;
&lt;li&gt;A &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/mlab_figure.html#close"&gt;close&lt;/a&gt; function to close scenes, similar to that in pylab or
matlab.&lt;/li&gt;
&lt;li&gt;A new filter to crop datasets: &lt;em&gt;DataSet Clipper&lt;/em&gt;. This filter is
terribly useful.&lt;/li&gt;
&lt;li&gt;All the &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/mlab_pipeline_reference.html"&gt;mlab.pipeline&lt;/a&gt; functions now take a &lt;em&gt;figure=&lt;/em&gt; keyword
argument. This is very useful when coding with several figures
embedded in GUIs, as in a GUI you can’t rely on a context. This is
illustrated in this &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/example_multiple_mlab_scene_models.html"&gt;example&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="important-features-added-in-3-3-1"&gt;
&lt;h2&gt;Important features added in 3.3.1&lt;/h2&gt;
&lt;p&gt;In latest release the following important features were added:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/mlab_figure.html#savefig"&gt;mlab.savefig&lt;/a&gt; can now reliably save images of a size larger than
the window.&lt;/li&gt;
&lt;li&gt;The interactive VTK documentation browser is now available in the
GUI.&lt;/li&gt;
&lt;li&gt;New functions added to &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/mlab.html"&gt;mlab&lt;/a&gt; to control position of the camera:
&lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/mlab_camera.html#move"&gt;move&lt;/a&gt;, &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/mlab_camera.html#yaw"&gt;yaw&lt;/a&gt;, and &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/mlab_camera.html#pitch"&gt;pitch&lt;/a&gt;. These complement the existing &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/mlab_camera.html#view"&gt;view&lt;/a&gt;
and &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/mlab_camera.html#roll"&gt;roll&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Make the lines smoother when using &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/mlab_helper_functions.html#enthought.mayavi.mlab.plot3d"&gt;mlab.plot3d&lt;/a&gt; (use a VTK Stripper
filter)&lt;/li&gt;
&lt;li&gt;Add a &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/mlab_figure.html#enthought.mayavi.mlab.screenshot"&gt;screenshot&lt;/a&gt; function to mlab for easy screen capture as a
numpy array. This is very useful when creating figures that combine
3D using Mayavi and 2D using pylab. I use it all the time.&lt;/li&gt;
&lt;li&gt;Add a &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/mlab_pipeline_data.html#probe-data"&gt;probe_data&lt;/a&gt; function to return the data values of Mayavi
objects at given locations as numpy arrays. This is very useful to
combine numerics with Mayavi.&lt;/li&gt;
&lt;li&gt;Add a auto mode to &lt;a class="reference external" href="http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/mlab_camera.html#view"&gt;mlab.view&lt;/a&gt; to compute position and distance
based on the objects on the image.&lt;/li&gt;
&lt;li&gt;Add a helper function to easily interact with the data: a callback
can easily be registered to picking data with the mouse. &lt;a class="reference external" href="https://svn.enthought.com/enthought/browser/Mayavi/trunk/examples/mayavi/data_interaction/"&gt;Two
examples&lt;/a&gt; illustrate this new functionality. This is a major step
forward in making life easier for people using Mayavi to build custom
interfaces.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</content><category term="programming"></category><category term="python"></category><category term="science"></category><category term="mayavi"></category></entry></feed>