Gaël Varoquaux - programming

People underestimate how impactful Scikit-learn continues to be

2023-11-27T00:00:00+01:00

Note

François Chollet rightfully said that people often underestimate the impact of scikit-learn. I give here a few illustrations to back his claim.

A few days ago, François Chollet (the creator of Keras, the library that that democratized deep learning) posted:

Indeed, scikit-learn continues to be the most popular machine learning in surveys:

Most popular machine-learning framework, according to a Kaggle survey

Note

Scikit-learn is probably the most used machine-learning library

This popularity is sometimes underestimated as scikit-learn is a small player in terms of funding and size of the team, in particular compared to giants such as tensorflow and pytorch. Size is limited by nature of the project, based on a community without a strong commercial entity backing the project.

We target different technology than tensorflow and pytorch: we have by design let the big players focus on deep learning, which demands much more resources. Rather, we have focused on classic machine learning, believing that it serves other important needs. While such technologies make less the news, they are used a lot, and scikit-learn is massively used:

**Usage statistics** (from github)

By not focusing on deep learning, does scikit-learn risk to become outdated? Surveys show that simple models such as linear models or models based on trees (including boosting) are actually the most used models:

Most popular machine learning algorithm, according to a kaggle survey (apologies for the small fonts on the figure, I did not generate it)

Note

Gradient Boosted Trees is a good go-to model

There is a lot of hype surrounding deep learning, but it is most often not the right tool do tackle tabular data. Tabular data has different properties than images or text: it comes with heterogeneous columns which make sense by themselves, and tree-based models have the right inductive bias [Grinsztajn et al 2023].

Benchmark comparing models on tabular data while tuning hyper-parameters (from Grinsztajn et al 2023) Each value corresponds to the test score of the best model (on the validation set) after a specific time spent doing random search. The ribbon corresponds to the minimum and maximum scores on these 15 shuffles. Models HistGradientBoostingTree, GradientBoostingTree, and RandomForest come from scikit-learn. FTtransformer, Saint, ResNet and MLP are all deep learning architecture, with FT transformer and Saint models specifically developed for tabular data.

As we can see, scikit-learn’s HistGradientBoosting really shines in terms of good prediction performance for small computational costs. We strive to facilitate datascience: make it lightweight, give good documentation and APIs.

Linear models and tree-based models are there to stay. They answer strong needs for many application settings and they come with small operational cost.

In my opinion, where scikit-learn could really grow to be even more relevant is to integrate better in a broader ecosystem going from databases to putting to production, being more “enterprise ready” :).

My Mayavi story: discovering open source communities

2022-07-10T00:00:00+02:00

The Mayavi Python software, and my personal history: A thread on Python and scipy ecosystems, building open source codebase, and meeting really cool and friendly people

I am writing today as a goodbye to the project: I used to be one of the core contributors and maintainers but have been inactive for a while for lack of time. Out of common agreement, we recently removed my commit rights to limit security risks.

Mayavi brought my so much!

The start of my adventure with Mayavi

I got involved around 2007: I needed 3D visualization of magnetic fields as I was designing coils for my PhD [1].

[1]	This led to an example in the Mayavi docs http://docs.enthought.com/mayavi/mayavi/auto/example_magnetic_field_lines.html

I started as an early user of Mayavi2, a rewrite of Mayavi, and eventually joined Prabhu Ramachandran and Enthought as a contributor.

What is Mayavi?

Mayavi is a scientific 3D visualization library in Python.

It enables interactive visualization to understand complex information in 3D, such as multi-physics fields, combined with simple scripting to integrate in a broader scientific computing flow.

Mayavi was designed and founded around 2000 by Prabhu Ramachandran, a researcher in computational fluid dynamics at IIT Bombay and long-time open-source and Python figure.

The key idea was to make VTK, a powerful C++ visualization library, easily useful with a Python interface.

Mayavi bridged the gap between the C++ data structures, and efficient Python data structures, exposing without copies to numpy arrays.

It uses tools from Enthought (namely the entought tool suite) for an interactive GUI built on a Python object model: fully scriptable (the vision in explained in an article Prabhu and I wrote )

Mayavi is a full-blown interactive application

Mayavi is also a Python library, for full scripting

Working on Mayavi taught me code and communities

Mayavi used within an interactive IPython – an image from the Mayavi paper

I joined to help with the “mlab” interface, for even simpler Python scripting built upon functions. My goal was to make Mayavi natural to matlab and matplotlib users, a product vision which was probably important to push popularity even further.

I was an isolated PhD student in a physics lab, emboldened by a discussion with Fernando Perez, I started contributing and discussing with Prabhu Ramanchandran. I remember my first skype discussion with Prabhu, I was very intimidated.

Understanding this large codebase was hard! And yet, slowly but surely, I started making more and more meaningful contribution: on mlab, than on the broader codbase, fixing bugs, a lot of work on documentation and examples…

Prabhu and myself are in this scipy conference group picture! From https://slideshare.net/enthought/scientific-computing-with-python-webinar-august-28-2009

Then Enthought funded my overseas travel to the scipy conference: a big deal for me, as I was a peniless PhD student.

My Mayavi story is that of meeting amazing people in the Python, scipy, and pydata world; people who believe in building a tool stack to democratize scientific computing; people from all over the world, friendly, welcoming, passionate.

It founded my belief in communities.

This adventure led me to learn software engineering (Software carpentry really helped getting started) to work at Enthought (a software startup central to scientific computing in Python), to change career from physics to computing, join Inria (French national research in maths and computing), and I do other open source projects…

Mayavi was crucial to my personal adventure. Thank you Prabhu! Thank you Enthought! Thank you the Scipy community!!

Hiring an engineer and post-doc to simplify data science on dirty data

2021-10-29T00:00:00+02:00

Note

Join us to work on reinventing data-science practices and tools to produce robust analysis with less data curation.

It is well known that data cleaning and preparation are a heavy burden to the data scientist.

Dirty data research

In the dirty data project, we have been conducting machine-learning research to see how better statistical models could readily ingest non-curated data, and reduce the need of data preparation for data science. We now have a growing understanding of the problems, theoretical and practical, which lie across statistical and database topics.

Machine learning leads to different tradeoffs than traditional inferential statistics (because it can rely on more powerful model). For instance, we now have a good understanding of the case of missing values: in Le Morvan et al, we showed that with traditional methods, ignorable missingness [1] and “good” imputation are important, but it turns out for prediction, flexible predictors are what matters and they can work on any missingness mechanism.

[1]	“Missing at Random”, where missingness is independent of the hidden values

Similarly, we have made good progress on tolerating normalization errors and typos. We find that rather to attempt to deduplicate the entries or fix the typos, it is best to represent similarities and ambiguities to a flexible learning algorithm. The simplest and most reliable methods are implemented in the dirty-cat library, to facilitate the life of data-scientists

Reinventing data science

With this understanding (and even more exciting on-going research), we want to revisit data science. Machine-learning can provide flexible models for many usages of data science. Our goal is to use it to help assembling and analyzing datasets while minimizing human efforts. For this, we need tools that can answer typical data-science questions using machine learning and starting from the raw data, often spread in multiple files or multiple tables of a databases. Building these tools requires data-science research, a new vision of data-science APIs, and careful software crafting.

Join us in this adventure

We have an awesome team, with a great mix of people of different seniority, different expertise (statistics, machine learning, databases, software engineering), sharing offices with the scikit-learn at Inria. But we have too many exciting ideas, so we are growing this team.

A data-science engineer: new software with new ideas

We are looking for someone with a background in data science or numerical Python programming to join us, to help with designing a new data-science library, evolving from dirty-cat, and to help with data-science experimentation for the research.

We like people who care about data, designing good tools, and have vision about data science. We are happy to consider different level of experience. Apply on the job offer.

A post-doc researcher: science joining data engineering to deep learning

We will soon be announcing a post-doc position to join the team for research in this scope. We are interested in questions around learning on relational or tabular data, or learning data integration. We have plenty of ideas to explore around embeddings in databases, learning to aggregate, learning on sets, graph neural networks for databases, or distributional matching for entity and schema alignment. We expect to be borrowing tools (conceptual and practical) from deep learning, but to blending them with techniques from data integration, knowledge graphs, and databases.

The job posting will be out soon, but I am running out of the office right now for vacations (work-life balance also matters to us).

Diversity is important

Our team is not as diverse as I would like it to be (though probably doing better than typical computer-science team). We love diverse candidates. Do not hesitate.

Hiring someone to develop scikit-learn community and industry partners

2021-09-14T00:00:00+02:00

Note

With the growth of scikit-learn and the wider PyData ecosystem, we want to recruit in the Inria scikit-learn team for a new role. Departing from our usual focus on excellence in algorithms, statistics, or code, we want to add to the team someone with some technical understanding, but an eye for people dynamics. Are you passionate about developing open-source communities for data science? This job is a unique opportunity.

The mandate will be on the one hand to develop the wider community behind scikit-learn, on the other hand to foster the foundation’s partnerships, as this is our funding.

Context: Scikit-learn @ Inria foundation

The growth of Scikit-learn

Scikit-learn is used massively, from schools to major companies. It underpins business-intelligence analysis or automates processes. Its reliability is crucial for the enterprise. Its well-documented methods help data-scientists run to valid analyses.

Scikit-learn has hugely grown and is still growing in terms of userbase and expectation of quality. These days, the development team is large, with many grass-root volunteering and some contributors spending a sizeable fraction of their work time.

Number of monthly website access

Scikit-learn @ Inria foundation

Birth of a foundation To ensure reliable funding to a small core of scikit-learn developers, we set up a foundation [1] a few years ago. The goal was to make sure that we did not lose our experienced developers.

[1]	See the motivating announcement and the website.

Achieving sustainability The resulting structure is set up to provide a career path to a few of our core people. As a consequence, it is a French legal entity, acting as an employer, funded via sponsorship agreement with a few of major economic users of scikit-learn (check out the list of our sponsors). The priorities of the team are set jointly between the sponsors and the open-source community. The setup is not without flaws, in particular it forces us to employ people on Campus, but it enables giving proper benefits to these contributors.

The team The scikit-learn team at Inria foundation currently comprises 4 very experienced developers. In addition, we have other sources of funding –research projects, the scikit-learn MOOC – that we use to create a larger team (currently 3 full-time positions). Finally, various researchers on campus are heavily invested in scikit-learn or related projects such as joblib. As a result, the amount of technical skills is staggering.

Long story short, we want to add new DNA to this awesome team: someone into peopleware as much as software.

Mandate

The goal of the new position is to talk both to our wider open-source world and our corporate partners. Both are crucial to fostering growth for scikit-learn.

The official job posting doesn’t convey as well as I would like what is behind this position. I’m probably to blame :).

Growing our open-source community

As both the scikit-learn and the PyData community have grown, communication becomes a bottleneck. There are so many little things to make an open-source community productive: facilitating on-boarding, dividing efficiently the workload, documenting well the decision making, organizing fun sprints, making sure that issue triaging is efficient…

We are looking for someone passionate about open-source communities and who wants to be herding such cats.

Increasing our corporate visibility

Scikit-learn is one of the most used data-science tools. However, talking to senior decision makers, their perception sometimes differs. Indeed, we are competing for visibility with many powerful actors.

We must communicate beyond the open-source world to develop a strong brand for scikit-learn. Good communication will help us find new sponsors, a key ingredient of growth and sustainability for scikit-learn.

We need to communicate on our progresses and our actions, as people are often surprised by the breadth of our contributions [2].

[2]	for instance, the foundation team has contributed improvements in CPython itself , maintains cloudpickle a central component of the data ecosystem).

As a foundation, we need to be transparent and accountable, which is harder than it seems.

A good fit

We are looking for someone into open source, but also who likes writing blog posts, social networks, organizing events, presenting scikit-learn, and improving processes.

We believe that such a job is best done by someone who has some technical interest in scikit-learn: good advocacy needs with good understanding.

Maybe this sounds daunting? Few people have all the skills, let alone the experience. We are actually more looking for a passionate and promising candidate, whatever the length of the resume. We believe that talented people can learn, when they like what they do.

This is a job about open-source, for open source! It’s not a perfect job: we have many administrative constraints in running the foundation, we are paying ourselves less than a non-open-source job.

Apply now

We are looking forward to your application. You can submit them on the official job offer

Technical discussions are hard; a few tips

2020-05-28T00:00:00+02:00

Note

This post discuss the difficulties of communicating while developing open-source projects and tries to gives some simple advice.

A large software project is above all a social exercise in which technical experts try to reach good decisions together, for instance on github pull requests. But communication is difficult, in particular between diverging points of view. It is easy to underestimate how much well-intended persons can misunderstand each-other and get hurt, in open source as elsewhere. Knowing why there are communication challenges can help, as well as applying a few simple rules.

Contents

Maintainer’s anxiety
Contributor’s fatigue
Communication is hard
Little things that help

The first challenge is to understand the other’s point of view: the different parties see the problem differently.

Maintainer’s anxiety

Open source can be anxiety-generating for the maintainers

Maintainers ensure the quality and the long-term life of an open-source project. As such, they feel responsible for any shortcoming in the product. In addition, they often do this work because they care, even though it may not bring any financial support. But they can quickly become a converging point of anxiety-generating feedback:

Code has bugs; the more code, the more bugs. Watching a issue tracker fill up with a long list of bugs is frightening to people who feel in charge.
Given that maintainers are visible and qualified, they become the target of constant requests for attention: from pleas to prioritize a specific issue to solicitations for advice.
A small fraction of these interactions come as plain aggressions. I have been insulted many times by unsatisfied users. Each time, it hurts me a lot. My policy is to disengage from the conversation, but I am left shaking and staring at my computer in the evening.

The more popular a project, the more weight it puts on its maintainers’ shoulders. A consequence is that maintainers are tired, and can sometimes approach discussions in a defensive way. Also, we may be plain scared of integrating a code that we do not fully comprehend.

Open-source developers may even, unconsciously, adopt a simple, but unfortunate, protection mechanism: being rude. The logic is flawless: if I am nasty to people, or I set unreasonnable expectations, people will let me alone. Alas, this strategy leads to toxic environments. It not only makes people unhappy but also harms the community dynamics that ground the excellence of open source.

The danger abusive gatekeeping

A maintainer quickly learns that every piece of code, no matter how cute it might be, will give him or her work in the long run, just like a puppy. This is unavoidable given that the complexity of code grows faster than its number of features [1], and, even for a company as rich as Google, project maintenance becomes intractable on huge projects [2].

[1]	An Experiment on Unit Increase in Problem Complexity, Woodfield 1979

[2]	To quote tensorflow developers “Every [code addition] takes around 16 CPU/GPU hours of [quality control]. As such, we cannot just run every [code addition] through the [quality control] infrastructure.”

A maintainer’s job is to say no often, to protect the project. But, as any gatekeeping, it can unfortunately become an excercise in unchecked power. Making objective choices for these difficult decisions is hard, and we all tend naturally to trust more people that we know.

Most often we are not aware of our shortcomings, let alone are we doing them on purpose.

Contributor’s fatigue

A new contributor starting a conversation with a group of seasoned project maintainers may easily feel an imposter. The new contributor knows less about the project. In addition, he or she is engaging with a group of people that know each-other well, and is not yet part of that inner group.

This person does not know the code base, or the conventions, and must make extra efforts, compared to the seasoned developers, to propose a contribution suitable for the project. Often, he or she does not understand fully the reasons for the project guidelines, or for the feedback given. Request for changes can easily be seen as trifles.

Integrating the contribution can often be a lengthy process –in particular in scikit-learn. Indeed, it will involve not only shaping up the contribution, but also learning the skills and discovering the process. These long cycles can undermine motivation: humans need successes to feel enthusiasm. Also, the contributor may legitimately worry: Will all these efforts be fruitful? Will the contribution make its way to the project?

Note that for these reasons, it is recommended to start contributing with very simple features, and to seek feedback on the scope of the contribution before writing the code.

Finally, contributors are seldom paid to work on the project, and there is no single line of command that makes decisions and controls incentives for all the people on the project. No one is responsible when things go astray, which means that the weight falls on the shoulder of the individuals.

The danger behind the lengthy cycle of reviews and improvements needed to contribute is death by a thousands cuts. The contributor looses motivation, and no longer finds the energy to finish the work.

How about users?

This article is focused on developers. Yet, users are also an important part of the discussion around open source.

Often communication failures with users are due to frustration. Frustration of being unable to use the software, of hitting a bug, of seeing an important issue still not addressed. This frustration stems from incorrect expectations, which can often be traced to misunderstanding of the processes and the dynamics. Managing expectations is important to improve the dialogue, via the documentation, via notes on the issue tracker.

Communication is hard

Communication is hard: messages are sometimes received differently than we would like. Overworked people discussing very technically challenging issues only makes the matter worse. I have seen people not come across well, while I know they are absolutely lovely and caring.

We are human beings; we are limited; we misunderstand things, and we have feelings.

Emotions – My most vivid memory of a communication failure was when I was a sailing instructor. Trainees that were under my responsibility had put themselves at risk, causing me a lot of worry. During the debrief, I was angry. My failure to convey the messages without emotional loading undermined my leadership on the group, putting everybody at risk for the rest of the week.

Inability to understand the others’ point of view, or to communicate ours, can bring in emotions. Emotions most often impedes technical communication.

Limited attention – We, in particular maintainers, are bombarded with email, notifications, text and code to read. As a consequence, it is easy to read things too fast, to stop in the middle, to forget.

Language barriers – Most discussions happen in English; but most of us are not native English speakers. We may hide well our difficulties, but nuances are often lost.

Clique effects – Most interactions in open source are done in writing, with low communication bandwidth. It can be much harder to convince a maintainer on the other side of the world than a colleague in the same room. Schools of thoughts naturally emerge when people work a lot together. These create bubbles, where we have the impression that everything we say is obvious and uncontroversial, and yet we fail to convince people outside of our bubble.

Little things that help

Communication can improved by continuously working on it [3]. It may be obvious to some, but it personally took me many years to learn.

[3]	Training materials for managers often discuss communication, and give tricks. I am sure that there are better references than my list below. But that’s the best I can do.

Hear the other: exchange

Foster multiway discussions – The goal of a technical discussion is to come up to the best solution. Better solutions emerge via confronting different points of view: a single brilliant individual probably cannot find or recognize the best solution alone.

Integrate input from as many perspectives as possible.
Make sure everyone feels heard.

Don’t seek victory – Most important to keep in mind is that giving up on an argument and accepting the other point of view is a perfectly valid option. I naturally biased to think that my view on topics dear to me is the right one. However, I’ve learned that adopting the view of the other could bring a lot to the social dynamics of a project: we are often debating over details and the bigger benefit comes from moving forward.

In addition, if several very bright people have different conclusions than me about something that they’ve thought a lot, who am I to disagree?

Convey ideas well: pedagogy

Explain – Give the premises of your thoughts. Unroll your thought processes. People are not sitting in your head, and need to hear not only your conclusion, but how you got there.

Repeat things – Account for the fact that people can forget, and never hesitate to gently restate important points. Reformulating differently can also help explaining.

Keep it short – A typical reading speed is around 200 words a minute. People have limited time and attention span. The greatest help you can provide to your reader is to condense your ideas: let us avoid long threads that require several dozens of minutes to read and digest. There is a tension between this point and the above. My suggestion: remove every word that is not useful, move details to footnotes or postscriptums.

Cater for emotions: tone

Stay technical – Always try to get to the technical aspect of the matter, and never the human. Give specific code and wording suggestions. When explaining a decision, give technical arguments, even if they feel obvious to you.

Be positive – Being positive in general helps people feeling happy and motivated. It is well known that positive feedback leads to quicker progress than negative, as revealed eg by studies of class rooms. I am particularly guilty of this: I always forget to say something nice, although I may be super impressed by a contribution. Likewise, avoid negative words when giving feedback (stay technical).

Avoid “you” – The mere use of the pronoun “you” puts the person we are talking to in the center of message. But the message should not be about the person, it should be about the work. It’s very easy to react emotionally when it’s about us. The passive voice can be useful to avoid putting people as the topic. If the topic is indeed people, sometimes “we” is an adequate substitute for “you”.

Assume good faith – There are so many misunderstandings that can happen. People forget things, people make mistakes, people fail to convey their messages. Most often, all these failures are in good faith, and misunderstandings are legitimate. In the rare cases there might possibly be some bad faith, accounting for it will only make communication worse, not better. Along the same line, we should ignore when we feel assaulted or insulted, and avoid replying in kind.

Choose words wisely – The choice of words matter, because they convey implicit messages. In particular, avoid terms that carry judgement values: “good” or “bad”. For example “This is done wrong” (note that this sentence already avoids “you”), could be replaced by “There might be more numerically stable / efficient way of doing it” (note also the use of precise technical wording rather than the generic term “better”).

Use moderating words – Try to leave room for the other in the discussion. Statements too assertive close the door to different points of view: “this must be changed” (note the lack of “you”) should be avoided while “this should be changed” is better. For this reason, this article is riddled with words such as “tend”, “often”, “feel”, “may”, “might”.

Don’t blame someone else – If you feel that there is some pattern that you would like to change, do not point fingers, do not blame others. Rather, point yourself at the center of the story, find an example of this pattern with you, and the message should be that “it is a pattern that we should avoid. “We” is such a powerful term. It unites; it builds a team.

Give your understanding – If you feel that there is a misunderstanding, explain how you are feeling. But do it using “I”, and not “you”, and acknowledge the subjectivity: “I feel ignored” rather than “you are ignoring me”. Even better: only talk about the feeling: “I am loosing motivation, because this is not moving forward”, or “I think that am failing to convey why this numerical problem is such an important issue” (note the use of “I think”, which avoids casting the situation as necessarily true).

I hope this can be useful. I personally try to apply these rules, because I want to work better with others.

Thanks

to many who gave me feedback: Adrin Jalali, Andreas Mueller, Elizabeth DuPre, Emmanuelle Gouillart, Guillaume Lemaitre, Joel Nothman, Joris Van den Bossche, Nicolas Hug.

PS: note how many times I’ve used “you” above. I can clearly get better at communication!

Getting a big scientific prize for open-source software

2019-12-01T06:00:00+01:00

Note

An important acknowledgement for a different view of doing science: open, collaborative, and more than a proof of concept.

A few days ago, Loïc Estève, Alexandre Gramfort, Olivier Grisel, Bertrand Thirion, and myself received the “Académie des Sciences Inria prize for transfer”, for our contributions to the scikit-learn project. To put things simply, it’s quite a big deal to me, because I feel that it illustrates a change of culture in academia.

Recognizing an open view of scientific contributions

It is a great honor, because the selection was made by the members of the Académie des Sciences, very accomplished scientists with impressive contributions to science. The “Académie” is the hallmark of fundamental academic science in France. To me, this prize is also symbolic because it recognizes an open view of academic research and transfer, a view that sometimes felt as not playing according to the incentives. We started scikit-learn as a crazy endeavor, a bit of a hippy science thing. People didn’t really take us seriously. We were working on software, and not publications. We were doing open source, while industrial transfer is made by creating startups or filing patents. We were doing Python, while academic machine learning was then done in Matlab, and industrial transfer in C++. We were not pursuing the latest publications, while these are thought to be research’s best assets. We were interested in reaching out to non experts, while partners considered as interesting have qualified staff.

Quality and openness, at the cost of quantity and control

No. We did it different. We reached out to an open community. We did BSD-licensed code. We worked to achieve quality at the cost of quantity. We cared about installation issues, on-boarding biologists or medical doctors, playing well with the wider scientific Python ecosystem. We gave decision power to people outside of Inria, sometimes whom we had never met in real life. We made sure that Inria was never the sole actor, the sole stake-holder. We never pushed our own scientific publications in the project. We limited complexity, trading off performance for ease of use, ease of installation, ease of understanding.

As a consequence, we slowly but surely assembled a large community. In such a community, the sum is greater than the parts. The breadth of interlocutors and cultures slows movement down, but creates better results, because these results are understandable to many and usable on a diversity of problems. The consequence of this quality is that we were progressively used in more and more places: industrial data-science labs, startups, research in applied or fundamental statistical learning, teaching. Ironically, the institutional world did not notice. It got hard, next to impossible, to get funding [1]. A few years ago, I was told by a central governmental agency that we, open-source zealots, were destroying an incredible amount of value by giving away for free the production of research [2]. The French report on AI, lead by a Fields medal, cited tensorflow and theano –a discontinued software–, but ignored scikit-learn; maybe because we were doing “boring science”?

But, scikit-learn’s amazing community continued plowing forward. We grew so much that we were heard from the top. The prize from the Académie shows that we managed to capture the attention of senior scientists with open-source software, because this software is really having a worldwide impact in many disciplines.

Presenting scikit-learn at the Academie Des Sciences

An accomplishment of the community

There were only five of us on stage, as the prize is for Inria permanent staff. But this is of course not a fair account of how the project has grown and what made it successful.

In 2011, at the first international sprint, I felt something was happening: Incredible people whom I had never met before were sitting next to me, working very hard on solving problems with me. This experience of being united to solve difficult problems is something amazing. And I deeply thank every single person who has worked on this project, the 1500 contributors, many of those that I have never met, in particular the core team who is committed to making sure that every detail of scikit-learn is solid and serves the users. The team that has assembled over the years is of incredible quality.

The promises of data science need open source

The world does not understand how much the promises of data science, for today and tomorrow, need open source projects, easy to install and to use by everybody. These projects are like roads and bridges: they are needed for growth thought no one wants to pay for maintaining them. I hope that I can use the podium that the prize will give us to stress the importance of the battle that we are fighting.

[1]	Getting funding from the government implied too much politics and risks. For these reasons, I turned to private donors, in a foundation.

[2]	Inria always supported us, and often paid developers in my team out of its own pockets.

PS: As an another illustration of the culture change toward openness in science, it was announced during the ceremony that the “Compte Rendu de l’Académie des Sciences” is becoming open access, without publication charges!

A foundation for scikit-learn at Inria

2018-09-17T00:00:00+02:00

We have just announced that a foundation will be supporting scikit-learn at Inria [1]: scikit-learn.fondation-inria.fr

Growth and sustainability

This is an exciting turn for us, because it enables us to receive private funding. As a result, we will be able to have secure employment for some existing core contributors, and to hire more people on the team. The goal is to help sustaining quality (more frequent releases?) and to tackle some ambitious features.

A foundation? What and why?

Open source lives and thrives by its base, the community of developers. And scikit-learn is a fantastic example of these dynamics. Because of its grass-root origins, it has focused on features that matter for the small and the many, such as ease of use and statistical models that work well in data-poor situations. Over the years, decisions have been based on their technical merit, rather than the importance of displaying a list of features that are trendy. A consequence of the breadth of contributors with different backgrounds is the library tends to be well-suited for many applications, including some models that are less mainstream.

People with dedicated time to support the community

That said, over time this is an increasing need for a core team of maintainers. As the library gets bigger, is it more and more difficult to have a full view of what is happening. Integration of new features, quality assurances, and releases are best done by developers who can dedicate a large amount of time to the library. Also, ambitious changes to the library, such as improving the parallel computing engine, need long efforts. For many years, we have always had people with dedicated time to support the community. In France, we were going through hoops to find public money to found them. As someone who has done this effort, I can tell you that is a complicated one [2].

The ability to receive money from sponsors will enable us to scale up our operations. I was initially worried that we would have difficulties finding partners that accepted to give us money without asking for control on the project. However, I was proven wrong, and we have found a small set of great partners.

What will people work on? How will decisions be made?

It can be a difficult exercise to balance how money is used in a community-driven project. The project should not loose its drive where the community of developers is important. Interests of the sponsors should not prime over interests of the user base.

We will make sure that the money that the foundation receives is invested for the interest of the community. We have a technical committee that supervises the activity of the foundation. Its decisions will be informed by the community [3]. For this, we have an advisory board composed of core contributors of scikit-learn. Beside the advisory board, the technical committee also comprises a delegate from each sponsor. I am excited about the input that our partners will provide us on the priorities for them, as they represent various industries. Voting power will be spread so that sponsors and community have the same voting power.

Why not an existing foundation such as NumFOCUS, or the PSF?

There are several reasons why we choose this particular legal vessel. Our endeavor is slightly from the prominent foundations in our ecosystem, NumFocus and the PSF (Python Software Foundation).

The first important aspect is that we want to employ full-time developers. Different countries have very different legal frameworks, and it is really hard to transfer money overseas in a non profit. Physical assets like employing people or owning real estate is even harder. We needed something in France. And there might be a need for something else in another country at some point.

Another reason to be embedded in the Inria foundation is that it is giving us a really good deal. We basically get legal advice, accounting, office space, and IT support, for an 8% overhead. This is an excellent deal and is part of the sponsoring efforts that Inria will keep doing.

Last, we feel that a foundation targeting specifically scikit-learn can raise money from different people than other foundations. I think that there is value having multiple foundations seeking money for open-source software. Indeed, a foundation builds a case and an image, to convince donors. Different donors require a different case and a different image. For instance the president of NumFOCUS argues for a name less focused on numerics. Yet, too wide of a scope can dilute the image.

We have in mind to make it easy for other foundations to support scikit-learn. We have majors contributors at leading institutions, such as Andreas Mueller at Columbia or Joel Nothman at Sydney university. It is important that these institutions can easily gather donations too, in the legal framework suited to their country. Hence the name reflects that the foundation is embedded at Inria, leaving room for other initiatives.

What’s the scope?

The scope of our work is everything scikit-learn related. It is not the whole pydata or scipy ecosystem: it is focused on scikit-learn. But we will not hesitate contribute fixes and enhancements to neighboring projects, like in the past, even all the way up to core Python [4].

I’m am very excited. A strong team of full-time contributors will allow us to do ambitious things with scikit-learn.

Join us

We will be recruiting! See our positions. Come work with us in Paris.

I want to end by thanking the amazing men and women who have been contributing to scikit-learn, and are with us in this fantastic adventure! The energy that is in this project is incredible. We are are launching this effort thank to you, and to empower you even more.

[1]

I am quite proud that over the years, my group has employed Olivier Grisel, Joris van den Bossche (working on pandas in addition to scikit-learn), Guillaume Lemaître (working on imbalanced-learn in addition to scikit-learn), Jérémie du Boisberranger, Tom Moreau, Loic Estève, Fabian Pedregosa, to name only a few. All these people, and the many others students that we have payed part time to work on software, have had an structuring impact on our ecosystem, going beyond the bounds of scikit-learn and touching many aspects of computing in Python. However, because of the constraints of research funding in France, public money forced my to hire them with short-term contracts.

[2]	Technically, it is a tax-deductible scikit-learn consortium inside the Inria foundation, which is an non-profit entity related to Inria.

[3]	Details on the goverance of the foundation can be found at https://scikit-learn.fondation-inria.fr/en/mission-and-governance

[4]

For instance Olivier and Tom have been making parallelism more robust in Python 3.7 (amongst various issues https://bugs.python.org/issue33056 and https://bugs.python.org/issue31699). Olivier helped defining the new pickling protocol, crucial to efficient persistence. This is hard work. Yet it is important, because it benefits all libraries.

Sprint on scikit-learn, in Paris and Austin

2018-08-01T00:00:00+02:00

Two weeks ago, we held a scikit-learn sprint in Austin and Paris. Here is a brief report, on progresses and challenges.

Several sprints

We actually held two sprint in Austin: one open sprint, at the scipy conference sprints, which was open to new contributors, and one core sprint, for more advanced contributors. Thank you to all who joined the scipy conference sprint. As I wasn’t there, I cannot report on it.

Many achievements

Too many things were done to be listed here. Here is brief overview:

Optics got merged: The optics clustering algorithm is a density-base clustering, as DBScan, but with hyperparameters more flexible and easier to set. Our implementation is also more scaleable for very large number of samples. The Pull request was opened in 2013, and got many many improvements over the years.
Yeo-Johnson: The Yeo-Johnson transform is a simple parametric transformation of the data that can be used to make it more Gaussian. It is similar to the Box-Cox transform but can deal with negative data (PR).
Novelty versus outlier detection: Novelty detection attempts to find on new data observations that differ from train data. Outlier detection considers that even in the train data there are aberrant observation. New modes in scikit-learn enable both usage scenario with the same algorithms (see this issue and this PR).
Missing-value indicator: a new transform that adds indicator columns marking missing data (PR).
Pypy support: pypy support was merged. (PR).
Random Forest with 100 estimators The default of n_estimator in RandomForest was changed from 10, which was fast but statistically poor, to 100 (PR).
Changing to 5-fold: we changed to default of cross-validation from 3-fold to 5-fold (PR).
Toward release 0.20: most of the effort of the sprint was actually spent on addressing issues for the 0.20 release: a long list of quality improvements (milestone).

Scikit-learn is hard work

Even for the almighty @amueller

Two days of intense group work on scikit-learn reminded me how much it is hard work. I thought that it was maybe a good idea to try to illustrate why.

Mathematical errors: maintaining the library requires mathematical understanding of the models. For instance Ivan Panico fixed the sparse PCA, for which the transform was mathematically incorrect.
Numerical instabilities: sometimes, however, when models give a result different from the expected one, this is due to numerical instability. For instance, Sergul Aydöre changed the tolerance for certain variants of ridge
Keeping examples and documentation up to date: Each change requires changing all documentation and examples. We have a lot these. For instance, Alex Boucault had to update many examples and documentation pages when changing the default cross-validation.
Clean deprecation path: We make sure that our changes do not break users code, and therefore we provide a smooth update path, with progressive deprecations. For instance, the change of default cross-validation introduce an intermediate step where the default is kept the same and warns that it will change in two releases.
Consistent behavior across the library: One of the acclaimed values of scikit-learn is that it has a very consistent behavior across different models. We enforce this by “common tests”, that test some properties of the estimators altogether. For instance, Sergul implemented common tests for sample weights.
Extensive testing: We test many many things in scikit-learn: that the code snippets in the documentation are correct, that the docstring conventions are respected, that there are no deprecation errors raised, including from our dependencies. As a results, continuous integration is a core part of our development. During the sprint, we flooded our cloud-based continuous integration, and as a result iteration really slowed down. TravisCI were kind enough to fix this by allocating us freely more computing power.
Supporting many versions: Least by not least, one constraint that makes development hard with scikit-learn is that we support many different versions of Python, of our dependencies, of linear-algebra libraries, and of operating system. This makes development harder, and continuous integration slower. But we feel that this is very valuable for a core library: narrowing the supported versions means that users are more likely to end up in unsatisfiable dependencies situations, where different parts of a project want different version numbers of a dependency.

Warning

Dropping support for Python 2

Supporting many version slows development. It also prevents implementing new features: supporting Python 2 makes it harder to provide better parallelism or traceback management.

Python 3 has been out for 10 years. It is solid and comes with many improvements over Python 2. Alongside with many other projects, we will be requiring Python 3 for the future releases of scikit-learn (0.21 and later). scikit-learn 0.20 will be the last release to support Python 2. It will enable us to develop faster a better toolkit.

Credits and acknowledgments

Contributors to the sprint

In Paris

Albert Thomas, Huawey
Alexandre Boucaud, Inria
Alexandre Gramfort, Inria
Eric Lebigot, CFM
Gaël Varoquaux, Inria
Ivan Panico, Deloitte
Jean-Baptiste Schiratti, Telecom ParisTech
Jérémie du Boisberranger, Inria
Léo Dreyfus-Schmidt, Dataiku
Nicolas Goix
Samuel Ronsin, Dataiku
Sebastien Treguer, Independent
Sergül Aydöre, Stevens Institute of Technology

In Austin

Andreas Mueller, Columbia
Guillaume Lemaître, Inria
Jan van Rijn, Columbia
Joan Massich, Inria
Joris Van den Bossche, Inria
Loïc Estève, Inria
Nicolas Hug, Columbia
Olivier Grisel, Inria
Roman Yurchak, independent
William de Vazelhes, Inria

Remote

Hanmin Qin, Peking University
Joel Nothman, University of Sydney

Beyond computational reproducibility, let us aim for reusability

2017-09-19T12:10:00+02:00

Note

Scientific progress calls for reproducing results. Due to limited resources, this is difficult even in computational sciences. Yet, reproducibility is only a means to an end. It is not enough by itself to enable new scientific results. Rather, new discoveries must build on reuse and modification of the state of the art. As time goes, this state of the art must be consolidated in software libraries, just as scientific knowledge as been consolidated on bookshelves of brick-and-mortar libraries.

I am reposting an essay that I wrote on reproducible science and software libraries. The full discussion is in IEEE CIS TC Cognitive and Developmental Systems, but I’ve been told that it is hard to find.

Science is based on the ability to falsify claims. Thus, reproduction or replication of published results is central to the progress of science. Researchers failing to reproduce a result will raise questions: Are these investigators not skilled enough? Did they misunderstand the original scientific endeavor? Or is the scientific claim unfounded? For this reason, the quality of the methods description in a research paper is crucial. Beyond papers, computers —central to science in our digital era— bring the hope of automating reproduction. Indeed, computers excel at doing the same thing several times.

However, there are many challenges to computational reproducibility. To begin with, computers enable reproducibility only if all steps of a scientific study are automated. In this sense, interactive environments —productivity-boosters for many— are detrimental unless they enable easy recording and replay of the actions performed. Similarly, as a computational-science study progresses, it is crucial to keep track of changes to the corresponding data and scripts. With a software-engineering perspective, version control is the solution. It should be in the curriculum of today’s scientists. But it does not suffice. Automating a computational study is difficult. This is because it comes with a large maintenance burden: operations change rapidly, straining limited resources —processing power and storage. Saving intermediate results helps. As does devising light experiments that are easier to automate. These are crucial to the progress of science, as laboratory classes or thought experiments in physics. A software engineer would relate them to unit tests, elementary operations checked repeatedly to ensure the quality of a program.

Archiving computers in thermally-regulated nuclear-proof vaults?

Once a study is automated and published, ensuring reproducibility should be easy; just a matter of archiving the computer used, preferably in a thermally-regulated nuclear-proof vault. Maybe, dear reader, the scientist in you frowns at this solution. Indeed, studies should also be reproduced by new investigators. Hardware and software variations then get in the way. Portability, ie achieving identical results across platforms, is well-known by the software industry as being a difficult problem. It faces great hurdles due to incompatibilities in compilers, libraries, or operating systems. Beyond these issues, portability also faces numerical and statistical stability issues in scientific computing. Hiding instability problems with heavy restrictions on the environment is like rearranging deck chairs on the Titanic. While enough freezing will recover reproducibility, unstable operations cast doubt upon scientific conclusions they might lead to. Computational reproducibility is more than a software engineering challenge; it must build upon solid numerical and statistical methods.

Reproducibility is not enough. It is only a means to an end, scientific progress. Setting in stone a numerical pipeline that produces a figure is of little use to scientific thinking if it is a black box. Researchers need to understand the corresponding set of operations to relate them to modeling assumptions. New scientific discoveries will arise from varying those assumptions, or applying the methodology to new questions or new data. Future studies build upon past studies, standing on the shoulders of giants, as Isaac Newton famously wrote. In this process, published results need to be modified and adapted, not only reproduced. Enabling reuse is an important goal.

Libraries as reusable computational experiments

To a software architect, a reusable computational experiment may sound like a library. Software libraries are not only a good analogy, but also an essential tool. The demanding process of designing a good library involves isolating elementary steps, ensuring their quality, and documenting them. It is akin to the editorial work needed to assemble a textbook from the research literature.

Science should value libraries made of code, and not only bookshelves. But they are expensive to develop, and even more so to maintain. Where to set the cursor? It is clear that in physics not every experimental setup can be stored for later reuse. Costs are less tangible with computational science; but they should not be underestimated. In addition, the race to publish creates legions of studies. As an example, Google scholar lists 28000 publications concerning compressive sensing in 2015. Arguably many are incremental and research could do with less publications. Yet the very nature of research is to explore new ideas, not all of which are to stay.

Identifying and consolidating major results for reuse

Computational research will best create scientific progress by identifying and consolidating the major results. It is a difficult but important task. These studies should be made reusable. Limited resources imply that the remainder will suffer from “code rot”, with results becoming harder and harder to reproduce as their software environment becomes obsolete. Libraries, curated and maintained, are the building blocks that can enable progress.

If you want to cite this essay in an academic publication, please cite the version in IEEE CIS TC Cognitive and Developmental Systems (volume 32, number 2, 2016).

Related posts:

Scikit-learn Paris sprint 2017

2017-06-23T00:00:00+02:00

Two week ago, we held in Paris a large international sprint on scikit-learn. It was incredibly productive and fun, as always. We are still busy merging in the work, but I think that know is a good time to try to summarize the sprint.

A massive workforce

We had a mix of core contributors and newcomers, which is a great combination, as it enables us to be productive, but also to foster the new generation of core developers. Were present:

Albert Thomas
Alexandre Abadie
Alexandre Gramfort
Andreas Mueller
Arthur Imbert
Aurélien Bellet
Bertrand Thirion
Denis Engemann
Elvis Dohmatob
Gael Varoquaux
Jan Margeta
Joan Massich
Joris Van den Bossche
Laurent Direr
Lemaitre Guillaume
Loic Esteve
Mohamed Maskani Filali
Nathalie Vauquier
Nicolas Cordier
Nicolas Goix
Olivier Grisel
Patricio Cerda
Paul Lagrée
Raghav RV
Roman Yurchak
Sebastien Treger
Sergei Lebedev
Thierry Guillemot
Thomas Moreau
Tom Dupré la Tour
Vlad Niculae

Manoj Kumar (could not come to Paris because of visa issues)

And many more people participating remote, and I am pretty certain that I forgot people.

Support and hosting

Hosting: As the sprint extended through a French bank holiday and the week end, we were hosted in a variety of venues:

La paillasse, a Paris bio-hacker space
Criteo, a French company doing word-wide add-banner placement. The venue there was absolutely gorgeous, with a beautiful terrace on the roofs of Paris. And they even had a social event with free drinks one evening.

Guillaume Lemaître did most of the organization, and at Criteo Ibrahim Abubakari was our host. We were treated like kings during the whole stay; each host welcoming us as well they could.

Financial support by France is IA: Beyond our hosts, we need to thank France is IA who fund the sprint, covering some of the lunches, accomodations, and travel expenses to bring in our contributors from abroad (3000 euros travel & accomodation, and 1000 euros for food and a venue during the week end).

Some achievements during the sprint

I would be hard to list everything that we did during the sprint (have a look at the development changelog if you’re curious). Here are some

Quantile transformer, to transform the data distribution into uniform, or Gaussian distributions (PR, example):

Before

After
Memory saving by avoiding to cast to float64 if X is given as float32: we are slowly making sure that, as much as possible, all models avoid using internal representations of a dtype float64 when the data is given as float32. This reduces significantly memory usage and can give speed ups up to a factor of two.
API test on instances rather than class. This is to facilitate testing packages in scikit-learn-contrib.
Many small API fixes to ensure better consistency of models, as well as cleaning the codebase, making sure that examples display well under matplotlib 2.x.
Many bug fixes, include fixing corner cases in our average precision, which was dear to me (PR).

Work soon to be merged

ColumnTransformer (PR): from pandas dataframe to feature matrix, by applying different transformers to different columns.
Fixing t-SNE (PR): our t-SNE implementation was extremely memory-inefficient, and on top of this had minor bugs. We are fixing it.

There is a lot more of pending work that the sprint help moved forward. You can also glance at the monthly activity report on github.

Joblib progress

Joblib, the parallel-computing engine used by scikit-learn, is getting extended to work in distributed settings, for instance using dask distributed as a backend. At the sprint, we made progress running a grid-search on Criteo’s Hadoop cluster.

Data science instrumenting social media for advertising is responsible for todays politics

2016-11-11T00:00:00+01:00

To my friends developing data science for the social media, marketing, and advertising industries,

It is time to accept that we have our share of responsibility in the outcome of the US elections and the vote on Brexit. We are not creating the society that we would like. Facebook, Twitter, targeted advertising, customer profiling, are harmful to truth and have helped Brexiting and electing Trump. Journalism has been replaced by social media and commercial content tailored to influence the reader: your own personal distorted reality.

There are many deep reasons why Trump won the election. Here, as a data scientist, I want to talk about the factors created by data science.

Rumor replaces truth: the way we, data-miners, aggregate and recommend content is based on its popularity, on readership statistics. In no way is it based in the truthfulness of the content. As a result, Facebook, Twitter, Medium, and the like amplify rumors and sensational news, with no reality check [1].

This is nothing new: clickbait and tabloids build upon it. However, social networking and active recommendation makes things significantly worst. Indeed, birds of a feather flock together, reinforcing their own biases. We receive filtered information: have you noticed that every single argument you heard was overwhelmingly against (or in favor of) Brexit? To make matters even worse, our brain loves it: to resolve cognitive dissonance we avoid information that contradicts our biases [2].

Note

We all believe more information when it confirms our biases

Gossiping, rumors, and propaganda have always made sane decisions difficult. The filter bubble, algorithmically-tuned rose-colored glasses of Facebook, escalate this problem into a major dysfunction of our society. They amplify messy and false information better than anything before. Soviet-style propaganda builds on a carefully-crafted lies; post-truth politics build on a flood of information that does not even pretend to be credible in the long run.

Active distortion of reality: amplifying biases to the point that they drown truth is bad. Social networks actually do worse: they give tools for active manipulation of our perception of the world. Indeed, the revenue of today’s Internet information engines comes from advertising. For this purpose they are designed to learn as much as possible about the reader. Then they sell this information bundled with a slot where the buyer can insert the optimal message to influence the reader.

The Trump campaign used targeted Facebook ads presenting to unenthusiastic democrats information about Clinton tuned to discourage them from voting. For instance, portraying her as racist to black voters.

Information manipulation works. The Trump campaign has been a smearing campaign aimed at suppressing votes of his opponent. Release of negative information on Clinton did affect her supporter allegiance.

Tech created the perfect mind-control tool, with an eyes on sales revenue. Someone used it for politics.

The tech industry is mostly socially-liberal and highly educated, wishing the best for society. But it must accept its share of the blame. My friends improving machine-learning for costumer profiling and ad placement, you help shaping a world of lies and deception. I will not blame you for accepting this money: if it were not for you, others would do it. But we should all be thinking about how do we improve this system. How do we use data science to build a world based on objectivity, transparency, and truth, rather than Internet-based marketing?

References analysing the erosion of truth

Must-read article in the economist on lies in politics
Wikipedia page on Post-truth politics
Donald Trump won because of Facebook
The real story behind todays referendum : Neil Lawrence’s analysis of the filter-bublle effect in Brexit
A 2013 academic study showing that twitter increases partisan polarization

Disgression: other social issues of data science

The tech industry is increasing inequalities, making the rich richer and leaving the poor behind. Data-science, with its ability to automate actions and wield large sources of information, is a major contributor to these sources of inequalities.
Internet-based marketing is building a huge spying machine that infers as much as possible about the user. The Trump campaign was able to target a specific population, black voters leaning towards democrats. What if this data was used for direct executive action? This could come quicker than we think, given how intelligence agencies tap into social media.

I preferred to focus this post on how data-science can help distort truth. Indeed, it is a problem too often ignored by data scientists who like to think that they are empowering users.

In memory of Aaron Schwartz who fought centralized power on Internet.

[1]	Facebook was until recently using human curators, but fired them, leading to a loss of control on veracity

[2]	It is a well-known and well-studied cognitive bias that individuals strive to reduce cognitive dissonace and actively avoid situations and information likely to increase it

Better Python compressed persistence in joblib

2016-05-20T00:00:00+02:00

Problem setting: persistence for big data

Joblib is a powerful Python package for management of computation: parallel computing, caching, and primitives for out-of-core computing. It is handy when working on so called big data, that can consume more than the available RAM (several GB nowadays). In such situations, objects in the working space must be persisted to disk, for out-of-core computing, distribution of jobs, or caching.

An efficient strategy to write code dealing with big data is to rely on numpy arrays to hold large chunks of structured data. The code then handles objects or arbitrary containers (list, dict) with numpy arrays. For data management, joblib provides transparent disk persistence that is very efficient with such objects. The internal mechanism relies on specializing pickle to handle better numpy arrays.

Recent improvements reduce vastly the memory overhead of data persistence.

Limitations of the old implementation

❶ Dumping/loading persisted data with compression was a memory hog, because of internal copies of data, limiting the maximum size of usable data with compressed persistence:

We see the increased memory usage during the calls to dump and load functions, profiled using the memory_profiler package with this gist

❷ Another drawback was that large numpy arrays (>10MB) contained in an arbitrary Python object were dumped in separate .npy file, increasing the load on the file system [1]:

>>> import numpy as np
>>> import joblib # joblib version: 0.9.4
>>> obj = [np.ones((5000, 5000)), np.random.random((5000, 5000))]

# 3 files are generated:
>>> joblib.dump(obj, '/tmp/test.pkl', compress=True)
['/tmp/test.pkl', '/tmp/test.pkl_01.npy.z', '/tmp/test.pkl_02.npy.z']
>>> joblib.load('/tmp/test.pkl')
[array([[ 1.,  1., ...,  1.,  1.]],
 array([[ 0.47006195,  0.5436392 , ...,  0.1218267 ,  0.48592789]])]

What’s new: compression, low memory…

❶ Memory usage is now stable:

❷ All numpy arrays are persisted in a single file:

>>> import numpy as np
>>> import joblib # joblib version: 0.10.0 (dev)
>>> obj = [np.ones((5000, 5000)), np.random.random((5000, 5000))]

# only 1 file is generated:
>>> joblib.dump(obj, '/tmp/test.pkl', compress=True)
['/tmp/test.pkl']
>>> joblib.load('/tmp/test.pkl')
[array([[ 1.,  1., ...,  1.,  1.]],
 array([[ 0.47006195,  0.5436392 , ...,  0.1218267 ,  0.48592789]])]

❸ Persistence in a file handle (ongoing work in a pull request)

❹ More compression formats are available

Backward compatibility

Existing joblib users can be reassured: the new version is still compatible with pickles generated by older versions (>= 0.8.4). You are encouraged to update (rebuild?) your cache if you want to take advantage of this new version.

Benchmarks: speed and memory consumption

Joblib strives to have minimum dependencies (only numpy) and to be agnostic to the input data. Hence the goals are to deal with any kind of data while trying to be as efficient as possible with numpy arrays.

To illustrate the benefits and cost of the new persistence implementation, let’s now compare a real life use case (LFW dataset from scikit-learn) with different libraries:

Joblib, with 2 different versions, 0.9.4 and master (dev),
Pickle
Numpy

The four first lines use non compressed persistence strategies, the last four use persistence with zlib/gzip [2] strategies. Code to reproduce the benchmarks is available on this gist.

⚫ Speed: the results between joblib 0.9.4 and 0.10.0 (dev) are similar whereas numpy and pickle are clearly slower than joblib in both compressed and non compressed cases.

⚫ Memory consumption: Without compression, old and new joblib versions are the same; with compression, the new joblib version is much better than the old one. Joblib clearly outperforms pickle and numpy in terms of memory consumption. This can be explained by the fact that numpy relies on pickle if the object is not a pure numpy array (a list or a dict with arrays for example), so in this case it inherits the memory drawbacks from pickle. When persisting pure numpy arrays (not tested here), numpy uses its internal save/load functions which are efficient in terms of speed and memory consumption.

⚫ Disk used: results are as expected: non compressed files have the same size as the in-memory data; compressed files are smaller.

Caveat Emptor: performance is data-dependent

Different data compress more or less easily. Speed and disk used will vary depending on the data. Key considerations are:

Fraction of data in arrays: joblib is efficient if much of the data is contained in numpy arrays. The worst case scenario is something like a large dictionary of random numbers as keys and values.
Entropy of the data: an array fully of zeros will compress well and fast. A fully random array will compress slowly, and use a lot of disk. Real data is often somewhere in the middle.

Extra improvements in compressed persistence

New compression formats

Joblib can use new compression formats based on Python standard library modules: zlib, gzip, bz2, lzma and xz (the last 2 are available for Python greater than 3.3). The compressor is selected automatically when the file name has an explicit extension:

>>> joblib.dump(obj, '/tmp/test.pkl.z')   # zlib
['/tmp/test.pkl.z']
>>> joblib.dump(obj, '/tmp/test.pkl.gz')  # gzip
['/tmp/test.pkl.gz']
>>> joblib.dump(obj, '/tmp/test.pkl.bz2')  # bz2
['/tmp/test.pkl.bz2']
>>> joblib.dump(obj, '/tmp/test.pkl.lzma')  # lzma
['/tmp/test.pkl.lzma']
>>> joblib.dump(obj, '/tmp/test.pkl.xz')  # xz
['/tmp/test.pkl.xz']

One can tune the compression level, setting the compressor explicitly:

>>> joblib.dump(obj, '/tmp/test.pkl.compressed', compress=('zlib', 6))
['/tmp/test.pkl.compressed']
>>> joblib.dump(obj, '/tmp/test.compressed', compress=('lzma', 6))
['/tmp/test.pkl.compressed']

On loading, joblib uses the magic number of the file to determine the right decompression method. This makes loading compressed pickle transparent:

>>> joblib.load('/tmp/test.compressed')
[array([[ 1.,  1., ...,  1.,  1.]],
 array([[ 0.47006195,  0.5436392 , ...,  0.1218267 ,  0.48592789]])]

Importantly, the generated compressed files use a standard compression file format: for instance, regular command line tools (zip/unzip, gzip/gunzip, bzip2, lzma, xz) can be used to compress/uncompress a pickled file generated with joblib. Joblib will be able to load cache compressed with those tools.

Toward more and faster compression

Specific compression strategies have been developped for fast compression, sometimes even faster than disk reads such as snappy , blosc, LZO or LZ4. With a file-like interface, they should be readily usable with joblib.

In the benchmarks above, loading and dumping with compression is slower than without (though only by a factor of 3 for loading). These were done on a computer with an SSD, hence with very fast I/O. In a situation with slower I/O, as on a network drive, compression could save time. With faster compressors, compression will save time on most hardware.

Compressed persistence into a file handle

Now that everything is stored in a single file using standard compression formats, joblib can persist in an open file handle:

>>> with open('/tmp/test.pkl', 'wb') as f:
>>>    joblib.dump(obj, f)
['/tmp/test.pkl']
>>> with open('/tmp/test.pkl', 'rb') as f:
>>>    print(joblib.load(f))
[array([[ 1.,  1., ...,  1.,  1.]],
 array([[ 0.47006195,  0.5436392 , ...,  0.1218267 ,  0.48592789]])]

This also works with compression file object available in the standard library, like gzip.GzipFile, bz2.Bz2File or lzma.LzmaFile:

>>> import gzip
>>> with gzip.GzipFile('/tmp/test.pkl.gz', 'wb') as f:
>>>    joblib.dump(data, f)
['/tmp/test.pkl.gz']
>>> with gzip.GzipFile('/tmp/test.pkl.gz', 'rb') as f:
>>>    print(joblib.load(f))
[array([[ 1.,  1., ...,  1.,  1.]],
 array([[ 0.47006195,  0.5436392 , ...,  0.1218267 ,  0.48592789]])]

Be sure that you use a decompressor matching the internal compression when loading with the above method. If unsure, simply use open, joblib will select the right decompressor:

>>> with open('/tmp/test.pkl.gz', 'rb') as f:
>>>     print(joblib.load(f))
[array([[ 1.,  1., ...,  1.,  1.]],
 array([[ 0.47006195,  0.5436392 , ...,  0.1218267 ,  0.48592789]])]

Towards dumping to elaborate stores

Working with file handles opens the door to storing cache data in database blob or cloud storage such as Amazon S3, Amazon Glacier and Google Cloud Storage (for instance via the Python package boto).

Implementation

A Pickle Subclass: joblib relies on subclassing the Python Pickler/Unpickler [3]. These are state machines that walk the graph of nested objects (a dict may contain a list, that may contain…), creating a string representation of each object encountered. The new implementation proceeds as follows:

Pickling an arbitrary object: when an np.ndarray object is reached, instead of using the default pickling functions (__reduce__()), the joblib Pickler replaces in pickle stream the ndarray with a wrapper object containing all important array metadata (shape, dtype, flags). Then it writes the array content in the pickle file. Note that this step breaks the pickle compatibility. One benefit is that it enables using fast code for copyless handling of the numpy array. For compression, we pass chunks of the data to a compressor object (using the buffer protocol to avoid copies).
Unpickling from a file: when pickle reaches the array wrapper, as the object is in the pickle stream, the file handle is at the beginning of the array content. So at this point the Unpickler simply constructs an array based on the metadata contained in the wrapper and then fills the array buffer directly from the file. The object returned is the reconstructed array, the array wrapper being dropped. A benefit is that if the data is stored not compressed, the array can be directly memory mapped from the storage (the mmap_mode option of joblib.load).

This technique allows joblib to pickle all objects in a single file but also to have memory-efficient dump and load.

A fast compression stream: as the pickling refactoring opens the door to file objects usage, joblib is now able to persist data in any kind of file object: open, gzip.GzipFile, bz2.Bz2file and lzma.LzmaFile. For performance reason and usability, the new joblib version uses its own file object BinaryZlibFile for zlib compression. Compared to GzipFile, it disables crc computation, which bring a performance gain of 15%.

Speed penalties of on-the-fly writes

There’s also a small speed difference with dict/list objects between new/old joblib when using compression. The old version pickles the data inside a io.BytesIO buffer and then compress it in a row whereas the new version write “on the fly” compressed chunk of pickled data to the file. Because of this internal buffer the old implementation is not memory safe as it indeed copy the data in memory before compressing. The small speed difference was judged acceptable compared to this memory duplication.

Conclusion and future work

Memory copies were a limitation when caching on disk very large numpy arrays, e.g arrays with a size close to the available RAM on the computer. The problem was solved via intensive buffering and a lot of hacking on top of pickle and numpy. Unfortunately, our strategy has poor performance with big dictionaries or list compared to a cPickle, hence try to use numpy arrays in your internal data structures (note that something like scipy sparse matrices works well, as it builds on arrays).

For the future, maybe numpy’s pickle methods could be improved and make a better use of 64-bit opcodes for large objects that were introduced in Python recently.

Pickling using file handles is a first step toward pickling in sockets, enabling broadcasting of data between computing units on a network. This will be priceless with joblib’s new distributed backends.

Other improvements will come from better compressor, making everything faster.

Note

The pull request was implemented by @aabadie. He thanks @lesteve, @ogrisel and @GaelVaroquaux for the valuable help, reviews and support.

[1]	The load created by multiple files on the filesystem is particularly detrimental for network filesystems, as it triggers multiple requests and isn’t cache friendly.

[2]	gzip is based on zlib with additional crc checks and a default compression level of 3.

[3]

A drawback of subclassing the Python Pickler/Unpickler is that it is done for the pure-Python version, and not the “cPickle” version. The latter is much faster when dealing with a large number of Python objects. Once again, joblib is efficient when most of the data is represented as numpy arrays or subclasses.

Of software and Science. Reproducible science: what, why, and how

2015-12-16T00:00:00+01:00

At MLOSS 15 we brainstormed on reproducible science, discussing why we care about software in computer science. Here is a summary blending notes from the discussions with my opinion.

“Without engineering, science is not more than philosophy” — the community

How do we enable better Science? Why do we do software in science? These are the questions that we were interested in.

Improving reproducility of our scientific studies makes us more efficient in the long run to do good science: even inside a lab, new research efforts build upon the previous work.

Forms of reproducible science: reproduction, replication, & reuse

The classic concepts of reproducible science are:

Reproducibility: being able to rerun an experiment as it was run, for instance by reanalysing data.
Replicability: being able to redo an experiment from scratch.

The reproducible science movement argues sharing source code of experiments is a need for reproduction.

For reproduction, fields like computer science (development of methods) and biology (challenging data acquisition) have very different constraints, with the complexity allocated differently between data and code.

“Machine learning people use hugely complex algorithms on trivially simple datasets. Biology does trivially simple algorithms on hugely complex datasets.” — an MLOSS15 attendee

We felt that computer science needed an additional notion, complementing replication and reproduction:

Reusability: applying the process to a new yet similar question. For instance for a paper contributing data analysis method, applying it to new data.

Reusability is more valuable than reproducibility.

Reproducibility without reusability in method development may hinder the advancement of science as it pushes people to do all the same things, eg always running experiments on the same data.

Reusability enables results that the original investigator did not have in mind. It implies that the experimental protocol extends further than the exact scope of the question initially asked. For software development, it is also harder, as it implies more robustness and flexibility.

Finally sharing source code is not enough: readability of the code is necessary.

Roadblocks to reproducible science

Man power

Reusability, readability, support of released code, all actually take a lot of time, even though it is seldom acknowledged in talks about reproducible science. Given a fixed man power, it is impossible to achieve reusability and high quality for everything.

Computing power

Some numerical experiments or complex data analysis require weeks of cluster to run. These will be much harder to reproduce. Also, rerunning an analysis from scratch on a regular basis is a good recipe to achieve a robust path from data to results. The more computing power is a limiting resource, the more likely it is that a glitch is not detected.

Data availability

No access, or restricted access, to data is a show stopper for reproducibility. Data sharing requirements are becoming common –from funding agencies, or journals. However, privacy concerns, or confidential information get in the way of making data public, for instance in medical research or micro-economy. Often, these concerns serve as a pretext to people who actually do not want to relinquish control [1].

[1]	A related post by Deevy Bishop: Who’s afraid of open data

Incentives problem

Fancy new results are what matters for success in academia. “High impact” journals such as Nature or Science accept papers that amaze and impress, often with subpar inspection of the materials and methods [2]. The rate of publication in many leading groups is incompatible with consolidation efforts required for strong reproducibility.

On the other hand, it is hard to tell beforehand if a new idea is a good one. Hence letting imagination forward to foster impossible and improbable ideas is a good path to innovation. The underlying questions are: What are the best community rules for the advancement of knowledge? What do we want from the way science moves forward? Rapid publication of many incremental ideas, eg at a conference, gives food for thoughts, possibly at the sake of reproducibility.

[2]	“Science, Nature and Cell, had a higher rate of retractions” – Wikipedia: Invalid science

How to improve the situation

Docker, containers, and virtual machines

Docker, or other virtual machine technologies, enable shipping a software environment. It diminishes the challenges of building software and setting up an analysis. Virtual machines are used as a way to avoid software packaging issues. This seems to me as a plaster on a wooden leg.

Containers give easy reproduction, to the cost of hard replication and reuse.

Indeed, an analysis that lives in a box can be reproduced, but can it be understood, modified, or applied to new data? New science is likely going to come from modifying this analysis, or combining it with other tools, or new data. If these other tools live in a different virtual machine, the combination will be challenging.

In addition, people are using containers as an excuse to avoid tackling the need for proper documentation of requirements, and the process to set them up. They sometimes even try justify binary blobs [3]. This is wrong. An analysis should be runnable without requiring the stars to align, and it should be understandable.

[3]	See also Titus Brown’s post: The post-apocalyptic world of binary containers

Version control: wear your seatbelt

Version control is like a time machine: if used with regular commits, it enables rolling back to any point in time. For my work, it’s always been a crucial aspect to reproducing what me or my students did a while ago. I often meet researchers that feel they lack time to learn it. I really cannot support this position. http://try.github.io is an easy way to learn version control.

Hint: use a “tag” to pin-point a position in the history that you might want to repeat, such as making a figure or the publication of an article.

Sotware libraries, curated and maintained

Consolidating an analysis pipeline, a standard visualization, or any computational aspect of a paper into a software library is a sure way to make the paper more reproducible. It will also make the steps reusable, and a replication easier. If continued effort is put in the library, chances are that computational efficiency will improve over time, thus helping in the long run with the challenge of computing power.

Tough choices: not every variant of an analysis can be forever reproducible.

Maintaining the library will ensure that results are still reproducible on new hardware, or with evolution of the general software stack (a new Python or Matlab release, for instance). Documentation and curated examples will lower the bar to reuse and facilitate replication of the original scientific results.

To avoid feature creep and technical debt, a library calls for focused efforts on selecting the most important operations.

Datasets, serving as model experiments, tractable and open

Sometimes researchers create a toy data, with a well-posed question, that is curated and open, small enough to be tractable yet large enough to be relevant to the application field. This is an invaluable service to the field. One example is the netflix prize in machine learning, which led to a standard dataset. Unfortunately, the dataset was taken down some years later due to copyright concerns. But it has been replaced, eg by the movielens dataset. For computer vision, a series of datasets –Caltech101, CIFAR, ImageNet…– have led to continuous progress of the field. In bioinformatics, standard data are regularly created, for instance by the DREAM challenges.

These reference open datasets serve as benchmarks and therefore foster competition. They also define a canonical experiment, helping a wider scientific community understand the questions that they ask. Ultimately, they result in better software tools to solve the problem at hand, as this problem becomes a standard example and application of tools.

Sage bionetworks, for instance, is a non-profit that collects and make biomedical data available. These people believe, as I do, that such data will lead to better medical care.

Changing incentives: setting the right goals

Making sustainable, quality scientific work that facilitates reproduction needs to be a clearly-visible benefit to researchers, young and senior. Such contributions should help them get jobs and grants.

An unsophisticated publication count is the basis of scientific evaluation. We need to accept publications about data, software, and replication of prior work in high-quality journals. They need to be strictly reviewed, to establish high standards on these contributions. This change is happening. Gigascience, amongst other venues, publishes data. The MLOSS (machine learning open source software) track of the JMLR (journal of machine learning research) publishes software, with a tough review on the software quality of the project.

Researchers should cite the software they use.

Yet software is still often under cited: many will use a software implementing a method, and only cite the original paper that proposed the method. Another remaining challenge is: how to give credit for continuing development and maintenance.

Fast-paced science is probably useful even if fragile. But the difference between a quick proof of concept and solid, reproducible and reusable work needs to be acknowledged. It is important to select for publication not only impressive results, but also sound reusable material and methods. The latter are the foundation of future scientific developments, but high-impact journals tend to focus on the former.

Related posts:

Nilearn 0.2: more powerful machine learning for neuroimaging

2015-12-13T00:00:00+01:00

After 6 months of efforts, We just released version 0.2 of nilearn, dedicated to making machine learning in neuroimaging easier and more powerful.

This release integrates the features of the july sprint, and more.

Highlights

Better documentation with narrative examples

The example can now be broken down into blocks (as here) for a better narration (thanks to sphinx-gallery).

Space net: spatial regularizations in decoding

The “SpaceNet” decoder does spatial regularizations such as TV-l1 or Graph-Net to identify predictive regions in decoding.

Dictionnary learning for resting-state parcellations

Dictionnary learning is a promising alternative to ICA to learn networks.

Plotting sets of probabilistic maps

With a simple function, you can plot outlines for multiple maps.

Separating regions out of maps

We have a set of functions to separate regions on maps or turn networks into a probabilistic parcellation.

Classification on connectomes

We now have advanced connectivity measures to do comparisons across connectomes for classification.

Thanks

Thanks to Alexandre Abraham who lead the effort, and all the contributors.

MLOSS 2015: wising up to building open-source machine learning

2015-11-28T00:00:00+01:00

Note

The 2015 edition of the machine learning open source software (MLOSS) workshop was full of very mature discussions that I strive to report here.

I give links to the videos. Some machine-learning researchers have great thoughts about growing communities of coders, about code as a process and a deliverable.

I was a co-organizer of the MLOSS 2015 workshop, held during ICML 2015. As I have finally figured out where the videos are, now is a good time to summarize my impressions on the workshop.

Online videos of the talks

The videos of all the talks are online:

Python and Parallelism or Dask by Matthew Rocklin
Collaborative filtering via matrix decomposition in mlpack by Ryan Curtin
BLOG: a probabilistic programming language for open-universe contingent Bayesian networks by Yi Wu
Spotlights:
- Nilearn, machine learning for neuroimaging in Python (Alexandre Abraham)
- KeLP: a Kernel-based Learning Platform in Java (Simone Filice)
- DiffSharp: Automatic Differentiation Library (Atılım Güneş Baydin)
- The FAST toolkit for Unsupervised Learning of HMMs (José P. González-Brenes)
- OpenML: a Networked Science Platform for Machine Learning (Joaquin Vanschoren)
Julia’s Approach to Open Source Machine Learning by John Myles White
Do it yourself deep learning with the Caffe community by Evan Shelhamer
From flop to success in academic software development by Gaël Varoquaux

MLOSS: a maturing community

When Antti Honkela and Cheng Soon Ong approached me to co-organize an MLOSS workshop, I felt that it was important to do it for the sake of open source scientific software. But it didn’t feel very enthousiastic about the event or the talks themselves. Boy I was wrong.

Huge attendance: open-source ML software is now mainstream.

My first MLOSS workshop was at the ICML 2011 conference, in Haifa. The workshop was in a tiny cramped room, with a couple of dozens of geeks, and it felt like a clique of people on the side of the conference. This year, we had a huge room and more than 200 people showed up.

I am used to talks being about a grad student or young researcher that has whiped the code of a paper on the web, with an open license but no vision. This year, people were presenting actual projects, with long-term goals and the desire to solve a problem large than their latest research. It might explain why the attendance was huge: people came because talks might genuinely help them.

With Cheng and Antti, we had choosen as a theme “open ecosystems”, because ecosystems are the key to scaling computing and science. Between us, imposing a theme on a workshop is something challenging, as people submit abstracts, good or bad, and one has to compose with what one has. However, at lot of talks mentioned how the projects slot in a wider picture, and interact with a community. For instance, Evan attributes part of the success of Cafe to the “Model Zoo” in which the community contributes fitted models. At the other end of the spectrum, OpenML is a full online project with the goal to foster collaboration and comparison. Project developers have shown in their talk that they are very conscious of other projects that might be used together with their’s.

Accepting the sustainability challenges

Over the time, I have gradually realized the importance of community building, ie project management and goal setting, more than technical virtuosity. Historically, the scientific culture of code has put the emphasis on the genius ideas behind the code, and the craftsmanship of the implementation, to the cost of sustainability.

Alone, I go fast. Together, we go far.

I was surprised to see that the MLOSS community was growing very aware of mechanisms of long-term project life, in particular the human factors.

I was asked by my coorganizers to give a talk on factors of success of open source scientific software. I touched upon software engineering, project vision, licensing, governance, community building. All these topics deemed “non scientific” and thus so often despised and left out. I was astonished to find out that the talks before me were giving very good advice on these. I found that I only had to summarize and comment what had been said before. This evolution of the scientific community makes me very hopeful for the future.

Every line of code you write is dept. You should be ashamed of every line of code you have written. […]

You have a supply of labor. These are the people who are contributors […]. The people who are users and not contributors are actually a source of demand […] they mostly consume sources of labor rather than produce it. — John Myles White

Thanks to our sponsors

Facebook and continuum sponsored the trip for our keynote speakers. Thank you very much, the keynotes were great!

The Paris-Saclay Center for Data Science (CDS) gave us our main operating fund, which is critical for organizing an event. In general, I must say that the CDS has been hugely supportive of open source data science in the Paris area, having a significant impact on training as well as development.

And also, I must acknowledge support from Inria for the accounting and administration of the event.

Finally, our reviewers were amazing. Most of them reviewed the project, ie its code, its documentation, its support. They arose above the typical petty fights that we see in academia and focused on what the project was bringing to the scientific community. Often there reviews were longer and with more information than the abstract submitted.

Related posts:

Nilearn sprint: hacking neuroimaging machine learning

2015-08-04T00:00:00+02:00

A couple of weeks ago, we had in Paris the second international nilearn sprint, dedicated to making machine learning in neuroimaging easier and more powerful.

It was such a fantastic experience, as nilearn is really shaping up as a simple yet powerful tool, and there is a lot of enthusiasm. For me, this sprint is a turning point, as I could see people other than the original core team (that spanned out of our research team) excited about the project’s future. Thank you to all who came:

Ahmed Kanaan
Andres Hoyos Idrobo
Alexandre Abraham
Arthur Mensch
Ben Cipolli (remote)
Bertrand Thirion
Chris Filo Gorgolewski
Danilo Bzdok
Elvis Dohmatob
Julia Hutenburg
Kamalaker Dadi
Loic Esteve
Martin Perez
Michael Hanke
Oscar Nájera, working on sphinx-gallery

The sprint was a joint sprint with the MNE-Python team, that makes MEG processing awesome. We also need to thank Alex Gramfort, who did most of the work to set up the sprint, as well as NeuroSaclay for funding, and La paillasse, Telecom, and INRIA for hosting.

Highlights of the sprints results

Plotting of multiple maps

A function to visualize overlays of various maps, eg for a probabilistic atlas, with defaults that try to adapt to the number of maps (see the example). It’s very useful for example for easy visualizing of ICA components.

Sign of activation in glass brain

Our glass brain plotting was greatly improved adding amongst other things the option to capture the sign of the activation in the color (see this example).

Spatially-regularized decoder

Decoders based on GraphNet and total variation have finally landed in nilearn. This has required a lot of work to get fast convergence and robust parameter selection. At the end of the day, it is much slower than an SVM, but the maps look splendid (see this example).

Sparse dictionary learning

We have almost merged sparse dictionnary learning as a alternative to ICA. Experience shows that on resting-state data, it gives more contrasted segmentation of networks (see this example).

New installation docs

New webpage layout using tabs to display only the installation instruction relevant to the OS of the user (see here). The results are more compact and more clear instructions, that I hope will make our users’ life easier.

CircleCI integration

We now use CircleCI to run the examples and build the docs. This is challenging because our examples are real cases of neuroimaging data analysis, and thus require heavy datasets and computing horse power.

Neurodebian packaging

There are now neurodebian packages for nilearn.

And much more!

Warning

Features listed above are not in the released version of nilearn. You need to wait a month or so.

Software for reproducible science: let’s not have a misunderstanding

2015-05-18T00:00:00+02:00

Note

tl;dr: Reproducibilty is a noble cause and scientific software a promising vessel. But excess of reproducibility can be at odds with the housekeeping required for good software engineering. Code that “just works” should not be taken for granted.

This post advocates for a progressive consolidation effort of scientific code, rather than putting too high a bar on code release.

Titus Brown recently shared an interesting war story in which a reviewer refuses to review a paper until he can run the code on his own files. Titus’s comment boils down to:

“Please destroy this software after publication”.

Note

Reproducible science: Does the emperor have clothes?

In other words, code for a publication is often not reusable. This point of view is very interesting from someone like Titus, who is a vocal proponent of reproducible science. His words triggered some surprises, which led Titus to wonder if some of the reproducible science crowd folks live in a bubble. I was happy to see the discussion unroll, as I think that there is a strong risk of creating a bubble around reproducible science. Such a bubble will backfire.

Replication is a must for science and society

Science advances by accumulating knowledge built upon observations. It’s easy to forget that these observations, and the corresponding paradigmatic conclusions, are not always as simple to establish as the fact that hot air rises: replicating many times the scientific process transforms an evidence into a truth.

One striking example of scientific replication is the on-going effort in psychology to replay the evidence behind well-accepted findings central to current line of thoughts in psychological sciences. It implies setting up the experiments accordingly to the seminal publications, acquiring the data, and processing it to come up to the same conclusions. Surprisingly, not everything that was taken for granted holds.

Note

Findings later discredited backed economic policy

Another example, with massive consequences on Joe Average’s everyday, is the failed replication of Reinhart and Rogoff’s “Growth in a Time of Debt” publication. The original paper, published in 2010 in the American Economic Review, claimed empirical findings linking important public debt to failure of GDP growth. In a context of economical crisis, it was used by policy makers as a justification for restricted public spending. However, while pursuing a mere homework assignment to replicate these findings, a student uncovered methodological flaws with the paper. Understanding the limitations of the original study took a while, and discredited the academic backing to the economical doctrine of austerity. Critically, the analysis of the publication was possible only because Reinhart and Rogoff released their spreadsheet, with data and analysis details.

Reproducibility is not sustainable for everything

Thinking is easy, acting is difficult — Goethe

Note

Keeping a physics apparatus running for replication years later?

I started my scientific career doing physics, and fairly “heavy” physics: vacuum systems, lasers, free-falling airplanes. In such settings, the cost of maintaining an experiment is apparent to the layman. No-one is expected to keep an apparatus running for replication years later. The pinnacle of reproducible research is when the work becomes doable in a students lab. Such progress is often supported by improved technology, driven by wider applications of the findings.

However, not every experiment will give rise to a students lab. Replicating the others will not be easy. Even if the instruments are still around the lab, they will require setting up, adjusting and wiring. And chances are that connectors or cables will be missing.

Software is no different. Storing and sharing it is cheaper. But technology evolves very fast. Every setup is different. Code for a scientific paper has seldom been built for easy maintenance: lack of tests, profusion of exotic dependencies, inexistent documentation. Robustness, portability, isolation, would be desirable, but it is difficult and costly.

Software developers know that understanding the constraints to design a good program requires writing a prototype. Code for a scientific paper is very much a prototype: it’s a first version of an idea, that proves its feasibility. Common sense in software engineering says that prototypes are designed to be thrown away. Prototype code is fragile. It’s untested, probably buggy for certain usage. Releasing prototypes amounts to distributing semi-functioning code. This is the case for most code accompanying a publication, and it is to be expected given the very nature of research: exploration and prototyping [1].

No success without quality, …

Note

Highly-reliable is more useful than state-of-the-art.

My experience with scientific code has taught me that success require quality. Having a good implementation of simple, well-known, methods seems to matter more than doing something fancy. This is what the success of scikit-learn has taught us: we are really providing classic “old” machine learning methods, but with a good API, good docs, computational performance, and stable numerics controlled by stringent tests. There exists plenty of more sophisticated machine-learning methods, including some that I have developed specifically for my data. Yet, I find myself advising my co-workers to use the methods in scikit-learn, because I know that the implementation is reliable and that they will be able to use them [2].

This quality is indeed central to doing science with code. What good is a data analysis pipeline if it crashes when I fiddle with the data? How can I draw conclusions from simulations if I cannot change their parameters? As soon as I need trust in code supporting a scientific finding, I find myself tinkering with its input, and often breaking it. Good scientific code is code that can be reused, that can lead to large-scale experiments validating its underlying assumptions.

Sqlite is so much used that its developers have been woken up at night by users.

You might say that I am putting the bar too high; that slightly buggy code is more useful than no code. But I frown at the idea of releasing code for which I am unable to do proper quality assurance. I may have done too much of that in the past. And because I am a prolific coder, many people are using code that has been through my hands. My mailbox looks like a battlefield, and when I go the coffee machine I find myself answering questions.

… and making difficult choices

Note

Craftsmanship is about trade-offs

Achieving quality requires making choices. Not only because time is limited, but also because the difficulty to maintain and improve a codebase increases much quicker than the numbers of features [3]. This phenomena is actually frightening to watch: adding a feature in scikit-learn these days is much much harder than what it used to be in the early days. Interactions between features is a killer: when you modify something, something else unrelated breaks. For a given functionality, nothing makes the code more incomprehensible than cyclomatic complexity: the multiplicity of branching, if/then clauses, for loops. This complexity naturally appears when supporting different input types, or minor variants of a same method.

The consequence is that ensuring quality for many variants of a method is prohibitory. This limit is a real problem for reproducible science, as science builds upon comparing and opposing models. However, ignoring it simply leads to code that fails doing what it claims to do. What this is telling us, is that if we are really trying to do long-term reproducibility, we need to identify successful and important research and focus our efforts on it.

If you agree with my earlier point that the code of a publication is a prototype, this iterative process seems natural. Various ideas can be thought of as competing prototypes. Some will not lead to publication at all, while others will end up having a high impact. Knowing before-hand is impossible. Focusing too early on achieving high quality is counter productive. What matters is progressively consolidating the code.

Reproducible science, a rich trade-off space

Note

Verbatim replication or reuse?

Does Reinhart and Rogoff’s “Growth in a Time of Debt” paper face the same challenges as the manuscript under review by Titus? One is describing mechanisms while the other is introducing a method. The code of the former is probably much simpler than that of the latter. Different publications come with different goals and code that is more or less easy to share. For verbatim replication of the analysis of a paper, a simple IPython notebook without tests or API is enough. To go beyond requires applying the analysis to different problems or data: reuse. Reuse is very difficult and cannot be a requirement for all publications.

Conventional wisdom in academia is that science builds upon ideas and concepts rather than methods and code. Galileo is known for his contribution to our understanding of the cosmos. Yet, methods development underpins science. Galileo is also the inventor of the telescope, which was a huge technical achievement. He needed to develop it to back his cosmological theories. Today, Galileo’s measurements are easy to reproduce because telescopes are readily-available as consumer products.

Standing on the shoulders of giants — Isaac Newton, on software libraries

Related posts:

[1]	To make my point very clear, releasing buggy untested code is not a good thing. However, it is not possible to ask for all research papers to come with industial-quality code. I am trying here to push for a collective, reasoned, undertaking of consolidation.

[2]

Theory tells us that there is there is no universal machine learning algorithm. Given a specific machine-learning application, it is always possible to devise a custom strategy that out-performs a generic one. However, do we need hundreds of classifiers to solve real world classification problems? Empirical results [Delgado 2014] show that most of the benefits can be achieved with a small number of strategies. Is it desirable and sustainable to distribute and keep alive the code of every machine learning paper?

[3]	Empirical studies on the workload for programmers to achieve a given task showed that 25 percent increase in problem complexity results in a 100 percent increase in programming complexity: An Experiment on Unit increase in Problem Complexity, Woodfield 1979.

I need to thank my colleague Chris Filo Gorgolewski and my sister Nelle Varoquaux for their feedback on this note.

MLOSS: machine learning open source software workshop @ ICML 2015

2015-04-23T00:00:00+02:00

Note

This year again we will have an exciting workshop on the leading-edge machine-learning open-source software. This subject is central to many, because software is how we propagate, reuse, and apply progress in machine learning.

Want to present a project? The deadline for the call for papers is Apr 28th, in a few days : http://mloss.org/workshop/icml15/

The workshop will be help at the ICML conference, in Lille France, on July 10th. ICML –International Conference in Machine Learning– is the leading venue for academic research in machine learning. It’s a fantastic place to hold such a workshop, as the actors of theoretical progress are all around. Software is the bridge that brings this progress beyond papers.

There is a long tradition of MLOSS workshop, with one every year and a half. Last time, at NIPS 2013, I could feel a bit of a turning point, as people started feeling that different software slotted together, to create an efficient and state-of-the art working environment. For this reason, we have entitled this year’s workshop ‘open ecosystems’, stressing that contributions in the scope of the workshop, that build a thriving work environment, are not only machine learning software, but also better statistics or numerical tools.

We have two keynotes with important contributions to such ecosystems:

John Myles White (Facebook), lead developer of Julia statistics and machine learning: “Julia for machine learning: high-level syntax with compiled-code speed”
Matthew Rocklin (Continuum Analytics), developer of Python computational tools, in particular Blaze (confirmed): “Blaze, a modern numerical engine with out-of-core and out-of-order computations”.

There will be also a practical presentation on how to set up an open-source project, discussing hosting, community development, quality assurance, license choice, by yours truly.

Job offer: working on open source data processing in Python

2015-04-02T00:00:00+02:00

We, Parietal team at INRIA, are recruiting software developers to work on open source machine learning and neuroimaging software in Python.

In general, we are looking for people who:

have a mathematical mindset,

are curious about data (ie like looking at data and understanding it)

have an affinity for problem-solving tradeoffs

love high-quality code

worry about users

are good scientific Python coders,

enjoy interacting with a community of developers

We welcome candidates people without all the skills, but are strongly motivated to acquire them. Prior open-source experience is a big plus.

One example of such position with application in Neuroimaging is: http://gael-varoquaux.info/programming/hiring-a-programmer-for-a-brain-imaging-machine-learning-library.html Which was opened a year ago and has now resulted in nilearn: http://nilearn.github.io/

Other positions may be more focused on general machine learning or computing tools such as scikit-learn and joblib, which are reference open-source libraries for data processing in Python.

We are a tightly knit team, with a high degree of programming, data analysis and neuroimaging skills.

Please contact me and Olivier Grisel if you are interested,

Euroscipy 2015: Call for paper

2015-03-28T00:00:00+01:00

EuroScipy 2015, the annual conference on Python in science will take place in Cambridge, UK on 26-30 August 2015. The conference features two days of tutorials followed by two days of scientific talks & posters and an extra day dedicated to developer sprints. It is the major event in Europe in the field of technical/scientific computing within the Python ecosystem. Scientists, PhD’s, students, data scientists, analysts, and quants from more than 20 countries attended the conference last year.

The topics presented at EuroSciPy are very diverse, with a focus on advanced software engineering and original uses of Python and its scientific libraries, either in theoretical or experimental research, from both academia and the industry.

Submissions for posters, talks & tutorials (beginner and advanced) are welcome on our website at http://www.euroscipy.org/2015/ Sprint proposals should be addressed directly to the organisation at euroscipy-org@python.org

Important dates:

Apr 30, 2015 Talk and tutorials submission deadline
May 1, 2015 Registration opens
May 30, 2015 Final program announced
Jun 15, 2015 Early-bird registration ends
Aug 26-27, 2015 Tutorials
Aug 28-29, 2015 Main conference
Aug 30, 2015 Sprints

We look forward to an exciting conference and hope to see you in Cambridge

The EuroSciPy 2015 Team - http://ww.euroscipy.org/2015/

PRNI 2016: call for organization

2015-01-01T00:00:00+01:00

PRNI (Pattern Recognition for NeuroImaging) is an IEEE conference about applying pattern recognition and machine learning to brain imaging. It is a mid-sized conference (about 150 attendee), and is a satellite of OHBM (the annual “Human Brain Mapping” meeting).

The steering committee is calling for bids to organize the conference in June 2016, in Europe, as a satellite the OHBM meeting in Geneva.

Improving your programming style in Python

2014-09-29T00:00:00+02:00

Here are some references on software development techniques and patterns to help write better code. They are intended for the casual programmer, and certainly not an advanced developer.

They are listed in order of difficulty.

Software carpentry

http://swc.scipy.org.

These are the original notes from Greg Wilson’s course on software engineering at the university of Toronto. This course is specifically intended for scientists, but not computer science students. It is very basic and does not cover design issues.

A tutorial introduction to Python

http://www.informit.com/articles/article.asp?p=23100&seqNum=3&rl=1.

This tutorial is easier to follow than Guido’s tutorial, thought it does not go as much in depth.

Python Essential Reference

http://www.informit.com/articles/article.asp?p=453682&rl=1

http://www.informit.com/articles/article.asp?p=459269&rl=1

These are two chapters out of David Beazley’s excellent book Python Essential Reference. They allow to understand more deeply how python works. I strongly recommend this book to anybody serious about python.

An Introduction to Regular Expressions

http://www.informit.com/articles/article.asp?p=20454&rl=1

If you are going to do any sort of text manipulation, you absolutely need to know how to use regular expressions: powerful search and replace patterns.

Software design for maintainability

My own post

A case of shameless plug: this is a post that I wrote a few years ago. I think that it is still relevant.

Writing a graphical application for scientific programming using TraitsUI

http://gael-varoquaux.info/computers/traits_tutorial/index.html

Building interactive graphical application is a difficult problem. I have found that the traitsUI module provides a great answer to this problem. This is a tutorial intended for the non programmer.

An introduction to Python iterators

http://www.informit.com/articles/article.asp?p=26895&rl=1

This article may not be terribly easy to follow, but iterator are a great feature of Python, so this is definitely worth reading.

Functional programming

http://www.ibm.com/developerworks/linux/library/l-prog.html?open&l=766,t=gr,p=PrmgPyth

Functional programming is a programming style where mathematical functions are successively applied to immutable objects to go from the inputs of the program to its outputs in a succession of transformation. It is appreciated by some because it is easy to analyze and prove. In certain cases it can be very readable.

Patterns in Python

http://www.suttoncourtenay.org.uk/duncan/accu/pythonpatterns.html.

This document exposes a few design patterns in Python. Design patterns are solutions to recurring development problems using object oriented programming. I suggest this reading only if you are familiar with OOP.

Idiomatic Python

Jeff Knupp’s post, a summary of his book:

http://www.jeffknupp.com/blog/2012/10/04/writing-idiomatic-python/
The scipy-lectures chapter on advanced Python:

https://scipy-lectures.github.io/advanced/advanced_python/index.html

General Object-Oriented programming advice

Designing Object-oriented code actually requires some care: when you are building your set of abstractions, you are designing the world in which you are going to be condemned to living (or actually coding). I would advice people to keep things as simple as possible, and follow the SOLID principles:

http://mmiika.wordpress.com/oo-design-principles/

Using decorators to do meta-programming in Python

http://www-128.ibm.com/developerworks/linux/library/l-cpdecor.html.

A very beautiful article for the advanced python user. Meta-programming is a programming technique that involves changing the program at the run-time. This allows to add new abstractions to the code the programmer writes, thus creating a “meta-language”. This article shows this very well.

A Primer on Python Metaclass Programming

http://www.onlamp.com/lpt/a/3388

Metaclasses allow to define new style of objects, that can have different calling, creation or inheritance rules. This is way over my head, but I am referencing it here for the record.

Iterators in Python

https://docs.python.org/2/library/itertools.html#recipes

Learn to use the itertools (but don’t abuse them)!

Related to the producer/consumer problem with iterators, see:

http://www.oluyede.org/blog/2007/04/09/producerconsumer-in-python/

Hiring an engineer to mine large functional-connectivity databases

2014-09-20T00:00:00+02:00

Work with us to leverage leading-edge machine learning for neuroimaging

At Parietal, my research team, we work on improving the way brain images are analyzed, for medical diagnostic purposes, or to understand the brain better. We develop new machine-learning tools and investigate new methodologies for for quantifying brain function from MRI scans.

One of our important alley of contributions is in deciphering “functional connectivity”: analysis the correlation of brain activity to measure interactions across the brain. This direction of research is exciting because it can be used to probe the neural-support of functional deficits in incapacitated patients, and thus lead to new biomarkers on functional pathologies, such as autism. Indeed, functional connectivity can be computed without resorting to complicated cognitive tasks, unlike most functional imaging approaches. The flip side is that exploiting such “resting-state” signal requires advanced multivariate statistics tools, something at which the Parietal team excels.

For such multivariate processing of brain imaging data, Parietal has an ecosystem of leading-edge high-quality tools. In particular we have built the foundations of the most successful Python machine learning library, scikit-learn, and we are growing a dedicate software, nilearn, that leverages machine-learning for neuroimaging. To support this ecosystem, we have dedicated top-notch programmers, lead by the well-known Olivier Grisel.

We are looking for a data-processing engineer to join our team and work on applying our tools on very large neuroimaging databases to learn specific biomarkers of pathologies. For this, the work will be shared with the CATI, the Fench platform for multicentric neuroimaging studies, located in the same building as us. The general context of the job is the NiConnect project, a multi-organisational research project that I lead and that focuses on improving diagnostic tools on resting-state functional connectivity. We have access to unique algorithms and datasets, before they are published. What we are now missing between those two, and that link could be you.

If you want more details, they can be found on the job offer. This post is to motivate the job in a personal way, that I cannot give in an official posting.

Why take this job?

I don’t expect some to take this job only because it pays the bill. To be clear, the kind of person I am looking for has no difficulties finding a job elsewhere. So, if you are that person, why would you take the job?

To join a great team with many experts, focused on finding elegant solutions to hard problems at the intersection of machine learning, cognitive science, and software. Choose to work with great people, knowledgeable, passionate, and fun.
To work on interesting problems, that matter. They are interesting because they are challenging but we have the skills to solve them. They matter because they can make brain research better.
To learn. NeuroImaging + Machine learning is a quickly growing topic. If you come from a NeuroImaging background and want to add to your CV an actual expertise in machine learning for NeuroImaging. This is the place to be.

What would make me excited in a resume?

A genuine experience in neuroimaging data processing, especially large databases.
Talent with computers and ideally some Python experience.
The unlikely combination of research training (graduate or undergraduate) and experience in a non academic setting.
A problem-solving mindset.
A good ability to write about neuroimaging and data processing in English: who knows, if everything goes to plan, you could very well be publishing about new biomarkers.

Now if you are interested and feel up for the challenge, read the real job offer, and send me your resume.

Scikit-learn 2014 sprint: a report

2014-07-25T00:00:00+02:00

A week ago, the 2014 edition of the scikit-learn sprint was held in Paris. This was the third time that we held an internation sprint and it was hugely productive, and great fun, as always.

Great people and great venues

We had a mix of core contributors and newcomers, which is a great combination, as it enables us to be productive, but also to foster the new generation of core developers. Were present:

Laurent Direr
Michael Eickenberg
Loic Esteve
Alexandre Gramfort
Olivier Grisel
Arnaud Joly
Kyle Kastner
Manoj Kumar
Balazs Kegl
Nicolas Le Roux
Andreas Mueller
Vlad Niculae
Fabian Pedregosa
Amir Sani
Danny Sullivan
Gabriel Synnaeve
Roland Thiolliere
Gael Varoquaux

As the sprint extended through a French bank holiday and the week end, we were hosted in a variety of venues:

La paillasse, a Paris bio-hacker space
INRIA, the French computer-science national research, and the place where I work :)
Criteo, a French company doing word-wide add-banner placement. The venue there was absolutely gorgeous, with a beautiful terrace on the roofs of Paris. And they even had a social event with free drinks one evening.
Tinyclues, a French startup mining e-commerce data.

I must say that we were treated like kings during the whole stay; each host welcoming us as well they could. Thank you to all of our hosts!

Achievements during the sprint

The first day of the sprint was dedicated to polishing the 0.15 release, which was finally released on the morning of the second day, after 10 months of development.

A large part of the efforts of the sprint were dedicated to improving the coding base, rather than directly adding new features. Some files were reorganized. The input validation code was cleaned up (opening the way for better support of pandas structures in scikit-learn). We hunted dead code, deprecation warnings, numerical instabilities and tests randomly failing. We made the test suite faster, and refactored our common tests that scan all the model.

Some work of our GSOC student, Manoj Kumar, was merged, making some linear models faster.

Our online documentation was improve with the API documentation pointing to examples and source code.

Still work in progress:

Faster stochastic gradient descent (with AdaGrad, ASGD, and one day SAG)
Calibration of probabilities for models that do not have a ‘predict_proba’ method
Warm restart in random forests to add more estimators to an existing ensemble.
Infomax ICA algorithm.

Scikit-learn 0.15 release: highlights

2014-07-15T00:00:00+02:00

We have just released the 0.15 version of scikit-learn. Hurray!! Thanks to all involved.

A long development stretch

It’s been a while since the last release of scikit-learn. So a lot has happened. Exactly 2611 commits according my count. Quite clearly, we have more and more existing code, more and more features to support. This means that when we modify an algorithm, for instance to make it faster, something else might break due to numerical instability, or exploring some obscure option. The good news is that we have tight continuous integration, mostly thanks to travis (but Windows continuous integration is on its way), and we keep growing our test suite. Thus while it is getting harder and harder to change something in scikit-learn, scikit-learn is also becoming more and more robust.

Highlights

Quality — Looking at the commit log, there has been a huge amount of work to fix minor annoying issues.

Speed — There has been a huge effort put in making many parts of scikit-learn faster. Little details all over the codebase. We do hope that you’ll find that your applications run faster. For instance, we find that the worst case speed of Ward clustering is 1.5 times faster in 0.15 than 0.14. K-means clustering is often 1.1 times faster. KNN, when used in brute-force mode, got faster by a factor of 2 or 3.

Random Forest and various tree methods — The random forest and various tree methods are much much faster, use parallel computing much better, and use less memory. For instance, the picture on the right shows the scikit-learn random forest running in parallel on a fat Amazon node, and nicely using all the CPUs with little RAM usage.

Hierarchical aglomerative clustering — Complete linkage and average linkage clustering have been added. The benefit of these approach compared to the existing Ward clustering is that they can take an arbitrary distance matrix.

Robust linear models — Scikit-learn now includes RANSAC for robust linear regression.

HMM are deprecated — We have been discussing for a long time removing HMMs, that do not fit in the focus of scikit-learn on predictive modeling. We have created a separate hmmlearn repository for the HMM code. It is looking for maintainers.

And much more — plenty of “minor things”, such as better support for sparse data, better support for multi-label data…

Google summer of code projects for scikit-learn

2014-04-23T00:00:00+02:00

I’d like to welcome the four students that were accepted for the GSoC this year:

Issam: Extending Neural networks
Hamzeh: Sparse Support for Ensemble Methods
Manoj: Making Linear models faster
Maheshakya: Locality Sensitive Hashing

Welcome to all of you. Your submissions were excellent, and you demonstrated a good will to integrate in the project, with its social and coding dynamics. It is a privilege to work with you.

I’d also like to thank all the mentors, Alex, Arnaud, Daniel, James, Jaidev, Olivier, Robert and Vlad. It is a lot of work to mentor and mentors are not only making it possible for great code to enter scikit-learn, but also shaping a future generation of scikit-learn contributors.

Hiring a programmer for a brain imaging machine-learning library

2014-02-12T00:00:00+01:00

Work with us on putting machine learning in the hand of cognitive scientists

Parietal is a research team that creates advanced data analysis to mine functional brain images and solve medical and cognitive science problems. Our day to day work is to write machine-learning and statistics code to understand and use better images of brain function (most often fMRI). Our purpose is to be useful to the NeuroImaging community, mostly medical and cognitive science researched, to understand brain function better. What is limiting us in this respect is that to reach end users we need to turn our algorithms in usable software.

This is why Parietal has a long tradition of investing in building an ecosystem of high-quality libraries and tools: we build, layer by layer, an environment in which we can do our research, and with which we hope to one day reach the user. We choose Python, as a high-level general purpose language with which we can do scientific computing, and, one day, GUIs, or web servers. We contribute to the scipy ecosystem; we have built the foundations of the most successful Python machine learning library, scikit-learn. We are invested in the neuroimaging in Python ecosystem. Our students, our team members, send patches to scientific Python projects, teach courses on how to use them, speak at conferences.

But to go all the way, we need support from people who do software as there sole goal. To put the finishing touch on the quality of our end-user libraries, we need full-time programmers. In an academic setting, they can be hard to justify, but we have always had dedicate top-notch engineers at Parietal, our latest hire being the well-known Olivier Grisel. This is where you can come in.

The NiConnect is a specific research project in which we are developing leading algorithmic tools. For this project, we have funding for a full-time programmer. Someone that will help us make from our understand of how to process brain images, a software tool that an cognitive science researcher can use. We have started work on such a software, in the nilearn project. What we need is someone who drives the project, and makes sure that the piece fit in together well. That the code to solve the user’s problem is not our research code, but a clean and lean library, just like scikit-learn is an elegant answer to day-to-day machine learning tasks.

If you want more details, they can be found on the job offer. This post is to motivate the job in a personal, that I cannot give in an official posting.

Why take this job?

I don’t expect some to take this job only because it pays the bill. To be clear, the kind of person I am looking for has no difficulties finding a well-payed job elsewhere. So, if you are that person, why would you take the job.

To join a great team that is focused on finding elegant solutions to hard problems at the intersection of machine learning, cognitive science, and software. Choose to work with great people, knowledgeable, passionate, and fun.
To work on interesting problems, that matter. They are interesting because they are challenging but we have the skills to solve them. They matter because these skills need to be used to make brain research better.
To have a boss (me) that actually codes and gives you feedback on your code.
To learn. Data science + Python is the combination of skills to have. We have a at Parietal a unique expertise in these. And add to it fine understanding of algorithms, high performance computing, statistics, and software quality. You have the perfect lines on a CV.

What would make me excited in a resume?

Open source contributions (there is no better coding CV than a github account).
Experience in agile-like situations
A passion for code quality
Good Python experience
The unlikely combination of research-like training (eg undergraduate) and experience in a non academic and non scientific setting (say web development).
To know that you care about user experience, about understanding and solving the user’s problems.

Now if you are interested and feel up for the challenge, read the real job offer, and send me your resume.

Scikit-learn 0.14 release: features and benchmarks

2013-08-08T00:00:00+02:00

I have tagged and released the scikit-learn 0.14 release yesterday evening, after more than 6 months of heavy development from the team. I would like to give a quick overview of the highlights of this release in terms of features but also in term of performance. Indeed, the scikit-learn developers believe that performance matters and strive to be fast and efficient on fairly datasets.

I will show in this article on a couple of benchmarks that we have significant performance improvement and are competitive with the faster libraries such as the proprietary WiseRF.

Prohiminent new features

Most of the new features of the upcoming release have been mentionned more in details on Andy Mueller’s blog. I am just giving a quick list here for completness (see also the full list of changes):

Major new estimators:

AdaBoost (by Noel Dawe and Gilles Louppe): the classic boosting algorithm. This implementation can be applied to any estimator, but uses trees by default. AdaBoost is a learning strategy that builds from simple learning strategies by focussing successively on samples that are not well predicted. Typically, the simple learners (called weak learners) can be rules as simple as taking simple thresholds of observed quantities (this will form decision stumps). Documentation — Example
Biclustering (by Kemal Eren): clustering rows and columns of the data matrices. Suppose you have access to the shopping list of many consumers, biclustering would consists is grouping both consumers and product they bought to come up with stories such as “geeks buy computers and phones”, where “geeks” would be a group of consumers and “computers” and “phones” would be groups of products. Documentation — Example
Missing value imputation (by Nicolas Tresegnie): simple transformer filling missing data with means or medians. If your data-acquisition has failures, human or material, you can easily end up with some descriptors missing for some observations. It would be a pitty to throw away either those observations, or some descriptors. “Imputation” fills in the blanks with simple strategies. Documentation — Example
RBMs (Restricted Boltzmann Machines) (by Yann Dauphin): a neural network model useful for unsupervised learning of features. Restricted Boltzmann machines learn a set of hidden (latent) factors that have, for each observation, a probability to be activated or not. These activations are found so that they explain the data well, when combined across all the hidden factors with connection weights. Typically, they form a new feature set that can be useful in a prediction task. Documentation — Example
RandomizedSearchCV (by Andreas Mueller): setting meta-parameters on estimators using a randomized parameter exploration rather than a grid, as in a grid-search. A CV (cross-validated) meta-estimator sets parameters of an estimator by maximizing their cross-validated prediction scores. This entails fitting the estimator for each parameter value tried. The randomized-search explores the parameter space randomly, avoiding the exponential growth in number of points to fit of the grid search. Documentation — Example

Infrastucture work:

New wesbite (mostly by Gilles Louppe, Nelle Varoquaux, Vincent Michel and Andreas Mueller). The redesign of the website had two objectives: i) unclutter the pages to help prioritize information, ii) make it easier for users to find the stable documentation, if they follow an external link to a documentation of previous releases. I think that it also looks prettier :).
Python 3 support (Justin Vincent, Lars Buitinck, Subhodeep Moitra and Olivier Grisel). As a side note, under Python 3.3, on Windows, we have found that np.load can trigger segfaults, which means our test suite crashes. The tests not relying on np.load pass.

Major API changes

The scoring parameter One of the benefits of scikit-learn over other learning packages is that it can set parameters to maximizing a prediction score. However, the prediction that one would want to optimize might depend on the application. Also, some scores can only be computed with specific estimators, for instance because they require probabilistic prediction. Andreas Mueller and Lars Buitinck came up with a new API to specifies the scoring strategy that is versatile and hides complexity from the user. This replaces the score_func argument.
*sklearn.test()* is deprecated and will not run the test suite. Please use nosetests sklearn from the command line.

The full list of API changes can be found on the change log.

Performance improvements

Many part of the codebase got speed-ups, with a focus on making scikit-learn more scalable for bigger data.

The trees (random forests and extra-trees) were massively sped up by Gilles Louppe, bringing them to par with the fastest libraries (see benchmarks below)
Jake Vanderplas improved the BallTree and implemented fast KDTrees for nearest-neighbor search (benchmarks below).
“cleverless” made the DBSCAN implementation scale to a large number of samples by relying on KDTree and BallTree for neighbor search.
KMeans much faster on sparse data (Lars Buitinck)
For text vectorization: much faster CountVectorizer and TfidVectorizer with less memory consumption (Jochen Wersdorfer and Roman Sinayev)
Out-of-core learning for discrete naive Bayes classifiers by Olivier Grisel. Estimators that implement a partial_fit method can be used to fit the model with an out-of-core strategy, as illustrated by the out-of-core classification example. These settings are well suited to very big data.
FastICA: less memory consumptions and slightly faster code (Denis Engemann and Alexandre Gramfort)
Faster IsotonicRegression (Nelle Varoquaux)
OrthogonalMatchingPursuitCV by Alexandre Gramfort and Vlad Niculae: while strictly-speaking not a speedup of a existing estimator, this new estimator means that OMP parameters can be set much faster.

We are faster: lies, damn lies and benchmarks

“There are three kinds of lies: lies, damned lies and statistics.” —

Mark Twain’s Own Autobiography: The Chapters from the North American Review

I claim we have gotten faster at certain things. Other libraries, such as WiseRf, have performance claims compared to us. It turns out that benching statistical learning code is very hard, because speed depends a lot on the properties of the data.

Fast neighbor searches: good KDTrees beat BallTrees

A good example of interplay between properties of the data and computational speed is the nearest neighbor search. In general, finding the nearest neighbor to a point out of n other points will cost you n operations, as you have to compute the distance to each of these points. However, building a tree-like data structure ahead of time can make this query cost only log n. If these points are in 1D, ie simple scalars, this would be achieve by sorting them. In higher dimensions that can be achieved by building a KDTree, made of planes dividing the space in half-spaces, or a BallTree, made of nested balls.

KD Tree Image from AstroML’s documentation

Ball tree Image from AstroML’s documentation

Popular wisdom in machine learning is that in high dimensions, BallTrees scale better than KDTrees. This is explained by the fact that as the dimensionality grows, the number of planes required to break up the space grows too. On the contrary, if the data has structure, BallTrees can more efficiently cover this structure. I have benched scikit-learn’s KDTree and BallTree, as well as scipy’s KDTree, which employs a simpler tree-building strategy, on a variety of datasets, both real-life and artificial. Below if a summary plot giving relative performance of neighbor search

n is the number of data points, and p the dimensionality.

We can see that no approach win on all counts. That said, it came to a surprise to me to see that even in high dimension, scikit-learn’s KDTree outperformed the BallTrees. This is explained because these datasets do not display a heavily-structured low ambient dimension. On highly-structured synthetic data, the benefit of BallTree can clearly stand out, as shown by Jake here. However, on most dataset people encounter, it seems that this is not the case. Note also that scikit-learn’s KDTree tend to scale better in high dimension than scipy’s. This is due to the more elaborate choice of cutting planes. Note that it also has a cost, and may backfire, as on some datasets scikit-learn is slower than scipy.

Overall, the new KDTree in scikit-learn seem to be giving an excellent compromise. Congratulations Jake!

DBSCAN is faster with trees

DBSCAN is a clustering algorithm that relies heavily on the local neighborhood structure. The implementation in scikit-learn 0.13 computed the complete n by n matrix of distance between observations, which means that if you had a lot of data, you would blow your memory. In the 0.14 release, DBSCAN uses the BallTree, and as a result scales to much larger datasets and brings speed benefits. Here is a comparison between 0.13 and 0.14 implementations (I couldn’t put data as large as I wanted because the 0.13 code would blow):

Dataset	time with 0.13	time with 0.14
“lfw”: 13233 samples, 5 features	6.57 seconds	3.59 seconds
“make_blobs”: 30000, with 10 features	33.50 seconds	12.87 seconds

Importantly, the scaling is different: while the 0.13 code scales as n ^ 2, the 0.14 code scales as n log n. This means that the benefit is bigger for large dataset.

Scikit-learn 0.14’s random forests are fast

Gilles Louppe has made the random forests significantly faster in the 0.14 release. Let us bench them in comparison with WiseIO’s WiseRf, a proprietary package that only does random forest and for which the main selling point is that it is significantly than scikit-learn. However, let us also bench ExtraTrees, a tree-based model that is very similar to random forests, but that in our experience can be implemented a bit faster, and tends to work better.

On the digits dataset (1797 samples, 641 features):

Forest implementation	train time	test time	prediction accuracy
Sklearn ExtraTrees	2.641s	0.082s	0.986
Sklearn RandomForest	5.074s	0.088s	0.981
WiseRF	5.665s	0.108s	0.979

So we see that on a mid-sized dataset, scikit-learn is faster than WiseRF, and ExtraTrees is twice as fast as RandomForest, for better results.

On the MNIST dataset (70000 samples, 784 features):

Forest implementation	train time	test time	prediction accuracy
Sklearn ExtraTrees	1378.141s	4.768s	0.976
Sklearn RandomForest	1639.866s	4.132s	0.972
WiseRF	1102.465s	14.542s	0.972

On a big dataset, WiseRF takes the lead, but not by a large factor.

Using 2 CPUs (n_jobs=2) on the digits dataset:

Forest implementation	train time	test time	prediction accuracy
Sklearn ExtraTrees	4.874s	1.478s	0.986
Sklearn RandomForest	5.716s	1.349s	0.978
WiseRF	3.264s	0.104s	0.979

Both scikit-learn and WiseRF can use several CPUs. However, the Python parallel execution model via multiple processes has an overhead in term of computing time and of memory usage. The internals of WiseRF are coded in C++, and thus it is not limited by this overhead. Also, because of the memory duplication with multiples processes in scikit-learn, I could not run it on MNIST with 2 jobs. Next release will address these issues, partly by using memmapped arrays to share memory between processes.

We make good use of funding: the Paris sprint

A couple of weeks ago, we had a coding sprint in Paris. We were able to bring in a lot of core developers from all over Europe thanks to our sponsors: FNRS, AFPy, Telecom Paristech, and Saint-Gobain Recherche. The total budget, including accommodation and travel, was a couple thousand euros, thanks to Telecom Paristech and tinyclues helping us with accommodation and hosting the sprint.

The productivity of such a sprint is huge, both because we get together and work efficiently, but also because we get back home and keep working (I have been sleep deprived because of late-night hacking ever since the sprint). As an illustration, here is the diagram of commits as can be seen on Github. The huge spike correspond to the second international sprint: Paris 2013.

We now have a “donate” button on the website. I can assure you that your donations are well spent and turned into code.

RIP John Hunter: the loss of a great man

2012-08-30T10:21:00+02:00

John Hunter, the author of matplotlib passed away yesterday after a short battle against cancer. John gave the keynote at the scipy 2012 conference a few weeks ago, and was diagnosed with cancer just on his return from the conference. It is a shock to me that that a friend can disappear so quickly. Please read the announcement of Fernando Perez, who supported John in the last weeks to learn more about John.

A man who gave a lot, not asking for anything in return

Many have benefited from the silent efforts of John, and are not fully aware of how he generously invested his time and talent for the benefit of others. Matplotlib, the Python plotting library that he created in 2002, has propelled Python as a major tool for scientific research and engineering. The impact of John’s efforts go well beyond Matplotlib. Early on, John had the vision of Python as a interactive scientific environment. He promoted this vision pairing with Fernando Perez to develop the fantastic ipython/matplotlib tandem, solving many technical challenges. But he also invested a lot of energy in teaching workshops that helped change the way people compute, as well as writing didactic documentation and articles. He was a friendly, active, leader of an online community, open and helpful to newcomers.

As Travis Oliphant said on John’s numfocus memorial webpage:

Those who contribute much to open source, as John did, do so at the expense of something - often it is time with family.

I cannot stress how true this is. The entire open source software, that nowadays supports our economy, our education, and our research, is built on the shoulders of a fairly small number of generous people that spend their energy in making better software, rather than personal wealth.

John was a humble man. He did not have a blog, or a twitter account, did not seek fame or money. For this reason I feel that his contributions are unknown and undervalued by many. In my eyes, he is an unknown soldier of our modern times. I hope that I am not being too emphatic, but this is how I feel.

Note

John passed away at 44, leaving behind a wife and 3 daughters. Please do consider supporting them:

http://numfocus.org/johnhunter

A journal promoting high-quality research code: dream and reality

2012-06-04T21:39:00+02:00

Open research computation (ORC) was an attempt to create a scientific publication promoting high-quality and open source scientific code. The project went public in falls 2010, but last month, facing the low volume of submission, the editorial board chose to reorient it as a special track of an existing journal.

The challenges that we face are discussed in our editorial:

Changing computational research. The challenges ahead. C Neylon, J Aerts, CT Brown, D Lemire, J Millman, P Murray-Rust, F Perez, N Saunders, A Smith, G Varoquaux and E Willighagen, Source Code for Biology and Medicine 2012, 7:20

Here is my own personal take on the rise and fall of this ideal.

My story with ORC

From pipe dream to journal - My involvement with ORC started long before there was such a thing as ORC. In falls 2008, I had a discussion with a friend working in the publication industry, telling her how I believed that the publication system is broken, because it promotes new results without any interest on whether these can be exported outside the lab that produced them: it is currently easier to publish a minor but novel result than a tool enabling the routine reproduction of previous results. This seemed particularly marked in the scientific software world, as software tools are becoming central to the scientific workflow, and cost nothing to duplicate when produced under open-source license. To my surprise, she took me seriously, and asked me to write my ideas down in an email that she would forward to her colleagues in the publication industry.

Looking back at the email that I send, my concerns were, back then, to promote:

quality and openness of scientific software
basic tools shared across communities
recognition of software development as a challenging and worthwhile task in academic research

Shaping the idea -In the year that followed, I had a few discussions with staff from BioMedCentral, an open-access publisher in biology and medicine that was looking into expending in the physics and math related fields. Eventually, my contact there told me that they had other similar requests and were launching a journal that would be lead by Cameron Neylon, a British biophysicist and strong advocate of openness and reproducibility in science. This was the start of ORC, and for me the chance to meet other people sharing my concerns, some new and some already old friends.

ORC editor

Conventional editor

Setting up the journal -BioMedCentral was instrumental in setting up the journal project. I quickly learned that, no surprises, a journal is a product, like anything else, and it must find customers. Here, as we were launching an open access journal, the customers were authors. This is where a journal faces a chicken and egg problem: to be recognised it needs high-visibility publications, but authors will submit only to journals that they know. The main tool to overcome this challenge are communication and advocacy. I then realized that these really weren’t my strong points. Cameron Neylon absolutely shined on this side, with very enthusiastic communications and an incredibly active twitter account. On my side, I am a slow writer, and I tend to speak Python code better than English language, which is not a strong asset to be a journal editor.

Wild editorial discussions - The discussions in the editorial board really thrilled me because they were centered on how to set standards to improve the quality of code published. Looking in my mailbox, I see discussions about code repositories, software testing, documentation or licensing issues. This is not that surprising, given that a lot of the editors where actually contributors to major software projects. It made me very happy, as I have the feeling that, so far, most committees or decision makers are clueless about software.

Sand in the gears: the lack of uptake

A false start -So ORC was launched late 2010 and we had fantastic feedback. I had the feeling that people were genuinely excited about our program: changing the way computational science worked from the inside, through the review process. The idea was that we had opened a pre-submission call, and were waiting for a few good papers to be submitted to launch the journal. However, it turned out that the papers were slow to come. It took me a while to realize that there was something wrong. But slowly we had to face the truth: many people were excited about the journal, but most were sending their papers elsewhere.

What went wrong? -If I really knew what went wrong, I would probably not be discussing it here, but rather changing the world. However, I can come up with a few guesses:

Working across communities is harder. From the beginning we had wanted to position the journal across communities, in order to foster the sharing of tools for a greater good. The challenge is that a central role of publication is nowadays to provide recognition. It is much easier to achieve recognition in a given community than across communities, and authors always preferred submitting their work to a non-software oriented journal in their field. We couldn’t fight together the battle for software quality and the battle for inter-community work.
Setting the bar too high. Many felt that the submission requirements that where too demanding, as expressed on a NeuroImaging forumn to quote a researcher: “I think it’s setting the bar unrealistically high for most neuroimaging software”. While we had originally shot for a very high test coverage (probably too high), we had scaled it back quickly, simply stressing that editors and reviewers would be looking closely at test coverage, documentation and ease of installation. That said, the average researcher did not share our ideals of raising the quality of scientific software. Trying to ask only for excellent publications in a new and unproven journal was probably unrealistic.
Editors not willing to game the system. I have watched a few journal launches, and it seems to me that a common trick is to line up articles that are created by the editors and their friends specifically for the new journal. People come up with opinion papers, reviews, commentaries that only serve to generate an identity to the journal. This did not happen for ORC, and I believe that it is because the editors themselves were not huge fans of the low signal-to-noise ratio in modern scientific publishing practice.

The times they are a changing

ORC is dead, long live ORC - We did get a few submissions. ORC is not coming to an end, it is morphing into a special thematic series in source code for biology and medicine. This solution is not completely satisfactory, as it pushes what should have been a forum for exposing good practices and good software into a smaller community. But at least there is now a venue in which people can publish a paper about software that they have been improving and maintaining, and not only about a new algorithm.

Changing practices across the board - Among the reasons for which we had a hard time making a breakthrough, is that authors where sending their software papers to other journals, in particular journals not specialized on software. While these papers are not getting the attention of a review and editorial team expert on software development, as we are setting up with ORC, this is still a good thing. Indeed it shows that the times are changing and that recognition of software as a scientific work is improving. I have been impressed to see that many high profile journals have changed their editorial policies to specifically accept software papers, or have create tracks dedicated to software.

Software is being slowly recognized as a pillar of modern scientific research. We need to keep pushing to make sure that quality standards are set and that the open-source scientific software grows into a mature ecosystem focused on problem solving.

Update on scikit-learn: recent developments for machine learning in Python

2012-05-09T00:12:00+02:00

Yesterday, we released version 0.11 of the scikit-learn toolkit for machine learning in Python, and there was much rejoincing.

Major features gained in the last releases

In the last 6 months, there have been many things happening with the scikit-learn. While I do not whish to give an exhaustive summary of features added (it can be found here), let me list a few of the additions that I personnally find exciting.

Non-linear prediction models

For complex prediction problems where there is no simple model available, as in computer vision, non-linear models are handy. A good example of such models are those based on decisions trees and model averaging. For instance random forests are used in the Kinect to locate body parts. As they are intrinsically complex, they may need a large amount of training data. For this reason, they have been implemented in the scikit-learn with special attention to computational efficiency.

Dealing with unlabeled instances

It is often easy to gather unlabeled observations than labeled observation. While prediction of a quantity of interest is then harder or simply impossible, mining this data can be useful.

Semi-supervised learning: using unlabeled observations together with labeled ones for better prediction.

Outlier/novelty detection: detect deviant observations.

Manifold learning: discover a non-linear low-dimensional structure in the data.

Clustering with an algorithm that can scale to really large datasets using an online approach: fitting small portions of the data on after the other (Mini-batch k-means).

Dictionary learning: learning patterns in the data that represent it sparsely: each observation is a combination of a small number patterns.

Sparse models: when very few descriptors are relevant

In general, finding which descriptors are useful when there are many of them is like find a needle in a haystack: it is a very hard problem. However, you know that only a few of these descriptors actually carry information, you are in a so-called sparse problem, for specific approaches can work well.

Orthogonal matching pursuit: a greedy and fast algorithm for very sparse linear models

Randomized sparsity (randomized Lasso): selecting the relevant descriptors in noisy high-dimensional observations

Sparse inverse covariance: learning graphs of connectivity from correlations in the data

Getting developpers together: the Granada sprint

Of course, such developments happen only because we have a great team of dedicated coders.

Getting along and working together is a critical part of the project. In December 2011, we held the first international scikit-learn sprint in Granada, on the side of the NIPS conference. That was a while ago, and I haven’t found time to blog about it, maybe because I was too busy merging in the code produced :). Here is a small report from my point of view. Better late than never.

Participants from all over the globe

This sprint was a big deal for us, because for the first time, thanks to sponsor money, we were able to fly contributors from overseas and meet the team in person. For the first time I was able to see the faces behind many of the fantastic people that I knew only from the mailing list.

I really think that we must thank our sponsors, Google and tinyclues, but also The PSF, that is in particular Jesse Noller but especially Steve Holden, whose help was absolutely instrumental in getting sponsor money. This money is what made it possible to unite a good fraction of the team, and it opened the door to great moments of coding, and more.

Producing code lines and friendship

An important aspect of the sprint for me was that I really felt the team being united. Granada is a great city and we spent fantastic moments together. Now when I review code, I can often put a face on the author of that code and remember a walk below the Alhambra or an evening in a bar. I am sure it helps reviewing code!

Was it worth the money?

I really appreciate that the sponsors did not ask for specific returns on investment beyond acknowledgments, but I think that it is useful for us to ask the question: was it worth the money? After all, we got around $5000, and that’s a lot of money. First of all, as a side effect of the sprint, people who had invested a huge amount of time in a machine learning toolkit without asking anything in return got help to go to a major machine learning conference.

But was there a return over investment in terms of code? If you look at the number of lines of code modified weekly (figure on the right), there is a big spike in December 2011. That’s our sprint! Importantly, if you look at the months following the sprint, there still is a lot of activity in the months following the sprint. This is actually unusual, as the active developments happen more in the summer break than during the winter, as our developpers are busy working on papers or teaching.

The explaination is simple: we where thrilled by the sprint. Overall, it was incredibly beneficial to the project. I am looking forward to the next ones.

3 Google summer of code for scikit-learn and more…

2012-04-23T22:25:00+02:00

The scikit-learn got 3 students accepted for the Google summer of code.

Imanuel Bayer will work on making our sparse linear models, for regression and classification, faster. His proposal Optimizing sparse linear models using coordinate descent and strong rules.
David Marek will implement multi-layer perceptrons for the scikit. His proposal: Multilayer Perceptron
Vlad Niculae will work on speeding up the library in general, catching all the low hanging fruits, and the ones a bit higher. His proposal: Need for scikit-learn speed

In addition, other related projects have exciting projects, for instance **statsmodels**:

Divyanshu Bandil: Extension of Linear to Non Linear Models in Statsmodels Python module
Alexandre Crayssac: estimating system of equations
Justin Grana: empirical Likelihood in Statsmodels
Georgi Panterov: nonparametric estimation

and Cython:

Philip Herron: pxd generation using gcc-python-plugin
Mark Florisson: Fast Numerical Computing with Cython

finally, in Pandas:

Vytautas Jancauskas: Plots in pandas

Congratulations to all of the students. This is going to be an exciting summer.

Want features? Just code

2012-03-08T22:46:00+01:00

Somebody just sent an email on a user’s mailing list for an open-source scientific package entitled “Feature foo: why is package bar not up to the task?”. To quote him:

Is there ANY plan for having such a module in package bar?? I think (personally) that this is a MUST DO. This is typically the type of routines that I hear people use in e.g., idl etc. If this could be an optimised, fast (and easy to use) routine, all the better.

As some one who spends a fair amount of time working on open source software I hear such remarks quite often. I am finding it harder and harder not to react negatively to these emails. Now I cannot consider myself as a contributor to package bar, and thus I can claim that I am not taking your comment personally.

Why aren’t package not up to the task? Will, the answer is quite simple: because they are developed by volunteers that do it on their spare time, late at night too often, or companies that put some of their benefits in open source rather in locking down a market. 90% of the time the reason the feature isn’t as good as you would want it is because of lack of time.

I personally find that suggesting that somebody else should put more of the time and money they are already giving away in improving a feature that you need is almost insulting.

I am aware that people do not realize how small the group of people that develop and maintain their toys is. Borrowing the figure below from Fernando Perez’s talk at Euroscipy, the number of people that do 90% of the grunt work to get the core scientific Python ecosystem going is around two handfuls:

I’d like to think that this recruitment problem is a lack of skill set: users that have the ability to contribute are just too rare. This is not entirely true, there are scores of skilled people on the mailing lists. The poster himself mentioned his email that he was developing a package. I personally started contribution not knowing anything about software development. I struggled, I did the grunt work like maintaining wikis, answer questions on mailing list, and writing documentation. These easier tasks were useful to the community, I think, but must importantly, they taught me a lot because I was investing energy in them.

Note

If people want things to improve, they will have more successes sending in pull requests than messages on mailing list that sound condescending to my ears.

I hope that I haven’t overreacted too badly :), that email turned me on. That said, I am not sure that people realize how much they owe to the open source developers breaking their backs on the packages they use.

All credit for images goes to Fernando Perez

Book review: NumPy 1.5 Beginner’s guide

2012-01-10T08:57:00+01:00

Packt publishing sent me a copy of NumPy 1.5 Beginner’s guide by Ivan Idris.

The book actually covers more than only numpy: it is a full introduction to numerical computing with Python. The table of contents is the following:

NumPy Quick Start
Beginning with NumPy Fundamentals
Get into Terms with Commonly Used Functions
Convenience Functions for Your Convenience
Working with Matrices and ufuncs
Move Further with NumPy Modules
Peeking Into Special Routines
Assure Quality with Testing
Plotting with Matplotlib
When NumPy is Not Enough: SciPy and Beyond

The book is easy to read, as it requires no specific expertise other than knowing basic Python programming. It is full of examples and exercises, which is really great for learning. I find the style of the author, Ivan Idris, particularly amusing and relaxing, engaging the reader with questions, challenges, or even jokes (“Have a go hero”).

With regards to the formatting and the print, the book is written in large fonts, with sectioning information, tips and exercises clearly standing out.

It is full of practical information, such as how to install the software, or where to get help. Finally, One thing that I appreciated, is that the examples are typed in IPython. Each time I teach, I like to use IPython, because it is full of features to help plotting, debugging and profiling numerical code. The book even has a little introduction to some useful IPython features.

After an introduction to the work flow, the book explores array manipulation such as creation or reshaping, followed by some simple numerics and the battery of array-based operations on functions and polynomials. Then it presents linear algebra and signal processing basics (FFT). It also covers the financial functions that are present in numpy and mentions testing, which is very important to achieve quality code. The book finishes with matplotlib and scipy, two modules that are important to know to go further.

The examples are mostly drawn from statistics or financial applications, such as computing running averages on stock quotes. Basic math explanations, such as the definition of the Moore-Penrose pseudo-inverse, are given when needed.

To conclude, I enjoyed this book and I think that it is a nice addition to my library. It answers exactly it’s title: it is well-suited for beginners wanting to learn numpy. On the other hand, I would not recommend it as a reference material, or as a book to learn more general scientific or numerical computing with Python.

Joblib beta release: fast compressed persistence + Python 3

2012-01-07T19:27:00+01:00

Joblib 0.6: better I/O and Python 3 support

Happy new year, every one. I have just released Joblib 0.6.0 beta. The highlights of the 0.6 release are a reworked enhanced pickler, and Python 3 support.

Many thanks go to the contributors to the 0.5.X series (Fabian Pedregosa, Yaroslav Halchenko, Kenneth C. Arnold, Alexandre Gramfort, Lars Buitinck, Bala Subrahmanyam Varanasi, Olivier Grisel, Ralf Gommers, Juan Manuel Caicedo Carvajal, and myself). In particular Fabian made sure that Joblib worked under Python 3.

In this blog post, I’d like to discuss a bit more the compressed persistence engine, as it illustrates well key factors in implementing and using compressed serialization.

Fast compressed persistence

One of the key components of joblib is it’s ability to persist arbitrary Python objects, and read them back very quickly. It is particularly efficient for containers that do their heavy lifting with numpy arrays. The trick to achieving great speed has been to save in separate files the numpy arrays, and load them via memmapping.

However, one drawback of joblib, is that the caching mechanism may end up using a lot of disk space. As a result, there is strong interest in having compressed storage, provided it doesn’t slow down the library too much. Another use case that I have in mind for fast compressed persistence, is implementing out of core computation.

There are some great compressed I/O libraries for Python, for instance Pytables. You may wonder why the need to code yet another one. The answer is that joblib is pure Python, depending only on the standard library (numpy is optional), but also that the goal here is black-box persistence of arbitrary objects.

Comparing I/O speed and compression to other libraries

Implementing efficient compressed storage was a bit of a struggle and I learned a lot. Rather than going into the details straight away, let me first discuss a few benchmarks of the resulting code. Benching such feature is very hard, first because you are fighting with the disk cache, second because they performances depends very much on the data at hand (some data compress better than others), last because they are three interesting metrics: disk space used, write speed, and read speed.

Dataset used - I chose to compare the different strategies on some datasets that I work with, namely the probabilistic brain atlases MNI 1mm (62Mb uncompressed) and Juelich 2mm (105Mb uncompressed). Whether the data is represented as a Fortran-ordered array, or a C-ordered array is important for the I/O performance. This data is normally stored to disk compressed using the domain-specific Nifti format (.nii files), accessed in Python with the Nibabel library.

Libraries used - I benched different compression strategies in joblib against Nibabel’s Nifti I/O, compressed or not, and against using Pytables to store the data buffer (without the meta-informations). Pytables exposed a variety of compression strategies, with different speed compromises. In addition, I benched numpy’s builtin save_compressed.

I would like to stress that I am comparing a general purpose persistence engine (joblib) to specific I/O libraries either optimized for the data (Nifti), or requiring some massaging to enable persistence (pytables).

Comparing to other libraries

Actual numbers can be found here.

Take home messages - The graphs are not crystal-clear, but a few tendencies appear:

Pytables with LZO or blosc compression is the king of the hill for read and write speed.
I/O of compressed data is often faster than with uncompressed data for a good compression algorithm.
Joblib with Zlib compression level 1 performs honorably in terms of speed with only the Python standard library and no compiled code.
Read time of memmapping (with nibabel or joblib) is negligeable (it is tiny on the graphs), however the loading time appears when you start accessing the data.
Passing in arrays with a memory layout (Fortran versus C order) that the I/O library doesn’t expect can really slow down writing.
Compressing with Zlib compression-level 1 gets you most of the disk space gains for a reasonable cost in write/read speed.
Compressing with Zlib compression-level 9 (not shown on the figures) doesn’t buy you much in disk space, but costs a lot in writing time.

Benching datasets richer than pure arrays

The datasets used so far are pretty much composed of one big array, a 4D smooth spatial map. I wanted to test on more datasets, to see how the performances varied with data type and richness. For this, I used the datasets of the scikit-learn, real life data of various nature, described here:

20 news - 20 usenet news group: this data mainly consists of text, and not numpy arrays.
LFW people - Labeled faces in the wild, many pictures of different people’s face.
LFW pairs - Labeled faces in the wild, pairs of pictures for each individual. This is a high entropy dataset, it does not have much redundant information.
Olivetti - Olivetti dataset: centered pictures of faces.
Juelich(F) - Our previous Juelich atlas
Big people - The LFW people dataset, but repeated 4 times, to put a strain on memory resources.
MNI(F) - Our previous MNI atlas
Species - Occurence of species measured in latin America, with a lot of missing data.

Actual numbers can be found here.

What this tells us - The main message from these benchmarks is that datasets with redundant information, i.e. that compress well, give fast I/O. This is not surprising. In particular, good compression can give good I/O on text (20 news). Another result, more of a sanity check, is that compressed I/O on big data (Big people, ) works as well as on smaller data. Earlier code would start to swap. Finally, I conclude from these graphs, that compression levels from 1 to 3 buy you most of the gains for reasonable costs, and that going up to 9 is not recommended, unless you know that your data can be compressed a lot (species).

Lessons learned

I’ll keep this paragraph short, because the information is really in joblib’s code and comments. Don’t hesitate to have a look, it’s BSD-licenced, so you are free to borrow what you please.

Memory copies, of arrays, but also of strings and byte streams can really slow you down with big data.
To avoid copies with numpy arrays, fully embrace numpy’s strided memory model. For instance, you do not need to save arrays in C order, if they are given to you in a different order. Accessing the memory in the wrong striding direction explains the poor write performance of pytables on Fortran-ordered Juelich.
When dealing with the file system, the OS makes so much magic (e.g. prefetching) that clever hacks tend not to work: always benchmark.
Depending on the size of the data, it may be more efficient to store subsets in different files: it introduces ‘chunk’ that avoid filling in the memory too much (parameter cache_size in joblib’s code). In addition, data of a same nature tends to compress better.
The I/O stream or file object interfaces are abstractions that can hide the data movement and the creation of large temporaries. After experiments with GZipFile and StringIO/BytesIO I found it more efficient to fall back to passing around big buffer object, numpy arrays, or strings.
For reasons 4 and 5, I ended up avoiding the gzip module: raw access to the zlib with buffers gives more control. This explains a good part of the differences in read speed for pure arrays with numpy’s save_compressed.

One of my conclusions for joblib, is that I’ll probably use Pytables as an optional backend for persistence in a future release.

Details on the benchmarks

These benchmarks where run on a Dell Lattitude D630 laptop. That’s a dual-core Intel Core2 Duo box, with 2M of CPU cache.

The code for the benchmarks below can be found on a gist.

Thanks

I’d like to that Francesc Alted for very useful feedback he gave on this topics. In particular, the following thread on the pytables mailing-list may be of interest to the reader.

Scikit-learn NIPS 2011 sprint: international thanks to our sponsors

2011-11-18T14:47:00+01:00

The NIPS conference: time for a sprint. The NIPS conference, one of the major conferences in machine learning, is hosted in Granada this year. I believe that it is the first time that it is hosted in Europe. As many of the scikit-learn developers are part of the wider NIPS community, but also many live in Europe, we jumped on the occasion to organize a truly international sprint: the NIPS 2011 scikit-learn sprint.

Finding money. As often with open source development, a lot of our contributors are young people, investing their free time outside of any request from their hierarchy. In such a situation, it can be hard to find travel money. So we started looking for sponsors. We needed to find a decent sum of money, as we were flying people in from places such as the West coast of the US, or even Japan. The good news is that we found money, and between supervisors pitching in, universities giving travel grants, and our generous sponsors, there will be an impressive list of contributors from all over the world at the sprint.

Thanks to our sponsors. The first people that we need to thank are Google, who gave us a sizable sponsorship, and the PSF, who made Google’s sponsorship possible through their accounting and sprints programs. We also need to thanks our other sponsors, namely Tinyclues. Thanks to these sponsors, and additional investment from many universities and research group, we have been able to gather a total of 12 contributors in Granada, a handful coming from overseas. Also, we are indebted to the University of Granada, and the Gnu/Linux Granada Group (GGG), who are providing hosting for the sprint, as well as Régine Bricquet, from INRIA, who did a lot of the trip planing for the sponsored people.

I am very much looking forward to the sprint. It will be the first time that meet in real life many of the contributors, and judging by the warmness of the on-line exchanges, it will be a great moment. Besides, Granada is known to be a lively and historical city.

If you are around and want to join us, to work on Python in machine learning, send us a mail on the mailing list.

Cython example of exposing C-computed arrays in Python without data copies

2011-09-28T23:42:00+02:00

Some advice on passing arrays from C to Python avoiding copies. I use Cython as I have found the code to be more maintainable than hand-written Python C-API code.

I found out that there was no self-contained example of creating numpy arrays from existing data in Cython. Thus I created my own. The full code with readme build and demo scripts is available on a gist. Here I only give an executive summary.

The core functionality is implemented by the PyArray_SimpleNewFromData function of the C API of numpy that can create an ndarray from a pointer to the data, a simple data type, and the shape of the data. The Cython file just builds around that function:

Python at scientific conferences

2011-09-11T15:52:00+02:00

Top notch scientific conferences are starting to add Python tracks to their program. This is good news. Indeed, it scientific Python conferences (namely Scipy, EuroSciPy and Scipy India) are doing great to get together people who have already heard about Python for science, but we need to reach out to specific Python communities to maximize impact.

ESCO 2012 - European Seminar on Coupled Problems

ESCO 2012 is the 3rd event in a series of interdisciplineary meetings dedicated to computational science challenges in multi-physics and PDEs.

I was invited as ESCO last year. It was an aboslute pleasure, because it is a small conference that is very focused on discussions. I learned a lot and could sit down with people who code top notch PDE libraries such as FEniCS and have technical discussions. Besides, it is hosted in the historical brewery where the Pilsner was invented. Plenty of great beer.

Application areas Theoretical results as well as applications are welcome. Application areas include, but are not limited to: Computational electromagnetics, Civil engineering, Nuclear engineering, Mechanical engineering, Computational fluid dynamics, Computational geophysics, Geomechanics and rock mechanics, Computational hydrology, Subsurface modeling, Biomechanics, Computational chemistry, Climate and weather modeling, Wave propagation, Acoustics, Stochastic differential equations, and Uncertainty quantification.

Minisymposia

Multiphysics and Multiscale Problems in Civil Engineering
Modern Numerical Methods for ODE
Porous Media Hydrodynamics
Nuclear Fuel Recycling Simulations
Adaptive Methods for Eigenproblems
Discontinuous Galerkin Methods for Electromagnetics
Undergraduate Projects in Technical Computing

Software afternoon Important part of each ESCO conference is a software afternoon featuring software projects by participants. Presented can be any computational software that has reached certain level of maturity, i.e., it is used outside of the author’s institution, and it has a web page and a user documentation. If you would like to present your software project, let us know soon.

Proceedings For each ESCO we strive to reserve a special issue of an international journal with impact factor. Proceedings of ESCO 2008 appeared in Math. Comput. Simul., proceedings of ESCO 2010 in CiCP and Appl. Math. Comput. Proceedings of ESCO 2012 will appear in Computing.

Important Dates

December 15, 2011: Abstract submission deadline.
December 15, 2011: Minisymposia proposals.
January 15, 2012: Notification of acceptance.

PyHPC: Python for High performance computing

If you are doing super computing, SC11, the Super Computing conference is the reference conference. This year there will a workshop on high performance computing with Python: PyHPC.

At the scipy conference, I was having a discussion with some of the attendees on how people often still do process management and I/O with Fortran in the big computing environment. This is counter productive. However, has success stories of supercomputing folks using high-level languages are not advertized, this is bound to stay. Come and tell us how you use Python for high performance computing!

Topics

Python-based scientific applications and libraries
High performance computing
Parallel Python-based programming languages
Scientific visualization
Scientific computing education
Python performance and language issues
Problem solving environments with Python
Performance analysis tools for Python application

Papers We invite you to submit a paper of up to 10 pages via the submission site. Authors are encouraged to use IEEE two column format.

Important Dates

Full paper submission: September 19, 2011
Notification of acceptance: October 7, 2011
Camera-ready papers: October 31, 2011

Hiring a junior developer on the scikit-learn

2011-09-03T07:26:00+02:00

Once again, we are looking for a junior developer to work on the scikit-learn. Below is the official job posting. As a personal remark, I would like to stress that this is a unique opportunity to be payed for two years to work on learning and improving the scientific Python toolstack.

Job Description

INRIA is looking to hire a young graduate on a 2-year position to help with the community-driven development of the open source machine learning in Python library, scikit-learn. The scikit-learn is one of the majormajor machine-learning libraries in Python. It aims to be state-of-the-art on mid-size to large datasets by harnessing the power of the scientific Python toolstack.

Speaking French is not a requirement, as it is an international team.

Requirements

Programming skills in Python and C/C++
Understanding of quality assurance in software development: test-driven programming, version control, technical documentation.
Some knowledge of Linux/Unix
Software design skills
Knowledge of open-source development and community-driven environments
Good technical English level
An experience in statistical learning or a mathematical-oriented mindset is a plus
We can only hire a young-graduate that has received a masters or equivalent degree at most a year ago.

About INRIA

INRIA is the French computer science research institute. It recognized word-wide as one of the leading research institutions and has a strong expertise in machine learning. You will be working in the Parietal team that makes a heavy use of Python for brain imaging analysis.

Parietal is a small research team (around 10 people) with an excellent technical knowledge of scientific and numerical computing in Python as well as a fine understanding of algorithmic issues in machine learning and statistics. Parietal is committed to investing in scikit-learn.

Working at Parietal is a unique opportunity to improve your skills in machine learning and numerical computing in Python. In addition, working full time on the scikit-learn, a very active open-source project, will give you premium experience of open source community management and collaborative project development.

Contact Info:

Technical Contact: Bertand Thirion
E-mail contact: bertrand dotnospam thirion atnospam inria dotnospam fr
HR Contact: Marie Domingues
E-mail Contact: marie dotnospam domingues atnospam inria dotnospam fr
No telecommuting

Euroscipy 2011: early bird deadline soon

2011-07-22T00:44:00+02:00

Euroscipy 2011: register now for early bird prices

The deadline for early-bird registration at the Euroscipy conference is this Sunday. Beyond this deadline prices will double. Register now to get a great deal.

To register, simply go to www.euroscipy.org, log in using the link on the top right, and follow the ‘Register now for the conference’ link on the top left.

The conference is a great opportunity to learn the intricacies of numerical and scientific computing in Python. You can register for the tutorials in a intro track, that will take you from beginner to fully autonomous user, or for an advanced track, to learn from the experts topics such as image processing, GPU computing, machine learning or optimization. The tutorials are a fairly unique occasion to improve your skills, as you will seldom get such a concentration of experts.

Some program highlights

After the 2 days of tutorial, the conference itself we host 2 keynotes: one by Marian Petre, of the open university, well-known for her empirical studies of software development, and another one by Fernando Perez, a pioneer in scientific computing in Python and the original author of IPython.

Glancing at the program, we can see how a wide range of topics are touched:

pure computer-science topics, such as concurrent programming
traditional hard sciences, such as multi-physics
simulation of complex systems, for instance network modeling in epidemiology
or novel application of quantitative large-data processing, as in legal research

The variety of the topics illustrates what is for me one of the greatest benefits of the scipy conferences: they form a forum to exchange ideas and techniques to find new solutions to scientific, numerical and data analysis problems. Unlike the pure computer science conference, they sit at the frontier of applications and bleeding edge computer developments, because these people really use the tools presented to solve their problems.

In addition to this rich program, we will have 2 days of sprints before the conference as well as 2-day-long satellite conferences on Python in Physics and NeuroScience after the conference. This is how what used to be a small conference can now be a full 8-days event if you order all the extras.

Hiring a junior engineer on the scikit-learn

2011-05-14T19:10:00+02:00

The scikit-learn is a Python module for machine learning. The project builds on the scientific and numerical tools of the scipy community to provide state-of-the-art data analysis tools. It is developed by a community of open source developers to which my research team (Parietal, INRIA) contributes a lot and is a striving project. Its mailing list fosters many discussions on code and machine learning topics, it has a a very detailed documentation, and a tight release cycle.

Although scikits.learn is mostly developed by volunteers, INRIA has funded a two year position for a junior engineer —currently Fabian Pedregosa— to help with the core management and integration of the project. This funding is coming to an end in falls 2011 [*]. The good news is that we have been allocate new funding to hire an engineer on the scikit.

We are thus looking to hire a junior engineer for a 2-year position to work on the scikits.learn at INRIA in Saclay, near Paris. The position is only available to candidates that have received a masters or equivalent degree at most a year ago — this is non negotiable: we cannot hire more senior candidates.

We are looking for a developer with good open-source project management skills: the successful candidate will review and merge patches, ensure the quality of the scikit, make releases, coordinate development on the mailing list and on github. Good knowledge of Python and its scientific ecosystem is expected. A mathematical or computer-science oriented mindset is a plus, as the project involves working with machine learning algorithms.

The candidate should be willing to relocate to work daily in the Neurospin brain research institute in which the Parietal is located. Knowledge of French is not required, as the team and the institute are very international. Non-EU candidates are welcome, but the hiring process will take longer.

You will be working in a very stimulating environment. You will be employed by INRIA, the French computer science research institute. As such, you will benefit from the expertise of the institute’s researchers and engineers. Team members contribute to various scientific Python libraries (in addition to scikits.learn, Mayavi, nipy, joblib). In addition, you will be working in a brain research institute, in collaboration with leading methods researchers and neuroscientists that use machine learning to gain new insights on brain processes.

To apply: To apply, you need to prepare a CV and a motivation letter. The deadline for applications is mid June, but we will be selecting candidates and conducting interviews before. Don’t send me CVs. The formal job description, as well as instructions to apply can be found on this page. The page is mostly in French, sorry; use Google translate if you don’t understand. At the bottom of the page you will find a link to apply.

[*] Fabian will most probably stay with us to do a PhD on analysis of large brain functional imaging datasets.

EuroScipy: the program is filling up, and the submission deadline nearing

2011-04-30T17:21:00+02:00

Submission deadline May 8th

The deadline for the call for presentation for the EuroScipy conference is on May 8th. There is only a week and a half left.

EuroScipy will be held in Paris, August 25-28. It is the European meeting for users of Python in scientific and numerical-intensive applications. It strives to bring together both users and developers of scientific and numerical tools, as well as academic research and state of the art industry. The conference will host 2 days of tutorials and 2 days of technical presentations.

Lately, numerical computing in Python has started reaching a much wider audience than the traditional academic-oriented audience. This is partly because Python is making its way in major engineering companies, but also because more and more industries are processing large amounts of data, and find precious data analytics tools in the Scipy community. In this spirit, this year there will be a tutorial on machine learning with Python.

Poster session

Last year, the organizing committee had to refuse a large fraction of the proposals, because there were not enough slots available. We had considered organizing a poster sessions, but the logistics were to challenging for our little resources. Indeed, EuroSciPy still tries to be organized as a hackers and coders conference, rather than an industry-level one. For instance, we keep the prices to a minimum, in order to make it easy for young people traveling on their own budget to join us. Getting 200 attendees as we did last year, did strain our small organization committee.

This year, we had a unexpected backing of the physics department of the ENS. They were extremely enthusiastic about Python, that they now use for teaching and research. This made me really happy, as this is where I studied. They proposed help, and in particular help with the local organization.

Thus I am able to announce that thanks to the physics department of the ENS, we will be able to host a poster session!

An exciting program shaping up

The program is starting to shape up, and it is looking really good, in my eyes.

Keynotes

We will be having two keynote speakers, one directly from the SciPy community, Fernando Perez, and one probably less known to this community, Marian Petre.

Marian Petre: Marian is the director of the Center for Research in Computing, at the Open University. She is interested in empirical studies of software development. I am very excited to hear a bit more about the often-forgotten human factor that goes behind every coding job, big or small. In my experience scientific computing and computational sciences pay a hefty price because they don’t acknowledge well-enough the gap between good ideas and tractable code.
Fernando Perez: Fernando is a research scientist in neuroscience at UC Berkeley. Before that, he was successively a physicist and a mathematician. He has been an early advocate of the scientific Python ecosystem, in addition to being the creator of IPython. His vision has always been oriented toward finding an computing environment that makes scientific creativity easier.

Tutorials

The tutorial program is now final, and can be seen on the schedule. Like last year, we will have two tracks:

An introductory track, designed as a two-day course addressing the different aspects of the Python language and the scientific computing module to bring up beginners to full speed. At the end of the two days, attendee should be able to solve simple computational problems using Python alone.
An advanced track, in which experts of various aspects of scientific and numerical computing in Python share their knowledge in 2-hours long tutorials.

Python in NeuroScience satellite

The two days following the conference, their will be a satellite meeting on the use Python in neuroscience. It will be a small and more focused event, in which neuroscientist will be able to exchange technical aspects of computation and data management in Python. Hopefully it will foster interest discussions and collaborations. if you are interested, you can submit a talk proposal for this satellite meeting here.

Come and join us at EuroScipy in Paris, Augst 25-28. Paris is a great city. The SciPy community is a friendly one.

Scikit-learn sprint on April 1st

2011-03-26T13:27:00+01:00

The scikit-learn team is organizing a sprint on April 1st (that next Friday). Join us in Paris, Boston, or on IRC!

With the rise of the data sciences, the scikit-learn, a BSD-licensed Python package for machine learning, is becoming an asset for more and more endeavors. Machine learning has traditionally been considered as very technical and inaccessible to the non mathematician. We are aiming to break this barrier.

The sprint will be focused on pragmatic down-to-earth improvements in the scikit. Our goal is to make it easy for people to contribute. A list of tasks and organization details can be found on the sprint planning wiki page. Amongst other things, we’ll be working on:

integrating new learning algorithms, in particular merging in the many excellent pull requests that we have: hierarchical clustering, data transforming using linear discriminant analysis, multinomial naive bayes classifier …
testing and logging framework,
**better parallel computing support**,
and many other itches to scratch, as it is a community-driven event.

Come and join us. It will be fun, and it’s an occasion to learn new tricks.

Windows binaries for the scientific Python ecosystem

2011-02-15T09:02:00+01:00

I just realized yesterday that Christoph Gohlke has a repository of binary installers (.exe) for Windows 32 and 64bit with almost all the scientific Python packages that you can dream of:

numpy, scipy and matplotlib, of course (compiled with the MKL)
cython
the ETS, including Mayavi
VTK, with the Python bindings
a variety of scikits (including the scikit-learn, hurray!)

These binaries are incredibly useful, as building all these packages under Windows does requires some skills, and a compiler. They complement very well fully-fledge scientific Python distributions such as EPD or Python(x,y), as they can be installed on top of an existing Python installation.

I should say that I discovered this thanks to a long email discussion in which Christoph Gohlke and Yakub Nowacki helped me debug a nasty Mayavi bug on Windows 64bit that I couldn’t reproduce as I don’t have a Windows 64bit available. That was particularly helpful.

Interested in parallel computing and statistics? We are looking for a post-doc

2011-01-30T22:30:00+01:00

My research group is kick starting a new project, called AzureBrain to do computational analysis of large brain imaging and genetics population-wise data. One of the goals of the project is to harness the power of grid computing to do statistical learning on fMRI data, finding features in an individuals brain images that can be predicted by his genome. The medical applications cover the wide scope of genetically-related brain pathologies, such as autism.

Want to work in a dynamic and exiting environment, using Python to solve challenging data analysis? We are looking for a post-doctoral fellow to hire in spring/beginning of summer. The ideal candidate would have a strong background in computational statistics or machine learning, as well as parallel computing, however we will consider any candidate with good experience in one or the other and a strong desire to learn.

You would be employed by INRIA, the lead computing research institute in France. We are a team of computer scientists specialized in image processing and statistical data analysis, integrated in one of the top French brain research centers, NeuroSpin, south of Paris. We work mostly in Python. The team includes core contributors to the scikit-learn project, for machine learning in Python, and the nipy project, for NeuroImaging in Python.

Below follows a summary of the official job announcement. Please contact Bertrand Thirion, (first name _dot_ last name _at_ inria _dot_ fr) if you are interested, referencing the AzureBrain project.

Introduction

Imaging genetic studies linking functional MRI data and Single Nucleotide Polyphormisms (SNPs) data face a dire multiple comparisons issue. In the genome dimension, genotyping DNA chips allow to record of several hundred thousands values per subject, while in the imaging dimension a brain image may contain 100k-1M voxels. Finding the brain and genome regions that may be involved in this link entails a huge number of hypotheses, hence a drastic correction of the statistical significance of pairwise relationships, which in turn reduces crucially the sensitivity of statistical procedures that aims at detecting the association. It is therefore desirable to set up as sensitive techniques as possible to explore where in the brain and where in the genome a significant link can be detected, while correcting for family-wise multiple comparisons (controlling for false positive rate). Another issue is the computational cost of these procedures, that need to be addressed with adequate algorithmic and computational tools.

Objectives

In this project, we will consider a unique dataset acquired in the Imagen project, an FP6 project that aims at investigating factors of addition in a population of adolescents; Imagen’s database contains multi-modal neuroimaging as well as genetics and psychological data on about 2000 subjects. This database is hosted and processed at Neurospin and is available for research purpose. The candidate will be in charge of:

Setting an analysis pipeline (based on code already available to analyze neuroimaging/genetics datasets) to extract and pre-process the relevant data for statistical analysis.
Performing statistical analysis on simulated datasets and sub-parts of the whole database in order to set all the computational framework. These procedures will include mass-univariate linear modeling (with peak- and cluster-level tests), regularized multiple regression and a permutation-based assessment framework.
Launch data analysis on a large scale grid and cloud environment, with the help of the Kerdata researchers (see below).
Build the post-analytic framework to ease the interpretation of the results in both neuroimaging and genetics domains.

The analysis framework is based on algorithmic tools developed in C/Python (numpy, scipy and scikit-learn). The candidate will interact i) with researchers of the Parietal team for algorithmic aspects, but also ii) with CEA researchers of Neurospin, who will provide expertise in genetics domain and iii) with the KerData team (INRIA Rennes) and the Joint MSR-INRIA Research Center (Microsoft Research), that will provide help and massive computation facilities. The project has an access to grid/cloud computing facilities to be used in collaboration with INRIA/Kerdata and MSR-INRIA partners.

The expected results is the discovery of correlation between brain activation and genetic information.

Required knowledge and background

The candidate should have at least a basic knowledge of standard statistical concepts. He or she should have a first significant experience in parallel computation and with python language. It is important that he or she has some real interest in genetics and/or brain imaging in order to have strong interactions with specialists of these domains. He or she will benefit from the algorithmic tools developed at Parietal and of the database settings and data pre-processing tools developed by Neurospin researchers.

EuroSciPy 2011: the dates are out - Aug 25-28, Paris

2011-01-16T15:57:00+01:00

We have finally been able to settle on final dates and venue for EuroSciPy 2011, the 4th European meeting on Python in Science.

The conference will be held from Thursday August 25th, to Sunday August 28th. The ENS will be hosting the conference once again, right in the center of Paris.

Scientific publication for software development

2011-01-08T22:40:00+01:00

The academic community seems to judge the validity and significance of any contribution by the number of papers published and the number of citations they get. To find funding, to get credit, you have to publish or perish. However, the natural output of software development tends not to be an article (people who confuse articles and documentation do a poor job of both, IMHO).

While I believe that this policy is harmful for the quality of research, I also know that I cannot fight it, and chances are that many other are in my situation. As such, we need to publish scientific papers about the scientific softwares that we develop (such as Mayavi, or scikit-learn, as far as I am concerned). On the other hand, as an editor of the Scipy conference proceedings, I have found that the process of writing a paper on software work and going through peer review can be greatly beneficial to the software. Indeed, it forces authors to do a thorough review of the prior work, and to clearly identify the purpose of the project. Also, such an article can only be much shorter than a user manual, thus it forces the authors to identify the key concepts of their software, and explain them clearly. As a result, it helps finding design and usability flaws and gaining insight on how the user manual can be structured.

A major challenge to publishing is that most of the highly-ranked journals tend to disregard software works, unless they are very specific to a scientific problem, which actually makes them less useful to the complete ecosystem. Deeply rooted in the minds of the editors and the reviewers, there tends to be the idea that developing software is easy compared to doing experiments or proofs. In addition, these top-notch scientists are not always the most qualified to judge the quality of software, as they have most often never worked in a major software project. The good news is that this is slowing changing with the creation of software tracks in specialized journals, and the development of new journals focused on scientific software.

Journals for publishing about interdisciplinary scientific software

In my opinion, interdisciplinary scientific software such as numpy, the GSL, octave, scilab, matplotlib, or Fenics, are the most valuable projects, as they provide foundations to build science in the open. The challenge that these projects have to face are not only algorithmic or computational, but also deal with providing good user interfaces, or developing and catering for very large communities of users. These problems are considered as solved in a scientific context, as they have all been solved at least once, often quite successfully by commercial products such as Matlab. As a result, it is hard to get some funding for these projects unless there is a political reason behind the funding, and IMHO politics tend to produce bad software. Publishing high-profile articles on interdisciplinary scientific software is thus hard, but critical. For this we need journals that accept software papers, but are not only read by researchers in CS or IT departments.

A couple of years ago, some of us made a review of where it was possible to publish truly wide-scope scientific software, and we found that there was pretty much no option. It’s crazy to see that things have still not changed much, and that all lot of major general-purpose widely-used projects, like the one I cited above, have never been acknowledged by a publication.

Computing in Science and Engineering: a joint publication between the AIP (American Institute of Physics) and the IEEE, it is a magazine-style journal and it can be seen in many coffee rooms of computational-science departments. Thanks to that it gets a lot of reading, but the articles cannot be too technical (which might be a good thing) and there is room for only few articles.
Open Research Computation (ORC): A newly-created journal, with a focus on making computational research reproducible. As such, it favors papers about open source scientific software with good software-engineering. Open access.

In addition to these software-friendly journals, some large-scope journals on computational science sometime accept software papers, though software production fall out of their scope:

Journal of Computational Science: a very multidisciplinary journal.
SIAM Journal on Scientific Computing (SISC): a journal of the SIAM (society for industrial and applied mathematics), thus with a focus on engineering-type applications.

Journals for publishing domain-specific scientific software

It is usually easier to publish a domain-specific software contribution, as you can claim that you have solved a well-identified scientific roadblock. Until recently, it was hard to get such papers in the best journals of a community, but things have been changing.

Computer Physics Communications: for algorithms and packages solving numerical and computational problems related to physics.
Bioinformatics: accepts software papers on biology-related problems.
ACM Transactions On Mathematical Software (TOMS): a journal of the ACM (Association for Computing Machinery), thus with a focus on algorithms.
Journal of statistical Software: this journal comes from the community of people who wrote the R language. They know that open source scientific software is hard and important. Open access.
Journal of Machine Learning Research (JMLR), Machine Learning Open Source (MLOSS) track: reference journal in the machine learning community, the MLOSS track cares strongly about documentation, packaging and usability of the software. Open access.
Computers & Geoscience: computational geoscience journal that accepts software papers (thanks Michael Aye for the pointer).
Computer Applications in Engineering Education: a journal about education with computers. AFAIK, no special focus on open source or software-engineering quality (thanks Doug Holton for the pointer).
NeuroInformatics and Frontiers NeuroInformatics (open access): two journals on computer-related issues in neuroscience that accept software papers. I have the feeling that the latter is a bit warmer to open source that the former (thanks Andrew Davison for the pointer).
Computers and Electronics in Agriculture: for publishing agriculture-related software (thanks John B. Cole for the pointer).

I should stress that, in my opinion, journals such as PLOS computational biology, or the Journal of Computational Physics, or are not great venues for software papers, as they tend to emphasize what I would call proof of principle, and not packaged and maintained software.

I have the feeling that there is need for more communication on scientific software. The list above is, of course, incomplete. If you have extra ideas, please do not hesitate to contact me.

As a conclusion, I would like to point out that conferences are also a good way to advertise scientific software. You may even get approached by the editor of a journal to open the door for a journal article. Last year I was at ESCO, a coupled problems conference, and there was a track on Python in science. All in all the conference was a huge amount of fun, and I learned a lot on practical aspects of numerical methods, given the amount of numerical computing geeks that were around. The same community is organizing FEMTEC in Lake Tahoe (California) this year. If you are in any field related to FEM or multiphysics, you should consider it.

Update: added links suggested by Doug Holton, Michael Aye, Andrew Davison, and John B. Cole

ICA versus PCA in the scikit-learn: the value of code over pictures

2010-11-20T16:12:00+01:00

When I was trying to get an intuitive feeling of the difference between Independent Component Analysis (ICA) and Principal Component Analysis (PCA), I wrote a few Python scripts producing some visualizations explaining the difference that have had a bit of success.

During the last sprint on scikit-learn, a machine learning toolkit in Python, we cleaned up the ICA code that I had been using, and we added it to the scikit, along with an example inspired from this earlier toy problem.

While the pictures are not as pretty as the initial ones I had done (because we wanted to keep the example as simple as possible), I am very happy that this discussion is know more than a set of static pictures, but comes with runnable code.

This illustrates very well my feelings on the future of scientific code and scientific research: paper, books, teaching materials, on numerical methods or computational science are greatly enhanced when they come with highly-readable code that illustrates their purpose, because the reader can start asking questions to the algorithm. Hopefully, the documentation of scientific programming toolkits will become the textbooks of tomorrow. We still have a lot of work to.

It’s funny, I just realized that my vision on software might have been strongly influenced by the fact that my mother, a high-school math teacher, spent endless nights when I was a teenager working on Geoplan, a software for teaching geometry by interaction with figures.

Multitouch with VTK (and MedINRIA and Mayavi)

2010-09-18T09:40:00+02:00

If the videos on this post are not showing, click through to see them.

A colleague of mine, Pierre Fillard, has just integrated multitouch in the next generation of the VTK-based medical imaging software MedINRIA. The nice thing is that it works on an Apple laptop out of the box.

On his blog, he explain how he did this (warning, it involves C++ and VTK programming). He also gives the code for this! Enjoy.

This reminded me of when the Enthought guys had rigged up a large multitouch screen and wired it in Mayavi for 3D plotting, and in chaco for 2D plotting, using only a web-cam, a video projector, and pure Python image-analysis code:

Scikit Learn coding sprint

2010-09-04T17:43:00+02:00

We have been really crap at communicating the next scikit-learn coding sprint. It’s next week!

The coding sprint will take place the 8 and 9 September at INRIA Saclay, near Paris, in the room K110 (building K).

For those who cannot make it, it will be possible to participate using the IRC chan (#scikit-learn on irc.freenode.net).

We will start at 9am (Paris time), and a sketch of the planning can be found here. In particular:

More docs! we still need tutorials: features selection, model selection, cross-validation, etc..
Make the pipeline object really work + illustration in different contexts.
Clean up and doc for bayesian approaches.
Implementation of PCA (fit + transform).
FastICA (adapt the CanICA code)
LDA : Covariance estimators (Ledoit-Wolf) and add transform.
Preprocessing routines (center, standardize) with fit transform.
Anything that you have a particular interest in.

Do not hesitate to send on the mailing list some advices on this (incomplete…) list, and see you next week!

scikit-learn is a Python module for efficient and easy machine learning using scipy and numpy.

Software design for maintainability

2010-08-01T23:47:00+02:00

I have just spent the best part of my Sunday fixing a bug that turned out being a seemingly-trivial two-liner. Such unpleasant experiences are all too frequent, and weight a lot on my view of code design.

My stance on code design

I call code design the process of designing the architecture of a piece of software: what are the objects it uses? how do they interact? how is the information passed around?…

My view of code design and software engineering has progressively evolved to favor extreme simplicity over sophistication. I believe that a good programmer should know design patterns, powerful language features, libraries dark corners, and not use them unless absolutely necessary.

Some rules of thumb

Here are some rules that I apply nowadays when writing code that I would like to last (I am aware that some of them go against well-advertised best practices).

Keep it as simple a possible, really! Experimental results have shown that the tractability of a code base goes down as the square of the number of interactions, and thus much quicker than the number of lines in a project. Each time you add a line, think about it: can you make simpler? If not you’ll have to find resources to maintain your project as fixing bugs or adding features will grow harder.
Design for the 80% usecases. In the same vein, a small decrease in the requirements can make your project much simpler [Woodfield1979]. Corner cases and minor usecases should not make the whole project complex and hard to maintain. If you can, give up on what is bringing in complexity. If you cannot, isolate it, and don’t let it sit at the core of your design.
Don’t design for the future. Again the same core idea: don’t start planing for all the usecases, and all the difficulties that you haven’t encountered, you will most certainly design wrong, and chances are that you’ll add complexity that you do not use. Design simple, design cleanly and refactor as you go, based on concrete problems. This is known as the “YAGNI principle”.

Don’t be clever. Each time you do a clever trick, whoever has to read and maintain this code will have to understand it (that person may be you, in a few years). Chances are that they’ll get it wrong and start by loosing a lot of time.
Repeating yourself may actually be OK. This is a case of practicality beats purity. Repeating code is really a bad thing in software design, because it leads to an increased number of lines to debug, and tends to hinder reusability. However, adding complexity in order to save a few lines of duplicated code will cost you more in the long run.
Use objects sparingly. Object are great, but are they always need? An object with a single method eval can probably simply be implemented by a function. The limitation of objects is that they all have a different behavior. As a result, the users and maintainers of your codebase will first have to understand how all your classes interact before understanding your code. This also means that there is a lot of benefit in making many different classes that have the same interface.
Avoid abstractions and levels of indirection. The more levels of code piled on top one of the other, the more layers your maintainer is going to have to inspect to find were the bug might be. An abstraction hides another object or algorithm. To debug code, chances are that all the black boxes will first have to be opened.

Coding for others to debug

“Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” - Brian W. Kernighan

You may think that I am overemphasizing simplicity at the cost of functionality. Well, think about the future of your code. The net is full of unmaintained and abandoned code. If you want your project to grow and have a future, you will probably need people to help you. For a given purpose, the easiest the code is to read and debug, the more chances you will have to pick momentum.

Some external references I like (about software engineering, rather than debugging):

Edmon Lau: Hidden costs that engineers ignore (Read this)
Titus Brown: Writing (Python) Code that Doesn’t Suck
Peter Norvig: Teach yourself programming in 10 years
Paul Stachour and David Collier-Brown: You Don’t Know Jack About Software Maintenance
Greg Wilson: Software carpentry: a course in software engineering

Sprint Scikit learn in Paris

2010-07-23T14:31:00+02:00

We are organizing a coding sprint in Paris on scikit learn, machine learning in Python. The goal of this sprint is to set the API and the general coding guidelines of the scikit to be able to tackle many different statistical learning problems in a consistent framework.

This is why we would like to have people with different problems, applications, and backgrounds to pitch in.

It will be a two-days sprint. Everyone is welcome, so just fill in the doodle, so that we can choose the date?

And do not hesitate to suggest some topics that you would like to be addressed during the sprint, and to discuss them on the mailing-list!

Vincent Michel is organizing the sprint. If you have questions about the sprint, you are welcomed to contact me, but please do put him in Cc to.

Simple object signatures

2010-07-16T23:31:00+02:00

A signature pattern

There are many libraries around to specify what I call a ‘signature’ for an object, in other words a list of attributes that define its parameter set. I have heavily used Enthought’s Traits library for this purpose, but the concept is fairly general and can be found eg in ORMs (Object Relational Mappers) or web frameworks.

Specification of this interface of parameters may be used to answer a variety of needs:

Typing: in the case of an ORM, to generate UIs, or for better error management, it may be desirable to have some control on the types of certain attributes of an object. In this case, specifying the signature corresponds to laying out a data model for the object.
Reactive programming: using properties to react to changes to attributes, one can fully specify the API of an object in terms of these attributes. This gives a message-passing like programming style that can be very well suited to parallel-computing in particular because it can easily be made thread-safe.

Signatures for statistical learning objects

Recently, I considered the signature pattern in a new context. In the scikit-learn, we are interested in statistical learning. This entails fitting models to data and often tuning parameters to select a model that fits best (a problem called model selection). Each of our models is an object that implements a couple of key methods to fit to the data and to apply to new data (fit and predict).

The approach that we are currently taking for model selection is (more or less) to generate a list of models with different parameters and fit and test them on the data.

A very nice feature would be to find out the parameters to vary simply by inspecting the objects, and such a desire recently got us discussing of defining signatures for our objects. I must confess that I am a bit weary as this means either depending on a signature library, or building one. We don’t want to grow our dependencies, and most signature-definition code that I know involve meta-programming tricks to avoid code duplication.

Solving the simple problem: avoiding type checking

Today, I had to bite the bullet, because we were in a situation in which we had to instantiate new models from the existing one during model selection. For technical reasons, using a copy.copy to create these new models was not a great idea, and it was better to have the minimal list of parameters required. Here come signatures again.

After a bit of messing around with the code, I realized that typing information was useless, and most probably harmful, to our immediate goals and that I just needed the names of the relevant attributes. I finally settled down to the following solution (which might still change):

All parameters need to be specified as keyword arguments of the __init__. The __init__ may not have positional arguments or ‘*’ arguments. Attributes on the objects have the same names as the __init__ parameters.
A simple base class, with couple of methods relying on a simple use of the inspect module to find the signature of the __init__.

class BaseEstimator(object):
    @classmethod
    def _get_param_names(cls):
        args, varargs, kw, default = inspect.getargspec(cls.__init__)
        assert varargs is None, (
            'scikit learn estimators should always specify their '
            'parameters in the signature of their init (no varargs).'
            )
        # Remove 'self'
        args.pop(0)
        return args

    def _get_params(self):
        out = dict()
        for key in self._get_param_names():
            out[key] = getattr(self, key)
        return out

    def _set_params(self, **params):
        valid_params = self._get_param_names()
        for key, value in params.iteritems():
            assert key in valid_params, ('Invalid parameter %s '
                'for estimator %s' %
                (key, self.__class__.__name__))
            setattr(self, key, value)

The full code can be seen here and adds a bit more features, such as a clever __repr__.

What I like about this solution is that it (almost) does not use metaprograming, and avoids code duplication without forcing any specific pattern on the developer subclassing BaseEstimator.

The next step

This approach solves my immediate problem, but not the bigger one of finding what values can the different parameters take when varied for model selection. Of course this second problem is much more complicated, and maybe it is not worth solving it: the framework could very easily be bringing in more problems than it solves.

However, it seems that a fairly easy way of specifying possible values for parameters would be to decorate the __init__, giving the possible parameters to be tested during the model selection:

@cv_params(l1=np.logspace(1e-4, 1, 10))
def __init__(self, l1=.5, fit_intercept=True)
# ...

All the decorator has to do is to store the information in an attribute attached to the __init__ (and probably to check that the parameters it was given are valid arguments, in order to raise errors early). Methods on the class can later inspect this information for model selection, or GUI building (data-model specification will probably require some typing language, rather than a simple list of possible parameters).

Once again, here we would be avoiding the difficulty of specifying type information in a non restrictive way, but avoiding a problem that we don’t have to solve is probably a good idea.

Euroscipy 2010: code, science, and a lot of fun

2010-07-13T17:31:00+02:00

Euroscipy 2010, the third European conference for the use of Python in science, is just over, and I think it was a great success.

Euroscipy in numbers

The attendance this year was huge: there was a grand total of 160 who came to EuroScipy, with 140 that came only to the tutorials, and 130 only the conference. This up by almost a factor of 3 compared to last year’s EuroScipy, more than last year’s SciPy conference in Passadena, and almost as much as this year’s SciPy conference in Austin that hosted 180 person. We had people coming from 16 country, and as far as New Zealand, the US, or Turkey. Research lab, education, and industry (small to large companies) were all well represented, with approximately a third of the delegates coming from the industry. Similarly, many different scientific field were discussed, ranging from landscape ecology to pure math.

There were 2 tutorial tracks with 10 tutorial slots in each track. We had 2 keynotes from Hans Petter Langtangen and Konrad Hinsen. With regards to the contributed talks, the conference this year was highly selective. We received 52 propositions. We unfortunately could accept only 30 of them, which corresponds to an acceptance rate of 58%. Finally, we had 18 lightning talks.

A warm and friendly atmosphere

As an organizer, I was really pleased to find out how much people were relaxed and friendly. This certainly facilitates discussions during the breaks. And the ambiance was undoubtedly warm: 140 people with laptops in a room without air conditioning in the Paris summer :).

Of course during the evenings, many people met to continue the passionate discussions in restaurants and bars.

Trends I noticed

What one remembers from a conference is obviously biased by personal interests. With that disclaimer, here are the recurrent and important topics that I noticed, both in the talks, but also in the coffee break discussions:

Parallel computing, in particular making it easy to do parallel computing. Konrad’s keynote had many interesting directions to explore. (talks: Playdoh, DANA).
Code generation. In the various conferences I have been to recently, I heard much talking about symbolic manipulation of numerical problems to generate optimal computing kernels (talks: Efficient computation tutorial, Theano, Algorithmic Differentiation.
Data management, with problems such as provenance tracking for reproducibility (talks: Sumatra, Knowledge management tutorial).

Finally installation problems of scientific tools were the subject of many discussions, as each year. One thing that I did notice, is that people stopped simply blaming each others and acknowledged that nobody knew how to fix the problem. Somebody even pointed out that installing any major scientific code was not a piece of cake. Hans Petter and others said that they had solved the problem by relying on a virtual machine and Ubuntu.

Konrad has also blogged, giving his own view of the conference.

Thanks

The conference could happen only because of the help of many people. First we need to thank our sponsors: Enthought, Python Academy, Pytables, and especially our host Ecole Normale Supérieure, which not only provided us with the rooms, but also made sure that everything was going well with the sound system, the projection, or the access to the building. With regards to organization and planing, Nicolas and I received a lot of help from Emmanuelle Gouillart.

Personal views on scientific computing

2010-05-20T00:00:00+02:00

My contributions to the scientific computing software ecosystem are motivated by my vision on computational science.

Scientific research relies more and more on computing. However, most of the researchers are not software engineers, and as computing is becoming ubiquitous, the limiting factor becomes more and more the human factor [G. Wilson, 2006] [P. Norvig, 2009].

Note

To address the needs of computing accross scientific fields, I believe that we need a general-purpose, high-level, interactive, and highly-readable language and set of tools for scientific computing.

C does not answer my needs: does a molecular biologist know about pointers? Should she?
Matlab does not answer my needs either: scientific work with computers is not only about numerical computation. Have you tried writing an experiment-control software with Matlab? How about file management? Inserting the algorithms in a web server.

We need better teaching material, that sit at interfaces between software engineer, and general science. Most top notch tools and libraries are full of domain-specific jargon and conventions.

For reproducible science, we need the code to be readable and to reflect the corresponding scientific operation. We need it to be unit-tested to ensure its correctness.

Note

We need to consider scientific libraries as end-result of our research with the same importance than articles [J. Buckheit and D. Donoho. 1995]. They need to convey a scientific message, to be understandable and refutable. New results should be reproducible via published code [CISE Jan. 2009]. As for established algorithms, scientific libraries with their documentation and examples should be the textbooks of tomorrow.

Scientific software should be as reusable as possible, to enable the advancement of Science via software, year after year. This means that we need to build general-purpose libraries.
Code quality and documentation are crucial, as human factors are often the limitation. As a corollary, scientific code should be unit-tested to ensure correctness.
Core scientific software should be open source, as scientific work cannot build on black boxes
Algorithms should be written as simply as possible. A high level of sophistication in software engineering should not be a requirement to all scientists
Prefer high-level languages. The code should be written at the right level of abstraction.
We need to build common and shared tools. Scientific software shouldn’t be ‘owned’ by a lab.
The source code should a deliverable of the research. As a result, it should read clearly and be understandable to all.
Documentation and examples are the textbooks of tomorrow.
Publications should be reproducible. Ideally they should become an example of the library. This should be mitigated by the fact that code maintainance is costly, and achieving good code takes more work that publishing. Focus should be on publications that will give rise to reference results.
Academia need to value sotware maintainance. It is hard and costly, but it determines our future.
Tools that develop the environment, rather than a specific algorithm or scientific field are crucial (one example is IPython).

EuroScipy abstract submission deadline extended

2010-05-15T23:36:00+02:00

Given that we have been able to turn on registration only very late, the EuroScipy conference committee is extending the deadline for abstract submission for the 2010 EuroScipy conference.

On Thursday May 20th, at midnight Samoa time, we will turn off the abstract submission on the conference site. Up to then, you can modify the already-submitted abstract, or submit new abstracts.

We are very much looking forward to your submissions to the conference.

Gaël Varoquaux

Nicolas Chauvat

EuroScipy 2010 is the annual European conference for scientists using Python. It will be held July 8-11 2010, in ENS, Paris, France.

Links: `Conference website`_, `Call for papers`_, `Practical information`_

EuroScipy is finally open for registration

2010-05-13T13:23:00+02:00

The registration for EuroScipy is finally open.

To register, go to the website, create an account, and you will see a ‘register to the conference’ button on the left. Follow it to a page which presents a ‘shoping cart’. Simply submitting this information registers you to the conference, and on the left of the website, the button will now display ‘You are registered for the conference’.

The registration fee is 50 euros for the conference, and 50 euros for the tutorial. Right now there is no payment system: you will be contacted later (in a week) with instructions for paying.

We apologize for such a late set up. We do realize this has come as an inconvenience to people.

Do not wait to register: the number of people we can host is limited.

An exciting program

Tutorials: from beginners to experts

We have two tutorial tracks:

**Introductory tutorial**: to get you to speed on scientific programming with Python.
**Advanced tutorial**: experts sharing their knowledge on specific techniques and libraries.

Scientific track: doing new science in Python

Although the abstract submission is not yet over, I can say that we are going to have a rich set of talks, looking at the current submissions. In addition to the contributed talks, we have:

**Keynote speakers**: Hans Petter Langtangen and Konrard Hinsen, two major player of scientific computing in Python.
**Lightning talks**: one hour will be open for people to come up and present in a flash an interesting project.

Publishing papers

We are talking with the editors of a major scientific computing journal, and the odds are quite high that we will be able to publish a special issue on scientific computing in Python based on the proceedings of the conference. The papers will undergo peer-review independently from the conference, to ensure high quality of the final publication.

Call for papers

Abstract submission is still open, though not for long. We are soliciting contributions on scientific libraries and tools developed with Python and on scientific or engineering achievements using Python. These include applications, teaching, future development directions, and current research. See the call for papers.

I am very much looking forward to passionate discussions about Python in science in Paris

Status of the EuroScipy registration

2010-05-02T22:57:00+02:00

It is still not possible to register for the Euroscipy conference: we are having difficulties with payment for the registration, and we are still not sure that we will be able to actually charge money!

This might not be a bad news, because it might mean that the conference will be completely free. This would mean that we would be able to provide lunch which is a pity as there is nothing like eating with a bunch of passionate experts to learn new tricks, but it would not hamper the conference in any other way, as the rooms are already booked and various little expenses covered.

If we manage to sort out payments in the next weeks, the fee should be 50 euros for the 2 days of tutorial, and between 50 and 100 euros for the full conference, depending on exactly what catering we offer.

Anyhow, we should open the registration real-soon, with or without payment. We will need to have some formal registration, as the number of people that can fit in the rooms will be limited.

All in all, with or without registration fees, it should be possible to make it to Euroscipy keeping expenses low: we have indicated a few cheap accommodation on the practical details page, and it is easy to get good food for a good price in the area.

I am very excited about this conference. We have two keynotes that I am really looking forward to hearing, and I can say that we have been getting pretty good submissions for presentations. Also, changes are that we should be able to publish proceedings in a peer-reviewed journal, although I can’t say more about that right now.

Also, even if you are not interested in scientific research done using Python, the tutorials are a unique opportunity: we are having top-notch experts presenting with two tracks, one to get beginners up to speed and efficient in a couple of days, and the other for exploring advanced subjects. I know the speakers, and I can tell you that I won’t be talking in the corridor, but sitting with my laptop and listening to them. People pay large chunks of money for such training, usually.

Mayavi: Representing an additional scalar on surfaces

2010-04-05T00:30:00+02:00

We have been getting a few questions on the enthought-dev mailing-list on how to represent an additional information on a surface with Mayavi, using color not given eg by the elevation. A recent post on his blog by Didrik Pinte shows the problem quite well:

This problem can be seen as taking a standard surf plot:

but coloring it with a different scalar than the elevation.

I would like to present two ways of solving this problem. First a very simple way specific to the exact problem, second a more complicated but quite generic approach.

Representing surfaces more complex than an elevation map

The first option is simply to use the tools that Mayavi’s mlab interface provide to represent surfaces that are not the particular case of an elevation plot. In our case, it is very easy to use the mesh function which can take the x, y, z positions of a grid giving the surface, but also an additional scalar value at these position:

# Create some data
import numpy as np
x, y = np.mgrid[0:10:100j, 0:10:100j]
z = x**2 + y**2
w = np.arctan(x/y)

# Visualize it
from enthought.mayavi import mlab
mlab.mesh(x, y, .05*z, scalars=w)

# Finally, add a few decorations.
mlab.axes()
mlab.outline()
mlab.view(-177, 82, 32)
mlab.show()

As you can see, this solution is really simple, and solves the problem.

A generic way of representing several scalar attributes with one visualization

If we think of the visualization problem as a way of representing two scalar values, ‘z’ and ‘w’, and a function of two others, ‘x’ and ‘y’, the above solution is not really satisfactory: the surf function really turns the scalar value ‘z’ in elevation (using a WarpScalar filter). We would like to be able to add an addition scalar value ‘w’ and turn it into color, just like ‘z’ is turned into elevation. The pipeline that is created by the surf function is the following:

The first element of the pipeline after the scene is the data source created for us by the surf function: it is a 2D array that contains the ‘z’ value as a scalar value. The ‘WarpScalar’ filter is applied, and transform that value into elevation. After that, a ‘PolyDataNormals’ filter is used to calculate normals, so as to have a smooth rendering, and finally, a ‘Surface’ module is applied to display the resulting elevation map as a surface, with a color reflecting the scalar value.

The way we can operate on two scalar values and turn them into elevation and color successively is to embed these two scalar values on the dataset, ‘z’ and ‘w’, and use a ‘SetActiveAttribute’ to control on which one the ‘Surface’ module is applied. This approach is much more powerful, because we can tweak the pipeline ourselves, and use any filter to replace the WarpScalar, and display the ‘z’ information (more on that below).

Here is how to do achieve a visualization with a similar look as above, but with two scalar values transformed successively in elevation and color:

###############################################################
# Create some data
import numpy as np
x, y = np.mgrid[0:10:100j, 0:10:100j]
z = x**2 + y**2
w = np.arctan(x/y)

###############################################################
# Visualize the data
from enthought.mayavi import mlab

# Create the data source
src = mlab.pipeline.array2d_source(z)

# Add the additional scalar information 'w', this is where we need to be a bit careful,
# see
# http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/example_atomic_orbital.html
# and
# http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/data.html
dataset = src.mlab_source.dataset
array_id = dataset.point_data.add_array(w.T.ravel())
dataset.point_data.get_array(array_id).name = 'color'
dataset.point_data.update()

# Here, we build the very exact pipeline of surf, but add a
# set_active_attribute filter to switch the color, this is code very
# similar to the code introduced in:
# http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/mlab.html#assembling-pipelines-with-mlab
warp = mlab.pipeline.warp_scalar(src, warp_scale=.5)
normals = mlab.pipeline.poly_data_normals(warp)
active_attr = mlab.pipeline.set_active_attribute(normals,
                                            point_scalars='color')
surf = mlab.pipeline.surface(active_attr)

# Finally, add a few decorations.
mlab.axes()
mlab.outline()
mlab.view(-177, 82)
mlab.show()

The pipeline that is created is the following:

In the first part of the pipeline, the ‘WarpScalar’ filter is applied to the ‘z’ scalar value, whereas, due to the ‘SetActiveAttribute’ filter, the ‘Surface’ module uses the ‘w’ scalar value to display the color.

This pattern is very powerful, and can be used with other sets of filters or modules. The example of this pattern that we use in the Mayavi documentation is the following:

We use a ‘Contour’ filter to contour on the amplitude of a complex a field defined in the volume, and then switch to the phase to display the color. See the atomic orbital example in the Mayavi documentation for more details.

Book review: Matplotlib for Python Developpers

2010-03-26T10:49:00+01:00

Packt publishing sent me a copy of Sandro Tosi’s book Matplotlib for Python Developpers a while ago. Unfortunately, it arrived after I had left for the Christmas break, and I couldn’t find time to review it for a while (I am terribly bad at time-management, and I do too many things, as I result I am always overworked). 3 months later, I have finally found time to read it and post a review.

Content

The book introduces matplotlib which is, for those who don’t know, a truly fantastic library for scientific plotting in Python. Matplotlib is great because it is really easy to pick up, and can be used to produce very high-quality plots.

The book starts by progressively introducing the simple, imperative API for matplotlib, with a focus on getting the user immediately plotting data. It then moves on to a review of the functionality for plotting in matplotlib and the object-oriented usage of matplotlib. Finally, Sandro shows us how to embedded matplotb in various environment such as GUI toolkits or web development tools.

The last part of the book is, in my opinion the most original and precious, as these subjects are less well-known and documented in classical references accessible to people with a scientific computing background.

Target audience

The book can pretty much be picked by a scientific Python beginner. It does require some knowledge of the Python language, but if the reader has programmed in another language, I don’t see this as a big problem. In this regard, the book is especially interesting, as it can lead a scientist from newbie to writing simple end-user programs. There is a clear need for more of these documents currently.

The book will also be useful for the experienced Python developers looking to pick up quickly matplotlib.

Personal comments on the book

In my experience, exposing a tool such as matplotlib is a challenge: everybody has different plotting needs and there is an infinity of variation in ways that you can use a powerful library like matplotlib. Thus, Sandro’s exposition of matlplotlib will not suffice: people should absolutely read more, and I can’t stress too much that the matplotlib documentation is excellent, and people should read more of it.

In general, I found that the books reads fairly well. Off course, I am not the best critic in term of ease of read, as I know matplotlib very well. I do find that the book lacks a personal touch such as interesting examples, or profound insights on specific problems. There is nothing that got me excited in the book (again, maybe it’s because I know what’s in the book quite well).

Once again, in my eyes, the biggest contribution of this book is to put together an introduction to matplotlib, and examples of application building using matplotlib. I would especially recommend the book for people wanting to build simple data visualization GUIs.

Finally, with regards to interactive data visualization, in my experience, scientific programmers achieve better productivity when avoiding to work at the widget level and using an abstraction library. I strongly recommend looking at TraitsUI for this purpose. you can find a tutorial here (disclaimer: I wrote that tutorial).

Also, if you are going to write a data visualization program that is interactive in the sens that it enables the user to interact with the data, using Chaco instead of matplotlib may make your life easier. Chaco is not as well polished and documented as matplotlib, and I would never use it for a quick scripting work, but it has a strong focus on data interaction, and as such makes it really easy to build very responsive user interfaces, because it is very fast and has a clear object-oriented API.

New Mayavi release

2010-03-14T12:58:00+01:00

A week ago, the Peter Wang released a new version of the Enthought Tool Suite (ETS). With it came a new version of Mayavi2.

Prabhu and I have been horribly busy we real life, and I had the bad feeling that we were not giving enough love to Mayavi. I was surprised when I put together the list of features and bugs fixes that went in Mayavi for the last two releases. The full list can be found in the documentation.

Contributors

We are not being terribly good at tracking external ideas and patches, so I hope that I haven’t forgotten anybody, but I am very happy to say that Prabhu and I have received a fair amount of help from non core contributors:

Chris Colbert
Darren Dale
Dave Martin
Dave Peterson
Emmanuelle Gouillart
Erik Tollerud
Evan Patterson
Gary Ruben
Kyle Mandli
Michele Mattioni
Ondrej Certik
Ram Rachum
Robert Kern
Scott Warts
Suyog Jain

On top of these people, I wish to thank the people making sure that the Mayavi packages are available in the different Linux distributions: Varun Hiremath, Lev Givon, Andrea Colangelo, Rakesh Pandit, as well as Pierre Raybault for integrating in Pythonxy.

Important features added in 3.3.0

3.3.0 was released last fall. We had not compiled the list of changes at the time, I am giving it here:

An example gallery in the documentation.
A sync_camera helper function to synchronize camera between two scenes.
A text3d module, for position text in 3D that is scaled and hidden like a data object.
A close function to close scenes, similar to that in pylab or matlab.
A new filter to crop datasets: DataSet Clipper. This filter is terribly useful.
All the mlab.pipeline functions now take a figure= keyword argument. This is very useful when coding with several figures embedded in GUIs, as in a GUI you can’t rely on a context. This is illustrated in this example.

Important features added in 3.3.1

In latest release the following important features were added:

mlab.savefig can now reliably save images of a size larger than the window.
The interactive VTK documentation browser is now available in the GUI.
New functions added to mlab to control position of the camera: move, yaw, and pitch. These complement the existing view and roll.
Make the lines smoother when using mlab.plot3d (use a VTK Stripper filter)
Add a screenshot function to mlab for easy screen capture as a numpy array. This is very useful when creating figures that combine 3D using Mayavi and 2D using pylab. I use it all the time.
Add a probe_data function to return the data values of Mayavi objects at given locations as numpy arrays. This is very useful to combine numerics with Mayavi.
Add a auto mode to mlab.view to compute position and distance based on the objects on the image.
Add a helper function to easily interact with the data: a callback can easily be registered to picking data with the mouse. Two examples illustrate this new functionality. This is a major step forward in making life easier for people using Mayavi to build custom interfaces.

Using Python, Scipy, ETS, … to implement art

2010-02-14T14:14:00+01:00

The Aikon project has just been slashdotted.

The project is about implementing a robotic artist, with a special artistic touch:

The Co-principal investigator, Patrick Tresset, gave a talk at the French Pycon this year and I was simply flabbergasted by the project. It is amazing to mix together art and technology in such a way, you should really have a look at the videos of the robotic arm making sketches of people.

But I was even more startled when I discovered that the project was using scipy and all my beloved stack for scientific computing in Python, including the Enthought Tool Suite: check it out. I really want scientific computing software to be tools opening new ideas and new research. This research goes beyond my dreams.

EuroScipy 2010, Paris July 8-11. Save the date!

2010-02-14T00:02:00+01:00

EuroScipy 2010, the 3rd European meeting on Python in Science, will be held July 8-11 in the center of Paris, at the Ecole Normale Supérieure.

We have made good progress in the organization, and we already have an exciting program although paper submission is not yet even open.

Tutorial tracks

There will be two tutorials tracks:

An introductory track, to bring attendees up to speed with Python in science. Even if you are a complete beginner, after these two days, you should be able to be efficient using Python for scientific purposes.
An advanced tutorial track, covering in-depth specific tools and projects, aimed at experienced users and presented by leading experts of the topic.

We will soon be requesting feedback from you to help us choose between the different thrilling tutorial propositions that we have for these tracks. More on that later…

Keynote speakers

Hans Petter Langtangen

Simula laboratory, Oslo, director of scientific computing and bio-medical research
Author of the famous book Python scripting for computational science

Konrad Hinsen

Synchrotron SOLEIL and Centre de Biophysique Moléculaire (Orléans)
One of the fathers of numeric, and developer of Scientific Python.

Help us spread the word

The poster of the conference can be downloaded:

Help us spread the word: print it and post it at your workplace!

The exciting city of Paris

The conference will take place in the center of Paris, in the very lively “quartier latin”, in the prestigious and historical ‘Ecole Normale Supérieure’. In the morning, on your way to ENS, drop by a café for a French croissant, served by a French waiter with a typical French accent in English. In the evenings, walk one block to enjoy the night life “rue Mouffetard”, or venture further to stroll on the river banks of the Seine, along which people dance to street music.

The SciPy 2009 proceedings are online

2009-12-20T18:49:00+01:00

We are finally announcing the online edition of SciPy proceedings:

http://conference.scipy.org/proceedings/SciPy2009/

This year, we tried to raise the bar in terms of article quality. This involved having a more strict review process, and we must thank a lot all the reviewers. I have the feeling it did improve the quality of the final papers. Actually, I must say that there are some really nice papers in the proceedings. I am not going to list them here, you can have a glance at the contents, but they range from fairly technical papers on tools development that are more in the software engineering and computer science fields, to application papers demonstrating how the tools can be used.

I must apologize for the time it took to publish the proceedings. All this was actually a lot of work, and it has taken me a lot of energy. I hope that you will it was worth it.

Announcing EuroScipy 2010

2009-12-14T01:01:00+01:00

The 3rd European meeting on Python in Science

Paris, Ecole Normale Supérieure, July 8-11 2010

We are happy to announce the 3rd EuroScipy meeting, in Paris, July 2010.

The EuroSciPy meeting is a cross-disciplinary gathering focused on the
use and development of the Python language in scientific research. This
event strives to bring together both users and developers of
scientific tools, as well as academic research and state of the art
industry.

Important dates

Registration opens: Sunday March 29

Paper submission deadline: Sunday May 9
Program announced: Sunday May 22
Tutorials tracks: Thursday July 8 - Friday July 9
Conference track: Saturday July 10 - Sunday July 11

Tutorial

There will be two tutorial tracks at the conference, an introductory one, to bring up to speed with the Python language as a scientific tool, and an advanced track, during which experts of the field will lecture on specific advanced topics such as advanced use of numpy, scientific visualization, software engineering…

Main conference topics

We will be soliciting talks on the follow topics:

Presentations of scientific tools and libraries using the Python language, including but not limited to:
- Vector and array manipulation
- Parallel computing
- Scientific visualization
- Scientific data flow and persistence
- Algorithms implemented or exposed in Python
- Web applications and portals for science and engineering
Reports on the use of Python in scientific achievements or ongoing projects.
General-purpose Python tools that can be of special interest to the scientific community.

Keynote Speaker: Hans Petter Langtangen

We are excited to welcome Hans Petter Langtangen as our keynote speaker.

Director of scientific computing and bio-medical research at Simula labs, Oslo
Author of the famous book Python scripting for computational science http://www.springer.com/math/cse/book/978-3-540-73915-9

The organizers:

Gaël Varoquaux (INRIA Saclay, Parietal), conference co-chair

Nicolas Chauvat (Logilab), conference co-chair

Program committee

Romain Brette (ENS Paris, DEC)
Mike Müller (Python Academy)
Christophe Pradal (CIRAD/INRIA, DigiPlantes team)
Pierre Raybault (CEA, DAM)
Jarrod Millman (UC Berkeley, Helen Wills NeuroScience institute)

Decoration in Python done right: Decorating and pickling

2009-11-13T00:14:00+01:00

Decoration is a fantastic pattern in Python that allows for very light-weight metaprograming with functions rather than objects (see this article for an in-depth discussion). However, when decorating, it is very easy to break another great feature of the language: its reflectivity and its ability to do static representations of its internal objects: pickling.

In this blog post, I’d like to rewrite a post I made on the IPython mailing list a month ago, summing up the few things to have in mind when decorating a function.

A pattern to avoid?

I have recently been revisiting my decoration code, to fight a common mistake I had been doing, and it was partly due to the heavy use of a simplified pattern for decorating:

def with_print(func):
    """ Decorate a function to print its arguments.
    """

    def my_func(*args, **kwargs):
        print args, kwargs
        return func(*args, **kwargs)

    return my_func

@with_print
def f(x):
    print 'f called'

The nice thing about this pattern is that is it quite easy to type, and to read.

Why it is harmful

The decorated function is actually the function ‘my_func’, with a reference to the original function ‘func’, a part of the scope of the decorator ‘with_print’, and thus in the closure of the with_print function.

The problem is that we have a closure here. Thus we have variables that are hard to get to (the undecorated function), and the decorated function is not picklable (which is more and more important to me, e.g. for parallel computing).

Some solutions

Avoiding the closure

Use objects as a scope, rather than a closure:

class WithPrint(object):
    def __init__(self, func):
        self.func = func

    def __call__(self, *args, **kwargs):
        print args, kwargs
        return self.func(*args, **kwargs)

This solution is not enough: the following code won’t pickle:

@WithPrint
def g(x):
    print 'g called'

The reason this won’t pickle is that we have a name collision: the code above expands to:

def g(x):
    print 'g called'

g = WithPrint(g)

and trying to pickle raises the following PicklingError:

Can't pickle <function g at 0x6ed2a8>: it's not the same object as __main__.g

If we do:

def g(x):
    print 'g called'

h = WithPrint(g)

we can pickle h, hurray!

Using functools.wraps

However, Python comes with the answer in the standard libary: functools.wraps does the name unmangling.

Thus the following code produces a pickleable f:

from functools import wraps
def with_print(func):
    """ Decorate a function to print its arguments.
    """
    @wraps(func)
    def my_func(*args, **kwargs):
        print args, kwargs
        return func(*args, **kwargs)
    return my_func

@with_print
def f(x):
    print 'f called'

The pickling works simply because using functools.wraps resets the
.func_name attribute of f to have a well-defined import path. Thus
pickling works, simply by storing the import path, as all pickling of
functions.

Notice that there is only a one-line difference with the original code!

I actually tend to use a combination of both solution (an object, using functools.wraps), to keep a reference on the undecorated functions.

Note: Demo code of this blog post can be found here.

Take home messages for pickling

Decorators can be more clever than you think, and might not return objects as simple as you think
Think about pickling, or you’ll get bitten at some point (for instance when doing parallel computing).

and most important:

Use functools.wraps

A remark about object-oriented programming

To jump on the band-wagon behind Travis, I believe that this discussion teaches us a bit about object-oriented programming. When decorating, we are really taking a callable object, and redefining how the call is handled. If we do this the naive way, we loose introspection (there is no way to access the original callable from Python), and as a result pickling, and many of the nice feature going with reflexivity in Python. This is because we trapped information in a scope that is not accessible by normal Python code (without playing at the frame level). If on the other hand, we accept that what we have behind all this are nested scope with a control of lookups, and we create a full-blown object, we have the benefits of the black box, and the benefits of reflexivity.

But this is not the point I want to make. The point I want to make is that, by decorating, we are piggy-backing on an absolutely universal object/interface: the callable. Everybody knows what a callable is, and knows how to employ it. From a pure object-oriented point of view, decorating is simply some kind of proxy design pattern. But, to stress Travis’s point, introducing new objects that have their own behavior puts cognitive load on the programmer. The real value of decoration is that it is object-oriented programming without adding any new or surprising interface. You don’t really have to care what is going on, you can still use the resulting ‘proxied’ function as the original function: a simple function.

Writing parallel code in a readable way

2009-11-09T00:10:00+01:00

Although I often have embarrasingly parallel problems (data parallel), and I have an 8-CPU box at work, I used to frown on writing parallel computing code when doing exploratory coding. We now have fantastic parallel computing facilities in Python (amongst other, multiprocessing, IPython, and parallel Python). However, in my opinion, there are two reasons to hesitate to use them, especially when the code is very imature (which is almost always my case, in research settings):

It makes the code look less like the ideas it is trying to express. Peter Norvig made a pretty convincing case for scientific code reading like math at SciPy2009.
Because parallel computing is out of process, in Python, it is simply harder to debug (though I hear that the IPython guys are on that).

I have progressively developed a tiny tool to adress both problems, at least for my embarrasingly-parallel problems. I address the second problem by having a trivial switch to run my code without importing any fancy parallel computing tools. And I address the first problem using syntactic sugar to be able to type in map/reduce code that actually looks like standard procedural code:

results = Parallel(n_jobs=2)(
            delayed(my_calculation)(data1, data2,
                                    parameter1=1, parameter2=2)
            for data1 in store1 for data2 in store2)

There are several tricks here:

I use a ‘delayed‘ decorator that creates the argument list and keyword argument dictionary for me so that I can type something that really looks like a function call. Also, the decorator checks to see if the function and the arguments can be pickled, because if not the parallel computing libraries will raise errors, sometimes with hard-to-understand messages.
I use list comprehension to create the list to apply the map/reduce onto. List comprehension is really readable, and very powerful.
The ‘Parallel‘ object hides all the cleverness. If the ‘n_jobs‘ parameter is set to 1, it does not call any parallel computing library. If it is set to -1, all the CPUs are used. The object instantiates the parallel computing context and also destroys it. While this is inefficient, it is great for catching errors early. And finally, while I have implemented this only for the multiprocessing module, any fork/join-based parallel computing library could be encapsulated the same way, thus providing a uniform API to do multi-node parallel computing or single-computer shared memory (as multi-processing uses the Unix fork call, and all modern Unices implement copy on write of memory pages, you get some shared memory for free without worrying about race conditions).

Update

This pattern has actually evolved in the joblib project , which provides a lot of cleverness under the hood.

EuroScipy 2010 in Paris

2009-10-27T23:22:00+01:00

Next year’s EuroScipy will be in Paris, as Nicolas Chauvat and myself announced in Leipzig this summer. We are still busy organizing, but we have pretty much settled down for a dates: July 8th- July 11th. So mark those dates, and get ready to come to Paris for a fantastic event where Science meets computing thanks to Python.

On the Thursday and Friday, we will have 2 days of optional tutorials; introductory ones to get up to speed with Python, and advanced ones, where experts explain the tools they know best. On the Saturday and Sunday, the main conference will be held, and if it is anywhere like last year’s, we will be hearing thrilling discussions with topics ranging from the latest libraries for better scientific computing to how Python was used in top-notch scientific achievements.

Useful trick for functions and tests using np.random

2009-08-29T16:00:00+02:00

Recently, listening to Robert Kern taught a new useful trick to use when you write functions that use the numpy random number generator:

As always, when using random number generation in code, the problem is to get ‘repeatable results’. Of course, you want only repeatable statistics, and with statistics, the problem is to define what repeatable is. Anyhow, for various reasons, it is useful to be able to reproduce exactly runs, for instance when testing, fine tuning, or debugging. That is why you would like to be able to control the seed of your random number generation. Robert Kern showed us (at the SciPy conference tutorial) a way to control the pseudo random number generator (PRNG) in a function, without affecting the rest of the code. This does not involve setting the seed of the global PRNG, as this is evil, because it has global effects. The idea is to pass in to your functions a PRNG instance (by default the global one):

def test(prng=np.random):
    print pnrg.rand(10)

if you want to use your function with a controlled PRNG, you can instantiate one with a specific seed:

prng = np.random.RandomState(seed=0)

and pass it to your function.

SciPy 2009 is over!

2009-08-29T12:21:00+02:00

The week is over, and I am finally catching up with things, back here in France.

The SciPy conference was exciting and fun as usual. It was great to meet old friends and put faces on names on the mailing list.

The turn out was very good: we had 150 people total. This is more than last year (125), which shows that there is high interest, given that most institutions have travel restrictions due to this year’s low budget.

The year, the conference was very international. I was really happy that, partly thanks to the PSF contribution, we had the visit of young contributors coming from far away, such as David Cournapeau (Japan), Dag Sverre Seljebotn (Norway), Pauli Virtanen (Finland), and Stéfan van der Walt (South Africa). For me, living in France, it was also great to have people from major European institutions, such as the ESRF (European Synchrotron Radiation Facility), the Fraunhofer institute, the Max Planck Institute. These people not only committed important projects to the scientific Python tools, but made the effort of coming all the way to California to talk about it, which is non negligible given the cost of the trip all the way from Europe. To me this is important because it means that we are getting more interaction wordwide, and thus the tools are more likely to converge to something of generaluse. Also, for the first time ever, one of my bosses came to the conference. It is fantastic to be working with great scientists who actually understand that technology is important to do good science, and that programming is actually hard, and a matter of interest per se.

On the other hand, I was disappointed that we had no presentations from the industry. There were a lot of industry people in the audience, and it is always fun to here what they use Python for.

I really enjoyed the keynote by Peter Norvig because Peter talked about the importance of having a clear language to expose and formulate scientific ideas. This is something that is very dear to me, and I must say that the code snippets he presented were crystal clear, and involving non-trivial maths explained in a way that made them look simple, similar to his famous blog post on spell checking. It was really inspiring for me, and driving me into trying to write even cleared and simpler code.

The technical keynote by Jon Guyer also hit a soft spot for me, not only because the physics presented was very beautiful, but also because my partner is doing research in similar fields (with Python, of course), and Jon made an excellent argument for using Python, which is not always easy when you are discussing heavily computational problems.

For my personal work, SciPy was very exciting, because I had so many discussions with different people on how we could share efforts, by tweaking a data structure in an existing package, or simply having a look at a package I wasn’t aware of. The machine learning BoF was extremely enthusiastic, and I am really looking forward to October, when we will be able to start working on that. If only half of the things was talked about ever get done, I will be thrilled.

I should point out that, thanks to hard work by Jeff Teeters and Kilian Koepsell from Berkeley, the videos of every talk are on the web for the first time.

Also, we have a nice photo gallery with a group picture.

We have so many people to thank. I think special thanks go to Leah Jones, at Enthought, and Julie Ponce, at CACR, Caltech. They made sure that the organization committee didn’t forget anything important and did a lot of the grunt work. Thanks also to Enthought and CACR, and many of their staff, for the support in the organization. PSF founded students, and that is a big deal. We should thank all the tutorial presenters, it takes a lot of work to put together the material. We were very grateful to the program committee for reviewing the papers. Also thanks to all the speakers, and to all the attendees. The SciPy conference is a bit special to me, because it is very laid back, and I can trust that it will be great almost by self-organization, as you put together nice and clever people, and they find ways of discussing of interested things with enthusiasm.

Update: That blog post feels way too ‘political’. I dislike sale pitches, and it does feel like one. But, how to sum up some important contributions and thanks the people who helped out? I made a point of always wearing a T-shirt of the conference, rather than a shirt, but I guess that there is a point at which trying to dodge etiquette with a T-shirt and a pony tail is just another cliché[*] and formalism.

[*] cliché is a French word for cliché.

Mayavi: 2 videos of tutorial-like presentation

2009-07-16T23:35:00+02:00

I gave a presentation on Mayavi in the Python for science seminar organised by Fernando Perez at Berkeley. I was loudmouth and obnoxious as usual, and unfortunately for me, I was recorded.

More seriously, Jeff Teeters has filmed the presentation and recorded the sound was a microphone I was wearing. I find that he did a really excellent job. Getting a good recording is hard, and he really got good audio and good framing. I am amazed and I don’t know how to thank him enough.

http://www.archive.org/details/ucb_py4science_2009_07_14_Gael_Varoquaux

Also, Stefan van der Waalt gave a talk in Souft Africa on Mayavi that was recorded. Another very useful resource for learning Mayavi:

http://www.archive.org/details/ctpug-2008-09-mayavi

Announcing the SciPy conference schedule

2009-07-16T03:02:00+02:00

The SciPy conference committee is pleased to announce the schedule of the conference:

http://conference.scipy.org/schedule

This year’s program is very rich. In order to limit the number of interesting talks that we had to turn down, we decided to reduce the length of talks. Although this results in many short talks, we hope that it will foster discussions, and give new ideas. Many subjects are covered, both varying technical subject in the scientific computing spectrum, and covering a lot of different research areas.

I would personally like to thank the members of the program committee, who spent time reviewing the proposed abstracts and giving the chairs feedback.

Fernando Perez and the tutorial presenters are hard at work finishing planning all the details of the two-day tutorial session that will precede the conference. An introduction tutorial track and an advanced tutorial track, both covering various aspect of scientific computing in Python, presented by experts in the field, should help many people getting up to speed on the amazing technology driving this community.

The SciPy 2009 program committee

Co-Chair Gaël Varoquaux, Applied Mathematics and Neuroscience, Neurospin, CEA - INRIA Saclay (France)
Co-Chair Stéfan van der Walt, Applied Mathematics, University of Stellenbosch (South Africa)
Michael Aivazis, Center for Advanced Computing Research, California Institute of Technology (USA)
Brian Granger, Physics Department, California Polytechnic State University, San Luis Obispo (USA)
Aric Hagberg, Theoretical Division, Los Alamos National Laboratory (USA)
Konrad Hinsen, Centre de Biophysique Moléculaire, CNRS Orléans (France)
Randall LeVeque, Mathematics, University of Washington, Seattle (USA)
Travis Oliphant, Enthought (USA)
Prabhu Ramachandran, Department of Aerospace Engineering, IIT Bombay (India)
Raphael Ritz, International Neuroinformatics Coordinating Facility (Sweden)
William Stein, Mathematics, University of Washington, Seattle (USA)

Conference Chair: Jarrod Millman, Neuroscience Institute, UC Berkeley (USA)

My article on scientific computing with Python

2009-07-13T03:23:00+02:00

I have never sold the rights to the article I published in LinuxMagazine France on scientific computing with Python. So I am uploading it to the net, under a CC-by-SA license : http://hal.inria.fr/hal-00776672/

It is in French, so it restricts the audience.

Tutorial on scientific use of Python

2009-07-08T19:38:00+02:00

The notes of the tutorial I gave on scientific use of Python at PyconFR are online. They are in French, but I am giving the link here, just in case it is needed:

http://dl.afpy.org/pycon-fr-09/python_scientifique/index.html

Object-oriented design: framework objects versus data containers

2009-07-01T05:13:00+02:00

I find that in object oriented design, there are two kinds of objects:

A first kind is the object encoding logics. This is an object for which clever and complex design will hold together the logics of a state-full application. It can often be part of a forest of objects that are linked together via design patterns. The interfaces of these objects are driven by their active role in the application. These objects are prominently present in interactive application and interactive application. They are mostly particular to an application or a framework, and are mostly implementation-defined.
The second type of object is a data container. It strives to expose a data model that can be of use in various situations, as it is the link between different parts of the code that do not talk to each other apart through data. It is responsible for loose coupling (something that is very important to achieve a maintainable code base) by having a light and shallow interface. It must be interfaced-designed, rather than implementation-designed. One should very easily get a grasp, an almost physical feeling, for the object by simple interaction with it. I have what I call the ‘explaining test’ for these objects: can I explain fully and completely to somebody what the object does, and any possible caveat, without being sidetracked into special discussions? If not, back to the drawing board: the object will not gain acceptance. In my experience, only the objects of the second kind can easily be shared between different projects.

SciPy abstract submission deadline extended

2009-06-27T08:14:00+02:00

Greetings,

The conference committee is extending the deadline for abstract

submission for the Scipy conference 2009 one week.

On Friday July 3th, at midnight Pacific, we will turn off the abstract

submission on the conference site. Up to then, you can modify the

already-submitted abstract, or submit new abstracts.

The SciPy 2009 executive committee

Jarrod Millman, UC Berkeley, USA (Conference Chair)
Gaël Varoquaux, INRIA Saclay, France (Program Co-Chair)
Stéfan van der Walt, University of Stellenbosch, South Africa (Program Co-Chair)
Fernando Pérez, UC Berkeley, USA (Tutorial Chair)

SciPy 2009 conference opened up for registration

2009-06-19T14:53:00+02:00

We are finally opening the registration for the SciPy 2009 conference. It took us time, but the reason is that we made careful budget estimations to bring the registration cost down.

We are very happy to announce that this year registration to the conference will be only $150, tutorial $100, and students get half price! We made this effort because we hope it will open up the conference to more people, especially students that often have to finance such trip with little budget. As a consequence, however, catering at noon is not included.

This does not mean that we are getting a reduced conference. Quite on the contrary, this year we have two keynote speakers. And what speakers: Peter Norvig and Jon Guyer! Peter Norvig is the director of research at Google and Jon Guyer is a research scientist at NIST, in the Thermodynamics and Kinetics Group, where he leads a fiPy, a finite element project in Python.

The SciPy 2009 Conference

SciPy 2009, the 8th Python in Science conference, will be held from August 18-23, 2009 at Caltech in Pasadena, CA, USA.

Each year SciPy attracts leading figures in research and scientific software development with Python from a wide range of scientific and engineering disciplines. The focus of the conference is both on scientific libraries and tools developed with Python and on scientific or engineering achievements using Python.

Call for Papers

We welcome contributions from the industry as well as the academic world. Indeed, industrial research and development as well academic research face the challenge of mastering IT tools for exploration, modeling and analysis.

We look forward to hearing your recent breakthroughs using Python! Please read the full call for papers.

Important Dates

Friday, June 26: Abstracts Due
Saturday, July 4: Announce accepted talks, post schedule
Friday, July 10: Early Registration ends
Tuesday-Wednesday, August 18-19: Tutorials
Thursday-Friday, August 20-21: Conference
Saturday-Sunday, August 22-23: Sprints
Friday, September 4: Papers for proceedings due

The SciPy 2009 executive committee

Jarrod Millman, UC Berkeley, USA (Conference Chair)
Gaël Varoquaux, INRIA Saclay, France (Program Co-Chair)
Stéfan van der Walt, University of Stellenbosch, South Africa (Program Co-Chair)
Fernando Pérez, UC Berkeley, USA (Tutorial Chair)

Update: I correct the typo in the original blog post: the sprints are free, the tutorial are $100.

Fuzzy on OOP and the French

2009-06-14T10:38:00+02:00

Fantastic:

Haha - I shake my fuzzywuzzy beard at you in bewilderment. Do you people dislike OOP, the class statement is mere boilerplate to you, I mumble incoherent French obscenities in your general direction. (Did you know the French acronym for object-oriented programming is POO?).

Job offering for junior Python developer

2009-06-07T19:53:00+02:00

Our lab is seeking to hire an engineer to work on porting our machine learning code to the scikit learn, adding tests and documentation and packaging it.

We are looking for someone motivated by quality in software and open source. No prior scientific computing experience is required. You will be working in a highly stimulating research environment (Neurospin), near Paris and employed by the French research institute in computer science and applied math (INRIA), a prestigious institution.

Neurospin is a research institute dedicated to the understanding of the brain. You will be working with computer-assisted neurology laboratory, the image-analysis and branch of Neurospin, in the small ‘Parietal’ INRIA team embedded in NeuroSpin and dedicated to statistical modeling.

Over the years, the lab has developed a set of tools for machine learning and statistical analysis in Python (with some C). There are some tools for this purpose available in the open-source world (BSD-licensed) in the scikit learn. We want to extract the good and unique parts of our internal library, and release it in the open source world through the scikit learn. Our code is fully Python code, using scipy and matlab, with some bindings to R. As we want the code to be BSD-licensed, we will remove the bindings with R, and replace when possible. The job does not involve developing new algorithms, but testing, improving, and documenting the existing one. There is a big quality assurance work to be done. The code needs to be put to the right coding standards; APIs should be cleaned; tests added. Dead code should be delete. There is some optimization work to be done. Also, if there is any duplicated funcitonnality with the scikit learn, you should analyse both code and determine which one to code. The job also involves working with the community, documentating the code, and releasing the project, including binary packages. And finally, all the original authors of the algorithms, and experts in the field, are in the lab. So you will be able to learn from them and pester them if there is a problem with the code.

In one word, this is about transforming an internal project, into a leading open source project that will rock and live on!

The job description is available here.

There are to caveats: first it is a 2 year position. Second, you need to have graduated recently (how recently I don’t know exactly, but I will inquire).

If you are interested, or just want to ask questions, don’t hesitate to send me an e-mail, I am _really_ looking forward to collaborate with someone motivated on this project.

UPDATE: I have more details on the restrictions of the job offering: you need to have graduated in 2008 or 2009. This is a very hard restriction, and I am recieving many excellent CVs that I even consider because of this restriction. I am sorry, I cannot do anything about it.

Pycon FR: presentations and tutorials

2009-05-16T16:25:00+02:00

May 30th and 31st the French Python conference, Pycon FR, will be held at ‘la citée des sciences’, la Villette, in Paris.

The first day, I will be giving a one-hour-long tutorial (in French) on numpy, scipy, and all the Python for Science jazz. On the following day, I will be giving a half-hour-long talk to ilustrate the use of Python in my current work: statistical analysis and modelling of brain activity.

I’ll be giving my tutorial in one room, while David Larlet (the famous Biologeek) will be giving one on Django in another room. Tough competition :-P .

The program of the conference is very eclectic, ranging from general programming talks, to GUIs or web development. While this might deter the pure scientific computing folks, I strongly encourage you to attend. Indeed, a lot of the development, packaging, quality assurance, … problems encountered in scientific computing are universal in computing.

You might think that you are only interested in writing algorithms,or processing data, but this code will have to live on. My experience is that it is terribly hard to have code in a lab that can be somewhat shared and live on when people move away to another lab, or stop having time to maintain the code. Talks like

can probably be of some use.

Also, don’t underestimate the fact that some other communities might have solved some of the issues you struggle with. When dealing with real-world problems, and not only developing algorithms on a few set of test data, a large fraction of the code lines and related to IO, interfaces, data massaging… Two years ago, I remember that I was not terribly interested in the web-development talks. I tried to be open-minded and listen to them, but… Now I have done a bit of web development myself, and I have played with some of the famous ‘web frameworks’. I can tell you, there are some really interesting concepts there. The web guys have managed to extract a set of patterns from the problems they face and provide excellent abstracts to data handling and display. Can we learn from them? I am especially interested in getting more insight from things like ORMs (object relational mappers), and understanding better the web frameworks:

Django-ROA pour une architecture orientée ressources
PyQt4: Un exemple de sur-mesure en Model/View/Delegate (this is not about web, but MVC/MVD pattern has been used in web a lot and is universal and very important, IMHO).
Oubliez le sql avec SQLAlchemy
Developpement d’applications maintenables avec Django
Turbogears 2, présentation et introduction (tutoriel)
Programmer CouchDB avec couchdbkit
Réflexion sur l’utilisation de python pour le développement d’une plateforme web d’annotation génomique
Oubliez le sql avec SQLAlchemy
Django par la pratique
Python et les bases de données non sql.

And finally, one more reason to come: it is so nice to actually get to meet in real life people, and have a chat.

So, see you there, for those who live in France.

Minimum spanning tree

2009-05-10T23:52:00+02:00

Gary Ruben came up with the excellent idea of visualizing the minimum spanning tree of a Delaunay tesselation in addition to Delaunay tessalation itself. After he sent me his code, I spent some times playing with it, because I found out that, with the right choice of visualization parameter, it gave me a nice understanding of what a minimum spanning tree was: a tree structure of minimal total length connecting all the vertices of the graphs, and embedded in the graph. On the visualization, the Delaunay graph is displayed in grey, and the minimum spanning tree in thick and colors.

The minimum spanning tree is calculated using Prim’s algorithms, on the fullly-connected distance-weighted graph of all points. One can clearly see that is it embedded in the Delaunay graph. In fact I have tested that calculating a minimum spanning tree on the Delaunay graph, or on the complete graph, gave the same result.

The code to create this picture can be found here.

Extracting the data from the Delaunay triangulation

2009-05-01T16:42:00+02:00

Gary Ruben just asked me if it was possible to retrieve the triangulation information from my previous Delaunay example. Actually the reason I came up with this example is that Emanuelle Gouillart, my partner[*], needed to do Delaunay triangulation on some data. She was kind enough to extract that code from her code base. Here it is.

[*] The various languages do not seem to have evolved quickly enough to cope with the fact that people can now have a stable long-term relationship with someone you are not married to. What word should I be using here: ‘girlfriend’, ‘partner’… ?

Mayavi image of the … month

2009-04-27T22:42:00+02:00

Tonight I sat down and played a bit with VTK’s Delaunay tessalation filter. I wanted to inspect the local structure of a graph created by Delaunay tessalation of random points. To see better the structure, I selected a slab of the resulting unstructured grid. I think the image is not only instructive to explain what a Delaunay tessalation is, it also looks pretty cool. Here is the image and the Mayavi script that creates it.

Long sys.path and consequences, one more reason not to use easy_install

2009-04-09T08:43:00+02:00

For those who don’t know, sys.path is the path that the Python interpreter traverse at each module import to look for the module file imported.

This blog post is about the consequences of having a long sys.path. I’ll try and make it short, but I would have a lot to say. I am just reacting on Noah Gift’s post on performance improvement, not making a full essay on why overloading sys.path is considered harmful.

When using easy_install (or setuptools), each new project is installed in a different directory, and the directory is added at runtime to the sys.path (the addition at runtime confuses many users who are not aware of it). As a result, you quickly end up with more than 40 directory on your sys.path. These directories are ‘stat-ed’ one after the other on each module import. Thus if you have a long sys.path, there are a large amount of system calls to read directories. To check this out, simply try:

strace python -c "import foobar" 2>&1 | less

You can see the amount of noise created by a simple (failing) import statement. On a system with high latency (such as an NFS, as we use at work), this is very costly.

Noah joyfully reports performance improvements by hijacking the Python import mechanism. I claim that part of what Noah has done is not really hijacking the import mechanism, it is undoing the hijacking performed by setuptools.

I know I am being rude, but many people raised this point before, and it is not getting any traction from the setuptools maintainer. I claim that you should not be using setuptools or easy_install if you want performance or control. I claim that you should not be using setuptools unless you understand well what you are doing (which defeats the name easy_install).

The way I install packages when I want good control via easy_install is in a virtual environment to discovered the dependencies, and then:

easy_install -Zeab . package_name

to download the package for each required package, and

python setup.py install --single-version-externally-managed --record ./foobar

if the package itself is using setuptools.

As you can see, setuptools make it really hard to do a clean install. Its a design choice :(.

Another alternative is to use pip which I strongly encourage.

Mayavi documentation: in multiple small pages, or a few long ones

2009-03-15T00:58:00+01:00

Prabhu and I can’t decide: what is best for the documentation, have more pages, and thus have them be small, or have longer pages, but have less. Two specific examples:

http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/examples.html

http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/mlab.html

Right now, these pages are split in smaller pages. Should all these smaller pages be folded back in one long page? It would be a long page, but all the information would be there.

Neither Prabhu, nor I, want to decide solely on our personnal preference. We would to do what suits users most. This I why we want FEEDBACK :). Could you please give feedback by mail, or in a comment on this blog. Thank you!

Mayavi on the web

2009-03-07T13:06:00+01:00

Ondrej Certik has installed a sage notebook on a server opened to the net, with Mayavi installed on it. The result is that you have a command line interface on the web, in which you can enter Mayavi commands, and see the result. You have to be very careful to switch Mayavi in offscreen mode as soon as you load it. To see the result of a plot, just save it in a file. The sage notebook will display the image.

I have always had in mind the use of Mayavi as a backend for a scientific web application, for instance for a neuromaging database, but what is really stuning with this implementation, is the way you interact with it: full-blown Python comand line.

Mayavi UI issue

2009-02-18T09:25:00+01:00

I have been wanting to change slightly the design of a Mayavi dialog for a while. Here is the issue: when you create a visualization, eg throught the command line in IPython, whith mlab, you get a nice and small window with only your visualization, and a toolbar. If you want to change the properties of the objects on the visualization, or add some more, you need to click on a button on the toolbar, which displays a dialog, from which you can open more dialogs to edit the objects:

I am thinking to changing this to a single dialog:

The single-object-editing dialogs could still be opened by double-clicking on the pipeline.

I am not going to discuss why I believe the new version would be better than the old one, because I do not want to bias people. However, I would prefer not making the decision to change based only on my feelings. So I ask everybody, users of Mayavi or not: what do you think is better? And why? I will probably leave an option to have the old behavior, anyhow, but the default is very important.

Error in my article

2009-01-27T22:00:00+01:00

There is an error in a code example in my article that just came out in Linux Magazine France. I am so ashamed. I did test the code, but I didn’t have automated tests, so I broke it when tweaking it :(. I think the lesson is that you need to do more than doc-testing articles (it was doc-tested).

The code example is about calculating the Mandebrot set. The idea is that you take a grid of the numbers c complex plane, and iterate on it the function f = lambda z: z**2 + c. You look at the divergence of this iteration, and plotting a mesure of the divergence gives you a nice figure. The code I wrote was:

from numpy import ogrid, zeros, abs, isnan, ones
c = x + y * 1j
threshold_time = zeros((500, 500))
z = zeros((500, 500))
mask = ones((500, 500), dtype=bool)
for i in range(50):
    z[mask] = z[mask]**2 + c[mask]
    mask = (abs(z) > 10)
    threshold_time += mask

The error is subtle. First there is the not so subtle mask error: I am masking the points that diverge, and iterate them even further. This is exactly the opposite that I meant to do. Then there is the more subtle bug: the line “z[mask] = z[mask]**2 + c[mask]” is an in-place assignment. As a result the dtype of z is not modified: z is not magically cast in a complex. Thus the imaginary information coming from c is lost. And that information is crucial to Mandelbrot.

The right code is:

from numpy import ogrid, zeros, abs, isnan, ones, complex
c = x + y * 1j
threshold_time = zeros((500, 500))
z = zeros((500, 500), complex)
mask = ones((500, 500), dtype=bool)
for i in range(50):
    z[mask] = z[mask]**2 + c[mask]
    mask = (abs(z) < 10)
    threshold_time += mask

Plot the threshold_time array with pylab.imshow (from the matlplotlib project) to get a nice figure.

Mayavi image of the fortnight

2009-01-25T19:21:00+01:00

It’s been two weeks since I posted a ‘Mayavi image of the week’. Prabhu has made a really cool example of integrating trajectories in a 3D vector field, using, of course, the Lorenz equation for the 3D field. With nice colors, it makes a new fantastic image:

The green surface represented is an isosurface of the z component of the vector field: on this surface, the z-component changes sign. This can be seen from the trajectories, as they start going up once they pass the surface. The script generating this image is checked in as an example: https://svn.enthought.com/enthought/browser/Mayavi/trunk/examples/mayavi/lorenz.py

LinuxMag special edition on Python

2009-01-24T12:42:00+01:00

The French LinuxMag just published a special edition on Python, in which I authored a 12-page article on scientific computing. The edition is in French, so if you don’t speak French, it is of limited interested.

Ce dossier hors-série est une excellente ressource pour découvrir Python, entre autre par ce qu’il présente Python sous beaucoup d’aspects différents, et permet donc de découvrir quelles sont les outils avancés disponibles pour s’attaquer à une tâche particulière.

Lien vers la présentation du magazine, ainsi que où l’acheter: http://www.gnulinuxmag.com/index.php/2009/01/23/gnulinux-magazine-hs-n%C2%B040-janvierfevrier-2009-chez-votre-marchand-de-journaux

Feuilletez en ligne: http://ed-diamond.com/feuille_lmhs40/index.html

Il ne coûte que 6 euros 50, et est disponible dans tous les kiosques de France (sauf au Monoprix à coté de chez moi :<). Ce n’est pas bien cher pour une soixantaine de pages d’informations spécialisés. Achetez le, même si vous connaissez bien Python, et n’apprendrez rien de nouveau. Vous le laisserez traîner au boulot, pour faire de la propangande passive :).

Cela fait plus d’un ans que les auteurs travaillent sur leurs articles. Je ne sais pas si cela dénote un grand perfectionnisme, ou une grande inefficacité :). En tout cas, un grand merci à Philippe Biondi qui a été le maître d’oeuvre du project, et qui l’a tiré en avant.

Le PDF de l’article

J’ai mis le PDF de l’article en ligne

Mayavi image of the week

2009-01-13T00:38:00+01:00

The title of this post is a lure: there won’t be a Mayavi image each week, because I would run out quickly. But it sounded cool.

Anyway, here is an image of a graph, visualized with Mayavi. The graph is actually a protein structure, downloaded from the PDB. The Python script producing this visualization is checked in as a Mayavi example: https://svn.enthought.com/enthought/browser/Mayavi/trunk/examples/mayavi/protein.py

The part of the code to read the PDB file is actualy way longer than the visualization part.

I hope this script inspires people trying to visualize graphs. The combination of the GaussianSplatter filter and the volume rendering to create a halo renders really well, IMHO.

Tracking objects in scientific code

2008-12-23T01:26:00+01:00

When I started working in my new field (data analysis of functional brain images), I was surprised to find in our data-analysis scripts what I thought was a very particular code smell: the numerical code is always doing a lot of filename and path manipulation, loading and saving data even in the core routines. I couldn’t picture what seemed wrong with this, but I was uncomfortable with it.

The good

Memory management

In the data-processing work I am currently doing, we deal with large objects, mostly huge numpy arrays, though sometimes some domain-specific data containers creep in. As a result, simple calculations take time (an SVD is 10 minutes), and I am always fighting with memory.

Saving to disk is a handy way of freeing memory. Moreover, with memmapping, reading only the relevant parts of pre-calculated arrays becomes very cheap.

Crash-resistance

When the simplest operation takes ten minutes, you want to save intermediate steps, to be able to resume calculations, or to inspect why the code crashed. And who knows, you might need this intermediate step.

The bad

The immediate apparent problem is that your code becomes riddled with path-management code. We often joke that once we have figured out the algorithm, the longest surviving piece of code is the path-related junk.

But, I believe this is only the tip of the iceberg, and that this code smell hints to deeper problems.

The ugly

Loss of scoping

When I started working on these problems, I was startled to encounter basic domain-specific algorithmic functions taking input and output data filenames. It took me a while to realize that the huge problem with this is that I loose scoping, or in other words naming locality. Let us pretend that I have a function ‘foo’ that does basic numerics on large numpy arrays, but to save memory it takes as a signature the name of the file where the input array is stored, and the name of the filename where the output array should be stored. So I have some code that looks like this:

def process_sessions(session_files):
    for session_file in session_files:
        foo(session_file, session_file + '.out')

Saving to files in the loop is a huge gain of memory;

Now I decide I want to add a parameter to foo, and vary this parameter, with, eg:

for param in params:
    process_sessions(session_files, param)

My code is hard to refactor, because I need to introduce modifications deep in all subroutines to make sure they do not save their outputs in the same file.

Suppose session_files are actually extracted from an upstream dataset, and now I want to apply my algorithm on a set of these upstream datasets, and in parallel. Once again I need to generate a score of new filenames and keep track of them. I can use temporary files, but I need to keep hold of this information too, and I loose most of my crash-resistance.

When you think it over, the way programming languages solve this problem elegantly, is by the rules connecting names to objects, and in particular scoping: a name corresponds to an object in a given function. Using files is equivalent to using globals, and we have to cook up our own scoping rules (which results in a lot of path-massaging code).

No history tracking

When I find a file on the disk, I do not really know how it has been generated. As a results, the crash-resistance is compromised. Moreover, when tweaking algorithms, we often try to rerun only the necessary parts of the algorithms, relying on the precomputed parts saved to the disk. We comment out code, or exercise different code paths. As a result we often end up in situations where the whole code does not actually run. And once again refactoring is hard, because we have not expressed the dependency relations between our intermediate results.

Doing better?

Once again, today I was refactoring my algorithm, or my “pipeline” as we call it. And once again, I felt the failure to have the proper tools, the proper abstractions, words, to express the problem in the code. Manipulating files directly seems wrong, for the reason expressed above. But can we do better?

The problem, I believe, is that we need a lightweight persistence framework adapted to scientific purposes. I remember telling Travis Vaught a few weeks before beginning my new job that scientists had no problem with their persistence. Well, I was so wrong.

By a persistence framework, I do not mean a persistence mechanism, like numpy.save, or hdf5, or a database. I am interested in the objects with which we represent it in the code. How do we solve the scoping problem? And the history problem? Can we implement a “trajectory tracking”, to reuse the words of Alexandre Fayolle, for our data containers?

I am thinking about a small set of well-thought abstractions, a bit like the use of ORM (object relational mappers) in web application, that would take care of the mapping from in-memory objects to objects on the disk for us.

I am starting to have some ideas. I am thinking in terms of context objects, with getattr tricks to do the mapping to a database doing the bookkeeping and the trajectory tracking, and doing the impedance matching with objects stored as numpy “.npy” files, hdf5 files, nifti files, or whatever you want. The added value of a database would be that it would give some robust locking, and possible network abstraction, to allow for crash-safety, and parallel or distributed computing.

This may sound overkill, or overcomplicated. I’ve tried simple things. They all failed.

This is a problem that matters a lot to me. I feel I am loosing a lot of time on this. However I feel that the effort to do something good is quite important. I am also afraid of polluting my numerical code with unnecessary abstractions. The main problem is that attempting to solve this problem would require a significant investment in time, and I don’t really see where I can find this time.

Have people encountered similar problems? Do you have any suggestions, any trick to share?

I’d be very happy to read any comments that can move forward my thinking, even if it is about pointing out problems and not solutions. I still think I haven’t identified the problems well.

Update: I have just realized that I will be almost without internet access for the next week, starting from pretty much now. Looks like it was a bad moment to start a thrilling discussion. I guess I got carried away by the discontent of a day doing some bad refactoring. I really look forward to catching up when I come back. Please forgive me for the bad timing.

Update

Patterns that derived from this line of thoughts are now implemented in the joblib library.

What’s new in Mayavi 3.1.0?

2008-12-11T00:56:00+01:00

Mayavi 3.1.0 has just been released, and I think it is a fantastic version. We are starting to be able to focus on the details and the focus. In addition, we are getting user feedback, which helps identify the pain points.

Automatic scripting

This is a huge deal! You now have a record button on the pipeline view. In record mode, the modifications that you make to the objects properties are recorded as valid Python lines: Mayavi tells you what are the line of code to modify those properties or create new objects. I use this a lot: I first build a skeletton of a visualization using mlab but when it comes to tuning parameters, I do it interactively, and record.

Much more testing

We added a huge amount of testing (many thanks to Suyog who contribed quite few). From an user’s point of view this has two consequences. First the code is more robust (for instance the mlab commands are more flexible on the shape of the arguments passed in). Second the rendering part of the Mayavi engine is well-separated from the algorithmes, which means that the VTK algorithms can now be used easily to manipulate numpy arrays through Mayavi.

Two new mlab functions: barchart and triangular_mesh

Mlab has two new functions: one to create nice bar chart, for 2D histograms displayed in 3D, and one to build meshes defined from their triangle.

Control of the pipeline through mlab is easier and more robust

As the mlab.pipeline is getting more usage, it is being ironed out. For instance applying a module to a source object (may it be a Mayavi source, or a vtk dataset) adds it automatically to the figure if it is not already in it. Also, when adding an additional module on an existing source, a new module manager (object controlling the colormap) is added automatically if the colormaps or extents differ. Many modules take keyword arguments to make common operations easier.

IPython in Mayavi

If you have a recent version of IPython installed (> 0.9), Mayavi will use an IPython widget, instead of the vanilla pyshell.

mlab.view has now a sensible behavior

The mlab.view no longer gives a bad roll angle to the camera. This makes it much easier to do animations during which the camera moves.

Axes and outline extents

mlab.axes and mlab.outline now adjust by default to the extents of the object they are applied on. This removes a bad surprise for people having tuned the scale of their visualization.

enthought.tvtk.tools.visual in Mayavi

enthought.tvtk.tools.visual can now be used inside Mayavi, to provide a visual-like API in mayavi.

Documentation has recieved some love

Documentation has been added and completed, with a focus on making it easier for the beginner to discover the features of Mayavi. We try more and more to walk the user through complete usecases of Mayavi, in a task-oriented documentation, such as in the introductory examples, or in case-studies.

Two new sources

There are two new sources that do not require data. The first creates objects, such as an arrow, a cube, or a view of the earth, to be viewed with a ‘surface’ module. The second creates image data, such as a disk, or a 2d gaussian, or (my favorite) the Mandelbrot set. This can be viewed with an ImageActor, or (even better) with a WarpScalar filter and a Surface. These sources have been contributed by Suyog.

A word of thanks

I am sure I am going to forget some people here, but I’d like to thank a lot those who have been helping us with getting Mayavi2 going. First of all, Dave Peterson, who is doing the release management for ETS. This is a lot of work, and we would never have frequent releases without him. I’d also like to thank Suyog Jain, who contributed some code. This is fantastic, and I am sure we are going to have more people contributing improvements. Finally, I’d like to thank Pierre Raybault, of Python(x,y), and Varun Hiremath, of Debian. Packaging is very important to our users, and it is not a trivial piece of work… Hum, I almost forgot Chris Casey. Chris has been updating the docs on the net and making sure the docs build well. This is also very important, as the web page is a major means of communication with our users.

120 pages!

2008-12-09T01:00:00+01:00

The mayavi manual in SVN has now 120 pages when compiled to pdf. I know that this is a stupid metric, and that the quality is more important then the number of pages, but it does give me a warm and fuzzy feeling.

More seriously, next release of Mayavi (coming soon, we promise) is going to have a lot of added documentation for the casual users. In particular the mlab section has been expended a lot and is starting to hint at Mayavi’s full power.

Thanks to Chris Casey, who is making sure that the docs land on the net as soon as they are written.

Using Mayavi to explore a potential field

2008-11-22T15:22:00+01:00

As promised, here is the sequel to the tutorial I posted yesterday on using Mayavi with scipy to understand the trajectories of a particle in a potential. (chances are you are reading this before my previous post. I suggest you first jump to my previous post, and then come back here).

This tutorial shows you how to use the powerfull VTK and Mayavi feature to explore the trajectories in the same potential. However, the tools we are using do not given us as much control on the dynamics of the system, so this time we do not add damping or oscillation of the potential. At the end of the day, the resulting visualization is however much more interactive. Once again, I would like as much feedback as possible, as this is intended for the Mayavi User Guide.

In this example, we create a vector field from the gradient of a scalar field and explore it interactively. This example shows you how to do some operations similar to the previous example, but interactively, using the filters and module. This approach requires a better knowledge of Mayavi and the VTK filters, but the big gain is that the resulting visualization can be explored interactively.

First, let us create the same scalar field as the previous example:. We open Mayavi and enter the following code in the Python shell:

from enthought.mayavi import mlab
import numpy as np

def V(x, y, z):
    """ A 3D sinusoidal lattice with a parabolic confinement. """
    return np.cos(10*x) + np.cos(10*y) + np.cos(10*z) + 2*(x**2 + y**2 + z**2)
X, Y, Z = np.mgrid[-2:2:100j, -2:2:100j, -2:2:100j]
mlab.contour3d(X, Y, Z, V)

As in the previous example, we can change the color map and the values chosen in the isosurfaces.

We want to take the gradient of the scalar field, to create a vector field. To do this we are going to use the CellDerivatives filter, that takes derivatives of the data located in the cells (that is, between the points, see *Creating data for Mayavi*). For this, we first need to interpolate the data from the points where it is located to the cells, using a PointToCellData filter. We can then apply our CellDerivatives filter, and then a CellToPointData filter to get point data back. (remark: if you are not using the latest Mayavi from SVN - 3.1.0 - you need to enable the ‘pass data’ option in the two CellToPointData and PointToCellData filters).

To visualize the vector field, we can use a VectorCutPlane module. The resulting vectors are too large, and we can go to the Glyph tab, (and the Glyph tab in this tab), to reduce the scale factor to 0.2. The vector field is still too dense, therefore we go to the Masking tab to enable masking, mask with an on ratio of 6 (one arrow out of 6 is masked) and turn off the random mode.

To have nice colors, we also changed the color map of the vector field by going to the Colors and legend node just above the VectorCutPlane, and choosing a look up table in the VectorLUT tab, as there can be different color maps for vector data and scalar data.

Unlike the previous example, we can play with all the parameters in the dialog box, like masking, or select color_by_scalar in the Glyph tab, to display the value of the potential. We can also move the cut plane used to display the vectors by dragging it.

Now that we have a 3D vector field, we can also use Mayavi to integrate the trajectory of a particle in it. For this we can use the streamline module. It displays trajectories starting from the vertices of a seed surface. We choose (in the Seed tab) a Point Widget as a seed. We can then move the seed point by dragging it along in the 3D scene. This allows us to explore the trajectories in the potential created by the initial scalar field. In our case, all the trajectories end up in a local potential minimum, and moving the seed point along lets us see in which minimum each point will fall into, in other world the basin of attraction of each local minimum.

Using Mayavi with Scipy: a tutorial

2008-11-22T00:19:00+01:00

Many years ago, I was working with a bright undergrad on the trajectories of a atoms in a complex light field created by the intersection of two laser beams. She had developped a code in C, and I was starting to discover Python, so we had binded in t in Python. We where using the Python binding, together with ipython and matplotlib to explore and debug the code. However, our problem was readlly fundementally 3D, and I din’t find the status of the 3D plotting tools in Python satisfying.

That usecase was very much on my mind while working on Mayavi, as I have always believed that Mayavi and ipython could make a fantastic steering and debugging tool for 3D Physics code. I think Mayavi is starting to be pretty mature for this set of problems and as I am improvong the docs, I decided to write a tutorial example on this specific problem. I am posting it here as a preview. This is going to go in the docs, so please, if you have any comments that might improve it, fire away.

This tutorial example shows you how how you can use Mayavi interactively to visualize numpy arrays while doing numerical work with scipy. It assumes that you are familiar with numerical Python tools, and shows you how to use Mayavi in combination with these tools.

Let us study the trajectories of a particle in a potential. This is a very common problem in physics and engineering, and visualization of the potential and the trajectories is key to developing an understanding of the problem.

The potential we are interested is a periodic lattice, immersed in a parabolic confinement. We will shake this potential and see how the particle jumps from a hole of the lattice to another. The parabolic confinement is there to limit the excursions of the particle:

import numpy as np

def V(x, y, z):
    """ A 3D sinusoidal lattice with a parabolic confinement. """
    return np.cos(10*x) + np.cos(10*y) + np.cos(10*z) + 2*(x**2 + y**2 + z**2)

Now that we have defined the potential, we would like to see what it looks like in 3D. To do this we can create a 3D grid of points, and sample it on these points:

X, Y, Z = np.mgrid[-2:2:100j, -2:2:100j, -2:2:100j]
V(X, Y, Z)

We are going to use the mlab module (see *Simple Scripting with mlab*) to interactively visualize this volumetric data. For this it is best to type the commands in an interactive Python shell, either using the built-in shell of the Mayavi2 application, on in ipython -wthread. Let us visualize the 3D isosurfaces of the potential:

from enthought.mayavi import mlab
mlab.contour3d(X, Y, Z, V)

We can interact with the visualization created by the above command by rotating the view, but to get a good understanding of the structure of the potential, it is useful to vary the iso-surfaces. We can do this by double-clicking on the IsoSurface in the Mayavi pipeline tree (if you are running from ipython, you need to click on the Mayavi icon on the scene to pop up the pipeline). This opens a dialog which lets us select the values of the contours used. A good view of the potential can be achieved by turning off auto contours and choosing -0.5 as a first contour value (eg by entering it in the text box on the right, and pressing tab). A second contour can be added by clicking on the blue arrow and selecting “Add after”. Using a value of 15 gives a nice result.

We can now click on the Colors and legends on the pipeline and change the colors used, by selecting a different LUT (Look Up Table). Let us select ‘Paired’ as it separates well levels.

To get a better view of the potential, we would like to display more contours, but the problem with this approach is that closed contours hide their interior. On solution is to use a cut plane. Right-click on the IsoSurface node and add a ScalarCutPlane through the “Add module” sub menu. You can move the cut plane by clicking on it and dragging.

To make the link between our numpy arrays and the visualization, we can use the same menu to add a Axes and an Outline. Finally, let us add a colorbar. We can do this by typing:

mlab.colorbar(title='Potential', orientation='vertical')

Or using the options in the LUT dialog visited earlier.

We want to study the motion of a particle in this potential. For this we need to derive the corresponding force, given by the gradient of the potential. We create a gradient function:

def gradient(f, x, y, z, d=0.01):
    """ Return the gradient of f in (x, y, z). """
    fx  = f(x+d, y, z)
    fx_ = f(x-d, y, z)
    fy  = f(x, y+d, z)
    fy_ = f(x, y-d, z)
    fz  = f(x, y, z+d)
    fz_ = f(x, y, z-d)
    return (fx-fx_)/(2*d), (fy-fy_)/(2*d), (fz-fz_)/(2*d)

To check that our gradient function works well, let us visualize the vector field it creates. To avoid displaying too many vectors, we will evaluate the gradient only along a cut for X=50, and every three points on our grid:

Vx, Vy, Vz = gradient(V, X[50, ::3, ::3], Y[50, ::3, ::3], Z[50, ::3, ::3])
mlab.quiver3d(X[50, ::3, ::3], Y[50, ::3, ::3], Z[50, ::3, ::3],
                     Vx, Vy, Vz, scale_factor=-0.2, color=(1, 1, 1))

Now we can use scipy to integrate the trajectories. We first have to define a dynamical flow, the function that returns the derivative of the different parameters as a function of these parameters and of time. The flow is used by every ODE (ordinary differential equation) solver, it give the dynamic of the system. The dynamics we are interested in is made of the force deriving from the potential, that we shake with time in the three direction, as well as a damping force. The damping coefficient and the amount and frequency of shaking have been tuned to give an interesting dynamic.

def flow(r, t):
    """ The dynamical flow of the system """
    x, y, z, vx, vy, vz = r
    fx, fy, fz = gradient(V, x-.2*np.sin(6*t), y-.2*np.sin(6*t+1), z-.2*np.sin(6*t+2))
    return np.array((vx, vy, vz, -fx - 0.3*vx, -fy - 0.3*vy, -fz - 0.3*vz))

Now we can integrate the trajectory:

from scipy.integrate import odeint

# Initial conditions
R0 = (0, 0, 0, 0, 0, 0)
# Times at which we want the integrator to return the positions:
t = np.linspace(0, 50, 500)
R = odeint(flow, R0, t)

And we can now plot the trajectories, after removing the cut plane and the vector field by right-clicking on the corresponding pipeline node and selecting delete. We also turn the first color bar off in the corresponding Colors and legends node. We plot the trajectories with an extra scalar information attached to it, to display the time via the colormap:

x, y, z, vx, vy, vz = R.T
trajectory = mlab.plot3d(x, y, z, t, colormap='hot',
                    tube_radius=None)
mlab.colorbar(trajectory, title='Time', orientation='vertical')

If I have time, I’ll show later how some of the operations we have done with numpy can be done with VTK and Mayavi. This will give us control of these operation via widgets and thus more interativity.

Numpy documentation editor

2008-10-27T00:50:00+01:00

Pauli Virtanen and myself have finally finished transfering the numpy documentation editor to http://docs.scipy.org. The documentation editor is a project that has been mainly championed by Pauli. It allows you to edit in a wiki-like fashion the documentation for numpy, including the docstring. The changes are reviewed by editors, and eventually merged in the numpy svn. As a result, they are shipped with numpy and end up on everybody’s install of numpy.

the documentation editor has been deployed during the summer on my girlfriend’s hosted server, but we where afraid it wouldn’t scale there (and beside using my girlfriend’s server was not ideal). The contributions made throught the web portal have already helped improve the numpy documentation tremendously. It is a pleasure to look at the docstring of a function and find it actually helpful. Now that it is hosted on the main scipy servers, we are no longer afraid of making as much publicity as possible around it. So please, go straight to http://docs.scipy.org and start improving the docs. More seriously, when you think a feature is poorly documented, when you have faught for a few hours to understand how a function works, improve the docs, it is very easy, and if everybody does this, you’ll save time too.

In the long run we would like to get scipy itself under the same mechanism, and I would love to open the service to other major Python scientific computing librairies that form the scipy ecosystem.

My travels this summer

2008-10-05T23:24:00+02:00

This summer has been hectic (life is hectic anyhow!). As I was switching fields from physics to neuro-imaging, I took the chance to travel to the US and to spend the summer doing Python-related stuff.

Austin - Enthought

I spent most of my time this summer at Enthought, in Austin, Texas. Actually it was the Enthought guys who really made my fantastic summer possible by paying me a good salary and thus indirectly funding my travels. Thanks Enthought, you guys rock.

Austin is a nice city. It’s got a very nice night life (thought the tequilla doesn’t match Jarrod’s expectations, and they don’t always accept South Africans in bars because they look like kids and “any college kid could fake this ID”).

I don’t know if it is particular to the Enthought guys, or it is just Texans in general, but the hospitality was simply incredible. I spent the summer at Eric Jones’ place (I had a separate little cabin for me, fantastic). I enjoyed a lot interacting with Eric’s family: Courtney and the kids, Zach and Liz. It was fun to be at a place with young children. By the way, they go around in scipy T-shirts and love when daddy’s friends (aka geeks) come around at the house. I bet they know the scipy community better than I do.

Of course, there is a catch: Eric lives out in the boonies (just in case you don’t speak Texan, this means “out in the bush”, ie in a remote location). It might even have been a trap: he lives in a dry county. Moreover, I woke up one morning to find this in my living room:

It’s not deadly, I have been told, it just hurts a lot. On top of that, the kids brought back a dead black widow one day. Now it may seem to some that I am making a stupid fuss about nothing, but the scariest animals we have where I live are probably cats.

The Enthought office is a very pleasant place to work. It large with a lot of space for everybody. It is full of very nice people (I knew that before coming, but it was nice to have a confirmation). I had the nicest office I have ever had so far (they seem to be improving each time I change jobs, that a good sign). I’ll talk about what I did there in another post (I actually started a screen cast about this, but the state of video-editing software under Linux is abymissimal, and I could never edit the sound track).

Visits to California

I made two trips to California. First I went to UCLA for a summer school on mathematical method in neuro-imaging. It was the occasion for me to discover the field. One thing that stroke me was the importance of software in the field, and how little people are organized to limit duplication of efforts. On the other hand, we had a very nice presentation about a beautiful software engineering effort (slicer) trying to build a platform to unite tools by Steve Pieper. Fernando and I where gritting our teethes during the talk, wondering what the license of the tools would be, and it turned out that Steve ended his talk with a discussion about why it seemed to him that the BSD was most suited. We were delighted.

My second visit was for the SciPy conference which was great fun. Fun to meet new people in real life, fun to meet old friends. I had the feeling the talks where excellent and I learned a lot of things. After the conference I went to Berkeley with the nipy team. Nipy stands for NeuroImaging in Python. It is a project led by a crack team at Berkeley with Chris Burns, Jarrod Millman, Fernando Perez, Tom Waite and now Matthew Brett. The team I am starting my new job with, in France, hopes to be able to integrate their software with nipy.

Berkeley is a nice place. I stayed at Fernando’s and Jarrod’s and it is always a pleasure to hang out with these guys. As far as work goes, we tried to do some 3D visualization of neuro-imaging data with Mayavi. We got some things done and the week ended in a party at Chris’s place where we greated Jarrod with a mac-book air displaying a really cool view of a brain. However I have the feeling I stayed just long-enough to understand the problems, and not to solve them. Damn, software is hard. Hopefuly my work will allow me to move further in this direction.

Back in France, and off to Prague

Well, after all these travels, I got back to France. Off course the first thing I did with my girlfriend, Emmanuelle, was to shoot out of the country. I need to get away of a computer, sometimes. We went to Prague. It is a very beautyful city, it has good beers, and it is the home town of Ondrej Certik, so that made three good reasons to go there. We had a great week end strolling through the city, in the old streets or in the castles (photos here). Most people there speak English, but I was lucky-enough to order fried cheese (doesn’t that sound nice?) to a lady not speaking English, so we fell back on Russian as a common language, and it is always fun to me to speak Russian (now that’s a long sentence).

We had a bunch of beers with Ondrej. We spent some times talking the world into a better place. We discussed licensing, and agreed that Cython was sooooo cool, and talked about all things under the (scientific Python) sun.

In the past year I have the feeling I have been all over the place. I changed jobs three times, moved houses once in France and twice abroad, and visited a lot of countries for the fun of it. In the near future I am planning to settle down in France to get some work done.

We need help

Speaking of work, I am starting a new career in a new academic field. This is going to require a lot of focus from me, and it will not leave too much time for open source work in the near future. We need help! There are many ways to help and not all of them involve coding. I think I spend not more than 50% of my open-source-devoted time on coding. We need better docs. We need more marketing (we are really bad at this: we have fantastic tools, but it is hard to see them). We need people to help each-others on the mailing lists. We need packaging. All this is paramount and takes a lot of time… And we need coding.

SciPy Conference proceedings

2008-09-22T14:54:00+02:00

The SciPy conference proceedings are finally available online: http://conference.scipy.org/proceedings/SciPy2008 .

I hope you enjoy them. I find it great to have this set of excellent articles talking about works done with, or for, Python in science. For me, it is a reference to remember what was said at the conference. I hope it can also be interesting for people who were not present at the conference.

I apologize for being so slow at publishing them. In addition to the round trip between authors and editors taking a while, I have been travelling back home and spent way too much time last week finishing off administrative duties in the US.

Rendering static pages with Turbogears

2008-09-07T07:12:00+02:00

Turbogears hack

Suppose you have a dynamic website using turbogears, and you want to publish part of the content of this dynamic site to a static website, for instance to garanty its perenity. Well turbogears makes it really hard for you to do this. On the mailing lists they pretty much advise you to create a webserver and crawl it. Ugly. So here is the code required to render the kid templates that you have been using with turbogears to an html string (consider this as a brain dump, so that Google picks this up, hopefuly to help somebody not to loose a day like I did):

# First set up the environment you need for your webapp:
import turbogears
turbogears.update_config(configfile="dev.cfg",
                         modulename="sanum.config")

from itertools import izip
import turbogears.view
turbogears.view.load_engines()

import turbogears.util as tg_util
from turbogears.widgets import js_location

engine = turbogears.view.engines.get('kid')

def render_static(data_dict, template):
    """ Render a given template + its data dictionnary to a static html.
    """
    data_dict['tg_css'] = tg_util.setlike()
    data_dict['tg_flash'] = False
    js = dict(izip(js_location, iter(tg_util.setlike, None)))

    for l in iter(js_location):
        data_dict["tg_js_%s" % str(l)] = js[l]

    return engine.render(data_dict, template=template)

You can call this function with data_dict being a dictionary as returned by your controller methods, and template the path to your template, as in the expose decorator.

pyreport: literate programming in python

2008-07-23T00:00:00+02:00

pyreport is a program that runs a python script and captures its output, compiling it to a pretty report in a pdf or an html file. It can display the output embedded in the code that produced it and can process special comments (literate comments) according to markup languages ( rst or LaTeX ) to compile a very readable document.

This allows for extensive literate progamming [1] in python, for generating reports out of calculations written in python, and for making nice tutorials.

License pyreport is free software released under a BSD-like license. You can chek out the latest code, submit bugs, ask questions… on the github project page.

Warning

Pyreport is unmaintained

Due to lack of time, pyreport is unmaintained and looking for a contributor. Please do not contact me to ask questions, unless it is to take over development.

Contents

Requirements
Installing
Examples and use cases
Command line switches

Requirements

pyreport is Python, and needs the Python interpreter to run. It should work on any operating system where Python and the other requirements are available.

External programs:

Under windows make sure they are in the path, elsewhere pyreport cannot find them. All these programs are available with the MikTeX LaTeX distribution.
- pyreport does not need any external programs to generate html files.
- LaTeX : currently pyreport calls LaTeX to generate pdf files. Hopefully one day this will be optional and pyreport will use reportlabs to output pdf files.
- epstopdf or ps2pdf if you want to use pylab to insert graphs in your documents. Once matplotlib has a pdf backend this will not be needed.
Python packages:
- docutils only

Installing

The easiest way to install pyreport is using pip:

pip install pyreport

It is recommended to install to your user drive:

pip install --user pyreport

Examples and use cases

Initial goals

I use python to write small scripts that can do, for instance, numerical calculations, or simple operations [2] and I want to have nice print-outs of these scripts to study off-screen or to hand out to colleagues. Having the code with its relevant output is great for code reviewing. This also allows something similar to mathematica’s notebook in python without having to use a special IDE.

First I want to be able to have a print-out of these calculations where I can see the code ran, and the results produced:

Second I would like all the plots produced by matplotlib to be captured and displayed too.

Last I would be able to comment these reports, give them titles, sections, … This can be done via “literate programming”: comment lines begin with a special sets of characters are interpreted as rst or LaTeX. I also want these files to still be standard python files, and to be able to run them with the python interpreter.
#! Just a title
#!---------------
a = "a"
#! *ooo oo
b = "b"
#$ This is \LaTeX : $c = 2\cdot(a+b)$
c = 2*(a+b)
print c

Other possible uses

Hidding the source code allows to generate nice reports from calculation scripts without worrying about writing document generating code in the script itself. The use of the literal comments and print statements allow your report to be well structured and self-explaining. The major advantage of having the text of the report in the source of the calulation is that the report always discribes the calculation that was actually ran, and not a previous one, with incorrect constants, for instance.

With a moin-moin syntax and a pdf output this can be a very useful tool for writing tutorials and putting them on line, with a pdf version.

Examples

Here are two examples showing what you can do with pyreport :

A calculation of the bifurcation diagram of the logistic mapping

The code , the pdf generated , and the html file generated

An exploration of the Julia sets, this example uses a LaTeX equation (LaTeX embedding does not work with html output, so far):

The code , the pdf generated , and the html file generated

Limitations

The “sys” module is imported in your code, whether you want it or not. As a general rule, beware of the namespace when running your scripts with pyreport, pyreport injects a few variables in your namespace.

from __future__ import foobar in a script does not work. This a a big caveat ! This is simply not possible in python, as the from __future__ imports have to be the first statement of a script.

As a consequence I have made the decision to always import division, so that 2/3 = 0.6666.

Some stange bugs can occur depending on the backend you use for matplotlib. WXAgg has played me a few tricks.

Command line switches

This is not as useful as a well written documentation, but it is better than nothing:

usage: pyreport [options] pythonfile

Processes a python script and pretty prints the results using LateX. If
the script uses "show()" commands (from pylab) they are caught by
pyreport and the resulting graphs are inserted in the output pdf.
Comments lines starting with "#!" are interprated as rst lines
and pretty printed accordingly in the pdf.
    By Gael Varoquaux

options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -o FILE, --outfile=FILE
                        write report to FILE
  -x, --noexecute       do not run the code, just extract the literate
                        comments
  -n, --nocode          do not display the source code
  -d, --double          compile to two columns per page (only for pdf or tex
                        output)
  -t TYPE, --type=TYPE  output to TYPE, TYPE can be ps, dvi, trac, eps, tex,
                        html, pdf, rst, moin
  -f TYPE, --figuretype=TYPE
                        output figure type TYPE  (TYPE can be of pdf, jpg,
                        eps, png, ps depending on report output type)
  -c CHAR, --commentchar=CHAR
                        literate comments start with "#CHAR"
  -l, --latexliterals   allow LaTeX literal comment lines starting with "#$"
  -e, --latexescapes    allow LaTeX math mode escape in code wih dollar signs
  -p, --nopyreport      disallow the use of #pyreport lines in the processed
                        file to specify options
  -q, --quiet           don't print status messages to stderr
  -v, --verbose         print all the message, including tex messages
  -s, --silent          Suppress the display of warning and errors in the
                        report
  --noecho              Turns off the echoing of the output of the script on
                        the standard out
  -a ARGS, --arguments=ARGS
                        pass the arguments "ARGS" to the script

[1]

literate programming is a programming style that embeds the documentation of a program in its source code. The documentation is generated in the same time that the program is built. I am using the term in a very loose way, as pyreport is capable of literate programming, but also of much more, as it embeds the documentation, but also weaves the output of the script in the documentation.

[2]	scipy provides a very powerful math package to python, and matplotlib is a great plotting interface.

Scipy2008 Early-bird registration deadline ends today

2008-07-11T09:15:00+02:00

I have been planning to make a more interesting post highlighting the large trends of the SciPy2008 conference, but it is 3AM local time, and I am still hacking on Mayavi, so I think I’ll keep it short.

As far a the conference program goes, we can see a few major themes emerging. There will be talks about the use of Python for scientific works, but also talks about the growing stack of Python scientific tools. Interesting trends are the non-purely-numerical tools: symbolic and graph theory, and the race towards more optimisation through compilation from Python code. In addition this year we see a major effort on documentation. I think this is the sign of a numerical stack that is maturing.

As for the tutorials, I am personnally very interested in the advanced track tutorial. The newest and coolest technologies, like Cython, are also not the one I know best, and we have the chance to be able to listen to their authors presenting them.

The early bird registration deadline is ending tomorrow, as I point out in my title. If you miss this deadline, the conference fees will be higher, and the reason is simply that late registration makes organisation harder and more expensive. I would be happier if everybody registered before this deadline and paid less. I am not too sure what the accountant would say.

Student sponsorship for the SciPy08 conference

2008-06-27T05:26:00+02:00

I am delighted to announce that the Python Software Foundation has answered our call and is providing sponsoring to the SciPy08 conference.

We will use this money to sponsor the registration fees and travel for up to 10 college or graduate students to attend the conference. The PSF did not provide all the founds required for all 10 students and once again Enthought Inc. (http://www.enthought.com) is stepping up to fill in[1].

To apply, please send a short description of what you are studying and why you’d like to attend to info@enthought.com. Please include telephone contact information.

From my perspective, this is excellent news. First of all this means that the SciPy community is working a bit more closely with the PSF and the broader Python community. But the is also very dear to me as last year I was sponsored as a student to come to the SciPy conference. I got to meet fantastic people and discover thrilling new developments. This was the beginning of a move away of my core physics activity to a more software-related work, and the realization that yes, I could do it, I could maybe contribute something useful to the community (well, I’ll let you judge that). Thanks a lot to Travis Vaught from Enthought for bringing this project to a success.

[1] I feel like we (the SciPy community) are like an aging teenager, wanting a lot of independence, but still living a lot out of parent’s money (Enthought). And I feel the first concerned, as I am spending the summer working at Enthought to get a chance to work on interesting SciPy-related projects.

Disclaimer: The second part of this post reflects my own opinions, and not those of my employer (obviously) or the SciPy08 organising comittee.

Alex Martelli giving the SciPy2008 Keynote

2008-06-14T19:08:00+02:00

On behalf of the SciPy2008 conference organizing committee, I am happy to announce that the Keynote at the conference will be given by Alex Martelli.

It is a pleasure for us to receive Alex. He currently works as “Uber Tech Leader” at Google and is the author of two of the Python classics: “Python in a nutshell” and the “Python CookBook”. Alex graduated in electronic engineering from the university of Bologna and worked in chip design first for Texas Instrument, and later for IBM Research. During the 8 years he spent at IBM, he gradually shifted from hardware design to software development while winning three Outstanding Technical Achievement Awards. Then he joined think3 inc., and Italian CAD company, as Senior Software Consultant where he developed libraries, network protocols, GUI engines, event frameworks, and web access frontends. After 12 years at think3, he worked for 3 years as a freelance consultant, mostly doing Python development, before joining Google.

Alex won the 2002 Activators’ Choice Award, and the 2006 Frank Willison award for outstanding contributions to the Python community.

Alex has also taught courses on programming, development methods, object-oriented design, and numerical computing, at Ferrara University (Italy) and other venues. Alex’s proudest achievement is the articles that appeared in Bridge World (January/February 2000), which were hailed as giant steps towards solving issues that had haunted contract-bridge game theoreticians for decades.

This biography was loosely adapted from Alex’s autobiography (http://www.aleax.it/bio.txt), more information can be found on his website http://www.aleax.it .

Arrived in Texas

2008-06-13T15:10:00+02:00

I just arrived in Austin, Texas. I need to settle down a bit more, blog about my fantastic holidays, but I wanted to give an update of where I was.

The hospitality here has been fantastic so far. I am sitting in a confy chair, sipping a fresh orange juice, after having spend a night in a very cosy bungalow and waking up to see two fawns in the garden. I would say that the Enthought guys really know how to treat there hosts. It seems to be something the Python scientific community is really good at, judging from my different experiences (Fernando, and JB Poline).

Docs using Sphinx

2008-04-28T09:10:00+02:00

After Ipython and Sympy, Mayavi is now using sphinx to build its docs. Sphinx is very neat because it allows for high quality pdf and html from the same restructured text source. The killer feature is that the resulting html pages have a builtin search that works with javascript, and thus works on the client without the need of a server.

In addition, the developer is very reactive and dedicated to making sphinx versatile-enough to generate high-quality docs for many packages. As a result many Python projects are switching to sphinx. First Python itself (that’s what sphinx was created for), but now more and more. It seems that zope is even considering it. One great side effect is that documentation for different Python modules will be consistent, with the same look and feel (although you can tweak sphinx output if you want).

We don’t have a server serving the html docs yet (it is planned, we just need a bit of time), but you can check out the pdf generated here.

Of packaging, installation and dependencies

2008-04-12T15:52:00+02:00

I have been struggling for the last few days trying to understand the issues behind packaging and installing the Enthought Tool Suite. I think have been making progress, though only in my head, no actual code or packages so far are terribly satisfying.

The problem

If you are developing a Python-only program, with only dependencies on the standard library, you have no problems with packaging. You can ship tarballs, MSI installer, eggs, … all this works.

However, if you want to develop a rich program that provides many features in a closely integrated and consistent way to the user, you will have to depend on external packages. I know that many projects work around this by including the external dependencies inside the project, or simply reinventing the wheel. Well this does not scale. We cannot expect to develop a major scientific tool and community this way. Reuse is the key to scalability, in my opinion. Thus comes the problem, how to we ship our program?

The problem can be very well seen with the Enthought Tool Suite (ETS). The ETS is a suite of many different packages, all pretty much geared towards building interactive scientific application. In house, Enthought, the company (disclaimer: I do not work for Enthought) uses these packages to develop domain-specific applications for customers. They have broken up the suite in a set of small packages, to enable assembling applications by requiring only the features you need. This is important because if you want to use ETS’s 3D plotting package (TVTK or Mayavi), but you want to stick with MatPlotLib to do 2D plotting, and not use Chaco, you should be able to download only what you need.

As a result the ETS is made of a set of interdependent packages. Maybe they went a bit too far in the modularity, and there are almost 50 packages. The dependency graph looks like this:

Just to reassure you, the next version of the ETS has a much reduced number of packages, just because some packages where grouped, and the dependency graph indeed is sane:

As you can see, there is a complex dependency graph. So how do you ship this to the user? Another problem that should not be underestimated is: how do you make it easy for people who distribute your projects to package this?

Setuptools

Python has no good answer for this problem, but setuptools do go part of the way. Dependencies in the ETS are declared using setuptools, and installing the ETS strongly relies on setuptools.

Setuptools provides a way of automatically downloading dependencies. However, it is not a full packaging system replacement. The reason I say this is that it does not have the knowledge of a dependency graph, it just downloads packages, introspects them to find their dependencies, and recursively tries to satisfy them by downloading more. Phillip J. Eby (the author of setuptools) has been quite clear that he does not want to write an APT replacement, tough people keep getting it wrong and making the equation “easy_install = apt for Python” (IMHO this is due to bad communication on setuptools webpage).

Moreover, setuptools does not provide an easy to use API to extract all the information it has about packages, dependencies, and download URLs. It is thus not trivial to plug packages shipped with setuptools in an other package manager like rpm or apt. This is why bothers me most, because this is strongly limiting the exposure the ETS is getting in distributions (whether they be Linux distributions, or scientific computing “superpacks”). Recently I have had discussions with somebody on how to ship Mayavi in a monolithic distribution he has developed. He agreed to ship setuptools with the distribution, so now I need to give him a list of eggs to provide. There is no obvious way to get this list using setuptools (insert here big big rant). So I thought that an option was to install Mayavi in a virtual environment to trac the eggs added, and use this information. However, this person’s internet access was possible only by login on dumbed-down servers for security reasons. So we hit a wall. And for me this wall is a wall we keep hitting with setuptools: setuptools does everything for you, the download, the building the install. It does have flags to control these processes, but it does not expose the information you need to do this without using it. I actually think the reason it does not expose this information is that it does not know it a priori. Looking at the code it does seem so. In addition, the structure of the packages make it hard to do.

From packages to repositories

On the other side, Dave Peterson, at Enthought, has been working on a tool to allow checking out of the ETS SVN only the projects you are interested in. I played a bit with it, and modified it to generate the dependency graphs. I quickly found out that I actually like this tool much more than setuptools, even though it was pretty much using the same concepts. It took me a while to understand what I like about the tool. It is that it uses a map file to gather all the package and dependency information. As a result, it has the equivalent of a dependency graph. This makes it possible to do the operations I am interested in, eg listing all the packages required for installing a given project without actually downloading them.

The reason this is possible is that with the ETS we are not dealing with an open set of packages, like PyPI, in which packages can come and go, and no consistency is enforced. We are dealing with one suite of multiple projects that are made to work with each other. The base entity is thus a project set, on which we can make a “project map”.

What Dave has done works fantastically for development, I would like to push it further for distribution. What we expose to the user can now be a repository, in the sens of APT: a set of packages with consistent inter-dependencies, and a way of retrieving easily this information. The difference between the two, and the implications of the difference, is not something I had clearly in my mind in the beginning, but it is becoming clearer that having a repository with a project map gives a lot of added value for distributing. I’ll see if I can reuse Dave’s work to build such a tool, but do not hold your breath, I am not willingly in the business of packaging, and will probably not spend enough time on this to make it a good tool.

Objects, modules and Traits and Envisage

2008-04-05T13:19:00+02:00

I have been reading an article about a new language paradigm (Erasmus, a modular language for concurrent programming). The authors discuss the limitations of objects in terms of modularity. To sum up their point (and most probably distort it completely), the limitations with objects comes from the fact that you can’t be sure what is modifying what: suppose you have a method foo of an object bar that you call in a method of an object baz, you cannot be sure that this method hasn’t modified private attributes of your object baz, as foo could have called a method of your object. This does happen in large code bases. Of course, best practice tries to reduce this to a minimum, but this reduces modularity, and thus limits both code reuse and concurrency (as side effects are not well controlled).

Erasmus’s solution to is adopt a new container, that they call modules rather than objects, and that are based on message passing rather than method calls. These modules live in separate processes and can themselves be made of more conventional code (I am extrapolating a bit from the original article here).

This strikes me as being related to a pattern that I see more and more in my code that uses Traits. The objects deriving from HasTraits have a very easy and cheap way of coupling callbacks to the modification of their attributes. This induces a programming style know as reactive programming that is entirely callback-driven. In addition, this is a nice way of ensuring that the internal state of an object is always consistent. This is a first step to message passing and decoupling: you no longer call methods, you just set attributes and let the object do the rest. The limitation of this model in a large code base is that you have to carry around references to the objects you are interested about, and their attributes. Traits has patterns to help you do this (delegation, namely), but it is still a limitation.

This is where the Envisage framework comes into play. Envisage introduces the notion of plugins which provide extension points. These extension points are special traits attributes that are published in a registry (which can be application-wide, or not, in Envisage3). You can query the registry to retrieve these extension points and contribute to them. After that, the traits callback mechanism triggers an action in the plugin contributing the extension point.

This contribution mechanism could be based on message passing between processes quite easily (although for GUIs it breaks down, because AFAIK you cannot assemble a consistent GUI from different widgets living in different process space, without using some Xwindows-specific tricks). Of course this does not give me hard guaranties of decoupling and control of the side-effects, as a call to a plugin can induce calls to other plugins inside it. This is where best practice comes along: core plugins should be able to run and provide their basic functionality outside of Envisage, as normal objects. Envisage should only be a thin wrapper allowing them to expose this functionality and extend other plugins. This is introducing a distinction between objects and method calls, that do not need to be arranged in self-consistent entities and which you use very often , and plugins and extensions contribution, that form standalone entities and should be used more sparsely.

Of course Envisage cannot go too far in terms of providing guaranties for decoupling. It gives a mechanism, best practices, could even help plugin decoupling by having them live in different processes, but as long as it does not enforce rules in the semantics of the language, it cannot achieve what projects like Erasmus are trying to do. I however think it is good to have a look at the work done in these projects to see what we can learn.

Of travels and sprints

2008-04-01T02:13:00+02:00

This month I have traveled a bit for scientific-computing related reasons.

In England

First of all, I was speaking at the OKcon, open knowledge conference in London, about Scientific tools in Python in general, and Mayavi in particular. I jumped on the occasion to visit the Airbus campus in Bristol. We have had some contacts with these guys, because they use Mayavi in some of their homegrown applications, and I was curious to put faces on friendly names on the mailing list. In addition, I was eager to find out how they were using Mayavi and Python scientific tools in an industrial environment, as I have never worked in another place than a physics lab.

Visiting the Airbus campus

The Airbus visit was enlightening: the Bristol campus is a major research facility (several thousands people) dedicated to wing design. A good part of the work is done through simulations deployed on big clusters. These calculations have historically been run in Fortran and C, but apparently the engineers are switching to a mix of compiled languages and Python. Moreover, steering of these simulations, through mesh-design, visualization of the results, analysis of the data, is done mainly through an interact program, ‘flightpad’, that is developed fully in Python, using the Envisage framework to couple together a bunch of scientific components, including Mayavi. I got to spend a fair amount of time with the guys doing this, and it was great to see how they did it. They have a good approach to scientific software design (loosely coupled components, reuse of all the existing libraries), eventhough their goal (automatic generation of Python scripts from user interaction) is way more ambitious than anything I have in mind. I was pleased to see that they where using Mayavi in a way completely consistent with its design, and did not have to hack around limitation.

It was really very encouraging to talk with the software strategist. He obviously completely got it as far as how an open-source model can be profitable to a company like Airbus. See so many people using open source tools as their main tools, as well as a manager ready to back this position, and explaining how it can be beneficial to contribute to an open-source project, really filled me with hope.

Of course visiting the Airbus campus was not only about software, it was also about planes (I got a drive around the campus, and it is quite fun to ride a mini cooper between to 747), and beers (reinventing the world to make it a better place at the pub, after work). I must say there is something special about the scientific Python community, it is the nicest community I know (with the sailing one :->). You meet people that you have never seen before, and you immediately feel at ease.

Open Knowledge conference

The Open Knowledge conference was fun. Not too much like the geek conferences I am used to, as here the focus was on the data, and not the tools , aka the software (for instance, the big deal is when you can get access to the complete public transport time-tables, and you can make maps of poorly connected areas). I met Martin Albrecht from the sage project. It was very interesting to discuss with him. I generally consider myself as doing rather fundamental research (Bose-Einstein condensation), but for him I was in the applied science section, because I use math and computers to do applied things. This distinction between applied and fundamental maths yields a distinction in the application of the code, and therefore the way an open-source scientific project can survive. It was very interesting to see the way sage’s development process therefore differed from scipy’s. I think that both Martin’s talk on sage, and mine on Python and interactive visualization had a lot of success: the room was full of scholars, and they wanted tools to do their work.

In London, I had the occasion to catch up with my brother, and Rob, a former colleague. That was nice too (and yielded more beers).

Paris

Nipy Sprint

The week after, I was attending a sprint in Paris on nipy: neuroimaging in Python. We were a bunch of enthusiastic scientific Python users crammed in a small room during the day. There was the team from Berkeley with including Jarrod and Fernando, and all their friends. I got to make new friends, and catch up with old ones. The goal of the nipy effort is to build a complete processing pipeline for neuroimaging data, especially fMRI, in Python. This is a lot of work, as many transformations are applied to the raw data to make it useful for scientific publications. As the field matures, these transformations pile up, and the processing pipeline gets more and more complex. There already exists a good pipeline under MatLab (SPM), the problem is that, due to the poor language features of MatLab, it is a codebase hard to extend and to modify. One of the goals of the nipy project is to make a pluggable architecture, for researcher to be able to replace part of the pipeline by their own code, and thus explore new methods while comparing them to the reference one. This means that there are some interesting software engineering problems in here (pluggable pipelines, framework…, the kind of stuff I like), however the current focus is to get the algorithms right, before trying to do software over-engineering.

The Berkeley group got an NSF grant to work on the project and has been able to hire two developers for two years (Chris Burns and Tom Waite). The effort is lead by Jarrod Millman, and they have put a lot of work in making the underlying libraries better (that is improving numpy and scipy).

I had difficulties contributing any useful code, as I don’t know neuroimaging, but I had the pleasure of seeing people pick up the mayavi API and use it to quickly build domain-specific tools for displaying brains and activation regions. As usual this also revealed some shortcomings in the mlab API that I plan to address ASAP.

IPython Sprint

The week end after Fernando, Laurent Dufréchou, Stefan van der Waalt and myself crashed at my parent’s place to work on ipython1 and the front ends. My mother cooked us some fabulous food and I had a great time.

Unfortunately we did get as far as I would have like. The right abstraction for talking between the ipython1 execution engine, and the front end are not really easy to get right, as the engine is nothing more than an abstract execution engine, that basically only has a namespace and knows how to execute stuff in a non-blocking mode (that’s where it gets hard: how do you know what is going on with your engine and the commands you have sent to it? How do you deal with introspections requests such as tab-completion or docstring exploration). We want as little logics in the front ends as possible: let us not duplicate tab-completion or history. This is why we are progressively building an object, that Fernando dubbed “InputStateManager” that is doing the impedance matching between the front end and the engine. I am starting to believe that the best way to connect this object (ISM) to the front end is via a callback-based mechanism: the front-end calls the ISM methods and gives them a callback to call when finished (for instance if running in a different thread, a Wx frontend would pass something based on Wx.CallAfter to display the result). That way the mechanism is very general, can adapt to event-driven front ends or readline-based one, and knows nothing about the front end. Of course not much code got written, because I am way too slow, and it took me ages to figure this out.

We had a lot of fun, and for me the highlight of the week end was when my girlfriend joined us to do some hacking on a really cool project trying to use the scipy.org wiki to edit the numpy docstring.

Fernando has pictures of all these happy moments. and I hope he will publish them somewhere (Fernando, get a blog :->). Next time I hope there will be more of us.

Edit: my slides at OKcon

How is Mayavi pronounced

2008-03-26T00:26:00+01:00

I have been traveling recently and talking to friendly Geeks I didn’t know yet. I have been surprised to see that many people were pronouncing “Mayavi”, “Maya-V-I”, is in “V-I”, like the old Unix editor. Maybe this comes from the spelling “MayaVi”, that Prabhu and I recently decided to avoid. Well, Mayavi is actually pronounced “Ma-ya-vee”, and it comes from an old sanskrit name meaning magician.

Numpy doc sprint in Paris tomorrow!

2008-03-20T12:23:00+01:00

We really need to get numpy 1.0.5 out. And for this release to rock, we want to have good docs. This is why Jarrod offered to have a doc sprint tomorrow.

In addition we are currently having a sprint in Paris for neuroimaging in Python, with a bunch of numpy developers. Some of us are going to work on the doc sprint tomorrow. We will have a room dedicated to this.

It would be great if people in Paris join us. If you want to have great fun with Python geeks and get the chance to make numpy better, send an e-mail to Jarrod ( <millman> at <berkeley> dot <edu> ). The venue is in Paris 6ème.

See you all tomorrow.

Usability

2008-03-06T09:54:00+01:00

I just wanted to echo a very good blog post about usability:

Users are busy not stupid.

As you design something, ask “is this relevant to what people are trying to do?” rather than “is this confusing?” […] It doesn’t matter whether people could figure something out. It matters whether they’re interested in figuring it out - is it part of what they’re trying to do, or an annoying sidetrack?

Read the original blog post and ask yourself, when designing something: “is this relevant to what people are trying to do?”.

Supporting our users under Windows

2008-03-04T02:51:00+01:00

Many of our users use Windows. I don’t, I use Linux, but I completely respect people’s choice to use the OS they want, as I expect other people to respect my choice. As Prabhu also run Linux (and MacOS X), this means I should sometimes roll up my sleeves and try out Mayavi under Windows. As my laptop came with a Windows installation that I did not entirely nuke, I have the option of booting under Windows. For me this is a tedious process: I don’t know anything about Windows, I loose all my beloved programs and have to struggle with non-Posix concepts.

To ease my pain, and make it easier for me to run Windows, Dave Peterson has been pushing me to run a virtual machine. VMware is able to boot an existing partition, and I installed it tonight (trick: Ubuntu users, add the “partner” repository). After fighting a bit with VMware and Windows, I was finally able to start the existing windows install in a virtual box. Then, I was stopped by a stupid activation dialog. To make a long story short, it seems that MicroSoft doesn’t want me to use Windows on a real machine and on a virtual machine at the same time. The canonical solution is to buy a new Windows license, but this I won’t do. I dislike Windows, and will not be forced to buy license. It is not (only) a question of Money: I have already spent more than that for me free software activities, but the idea of being coerced in spending more money on Microsoft products just to support Microsoft products simply doesn’t fit.

If somebody has a clever way of getting around this problem, and
allowing me to legally use a virtual machine for Windows, I would be
very grateful.

Edit2: See also the blog post of Bryce Harrington on the same topic for Inkscape.

Playing with filters in Mayavi2

2008-02-21T00:59:00+01:00

Mayavi uses VTK as a rendering engine. It does its best not to force you to learn anything about VTK, and I often forget about the infinite possibilities of this visualization toolkit, but sometimes it can be interesting to actually look at bit more at its data processing algorithms to make a nice visualization.

Lately I was trying to get a 3D view of France, using altitude measurements that can be freely downloaded on the IGN website. The shear number of points is huge: one point every kilometer, on the whole French territory. As a result, the brute force approach does not work, and Prabhu hinted I could look at VTK filters to make a good use of this data.

VTK, and thus Mayavi, uses a pipeline-oriented approach: the data is loaded by a data source, and it is plotted using one or more visualization modules. Between the source and the modules, we can insert filters, that can process and transform the data. Playing with this pipeline I was able to transform the set of 800 000 (yes, that’s 800 000!) altitude measurements, given as (latitude, longitude, elevation) into an optimized mesh small-enough to display fluidly on my laptop. Let’s see how I did it.

First of all I have to load the data, and I create a set of scattered (non-grided) 3D points. This I do with numpy and mlab’s scatterscalar function.

Then I want to add connectivity information to go from scattered points to a field. This I do with a Delaunay2D VTK filter. This gives me a mesh that I can display with the surface visualization module. But if I do this, my computer grinds to a halt. Remember, I am dealing with 800 000 data points. It is interesting to note that VTK is a pipeline, data-on-demand, type of rendering architecture: the delaunay filter, for instance, will process data only as it is required by the rendering. Thus adding a delaunay tessellation filter is a numerically very cheap operation, as long as no visualization module is pulling the data out (think generator/lazy-evaluation pattern).

The first simplification I make to the mesh is simply to remove mesh faces outside the French border. For this I have added points outside France with negative scalar value, and I use altitude as a scalar value for the points given by the IGN data. I want to use the Threshold filter to filter out the cells with scalar data equal or less than zero (this way I also filter out the sea). As I want to act on cells rather than vertexes, I have first to use convert the scalar data, located on the data points, to cell data, using the PointToCellData filter.

Then I use the QuadricDecimation filter to simplify my mesh. This filter finds a good approximation of the mesh with less faces. Unfortunately I also loses track of the scalar data attached to my mesh. As I am interested in having a scalar reflecting altitude, in order to associate a pleasing color with it, I rebuild this scalar using the ElevationFilter filter.

I find the result very pleasing: the mesh simplification is very impressive, because it yields a good rendering of the landscape with little faces. For example, the triangles around a river are elongated and follow it. I tried playing around with the data using numpy and writing my own algorithms (binning, averaging, …) and I didn’t get as good results, obviously because these algorithms have not been developed during an evening’s hack. The resulting visualization takes a long time to load, probably because the QuadricDecimation filter is busy doing its work.

To sum up the pipeline used:

Delaunay2D to create connectivity information,
Threshold to remove what lies outside France
QuadricDecimation to simplify the mesh

and a few filters to do conversion/creation of different data type. The mlab code to generate this visualization can be found in the Mayavi2 example directory (currently onhttps://svn.enthought.com/enthought/browser/Mayavi/branches/enthought.mayavi_2.1.0/examples/france.py but the branch will soon move to a tag as we do a release).

Edit: Indeed, the release has happened and as Fred points out, the correct link is

https://svn.enthought.com/enthought/browser/Mayavi/tags/enthought.mayavi_2.1.1/examples/france.py

I had fun zooming with this visualization and exploring France. I can say that this doesn’t compare to well with Google Earth: these guys pull tricks like detailed textures attached to the mesh or level of detail adaptation of the mesh depending on the distance to the camera. Yeah, Mayavi is a general-purpose 3D visualization software, and you can already go pretty far quite quickly, but if you want to do better, you’ll have to get your hands dirty.

Adding simple customisation to Mayavi2

2008-02-05T02:37:00+01:00

Mayavi2 is a rewrite of the original Mayavi application to make it easier to adapt and customize.

Mayavi2 uses, for its full-blown application, the Envisage framework. As a result it can both use envisage plugins (such as the logger and the python shell), and contribute to other plugins, thus providing a visualization engine.

The problem with a framework is that if you are not already using it, it comes at a cost. The cost of the Envisage2 framework is well-known: it is a bit tedious to learn. This is why Martin Chilvers (the Envisage author) has written Envisage3, but this is another story as Mayavi2 is currently based on Envisage2. To avoid forcing Envisage on people wanting to use Mayavi2, we have been working on decoupling the two. As I showed in a previous post, Mayavi2 can now be fully used without Envisage. But this is in the development version, and some people are stuck with the current release.

Today I would like to show how one can add some very simple customization to Mayavi2. The idea is to use the “-x” switch of Mayavi, that allows to execute a script in Mayavi2 after it has been started. Mayavi2 is thus started, the WxPython mainloop is running, and we can do better than a script, we can pop up a small UI. For this I will use traitsUI as I really like this library rather than raw WxPython (you can find a tutorial for this technology on my website). I will make a small dialog that uses Mayavi2 to create a 3D visualization, giving the user the possibility to change interactively the parameters of the visualization:

import numpy as N
from enthought.mayavi.mlab import plot3d, clf

from enthought.traits.api import HasTraits, Int
from enthought.traits.ui.api import View

class MyModel(HasTraits):
    n_mer   = Int(6)
    n_long  = Int(11)

    def _anytrait_changed(self):
        pi = N.pi
        dphi = pi/1000.
        phi = N.arange(0.0, 2*pi + 0.5*dphi, dphi, 'd')
        mu = phi*self.n_mer
        x = N.cos(mu)*(1+N.cos(self.n_long*mu/self.n_mer)*0.5)
        y = N.sin(mu)*(1+N.cos(self.n_long*mu/self.n_mer)*0.5)
        z = N.sin(self.n_long*mu/self.n_mer)*0.5
        t = N.sin(mu)
        # Realy ugly, but so much easier than modifying in place the
        # visualization
        clf()
        self.plot = plot3d(x, y, z, t,
                            tube_radius=0.025, colormap='Spectral')

    view = View('n_mer', 'n_long')

my_model = MyModel()
my_model._anytrait_changed()
my_model.edit_traits()

After the imports, the class definition is the object behind the dialog: two integer attributes that get displayed in the dialog, and a callback call when these attributes are modified. This callback uses Mayavi2’s mlab scripting interface to plot a nice 3D curve. The last line pops up the dialog that allows the user to interact with the visualization. This is very crude, but is a simple example. If you run this code using “mayavi2 -x”, the Mayavi2 application window appears with our visualization, and in addition the dialog to interact with it.

With the development version of Mayavi2, you can simply change the last line from ‘model.edit_traits()’ to ‘model.configure_traits()’ and the file can be run as a normal Python file: there is no need for the Envisage framework. As a result the UI is a bit simpler, which can be seen as a pro, or as a con, depending on what you want to do:

Mayavi2 in Ubuntu

2008-01-26T10:10:00+01:00

After Debian, Mayavi2 has just made it into Ubuntu Hardy (http://packages.ubuntu.com/hardy/science/mayavi2). From what I can see, the deps look just good, thanks a lot to Varun for making sure the Debian package was in shape. This means in April, it will be massively easier for a lot of people to install an oldish version of Mayavi2.

I am quite happy to say that we did this the right way, by polishing a Debian package first. Once this once done, getting Mayavi2 in Ubuntu was trivial. This felt like a well-working machinery.

Now that a Debian package has been done and the Debian QA went over all the fine details of permissions, license, man pages… it should be much easier to get Mayavi2 in other distros (anybody for Fedora ?). Having a binary package in a distro is a major bonus for the users: there is a world between having to grab and compile the ETS to try out a program, and being able to install it from the repos. Of course, the package will always be a bit old and lacking the shiny features that we add in the SVN. When I find time, I will put myself together and make debs for Gusty. It not hard, I just have to find the time (yes, promises …).

Mayavi2 in Debian

2008-01-13T09:40:00+01:00

Thanks to the combined efforts of Ondrej Certik, who made me do the necessary tarballs, and Varun Hiremath, who finalized the packaging efforts, Mayavi2 is now in Debian ( http://packages.debian.org/sid/mayavi2 ). Currently, it is in testing, but it will soon trickle down to unstable. Along with Mayavi2, we have two new Debian packages: Traits, and TraitsUI, two other absolutely great Python packages and, IMHO, fundamental technologies.

This is looking very good for simplifying reuse of these technologies.

Mayavi2: using from ipython

2008-01-04T03:38:00+01:00

Recently Prabhu and I have been ironing the library aspect of Mayavi2 (library as opposed to application). One of the usecases we are interested in, is interative use, via for instance ipython, a la pylab.

Most people think of Mayavi as a big and powerful application, maybe a bit clunky to script and to get to interact with other bits of code. With the recent additions you can use mayavi just as you would use matplotlib, to complete matplotlib’s 3D plotting.

3D plotting from ipython

As Mayavi relies on wxPython (technical details, yes I know you don’t care), to use it with ipython, you have to start ipython with iptyhon -wthread. Using ipython from svn you can start ipython with both the -pylab and the -wthread options to use both pylab and mayavi for 2D and 3D plotting (beware of namespace mangling, don’t use from module import *).

The matlab/pylab-like interface to Mayavi is found in enthought.mayavi.tools.mlab(this will most probably change to enthought.mayavi.mlab, or to something else shorter), import this module to have familiar functions. Documentation is a bit missing for now, (you can grab some kind of a embryo at https://svn.enthought.com/enthought/attachment/wiki/MayaVi/mlab.pdf?format=raw ), so I’ll just show an example:

from enthought.mayavi.toosl import mlab as M
from numpy import *

f = lambda x, y: sin(x + y) + sin(2*x - y ) + cos(3*x + 4*y)
x, y = mgrid[-7:7.05:0.1, -5:5.05:0.05]
M.surf(x, y, f)
M.axes()
M.title('Demoing mlab.surf')

Apart from the surf command, the different commands used have equivalents in pylab. surf is inspired from matlab: let us continue pylab’s work here. OK, the keyword arguments are not exactly the same, and not all pylab features are available through the mayavi/mlab interface. But the good news is that the objects created are VTK objects, even though it is a bit hidden by this simplified interface. This means that there is the power of VTK, and that you can always modify the resulting objects to fine tune properties that you cannot (yet) tweak with keyword arguments.

Modifying the plot from the GUI

OK, as you see, we can control Mayavi without all the fuss of the user interface. We get a really simple window that does not get in our way. But is this still Mayavi? This clunky UI was convenient to interact with the visualization. I can pop the pipeline up using M.show_engine(). Once I have the pipeline I can easily double click on any of the items I want to modify, and I get the usual Mayavi dialogs that are so convenient when trying to tweak a scene, for instance to modify the colormap:

This is still work in progress, but mlab is completely useable for real work (I use it whenever I want to make a figure in 3D). Beware that the API is not cast into stone (that’s a good moment to make remarks) and that it might change. Documentation is also lacking. Don’t worry it will come pretty soon, but I also have a thesis to defend :->.

Gaël Varoquaux - programming

People underestimate how impactful Scikit-learn continues to be

My Mayavi story: discovering open source communities

The start of my adventure with Mayavi

What is Mayavi?

Working on Mayavi taught me code and communities

Hiring an engineer and post-doc to simplify data science on dirty data

Dirty data research

Reinventing data science

Join us in this adventure

A data-science engineer: new software with new ideas

A post-doc researcher: science joining data engineering to deep learning

Hiring someone to develop scikit-learn community and industry partners

Context: Scikit-learn @ Inria foundation

The growth of Scikit-learn

Scikit-learn @ Inria foundation

Mandate

Growing our open-source community

Increasing our corporate visibility

A good fit

Technical discussions are hard; a few tips

Maintainer’s anxiety

Open source can be anxiety-generating for the maintainers

The danger abusive gatekeeping

Contributor’s fatigue

Communication is hard

Little things that help

Hear the other: exchange

Convey ideas well: pedagogy

Cater for emotions: tone

Getting a big scientific prize for open-source software

A foundation for scikit-learn at Inria

A foundation? What and why?

What will people work on? How will decisions be made?

Why not an existing foundation such as NumFOCUS, or the PSF?

What’s the scope?

Sprint on scikit-learn, in Paris and Austin

Many achievements

Scikit-learn is hard work

Credits and acknowledgments

Contributors to the sprint

Sponsors

Beyond computational reproducibility, let us aim for reusability

Scikit-learn Paris sprint 2017

A massive workforce

Support and hosting

Some achievements during the sprint

Data science instrumenting social media for advertising is responsible for todays politics

Better Python compressed persistence in joblib

Problem setting: persistence for big data

Limitations of the old implementation

What’s new: compression, low memory…

Benchmarks: speed and memory consumption

Extra improvements in compressed persistence

New compression formats

Compressed persistence into a file handle

Implementation

Conclusion and future work

Of software and Science. Reproducible science: what, why, and how

Forms of reproducible science: reproduction, replication, & reuse

Roadblocks to reproducible science

Man power

Computing power

Data availability

Incentives problem

How to improve the situation

Docker, containers, and virtual machines

Version control: wear your seatbelt

Sotware libraries, curated and maintained

Datasets, serving as model experiments, tractable and open

Changing incentives: setting the right goals

Nilearn 0.2: more powerful machine learning for neuroimaging

Highlights

MLOSS 2015: wising up to building open-source machine learning

Online videos of the talks

MLOSS: a maturing community

Accepting the sustainability challenges

Nilearn sprint: hacking neuroimaging machine learning

Highlights of the sprints results

Software for reproducible science: let’s not have a misunderstanding