16 May

Offline for three weeks

I am leaving tomorrow for a three-weeks trip to central Asia with Emmanuelle. We are going to spend one week in Uzbekistan, visiting fabulous cities like Samrkand or Bukhara. After this, we will spend two weeks in Kyrgyzstan, back-country trekking.

I will be offline during all this time. It will be good to finally take holidays. I have been unemployed for two weeks, I took a few days in the mountains during the first week, with my parents (I climbed the Mont Blanc, that was fantastic, maybe I’ll find time to blog about it), but during the second week, I have been running between embassies, specialized shop, change offices, to sort out my various trip (this one, and the next, professional, to the States). In addition, I have been working hard to prepare a small surprise for the Scipy2008 conference (I do hope Jarrod will blog about it when it is ready).

Soooo, I am wasted. Enough said, I am not going to touch a computer for thee weeks, and that’s good.

05 May

Update on my life

I am currently changing jobs and changing countries. This is why I have been really bad at dealing with questions on the mailing-lists, bug-reports or feature requests.

Before

So far I have been working as a physicist, doing atomic physics (Bose Einstein Condensation). I studied quantum physics, mostly theory, and I did a PhD in an experimental lab, building a couple of experiments on Bose Einstein Condensation and atom interferometry. After this, I moved to Florence to do a post-doc also on a BEC experiment.

A colleague working on the experiment in Florence

This kind of work is very experimental. These experiments are monsters that you have to keep alive doing a lot of homemade mechanics, optics, and electronics. I thought I would love that, because I used to like working with my hands, but I grew tired of it. I wanted to work more with abstractions. And in addition I am computer geek, the parts of my job I preferred were related to computers.

This summer

My contract has ended at the end of April, and I have not renewed it. I was missing my girlfriend and wanted to find an excuse to come back to Paris. So now I am jobless, living at the expense of my girlfriend. I decided to take some time without a job, as I have the feeling I have been working without stopping for the last few years, not having time to travel and visit the world as I like to. We are planning a three weeks trip to Uzbekistan and Kyrgyzstan in two weeks.

After this i am going to devote my summer to hacking. The big news is that I am going to be going to the states. I will spend most of my time in Austin, working for Enthought. I am very excited about this, as I see this as the occasion to learn more about building scientific GUIs with Python. Building usable scientific programs is something that I am passionate about. I will also spend some time at Berkeley, with Fernando Perez, hopefully to work on Ipython1. I need to thank Enthought for making this possible for me, as they are providing the money. With some luck, this summer I will be productive on the free software side.

Of course right now I am battling with moving houses, fighting for visas, trying to fall back on my feet and organize the summer. I still don’t have my visa for the states, and it is making me nervous. I would really hate to have to cancel my trip to Kyrgyzstan because of visa problems with the states: I take time off work, I expect to spend it enjoying myself, and not waiting for visas.

The future

So I am quitting atomic physics. I am starting a new adventure in something totally new for me. Starting from October, I will be working with JB Poline and Bertrand Thirion, at Neurospin, on neuroimaging. This work is mostly data processing, even though it has a lot of interplay with the physics of NMR. This is something very new for me and I will have to discover a new field. The good news is that a lot of the work is centered on computers, and one of the core technologies used at Neurospin is Python.

28 Apr

Docs using Sphinx

After Ipython and Sympy, Mayavi is now using sphinx to build its docs. Sphinx is very neat because it allows for high quality pdf and html from the same restructured text source. The killer feature is that the resulting html pages have a builtin search that works with javascript, and thus works on the client without the need of a server.

In addition, the developer is very reactive and dedicated to making sphinx versatile-enough to generate high-quality docs for many packages. As a result many Python projects are switching to sphinx. First Python itself (that’s what sphinx was created for), but now more and more. It seems that zope is even considering it. One great side effect is that documentation for different Python modules will be consistent, with the same look and feel (although you can tweak sphinx output if you want).

We don’t have a server serving the html docs yet (it is planned, we just need a bit of time), but you can check out the pdf generated here.

12 Apr

Of packaging, installation and dependencies

I have been struggling for the last few days trying to understand the issues behind packaging and installing the Enthought Tool Suite. I think have been making progress, though only in my head, no actual code or packages so far are terribly satisfying.

The problem

If you are developing a Python-only program, with only dependencies on the standard library, you have no problems with packaging. You can ship tarballs, MSi installer, eggs, … all this works.

However, if you want to develop a rich program that provides many features in a closely integrated and consistent way to the user, you will have to depend on external packages. I know that many projects work around this by including the external dependencies inside the project, or simply reinventing the wheel. Well this does not scale. We cannot expect to develop a major scientific tool and community this way. Reuse is the key to scalability, in my opinion. Thus comes the problem, how to we ship our program?

The problem can be very well seen with the Enthought Tool Suite (ETS). The ETS is a suite of many different packages, all pretty much geared towards building interactive scientific application. In house, Enthought, the company (disclaimer: I do not work for Enthought) uses these packages to develop domain-specific applications for customers. They have broken up the suite in a set of small packages, to enable assembling applications by requiring only the features you need. This is important because if you want to use ETS’s 3D plotting package (TVTK or Mayavi), but you want to stick with MatPlotLib to do 2D plotting, and not use Chaco, you should be able to download only what you need.

As a result the ETS is made of a set of interdependent packages. Maybe they went a bit too far in the modularity, and there are almost 50 packages. The dependency graph looks like this:

Just to reassure you, the next version of the ETS has a much reduced number of packages, just because some packages where grouped, and the dependency graph indeed is sane:

As you can see, there is a complex dependency graph. So how do you ship this to the user? Another problem that should not be underestimated is: how do you make it easy for people who distribute your projects to package this?

Setuptools

Python has no good answer for this problem, but setuptools do go part of the way. Dependencies in the ETS are declared using setuptools, and installing the ETS strongly relies on setuptools.

Setuptools provides a way of automatically downloading dependencies. However, it is not a full packaging system replacement. The reason I say this is that it does not have the knowledge of a dependency graph, it just downloads packages, introspects them to find their dependencies, and recursively tries to satisfy them by downloading more. Phillip J. Eby (the author of setuptools) has been quite clear that he does not want to write an APT replacement, tough people keep getting it wrong and making the equation “easy_install = apt for Python” (IMHO this is due to bad communication on setuptools webpage).

Moreover, setuptools does not provide an easy to use API to extract all the information it has about packages, dependencies, and download URLs. It is thus not trivial to plug packages shipped with setuptools in an other package manager like rpm or apt. This is why bothers me most, because this is strongly limiting the exposure the ETS is getting in distributions (whether they be Linux distributions, or scientific computing “superpacks”). Recently I have had discussions with somebody on how to ship Mayavi in a monolithic distribution he has developed. He agreed to ship setuptools with the distribution, so now I need to give him a list of eggs to provide. There is no obvious way to get this list using setuptools (insert here big big rant). So I thought that an option was to install Mayavi in a virtual environment to trac the eggs added, and use this information. However, this person’s internet access was possible only by login on dumbed-down servers for security reasons. So we hit a wall. And for me this wall is a wall we keep hitting with setuptools: setuptools does everything for you, the download, the building the install. It does have flags to control these processes, but it does not expose the information you need to do this without using it. I actually think the reason it does not expose this information is that it does not know it a priori. Looking at the code it does seem so. In addition, the structure of the packages make it hard to do.

From packages to repositories

On the other side, Dave Peterson, at Enthought, has been working on a tool to allow checking out of the ETS SVN only the projects you are interested in. I played a bit with it, and modified it to generate the dependency graphs. I quickly found out that I actually like this tool much more than setuptools, even though it was pretty much using the same concepts. It took me a while to understand what I like about the tool. It is that it uses a map file to gather all the package and dependency information. As a result, it has the equivalent of a dependency graph. This makes it possible to do the operations I am interested in, eg listing all the packages required for installing a given project without actually downloading them.

The reason this is possible is that with the ETS we are not dealing with an open set of packages, like PyPI, in which packages can come and go, and no consistency is enforced. We are dealing with one suite of multiple projects that are made to work with each other. The base entity is thus a project set, on which we can make a “project map”.

What Dave has done works fantastically for development, I would like to push it further for distribution. What we expose to the user can now be a repository, in the sens of APT: a set of packages with consistent inter-dependencies, and a way of retrieving easily this information. The difference between the two, and the implications of the difference, is not something I had clearly in my mind in the beginning, but it is becoming clearer that having a repository with a project map gives a lot of added value for distributing. I’ll see if I can reuse Dave’s work to build such a tool, but do not hold your breath, I am not willingly in the business of packaging, and will probably not spend enough time on this to make it a good tool.

Edit: Correct Phillip’s name.

05 Apr

Objects, modules and Traits and Envisage

I have been reading an article about a new language paradigm (Erasmus, a modular language for concurrent programming). The authors discuss the limitations of objects in terms of modularity. To sum up their point (and most probably distort it completely), the limitations with objects comes from the fact that you can’t be sure what is modifying what: suppose you have a method foo of an object bar that you call in a method of an object baz, you cannot be sure that this method hasn’t modified private attributes of your object baz, as foo could have called a method of your object. This does happen in large code bases. Of course, best practice tries to reduce this to a minimum, but this reduces modularity, and thus limits both code reuse and concurrency (as side effects are not well controlled).

Erasmus’s solution to is adopt a new container, that they call modules rather than objects, and that are based on message passing rather than method calls. These modules live in separate processes and can themselves be made of more conventional code (I am extrapolating a bit from the original article here).

This strikes me as being related to a pattern that I see more and more in my code that uses Traits. The objects deriving from HasTraits have a very easy and cheap way of coupling callbacks to the modification of their attributes. This induces a programming style know as reactive programming that is entirely callback-driven. In addition, this is a nice way of ensuring that the internal state of an object is always consistent. This is a first step to message passing and decoupling: you no longer call methods, you just set attributes and let the object do the rest. The limitation of this model in a large code base is that you have to carry around references to the objects you are interested about, and their attributes. Traits has patterns to help you do this (delegation, namely), but it is still a limitation.

This is where the Envisage framework comes into play. Envisage introduces the notion of plugins which provide extension points. These extension points are special traits attributes that are published in a registry (which can be application-wide, or not, in Envisage3). You can query the registry to retrieve these extension points and contribute to them. After that, the traits callback mechanism triggers an action in the plugin contributing the extension point.

This contribution mechanism could be based on message passing between processes quite easily (although for GUIs it breaks down, because AFAIK you cannot assemble a consistent GUI from different widgets living in different process space, without using some Xwindows-specific tricks). Of course this does not give me hard guaranties of decoupling and control of the side-effects, as a call to a plugin can induce calls to other plugins inside it. This is where best practice comes along: core plugins should be able to run and provide their basic functionality outside of Envisage, as normal objects. Envisage should only be a thin wrapper allowing them to expose this functionality and extend other plugins. This is introducing a distinction between objects and method calls, that do not need to be arranged in self-consistent entities and which you use very often , and plugins and extensions contribution, that form standalone entities and should be used more sparsely.

Of course Envisage cannot go too far in terms of providing guaranties for decoupling. It gives a mechanism, best practices, could even help plugin decoupling by having them live in different processes, but as long as it does not enforce rules in the semantics of the language, it cannot achieve what projects like Erasmus are trying to do. I however think it is good to have a look at the work done in these projects to see what we can learn.

PS: Web apps suck! I made a few sortcut mystakes under wordpress, wanted to undo them and hit “Ctrl-R”, which is “redo” under vim, and lost all my post. I strongly don’t believe in web apps, amongst other things because they don’t allow me to use vim.

01 Apr

Of travels and sprints

This month I have traveled a bit for scientific-computing related reasons, and of course it was pure delight.

In England

First of all, I was speaking at the OKcon, open knowledge conference in London, about Scientific tools in Python in general, and Mayavi in particular. I jumped on the occasion to visit the Airbus campus in Bristol. We have had some contacts with these guys, because they use Mayavi in some of their homegrown applications, and I was curious to put faces on friendly names on the mailing list. In addition, I was eager to find out how they were using Mayavi and Python scientific tools in an industrial environment, as I have never worked in another place than a physics lab.

Visiting the Airbus campus

The Airbus visit was enlightening: the Bristol campus is a major research facility (several thousands people) dedicated to wing design. A good part of the work is done through simulations deployed on big clusters. These calculations have historically been run in Fortran and C, but apparently the engineers are switching to a mix of compiled languages and Python. Moreover, steering of these simulations, through mesh-design, visualization of the results, analysis of the data, is done mainly through an interact program, ‘flightpad’, that is developed fully in Python, using the Envisage framework to couple together a bunch of scientific components, including Mayavi. I got to spend a fair amount of time with the guys doing this, and it was great to see how they did it. They have a good approach to scientific software design (loosely coupled components, reuse of all the existing libraries), eventhough their goal (automatic generation of Python scripts from user interaction) is way more ambitious than anything I have in mind. I was pleased to see that they where using Mayavi in a way completely consistent with its design, and did not have to hack around limitation.

It was really very encouraging to talk with the software strategist. He obviously completely got it as far as how an open-source model can be profitable to a company like Airbus. See so many people using open source tools as their main tools, as well as a manager ready to back this position, and explaining how it can be beneficial to contribute to an open-source project, really filled me with hope.

Of course visiting the Airbus campus was not only about software, it was also about planes (I got a drive around the campus, and it is quite fun to ride a mini cooper between to 747), and beers (reinventing the world to make it a better place at the pub, after work). I must say there is something special about the scientific Python community, it is the nicest community I know (with the sailing one :->). You meet people that you have never seen before, and you immediately feel at ease.

Open Knowledge conference

The Open Knowledge conference was fun. Not too much like the geek conferences I am used to, as here the focus was on the data, and not the tools , aka the software (for instance, the big deal is when you can get access to the complete public transport time-tables, and you can make maps of poorly connected areas). I met Martin Albrecht from the sage project. It was very interesting to discuss with him. I generally consider myself as doing rather fundamental research (Bose-Einstein condensation), but for him I was in the applied science section, because I use math and computers to do applied things. This distinction between applied and fundamental maths yields a distinction in the application of the code, and therefore the way an open-source scientific project can survive. It was very interesting to see the way sage’s development process therefore differed from scipy’s. I think that both Martin’s talk on sage, and mine on Python and interactive visualization had a lot of success: the room was full of scholars, and they wanted tools to do their work.

In London, I had the occasion to catch up with my brother, and Rob, a former colleague. That was nice too (and yielded more beers).

Paris

Nipy Sprint

The week after, I was attending a sprint in Paris on nipy: neuroimaging in Python. We were a bunch of enthusiastic scientific Python users crammed in a small room during the day. There was the team from Berkeley with including Jarrod and Fernando, and all their friends. I got to make new friends, and catch up with old ones. The goal of the nipy effort is to build a complete processing pipeline for neuroimaging data, especially fMRI, in Python. This is a lot of work, as many transformations are applied to the raw data to make it useful for scientific publications. As the field matures, these transformations pile up, and the processing pipeline gets more and more complex. There already exists a good pipeline under MatLab (SPM), the problem is that, due to the poor language features of MatLab, it is a codebase hard to extend and to modify. One of the goals of the nipy project is to make a pluggable architecture, for researcher to be able to replace part of the pipeline by their own code, and thus explore new methods while comparing them to the reference one. This means that there are some interesting software engineering problems in here (pluggable pipelines, framework…, the kind of stuff I like), however the current focus is to get the algorithms right, before trying to do software over-engineering.

The Berkeley group got an NSF grant to work on the project and has been able to hire two developers for two years (Chris Burns and Tom Waite). The effort is lead by Jarrod Millman, and they have put a lot of work in making the underlying libraries better (that is improving numpy and scipy).

I had difficulties contributing any useful code, as I don’t know neuroimaging, but I had the pleasure of seeing people pick up the mayavi API and use it to quickly build domain-specific tools for displaying brains and activation regions. As usual this also revealed some shortcomings in the mlab API that I plan to address ASAP.

IPython Sprint

The week end after Fernando, Laurent Dufréchou, Stefan van der Waalt and myself crashed at my parent’s place to work on ipython1 and the front ends. My mother cooked us some fabulous food and I had a great time.

Unfortunately we did get as far as I would have like. The right abstraction for talking between the ipython1 execution engine, and the front end are not really easy to get right, as the engine is nothing more than an abstract execution engine, that basically only has a namespace and knows how to execute stuff in a non-blocking mode (that’s where it gets hard: how do you know what is going on with your engine and the commands you have sent to it? How do you deal with introspections requests such as tab-completion or docstring exploration). We want as little logics in the front ends as possible: let us not duplicate tab-completion or history. This is why we are progressively building an object, that Fernando dubbed “InputStateManager” that is doing the impedance matching between the front end and the engine. I am starting to believe that the best way to connect this object (ISM) to the front end is via a callback-based mechanism: the front-end calls the ISM methods and gives them a callback to call when finished (for instance if running in a different thread, a Wx frontend would pass something based on Wx.CallAfter to display the result). That way the mechanism is very general, can adapt to event-driven front ends or readline-based one, and knows nothing about the front end. Of course not much code got written, because I am way too slow, and it took me ages to figure this out.

We had a lot of fun, and for me the highlight of the week end was when my girlfriend joined us to do some hacking on a really cool project trying to use the scipy.org wiki to edit the numpy docstring.

Fernando has pictures of all these happy moments. and I hope he will publish them somewhere (Fernando, get a blog :->). Next time I hope there will be more of us.

Edit: my slides at OKcon

26 Mar

How is Mayavi pronounced

I have been traveling recently and talking to friendly Geeks I didn’t know yet. I have been surprised to see that many people were pronouncing “Mayavi”, “Maya-V-I”, is in “V-I”, like the old Unix editor. Maybe this comes from the spelling “MayaVi”, that Prabhu and I recently decided to avoid. Well, Mayavi is actually pronounced “Ma-ya-vee”, and it comes from an old sanskrit name meaning magician.

20 Mar

Numpy doc sprint in Paris tomorrow!

We really need to get numpy 1.0.5 out. And for this release to rock, we want to have good docs. This is why Jarrod offered to have a doc sprint tomorrow.

In addition we are currently having a sprint in Paris for neuroimaging in Python, with a bunch of numpy developers. Some of us are going to work on the doc sprint tomorrow. We will have a room dedicated to this.

It would be great if people in Paris join us. If you want to have great fun with Python geeks and get the chance to make numpy better, send an e-mail to Jarrod ( <millman> at <berkeley>  dot <edu> ). The venue is in Paris 6ème.

See you all tomorrow.

19 Mar

Mayavi debian packages for Gutsy

I have decided that delivering packages of the latest release of the Enthought Tool Suite to the users was something important, as some of the new features (interactive docs for Mayavi, decoupling of Mlab from Envisage) are really neat, and we have the feeling the suite is quite mature.

I spent the whole day fighting with debian packages (which I don’t know anything about) and launchpad PPA. It is not something I enjoin learning, but the important is to keep the user satisfied.

To get the packages under ubuntu gutsy, just add the following lines to your /etc/apt/sources.list:

# Package archive for enthought tools
deb http://ppa.launchpad.net/gael-varoquaux/ubuntu gutsy main

You will be getting three packages:

  • python-enthought-traits
  • python-enthought-traits-ui
  • mayavi2

I want to do rpms. I could really use some help here, none of any of the boxes I have a login onto are rpm-based. I would like to be able to build rpms that work for fedora, red-hat, and mandriva. Anyone ?

11 Mar

Re: Grid computing for Python

Matthieu talks about leveraging the unused desktop-computer power in his lab for performing calculation. I share his feeling that there is a huge potential here, and I think this is where the work on IPython1 by Fernando, Brian and Benjamin comes into play. Hopefully one day we will have a tool to transform prototyping code in Python into calculations scattered on a grid made of desktop computers. Off course this can only be easy for embarrassingly parallel problems, as in general parallel algorithmic is hard, but a lot of the problems I encounter are of this kind, for instance a not-at-all-parallel computation, that I need to run for a large number of different parameters.