19 Oct

Hiring a programmer for a brain imaging library

I am super excited to announce a job offer that is dear to my heart: doing quality open-source software, with Python scientific tools and machine learning, for clinical application of brain imaging. This is the most exciting job that I have had the chance to be recruiting for!


We are looking for a programmer to join our research group, Parietal team, at INRIA, to work on a library integrating state of the art methods of functional brain imaging.

As a programmer, you will be taking part to the NiConnect research project, developing tools for the analysis of spontaneous brain activity using functional MRI. The project unites neuroscientists, data-miners, statisticians and clinical researchers to transfer recent advances in basic neuroscience to clinical diagnostic tools. Your duties will be to work hand in hand with the computer science and statistics researchers to turn the research code into a solid and well documented Python library usable by clinical researchers. The core technologies used will rely on the scientific Python stack and scikit-learn machine learning library.

Requirements

  • Programming skills in Python, preferably with experience of the scientific Python stack
  • Understanding of quality assurance in software development: test-driven programming, version control, technical documentation.
  • Software design skills
  • Some knowledge of Linux/Unix
  • Knowledge of open-source development and community-driven environments is valued
  • Good technical English level
  • An experience in statistical learning or a mathematical-oriented mindset is a plus

Speaking French is not a requirement, as it is an international team.

About the team

INRIA is the French computer science research institute. It recognized word-wide as one of the leading research institutions and has a strong expertise in machine learning. You will be working in the Parietal team that makes a heavy use of Python for brain imaging analysis.

Parietal is a small research team (around 15 people) with an excellent technical knowledge of scientific and numerical computing in Python as well as a fine understanding of algorithmic issues in machine learning, statistics and image processing. Parietal is committed to investing in the scientific Python toolstack and its members are core developers in central projects such as Mayavi and scikit-learn, as well as the nipy library for NeuroImaging in Python.

Parietal is located in the Neurospin brain research facility, that hosts several brain scanners and research teams in neuroscience and medical imaging.

Working at Parietal is a unique opportunity to improve your skills in numerical computing and statistical data processing in Python. In addition, working on an open source stack, will give you premium experience of open source community management and collaborative project development.

Contact Info:

30 Aug

RIP John Hunter: the loss of a great man

John Hunter, the author of matplotlib passed away yesterday after a short battle against cancer. John gave the keynote at the scipy 2012 conference a few weeks ago, and was diagnosed with cancer just on his return from the conference. It is a shock to me that that a friend can disappear so quickly. Please read the announcement of Fernando Perez, who supported John in the last weeks to learn more about John.

A man who gave a lot, not asking for anything in return

Many have benefited from the silent efforts of John, and are not fully aware of how he generously invested his time and talent for the benefit of others. Matplotlib, the Python plotting library that he created in 2002, has propelled Python as a major tool for scientific research and engineering. The impact of John’s efforts go well beyond Matplotlib. Early on, John had the vision of Python as a interactive scientific environment. He promoted this vision pairing with Fernando Perez to develop the fantastic ipython/matplotlib tandem, solving many technical challenges. But he also invested a lot of energy in teaching workshops that helped change the way people compute, as well as writing didactic documentation and articles. He was a friendly, active, leader of an online community, open and helpful to newcomers.

As Travis Oliphant said on John’s numfocus memorial webpage:

Those who contribute much to open source, as John did, do so at the expense of something - often it is time with family.

I cannot stress how true this is. The entire open source software, that nowadays supports our economy, our education, and our research, is built on the shoulders of a fairly small number of generous people that spend their energy in making better software, rather than personal wealth.

John was a humble man. He did not have a blog, or a twitter account, did not seek fame or money. For this reason I feel that his contributions are unknown and undervalued by many. In my eyes, he is an unknown soldier of our modern times. I hope that I am not being too emphatic, but this is how I feel.


John passed away at 44, leaving behind a wife and 3 daughters. Please do consider supporting them: http://numfocus.org/johnhunter.

04 Jun

A journal promoting high-quality research code: dream and reality

Open research computation (ORC) was an attempt to create a scientific publication promoting high-quality and open source scientific code. The project went public in falls 2010, but last month, facing the low volume of submission, the editorial board chose to reorient it as a special track of an existing journal.

The challenges that we face are discussed in our editorial:

Changing computational research. The challenges ahead. C Neylon, J Aerts, CT Brown, D Lemire, J Millman, P Murray-Rust, F Perez, N Saunders, A Smith, G Varoquaux and E Willighagen, Source Code for Biology and Medicine 2012, 7:20

Here is my own personal take on the rise and fall of this ideal.

My story with ORC

From pipe dream to journal - My involvement with ORC started long before there was such a thing as ORC. In falls 2008, I had a discussion with a friend working in the publication industry, telling her how I believed that the publication system is broken, because it promotes new results without any interest on whether these can be exported outside the lab that produced them: it is currently easier to publish a minor but novel result than a tool enabling the routine reproduction of previous results. This seemed particularly marked in the scientific software world, as software tools are becoming central to the scientific workflow, and cost nothing to duplicate when produced under open-source license. To my surprise, she took me seriously, and asked me to write my ideas down in an email that she would forward to her colleagues in the publication industry.

Looking back at the email that I send, my concerns were, back then, to promote:

  • quality and openness of scientific software
  • basic tools shared across communities
  • recognition of software development as a challenging and worthwhile task in academic research

Shaping the idea - In the year that followed, I had a few discussions with staff from BioMedCentral, an open-access publisher in biology and medicine that was looking into expending in the physics and math related fields. Eventually, my contact there told me that they had other similar requests and were launching a journal that would be lead by Cameron Neylon, a British biophysicist and strong advocate of openness and reproducibility in science. This was the start of ORC, and for me the chance to meet other people sharing my concerns, some new and some already old friends.


ORC editor

Conventional editor

Setting up the journal - BioMedCentral was instrumental in setting up the journal project. I quickly learned that, no surprises, a journal is a product, like anything else, and it must find customers. Here, as we were launching an open access journal, the customers were authors. This is where a journal faces a chicken and egg problem: to be recognised it needs high-visibility publications, but authors will submit only to journals that they know. The main tool to overcome this challenge are communication and advocacy. I then realized that these really weren’t my strong points. Cameron Neylon absolutely shined on this side, with very enthusiastic communications and an incredibly active twitter account. On my side, I am a slow writer, and I tend to speak Python code better than English language, which is not a strong asset to be a journal editor.

Wild editorial discussions - The discussions in the editorial board really thrilled me because they were centered on how to set standards to improve the quality of code published. Looking in my mailbox, I see discussions about code repositories, software testing, documentation or licensing issues. This is not that surprising, given that a lot of the editors where actually contributors to major software projects. It made me very happy, as I have the feeling that, so far, most committees or decision makers are clueless about software.

Sand in the gears: the lack of uptake

A false start - So ORC was launched late 2010 and we had fantastic feedback. I had the feeling that people were genuinely excited about our program: changing the way computational science worked from the inside, through the review process. The idea was that we had opened a pre-submission call, and were waiting for a few good papers to be submitted to launch the journal. However, it turned out that the papers were slow to come. It took me a while to realize that there was something wrong. But slowly we had to face the truth: many people were excited about the journal, but most were sending their papers elsewhere.

What went wrong? - If I really knew what went wrong, I would probably not be discussing it here, but rather changing the world. However, I can come up with a few guesses:

  • Working across communities is harder. From the beginning we had wanted to position the journal across communities, in order to foster the sharing of tools for a greater good. The challenge is that a central role of publication is nowadays to provide recognition. It is much easier to achieve recognition in a given community than across communities, and authors always preferred submitting their work to a non-software oriented journal in their field. We couldn’t fight together the battle for software quality and the battle for inter-community work.
  • Setting the bar too high. Many felt that the submission requirements that where too demanding, as expressed on a NeuroImaging forumn to quote a researcher: “I think it’s setting the bar unrealistically high for most neuroimaging software”. While we had originally shot for a very high test coverage (probably too high), we had scaled it back quickly, simply stressing that editors and reviewers would be looking closely at test coverage, documentation and ease of installation. That said, the average researcher did not share our ideals of raising the quality of scientific software. Trying to ask only for excellent publications in a new and unproven journal was probably unrealistic.
  • Editors not willing to game the system. I have watched a few journal launches, and it seems to me that a common trick is to line up articles that are created by the editors and their friends specifically for the new journal. People come up with opinion papers, reviews, commentaries that only serve to generate an identity to the journal. This did not happen for ORC, and I believe that it is because the editors themselves were not huge fans of the low signal-to-noise ratio in modern scientific publishing practice.

The times they are a changing

ORC is dead, long live ORC - We did get a few submissions. ORC is not coming to an end, it is morphing into a special thematic series in source code for biology and medicine. This solution is not completely satisfactory, as it pushes what should have been a forum for exposing good practices and good software into a smaller community. But at least there is now a venue in which people can publish a paper about software that they have been improving and maintaining, and not only about a new algorithm.

Changing practices across the board - Among the reasons for which we had a hard time making a breakthrough, is that authors where sending their software papers to other journals, in particular journals not specialized on software. While these papers are not getting the attention of a review and editorial team expert on software development, as we are setting up with ORC, this is still a good thing. Indeed it shows that the times are changing and that recognition of software as a scientific work is improving. I have been impressed to see that many high profile journals have changed their editorial policies to specifically accept software papers, or have create tracks dedicated to software.

Software is being slowly recognized as a pillar of modern scientific research. We need to keep pushing to make sure that quality standards are set and that the open-source scientific software grows into a mature ecosystem focused on problem solving.

09 May

Update on scikit-learn: recent developments for machine learning in Python

Yesterday, we released version 0.11 of the scikit-learn toolkit for machine learning in Python, and there was much rejoincing.

Major features gained in the last releases

In the last 6 months, there have been many things happening with the scikit-learn. While I do not whish to give an exhaustive summary of features added (it can be found here), let me list a few of the additions that I personnally find exciting.

Non-linear prediction models

For complex prediction problems where there is no simple model available, as in computer vision, non-linear models are handy. A good example of such models are those based on decisions trees and model averaging. For instance random forests are used in the Kinect to locate body parts. As they are intrinsically complex, they may need a large amount of training data. For this reason, they have been implemented in the scikit-learn with special attention to computational efficiency.

Dealing with unlabeled instances

It is often easy to gather unlabeled observations than labeled observation. While prediction of a quantity of interest is then harder or simply impossible, mining this data can be useful.

Semi-supervised
learning
: using unlabeled observations together with labeled one for better prediction.

 
Outlier/novelty detection: detect deviant observations.

 
Manifold learning: discover a non-linear low-dimensional structure in the data.

 
Clustering with an algorithm that can scale to really large datasets using an online approach: fitting small portions of the data on after the other (Mini-batch k-means).

 
Dictionary learning: learning patterns in the data that represent it sparsely: each observation is a combination of a small number patterns.


Sparse models: when very few descriptors are relevant

In general, finding which descriptors are useful when there are many of them is like find a needle in a haystack: it is a very hard problem. However, you know that only a few of these descriptors actually carry information, you are in a so-called sparse problem, for specific approaches can work well.

Orthogonal matching pursuit: a greedy and fast algorithm for very sparse linear models

 
Randomized sparsity (randomized Lasso): selecting the relevant descriptors in noisy high-dimensional observations

 
Sparse inverse covariance: learning graphs of connectivity from correlations in the data


Getting developpers together: the Granada sprint

Of course, such developments happen only because we have a great team of dedicated coders.

Getting along and working together is a critical part of the project. In December 2011, we held the first international scikit-learn sprint in Granada, on the side of the NIPS conference. That was a while ago, and I haven’t found time to blog about it, maybe because I was too busy merging in the code produced :). Here is a small report from my point of view. Better late than never.

Participants from all over the globe

This sprint was a big deal for us, because for the first time, thanks to sponsor money, we were able to fly contributors from overseas and meet the team in person. For the first time I was able to see the faces behind many of the fantastic people that I knew only from the mailing list.

I really think that we must thank our sponsors, Google and tinyclues, but also The PSF, that is in particular Jesse Noller but especially Steve Holden, whose help was absolutely instrumental in getting sponsor money. This money is what made it possible to unite a good fraction of the team, and it opened the door to great moments of coding, and more.

Producing code lines and friendship

An important aspect of the sprint for me was that I really felt the team being united. Granada is a great city and we spent fantastic moments together. Now when I review code, I can often put a face on the author of that code and remember a walk below the Alhambra or an evening in a bar. I am sure it helps reviewing code!

Was it worth the money?


I really appreciate that the sponsors did not ask for specific returns on investment beyond acknowledgments, but I think that it is useful for us to ask the question: was it worth the money? After all, we got around $5000, and that’s a lot of money. First of all, as a side effect of the sprint, people who had invested a huge amount of time in a machine learning toolkit without asking anything in return got help to go to a major machine learning conference.

But was there a return over investment in terms of code? If you look at the number of lines of code modified weekly (figure on the right), there is a big spike in December 2011. That’s our sprint! Importantly, if you look at the months following the sprint, there still is a lot of activity in the months following the sprint. This is actually unusual, as the active developments happen more in the summer break than during the winter, as our developpers are busy working on papers or teaching.

The explaination is simple: we where thrilled by the sprint. Overall, it was incredibly beneficial to the project. I am looking forward to the next ones.

23 Apr

3 Google summer of code for scikit-learn and more…

The scikit-learn got 3 students accepted for the Google summer of code.

In addition, other related projects have exciting projects, for instance statsmodels:

and Cython:

finally, in Pandas:

Congratulations to all of the students. This is going to be an exciting summer.

14 Apr

The problems of low statistical power and publication bias

Lately, I have been a mood of scientific scepticism: I have the feeling that the worldwide academic system is more and more failing to produce useful research. Christophe Lalanne’s twitter feed lead me to an interesting article in a non-mainstream journal: A farewell to Bonferroni: the problems of low statistical power and publication bias, by Shinichi Nakagawa.

Each study performed has a probability of being wrong. Thus performing many studies will lead to some wrong conclusions by chance. This is known in statistics as the multiple comparisons problem. When a working hypothesis is not verified empirically in a study, this null finding is seldom reported, leading to what is called publication bias: discoveries are further studied; negative results are usually ignored (Y. Benjamini). Because only discoveries, called detections in statistical terms, are reported, published results contain more false detections than the individual experiments and very little false negatives. Arguably, the original investigators have corrected using the understanding that they gained the experiments performed and account in a post-hoc analysis for the fact that some of their working hypothesis could not have been correct. Such a correction can work only in a field where there is a good mechanistic understanding, or models, such as physics, but in my opinion not in life and social sciences.

Let me quote some relevant extracts of the article, as you may never have access to it thanks to the way scientific publishing works:

Recently, Jennions and Moller (2003) carried out a meta-analysis on statistical power in the field of behavioral ecology and animal behavior, reviewing 10 leading journals including Behavioral Ecology. Their results showed dismayingly low average statistical power (note that a meta-analytic review of statistical power is different from post hoc power analysis as criticized in Hoenig and Heisey, 2001). The statistical power of a null hypothesis (Ho) significance test is the probability that the test will reject Ho when a research hypothesis (Ha) is true.

The meta-analysis on statistical power by Jennions and Moller (2003) revealed that, in the field of behavioral ecology and animal behavior, statistical power of less than 20% to detect a small effect and power of less than 50% to detect a medium effect existed. This means, for example, that the average behavioral scientist performing a statistical test has a greater probability of making a Type II error (or beta) (i.e., not rejecting Ho when Ho is false; note that statistical power is equals to 1 - beta) than if they had flipped a coin, when an experiment effect is of medium size.

Imagine that we conduct a study where we measure as many relevant variables as possible, 10 variables, for example. We find only two variables statistically significant. Then, what should we do? We could decide to write a paper highlighting these two variables (and not reporting the other eight at all) as if we had hypotheses about the two significant variables in the first place. Subsequently, our paper would be published. Alternatively, we could write a paper including all 10 variables. When the paper is reviewed, referees might tell us that there were no significant results if we had “appropriately” employed Bonferroni corrections, so that our study would not be advisable for publication. However, the latter paper is scientifically more important than the former paper. For example, if one wants to conduct a meta-analysis to investigate an overall effect in a specific area of study, the latter paper is five times more informative than the former paper. In the long term, statistical significance of particular tests may be of trivial importance (if not always), although, in the short term, it makes papers publishable. Bonferroni procedures may, in part, be preventing the accumulation of knowledge in the field of behavioral ecology and animal behavior, thus hindering the progress of the field as science.

Some of the concerns raised here are partly a criticism of Bonferoni corrections, i.e. in technical terms correcting for family-wise error rate (FWER). It is actually the message that the author wants to convey in his paper. Proponents of controling for false discovery rate (FDR) argue that an investigator shouldn’t be penalized for asking more questions, and the fraction of errors in the answers should be controlled, rather than the absolute value. That said, FDR, while useful, does not answer the problems of publication bias.

08 Mar

Want features? Just code

Somebody just sent an email on a user’s mailing list for an open-source scientific package entitled Feature foo: why is package bar not up to the task?” (names hidden to avoid pointing directly to the responsible of my wrath). To quote him:

Is there ANY plan for having such a module in package bar?? I think (personally) that this is a MUST DO. This is typically the type of routines that I hear people use in e.g., idl etc. If this could be an optimised, fast (and easy to use) routine, all the better.

As some one who spends a fair amount of time working on open source software I hear such remarks quite often. I am finding it harder and harder not to react negatively to these emails. Now I cannot consider myself as a contributor to package bar, and thus I can claim that I am not taking your comment personally.

Why aren’t package not up to the task? Will, the answer is quite simple: because they are developed by volunteers that do it on their spare time, late at night too often, or companies that put some of their benefits in open source rather in locking down a market. 90% of the time the reason the feature isn’t as good as you would want it is because of lack of time.

I personally find that suggesting that somebody else should put more of the time and money they are already giving away in improving a feature that you need is almost insulting.

I am aware that people do not realize how small the group of people that develop and maintain their toys is. Borrowing the figure below from Fernando Perez’s talk at Euroscipy, the number of people that do 90% of the grunt work to get the core scientific Python ecosystem going is around two handfuls:

Commits per contributor in various scientific Python packages, from Fernando Perez

I’d like to think that this recruitment problem is a lack of skill set: users that have the ability to contribute are just too rare. This is not entirely true, there are scores of skilled people on the mailing lists. The poster himself mentioned his email that he was developing a package. I personally started contribution not knowing anything about software development. I struggled, I did the grunt work like maintaining wikis, answer questions on mailing list, and writing documentation. These easier tasks were useful to the community, I think, but must importantly, they taught me a lot because I was investing energy in them.

If people want things to improve, they will have more successes sending in pull requests than messages on mailing list that sound condescending to my ears.
I hope that I haven’t overreacted too badly :), that email turned me on. That said, I am not sure that people realize how much they owe to the open source developers breaking their backs on the packages they use.
All credit for images goes to Fernando Perez
10 Jan

Book review: NumPy 1.5 Beginner’s guide

Packt publishing sent me a copy of NumPy 1.5 Beginner’s guide by Ivan Idris.

The book actually covers more than only numpy: it is a full introduction to numerical computing with Python. The table of contents is the following:

  • NumPy Quick Start
  • Beginning with NumPy Fundamentals
  • Get into Terms with Commonly Used Functions
  • Convenience Functions for Your Convenience
  • Working with Matrices and ufuncs
  • Move Further with NumPy Modules
  • Peeking Into Special Routines
  • Assure Quality with Testing
  • Plotting with Matplotlib
  • When NumPy is Not Enough: SciPy and Beyond

The book is easy to read, as it requires no specific expertise other than knowing basic Python programming. It is full of examples and exercises, which is really great for learning. I find the style of the author, Ivan Idris, particularly amusing and relaxing, engaging the reader with questions, challenges, or even jokes (“Have a go hero”).

With regards to the formatting and the print, the book is written in large fonts, with sectioning information, tips and exercises clearly standing out.

It is full of practical information, such as how to install the software, or where to get help. Finally, One thing that I appreciated, is that the examples are typed in IPython. Each time I teach, I like to use IPython, because it is full of features to help plotting, debugging and profiling numerical code. The book even has a little introduction to some useful IPython features.

After an introduction to the work flow, the book explores array manipulation such as creation or reshaping, followed by some simple numerics and the battery of array-based operations on functions and polynomials. Then it presents linear algebra and signal processing basics (FFT). It also covers the financial functions that are present in numpy and mentions testing, which is very important to achieve quality code. The book finishes with matplotlib and scipy, two modules that are important to know to go further.

The examples are mostly drawn from statistics or financial applications, such as computing running averages on stock quotes. Basic math explanations, such as the definition of the Moore-Penrose pseudo-inverse, are given when needed.

To conclude, I enjoyed this book and I think that it is a nice addition to my library. It answers exactly it’s title: it is well-suited for beginners wanting to learn numpy. On the other hand, I would not recommend it as a reference material, or as a book to learn more general scientific or numerical computing with Python.

07 Jan

Joblib beta release: fast compressed persistence + Python 3

Joblib 0.6: better I/O and Python 3 support

Happy new year, every one. I have just released Joblib 0.6.0 beta. The highlights of the 0.6 release are a reworked enhanced pickler, and Python 3 support.

Many thanks go to the contributors to the 0.5.X series (Fabian Pedregosa, Yaroslav Halchenko, Kenneth C. Arnold, Alexandre Gramfort, Lars Buitinck, Bala Subrahmanyam Varanasi, Olivier Grisel, Ralf Gommers, Juan Manuel Caicedo Carvajal, and myself). In particular Fabian made sure that Joblib worked under Python 3.

In this blog post, I’d like to discuss a bit more the compressed persistence engine, as it illustrates well key factors in implementing and using compressed serialization.

Fast compressed persistence

One of the key components of joblib is it’s ability to persist arbitrary Python objects, and read them back very quickly. It is particularly efficient for containers that do their heavy lifting with numpy arrays. The trick to achieving great speed has been to save in separate files the numpy arrays, and load them via memmapping.

However, one drawback of joblib, is that the caching mechanism may end up using a lot of disk space. As a result, there is strong interest in having compressed storage, provided it doesn’t slow down the library too much. Another use case that I have in mind for fast compressed persistence, is implementing out of core computation.

There are some great compressed I/O libraries for Python, for instance Pytables. You may wonder why the need to code yet another one. The answer is that joblib is pure Python, depending only on the standard library (numpy is optional), but also that the goal here is black-box persistence of arbitrary objects.

Comparing I/O speed and compression to other libraries

Implementing efficient compressed storage was a bit of a struggle and I learned a lot. Rather than going into the details straight away, let me first discuss a few benchmarks of the resulting code. Benching such feature is very hard, first because you are fighting with the disk cache, second because they performances depends very much on the data at hand (some data compress better than others), last because they are three interesting metrics: disk space used, write speed, and read speed.

Dataset used - I chose to compare the different strategies on some datasets that I work with, namely the probabilistic brain atlases MNI 1mm (62Mb uncompressed) and Juelich 2mm (105Mb uncompressed). Whether the data is represented as a Fortran-ordered array, or a C-ordered array is important for the I/O performance. This data is normally stored to disk compressed using the domain-specific Nifti format (.nii files), accessed in Python with the Nibabel library.

Libraries used - I benched different compression strategies in joblib against Nibabel’s Nifti I/O, compressed or not, and against using Pytables to store the data buffer (without the meta-informations). Pytables exposed a variety of compression strategies, with different speed compromises. In addition, I benched numpy’s builtin save_compressed.

I would like to stress that I am comparing a general purpose persistence engine (joblib) to specific I/O libraries either optimized for the data (Nifti), or requiring some massaging to enable persistence (pytables).






Comparing to other libraries

Actual numbers can be found here.

Take home messages - The graphs are not crystal-clear, but a few tendencies appear:

  • Pytables with LZO or blosc compression is the king of the hill for read and write speed.
  • I/O of compressed data is often faster than with uncompressed data for a good compression algorithm.
  • Joblib with Zlib compression level 1 performs honorably in terms of speed with only the Python standard library and no compiled code.
  • Read time of memmapping (with nibabel or joblib) is negligeable (it is tiny on the graphs), however the loading time appears when you start accessing the data.
  • Passing in arrays with a memory layout (Fortran versus C order) that the I/O library doesn’t expect can really slow down writing.
  • Compressing with Zlib compression-level 1 gets you most of the disk space gains for a reasonable cost in write/read speed.
  • Compressing with Zlib compression-level 9 (not shown on the figures) doesn’t buy you much in disk space, but costs a lot in writing time.

Benching datasets richer than pure arrays

The datasets used so far are pretty much composed of one big array, a 4D smooth spatial map. I wanted to test on more datasets, to see how the performances varied with data type and richness. For this, I used the datasets of the scikit-learn, real life data of various nature, described here:

  • 20 news - 20 usenet news group: this data mainly consists of text, and not numpy arrays.
  • LFW people - Labeled faces in the wild, many pictures of different people’s face.
  • LFW pairs - Labeled faces in the wild, pairs of pictures for each individual. This is a high entropy dataset, it does not have much redundant information.
  • Olivetti - Olivetti dataset: centered pictures of faces.
  • Juelich(F) - Our previous Juelich atlas
  • Big people - The LFW people dataset, but repeated 4 times, to put a strain on memory resources.
  • MNI(F) - Our previous MNI atlas
  • Species - Occurence of species measured in latin America, with a lot of missing data.

Testing compression strategies on various datasets

Actual numbers can be found here.

What this tells us - The main message from these benchmarks is that datasets with redundant information, i.e. that compress well, give fast I/O. This is not surprising. In particular, good compression can give good I/O on text (20 news). Another result, more of a sanity check, is that compressed I/O on big data (Big people, ) works as well as on smaller data. Earlier code would start to swap. Finally, I conclude from these graphs, that compression levels from 1 to 3 buy you most of the gains for reasonable costs, and that going up to 9 is not recommended, unless you know that your data can be compressed a lot (species).

Lessons learned

I’ll keep this paragraph short, because the information is really in joblib’s code and comments. Don’t hesitate to have a look, it’s BSD-licenced, so you are free to borrow what you please.

  1. Memory copies, of arrays, but also of strings and byte streams can really slow you down with big data.
  2. To avoid copies with numpy arrays, fully embrace numpy’s strided memory model. For instance, you do not need to save arrays in C order, if they are given to you in a different order. Accessing the memory in the wrong striding direction explains the poor write performance of pytables on Fortran-ordered Juelich.
  3. When dealing with the file system, the OS makes so much magic (e.g. prefetching) that clever hacks tend not to work: always benchmark.
  4. Depending on the size of the data, it may be more efficient to store subsets in different files: it introduces ‘chunk’ that avoid filling in the memory too much (parameter cache_size in joblib’s code). In addition, data of a same nature tends to compress better.
  5. The I/O stream or file object interfaces are abstractions that can hide the data movement and the creation of large temporaries. After experiments with GZipFile and StringIO/BytesIO I found it more efficient to fall back to passing around big buffer object, numpy arrays, or strings.
  6. For reasons 4 and 5, I ended up avoiding the gzip module: raw access to the zlib with buffers gives more control. This explains a good part of the differences in read speed for pure arrays with numpy’s save_compressed.

One of my conclusions for joblib, is that I’ll probably use Pytables as an optional backend for persistence in a future release.

Details on the benchmarks

These benchmarks where run on a Dell Lattitude D630 laptop. That’s a dual-core Intel Core2 Duo box, with 2M of CPU cache.

The code for the benchmarks below can be found on a gist.

Thanks

I’d like to that Francesc Alted for very useful feedback he gave on this topics. In particular, the following thread on the pytables mailing-list may be of interest to the reader.

18 Nov

Scikit-learn NIPS 2011 sprint: international thanks to our sponsors

The NIPS conference: time for a sprint. The NIPS conference, one of the major conferences in machine learning, is hosted in Granada this year. I believe that it is the first time that it is hosted in Europe. As many of the scikit-learn developers are part of the wider NIPS community, but also many live in Europe, we jumped on the occasion to organize a truly international sprint: the NIPS 2011 scikit-learn sprint.

Finding money. As often with open source development, a lot of our contributors are young people, investing their free time outside of any request from their hierarchy. In such a situation, it can be hard to find travel money. So we started looking for sponsors. We needed to find a decent sum of money, as we were flying people in from places such as the West coast of the US, or even Japan. The good news is that we found money, and between supervisors pitching in, universities giving travel grants, and our generous sponsors, there will be an impressive list of contributors from all over the world at the sprint.

Thanks to our sponsors. The first people that we need to thank are Google, who gave us a sizable sponsorship, and the PSF, who made Google’s sponsorship possible through their accounting and sprints programs. We also need to thanks our other sponsors, namely Tinyclues. Thanks to these sponsors, and additional investment from many universities and research group, we have been able to gather a total of 12 contributors in Granada, a handful coming from overseas. Also, we are indebted to the University of Granada, and the Gnu/Linux Granada Group (GGG), who are providing hosting for the sprint, as well as Régine Bricquet, from INRIA, who did a lot of the trip planing for the sponsored people.

I am very much looking forward to the sprint. It will be the first time that meet in real life many of the contributors, and judging by the warmness of the on-line exchanges, it will be a great moment. Besides, Granada is known to be a lively and historical city.

If you are around and want to join us, to work on Python in machine learning, send us a mail on the mailing list.

Самое популярное - Jeux Casino - предложение года. . Fleshlight