<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>Gaël Varoquaux</title>
	<atom:link href="http://gael-varoquaux.info/blog/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://gael-varoquaux.info/blog</link>
	<description>Views on Python, Computational Science, ...</description>
	<pubDate>Fri, 19 Oct 2012 13:30:06 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
	<language>en</language>
			<item>
		<title>Hiring a programmer for a brain imaging library</title>
		<link>http://gael-varoquaux.info/blog/?p=168</link>
		<comments>http://gael-varoquaux.info/blog/?p=168#comments</comments>
		<pubDate>Fri, 19 Oct 2012 13:06:26 +0000</pubDate>
		<dc:creator>gael</dc:creator>
		
		<category><![CDATA[machine learning]]></category>

		<category><![CDATA[personnal]]></category>

		<category><![CDATA[programming]]></category>

		<category><![CDATA[python]]></category>

		<category><![CDATA[science]]></category>

		<category><![CDATA[scientific computing]]></category>

		<category><![CDATA[scikit-learn]]></category>

		<guid isPermaLink="false">http://gael-varoquaux.info/blog/?p=168</guid>
		<description><![CDATA[I am super excited to announce a job offer that is dear to my heart: doing quality open-source software, with Python scientific tools and machine learning, for clinical application of brain imaging. This is the most exciting job that I have had the chance to be recruiting for!
We are looking for a programmer to join [...]]]></description>
			<content:encoded><![CDATA[<p>I am super excited to announce a job offer that is dear to my heart: doing quality open-source software, with Python scientific tools and machine learning, for clinical application of brain imaging. This is the most exciting job that I have had the chance to be recruiting for!</p>
<hr />We are looking for a programmer to join our research group, <a href="http://http://team.inria.fr/parietal/">Parietal team</a>, at INRIA, to work on a library integrating state of the art methods of functional brain imaging.</p>
<p>As a programmer, you will be taking part to the <a href="https://team.inria.fr/parietal/research/spatial_patterns/niconnect/">NiConnect</a> research project, developing tools for the analysis of spontaneous brain activity using functional MRI. The project unites neuroscientists, data-miners, statisticians and clinical researchers to <a href="http://www.nature.com/news/neuroscience-idle-minds-1.11440">transfer recent advances in basic neuroscience</a> to clinical diagnostic tools. Your duties will be to work hand in hand with the computer science and statistics researchers to turn the research code into a solid and well documented Python library usable by clinical researchers. The core technologies used will rely on <a href="http://scipy.org">the scientific Python stack</a> and <a href="http://scikit-learn.org">scikit-learn</a> machine learning library.</p>
<div id="requirements" class="section">
<h1>Requirements</h1>
<ul>
<li>Programming skills in Python, preferably with experience of the scientific Python stack</li>
<li>Understanding of quality assurance in software development: test-driven programming, version control, technical documentation.</li>
<li>Software design skills</li>
<li>Some knowledge of Linux/Unix</li>
<li>Knowledge of open-source development and community-driven environments is valued</li>
<li>Good technical English level</li>
<li>An experience in statistical learning or a mathematical-oriented mindset is a plus</li>
</ul>
<p>Speaking French is not a requirement, as it is an international team.</p>
</div>
<div>
<h1>About the team</h1>
<p><a href="http://www.inria.fr">INRIA</a> is the French computer science research institute. It recognized word-wide as one of the leading research institutions and has a strong expertise in machine learning. You will be working in the <a href="http://http://team.inria.fr/parietal/">Parietal team</a> that makes a heavy use of Python for brain imaging analysis.</p>
<p>Parietal is a small research team (around 15 people) with an excellent technical knowledge of scientific and numerical computing in Python as well as a fine understanding of algorithmic issues in machine learning, statistics and image processing. Parietal is committed to investing in the scientific Python toolstack and its members are core developers in central projects such as <a href="http://docs.enthought.com/mayavi/mayavi/">Mayavi</a> and <a href="http://scikit-learn.org">scikit-learn</a>, as well as the <a href="http://nipy.org">nipy</a> library for NeuroImaging in Python.</p>
<p>Parietal is located in the <a href="http://www-dsv.cea.fr/en/instituts/institut-d-imagerie-biomedicale-i2bm/services/neurospin-neurospin">Neurospin brain research facility</a>, that hosts several brain scanners and research teams in neuroscience and medical imaging.</p>
<p>Working at Parietal is a unique opportunity to improve your skills in numerical computing and statistical data processing in Python. In addition, working on an open source stack, will give you premium experience of open source community management and collaborative project development.</p>
<p><strong>Contact Info:</strong></p>
<ul>
<li><strong>Technical Contact</strong>: Gael Varoquaux</li>
<li><strong>E-mail contact</strong>: <a href="mailto:gael.varoquaux@inria.fr">gael.varoquaux@inria.fr</a></li>
<li><strong>HR Contact</strong>: Marie Domingues</li>
<li><strong>E-mail Contact</strong>: <a href="mailto:marie.domingues@inria.fr">marie.domingues@inria.fr</a></li>
<li><a href="http://www.inria.fr/institut/recrutement-metiers/offres/cdd/%28view%29/details.html?id=PUQFK026203F3VBQB6G68LO2G&amp;ContractType=5033&amp;SUBDEPT1=2&amp;LG=FR&amp;Resultsperpage=20&amp;nPostingID=6891&amp;nPostingTargetID=12252&amp;option=52&amp;sort=DESC&amp;nDepartmentID=2">Official job posting</a></li>
<li><strong>No telecommuting</strong></li>
</ul>
</div>
]]></content:encoded>
			<wfw:commentRss>http://gael-varoquaux.info/blog/?feed=rss2&amp;p=168</wfw:commentRss>
		</item>
		<item>
		<title>RIP John Hunter: the loss of a great man</title>
		<link>http://gael-varoquaux.info/blog/?p=167</link>
		<comments>http://gael-varoquaux.info/blog/?p=167#comments</comments>
		<pubDate>Thu, 30 Aug 2012 09:21:20 +0000</pubDate>
		<dc:creator>gael</dc:creator>
		
		<category><![CDATA[computational science]]></category>

		<category><![CDATA[mayavi]]></category>

		<category><![CDATA[personnal]]></category>

		<category><![CDATA[python]]></category>

		<category><![CDATA[science]]></category>

		<guid isPermaLink="false">http://gael-varoquaux.info/blog/?p=167</guid>
		<description><![CDATA[John Hunter, the author of matplotlib passed away yesterday after a short battle against cancer. John gave the keynote at the scipy 2012 conference a few weeks ago, and was diagnosed with cancer just on his return from the conference. It is a shock to me that that a friend can disappear so quickly. Please [...]]]></description>
			<content:encoded><![CDATA[<p>John Hunter, the author of <a href="http://matplotlib.sourceforge.net/">matplotlib</a> passed away yesterday after a short battle against cancer. John gave the keynote at the scipy 2012 conference a few weeks ago, and was diagnosed with cancer just on his return from the conference. It is a shock to me that that a friend can disappear so quickly. Please read the <a href="https://groups.google.com/forum/#!msg/pydata/FpwXp3sX6N8/mxopkZ1PkBQJ">announcement</a> of <a href="http://fperez.org/">Fernando Perez</a>, who supported John in the last weeks to learn more about John.</p>
<h1>A man who gave a lot, not asking for anything in return</h1>
<p> Many have benefited from the silent efforts of John, and are not fully aware of how he generously invested his time and talent for the benefit of others. Matplotlib, the Python plotting library that he created in 2002, has propelled Python as a major tool for scientific research and engineering. The impact of John&#8217;s efforts go well beyond Matplotlib. Early on, John had the vision of Python as a interactive scientific environment. He promoted this vision pairing with Fernando Perez to develop the fantastic <a href="http://ipython.org/">ipython</a>/<a href="http://matplotlib.sourceforge.net/">matplotlib</a> tandem, solving many technical challenges. But he also invested a lot of energy in teaching workshops that helped change the way people compute, as well as writing didactic documentation and articles. He was a friendly, active, leader of an online community, open and helpful to newcomers.</p>
<p>As Travis Oliphant said on John&#8217;s numfocus <a href="http://numfocus.org/johnhunter/">memorial webpage</a>:<br />
<blockquote>Those who contribute much to open source, as John did, do so at the expense of something - often it is time with family.</p></blockquote>
<p>I cannot stress how true this is. The entire open source software, that nowadays supports our economy, our education, and our research, is built on the shoulders of a fairly small number of generous people that spend their energy in making better software, rather than personal wealth. </p>
<p>John was a humble man. He did not have a blog, or a twitter account, did not seek fame or money. For this reason I feel that his contributions are unknown and undervalued by many. In my eyes, he is an unknown soldier of our modern times. I hope that I am not being too emphatic, but this is how I feel.</p>
<hr/>
<p><strong>John passed away at 44, leaving behind a wife and 3 daughters. Please do consider supporting them: <a href="http://numfocus.org/johnhunter">http://numfocus.org/johnhunter</a></strong>. </p>
]]></content:encoded>
			<wfw:commentRss>http://gael-varoquaux.info/blog/?feed=rss2&amp;p=167</wfw:commentRss>
		</item>
		<item>
		<title>A journal promoting high-quality research code: dream and reality</title>
		<link>http://gael-varoquaux.info/blog/?p=166</link>
		<comments>http://gael-varoquaux.info/blog/?p=166#comments</comments>
		<pubDate>Mon, 04 Jun 2012 20:39:52 +0000</pubDate>
		<dc:creator>gael</dc:creator>
		
		<category><![CDATA[computational science]]></category>

		<category><![CDATA[programming]]></category>

		<category><![CDATA[python]]></category>

		<category><![CDATA[science]]></category>

		<category><![CDATA[scientific computing]]></category>

		<guid isPermaLink="false">http://gael-varoquaux.info/blog/?p=166</guid>
		<description><![CDATA[
Open research computation (ORC) was an attempt to create a scientific publication promoting high-quality and open source scientific code. The project went public in falls 2010, but last month, facing the low volume of submission, the editorial board chose to reorient it as a special track of an existing journal. 

The challenges that we face [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://www.openresearchcomputation.com/sites/10206/images/logo.gif" align="right" width=300px></p>
<p><a href="http://www.openresearchcomputation.com/">Open research computation (ORC)</a> was an attempt to create a scientific publication promoting <strong>high-quality and open source scientific code</strong>. The project went public in falls 2010, but last month, facing the low volume of submission, the editorial board <a  href="http://blogs.openaccesscentral.com/blogs/bmcblog/entry/open_research_computation_thematic_series">chose to reorient it</a> as a special track of an existing journal. </p>
<p>
The challenges that we face are discussed in our editorial:</p>
<blockquote><p>
    <a href="http://www.scfbm.org/content/7/1/2/abstract">Changing computational research. The challenges ahead.</a> C Neylon, J Aerts, CT Brown, D Lemire, J Millman, P Murray-Rust, F Perez, N Saunders, A Smith, G Varoquaux and E Willighagen, <i>Source Code for Biology and Medicine</i> 2012, 7:20
</p></blockquote>
<p>Here is my own personal take on the rise and fall of this ideal.
</p>
<h1>My story with ORC</h1>
<p><img src="http://www.rcac.net.au/images/Publications1.jpg" align="right" width=300px></p>
<p><strong>From pipe dream to journal -</strong> My involvement with ORC started long before there was such a thing as ORC. In falls 2008, I had a discussion with a friend working in the publication industry, telling her how I believed that the publication system is broken, because it promotes new results without any interest on whether these can be exported outside the lab that produced them: <strong>it is currently easier to publish a minor but novel result than a tool enabling the routine reproduction of previous results</strong>. This seemed particularly marked in the scientific software world, as software tools are becoming central to the scientific workflow, and cost nothing to duplicate when produced under open-source license. To my surprise, she took me seriously, and asked me to write my ideas down in an email that she would forward to her colleagues in the publication industry.</p>
<p>Looking back at the email that I send, my concerns were, back then, to promote: </p>
<ul>
<li>quality and openness of scientific software</li>
<li>basic tools shared across communities</li>
<li>recognition of software development as a challenging and worthwhile task in academic research</li>
</ul>
<p><strong>Shaping the idea - </strong>In the year that followed, I had a few discussions with staff from <a href="http://www.biomedcentral.com">BioMedCentral</a>, an open-access publisher in biology and medicine that was looking into expending in the physics and math related fields. Eventually, my contact there told me that they had other similar requests and were launching a journal that would be lead by Cameron Neylon, a British biophysicist and strong advocate of openness and reproducibility in science. This was the start of ORC, and for me the chance to meet other people sharing my concerns, some new and some <a href="http://fperez.org/">already</a> <a href="http://jarrodmillman.com/">old</a> <a href="http://ivory.idyll.org">friends</a>. </p>
<div style="float: right; background-color: #EEE; padding: 2px; text-align: center"><img src="http://www.salinafbc.com/Websites/fbcsalina/images/nerd_computer.gif" height=200px align=centered><br/>ORC editor</div>
<div style="float: right; background-color: #EEE; padding: 2px">
<img src="http://researchsupportgroup.files.wordpress.com/2011/11/kayla1.jpg" height=200px align=centered><br/>Conventional editor</div>
<p><strong>Setting up the journal - </strong>BioMedCentral was instrumental in setting up the journal project. I quickly learned that, no surprises, a journal is a product, like anything else, and it must find customers. Here, as we were launching an open access journal, the customers were authors. This is where a journal faces a chicken and egg problem: to be recognised it needs high-visibility publications, but authors will submit only to journals that they know. The main tool to overcome this challenge are communication and advocacy. I then realized that these really weren&#8217;t my strong points. Cameron Neylon absolutely shined on this side, with very enthusiastic <a href="http://cameronneylon.net/blog/open-research-computation-an-ordinary-journal-with-extraordinary-aims/">communications</a> and an incredibly active <a href="https://twitter.com/#!/CameronNeylon">twitter account</a>. On my side, I am a slow writer, and I tend to speak Python code better than English language, which is not a strong asset to be a journal editor.</p>
<p><strong>Wild editorial discussions - </strong> The discussions in the editorial board really thrilled me because they were centered on how to set standards to improve the quality of code published. Looking in my mailbox, I see discussions about code repositories, software testing, documentation or licensing issues. This is not that surprising, given that a lot of the editors where actually contributors to major software projects. It made me very happy, as I have the feeling that, so far, most committees or decision makers are clueless about software.</p>
<h1>Sand in the gears: the lack of uptake</h1>
<p><img src="http://spacenews.com/images/123110Gsat02.jpg" align="right" width=200px></p>
<p><strong>A false start - </strong>So ORC was launched late 2010 and we had fantastic feedback. I had the feeling that people were <a href="http://neuralensemble.blogspot.fr/2010/12/open-research-computation-new-journal.html">genuinely</a> <a href="https://twitter.com/vaguery/status/15402390589018112">excited</a> about our program: changing the way computational science worked from the inside, through the review process. The idea was that we had opened a pre-submission call, and were waiting for a few good papers to be submitted to launch the journal. However, it turned out that the papers were slow to come. It took me a while to realize that there was something wrong. But slowly we had to face the truth: many people were excited about the journal, but most were sending their papers elsewhere. </p>
<p><strong>What went wrong? - </strong>If I really knew what went wrong, I would probably not be discussing it here, but rather changing the world. However, I can come up with a few guesses:</p>
<ul>
<li><strong>Working across communities is harder.</strong> From the beginning we had wanted to position the journal across communities, in order to foster the sharing of tools for a greater good. The challenge is that a central role of publication is nowadays to provide recognition. It is much easier to achieve recognition in a given community than across communities, and authors always preferred submitting their work to a non-software oriented journal in their field. We couldn&#8217;t fight together the battle for software quality and the battle for inter-community work.
</li>
<li><strong>Setting the bar too high.</strong> Many felt that the submission requirements that where too demanding, as expressed on a NeuroImaging forumn to quote a researcher: <a href="http://www.nitrc.org/forum/message.php?msg_id=3674">&#8220;I think it&#8217;s setting the bar unrealistically high for most neuroimaging software&#8221;</a>. While we had originally shot for a very high test coverage (probably too high), we had scaled it back quickly, simply stressing that editors and reviewers would be looking closely at test coverage, documentation and ease of installation. That said, the average researcher did not share our ideals of raising the quality of scientific software. Trying to ask only for excellent publications in a new and unproven journal was probably unrealistic.</li>
<li><strong>Editors not willing to game the system.</strong> I have watched a few journal launches, and it seems to me that a common trick is to line up articles that are created by the editors and their friends specifically for the new journal. People come up with <i>opinion papers</i>, <i>reviews</i>, <i>commentaries</i> that only serve to generate an identity to the journal. This did not happen for ORC, and I believe that it is because <a href="http://cameronneylon.net/blog/open-research-computation-an-ordinary-journal-with-extraordinary-aims">the editors themselves</a> were not huge fans of the low signal-to-noise ratio in modern scientific publishing practice.
</li>
</ul>
<h1>The times they are a changing</h1>
<p><img src="http://www.pictures88.com/p/success/success_005.jpg" align="right" width=200px></p>
<p><strong>ORC is dead, long live ORC - </strong> We did get a few submissions. ORC is not coming to an end, it is morphing into a special thematic series in <a href="http://www.scfbm.org/">source code for biology and medicine</a>. This solution is not completely satisfactory, as it pushes what should have been a forum for exposing good practices and good software into a smaller community. But at least there is now a venue in which people can publish a paper about software that they have been improving and maintaining, and not only about a new algorithm.</p>
<p><strong>Changing practices across the board - </strong> Among the reasons for which we had a hard time making a breakthrough, is that authors where sending their software papers to other journals, in particular journals not specialized on software. While these papers are not getting the attention of a review and editorial team expert on software development, as we are setting up with ORC, this is still a good thing. Indeed it shows that the times are changing and that recognition of software as a scientific work is improving. I have been impressed to see that many high profile journals have changed their editorial policies to specifically accept software papers, or have create tracks dedicated to software.</p>
<p>Software is being slowly recognized as a pillar of modern scientific research. We need to keep pushing to make sure that quality standards are set and that the open-source scientific software grows into a mature ecosystem focused on problem solving.</p>
]]></content:encoded>
			<wfw:commentRss>http://gael-varoquaux.info/blog/?feed=rss2&amp;p=166</wfw:commentRss>
		</item>
		<item>
		<title>Update on scikit-learn: recent developments for machine learning in Python</title>
		<link>http://gael-varoquaux.info/blog/?p=165</link>
		<comments>http://gael-varoquaux.info/blog/?p=165#comments</comments>
		<pubDate>Tue, 08 May 2012 23:12:54 +0000</pubDate>
		<dc:creator>gael</dc:creator>
		
		<category><![CDATA[machine learning]]></category>

		<category><![CDATA[programming]]></category>

		<category><![CDATA[python]]></category>

		<category><![CDATA[science]]></category>

		<category><![CDATA[scientific computing]]></category>

		<category><![CDATA[scikit-learn]]></category>

		<guid isPermaLink="false">http://gael-varoquaux.info/blog/?p=165</guid>
		<description><![CDATA[Yesterday, we released version 0.11 of the scikit-learn toolkit for machine learning in Python, and there was much rejoincing.
Major features gained in the last releases
In the last 6 months, there have been many things happening with the scikit-learn. While I do not whish to give an exhaustive summary of features added (it can be found [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday, we released version 0.11 of the <a href="http://scikit-learn.org"><i>scikit-learn</i></a> toolkit for machine learning in Python, and there was much rejoincing.</p>
<h2>Major features gained in the last releases</h2>
<p>In the last 6 months, there have been many things happening with the scikit-learn. While I do not whish to give an exhaustive summary of features added (it can be found <a href="http://scikit-learn.org/stable/whats_new.html">here</a>), let me list a few of the additions that I personnally find exciting.</p>
<p><a href="http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_iris.html"> <img src="http://scikit-learn.org/stable/_images/plot_forest_iris_1.png" width=40% align="right"></a></p>
<h3>Non-linear prediction models</h3>
<p>For complex prediction problems where there is no simple model available, as in computer vision, non-linear models are handy. A good example of such models are those based on decisions trees and model averaging. For instance random forests are used in the Kinect to locate body parts. As they are intrinsically complex, they may need a large amount of training data. For this reason, they have been implemented in the scikit-learn with special attention to computational efficiency.
</p>
<ul>
<li><a href="http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees">Randomized Forests and extra-trees</a></li>
<li><a href="http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting">Gradient boosted regression trees</a></li>
</ul>
<div style="clear: both"></div>
<h3>Dealing with unlabeled instances</h3>
<p>It is often easy to gather unlabeled observations than labeled observation. While prediction of a quantity of interest is then harder or simply impossible, mining this data can be useful.
</p>
<div style="width:300px; float:left; padding: 5px; border: 1px solid #888"><a href="http://scikit-learn.org/stable/modules/label_propagation.html">Semi-supervised<br />
learning</a>: using unlabeled observations together with labeled one for better prediction.<br />
<hr/><a href="http://scikit-learn.org/stable/auto_examples/semi_supervised/plot_label_propagation_structure.html"><img src="http://scikit-learn.org/stable/_images/plot_label_propagation_structure_1.png" width=300px/></a>
</div>
<div style="width:10px; float:left;">&nbsp;</div>
<div style="width:300px; float:left; padding: 5px; border: 1px solid #888"><a href="http://scikit-learn.org/stable/modules/outlier_detection.html">Outlier/novelty detection</a>: detect deviant observations.<br />
<hr/><a href="http://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html"><img src="http://scikit-learn.org/stable/_images/plot_oneclass_1.png" width=300px /></a></div>
<div style="width:10px; float:left;">&nbsp;</div>
<div style="width:300px; float:left; padding: 5px; border: 1px solid #888"><a href="http://scikit-learn.org/stable/modules/manifold.html">Manifold learning</a>: discover a non-linear low-dimensional structure in the data.<br />
<hr/><a href="http://scikit-learn.org/stable/modules/manifold.html"><img src="http://scikit-learn.org/stable/_images/plot_compare_methods_1.png" width=300px /></a> </div>
<div style="width:10px; float:left;">&nbsp;</div>
<div style="width:300px; float:left; padding: 5px; border: 1px solid #888"> <a href="http://scikit-learn.org/stable/modules/clustering.html">Clustering</a> with <a href="http://scikit-learn.org/stable/modules/clustering.html#mini-batch-k-means">an algorithm</a> that can scale to really large datasets using an online approach: fitting small portions of the data on after the other (Mini-batch k-means).<br />
<hr/><a href="http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html"><img src="http://scikit-learn.org/stable/_images/plot_cluster_comparison_1.png" width=300px /></a>
</div>
<div style="width:10px; float:left;">&nbsp;</div>
<div style="width:300px; float:left; padding: 5px; border: 1px solid #888">
<a href="http://scikit-learn.org/stable/modules/decomposition.html#dictionarylearning">Dictionary learning</a>: learning patterns in the data that represent it sparsely: each observation is a combination of a small number patterns.</p>
<hr/><a href="http://scikit-learn.org/stable/auto_examples/decomposition/plot_image_denoising.html#example-decomposition-plot-image-denoising-py"><img src="http://scikit-learn.org/stable/_images/plot_image_denoising_1.png" width=300px /></a></div>
<div style="clear: both"></div>
<h3>Sparse models: when very few descriptors are relevant</h3>
<p>In general, finding which descriptors are useful when there are many of them is like find a needle in a haystack: it is a very hard problem. However, you know that only a few of these descriptors actually carry information, you are in a so-called <i>sparse</i> problem, for specific approaches can work well.
</p>
<div style="width:300px; float:left; padding: 5px; border: 1px solid #888"><a href="http://scikit-learn.org/stable/modules/linear_model.html#orthogonal-matching-pursuit-omp">Orthogonal matching pursuit</a>: a greedy and fast algorithm for very sparse linear models<br />
<hr/><a href="http://scikit-learn.org/stable/auto_examples/linear_model/plot_omp.html"><img src="http://scikit-learn.org/stable/_images/plot_omp_1.png" width=300px /></a></div>
<div style="width:10px; float:left;">&nbsp;</div>
<div style="width:300px; float:left; padding: 5px; border: 1px solid #888"><a href="http://scikit-learn.org/stable/modules/feature_selection.html#randomized-sparse-models">Randomized sparsity (randomized Lasso)</a>: selecting the relevant descriptors in noisy high-dimensional observations<br />
<hr/> <a href="http://scikit-learn.org/stable/auto_examples/linear_model/plot_sparse_recovery.html"><img src="http://scikit-learn.org/stable/_images/plot_sparse_recovery_11.png" width=300px /></a></div>
<div style="width:10px; float:left;">&nbsp;</div>
<div style="width:300px; float:left; padding: 5px; border: 1px solid #888"> <a href="http://scikit-learn.org/stable/modules/generated/sklearn.covariance.GraphLasso.html#sklearn.covariance.GraphLasso">Sparse inverse covariance</a>: learning graphs of connectivity from correlations in the data</p>
<hr/><a href="http://scikit-learn.org/stable/auto_examples/applications/plot_stock_market.html#example-applications-plot-stock-market-py"><img src="http://scikit-learn.org/stable/_images/plot_stock_market_1.png" width=300px /></a>
</div>
<div style="clear: both"></div>
<h1>Getting developpers together: the Granada sprint</h1>
<p><object width="400" height="300" align="right"><param name="flashvars" value="offsite=true&#038;lang=en-us&#038;page_show_url=%2Fsearch%2Fshow%2F%3Fq%3Dscikit-learn%26m%3Dtags%26w%3D66885349%2540N03&#038;page_show_back_url=%2Fsearch%2F%3Fq%3Dscikit-learn%26m%3Dtags%26w%3D66885349%2540N03&#038;method=flickr.photos.search&#038;api_params_str=&#038;api_tags=scikit-learn&#038;api_tag_mode=bool&#038;api_user_id=66885349%40N03&#038;api_safe_search=3&#038;api_content_type=7&#038;api_media=all&#038;api_sort=date-posted-desc&#038;jump_to=&#038;start_index=0"></param><param name="movie" value="http://www.flickr.com/apps/slideshow/show.swf?v=109615"></param><param name="allowFullScreen" value="true"></param><embed type="application/x-shockwave-flash" src="http://www.flickr.com/apps/slideshow/show.swf?v=109615" allowFullScreen="true" flashvars="offsite=true&#038;lang=en-us&#038;page_show_url=%2Fsearch%2Fshow%2F%3Fq%3Dscikit-learn%26m%3Dtags%26w%3D66885349%2540N03&#038;page_show_back_url=%2Fsearch%2F%3Fq%3Dscikit-learn%26m%3Dtags%26w%3D66885349%2540N03&#038;method=flickr.photos.search&#038;api_params_str=&#038;api_tags=scikit-learn&#038;api_tag_mode=bool&#038;api_user_id=66885349%40N03&#038;api_safe_search=3&#038;api_content_type=7&#038;api_media=all&#038;api_sort=date-posted-desc&#038;jump_to=&#038;start_index=0" width="400" height="300"></embed></object></p>
<p>Of course, such developments happen only because we have a great team of <a href="https://github.com/scikit-learn/scikit-learn/graphs/contributors">dedicated coders</a>.</p>
<p>Getting along and working together is a critical part of the project. In December 2011, we held the first international <a href="http://scikit-learn">scikit-learn</a> sprint in Granada, on the side of the <a href="http://nips.cc">NIPS conference</a>. That was a while ago, and I haven&#8217;t found time to blog about it, maybe because I was too busy merging in the code produced :). Here is a small report from my point of view. Better late than never. </p>
<h2>Participants from all over the globe</h2>
<p>This sprint was a big deal for us, because for the first time, thanks to sponsor money, we were able to fly contributors from overseas and meet the team in person. For the first time I was able to see the faces behind many of the fantastic people that I knew only from the mailing list.</p>
<p>I really think that we must thank our sponsors, <strong>Google</strong> and <strong>tinyclues</strong>, but also The PSF, that is in particular Jesse Noller but especially <strong>Steve Holden</strong>, whose help was absolutely instrumental in getting sponsor money. This money is what made it possible to unite a good fraction of the team, and it opened the door to great moments of coding, and more.</p>
<h2>Producing code lines and friendship</h2>
<p>An important aspect of the sprint for me was that I really felt the team being united. Granada is a great city and we spent fantastic moments together. Now when I review code, I can often put a face on the author of that code and remember a walk below the Alhambra or an evening in a bar. I am sure it helps reviewing code!
</p>
<h2>Was it worth the money?</h2>
<p><a href="http://gael-varoquaux.info/blog/wp-content/uploads/2012/skl_activity.png"><br />
<img src="http://gael-varoquaux.info/blog/wp-content/uploads/2012/skl_activity.png" width=50% align='right'> </a>I really appreciate that the sponsors did not ask for specific returns on investment beyond acknowledgments, but I think that it is useful for us to ask the question: was it worth the money? After all, we got around $5000, and that&#8217;s a lot of money. First of all, as a side effect of the sprint, people who had invested a huge amount of time in a machine learning toolkit without asking anything in return got help to go to a major machine learning conference.</p>
<p>But was there a return over investment in terms of code? If you look at the number of lines of code modified weekly (figure on the right), there is a big spike in December 2011. That&#8217;s our sprint! Importantly, if you look at the months following the sprint, there still is a lot of activity in the months following the sprint. This is actually unusual, as the active developments happen more in the summer break than during the winter, as our developpers are busy working on papers or teaching.</p>
<p>The explaination is simple: we where thrilled by the sprint. Overall, it was incredibly beneficial to the project. I am looking forward to the next ones.</p>
]]></content:encoded>
			<wfw:commentRss>http://gael-varoquaux.info/blog/?feed=rss2&amp;p=165</wfw:commentRss>
		</item>
		<item>
		<title>3 Google summer of code for scikit-learn and more&#8230;</title>
		<link>http://gael-varoquaux.info/blog/?p=164</link>
		<comments>http://gael-varoquaux.info/blog/?p=164#comments</comments>
		<pubDate>Mon, 23 Apr 2012 21:25:58 +0000</pubDate>
		<dc:creator>gael</dc:creator>
		
		<category><![CDATA[computational science]]></category>

		<category><![CDATA[machine learning]]></category>

		<category><![CDATA[mayavi]]></category>

		<category><![CDATA[programming]]></category>

		<category><![CDATA[python]]></category>

		<category><![CDATA[science]]></category>

		<category><![CDATA[scikit-learn]]></category>

		<guid isPermaLink="false">http://gael-varoquaux.info/blog/?p=164</guid>
		<description><![CDATA[The scikit-learn got 3 students accepted for the Google summer of code.

Imanuel Bayer will work on making our sparse linear models, for regression and classification, faster. His proposal Optimizing sparse linear models using coordinate descent and strong rules.
David Marek will implement multi-layer perceptrons for the scikit. His proposal: Multilayer Perceptron
Vlad Niculae will work on speeding [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://scikit-learn.org">scikit-learn</a> got 3 students accepted for the Google summer of code.</p>
<ul>
<li><a href="http://ibayer.blogspot.fr/">Imanuel Bayer</a> will work on making our sparse linear models, for regression and classification, faster. His proposal <a href="http://www.google-melange.com/gsoc/project/google/gsoc2012/ibayer/11001">Optimizing sparse linear models using coordinate descent and strong rules</a>.</li>
<li><a href="http://www.davidmarek.cz/">David Marek</a> will implement multi-layer perceptrons for the scikit. His proposal: <a href="http://www.google-melange.com/gsoc/project/google/gsoc2012/h4wk_cz/24001">Multilayer Perceptron</a></li>
<li><a href="http://blog.vene.ro/">Vlad Niculae</a> will work on speeding up the library in general, catching all the low hanging fruits, and the ones a bit higher. His proposal: <a href="http://www.google-melange.com/gsoc/project/google/gsoc2012/vladn/26002">Need for scikit-learn speed</a></li>
</ul>
<p>
In addition, other related projects have exciting projects, for instance <a href="http://statsmodels.sourceforge.net/"><strong>statsmodels</strong><a>:</p>
<ul>
<li>Divyanshu Bandil: <a href="http://www.google-melange.com/gsoc/project/google/gsoc2012/divyanshu/34002">Extension of Linear to Non Linear Models in Statsmodels Python module</a></li>
<li>Alexandre Crayssac: <a href="http://www.google-melange.com/gsoc/project/google/gsoc2012/alexandreyc/8001">estimating system of equations</a></li>
<li>Justin Grana: <a href="http://www.google-melange.com/gsoc/project/google/gsoc2012/j_grana/8001">empirical Likelihood in Statsmodels</a></li>
<li>Georgi Panterov: <a href="http://www.google-melange.com/gsoc/project/google/gsoc2012/gpanterov/7001">nonparametric estimation</a></li>
</ul>
<p> and <a href="http://www.cython.org">Cython</a>:
<ul>
<li>Philip Herron: <a href="http://www.google-melange.com/gsoc/project/google/gsoc2012/redbrain1123/28002">pxd generation using gcc-python-plugin</a></li>
<li>Mark Florisson: <a href="http://www.google-melange.com/gsoc/project/google/gsoc2012/markflorisson88/30002">Fast Numerical Computing with Cython</a></li>
</ul>
<p>finally, in <a href="http://pandas.pydata.org/">Pandas</a>:</p>
<ul>
<li>Vytautas Jancauskas: <a href="http://www.google-melange.com/gsoc/project/google/gsoc2012/bucket_brigade/42002">Plots in pandas</a>
</li>
</ul>
<p>Congratulations to all of the students. This is going to be an exciting summer.</p>
]]></content:encoded>
			<wfw:commentRss>http://gael-varoquaux.info/blog/?feed=rss2&amp;p=164</wfw:commentRss>
		</item>
		<item>
		<title>The problems of low statistical power and publication bias</title>
		<link>http://gael-varoquaux.info/blog/?p=163</link>
		<comments>http://gael-varoquaux.info/blog/?p=163#comments</comments>
		<pubDate>Sat, 14 Apr 2012 15:16:33 +0000</pubDate>
		<dc:creator>gael</dc:creator>
		
		<category><![CDATA[computational science]]></category>

		<category><![CDATA[python]]></category>

		<category><![CDATA[science]]></category>

		<guid isPermaLink="false">http://gael-varoquaux.info/blog/?p=163</guid>
		<description><![CDATA[
 Lately, I have been a mood of scientific scepticism: I have the feeling that the worldwide academic system is more and more failing to produce useful research. Christophe Lalanne&#8217;s twitter feed lead me to an interesting article in a non-mainstream journal: A farewell to Bonferroni: the problems of low statistical power and publication bias, [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://idoubtit.files.wordpress.com/2010/12/coldfusion.jpg" align="right" width="20%" target="http://idoubtit.wordpress.com/2010/12/16/direct-to-the-public-science/"></p>
<p> Lately, I have been a mood of scientific scepticism: I have the feeling that the worldwide academic system is more and more failing to produce useful research. Christophe Lalanne&#8217;s <a href="https://twitter.com/#!/chlalanne">twitter feed</a> lead me to an interesting article in a non-mainstream journal: <a href="http://beheco.oxfordjournals.org/content/15/6/1044.short"><strong>A farewell to Bonferroni: the problems of low statistical power and publication bias</strong></a>, by Shinichi Nakagawa.
<p>Each study performed has a probability of being wrong. Thus performing many studies will lead to some wrong conclusions by chance. This is known in statistics as the <a href="http://en.wikipedia.org/wiki/Multiple_comparisons">multiple comparisons</a> problem. When a working hypothesis is not verified empirically in a study, this null finding is seldom reported, leading to what is called <i>publication bias</i>: <strong>discoveries are further studied; negative results are usually ignored</strong> (Y. Benjamini). Because only <i>discoveries</i>, called <i>detections</i> in statistical terms, are reported, <strong>published results contain more false detections than the individual experiments and very little false negatives</strong>. Arguably, the original investigators have corrected using the understanding that they gained the experiments performed and account in a <i>post-hoc analysis</i> for the fact that some of their working hypothesis could not have been correct. Such a correction can work only in a field where there is a good mechanistic understanding, or models, such as physics, but in my opinion not in life and social sciences.</p>
<p>Let me quote some relevant extracts of <a href="http://beheco.oxfordjournals.org/content/15/6/1044.short">the article</a>, as you may never have access to it thanks to the way scientific publishing works:</p>
<blockquote><p> Recently, Jennions and Moller (2003) carried out a meta-analysis on statistical power in the field of behavioral ecology and animal behavior, reviewing 10 leading journals including Behavioral Ecology. Their results showed dismayingly low average statistical power (note that a meta-analytic review of statistical power is different from post hoc power analysis as criticized in Hoenig and Heisey, 2001). The statistical power of a null hypothesis (Ho) significance test is the probability that the test will reject Ho when a research hypothesis (Ha) is true.<br />&#8230;<br />
The meta-analysis on statistical power by Jennions and Moller (2003) revealed that, in the field of behavioral ecology and animal behavior, statistical power of less than 20% to detect a small effect and power of less than 50% to detect a medium effect existed. This means, for example, that the average behavioral scientist performing a statistical test has a greater probability of making a Type II error (or beta) (<i>i.e.</i>, not rejecting Ho when Ho is false; note that statistical power is equals to 1 - beta) than if they had flipped a coin, when an experiment effect is of medium size.<br />&#8230;<br />
Imagine that we conduct a study where we measure as many relevant variables as possible, 10 variables, for example. We find only two variables statistically significant. Then, what should we do? We could decide to write a paper highlighting these two variables (and not reporting the other eight at all) as if we had hypotheses about the two significant variables in the first place. Subsequently, our paper would be published. Alternatively, we could write a paper including all 10 variables. When the paper is reviewed, referees might tell us that there were no significant results if we had &#8220;appropriately&#8221; employed Bonferroni corrections, so that our study would not be advisable for publication. However, the latter paper is scientifically more important than the former paper. For example, if one wants to conduct a meta-analysis to investigate an overall effect in a specific area of study, the latter paper is five times more informative than the former paper. In the long term, statistical significance of particular tests may be of trivial importance (if not always), although, in the short term, it makes papers publishable. Bonferroni procedures may, in part, be preventing the accumulation of knowledge in the field of behavioral ecology and animal behavior, thus hindering the progress of the field as science.
</p></blockquote>
<p><img src="http://farm6.staticflickr.com/5206/5330056727_a98c97c3c5.jpg" align="right" width="30%"></p>
<p>Some of the concerns raised here are partly a criticism of Bonferoni corrections, <i>i.e.</i> in technical terms correcting for <a href="http://en.wikipedia.org/wiki/Familywise_error_rate">family-wise error rate (FWER)</a>. It is actually the message that the author wants to convey in his paper. Proponents of controling for <a href="http://en.wikipedia.org/wiki/False_discovery_rate">false discovery rate (FDR)</a> argue that an investigator shouldn&#8217;t be penalized for asking more questions, and the fraction of errors in the answers should be controlled, rather than the absolute value. That said, FDR, while useful, does not answer the problems of publication bias.</p>
]]></content:encoded>
			<wfw:commentRss>http://gael-varoquaux.info/blog/?feed=rss2&amp;p=163</wfw:commentRss>
		</item>
		<item>
		<title>Want features? Just code</title>
		<link>http://gael-varoquaux.info/blog/?p=162</link>
		<comments>http://gael-varoquaux.info/blog/?p=162#comments</comments>
		<pubDate>Thu, 08 Mar 2012 21:46:52 +0000</pubDate>
		<dc:creator>gael</dc:creator>
		
		<category><![CDATA[personnal]]></category>

		<category><![CDATA[programming]]></category>

		<category><![CDATA[python]]></category>

		<category><![CDATA[scientific computing]]></category>

		<category><![CDATA[scikit-learn]]></category>

		<guid isPermaLink="false">http://gael-varoquaux.info/blog/?p=162</guid>
		<description><![CDATA[Somebody just sent an email on a user&#8217;s mailing list for an open-source scientific package entitled &#8220;Feature foo: why is package bar not up to the task?&#8221; (names hidden to avoid pointing directly to the responsible of my wrath). To quote him:
Is there ANY plan for having such a module in package bar?? I think (personally) that [...]]]></description>
			<content:encoded><![CDATA[<p>Somebody just sent an email on a user&#8217;s mailing list for an open-source scientific package entitled <strong>&#8220;<em>Feature foo</em>: why is <em>package bar</em> not up to the task?&#8221;</strong> (names hidden to avoid pointing directly to the responsible of my wrath). To quote him:</p>
<blockquote><p>Is there ANY plan for having such a module in <em>package bar</em>?? I think (personally) that this is a MUST DO. This is typically the type of routines that I hear people use in e.g., idl etc. If this could be an optimised, fast (and easy to use) routine, all the better.</p></blockquote>
<p>As some one who spends a fair amount of time working on open source software I hear such remarks quite often. I am finding it harder and harder not to react negatively to these emails. Now I cannot consider myself as a contributor to <em>package bar</em>, and thus I can claim that I am not taking your comment personally.</p>
<p>Why aren&#8217;t package not up to the task? Will, the answer is quite simple: because they are developed by volunteers that do it on their spare time, late at night too often, or companies that put some of their benefits in open source rather in locking down a market. 90% of the time the reason the feature isn&#8217;t as good as you would want it is because of lack of time.</p>
<p>I personally find that suggesting that somebody else should put more of the time and money they are already giving away in improving a feature that you need is almost insulting.</p>
<p>I am aware that people do not realize how small the group of people that develop and maintain their toys is. Borrowing the figure below from <a href="http://www.euroscipy.org/file/6459?vid=download">Fernando Perez&#8217;s talk at Euroscipy</a>, the number of people that do 90% of the grunt work to get the core scientific Python ecosystem going is around two handfuls:</p>
<p><img style="vertical-align: middle;" src="http://gael-varoquaux.info/blog/wp-content/uploads/2012/fperez_euroscipy_2011_contributors.jpg" alt="Commits per contributor in various scientific Python packages, from Fernando Perez" /></p>
<p>I&#8217;d like to think that this recruitment problem is a lack of skill set: users that have the ability to contribute are just too rare. This is not entirely true, there are scores of skilled people on the mailing lists. The poster himself mentioned his email that he was developing a package. I personally started contribution not knowing anything about software development. I struggled, I did the grunt work like maintaining wikis, answer questions on mailing list, and writing documentation. These easier tasks were useful to the community, I think, but must importantly, they taught me a lot because I was investing energy in them.</p>
<div>
<div><strong>If people want things to improve, they will have more successes sending in pull requests than messages on mailing list that sound condescending to my ears.</strong></div>
<div>I hope that I haven&#8217;t overreacted too badly :), that email turned me on. That said, I am not sure that people realize how much they owe to the open source developers breaking their backs on the packages they use.</div>
<div><img style="vertical-align: middle;" src="http://gael-varoquaux.info/blog/wp-content/uploads/2012/fperez_euroscipy_2011_i_want_you.jpg" alt="" width="334" height="444" /></div>
<div>All credit for images goes to <a href="http://fperez.org/">Fernando Perez</a></div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://gael-varoquaux.info/blog/?feed=rss2&amp;p=162</wfw:commentRss>
		</item>
		<item>
		<title>Book review: NumPy 1.5 Beginner&#8217;s guide</title>
		<link>http://gael-varoquaux.info/blog/?p=161</link>
		<comments>http://gael-varoquaux.info/blog/?p=161#comments</comments>
		<pubDate>Tue, 10 Jan 2012 07:57:21 +0000</pubDate>
		<dc:creator>gael</dc:creator>
		
		<category><![CDATA[computational science]]></category>

		<category><![CDATA[python]]></category>

		<category><![CDATA[scientific computing]]></category>

		<guid isPermaLink="false">http://gael-varoquaux.info/blog/?p=161</guid>
		<description><![CDATA[Packt publishing sent me a copy of NumPy 1.5 Beginner&#8217;s guide by Ivan Idris.


The book actually covers more than only numpy: it is a full introduction to numerical computing with Python. The table of contents is the following:

NumPy Quick Start
Beginning with NumPy Fundamentals
Get into Terms with Commonly Used Functions
Convenience Functions for Your Convenience
Working with Matrices [...]]]></description>
			<content:encoded><![CDATA[<p>Packt publishing sent me a copy of <a href="http://www.packtpub.com/numpy-1-5-using-real-world-examples-beginners-guide/Book">NumPy 1.5 Beginner&#8217;s guide</a> by Ivan Idris.
</p>
<p><iframe align="right" src="http://rcm.amazon.com/e/cm?t=gaelvaro-20&#038;o=1&#038;p=8&#038;l=as1&#038;asins=1849515301&#038;ref=qf_sp_asin_til&#038;fc1=000000&#038;IS2=1&#038;lt1=_blank&#038;m=amazon&#038;lc1=0000FF&#038;bc1=000000&#038;bg1=FFFFFF&#038;f=ifr" style="width:120px;height:240px;" scrolling="no" marginwidth="0" marginheight="0" frameborder="0"></iframe></p>
<p>The book actually covers more than only <a href="http://numpy.scipy.org/">numpy</a>: it is a full introduction to numerical computing with Python. The <a href="http://www.packtpub.com/toc/numpy-15-beginners-guide-table-contents">table of contents</a> is the following:</p>
<ul>
<li>NumPy Quick Start</li>
<li>Beginning with NumPy Fundamentals
<li>Get into Terms with Commonly Used Functions</li>
<li>Convenience Functions for Your Convenience</li>
<li>Working with Matrices and ufuncs</li>
<li>Move Further with NumPy Modules</li>
<li>Peeking Into Special Routines</li>
<li>Assure Quality with Testing</li>
<li>Plotting with Matplotlib</li>
<li>When NumPy is Not Enough: SciPy and Beyond</li>
</ul>
<p>The book is easy to read, as it requires no specific expertise other than knowing basic Python programming. It is full of examples and exercises, which is really great for learning. I find the style of the author, Ivan Idris, particularly amusing and relaxing, engaging the reader with questions, challenges, or even jokes (<i>&#8220;Have a go hero&#8221;</i>).</p>
<p>With regards to the formatting and the print, the book is written in large fonts, with sectioning information, tips and exercises clearly standing out.</p>
<p>It is full of practical information, such as how to install the software, or where to get help. Finally, One thing that I appreciated, is that the examples are typed in <a href="http://ipython.org/">IPython</a>. Each time I teach, I like to use IPython, because it is full of features to help plotting, debugging and profiling numerical code. The book even has a little introduction to some useful IPython features.</p>
<p>After an introduction to the work flow, the book explores array manipulation such as creation or reshaping, followed by some simple numerics and the battery of array-based operations on functions and polynomials. Then it presents linear algebra and signal processing basics (FFT). It also covers the financial functions that are present in numpy and mentions testing, which is very important to achieve quality code. The book finishes with matplotlib and scipy, two modules that are important to know to go further.</p>
<p>The examples are mostly drawn from statistics or financial applications, such as computing running averages on stock quotes. Basic math explanations, such as the definition of the Moore-Penrose pseudo-inverse, are given when needed.</p>
<p>To conclude, I enjoyed this book and I think that it is a nice addition to my library. It answers exactly it&#8217;s title: it is well-suited for beginners wanting to learn numpy. On the other hand, I would not recommend it as a reference material, or as a book to learn more general scientific or numerical computing with Python.</p>
]]></content:encoded>
			<wfw:commentRss>http://gael-varoquaux.info/blog/?feed=rss2&amp;p=161</wfw:commentRss>
		</item>
		<item>
		<title>Joblib beta release: fast compressed persistence + Python 3</title>
		<link>http://gael-varoquaux.info/blog/?p=159</link>
		<comments>http://gael-varoquaux.info/blog/?p=159#comments</comments>
		<pubDate>Sat, 07 Jan 2012 18:27:04 +0000</pubDate>
		<dc:creator>gael</dc:creator>
		
		<category><![CDATA[programming]]></category>

		<category><![CDATA[python]]></category>

		<category><![CDATA[scientific computing]]></category>

		<category><![CDATA[scikit-learn]]></category>

		<guid isPermaLink="false">http://gael-varoquaux.info/blog/?p=159</guid>
		<description><![CDATA[Joblib 0.6: better I/O and Python 3 support
Happy new year, every one. I have just released Joblib 0.6.0 beta. The highlights of the 0.6 release are a reworked enhanced pickler, and Python 3 support.
Many thanks go to the contributors to the 0.5.X series (Fabian Pedregosa, Yaroslav Halchenko, Kenneth C. Arnold, Alexandre Gramfort, Lars Buitinck, Bala [...]]]></description>
			<content:encoded><![CDATA[<h1>Joblib 0.6: better I/O and Python 3 support</h1>
<p>Happy new year, every one. I have just released <a href="">Joblib</a> 0.6.0 beta. The highlights of the 0.6 release are a reworked enhanced pickler, and Python 3 support.</p>
<p>Many thanks go to the contributors to the 0.5.X series (Fabian Pedregosa, Yaroslav Halchenko, Kenneth C. Arnold, Alexandre Gramfort, Lars Buitinck, Bala Subrahmanyam Varanasi, Olivier Grisel, Ralf Gommers, Juan Manuel Caicedo Carvajal, and myself). In particular Fabian made sure that Joblib worked under Python 3. </p>
<p>In this blog post, I&#8217;d like to discuss a bit more the compressed persistence engine, as it illustrates well key factors in implementing and using compressed serialization. </p>
<h1>Fast compressed persistence</h1>
<p>One of the key components of joblib is it&#8217;s ability to persist arbitrary Python objects, and read them back very quickly. It is particularly efficient for <strong>containers that do their heavy lifting with numpy arrays</strong>. The trick to achieving great speed has been to save in separate files the numpy arrays, and load them via <strong>memmapping</strong>.</p>
<p>However, one drawback of joblib, is that the caching mechanism may end up using a lot of disk space. As a result, there is strong interest in having <strong>compressed storage</strong>, provided it doesn&#8217;t slow down the library too much. Another use case that I have in mind for fast compressed persistence, is implementing <a href="http://en.wikipedia.org/wiki/Out-of-core_algorithm">out of core computation</a>.</p>
<p>There are some great compressed I/O libraries for Python, for instance <a href='http://pytables.github.com/index.html'>Pytables</a>. You may wonder why the need to code yet another one. The answer is that joblib is <strong>pure Python, depending only on the standard library</strong> (numpy is optional), but also that the goal here is <strong>black-box persistence of arbitrary objects</strong>.</p>
<h2>Comparing I/O speed and compression to other libraries</h2>
<p>Implementing efficient compressed storage was a bit of a struggle and I learned a lot. Rather than going into the details straight away, let me first discuss a few benchmarks of the resulting code. Benching such feature is very hard, first because you are fighting with the disk cache, second because they performances depends very much on the data at hand (some data compress better than others), last because they are three interesting metrics: disk space used, write speed, and read speed.</p>
<p><strong>Dataset used</strong> - I chose to compare the different strategies on some datasets that I work with, namely the probabilistic brain atlases MNI 1mm (62Mb uncompressed) and Juelich 2mm (105Mb uncompressed). Whether the data is represented as a Fortran-ordered array, or a C-ordered array is important for the I/O performance. This data is normally stored to disk compressed using the domain-specific Nifti format (<i>.nii</i> files), accessed in Python with  the <a href="http://nipy.sourceforge.net/nibabel/">Nibabel</a> library.
</p>
<p><strong>Libraries used</strong> - I benched different compression strategies in joblib against Nibabel&#8217;s Nifti I/O, compressed or not, and against using Pytables to store the data buffer (without the meta-informations). Pytables exposed a variety of compression strategies, with different speed compromises. In addition, I benched numpy&#8217;s builtin <i>save_compressed</i>.</p>
<p>I would like to stress that I am comparing a general purpose persistence engine (joblib) to specific I/O libraries either optimized for the data (Nifti), or requiring some massaging to enable persistence (pytables).</p>
<p><center><img src="http://gael-varoquaux.info/blog/wp-content/uploads/2012/joblib_rel_0.6_speed/disk.png" width=70%><br />
<br/></p>
<p><img src="http://gael-varoquaux.info/blog/wp-content/uploads/2012/joblib_rel_0.6_speed/write.png" width=70%><br />
<br/></p>
<p><img src="http://gael-varoquaux.info/blog/wp-content/uploads/2012/joblib_rel_0.6_speed/read.png" width=70%></p>
<p><br/></p>
<p><i>Comparing to other libraries</i></center></p>
<p>Actual numbers can be found <a href="http://gael-varoquaux.info/blog/wp-content/uploads/2012/joblib_rel_0.6_speed/results_nii.csv">here</a>.</p>
<p><strong>Take home messages</strong> - The graphs are not crystal-clear, but a few tendencies appear:
<ul>
<li>Pytables with LZO or blosc compression is the king of the hill for read and write speed.</li>
<li>I/O of compressed data is often faster than with uncompressed data for a good compression algorithm.</li>
<li>Joblib with Zlib compression level 1 performs honorably in terms of speed with only the Python standard library and no compiled code.</li>
<li>Read time of memmapping (with nibabel or joblib) is negligeable (it is tiny on the graphs), however the loading time appears when you start accessing the data.</li>
<li>Passing in arrays with a memory layout (Fortran versus C order) that the I/O library doesn&#8217;t expect can really slow down writing. </li>
<li>Compressing with Zlib compression-level 1 gets you most of the disk space gains for a reasonable cost in write/read speed.</li>
<li>Compressing with Zlib compression-level 9 (not shown on the figures) doesn&#8217;t buy you much in disk space, but costs a lot in writing time.</li>
</ul>
<h2>Benching datasets richer than pure arrays</h2>
<p>The datasets used so far are pretty much composed of one big array, a 4D smooth spatial map. I wanted to test on more datasets, to see how the performances varied with data type and richness. For this, I used the datasets of the <a href="http://scikit-learn.org">scikit-learn</a>, real life data of various nature, described <a href="http://scikit-learn.org/stable/datasets/index.html">here</a>:</p>
<ul>
<li><strong>20 news</strong> - 20 usenet news group: this data mainly consists of text, and not numpy arrays.</li>
<li><strong>LFW people</strong> - Labeled faces in the wild, many pictures of different people&#8217;s face.</li>
<li><strong>LFW pairs</strong> - Labeled faces in the wild, pairs of pictures for each individual. This is a high entropy dataset, it does not have much redundant information.</li>
<li><strong>Olivetti</strong> - Olivetti dataset: centered pictures of faces.</li>
<li><strong>Juelich(F)</strong> - Our previous Juelich atlas</li>
<li><strong>Big people</strong> - The LFW people dataset, but repeated 4 times, to put a strain on memory resources.</li>
<li><strong>MNI(F)</strong> - Our previous MNI atlas</li>
<li><strong>Species</strong> - Occurence of species measured in latin America, with a lot of missing data.</li>
</ul>
<p><img src="http://gael-varoquaux.info/blog/wp-content/uploads/2012/joblib_rel_0.6_speed/joblib_disk.png" width=32%> <img src="http://gael-varoquaux.info/blog/wp-content/uploads/2012/joblib_rel_0.6_speed/joblib_write.png" width=32%> <img src="http://gael-varoquaux.info/blog/wp-content/uploads/2012/joblib_rel_0.6_speed/joblib_read.png" width=32%></p>
<p><center><i>Testing compression strategies on various datasets</i></center></p>
<p>Actual numbers can be found <a href="http://gael-varoquaux.info/blog/wp-content/uploads/2012/joblib_rel_0.6_speed/joblib_results.csv">here</a>.</p>
<p><strong>What this tells us</strong> - The main message from these benchmarks is that datasets with redundant information, i.e. that compress well, give fast I/O. This is not surprising. In particular, good compression can give good I/O on text (20 news). Another result, more of a sanity check, is that compressed I/O on big data (Big people, ) works as well as on smaller data. Earlier code would start to swap. Finally, I conclude from these graphs, that compression levels from 1 to 3 buy you most of the gains for reasonable costs, and that going up to 9 is not recommended, unless you know that your data can be compressed a lot (species).</p>
<h2>Lessons learned</h2>
<p>I&#8217;ll keep this paragraph short, because the information is really in <a href="https://github.com/joblib/joblib/blob/0.5.X/joblib/numpy_pickle.py">joblib&#8217;s code and comments</a>. Don&#8217;t hesitate to have a look, it&#8217;s BSD-licenced, so you are free to borrow what you please.</p>
<ol>
<li>Memory copies, of arrays, but also of strings and byte streams can really slow you down with big data.</li>
<li>To avoid copies with numpy arrays, fully embrace numpy&#8217;s strided memory model. For instance, you do not need to save arrays in C order, if they are given to you in a different order. Accessing the memory in the wrong striding direction explains the poor write performance of pytables on Fortran-ordered Juelich.</li>
<li>When dealing with the file system, the OS makes so much magic (e.g. prefetching) that clever hacks tend not to work: always benchmark.</li>
<li>Depending on the size of the data, it may be more efficient to store subsets in different files: it introduces &#8216;chunk&#8217; that avoid filling in the memory too much (parameter <i>cache_size</i> in joblib&#8217;s code). In addition, data of a same nature tends to compress better.</li>
<li>The I/O stream or file object interfaces are abstractions that can hide the data movement and the creation of large temporaries. After experiments with GZipFile and StringIO/BytesIO I found it more efficient to fall back to passing around big buffer object, numpy arrays, or strings.</li>
<li>For reasons 4 and 5, I ended up avoiding the gzip module: raw access to the zlib with buffers gives more control. This explains a good part of the differences in read speed for pure arrays with numpy&#8217;s <i>save_compressed</i>.</li>
</ol>
<p>One of my conclusions for joblib, is that I&#8217;ll probably use Pytables as an optional backend for persistence in a future release.</p>
<h2>Details on the benchmarks</h2>
<p>These benchmarks where run on a Dell Lattitude D630 laptop. That&#8217;s a dual-core Intel Core2 Duo box, with 2M of CPU cache.</p>
</p>
<p>The code for the benchmarks below can be found on <a href="https://gist.github.com/1551250">a gist</a>.</p>
<h2>Thanks</h2>
<p>I&#8217;d like to that Francesc Alted for very useful feedback he gave on this topics. In particular, the <a href="http://sourceforge.net/mailarchive/message.php?msg_id=28609087">following thread</a> on the pytables mailing-list may be of interest to the reader.</p>
]]></content:encoded>
			<wfw:commentRss>http://gael-varoquaux.info/blog/?feed=rss2&amp;p=159</wfw:commentRss>
		</item>
		<item>
		<title>Scikit-learn NIPS 2011 sprint: international thanks to our sponsors</title>
		<link>http://gael-varoquaux.info/blog/?p=158</link>
		<comments>http://gael-varoquaux.info/blog/?p=158#comments</comments>
		<pubDate>Fri, 18 Nov 2011 13:47:59 +0000</pubDate>
		<dc:creator>gael</dc:creator>
		
		<category><![CDATA[mayavi]]></category>

		<guid isPermaLink="false">http://gael-varoquaux.info/blog/?p=158</guid>
		<description><![CDATA[The NIPS conference: time for a sprint. The NIPS conference, one of the major conferences in machine learning, is hosted in Granada this year. I believe that it is the first time that it is hosted in Europe. As many of the scikit-learn developers are part of the wider NIPS community, but also many live [...]]]></description>
			<content:encoded><![CDATA[<p><strong>The NIPS conference: time for a sprint.</strong> The <a href="http://nips.cc/">NIPS conference</a>, one of the major conferences in machine learning, is hosted in Granada this year. I believe that it is the first time that it is hosted in Europe. As many of the <a href="http://scikit-learn.org">scikit-learn</a> developers are part of the wider NIPS community, but also many live in Europe, we jumped on the occasion to organize a truly international sprint: the <a href="http://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events">NIPS 2011 scikit-learn sprint</a>. </p>
<p><strong>Finding money.</strong> As often with open source development, a lot of our contributors are young people, investing their free time outside of any request from their hierarchy. In such a situation, it can be hard to find travel money. So we started looking for sponsors. We needed to find a decent sum of money, as we were flying people in from places such as the West coast of the US, or even Japan. The good news is that we found money, and between supervisors pitching in, universities giving travel grants, and our generous sponsors, there will be an impressive list of contributors from all over the world at the sprint. </p>
<p><strong>Thanks to our sponsors.</strong> The first people that we need to thank are Google, who gave us a sizable sponsorship, and the <a href="http://www.python.org/psf/">PSF</a>, who made Google&#8217;s sponsorship possible through their accounting and sprints programs. We also need to thanks our other sponsors, namely <a href="http://www.tinyclues.com/">Tinyclues</a>. Thanks to these sponsors, and additional investment from many universities and research group, we have been able to gather a total of 12 contributors in Granada, a handful coming from overseas. Also, we are indebted to the <a href="http://www.ugr.es/">University of Granada</a>, and the Gnu/Linux Granada Group (GGG), who are providing hosting for the sprint, as well as Régine Bricquet, from INRIA, who did a lot of the trip planing for the sponsored people. </p>
<p>I am very much looking forward to the sprint. It will be the first time that meet in real life many of the contributors, and judging by the warmness of the on-line exchanges, it will be a great moment. Besides, Granada is known to be a lively and historical city. </p>
<p>If you are around and want to join us, to work on Python in machine learning, send us a mail on the <a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general">mailing list</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://gael-varoquaux.info/blog/?feed=rss2&amp;p=158</wfw:commentRss>
		</item>
	</channel>
</rss>
