At MLOSS 15 we brainstormed on reproducible science, discussing why we care about software in computer science. Here is a summary blending notes from the discussions with my opinion.
“Without engineering, science is not more than philosophy” — the community
How do we enable better Science? Why do we do software in science? These are the questions that we were interested in.
Forms of reproducible science: reproduction, replication, & reuse
The classic concepts of reproducible science are:
- Reproducibility: being able to rerun an experiment as it was run, for instance by reanalysing data.
- Replicability: being able to redo an experiment from scratch.
The reproducible science movement argues sharing source code of experiments is a need for reproduction.
For reproduction, fields like computer science (development of methods) and biology (challenging data acquisition) have very different constraints, with the complexity allocated differently between data and code.
“Machine learning people use hugely complex algorithms on trivially simple datasets. Biology does trivially simple algorithms on hugely complex datasets.” — an MLOSS15 attendee
We felt that computer science needed an additional notion, complementing replication and reproduction:
- Reusability: applying the process to a new yet similar question. For instance for a paper contributing data analysis method, applying it to new data.
Reproducibility without reusability in method development may hinder the advancement of science as it pushes people to do all the same things, eg always running experiments on the same data.
Reusability enables results that the original investigator did not have in mind. It implies that the experimental protocol extends further than the exact scope of the question initially asked. For software development, it is also harder, as it implies more robustness and flexibility.
Finally sharing source code is not enough: readability of the code is necessary.
Roadblocks to reproducible science
Man power
Reusability, readability, support of released code, all actually take a lot of time, even though it is seldom acknowledged in talks about reproducible science. Given a fixed man power, it is impossible to achieve reusability and high quality for everything.
Computing power
Some numerical experiments or complex data analysis require weeks of cluster to run. These will be much harder to reproduce. Also, rerunning an analysis from scratch on a regular basis is a good recipe to achieve a robust path from data to results. The more computing power is a limiting resource, the more likely it is that a glitch is not detected.
Data availability
No access, or restricted access, to data is a show stopper for reproducibility. Data sharing requirements are becoming common –from funding agencies, or journals. However, privacy concerns, or confidential information get in the way of making data public, for instance in medical research or micro-economy. Often, these concerns serve as a pretext to people who actually do not want to relinquish control [1].
[1] | A related post by Deevy Bishop: Who’s afraid of open data |
Incentives problem
Fancy new results are what matters for success in academia. “High impact” journals such as Nature or Science accept papers that amaze and impress, often with subpar inspection of the materials and methods [2]. The rate of publication in many leading groups is incompatible with consolidation efforts required for strong reproducibility.
On the other hand, it is hard to tell beforehand if a new idea is a good one. Hence letting imagination forward to foster impossible and improbable ideas is a good path to innovation. The underlying questions are: What are the best community rules for the advancement of knowledge? What do we want from the way science moves forward? Rapid publication of many incremental ideas, eg at a conference, gives food for thoughts, possibly at the sake of reproducibility.
[2] | “Science, Nature and Cell, had a higher rate of retractions” – Wikipedia: Invalid science |
How to improve the situation
Docker, containers, and virtual machines
Docker, or other virtual machine technologies, enable shipping a software environment. It diminishes the challenges of building software and setting up an analysis. Virtual machines are used as a way to avoid software packaging issues. This seems to me as a plaster on a wooden leg.
Indeed, an analysis that lives in a box can be reproduced, but can it be understood, modified, or applied to new data? New science is likely going to come from modifying this analysis, or combining it with other tools, or new data. If these other tools live in a different virtual machine, the combination will be challenging.
In addition, people are using containers as an excuse to avoid tackling the need for proper documentation of requirements, and the process to set them up. They sometimes even try justify binary blobs [3]. This is wrong. An analysis should be runnable without requiring the stars to align, and it should be understandable.
[3] | See also Titus Brown’s post: The post-apocalyptic world of binary containers |
Version control: wear your seatbelt
Version control is like a time machine: if used with regular commits, it enables rolling back to any point in time. For my work, it’s always been a crucial aspect to reproducing what me or my students did a while ago. I often meet researchers that feel they lack time to learn it. I really cannot support this position. http://try.github.io is an easy way to learn version control.
Hint: use a “tag” to pin-point a position in the history that you might want to repeat, such as making a figure or the publication of an article.
Sotware libraries, curated and maintained
Consolidating an analysis pipeline, a standard visualization, or any computational aspect of a paper into a software library is a sure way to make the paper more reproducible. It will also make the steps reusable, and a replication easier. If continued effort is put in the library, chances are that computational efficiency will improve over time, thus helping in the long run with the challenge of computing power.
Maintaining the library will ensure that results are still reproducible on new hardware, or with evolution of the general software stack (a new Python or Matlab release, for instance). Documentation and curated examples will lower the bar to reuse and facilitate replication of the original scientific results.
To avoid feature creep and technical debt, a library calls for focused efforts on selecting the most important operations.
Datasets, serving as model experiments, tractable and open
Sometimes researchers create a toy data, with a well-posed question, that is curated and open, small enough to be tractable yet large enough to be relevant to the application field. This is an invaluable service to the field. One example is the netflix prize in machine learning, which led to a standard dataset. Unfortunately, the dataset was taken down some years later due to copyright concerns. But it has been replaced, eg by the movielens dataset. For computer vision, a series of datasets –Caltech101, CIFAR, ImageNet…– have led to continuous progress of the field. In bioinformatics, standard data are regularly created, for instance by the DREAM challenges.
These reference open datasets serve as benchmarks and therefore foster competition. They also define a canonical experiment, helping a wider scientific community understand the questions that they ask. Ultimately, they result in better software tools to solve the problem at hand, as this problem becomes a standard example and application of tools.
Sage bionetworks, for instance, is a non-profit that collects and make biomedical data available. These people believe, as I do, that such data will lead to better medical care.
Changing incentives: setting the right goals
Making sustainable, quality scientific work that facilitates reproduction needs to be a clearly-visible benefit to researchers, young and senior. Such contributions should help them get jobs and grants.
An unsophisticated publication count is the basis of scientific evaluation. We need to accept publications about data, software, and replication of prior work in high-quality journals. They need to be strictly reviewed, to establish high standards on these contributions. This change is happening. Gigascience, amongst other venues, publishes data. The MLOSS (machine learning open source software) track of the JMLR (journal of machine learning research) publishes software, with a tough review on the software quality of the project.
Yet software is still often under cited: many will use a software implementing a method, and only cite the original paper that proposed the method. Another remaining challenge is: how to give credit for continuing development and maintenance.
Fast-paced science is probably useful even if fragile. But the difference between a quick proof of concept and solid, reproducible and reusable work needs to be acknowledged. It is important to select for publication not only impressive results, but also sound reusable material and methods. The latter are the foundation of future scientific developments, but high-impact journals tend to focus on the former.