big picture

Why reproducible research?

Research is intended for the betterment of society and the world. Science is incremental. If you agree that these two statements are true, then to gain the greatest possible “betterment” for every piece of research, it must be as reproducible by others as possible.

This is the traditional remit of the “Methods” section, but sadly the methods of a given paper often don’t contain enough detail to reproduce the research from scratch. There can also be “magic tricks” or insider knowhow necessary for something to work that are often (and usually with good intention) not made explicit in the paper, but would be critical for any replication attempt.

Ideally, a research paper would be published with the tools necessary for the reader to reproduce the paper’s contents with the push of a button, from as raw a data set as is ethically and physically possible to share. If the reader were able to then modify or expand the existing paper analysis or methods, replication and extension of published work would be made a lot easier.

If this were a publishing requirement, I submit that we would see a general increase in the quality (or at least thoroughness) of published research. Authors would take more care with their organisation of data and their experiment code if they knew that the reviewers of the paper would be able to simply go and look at them. I think it would also make it less likely that authors are tempted to gloss over holes in their data set for want of a clean story, and the state of science would be better for this.

Reproducible research in practice

I believe that one important requirement for something to be reproducible is that, where-ever possible, the software to do so is available free and easily. Why free? Well, if we accept that publicly-funded science is done for the benefit of the public, then I believe the public should have the opportunity to become involved in the data to whatever degree they wish. After all, it belongs to them.

Another reason I think free software is preferable is that more researchers are able to replicate and extend the work. The population of people to whom free software is available is larger than the population who can afford paid software. If everyone used free software to do their research, psychology departments around the world wouldn’t need to spend money on Matlab or SPSS site licences, and could maybe afford to employ more full-time staff instead.

Paid software has its advantages too: chiefly, ease and support. Installing Matlab is currently much easier than installing all the Python modules necessary to mimic Matlab functionality (blog post on this on its way). As for support: when was the last time you asked Mathworks or SPSS for help with something you were trying to use their software for? In my experience, support is usually community-driven in the first place (searching things like Stack Overflow or user groups for other people with similar problems). Why not build a community around software that’s available to everyone?

One obvious and important limitation of this philosophy is that certain things just can’t be freely reproducible. In particular, hardware is not going to be available for free (or even at low cost). If you want to replicate an fMRI study you need a big machine that goes bing; Uncle McGuyver’s magnet set, a ham radio and a helium balloon are not going to cut it.

However, straight behavioural research in experimental psychology is pretty uniquely placed as a discipline with low barriers to entry on hardware for the vast majority of experiments. Moreover, data analysis can always be open and reproducible, whether the data comes from free web repositories or a proprietary MRI scanner.

What’s one simple thing I can do to make my work more reproducible right now?

This blog will hopefully get filled with useful tips on creating a more open and reproducible workflow. I know you’re itching to get started though. What’s something you can do right now?

Start using text files for more things. If you store your content in binary files (basically, anything that looks funny when opened in a text editor) then people who want access to your content probably need your program to open it. I’m writing this blog in a .txt file on my computer on a train right now (see the excellent TextWrangler). If you hate how plain text looks, consider using Markdown (or LaTeX, which we’ll get to as well).

Chiefly, don’t store your data in a .xlsx or .mat file. Store it in a tab-delimited .txt file. Everyone can open that with whatever software they like without jumping through hoops. It’s a one-liner in any good stats program. An Excel spreadsheet (or SPSS data file) is not a good solution for reproducible distribution of data.

Finally, using text files rather than binary files will let you keep things under version control, which I’ll talk about in a future blog post.

If you’re interested in reading more about reproducible research, start with this recent blog post from Roger Peng.