Why reproducible research?

Research is intended for the betterment of society and the world. Science is incremental. If you agree that these two statements are true, then to gain the greatest possible “betterment” for every piece of research, it must be as reproducible by others as possible.

This is the traditional remit of the “Methods” section, but sadly the methods of a given paper often don’t contain enough detail to reproduce the research from scratch. There can also be “magic tricks” or insider knowhow necessary for something to work that are often (and usually with good intention) not made explicit in the paper, but would be critical for any replication attempt.

Ideally, a research paper would be published with the tools necessary for the reader to reproduce the paper’s contents with the push of a button, from as raw a data set as is ethically and physically possible to share. If the reader were able to then modify or expand the existing paper analysis or methods, replication and extension of published work would be made a lot easier.

If this were a publishing requirement, I submit that we would see a general increase in the quality (or at least thoroughness) of published research. Authors would take more care with their organisation of data and their experiment code if they knew that the reviewers of the paper would be able to simply go and look at them. I think it would also make it less likely that authors are tempted to gloss over holes in their data set for want of a clean story, and the state of science would be better for this.

Reproducible research in practice

I believe that one important requirement for something to be reproducible is that, where-ever possible, the software to do so is available free and easily. Why free? Well, if we accept that publicly-funded science is done for the benefit of the public, then I believe the public should have the opportunity to become involved in the data to whatever degree they wish. After all, it belongs to them.

Another reason I think free software is preferable is that more researchers are able to replicate and extend the work. The population of people to whom free software is available is larger than the population who can afford paid software. If everyone used free software to do their research, psychology departments around the world wouldn’t need to spend money on Matlab or SPSS site licences, and could maybe afford to employ more full-time staff instead.

Paid software has its advantages too: chiefly, ease and support. Installing Matlab is currently much easier than installing all the Python modules necessary to mimic Matlab functionality (blog post on this on its way). As for support: when was the last time you asked Mathworks or SPSS for help with something you were trying to use their software for? In my experience, support is usually community-driven in the first place (searching things like Stack Overflow or user groups for other people with similar problems). Why not build a community around software that’s available to everyone?

One obvious and important limitation of this philosophy is that certain things just can’t be freely reproducible. In particular, hardware is not going to be available for free (or even at low cost). If you want to replicate an fMRI study you need a big machine that goes bing; Uncle McGuyver’s magnet set, a ham radio and a helium balloon are not going to cut it.

However, straight behavioural research in experimental psychology is pretty uniquely placed as a discipline with low barriers to entry on hardware for the vast majority of experiments. Moreover, data analysis can always be open and reproducible, whether the data comes from free web repositories or a proprietary MRI scanner.

What’s one simple thing I can do to make my work more reproducible right now?

This blog will hopefully get filled with useful tips on creating a more open and reproducible workflow. I know you’re itching to get started though. What’s something you can do right now?

Start using text files for more things. If you store your content in binary files (basically, anything that looks funny when opened in a text editor) then people who want access to your content probably need your program to open it. I’m writing this blog in a .txt file on my computer on a train right now (see the excellent TextWrangler). If you hate how plain text looks, consider using Markdown (or LaTeX, which we’ll get to as well).

Chiefly, don’t store your data in a .xlsx or .mat file. Store it in a tab-delimited .txt file. Everyone can open that with whatever software they like without jumping through hoops. It’s a one-liner in any good stats program. An Excel spreadsheet (or SPSS data file) is not a good solution for reproducible distribution of data.

Finally, using text files rather than binary files will let you keep things under version control, which I’ll talk about in a future blog post.

If you’re interested in reading more about reproducible research, start with this recent blog post from Roger Peng.

Advertisements

11 comments

  1. Tom – this is such a great topic! I am looking forward to seeing what you write about this.

    A comment I had while reading this is that I understand why you want to avoid *proprietary* binary file formats (such as xls/xlsx), but I don’t see what the problem would be with properly documented open binary file formats? For example the ‘.npy’ file format in numpy? A csv file without column labels is just as useful as a binary file you don’t know how to read… And saving large-ish data, such as data from neuroimaging experiments in tabular/text form would be quite impractical. The nifti standard is another example of a (usually) useful open binary file format, used for that kind of data.

    As for version control of data, a thorny issue in and of itself, git-annex can help deal with those really large data-sets.

    1. Hi Ariel, thanks for commenting.

      Open binary formats are better than proprietary formats, but I would still say that unless it’s impractical (e.g. neuroimaging data), the final data set for distribution should be in a text file. Sharing a binary format (e.g., .npy) means that I need to find a way to convert that if I wanted to use XX to analyse the data. Any psychophysics dataset I can think of could be distributed simply as a text file, but I agree that for neuroimaging and similar another format is more practical.

      I hope you can keep commenting here – as you can see, my suggestions are going to be focused on the types of data that I deal with regularly. Getting input from someone in the big(ger) data world is really great!

  2. A great first article, Tom. I look forward to reading more!

    Now that electronic publication of journals (first) has become the norm rather than the exception, I think that at a minimum all the data used in statistical analyses and the scripts used to perform the calculations should be freely available as an addendum, likewise with residual diagnostics for the ever-popular ANOVAs/t-tests/etc. One annoying thing about doing peer reviews is that I often develop a strong suspicion that the manuscript’s statistical methodology is flawed, but without seeing what the actual analyses were I can’t be certain.

    One particularly egregious example was a group of (senior) authors claiming to have “accurately” modelled numerous processes using normal distributions – but they looked bimodal distribution to me. I asked for evidence for normality, and they claimed the normality tests were significant. Upon further questioning, I discovered they were using a Kolmogorov-Smirnoff test where “significant” actually means non-normality :-/ It would be nice to be able to pick up on these types of major errors on the first review!

  3. I don’t see a problem with binary files that are documented and open, even for the most basic of psychophysical data. The ones you mentioned are particularly squishy examples as they’re not open per-se but have been reverse engineered for over a decade.

    There are issues with storing data as a plain text file. The author has to be careful with the precision of the variables saved (is it a float, integer) — which isn’t a huge deal, but I’ve run into at least one case of messed up data shared with me because of this (e.g., saving seeds of the random number generator). Also, saving data as text leads to authors sometimes making idiosyncratic choices that have to be documented, easy enough, but why reinvent the wheel? Loading a mat, excel, and rda all, like csv, take one line of code to load in R.

    I agree with the point as a whole, we should always save data in a way others can read it, but an idiosyncratic and poorly documented text file is far worse to deal with than the binary de-facto standards mentioned.

    1. Great points Chris. I hadn’t thought about the precision of saved variables, particularly when it comes to random seeds. Idiosyncratic choices that need to be documented are true of whatever format you save in, however. If I have a variable called “presentation_side” that takes values of 0 or 1, I need to document this whether it’s a binary or text file. Binary files have more scope for idiosyncratic choices because people can use weird data formats (think structures with lots of levels in a .mat file rather than a flat table). In any case, I think data formats are worth a post in themselves (i.e. how to properly structure and document a data file).

  4. Excel used to (I don’t know if it still does) have a idiosyncratic feature when saving data in plain text formats – unless you had properly specified data types and precision levels, you could easily lose data via truncation. For example, a UNIX long timestamp in a CSV file would load into Excel fine, but if re-saved Excel would write it out in scientific notation and drop the precision substantially.

    Another issue with plain text files is that storing structure (hierarchy and links) and expanding file formats becomes convoluted quite quickly. When I was involved in business data programming a decade ago, XML was rapidly becoming popular – but the researchers I’ve talked to don’t seem to like it? I liked XML because it supports document structure definition and typing inline, is easily extensible with additional data, and XSLT stylesheets and XPath expressions make document navigation easy.

    On the other hand, research data is typically used within individual groups and don’t have to be shared across separate groups over long periods of time – so rapid coding and flexibility are more important. Thoughts?

    1. Good points Dave. I think XML is the way forward (coincidentally, I’ve just been talking with colleagues here about XML data standards). It’s plain text (human readable in a text editor), but able to support flexible structures and data types. For this post though I wanted to encourage people to move their stuff (not just data) to plain text as a first step. Perhaps in a few weeks time I can do a post devoted to data standards for psychophysics.

  5. Thanks for a great post. I have only recently been introduced to the idea of reproducible research, and I am excited to move in that direction. I realise that many of my frustrations in different labs I have worked in has been endlessly working in different proprietary programs – and spending ages moving data from one program to another with many opportunities for error!! I am trying to tackle the current workflow in my lab – showing off some of the benefits of RR and trying to convince my boss and colleagues! Not always easy with the entry barriers for scientists that don’t do much computational work.
    I wonder about how we can lower the entry barriers for scientists to conduct reproducible research and scaffold this better for new students so that each doesn’t have to invent their own workflow! Certainly having discussion forums such as this is a great start.
    As a beginner I also found great tips in Chris Gandrud’s book: Reproducible Research with R and RStudio: http://christophergandrud.github.io/RepResR-RStudio/
    Thanks for the encouragement!
    Jess

    1. Glad you found the post useful, and thanks for commenting. Yes, I think one of the huge advantages of using free programs as much as possible is that you’re self-sufficient where-ever you end up. You don’t need your new employer to buy a license for X so that you can be productive.

      Thanks also for the link to the book; looks very related to some of what I want to cover on this blog.

Comments are closed.