reproducible research

Setting up a project directory

The first step in making a research project reproducible is to make sure that the files are associated with it are organised. Since scientific projects will change (sometimes a lot) as you work on them, you’re aiming to have something that will cope with changes in direction and scope as you go.

Here’s an example of what not to do, from my own archive:

blog_5If this looks familiar, then read on. This post is going to run you through, step by step, how to set up a project directory like the one used here. In the following post I will talk about setting up version control using git, which is an integral part of keeping your project folder organised and its history intact. The next few posts will be running through some basic data handling.

My directory structure and approach is almost entirely based on this great post from the blog Nice R Code (see what they did there?).  You should tweak the process to something that suits you, but I hope this will be a useful starting point.

You’re going to have a root directory for your project, with five subdirectories:

project_root/
    data/
    exp/
    figs/
    funs/
    out/

Note how my root directory is project_root and not project root. Many programs don’t like spaces in file paths, so it’s a good habit to get into naming your files and folders in this way.

You’re trying to keep your project_root/ directory as clean as possible. In the way I work, this directory is going to contain two key files: your manuscript file (e.g a LaTeX .tex file or a Sweave / knitr .Rnw file) and your master analysis file (analysis.R or analysis.txt or similar).

The exp/ directory is going to contain all the scripts and programs that are associated with data collection. For example, all of your Psychtoolbox or PsychoPy code is going in here. You could also create subfolders for storing stimuli like pictures or movies. All of that should go in exp/.

Any data files saved by your experiment are going to be stored in data/. That is, your experiment script, when finished, is going to write a file like project_root/data/subject_01_session_02_2014_01_16.csv. You should treat everything in the data/ directory as read-only once created. That is, your analysis scripts should read the files in data/ but never write to them. All your data analysis is going to depend on a processed data file that you will store in out/. Likewise, as I previously discussed, the files in data/ should never be edited by a human.

The out/ directory stores any numerical results of your data analysis that you want to write to disk, such as processed data files, model fits, MCMC samples, and so on. Take the approach that you should be able to delete everything in your out/ directory and then regenerate it by running your analysis scripts again.

figs/ stores all the graphical output from your analysis. Your manuscript file (which is going to live in the root directory, or maybe in a separate doc/ directory if you prefer) is going to populate itself with the figures in the figs/ directory. Well, if you use Word then I think you have to paste the figures in yourself every time you make new ones, but if you use LaTeX or Markdown the manuscript will populate itself. Like the out/ directory, you should try to take the attitude that everything in figs/ should be able to be deleted and regenerated from your master analysis file. However, one caveat for this is for multi-panel data + diagram figures like those popular in vision science. Usually these need to be arranged in an illustrator-like program (such as Inkscape). I doubt there is an easy solution for automating this sort of thing, so those figures are just going to have to be manually created. However, all panels containing data should be generated automatically.

Finally, the funs/ directory is going to contain all your analysis functions. These are things that your master analysis file is going to call on to do the work. My approach differs a little from Nice R Code here, in that I’m happy to have some scripts that actually do work here (and not just function definitions). Also, I call my directory funs/ rather than R/ because I’m not just going to have R code in there. For me the important thing is that your master analysis file should be as clean as possible, only acting as an overview or make file to tell the reproducer of your analysis the order in which the code in funs/ should be run. If you’re just using R, like I was in my previous projects, you can run this file and the whole analysis comes out.

OK. In the next post I’m going to run over how I set up version control for a project directory like this, then we can get into some basic data import / export.

Advertisements

New paper on scientific computing

Further to my last post and this blog in general, PLoS Biology has just published an article on Best Practices for Scientific Computing. It’s a great overview of some of the stuff I’m planning to talk about here. It’s by at least one of the developers of Software Carpentry, which I recommend you check out.

A workflow principle: Avoiding humans touching data

One key principle for creating a reproducible workflow is that everything that happens to the data should be documented, and ideally repeatable automatically. The best way to do this is to make sure that the data are only ever manipulated by lines of a computer program, and wherever possible, never by the manual work of a human.

In a behavioural experiment, your raw data is often going to be a bunch of files containing one observation (trial, survey respondent, etc) per row. There will be a number of columns associated with the variables and conditions of the experiment, along with some response for that row.

These files should never be modified once output by your experiment code or questionnaire software: they are read-only.

Instead, the first step of your analysis is to write a script that opens the experiment files, checks that all the values make sense, and compiles all the information you want for your analysis into a master file for further processing. This might involve compiling all the blocks of trials from different subjects into one file, for example. All of your future analysis scripts are then going to use this munged file.

The workflow for your whole analysis is going to continue in this fashion: a series of scripts that can be executed sequentially, with the eventual output being all the figures and numbers for the paper. You’re trying to avoid any stage at which you are manually interacting with the data. If you find yourself copying and pasting columns of data in a spreadsheet, alarm bells should be ringing. While often it feels convenient, it’s not reproducible or easily repeatable.

Why reproducible research?

Research is intended for the betterment of society and the world. Science is incremental. If you agree that these two statements are true, then to gain the greatest possible “betterment” for every piece of research, it must be as reproducible by others as possible.

This is the traditional remit of the “Methods” section, but sadly the methods of a given paper often don’t contain enough detail to reproduce the research from scratch. There can also be “magic tricks” or insider knowhow necessary for something to work that are often (and usually with good intention) not made explicit in the paper, but would be critical for any replication attempt.

Ideally, a research paper would be published with the tools necessary for the reader to reproduce the paper’s contents with the push of a button, from as raw a data set as is ethically and physically possible to share. If the reader were able to then modify or expand the existing paper analysis or methods, replication and extension of published work would be made a lot easier.

If this were a publishing requirement, I submit that we would see a general increase in the quality (or at least thoroughness) of published research. Authors would take more care with their organisation of data and their experiment code if they knew that the reviewers of the paper would be able to simply go and look at them. I think it would also make it less likely that authors are tempted to gloss over holes in their data set for want of a clean story, and the state of science would be better for this.

Reproducible research in practice

I believe that one important requirement for something to be reproducible is that, where-ever possible, the software to do so is available free and easily. Why free? Well, if we accept that publicly-funded science is done for the benefit of the public, then I believe the public should have the opportunity to become involved in the data to whatever degree they wish. After all, it belongs to them.

Another reason I think free software is preferable is that more researchers are able to replicate and extend the work. The population of people to whom free software is available is larger than the population who can afford paid software. If everyone used free software to do their research, psychology departments around the world wouldn’t need to spend money on Matlab or SPSS site licences, and could maybe afford to employ more full-time staff instead.

Paid software has its advantages too: chiefly, ease and support. Installing Matlab is currently much easier than installing all the Python modules necessary to mimic Matlab functionality (blog post on this on its way). As for support: when was the last time you asked Mathworks or SPSS for help with something you were trying to use their software for? In my experience, support is usually community-driven in the first place (searching things like Stack Overflow or user groups for other people with similar problems). Why not build a community around software that’s available to everyone?

One obvious and important limitation of this philosophy is that certain things just can’t be freely reproducible. In particular, hardware is not going to be available for free (or even at low cost). If you want to replicate an fMRI study you need a big machine that goes bing; Uncle McGuyver’s magnet set, a ham radio and a helium balloon are not going to cut it.

However, straight behavioural research in experimental psychology is pretty uniquely placed as a discipline with low barriers to entry on hardware for the vast majority of experiments. Moreover, data analysis can always be open and reproducible, whether the data comes from free web repositories or a proprietary MRI scanner.

What’s one simple thing I can do to make my work more reproducible right now?

This blog will hopefully get filled with useful tips on creating a more open and reproducible workflow. I know you’re itching to get started though. What’s something you can do right now?

Start using text files for more things. If you store your content in binary files (basically, anything that looks funny when opened in a text editor) then people who want access to your content probably need your program to open it. I’m writing this blog in a .txt file on my computer on a train right now (see the excellent TextWrangler). If you hate how plain text looks, consider using Markdown (or LaTeX, which we’ll get to as well).

Chiefly, don’t store your data in a .xlsx or .mat file. Store it in a tab-delimited .txt file. Everyone can open that with whatever software they like without jumping through hoops. It’s a one-liner in any good stats program. An Excel spreadsheet (or SPSS data file) is not a good solution for reproducible distribution of data.

Finally, using text files rather than binary files will let you keep things under version control, which I’ll talk about in a future blog post.

If you’re interested in reading more about reproducible research, start with this recent blog post from Roger Peng.