project directory

Setting up a project directory

The first step in making a research project reproducible is to make sure that the files are associated with it are organised. Since scientific projects will change (sometimes a lot) as you work on them, you’re aiming to have something that will cope with changes in direction and scope as you go.

Here’s an example of what not to do, from my own archive:

blog_5If this looks familiar, then read on. This post is going to run you through, step by step, how to set up a project directory like the one used here. In the following post I will talk about setting up version control using git, which is an integral part of keeping your project folder organised and its history intact. The next few posts will be running through some basic data handling.

My directory structure and approach is almost entirely based on this great post from the blog Nice R Code (see what they did there?).  You should tweak the process to something that suits you, but I hope this will be a useful starting point.

You’re going to have a root directory for your project, with five subdirectories:

project_root/
    data/
    exp/
    figs/
    funs/
    out/

Note how my root directory is project_root and not project root. Many programs don’t like spaces in file paths, so it’s a good habit to get into naming your files and folders in this way.

You’re trying to keep your project_root/ directory as clean as possible. In the way I work, this directory is going to contain two key files: your manuscript file (e.g a LaTeX .tex file or a Sweave / knitr .Rnw file) and your master analysis file (analysis.R or analysis.txt or similar).

The exp/ directory is going to contain all the scripts and programs that are associated with data collection. For example, all of your Psychtoolbox or PsychoPy code is going in here. You could also create subfolders for storing stimuli like pictures or movies. All of that should go in exp/.

Any data files saved by your experiment are going to be stored in data/. That is, your experiment script, when finished, is going to write a file like project_root/data/subject_01_session_02_2014_01_16.csv. You should treat everything in the data/ directory as read-only once created. That is, your analysis scripts should read the files in data/ but never write to them. All your data analysis is going to depend on a processed data file that you will store in out/. Likewise, as I previously discussed, the files in data/ should never be edited by a human.

The out/ directory stores any numerical results of your data analysis that you want to write to disk, such as processed data files, model fits, MCMC samples, and so on. Take the approach that you should be able to delete everything in your out/ directory and then regenerate it by running your analysis scripts again.

figs/ stores all the graphical output from your analysis. Your manuscript file (which is going to live in the root directory, or maybe in a separate doc/ directory if you prefer) is going to populate itself with the figures in the figs/ directory. Well, if you use Word then I think you have to paste the figures in yourself every time you make new ones, but if you use LaTeX or Markdown the manuscript will populate itself. Like the out/ directory, you should try to take the attitude that everything in figs/ should be able to be deleted and regenerated from your master analysis file. However, one caveat for this is for multi-panel data + diagram figures like those popular in vision science. Usually these need to be arranged in an illustrator-like program (such as Inkscape). I doubt there is an easy solution for automating this sort of thing, so those figures are just going to have to be manually created. However, all panels containing data should be generated automatically.

Finally, the funs/ directory is going to contain all your analysis functions. These are things that your master analysis file is going to call on to do the work. My approach differs a little from Nice R Code here, in that I’m happy to have some scripts that actually do work here (and not just function definitions). Also, I call my directory funs/ rather than R/ because I’m not just going to have R code in there. For me the important thing is that your master analysis file should be as clean as possible, only acting as an overview or make file to tell the reproducer of your analysis the order in which the code in funs/ should be run. If you’re just using R, like I was in my previous projects, you can run this file and the whole analysis comes out.

OK. In the next post I’m going to run over how I set up version control for a project directory like this, then we can get into some basic data import / export.