Setting up a project directory

The first step in making a research project reproducible is to make sure that the files are associated with it are organised. Since scientific projects will change (sometimes a lot) as you work on them, you’re aiming to have something that will cope with changes in direction and scope as you go.

Here’s an example of what not to do, from my own archive:

blog_5If this looks familiar, then read on. This post is going to run you through, step by step, how to set up a project directory like the one used here. In the following post I will talk about setting up version control using git, which is an integral part of keeping your project folder organised and its history intact. The next few posts will be running through some basic data handling.

My directory structure and approach is almost entirely based on this great post from the blog Nice R Code (see what they did there?).  You should tweak the process to something that suits you, but I hope this will be a useful starting point.

You’re going to have a root directory for your project, with five subdirectories:

project_root/
    data/
    exp/
    figs/
    funs/
    out/

Note how my root directory is project_root and not project root. Many programs don’t like spaces in file paths, so it’s a good habit to get into naming your files and folders in this way.

You’re trying to keep your project_root/ directory as clean as possible. In the way I work, this directory is going to contain two key files: your manuscript file (e.g a LaTeX .tex file or a Sweave / knitr .Rnw file) and your master analysis file (analysis.R or analysis.txt or similar).

The exp/ directory is going to contain all the scripts and programs that are associated with data collection. For example, all of your Psychtoolbox or PsychoPy code is going in here. You could also create subfolders for storing stimuli like pictures or movies. All of that should go in exp/.

Any data files saved by your experiment are going to be stored in data/. That is, your experiment script, when finished, is going to write a file like project_root/data/subject_01_session_02_2014_01_16.csv. You should treat everything in the data/ directory as read-only once created. That is, your analysis scripts should read the files in data/ but never write to them. All your data analysis is going to depend on a processed data file that you will store in out/. Likewise, as I previously discussed, the files in data/ should never be edited by a human.

The out/ directory stores any numerical results of your data analysis that you want to write to disk, such as processed data files, model fits, MCMC samples, and so on. Take the approach that you should be able to delete everything in your out/ directory and then regenerate it by running your analysis scripts again.

figs/ stores all the graphical output from your analysis. Your manuscript file (which is going to live in the root directory, or maybe in a separate doc/ directory if you prefer) is going to populate itself with the figures in the figs/ directory. Well, if you use Word then I think you have to paste the figures in yourself every time you make new ones, but if you use LaTeX or Markdown the manuscript will populate itself. Like the out/ directory, you should try to take the attitude that everything in figs/ should be able to be deleted and regenerated from your master analysis file. However, one caveat for this is for multi-panel data + diagram figures like those popular in vision science. Usually these need to be arranged in an illustrator-like program (such as Inkscape). I doubt there is an easy solution for automating this sort of thing, so those figures are just going to have to be manually created. However, all panels containing data should be generated automatically.

Finally, the funs/ directory is going to contain all your analysis functions. These are things that your master analysis file is going to call on to do the work. My approach differs a little from Nice R Code here, in that I’m happy to have some scripts that actually do work here (and not just function definitions). Also, I call my directory funs/ rather than R/ because I’m not just going to have R code in there. For me the important thing is that your master analysis file should be as clean as possible, only acting as an overview or make file to tell the reproducer of your analysis the order in which the code in funs/ should be run. If you’re just using R, like I was in my previous projects, you can run this file and the whole analysis comes out.

OK. In the next post I’m going to run over how I set up version control for a project directory like this, then we can get into some basic data import / export.

Advertisements

8 comments

  1. I should definitely read this carefully and start using a similar system for my matlab administration. My manual control is not as bad as your initial example but it could be a lot better by having an automated system. Need to read up on this when I find some time….

  2. Thanks for this nice post Tom, and also all the previous ones.

    I’m really in a similar state of mind, but less advanced than you are, so it’s very useful to see your view on that. I’d have one comment, about the fact that you include the data here in the same directory. I assume that this project might live in something like dropbox (or GitHub), so having the data inside is only possible for simple behavioral data. As soon as you get EEG, fMRI or even just eye-tracking, it starts to be problematic.

    From what I’ve seen from a few friends who work as computer scientists with kind of large data, I’d rather recommend having two general places on your computer: one for data, one for code, figures, etc. I did that for a few months now and it works pretty well.

  3. Thanks for this nice post Tom, and also all the previous ones.

    I’m really in a similar state of mind, but less advanced than you are, so it’s very useful to see your view on that. I’d have one comment, about the fact that you include the data here in the same directory. I assume that this project might live in something like dropbox (or GitHub), so having the data inside is only possible for simple behavioral data. As soon as you get EEG, fMRI or even just eye-tracking, it starts to be problematic.

    From what I’ve seen from a few friends who work as computer scientists with kind of large data, I’d rather recommend having two general places on your computer: one for data, one for code, figures, etc. I did that for a few months now and it works pretty well.

    1. Thanks for the comment Sebastien; I’m glad you like the blog.

      You’re exactly right about larger data sets. This exact strategy is only going to work if the data files are reasonably small. However, I figure that the leap is not big to basically reference your data folder separately even if it’s outside the main project_root directory. Then all your analysis scripts will reference /data_drive/project_root/data rather than just project_root/data to process your raw data – exactly as you say.

  4. Hi Tom! I liked this post a lot, it is really helpful!! However, I had the same concern as Seb, and the situation is even further complicated by the fact that my experiments live on several computers with different operating systems. As the data is big, I need to run the analysis scripts on the computer on which the data is stored. But for example I write and work on figures on my laptop.
    The only way I could think of to solve this, is to have the main experiment folder in my dropbox (or GitHub), and run the scripts from there on data that is stored locally on one of the computers. Any better suggestions are very welcome. I am looking forward to the git-post!

    1. Hey Sophie, nice to hear from you! One strategy would be to have your data processing scripts reference the data storing computer, via SSH or similar. However, that’s probably too complicated for me (or too much overhead to get it set up). If I were in your position I would probably try to at least start my project directory from a preprocessed data file (living in out/) that maybe I have to produce from the data computer manually.

      This is not as nice for reproducibility, but you wouldn’t want to store huge data files on github anyway. I think Ariel Rokem had some ideas for version controlling and sharing larger data sets on a previous post on my blog. Perhaps I should ask him to write a guest post about storing large data sets! In any case, you and Seb are right that this folder structure gets difficult if you’re working across multiple data and analysis machines.

      Perhaps one solution would be to ignore your data/ directory (i.e. don’t have it indexed by git), but keep the other directories organised between your machines using git? That way you could still have all your code version controlled but move between multiple machines. If you write some new analyses at home on your laptop, you push it to a central repository (e.g. github), then when you come into the lab you pull from the central repo and your new script is there. I’m not sure, but this might be something worth considering.

Comments are closed.