Month: January 2014

Reproducibility with Psychopy_ext

You may be interested in a recent paper by Jonas Kubilius (link) detailing his reproducibility framework Psychopy_ext. This is a Python wrapper package for PsychoPy (for stimulus generation and presentation) and various data analysis packages that promises to streamline workflows for conducting typical psychophysical experiments. Looks really useful – great work Jonas!

Regarding my last post, you might be particularly interested in Jonas’ Figure 2, which contains a slightly different suggestion for how to lay out a project directory.

Advertisements

Setting up a project directory

The first step in making a research project reproducible is to make sure that the files are associated with it are organised. Since scientific projects will change (sometimes a lot) as you work on them, you’re aiming to have something that will cope with changes in direction and scope as you go.

Here’s an example of what not to do, from my own archive:

blog_5If this looks familiar, then read on. This post is going to run you through, step by step, how to set up a project directory like the one used here. In the following post I will talk about setting up version control using git, which is an integral part of keeping your project folder organised and its history intact. The next few posts will be running through some basic data handling.

My directory structure and approach is almost entirely based on this great post from the blog Nice R Code (see what they did there?).  You should tweak the process to something that suits you, but I hope this will be a useful starting point.

You’re going to have a root directory for your project, with five subdirectories:

project_root/
    data/
    exp/
    figs/
    funs/
    out/

Note how my root directory is project_root and not project root. Many programs don’t like spaces in file paths, so it’s a good habit to get into naming your files and folders in this way.

You’re trying to keep your project_root/ directory as clean as possible. In the way I work, this directory is going to contain two key files: your manuscript file (e.g a LaTeX .tex file or a Sweave / knitr .Rnw file) and your master analysis file (analysis.R or analysis.txt or similar).

The exp/ directory is going to contain all the scripts and programs that are associated with data collection. For example, all of your Psychtoolbox or PsychoPy code is going in here. You could also create subfolders for storing stimuli like pictures or movies. All of that should go in exp/.

Any data files saved by your experiment are going to be stored in data/. That is, your experiment script, when finished, is going to write a file like project_root/data/subject_01_session_02_2014_01_16.csv. You should treat everything in the data/ directory as read-only once created. That is, your analysis scripts should read the files in data/ but never write to them. All your data analysis is going to depend on a processed data file that you will store in out/. Likewise, as I previously discussed, the files in data/ should never be edited by a human.

The out/ directory stores any numerical results of your data analysis that you want to write to disk, such as processed data files, model fits, MCMC samples, and so on. Take the approach that you should be able to delete everything in your out/ directory and then regenerate it by running your analysis scripts again.

figs/ stores all the graphical output from your analysis. Your manuscript file (which is going to live in the root directory, or maybe in a separate doc/ directory if you prefer) is going to populate itself with the figures in the figs/ directory. Well, if you use Word then I think you have to paste the figures in yourself every time you make new ones, but if you use LaTeX or Markdown the manuscript will populate itself. Like the out/ directory, you should try to take the attitude that everything in figs/ should be able to be deleted and regenerated from your master analysis file. However, one caveat for this is for multi-panel data + diagram figures like those popular in vision science. Usually these need to be arranged in an illustrator-like program (such as Inkscape). I doubt there is an easy solution for automating this sort of thing, so those figures are just going to have to be manually created. However, all panels containing data should be generated automatically.

Finally, the funs/ directory is going to contain all your analysis functions. These are things that your master analysis file is going to call on to do the work. My approach differs a little from Nice R Code here, in that I’m happy to have some scripts that actually do work here (and not just function definitions). Also, I call my directory funs/ rather than R/ because I’m not just going to have R code in there. For me the important thing is that your master analysis file should be as clean as possible, only acting as an overview or make file to tell the reproducer of your analysis the order in which the code in funs/ should be run. If you’re just using R, like I was in my previous projects, you can run this file and the whole analysis comes out.

OK. In the next post I’m going to run over how I set up version control for a project directory like this, then we can get into some basic data import / export.

My current direction in scientific computing

During my PhD I learned to program in Matlab. I’d never done any programming before that, and I found it to be a rewarding experience. As is typical for people in vision science, I did pretty much everything in Matlab. Stimuli were generated and presented to human subjects using the CRS Visage (in my PhD; programming this thing can be hell) and now the excellent Psychtoolbox. Early on in my PhD I also moved away from SPSS to doing data analysis in Matlab, too.

An early project in my postdoc (see here) involved some more sophisticated statistical analyses than what I had done before. For this, Matlab was an absolute pain. For example, the inability (in base Matlab) to have named columns in a numerical matrix meant that my code contained references to column numbers throughout. This meant that if I wanted to change the order or number of variables going into the analysis I had to carefully check all the column references. Ugly, and ripe for human error.

Cue my switch to R. For statistical analyses R is pretty damn excellent. There are thousands of packages implementing pretty much every statistical tool ever conceived, often written by the statistician who thought up the method. Plus, it does brilliant plotting and data visualisation. Add the ability to define a function anywhere, real namespaces and the excellent R Studio IDE and I was hooked. I would try to avoid using Matlab again for anything on the analysis side but some light data munging (this is also wrapped up in my preference for science in open software).

For several years now I’ve been doing pretty much everything in R. For our latest paper, I also did my best to make the analysis fully reproducible by using knitr, a package that lets you include and run R analyses in a LaTeX document. You can see all the code for reproducing the analysis, figures and paper here. I’m going to work through the work flow that I used to do this in the next few blog posts.

While R is great for stats and plotting, unfortunately I’m not going to be able to fully replace Matlab with R. Why? First, last I checked, R’s existing tools for image processing are pretty terrible. A typical image processing task I might do to prepare an experiment is take an image and filter it in the Fourier domain (say, to limit the orientations and spatial frequencies to a specific band). I spent about a day trying to do this in R a year or so ago, and it was miserable. Second, R has no ability to present stimuli to the screen with any degree of timing or spatial precision. In fact, that would be going well outside its intended purpose (which is usually a bad idea – see Matlab).

So my “professional development” project for this year is to learn some Python, and test out the PsychoPy toolbox. In addition I’m interested in the data analysis and image processing capabilities of Python – see for example scikit-learn, scikit-image and pandas. I’ve had some recent early success with this, which I’ll share in a future post. It would be so great to one day have all my scientific computing happen in a single, powerful, cross platform, open and shareable software package. I think the signs point to that being a Python-based set of tools.

A workflow principle: Avoiding humans touching data

One key principle for creating a reproducible workflow is that everything that happens to the data should be documented, and ideally repeatable automatically. The best way to do this is to make sure that the data are only ever manipulated by lines of a computer program, and wherever possible, never by the manual work of a human.

In a behavioural experiment, your raw data is often going to be a bunch of files containing one observation (trial, survey respondent, etc) per row. There will be a number of columns associated with the variables and conditions of the experiment, along with some response for that row.

These files should never be modified once output by your experiment code or questionnaire software: they are read-only.

Instead, the first step of your analysis is to write a script that opens the experiment files, checks that all the values make sense, and compiles all the information you want for your analysis into a master file for further processing. This might involve compiling all the blocks of trials from different subjects into one file, for example. All of your future analysis scripts are then going to use this munged file.

The workflow for your whole analysis is going to continue in this fashion: a series of scripts that can be executed sequentially, with the eventual output being all the figures and numbers for the paper. You’re trying to avoid any stage at which you are manually interacting with the data. If you find yourself copying and pasting columns of data in a spreadsheet, alarm bells should be ringing. While often it feels convenient, it’s not reproducible or easily repeatable.