A workflow principle: Avoiding humans touching data

One key principle for creating a reproducible workflow is that everything that happens to the data should be documented, and ideally repeatable automatically. The best way to do this is to make sure that the data are only ever manipulated by lines of a computer program, and wherever possible, never by the manual work of a human.

In a behavioural experiment, your raw data is often going to be a bunch of files containing one observation (trial, survey respondent, etc) per row. There will be a number of columns associated with the variables and conditions of the experiment, along with some response for that row.

These files should never be modified once output by your experiment code or questionnaire software: they are read-only.

Instead, the first step of your analysis is to write a script that opens the experiment files, checks that all the values make sense, and compiles all the information you want for your analysis into a master file for further processing. This might involve compiling all the blocks of trials from different subjects into one file, for example. All of your future analysis scripts are then going to use this munged file.

The workflow for your whole analysis is going to continue in this fashion: a series of scripts that can be executed sequentially, with the eventual output being all the figures and numbers for the paper. You’re trying to avoid any stage at which you are manually interacting with the data. If you find yourself copying and pasting columns of data in a spreadsheet, alarm bells should be ringing. While often it feels convenient, it’s not reproducible or easily repeatable.

Advertisements