reproducible research

Quick link on organising projects

Felix Wichmann pointed me to this paper on organising project directories.

Some of the tips are a bit specific to bioinformatics / computer science, but one idea I really liked was to have a chronologically-organised electronic notebook for each project. This would sketch out analyses / experiments you tried, what worked, and what didn’t. You could also “transcribe notes from conversations as well as email text into the lab notebook”. This type of document makes way more sense to me than keeping a single general lab notebook for all the projects you’re doing. Perhaps that helps you too, but notes about a specific project should all be accessible within the project directory. This idea would definitely complement some others that I’ve previously detailed on this blog.

It strikes me that ipython notebooks or R markdown documents are perfect formats for such a project notebook.

Advertisements

Code and data, or it didn’t happen

There is a new motto for reproducible research: “code and data or it didn’t happen” (see also here). I want this to apply more often in vision science, too. Since vision science experiments are usually entirely computer-controlled, implementing this should be a lot easier than in some other fields. For example, in social psychology raw data can be paper-based (participants fill out a questionnaire), making sharing that much harder.

In my mind, vision science journals should begin requiring raw code, data and stimuli (if they’re pictures that can’t be regenerated) to accompany every published empirical article. The code and data would be maintained by the journal in a data repository (e.g. http://thedata.org/).

Why doesn’t this happen already? One attitude I’ve encountered in the field is that “know-how” (for things like stimulus generation or data analysis — proximally, the code that does these things) is somehow an in-house resource, only to be actively shared when people ask for it.

A great passage from this recent article is relevant here:

… authors are not the owners of their methods or results. By publishing their results, original authors state that they have sufficient confidence in them that they should be included in the scientific record. That record belongs to everyone…

So the code and data aren’t yours. They don’t belong to you once the paper is published (and arguably, not before then either). They belong to everyone — particularly if your research was publicly-funded.

Second, many people have the attitude that “well of course I would share the code / data if someone emailed me and asked for it”. Maybe you would, but that doesn’t guarantee that everyone will. I’ve previously encountered a refusal to share materials upon request, even several years after the paper was published. Better that these are made publicly available at the outset, rather than possibly refused (off the public record) later on.

A code / data embargo

The only reason I can think of to refuse to share code and data at the time of publication is that the researcher has a reasonable expectation that they should be able to get more publications out of that material before others have the chance. Many others before me have suggested a compromise solution for this: a code / data embargo. The journal or a third party holds the materials, and after some period of time post-publication, releases it to the public. For example, data dryad allow limited-term data embargoes to be set up.

I think the embargo time is something field-specific. For example, in some fields it might make sense to have quite long embargoes. But human psychophysics data is relatively quick and cheap to collect (possibly excluding clinical data), so I see no reason why such data should be embargoed for more than a year. Perhaps for animal data or brain scanning data other timeframes might be appropriate, but in general, I think most people would be happy with this idea.

Releasing your data to the community could yield interesting scientific advances that you never even considered when you conducted the experiment. This helps everyone (and nets you another citation). For a much more in-depth consideration of this issue, take a look at this recent article.

Data import in R

In this post, I will demonstrate one way to import and collate a data set (using the R environment). This is a follow up to a post in which I argued that a good principle for reproducible research is to avoid humans touching data. That is, once the data from the experiment are saved we want them to be “read only” and never altered by a human in some undocumented way (such as editing in a spreadsheet).

Using R is not the only way to do the following, and I would encourage you to replicate these steps in the environment of your choice. If your scientific computing environment makes following what I do here really hard, maybe you should consider switching…

Data set

First, we need a data set. To make this more interesting let’s build on a classic paper from vision science.

Imagine we’ve conducted an experiment similar to the classic Campbell & Robson (1968)^1 study but with a few modifications. As a participant in our experiment, you’re seated in front of a monitor showing a grey screen. You’re going to be shown a sequence of trials, and for each trial you make a response with a button press.
On each trial you are asked to keep your eyes on a small dot on the centre of the screen. On each trial, a pattern of dark-and-light stripes (a grating) is shown on one side of the screen (left or right of your eye position). The computer randomly decides whether to present the grating on the left or on the right (the other side just stays as the grey background). You have to respond either “grating on the left” or “grating on the right” — you can’t say “I don’t know”. The computer waits for your response before showing the next trial.
We are going to vary both the contrast of the grating pattern (how different from grey the dark and light stripes are) and also the spatial frequency of the pattern (how wide the bars are) over trials.

If the contrast is so low that you can’t see the grating, your responses across many trials will be near chance performance (here 50% correct). If the grating is really easy to see, your performance will be near 100%. We determine how your performance on the task changes as a function of contrast, for each spatial frequency tested.

We’ve tested 5 subjects in this experiment, showing them 7 contrasts at 5 spatial frequencies, with the targets equally on the left and right. They did 20 trials for each condition (so each subject did 7 * 5 * 20 * 2 = 1400 trials). Let’s say that our experiment program saves the data as a .csv file in our project’s /data/ directory. We have one .csv file per subject, and one of them might look something like this when opened in a text editor:

csv

A few things to notice here: each comma , in the file denotes a new column, and each new line denotes a row. Secondly, note that there’s a header row: the first line of the file contains column names for our variables.

Finally, notice how our target_side and response columns contain text strings (left and right). The reason I’ve done this is that it makes the data easily human-readable. It’s obvious what the entries mean (imagine if instead target_side could be either 0 or 1). This can be used to great effect to avoid needing a data key later.

Installing R

This couldn’t be simpler. Go here and get the right binary for your system, install it, then immediately go here and get RStudio, which is awesome. To follow along with my stuff here, you can install any packages I use (the library() calls in future posts) via RStudio’s “Packages” tab.

While I’m going to demonstrate this stuff using R, I would encourage you to follow along in your package of choice. I’d be interested to know how easy / hard it is to duplicate this stuff in other environments (for example, last I used Matlab handling .csv files with mixed numeric and text was a massive pain).

Reading each file into R and putting them together

Now we want to read each subject’s data file into R, then stick the files together to create one big data file.

The paste0 command

To do this, I’m going to make use of the paste command, which allows you to concatenate (stick together) strings. Actually, I’m going to use the paste0 command, which is a shortcut for paste. By default paste adds a space between each pasted item, which we usually don’t want. paste0 just puts together the items you give it. For example:

paste0("A text string", 42, ", another text string")
## [1] "A text string42, another text string"

What we get is that R automatically converts the number “42” to text, and sticks it together with the preceeding and subsequent stuff. Usefully, we can also include ranges of numbers, which produces a number of strings:

paste0("A text string", 41:43, ", another text string")
## [1] "A text string41, another text string"
## [2] "A text string42, another text string"
## [3] "A text string43, another text string"

Read in the damn files already

The file for subject one is labelled like this:

“data_S1.csv”

and subject 2’s results are in the file “data_S2.csv”, and so on. The following script uses a for loop to read in the data, then appends it to a data frame called dat.

dat <- data.frame()  # create an empty data frame.
for (i in 1:5) {
    file <- paste0(getwd(), "/data/data_S", i, ".csv")
    this_dat <- read.csv(file = file)  # read the subject's file, put in a data frame called this_dat
    dat <- rbind(dat, this_dat)  # append to larger data frame  
}

What this for loop gives us is a data frame object called dat. Let’s examine it using the str (“structure”) command:

str(dat)
## 'data.frame':  7000 obs. of  6 variables:
##  $ subject    : Factor w/ 5 levels "S1","S2","S3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ contrast   : num  0.0695 0.0131 0.0695 0.0695 0.3679 ...
##  $ sf         : num  0.5 40 4.47 40 13.37 ...
##  $ target_side: Factor w/ 2 levels "left","right": 2 2 1 1 1 1 2 2 1 2 ...
##  $ response   : Factor w/ 2 levels "left","right": 2 1 1 2 1 2 2 1 2 2 ...
##  $ unique_id  : Factor w/ 7000 levels "00004355-345d-403e-b244-79c8adb8f1f8",..: 451 983 595 395 277 387 132 809 711 582 ...

Data frames

Data frames are the most important (or at least useful) data type in R, and what you’re going to be using a lot. Many methods use data frames. The most awesome thing about a data frame is that it can store both numerical data and text. This allows us to read in that csv file no problem, where other basic data types would really struggle (I’m looking at you, Matlab).

Furthermore, data frames can explicitly treat text as a “factor”, which means that when you fit a model, it won’t try to use this numerically but will rather dummy code it. Note how in the str call above, several variables (in fact, all those that were strings in the .csv file) have been imported as factors. Let’s look at some behaviour of factors now by looking at the summary of our data:

summary(dat)
##  subject      contrast            sf        target_side   response   
##  S1:1400   Min.   :0.0025   Min.   : 0.50   left :3500   left :3488  
##  S2:1400   1st Qu.:0.0057   1st Qu.: 1.50   right:3500   right:3512  
##  S3:1400   Median :0.0302   Median : 4.47                            
##  S4:1400   Mean   :0.0927   Mean   :11.97                            
##  S5:1400   3rd Qu.:0.1599   3rd Qu.:13.37                            
##            Max.   :0.3679   Max.   :40.00                            
##                                                                      
##                                 unique_id   
##  00004355-345d-403e-b244-79c8adb8f1f8:   1  
##  000f3b09-9dde-4dd4-8a97-ad87cfcbc947:   1  
##  00448030-70e5-4010-b954-4a35c107841e:   1  
##  0086b264-17ed-4fbb-8e32-8c7814ae6b6a:   1  
##  00a070b7-f849-4727-a710-0453d6f27c50:   1  
##  00b414aa-3f65-4b4d-8d12-f0d41ec7ae42:   1  
##  (Other)                             :6994

See how we get some distribution summaries for the covariates (e.g. contrast), but only told how many instances of each factor level there are? Neat huh?

Data munging

In our data file there is a “response” variable, that is a string of the side the subject responded to. What we really want however is to know whether they got the trial correct. That is, is the string in “target_side” the same as the string in “response”? Let’s create this new variable now:

dat$correct <- 0  # initialises the variable 'correct' with all zeros.
dat$correct[dat$target_side == dat$response] <- 1  # logical indexing; if target == response, returns TRUE
summary(dat$correct)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    1.00    0.77    1.00    1.00
hist(dat$correct)

unnamed-chunk-6

Now we have a variable in the data frame dat that gives a 1 where the subject was correct and a 0 elsewhere. In the next post, I will show some basic graphical exploration of this data set using the ggplot2 package.

PS

This blog was written in R Markdown (in R Studio as a .Rmd file -> “knit HTML”, then paste the .md code directly into wordpress… too easy!)

You can check out the repository for this and some upcoming posts at my Github page.


[1] Campbell, F. W., & Robson, J. G. (1968). Application of Fourier analysis to the visibility of gratings. The Journal of Physiology, 197(3), 551–566.

Version control Part 2: Remote repository

The second stage in my version control workflow is to push my local changes to a remote repository. A remote repository is basically an identical repository to the one stored locally on your computer, but is on a remote server somewhere in the internet ether. Much like using dropbox, this provides an additional layer of backup for your project (with the advantage of a full version history). So if you ever lose your local copy of your project for some reason, you can just re-clone it from the remote repository to get everything back (not including files that were never committed, of course). ** NOTE that I don’t recommend using this, or any one tool, as your only backup: your scientific projects should be backed up with multiple means, in multiple locations, all the time **.

However, the main advantage of pushing things up to a remote repository is that this facilitates sharing. With various methods that I’ll outline below, you can keep the remote repository private and invite your collaborators to use it, or you can make it public so that anyone can see it, clone it, etc (though of course in this case, you control whether to use anyone else’s changes or not).

I use two services to host remote repositories: Github and Bitbucket. These companies offer similar services with a few key differences, which means that in my current workflow I switch between both.

Github

Github is the “one that started it all”. They have a really slick web interface, awesome graphs for looking at repository activity, great tools for interacting with other members, wikis and issue trackers that can be associated with a repo, and a big user base. Plus they offer the free GUI that I talked about in my last post. However, their pricing structure is that they charge you to have private repositories. That is, they host unlimited public repositories – i.e. anyone can see the respository’s contents, contributors and history. If you want to keep your repository to a few invited collaborators however, you need a paid account. Seven dollars a month gets you 5 private repositories. The idea here would be that you have some projects on the go, then when one is ready for sharing (say, the article is accepted), you switch the repo from “private” to “public”. Now everyone can see your code and data, and you have another private repo slot to use.

However, since I know that I would need more than 5 private repos (projects languishing, maybe one day, etc), I’ve so far avoided a paid Github account (the idea of just working with everything open is for another post). Thankfully we’re helped by Bitbucket.


UPDATE: Thanks to Ariel Rokem for pointing out in the comments that Github actually offer a Micro plan (5 private repositories) for educational users. Send them an email with your educational email address at this site.


 

Bitbucket

Bitbucket is basically Github with a different pricing structure. Their web interface and user community is a fair bit behind Github. For scientists however, the advantage is that they offer unlimited free private code repositories. The catch is that you’re only allowed 5 collaborators (i.e. people who have joined any of your repositories, like co-authors). However, an academic email address will get you unlimited collaborators too, so this is essentially a free service.

Using Remote Repositories on Bitbucket with the Github GUI

Generously, Github have not restricted their GUI to use Github repositories. So what I do is basically use the Github GUI to manage my version control day-to-day, but push the local repository to a remote repository on Bitbucket. I can share this with collaborators and keep it private.

Here’s how:

  1. Set up a local repository as explained here.
  2. Log in to your Bitbucket account in a web browser.
  3. Follow the steps to set up a new repository. Select “git” as the version control flavour.
  4. This should then give you an option to “push up an existing repository”.
  5. On the command line that starts `git remote add origin`, copy the following link to your repository (something like `git@bitbucket.org:tomwallis/test.git`. This might look different, depending whether you’re using SSH or a password to authenticate (if you’re using a password, your link will start with https; either works). The Bitbucket / Github help pages will explain how to set up an SSH key if you’d like to do that.
  6. In the Github GUI, open your local repository and go to the “Settings” pane. On the line that says “Primary Remote Repository”, paste in the link to your repo. Hit “Update Remote”.
  7. Switch back to the “Changes” pane of the Github GUI. See the button in the top left? It should have changed from “Push to Github” to “Sync Branch” (if not, close and re-open the Github GUI).
  8. Press this button. You might be asked for your password (depending whether you’ve set up an SSH key).
  9. Github should synch, and the list of “unsynched commits” should disappear.
  10. Refresh your web browser on your Bitbucket account. Your code should now be in your private repository on Bitbucket!

You can now share this repository with your co-workers, friends and family, and take advantage of all the nice things about collaborating with version control!

When the code is ready to be made public, you can simply push the repository to a public repository on Github by changing the primary remote in the settings pane. This lets you take advantage of the bigger user community (for example, see the PsychtoolboxPsychoPy,  Psychopy_ext, Julia) and better web tools on Github for publishing your code. Maybe in the future I would just use Github exclusively (i.e. paying for private repos), but for now the dual solution (Bitbucket, Github) works well. Of course, you can also just make your Bitbucket repository public and not worry about using Github at all, but then you’re stuck with their (relatively) clunky web interface.

As always, test this process out for yourself to see that it works for you before using it for important stuff, and always keep independent backups of your project and data (Dropbox, Time Machine, etc).