Scientific programming

Data import in R

In this post, I will demonstrate one way to import and collate a data set (using the R environment). This is a follow up to a post in which I argued that a good principle for reproducible research is to avoid humans touching data. That is, once the data from the experiment are saved we want them to be “read only” and never altered by a human in some undocumented way (such as editing in a spreadsheet).

Using R is not the only way to do the following, and I would encourage you to replicate these steps in the environment of your choice. If your scientific computing environment makes following what I do here really hard, maybe you should consider switching…

Data set

First, we need a data set. To make this more interesting let’s build on a classic paper from vision science.

Imagine we’ve conducted an experiment similar to the classic Campbell & Robson (1968)^1 study but with a few modifications. As a participant in our experiment, you’re seated in front of a monitor showing a grey screen. You’re going to be shown a sequence of trials, and for each trial you make a response with a button press.
On each trial you are asked to keep your eyes on a small dot on the centre of the screen. On each trial, a pattern of dark-and-light stripes (a grating) is shown on one side of the screen (left or right of your eye position). The computer randomly decides whether to present the grating on the left or on the right (the other side just stays as the grey background). You have to respond either “grating on the left” or “grating on the right” — you can’t say “I don’t know”. The computer waits for your response before showing the next trial.
We are going to vary both the contrast of the grating pattern (how different from grey the dark and light stripes are) and also the spatial frequency of the pattern (how wide the bars are) over trials.

If the contrast is so low that you can’t see the grating, your responses across many trials will be near chance performance (here 50% correct). If the grating is really easy to see, your performance will be near 100%. We determine how your performance on the task changes as a function of contrast, for each spatial frequency tested.

We’ve tested 5 subjects in this experiment, showing them 7 contrasts at 5 spatial frequencies, with the targets equally on the left and right. They did 20 trials for each condition (so each subject did 7 * 5 * 20 * 2 = 1400 trials). Let’s say that our experiment program saves the data as a .csv file in our project’s /data/ directory. We have one .csv file per subject, and one of them might look something like this when opened in a text editor:

csv

A few things to notice here: each comma , in the file denotes a new column, and each new line denotes a row. Secondly, note that there’s a header row: the first line of the file contains column names for our variables.

Finally, notice how our target_side and response columns contain text strings (left and right). The reason I’ve done this is that it makes the data easily human-readable. It’s obvious what the entries mean (imagine if instead target_side could be either 0 or 1). This can be used to great effect to avoid needing a data key later.

Installing R

This couldn’t be simpler. Go here and get the right binary for your system, install it, then immediately go here and get RStudio, which is awesome. To follow along with my stuff here, you can install any packages I use (the library() calls in future posts) via RStudio’s “Packages” tab.

While I’m going to demonstrate this stuff using R, I would encourage you to follow along in your package of choice. I’d be interested to know how easy / hard it is to duplicate this stuff in other environments (for example, last I used Matlab handling .csv files with mixed numeric and text was a massive pain).

Reading each file into R and putting them together

Now we want to read each subject’s data file into R, then stick the files together to create one big data file.

The paste0 command

To do this, I’m going to make use of the paste command, which allows you to concatenate (stick together) strings. Actually, I’m going to use the paste0 command, which is a shortcut for paste. By default paste adds a space between each pasted item, which we usually don’t want. paste0 just puts together the items you give it. For example:

paste0("A text string", 42, ", another text string")
## [1] "A text string42, another text string"

What we get is that R automatically converts the number “42” to text, and sticks it together with the preceeding and subsequent stuff. Usefully, we can also include ranges of numbers, which produces a number of strings:

paste0("A text string", 41:43, ", another text string")
## [1] "A text string41, another text string"
## [2] "A text string42, another text string"
## [3] "A text string43, another text string"

Read in the damn files already

The file for subject one is labelled like this:

“data_S1.csv”

and subject 2’s results are in the file “data_S2.csv”, and so on. The following script uses a for loop to read in the data, then appends it to a data frame called dat.

dat <- data.frame()  # create an empty data frame.
for (i in 1:5) {
    file <- paste0(getwd(), "/data/data_S", i, ".csv")
    this_dat <- read.csv(file = file)  # read the subject's file, put in a data frame called this_dat
    dat <- rbind(dat, this_dat)  # append to larger data frame  
}

What this for loop gives us is a data frame object called dat. Let’s examine it using the str (“structure”) command:

str(dat)
## 'data.frame':  7000 obs. of  6 variables:
##  $ subject    : Factor w/ 5 levels "S1","S2","S3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ contrast   : num  0.0695 0.0131 0.0695 0.0695 0.3679 ...
##  $ sf         : num  0.5 40 4.47 40 13.37 ...
##  $ target_side: Factor w/ 2 levels "left","right": 2 2 1 1 1 1 2 2 1 2 ...
##  $ response   : Factor w/ 2 levels "left","right": 2 1 1 2 1 2 2 1 2 2 ...
##  $ unique_id  : Factor w/ 7000 levels "00004355-345d-403e-b244-79c8adb8f1f8",..: 451 983 595 395 277 387 132 809 711 582 ...

Data frames

Data frames are the most important (or at least useful) data type in R, and what you’re going to be using a lot. Many methods use data frames. The most awesome thing about a data frame is that it can store both numerical data and text. This allows us to read in that csv file no problem, where other basic data types would really struggle (I’m looking at you, Matlab).

Furthermore, data frames can explicitly treat text as a “factor”, which means that when you fit a model, it won’t try to use this numerically but will rather dummy code it. Note how in the str call above, several variables (in fact, all those that were strings in the .csv file) have been imported as factors. Let’s look at some behaviour of factors now by looking at the summary of our data:

summary(dat)
##  subject      contrast            sf        target_side   response   
##  S1:1400   Min.   :0.0025   Min.   : 0.50   left :3500   left :3488  
##  S2:1400   1st Qu.:0.0057   1st Qu.: 1.50   right:3500   right:3512  
##  S3:1400   Median :0.0302   Median : 4.47                            
##  S4:1400   Mean   :0.0927   Mean   :11.97                            
##  S5:1400   3rd Qu.:0.1599   3rd Qu.:13.37                            
##            Max.   :0.3679   Max.   :40.00                            
##                                                                      
##                                 unique_id   
##  00004355-345d-403e-b244-79c8adb8f1f8:   1  
##  000f3b09-9dde-4dd4-8a97-ad87cfcbc947:   1  
##  00448030-70e5-4010-b954-4a35c107841e:   1  
##  0086b264-17ed-4fbb-8e32-8c7814ae6b6a:   1  
##  00a070b7-f849-4727-a710-0453d6f27c50:   1  
##  00b414aa-3f65-4b4d-8d12-f0d41ec7ae42:   1  
##  (Other)                             :6994

See how we get some distribution summaries for the covariates (e.g. contrast), but only told how many instances of each factor level there are? Neat huh?

Data munging

In our data file there is a “response” variable, that is a string of the side the subject responded to. What we really want however is to know whether they got the trial correct. That is, is the string in “target_side” the same as the string in “response”? Let’s create this new variable now:

dat$correct <- 0  # initialises the variable 'correct' with all zeros.
dat$correct[dat$target_side == dat$response] <- 1  # logical indexing; if target == response, returns TRUE
summary(dat$correct)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    1.00    0.77    1.00    1.00
hist(dat$correct)

unnamed-chunk-6

Now we have a variable in the data frame dat that gives a 1 where the subject was correct and a 0 elsewhere. In the next post, I will show some basic graphical exploration of this data set using the ggplot2 package.

PS

This blog was written in R Markdown (in R Studio as a .Rmd file -> “knit HTML”, then paste the .md code directly into wordpress… too easy!)

You can check out the repository for this and some upcoming posts at my Github page.


[1] Campbell, F. W., & Robson, J. G. (1968). Application of Fourier analysis to the visibility of gratings. The Journal of Physiology, 197(3), 551–566.

Advertisements

Version control Part 1: Local repository

Previously I talked about setting up a project directory. Now I’ll run through the first stages of something I see as an integral part of reproducible research: version control.

Why version control?

Loosely, think of “version control” like track changes in Word but for all the (plain text) files in your project directory. More formally, a version control system is a piece of software that will automatically detect when files in monitored directories change, record which lines of the file changed and how, prompt you to comment why you changed the file, then have you “commit” the changes to a repository (a data bank containing the history of the project directory). If you’re collaborating on the project with others, the version control system can be used to make sure each collaborator’s files stay up to date with each other, record who changed what, and notice if two people changed a file in conflicting ways (you’ll then be asked to look at the changes and decide which to keep). This is a big advantage over something like sharing a folder on Dropbox. You can even use a version control system to keep different versions of the same project (using “branches”). Finally, from the perspective of making your science open and reproducible, making the version control history of the project available is close to a gold standard. Not only do users get the version that created the final results, but if they’re so inclined they could search back through your repository history to see how the project evolved. Being able to just switch a repository from private to public also makes it really simple to share your code (say, when the paper comes out). Plenty of other people have written about the usefulness of version control for scientists (see here, here and here, for example), so in this post I’m going to concentrate on the way I use it.

How I set up version control

First, a disclaimer: I am not a version control expert. I use simple tools, and so far they’ve worked for me. Any problems I’ve had I’ve been able to resolve with some Googling and Stack Overflow. In the remainder of this post I’ll show you how I set things up, in the hope that this will be useful to you. Second disclaimer: I use OSX (and a bit of Linux on experimental machines), so my process might have some OSX-specific steps. However, all the software I’m using is cross-platform, so you should be able to port the process to your operating system of choice with a bit of internet searching.

The specific version control software I use is git. I find this to be the simplest system I’ve tried, with the best support for integrating with others and publishing your code. You can do some tutorials and get more info at the git site, or watch this video. I mostly use git through a GUI (graphical user interface) rather than through the command line because I find this simpler for day-to-day work. I use the Github GUI, which you can download here for mac and here for windows. I have found that this does everything I regularly need, in a clean and simple interface. Other people (Diederick, posting on this blog) have recommended SmartGit, which seems more fully-featured and is free for academic use. I haven’t tried this yet, though. So, without further ado, here are the steps for setting up version control in the way I have.

  1. Download and install the Github GUI for your operating system of choice. You might need to create a Github account at this stage (which I would recommend doing anyway). Since I already have everything installed on my setup here, I’m not going to walk you through this stage. It should be well-explained by the app itself, but if there are other steps you need to take, post them in the comments below and I will update this post.

  2. Now open the Github application. We’re going to use this to create a new repository for a new project. I will use the project directory structure I set up before:

blog_1

In the OSX Github GUI, go to the small plus sign in the lower left corner of the main display and select “Create New Repository”:

blog_3

Select the root directory of your project (in my case, this is called blog_example). Now we have this blank screen for our new repository:

blog_4

The Github GUI has created a new git repository, which lives in a hidden directory within our parent directory. Here you can see the contents of the directory before and after I created the repository:

blog_5

Note how the second call to ls -la has revealed a new (hidden) folder .git. This contains all the files for the git repository, and is what the Github GUI uses. Note that it’s just a normal git repository, so if you want to do something that isn’t possible in the GUI, you can just interact with the repository using the command line.

Now let’s create a file in our project directory. I made a file in the root directory called master_manuscript.txt, with two lines of text. When we flip back to the Github GUI, we see that it has detected the new file:

blog_6

I enter a commit message “created master file”, hit commit, and there you go. Our git repository has its first local commit:

blog_7

Note how the commit is listed in “Unsynced Commits”. It’s “Unsynched” because our local repository hasn’t been synchronised with a remote repository, such as one on Github. That’s fine if you just want to maintain a local repository for yourself. I will discuss synching with remotes in a future post.

Now let’s try modifying the file master_manuscript.txt. I’m going to delete the original two lines and add something new. How about a Well Thought-Out Englilsh Paper?

blog_8

The GUI shows us which lines were deleted (in red) and which were added (green). We can commit this new change, then take a look at our project’s history (in the history tab):

blog_9

There we can see our original commit (“created master file”) as well as our new one. On second thought, Strong Bad’s Well Thought-Out Englilsh paper is maybe not so well thought out. Let’s revert to the last version. Select our commit “a well thought-out englilsh paper” and click on the gear to the right of the panel, then select “Revert This Commit” (for the difference between revert and roll back, see here). We then get a new commit, telling us that we’ve reverted the content of the file back to the old one:

blog_11

If you open up the file master_manuscript.txt, you’ll see that it has been changed to have the original two lines by git.

Ignoring files

Let’s say that we now have some files in our project directory that we don’t want git to monitor (e.g. a really large data file that would be infeasible to upload to a remote repository). We can add this to the list of files to be ignored by placing it in the file .gitignore. The GUI allows us to do this by right clicking on a file when it appears on the left pane and selecting “ignore”. You can also ignore entire directories by entering them into .gitignore. You can do this in the GUI by going to the Settings pane, to the ignored files section, and entering a line like:

/directory_to_be_ignored/*

The GUI simply adds this line to your .gitignore file.

I usually ignore the /out and /figs/ directories since their contents can be regenerated from your code, and for some of my analyses the contents of /out can be rather large.

Branching

A branch is basically an independent copy of a state of the respository that allows you to do work in parallel to other changes occurring to the repository. While it’s aimed more at larger collaborative projects, I have found it useful in working on solo repositories as a way to explicitly maintain old versions of analyses or papers. I will cover branching more in a future post.

Other things to note

Note how in the explanation above I said “for all the (plain text) files in your directory”? A plain text file is something that looks ok when opened in a text editor like notepad. They could have extensions like .txt, .csv, .m, .R, .py, .tex, etc, but they are still readable in a text editor. Version control works beautifully for these. However, binary files not so much. You won’t take the most advantage of version control if you try to keep track of something like Word docs, .pdf or .jpg. At best your version control will be able to record that the file changed – but not which lines. On the other hand, it may still work to keep collaborators synched with the latest file, but conflict resolution would be hard.

A final disclaimer: Like any file, keeping your git repository within a Dropbox folder carries the risk that the file could be corrupted. Specifically, if you work from computer A, make changes, shut down that computer before Dropbox finishes syncing, then subsequently work on the same files from computer B, you will have conflicts when you open up computer A again. Keep a separate backup of your Dropbox folder (e.g. with Time Machine, or Dropbox’s own file history service) to prevent anything nasty happening. This also becomes less of a problem if you are pushing your git repository to a remote server, which is what I will cover in the next post.

UPDATE 12 Feb 2014:

A few people have commented to me on other forums about my selection of git over something else. Specifically, a lot of people find git to be unnecessarily complex for many projects as compared to say, subversion.

Alex Holcombe shared the following: “I have started using github (inside Rstudio), and seems to be working, but after reading this I am very afraid http://t.co/hLrUMrIz8J&#8221;

Another friend writes: “SVN makes perfect sense to me (I used it in my programming jobs) and would definitely do the trick. Git on the other hand is a brutally complicated, confusing thing, (e.g. https://steveko.wordpress.com/2012/02/24/10-things-i-hate-about-git/) that will require hours to master even for relatively simple tasks. However it is widely in use out in the world of software development, so the skill is a valuable one to develop, and has some super neat features (all that branching). ”

I guess my response to these concerns is just to say that the above is how I’ve done version control, and so far it has worked for me. However, the vast majority of what I’ve done so far is single-user repositories (i.e. just me), where I integrate changes and comments of co-authors myself manually. Perhaps I will learn to hate git when I have to do more with other contributors, but for now the workflow above works for me.

Reproducibility with Psychopy_ext

You may be interested in a recent paper by Jonas Kubilius (link) detailing his reproducibility framework Psychopy_ext. This is a Python wrapper package for PsychoPy (for stimulus generation and presentation) and various data analysis packages that promises to streamline workflows for conducting typical psychophysical experiments. Looks really useful – great work Jonas!

Regarding my last post, you might be particularly interested in Jonas’ Figure 2, which contains a slightly different suggestion for how to lay out a project directory.

My current direction in scientific computing

During my PhD I learned to program in Matlab. I’d never done any programming before that, and I found it to be a rewarding experience. As is typical for people in vision science, I did pretty much everything in Matlab. Stimuli were generated and presented to human subjects using the CRS Visage (in my PhD; programming this thing can be hell) and now the excellent Psychtoolbox. Early on in my PhD I also moved away from SPSS to doing data analysis in Matlab, too.

An early project in my postdoc (see here) involved some more sophisticated statistical analyses than what I had done before. For this, Matlab was an absolute pain. For example, the inability (in base Matlab) to have named columns in a numerical matrix meant that my code contained references to column numbers throughout. This meant that if I wanted to change the order or number of variables going into the analysis I had to carefully check all the column references. Ugly, and ripe for human error.

Cue my switch to R. For statistical analyses R is pretty damn excellent. There are thousands of packages implementing pretty much every statistical tool ever conceived, often written by the statistician who thought up the method. Plus, it does brilliant plotting and data visualisation. Add the ability to define a function anywhere, real namespaces and the excellent R Studio IDE and I was hooked. I would try to avoid using Matlab again for anything on the analysis side but some light data munging (this is also wrapped up in my preference for science in open software).

For several years now I’ve been doing pretty much everything in R. For our latest paper, I also did my best to make the analysis fully reproducible by using knitr, a package that lets you include and run R analyses in a LaTeX document. You can see all the code for reproducing the analysis, figures and paper here. I’m going to work through the work flow that I used to do this in the next few blog posts.

While R is great for stats and plotting, unfortunately I’m not going to be able to fully replace Matlab with R. Why? First, last I checked, R’s existing tools for image processing are pretty terrible. A typical image processing task I might do to prepare an experiment is take an image and filter it in the Fourier domain (say, to limit the orientations and spatial frequencies to a specific band). I spent about a day trying to do this in R a year or so ago, and it was miserable. Second, R has no ability to present stimuli to the screen with any degree of timing or spatial precision. In fact, that would be going well outside its intended purpose (which is usually a bad idea – see Matlab).

So my “professional development” project for this year is to learn some Python, and test out the PsychoPy toolbox. In addition I’m interested in the data analysis and image processing capabilities of Python – see for example scikit-learn, scikit-image and pandas. I’ve had some recent early success with this, which I’ll share in a future post. It would be so great to one day have all my scientific computing happen in a single, powerful, cross platform, open and shareable software package. I think the signs point to that being a Python-based set of tools.