Tutorial-style

Simulating data

Generating data from a probabalistic model with known parameters is a great way to test that any analysis you have written is working correctly, and to understand statistical methods more generally. In this post I’m going to show you an example of generating (simulating) data from a probabalistic model. A probabalistic model is one that is stochastic in that it doesn’t generate the same data set each time the simulation is run (unlike a deterministic model).

The following is actually the code that I used to generate the data in my post on importing data and plot here, so if you’ve been following this blog then you’re already familiar with the output of this post. This simulated data set consisted of 5 subjects, who were shown 20 trials at each combination of 7 contrasts and 5 spatial frequencies.

The predictor variables

First we set up all the predictor variables (independent or experimental variables) we’re interested in.

Variable vectors

(subjects <- paste0("S", 1:5))  # the double parentheses just serve to print the output
## [1] "S1" "S2" "S3" "S4" "S5"

As we’ve discussed, the paste0 command is for concatenating (sticking together) strings. The command above creates a character vector of five strings, starting with “S1” and ending with “S5”.

(contrasts <- exp(seq(-6, -1, l = 7)))
## [1] 0.002479 0.005704 0.013124 0.030197 0.069483 0.159880 0.367879

This creates a vector of length 6 containing log-spaced contrast values… and so on for other variables:

(sfs <- exp(seq(log(0.5), log(40), l = 5)))
## [1]  0.500  1.495  4.472 13.375 40.000
target_sides <- c("left", "right")
(n_trials <- 20)
## [1] 20

Combining the variables

Now that we’ve set up the independent variables from the experiment, we want to create a data frame with one row for each combination of the variables for each subject. We can do this easily with R’s expand.grid command.

dat <- expand.grid(subject = subjects, contrast = contrasts, sf = sfs, target_side = target_sides, 
    trial = 1:n_trials)
head(dat)
##   subject contrast  sf target_side trial
## 1      S1 0.002479 0.5        left     1
## 2      S2 0.002479 0.5        left     1
## 3      S3 0.002479 0.5        left     1
## 4      S4 0.002479 0.5        left     1
## 5      S5 0.002479 0.5        left     1
## 6      S1 0.005704 0.5        left     1

You can see from the first few rows of dat what expand.grid has done: it has created a data frame (called dat) that contains each combination of the variables entered. A call to summary shows us how R has usefully made factors out of string variables (subject and target side):

summary(dat)
##  subject      contrast            sf        target_side      trial      
##  S1:1400   Min.   :0.0025   Min.   : 0.50   left :3500   Min.   : 1.00  
##  S2:1400   1st Qu.:0.0057   1st Qu.: 1.50   right:3500   1st Qu.: 5.75  
##  S3:1400   Median :0.0302   Median : 4.47                Median :10.50  
##  S4:1400   Mean   :0.0927   Mean   :11.97                Mean   :10.50  
##  S5:1400   3rd Qu.:0.1599   3rd Qu.:13.37                3rd Qu.:15.25  
##            Max.   :0.3679   Max.   :40.00                Max.   :20.00

You can tell that something is a factor because instead of showing the summary statistics (mean, median) it just shows how many instances of each level there are.

To make the modelling a bit simpler, we’re also going to create a factor of spatial frequency:

dat$sf_factor <- factor(round(dat$sf, digits = 2))
summary(dat)
##  subject      contrast            sf        target_side      trial      
##  S1:1400   Min.   :0.0025   Min.   : 0.50   left :3500   Min.   : 1.00  
##  S2:1400   1st Qu.:0.0057   1st Qu.: 1.50   right:3500   1st Qu.: 5.75  
##  S3:1400   Median :0.0302   Median : 4.47                Median :10.50  
##  S4:1400   Mean   :0.0927   Mean   :11.97                Mean   :10.50  
##  S5:1400   3rd Qu.:0.1599   3rd Qu.:13.37                3rd Qu.:15.25  
##            Max.   :0.3679   Max.   :40.00                Max.   :20.00  
##  sf_factor   
##  0.5  :1400  
##  1.5  :1400  
##  4.47 :1400  
##  13.37:1400  
##  40   :1400  
## 

The data model

Now that we have the data frame for the experimental variables, we want to think about how to generate a lawful (but probabalistic) relationship between the experimental variables and the outcome variable (in this case, getting a trial correct or incorrect). I am going to do this in the framework of a logistic Generalised Linear Model (GLM).

For this data set, let’s say that each subject’s chance of getting the trial correct is a function of the contrast and the spatial frequency on the trial. I’m going to treat contrast as a continuous predictor (aka a covariate) and spatial frequency as a discrete variable (a factor). I could treat both as continuous, but visual sensitivity as a function of spatial frequency is non-monotonic, so I thought this blog post would be simpler if I just treat my five levels as discrete. In the simple GLM I’m going to use, this means that each subject will have six coefficients: an intercept, the slope of contrast, and the offset for each level of spatial frequency.

Design matrix

R is pretty amazing for doing stuff in a GLM framework. For this example, we can generate the design matrix for our data frame dat using the model.matrix function:

X <- model.matrix(~log(contrast) + sf_factor, data = dat)
head(X)
##   (Intercept) log(contrast) sf_factor1.5 sf_factor4.47 sf_factor13.37
## 1           1        -6.000            0             0              0
## 2           1        -6.000            0             0              0
## 3           1        -6.000            0             0              0
## 4           1        -6.000            0             0              0
## 5           1        -6.000            0             0              0
## 6           1        -5.167            0             0              0
##   sf_factor40
## 1           0
## 2           0
## 3           0
## 4           0
## 5           0
## 6           0

The formula ~ log(contrast) + sf_factor tells R that we want to predict something using contrast (a covariate) and additive terms of the factor sf_factor. If we changed the + to a * this would give us all the interaction terms too (i.e. the slope would be allowed to vary with spatial frequency). Note how our first 6 rows have all zeros for the columns of sf_factor: this is because all the first rows in dat are from sf 0.5, which here is the reference level.

R has automatically dummy coded sf_factor, dropping the reference level (this defaults to the first level of the factor). A trial with a spatial frequency of 0.5 will have zeros for all the sf_factor columns of the design matrix. A trial with a spatial frequency of 1.5 will have ones in that column and zeros everywhere else.

Coefficients (parameters)

Next we need our "true" coefficients for each subject. I will generate these as samples from a normal distribution (this is where the first stochastic part of our data generation comes in). But first, I will set R's random number seed so that you can produce the results described here (i.e. this will make the result deterministic). If you want stochasticity again, just comment out this line:

set.seed(424242)

Now the coefficients:

b0 <- rnorm(length(subjects), mean = 7, sd = 0.2)  # intercept
b1 <- rnorm(length(subjects), mean = 2, sd = 0.2)  # slope of contrast
b2 <- rnorm(length(subjects), mean = 2, sd = 0.2)  # sf 1.5
b3 <- rnorm(length(subjects), mean = 1.5, sd = 0.2)  # sf 4.4
b4 <- rnorm(length(subjects), mean = 0, sd = 0.2)  # sf 13
b5 <- rnorm(length(subjects), mean = -2, sd = 0.2)  # sf 40

print(betas <- matrix(c(b0, b1, b2, b3, b4, b5), nrow = 5))
##       [,1]  [,2]  [,3]  [,4]    [,5]   [,6]
## [1,] 7.088 2.445 2.101 1.553 -0.2687 -2.138
## [2,] 7.194 1.974 2.279 1.722  0.1459 -2.243
## [3,] 7.152 2.232 2.136 1.510  0.2344 -2.189
## [4,] 7.034 2.070 2.242 1.391  0.1914 -1.790
## [5,] 7.085 1.922 1.705 1.541  0.3116 -1.871

The above matrix has the subjects in the rows and the coefficients in the columns. That is, subject 2's "true" (i.e. generating) parameters are 7.194, 1.974, 2.279, etc.

Generating predictions

To generate predictions in the linear space of the GLM, we take the product of the design matrix and the betas for each subject:

eta <- rep(NA, length = nrow(dat))
for (i in 1:length(subjects)) {
    # for each subject
    this_subj <- levels(dat$subject)[i]  # which subject?
    subj_rows <- dat$subject == this_subj  # which rows belong to this subject?

    # create a design matrix, and pull out the betas for this subject:
    this_X <- model.matrix(~log(contrast) + sf_factor, data = dat[subj_rows, 
        ])
    this_betas <- betas[i, ]

    # mult, stick into eta:
    eta[subj_rows] <- this_X %*% this_betas
}

# stick eta into dat:
dat$eta <- eta

The multiplication of the matrix X and the (row vector) beta above is equivalent to, for each row, multiplying each number in the design matrix by the corresponding beta value, then summing all these products to produce one number per row. This number is the linear predictor (and if we were dealing with simple linear regression, we'd be done now).

For our application however, we want to turn the linear predictor eta into a probability that ranges from 0.5 (chance performance on the task) to 1. To do this we're going to use one of the custom link functions in the psyphy package for fitting psychophysical data in R.

library(psyphy)
links <- mafc.weib(2)  # creates a list of functions.

dat$p <- links$linkinv(dat$eta)  # run the linear predictor through the inverse link function to get a probability.

Test that parameters are in a decent range by plotting:

library(ggplot2)
fig <- ggplot(dat, aes(x = contrast, y = p)) + geom_line() + facet_grid(sf_factor ~ 
    subject) + coord_cartesian(ylim = c(0.45, 1.05)) + scale_x_log10() + scale_y_continuous(breaks = c(0.5, 
    0.75, 1)) + theme_minimal()
fig

unnamed-chunk-12

Finally, we want to generate a binary outcome (success or failure, usually denoted 1 and 0) with probability p for each trial. We do this using R's rbinom function, which generates random samples from the binomial distribution (but remember, since we set the seed above, you can reproduce my numbers):

dat$y <- rbinom(nrow(dat), 1, prob = dat$p)
summary(dat)
##  subject      contrast            sf        target_side      trial      
##  S1:1400   Min.   :0.0025   Min.   : 0.50   left :3500   Min.   : 1.00  
##  S2:1400   1st Qu.:0.0057   1st Qu.: 1.50   right:3500   1st Qu.: 5.75  
##  S3:1400   Median :0.0302   Median : 4.47                Median :10.50  
##  S4:1400   Mean   :0.0927   Mean   :11.97                Mean   :10.50  
##  S5:1400   3rd Qu.:0.1599   3rd Qu.:13.37                3rd Qu.:15.25  
##            Max.   :0.3679   Max.   :40.00                Max.   :20.00  
##  sf_factor         eta               p               y       
##  0.5  :1400   Min.   :-9.717   Min.   :0.500   Min.   :0.00  
##  1.5  :1400   1st Qu.:-2.925   1st Qu.:0.526   1st Qu.:1.00  
##  4.47 :1400   Median : 0.286   Median :0.868   Median :1.00  
##  13.37:1400   Mean   : 0.004   Mean   :0.776   Mean   :0.77  
##  40   :1400   3rd Qu.: 3.240   3rd Qu.:1.000   3rd Qu.:1.00  
##               Max.   : 7.500   Max.   :1.000   Max.   :1.00

This function generates a vector of the same length as the number of rows in dat, each vector made up of one binomial trial (i.e. a Bernoulli trial), with probability given by the vector dat$p.

Converting from "y" (correct / incorrect) to "response"

Finally, as a little extra flavour, I convert the binary correct / incorrect response above into a "left" or "right" response by the simulated subject on each trial. Note that you wouldn't normally do this, I'm just doing it for demo purposes for the data import post.

dat$response <- "left"
dat$response[dat$target_side == "right" & dat$y == 1] <- "right"  # correct on right
dat$response[dat$target_side == "left" & dat$y == 0] <- "right"  # wrong on left
dat$response <- factor(dat$response)

Saving the data

Now I'm going to save a subset of the data here to a series of .csv files. These are the files we imported in my post on importing data. First, to make the data set look a little more realistic I'm going to shuffle the row order (as if I randomly interleaved trials in an experiment).

new_rows <- sample.int(nrow(dat))
dat <- dat[new_rows, ]

Create a unique id for each trial

I'm going to add a unique identifier (UUID) to each trial. This is a good thing to do in your experiment script. It makes it easier to, for example, check that some data set stored in a different table (e.g. eye tracking or brain imaging data) synchs up with the correct psychophysical trial. You could also do something where you add the exact time stamp to each trial.

library(uuid)
ids <- rep(NA, length = nrow(dat))
for (i in 1:length(ids)) ids[i] <- UUIDgenerate()

dat$unique_id <- ids

Writing to a file

To do this I'm going to use the paste0 command, which you should already be familiar with from the data import post. This allows you to stick strings (text) together.
On the line with paste0, the first part gives us the project working directory, then we go into the data directory and finally we create the filename from the subject name.

for (i in 1:length(subjects)) {
    # for each subject
    this_subj <- levels(dat$subject)[i]  # which subject?

    # use the subset command to subset the full data frame and remove some
    # variables:
    this_dat <- subset(dat, subject == this_subj, select = c(-trial:-y))

    output_file <- paste0(getwd(), "/data/data_", this_subj, ".csv")
    write.table(this_dat, file = output_file, row.names = FALSE, sep = ",")
}

Summing up

That's one way to generate data from an underlying probability model. I've used it to generate some data to import and plot, but this is a really great thing to know how to do. It's useful for testing your analysis code (if you don't get back the parameters you put in, you know something is wrong). I also found I learned a lot about statistics more generally by simulating various tests and models.

This post concludes the more "practical" content I've planned so far. In the next few posts I'm planning to talk about some more general things. If you have requests for future "practical" posts, let me know in the comments below.

Graphically exploring data using ggplot2

Your second step after importing should always be to look at the data. That means plotting lots of things, and getting a sense of how everything fits together. Never run a statistical test until you’ve looked at your data in as many ways as you can. Doing so can give you good intuitions about whether the comparisons you planned make sense to do, and whether any unexpected relationships are apparent in the data. The best tool for reproducible data exploration that I have used is Hadley Wickham’s ggplot2 package.

A brief introduction to the mindset of ggplot

The first thing to note about ggplot2 is that it is better thought of as a data exploration tool than a plotting package (like base graphics in Matlab, R, or Python). In those systems, you typically create an x variable and a y variable, then plot them as something like a line, then maybe add a new line in a different colour. ggplot tries to separate your data from how you display it by making links between the data and the visual representations explicit, including any transformations. There’s a little introduction to this philosophy here.

For me, what this means in practice is that you need to start thinking in terms of long format data frames rather than separate x- and y vectors. A long format data frame is one where each value that we want on the y-axis in our plot is in a separate row (see wiki article here). The Cookbook for R has some good recipes for going between wide and long format data frames here. For example, imagine we have measured something (say, reaction time) in a within-subjects design where the same people performed the task under two conditions. For ggplot we want a data frame with columns like:

subject condition rt
s1 A 373
s1 B 416
s2 A 360
s2 B 387

not like:

subject rt condition A rt condition B
s1 373 416
s2 360 387

and not like (familiar to anyone plotting with Matlab or Matplotlib):

x = [s1, s2]
y_1 = [373, 360]
y_2 = [416, 387]

For the first few weeks of using ggplot2 I found this way of thinking about data took some getting used to, particularly when trying to do things as I’d done in Matlab. However, once you make the mental flip, the ggplot universe will open up to you.

Contrast detection data example

Now we will look at the data from my data import post. This consists of data from a psychophysical experiment where five subjects detected sine wave gratings at different contrasts and spatial frequencies. You can download the data from my github repository here. For each trial, we have a binary response (grating left or right) which is either correct or incorrect. Each row in the data frame is a trial, which means that this is already in long format:

load(paste0(getwd(),'/out/contrast_data.RData'))
head(dat)
##   subject contrast     sf target_side response
## 1      S1 0.069483  0.500       right    right
## 2      S1 0.013124 40.000       right     left
## 3      S1 0.069483  4.472        left     left
## 4      S1 0.069483 40.000        left    right
## 5      S1 0.367879 13.375        left     left
## 6      S1 0.002479  0.500        left    right
##                              unique_id correct
## 1 544ee9ff-2569-4f38-b04e-7e4d0a0be4d2       1
## 2 b27fe910-e3ba-48fb-b168-5afb1f115d8f       0
## 3 72c9d6ce-0a90-4d4b-a199-03435c15291b       1
## 4 48b5bbb2-e6ee-4848-b77e-839ed5320c01       0
## 5 32a5cce4-3f8a-4e63-80c1-3fee3230d1bd       1
## 6 47ebce53-9d5a-48de-936b-25d5105a0784       0

Baby steps

Building a plot in ggplot2 starts with the ggplot() function:

library(ggplot2)
fig <- ggplot(data = dat, aes(x = contrast, y = correct))

This command creates fig, which is a ggplot object, in our workspace. We’ve specified the data frame to use (dat), and two “aesthetics” using the aes() function. Aesthetics are how ggplot assigns variables in our data frame to things we want to plot. In this example we have specified that we want to plot contrast on the x-axis and correct on the y-axis.

We can try plotting this just by typing fig into the command window:

fig
## Error: No layers in plot

but this returns an error because we haven’t specified how we want to display the data. We must add a geom to the fig object (note the iterative notation, where we overwrite the fig object with itself plus the new element):

fig <- fig + geom_point()
fig

unnamed-chunk-4

Now we get a plot of the data, with each correct trial as a point at 1 and each incorrect trial as a point at 0. But that’s not very informative, because there’s a lot of overplotting — we’re really interested in how often the subjects get the trials correct at each contrast level. That is, we want to know the proportion of correct responses.

To do that we could create a new data frame where we compute the mean of all correct values for each cell of our experiment (i.e. for each subject, at each level of contrast and spatial frequency). However, it’s also possible for ggplot2 to do that for us as we plot, using the stat_summary command:

fig <- fig + stat_summary(fun.data = "mean_cl_boot", colour = "red")
fig

unnamed-chunk-5

The mean_cl_boot command computes the means and bootstrapped 95% confidence intervals on the mean, for all the y-values falling into each unique x-value. These are shown as the red points in the above plot. Type ?stat_summary and look at the examples (or run example(stat_summary) to get an idea of what you can do out-of-the-box with this command. It also allows you to define your own functions to summarise the y values for each value of x, so it’s incredibly flexible.

Since the contrast values in our experiment were sampled logarithmically, the values for all the small contrasts are all squished up to the left of the plot. Therefore, the last thing we might want to do with this basic plot is to log scale the x-axis:

fig <- fig + scale_x_log10()
fig

unnamed-chunk-6

Now we can see that the mean proportion correct starts from 0.5 for low contrasts (i.e. 50% correct, or chance performance on the task) and gradually rises up to near 100% correct in an S-shaped fashion.

Facets and smooths

The goal of this experiment was to see whether and how human visual sensitivity to contrast changes depending on the spatial scale of the information (loosely, whether the pattern is coarse or fine). While the basic data representation makes sense (i.e. looking at proportion correct), the plot above is not very useful because it averages over all the different subjects and over the experimental variable we’re most interested in (spatial frequency). Thus it doesn’t tell us anything about the goal of the experiment.

Here’s where ggplot2 gets really cool. We can apply the same basic plot to each subset of the data we’re interested in, in one line of code, by using faceting. First, here’s the basic plot again, but in a more succinct form (note how I string together subfunctions using the + sign across multiple lines):

fig <- ggplot(data = dat, aes(x = contrast, y = correct)) +
stat_summary(fun.data = "mean_cl_boot") +
scale_x_log10() +
scale_y_continuous(limits = c(0, 1))
fig

unnamed-chunk-7

Now we want to do the same thing, but look at the data for each subject and spatial frequency separately. The facet_grid command allows us to lay out the data subsets on a grid. Since we want to compare how the performance shifts as a function of contrast within each subject, it makes sense to arrange the facets with subjects in each column and spatial frequencies in the rows. This is done by adding one element to the fig object above:

fig <- fig + facet_grid(sf ~ subject)  # specifies rows ~ columns of the facet_grid.
fig

<a href="https://tomwallisblog.files.wordpress.com/2014/04/unnamed-chunk-8.png"><img src="https://tomwallisblog.files.wordpress.com/2014/04/unnamed-chunk-8.png?w=300&quot; alt="unnamed-chunk-8" width="300" height="300" class="alignnone size-medium wp-image-289" /></a>

Now we get a replication of the basic x-y plot but for each subject and spatial frequency. The axes have been scaled identically by default so it's easy to see variation across the facets. If you follow down each column, you can see that performance as a function of contrast first improves and then gets worse again, relative to the first spatial frequency (0.5 cycles of the grating per degree of visual angle). To see this more clearly it will help to add trend lines, which again we can do in one line in ggplot2:

fig <- fig + stat_smooth(method = "glm", family = binomial())
fig

unnamed-chunk-9

In this case I've used a Generalised Linear Model specifying that we have binomial data (this defaults to a logistic link function). The blue lines show the maximum likelihood model prediction and the grey shaded regions show the 95% confidence limits on these predictions. However, this model isn't taking account of the fact that we know performance will asymptote at 0.5 (because this is chance performance on the task), so the slopes look all wrong. A psychophysicist would now fit this model with a custom link function that asymptotes at 0.5. Such link functions are implemented in Ken Knoblauch's psyphy package for R.

We could implement that within ggplot2 as well, but instead here I will use a Generalised Additive Model (GAM) to show more flexible fitting of splines. This could come in handy if you didn't have a good approximation for the functional form for your dataset (i.e. you have only vague expectations for what the relationship between x and y should be).

library(mgcv)
## Loading required package: nlme
## This is mgcv 1.7-28. For overview type 'help("mgcv-package")'.
fig <- fig + stat_smooth(method = "gam", family = binomial(),
formula = y ~ s(x, bs = "cs", k = 3),
colour = "red")
fig

<a href="https://tomwallisblog.files.wordpress.com/2014/04/unnamed-chunk-10.png"><img src="https://tomwallisblog.files.wordpress.com/2014/04/unnamed-chunk-10.png&quot; alt="unnamed-chunk-10" width="504" height="504" class="aligncenter size-full wp-image-291" /></a>

The above code fits a cubic spline ("cs") with three knots ("k=3"). Essentially this is a really flexible way to consider nonlinear univariate relationships (a bit like polynomial terms in normal GLMs). You can see that these data are probably overfit (i.e. the model is capturing noise in the data and is unlikely to generalise well to new data), and some of the confidence regions are pretty crazy (because the model is so flexible and the data are not well sampled for the observer's sensitivity) but that it gives a reasonable impression for the purposes of exploration. If you scan down the columns for each subject, you can see that the point on the x-axis where the red curves have the steepest slope is furthest to the left for spatial frequencies of 1.5 and 4.5 cycles per degree. These observers reach a higher level of performance with less contrast than in the other conditions: their thresholds are lower. Humans are most sensitive to contrast in the 1–4 cycles per degree range (Campbell and Robson, 1968).

Appearance matters

Finally, we can adjust the appearance of our plot. First, let's get rid of the ugly decimal-point labels in the spatial frequency dimension by creating a new variable, a factor of sf:

dat$sf_factor <- factor(dat$sf)
levels(dat$sf_factor) <- round(sort(unique(dat$sf)), digits = 1)  # rename levels

Second, some people don't like the grey background theme of the ggplot default. It took me a little while to get used to, but now I quite like it: by attending to things that are darker than the background you can concentrate on the data, but the gridlines are there if you need them. However, if you prefer a more traditional plot, just add the classic theme (fig + theme_classic()). Personally my favourite is now theme_minimal(). So having done all this, our entire plot call becomes:

fig <- ggplot(data = dat, aes(x = contrast, y = correct)) +
facet_grid(sf_factor ~ subject) +
stat_summary(fun.data = "mean_cl_boot") +
stat_smooth(method = "gam", family = binomial(),
formula = y ~ s(x, bs = "cs", k = 3)) +
scale_x_log10(name = "Contrast") +
scale_y_continuous(name ="Proportion Correct",
limits=c(0, 1), breaks=c(0, 0.5, 1)) +
theme_minimal()
fig

unnamed-chunk-12

Going further with ggplot2

What would happen if we wanted to see whether performance changed depending on the side of the retina (left or right of fixation) the grating was presented? Perhaps sensitivity is different, or the person has a bias in responding to a side. We can look at this by simply adding a new argument in the aes() function when we call ggplot: colour = target_side. Try it at home! There's an example of doing this in the plots.R script on my github page. Here you can also see how the plots are saved to the /figs subdirectory, where a document file (like a .tex doc) can be set up to automatically pull in the figures. You can also see a nice vector graphic version of the final figure above.

This post was just a little taste of what you can do in ggplot2, with a focus on vision science data. There are many more thorough introductions for ggplot2 available on the web, like here and here. I find that The Cookbook for R has heaps of useful little tips and code snippets, as well as showing you lots of basic ggplot things in the plotting section. If you want an example of some published figures made with ggplot2 as well as the code that generated them, you can see our recent paper here.

Data import in R

In this post, I will demonstrate one way to import and collate a data set (using the R environment). This is a follow up to a post in which I argued that a good principle for reproducible research is to avoid humans touching data. That is, once the data from the experiment are saved we want them to be “read only” and never altered by a human in some undocumented way (such as editing in a spreadsheet).

Using R is not the only way to do the following, and I would encourage you to replicate these steps in the environment of your choice. If your scientific computing environment makes following what I do here really hard, maybe you should consider switching…

Data set

First, we need a data set. To make this more interesting let’s build on a classic paper from vision science.

Imagine we’ve conducted an experiment similar to the classic Campbell & Robson (1968)^1 study but with a few modifications. As a participant in our experiment, you’re seated in front of a monitor showing a grey screen. You’re going to be shown a sequence of trials, and for each trial you make a response with a button press.
On each trial you are asked to keep your eyes on a small dot on the centre of the screen. On each trial, a pattern of dark-and-light stripes (a grating) is shown on one side of the screen (left or right of your eye position). The computer randomly decides whether to present the grating on the left or on the right (the other side just stays as the grey background). You have to respond either “grating on the left” or “grating on the right” — you can’t say “I don’t know”. The computer waits for your response before showing the next trial.
We are going to vary both the contrast of the grating pattern (how different from grey the dark and light stripes are) and also the spatial frequency of the pattern (how wide the bars are) over trials.

If the contrast is so low that you can’t see the grating, your responses across many trials will be near chance performance (here 50% correct). If the grating is really easy to see, your performance will be near 100%. We determine how your performance on the task changes as a function of contrast, for each spatial frequency tested.

We’ve tested 5 subjects in this experiment, showing them 7 contrasts at 5 spatial frequencies, with the targets equally on the left and right. They did 20 trials for each condition (so each subject did 7 * 5 * 20 * 2 = 1400 trials). Let’s say that our experiment program saves the data as a .csv file in our project’s /data/ directory. We have one .csv file per subject, and one of them might look something like this when opened in a text editor:

csv

A few things to notice here: each comma , in the file denotes a new column, and each new line denotes a row. Secondly, note that there’s a header row: the first line of the file contains column names for our variables.

Finally, notice how our target_side and response columns contain text strings (left and right). The reason I’ve done this is that it makes the data easily human-readable. It’s obvious what the entries mean (imagine if instead target_side could be either 0 or 1). This can be used to great effect to avoid needing a data key later.

Installing R

This couldn’t be simpler. Go here and get the right binary for your system, install it, then immediately go here and get RStudio, which is awesome. To follow along with my stuff here, you can install any packages I use (the library() calls in future posts) via RStudio’s “Packages” tab.

While I’m going to demonstrate this stuff using R, I would encourage you to follow along in your package of choice. I’d be interested to know how easy / hard it is to duplicate this stuff in other environments (for example, last I used Matlab handling .csv files with mixed numeric and text was a massive pain).

Reading each file into R and putting them together

Now we want to read each subject’s data file into R, then stick the files together to create one big data file.

The paste0 command

To do this, I’m going to make use of the paste command, which allows you to concatenate (stick together) strings. Actually, I’m going to use the paste0 command, which is a shortcut for paste. By default paste adds a space between each pasted item, which we usually don’t want. paste0 just puts together the items you give it. For example:

paste0("A text string", 42, ", another text string")
## [1] "A text string42, another text string"

What we get is that R automatically converts the number “42” to text, and sticks it together with the preceeding and subsequent stuff. Usefully, we can also include ranges of numbers, which produces a number of strings:

paste0("A text string", 41:43, ", another text string")
## [1] "A text string41, another text string"
## [2] "A text string42, another text string"
## [3] "A text string43, another text string"

Read in the damn files already

The file for subject one is labelled like this:

“data_S1.csv”

and subject 2’s results are in the file “data_S2.csv”, and so on. The following script uses a for loop to read in the data, then appends it to a data frame called dat.

dat <- data.frame()  # create an empty data frame.
for (i in 1:5) {
    file <- paste0(getwd(), "/data/data_S", i, ".csv")
    this_dat <- read.csv(file = file)  # read the subject's file, put in a data frame called this_dat
    dat <- rbind(dat, this_dat)  # append to larger data frame  
}

What this for loop gives us is a data frame object called dat. Let’s examine it using the str (“structure”) command:

str(dat)
## 'data.frame':  7000 obs. of  6 variables:
##  $ subject    : Factor w/ 5 levels "S1","S2","S3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ contrast   : num  0.0695 0.0131 0.0695 0.0695 0.3679 ...
##  $ sf         : num  0.5 40 4.47 40 13.37 ...
##  $ target_side: Factor w/ 2 levels "left","right": 2 2 1 1 1 1 2 2 1 2 ...
##  $ response   : Factor w/ 2 levels "left","right": 2 1 1 2 1 2 2 1 2 2 ...
##  $ unique_id  : Factor w/ 7000 levels "00004355-345d-403e-b244-79c8adb8f1f8",..: 451 983 595 395 277 387 132 809 711 582 ...

Data frames

Data frames are the most important (or at least useful) data type in R, and what you’re going to be using a lot. Many methods use data frames. The most awesome thing about a data frame is that it can store both numerical data and text. This allows us to read in that csv file no problem, where other basic data types would really struggle (I’m looking at you, Matlab).

Furthermore, data frames can explicitly treat text as a “factor”, which means that when you fit a model, it won’t try to use this numerically but will rather dummy code it. Note how in the str call above, several variables (in fact, all those that were strings in the .csv file) have been imported as factors. Let’s look at some behaviour of factors now by looking at the summary of our data:

summary(dat)
##  subject      contrast            sf        target_side   response   
##  S1:1400   Min.   :0.0025   Min.   : 0.50   left :3500   left :3488  
##  S2:1400   1st Qu.:0.0057   1st Qu.: 1.50   right:3500   right:3512  
##  S3:1400   Median :0.0302   Median : 4.47                            
##  S4:1400   Mean   :0.0927   Mean   :11.97                            
##  S5:1400   3rd Qu.:0.1599   3rd Qu.:13.37                            
##            Max.   :0.3679   Max.   :40.00                            
##                                                                      
##                                 unique_id   
##  00004355-345d-403e-b244-79c8adb8f1f8:   1  
##  000f3b09-9dde-4dd4-8a97-ad87cfcbc947:   1  
##  00448030-70e5-4010-b954-4a35c107841e:   1  
##  0086b264-17ed-4fbb-8e32-8c7814ae6b6a:   1  
##  00a070b7-f849-4727-a710-0453d6f27c50:   1  
##  00b414aa-3f65-4b4d-8d12-f0d41ec7ae42:   1  
##  (Other)                             :6994

See how we get some distribution summaries for the covariates (e.g. contrast), but only told how many instances of each factor level there are? Neat huh?

Data munging

In our data file there is a “response” variable, that is a string of the side the subject responded to. What we really want however is to know whether they got the trial correct. That is, is the string in “target_side” the same as the string in “response”? Let’s create this new variable now:

dat$correct <- 0  # initialises the variable 'correct' with all zeros.
dat$correct[dat$target_side == dat$response] <- 1  # logical indexing; if target == response, returns TRUE
summary(dat$correct)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    1.00    0.77    1.00    1.00
hist(dat$correct)

unnamed-chunk-6

Now we have a variable in the data frame dat that gives a 1 where the subject was correct and a 0 elsewhere. In the next post, I will show some basic graphical exploration of this data set using the ggplot2 package.

PS

This blog was written in R Markdown (in R Studio as a .Rmd file -> “knit HTML”, then paste the .md code directly into wordpress… too easy!)

You can check out the repository for this and some upcoming posts at my Github page.


[1] Campbell, F. W., & Robson, J. G. (1968). Application of Fourier analysis to the visibility of gratings. The Journal of Physiology, 197(3), 551–566.

Version control Part 2: Remote repository

The second stage in my version control workflow is to push my local changes to a remote repository. A remote repository is basically an identical repository to the one stored locally on your computer, but is on a remote server somewhere in the internet ether. Much like using dropbox, this provides an additional layer of backup for your project (with the advantage of a full version history). So if you ever lose your local copy of your project for some reason, you can just re-clone it from the remote repository to get everything back (not including files that were never committed, of course). ** NOTE that I don’t recommend using this, or any one tool, as your only backup: your scientific projects should be backed up with multiple means, in multiple locations, all the time **.

However, the main advantage of pushing things up to a remote repository is that this facilitates sharing. With various methods that I’ll outline below, you can keep the remote repository private and invite your collaborators to use it, or you can make it public so that anyone can see it, clone it, etc (though of course in this case, you control whether to use anyone else’s changes or not).

I use two services to host remote repositories: Github and Bitbucket. These companies offer similar services with a few key differences, which means that in my current workflow I switch between both.

Github

Github is the “one that started it all”. They have a really slick web interface, awesome graphs for looking at repository activity, great tools for interacting with other members, wikis and issue trackers that can be associated with a repo, and a big user base. Plus they offer the free GUI that I talked about in my last post. However, their pricing structure is that they charge you to have private repositories. That is, they host unlimited public repositories – i.e. anyone can see the respository’s contents, contributors and history. If you want to keep your repository to a few invited collaborators however, you need a paid account. Seven dollars a month gets you 5 private repositories. The idea here would be that you have some projects on the go, then when one is ready for sharing (say, the article is accepted), you switch the repo from “private” to “public”. Now everyone can see your code and data, and you have another private repo slot to use.

However, since I know that I would need more than 5 private repos (projects languishing, maybe one day, etc), I’ve so far avoided a paid Github account (the idea of just working with everything open is for another post). Thankfully we’re helped by Bitbucket.


UPDATE: Thanks to Ariel Rokem for pointing out in the comments that Github actually offer a Micro plan (5 private repositories) for educational users. Send them an email with your educational email address at this site.


 

Bitbucket

Bitbucket is basically Github with a different pricing structure. Their web interface and user community is a fair bit behind Github. For scientists however, the advantage is that they offer unlimited free private code repositories. The catch is that you’re only allowed 5 collaborators (i.e. people who have joined any of your repositories, like co-authors). However, an academic email address will get you unlimited collaborators too, so this is essentially a free service.

Using Remote Repositories on Bitbucket with the Github GUI

Generously, Github have not restricted their GUI to use Github repositories. So what I do is basically use the Github GUI to manage my version control day-to-day, but push the local repository to a remote repository on Bitbucket. I can share this with collaborators and keep it private.

Here’s how:

  1. Set up a local repository as explained here.
  2. Log in to your Bitbucket account in a web browser.
  3. Follow the steps to set up a new repository. Select “git” as the version control flavour.
  4. This should then give you an option to “push up an existing repository”.
  5. On the command line that starts `git remote add origin`, copy the following link to your repository (something like `git@bitbucket.org:tomwallis/test.git`. This might look different, depending whether you’re using SSH or a password to authenticate (if you’re using a password, your link will start with https; either works). The Bitbucket / Github help pages will explain how to set up an SSH key if you’d like to do that.
  6. In the Github GUI, open your local repository and go to the “Settings” pane. On the line that says “Primary Remote Repository”, paste in the link to your repo. Hit “Update Remote”.
  7. Switch back to the “Changes” pane of the Github GUI. See the button in the top left? It should have changed from “Push to Github” to “Sync Branch” (if not, close and re-open the Github GUI).
  8. Press this button. You might be asked for your password (depending whether you’ve set up an SSH key).
  9. Github should synch, and the list of “unsynched commits” should disappear.
  10. Refresh your web browser on your Bitbucket account. Your code should now be in your private repository on Bitbucket!

You can now share this repository with your co-workers, friends and family, and take advantage of all the nice things about collaborating with version control!

When the code is ready to be made public, you can simply push the repository to a public repository on Github by changing the primary remote in the settings pane. This lets you take advantage of the bigger user community (for example, see the PsychtoolboxPsychoPy,  Psychopy_ext, Julia) and better web tools on Github for publishing your code. Maybe in the future I would just use Github exclusively (i.e. paying for private repos), but for now the dual solution (Bitbucket, Github) works well. Of course, you can also just make your Bitbucket repository public and not worry about using Github at all, but then you’re stuck with their (relatively) clunky web interface.

As always, test this process out for yourself to see that it works for you before using it for important stuff, and always keep independent backups of your project and data (Dropbox, Time Machine, etc).

Version control Part 1: Local repository

Previously I talked about setting up a project directory. Now I’ll run through the first stages of something I see as an integral part of reproducible research: version control.

Why version control?

Loosely, think of “version control” like track changes in Word but for all the (plain text) files in your project directory. More formally, a version control system is a piece of software that will automatically detect when files in monitored directories change, record which lines of the file changed and how, prompt you to comment why you changed the file, then have you “commit” the changes to a repository (a data bank containing the history of the project directory). If you’re collaborating on the project with others, the version control system can be used to make sure each collaborator’s files stay up to date with each other, record who changed what, and notice if two people changed a file in conflicting ways (you’ll then be asked to look at the changes and decide which to keep). This is a big advantage over something like sharing a folder on Dropbox. You can even use a version control system to keep different versions of the same project (using “branches”). Finally, from the perspective of making your science open and reproducible, making the version control history of the project available is close to a gold standard. Not only do users get the version that created the final results, but if they’re so inclined they could search back through your repository history to see how the project evolved. Being able to just switch a repository from private to public also makes it really simple to share your code (say, when the paper comes out). Plenty of other people have written about the usefulness of version control for scientists (see here, here and here, for example), so in this post I’m going to concentrate on the way I use it.

How I set up version control

First, a disclaimer: I am not a version control expert. I use simple tools, and so far they’ve worked for me. Any problems I’ve had I’ve been able to resolve with some Googling and Stack Overflow. In the remainder of this post I’ll show you how I set things up, in the hope that this will be useful to you. Second disclaimer: I use OSX (and a bit of Linux on experimental machines), so my process might have some OSX-specific steps. However, all the software I’m using is cross-platform, so you should be able to port the process to your operating system of choice with a bit of internet searching.

The specific version control software I use is git. I find this to be the simplest system I’ve tried, with the best support for integrating with others and publishing your code. You can do some tutorials and get more info at the git site, or watch this video. I mostly use git through a GUI (graphical user interface) rather than through the command line because I find this simpler for day-to-day work. I use the Github GUI, which you can download here for mac and here for windows. I have found that this does everything I regularly need, in a clean and simple interface. Other people (Diederick, posting on this blog) have recommended SmartGit, which seems more fully-featured and is free for academic use. I haven’t tried this yet, though. So, without further ado, here are the steps for setting up version control in the way I have.

  1. Download and install the Github GUI for your operating system of choice. You might need to create a Github account at this stage (which I would recommend doing anyway). Since I already have everything installed on my setup here, I’m not going to walk you through this stage. It should be well-explained by the app itself, but if there are other steps you need to take, post them in the comments below and I will update this post.

  2. Now open the Github application. We’re going to use this to create a new repository for a new project. I will use the project directory structure I set up before:

blog_1

In the OSX Github GUI, go to the small plus sign in the lower left corner of the main display and select “Create New Repository”:

blog_3

Select the root directory of your project (in my case, this is called blog_example). Now we have this blank screen for our new repository:

blog_4

The Github GUI has created a new git repository, which lives in a hidden directory within our parent directory. Here you can see the contents of the directory before and after I created the repository:

blog_5

Note how the second call to ls -la has revealed a new (hidden) folder .git. This contains all the files for the git repository, and is what the Github GUI uses. Note that it’s just a normal git repository, so if you want to do something that isn’t possible in the GUI, you can just interact with the repository using the command line.

Now let’s create a file in our project directory. I made a file in the root directory called master_manuscript.txt, with two lines of text. When we flip back to the Github GUI, we see that it has detected the new file:

blog_6

I enter a commit message “created master file”, hit commit, and there you go. Our git repository has its first local commit:

blog_7

Note how the commit is listed in “Unsynced Commits”. It’s “Unsynched” because our local repository hasn’t been synchronised with a remote repository, such as one on Github. That’s fine if you just want to maintain a local repository for yourself. I will discuss synching with remotes in a future post.

Now let’s try modifying the file master_manuscript.txt. I’m going to delete the original two lines and add something new. How about a Well Thought-Out Englilsh Paper?

blog_8

The GUI shows us which lines were deleted (in red) and which were added (green). We can commit this new change, then take a look at our project’s history (in the history tab):

blog_9

There we can see our original commit (“created master file”) as well as our new one. On second thought, Strong Bad’s Well Thought-Out Englilsh paper is maybe not so well thought out. Let’s revert to the last version. Select our commit “a well thought-out englilsh paper” and click on the gear to the right of the panel, then select “Revert This Commit” (for the difference between revert and roll back, see here). We then get a new commit, telling us that we’ve reverted the content of the file back to the old one:

blog_11

If you open up the file master_manuscript.txt, you’ll see that it has been changed to have the original two lines by git.

Ignoring files

Let’s say that we now have some files in our project directory that we don’t want git to monitor (e.g. a really large data file that would be infeasible to upload to a remote repository). We can add this to the list of files to be ignored by placing it in the file .gitignore. The GUI allows us to do this by right clicking on a file when it appears on the left pane and selecting “ignore”. You can also ignore entire directories by entering them into .gitignore. You can do this in the GUI by going to the Settings pane, to the ignored files section, and entering a line like:

/directory_to_be_ignored/*

The GUI simply adds this line to your .gitignore file.

I usually ignore the /out and /figs/ directories since their contents can be regenerated from your code, and for some of my analyses the contents of /out can be rather large.

Branching

A branch is basically an independent copy of a state of the respository that allows you to do work in parallel to other changes occurring to the repository. While it’s aimed more at larger collaborative projects, I have found it useful in working on solo repositories as a way to explicitly maintain old versions of analyses or papers. I will cover branching more in a future post.

Other things to note

Note how in the explanation above I said “for all the (plain text) files in your directory”? A plain text file is something that looks ok when opened in a text editor like notepad. They could have extensions like .txt, .csv, .m, .R, .py, .tex, etc, but they are still readable in a text editor. Version control works beautifully for these. However, binary files not so much. You won’t take the most advantage of version control if you try to keep track of something like Word docs, .pdf or .jpg. At best your version control will be able to record that the file changed – but not which lines. On the other hand, it may still work to keep collaborators synched with the latest file, but conflict resolution would be hard.

A final disclaimer: Like any file, keeping your git repository within a Dropbox folder carries the risk that the file could be corrupted. Specifically, if you work from computer A, make changes, shut down that computer before Dropbox finishes syncing, then subsequently work on the same files from computer B, you will have conflicts when you open up computer A again. Keep a separate backup of your Dropbox folder (e.g. with Time Machine, or Dropbox’s own file history service) to prevent anything nasty happening. This also becomes less of a problem if you are pushing your git repository to a remote server, which is what I will cover in the next post.

UPDATE 12 Feb 2014:

A few people have commented to me on other forums about my selection of git over something else. Specifically, a lot of people find git to be unnecessarily complex for many projects as compared to say, subversion.

Alex Holcombe shared the following: “I have started using github (inside Rstudio), and seems to be working, but after reading this I am very afraid http://t.co/hLrUMrIz8J&#8221;

Another friend writes: “SVN makes perfect sense to me (I used it in my programming jobs) and would definitely do the trick. Git on the other hand is a brutally complicated, confusing thing, (e.g. https://steveko.wordpress.com/2012/02/24/10-things-i-hate-about-git/) that will require hours to master even for relatively simple tasks. However it is widely in use out in the world of software development, so the skill is a valuable one to develop, and has some super neat features (all that branching). ”

I guess my response to these concerns is just to say that the above is how I’ve done version control, and so far it has worked for me. However, the vast majority of what I’ve done so far is single-user repositories (i.e. just me), where I integrate changes and comments of co-authors myself manually. Perhaps I will learn to hate git when I have to do more with other contributors, but for now the workflow above works for me.