Scientific programming

Exploring GLMs with Python

A few months ago Philipp Berens and I ran a six week course introducing the Generalised Linear Model to graduate students. We decided do run the course in Python and to use the IPython Notebook to present the lectures. You can view the lecture notes here and get the source code here.

The six lectures were:

  1. Introduction by us, then student group presentations of a points of significance article that we hoped would provide a refresher on basic statistics.
  2. Finishing off student presentations (which took too long in the first week), followed by an introduction to Python, dataframes and exploratory plotting.
  3. Design matrices
  4. Interactions
  5. ANOVA and link functions
  6. Current issues in statistical practice, in which we touched on things like exploratory vs confirmatory data analysis, p-hacking, and the idea that statistics is not simply a cookbook one can mechanically follow.

What worked

  • In general the feedback suggests that most students felt they benefitted from the course. Many reported that they enjoyed using the notebook and working through the course materials with a few exercises themselves.
  • The main goal of the course was to give students a broad understanding of how GLMs work and how many statistical procedures can be thought of as special cases of the GLM. Our teaching evaluations suggest that many people felt they achieved this.
  • The notebooks allow tying the theory and equations very concretely to the computations one performs to do the analysis. I think many students found this helpful, particularly in working through the design matrices

What didn’t

  • Many students felt the course was too short for the material that we wanted to cover. I agree.
  • Some students found the lectures (for which weeks 2–5 involved me working through notebooks and presenting extra material) boring.
  • Despite the niceness of the Anaconda distribution, Python is unfortunately still not as easy to set up as (for example) MATLAB across different environments (primarily Windows, some OSX). We spent more time than we would have liked troubleshooting issues (mainly to do with different Windows versions not playing nicely with other things).
  • We didn’t spend a lot of time discussing the homework assignments in class.
  • Some students (graduate students in the neural behavioural school) are more familiar with MATLAB and strongly preferred that we teach the course using that.

If I were to run the course again

I think the main thing I would change if I ran the course again would be to have it run longer. As a “taste test” I think six weeks was fine, but to really get into the materials would require more lectures.

I also think it would be beneficial to split the content into lectures and practicals. I think the IPython notebooks in the way I used them would be excellent for teaching practical / small class tutorials, but that the lectures probably benefit from higher-level overviews and less scrolling over code*.

I would also plan the homework projects better. One student suggested that rather than having the introduction presentations at the beginning of the course, it would be nice to have the last lecture dedicated to students running through their analysis of a particular dataset in the notebook. I think that’s a great idea.

Finally, I would ignore requests to do the course in MATLAB. Part of why we wanted to use Python or R to do the course was to equip students with tools that they could continue using (for free) if they leave the university environment. Perhaps this is more valuable to undergraduates than PhD students (who already have some MATLAB experience), but I think it’s good for those students to get exposed to free alternatives as well. Plus, the IPython notebooks are just, well, pretty boss.

*I know you can hide cells using the slide mode, but I found slide mode in general quite clunky and so I opted not to use it.

Advertisements

High-level plotting in Python

If you have read my older blog posts, you know I’m a fan of R’s ggplot2 library for exploratory data analysis. It allows you to examine many views onto data, creating summaries over different variables right there as you plot. I gave a short tutorial here. Once you understand how the library works, it’s a very powerful way of quickly seeing patterns in large datasets.

Moving over to Python

The past few months have seen me ramp up my Python useage (as I preempted almost a year ago). This is for a variety of reasons, but mostly (a) Python is a general programming language, so you can essentially do anything in it without it being a massive hack; (b) the scientific computing packages available for Python are generally super awesome; (c) it seems to me to be poised to replace Matlab as the lingua franca of neuroscience / vision research — at least, much more so than R. See Tal Yarkoni’s blog for a related discussion. Of course, it’s also free and open source.

What I’m still kind of missing in Python is a nice high-level plotting library like ggplot2. Certainly, there are many in the works, but in playing around I’ve found that none of them can match R’s ggplot2 yet.

Amphibious assault

Of the libraries I’ve tried (main competitor is yhat’s ggplot port), my go-to library for plotting in Python is hands-down Seaborn. Developed largely by Michael Waskom, a graduate student in Cog Neuro at Stanford, it has the prettiest plots, and also the best ideas for high-level functionality.

Speeding up Seaborn?

Unfortunately, compared to R’s ggplot2, Seaborn data summary functions are slow. As an example, I’ll show you a summary plot using the simulated data from my R tutorials.

We want to summarise our (simulated) subjects’ detection performance as a function of stimulus contrast, for each spatial frequency. In R, my code looks like this:

    library(ggplot2)
    setwd('~/Dropbox/Notebooks/')
    dat <- read.csv('contrast_data.csv')
    summary(dat)

    plot_fun <- function(){
      fig <- ggplot(data = dat, aes(x=contrast, y=correct, colour=factor(sf))) +
        stat_summary(fun.data = "mean_cl_boot") + 
        scale_x_log10() + scale_y_continuous(limits = c(0,1)) + 
        facet_wrap( ~ subject, ncol=3)  +  
        stat_smooth(method = "glm", family=binomial())
      print(fig)
    }

    system.time(plot_fun())

This takes about 3 seconds (2.67) to return the following plot:

<a href="https://tomwallisblog.files.wordpress.com/2014/10/ggplot_timing.png"><img src="https://tomwallisblog.files.wordpress.com/2014/10/ggplot_timing.png&quot; alt="ggplot_timing" width="600" height="485" class="aligncenter size-full wp-image-374" /></a>

EDIT: for some reason WordPress is being silly and won't let me actually show this picture. Suffice to say it looks substantively similar to the plot below

Let's contrast this to Seaborn. First the code:

# do imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import ggplot

dat = pd.read_csv('contrast_data.csv')

# make a log contrast variable (because Seaborn can't auto-log an axis like ggplot):
dat['log_contrast'] = np.log(dat['contrast'])

# set some styles we like:
sns.set_style("white")
sns.set_style("ticks")
pal = sns.cubehelix_palette(5, start=2, rot=0, dark=0.1, light=.8, reverse=True)

%%timeit
fig = sns.lmplot("log_contrast", "correct", dat,
x_estimator=np.mean,
col="subject",
hue='sf',
col_wrap=3,
logistic=True,
palette=pal,
ci=95);

(note that the above lines were run in an ipython notebook, hence the %%timeit magic operator).

Seaborn takes… wait for it… 2 minutes and 11 seconds to produce the following plot:

seaborn_2min

(I’m using Python 3.3.5, Anaconda distribution 2.1.0, Seaborn 0.4.0.)

Note also that both packages are doing 1000 bootstraps by default (as far as I’m aware), so I’m pretty sure they’re doing equivalent things.

What can be done?

This is obviously not a positive factor for my switch from R to Python, and I’m hoping it’s just that I’ve done something wrong. However, another explanation is that whatever Seaborn is using to do the bootstrapping or logistic model fitting is just far less optimised than ggplot2’s backend in R.

The nice thing about open source software is that we can help to make this better. So if you’re a code guru who’s reading this and wants to contribute to the scientific computing world moving ever faster to Python, go fork the github repo now!

Update

After I posted this, I opened an issue on Github asking the developers about the slow times. Turns out that the ci flag in sns.lmplot specifies confidence intervals for the logistic regression, which is also bootstrapped. Bootstrapping a logistic regression takes a while; setting ci=False means that Seaborn now takes about 7 seconds to produce that plot instead of 2 minutes.

So, hooray for Seaborn and for awesome open source developers!

Guest Post: Matlab versus Pandas for data analysis

Annelie Muehler is an undergraduate student who is about to finish a 2 month internship in our group. She has been working with me conducting psychophysical experiments, and we have been creating stimuli using python. As part of getting used to scientific python Annelie learned to use Pandas, a package that essentially gives you R’s data frames in Python. The following compares the code to do a simple analysis in Matlab and Python. While it’s possible there are ways to improve the Matlab implementation (perhaps using the statistics toolbox?), it’s noteworthy that these weren’t taught in Annelie’s course.

A comparison of Matlab and Pandas for a simple data analysis

As part of my undergraduate studies in cognitive psychology and neuroscience, I did a water maze experiment in an advanced biology/neuroscience lab course using mice. For this experiment, I had ten mice that did four trials of the experiment over a six day period. The point of this experiment is for the mice to be able to find the platform in the water with increasing speed as they complete more trials. This is because they learn where the platform is. The water maze experiment is one of the behavioural experiments used in mice and rats to test for the ability to learn. Later we used this data while we were learning Matlab in another lab class as a basis for learning data analysis.

During my internship at the Centre for Integrative Neuroscience in Tuebingen, Germany, I reanalyzed this data using pandas in python as a way to learn pandas, giving me a direct comparison of Matlab and pandas. There are definitely some very nice things about pandas. First, you are able to define your own index and column names that are shown surrounding the matrix in a format similar to a table in a word processing document or excel file. This was one of the most frustrating things for me in Matlab because in Matlab you have a dataset and then another variable which contains a list of strings that corresponded to the column names so that you can look them up there.

Table_example_blog

An example of the format in which tables are seen in pandas using the mice data. The table is stored in a variable called rdata.

In pandas, reading data in and out in is easy with the pd.read_csv() and rdata.to_csv function. As you can see in the image above, the mice data is structured so that the indices represent the row number, the other columns are:

  • Trials which represents the trial number and is numbered from one to four for each trial in each day
  • Animal is the animal number which is in the range one to ten
  • Day stands for the day number and is numbered from one to six
  • Swimming Time represents the amount of time it took the mouse to find the platform in the water maze experiment.

I find it easier to work with the table labeled in this way as opposed to having a different variable with the labels of the columns, as we had done in Matlab. Also pandas has great functions
such as:

  • rdata.head() which shows the top rows of the dataframe
  • rdata.describe() which gives the count, mean, standard deviation and other statistics of the dataframe (not the most useful for this specific dataframe)
  • rdata.sort(columns = 'Animal') which sorts the data by a specific column, in this case the column Animal.

As you can see above, pandas (and python in general) has object-oriented functions. These work by using the name of the object, in this case rdata, adding a period and then typing the function. This will show you the result of the function but generally not change the actual object unless the object is equated with the function (as in rdata = rdata.sort(columns = 'Animal').

The idea of the analysis was the find the average swimming time per day across animals to see if there was any improvement as the mice learned the task. In Matlab we did this by:

1.

for i=1:nday
rows_day(:,i)=find(rdata(:,3)==i);
end

This created a dataset in which the rows for each day were identified.

2.

for i=1:nday
time_day(:,i)=rdata(rows_day(:,i),5);
end

Using the data set from step 1, we are able to get a new data set where the swimming time of each trial is listed for each day across animals.

3.

m_day=mean(time_day);
f=figure;
a=axes;
plot(m_day);
ylabel('Swimming Time (s)')
xlabel('Experimental Day')
set(a,'xtick',1:nday)
title('Average swimming time (s) per day across animals')

This results in this simple line graph:

mean_day_mat

Graph output from Matlab
Here’s the same thing in pandas.

1.

import pandas as pd

The usual importing at the beginning of each python script.

2.

m_day = rdata.groupby('Day')['Swimming Time'].mean()
m_day = pd.DataFrame({'Swimming Time (s)':m_day, 'Experimental Day': range(1,7)})

Groupby is a useful command that will group the data by day (parentheses) according to Swimming Time (square brackets). This eliminates sorting out the rows by day using a for loop as is done in the Matlab code above and allows you to group your data according to different variables in your data frame. The .mean() operator at the end tells pandas that you want to compute the means on the grouped data.

3.

m_day.plot(style='b-', x='Experimental Day', y='Swimming Time (s)', title='Average swimming time (s) per day across animals')

There are other python plot functions that may be a bit more elaborate but in the spirit of doing everything in pandas I decided to show the pandas version. This results in this simple line graph, identical to the one above:

mean_day_py

Graph output from Pandas

Figures can be easily saved in pandas using:

fig = plot.get_figure()

fig.savefig()
Of course this is a very simple example of data analysis, but I think it does outline some of the benefits of pandas. The nicest thing in my opinion is the ease with which you can manipulate the data frame and the ability to select columns by their name. The groupby function is very useful and can be used to groupby multiple columns or to group multiple columns.

In my opinion, pandas is a much simpler and convenient way to work with and manipulate data.

Simulating data

Generating data from a probabalistic model with known parameters is a great way to test that any analysis you have written is working correctly, and to understand statistical methods more generally. In this post I’m going to show you an example of generating (simulating) data from a probabalistic model. A probabalistic model is one that is stochastic in that it doesn’t generate the same data set each time the simulation is run (unlike a deterministic model).

The following is actually the code that I used to generate the data in my post on importing data and plot here, so if you’ve been following this blog then you’re already familiar with the output of this post. This simulated data set consisted of 5 subjects, who were shown 20 trials at each combination of 7 contrasts and 5 spatial frequencies.

The predictor variables

First we set up all the predictor variables (independent or experimental variables) we’re interested in.

Variable vectors

(subjects <- paste0("S", 1:5))  # the double parentheses just serve to print the output
## [1] "S1" "S2" "S3" "S4" "S5"

As we’ve discussed, the paste0 command is for concatenating (sticking together) strings. The command above creates a character vector of five strings, starting with “S1” and ending with “S5”.

(contrasts <- exp(seq(-6, -1, l = 7)))
## [1] 0.002479 0.005704 0.013124 0.030197 0.069483 0.159880 0.367879

This creates a vector of length 6 containing log-spaced contrast values… and so on for other variables:

(sfs <- exp(seq(log(0.5), log(40), l = 5)))
## [1]  0.500  1.495  4.472 13.375 40.000
target_sides <- c("left", "right")
(n_trials <- 20)
## [1] 20

Combining the variables

Now that we’ve set up the independent variables from the experiment, we want to create a data frame with one row for each combination of the variables for each subject. We can do this easily with R’s expand.grid command.

dat <- expand.grid(subject = subjects, contrast = contrasts, sf = sfs, target_side = target_sides, 
    trial = 1:n_trials)
head(dat)
##   subject contrast  sf target_side trial
## 1      S1 0.002479 0.5        left     1
## 2      S2 0.002479 0.5        left     1
## 3      S3 0.002479 0.5        left     1
## 4      S4 0.002479 0.5        left     1
## 5      S5 0.002479 0.5        left     1
## 6      S1 0.005704 0.5        left     1

You can see from the first few rows of dat what expand.grid has done: it has created a data frame (called dat) that contains each combination of the variables entered. A call to summary shows us how R has usefully made factors out of string variables (subject and target side):

summary(dat)
##  subject      contrast            sf        target_side      trial      
##  S1:1400   Min.   :0.0025   Min.   : 0.50   left :3500   Min.   : 1.00  
##  S2:1400   1st Qu.:0.0057   1st Qu.: 1.50   right:3500   1st Qu.: 5.75  
##  S3:1400   Median :0.0302   Median : 4.47                Median :10.50  
##  S4:1400   Mean   :0.0927   Mean   :11.97                Mean   :10.50  
##  S5:1400   3rd Qu.:0.1599   3rd Qu.:13.37                3rd Qu.:15.25  
##            Max.   :0.3679   Max.   :40.00                Max.   :20.00

You can tell that something is a factor because instead of showing the summary statistics (mean, median) it just shows how many instances of each level there are.

To make the modelling a bit simpler, we’re also going to create a factor of spatial frequency:

dat$sf_factor <- factor(round(dat$sf, digits = 2))
summary(dat)
##  subject      contrast            sf        target_side      trial      
##  S1:1400   Min.   :0.0025   Min.   : 0.50   left :3500   Min.   : 1.00  
##  S2:1400   1st Qu.:0.0057   1st Qu.: 1.50   right:3500   1st Qu.: 5.75  
##  S3:1400   Median :0.0302   Median : 4.47                Median :10.50  
##  S4:1400   Mean   :0.0927   Mean   :11.97                Mean   :10.50  
##  S5:1400   3rd Qu.:0.1599   3rd Qu.:13.37                3rd Qu.:15.25  
##            Max.   :0.3679   Max.   :40.00                Max.   :20.00  
##  sf_factor   
##  0.5  :1400  
##  1.5  :1400  
##  4.47 :1400  
##  13.37:1400  
##  40   :1400  
## 

The data model

Now that we have the data frame for the experimental variables, we want to think about how to generate a lawful (but probabalistic) relationship between the experimental variables and the outcome variable (in this case, getting a trial correct or incorrect). I am going to do this in the framework of a logistic Generalised Linear Model (GLM).

For this data set, let’s say that each subject’s chance of getting the trial correct is a function of the contrast and the spatial frequency on the trial. I’m going to treat contrast as a continuous predictor (aka a covariate) and spatial frequency as a discrete variable (a factor). I could treat both as continuous, but visual sensitivity as a function of spatial frequency is non-monotonic, so I thought this blog post would be simpler if I just treat my five levels as discrete. In the simple GLM I’m going to use, this means that each subject will have six coefficients: an intercept, the slope of contrast, and the offset for each level of spatial frequency.

Design matrix

R is pretty amazing for doing stuff in a GLM framework. For this example, we can generate the design matrix for our data frame dat using the model.matrix function:

X <- model.matrix(~log(contrast) + sf_factor, data = dat)
head(X)
##   (Intercept) log(contrast) sf_factor1.5 sf_factor4.47 sf_factor13.37
## 1           1        -6.000            0             0              0
## 2           1        -6.000            0             0              0
## 3           1        -6.000            0             0              0
## 4           1        -6.000            0             0              0
## 5           1        -6.000            0             0              0
## 6           1        -5.167            0             0              0
##   sf_factor40
## 1           0
## 2           0
## 3           0
## 4           0
## 5           0
## 6           0

The formula ~ log(contrast) + sf_factor tells R that we want to predict something using contrast (a covariate) and additive terms of the factor sf_factor. If we changed the + to a * this would give us all the interaction terms too (i.e. the slope would be allowed to vary with spatial frequency). Note how our first 6 rows have all zeros for the columns of sf_factor: this is because all the first rows in dat are from sf 0.5, which here is the reference level.

R has automatically dummy coded sf_factor, dropping the reference level (this defaults to the first level of the factor). A trial with a spatial frequency of 0.5 will have zeros for all the sf_factor columns of the design matrix. A trial with a spatial frequency of 1.5 will have ones in that column and zeros everywhere else.

Coefficients (parameters)

Next we need our "true" coefficients for each subject. I will generate these as samples from a normal distribution (this is where the first stochastic part of our data generation comes in). But first, I will set R's random number seed so that you can produce the results described here (i.e. this will make the result deterministic). If you want stochasticity again, just comment out this line:

set.seed(424242)

Now the coefficients:

b0 <- rnorm(length(subjects), mean = 7, sd = 0.2)  # intercept
b1 <- rnorm(length(subjects), mean = 2, sd = 0.2)  # slope of contrast
b2 <- rnorm(length(subjects), mean = 2, sd = 0.2)  # sf 1.5
b3 <- rnorm(length(subjects), mean = 1.5, sd = 0.2)  # sf 4.4
b4 <- rnorm(length(subjects), mean = 0, sd = 0.2)  # sf 13
b5 <- rnorm(length(subjects), mean = -2, sd = 0.2)  # sf 40

print(betas <- matrix(c(b0, b1, b2, b3, b4, b5), nrow = 5))
##       [,1]  [,2]  [,3]  [,4]    [,5]   [,6]
## [1,] 7.088 2.445 2.101 1.553 -0.2687 -2.138
## [2,] 7.194 1.974 2.279 1.722  0.1459 -2.243
## [3,] 7.152 2.232 2.136 1.510  0.2344 -2.189
## [4,] 7.034 2.070 2.242 1.391  0.1914 -1.790
## [5,] 7.085 1.922 1.705 1.541  0.3116 -1.871

The above matrix has the subjects in the rows and the coefficients in the columns. That is, subject 2's "true" (i.e. generating) parameters are 7.194, 1.974, 2.279, etc.

Generating predictions

To generate predictions in the linear space of the GLM, we take the product of the design matrix and the betas for each subject:

eta <- rep(NA, length = nrow(dat))
for (i in 1:length(subjects)) {
    # for each subject
    this_subj <- levels(dat$subject)[i]  # which subject?
    subj_rows <- dat$subject == this_subj  # which rows belong to this subject?

    # create a design matrix, and pull out the betas for this subject:
    this_X <- model.matrix(~log(contrast) + sf_factor, data = dat[subj_rows, 
        ])
    this_betas <- betas[i, ]

    # mult, stick into eta:
    eta[subj_rows] <- this_X %*% this_betas
}

# stick eta into dat:
dat$eta <- eta

The multiplication of the matrix X and the (row vector) beta above is equivalent to, for each row, multiplying each number in the design matrix by the corresponding beta value, then summing all these products to produce one number per row. This number is the linear predictor (and if we were dealing with simple linear regression, we'd be done now).

For our application however, we want to turn the linear predictor eta into a probability that ranges from 0.5 (chance performance on the task) to 1. To do this we're going to use one of the custom link functions in the psyphy package for fitting psychophysical data in R.

library(psyphy)
links <- mafc.weib(2)  # creates a list of functions.

dat$p <- links$linkinv(dat$eta)  # run the linear predictor through the inverse link function to get a probability.

Test that parameters are in a decent range by plotting:

library(ggplot2)
fig <- ggplot(dat, aes(x = contrast, y = p)) + geom_line() + facet_grid(sf_factor ~ 
    subject) + coord_cartesian(ylim = c(0.45, 1.05)) + scale_x_log10() + scale_y_continuous(breaks = c(0.5, 
    0.75, 1)) + theme_minimal()
fig

unnamed-chunk-12

Finally, we want to generate a binary outcome (success or failure, usually denoted 1 and 0) with probability p for each trial. We do this using R's rbinom function, which generates random samples from the binomial distribution (but remember, since we set the seed above, you can reproduce my numbers):

dat$y <- rbinom(nrow(dat), 1, prob = dat$p)
summary(dat)
##  subject      contrast            sf        target_side      trial      
##  S1:1400   Min.   :0.0025   Min.   : 0.50   left :3500   Min.   : 1.00  
##  S2:1400   1st Qu.:0.0057   1st Qu.: 1.50   right:3500   1st Qu.: 5.75  
##  S3:1400   Median :0.0302   Median : 4.47                Median :10.50  
##  S4:1400   Mean   :0.0927   Mean   :11.97                Mean   :10.50  
##  S5:1400   3rd Qu.:0.1599   3rd Qu.:13.37                3rd Qu.:15.25  
##            Max.   :0.3679   Max.   :40.00                Max.   :20.00  
##  sf_factor         eta               p               y       
##  0.5  :1400   Min.   :-9.717   Min.   :0.500   Min.   :0.00  
##  1.5  :1400   1st Qu.:-2.925   1st Qu.:0.526   1st Qu.:1.00  
##  4.47 :1400   Median : 0.286   Median :0.868   Median :1.00  
##  13.37:1400   Mean   : 0.004   Mean   :0.776   Mean   :0.77  
##  40   :1400   3rd Qu.: 3.240   3rd Qu.:1.000   3rd Qu.:1.00  
##               Max.   : 7.500   Max.   :1.000   Max.   :1.00

This function generates a vector of the same length as the number of rows in dat, each vector made up of one binomial trial (i.e. a Bernoulli trial), with probability given by the vector dat$p.

Converting from "y" (correct / incorrect) to "response"

Finally, as a little extra flavour, I convert the binary correct / incorrect response above into a "left" or "right" response by the simulated subject on each trial. Note that you wouldn't normally do this, I'm just doing it for demo purposes for the data import post.

dat$response <- "left"
dat$response[dat$target_side == "right" & dat$y == 1] <- "right"  # correct on right
dat$response[dat$target_side == "left" & dat$y == 0] <- "right"  # wrong on left
dat$response <- factor(dat$response)

Saving the data

Now I'm going to save a subset of the data here to a series of .csv files. These are the files we imported in my post on importing data. First, to make the data set look a little more realistic I'm going to shuffle the row order (as if I randomly interleaved trials in an experiment).

new_rows <- sample.int(nrow(dat))
dat <- dat[new_rows, ]

Create a unique id for each trial

I'm going to add a unique identifier (UUID) to each trial. This is a good thing to do in your experiment script. It makes it easier to, for example, check that some data set stored in a different table (e.g. eye tracking or brain imaging data) synchs up with the correct psychophysical trial. You could also do something where you add the exact time stamp to each trial.

library(uuid)
ids <- rep(NA, length = nrow(dat))
for (i in 1:length(ids)) ids[i] <- UUIDgenerate()

dat$unique_id <- ids

Writing to a file

To do this I'm going to use the paste0 command, which you should already be familiar with from the data import post. This allows you to stick strings (text) together.
On the line with paste0, the first part gives us the project working directory, then we go into the data directory and finally we create the filename from the subject name.

for (i in 1:length(subjects)) {
    # for each subject
    this_subj <- levels(dat$subject)[i]  # which subject?

    # use the subset command to subset the full data frame and remove some
    # variables:
    this_dat <- subset(dat, subject == this_subj, select = c(-trial:-y))

    output_file <- paste0(getwd(), "/data/data_", this_subj, ".csv")
    write.table(this_dat, file = output_file, row.names = FALSE, sep = ",")
}

Summing up

That's one way to generate data from an underlying probability model. I've used it to generate some data to import and plot, but this is a really great thing to know how to do. It's useful for testing your analysis code (if you don't get back the parameters you put in, you know something is wrong). I also found I learned a lot about statistics more generally by simulating various tests and models.

This post concludes the more "practical" content I've planned so far. In the next few posts I'm planning to talk about some more general things. If you have requests for future "practical" posts, let me know in the comments below.

Data import: follow-up

This is a quick update post following up my data import post. I have put a script file into the /funs/ directory of my blog project that repeats the import and saving stuff I stepped through in that last post. You can find it on Github here. Feel free to fork that repository, but if you don’t want to deal with all the git and version control stuff you can just click the Download Zip button on the right to get all the files as a zip archive. The data_import.R script can be sourced via RStudio or an R command prompt, and will reproduce the contrast_data.RData file in the /out/ directory. For this to work, your working directory needs to be set to the project’s root directory; the easiest way to do this is by setting up a Project in RStudio located in the root directory. When you open your R project in RStudio, the working directory will automatically be set to the root.

I also wanted to point you to this article on using the “good parts” of R. It’s certainly true that some of R’s base syntax and functions are kind of horrible; using those add-on packages is really helpful. I learned some new things there too – like the use of data.table.