Confirmatory experiments and researcher degrees of freedom

A few months ago I attended a talk given by a professor of cognitive neuroscience / psychology. The professor presented several experiments to support a particular hypothesis, including both behavioural studies and fMRI. The final minutes of the presentation were used to tell us about some exciting new findings that could suggest an interesting new effect. In presenting these results, the professor stated “we’ve only run 10 subjects so far and this difference is not quite significant yet, but we’re collecting a few more people and we expect it to drop under .05″.

This is an example of a “researcher degree of freedom”, “questionable research practice”, or “p-hacking” (specifically, we could call this example “data-peeking”). In my experience it is very common in experimental psychology, and recent publications show it’s a problem much more broadly (see e.g. here, here and this article by Chris Chambers).

Why does data-peeking happen? I believe that in almost all cases there is no malicious intent to mislead, but rather that it arises from a faulty intuition.

Researchers intuit that having more data should lead to a better estimate of the true effect. The intuition is correct, but where people go wrong is assuming it applies to statistical testing too. Unfortunately, many researchers (including my former self) don’t understand this, and erroneously rely on their intuition.

The Garden of Forking Paths *

It essentially boils down to this: if your data depend on your analyses or your analyses depend on your data, you could be on thin inferential ice. Daniel Lakens has a nice post on the former, while Gelman and Loken have an article on the latter that’s well worth your time (now published in revised form here).

Data depend on analyses

An example of this is if you test a few subjects, check the result, and maybe collect some more data because it’s “trending towards significance” (as for our anonymous professor, above). If you just apply a p-value as normal, it means that your false positive rate is no longer equal to the nominal alpha level (e.g., 0.05), but is actually higher. You’re more likely to reject the null hypothesis and call something significant if you data peek – unless you apply statistical corrections for your stopping rules (called “sequential testing” in the clinical trials literature; see this post by Daniel Lakens has some info on how to correct for this).

Analyses depend on data

An example of this is if you collect 20 subjects, then realise two of them show some “outlier-like” behavior that you hadn’t anticipated: reaction times that are too fast to be task-related. You decide to exclude the trials with “too-fast” reaction times, defining “too-fast” based on the observed RT distribution**. This neatens up your dataset — but given different data (say, those two subjects behaved like the others), your analysis would have looked different. In this case, your analyses are dependent on the data.

I believe this happens all the time in experimental psychology / neuroscience. Other examples include grouping levels of a categorical variable differently, defining which “multiple comparisons” to correct for, defining cutoffs for “regions of interest”… When your analyses depend on the data, you’re doing exploratory data analysis. Why is that a problem? By making data-dependent decisions you’ve likely managed to fit noise in your data, and this won’t hold for new, noisy data: you’re increasing the chance that your findings for this dataset won’t generalise to a new dataset.

Exploratory analyses are often very informative — but they should be labelled as such. As above, your actual false positive rate will be higher than your nominal false positive rate (alpha level) when you use a p-value.

We should be doing more confirmatory research studies

For experimental scientists, the best way to ensure that your findings are robust is to run a confirmatory study. This means

  1. Collect the data with a pre-specified plan: how many participants, then stop. If you plan on having contingent stopping rules (doing sequential testing) then follow the appropriate corrections for any test statistics you use to make inferential decisions.
  2. Analyse the data with an analysis pipeline (from data cleaning to model fitting and inference) that has been prespecified, without seeing the data.
  3. Report the results of those analyses.

If you started out with a data-dependent (exploratory) analysis, report it as such in the paper, then add a confirmatory experiment ***. There’s a great example of this approach from experimental psychology (in this case, a negative example). Nosek, Spies and Motyl report finding that political moderates were better at matching the contrast (shade of grey) of a word than those with more extreme (left or right) political ideologies (p = 0.01, N = 1,979 — it was an online study). Punch line: “Political extremists perceive the world in black and white — literally and figuratively”. However, the authors were aware that they had made several data-dependent analysis decisions, so before rushing off to publish their finding, they decided to run a direct confirmatory study. New N = 1,300, new p-value = 0.59.

The best way to show the community that your analysis is confirmatory is to pre-register your study. One option is to go on something like the Open Science Framework, submit a document with your methods and detailed analysis plan, then register it. The project can stay private (so nobody else can see it), but now there’s a record of your analyis plan registered before data collection. A better option is to submit a fully registered report. In this case, you can send your introduction, method and analysis plan to a journal, where it is peer reviewed and feedback given — all before the data are collected. Taking amendments into account, you then run off and collect the data, analyse it as agreed, and report the results. In a registered report format, the journal agrees to publish it no matter the result. If the study is truly confirmatory, a null result can be informative too.

Of course, there’s still trust involved in both of these options – and that’s ok. It’s hard to stop people from outright lying. I don’t think that’s a big problem, because the vast majority of scientists really want to do the right thing. It’s more that people just don’t realise that contingent analyses can be a problem, or they convince themselves that their analysis is fine. Pre-registration can help you convince yourself that you’ve really run a confirmatory study.


I hope the considerations above are familiar to you already — but if you’re like many people in experimental psychology, neuroscience, and beyond, maybe this is the first you’ve heard of it. In practice, most people I know (including myself) are doing exploratory studies all the time. Full disclosure: I’ve never reported a truly confirmatory study, ever. In a follow-up post, I’m going to speculate about how the recommendations above might be practically implemented for someone like me.

* the title refers to the short story by Jorge Luis Borges, and was used by Gelman and Loken to refer to data-dependent analyses.

** Note: this is a very bad idea, because it ignores any theoretical justification for what “too fast to be task-related” is. I use it only for example here. I have a more general problem with outlier removal, too: unless the data is wrong (e.g. equipment broke), then change the model, not the data. For example, if your data have a few outliers, run a robust regression (i.e. don’t assume Gaussian noise). Of course, this is another example of a data-dependent analysis. Run a confirmatory experiment…

*** An equivalent method is to keep a holdout set of data that’s only analysed at the end — if your conclusions hold, you’re probably ok.

High-level plotting in Python

If you have read my older blog posts, you know I’m a fan of R’s ggplot2 library for exploratory data analysis. It allows you to examine many views onto data, creating summaries over different variables right there as you plot. I gave a short tutorial here. Once you understand how the library works, it’s a very powerful way of quickly seeing patterns in large datasets.

Moving over to Python

The past few months have seen me ramp up my Python useage (as I preempted almost a year ago). This is for a variety of reasons, but mostly (a) Python is a general programming language, so you can essentially do anything in it without it being a massive hack; (b) the scientific computing packages available for Python are generally super awesome; (c) it seems to me to be poised to replace Matlab as the lingua franca of neuroscience / vision research — at least, much more so than R. See Tal Yarkoni’s blog for a related discussion. Of course, it’s also free and open source.

What I’m still kind of missing in Python is a nice high-level plotting library like ggplot2. Certainly, there are many in the works, but in playing around I’ve found that none of them can match R’s ggplot2 yet.

Amphibious assault

Of the libraries I’ve tried (main competitor is yhat’s ggplot port), my go-to library for plotting in Python is hands-down Seaborn. Developed largely by Michael Waskom, a graduate student in Cog Neuro at Stanford, it has the prettiest plots, and also the best ideas for high-level functionality.

Speeding up Seaborn?

Unfortunately, compared to R’s ggplot2, Seaborn data summary functions are slow. As an example, I’ll show you a summary plot using the simulated data from my R tutorials.

We want to summarise our (simulated) subjects’ detection performance as a function of stimulus contrast, for each spatial frequency. In R, my code looks like this:

    dat <- read.csv('contrast_data.csv')

    plot_fun <- function(){
      fig <- ggplot(data = dat, aes(x=contrast, y=correct, colour=factor(sf))) +
        stat_summary( = "mean_cl_boot") + 
        scale_x_log10() + scale_y_continuous(limits = c(0,1)) + 
        facet_wrap( ~ subject, ncol=3)  +  
        stat_smooth(method = "glm", family=binomial())


This takes about 3 seconds (2.67) to return the following plot:

<a href=""><img src="; alt="ggplot_timing" width="600" height="485" class="aligncenter size-full wp-image-374" /></a>

EDIT: for some reason WordPress is being silly and won't let me actually show this picture. Suffice to say it looks substantively similar to the plot below

Let's contrast this to Seaborn. First the code:

# do imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import ggplot

dat = pd.read_csv('contrast_data.csv')

# make a log contrast variable (because Seaborn can't auto-log an axis like ggplot):
dat['log_contrast'] = np.log(dat['contrast'])

# set some styles we like:
pal = sns.cubehelix_palette(5, start=2, rot=0, dark=0.1, light=.8, reverse=True)

fig = sns.lmplot("log_contrast", "correct", dat,

(note that the above lines were run in an ipython notebook, hence the %%timeit magic operator).

Seaborn takes… wait for it… 2 minutes and 11 seconds to produce the following plot:


(I’m using Python 3.3.5, Anaconda distribution 2.1.0, Seaborn 0.4.0.)

Note also that both packages are doing 1000 bootstraps by default (as far as I’m aware), so I’m pretty sure they’re doing equivalent things.

What can be done?

This is obviously not a positive factor for my switch from R to Python, and I’m hoping it’s just that I’ve done something wrong. However, another explanation is that whatever Seaborn is using to do the bootstrapping or logistic model fitting is just far less optimised than ggplot2’s backend in R.

The nice thing about open source software is that we can help to make this better. So if you’re a code guru who’s reading this and wants to contribute to the scientific computing world moving ever faster to Python, go fork the github repo now!


After I posted this, I opened an issue on Github asking the developers about the slow times. Turns out that the ci flag in sns.lmplot specifies confidence intervals for the logistic regression, which is also bootstrapped. Bootstrapping a logistic regression takes a while; setting ci=False means that Seaborn now takes about 7 seconds to produce that plot instead of 2 minutes.

So, hooray for Seaborn and for awesome open source developers!

information gain differences

How close are we to understanding image-based saliency?

How well can we predict where people look in stationary natural images? While the scope of this question addresses only a fraction of what it means to understand eye movements in natural environments, it nevertheless remains a starting point to study this complex topic. It’s also of great interest to both cognitive scientists and computer vision researchers since it has applications from advertising to robotics.

Matthias Kümmerer has come up with a statistical framework that answers this question in a principled way. Building on the nice work of Simon Barthelmé and colleagues, Matthias has shown how saliency models can be compared in units of information (i.e. using log-likelihoods). Since information provides a meaningful linear metric, it allows us to compare the distance between model predictions, a baseline (capturing image-independent spatial biases in fixations) and the gold standard (i.e. how well you could possibly do, knowing only the image).

So how close are we to understanding image-based saliency? Turns out, not very. The best model we tested (a state-of-the-art model from 2014 by Vig, Dorr and Cox) explained about one third of the possible information gain between the baseline and the gold standard in the dataset we used. If you want to predict where people look in stationary images, there’s still a long way to go.

In addition, our paper introduces methods to show, in individual images, where and by how much a model fails (see the image above). We think this is going to be really useful for people who are developing saliency models. Finally, we extend the approach to the temporal domain, showing that knowing about both spatial and temporal biases, but nothing about the image, gives you a better prediction than the best saliency model using only spatial information.

The nice thing about this last point is that it shows that Matthias’ method is very general. If you don’t think that measuring where people look in stationary images tells you much about eye movements in the real world, that’s fine. You can still use the method to quantify and compare data and models in your exciting new experiment.

A paper that goes into much more detail than this blog post is now available on arXiv. In particular, saliency experts should check out the appendices, where we think we’ve resolved some of the reasons why the saliency model comparison literature was so muddled.

We’re close to submitting it, so we’d love to hear feedback on the pitch and nuance of our story, or anything that may not be clear in the paper. You can send me an email to pass on your thoughts.

When the paper is out we will also be making a software framework available for model comparison and evaluation. We hope the community will find these to be useful tools.

As usual, everything that appears below this line are ads and not endorsed by me. I don’t make money from this site; I would have to pay WordPress to remove the ads.

Neurostats 2014 Highlights


Simon Barthelmé shows some highlights from Neurostats 2014 in Warwick. I particularly like the “in limbo” brain areas.

Originally posted on dahtah:

Last week the Neurostats 2014 workshop took place at the University of Warwick (co-organised by Adam Johansen, Nicolas Chopin, and myself). The goal was to put some neuroscientists and statisticians together to talk about neural data and what to do with it. General impressions:

  • The type of Bayesian hierarchical modelling that Andrew Gelman has been advocating for years is starting to see some use in neuroimaging. On the one hand it makes plenty of sense since the data at the level of individual subjects can be cr*p and so one could really use a bit of clever pooling. On the other, imaging data is very high-dimensional, running a Gibbs sampler can take days, and it’s not easy making the data comparable across subjects.
  • You have to know your signals. Neural data can be unbelievably complicated and details matter a lot, as Jonathan Victor showed in his talk. A consequence if…

View original 547 more words

Guest Post: Matlab versus Pandas for data analysis

Annelie Muehler is an undergraduate student who is about to finish a 2 month internship in our group. She has been working with me conducting psychophysical experiments, and we have been creating stimuli using python. As part of getting used to scientific python Annelie learned to use Pandas, a package that essentially gives you R’s data frames in Python. The following compares the code to do a simple analysis in Matlab and Python. While it’s possible there are ways to improve the Matlab implementation (perhaps using the statistics toolbox?), it’s noteworthy that these weren’t taught in Annelie’s course.

A comparison of Matlab and Pandas for a simple data analysis

As part of my undergraduate studies in cognitive psychology and neuroscience, I did a water maze experiment in an advanced biology/neuroscience lab course using mice. For this experiment, I had ten mice that did four trials of the experiment over a six day period. The point of this experiment is for the mice to be able to find the platform in the water with increasing speed as they complete more trials. This is because they learn where the platform is. The water maze experiment is one of the behavioural experiments used in mice and rats to test for the ability to learn. Later we used this data while we were learning Matlab in another lab class as a basis for learning data analysis.

During my internship at the Centre for Integrative Neuroscience in Tuebingen, Germany, I reanalyzed this data using pandas in python as a way to learn pandas, giving me a direct comparison of Matlab and pandas. There are definitely some very nice things about pandas. First, you are able to define your own index and column names that are shown surrounding the matrix in a format similar to a table in a word processing document or excel file. This was one of the most frustrating things for me in Matlab because in Matlab you have a dataset and then another variable which contains a list of strings that corresponded to the column names so that you can look them up there.


An example of the format in which tables are seen in pandas using the mice data. The table is stored in a variable called rdata.

In pandas, reading data in and out in is easy with the pd.read_csv() and rdata.to_csv function. As you can see in the image above, the mice data is structured so that the indices represent the row number, the other columns are:

  • Trials which represents the trial number and is numbered from one to four for each trial in each day
  • Animal is the animal number which is in the range one to ten
  • Day stands for the day number and is numbered from one to six
  • Swimming Time represents the amount of time it took the mouse to find the platform in the water maze experiment.

I find it easier to work with the table labeled in this way as opposed to having a different variable with the labels of the columns, as we had done in Matlab. Also pandas has great functions
such as:

  • rdata.head() which shows the top rows of the dataframe
  • rdata.describe() which gives the count, mean, standard deviation and other statistics of the dataframe (not the most useful for this specific dataframe)
  • rdata.sort(columns = 'Animal') which sorts the data by a specific column, in this case the column Animal.

As you can see above, pandas (and python in general) has object-oriented functions. These work by using the name of the object, in this case rdata, adding a period and then typing the function. This will show you the result of the function but generally not change the actual object unless the object is equated with the function (as in rdata = rdata.sort(columns = 'Animal').

The idea of the analysis was the find the average swimming time per day across animals to see if there was any improvement as the mice learned the task. In Matlab we did this by:


for i=1:nday

This created a dataset in which the rows for each day were identified.


for i=1:nday

Using the data set from step 1, we are able to get a new data set where the swimming time of each trial is listed for each day across animals.


ylabel('Swimming Time (s)')
xlabel('Experimental Day')
title('Average swimming time (s) per day across animals')

This results in this simple line graph:


Graph output from Matlab
Here’s the same thing in pandas.


import pandas as pd

The usual importing at the beginning of each python script.


m_day = rdata.groupby('Day')['Swimming Time'].mean()
m_day = pd.DataFrame({'Swimming Time (s)':m_day, 'Experimental Day': range(1,7)})

Groupby is a useful command that will group the data by day (parentheses) according to Swimming Time (square brackets). This eliminates sorting out the rows by day using a for loop as is done in the Matlab code above and allows you to group your data according to different variables in your data frame. The .mean() operator at the end tells pandas that you want to compute the means on the grouped data.


m_day.plot(style='b-', x='Experimental Day', y='Swimming Time (s)', title='Average swimming time (s) per day across animals')

There are other python plot functions that may be a bit more elaborate but in the spirit of doing everything in pandas I decided to show the pandas version. This results in this simple line graph, identical to the one above:


Graph output from Pandas

Figures can be easily saved in pandas using:

fig = plot.get_figure()

Of course this is a very simple example of data analysis, but I think it does outline some of the benefits of pandas. The nicest thing in my opinion is the ease with which you can manipulate the data frame and the ability to select columns by their name. The groupby function is very useful and can be used to groupby multiple columns or to group multiple columns.

In my opinion, pandas is a much simpler and convenient way to work with and manipulate data.