New paper: contrast, eye movements, movies and modelling

Sensitivity to spatial distributions of luminance is a fundamental component of visual encoding. In vision science, contrast sensitivity has been studied most often using stimuli like sinusoidal luminance patterns (“gratings”). The contrast information in these stimuli is limited to a narrow band of scales and orientations. Visual encoding can be studied using these tiny building blocks, and the hope is that the full architectural design will become clear if we piece together enough of the tiny bricks and examine how they interact.

Real world visual input is messy and the interactions between the building blocks are strong. Furthermore, in the real world we make about 3 eye movements per second, whereas in traditional experiments observers keep their eyes still.

In this paper, we attempt to quantify the effects of this real-world mess on sensitivity to luminance contrast. We had observers watch naturalistic movie stimuli (a nature documentary) while their gaze position was tracked. Every few seconds, a local patch of the movie was modified by increasing its contrast. The patch was yoked to the observer’s gaze position, so that it always remained at the same position on the retina as the observer moved their eyes. The observer had to report the position of the target relative to their centre of gaze.

We quantify how sensitivity depended on various factors including when and where observers moved their eyes and the content (in terms of low-level image structure) of the movies. Then we asked how well one traditional model of contrast processing could account for our data, comparing it to an atheoretical generalised linear model. In short, the parameters of the mechanistic model were poorly constrained by our data and hard to interpret, whereas the atheoretical model did just as well in terms of prediction. More complex mechanistic models may need to be tested to provide concise descriptions of behaviour in our task.

I’m really happy that this paper is finally out. It’s been a hard slog – the data were collected way back in 2010, and since then it has been through countless internal iterations. Happily, the reviews we received upon eventual submission were very positive (thanks to our two anonymous reviewers for slogging through this long paper and suggesting some great improvements). I’ve got plenty of ideas for what I would do differently next time we try something like this, and it’s something I intend to follow up sometime. In the meantime, I’m interested to see what the larger community think about this work!

To summarise, you might like to check out the paper if you’re interested in:

  • contrast perception / processing
  • eye movements
  • naturalistic stimuli and image structure
  • Bayesian estimation of model parameters
  • combining all of the above in one piece of work

You can download the paper here.

The data analysis and modelling were all done in R; you can find the data and code here. Note that I’m violating one of my own suggestions (see also here) in that the data are not shared in a text-only format. This is purely a constraint of file size – R binaries are much more compressed than the raw csv files, so they fit into a Github repository. Not ideal, but the best I can do right now.

Confirmatory experiments and researcher degrees of freedom

A few months ago I attended a talk given by a professor of cognitive neuroscience / psychology. The professor presented several experiments to support a particular hypothesis, including both behavioural studies and fMRI. The final minutes of the presentation were used to tell us about some exciting new findings that could suggest an interesting new effect. In presenting these results, the professor stated “we’ve only run 10 subjects so far and this difference is not quite significant yet, but we’re collecting a few more people and we expect it to drop under .05″.

This is an example of a “researcher degree of freedom”, “questionable research practice”, or “p-hacking” (specifically, we could call this example “data-peeking”). In my experience it is very common in experimental psychology, and recent publications show it’s a problem much more broadly (see e.g. here, here and this article by Chris Chambers).

Why does data-peeking happen? I believe that in almost all cases there is no malicious intent to mislead, but rather that it arises from a faulty intuition.

Researchers intuit that having more data should lead to a better estimate of the true effect. The intuition is correct, but where people go wrong is assuming it applies to statistical testing too. Unfortunately, many researchers (including my former self) don’t understand this, and erroneously rely on their intuition.

The Garden of Forking Paths *

It essentially boils down to this: if your data depend on your analyses or your analyses depend on your data, you could be on thin inferential ice. Daniel Lakens has a nice post on the former, while Gelman and Loken have an article on the latter that’s well worth your time (now published in revised form here).

Data depend on analyses

An example of this is if you test a few subjects, check the result, and maybe collect some more data because it’s “trending towards significance” (as for our anonymous professor, above). If you just apply a p-value as normal, it means that your false positive rate is no longer equal to the nominal alpha level (e.g., 0.05), but is actually higher. You’re more likely to reject the null hypothesis and call something significant if you data peek – unless you apply statistical corrections for your stopping rules (called “sequential testing” in the clinical trials literature; see this post by Daniel Lakens has some info on how to correct for this).

Analyses depend on data

An example of this is if you collect 20 subjects, then realise two of them show some “outlier-like” behavior that you hadn’t anticipated: reaction times that are too fast to be task-related. You decide to exclude the trials with “too-fast” reaction times, defining “too-fast” based on the observed RT distribution**. This neatens up your dataset — but given different data (say, those two subjects behaved like the others), your analysis would have looked different. In this case, your analyses are dependent on the data.

I believe this happens all the time in experimental psychology / neuroscience. Other examples include grouping levels of a categorical variable differently, defining which “multiple comparisons” to correct for, defining cutoffs for “regions of interest”… When your analyses depend on the data, you’re doing exploratory data analysis. Why is that a problem? By making data-dependent decisions you’ve likely managed to fit noise in your data, and this won’t hold for new, noisy data: you’re increasing the chance that your findings for this dataset won’t generalise to a new dataset.

Exploratory analyses are often very informative — but they should be labelled as such. As above, your actual false positive rate will be higher than your nominal false positive rate (alpha level) when you use a p-value.

We should be doing more confirmatory research studies

For experimental scientists, the best way to ensure that your findings are robust is to run a confirmatory study. This means

  1. Collect the data with a pre-specified plan: how many participants, then stop. If you plan on having contingent stopping rules (doing sequential testing) then follow the appropriate corrections for any test statistics you use to make inferential decisions.
  2. Analyse the data with an analysis pipeline (from data cleaning to model fitting and inference) that has been prespecified, without seeing the data.
  3. Report the results of those analyses.

If you started out with a data-dependent (exploratory) analysis, report it as such in the paper, then add a confirmatory experiment ***. There’s a great example of this approach from experimental psychology (in this case, a negative example). Nosek, Spies and Motyl report finding that political moderates were better at matching the contrast (shade of grey) of a word than those with more extreme (left or right) political ideologies (p = 0.01, N = 1,979 — it was an online study). Punch line: “Political extremists perceive the world in black and white — literally and figuratively”. However, the authors were aware that they had made several data-dependent analysis decisions, so before rushing off to publish their finding, they decided to run a direct confirmatory study. New N = 1,300, new p-value = 0.59.

The best way to show the community that your analysis is confirmatory is to pre-register your study. One option is to go on something like the Open Science Framework, submit a document with your methods and detailed analysis plan, then register it. The project can stay private (so nobody else can see it), but now there’s a record of your analyis plan registered before data collection. A better option is to submit a fully registered report. In this case, you can send your introduction, method and analysis plan to a journal, where it is peer reviewed and feedback given — all before the data are collected. Taking amendments into account, you then run off and collect the data, analyse it as agreed, and report the results. In a registered report format, the journal agrees to publish it no matter the result. If the study is truly confirmatory, a null result can be informative too.

Of course, there’s still trust involved in both of these options – and that’s ok. It’s hard to stop people from outright lying. I don’t think that’s a big problem, because the vast majority of scientists really want to do the right thing. It’s more that people just don’t realise that contingent analyses can be a problem, or they convince themselves that their analysis is fine. Pre-registration can help you convince yourself that you’ve really run a confirmatory study.


I hope the considerations above are familiar to you already — but if you’re like many people in experimental psychology, neuroscience, and beyond, maybe this is the first you’ve heard of it. In practice, most people I know (including myself) are doing exploratory studies all the time. Full disclosure: I’ve never reported a truly confirmatory study, ever. In a follow-up post, I’m going to speculate about how the recommendations above might be practically implemented for someone like me.

* the title refers to the short story by Jorge Luis Borges, and was used by Gelman and Loken to refer to data-dependent analyses.

** Note: this is a very bad idea, because it ignores any theoretical justification for what “too fast to be task-related” is. I use it only for example here. I have a more general problem with outlier removal, too: unless the data is wrong (e.g. equipment broke), then change the model, not the data. For example, if your data have a few outliers, run a robust regression (i.e. don’t assume Gaussian noise). Of course, this is another example of a data-dependent analysis. Run a confirmatory experiment…

*** An equivalent method is to keep a holdout set of data that’s only analysed at the end — if your conclusions hold, you’re probably ok.

High-level plotting in Python

If you have read my older blog posts, you know I’m a fan of R’s ggplot2 library for exploratory data analysis. It allows you to examine many views onto data, creating summaries over different variables right there as you plot. I gave a short tutorial here. Once you understand how the library works, it’s a very powerful way of quickly seeing patterns in large datasets.

Moving over to Python

The past few months have seen me ramp up my Python useage (as I preempted almost a year ago). This is for a variety of reasons, but mostly (a) Python is a general programming language, so you can essentially do anything in it without it being a massive hack; (b) the scientific computing packages available for Python are generally super awesome; (c) it seems to me to be poised to replace Matlab as the lingua franca of neuroscience / vision research — at least, much more so than R. See Tal Yarkoni’s blog for a related discussion. Of course, it’s also free and open source.

What I’m still kind of missing in Python is a nice high-level plotting library like ggplot2. Certainly, there are many in the works, but in playing around I’ve found that none of them can match R’s ggplot2 yet.

Amphibious assault

Of the libraries I’ve tried (main competitor is yhat’s ggplot port), my go-to library for plotting in Python is hands-down Seaborn. Developed largely by Michael Waskom, a graduate student in Cog Neuro at Stanford, it has the prettiest plots, and also the best ideas for high-level functionality.

Speeding up Seaborn?

Unfortunately, compared to R’s ggplot2, Seaborn data summary functions are slow. As an example, I’ll show you a summary plot using the simulated data from my R tutorials.

We want to summarise our (simulated) subjects’ detection performance as a function of stimulus contrast, for each spatial frequency. In R, my code looks like this:

    dat <- read.csv('contrast_data.csv')

    plot_fun <- function(){
      fig <- ggplot(data = dat, aes(x=contrast, y=correct, colour=factor(sf))) +
        stat_summary( = "mean_cl_boot") + 
        scale_x_log10() + scale_y_continuous(limits = c(0,1)) + 
        facet_wrap( ~ subject, ncol=3)  +  
        stat_smooth(method = "glm", family=binomial())


This takes about 3 seconds (2.67) to return the following plot:

<a href=""><img src="; alt="ggplot_timing" width="600" height="485" class="aligncenter size-full wp-image-374" /></a>

EDIT: for some reason WordPress is being silly and won't let me actually show this picture. Suffice to say it looks substantively similar to the plot below

Let's contrast this to Seaborn. First the code:

# do imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import ggplot

dat = pd.read_csv('contrast_data.csv')

# make a log contrast variable (because Seaborn can't auto-log an axis like ggplot):
dat['log_contrast'] = np.log(dat['contrast'])

# set some styles we like:
pal = sns.cubehelix_palette(5, start=2, rot=0, dark=0.1, light=.8, reverse=True)

fig = sns.lmplot("log_contrast", "correct", dat,

(note that the above lines were run in an ipython notebook, hence the %%timeit magic operator).

Seaborn takes… wait for it… 2 minutes and 11 seconds to produce the following plot:


(I’m using Python 3.3.5, Anaconda distribution 2.1.0, Seaborn 0.4.0.)

Note also that both packages are doing 1000 bootstraps by default (as far as I’m aware), so I’m pretty sure they’re doing equivalent things.

What can be done?

This is obviously not a positive factor for my switch from R to Python, and I’m hoping it’s just that I’ve done something wrong. However, another explanation is that whatever Seaborn is using to do the bootstrapping or logistic model fitting is just far less optimised than ggplot2’s backend in R.

The nice thing about open source software is that we can help to make this better. So if you’re a code guru who’s reading this and wants to contribute to the scientific computing world moving ever faster to Python, go fork the github repo now!


After I posted this, I opened an issue on Github asking the developers about the slow times. Turns out that the ci flag in sns.lmplot specifies confidence intervals for the logistic regression, which is also bootstrapped. Bootstrapping a logistic regression takes a while; setting ci=False means that Seaborn now takes about 7 seconds to produce that plot instead of 2 minutes.

So, hooray for Seaborn and for awesome open source developers!

information gain differences

How close are we to understanding image-based saliency?

How well can we predict where people look in stationary natural images? While the scope of this question addresses only a fraction of what it means to understand eye movements in natural environments, it nevertheless remains a starting point to study this complex topic. It’s also of great interest to both cognitive scientists and computer vision researchers since it has applications from advertising to robotics.

Matthias Kümmerer has come up with a statistical framework that answers this question in a principled way. Building on the nice work of Simon Barthelmé and colleagues, Matthias has shown how saliency models can be compared in units of information (i.e. using log-likelihoods). Since information provides a meaningful linear metric, it allows us to compare the distance between model predictions, a baseline (capturing image-independent spatial biases in fixations) and the gold standard (i.e. how well you could possibly do, knowing only the image).

So how close are we to understanding image-based saliency? Turns out, not very. The best model we tested (a state-of-the-art model from 2014 by Vig, Dorr and Cox) explained about one third of the possible information gain between the baseline and the gold standard in the dataset we used. If you want to predict where people look in stationary images, there’s still a long way to go.

In addition, our paper introduces methods to show, in individual images, where and by how much a model fails (see the image above). We think this is going to be really useful for people who are developing saliency models. Finally, we extend the approach to the temporal domain, showing that knowing about both spatial and temporal biases, but nothing about the image, gives you a better prediction than the best saliency model using only spatial information.

The nice thing about this last point is that it shows that Matthias’ method is very general. If you don’t think that measuring where people look in stationary images tells you much about eye movements in the real world, that’s fine. You can still use the method to quantify and compare data and models in your exciting new experiment.

A paper that goes into much more detail than this blog post is now available on arXiv. In particular, saliency experts should check out the appendices, where we think we’ve resolved some of the reasons why the saliency model comparison literature was so muddled.

We’re close to submitting it, so we’d love to hear feedback on the pitch and nuance of our story, or anything that may not be clear in the paper. You can send me an email to pass on your thoughts.

When the paper is out we will also be making a software framework available for model comparison and evaluation. We hope the community will find these to be useful tools.

As usual, everything that appears below this line are ads and not endorsed by me. I don’t make money from this site; I would have to pay WordPress to remove the ads.