Reproducible research

Open peer review!

Our manuscript “Detecting distortions of peripherally-presented letter stimuli under crowded conditions” (see here) has received an open peer review from Will Harrison. Thanks for your comments Will! They will be valuable in improving the manuscript in a future revision.

Hurrah for open science!*

* as suggested here, from now on “Open Science” will just be called “Science”, and everything else will be called “Closed Science”.

New paper: testing models of peripheral encoding using appearance matching

We only perceive a small fraction of the information that enters our eyes. What information is represented and what is discarded? Freeman and Simoncelli (2011) introduced a neat method of psychophysically testing image-based models of visual processing. If images that produce identical model responses also appear identical to human observers, it implies that the model is only discarding information that does not matter for perception (and conversely, retaining image structure that matters). The images are metamers: physically different images that appear the same (the term originates in the study of colour vision).

Our latest paper extends this approach and sets a higher bar for model testing. In the original study, Freeman and Simoncelli synthesised images from a model of peripheral visual processing, and showed that observers could not tell two synthesised images apart from each other at an appropriate level of information loss (in this case, the scaling of pooling regions spanning into the visual periphery). However, observers in these experiments never compared the model images to the original (unmodified) images. If we’re interested in the appearance of natural scenes, this is not a sufficient test. To take one extreme, a “blind” model that discarded all visible information would produce images that were indiscriminable from each other, but would no doubt fail to match the appearance of a natural source image.

We extend this approach by having observers compare model-compressed images to the original image. If models are good candidates for the kind of information preserved in the periphery, they should succeed in matching the appearance of the original scenes (that is, the information they preserve is sufficient to match appearance).

We apply this logic to two models of peripheral appearance: one in which high spatial frequency information (fine detail) is lost in the periphery (simulated using Gaussian blur), and another in which image structure in the periphery is assumed to be “texturised”. We had observers discriminate images using a three-alternative temporal oddity task. Three image patches are presented consecutively; two are identical to each other, and one is different. The “oddball” could be either the original or the modified image. The observer indicates whether image 1, 2 or 3 was different to the other two. If the images appear identical, the observer will achieve 33% correct, on average.

Our results show that neither a blur model nor a texture model are particularly good at matching peripheral appearance. Human observers were more sensitive to natural appearance than might be expected from either of these models, implying that richer representations than the ones we examined will be required to match the appearance of natural scenes in the periphery. That is, the models discard too much information.

Finally, we note that appearance matching alone is not sufficient. A model that discards no information would match appearance perfectly. We instead seek the most compressed (parsimonious) model that also matches appearance. Therefore, the psychophysical approach we outline here must ultimately be paired with information-theoretic model comparison techniques to adjudicate between multiple models that successfully match appearance.

You can read our paper here, and the code, data and materials are also available (you can also find the code on Github).

Wallis, T. S. A., Bethge, M., & Wichmann, F. A. (2016). Testing models of peripheral encoding using metamerism in an oddity paradigm. Journal of Vision, 16(2), 4.

Freeman, J., & Simoncelli, E. P. (2011). Metamers of the ventral stream. Nature Neuroscience, 14(9), 1195–1201.

Confirmatory experiments and researcher degrees of freedom

A few months ago I attended a talk given by a professor of cognitive neuroscience / psychology. The professor presented several experiments to support a particular hypothesis, including both behavioural studies and fMRI. The final minutes of the presentation were used to tell us about some exciting new findings that could suggest an interesting new effect. In presenting these results, the professor stated “we’ve only run 10 subjects so far and this difference is not quite significant yet, but we’re collecting a few more people and we expect it to drop under .05”.

This is an example of a “researcher degree of freedom”, “questionable research practice”, or “p-hacking” (specifically, we could call this example “data-peeking”). In my experience it is very common in experimental psychology, and recent publications show it’s a problem much more broadly (see e.g. here, here and this article by Chris Chambers).

Why does data-peeking happen? I believe that in almost all cases there is no malicious intent to mislead, but rather that it arises from a faulty intuition.

Researchers intuit that having more data should lead to a better estimate of the true effect. The intuition is correct, but where people go wrong is assuming it applies to statistical testing too. Unfortunately, many researchers (including my former self) don’t understand this, and erroneously rely on their intuition.

The Garden of Forking Paths *

It essentially boils down to this: if your data depend on your analyses or your analyses depend on your data, you could be on thin inferential ice. Daniel Lakens has a nice post on the former, while Gelman and Loken have an article on the latter that’s well worth your time (now published in revised form here).

Data depend on analyses

An example of this is if you test a few subjects, check the result, and maybe collect some more data because it’s “trending towards significance” (as for our anonymous professor, above). If you just apply a p-value as normal, it means that your false positive rate is no longer equal to the nominal alpha level (e.g., 0.05), but is actually higher. You’re more likely to reject the null hypothesis and call something significant if you data peek – unless you apply statistical corrections for your stopping rules (called “sequential testing” in the clinical trials literature; see this post by Daniel Lakens has some info on how to correct for this).

Analyses depend on data

An example of this is if you collect 20 subjects, then realise two of them show some “outlier-like” behavior that you hadn’t anticipated: reaction times that are too fast to be task-related. You decide to exclude the trials with “too-fast” reaction times, defining “too-fast” based on the observed RT distribution**. This neatens up your dataset — but given different data (say, those two subjects behaved like the others), your analysis would have looked different. In this case, your analyses are dependent on the data.

I believe this happens all the time in experimental psychology / neuroscience. Other examples include grouping levels of a categorical variable differently, defining which “multiple comparisons” to correct for, defining cutoffs for “regions of interest”… When your analyses depend on the data, you’re doing exploratory data analysis. Why is that a problem? By making data-dependent decisions you’ve likely managed to fit noise in your data, and this won’t hold for new, noisy data: you’re increasing the chance that your findings for this dataset won’t generalise to a new dataset.

Exploratory analyses are often very informative — but they should be labelled as such. As above, your actual false positive rate will be higher than your nominal false positive rate (alpha level) when you use a p-value.

We should be doing more confirmatory research studies

For experimental scientists, the best way to ensure that your findings are robust is to run a confirmatory study. This means

  1. Collect the data with a pre-specified plan: how many participants, then stop. If you plan on having contingent stopping rules (doing sequential testing) then follow the appropriate corrections for any test statistics you use to make inferential decisions.
  2. Analyse the data with an analysis pipeline (from data cleaning to model fitting and inference) that has been prespecified, without seeing the data.
  3. Report the results of those analyses.

If you started out with a data-dependent (exploratory) analysis, report it as such in the paper, then add a confirmatory experiment ***. There’s a great example of this approach from experimental psychology (in this case, a negative example). Nosek, Spies and Motyl report finding that political moderates were better at matching the contrast (shade of grey) of a word than those with more extreme (left or right) political ideologies (p = 0.01, N = 1,979 — it was an online study). Punch line: “Political extremists perceive the world in black and white — literally and figuratively”. However, the authors were aware that they had made several data-dependent analysis decisions, so before rushing off to publish their finding, they decided to run a direct confirmatory study. New N = 1,300, new p-value = 0.59.

The best way to show the community that your analysis is confirmatory is to pre-register your study. One option is to go on something like the Open Science Framework, submit a document with your methods and detailed analysis plan, then register it. The project can stay private (so nobody else can see it), but now there’s a record of your analyis plan registered before data collection. A better option is to submit a fully registered report. In this case, you can send your introduction, method and analysis plan to a journal, where it is peer reviewed and feedback given — all before the data are collected. Taking amendments into account, you then run off and collect the data, analyse it as agreed, and report the results. In a registered report format, the journal agrees to publish it no matter the result. If the study is truly confirmatory, a null result can be informative too.

Of course, there’s still trust involved in both of these options – and that’s ok. It’s hard to stop people from outright lying. I don’t think that’s a big problem, because the vast majority of scientists really want to do the right thing. It’s more that people just don’t realise that contingent analyses can be a problem, or they convince themselves that their analysis is fine. Pre-registration can help you convince yourself that you’ve really run a confirmatory study.

Conclusion

I hope the considerations above are familiar to you already — but if you’re like many people in experimental psychology, neuroscience, and beyond, maybe this is the first you’ve heard of it. In practice, most people I know (including myself) are doing exploratory studies all the time. Full disclosure: I’ve never reported a truly confirmatory study, ever. In a follow-up post, I’m going to speculate about how the recommendations above might be practically implemented for someone like me.

* the title refers to the short story by Jorge Luis Borges, and was used by Gelman and Loken to refer to data-dependent analyses.

** Note: this is a very bad idea, because it ignores any theoretical justification for what “too fast to be task-related” is. I use it only for example here. I have a more general problem with outlier removal, too: unless the data is wrong (e.g. equipment broke), then change the model, not the data. For example, if your data have a few outliers, run a robust regression (i.e. don’t assume Gaussian noise). Of course, this is another example of a data-dependent analysis. Run a confirmatory experiment…

*** An equivalent method is to keep a holdout set of data that’s only analysed at the end — if your conclusions hold, you’re probably ok.

Quick link on organising projects

Felix Wichmann pointed me to this paper on organising project directories.

Some of the tips are a bit specific to bioinformatics / computer science, but one idea I really liked was to have a chronologically-organised electronic notebook for each project. This would sketch out analyses / experiments you tried, what worked, and what didn’t. You could also “transcribe notes from conversations as well as email text into the lab notebook”. This type of document makes way more sense to me than keeping a single general lab notebook for all the projects you’re doing. Perhaps that helps you too, but notes about a specific project should all be accessible within the project directory. This idea would definitely complement some others that I’ve previously detailed on this blog.

It strikes me that ipython notebooks or R markdown documents are perfect formats for such a project notebook.

Code and data, or it didn’t happen

There is a new motto for reproducible research: “code and data or it didn’t happen” (see also here). I want this to apply more often in vision science, too. Since vision science experiments are usually entirely computer-controlled, implementing this should be a lot easier than in some other fields. For example, in social psychology raw data can be paper-based (participants fill out a questionnaire), making sharing that much harder.

In my mind, vision science journals should begin requiring raw code, data and stimuli (if they’re pictures that can’t be regenerated) to accompany every published empirical article. The code and data would be maintained by the journal in a data repository (e.g. http://thedata.org/).

Why doesn’t this happen already? One attitude I’ve encountered in the field is that “know-how” (for things like stimulus generation or data analysis — proximally, the code that does these things) is somehow an in-house resource, only to be actively shared when people ask for it.

A great passage from this recent article is relevant here:

… authors are not the owners of their methods or results. By publishing their results, original authors state that they have sufficient confidence in them that they should be included in the scientific record. That record belongs to everyone…

So the code and data aren’t yours. They don’t belong to you once the paper is published (and arguably, not before then either). They belong to everyone — particularly if your research was publicly-funded.

Second, many people have the attitude that “well of course I would share the code / data if someone emailed me and asked for it”. Maybe you would, but that doesn’t guarantee that everyone will. I’ve previously encountered a refusal to share materials upon request, even several years after the paper was published. Better that these are made publicly available at the outset, rather than possibly refused (off the public record) later on.

A code / data embargo

The only reason I can think of to refuse to share code and data at the time of publication is that the researcher has a reasonable expectation that they should be able to get more publications out of that material before others have the chance. Many others before me have suggested a compromise solution for this: a code / data embargo. The journal or a third party holds the materials, and after some period of time post-publication, releases it to the public. For example, data dryad allow limited-term data embargoes to be set up.

I think the embargo time is something field-specific. For example, in some fields it might make sense to have quite long embargoes. But human psychophysics data is relatively quick and cheap to collect (possibly excluding clinical data), so I see no reason why such data should be embargoed for more than a year. Perhaps for animal data or brain scanning data other timeframes might be appropriate, but in general, I think most people would be happy with this idea.

Releasing your data to the community could yield interesting scientific advances that you never even considered when you conducted the experiment. This helps everyone (and nets you another citation). For a much more in-depth consideration of this issue, take a look at this recent article.