information gain differences

How close are we to understanding image-based saliency?

How well can we predict where people look in stationary natural images? While the scope of this question addresses only a fraction of what it means to understand eye movements in natural environments, it nevertheless remains a starting point to study this complex topic. It’s also of great interest to both cognitive scientists and computer vision researchers since it has applications from advertising to robotics.

Matthias Kümmerer has come up with a statistical framework that answers this question in a principled way. Building on the nice work of Simon Barthelmé and colleagues, Matthias has shown how saliency models can be compared in units of information (i.e. using log-likelihoods). Since information provides a meaningful linear metric, it allows us to compare the distance between model predictions, a baseline (capturing image-independent spatial biases in fixations) and the gold standard (i.e. how well you could possibly do, knowing only the image).

So how close are we to understanding image-based saliency? Turns out, not very. The best model we tested (a state-of-the-art model from 2014 by Vig, Dorr and Cox) explained about one third of the possible information gain between the baseline and the gold standard in the dataset we used. If you want to predict where people look in stationary images, there’s still a long way to go.

In addition, our paper introduces methods to show, in individual images, where and by how much a model fails (see the image above). We think this is going to be really useful for people who are developing saliency models. Finally, we extend the approach to the temporal domain, showing that knowing about both spatial and temporal biases, but nothing about the image, gives you a better prediction than the best saliency model using only spatial information.

The nice thing about this last point is that it shows that Matthias’ method is very general. If you don’t think that measuring where people look in stationary images tells you much about eye movements in the real world, that’s fine. You can still use the method to quantify and compare data and models in your exciting new experiment.

A paper that goes into much more detail than this blog post is now available on arXiv. In particular, saliency experts should check out the appendices, where we think we’ve resolved some of the reasons why the saliency model comparison literature was so muddled.

We’re close to submitting it, so we’d love to hear feedback on the pitch and nuance of our story, or anything that may not be clear in the paper. You can send me an email to pass on your thoughts.

When the paper is out we will also be making a software framework available for model comparison and evaluation. We hope the community will find these to be useful tools.


As usual, everything that appears below this line are ads and not endorsed by me. I don’t make money from this site; I would have to pay WordPress to remove the ads.


Neurostats 2014 Highlights

tsawallis:

Simon Barthelmé shows some highlights from Neurostats 2014 in Warwick. I particularly like the “in limbo” brain areas.

Originally posted on dahtah:

Last week the Neurostats 2014 workshop took place at the University of Warwick (co-organised by Adam Johansen, Nicolas Chopin, and myself). The goal was to put some neuroscientists and statisticians together to talk about neural data and what to do with it. General impressions:

  • The type of Bayesian hierarchical modelling that Andrew Gelman has been advocating for years is starting to see some use in neuroimaging. On the one hand it makes plenty of sense since the data at the level of individual subjects can be cr*p and so one could really use a bit of clever pooling. On the other, imaging data is very high-dimensional, running a Gibbs sampler can take days, and it’s not easy making the data comparable across subjects.
  • You have to know your signals. Neural data can be unbelievably complicated and details matter a lot, as Jonathan Victor showed in his talk. A consequence if…

View original 547 more words

Guest Post: Matlab versus Pandas for data analysis

Annelie Muehler is an undergraduate student who is about to finish a 2 month internship in our group. She has been working with me conducting psychophysical experiments, and we have been creating stimuli using python. As part of getting used to scientific python Annelie learned to use Pandas, a package that essentially gives you R’s data frames in Python. The following compares the code to do a simple analysis in Matlab and Python. While it’s possible there are ways to improve the Matlab implementation (perhaps using the statistics toolbox?), it’s noteworthy that these weren’t taught in Annelie’s course.

A comparison of Matlab and Pandas for a simple data analysis

As part of my undergraduate studies in cognitive psychology and neuroscience, I did a water maze experiment in an advanced biology/neuroscience lab course using mice. For this experiment, I had ten mice that did four trials of the experiment over a six day period. The point of this experiment is for the mice to be able to find the platform in the water with increasing speed as they complete more trials. This is because they learn where the platform is. The water maze experiment is one of the behavioural experiments used in mice and rats to test for the ability to learn. Later we used this data while we were learning Matlab in another lab class as a basis for learning data analysis.

During my internship at the Centre for Integrative Neuroscience in Tuebingen, Germany, I reanalyzed this data using pandas in python as a way to learn pandas, giving me a direct comparison of Matlab and pandas. There are definitely some very nice things about pandas. First, you are able to define your own index and column names that are shown surrounding the matrix in a format similar to a table in a word processing document or excel file. This was one of the most frustrating things for me in Matlab because in Matlab you have a dataset and then another variable which contains a list of strings that corresponded to the column names so that you can look them up there.

Table_example_blog

An example of the format in which tables are seen in pandas using the mice data. The table is stored in a variable called rdata.

In pandas, reading data in and out in is easy with the pd.read_csv() and rdata.to_csv function. As you can see in the image above, the mice data is structured so that the indices represent the row number, the other columns are:

  • Trials which represents the trial number and is numbered from one to four for each trial in each day
  • Animal is the animal number which is in the range one to ten
  • Day stands for the day number and is numbered from one to six
  • Swimming Time represents the amount of time it took the mouse to find the platform in the water maze experiment.

I find it easier to work with the table labeled in this way as opposed to having a different variable with the labels of the columns, as we had done in Matlab. Also pandas has great functions
such as:

  • rdata.head() which shows the top rows of the dataframe
  • rdata.describe() which gives the count, mean, standard deviation and other statistics of the dataframe (not the most useful for this specific dataframe)
  • rdata.sort(columns = 'Animal') which sorts the data by a specific column, in this case the column Animal.

As you can see above, pandas (and python in general) has object-oriented functions. These work by using the name of the object, in this case rdata, adding a period and then typing the function. This will show you the result of the function but generally not change the actual object unless the object is equated with the function (as in rdata = rdata.sort(columns = 'Animal').

The idea of the analysis was the find the average swimming time per day across animals to see if there was any improvement as the mice learned the task. In Matlab we did this by:

1.

for i=1:nday
rows_day(:,i)=find(rdata(:,3)==i);
end

This created a dataset in which the rows for each day were identified.

2.

for i=1:nday
time_day(:,i)=rdata(rows_day(:,i),5);
end

Using the data set from step 1, we are able to get a new data set where the swimming time of each trial is listed for each day across animals.

3.

m_day=mean(time_day);
f=figure;
a=axes;
plot(m_day);
ylabel('Swimming Time (s)')
xlabel('Experimental Day')
set(a,'xtick',1:nday)
title('Average swimming time (s) per day across animals')

This results in this simple line graph:

mean_day_mat

Graph output from Matlab
Here’s the same thing in pandas.

1.

import pandas as pd

The usual importing at the beginning of each python script.

2.

m_day = rdata.groupby('Day')['Swimming Time'].mean()
m_day = pd.DataFrame({'Swimming Time (s)':m_day, 'Experimental Day': range(1,7)})

Groupby is a useful command that will group the data by day (parentheses) according to Swimming Time (square brackets). This eliminates sorting out the rows by day using a for loop as is done in the Matlab code above and allows you to group your data according to different variables in your data frame. The .mean() operator at the end tells pandas that you want to compute the means on the grouped data.

3.

m_day.plot(style='b-', x='Experimental Day', y='Swimming Time (s)', title='Average swimming time (s) per day across animals')

There are other python plot functions that may be a bit more elaborate but in the spirit of doing everything in pandas I decided to show the pandas version. This results in this simple line graph, identical to the one above:

mean_day_py

Graph output from Pandas

Figures can be easily saved in pandas using:

fig = plot.get_figure()

fig.savefig()
Of course this is a very simple example of data analysis, but I think it does outline some of the benefits of pandas. The nicest thing in my opinion is the ease with which you can manipulate the data frame and the ability to select columns by their name. The groupby function is very useful and can be used to groupby multiple columns or to group multiple columns.

In my opinion, pandas is a much simpler and convenient way to work with and manipulate data.

Quick link on organising projects

Felix Wichmann pointed me to this paper on organising project directories.

Some of the tips are a bit specific to bioinformatics / computer science, but one idea I really liked was to have a chronologically-organised electronic notebook for each project. This would sketch out analyses / experiments you tried, what worked, and what didn’t. You could also “transcribe notes from conversations as well as email text into the lab notebook”. This type of document makes way more sense to me than keeping a single general lab notebook for all the projects you’re doing. Perhaps that helps you too, but notes about a specific project should all be accessible within the project directory. This idea would definitely complement some others that I’ve previously detailed on this blog.

It strikes me that ipython notebooks or R markdown documents are perfect formats for such a project notebook.

Code and data, or it didn’t happen

There is a new motto for reproducible research: “code and data or it didn’t happen” (see also here). I want this to apply more often in vision science, too. Since vision science experiments are usually entirely computer-controlled, implementing this should be a lot easier than in some other fields. For example, in social psychology raw data can be paper-based (participants fill out a questionnaire), making sharing that much harder.

In my mind, vision science journals should begin requiring raw code, data and stimuli (if they’re pictures that can’t be regenerated) to accompany every published empirical article. The code and data would be maintained by the journal in a data repository (e.g. http://thedata.org/).

Why doesn’t this happen already? One attitude I’ve encountered in the field is that “know-how” (for things like stimulus generation or data analysis — proximally, the code that does these things) is somehow an in-house resource, only to be actively shared when people ask for it.

A great passage from this recent article is relevant here:

… authors are not the owners of their methods or results. By publishing their results, original authors state that they have sufficient confidence in them that they should be included in the scientific record. That record belongs to everyone…

So the code and data aren’t yours. They don’t belong to you once the paper is published (and arguably, not before then either). They belong to everyone — particularly if your research was publicly-funded.

Second, many people have the attitude that “well of course I would share the code / data if someone emailed me and asked for it”. Maybe you would, but that doesn’t guarantee that everyone will. I’ve previously encountered a refusal to share materials upon request, even several years after the paper was published. Better that these are made publicly available at the outset, rather than possibly refused (off the public record) later on.

A code / data embargo

The only reason I can think of to refuse to share code and data at the time of publication is that the researcher has a reasonable expectation that they should be able to get more publications out of that material before others have the chance. Many others before me have suggested a compromise solution for this: a code / data embargo. The journal or a third party holds the materials, and after some period of time post-publication, releases it to the public. For example, data dryad allow limited-term data embargoes to be set up.

I think the embargo time is something field-specific. For example, in some fields it might make sense to have quite long embargoes. But human psychophysics data is relatively quick and cheap to collect (possibly excluding clinical data), so I see no reason why such data should be embargoed for more than a year. Perhaps for animal data or brain scanning data other timeframes might be appropriate, but in general, I think most people would be happy with this idea.

Releasing your data to the community could yield interesting scientific advances that you never even considered when you conducted the experiment. This helps everyone (and nets you another citation). For a much more in-depth consideration of this issue, take a look at this recent article.