If you have read my older blog posts, you know I’m a fan of R’s `ggplot2`

library for exploratory data analysis. It allows you to examine many views onto data, creating summaries over different variables right there as you plot. I gave a short tutorial here. Once you understand how the library works, it’s a very powerful way of quickly seeing patterns in large datasets.

## Moving over to Python

The past few months have seen me ramp up my Python useage (as I preempted almost a year ago). This is for a variety of reasons, but mostly (a) Python is a general programming language, so you can essentially do *anything* in it without it being a massive hack; (b) the scientific computing packages available for Python are generally super awesome; (c) it seems to me to be poised to replace Matlab as the *lingua franca* of neuroscience / vision research — at least, much more so than R. See Tal Yarkoni’s blog for a related discussion. Of course, it’s also free and open source.

What I’m still kind of missing in Python is a nice high-level plotting library like `ggplot2`

. Certainly, there are many in the works, but in playing around I’ve found that none of them can match R’s `ggplot2`

yet.

## Amphibious assault

Of the libraries I’ve tried (main competitor is yhat’s ggplot port), my go-to library for plotting in Python is hands-down Seaborn. Developed largely by Michael Waskom, a graduate student in Cog Neuro at Stanford, it has the prettiest plots, and also the best ideas for high-level functionality.

## Speeding up Seaborn?

Unfortunately, compared to R’s `ggplot2`

, Seaborn data summary functions are *slow*. As an example, I’ll show you a summary plot using the simulated data from my R tutorials.

We want to summarise our (simulated) subjects’ detection performance as a function of stimulus contrast, for each spatial frequency. In R, my code looks like this:

library(ggplot2) setwd('~/Dropbox/Notebooks/') dat <- read.csv('contrast_data.csv') summary(dat) plot_fun <- function(){ fig <- ggplot(data = dat, aes(x=contrast, y=correct, colour=factor(sf))) + stat_summary(fun.data = "mean_cl_boot") + scale_x_log10() + scale_y_continuous(limits = c(0,1)) + facet_wrap( ~ subject, ncol=3) + stat_smooth(method = "glm", family=binomial()) print(fig) } system.time(plot_fun())

This takes about **3 seconds** (2.67) to return the following plot:

<a href="https://tomwallisblog.files.wordpress.com/2014/10/ggplot_timing.png"><img src="https://tomwallisblog.files.wordpress.com/2014/10/ggplot_timing.png" alt="ggplot_timing" width="600" height="485" class="aligncenter size-full wp-image-374" /></a>

**EDIT: for some reason WordPress is being silly and won't let me actually show this picture. Suffice to say it looks substantively similar to the plot below**

Let's contrast this to Seaborn. First the code:

# do imports import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import ggplot dat = pd.read_csv('contrast_data.csv') # make a log contrast variable (because Seaborn can't auto-log an axis like ggplot): dat['log_contrast'] = np.log(dat['contrast']) # set some styles we like: sns.set_style("white") sns.set_style("ticks") pal = sns.cubehelix_palette(5, start=2, rot=0, dark=0.1, light=.8, reverse=True) %%timeit fig = sns.lmplot("log_contrast", "correct", dat, x_estimator=np.mean, col="subject", hue='sf', col_wrap=3, logistic=True, palette=pal, ci=95);

(note that the above lines were run in an ipython notebook, hence the `%%timeit`

magic operator).

Seaborn takes… wait for it… **2 minutes and 11 seconds** to produce the following plot:

(I’m using Python 3.3.5, Anaconda distribution 2.1.0, Seaborn 0.4.0.)

Note also that both packages are doing 1000 bootstraps by default (as far as I’m aware), so I’m pretty sure they’re doing equivalent things.

## What can be done?

This is obviously not a positive factor for my switch from R to Python, and I’m hoping it’s just that I’ve done something wrong. However, another explanation is that whatever Seaborn is using to do the bootstrapping or logistic model fitting is just far less optimised than ggplot2’s backend in R.

The nice thing about open source software is that we can help to make this better. So if you’re a code guru who’s reading this and wants to contribute to the scientific computing world moving ever faster to Python, go fork the github repo now!

## Update

After I posted this, I opened an issue on Github asking the developers about the slow times. Turns out that the `ci`

flag in `sns.lmplot`

specifies confidence intervals for the *logistic regression*, which is also bootstrapped. Bootstrapping a logistic regression takes a while; setting `ci=False`

means that Seaborn now takes about 7 seconds to produce that plot instead of 2 minutes.

So, hooray for Seaborn and for awesome open source developers!

I see that ggplot’s smean.cl.boot “is a very fast implementation of the basic nonparametric bootstrap for obtaining confidence limits for the population mean without assuming normality”. Is there a similar bootstrap for Seaborn?

Hi Egor, I’m not sure how optimised the implementation in Seaborn is (I’m not sure whether you posted this before or after my update). If we don’t bootstrap the logistic regression too (ci=False), then Seaborn is much faster (7 seconds) but still slower than R (3 seconds). This would imply to me that further optimisation could be achieved for Seaborn’s backend, but I am not sure.

Oh, so if (ci=False), the confidence intervals still appear. I see, thanks.

Yep, there’s another flag for the point confidence intervals (which are also bootstrapped): x_ci. See http://web.stanford.edu/~mwaskom/software/seaborn/generated/seaborn.regplot.html#seaborn.regplot

Perfect timing for me as I’ve been looking at getting psychopy to create a dataframe and plotting the results. I was going to try yhat but seaborn may be a bit more developed. However I’ll probably stick to matplotlib for now because only that is included in Psychopy stand-alone, the others will require independent python installations.