Ben Vincent over at inferencelab.com has a nice post on two simple ways to get set up with a Python scientific computing environment. I hadn’t heard of Wakari before – thanks Ben! Basically this lets you run python scripts on cloud computers, and share these scripts with others without requiring them to set up an identical software installation to yours. You could send collaborators, reviewers or readers a link to a web based interface that would allow them to re-run your analysis. Pretty cool! Yay free software!
The second stage in my version control workflow is to push my local changes to a remote repository. A remote repository is basically an identical repository to the one stored locally on your computer, but is on a remote server somewhere in the internet ether. Much like using dropbox, this provides an additional layer of backup for your project (with the advantage of a full version history). So if you ever lose your local copy of your project for some reason, you can just re-clone it from the remote repository to get everything back (not including files that were never committed, of course). ** NOTE that I don’t recommend using this, or any one tool, as your only backup: your scientific projects should be backed up with multiple means, in multiple locations, all the time **.
However, the main advantage of pushing things up to a remote repository is that this facilitates sharing. With various methods that I’ll outline below, you can keep the remote repository private and invite your collaborators to use it, or you can make it public so that anyone can see it, clone it, etc (though of course in this case, you control whether to use anyone else’s changes or not).
Github is the “one that started it all”. They have a really slick web interface, awesome graphs for looking at repository activity, great tools for interacting with other members, wikis and issue trackers that can be associated with a repo, and a big user base. Plus they offer the free GUI that I talked about in my last post. However, their pricing structure is that they charge you to have private repositories. That is, they host unlimited public repositories – i.e. anyone can see the respository’s contents, contributors and history. If you want to keep your repository to a few invited collaborators however, you need a paid account. Seven dollars a month gets you 5 private repositories. The idea here would be that you have some projects on the go, then when one is ready for sharing (say, the article is accepted), you switch the repo from “private” to “public”. Now everyone can see your code and data, and you have another private repo slot to use.
However, since I know that I would need more than 5 private repos (projects languishing, maybe one day, etc), I’ve so far avoided a paid Github account (the idea of just working with everything open is for another post). Thankfully we’re helped by Bitbucket.
Bitbucket is basically Github with a different pricing structure. Their web interface and user community is a fair bit behind Github. For scientists however, the advantage is that they offer unlimited free private code repositories. The catch is that you’re only allowed 5 collaborators (i.e. people who have joined any of your repositories, like co-authors). However, an academic email address will get you unlimited collaborators too, so this is essentially a free service.
Using Remote Repositories on Bitbucket with the Github GUI
Generously, Github have not restricted their GUI to use Github repositories. So what I do is basically use the Github GUI to manage my version control day-to-day, but push the local repository to a remote repository on Bitbucket. I can share this with collaborators and keep it private.
- Set up a local repository as explained here.
- Log in to your Bitbucket account in a web browser.
- Follow the steps to set up a new repository. Select “git” as the version control flavour.
- This should then give you an option to “push up an existing repository”.
- On the command line that starts `git remote add origin`, copy the following link to your repository (something like `email@example.com:tomwallis/test.git`. This might look different, depending whether you’re using SSH or a password to authenticate (if you’re using a password, your link will start with https; either works). The Bitbucket / Github help pages will explain how to set up an SSH key if you’d like to do that.
- In the Github GUI, open your local repository and go to the “Settings” pane. On the line that says “Primary Remote Repository”, paste in the link to your repo. Hit “Update Remote”.
- Switch back to the “Changes” pane of the Github GUI. See the button in the top left? It should have changed from “Push to Github” to “Sync Branch” (if not, close and re-open the Github GUI).
- Press this button. You might be asked for your password (depending whether you’ve set up an SSH key).
- Github should synch, and the list of “unsynched commits” should disappear.
- Refresh your web browser on your Bitbucket account. Your code should now be in your private repository on Bitbucket!
You can now share this repository with your co-workers, friends and family, and take advantage of all the nice things about collaborating with version control!
When the code is ready to be made public, you can simply push the repository to a public repository on Github by changing the primary remote in the settings pane. This lets you take advantage of the bigger user community (for example, see the Psychtoolbox, PsychoPy, Psychopy_ext, Julia) and better web tools on Github for publishing your code. Maybe in the future I would just use Github exclusively (i.e. paying for private repos), but for now the dual solution (Bitbucket, Github) works well. Of course, you can also just make your Bitbucket repository public and not worry about using Github at all, but then you’re stuck with their (relatively) clunky web interface.
As always, test this process out for yourself to see that it works for you before using it for important stuff, and always keep independent backups of your project and data (Dropbox, Time Machine, etc).
Previously I talked about setting up a project directory. Now I’ll run through the first stages of something I see as an integral part of reproducible research: version control.
Why version control?
Loosely, think of “version control” like track changes in Word but for all the (plain text) files in your project directory. More formally, a version control system is a piece of software that will automatically detect when files in monitored directories change, record which lines of the file changed and how, prompt you to comment why you changed the file, then have you “commit” the changes to a repository (a data bank containing the history of the project directory). If you’re collaborating on the project with others, the version control system can be used to make sure each collaborator’s files stay up to date with each other, record who changed what, and notice if two people changed a file in conflicting ways (you’ll then be asked to look at the changes and decide which to keep). This is a big advantage over something like sharing a folder on Dropbox. You can even use a version control system to keep different versions of the same project (using “branches”). Finally, from the perspective of making your science open and reproducible, making the version control history of the project available is close to a gold standard. Not only do users get the version that created the final results, but if they’re so inclined they could search back through your repository history to see how the project evolved. Being able to just switch a repository from private to public also makes it really simple to share your code (say, when the paper comes out). Plenty of other people have written about the usefulness of version control for scientists (see here, here and here, for example), so in this post I’m going to concentrate on the way I use it.
How I set up version control
First, a disclaimer: I am not a version control expert. I use simple tools, and so far they’ve worked for me. Any problems I’ve had I’ve been able to resolve with some Googling and Stack Overflow. In the remainder of this post I’ll show you how I set things up, in the hope that this will be useful to you. Second disclaimer: I use OSX (and a bit of Linux on experimental machines), so my process might have some OSX-specific steps. However, all the software I’m using is cross-platform, so you should be able to port the process to your operating system of choice with a bit of internet searching.
The specific version control software I use is git. I find this to be the simplest system I’ve tried, with the best support for integrating with others and publishing your code. You can do some tutorials and get more info at the git site, or watch this video. I mostly use git through a GUI (graphical user interface) rather than through the command line because I find this simpler for day-to-day work. I use the Github GUI, which you can download here for mac and here for windows. I have found that this does everything I regularly need, in a clean and simple interface. Other people (Diederick, posting on this blog) have recommended SmartGit, which seems more fully-featured and is free for academic use. I haven’t tried this yet, though. So, without further ado, here are the steps for setting up version control in the way I have.
- Download and install the Github GUI for your operating system of choice. You might need to create a Github account at this stage (which I would recommend doing anyway). Since I already have everything installed on my setup here, I’m not going to walk you through this stage. It should be well-explained by the app itself, but if there are other steps you need to take, post them in the comments below and I will update this post.
Now open the Github application. We’re going to use this to create a new repository for a new project. I will use the project directory structure I set up before:
In the OSX Github GUI, go to the small plus sign in the lower left corner of the main display and select “Create New Repository”:
Select the root directory of your project (in my case, this is called
blog_example). Now we have this blank screen for our new repository:
The Github GUI has created a new git repository, which lives in a hidden directory within our parent directory. Here you can see the contents of the directory before and after I created the repository:
Note how the second call to
ls -la has revealed a new (hidden) folder
.git. This contains all the files for the git repository, and is what the Github GUI uses. Note that it’s just a normal git repository, so if you want to do something that isn’t possible in the GUI, you can just interact with the repository using the command line.
Now let’s create a file in our project directory. I made a file in the root directory called
master_manuscript.txt, with two lines of text. When we flip back to the Github GUI, we see that it has detected the new file:
I enter a commit message “created master file”, hit commit, and there you go. Our git repository has its first local commit:
Note how the commit is listed in “Unsynced Commits”. It’s “Unsynched” because our local repository hasn’t been synchronised with a remote repository, such as one on Github. That’s fine if you just want to maintain a local repository for yourself. I will discuss synching with remotes in a future post.
Now let’s try modifying the file
master_manuscript.txt. I’m going to delete the original two lines and add something new. How about a Well Thought-Out Englilsh Paper?
The GUI shows us which lines were deleted (in red) and which were added (green). We can commit this new change, then take a look at our project’s history (in the history tab):
There we can see our original commit (“created master file”) as well as our new one. On second thought, Strong Bad’s Well Thought-Out Englilsh paper is maybe not so well thought out. Let’s revert to the last version. Select our commit “a well thought-out englilsh paper” and click on the gear to the right of the panel, then select “Revert This Commit” (for the difference between revert and roll back, see here). We then get a new commit, telling us that we’ve reverted the content of the file back to the old one:
If you open up the file
master_manuscript.txt, you’ll see that it has been changed to have the original two lines by git.
Let’s say that we now have some files in our project directory that we don’t want git to monitor (e.g. a really large data file that would be infeasible to upload to a remote repository). We can add this to the list of files to be ignored by placing it in the file
.gitignore. The GUI allows us to do this by right clicking on a file when it appears on the left pane and selecting “ignore”. You can also ignore entire directories by entering them into
.gitignore. You can do this in the GUI by going to the Settings pane, to the ignored files section, and entering a line like:
The GUI simply adds this line to your
I usually ignore the
/figs/ directories since their contents can be regenerated from your code, and for some of my analyses the contents of
/out can be rather large.
A branch is basically an independent copy of a state of the respository that allows you to do work in parallel to other changes occurring to the repository. While it’s aimed more at larger collaborative projects, I have found it useful in working on solo repositories as a way to explicitly maintain old versions of analyses or papers. I will cover branching more in a future post.
Other things to note
Note how in the explanation above I said “for all the (plain text) files in your directory”? A plain text file is something that looks ok when opened in a text editor like notepad. They could have extensions like .txt, .csv, .m, .R, .py, .tex, etc, but they are still readable in a text editor. Version control works beautifully for these. However, binary files not so much. You won’t take the most advantage of version control if you try to keep track of something like Word docs, .pdf or .jpg. At best your version control will be able to record that the file changed – but not which lines. On the other hand, it may still work to keep collaborators synched with the latest file, but conflict resolution would be hard.
A final disclaimer: Like any file, keeping your git repository within a Dropbox folder carries the risk that the file could be corrupted. Specifically, if you work from computer A, make changes, shut down that computer before Dropbox finishes syncing, then subsequently work on the same files from computer B, you will have conflicts when you open up computer A again. Keep a separate backup of your Dropbox folder (e.g. with Time Machine, or Dropbox’s own file history service) to prevent anything nasty happening. This also becomes less of a problem if you are pushing your git repository to a remote server, which is what I will cover in the next post.
UPDATE 12 Feb 2014:
A few people have commented to me on other forums about my selection of git over something else. Specifically, a lot of people find git to be unnecessarily complex for many projects as compared to say, subversion.
Alex Holcombe shared the following: “I have started using github (inside Rstudio), and seems to be working, but after reading this I am very afraid http://t.co/hLrUMrIz8J”
Another friend writes: “SVN makes perfect sense to me (I used it in my programming jobs) and would definitely do the trick. Git on the other hand is a brutally complicated, confusing thing, (e.g. https://steveko.wordpress.com/2012/02/24/10-things-i-hate-about-git/) that will require hours to master even for relatively simple tasks. However it is widely in use out in the world of software development, so the skill is a valuable one to develop, and has some super neat features (all that branching). “
I guess my response to these concerns is just to say that the above is how I’ve done version control, and so far it has worked for me. However, the vast majority of what I’ve done so far is single-user repositories (i.e. just me), where I integrate changes and comments of co-authors myself manually. Perhaps I will learn to hate git when I have to do more with other contributors, but for now the workflow above works for me.
Academic Torrents looks like it might be pretty cool for sharing larger data sets (e.g. brain imaging or eye tracking data)…
You may be interested in a recent paper by Jonas Kubilius (link) detailing his reproducibility framework
Psychopy_ext. This is a Python wrapper package for PsychoPy (for stimulus generation and presentation) and various data analysis packages that promises to streamline workflows for conducting typical psychophysical experiments. Looks really useful – great work Jonas!
Regarding my last post, you might be particularly interested in Jonas’ Figure 2, which contains a slightly different suggestion for how to lay out a project directory.