2.9 Harnessing version control for transparent data recording

As a research project progresses, a typical practice in many experimental research groups is to save new versions of files (e.g., ‘draft1.doc,’ ‘draft2.doc’), so that changes can be reverted. However, this practice leads to an explosion of files, and it becomes hard to track which files represent the ‘current’ state of a project. Version control allows researchers to edit and change research project files more cleanly, while maintaining the power to ‘backtrack’ to previous versions, messages included to explain changes. We will explain what version control is and how it can be used in research projects to improve the transparency and reproducibility of research, particularly for data recording.

Objectives. After this module, the trainee will be able to:

Describe version control
Explain how version control can be used to improve reproducibility for data recording

2.9.1 What is version control?

Version control developed as a way to coordinate collaborative work on software programming projects. The term “version” here refers to the current state of a document or set of documents, for example the source code for a computer program. The idea of “control” is to allow for safe changes and updates to this version while more than one person is working on it. The general term “version control” can refer to any method of syncing contributions from several people to a file or set of files, and very early on it was done by people rather than through a computer program. While version control of computer files can be done by people, and originally was (Irving 2011), it’s much more efficient to use a computer program to handle this tracking of the history of a set of files as they evolve.

“Tracking all that detail is just the sort of thing computers are good at and humans are not.” (E. S. Raymond 2003)

While the very earliest version control systems tracked single files, these systems quickly moved to tracking sets of files, called repositories. You can think of a repository as a computer file directory with some extra overhead added to record how the files in the directory have changed over time. In a repository of files that is under version control, you take regular “snapshots” of how the files look during your work on them. Each snapshot is called a commit, and it provides a record of which lines in each file changed from one snapshot to another, as well as exactly how they changed. The idea behind these commits—recording the differences, line-by-line, between an older and newer version of each file derives from a longstanding Unix command line tool called diff. This tool, developed early in the history of Unix at AT&T (E. S. Raymond 2003), is and extremely solid and well-tested tool that did the simple but important job of generating a list of all the differences between two plain text files.

When you are working with a directory under version control, you can also explain your changes as you make them—in other words, it allows for annotation of the developing and editing process (E. Raymond 2009). Each commit requires you to enter a commit message describing why the change was made. The commit messages can serve as a powerful tool for explaining changes to other team members or for reminding yourself in the future about why certain changes were made. As a result, a repository under version control includes a complete history of how files in a project directory have changed over the timecourse of the project and why. Further, each of the commits is given its own ID tag (a unique SHA-1 hash), and version control systems have a number of commands that let you “roll back” to earlier versions, by going back to the version as it was when a certain commit was made, provided reversability within the project files (E. Raymond 2009).

It turns out that this functionality—of being able to “roll back” to earlier versions—has a wonderful side benefit when it comes to working on a large project. It means that you don’t need to save earlier versions of each file. You can maintain one and only one version of each project file in the project’s directory, with the confidence that you never “lose” old versions of the file (J. Perkel 2018; Blischak, Davenport, and Wilson 2016). This allows you to maintain a clean and simple version of the project files, with only one copy of each, ensuring it’s always clear which version of a file is the “current” one (since there’s only one version). This also provides the reassurance that you can try new directions in a project, and always roll back to the old version if that direction doesn’t work well.

“Early in his graduate career, John Blischak found himself creating figures for his advisor’s grant application. Blischak was using the programming language R to generate the figures, and as he iterated and optimized his code, he ran into a familiar problem: Determined not to lose his work, he gave each new version a different filename—analysis_1, analysis_2, and so on, for instance—but failed to document how they had evolved. ‘I had no idea what had changed between them,’ says Blischak… Using Git, Blischak says, he no longer needed to maintain multiple copies of his files. ‘I just keep overwriting it and changing it and saving the snapshots. And if the professor comes back and says, ’oh, you sent me an email back in March with this figure,’ I can say, ‘okay, well, I’ll just bo back to the March version of my code and I can recreate it.’” (J. Perkel 2018)

Finally, most current version control systems operate under a distributed framework. In earlier types of version control programs, there was one central (“main”) repository for the file or set of files the team was working on (E. Raymond 2009; Target 2018). Very early on, this was kept on one computer (Irving 2011). A team member who wanted to make a change would “check out” the file he or she wanted to work on, make changes, and then check it back in as the newest main version (E. S. Raymond 2003). While one team member had this file checked out, other members would often be “locked” out of making any changes to that file—they could look at it, but couldn’t make any edits (E. Raymond 2009; Target 2018). This meant that there was no chance of two people trying to change the same part of a file at the same time. In spirit, this early system is pretty similar to the idea of sending a file around the team by email, with the understanding that only one person works on it at a time. While the “main” version is in different people’s hands at different times, to do work, you all agree that only one person will work on it at a time. A slightly more modern analogy is the idea of having a single version of a file in Dropbox or Google Docs, and avoiding working on the file when you see that another team member is working on it.

This system is pretty clunky, though. In particular, it usually increases the amount of time that it takes the team to finish the project, because only one person can work on a file at a time. Later types of version control programs moved toward a different style, allowing for distributed rather than centralized collaborative work on a file or a set of files (E. Raymond 2009; Irving 2011). Under the distributed model, all team members can have their own version of all the files, work on them and make records of changes they make to the files, and then occassionally sync with everyone else to share your changes with them and bring their changes into your copy of the files. This distributed model also means there is a copy of the full repository on every team members computer, which has the side benefit of provided natural backup of the project files. Remote repositories—which may be on a server in a different location—can be added with another copy of the project, which can similarly be synced regularly to update with any changes made to project files.

While there are a number of software systems for version control, by far the most common currently used for scienctific projects is git. This program was created by Linus Torvalds, who also created the Linux operating system, in 2005 as a way to facilitate the team working on Linux development. This program for version control thrives in large collaborative projects, for example open-source software development projects that include numerous contributors, both regular and occasional (Brown 2018).

In recent years, some complementary tools have been developed that make the process of collaborating together using version control software easier. Other tools can helps in collaborating on file-based projects, including bug trackers or issue trackers, which allow the team to keep a running “to-do” list of what needs to be done to complete the project, all of which are discussed in the next chapter as tools that can be used to improve collaboration on scientific projects spread across teams. GitHub, a very popular version control platform with these additional tools, was created in 2008 as a web-based platform to facilitate collaborating on projects running under git version control. It can provide an easier entry to using git for version control than trying to learn to use git from the command line (Perez-Riverol et al. 2016). It also plays well with RStudio, making it easy to integrate a collaborative workflow through GitHub from the same RStudio window on your computer where you are otherwise doing your analysis (Perez-Riverol et al. 2016).

“If your software engineering career, like mine, is no older than GitHub, then git may be the only version control software you have ever used. While people sometimes grouse about its steep learning curve or unintuitive interface, git has become everyone’s go-to for version control.” (Target 2018)

2.9.2 Recording data in the laboratory—from paper to computers

Traditionally, experimental data collected in a laboratory was recorded in a paper laboratory notebook. These laboratory notebooks played a role not only as the initial recording of data, but also can serve as, for example, a legal record of the data recorded in the lab (Mascarelli 2014). They were also a resource for collaborating across a team and for passing on a research project from one lab member to another (Butler 2005).

However, paper laboratory notebooks have a number of limitations. First, they can be very inefficient. In a time when almost all data analyses—even simple calculations—are done on a computer, recording research data on paper rather than directly entering it into a computer is inefficient. Also, any stage of copying data from one format to another, especially when done by a human rather than a machine, introduces the chance to copying errors. Handwritten laboratory notebooks can be hard to read (Butler 2005; J. M. Perkel 2011), and may lack adequate flexibility and expandability to handle the complex experiments often conducted. Further, electronic alternatives can also be easier to search, allowing for deeper and more comprehensive investigations of the data collected across multiple experiments (Giles 2012; Butler 2005; J. M. Perkel 2011).

“Handwritten lab notebooks are usually chaotic and always unsearchable.” (J. M. Perkel 2011)

Given a widespread recognition of the limitations of paper laboratory notebooks, in the past couple of decades, there have been a number of efforts, both formal and informal, to move from paper laboratory notebooks to electronic alternatives. In some fields that rely heavily on computational analysis, there are very few research labs (if any) that use paper laboratory notebooks (Butler 2005). In other fields, where researchers have traditionally used paper lab notebooks, companies have been working for a while to develop electronic laboratory notebooks specifically tailored to scientific research needs (Giles 2012). These were adopted more early in pharmaceutical industrial labs, where companies had the budgets to get customized versions and the authority to require their use, but have taken longer to be adapted in academic laboratories (Giles 2012; Butler 2005). A widely adopted platform for electronic laboratory notebooks has yet to be taken up by the scientific community, despite clear advantages of recording data directly into a computer rather than first using a paper notebook.

“Since at least the 1990s, articles on technology have predicted the imminent, widespread adoption of electronic laboratory notebooks (ELNs) by researchers. It has yet to happen—but more and more scientists are taking the plunge.” (Kwok 2018)

Instead of using customized electronic laboratory notebook software, some academics are moving their data recording online, but are using more generalized electronic alternatives, like Dropbox, Google applications, OneNote, and Evernote (J. M. Perkel 2011; Kwok 2018; Giles 2012; K. Powell 2012). Some scientists have started using version control tools, especially the combination of git and GitHub, as a way to improve laboratory data recording, and in particular to improve transparency and reproducibility standards. These pieces of software share the same pattern as Google tools or Dropbox—they are generalized tools that have been honed and optimized for ease of use through their role outside of scientific research, but can be harnessed as a powerful tool in a scientific laboratory, as well. They are also free—at least, for GitHub, at the entry and academic levels—and, even better, one (git) is open source.

“The purpose of a lab notebook is to provide a lasting record of events in a laboratory. In the same way that a chemistry experiment would be nearly impossible without a lab notebook, scientific computing would be a nightmare of inefficiency and uncertainty without version-control systems.” (Tippmann 2014)

While some generalized tools like Google tools and Dropbox might be simpler to initially learn, version control tools offer some key advantages for recording scientific data and are worth the effort to adopt. A key advantage is their ability to track the full history of files as they evolve, including not only the history of changes to each file, but also a record of why each change was made. Git excels in tracking changes made to plain text files. For these files, whether they record code, data, or text, git can show line-by-line differences between two versions of the file. This makes it very easy to go through the history of “commits” to a plain text file in a git-tracked repository and see what change was made at each time point, and then read through the commit messages associated with those commits to see why a change was made. For example, if a value was entered in the wrong row of a csv, and the researcher then made a commit to correct that data entry mistake, the researcher could explain the problem and its resolution in the commit message for that change.

Platforms for using git often include nice tools for visualizing differences between two files, providing a more visual way to look at the “diffs” between files across time points in the project. For example, GitHub automatically shows these using colors to highlight addititions and substractions of plain text for one file compared to another version of it when you look through a repository’s commit history. Similarly, RStudio provides a new “Commit” window that can be used to compare differences between the original and revised version of plain text files at a particular stage in the commit history.

The use of version control tools and platforms, like git and GitHub, not only helps in transparent and trackable recording of data, but it also brings some additional advantages in the research project. First, this combination of tools aids in collaboration across a research group, as we discuss in depth in the next chapter.

Second, if a project uses these tools, it is very easy to share data recorded for the project publicly. In a project that uses git and GitHub version control tools, it is easy to share the project data online once an associated manuscript is published, an increasingly common request or requirement from journals and funding agencies (Blischak, Davenport, and Wilson 2016). Sharing data allows a more complete assessment of the research by reviewers and readers and makes it easier for other researchers to build off the published results in their own work, extending and adapting the code to explore their own datasets or ask their own research questions (Perez-Riverol et al. 2016). On GitHub, you can set the access to a project to be either public or private, and can be converted easily from one form to the other over the course of the project (Metz 2015). A private project can be viewed only by fellow team members, while a public project can be viewed by anyone. Further, because git tracks the full history of changes to these documents, it includes functionality that let’s you tag the code and data at a specific point (for example, the date when a paper was submitted) so that viewers can look at that specific “version” of the repository files, even while the project team continues to move forward in improving files in the directory. At the more advanced end of functionality, there are even ways to assign a DOI to a specific version of a GitHub repository (Perez-Riverol et al. 2016).

Third, the combination of git and GitHub can help as a way to backup study data (Blischak, Davenport, and Wilson 2016; Perez-Riverol et al. 2016; J. Perkel 2018). Together, git and GitHub provide a structure where the project directory (repository) is copied on multiple computers, both the users’ laptop or desktop computers and on a remote server hosted by GitHub or a similar organization. This set-up makes it easy to bring all the project files onto a new computer—all you have to do is clone the project repository. It also ensures that there are copies of the full project directory, including all its files, in multiple places (Blischak, Davenport, and Wilson 2016). Further, not only is the data backed up across multiple computers, but so is the full history of all changes made to that data and the recorded messages explaining those changes, through the repositories commit messages (Perez-Riverol et al. 2016).

There are, of course, some limitations to using version control tools when recording experimental data. First, while ideally laboratory data is recorded in a plain text format (see the module in section 2.2 for a deeper discussion of why), some data may be recorded in a binary file format. Some version control tools, including git, can be used to track changes in binary files. However, git does not take to these types of files naturally. In particular, git typically will not be able to show users a useful comparison of the differences between two versions of a binary file. More problems can arise if the binary file is very large (Perez-Riverol et al. 2016; Blischak, Davenport, and Wilson 2016), as some experimental research data files are (e.g., if they are high-throughput output of laboratory equipment like a mass spectrometer). However, there are emerging tools and strategies for improving the ability to include and track large binary files when using git and GitHub (Blischak, Davenport, and Wilson 2016)

“You can version control any file that you put in a Git repository, whether it is text-based, an image, or a giant data file. However, just because you can version control something, does not mean that you should.” (Blischak, Davenport, and Wilson 2016)

Finally, as with other tools and techniques described in this book, there is an investment required to learn how to use git and GitHub (Perez-Riverol et al. 2016), as well as a bit of extra overhead when using version control tools in a project (E. S. Raymond 2003). However, both can bring dramatic gains to efficiency, transparency, and organization of research projects, even if you only use a small subset of its basic functionality (Perez-Riverol et al. 2016). In Chapter 11 we provide guidance on getting started with using git and Github to track a scientific research project.

“Although Git has a complex set of commands and can be used for rather complex operations, learning to apply the basics requires only a handful of new concepts and commands and will provide a solid ground to efficiently track code and related content for research projects.” (Perez-Riverol et al. 2016)

2.9.3 Discussion questions

“Using an RCS [revision control system] has changed how I work. … a day’s work is no longer a featureless slog toward the summit, but a sequence of small steps. What one feature could I add? What one problem could I fix? Once a step is made and you are sure your code base is in a safe and clean state, commit a revision, and if your next step turns out disastrously, you can fall back to the revision you just committed instead of starting from the beginning.” (Klemens 2014)

With version control, “Our filesystem now has a time dimension. We can query the RCS’s repository of file information to see what a file looked like last week and how it changed from then to now. Even without the other powers, I have found that this alone makes me a more confident writer.” (Klemens 2014)

“The most rudimentary means of revision control is via diff and patch, which are POSIX-standard and therefore most certainly on your system.” (Klemens 2014)

“Git is a C program like any other, and is based on a small set of objects. The key object is the commit object, which is akin to a unified diff file. Given a previous commit object and some changes from that baseline, a new commit object encapsulates the information. It gets some support from the index, which is a list of the changes registered since the last commit object, the primary use of which will be in generating the next commit object. The commit objects link together to form a tree much like any other tree. Each commit object will have (at least) one parent commit object. Stepping up and down the tree is akin to using patch and patch -R to step among versions.” (Klemens 2014)

“Having a backup system organized enough that you can delete code with confidence and recover as needed will already make you a better writer.” (Klemens 2014)