Module 9 Harnessing version control for transparent data recording

As a research project progresses, researchers will often end up with many files (e.g., ‘draft1.doc’, ‘draft2.doc’). This can result in an explosion of files, and it becomes hard to track which files represent the “current” state of a project. Version control allows researchers to edit and change research project files more cleanly, while including messages to explain changes and maintaining the power to backtrack to previous versions.

In this module, we will explain what version control is and how it can be used in research projects to improve the transparency and reproducibility of research, particularly for data recording. We’ll introduce you to the basic idea of version control, using the Git software program as an example. In later modules, we’ll explain version control platforms like GitHub, as well as give some tips on how to use both within your research projects.

Objectives. After this module, the trainee will be able to:

  • Define “version”, “version control”, “version control software”, “repository”, “commit”, and “commit message”
  • Discuss challenges in coordinating changes in project files when working in teams
  • List some downsides of physical laboratory notebooks
  • Distinguish between “version control” and “version control software”
  • Identify examples of versioning in a digital context (data, code, files)
  • Discuss how version control principles can improve collaboration in scientific projects

9.1 Challenges of collaborating on evolving research materials

When research groups—or any other professional teams—collaborate on publications and research, the process can be a bit haphazard. Teams often use emails and email attachments to share updates on the project, and they sometimes use email attachments to pass around the latest version of a document for others to review and edit.

One fascinating example comes from the business world. After the implosion of Enron, a trove of emails from within the company were released. This set of emails has become known as the Enron Corpus and has been used for a variety of research studies. One group of researchers investigated emails from this corpus that involved people who were doing work with spreadsheets.189 They found that passing Excel files through email attachments was a common practice, and that messages within emails suggested that spreadsheets were stored locally, rather than in a location that was accessible to all team members.190 This meant that team members might often be working on different versions of the same spreadsheet file. They note that “the practice of emailing spreadsheets is known to result in serious problems in terms of accountability and errors, as people do not have access to the latest version of a spreadsheet, but need to be updated of changes via email.”191 The same process for collaboration is often used in scientific research: one study found, “Team members regularly pass data files back and forth by hand, by email, and by using shared lab or project servers, websites, and databases.”192

These practices make it very difficult to keep track of all project files, and in particular, to track which version of each file is the most current. Further, this process constrains patterns of collaboration—it requires each team member to take turns in editing each file, or for one team member to attempt to merge in changes that were made by separate team members at the same time when all versions are collected.

This process also makes it difficult to keep track of why changes were made, and often requires one team member to approve the changes of other team members. While “Track changes” and “Comment” features in software like Microsoft Word can help the team communicate with each other, these features often lead to a very messy document at stages in the editing, where it is hard to pick out the current versus suggested wording, and once a change is accepted or a comment deleted, these conversations can be lost forever. Finally, word processing tools are poorly suited to track changes or add suggestions directly to data or code, as both data and code are usually saved in formats that aren’t native to word processing programs, and copying them into a format like Word can introduce problematic hidden formatting that can cause the data or code to malfunction.

9.2 Recording data in the laboratory—from paper to computers

More and more scientific researchers are tackling these challenges in their own projects using something called version control. But how does version control—traditionally a tool of software engineers—relate to collaborating to collect and analyze scientific research data? Traditionally, experimental data collected in a laboratory was recorded in a paper laboratory notebook. These laboratory notebooks played a role not only as the initial recording of data, but also keep a legal record of the data recorded in the lab.193 They were also a resource for collaborating across a team and for passing on a research project from one lab member to another.194

However, paper laboratory notebooks have a number of limitations. First, they can be very inefficient. In a time when almost all data analyses—even simple calculations—are done on a computer, recording research data on paper rather than directly entering it into a computer is inefficient. Also, any stage of copying data from one format to another, especially when done by a human rather than a machine, introduces the chance to copying errors. Handwritten laboratory notebooks can be hard to read,195 and they may lack adequate flexibility to handle the complex experiments often conducted. Further, electronic alternatives can also be easier to search, allowing for deeper and more comprehensive investigations of the data collected across multiple experiments.196 As one article notes, physical lab notebooks are “usually chaotic and always unsearchable.”197

Given a widespread recognition of the limitations of paper laboratory notebooks, in the past couple of decades, there have been a number of efforts, both formal and informal, to move from paper laboratory notebooks to electronic alternatives. In some fields that rely heavily on computational analysis, there are very few research labs (if any) that use paper laboratory notebooks.198 In other fields, where researchers have traditionally used paper lab notebooks, companies have been working for a while to develop electronic laboratory notebooks specifically tailored to scientific research.199 Some early adapters were pharmaceutical industrial labs, where companies had the budgets to get customized versions and the authority to require their use. In academic laboratories, electronic lab notebooks have taken longer to be adapted.200 Indeed, a widely adopted platform for electronic laboratory notebooks has yet to be taken up by the scientific community,201 despite clear advantages of recording data directly into a computer rather than first using a paper notebook. As Kwok notes in a 2018 commentary,

“Since at least the 1990s, articles on technology have predicted the imminent, widespread adoption of electronic laboratory notebooks (ELNs) by researchers. It has yet to happen”202

Instead of using customized electronic laboratory notebook software, some academics are moving their data recording online, but are using more generalized electronic alternatives, like Dropbox, Google applications, OneNote, and Evernote.203 Some scientists have started using version control software, especially the combination of Git and GitHub, as a way to improve laboratory data recording, and in particular to improve transparency and reproducibility standards. These pieces of software share the same pattern as Google applications or Dropbox—they are generalized tools that have been honed and optimized for ease of use through their role outside of scientific research, but can be harnessed as a powerful tool in a scientific laboratory, as well. They are also free—at least, for GitHub, at the entry and academic levels—and, even better, one (Git) is open-source.

9.3 Defining “version” and “version control”

Most scientific research today involves collaboration across a team of researchers, rather than an individual scientist working alone. Collaboration drives interdisciplinary science, but it also creates challenges. One challenge comes with coordinating versions of research materials. These materials can include data collection files, but can also include other documents like study protocols, as well as physical materials like cell lines, antibodies, and model organisms.

A version is one iteration of a research material that is evolving. For example, a draft of a research paper is one version of that paper. Research data that you collect may also go through several versions. For example, if you identify a typo in data after you record it, you may need to correct the typo and add a note or signature to explain that update. Further, if you are collecting data at multiple timepoints, you may have new versions of a data file as you complete each timepoint.

As materials evolve across versions, it introduces challenges in maintaining a research process that is smooth, efficient, and error-free. One challenge is to make sure it is always clear which version is the most current, as well as which version should be used for specific purposes. For example, if several coauthors are editing a paper draft, it is important to ensure they are all working on the most recent version.

Another challenge is to coordinate the changes that different people make when they work on the material at the same time. Scientific collaboration often does not operate as an assembly line, where one person finishes their work on a document or material and then hands it off to the next person. Instead, there will often be several copies of a version in different peoples’ hands, with all of them working on it at once. One example is a paper draft—often coauthors all edit the latest draft at the same time, rather than one-by-one. This creates the challenge of taking the contributions of each person and coordinating their changes and additions into one primary copy.

A third challenge is to keep track of the changes that are made at each step, as the document moves from version to version. This record can help in auditing for errors or bugs that might be introduced as the document evolves. Ideally, the record also will include some information about why changes were made at each step.

These challenges can be addressed through a process called version control. While the term is most commonly used in reference to software development, the idea of version control is widely relevant. Any process that creates evolving versions of a document or material can benefit from the idea of version control, which aims to record and document changes to the material over time, coordinate the contributions of different members of a team, and revert back to older versions if needed. In this module, we’ll focus on version control as it applies to research materials that are electronic (files and directories), but you may also find it useful to think about how the principles and elements of version control can be applied to other research materials, like cell lines and antibodies.

9.4 What are the key elements of version control?

The term version in version control refers to one iteration or state of a document or set of documents, for example the current version of a data file. The word control captures the idea of allowing for safe changes and updates to the version, especially when more than one person is working on it. Part of this “control” will also include recording the changes made from one version to the next and annotating reasons for those changes.

The general term version control can refer to any method of syncing contributions from several people to a file or set of files. Version control of computer files can be done “by hand”, with a person manually logging each change, and originally was.204 However, it’s much more efficient to use a computer program to handle this tracking and to coordinate contributions from multiple people. As Eric Raymond notes in The Art of Unix Programming, “tracking all that detail is just the sort of thing computers are good at and humans are not.”205 He goes on to describe version control as “a suite of programs that automates away most of the drudgery involved in keeping an annotated history of your project and avoiding modification conflicts.”206

Software for this purpose—version control software—was first developed for software programming projects. Some popular version control software today comes from these roots. In this section, we’ll introduce the key features of version control, and to do so we’ll use examples and terminology from a common version control software program called Git. While these terms are derived from this particular software program, they represent ideas that are important in any implementation of version control. Later, we’ll touch on how some of these ideas are incorporated in other software, like Google Docs.

The software available for version control tracks electronic files. While the very earliest version control software systems tracked single files, these systems quickly moved to tracking sets of files, called repositories. A repository is almost identical to a file directory (which you may also know as a file folder), and indeed a repository starts from a file directory. The only difference is the repository is enhanced with some additional overhead.207 This overhead is added to record how the files in the directory have changed over time. You can compare this to how you might track document changes if the documents were paper rather than electronic—you could store the documents in a paper folder and add a piece of paper where you record a log of each change you make to the documents in the folder. The extra overhead that changes a regular file directory to a repository is very similar to the log in this example. A repository, in other words, is a directory that is under version control.

In a repository of files that is under version control, the version control software takes snapshots of how the files look during your work on them. Each snapshot is called a commit, and it provides a record of which lines in each file changed from one snapshot to another, as well as exactly how they changed. The idea behind these commits—recording the differences, line-by-line, between an older and newer version of each file derives from a longstanding Unix command line tool called diff. This tool, developed early in the history of Unix at AT&T’s Bell Labs,208 is a solid and well-tested tool that does the simple but important job of generating a list of all the differences between two plain text files. Each commit in a repository includes the same type of information about the differences introduced in the files at the time of that commit.

When you are working with a directory under version control, you explain your changes as you make them—in other words, version control allows for annotation of the developing and editing process.209 Each commit requires you to enter a commit message describing why the changes in that commit were made. The commit messages can serve as a powerful tool for explaining changes to other team members or for reminding yourself in the future about why certain changes were made. A repository under version control, then, can include not only a complete history of how files in a project directory have changed over the course of the project, but also why. If this feature is used thoughtfully, then the commit history of the project provides a well-documented description of the project’s full evolution. If you’re working on a manuscript, for example, when it’s time to edit, you can cut whole paragraphs, and if you ever need to get them back, they’ll be right there in the commit history for your project, with their own commit message about why they were cut. If you make the commit message clear, it will make it easy to find that commit if you ever need those paragraphs again.

Further, each of the commits is given its own ID tag (in the Git software, this is done through something called a unique SHA-1 hash),210 and version control systems have a number of commands that let you “roll back” to earlier versions. This provides reversability within the project files, allowing you to go back to the version as it was when a certain commit was made.211

It turns out that this functionality—of being able to roll back to earlier versions—has a wonderful side benefit when it comes to working on a large project. It means that you don’t need to save earlier versions of each file. You can maintain one and only one version of each project file in the project’s directory, with the confidence that you never “lose” old versions of the file.212 This allows you to maintain a clean and simple version of the project files, with only one copy of each, ensuring it’s always clear which version of a file is the “current” one (since there’s only one version).213 This also provides the reassurance that you can try new directions in a project, and always roll back to the old version if that direction doesn’t work well.

In a 2011 commentary in Nature Methods, Perkel tells a story about how this functionality helped one researcher keep his project directories simpler:

“Early in his graduate career, John Blischak found himself creating figures for his advisor’s grant application. Blischak was using the programming language R to generate the figures, and as he iterated and optimized his code, he ran into a familiar problem: Determined not to lose his work, he gave each new version a different filename—analysis_1, analysis_2, and so on, for instance—but failed to document how they had evolved. ‘I had no idea what had changed between them,’ says Blischak… Using Git, Blischak says, he no longer needed to maintain multiple copies of his files. ‘I just keep overwriting it and changing it and saving the snapshots. And if the professor comes back and says, ’oh, you sent me an email back in March with this figure’, I can say, ‘okay, well, I’ll just bo back to the March version of my code and I can recreate it’.”214

A key strength, then, of using version control is its ability to track every change made to files in the project, why the change was made, and who made it. Version control creates a full history of the evolution of each file in the project. When a change is committed, the history records the exact change made, including the previous version of the file. No change is ever fully lost, therefore, unless a great deal of extra work is taken to erase something from the project’s commit history.

It’s also helpful to understand how version control programs handle collaboration. In earlier types of version control programs, under what is called a centralized framework, there was one central repository for the file or set of files the team was working on.215 A team member who wanted to make a change would “check out” the file he or she wanted to work on, make changes, and then check it back in as the newest main version.216 While one team member had this file checked out, other members would be locked out of making any changes to that file—they could look at it, but couldn’t make any edits.217 This meant that there was no chance of two people trying to change the same part of a file at the same time. In spirit, this early system is pretty similar to the idea of sending a file around the team by email, with the understanding that only one person works on it at a time. A slightly more modern analogy is the idea of having a single version of a file in Dropbox or Google Docs, and avoiding working on the file when you see that another team member is working on it.

This assembly-line approach is pretty clunky, though. In particular, it usually increases the amount of time that it takes the team to finish the project, because only one person can work on a file at a time. Later types of version control programs moved toward a different style, allowing for distributed rather than centralized collaborative work on a file or a set of files.218 Under the distributed model, all team members can have their own version of all the files, work on them and make records of changes they make to the files, and then occassionally sync with everyone else to share your changes with them and bring their changes into your copy of the files. This functionality is called concurrency, since it allows team members to concurrently work on the same set of files.219

This idea allowed for the development of other useful features and styles of working, including branching and forking. Branching allows you to try out new ideas that you’re not sure you’ll ultimately want to go with. Forking is a key tool used in open-source software development, and, among other things, facilitates someone who isn’t part of the original team getting a copy of the files they can work with and suggesting some changes that might be helpful. So, this is the basic idea of modern version control—for a project that involves a set of computer files, everyone on the team has their own copy of the directory on their own computer, makes changes at the time and in the spots in the files that they want, and then regularly re-syncs their local directory with everyone else’s to share changes and updates.

This distributed model also means there is a copy of the full repository on every team member’s computer, which has the side benefit of providing additional backup of the project files. Remote repositories—which may be on a server in a different location—can be added with another copy of the project, which can similarly be synced regularly to update with any changes made to project files.

9.5 Comparing Git to other tools

While there are a number of software systems for version control, one of the most common currently used for scientific projects is Git. This program was created by Linus Torvalds, who also created the Linux operating system, in 2005 as a way to facilitate the team working on Linux development. This program for version control thrives in large collaborative projects, for example open-source software development projects that include numerous contributors, both regular and occasional.220 As Target notes in a 2018 article about version control:

“While people sometimes grouse about its steep learning curve or unintuitive interface, Git has become everyone’s go-to for version control.”221

In recent years, some complementary tools have been developed that make the process of collaborating together using version control software easier. Other tools, such as bug trackers or issue trackers, facilitate corroborative file-based projects to allow the team to keep a running “to-do” list of what needs to be done to complete the project. These tools—which are discussed in modules 10 and 11—can be used to improve collaboration on scientific projects done by teams. GitHub is one a very popular version control platform with these additional tools. It was created in 2008 as a web-based platform to facilitate collaborating on projects running under Git version control. It can provide an easier entry to using Git for version control than trying to learn to use Git from the command line.222 It also interfaces well with RStudio, making it easy to integrate a collaborative workflow through GitHub from the same RStudio window on your computer where you are otherwise doing your analysis.223

While Git version control software is one of the best established ways of implementing version control, there are growing efforts to enable some level of version control through other platforms. For example, Google Docs enables a level of version control through its Version History feature. This feature allows you name different versions of a document as they are is saved in Google Docs. It also allows you to restore a document to earlier versions, as well as see which changes have been made to a document and who made each change.

While some generalized tools like Google tools and Dropbox might be simpler to initially learn, more powerful version control tools like Git offer some key advantages for recording scientific data and are worth the effort to adopt. A key advantage is their ability to track the full history of files as they evolve, including not only the history of changes to each file, but also a record of why each change was made.

Git excels in tracking changes made to plain text files. For these files, whether they record code, data, or text, Git can show line-by-line differences between two versions of the file. This makes it very easy to go through the history of “commits” to a plain text file in a Git-tracked repository and see what change was made at each time point, and then read through the commit messages associated with those commits to see why a change was made. For example, if a value was entered in the wrong row of a plain text file or spreadsheet, and the researcher then made a commit to correct that data entry mistake, the researcher could explain the problem and its resolution in the commit message for that change.

There are, of course, some limitations to using version control tools when recording experimental data. First, while ideally laboratory data is recorded in a plain text format, some data may be recorded in a binary file format. Some version control tools, including Git, can be used to track changes in binary files. However, Git does not take to these types of files naturally. In particular, Git typically will not be able to show a useful comparison of the differences between two versions of a binary file.

More problems can arise if the binary file is very large,224 as some experimental research data files are (e.g., if they are high-throughput output of laboratory equipment like a mass spectrometer). However, there are emerging tools and strategies for improving the ability to include and track large binary files when using Git and GitHub.225

Finally, as with other tools and techniques described in this book, there is an investment required to learn how to use Git,226 as well as some extra overhead when using version control tools in a project.227 However, Git can bring dramatic gains to efficiency, transparency, and organization of research projects, even if you only use a small subset of its basic functionality.228 In module 11, we provide guidance on getting started with using Git and GitHub to track a scientific research project.

Third, the combination of Git and GitHub can help as a way to backup study data.229 Together, Git and GitHub provide a structure where the project directory (repository) is copied on multiple computers, both the users’ laptop or desktop computers and on a remote server hosted by GitHub or a similar organization. This set-up makes it easy to bring all the project files onto a new computer—all you have to do is clone the project repository. It also ensures that there are copies of the full project directory, including all its files, in multiple places.230 Further, not only is the data backed up across multiple computers, but so is the full history of all changes made to that data and the recorded messages explaining those changes, through the repositories commit messages.231

9.6 Discussion questions

  • In your own research, do you collect data in paper laboratory notebooks, electronically, or a mixture of the two? What have you found to be advantages and disadvantages of the method you typically use? Are there ever cases where you have no choice and must either record on paper or electronically (examples might include when working behind a secure barrier or when data are recorded directly by equipment into a digital format)?
  • Have you used any of the following tools for recording, sharing, and versioning data or other research files (e.g., drafts of research papers, code):
    • Electronic laboratory notebooks
    • Dropbox
    • Google Docs / Google Drive
    • Microsoft Teams
    • Local server or drive run by your institution
    • GitHub / GitLab
  • Describe how any of these tools have helped in version control, including tracking changes to the file and helping to coordinate several people working on a file at once. Are there aspects where the tools you’ve used have been limited in this capacity?
  • Can you think of any examples of times when you’ve experienced a failure of version control? Examples might include a case where some team members worked on the wrong version of a file, or when you lost track of the changes that had been made to a file. What did you learn from the experience? Have you developed methods to avoid similar problems in the future? How might a version control problem like this result in problems with the rigor and reproducibility of scientific research?
  • How does the idea of version control relate to physical research materials, like model organisms, antibodies, or cell lines? Do you have any examples you can share of issues that have come up in research related to the version of these types of physical research materials?
  • What steps do you think you could take in your research to improve version control? Do you see this as a higher or lower priority change to take compared to other steps that might improve rigor and reproducibility in your research? Discuss your reasoning.