2.10 Enhance the reproducibility of collaborative research with version control platforms

Once a researcher has learned to use git on their own computer for local version control, they can begin using version control platforms (e.g., GitLab, GitHub) to collaborate with others under version control. We will describe how a research team can benefit from using a version control platform to work collaboratively.

Objectives. After this module, the trainee will be able to:

  • List benefits of using a version control platform to collaborate on research projects, particularly for reproducibility
  • Describe the difference between version control (e.g., git) and a version control platform (e.g., GitLab)

2.10.1 What are version control platforms?

The last module introduced the idea of version control, including the popular software tool often used for version control, git. In this module, we’ll go a step further, telling you about how you can expand the idea of version control to leverage it when collaborating across your research team, using version control platforms.

When research groups—or any other professional teams—collaborate on publications and research, the process can be a bit haphazard. Teams often use emails and email attachments to share updates on the project, and email attachments to pass around the latest version of a document for others to review and edit. For example, one group of researchers investigated a large collection of emails from Enron (Hermans and Murphy-Hill 2015). They found that passing Excel files through email attachements was a common practice, and that messages within emails suggested that spreadsheets were stored locally, rather than in a location that was accessible to all team members (Hermans and Murphy-Hill 2015), which meant that team members might often be working on different versions of the same spreadsheet file. They note that “the practice of emailing spreadsheets is known to result in serious problems in terms of accountability and errors, as people do not have access to the latest version of a spreadsheet, but need to be updated of changes via email.” (Hermans and Murphy-Hill 2015) The same process for collaboration is often used in scientific research, as well: one study found, “Team members regularly pass data files back and forth by hand, by email, and by using shared lab or project servers, websites, and databases.” (Edwards et al. 2011)

“The most primitive (but still very common) method [of version control] is all hand-hacking. You snapshot the project periodically by manually copying everything in it to a backup. You include history comments in source files. You make verbal or email arrangements with other developers to keep their hands off certain files while you hack them. … The hidden costs of this hand-hacking method are high, especially when (as frequently happens) it breaks down. The procedures take time and concentration; they’re prone to error, and tend to get slipped under pressure or when the project is in trouble—that is exactly when they are needed.” (E. S. Raymond 2003)

These practices make it very difficult to keep track of all project files, and in particular, to track which version of each file is the most current. Further, this process constrains patterns of collaboration—it requires each team member to take turns in editing each file, or for one team member to attempt to merge in changes that were made by separate team members at the same time when all versions are collected. Further, this process makes it difficult to keep track of why changes were made, and often requires one team member to approve the changes of other team members. While the “Track changes” and comment features can help the team communicate with each other, but these features often lead to a very messy document at stages in the editing, where it is hard to pick out the current versus suggested wording, and once a change is accepted or a comment deleted, these conversations are typically lost forever. Finally, word processing tools are poorly suited to track changes or add suggestions directly to data or code, as both data and code are usually saved in formats that aren’t native to word processing programs, and copying them into a format like Word can introduce problematic hidden formatting that can cause the data or code to malfunction.

A version control platform allows you to share project files across a group of collaborators while keeping track of what changes are made, who made each change, and why each change was made. It therefore combines the strengths of a “Track changes” feature with those of a file sharing platform like Dropbox. To some extent, Google Docs or Google Drive also combine these features, and some spreadsheet programs are moving toward some rudimentary functionality for version control (Birch, Lyford-Smith, and Guo 2018). However, there are added advantages of version control platforms. Since open-source version control platforms like GitHub can be set up on a server that you own, they can be used to collaborate on projects with sensitive data, and also can store data directly on the server you would like to use to store large project datasets or to run computationally-intensive pre-processing or analysis. Finally, most version control platforms include tools that help you manage and track the project. These include “Issue Trackers,” tools for exploring the history of each file and each change, and features to assign project tasks to specific team members. The next section will describe the features of version control platforms that make them helpful as a tool for collaborating on scientific research. These systems are being leveraged by some scientists, both to manage research projects and also to collaborate on writing scientific manuscripts and grant proposals (Perez-Riverol et al. 2016).

“Using GitHub or any similar versioning / tracking system is not a replacement for good project management; it is an extension, an improvement for good project and file management.” (Perez-Riverol et al. 2016)

Version control platforms are always used in conjunction with version control software, like the git software described in the last module. Version control itself has been described as “a suite of programs that automates away most of the drudgery involved in keeping an annotated history of your project and avoiding modification conflicts,” (E. S. Raymond 2003). The version control platform leverages the history of commits that were made to the project, as well as the version control software’s capabilities for merging changes made by different people at different times. On top of these facilities, a version control platform also adds attractive visual interfaces for working with the project, free or low-cost online hosting of project files, and team management tools for each project. You can think of git as the engine, in other words, and the version control platform as the driver’s seat, with dashboard, steering wheel, and gears to leverage the power of the underlying git software.

A number of version control platforms are available. Two that are currently very popular for scientific research are GitHub (https://github.com/) and GitLab (https://about.gitlab.com/). Both provide free options for scientific researchers, including the capabilities for using both public and private repositories in collaboration with other researchers.

Resources like GitHub are “essential for collaborative software projects because they enable the organization and sharing of programming tasks between different remote contributors.” (Perez-Riverol et al. 2016)

2.10.2 Why use version control platforms?

Version control platforms offer a number of advantages when collaborating on a research project that can help to improve your efficiency, rigor, and reproducibility. Further, there are several high-quality free versions of version control platforms that are available for researchers, and as their use becomes more popular, resources for learning the details of how to use these platforms effectively. Open-source versions, like GitLab, even allow you to set up a version control platform on a server you own, rather than needing to post data or code on an outside platform, and so you can use these tools even in cases of sensitive data.

Some of the key advantages of using a version control platform like GitHub to collaborate on research projects include:

  • Ability to track and merge changes that different collaborators made to the document
  • Ability to create alternative versions of project files (branches), and merge them into the main project as desired
  • Tools for project management, including Issue Trackers
  • Default backup of project files
  • Ability to share project information online, including through hosting websites related to the project or supplemental files related to a manuscript

Many of these strengths draw directly on the functions provided by the underlying version control software (e.g., git). However, the version control platform will typically allow team members to explore and work with these functions in an easier way than if they try to use the barebones version control software. In earlier years, the use of version control often required users to be familiar with the command line, and to send arcane commands to track the project files through that interface. With the rising popularity of version control platforms, version control for project management can be taught relatively quickly to students with a few months—or even weeks—of coding experience. In fact, version control is beginning to be used as a method of turning in and grading homework in beginning programming classes, with students learning these techniques in the first few weeks of class. This would be practically unimaginable without the user-friendly interface of a version control platform as a wrapper for the power of the version control software itself.

“One reason for GitHub’s success is that it offers more than a simple source code hosting service. It provides developers and researchers with a dynamic and collaborative environment, often referred to as a social coding platform, that supports peer review, commenting, and discussion. A diverse range of efforts, ranging from individual to large bioinformatics projects, laboratory repositories, as well as global collaborations, have found GitHub to be a productive place to share code and ideas and collaborate.” (Perez-Riverol et al. 2016)

The first strength of using version control—and a version control platform—to collaborate on scientific projects is its ability to track every change made to files in the project, why the change was made, and who made it. Version control creates a full history of the evolution of each file in the project. When a change is committed, the history records the exact change made, including the previous version of the file. No change is ever fully lost, therefore, unless a great deal of extra work is taken to erase something from the project’s commit history. Version control also requires a user to provide a commit message describing each change that is made. If this feature is used thoughtfully, then the commit history of the project provides a well-documented description of the project’s full evolution. If you’re working on a manuscript, for example, when it’s time to edit, you can cut whole paragraphs, and if you ever need to get them back, they’ll be right there in the commit history for your project, with their own commit message about why they were cut (hopefully a nice clear one that will make it easy to find that commit if you ever need those paragraphs again).

“[Version control systems] are a huge boon to productivity and code quality in many ways, even for small single-developer projects. They automate away many procedures that are just tedious work. They help a lot in recovering from mistakes. Perhaps most importantly, they free programmers to experiment by guarnateeing that reversion to a known-good state will always be easy.” (E. S. Raymond 2003)

These capacities to track changes and histories of project files becomes even more important when working in collaboration on a project. As the proverb about too many cooks in the kitchen captures, any time you have multiple people working on a project, it introduces the chance for conflicts. While higher-level conflicts, like about what you want the final product to look like or who should do which jobs, can’t be easily managed by a computer program, now the complications of integrating everyone’s contributions—and letting people work in their own space and then bring together their individual work into one final joint project—can be. While these programs for version control were originally created to help with programmers developing code, they can be used now to coordinate group work on numerous types of file-based projects, including scientific manuscripts, books, and websites (E. Raymond 2009). And although they can work with projects that include binary code, they thrive in projects with a heavier concentration of text-based files, and so they fit in nicely in a scientific research / data analysis workflow that is based on data stored in plain text formats and data analysis scripts written in plain text files, tools we discuss in other parts of this book.

“In a medium-sized project, it often happens that a (relatively small) number of people work simultaneously on a single set of files, the ‘program’ or the ‘project.’ Often these people have additional tasks, causing their working speeds to differ greatly. One person may be working a steady ten hours a day on the project, a second may have barely time to dabble in the project enough to keep current, while a third participant may be sent off on an urgent temporary assignment just before finishing a modification. It would be nice if each participant could be abstracted from the vicissitudes of the lives of the others.” (Grune 1986)

Modern version control systems like git take a distributed approach to collaboration on project files. Under the distributed model, all team members can have their own version of all the files, work on them and make records of changes they make to the files, and then occassionally sync with everyone else to share your changes with them and bring their changes into your copy of the files. This functionality is called concurrency, since it allows team members to concurrently work on the same set of files (E. Raymond 2009). This idea allowed for the development of other useful features and styles of working, including branching to try out new ideas that you’re not sure you’ll ultimately want to go with and forking, a key tool used in open-source software development, which among other things facilitates someone who isn’t part of the original team getting a copy of the files they can work with and suggesting some changes that might be helpful. So, this is the basic idea of modern version control—for a project that involves a set of computer files, everyone on the team (even if that’s just one person) has their own copy of a directory with those files on their own computer, makes changes at the time and in the spots in the files that they want, and then regularly re-syncs their local directory with everyone else’s to share changes and updates.

There is one key feature of modern version control that’s critical to making this work—merging files that started the same but were edited in different ways and now need to be put back together, bringing any changes made from the original version. This step is called merging the files. While this is typically described using the plural, “files,” at a higher-level, you can thing of this as just merging the changes that two people have made as they edited a single file, a file where they both started out with identical copies.

Think of the file broken up into each of its separate lines. There will be some lines that neither person changed. Those are easy to handle in the “merge”—they stay the same as in the original copy of the file. Next, there will be some lines that one person changed, but that the other person didn’t. It turns out that these are pretty easy to handle, too. If only one person changed the line, then you use their version—it’s the most up-to-date, since if both people started out with the same version, it means that the other person didn’t make any changes to that part of the file. Finally, there may be a few lines that both people changed. These are called merge conflicts. They’re places in the file where there’s not a clear, easy-to-automate way that the computer can know which version to put into the integrated, latest version of the file. Different version control programs handle these merge conflicts in different ways. For the most common version control program used today, git, these spots in the file are flagged with a special set of symbols when you try to integrate the two updated versions of the file. Along with the special symbols to denote a conflict, there will also be both versions of the conflicting lines of the file. Whoever is integrating the files must go in and pick the version of those lines to use in the integrated version of the file, or write in some compromise version of those lines that brings in elements from both people’s changes, and then delete all the symbols denoting that was a conflict and save this latest version of the file.

“You will likely share your code with multiple lab mates or collaborators, and they may have suggestions on how to improve it. If you email the code to multiple people, you will have to manually incorporate all the changes each of them sends.” (Blischak, Davenport, and Wilson 2016)

There are a number of other features of version control that make it useful for collaborating on file-based projects with teams. First, these systems allow you to explain your changes as you make them—in other words, it allows for annotation of the developing and editing process (E. Raymond 2009). This provides the team with a full history of why the files evolved in the way they did across the team. It also provides a way to communicate across the team members.

For example, if one person is the key person working on a certain file, but has run into a problem with one spot and asks another team member to take a go, then the second team member isn’t limited to just looking at the file and then emailing some suggestions. Instead, the second person can make sure he or she has the latest version of that file, make the changes they think will help, commit those changes with a message (a commit message) about why they think this change will fix the problem, and then push that latest version of the file back to the first person. If there are several places where it would help to change the file, then these can be fixed through several separate commits, each with their own message. The first person, who originally asked for help, can read through the updates in the file (most platforms for using version control will now highlight where all these changes are in the file) and read the second person’s message or messages about why each change might help. Even better, days or months later, when team members are trying to figure out why a certain change was made in that part of the file, can go back and read these messages to get an explanation.

“You know your code has changed; do you know why? It’s easy to forget the reasons for changes, and step on them later. If you have collaborators on a project, how do you know what they have changed while you weren’t looking, and who was responsible for each change?” (E. S. Raymond 2003)

In recent years, some complementary tools have been developed that make the process of collaborating together using version control software easier. Other tools can helps in collaborating on file-based projects, including bug trackers or issue trackers, which allow the team to keep a running “to-do” list of what needs to be done to complete the project, all of which are discussed in the next chapter (Perez-Riverol et al. 2016).

Finally, version control platforms like GitHub can be used for a number of supplementary tasks for your research project. These include publishing webpages or other web resources linked to the project and otherwise improving public engagement with the project, including by allowing other researchers to copy and adapt your project through a process called forking. Version control platforms also provide a supplemental backup to project files.

First, GitHub can be used to collaborate on, host, and publish websites and other online content (Perez-Riverol et al. 2016). Version control systems have been used by some for a long time to help in writing longform materials like books (e.g., (E. S. Raymond 2003)); new tools are making the process even easier. Thethe GitHub Pages functionality, for example, is now being used to host a number of books created in R using the bookdown package, including the online version of this book. The blogdown package similarly can be used to create websites, either for individual researchers, for research labs, or for specific projects or collaborations. Further, if a project includes the creation of scientific software, it can be used to share that software—as well as associated documentation—in a format that is easy for others to work with. The platform can also be used to share supplemental material for a manuscript, including the code used for preprocessing and analyzing data. The most popular version control platforms, GitHub and GitLab, both allow users to toggle projects between “public” and “private” modes, which can be used to work privately on a project prior to peer review and publication, and then switch to a public mode after publication. This functionality will allow those who access the code to see not only the final product, but also the history of the development of the code and data for the project, providing more transparency in the development process, but without jeopardizing the novelty of the research results prior to publication.

“The traditional way to promote scientific software is by publishing an associated paper in the peer-reviewed scientific literature, though, as pointed out by Buckheir and Donoho, this is just advertising. Additional steps can boost the visibility of an organization. For example, GitHub Pages are simple websites freely hosted by GitHub. Users can create and host blog websites, help pages, manuals, tutorials, and websites related to specific projects.” (Perez-Riverol et al. 2016)

With GitHub, while only collaborators on a public project can directly change the code, anyone else can suggest changes through a process of copying a version of the project (forking it). This allows someone to make the changes they would like to suggest directly to a copy of the code, and then ask the project’s owners to consider integrating the changes back into the main version of the project through a pull request. GitHub therefore creates a platform where people can explore, adapt, and add to other people’s coding projects, enabling a community of coders (Perez-Riverol et al. 2016), and because of this functionality it has been described as “a social network for software development” (J. Perkel 2018) and as “a kind of bazaar that offers just about any piece of code you might want—and so much of it free.” (Metz 2015). This same process can be leveraged for others to copy and adapt code—this is particularly helpful in ensuring that a software or research project won’t be “orphaned” if its main developer is unavailable (e.g., retires, dies), but instead can be picked up and continued by other interested researchers. Copyright statements and licenses within code projects help to clarify attribution and rights in these cases.

“The astonishment was that you might want to make even your tiny hacks to other people’s code public. Before GitHub, we tended to keep those on our own computer. Nowadays, it is so each to make a fork, or even edit the code directly in your browser, that potentially anyone can find even your least polished bug fixes immediately.” (Irving 2011)

Finally, version control platforms help in providing additional back-up for project files. As you collaborate with others using version control under a distributed model, each collaborator will have their own copy of all project files on their local computer. All project files are also stored on the remote repository to which you all push and pull commits. If you are using the GitHub platform, this will be GitHub’s servers; if you use GitLab, you can set up the system on your own server. Each time you push or pull from the remote copy of the project repository, you are syncing your copy of the project files with those on other computers.

“Backup, backup, backup—this is the main action you can take to care for your computers and your data. Many PIs assume that backup systems are inherently permanent and foolproof, and it often takes a loss to remind one that materials break, systems fail, and humans make mistakes. Even if your data are backed up at work, have at least one other backup system. Keep at least one backup off site, in case of a diaster in the lab (yes, fires and floods do happen). It doesn’t make much sense to have two separate backup systems stored next to each other in a drawer.” (LEIPS 2010)

2.10.3 How to use GitHub

In the next module, we describe practical ways to leverage these resources within your research group. We include instructions both for team leaders—who may not code but may want to use GitHub within projects to help manage the projects—as well as researchers who work directly with data and code for the research team. There are also a number of excellent resources that are now available that walk users through how to set up and use a version control platform. The process is particularly straightforward when the research project files are collected in an RStudio Project format, as described in earlier modules.

2.10.4 Discussion questions