Module 10 Enhance the reproducibility of collaborative research with version control platforms
Once a researcher has learned to use Git on their own computer for local version control, they can begin using version control platforms (e.g., GitLab, GitHub) to collaborate with others under version control. In this module, we will describe how a research team can benefit from using a version control platform to work collaboratively. In module 11, we’ll give detailed examples of how you can use a version control platform even if you’re not familiar with coding.
Objectives. After this module, the trainee will be able to:
- List benefits of using a version control platform to collaborate on research projects, particularly for reproducibility
- Describe the difference between version control (e.g., Git) and a version control platform (e.g., GitHub)
- Explain how version control software and version control platforms can help coordinate contributions from different team members
- Define “merging”, “merge conflicts”, “issue trackers”
- Explain how commit messages can improve project management
- Explain how to-do lists can help project management
- Describe how a version control platform provides additional back-up for study files
10.1 What are version control platforms?
Module 9 introduced the idea of version control, including the popular software version control program Git. In this module, we’ll go a step further, telling you about how you can expand the idea of version control to leverage it when collaborating across your research team, using version control platforms. Version control platforms build on the functionality of version control software, and they can provide you and your team tools for sharing, tools for visualization, and tools for project management.
Version control platforms offer a number of advantages when collaborating on a research project that can help to improve your efficiency, rigor, and reproducibility. Further, as their use has become more popular, there are more and more resources to help you learn how to use these platforms effectively. Some of the key advantages of using a version control platform like GitHub to collaborate on research projects include that the platform:
- Can track and merge changes that different collaborators made to the document
- Allows you to create alternative versions of project files (branches), and merge them into the main project as desired
- Includes tools for project management, including Issue Trackers
- Provides additional backup of project files
- Allows you to share project information online, including through hosting websites related to the project or supplemental files related to a manuscript
A number of version control platforms are available. Two that are currently very popular for scientific research are GitHub (https://github.com/) and GitLab (https://about.gitlab.com/). Both provide free options for scientific researchers, including the capabilities for using both public and private repositories in collaboration with other researchers.
Version control platforms are always used in conjunction with version control software, like the Git software described in module 9. A version control platform adds attractive visual interfaces for working with the project, free or low-cost online hosting of project files, and team management tools for each project. In this sense, you can think of Git as the engine and the version control platform as the driver’s seat, with dashboard, steering wheel, and gears to leverage the power of the underlying Git software. One scientist, in an article about Git and GitHub for scientists, highlighted that resources like GitHub are “essential for collaborative software projects because they enable the organization and sharing of programming tasks between different remote contributors.”232
A version control platform therefore combines the strengths of a “Track changes” feature with those of a file sharing platform like Dropbox. To some extent, Google Docs or Google Drive also combine these features, and some spreadsheet programs are moving toward rudimentary functionality for version control.233 However, there are added advantages of version control platforms. For example, there are version control platforms that are open-source. GitLab is one example. Since these can be set up on a server that you own, they can be used to collaborate on projects with sensitive data, and also can operate directly on the server you’re using to store large project datasets or to run computationally-intensive pre-processing or analysis. Also, most version control platforms include tools that help you manage and track the project. These include “Issue Trackers”, tools for exploring the history of each file and each change, and features to assign project tasks to specific team members. The next section will describe the features of version control platforms that make them helpful as a tool for collaborating on scientific research. These systems are being leveraged by some scientists both to manage research projects and to collaborate on writing scientific manuscripts and grant proposals.234
10.2 Why use version control platforms?
Let’s look in detail at some of the advantages of using a version control program. The first is that they can provide an easy-to-use interface to the power of Git. Years ago, the use of version control required users to be familiar with the command line, and to send arcane commands to track the project files through that interface. However, version control platforms will typically allow team members to explore and work with the functions from Git in an easier way than if they try to use the barebones version control software. With the rising popularity of version control platforms, version control for project management can be taught relatively quickly to students with a few months—or even weeks—of coding experience. In fact, version control is beginning to be used as a method of turning in and grading homework in beginning programming classes, with students learning these techniques in the first few weeks of class.235 This would be practically unimaginable without the user-friendly interface of a version control platform as a wrapper for the power of the version control software itself.
A second advantage is that a version control platform helps in tracking and managing contributions from team members. As the proverb about too many cooks in the kitchen captures, any time you have multiple people working on a project, it introduces the chance for conflicts—cases where contributions from different people disagree. While higher-level conflicts, like about what you want the final product to look like or who should do which jobs, can’t be easily managed by a computer program, now the complications of integrating everyone’s contributions—and letting people work in their own space and then bring together their individual work into one final project—can be. While programs for version control were originally created to help with programmers developing code, they can be used now to coordinate group work on numerous types of file-based projects, including scientific manuscripts, books, and websites.236 Although they can work with projects that include files saved in binary (Word documents, for example), they thrive in projects with a heavier concentration of text-based files, and so they fit in nicely in a scientific research / data analysis workflow that is based on data stored in plain text formats and data analysis scripts written in plain text files, tools we discuss in other modules.
There is one key feature of modern version control that’s critical to making this work—resolving changes in files that started the same but were edited in different ways by different people and now need to be put back together. This step is called merging. While this is a feature driven by the Git software itself, you typically won’t use it until you’re collaborating on a project through a version control platform like GitHub.
You can think of this as merging the changes that two people have made as they edited a single file, a file where they both started out with identical copies. Without version control, this process can be time-consuming and frustrating. As one scientist notes:
“You will likely share your code with multiple lab mates or collaborators, and they may have suggestions on how to improve it. If you email the code to multiple people, you will have to manually incorporate all the changes each of them sends.”237
The version control software can handle this for you. Think of the file broken up into each of its separate lines. There will be some lines that neither person changed. Those are easy to handle in the “merge”—they stay the same as in the original copy of the file. Next, there will be some lines that one person changed, but that the other person didn’t. It turns out that these are pretty easy to handle, too. If only one person changed the line, then you use their version—it’s the most up-to-date, since if both people started out with the same version, it means that the other person didn’t make any changes to that part of the file. Finally, there may be a few lines that both people changed at about the same time. These are called merge conflicts. They’re places in the file where there’s not a clear, easy-to-automate way that the computer can know which version to put into the integrated, latest version of the file. Different version control programs handle these merge conflicts in different ways.
For the most common version control program used today, Git, these spots in the file are flagged with a special set of symbols when you try to integrate the two updated versions of the file. Along with the special symbols to denote a conflict, there will also be both versions of the conflicting lines of the file. Whoever is integrating the files must go in and pick the version of those lines to use in the integrated version of the file, or write in some compromise version of those lines that brings in elements from both people’s changes, and then delete all the symbols denoting that was a conflict and save this latest version of the file.
Another advantage of a version control platform is that, when you collaborate using a version control platform, the commit messages provide a way to communicate across the team members. For example, if one person is the key person working on a certain file, but has run into a problem with one spot and asks another team member to take a go, then the second team member isn’t limited to just looking at the file and then emailing some suggestions. Instead, the second person can make sure he or she has the latest version of that file, make the changes they think will help, commit those changes with a message (a commit message) about why they think this change will fix the problem, and then push that latest version of the file back to the first person. If there are several places where it would help to change the file, then these can be fixed through several separate commits, each with their own message. The first person, who originally asked for help, can read through the updates in the file (most platforms for using version control will now highlight where all these changes are in the file) and read the second person’s message or messages about why each change might help. Even better, days or months later, when team members are trying to figure out why a certain change was made in that part of the file, can go back and read these messages to get an explanation.
Even better, platforms for using Git often include nice tools for visualizing differences between two files, providing a more visual way to look at the differences between files across time points in the project. For example, GitHub automatically shows changes using colors to highlight additions and subtractions of plain text for one file compared to another version of it when you look through a repository’s commit history. Similarly, RStudio provides a new “Commit” window that can be used to compare differences between the original and revised version of plain text files at a particular stage in the commit history. In module 9, we’ll walk you through examples of navigating these features.
Another advantage of a version control platform is that they often include extra tools for project management. These include issue trackers, which allow the team to keep a running “to-do” list of what needs to be done to complete the project.238 Sometimes the best tools also happen to be those that are cheap and easy. In this case, the tool might be so obvious that you don’t even think of formalizing it as a tool. The “to-do” list is an excellent example.
A to-do list allows you to take a big task and break it into specific steps that need to be done to complete that task. It helps with something very key to solving big problems: being able to zoom between the big picture—big but vague descriptions of major steps to solve the problem—and the fine details of how you will tackle each of those steps. As Adam Savage of the TV show Mythbusters notes:
“The value of a list is that it frees you up to think more creatively, by defining a project’s scope and scale for you on the page, so your brain doesn’t have to hold on to so much information. The beauty of the checkbox is that it does the same thing with regard to progress, allowing you to monitor the status of your project, without having to mentally keep track of everything.”239
The Issues section of a GitHub repository works as this type of to-do list. By looking at the home Issues page, you see an overview of the tasks you need to complete to finish the project. For each of these tasks, you can zoom in on the details by clicking on its Issue. This will take you to a page where your team can discuss the details of the task, honing in on how you will solve it.
It’s tempting to use emails to discuss progress on a task and talk about how to solve it. Don’t. Use an Issue instead. This will keep the discussion in one place, and so you won’t have to go back through emails to find your old discussion on how you solved it. Also, the Issues section of GitHub doesn’t delete an Issue once you’ve complete that task. Instead, it allows you to “close” the Issue. This moves the Issue into a section with all your other closed issues—it takes it out of your to-do list, but saves the full discussion somewhere that you’ll be able to find easily in the future if you ever want to revisit how you solved that problem.
Another advantage of version control platforms is that, if a project uses a version control platform, it is very easy to share data recorded for the project publicly. On GitHub, you can set the access to a project to be either public or private, a setting that can be converted easily from one form to the other over the course of the project.240 A private project can be viewed only by fellow team members, while a public project can be viewed by anyone. This can be used to share the project data online once an associated manuscript is published, an increasingly common request or requirement from journals and funding agencies.241 Sharing data allows a more complete assessment of the research by reviewers and readers and makes it easier for other researchers to build off the published results in their own work, extending and adapting the code to explore their own datasets or ask their own research questions.242
Further, because Git tracks the full history of changes to these documents, it includes functionality that lets you tag the code and data at a specific point (for example, the date when a paper was submitted) so that viewers can look at that specific version of the repository files, even while the project team continues to move forward in improving files in the directory. At the more advanced end of functionality, there are even ways to assign a persistent digital identifier (e.g., a DOI, like those assigned to published articles) to a specific version of a GitHub repository.243
Version control platforms also help in providing a way to backup study data.244 Together, Git and GitHub provide a structure where the project directory (repository) is copied on multiple computers. Under a distributed model, each collaborator will have their own copy of all project files on their local computer. All project files are also stored on the remote repository to which you all push and pull commits. If you are using the GitHub platform, this will be GitHub’s servers; if you use GitLab, you can set up the system on your own server. Each time you push or pull from the remote copy of the project repository, you are syncing your copy of the project files with those on other computers.
This set-up makes it easy to bring all the project files onto a new computer—all you have to do is clone the project repository. It also ensures that there are copies of the full project directory, including all its files, in multiple places.245 Further, not only is the data backed up across multiple computers, but so is the full history of all changes made to that data and the recorded messages explaining those changes, through the repositories commit messages.246
Leips highlights the importance of backup for research data and code:
“Backup, backup, backup—this is the main action you can take to care for your computers and your data. Many PIs assume that backup systems are inherently permanent and foolproof, and it often takes a loss to remind one that materials break, systems fail, and humans make mistakes. Even if your data are backed up at work, have at least one other backup system. Keep at least one backup off site, in case of a diaster in the lab (yes, fires and floods do happen). It doesn’t make much sense to have two separate backup systems stored next to each other in a drawer.”247
Finally, version control platforms like GitHub can be used for a number of supplementary tasks for your research project. These include publishing webpages or other web resources linked to the project and otherwise improving public engagement with the project, including by allowing other researchers to copy and adapt your project through a process called forking. Version control platforms also provide a supplemental backup to project files.
First, GitHub can be used to collaborate on, host, and publish websites and
other online content.248 Version control systems have been used by
some for a long time to help in writing longform materials like books (e.g.,
Raymond249); new tools are making the process even easier. The GitHub Pages
functionality, for example, is now being used to host a number of books created
in R using the bookdown
package, including the online version of this book.250 The blogdown
package similarly can be used to create websites,
either for individual researchers, for research labs, or for specific projects
or collaborations.251 Further, if a project includes the creation of
scientific software, a version control platform can be used to share that
software—as well as associated documentation—in a format that is easy for
others to work with.
The platform can also be used to share supplemental material for a manuscript, including the code used for pre-processing and analyzing data. Perez highlights this functionality:
“The traditional way to promote scientific software is by publishing an associated paper in the peer-reviewed scientific literature, though, as pointed out by Buckheir and Donoho, this is just advertising. Additional steps can boost the visibility of an organization. For example, GitHub Pages are simple websites freely hosted by GitHub. Users can create and host blog websites, help pages, manuals, tutorials, and websites related to specific projects.”252
With GitHub, while only collaborators on a public project can directly change the code, anyone else can suggest changes through a process of copying a version of the project (forking it). This allows someone to make the changes they would like to suggest directly to a copy of the code, and then ask the project’s owners to consider integrating the changes back into the main version of the project through a pull request. GitHub therefore creates a platform where people can explore, adapt, and add to other people’s coding projects, enabling a community of coders.253 Because of this functionality, GitHub has been described as “a social network for software development”254 and as “a kind of bazaar that offers just about any piece of code you might want—and so much of it free.”255 This same process can be leveraged for others to copy and adapt code—this is particularly helpful in ensuring that a software or research project won’t be “orphaned” if its main developer is unavailable (e.g., retires, dies), but instead can be picked up and continued by other interested researchers. Copyright statements and licenses within code projects help to clarify attribution and rights in these cases.
In module 11, we describe practical ways to leverage these resources within your research group. We include instructions both for team leaders—who may not code but may want to use GitHub within projects to help manage the projects—as well as researchers who work directly with data and code for the research team. There are also a number of excellent resources that are now available that walk users through how to set up and use a version control platform. The process is particularly straightforward when the research project files are collected in an RStudio Project format, as described in earlier modules.