Module 15 Tips for improving reproducibility when writing R scripts

Some biomedical researchers have already worked quite a bit with a programming language like R, either in a role that is primarily computational, or as a way to understand data they’ve collected from the wet lab. While module 14 focused on scientists who are new to coding, to help give them an entry point into how to write and run a code script in R, this module focuses on a different audience—scientists who are familiar with coding but would like to take steps to improve their practice.

We have worked with a number of scientists in this situation. This module provides a series of tips for how they can improve their coding practice to make it more rigorous and reproducible. These are tips based on our own experiences of the things that—in real and regular practice—get in the way of code being rigorous and reproducible.

We’ll provide advice in three main areas:

Write code for computers, but edit it for humans
Modify rather than start from scratch
Do not repeat yourself

This module is meant for researchers who are using R already as part of their research. It is meant as a complement and alternative to module 14, which focused on readers who are new to creating code scripts.

Objectives. After this module, the trainee will be able to:

Improve the reproducibility of their scripts by leveraging tips for researchers who are already coders
Implement advice on editing code to make it clear for humans
Practice editing code to include better names for data objects and columns
Practice breaking up monolithic code into code that is structured and organized
List examples of dead-end code
Practice editing scripts to remove dead-end code
Define “package vignette”
List steps to adapt example code to make it more reproducible and easier to maintain in their own pipeline
Explain why you can improve rigor and reproducibility in your code if you avoid repetition

15.1 Write code for computers, but edit it for humans

A key requirement for a project to be computationally reproducible is that the code used for pre-processing and analysis is available. However, even when code for a project is available, it can be hard to understand and reproduce the analysis. One common culprit is that the code is unclear. One way to improve the reproducibility of your code, therefore, is to make sure you edit it so that it’s clear for humans, not just computers.

During World War I and World War II, the British and US used a special type of camouflage on some of their ships called “dazzle camouflage”. This type of camouflage uses large geometric shapes, often in black and white, and it makes the ships look a bit like zebras. Unlike other types of camouflage, this type doesn’t conceal the ship—it’s still very clear that it’s there. However, to be able to hit a ship at sea, people needed to know not only where it was, but also where it was going. This is because the ship is moving: between the time that a ballistic is fired and when it lands it will have changed location. People needed to calibrate to aim where the ship would be by the time the ballistic got to it. Dazzle camouflage makes it much harder to determine where the ship is headed.

Often, people will write code for research projects that looks like it’s using dazzle camouflage. It’s easy to see that there’s something there when you look at the code script, but it’s very hard to figure out what it’s doing or where it’s trying to go. In other words, it’s hard for a human to quickly digest. This type of code will be hard for others to figure out, and also it will be hard for you to figure out when you come back to the code in the future.

The best way to avoid this type of code is to get into the practice of editing your code. When you first write code, you don’t want to write it slowly and carefully—rather, you’ll usually be best at figuring out how to get something to work if you get in the flow and get down some code without worrying about how legible it is to humans.

This is fine, but get in the habit of thinking of this as just the first step: in your initial coding, you’re getting the code to work for the computer, but later you will need to go back and clean up the code so it’s clear for humans, too. Editing the code will make it easier to understand (both by others and by you) and will also make the code easier to maintain and extend in the future.

This idea is similar to writing. Many writing experts recommend that you break your writing process into several stages. First, you write in a drafting process, where you get your ideas on paper without editing yourself much. This is a stage of getting the ideas out. In a separate stage, you edit, and at this stage you have your audience clearly in mind, editing to make the writing clear for them. By separating into these stages, you can use your mind in a more creative, less constrained way as you create ideas, and then in a more critical way as you refine those ideas for your audience.

While this practice is familiar to many writers, it’s less well known to scientists who are also coders. If you don’t already, try incorporating editing stages as you develop your code. It is most helpful to take time to edit code if you’re still within a day or two of writing it, so it’s helpful to work in this editing stage fairly frequently. Since it often requires less energy and brain power than the initial stage (getting the code to work with the computer), it can be helpful to incorporate editing time at times in your day when your energy is otherwise low. For example, taking ten or fifteen minutes to edit existing code can be a good way to start coding for the day, before you get to the heavier lifting of writing new code.

As you edit your code, there are a few specific things that you can do to make it clearer for humans to read. Some editing steps that we will cover in this module are:

Improve the names you’re using within the code
Break up monolithic code
Add useful comments
Remove dead-end code

Let’s take a closer look at how you can do each of these steps.

15.1.1 Improve names within the code

When you’re initially coding, you might often use “placeholder” types of names for data objects. For example, a coder might tend to name objects “df” (for “dataframe”) or “ex” (for “example”) as they’re first getting the code to work.

There’s no problem in using these types of generic names as you initially develop your code. In fact, there’s a rich history of these placeholder object names. They even have a fancy name, metasyntactic variables. Different coding languages have developed different ones that are popular, as have coders in different countries. For example, many C programmers will name things “foo” and “bar” as they initially work on their code, while Italians often use the Italian words for different Disney characters (“pippo”, “pluto”, “paperino”).

The problem isn’t in using these placeholder names; the problem is when you don’t later edit your code to use better names. These generic names will tell you nothing about what’s stored in each object when you go back and read the code later. With better names for each object, you can read through the code and in some ways it will document itself, without even needing to read the code comments to figure out what’s going on.

There is a style guide that is focused on the tidyverse approach available at https://style.tidyverse.org/syntax.html. It includes guidance on how to select good names for objects in R within its section on “Syntax”. Generally, some good principles include that the name of the object should give you an idea of what’s contained in the object. For example, if you have a dataframe that has the weights of mice from your experiment, it’s much better to name it “mouse_weights” rather than something generic like “foo”. Some of the other guidance will help make your life easier as a coder, including things like using only lowercase letters.

Similar principles apply to column names: ideally, you want their names to describe what they contain. There are also some rules that will make it easier to work with the column names. For example, column names can include spaces, but if they do, it makes using them within R harder. Each time you refer to that column name, you have to surround the name in backticks so R will process its full name as a single name, rather than thinking its name ends at the first space. This becomes a pain as you write a lot of code that refers to that column. It’s also helpful to keep column names fairly short, so you can see the full name as you work with the dataset and resulting output.

When it comes to column names, some of your editing might be to improve names that you quickly wrote yourself as you coded. However, a common reason for ungainly column names is that you’ve read in data from a file format like Excel, where it was easy for the person who entered the data to include spaces and special characters in the column names. In this case, there are some tools in R that can help you quickly improve the column names. In particular, the janitor package has a function called clean_names that will do a lot of the work for you, including converting the name to lowercase, removing special characters (like “*” and “&”), and replacing spaces with underscores. If you need to make more targeted changes to column names, you can do so using the rename function from the dplyr package.

15.1.2 Break up monolithic code

Next, you can edit to break up “monolithic” code—that is, code that isn’t clearly divided to show sections or steps in the process. When you’re first creating your code, you won’t want to take the time to nicely organize it into logical sections. However, once you are ready to edit your code, you will find that breaking it into clear sections and labeling them will help you and others navigate the code at a higher level (understanding the big picture of how it works by looking at the major steps it takes), only diving into the details of each section once the big picture is clear.

Again, this process mimics a process used by many writers. It’s common to create drafts and notes that lack clear organization, but instead are just collecting the raw material that will be shaped into a final article or book. However, this raw material then needs to be organized and edited to make it into something that others can navigate and make sense of.

In a similar way, once you’ve gotten your code to work, you should make sure you have a clear picture of the whole process and how it tackles the problem at a “big picture” level. One of the big steps might be something like reading in and cleaning the data, while another step might be identifying and addressing outliers in the data. Once you’ve identified the big steps, try as much as possible to group the code into these big steps, then you can use code comments and blank lines in your code to separate these sections and label them to describe what they’re doing.

As you do, you might find that you move some of your code around in the script. This is fine as long as it doesn’t affect the computer being able to process the script. For example, one of your big steps might be loading in packages you’ll need. Rather than having a lot of library calls sprinkled throughout your code, you can group these all together at the start of your script in a section called something like “Loading packages”. This will clean up other parts of your script, and by having all your library calls at the start of the code script, another person could immediately see which packages they’ll need to have installed to run your code.

Another way that you can break up monolithic code is to split it into more lines. R will process the code whether it’s all on one line or split into separate lines: R just keeps reading until it gets to the end of the function call either way. This means that you can use the “Return” key to break up your code lines so you’re always able to see the full line of code without scrolling.

One common standard is to keep all of your lines of code to 80 characters or fewer. RStudio has the functionality to reformat your code to meet this standard. In the RStudio menus, if you go to the “Code” menu, you can select to “Reformat Code”. This can help clean up long lines of code in your editing process.

15.1.3 Add useful comments

As you are breaking up monolithic code, it is helpful to add code comments about why you are doing certain things. In R, you can add a code comment after a #; R won’t read anything that comes after that symbol on a line. You can use this to add small messages for humans that describe your code.

As you add these comments, keep in mind that it’s often more useful to describe why you’re doing something rather than what you’re doing. With a lot of R code—especially in the tidyverse approach—the functions have names that clearly describe what they do. For example, the function to rename a column is called rename, while the function to select certain columns is called select. Therefore, your code should do a fairly good job of self-documenting in terms of describing what it’s doing.

Instead, you can use code comments to remind yourself or others of why you’re implementing certain steps. For example, rather than having a code comment that says “Rename columns”, you could say, “The columns that come from the Excel file generated by the cytometer include a lot of special characters, which we need to remove to make it easier to work with the data in R.” By explaining why you’re doing something, you’ll also help yourself when it’s time to maintain or extend your code. You’ll be able to tell, for example, whether changing or deleting a certain line of code will cause a big problem in other areas of code.

15.1.4 Remove dead-end code

Another useful step when you edit your code is to edit out pieces we’ll call “dead-end code”. These are pieces of code that aren’t contributing to the process of your script.

There are two main types of dead-end code that we often see. First, there’s code that you use during your interactive coding process to check on things. For example, you might use the View function to take a look at a data frame at a certain step in your process, or use functions like summary and str to explore what’s in different objects.

It’s great to do this kind of exploration as you code; in fact, one of the advantages of interactive software like R is that you can explore as you develop your scripts. However, these are tools that help you develop a script, but not ones that are necessary for the final script to run. Instead, they just gunk up the code that’s doing the real work.

There are two things you can do regarding this type of dead-end code. The first is that you can get in the habit of running it in your console, rather than having it in your script, even when you’re developing the code. However, this does require switching between the console and the script as you write code, which can interrupt your flow. An alternative is to run these in the script as you write the code, but then delete any of these exploratory calls as you edit your script.

There’s also a second type of dead-end code. This is code that you wrote to try out to solve a particular problem, but that ultimately didn’t work (or that you replaced with a better approach). Often, you may have worked a long time on that piece of code, or it might contain some really clever approach that you’re proud of. However, leaving it your script, if it isn’t contributing to the ultimate process you ended up with, will only get it the way of understanding your primary code. It will lead a reader down a rabbit hole, rather than allowing them to move step by step through your logic.

In writing, there are similarly areas that aren’t contributing to the forward movement of a piece but that authors are reluctant to remove because they love them for one reason or another. This has resulted in the famous advice to authors (from Stephen King, among others) to “murder your darlings”. In other words, be brave enough to edit out anything that isn’t contributing to the necessary progress of your piece. Coders should take this advice in a similar way when it comes to pieces of code in their scripts that don’t ultimately contribute to the pipeline they’ve developed.

15.2 Modify rather than start from scratch

How do you get started on solving a problem? Science and engineering have long traditions of starting by modifying something that exists, rather than starting from scratch. For example, once it was clear how important penicillin would be for human health, it was time for the hard work of producing it as scale. Guru Madhavan of the National Academy of Engineering tells this story in his book Applied Minds: How Engineers Think, focusing on the critical role of adapting existing technology to get a foothold on the problem:

“Extracting penicillin from the mold was no child’s play… Instead of designing and building a reactor for the chemical reactions from scratch—which meant more time, money, and uncertainty—[Margaret] Hutchinson opted for something that was already functional. Some researchers had found that mold from cantaloupe could be an effective source for penicillin, so she started there. Her team then revised a fermentation process that Pfizer was using to produce food additives like citric acid and gluconic acid from sugars, with the help of microbes. Hutchinson swiftly helped convert a run-down Brooklyn ice factory into a production facility. The deep-tank fermentation process produced great quantities of mold by mixing sugar, salt, milk, minerals, and fodder through a chemical separation process that Hutchinson knew very well from the refinery business.”³⁰⁷

Similarly, as you code, keep in mind that you shouldn’t reinvent the wheel. Instead, it’s often useful to start from an existing script, pipeline, or piece of code.

To find some starting scripts to learn from, there are a few tactics you can try. First, check around with colleagues to see if they have R code for data pre-processing tasks that they do in their lab. If they work with similar types of data, and use R, they’re likely to have come up with some scripts that achieve tasks you also need to do.

Another excellent source of example R code are the vignettes and examples that come with many R packages. If you are using functions from an R package, there is likely a vignette that comes with that package, and there may also be examples within the helpfiles for each of the package’s functions. A package vignette is a tutorial that walks you through the major functionality of the package, showing how to use the key functions in the package in an extended example. Some packages will have multiple vignettes, showing a range of things that you can do with the package.

To find out if there is a vignette for a package that you’re using, you can google the package name and “vignette”. You can also find out from the console in R using the function vignette. For example, to find out if the package readxl, which helps read in data from Excel files, has any vignettes, you can run vignette(package = "readxl"). This will tell you that the package has two, one called “cell-and-column-types” and one called “sheet-geometry”. To open one of these, you can again use the vignette function. For example, vignette("cell-and-column-types", package = "readxl") would open the first of the two vignettes within your R session.

To open the helpfile for any function in R, at the console type a question mark and then the function name. For example, ?read_excel will open the helpfile for the read_excel function (you will need to make sure you’ve run library("readxl") to load the package with this function). The helpfile provides useful information for running the function, and one of the most useful parts is the “Examples” section. Scroll down to the bottom of the helpfile to find this section. It includes several examples that you can copy into your R script or console and try yourself, to figure out the types of inputs the function needs and how different options for the function modify it.

Online resources like StackOverflow also provide advice and example code for many challenges you might come up against as you’re coding. Google can also be used to help you solve coding problems, especially if you become familiar with some of its special operators, which can help you refine your search. You can find more on Google special operators at https://support.google.com/websearch/answer/2466433?hl=en.

There’s no problem with using any of these as starting points as you develop your own pipeline. However, it’s often tempting for a coder to leave example code “as-is” if they’ve found an example solution that works. Instead, it’s critical that you make sure you fully understand why each line of code in your script works the way it does. Further, if you’re adapting example code to your own problem, you should edit it if possible to use the set of tools you’re most familiar with. In this section, we’ll go through some steps you should take to start from example code and adapt it to your own pipeline in a way that will be more rigorous and reproducible.

When you find a piece of example code that you think will help with something you need to do in your own code, you’ll first want to make sure that you can get it to work with any example data it came with, before you try it out with your own data. If it won’t run with its own data, there are a few trouble-shooting steps you can take. First, make sure you have all the required packages installed and loaded. Second, make sure that you’ve saved the example data to the right place on your computer if the example code reads in data from a file. Finally, make sure that you have the same versions of packages and of R as were used in the example. If the code still doesn’t work after you’ve resolved these issues, you may want to move on to finding other example code.

Once you’ve gotten the code to run on the example data, walk through it line by line to understand what it does. For each step, make sure you understand what the input looks like and what the output looks like. If code is nested (function calls are placed within function calls), be sure that you understand the code at each level of nesting. If the code uses piping to move the output of one call to the input of the next, make sure you’ve worked through each of the lines in the pipe individually.

There are two tools that can help as you dissect the code in this way. First, when you work in R study, you can highlight code in a script and then use the “Run” button to run only the highlighted code. This functionality allows you to run a nested function call without running the whole line of code, or to run only part of a series of piped calls (by highlighting everything up to the piping symbol on a line and then running it). The other tool that’s useful is a function in the dplyr package called pull. This function allows you to extract a column from a dataframe as a vector. This is helpful when you’re dissecting nested calls in piped code, as often a function will operate on a single column of the dataframe. This function allows you to pull out that column and then test the function call to see what it’s doing with that column.

Once you figure out what the example code does on the data it comes with, you can adapt it to work with your own data. As you do, pay close attention to how your data are similar or different to the example data. At this stage, your goal will be to get the example code to work with your data.

Many researchers stop at this step—they’ve gotten example code to work with their own data (and hopefully worked through it to understand why). However, example code often follows a very different style than code you write yourself. For example, you may use the tidyverse approach, while the example might use code written in a base R style. Further, different coders think about and tackle problems in different ways, which can lead to the case where any example code that you’ve adapted in your script feels different from your usual code.

This can result in spots of your code that you will later be very worried to change, because while it works, you don’t understand why well enough to feel comfortable making any change. This can make your code fragile and hard to maintain. Instead, take a moment to adapt the logic that you’ve learned from the example to your own set of tools. For example, if the example code is written in base R but you prefer tidyverse tools, rewrite the code’s logic to use tidyverse tools.

You gain several advantages when you adapt code to use the tools you’re familiar with. For example, bugs will be less likely, and if there are bugs, you’ll catch them more quickly, since you’re familiar with the tools that the code is using. The code will also be much easier for you to understand and maintain in the future if it’s written using tools that you know well.

15.3 Do not repeat yourself

As you become more familiar with programming with R, you can start to evolve your style of writing scripts in more advanced ways. A key one is to learn how to limit how often you repeat the same code. As you write data pre-processing pipelines, you’ll find that you often need to do the same thing, or variations on the same thing, over and over. For example, you may need to read in and clean several files of the same type and structure. You will likely, at first at least, find yourself copying and pasting the same code to several parts of your script, with only minor changes to that code (e.g., changing the R object that you input each time).

Don’t worry too much about this as you start to learn how to write R scripts. This is a normal part of the drafting process. However, as you get better at using R, you’ll want to learn techniques that can help you avoid this repetition.

There are a few reasons that you’ll want to avoid repetition in your code when possible. First, these repeated copies of the same or similar code will make your code script much longer and harder to figure out later. Second, it is hard to keep these copies of code in sync with each other. For example, if you have several copies of the code you use to check for outliers in your data, and you decide you want to change how you are doing that, you’ll need to find every copy of that piece of code in your script and make sure you make the same change in each place. Instead, if you have less repetition in your code, then you can make the change in a single place and ensure that the change will be in place everywhere you are doing that process.

There are a few tools that are useful to develop to help avoid repetition. The first is to learn how to write your own R functions. Any R user can write a new function. You can collect them in packages that you plan to share with others, but you can also just write them for your personal use. When you create a function, it encapsulates the code for something that you need to do, and it allows you to do that thing anywhere else in your code just by calling that function, rather than copying all the lines of the original code. This is an excellent way to write the code in one place you need to use often, rather than copying and pasting the same code throughout your R script.

Since you need to run the code that defines the function before you use it, it often makes sense to write the code that creates these functions near the top of your code script. If you find that you’ve written a lot of functions, or that you’ve written functions that you’d like to use in more than one of your data pre-processing scripts, you can even save the code that creates the functions in a separate R script and just source that separate script at the top of each script that uses the function, using the source call. “Sourcing” a file in this way simply runs all the code in the file. Eventually, you could even think of creating your own package with those functions.

There is one other excellent set of tool for avoiding repetition that we want to mention. Again, it is likely more complex that what you’ll want to start off with as you learn to write R scripts, but once you are comfortable with the basics, it’s a powerful tool for creating code scripts that are as short and simple as possible while doing very powerful things. This is a set of tools that focus on iteration. They include for loops, which allow you to step through elements in a data structure and apply the same code to each. They also include a set of tools in the purrr library that allow you to apply the same code, through a function, to each element in a larger data structure. These are excellent tools when you are doing something like reading in a lot of similar files and combining them into a single R object for pre-processing.

We will not go into details about how to write R functions or these iteration tools in these modules, as our aim here is to get you started and give you an overview of where you might want to go next. If you do want to learn to write your own R functions, there’s a chapter describing the process in the free online book “R for Data Science” with guidance on this topic (https://r4ds.had.co.nz/functions.html).³⁰⁸ If you’d like to learn more about tools for iteration, the same book also has a chapter on that (https://r4ds.had.co.nz/iteration.html).

14 Introduction to scripted data pre-processing in R

16 Simplify scripted pre-processing through R’s “tidyverse” tools