Module 16 Simplify scripted pre-processing through R’s “tidyverse” tools
As you learn to code, then, a good strategy is to start collecting “tools” for your toolbox in R—functions that you have learned to use very well and that you understand thoroughly. This will make you proficient in R more quickly, and it will also limit the chance of bugs and errors in your code, making your data work more robust and rigorous. If you ask most good programmers, you will find that a large amount of their code relies on a fairly small set of general-use tools, with more specialized tools only used here and there, where a specific algorithm is necessary. When you first start out, though, it is hard to know which tools are the most important to add early and learn well. In this section, we’ll cover some tools that we have found helpful for pre-processing biological data. These are not exhaustive, but may help you to identify some sets of tools to focus on learning well for data pre-processing and analysis of biological data.
Some key tools for pre-processing laboratory data are:
- Tools for data input
- Tools for changing columns or creating new columns
- Tools for working with character strings
- Tools for working with dates and times
- Tools for statistical modeling
We will concentrate on tools that are drawn from a collection of tools called the “tidyverse”. The “tidyverse” approach is an approach to using R that has grown enormously in popularity in recent years. Most R courses and workshops for beginning programmers are now structured around this approach. It provides a powerful yet flexible approach for working with data in R, and it one of the easier ways to start learning R. In module 3 we described the tidyverse approach in conjunction with talking about the power of the tidy data format. In this module, we’ll go deeper into specific tools under this approach that can be used for common data pre-processing tasks when working with biomedical data, as well as provide information on more resources that can be used to continue learning this approach.
The tidyverse functions do not come with base R, but rather are available
through extensions to base R, commonly referred to as “packages”. Like base
R, these are all open-source and free. Many are available through a
repository called CRAN, and you can download them directly from R using the
install.packages
function.
The heart of the tidyverse functions are available through an umbrella
package called “tidyverse”. This package includes a number of key tidyverse
packages (e.g., “dplyr”, “tidyr”, “stringr”, “forcats”, “ggplot2”) and allows you
to quickly install this set of packages on your computer. When you are coding
in R, you will then need to load the package in your R session, which you can
do using the library
call (e.g., library("tidyverse")
).
In addition to the packages that come with the umbrella “tidyverse” package,
there are numerous other packages that build on the tidyverse approach.
Some are created by the creator of the tidyverse approach (Hadley Wickham)
or others on his team, while others are created by other R programmers but
follow the standards of the tidyverse approach. An example of one of
these extensions that is specifically created for working with biomedical data
is the tidybulk
package,309 for working with
transcriptomics data.
Objectives. After this module, the trainee will be able to:
- Define “tidyverse”, “character string”, “factor”,
- Explain how the tidyverse collection of packages can be both user-friendly and powerful in solving many complex tasks with data
- Describe the difference between base R and R’s tidyverse
- List two key tidyverse packages for file input
- List a key tidyverse package for changing or adding columns
- List key tidyverse packages for working with character strings and factors
- List a key tidyverse package for working with dates
- List two key tidyverse packages related to statistical modeling
- Locate further resources for learning the tidyverse approach
16.1 Tools for data input
To be able to work with data in R, you first must load the data into your R session. Data will typically be saved in some type of file or files, and so you must instruct R about how to find that data and then read it from the file into the R session.
There are several key tidyverse tools for inputting data from a file. The
most important is a package in the tidyverse called readr
. This package
allows you to read data from plain text files. Data are often stored in
these plain text files, including in formats like CSV (“comma-separated
values”), tab-separated values, and fixed width files. These are all files
that you can open on your computer with a text editor (for example,
Notepad, Wordpad, or TextEdit).
The readr
package includes various functions to read in data from these
types of files, with different functions for different formats of those
files. For example, CSV files separate different pieces of data in the
file with commas, and these can be read into R with the readr
function
read_csv
.
Some equipment in the laboratory may allow you to save results in a plain
text format. When you export your data from laboratory equipment, you can
check to see if there is an option to outload it to a format like “CSV”
or “txt”, which would allow you to use these readr
functions to then
read the data into R.
There are other packages in the tidyverse that allow you to read in data
from other types of file formats. For example, you may have data that
you recorded into an Excel spreadsheet. Excel files are a bit more complex
in their structure than plain text files, and the functions that read
plain text files into R will not work for Excel files. Instead, there are
a series of functions in a package called readxl
that you can use to
read in data from Excel files into R. These functions even allow you to
specify which sheet of an Excel file to read data from, as well as which
cells on that sheet, so they allow for very fine control of data input
from an Excel spreadsheet.
In some cases, you may be collecting data with laboratory equipment that does not export its data to a standard format, like a plain text file or a basic spreadsheet file. Instead, some equipment will save data into a file format that has been standardized for a certain type of data (e.g., an mzML file for metabolomics data) or to a file type that is proprietary to the company that manufactures the equipment. There is a chance that someone has created an R package that can input data from these more specialized types of files. In fact, for common file types from biomedical research, that chance is high (for example, there are several packages available with functions that input data from an mzML file). One of the best ways to find an appropriate tool to input data from more specialized formats is by searching Google for “R data input” and then the name of the file format. If you use that file format often in your laboratory, it is worth some research to determine which R package is a good fit for inputting data from that file format and then working through vignettes and other helpfiles for that package to learn how to use it well.
You can learn more about the readr
and readxl
packages through their
vignettes, which provide tutorials walking through the functionality of
each package. You can find those at:
-
readr
: https://readr.tidyverse.org/ -
readxl
: https://readxl.tidyverse.org/
16.2 Tools for changing or creating columns
There are many pre-processing tasks that require creating columns that are mathematical functions of existing columns. Therefore, you’ll want to have some tools for changing existing columns or making new column.
One example is when you are scaling or normalizing data. Scaling is often required before using some of the techniques for dimension reduction (e.g., principal components analysis) or clustering, to ensure that the unit of measurement of each column does not influence its weight in later analysis. For example, if you were clustering observations using measurements for each subject that included their weight, you don’t want to get different results depending on whether their weight is measured in grams versus pounds, and this type of scaling can help avoid any of those differences based on the units used for measurements.
There are a range of ways to standardize and normalize different types of biomedical data, ranging from very simple to much more complex. At the simpler end is a method called z-score normalization, where the observations for each feature (i.e., column) are changed to have an overall mean of 0 and standard deviation of 1. This can be done by taking each value in a column and subtracting from it the column-wide mean, then dividing by the standard deviation. There are also more complex methods for scaling and normalization. All similarly require mathematical algorithms or functions to be applied to the original data to create a new column of data that is the scaled or normalized version of the original.
In R, there are functions that come with the base installation of R (in other
words, don’t require installing extra packages) that can be used for more basic
processes of standardization and normalization. For example, the scale
function can be used for the basic scaling described in the previous paragraph.
You can also directly use math functions (like -
for subtraction and /
for
division) and very basic functions (like mean
to calculate the mean of a
vector of numbers and sd
to calculate the standard deviation) to make these
types of calculations from scratch. To apply these, though, you’ll need to know
functions that work with columns in a dataframe.
The dplyr
package is a key package to learn from the tidyverse, as it forms
the heart of the tools for cleaning and exploring data that are stored in tidy
dataframes. The package includes not only functions for making changes to a
single column (e.g., the mutate
function), but also functions that can be used
to perform the same calculation across many columns (e.g., the across
function). This is an efficient way to do something like scale the data in
multiple columns at once.
These functions can also be used for basic cleaning operations in a dataframe.
For example, data that are recorded for colony-forming units may include “TNTC”
in cells of the spreadsheet where so many bacteria had grown that the individual
colonies were “too numerous to count”. When you read in the data, you may want
to change these values to missing values so that you can run numerical
calculations on the cells that include colony counts. This type of conversion
can easily be done using functions from the dplyr
package. They are also
critical for performing processes like scaling / normalization— the mutate
function, for example, can be used to create a new column of scaled data by
applying a scaling function to an existing column.
You can learn more about the dplyr
package through its vignette, which is
available at: https://dplyr.tidyverse.org/.
16.3 Tools for working with character strings
Once you have learned the basic tools for inputting data, as well as basic
manipulations on columns with the dplyr
tools, you should take some time to
learn a few other tools that can often be used to make your coding pipelines
much more efficient. One of these is to learn how to work well with character
strings.
Character strings are strings of alphanumerical symbols that are stored inside quotation marks, like “Mouse-01” or “Control group”. Several tidyverse packages help you work with this type of data more efficiently, either through finding and using regular patterns in the data (e.g., the number “01” stored in “Mouse-01”) or in treating these data as a marker of a set number of groups (e.g., “Control group” versus “Treated group”). These tools can help you in processing and exploring the data, and they are also extremely important in creating figures and tables from the data with clear labels. Once you start learning to work with character string data, you will realize that you can also treat the file names and directory names of your project as character strings, and use these tools to embed and use useful information in these filenames.
The stringr
package, which is part of the tidyverse, includes simple but
powerful tools for working with vectors composed of character strings. For
example, the package includes a function that let you extract a subset of each
character string based on the position of the characters in the string, a
function that lets you replace every instance of a pattern with something
else, and a function that will tell you which character strings in the vector
have a match to a certain pattern. It also includes a function that can change
the case of all the letters in each string, either to uppercase, to lowercase,
or to “title case” (the first letter in each word is capitalized).
You likely will not realize how powerful many of these tools are until you need
to do one of these tasks, but then you’ll find they make your life much easier.
For example, say that you have a column in your data that provides the ID of
each study subject (e.g., “Mouse 1A”). If some of the IDs were entered using
upper case (e.g., “MOUSE 1A”), some with lower case (“mouse 1a”), and some with
a mixture (e.g., “Mouse 1A”), then you may find that it is hard to write code
that recognizes that “Mouse 1A” is the same as “mouse 1a” and “MOUSE 1A”. The
functions in the stringr
package would let you quickly convert everything to
the same case and so work around this issue.
As another example, you may want to extract certain elements from each subject
ID—for example, you might want to create a column where you have changed
“Mouse 1A” to just “1A” and “Mouse 2B” to just “2B”. The stringr
package has
functions that will let you do this in several ways. For example, it has a
function that would let you remove “Mouse” from each character string, and
another function that would let you extract only the part of the string that
starts from the first number. These types of tools can be invaluable when you
need to pre-process or clean data from the format that it first enters R.
Sometimes, you will want to treat character strings as discrete categories or values. For example, if part of your data records subject IDs (e.g., “Mouse 1A”, “Mouse 2B”), you may want to be able to link up all of the observations that are recorded for each subject. Similarly, you may want to treat a variable that records treatment (e.g., “treated” / “control”) as a set of specific categories that each observation belongs to.
In R, you can do this by treating that column as something called a “factor”. This data type looks like a character string (e.g., “treated”), but R has recorded that there are only a few set values that values in the column can have (e.g., “treated” or “control”), and when you summarize or plot the data, you can group by this variable to get summaries within each category, or align it with the color or shape of plotted points.
The forcats
package includes helpful tools for working with this factor type
of data. When a column is changed into a factor, the possible levels of the
factor (in other words, the possible values it can take) will be given an order,
often alphabetical. You won’t notice this order with many of the processing
you might do, but it will control the order that categories are mentioned when
you summarize or plot the data. The forcats
package includes a function that
lets you rearrange this order, and so rearrange the order that each category
is presented in summaries and plots. The package also includes numerous other
tools for working with this type of data. For example, if you have a factor
that takes many different possible values, it will let you to convert to
specify only those that are most common (you can specify how many categories),
and then pool the rest into an “Other” category.
The vignettes for the stringr
and forcats
packages are available at:
-
stringr
: https://stringr.tidyverse.org/ -
forcats
: https://forcats.tidyverse.org/
16.4 Tools for working with dates and times
Another handy set of tools are those for working with dates and times. Often, you will record the date that an observation is collected, or the date and time if data are being collected at a fine time scale. Although you record these as a character string (e.g., “August 1, 2019”), you’ll want to be able to use the quantitative information within the date. For example, you may want to be able to tell if the date of each observation is before a certain date, or determine how many days there are between two date.
The tidyverse includes a package for working with dates and times called
lubridate
. This package includes functions that allow you to change a column
in your data to have a date or date-time data type. This will allow you to
do operations on those values as dates—in other words, do things like determine
the number of days between two dates. The lubridate
package also includes
functions for these operations on dates, including determining if one date is
larger or smaller than another and whether it’s within an interval of two dates,
as well as determining the difference between two dates or finding out which
date is a certain number of days after a given date. There are also functions
to extract certain elements from each date, like the day of the week or the
month of the year.
The functions in the lubridate
package can be very useful for pre-processing
data. For example, you may record the date of each measurement that you take,
but also need to determine how much time has passed between the start of the
experiment and that measurement. The lubridate
package has a function that
will allow you to calculate the time since a recorded start time, and so this
allows you to record only the date and time of each measurement, and then
determine the time since the start of the experiment within reproducible code
once you read the recorded data into R.
To find out more about the lubridate
package, you can read its vignette
at https://lubridate.tidyverse.org/.
16.5 Tools for statistical modeling
Often, analysis of biomedical data will include some statistical hypothesis testing or model building. For example, if you have collected bacterial loads in two groups of animals with different treatment assignments (treated and control), you may want to test the hypothesis that the average bacterial load in the two groups is the same. If the treatment was successful and the experiment had adequate power, then the data will hopefully show that this null hypothesis should be rejected.
R has a number of functions that can run the most common statistical hypothesis tests (e.g., Student’s t-test) as well as fit commonly-used statistical models (e.g., linear regression models). Many of the tools for common tests and model building are included with your initial installation of R. This means that you can use them without installing or loading additional packages.
Further, there are many additional packages that are available that run less common statistical tests or fit less common statistical model frameworks. Part of R’s strength is in its deep availability of these packages for statistical analysis. You can often use a Google search to determine if there is a function or package for a statistical analysis that you would like to perform in R, and it is rare to not find at least one package with the appropriate algorithm. To help you select among different packages, check out the article “Ten Simple Rules for Finding and Selecting R Packages.”310
In addition to learning the tools for the types of statistical analysis that you do often in your research, it is also helpful to learn some tools that help you incorporate that statistical analysis into your workflow. Many of the tools in R for statistical analysis were originally focused on being an endpoint of a code pipeline. For example, many of them will result in a print-out summary of the results of the statistical test or model fit. This is fine if you only want to record that result, but often you will want to use the results in further R code, for example to add to plots or tables or to combine with other results.
There are a couple of packages that can help with this. First, there is a
package called broom
that can conver the output of many statistical
tests and models into a tidy dataframe. If you have focused on learning
tidyverse tools, then this functionality makes it much easier for you to
continue working with the output. The tidymodels
package extends on this idea by creating a common interface for fitting
a variety of statistical models and extracting results in a tidy format.
You can read the vignettes for the broom
and tidymodels
packages at:
16.6 Resources to learn more on tidyverse tools
The tidyverse approach is now widely taught, both in in-person courses at universities and through a variety of online resources. Since there are so many excellent resources available—many for free—to learn how to code in R using the tidyverse approach, we consider it beyond the scope of these modules to go more deeply into these instructions. Rather, we’ll point you to some excellent references that go deeply into the tidyverse approach to coding, its set of tools, and how they can be applied when working with biomedical data.
16.6.1 Classes and workshops
Most R programming classes at universities, as well as workshops at conferences and other venues, now focus on the tidyverse approach, especially if they are geared to new R users. An R programming class can be a worthwhile investment of time if this resource is available to you, and if you head a research group and do not have time to take one yourself, you could instead encourage trainees in your research group to take this type of class. Programming in other scripted languages, like Python and Julia, provides similar skills, although the collection of extension packages that are available for biomedical data tends to be most extensive for R (at least at this time). Classes in programming languages like Java or C++, on the other hand, would have less immediate relevance for most biologists and other bench scientists, and so if you would like to become better at working with biomedical data, it would be worthwhile to focus on programming languages that are scripted.
16.6.2 Online books
There are a number of excellent free online books that are available to help you learn more about R (many of which can also be purchased as a hard copy, if you prefer that format). These typically include lots of examples of code that help you try out concepts as you learn them.
One key resource for learning the tidyverse approach for R is the book R for Data Science by Hadley Wickham (the primary developer of the tidyverse) and Garrett Grolemund.311 This book is available as a print edition through O’Reilly Media. It is also freely available online at https://r4ds.had.co.nz/. This book is geared to beginners in R, moving through to get readers to an intermediate stage of coding expertise, which is a level that will allow most scientific researchers to powerfully work with their experimental data. The book includes exercises for practicing the concepts, and a separate online book is available with solutions for the exercises at https://jrnold.github.io/r4ds-exercise-solutions/.
Another online book that is an excellent tool—particularly for those using R for biomedical research—is Modern Statistics for Modern Biology, by Susan Holmes and Wolfgang Huber.312 This book shows how the tidyverse approach can be combined with tools from Bioconductor that are custom-built to work with bioinformatics data. It also provides an excellent overview of statistical methods for working with biomedical data and how those can be applied using R. The book is available online at https://www.huber.embl.de/msmb/.
16.6.3 Cheatsheets
For many of the key tidyverse packages, there are two-page “cheatsheets” that have been developed by the package creators to help users learn and remember the functions that are available in the package. These are available at https://posit.co/resources/cheatsheets/.
Each cheatsheet includes numerous working examples. One excellent way to familiarize yourself with the tools in a package, then, is to work through the examples on the cheatsheet one at a time, making sure that you understand the inputs and outputs to the function and how the function has created the output. Once you have worked through a cheatsheet in this way, you can keep it close to your desk to serve as a quick reminder of the names and uses of different functions in the package, until you have used them enough that you don’t need this memory jog.
For deeper tutorials of each tidyverse package, you can explore the package’s vignette. We’ve provided links to several of these throughout this module.