Module 20 Example: Creating a reproducible data pre-processing protocol

We will walk through an example of creating a reproducible data pre-processing protocol. As an example, we will look at how to pre-process and analyze data that are collected in the laboratory to estimate bacterial load in samples. These data come from plating samples from an immunological experiment at serial dilutions, using data from an experiment lead by one of the coauthors. This data pre-processing protocol was created using RMarkdown and allows the efficient, transparent, and reproducible pre-processing of plating data for all experiments in the research group. We will go through how RMarkdown techniques can be applied to develop this type of data pre-processing protocol for a laboratory research group.

Objectives. After this module, you should be able to:

Explain how a reproducible data pre-processing protocol can be developed for a real research project
Understand how to design and implement a data pre-processing protocol to replace manual or point-and-click data pre-processing tools
Apply techniques in RMarkdown to develop your own reproducible data pre-processing protocols

20.1 Introduction and example data

In this module, we’ll provide advice and an example of how you can use the tools for knitted documents to create a reproducible data pre-processing protocol. This module builds on ideas and techniques that were introduced in the last two modules (modules 18 and 19), to help you put them into practical use for data pre-processing that you do repeatedly for research data in your laboratory.

In this module, we will use an example of a common pre-processing task in immunological research: estimating the bacterial load in samples by plating at different dilutions. For this type of experiment, the laboratory researcher plates each of the samples at several dillutions, identifies a good dilution for counting colony-forming units (CFUs), and then back-calculates the estimated bacterial load in the original sample based on the colonies counted at this “good” dilution. This experimental technique dates back to the late 1800s, with Robert Koch, and continues to be widely used in microbiology research and applications today.³⁴¹ These data are originally from an experiment in one of our authors’ laboratory and are also available as example data for an R package called bactcountr, currently under development at https://github.com/aef1004/bactcountr/tree/master/data.

These data are representative of data often collected in immunological research. For example, you may be testing out some drugs against an infectious bacteria and want to know how successful different drugs are in limiting bacterial load. You run an experiment and have samples from animals treated with different drugs or under control and would then want to know how much viable (i.e., replicating) bacteria are in each of your samples.

You can find out by plating the sample at different dilutions and counting the colony-forming units (CFUs) that are cultured on each plate. You put a sample on a plate with a medium they can grow on and then give them time to grow. The idea is that individual bacteria from the original sample end up randomly around the surface of the plate, and any that are viable (able to reproduce) will form a new colony that, after a while, you’ll be able to see.

To get a good estimate of bactieral load from this process, you need to count CFUs on a “countable” plate—one with a “just right” dilution (and you typically won’t know which dilution this is for a sample until after plating). If you have too high of a dilution (i.e., one with very few viable bacteria), randomness will play a big role in the CFU count, and you’ll estimate the original bacterial load with more variability. If you have too low of a dilution (i.e., one with lots of viable bacteria), it will be difficult to identify separate colonies, and they may complete for resources. To translate from diluted concentration to original concentration, you can then do a back-calculation, incorporating both the number of colonies counted at that dilution and how dilute the sample was. There is therefore some pre-processing required (although it is fairly simple) to prepare the data collected to get an estimate of bacterial load in the original sample. This estimate of bacterial load can then be used in statistical testing and combined with other experimental data to explore questions like whether a candidate vaccine reduces bacterial load when a research animal is challenged with a pathogen.

We will use this example of a common data pre-processing task to show how to create a reproducible pre-processing protocol in this module. If you would like, you can access all the components of the example pre-processing protocol and follow along, re-rendering it yourself on your own computer. The example data are available as a csv file, downloadable here. You can open this file using spreadsheet software, or look at it directly in RStudio. The final pre-processing protocol for these data can also be downloaded, including both the original RMarkdown file and the output PDF document. Throughout this module, we will walk through elements of this document, to provide an example as we explain the process of developing data pre-processing modules for common tasks in your research group. We recommend that you go ahead and read through the output PDF document, to get an idea for the example protocol that we’re creating.

This example is intentionally simple, to allow a basic introduction to the process using pre-processing tasks that are familiar to many laboratory-based scientists and easy to explain to anyone who has not used plating in experimental work. However, the same general process can also be used to create pre-processing protocols for data that are much larger or more complex or for pre-processing pipelines that are much more involved. For example, this process could be used to create data pre-processing protocols for automated gating of flow cytometry data or for pre-processing data collected through single cell RNA sequencing.

20.2 Advice on designing a pre-processing protocol

Before you write your protocol in a knitted document, you should decide on the content to include in the protocol. This section provides tips on this design process. In this section, we’ll describe some key steps in designing a data pre-processing protocol:

Defining input and output data for the protocol;
Setting up a project directory for the protocol;
Outlining key tasks in pre-processing the input data; and
Adding code for pre-processing.

We will illustrate these design steps using the example protocol on pre-processing plating data.

20.2.1 Defining input and output data for the protocol

The first step in designing the data pre-processing protocol is to decide on the starting point for the protocol (the data input) and the ending point (the data output). It may make sense to design a separate protocol for each major type of data that you collect in your research laboratory. Your input data for the protocol, under this design, might be the data that is output from a specific type of equipment (e.g., flow cytometer) or from a certain type of sample or measurement (e.g., metabolomics run on a mass spectrometer), even if it is a fairly simple type of data (e.g., CFUs from plating data, as used in the example protocol for this module). For example, say you are working with three types of data for a research experiment: data from a flow cytometer, metabolomics data measured with a mass spectrometer, and bacterial load data measured by plating data and counting colony forming units (CFUs). In this case, you may want to create three pre-processing protocols: one for the flow data, one for the metabolomics data, and one for the CFU data. These protocols are modular and can be re-used with other experiments that use any of these three types of data.

With an example dataset, you can begin to create a pre-processing protocol before you collect any of your own research data for a new experiment. If the format of the initial data is similar to the format you anticipate for your data, you can create the code and explanations for key steps in your pre-processing for that type of data. Often, you will be able to adapt the RMarkdown document to change it from inputting the example data to inputting your own experimental data with minimal complications, once your data comes in. By thinking through and researching data pre-processing options before the data is collected, you can save time in analyzing and presenting your project results once you’ve completed the experimental data collection for the project. Further, with an example dataset, you can get a good approximation of the format in which you will output data from the pre-processing steps. This will allow you to begin planning the analysis and visualization that you will use to combine the different types of data from your experiment and use it to investigate important research hypotheses. Again, if data follow standardized formats across steps in your process, it will often be easy to adapt the code in the protocol to input the new dataset that you created, without major changes to the code developed with the example dataset.

While pre-processing protocols for some types of data might be very complex, others might be fairly simple. However, it is still worthwhile to develop a protocol even for simple pre-processing tasks, as it allows you to pass along some of the details of pre-processing the data that might have become “common sense” to longer-tenured members of your research group. For example, the pre-processing tasks in the example protocol are fairly simple. This protocol inputs data collected in a plain-text delimited file (a csv file, in the example). Within the protocol, there are steps to convert initial measurements from plating at different dilutions into an estimate of the bacterial load in each sample. There are also sections in the protocol for exploratory data analysis, to allow for quality assessment and control of the collected data as part of the pre-processing. The output of the protocol is a simple data object (a dataframe, in this example) with the bacterial load for each original sample. These data are now ready to be used in tables and figures in the research report or manuscript, as well as to explore associations with the experimental design details (e.g., comparing bacterial load in treated versus untreated animals) or merged with other types of experimental data (e.g., comparing immune cell populations, as measured with flow cytometry data, with bacterial loads, as measured from plating and counting CFUs).

Once you have identified the input data type to use for the protocol, you should identify an example dataset from your laboratory that you can use to create the protocol. This could be a dataset that you currently need to pre-process, in which case the development of the protocol will serve a second purpose, allowing you to complete this task at the same time. However, you may not have a new set of data of this type that you currently need to pre-process, and in this case you can build your protocol using a dataset from a previous experiment in your laboratory. In this case, you may already have a record of the steps that you used to pre-process the data previously, and these can be helpful as a starting point as you draft the more thorough pre-processing protocol. You may want to select an example dataset that you have already published or are getting ready to publish, so you won’t feel awkard about making the data available for people to practice with. If you don’t have an example dataset from your own laboratory, you can explore example datasets that are already available, either as data included with existing R packages or through open repositories, including those hosted through national research institutions like the NIH. In this case, be sure to cite the source of the data and include any available information about the equipment that was used to collect it, including equipment settings used when the data were collected.

For the example protocol for this module, we want to pre-process data that were collected “by hand” by counting CFUs on plates in the laboratory. These counts were recorded in a plain text delimited file (a csv file) using spreadsheet software. The spreadsheet was set up to ensure the data can easily be converted to a “tidy” format, as described in module 3. The first few rows of the input data look like this:

## # A tibble: 6 × 6
##   group replicate dilution_0 dilution_1 dilution_2 dilution_3
##   <dbl> <chr>     <chr>      <chr>           <dbl>      <dbl>
## 1     2 2-A       26         10                  0          0
## 2     2 2-B       TNTC       52                 10          5
## 3     2 2-C       0          0                   0          0
## 4     3 3-A       0          0                   0          0
## 5     3 3-B       TNTC       TNTC               30         10
## 6     3 3-C       0          0                   0          0

Each row represents the number of bacterial colonies counted after plating a certain sample, where each sample represents one experimental animal and several experimental animals (replicates) were considered for each experimental group. Columns are included with values for the experimental group of the sample (group), the specific ID of the sample within that experimental group (replicate, e.g., 2-A is mouse A in experimental group 2), and the colony-forming units (CFUs) counted at each of several dilutions. If a cell has the value “TNTC”, this indicates that CFUs were too numerous to count for that sample at that dilution.

When you have identified the input data type you will use for the protocol, as well as selected an example dataset of this type to use to create the protocol, you can include a section in the protocol that describes these input data, what file format they are in, and how they can be read into R for pre-processing (Figure 20.1).

Providing details on input data in the pre-processing protocol. Once you have an example data file for the type of data that will be input for the protocol, you can add a section that provides the code to read the data into R. You can also add code that will show the first few rows of the example dataset, as well as a description of the data. This figure shows examples of how these elements can be added to an RMarkdown file for a pre-processing protocol, and the associated elements in the final pdf of the protocol, using the example protocol for this module.

Figure 20.1: Providing details on input data in the pre-processing protocol. Once you have an example data file for the type of data that will be input for the protocol, you can add a section that provides the code to read the data into R. You can also add code that will show the first few rows of the example dataset, as well as a description of the data. This figure shows examples of how these elements can be added to an RMarkdown file for a pre-processing protocol, and the associated elements in the final pdf of the protocol, using the example protocol for this module.

For the data output, it often makes sense to plan for data in a format that is appropriate for data analysis and for merging with other types of data collected from the experiment. The aim of pre-processing is to get the data from the format in which they were collected into a format that is meaningful for combining with other types of data from the experiment and using in statistical hypothesis testing.

In the example pre-processing protocol, we ultimately output a simple dataset, with one row for each of the original samples. The first few rows of this output data are:

## # A tibble: 6 × 3
##   group replicate cfu_in_organ
##   <dbl> <chr>            <dbl>
## 1     2 2-A                260
## 2     2 2-B               2500
## 3     2 2-C                  0
## 4     3 3-A                  0
## 5     3 3-B               7500
## 6     3 3-C                  0

For each original sample, an estimate of the CFUs of Mycobacterium tuberculosis in the full spleen is given (cfu_in_organ). These data can now be merged with other data collected about each animal in the experiment. For example, they could be joined with data that provide measures of the immune cell populations for each animal, to explore if certain immune cells are associated with bacterial load. They could also be joined with experimental information and then used in hypothesis testing. For example, these data could be merged with a table that describes which groups were controls versus which used a certain vaccine, and then a test could be conducted exploring evidence that bacterial loads in animals given a vaccine were lower than in control animals.

20.2.2 Setting up a project directory for the protocol

Once you have decided on the input and output data formats, you will next want to set up a file directory for storing all the inputs needed in the protocol. You can include the project files for the protocol in an RStudio Project (see module 6) and post this either publicly or privately on GitHub (see modules 9–11). This creates a “packet” of everything that a reader needs to use to recreate what you did—they can download the whole GitHub repository and will have a nice project directory on their computer with everything they need to try out the protocol.

Part of the design of the protocol involves deciding on the files that should be included in this project directory. Figure 20.2 provides an example of the initial files included in the project directory for the example protocol for this module. The left side of the figure shows the files that are initially included, while the left side shows the files in the project after the code in the protocol is run.

Generally, in the project directory you should include a file with the input example data, in whatever file format you will usually collect this type of data. You will also include an RMarkdown file where the protocol is written. If you are planning to cite articles and other references, you can include a BibTeX file, with the bibliographical information for each source you plan to cite (see module 19). Finally, if you would like to include photographs or graphics, you can include these image files in the project directory. Often, you might want to group these together in a subdirectory of the project named something like “figures”.

Once you run the RMarkdown file for the protocol, you will generate additional files in the project. Two typical files you will generate will be the output file for the protocol (in the example, this is output to a pdf file). Usually, the code in the protocol will also result in output data, which is pre-processed through the protocol code and written into a file to be used in further analysis.

$Example of files in the project directory for a data pre-processing protocol. On the left are the files initially included in the project directory for the example protocol for this module. These include a file with the input data (cfu\_data.csv), a BibTeX file with bibliographical information for references (example\_bib.bib), the RMarkdown file for the protocol (example\_protocol.Rmd), and a subdirectory with figures to include in the protocol (figures). On the right is shown the directory after the code in the protocol RMarkdown document is run, which creates an output pdf with the protocol (example\_protocol.pdf) as well as the output data (processed\_cfu\_estimates.csv).$

Figure 20.2: Example of files in the project directory for a data pre-processing protocol. On the left are the files initially included in the project directory for the example protocol for this module. These include a file with the input data (cfu_data.csv), a BibTeX file with bibliographical information for references (example_bib.bib), the RMarkdown file for the protocol (example_protocol.Rmd), and a subdirectory with figures to include in the protocol (figures). On the right is shown the directory after the code in the protocol RMarkdown document is run, which creates an output pdf with the protocol (example_protocol.pdf) as well as the output data (processed_cfu_estimates.csv).

20.2.3 Outlining key tasks in pre-processing the input data

The next step is to outline the key tasks that are involved in moving from the data input to the desired data output. For the plating data we are using for our example, the key tasks to be included in the pre-processing protocol are:

Read the data into R
Explore the data and perform some quality checks
Identify a “good” dilution for each sample—one at which you have a countable plate
Estimate the bacterial load in each original sample based on the CFUs counted at that dilution
Output data with the estimated bacterial load for each sample

Once you have this basic design, you can set up the RMarkdown file for the pre-processing protocol to include a separate section for each task, as well as an “Overview” section at the beginning to describe the overall protocol, the data being pre-processed, and the laboratory procedures used to collect those data. In RMarkdown, you can create first-level section headers by putting the text for the header on its own line and beginning that line with #, followed by a space. You should include a blank line before and after the line with this header text. Figure 20.3 shows how this is done in the example protocol for this module, showing how text in the plain text RMarkdown file for the protocol align with section headers in the final pdf output of the protocol.

Dividing an RMarkdown data pre-processing protocol into sections. This shows an example of creating section headers in a data pre-processing protocol created with RMarkdown, showing section headers in the example pre-procotcol for this module.

Figure 20.3: Dividing an RMarkdown data pre-processing protocol into sections. This shows an example of creating section headers in a data pre-processing protocol created with RMarkdown, showing section headers in the example pre-procotcol for this module.

20.2.4 Adding code for pre-processing

For many of these steps, you likely have code—or can start drafting the code—required for that step. In RMarkdown, you can test this code as you write it. You insert each piece of executable code within a special section, separated from the regular text with special characters, as described in previous modules.

For any pre-processing steps that are straightforward (e.g., calculating the dilution factor in the example module, which requires only simple mathematical operations), you can directly write in the code required for the step. For other pre-processing steps, however, the algorithm may be a bit more complex. For example, complex algorithms have been developed for steps like peak identification and alignment that are required when pre-processing data from a mass spectrometer.

For these more complex tasks, you can start to explore available R packages for performing the task. There are thousands of packages available that extend the basic functionality of R, providing code implementations of algorithms in a variety of scientific fields. Many of the R packages relevant for biological data—especially high-throughput biological data—are available through a repository called Bioconductor. These packages are all open-source (so you can explore their code if you want to) and free. You can use vignettes and package manuals for Bioconductor packages to identify the different functions you can use for your pre-processing steps. Once you have identified a function for the task, you can use the helpfile for the function to see how to use it. This help documentation will allow you to determine all of the function’s parameters and the choices you can select for each.

You can add each piece of code in the RMarkdown version of the protocol using the standard method for RMarkdown (module 11). Figure 20.4 shows an example from the example protocol for this module. Here, we are using code to help identify a “good” dilution for counting CFUs for each sample. The code in included in an executable code chunk, and so it will be run each time the protocol is rendered. Code comments are included in the code to provide finer-level details about what the code is doing.

Example of including code in a data pre-processing protocol created with RMarkdown. This figure shows how code can be included in the RMarkdown file for a pre-processing protocol (right), and the corresponding output in the final pdf of the protocol (left), for the code to identify a 'good' dilution for counting CFUs for each sample. Code comments are included to provide finer-level details on the code.

Figure 20.4: Example of including code in a data pre-processing protocol created with RMarkdown. This figure shows how code can be included in the RMarkdown file for a pre-processing protocol (right), and the corresponding output in the final pdf of the protocol (left), for the code to identify a ‘good’ dilution for counting CFUs for each sample. Code comments are included to provide finer-level details on the code.

For each step of the protocol, you can also include potential problems that might come up in specific instances of the data you get from future experiments. This can help you adapt the code in the protocol in thoughtful ways as you apply it in the future to new data collected for new studies and projects.

20.3 Writing data pre-processing protocols

Now that you have planned out the key components of the pre-processing protocol, you can use RMarkdown’s functionality to flesh it out into a full pre-processing protocol. This gives you the chance to move beyond a simple code script, and instead include more thorough descriptions of what you’re doing at each step and why you’re doing it. You can also include discussions of potential limitations of the approach that you are taking in the pre-processing, as well as areas where other research groups might use a different approach. These details can help when it is time to write the Methods section for the paper describing your results from an experiment using these data. They can also help your research group identify pre-processing choices that might differ from other research groups, which opens the opportunity to perform sensitivity analyses regarding these pre-processing choices and ensure that your final conclusions are robust across multiple reasonable pre-processing approaches.

Protocols are common for wet lab techniques, where they provide a “recipe” that ensures consistency and reproducibility in those processes. Computational tasks, including data pre-processing, can also be standardized through the creation and use of protocol in your research group. While code scripts are becoming more common as a means of recording data pre-processing steps, they are often not as clear as a traditional protocol, in particular in terms of providing a thorough description of what is being done at each step and why it is being done that way. Data pre-processing protocols can provide these more thorough descriptions, and by creating them with RMarkdown or with similar types of “knitted” documents (modules 18 and 19), you can combine the executable code used to pre-process the data with extensive documentation. As a further advantage, the creation of these protocols will ensure that your research group has thought carefully about each step of the process, rather than relying on cobbling together bits and pieces of code they’ve found but don’t fully understand. Just as the creation of a research protocol for a clinical trial requires a careful consideration of each step of the ultimate trial,³⁴² the creation of data pre-processing protocols ensure that each step in the process is carefully considered, and so helps to ensure that each step of this process is conducted as carefully as the steps taken in designing the experiment as a whole and each wet lab technique conducted for the experiment.

A data pre-processing protocol, in the sense we use it here, is essentially an annotated recipe for each step in preparing your data from the initial, “raw” state that is output from the laboratory equipment (or collected by hand) to a state that is useful for answering important research questions. The exact implementation of each step is given in code that can be re-used and adapted with new data of a similar format. However, the code script is often not enough to helpfully understand, share, and collaborate on the process. Instead, it’s critical to also include descriptions written by humans and for humans. These annotations can include descriptions of the code and how certain parameters are standardized the algorithms in the code. They can also be used to justify choices, and link them up both with characteristics of the data and equipment for your experiment as well as with scientific principles that underlie the choices. Protocols like this are critical to allow you to standardize the process you use across many samples from one experiment, across different experiments and projects in your research laboratory, and even across different research laboratories.

As you begin adding text to your pre-processing protocol, you should keep in mind these general aims. First, a good protocol provides adequate detail that another researcher can fully reproduce the procedure.³⁴³ For a protocol for a trial or wet lab technique, this means that the protocol should allow another researcher to reproduce the process and get results that are comparable to your results;³⁴⁴ for a data pre-processing protocol, the protocol must include adequate details that another researcher, provided they start with the same data, gets identical results (short of any pre-processing steps that include some element of sampling or random-number generation, e.g., Monte Carlo methods). This idea—being able to exactly re-create the computational results from an earlier project—is referred to as computational reproducibility and is considered a key component in ensuring that research is fully reproducible.

By creating the data pre-processing protocol as a knitted document using a tool like RMarkdown (modules 18 and 19), you can ensure that the protocol is computationally reproducible. In an RMarkdown document, you include the code examples as executable code—this means that the code is run every time you render the document. You are therefore “checking” your code every time that you run it. As the last step of your pre-processing protocol, you should output the copy of the pre-processed data that you will use for any further analysis for the project. You can use functions in R to output this to a plain text format, for example a comma-separated delimited file (modules 4 and 5). Each time you render the protocol, you will re-write this output file, and so this provides assurance that the code in your protocol can be used to reproduce your output data (since that’s how you yourself created that form of the data).

Figure 20.5 provides an example from the example protocol for this module. The RMarkdown file for the protocol includes code to write out the final, pre-processed data to a comma-separated plain text file called “processed_cfu_estimates.csv”. This code writes the output file into the same directory where you’ve saved the RMarkdown file. Each time the RMarkdown file is rendered to create the pdf version of the protocol, the input data will be pre-processed from scratch, using the code throughout the protocol, and this file will be overwritten with the data generated. This guarantees that the code in the protocol can be used by anyone—you or other researchers—to reproduce the final data from the protocol, and so guarantees that these data are computationally reproducible.

Example of using code in pre-processing protocol to output the final, pre-processed data that will be used in further analysis for the research project. This example comes from the example protocol for this module, showing both the executable code included in the RMarkdown file for the protocol (right) and how this code is included in the final pdf of the protocol. Outputting the pre-processed data into a plain text file as the last step of the protocol helps ensure computational reproducibility for this step of working with experimental data.

Figure 20.5: Example of using code in pre-processing protocol to output the final, pre-processed data that will be used in further analysis for the research project. This example comes from the example protocol for this module, showing both the executable code included in the RMarkdown file for the protocol (right) and how this code is included in the final pdf of the protocol. Outputting the pre-processed data into a plain text file as the last step of the protocol helps ensure computational reproducibility for this step of working with experimental data.

In your data pre-processing protocol, show the code that you use to implement this choice and also explain clearly in the text why you made this choice and what alternatives should be considered if data characteristics are different. Write this as if you are explaining to a new research group member (or your future self) how to think about this step in the pre-processing, why you’re doing it the way your doing it, and what code is used to do it that way. You should also include references that justify choices when they are available—include these using BibTeX (module 19). By doing this, you will make it much easier on yourself when you write the Methods section of papers that report on the data you have pre-processed, as you’ll already have draft information on your pre-processing methods in your protocol.

Good protocols include not only how (for data pre-processing protocols, this is the code), but also why each step is taken. This includes explanations that are both higher-level (i.e., why a larger question is being asked) and also at a fine level, for each step in the process. A protocol should include some background, the aims of the work, hypotheses to be tested, materials and methods, methods of data collection and equipment to analyze samples.³⁴⁵

This step of documentation and explanation is very important to creating a useful data pre-processing protocol. Yes, the code itself allows someone else to replicate what you did. However, only those who are very, very familiar with the software program, including any of the extension packages you include, can “read” the code directly to understand what it’s doing. Further, even if you understand the code very well when you create it, it is unlikely that you will stay at that same level of comprehension in the future, as other tasks and challenges take over that brain space. Explaining for humans, in text that augments and accompanies the code, is also important because function names and parameter names in code often are not easy to decipher. While excellent programmers can sometimes create functions with clear and transparent names, easy to translate to determine the task each is doing, this is difficult in software development and is rare in practice. Human annotations, written by and for humans, are critical to ensure that the steps will be clear to you and others in the future when you revisit what was done with this data and what you plan to do with future data.

The process of writing a protocol in this way forces you to think about each step in the process, why you do it a certain way (include parameters you choose for certain functions in a pipeline of code), and include justifications from the literature for this reasoning. If done well, it should allow you to quickly and thoroughly write the associated sections of Methods in research reports and manuscripts and help you answer questions and challenges from reviewers. Writing the protocol will also help you identify steps for which you are uncertain how to proceed and what choices to make in customizing an analysis for your research data. These are areas where you can search more deeply in the literature to understand implications of certain choices and, if needed, contact the researchers who developed and maintained associated software packages to get advice.

For example, the example protocol for this module explains how to pre-process data collected from counting CFUs after plating serial dilutions of samples. One of the steps of pre-processing is to identify a dilution for each sample at which you have a “countable” plate. The protocol includes an explanation of why it is important to identify the dilution for a countable plate and also gives the rules that are used to pick a dilution for each sample, before including the code that implements those rules. This allows the protocol to provide research group members with the logic behind the pre-processing, so that they can adapt if needed in future experiments. For example, the count range of CFUs used for the protocol to find a good dilution is about a quarter of the typically suggested range for this process, and this is because this experiment plated each sample on a quarter of a plate, rather than using the full plate. By explaining this reasoning, in the future the protocol could be adapted when using a full plate rather than a quarter of a plate for each sample.

One tool in RMarkdown that is helpful for this process is its built-in referencing system. In the previous module, we showed how you can include bibliographical references in an RMarkdown file. When you write a protocol within RMarkdown, you can include references in this way to provide background and support as you explain why you are conducting each step of the pre-processing. Figure 20.6 shows an example of the elements you use to do this, showing each element in the example protocol for this module.

Including references in a data pre-processing protocol created with RMarkdown. RMarkdown has a built-in referencing system that you can use, based on the BibTeX system for LaTeX. This figure shows examples from the example protocol for this module of the elements used for referencing. You create a BibTeX file with information about each reference, and then use the key for the reference within the text to cite that reference. All cited references will be printed at the end of the document; you can chose the header that you want for this reference section in the RMarkdown file ('References' in this example). In the YAML of the RMarkdown file, you specify the path to the BibTeX file (with the 'bibliography: ' key), so it can be linked in when the RMarkdown file is rendered.

Figure 20.6: Including references in a data pre-processing protocol created with RMarkdown. RMarkdown has a built-in referencing system that you can use, based on the BibTeX system for LaTeX. This figure shows examples from the example protocol for this module of the elements used for referencing. You create a BibTeX file with information about each reference, and then use the key for the reference within the text to cite that reference. All cited references will be printed at the end of the document; you can chose the header that you want for this reference section in the RMarkdown file (‘References’ in this example). In the YAML of the RMarkdown file, you specify the path to the BibTeX file (with the ‘bibliography:’ key), so it can be linked in when the RMarkdown file is rendered.

Other helpful tools in RMarkdown are tools for creating equations and tables. As described in module 19, RMarkdown includes a number of formatting tools. You can create simple tables through basic formatting, or more complex tables using add-on packages like kableExtra. Math can be typeset using conventions developed in the LaTeX mark-up language. The module 19 provided advice and links to resources on using these types of tools. Figure 20.7 gives an example of them in use within the example protocol for this module.

Figure 20.7: Example of including tables and equations in an RMarkdown data pre-processing protocol.

You can also include figures, either figures created in R or outside figure files. Any figures that are created by code in the RMarkdown document will automatically be included in the protocol. For other graphics, you can include image files (e.g., png and jpeg files) using the include_graphics function from the knitr package. You can use options in the code chunk options to specify the size of the figure in the document and to include a figure caption. The figures will be automatically numbered in the order they appear in the protocol.

Figure 20.8 shows an example of how external figure files were included in the example protocol. In this case, the functionality allowed us to include an overview graphic that we created in PowerPoint and saved as an image as well as a photograph taken by a member of our research group.

Figure 20.8: Example of including figures from image files in an RMarkdown data pre-processing protocol.

Finally, you can try out even more complex functionality for RMarkdown as you continue to build data pre-processing protocols for your research group. Figure 20.9 shows an example of using R code within the YAML of the example protocol for this module; this allows us to include a “Last edited” date that is updated with the day’s date each time the protocol is re-rendered.

Example of using more advanced RMarkdown functionality within a data pre-processing protocol. In this example, R code is incorporated into the YAML of the document to include the date that the document was last rendered, marking this on the pdf output as the *Last edited* date of the protocol.

Figure 20.9: Example of using more advanced RMarkdown functionality within a data pre-processing protocol. In this example, R code is incorporated into the YAML of the document to include the date that the document was last rendered, marking this on the pdf output as the Last edited date of the protocol.

20.4 Applied exercise

To wrap up this module, try downloading both the source file and the output of this example data pre-processing protocol. Again, you can find the source code (the RMarkdown file) here and the output file here. If you would like to try re-running the file, you can get all the additional files you’ll need (the original data file, figure files, etc.) here. See if you can compare the elements of the RMarkdown file with the output they produce in the PDF file. Read through the descriptions of the protocol. Do you think that you could recreate the process if your laboratory ran a new experiment that involved plating samples to estimate bacterial load?

Al-JunDi, AzzAm, and SAlAh SAkkA. “Protocol Writing in Clinical Research.” Journal of Clinical and Diagnostic Desearch: JCDR 10, no. 11 (2016): ZE10.

AlTarawneh, Ghada, and Simon Thorne. “A Pilot Study Exploring Spreadsheet Risk in Scientific Research.” arXiv Preprint arXiv:1703.09785, 2017.

Altman, Douglas G., and J. Martin Bland. “Statistical Notes: Units of Analysis.” BMJ 314 (1997): 1874.

Altschul, Stephen, Barry Demchak, Richard Durbin, Robert Gentleman, Martin Krzywinski, Heng Li, Anton Nekrutenko, et al. “The Anatomy of Successful Computational Biology Software.” Nature Biotechnology 31, no. 10 (2013): 894–97.

Anderson, Nicholas R, E Sally Lee, J Scott Brockenbrough, Mark E Minie, Sherrilynne Fuller, James Brinkley, and Peter Tarczy-Hornoch. “Issues in Biomedical Research Data Management and Analysis: Needs and Barriers.” Journal of the American Medical Informatics Association 14, no. 4 (2007): 478–88.

Arnold, Taylor. “A Tidy Data Model for Natural Language Processing using cleanNLP.” The R Journal 9, no. 2 (2017): 248–67.

Bacher, Rhonda, Li-Fang Chu, Ning Leng, Audrey P Gasch, James A Thomson, Ron M Stewart, Michael Newton, and Christina Kendziorski. “SCnorm: Robust Normalization of Single-Cell RNA-Seq Data.” Nature Methods 14, no. 6 (2017): 584–86.

Baker, Monya. “How Quality Control Could Save Your Science.” Nature 529, no. 7587 (2016): 456–59.

Barga, Roger, Bill Howe, David Beck, Stuart Bowers, William Dobyns, Winston Haynes, Roger Higdon, et al. “Bioinformatics and Data-Intensive Scientific Discovery in the Beginning of the 21st Century.” Omics: A Journal of Integrative Biology 15, no. 4 (2011): 199–201.

Barnett, David, Brooke Walker, Alan Landay, and Thomas N Denny. “CD4 Immunophenotyping in HIV Infection.” Nature Reviews Microbiology 6, no. 11 (2008): S7–15.

Barry, Clifton E, and Maija S Cheung. “New Tactics Against Tuberculosis.” Scientific American 300, no. 3 (2009): 62–69.

Bass, Andrew J., David G. Robinson, Steve Lianoglou, Emily Nelson, John D. Storey, and with contributions from Laurent Gatto. Biobroom: Turn Bioconductor Objects into Tidy Data Frames, 2020. https://github.com/StoreyLab/biobroom.

Baumer, Ben. “A Data Science Course for Undergraduates: Thinking with Data.” The American Statistician 69, no. 4 (2015): 334–42.

Baumer, Benjamin S, Daniel T Kaplan, and Nicholas J Horton. Modern Data Science with r. Boca Raton: CRC Press, 2017.

Beckman, Matthew D, Mine Çetinkaya-Rundel, Nicholas J Horton, Colin W Rundel, Adam J Sullivan, and Maria Tackett. “Implementing Version Control with Git and GitHub as a Learning Objective in Statistics and Data Science Courses.” Journal of Statistics and Data Science Education 29, no. sup1 (2021): S132–44.

Ben-David, Avishai, and Charles E Davidson. “Estimation Method for Serial Dilution Experiments.” Journal of Microbiological Methods 107 (2014): 214–21.

Benoist, Christophe, and Nir Hacohen. “Flow Cytometry, Amped Up.” Science 332, no. 6030 (2011): 677–78.

Bertin, Audrey M, and Benjamin S Baumer. “Creating Optimal Conditions for Reproducible Data Analysis in r with Fertile.” Stat 10, no. 1 (2021): e332.

Birch, David, David Lyford-Smith, and Yike Guo. “The Future of Spreadsheets in the Big Data Era.” arXiv Preprint arXiv:1801.10231, 2018.

Blischak, John D, Peter Carbonetto, and Matthew Stephens. “Creating and Sharing Reproducible Research Code the Workflowr Way.” F1000Research 8 (2019): 1749.

Blischak, John D, Emily R Davenport, and Greg Wilson. “A Quick Introduction to Version Control with Git and GitHub.” PLoS Computational Biology 12, no. 1 (2016).

Brazma, Alvis, Maria Krestyaninova, and Ugis Sarkans. “Standards for Systems Biology.” Nature Reviews Genetics 7, no. 8 (2006): 593.

Broman, Karl W, and Kara H Woo. “Data Organization in Spreadsheets.” The American Statistician 72, no. 1 (2018): 2–10.

Brown, Zack. “A Git Origin Story.” Linux Journal, 2018.

Bryan, Jennifer. “Excuse Me, Do You Have a Moment to Talk about Version Control?” The American Statistician 72, no. 1 (2018): 20–27.

Bryan, Jennifer, and Hadley Wickham. “Data Science: A Three Ring Circus or a Big Tent?” Journal of Computational and Graphical Statistics 26, no. 4 (2017): 784–85.

Buffalo, Vince. Bioinformatics Data Skills: Reproducible and Robust Research with Open Source Tools. " O’Reilly Media, Inc.", 2015.

Burns, Patrick. The r Inferno. Lulu. com, 2011.

Butler, Declan. “Electronic Notebooks: A New Leaf.” Nature 436, no. 7047 (2005): 20–22.

Campbell-Kelly, Martin. “Number Crunching Without Programming: The Evolution of Spreadsheet Usability.” IEEE Annals of the History of Computing 29, no. 3 (2007): 6–19.

Chatfield, Chris. Problem Solving: A Statistician’s Guide. CRC Press, 1995.

Chen, Yunshun, Davis McCarthy, Mark Robinson, and Gordon K Smyth. “edgeR: Differential Expression Analysis of Digital Gene Expression Data User’s Guide.” Bioconductor User’s Guide. Available Online: Http://Www. Bioconductor. Org/Packages/Release/Bioc/Vignettes/edgeR/Inst/Doc/edgeRUsersGuide.pdf (Accessed on 15 February 2021), 2014.

Cooper, Sarah K, David F Ackart, Faye Lanni, Marcela Henao-Tamayo, G B Anderson, and Brendan K Podell. “Heterogeneity in Immune Cell Composition Is Associated with Mycobacterium Tuberculosis Replication at the Granuloma Level.” Frontiers in Immunology 15 (2024): 1427472.

Creeth, Richard. “Microcomputer Spreadsheets: Their Uses and Abuses.” Journal of Accountancy (Pre-1986) 159, no. 6 (1985): 90.

Edwards, Paul N, Matthew S Mayernik, Archer L Batcheller, Geoffrey C Bowker, and Christine L Borgman. “Science Friction: Data, Metadata, and Collaboration.” Social Studies of Science 41, no. 5 (2011): 667–90.

Ellis, Shannon E, and Jeffrey T Leek. “How to Share Data for Collaboration.” The American Statistician 72, no. 1 (2018): 53–57.

Flicek, Paul, and Ewan Birney. “Sense from Sequence Reads: Methods for Alignment and Assembly.” Nature Methods 6, no. Suppl 11 (2009): S6–12.

Fox, Amy, Taru S Dutt, Burton Karger, Mauricio Rojas, Andrés Obregón-Henao, G Brooke Anderson, and Marcela Henao-Tamayo. “Cyto-Feature Engineering: A Pipeline for Flow Cytometry Analysis to Uncover Immune Populations and Associations with Disease.” Scientific Reports 10, no. 1 (2020): 1–12.

Gatto, Laurent. “MSnbase Development,” 2013.

Ghosh, Samik, Yukiko Matsuoka, Yoshiyuki Asai, Kun-Yi Hsin, and Hiroaki Kitano. “Software for Systems Biology: From Tools to Integrated Platforms.” Nature Reviews Genetics 12, no. 12 (2011): 821.

Gibb, Bruce C. “Reproducibility.” Nature Chemistry 6, no. 8 (2014): 653–54.

Giles, Jim. “The Digital Lab: Lab-Management Software and Electronic Notebooks Are Here–and This Time, It’s More Than Just Talk.” Nature 481, no. 7382 (2012): 430–32.

Gillespie, Colin, and Robin Lovelace. Efficient r Programming: A Practical Guide to Smarter Programming. " O’Reilly Media, Inc.", 2016.

Goodman, Alyssa, Alberto Pepe, Alexander W Blocker, Christine L Borgman, Kyle Cranmer, Merce Crosas, Rosanne Di Stefano, et al. “Ten Simple Rules for the Care and Feeding of Scientific Data.” Public Library of Science, 2014.

Hamming, Richard R. The Art of Doing Science and Engineering: Learning to Learn. CRC Press, 1997.

Haque, Ashraful, Jessica Engel, Sarah A Teichmann, and Tapio Lönnberg. “A Practical Guide to Single-Cell RNA-Sequencing for Biomedical Research and Clinical Applications.” Genome Medicine 9, no. 1 (2017): 1–12.

Hermans, Felienne, Bas Jansen, Sohon Roy, Efthimia Aivaloglou, Alaaeddin Swidan, and David Hoepelman. “Spreadsheets Are Code: An Overview of Software Engineering Approaches Applied to Spreadsheets.” In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), 5:56–65. IEEE, 2016.

Hermans, Felienne, and Emerson Murphy-Hill. “Enron’s Spreadsheets and Related Emails: A Dataset and Analysis.” In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, 2:7–16. IEEE, 2015.

Hicks, Stephanie C, and Rafael A Irizarry. “A Guide to Teaching Data Science.” The American Statistician In Press (2017). doi:10.1080/00031305.2017.1356747.

Hicks, Stephanie C, Ruoxi Liu, Yuwei Ni, Elizabeth Purdom, and Davide Risso. “Mbkmeans: Fast Clustering for Single Cell Data Using Mini-Batch k-Means.” PLOS Computational Biology 17, no. 1 (2021): e1008625.

Holmes, Susan, and Wolfgang Huber. Modern Statistics for Modern Biology. Cambridge University Press, 2018.

Hsieh, TC, KH Ma, and Anne Chao. “iNEXT: an R package for rarefaction and extrapolation of species diversity (Hill numbers).” Methods in Ecology and Evolution 7, no. 12 (2016): 1451–56.

Huber, Wolfgang, Vincent J Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S Carvalho, Hector Corrada Bravo, et al. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nature Methods 12, no. 2 (2015): 115.

Ilicic, Tomislav, Jong Kyoung Kim, Aleksandra A Kolodziejczyk, Frederik Otzen Bagger, Davis James McCarthy, John C Marioni, and Sarah A Teichmann. “Classification of Low Quality Cells from Single-Cell RNA-Seq Data.” Genome Biology 17, no. 1 (2016): 1–15.

Irizarry, Rafael A., and Michael I. Love. Data Analysis for the Life Sciences with r. Chapman; Hall, 2016.

Irving, Francis. “Astonishments, Ten, in the History of Version Control,” 2011. https://www.flourish.org/2011/12/astonishments-ten-in-the-history-of-version-control/.

Janssens, Jeroen. Data Science at the Command Line: Facing the Future with Time-Tested Tools. " O’Reilly Media, Inc.", 2014.

Johnston, Luke. Prodigenr: Research Project Directory Generator, 2022. https://CRAN.R-project.org/package=prodigenr.

Judkins, Rod. The Art of Creative Thinking. Perigee, 2016.

Kaplan, Daniel. “Teaching Stats for Data Science.” The American Statistician 72, no. 1 (2018): 89–96.

Keller, Sallie, Gizem Korkmaz, Mark Orr, Aaron Schroeder, and Stephanie Shipp. “The Evolution of Data Quality: Understanding the Transdisciplinary Origins of Data Quality Concepts and Approaches.” Annu. Rev. Stat. Appl 4 (2017): 85–108.

Kernighan, Brian W. D Is for Digital: What a Well-Informed Person Should Know about Computers and Communications. CreateSpace Independent Publishing Platform, 2011.

Kernighan, Brian W, and Rob Pike. The UNIX Programming Environment. Vol. 270. Prentice-Hall Englewood Cliffs, NJ, 1984.

Klemens, Ben. 21st Century c: C Tips from the New School. " O’Reilly Media, Inc.", 2014.

Krishnan, Sanjay, Daniel Haas, Michael J Franklin, and Eugene Wu. “Towards Reliable Interactive Data Cleaning: A User Survey and Recommendations.” In Proceedings of the Workshop on Human-in-the-Loop Data Analytics, 9. ACM, 2016.

Kwok, Roberta. “Lab Notebooks Go Digital.” Nature 560, no. 7717 (2018): 269–70.

Lawrence, Michael, and Martin Morgan. “Scalable Genomics with r and Bioconductor.” Statistical Science: A Review Journal of the Institute of Mathematical Statistics 29, no. 2 (2014): 214.

LEIPS, JEFF. “At the Helm: Leading Your Laboratory.” Genetics Research 92, no. 4 (2010): 325–26.

Levy, Steven. “A Spreadsheet Way of Knowledge.” Harpers 269 (1984): 58–64.

Lithgow, Gordon J, Monica Driscoll, and Patrick Phillips. “A Long Journey to Reproducible Results.” Nature 548, no. 7668 (2017): 387–88.

Lowndes, Julia S Stewart, Benjamin D Best, Courtney Scarborough, Jamie C Afflerbach, Melanie R Frazier, Casey C O’Hara, Ning Jiang, and Benjamin S Halpern. “Our Path to Better Science in Less Time Using Open Data Science Tools.” Nature Ecology & Evolution 1, no. 6 (2017): 160.

Lynch, Clifford. “Big Data: How Do Your Data Grow?” Nature 455, no. 7209 (2008): 28.

Madhavan, Guru. Applied Minds: How Engineers Think. W. W. Norton, 2015.

Maecker, Holden T, J Philip McCoy, and Robert Nussenblatt. “Standardizing Immunophenotyping for the Human Immunology Project.” Nature Reviews Immunology 12, no. 3 (2012): 191–200.

Majumder, Erica L-W, Elizabeth M Billings, H Paul Benton, Richard L Martin, Amelia Palermo, Carlos Guijas, Markus M Rinschen, et al. “Cognitive Analysis of Metabolomics Data for Systems Biology.” Nature Protocols 16, no. 3 (2021): 1376–1418.

Mak, H Craig. “John Storey.” Nature Biotechnology 29, no. 4 (2011): 331–33.

Mangiola, Stefano, Ramyar Molania, Ruining Dong, Maria A Doyle, and Anthony T Papenfuss. “Tidybulk: An r Tidy Framework for Modular Transcriptomic Data Analysis.” Genome Biology 22 (2021): 1–15.

Marwick, Ben, Carl Boettiger, and Lincoln Mullen. “Packaging Data Analytical Work Reproducibly Using R (and Friends).” The American Statistician 72, no. 1 (2018): 80–88.

Mascarelli, Amanda. “Research Tools: Jump Off the Page.” Nature 507, no. 7493 (2014): 523–25.

McCarthy, Davis J, Kieran R Campbell, Aaron TL Lun, and Quin F Wills. “Scater: Pre-Processing, Quality Control, Normalization and Visualization of Single-Cell RNA-Seq Data in r.” Bioinformatics 33, no. 8 (2017): 1179–86.

McCullough, B D, and Berry Wilson. “On the Accuracy of Statistical Procedures in Microsoft Excel 2000 and Excel XP.” Computational Statistics & Data Analysis 40, no. 4 (2002): 713–21.

McCullough, BD. “Does Microsoft Fix Errors in Excel?” In Proceedings of the 2001 Joint Statistical Meetings, 2001.

McCullough, Bruce D. “Assessing the Reliability of Statistical Software: Part II.” The American Statistician 53, no. 2 (1999): 149–59.

McCullough, Bruce D, and David A Heiser. “On the Accuracy of Statistical Procedures in Microsoft Excel 2007.” Computational Statistics & Data Analysis 52, no. 10 (2008): 4570–78.

McCullough, Bruce D, and Berry Wilson. “On the Accuracy of Statistical Procedures in Microsoft Excel 2003.” Computational Statistics & Data Analysis 49, no. 4 (2005): 1244–52.

———. “On the Accuracy of Statistical Procedures in Microsoft Excel 97.” Computational Statistics & Data Analysis 31, no. 1 (1999): 27–37.

McMurdie, Paul J, and Susan Holmes. “Phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.” PloS One 8, no. 4 (2013): e61217.

McNamara, Amelia. “On the State of Computing in Statistics Education: Tools for Learning and for Doing.” arXiv Preprint arXiv:1610.00984, 2016.

Mélard, Guy. “On the Accuracy of Statistical Procedures in Microsoft Excel 2010.” Computational Statistics 29, no. 5 (2014): 1095–1128.

Metz, Cade. “How GitHub Conquered Google, Microsoft, and Everyone Else.” Wired Magazine, 2015.

Michener, William K. “Ten Simple Rules for Creating a Good Data Management Plan.” PLoS Computational Biology 11, no. 10 (2015): e1004525.

Murrell, Paul. Introduction to Data Technologies. Chapman; Hall/CRC, 2009.

Myneni, Sahiti, and Vimla L Patel. “Organization of Biomedical Data for Collaborative Scientific Research: A Research Information Management System.” International Journal of Information Management 30, no. 3 (2010): 256–64.

Nardi, Bonnie A, and James R Miller. “The Spreadsheet Interface: A Basis for End User Programming.” In Proceedings of the IFIP TC13 Third Interational Conference on Human-Computer Interaction, 977–83. North-Holland Publishing Co., 1990.

Nash, JC. “Spreadsheets in Statistical Practice—Another Look.” The American Statistician 60, no. 3 (2006): 287–89.

Neff, Ellen P. “On the Past, Present, and Future of in Vivo Science.” Lab Animal 50, no. 10 (2021): 273–76.

Nekrutenko, Anton, and James Taylor. “Next-Generation Sequencing Data Interpretation: Enhancing Reproducibility and Accessibility.” Nature Reviews Genetics 13, no. 9 (2012): 667–72.

Noble, William Stafford. “A Quick Guide to Organizing Computational Biology Projects.” PLoS Computational Biology 5, no. 7 (2009): e1000424.

Osann, Isabell, Lena Mayer, and Inga Wiele. The Design Thinking Quick Start Guide: A 6-Step Process for Generating and Implementing Creative Solutions. John Wiley & Sons, 2020.

Pawlik, Aleksandra, Celia WG van Gelder, Aleksandra Nenadic, Patricia M Palagi, Eija Korpelainen, Philip Lijnzaad, Diana Marek, Susanna-Assunta Sansone, John Hancock, and Carole Goble. “Developing a Strategy for Computational Lab Skills Training Through Software and Data Carpentry: Experiences from the ELIXIR Pilot Action.” F1000Research 6 (2017): ELIXIR–1040.

Peng, Roger. “Teaching r to New Users—from Tapply to the Tidyverse.” Simply Statistics, 2018.

Peng, Roger D. R Programming for Data Science. Leanpub, 2016.

Perez-Riverol, Yasset, Laurent Gatto, Rui Wang, Timo Sachsenberg, Julian Uszkoreit, Felipe da Veiga Leprevost, Christian Fufezan, et al. “Ten Simple Rules for Taking Advantage of Git and GitHub.” PLoS Computational Biology 12, no. 7 (2016).

Perkel, Jeffrey. “Git: The Reproducibility Tool Scientists Love to Hate.” Naturejobs Blog, 2018.

Perkel, Jeffrey M. “Coding Your Way Out of a Problem.” Nature Publishing Group, 2011.

———. “Single-Cell Sequencing Made Simple.” Nature 547, no. 7661 (2017): 125–26.

———. “THE FUTURE OF SCIENTIFIC FIGURES.” Nature 554, no. 7690 (2018): 133–34.

Powell, Kendall. “A Lab App for That.” Nature 484, no. 7395 (2012): 553–55.

Powell, Stephen G, Kenneth R Baker, and Barry Lawson. “Errors in Operational Spreadsheets: A Review of the State of the Art.” In 2009 42nd Hawaii International Conference on System Sciences, 1–8. IEEE, 2009.

Quintelier, Katrien, Artuur Couckuyt, Annelies Emmaneel, Joachim Aerts, Yvan Saeys, and Sofie Van Gassen. “Analyzing High-Dimensional Cytometry Data Using FlowSOM.” Nature Protocols 16, no. 8 (2021): 3775–3801.

Raymond, Eric. “Understanding Version-Control Systems (DRAFT),” 2009. http://www.catb.org/~esr/writings/version-control/version-control.html.

Raymond, Eric S. The Art of Unix Programming. Addison-Wesley Professional, 2003.

Robinson, David. “Broom: An r Package for Converting Statistical Analysis Objects into Tidy Data Frames.” arXiv Preprint arXiv:1412.3565, 2014.

———. “Teach the Tidyverse to Beginners.” Variance Explained, 2017.

Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26, no. 1 (2010): 139–40. doi:10.1093/bioinformatics/btp616.

Ross, Zev, Hadley Wickham, and David Robinson. “Declutter Your R Workflow with Tidy Tools.” PeerJ Preprints 5 (2017): e3180v1.

Sansone, Susanna-Assunta, Philippe Rocca-Serra, Dawn Field, Eamonn Maguire, Chris Taylor, Oliver Hofmann, Hong Fang, et al. “Toward Interoperable Bioscience Data.” Nature Genetics 44, no. 2 (2012): 121.

Savage, Adam. Every Tool’s a Hammer: Life Is What You Make It. Atria Books, 2020.

Schadt, Eric E, Michael D Linderman, Jon Sorenson, Lawrence Lee, and Garry P Nolan. “Computational Solutions to Large-Scale Data Management and Analysis.” Nature Reviews Genetics 11, no. 9 (2010): 647.

Schrode, Nadine, Carina Seah, PJ Michael Deans, Gabriel Hoffman, and Kristen J Brennand. “Analysis Framework and Experimental Design for Evaluating Synergy-Driving Gene Expression.” Nature Protocols 16, no. 2 (2021): 812–40.

Sedgwick, Philip. “Unit of Observation and Unit of Analysis.” BMJ 348 (2014): g3840.

Silge, Julia, and David Robinson. Text Mining with r: A Tidy Approach. Sebastopol: O’Reilly Media, 2017.

———. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” The Journal of Open Source Software 1, no. 3 (2016).

Spidlen, Josef, Wayne Moore, David Parks, Michael Goldberg, Kim Blenman, James S Cavenaugh, ISAC Data Standards Task Force, and Ryan Brinkman. “Data File Standard for Flow Cytometry, Version FCS 3.2.” Cytometry Part A 99, no. 1 (2021): 100–102.

Stander, Julian, and Luciana Dalla Valle. “On Enthusing Students about Big Data and Social Media Visualization and Analysis Using R, RStudio, and RMarkdown.” Journal of Statistics Education 25, no. 2 (2017): 60–67.

Stark, Philip B. “Before Reproducibility Must Come Preproducibility.” Nature 557, no. 7706 (2018): 613–14.

Target, Sinclair. “Version Control Before Git with CVS,” 2018. https://twobithistory.org/2018/07/07/cvs.html.

Teixeira, Ricardo, and Vasco Amaral. “On the Emergence of Patterns for Spreadsheets Data Arrangements.” In Federation of International Conferences on Software Technologies: Applications and Foundations, 333–45. Springer, 2016.

Thomas, Kathy, and Mary Beth Farrell. “How to Write a Protocol: Part 1.” Journal of Nuclear Medicine Technology 43, no. 1 (2015): 1–7.

Topaloglou, Thodoros, Susan B Davidson, HV Jagadish, Victor M Markowitz, Evan W Steeg, and Mike Tyers. “Biological Data Management: Research, Practice and Opportunities.” In Proceedings of the Thirtieth International Conference on Very Large Data Bases-Volume 30, 1233–36. VLDB Endowment, 2004.

Tyner, Sam, François Briatte, and Heike Hofmann. “Network Visualization with ggplot2.” The R Journal 9, no. 1 (2017): 27–59.

U.S. Department of Health and Human Services, National Institutes of Health. “NIH Strategic Plan for Data Science,” 2018. https://datascience.nih.gov/sites/default/files/NIH_Strategic_Plan_for_Data_Science_Final_508.pdf.

———. “NIH-Wide Strategic Plan, Fiscal Years 2016-2020: Turning Discovery Into Health,” 2016. https://www.nih.gov/sites/default/files/about-nih/strategic-plan-fy2016-2020-508.pdf.

Vuorre, Matti, and Matthew JC Crump. “Sharing and Organizing Research Products as r Packages.” Behavior Research Methods 53 (2021): 792–802.

Welsh, Eric A, Paul A Stewart, Brent M Kuenzi, and James A Eschrich. “Escape Excel: A Tool for Preventing Gene Symbol and Accession Conversion Errors.” PloS One 12, no. 9 (2017): e0185207.

Wendt, Caroline J, and G Brooke Anderson. “Ten Simple Rules for Finding and Selecting r Packages.” PLoS Computational Biology 18, no. 3 (2022): e1009884.

Wickham, Hadley. Ggplot2: Elegant Graphics for Data Analysis. 2nd ed. New York: Springer, 2016.

———. “The Tidy Tools Manifesto.” CRAN Vignette, 2017.

———. “Tidy Data.” Journal of Statistical Software 59, no. 10 (2014): 1–23.

Wickham, Hadley, and Garrett Grolemund. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. sebastopol: O’Reilly Media, 2016.

Wilkinson, Mark D, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (2016).

Willekens, Frans. “Chronological Objects in Demographic Research.” Demographic Research 28 (2013): 649–80.

Wilson, G. “Software Carpentry: Lessons Learned.” F1000Research 3 (2014): 62–62.

Winchester, Catherine. “Give Every Paper a Read for Reproducibility.” Nature 557, no. 7706 (2018): 281–82.

Xie, Yihui, Joseph J Allaire, and Garrett Grolemund. R Markdown: The Definitive Guide. CRC Press, 2018.

Xie, Yihui, Christophe Dervieux, and Emily Riederer. R Markdown Cookbook. Chapman; Hall/CRC, 2020.

Yin, Tengfei, Dianne Cook, and Michael Lawrence. “Ggbio: An R Package for Extending the Grammar of Graphics for Genomic Data.” Genome Biology 13, no. 8 (2012): R77.

Zeeberg, Barry R, Joseph Riss, David W Kane, Kimberly J Bussey, Edward Uchio, W Marston Linehan, J Carl Barrett, and John N Weinstein. “Mistaken Identifiers: Gene Name Errors Can Be Introduced Inadvertently When Using Excel in Bioinformatics.” BMC Bioinformatics 5, no. 1 (2004): 80.

Ziemann, Mark, Yotam Eren, and Assam El-Osta. “Gene Name Errors Are Widespread in the Scientific Literature.” Genome Biology 17, no. 1 (2016): 177.

19 RMarkdown for creating reproducible data pre-processing protocols