Module 3 The “tidy” data format
In module 2, we explained the benefits of saving data in a structured format, and in particular one that follows standards for your discipline. In this section, we’ll talk about the “tidy” data format. The tidy data format is one implementation of a tabular, two-dimensional structured data format that has quickly gained popularity among statisticians and data scientists since it was defined in a 2014 paper.110
These principles cover some basic rules for ordering the data, and even if you haven’t heard the term tidy data, you may already be implementing many of its standards in your own datasets. Datasets in this format tend to be very easily to work with, including to further clean, model, and visualize the data, as well as to integrate the data with other datasets. In particular, this data format is compatible with a collection of open-source tools on the R platform called the tidyverse. These characteristics mean that, if you are planning to use a standardized data format for recording experimental data in your research group, you may want to consider creating one that adheres to the tidy data format.
Objectives. After this module, the trainee will be able to:
- List characteristics defining the “tidy” structured data format
- Understand how to reformat a dataset to make it follow the “tidy” format
- Explain the difference between the a structured data format (general concept) and the “tidy’ data format (one popular implementation)
- Understand benefits of recording data in a “tidy” format
3.1 Keeping things tidy
Adam Savage has built a career out of making things. He became famous as the host of the TV show Mythbusters, where a crew builds contraptions to test urban myths. For many years before that, he created models and special effects for movies. He has thought a lot about how to effectively work in teams to make things, and in 2019 he published a book about his life as a maker called Every Tool is a Hammer.111
Among many insights, Savage focuses on the importance of tidying up as part of the creation process, saying “It’s time, when taken, that you might feel is slowing you down in the moment, but in fact is saving you time in the long run.”112 He introduces a new word for the process of straightening up tools and materials—“knolling”. He borrowed the term from an artist, Tom Sachs, whose rules for his own workshop include, “Always Be Knolling”.
The idea of “knolling” includes a few key principles. First, only have what you need out. Put everything else somewhere else. Removing any extras makes it faster to find what you need when you need it. Second, for things you need, make sure they’re out and available. “Drawers are where things go to die,” Savage says, highlighting inefficiency when you have to look for things that are hidden from site as you work. Finally, organize the things that you have out. Put like things together, and arrange everything neatly, aligning things in parallel or perpendicular patterns, rather than piling it haphazardly.
Just as organizing tools and materials improves efficiency in a workshop, organizing your data can dramatically improve the efficiency of data pre-processing, analysis, and visualization. Indeed, “tidying up” your data can give such dramatic improvements that a number of researchers have developed systems and written papers that describe good organization schemes to use to tidy up data (e.g.,).113
The principles for tidying up data follow some of the principles for knolling. For example, you want to make sure that you’re saving data in a file or spreadsheet that only includes the data, removing any of the extras. Lab groups will sometimes design spreadsheets for data collection that include a space for recording data, but also space for notes, embedded calculations, and plots. These extra elements can make it hard to extract and use the data itself. One way to tidy up a dataset is to remove any of these extra elements. While you can do this after you’ve collected your data, it’s more efficient to design a way to record your data in the first place without extra elements in the file or spreadsheet.
You can further tidy up your data format by reformatting it to follow the rules of a data format called the “tidy data” format. Just as Adam Savage’s “knolling” helps you find things when you need them, using a tidy data format puts elements of your data in the “right” place to be found by a powerful collection of tools called the tidyverse.
We’ll start this module by describing rules a dataset format must follow for it to be “tidy” and clarifying how you can set up your data recording to follow these rules. In later parts of this module, we’ll talk more about why it’s helpful to use a tidy data format, as well as a bit about the tidyverse tools that you can use with data in this format.
3.2 What makes data “tidy”?
The “tidy” data format describes one way to structure tabular data. The name follows from the focus of this data format and its associated set of tools—the “tidyverse”—on preparing and cleaning (“tidying”) data, in contrast to sets of tools more focused on other steps, like data analysis.114 The word “tidy” is not meant to apply that other formats are “dirty”, or that they include data that is incorrect or subpar. In fact, the same set of datapoints could be saved in a file in a way that is either “tidy” (in the sense of)115 or untidy, depending only on how the data are organized across columns and rows.
Wickham notes in his article, where he first describes the tidy data format, that his ideas about this format evolved from seeing many examples of different ways that data could be organized within a two-dimensional structure. He notes:
“The development of tidy data has been driven by my experience from working with real-world datasets. With few, if any, constraints on their organization, such datasets are often constructed in bizarre ways. I have spent countless hours struggling to get such datasets organized in a way that makes data analysis possible, let alone easy.”116
To help you understand the tidy data format that Wickham developed, let’s start with a checklist of rules that make a dataset tidy. Some of these are drawn directly from the journal article that originally defined the data format.117 Other rules are based on common untidy patterns that show up in data recording templates for laboratory research. The checklist is:
- Data are recorded in a tabular, two-dimensional format
- The data collection file or spreadsheet avoids extra elements like plots or embedded equations in the file
- Each observation forms a row
- Column headers are variable names, not values
- Each type of observational unit forms its own table
- Each variable forms a column
- A single variable is in a single column, not spread across multiple columns
- A column contains only one variable; multiple variables are not stored in one column
- Data types are consistent within a column
In module 1, we discussed the first two principles, highlighting how important it is to separate data collection from further steps of data processing and analysis. To start this module, we’ll go through other items in this checklist, to help you understand what makes a dataset follow the tidy data format. We aim to help you be able to set up your data recording template to follow this format, as well as be able to tell when you work with data that others collect if it is in this format, and restructure it if not.
Tidy data, first, must be in a tabular format—that is, two-dimensional, with columns and rows, and with all rows and columns of the same length. If it’s in a spreadsheet, it should be stored without any “extras”, like embedded plots and calculations. If you record data in a spreadsheet using a very basic strategy of saving a single table per spreadsheet, with the first row giving the column names, then your data will be in a tabular format. In general, if your recorded data looks “boxy”, it’s probably in a two-dimensional tabular format.
There are some additional criteria for the tidy data format, though, and so not every structured, tabular dataset is in a tidy format. As Wickham notes in his paper defining the format,
“Most statistical datasets are rectangular tables made up of rows and columns … [but] there are many ways to structure the same underlying data. … Real datasets can, and often do, violate the three precepts of tidy data in almost every way imaginable.”118
First, each row of a tidy dataset records the values for a single observation.119 To figure out if your data format follows this rule, it’s important to determine the unit of observation of that data, which is the unit at which you take measurements.120 This idea is different than the unit of analysis, which is the unit that you’re focusing on in your study hypotheses and conclusions (this is sometimes also called the “sampling unit” or “unit of investigation”).121 In some cases, these two might be equivalent (the same unit is both the unit of observation and the unit of measurement), but often they are not.122 Sedgwick notes:
“The unit of observation and unit of analysis are often confused. The unit of observation, sometimes referred to as the unit of measurement, is defined statistically as the ‘who’ or ‘what’ for which data are measured or collected. The unit of analysis is defined statistically as the ‘who’ or ‘what’ for which information is analysed and conclusions are made.”123
As an example, say you are testing how the immune system of mice responds to a certain drug over time. In this case, the unit of analysis might be the drug, or a combination of drug and dose—ultimately, you may want to test something like if one drug is more effective than another. To answer this research question, you likely have several replicates of mice in each treatment group. If a separate mouse (replicate) is used to collect each observation, and a mouse is never measured twice (i.e., at different time points, or for a different infection status), then the unit of measurement—the level at which each data point is collected—is the mouse. This is because each mouse is providing a single observation to help answer the larger research question.
As another example, say you conducted a trial on human subjects, to see how a certain treatment affects the speed of recovery, where each study subject was measured at different time points. In this case, the unit of observation is the combination of study subject and time point (while the unit of analysis is the treatment). That means that Subject 1’s measurement at Time 1 would be one observation, and the same person’s measurement at Time 2 would be a separate observation. For a dataset to comply with the tidy data format, these two observations would need to be recorded on separate lines in the data. If the data instead had different columns to record each study subject’s measurements at different time points, then the data would still be tabular, but it would not be tidy.
For a dataset to be tidy, it cannot have variable values in any of its column names. It’s helpful to talk about an example to understand how you might end up with a variable value in a column name. If you are measuring study subjects at different times, then one variable is the ting you measure at each timepoint (the subject’s weight, for example). Another variable, though, is the timepoint itself. One observation might be recorded 14 days into the study, so it would have a timepoint of “day 14”. Another might be measured 28 days into the study, and that measure would have a timepoint of “day 28”.
It can be tempting to put these types of variables—which were set as part of the study design—into the column names. It would be tempting, for example, to have a column for each timepoint, then put the weight measures within the cells for that timepoint. This type of format would look fine visually and would be easy for readers to interpret. What’s the problem, then? Why aren’t these data tidy?
It is simply because this format doesn’t work well with the software tools created to work with tidy data. Remember that the “tidy” data format isn’t meant as a contrast, where everything else is objectively “messy”. Instead, it’s a standard format—by insisting on certain things consistently being in certain places, it allows for tools that work with that format. Some of the rules of the tidy format, then, exist to get things in the right place to work with the tools.
To make the dataset we talked about (repeated measures of weights for each subject) tidy, we’d just need to move some of the elements around. We’d need to put the description of the timepoint (e.g., “day 14”, “day 28”) not in a column name, but instead with the cells of a column of the table. This will mean we’ll need to add some rows to the table, but have fewer columns. We often refer to this change as pivoting the dataframe from a wider format to a longer format.
In the example of human subjects measured at repeated time points, you may initially find the tidy format unappealing, because it seems like it would lead to a lot of repeated data. For example, if you wanted to record each study subject’s sex, it seems like the tidy format would require you to repeat that information in each separate line of data that’s used to record the measurements for that subject for different time points. This isn’t the case—instead, with a tidy data format, different “levels” of data observations should be recorded in separate tables.124 In other words, you should design a separate table for each unit of observation if you have data at several of these units for your experiment. For example, if you have some data on each study subject that does not change across the time points of the study—like the subject’s ID, sex, and age at enrollment—those form a separate dataset, one where the unit of observation is the study subject, so there should be just one row of data per study subject in that data table, while the measurements for each time point should be recorded in a separate data table. A unique identifier, like a subject ID, should be recorded in each data table so it can be used to link the data in the two tables. If you are using a spreadsheet to record data, this would mean that the data for these separate levels of observation should be recorded in separate sheets, and not on the same sheet of a spreadsheet file. Once you read the data into a scripting language like R or Python, it will be easy to link the larger and smaller tidy datasets as needed for analysis, visualizations, and reports.
Next, for a dataset to be tidy, each column should be used to measure a separate characteristic or measurement (a variable) for each measurement.125 A column could either give characteristics of the data that were pre-defined by the study design—for example, the treatment assigned to a mouse (a type of variable called a fixed variable, since its value was fixed before the start of the experiment) or observed measurements, like the level of infection measured in an animal (a type of variable called a measured variable, since its value is determined through the experiment).126
Next, make sure that each column has one and only one variable. Let’s look at an example to see the type of thing you should avoid to make sure you’re following this rule. Say that you’re recording weights for your study subjects, but sometimes you collect the weight in ounces and sometimes in grams. You will want to include in your data both the numeric measure of weight for each subject and also the unit of that measure. To keep your data tidy, you need to have one column to record the numeric value you measured for the weight and another column for the units in which that weight was measured. If you don’t use separate columns, but instead record values like “22 g” and “0.8 oz” in a single column, you’ll have to do extra work once you read the data into a program like R to make the data tidy. This can be done using a tool called regular expressions, but it’s even better to set up your initial data recording to record the numeric value (e.g., 22, 0.8) in one column and your units (e.g., “g”, “oz”) in a separate column.
In each column of the dataframe, make sure you only have one type of data. For example, make sure all the values are numbers, or that all the values are character strings. Each column in a dataframe is treated by R as a vector, and a vector must be limited to one data type. If you try to mix different data types, then some of your entries may be coerced into a different data type or treated as a missing value.
One culprit to look out for is putting comments in table cells as you record data. Say you’re recording animal weights, but you forgot to weigh one animal. If you put a comment about that in the cell where you would have recorded the weight, you’ll mix numeric values (since all other cells in the column record weight as a number) with one cell that is a character string (the comment). If you need to record comments, you could handle that by making a separate column just for that.
3.3 Why make your data tidy?
This may all seem like a lot of extra work to make a dataset tidy, and why bother if you already have it in a structured, tabular format? It turns out that, once you get the hang of what gives data a tidy format, it’s pretty simple to design recording formats that comply with these rules. What’s more, when data are in a tidy format, they can be directly input into a collection of tools in R that belong to something called the tidyverse.
R’s tidyverse framework enables powerful and user-friendly data management, processing, and analysis by combining simple tools to solve complex, multi-step problems.127 Since the tidyverse tools are simple and share a common interface, they are easier to learn, use, and combine than tools created in the traditional base R framework.128 This tidyverse framework is quickly becoming the standard taught in introductory R courses and books,129 ensuring ample training resources for researchers new to programming, including books (e.g.,),130 massive open online courses (MOOCs), on-site university courses,131 and Software Carpentry workshops.132 Further, tools that extend the tidyverse have been created to enable high-quality data analysis and visualization in several domains, including text mining,133 microbiome studies,134 natural language processing,135 network analysis,136 ecology,137 and genomics.138
The tidyverse is a collection of tools united by a common philosophy: very complex things can be done simply and efficiently with small, sharp tools that share a common interface. Zev Ross, in an article about tidy tools and how they can declutter a workflow, notes:
“The philosophy of the tidyverse is similar to and inspired by the “unix philosophy”, a set of loose principles that ensure most command line tools play well together. … Each function should solve one small and well-defined class of problems. To solve more complex problems, you combine simple pieces in a standard way.”139
The tidyverse isn’t the only popular system that follows this philosophy—one other favorite is Legos. Legos are small, plastic bricks, with small studs on top and tubes for the studs to fit into on the bottom. The studs all have the same, standardized size and are all spaced the same distance apart. Therefore, the bricks can be joined together in any combination, since each brick uses the same input format (studs of the standard size and spaced at the standard distance fit into the tubes on the bottom of the brick) and the same output format (again, studs of the standard size and spaced at the standard distance at the top of the brick). Because of this design, bricks can be joined regardless of whether the bricks are different colors or different heights or different widths or depths. With Legos, even though each “tool” (brick) is very simple, the tools can be combined in infinite variations to create very complex structures.
The tools in the tidyverse operate on a similar principle. They all input a tidy dataset (or a column from a tidy dataset) and they (almost) all output data in the same format they input it. For most of the tools, their required format for input and output is the tidy data format,140 called a tidy dataframe in R—this is a dataframe that follows the rules detailed earlier in this section.
This common input / output interface, and the use of small tools that follow this interface and can be combined in various ways, is what makes the tidyverse tools so powerful. However, there are other good things about the tidyverse that make it so popular. One is that it’s fairly easy to learn to use the tools, in comparison to learning how to write code for other R tools.141 This is because the developers who have created the tidyverse tools have taken a lot of effort to try to make sure that they have a clear and consistent user interface.142
To help understand a user interface, and how having a consistent user interface across tools is useful, let’s think about a different example—cars. When you drive a car, you get the car to do what you want through the steering wheel, the gas pedal, the break pedal, and different knobs and buttons on the dashboard. When the car needs to give you feedback, it uses different gauges on the dashboard, like the speedometer, as well as warning lights and sounds. Collectively, these ways of interacting with your car make up the car’s user interface. In the same way, each function in a programming language has a collection of parameters you can set, which let you customize the way the function runs, as well as a way of providing you output once the function has finished running and the way to provide any messages or warnings about the function’s run. For functions, the software developer can usually choose design elements for the function’s user interface, including which parameters to include for the function, what to name those parameters, and how to provide feedback to the user through messages, warnings, and the final output.
If tools are similar in their user interfaces, it will make it easier for users to learn and use any of the tools once they’ve learned how to use one. For cars, this explains how the rental car business is able to succeed. Even though different car models are very different in many characteristics—their engines, their colors, their software—they are very consistent in their user interfaces. Once you’ve learned how to drive one car, when you get in a new car, the gas pedal, brake, and steering wheel are almost guaranteed to be in about the same place and to operate about the same way as in the car you learned to drive in. The exceptions are rare enough to be memorable—think how many movies have a laughline from a character trying to drive a car with the driver side on the opposite side of what they’re used to.
The tidyverse tools are similarly designed so that they all have a very similar
user interface. For example, many of the tidyverse functions use a parameter
named “.data” to refer to the input data. Similarly, parameters
named “.vars” and “.funs” are repeatedly used over tidyverse functions, with the
same meaning in each case. What’s more, the tidyverse functions are typically given names
that very clearly describe the action that the function does, like filter
,
summarize
, mutate
, and group
. As a result, the final code is very clear
and can almost be “read” as a natural language, rather than code. As Jenny
Bryan notes, in an article on data science:
“The Tidyverse philosophy is to rigorously (and ruthlessly) identify and obey common conventions. This applies to the objects passed from one function to another and to the user interface each function presents. Taken in isolation, each instance of this seems small and unimportant. But collectively, it creates a cohesive system: having learned one component you are more likely to be able to guess how another different component works.”143
Many people who teach R programming now focus on first teaching the tidyverse, given these characteristics,144 and it’s often a first focus for online courses and workshops on R programming. Since its main data structure is the tidy data structure, it’s often well worth recording data in this format so that all these tools can easily be used to explore and model the data.
3.4 Using tidyverse tools with data in the tidy data format
When you download R, you get what’s called base R. This includes the main code that drives anything you do in R, as well as functions for doing many core tasks. However, the power of R is that, in addition to base R, you can also add onto R through what are called packages (sometimes also referred to as extensions or libraries). These are kind of like “booster packs” that add on new functions for R. They can be created and contributed by anyone, and many are collected through a few key repositories like CRAN and Bioconductor.
All the tidyverse tools are included in R extension packages, rather than base
R, so once you download R, you’ll need to download these packages as well to use
the tidyverse tools. The core tidyverse functions include functions to read in
data (the readr
package for reading in plain text, delimited files, readxl
to read in data from Excel spreadsheets), clean or summarize the data (the
dplyr
package, which includes functions to merge different datasets, make
new columns as functions of old ones, and summarize columns in the data, either
as a whole or by group), and reformat the data if needed to get it in a tidy
format (the tidyr
package). The tidyverse also includes more precise tools,
including tools to parse dates and times (lubridate
) and tools to work with
character strings, including using regular expressions as a powerful way to find
and use certain patterns in strings (stringr
). Finally, the tidyverse
includes powerful functions for visualizing data, based around the ggplot2
package, which implements a “grammar of graphics” within R. We cover some
tidyverse tools you may find helpful for pre-processing biomedical data in
module 16.
You can install and load any of these tidyverse packages one-by-one using the
install.packages
and library
functions with the package name from within R.
If you are planning on using many of the tidyverse packages, you can also
install and load many of the tidyverse functions by installing a package called
tidyverse, which serves as an umbrella for many of the tidyverse packages.
In addition to the original tools in the tidyverse, many people have developed
tidyverse extensions—R packages that build off the tools and principles in
the tidyverse. These often bring the tidyverse conventions into tools for
specific areas of science. For example, the tidytext
package provides tools to
analyze large datasets of text, including books or collections of tweets, using
the tidy data format and tidyverse-style tools. Similar tidyverse extensions
exist for working with network data (tidygraph
) or geospatial data (sf
).
Extensions also exist for the visualization branch of the tidyverse
specifically. These include ggplot extensions that allow users to create
things like calendar plots (sugrrants
), gene arrow maps (gggene
), network
plots (igraph
), phytogenetic trees (ggtree
) and anatogram images
(gganatogram
). These extensions all allow users to work with data that’s in a
tidy data format, and they all provide similar user interfaces, making it
easier to learn a large set of tools to do a range of data analysis and
visualization, compared to if the set of tools lacked this coherence.
3.5 Discussion questions
What are your main considerations when you decide how to record your data?
Based on the reading, can you define the tidy data format? Were you familiar with this format before preparing for this discussion? Do you use some of these principles when recording your own data?
Describe advantages, as well as potential limitations, of storing data in a tidy data format
In data that you have collected, can you think of examples when the data collection format included extra elements, beyond simply space for recording the data? Examples might include plots, calculations, notes, and highlighting. What were some of the advantages of having these extra elements in the template? Based on the reading or your own experience, what are some disadvantages to including these extra elements in a data collection template?
In research collaborations, have you experienced a case where the data format for one researcher created difficulties for the other?