As an epidemiologist, I meet many people who learned SAS as students
and continue
to use it. A common misperception is that R is good for visualizations, but bad for
cleaning data. While in the past this might have been (somewhat) valid, now it couldn’t
be further from the truth. With a collection of tools available through the
tidyverse,32 tidyverse. A collection of packages to work with data in a
“tidy” format, or to convert it to that format if needed. Many of these packages
are developed and maintained by people at RStudio. If you run library("tidyverse")
,
you can load the core tidyverse packages in your R session. This way, you avoid
having to load them one by one.
you can write clean and compact code to clean even very large and
messy datasets.
The tidyverse works as well as it does because, for many parts of it, it requires a common input and output, and those input and output specifications are identical (the tidy data format).33 There are some clear specifications for this format. I’m not going to go into them here, but several of the references given in the “Learn More” section go into depth in describing and defining this format. If you want to get a better idea of this concept, and why it’s so powerful, think of some of the classic toys, like Legos (train sets and Lincoln logs also work here). Each piece takes the same input and produces the same output. Think of the bottom of a Lego—it “inputs” small, regularly-spaced pegs, which are exactly what’s at the top (“output”) of each Lego block. This common input and output means that the blocks can be joined together in an extraordinary number of different combinations, and that you can imagine and then make very complex structures with the blocks.
The functions in the tidyverse work this way. For the major data cleaning functions, they all take the same format of input (a tidy tibble) and they all output that same format of input. Just like you can build Legos on top of each other in different orders and patterns to create lots of different structures, this framework of small tools that work on the same type of input and produce the same type of output allow you to string together lots of small, simple calls to do some very complex things.
To work with data with tidyverse tools, we’ll use two main ideas. The first is that we’ll use many small tools that each do one thing well and that can be combined in lots of configurations to achieve complex tasks. The second is that we’ll string these small functions togther using a special operator called the “pipe operator”.
The main functions in the tidyverse (sometimes called “verbs”) all do simple things.
For example, there are functions to select
certain columns, slice
to specific
rows, filter
to a set of rows that match some criterion, mutate
existing columns
to create new columns or change existing ones in place, and summarize
a dataframe,
possibly one that you group_by
certain characteristics of the data (e.g., a summary
of mean height grouped by gender).
The second main idea for this data cleaning approach is that we’ll use a
pipe operator (%>%
). This operator lets you input a tidy dataset as the first
argument of a function. In practice, this allows you to string together a “pipeline” of
data cleaning calls that is very clean and compact.