6.2 Tidying example

We’ve been using an example dataset with fatal motor vehicle accidents. I downloaded the raw data from the Fatality Analysis Reporting System (FARS), and the original download was a large zipped file, with separate files for variables about the accident, the people involved, etc.

In this short booklet, I won’t be able to teach you all the tools you can use to tidy a dataset in R, but I can quickly walk you through an example to show you how powerful these tools can be, and how clean and efficient the final code is. R’s power for data analytics comes in part from these wonderful tools for working with data, and in a typical data analysis project, I’ll use these tools extensive to create different summaries and views of the data as I work. You will definitely want to learn these tools if you’re serious about using R, so I’ve provided several places to go to master them in the “Learn more” section.

For the rest of the section, we’ll look at cleaning up data from the Federal Accident Reporting System, the source of the example data we’ve used in other sections. If you completed the set-up in the “Prerequisites”, you should be able to load this data using:

library("readr")
fl_accidents <- read_csv("data/accident.csv")

As a reminder, to print out some information about this data, call the object’s name:

fl_accidents
## # A tibble: 34,247 x 52
##    STATE ST_CASE VE_TOTAL VE_FORMS PVH_INVL  PEDS PERNOTMVIT PERMVIT
##    <dbl>   <dbl>    <dbl>    <dbl>    <dbl> <dbl>      <dbl>   <dbl>
##  1     1   10001        1        1        0     0          0       1
##  2     1   10002        1        1        0     0          0       1
##  3     1   10003        3        3        0     0          0       3
##  4     1   10004        1        1        0     0          0       1
##  5     1   10005        1        1        0     0          0       2
##  6     1   10006        2        2        0     0          0       4
##  7     1   10007        2        2        0     0          0       2
##  8     1   10008        1        1        0     0          0       1
##  9     1   10009        1        1        0     0          0       1
## 10     1   10010        1        1        0     1          1       1
## # … with 34,237 more rows, and 44 more variables: PERSONS <dbl>,
## #   COUNTY <dbl>, CITY <dbl>, DAY <dbl>, MONTH <dbl>, YEAR <dbl>,
## #   DAY_WEEK <dbl>, HOUR <dbl>, MINUTE <dbl>, NHS <dbl>, RUR_URB <dbl>,
## #   FUNC_SYS <dbl>, RD_OWNER <dbl>, ROUTE <dbl>, TWAY_ID <chr>,
## #   TWAY_ID2 <chr>, MILEPT <dbl>, LATITUDE <dbl>, LONGITUD <dbl>,
## #   SP_JUR <dbl>, HARM_EV <dbl>, MAN_COLL <dbl>, RELJCT1 <dbl>,
## #   RELJCT2 <dbl>, TYP_INT <dbl>, WRK_ZONE <dbl>, REL_ROAD <dbl>,
## #   LGT_COND <dbl>, WEATHER1 <dbl>, WEATHER2 <dbl>, WEATHER <dbl>,
## #   SCH_BUS <dbl>, RAIL <chr>, NOT_HOUR <dbl>, NOT_MIN <dbl>,
## #   ARR_HOUR <dbl>, ARR_MIN <dbl>, HOSP_HR <dbl>, HOSP_MN <dbl>,
## #   CF1 <dbl>, CF2 <dbl>, CF3 <dbl>, FATALS <dbl>, DRUNK_DR <dbl>

This is a large dataset, with over 50 columns and over 34,000 rows. It records details about all of the fatal motor vehicle accidents in the US in 2017 (at least, that were reported to this database). Currently it inclues all states, although we’re planning to limit it to Florida. In this section, you’ll work through cleaning up this data, as you might if you were starting from this raw data and needed to create summaries and plots similar to those in other sections of this booklet.

The following piece of code is all the code you need to transform this dataset into the fl_accidents dataset we’ve used in earlier examples.

library(magrittr)
library(dplyr)
library(tidyr)
library(stringr)
library(lubridate)

fl_accidents %>% 
  rename_all(.funs = str_to_lower) %>% 
  select(state, county, day, month, year, latitude, longitud, fatals) %>% 
  filter(state == 12) %>% 
  mutate(county = str_pad(county, width = 3, pad = "0")) %>% 
  unite(col = fips, c(state, county), sep = "") %>% 
  unite(col = date, c(month, day, year), sep = "-") %>% 
  mutate(date = mdy(date)) %>% 
  filter(date >= mdy("9-7-2017") & date <= mdy("9-13-2017"))
## # A tibble: 37 x 5
##    fips  date       latitude longitud fatals
##    <chr> <date>        <dbl>    <dbl>  <dbl>
##  1 12031 2017-09-08     30.2    -81.5      1
##  2 12095 2017-09-07     28.5    -81.4      1
##  3 12097 2017-09-08     28.3    -81.3      1
##  4 12095 2017-09-07     28.6    -81.2      1
##  5 12031 2017-09-08     30.2    -81.8      1
##  6 12033 2017-09-07     30.6    -87.4      2
##  7 12023 2017-09-10     30.1    -82.7      1
##  8 12075 2017-09-08     29.6    -82.9      1
##  9 12045 2017-09-09     30.1    -85.3      2
## 10 12031 2017-09-12     30.4    -81.8      1
## # … with 27 more rows

This function takes the large original dataset. It first renames all the columns—they were in all capital letters, which are a pain to type in your code, so we’re using a function from the stringr package to change them all to lowercase. We’re then selecting just the columns we want to work with for the plot, using their column names to pick them (state, county, etc.). We’re then filtering to just the observations in Florida (where the state FIPS code is 12). Counties have five digit FIPS codes that are useful to use when merging different datasets, including merging in geographic data, and the dataset at this stage has the state part of the FIPS code and the county part in different columns. The county part is currently in a numeric class, which I need to “pad” with 0s at the beginning if it’s currently fewer than three digits. We’ll mutate the county FIPs code to pad it with 0s, using another function from the stringr package. We’ll then unite the state and county FIPS columns to create a single column with the 5-digit FIPS code. Next, we want to convert the date information into a “Date” class, which will let us work with these values more easily. We’ll unite all the columns with date information (month, day, year) into a single column, and then we’ll use a function from the lubridate package to mutate this column to have a date class. Finally, we’ll filter to just the observations with a date within a week of Hurricane Irma’s landfall on September 10, 2017.

At this stage, don’t worry if you don’t know which functions you should use to clean up a new dataset, just try to get a feel for how this tidyverse framework is allowing you to clean up the dataframe with lots of small, interoperable tools. In the example code, try highlighting and running different portions of the code and check out the output at each step along the way. This will help you get a better idea for how this process works.

Becoming familiar with these tools so you can use them yourself takes some time, but is well work the effort. In the “Learn more” section, I’ve got some tips on where you can go to develop those skills.

Finally, the code above is cleaning the data, but not overwriting the original fl_accidents object—instead, it’s printing out the result, but not saving it for you to use later. To use the cleaned data, you’ll need to overwrite the original R object. You can do that in two ways. First, you can use the gets arrow to assign the output to the same R object name:

fl_accidents <- fl_accidents %>% 
  rename_all(.funs = str_to_lower) %>% 
  select(state, county, day, month, year, latitude, longitud, fatals) %>% 
  filter(state == 12) %>% 
  mutate(county = str_pad(county, width = 3, pad = "0")) %>% 
  unite(col = fips, c(state, county), sep = "") %>% 
  unite(col = date, c(month, day, year), sep = "-") %>% 
  mutate(date = mdy(date)) %>% 
  filter(date >= mdy("9-7-2017") & date <= mdy("9-13-2017"))

If you want to be even more compact, you can use something called the compound pipe operator (%<>%). This inputs an R object and then, when it’s done running all the code, overwrites that R object with the output. You can think of it as combining the <- and %>% operators. Here’s how you would use it to save the cleaned version of the data to the fl_accidents object name:

fl_accidents %<>% 
  rename_all(.funs = str_to_lower) %>% 
  select(state, county, day, month, year, latitude, longitud, fatals) %>% 
  filter(state == 12) %>% 
  mutate(county = str_pad(county, width = 3, pad = "0")) %>% 
  unite(col = fips, c(state, county), sep = "") %>% 
  unite(col = date, c(month, day, year), sep = "-") %>% 
  mutate(date = mdy(date)) %>% 
  filter(date >= mdy("9-7-2017") & date <= mdy("9-13-2017"))