3.1 Geographical data in a tidy format

The sf package allows you to create a dataframe object with one special characteristic: a special column called “geometry” that contains the geometrical data needed to draw an observation. For example, if you have a dataset of motor vehicle accidents, where each row gives the data for an accident, the geometry column might include the latitude and longitude for the location of the accident. As another example, if you have a data set of the total number of motor vehicle fatalities in a year in each county in a state, the geometry column might have all the latitude and longitude points to form the boundary of each county.

These special dataframes are given a class called sf20 sf. Short for “simple features”, the name of both an R package and the object class created and used by the data. This class includes a special column called “geometry” for geographic information about each observation. Objects with this class can be used for mapping spatial data, but also can be manipulated using tidyverse tools very similarly to tibbles. There are several ways for you to create this special type of dataframe. We’ll create a few to use in the later mapping examples to demonstrate some of the ways to get an sf object.

First, if you have a regular dataframe, you can convert it into an sf object, specifying which parts of the dataframe include geographical information. If you followed all the set-up instructions in the “Prerequisites”, you should have downloaded a dataset called “fl_accidents.csv” and in the “data” subdirectory of the R Project directory for the examples. You can use readr to read it in. If you print out the start of it, you’ll see that it’s got observations (rows) of fatal motor vehicle accidents. These accidents all occurred in Florida within a week of Hurricane Irma’s landfall on September 10, 2017. The columns give the county code (fips), date (date), location (latitude and longitud), and the number of fatalities (fatals).

library("readr")
fl_accidents <- read_csv("data/fl_accidents.csv")
fl_accidents
## # A tibble: 37 x 5
##     fips date       latitude longitud fatals
##    <dbl> <date>        <dbl>    <dbl>  <dbl>
##  1 12031 2017-09-08     30.2    -81.5      1
##  2 12095 2017-09-07     28.5    -81.4      1
##  3 12097 2017-09-08     28.3    -81.3      1
##  4 12095 2017-09-07     28.6    -81.2      1
##  5 12031 2017-09-08     30.2    -81.8      1
##  6 12033 2017-09-07     30.6    -87.4      2
##  7 12023 2017-09-10     30.1    -82.7      1
##  8 12075 2017-09-08     29.6    -82.9      1
##  9 12045 2017-09-09     30.1    -85.3      2
## 10 12031 2017-09-12     30.4    -81.8      1
## # … with 27 more rows

Even though this data has geographical information in it (latitude and longitude), it’s currently just a regular dataframe. To convert it to an sf class object, you can use the st_as_sf function,21 Most of the functions in the st package start with st_. Since R studio allows tab completion, this makes it very easy to look up a function in the package whose name you might have forgotten. Just try typing st_ and then the Tab key in your R console—you should get a pop-up with a list of suggestions for possible function names. specifying the columns with the geogrphical coordinates using the coords parameter.22 Here, the compound pipe operator, %<>%, applies the function to fl_accidents and then overwrites the fl_accidents with the modified version, updating the object so you’re ready to use it for later code. We also want to go ahead and set the projection to something reasonable for latitude-longitude data; the “4326” code is a good pick. That can be set with the st_sf function (for a lot more on map projections, see the resources in “Learn More”).

library("sf")
fl_accidents %<>% 
  st_as_sf(coords = c("longitud", "latitude")) %>% 
  st_sf(crs = 4326)

Now, when you print out fl_accidents, you’ll see some extra information at the top of the print-out, including the objects bounding box (bbox) and projection (epsg, proj4string).

fl_accidents
## Simple feature collection with 37 features and 3 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -87.3797 ymin: 25.6876 xmax: -80.32332 ymax: 30.65894
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 37 x 4
##     fips date       fatals             geometry
##    <dbl> <date>      <dbl>          <POINT [°]>
##  1 12031 2017-09-08      1   (-81.5407 30.2474)
##  2 12095 2017-09-07      1 (-81.41927 28.51796)
##  3 12097 2017-09-08      1 (-81.34221 28.26149)
##  4 12095 2017-09-07      1 (-81.18188 28.57327)
##  5 12031 2017-09-08      1 (-81.84294 30.22846)
##  6 12033 2017-09-07      2  (-87.3797 30.56341)
##  7 12023 2017-09-10      1 (-82.71024 30.13608)
##  8 12075 2017-09-08      1 (-82.87318 29.56196)
##  9 12045 2017-09-09      2 (-85.25815 30.06198)
## 10 12031 2017-09-12      1 (-81.76086 30.37913)
## # … with 27 more rows

A second way to create an sf object in R is to read in data from a geographic data file, like a shapefile23 shapefile. A format for storing geographical data common for GIS. The data will typically be stored in a directory (often zipped to a single file), with separate files that give geographic data and other characteristics of the data. If you followed the “Prerequisites”, you should have downloaded a shapefile as a subdirectory called “al112017_best_track” in the “data” subdirectory. I downloaded this as a zipped data file from the National Hurricane Center and unzipped it. The shapefile includes shapefiles for the track of Hurricane Irma.

If you look at the “al112017_best_track” directory, you will see that it contains a collection of files starting with one of several roots (e.g., “al112017_lin”) and with one of several suffixes (e.g., “.dbf”, “.prj”, “.shx”).

list.files("data/al112017_best_track/")
##  [1] "al112017_lin.dbf"           "al112017_lin.prj"          
##  [3] "al112017_lin.shp"           "al112017_lin.shp.xml"      
##  [5] "al112017_lin.shx"           "al112017_pts.dbf"          
##  [7] "al112017_pts.prj"           "al112017_pts.shp"          
##  [9] "al112017_pts.shp.xml"       "al112017_pts.shx"          
## [11] "al112017_radii.dbf"         "al112017_radii.prj"        
## [13] "al112017_radii.shp"         "al112017_radii.shp.xml"    
## [15] "al112017_radii.shx"         "al112017_windswath.dbf"    
## [17] "al112017_windswath.prj"     "al112017_windswath.shp"    
## [19] "al112017_windswath.shp.xml" "al112017_windswath.shx"

Each set of files with the same root provides a “layer” of the shapefile, giving a set of geographic information. For example, the “al112017_lin." files provides the line of the hurricane’s track, while the "al112017_windswarth.” layer provides windswaths (how severe the wind was in certain locations surrounding the storm track).

You can read in any of these layers into R as an sf object using st_read and specifying the layer you’d like from the shapefile directory with the layer parameter (put the root of the filenames for the layer you want). Since we’ll be plotting this with the accident data, we should transform the data’s projection from its original projection to the one we set for the fl_accidents data. You can do that with the st_transform function. For example, to read in a line with the track of irma, you can run:

irma_tracks <- st_read("data/al112017_best_track/",
                       layer = "al112017_lin") %>% 
  st_transform(crs = 4326)
## Reading layer `al112017_lin' from data source `/Users/georgianaanderson/Documents/r_workshops/navy_public_health/data/al112017_best_track' using driver `ESRI Shapefile'
## Simple feature collection with 20 features and 3 fields
## geometry type:  LINESTRING
## dimension:      XY
## bbox:           xmin: -90.1 ymin: 16.1 xmax: -26.9 ymax: 36.8
## epsg (SRID):    NA
## proj4string:    +proj=longlat +a=6371200 +b=6371200 +no_defs

If you print out irma_tracks, you can see that it looks like a dataframe, but with extra geographic information (bounding box, projection, etc.) as well as a special column for geometry, which gives the latitude and longitude of each accident.

irma_tracks
## Simple feature collection with 20 features and 3 fields
## geometry type:  LINESTRING
## dimension:      XY
## bbox:           xmin: -90.1 ymin: 16.1 xmax: -26.9 ymax: 36.8
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## First 10 features:
##    STORMNUM           STORMTYPE SS                       geometry
## 1        11 Tropical Depression  0 LINESTRING (-26.9 16.1, -28...
## 2        11      Tropical Storm  0 LINESTRING (-28.3 16.2, -29...
## 3        11          Hurricane1  1 LINESTRING (-32.5 16.4, -33...
## 4        11          Hurricane2  2 LINESTRING (-34.2 17.1, -35...
## 5        11          Hurricane3  3 LINESTRING (-35.1 17.5, -36...
## 6        11          Hurricane2  2 LINESTRING (-42.6 18.9, -44...
## 7        11          Hurricane3  3 LINESTRING (-47.9 17.9, -49...
## 8        11          Hurricane4  4 LINESTRING (-53.9 16.7, -55...
## 9        11          Hurricane5  5 LINESTRING (-57.8 16.7, -59...
## 10       11          Hurricane4  4 LINESTRING (-73 21.5, -73.2...

Finally, there are some packages available now that allow you to read data, including spatial data, directly into R from large open data databases, using R functions that wrap the database’s Application Programming Interfaces (APIs). For example, the tigris package lets you pull spatial data into R directly from the US Census. To get spatial data for all the counties in Florida, you can run:

library(tigris)
fl_counties <- counties(state = "FL", cb = TRUE, class = "sf")

If, for any reason, you’re not able to get the above code to work, I pulled and saved this data with the example data, and you can load it with the following code if you followed all the directions in the “Prerequisites” and are having problems with the previous code:

load("data/fl_counties.RData")