Chapter 4 Generalized linear models

4.1 Readings

The readings for this chapter are:

  • Bhaskaran et al. (2013) Provides an overview of time series regression in environmental epidemiology.
  • Vicedo-Cabrera, Sera, and Gasparrini (2019) Provides a tutorial of all the steps for a projecting of health impacts of temperature extremes under climate change. One of the steps is to fit the exposure-response association using present-day data (the section on “Estimation of Exposure-Response Associations” in the paper). In this chapter, we will go into details on that step, and that section of the paper is the only required reading for this chapter. Later in the class, we’ll look at other steps covered in this paper. Supplemental material for this paper is available to download by clicking http://links.lww.com/EDE/B504. You will need the data in this supplement for the exercises for class.
  • B. G. Armstrong, Gasparrini, and Tobias (2014) This paper describes different data structures for case-crossover data, as well as how conditional Poisson regression can be used in some cases to fit a statistical model to these data. Supplemental material for this paper is available at https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-14-122#Sec13.

The following are supplemental readings (i.e., not required, but may be of interest) associated with the material in this chapter:

  • B. Armstrong (2006) Covers similar material as Bhaskaran et al. (2013), but with more focus on the statistical modeling framework (highly recommended!)
  • Gasparrini and Armstrong (2010) Describes some of the advances made to time series study designs and statistical analysis, specifically in the context of temperature
  • Basu, Dominici, and Samet (2005) Compares time series and case-crossover study designs in the context of exploring temperature and health. Includes a nice illustration of different referent periods, including time-stratified.
  • Imai et al. (2015) Typically, the time series study design covered in this chapter is used to study non-communicable health outcomes. This paper discusses opportunities and limitations in applying a similar framework for infectious disease.
  • Lu and Zeger (2007) Heavier on statistics. This paper shows how, under conditions often common for environmental epidemiology studies, case-crossover and time series methods are equivalent.
  • Gasparrini (2014) Heavier on statistics. This provides the statistical framework for the distributed lag model for environmental epidemiology time series studies.
  • Dunn and Smyth (2018) Introduction to statistical models, moving into regression models and generalized linear models. Chapter in a book that is available online through the CSU library.
  • James et al. (2013) General overview of linear regression, with an R coding “lab” at the end to provide coding examples. Covers model fit, continuous, binary, and categorical covariates, and interaction terms. Chapter in a book that is available online through the CSU library.

4.2 Splines in GLMs

We saw from the last model, with a linear term for mean daily temperature, that the suggested effect on mortality is a decrease in daily mortality counts with increasing temperature. However, as you’ve probably guessed that’s likely not entirely accurate. A linear term for the effect of exposure restricts us to an effect that can be fitted with a straight line (either a null effect or a monotonically increasing or decreasing effect with increasing exposure).

This clearly is problematic in some cases. One example is when exploring the association between temperature and health risk. Based on human physiology, we would expect many health risks to be elevated at temperature extremes, whether those are extreme cold or extreme heat. A linear term would be inadequate to describe this kind of U-shaped association. Other effects might have a threshold—for example, heat stroke might have a very low risk at most temperatures, only increasing with temperature above a certain threshold.

We can capture non-linear patterns in effects, by using different functions of X. Examples are \(\sqrt{X}\), \(X^{2}\), or more complex smoothing functions, such as polynomials or splines. Polynomials might at first make a lot of sense, especially since you’ve likely come across polynomial terms in mathematics classes since grade school. However, it turns out that they have some undesirable properties. A key one is that they can have extreme behavior, particularly when using a high-order polynomial, and particularly outside the range of data that are available to fit the model.

An alternative that is generally preferred for environmental epidemiology studies is the regression spline. The word “spline” originally comes from drafting and engineering (ship building, in particular), where it described a flexible piece of wood or metal that you could use to draw a curved line—it created a curve that was flexible enough—but just flexible enough—to fit a space (see Wikipedia’s very interesting article on flat splines for more).

Splines follow a similar idea in mathematics, making them helpful tools when a line won’t fit your data well. In general, a spline fits together a few simpler functions to create something with a curve or non-linearity. Each simpler function operates within an interval of the data, and then they join together at “knots” along the range of the data. Regression splines are therefore simple parametric smoothing function, which fit separate polynomial in each interval of the range of the predictor; these can be linear, quadratic, and cubic.

The simplest example is a linear spline (also called a piecewise linear function). This type of spline creates non-linearity by having a breakpoint at the knot, allowing the slope of the line to be different in the intervals of data on either side of the knot. The following plot shows an example. Say you want to explore how mean temperature varies by the day in the year (Jan 1 = 1, Jan 2 = 2, and so on) in the London example dataset from the last chapter. Temperature tends to increase with day of year for a while, but then it changes around the start of August, after that decreasing with day of year. This patterns means that a line will give a bad fit for how temperature changes with day of year, since it smooths right through that change. On the other hand, you can get a very reasonable fit using a linear spline with a knot around August 1 (day 213 in the year). These two examples are shown in the following plot, with the linear function fit on the left and the linear spline on the right:

If you were to write out these regression models in mathematical notation, the linear one is very simple:

\[ Y_t = \alpha + \beta X_t \] where \(Y_t\) is the temperature on day \(t\), \(\alpha\) is the model intercept, \(X_t\) is the day of the year on day \(t\), and \(\beta\) is the estimated coefficient for \(X_t\).

The notation for the model with the linear spline is a bit more complex:

\[ Y_t = \alpha + \beta_1 X_t + \beta_2 (X_t - k)_+ \]

Here, \(Y_t\) is again the temperature on day \(t\), \(X_t\) is again the day of the year on day \(t\), and \(\alpha\) is again the intercept. The term \(k\) is the “knot”—the value of \(X\) where we’re letting the slope change. In this example, we’re using \(k = 213\). The term \((X_t - k)_+\) has a special meaning—it takes the value 0 if \(X_t\) is in the interval to the left of the knot, while if \(X_t\) is in the interval to the right of the knot, it takes the value of \(X_t\) minus the knot value:

\[ (X_t - k)_+ = \begin{cases} 0, & \mbox{if } X_t < k \\ X_t - k, & \mbox{if } X_t \ge k \end{cases} \]

In this model, the coefficient \(\beta_1\) estimates the slope of the line to the left of the knot, while \(\beta_2\) estimates how that slope will change to the right of the knot.

Fortunately, we usually won’t have to get this complex in the model notation, especially when we use more complex splines (where the notation would get even more complex). Instead, we’ll often write out the regression equation in a simpler way, just indicating that we’re using a function of the covariate, rather than the covariate directly:

\[ Y_t = \alpha + f(X_t | \mathbf{\beta}) \]

where we can note that \(f(X_t)\) is a function of day of the year (\(X_t\)), fit in this case using a linear spline, and with a set of estimated coefficients \(\mathbf{\beta}\) for that function (see B. Armstrong (2006) for an example of using this model notation).

While a linear spline is the simplest conceptually (and mathematically), it often isn’t very satisfying, because it fits a function with a sharp breakpoint, which often isn’t realistic. For example, the linear spline fit above suggests that the relationship between day of the year and temperature changes abruptly and dramatically on August 1 of the year. In reality, we know that this change in the relationship between day of year and temperature is probably a lot smoother.

To fit smoother shapes, we can move to higher level splines. Cubic splines (“cubic” because they include terms of the covariate up to the third power) are very popular. An example of a cubic spline function is \(X+X^{2}+X^{3}+I((X>X_{0})*(X-X_{0})^3)\). This particular function is a cubic spline with four degrees of freedom (\(df=4\)) and one knot (\(X_{0}\)). A special type of cubic spline called a natural cubic spline is particularly popular. Unlike a polynomial function, a natural cubic spline “behaves” better outside the range of the data used to fit the model—they are constrained to continue on a linear trajectory once they pass beyond the range of the data.

Regression splines can be fit in a GLM via the package splines. Two commonly used examples of regression splines are b-splines and natural cubic splines. Vicedo-Cabrera, Sera, and Gasparrini (2019) uses natural cubic splines, which can be fit with the ns (for “natural spline”) function from the splines package.

While splines are great for fitting non-linear relationships, they do create some challenges in interpreting the results. When you fit a linear relationship for a covariate, you will get a single estimate that helps describe the fitted relationship between that covariate and the outcome variable. However, when you use a non-linear function, you’ll end up with a mix of coefficients associated with that function. Sometimes, you will use splines to control for a potential confounder (as we will in the exercises for this first part of the chapter). In this case, you don’t need to worry about interpreting the estimated coefficients—you’re just trying to control for the variable, rather than inferring anything about how it’s associated with the outcome. In later parts of this chapter, we’ll talk about how to interpret these coefficients if you’re using a spline for the exposure that you’re interested in, when we talk more broadly about basis functions.

Applied: Including a spline in a GLM

For this exercise, you will continue to build up the model that you began in the examples in the previous chapter. The example uses the data provided with one of this chapter’s readings, Vicedo-Cabrera, Sera, and Gasparrini (2019).

  1. Start by fitting a somewhat simple model—how are daily mortality counts associated with (a) a linear and (b) a non-linear function of time? Is a linear term appropriate to describe this association? What types of patterns are captured by a non-linear function that are missed by a linear function?
  2. In the last chapter, the final version of the model used a GLM with an overdispersed Poisson distribution, including control for day of week. Start from this model and add control for long-term and seasonal trends over the study period.
  3. Refine your model to fit for a non-linear, rather than linear, function of temperature in the model. Does a non-linear term seem to be more appropriate than a linear term?

Applied exercise: Example code

  1. Start by fitting a somewhat simple model—how are daily mortality counts associated with (a) a linear and (b) a non-linear function of time?

It is helpful to start by loading the R packages you are likely to need, as well as the example dataset. You may also need to re-load the example data and perform the steps taken to clean it in the last chapter:

# Load some packages that will likely be useful
library(tidyverse)
library(viridis)
library(lubridate)
library(broom)

# Load and clean the data
obs <- read_csv("data/lndn_obs.csv") %>% 
  mutate(dow = wday(date, label = TRUE))

For this first question, the aim is to model the association between time and daily mortality counts within the example data. This approach is often used to explore and, if needed, adjust for temporal factors in the data.

There are a number of factors that can act over time to create patterns in both environmental exposures and health outcomes. For example, there may be changes in air pollution exposures over the years of a study because of changes in regulations or growth or decline of factories and automobile traffic in an area. Changes in health care and in population demographics can cause patterns in health outcomes over the study period. At a shorter, seasonal term, there are also factors that could influence both exposures and outcomes, including seasonal changes in climate, seasonal changes in emissions, and seasonal patterns in health outcomes.

It can be difficult to pinpoint and measure these temporal factors, and so instead a common practice is to include model control based on the time in the study. This can be measured, for example, as the day since the start of the study period.

You can easily add a column for day in study for a dataset that includes date. R saves dates in a special format, which we’re using the in obs dataset:

class(obs$date)
## [1] "Date"

However, this is just a fancy overlay on a value that’s ultimately saved as a number. Like most Unix programs, the date is saved as the number of days since the Unix “epoch”, January 1, 1970. You can take advantage of this convention—if you use as.numeric around a date in R, it will give you a number that gets one unit higher for every new date. Here’s the example for the first date in our example data:

obs$date[1]
## [1] "1990-01-01"
as.numeric(obs$date[1]) 
## [1] 7305

And here’s the example for the next date:

obs$date[2]
## [1] "1990-01-02"
as.numeric(obs$date[2])
## [1] 7306

You can use this convention to add a column that gives days since the first study date. While you could also use the 1:n() call to get a number for each row that goes from 1 to the number of rows, that approach would not catch any “skips” in dates in the data (e.g., missing dates if only warm-season data are included). The use of the dates is more robust:

obs <- obs %>% 
  mutate(time = as.numeric(date) - first(as.numeric(date)))

obs %>% 
  select(date, time)
## # A tibble: 8,279 × 2
##    date        time
##    <date>     <dbl>
##  1 1990-01-01     0
##  2 1990-01-02     1
##  3 1990-01-03     2
##  4 1990-01-04     3
##  5 1990-01-05     4
##  6 1990-01-06     5
##  7 1990-01-07     6
##  8 1990-01-08     7
##  9 1990-01-09     8
## 10 1990-01-10     9
## # ℹ 8,269 more rows

As a next step, it is always useful to use exploratory data analysis to look at the patterns that might exist for an association, before you start designing and fitting the regression model.

ggplot(obs, 
       aes(x = time, y = all)) +
  geom_point(size = 0.5, alpha = 0.5)

There are clear patterns between time and daily mortality counts in these data. First, there is a clear long-term pattern, with mortality rates declining on average over time. Second, there are clear seasonal patterns, with higher mortality generally in the winter and lower rates in the summer.

To model this, we can start with fitting a linear term. In the last chapter, we determined that the mortality outcome data can be fit using a GLM with a Poisson family, allowing for overdispersion as it is common in real-life count data like these. To include time as a linear term, we can just include that column name to the right of the ~ in the model formula:

mod_time <- glm(all ~ time, 
                data = obs, family = "quasipoisson")

You can use the augment function from the broom package to pull out the fitted estimate for each of the original observations and plot that, along with the observed data, to get an idea of what this model has captured:

mod_time %>% 
  augment() %>% 
  ggplot(aes(x = time)) + 
  geom_point(aes(y = all), alpha = 0.4, size = 0.5) + 
  geom_line(aes(y = exp(.fitted)), color = "red") + 
  labs(x = "Date in study", y = "Expected mortality count") 

This linear trend captures the long-term trend in mortality rates fairly well in this case. This won’t always be the case, as there may be some health outcomes—or some study populations—where the long-term pattern over the study period might be less linear than in this example. Further, the linear term is completely unsuccessful in capturing the shorter-term trends in mortality rate. These oscillate, and so would be impossible to capture over multiple years with a linear trend.

Instead, it’s helpful to use a non-linear term for time in the model. We can use a natural cubic spline for this, using the ns function from the splines package. You will need to clarify how flexible the spline function should be, and this can be specified through the degrees of freedom for the spline. A spline with more degrees of freedom will be “wigglier” over a given data range compared to a spline with fewer degrees of freedom. Let’s start by using 158 degrees of freedom, which translates to about 7 degrees of freedom per year:

library(splines)
mod_time_nonlin <- glm(all ~ ns(time, df = 158), 
                       data = obs, family = "quasipoisson")

You can visualize the model results in a similar way to how we visualized the last model. However, there is one extra step. The augment function only carries through columns in the original data (obs) that were directly used in fitting the model. Now that we’re using a transformation of the time column, by wrapping it in ns, the time column is no longer included in the augment output. However, we can easily add it back in using mutate, pulling it from the original obs dataset, and then proceed as before.

mod_time_nonlin %>% 
  augment() %>% 
  mutate(time = obs$time) %>% 
  ggplot(aes(x = time)) + 
  geom_point(aes(y = all), alpha = 0.4, size = 0.5) + 
  geom_line(aes(y = exp(.fitted)), color = "red") + 
  labs(x = "Date in study", y = "Expected mortality count") 

The non-linear term for time has allowed enough flexibility that the model now captures both long-term and seasonal trends in the data.

You might wonder how many degrees of freedom you should use for this time spline. In practice, researchers often using about 6–8 degrees of freedom per year of the study, in the case of year-round data. You can explore how changing the degrees of freedom changes the way the model fits to the observed data. As you use more degrees of freedom, the line will capture very short-term effects, and may start to interfere with the shorter-term associations between environmental exposures and health risk that you are trying to capture. Even in the example model we just fit, for example, it looks like the control for time may be capturing some patterns that were likely caused by heatwaves (the rare summer peaks, including one from the 1995 heatwave). Conversely, if too few degrees of freedom are used, the model will shift to look much more like the linear model, with inadequate control for seasonal patterns.

# A model with many less d.f. for the time spline
mod_time_nonlin_lowdf <- glm(all ~ ns(time, df = 10), 
                             data = obs, family = "quasipoisson")
mod_time_nonlin_lowdf %>% 
  augment() %>% 
  mutate(time = obs$time) %>% 
  ggplot(aes(x = time)) + 
  geom_point(aes(y = all), alpha = 0.4, size = 0.5) + 
  geom_line(aes(y = exp(.fitted)), color = "red") + 
  labs(x = "Date in study", y = "Expected mortality count") 

# A model with many more d.f. for the time spline
# (Takes a little while to run)
mod_time_nonlin_highdf <- glm(all ~ ns(time, df = 400), 
                             data = obs, family = "quasipoisson")
mod_time_nonlin_highdf %>% 
  augment() %>% 
  mutate(time = obs$time) %>% 
  ggplot(aes(x = time)) + 
  geom_point(aes(y = all), alpha = 0.4, size = 0.5) + 
  geom_line(aes(y = exp(.fitted)), color = "red") + 
  labs(x = "Date in study", y = "Expected mortality count") 

In all cases, when you fit a non-linear function of an explanatory variable, it will make the model summary results look much more complicated, e.g.:

mod_time_nonlin_lowdf %>% 
  tidy()
## # A tibble: 11 × 5
##    term                estimate std.error statistic   p.value
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)           5.26     0.00948    555.   0        
##  2 ns(time, df = 10)1   -0.0260   0.0119      -2.18 2.93e-  2
##  3 ns(time, df = 10)2   -0.0860   0.0155      -5.56 2.85e-  8
##  4 ns(time, df = 10)3   -0.114    0.0139      -8.15 4.01e- 16
##  5 ns(time, df = 10)4   -0.196    0.0151     -13.0  4.47e- 38
##  6 ns(time, df = 10)5   -0.187    0.0148     -12.6  2.80e- 36
##  7 ns(time, df = 10)6   -0.315    0.0154     -20.5  5.62e- 91
##  8 ns(time, df = 10)7   -0.337    0.0154     -21.9  1.95e-103
##  9 ns(time, df = 10)8   -0.358    0.0135     -26.5  1.56e-148
## 10 ns(time, df = 10)9   -0.467    0.0244     -19.2  4.49e- 80
## 11 ns(time, df = 10)10  -0.392    0.0126     -31.2  8.01e-202

You can see that there are multiple model coefficients for the variable fit using a spline function, the same as the number of degrees of freedom. These model coefficients are very hard to interpret on their own. When we are using the spline to control for a factor that might serve as a confounder of the association of interest, we typically won’t need to try to interpret these model coefficients—instead, we are interested in accounting for how this factor explains variability in the outcome, without needing to quantify the association as a key result. However, there are also cases where we want to use a spline to fit the association with the exposure that we are interested in. In this case, we will want to be able to interpret model coefficients from the spline. Later in this chapter, we will introduce the dlnm package, which includes functions to both fit and interpret natural cubic splines within GLMs for environmental epidemiology.

  1. Start from the last model created in the last chapter and add control for long-term and seasonal trends over the study period.

The last model fit in the last chapter was the following, which fits for the association between a linear term of temperature and mortality risk, with control for day of week:

mod_ctrl_dow <- glm(all ~ tmean + factor(dow, ordered = FALSE), 
                    data = obs, family = "quasipoisson")

To add control for long-term and seasonal trends, you can take the natural cubic spline function of temperature that you just fit and include it among the explanatory / independent variables from the model in the last chapter. If you want to control for only long-term trends, a linear term of the time column could work, as we discovered in the first part of this chapter’s exercise. However, seasonal trends could certainly confound the association of interest. Mortality rates have a clear seasonal pattern, and temperature does as well, and these patterns create the potential for confounding when we look at how temperature and mortality risk are associated, beyond any seasonally-driven pathways.

mod_ctrl_dow_time <- glm(all ~ tmean + factor(dow, ordered = FALSE) +
                           ns(time, df = 158), 
                         data = obs, family = "quasipoisson")

You can see the influence of this seasonal confounding if you look at the model results. When we look at the results from the model that did not control for long-term and seasonal trends, we get an estimate that mortality rates tend to be lower on days with higher temperature, with a negative term for tmean:

mod_ctrl_dow %>% 
  tidy() %>% 
  filter(term == "tmean")
## # A tibble: 1 × 5
##   term  estimate std.error statistic p.value
##   <chr>    <dbl>     <dbl>     <dbl>   <dbl>
## 1 tmean  -0.0148  0.000354     -41.7       0

Conversely, when we include control for long-term and seasonal trends, the estimated association between mortality rates and temperature is reversed, estimating increased mortality rates on days with higher temperature, controlling for long-term and seasonal trends:

mod_ctrl_dow_time %>% 
  tidy() %>% 
  filter(term == "tmean")
## # A tibble: 1 × 5
##   term  estimate std.error statistic  p.value
##   <chr>    <dbl>     <dbl>     <dbl>    <dbl>
## 1 tmean  0.00370  0.000395      9.36 1.02e-20
  1. Refine your model to fit for a non-linear, rather than linear, function of temperature in the model.

You can use a spline in the same way to fit a non-linear function for the exposure of interest in the model (temperature). We’ll start there. However, as mentioned earlier, it’s a bit tricky to interpret the coefficients from the fit model—you no longer generate a single coefficient for the exposure of interest, but instead several related to the spline. Therefore, once we show how to fit using ns directly, we’ll show how you can do the same thing using specialized functions in the dlnm package. This package includes a lot of nice functions for not only fitting an association using a non-linear term, but also for interpreting the results after the model is fit.

First, here is code that can be used to fit the model using ns directly, similarly to the approach we used to control for temporal patterns with a flexible function:

mod_ctrl_nl_temp <- glm(all ~ ns(tmean, 4) + factor(dow, ordered = FALSE) +
                          ns(time, df = 158), 
                        data = obs, family = "quasipoisson")

We can plot the predicted values from this fitted model (red points in the plot below) compared to the observed data (black dots) using our usual method of using augment to extract the predicted values:

mod_ctrl_nl_temp %>% 
  augment() %>% 
  mutate(tmean = obs$tmean) %>% 
  ggplot(aes(x = tmean)) + 
  geom_point(aes(y = all), alpha = 0.4, size = 0.5) + 
  geom_point(aes(y = exp(.fitted)), color = "red",  size = 0.4) + 
  labs(x = "Daily mean temperature", y = "Expected mortality count") 

However, these predictions are very variable at any given temperature. This reflects how other independent variables, like long-term and seasonal trends, explain variability in mortality. This makes sense, but it makes it a bit hard to investigate the role of temperature specifically. Let’s look, then, at some other options for viewing the results that will help us focus on the association of the exposure we care about (temperature) with the outcome.

The next part has a lot of steps, so it might at first seem confusing. However, fortunately we won’t always have to do all the steps ourselves—there is a nice R package that will help us. We’ll look at the process first, though, and then the easier way to use a package to help with some of the steps. Also, this process gives us a first look at how the idea of basis functions work. We’ll later expand on these to look at non-linear relationships with the exposure at different lag times, using cross-basis functions, so this forms a starting point for moving into the idea of a cross-basis.

First, we need to think about how temperature gets included in the regression model—specifically, as a function of basis variables rather than as a single variable. In other words, to include temperature as a nonlinear function, we’re going to create a structure in the regression equation that includes the variable of temperature in several different terms, in different transformations in each term. I simple example of this is a second-degree polynomial. If we wanted to include a second-order polynomial function of temperature in the regression, then we’d use the following basis:

\[ \beta_1 T + \beta_2 T^2 \]

Notice that here we’re using the same independent variable (\(T\), which stands for daily temperature), but we’re including both untransformed \(T\) and also \(T\) squared. This function of \(T\) might go into the regression equation as something like:

\[ E(Y) = \alpha + \beta_1 T + \beta_2 T^2 + ns(time) + \mathbf{\gamma^{'}D} \] where \(\alpha\) is the intercept, \(ns(time)\) is a natural cubic spline that controls for time (long-term and seasonal trends) and \(\mathbf{D}\) is day of week, with \(\mathbf{\gamma}\) as the set of coefficients associated with day of week. (Here, I’ve included the outcome, \(Y\), untransformed, as you would for a linear regression, but of course the same idea works with Poisson regression, when you’d instead have \(log(E(Y))\) on the left of the equation.)

When you fit a regression model in a program like R, it sets up the observed data into something called a model matrix, where it has the values for each observation for each of the independent variables you want to include, plus a column of “1”s for the intercept, if you model structure includes one. The regression model will fit a coefficient for each of these columns in the model matrix. In a simple case, this model matrix will just repeat each of your independent variables. For example, the model matrix for the model mod_overdisp_reg, which we fit in the last chapter and which included only an intercept and a linear term for tmean, looks like this (the model.matrix function will give you the model matrix of any glm object that you get by running the glm function in R):

mod_ovdisp_reg %>% 
  model.matrix() %>% 
  head()
##   (Intercept)    tmean
## 1           1 3.913589
## 2           1 5.547919
## 3           1 4.385564
## 4           1 5.431046
## 5           1 6.867855
## 6           1 9.232628

We already start using the idea of basis variables when we include categorical variables, like day of week. Here is the model matrix for the mod_ctrl_dow model that we fit in the last chapter:

mod_ctrl_dow %>% 
  model.matrix() %>% 
  head()
##   (Intercept)    tmean factor(dow, ordered = FALSE)Mon
## 1           1 3.913589                               1
## 2           1 5.547919                               0
## 3           1 4.385564                               0
## 4           1 5.431046                               0
## 5           1 6.867855                               0
## 6           1 9.232628                               0
##   factor(dow, ordered = FALSE)Tue factor(dow, ordered = FALSE)Wed
## 1                               0                               0
## 2                               1                               0
## 3                               0                               1
## 4                               0                               0
## 5                               0                               0
## 6                               0                               0
##   factor(dow, ordered = FALSE)Thu factor(dow, ordered = FALSE)Fri
## 1                               0                               0
## 2                               0                               0
## 3                               0                               0
## 4                               1                               0
## 5                               0                               1
## 6                               0                               0
##   factor(dow, ordered = FALSE)Sat
## 1                               0
## 2                               0
## 3                               0
## 4                               0
## 5                               0
## 6                               1

You can see that the regression call broke the day-of-week variable up into a set of indicator variables, which equal either 1 or 0. There will be one less of these than the number of categories for the variable; in otherwords, if the categorical variable took two values (Weekend / Weekday), then this would be covered by a single column; since there are seven days in the week, the day-of-week variable breaks into six (7 - 1) columns. The first level of the categories (Sunday in this case) serves as a baseline and doesn’t get a value. The others levels (Monday, Tuesday, etc.) each get their own indicator variable—so, their own column in the model matrix—which equals “1” on that day (i.e., “1” for the Monday column if the date of the observation is a Monday) and “0” on all other days.

Now let’s go back and look at the example of including temperature in the model as a nonlinear function, starting with the simple example of using a second-degree polynomial. What would the basis for that look like? We can find out by fitting the model and then looking at the model matrix (the I() function lets us specify a transformation of a column from the data when we set up the regression equation structure in glm):

mod_polynomial_temp <- glm(all ~ tmean + I(tmean ^ 2), 
                           data = obs, family = "quasipoisson")
mod_polynomial_temp %>% 
  model.matrix() %>% 
  head()
##   (Intercept)    tmean I(tmean^2)
## 1           1 3.913589   15.31618
## 2           1 5.547919   30.77941
## 3           1 4.385564   19.23317
## 4           1 5.431046   29.49627
## 5           1 6.867855   47.16743
## 6           1 9.232628   85.24142

You can see that this has created two columns based on the temperature variable, one with it untransformed and one with it squared. The regression will estimate coefficients for each of these basis variables of temperature, and then that allows you to fit a model with a function of temperature, rather than solely temperature. Here are the coefficients from this model:

mod_polynomial_temp %>% 
  tidy()
## # A tibble: 3 × 5
##   term         estimate std.error statistic   p.value
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  5.31     0.00675       787.  0        
## 2 tmean       -0.0298   0.00126       -23.6 2.27e-119
## 3 I(tmean^2)   0.000667 0.0000538      12.4 5.70e- 35

Therefore, it’s fit the following model to describe the association between temperature and mortality:

\[ log(E(Y)) = 5.3092 - 0.0298T + 0.0007T^2 \] If you want to estimate the relative risk of mortality at 15 degrees Celsius versus 10 degrees Celsius with this model (which is confounded by long-term and seasonal trends, so has some issues we’ll want to fix), you can take the following process. Take the two equations for the expected mortality when \(T\) is 15 and 10, respectively:

\[ log(E(Y|T=15)) = 5.3092 - 0.0298(15) + 0.0007(15^2) \\ log(E(Y|T=10)) = 5.3092 - 0.0298(10) + 0.0007(10^2) \] If you subtract the second from the first, you get:

\[ log(E(Y|T=15)) - log(E(Y|T=10)) = (5.3092 - 5.3092) - 0.0298(15 - 10) + 0.0007(15^2 - 10^2) \] which simplifies to (if the left part isn’t clear, review the rules for how you can manipulate logarithms):

\[ log(\frac{E(Y|T=15)}{E(Y|T=10)}) = - 0.0298(5) + 0.0007(125) = -0.0615 \] Exponentiate both sides, to get to the relative risk at 15 degrees versus 10 degrees (in other words, the ratio of expected mortality at 15 degrees to that at 10 degrees):

\[ \frac{E(Y|T=15)}{E(Y|T=10)} = e^{-0.0615} = 0.94 \] You can see that the basis variables are great in helping us explore how an independent variable might be related to the outcome in a non-linear way, but there’s a bit of a cost in terms of us needing to take some extra steps to interpret the results from that model. Also, note that there’s not a single coefficient that we can extract from the model as a summary of the relationship. Instead, we needed to pick a reference temperature (10 degrees in this example) and compare to that to get an estimated relative risk. We’ll see the same pattern as we move to using natural cubic splines to create the basis variables for temperature.

Now let’s move to a spline. The function the spline runs to transform the temperature variable into the different basis variables is more complex than for the polynomial example, but the result is similar: you get several columns to fit in the model for a variable, compared to the single column you would include if you were only fitting a linear term. If you take a look at the output of running the ns function on temperature in the data, you can see that it creates several new columns (one for each degree of freedom), which will be the basis variables in the regression:

ns(obs$tmean, df = 4) %>% 
  head()
##              1           2         3          4
## [1,] 0.1769619 -0.20432696 0.4777110 -0.2733841
## [2,] 0.2860253 -0.19686120 0.4602563 -0.2633951
## [3,] 0.2049283 -0.20410506 0.4771922 -0.2730872
## [4,] 0.2770456 -0.19804188 0.4630167 -0.2649748
## [5,] 0.4012496 -0.17595974 0.4113892 -0.2354295
## [6,] 0.6581858 -0.09844302 0.2476396 -0.1417190

The output from ns also has some metadata, included in the attributes of the object, that have some other information about the spline function, including where the knots were placed and where the boundary knots (which are at the outer ranges of the data) are placed. You can the str function to explore both the data and metadata stored in this object:

ns(obs$tmean, df = 4) %>% 
  str()
##  'ns' num [1:8279, 1:4] 0.177 0.286 0.205 0.277 0.401 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:4] "1" "2" "3" "4"
##  - attr(*, "degree")= int 3
##  - attr(*, "knots")= num [1:3] 7.47 11.47 15.93
##  - attr(*, "Boundary.knots")= num [1:2] -5.5 29.1
##  - attr(*, "intercept")= logi FALSE

This is saying, for example, that the internal knots for this spline were put at the 25th, 50th, and 75th quantiles of the temperature data, where were 7.47 degrees, 11.47 degrees, and 15.93 degrees. (The defaults for ns is to place knots evenly at percentiles, based on the number of degrees of freedom you specify. You can change this by placing the knots “by hand”, using the knots argument in the ns function.)

This spline object can be used not just for its basis values, but also to “predict” new basis values for a new set of temperature values. For example, you could figure out what the basis values from this spline would be for every degree of temperature between -6 and 29 using the following code:

temp_spline <- ns(obs$tmean, df = 4)
temp_spline_preds <- predict(temp_spline, newx = -5:30)
temp_spline_preds %>% 
  head()
##                 1           2          3           4
## [1,] 2.689545e-05 -0.01606406 0.03755734 -0.02149328
## [2,] 7.189726e-04 -0.04768237 0.11148013 -0.06379775
## [3,] 3.321933e-03 -0.07825364 0.18295494 -0.10470130
## [4,] 9.107577e-03 -0.10708098 0.25035250 -0.14327152
## [5,] 1.934771e-02 -0.13346753 0.31204355 -0.17857602
## [6,] 3.531412e-02 -0.15671641 0.36639881 -0.20968240

This is handy, because it will help us visualize the relationship we fit in a regression model—we can start with these basis values for each degree in our temperature range, and then use the regression coefficients for each basis variable to estimate relative risk at that temperature compared to a reference temperature, exactly as we did in the equations for the polynomial function of temperature earlier, comparing 15 degrees C to the reference of 10 degrees.

All we need now are the regression coefficients for each of these temperature basis variables. We can extract those from the model we fit earlier, where we included ns(tmean, 4) as one of the model terms. Here, I’m using tidy to get the model coefficients and then, because there are lots of them (from fitting a spline for time with lots of degrees of freedom), I’m using the str_detect function from the stringr package to pick out just those with “tmean” in the term column:

mod_ctrl_nl_temp %>% 
  tidy() %>% 
  filter(str_detect(term, "tmean"))
## # A tibble: 4 × 5
##   term          estimate std.error statistic  p.value
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>
## 1 ns(tmean, 4)1  -0.0700    0.0135    -5.18  2.30e- 7
## 2 ns(tmean, 4)2  -0.0660    0.0126    -5.24  1.61e- 7
## 3 ns(tmean, 4)3   0.0110    0.0313     0.350 7.26e- 1
## 4 ns(tmean, 4)4   0.347     0.0177    19.6   2.02e-83

You can add on pull to pull out just the estimate column as a vector, which will be helpful when we want to multiple these coefficients by the basis values for each degree of temperature across our temperature range:

temp_spline_ests <- mod_ctrl_nl_temp %>% 
  tidy() %>% 
  filter(str_detect(term, "tmean")) %>% 
  pull(estimate)
temp_spline_ests
## [1] -0.06995454 -0.06596939  0.01095574  0.34659715

Now, let’s put this together to see how relative risk of mortality changes as you move across the temperature range! First, we can set up a dataframe that has a column with each unit of temperature across our range—these are the temperatures where we want to estimate relative risk—and then the estimated relative risk at that temperature. By default, we’ll be comparing to the lowest temperature in the original data, but we’ll talk in a minute about how to adjust to a different reference temperature. You can use matrix multiplication (%*%) as a shorthand way to multiple each column of the spline basis variables from temp_spline_preds by its estimated coefficient from the regression model, saved in temp_spline_ests, and then add all of those values together (this is the same idea as what we did in the equations earlier, for the polynomial basis). Then, to get from log relative risk to relative risk, we’ll exponentiate that with exp. You can see the first rows of the results below:

pred_temp_function <- tibble(
  temp = -5:30, 
  temp_func = temp_spline_preds %*% temp_spline_ests,
  rr = exp(temp_func)
)

pred_temp_function %>% 
  head()
## # A tibble: 6 × 3
##    temp temp_func[,1] rr[,1]
##   <int>         <dbl>  <dbl>
## 1    -5      -0.00598  0.994
## 2    -4      -0.0178   0.982
## 3    -3      -0.0294   0.971
## 4    -2      -0.0405   0.960
## 5    -1      -0.0510   0.950
## 6     0      -0.0608   0.941

This is now very easy to plot, with a reference line added at a relative risk of 1.0:

ggplot(pred_temp_function, aes(x = temp, y = rr)) + 
  geom_point() + 
  geom_line() + 
  labs(x = "Temperature", 
       y = "Relative Risk") + 
  geom_hline(yintercept = 1.0, color = "red", linetype = 2)

We might want to shift this, so we’re comparing the temperature at which mortality risk is lowest (sometimes called the minimum mortality temperature). This tends to be at milder temperatures, in the middle of our range, rather than at the minimum temperature in the range. To start, let’s see what temperature aligns with the lowest relative risk of mortality:

pred_temp_function %>% 
  filter(rr == min(rr))
## Warning: Using one column matrices in `filter()` was deprecated in dplyr 1.1.0.
## ℹ Please use one dimensional logical vectors instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 1 × 3
##    temp temp_func[,1] rr[,1]
##   <int>         <dbl>  <dbl>
## 1     6       -0.0938  0.911

We can therefore realign so that the relative risk equals 1.0 when the temperature is 7 degrees C, and all other relative risks are relative to a reference temperature of 7 degrees C:

pred_temp_function <- pred_temp_function %>% 
  mutate(temp_func_reset = temp_func - temp_func[temp == 7],
         rr_reset = exp(temp_func_reset)
)

ggplot(pred_temp_function, aes(x = temp, y = rr_reset)) + 
  geom_point() + 
  geom_line() + 
  labs(x = "Temperature", 
       y = "Relative Risk") + 
  geom_hline(yintercept = 1.0, color = "red", linetype = 2)

This is a fairly cumbersome process, as you’ve seen with this example, and it would be a pain if we had to do it every time we wanted to visualize and explore results from fitting regressions that include basis variables from splines. Fortunately, there’s a nice R package called dlnm (for “distributed lag nonlinear models”) that will help take care of a lot of these “under the hood” steps for us. Even better, this package will let us move on to more complex crossbasis functions, where we fit non-linear functions in two dimensions, and help us visualize and interpret results from those models.

To start, make sure you have the dlnm package installed on your computer, and then load it in your R session. This package has a function called crossbasis that lets you build a crossbasis function to use in a regression model. We’ll talk more about these types of functions later in this chapter; for right now, we’ll be a bit simpler and just use it to create our spline function of temperature.

In the crossbasis function, you start by putting in the vector of the variable that you want to expand into basis variables. In our case, this is temperature, which we have saved as the tmean column in the obs dataset. You use the argvar to give some information about the basis you want to use for that variable. This will both include the type of basis function ("ns" for a natural cubic spline), and then also arguments you want to pass to that basis function, like degrees of freedom (df) or knot locations (knots) if you’re fitting a spline. The other arguments allow you to specify a function of lagged time; we won’t use that yet, so you can just include lag = 0 and arglag = list(fun = "integer") to create a crossbasis where we only consider same-day effects in a simple way. If you look at the output, you’ll see it’s very similar to the basis created by the ns call earlier (in fact, the values in each column should be identical):

library(dlnm)
temp_basis <- crossbasis(obs$tmean, lag = 0, 
                         argvar = list(fun = "ns", df = 4), 
                         arglag = list(fun = "integer"))
temp_basis %>% 
  head()
##          v1.l1       v2.l1     v3.l1      v4.l1
## [1,] 0.1769619 -0.20432696 0.4777110 -0.2733841
## [2,] 0.2860253 -0.19686120 0.4602563 -0.2633951
## [3,] 0.2049283 -0.20410506 0.4771922 -0.2730872
## [4,] 0.2770456 -0.19804188 0.4630167 -0.2649748
## [5,] 0.4012496 -0.17595974 0.4113892 -0.2354295
## [6,] 0.6581858 -0.09844302 0.2476396 -0.1417190

To estimate the regression coefficients, we’ll put this whole crossbasis in as one of our terms in the glm regression equation:

dlnm_mod_1 <- glm(all ~ temp_basis + factor(dow, ordered = FALSE) +
                          ns(time, df = 158), 
                        data = obs, family = "quasipoisson")

As when we fit the spline earlier, you can see that this gives us a set of coefficients, one for each column in the matrix of crossbasis variables:

dlnm_mod_1 %>% 
  tidy() %>% 
  filter(str_detect(term, "temp_basis"))
## # A tibble: 4 × 5
##   term            estimate std.error statistic  p.value
##   <chr>              <dbl>     <dbl>     <dbl>    <dbl>
## 1 temp_basisv1.l1  -0.0700    0.0135    -5.18  2.30e- 7
## 2 temp_basisv2.l1  -0.0660    0.0126    -5.24  1.61e- 7
## 3 temp_basisv3.l1   0.0110    0.0313     0.350 7.26e- 1
## 4 temp_basisv4.l1   0.347     0.0177    19.6   2.02e-83

However, there’s an advantage with using crossbasis, even though it’s looked pretty similar to using ns up to now. That’s that there are some special functions that let us predict and visualize the model results without having to do all the work we did before.

For example, there’s a function called crosspred that will give us a number of values, including the estimated relative risk compared to a reference value. To use this, we need to input the object name of our crossbasis object (temp_basis) and the name of our regression model object (dlnm_mod_1). We also want to tell it which temperature we want to use as our reference (cen = 7 to compare everything to 7 degrees C) and what interval we want for predictions (by = 1 will give us an estimate for every degree temperature along our range). The output is a list with a lot of elements, which you can get a view of with str:

crosspred(basis = temp_basis, model = dlnm_mod_1, cen = 7, by = 1) %>% 
  str()
## List of 19
##  $ predvar     : num [1:35] -5 -4 -3 -2 -1 0 1 2 3 4 ...
##  $ cen         : num 7
##  $ lag         : num [1:2] 0 0
##  $ bylag       : num 1
##  $ coefficients: Named num [1:4] -0.07 -0.066 0.011 0.347
##   ..- attr(*, "names")= chr [1:4] "temp_basisv1.l1" "temp_basisv2.l1" "temp_basisv3.l1" "temp_basisv4.l1"
##  $ vcov        : num [1:4, 1:4] 1.83e-04 1.12e-04 3.85e-04 6.58e-05 1.12e-04 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "temp_basisv1.l1" "temp_basisv2.l1" "temp_basisv3.l1" "temp_basisv4.l1"
##   .. ..$ : chr [1:4] "temp_basisv1.l1" "temp_basisv2.l1" "temp_basisv3.l1" "temp_basisv4.l1"
##  $ matfit      : num [1:35, 1] 0.0874 0.0756 0.064 0.0529 0.0424 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:35] "-5" "-4" "-3" "-2" ...
##   .. ..$ : chr [1, 1] "lag0"
##  $ matse       : num [1:35, 1] 0.01435 0.01254 0.01077 0.00907 0.00746 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:35] "-5" "-4" "-3" "-2" ...
##   .. ..$ : chr [1, 1] "lag0"
##  $ allfit      : Named num [1:35] 0.0874 0.0756 0.064 0.0529 0.0424 ...
##   ..- attr(*, "names")= chr [1:35] "-5" "-4" "-3" "-2" ...
##  $ allse       : Named num [1:35] 0.01435 0.01254 0.01077 0.00907 0.00746 ...
##   ..- attr(*, "names")= chr [1:35] "-5" "-4" "-3" "-2" ...
##  $ matRRfit    : num [1:35, 1] 1.09 1.08 1.07 1.05 1.04 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:35] "-5" "-4" "-3" "-2" ...
##   .. ..$ : chr [1, 1] "lag0"
##  $ matRRlow    : num [1:35, 1] 1.06 1.05 1.04 1.04 1.03 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:35] "-5" "-4" "-3" "-2" ...
##   .. ..$ : chr [1, 1] "lag0"
##  $ matRRhigh   : num [1:35, 1] 1.12 1.11 1.09 1.07 1.06 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:35] "-5" "-4" "-3" "-2" ...
##   .. ..$ : chr [1, 1] "lag0"
##  $ allRRfit    : Named num [1:35] 1.09 1.08 1.07 1.05 1.04 ...
##   ..- attr(*, "names")= chr [1:35] "-5" "-4" "-3" "-2" ...
##  $ allRRlow    : Named num [1:35] 1.06 1.05 1.04 1.04 1.03 ...
##   ..- attr(*, "names")= chr [1:35] "-5" "-4" "-3" "-2" ...
##  $ allRRhigh   : Named num [1:35] 1.12 1.11 1.09 1.07 1.06 ...
##   ..- attr(*, "names")= chr [1:35] "-5" "-4" "-3" "-2" ...
##  $ ci.level    : num 0.95
##  $ model.class : chr [1:2] "glm" "lm"
##  $ model.link  : chr "log"
##  - attr(*, "class")= chr "crosspred"

The one we’ll look at right now is allRRfit. You can extract it with (here I’m showing how you can do it with piping and the pluck function to pull out an element of a list, but you could do other approaches, too):

est_rr <- dlnm_mod_1 %>% 
  crosspred(basis = temp_basis, model = ., cen = 7, by = 1) %>% 
  pluck("allRRfit")
est_rr
##        -5        -4        -3        -2        -1         0         1         2 
## 1.0913393 1.0785207 1.0661255 1.0543222 1.0432720 1.0331298 1.0240458 1.0161668 
##         3         4         5         6         7         8         9        10 
## 1.0096382 1.0046057 1.0012180 0.9996288 1.0000000 1.0024680 1.0064490 1.0106777 
##        11        12        13        14        15        16        17        18 
## 1.0138437 1.0147024 1.0135774 1.0122284 1.0124622 1.0160921 1.0245314 1.0378080 
##        19        20        21        22        23        24        25        26 
## 1.0557076 1.0780540 1.1046963 1.1354974 1.1703226 1.2090286 1.2514526 1.2974015 
##        27        28        29 
## 1.3466411 1.3988856 1.4537866

To make it easier to work with, let’s put this in a dataframe. Note that the temperatures are included as the names in the vector, so we can extract those with names:

dlnm_temp_function <- tibble(tmean = as.numeric(names(est_rr)), 
                             rr = est_rr)

dlnm_temp_function %>% 
  head()
## # A tibble: 6 × 2
##   tmean    rr
##   <dbl> <dbl>
## 1    -5  1.09
## 2    -4  1.08
## 3    -3  1.07
## 4    -2  1.05
## 5    -1  1.04
## 6     0  1.03

This has gotten us (much more quickly!) to estimates of the relative risk of mortality at each temperature compared to a reference of 7 degrees C. We can plot this the same way we did earlier, and you’ll notice that this plot is identical to the one we created based on the regression with ns earlier:

ggplot(dlnm_temp_function, aes(x = tmean, y = rr)) + 
  geom_point() + 
  geom_line() + 
  labs(x = "Temperature", y = "Relative risk") + 
  geom_hline(yintercept = 1.0, color = "red", linetype = 2)

The dlnm package also includes some functions specifically for plotting. One downside is that they’re in base R, rather than based on ggplot2, which makes them a bit harder to customize. However, they’re a great way to get a first look at your results:

dlnm_mod_1 %>% 
  crosspred(basis = temp_basis, model = ., cen = 7, by = 1) %>% 
  plot()

Don’t worry if all of this about basis functions is a lot! If this is the first time you’ve seen this, it will likely take a few passes to get comfortable with it—and while the idea of basis variables is fairly straightforward, it’s certainly not straightforward to figure out how to interpret the results of the model that you fit. It is worth the effort, though, as this is a very powerful tool when working with regression modeling in environmental epidemiology.

4.3 Distributed lags and cross-basis functions in GLMs

Next, let’s explore how we can add into our model something that lets us look at delayed effects and the potential for mortality displacement. So far, we have only considered how temperature affects mortality risk on the day of exposure—in other words, if it is very hot today, does risk of mortality go up substantially today? For many exposures, however, risk can persist after the day of exposure.

In some cases, this may be caused by a delay in detection of the outcome in the data we have. For example, extreme heat today could precipitate a respiratory or cardiac event that the person may not seek treatment for or result in an outcome like mortality until tomorrow, and so in the data the outcome would be recorded at a one-day delay from the exposure.

In other cases, the path from the exposure to the outcome may take several days. For example, cold temperatures are often found to be associated with respiratory outcomes a week or more after exposures (something you can look for in the example data!), and one potential pathway is that cold temperatures might promote the spread of respiratory infectious disease, like influenza and colds, both through making people stay inside in more crowded, less ventilated conditions and also through effects on humidity and other conditions that might influence the behavior of respiratory droplets, which are key players in spreading some respiratory diseases. Since there is an incubation time up to several days for these infectious diseases, we might expect a lagged association between temperature and respiratory outcomes under this scenario.

There is another lagged pattern that can be important to check for, as well—one that indicates mortality displacement. For many ambient environmental exposures, there are clear indications that associations with health outcomes are strongest among older adults or other populations with a disproportionately high percent of people who are frail before exposure. One question that comes up, then, is if, when an exposure is associated with excess deaths, these are deaths that are displaced in time by only a few days or weeks, rather than deaths that represent a larger time of life lost. This question—of whether deaths observed to be associated with an environmental exposure were displaced by only a small amount of time—can be explored by investigating whether an observed increase in mortality across a community during an exposure is then offset by a lower-than-expected rate of mortality in the following days and weeks.This phenomenon is sometimes also referred to as ‘harvesting’ or a ‘harvesting effect’.

All of these questions can be explored through a type of model called a distributed lag model. Conceptually, these models investigate how an exposure today is associated with risk of a health outcome not only today, but also in the following days. In time-series studies where we typically want to estimate potential short-term and/or acute effects, and because we are concerned about potential confounding by seasonal trends—and have incorporated control for these trends in our models—we often restrict these models to only look up to about a month following exposure. However, this can help in identifying delayed effects of an exposure days to weeks following the initial exposure, and can also help in determining the role of mortality displacement in accounting for some or all of the excess deaths that are associated with an exposure. While some environmental exposures of interest have mostly immediate effects (e.g., hot temperature), others, including many air pollutants, tend to have effects that are stretched over a longer period of one or more days from exposure. As a result, distributed lag models are used widely in environmental epidemiology studies of time series data, to ensure that important associations aren’t missed by limiting the model to focus on risk that are detectable on the same day as the exposure. They can also help us identify influential windows of exposure as well as help eliminate potential confounding by exposures occurring at proximal times to the window of interest. For example, today’s exposure may appear to have an effect on a health outcome even if in reality yesterday’s exposure is the influential one and we don’t control for it, because today’s and yesterday’s exposure to temperature or air pollution are likely to be very similar.

To fit a distributed lag model, you can start with a fairly straightforward extension of the regression models we’ve been fitting, especially if you are investigating an exposure that can be plausibly fit with a linear term (rather than a more complex non-linear basis function) in the regression model. Say you are starting, for example, from the following regression model:

\[ log(E(Y)) = \alpha + \beta X + \mathbf{\gamma^{'}W} + ns(time) \]

In this model, you are using a Poisson regression to investigate how risk of the health outcome (\(Y\)) changes for every unit increase in the exposure (\(X\)), assuming a linear relationship between the exposure and the log of the health outcome. The model can include control for confounders like day of week (\(\mathbf{W}\)) and long-term and seasonal time trends (\(ns(time)\)). This model should look familiar from some of the models we’ve fit earlier to the example temperature data.

To expand this model to explore associations of temperature at different lags, the simplest approach is to add a term for each lag—in other words, instead of only having a term for temperature on the same day as the outcome measurement, we’ll also include a term for temperature the previous day, and the day before that, and so on. For example, here’s what the model would look like if we included terms up to lag 2:

\[ log(E(Y)) = \alpha + \beta_{0} X_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \mathbf{\gamma^{'}W} + ns(time) \] where \(X_0\) is temperature on the same day as the measured outcome \(Y\) (with an associated coefficient \(\beta_0\)), \(X_1\) is the temperature the day before (with an associated coefficient \(\beta_1\)), and \(X_2\) is the temperature two days before (with an associated coefficient \(\beta_2\)). The interpretation of each of these coefficients is similar—\(exp(\beta_2)\), for example, gives our estimate of the change in the relative risk of the outcome \(Y\) for every one-unit increase in the exposure two days prior.

It can become cumbersome to write out the separate terms for each lag day, so you’ll often see equations for distributed lag models written using a shorter approach with the \(\sum\) symbol. The \(\sum\) symbol expresses several terms that are added together and so is well-suited for shortening the expression of a distributed lag model. For example, the previous model equation could be rewritten as:

\[ log(E(Y)) = \alpha + \sum_{l=0}^2{\beta_lX_l} + \mathbf{\gamma^{'}W} + ns(time) \]

This version of a distributed lag model is nice because it is fairly simple—we’re taking the idea of fitting a linear term for an exposure, and then expanding it to include the exposure on earlier days by adding the same type of term for exposure measured at each lag. However, it does have some downsides. The main downside is that exposure tends to be very strongly correlated from one day to the next. Any time you put different terms in a regression model that are correlated, you create the risk of collinearity, and associated problems with fitting the model. The term collinearity refers to the idea that you have two or more columns in your model matrix that give identical (or, if you loosen the definition a bit, very similar) information. To fit coefficients for each term, the modeling algorithm is trying to identify weights to place on each column in the model matrix that result in a model that gives the best likelihood of the data you see. If two or more columns have (almost) identical information, then the algorithm struggles to determine how to distribute weight across these two columns. The implications are that you can end up with a lot of instability in model coefficients for the columns that are collinear, as well as very inflated estimates of the standard error. Long story short, it can cause problems in your regression model if you include independent variables that are very strongly correlated with each other, and so we usually try to avoid doing that.

There are alternative distributed lag models that can help avoid this instability that can arise from fitting a distributed lag by using a separate term for each lag. All of them share a common feature—they somehow fit a function of the lags instead of each separate lag, so that we end up adding fewer terms to the regression model (and, in turn, are somewhat constraining the pattern that the distributed lag effects follow). At the most extreme, you can use a function that reduces all the lag terms to a single term, which you can do by averaging the exposure across all lags. In other words, you could take the same-day temperature, yesterday’s temperature, and the temperature the day before that, and average those three temperatures to create the term you put in the model. The model equation in this case would look like:

\[ log(E(Y)) = \alpha + \beta_{\overline{0-2}} X_{\overline{0-2}} + \mathbf{\gamma^{'}W} + ns(time) \] where \(\overline{0-2}\) represents an average over lags 0 to 2, and so \(X_{\overline{0-2}}\) is the average of the exposure on lags 0, 1, and 2 compared to the day that \(Y\) is measured.

This approach avoids any issues with collinearity across exposure terms from different lags, because you’re only including a single term for exposure in the model. However, there is a downside—this model will provide you with a single estimate for the lagged effects, without allowing you to explore patterns across lags (for example, is the association mostly on the day of exposure, or are there some delayed effects on following days?). Instead, the main assumption is that the effect is constant across all days in the window of interest.

Another approach is more complex, but it allows you to explore patterns across lags. Instead of reducing the lagged exposure to a single term in the model, you can reduce it to a few terms that represent the basis for a smooth function. Often, a polynomial or natural cubic spline function will be used for this. The details of setting up the model become more complex, but fortunately there are software packages (like the dlnm package in R) that help in managing this set-up, as we’ll explore in the exercise.

Packages like dlnm can help in extending the complexity of the distributed lag model in other ways, too. So far, we have explored using a distributed lag model where we are happy to model the exposure using a linear term. As we’ve seen in earlier exercises, this isn’t ideal for temperature. Instead, temperature has a non-linear association with mortality risk, with the lowest risk at mild temperatures and then increased risk at both lower (colder) and higher (hotter) temperatures.

We can fit a GLM that incorporates the exposure using a non-linear shape while, at the same time, including a non-linear function to describe how the association evolves across lagged time following the exposure. The approach is to extend the idea of using a basis function—instead of using a basis function in a single dimension, we’ll use a cross of basis functions in two dimensions. One dimension is the level of exposure (e.g., cold to hot temperatures), and the other is the timing between the exposure and the outcome (e.g., same-day to lagged by many days). In terms of the mechanics of building a model matrix to use to fit the regression model, everything follows directly from the idea we explored earlier of using a function like a spline to create basis variables in that model matrix for a term with a non-linear association with the outcome variable. However, the mechanics do get quite cumbersome as so many terms are included, and so again we will use the dlnm package to handle these mechanics as we start to incorporate these cross-basis functions within a GLM.

Applied: Including a cross-basis function for exposure-lag-response in a GLM

For this exercise, you will continue to build up the model that you began in the examples in the previous chapter. The example uses the data provided with one of this chapter’s readings, Vicedo-Cabrera, Sera, and Gasparrini (2019).

  1. Start with the simple model describing how are daily mortality counts are associated with a linear function of temperature. Include control for day of week and long-term / seasonal trends. Add to this model a term that describes the lagged effects of temperature up to a week following exposure, using a separate model term for each lag. What patterns do you detect in the lagged effects of temperature?
  2. Start with the model from the first question, but change it to use a smooth function of lag to detect lagged effects, rather than fitting a seperate term for each lag. Extend the model to look at lagged effects up to a month following exposure. What patterns do you see in these lagged effects?
  3. Refine your model to fit for a non-linear, rather than linear, function of temperature, while also incorporating a function to describe lagged effects up to a month following exposure. Based on this model, what does the association between temperature and mortality look like at lag 0 days? How about at lag 21 days? What does the pattern of lagged effects look like when comparing the relative risk of mortality at a hot temperature (e.g., 28 degrees C) to a milder temperature (7 degrees C)? What about when comparing mortality at a cold temperature (e.g., -4 degrees C) to a milder temperature (7 degrees C)? If you use a constant term for the effect of exposure over a month as opposed to the smooth function you just used, how does the overall cumulative RR compare?
  4. Assume that we are interested in the potential effect of a 5-day heatwave occurring on lags 0 to 4? You can assess this as the effect of hot temperature (e.g., 28 degrees C) during these lags, while the temperature remains warm but more mild (e.g., 20 degrees C) the rest of the month, against constant 20 degrees C for the whole month. What is the effect under the ‘constant’ model? How about the smooth function lag model?

Applied exercise: Example code

  1. Start with the simple model describing how are daily mortality counts are associated with a linear function of temperature. Include control for day of week and long-term / seasonal trends. Add to this model a term that describes the lagged effects of temperature up to a week following exposure, using a separate model term for each lag. What patterns do you detect in the lagged effects of temperature?

We’ll start by looking at distributed lag associations by fitting a model with separate terms for each of the lags we want to fit. Here, we want to fit up to a week, so we’ll want eight terms in total: ones for the same day (\(X_0\)), the previous day (\(X_1\)), two days before (\(X_2\)), three days before (\(X_3\)), four days before (\(X_4\)), five days before (\(X_5\)), six days before (\(X_6\)), and seven days before (\(X_7\)).

Right now, our data have temperature lined up with the date of the recorded deaths:

obs %>% 
  select(date, all, tmean) %>% 
  head()
## # A tibble: 6 × 3
##   date         all tmean
##   <date>     <dbl> <dbl>
## 1 1990-01-01   220  3.91
## 2 1990-01-02   257  5.55
## 3 1990-01-03   245  4.39
## 4 1990-01-04   226  5.43
## 5 1990-01-05   236  6.87
## 6 1990-01-06   235  9.23

To add lagged temperature to the model, then, we’ll need to add some columns to our data, so that the row for an observation includes not only the temperature on that day, but also on each of the seven previous days. The lag function from the tidyverse suite of packages can be used to create these columns:

obs <- obs %>% 
  mutate(tmean_1 = lag(tmean, n = 1), 
         tmean_2 = lag(tmean, n = 2), 
         tmean_3 = lag(tmean, n = 3), 
         tmean_4 = lag(tmean, n = 4), 
         tmean_5 = lag(tmean, n = 5), 
         tmean_6 = lag(tmean, n = 6), 
         tmean_7 = lag(tmean, n = 7))

You can see below that each of these have taken tmean and then offset it by moving it down by one or more rows (down one row for lag 1, two for lag 2, etc.). As a result, we now have columns that give us today’s temperature, yesterday’s, and so on, all lined up with the row with the observation for daily mortality:

obs %>% 
  select(date, all, tmean, tmean_1:tmean_7) %>% 
  head()
## # A tibble: 6 × 10
##   date         all tmean tmean_1 tmean_2 tmean_3 tmean_4 tmean_5 tmean_6 tmean_7
##   <date>     <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1990-01-01   220  3.91   NA      NA      NA      NA      NA         NA      NA
## 2 1990-01-02   257  5.55    3.91   NA      NA      NA      NA         NA      NA
## 3 1990-01-03   245  4.39    5.55    3.91   NA      NA      NA         NA      NA
## 4 1990-01-04   226  5.43    4.39    5.55    3.91   NA      NA         NA      NA
## 5 1990-01-05   236  6.87    5.43    4.39    5.55    3.91   NA         NA      NA
## 6 1990-01-06   235  9.23    6.87    5.43    4.39    5.55    3.91      NA      NA

You’ll notice that this process has created some missing data right at the beginning of the dataset—we don’t know what the temperature was the day before Jan. 1, 1990, for example, since the study data starts on Jan. 1, 1990. With most time series, we’ll have plenty of days in the study period, so we’ll usually just allow these early dates to drop when the model is fit.

Now we’ll add all these terms to our regression model:

dist_lag_mod_1 <- glm(all ~ tmean + tmean_1 + tmean_2 + tmean_3 + tmean_4 + 
                        tmean_5 + tmean_6 + tmean_7 + 
                        factor(dow, ordered = FALSE) +
                          ns(time, df = 158), 
                        data = obs, family = "quasipoisson")

If you look at the output from the model, you can see that a regression coefficient has been fit for each of these distributed lag terms:

dist_lag_mod_1 %>% 
  tidy() %>% 
  filter(str_detect(term, "tmean"))
## # A tibble: 8 × 5
##   term     estimate std.error statistic  p.value
##   <chr>       <dbl>     <dbl>     <dbl>    <dbl>
## 1 tmean    0.00749   0.000592    12.6   2.67e-36
## 2 tmean_1 -0.00255   0.000788    -3.23  1.25e- 3
## 3 tmean_2 -0.00152   0.000797    -1.91  5.68e- 2
## 4 tmean_3 -0.00243   0.000798    -3.05  2.31e- 3
## 5 tmean_4 -0.00106   0.000798    -1.32  1.86e- 1
## 6 tmean_5 -0.000515  0.000797    -0.645 5.19e- 1
## 7 tmean_6 -0.00107   0.000789    -1.36  1.73e- 1
## 8 tmean_7 -0.00209   0.000593    -3.52  4.30e- 4

We can pull out all those estimates, calculate the 95% confidence intervals (do this before you take the exponential to get to the relative risk estimate, not after!), and then exponentiate all the terms to get estimates of relative risk of mortality per unit increase in temperature.

dist_lag_terms <- dist_lag_mod_1 %>% 
  tidy() %>% 
  filter(str_detect(term, "tmean")) %>% 
  mutate(lag = 0:7, 
         low_ci = estimate - 1.96 * std.error, 
         high_ci = estimate + 1.96 * std.error, 
         rr = exp(estimate), 
         low_rr = exp(low_ci), 
         high_rr = exp(high_ci)) %>% 
  select(term, lag, rr, low_rr, high_rr)
dist_lag_terms
## # A tibble: 8 × 5
##   term      lag    rr low_rr high_rr
##   <chr>   <int> <dbl>  <dbl>   <dbl>
## 1 tmean       0 1.01   1.01    1.01 
## 2 tmean_1     1 0.997  0.996   0.999
## 3 tmean_2     2 0.998  0.997   1.00 
## 4 tmean_3     3 0.998  0.996   0.999
## 5 tmean_4     4 0.999  0.997   1.00 
## 6 tmean_5     5 0.999  0.998   1.00 
## 7 tmean_6     6 0.999  0.997   1.00 
## 8 tmean_7     7 0.998  0.997   0.999

You can plot these estimates to explore how the association changes by lag (notice how geom_pointrange is very helpful here!):

dist_lag_terms %>% 
  ggplot(aes(x = lag, y = rr)) + 
  geom_point() + 
  geom_pointrange(aes(ymin = low_rr, ymax = high_rr)) + 
  labs(x = "Lag days since exposure", 
       y = "Relative risk of mortality\nper degree increase in temperature") + 
  geom_hline(yintercept = 1, color = "red", linetype = 3)

You can also, as an alternative, use the dlnm package to fit this model, using the crossbasis function to set up the distributed lag basis to include in the regression, and then the crosspred function to extract results after fitting the regression model and to plot results.

First, you’ll again use crossbasis to create the basis, as you did earlier when using this package to fit a non-linear function of temperature. You will use the lag argument to say the maximum lag you want to include—in this case, seven days (lag = 7). You can use the argvar and arglag functions to specify the basis function for the exposure variable and the distributed lag component. For right now, we’ll use a very basic linear function for the exposure (temperature), using fun = "lin" within the argvar list of arguments. To fit a separate coefficient for each lag, we can specify fun = "integer" in the arglag list of arguments (later we’ll explore some other functions for the lag component).

library(dlnm)
dl_basis <- crossbasis(obs$tmean, lag = 7,
                       argvar = list(fun = "lin"), 
                       arglag = list(fun = "integer"))

If you take a peak at this crossbasis, you’ll notice that it’s doing something similar to what we did when we added columns for lagged temperatures. The first column gives temperature on the same day, the second on the previous day (and so is offset one down from the first column), and so on. The crossbasis function deals with the missing early dates by setting all column values to be missing on those days; in practice, this difference in set up won’t create any difference in how the model is fit, as the glm function will by default exclude any observations with any columns of missing data.

dl_basis %>% 
  head(n = 12)
##           v1.l1     v1.l2    v1.l3    v1.l4    v1.l5    v1.l6    v1.l7    v1.l8
##  [1,]        NA        NA       NA       NA       NA       NA       NA       NA
##  [2,]        NA        NA       NA       NA       NA       NA       NA       NA
##  [3,]        NA        NA       NA       NA       NA       NA       NA       NA
##  [4,]        NA        NA       NA       NA       NA       NA       NA       NA
##  [5,]        NA        NA       NA       NA       NA       NA       NA       NA
##  [6,]        NA        NA       NA       NA       NA       NA       NA       NA
##  [7,]        NA        NA       NA       NA       NA       NA       NA       NA
##  [8,]  7.958644  6.686216 9.232628 6.867855 5.431046 4.385564 5.547919 3.913589
##  [9,]  7.268950  7.958644 6.686216 9.232628 6.867855 5.431046 4.385564 5.547919
## [10,]  9.510644  7.268950 7.958644 6.686216 9.232628 6.867855 5.431046 4.385564
## [11,] 10.424808  9.510644 7.268950 7.958644 6.686216 9.232628 6.867855 5.431046
## [12,]  7.368595 10.424808 9.510644 7.268950 7.958644 6.686216 9.232628 6.867855

We can use this matrix of crossbasis variables now in our regression call:

dist_lag_mod_2 <- glm(all ~ dl_basis + 
                        factor(dow, ordered = FALSE) +
                          ns(time, df = 158), 
                        data = obs, family = "quasipoisson")

Again, you can pull the regression coefficients directly out of the regression model object (and you can see how they’re identical to those from our first distributed lag model):

dist_lag_mod_2 %>% 
  tidy() %>% 
  filter(str_detect(term, "dl_basis"))
## # A tibble: 8 × 5
##   term           estimate std.error statistic  p.value
##   <chr>             <dbl>     <dbl>     <dbl>    <dbl>
## 1 dl_basisv1.l1  0.00749   0.000592    12.6   2.67e-36
## 2 dl_basisv1.l2 -0.00255   0.000788    -3.23  1.25e- 3
## 3 dl_basisv1.l3 -0.00152   0.000797    -1.91  5.68e- 2
## 4 dl_basisv1.l4 -0.00243   0.000798    -3.05  2.31e- 3
## 5 dl_basisv1.l5 -0.00106   0.000798    -1.32  1.86e- 1
## 6 dl_basisv1.l6 -0.000515  0.000797    -0.645 5.19e- 1
## 7 dl_basisv1.l7 -0.00107   0.000789    -1.36  1.73e- 1
## 8 dl_basisv1.l8 -0.00209   0.000593    -3.52  4.30e- 4

You can proceed from here to plot estimates by lag using ggplot, but you can also use some of the special functions for plotting in dlnm. We can plot the “slice” of comparing the relative risk at 1 degree C (var = 1) (this will give us the one-unit increase, because by default the crossbasis set-up with 0 degrees as the baseline, so this is one degree above that value). The crosspred function predicts based on two things: (1) a crossbasis and (2) a regression model object that uses that crossbasis. Once you create an object with crosspred, then the plot function has a special method for plotting that object (to see the helpfile for this plot method, type ?plot.crosspred in your R console).

crosspred(dl_basis, dist_lag_mod_2) %>% 
  plot(ptype = "slices", var = 1, ci = "bars")

However, even though mechanically we can fit this model, we should think about whether it’s a good idea, or whether we should contrain the lag function some. If you look at the correlations between temperature at different lags, you’ll see they are very strongly correlated:

obs %>% 
  select(tmean, tmean_1:tmean_7) %>% 
  cor(use = "complete.obs") 
##             tmean   tmean_1   tmean_2   tmean_3   tmean_4   tmean_5   tmean_6
## tmean   1.0000000 0.9431416 0.8845767 0.8462589 0.8215974 0.8031790 0.7891306
## tmean_1 0.9431416 1.0000000 0.9431447 0.8846236 0.8463419 0.8216464 0.8032469
## tmean_2 0.8845767 0.9431447 1.0000000 0.9431403 0.8846198 0.8463094 0.8216348
## tmean_3 0.8462589 0.8846236 0.9431403 1.0000000 0.9431414 0.8846132 0.8463072
## tmean_4 0.8215974 0.8463419 0.8846198 0.9431414 1.0000000 0.9431482 0.8846132
## tmean_5 0.8031790 0.8216464 0.8463094 0.8846132 0.9431482 1.0000000 0.9431506
## tmean_6 0.7891306 0.8032469 0.8216348 0.8463072 0.8846132 0.9431506 1.0000000
## tmean_7 0.7781337 0.7892093 0.8032205 0.8216271 0.8463046 0.8846283 0.9431482
##           tmean_7
## tmean   0.7781337
## tmean_1 0.7892093
## tmean_2 0.8032205
## tmean_3 0.8216271
## tmean_4 0.8463046
## tmean_5 0.8846283
## tmean_6 0.9431482
## tmean_7 1.0000000

This means that we could have some issues with collinearity if we use this simpler distributed lag model, with a separate coefficient for each lag Next, we’ll explore a model that constrains the lagged effects to follow a smooth function, to help avoid this potential instability in the modeling.

  1. Start with the model from the first question, but change it to use a smooth function of lag to detect lagged effects, rather than fitting a separate term for each lag. Extend the model to look at lagged effects up to a month following exposure. What patterns do you see in these lagged effects?

As a reminder, we can stabilize our model of lagged effects a bit by fitting the lagged effects to be constrained to follow a smooth function, instead of fitting a separate term for each lag. The dlnm package makes this very easy to do.

When you use crossbasis to set up a crossbasis function, you can make a lot of choices to customize that crossbasis. The crossbasis combines a basis function for incorporating the exposure in the regression model (temperature in this case), as well as a basis function for incorporating lagged effects of that exposure. The crossbasis functions gives you several choices for each of these basis functions for the two dimensions of the crossbasis. Here, we’ll look at using that flexibility for the lag dimension; in the next part of the exercise, we’ll add some more complexity in the basis function for the exposure, as well.

To customize the basis function for lagged effects, you can use the arglag parameter in the crossbasis function. You send this parameter a list of other parameters, which get sent to a general onebasis function that builds that basis function (this will be convenient when we start working with the exposure basis, too—you’ll see that you can use the same parameters in setting up each basis function).

To see the range of functions you can use for this crossbasis function, check the help function for onebasis. When we fit a separate coefficient for each lag, we used the “integer” function, which creates a set of basis variables with indicators for each separate lag (this is really similar to the idea of basis variables for a categorical variable, like day of the week). This is what we did in the first part of the exercise.

Now, we want to instead use a smooth function, with fewer degrees of freedom (i.e., fewer columns created in the matrix of basis values). Two popular options for that function are the “poly” function, which will fit a polynomial, or the “ns” function, which will fit a natural cubic spline (just like we did earlier to create a non-linear function of temperature). We’ll use “ns” in this example (fun = "ns" in the list of parameters for arglag).

For some of these functions, you’ll need to specify additional parameters to clarify how they should be built. For example, if you’re using a spline, you need to specify how many degrees of freedom it should have with df. We’ll use 4 degrees of freedom, but you can experiment with different values for this and see how it affects the shape of the resulting function (i.e., the plot we create later). If you were fitting a polynomial function, you’d need to specify the degree of the polynomial; if you were fitting strata (i.e., constant values within sets of lags), you’d need to specify the breaks for dividing those strata; if you were using a threshold function, you’d need to specify the threshold, and so on. You can find more details on these options in the helpfile for onebasis; in general, the onebasis function is using other underlying functions like ns and poly to build these functions, so the choice of parameters will depend on the parameters for those underlying functions.

Here is the final call for building this new crossbasis (note that we’re continuing to use a linear function of temperature, and at this point still only looking at lags up to a week; later in this question we’ll expand to look at longer lags):

dl_basis_2 <- crossbasis(obs$tmean, lag = 7,
                       argvar = list(fun = "lin"), 
                       arglag = list(fun = "ns", df = 4))

If you look at this crossbasis object, you’ll see that it’s created a matrix of basis variables, similar to the last part of the exercise. However, instead of having the same number of columns as the number of lags (8), we now have fewer columns. We have 4, reflecting the degrees of freedom we specified for the spline function of lag. Therefore, we’ve constrained the lag function a bit compared to the previous part of the exercise. Again, the values are all missing for the first seven days, since the data isn’t able to capture the lagged temperature for these days.

dl_basis_2 %>% 
  head(n = 12)
##          v1.l1    v1.l2    v1.l3      v1.l4
##  [1,]       NA       NA       NA         NA
##  [2,]       NA       NA       NA         NA
##  [3,]       NA       NA       NA         NA
##  [4,]       NA       NA       NA         NA
##  [5,]       NA       NA       NA         NA
##  [6,]       NA       NA       NA         NA
##  [7,]       NA       NA       NA         NA
##  [8,] 10.53142 5.938107 17.26767 -2.9117904
##  [9,] 11.46186 7.311815 17.99659 -2.0873124
## [10,] 10.91751 8.769682 19.69413 -3.2804334
## [11,] 10.94122 8.867758 22.33138 -2.7169313
## [12,] 13.00391 8.984807 22.20899 -0.3726342

Again, once we build the crossbasis function, we can put it in the regression equation and fit a regression model using glm:

dist_lag_mod_3 <- glm(all ~ dl_basis_2 + 
                        factor(dow, ordered = FALSE) +
                          ns(time, df = 158), 
                        data = obs, family = "quasipoisson")

Again, you can plot the results to see the pattern of associations by lag. The plot is just visualizing the results of predicting the model to certain values. By default, it will predict at each lag; this makes the plot a little “jumpy” if you’re just looking at lags for a week or so, so to make it smoother, you might want to ask crosspred to predict to a finer resolution along the lag dimension (here, we’re using bylag = 0.2).

crosspred(dl_basis_2, dist_lag_mod_3, at = 1, bylag = 0.2) %>% 
  plot(ptype = "slices", var = 1)

You can go back to the crossbasis call and try playing around with the degrees of freedom for the spline, to see how that affects the shape of this function. Be sure, though, that you don’t use more degrees of freedom than there are lags (i.e., more than 8).

Now let’s try to extend the model out to look at up to a month after exposure. To do this, all you need to do is change the lag argument in crossbasis to the longest lag you want to include. Since we’re adding more lags, you’ll likely also want to increase the degrees of freedom you use for the basis function of lag; here, I’m using 6, but you can explore other values as well (again, don’t use more than the number of lags, though). From there, the process of fitting the model and plotting the results is identical.

dl_basis_3 <- crossbasis(obs$tmean, lag = 30,
                       argvar = list(fun = "lin"), 
                       arglag = list(fun = "ns", df = 6))
dist_lag_mod_4 <- glm(all ~ dl_basis_3 + 
                        factor(dow, ordered = FALSE) +
                          ns(time, df = 158), 
                        data = obs, family = "quasipoisson")
crosspred(dl_basis_3, dist_lag_mod_4, at = 1) %>% 
  plot(ptype = "slices", var = 1)

  1. Refine your model to fit for a non-linear, rather than linear, function of temperature, while also incorporating a function to describe lagged effects up to a month following exposure. Based on this model, what does the association between temperature and mortality look like at lag 0 days? How about at lag 21 days? What does the pattern of lagged effects look like when comparing the relative risk of mortality at a hot temperature (e.g., 28 degrees C) to a milder temperature (7 degrees C)? What about when comparing mortality at a cold temperature (e.g., -4 degrees C) to a milder temperature (7 degrees C)? If you use a constant term for the effect of exposure over a month as opposed to the smooth function you just used, how does the overall cumulative RR compare?

To extend the model to include temperature using a non-linear function, we’ll go back to the crossbasis call, but this time we’ll make some changes to the exposure variable dimension. The basis function for the exposure is set using the argvar (“var” is for “variable”) parameter. Again, this allows us to include a list of parameters that will be passed to onebasis to build a basis function and set up basis variables for this dimension of the crossbasis. We can use a spline for temperature by specifying fun = "ns", as well as specify the degrees of freedom for that spline (here, I’ve used 4, but again, feel free to experiment):

dl_basis_4 <- crossbasis(obs$tmean, lag = 30,
                       argvar = list(fun = "ns", df = 4), 
                       arglag = list(fun = "ns", df = 6))

Try looking at the crossbasis from this function (you’ll need to look past the 30th row or so, since the earliest rows will all be missing since they’re in that range where they don’t have lag data available for the exposure). You can see that we now have a lot of columns: specifically, we have 4 (degrees of freedom for the temperature spline) times 6 (degrees of freedom for the lag spline) = 24 columns.

dl_basis_4 %>% 
  as.data.frame() %>% 
  slice(28:38)
##       v1.l1    v1.l2    v1.l3    v1.l4    v1.l5       v1.l6      v2.l1
## 1        NA       NA       NA       NA       NA          NA         NA
## 2        NA       NA       NA       NA       NA          NA         NA
## 3        NA       NA       NA       NA       NA          NA         NA
## 4  2.068939 3.299498 3.111270 1.738184 2.541810 -0.51512317 -0.5552531
## 5  1.955177 3.314862 3.125558 1.820183 2.690445 -0.40693744 -0.5967078
## 6  1.886111 3.299149 3.148516 1.893007 2.766242 -0.32385143 -0.6273517
## 7  1.882405 3.248469 3.180797 1.927784 2.804239 -0.08305771 -0.6419354
## 8  1.852973 3.165440 3.220817 1.900001 2.905860  0.08447527 -0.6588751
## 9  1.757370 3.058913 3.263914 1.795344 3.175517  0.03686863 -0.7106452
## 10 1.760596 2.944202 3.299498 1.779203 3.259667 -0.16712630 -0.7084665
## 11 1.797646 2.838520 3.314862 1.698607 3.497240 -0.17786583 -0.7118631
##         v2.l2      v2.l3      v2.l4      v2.l5         v2.l6    v3.l1    v3.l2
## 1          NA         NA         NA         NA            NA       NA       NA
## 2          NA         NA         NA         NA            NA       NA       NA
## 3          NA         NA         NA         NA            NA       NA       NA
## 4  -0.7011824 -0.7707934 -0.4847858 -0.9916438 -0.0956588789 1.405249 1.838701
## 5  -0.6979016 -0.7603742 -0.4547662 -0.9667940 -0.0791413415 1.478133 1.830775
## 6  -0.7047091 -0.7492423 -0.4287099 -0.9503322 -0.0554076103 1.526744 1.841501
## 7  -0.7234011 -0.7371123 -0.4081381 -0.9450033 -0.0058689802 1.541114 1.873929
## 8  -0.7537142 -0.7241504 -0.4035141 -0.9205842  0.0358717107 1.565836 1.927170
## 9  -0.7930344 -0.7111578 -0.4381402 -0.7997450 -0.0007806944 1.643309 1.996194
## 10 -0.8364025 -0.7011824 -0.4373333 -0.7657041 -0.0621066847 1.644968 2.071685
## 11 -0.8780716 -0.6979016 -0.4735973 -0.6384769 -0.0958995727 1.631672 2.143093
##       v3.l3    v3.l4    v3.l5       v3.l6      v4.l1     v4.l2     v4.l3
## 1        NA       NA       NA          NA         NA        NA        NA
## 2        NA       NA       NA          NA         NA        NA        NA
## 3        NA       NA       NA          NA         NA        NA        NA
## 4  1.955381 1.180876 2.353392  0.20605400 -0.8041945 -1.052250 -1.119024
## 5  1.942554 1.126503 2.294679  0.17194514 -0.8459049 -1.047714 -1.111683
## 6  1.926528 1.081359 2.257968  0.12135036 -0.8737239 -1.053852 -1.102512
## 7  1.906672 1.048474 2.249542  0.01133949 -0.8819473 -1.072411 -1.091149
## 8  1.883354 1.051489 2.198457 -0.07875894 -0.8960954 -1.102879 -1.077804
## 9  1.858659 1.116576 2.004017 -0.03652356 -0.9404317 -1.142380 -1.063672
## 10 1.838701 1.121782 1.943746  0.09297047 -0.9413808 -1.185582 -1.052250
## 11 1.830775 1.177720 1.763872  0.11448195 -0.9337721 -1.226447 -1.047714
##         v4.l4     v4.l5        v4.l6
## 1          NA        NA           NA
## 2          NA        NA           NA
## 3          NA        NA           NA
## 4  -0.6757908 -1.346797 -0.117920408
## 5  -0.6446743 -1.313197 -0.098400617
## 6  -0.6188389 -1.292188 -0.069446282
## 7  -0.6000200 -1.287366 -0.006489356
## 8  -0.6017453 -1.258131  0.045072099
## 9  -0.6389933 -1.146857  0.020901671
## 10 -0.6419722 -1.112366 -0.053205061
## 11 -0.6739847 -1.009427 -0.065515636

The dlnm package has done the work of setting up this crossbasis function, as well as creating this matrix of basis variables from our original simple column of temperature measurements. This is getting to be a much more complex function than the simple linear/unconstrained lag function that we started with. However, the underlying principles and mechanics are the same, they’re just scaling up to allow some more complex “shapes” in our exposure-response association.

We can add this new crossbasis function to the regression model, and then predict from the crossbasis and regression model to get a view of how temperature is (non-linearly) associated with mortality risk at different lags, comparing to a baseline temperature of 7 degrees C:

dist_lag_mod_5 <- glm(all ~ dl_basis_4 + 
                        factor(dow, ordered = FALSE) +
                          ns(time, df = 158), 
                        data = obs, family = "quasipoisson")

cp_dl <- crosspred(dl_basis_4, dist_lag_mod_5, cen = 7, bylag = 1)

The object created by crosspred includes a lot of output (you can check out the whole thing by running cp_dl to print it out). We can extract elements from this object (for example, if we want to get effect estimates or estimates of confidence intervals). We can also create a variety of plots from this object. The following plots all use the plot method for crosspred objects, so you can access the helpfile with ?plot.crosspred.

First, we can look at “slices,” where we focus on just one lag value and look at the non-linear relationship between temperature and mortality risk at that lag only. For example, here are plots showing the relationship between temperature and mortality risk on the same day of exposure (lag 0) and 21 days after exposure (lag 21):

plot(cp_dl, ptype = "slices", lag = 0)

plot(cp_dl, ptype = "slices", lag = 21)

You can see that heat has a stronger relationship with mortality risk than cold at lag 0, while at a longer lag, there continues to be a positive association between cold temperature and mortality risk, but now the relationship with heat has shifted to be negative.

You can do the same idea of looking at a “slice” for a specific temperature, in this case seeing how effects vary across lags when comparing that temperature to the baseline temperature (7 degrees C, which we set when we ran crosspred). For example, here are “slices” for 28 degrees C and -4 degrees C:

plot(cp_dl, ptype = "slices", var = 28)

plot(cp_dl, ptype = "slices", var = -4)

These plots again reinforce that heat’s association with mortality risk is more immediate, while cold has an association that plays out over a much longer lag following exposure.

There are also some plots you can make from the crosspred object that help show the full, three-dimensional association between the exposure, lag time, and risk of the outcome. For example, the default plot created from the crosspred object is a three-dimensional plot (the theta, phi, and lphi parameters shift the angle we view it from), while the “contour” plot type shows us a contour plot of these results (in other words, showing the exposure and lag variables in two dimensions of a two-dimensional plot then showing relative risk using color for each combination). You can think of the ‘sliced’ plots from above as two-dimensional slices of the three-dimensional graph if we slice right at the selected lag or var value. Here is the code to create those plots:

plot(cp_dl, theta = 200, phi = 40, lphi = 30)

plot(cp_dl, ptype = "contour")

To see how a ‘constant’ lag-response model compares to the smooth function from this model we have to go back to the crossbasis and use the appropriate argument in the arglag parameter:

dl_basis_5 <- crossbasis(obs$tmean, lag = 30,
                       argvar = list(fun = "ns", df = 4), 
                       arglag = list(fun = "strata", df = 1))

The function strata in arglag essentially breaks up the lag window into strata equaling the degrees of freedom specified, and the lag-response is assumed to be constant for each day within each stratum. Here we specified df=1 so the lag-response is assumed to be constant for the entire lag window. In other words, the exposure in each day in the lag window is constrained to have the same effect. If we look at the crossbasis from this function you’ll see that we have far fewer columns. Remember that the total number of columns is determined by the total number of degrees of freedom, which is the product of the df’s in each of argvar and arglag so in this case 4.

dl_basis_5 %>% 
  as.data.frame() %>% 
  slice(28:38)
##       v1.l1     v2.l1     v3.l1     v4.l1
## 1        NA        NA        NA        NA
## 2        NA        NA        NA        NA
## 3        NA        NA        NA        NA
## 4  14.69900 -4.438015 10.919532 -6.249020
## 5  15.03355 -4.382485 10.789984 -6.174883
## 6  15.21075 -4.347038 10.707109 -6.127455
## 7  15.33913 -4.332385 10.672850 -6.107850
## 8  15.51220 -4.298983 10.594759 -6.063160
## 9  15.89771 -4.127076 10.317738 -5.904626
## 10 15.96057 -4.094014 10.267995 -5.876159
## 11 16.37863 -3.880472  9.958228 -5.698886

We can now run the same model as above replacing the crossbasis with the one we just created and repeat the prediction to see how the temperature effect comparing to a baseline temperature of 7 degrees C changes when we constrain the lag-response to be constant.

dist_lag_mod_6 <- glm(all ~ dl_basis_5 + 
                        factor(dow, ordered = FALSE) +
                          ns(time, df = 158), 
                        data = obs, family = "quasipoisson")

cp_dlconstant <- crosspred(dl_basis_5, dist_lag_mod_6, cen = 7, bylag = 1)

If we look at the same plots and sliced plots as above we can see that this is a much simple model.

plot(cp_dlconstant, ptype = "slices", lag = 0)
plot(cp_dlconstant, ptype = "slices", lag = 21)

The two plots are identical (because we assume that the exposure-response is the same at all lags) with colder temperatures having a negative effect, while the warmer temperatures appear to be protective.

plot(cp_dlconstant, ptype = "slices", var = 28)

plot(cp_dlconstant, ptype = "slices", var = -4)

Here we see the constant lag-response more clearly, but at different effects depending on the set temperature. The hot temperature appears to be mildly protective (RR < 1.0), while the cold temperature is harmful (RR > 1.0).

The reduced complexity is evident in the 3-d plot (which looks “flat” on one of the dimensions) and the contour plot (which has large rectangular areas) as well.

plot(cp_dlconstant, theta = 200, phi = 40, lphi = 30)

plot(cp_dlconstant, ptype = "contour")

One other thing that we can do with the model output is we can estimate the cumulative risk across lags, for example for a month of hot temperature (28 degrees C) compared to cooler temperature (7 degrees C). We do this by essentially accumulating the RRs from each lag-day as follows. The crosspred object has estimates of the coefficients for each lag day (in the matfit element), and you could add these up yourself, or you can extract a difference element (allfit) that has already added these up for you to get a cumulative effect estimate. If you’d like to calculate confidence intervals, you can get an estimate of the standard error for this cumulative estimate with the allse element of the crosspred object. We can get all these for a temperature of 28 degrees C; since we set the crosspred object to be centered at 7 degrees C, this will allow us to compare 28 to 7 degrees C:

### parameter for each lag-day
cp_dlconstant$matfit[cp_dlconstant$predvar==28]
##  [1] -0.004774825 -0.004774825 -0.004774825 -0.004774825 -0.004774825
##  [6] -0.004774825 -0.004774825 -0.004774825 -0.004774825 -0.004774825
## [11] -0.004774825 -0.004774825 -0.004774825 -0.004774825 -0.004774825
## [16] -0.004774825 -0.004774825 -0.004774825 -0.004774825 -0.004774825
## [21] -0.004774825 -0.004774825 -0.004774825 -0.004774825 -0.004774825
## [26] -0.004774825 -0.004774825 -0.004774825 -0.004774825 -0.004774825
## [31] -0.004774825
### combined model parameter for the cumulative RR (this is the sum for all of the above parameters)
cp_dlconstant$allfit[cp_dlconstant$predvar==28]
##         28 
## -0.1480196
### corresponding standard error for the cumulative RR parameter
cp_dlconstant$allse[cp_dlconstant$predvar==28]
##         28 
## 0.05128501

If we exponentiate the model parameter we will get the cumulative RR estimate and likewise using the model parameter and standard error we can derive 95% Confidence Intervals (CIs) for the cumulative RR. For the central estimate:

\[ RR=exp(\beta)=exp(-0.148)=0.86 \] For the confidence interval:

\[ exp(\beta \pm 1.96*se) = exp(-0.148 \pm 1.96*0.05) = (0.78, 0.95) \]

The crosspred object, it turns out, stores this information as well (in the allRRfit, allRRlow, and allRRhigh elements), so we can extract these elements directly without doing the math:

cp_dlconstant$allRRfit[cp_dlconstant$predvar==28]
##        28 
## 0.8624142
cp_dlconstant$allRRlow[cp_dlconstant$predvar==28]
##        28 
## 0.7799415
cp_dlconstant$allRRhigh[cp_dlconstant$predvar==28]
##        28 
## 0.9536078

Does this look accurate? Do we expect a 14% decrease in mortality comparing a very hot month to a cool month?

Now let’s calculate the cumulative RR for temperature at 28 degrees for the smooth lag-response function model from before (i.e., pull those same elements from the model we fit before with a smoother lag-response function):

cp_dl$allRRfit[cp_dl$predvar==28]
##       28 
## 1.078548
cp_dl$allRRlow[cp_dl$predvar==28]
##        28 
## 0.9768645
cp_dl$allRRhigh[cp_dl$predvar==28]
##       28 
## 1.190816

Based on this model, we infer there’s about a 6% cumulative increase associated with a temperature of 28 degrees C compared to 7 degrees.

Does this look more accurate? Which of the two lag-response functions is more likely to be true?

  1. Assume that we are interested in the potential effect of a 5-day heatwave occurring on lags 0 to 4? You can assess this as the effect of hot temperature (e.g., 28 degrees C) during these lags, while the temperature remains warm but more mild (e.g., 20 degrees C) the rest of the month, against constant 20 degrees C for the whole month. What is the effect under the ‘constant’ model? How about the smooth function lag model?

Using the crosspred function we can compare any exposure history over a lag-window of interest to a reference value (or values varying over time in the same window). Realistically, a whole month of extremely high exposures is unlikely. Extreme phenomena (such as 28 degree mean daily temperature in London) are more likely to last a shorter period of time. Furthermore, the effect of an event like a heat wave (likely occurring in the summer) is probably better assessed with a referent value more akin to what the heat wave is replacing, i.e., a warm but more mild summer temperature rather than temperatures more likely to occur in other seasons. We saw above that the effect of hot temperatures tends to be more short term, so let’s assess the potential effect of a 5-day heatwave assessed on the last day of the heatwave. The rest of the days in the month are at 20 degrees C and we will use the same value as the referent value for the whole month. Realistically there will be variability in the temperatures during these days as well, but for the purposes of this exercise let’s keep it constant at 20 for simplicity. Let’s assess this effect using the constant lag-response model.

hw_constant <- crosspred(dl_basis_5, dist_lag_mod_6, 
                         at = exphist(c(rep(20, 26), rep(28, 5)), times = 31, lag = 30) , 
                         cen = 20, bylag = 1)

The exphist function allows us to designate distinct exposure values for the entire lag-window. The exposure values here appear in an oldest (lag30) to most recent (lag0) order. The times argument determines were the list of exposure begins in the lag window (here time = 31 or lag30 from 0-30). The effect estimate and corresponding CI can be extracted as above.

hw_constant$allRRfit
##       31 
## 1.004005
hw_constant$allRRlow
##        31 
## 0.9891951
hw_constant$allRRhigh
##       31 
## 1.019037

Now, lets repeat this with the smooth function lag-response model:

hw_smooth<-crosspred(dl_basis_4, dist_lag_mod_5, 
                     at = exphist(c(rep(20, 26), rep(28, 5)), times = 31, lag = 30) , 
                     cen = 20, bylag = 1)

hw_smooth$allRRfit
##       31 
## 1.402728
hw_smooth$allRRlow
##       31 
## 1.356389
hw_smooth$allRRhigh
##       31 
## 1.450651

We see that the constant lag-response model underestimates the RR and does not capture any potential effect of a heatwave, while the smooth term lag-response model indicates that there is a 40% increase in mortality for the last day of the heatwave. Note that this increase in mortality is not the only effect, as based on out model the heatwave started having an effect on the same day it started, i.e. we would expect all 5 heatwave days to have a temperature related increase in mortality compared to the days before the heat-wave. We can look at the increase in mortality for each day if we look at the individual effect estimates:

### individual lag-day parameters
hw_smooth$matfit
##         lag0       lag1       lag2       lag3       lag4 lag5 lag6 lag7 lag8
## 31 0.1196662 0.09213574 0.06563143 0.04117951 0.01980617    0    0    0    0
##    lag9 lag10 lag11 lag12 lag13 lag14 lag15 lag16 lag17 lag18 lag19 lag20 lag21
## 31    0     0     0     0     0     0     0     0     0     0     0     0     0
##    lag22 lag23 lag24 lag25 lag26 lag27 lag28 lag29 lag30
## 31     0     0     0     0     0     0     0     0     0
#### individual lag-day RRs
hw_smooth$matRRfit
##        lag0     lag1     lag2     lag3     lag4 lag5 lag6 lag7 lag8 lag9 lag10
## 31 1.127121 1.096514 1.067833 1.042039 1.020004    1    1    1    1    1     1
##    lag11 lag12 lag13 lag14 lag15 lag16 lag17 lag18 lag19 lag20 lag21 lag22
## 31     1     1     1     1     1     1     1     1     1     1     1     1
##    lag23 lag24 lag25 lag26 lag27 lag28 lag29 lag30
## 31     1     1     1     1     1     1     1     1

The first day of the heatwave will have an effect corresponding to a lag0 effect:

\[ RR=exp(0.107)=1.11 \]

The second day of the heatwave will have an effect corresponding to a lag0 and lag1 effect:

\[ RR = exp(0.107 + 0.085) = 1.21 \]

(This quantity is also equal to the product of the lag-day specific RRs for lag0 and lag1:)

\[ 1.11 \times 1.09 = 1.21 \]

And so on, until we reach our last day of the heatwave with RR=1.39. There is a bidirectional relationship for this RR, as you can interpret this as the overall effect of the heatwave on mortality on the last day of the heatwave, or as the overall effect of temperature on the first day of the heatwave on mortality over the next 5 days.

References

Armstrong, Ben. 2006. “Models for the Relationship Between Ambient Temperature and Daily Mortality.” Epidemiology, 624–31.
Armstrong, Ben G, Antonio Gasparrini, and Aurelio Tobias. 2014. “Conditional Poisson Models: A Flexible Alternative to Conditional Logistic Case Cross-over Analysis.” BMC Medical Research Methodology 14 (1): 122.
Basu, Rupa, Francesca Dominici, and Jonathan M Samet. 2005. “Temperature and Mortality Among the Elderly in the United States: A Comparison of Epidemiologic Methods.” Epidemiology, 58–66.
Bhaskaran, Krishnan, Antonio Gasparrini, Shakoor Hajat, Liam Smeeth, and Ben Armstrong. 2013. “Time Series Regression Studies in Environmental Epidemiology.” International Journal of Epidemiology 42 (4): 1187–95.
Dunn, Peter K, and Gordon K Smyth. 2018. “Statistical Models.” In Generalized Linear Models with Examples in r, 1–30. Springer.
Gasparrini, Antonio. 2014. “Modeling Exposure–Lag–Response Associations with Distributed Lag Non-Linear Models.” Statistics in Medicine 33 (5): 881–99.
Gasparrini, Antonio, and Ben Armstrong. 2010. “Time Series Analysis on the Health Effects of Temperature: Advancements and Limitations.” Environmental Research 110 (6): 633–38.
Imai, Chisato, Ben Armstrong, Zaid Chalabi, Punam Mangtani, and Masahiro Hashizume. 2015. “Time Series Regression Model for Infectious Disease and Weather.” Environmental Research 142: 319–27.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. “Linear Regression.” In An Introduction to Statistical Learning: With Applications in r, 59–126. Springer.
Lu, Yun, and Scott L Zeger. 2007. “On the Equivalence of Case-Crossover and Time Series Methods in Environmental Epidemiology.” Biostatistics 8 (2): 337–44.
Vicedo-Cabrera, Ana M, Francesco Sera, and Antonio Gasparrini. 2019. “Hands-on Tutorial on a Modeling Framework for Projections of Climate Change Impacts on Health.” Epidemiology 30 (3): 321–29.