Chapter 7 Longitudinal cohort study designs
7.1 Readings
The required readings for this chapter are:
Andersson et al. (2019) Provides background on the study we’ll use in this chapter for our example data
Wong et al. (1989) An epidemiological study using the study data.
There are also some supplemental readings you may find useful. The following are a series of instructional papers on survival analysis, that are meant as general background on how to fit survival analysis models:
This article is a good summary for the limitations of Hazards Ratios and offers an alternative survival analysis approach enabling the plotting of adjusted cumulative incidence (and similarly survival rate) curves:
- Hernán (2010) (Note that there is an online erratum for this article: “On page 15, column 1, second full paragraph: ‘… and the average HR is ensured to reach the value 1.’ should instead say, ‘… and the risk ratio is ensured to reach the value 1.’”)
These articles provide more background on the Framingham Heart Study:
- Dawber, Meadors, and Moore Jr (1951) A paper from the 1950s, this presents the rationale behind the design of the Framingham Heart Study
- Dawber, Moore, and Mann (2015) A reprint of the paper describing the first results to come out of the Framingham Heart Study
These articles provide some general reviews on our current understanding of the epidemiology and etiology of heart disease and its risk factors:
- Wong (2014) Review of large epidemiological studies on coronary heart disease and key findings
- Ziaeian and Fonarow (2016) Review on the epidemiology and etiology of heart failure
- Valenzuela et al. (2021) Review of risk factors for hypertension
- Zhou et al. (2021) Review on global epidemiology of hypertension
7.2 Longitudinal cohort data
Our example data for this chapter comes from the Framingham Heart Study. This is a cohort study that began in 1948 (Andersson et al. 2019). At the time (and continuing today), heart disease was the most common cause of death in the US—a notable shift from earlier times, when infectious diseases played a larger role in mortality. While heart disease was an important cause of death, however, very little was known about risk factors for heart disease, outside of a little concerning the role of some infectious and nutritional diseases that affected the heart (Dawber, Meadors, and Moore Jr 1951; Andersson et al. 2019). This study was designed to collect a group of people without evident cardiovascular disease (although this restriction was eased in practice) and track them over many years, to try to identify risk factors for developing cardiovascular disease. The study subjects were tracked over time, with data regularly collected on both risk factors and cardiovascular outcomes. The study was revolutionary in identifying some key risk factors for coronary heart disease, including elevated blood pressure, high cholesterol levels, being overweight, and smoking (Andersson et al. 2019). It was also revolutionary in identifying high blood pressure as a risk for other cardiovascular outcomes, including stroke and congestive heart failure (Andersson et al. 2019).
The original cohort included about 5,200 people from the town of Framingham, MA. The example data is a subset of data from this original cohort. You can download the example dataset for this class by clicking here and then saving the content of the page as a csv file (we recommend using the original filename of “frmgham2.csv”). There is also a codebook file that comes with these data, which you can download for this class by clicking here. This codebook includes some explanations about the columns in the data, as well as how multiple measurements from a single study subject are included in the data.
The data are saved in a csv format, and so they can be read into R using the
read_csv
function from the readr
package (part of the tidyverse). You can use the following code to read in these data, assuming you have saved them in a “data” subdirectory of your current
working directory:
library(tidyverse) # Loads all the tidyverse packages, including readr
fhs <- read_csv("data/frmgham2.csv")
fhs
## # A tibble: 11,627 × 39
## RANDID SEX TOTCHOL AGE SYSBP DIABP CURSMOKE CIGPDAY BMI DIABETES BPMEDS
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2448 1 195 39 106 70 0 0 27.0 0 0
## 2 2448 1 209 52 121 66 0 0 NA 0 0
## 3 6238 2 250 46 121 81 0 0 28.7 0 0
## 4 6238 2 260 52 105 69.5 0 0 29.4 0 0
## 5 6238 2 237 58 108 66 0 0 28.5 0 0
## 6 9428 1 245 48 128. 80 1 20 25.3 0 0
## 7 9428 1 283 54 141 89 1 30 25.3 0 0
## 8 10552 2 225 61 150 95 1 30 28.6 0 0
## 9 10552 2 232 67 183 109 1 20 30.2 0 0
## 10 11252 2 285 46 130 84 1 23 23.1 0 0
## # ℹ 11,617 more rows
## # ℹ 28 more variables: HEARTRTE <dbl>, GLUCOSE <dbl>, educ <dbl>,
## # PREVCHD <dbl>, PREVAP <dbl>, PREVMI <dbl>, PREVSTRK <dbl>, PREVHYP <dbl>,
## # TIME <dbl>, PERIOD <dbl>, HDLC <dbl>, LDLC <dbl>, DEATH <dbl>,
## # ANGINA <dbl>, HOSPMI <dbl>, MI_FCHD <dbl>, ANYCHD <dbl>, STROKE <dbl>,
## # CVD <dbl>, HYPERTEN <dbl>, TIMEAP <dbl>, TIMEMI <dbl>, TIMEMIFC <dbl>,
## # TIMECHD <dbl>, TIMESTRK <dbl>, TIMECVD <dbl>, TIMEDTH <dbl>, …
You can find full details on the structure of this data in the codebook. At a broad scale, note that it includes several health outcomes related to heart disease, which the codebook calls “events” (DEATH
: indicator of death from any cause; ANYCHD
: indicator of one of several types of events related to coronary heart disease; HOSPMI
: Hopitalization for myocardial infarction [heart attack], etc.). The data also includes a number of risk factors that the study researchers hypothesized might be linked to cardiovascular disease (CURSMOKE
: if the study subject is currently a smoker; TOTCHOL
: serum total cholesterol; SYSBP
, DIABP
: measures of the systolic and diastolic blood pressure, respectively; BMI
: Body Mass Index; etc.). Finally, there are some characteristics of the study subject, like age (AGE
) and sex (SEX
), as well as some variables that are connected to either the time of the record or the time of certain events (or of censoring, if the event did not happen during follow-up).
As you look through the data, pay attention to some features that are more characteristic of cohort studies, compared to features of the time series data we worked with in earlier chapters:
- One important difference compared to a time-series dataset is the
RANDID
variable. This is the unique identifier for unit for which we have repeated observations for over time. In this case theRANDID
variable represents a unique identifier for each study participant, with multiple observations (rows) per participant over time. - The
TIME
variable indicates the number of days that have elapsed since beginning of follow-up of each observation.TIME
is always 0 for the first observation of each participant (whenPERIOD
equals 1, for the first examination), and then for following measurements will track the time since follow-up started for that study participant. - The number of observations varies between participants (typical of many cohort studies)
- The time spacing between observations is not constant. This is because the repeated observations in the Framingham Heart Study are the result of follow-up exams happening 3 to 5 years apart. Many longitudinal cohorts will instead have observations over a fixed time interval (monthly, annual, biannual etc), resulting in a more balanced dataset.
- Observations are given for various risk factors, covariates and cardiovascular outcomes. Some will be invariant for each participant over time (
SEX
,educ
), while others will vary with each exam.
From a data management perspective, we might want to change all the column names
to be in lowercase, rather than uppercase. This will save our pinkies some
work as we code with the data! You can make that change with the following
code, using the str_to_lower
function from the stringr
package (part of
the tidyverse
):
## # A tibble: 11,627 × 39
## randid sex totchol age sysbp diabp cursmoke cigpday bmi diabetes bpmeds
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2448 1 195 39 106 70 0 0 27.0 0 0
## 2 2448 1 209 52 121 66 0 0 NA 0 0
## 3 6238 2 250 46 121 81 0 0 28.7 0 0
## 4 6238 2 260 52 105 69.5 0 0 29.4 0 0
## 5 6238 2 237 58 108 66 0 0 28.5 0 0
## 6 9428 1 245 48 128. 80 1 20 25.3 0 0
## 7 9428 1 283 54 141 89 1 30 25.3 0 0
## 8 10552 2 225 61 150 95 1 30 28.6 0 0
## 9 10552 2 232 67 183 109 1 20 30.2 0 0
## 10 11252 2 285 46 130 84 1 23 23.1 0 0
## # ℹ 11,617 more rows
## # ℹ 28 more variables: heartrte <dbl>, glucose <dbl>, educ <dbl>,
## # prevchd <dbl>, prevap <dbl>, prevmi <dbl>, prevstrk <dbl>, prevhyp <dbl>,
## # time <dbl>, period <dbl>, hdlc <dbl>, ldlc <dbl>, death <dbl>,
## # angina <dbl>, hospmi <dbl>, mi_fchd <dbl>, anychd <dbl>, stroke <dbl>,
## # cvd <dbl>, hyperten <dbl>, timeap <dbl>, timemi <dbl>, timemifc <dbl>,
## # timechd <dbl>, timestrk <dbl>, timecvd <dbl>, timedth <dbl>, …
To look a bit more closely at how this dataset works, let’s take a look just at the observations of the oldest study subjects at the first examination, those 69 or older (age >= 69
) at their first examination (period == 1
):
## [1] 3603542 3762702 4726021 7198139 7351212 7568367 7659676 7695005 8723664
## [10] 9643995 9686180 9789948
Once we have the IDs of these study subjects (oldest_subjects
), we can pull out the study data just for them. I’m limiting to a few columns: their ID (randid
), the time of each examination (time
), the number of each examination (period
), whether they died during follow-up (died
), and the number of days between the first examination and either death (if they died during follow-up) or censoring (i.e., stopped tracking the subject or lost them to follow-up) (timedth
):
## # A tibble: 24 × 5
## randid time period death timedth
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3603542 0 1 1 1825
## 2 3762702 0 1 1 3076
## 3 3762702 2253 2 1 3076
## 4 4726021 0 1 1 4275
## 5 4726021 2134 2 1 4275
## 6 7198139 0 1 1 5384
## 7 7351212 0 1 1 4396
## 8 7351212 2146 2 1 4396
## 9 7351212 4283 3 1 4396
## 10 7568367 0 1 1 1563
## # ℹ 14 more rows
We can explore this subset of the data by plotting, for each of these study subjects, the timing of each of their examination periods, whether they died during follow-up, and the timing of their death or censoring:
fhs %>%
filter(randid %in% oldest_subjects) %>%
select(randid, time, period, death, timedth) %>%
mutate(randid = as_factor(randid),
randid = fct_reorder(randid, timedth), # Arrange by time to death
period = as_factor(period),
death = as_factor(death),
timey = time / 365.25, # Convert from days to years
timedthy = timedth / 365.25) %>%
ggplot() +
geom_segment(aes(x = 0, xend = timedthy,
y = randid, yend = randid), color = "lightgray") +
geom_point(aes(x = timedthy, y = randid, fill = death), shape = 22) +
geom_point(aes(x = timey, y = randid, color = period)) +
theme_classic() +
labs(x = "Time since first examination (years)",
y = "Patient ID",
color = "Examination\nperiod") +
scale_fill_manual(name = "", values = c("white", "black"),
labels = c("Survived\nfollow-up", "Died during\nfollow-up"))
You can see that we have at least one examination (period 1) for each of the study subjects, and for some we have as many as three. One of these study subjects was tracked for almost 25 years without a recorded death (subject ID 9789948). All the other subjects in this subset died during follow-up. Some died within a few years of the first examination, and so did not survive to a second or later examination. Others survived longer but missed some later examinations.
As you work with these data, keep in mind that they have multiple measurements (rows) for some but not all of the study subjects, and that events are recorded both in terms of whether they happened during follow-up (e.g., death
) and also how long after the first examination the event occurred or the data for the subject was censored (e.g., timedth
).
Applied exercise: Exploring longitudinal cohort data
Read the example cohort data in R and explore it to answer the following questions:
- What is the number of participants and number of observations in the
fhs
dataset? - Is there any missingness in the data?
- How many participants died during the observation period? What is the distribution of age at time of death?
- What is the distribution of BMI among MI cases and non-cases? How about between smokers and non-smokers?
Based on this exploratory exercise, talk about the potential for confounding when these data are analyzed to estimate the association between smoking and risk of incident MI.
Applied exercise: Example code
- What is the number of participants and the number of observations in the
fhs
dataset? (i.e what is the sample size and number of person-time observations)
In the fhs
dataset, the number of participants will be equal to the number of unique ID’s (The RANDID
variable which takes a unique value for each participant). We can extract this using the unique
function nested within the length
function
## [1] 4434
If you’d like to use tidyverse
tools to answer this question, you can do
that, as well. The pipe operator (%>%
) works on any type of object—it will
take your current output and include it as the first parameter value for the
function call you pipe into. If you want to perform operations on a column of
a dataframe, you can use pull
to extract it from the dataframe as a vector, and
then pipe that into vector operations:
## [1] 4434
It’s entirely a personal choice whether you use the $
operator and “nesting”
of function calls, versus pull
and piping to do a series of function calls.
You can see you get the same result, so it just comes down to the style that
you will find easiest to understand when you look at your code later.
The number of person-time observations will be equal to the length of the dataset, since there’s a row for every observation taken.
The dim
function gives us the length (number of rows) and width (number of columns) for a dataframe or any matrix like object in R.
## [1] 11627 39
We see that there are 11,626 observations, which is an average of approximately 2 to 3 observations per participant (11,626 / 4,434 = 2.6).
When you know there are repeated measurements, it can be helpful to explore how much variation there is in the number of observations per study subject. You could do that in this dataset with the following code:
fhs %>%
# Group by the study subject identifier and then count the rows for each
group_by(randid) %>%
count() %>%
# Reorder the dataset so the subjects with the most observations come first
arrange(desc(n)) %>%
head()
## # A tibble: 6 × 2
## # Groups: randid [6]
## randid n
## <dbl> <int>
## 1 6238 3
## 2 11252 3
## 3 11263 3
## 4 12806 3
## 5 14367 3
## 6 16365 3
You can visualize this, as well. A histogram is one good choice:
fhs %>%
# Group by the study subject identifier and then count the rows for each
group_by(randid) %>%
count() %>%
ggplot(aes(x = n)) +
geom_histogram()
All study subjects have between one and three measurements. Most of the study subjects (over 3,000) have three measurements recorded in the dataset.
- Is there any missingness in the data?
We can check for missingness in a number of ways. There are a couple of great
packages, visdat
and naniar
, that include functions for investigating
missingness in a dataset. If you don’t have these installed, you can install
them using install.packages("naniar")
and install.packages("visdat")
. The
naniar
package has a vignette with
examples
that is a nice starting point for working with both packages.
The vis_miss
function from the visdat
package shows missingness in a dataset in a way that lets you
get a top-level snapshot:
This shows you how much data is missing for each column in the data. For a smaller dataset, the design of the plot would also let you see how often missing data line up across several columns for the same observation (in other words, if an observation that’s missing one measurement tends to also be missing several other measurements). In this case, however, there are so many rows, it’s a bit hard to visually line up missingness by row.
Another was to visualize this is with gg_miss_var
from the naniar
package:
This output again focuses on missingness by column in the data, helping you identify columns where we might not have many non-missing observations.
In this case, many of the columns have measurements that are available for all observations, with no missingness,
including records of the subject’s ID, measures of death, stroke, CVD and other
events, age, sex, and BMI. Some of the measured values from visits are missing
occasionally, like the total cholesterol, and glucose. Other measures asked of
the participants (number of cigarettes per day, education) are occasionally
missing. Two of the variables—hdlc
and ldlc
(High Density Lipoprotein Cholesterol and Low Density Lipoprotein Cholesterol, respectively)—are missing more often than
they are available. If you read the codebook for the data, you’ll see that this is because these measurements are only available at time period 3.
You can also do faceting with the gg_miss_var
function. For
example, you could see if missingness varies by the period of the observation:
You may also want to check if missingness varies with whether an observation was associated with death of the study subject:
There are also functions in these packages that allow you to look at how
missingness is related across variables. For example, both glucose
and
totchol
are continuous variables, and both are occasionally missing. You
can use the geom function geom_miss_point
from the naniar
package
with a ggplot object to explore patterns of missingness among these two
variables:
The lower left corner shows the observations where both values are missing—it
looks like there aren’t too many. For observations with one missing but not the
other (the points in red along the x- and y-axes), it looks like the distribution
across the non-missing variable is pretty similar to that for observations
with both measurements available. In other words, totchol
has a similar
distribution among observations where glucose
is available as observations
where glucose
is missing, and the same for glucose
for observations with and without missingess for totchol
.
Since this function interfaces with ggplot
, you can use any usual tricks with ggplot in association with it. For example, you can do things like facet by sex to explore patterns of missingness and codistribution at a finer level:
- How many participants died during the observation period? What is the distribution of age at time of death?
The death
variable in the fhs
data is an indicator for mortality if a participant died at any point during follow-up. It is time-invariant: that is, it takes the value 1 if a participant died at any point during the follow-up period or 0 if they were alive at their end of follow-up, so we have to be careful on how to extract the actual number of deaths.
If you arrange by the random ID and look at period
and death
for each subject,
you can see that the death
variable is the same for all periods for each
subject (this is what we mean by it being “time-invariant” in the data):
## # A tibble: 11,627 × 3
## randid period death
## <dbl> <dbl> <dbl>
## 1 2448 1 0
## 2 2448 3 0
## 3 6238 1 0
## 4 6238 2 0
## 5 6238 3 0
## 6 9428 1 0
## 7 9428 2 0
## 8 10552 1 1
## 9 10552 2 1
## 10 11252 1 0
## # ℹ 11,617 more rows
We need to think some about this convention of recording the data when we count the deaths.
It is often useful to extract the first (and sometimes last) observation, in order to assess certain covariate statistics on the individual level. We can create a dataset including only the first (or last) observation per participant from the fhs
data using tidyverse
tools. The group_by
functions groups data by unique values of designated variables (here randid
) and the slice
function selects rows as designated. Here is an example of extracting the first row (group_by(randid) %>% slice(1L)
) for each study subject:
fhs_first <- fhs %>%
group_by(randid) %>%
slice(1L) %>% # The L after the one clarifies that this is an integer
ungroup()
fhs_first
## # A tibble: 4,434 × 39
## randid sex totchol age sysbp diabp cursmoke cigpday bmi diabetes bpmeds
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2448 1 195 39 106 70 0 0 27.0 0 0
## 2 6238 2 250 46 121 81 0 0 28.7 0 0
## 3 9428 1 245 48 128. 80 1 20 25.3 0 0
## 4 10552 2 225 61 150 95 1 30 28.6 0 0
## 5 11252 2 285 46 130 84 1 23 23.1 0 0
## 6 11263 2 228 43 180 110 0 0 30.3 0 0
## 7 12629 2 205 63 138 71 0 0 33.1 0 0
## 8 12806 2 313 45 100 71 1 20 21.7 0 0
## 9 14367 1 260 52 142. 89 0 0 26.4 0 0
## 10 16365 1 225 43 162 107 1 30 23.6 0 0
## # ℹ 4,424 more rows
## # ℹ 28 more variables: heartrte <dbl>, glucose <dbl>, educ <dbl>,
## # prevchd <dbl>, prevap <dbl>, prevmi <dbl>, prevstrk <dbl>, prevhyp <dbl>,
## # time <dbl>, period <dbl>, hdlc <dbl>, ldlc <dbl>, death <dbl>,
## # angina <dbl>, hospmi <dbl>, mi_fchd <dbl>, anychd <dbl>, stroke <dbl>,
## # cvd <dbl>, hyperten <dbl>, timeap <dbl>, timemi <dbl>, timemifc <dbl>,
## # timechd <dbl>, timestrk <dbl>, timecvd <dbl>, timedth <dbl>, …
Alternatively you can use the slice_head
function, which allows us to slice a designated number of rows beginning from the first observation. Because we are piping this in the group_by
function, we will be slicing rows beginning from the first observation for each randid
:
We can similarly select the last observation for each participant:
fhs_last <- fhs %>%
group_by(randid) %>%
slice(n()) %>% # The `n()` function gives the count of rows in a group
ungroup()
fhs_last
## # A tibble: 4,434 × 39
## randid sex totchol age sysbp diabp cursmoke cigpday bmi diabetes bpmeds
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2448 1 209 52 121 66 0 0 NA 0 0
## 2 6238 2 237 58 108 66 0 0 28.5 0 0
## 3 9428 1 283 54 141 89 1 30 25.3 0 0
## 4 10552 2 232 67 183 109 1 20 30.2 0 0
## 5 11252 2 NA 58 155 90 1 30 24.6 0 0
## 6 11263 2 220 55 180 106 0 0 31.2 1 1
## 7 12629 2 220 70 149 81 0 0 36.8 0 0
## 8 12806 2 320 57 110 46 1 30 22.0 0 0
## 9 14367 1 280 64 168 100 0 0 25.7 0 0
## 10 16365 1 211 55 173 123 0 0 29.1 0 1
## # ℹ 4,424 more rows
## # ℹ 28 more variables: heartrte <dbl>, glucose <dbl>, educ <dbl>,
## # prevchd <dbl>, prevap <dbl>, prevmi <dbl>, prevstrk <dbl>, prevhyp <dbl>,
## # time <dbl>, period <dbl>, hdlc <dbl>, ldlc <dbl>, death <dbl>,
## # angina <dbl>, hospmi <dbl>, mi_fchd <dbl>, anychd <dbl>, stroke <dbl>,
## # cvd <dbl>, hyperten <dbl>, timeap <dbl>, timemi <dbl>, timemifc <dbl>,
## # timechd <dbl>, timestrk <dbl>, timecvd <dbl>, timedth <dbl>, …
or using the slice_tail
function:
In this dataset we can extract statistics on baseline covariates (i.e., at the first examination) on the individual level, but also assess the number of participants with specific values, including death = 1
. For example, we can use the sum
function in base R, which generates the sum of all values for a given vector. In this case since each death has the value of 1, the sum
function will give us the number of deaths in the sample.
## [1] 1550
Conversely using tidyverse
tools we can extract the number of observations with death = 1
using the count
function:
## # A tibble: 2 × 2
## death n
## <dbl> <int>
## 1 0 2884
## 2 1 1550
Based on this analysis, 1,550 of the study subjects, or about 35% of them, died during follow-up for this study.
Note that, unlike in this sample data, in many datasets with longitudinal cohort data, survival or time-to-event outcomes will be recorded using time-varying conventions. For example, a variable for mortality will take the value of zero until the person-time observation that represents the time interval that the outcome actually happens in. For outcomes such as mortality this will typically be the last observation (since the subject won’t be tracked after death). In those cases, it will be important to take the last observation for each subject to count the number of deaths; for this dataset, we’ve got more flexibility since they use time-invariant recording conventions for outcomes like death.
In order to estimate the distribution of age at death among those participants who died during follow-up we need to create a new age at death variable. First, we don’t know the age at death of any study subjects who died after follow-up (which could be the end of the study or when they were lost to follow-up). Therefore, when we calculate, we should filter the dataset to only include subjects who died during follow-up (filter(death == 1)
a bit later in the code).
Next, we need to use information in the dataset to calculate the age at the time of death for the study subjects who died during follow-up.
The age
variable in fhs
represents the participant’s age at each visit. Typically a death would happen between visits so the last recorded value for age
would be less than the age at death. We will instead use the timedth
variable to help us determine the actual age at death. The value of timedth
is the number of days from beginning of follow-up until death for those with death = 1
, while it is a fixed value of timedth = 8766
(the maximum duration of follow-up) for those with death = 0
(did not die during follow-up).
We can create a new age at death variable for those with death = 1
using the age
at baseline and timedth
values:
fhs_first <- fhs_first %>%
mutate(timedthy = timedth / 365.25, # time-to-death in years
agedth = age + timedthy)
fhs_first
## # A tibble: 4,434 × 41
## randid sex totchol age sysbp diabp cursmoke cigpday bmi diabetes bpmeds
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2448 1 195 39 106 70 0 0 27.0 0 0
## 2 6238 2 250 46 121 81 0 0 28.7 0 0
## 3 9428 1 245 48 128. 80 1 20 25.3 0 0
## 4 10552 2 225 61 150 95 1 30 28.6 0 0
## 5 11252 2 285 46 130 84 1 23 23.1 0 0
## 6 11263 2 228 43 180 110 0 0 30.3 0 0
## 7 12629 2 205 63 138 71 0 0 33.1 0 0
## 8 12806 2 313 45 100 71 1 20 21.7 0 0
## 9 14367 1 260 52 142. 89 0 0 26.4 0 0
## 10 16365 1 225 43 162 107 1 30 23.6 0 0
## # ℹ 4,424 more rows
## # ℹ 30 more variables: heartrte <dbl>, glucose <dbl>, educ <dbl>,
## # prevchd <dbl>, prevap <dbl>, prevmi <dbl>, prevstrk <dbl>, prevhyp <dbl>,
## # time <dbl>, period <dbl>, hdlc <dbl>, ldlc <dbl>, death <dbl>,
## # angina <dbl>, hospmi <dbl>, mi_fchd <dbl>, anychd <dbl>, stroke <dbl>,
## # cvd <dbl>, hyperten <dbl>, timeap <dbl>, timemi <dbl>, timemifc <dbl>,
## # timechd <dbl>, timestrk <dbl>, timecvd <dbl>, timedth <dbl>, …
We can then get summary statistics on this new variable, filtering down to only the observations where death occurs during follow-up (this means that we’re only calculating average age at death among those who died during follow-up—see the note at the end of this section related to that):
fhs_first %>%
filter(death == 1) %>%
summarize(min_agedth = min(agedth),
mean_agedth = mean(agedth),
max_agedth = max(agedth),
missing_agedth = sum(is.na(agedth)))
## # A tibble: 1 × 4
## min_agedth mean_agedth max_agedth missing_agedth
## <dbl> <dbl> <dbl> <int>
## 1 38.4 69.1 91.1 0
The earliest death was at 38 years and the latest at 91, among deaths that occurred during follow-up. On average, subjects who died during follow-up for this study died when they were about 69 years old. There were no missing ages for study subjects who died during follow-up.
We can also check on these values by groups of interest such as sex:
fhs_first %>%
filter(death == 1) %>%
group_by(sex) %>%
summarize(min_agedth = min(agedth),
mean_agedth = mean(agedth),
max_agedth = max(agedth))
## # A tibble: 2 × 4
## sex min_agedth mean_agedth max_agedth
## <dbl> <dbl> <dbl> <dbl>
## 1 1 41.6 68.6 91.1
## 2 2 38.4 69.6 90.0
Of course, it’s important to remember that these are summaries of the age at death only for the study subjects who died during follow-up. The mean age at death will be different across all our study subjects, but we don’t have the information about age at death for those who died after censoring to include in calculating our summary statistics here. It is likely that they lived to older ages, if the most common reason for censoring is outliving follow-up. However, if they were censored because they dropped out of the cohort, then that might instead be associated with either longer or shorter average lifespans, in which case we don’t have a great idea of which direction the average age of death would move if they were included. One thing that you could do is to average the age at either death or loss to follow-up—this would give you a lower bound on the average across the whole population, since you know that those who were censored survived to at least the age they were when they were censored.
- What is the distribution of BMI among MI cases and non-cases? How about between smokers and non-smokers
Both BMI and smoking are recorded in a time-variant way; that is, their value can differ between different examinations for the same study subject. For example, you can look at a couple of study subjects:
## # A tibble: 6 × 4
## randid period bmi cursmoke
## <dbl> <dbl> <dbl> <dbl>
## 1 6238 1 28.7 0
## 2 6238 2 29.4 0
## 3 6238 3 28.5 0
## 4 16365 1 23.6 1
## 5 16365 2 27.5 0
## 6 16365 3 29.1 0
For the study subject with ID 6238, BMI changed a little bit across the three examinations, but not much, and the person remained a non-smoker. For the study subject with ID 16365, BMI increased steady across the examinations, and the person stopped smoking sometime after the first examination.
We will need to think about how to handle these variant values while we compare them to the invariant outcome (whether the subject was hospitalized for MI during follow-up). One thing that we could do is to see the association between average BMI for each study subject and whether they had a hospitalized MI event during follow-up:
bmi_vs_mi <- fhs %>%
group_by(randid) %>%
summarize(bmi = mean(bmi, rm.na = TRUE),
hospmi = first(hospmi))
bmi_vs_mi
## # A tibble: 4,434 × 3
## randid bmi hospmi
## <dbl> <dbl> <dbl>
## 1 2448 NA 1
## 2 6238 28.9 0
## 3 9428 25.3 0
## 4 10552 29.4 0
## 5 11252 23.7 0
## 6 11263 30.9 0
## 7 12629 34.9 0
## 8 12806 22.0 0
## 9 14367 25.8 0
## 10 16365 26.7 0
## # ℹ 4,424 more rows
Now we can compare the distribution of these average BMIs among study subjects who did and did not have a hospitalization for MI during follow-up. A simple was is by creating some summaries:
bmi_vs_mi %>%
group_by(hospmi) %>%
summarize(perc_25 = quantile(bmi, 0.25, na.rm = TRUE),
mean = mean(bmi, na.rm = TRUE),
median = median(bmi, na.rm = TRUE),
perc_75 = quantile(bmi, 0.75, na.rm = TRUE))
## # A tibble: 2 × 5
## hospmi perc_25 mean median perc_75
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 23.1 25.8 25.4 28.0
## 2 1 24.0 26.5 26.2 28.5
We can also create a plot to help explore this question:
bmi_vs_mi %>%
ggplot(aes(x = bmi)) +
geom_histogram() +
facet_wrap(~ hospmi, ncol = 1, scale = "free_y")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 44 rows containing non-finite outside the scale range
## (`stat_bin()`).
There is not a dramatic difference between the two groups in terms of the distribution of BMI.
We might want to check how stable BMI tends to be within each study subject, and check variation in BMI within subjects (at different timepoints) compared to between subjects, to see if the average might be a reasonable summary of BMI for a study subject. We can check this within just the study subjects with three examinations and without any missing measures of BMI at those examinations. We can check it with a random sample of 20 subjects (since this uses sample
, the sample you get, and so the plot, will likely be different):
sample_check <- fhs %>%
group_by(randid) %>%
filter(max(period == 3) & !anyNA(bmi)) %>%
pull(randid) %>%
sample(size = 20)
fhs %>%
filter(randid %in% sample_check) %>%
mutate(sex = factor(sex, levels = c(1, 2), labels = c("Male", "Female"))) %>%
ggplot(aes(x = age, y = bmi, color = sex, group = randid)) +
geom_line(alpha = 0.4)
While there is some variation within study subjects in BMI, there tends to be more variation when comparing one study subject to another. Average BMI across the three examinations is therefore likely a reasonable measurement to use in exploratory analysis as we did in the previous plot.
Another alternative we could have considered is to use the BMI at the first examination as the measure of the BMI risk factor for each study subject; if you’d like, try that out to see if it changes our main conclusions.
We can use a similar approach to compare BMI and smoking. In this case, one way we could summarize smoking for each study subject is to determine if they were a current smoker at any of their examinations. If we do this, we see that there may be a small, but not dramatic, difference in BMI between smokers and non-smokers:
bmi_vs_smoke <- fhs %>%
group_by(randid) %>%
summarize(bmi = mean(bmi, rm.na = TRUE),
smoke = max(cursmoke))
bmi_vs_smoke %>%
group_by(smoke) %>%
summarize(perc_25 = quantile(bmi, 0.25, na.rm = TRUE),
mean = mean(bmi, na.rm = TRUE),
median = median(bmi, na.rm = TRUE),
perc_75 = quantile(bmi, 0.75, na.rm = TRUE))
## # A tibble: 2 × 5
## smoke perc_25 mean median perc_75
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 23.8 26.6 26.1 28.8
## 2 1 22.7 25.3 24.9 27.4
Again, we might want to check how “stable” smoking status is among study subjects, to get a better idea of how reasonable it is to use a single summary of smoking for each subject in this exploratory analysis.
fhs %>%
group_by(randid) %>%
summarize(always_no = max(cursmoke) == 0,
always_yes = min(cursmoke) == 1,
changes = !((sum(cursmoke) / n()) %in% c(0, 1))) %>%
ungroup() %>%
summarize(always_no = sum(always_no),
always_yes = sum(always_yes),
changes = sum(changes))
## # A tibble: 1 × 3
## always_no always_yes changes
## <int> <int> <int>
## 1 2100 1585 749
Less than 20% of the study subjects changed their smoking status over the course of the examinations. This isn’t negligible, but it also means that for the majority of the study subjects, their smoking status was stable across all study examinations.
7.3 Coding a survival analysis
In the context of survival analysis, what is modeled is time to an event (also referred to as survival time or failure time). This is a bit different than the models in the linear or glm
family that model an outcome that may follow a gaussian (linear regression), binomial (logistic model) or Poisson distribution. Another difference is that the outcome (time to event) will not be determined in some participants, as they will not have experienced the event of interest during their follow-up. These participants are considered ‘censored’. Censoring can occur in three ways:
- the participant does not experience the event of interest before the study end
- the participant is lost to follow-up before experiencing the event of interest
- the participant experiences a difference event that makes the event of interest impossible (for example if the event of interest is acute MI a participant that dies from a different cause is considered censored)
These are all types of right censoring and in simple survival analysis they are considered to be uninformative (typically not related to exposure). If the censoring is related to the exposure and the outcome, then you must adjust for censoring or it could confound the estimates from your model.
Let’s assume that we are interested in all-cause mortality as the event of interest, and let’s denote \(T\) as time to death, where we can assume that \(T\geq 0\). We define the survival function as \(S(t)=Pr[T>t]=1-F(t)\), where the survival function \(S(t)\) is the probability that a participant survives past time \(t\) (\(Pr[T>t]=1\)). \(F(t)\) is the Probability Density Function, (sometimes also denoted as the the Cumulative Incidence Function, \(R(t)\)) or the probability that that an individual will have a survival time less than or equal to t (\([Pr(T≤t)]\)).
Time to event \(t\) is bounded by \([0,\infty)\) (i.e., the time could be as low as 0, but no lower, and has no upper bound) and \(S(t)\) is non-increasing as \(t\) becomes greater. At \(t=0\), \(S(t)=1\) and conversely as \(t\) approaches \(\infty\), \(S(t)=0\). A property of the survival and probabilty density function is \(S(t) = 1 – F(t)\): the survival function and the probability density function (or cumulative incidence function (\(R(t)\)) sum to 1.
Another useful function is the hazard Function, \(h(t)\), which is the instantaneous potential of experiencing an event at time \(t\), conditional on having survived to that time (\(h(t)=\frac{Pr[t<T\leq t+\Delta t|T>t]}{\Delta t}=\frac{f(t)}{S(t)}\)). The cumulative Hazard Function, \(H(t)\) is defined as the integral of the hazard function from time \(0\) to time \(t\), which equals the area under the curve \(h(t)\) between time \(0\) and time \(t\) (\(H(t)=\int_{0}^{t}h(u)du\)). If we know any of \(S(t)\), \(H(t)\) or \(h(t)\), we can derive the rest based on the following relationships:
\(h(t)=\frac{\partial log(S(t))}{\partial t}\)
\(H(t)=-log(S(t))\) and conversely \(S(t)=exp(-H(t))\)
The survival
package in R allows us to fit these types of models, including a very popular model in survival analysis, the Cox proportional hazards model. This is the model that was also applied in one of this chapter’s required readings, Wong et al. (1989). There are also some simple non-parametric ways to explore survival times, which we’ll also explore in the exercise.
Applied exercise: Survival curves and simple survival analysis
- What does the survival curve for mortality look like with follow-up time as the time scale of interest? How about with age?
- How do (survival) curves for mortality compare between males and females? How about for MI?
- What is the Hazard Ratio for smoking and the risk of MI, from a Cox Proportional Hazards model?
1. What does the survival curve for mortality look like with follow-up time as the time scale of interest? How about with age?
A very simple way to estimate survival is the non-parametric Kaplan-Meier estimator.
In R we would estimate Survival \(S(t)\) with all-cause mortality representing failure as follows:
The Surv
function will be key as we look at time-to-event data. This function inputs two vectors: one for the time
parameter that gives the follow-up time (either the time to the event, if the event happens during follow-up, or the time to censoring) and one for the event
parameter, which gives a status indicator of whether the event happened during follow-up or whether then person was censored before the event happened. For our example data, we can use the column timedth
to give the time until either death or censoring, and then the death
column as an indicator of whether death happened before censoring.
The output of the Surv
function is a special type of object in R with the class “Surv”. If you look at the first few values, you can see that this object records both the follow-up time and the status at that follow-up time:
## [1] "Surv"
## [1] 8766+ 8766+ 8766+ 2956 8766+ 8766+
The numbers assigned to each individual represent their final measured time, with each number with a plus sign indicating that the participant was censored at that time without developing the outcome (haven’t failed/died), while those without the plus sign are the times at which participants developed the outcome (failure/death).
We can use the Surv
function inside the survfit
function to estimate the Kaplan-Meier curve for a set of data. If we aren’t considering any covariates, we can include ~ 1
in the model equation, to estimate with only an intercept, rather than to explore differences in the curve based on covariates (we’ll get to that idea later):
This creates an object of the survfit
class, that we can then use in some special plotting functions to look at the curve its estimated. If you print this object, you can see that it gives us some information about the total number of study subjects as well as the total number of events during follow-up (in this case, deaths):
## [1] "survfit"
## Call: survfit(formula = Surv(timedthy, death) ~ 1, data = fhs_first)
##
## n events median 0.95LCL 0.95UCL
## [1,] 4434 1550 NA NA NA
You can also look at the structure of this object with str
:
## List of 16
## $ n : int 4434
## $ time : num [1:1419] 0.0712 0.0931 0.1095 0.1232 0.1259 ...
## $ n.risk : num [1:1419] 4434 4433 4432 4431 4430 ...
## $ n.event : num [1:1419] 1 1 1 1 1 1 1 1 1 1 ...
## $ n.censor : num [1:1419] 0 0 0 0 0 0 0 0 0 0 ...
## $ surv : num [1:1419] 1 1 0.999 0.999 0.999 ...
## $ std.err : num [1:1419] 0.000226 0.000319 0.000391 0.000451 0.000505 ...
## $ cumhaz : num [1:1419] 0.000226 0.000451 0.000677 0.000902 0.001128 ...
## $ std.chaz : num [1:1419] 0.000226 0.000319 0.000391 0.000451 0.000505 ...
## $ type : chr "right"
## $ logse : logi TRUE
## $ conf.int : num 0.95
## $ conf.type: chr "log"
## $ lower : num [1:1419] 0.999 0.999 0.999 0.998 0.998 ...
## $ upper : num [1:1419] 1 1 1 1 1 ...
## $ call : language survfit(formula = Surv(timedthy, death) ~ 1, data = fhs_first)
## - attr(*, "class")= chr "survfit"
It’s listing many different times during the follow-up. For each, it’s determined the number of people in the study at risk at that time (not dead or censored) and the number of events at that time point, as well as some other values. These calculations allow it to estimate the cumulative hazard at each time point during follow-up.
There are a number of ways you can plot this output to get a better idea of patterns in the survival curve estimated from the data. The survminer
package allows you to do this in a way that interfaces well with ggplot
(note also how you can use expression
to include mathematical notation in an axis label):
library(survminer) # for plotting survival plots with ggplot2
fit_time %>%
ggsurvplot(xlab = "Time to death (years)",
ylab = expression(paste('Overall Survival Probablity ',
hat(S)*"(t)")))
We can see that as follow-up time increases, survival decreases rather monotonically, steadily, and slowly over time. In other words, the number of people who have died increases as the length of follow-up increases, but not very quickly (which makes sense since the study population was largely healthy at the start of the study). Survival \(\hat{S}(t)\) drops to about 0.65 at the end of follow-up, or in other words about 35% of participants have died, which is what is expected as we already know that 1,550 of 4,434 participants died during follow-up.
We can repeat this estimation with a different time-scale of interest. Other that follow-up times we may also be interested in survival and failure (mortality) with respect to age. We repeat the same code only changing the first argument in the Surv
function, substituting time of death with respect to follow-up time with age at death.
fit_age <- survfit(Surv(agedth, death) ~ 1, data = fhs_first)
fit_age %>%
ggsurvplot(xlab = "Age (years)",
ylab = expression(paste('Overall Survival Probablity ',
hat(S)*"(t)")))
We see that the shape of this survival curve is different, with virtually no one dying until they reach their 40s (part of this is likely because this study focused on subjects in their 30s and older at the baseline examination period), and then a sharper drop in survival as age increases.
2. How do (survival) curves for mortality compare between males and females? How about for MI?
Kaplan-Meir curves like the above are useful in comparing the survival rate between two groups. For example if we wanted to compare the survival rates between males and females we would fit the same model as above with sex as an independent variable. For all-cause mortality, we can run these models both based on follow-up time and on age (in separate models):
fit_bysex <- survfit(Surv(timedthy, death) ~ sex, data = fhs_first)
fit_age_bysex <- survfit(Surv(agedth, death) ~ sex, data = fhs_first)
fit_bysex %>%
ggsurvplot(xlab = "Time to death (years)",
ylab = expression(paste('Overall Survival Probablity ',
hat(S) * "(t)")),
legend.labs = c("Male", "Female"),
legend.title="Sex")
fit_age_bysex %>%
ggsurvplot(xlab = "Age (years)",
ylab = expression(paste('Overall Survival Probablity ',
hat(S) * "(t)")),
legend.labs = c("Male", "Female"),
legend.title = "Sex")
You can now see that the survival rate for males drops quicker (at a younger age) than for females, and that it also drops more quickly for males if we’re looking across follow-up time.
Similarly we can look at MI as the outcome. If we want to compare across age for the x-axis, then we need to calculate the age at the time of the first hospitalization for MI during follow-up. We can then put that variable in as the time
parameter in Surv
and use the status of whether the subject had an MI hospitalization by the end of follow-up for the event
parameter, then plot as before:
fhs_first <- fhs_first %>%
mutate(timemiy = timemi / 365.25,
agemi = age + timemiy)
fit_age_MIbysex <- survfit(Surv(agemi, hospmi) ~ sex, data = fhs_first )
fit_age_MIbysex %>%
ggsurvplot(xlab = "Age (years)",
ylab = expression(paste('MI Survival Probablity ',
hat(S) * "(t)")),
legend.labs=c("Male","Female"),
legend.title="Sex")
Once again we see a difference in the survival rates (in this case, “survival” until first hospitalized myocardial infarction, even though the subject might survive the event) with age by sex, which is in line with what we already know from the literature.
We can actually approach survival rates by smoking status in the same manner (in this case, we’ll use the subject’s smoking status at the baseline examination):
fit_age_MIsmoking <- survfit(Surv(agemi, hospmi) ~ cursmoke, data = fhs_first )
fit_age_MIsmoking %>%
ggsurvplot(xlab = "Age (years)",
ylab=expression(paste('MI Survival Probablity ',
hat(S) * "(t)")),
legend.labs=c("Non-smokers", "Smokers"),
legend.title="Smoking status at baseline")
Once again we can observe that there is a difference in survival rates for MI, by smoking status at baseline.
The advantages of the Kaplan-Meier estimator for the survival function are its simplicity and the fact that it is a non-paramteric estimator. One limitation of Kaplan-Meier curves is that in this simple form of visualizing a survival rate, we cannot adjust for confounding by other variables, as the survival rates we are plotting here are marginal with respect to everything else. For example, we can compare survival rates among smokers and non-smokers, but we can’t really simply plot a sex adjusted survival rate for each, as the baseline rate for males and females will differ. What we can estimate while adjusting for other covariates is a survival time ratio, which is actually estimated using the same model we’ve been fitting. The survreg
function in the survival
package will fit a failure time model, with time to event as the outcome of interest. Unlike the Kaplan-Meier estimator this will require us to make an assumption about the distribution of time-to-event. Usually time-to-event outcomes are assumed to belong to the exponential, Weibull, log-normal (log(T) is normally distributed) or log-logistic distributions.
The majority of survival analyses for longitudinal cohort data, however, has been dominated by the Cox proportional hazards model over the past few decades, and this is the type of model we will focus on for the rest of the chapter. The main advantage of the Cox proportional hazards model is that we don’t have to make any distributional assumptions about the outcome or residuals. We simply model the instantaneous hazard of the outcome at specific time intervals as a function of covariates of interest, and the assumptions we have to make is that of ‘proportional hazards’. This assumptions stipulates that the hazards across levels of covariates of interest are proportional over time. In other words the ratio of the hazards across levels of covariates should be constant over time.
The Cox proportional hazards model in a simple form has this form:
\(log(\lambda(t|X))=log(\lambda_{0}(t))+\beta_{1}\times X\)
where \(\lambda(t)\) represents the hazard at time \(t\), \(\lambda_{0}(t)\) is the baseline hazard at time \(t\), and \(\beta_{1}\) is the log hazard for those with \(X=1\) compared to \(X=0\). The baseline hazard \(\lambda_{0}(t)\) is similar to the intercept term in a linear model or glm and is the value of the hazard when all covariates equal 0. However, unlike the intercept term in a linear model or glm, \(\lambda_{0}(t)\) is not estimated by the model. The above model can also be writen as
\(\lambda(t|X)=\lambda_{0}(t)\times e^{\beta_{1}\times X}\)
\(e^{\beta_{1}}\) is the hazard ratio comparing those hose with \(X=1\) and \(X=0\)
Using the fhs
data we will fit a simple Cox proportional hazard for the effect of smoking on the hazard for MI.
If we look the estimated parameters of the model (we can use broom
to pull out a tidy version of these summaries, just as we did with GLM models), there isn’t an intercept term, as noted above, just an estimate for the covariate we included (cursmoke
for smoking status at the baseline examination):
## # A tibble: 1 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 cursmoke 0.0867 0.0508 1.71 0.0880
The parameter for the covariate of interest is equivalent to the log of the hazard ratio comparing current smokers at baseline to non-smokers. We can extract the hazard ratio by exponentiating that parameter.
coxph_mod1 %>%
tidy() %>%
filter(term == "cursmoke") %>%
mutate(hr = exp(estimate),
low_ci = estimate - 1.96 * std.error,
high_ci = estimate + 1.96 * std.error,
low_hr = exp(low_ci),
high_hr = exp(high_ci)) %>%
select(term, hr, low_hr, high_hr)
## # A tibble: 1 × 4
## term hr low_hr high_hr
## <chr> <dbl> <dbl> <dbl>
## 1 cursmoke 1.09 0.987 1.20
We see that there is modest suggestive HR elevating the hazard for mortality, but the confidence interval includes the null (hazard ratio of 1).
We have said that the main assumption we need to make here is that of proportional hazards. The survival
package actually allows us to check this with the cox.zph
function
## chisq df p
## cursmoke 0.511 1 0.47
## GLOBAL 0.511 1 0.47
The output of this function is the result of a \(\chi^2\) test for the proportional hazards assumption. If the p-value here was below 0.05 we would have to reject the null hypothesis that the proportional hazards assumption holds. We can also plot the parameter(s) of interest across time from this output. If the proportional hazards assumption holds (constant HR) then the parameter should resemble a horizontal line with respect to time.
Here we see that the line is basically horizontal. Now let’s repeat the model adjusting for some covariates, specifically sex and age. We can include these in a model equation in coxph
in a very similar way to how we built model equations for GLMs. The only difference is that the “outcome” part of the formula should be a Surv
object, rather than a direct (untransformed) measure from the original data:
coxph_mod2 <- coxph(Surv(timedth, death) ~ cursmoke + sex + age,
data = fhs_first)
coxph_mod2 %>%
tidy()
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 cursmoke 0.338 0.0535 6.32 2.64e- 10
## 2 sex -0.540 0.0525 -10.3 9.35e- 25
## 3 age 0.0967 0.00327 29.6 3.79e-192
We can already see that the parameter for smoking is now quite a bit higher, but let’s estimate the HR and 95% CI:
coxph_mod2 %>%
tidy() %>%
filter(term == "cursmoke") %>%
mutate(hr = exp(estimate),
low_ci = estimate - 1.96 * std.error,
high_ci = estimate + 1.96 * std.error,
low_hr = exp(low_ci),
high_hr = exp(high_ci)) %>%
select(term, hr, low_hr, high_hr)
## # A tibble: 1 × 4
## term hr low_hr high_hr
## <chr> <dbl> <dbl> <dbl>
## 1 cursmoke 1.40 1.26 1.56
The estimated Hazard Ratio is now 1.40 and the 95% CI does not include the null (in fact, the lower bound of the 95% CI is 1.26). We can determine that there was some confounding by these variables (sex and age at the baseline examination) leading the estimate from the previous model of the association between smoking and time to death to be biased towards the null. Let’s test the proportional hazard assumption for this model:
## chisq df p
## cursmoke 0.373 1 0.541
## sex 3.776 1 0.052
## age 3.046 1 0.081
## GLOBAL 7.747 3 0.052
We see that the assumption holds, though the results for the test for both sex and age is close to rejecting the assumption (p-values close to 0.05—if they were below 0.05, we’d reject the null hypothesis that the assumption holds).
Now let’s see what happens if repeat the above model using age, rather than follow-up time, as the time-scale of interest in the survival function of the model:
coxph_modage1 <- coxph(Surv(agedth, death) ~ cursmoke + sex, data = fhs_first)
coxph_modage1 %>%
tidy()
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 cursmoke 0.373 0.0529 7.05 1.73e-12
## 2 sex -0.539 0.0526 -10.2 1.46e-24
Notice that we did not also include age as an additional parameter in the model, since using it as the time-scale inherently adjusts for it. We can again convert the output to a hazard ratio and 95% CIs:
coxph_modage1 %>%
tidy() %>%
filter(term == "cursmoke") %>%
mutate(hr = exp(estimate),
low_ci = estimate - 1.96 * std.error,
high_ci = estimate + 1.96 * std.error,
low_hr = exp(low_ci),
high_hr = exp(high_ci)) %>%
select(term, hr, low_hr, high_hr)
## # A tibble: 1 × 4
## term hr low_hr high_hr
## <chr> <dbl> <dbl> <dbl>
## 1 cursmoke 1.45 1.31 1.61
We also see that the HR is similar as in the previous model (although slightly higher in this case).
Testing for the proportional hazards assumption in this model:
## chisq df p
## cursmoke 11.637 1 0.00065
## sex 0.657 1 0.41770
## GLOBAL 15.358 2 0.00046
Here we see that the assumption fails. If we plot the results of the models we can also see that we no longer have that horizontal line and in the case of smoking, it deviates from the that line significantly.
7.4 Handling complexity
You may have noticed that the models we’ve been fitting above only use the first observation from the dataset (including death and time of death variables). This would be fine if all covariates of interest were time-invariant (e.g. sex: its value at baseline would be representative for the duration of follow-up). However, this is a longitudinal dataset with multiple observations per participant, and as we have seen with BMI and smoking, some measurement values change over time. The time-varying nature of these values can be informative whether it has to do with an exposure, covariate, or outcome of interest, and we will now see how to leverage this extra bit of information in the survival analysis context. We will create a longitudinal dataset appropriate for analysis, and go through steps of dealing with repeated outcome measures, time-varying exposures and covariates, and handling multi-level exposures in this setting.
7.4.1 Recurrent outcome and time varying-exposures
We mentioned earlier how typically in longitudinal studies survival outcomes may be represented in a time-varying fashion. Typically these variables will take the value of 0 until someone becomes a case (i.e., an event like death happens to the person) in which case the value is 1 (and usually this will be the last observation for a given participant). We will go ahead and create these types of variables for death and incident MI in our longitudinal dataset.
Applied exercise: Survival analysis with time-varying values
- Construct a longitudinal dataset with repeated (time-varying) values for death and MI hospitalization.
- Fit a Cox model for the effect of smoking on mortality and MI using this dataset. How do the results compare to the results from the models in 7.3?
- Construct an exposure variable representing the history of smoking (identify if someone is a current, former, or never smoker at each period). Repeat the model from 2, using this exposure instead. What added information does this model give us?
1. Construct a longitudinal dataset with repeated (time-varying) values for death and cardiovascular outcomes.
The original fhs
dataset already has multiple observations per participant denoted by the period
variable, with a range of 1–3. Covariate values can change from period to period as we’ve already seen, but the variables for death and cardiovascular outcomes were not time-varying, simply an indicator for the event occurring at any point during follow-up. We will create time-varying variables for these events, as well as time variables indicating the beginning and end of follow-up time in each period.
fhstv <- fhs %>%
group_by(randid) %>%
mutate(time_next = lead(time), ### bring forward the time from the next period
time2 = ifelse(is.na(time_next) == FALSE,
time_next - 1,
timedth), ### create a variable indicating the last day of follow-up in a given period
deathtv = ifelse(death == 1 & timedth <= time2, 1, 0),
timemi2 = ifelse(time2 <= timemi, time2, timemi),
hospmitv = ifelse(hospmi == 1 & timemi <= timemi2, 1, 0)) %>%
ungroup()
We can look at each piece of this. First, we can check the time-related variables we’ve added:
## # A tibble: 11,627 × 5
## randid period time time_next time2
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2448 1 0 4628 4627
## 2 2448 3 4628 NA 8766
## 3 6238 1 0 2156 2155
## 4 6238 2 2156 4344 4343
## 5 6238 3 4344 NA 8766
## 6 9428 1 0 2199 2198
## 7 9428 2 2199 NA 8766
## 8 10552 1 0 1977 1976
## 9 10552 2 1977 NA 2956
## 10 11252 1 0 2072 2071
## # ℹ 11,617 more rows
You can see that we’ve grabbed the time of the start of the next period, then reduced this by one to get the last time in the current period. For example, study subject #2448 has examinations for periods 1 and 3. Since period 3 starts at 4,628 days, we can figure out that the first period’s data represents all days up to day 4.627. For the last period, it represents measures from the time of the last examination to the time of death or loss to follow-up, so we can use the column with the time of death / follow-up in the original data to figure out this time.
Next, we can look at the variable we’ve added related to death. Let’s filter to two study subjects who died during follow-up, to see how this element works:
fhstv %>%
filter(randid %in% c(10552, 24721)) %>%
select(randid, period, time, time2, death, timedth, deathtv)
## # A tibble: 5 × 7
## randid period time time2 death timedth deathtv
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10552 1 0 1976 1 2956 0
## 2 10552 2 1977 2956 1 2956 1
## 3 24721 1 0 2267 1 6411 0
## 4 24721 2 2268 4407 1 6411 0
## 5 24721 3 4408 6411 1 6411 1
The original death
column is time-invariant—if the study subject died during follow-up, it will be “1” for all the subject’s rows of data. With the new deathtv
variable, it lines up the time of death with the time of each examination (period
). For subject #10552, death occurred on day 2,956. This was after the study subject’s first period of observation, which went from day 0 to day 1,976, but during the second period, which went from day 1,977 to day 2,956 (in other words, the second period ended with the subject’s death). Therefore, deathtv
is 0 in the first period (a death did not occur during this follow-up period for this subject) but 1 during the second period.
You can check the MI variables to see that they work in the same way:
fhstv %>%
filter(randid %in% c(10552, 24721)) %>%
select(randid, period, time, time2, hospmi, timemi, timemi2, hospmitv)
## # A tibble: 5 × 8
## randid period time time2 hospmi timemi timemi2 hospmitv
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10552 1 0 1976 0 2956 1976 0
## 2 10552 2 1977 2956 0 2956 2956 0
## 3 24721 1 0 2267 0 6411 2267 0
## 4 24721 2 2268 4407 0 6411 4407 0
## 5 24721 3 4408 6411 0 6411 6411 0
We can check to see if our new variables are in agreement with the totals from above. Since these variables are created to only take the value of one once for each case, then their cumulative sum in the entire longitudinal dataset should equal our totals from above, when we calculated based on the first row of data for each study subject:
# Summarize by counting the number of "1"s for the time-variant death variable
fhstv %>%
summarize(tot_deaths = sum(deathtv))
## # A tibble: 1 × 1
## tot_deaths
## <dbl>
## 1 1550
# Summarize similarly to early in the chapter, by taking the values from
# the first measurement for each study subject
fhstv %>%
group_by(randid) %>%
slice(1) %>%
ungroup() %>%
summarize(tot_deaths = sum(death))
## # A tibble: 1 × 1
## tot_deaths
## <dbl>
## 1 1550
2. Fit a Cox model for the effect of smoking on mortality and MI hospitalizations using this dataset. How do the results compare to the results from the models in 7.3
Using the dataset with the time-varying outcomes we created above, we will repeat the same model for death (and MI hospitalizations) as a function of smoking and age and sex. However, unlike those models in section 7.3 we are not interested in the hazard occurring between beginning and end of follow-up for each participant, but rather the hazard occurring during each period. In order to address this we will slightly modify the Surv
function in our model to accommodate the specific interval for each period, by designating two time variables (time
and time2
) instead of one. If you look at the results of using Surv
with two times, you can see that it now creates a range for each observation, including an indicator of whether the person was still alive at the end of that range (“+” after the second number in the interval) or whether they died at the end time of the range:
## [1] ( 0,4627+] (4628,8766+] ( 0,2155+] (2156,4343+] (4344,8766+]
## [6] ( 0,2198+] (2199,8766+] ( 0,1976+] (1977,2956] ( 0,2071+]
## [11] (2072,4284+] (4285,8766+] ( 0,2177+] (2178,4350+] (4351,8766+]
## [16] ( 0,2211+] (2212,8766+] ( 0,2169+] (2170,4288+] (4289,8766+]
You can use this in a Cox regression model as before and extract estimates (log hazard ratio) for each coefficient:
coxph_modtv1 <- coxph(Surv(time, time2, deathtv) ~ cursmoke + age + sex,
data = fhstv)
coxph_modtv1 %>%
tidy()
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 cursmoke 0.378 0.0545 6.94 3.80e- 12
## 2 age 0.0871 0.00319 27.3 8.13e-164
## 3 sex -0.546 0.0518 -10.5 5.42e- 26
And if we look specifically at the HR for smoking:
coxph_modtv1 %>%
tidy() %>%
filter(term == "cursmoke") %>%
mutate(hr = exp(estimate),
low_ci = estimate - 1.96 * std.error,
high_ci = estimate + 1.96 * std.error,
low_hr = exp(low_ci),
high_hr = exp(high_ci)) %>%
select(term, hr, low_hr, high_hr)
## # A tibble: 1 × 4
## term hr low_hr high_hr
## <chr> <dbl> <dbl> <dbl>
## 1 cursmoke 1.46 1.31 1.62
We see that our estimates are similar to the ones from the model looking only at baseline values and death, so it doesn’t look like we’ve gained much in terms of information.
Now let’s repeat this process with first MI hospitalization as the outcome:
## Warning in Surv(time, timemi2, hospmitv): Stop time must be > start time, NA
## created
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 cursmoke 0.482 0.107 4.52 6.08e- 6
## 2 age 0.0455 0.00610 7.45 9.50e-14
## 3 sex -1.11 0.110 -10.1 4.77e-24
There is a warning here indicating that not all out time variables in each observation are in agreement with the rule time2 > time1
. If that is the case, the Cox model cannot assess a hazard for a time interval with length <= 0. In our dataset we are including periods after a participant has experienced an MI hospitalization. The Surv
function will create missing values for the hazard for these observations, so the estimation of model parameters is not affected, however, we should limit the dataset to the person-time actually at risk for incident MI hospitalization. Luckily the dataset includes a time-varying prevmi
variable, indicating a prevalent MI for each period, which we can use to subset to the person-time that is still at risk for an incident MI hospitalization.
fhstv_incMI <- fhstv %>%
filter(prevmi == 0)
coxph_modhospmitv2 <- coxph(Surv(time, timemi2, hospmitv) ~
cursmoke + age + sex, data = fhstv_incMI)
## Warning in Surv(time, timemi2, hospmitv): Stop time must be > start time, NA
## created
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 cursmoke 0.487 0.107 4.57 4.88e- 6
## 2 age 0.0465 0.00611 7.61 2.69e-14
## 3 sex -1.12 0.110 -10.2 3.00e-24
We see that the warning is still there. If we look into our data, we can check why this is:
fhstv_incMI %>%
filter(timemi2 <= time) %>%
pull(randid) # Identify study subjects where this is an issue
## [1] 1338446
# Check out the data for this study subject
fhstv_incMI %>%
filter(randid == '1338446') %>%
select (randid, period, time, timemi2, timemi, hospmi, hospmitv)
## # A tibble: 2 × 7
## randid period time timemi2 timemi hospmi hospmitv
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1338446 1 0 1901 1902 0 0
## 2 1338446 2 1902 1902 1902 0 0
This is probably a single participant that was considered lost to follow-up, right after their second period visit, and was assigned the same time as the beginning of that period for all time-to-event values (without actually experiencing the outcomes). In other words they were censored at this exact time. Again, the model doesn’t include this observation in the estimation as it creates a missing value for the hazard, so we don’t have to worry about our results, but it’s good to investigate issues like this. We can visualize the model results with a forest plot, which we can generate using ggforest
from the survminer
package we installed above. We also extract the HR and 95% CI for the variable of interest.
## Warning in .get_data(model, data = data): The `data` argument is not provided.
## Data will be extracted from model fit.
coxph_modhospmitv2 %>%
tidy() %>%
filter(term == "cursmoke") %>%
mutate(hr = exp(estimate),
low_ci = estimate - 1.96 * std.error,
high_ci = estimate + 1.96 * std.error,
low_hr = exp(low_ci),
high_hr = exp(high_ci)) %>%
select(term, hr, low_hr, high_hr)
## # A tibble: 1 × 4
## term hr low_hr high_hr
## <chr> <dbl> <dbl> <dbl>
## 1 cursmoke 1.63 1.32 2.01
We see that the hazard ratio comparing current smokers to current non-smokers is 1.63 with 95% CI: (1.32 - 2.01), indicating an elevated risk for MI hospitalization associated with current smoking.
However, the key in that interpretation is the word ‘current’. In this model for each observation (conditional on sex and age) participants with smoking value of zero are treated equally regardless of whether they are never smokers or former smokers (the latter would have changed from cursmoke=1
to cursmoke=0
sometime during follow-up). Based on what we know about smoking history and health, we are therefore probably not making use of all the information we have with this approach.
3. Construct an exposure variable representing the history of smoking (identify if someone is a current, former and never smoker at each period). Repeat the model from 2, using this exposure instead. What added information does this model give us?
We can construct a new time-varying exposure variable that allows us to distinguish between those that are never smokers and those who are currently non-smoking, but used to at some point in the past. We’ve already identified those people that have had a change in current smoking status during follow-up above, which in theory would include people becoming smokers after starting as non-smokers, but also those who quit smoking and become former smokers.
We can add a column named smoking
that takes three possible values: “never”, “current”, and “former”. First, let’s create a variable called previous_smoking
that marks whether the subject was ever marked as a current smoking at previous examinations. For this we can use the cummax
function after grouping by the subject ID—this function will take the maximum value up to (and including) that observation. Since smoking is marked with 1 and not smoking with 0 for cursmoke
, this new variable marks if there have been any cases where cursmoke
has been 1 up to the observation time:
fhstv <- fhstv %>%
group_by(randid) %>%
mutate(previous_smoking = cummax(cursmoke)) %>%
ungroup()
fhstv %>%
filter(randid %in% c(16365, 67905, 68397)) %>% # Show for some examples where smoking status changed
select(randid, period, cursmoke, previous_smoking)
## # A tibble: 9 × 4
## randid period cursmoke previous_smoking
## <dbl> <dbl> <dbl> <dbl>
## 1 16365 1 1 1
## 2 16365 2 0 1
## 3 16365 3 0 1
## 4 67905 1 1 1
## 5 67905 2 1 1
## 6 67905 3 0 1
## 7 68397 1 0 0
## 8 68397 2 1 1
## 9 68397 3 1 1
This previous_smoking
variable will help make it easier to create a smoking
variable of never, current, or former smoker:
library(forcats)
fhstv <- fhstv %>%
group_by(randid) %>%
mutate(smoking = case_when(
cursmoke == 1 ~ "current",
cursmoke == 0 & previous_smoking == 1 ~ "former",
cursmoke == 0 & previous_smoking == 0 ~ "never",
TRUE ~ NA_character_ # This is a catch-all, in case there are any cases missed by the earlier logic
),
smoking = as_factor(smoking), # Change to a factor
smoking = fct_relevel(smoking, "never", "former", "current")) # Get factor levels in desired order (to ensure "never" is the baseline)
fhstv %>%
filter(randid %in% c(16365, 67905, 68397)) %>% # Show for some examples where smoking status changed
select(randid, period, cursmoke, previous_smoking, smoking)
## # A tibble: 9 × 5
## # Groups: randid [3]
## randid period cursmoke previous_smoking smoking
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 16365 1 1 1 current
## 2 16365 2 0 1 former
## 3 16365 3 0 1 former
## 4 67905 1 1 1 current
## 5 67905 2 1 1 current
## 6 67905 3 0 1 former
## 7 68397 1 0 0 never
## 8 68397 2 1 1 current
## 9 68397 3 1 1 current
Let’s repeat the Cox model for all-cause mortality from above using this exposure:
coxph_modtv2 <- coxph(Surv(time, time2, deathtv) ~ smoking + age + sex,
data = fhstv)
coxph_modtv2 %>%
tidy()
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 smokingformer 0.144 0.0836 1.72 8.61e- 2
## 2 smokingcurrent 0.410 0.0577 7.10 1.28e- 12
## 3 age 0.0875 0.00320 27.3 3.28e-164
## 4 sex -0.531 0.0525 -10.1 4.92e- 24
coxph_modtv2 %>%
tidy() %>%
filter(str_detect(term, "smoking")) %>%
mutate(hr = exp(estimate),
low_ci = estimate - 1.96 * std.error,
high_ci = estimate + 1.96 * std.error,
low_hr = exp(low_ci),
high_hr = exp(high_ci)) %>%
select(term, hr, low_hr, high_hr)
## # A tibble: 2 × 4
## term hr low_hr high_hr
## <chr> <dbl> <dbl> <dbl>
## 1 smokingformer 1.15 0.980 1.36
## 2 smokingcurrent 1.51 1.35 1.69
We see that the HR for current smokers vs non-smokers remains elevated, but the HR for former smokers is much smaller and the CI includes the null. It seems that the risk of mortality for former smokers compared to non-smokers is quite a bit lower than it is for current smokers.
If we repeat for MI hospitalization (we should recreate our incident MI dataset to include the new smoking variable):
fhstv_incMI <- fhstv %>%
filter(prevmi == 0)
coxph_modhospmitv3 <- coxph(Surv(time, timemi2, hospmitv) ~
smoking + age + sex,
data = fhstv_incMI)
coxph_modhospmitv3 %>%
tidy()
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 smokingformer -0.0588 0.186 -0.316 7.52e- 1
## 2 smokingcurrent 0.475 0.113 4.19 2.82e- 5
## 3 age 0.0464 0.00612 7.59 3.25e-14
## 4 sex -1.12 0.111 -10.1 5.51e-24
coxph_modhospmitv3 %>%
tidy()%>%
filter(str_detect(term, "smoking")) %>%
mutate(hr = exp(estimate),
low_ci = estimate - 1.96 * std.error,
high_ci = estimate + 1.96 * std.error,
low_hr = exp(low_ci),
high_hr = exp(high_ci)) %>%
select(term, hr, low_hr, high_hr)
## # A tibble: 2 × 4
## term hr low_hr high_hr
## <chr> <dbl> <dbl> <dbl>
## 1 smokingformer 0.943 0.655 1.36
## 2 smokingcurrent 1.61 1.29 2.01
We can now see that the parameter for current smokers compared to never smokers is still elevated, with a HR = 1.60, but the hazard associated with a comparison of former smokers compared to never smokers is actually below the null, but the CI is very wide and includes the null.
7.4.2 Multi-level exposure
We’ve seen that changing smoking from a binary ‘yes/no’ variable to one with three categories (representing some history in the exposure) has added more information in our assessment for the potential effect of smoking on risk of MI. We actually have even more information on smoking, and specifically smoking intensity, through the cigpday
variable. We will incorporate this in our model, as well as look at some other continuous exposures with respect to risk of MI.
Applied exercise: Survival analysis with time-varying values
- Explore the role of smoking intensity on the hazard for MI hospitalizations, using the
cigpday
variable. What can we infer about the exposure-response? - Explore the relationship between some of the other predictors in the dataset (specifically
bmi
andsysbp
) and the hazard for MI hospitalization. Investigate whether the exposure response for these variable deviates from (log-)linearity using spline functions.
1. Explore the role of smoking intensity on the hazard for MI hospitalizations, using the cigpday
variable. What can we infer about the exposure-response?
Let’s repeat our latest model for smoking and MI hospitalizations, including the cigpday
variable
coxph_modhospmitv4 <- coxph(Surv(time, timemi2, hospmitv) ~
smoking + cigpday + age + sex,
data = fhstv_incMI)
coxph_modhospmitv4 %>%
tidy()
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 smokingformer -0.0487 0.186 -0.261 7.94e- 1
## 2 smokingcurrent 0.256 0.166 1.54 1.23e- 1
## 3 cigpday 0.00992 0.00579 1.71 8.64e- 2
## 4 age 0.0461 0.00622 7.42 1.21e-13
## 5 sex -1.10 0.114 -9.65 4.68e-22
coxph_modhospmitv4 %>%
tidy()%>%
filter(term == 'cigpday') %>%
mutate(hr = exp(estimate),
low_ci = estimate - 1.96 * std.error,
high_ci = estimate + 1.96 * std.error,
low_hr = exp(low_ci),
high_hr = exp(high_ci)) %>%
select(term, hr, low_hr, high_hr)
## # A tibble: 1 × 4
## term hr low_hr high_hr
## <chr> <dbl> <dbl> <dbl>
## 1 cigpday 1.01 0.999 1.02
We now see that the parameter coefficient for current smoking is attenuated, while the continuous cigpday
variable has a positive parameter coefficient (HR=1.009 with each cigarette per day increase in smoking intensity), but the CI includes the null).
We can actually combine these effects, so that for the current smokers we only get an effect for continuous cigarettes per day, by transforming our smoking variable into dummy variables for each level:
fhstv_incMI<-fhstv_incMI %>%
mutate(smokingcurrent = ifelse(smoking == 'current', 1, 0),
smokingformer = ifelse(smoking == 'former', 1, 0),
smokingnever = ifelse(smoking == 'never', 1, 0))
coxph_modhospmitv5 <- coxph(Surv(time, timemi2, hospmitv) ~
smokingformer + smokingcurrent:cigpday + age + sex,
data = fhstv_incMI) ###leaving the dummy variable for 'never' out makes 'never' the default comparison group
coxph_modhospmitv5 %>%
tidy()
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 smokingformer -0.106 0.182 -0.585 5.59e- 1
## 2 age 0.0453 0.00620 7.31 2.64e-13
## 3 sex -1.10 0.114 -9.66 4.28e-22
## 4 smokingcurrent:cigpday 0.0163 0.00386 4.23 2.33e- 5
coxph_modhospmitv5 %>%
tidy() %>%
filter(term == 'smokingcurrent:cigpday') %>%
mutate(hr = exp(estimate),
low_ci = estimate - 1.96 * std.error,
high_ci = estimate + 1.96 * std.error,
low_hr = exp(low_ci),
high_hr = exp(high_ci)) %>%
select(term, hr, low_hr, high_hr)
## # A tibble: 1 × 4
## term hr low_hr high_hr
## <chr> <dbl> <dbl> <dbl>
## 1 smokingcurrent:cigpday 1.02 1.01 1.02
The HR for each additional cigarette per day compared to never smokers is now 1.016 (95% CI: 1.008–1.023). If we want to estimate the effect for a smoker who smokes a pack of cigarettes a day then:
coxph_modhospmitv5 %>%
tidy() %>%
filter(term == 'smokingcurrent:cigpday') %>%
mutate(hr = exp(20 * estimate),
low_ci = 20 * (estimate - 1.96 * std.error),
high_ci = 20 * (estimate + 1.96 * std.error),
low_hr = exp(low_ci),
high_hr = exp(high_ci)) %>%
select(term, hr, low_hr, high_hr)
## # A tibble: 1 × 4
## term hr low_hr high_hr
## <chr> <dbl> <dbl> <dbl>
## 1 smokingcurrent:cigpday 1.39 1.19 1.61
The model we just fitted, however, assumes a log-linear exposure-response between smoking intensity and hazard for MI hospitalization. We can explore different categories of smoking intensity by creating a categorical variable, representing different smoking categories. The case_when
function is helpful here to give a variable different values depending on when different logical conditions are met:
fhstv_incMI <- fhstv_incMI %>%
mutate(smokingint = case_when(
cigpday == 0 ~ "None",
0 < cigpday & cigpday <= 5 ~ "Light",
5 < cigpday & cigpday <= 10 ~ "Moderate",
10 < cigpday & cigpday <= 20 ~ "Heavy",
20 < cigpday ~ "Very Heavy",
TRUE ~ NA_character_
),
smokingint = as_factor(smokingint),
smokingint = fct_relevel(smokingint,
"None", "Light", "Moderate", "Heavy", "Very Heavy"))
You can see that this added variable gives us the desired ordered categories for smoking:
## # A tibble: 11,253 × 3
## # Groups: randid [4,348]
## randid cigpday smokingint
## <dbl> <dbl> <fct>
## 1 2448 0 None
## 2 2448 0 None
## 3 6238 0 None
## 4 6238 0 None
## 5 6238 0 None
## 6 9428 20 Heavy
## 7 9428 30 Very Heavy
## 8 10552 30 Very Heavy
## 9 10552 20 Heavy
## 10 11252 23 Very Heavy
## # ℹ 11,243 more rows
Now you can use this variable within the regression model:
coxph_modhospmitv6 <- coxph(Surv(time, timemi2, hospmitv) ~
smokingformer + smokingcurrent:smokingint + age + sex,
data = fhstv_incMI)
coxph_modhospmitv6 %>%
tidy()
## # A tibble: 8 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 smokingformer -0.0463 0.186 -0.249 8.04e- 1
## 2 age 0.0471 0.00624 7.56 4.05e-14
## 3 sex -1.08 0.113 -9.55 1.36e-21
## 4 smokingcurrent:smokingintNone NA 0 NA NA
## 5 smokingcurrent:smokingintLight -0.257 0.313 -0.820 4.12e- 1
## 6 smokingcurrent:smokingintModerate 0.0545 0.244 0.223 8.23e- 1
## 7 smokingcurrent:smokingintHeavy 0.654 0.134 4.89 1.00e- 6
## 8 smokingcurrent:smokingintVery Heavy 0.559 0.159 3.51 4.50e- 4
We see that the category for “None” did not return a parameter, which should be expected as we don’t expect any current smokers to be smoking zero cigarettes. We can extract the rest of the parameters and see what the HR is for each category compared to non-smokers. You can use a regular expression with str_detect
to pull out the four parameter estimates we’re interested in, so you can calculate the confidence intervals for all four at once:
coxph_modhospmitv6 %>%
tidy() %>%
filter(str_detect(term, "smokingcurrent:smokingint[HLMV]")) %>%
mutate(hr = exp(estimate),
low_ci = (estimate - 1.96 * std.error),
high_ci = (estimate + 1.96 * std.error),
low_hr = exp(low_ci),
high_hr = exp(high_ci)) %>%
select(term, hr, low_hr, high_hr)
## # A tibble: 4 × 4
## term hr low_hr high_hr
## <chr> <dbl> <dbl> <dbl>
## 1 smokingcurrent:smokingintLight 0.774 0.419 1.43
## 2 smokingcurrent:smokingintModerate 1.06 0.654 1.70
## 3 smokingcurrent:smokingintHeavy 1.92 1.48 2.50
## 4 smokingcurrent:smokingintVery Heavy 1.75 1.28 2.39
We see that only heavy and very heavy smokers have elevated HRs compared to non-smokers, with the estimates for light and moderate yielding CIs including the null.
2. Explore the relationship between some of the other predictors in the dataset (specifically bmi
and sysbp
) and the hazard for MI hospitalization. Investigate whether the exposure response for these variable deviates from (log-)linearity using spline functions.
We will now turn our attention to some other predictors for cardiovascular disease, namely BMI (bmi
in our dataset) and systolic blood pressure (sysbp
in out dataset). Below we fit a Cox model for BMI, adjusting for age and sex. BMI is a considered a fairly flawed proxy for obesity and/or healthy metabolic profiles, however unlike more accurate or informative exposure measures it is very easy to assess and can still be quite informative.
coxph_modhospmitvBMI <- coxph(Surv(time, timemi2, hospmitv) ~ bmi + age + sex,
data = fhstv_incMI)
coxph_modhospmitvBMI %>%
tidy()
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 bmi 0.0483 0.0122 3.96 7.55e- 5
## 2 age 0.0400 0.00596 6.71 1.89e-11
## 3 sex -1.17 0.109 -10.7 8.84e-27
coxph_modhospmitvBMI %>%
tidy()%>%
filter(term == 'bmi') %>%
mutate(hr = exp(estimate),
low_ci = (estimate - 1.96 * std.error),
high_ci = (estimate + 1.96 * std.error),
low_hr = exp(low_ci),
high_hr = exp(high_ci)) %>%
select(term, hr, low_hr, high_hr)
## # A tibble: 1 × 4
## term hr low_hr high_hr
## <chr> <dbl> <dbl> <dbl>
## 1 bmi 1.05 1.02 1.07
We see that there is an elevated HR for each unit increase in BMI (HR=1.05, 95% CI: 1.02–1.07). In order to explore whether the exposure-response relationship deviates from (log-)linearity, we will repeat the same model using a spline term for bmi
.
library(splines)
coxph_modhospmitvBMIsp <- coxph(Surv(time, timemi2, hospmitv) ~
ns(bmi, df = 3) + age + sex,
data = fhstv_incMI)
coxph_modhospmitvBMIsp %>%
tidy()
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ns(bmi, df = 3)1 1.06 0.378 2.80 5.12e- 3
## 2 ns(bmi, df = 3)2 0.966 1.52 0.638 5.24e- 1
## 3 ns(bmi, df = 3)3 0.818 1.35 0.605 5.45e- 1
## 4 age 0.0400 0.00596 6.71 1.94e-11
## 5 sex -1.16 0.111 -10.4 2.10e-25
We can better summarize the relationship between BMI and the outcome from the above model by plotting it similar to our approach in the time-series example. We could use similar code as in 4.2, however the termplot
functions simplifies some of the steps for us:
library(purrr)
predbmi <- termplot(coxph_modhospmitvBMIsp, se = TRUE, plot = FALSE) %>%
pluck("bmi")
predbmi %>%
ggplot(aes(x = x, y = exp(y))) +
geom_line() +
labs(x = "BMI",
y = "Hazard Ratio") +
geom_hline(yintercept = 1.0, color = "red", linetype = 2)
We see that the relationship is mostly linear between values of about 25 and 40, but deviates from linearity at the lower and higher ends of the range in BMI. We would like to re-center this so that we are comparing to a BMI value that is of lowest risk (i.e., the plot crosses 1.0 at that BMI). The ‘normal’ range of BMI is 20–25, so let’s center out graph around 22.5. We will also add confidence bands to our plot.
predbmi <- predbmi %>%
mutate(center = y[x == 22.5],
ylow = y - 1.96 * se,
yhigh = y + 1.96 * se)
predbmi %>%
ggplot(aes(x = x, y = exp(y - center))) +
geom_line() +
labs(x = "BMI",
y = "Hazard Ratio") +
geom_hline(yintercept = 1.0, color = "red", linetype = 2)+
geom_ribbon(aes(ymin = exp(ylow - center), ymax = exp(yhigh - center)),
alpha = 0.1,
linetype = "dashed",
color = "grey")
We see that the width of the confidence interval is very large and makes reading the plot quite difficult. We can partially remedy this by log-transforming the y-axis.
predbmi %>%
ggplot(aes(x = x, y = exp(y - center))) +
geom_line() +
labs(x = "BMI",
y = "Hazard Ratio") +
geom_hline(yintercept = 1.0, color = "red", linetype = 2)+
geom_ribbon(aes(ymin = exp(ylow - center), ymax = exp(yhigh - center)),
alpha = 0.1,
linetype = "dashed",
color = "grey") +
scale_y_continuous(trans = 'log2')
We now see that the HR is elevated as BMI increases until about BMI=40 with the effect estimate plateauing at higher values. The CI bands indicate a significant effect between BMI 25 and 42–43. Part of the reason for very wide confidence intervals at the far ranges of BMI is likely because data is sparse for those values, so another alternative you might explore would be to trim the data to focus on the range of BMIs that are more common among the study population.
Next, let’s repeat the model, but now add a term for sysbp
.
coxph_modhospmitvBMIBP <- coxph(Surv(time, timemi2, hospmitv) ~
ns(bmi, df = 3) + sysbp + age + sex,
data = fhstv_incMI)
coxph_modhospmitvBMIBP %>%
tidy()
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ns(bmi, df = 3)1 0.661 0.381 1.73 8.30e- 2
## 2 ns(bmi, df = 3)2 -0.360 1.53 -0.236 8.14e- 1
## 3 ns(bmi, df = 3)3 -0.694 1.38 -0.502 6.16e- 1
## 4 sysbp 0.0187 0.00216 8.67 4.23e-18
## 5 age 0.0246 0.00635 3.87 1.09e- 4
## 6 sex -1.24 0.113 -11.1 2.05e-28
coxph_modhospmitvBMIBP %>%
tidy()%>%
filter(term == 'sysbp') %>%
mutate(hr = exp(estimate),
low_ci = (estimate - 1.96 * std.error),
high_ci = (estimate + 1.96 * std.error),
low_hr = exp(low_ci),
high_hr = exp(high_ci)) %>%
select(term, hr, low_hr, high_hr)
## # A tibble: 1 × 4
## term hr low_hr high_hr
## <chr> <dbl> <dbl> <dbl>
## 1 sysbp 1.02 1.01 1.02
We see that there is an a HR of 1.019 (95% CI: 1.015 - 1.023) for each unit increase in systolic blood pressure. We can further explore this relationship by trying a spline term for blood pressure as well.
coxph_modhospmitvBMIBPsp <- coxph(Surv(time, timemi2, hospmitv) ~
ns(bmi, df = 3) + ns(sysbp, df = 3) + age + sex,
data = fhstv_incMI)
coxph_modhospmitvBMIBPsp %>%
tidy()
## # A tibble: 8 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ns(bmi, df = 3)1 0.614 0.384 1.60 1.10e- 1
## 2 ns(bmi, df = 3)2 -0.374 1.53 -0.244 8.07e- 1
## 3 ns(bmi, df = 3)3 -0.617 1.39 -0.445 6.57e- 1
## 4 ns(sysbp, df = 3)1 2.03 0.414 4.91 9.32e- 7
## 5 ns(sysbp, df = 3)2 4.75 1.62 2.93 3.37e- 3
## 6 ns(sysbp, df = 3)3 3.77 1.04 3.62 2.96e- 4
## 7 age 0.0243 0.00636 3.83 1.29e- 4
## 8 sex -1.24 0.113 -11.0 4.19e-28
# We will go ahead and plot this as well using 110 as the center
predsysbp <- termplot(coxph_modhospmitvBMIBPsp, se=TRUE, plot=FALSE) %>%
pluck("sysbp")
predsysbp <- predsysbp %>%
mutate(center = y[x==110],
ylow = y - 1.96 * se,
yhigh = y + 1.96 * se)
predsysbp %>%
ggplot(aes(x = x, y = exp(y-center))) +
geom_line() +
labs(x = "Systolic Blood Pressure",
y = "Hazard Ratio") +
geom_hline(yintercept = 1.0, color = "red", linetype = 2)+
geom_ribbon(aes(ymin = exp(ylow - center), ymax = exp(yhigh - center)),
alpha = 0.1,
linetype = "dashed",
color="grey") +
scale_y_continuous(trans = 'log2')
We see an almost monotonic increase in the HR (keep in mind the log-transformation on the y-axis) as blood pressure increases with the effect being significant for the range of exposure above ~110.
Both BMI and systolic blood pressure appear significant predictors for the hazard of MI hospitalization. You can explore other predictors in the dataset (including blood glucose, cholesterol, etc.) but keep in mind of the missingness in some of these variables and how might these affect your results.
7.4.3 Time-varying coefficients
We’ve already shown how some of the covariate values in the dataset change over time. Now consider the following model including smoking, and some of the other predictors from above:
coxph_modhospmitvall <- coxph(Surv(time, timemi2, hospmitv) ~
smoking + ns(bmi, df = 3) + ns(sysbp, df = 3) +
age + sex,
data = fhstv_incMI)
coxph_modhospmitvall %>%
tidy()
## # A tibble: 10 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 smokingformer -0.0592 0.187 -0.317 7.51e- 1
## 2 smokingcurrent 0.575 0.116 4.96 7.04e- 7
## 3 ns(bmi, df = 3)1 0.941 0.390 2.41 1.59e- 2
## 4 ns(bmi, df = 3)2 0.231 1.57 0.147 8.83e- 1
## 5 ns(bmi, df = 3)3 -0.387 1.43 -0.270 7.87e- 1
## 6 ns(sysbp, df = 3)1 2.06 0.414 4.97 6.69e- 7
## 7 ns(sysbp, df = 3)2 4.85 1.63 2.99 2.81e- 3
## 8 ns(sysbp, df = 3)3 3.78 1.05 3.60 3.22e- 4
## 9 age 0.0321 0.00656 4.90 9.73e- 7
## 10 sex -1.16 0.115 -10.1 6.69e-24
coxph_modhospmitvall %>%
tidy() %>%
filter(term == 'smokingcurrent') %>%
mutate(hr = exp(estimate),
low_ci = (estimate - 1.96 * std.error),
high_ci = (estimate + 1.96 * std.error),
low_hr = exp(low_ci),
high_hr = exp(high_ci)) %>%
select(term, hr, low_hr, high_hr)
## # A tibble: 1 × 4
## term hr low_hr high_hr
## <chr> <dbl> <dbl> <dbl>
## 1 smokingcurrent 1.78 1.42 2.23
The HR for current smoking appears more elevated than in our simple models with only age and sex adjusted for. What is a possible explanation for this?
Keeping in mind the dynamic nature of some of these variable over time, what are some other potential explanations for these relationships? How does that affect our potential interpretation of the parameter coefficients for these variables? We will revisit this issue in chapter 8 of the book.
7.4.4 Using survey data (e.g. NHANES)
For this chapter, we’ve used data from the Framingham study as an example. Another very famous public health study in the United States is the National Health and Nutrition Examination study, more commonly known by its acronym of NHANES. We’ll wrap up this chapter by talking a bit about this other famous study, a branch of it that includes longitudinal data appropriate for a survival analysis like that covered in this chapter for Framingham, and how some quirks of the NHANES study design make the analysis of its data a bit more complex.
Just as the original Framingham study has evolved into a whole set of related studies, NHANES also began with one study and has evolved into a collection of studies. All are focused on nuitrition and health among Americans, and an enormous body of epidemiological literature has resulted from the collection of studies, covering a wide range of health outcomes and risk factors.
The core of the NHANES study collection is a survey study. These surveys are currently run continuously (every year), with around 5,000 participants a year. The surveys collect individual-level data (rather than the county- or city-level aggregated data we used for time series studies) that allow a cross-sectional view of associations between different risk factors and different measures of health.
These survey data therefore only offer a snapshot in time (the time of the survey). However, the NHANES project has supplemented the survey with some longitudinal data that can be used to study the time to events of interest. Following NHANES I, which was the first run of the survey, the study’s researchers created the NHANES Epidemiologic Followup Study (NHEFS). They found as many people as they could who took the original survey (NHANES I) as adults (between ages 25 and 74) and began tracking their time to several health inputs, including mortality, and also collected updated information on risk factors and health status. The cohort was tracked during the 1980s and into the early 1990s, and follow-up for mortality among the cohort was tracked all the way until 2011. Much more information about this cohort, as well as links to available data, are available here.
The NHEFS cohort study from NHANES has been used for a wide variety of epidemiological studies. These include studies with a wide variety, including studies of folate and colon cancer risk (Su and Arab 2001), hypertension and educational attainment (Vargas, Ingram, and Gillum 2000), chronic musculoskeletal pain and depressive symptoms (Magni et al. 1993), frequency of eating and weight change (Kant et al. 1995), physical activity and breast cancer (Breslow et al. 2001), white blood cell counts and stroke (Gillum, Ingram, and Makuc 1994), and risk factors for hip fracture (Mussolino et al. 1998).
While you can perform survival analysis on the NHEFS data, you have to take some extra steps. This is because the cohort was developed based on participants in a survey (NHANES I), and that survey selected participants in some ways that need to be addressed when you analyze the resulting data.
First, some groups were oversampled for NHANES I, including older adults, women of childbearing age, and people who lived in poorer areas. This oversampling has implications when it comes to generating estimates that generalize to the full US population, because it means that the study population isn’t representative of the general population. To address this, you can use sampling weights within an analysis, which reweight the data from each study subject to create estimates that are more appropriate to describe trends in the target population. Alternatively, in some cases it may be appropriate to do an unweighted analysis, while include variables related to the weight (e.g., age, gender, socioeconomic status) as covariates in the model (Korn and Graubard 1991).
Second, NHANES I used a multi-stage sampling design to pick participants. It would have been impractical to randomly sample from everyone in the country, in part because the study included in-person interviews and it would be hard to get the interviewers to people spread around the country. Instead, they designed a strategy where they could send interviewers to fewer places, while still having a sample that could be used to represent the non-institutionalized US population (Korn and Graubard 1991).
However, when a study population is selected based on this type of complex survey design, you may not be able to assume that observations are independent of each other. Instead, there may be some patterns of clustering in the data you collect—for example, people surveyed from the same sampling unit for a multi-stage sampling design may be more similar than people from different units. While this is unlikely to bias the main effect estimate, it is likely to cause you to estimate too-small standard errors (and so confidence intervals that are too tight), providing an overly optimistic view of the variance in your estimate (Korn and Graubard 1991).
These characteristics of the NHEFS cohort were necessary to make the original survey practical enough that it could be carried out. While it does make the data more complicated to analyze, there are strategies that can be used to account for oversampling and the multi-stage sampling design. Some papers have described these approaches for the NHEFS cohort and provide a good roadmap if you’d like to explore these data yourself (Korn and Graubard 1991; Ingram and Makuc 1994). Further, the NHANES website includes extensive information about the NHANES series of studies, as well as tools, data, and tutorials, if you would like to try working with the data.