Newcastle University :: School of Maths and Stats :: My home page :: MAS3311

## Survival Data Analysis, Practical 1

In this practical we'll look at how to use R to get started with some survival data analysis. In Parts 1,2 and 3 we will look at how to:

• Create surv objects in order represent a set of times and censorship status
• Obtain the Kaplain-Meier estimate for a set of survival data
• Use data frames containing survival data and access different subsets of the data frame

In Parts 4 and 5 we will look at how to use R for the following:

• Comparing survival distributions and performing log-rank hypothesis tests
• Plotting graphs of survivor functions to assess the suitability of parametric forms for the survival distribution
• Carrying out maximum likelihood estimation of parameters

### Part 1. Representing Survival Data

R has a special library of functions and objects for analysing survival data. The library is loaded using:

```> library(survival)
```
More information on the survival library is available from the package website.

Sets of survival data are represented using an object of type Surv. These objects bundle together a set of times together with a set of data indicating whether the times are censored or not. They do not represent a survivor function or anything else -- just a set of raw survival data. To create a Surv object you need to pass it a set of times and a vector indicating which times are censored. Recall that the R function c (for "concatenate") is used to create vectors of numbers. To create a Surv object corresponding to the IUD data in section 4 of your notes, use the following commands:

```> my.times <-c(10,13,18,19,23,30,36,38,54,56,59,75,93,97,104,107,107,107)
> my.events <-c(1,0,0,1,0,1,1,0,0,0,1,1,1,1,0,1,0,0)
> my.surv <- Surv(my.times,my.events)
```
The my.times is a vector of the survival times, while the my.events vector indicates whether each time is censored or not (a 1 indicates an observed failure, a 0 indicates a right-censored failure time). Note that R uses periods / full-stops just to break up object and function names (like a hyphen in ordinary English): the full-stops don't mean anything syntactically so we could use mytimes just as well as my.times in the example above. If you type
```> my.surv
```
R displays the object's contents -- the set of times with censored times marked "+" (not "*").

In addition to the IUD data above, create a Surv object called HPA.surv corresponding to the following survival times:
23, 47, 69, 70*, 71*, 100*, 101*, 148, 181, 198*, 208*, 212*, 224*

### Part 2. Estimating survivor functions

Obviously the next thing you'd like to do is estimate a survivor function for your data. This is done using the survfit function:

```> my.KMest <- survfit(my.surv~1, conf.int=0.95)
> plot(my.KMest)
```
In this example we have created an object called my.KMest that represents the estimated (Kaplain-Meier) survivor function for the data in my.surv The Kaplain-Meier is the default kind of estimate fitted to the data, though survfit can produce other sorts of estimates - have a look in the help pages (type help(survfit)). The argument conf.int=0.95 tells survfit to find a 95% confidence interval for the estimate. The graph produced should look just like the one in section 4 of the lecture notes. The command summary can be used to obtain a table of the fitted KM estimate:
```> summary(my.KMest)
```

Obtain a graph of the survivor function and tabulated values for the HPA survival object you created earlier. This time, call survfit with the argument conf.int=FALSE to suppress plotting confidence intervals on your graph. Set the title and axis labels to suitable values using the arguments main, xlab, and ylab.

### Part 3. Working with data frames

Very typically survivor data sets contain covariates (other information about each subject) in addition to the failure times and censorship status. Data frames are used to represent such data. A data frame collects together related data into columns. One column of a data frame might give the survival times, another the censorship status, and another the smoking status, for example. All the information for each individual in a study is given on a single row. The survival library contains a data frame called lung. You can look at what it contains by typing:

```> lung
```
There's information for 228 individuals. To see the names of the columns (ie. what information is available for each individual), use the names function:
```> names(lung)
```
The column called time contains the failure times, while the column status gives the censoring status (for this data set, 2 means a death, 1 means right-censored). Columns of the data are accessed using the dollar sign, for example
```> lung\$time
```
extracts the vector of times. Create a plot of the Kaplain Meier estimate of the survivor function for the lung data set.

Sometimes we want to compare two or more different subgroups of a single data frame. For example, we might wan to compare the survivor functions for males and females in the lung data set. To do this, create a Surv object for the lung data:

```> s <- Surv(lung\$time, lung\$status)
```
Then fit the KM estimate, telling R to split the data up according to the sex of each individual:
```> lung.KM <- survfit(s~lung\$sex)
```
Alternatively, you can type
```> lung.KM<-survfit(s~sex,data=lung)
```
This second form can be more convenient when you want to refer to several variables in the same data frame.

Finally plot the lung.KM object: there should be one function for males and one for females on the graph.

### Part 4: Comparing distributions and log-rank tests

The data set bladder contains times for the recurrence of bladder cancer for 352 individuals. It is included in the survival library. If for some reason you cannot access the data, you can read it in by typing:
```> bladder <- read.table("http://www.mas.ncl.ac.uk/~nmf16/teaching/mas3311/bladder.dat")
```
The failure times are contained in the field stop and the censoring status is contained in the field event. (As usual 1 denotes an observed failure, 0 denotes a right-censored observation.) The data concern a trial of a drug called Thiotepa. The field rx indicates whether each patient received the drug (2) or placebo (1). Start by creating a survival object called bladder.surv for the survival data. As we did for the lung data earlier we can compare the Kaplain-Meier estimates for the two treatment groups:
```> bladder.km <- survfit(bladder.surv~bladder\$rx)
```
This command tells R to create two Kaplan-Meier estimates, one for each treatment group (drug and placebo). The result is stored in bladder.km which is an array with two entries. bladder.km is the placebo group survivor function and bladder.km is the thiotepa group. Plot the survivor functions with the following commands:
```> plot(bladder.km, col=c("red","blue"))
> legend(x=10,y=0.2,legend=c("Placebo","Thiotepa"),col=c("red","blue"),lty=c(1,1))
```
The legend command is used to draw the legend on the plot. You may need to use it in the project for comparing different survivor functions. Try replotting the graph but including confidence intervals (include the argument conf.int=TRUE). Does it look like Thiotepa has an effect? What is the effect of the drug? To carry out a log-rank hypothesis test you use the survdiff command. This has the form
```survdiff(my.surv~factor)
```
where my.surv is a survival object, and factor is an array specifying the groups. So in order to test whether Thiotepa has an effect on the recurrence time of bladder cancer, use:
```survdiff(bladder.surv~bladder\$rx)
```
Look at the output and decide whether to reject the null hypothesis (that the two survival distributions are the same). Does this agree with your conclusions drawn from inspecting the estimated survivor functions? To carry out a stratified log-rank test you use the following command:
```survdiff(my.surv~array_defining_groups + strata(array_defining_strata))
```
Here array_defining_groups specifies the group of each individual eg. bladder\$rx. In addition, array_defining_strata defines the strata. Carry out a stratified test to compare the treatment groups, where the strata are defined by the initial number of tumours in each individual. This information is contained in the field number of the bladder data frame (individuals have from 1 to 8 tumours initially). How does the p-value for the stratified test compare to the p-value for the regular test? How do you account for the change?

### Part 5: Parametric models

In this section we'll look at identifying and fitting parametric models for two data sets. In outline, here's what you have to do:

• Estimate the survivor functions for the two sets of data.
• Plot graphs to decide whether each data set looks like it might have an underlying Weibull or log-logistic distribution.
• Fit a straight line to these graphs to estimate the model parameters.
• Carry out a maximum likelihood estimation of the parameters and compare the results with your estimates.
• Compare the fitted distributions to the actual data.

#### Obtaining the data

Paste the following commands into R:

```> A <- read.table("http://www.mas.ncl.ac.uk/~nmf16/teaching/mas3311/A.dat")
```
Have a look at the the data to make sure you understand what the data frames A and B contain.

#### Plotting the survivor functions

Estimate the survivor function for each group and plot it. Call the survival objects you create A.surv and B.surv; call the survfit objects A.fit and B.fit. You can extract the x/y values of the estimated survivor function from the survfit object in the following way:

```> t <- A.fit\$time
> S <- A.fit\$surv
```
Now if you plot S against t you just get a plot of the survivor function (not very interesting):
```> plot(t,S)
```
However you can also produce other plot of the survivor function, as in Section 8.2 in your notes. For example, to produce the Weibull assessment plot, try the following:
```> plot(log(t),log(-log(S)), type="S")
```
Sometimes it's useful to change the x-axis limits for these plots eg.
```> plot(log(t),log(-log(S)), xlim=c(1,5), type="S")
```

You should produce a plot for both data sets to test whether the Weibull or log-logistic distribution is suitable to model the data. Decide which distribution is most suitable for data sets A and B.

#### Finding the best straight line

Suppose that S and t contain an estimated survivor function and its times eg.

```> t <- A.fit\$time
> S <- A.fit\$surv
```
Also suppose that you've decided that A is weibull. You therefore want to fit a straight line to the graph log(-log(S)) against log(t). You do this as follows (but see the note below):
```> x <- log(t)
> y <- log(-log(S))
> lm(y~x)
```
The output tells you the intercept and gradient (called x).

Note: often the last point on an estimated survivor function is 0 or the first point can be 1. This leads to a value of infinity in the y vector above. You need to remove these values. This happens for example in data set A:

```> t <- A.fit\$time
> S <- A.fit\$surv
> x<-log(t)
> y<-log(((1/S)-1))
> y
        -Inf        -Inf        -Inf        -Inf -3.80666249 -3.80666249
 -3.07922624 -2.64649231 -2.64649231 -2.32614079 -2.07292369 -1.86146867
etc
> y <- y[5:50]
> x <- x[5:50]
> lm(y~x)
```

Extract estimates of the parameters for your models of the two data sets.

#### Fitting models (maximum likelihood)

As we discussed in section 8.4 of the notes, the R function survreg fits parametric distributions. So, for example, to fit a Weibull distribution to data set B, you might do something like:

```> survreg(formula = B.surv ~ 1, dist="weibull")
```
and to fit a log-logistic distribution to data set A you use:
```> survreg(formula = A.surv ~ 1, dist="loglogistic")
```
The "1" on the right-hand side of the formulae means no covariates are being used. We'll see in the next couple of lectures and the next practical how to include covariates in such an analysis.

survreg tells you an intercept and a scale. Using section 8.4 of your notes, extract estimates of the model parameters and compare these to the values you obtained graphically.

#### Evaluating the fit

Try plotting the estimated survivor function along with your model estimate eg.

```> plot(B.fit)
> x <- seq(0,120)
> y <- exp(-lambda*x^gamma)
> lines(x,y)
```
You need to provide your own values of lambda and gamma here. However it is possible to calculate the values using R. For example you can start by using something like the following commands.
```> Bweib<-survreg(formula = B.surv ~ 1, dist="weibull")
> Bweib\$coefficients
(Intercept)
2.678162
> Bweib\$scale
 1.695571
> gamma<-1/Bweib\$scale
```