In this practical we'll look at how to use R to get started with some survival data analysis. In Parts 1,2 and 3 we will look at how to:

- Create
`surv`objects in order represent a set of times and censorship status - Obtain the Kaplain-Meier estimate for a set of survival data
- Use data frames containing survival data and access different subsets of the data frame

In Parts 4 and 5 we will look at how to use R for the following:

- Comparing survival distributions and performing log-rank hypothesis tests
- Plotting graphs of survivor functions to assess the suitability of parametric forms for the survival distribution
- Carrying out maximum likelihood estimation of parameters

R has a special library of functions and objects for analysing survival data. The library is loaded using:

> library(survival)More information on the survival library is available from the package website.

Sets of survival data are represented using an object of type `Surv`.
These objects bundle together a set of times together with a set of data indicating whether the times are censored or not.
They do *not* represent a survivor function or anything else -- just a set of raw survival data.
To create a `Surv` object you need to pass it a set of times and a vector indicating which times are censored.
Recall that the R function `c` (for "concatenate") is used to create vectors of numbers.
To create a `Surv` object corresponding to the IUD data in section 4 of your notes, use the following commands:

> my.times <-c(10,13,18,19,23,30,36,38,54,56,59,75,93,97,104,107,107,107) > my.events <-c(1,0,0,1,0,1,1,0,0,0,1,1,1,1,0,1,0,0) > my.surv <- Surv(my.times,my.events)The

> my.survR displays the object's contents -- the set of times with censored times marked "+" (not "*").

In addition to the IUD data above, create a `Surv` object called `HPA.surv` corresponding to the following survival times:

23, 47, 69, 70*, 71*, 100*, 101*, 148, 181, 198*, 208*, 212*, 224*

Obviously the next thing you'd like to do is estimate a survivor function for your data.
This is done using the `survfit` function:

> my.KMest <- survfit(my.surv~1, conf.int=0.95) > plot(my.KMest)In this example we have created an object called

> summary(my.KMest)

Obtain a graph of the survivor function and tabulated values for the HPA survival object you created earlier.
This time, call `survfit` with the argument `conf.int=FALSE` to suppress plotting confidence intervals on your graph.
Set the title and axis labels to suitable values using the arguments `main`, `xlab`, and `ylab`.

Very typically survivor data sets contain covariates (other information about each subject) in addition to the failure times and censorship status.
Data frames are used to represent such data.
A data frame collects together related data into columns.
One column of a data frame might give the survival times, another the censorship status, and another the smoking status, for example.
All the information for each individual in a study is given on a single row.
The survival library contains a data frame called `lung`.
You can look at what it contains by typing:

> lungThere's information for 228 individuals. To see the names of the columns (ie. what information is available for each individual), use the

> names(lung)The column called

> lung$timeextracts the vector of times. Create a plot of the Kaplain Meier estimate of the survivor function for the lung data set.

Sometimes we want to compare two or more different subgroups of a single data frame.
For example, we might wan to compare the survivor functions for males and females in the lung data set.
To do this, create a `Surv` object for the lung data:

> s <- Surv(lung$time, lung$status)Then fit the KM estimate, telling R to split the data up according to the sex of each individual:

> lung.KM <- survfit(s~lung$sex)Alternatively, you can type

> lung.KM<-survfit(s~sex,data=lung)This second form can be more convenient when you want to refer to several variables in the same data frame.

Finally plot the `lung.KM` object: there should be one function for males and one for females on the graph.

> bladder <- read.table("http://www.mas.ncl.ac.uk/~nmf16/teaching/mas3311/bladder.dat")The failure times are contained in the field

> bladder.km <- survfit(bladder.surv~bladder$rx)This command tells R to create two Kaplan-Meier estimates, one for each treatment group (drug and placebo). The result is stored in

> plot(bladder.km, col=c("red","blue")) > legend(x=10,y=0.2,legend=c("Placebo","Thiotepa"),col=c("red","blue"),lty=c(1,1))The

survdiff(my.surv~factor)where

survdiff(bladder.surv~bladder$rx)Look at the output and decide whether to reject the null hypothesis (that the two survival distributions are the same). Does this agree with your conclusions drawn from inspecting the estimated survivor functions? To carry out a stratified log-rank test you use the following command:

survdiff(my.surv~array_defining_groups + strata(array_defining_strata))Here

In this section we'll look at identifying and fitting parametric models for two data sets. In outline, here's what you have to do:

- Estimate the survivor functions for the two sets of data.
- Plot graphs to decide whether each data set looks like it might have an underlying Weibull or log-logistic distribution.
- Fit a straight line to these graphs to estimate the model parameters.
- Carry out a maximum likelihood estimation of the parameters and compare the results with your estimates.
- Compare the fitted distributions to the actual data.

Paste the following commands into R:

> A <- read.table("http://www.mas.ncl.ac.uk/~nmf16/teaching/mas3311/A.dat") > B <- read.table("http://www.mas.ncl.ac.uk/~nmf16/teaching/mas3311/B.dat")Have a look at the the data to make sure you understand what the data frames

Estimate the survivor function for each group and plot it.
Call the survival objects you create `A.surv` and `B.surv`; call the survfit objects `A.fit` and `B.fit`.
You can extract the x/y values of the estimated survivor function from the `survfit` object in the following way:

> t <- A.fit$time > S <- A.fit$survNow if you plot

> plot(t,S)However you can also produce other plot of the survivor function, as in Section 8.2 in your notes. For example, to produce the Weibull assessment plot, try the following:

> plot(log(t),log(-log(S)), type="S")Sometimes it's useful to change the x-axis limits for these plots eg.

> plot(log(t),log(-log(S)), xlim=c(1,5), type="S")

You should produce a plot for both data sets to test whether the Weibull or log-logistic distribution is suitable to model the data. Decide which distribution is most suitable for data sets A and B.

Suppose that `S` and `t` contain an estimated survivor function and its times eg.

> t <- A.fit$time > S <- A.fit$survAlso suppose that you've decided that

> x <- log(t) > y <- log(-log(S)) > lm(y~x)The output tells you the intercept and gradient (called

Note: often the last point on an estimated survivor function is 0 or the first point can be 1.
This leads to a value of infinity in the `y` vector above.
You need to remove these values.
This happens for example in data set A:

> t <- A.fit$time > S <- A.fit$surv > x<-log(t) > y<-log(((1/S)-1)) > y [1] -Inf -Inf -Inf -Inf -3.80666249 -3.80666249 [7] -3.07922624 -2.64649231 -2.64649231 -2.32614079 -2.07292369 -1.86146867 etc > y <- y[5:50] > x <- x[5:50] > lm(y~x)

Extract estimates of the parameters for your models of the two data sets.

As we discussed in section 8.4 of the notes, the R function `survreg` fits parametric distributions.
So, for example, to fit a Weibull distribution to data set B, you might do something like:

> survreg(formula = B.surv ~ 1, dist="weibull")and to fit a log-logistic distribution to data set A you use:

> survreg(formula = A.surv ~ 1, dist="loglogistic")The "1" on the right-hand side of the formulae means no covariates are being used. We'll see in the next couple of lectures and the next practical how to include covariates in such an analysis.

`survreg` tells you an *intercept* and a *scale*.
Using section 8.4 of your notes, extract estimates of the model parameters and compare these to the values you obtained graphically.

Try plotting the estimated survivor function along with your model estimate eg.

> plot(B.fit) > x <- seq(0,120) > y <- exp(-lambda*x^gamma) > lines(x,y)You need to provide your own values of lambda and gamma here. However it is possible to calculate the values using R. For example you can start by using something like the following commands.

> Bweib<-survreg(formula = B.surv ~ 1, dist="weibull") > Bweib$coefficients (Intercept) 2.678162 > Bweib$scale [1] 1.695571 > gamma<-1/Bweib$scale