Survival Analysis: Practical 3

Newcastle University :: School of Maths and Stats :: My home page :: MAS3311

Survival Data Analysis, Practical 3

This practical involves fitting Weibull proportional hazards models to survival data with covariates. There are two parts: the first part involves a data set called heart for which there are two covariates, one binary and one continuous; the second data set, artery is intended to help you to understand categorical covariates / factors.

Modelling with binary and continuous covariates

The data can be read into R using:

> heart <- read.table("http://www.mas.ncl.ac.uk/~nmf16/teaching/mas3311/heart.dat")

The table gives the survival time in months from first incidence of myocardial infarction (heart attack) for 200 patients. Some observations are right-censored, as indicated by the status column (1 = observed death). The covariates are BMI (body mass index) and smoking status. Smoking status is an indicator with value 1 when the individual is a smoker.

Parametric models are fitted in R using the survreg command. For example, to fit the model incorporating smoking status, use:

> m <- survreg(formula = Surv(time, status) ~ smoker, data=heart, dist="weibull")
> summary(m)

We saw in lectures this week how to obtain the model parameters from the R output. You might need to refer back to section 12 of your notes to remind yourself about this. You will also need to use the log likelihood. Note that two log likelihood values are given in the output. One is for the full model, and the other for a model with no covariates. The value for the model with no covariates is labelled "intercept only".

Your task is to find the best Weibull proportional hazards model for these data. There are some notes below on model building / model selection that are relevant to this practical and to the project. Compare models using the likelihood ratio test described in section 12 of the notes. Once you have selected your model, answer the following questions:

Which covariates are included in your model?
What is the baseline hazard function for your model? You will need to extract the model parameters lambda and gamma.
Find the hazard ratio between two individuals with BMI 28 and BMI 22 who have the same smoking status.
Does any individual in the data set have the same hazard function as the baseline? Why?

Model selection

When there are a number of different covariates that we might include in a model it's useful to adopt a systematic approach to model building. You have probably seen aspects of this in other courses (eg. general linear models). Model building is not a key aspect of MAS3311, but you need to do some in order to tackle Part 3 of the project, for which there are 8 potential covariates and factors. One way is to try every possible model, or combination of covariates, but with 8 covariates there will be 2^8=256 different models, which is far too many to try. A good approach is as follows:

Construct each of the possible models with just one covariate. Pick the model with the highest likelihood, if it's significant compared to the model with no covariates (use the likelihood ratio test described in section 12 of the notes). Call this model M1.
Once you've selected a model with just one covariate, try adding in each of the others separately. Again pick the model with the highest likelihood, call it M2, and decide whether it is an improvement on M1.
Continue adding covariates until there is no evidence (on the basis of the likelihood ratio test) that adding covariates improves the model.

There are many other valid ways to go about building a model. Whatever you decide to do for your project, make sure you describe the process in your report. For today's practical there are only 4 possible models for you to try out and compare.

Categorical covariates / factors

The data can be read into R using:

> artery <- read.table("http://www.mas.ncl.ac.uk/~nmf16/teaching/mas3311/artery.dat")

The survival time is contained in the column time and the column status indicates whether each time was observed or right-censored. The only covariate is alcohol which is a score taking values 1, 2, 3, 4, and 5. It indicates how much alcohol each patient drinks.

Fit a Weibull proprtional hazards model treating alcohol as a continuous covariate, and a second treating it as a factor (See section 11 of the notes). On the basis of the model coefficients, and log likelihood values, do you think it's correct to treat the alcohol score as a continuous covariate? Note that if s is a survreg object, then summary(s) tells you standard errors for the estimated model parameters. eg. to obtain these standard errors type:

> s <- survreg(...)
> summary(s)