Prof Darren Wilkinson Professor of Stochastic Modelling

School of Mathematics & Statistics

Newcastle University

Biological sequence analysis

Practical

The aim of this practical is to illustrate the use of hidden Markov models (HMMs) for biological sequence analysis (particularly, analysis of DNA sequences), and to develop an understanding of how the various statistical and probabilistic concepts underlying HMMs fit together.

The main task is to reproduce the example from page 17 of the sequence analysis notes, then go on to look at training and unsupervised learning of the transition structure and emission probabilities.
Use R to simulate a hidden sequence with the Markov transition structure given in the notes. Keep this hidden sequence for future reference.
Conditional on the hidden sequence, simulate a DNA sequence given the suggested emission probabilities. Just use 1,2,3,4 for A, C, G, and T. Keep this sequence for inference purposes.
Using knowledge of the DNA sequence, transition structure, and emission probabilities (but not the hidden sequence), carry out local and global (Viterbi) decoding using appropriate forwards/backwards algorithms and see how well the hidden sequence can be recovered.
Now assume that the transition matrix and emission probabilities are not known but that both the DNA sequence and hidden sequence are. Using the hidden sequence to "train" the model, how well are you able to recover the transition matrix and emission probabilities?
Now assume that only the DNA sequence itself is observed, and use an ad hoc iterative algorithm for "unsupervised learning" of the model. Try alternating between a Viterbi global decoding and parameter inference conditional on the decoding. Investigate convergence from a variety of starting points, and try running the algorithm with different values of r.
Repeat the above using the EM (Baum-Welch) algorithm. Compare the results you obtain using the EM algorithm with those for the ad hoc procedure by evaluating the marginal model likelihoods for the fitted models. See my additional notes on marginal model likelihood.
If time permits, try develping an algorithm for model fitting based on direct numerical maximisation of the marginal model likelihood. How does such an algorithm compare in terms of efficiency and reliability compared to the EM algorithm?
Find an interesting Genbank DNA sequence and analyse it in an appropriate way using the tools that you have developed and tested.

Hint: I have a few very simple R functions to get you started with models having a zero-order base dependence.

Assessment

Write-up this practical as a coherent report. The report should be no more than 10 pages including all figures, tables, appendices, etc. and use a minimum font size of 11pt. You should include appropriate plots and tables, and include important pieces of R code in an appendix (which may be set at 10pt). It should be emailed to me before midnight on Wednesday 13th February, 2013.

Darren J Wilkinson

darren.wilkinson@ncl.ac.uk

http://www.staff.ncl.ac.uk/d.j.wilkinson/

	Professor of Stochastic Modelling
	School of Mathematics & Statistics
	Newcastle University


		`darren.wilkinson@ncl.ac.uk`
		`http://www.staff.ncl.ac.uk/d.j.wilkinson/`