The aim of this practical is to illustrate the use of hidden Markov
models (HMMs) for biological sequence analysis (particularly,
analysis of DNA sequences), and to develop an understanding of how
the various statistical and probabilistic concepts underlying HMMs
fit together.
The main task is to reproduce the example from page 17 of the
sequence analysis notes, then go on to look at training and
unsupervised learning of the transition structure and emission
probabilities.
Use R to simulate a hidden sequence with the Markov transition
structure given in the notes. Keep this hidden sequence for
future reference.
Conditional on the hidden sequence, simulate a DNA sequence given
the suggested emission probabilities. Just use 1,2,3,4 for A,
C, G, and T. Keep this sequence for inference purposes.
Using knowledge of the DNA sequence, transition structure, and
emission probabilities (but
not the hidden sequence),
carry out local and global (Viterbi) decoding using appropriate
forwards/backwards algorithms and see how well the
hidden sequence can be recovered.
Now assume that the transition matrix and emission probabilities
are not known but that both the DNA sequence and hidden
sequence are. Using the hidden sequence to "train" the model,
how well are you able to recover the transition matrix and
emission probabilities?
Now assume that only the DNA sequence itself is observed, and use
an ad hoc iterative algorithm for "unsupervised learning" of the
model. Try alternating between a Viterbi global decoding and parameter
inference conditional on the decoding. Investigate convergence from a
variety of starting points, and try running the algorithm with
different values of r.
Repeat the above using the EM (Baum-Welch) algorithm. Compare the
results you obtain using the EM algorithm with those for the ad hoc
procedure by evaluating the marginal model likelihoods for the fitted
models. See my additional notes on marginal model
likelihood.
If time permits, try develping an algorithm for model fitting
based on direct numerical maximisation of the marginal model
likelihood. How does such an algorithm compare in terms of
efficiency and reliability compared to the EM algorithm?
Find an interesting Genbank DNA sequence and analyse it in an
appropriate way using the tools that you have developed and
tested.
Hint: I have a few very
simple R
functions to get you started with models having a
zero-order base dependence.
Assessment
Write-up this practical as a coherent report. The report should be no
more than 10 pages including all figures, tables, appendices, etc. and
use a minimum font size of 11pt. You should include appropriate plots
and tables, and include important pieces of R code in an appendix
(which may be set at 10pt). It should be emailed to me before midnight
on Wednesday 13th February, 2013.