Stochastic simulator diagnostics

Stochastic kinetic models

Suppose we have

\(N\) species: \(X_1, X_2, \ldots, X_N\)
\(M\) reactions: \(R_1, R_2, \ldots, R_M\)
In a typical model, \(M = 2N\)

Reaction \(R_i\) takes the form \[ \begin{array}{cccc} R_{i}: \, & u_{i1} X_1+\ldots +u_{ik} X_N &\xrightarrow {\phantom{a}c_{i}\phantom{a}} & v_{i1}X_1+\ldots +v_{ik}X_N \end{array} \] The effect of reaction \(i\) on species \(j\) is to change \(X_j\) by an amount \(v_{ij} - u_{ij}\)

Straightforward to simulate

Direct method

Initialisation: initial conditions, reactions constants, and random number generators
Propensities update: Update each of the \(M\) hazard functions, \(h_i(x)\)
Propensities total: Calculate the total hazard \(h_0 = \sum_{i=1}^M h_i(x)\)
Reaction time: \(\tau = -ln[U(0,1)]/h_0\) and \(t = t+ \tau\)
Reaction selection: A reaction is chosen proportional to it's hazard
Reaction execution: Update species
Iteration: If the simulation time is exceeded stop, otherwise go back to step 2

Typically there are a large number of iterates

Simulator slow down

As the number of reactions (and species) increase, the length of time a single iteration takes also increases
Clever tricks help
- Dynamic sorting of the reactions
- Updating only species and hazards that have changed
- Avoid simulating two uniform random numbers per iteration
For models where one of the hazards is large, simulation time can be prohibitively large

Approximate simulators

Relax some assumptions (e.g. discreteness and stochasticity) in order to make simulation faster and more scalable

Diffusion approximation / chemical Langevin equation (CLE)
Hybrid discrete-continuous models
Linear noise approximation (LNA) / moment closure (2MA)

How do can we assess the appropriateness of these methods?

Moment based approximations

Moment closure and the linear noise approximation
Obtain (differential) equations for the mean and covariance
Typically assume normality
Very useful for parameter inference

Dimerisation system

\[ 2X_1 \longrightarrow X_2 \quad \mbox{and} \quad X_2 \longrightarrow 2 X_1 \] with propensities \(h_1(X_1, X_2) = 0.5 k_1 X_1(X_1-1)\) and \(h_2(X_2) = k_2 X_2\)

Moment equations

The equation for the mean ODE is \[ \frac{dE[X_1]}{dt} = 0.5 k_1 (E[X_1^2] - E[X_1]) -k_2 E[X_1] \] and second moment \[ \frac{dE[X_1^2]}{dt} = k_1 (E[X_1^2X_2] - E[X_1 X_2]) + 0.5 k_1(E[X_1^2] - E[X_1]) k_2 (E[X_1] - 2 E[X_1^2]) \] where \(E[X_1]\) is the mean of \(X_1\) and \(E[X_1^2] - E[X_1]^2\) is the variance

Closing the system

Typically use Gaussian or Log-normal, since it's easy to generalise to the multivariate case

Distribution	Moments	Cumulants
Gaussian	\(\mu_3 = 3 \mu_2 \mu_1 - 2 \mu_1^3\)	\(\kappa_3=0\)
Poisson	\(\mu_3 = \mu_1+3\mu_1^2 + \mu_1^3\)	\(\kappa_3=\kappa_2=\kappa_1\)
Log-Normal	\(\mu_3 = (\mu_2/\mu_1)^3\)	\(\kappa_3=(\kappa_2/\kappa_1)^3 + 3\kappa_2^2/\kappa_1\)

Rational rate laws

If we consider a slightly more complicated propensity function \[ X \rightarrow \emptyset \] with propensity function \[ h(X) = \frac{k_1 X^2}{k_2 + X^2} \] we get the moment equation \[ \frac{dE(X)}{ dt} =\frac{1}{2k_2}\left[-4k_1 E(X^3) - (2k_2+1)\frac{d E(X^2)}{dt} - 2\frac{d E(X^3)}{dt} - \frac{d E(X^4)}{dt}\right]. \]

Why use MC/LN approximations?

Model building
- a quick method for estimating the mean/variance of system
Steady-state analysis
- set the differentials to zero
Parameter inference
- instead of multiple forward simulations (per MCMC iteration), solve a set of ODEs
- Lotka-Volterra model - \(10^4\) iterations in less than 2 seconds
Experimental design
- Each iteration of the algorithm reduces to the parameter inference problem

Example: Heat shock model (Proctor et al, 2005)

Stochastic kinetic model of the heat shock system
- twenty-three reactions
- seventeen chemical species
A single stochastic simulation up to \(t=2000\) takes about \(35\) minutes.
If we convert the model to moment equations, we get \(139\) equations

Example: Heat shock model

Issues

When modelling biological systems, it's unusual to have detailed knowledge about all rate constants
- Vague parameters over many orders of magnitude
- Uncertain initial conditions
- Unobserved species
How can we assess the appropriateness of the MC/LN approximation?

P53-Mdm2 oscillation model

Proctor and Grey, 2008
- 16 chemical species
- Around a dozen reactions
The model contains a single event
- At \(t = 1\), set \(X = 0\)
If we convert the model to moment equations, we get 139 equations.

SIR model

Immigration: \(\emptyset \longrightarrow X + 1\) and \(\emptyset \xrightarrow{k_2 sf} Y + 1\)
Death: \(X \xrightarrow{k_3 sf} \emptyset\) and \(Y \longrightarrow \emptyset\)
Interaction: \(X +Y \longrightarrow 2 Y\)

Assessing the moment closure approximations

Accuracy of the mean and variance estimates
- For any system with a second-order reaction, we can prove that the MC/LN approximations are not exact
Suitability of the normality assumption
- Continuous distribution for a discrete state space
The question is, are the approximations good enough?

Assessing moment accuracy

If the system contains only zero and first order reactions, the MC/LN moment estimates are exact
Intuitively, if the MC/LN produce different moment estimates, this suggests an issue with the approximation
- Note - if the MC/LN approximations agree, this does not imply accuracy
For a particular set of parameters, \(\gamma_i = (c_1, \ldots, c_L)\), calculate the standardised difference \[ D_{j}^{LM}(\mathbf{\gamma}_i) = \frac{\psi_j^l(\mathbf{\gamma}_i) - \psi_j^m(\mathbf{\gamma}_i)}{\sqrt{\Sigma_{jj}^l(\mathbf{\gamma}_i)}} \] where
- \(\psi_j(\mathbf{\gamma}_i)\) is the solution to the equation for species \(j\) for parameters \(\gamma_i\) using the LN or MC approximation
- \(\Sigma_{jj} (\mathbf{\gamma}_i)\) is the variance estimate
Large values of \(D_j^{LM}\) indicate a potential issues with the approximation

Lotka Volterra system

Simple system with non-linear dynamics \[ X \rightarrow X + 1, \quad X + Y \rightarrow 2Y \quad \text{and} \quad Y \rightarrow \emptyset \]
Under certain parameter combinations, deterministic model predicts sustained oscillations
- But the stochastic system can become extinct
Set-up:
- Fix \(c_3=0.3\) and \(X(0) = Y(0)=100\)
- Maximum simulation time \(t=30\)
Assess the moment closure and the linear noise approximations over the parameter ranges \[ c_1 \sim (10^{-6}, 10^0) \quad \text{and} \quad c_2 \sim (10^{-6}, 10^0) \] on the log scale

LV moment assessment

Form a Latin hypercube on the log scale
At each point, \(\gamma_i\) on the hyper cube calculate the standardised difference \(D_{j}^{LM}(\mathbf{\gamma}_i)\)
- Values of \(D_{j}^{LM}(\mathbf{\gamma}_i) > 5\) are shown as red triangles

Assessing the normality

We can borrow ideas from Gaussian process assessment
At each point, \(\gamma_i\), on the design space
- estimate the mean, \(\psi\), and co-variance, \(\Sigma\), using the LN/MC approximations
- simulate using an exact simulator
Prediction error for species \(j\),
\[ e_{i,j} = y_j^*(\mathbf{\gamma}_i) - \psi_j^{l}(\mathbf{\gamma}_i) \]
Easier to work with the standardised prediction errors \[ e_{i,j}^* = \frac{y_j^*(\mathbf{\gamma}_i) - \psi_j^l(\mathbf{\gamma}_i)}{\sqrt{\Sigma_{jj}^l(\mathbf{\gamma}_i)}} \;. \]
If the assumptions are correct, then \(e_{i,j}^* \sim N(0, 1)\)

Interval diagnostics

Construct \(100\alpha\)% confidence interval using the approximate mean \(\psi\) and variance, \(\Sigma\)
Denote a particular confidence region at design point \(i\), for species \(j\), as \(CI_{i,j} (\alpha)\)
The proportion of simulated values that land within the confidence region is given by \[ D_j^{CR} = \frac{1}{n_d} \sum_{i=1}^{n_d} \mathbf{1}[y_j^*(\mathbf{\gamma}_i) \in CI_{i,j}(\alpha)], \] where \(\mathbf{1}[\cdot]\) is the indicator function. The value of \(D_{j}^{CR}\) should be approximately equal to \(\alpha\)

Schlogl model

A well known test system that exhibits bi-modal and non-linear characteristics at certain parameter combinations
The system contains four reactions \[ 2X_1 + X_2 \xrightarrow{\phantom{a}c_{1}\phantom{a}} 3X_1, \quad 3X_1 \xrightarrow{\phantom{a}c_{2}\phantom{a}} 2X_1 + X_2, \quad X_3 \xrightarrow{\phantom{a}c_{3}\phantom{a}} X_1 \\ X_1 \xrightarrow{\phantom{a}c_{4}\phantom{a}} X_3 \] where the distribution of \(X_1\) is bi-modal, the Gaussian assumption would clearly be inappropriate
Parameters \(\{c_1, c_2\}\) are fixed at \(\{3\times 10^{-7}, 10^{-4}\}\) and the initial conditions are held constant at \[ \{X_1(0), X_2(0), X_3(0)\} = \{250, 10^5, 2\times 10^5\} \;. \]
Two dimensional hypercube size \(n_d=1,000\) on the \(\log_{10}\) space for the parameters \(\{c_3, c_4\}\)
At each of the \(n_d\) points on the hypercube, we simulate using the direct method and the LNA to a maximum time of \(t=5\).

Schlogl model: diagnostics

Latin hyper-cube
Standardised errors
Q-Q plot
\(X_1\) vs \(e_i^*\)

Schlogl model: simulations

Investigate large standardised errors more carefully

Summary

Moment closure/linear approximations provide a quick, easy to construct approximation to the system of interest
- Intelligent emulator
Care needs to be taken to assess the suitability of the approximation
Assessing the accuracy of other approximate simulators is tricky (arXiv:1409.1096)

References

Gillespie, CS. Moment closure approximations for mass-action models. IET Systems Biology 2009
Milner, P, Gillespie, CS and Wilkinson, DJ. Moment closure approximations for stochastic kinetic models with rational rate laws. Mathematical Biosciences 2011
Gillespie, C.S. Diagnostics for assessing the accuracy of approximate stochastic simulators, arXiv:1409.1096