Model
Motivation
A model is a lie that helps you see the truth. Howard Skipper 1)
There are many choices to make when designing a model. Each decision limits the scope of the model but hopefully makes it more useful. Should I include such-and-such features? They may make the model more realistic but they'll make the model harder to understand and use. I wanted this model to be as clear and understandable as possible so I started with the simplest assumptions I could think of. Here are some considerations that influenced my modeling approach:
- I wanted my model to make quantitative predictions. Therefore it needed to be quantitative, mathematical.
- I wanted to see how things changed over time so my model needed to be dynamic (just meaning that the model deals with time-varying quantities).
- In my experience models with discrete time steps are more prone to artifacts (errors) so I chose a continuous time model – time is treated as continuous, and can be divided into arbitrarily small intervals (eg. days, hours, minutes, seconds, …).
- I am particularly interested in emergence, such as how interactions between individuals can lead to surprising outcomes for the population as a whole. Therefore I was motivated to construct a model describing individual interactions and using it to predict population dynamics, using a technique called reaction kinetics.
- I considered including Spatial heterogeneity (so, for example, individuals might be located in a particular region and only interact with others nearby). Space could have been represented as a network of connected nodes (eg. a node for each affected country) or as coordinates in a continuum. Either approach would probably be more accurate than neglecting space since there were known “hot spots” in this epidemic. But I had a few reasons I chose not to include spatial structure:
- Space often makes a model much harder to understand.
- I wasn't certain that space was necessary: it may be possible to capture the essence of the epidemic without it. I am more concerned with developing a model that helps me understand the epidemic (even if it is only paints with broad strokes) than accurately reproducing the dynamics without helping me to understand them.
- Adding space is akin to adding many more parameters to a model. Then we risk overfitting the model to the data (unless the data it's compared to is also spatial).
- Adding space to the model would also make the model computationally much more complicated. It would be harder to validate so I would have less confidence if any interesting results I got were meaningful or just a side-effect of an undiscovered error. The model would have to be “solved” on a computer and spatial structure could make it take much longer (depending on how fine-grained a spatial structure I chose).
- So, with the goal of making the model as simple as possible (but no simpler) I chose not to include a spatial component. If the model was unable to reasonably represent the dynamics of the epidemic I might suppose that space was a necessary ingredient and consider building it into a new model.
With these thoughts in mind, I started by looking at the well-known SIR model of epidemics.
Reaction kinetics
The SIR model is a well-known compartmental model in epidemiology. It simplifies the representation of individuals in a population by assigning them to one of three “compartments”: Susceptible, Infected, or Removed. All other details, such as age, gender, wealth, etc., are neglected. This is of course a drastic oversimplification. But it does have one big advantage: it makes the model much easier to understand and work with.
The SIR model is usually depicted as a multi-compartment model as shown here:
The population is split into three compartments (or states): S, I, and R. Individuals transition from the S compartment to I at a rate $\alpha S I$ and from I to R at rate $\phi I$ where the variables $S$, $I$, and $R$ denote the density of individuals in each compartment.
The transition rates are not clearly motivated in this picture. I prefer an alternative representation, reaction kinetics, that clarifies how the rates come about. Instead of thinking of populations we look at the interactions that individuals are involved in. For the SIR model above those processes are
\begin{eqnarray} S + I & \xrightarrow{\alpha} & 2 I & \text{(infection)} \\ I & \xrightarrow{\phi} & R. & \text{(removal)} \end{eqnarray}
The first reaction shows an infection process, an interaction between a susceptible and an infected individual. The parameter $\alpha$ indicates the strength of the interaction, the rate per S-I pair in the population. The second reaction indicates removal of an infected individual, at rate $\phi$ per infected.
The compartmental and reaction kinetics formalism are mathematically equivalent. Herein, I will use reaction kinetics to better explain the “microscopic” (individual-level) processes that motivate the model.
In the following section I will alter the SIR model to better represent the features I think are most important.
IR Model
S is usually included in an SIR model to account for the loss of “targets” as an epidemic spreads, and the subsequent slowing. But that only makes sense if the epidemic spreads to a large proportion of those susceptible individuals. If only a small fraction are ever infected then we can treat the size of the S “pool” as effectively constant.
Even in the worst hit countries only 0.2%2) of the population had been affected so I was hopeful that this would be a valid assumption here. With this assumption I chose to leave “S” out of my model. So I had only an “IR model”.
\begin{eqnarray} I & \xrightarrow{\alpha} & 2 I & \text{(infection)} \\ I & \xrightarrow{\phi} & R. & \text{(removal)} \end{eqnarray}
But there were some important factors to consider for the I and R-types if I wanted a reasonable representation of this epidemic.
Subcompartments of I
I'm not sure it matters but I have an inkling that the latent period, after exposure but before onset of disease is important. So I chose to include this in my model by adding an Exposed compartment. Infected cases create new exposed cases that later transition to being infectious themselves. So my model grew to: \begin{eqnarray} I & \xrightarrow{\alpha} & I + E & \text{(exposure)} \\ E & \xrightarrow{\lambda} & I & \text{(infectious)} \\ I & \xrightarrow{\phi} & R. & \text{(removal)} \end{eqnarray}
Next we consider patients receiving treatment. Treatment for ebola involves hospitalization for care and isolation. Evidence indicates that care improves survival rates and isolation helps prevent spread of the disease. Isolation and care could be considered as two independent properties (so a patient might be isolated but not cared for, or vice versa) but for the sake of simplicity I will group them together. I assume patients who are isolated also receive the benefit of care.
So I split I into two subcompartments:
- I = infected but not identified. Can spread infection and lower survival chances.
- H = identified and hospitalized. Infected but isolated and receiving treatment. I assumed hospitalized cases can no longer spread infection and have higher survival probability. One might argue that past contacts are traced so they are less likely to spread disease but I didn't include it in my model.
With these changes the reaction kinetics model becomes \begin{eqnarray} I & \xrightarrow{\alpha} & I + E & \text{(exposure)} \\ E & \xrightarrow{\lambda} & I & \text{(infectious)} \\ I & \xrightarrow{\phi} & H & \text{(hospitalization)} \\ I & \rightarrow & R & \text{(removal of infectious)} \\ H & \rightarrow & R. & \text{(removal of hospitalized)} \end{eqnarray}
One could argue that the Exposed compartment should also be separated into known (via contact tracing) and unknown compartments but I didn't think of that. So I just have one type E – I assume these exposed cases have not been unidentified or hospitalized.
Note that we haven't specified the rates of removal yet. We will discuss removal in more detail next.
Subcompartments of R
In the SIR model R refers to “removed” individuals, those who are no longer infectious and are not susceptible to re-acquiring infection. There are various ways an individual can be removed. Two in particular are important for this ebola epidemic: individuals can recover or die.
Of course this is a hugely important distinction for patients, their families, care-takers, and everybody affected by the epidemic. But it might not be obvious that it's important to track the different outcomes in the model. It turns out it is important when building the model because known deaths are a recorded statistic so they are useful when comparing the model to empirical data.
On the other hand, deaths of unknown infected people can't be tracked so the model doesn't distinguish between deaths versus recovery of unidentified cases. Of course, it should not be interpreted that these lost victims are unimportant.
There have been allegations that traditional funeral rights contributed to the spread of the disease. Other models have therefore included an Infectious Dead compartment. I didn't have any solid evidence or intuition one way or the other so I chose to leave this out for the sake of simplicity. This would be an ingredient to add if the model wasn't realistic enough without it.
We've identified three compartments or important types to include in our model: E I H.
Representation & analysis
Rate equations
Above gives a complete description of the processes I considered in my model. Ignore all other factors.
The material below this point is incomplete and requires further editing.
Assuming a large, well-mixed population of individuals undergoing these reactions, law of mass action gives us the dynamical equations governing the population densities.
\begin{eqnarray} \frac{dE}{dt} & = & \alpha I - \lambda E \\ \frac{dI}{dt} & = & \lambda E - (\phi + \rho_I + \delta) I \\ \frac{dH}{dt} & = & \phi I - (\rho_H + \delta) H. \end{eqnarray}
We also track two other properties, the number of known cases and deaths, $C$ and $D$, so we can compare against available data:
\begin{eqnarray} \frac{dC}{dt} & = & \phi I \\ \frac{dD}{dt} & = & \delta H. \end{eqnarray}
Aside (put in a box?): we can see where these rates come from if we consider which processes might involve recording a new case C or death D. (Show modified reactions.) I assume cases are counted when they are hospitalized.
If we know the rate parameters and the initial numbers of each type we can use the rate equations to determine the population at any other time.
In practice it is a difficult problem but there are good mathematical tools to solve these equations numerically.
Choosing the right tool is an art. I found ode15s gives consistently good results.
Fixed parameters
- constant parameters & initial conditions
1/lambda =
1/rho_I =
1/rho_H =
delta =
No information on alpha and phi so I fit those to the data.
Originally I fit the initial conditions as well.
But I was worried this could lead to “garden of eden” scenarios. Example: more deaths than cases.
Amazingly, the index case of the current outbreak was identified.
So can set E_0, I_0, H_0, C_0, D_0 = … on Dec 2, 2013.
0.1 Solution for constant rates
If all of the above parameters are constant then it is possible to solve in closed-form for the population at any time.
Here's how.
0.2 phi as a function of time
But there are reasons to think parameters are not all constant.
I don't know what might affect alpha so I will assume it is constant and fit it.
But I think it is easier to imagine that phi changed throughout the epidemic.
Initially, ebola was not suspected so no effective protective measures were taken. Sick were cared for at home and infected their families. Hospitalization and quarantine rates were low so phi initially small (but positive).
Eventually, ebola was identified and an international response was mounted. Education happened and contact tracing. These would have resulted in better identification of infected with better care and more effective isolation. Phi high.
- strictly positive
0.2.1 Sigmoid
- originally sigmoid
0.2.2 Bounded polynomial
- later bounded polynomial
- Gaussian envelope
- polynomial inside
- lowest order possible
0.2.3 Stretched polynomial?
- try this in the future?
- conceptually even simpler
0.3 Other model variations
- also tried these variations: odemodel1..11
- but they didn't fit the data as well at the time
- haven't kept comparing them but maybe I should've
1. Calibration
1.1 Fitting C & D
- data sources. Thanks.
- parfmincon
- logscale
- don't worry about correlations in data
Results
- fit performed every day with latest data
1. Fitted curves
- animation of fitted curves
- only started recording in December
2. R0
- derivation
- show it matches analytic solution for const phi (above)
- plot over all time
- discuss what it means: early, intermediate, late time
- where it crosses one
- sensitivity analysis? Which parameter has biggest impact? Especially when R_0 near one?
From Carroll et al. (2015):
Phylogenetic analysis revealed the dynamic nature of the epidemic and molecular change in the viral sequence (Fig. 2a). Several distinct lineages were identified, with an initial lineage A (Figs 2a, 3 and Extended Data Fig. 2) linked to early Guinean cases dating from March 2014 including the three original viruses published by Baize et al.2. A second lineage, B, emerged in May and June and comprises all the sequences from Gire et al.6 and the remainder of those described here. As the epidemic expanded, lineage A remained confined in Guinea from March to June 2014, except for one sequence from 18 July 2014. A single Liberian sequence from March 2014 grouped within this lineage. No further EBOV genomes that we sequenced from samples taken after July 2014 belonged to lineage A. This clade was likely to have been associated with the original outbreak in Guinea and was almost successfully contained in May 2014 by the interventions of the multi-agency response.
Do these separate lineages explain the slowdown and resurgence of the epidemic around March 2014? I see $R_0$ declining until roughly March, then it explodes upwards. The inflection point of $R_0$ is around February 2014, which corresponds to the start of Lineage B in Figure 3.
3. Forecast
- contained?
- model estimates total number of cases (known and unknown)
- forecast total deaths
- forecast last death
- compare with Valdez et al. (2015)
Discussion
- could use model to test interventions
1. Conclusions
- sympathy
- benefits
- learned a lot about curve fitting
- learned about confidence intervals, especially with cumulative data. Cite Nychka et al. (2000), Zwiers and von Storch (1995), King et al. (2014).
- wrote and shared two routines:
lhsdesigncon
andparfmincon
- gained appreciation for importance and difficulty of true forecasting vs. pseudo-forecasting