Computational Statistics & Data Analysis 50 (2006) 3354 – 3368 www.elsevier.com/locate/csda
A latent variable model for estimating disease transmission rate from data on household outbreaks Ning Li∗ , Guoqi Qian, Richard Huggins Department of Statistical Science, La Trobe University, Bundoora, Melbourne 3086, Australia Received 18 January 2005; received in revised form 20 June 2005; accepted 21 June 2005 Available online 19 July 2005
Abstract A Bayesian latent variable model is proposed for studying household epidemics of infectious diseases in this paper. This model is more general and flexible than the commonly used chain binomial epidemic model. In particular, the model allows for the heterogeneity of the infection transmission rates in related to the sizes and generations of the infectives. Moreover, the model assumes the availability of only the household outbreak sizes, which is more reasonable than assuming the availability of the hardly observed infection chains. The Tanner–Wong’s IP algorithm is employed for effective simulations and inferences of this model. Finally, this model was applied to analyzing a real data set on F-4 Asian influenza. © 2005 Elsevier B.V. All rights reserved. Keywords: Disease transmission rate; Infection chain; Outbreak size; Data augmentation; Tanner–Wong algorithm; Bayesian statistics
1. Introduction A key quantity in planning vaccine strategies is the threshold parameter—a quantity when its value is exceeded an epidemic will take place. Given the structure of a community, the threshold parameter is a deterministic function of the often unavailable transmission rate of the infectious disease under investigation (Becker et al., 1995; Starczak, 1998; Becker and Dietz, 1995). Therefore, it is essential to find a precise estimate of the transmission rate. ∗ Corresponding author. Tel.: +61 3 9479 3296; fax: +61 3 9479 2466.
E-mail address:
[email protected] (N. Li). 0167-9473/$ - see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2005.06.011
N. Li et al. / Computational Statistics & Data Analysis 50 (2006) 3354 – 3368
3355
Estimation of the transmission rate, however, is challenging because an infection process is only partially observable. We may be able to observe the time when the symptoms were detected but rarely the time when the infection actually occurred; we may be able to observe the total number of infected cases but rarely the order of occurrences of the infections. When an infectious disease has a virtually constant latent period followed by a relatively short period of infectiousness, the latent period can be used as a time unit to study the disease transmission. This is the primary consideration that motivated the chain binomial transmission models for infectious diseases. See, for example, En’Ko (1889), Greenwood (1931), Bailey (1975) and Becker (1989). Two of the well-known chain binomial models are Greenwood and Reed-Frost models. Both are special cases of a more general chain binomial model—in which the chance that a given susceptible avoids being infected when exposed to i infectives of the previous generation is qi —so that if there are s susceptibles in a household the number of newly infected individuals has a Bin (s, qi ) distribution. While the later model is quite general, it still assumes homogeneity of infectiousness over generations. That is, the probability that a susceptible catches the disease is independent of the length of the infection process having been evolved. These assumptions might not be realistic for the diseases that have short latent period or variable incubation period. Becker (1989) extended the homogeneous qi model to a more flexible model that allows for the heterogeneity of the infectiousness across generations. To make inference for this more flexible model, Becker (1989) proposed a method based on maximum likelihood. This method has some limitations in that it requires the availability of the detailed progression of infections, i.e. the number of infected individuals in each generation. But for a diseases with a short latent period or a highly variable incubation period, identification of the progression of infectiousness is usually difficult. On the other hand, the total number of infected cases in each household is often obtainable and also more reliable because it can be laboratorially tested. Therefore, it is our intention to adopt Becker’s transmission model setting but to fit it by using only the data of outbreak sizes, i.e., the data on the total number of infected cases in each household. In this paper, we shall generalize Becker’s model to allow for a missing data or latent data formulation. We demonstrate that under this formulation the Tanner–Wong algorithm is suitable for making inference about the disease transmission rate, using only the data of the household outbreak sizes. We opt to proceed with our work in the Bayesian framework because it provides a natural platform for accommodating data augmentation. Until recently, most works in fitting stochastic transmission models for infectious diseases have been based on the method of maximum likelihood. Due to the distinct feature that an infection process is only partially observable the MLE method usually either entailed unrealistic simplicity assumptions to secure a solution or left the estimation and inference as a compromised optimization question. With the help of highly advanced computers, the Bayesian statistics and sampling-based Monte Carlo methods have been found to be particularly powerful in handling these complications. O’Neill and Roberts (1999) are presumably the first who estimated transmission rates in a Greenwood model and a continuous transmission model using the Bayesian approach. More recently Li et al. (2002) estimated the transmission rate through a random effects models using the outbreak size data but assuming generationinvariant transmission rate.
3356
N. Li et al. / Computational Statistics & Data Analysis 50 (2006) 3354 – 3368
In Section 2, we extend Becker’s model to allow for a missing data formulation. That is we model the heterogeneous transmission across generations and across the number of infectious individuals using the data on household outbreak sizes. We then derive the posterior distribution of the parameters of interest. In Section 3, we discuss the computation of posterior distribution via the Tanner–Wong algorithm. We also discuss means of model evaluation. In Section 4, we assess the performance of the proposed method by simulations. In Section 5, the method is applied to estimate the infection rate of an F-4 Asian influenza. Finally a brief summary is presented in Section 6. 2. The model Consider a community of N independent households, each with size M including i0 introductory cases of a particular infectious disease. The total number of infected cases in a household over the period of epidemic is referred to as the outbreak size of the household. We consider the type of infectious diseases for which the infection is transmitted by close contact and assume that the probability that a susceptible individual catches the disease from outside the household is negligible compared with the infection force within. Also assume that susceptible individuals are infected independently of one another. Following Becker (1989), we denote by i the number of infectious individuals in generation , with the i0 introductory cases being classified as generation zero. A typical epidemic chain, Cj , thus may be articulated as i0 → i1 → i2 → · · · → igj where gj is the last generation containing non-zero infective(s). Denote by s the number of susceptibles at the beginning of the th generation so that i0 + i1 + · · · + igj + sgj +1 = M. With these notations, the outbreak size of the infection chain Cj can be seen to be Ij = i0 + i1 + · · · + igj . Let bz be the number of households among the N that have the outbreak size z where z is an integer between of the outbreak sizes of the N households i0 and M. The frequencies are denoted as b = bi0 , bi0 +1 , . . . , bM . Let q(i, ) be the probability that a susceptible of generation avoids infection when exposed to the i infectives in the previous generation. Let q denote the vector of q(i, ) i = 1, . . . , M, = 1, . . . , gj , j = 1, 2, . . . , 2M−i0 . Given the number of infectives and susceptibles in a generation, the number of infectives in the next generation has a binomial distribution: Pr (s = s+1 , i = i | s , i−1 ) = s i s+1 , where p (i−1 , ) = 1 − q (i−1 , ) is the probability i (p (i−1 , )) (q (i−1 , )) that a susceptible of generation is infected by the i−1 infectives of the previous generation. The probability of an infection chain Cj is given by g
j Pr Cj q = Pr (s = s+1 , i = i | s , i−1 ) .
(1)
=1
If the infection chain for each household were completely observed, one would be able to estimate the characteristics of the avoidance probability vector q using the complete chain likelihood obtained from the joint probability of the observed chains. However, it is most often that only the outbreak size, i.e., the total number of infectives in the occurring infection chain, of each household is observed. Therefore, the avoidance probability q cannot be estimated from the complete chain likelihood that is not computable.
N. Li et al. / Computational Statistics & Data Analysis 50 (2006) 3354 – 3368
3357
2.1. The likelihood functions Note that except for the case z = i0 or z = i0 + 1 there are multiple infection chains that can give rise to a given outbreak size z. Suppose there are nz such infection chains and we denote them as Oz = Cz,1 , Cz,2 , . . . , Cz,nz . The probability of outbreak size z is then given by the sum of the probabilities over the nz infection chains p(z) =
nz
Pr Cz,j q .
(2)
j =1
The likelihood function of the observed data b (observed-data likelihood) is given by f (b|q) ∝
M
b
z p(z)
(0 < q(i, ) < 1).
(3)
z=i0
The maximization of (3) for estimating q requires a numerical procedure which may be difficult when there exist local maximizers or the global maximizer is on the boundary. On the other hand, the estimation of q by working with the complete chain likelihood would be much easier had all household infection chains been observed. This observation suggests the possibility of taking the unobservable infection chains as latent data and estimating q in a missing-data framework where the method of data augmentation can be used (Tanner and Wong, 1987). To be specific, denote by cz,j the unobserved frequency of households having the progresthat cz,1 +· · ·+cz,n(z) =bz . Note sion of infection chain identified by Cz,j ∈ Oz .This implies that {bz , z = i0 , ·, M} are the observed and c= cz = cz,1 , . . . , cz,n(z) , z = i0 , . . . , M the latent or the missing or the augmented data. While c is ideal for the purpose of making inference about q(i, ), it is difficult to be observed reliably. The likelihood corresponding to the augmented data is given by f (c|q) ∝
M n(z)
c Pr Cz,k q z,k .
z=i0 k=1
Denote by q1 , . . . , qJ the set of the distinct components of q. The augmented-data likelihood function can be re-expressed as f (c|q) ∝
J
w
pj j,p
(c) wj,q (c) qj ,
(4)
j =1
where pj =1−qj , and wj,p (c) and wj,q (c) are functions determined by the augmented data c. Note that now the unknown quantities in the model include not only q, the parameters to be estimated, but also the unobserved chain frequencies c. 2.2. Posterior distribution We adopt to use conjugate priors for Bayesian inference for the above model. It can be seen from (4) that the conjugate prior for each parameter qj is a Beta distribution.
3358
N. Li et al. / Computational Statistics & Data Analysis 50 (2006) 3354 – 3368
In the following we assume that the priors of q1 , . . . , qJ are independent. The posterior distribution corresponding to the augmented data is proportional to the prior distribution and the augmented-data likelihood. Hence under the conjugate priors the joint posterior distribution given the augmented data is a product of independent Beta distributions, each with its parameters being determined by the hyper parameters and the chain frequencies. Specifically, let qj follow Beta j , j independently for j = 1, . . . , J where j and j are fixed positive numbers. The augmented-data posterior distribution becomes f (q|c) ∝
J
+wj,p (c)−1 j +wj,q (c)−1 qj
pj j
j =1
∝
J
Beta j + wj,p (c), j + wj,q (c) .
(5)
j =1
The posterior distribution corresponding to the observed data (observed-data posterior) is given by f (q|b) ∝ f (q)f (b|q) ∝
J
Beta j , j · f (b|q),
(6)
j =1
where f (b|q) is specified in (3). To complete this section we investigate the relationship between the probability of outbreak size and the probability of corresponding infection chains. Given the value of bz , cz = cz,1 , cz,2 , . . . , cz,n(z) has a multinomial distribution over the n(z) possible epidemic chains, namely f cz | bz , z,1 , . . . , z,n(z) =
bz ! cz,n(z) c c , z,1 z,2 · · · z,n(z) cz,1 !cz,2 ! · · · cz,n(z) ! z,1 z,2
(7)
where z,k = Pr Cz,k q /p(z) is the conditional probability of Cz,k . The conditional joint distribution of c given b now becomes f (c|b, q) =
M
f cz | bz , z,1 , . . . , z,n(z)
(8)
z=i0
which states explicitly the missing data mechanism.
3. Posterior computation Recall that the objective is to estimate the observed-data posterior distribution f (q|b), which is shown to be difficult to compute directly. Knowing that the augmented-data posterior f (q|c) is easy to compute, this difficulty can now be overcome by employing the so-called Tanner–Wong algorithm (also known as IP algorithm, see Tanner and Wong,
N. Li et al. / Computational Statistics & Data Analysis 50 (2006) 3354 – 3368
3359
1987). In the context of the problem considered in this paper, the Tanner–Wong algorithm consists of the following two steps: Step I: Generate m independent q, denoted as q(1) , . . . , q(m) , from the current estimate fˆcurrent (q|b) of f (q|b). Then for each q(j ) generate c(j ) from the missing-data mechanism f (c|b, q) given in (8). Step P: Calculate the posterior density corresponding to the augmented data f q| c(j) by using (5), and set the next observed-data posterior density estimate to the mixture of the augmented-data posterior densities m 1 fˆnext (q|b) = f q| c(j ) . m
(9)
j =1
When the I- and P-steps are repeated for a sufficiently long time, the sample generated from fˆnext (q|b) may be regarded as a sample from the posterior distribution f (q|b). The Tanner–Wong algorithm illustrates the basic idea of data augmentation, that is, to replace a difficult computation under the observed data with a sequence of simpler computations by introducing some latent data. Note that when the posterior distribution f (q|b) is thought of as the marginal posterior distribution of f (q, c|b), f (q|b) can be computed alternatively using the standard Gibbs sampler. The required full conditional distributions in the Gibbs sampler, f (q|c, b)=f (q|c) and f (c|q, b), are already derived in (5) and (8), respectively. One important step of Bayesian data analysis is to check how well the model fits the data (refer to the outbreak size data b here). This can be implemented through the posterior predictive distribution (PPD) for b. The posterior predictive distribution is defined as
f b b = f b q, b f (q|b) dq, where b can be envisaged as an observation from a model updated from processing the information in the data b. The basic idea underlying PPD check is that if the data come from the model, then the data should typify all the data generated from the model (Gelman et al., 1995). Usually the model is assessed against a number of criteria of interest, referred to as test quantities. Denote by T = T (b , q) a test quantity, then the Bayesian p-value is computed as the tail probability of observing a value less extreme than the one based on the observed data under the posterior predictive distribution, namely p-value = Pr T b , q < T (b, q) . The Bayesian p-value is usually approximated by the sampling frequency based on the simulated sample from the PPD. For the outbreak size data, it is natural to use outbreak frequencies as the test quantities. To do this, firstly simulate K copies of the epidemics using the estimated qˆ to produce K replications of outbreak size frequencies. These replications are then compared to the observed outbreak frequencies to compute the Bayesian p-value. A p-value either close to zero or one indicates poor fit of the model in terms of the aspect specified by the test quantity.
3360
N. Li et al. / Computational Statistics & Data Analysis 50 (2006) 3354 – 3368
Table 1 Chain probabilities of households of size four with one introductory case Infection chain
Probability
Realized frequency
1 11 111 12 1111 112 121 13
q 3 (1, 0) 3p(1, 0)q 2 (1, 0)q 2 (1, 1) 6p(1, 0)q 2 (1, 0)p(1, 1)q(1, 1)q(1, 2) 3p 2 (1, 0)q(1, 0)q(2, 1) 6p(1, 0)q 2 (1, 0)p(1, 1)q(1, 1)p(1, 2) 3p(1, 0)q 2 (1, 0)p 2 (1, 1) 3p 2 (1, 0)q(1, 0)p(2, 1) p 3 (1, 0)
c1 = 6 c2 = 15 c31 = 7 c32 = 22 c41 = 5 c42 = 3 c43 = 21 c44 = 21
4. Simulation study To test the performance of the proposed method, we consider 100 households each having household of size four with one introductory case. The eight possible infection chains associated with each household and corresponding probabilities are given in Table 1. In this case there are four distinct parameters q(1, 0), q(1, 1), q(1, 2), and q(2, 1) to be estimated. In the following, we shall first obtain the data by generating an epidemic using true parameter values and, then, pretending that the parameter values were unknown, we use the proposed method to estimate them. The estimates will finally be compared back to the true parameter values. In particular, we specified the true avoidance probabilities as q = (q(1, 0), q(1, 1), q(1, 2), q(2, 1))=(0.4, 0.7, 0.6, 0.5). Using these q we simulate 100 independent infection chains according to Becker’s model. The chain frequencies were recorded, shown in the last column of Table 1. The frequencies of outbreak sizes were also noted as b=(b1 , b2 , b3 , b4 ) = (6, 15, 29, 50). The posterior distribution conditional on the chain data c = (c1 , c2 , c31 , c32 , c41 , c42 , c43 , c44 ) can be derived to be f (q|c) = Beta 10 + u10 , 10 + v10 Beta 11 + u11 , 11 + v11 × Beta 12 + u12 , 12 + v12 Beta 21 + u21 , 21 + v21 , (10) where ij and ij are prior parameters and u10 = 3c1 + 2c2 + 2c31 + c32 + 2c41 + 2c42 + c43 , v10 = c2 + c31 + 2c32 + c41 + c42 + 2c43 + 3c44 , u11 = 2c2 + c31 + c41 , v11 = c31 + c41 + 2c42 , u12 = c31 , v12 = c41 , u21 = c32 , v21 = c43 . The conditional distribution of c given b and q is f (c|q, b) = f ( c31 , c32 | q, b3 ) f ( c41 , . . . , c44 | q, b4 ) ,
(11)
N. Li et al. / Computational Statistics & Data Analysis 50 (2006) 3354 – 3368
3361
Table 2 The sets of chain frequencies used to start the Tanner–Wong algorithm Chain
1
11
111
12
1111
112
121
13
Frequency (Set No. 1) (Set No. 2) (Set No. 3) (Set No. 4) (Set No. 5) (Set No. 6)
6 6 6 6 6 6
15 15 15 15 15 15
29 29 0 0 15 15
0 0 29 29 14 14
50 0 50 0 25 0
0 0 0 0 0 0
0 0 0 0 0 0
0 50 0 50 25 50
where f ( c31 , c32 | q, b3 ) = Multinom((111, 12), Pr(111), Pr(12)), f ( c41 , . . . , c44 | q, b4 ) = Multinom((1111, 112, 121, 13), Pr(1111), Pr(112), Pr(121), Pr(13)). In the above, the notation 1111 and 112, etc. represent the infection chain 1 → 1 → 1 → 1 and 1 → 1 → 2, etc., respectively. The Multinom((111, 12); Pr(111), Pr(12)) represents the multinomial distribution over the chains 111 and 12 with corresponding non-normalized probabilities Pr(111) and Pr(12). Similar notations were used in f ( c41 , . . . , c44 | q, b4 ). The same notations have also been used in Tables 2 and 4. To estimate q we used independent uniform prior distributions for q(1, 0), q(1, 1), q(1, 2) and q(2, 1) by setting the hyper parameters ij and ij in expression (10) to 1, independently. To choose initial values of the chain frequencies c(1) , . . . , c(m) for applying the Tanner–Wong algorithm, we considered m = 6 compositions of b = (6, 15, 29, 50) in terms of chain frequencies, which are given by the six rows in Table 2. These six compositions were chosen to be as different from each other as possible but otherwise were arbitrary. The purpose was to see if there is any effects of the specification of starting values on final results. Started with these six sets of initial values, we run the Tanner–Wong algorithm for 20,000 iterations, using the software Splus 6.2. At each iteration L = 500 random points of q were generated to constitute a sampling distribution of f (q|b). Particularly the 2.5% and 97.5% quantile values of this sample in related to each component of q were computed, which constituted a sequence of 95% Bayesian intervals for each component of q. The Bayesian intervals estimated at the last 100 iterations were displayed in Fig. 1. The medians obtained at each of the 20,000 iterations were displayed in Fig. 2 where the horizontal lines indicate the true parameter values. The prior distributions used and the posterior distributions obtained from the last iteration are shown in Fig. 3, and a numerical summary was given in Table 3. The posterior medians of q(1, 0), q(1, 1), q(1, 2) and q(2, 1) are respectively 0.397, 0.724, 0.495 and 0.508. The 95% Bayesian intervals for q(1, 0), q(1, 1), q(1, 2) and q(2, 1) are (0.335, 0.493), (0.587, 0.860), (0.172, 0.823) and (0.314, 0.738), respectively, all covering corresponding true parameter values. It was seen that the q(1, 0) can be estimated most precisely and q(1, 2) least. This is expected because every household supplied
3362
N. Li et al. / Computational Statistics & Data Analysis 50 (2006) 3354 – 3368
q10
0.45 0.30 0
20
40
60
80
100
60
80
100
60
80
100
60
80
100
iteration
q11
0.8
0.5 0
20
40 iteration
q12
0.6
0.0 0
20
40 iteration
q21
0.8 0.5 0.2 0
20
40 iteration
Fig. 1. Posterior distributions of q(1, 0), q(1, 1), q(1, 2) and q(2, 1) estimated at the last 100 iterations by fitting the model to data b = (6, 15, 29, 50) using the Tanner–Wong algorithm.
information in estimating q(1, 0) no matter what outbreak size was observed in the household, but only the households that had observed the infection chain 1 → 1 → 1 → 1 contributed information in estimating q(1, 2). To investigate if there is any impact of starting values of the Tanner–Wong algorithm on the final inference, we recorded the sets of chain frequencies imputed at every 1000 iteration. The 6 sets of chain frequencies imputed at the last iteration, i.e. chain frequencies produced by starting with the initial values given in Table 2 and running the Tanner–Wong algorithm for 20,000 iterations, are given in Table 4. A comparison of the realized chain frequency c = (c1 , c2 , c31 , c32 , c41 , c42 , c43 , c44 ) = (6, 15, 7, 22, 5, 3, 21, 21) that is given in Table 1 and the imputed chain frequencies in Table 4 reveals that as the number of iterations in the Tanner–Wong algorithm increases the imputed latent data tend to be closer to the realized latent data. In other words, the starting values of the Tanner–Wong algorithm are unlikely to affect the final results as long as the run is long. This observation is consistent with the theoretical property of the Tanner–Wong algorithm (Tanner and Wong, 1987).
5. Application to influenza data As an illustration of the method proposed we consider the data from an F-4Asian influenza epidemic presented by Sugiyama (1960). All 42 households in the sample are of size three,
q10
N. Li et al. / Computational Statistics & Data Analysis 50 (2006) 3354 – 3368
3363
0.6
q11
0.0 0
5000
10000 iteration
15000
20000
0
5000
10000 iteration
15000
20000
0
5000
10000
15000
20000
15000
20000
0.6
q12
0.0
0.6 0.0
q21
iteration
0.6 0.0 0
5000
10000 iteration
Fig. 2. Sequence of estimated medians of q(1, 0), q(1, 1), q(1, 2) and q(2, 2).
and each can be reasonably assumed to have one introductory case from the information available so that there are only two distinct avoidance probabilities involved: q(1, 0) and q(1, 1). The former is the probability that a susceptible escapes being infected from the initial infectious case, the later is the probability that a susceptible escapes being infected from a secondary case. The observed frequencies for the outbreak sizes are given in Column 3 of Table 5, with the corresponding probabilities being given in Column 2 of the table. The 29 households in which no infection cases were observed do not contribute to the inference about the transmission rates in this model setting, hence we shall focus on the remaining 13 infected households. Using independent uniform prior distributions and running the Tanner–Wong algorithm, we obtained estimates of q(1, 0) and q(1, 1) as 0.816 and 0.541, respectively. We now provide more details about the inference in this example. To choose the number of mixture components, m, we noticed that there are only three possible ways to partition the outbreak frequency b = (b1 , b2 , b3 ) = (9, 2, 2) into chain (1) (2) frequency c = (c1 , c2 , c31 , c32 ) with b3 = c31 + c32 : c(0) = (9, 2, 2, 0), c(0) = (9, 2, 0, 2) (3)
and c(0) = (9, 2, 1, 1). So all of them were used to initialize the Tanner–Wong algorithm. The chain frequency c31 is augmented from the current estimate q and the data b3 as f ( c31 | q, b3 ) = Bin b3 ,
2q(1, 0)p(1, 1) . 2q(1, 0)p(1, 1) + p(1, 0)
3364
N. Li et al. / Computational Statistics & Data Analysis 50 (2006) 3354 – 3368
10
10 8
prior posterior
8
6
6
4
4
2
2
0
0 0.0
0.2
0.4
0.6
0.8
1.0
prior posterior
0.0
0.2
0.4
q10
0.6
0.8
1.0
0.6
0.8
1.0
q11 10
10 8
8
prior posterior
6
6
4
4
2
2
0
0 0.0
0.2
0.4
0.6
0.8
1.0
prior posterior
0.0
0.2
q12
0.4 q21
Fig. 3. Prior (dashed line) and Posterior (solid line) distributions of avoidance probabilities q(1, 0), q(1, 1), q(1, 2) and q(2, 1) estimated from the Tanner–Wong algorithm. The vertical line in each plot indicates the true parameter value.
The posterior distribution conditional on the augmented-data c is given by f (q|c) = Beta (1 + 2c1 + c2 + c31 , 1 + c2 + c31 + 2c32 ) × Beta (1 + c2 , 1 + c31 ) .
(12)
to the Tanner–Wong algorithm, in every P-step a sample of size L, q(1) = According (L) (1) (1) (L) (L) q (1, 0), q (1, 1) , . . . , q = q (1, 0), q (1, 1) , are generated from 1 1 1 f (q|b) = f q| c(1) + f q| c(2) + f q| c(3) . (13) 3 3 3 One draw of q from mixture (13) drawing avalue q from each of is obtained by firstly the component distributions f q| c(1) , f q| c(2) and f q| c(3) using (12), and then selecting one of the generated values with equal probability. A sample of size L can be produced by independently repeating the above procedure for L times. Alternatively, a sample of size L from distribution (13) is obtained more efficiently as follows. Sample L
N. Li et al. / Computational Statistics & Data Analysis 50 (2006) 3354 – 3368
3365
Table 3 Posterior summary of avoidance probabilities of simulated data Quantiles
q(1, 0) q(1, 1) q(1, 2) q(2, 1)
True value
2.5%
25%
50%
75%
97.5%
0.336 0.590 0.172 0.318
0.342 0.615 0.210 0.352
0.397 0.724 0.495 0.508
0.480 0.844 0.792 0.711
0.491 0.859 0.820 0.735
0.4 0.7 0.6 0.5
Table 4 The six sets of chain frequencies imputed at iteration 20,000 Chain
1
11
111
12
1111
112
121
13
Frequency (Set No. 1) (Set No. 2) (Set No. 3) (Set No. 4) (Set No. 5) (Set No. 6)
6 6 6 6 6 6
15 15 15 15 15 15
8 6 6 5 6 7
24 21 17 18 24 20
4 6 6 5 6 7
1 3 3 1 4 2
19 18 20 24 23 19
23 23 24 24 17 24
Table 5 Observed and expected frequencies of outbreaks of Asian influenza Outbreak size
Probability
Observed frequency
Expected frequency
3 2 1 0 Total
2p(1, 0)q(1, 0)p(1, 1) + p 2 (1, 0) 2p(1, 0)q(1, 0)q(1, 1) q 2 (1, 0) 1− the rest 1
2 2 9 29 42
2.23 2.11 8.66 29 42
integers with replacement from (1, 2, 3) to obtain L1 1’s, L2 2’s and L3 3’s. These represent the number of q values to be generated from each of the 3 mixture components. Then draw a sample of size L1 from f q| c(1) and a sample of size L2 from f q| c(2) and a sample of size L3 from f q| c(3) . The set of these three sample points constitute a sample of size L from the mixture distribution f (q|b). (1) (2) (3) Started with the three sets of initial values, c0 , c0 , c0 , we run the Tanner–Wong algorithm for 10,000 iterations, with a sample of size L = 1000 were generated at each iteration. Fig. 4 shows the posterior distributions estimated at the last 4 iterations. A summary of the estimates based on the last iteration was given in Table 6. The avoidance probability that a susceptible individual avoids being infected from the initial infection case was estimated as 0.816, and the avoidance probability decreased to 0.541 at the next generation.
3366
N. Li et al. / Computational Statistics & Data Analysis 50 (2006) 3354 – 3368 q(1,0)
q(1,1)
6
6
5
5
4
4
3
3
2
2
1
1
0
0 0.0
0.2
0.4 0.6 Iteration 997
0.8
1.0
6
6
5
5
4
4
3
3
2
2
1
1
0
0.0
0.2
0.4 0.6 Iteration 997
0.8
1.0
0.0
0.2
0.4
0.8
1.0
0.8
1.0
0 0.0
0.2
0.4
0.6
0.8
1.0
Iteration 998
0.6
Iteration 998
6
6
5
5
4
4
3
3
2
2
1
1
0
0 0.0
0.2
0.4 0.6 Iteration 999
0.8
1.0
0.0
6
6
5
5
4
4
3
3
2
2
1
1
0.2
0.4 0.6 Iteration 999
0
0 0.0
0.2
0.4
0.6
Iteration 1000
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 1000
Fig. 4. Posterior distributions of q(1, 0) (left column) and q(1, 1) (right column) simulated at the last 4 Tanner–Wong iterations. The simulation started with the three possible chain frequencies and run for 10,000 iterations.
N. Li et al. / Computational Statistics & Data Analysis 50 (2006) 3354 – 3368
3367
Table 6 Posterior summary of avoidance probabilities of Asian Influenza Quantiles
q(1, 0) q(1, 1)
2.5%
5%
50%
95%
97.5%
0.643 0.166
0.673 0.210
0.816 0.541
0.918 0.848
0.934 0.887
300
300
300
200
200
200
100
100
100
0
0 0
2 4 6 8 10 12 outbreak size 1
0 0
2
4 6 8 10 12 outbreak size 2
0
2
4
6 8 10 12 outbreak size 3
Fig. 5. Posterior Predictive check of the model fitted by Tanner–Wong algorithm. From left to right are the posterior predictive distributions of the frequencies of outbreak size 1, 2, and 3, respectively.
The 95% Bayesian intervals are (0.643, 0.934) and (0.166, 0.887) for q(1, 0) and q(1, 1), respectively. Two procedures were carried out to evaluate the model fitting. In the first procedure, we calculated the frequencies of each outbreak size one would expect to see under the estimated model. This was accomplished by substituting (q(1, ˆ 0), q(1, ˆ 1)) = (0.816, 0.541) into the outbreak probabilities of column 2 in Table 5 and then multiplying the probabilities with the total number of infected households. The resulting expected outbreak frequencies, shown in the last column of Table 5, closely agree with the observed data. Our second model evaluation procedure is via the posterior predictive distribution. Specifically, we simulated K = 1000 copies of epidemics using the estimated posterior medians (q(1, ˆ 0), q(1, ˆ 1)) = (0.816, 0.541). Each epidemic involves 13 households and each household consists of one initial infected and two susceptible individuals. At the end of each epidemic, the 13 households were grouped according to their outbreak size. The outbreak frequencies form a future observation, b , from the posterior predictive distribution f ( b | b). Histograms of the 1000 replicates of b were displayed in Fig. 5 in which the predicted frequencies of outbreak size 1, 2 and 3 are given in Column 1, 2 and 3, respectively. The vertical line in each plot represents the observed frequency of the corresponding outbreak size. It is seen that the observed frequency of each outbreak size looks typical among the future observations predicted from the PPD. Furthermore, the Bayesian p-value of an observed frequency was found from the area to the left of the vertical line in each plot. In our simulation, the Bayesian p-values are 0.465, 0.352 and 0.319 for outbreak size 1, 2 and 3, respectively. Therefore we conclude that the model fits the data reasonably well.
3368
N. Li et al. / Computational Statistics & Data Analysis 50 (2006) 3354 – 3368
6. Discussion Once the transmission rate is estimated, its value can be used to work out the minimum vaccine level for preventing further epidemics of the same infectious disease. The precision in estimating the transmission rate directly affects the precision of the minimum vaccination level. We demonstrated that a transmission model can be constructed by using the idea of data augmentation. The unobserved infection chains are introduced into the model as latent variables that give rise to the observed outbreaks. When the stochastic relationship between the latent data and the observed data given the model parameters is known, this structure can be used to impute the latent data. The posterior distribution for complete data are averaged to estimate the posterior distribution for the observed data. The Tanner–Wong algorithm facilitates such computations. In the implementation of the Tanner–Wong algorithm, the sample size in each iteration, L, should be large for the generated sample to be representative. Once the model has been estimated, the goodness of fit of the model can be assessed via the posterior predictive distributions without much extra effort. The Bayesian p-value provides a numerical measurement for this model assessment. There are several issues that we have not been able to consider in this paper. First, the method presented in this paper assumes that all households under consideration have the same household structure, that is, all households have initially i0 infected and s0 susceptible individuals. It is desirable to extend this work to households of various household structures. Secondly, formal procedures and practical rules for determining the number of iterations needed in the Tanner–Wong algorithm to attain a prescribed precision may need further investigation. Finally, whether the choice of m, the number of mixture components, has any effect on the results of the Tanner–Wong algorithm deserves an investigation. References Bailey, N.T., 1975. The Mathematical Theory of Infectious Diseases and its Applications. Griffin, London. Becker, N.G., 1989. Analysis of Infectious Disease Data. Chapman & Hall, London. Becker, N.G., Dietz, K., 1995. The effect of household distribution on transmission and control of highly infectious diseases. Math. Biosci. 127, 207–219. Becker, N.G., Bahrampour, A., Ditez, K., 1995. Threshold parameters for epidemics in different community settings. Math. Biosci. 129, 189–219. En’Ko, P.D., 1889. On the Course of Epidemics of Some Infectious Diseases. St. Petersburg X, 1008–1010, 1039–1042, 1061–1063. Gelman, A., Carlin, J., Stern, H., Rubin, D., 1995. Bayesian Data Analysis. Chapman & Hall, London, New York. Greenwood, M., 1931. On the statistical measure of infectiousness. J. Hyg. Cambridge 31, 336–351. Li, N., Qian, G., Huggins, R.M., 2002. Analysis of between-household heterogeneity in disease transmission from data on outbreak sizes. A. NZ. J. Statist. 44, 401–411. O’Neill, P.D., Roberts, G.O., 1999. Bayesian inference for partially observed epidemics. J. R. Statist. Soc. A 162, 121–129. Starczak, D.N., 1998. The vaccination coverage required to prevent epidemics, Ph.D. Thesis, La Trobe University. Sugiyama, H., 1960. Some statistical contributions to the health sciences. Osaka City Med. J. 6, 141–158. Tanner, M.A., Wong, W., 1987. The calculation of posterior distributions by data augmentation. J. Amer. Statist. Assoc. 82, 528–550.