This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright
Author's personal copy water research 43 (2009) 3453–3468
Available at www.sciencedirect.com
journal homepage: www.elsevier.com/locate/watres
Multivariate distributions of disinfection by-products in chlorinated drinking water Royce A. Francisa,b,*, Mitchell J. Smalla,b, Jeanne M. VanBriesenb a
Department of Engineering and Public Policy, Carnegie Mellon University, Baker Hall 129, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA Department of Civil and Environmental Engineering, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA
b
article info
abstract
Article history:
Drinking water disinfection by-product (DBP) occurrence research is important in sup-
Received 9 December 2008
porting risk assessment and regulatory performance assessment. Recent DBP occurrence
Received in revised form
surveys have expanded their scope to include non-regulated priority DBPs as well as
3 May 2009
regulated DBPs. This study applies a Box–Cox transformed multivariate normal model and
Accepted 7 May 2009
data augmentation methods for left-censored and missing observations to US EPA Infor-
Published online 18 May 2009
mation Collection Rule (ICR) drinking water data to describe the variability in the trihalomethane
(THM4),
trihaloacetic
acid
(THAA),
dihaloacetic
acid
(DHAA),
and
Keywords:
dihaloacetonitrile (DHAN) DBP classes, the relationship between class-sum and the
Bayesian modeling
occurrence of individual DBPs within these classes. Inferences about bromine incorpora-
Markov Chain Monte Carlo (MCMC)
tion in these classes are then compared to those made by Obolensky and Singer (2005).
simulation
Results reported herein show that class-based and individual DBP concentrations are
Disinfection by-products
strongly related to bromine substitution, and that speciation and bromine substitution
Gibbs’ sampling
patterns are consistent across DBP classes. In addition, the multiple imputation approach
Drinking water research priorities
employed reveals that uncertainties related to missing and left-censored DBPs have
Left-censoring
important implications for understanding bromine substitution in the THAA class. These
Regulatory evaluation
concerns should be considered through alternative approaches to DBP regulation in subsequent Stage II D/DBP assessment and revisions, where appropriate. ª 2009 Elsevier Ltd. All rights reserved.
1.
Introduction
Drinking waters in the United States (US) and other countries are disinfected through chemical oxidation. Chemical oxidation is an essential drinking water treatment technique used to
inactivate microbial pathogens in the water supply and ease removal of certain physical–chemical contaminants. All chemical oxidants, however, including chlorine, chloramines, chlorine dioxide, or ozone, added to drinking water produce disinfection by-products (DBPs) through reactions with
Abbreviations and notations: DBPs, Disinfection byproducts; THMs, Trihalomethanes; TTHM, Total Trihalomethanes; HAAs, Haloacetic acids; HAA5, 5 Regulated haloacetic acids; HAA9, 9 Brominated and chlorinated haloacetic acids; THAAs, Trihaloacetic acids; DHAAs, Dihaloacetic acids; MHAA, Monohaloacetic acids; HAN, Haloacetonitriles; DHANs, Dihaloacetonitriles; CHCL3, Chloroform; CHBrCl2, Bromodichloromethane; CHBr2Cl, Dibromochloromethane; CHBR3, Bromoform; TCAA, Trichloroacetic acid; DBCAA, Dibromochloroacetic acid; BDCAA, Bromodichloroacetic acid; TBAA, Tribromoacetic acid; DCAA, Dichloroacetic acid; BCAA, Bromochloroacetic acid; DBAA, Dibromoacetic acid; DCAN, Dichloroacetonitrile; BCAN, Bromochloroacetonitrile; DBAN, Dibromoacetonitrile; US EPA, United States Environmental Protection Agency; ICR, Information Collection Rule; MRL, Minimum reporting level; MCL, Maximum Contaminant Level; mmol/L, Micromoles per liter; mg/L, Micrograms per liter; TOC, Total organic carbon; PSRF, Potential scale reduction factor. * Corresponding author at: Department of Engineering and Public Policy, Carnegie Mellon University, Baker Hall 129, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA. Tel.: þ1 412 780 6453; fax: þ1 412 268 3757. E-mail addresses:
[email protected] (R.A. Francis),
[email protected] (M.J. Small),
[email protected] (J.M. VanBriesen). 0043-1354/$ – see front matter ª 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.watres.2009.05.008
Author's personal copy 3454
water research 43 (2009) 3453–3468
bromide and organic precursors within the water treatment plant and its drinking water distribution systems. Certain DBPs have been associated with bladder cancer (Villanueva et al., 2004) and are potentially associated with adverse reproductive outcomes (Bove et al., 2002; Graves et al., 2001; Nieuwenhuijsen, 2005; Nieuwenhuijsen et al., 2000; Tardiff et al., 2006). DBPs occurring in US drinking waters have not been completely characterized, but include halomethanes (e.g., TrihalomethanesdTHM, THM4), haloacetic acids (e.g., HAA9, HAA5), haloacetonitriles, haloketones, halonitromethanes, haloaldehydes, haloacetaldehydes, and halogenated furanones (Krasner et al., 2006; Weinberg et al., 2002). Only a subset of the THMs and HAAs is currently regulated by the US Environmental Protection Agency (i.e., THM4 and HAA5), and their concentrations are assumed to be adequate surrogates for exposure to DBP mixtures. Two recent US EPA DBP surveys are especially relevant to current exposure assessment efforts: the Information Collection Rule (ICR) database and data analysis (McGuire et al., 2002), and the Nationwide DBP Occurrence Survey (Krasner et al., 2006; Weinberg et al., 2002). The set of compounds included in the present study is presented in Table 1. The ICR, promulgated by the EPA in May 1996, mandated an 18-month effort by water utilities to sample and report water quality data that could provide the basis for regulatory actions needed to protect concurrently against microbial disease outbreak and DBP exposure (EPA, 1996). The resulting compiled ICR database includes measurements taken from 1997 to 1998 from 500 large water systems with different source waters, treatment configurations, and disinfection techniques. While the ICR database was not the first effort to characterize the occurrence of total trihalomethanes (TTHM) and haloacetic acids (HAAs) across the US (Brass et al., 1977; McGuire and Meadow, 1988; Munch et al., 1977; Symons et al., 1975), it was the first to characterize the occurrence of nonregulated DBPs, including trichloroacetonitrile, dichloroacetonitrile, bromochloroacetonitrile, dibromoacetonitrile, 1,1-dichloropropanone, 1,1,1-trichloropropanone, chloral hydrate, cyanogen chloride, and chloropicrin. The Nationwide DBP Occurrence Survey reports concentration data for the ICR chemicals, plus an additional 58 compounds across ten
Table 1 – DBPs included in the present analysis. Two regulated haloacetic acids, monochloroacetic acid and monobromoacetic acid, are listed in Table 1 but are not included in this analysis. The regulated DBPs are indicated as follows: D [ THM4, regulated THM species; * [ HAA5, regulated HAA species. Trihalomethanes Chloroformþ# Bromodichloromethaneþ# Dibromochloromethaneþ# Bromoformþ# Haloacetonitriles Dichloroacetonitrile# Bromochloroacetonitrile# Dibromoacetonitrile#
Haloacetic acids Monochloroacetic acid*# Monobromoacetic acid*# Dichloroacetic acid*# Bromochloroacetic acid# Dibromoacetic acid*# Trichloroacetic acid*# Bromodichloroacetic acid Dibromochloroacetic acid Tribromoacetic acid
categories. Furthermore, additional non-regulated and previously unobserved DBPs were identified in the Nationwide Occurrence Survey using broadscreen gas chromatography/ mass spectroscopy and novel analytical methods. The ICR database and Nationwide DBP Occurrence Survey represent substantial advances in our knowledge of DBP occurrence and speciation in U.S. drinking waters. The ICR database remains the most comprehensive effort to characterize the occurrence of DBPs in drinking waters, especially the bromo-chloro haloacetic acids, the haloacetonitriles, and the other non-regulated DBPs listed above. The Nationwide Occurrence Survey, while examining a non-representative subset of plants with high total organic carbon (TOC) and bromide levels, is one of the most extensive efforts to identify the suite of disinfection byproducts to which populations may potentially be exposed. Taken together, these data yield unprecedented information concerning the distribution of regulated and non-regulated DBPs occurring in U.S. drinking waters. An initial analysis was undertaken to determine whether a multivariate model could be fit to the data in the Nationwide Occurrence Survey to permit extrapolation from the ICR database to the set of chemicals listed in Table 1; however, difficulties with convergence precluded further pursuit of this model. The principal objectives of the present study were to: 1) Demonstrate a Bayesian method for fitting multivariate distributions to disinfection byproduct occurrence data that include censored and missing observations; and 2) Use these models to elucidate common compositional and speciation patterns across and within disinfection byproduct classes, comparing these inferences to those obtained in previous studies using less rigorous statistical methods (Obolensky and Singer, 2005, 2008). These results are then used to explore uncertainties introduced when using regulated DBPs as surrogates for risks due to non-regulated DBPs, including comparison with similar analyses made by Obolensky and Singer (2005). This study focuses on speciation patterns in the THM4, HAA, and dihaloacetonitrile (DHAN) classes. In this study, the HAA class is treated in its subclasses (e.g., trihaloacetic acids (THAAs), dihaloacetic acids (DHAAs), and monohaloacetic acids (MHAA)) since evidence suggests that these classes may be formed from different precursors through different reaction pathways (Xie, 2003). This study, consistent with previous work on halogen substitution patterns in ICR data (Obolensky and Singer, 2005), considers these DBP classes because they are the ICR classes that contain information for each brominesubstituted species in the subclass, and does not consider the MHAA class since consistently low concentrations make estimation and analysis difficult. Our focus is on bromine substitution in DBPs. It has been documented that brominated and iodinated DBPs may be more genotoxic and cytotoxic than fully chlorinated analogues (Muellner et al., 2007; Plewa et al., 2004). In addition, patterns in bromine substitution across DBP classes provide important information for understanding DBP occurrence within and between regulated and non-regulated DBP classes. Because of redundant information encoded in the bromine incorporation patterns among DBP classes (Obolensky and Singer, 2005), this study presumes that differential censoring
Author's personal copy water research 43 (2009) 3453–3468
may be satisfactorily overcome by adopting a Bayesian imputation procedure, implemented through Markov Chain Monte Carlo (MCMC) sampling and parameter estimation. In order to maintain interpretability of the results with respect to previous studies, we also restrict our analysis to distribution system samples from ICR plants using only free chlorine as the primary and residual disinfectant.
2.
Theory and methods
2.1.
Methodology/approach
A multivariate Box–Cox normal distribution is fit to measured concentrations of disinfection byproduct species in drinking water from the ICR database. Additional analyses are performed to reflect the relationships among species with concentrations that sum to yield a total class concentration. An imputation method is used for censored and missing data in the ICR Database. A Bayesian MCMC method is used to estimate concentration distribution parameters for major DBP classes, including THM4, THAA, DHAA, and DHAN. The mean and standard deviation of the marginal distribution of the 14 DBPs in these classes and the correlation structure among these 14 DBPs are estimated. Spearman rank correlations of individual DBP species concentrations with class concentrations, and correlations among class-based bromine fractions are computed at each step of the Gibbs’ sampler. In addition, the model is evaluated by simulation from its posterior predictive distribution, and scatterplots of bromine fractions in each class are from MCMC output are presented.
2.2.
Data retrieval and preparation
The ICR database is extensively described elsewhere (McGuire et al., 2002; Obolensky et al., 2007). The dataset includes individual DBP species in the THM4, THAA, DHAA, and DHAN class in mg/L. The ICR database is distributed as a Microsoft Access database to encourage data analysis and further study of the microbial and disinfection byproducts data therein. This analysis includes data from 235 ICR plants, including 67 plants measuring all nine HAAs. The data for the 14 species are stored in an R data frame, and converted to matrix form as the operations demand. Each row of data contains species measurements from one distribution system sample. These are also referred to as records. Each column of the data consists of measurements for a particular DBP species. In the ICR database, several non-regulated DBP species (e.g., nonregulated HAAs) were included, but optionally reported by many ICR plants. Despite the fact that reporting of several non-regulated DBP species was not required, the compilation of these data, especially the non-regulated HAAs, represents a larger storehouse of information concerning these chemicals than any other survey to date. Our dataset includes 1166 observations with all THAAs and DHAAs observed from 67 ICR plants measuring all nine HAAs. For all species in the analysis, measurements are only referred to as ‘‘censored’’ if the appropriate code for ‘below minimum reporting level’ was indicated. Measurements with a ‘‘NULL’’ value are considered ‘‘missing.’’ The data used in this analysis are extracted from
3455
the publicly available ICR database following the approach of Obolensky and Singer (2005). This analysis is restricted to distribution system samples from plants using either surface or groundwater as source waters, and using free chlorine as the only disinfectant for pre-oxidation, primary, and residual disinfection to avoid complicating interpretation of the results with effects from alternative disinfectants. This analysis includes observations in which at least one class of chemicals studied (THM, THAA, DHAA, DHAN) were completely reported (censored or fully observed). This analysis excludes records with null values for source water category, either treatment plant or distribution system disinfectant category, and mixed/purchased water indicators. The dataset for this study includes 6823 total records and is summarized in Table 2.
2.3.
Analysis of left-censored data
Data censoring occurs when observations are known to lie in an unobserved region of the sample space. Four types of censoring are commonly considered: left-censoring, interval censoring, right-censoring, and differential censoring. We are concerned with left- and differential censoring. Left-censoring occurs when data are not observed, but known to occur below a specified value. Left-censoring is a common problem in environmental data where analytical methods have a minimum detection limit or reporting level. If a measurement is below this minimum reporting level, the measurement is left-censored. Differential censoring originates when observations in the same dataset corresponding to the same chemical species (or other analyte) are reported by different analysts with different minimum reporting levels; or when measurements in the same sample corresponding to different species have different minimum reporting levels, complicating the interpretation of multivariate analyses (e.g., multiple linear regression) of the data. Several options are available for analyzing left-censored data, including substitution of an arbitrary data point (e.g., minimum reporting level (MRL)/2 or MRL/square-root of 2), maximum likelihood techniques, or non-parametric ranking techniques (Helsel, 1990, 2005; Helsel and Cohn, 1988). Many of these methods can lead to biased parameter estimates, with the direction and magnitude of bias being unknown. Obolensky and Singer (2005, 2008) consider the influence of differential censoring using three substitution methods in which missing observations are omitted and censored observations are replaced with zero, the detection limit, or a function of an analyte’s detection limit and the value of other analytes in the same observation. While this is likely to bound the uncertainty due to censoring on inferences drawn from this database, improved parameter estimates are possible using more sophisticated data substitution methods. Furthermore, as described below, the multiple imputation estimation procedure uses halogen substitution patterns encoded in DBP data to address problems posed by differential censoring. In this study we compare parameter estimates obtained using three alternative methods for censored data handling, and compare inferences that may be made concerning bromine incorporation in the ICR database using multiple imputation of missing and left-censored measurements.
Author's personal copy 3456
water research 43 (2009) 3453–3468
Table 2 – Summary statistics for 6823 ICR data records utilized in analysis. MRL [ minimum reporting level. No. Reported includes measurements above and below the MRL, while No. Missing includes measurements with a NULL value in the ICR database. % Below MRL is the number of measurements below the MRL divided by the number of measurements reported. % Missing is the number of measurements missing divided by the sum of the number missing plus the number reported. Class
Species
THM
CHCL3 BDCM DBCM CHBR3
THAA
MRL (mg/L)
MRL (mmol/L)
No. reported
No. below MRL
No. missing
Fraction reported below MRL
Total fraction below MRL
Fraction missing
1 1 1 1
0.008377 0.006104 0.004825 0.003957
6073 6162 6157 6151
425 542 1998 4838
750 661 666 672
0.070 0.088 0.325 0.787
0.062 0.079 0.293 0.709
0.110 0.097 0.098 0.098
TCAA BDCAA DBCAA TBAA
1 1 2 4
0.00612 0.004811 0.007927 0.01348
6069 1736 1588 1262
902 251 1194 1243
754 5087 5235 5561
0.149 0.145 0.752 0.985
0.132 0.037 0.175 0.182
0.111 0.746 0.767 0.815
DHAA
DCAA BCAA DBAA
1 1 1
0.007755 0.005767 0.00459
6111 6099 6127
630 1548 4510
712 724 696
0.103 0.254 0.736
0.092 0.227 0.661
0.104 0.106 0.102
DHAN
DCAN BCAN DBAN
0.5 0.5 0.5
0.004548 0.003239 0.002593
5719 5753 5678
1048 1984 3243
1104 1070 1145
0.183 0.345 0.571
0.154 0.291 0.475
0.162 0.157 0.168
In this study left-censored and missing observations are treated using a data augmentation approach executed within a Markov Chain Monte Carlo (MCMC) Gibbs’ sampling algorithm to estimate the parameters of the Box–Cox multivariate normal model for species concentrations. This simultaneously allows for parameterization of the marginal distributions of individual DBP species and the correlations between them. This approach is adapted from data augmentation and complete-data methods described in Gelman et al. (2004), Lockwood et al. (2001), Box and Tiao (1973), and Little and Rubin (1987).
2.4.
Box–Cox transformation
It is common practice to transform observed data, in this case x [¼ species concentration (in mg/L)], to an approximately normal distribution, through a Box–Cox transformation (Box and Cox, 1964). The Box–Cox distribution is defined as follows. Suppose y has a normal distribution where y is some transformation of the data, x:
yðxjlÞ ¼
xl 1 l
for l > 0 logðxÞ for l ¼ 0
(1)
The Box–Cox distribution on x is: fx ðxÞ ¼
2 d½yðxjlÞ 1 l : 1=2 exp 2 x mx 2sx dx 2ps2x 1
(2)
For computational purposes, we make two simplifying assumptions. First, we assume that each marginal distribution must belong to a joint distribution with a single value of the parameter, l. Second, following Shumway et al. (1989), we consider only transformations l˛f0; 12; 1g since concentration data must be positive. In most studies, environmental data are assumed log-normally distributed (l ¼ 0). In preliminary evaluation of the log transformation by regression on order statistics, we find that the DBP marginal distributions significantly deviate from lognormality. Fig. 1 illustrates this approach using bromodichloroacetic acid as an example. While the distribution of observations above the minimum reporting level is not well estimated in the upper range by the log transformation, the
Fig. 1 – Evaluation of transformation by regression on order statistics using bromodichloroacetic acid as an example. Blue points indicate observed data, red points indicate estimated distribution using regression on order statistics. Left panel, l [ 1/2, right panel, l [ 0. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article).
Author's personal copy 3457
water research 43 (2009) 3453–3468
power transformation with l ¼ 0.5 improves the normality of the transformed variables. Therefore, in this study the DBP data are transformed using the square root transformation (l ¼ 0.5). This assumption is more rigorously evaluated below using the estimated average discrepancy to compare models with different values of l (Gelman et al., 2004).
2.5.
2.7.
Bayesian parameter estimation
The model parameters are estimated using a Gibbs’ MCMC to sample from the posterior distribution of the mean vector (e.g., the vector of the transformed mean concentration for each of the 14 DBPs studied), m, and the covariance matrix (e.g., the covariance among the transformed concentrations for each of the 14 DBPs studied), S. We assume, for simplicity, that the covariance structure of the dataset is not influenced by distribution system residence time. This process involves two steps at each iteration of the MCMC algorithm: conditional imputation of the missing and left-censored data; and drawing a sample of (transformed DBP joint distribution parameters) Sjy and mjS,y from their marginal and conditional posterior distributions, respectively.
2.6. Conditional imputation of missing and left-censored data Because it is computationally simpler to sample from the posterior distribution of a complete dataset, a data augmentation step is employed at each Monte Carlo state. Data augmentation is similar to the popular techniques of multiple imputation (Hopke et al., 2001) and application of the EM algorithm (Dempster et al., 1977). Data augmentation through imputation of missing or censored data to permit completedata analysis techniques increases variability in the parameter estimates through the variability in the imputations. To account for this variability, multiple imputation techniques involve proposing multiple plausible values of the missing or censored data (Hopke et al., 2001). The method used in this study assumes the variability in the posterior parameter distributions contributed by imputation of missing and censored data are accounted for through Gibbs’ sampling, and by monitoring convergence of the posterior distributions on the parameters. Because our model considers correlation among all chemical measurements, the individual missing and censored values for a sample are potentially correlated with observed values for that sample and with each other. Let Ycom, Yobs, Ymis, and Ycens be the complete, observed, missing, and censored data. In essence, this study specifies the joint posterior distribution for the Yobs, Ymis, and Ycens as in Hopke et al. (2001): pðYmis ;Ycens jYobs Þ¼ R
in Hopke et al. (2001) and Tanner and Wong (1987): Ymis and Ycens are imputed with a draw from pðYmis ;Ycens jYobs ;m ;S Þ where m* and S* are the current parameter estimates; then m and S are drawn from their posterior distribution, pðm;SjYobs ;Ymis ;Ycens ÞfpðYcom jm;SÞpðm;SÞ.
R
pðYobs ;Ymis ;Ycens jm;SÞdmdS : pðYobs ;Ymis ;Ycens jm;SÞpðm;SÞdmdSdYmis dYcens (3)
In this model, the data are multivariate Box–Cox normal, while the parameters m and S are assigned a non-informative prior distribution. The likelihood and prior distribution are discussed below. Draws from the joint posterior distribution are simulated according to the procedure described
Prior and posterior distributions for Sjy and mjS,y
The mean vector and covariance matrix are then updated using the complete imputed data based on the conjugate multivariate normal family of distributions. Because of the large number of ICR observations, we use the multivariate non-informative Jeffrey’s prior with dimension, d: dþ1 2
pðm; SÞfjSj
(4)
Under the multivariate normal model, the joint posterior distribution (given n observations) is, up to a proportionality constant: "
n T nX yi m S1 yi m 2 i¼1 nþdþ1 1 ¼ jSj 2 exp tr S1 S 2 dþ1
n
#
pðm; SjyÞfjSj 2 jSj2 exp
(5)
The form of the marginal posterior distribution on the covariance matrix is recognized as an inverse-Wishart distribution up to a constant of proportionality: SjywInv-Wishartnn ðSÞ
(6)
where S is the sum of squares matrix about the sample mean: S¼
n X
T yi y yi y
(7)
i¼1
and the posterior distribution on the mean vector, conditional on the complete dataset and the covariance matrix is a multivariate normal distribution: . mjS; ywN y; S n :
(8)
This approach is presented in greater detail, including details of conditional imputation and MCMC convergence monitoring, in appendix.
2.8.
Model comparison and assessment
For model comparison, we employ the estimated average discrepancy metric, discussed in Gelman et al. (2004). We assume the untransformed data are not normally distributed; therefore, we compute the estimated average discrepancy only for the transformations l˛f0; 12g. To compute the estimated average discrepancy, an important distinction between ‘‘observed’’ data in the case of fitting the model (e.g., ‘‘observed’’ measurements are measurements above the detection limit), and ‘‘observed’’ data in the case of computing the model likelihood. The ‘‘observed’’ data when computing the model likelihood include missing and censored (e.g., below detection limit) observations since the likelihood of both above and below detection limit measurements may be computed. In contrast, ‘‘missing’’ data in both model fitting
Author's personal copy 3458
water research 43 (2009) 3453–3468
and computing the model likelihood only applies to measurements with a null value reported in the ICR database. Thus, we compute the estimated average discrepancy based on the joint likelihood of each non-missing (e.g., censored and observed) measurement in the dataset. Let y be each measurement for chemical, i, and sample, j, and c be the censoring point (e.g., minimum reporting level). Suppose censoring is indicated by an inclusion model, Ic, and that the model for missing measurements does not contribute to the likelihood of observing the dataset under the transformation assumptions. The likelihood of observing the dataset is (9)
pðyjm; SÞ ¼ pðyjm; S; Ic Þ
8
> > < Pr Iij ¼ 1yij ¼ 0 if yij cj
Ic ¼ > > : Pr Iij ¼ 1yij ¼ 1 if yij > cj The model for the dataset likelihood follows " # N Y
Y Y
p yij jq; Iij ¼ 1 p yij jq; Iij ¼ 0 i¼1
j˛jobs
j˛jcens
i¼1
j˛jobs
j˛jcens
" # N Y Y
Y p yij jq; Iij ¼ 1 F cj jq; Iij ¼ 0 ¼
(10)
where q are the marginal mj and s2 calculated using the properties of the multivariate normal distribution discussed above. The estimated average discrepancy estimates the posterior expectation of the deviance between the model and the data. Equation (11) gives the deviance of the data given a model structure. Dðy; qÞ ¼ 2 log½pðyjqÞ
(11)
The average discrepancy between observed data and an assumed model over L simulations is b avg ðyÞ ¼ 1 D L
L X D y; ql
Z q
Results and discussion
Gelman et al. (2004) advise that posterior inferences based on a chosen model be summarized and evaluated graphically. To evaluate the Box–Cox transformed normal model we have employed, we graphically compare draws from the marginal posterior predictive distribution of equation (13) to the distributions of the observed data. The marginal mean and variance estimates for the Box–Cox multivariate normal model are presented in Table 3a, back-converted to the original scale (mmol/L). For all species, the parameter estimates display a high level of precision due to the large dataset employed. Commensurate with nationwide DBP surveys, the chemical with the largest mean is chloroform, followed by dichloroacetic acid. Although unregulated, bromodichloroacetic acid, dibromochloroacetic acid, bromochloroacetic acid all may be found at comparable levels to regulated DBPs such as the bromine-substituted THMs. The dihaloacetonitriles are found at levels comparable to bromine-substituted trihalomethanes, dihaloacetic acids, and bromodichloroacetic acid. Across classes, the mean concentration decreases as the number of bromine-substituted atoms increases. The estimates for species variances appear to reflect the relative magnitude (on the mmol/L scale) of the concentrations, and do not seem to be
(12)
l¼1
As the number of simulations approaches infinity, the model with the lowest expected deviance may be expected to have the highest posterior probability. The estimated average discrepancy for the two model alternatives indicate that the square root transformation yields a model with a much less negative average discrepancy compared to the lognormal model. The square root model is thus used in all subsequent analyses. For model assessment, we compare inferences based on samples from the posterior predictive distribution to observed ICR data. The posterior predictive distribution, where q ¼ ðm; SÞ, follows: y~jy ¼
3.
3.1. Model evaluation: comparison of marginal and observed empirical distributions
where the inclusion model, Ic
pðyjm; SÞ ¼
distribution, the predictive distribution of new observations may be estimated conditional on the posterior distributions of the mean vector and covariance matrix. To estimate the posterior predictive distribution of new observations, we take 6823 draws from the posterior predictive distribution, sampling m and S from the MCMC output for each draw. These draws are then converted back to the original micromoles per liter scale to illustrate speciation patterns and other relationships.
Z
Z
p y~; qjy dq ¼ p y~jq; y pðqjyÞdq ¼ p y~jq pðqjyÞdq q
Table 3a – Credible intervals for mean (m) and standard deviation (s) parameter estimates for marginal concentration distributions in mmol/L. Class
Species
m
m (95% CI)
s
s (95% CI)
THM
CHCL3 BDCM DBCM CHBR3
0.2174 0.0532 0.0189 0.0047
(0.2130,0.2217) (0.0523,0.0541) (0.0185,0.0194) (0.0046,0.0049)
0.1824 0.0414 0.0197 0.0055
(0.178,0.186) (0.0407,0.0423) (0.0193,0.0202) (0.0053,0.0056)
THAA
TCAA BDCAA DBCAA TBAA
0.0750 0.0178 0.0057 0.0056
(0.0734,0.0763) (0.0174,0.0182) (0.0055,0.0059) (0.0050,0.0067)
0.0671 0.0138 0.0044 0.0031
(0.0656,0.0686) (0.0134,0.0141) (0.0043,0.0045) (0.0028,0.0034)
DHAA
DCAA BCAA DBAA
0.0950 0.0169 0.0047
(0.0931,0.0970) (0.0166,0.0173) (0.0046,0.0048)
0.0810 0.0144 0.0048
(0.0792,0.0823) (0.0141,0.0147) (0.0047,0.0049)
DHAN
DCAN BCAN DBAN
0.0197 0.0071 0.0036
(0.0193,0.0201) (0.0070,0.0073) (0.0035,0.0037)
0.0164 0.0063 0.0035
(0.0160,0.0167) (0.0061,0.0064) (0.0034,0.0036)
q
(13) Observations drawn from this posterior predictive distribution are used to illustrate speciation patterns reproduced by this model. Notice that after factoring the posterior predictive
Author's personal copy 3459
water research 43 (2009) 3453–3468
Table 3b – Comparison of mean and standard deviation among data replacement methods. Median values shown for Gibbs’ sampler, while maximum likelihood methods are shown for below detection limit observation replacement at detection limit (DL), half detection limit (HDL), and zero. Missing observations ignored for maximum likelihood methods. DL
HDL
ZERO
Gibbs’
Mean
SD
Mean
SD
Mean
SD
Mean
SD
CHCL3 BDCM DBCM CHBR3
0.2104 0.0534 0.0200 0.0065
0.1641 0.0401 0.0190 0.0047
0.2094 0.0530 0.0193 0.0049
0.1650 0.0408 0.0197 0.0047
0.2074 0.0521 0.0185 0.0033
0.1678 0.0429 0.0221 0.0045
0.2174 0.0532 0.0189 0.0047
0.1824 0.0414 0.0197 0.0055
TCAA BDCAA DBCAA TBAA
0.0710 0.0184 0.0095 0.0136
0.0545 0.0134 0.0030 0.0004
0.0703 0.0181 0.0064 0.0069
0.0563 0.0142 0.0038 0.0009
0.0693 0.0178 0.0032 0.0002
0.0613 0.0166 0.0044 0.0003
0.0750 0.0178 0.0057 0.0056
0.0671 0.0138 0.0044 0.0031
DCAA BCAA DBAA
0.0880 0.0174 0.0066
0.0611 0.0125 0.0039
0.0873 0.0167 0.0049
0.0628 0.0136 0.0042
0.0862 0.0159 0.0034
0.0676 0.0168 0.0047
0.0950 0.0169 0.0047
0.0810 0.0144 0.0048
DCAN BCAN DBAN
0.0188 0.0075 0.0043
0.0109 0.0047 0.0030
0.0183 0.0069 0.0036
0.0120 0.0052 0.0031
0.0177 0.0061 0.0028
0.0151 0.0067 0.0037
0.0197 0.0071 0.0036
0.0164 0.0063 0.0035
Fig. 2 – Empirical CDF for THM4 from transformed THM observations (green) compared with marginal CDF simulated (6823 replicates) for each species using joint posterior predictive DBP distribution (red). Number of iterations [ 1500 after burn in of 1000 iterations. Potential scale reduction factors: CHCL3 (mean: 1.007, var: 1.009), BDCM (mean: 1.007, var: 1.037), DBCM (mean: 1.020, var: 1.046), CHBR3 (mean: 1.033, var: 1.017). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article).
Author's personal copy 3460
water research 43 (2009) 3453–3468
sensitive to the rate of missing or censored measurements in each marginal distribution. The variances are largest for the fully chlorinated species, which are most prevalent in the DBP subcompositions (e.g., classes), while the variances for the fully brominated species are the smallest, despite the most censoring and missing data among these species. In Table 3b, the parameter estimates from the Box–Cox model are compared to maximum likelihood estimates obtained using alternative censored data handling methods: replacement at one-half the minimum reporting level, the reporting level, and zero. In these alternative approaches, missing measurements are not considered. For DBPs with low rates of missing and below detection limit measurements (CHCL3, BDCM, TCAA, and DCAA), the three maximum likelihood estimates under-estimate the mean and variance of the marginal distributions, with respect to the estimates obtained using the imputation method. For DBPs with modest rates of missing and below detection limit measurements (DBCM, BCAA, and DCAN), estimates obtained using replacement at the detection limit over-estimate the mean and under-estimate the variance, while estimates obtained using replacement at zero under-estimate the mean and over-estimate or approach the variance relative to the estimates obtained using the imputation method. For these DBPs, the imputation method seems to yield results most closely approximated by replacement at one-half the detection limit. For DBPs with high rates of missing and below detection limit measurements (CHBR3, DBCAA, TBAA), the alternative maximum likelihood approaches yield more highly varying estimates. In each case, the variance is under-estimated, relative to the imputation method, while the mean is very sensitive to the alternative replacement method used. Because the posterior distributions of the marginal distribution parameters for each DBP studied converges to the target distribution (the convergence criterion is presented below) with the exception of TBAA, we believe the imputation results more closely approximate the ‘‘true’’ marginal distributions for the mean and variance of the 14 DBPs studied in this analysis. Fig. 2 shows the empirical cumulative distribution function (CDF) of the reported ICR data compared to the CDF of the square root Box–Cox model fitted using the Gibbs’ Markov Chain Monte Carlo (MCMC) algorithm for the trihalomethane (THM) class as an example. In Fig. 2, each empirical distribution in the THM class seems to be well-approximated by the posterior predictive marginal distribution. Because of the exceptionally high levels of left-censoring for bromoform (>80%) the marginal distribution is estimated almost completely on the basis of observations from other classes of chemicals, and other chemicals in the THM class. While the fit seems a little less good for dibromochloromethane (DBCM), the marginal distributions of chloroform and bromodichloromethane (BDCM) appear to be better represented. Corroborating the appearance that these distributions adequately predict the data, the potential scale reduction factors (PSRF) for these compounds indicate that the model converges to a stable joint distribution. Table 4 reports the potential scale reduction factors for the mean and variance of each species included in the analysis. The fitted species marginal distributions exhibit close approximations to their corresponding empirical CDFs and the PSRFs are all close to
Table 4 – Potential scale reduction factors (PSRF) for mean and variance, of Box–Cox marginal distributions of each DBP in the analysis. Five concurrent chains of 2500 iterations, each with a burn-in period of 1000 iterations, were used to compute the PSRF and estimate the posterior distribution for the mean vector and covariance matrix. PSRF, m
PSRF, s2
CHCL3 BDCM DBCM CHBR3
1.007 1.007 1.020 1.033
1.009 1.037 1.046 1.017
THAA
TCAA BDCAA DBCAA TBAA
1.019 1.030 1.149 7.577
1.008 1.029 1.114 2.114
DHAA
DCAA BCAA DBAA
1.027 1.005 1.015
1.029 1.023 1.008
DHAN
DCAN BCAN DBAN
1.013 1.006 1.037
1.007 1.026 1.007
Class
Species
THM
1.00 (e.g., 0.7), light green indicates moderate correlation (0.5 < rS < 0.7), light blue indicates moderate low correlation (0.2 < rS < 0.5), and blue indicates low to negative correlation (rS < 0.2). 95% Credible intervals in italics and parentheses.
THM CHCL3
BDCM
0.742
0.687
0.493
(0.734,0.750)
(0.682,0.693)
(0.486,0.501)
0.760 0.237 -0.118 (-0.134,-0.098)
TCAA
0.708 (0.703,0.712)
BDCAA
0.565 (0.552,0.577)
DBCAA
0.152 (0.106,0.197)
TBAA
-0.041 (-0.139,0.143)
DCAA
0.671 (0.665,0.675)
BCAA
0.507 (0.501,0.514)
DBAA
0.100 (0.080,0.120)
DCAN
0.620 (0.614,0.628)
BCAN
0.414 (0.404,0.424)
DBAN
DHAN
0.939
(0.230,0.244) CHBR3
DHAA
(0.938,0.940) (0.757,0.763) DBCM
THAA
0.080 (0.066,0.093)
0.560
0.557
(0.552,0.567)
(0.551,0.562)
0.033
0.087
(0.022,0.043)
(0.079,0.094)
-0.215
-0.182
(-0.235,-0.197) 0.957
0.782
(0.954.0.959)
(0.779,0.786)
0.743
0.629
(0.732,0.753)
(0.614,0.642)
0.193
0.140
(0.147,0.241)
(0.088,0.198)
0.005
0.016
(-0.229,0.200) 0.807
(0.970,0.971)
0.521
0.682
(0.514,0.529)
(0.678,0.686)
-0.0005
0.135
(-0.022,0.019) 0.780
(0.115,0.152) 0.695
(-0.021,0.790) 0.386
(0.688,0.701) 0.413
(0.370,0.404)
(0.402,0.423)
-0.006
0.027
(-0.021,0.009)
a. High concentrations of DBPs form under conditions amenable to fully chlorinated compounds, including excess chlorine or high total organic carbon (TOC) concentrations; b. Waters with higher concentrations of bromine might tend to have lower TOC concentrations, with a corresponding lower supply of organic compounds that form the basis for DBP formation (Richardson, 2003; Sohn et al., 2006); or,
(-0.170,0.166) 0.970
(0.802,0.812)
with class concentration (Spearman rS ¼ 0.760 and 0.743, respectively); and the more brominated species have a low or negative correlation with class concentration (DBCM, DBCAA, bromoform 0.118 < rS < 0.237, TBAA rS ¼ 0.005). This pattern is consistent across all DBP classes analyzed. Specific explanations regarding chemical occurrence or reaction could account for all or part of this trend:
(-0.203,-0.162)
(0.011,0.041)
0.698 (0.692,0.704) 0.390 (0.381,0.397) 0.005 (-0.016,0.022) 0.567 (0.560,0.574) 0.774 (0.761,0.786) 0.489 (0.440,0.533) 0.232 (-0.061,0.407) 0.562 (0.554,0.569) 0.675 (0.668,0.680) 0.291 (0.274,0.308) 0.875 (0.873,0.878) 0.788 (0.782,0.793) 0.384 (0.371,0.397)
c. Hypobromous acid may ‘‘outcompete’’ hypochlorous acid in halogen substitution reactions where levels of organic DBP precursors are low due to higher relative activity of hypobromous acid in these reactions relative to hypochlorous acid (Chang et al., 2001; Cowman and Singer, 1996). Further study is needed to isolate the role of these or other processes in producing the consistent species–class correlation structures that are reflected in the fitted model. In addition, the results presented may be influenced by treatment strategies employed by plants at the time of the ICR database promulgation. Specifically, plants concerned with high TOC levels, high bromide levels, or a combination of both, may have employed enhanced coagulation and other advanced treatment strategies (Obolensky et al., 2007). As a result, the
Author's personal copy 3463
water research 43 (2009) 3453–3468
Table 6 – MCMC Bromine Incorporation Fraction (BIF) results. ‘‘MCMC Estimates’’ refers to value obtained for each sample averaged over the entire Markov Chain. Summary statistics for ‘‘Standard deviation in MCMC estimates’’ shown in italics. Standard deviation obtained for each sample over entire Markov Chain. The lower panel reports quantile statistics for the BIF estimates. Summary statistics for MCMC bromine incorporation fraction THM–BIF
THAA–BIF
DHAA–BIF
DHAN–BIF
0.006 0.107 0.886 0.170
0.025 0.161 0.788 0.237
0.007 0.105 0.803 0.171
0.010 0.217 0.875 0.270
Standard deviation in MCMC estimates 5th %-ile 0 Median 0.004 95th %-ile 0.109
0 0.029 0.217
0 0.010 0.235
0 0.027 0.235
MCMC estimates Min Median Max Average
Summary statistics for MCMC bromine incorporation fraction ratios
5th %-ile 10th %-ile Median 90th %-ile 95th %-ile
THM:THAA
THM:DHAA
THM:DHAN
THAA:DHAA
THAA:DHAN
DHAA:DHAN
0.457 0.548 0.913 1.358 1.535
0.530 0.638 1.004 1.541 1.855
0.194 0.274 0.585 0.960 1.097
0.889 0.971 1.450 2.839 3.586
0.402 0.497 0.830 1.401 1.609
0.191 0.254 0.585 0.940 1.080
explanations listed above may also be artifacts of shifts in treatment strategies based on the TTHM rule and anticipation of the Stage 1 D/DBP Rule (McGuire, 2006). Furthermore, for plants with high bromide levels in the source water, class sums may be highly correlated with mixed bromo-chloro species concentrations. Thus, while the class–sum correlation results apply to the U.S. nationwide distribution, individual plants may have strikingly different correlations between individual species and class sums. Despite these concerns, the inferences that may be drawn from the present study are generally representative of U.S. plants using only free chlorine as a primary or residual disinfectant.
3.3.
Bromine incorporation
The extent of bromine substitution in a mixture of DBPs may be characterized by the bromine fraction (Obolensky and Singer, 2005). The bromine fraction describes the ratio of the moles of bromine atoms substituted into a class of DBPs to the total moles of halogen atoms in a class of DBPs. The bromine fraction, computed from the individual species molar concentrations, for each class in this study follows:
BIFTHM ¼
0½CHCL3 þ 1½BDCM þ 2½DBCM þ 3½CHBR3 3ð½CHCL3 þ ½BDCM þ ½DBCM þ ½CHBR3 Þ
BIFTHAA ¼ BIFDHAA
0½TCAA þ 1½BDCAA þ 2½DBCAA þ 3½TBAA 3ð½TCAA þ ½BDCAA þ ½DBCAA þ ½TBAAÞ
0½DCAA þ 1½BCAA þ 2½DBAA ¼ 2ð½DCAA þ ½BCAA þ ½DBAAÞ
BIFDHAN ¼
(14)
0½DCAN þ 1½BCAN þ 2½DBAN 2ð½DCAN þ ½BCAN þ ½DBANÞ
Our evaluation of the patterns of bromine incorporation in the ICR database are based on the expected values of the MCMC output for the bromine fraction of each of the 6823 samples. To do this, the BIF is computed for each observation at each iteration of the Markov Chain, then the averages and variances of each class BIF computed for each sample is calculated. Summary statistics of the averages and standard deviations of BIF MCMC output are reported in Table 6. Variance is equal to zero when all measurements in a subclass are completely observed (e.g., >MRL), while the variance increases in observations across classes as the number of measurements in an observation that are left-censored or missing increases. From
Table 7 – BIF correlation shown for the four DBP classes in this analysis. Spearman rank correlation coefficients shown with 95% credible intervals in parentheses. N [ 6823. THM–BIF THM-BIF THAA-BIF DHAA-BIF DHAN-BIF
1.000 0.827 0.850 0.738
N/A (0.807,851) (0.842,0.857) (0.727,0.748)
THAA–BIF
DHAA–BIF
1.000 N/A 0.809 (0.788,0.828) 0.711 (0.666,0.754)
1.000 N/A 0.715 (0.704,0.725)
DHAN–BIF
1.000 N/A
Author's personal copy 3464
water research 43 (2009) 3453–3468
our summary statistics, we see that the variances in the BIF estimates for each sample are low. Thus, we assume that the BIF estimates averaged over the Gibbs’ sampler are representative of what would have been computed had all measurements in each sample been fully observed. These results indicate that, despite uncertainties in the distributions of individual chemicals, especially TBAA, the Gibbs’ sampler may be used to make inferences about the patterns of bromine incorporation in the four DBP subclasses included in this study. Table 7 presents the inter-class Spearman rank correlation between class-based BIFs computed at each iteration of the Gibbs’ sampler. Each class-based BIF is strongly correlated with each other, with the strongest correlation between the THM and THAA classes (rS ¼ 0.827) and the least strong correlation between the THAA and DHAN classes (rS ¼ 0.711). This cross-class correlation was studied by Obolensky and
Singer (2005), though the Spearman rank correlation coefficients in Table 6 are less strong than those found by Obolensky and Singer for each pair of classes. Further consideration of our results indicate that differential censoring has substantial effects on the inferences that may be made concerning bromine incorporation, and that more efforts at collecting occurrence information for brominesubstituted DBPs, especially the THAAs, may be justified. From Table 6, we see that our median values for the ratio of BIF–THM to BIF–THAA are quite different from the median values found by Obolensky and Singer (2005) using their ‘‘zero’’ and ‘‘level’’ data handling approaches for censored observations. This is a result of the increased variability associated with imputation of DBPs below the reporting level. The impact of censoring and missing measurements is far less pronounced in the distributions for the ratios among BIF–THM, BIF–DHAA, and BIF–DHAN,
Fig. 4 – Class-based BIF scatterplots from MCMC output averaged over 1500 post-burn-in iterations (N [ 6823). 1:1 line in red for reference. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article).
Author's personal copy
Table 8 – Posterior estimate of the correlation matrix between ICR DBP species in this analysis. 95% credible intervals are enclosed in parentheses. Strong positive correlation (0.5 < r < 1.0) indicated by no highlighting. Yellow highlighting indicates negative or no correlation, while blue highlighting indicates moderate positive correlation (r < 0.5). These results show that fully chlorinated species generally are negatively correlated with species containing at least two bromine atoms across all classes. Fully chlorinated species are generally either uncorrelated or marginally correlated with intermediate species containing only one bromine atom.
water research 43 (2009) 3453–3468
3465
Author's personal copy 3466
water research 43 (2009) 3453–3468
with median ratios for each corresponding pair close to those found by Obolensky and Singer (2005). For each ratio reported in their analysis, our imputation method widens the 10th–90th interquantile ranges. The widest interquartile range is found for the THAAs, due to imputation of over 99% of the TBAA measurements, 76% of the BDCAA measurements, and 93% of the DBCAA measurements. Our conclusions based on the imputation method confirm Obolensky and Singer’s (2005) conclusions about bromine incorporation in the DHAN, DHAA, and THM classes, and are illustrated in Fig. 4 and Table 6. First, the imputation results suggest that bromine incorporation in the DHAN class is elevated relative to the DHAA, THAA, and THM classes. Furthermore, the imputation results suggest the DHAA class has lower levels of bromine incorporation than the other classes studied, and the THMs have lower levels of bromine incorporation than each class except the DHAAs. Despite these similarities, our results diverge concerning the THAA class. We find that the THAA class has higher levels of bromine incorporation than the THMs or DHAAs. Consequently, the differences between these two studies, especially with regard to the THAA class, indicate that the rates of missing and left-censored measurements can significantly influence the inferences that may be made concerning class-based BIF correlation.
3.4.
that consider the diversity in DBP mixture composition should be a high priority for the US EPA and other regulatory agencies.
Acknowledgments This work was conducted under support from a National Science Foundation Graduate Research Fellowship, Carnegie Mellon WaterQUEST, the H. John Heinz III Chair in Environmental Engineering, and the Paul and Norene Christiano Faculty Fellowship in Civil and Environmental Engineering.
Appendix. Imputation and convergence monitoring procedures Imputation of Ymis and Ycens employs the properties of the multivariate normal distribution (Morrison, 1976) as discussed in Gelman (2004) Chapter 3, and Box and Tiao (1973) Chapter 2. Suppose X : Np ðm; SÞ, where X is a dataset consisting of p variables and m observations, and X is p-dimensional multivariate normally distributed with p-length mean vector, m, and p p positive-definite covariance matrix, S. If X is partitioned into S12 X m S X Þ. If m and S are partitioned as ½ 1 , then ½ 1 : Np ð½ 1 ; ½ 11 X2 X2 m2 S21 S22 follows;
Relationships between individual DBPs m¼
The correlation between individual DBP species (inter- and intra-class) for the Box–Cox transformed multivariate normal model is shown in Table 8. In Table 8, yellow highlighting is used to show significant negative correlation, while blue highlighting is used to show moderate positive to non-significant correlation. In conjunction with recent ICR data analyses (Obolensky and Singer, 2005, 2008), Table 8 shows that fully chlorinated species are negatively correlated with species containing at least two bromine atoms across all classes; these results also demonstrate marginal to no correlation between fully chlorinated species and intermediate bromo-chloro species containing at least one bromine atom.
4. Conclusions and policy-focused research implications A Box–Cox transformed (l ¼ 1/2) multivariate normal model is shown to provide an effective approximation of the joint distribution of DBP species across four major classes. These model results indicate that DBP mixture compositions are quite different at different levels of total DBP concentration and these concentration-based differences are consistent across the classes studied herein (e.g., THM4, THAA, DHAA, and DHAN). Our model results have particular relevance for evaluating regulatory approaches to DBP control. Class-sum maximum contaminant levels (MCLs) using mass-based concentrations may be inappropriate for assessing protection of public health (Singer, 2006). This implies that alternative MCLs for individual DBPs, as opposed to class-sum MCLs, will likely be necessary, particularly as more information is obtained on the toxicity of individual species. Finally, alternative regulatory approaches
S¼
m1
with sizes
m2 S11
S12
S21
S22
q1
ðN qÞ1 qq qðN qÞ with sizes ; ðN qÞq ðN qÞðN qÞ (A1)
then the distribution of x1 conditional on x2 ¼ a is multivariate normal, ðX1 jX2 ¼ aÞ : Nðm; SÞ; where: m ¼ m1 þ S12 S1 22 ða m2 Þ
(A2a)
and covariance matrix: S ¼ S11 S12 S1 22 S21
(A2b)
Recall that the data being imputed are the transformation of 14 chemicals (columns) by 6823 distribution system samples (rows), e.g., 14 6823. Then, corresponding to each mean vector, m, would be a transposed data vector (e.g., one row in X ). Thus, the algorithm would step through the data by rows, imputing missing and censored data at each cell. The computational approach described herein is similar to that used in Lockwood et al. (2004) and Lockwood and Schervish (2003). Let m* and S* be the mean and covariance at the current Markov state, and C be the censoring point. Then the data will be imputed using these distribution functions. If the data are missing: FðY1 Þ ¼ FðY1 jY2 ; m; S; m ; S Þ
(A3)
If the data are censored: FðY1 Þ ¼
FðY1 jY2 ; m; S; m ; S Þ where Y1 C FðCjY2 ; m; S; m ; S Þ
(A4)
In these expressions, Fð,Þ denotes the normal cumulative distribution function (CDF), and Fð,Þ denotes the
Author's personal copy water research 43 (2009) 3453–3468
cumulative distribution function over the region where y may occur.
Prior and posterior distributions for Sjy and mjS,y
pðm; SÞfjSj
(A5)
Because the model uses imputation to approach posterior computation with complete-data techniques, the likelihood function is: nþdþ1 2
pðyjm; SÞfjSj
1 exp tr S1 S0 2
(A6)
where S0 is the sum of squares matrix about the sample mean: S0 ¼
n X
T yi y yi y
(A7)
i¼1
Under the multivariate normal model, the posterior distribution on the covariance matrix is recognized as an InverseWishart distribution up to a constant of proportionality (with nn ¼ n d þ 1 degrees of freedom), thus: (A8)
SjywInv-Wishartnn ðSÞ:
The posterior distribution on the mean vector is recognized as a multivariate normal distribution up to a constant of proportionality: . mjS; ywN y; S n :
(A9)
Monitoring convergence of each parameter We monitor parameter convergence by examining a vector of potential scale reduction factors for the mean vector (Brooks and Gelman, 1998; Gelman, 2004), and a multivariate potential scale reduction factor for the covariance matrix (Brooks and Gelman, 1998). For each mean estimate, j, in the mean vector we label the draws from each Gibbs’ sample as jij forði ¼ 1; .; n; j ¼ 1; .; mÞ where i denotes the number of Gibbs’ sampler iterations, and m is the number of chains. We then estimate between-chain variation, B, and within chain variation, W. 2 Pm
j¼1 j,j j,, n m 1X 1X where j,j ¼ j and j,, ¼ j n i¼1 ij m j¼1 ,j 1 B ¼ m1
(A10)
In the expression for B, j,j and j,, may be thought of as the chain average and average of chain averages, respectively. W ¼ m1
Pm
j¼1
s2j
1 where s2j ¼ n1
Pn
i¼1
jij j,j
2
n1 1 Wþ B n n
(A12)
The potential scale reduction is:
The mean vector and covariance matrix are updated using the complete imputed data based on the conjugate multivariate normal family of distributions. Because of the large number of ICR observations, it is appropriate to use a non-informative prior distribution so that posterior parameter estimates may be informed completely by the data. The model employs the multivariate non-informative Jeffrey’s prior: dþ1 2
b s 2þ ðjjyÞ ¼
3467
(A11)
The marginal posterior variance of j is estimated by a weighted sum of W and B.
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi s 2þ ðjjyÞ b¼ b R W
(A13)
For posterior estimates of the covariance matrix, we use the multi-dimensional analogue of the univariate potential scale reduction factor approach (Brooks and Gelman, 1998). b ¼ ðn 1Þ W þ B V n n
(A14)
b is the posterior variance covariance matrix Here, V analogue of b s 2 , and W and B are the within and between chain covariance matrix estimates, respectively. To convergence is b ¼ VW b 1 is sufficiently close to the p-dimenreached when R sional identity matrix. Let vffiffiffiffiffiffiffiffiffiffiffiffiffi uP 2 b u R i;j u b 1 ¼ t i;j R p
(A15)
b i;j denotes the i,jth element of R. b R b 1 is the multivariate where R potential scale reduction factor, and convergence is reached b 1 is sufficiently close to 1. when R
Software The data are prepared for analysis by extracting from Microsoft Access tables to Microsoft Excel workbooks. We then process these using Microsoft Excel to prepare. csv data files for the observations and minimum reporting levels. All computational and graphical data analysis are performed in the R programming environment (Team, 2008), with the exception of regression on order statistics, performed in Microsoft Excel. Code was developed on both Mac OSX and Windows XP, and executed on the Windows XP GUI interface for R using the packages MCMCpack (Martin et al., 2008), mvtnorm (Genz et al., 2008), msm (Jackson, 2009), and bayesm (Rossi and McCulloch, 2008).
references
Bove, F., Shim, Y., Zeitz, P., 2002. Drinking water contaminants and adverse pregnancy outcomes: a review. Environmental Health Perspectives 110 (Suppl. 1), 61–74. Box, G.E.P., Cox, D.R., 1964. An analysis of transformations. Journal of the Royal Statistical Society, Series B 26 (2), 211–252. Box, G.E.P., Tiao, G.C., 1973. Bayesian Inference in Statistical Analysis. Addison-Wesley Publishing Company, Reading, Massachusetts. Brass, H.J., Feige, M.A., Halloran, T., Mello, J.W., Munch, D., Thomas, R.F., 1977. Drinking water quality enhancement through source protection. In: Pojasek, R.B. (Ed.). Lewis Publishers, Chelsea, MI. Brooks, S.P., Gelman, A., 1998. General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphic Statistics 7 (4), 434–455. Chang, E.E., Lin, Y.P., Chiang, P.C., 2001. Effects of bromide on the formation of THMs and HAAs. Chemosphere 43 (8), 1029–1034.
Author's personal copy 3468
water research 43 (2009) 3453–3468
Cowman, G.A., Singer, P.C., 1996. Effect of bromide ion on haloacetic acid speciation resulting from chlorination and chloramination of aquatic humic substances. Environmental Science and Technology 30 (1), 16–24. Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39 (1), 1–38. EPA, U.S., 1996. National Primary Drinking Water Regulations; Monitoring Requirements for Public Drinking Water Supplies; Final Rule, pp. 24353–24388, Federal Register. Gelman, A., 2004. Bayesian Data Analysis. Chapman & Hall/CRC, Boca Raton, Fla. Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B., 2004. Bayesian Data Analysis. Chapman & Hall/CRC, Boca Raton, FL. Genz, A., Bretz, F., Hothorn, T., Miwa, T., Mi, X., Leisch, F., Scheipl, F., 2008. mvtnorm: Multivariate Normal and t Distributions. Graves, C.G., Matanoski, G.M., Tardiff, R.G., 2001. Weight of evidence for an association between adverse reproductive and developmental effects and exposure to disinfection byproducts: a critical review. Regulatory Toxicology and Pharmacology 34 (2), 103–124. Helsel, D.R., 1990. Less than obvious: statistical treatment of data below the detection limit. Environmental Science and Technology 24 (12), 1766–1774. Helsel, D.R., 2005. More than obvious: better methods for interpreting nondetect data. Environmental Science and Technology 39 (20), 419A–423A. Helsel, D.R., Cohn, T.A., 1988. Estimation of descriptive statistics for multiply censored water quality data. Water Resources Research 24, 1997–2004. Hopke, P.K., Liu, C., Rubin, D.B., 2001. Multiple imputation for multivariate data with missing and below-threshold measurements: time-series concentrations of pollutants in the Arctic. Biometrics 57, 22–33. Jackson, C., 2009. The msm Package. Krasner, S.W., Weinberg, H., Richardson, S., Pastor, S.J., Chinn, R., Sclimenti, M.J., Onstad, G.D., Thruston Jr., Alfred D., 2006. Occurrence of a new generation of disinfection byproducts. Environmental Science and Technology 40 (23), 7175–7185. Little, R.J.A., Rubin, D.B., 1987. Statistical Analysis with Missing Data. John Wiley and Sons, New York. Lockwood, J.R., Schervish, M.J., 2003. MCMC Strategies for Computing Bayesian Predictive Densities for Censored Multivariate Data. Carnegie Mellon University, Pittsburgh, PA. Lockwood, J.R., Schervish, M.J., Gurian, P., Small, M.J., 2001. Characterization of arsenic occurrence in source waters of U. S. community water systems. Journal of the American Statistical Association 96 (456), 1184–1193. Lockwood, J.R., Schervish, M.J., Gurian, P.L., Small, M.J., 2004. Analysis of contaminant co-occurrence in community water systems. Journal of the American Statistical Association 99 (465), 45–56. Martin, A.D., Quinn, K.M., Park, J.-H., 2008. MCMCpack: Markov Chain Monte Carlo (MCMC) Package. http://mcmcpack.wustl.edu. McGuire, M.J., 2006. Eight revolutions in the history of US drinking water disinfection. Journal AWWA 98 (3), 123–149. McGuire, M.J., McLain, J.L., Obolensky, A., 2002. Information Collection Rule Data Analysis. AWWA Research Foundation, American Water Works Association, Denver, CO. McGuire, M.J., Meadow, R.G., 1988. AWWARF trihalomethane survey. Journal AWWA 80 (1), 61–68. Morrison, D.F., 1976. Multivariate Statistical Methods. McGrawHill Book Company, New York. Muellner, M.G., Wagner, E.D., McCalla, K., Richardson, S., Woo, Y.-T., Plewa, M.J., 2007. Haloacetonitriles vs. regulated haloacetic acids: are nitrogen-containing DBPs more toxic? Environmental Science and Technology 41 (2), 645–651. Munch, D.J., Feige, M.A., Brass, H.J., 1977. The Analysis of Purgeable Compounds in the National Organics Monitoring
Survey by Gas Chromatography/Mass Spectrometry. AWWA, Denver, CO. Nieuwenhuijsen, M.J., 2005. Adverse reproductive health effects of exposure to chlorination disinfection byproducts. Global NEST Journal 7 (1), 128–144. Nieuwenhuijsen, M.J., Toledano, M.B., Eaton, N.E., Fawell, J., Elliott, P., 2000. Chlorination disinfection byproducts in water and their association with adverse reproductive outcomes: a review. Occupational and Environmental Medicine 57, 73–85. Obolensky, A., Singer, P.C., 2005. Halogen substitution patterns among disinfection byproducts in the information collection rule database. Environmental Science and Technology 39 (8), 2719–2730. Obolensky, A., Singer, P.C., 2008. Development and interpretation of disinfection byproduct formation models using the information collection rule database. Environmental Science and Technology 42 (15), 5654–5660. Obolensky, A., Singer, P.C., Shukairy, H.M., 2007. Information collection rule data evaluation and analysis to support impacts on disinfection by-product formation. ASCE Journal of Environmental Engineering 133 (1), 53–63. Plewa, M.J., Wagner, E.D., Richardson, S.D., Thruston Jr., A.D., Woo, Y.T., McKague, A.B., 2004. Chemical and biological characterization of newly discovered iodoacid drinking water disinfection byproducts. Environmental Science and Technology 38 (18), 4713–4722. Richardson, S., 2003. Disinfection byproducts and other emerging contaminants in drinking water. Trends in Analytical Chemistry 22 (10), 666–684. Rossi, P., McCulloch, R., 2008. bayesm: Bayesian Inference for Marketing/Micro-econometrics. http://faculty.chicagogsb.edu/ peter.rossi/research/bsm.html. Shumway, R.H., Azari, A.S., Johnson, P., 1989. Estimating mean concentrations under transformation for environmental data with detection limits. Technometrics 31 (3), 347–356. Singer, P.C., 2006. DBPs in drinking water: additional scientific and policy considerations for public health protection. Journal AWWA 98 (10), 73–80. Sohn, J., Amy, G., Yoon, Y., 2006. Bromide ion incorporation into brominated disinfection byproducts. Water, Air, and Soil Pollution 174, 265–277. Symons, J.M., Bellar, T.A., Carswell, J.K., DeMarco, J., Kropp, K.L., Robeck, G.G., Seeger, D.R., Slocum, C.J., Smith, B.L., Stevens, A. A., 1975. National organics reconnaissance survey for halogenated organics. Journal AWWA 67 (11), 634–647. Tanner, M.A., Wong, W.-H., 1987. The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association 82 (398), 528–540. Tardiff, R.G., Carson, M.L., Ginevan, M.E., 2006. Updated weight of evidence for an association between adverse reproductive and developmental effects and exposure to disinfection by-products. Regulatory Toxicology and Pharmacology 45 (2), 185–205. Team, R.D.C., 2008. R: a Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Villanueva, C.M., Cantor, K.P., Cordier, S.M., Jaakkola, J.K., King, W.D., Lynch, C.F., Porru, S., Kogevinas, M., 2004. Disinfection byproducts and bladder cancer: a pooled analysis. Epidemiology 15 (3), 357–367. Weinberg, H., Krasner, S., Richardson, S., Thruston Jr., A.D., 2002. The Occurrence of Disinfection By-products (DBPs) of Health Concern in Drinking Water: Results of a Nationwide DBP Occurrence Survey. National Exposure Research Laboratory Office of Research and Development, U.S. Environmental Protection Agency, Athens, GA. Xie, Y.F., 2003. Disinfection Byproducts in Drinking Water: Formation, Analysis, and Control. CRC Press, Washington, D.C.