J Geogr Syst (2012) 14:265–282 DOI 10.1007/s10109-010-0145-1 ORIGINAL ARTICLE
Imputing censored data with desirable spatial covariance function properties using simulated annealing L. Sedda • P. M. Atkinson • E. Barca • G. Passarella
Received: 11 June 2010 / Accepted: 25 November 2010 / Published online: 12 December 2010 Ó Springer-Verlag 2010
Abstract When measurements of values that are less than the limit of detection are reported as not detected, the data are referred to as censored. The non-recording of values below the limit of detection is common in soil science research although modelling data affected by censoring can be problematic. This paper develops and tests a modified version of Spatial Simulated Annealing, called Simulated Annealing by Variogram and Histogram form, for drawing values for censored points given a mixed set of observed and censored data. The algorithm aims to maximise the goodness of fitting between the experimental and theoretical variograms (by allowing variation in its parameters) while the imputed values are constrained to a target histogram form. In practice, the experimental histogram is estimated by transforming the available data (interval and exact observations) to quantiles and fitting a plausible distribution. The theoretical distribution of the data is used to constrain the variogram fitting. The proposed simulated annealing method is designed to find the optimal spatial arrangement of values, given by the lowest errors in variogram and histogram fitting and kriging prediction. The accuracy of the method proposed is assessed on a simulated data set in which the censored point values are known and compared with the Spatial Simulated Annealing algorithm. According to the results obtained, the Simulated Annealing by Variogram and Electronic supplementary material The online version of this article (doi:10.1007/s10109-010-0145-1) contains supplementary material, which is available to authorized users. L. Sedda (&) Spatial Ecology and Epidemiology Group, University of Oxford, South Park Road, Oxford OX1 3PS, UK e-mail:
[email protected] P. M. Atkinson School of Geography, University of Southampton, Highfield, Southampton S017 1BJ, UK E. Barca G. Passarella IRSA-CNR, National Research Council, via F. De Blasio 5, 70123 Bari, Italy
123
266
L. Sedda et al.
Histogram form (SAVH) approach can be recommended as a useful tool for the analysis of spatially distributed data with censoring. Keywords Detection limit Annealing simulation Variogram and histogram fitting Cross-validation Kriging JEL Classification
C21 C24 C61 Q19
1 Introduction In certain circumstances, the attribute of interest may not be observed fully for some of the required sampling locations, for example, due to the limitations of the measuring instrument. In particular, when the values to be observed are below the detection limit of the measuring device, the resulting values are referred to as a ‘‘censored’’ or ‘‘interval’’ observations (or soft data). Censored data can be expressed either as an interval or single value. They are distinguished from ‘‘exact observations’’ (or hard data) that are observed above the limit of detection (De Oliveira 2005). The given definition of a censored observation makes censored data distinct from a missing observation where the information is unknown. Censored observations convey information regarding their value and the general distribution of the whole data set (Orton and Lark 2007). The major difficulties related to censored data are that they are not missing in a random pattern, but are missing at one end of the distribution (left and/or right censored) and, in some cases, the censored observations can have a similar range as the variance of the data set (De Oliveira 2005; Helsel 2005). Much research has been undertaken in recent decades with the aim of producing reasonable values for censored points, according to some properties of the variable of interest (Porter et al. 1988). Methods of creating complete data by ‘‘filling in’’ censored observations can be classified into single-imputation methods and multiple-imputation methods. Single imputation involves filling in one value for each censored location (Caudill 1996; Gibbons 1995; Odell et al. 1992), often independently from the distribution of the complete sample (censored and noncensored). Multiple imputation replaces each censored value with two or more plausible values from the joint distribution of possible values, often in a Bayesian approach (Hopke et al. 2001; Rubin 1987). Recently, maximum likelihood estimation (ML) and the expectation–maximisation algorithm (EM) were proposed, although sometimes the direct evaluation of the likelihood for correlated data with censored values can involve computationally intractable integrals (Abrahamsen and Benth 2001; Svensson et al. 2006). Knowledge of the form of the spatial covariance function for the variable of interest permits constraint of the censored points to values that depend on the spatial configuration of the sample locations as well as the underlying distribution of the values. Also, in a regression context, it avoids the potential bias that may arise if considering the error term as being independent and identically distributed (Agarwal and Sharma 2003; Bang and Tsiatis 2002; Tsiatis 1990). Advanced spatial
123
Imputing censored data with desirable spatial covariance
267
frameworks for the univariate or multivariate context are proposed by Stein (1992), Knotters et al. (1995), Militino and Ugarte (1999), Abrahamsen and Benth (2001), Lark and Papritz (2003), De Oliveira (2005), Fridley and Dixon (2007), Orton and Lark (2007) and Goovaerts (2009). Some of these methods use the Bayesian statistical framework. An example of a non-Bayesian approach is given by Abrahamsen and Benth (2001). They proposed the use of kriging (a linear technique for spatial prediction) with inequality constraints preceded by Gaussian anamorphosis modelling (Rivoirard 1994) of the raw variable. The pre-transformation of the variable demands a final back transformation, often a complex technique that introduces more imprecision into the final predictions (Leuangthong and Deutsch 2003). Elements of these issues are taken into account by Goovaerts (2009) who proposed the use of indicator kriging with multiple thresholds for the values below the detection limit. The method is applied for continuous data with a cumulative histogram and requires the modelling of a number of variograms equal to the number of thresholds applied. A Bayesian approach can be found in De Oliveira (2005). De Oliveira (2005) used Gaussian Random Fields to predict the depths of geologic horizon with censored data that were described spatially by the random field model. This framework was implemented via a data augmentation method that can also be found in other papers (Fridley and Dixon 2007; Pardo-Igu´zquiza 1998). Data augmentation consists of regarding latent and missing data as parameters to estimate (Meng and Van Dyk 1999). Although this introduces many more parameters, the conditional distributions become much easier to sample from simplifying the design of the algorithm. The principal drawback is slow convergence due to the large correlation between model parameters and latent data. In the presence of large numbers of data, the computing time can be large and it can be difficult to determine when the chain has reached convergence (Liu 2001) and to invert the covariance matrix (De Oliveira 2005). This paper develops a method of imputation in the context of a known covariance function (or variogram; hereafter, the text is standardised to variogram) with unknown parameters (e.g. sill, nugget, range), a known univariate distribution with unknown parameters, where one tail of the distribution is censored, and in the absence of covariates. For the unknown parameters, the true fixed values are unknown, but can be estimated from initial (starting) values. The main assumption is that the imputed values at the censored locations are arranged spatially such as to produce the smallest variogram fitting error (considering the censored and exact observations), smallest histogram fitting error and smallest kriging errors. The algorithm, created in the R software environment,1 is an extension, for censored data with unknown variogram and histogram parameters, of the Spatial Simulated Annealing (SSA) of Deutsch and Journel (1998) called Simulated Annealing by Variogram and Histogram form (SAVH). SAVH is based on a plausible theoretical variogram function and distribution obtained from the raw data, and it does not constrain the final variogram or distribution to a specific set of pre-defined variogram parameters (range, nugget or sill) (as in Deutsch and Journel 1
The R project. www.r-project.org.
123
268
L. Sedda et al.
1998; Deutsch and Wen 1998). The final result is the imputation of values to the censored points that represent an important source of information especially when the censored interval contains values that are the level of a defined risk. The method aims to increase the speed of computing without sacrificing simplicity and efficiency.
2 Methods SAVH is a geostatistical program designed to reduce the uncertainty in the prediction of values at censored locations, when applied to incomplete data sets (the R code is available from the corresponding author, while some of the implementation details are given in the Supplementary Information). The information required a priori to run SAVH is the histogram form and the variogram form. The term ‘form’ refers to the general mathematical function describing the variogram and the histogram, without any specification of its parameters. For example, a possible variogram form is exponential with unknown values of sill, nugget and range; while for the histogram, a possible form is Poisson with unknown value of mean. 2.1 SAVH general characteristics 2.1.1 Histogram The number of censored points and the censored interval are important sources of information about the unobserved part of the true histogram. The histogram may be discretised such as to incorporate directly the information given by the censored observations. Specifically, an approximate estimate of the true histogram can be obtained by computing the histogram with bin widths equal to the length of the censored interval, G (Gilliom and Helsel 1984; Saito and Goovaerts 2000; Huzurbazar 2005). In this paper, the form of the distribution is determined in this way, but then the parameters estimated are discarded and the form of the distribution only is used as a constraint in the subsequent imputation process. 2.1.2 Variogram Depending on various parameters, such as the censored interval and the proportion of censored points to the total, it may be expected that the variogram total sill variance is likely to increase when the censored points are added because the censored points lie outside the range of the exact observations. No further information is available. However, by drawing a random value for each censored point, it is possible to simulate an experimental variogram for the complete data set. This process can be repeated to produce a large number of experimental variograms, which leads to an envelope of random draws within which the true variogram is expected to lie. The realisations can be constrained to a specific variogram form, from which a narrower envelope is obtained. It is within this envelope that the true variogram (and that estimated by SAVH) is expected to lie. The variogram is thus
123
Imputing censored data with desirable spatial covariance
269
considered as a variable in the annealing process. At each iteration, the experimental variogram is calculated from the new imputed values and the theoretical variogram parameters (sill, range and nugget) are fitted to the new experimental variogram (by nlm, see below). While the new fitted variogram parameters may change, the initial values from which they are calculated are held constant. 2.1.3 Predictions A problem with ordinary or simple kriging is that it is known to produce conditional bias. That is, it is biased when predicting extremes: under-predicting highs and overpredicting lows. This means that kriging is not well suited to imputation of censored values distributed at one tail. In particular, imputation under the constraint of minimum prediction variance will tend to predict overly (biased) large values for the censored locations at the lower tail of the distribution. SAVH reduces the biases introduced by ordinary kriging by conditioning the predictions to the histogram form and allowing the variogram to vary from that of the observed data, constrained only to a selected variogram form. Another possibility is the use of disjunctive kriging which, through indicator coding of the data set, constrains the predictions to lie within the range of the sample data. The difficulties related to the orthogonal transformation (often based on Hermite polynomials) of the data make this option unsuitable for optimisation algorithms. 2.1.4 Combination The set of three constraints given above are represented as a single objective function, E: E ¼ minðw0 WLS þ w1 D þ w2 H Þ
ð1Þ
where D is the departure (in absolute terms) of the Mean Squared Deviation Ratio error (MSDR, measured using the cross-validation method via ordinary kriging, see below) from a value of one (Kerry and Oliver 2007c): D ¼ absð1 MSDRÞ
ð2Þ
and WLS is the weighted least square residual, as defined by McBratney and Webster (1986), between the experimental and theoretical variograms. WLS with respect to ordinary least squares (OLS) gives more weight at the shorter lags (improving the fitting at shorter distances), which is desirable when there is large variation between the paired comparisons and when the variogram is used in a kriging context (Oliver 2010). The equations for MSDR and WLS are not recalled because they are common measures in geostatistical research. Finally, H is the AIC of the fitted distribution with respect to the target histogram form, and w0, w1 and w2, are user-defined rescaling coefficients for WLS, D and H respectively. A common way to choose the rescaling coefficients is to assign values that produce comparable variation in WLS, D and H (i.e. around one unit—in the present case by setting w0 = 0.001, w1 = 1 and w2 = 0.01). This is only possible after running the algorithm and recording the range of variation for WLS, D and H.
123
270
L. Sedda et al.
Table 1 Notation
Symbol
Description
Value
n
Number of censored points
62
G
Range value for the censored points (censored interval)
[0, 5]
T
Temperature during SA
i
Actual number of perturbations so far in SA
c
Number of perturbations before decreasing T
E
Value of objective function
D
Ei - Ei?1
Unit
g kg-1
The minimum of E can be solved approximately by Simulated Annealing (SA), a well established and documented procedure (Alkhamis and Ahmed 2004; Christakos and Killam 1993; Dueck and Scheuer 1990; Gelman et al. 1995; Rajasekaran 2000; Triki et al. 2005) for minimisation/maximisation of a complex objective function (the term ‘complex’ is used to express objective functions with more than one variable), and used in the past for switching points procedures (i.e. Macmillan 2001). 2.2 SAVH algorithm flow (following the notation in Table 1) 2.2.1 Input Censored locations; censored interval G; exact observations; variogram form; and histogram form. 2.2.2 Output Values at the censored locations; optimised variogram-related parameters; histogram, variogram and prediction statistics. Step i
Create the target histogram and variogram forms (supervised).
i1 Estimate the histogram using bin widths equal to G. i2 Fit the histogram with an a priori distribution form. i3 Estimate the experimental variogram of the observed data only. i4 Fit theoretical variogram models and select a suitable general form. Step ii
Create the initial state.
ii1 Draw n values, vi, from a uniform distribution with range equal to G. ii2 Allocate the vi values randomly at the n censored locations. ii3 Merge the n censored observations (new values) with the exact observations. ii4 Calculate the objective function E (see step iv).
123
Imputing censored data with desirable spatial covariance
271
Step iii Exchange the value of one point, belonging to vi, chosen randomly, with a value sampled from a uniform distribution with range equal to G. Step iv
Calculate the objective function E.
iv1 Estimate the experimental variogram and then fit with the pre-defined variogram form. New parameters for the theoretical variogram are obtained (through the non-linear minimisation function), and a WLS value between the experimental and theoretical variograms is produced. iv2 Apply cross-validation via ordinary kriging to estimate the MSDR. iv3 Fit the pre-defined distribution to the histogram by General Additive Model for Location, Shape and Scale (GAMLSS) to calculate H. Step v
Accept or reject the new state.
v1 If the new value of E is smaller than the previous one, accept the change. Else go to step v2 v2 Calculate the Boltzmann probability, exp(D/T). If it is greater than a random number generated from the uniform distribution between [0,1], accept the new state. Otherwise reject it. Step vi Check if the system is frozen. If the system is not frozen (see termination criterion), after c perturbations decrease the temperature according to the cooling schedule and go back to Step iii. 2.3 SAVH computational details 2.3.1 Step i Histogram and variogram forms To determine the histogram form, the histogram constructed with bin widths equal to the censored interval was fitted with different distribution families using generalised additive models for location, scale and shape (GAMLSS) (Rigby and Stasinopoulos 2005). The goodness of fit was investigated via three measures: global deviance and the Akaike (AIC) and Bayesian (BIC) information criteria. AIC and BIC, in particular, take into account the model log-likelihood and a complexity term. The latter provides the difference between the two criteria: in the AIC, it is a function of the number of parameters involved, while in the BIC it is a function of both the number of parameters and the sample size. The same model is applied for histogram fitting in SAVH step iv3. GAMLSS is a flexible model allowing several controls on the statistical distribution of the independent variable in the presence of large skewness and/or kurtosis (even without explanatory variables). This is possible by relaxing the hypothesis associated with the exponential family distribution of the response variable, for which a general distribution family with large skew and/or kurtosis is allowed. In practice, GAMLSS considers the distribution of the response variable defined by several parameters related (linearly and/or additively) to the shape and scale of the distribution and not only to the mean (as in generalised linear models).
123
272
L. Sedda et al.
The variogram form was chosen after analysis of the experimental variogram of the exact observations produced by residual maximum likelihood, which can provide a robust variogram in the presence of outliers and with a small data set (Lark 2000a; Kerry and Oliver 2007a, b, c; Webster and Oliver 2001). It was fitted using a non-linear minimisation function, nlm (Dennis and Schnabel 1983; Ribeiro and Diggle 2001), defining initial values for the range, partial sill and nugget, chosen according to the experimental variogram of the exact observations. 2.3.2 Step iii Value exchange The new value exchanged at step iii is drawn from a uniform distribution (and is not produced by swapping in values from other locations). This allows the relaxation of the histogram constraints and avoids the potential bias produced by the approximate estimation of the histogram form described in Sect. 2.1.1 above. 2.3.3 Step iv1 Variogram fitting and cross-validation Step iv1 of the SAVH routine was applied because the new experimental variogram (produced by changing the value at a censored location) is likely to lead to a new set of variogram model parameters. Similarly to step i, constant initial values of the variogram parameters are used in the nlm function at each iteration of the algorithm to produce a new theoretical variogram for each new experimental variogram. A further constraint prevents final variograms with a nugget term smaller than a pre-fixed noise value that, in this analysis, was chosen as the instrumental error (0.1 mg kg-1). The procedure of variogram model fitting, in the presence of distribution asymmetry and outliers in a small data set, required a log-normal transformation (Box and Cox 1964) to transform the original distribution to a Gaussian distribution (Gringarten and Deutsch 2001). Kerry and Oliver (2007a) suggest transforming the data for skewness less than -1 or greater than ?1. The log-normal transformation applied to produce the theoretical variogram requires that the values in G are also log-transformed. Hence, cross-validation (with ordinary kriging) was applied to the entire log-transformed data set (censored and exact observations) and using the i-th theoretical variogram. 2.3.4 Steps v and vi: Simulated annealing The rules applied in SA are now explained. Initially, the temperature, T, was fixed sufficiently high (the higher the temperature, the more likely an unfavourable perturbation will be accepted) to keep an acceptance rate greater than 0.5 (Corana et al. 1987; Kirkpatrick et al. 1983; Metropolis et al. 1953). This choice allows escape from local minima solutions. The parameter T is controlled by the cooling schedule (Aarts and Korst 1989; Bouktif et al. 2006; Kuo et al. 2001). In SAVH, the rule proposed by Bo¨lte and Thonemann (1996) and Ingber (1996) was implemented, which guarantees an acceptance rule with decreasing probability. Also, it respects the condition proposed by Geman and Geman (1984): T no larger than the ratio between the starting temperature and the log value of i. The termination criterion
123
Imputing censored data with desirable spatial covariance
273
defined by Kirkpatrick et al. (1983) and Deutsch and Journel (1998) was adopted: if the desired acceptance is not achieved after three successive temperature decreases, the system is considered ‘‘frozen’’ and the annealing stops. 2.4 Comparison with SSA on the simulated data set The accuracy of the SAVH method was evaluated using a simulated data set for which the values of the censored points were known. In this context, the methods proposed were compared with the classical SSA (Deutsch and Journel 1998). SSA takes into account the mean, the variance–covariance matrix and the histogram of the values. In SSA, there are various options for the objective function that can include a histogram, a variogram (or indicator variograms), and when covariates are present, a correlation coefficient and conditional distribution. Both SAVH and SSA use WLS in the optimisation function. The main differences between the two algorithms are that: (a)
In SAVH the variogram and its parameters are variable and not targeted to an a priori variogram (as in SSA); (b) In SAVH only the distributional form (histogram) is given. In the classical SSA a fully described histogram is required; (c) The MSDR is absent in SSA; SSA was applied twice: (1) with the variogram of the observed data and a fitted histogram using the censored interval and observed data (SSAob) (this allows constraint of the predictions to within the censored range) and (2) with the true variogram and true histogram (SSAvh) (note that this is not possible in practice, but is included for reference). The simulated data set was composed of 120 points randomly distributed spatially with values in the range 0–61 (in integers) produced from a stationary spatial Gaussian random field. To generate the values, 1,000 simulations of the random field were generated by conditional simulation with exponential variogram parameters: sill = 0.74 (g kg-1)-2, range = 0.98 (km) and nugget = 0.12 (g kg-1)-2. The conditional simulation (Chile`s and Delfiner 1999) is conditional to a specified mean and variance–covariance matrix and is based on kriging predictions on a random field. Initially, 13 points of the simulated data set were considered as censored and hence removed from the data in order to allow their imputation. To assess the sensitivity of the algorithms to the number of censored locations, the algorithms were also run also for 23 and 40 censored points.
3 Field data Data were collected during fieldwork in the years 2000 and 2001. The data belong to the Toxicological Database of the Apulia Soils project (TDB)2 that aimed to digitise and elaborate the main environmental characteristics of the Apulia Region (Fig. 1). 2
The Apulia Region. http://bdt.regione.puglia.it/home.html.
123
274
L. Sedda et al.
Fig. 1 Study area in southern Italy and categorised silt measures
From the TDB database, the soil attribute of total silt was considered which, because of its statistical properties, constitutes an interesting case to analyse. The importance of this variable is its relation with plant health and growth, agricultural management and the quality of groundwater (Van Breemen et al. 1983). The measure of total silt was obtained in the laboratory on soil samples following the Italian governmental policy on the official methods for soil chemical analysis (Ministero per le Politiche Agricole e Forestali 1999), and based on the same policy, the value is expressed in g kg-1 without decimals. It is a rounded continuous variable for which the original values are not available, and for this reason, it was treated as a discrete variable. The total silt is composed of calcium carbonate or calcite (CaCO3), magnesite (MgCO3) and sodium carbonate (Na2CO3), but is expressed simply as calcium carbonate because the latter is generally the prevalent component (Raimo and Napolitano 2003). Total silt was measured once at each location and was censored because some of the investigated locations were below the level of detection (the distribution is called left truncated), so that these values do not belong to the range of the exact observations. Total silt was measured at 149 locations in the Apulia Region, among which 62 were censored and 87 were exactly observed. The censored interval has a range between 0 and 5 g kg-1. As shown in Fig. 1, the locations are concentrated in the middle of the north and along the coast and administrative boundaries in the rest of the territory and always at lower altitudes (not shown) because the TDB database was constructed to help address some of the problems affecting agricultural areas. The minimum and maximum distances between points for the complete data set, the exact observations and the censored locations are almost the same. This type of spatial distribution helps in the choice of separation lag distance (Webster and Oliver 2001).
123
Imputing censored data with desirable spatial covariance
275
4 Results and discussion The results for the simulated data set are explored first to provide an understanding of the capabilities of the proposed algorithm. 4.1 Simulated data SSA provides an alternative to SAVH. SSA requires the target variogram and the target histogram, which is required to constrain the prediction to within the censored interval, but neither of these is available in the case of censored data. The only information known that can be given to SSA is the theoretical (or experimental) variogram of the observed data and the observed histogram. In fact, providing only the statistics of the observed data (variogram and histogram) often leads to predictions that are larger in value than the censored interval at the lower tail of the distribution (not shown, while the results of SSA using the censored locations with values equal to half of the detection limit are reported in the Supplementary Information). Table 2 shows a comparison between SAVH and SSA. When SSA uses the observed histogram and variogram (i.e. SSAob), it produces the largest error in terms of the number of correctly predicted unknown values (i.e. the observations considered as censored), the mean squared difference between the true values and the predicted values at the censored points and the deviance of the MSDR (for 13 and 23 censored locations) from the optimal value of one. The latter represents the overall algorithm performance by including the results of the predictions of the observed locations. With 13 and 23 censored locations to be predicted, SAVH predicts three times the number of correct values than SSAob. When the number of censored points is increased to 33% (40 from 120) of the total number of observations, SAVH is slightly more accurate than SSAob, in terms of MSE, but with predictions of similar accuracy at the observed locations (in fact, the MSDR values are identical). A comparison between the experimental variogram produced by SAVH and the true experimental variogram and its envelopes is reported in the Supplementary Information. When SSA is provided with the true histogram as well as the true variogram (a situation that is not possible in practice), SSAvh outperforms SAVH especially for 23 and 40 censored locations. Only when the proportion of censored locations compared to the total is lower than 10% do the two methods give similar results. If only the true variogram is provided (not shown in Table 2), the accuracy decreases with respect to SSAvh (for SSA with true variogram, the numbers of correctly predicted locations are 3, 9 and 12, and the MSEs are 0.85, 1.47 and 4.46, respectively, for 13, 23 and 40 censored locations), and for 40 censored locations, it equals SAVH. This confirms that the histogram is required to obtain a large number of correct predictions, especially when the number of censored locations is lower than 1/3; while with a larger number of censored locations, the histogram does not increase accuracy. In relation to the parameters of the original distribution (mean = 14.94 and variance = 47.56), neither algorithm (SAVH or SSAob) can reproduce the sample mean and variance when the number of censored locations is increased, although
123
276
L. Sedda et al.
Table 2 Number of points correctly predicted (npc), mean squared error of the prediction (MSE) and mean squared deviation ratio (MSDR) for three different censored point sample sizes (n) by the three methods for a simulated data set (SAVH, simulated annealing by variogram and histogram form; SSAob, spatial simulated annealing with variogram from the observed points and histogram from the complete data; SSAvh, spatial simulated annealing with true variogram and histogram) n
Range
SAVH
SSAvh
SSAob
npc
MSE
MSDR
npc
MSE
MSDR
npc
MSE
MSDR
13
0–4
9
0.50
0.98
3
1.12
0.78
10
0.02
1.00
23
0–6
15
1.25
0.87
6
3.53
0.77
19
0.12
0.98
40
0–9
12
4.46
0.77
10
8.21
0.77
33
1.00
0.92
Mean
Variance
Mean
Variance
Mean
Variance
Statistics of the final distribution 13
–
14.92
47.74
15.32
47.84
14.94
47.50
23
–
14.61
49.28
17.22
54.12
14.93
47.81
40
–
14.62
53.43
19.84
60.97
14.90
48.03
Computational characteristics for 40 censored locations UPT (min) NoP
99
30
38
53,213
12,021
26,564
The mean and variance values refer to the final distribution obtained by the imputing methods. Finally, the processing time (UPT) and the number of perturbations (NoP) required to reach the convergence are reported
SAVH is substantially more accurate than SSAob. Finally, the SAVH requires a longer computation time than SSA (Table 2). 4.2 The Apulia data 4.2.1 Obtaining the histogram form The exploratory data analysis for total silt considering the exact observations only and the histogram of the complete data set is summarised in Table 3. The histogram of the complete data set was estimated by transforming the raw data range (0–587) to bin widths of five units (0–590) in order to incorporate the censored values in one interval (the first one). The position of the mean and the values of skewness and kurtosis indicate a leptokurtic distribution with positive skew, especially for the discretised histogram. A common distribution choice to fit the discretised histogram with prevalence of values in the left tail is the Poisson (PO) distribution. In the present case, the Poisson inverse Gaussian (PIG) distribution (Holla 1966) was another alternative. The PIG distribution is generated as a Poisson distribution with a random mean parameter, which has an inverse Gaussian distribution. The PIG provided the best fit (Table 4). This was also confirmed by an analysis of the randomised quantile residuals where, for the PIG, the mean was close to zero, the variance was smaller and the asymmetry was almost zero.
123
Imputing censored data with desirable spatial covariance Table 3 Summary statistics for the exact observations and the transformed complete histogram from the Apulia total silt variable
Exact observations
Complete discretised histogram
Number of points
87
149
Range (g kg-1)
8–587
0–590
133.20
78.91
Raw mean (g kg-1) -1 2
Table 4 Results of fitting to the discretised histogram of total silt using a Poisson (PO) and a Poisson Inverse Gaussian (PIG) distributions
277
Variance ((g kg ) )
13,479
12,032
Standard deviation (g kg-1)
116
109
Skewness
1.40
1.88
Kurtosis
2.15
3.81
PO Intercept Global deviance
PIG 4.37
4.36
19,513.30
2,048.76
Akaike information criterion
19,513.30
2,052.76
Bayesian information criterion
19,513.30
2,058.77
Mean (g kg-1)
-4.21
-0.34
Variance ((g kg-1)2)
56.77
2.11
Coefficient of skewness
0.41
-0.05
Coefficient of kurtosis
1.47
1.28
Filliben correlation coefficient
0.89
0.91
Randomised quantile residuals
4.2.2 Obtaining the variogram form An experimental variogram was estimated, with 15 lags of 10 km each, for the 87 exact observations. The experimental variogram (not shown) did not exhibit anisotropy and so the analysis proceeded based on an isotropic random field. The model chosen to fit to the experimental variogram was exponential (variogram form). The values of 0.37 (g kg-1)-2 for the partial sill, 0.74 (g kg-1)-2 for the nugget and 12.16 km for the range were input as starting values for the optimisation in the SAVH algorithms. The variogram initial conditions and, in general, the SA parameters can affect the final SA results only if the global minimum is not reached. In fact, simulated annealing should find the same optimum with high probability repeatedly. To ensure this, the SA was tested several times (with different parameters for annealing temperature and variogram fitting optimisation). 4.2.3 SAVH results Table 5 shows the result of running SAVH. Three other results of variogram fitting are shown to facilitate interpretation: the variogram fitted to (1) the exact observations only, (2) the complete data set based on the allocation of the desired
123
278
L. Sedda et al.
Table 5 SAVH results and theoretical variograms for exact observations, complete data set with midpoint at censored locations and complete data set with random values at the censored locations Algorithm
Partial sill (g kg-1)2
Nugget (g kg-1)2
Nugget-to-sill ratio
WLS (g kg-1)2
MSDR
SE (g kg-1)
SAVH
2.00
1.32
0.39
1,528
0.98
-0.40
Random
2.01
0.99
0.33
1,111
–
2,146
Midpoint
3.07
0.41
0.11
Exact only
0.52
0.43
0.45
43.48
–
–
–
0.22
2.60
SQE (g kg-1)2 0.94 – – 13.85
For all the measures refer to the text
values (according to the fitted PIG distribution) to the censored points, but randomised spatially and (3) the complete data set, but with the censored interval mid-point allocated to the censored points. The partial sill, as expected, increases from the 0.52 (g kg-1)2 of the theoretical variogram of the exact observations to 2 (g kg-1)2 (Table 5). The total sill is more than three times the total sill of the exact observations. Given that the censored points are at the left tail of the distribution and with values very different from the mean of the exact observations, it is expected that the sill increases. SAVH reduces the nugget-to-sill ratio, with respect to the variogram of the exact observations, facilitating a reduction in the prediction error variance (Lark 2000b; Marcotte 1995). The variograms produced by attributing random draws and the midpoint to the censored locations produced an even smaller nugget-to-sill ratio. This arises because the variogram parameters are in these cases not constrained to the MSDR and AIC. Attributing the censored interval midpoint to the censored locations produced the largest partial sill because, according to the fitted PIG distribution, the censored interval contains more large values than small ones. The SAVH model produced a final theoretical variogram within the envelope obtained from a set of variogram models fitted to experimental variograms calculated from 100,000 sets of observed and censored data, in which the censored values were drawn randomly from a uniform distribution (Fig. 2). Finally, the envelopes were fitted using the same approach for variogram fitting described above (initial variogram parameter values used in the non-linear minimisation function). The final variogram produced by SAVH is not statistically different from a model fitted to a complete data set based on random draws, but is significantly different from the variogram of the exact observations only. Regarding the cross-validation results shown in Table 5 (last two columns), the standard error and the standard quadratic error are both much smaller in SAVHbased kriging relative to exact observations only. Hence, the SAVH method predicted values and a variogram that increase the general accuracy of the kriging algorithm. To provide a visual assessment of the differences revealed in Table 5, Fig. 3 shows the predictions obtained by ordinary kriging based on the theoretical variogram produced by SAVH (left-hand side) and based on the variogram of the
123
279
2 0
1
Semivariance
3
4
Imputing censored data with desirable spatial covariance
0
20000
40000
60000
80000
Distance Fig. 2 Theoretical variograms for the Apulia data: exact observations (circles), SAVH (triangles) and envelopes for the random simulation (solid lines)
Fig. 3 Kriging predictions using (1) the fitted theoretical variogram and estimated values at censored locations from SAVH (left-hand side) and (2) the fitted theoretical variogram of exact observations only and assuming censored values equal to the detection limit (5 g kg-1) (right-hand side)
exact observations only (right-hand side). In the former case, the censored observations were estimated by SAVH, while in the latter case the censored observations were set equal to the detection limit value. The kriging predictions obtained using the exact observations with censored data equal to the detection limit exhibit greater continuity with an over-prediction of the silt content in the centre and South of the Apulia region relative to the kriging
123
280
L. Sedda et al.
predictions based on SAVH. In the north, the extreme smoothing under-predicts (relatively) an area of high silt concentration. The SAVH-based kriging predicted greater variability across space due to the increase in the variance obtained including the overall range of the censored locations and permitted reproduction of the true high concentration of silt in the northern area (see Fig. 1 for comparison). An observable effect is the restriction spatially of the influence of high silt concentrations and hence the creation of a halo appearance around such locations.
5 Conclusion The variogram produced from observed (non-censored) locations can be misleading if applied to predict the censored locations using classical geostatistical approaches such as kriging and SSA. In this paper, an extension of the SSA algorithm was developed to impute values for censored locations while taking into account the general form of autocorrelation and distributional form implicit in the complete data set (i.e. observed and censored data). It was demonstrated that for a data set with less than 20% of censored locations, SAVH can estimate correctly 65-70% of the censored values. In comparison with SSA (based on the observed data only), the SAVH algorithm predicted more censored points correctly and resulted in a consistently smaller prediction error. The algorithm proposed is an attractive alternative to augmentation or geostatistical approaches for those researchers interested in imputing censored values while taking into account spatial dependence. The only information required by SAVH is the form of the histogram and the form of the variogram and starting values for all other parameters. Acknowledgments This study was supported by a fellowship from the Master and Back program financed by the Regional Sardinia Government, under agreement between the School of Geography, University of Southampton (UK) and the Dipartimento di Economia e Sistemi Arbori, University of Sassari (Italy). Thanks are due to the Apulia Regional Authority for Ecology and the Water Research Institute of the National Research Council for providing the data used in this study. Finally, the authors would like to acknowledge Dr. Edith Cheng at the University of Southampton, for inspiring the analysis and for helpful assistance.
References Aarts E, Korst J (1989) Simulated annealing and Boltzmann machines—a stochastic approach to combinatorial optimization and neural computing. Wiley & Sons, New York Abrahamsen P, Benth FE (2001) Kriging with inequality constraints. Math Geol 33(6):719–744 Agarwal R, Sharma M (2003) Parameter estimation for non-linear environmental models using belowdetection data. Ad Environ Res 7(2):249–261 Alkhamis TM, Ahmed MA (2004) Simulation-based optimization using simulated annealing with confidence interval. In: Proceedings 2004 Winter simulation conference, IEEE, Washington D.C., pp 514–519 Bang H, Tsiatis AA (2002) Median regression with censored cost data. Biometrics 58(3):643–649 Bo¨lte A, Thonemann UW (1996) Optimizing simulated annealing schedules with genetic programming. Eur J Oper Res 92(2):402–416
123
Imputing censored data with desirable spatial covariance
281
Bouktif S, Sahraoui H, Antoniol G (2006) Simulated annealing for improving software quality prediction. In: Proceedings genetic and evolutionary computation conference GECCO 06, ACM, New York, pp 1893–1991 Box GEP, Cox DR (1964) An analysis of transformations. J Roy Stat Soc B 26(2):211–252 Caudill SB (1996) Maximum likelihood estimation in a model with interval data: a comment and extension. J Appl Stat 23(1):97–104 Christakos G, Killam BR (1993) Sampling design for classifying contaminant level using annealing search algorithms. Water Resour Res 29(12):4063–4076 Corana A, Marchesi M, Martini C, Ridella S (1987) Minimizing multimodal functions of continuous variables with the ‘‘simulated annealing’’ algorithm. ACM T Math Software 13(3):262–280 De Oliveira V (2005) Bayesian inference and prediction of Gaussian random fields based on censored data. J Comput Graph Stat 14(1):95–115 Dennis JE, Schnabel RB (1983) Numerical methods for unconstrained optimization and nonlinear equations. Prentice-Hall, Englewood Cliffs Deutsch CV, Journel AG (1998) GSLIB Geostatistical Software Library and user’s guide, 2nd edn. Oxford University Press, New York Deutsch CV, Wen XH (1998) An improved perturbation mechanism for simulated annealing simulation. Math Geol 30(7):801–816 Dueck G, Scheuer T (1990) Threshold accepting: a general purpose optimization algorithm appearing superior to simulated annealing. J Comput Phys 90(1):161–175 Fridley BL, Dixon P (2007) Data augmentation for a Bayesian spatial model involving censored observations. Environmetrics 18(2):107–123 Gelman A, Roberts G, Gilks W (1995) Efficient Metropolis jumping rules. In: Bernardo JM, Berger JO, Dawid AP, Smith AFM (eds) Bayesian statistics, vol 5. Oxford University Press, New York, pp 599–608 Geman S, Geman D (1984) Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images. IEEE T Pattern Anal 6(6):721–741 Gibbons R (1995) Some statistical and conceptual issues in the detection of low-level environmental pollutants. Environ Ecol Stat 2(2):125–167 Gilliom RJ, Helsel DR (1984) Estimation of distributional parameters of censored trace-level water quality data. Water Resour Res 22(2):147–155 Goovaerts P (2009) AUTO-IK: a 2D indicator kriging program for the automated non-parametric modeling of local uncertainty in earth sciences. Comput Geosci 35(6):1255–1270 Gringarten E, Deutsch CV (2001) Theacher’s aide. Variogram interpretation and modeling. Math Geol 27(5):659–672 Helsel DR (2005) Nondetects and data analysis. Wiley & Sons, New York Holla MS (1966) On a poisson-inverse Gaussian distribution. Metrika 11(1):115–121 Hopke PK, Liu C, Rubin DB (2001) Multiple imputation for multivariate data with missing and belowthreshold measurements: time-series concentrations of pollutants in the Arctic. Biometrics 57(1):22–33 Huzurbazar AV (2005) A censored data histogram. Commun Stat Simulat 34(1):113–120 Ingber L (1996) Adaptive simulated annealing (ASA): lessons learned. J Control Cybern 25(1):33–54 Kerry R, Oliver MA (2007a) Determining the effect of asymmetric data on the variogram. I. Underlying asymmetry. Comput Geosci 33(10):1212–1232 Kerry R, Oliver MA (2007b) Determining the effect of asymmetric data on the variogram. II. Outliers. Comput Geosci 33(10):1233–1260 Kerry R, Oliver MA (2007c) Comparing sampling needs for variograms of soil properties computed by method of moments and residual maximum likelihood. Geoderma 140(10):383–396 Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4598):671–680 Knotters M, Brus DJ, Oude-Voshaar JH (1995) A comparison of kriging, co-kriging and kriging combined with regression for spatial interpolation of horizon depth with censored observations. Geoderma 67(3–4):227–246 Kuo SF, Liu CW, Merkley GP (2001) Application of the simulated annealing method to agricultural water resource management. J Agr Eng Res 80(1):109–124 Lark RM (2000a) A comparison of some robust estimators of the variogram for use in soil survey. Eur J Soil Sci 51(1):137–157
123
282
L. Sedda et al.
Lark RM (2000b) Estimating variograms of soil properties by the method-of-moments and maximum likelihood. Eur J Soil Sci 51(4):717–728 Lark RM, Papritz A (2003) Fitting a linear model of coregionalization for soil properties using simulated annealing. Geoderma 115(3–4):245–260 Leuangthong O, Deutsch CV (2003) Stepwise conditional transformation for simulation of multiple variables. Math Geol 35(2):155–173 Liu C (2001) The art of data augmentation: discussion. J Comput Graph Stat 10(1):75–81 Macmillan W (2001) Redistricting in a GIS environment: an optimisation algorithm using switchingpoints. J Geogr Syst 3(2):167–180 Marcotte D (1995) Generalized cross-validation for covariance model selection. Math Geol 27(5):659–672 McBratney AB, Webster R (1986) Choosing functions for semivariograms of soil properties and fitting them to sampling estimates. J Soil Sci 37(4):617–639 Meng XL, Van Dyk DA (1999) Seeking efficient data augmentation schemes via conditional and marginal augmentation. Biometrika 86(2):301–320 Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E (1953) Equation of state calculations by fast computing machines. J Chem Phys 21(6):1087–1092 Militino AF, Ugarte MD (1999) Analyzing censored spatial data. Math Geol 31(5):551–561 Ministero per le Politiche Agricole e Forestali (1999) Metodi ufficiali di analisi chimica del suolo. Gazzetta Ufficiale Supplemento Ordinario 248:1–162 Odell PM, Anderson KM, D’Agostino RB (1992) Maximum likelihood estimation for interval censored data using a weibull-based accelerated failure time model. Biometrics 48(3):951–959 Oliver MA (2010) The variogram and kriging. In: Fischer MM, Getis A (eds) Handbook of applied spatial analysis. Springer, Berlin, pp 319–352 Orton TG, Lark RM (2007) Estimating the local mean for Bayesian maximum entropy by generalized least squares and maximum likelihood, and an application to the spatial analysis of a censored soil variable. Eur J Soil Sci 58(1):60–73 Pardo-Igu´zquiza E (1998) Optimal selection of number and location of rainfall gauges for areal rainfall estimation using geostatistics and simulated annealing. J Hydrol 210(1–4):206–220 Porter PS, Ward RC, Bell HF (1988) The detection limit. Water quality monitoring data are plagued with levels of chemicals that are too low to be measured precisely. Environ Sci Technol 22(8):856–861 Raimo F, Napolitano A (2003) Studio della distribuzione spaziale di alcuni parametri chimici. Il Tabacco 11:11–17 Rajasekaran S (2000) On simulated annealing and nested annealing. J Global Optim 16(1):43–56 Ribeiro PJ Jr, Diggle PJ (2001) geoR: a package for geostatistical analysis. R-NEWS 1(2):15–18 Rigby RA, Stasinopoulos DM (2005) Generalized additive models for location, scale and shape, (with discussion). Appl Stat-J Roy St C 54(3):507–554 Rivoirard J (1994) Introduction to disjunctive kriging and non-linear geostatistics. Clarendon, Oxford Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley & Sons, New York Saito H, Goovaerts P (2000) Geostatistical interpolation of positively skewed and censored data in a Dioxin-contaminated site. Environ Sci Technol 34(19):4228–4235 Stein ML (1992) Prediction and inference for truncated spatial data. J Comput Graph Stat 1(1):354–372 Svensson I, Sjo¨stedt-De Luna S, Bondesson L (2006) Estimation of wood fibre length distributions from censored data through an EM algorithm. Scand J Stat 33(3):503–522 Triki E, Collette Y, Siarry P (2005) A theoretical study on the behaviour of simulated annealing leading to a new cooling schedule. Eur J Oper Res 166(1):77–92 Tsiatis AA (1990) Estimating regression parameters using linear regression rank test for censored data. Ann Stat 18(1):354–372 Van Breemen N, Mulder J, Driscoll CT (1983) Acidification and alkalinization of soils. Plant Soil 75(3):283–308 Webster R, Oliver MA (2001) Geostatistics for environmental scientists. Wiley & Sons, Chichester
123