Stat Methods Appl (2011) 20:1–21 DOI 10.1007/s10260-010-0149-5
Adaptive cluster sampling with a data driven stopping rule Stefano A. Gattone · Tonio Di Battista
Published online: 1 October 2010 © Springer-Verlag 2010
Abstract The adaptive cluster sampling (ACS) is a suitable sampling design for rare and clustered populations. In environmental and ecological applications, biological populations are generally animals or plants with highly patchy spatial distribution. However, ACS would be a less efficient design when the study population is not rare with low aggregation since the final sample size could be easily out of control. In this paper, a new variant of ACS is proposed in order to improve the performance (in term of precision and cost) of ACS versus simple random sampling (SRS). The idea is to detect the optimal sample size by means of a data-driven stopping rule in order to determine when to stop the adaptive procedure. By introducing a stopping rule the theoretical basis of ACS are not respected and the behaviour of the ordinary estimators used in ACS is explored by using Monte Carlo simulations. Results show that the proposed variant of ACS allows to control the effective sample size and to prevent from excessive efficiency loss typical of ACS when the population is less clustered than anticipated. The proposed strategy may be recommended especially when no prior information about the population structure is available as it does not require a prior knowledge of the degree of rarity and clustering of the population of interest. Keywords Efficiency
Adaptive cluster sampling · Monte carlo simulation · Stopping rule ·
S. A. Gattone (B) Department SEFeMeQ, University of Tor Vergata, Rome, Italy e-mail:
[email protected] T. Di Battista Department DMQTE, University G. d’Annunzio of Chieti-Pescara, Chieti-Pescara, Italy e-mail:
[email protected] 123
2
S. A. Gattone, T. Di Battista
1 Introduction It is well known that adaptive cluster sampling (ACS) (Thompson 1990; Thompson and Seber 1996) has been introduced as a suitable sampling design in order to estimate parameters of rare and clustered biological populations (Smith et al. 2004). For a detailed review of the major developments and issues in ACS, see Turk and Barkowski (2005). Adaptive designs are such that sampling is allocated to local areas where the observed value of the selected units satisfies a condition of interest. The selection of an adaptive cluster sample depends on the population values observed in the field. Consider a study region partitioned into N spatial units labelled by {1, 2, . . . , N } and let yi denote the y-value associated with the i-th unit. We will consider the population to be fixed rather than a realization of a random variable. Starting with the selection of an initial set of n units, the neighbourhood of each unit may be added to the sample whenever the unit satisfies a previously fixed condition C in one-dimensional real space, usually of the form C = {yi > c}. If any of these additional units satisfies the condition, its neighborhood will be added to the sample. A network may be defined as a collection of units with the property that the selection of any unit within the network would lead to the inclusion in the sample of every other unit in the network. If the initial unit does not satisfy C, no further units are added to the sample and the initial unit is defined to be a network of size one. The units adaptively sampled which do not satisfy C are called edge units. A network plus its edge units forms a cluster. The distinct networks form a partition of the N units {A1 , A2 , . . . , A K } with K ≤ N . Finally, performing an ACS design requires − the selection of an initial sample by using a probability sampling procedure − the definition of the condition C for additional sampling − the definition of the concept of neighbourhood. One practical concern in real life applications of ACS is the uncertainty of the final sample size. If the networks are very large, the final sampling effort and the total cost of the survey may run out of control. As a matter of fact, field investigators may deal with a poor performance of the ACS design in real surveys (Goldberg et al. 2007; Magnussen et al. 2005) as its efficiency depends critically on the population structure. If no prior information about the rarity and the aggregation of the population is available, it may be difficult to design an efficient adaptive cluster sample (Smith et al. 1995; Brown 2003; Smith et al. 2003). There are best current practices to improve the performance (in term of precision and cost) of existing ACS methods versus simple random sampling, but almost all need a priori information on the structure of the population. Thompson and Seber (1996, Chap. 5) and Thompson (1996) proposed the use of stratification, order statistics and partitioning into blocks in order to limit sampling. A two stage ACS was proposed by Salehi and Seber (1997a) where information from a pilot study could be useful in order to set the size and the number of primary units. An adjusted two-stage ACS was proposed by Muttlak and Khan (2002) where large networks, identified by using a rapid assessment auxiliary variable, are subsampled and small networks are completely enumerated. Inverse ACS was proposed by Christman and Lan (2001) where
123
Adaptive cluster sampling with a data driven stopping rule
3
the initial sample is taken by general inverse sampling. In Restricted ACS (Brown and Manly 1998) the sampling effort is fixed in advance and once it is reached the sampling is stopped at the end of the current adaptive step. Lo et al. (1997) and Su and Quinn (2003) proposed a stopping rule defining a stopping level S where the sampling process is terminated at the S-th step of the neighbouring unit search. The stopping rule is fixed in advance, prior to sampling. Thus, the best choice of the stopping level is not known unless to have some prior knowledge about the population. Furthermore, the stopping rule proposed by Su and Quinn (2003) causes biases in the ordinary Hansen–Hurwitz (HH) and Horvitz–Thompson (HT) estimators. To this purpose, Salehi and Seber (2002) have shown that Murty’s estimator is an unbiased estimator when used with the restricted variants of ACS proposed by Brown and Manly (1998) and by Christman and Lan (2001). This paper can be set along this framework with the aim of finding a variant of the ordinary ACS, easy to implement, which enlarges the circumstances under which the ACS strategy could be efficient in terms of precision and cost. Our idea is to modify the sampling effort by means of a data driven procedure. Instead of having a criterion fixed prior to the survey we propose a stopping rule that changes at each step of the aggregative procedure. In the spirit of the adaptive procedure we want the data to say us when to stop sampling in a sort of sequential ACS. Data driven stopping rules have been successfully applied in other areas of research such as time series analysis and generalized linear models (Zacks 2009). In Sect. 2 the proposed variant of ACS is illustrated. Since the theoretical properties of the proposed method could not be analytically stated, in Sect. 3, a simulation study is conducted on a wide range of artificial populations in order to empirically investigate the performance of the stopping rule proposed. Results are given in Sect. 4. Discussion is given in Sect. 5. 2 The method In order to overcome the problem related to the uncertainty of the final sampling fraction in this section we propose a data-driven procedure which stops the adaptive selection of units whenever a given stopping rule is verified. Let us denote with s1 = {1, 2, . . . , i, . . . n} an initial simple random sample of size n. For each unit i ∈ s1 we can view the adaptive sampling procedure as a set of steps. At step l = 0 we have just the initial unit i. If unit i satisfies a given condition C then the adaptive sampling phase is carried out in the neighbourhood of i so to have at the next step l = 1 the initial unit i plus its neighbourhood. Similarly, at step l = 2 the neighbourhood of units satisfying the condition C at step l = 1 is added and so on until when the condition C is satisfied. Let ASi1 = i, i 11 , i 21 , . . . , i k11 i
be the set of indexes labelling the units sampled after the first step (l = 1) of the aggregative procedure started from the i-th sampled unit. Accordingly, at step l = 2 we have
123
4
S. A. Gattone, T. Di Battista
ASi2 = (i, i 11 , i 21 , . . . , i k11 , i 12 , i 22 , . . . , i k22 ). i
i
In general, the set of units adaptively sampled after the l-th step are labelled by ASil = (i, i 11 , i 21 , . . . , i k11 , i 12 , i 22 , . . . , i k22 , . . . , i 1l , i 2l , . . . , i kl l ). i
i
i
For each step l we may define the average of units aggregated from the i-th initial unit as j∈ASil y j (l) wi = (1) m i(l) (l)
where m i is the cardinality of ASil . While 2(l) si
=
(l) 2 j∈ASil (y j − wi ) (l) mi
(2)
represents the within-network variance at the l-th step for the i-th initial unit. The stopping rule proposed in this paper follows from the analysis of the equation expressing the condition under which ACS is more efficient than SRS when the Hansen–Hurwitz estimator is adopted (Thompson 1990): 1 − Nn 2 (3) σW N > σ 2. n 1 − E(v) E(v) is the expected final sample size of the where σ 2 is the population variance, 1 1 K 2 2 = adaptive process and σW k=1 m k j∈Ak (y j − wk ) is the within-network N K variance where wk represents the average of the observations in the k-th network. Formally, ACS will be more efficient than SRS when the left-hand side of (3) is larger than σ 2 . It could happen that all the initial units in the sample do not satisfy the condition for extra sampling. In such a case v = n and the left-hand side of Eq. (3) goes to infinity. In Rocco (2003) a variant of ACS, named Constrained Inverse ACS, is proposed in order to ensure that v > n. In designing an efficient adaptive cluster sample one has to consider the relation 2 between σ 2 and σW N and n and v. The final sample size can be reduced by increasing the stopping value c in the condition C. On the other hand, if the networks are very 2 small, the disadvantage of having a relatively small within-network variance σW N 2 compared with σ will become more important than the advantage of having the final sampling fraction close to the initial sampling fraction. Therefore, we have to cope 2 with a trade-off between σW N and v. Of course, it is a characteristic of ACS that both 2 v and σW N are not known beforehand. Before sampling takes place one does not know which would be the best choice for the stopping value c. If the critical value is set too low, the final sample size will be excessively large as almost all the units will satisfy the condition.
123
Adaptive cluster sampling with a data driven stopping rule
5
Relative efficiency of ACS compared with SRS is a result of how networks are defined by the sampling design. We can define condition (3) for each step l of the adaptive process as follows:
2(l) σW N
1 − Nn 1 − E(vn(l) )
>σ
2
(4)
where σW N and E(v (l) ) are the within network variance and the expected final sample size of the ACS when the sampling process is completed after l-th steps. In Thompson (1990) are stated the reasons why expression (3) ensures that the adaptive strategy will have lower variance than the sample mean of a SRS of size v. Denoting with D ∗ the set of distinct units included in the sample plus the number of times each unit is included in the sample, with μˆ H H and y¯ the modified Hansen–Hurwitz mean estimator and the sample mean estimator on the initial sample, respectively, Thompson (1990) shows that the result holds because var (μˆ H H ) = var ( y¯ ) − E var ( y¯ |D ∗ ) since μˆ H H = E( y¯ |D ∗ ). Thus the variance of μˆ H H will always be less than or equal to the variance of y¯ . Similarly, at each l-th step, condition (4) ensures efficiency of ACS if at least more
than one initial selection of n units will lead to the same D ∗(l) , i.e. E var ( y¯ |D ∗(l) ) > 0 where D ∗(l) denotes the set of units
sampled at the l-th step. If it is not the case then we have that E y¯ |D ∗(l) = y¯ and
E var ( y¯ |D ∗(l) ) = 0. As before sampling takes place σ 2 is unknown, Eq. (4) could not be solved but we could use the left-hand side of the equation in order to compare the relative efficiency at the various steps of the aggregative procedure. Then, from (4) it follows that an ACS strategy which stops at the l-th step of the aggregative procedure will be more efficient than an ACS strategy of (l − 1) steps if the following event is verified: 2(l)
n 2(l) σW N 1 − v (l−1) >1 . 2(l−1) 1 − n σ v (l)
(5)
WN
It is apparent how efficiency depends on the increase of the within-network variance relative to the number of sampled units in moving to the further step of the aggregative procedure. Our idea consists in applying condition (5) for each initial unit i and at each step l of the adaptive process. To this purpose, a straightforward estimate of v (l) n (l) 2(l) is given by i=1 = m i while an estimate of the within-network variance σW N is given by n 1 2(l) si . n
(6)
i=1
In our proposed variant, the step l = 1 of the aggregative procedure is not modified with respect to the ordinary ACS. Starting from l = 2, for each initial unit i ∈ s1 we propose to sample units in the network that contains units i only if the following criterion is satisfied:
123
6
S. A. Gattone, T. Di Battista
⎧ ⎫ 1 ⎨ s 2(l) 1 − m (l−1) ⎬ i i Sil = > 1 . ⎩ s 2(l−1) 1 − 1(l) ⎭ i
(7)
mi
2 Sil accomplishes with the trade-off between σW N and v. Indeed, at higher steps, the number of neighbouring units increases so we ask for more extra information—higher 2(l) within-network variance si —in order to carry on sampling. Instead of having a criterion fixed prior to survey, the stopping rule Sil changes at each step of the aggregative procedure and for each unit i in the initial sample s1 . Hence, units meeting the criterion vary from sample to sample. For each unit in the initial sample s1 we implement the adaptive procedure with the proposed stopping rule and whenever condition Sil is not satisfied we stop sampling and the i-th network, say ASi , will be truncated at the (l − 1)-th step. The units aggregated at the l-th step will be considered as edge units. Comments about the treatment of the edge units will be given in the discussion section. Finally, we will have n networks, say AS1 , AS2 , . . . ASn , on which the modified HH and HT estimators will be suitably applied. A graphical flow-chart showing the process of ACS with our stopping rule is displayed in Fig. 1. The modified HH estimator for the mean is
μˆ H H =
n 1 wi n
(8)
i=1
where wi = m1i j∈ASi y j is the mean of the observations in ASi . The sample variance of the HH estimator could be unbiasedly estimated by (Thompson and Seber 1996) N −n (wi − μˆ H H )2 . N n(N − 1) n
var ˆ (μˆ H H ) =
(9)
i=1
The modified HT estimator of the mean takes the form μˆ HT =
r 1 yk∗ N αk
(10)
k=1
where yk∗ is the sum of the y-values in the k-th network, r is the number of distinct networks in a sample and αk is the probability that network k is included in the sample. An unbiased estimator of the variance of μˆ HT is given by (Thompson and Seber 1996) ⎡ ⎤ r r 1 ⎣ ∗ ∗ α jk − α j αk ⎦ var ˆ (μˆ HT ) = 2 y j yk (11) N α j αk j=1 k=1
where α jk are the second-order inclusion probabilities. The modified HT estimator (10) is based on the fact that αk = αi for every unity i in network k. With our proposed
123
Adaptive cluster sampling with a data driven stopping rule
7
Fig. 1 Flow chart of ACS with a data driven stopping rule
variant, networks AS1 , AS2 , . . . ASn do not form a partition of the population and may intersect as the adaptive sample initialized by two different units might overlap. For the units which belong to only one network, an approximation of the first-order inclusion probability may be obtained by the usual formula (Thompson 1990): N −m i αˆ i = 1 −
Nn .
(12)
n
For the units which belong to more than one network, expression (12) has to be slightly modified as follows N −m ∗ αˆ i = 1 −
Nn
i
(13)
n
123
8
S. A. Gattone, T. Di Battista
where m i∗ =
n
m j I ij
j=1
where I ij is an indicator function which takes a value of 1 if unit j belongs to the i-th sampled network ASi , and 0 otherwise. In practice, the networks that overlap are merged and the inclusion probability are evaluated on the basis of the size of the resulting network. In our framework, the summation of the HT estimator would be over the distinct units sampled and not over the distinct networks. Therefore, with our proposed variant of ACS, a suitable version of the modified HT estimator will be given by μˆ HT =
v 1 yi N αˆ i
(14)
i=1
where y1 , y2 , . . . , yv represent the y-values from the v distinct labels in the final sample. The theoretical basis for the unbiasedness of the estimators for adaptive sampling rely on the fact that networks, built throughout this sampling design, are disjoint, do not overlap and form a unique partition of the population for a specified criterion. In the ordinary ACS the first order inclusion probabilities are computed using the fact that if a unit in a network is included in the initial sample then every unit in that network is included in the final sample. Our design does not have this property because of the use of a stopping rule which determines the premature end of the aggregative procedure. As a matter of fact, with our proposed variant of ACS the modified HH and HT estimators given in (8) and (14) turn out to be biased. The bias is the price one has to pay in order to limit the sampling effort. The key point is that it is not possible to have a good evaluation of the inclusion probabilities of each unit which can only be approximated by means of the size of the truncated networks. The effect of using the stopping rule on the properties of the modified HH and HT estimators has also been evaluated by Su and Quinn (2003). 3 Monte-carlo simulation No theoretical results may be obtained on the performance of the stopping rule proposed in Sect. 2. Accordingly, similar to Brown and Manly (1998) and Su and Quinn (2003), we evaluate the proposed sampling design by simulating artificial populations by means of a Poisson Cluster Process (Diggle 1983). We choose to use the same 20×2×3 factorial design used by Brown (2003) as this article had the aim to evaluate by means of a thorough simulation study the ACS efficiency compared with SRS under different survey design factors. The study area was a square grid divided into 30 × 30=N =900 plots. For any population, the number of parents was a realization from a Poisson process with mean
123
Adaptive cluster sampling with a data driven stopping rule
9
λ =5, λ =10, θ=0.5 1
0
λ =20, λ =10, θ=0.5
2
0
−10
−10
−20
−20
−30
2
10
15
−30 0
5
10
15
20
25
30
λ1=5, λ2=20, θ=1.5
0
0
5
−10
−20
−20
20
25
30
25
30
25
30
λ1=20, λ2=20, θ=3.5
0
−10
−30
1
−30 0
5
10
15
20
25
30
0
5
10
15
20
λ =80, λ =20, θ=3.5
λ =40, λ2 =20, θ=0.5 1 0
0
−10
−10
−20
−20
−30
1
2
10
15
−30 0
5
10
15
20
25
30
0
5
20
Fig. 2 Examples of populations generated by a Poisson cluster process with different values of λ1 , λ2 and θ
λ1 = 5, 10, . . . , 100. Parents were randomly located within the study area. For each parent, the number of children was generated according to a Poisson random variable with mean λ2 = 10, 20. Children were randomly placed around the parents at a random angle uniformly distributed between 0◦ and 360◦ and at a distance taken from an exponential distribution with mean θ = 0.5, 1.5, 3.5. The combination of these parameter values allows us to cover various levels of clustering: from rare and tightly aggregated populations (θ = 0.5, λ2 = 10, low values of λ1 ) for which ACS outperforms SRS, to more sparse and less clustered populations (θ = 3.5, λ2 = 20 and high values of λ1 ) for which it is well known that ACS is not suited. Different realizations of the Poisson cluster process are given in Fig. 2 in order to illustrate how the populations distributed over the study region. Within the study area a central area of 20 × 20 plots was defined as the sampling area to allow for edge effects from the Poisson cluster process. For each population, M = 10000 sampling simulations were conducted by using SRS, ordinary ACS and ACS with our proposed stopping rule. ACS designs were carried out by selecting an initial sample of size n = 5, 10, . . . , 25. Sample size of SRS was set up to the effective sample size of the adaptive procedures, namely E(v). The condition for adaptive sampling was C = {yi > 1}. The simulation study was based on without replacement sampling. For each population and for each design, HH and HT estimators of the population mean were evaluated by using Eqs. (8) and (14), respectively. Estimates of the sample variance from our proposed variant of ACS were also evaluated by using the conventional estimators given in Eqs. (9) and (11).
123
10
S. A. Gattone, T. Di Battista
Among the sampling simulations the mean square error (MSE) of the estimators was recorded. Only for the design with the stopping rule, relative bias of the estimators was also evaluated. In particular we have: MSE =
M
(μˆ i − μ)2 /M
i=1
E(v) =
M 1 vi M i=1
M 1 RB = (μˆi − μ)/μ M i=1
where μˆ i is the mean estimate and vi is the final sample size at the i-th sampling simulation. Similar to Su and Quinn (2003), vi was evaluated without considering the edge units so that vi represents the number of sampling units used in the estimators. The bias in the estimated variance was also evaluated by comparing for each population, the average of estimated variances with the actual variance of the M = 10000 sample estimates. Then, the relative efficiency of adaptive cluster sampling with and without the stopping rule with respect to simple random sampling was computed as follows: M S ESRS M S E AC S M S E S R S∗ = M S E AC S ∗
r e AC S = r e AC S ∗
where M S E S R S and M S E S R S ∗ are the mean square errors of the SRS estimators computed with a sample size equal to the effective sample size of the ordinary ACS (E(v) AC S ) and ACS with our data-driven stopping rule (E(v) AC S ∗ ), respectively. With AC S ∗ we denote our proposed ACS design with the stopping rule. In reporting the results we will emphasize the behaviour of the sampling designs with respect to the cluster compactness that is the cluster size defined by the θ values and the number of individuals for each cluster defined by the λ2 values. In particular we will be interested in the efficiency of AC S ∗ relative to SRS compared with the efficiency of ACS relative to SRS. Furthermore, values of the effective sample size will be compared for both adaptive designs by evaluating the sampling fraction f n = E(v) AC S /N and f n∗ = E(v) AC S ∗ /N . 4 Results 4.1 The influence of the stopping rule on the efficiency of the estimators Results for the relative efficiency r e AC S and r e AC S ∗ of the modified HH and HT estimators are reported in Figs. 3 and 4, respectively.
123
Adaptive cluster sampling with a data driven stopping rule
11
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0 0
50
100
0 0
λ
50
100
0
λ
1
1
1
1.5
1.5
1
1
1
0.5
0.5
0.5
0 0
50
λ
1
100
100
λ
1.5
0
50
0 0
50
λ
1
100
0
50
100
λ
1
Fig. 3 Hansen–Hurwitz estimator: relative efficiency with respect to SRS of ordinary ACS (AC S) and ACS with the data-driven stopping rule (AC S ∗ ) with λ1 ranging from 5 to 100, λ2 = 10 first row, λ2 = 20 second row and θ = 0.5, 1.5, 3.5 (first, second and third column). AC S versus S RS (dotted line) and AC S ∗ versus S RS (solid line)
ACS and AC S ∗ have similar behaviours for rare and clustered populations (λ2 = 10, 20 and λ1 < 30). With compact clusters (θ = 0.5), as the population total increases (λ1 > 35), AC S ∗ outperforms ACS and, more importantly, it is more efficient than SRS for almost all the populations even those which are not rare. For less compact clusters (θ = 1.5, 3.5) and for high-density populations (λ1 > 30) both ACS designs are not as efficient as SRS. Anyway, by adopting our proposed stopping rule the adaptive design has an efficiency loss not so relevant if compared to that of ordinary ACS which performs very poorly. As it is well known, the modified HH estimator is less efficient than the modified HT estimator (Salehi 2003). This is confirmed by the results of the simulation study although the improvement in efficiency ensured by the modified HT estimator is negligible with our proposed variant of ACS. Ordinary ACS results to be more efficient than ACS* only when the HT estimator is applied for some of the less rare and clustered populations (λ2 = 20, θ = 1.5, 3.5 and λ1 > 60). This can be explained by the good behaviour of the HT estimator in presence of extremely high effective sample size (Christman 1997). However, the ACS design results to be prohibitive in terms of cost and time under these circumstances (sampling fraction f n > 0.6; see Figs. 9, 10 and 11). Results on relative efficiency have been reported only for an initial sample size of n = 15 since the behaviour of the stopping rule resulted to be quite similar with
123
12
S. A. Gattone, T. Di Battista 1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0 0
50
100
0
0
λ1
50
100
0
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0 50
λ
1
100
50
100
λ1
1.5
0
0
λ1
0 0
50
λ
1
100
0
50
100
λ
1
Fig. 4 Horvitz–Thompson estimator: relative efficiency with respect to SRS of ordinary ACS (AC S) and ACS with the data-driven stopping rule (AC S ∗ ) with λ1 ranging from 5 to 100, λ2 = 10 first row, λ2 = 20 second row and θ = 0.5, 1.5, 3.5 (first, second and third column). AC S versus S RS (dotted line) and AC S ∗ versus S RS (solid line)
varying initial sample sizes. However, for the high-density and not clustered populations, the appeal of our stopping rule with respect to the ordinary ACS is tempered when a relatively large initial sample size is used (n > 20). Finally, in Tables 1, 2, 3, and 4, by means of factorial designs we report main effects and interactions of the factors λ1 , λ2 and θ on relative efficiency of ACS with and without the data driven stopping rule. As we would expect, all the three factors, degree of rarity λ1 , number of clusters λ2 and cluster compactness θ result to have a significant effect on relative efficiency of ACS and ACS* both for HH and HT estimators with the exception of λ2 with HT and ordinary ACS. The interaction λ1 θ is never significant in fact the efficiency loss observed when λ1 increases does not vary with the cluster compactness θ . Interestingly we note that the interaction effect between λ1 and λ2 is not significant with ACS* while it is significant with ACS. As the degree of rarity decreases, the efficiency loss of the HH and HT estimators caused by the increase of the number of clusters is mitigated by introducing our data driven stopping rule. Finally, the interaction effect between λ2 and θ which under ACS is not significant becomes significant with AC S ∗ . In fact, the efficiency loss caused by the increase of the number of clusters λ2 observed both in ACS and in ACS* is mitigated for the more compact clusters with our data driven stopping rule.
123
Adaptive cluster sampling with a data driven stopping rule Table 1 ANOVA summary table: r e AC S of μ H H
Table 2 ANOVA summary table: r e AC S ∗ of μ H H
Table 3 ANOVA summary table: r e AC S of μHT
13
Source
Sum of square
df
Mean square
F
p>F
λ1 λ2 θ λ1 × λ2 λ1 × θ λ2 × θ Error Total
7.622 0.3059 4.2405 0.4451 0.3814 0.0463 0.4475 13.4886
19 1 2 19 38 2 38 119
0.40116 0.30593 2.12025 0.02342 0.01004 0.02314 0.01178
34.06 25.98 180.04 1.99 0.85 1.96
0 0 0 0.0351 0.6876 0.1541
Source
Sum of square
df
Mean square
F
p>F
λ1
2.45664
λ2 θ
19
0.1293
0.026
1
0.026
3.47829
2
1.73914
21.45 4.31 288.46
0 0.0446 0
λ1 × λ2
0.13469
19
0.00709
1.18
0.3258
λ1 × θ
0.21016
38
0.00553
0.92
0.6042
λ2 × θ
0.04392
2
0.02196
3.64
0.0357
Error
0.22911
38
0.00603
Total
6.5788
119
Source
Sum of square
df
Mean square
p>F
F
λ1
4.8649
19
0.25605
2.97
0.0021
λ2
0.1776
1
0.17758
2.06
0.1591
θ
2.0933
2
1.04665
12.16
0.0001
λ1 × λ2
3.3315
19
0.17534
2.04
0.0306
λ1 × θ
2.9105
38
0.07659
0.89
0.6399
λ2 × θ
0.046
2
0.02298
0.27
0.7671
0.0861
Error
3.2718
38
Total
16.6955
119
4.2 Bias of the HH and HT estimators and estimates of the sample variance As already said, the stopping rule induces some bias in the modified HH and HT estimators. Figures 5, 6 and 7 show the bias for n = 5, 15, 25, respectively. The HT estimator seems to have less bias than HH estimator but for very small initial sample size (n = 5). Increasing the initial sample size leads to a better behaviour of the HT estimator with a relative bias alway less than 5%. The opposite is observed with the HH estimator as its bias increases as function of the initial sample size n. The bias of HH results to be almost always positive with the proposed stopping rule. These results are in agreement with those presented by Su and Quinn (2003) even though they report greater bias of the HH estimator when used with their variant of ACS.
123
14
S. A. Gattone, T. Di Battista
Table 4 ANOVA summary table: r e AC S ∗ of μHT
Source
Sum of square
df
Mean square
λ1
3.25417
19
0.17127
21.54
λ2
0.1093
1
0.1093
13.75
0
θ
3.70646
2
1.85323
233.11
0
0.23617
19
0.01243
1.56
0.1185
λ1 × θ
0.22848
38
0.00601
0.76
0.8034
λ2 × θ
0.33752
2
0.16876
21.23
Error
0.3021
38
0.00795
Total
8.1742
119
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
0
−0.1
−0.1 50
100
50
100
0
λ1 0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
0
−0.1 50
λ1
100
50
100
λ1
0.3
−0.1
0
−0.1 0
λ1
0
0
λ1 × λ2
0.3
0
p>F
F
−0.1 0
50
λ1
100
0
50
100
λ1
Fig. 5 Relative bias of the HH (solid line) and HT (dotted line) estimators for ACS with the data-driven stopping rule with an initial sample size n = 5. Simulated populations with λ1 ranging from 5 to 100, λ2 = 10 first row, λ2 = 20 second row and θ = 0.5, 1.5, 3.5 (first, second and third column)
However, a direct comparison is not feasible as they use a stopping rule different from Sil . Figure 8 shows that the conventional variance estimator var ˆ (μˆ H H ) given in (9) could be used in order to have a measure of the variability of the HH estimator with AC S ∗ . Relative bias results to be smaller than 5% for all populations considered. On the other hand, direct estimation of the variance of the HT estimator could not be obtained since the exact second-order inclusion probabilities cannot be evaluated with our proposed variant of ACS. Results of the simulations (not reported in this paper) have shown that the conventional variance estimator var ˆ (μˆ HT ) given in (11) dose not give an acceptable approximation of the variance of the HT estimator. Indeed, the approximation of the inclusion probabilities often leads to estimates of the variance
123
Adaptive cluster sampling with a data driven stopping rule
15
0.1
0.1
0.1
0.05
0.05
0.05
0
0
0
−0.05
−0.05
−0.05
−0.1
−0.1 0
50
100
0
λ
50
100
−0.1
0.1
0.05
0.05
0.05
0
0
0
−0.05
−0.05
−0.05
−0.1 50
100
0
λ
50
100
−0.1
0
λ
1
100
1
0.1
−0.1
50
λ
1
0.1
0
0
λ
1
50
100
λ
1
1
Fig. 6 Relative bias of the HH (solid line) and HT (dotted line) estimators for ACS with the data-driven stopping rule with an initial sample size n = 15. Simulated populations with λ1 ranging from 5 to 100, λ2 = 10 first row, λ2 = 20 second row and θ = 0.5, 1.5, 3.5 (first, second and third column)
0.1
0.1
0.1
0.05
0.05
0.05
0
0
0
−0.05
−0.05 0
50
λ
100
−0.05 0
1
50
λ
100
0
1
0.1
0.1
0.05
0.05
0.05
0
0
0
−0.05 0
50
λ1
100
λ
100
1
0.1
−0.05
50
−0.05 0
50
λ1
100
0
50
λ1
100
Fig. 7 Relative bias of the HH (solid line) and HT (dotted line) estimators for ACS with the data-driven stopping rule with an initial sample size n = 25. Simulated populations with λ1 ranging from 5 to 100, λ2 = 10 first row, λ2 = 20 second row and θ = 0.5, 1.5, 3.5 (first, second and third column)
123
16
S. A. Gattone, T. Di Battista
0.04
0.04
0.04
0.02
0.02
0.02
0
0
0
−0.02
−0.02
−0.02
−0.04
−0.04
−0.04
0
50
100
0
λ
50
100
0
λ
1 0.06
0.06
0.04
0.04
0.04
0.02
0.02
0.02
0
0
0
−0.02
−0.02
−0.02
−0.04
−0.04
−0.04
50
λ
1
100
0
50
λ
1
100
1
0.06
0
50
λ
1
100
0
50
100
λ
1
Fig. 8 Hansen–Hurwitz estimator: relative bias of the conventional variance estimators for ACS with the data-driven stopping rule with an initial sample size n = 15. Simulated populations with λ1 ranging from 5 to 100, λ2 = 10 first row, λ2 = 20 second row and θ = 0.5, 1.5, 3.5 (first, second and third column)
with negative values. An easy to compute approximation of var ˆ (μˆ HT ) is given by the HH variance estimator var ˆ (μˆ H H ) (Berger 1998). As it is well known, in general the HT estimator is more efficient than the HH estimator (Salehi 2003), therefore var ˆ (μˆ H H ) has a positive bias as an estimator of var (μˆ HT ) (Durbin 1953). The object of further research could be to find a good variance estimator of the HT estimator when used with our proposed variant of ACS.
4.3 Effective sample size The specification of the exact upper limit of the number of sampled units is a key point in many real life applications. Figures 9, 10 and 11 show the final sampling fraction f n and f n∗ of both adaptive designs ACS and AC S ∗ for different initial sample sizes. With our stopping rule, f n∗ is always less than 0.4. For initial sample sizes n = 5, 15, f n∗ is almost always less than 0.2. For initial sample sizes n > 15, values of f n∗ larger than 0.2 are reported just for some populations with λ1 > 50, λ2 = 20 and θ = 1.5, 3.5. The analysis of the final sampling fraction of ACS highlights the effectiveness of our stopping rule in limiting the sampling effort. Indeed, for the less clustered populations f n exceeds 40 per cent and for λ1 > 60, λ2 = 20 and θ = 1.5, 3.5 extremely high effective sample sizes are reported ( f n > 0.6). Under these circumstances ordinary ACS is unfeasible from both a logistical and a cost perspective. With our stopping rule, the effective sample size reduction is small with respect to ordinary ACS for highly patchy populations with small cluster size (λ1 < 50, λ2 = 10, 20 and θ = 0.5). It
123
Adaptive cluster sampling with a data driven stopping rule
17
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0 0
50
λ1
100
0
0
50
λ1
100
0
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0 0
50
λ
1
100
0
0
50
λ
1
100
0
0
50
100
0
50
100
λ1
λ
1
Fig. 9 Sampling fraction of ordinary ACS ( f n dotted line) and of ACS with the data-driven stopping rule ( f n∗ solid dotted line). Initial sample size n = 5. Simulated populations with λ1 ranging from 5 to 100, λ2 = 10 first row, λ2 = 20 second row and θ = 0.5, 1.5, 3.5 (first, second and third column)
becomes relevant for the less clustered populations and as λ1 increases. Indeed, the specification of the exact upper limit of the number of sampled units is a key point in many real life applications (Rocco 2007). In Table 5, some empirical statistics about the final sample size are reported for two simulated populations. We can see how ACS will lead to sampling more units than cost and time would probably allow. At the same time ACS* shows a good behaviour in controlling the highest final sampling effort and in reducing the variability of the final sample size.
5 Discussion As it is well known (Brown 2003; Su and Quinn 2003), the simulation study has shown that when there is no prior information about the rarity and the patchiness of the population, applying adaptive cluster sampling could be prohibitive in terms of time and resources disposable and the survey would become unpractical. The stopping rule proposed in this paper adds a sequential component to the ordinary ACS which aims to predict the expected performance of ACS relative to that of SRS. This information taken from the sample data is then used to modify the sampling effort accordingly. Despite the lack of theoretical grounding, simulation results show that the proposed stopping rule provides a substantial improvement in terms of efficiency when incorporated with adaptive cluster sampling. For the most rare and clustered populations, the sampling effort of AC S ∗ is very close to the one obtained with ACS without the stopping rule. Thus, in situations ideal for ACS (populations with few small networks and high-within network variance) the proposed stopping rule is not operating but in a very few samples. As a matter of fact, relative efficiency of AC S ∗ is nearly the same (slightly worse in general) to that of ordinary ACS. Nevertheless, in such a case,
123
18
S. A. Gattone, T. Di Battista
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0 0
50
100
0 0
λ1
50
100
0
λ1 0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0 0
50
λ1
100
100
λ1
0.8
0
50
0 0
50
λ1
100
0
50
100
λ1
Fig. 10 Sampling fraction of ordinary ACS ( f n dotted line) and of ACS with the data-driven stopping rule ( f n∗ solid dotted line). Initial sample size n = 15. Simulated populations with λ1 ranging from 5 to 100, λ2 = 10 first row, λ2 = 20 second row and θ = 0.5, 1.5, 3.5 (first, second and third column)
it would be preferable to adopt the ordinary ACS strategy which ensures unbiased estimators. Moreover, the reduction of the sampling effort makes the adaptive cluster sampling with a stopping rule more desirable than simple random sampling for a wider range of populations. In particular, for the more compact cluster (θ = 0.5) and low and high density populations (λ2 = 10, 20) the proposed variant of ACS seems to be effective in reducing the well known edge units effect that is, to sample units which do not contribute to the estimator but at the same time augment the final sample size. Thus, we can conclude that the proposed design is as efficient as ordinary ACS for rare and tightly clustered populations and more efficient than ordinary ACS for a range of populations that lack clustering. As expected, the resulting estimators turn out to be biased. However, results have shown that the estimators bias is negligible. Furthermore, Brown and Manly (1998) have shown that the bootstrap procedure can be applied under ACS design with a stopping rule in order to estimate the bias of the estimators. Thus, M S E AC S ∗ and r e AC S ∗ could be further improved. We were not able to provide a suitable variance estimator for the HT estimator so that potential users are recommended to use the HH estimator with the variant of ACS proposed in this paper. The value c of the aggregative condition C, the treatment of the edge units and the way relative efficiency was measured deserve some more comments. It could be argued that condition C = {yi > c = 1} would not be a reasonable one for some of the populations simulated. Nevertheless, we stick on a value of c = 1 in
123
Adaptive cluster sampling with a data driven stopping rule
19
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0 0
50
100
0 0
λ1
50
100
0
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0 50
100
100
λ1
0.8
0
50
λ1
0 0
λ1
50
100
0
50
λ1
100
λ1
Fig. 11 Sampling fraction of ordinary ACS ( f n dotted line) and of ACS with the data-driven stopping rule ( f n∗ solid line). Initial sample size n = 25. Simulated populations with λ1 ranging from 5 to 100, λ2 = 10 first row, λ2 = 20 second row and θ = 0.5, 1.5, 3.5 (first, second and third column)
Table 5 Statistics concerning the final sampling effort of ACS and AC S ∗ for two simulated populations with λ1 = 50 and θ = 1.5. Initial sample size n = 15, N = 400
Statistics
AC S ∗ λ2
AC S λ2
10
20
10
20
38
60
80
145
Max(v)
82
113
115
182
Std(v)
10.20
E(v)
15.52
18.57
18.67
all the simulations since the purpose of this article is to provide an effective solution under the situation that the final sample size would be out of control. We are aware of the fact that if a value of c 1 were used for the less patchy population one would observe an improvement in the relative efficiency of ordinary ACS. However, the simulation study has shown that under conditions where ACS is suited, AC S ∗ would perform quite similarly. When the stopping rule Sil is not satisfied the ordinary estimators used in ACS are computed using the data collected until the step l − 1 and therefore we do not include the units sampled at step l in the estimators. It could be argued that in such a way we are losing information. At this purpose we stress the fact that if condition Sil is not satisfied it follows that the units aggregated at the l-th step would not provide valuable
123
20
S. A. Gattone, T. Di Battista
information and their inclusion in the estimator will cause a loss of efficiency with respect to the previous step l − 1. In fact, we tried to use these units in the simulations but this did result in an efficiency loss of AC S ∗ and in an increase of the bias of the estimators. This is clearly due, as in the ordinary ACS, to the very bad approximation of the inclusion probabilities for these units. The procedure is in agreement with what happens for the ordinary ACS where the edge units are not used in the standard adaptive estimators because their inclusion probabilities cannot be determined from the sample data. Thus, the ordinary estimators of ACS incorporate only those edge units which were in the initial sample. So we do with our proposed variant of ACS. More efficient estimators which use all the units sampled can be obtained by using the Rao-Blackwell theorem based on the minimal sufficient statistics (Thompson 1990; Salehi and Seber 1997b; Dryver and Thompson 2005). The evaluation of the performance of our proposed variant of ACS with these estimators could be the goal of further developments of the present work. The relative efficiency was measured without considering the edge units in the evaluation of the final sample size E(v). Following Su and Quinn (2003), we choose to compare the efficiency focusing on the number of sampling units used in the estimators. We point out that the inclusion of the edge units in the evaluation of E(v) would decrease the efficiency of both adaptive designs, ACS and AC S ∗ , with respect to SRS but the relative performances would not be affected by this choice of sample size. Furthermore, with AC S ∗ the number of edge units would be less than those sampled with AC S. Thus, the overall performance of AC S ∗ in comparison to AC S is likely to increase. Finally, it has to be noted that the proposed variant of ACS still has the drawback of uncertainty of the final sample size as the original ACS proposed by Thompson (1990). Indeed, the stopping rule proposed is such that it might be possible to sample completely a large network but the probability of this happening is related to the presence of a large network with a variance of the children points around the parents highly increasing. Such a network is probably least likely to be observed in real life populations. However, results show that the final sampling fraction is well controlled in all simulations. The proposed stopping rule has the benefit of reducing the risk of cost overruns due to the adaptive increase in sample size.
References Berger GY (1998) Rate of convergence for asymptotic variance of the Horvitz-Thompson estimator. J Stat Plan Inference74:149–168 Brown JA, Manly BJF (1998) Restricted adaptive cluster sampling. Environ Ecol Stat 5:49–63 Brown JA (2003) Designing an efficient adaptive cluster sample. Environ Ecol Stat 10:95–105 Christman MC (1997) Efficiency of some sampling designs for spatially clustered populations. Environmetrics 8:145–166 Christman MC, Lan F (2001) Inverse adaptive cluster sampling. Biometrics 57:1096–1105 Diggle PJ (1983) Statistical analysis of spatial point patterns. Academic Press, London Dryver AL, Thompson SK (2005) Improved unbiased estimators in adaptive cluster sampling. J R Stat Soc B 67:157–166 Durbin J (1953) Some results in sampling theory when the units are selected with unequal probabilities. J R Stat Soc B 15:262–269
123
Adaptive cluster sampling with a data driven stopping rule
21
Goldberg NA, Heine JN, Brown JA (2007) The application of adaptive cluster sampling for rare subtidal macroalgae. Mar Biol 151:1343–1348 Lo NCH, Griffith D, Hunter JR (1997) Using a restricted adaptive cluster sampling to estimate Pacific hake larval abundance. CalCOFI Rep 38:103–113 Magnussen S, Kurz W, Leckie DG, Paradine D (2005) Adaptive cluster sampling for estimation of deforestation rates. Eur J For Res 124:207–220 Muttlak HA, Khan A (2002) Adjusted two-stage adaptive cluster sampling. Environ Ecol Stat 9:111–120 Rocco E (2003) Constrained inverse adaptive cluster sampling. J Official Stat 19:45–57 Rocco E (2007) Two-Stage Restricted Adaptive Cluster Sampling. Working paper 12, Dipartimento di Statistica G. Parenti, Firenze. Salehi MM (2003) Comparison between Hansen–Hurwitz and Horvitz–Thompson estimators for adaptive cluster sampling. Environ Ecol Stat 10:115–127 Salehi MM, Seber GAF (1997a) Two-stage adaptive cluster sampling. Biometrics 53:959–970 Salehi MM, Seber GAF (1997b) Adaptive cluster sampling with networks selected without replacement. Biometrika 84:209–219 Salehi MM, Seber GAF (2002) Unbiased estimators for restricted adaptive cluster sampling. Aust NZ J Stat 44:63–74 Smith DR, Conroy MJ, Brakhage DH (1995) Efficiency of adaptive cluster sampling for estimating density of wintering waterfowl. Biometrics 51:777–788 Smith DR, Villella RF, Lemarié DP (2003) Application of adaptive cluster sampling to low-density populations of freshwater mussels. Environ Ecol Stat 10:7–15 Smith DR, Brown JA, Lo NCH (2004) Application of adaptive cluster sampling to biological populations. In: Thompson WL (ed) Sampling rare or elusive species. Island Press, Covelo pp 93–152 Su Z, Quinn TJII (2003) Estimator bias and efficiency for adaptive cluster sampling with order statistics and a stopping rule. Environ Ecol Stat 10:17–41 Thompson SK (1990) Adaptive cluster sampling. J Am Stat Ass 85:1050–1059 Thompson SK, Seber GAF (1996) Adaptive sampling. Wiley, New York Thompson SK (1996) Adaptive cluster sampling based on order statistics. Environmetrics 7:123–133 Turk P, Barkowski JJ (2005) A review of adaptive cluster sampling: 1990–2003. Environ Ecol Stat 12:55–94 Zacks S (2009) Stage wise adaptive design. Wiley, New York
123