Estimating quantiles under sampling on two ... - Semantic Scholar

Report 2 Downloads 19 Views
Computational Statistics & Data Analysis 51 (2007) 6596 – 6613 www.elsevier.com/locate/csda

Estimating quantiles under sampling on two occasions with arbitrary sample designs M. Ruedaa,∗ , J.F. Muñoza , S. Gonzálezb , A. Arcosa a University of Granada, Spain b University of Jaén, Spain

Received 15 February 2006; received in revised form 8 March 2007; accepted 8 March 2007 Available online 16 March 2007

Abstract A practical problem related to the estimation of quantiles in double sampling with arbitrary sampling designs in each of the two phases is investigated. In practice, this scheme is commonly used for official surveys, in which quantile estimation is often required when the investigation deals with variables such as income or expenditure. A class of estimators for quantiles is proposed and some important properties, such as asymptotic unbiasedness and asymptotic variance, are established. The optimal estimator, in the sense of minimizing the asymptotic variance, is also presented. The proposed class contains several known types of estimators, such as ratio and regression estimators, which are of practical application and are therefore derived. Assuming several populations, the proposed estimators are compared with the direct estimator via an empirical study. Results show that a gain in efficiency can be obtained. © 2007 Elsevier B.V. All rights reserved. Keywords: Successive sampling; Quantile estimation; Auxiliary information; Monte Carlo simulation

1. Introduction In many social surveys, the same population is sampled repeatedly and the same study variable is measured on each occasion, so that development over time can be followed. For example, labour force surveys are conducted monthly to estimate the number of people in employment, data on the prices of goods are collected monthly to determine a consumer price index, political opinion surveys are conducted at regular intervals to measure voter preferences, etc. In such cases, the use of a successive sampling scheme can be an attractive alternative to improve the estimates of level at a point in time or to measure the change between two-time points. In successive sampling on two occasions, previous theory (Jessen, 1942; Patterson, 1950) aimed at providing the optimum estimate of the mean on the second occasion by combining two estimators of this mean: a double sampling regression estimate from the matched portion of the sample when the auxiliary variable is the value of the principal variable on the first occasion, and a simple mean based on a random sample from the unmatched portion on the second occasion. Successive sampling has also been discussed in some detail by Narain (1953), Adhvaryu (1978), ∗ Corresponding author. Dept. de Estadística e I.O., Facultad de Ciencias, Avda. Fuentenueva, Universidad de Granada, 18071 Granada, Spain.

Tel.: +34 958240494; fax: +34 958243267. E-mail address: [email protected] (M. Rueda). 0167-9473/$ - see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2007.03.011

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 6596 – 6613

6597

Eckler (1955), Gordon (1983) and Arnab and Okafor (1992) (Singh, 2003 provides an extensive bibliography of this topic). In all the above studies, the parameter considered is the mean. For the problem of estimating a population quantile, the situation is quite different. Only recently has this problem been discussed. Most of the studies related to medians have been developed by assuming simple random sampling or stratified sampling (see Gross, 1980; Sedransk and Meyer, 1978; Sedransk and Smith, 1988 among others) and considering only the variable of interest without making explicit use of auxiliary variables for the construction of the estimators. The literature relating to the estimation of medians and other quantiles which use an auxiliary variable is considerably less extensive than in the case of means and totals. Relevant references include Chambers and Dunstan (1986), Kuk and Mak (1989), Mak and Kuk (1993), Rueda et al. (1998), Allen et al. (2002), Singh et al. (2001) and Singh et al. (2006). Recently, Martínez et al. (2005) proposed a theory of successive sampling applied to quantile estimation using the value of the principal variable on a previous occasion as the auxiliary variable. This paper is developed under simple random sampling and assumes that on the current occasion a sub-sample is taken from the previously selected units, and that certain units are replaced by the new units selected independently of the matched sample. The extension to a general sampling design is not made. In Section 2, we propose a class of estimators for a -quantile under sampling on two occasions with arbitrary sampling designs in each of the two phases. This is composed of a class of estimators from the matched sample and another estimator based on the unmatched sample. The asymptotic unbiasedness of the estimators in the proposed class is derived in Section 3. Asymptotic expressions for variances of the estimators are also derived. In Section 4, we obtain the optimal estimator of the proposed class in the sense of minimizing its asymptotic variance. Since this requires the use of unknown parameters, a consistent estimator of the optimal estimator is proposed. The proposed class involves several types of estimators. Ratio and regression type estimators are derived from the proposed class in Section 5. Assuming simple random sampling, the optimal estimators of the class based on the matched sample are also derived in Section 6. The asymptotic normality, the optimal matching fraction and the gain in precision of the proposed estimators over the direct estimator described in Section 2 are other properties discussed in this section. In Section 7, we support our results with a simulation study based upon several populations, comparing the sampling properties of the proposed ratio and regression estimators to the direct estimator. 2. The proposed class of estimators In this section, after giving some definitions, a composite quantile estimator is defined under sampling on two occasions. We also introduce some basic quantile estimators, which are used by the proposed estimator. Consider a finite population U with size N, which is assumed to retain its composition over two-time periods with values yi on the current and xi on a previous occasion. Assume that a sample of size n is drawn on the previous occasion. At the current occasion, two independent samples are drawn, one being matched and the other unmatched. The matched sample is a sub-sample, of size m, taken from the previously selected n units, whereas the unmatched sample, of size u, is taken from the N − n remaining units. Thus the total sample on the current occasion has n = m + u units and the matched fraction is given by  = m/n. The first-phase sample s  , of size n , is drawn according to a sampling design d1 , such that pd1 (s  ) is the probability of s  being chosen. The corresponding first and second order probabilities are i , ij , for i, j ∈ U . Given s  , on the second occasion, a matched sample sm of size m, is drawn from s  according to the design d2 , such that pm (sm /s  ) is the conditional probability of choosing sm . The inclusion probabilities under this design are denoted by i/s  and ij /s  . The unmatched sample su is drawn from U − s  = s  c in accordance with the design d3 , such that pu (su /s  c ) is the conditional probability of choosing su . The inclusion probabilities under this design are denoted by i/s  c and ij /s  c . The total sample on the current occasion is s = sm ∪ su .  The finite population distribution function of y is FY (t) = N −1 i∈U (t − yi ), with (a) = 1 if a 0 and (a) = 0 otherwise. The corresponding finite population -quantile of y is QY () = FY−1 () = inf{t: FY (t) }, where FY−1 (·) is the inverse function. Under two-phase sampling, Särndal et al. (1992) showed that the usual Horvitz and Thompson (1952) type estimator of the population total cannot always be obtained in practice, since the inclusion probabilities associated to the secondphase sample should be known for every first-phase sample, and this is not normally the case. Under sampling on two

6598

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 6596 – 6613

occasions, the situation is very much the same as before, since sm can be seen as a second-phase sample taken from s  , the first-phase sample. This causes complications to define Horvitz–Thompson type estimators. Alternative procedures are therefore required. The use of ∗ estimators is a possible alternative, which was proposed by Särndal et al. (1992) for the problem of the population total estimation. Using this idea, we introduce the quantities ∗i = i i/s  + i c i/s  c (see Berger, 2004) where i c = 1 − i , and we define an estimator of the distribution function as  (t − yi ) Y n (t) = 1 F . N ∗i i∈s

We then estimate the quantile by taking its inverse −1 () = inf{t: F Y n (t)}. Y n () = F Q Yn

(1)

Using the samples su and sm , the following ∗ estimators of the distribution function can also be formed:  (t − yi ) Y u (t) = 1 , F N i c i/s  c i∈s u

 (t − yi ) Y m (t) = 1 F N i i/s  i∈sm

and  (t − xi ) Xm (t) = 1 , F N i i/s  i∈sm

Xm (t) are, respectively, the ∗ estimators based on the matched sample on the second and first Y m (t) and F where F Y u () = F Y m () = F −1 () −1 (), Q occasions. From these expressions, quantile estimators can easily be derived as Q Yu Ym −1 Xm () = F  (). and Q Xm Finally, we also form a Horvitz–Thompson estimator of the distribution function using the first sample:  (t − xi ) X (t) = 1 . F N i  i∈s

X () = F −1 (). The quantile QX () can then be estimated by Q X Let us consider a general estimator with the following form: H () = w Q H () + (1 − w)Q Y u (), Q Y Ym

(2)

where w is a constant (0 < w < 1), and QH Y m () belongs to a class of estimators H (inspired by the class proposed by Allen et al., 2002; Singh et al., 2001), the elements of which are stated as follows: H () = H (Q Y m (), t), Q Ym

(3)

X () and where the function H satisfies the following assumptions: Xm ()/Q with t = Q (A1) H assumes values in a closed convex subset C ⊂ R2 which contains the point (QY (), 1). (A2) H is a continuous function in C such that H (q, 1) = q. (A3) The first and second order partial derivatives of H exist and are also continuous in C, with  jH (q, t)  H10 (QY (), 1) = = 1. jq (q,t)=(QY (),1) Observe that the composite estimator (2) combines a matched sample estimator with the unmatched sample estimator. The problem of finding the optimal w is discussed in Section 4. Firstly, some asymptotic properties related to the class (2) are established in the next section.

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 6596 – 6613

6599

3. Asymptotic theory Some asymptotic properties of the proposed class are studied in this section. In particular, the asymptotic unbiasedness and the asymptotic variance of (12) are established. The results obtained are derived by assuming the following hypothesis: (A4) The finite population embeds in a sequence of populations {U }, where n and N increase such that n / N → f when n → ∞. (A5) When N → ∞ the bivariate distribution can be approximated by a continuous distribution with marginal densities fX (x) and fY (y) for x and y respectively, and fX (QX ()) and fY (QY ()) are positive. 3.1. Asymptotic unbiasedness H () are asymptotically Theorem 1. Under assumptions (A4) and (A5), the estimators in the proposed class Q Y unbiased for QY (). Y u () is asymptotically unbiased for QY (). We now study the Proof. It is well known that the sample quantile Q H (). For this, a linear approximation is needed because Q H () is not a continuous properties of the estimator Q Ym Ym function. The first order Taylor’s series expansion H about the point (QY (), 1) gives H () = H ((QY (), 1)) + (Q Y m () − QY ())H10 (QY (), 1) + (t − 1)H01 (QY (), 1) + oP (m−1 ), Q Ym where H10 and H01 denote the first order partial derivatives of H. Xm () and Q X () can be expressed asymptotically as a linear function of the estimated Y m (), Q The estimators Q distribution function evaluated at the quantile by the Bahadur representation (see Chambers and Dunstan, 1986) Y m () − QY () = Q

1 Y m (QY ())) + oP (m−1/2 ), ( − F fY (QY ())

Xm () − QX () = Q

1 Xm (QX ())) + oP (m−1/2 ) ( − F fX (QX ())

and X () − QX () = Q

1 X (QX ())) + oP (n −1/2 ). ( − F fX (QX ())

Y m (t), F Xm (t) and F X (t) are unbiased of FY (t) and FX (t), respectively, we find that Q H ()) is asymptotically As F Ym unbiased for QY (). H () and Q Y u () are asymptotically unbiased for QY () then so is the proposed estimator Finally, since Q Ym H  QY ().  3.2. Asymptotic expression of variances Expression (12) allows us to compute the asymptotic variance of the proposed estimator by providing the variance of the estimator based on the matched sample, the estimator that involves the unmatched sample and the covariance between them: H ()) = w 2 V1 + (1 − w)2 V2 + 2w(1 − w)C, V (Q Y H ()), V2 = V (Q Y u ()) and C = Cov(Q Y u (), Q H ()). where V1 = V (Q Ym Ym

(4)

6600

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 6596 – 6613

H ()) can be obtained from The asymptotic expression for V (Q Ym   Xm () Q H Y m () − QY () +  () − QY () Q − 1 H01 (QY (), 1) Q Ym X () Q = QY ()e0 +

e1 − e2 H01 (QY (), 1) 1 + e2

QY ()e0 + (e1 − e2 )(1 − e2 )H01 (QY (), 1) = QY ()e0 + (e1 − e2 )H01 (QY (), 1) − e2 (e1 − e2 )H01 (QY (), 1), with the notation Y m () Q e0 = − 1; QY ()

e1 =

Xm () Q − 1; QX ()

e2 =

(5)

X () Q − 1. QX ()

The variance, to the first order of approximation, is thus given by H ()) = Qy ()2 V (e0 ) + H01 (Qy (), 1)2 V (e1 − e2 ) V (Q Ym + 2H01 (Qy (), 1) Cov(e0 , e1 − e2 ).

(6)

Using the properties of two-phase sampling, the expression H ()) = Ed1 V (Q H ()/s  ) + Vd1 E(Q H ()/s  ) V (Q Ym Ym Ym

(7)

reflects the variation due to each of the two phases of sampling. Using known properties of the Horvitz–Thompson  estimator and its variance and noting by ij = ij − i j and sij = ij /s  − i/s  j/s  , we obtain H ()/s  ) = Vd1 E(Q Ym

 (QY () − yi ) (QY () − yj ) 1 ij i j N 2 fY2 (QY ()) i,j ∈U

and

⎛ H ()/s  ) = Ed1 ⎝ Ed1 V (Q Ym

+

  (QY () − yi ) (QY () − yj ) 1 s i · i/s  j · j/s  N 2 fY2 (QY ()) i,j ∈s  ij

2 (Q (), 1) H01 Y

Q2X ()

  (QX () − xi ) (QX () − xj ) 1 s i · i/s  j · j/s  N 2 fX2 (QX ()) i,j ∈s  ij

H01 (QY (), 1) 1 QX () N 2 fY (QY ())fX (QX ()) ⎞   (QY () − yi ) (QX () − xj ) ⎠. × sij  ·   ·    i/s j/s i j 

+2

i,j ∈s

The latter variance is not stated explicitly but is taken as an expected value over the first-phase design. This causes no problem for the variance estimation, because  (QY () − yi ) (QY () − yj ) ij j i i,j ∈U

can be estimated by  i,j ∈sm

ij

ij · ij /s 

Y n () − yj ) Y n () − yi ) (Q (Q  i j

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 6596 – 6613

and

⎛ Ed1 ⎝



i,j ∈s

6601

⎞ (QY () − yi ) (QY () − yj ) ⎠ ij i · i/s  j · j/s   s

by  i,j ∈sm

Y n () − yj ) Y n () − yi ) (Q ij (Q . ij /s  i · i/s  j · j/s 

The densities fX (QX ()) and fY (QY ()) can also be estimated following Silverman (1986). Similarly we obtain Y u ()) = V (Q



1

N 2 fY2 (QY ()) i,j ∈U

c ij



+ Ed1 ⎝

1 N 2 fY2 (QY ())

(QY () − yi ) (QY () − yj ) i c j c ⎞ (Q () − y ) c (QY () − yi ) Y j ⎠ , sij  c  c  c  c   i i/s j j/s c

 i,j ∈s

c

s c c c c c c where c ij = ij − i j and ij = ij /s  − i/s  j/s  . An unbiased estimator for this variance is given by

(Q Y u ()) = V

 Y n () − yj ) Y n () − yi ) (Q c (Q 1 ij c c 2 2 i c N fY (QY ()) i,j ∈s ij ij /s c j u

c

 sij (Q Y n () − yj ) Y n () − yi ) (Q 1 + 2 2 . c i i/s  c j c j/s  c N fY (QY ()) i,j ∈s ij /s  c u

Finally, we determine the covariance C: H ()) = Cov(E(Q Y u ()/s  ), E(Q H ()/s  )) + E(Cov(Q Y u (), Q H ()/s  )). Y u (), Q Cov(Q Y Y Y

(8)

From the independence between su and sm , we conclude that the second term in (8) is zero. As far as the first term is Y s  () + o(m−1 ) and E(Q Y u ()/s  ) = Q   c (), where Q Y s  () = F −1 (), H ()/s  ) = Q concerned, we find that E(Q Ys Y Ys −1   c () and   c () = F Q Ys Ys  (t − yi ) Y s  (t) = 1 , F N i  i∈s

 (t − yi )   c (t) = 1 F . Ys N c i c i∈s

On the other hand, Bahadur’s representation gives   c () − QY () = Q Ys Y s  () − QY () = Q

1   c (QY ())) + op (n−1/2 ), ( − F Ys fY (QY ()) 1 Y s  (QY ())) + op (n−1/2 ) ( − F fY (QY ())

and thus we have Y s  ()) −   c (), Q Cov(Q Ys



1 N 2f

2

Y (QY ()) i,j ∈U

which can be estimated from the sample in a similar way.

ij

(QY () − yi ) (QY () − yj ) , i j c

6602

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 6596 – 6613

4. The optimal estimator of the proposed class In this section, we address the problem of obtaining the optimal value w of the class (2). Note that the optimal w is the value that minimizes the asymptotic variance of (2). As this optimal value is unknown in practice, we also propose an estimator that can be applied in real situations. The variance of the composite estimator can be expressed as H ()) = w 2 V1 + (1 − w)2 V2 + 2w(1 − w)C V (Q Y

2 V2 − C V 1 V2 − C 2 + = (V1 + V2 − 2C) w − V1 + V2 − 2C V1 + V2 − 2C 

V1 V 2 − C 2 H ()), = Vmin (Q Y V1 + V2 − 2C

(9)

because V1 + V2 − 2C > 0. Equality holds if and only if w=

V2 − C = wopt . V1 + V2 − 2C

(10)

Using this optimal weight, we can construct the optimal estimator: H () = wopt Q H () + (1 − wopt )Q Y u (), Q Y opt Ym

(11)

Y u () and Q H (). which is more efficient, theoretically, than the estimators Q Ym We observe that wopt depends on unknown variances and covariances and so estimator (11) cannot be used in practice. In the absence of good a priori knowledge of these characteristics, we can consider consistent estimators of the asymptotic variances and the asymptotic covariance. This allows us to obtain consistent estimators of the optimum values wopt and 1 − wopt . The weight wopt can be estimated by ∗ = wopt

V2∗ − C ∗ , V1∗ + V2∗ − 2C ∗

(Q Y u ()) and C ∗ = Cov(  Q Y u (), Q H ()) (see Section 3). (Q H ()), V ∗ = V where V1∗ = V 2 Ym Ym ∗ By substituting wopt by wopt in (11), we obtain ∗ H ∗  H∗ () = wopt QY m () + (1 − wopt )QY u (). Q Y opt

(12)

H∗ () by the following result: We shall now establish the asymptotic behaviour of Q Y opt H∗ () and Q H () have the same asymptotic Theorem 2. Under assumptions (A4) and (A5), the estimators Q Y opt Y opt behaviour. Proof. To demonstrate Theorem 2, we will use a result provided by Randles (1982) and later used in Chambers and Dunstan (1986) and Rao et al. (1990). Randles obtained the asymptotic behaviour of a family of statistics were it not for the fact that some vital parameters in the formulation of the statistics are unknown. Randles shows that if the statistic Tn ( ) is a function of data and also uses the estimator  , which is also a function of data, consistently estimating ) and Tn () have the same limiting distribution provided (j ( )/j )| = = 0, where the vector parameter , then Tn ( ( ) = limn→+∞ E [Tn ( )] and the expectation is taken when the true parameter is  (see also Rao et al., 1990). ∗ is a consistent estimator of w . We find H∗ () = Tn ( H () = Tn (),  ) and Q  = wopt Denoting by  = wopt , Q opt Y opt Y opt the limiting mean Y (), ( ) = lim E [Tn ( )] = Q n→+∞

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 6596 – 6613

6603

Y () = limN→∞ QY (). Since (j ( )/j )| = = 0, we conclude that Tn ( being Q ) and Tn () have the same limiting H ∗  distribution. Therefore, the estimator QY opt () is asymptotically design unbiased and has the same asymptotic variance H ().  as Q Y opt

5. Some estimators within the proposed class Assuming sampling on two occasions, a new class of estimators was defined in Section 2. The optimal estimator of this class in the sense of minimal variance is proposed in Section 4. However, we need to know an expression for the H (). This would allow us to compute the estimator (12) in practice. Note that Q H () involves several estimator Q Ym Ym types of known estimators, some of which are presented in this section. 5.1. The ratio type estimator A particular case within the general class of estimators H is the ratio type estimator  X (), r () = QY m () Q Q Ym  QXm () which corresponds to the choice H (q, t) = q/t, and which in turn produces an estimator for sampling on two occasions, as follows: ∗ r ∗  r∗ () = wopt QY m () + (1 − wopt )QY u (). Q Y opt

(13)

ˆ r () coincides with QY (). This suggests that the ratio Note that if the values yi are proportional to xi , ∀i ∈ U , Q Ym estimator works well if the values yi and xi are approximately proportional. Assuming successive sampling, the situation where the variable x is the same variable y but observed on the previous occasion is quite common. Under this scenario, the assumption of proportionality is very likely to occur. r () can easily be computed, and Another reason to use the ratio type estimator is its simplicity. We observe that Q Ym furthermore it does not require the use of unknown parameters. In official surveys, the ratio method of estimation is commonly used. For example, Sen (1972) and Sen et al. (1975) used the ratio estimator in successive sampling to estimate the mean values provided by the National Harvest Survey of the Canadian Wildlife Service. 5.2. The regression type estimator If a regression type estimator based on the matched sample is constructed, it is possible to deduce a second estimator under sampling on two occasions and from which, as would occur with one-time sample designs, better theoretical and practical functioning is to be expected than with the above ratio type estimator. Specifically, let us now consider a choice within the class H of the type: d () = Q Y m () + d(Q X () − Q Xm ()), Q Ym

(14)

X (). The presence of the random variable which would require us to choose the function H (q, t) = q + d(1 − t)Q x () introduces extra variation and additional asymptotic studies are therefore required. In Appendix A we show how Q the regression type estimator is asymptotically unbiased. Asymptotic normality is also derived under simple random sampling without replacement (SRSWOR). The optimal d-value depends on unknown population characteristics and so the optimal estimator cannot be used. By adapting the estimator proposed by Rueda et al. (2003) to the case of two-phase sampling, we obtain d∗ =

Xm ())  Q Y m (), Q Cov( ,   V (QXm ())

(15)

6604

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 6596 – 6613

where  Q Y m (), Q Xm ()) Cov(  X () − xj ) Y n () − yi ) (Q ij (Q 1 = 2   N fY (QY ())fX (QX ()) ij ij /s  i j i,j ∈sm



 sij (Q X () − xj ) Y n () − yi ) (Q 1 + 2 ,  N fY (QY ())fX (QX ()) ij /s  i i/s  j j/s  i,j ∈sm

(Q Xm ()) = V

 X () − xj ) X () − xi ) (Q ij (Q 1   2 2 i j N fX (QX ()) i,j ∈s ij ij /s  m



 sij (Q X () − xj ) X () − xi ) (Q 1 + 2 2 . i i/s  j j/s  N fX (QX ()) i,j ∈s ij /s  m

Thus we define the regression estimator by d∗ () = Q Y m () + d ∗ (Q X () − Q Xm ()) Q Ym

(16)

and consequently ∗  ∗ d∗ d∗ () = wopt Q QY m () + (1 − wopt )QY u (). Y opt

(17)

Following Randles (1982), we can conclude that the estimators (16) and (14) have the same limiting distribution (see Singh et al., 2001; Rueda et al., 2003). H () 5.3. The optimal estimators of the class Q Ym H () can also be derived, where the optimality is again defined in the The optimal estimator in the general class Q Ym sense of minimizing the asymptotic variance. Considering expression (7) and equalling its derivative on H01 (QY (), 1) to zero, the optimal estimator in the class H () can be obtained, although it is not simple for a general sampling design. Q Ym H (), assuming SRSWOR. In particular, we In the next section, we calculate the optimal estimator of the class Q Ym obtain a subclass of optimal estimators, that is, the solution is not unique. 6. A particular case: simple random sampling without replacement Assuming SRSWOR, we now give simple expressions for the variances of the estimators considered. This allows us H () and its corresponding minimal variance. The asymptotic normality of to obtain the optimal estimator of class Q Ym class (2) is also discussed. Finally, we deal with other important aspects under sampling on two occasions, such as the optimal matching fraction and the gain in precision of the estimators given by (2) over the direct estimator (1). Assume that s  is a simple random sample from U, which implies that the complement s  c is a simple random sample from U. sm is also a simple random sample from s  and su is a simple random sample from s  c . The  values are: i =

n , N

ij /s  c =

ij =

n  n − 1 , N N −1

i/s  =

u(u − 1) . (N − n )(N − n − 1)

m , n

ij /s  =

m(m − 1) , n (n − 1)

i/s  c =

u N − n

and

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 6596 – 6613

6605

H () 6.1. Optimal estimators of the class Q Ym H ()) is given by Under SRSWOR, the asymptotic expression for V (Q Ym     fY (QY ()) 1 1 1 1 H ()) = (1 − ) V (Q − + − Ym 2  m N m n QX ()fX (QX ()) fY (QY ())    fY (QY ()) P11 (x, y) −1 , H01 (QY (), 1) + 2 ×H01 (QY (), 1) (1 − ) QX ()fX (QX ())

(18)

where P11 (x, y) denotes the proportion of values in the population for which x QX () and y QY () (see Singh et al., 2001). From expression (18) we conclude that the optimal estimator is that (or those) for which the first derivative is   QX ()fX (QX ()) P11 (x, y) H01 (QY (), 1) = 1− . (19) fY (QY ()) (1 − ) H () is obtained. Observe that the optimal estimator is not unique, as a subclass of the general class Q Ym By inserting the optimal derivative (19) into the asymptotic variance (18), a bound for the variances of estimators H () is derived: Q Ym       1 P11 (x, y) 2 1 1 1 (1 − ) H  V (QY m ()) , 1− − − − m N m n (1 − ) fY (QY ())2 with equality holding if H01 (QY (), 1) is equal to (19). 6.2. Asymptotic normality Y u () is asymptotically normal (Gross, 1980), whereas any estimator in Assuming SRSWOR, the sample quantile Q H is also asymptotically normal (Singh et al., 2001). Taking into account the linear combination of the proposed class of estimators (2), its asymptotic normality can easily be derived. 6.3. The optimal matching fraction An important problem under sampling on two occasions is that of obtaining the optimal matching fraction. This is H (). To determine this variance under SRSWOR, simple defined as the value of  that minimizes the variance of Q Y opt expressions for V1 , V2 and C are now used. Assuming several populations, the asymptotic variances are drawn and the optimal matching fractions can therefore be analysed. Variance (18) can be expressed as     1 1 1 1 H  V (QY m ()) = C1 − + C2 − n N n n with C1 = and

(1 − ) fY (QY ())2

  fY (QY ()) fY (QY ()) P11 (x, y) C2 = H01 (QY (), 1) H01 (QY (), 1) + 2 −1 . QX ()fX (QX ()) QX ()fX (QX ()) (1 − )

Y u ()) is given by Assuming SRSWOR, Gross (1980) showed that an asymptotic expression for V (Q

1 n Y u ()) = C1 V (Q − . 1− N

6606

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 6596 – 6613

Fig. 1. Scatter plot for Counties population.

Following the factorizing given by (8), it can be seen that   −n 1 n (1 − ) H   Cov(QY u (), QY m ()) = = C3 . 1− N n fY (QY ())2 N − n H ()) whose asymptotic expression is Therefore, the problem is to obtain the minimizer of the function () = V (Q Y opt given by (9). This can be expressed as       n 1 1 1 1 1 − C32 − − + C2 −  C12 1− N n N n n       , (20) () = 1 1 1 1 1 n − 2C3 − + C1 − + C2 −  C1 1− N n N n n and verifying the natural condition that 0 <  < 1. Differentiating (20) with respect to  and equating to zero, the local minimum (if one exists) of () can be obtained. The optimal matching fraction is now analysed under several populations. For this purpose, we computed the asymptotic expression (20) for the ratio estimator. Firstly, we consider the natural Counties population, which has been used for previous empirical studies in survey sampling. This population consists of N = 304 counties in North Carolina, South Carolina and Georgia with less than 1 00 000 households in 1960. The data were obtained on two occasions; the values yi on the current occasion represent the population in 1970, and the values xi on the previous occasion refer to the population in 1960. In Section 7, we will use the variable z= Number of households in 1960” to extract samples with unequal probabilities. Note that this is not now used because we compute asymptotic variances, and samples are not extracted. A perfectionist description of the Counties population can be seen in Royall and Cumberland (1981) and Valliant et al. (2000). This population is available at the John Wiley worldwide web site: ftp://ftp.wiley.com/public/sci_tech_med/finite_populations. Fig. 1 represents scatter plots for the three variables within the Counties population. We observe that the variables exhibit a strong linear relationship. Simulated populations are also used in this study. Thus, N = 1000 vectors (x, y, z)

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 6596 – 6613

6607

Fig. 2. Asymptotic variances for the ratio estimator in the Counties population and several sample sizes.  = 0.5.

are generated from the multivariate normal distribution (x, y, z) ∼ N (μ, 2 P), where μ = (20, 20, 30), 2 = 10 and ⎛ ⎞ 1 ⎜ ⎟ P = ⎝ 1 ⎠ .



1

H ()) can be explained by The evolution of V (Q Y opt 2 H ()) = wopt V (Q V1 + (1 − wopt )2 V2 + 2wopt (1 − wopt )C. Y opt

(21)

Figs. 2 and 3 give each component of expression (21) and the asymptotic variances of the ratio estimator over values of  within (0, 1). We see that as  increases, V2 and wopt increase, whereas V1 decreases. For instance, as V1 H ()) should decrease, however, V1 is weighted by w 2 , which increases. Therefore, taking these decreases, V (Q opt Y opt H ()) is not always a decreasing function. We also observe considerations into account, it is intuitively clear that V (Q Y opt that the optimal matching fraction varies with the population in question, the sample sizes, the correlation coefficient

, the order of the quantile, etc. H () over Q Y n () 6.4. Gain in precision of Q Y opt Y n () coincides with the Horvitz–Thompson estimator Q Y (). The Under SRSWOR, the proposed estimator Q variance of this estimator is given by (Gross, 1980) Y ()) = N − n (1 − )(n)−1 {fY (QY ())}−2 . V (Q N Y ()) and V (Q H ()) have been established, the gain in precision G1 of Q H () over Q Y () can be Once V (Q Y opt Y opt calculated: H ()) Y ()) − V (Q V (Q Y opt G1 = . H  V (Q ()) Y opt

(22)

6608

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 6596 – 6613

Fig. 3. Asymptotic variances for the ratio estimator in the simulated populations and several values of . n = 200 and n = 200.  = 0.5.

H ()), this also varies with the sample sizes, the population, the correlation coefficient, As G1 depends on V (Q Y opt etc. 7. Empirical study The next step in our study consists of carrying out a simulation study to reveal the behaviour of the proposed estimators. The populations described in Section 6.3 are also used in this study. Assuming sampling on two occasions, we observe that three types of samples need to be selected: the first-phase sample, the matched sample and the unmatched sample. On the one hand, we use traditional SRSWOR to select each sample. This allows us to make asymptotic and empirical comparisons of the optimal matching fraction. Finally, the Midzuno method (Singh, 2003) is used to extract all the samples. Note that this is a technique based upon unequal probabilities. B = 1000 samples are generated randomly to compare the performance of the proposed estimators with the direct estimator given by (1). This study is carried out for  = 0.50, and the precision measures used are the relative bias (RB) and the relative efficiency (RE), with RB =

B H () − Q () Y 1 Q b Y opt , B QY () b=1

RE =

Y ()] MSE[Q ,  MSE[QH ()] Y opt

 H ()] = B −1 B [Q H ()b − QY ()]2 , and MSE[Q Y ()] is where b indexes the bth simulation run, MSE[Q b=1 Y opt Y opt H   similarly defined for the direct estimator QY (). The estimators QY opt () used in this study are the ratio and regression estimators defined in (13) and (17). As in Section 6.3, we analyse the optimal matching fraction via the empirical variance for the ratio estimator. ∗ in (13) and (17) depends on covariances, although some authors ignore it (see, for example Observe that wopt Sen, 1972; Martínez et al., 2005). To determine the gain in precision of using covariances, we also compute the estimators

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 6596 – 6613

Fig. 4. Empirical variances for the ratio estimator in the Counties population and several sample sizes.  = 0.5.

Fig. 5. Empirical variances for the ratio estimator in the simulated populations and several values of . n = 200 and n = 200.  = 0.5. ∗ by w ∗ , where (13) and (17) after substituting wopt

w∗ =

V1∗

V2∗ . + V2∗

d∗ (). r∗ () and Q These estimators are denoted as Q Y Y

6609

6610

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 6596 – 6613

Table 1 Relative efficiency (RE) of estimators for the Counties population m

r∗ (0.5) Q Y opt

r∗ (0.5) Q Y

d∗ (0.5) Q Y opt

d∗ (0.5) Q Y

50

20 30 40

1.931 2.145 2.460

1.914 2.126 2.456

1.515 1.883 2.321

1.486 1.848 2.300

75

20 40 60

1.590 1.825 1.479

1.543 1.806 1.472

1.434 1.744 1.522

1.384 1.704 1.498

100

30 50 70

1.384 1.545 1.141

1.352 1.491 1.081

1.301 1.405 1.083

1.266 1.327 1.004

n

The samples s  (of size n = 100), s and sm are selected by SRSWOR.

Table 2 Relative efficiency (RE) of estimators for the Counties population m

r∗ (0.5) Q Y opt

r∗ (0.5) Q Y

d∗ (0.5) Q Y opt

d∗ (0.5) Q Y

50

20 30 40

2.067 2.045 1.490

2.061 2.074 1.522

1.543 1.870 1.645

1.507 1.866 1.712

75

20 40 60

1.681 1.579 1.636

1.645 1.571 1.650

1.285 1.432 1.611

1.235 1.402 1.603

100

30 50 70

1.528 1.429 1.177

1.500 1.409 1.161

1.212 1.344 1.137

1.167 1.293 1.098

n

The samples s  (of size n = 100), s and sm are selected by the Midzuno method.

Note that fX (QX ()) and fY (QY ()) are estimated by applying the method of the kth nearest neighbour to the population in hand. Following the usual practice (see Kuk and Mak, 1989; Arcos et al., 2005), we use k = O(N 1/2 ) = 2L − 1, where L is the integer nearest to N 1/2 . The random generations, calculations and all the estimators were obtained using the R programme. Programming details are available from the author. Figs. 4 and 5 give the empirical variances for the proposed ratio estimator. Observe that this is not, in general, a monotone function and the optimal matching fraction varies with the sample sizes and the correlation coefficient . Assuming the Counties population, the RE for the several estimators are given in Tables 1 and 2. Note that estimators r∗ () and Q d∗ () are, generally, more efficient than Q r∗ () and Q d∗ (), respectively. The use of covariances Q Y opt Y Y opt Y provides more accurate estimators. On the other hand, the ratio estimator is more efficient than the regression estimator. This is due to the assumption of proportionality made with this population (see Fig. 1). Finally, we observe that the direct estimator always has the smallest RE. Tables 3 and 4 give the RE of estimators for the simulated populations and several values of and sample sizes. We now observe that the regression estimator is more efficient than the ratio estimator. This is probably due to the fact the variables x and y do not have a large degree of proportionality. The correlation coefficient seems to have a significant impact on the RE, since more efficient estimators are obtained as increases. Assuming the simulated populations, the absolute values of the RBs for the several estimators are all within a reasonable range and are less than 0.5%. Therefore, these values are not reported. As far as the Counties population is concerned, the direct estimator has the largest RB at 4% and 8.7% under SRSWOR and the Midzuno method,

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 6596 – 6613

6611

Table 3 Relative efficiency (RE) of estimators for the simulated populations and several values of

n

m

r∗ (0.5) Q Y opt

r∗ (0.5) Q Y

d∗ (0.5) Q Y opt

d∗ (0.5) Q Y

0.9

100

40 60 80 40 80 120 60 100 140

1.092 1.095 1.103 1.087 1.094 1.021 1.054 1.009 0.997

1.090 1.092 1.105 1.083 1.090 1.026 1.051 1.008 0.997

1.224 1.242 1.284 1.115 1.134 1.132 1.131 1.095 1.072

1.221 1.237 1.284 1.114 1.126 1.127 1.132 1.088 1.058

40 60 80 40 80 120 60 100 140

1.063 1.179 1.081 1.045 1.043 1.061 1.069 1.042 0.999

1.061 1.176 1.080 1.043 1.040 1.062 1.068 1.038 0.993

1.137 1.227 1.172 1.097 1.100 1.114 1.081 1.058 1.057

1.135 1.222 1.169 1.096 1.094 1.105 1.083 1.048 1.037

40 60 80 40 80 120 60 100 140

0.900 0.880 0.771 0.900 0.828 0.835 0.956 0.929 0.912

0.899 0.879 0.773 0.897 0.827 0.837 0.952 0.927 0.912

1.019 1.080 1.041 1.025 1.023 1.047 1.042 1.072 1.008

1.019 1.078 1.043 1.025 1.020 1.039 1.045 1.068 0.994

150

200

0.8

100

150

200

0.7

100

150

200

The samples s  (of size n = 200), s and sm are selected by SRSWOR.

respectively. Both of the proposed estimators have analogous RBs. The largest values are 2.9% under SRSWOR, and 6.6% when the Midzuno method is used.

8. Concluding remarks This paper addresses a practical problem related to the estimation of quantiles in double sampling with arbitrary sampling designs in each of the two phases. This scheme is commonly used for official surveys, for which quantile estimation is often required, especially when the investigation deals with variables such as income or expenditure. We therefore propose a class of estimators of population quantiles under sampling on two occasions when the samples are obtained by a general sampling design on each occasion. Some important properties related to the proposed class are analysed. From the proposed class, ratio and regression type estimators with optimal properties are derived. The optimal matching fraction is analysed under simple random sampling.A simulation study based upon several populations supports our findings. The simulation in Section 7 shows how the use of covariances in the proposed estimators provides a slight gain in efficiency. We also observe that the effect of the correlation coefficient on the proposed estimators is not negligible, since more efficient estimators are obtained as the correlation coefficient increases. The extension of the proposed class to the case in which we have three or more occasions can also be obtained, and comparisons with sampling on two occasions could be made. This is an important topic which needs further research.

6612

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 6596 – 6613

Table 4 Relative efficiency (RE) of estimators for the simulated populations and several values of

n

m

r∗ (0.5) Q Y opt

r∗ (0.5) Q Y

d∗ (0.5) Q Y opt

d∗ (0.5) Q Y

0.9

100

40 60 80 40 80 120 60 100 140

1.082 1.102 1.163 1.122 1.093 1.149 1.012 1.021 1.038

1.079 1.100 1.168 1.120 1.091 1.156 1.012 1.019 1.037

1.163 1.231 1.277 1.127 1.212 1.252 1.069 1.045 1.047

1.160 1.227 1.279 1.127 1.205 1.247 1.071 1.034 1.028

40 60 80 40 80 120 60 100 140

1.101 1.058 1.305 1.060 1.044 1.026 1.051 1.088 0.983

1.099 1.055 1.305 1.058 1.041 1.023 1.049 1.084 0.978

1.122 1.198 1.338 1.088 1.089 1.099 1.057 1.066 1.020

1.120 1.193 1.334 1.089 1.082 1.089 1.060 1.055 1.002

40 60 80 40 80 120 60 100 140

0.872 0.841 0.831 0.901 0.865 0.876 0.942 0.901 0.921

0.871 0.840 0.832 0.901 0.864 0.878 0.939 0.900 0.922

1.052 1.129 1.138 1.039 1.060 1.042 1.079 1.016 1.020

1.052 1.128 1.136 1.041 1.055 1.034 1.081 1.011 1.004

150

200

0.8

100

150

200

0.7

100

150

200

The samples s  (of size n = 200), s and sm are selected by the Midzuno method.

Acknowledgements The authors would like to thank the referees for useful comments. Research partially supported by CICE Grant SEJ565 and MEC Grant MTM2006-04809.

Appendix A d (). This appendix contains the asymptotic properties of the regression type estimator, Q Ym Y m (), Q Xm () and Q X () are asymptotically unbiased First, under assumptions (A4) and (A5), the estimators Q d () is a linear function of these estimators and hence it is asymptotically unbiased for QY (). for QY (). Q Ym On the other hand, the asymptotic variance can be obtained using: d ()/s  ) + Vd1 E(Q d ()/s  ), d ()) = Ed1 V (Q V (Q Ym Ym Ym where d ()/s  ) = V (Q Ym

1



N 2 fY2 (QY ()) i,j ∈s 



sij

(QY () − yi ) (QY () − yj ) i · i/s  j · j/s 

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 6596 – 6613

+ d2

− 2d

1



N 2 fX2 (QX ()) i,j ∈s  N 2f



sij

6613

(QX () − xi ) (QX () − xj ) i · i/s  j · j/s 

  (QY () − yi ) (QX () − xj ) 1 sij i · i/s  j · j/s  Y (QY ())fX (QX ())  i,j ∈s

and d ()/s  ) = Vd1 E(Q Ym

 (QY () − yi ) (QY () − yj ) 1 ij . 2 2 i j N fY (QY ()) i,j ∈U

The asymptotic normality of estimator can be obtained assuming that the sample is obtained by SRSWOR. Under this Y m (), Q Xm () and Q X () are asymptotically normals (Gross, 1980). As (14) is a linear function sampling design, Q d () is also asymptotically normal. of these estimators, using the result obtained by Cramer (1946), Q Ym References Adhvaryu, D., 1978. Successive sampling using multi-auxiliary information. Sankhya 40, 167–173. Allen, J., Singh, H.P., Singh, S., Smarandache, F., 2002. A general class of estimatiors of population median using two auxiliary variables in double sampling. INTERSTAT – Statistics on the Internet, Virginia Polytechnic Institute and State University, Blacksburg, USA. Arcos, A., Rueda, M., Martínez-Miranda, M.D., 2005. Using multiparametric auxiliary information at the estimation stage. Statist. Papers 46, 339–358. Arnab, R., Okafor, F.C., 1992. A note on double sampling over two occasions. Pakistan J. Statist. 8, 9–18. Berger, Y., 2004. Variance estimation for measures of change in probability sampling. Canad. J. Statist. 32, 451–467. Chambers, R.L., Dunstan, R., 1986. Estimating distribution functions from survey data. Biometrika 73, 597–604. Cramer, H., 1946. Mathematical Methods of Statistics. Princenton University Press, Princeton. Eckler, A.R., 1955. Rotation sampling. Ann. of Math. Statist. 26, 664–685. Gordon, L., 1983. Successive sampling in finite populations. Ann. of Statist. 11, 702–706. Gross, S.T., 1980. Median estimation in sample surveym. In: Proceedings of the Section on Survey Research Methods.American Statistical Association, pp. 181–184. Horvitz, D.G., Thompson, D.J., 1952. A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 47, 663–685. Jessen, R.J., 1942. Statistical investigation of a sample survey for obtaining farm facts. Iowa Agricultural Experiment Statistical Research Bulletin, 304. Kuk, A., Mak, T.K., 1989. Median estimation in the presence of auxiliary information. J. Roy. Statist. Soc. B 1, 261–269. Mak, T.K., Kuk, A.Y.C., 1993. A new method for estimating finite-population quantiles using auxiliary information. Canad. J. Statist. 25, 29–38. Martínez, M.D., Rueda, M., Arcos, A., Román, Y., Gónzalez, S., 2005. Quantile estimation under successive sampling. Comput. Statist. 20, 385–399. Narain, R.D., 1953. On the recurrence formula in sampling on successive occasions. J. Indian Soc. Agricultural Statist. 5, 96–99. Patterson, H.D., 1950. Sampling on successive occasions with partial replacement of units. J. Roy. Statist. Soc. B 12, 241–255. Randles, R., 1982. On the asymptotic normality of statistics with estimated parameters. Ann. of Statist. 10 (2), 462–474. Rao, J.N.K., Kovar, J.G., Mantel, H.J., 1990. On estimating distribution functions and quantiles from survey data using auxiliary information. Biometrika 77, 365–375. Royall, R.M., Cumberland, W.G., 1981. An empirical study of the ratio estimator and estimators of its variance. J. Amer. Statist. Assoc. 76, 66–77. Rueda, M., Arcos, A., Artés, E., 1998. Quantile interval estimation in finite population using a multivariate ratio estimator. Metrika 47, 203–213. Rueda, M., Arcos, A., Martínez-Miranda, M.D., 2003. Difference estimators of quantiles in finite populations. Test 12, 481–496. Särndal, C.E., Swensson, B., Wretman, J.H., 1992. Model Assisted Survey Sampling. Springer, New York. Sedransk, J., Meyer, J., 1978. Confidence intervals for the quantiles of a finite population: simple random and stratified simple random sampling. J. Roy. Statist. Soc. B 40 (2), 239–252. Sedransk, J., Smith, P.J., 1988. Inference for finite population quantiles. In: Krishnaiah, P.R., Rao, C.R. (Eds.), Handbook of Statistics, vol. 6. North-Holland, Amsterdam, pp. 267–289 (Chapter 11). Sen, A.R., 1972. Successive sampling with p auxiliary variables. Ann. Math. Statist. 43 (6), 2031–2034. Sen, A.R., Sellers, S., Smith, G.E.J., 1975. The use of a ratio estimate in successive sampling. Biometrics 31, 673–683. Silverman, B.W., 1986. Density Estimation for Statistics and Data Analysis. Chapman & Hall, London. Singh, H.P., Sidhu, S.S., Singh, S., 2006. Median estimation with known interquartile range of auxiliary variable. Appl. Math. Statist. 4 (6), 68–80. Singh, S., 2003. Advanced Sampling Theory with Applications: How Michael “selected” Amy. Kluwer Academic Publisher, The Netherlands, pp. 1-1247. Singh, S., Joarder, A.H., Tracy, D.S., 2001. Median estimation using double sampling. Australian & New Zealand J. Statist. 43, 33–46. Valliant, R., Dorfman, A.H., Royall, R.M., 2000. Finite Population Sampling and Inference. A Prediction Approach. Wiley Series in Probability and Statistics. Survey Methodology Section.