1. Definition of Estimator

Report 4 Downloads 89 Views
Week 7: Parametric Point Estimation

1. Definition of Estimator Problem of statistical estimation: a population has some charcateristics that can be described by a r.v. X with density fX (· |θ). Density has unknown parameter (or set of parameters) θ. For if the parameters are known, then the density would have been completely specified and there is no need to make inferences about them. Suppose however, we can observe values of the random sample X1, X2, ..., Xn from the population fX (· |θ). Denote this observed sample values by x1, x2, ..., xn. We then estimate the parameter (or some function of the parameter) based on this random sample. Definition of an Estimator Any statistic, denoted by T (X1, X2, ..., Xn), that is a function of observable random variables and whose values are used to estimate τ (θ), where τ (·) is some function of the parameter θ , is called an estimator of τ (θ). A value of the statistic evaluated at the observed sample values by x1, x2, ..., xn, will be called an estimate.

For example, the statistic T (X1, X2, ..., Xn) = X n = 1 Pn Xj , which is the sample mean, may be used n j=1 to estimate the mean µ of a population density. Methods of Point Estimation 1. The Method of Moments 2. The Method of Maximum Likelihood 3. Bayes Estimation

2. The Method of Moments Let X1, X2, ..., Xn be a random sample from the population with density fX (·; θ) which we will assume has k number of parameters, say θ1, θ 2, ..., θk . Finding the method of moments estimators is fairly straightforward: equate the first k sample moments to the corresponding k population moments and then solve the resulting system of simultaneous equations. So, if we denote the sample moments by n 1X m1 = Xj n j=1 n X

1 Xj2 n j=1 ... ... n 1X k mk = X n j=1 j m2 =

and the population moments by µ1 (θ1, θ2, ..., θk ) = E (X) ¡ ¢ µ2 (θ1, θ2, ..., θk ) = E X 2 ... ... ¡ ¢ µk (θ1, θ2, ..., θk ) = E X k

then the system of equations to solve for (θ1, θ2, ..., θk ) is given by mj = µj (θ1, θ2, ..., θk ) for j = 1, 2, ..., k. Example: Normal Distribution Suppose ¡ ¢ X1, X2, ..., Xn is a random sample from 2 N µ, σ distribution. Use the method of moments to find point estimators of µ and σ 2. We solve the equations:

n

and

Since

1X E (X) = Xj = X n j=1 n ¡ 2¢ 1 X E X = Xj2. n j=1

¡ ¢ E (X) = µ and E X 2 = σ 2 + µ2, then the method of moments estimators are µ e=X and n 1X 2 2 2 σ e = Xj − X n j=1 n

¢2 n − 1 2 1 X¡ S = Xj − X = n j=1 n

whre S 2 is the sample variance.

3. Maximum Likelihood Estimation The Likelihood Function. If x1, x2, ..., xn are drawn from a population with a parameter θ (where θ could be a vector of parameters), then the likelihood function is given by L (θ; x1, x2, ..., xn) = f (x1, x2, ..., xn) where f (x1, x2, ..., xn) is the joint probability density of the random variables X1, X2, ..., Xn. Maximum Likelihood Estimator/Estimate. Let L (θ) = L (θ; x1, x2, ..., xn) be the likelihood function θ=b θ (x1, x2, ..., xn) is a function for X1, X2, ..., Xn. If b of the observed values x1, x2, ..., xn that maximizes θ is the maximum likelihood estimate of L (θ), then b θ (X1, X2, ..., Xn) is called θ and the random variable b the maximum likelihood estimator. When X1, X2, ..., Xn is a random sample from fX (x; θ), then the likelihood function is n Y L (θ; x1, x2, ..., xn) = fX (xj ; θ) j=1

which is just the product of the densities evaluated at each of the random sample. If the likelihood function

contains k parameters so that L (θ1, θ2, ..., θk ) = fX (x1; θ1, ..., θk ) fX (x2; θ1, ..., θk ) · · · fX (xn; θ1, ..., θk ) , then under certain regularity conditions, the point where the likelihood is a maximum is a solution of the k equations ∂L (θ1, θ2, ..., θk ) = 0 ∂θ1 ∂L (θ1, θ2, ..., θk ) = 0 ∂θ2 ... ... ∂L (θ1, θ2, ..., θk ) = 0. ∂θk Normally, the solutions to this system of equations give the global maximum, but to ensure, you must really check for second derivative (or Hessian) conditions for a global maximum.

Consider the case of estimating two variables, say θ1 and θ2. Define the gradient vector   ∂L  1 D (L) =  ∂θ ∂L  ∂θ2

and the Hessian matrix 

 ∂ 2L ∂ 2L  ∂θ2 ∂θ1∂θ2    1  H (L) =    2 2  ∂ L ∂ L  ∂θ1∂θ2 ∂θ22 Calculus tells us that the maximum choice θ1 and θ 2 should satisfy not only D (L) = 0 but also H is negative definite which means  2  ∂ 2L ∂ L 2  µ ¶ ¡ ¢  ∂θ1 ∂θ1∂θ2  h1  h1 h2   2  h1 ≤ 0 2  ∂ L ∂ L  ∂θ1∂θ2 ∂θ22 ¡ ¢ for all h1 h2 . In several instances, maximizing the log-likelihood function is an easier task. Thus, we define the log-likelihood function as ` (θ1, θ2, ..., θk ) = ln [L (θ1, θ2, ..., θk )] .

4. Examples Illustrating M.L.E. Poisson Distribution. Suppose X1, X2, ..., Xn are i.i.d. and Poisson(λ). The likelihood function is given by µ −λ xn ¶ µ −λ x1 ¶ µ −λ x2 ¶ e λ e λ e λ ··· L (λ) = xµ x2! ¶ xn! 1! x1 x2 xn λ λ λ ··· = e−λn x1! x2! xn! so that taking the log of both sides, we get n n X X ` (λ) = −λn + ln (λ) xk − ln (xk !) . k=1

k=1

Now, maximize this log-likelihood function with respect to the parameter λ, we have n 1X d ` (λ) = −n + xk dλ λ k=1 which gives the mle n X 1 b= λ xk = x, n k=1

the sample mean. Check for second derivative condition to ensure global maximum. Normal Distribution. ¡ ¢Suppose X1, X2, ..., Xn are 2 i.i.d. and Normal µ, σ where both parameters are

unknown. The likelihood function " isµgiven by ¶2# n Y 1 1 xk − µ √ . exp − L (µ, σ) = 2 σ 2πσ k=1

Its log-likelihood function is n 1 X n ` (µ, σ) = −n ln σ − ln 2π − 2 (xk − µ)2 . 2 2σ k=1 Now, take the derivative and set to zero, we have µ b=x and n 1X 2 σ b = (xk − x)2 . n k=1 See pages 255-256 of Rice for the rest of the details. Gamma Distribution. You do not always get closed-form solutions for the parameter estimates with maximum likelihood method. Take the case of gamma, as illustrated in Rice, pp. 256-257, while one parameter can get a closed-form solution, the other you cannot get. You will have to numerically estimate the value of the other by solving a non-linear equation. Solving it may require an iterative process. An initial value may be needed so that other means of estimating first may be used, such as using the method of moments. Then use it as the starting value. Details, see Rice.

Uniform Distribution. Suppose X1, X2, ..., Xn are 1 i.i.d. U [0, θ], i.e. fX (x) = for 0 ≤ x ≤ θ. Here θ the range of x depends on the parameter θ. The likelihood function can be expressed as µ ¶n Y n 1 L (θ) = I (0 ≤ xk ≤ θ) . θ k=1

You cannot use calculus to maximize this function. Use you intuition to note that this function is mazimized at b θ = max (x1, x2, ..., xn) = x(n).

5. Important Properties of the M.L.E. θ denote the maximum Invariance Property Let b likelihood estimator of θ in the density fX (x; θ), where θ can be a vector of parameters. If g (θ) is a transformation of the parameter(s), then the ³ ´ θ , maximum likelihood estimator of g (θ) is g b i.e. the transformation evaluated at the maxiumum likelihood estimator b θ.

Recall in the Poisson distribution, the MLE of λ is x. Now suppose we want to estimate the probability the poisson is zero, i.e. Prob (X = 0) = e−λ which is apparently a function of λ. Under invariance property, the maximum likelihood estimate will be b e−λ = ex.

Asymptotic Property of MLE Suppose the density fX (x; θ) satisfies certain regularity conditions (e.g. continuous, differentiable, no parameter on the boundaries of x, etc.) and suppose b θn is the MLE of θ for a random sample of size n from fX (x; θ). Then b θn is asymptotically normally ³ ´ distributed with mean E b θn = θ

and variance " ( · ¸2)#−1 ³ ´ 1 ∂ 1 E log fX (X; θ) V ar b θn = = τ 2. n ∂θ n We write this as ³ ´ ¡ ¢ √ d b n θn − θ → N 0, τ 2 . Note that

" (· ¸2)#−1 ∂ log fX (X; θ) τ2 = E ∂θ and(it can be shown that) · ¸2 ¸ · 2 ∂ ∂ log fX (X; θ) E log fX (X; θ) . = −E ∂θ ∂θ2 In evaluating the variance of MLE, you can therefore use either form of this variance formula. For functions of the parameter, say g (θ), we can easily extend the theorem, except there is a deltamethod adjustment to the variance. Thus, assuming g (θ) is a differentiable function of µ θ,hthen i ¶ ´ ´ ³ ³ 2 √ 0 d n g b θn − g (θ) → N 0, g (θ) τ 2 , 0

where g (θ) is the first derivative of g with respect to the parameter θ.

6. Bayes Estimation

Example of Bayes Estimation. Let X1, X2, ..., Xn be i.i.d. Bernoulli(p). Then we know that1 f (x |p) = px1 (1 − p)1−x1 pPx2 (1 − p)1−x2 · · · pxn (1 − p)1−xn P = p xk (1 − p)n− xk . Assume the prior distribution on p is Beta(a, b) so that Γ (a + b) a−1 p (1 − p)b−1 . π (p) = Γ (a) Γ (b) P To simplify notation, let y = xk . Therefore, Z 1 m (x) = f (x |p) π (p) dp 0 Z 1 Γ (a + b) (a+y)−1 p (1 − p)(b+n−y)−1 dp = 0 Γ (a) Γ (b) Γ (a + b) Γ (a + y) Γ (b + n − y) . = Γ (a) Γ (b) Γ (a + b + n) Therefore, the posterior distribution, the distribution of p given x, is given by f (x |p) π (p) π (p |x) = m (x) n−y Γ(a+b) a−1 py (1 − p) (1 − p)b−1 Γ(a)Γ(b) p = Γ(a+b) Γ(a+y)Γ(b+n−y)

Under this approach, we assume that θ is a random quantity with distribution π (θ) called the prior distribution. A sample x1, x2, ..., xn is taken from the population and the prior distribution is updated using the information drawn from this sample and applying Bayes’ rule. This updated prior is called the posterior distribution which is the conditional distribution of θ given the sample x1, x2, ..., xn. This is derived as f (x1, x2, ..., xn |θ) π (θ) π (θ |x1, x2, ..., xn ) = m (x1, x2, ..., xn) f (x1, x2, ..., xn |θ) π (θ) = Z f (x1, x2, ..., xn |θ) π (θ) dθ where

m (x1, x2, ..., xn) =

Z

f (x1, x2, ..., xn |θ) π (θ) dθ

is the marginal distribution of x1, x2, ..., xn. The posterior is then used to draw conslusions about θ. Its mean (or expected value), for example, can be used as a point estimate of θ . Sometimes to simplify the notation, we use x = (x1, x2, ..., xn) bold letters to denote a vector.

Γ(a)Γ(b)

= 1

Γ(a+b+n)

Γ (a + b + n) p(a+y)−1 (1 − p)(b+n−y)−1 Γ (a + y) Γ (b + n − y)

The powers of (1 − p) were corrected on 21-6-2002.

which is Beta(a + y, b + n − y). The mean of this posterior distribution a+y pbB = E [Beta (a + y, b + n − y)] = a+b+n gives the Bayes estimator of p. We note that we can write the Bayes estimator as a weighted average of a the prior mean (which is a+b ) and the sample mean (which is µ x) as follows: ¶ ¶µ ¶ µ a n a+b (x) + pbB = a+b+n a+b+n a+b

7. Methods to Evaluate Estimators To illustrate the several concepts introduced below, we consider the case of the Poisson, we found that the maximum likelihood estimator of λ is given by b = T (X1, X2, ..., Xn) = X. λ

Here we examine some procedures in deciding how good an estimator is: 1. Mean squared error and bias 2. The Best Unbiased Minimum Variance Estimator 3. Cramer Rao Lower Bound 4. Consistency 5. Sufficient Statistics

8. Mean Squared Error and Bias

It is clear that if this is smaller than 1, then V ar (T2) < V ar (T1) .

The mean squared error (MSE) of an estimator T (X1, X2, ..., Xn) of a parameter θ is defined as M SE = E (T − θ)2 .

In the Poisson case, we evaluate the bias ¡ ¢ Bias (T ) = E (T ) − λ = E X − λ Ã n ! X 1 = E Xk − λ n

The MSE gives the average squared difference between the estimator T (X1, X2, ..., Xn) and θ and has the interpretation that because MSE = E (T − θ)2 = V ar (T ) + [E (T ) − θ]2 = V ar (T ) + [Bias (T )]2 , the MSE is the sum of the variance of the estimator and the squared of the bias, where the bias of the point estimator is the difference between the expected value of the estimator and the parameter: Bias (T ) = E (T ) − θ. If this bias is zero, then the estimator is said to be an unbiased estimator. Thus, for an unbiased estimator, we have E (T ) = θ. Consider two estimators, say T1 and T2. We define efficiency of T1 relative to T2 as V ar (T2) . ef f (T1, T2) = V ar (T1)

k=1

n 1X

E (Xk ) − λ = λ − λ = 0 n k=1 and hence is unbiased. The M SE is therefore given to be à n ! X 1 MSE = V ar (T ) = V ar Xk n k=1 n X λ 1 1 V ar (Xk ) = λ = . = 2 n n n k=1 Note that the variance of the estimator is a function of the parameter, but we can obviously approximate using the estimator itself so that b x λ V\ ar (T ) = = . n n The square root of this is called the standard error of the estimate: ³ ´ rx b = . s.e. λ n =

9. UMVUEs The Best Unbiased Estimators (UMVUEs). An estimator T is said to be best unbiased estimator of τ (θ) if it satisfies two conditions: (a) it is unbiased, i.e. E (T ) = τ (θ), and (b) it has the smallest variance, i.e. V ar (T ) 6 V ar (T ∗), for any other estimator T ∗. Note that T is often called the uniform minimum variance unbiased estimator (UMVUE) of τ (θ) . Cramer-Rao Lower Bound (CRLB). The Fisher information of the parameter θ is defined to be the function (· ¸2) ∂ ln fX (x; θ) I (θ) = E ∂θ ¸¾ ½· 2 ∂ ln fX (x; θ) . = −E ∂θ2 Note that the asymptotic variance of the MLE is the inverse of the Fisher information matrix. Now, let X1, X2, ..., Xn be a random sample from f (x; θ) and let T (X1, X2, ..., Xn) be any estimator of θ . Then the smallest lower bound of the variance (called

the CRLB) of the estimator is given by V ar (T (X1, X2, ..., Xn)) · ¸2 d E (T (X1, X2, ..., Xn)) dθ > , nI (θ) where I (θ) is the Fisher information of the parameter θ. For unbiased estimators, we have the CRLB given by 1 . V ar (T (X1, X2, ..., Xn)) > nI (θ) Another version of the CRLB is the following: If X1, X2, ..., Xn is a random sample from fX (x; θ) and T (X1, X2, ..., Xn) is an unbiased estimator of τ (θ), then £ 0 ¤2 τ (θ) , V ar (T (X1, X2, ..., Xn)) > nI (θ) 0 where τ (θ) denotes the first derivative of τ with respect to θ.

10. Consistency

11. Sufficient Statistics

A sequence of estimators {Tn} is a consistent sequence of estimators of the parameter θ if for every ε > 0 we have lim Prob (|Tn − θ| < ε) = 1

Definition. A statistic T is said to be sufficient for θ if the conditional distribution of X1, X2, ..., Xn, given T = t, does not depend on θ for any value of t.

n→∞

or equivalently. If Tn is a sequence of estimators of a parameter θ that satisfies the following two conditions: (a) lim V ar (Tn) = 0, and (b) lim Bias (Tn) = 0, then n→∞

n→∞

it is a sequence of consistent estimators of θ. Consistency of MLEs. Suppose X1, X2, ...Xn is a b random sample ³ ´ from fX (x; θ). Let θ be the MLE of θ is the MLE of any continuous function θ. Thus, τ b τ (θ) ³ ´. Under certain regularity conditions on fX (x; θ), τ b θ is a consistent estimator of θ.

Refer to Example A, Rice, page 281. Factorization Theorem. A necessary and sufficient condition for T (X1, ..., Xn) to be a sufficient statistic for θ is that the joint probability function (density function or frequency function) factors in the form f (x1, ..., xn |θ) = g [T (X1, ..., Xn) , θ] h (x1, ..., xn) . A consequence of this is that if T is sufficient for θ, then the mle is a function of T .

Exponential Family of Distributions A distribution is said to belong to the exponential family if its density or mass function is of the form f (x |θ) = exp [c (θ) T (x) + d (θ) + S (x)] where the range of x does not depend on the parameter θ. Many special distributions belong to this family including the normal, binomial, Poisson, and Gamma. Show for example that the Poisson distribution belongs to this family.

Now, consider a random sample X1, X2, ..., Xn from an exponential family. Then its joint probability function is n Y f (x1, ..., xn |θ) = exp [c (θ) T (xk ) + d (θ) + S (xk )] k=1

"

= exp c (θ) |

×exp |

n X k=1

#

T (xk ) + nd (θ) {z

g[T (X1 ,...,Xn ),θ] " n # X

S (xk )

k=1

{z

h(x1 ,...,xn )

}

}

and therefore by the factorization theorem n X T (xk ) is a sufficient statistic.

k=1

θ be an estimator The Rao Blackwell Theorem. Let b ³ 2´ b of θ with E θ < ∞ (i.e. finite) for all θ. Suppose a new estimator as that T is sufficient for θ. Define ³ ´ e θ=E b θ |T .

Then for all θ, this new estimator has a smaller MSE. That is ³ ´ ³ ´ e MSE θ ≤ MSE b θ

or equivalently

³ ´2 ³ ´2 e b E θ−θ ≤E T −θ .

Thus, we see from that Rao-Blackwell theorem, that if an estimator is not a function of a sufficient statistic it can be improved. Example Applying the Rao-Blackwell Theorem. Consider the case X1, X2, ..., Xn are i.i.d. U [0, θ], i.e. 1 fX (x) = for 0 ≤ x ≤ θ. We showed that θ b θ = max (x1, x2, ..., xn) = x(n) is an MLE of θ. Note that the density of T = X(n) is n fX(n) (x) = n xn−1, for 0 ≤ x ≤ θ θ so that its mean Z θ £ ¤ n n θ E X(n) = x n xn−1dx = θ n + 1 0 is apparently not unbiased. But it can be shown that T = X(n) is a sufficient statistic. Consider the estimator Y = 2X1 where E (Y ) = E (2X1) = θ is unbiased. Therefore, by Rao-Blackwell theorem, we can improve on this statistic¯ by considering ¡ ¢ E (Y |T ) = 2E X1 ¯X(n) = t . Note that if X(n) = t, then X1 = t with probability 1/n and X1 ∼ U [0, t] with the remaining probability

(n − 1) /n. (this requires a little bit of work to show. For those challenged, you can ¡ ¯ show¢this) Therefore, E (Y |T ) = 2E X1 ¯X(n) = t µ ¶ 1 n − 1t n = 2 t+ = t n n 2 n+1 so that n n T = X(n) n+1 n+1 is the UMVUE for θ.