Bayesian Adaptive Bandwidth Kernel Density Estimation of Irregular Multivariate Distributions Shuowen Hu, D. S. Poskitt, Xibin Zhang∗ Department of Econometrics and Business Statistics, Monash University, Australia
Abstract In this paper, we propose a new methodology for multivariate kernel density estimation in which data are categorized into low- and high-density regions as an underlying mechanism for assigning adaptive bandwidths. We derive the posterior density of the bandwidth parameters via the Kullback-Leibler divergence criterion and use a Markov chain Monte Carlo (MCMC) sampling algorithm to estimate the adaptive bandwidths. The resulting estimator is referred to as the tail-adaptive density estimator. Monte Carlo simulation results show that the tail-adaptive density estimator outperforms the globalbandwidth density estimators implemented using different global bandwidth selection rules. The inferential potential of the tail-adaptive density estimator is demonstrated by employing the estimator to estimate the bivariate density of daily index returns observed from the USA and Australian stock markets. Keywords: marginal likelihood, Markov chain Monte Carlo, S&P500 index, value-at-risk
∗
Corresponding author. Email:
[email protected]. Address: 900 Dandenong Road, Caulfield East, Victoria 3145, Australia. Telephone: +61 3 99032130. Fax: +61 3 99032007. URL: http://users.monash.edu.au/~xzhang/
Preprint submitted to Computational Statistics and Data Analysis
July 25, 2011
1. Introduction Kernel density estimation is one of the most important techniques for understanding the distributional properties of data. It is understood that the effectiveness of such approach depends on the choice of a kernel function and the choice of a smoothing parameter or bandwidth (see for example, Izenman, 1991, for a discussion). Although the two issues cannot be treated separately, it is widely accepted that the performance of a kernel density estimator is mainly determined by the bandwidth (see for example, Scott, 1992; Wand and Jones, 1995), while the impact of kernel choices on the performance of the resulting density estimator was examined by Marron and Nolan (1988),Vieu (1999) and Horov´a et al. (2002). Most investigations have aimed at choosing a fixed or global bandwidth for a full sample of data (see Jones et al., 1996, for a survey). Terrell and Scott (1992) and Sain and Scott (1996) proposed the idea of data-driven adaptive bandwidth density estimation, which allows the bandwidth to vary at different data points. In this situation, bandwidth selection remains as an important issue and has been extensively investigated for univariate data. However, less attention has been paid to investigations of data-driven methods for estimating adaptive bandwidths in multivariate data. This motivates the investigation of this paper. Our investigation is also empirically motivated. Most financial analysts believe that during the course of the global financial crisis caused by the fallout of the USA sub-prime mortgage crisis, the USA stock market has had a leading effect on most other stock markets world wide. Using a kernel density estimator of bivariate stock-index returns we can derive the conditional distribution of the stock-index return in one market for a given value of the 2
stock-index return in the USA market, and therefore we can better understand how the former market was associated with the USA market. However, the marginal density of stock-index returns often exhibits leptokurtosis. Consequently, the kernel estimation of the bivariate density of stock-index returns may require different bandwidths to different groups of observed returns. 1.1. Some Background Let X = (X1 , X2 , · · · , Xd )0 denote a d-dimensional random vector with its density function f (x) defined on Rd (see J´acome et al., 2008, for an example of censored data). Let {x1 , x2 , · · · , xn } be a random sample drawn from f (x). The kernel density estimator of f (x) is (Wand and Jones, 1995) n n X 1X 1 fˆH (x) = KH (x − xi ) = K(H −1/2 (x − xi )), n i=1 n|H|1/2 i=1
(1)
where K(·) is a multivariate kernel, and H is a symmetric and positive definite d × d matrix known as the bandwidth matrix. To choose an optimal H, several methods have been discussed in literature (see for example, Devroye and Gy¨orfi, 1985; Marron, 1987). Marron (1992) proposed a bootstrapping approach to the approximation of the mean integrated squared error (MISE), and the normal reference rule (NRR) or equivalently the rule-of-thumb that minimizes the asymptotic MISE was discussed by Scott (1992). The least squares cross-validation was discussed by Sain et al. (1994) and Duong and Hazelton (2005), and a plug-in method was suggested by Wand and Jones (1994) and improved by Duong and Hazelton (2003). Moreover, Zhang et al. (2006) proposed a Bayesian approach to the estimation of the bandwidth matrix based on Kullback-Leibler information criterion.
3
However, a problem in using a global bandwidth is that the kernel methods often produce unsatisfactory results for complex or irregular densities. Sain and Scott (1996) presented a classic example, in which a bimodal mixture with two modes of equal height but different levels of variation, where an optimal global bandwidth tends to under-smooth the mode with a large variation and over-smooth the mode with a small variation. Therefore, it is necessary to let the bandwidth be adaptive to local observations; a relatively small bandwidth is needed where observations are densely distributed, and a large bandwidth is required where observations are sparsely distributed (see Jones, 1990; Wand and Jones, 1995; Sain, 2002, for similar arguments). Several versions of the adaptive bandwidth kernel density estimator have been studied in literature. Mielniczuk et al. (1989) proposed to use a weighting function on data point x in a global bandwidth estimator. A popular method is to make the bandwidth as a function of data points, and Nolan and Marron (1989) discussed this issue from a general Delta-sequence estimator perspective. Loftsgaarden and Quesenberry (1965) suggested to replace H in (1) is by H(x), which is the bandwidth at data point x. This estimator was studied by Cao (2001) and Sain and Scott (2002) and is also called the balloon estimator. However, the balloon estimator does not integrate to one and therefore is not a good choice for density estimation (Terrell and Scott, 1992; Izenman, 1991). The other estimator proposed by Breiman et al. (1977) is called the sample-point estimator, which employs H(xi ) as the bandwidth associated with sample point xi . Abramson (1982a,b) suggested to choose bandwidth as the inverse square root of f (xi ). As the sample-point estimator always integrate to one, we consider this estimator throughout this paper.
4
The difficulty of using sample-point estimator is that the number of bandwidths exceeds the sample size in multivariate data, and this makes difficulties in estimating bandwidths. Sain and Scott (1996) and Sain (2002) suggested grouping data into different bins and using a constant bandwidth for each bin. Although this method reduces the number of bandwidths, the number of bandwidths still grows exponentially with the dimension. 1.2. The Tail-Adaptive Density Estimator A simple method to control the number of bandwidths to be estimated for the sample-point kernel density estimator is to put the data into a small number of groups. In this paper, we propose dividing the observations into two regions, namely the low-density region (LDR) and high-density region (HDR), and assigning two different bandwidth matrices to these two regions. When the true density is unimodal, the low-density region is the tail area and should be assigned larger bandwidths than the high-density region. We call this type of kernel density estimator the tail-adaptive density estimator. It is not new to group the observations into the low- and high-density regions. Hartigan (1975, 1987) proposed clustering data into different regions with different density values, and Hyndman (1996) presented an algorithm for computing and graphing data in different density regions. A comprehensive review of applications related to the issue of low-and high-density regions was given in Mason and Polonik (2009). In terms of bandwidth selection, Samworth and Wand (2010) considered a bandwidth selection method for univariate high-density region estimation. To our knowledge, there is yet any other investigation on bandwidth selection for a general multivariate kernel density estimation, where two different bandwidth matrices are assigned to 5
observations in the low- and high-density regions. In this paper, we treat the elements of the two bandwidth matrices as parameters, whose posterior can be approximately derived through the Kullback-Leibler information. Therefore, bandwidths can be estimated through a posterior simulator. During the past decade, there have been several investigations on Bayesian approaches to bandwidth estimation for kernel density estimation (see for example, Brewer, 2000; Gangopadhyay and Cheung, 2002; Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang et al. (2006) derived the posterior of bandwidths for multivariate kernel density estimation with a global bandwidth matrix. Their Monte Carlo simulation results reveal the advantage of this Bayesian approach over its competitors including the plug-in method of Duong and Hazelton (2003) and the NRR. However, Hall (1987) showed that the use of a global bandwidth for a long-tailed distribution can mislead Kullback-Leibler information. Therefore, in this paper, we extent the sampling algorithm proposed by Zhang et al. (2006) by incorporating tail-adaptive bandwidth matrices into the multivariate kernel density estimation. The rest of this paper is organized as follows. In Section 2, we derive the posterior of the elements of the tail-adaptive bandwidth matrices describe an MCMC sampling algorithm. Sections 3 and 4 present the results of Monte Carlo simulation studies designed to examine the performance of the tailadaptive density estimator. In Section 5, we apply the tail-adaptive density estimator to the estimation of the bivariate density of two asset returns. Section 6 concludes the paper.
6
2. Bayesian estimation of bandwidths 2.1. Likelihood cross-validation The Kullback-Leibler information, which measures the discrepancy between a density estimator and its true density, is defined as Z Z ˆ dKL f (x), fH (x) = log{f (x)}f (x)dx − log{fˆH (x)}f (x)dx. (2) Rd
Rd
An optimal bandwidth could be derived by minimizing (2), and this is equivR alent to the maximization of Rd log{fˆH (x)}f (x)dx with respect to H. As this integral is the expectation of log{fˆH (x)} under f (x), it is approximated P by n−1 ni=1 log fˆH,i (xi ), where fˆH,i (xi ) =
n 1 X |H|−1/2 K H −1/2 (xi − xj ) , n − 1 j=1
(3)
j6=i
known as the leave-one-out estimator of f (xi ) (H¨ardle, 1991). The likelihood P cross-validation method is to maximize ni=1 log fˆH,i (xi ), which is regarded as the likelihood of {x1 , x2 , · · · , xn } for given H, with respect to H. The bandwidth matrix can be either a full or diagonal matrix. A full bandwidth matrix can reveal useful features of the resulting density estimator, while the implementation of some computing algorithms for bandwidth estimation is often very difficult especially when adaptive bandwidth matrices are used. The numerical result obtained by Sain (2002) showed that the density estimator with a full bandwidth matrix is not smooth in lowdensity regions. In this paper, we use a diagonal bandwidth matrix and let h = (h1 , h2 , . . . , hd )0 denote the vector of the square roots of the diagonal elements of the bandwidth matrix. h is also known as the bandwidth vector. 7
2.2. Tail-adaptive kernel density estimator The concept of grouping observations into low- and high-density regions has been discussed in literature (see for example, Hartigan, 1975, p205). We propose to group some observations into the low-density region, inside which each observation has a density no more than the density of each observation outside the region. Actually, our definition of the low-density region is in the same sense of Hyndman (1996) whose defined the highest density region. Let α be a threshold value that determines the proportion of the lowdensity region relative to the sample space. Let L(fα ) denote a subset of the sample space, so that the (100 × α)% low-density region is shown as L(fα ) = {x : f (x) ≤ fα }, where fα is the largest constant such that Pr{x ∈ L(fα )} ≤ α. Let
1 Ij = 0
if xj ∈ L(fα )
,
otherwise
for j = 1, 2, · · · , n. Let h(1) denote the bandwidth vector assigned to the observations in L(fα ) and h(0) the bandwidth vector assigned to those observations outside L(fα ). The kernel density estimator is n 1X ˆ Ij K (x − xj )./h(1) ./h(1) fh(1) ,h(0) (x) = n j=1 (0)
+ (1 − Ij )K (x − xj )./h
./h , (0)
(4)
where ./ denotes division by elements. The leave-one-out estimator is denoted as fˆh(1) ,h(0) ,i (xi ) for i = 1, 2, · · · , n. As the low-density region becomes the tail area when the underlying density is unimodal, we call (4) the tail-adaptive 8
estimator for simplicity. The value of α can be chosen as either 5% or 10%. As f (x) is unknown, the initial value of fα can be approximated through the kernel density estimator of f (x) using a global bandwidth. 2.3. Posterior of bandwidth parameters Given h(1) and h(0) , the approximate likelihood is (1)
(0)
`(x1 , x2 , . . . , xn |h , h ) =
n Y
fˆh(1) ,h(0) ,i (xi ).
(5)
i=1
As suggested by Zhang et al. (2006), we assume that the prior of each bandwidth to be the truncated Cauchy density 2
(l)
p(hk ) =
1+
(l) hk
(l)
×
(l) hk
, for hk > 0,
for k = 1, 2, · · · , d, and l = 0 and 1. The posterior of h(1) and h(0) for given {x1 , x2 , . . . , xn } is π(h(1) , h(0) |x1 , x2 , · · · , xn ) ∝
( n Y
) ( fˆh(1) ,h(0) ,i (xi ) ×
i=1
d Y
) (1) p(hk )
×
(0) p(hk )
.
k=1
(6) The posterior given by (6) is of non-standard form, and we cannot derive an analytical expression as the estimate of {h(1) , h(0) }. However, we can use the random-walk Metropolis algorithm to sample {h(1) , h(0) } from (6). The sampling procedure is as follows. 1) Obtain initial low-density regions for a given α based on kernel density estimator with a global bandwidth vector chosen via NRR. 2) Assign initial values to h(1) and h(0) , which are respectively, the bandwidth vectors given to observations within the low- and high-density regions specified in Step 1). 9
˜ denote the vector of all elements of h(1) and h(0) . Use the random3) Let h ˜ with the acceptance probability walk Metropolis algorithm to update h computed through the posterior given by (6). 4) Derive the low-density region according the density estimator with the bandwidth vectors updated in Step 3). ˜ achieves reason5) Repeat Steps 3) and 4) until the simulated chain of h able mixing performance. During the above iteration process, we discard the draws during the burn˜ thereafter. Let {h ˜ (1) , h ˜ (2) , · · · , h ˜ (M ) } in period, and record the draws of h denote the recorded draws. The posterior mean (or ergodic average) denoted P ˜ ˜ as M i=1 h(i) /M , is an estimate of h. 3. A Monte Carlo simulation study To investigate the performance of the proposed tail-adaptive kernel density estimator, we approximate Kullback-Leibler information via Monte Carlo simulation. A large number of random vectors are drawn from f (x) so as to approximate (2) numerically. A bandwidth estimation method is better than its competitor if the former produces a smaller Kullback-Leibler information than the former. 3.1. True densities We simulate samples from six target bivariate densities labeled A, B, C, D, E and F1 . These densities are of irregular shapes. Density A is a mixture of 1
Refer to Hu et al. (2010) for univariate target densities, their contour plots and the
simulation results.
10
two equally weighted normal densities with bimodality: fA (x|µ1 , Σ1 , µ2 , Σ2 ) = 1 φ (x|µ1 , Σ1 ) + 12 φ (x|µ2 , Σ2 ) , 2
where φ(x|µ, Σ) is a multivariate normal den-
sity with mean µ and variance-covariance matrix 1 ρ . Σ= ρ 1
(7)
We used µ1 = (−1.5, −1.5)0 , µ2 = (2, 2)0 , ρ = 0.3 for Σ1 and ρ = −0.9 for Σ2 . Density B is a mixture of two normal densities with different weights but an equal height at the modes: fB (x|µ1 , Σ1 , µ2 , Σ2 ) = 1 φ (x|µ2 , Σ2 ) , 4
3 φ (x|µ1 , Σ1 ) 4
+
where µ1 = (−1.5, −1.5)0 and µ2 = (1.5, 1.5)0 . Σ is defined in
(7) with ρ = 0.5 in Σ1 , and Σ2 = 13 Σ1 . Density C is a mixture of two skew-normal densities proposed by Azzalini and Valle (1996): fC (x|µ1 , γ1 , µ2 , γ2 , Σ) = 1 2
1 2
× 2φ(x|µ1 , Σ)Φ (γ10 (x − µ1 )) +
× 2φ(x|µ2 , Σ)Φ (γ20 (x − µ2 )), where Φ(·) is the cumulative density function
(CDF) of the standard normal distribution. We chose µ1 = (−0.5, −0.5)0 , γ1 = (−9, −9)0 , µ2 = (0, 0)0 , γ2 = (9, 9) and Σ is defined in (7) with ρ = 0.3. Note that γ1 , γ2 ∈ R2 are the shape parameters determining the skewness. Density D is a Student t density denoted as td (x|µ, Σ, ν), Density E is a mixture of two Student t densities with ν = 5: fE (x|µ1 , µ2 , Σ, ν) = 0.5 td (x|µ1 , Σ1 , ν) + 0.5 td (x|µ2 , Σ2 , ν) , where µ1 = (−2, 0)0 , µ2 = (2, 0)0 , and Σ1 and Σ2 are defined in (7) with ρ = −0.5 and 0.5, respectively. Density F is a skew-t density proposed by Azzalini and Capitanio (2003): ˜ + d), fF (x|µ, Σ, α, ν) = 2 td (x|µ, Σ, ν)Td (x|ν 11
˜ = γ 0 ω −1 (x − µ) where x
ν+d (x−µ)0 Σ−1 (x−µ)+ν
1/2
, µ = (0, 0)0 , γ = (−2, 0)0 , Σ
is an identity matrix. ω is a diagonal matrix with diagonal elements the same as those of Σ, and Td (·|ν + d) is the CDF of the Student t distribution with ν + d degrees of freedom. The contour plot of each of the six bivariate densities can be found in Figure 2 of Hu et al. (2010). 3.2. Accuracy of our Bayesian bandwidth estimation We generated samples of n = 500, 1000, 2000 from each of the six bivariate densities. The kernel function is the product of univariate Gaussian kernels. We estimated the bandwidth vectors for the proposed tail-adaptive kernel density estimator with α = 0.05 and 0.1. A global bandwidth vector is also estimated via the Bayesian method of Zhang et al. (2006) and NRR. We used the random-walk Metropolis algorithm to sample all bandwidths from their posterior. The burn-in period contains 3,000 iterations, and the following 10,000 iterations were recorded. We computed the batch-mean standard deviation discussed by Roberts (1996) and the simulation inefficient factor (SIF) discussed by Kim et al. (1998) to monitor the mixing performance. Both indicators show that the sampler has achieved a reasonable mixing performance. Refer to Hu et al. (2010) for graphs on visual inspection of the mixing performance. Consider a sample generated from fF (x) with α = 0.05 and sample size 1000. Table 1 presents the MCMC results, where the SIFs are very small and indicate a reasonable mixing performance of the proposed sampler. The tailadaptive density estimator clearly captures the fat-tailed feature of fF . For example, the estimates of both components of h(1) for observations inside the
12
low-density region are respectively, much larger than the estimates of both components of h(0) for observations outside this region. We generated N =100,000 random vectors from the true density and calculated the estimated Kullback-Leibler information defined by (2). As shown in Table 2, among all six densities considered, the tail-adaptive density estimator obviously performs better than the global-bandwidth density estimators with the bandwidth vector estimated through Bayesian sampling and NRR. The results also indicate that the performance of the tail-adaptive density estimator is not very sensitive to different values of α. The mean integrated squared error (MISE) was also used to examine the performance of tail-adaptive density estimator. We numerically approximate the MISE through 200 data sets for each the bivariate densities with sample size 500, 1000 and 2000. Table 3 shows that the lower-adaptive estimator always outperforms the global-bandwidth estimator. 4. Tail-adaptive density estimation for high dimensions Our proposed Bayesian sampling algorithm for estimating bandwidths in tail-adaptive kernel density estimation is applicable to data of any dimension. In this section, we aim to examine the performance of the tail-adaptive estimator in comparison with the global-bandwidth estimator. 4.1. True densities We consider four target densities labeled G, H, I and J. Density G is a mixture of two multivariate normal densities: 1 1 fG (x|µ1 , µ2 , Σ1 , Σ2 ) = φ (x|µ1 , Σ1 ) + φ (x|µ2 , Σ2 ) , 2 2 13
where µ1 = (−1.5, −1.5, −1.5, −1.5, −1.5)0 , µ2 = (2, 2, 2, 2, 2)0 , and both variance-covariance matrices have the form of 1 ρ ρ2 ρ3 ρ4 2 3 1 ρ 1 ρ ρ ρ , Σ= ... 1 − ρ2 ρ4 ρ3 ρ2 ρ 1
(8)
with ρ = 0.3 for Σ1 and ρ = −0.9 for Σ2 . Density H is a multivariate skew-normal densities: fH (x|µ, Σ, α) = 2φ (x|µ, Σ) Φ (γ 0 (x − µ)) , where Σ is given by (8) with ρ = 0.9, µ = (−0.5, −0.5, −0.5, −0.5, −0.5)0 , and γ = (−9, −9, −9, −9, −9)0 . Density I is a mixture of two multivariate Student t densities: fI (x|µ1 , µ2 , Σ1 , Σ2 , ν) = 0.5 td (x|µ1 , Σ1 , ν) + 0.5 td (x|µ2 , Σ2 , ν) , where µ1 = (−2, 0, −2, 0, −2)0 , µ2 = (2, 0, 2, 0, 2)0 , ν = 5, and both Σ1 and Σ1 are defined by (8) with ρ = −0.5 and ρ = 0.5, respectively. Density J is a multivariate skew-t densities: fJ (x|µ, Σ, α, ν) = 2td (x|µ, Σ, ν) Td (˜ x|ν + d) , where µ = 0, ν = 5, Σ is an identity matrix, and x ˜ is defined in a similar way as that in the bivariate fF (x) with γ = (2, 0, 2, 0, 2)0 . 4.2. Accuracy of our Bayesian bandwidth estimation We generated samples of sizes n = 500, 1000, 2000 from each of the fivedimensional densities. Table 4 presents the estimated Kullback-Leibler information between the true density and its estimator resulted from each of 14
the three bandwidth estimation methods. The kernel density estimator with tail-adaptive bandwidth vectors obviously outperforms its counterpart with a global bandwidth vector estimated via either the NRR or Bayesian sampling. This finding is consistent with what we found in the bivariate situation. For all sample sizes of each density, we found that the tail-adaptive kernel density estimator with α = 0.1 slightly outperforms the same estimator with α = 0.05. However, we would be reluctant to make a decision as to whether the former performs better than the latter because such a is marginal. Due to the heavy burden required by the numerical computation of MISE for five-dimensional densities, we have not computed MISE in this section. 5. An application of the tail-adaptive density estimator In this section, we apply the proposed tail-adaptive kernel density estimator to the estimation of bivariate density of stock-index returns. We downloaded the daily closing index values of the S&P 500 index in the USA stock market and the All Ordinaries (AORD) in the Australian stock market, from the 2nd January 2006 to the 16th September 2010 excluding nontrading days, with sample size 1155. The AORD return was matched to the overnight S&P 500 return. Let Pt denote the closing index at date t. The daily continuously compounded returns in percentage form was computed as (ln Pt − ln Pt−1 ) × 100. The sample period covers the period of current global financial crisis, and there were some extreme observations. The Pearson correlation coefficient between the two return series is 0.6171. We used the random-walk Metropolis algorithm to estimate bandwidths for the tail-adaptive kernel density estimator of the bivariate returns, with 15
α = 0.05. We also estimated the global bandwidth via the NRR and Bayesian sampling. The results are given in Table 5, where the SIFs indicate reasonable mixing performance achieved by both sampling algorithms. Moreover, the use of tail-adaptive bandwidths leads to a much larger marginal likelihood than the use of a global bandwidth in this bivariate kernel density estimator, where the marginal likelihoods were calculated through Newton and Raftery (1994). Thus, the use of tail-adaptive bandwidths is strongly favored against the use of a global bandwidth. The density surface and contour plot for each of the three density estimators are presented in Hu et al. (2010), where the tail-adaptive estimator is shown to be able to capture richer dynamics than the other tow. Let xt and yt denote the S&P 500 and AORD daily returns, respectively. We used the estimated tail-adaptive bandwidths to compute the conditional density of yt at a given value of xt through f (y|xt = x) = f (y, x)/fx (x), where f (y, x) is the joint density of (yt , xt ), and fx (x) is the marginal density of xt . Bandwidths estimated for a joint density can be used for the purpose to compute conditional density (Holmes et al., 2010; Polak et al., 2010). We computed the conditional density of xt given that yt is at the 10th, 5th and 1st percentiles, which are corresponding to percentage return values of -0.75, -1.13, and -2.24, respectively. We are also able to compute the conditional probability Pr{yt ≤ y|xt ≤ x} = Pr{yt ≤ y, xt ≤ x}/Pr{xt ≤ x}. For example, Pr{yt ≤ 0|xt ≤ 0} = 0.67. It means that when the USA stock market finished daily trading with a negative return, there was a 67% chance that the Australian market would also drop. Such a percentage suggests that the Australian market followed
16
the USA market during this period. However, this result should not be interpreted as blaming the problems in Australian stock market to the USA stock market, while this is simply an evidence of market dependence. In addition, we can estimate the conditional CDF of yt for given xt = x: Z y f (z, x) F (y|xt = x) = Pr{yt ≤ y|xt = x} = dz. (9) −∞ fx (x) Given that the USA market finished daily trading with the S&P 500 return being at x%, the probability that the Australian market dropped beyond the same level is indicated by letting y = x in (9). At the above-mentioned percentiles of the S&P 500 return, the probability that the AORD return were below the same levels are 0.27, 0.24 and 0.12, respectively. It indicates that the probability that the Australian market had a larger drop than the USA market was no more than 27%. See Hu et al. (2010) for the graphs of the conditional density and its CDF. Moreover, we evaluated the performance of our VaR in comparison to the VaR obtained through IGARCH model (or the Riskmetrics). We followed the steps described in Bao et al. (2006) and calculated the check function of Koenker and Bassett Jr (1978). The existing sample was used as the learning set, and we forecasted daily 5% VaR of AORD from 17th September 2010 to 5th May 2011. The check function computed by IGARCH and our proposed method are 0.0356 and 0.0344, respectively. This finding indicates that our approach to VaR estimation is more effective than the Riskmetrics. 6. Conclusion This paper proposes a kernel density estimator with tail-adaptive bandwidths, which are assigned to the low- and high-density regions, respectively. 17
We have derived the posterior of bandwidth parameters based on KullbackLeibler information and presented an MCMC sampling algorithm to estimate bandwidths. The Monte Carlo simulation study shows that the tail-adaptive kernel density estimator outperforms its competitor, the global-bandwidth kernel density estimator. The simulation result also shows that the improvement made by the tail-adaptive kernel density estimator is especially obvious when the underlying density is fat-tailed. Although the probability of the low-density region has to be chosen before sampling is carried out, we have found that performance the tail-adaptive kernel estimator is not sensitive to different values of this probability. Future search could include this probability as a parameter to be estimated via the same sampling procedure. We applied the tail-adaptive kernel density estimator to the estimation of bivariate density of the paired daily returns of the Australian Ordinary and S&P 500 indices during a period that covers the global financial crisis. The tail-adaptive density estimator is strongly favored against the globalbandwidth density estimator. We derived the estimated conditional density and distribution of the Australian index return given that the USA market finished daily trading with different return values. Although the Australian stock market followed the USA market during the crisis, there was no more than 27% chance that the former market had a larger drop than the latter. Our approach can be viewed as mode estimation in clustering analysis. Even though our algorithm can be implemented to data of any dimension, the curse of dimensionality is a major concern for kernel density estimation. Ferraty and Vieu (2006) suggested attacking the dimensionality problem from a functional setting even for data of infinite dimensions. It might be possi-
18
ble to explore whether the proposed Bayesian approach is applicable in this situation. We leave it for future research. Acknowledgements We extend our sincere thanks to the Co-Editor Stanley Azen, an associate editor and two reviewers for their very insightful comments that have led to a substantially improved paper. Thanks also go to the Victorian Partnership for Advanced Computing (VPAC) for its quality computing facility. References Abramson, I., 1982a. Arbitrariness of the pilot estimator in adaptive kernel methods. Journal of Multivariate Analysis 12, 562–567. Abramson, I., 1982b. On bandwidth variation in kernel estimates — a square root law. The Annals of Statistics 10, 1217–1223. Azzalini, A., Capitanio, A., 2003. Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t-distribution. Journal of the Royal Statistical Society, Series B 65, 367–389. Azzalini, A., Valle, A., 1996. The multivariate skew-normal distribution. Biometrika 83, 715–726. Bao, Y., Lee, T., Saltoglu, B., 2006. Evaluating predictive performance of value-at-risk models in emerging markets: a reality check. Journal of Forecasting 25, 101–128.
19
Breiman, L., Meisel, W., Purcell, E., 1977. Variable kernel estimates of multivariate densities. Technometrics 19, 135–144. Brewer, M., 2000. A Bayesian model for local smoothing in kernel density estimation. Statistics and Computing 10, 299–309. Cao, R., 2001. Relative efficiency of local bandwidths in kernel density estimation. Statistics 35, 113–137. Devroye, L., Gy¨orfi, L., 1985. Nonparametric Density Estimation: The L1 View. John Wiley & Sons, New York. Duong, T., Hazelton, M., 2003. Plug-in bandwidth matrices for bivariate kernel density estimation. Journal of Nonparametric Statistics 15, 17–30. Duong, T., Hazelton, M., 2005. Cross-validation bandwidth matrices for multivariate kernel density estimation. Scandinavian Journal of Statistics 32, 485–506. Ferraty, F., Vieu, P., 2006. Nonparametric Functional Data Analysis: Theory and Practice. Springer, New York. Gangopadhyay, A., Cheung, K., 2002. Bayesian approach to the choice of smoothing parameter in kernel density estimation. Journal of Nonparametric Statistics 14, 655–664. Hall, P., 1987. On Kullback-Leibler loss and density estimation. The Annals of Statistics 15, 1491–1519. H¨ardle, W., 1991.
Smoothing Techniques: with Implementation in S.
Springer, New York. 20
Hartigan, J.A., 1975. Clustering Algorithms. John Wiley & Sons, New York. Hartigan, J.A., 1987. Estimation of a convex density contour in two dimensions. Journal of the American Statistical Association 82, 267–270. Holmes, M.P., Gray, A.G., Isbell Jr, C.L., 2010. Fast kernel conditional density estimation: A dual-tree Monte Carlo approach. Computational Statistics & Data Analysis 54, 1707–1718. Horov´a, I., Vieu, P., Zelinka, H., 2002. Optimal choice of nonparametric estimaties of a density and of its derivatives. Statistics & Decisions 20, 355–378. Hu, S., Poskitt, D., Zhang, X., 2010. Bayesian Adaptive Bandwidth Kernel Density Estimation of Irregular Multivariate Distributions. Monash Econometrics and Business Statistics Working Papers. Hyndman, R.J., 1996. Computing and graphing highest density regions. The American Statistician 50, 120–126. Izenman, A.J., 1991. Recent developments in nonparametric density estimation. Journal of the American Statistical Association 86, 205–224. J´acome, M., Gijbels, I., Cao, R., 2008. Comparison of presmoothing methods in kernel density estimation under censoring. Computational Statistics 23, 381–406. Jones, M.C., 1990. Variable kernel density estimates and variable kernel density estimates. Australian & New Zealand Journal of Statistics 32, 361–371. 21
Jones, M.C., Marron, J.S., Sheather, S.J., 1996. A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association 91, 401–407. Kim, S., Shepherd, N., Chib, S., 1998. Stochastic volatility: Likelihood inference and comparison with ARCH models. Review of Economic Studies 65, 361–393. Koenker, R., Bassett Jr, G., 1978. Regression quantiles. Econometrica 46, 33–50. Kulasekera, K.B., Padgett, W.J., 2006. Bayes bandwidth selection in kernel density estimation with censored data. Journal of Nonparametric Statistics 18, 129–143. de Lima, M., Atuncar, G., 2010. A bayesian method to estimate the optimal bandwidth for multivariate kernel estimator. Journal of Nonparametric Statistics 23, 137–148. Loftsgaarden, D.O., Quesenberry, C.P., 1965. A nonparametric estimate of a multivariate density function. The Annals of Mathematical Statistics 36, 1049–1051. Marron, J., 1992. Bootstrap bandwidth selection, in: LePage, R., Billard, L. (Eds.), Exploring the Limits of Boostrap. John Wiley & Sons, New York, pp. 249–262. Marron, J.S., 1987. A comparison of cross-validation techniques in density estimation. The Annals of Statistics 15, pp. 152–162. 22
Marron, J.S., Nolan, D., 1988. Canonical kernels for density estimation. Statistics & Probability Letters 7, 195 – 199. Mason, D.M., Polonik, W., 2009. Asymptotic normality of plug-in level set estimates. The Annals of Applied Probability 19, 1108–1142. Mielniczuk, J., Sarda, P., Vieu, P., 1989. Local data-driven bandwidth choice for density estimation. Journal of Statistical Planning and Inference 23, 53–69. Newton, M.A., Raftery, A.E., 1994. Approximate Bayesian inference with the weighted likelihood bootstrap. Journal of the Royal Statistical Society, Series B 56, 3–48. Nolan, D., Marron, J., 1989. Uniform consistency of automatic and locationadaptive delta-sequence estimators. Probability Theory and Related Fields 80, 619–632. Polak, J., Zhang, X., King, M.L., 2010. Bandwidth selection for kernel conditional density estimation using the MCMC method. Manuscript presented at Australian Statistical Conference, 6-10 December, Western Australia. Roberts, G.O., 1996. Markov chain concepts related to sampling algorithms, in: Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (Eds.), Markov Chain Monte Carlo in Practice. Chapman & Hall, London, pp. 45–57. Sain, S.R., 2002. Multivariate locally adaptive density estimation. Computational Statistics & Data Analysis 39, 165–186.
23
Sain, S.R., Baggerly, K.A., Scott, D.W., 1994. Cross-validation of multivariate densities. Journal of the American Statistical Association 89, 807–817. Sain, S.R., Scott, D.W., 1996. On locally adaptive density estimation. Journal of the American Statistical Association 91, 1525–1534. Sain, S.R., Scott, D.W., 2002. Zero-bias locally adaptive density estimators. Scandinavian Journal of Statistics 29, 441–460. Samworth, R.J., Wand, M.P., 2010. Asymptotics and optimal bandwidth selection for highest density region estimation. The Annals of Statistics 38, 1767–1792. Scott, D.W., 1992. Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons, New York. Terrell, G.R., Scott, D.W., 1992. Variable kernel density estimation. The Annals of Statistics 20, 1236–1265. Vieu, P., 1999. Multiple kernel procedure: An asymptotic support. Scandinavian Journal of Statistics 26, 61–72. Wand, M.P., Jones, M.C., 1994. Multivariate plug-in bandwidth selection. Computational Statistics 9, 97–116. Wand, M.P., Jones, M.C., 1995. Kernel Smoothing. Chapman & Hall, New York. Zhang, X., King, M.L., Hyndman, R.J., 2006. A Bayesian approach to bandwidth selection for multivariate kernel density estimation. Computational Statistics & Data Analysis 50, 3009–3031. 24
Table 1: MCMC results obtained based on a sample generated from density F Bandwidths Mean Standard Batch-mean
SIF
deviation standard deviation
rate
(1)
1.1121 0.3184
0.0157
24.32
h2
(1)
1.6432 0.3816
0.0164
18.57
(0) h1 (0) h2
0.2505 0.0469
0.0019
17.13
0.4196 0.0675
0.0018
7.35
h1
Acceptance
0.28
Table 2: Estimated Kullback-Leibler information for bivariate densities
Density fA
fB
fC
fD
fE
fF
Global-bandwidth
Tail-adaptive bandwidth
n
NRR
Bayesian
α = 0.05
α = 0.10
500
0.2878
0.0858
0.0772
0.0748
1000
0.2382
0.0617
0.0498
0.0467
2000
0.1981
0.0402
0.0339
0.0338
500
0.1201
0.0499
0.0444
0.0442
1000
0.0826
0.0349
0.0332
0.0337
2000
0.0653
0.0256
0.0219
0.0217
500
0.1126
0.0930
0.0783
0.0768
1000
0.0924
0.0689
0.0559
0.0558
2000
0.0900
0.0648
0.0497
0.0498
500
0.1171
0.0946
0.0464
0.0449
1000
0.0809
0.0769
0.0286
0.0312
2000
0.0590
0.0565
0.0242
0.0270
500
0.1436
0.1072
0.0623
0.0530
1000
0.1038
0.1088
0.0328
0.0397
2000
0.0782
0.0666
0.0262
0.0282
500
0.1169
0.1641
0.0520
0.0545
1000
0.0781
0.0657
0.0261
0.0306
2000
0.0708
0.0637
0.0237
0.0242
25
Table 3: Estimated MISE (×100) for bivariate densities
Density fA
fB
fC
fD
fE
fF
Global-bandwidth
Tail-adaptive bandwidth
n
NRR
Bayesian
α = 0.05
α = 0.10
500
1.8482
0.4945
0.4415
0.4401
1000
1.5546
0.3354
0.2920
0.2901
2000
1.2875
0.2230
0.1875
0.1854
500
0.6526
0.2718
0.2612
0.2606
1000
0.4927
0.1828
0.1738
0.1727
2000
0.3595
0.1164
0.1101
0.1089
500
0.8812
0.6022
0.5422
0.5422
1000
0.7059
0.4208
0.3694
0.3804
2000
0.5589
0.2868
0.2600
0.2692
500
0.3250
0.6573
0.2782
0.2638
1000
0.2152
0.5121
0.1796
0.1737
2000
0.1489
0.3915
0.1219
0.1245
500
0.4022
0.3722
0.1919
0.1822
1000
0.3039
0.2980
0.1279
0.1242
2000
0.2207
0.2283
0.0840
0.0844
500
0.3331
0.7560
0.3092
0.2966
1000
0.2229
0.6150
0.2054
0.2022
2000
0.1496
0.4856
0.1386
0.1437
26
Table 4: Estimated Kullback-Leibler information for 5-dimensional densities
Density fG
fH
fI
fJ
Global-bandwidth
Tail-adaptive bandwidth
n
NRR
Bayesian
α = 0.05
α = 0.10
500
0.8923
0.4280
0.4026
0.4004
1000
0.7705
0.3093
0.2848
0.2825
2000
0.6933
0.2489
0.2343
0.2300
500
0.4559
0.3438
0.3212
0.3179
1000
0.4041
0.2892
0.2613
0.2582
2000
0.3355
0.2226
0.2033
0.1987
500
0.5943
0.5674
0.3446
0.3187
1000
0.4994
0.4814
0.2891
0.2666
2000
0.4395
0.4255
0.2274
0.2072
500
0.6107
0.5755
0.3226
0.3033
1000
0.5969
0.4415
0.2538
0.2284
2000
0.5050
0.3937
0.1971
0.1773
Table 5: A summary of MCMC results obtained through our proposed Bayesian sampling algorithm to the tail-adaptive kernel density estimator of the S&P500 and AORD returns Bandwidths Mean
Standard SIF Acceptance log marginal deviation
NRR
h1
0.2171
h2
0.1783
Bayesian global
h1
0.1795
0.0113
5.63
bandwidth
h2
0.2485
0.0121
5.89
Tail-adaptive
h1
(1)
0.5533
0.2217
39.30
bandwidth
(1) h2 (0) h1 (0) h2
0.1221
0.0161
15.39
0.5552
0.1140
19.97
0.1547
0.0174
13.55
with α = 0.05
27
rate
likelihood
0.21
-1719.64
0.27
-1657.14