Machine Learning, vol.86, no.3, pp.335–367, 2012.
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation Takafumi Kanamori Nagoya University, Nagoya, Japan
[email protected] Taiji Suzuki University of Tokyo, Tokyo, Japan
[email protected] Masashi Sugiyama Tokyo Institute of Technology, Tokyo, Japan
[email protected] http://sugiyama-www.cs.titech.ac.jp/~sugi/
Abstract The ratio of two probability densities can be used for solving various machine learning tasks such as covariate shift adaptation (importance sampling), outlier detection (likelihood-ratio test), feature selection (mutual information), and conditional probability estimation. Several methods of directly estimating the density ratio have recently been developed, e.g., moment matching estimation, maximum-likelihood density-ratio estimation, and least-squares density-ratio fitting. In this paper, we propose a kernelized variant of the least-squares method for density-ratio estimation, which is called kernel unconstrained least-squares importance fitting (KuLSIF). We investigate its fundamental statistical properties including a non-parametric convergence rate, an analytic-form solution, and a leave-one-out cross-validation score. We further study its relation to other kernel-based density-ratio estimators. In experiments, we numerically compare various kernel-based density-ratio estimation methods, and show that KuLSIF compares favorably with other approaches.
1
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
1
2
Introduction
The problem of estimating the ratio of two probability densities is attracting a great deal of attention these days, since the density ratio can be used for various purposes (Sugiyama et al., 2009; Sugiyama et al., 2012), such as covariate shift adaptation (Shimodaira, 2000; Zadrozny, 2004; Sugiyama & M¨ uller, 2005; Huang et al., 2007; Sugiyama et al., 2007; Bickel et al., 2009; Qui˜ nonero-Candela et al., 2009; Sugiyama & Kawanabe, 2011), outlier detection (Hido et al., 2008; Smola et al., 2009; Kawahara & Sugiyama, 2011; Hido et al., 2011), divergence estimation (Nguyen et al., 2010; Suzuki et al., 2008; Suzuki et al., 2009), and conditional probability estimation (Sugiyama et al., 2010; Sugiyama, 2010). A naive approach to density-ratio estimation is to first separately estimate two probability densities (corresponding to the numerator and the denominator of the ratio), and then take the ratio of the estimated densities. However, density estimation is known to be a hard problem particularly in high-dimensional cases unless we have simple and good parametric density models (Vapnik, 1998; H¨ardle et al., 2004; Kanamori et al., 2010), which may not be the case in practice. For reliable statistical inference, it is important to develop methods of directly estimating the density ratio without going through density estimation. In the context of case-control studies, Qin (1998) has proposed a direct method of estimating the density ratio by matching moments of the two distributions. Another density-ratio estimation approach uses the M-estimator (Nguyen et al., 2010) based on non-asymptotic variational characterization of the f -divergence (Ali & Silvey, 1966; Csisz´ar, 1967). See also Sugiyama et al. (2008a) for a similar algorithm using the Kullback-Leibler divergence. Kanamori et al. (2009) have developed a squared-loss version of the M-estimator for linear densityratio models called unconstrained Least-Squares Importance Fitting (uLSIF), and have shown that uLSIF possesses superior computational properties. That is, a closed-form solution is available and the leave-one-out cross-validation score can be analytically computed. As another approach, one can use logistic regression for the inference of density ratios, since the ratio of two probability densities is directly connected to the posterior probability of labels in classification problems. Using the Bayes formula, the estimated posterior probability can be transformed to an estimator of density ratios (Bickel et al., 2007). Various kernel-based approaches are also available for density-ratio estimation. The kernel mean matching (KMM) method (Gretton et al., 2009) directly gives estimates of the density ratio by matching the two distributions using universal reproducing kernel Hilbert spaces (Steinwart, 2001). KMM can be regarded as a kernelized variant of Qin’s moment matching estimator (Qin, 1998). Nguyen’s approach based on the M-estimator (Nguyen et al., 2010) also has a kernelized variant. Non-parametric convergence properties of the M-estimator in reproducing kernel Hilbert spaces have been elucidated under the Kullback-Leibler divergence (Nguyen et al., 2010; Sugiyama et al., 2008b). For the density-ratio estimation, one can also apply kernel logistic regression (Wahba et al., 1993; Zhu & Hastie, 2001), instead of conventional linear logistic models for the inference of the posterior distribution in classification problems.
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
3
In this paper, we first propose a kernelized variant of uLSIF (called KuLSIF), and show that the solution of KuLSIF as well as its leave-out-out cross-validation score can be computed analytically, as the original uLSIF for linear models. We then elucidate the statistical consistency and convergence rate of KuLSIF based on the argument on nonparametric bounds (van de Geer, 2000; Nguyen et al., 2010). We further study the relation between KuLSIF and other kernel-based density-ratio estimators. Finally, statistical performance of KuLSIF is numerically compared with other kernel-based density-ratio estimators in experiments. The rest of this paper is organized as follows. In Section 2, we formulate the problem of density-ratio estimation and briefly review the existing least-squares method. In Section 3, we describe the kernelized variant of uLSIF, and show its statistical properties such as the convergence rate and availability of the analytic-form solution and the analytic-form leave-one-out cross-validation score. In Section 4, we investigate the relation between KuLSIF and other kernel-based density-ratio estimators. In Section 5, we experimentally investigate computational efficiency and statistical performance of KuLSIF. Finally, in Section 6, we conclude by summarizing our contributions and showing possible future directions. Detailed proofs and calculations are deferred to Appendix. In a companion paper (Kanamori et al., 2011), computational properties of the KuLSIF method are further investigated from the viewpoint of condition numbers.
2
Estimation of Density Ratios
In this section, we formulate the problem of density-ratio estimation and briefly review the least-squares density-ratio estimator.
2.1
Formulation and Notations
Consider two probability distributions P and Q on a probability space Z. Assume that the distributions P and Q have the probability densities p and q, respectively. We assume p(x) > 0 for all x ∈ Z. Suppose that we are given two sets of independent and identically distributed (i.i.d.) samples, i.i.d.
X1 , . . . , Xn ∼ P,
i.i.d.
Y1 , . . . , Ym ∼ Q.
(1)
Our goal is to estimate the density ratio w0 (x) =
q(x) (≥ 0) p(x)
based on the observed samples. We summarize some notations to be used throughout the paper. For two integers n and m, n ∧ m denotes min{m, n}. For a vector a in the Euclidean space, ∥a∥ denotes the Euclidean norm. Given a probability distribution P and a random variable h(X), we
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
4
∫ ∫ denote the expectation of h(X) under P by hdP or h(x)P (dx). Let ∥ · ∥∞ ∫ be2 the 2 infinity norm, and ∥ · ∥P be the L2 -norm under the probability P , i.e., ∥h∥P = |h| dP . For a reproducing kernel Hilbert space (RKHS) H (Aronszajn, 1950), the inner product and the norm on H are denoted as ⟨·, ·⟩H and ∥ · ∥H , respectively. Below we review the least-squares approach to density-ratio estimation proposed by Kanamori et al. (2009).
2.2
Least-Squares Approach
The linear model w(x) b =
B ∑
αi hi (x)
(2)
i=1
is assumed for the estimation of the density ratio w0 , where the coefficients α1 , . . . , αB are the parameters of the model. The basis functions hi , i = 1, . . . , B are chosen so that the non-negativity condition hi (x) ≥ 0 is satisfied. A practical choice would be the Gaussian 2 2 kernel function hi (x) = e−∥x−ci ∥ /2σ with appropriate kernel center ci ∈ Z and kernel width σ (Sugiyama et al., 2008a). The unconstrained least-squares importance fitting (uLSIF) (Kanamori et al., 2009) estimates the parameter α based on the squared error: ∫ ∫ ∫ ∫ 1 1 1 2 2 (w b − w0 ) dP = w b dP − wdQ b + w02 dP. 2 2 2 The last term in the above expression is a constant and can be safely ignored when minimizing the squared error of the estimator w. b Therefore, the solution of the following minimization problem over the linear model (2), 1 ∑ 1 ∑ (w(Xi ))2 − w(Yj ) + λ · Reg(α), 2n i=1 m j=1 n
min w
m
(3)
is expected to approximate the true density-ratio w0 , where the regularization term Reg(α) with the regularization parameter λ is introduced to avoid overfitting. Let α b be the optimal solution of (3) under the linear model (2). Then the estimator of w 0 is given ∑B bi hi (x). There are several ways to impose the non-negativity condition as w(x) b = i=1 α w(x) b ≥ 0 (Kanamori et al., 2009). Here, truncation of w b defined as w b+ (x) = max{w(x), b 0} is used to ensure the non-negativity of the estimator. It is worthwhile to point out that uLSIF can be regarded as an example of the Mestimator (Nguyen et al., 2010) with the quadratic loss function, i.e., ϕ∗ (f ) = f 2 /2 in Nguyen’s notation. Nguyen’s M-estimator is constructed based on the f -divergence from Q and P . Due to the asymmetry of f -divergence, the estimation error evaluated with
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
5
respect to P rather than Q is obtained in uLSIF. uLSIF has an advantage in computation over other M-estimators: When Reg(α) = ∥α∥2 /2, the estimator α b can be obtained in an analytic form. As a result, the leave-one-out cross-validation (LOOCV) score can also be computed in a closed form (Kanamori et al., 2009), which allows us to compute the LOOCV score very efficiently. LOOCV is an (almost) unbiased estimator of the prediction error and can be used for determining hyper-parameters such as the regularization parameter and the Gaussian kernel width. In addition, for the L1 -regularization ∑ Reg(α) = B |α i |, Kanamori et al. (2009) applied the path-following algorithm to regui=1 larization parameter estimation, which highly contributes to reducing the computational cost in the model selection phase.
3
Kernel uLSIF
The purpose of this paper is to show that a kernelized variant of uLSIF (which we refer to as kernel uLSIF ; KuLSIF) has good theoretical properties and thus useful. In this section, we formalize the KuLSIF algorithm and show its fundamental statistical properties.
3.1
uLSIF on RKHS
We assume that the model for the density ratio is an RKHS H endowed with a kernel function k on Z × Z, and we consider the optimization problem (3) on H. Then, the estimator w b is obtained as an optimal solution of 1 ∑ 1 ∑ λ min (w(Xi ))2 − w(Yj ) + ∥w∥2H , w 2n i=1 m j=1 2 n
m
s. t. w ∈ H,
(4)
where the regularization term λ2 ∥w∥2H with the regularization parameter λ (≥ 0) is introduced to avoid overfitting. We may also consider the truncated estimator w b+ = max{w, b 0}. The estimator based on the loss function (4) is called KuLSIF. The computation of KuLSIF is efficiently conducted. For infinite-dimensional H, the problem (4) is an infinite-dimensional optimization problem. The representer theorem (Kimeldorf & Wahba, 1971), however, is applicable to RKHSs, which allows us to transform the infinite-dimensional optimization problem to a finite-dimensional one. Let K11 , K12 , and K21 be the sub-matrices of the Gram matrix: (K11 )ii′ = k(Xi , Xi′ ),
(K12 )ij = k(Xi , Yj ),
⊤ , K21 = K12
where i, i′ = 1, . . . , n, j, j ′ = 1, . . . , m. Then, detailed analysis leads us to the specific form of the solution as follows. Theorem 1 (Analytic Solution of KuLSIF). Suppose λ > 0. Then, the KuLSIF estimator is given as w(z) b =
n ∑ i=1
1 ∑ α ¯ i k(z, Xi ) + k(z, Yj ). mλ j=1 m
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
The coefficients α ¯ = (¯ α1 , . . . , α ¯ n )⊤ are given by the solution of the linear equation ( ) 1 1 K11 + λIn α = − K12 1m , n nmλ
6
(5)
where In is the n by n identity matrix and 1m is the column vector defined as 1m = (1, . . . , 1)⊤ ∈ ℜm . The proof is deferred to Appendix A. Theorem 1 implies that it is sufficient to find n variables α ¯1, . . . , α ¯ n to obtain the estimator w b and that the estimator has the analyticform solution. Theorem 1 also guarantees that the parameters in the KuLSIF estimator are obtained by the solution of the following optimization problem: ( ) 1 ⊤ 1 1 ⊤ min α (6) K11 + λIn α + 1 K21 α, α ∈ ℜn , α 2 n nmλ m where we used the fact that the solution of Ax = b is given as the minimizer of 1 ⊤ x Ax − b⊤ x, when A is positive-semidefinite. When the sample size of n is large, numer2 ically optimizing the quadratic function in (6) can be computationally more efficient than directly solving the linear equation (5). In Section 5, numerical experiments are carried out to investigate the computational efficiency of KuLSIF.
3.2
Leave-One-Out Cross-Validation
The leave-one-out cross-validation (LOOCV) score for the KuLSIF estimator can also be obtained analytically as well as the coefficient parameters of the kernel model. Let us measure the accuracy of the KuLSIF estimator, w b+ = max{w, b 0}, by ∫ ∫ 1 2 w b+ dP − w b+ dQ, 2 which is equal to the squared error of w b+ up to a constant term. Then the LOOCV score of w b+ under the squared error is defined as } n∧m { 1 ∑ 1 (ℓ) (ℓ) 2 LOOCV = (w b (xℓ )) − w b+ (yℓ ) , (7) n ∧ m ℓ=1 2 + (ℓ)
where w b+ = max{w b(ℓ) , 0} is the estimator based on training samples except1 xℓ and yℓ . The hyper-parameters achieving the minimum value of LOOCV are chosen. Thanks to the analytic-form solution shown in Theorem 1, the leave-one-out solution w b(ℓ) can be computed efficiently from w b by the use of the Sherman-Woodbury-Morrison formula (Golub & Loan, 1996). Details of the analytic LOOCV expression are presented in Appendix B. The index of removed samples can be different for x and y, i.e., xℓ1 and yℓ2 (ℓ1 ̸= ℓ2 ) can be removed. For the sake of simplicity, however, we suppose that the samples xℓ and yℓ are removed in the computation of LOOCV. 1
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
3.3
7
Statistical Consistency of KuLSIF
The following theorem reveals the convergence rate of the KuLSIF estimator. Theorem 2 (Convergence Rate of KuLSIF). Let Z be a probability space, and H be the RKHS endowed with the kernel function k defined on Z×Z. Suppose that supx∈Z k(x, x) < ∞, and that the bracketing entropy HB (δ, HM , P ) is bounded above by O(M/δ)γ , where γ is a constant satisfying 0 < γ < 2 (see Appendix C for the detailed definition). Set the regularization parameter λ = λn,m so that2 lim λn,m = 0,
n,m→∞
1−δ λ−1 ), n,m = O((n ∧ m)
(n, m → ∞),
where n ∧ m = min{n, m} and δ is an arbitrary number satisfying 1 − 2/(2 + γ) < δ < 1. Then, for q/p = w0 ∈ H, we have ∥w b+ − w0 ∥P ≤ ∥w b − w0 ∥P = Op (λ1/2 n,m ), where ∥ · ∥P is the L2 -norm under the probability P . The proof is available in Appendix C. See Nguyen et al. (2010) and Sugiyama et al. (2008b) for similar convergence analysis for the logarithmic loss function. The condition limn,m→∞ λn,m = 0 means that the regularization parameter λn,m should vanish asymp1−δ totically, but the condition λ−1 ) means that the regularization parameter n,m = O((n∧m) λn,m should not vanish too fast. As shown in the√proof of the theorem, the assumption w0 ∈ H imposes supx∈Z w0 (x) ≤ ∥w0 ∥H supx∈Z k(x, x) < ∞. For example, the ratio of two Gaussian distributions with different means or variances does not satisfy this condition (see Yamada et al., 2011 for how to handle such a situation). Remark 1. Suppose that Z is a compact set and k is the Gaussian kernel. Then, for any small γ > 0, the condition ( )γ M HB (δ, HM , P ) = O , (M/δ → ∞) (8) δ holds (Cucker & Smale, 2002, Theorem D in Chap III, Section 5). More precisely, Cucker and Smale (2002) proved that the entropy number with the supremum norm is bounded above by c(M/δ)γ for M, δ > 0, where c is a positive constant. In addition, the bracketing entropy HB (δ, HM , P ) is bounded above by the entropy number with the supremum norm due to the second inequality of Lemma 2.1 in van de Geer (2000). As a result, the 1/2 convergence rate in Theorem 2 is given as Op (λn,m ) = Op (1/(n ∧ m)(1−δ)/2 ) for 0 < δ < 1. By choosing small δ > 0 (i.e., λn,m vanishes fast), the convergence rate will get close to √ that for parametric models, i.e., Op (1/ n ∧ m). The multivariate big-O notation f (n, m) = O(g(n, m)), (n, m → ∞) implies that there exist C > 0, n0 > 0, and m0 > 0 such that the inequality |f (n, m)| ≤ C|g(n, m)| holds for all n > n0 and all m > m0 . 2
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
8
Table 1: Summary of density ratio estimators. “#parameters” is the number of parameters other than regularization parameters. Estimator
loss function
#parameters
KuLSIF KMM KL-div KLR RKDE
quadratic loss quadratic loss conjugate of log loss log-likelihood loss log-likelihood loss
n n m n+m 0
estimates of density ratio max{w, b 0}, w b∈H max{w, b 0}, w b∈H max{w, b 0}, w b∈H n −fb b e , f ∈H m ratio of KDEs
model selection possible not possible possible possible possible
Remark 2. In the KuLSIF estimator, we do not need the assumption that the target density-ratio w0 is bounded below by a positive constant. In the M-estimator with the Kullback-Leibler divergence proposed by Nguyen et al. (2010), the function defined by f (w) = log
w + w0 w0
is considered. The boundedness condition of w0 such that w0 ≥ c > 0 with some constant c leads to the fact that f (w) is Lipschitz in w, which is a crucial property in the proof of Nguyen et al. (2010). In our proof in Appendix C, we use different transformation of w, and thus, the boundedness condition is not needed.
4
Relation to Existing Kernel-Based Estimators
In this section, we discuss the relation between KuLSIF and other kernel-based densityratio estimators. Properties of density-ratio estimation methods are summarized in Table 1.
4.1
Kernel Mean Matching (KMM)
The kernel mean matching (KMM) method allows us to directly obtain an estimate of w0 (x) at X1 , . . . , Xn without going through density estimation (Gretton et al., 2009). The basic idea of KMM is to find w0 (x) such that the mean discrepancy between nonlinearly transformed samples drawn from P and Q is minimized in a universal reproducing kernel Hilbert space (Steinwart, 2001). We introduce the definition of universal kernels below. Definition 1 (Definition 4.52 in Steinwart, 2001). A continuous kernel k on a compact metric space Z is called universal if the RKHS H of k is dense in the set of all continuous functions on Z, that is, for every continuous function g on Z and all ε > 0, there exists an f ∈ H such that ∥f − g∥∞ < ε. The corresponding RKHS is called a universal RKHS.
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
9
The Gaussian kernel on a compact set Z is an example of universal kernels. Let H be a universal RKHS endowed with a universal kernel function k : Z × Z → ℜ. Then, one can infer the density ratio w0 by solving the following minimization problem:
∫
2 ∫
1
w(x)k(·, x)P (dx) − k(·, y)Q(dy) , min
w 2 H (9) ∫ s.t. wdP = 1 and w ≥ 0. Huang et al. (2007) proved that the solution of (9) is given as w = w0 , when Q is absolutely continuous with respect to P . An empirical version of the above problem is reduced to the following convex quadratic program: n m n 1 ∑ 1 ∑∑ min wi wj k(Xi , Xj ) − wi k(Xi , Yj ), w1 ,...,wn 2n m j=1 i=1 i,j=1 n 1 ∑ s.t. wi − 1 ≤ ϵ and 0 ≤ w1 , w2 , . . . , wn ≤ B. n i=1
(10)
The tuning parameters, B ≥ 0 and ϵ ≥ 0, control the regularization effects. The optimal solution (w b1 , . . . , w bn ) is an estimate of the density ratio at the samples from P , i.e., w0 (X1 ), . . . , w0 (Xn ). KMM does not estimate the function w0 on Z but the values on sample points, while the assumption that w0 ∈ H is not required. We study the relation between KuLSIF and KMM. Below, we assume that the true density-ratio w0 = q/p is included in the RKHS H. Let Φ(w) be ∫ ∫ Φ(w) = k(·, x)w(x)P (dx) − k(·, y)Q(dy). (11) Then the loss function of KMM on H under the population distribution is written as 1 LKMM (w) = ∥Φ(w)∥2H . 2 In the estimation phase, an empirical approximation of LKMM is optimized in the KMM algorithm. On the other hand, the (unregularized) loss function of KuLSIF is given by ∫ ∫ 1 2 LKuLSIF (w) = w dP − wdQ. 2 Both LKMM and LKuLSIF are minimized at the true density-ratio w0 ∈ H. Although some linear constraints may be introduced in the optimization phase, we study the optimization problems of LKMM and LKuLSIF without constraints. This is because when the sample size tends to infinity, the optimal solutions of LKMM∫ and LKuLSIF without constraints automatically satisfy the required constraints such as wdP = 1 and w ≥ 0.
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
10
We consider the extremal condition of LKuLSIF (w) at w0 . Substituting w = w0 + δ · v, into LKuLSIF (w), we have
(δ ∈ ℜ, v ∈ H) {∫
LKuLSIF (w0 + δv) − LKuLSIF (w0 ) = δ
∫ w0 vdP −
}
δ2 vdQ + 2
∫ v 2 dP.
Since LKuLSIF (w0 + δv) is minimized at δ = 0, the derivative of LKuLSIF (w0 + δv) at δ = 0 vanishes, i.e., ∫ ∫ w0 vdP − vdQ = 0. (12) The equality (12) holds for arbitrary v ∈ H. Using the reproducing property of the kernel function k, we can derive another expression of (12) as follows ∫ ∫ ∫ ∫ w0 vdP − vdQ = w0 (x)⟨k(·, x), v⟩H P (dx) − ⟨k(·, y), v⟩H Q(dy) ⟩ ⟨∫ ∫ k(·, x)w0 (x)P (dx) − k(·, y)Q(dy), v = H ⟨ ⟩ = Φ(w0 ), v H = 0, ∀v ∈ H. (13) Rigorous proof of the above formula is shown in Appendix D. As a result, we obtain Φ(w0 ) = 0. The above expression implies that Φ(w) is the Gˆateaux derivative (Zeidler, 1986, Section 4.2) of LKuLSIF at w ∈ H, that is, d LKuLSIF (w + δ · v) = ⟨Φ(w), v⟩H (14) dδ δ=0 holds for all v ∈ H. Let DLKuLSIF be the Gˆateaux derivative of LKuLSIF over the RKHS H. Then we have DLKuLSIF = Φ, and the equality LKMM (w) =
1 ∥DLKuLSIF (w)∥2H 2
(15)
holds. Tsuboi et al. (2008) have pointed out a similar relation for the M-estimator based on the Kullback-Leibler divergence. Now we give an interpretation of (15) through an analogous optimization example in the Euclidean space. Let f : ℜd → ℜ be a differentiable function, and consider the optimization problem minx f (x). At an optimal solution x0 , the extremal condition ∇f (x0 ) = 0 should hold, where ∇f is the gradient of f with respect to x. Thus, instead of minimizing f , minimization of ∥∇f (x)∥2 also provides the minimizer of f . This corresponds to the relation between KuLSIF and KMM: KuLSIF ⇐⇒ min f (x), x
KMM ⇐⇒ min x
1 ∥∇f (x)∥2 . 2
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
11
In other words, in order to find the solution of the equation Φ(w) = 0,
(16)
KMM tries to minimize the norm of Φ(w). The “dual” expression of (16) is given as ⟨Φ(w), v⟩H = 0, ∀v ∈ H.
(17)
By “integrating” ⟨Φ(w), v⟩H , we obtain the loss function LKuLSIF . Remark 3. Gretton et al. (2006) proposed the maximum mean discrepancy (MMD) criterion to measure the discrepancy between two probability distributions P and Q. When the constant function 1 is included in the RKHS H, MMD between P and Q is equal to 2 × LKMM (1). Due to the equality (15), we find that MMD is also expressed as ∥DLKuLSIF (1)∥2H , that is, the squared norm of the derivative of LKuLSIF at 1 ∈ H. This quantity will be related to the discrepancy between the constant function 1 and the true density-ratio w0 = q/p. In the original KMM method, the density-ratio values on training samples X1 , . . . , Xn are estimated (Gretton et al., 2009). Here, we consider its inductive variant, i.e., estimating the function w0 on Z using the loss function of KMM. Given samples (1), the empirical loss function of inductive KMM is defined as
2 1 min b Φ(w) + λw H , w ∈ H, (18) w 2 b where Φ(w) is defined as 1∑ 1 ∑ b Φ(w) = k(· , Xi )w(Xi ) − k(· , Yj ). n i=1 m j=1 n
m
b Note that Φ(w) + λw in (18) is the Gˆateaux derivative of the empirical loss function of KuLSIF in (4) including the regularization term. The optimal solution of (18) is the same as that of KuLSIF, and hence, the same results as Theorem 1 and Theorem 2 hold for the inductive version of the KMM estimator. The computational efficiency, however, could be different. We show numerical examples of the computational cost in Section 5. In a companion paper (Kanamori et al., 2011), we further investigate the computational properties of the KuLSIF method from the viewpoint of condition numbers (see Section 8.7 of Luenberger & Ye, 2008), and reveal that KuLSIF is computationally more efficient than KMM. Another difference between KuLSIF and the inductive variant of KMM lies in model selection. As shown in Section 3.2, KuLSIF is equipped with cross-validation, and thus model selection can be performed systematically. On the other hand, the KMM objective function (9) is defined in terms of the RKHS norm. This implies that once kernel parameters (such as the Gaussian kernel width) are changed, the definition of the objective function is also changed and therefore naively performing cross-validation may not be valid in KMM. The regularization parameter in KMM may be optimized by cross-validation for a fixed RKHS.
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
4.2
12
M-Estimator with the Kullback-Leibler Divergence (KLdiv)
The M-estimator based on the Kullback-Leibler (KL) divergence (Nguyen et al., 2010) also directly gives an estimate of the density ratio without going through density estimation. The KL divergence I(Q, P ) is defined as ∫ p(z) I(Q, P ) = − log dQ(z) q(z) [ ∫ ] ∫ = − inf − log(w(z))dQ(z) + w(z)dP (z) − 1 , (19) w
where the second equality follows from the conjugate dual function of the logarithmic function and the infimum is taken over all measurable functions. Detailed derivation is shown in Nguyen et al. (2010). The optimal solution of (19) is given as w(z) = q(z)/p(z), and thus, the empirical approximation of (19) leads to the loss function for the estimation of density ratios. The kernel-based estimator w(z) b is defined as an optimal solution of 1 ∑ 1∑ λ min − log(w(Yj )) + w(Xi ) + ∥w∥2H , w m j=1 n i=1 2 m
n
w ∈ H,
where H is an RKHS. We may also use the truncated one w b = max{w, 0} as the estimator of the density ratio. Nguyen et al. (2010) proved that the RKHS H endowed with the Gaussian kernel and regularization parameter λ = (m ∧ m)δ−1 , (0 < δ < 1) leads to a consistent estimator under a boundedness assumption on w0 = q/p. Due to the representer theorem (Kimeldorf & Wahba, 1971), we see that the above infinite-dimensional optimization problem is reduced to a finite-dimensional one. Furthermore, the optimal solution of KL-div has a similar form to that shown in Theorem 1, and one needs to estimate only m parameters when samples (1) are observed (Nguyen et al., 2010). Actually, this property holds for general M-estimators with all f -divergences (Ali & Silvey, 1966; Csisz´ar, 1967); see Kanamori et al. (2011) for details. Note that model selection of the KL-div method can be systematically carried out based on cross-validation in terms of the KL-divergence (Sugiyama et al., 2008b).
4.3
Kernel Logistic Regression (KLR)
Another approach to directly estimating the density ratio is to use a probabilistic classifier. Let b be a binary random variable. For the conditional probability p(z|b), we assume that p(z) = p(z|b = +1), q(z) = p(z|b = −1),
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
13
hold. That is, b plays a role as a ‘class label’ for discriminating ‘numerator’ and ‘denominator’. An application of the Bayes theorem yields that the density ratio can be expressed in terms of the class label b as w0 (z) =
p(b = +1) p(b = −1|z) q(z) = . p(z) p(b = −1) p(b = +1|z)
The ratio of class-prior probabilities, p(b = +1)/p(b = −1), can be simply estimated from the numbers of samples from P and Q, and the class-posterior probability p(b|z) can be estimated by discrimination methods such as logistic regression. Below we briefly explain the kernel logistic regression method (Wahba et al., 1993; Zhu & Hastie, 2001). The kernel logistic regression method employs a model of the following form for expressing the class-posterior probability p(b|z): p(b|z) =
1 , 1 + exp (−bf (z))
f ∈ H,
where H is an RKHS on Z. The function f ∈ H is learned so that the negative regularized log-likelihood based on training samples (1) is minimized: [∑ ] n m ∑ 1 λ −f (Xi ) f (Yj ) log(1 + e )+ log(1 + e ) + ∥f ∥2H , f ∈ H, min f n + m i=1 2 j=1 where λ is the regularization parameter. Let fb be an optimal solution. Then the density ratio can be estimated by n b w(z) b = e−f (z) . m Note that we do not need to truncate the negative part of w(z), b since the estimator w b takes only positive values by construction. Model selection of the KLR-based density-ratio estimator is performed by crossvalidation in terms of the classification accuracy measured by the log-likelihood of the logistic model.
4.4
Ratio of Kernel Density Estimators (RKDE)
The kernel density estimator (KDE) is a non-parametric technique to estimate a probability density function p(x) from its i.i.d. samples {xk }nk=1 . For the Gaussian kernel kσ (x, x′ ) = exp{−∥x − x′ ∥2 /(2σ 2 )}, KDE is expressed as pb(x) =
n ∑ 1 kσ (x, xk ). n(2πσ 2 )d/2 k=1
The accuracy of KDE heavily depends on the choice of the kernel width σ, which can be optimized by cross-validation in terms of the log-likelihood. See H¨ardle et al. (2004) for details.
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
14
KDE can be used for density-ratio estimation by first obtaining density estimators pb(x) and qb(y) separately from X1 , . . . , Xn and Y1 , . . . , Ym , and then estimating the density ratio by qb(z)/b p(z). This estimator is referred to as the ratio of kernel density estimators (RKDE). A potential limitation of RKDE is that division by an estimated density pb(z) is involved, which tends to magnify the estimation error of q(z). This is critical when the number of available samples is limited. Therefore, the KDE-based approach may not be reliable in high-dimensional problems.
5
Simulation Studies
In this section, we numerically compare the computational cost and the statistical performance of proposed and existing density-ratio estimators.
5.1
Computational Costs
First, we experimentally investigate the computational cost of KuLSIF and KMM. The IDA data sets (R¨atsch et al., 2001) are used, which are binary classification data sets consisting of positive/negative and training/test samples (see Table 3). We use large data sets in IDA: titanic, waveform, banana, ringnorm, and twonorm, and compare the computation time of KuLSIF (6) with that of the inductive KMM (18). The solutions are numerically computed by minimizing the objective functions using the BFGS quasiNewton method implemented in the optim function in the R environment (R Development Core Team, 2009). For KuLSIF, we also investigate the computation time for directly solving the linear equation (5) by the function solve in R. Note that theoretically all methods share the same solution (see Section 4.1). In the first experiments, the data set corresponding to the distribution P consists of all positive test samples, and all negative test samples are assigned to the other data set corresponding to Q. Therefore, the target density-ratio may be far from the constant function w0 (x) = 1. Table 2(a) shows the average computation time over 20 runs. In the table, ‘KuLSIF(numerical)’, ‘KuLSIF(direct)’, and ‘KMM’ denote KuLSIF numerically minimizing the loss function, KuLSIF directly solving the linear equation, and the inductive variant of KMM (numerically minimizing the loss function), respectively. In the second experiments, samples X1 , . . . , Xn and Y1 , . . . , Ym are both randomly taken from all (i.e., both positive and negative) test samples. Hence, the target density-ratio is almost equal to the constant function w0 (x) = 1. Table 2(b) shows the average computation time over 20 runs. The results show that, for large data sets, KuLSIF(numerical) is computationally more efficient than KuLSIF(direct). Experimentally, the computational cost of KuLSIF(numerical) is approximately proportional to n2 , while that of KuLSIF(direct) takes the order of n3 . Thus, for large data sets, computing the solution by numerically minimizing the quadratic loss function will be more advantageous than directly solving the linear equation. KMM is computationally highly demanding for all cases.
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
15
Table 2: The averaged computation time (sec.) of KuLSIF(numerical), KuLSIF(direct), and KMM are presented. (a) The data set from P is randomly taken from positive test samples, and that from Q is randomly taken from negative test samples. (b) Two data sets X1 , . . . , Xn and Y1 , . . . , Ym are both randomly taken from all (i.e., both positive and negative) test samples. The data sets are arranged in ascending order of the sample size n. Results of the method having the lowest mean are described by bold face. (a) The true density-ratio is far from a constant KuLSIF KuLSIF data set n m KMM (numerical) (direct) titanic 1327 2775 6.11 1.45 57.96 waveform 3032 6168 52.74 16.96 1713.71 banana 4383 5417 97.64 52.97 1539.65 ringnorm 6933 7067 145.37 177.96 4346.32 twonorm 7002 6998 145.61 226.20 1944.79 (b) The true density-ratio is close KuLSIF data set n m (numerical) titanic 2052 2050 10.20 waveform 4600 4600 63.55 banana 4900 4900 112.21 ringnorm 7000 7000 135.70 twonorm 7000 7000 133.44
5.2
to a constant KuLSIF KMM (direct) 5.13 91.97 58.55 3078.64 78.08 1408.91 258.03 3201.78 243.46 3584.25
Stability of Estimators
By using synthetic data, we study how the dimension of the data affects the estimation accuracy. The probabilities P and Q are defined as the Gaussian distribution with increasing dimension ranges from 1 to 10, and the sample size is set to m = n = 500. The covariance matrix of both distributions is given as the identity matrix. The mean vector of P is the null vector, and that of Q is equal to µe1 , where e1 is the standard unit vector with only the first component being 1. Then, the density ratio is equal to w0 (x) = exp{x1 µ − µ2 /2} for x = (x1 , . . . , xd ). For each case of µ = 0 and µ = 1, we compare four kernel-based estimators: KuLSIF, the M-estimator with the Kullback-Leibler divergence (KL-div), kernel logistic regression (KLR), and the ratio of kernel density estimators (RKDE). See Section 4 for details of each estimator. The solve function in R is used for computing the KuLSIF solution, and the optim function in R is used for computing the KL-div solution. For computing the KLR solution, we use the myKLR package (R¨ uping, 2003), which is a C++ implementation of the algorithm proposed by Keerthi et al. (2005) to solve the dual problem. For computing the RKDE, we use our own implementation in R. In all estimators, the Gaussian kernel is used. Except RKDE, the kernel width σ is
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
%FOTJUZSBUJPFYQPOFOUJBMGVODUJPO ,V-4*' ,-EJW ,-3 3,%&
,V-4*' ,-EJW ,-3 3,%&
/.4&
/.4&
%FOTJUZSBUJPDPOTUBOUGVODUJPO
16
%JNFOTJPO
%JNFOTJPO
Figure 1: The median of NMSEs is depicted as functions of the input dimensionality. Left panel: the case of w0 (x) = 1 (i.e., µ = 0). Right panel: the case of w0 (x) = exp{x1 − 1/2} (i.e., µ = 1). set to the median of ∥z − z ′ ∥ among all pairs of distinct training points, z and z ′ . This is a standard heuristic for the choice of the Gaussian kernel width (Sch¨olkopf & Smola, 2002). For RKDE, the kernel width is chosen by using CV among 20 candidates around the value determined by the above median heuristics. The regularization parameter λ is set to λ = 1/(n ∧ m)0.9 . The estimation accuracy of the density-ratio estimator w(z) b is evaluated by the normalized mean squared error (NMSE) over the test points ze1 , . . . , zeN : ( )2 N 1 ∑ w(e b zi ) w0 (e zi ) NMSE = − 1 ∑N . (20) ∑ N i=1 N1 N b zek ) k=1 w( k=1 w0 (zek ) N In many applications of density ratios, only the relative size of the density ratio is required and the normalization factor is not essential (Sugiyama et al., 2009; Sugiyama et al., 2012). On the other hand, the target of the current experiment is density ratio estimation itself. Thus, evaluating the error under normalization would be reasonable. Figure 1 presents the median NMSE for each estimator as functions of the input dimensionality. Since the average NMSE of the RKDE took extremely large values highdimensional data, we decided to use the median NMSE for evaluation. We see that the RKDE immediately gets unstable for multi-dimensional data, whereas the other three estimators provide stable prediction for all data sets.
5.3
Statistical Performance
Finally, we experimentally compare the statistical performance of four kernel-based estimators: KuLSIF, the M-estimator with the Kullback-Leibler divergence (KL-div), kernel
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
17
logistic regression (KLR), and the ratio of kernel density estimators (RKDE). We again use the IDA data sets (see Table 3 for details). Bayes error in Table 3 denotes the test error of the best classifier reported in R¨atsch et al. (2001). First, we explain how to prepare the training data set using the IDA data sets. Given the training data for binary classification, (z1 , b1 ), . . . , (zt , bt ) ∈ Z × {+1, −1}, the posterior probability of binary labels is estimated by the support vector machine with the Gaussian kernel, where Platt’s approach (Platt, 2000) is used.3 The estimated class-posterior probability is denoted as Pb(b|z), and let Pbη (b|z) be η Pbη (b|z) = (1 − η)Pb(b|z) + , 2
0 ≤ η ≤ 1,
for z ∈ Z and b ∈ {+1, −1}. The probability Pb0 (b|z) will lead to the Bayes error close to the values described in Table 3, whereas Pb1 (b|z) indicates the uniform probability on the binary labels. Hence, the Bayes error of Pb1 (b|z) is equal to 0.5. Then, the label of the training input zi is reassigned according to the conditional probability Pbη (b|zi ). The input points with the reassigned label +1 are regarded as samples from the probability distribution P , and those with the label −1 are regarded as samples from the probability distribution Q. As such, we have the training data set (1), and the density ratio is estimated based on these training samples. Thus, the true density-ratio is approximately given by w e0 (z) =
n Pbη (b = −1|z) · . m Pbη (b = +1|z)
Note that w e0 is close to the constant function 1, when η is close to 1. The estimation accuracy of the estimator w(z) b is evaluated by the NMSE (20). Here the test points in the NMSE are uniformly sampled from all the test input vectors of the classification data set. Hence, the distribution of zei is not the same as the probability distribution P , unless η = 1. Some examples of estimated density-ratios are depicted in Figure 2 and Figure 3. The training samples are generated from the data set banana or german with the mixing parameter η = 0.01. In these figures, the index of test samples zei is arranged in the ascending order of the density-ratio values w e0 (e zi ). The solid increasing line denotes w e0 (e zi ) for each test point, and ◦’s in the plots are estimated values. When the input dimension is low (see Figure 2), all the methods including RKDE perform reasonably well. However, for the high-dimensional data (see Figure 3), RKDE severely overfits due to division by an estimated density. As illustrated in Section 5.2, this leads to the instability of estimation by RKDE, and the prediction ability becomes poor. The other three direct density-ratio estimators provide reasonably stable prediction even when the dimension is high. In the Platt’s approach, the conditional probability is estimated by the model Pb(b|z) = 1/{1 + exp(−b(αb h(z) + β))}, α, β ∈ ℜ, where b h : Z → ℜ is the decision function estimated by support vector machine. Maximum likelihood estimation is used to estimate the parameter α, β. 3
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
18
Table 3: The dimension of the data domain and the Bayes error are shown. The Bayes error denotes the lowest test error reported in the original paper (R¨atsch et al., 2001). The data sets are arranged in the ascending order of the Bayes error. “#samples” is the total sample size, i.e., n + m. “Iterations” denotes the number of trials of estimation to compute the average performance of estimators. dimension data set ringnorm 20 twonorm 20 image 18 thyroid 5 splice 60 waveform 21 banana 2 heart 13 titanic 3 8 diabetes german 20 breast-cancer 9 flare-solar 9
Bayes error (%) 1.5 2.6 2.7 4.2 9.5 9.8 10.7 16.0 22.4 23.2 23.6 24.8 32.4
#samples (n + m) iterations 7000 20 7000 20 1010 20 75 50 2175 20 4600 20 4900 20 100 50 2051 20 300 50 300 50 77 50 400 50
Next, we compare the following methods: KuLSIF, KuLSIF with leave-one-out crossvalidation (LOOCV), KL-div, KL-div with 5-fold cross-validation (CV), KLR, KLR with CV, and RKDE with CV. For KuLSIF, KL-div, and KLR, LOOCV or CV is used to choose the regularization parameter λ from 2k /(n ∧ m)0.9 , k = −5, −4, . . . , 4, 5. We also test a fixed value λ = (m ∧ n)−0.9 for KuLSIF, KL-div, and KLR. As shown in Section 3, the regularization parameter λ = (m ∧ n)−0.9 guarantees the statistical consistency of KuLSIF and KL-div under mild assumptions. Statistical properties of KLR have been studied by Bartlett et al. (2006), Bartlett and Tewari (2007), Steinwart (2005), and Park (2009). Especially, Steinwart (2005) has proved that, under mild assumptions, KLR with λ = (m + n)−0.9 has the statistical consistency. When the training samples are balanced, i.e., the ratio of sample size m/n converges to a positive constant, the regularization parameter λ = (m∧n)−0.9 guarantees the statistical consistency of KLR. In all estimators, the Gaussian kernel is used. Except RKDE, the kernel width σ is set by using the standard heuristic introduced in Section 5.2. For RKDE, the kernel width is chosen by using CV among 20 candidates around the value determined by the above median heuristics. For each data set, training samples generated by setting η = 0.01, 0.1, 0.5 or 1 in Pη (b|z) are respectively prepared. For each training set, the NMSE of each estimator is computed. By using the NMSE over the uniformly distributed test samples, the estimation accuracy on the whole data domain is evaluated, while Theorem 2 does not guarantee the statistical consistency for that test distribution. To compute the average performance, the above experiments are repeated multiple times as described in Table 3. The numerical results are presented in Tables 4–7 and Figure 4. In the tables, data sets are arranged in the
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
19
ascending order of the Bayes error shown in Table 3. In Table 4, the NMSEs under η = 1 are presented. In this case, the class-posterior probability satisfies Pb1 (b|z) = 0.5, and hence, the density ratio is close to the constant function. Then, estimators with strong regularization will provide good results. Indeed, methods using LOOCV or CV such as KuLSIF(LOOCV), KL-div(CV), and KLR(CV) achieve the lowest NMSEs. Especially, KLR(CV) is significantly better than the others. For small η, other estimators except RKDE also present good statistical performance (see Tables 5–7). At the bottom of each table, the relative computational costs are also described. The computation time depends on parameters included in the optimization algorithm. In order to reduce the computation time of KL-div and KL-div(CV), we used the optim function with the stopping criterion reltol=0.5 × 10−3 , instead of the default value reltol=10−8 . On the other hand, KuLSIF using the solve function is numerically accurate. From the experimental results, we see that KuLSIF dominates the other methods in terms of the computational efficiency. KuLSIF(LOOCV) with the analytic-form expression of the LOOCV score also has a computational advantage over KL-div(CV), KLR(CV), and RKDE(CV). Note that the relative computational cost of KL-div is large for small η. This phenomenon is theoretically studied in a companion paper (Kanamori et al., 2011). In Figure 4, the NMSEs of estimators described in these tables are plotted as functions of η. The NMSEs of RKDE are not shown since they are much larger than the others. We see that KLR and KLR(CV) are sensitive to η for the data set with low Bayes error such as ringnorm, twonorm, image, thyroid, splice, waveform, and banana. On the other hand, KuLSIF, KuLSIF(LOOCV), and KL-div(CV) present moderate NMSEs for a wide range of η. See Tables 4–7 for more details.
6
Conclusions
In this paper, we addressed the problem of estimating the ratio of two probability densities. We proposed a kernel-based least-squares density-ratio estimator called KuLSIF, and investigated its statistical properties such as consistency and the rate of convergence. We also showed that, not only the estimator, but also the leave-one-out cross-validation score can be analytically obtained for KuLSIF. This highly contributes to reducing the computational cost. Then we pointed out that KuLSIF and an inductive variant of kernel mean matching (KMM) actually share the same solution. Hence, the statistical properties of KuLSIF are inherited to KMM. However, we showed through numerical experiments that KuLSIF is computationally much more efficient than KMM. We further experimentally showed that KuLSIF overall compares favorably with other density-ratio estimators such as the M-estimator with the Kullback-Leibler divergence, kernel logistic regression, and the ratio of kernel density estimators. Our definition of KuLSIF (see Eq.(4)) does not contain a non-negativity constraint on the learned density-ratio function. We may add a non-negativity constraint w ≥ 0 to (4) as Kanamori et al. (2009) did. However, by the additional constraint, we can no longer obtain the solution analytically. When the sample size is large, the estimator w b
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
KuLSIF
20
0
0
density ratio 1 2 3
density ratio 1 2 3
4
KL−div.
0
20
40 60 data index
80
100
0
20
40 60 data index
80
100
(b) KL-div: NMSE= 0.749
(a) KuLSIF: NMSE= 0.634
KLR
0
0
density ratio 2 4 6
density ratio 5 10
8
15
RKDE
0
20
40 60 data index
80
(c) KLR: NMSE= 1.922
100
0
20
40 60 data index
80
100
(d) RKDE: NMSE= 0.960
Figure 2: Estimation of density ratios for the data set banana is shown. The dimension of the data is 2. The solid line is the true density-ratio w e0 with η = 0.01, and ◦’s are predicted values of the density ratio. The data index is arranged in the ascending order of the density ratio, and thus the solid line is an increasing function.
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
KuLSIF
21
0.0
0.0
density ratio 1.0 2.0
density ratio 1.0 2.0
3.0
3.0
KL−div.
0
20
40 60 data index
80
100
0
20
40 60 data index
80
100
(b) KL-div: NMSE= 0.397
(a) KuLSIF: NMSE= 0.432
KLR
0.0
0 10
density ratio 1.0 2.0
density ratio 30 50
3.0
70
RKDE
0
20
40 60 data index
80
(c) KLR: NMSE= 0.372
100
0
20
40 60 data index
80
100
(d) RKDE: NMSE= 7.435
Figure 3: Estimation of density ratios for the data set german is shown. The dimension of the data is 20. The solid line is the true density-ratio w e0 with η = 0.01, and ◦’s are predicted values of the density ratio. The data index is arranged in the ascending order of the density ratio, and thus the solid line is an increasing function.
Bayes err. data set (%) (η = 1) ringnorm 50 twonorm 50 50 image thyroid 50 50 splice waveform 50 banana 50 50 heart titanic 50 50 diabetes german 50 50 breast-cancer flare-solar 50 relative computational cost
KuLSIF (LOOCV) 0.18±0.01 0.15±0.01 0.22±0.03 0.53±0.16 0.14±0.01 0.13±0.01 0.09±0.02 0.26±0.07 0.13±0.10 0.30±0.05 0.25±0.03 0.30±0.13 0.17±0.06 38.0
KuLSIF (λ = (m ∧ n)−0.9 ) 0.23±0.01 0.23±0.01 0.26±0.03 0.49±0.11 0.31±0.02 0.23±0.01 0.10±0.02 0.36±0.06 0.11±0.03 0.36±0.05 0.35±0.04 0.37±0.09 0.21±0.06 1.0
2.0
KL-div (λ = (m ∧ n)−0.9 ) 0.35±0.02 0.33±0.03 0.45±0.06 0.57±0.13 0.39±0.05 0.37±0.05 0.52±0.14 0.54±0.10 0.57±0.20 0.46±0.06 0.48±0.06 0.51±0.13 0.31±0.08 132.1
KL-div (CV) 0.25±0.01 0.16±0.02 0.23±0.04 0.53±0.14 0.15±0.02 0.14±0.03 0.15±0.07 0.31±0.18 0.14±0.04 0.30±0.07 0.26±0.04 0.34±0.21 0.17±0.08 5.0
KLR (λ = (m ∧ n)−0.9 ) 0.17±0.01 0.17±0.01 0.19±0.02 0.25±0.09 0.22±0.02 0.17±0.01 0.09±0.02 0.23±0.05 0.10±0.03 0.25±0.06 0.23±0.04 0.23±0.07 0.21±0.05
207.1
KLR (CV) 0.03±0.01 0.03±0.01 0.05±0.03 0.10±0.22 0.03±0.03 0.04±0.02 0.05±0.02 0.13±0.23 0.03±0.02 0.07±0.09 0.03±0.04 0.11±0.21 0.05±0.06
45.0
RKDE (CV) 41.7±9.31 30.8±12.4 — 4.39±2.23 19.9±6.63 17.8±5.49 23.9±23.9 5.35±1.27 6.07±12.3 9.72±2.98 10.9±1.88 4.09±1.28 12.8±2.55
Table 4: Average NMSEs of each estimator for training data with η = 1. The lowest NMSEs are in bold, and those with the second lowest are also in bold if they are not significantly different from the lowest NMSEs under the Wilcoxon paired rank-sum test with the significance level 5%. “—” in RKDE denotes the fact that numerically 0/0 occurred frequently and thus valid results were not obtained.
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation 22
Bayes err. data set (%) (η = 0.5) ringnorm 25.7 twonorm 26.3 26.3 image thyroid 27.1 29.8 splice waveform 29.9 banana 30.4 33.0 heart titanic 36.2 36.6 diabetes german 36.8 37.4 breast-cancer flare-solar 41.2 relative computational cost
KuLSIF (LOOCV) 0.29±0.01 0.31±0.02 0.46±0.07 0.31±0.12 0.42±0.01 0.25±0.02 0.26±0.02 0.37±0.07 0.13±0.03 0.30±0.05 0.32±0.05 0.30±0.12 0.25±0.06 38.9
KuLSIF (λ = (m ∧ n)−0.9 ) 0.29±0.01 0.34±0.02 0.45±0.07 0.34±0.11 0.42±0.01 0.29±0.02 0.28±0.01 0.38±0.08 0.14±0.02 0.32±0.05 0.34±0.05 0.36±0.08 0.24±0.05 1.0
3.8
KL-div (λ = (m ∧ n)−0.9 ) 0.41±0.01 0.44±0.03 0.49±0.08 0.35±0.11 0.51±0.02 0.30±0.02 0.58±0.06 0.47±0.11 0.58±0.25 0.37±0.06 0.41±0.06 0.44±0.10 0.29±0.07 179.3
KL-div (CV) 0.34±0.02 0.33±0.05 0.47±0.05 0.39±0.19 0.47±0.02 0.26±0.01 0.37±0.04 0.46±0.17 0.18±0.04 0.31±0.06 0.33±0.08 0.38±0.20 0.25±0.05 5.3
KLR (λ = (m ∧ n)−0.9 ) 0.33±0.01 0.44±0.03 0.50±0.07 0.32±0.08 0.45±0.01 0.29±0.02 0.41±0.04 0.36±0.07 0.15±0.02 0.27±0.04 0.28±0.04 0.24±0.07 0.23±0.05
218.4
KLR (CV) 0.31±0.01 0.43±0.03 0.55±0.10 0.43±0.15 0.57±0.06 0.31±0.03 0.37±0.05 0.45±0.18 0.12±0.02 0.29±0.06 0.31±0.07 0.33±0.21 0.22±0.05
47.7
RKDE (CV) 42.8±12.6 30.8±9.89 — 4.60±2.08 23.1±5.92 19.8±7.50 19.91±21.4 5.10±1.41 1.71±7.14 8.86±3.00 10.8±1.98 3.79±1.49 12.0±2.96
Table 5: Average NMSEs of each estimator for training data with η = 0.5. The lowest NMSEs are in bold, and those with the second lowest are also in bold if they are not significantly different from the lowest NMSEs under the Wilcoxon paired rank-sum test with the significance level 5%. “—” in RKDE denotes the fact that numerically 0/0 occurred frequently and thus valid results were not obtained.
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation 23
Bayes err. data set (%) (η = 0.1) ringnorm 6.3 twonorm 7.3 7.4 image thyroid 8.8 13.6 splice waveform 13.8 banana 14.7 19.4 heart titanic 25.2 25.9 diabetes german 26.2 27.3 breast-cancer flare-solar 34.2 relative computational cost
KuLSIF (LOOCV) 0.32±0.02 0.41±0.02 0.89±0.37 0.37±0.11 0.49±0.01 0.35±0.04 0.46±0.04 0.58±0.13 0.15±0.04 0.40±0.07 0.46±0.07 0.35±0.10 0.30±0.07 38.9
KuLSIF (λ = (m ∧ n)−0.9 ) 0.34±0.01 0.41±0.02 0.79±0.14 0.36±0.10 0.54±0.02 0.36±0.04 0.49±0.02 0.58±0.13 0.18±0.02 0.39±0.07 0.45±0.07 0.37±0.08 0.30±0.07 1.0
4.6
KL-div (λ = (m ∧ n)−0.9 ) 0.61±0.02 0.65±0.03 0.76±0.12 0.31±0.11 0.85±0.04 0.39±0.04 0.93±0.06 0.51±0.11 0.60±0.18 0.40±0.08 0.45±0.06 0.41±0.08 0.35±0.07 237.7
KL-div (CV) 0.50±0.04 0.55±0.13 0.67±0.05 0.44±0.19 0.67±0.08 0.32±0.02 0.83±0.07 0.66±0.15 0.23±0.04 0.41±0.08 0.46±0.08 0.43±0.16 0.33±0.08 5.6
KLR (λ = (m ∧ n)−0.9 ) 0.82±0.02 1.34±0.07 0.95±0.12 0.39±0.09 0.76±0.05 0.53±0.04 1.28±0.08 0.57±0.13 0.22±0.02 0.37±0.06 0.50±0.08 0.31±0.10 0.25±0.06
226.4
KLR (CV) 0.98±0.03 1.31±0.10 1.66±0.43 0.61±0.26 2.91±0.56 0.74±0.11 1.74±0.35 0.79±0.32 0.15±0.03 0.47±0.13 0.53±0.13 0.40±0.23 0.28±0.09
49.9
RKDE (CV) 31.7±8.78 36.4±10.3 — 2.41±2.21 21.7±6.93 26.1±9.02 37.8±16.4 5.20±1.31 0.16±0.05 10.1±2.31 11.2±1.66 3.85±1.36 11.6±3.43
Table 6: Average NMSEs of each estimator for training data with η = 0.1. The lowest NMSEs are in bold, and those with the second lowest are also in bold if they are not significantly different from the lowest NMSEs under the Wilcoxon paired rank-sum test with the significance level 5%. “—” in RKDE denotes the fact that numerically 0/0 occurred frequently and thus valid results were not obtained.
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation 24
Bayes err. data set (%) (η = 0.01) ringnorm 2.0 twonorm 3.1 3.1 image thyroid 4.7 9.9 splice waveform 10.2 banana 11.1 16.3 heart titanic 22.7 23.5 diabetes german 23.9 25.0 breast-cancer flare-solar 32.6 relative computational cost
KuLSIF (LOOCV) 0.44±0.01 0.48±0.04 1.44±0.31 0.64±0.15 0.98±0.05 0.49±0.05 0.78±0.04 0.77±0.16 0.12±0.04 0.50±0.11 0.62±0.14 0.35±0.16 0.34±0.13 39.4
KuLSIF (λ = (m ∧ n)−0.9 ) 0.49±0.01 0.38±0.01 1.26±0.13 0.64±0.15 1.10±0.04 0.47±0.04 1.00±0.04 0.79±0.17 0.19±0.02 0.49±0.11 0.62±0.14 0.37±0.13 0.34±0.06 1.0
5.2
KL-div (λ = (m ∧ n)−0.9 ) 0.72±0.02 0.71±0.05 1.14±0.16 0.61±0.14 1.36±0.06 0.59±0.05 1.59±0.09 0.69±0.15 0.70±0.31 0.50±0.10 0.60±0.13 0.40±0.12 0.37±0.07 254.3
KL-div (CV) 0.62±0.03 0.66±0.09 1.03±0.11 0.60±0.15 1.21±0.07 0.50±0.04 1.50±0.08 0.78±0.20 0.25±0.07 0.52±0.12 0.61±0.13 0.42±0.18 0.37±0.07 5.7
KLR (λ = (m ∧ n)−0.9 ) 1.10±0.03 2.11±0.10 1.24±0.12 0.68±0.14 0.75±0.03 0.66±0.04 1.61±0.08 0.76±0.16 0.23±0.02 0.48±0.09 0.68±0.16 0.32±0.17 0.27±0.06
230.9
KLR (CV) 2.28±0.17 3.72±0.73 2.04±0.66 0.67±0.20 5.70±1.94 1.11±0.10 4.30±0.77 0.87±0.38 0.14±0.03 0.54±0.19 0.63±0.18 0.40±0.23 0.30±0.09
50.4
RKDE (CV) 21.9±15.4 43.9±10.7 — 2.66±2.23 27.1±5.62 38.8±8.54 44.3±9.82 5.10±1.48 0.12±0.04 9.91±2.67 11.1±1.68 3.62±1.34 12.1±3.02
Table 7: Average NMSEs of each estimator for training data with η = 0.01. The lowest NMSEs are in bold, and those with the second lowest are also in bold if they are not significantly different from the lowest NMSEs under the Wilcoxon paired rank-sum test with the significance level 5%. “—” in RKDE denotes the fact that numerically 0/0 occurred frequently and thus valid results were not obtained.
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation 25
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
FUB
FUB
IFBSU
,V-4*' ,V-4*' -00$7
,-EJW ,-EJW $7
,-3 ,-3 $7
,V-4*' ,V-4*' -00$7
,-EJW ,-EJW $7
,-3 ,-3 $7
,V-4*' ,V-4*' -00$7
,-EJW ,-EJW $7
,-3 ,-3 $7
FUB
FUB
FUB
FUB
/.4&
,V-4*' ,V-4*' -00$7
,-EJW ,-EJW $7
,-3 ,-3 $7
FUB
FUB
,V-4*' ,V-4*' -00$7
,-EJW ,-EJW $7
,-3 ,-3 $7
GMBSFTPMBS
CSFBTUDBODFS
,V-4*' ,V-4*' -00$7
,-EJW ,-EJW $7
,-3 ,-3 $7
HFSNBO
/.4&
/.4&
FUB
EJBCFUJT
,V-4*' ,V-4*' -00$7
,-EJW ,-EJW $7
,-3 ,-3 $7
UJUBOJD
/.4&
FUB
/.4&
/.4&
CBOBOB
,V-4*' ,V-4*' -00$7
,-EJW ,-EJW $7
,-3 ,-3 $7
/.4&
FUB
XBWFGPSN
,V-4*' ,V-4*' -00$7
,-EJW ,-EJW $7
,-3 ,-3 $7
/.4&
TQMJDF
/.4&
FUB
/.4&
,V-4*' ,V-4*' -00$7
,-EJW ,-EJW $7
,-3 ,-3 $7
UIZSPJE
,V-4*' ,V-4*' -00$7
,-EJW ,-EJW $7
,-3 ,-3 $7
JNBHF
/.4&
/.4&
,V-4*' ,V-4*' -00$7
,-EJW ,-EJW $7
,-3 ,-3 $7
UXPOPSN
,V-4*' ,V-4*' -00$7
,-EJW ,-EJW $7
,-3 ,-3 $7
/.4&
SJOHOPSN
26
Figure 4: The NMSEs are depicted as functions of η.
FUB
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
27
obtained by (4) will be a non-negative function without additional constraints. Thus, the estimator w b (and its cut-off version w b+ ) will be asymptotically the same as the one obtained by imposing the nonnegative constraint on (4). For the small sample size, however, the estimator w b can take negative values, and the cut-off estimator w b+ may have statistical bias. Thus, we need careful treatment to obtain a good estimator in practice. In nonparametric density estimation, it was shown that nonnegative estimators cannot be unbiased (Rosenblatt, 1956). We conjecture that a similar result also holds in the inference of density ratios, which needs to be investigated in our future work.
Acknowledgment The authors are grateful to the anonymous reviewers for their helpful comments. The work of T. Kanamori was partially supported by Grant-in-Aid for Young Scientists (20700251). T. Suzuki was supported in part by Global COE Program “The research and training center for new development in mathematics”, MEXT, Japan. M. Sugiyama was supported by SCAT, AOARD, and the JST PRESTO program.
A
Proof of Theorem 1
Proof. Applying the representer theorem (Kimeldorf & Wahba, 1971), we see that an optimal solution of (4) has the form of w=
n ∑
αj k(·, Xj ) +
j=1
m ∑
βℓ k(·, Yℓ ).
(21)
ℓ=1
Let K11 , K12 , K21 , and K22 be the sub-matrices of the Gram matrix: (K11 )ii′ = k(Xi , Xi′ ),
(K12 )ij = k(Xi , Yj ),
⊤ K21 = K12 ,
(K22 )jj ′ = k(Yj , Yj ′ ),
where i, i′ = 1, . . . , n, j, j ′ = 1, . . . , m. Then, the extremal condition of (4) with respect to parameters α = (α1 , . . . , αn )⊤ and β = (β1 , . . . , βn )⊤ is given as 1 K11 (K11 α + K12 β) − n 1 K21 (K11 α + K12 β) − n
1 K12 1m + λK11 α + λK12 β = 0, m 1 K22 1m + λK22 β + λK21 α = 0. m
and
An easy computation shows that the above extremal condition is satisfied at the parameter 1 α which is defined as the solution of the linear equation (5) and β = mλ (1, . . . , 1)⊤ .
B
Leave-One-Out Cross-Validation of KuLSIF
The procedure to compute the leave-one-out cross-validation score of KuLSIF is presented (ℓ) (ℓ) (ℓ)⊤ here. Let K11 ∈ ℜ(n−1)×(n−1) and K12 = K21 ∈ ℜ(n−1)×(m−1) be the Gram matrices of
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
28
samples except xℓ and yℓ , respectively. According to Theorem 1, the estimated parameters α e(ℓ) and βe(ℓ) of ∑ ∑ w b(ℓ) (z) = αi k(z, Xi ) + βj k(z, Yj ) i̸=ℓ
j̸=ℓ
is equal to α e(ℓ) = −
1 (ℓ) (ℓ) (K11 + (n − 1)λIn−1 )−1 K12 1m−1 , (m − 1)λ
βe(ℓ) =
1 1m−1 , (m − 1)λ
where In−1 denotes the (n − 1) by (n − 1) identity matrix. Hence, the parameter α e(ℓ) is the solution of the following convex quadratic problem, min α
1 1 ⊤ (ℓ) (ℓ) α (K11 + (n − 1)λIn−1 )α + 1⊤ m−1 K21 α, 2 (m − 1)λ
α ∈ ℜn−1 .
The same solution can be obtained by solving min α
1 ⊤ 1 α (K11 + (n − 1)λIn )α + (1m − em,ℓ )⊤ K21 α, 2 (m − 1)λ s. t. α ∈ ℜn , αℓ = 0,
(22)
where em,ℓ ∈ ℜm is the standard unit vector with only the ℓ-th component being 1. The optimal solution of (22) denoted by α(ℓ) is equal to ( ) 1 (ℓ) −1 α = (K11 + (n − 1)λIn ) − K12 (1m − em,ℓ ) − cℓ en,ℓ , (m − 1)λ (ℓ)
where cℓ is determined so that αℓ = 0. The estimator α e(ℓ) ∈ ℜn−1 is equal to the (n − 1)-dimensional vector consisting of α(ℓ) except the ℓ-th component, i.e., α e(ℓ) = (ℓ) (ℓ) (ℓ) (ℓ) ⊤ (α1 , . . . , αℓ−1 , αℓ+1 , . . . , αn ) . Let βb(ℓ) be βb(ℓ) =
1 (1m − em,ℓ ), (m − 1)λ
then we have w b(ℓ) (z) =
n ∑
(ℓ)
αi k(z, Xi ) +
i=1
m ∑
(ℓ) βbj k(z, Yj ).
j=1
We consider an analytic expression of the leave-one-out score. Let the matrices A and B be the parameters of the leave-one-out estimator, A = (α(1) , . . . , α(n∧m) ) ∈ ℜn×(n∧m) ,
B = (β (1) , . . . , β (n∧m) ) ∈ ℜm×(n∧m) ,
the matrix G ∈ ℜn×n be G = (K11 + (n − 1)λIn )−1 , and E ∈ ℜm×(n∧m) be the matrix defined as { 1 i ̸= j, Eij = 0 i = j.
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
Let S ∈ ℜn×(n∧m) be S=− and T ∈ ℜn×(n∧m) be
29
1 K12 E, (m − 1)λ
(GS)ii Tij = Gii 0
i = j, i ̸= j.
Then, we obtain A = G(S − T ),
B=
1 E. (m − 1)λ
Let KX ∈ ℜ(n∧m)×(n+m) be the sub-matrix of (K11 K12 ) formed by the first n ∧ m rows and all columns. Similarly, let KY ∈ ℜ(n∧m)×(n+m) be the sub-matrix of (K21 K22 ) formed by the first n ∧ m rows and all columns. Let the product U ∗ U ′ be the element-wise multiplication of matrices U and U ′ of the same size, i.e., the (i, j) element is given by Uij Uij′ . Then, we have w bX = (w b(1) (X1 ), . . . , w b(n∧m) (Xn∧m ))⊤ = (KX ∗ (A⊤ B ⊤ ))1n+m , w bY = (w b(1) (Y1 ), . . . , w b(n∧m) (Yn∧m ))⊤ = (KY ∗ (A⊤ B ⊤ ))1n+m , (1)
(n∧m)
w bX+ = (w b+ (X1 ), . . . , w b+ (1)
(n∧m)
b+ w bY + = (w b+ (Y1 ), . . . , w
(Xn∧m ))⊤ = max{w bX , 0},
(Yn∧m ))⊤ = max{w bY , 0},
where the max operation for a vector is applied in the element-wise manner. As a result, LOOCV (7) is equal to { } 1 ⊤ 1 ⊤ w b w bX+ − 1n∧m w bY + . LOOCV = n ∧ m 2 X+
C
Proof of Theorem 2
We summarize some notations to be used in the proof. Given a probability distribution ∫ P and a random variable h(X), we denote the expectation of h(X) under P by hdP . Given samples by Pn . The ∫ X1 , . . . , Xn from P , the empirical distribution is 1denoted ∑n expectation hdP∫n denotes the empirical means of h(X), ∫ ∑ that is, n i=1 h(Xi ). We also use the notation h d(P − Pn ) to represent hdP − n1 ni=1 h(Xi ). Let H be the RKHS endowed with the kernel k. The norm and inner product on H are denoted by ∥ · ∥H and ⟨·, ·⟩H , respectively. Let ∥ · ∥∞ be the infinity norm, and for distribution function P , define the L2 norm by (∫ )1/2 2 , ∥g∥P = |g| dP and let L2 (P ) be the metric space defined by this distance.
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
30
Since supx∈Z k(x, x) is assumed to be bounded above, without loss of generality we assume supx∈Z k(x, x) ≤ 1. The constant factor of the kernel function does not affect the following proof. We now define the bracketing entropy of the set of functions. For any fixed δ > 0, a covering for function class F using the metric L2 (P ) is a collection of functions which allows F to be covered using L2 (P ) balls of radius δ centered at these functions. Let NB (δ, F, P ) be the smallest number of N for which there exist pairs of functions {(gjL , gjU ) ∈ L2 (P ) × L2 (P ) | j = 1, . . . , N } such that ∥gjL − gjU ∥P ≤ δ, and such that for each f ∈ F, there exists j satisfying gjL ≤ f ≤ gjU . Then, HB (δ, F, P ) = log NB (δ, F, P ) is called the bracketing entropy of F (van de Geer, 2000, Definition 2.2). For w ∈ H, we have ∥w∥P ≤ ∥w∥∞ ≤ ∥w∥H , because for any x ∈ Z, the inequalities √ |w(x)| = |⟨w, k(·, x)⟩H | ≤ ∥w∥H sup k(x, x) ≤ ∥w∥H x
hold. Let G = {v 2 | v ∈ H} and we define a measure of complexity J : G → ℜ by J(g) = inf { ∥v∥2H | g = v 2 , v ∈ H }. Let HM and G M be HM = {v ∈ H | ∥v∥H < M }, G M = {v 2 | v ∈ H√M } = {g ∈ G | J(g) < M }.
(23)
It is straightforward to verify the second equality of (23). The following proposition is crucial to prove the convergence property of KuLSIF. Proposition 1 (Lemma 5.14 in van de Geer (2000)). Let F ⊂ L2 (P ) be a function class, and the map I(f ) be a measure of complexity of f ∈ F, where I is a non-negative functional on F and I(f0 ) < ∞ for a fixed f0 ∈ F . We now define FM = {f ∈ F | I(f ) < M } satisfying F = ∪M ≥1 FM . Suppose that there exist c0 > 0 and 0 < γ < 2 such that sup ∥f − f0 ∥P ≤ c0 M,
f ∈FM
sup
f ∈FM ∥f −f0 ∥P ≤δ
∥f − f0 ∥∞ ≤ c0 M,
and that HB (δ, FM , P ) = O (M/δ)γ . Then, we have ∫ (f − f0 )d(P − Pn ) sup = Op (1), D(f ) f ∈F
(n → ∞),
where D(f ) is defined by 1−γ/2
D(f ) = and a ∨ b denotes max{a, b}.
∥f − f0 ∥P √
n
I(f )γ/2
∨
for all δ > 0,
I(f ) n2/(2+γ)
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
31
In van de Geer (2000), the probabilistic order is evaluated for each case of ∥f − f0 ∥ ≤ n−1/(2+γ) I(f ) and ∥f − f0 ∥ > n−1/(2+γ) I(f ), respectively. When the supremum is taken over {f ∈ F | ∥f − f0 ∥ ≤ n−1/(2+γ) I(f )}, D(f ) is equal to I(f )/n2/(2+γ) , and the probabilistic order above is obtained from the first formula of Lemma 5.14 (van de Geer, 2000). In the same way, we obtain the probabilistic order for ∥f − f0 ∥ > n−1/(2+γ) I(f ). The sum of the probabilistic upper bounds for these two cases provides the result in the above proposition. We use Proposition 1 to derive an upper bound of ∫ ∫ (w b − w0 )d(Q − Qm ), and (w b2 − w02 )d(P − Pn ). Lemma 1. The bracketing entropy of GM is bounded above by ( )γ M HB (δ, G M , P ) = O . δ L U Proof. Let v1L , v1U , v2L , v2U , . . . , vN , vN ∈ L2 (P ) be coverings of H√M in the sense of bracketL U ing, such that ∥vi −vi ∥P ≤ δ holds for i = 1, . . . , N . Then, for any v ∈ H√M there exists i √ L(U ) such that viL ≤ v ≤ viU holds. We can choose these functions such that ∥vi ∥∞ ≤ M √ is satisfied for all i = 1, . . . , N , since for any v ∈ H√M , the inequality ∥v∥∞ ≤ ∥v∥H < M √ √ L(U ) L(U ) holds. For example, replace vi with min{ M , max{− M , vi }} ∈ L2 (P ). Let v¯iL and v¯iU be L 2 L (vi (x)) , vi (x) ≥ 0, v¯iL (x) = (viU (x))2 , viU (x) ≤ 0, 0, viL (x) < 0 < viU (x),
v¯iU = max{(viL )2 , (viU )2 }, for i = 1, . . . , N . Then, v¯iL ≤ v¯iU holds. Moreover, for any v ∈ H√M satisfying viL ≤ v ≤ viU , we have v¯iL ≤ v 2 ≤ v¯iU . By definition, we also have 0 ≤ v¯iU (x) − v¯iL (x) ≤ max{|viU (x)2 − viL (x)2 |, |viU (x) − viL (x)|2 } √ ≤ (|viU (x)| + |viL (x)|) · |viU (x) − viL (x)| ≤ 2 M |viU (x) − viL (x)|, √ and thus, ∥¯ viU − v¯iL ∥P ≤ 2 M ∥viU − viL ∥P holds. Due to (8), we obtain ( √ )γ √ M HB (2 M δ, G M , P ) ≤ HB (δ, H√M , P ) = O . δ Hence, HB (δ, G M , P ) = O (M/δ)γ holds.
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
32
Lemma 2. Assume the condition of Theorem 2. Then, for the KuLSIF estimator w, b we have ) ( ∫ 1−γ/2 γ/2 ∥ w∥ b ∥w − w∥ b ∥ w∥ b H 0 H (w √P ∨ 2/(2+γ) , b − w0 )d(Q − Qm ) = Op m m ( ) ∫ 1−γ/2 1+γ/2 2 ∥ w b − w ∥ (1 + ∥ w∥ b ) ∥ w∥ b 0 P H H (w √ b2 − w02 )d(P − Pn ) = Op . ∨ 2/(2+γ) n n Proof. There exists c0 > 0 such that sup ∥w − w0 ∥P ≤ c0 M, w∈HM
sup w∈HM ∥w−w0 ∥P ≤δ
sup ∥g − w02 ∥P ≤ c0 M,
g∈GM
sup g∈GM ∥g−w02 ∥P ≤δ
∥w − w0 ∥∞ ≤ c0 M,
∥g − w02 ∥∞ ≤ c0 M.
(24) (25)
The inequalities in (25) are derived as follows. For g ∈ GM , there exists v ∈ H such that v 2 = g and ∥v∥2H < M , and then, we have ∥g − w02 ∥P ≤ ∥g − w02 ∥∞ ≤ ∥v∥2∞ + ∥w0 ∥2∞ ≤ ∥v∥2H + ∥w0 ∥2∞ ≤ M + ∥w0 ∥2∞ ≤ c0 M, (M ≥ 1). In the same way, (24) also holds. Set F be H and I(w) = ∥w∥H in Proposition 1. Taking (24) into account, we have ∫ (w0 − w)d(Q − Qm ) sup = Op (1), D(w) w∈H where D(w) is defined as 1−γ/2
D(w) =
γ/2
∥w∥H ∥w0 − w∥P ∥w∥H √ ∨ 2/(2+γ) . m m
In the same way, by setting F be G and I(g) = J(g) in Proposition 1, we have ∫ (w2 − w02 )d(P − Pn ) sup = Op (1), E(w) w∈H where E(w) is defined as 1−γ/2
E(w) =
J(w2 )γ/2 ∥w2 − w02 ∥P J(w2 ) √ ∨ 2/(2+γ) . n n
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
33
Note that ∥w2 − w02 ∥P ≤ (∥w0 ∥∞ + ∥w∥H )∥w − w0 ∥P = O((1 + ∥w∥H )∥w − w0 ∥P ) and J(w2 ) ≤ ∥w∥2H . Then, we obtain 1−γ/2
E(w) ≤
∥w − w0 ∥P
(1 + ∥w∥H )1+γ/2 ∥w∥2H √ ∨ 2/(2+γ) . n n
Now we show the proof of Theorem 2. Proof. The estimator w b satisfies the inequality ∫ ∫ ∫ ∫ λ 1 λ 1 2 2 2 w b dPn − wdQ b m + ∥w∥ b H ≤ w0 dPn − w0 dQm + ∥w0 ∥2H . 2 2 2 2 Then, we have ∫ 1 (w b2 − w02 )dP (w0 − w)dQ b + 2 ∫ ∫ 1 ≤ (w0 − w)dQ b + (w b2 − w02 )dP 2 ∫ ∫ λ 1 λ (w02 − w b2 )dPn + ∥w0 ∥2H − ∥w∥ + (w b − w0 )dQm + b 2H . 2 2 2
1 ∥w b − w0 ∥2P = 2
∫
As a result, we have 1 λ ∥w b − w0 ∥2P + ∥w∥ b 2H 2 2 ∫ ∫ 1 λ 2 2 ≤ (w b − w0 )d(Q − Qm ) + (w b − w0 )d(P − Pn ) + ∥w0 ∥2H 2 2 ( ) 1−γ/2 λ ∥w0 − w∥ b P (1 + ∥w∥ b H )1+γ/2 (1 + ∥w∥ b H )2 2 √ ≤ ∥w0 ∥H + Op ∨ , 2 (n ∧ m)2/(2+γ) n∧m where Lemma 2 is used. We need to study three possibilities: λ 1 ∥w0 − w∥ b 2P + ∥w∥ b 2H ≤ Op (λ), 2 2 ( ) 1−γ/2 1+γ/2 λ ∥w0 − w∥ b P (1 + ∥w∥ b H) 1 √ ∥w0 − w∥ b 2P + ∥w∥ b 2H ≤ Op , 2 2 n∧m ( ) λ (1 + ∥w∥ b H )2 1 2 2 ∥w0 − w∥ b P + ∥w∥ b H ≤ Op . 2 2 (n ∧ m)2/(2+γ) One of the above inequalities should be satisfied. We study each inequality below.
(26) (27) (28)
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
34
Case (26): we have 1 ∥w0 − w∥ b 2P ≤ Op (λ), 2
λ ∥w∥ b 2H ≤ Op (λ), 2
and hence the inequalities ∥w0 − w∥ b P ≤ Op (λ1/2 ) and ∥w∥ b H ≤ Op (1) hold. Case (27): we have ( ) 1−γ/2 1+γ/2 ∥w − w∥ b (1 + ∥ w∥ b ) 0 H P ∥w0 − w∥ b 2P ≤ Op , (n ∧ m)1/2 ( ) 1−γ/2 1+γ/2 ∥w − w∥ b (1 + ∥ w∥ b ) 0 H P λ∥w∥ b 2H ≤ Op . (n ∧ m)1/2 The first inequality provides
(
∥w0 − w∥ b P ≤ Op
1 + ∥w∥ b H (n ∧ m)1/(2+γ)
) .
Thus, the second inequality leads to ) ( 1−γ/2 1+γ/2 ∥w − w∥ b (1 + ∥ w∥ b ) 0 H P λ∥w∥ b 2H ≤ Op (n ∧ m)1/2 (( ) )1−γ/2 1 + ∥w∥ b H (1 + ∥w∥ b H )1+γ/2 ≤ Op (n ∧ m)1/(2+γ) (n ∧ m)1/2 ) ( (1 + ∥w∥ b H )2 = Op . (n ∧ m)2/(2+γ) Hence, we have
( ∥w∥ b H ≤ Op
1 1/2 λ (n ∧ m)1/(2+γ)
Then, we obtain
( ∥w0 − w∥ b P ≤ Op
Case (28): we have ∥w0 −
w∥ b 2P
≤ Op
(
1 (n ∧ m)1/(2+γ)
(1 + ∥w∥ b H )2 (n ∧ m)2/(2+γ)
) = op (1).
) ≤ Op (λ1/2 ). (
) ,
λ∥w∥ b 2H
≤ Op
(1 + ∥w∥ b H )2 (n ∧ m)2/(2+γ)
Then, as shown in the case (27), we have ∥w∥ b H = op (1). Hence, we obtain ( ) 1 ∥w0 − w∥ b P ≤ Op ≤ Op (λ1/2 ). (n ∧ m)1/(2+γ)
) .
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
D
35
Proof of (13)
Theorem 3. Let H be the reproducing kernel Hilbert space endowed with the kernel function k on Z × Z, and suppose supx∈Z k(x, x) < ∞. Then, for w, v ∈ H, the equality ∫ ∫ wvdP − vdQ = ⟨Φ(w), v⟩H holds, where Φ(w) is defined by (11). Proof.√For all w ∈ H, supx∈Z |w(x)| is bounded. This is because |w(x)|∫ = |⟨w, k(·, x)⟩H | ≤ ∫ ∥w∥H k(x, x) < ∞. For a fixed w ∈ H, the function wvdP − vdQ is linear and bounded as the function of v ∈ H. Indeed, the linearity is clear, and the boundedness is shown by ∫ ∫ ∫ ∫ wvdP − vdQ ≤ |v(x)||w(x)|P (dx) + |v(y)|Q(dy) ∫ ∫ = |⟨v, k(·, x)⟩H ||w(x)|P (dx) + |⟨v, k(·, y)⟩H |Q(dy) ) (∫ √ |w(x)|P (dx) + 1 ∥v∥H . ≤ sup k(x, x) x∈Z
Since |w(x)| is bounded, the integral above is finite. Then, by the Riesz representation theorem (Reed & Simon, 1972, Theorem II.4), there exists Ψ : H → H such that ∫ ∫ wvdP − vdQ = ⟨Ψ(w), v⟩H holds for all w, v ∈ H. For v = k(·, x0 ) ∈ H, we have ∫ ∫ (Ψ(w))(x0 ) = k(x0 , x)w(x)P (dx) − k(x0 , y)Q(dy), where we used the symmetry of the kernel function. We see that the function Ψ is the same as the function Φ defined by (11).
References Ali, S. M., & Silvey, S. D. (1966). A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society, Series B, 28, 131– 142. Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404. Bartlett, P. L., Jordan, M. I., & McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101, 138–156.
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
36
Bartlett, P. L., & Tewari, A. (2007). Sparseness vs estimating conditional probabilities: Some asymptotic results. Journal of Machine Learning Research, 8, 775–790. Bickel, S., Br¨ uckner, M., & Scheffer, T. (2007). Discriminative learning for differing training and test distributions. Proceedings of the 24th International Conference on Machine Learning (pp. 81–88). Bickel, S., Br¨ uckner, M., & Scheffer, T. (2009). Discriminative learning under covariate shift. Journal of Machine Learning Research, 10, 2137–2155. Csisz´ar, I. (1967). Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica, 2, 229–318. Cucker, F., & Smale, S. (2002). On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39, 1–49. Golub, G. H., & Loan, C. F. V. (1996). Matrix computations. Baltimore, MD: Johns Hopkins University Press. Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch¨olkopf, B., & Smola, A. J. (2006). A kernel method for the two-sample-problem. Advances in Neural Information Processing Systems 19 (pp. 513–520). Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., & Sch¨olkopf, B. (2009). Covariate shift by kernel mean matching. In J. Qui˜ nonero-Candela, M. Sugiyama, A. Schwaighofer and N. Lawrence (Eds.), Dataset shift in machine learning, chapter 8, 131–160. Cambridge, MA: MIT Press. H¨ardle, W., M¨ uller, M., Sperlich, S., & Werwatz, A. (2004). Nonparametric and semiparametric models. Springer Series in Statistics. Berlin: Springer. Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., & Kanamori, T. (2008). Inlier-based outlier detection via direct density ratio estimation. Proceedings of IEEE International Conference on Data Mining (ICDM2008) (pp. 223–232). Pisa, Italy. Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., & Kanamori, T. (2011). Statistical outlier detection using direct density ratio estimation. Knowledge and Information Systems, 26, 309–336. Huang, J., Smola, A., Gretton, A., Borgwardt, K. M., & Sch¨olkopf, B. (2007). Correcting sample selection bias by unlabeled data. Advances in Neural Information Processing Systems 19 (pp. 601–608). Cambridge, MA: MIT Press. Kanamori, T., Hido, S., & Sugiyama, M. (2009). A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10, 1391–1445.
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
37
Kanamori, T., Suzuki, T., & Sugiyama, M. (2010). Theoretical analysis of density ratio estimation. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E93-A, 787–798. Kanamori, T., Suzuki, T., & Sugiyama, M. (2011). Kernel-based density ratio estimation: Part II, condition number analysis. Machine Learning. submitted. Kawahara, Y., & Sugiyama, M. (2011). Sequential change-point detection based on direct density-ratio estimation. Statistical Analysis and Data Mining. to appear. Keerthi, S. S., Duan, K., Shevade, S. K., & Poo, A. N. (2005). A fast dual algorithm for kernel logistic regression. Machine Learning, 61, 151–165. Kimeldorf, G. S., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33, 82–95. Luenberger, D., & Ye, Y. (2008). Linear and nonlinear programming. Springer. Nguyen, X., Wainwright, M. J., & Jordan, M. I. (2010). Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56, 5847–5861. Park, C. (2009). Convergence rates of generalization errors for margin-based classification. Journal of Statistical Planning and Inference, 139, 2543–2551. Platt, J. C. (2000). Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. Advances in Large Margin Classifiers, 61–74. Qin, J. (1998). Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85, 619–639. Qui˜ nonero-Candela, J., Sugiyama, M., Schwaighofer, A., & Lawrence, N. (Eds.). (2009). Dataset shift in machine learning. Cambridge, MA: MIT Press. R Development Core Team (2009). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. R¨atsch, G., Onoda, T., & M¨ uller, K.-R. (2001). Soft margins for adaboost. Machine Learning, 42, 287–320. Reed, M., & Simon, B. (1972). Functional analysis. New York: Academic Press. Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27, 832–837. R¨ uping, S. (2003). myklr - kernel logistic regression. University of Dortmund, Department of Computer Science.
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
38
Sch¨olkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge, MA: MIT Press. Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90, 227–244. Smola, A., Song, L., & Teo, C. H. (2009). Relative novelty detection. Twelfth International Conference on Artificial Intelligence and Statistics (pp. 536–543). Steinwart, I. (2001). On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2, 67–93. Steinwart, I. (2005). Consistency of support vector machines and other regularized kernel classifiers. IEEE Transactions on Information Theory, 51, 128–142. Sugiyama, M. (2010). Superfast-trainable multi-class probabilistic classifier by leastsquares posterior fitting. IEICE Transactions on Information and Systems, E93-D, 2690–2701. Sugiyama, M., Kanamori, T., Suzuki, T., Hido, S., Sese, J., Takeuchi, I., & Wang, L. (2009). A density-ratio framework for statistical data processing. IPSJ Transactions on Computer Vision and Applications, 1, 183–208. Sugiyama, M., & Kawanabe, M. (2011). Machine learning in non-stationary environments: Introduction to covariate shift adaptation. Cambridge, MA, USA: MIT Press. to appear. Sugiyama, M., Krauledat, M., & M¨ uller, K.-R. (2007). Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8, 985– 1005. Sugiyama, M., & M¨ uller, K.-R. (2005). Input-dependent estimation of generalization error under covariate shift. Statistics & Decisions, 23, 249–279. Sugiyama, M., Nakajima, S., Kashima, H., von B¨ unau, P., & Kawanabe, M. (2008a). Direct importance estimation with model selection and its application to covariate shift adaptation. Advances in Neural Information Processing Systems 20 (pp. 1433–1440). Cambridge, MA: MIT Press. Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Density ratio estimation in machine learning. Cambridge, UK: Cambridge University Press. to appear. Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von B¨ unau, P., & Kawanabe, M. (2008b). Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60, 699–746. Sugiyama, M., Takeuchi, I., Suzuki, T., Kanamori, T., Hachiya, H., & Okanohara, D. (2010). Least-squares conditional density estimation. IEICE Transactions on Information and Systems, E93-D, 583–594.
Statistical Analysis of Kernel-Based Least-Squares Density-Ratio Estimation
39
Suzuki, T., Sugiyama, M., Sese, J., & Kanamori, T. (2008). Approximating mutual information by maximum likelihood density ratio estimation. JMLR Workshop and Conference Proceedings (pp. 5–20). Suzuki, T., Sugiyama, M., & Tanaka, T. (2009). Mutual information approximation via maximum likelihood estimation of density ratio. Proceedings of 2009 IEEE International Symposium on Information Theory (ISIT2009) (pp. 463–467). Seoul, Korea. Tsuboi, Y., Kashima, H., Hido, S., Bickel, S., & Sugiyama, M. (2008). Direct density ratio estimation for large-scale covariate shift adaptation. SDM (pp. 443–454). van de Geer, S. (2000). Empirical processes in M-estimation. Cambridge University Press. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Wahba, G., Gu, C., & Y. (1993). Soft classification, a.k.a. risk estimation, via penalized log likelihood and smoothing spline analysis of variance. The Mathematics of Generalization. Addison-Wesley. Yamada, M., Suzuki, T., Kanamori, T., Hachiya, H., & Sugiyama, M. (2011). Relative density-ratio estimation for robust distribution comparison. Advances in Neural Information Processing Systems 24. to appear. Zadrozny, B. (2004). Learning and evaluating classifiers under sample selection bias. Proceedings of the Twenty-First International Conference on Machine Learning. New York, NY: ACM Press. Zeidler, E. (1986). Nonlinear functional analysis and its applications, I: Fixed-point theorems. Springer-Verlag. Zhu, J., & Hastie, T. (2001). Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics (pp. 1081–1088). MIT Press.