Information-theoretic lower bounds on learning the structure of ...

Report 6 Downloads 57 Views
arXiv:1601.07460v1 [cs.LG] 27 Jan 2016

Information-theoretic lower bounds on learning the structure of Bayesian networks Asish Ghoshal and Jean Honorio Department of Computer Science Purdue University West Lafayette, IN - 47906 Email: {aghoshal, jhonorio}@purdue.edu

In this paper, we study the information theoretic limits of learning the structure of Bayesian networks from data. We show that for Bayesian networks on continuous as well as discrete random variables, there exists a parameterization of the Bayesian network such that, the minimum number of samples required to learn the “true” Bayesian network grows as O (m), where m is the number of variables in the network. Further, for sparse Bayesian networks, where the number of parents of any variable in the network is restricted to be at most l for l ≪ m, the minimum number of samples required grows as O (l log m). We discuss conditions under which these limits are achieved. For Bayesian networks over continuous variables, we obtain results for Gaussian regression and Gumbel Bayesian networks. While for the discrete variables, we obtain results for Noisy-OR, Conditional Probability Table (CPT) based Bayesian networks and Logistic regression networks. Finally, as a byproduct, we also obtain lower bounds on the sample complexity of feature selection in logistic regression and show that the bounds are sharp.

priori and must be inferred from data. Structure learning for Bayesian networks entails learning the parent set for each node while constraining the resulting directed graph to be acyclic. A plethora of algorithms have been developed over the years for learning the structure of a Bayesian network from a given sample of data. For more information about various algorithms see, e.g., [9] and [14]. In this paper we obtain lower bounds on the minimum number of samples required for recovering the true structure of the Bayesian network from an observed sample of data, irrespective of the particular method used for learning the structure. Our main contributions are as follows: • We show that, in general, the sample complexity of learning the structure of Bayesian networks grows as O (m), where m is the number of variables in the networks. We obtain results for various widely used Bayesian networks such as Gaussian and Gumbel Bayesian networks for continuous valued variables and Noisy-OR, Logistic regression and Conditional Probability Table (CPT) based Bayesian networks for discrete valued random variables.

1 Introduction

• For sparse Bayesian network structure recovery, where the number of parents of any node is at most l for l ≪ m, the minimum number of samples required grows logarithmically with the number of variables i.e. O (l log m).

A Bayesian network is a probabilistic graphical model that describes the conditional dependencies between a set of random variables as a directed acyclic graph (DAG). In a Bayesian network the joint distribution of a set of random variables factorizes as a product of conditional distribution of each variable conditioned on its parents. Bayesian networks have found widespread use as a probabilistic modeling and decision making tool in a number scientific disciplines, where they are used to reason about the value or distribution of some variables of interest given observations of other variables. However, in many problems of practical interest, the structure of the network is not known a

• We obtain lower bounds on the sample complexity of feature selection in logistic regression. Our results are complementary to those of [10] and we show that L1 -regularized logistic regression approaches information-theoretic limits for feature selection. The rest of the paper is organized as follows. In section 2 we discuss relevant prior work. Section 3 introduces

1

, where φ(Xi : Xπi ) = p(Xi |Xπi ) denotes the family of distributions specifying the conditional probability densities in the family Fφ,m and Xπ = {Xi |i ∈ π}. The parents Xπi for each variable Xi is chosen such that the resulting graph is directed and acyclic (DAG). Also denote by Fφ,m,l the space of sparse Bayesian networks where the maximum number of parents of any variable is at most l, where l is:

the problem of Bayesian network structure learning. In sections 5 and 6 we present sample complexity results for Bayesian networks on continuous and discrete variables respectively. Finally, we conclude by discussing potential avenues for future work in section 7.

2 Related Work Some of the earliest sample complexity results for learning Bayesian networks were derived in [8] and [7]. In both [8] and [7] the authors provide upper bounds on sample complexity of learning a Bayesian network that is likelihood consistent, i.e. the likelihood of the learned network is ǫ away from the likelihood of the true network in terms of the Kullback-Leibler (KL) divergence measure. In other words, they do not recover the true structure of the network. Moreover, in [8] the authors consider boolean variables and each variable is allowed to have at most k parents. The authors in [1] provide polynomial sample complexity results for learning likelihood consistent factor graphs. Among sample complexity results for learning structure consistent Bayesian networks, where the structure of the learned network is close to the true network, [11] and [3] provide such guarantees for polynomial-time test-based methods, but the results hold only in the infinite-sample limit. The authors in [4] also provide a greedy hill-climbing algorithm for structure learning that is structure consistent in the infinite sample limit. In [15] the authors show structure consistency of a single network and do not provide uniform consistency for all candidate networks, i.e. the bounds relate to the error of learning a specific wrong network having a score greater than the true network. In [2] the authors provide upper bounds on the sample complexity of recovering the structure of sparse Bayesian networks. However, they consider binary valued vari ables only and the sample complexity grows as O m2 . Finally, for completeness we state another sample complexity result for Bayesian networks, in [6], which is not concerned with learning the structure of the network but with learning the parameters of the conditional distribution given a fixed known graph structure. To the best of our knowledge, we are the first to analyze the information theoretic limits of learning structure consistent Bayesian networks.

maxm i=1 (|πi |) ≤ l ≤ m/2 Nature picks a “true” Bayesian network f¯ from the family Fφ,m (or Fφ,m,l ), uniformly at random, and generates i.i.d sample data S = {x(i) }ni=1 , where x(i) ∈ Dm is the i-th sample data and D is some domain on which the random variable is defined. The learner then infers the network fˆ from sample data S. In this paper, we are interested in studying the information theoretic limits of recovering the true network f¯, irrespective of the particular algorithm that is used to learn the network from data. In the following section, we present our main results.

4 Main Result Ir order to derive our main result, we consider restricted ensembles of networks Feφ,m ⊆ Fφ,m and Feφ,m,l ⊆ Fφ,m,l , where the space of possible networks is constrained by the choice of parents for one variable only. In other words, we assume that the space of networks Feφ,m , and similarly Feφ,m,l , is given by networks that factorize as follows:  e Fφ,m = π π ⊆ {3, . . . , m} and p(X1:m ) = φ(X1 : Xπ )φ(X2 : X3:m )

i=3

Let Fφ,m denote the space of Bayesian networks over m variables, X1:m = {X1 , . . . Xm }, whose joint distribution factorizes as follows:  Fφ,m = {πi }m i=1 πi ⊆ {1, . . . , m} and m Y

i=1

 φ(Xi : Xπi )

 φ(Xi ) ,

(2)

where Xπ = {Xi |i ∈ π} and X3:m = {X3 , . . . , Xm }. Figure 1 shows a particular Bayesian network that belongs to Feφ,m . Our main approach for deriving the results in this section is to show that the mutual information I(f¯, S) is only a function of the number of samples and that the number of networks in the restricted ensembles is large, i.e. exponential in the number of variables. Then we use the Fano’s inequality [5] to lower bound the probability of failing to learn the correct network. We present our main result as the following Theorem.

3 Problem Formulation

p(X1:m ) =

m Y

Theorem 1. Assume there exists a true Bayesian network f¯ picked uniformly at random from some family F . Further assume that we are given a data set S of n i.i.d samples for each of the m variables drawn from the Bayesian network f¯. The learner infers the network fˆ from the sample data set S. Let PX1 |Xπ = p(X1 |Xπ ) and PX1 |Xπ′ = p(X1 |Xπ′ ) for π, π ′ ⊆ {3, . . . , m}.

(1)

2

π X3

on log k: X4

Xm

...

log k = log ≥ log



! l  X m−2 i=0

m−2 l

i l

≥ log

  m−2 l

= l log(m − 2) − l log l

Then using the Fano’s inequality we have X1

I(f¯, S) + 1 nc + 1 p(fˆ 6= f¯) ≥ 1 − ≥1− log k log k

X2

Figure 1: A Bayesian network in the family Feφ,m , where the parent set of the node X1 is π = {X3 , X4 }.

By equating p(fˆ 6= f¯) ≥ 1/2 we prove our claim. The above Theorem specifies the condition under which the sample complexity of learning the structure of Bayesian networks scales linearly, and for sparse Bayesian networks scales logarithmically, with the number of variables in the network. We show that the condition holds for many widely used Bayesian networks like Gaussian and Logistic regression networks, noisy-or networks, conditional probability table based networks among others. The main challenge in obtaining sample complexity results for various types of Bayesian networks is to bound the KL divergence of the conditional distributions of X1 , conditioned on any two different choices of the parent set, as given in the Theorem above. Armed with the above result, we next derive sample complexity results for Bayesian networks over continuous and discrete random variables.

i h  ≤c If the condition EX3:m KL PX1 |Xπ k PX1 |Xπ′ holds ∀ π, π ′ and c > 0 then we have the following

(a) F = Feφ,m ⇒ if the number of samples n < (m − 4)/(2c) then learning fails with probability at least 1/2, i.e. p(fˆ 6= f¯) ≥ 1/2. (b) F = Feφ,m,l ⇒ if the number of samples n < (l log(m − 2) − l log l − 2)/(2c) then learning fails with probability at least 1/2, i.e. p(fˆ 6= f¯) ≥ 1/2. Proof. First we use an averaged pairwise KL-based bound from [13] to bound the mutual information I(f¯, S). Let PS|f be the distribution of the sample data under the network f where the parent set of X1 is π. Similarly, PS|f is the distribution of sample data under the network f ′ with the parent set of X1 being π ′ . Let k = |F |. From the independence and identicalness of the samples and using Proposition 2, whose statement and proof we defer to the appendix 1 , we have:  1 X X I(f¯, S) ≤ 2 KL PS|f k PS|f ′ k f ∈F f ′ ∈F  ≤ KL PS|f k PS|f ′  = nKL PX1:m |f k PX1:m |f ′ i h  = nEX3:m KL PX1 |Xπ k PX1 |Xπ′

5 Continuous variables 5.1 Gaussian Bayesian networks For Gaussian Bayesian networks we assume the conditional distributions to be the univariate Gaussian distribution. Xi ∼ N (µi , 1) (∀ 3 ≤ i ≤ m) ¯ π , 1) X1 |Xπ ∼ N (µ1 + X X ¯π = 1 Xi X |π| i∈π where, ∀ 1 ≤ i ≤ m, µi ∈ R and |µi | ≤ 1. We have the following Lemma for the KL divergence of the conditional distribution of X1 . Omitted proofs of Lemmas and Propositions can be found in the appendix.

≤ nc

Next we compute the number of networks in each of the families. Let k = |F | be the number of Bayesian networks. Then we have the following two cases corresponding to the two cases in the statement of the Theorem:

Lemma 1. For any π, π ′ ⊆ {3, . . . , m} we have: i h  EX3:m KL PX1 |Xπ k PX1 |Xπ′ ≤ 3.

(a) If |π| ∈ {0, . . . , m − 2} then k = 2m−2

A key assumption for the bounding the KL divergence term above is that, the mean of the conditional distribution of X1 should be bounded by some constant, regardless of the number of parents of X1 . The following Theorem follows from Lemma 1 and Theorem 1.

(b) If |π| ≤ l, where l is a constant and less than (m − 2)/2, then we have the following lower bound 1 The

full version of the paper with appendix can be found at https://goo.gl/cSaOzJ

3

Theorem 2. Assume there exists a true Bayesian network f¯ ∈ F where, the conditional distribution φ is Gaussian and Xi ∈ R for all 1 ≤ i ≤ m. Further, assume that we are given a data set S of n i.i.d samples of the m random variables drawn from f¯. The learner then infers the Bayesian network fˆ from the sample data set. There exits a specific parameterization of the distribution φ such that

Proposition 1 (Maximum of Gumbel random variables). If for all 1 ≤ i ≤ n, Xi are independent random variables with Xi ∼ Gumbel(µi , 1) and Y = max1≤i≤n (Xi ), then Y ∼ Gumbel(µy , 1), where µy = log

N X

exp(µi ).

i=1

Therefore, from Proposition 1 we have that

(a) F = Feφ,m ⇒ if n ≤ (m−4) , then learning fails with 6 probability at least 1/2, i.e. p(fˆ 6= f¯) ≥ 1/2.

X1 |Xπ ∼ Gumbel(µπ , 1) X exp(µi ). where, µπ = log

log l−2 (b) F = Feφ,m,l ⇒ if n ≤ l log(m−2)−l , then 6 learning fails with probability at least 1/2, i.e. p(fˆ 6= f¯) ≥ 1/2.

i∈π

Further, assume that

A consequence of the above Theorem is that we can characterize the information-theoretic limits of any procedure for solving the problem of variable selection, or support recovery, in linear regression models. The following corollary gives the minimum number of samples needed to correctly recover the predictors in a linear regression model.

∀ 3 ≤ i ≤ m, µi = log(θi /m) ∀ 3 ≤ i ≤ m, 1 ≤ θi ≤ e From the above assumptions we have that 0 ≤ µπ ≤ 1. Lemma 2. For any π, π ′ ⊆ {3, . . . , m} we have: h  i EX3:m KL PX1 |Xπ k PX1 |Xπ′ ≤ e.

Corollary 1. Assume we have the following linear model: y = wT x + ε, where y ∈ R, w, x ∈ Rp and ε ∼ N (0, 1). Let Sw be the support of w, i.e. Sw = {i|1 ≤ i ≤ p and |wi | > 0}. We are given a data set S = {(yi , xi )}ni=1 of observations where xi,j ∼ N (µj , 1). Let kwk ≤ 1 and |µj | ≤ 1. The learner then infers w ˆ from the data set S, with support Swˆ . Then we have

Similar to the Gaussian case, the key to proving the above result is to ensure that the conditional mean of X1 is bounded by some constant. The following Theorem follows from Lemma 2 and Theorem 1. Theorem 3. Assume there exists a true Bayesian network f¯ ∈ F where, the conditional distribution φ is Gumbel and Xi ∈ R for all 1 ≤ i ≤ m. Further, assume that we are given n i.i.d samples of the m random variables drawn from f¯. The learner then infers the Bayesian network fˆ from the sample data set. There exits a specific parameterization of the distribution φ such that

(a) |Sw | ≤ p ⇒ if the number of samples n ≤ (p−2) 6 then support recovery fails with probability at least 1/2, i.e. p(Swˆ 6= Sw ) ≥ 1/2. (b) |Sw | ≤ k ≤ p/2 ⇒ if the number of samples n ≤ k log p−k6 log k−2 then support recovery fails with probability at least 1/2, i.e. p(Swˆ 6= Sw ) ≥ 1/2.

(a) F = Feφ,m ⇒ if n ≤ m−4 2e , then learning fails with probability at least 1/2, i.e. p(fˆ 6= f¯) ≥ 1/2.

Proof. The above corollary follows from Theorem 2, after setting p = m − 2, y = X1 , disconnecting the variable X2 i.e. setting p(X2 |X3:m ) = p(X2 ) and treating the variables X3 , . . . , Xm as covariates i.e. x = {X3 , . . . , Xm }.

log l−2 (b) F = Feφ,m,l ⇒ if n ≤ l log(m−2)−l , then 2e learning fails with probability at least 1/2, i.e. p(fˆ 6= f¯) ≥ 1/2.

The logarithmic sample complexity for sparse support recovery in linear regression models was also proved in [12].

6 Discrete variables

5.2 Gumbel Bayesian networks

6.1 Noisy-OR Bayesian networks

For Gumbel Bayesian networks we have the following setting:

For the noisy-or setting we have binary random variables and the conditional distribution of a node given its parents is given by the noisy-or distribution:

Xi ∼ Gumbel(µi , 1), ∀ 3 ≤ i ≤ m X1 |π = max(Xπ ).

Xi ∼ Bernoulli(θ) (∀ 3 ≤ i ≤ m) ( Q  Xi if X1 = 0 θ i∈π θ  p(X1 |Xπ ) = Q Xi otherwise 1−θ i∈π θ

We have the following result on the distribution of the maximum of Gumbel distributed random variables.

4

. . P , where 0 < θ < 1. Let Yπ = i∈π Xi and Zπ = Yπ +1 then we have

(a) F = Feφ,m ⇒ if n < (m−4) , then learning fails with 4 probability at least 1/2, i.e. p(fˆ 6= f¯) ≥ 1/2.

Yπ ∼ Binomial(m − 2, θ)

log l−2 , then (b) F = Feφ,m,l ⇒ if n ≤ l log(m−2)−l 4 learning fails with probability at least 1/2, i.e. p(fˆ 6= f¯) ≥ 1/2.

p(X1 |π) ∼ Bernoulli(1 − θYπ +1 ) Finally, we assume m ≥ 4 and set θ = 1/(m − 2).

As a consequence of the above Theorem, we can obtain lower bounds on the sample complexity of feature selection in logistic regression.

Lemma 3. For any π, π ′ ⊆ {3, . . . , m} we have: i h  EX3:m KL PX1 |Xπ k PX1 |Xπ′ ≤ 1

Corollary 2. Assume we have the following logistic regression model

Analogous to the continuous case, the key to proving that the KL divergence is bounded in the discrete case is to ensure that the parameterization of the conditional distribution of X1 is independent of the number of parents. Therefore, we needed to set θ = 1/(m − 2) to ensure that the mean of Yπ is constant. The following Theorem follows from Lemma 3 and Theorem 1.

p(y = 1|x, w) =

1 1+

e−(wT x)

,

where y ∈ {0, 1}, x ∈ {−1, 1}p, w ∈ Rp . Let Sw be the support of w, i.e. Sw = {i|1 ≤ i ≤ p and |wi | > 0}. We are given a data set S = {(yi , xi )}ni=1 of observations drawn from some distribution. Let kwk1 ≤ 1. The learner then infers w ˆ from the data set S, with support Swˆ . Then we have

Theorem 4. Assume there exists a true Bayesian network f¯ ∈ F where, the conditional distribution φ is the noisy-or distribution and Xi ∈ {0, 1} for all 1 ≤ i ≤ m. Further, assume that we are given n i.i.d samples of the m random variables drawn from f¯. The learner then infers the Bayesian network fˆ from the sample data set. There exits a specific parameterization of the distribution φ such that

(a) |Sw | ≤ p ⇒ if the number of samples n ≤ (p−2) 6 then support recovery fails with probability at least 1/2, i.e. p(Swˆ 6= Sw ) ≥ 1/2. (b) |Sw | ≤ k ≤ p/2 ⇒ if the number of samples n ≤ k log p−k6 log k−2 then support recovery fails with probability at least 1/2, i.e. p(Swˆ 6= Sw ) ≥ 1/2.

(a) F = Feφ,m ⇒ if n ≤ (m−4) , then learning fails with 2 probability at least 1/2, i.e. p(fˆ 6= f¯) ≥ 1/2.

Proof. The above corollary follows from Theorem 5, after setting p = m − 2, y = X1 , disconnecting the variable X2 i.e. setting p(X2 |X3:m ) = p(X2 ) and treating the variables X3 , . . . , Xm as covariates i.e. xh = {X3 , . . . , Xm }. Note i that in this case

log l−2 , then (b) F = Feφ,m,l ⇒ if n ≤ l log(m−2)−l 2 learning fails with probability at least 1/2, i.e. p(fˆ 6= f¯) ≥ 1/2.

EX3:m KL PX1 |Xπ k PX1 |Xπ′

6.2 Logistic regression Bayesian networks In this case we assume the following

≤ 3.

The above sample complexity results for logistic regression complements the upper bounds in [10]. In [10] the authors showed that the upper bound on the sample complexity of L1 -regularized logistic regression grows logarithmically with the number of data points. Thus, we conclude that L1 -regularized logistic regression is statistically optimal.

Xi ∼ Bernoulli(θi ) ∀ 3 ≤ i ≤ m T X1 |Xπ ∼ Bernoulli(σ(wπ Xπ ))

where, σ(x) is the logistic function and is given by 1/(1 + exp(−x)), 0 ≤ θi ≤ 1 and wπ = [1/ |π|]. Note T that in the above setting 1/2 ≤ σ(wπ Xπ ) < 1. Denote θπ = {θi |Xi ∈ π}.

6.3 Conditional probability table (CPT) Bayesian networks

Lemma 4. For any π, π ′ ⊆ {3, . . . , m} we have: i h  EX3:m KL PX1 |Xπ k PX1 |Xπ′ ≤ 2.

In this setting we assume that the conditional probabilities are specified by conditional probability tables. Further, we assume that every variable takes one of v values. Let θ be some element from the v-dimensional probability simplex, i.e. θ = {θi }vi=1 , θi > 0, and Pv i=1 θi = 1. Nature generates the conditional probability table for X1 as follows. After picking the parents π of the variable X1 , for each assignment xπ , out of v |π| possible assignments, to the variables in Xπ , nature generates X1 as follows:

The following Theorem follows from Lemma 4 and Theorem 1. Theorem 5. Assume there exists a true Bayesian network f¯ ∈ F where, the conditional distribution φ is given by the logistic regression model and Xi ∈ {0, 1} for all 1 ≤ i ≤ m. Further, assume that we are given n i.i.d samples of the m random variables drawn from f¯. The learner then infers the Bayesian network fˆ from the sample data set. There exits a specific parameterization of the distribution φ such that

X1 |(Xπ = xπ ) ∼ M ultinomial(ψ(θ)),

5

O (l log m). For deriving the results in the paper, we considered restricted ensembles of networks where the learning problem corresponded to identifying the parents of a single variable only. An interesting direction for future work is to obtain sample complexity results for the general case of learning the parents of all variables. Obtaining sharp phase transitions for structure learning in Bayesian networks can also be of interest.

where ψ(θ) is a permutation of the entries of θ, generated uniformly at random. Lemma 5. Let Xi ∈ {1, . . . , v} ∀i. For any π, π ′ ⊆ {3, . . . , m} and any pair of permutations ψ(θ) and ψ ′ (θ) we have: h  i EX3:m KL PX1 |Xπ k PX1 |Xπ′ ≤ log v.

Theorem 6. Assume there exists a true Bayesian network f¯ ∈ F where, the conditional distribution φ is given by some conditional probability table and Xi ∈ {1, . . . , v} for all 1 ≤ i ≤ m. Further, assume that we are given n i.i.d samples of the m random variables drawn from f¯. The learner then infers the Bayesian network fˆ from the sample data set. There exits a specific parameterization of the distribution φ such that

References [1] P. Abbeel, D. Koller, and A. Ng. Learning factor graphs in polynomial time and sample complexity. UAI, 2005. [2] E. Brenner and D. Sontag. SparsityBoost: A New Scoring Function for Learning Bayesian Network Structure. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI-13), pages 112–121, Corvallis, Oregon, 2013. AUAI Press.

(a) F = Feφ,m ⇒ if n < v log(v/e)−2 + m−3/2 , then 2 log v 2 learning fails with probability at least 1/2, i.e. p(fˆ 6= f¯) ≥ 1/2. l(log(m−2)−log l)−log e−2 , (b) F = Feφ,m,l ⇒ if n < v+l 2 + 2 log v then learning fails with probability at least 1/2, i.e. p(fˆ 6= f¯) ≥ 1/2.

[3] J. Cheng, R. Greiner, J. Kelly, D. Bell, and W. Liu. Learning Bayesian networks from data: An information-theory based approach. Artificial Intelligence Journal, 2002.

Proof. In this case nature picks both the parents of variable X1 as well as sets the elements of the conditional probability table. Therefore, the number of Bayesian networks is given by:

[4] D. Chickering and C. Meek. Finding optimal Bayesian networks. UAI, 2002. [5] T. Cover and J. Thomas. Elements of Information Theory. John Wiley & Sons, 2nd edition, 2006.

(a) If |π| ∈ {0, . . . , m − 2} then log k = log

m−2 X i=0

 m−2 i v v! = log(v!(v + 1)m−2 ) i

[6] S. Dasgupta. The sample complexity of learning fixed-structure bayesian networks. Machine Learning, 29(2-3):165–180, 1997.

> (v + m − 3/2) log v − v log e

[7] N. Friedman and Z. Yakhini. On the sample complexity of learning bayesian networks. In Proceedings of the Twelfth international conference on Uncertainty in artificial intelligence, pages 274– 282. Morgan Kaufmann Publishers Inc., 1996.

(b) If |π| ≤ l ≤ m/2 then (  l ) m−2 log k ≥ log v! vl l

[8] K.-U. Höffgen. Learning and robust learning of product distributions. In Proceedings of the sixth annual conference on Computational learning theory, pages 77–83. ACM, 1993.

> (v + l) log v + l(log(m − 2) − l log l)− log e The final result follows from Lemma 5 and Theorem 1.

[9] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.

7 Conclusion

[10] A. Y. Ng. Feature selection, L 1 vs. L 2 regularization, and rotational invariance. Twenty-first international conference on Machine learning ICML ’04, page 78, 2004.

In this paper we derived lower bounds on the sample complexity of learning Bayesian networks. We showed that there exists parameterizations of the Bayesian networks such that the minimum number of samples required grows linearly with the number of variables m. For sparse networks, the minimum number of samples required to learn the true network reduces to

[11] P. Spirtes, C. N. Glymour, and R. Scheines. Causation, prediction, and search, volume 81. MIT press, 2000.

6

[12] M. J. Wainwright. Information-Theoretic Limits on Sparsity Recovery in the High-Dimensional and Noisy Setting. Information Theory, IEEE Transactions on, 55(12):5728–5741, 2009.

=

p(Y ≤ y) =

=e

n Y

−(y−µi )

e−e

=e

−e−(y−µy )

EX1 |π [−(X1 − µπ′ ) − exp(−(X1 − µπ′ ))] h = EX3:m −(γ + 1) + µπ + γ − µπ′ i   + EX1 |π e−X1 eµπ′

(3)

i

= EX3:m [−1 + µπ − µπ′ + exp (−(µπ − µπ′ ))]

Proof. The statement follows directly from factorizing the joint distribution according to the Bayes rule.

≤e In the above proof γ is the Euler’s constant and the last line follows from the moment generating function of the Gumbel distribution.

Proof of Lemma 1: (KL for Gaussian regression). i h  EX3:m KL PX1 |Xπ k PX1 |Xπ′  ( )2  X X 1 1 1 = EX3:m  Xi − ′ Xi  2 |π| i∈π |π | ′

Proof of Lemma 3: (KL for Noisy-OR). We have log(p(X1 |Xπ )) = X1 log(1 − θZπ ) + (1 − X1 )Zπ log θ.

i∈π

  1 ¯π − X ¯ π′ )2 EX3:m (X 2   o  1n ¯ ¯ π′ + EX3:m X ¯π − X ¯ π′ 2 = V Xπ − X 2

Also for any x ∈ R and 0 < x < 1

=

−x > log(1 − x) > −

x 1−x

and 0 < θm ≤ θZπ , θZπ′ ≤ θ < 1. Using the above facts we can then compute the following upper bound on the KL divergence. h  i EX3:m KL PX1 |Xπ k PX1 |Xπ′   1 − θZπ = EX3:m EX1 |Xπ X1 log 1 − θZπ′  + (1 − X1 )(Zπ − Zπ′ ) log θ| X  1 − θZπ = EX3:m (1 − θZπ ) log 1 − θZπ′  + θZπ (Zπ − Zπ′ ) log θ h  = EX3:m (1 − θZπ ) log(1 − θZπ ) − log(1 − θZπ′ )

Now,       ¯ π′ ¯π + V X ¯π − X ¯ π′ ≤ V X V X 1 X 1 X V [Xi ] + ′ 2 V [Xi ] = 2 |π| i∈π |π | i∈π′   1 1 ≤2 + = |π| |π ′ | and



p(Xi ≤ y) =

Proof of Lemma 2: (KL for Gumbel). i h  EX3:m KL PX1 |Xπ k PX1 |Xπ′    p(X1 |Xπ ) = EX3:m EX1 |Xπ log X 3:m p(X1 |Xπ′ ) h = EX3:m −(γ + 1)−

Proposition 2. Let PX,Y = p(X, Y ) and QX,Y = q(X, Y ) be two distributions. Also, let PX = p(X), QY = q(Y ), PX|Y = p(X|Y = y) and QX|Y = q(X|Y = y). Then we have

=

n Y

=e

Appendix

¯π − X ¯ π′ EX3:m X

=4

i=1 i=1 P P −(y−µi ) µi − n −(e−y n ) i=1 e i=1 e

[15] O. Zuk, S. Margel, and E. Domany. On the number of samples needed to learn the correct structure of a Bayesian network. UAI, 2006.

2

2

Proof of Proposition 1: (Max of Gumbel variables).

[14] Y. Zhou. Structure Learning of Probabilistic Graphical Models : A Comprehensive Survey. Structure, 2007.



|π| |π ′ | + |π| |π ′ |

Using the above equations we prove our claim.

[13] B. Yu. Festschrift for Lucien Le Cam: Research Papers in Probability and Statistics, chapter Assouad, Fano, and Le Cam, pages 423–435. Springer New York, New York, NY, 1997.

KL (PX,Y k QX,Y ) =   EY KL PX|Y k QX|Y + KL (PY k QY )



!2 1 X 1 X µi − ′ µi |π| i:∈π |π | i:∈π′ !2 1 X 1 X 1+ ′ 1 |π| i:∈π |π | ′ i:∈π

7

i + θZπ (Zπ − Zπ′ ) log θ    θZπ′ Zπ Zπ ≤ EX3:m (1 − θ ) −θ + 1 − θZπ′  + θZπ (Zπ − Zπ′ ) log θ   (1 − θZπ )θZπ′ Zπ ≤ EX3:m θ (Zπ − Zπ′ ) log θ + (1 − θZπ′ )   θ Zπ ≤ EX3:m θ (Zπ − Zπ′ ) log θ + 1−θ   θ ≤ EX3:m (Zπ − Zπ′ )θ log θ + 1−θ |EX3:m [Zπ ] − EX3:m [Zπ′ ]| θ ≤ + 2 1−θ θ θ(m − 2) θ θ ||π| − |π ′ || + ≤ + ≤ 2 1−θ 2 1−θ 1 1 1 1 ≤ + ≤ + 2 m−2 2 2 =1

= EX3:m

 v v X X θj

 v v θk  θj X X θj θk = log log −  v v v2 v j=1 i=1 j=1 k=1 ) ( v v X θk θj X θk − log θj log = v v v j=1 k=1 ( ! Pv θk  ) v v X X θj θk k=1 v θj log log ≤ − v v v j=1 k=1   v X θj 1 1 θj log = − log 2 v v v j=1 =

v X

θj log θj −

j=1

≤ log v

T T = wπ θ π − wπ ′ θπ ′ + 1 ≤2

Proof of Lemma 5: (KL for conditional probability table). Let ψi (θ) be the i-th component of some permutation of the entries of the vector θ. We have: EX1 |π [log p(X1 |Xπ )] v X EX1 |π [1[X1 = i] log ψi (θ)] = i=1

=

ψi (θ) (log ψi (θ) −

#

log ψi′ (θ))

i=1

Proof of Lemma 4: (KL for Logistic regression). i h  EX3:m KL PX1 |Xπ k PX1 |Xπ′  T σ(wπ Xπ ) T Xπ ) log = EX3:m σ(wπ + T σ(wπ′ Xπ′ )  T 1 − σ(wπ Xπ ) T (1 − σ(wπ Xπ )) log T X ′) 1 − σ(wπ ′ π  T T σ(wπ Xπ )(1 − σ(wπ ′ Xπ ′ )) T + Xπ ) log = EX3:m σ(wπ T T X )) ′ σ(wπ′ Xπ )(1 − σ(wπ π  T 1 − σ(wπ Xπ ) log T X ′) 1 − σ(wπ ′ π   T T < EX3:m wπ Xπ − wπ′ Xπ′ + 1

v X

" v X

ψi (θ) log ψi (θ).

i=1

Using the above we can derive the KL-divergence as follows: i  h EX3:m KL PX1 |Xπ k PX1 |Xπ′ 8

v X j=1

θj log v + 2 log v