Towards a Learning Theory of Cause-Effect Inference - arXiv

Report 4 Downloads 68 Views
arXiv:1502.02398v2 [stat.ML] 18 May 2015

Towards a Learning Theory of Cause-Effect Inference

David Lopez-Paz1,2 Krikamol Muandet1 Bernhard Sch¨olkopf1 Ilya Tolstikhin1 1 Max-Planck-Institute for Intelligent Systems 2 University of Cambridge

Abstract We pose causal inference as the problem of learning to classify probability distributions. In particular, we assume access to a collection {(Si , li )}ni=1 , where each Si is a sample drawn from the probability distribution of Xi ×Yi , and li is a binary label indicating whether “Xi → Yi ” or “Xi ← Yi ”. Given these data, we build a causal inference rule in two steps. First, we featurize each Si using the kernel mean embedding associated with some characteristic kernel. Second, we train a binary classifier on such embeddings to distinguish between causal directions. We present generalization bounds showing the statistical consistency and learning rates of the proposed approach, and provide a simple implementation that achieves state-of-the-art cause-effect inference. Furthermore, we extend our ideas to infer causal relationships between more than two variables.

1. Introduction The vast majority of statistical learning algorithms rely on the exploitation of associations between the variables under study. Given the argument that all associations arise from underlying causal structures (Reichenbach, 1956), and that different structures imply different influences between variables, the question of how to infer and use causal knowledge in learning acquires great importance (Pearl, 2000; Sch¨olkopf et al., 2012). Traditionally, the most widely used strategy to infer the causal structure of a system is to perform interventions on some of its variables, while studying the response of some others. However, such interventions are in many situations unethical, expensive, or Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s).

DAVID @ LOPEZPAZ . ORG KRIKAMOL @ TUEBINGEN . MPG . DE BS @ TUEBINGEN . MPG . DE ILYA @ TUEBINGEN . MPG . DE

even impossible to realize. Consequently, we often face the need of causal inference purely from observational data. In these scenarios, one suffers, in the absence of strong assumptions, from the indistinguishability between latent confounding (X ← Z → Y ) and direct causation (X → Y or X ← Y ). Nevertheless, disregarding the impossibility of the task, humans continuously learn from experience to accurately infer causality-revealing patterns. Inspired by this successful learning, and in contrast to prior work, this paper addresses causal inference by unveiling such causal patterns directly from data. In particular, we assume access to a set {(Si , li )}ni=1 , where each Si is a sample set drawn from the probability distribution of Xi × Yi , and li is a binary label indicating whether “Xi → Yi ” or “Xi ← Yi ”. Using these data, we build a causal inference rule in two steps. First, we construct a suitable and nonparametric representation of each sample Si . Second, we train a nonlinear binary classifier on such features to distinguish between causal directions. Building upon this framework, we derive theoretical guarantees regarding consistency and learning rates, extend inference to the multivariate case, propose approximations to scale learning to big data, and obtain state-of-the-art performance with a simple implementation. Given the ubiquity of uncertainty in data, which may arise from noisy measurements or the existence of unobserved common causes, we adopt the probabilistic interpretation of causation from Pearl (2000). Under this interpretation, the causal structure underlying a set of random variables X = (X1 , . . . , Xd ), with joint distribution P , is often described in terms of a Directed Acyclic Graph (DAG), denoted by G = (V, E). In this graph, each vertex Vi ∈ V is associated to the random variable Xi ∈ X, and an edge Eji ∈ E from Vj to Vi denotes the causal relationship “Xi ← Xj ”. More specifically, these causal relationships are defined by a structural equation model: each Xi ← fi (Pa(Xi ), Ni ), where fi is a function, Pa(Xi ) is the parental set of Vi ∈ V , and Ni is some independent noise variable. Then, causal inference is the task of recovering G from S ∼ P n .

Towards a Learning Theory of Cause-Effect Inference

1.1. Prior Art We now briefly review the state-of-the-art on the inference of causal structures G from observational data S ∼ P n . For a more thorough exposition, see, e.g., Mooij et al. (2014). One of the main strategies to recover G is through the exploration of conditional dependencies, together with some other technical assumptions such as the Markov and faithfulness relationships between P and G (Pearl, 2000). This is the case of the PC algorithm (Spirtes et al., 2000), which allows the recovery of the Markov equivalence class of G without placing any restrictions on the structural equation model specifying the random variables under study. Causal inference algorithms that exploit conditional dependencies are unsuited for inference in the bivariate case. Consequently, a large body of work has been dedicated to the study of this scenario. First, the linear non-Gaussian causal model (Shimizu et al., 2006; 2011) recovers the true causal direction between two variables whenever their relationship is linear and polluted with additive and non-Gaussian noise. This model was later extended into nonlinear additive noise models (Hoyer et al., 2009; Zhang & Hyv¨arinen, 2009; Stegle et al., 2010; Kpotufe et al., 2013; Peters et al., 2014), which prefer the causal direction under which the alleged cause is independent from the additive residuals of some nonlinear fit to the alleged effect. Third, the information geometric causal inference framework (Daniusis et al., 2012; Janzing et al., 2014) assumes that the cause random variable is independently generated from some invertible and deterministic mapping to its effect; thus, it is unlikely to find dependencies between the density of the former and the slope of the latter, under the correct causal direction. As it may be inferred from the previous exposition, there exists a large and heterogeneous array of causal inference algorithms, each of them working under a very specialized set of assumptions, which are sometimes difficult to test in practice. Therefore, there exists the need for a more flexible causal inference rule, capable of learning the relevant causal footprints, later used for inference, directly from data. Such a “data driven” approach would allow to deal with complex data-generating processes, and would greatly reduce the need of explicitly crafting identifiability conditions a-priori. A preliminary step in this direction distilled from the competitions organized by Guyon (2013; 2014), which phrased causal inference as a learning problem. In these competitions, the participants were provided with a large collection of cause-effect samples {(Si , li )}ni=1 , where Si = i {(xij , yij )}nj=1 is drawn from the probability distribution of Xi × Yi , and li is a binary label indicating whether “Xi → Yi ” or “Yi → Xi ”. Given these data, most participants adopted the strategy of i) crafting a vector of features from each Si , and ii) training a binary classifier on top of

the constructed features and paired labels. Although these “data-driven” methods achieved state-of-the-art performance (Guyon, 2013), the laborious task of hand-crafting features renders their theoretical analysis impossible. In more specific terms, the approach described above is a learning problem with inputs being sample sets Si , where each Si contains samples drawn from the probability distribution Pi (Xi , Yi ). In a separate strand of research, there has been several attempts to learn from probability distributions in a principled manner (Jebara et al., 2004; Hein & Bousquet, 2004; Cuturi et al., 2005; Martins et al., 2009; Muandet et al., 2012). Szab´o et al. (2014) presented the first theoretical analysis of distributional learning based on kernel mean embedding (Smola et al., 2007), with focus on kernel ridge regression. Similarly, Muandet et al. (2012) studied the problem of classifying distributions, but their approach is constrained to kernel machines, and no guarantees regarding consistency or learning rates are provided. 1.2. Our Contribution Inspired by Guyon’s competitions, we pose causal inference as the problem of classifying probability measures on causally related pairs of random variables. Our contribution to this framework is the use of kernel mean embeddings to nonparametrically featurize each cause-effect sample Si . The benefits of this approach are three-fold. First, this avoids the need of hand-engineering features from the samples Si . Second, this enables a clean theoretical analysis, including provable learning rates and consistency results. Third, the kernel hyperparameters (that is, the data representation) can be jointly optimized with the classifier using cross-validation. Furthermore, we show how to extend these ideas to infer causal relationships between d ≥ 2 variables, give theoretically sustained approximations to scale learning to big data, and provide the source code of a simple implementation that outperforms the state-of-the-art. The rest of this article is organized as follows. Section 2 reviews the concept of kernel mean embeddings, the tool that will facilitate learning from distributions. Section 3 shows the consistency and learning rates of our kernel mean embedding classification approach to cause-effect inference. Section 4 extends the presented ideas to the multivariate causal inference case. Section 5 presents a variety of experiments displaying the state-of-the-art performance of a simple implementation of the proposed framework. For convenience, Table 1 summarizes our notations.

2. Kernel Mean Embeddings of Probability Measures In order to later classify probability measures P according to their causal properties, we first need to featurize them into

Towards a Learning Theory of Cause-Effect Inference E[ξ], V[ξ] Z P L M {(Pi , li )}n i=1 i Si = {Zij }n j=1 PSi k Hk µk (P ) µk (Psi ) µk (P) Mk Fk Rn (Fk ) ϕ, Rϕ (f )

Expected value and variance of r.v. ξ Domain of cause-effect pairs Z = (X, Y ) Set of cause-effect measures P on Z Set of labels li ∈ {−1, 1} Mother distribution over P × L Sample from Mn Sample from Pini Empirical distribution of Si Kernel function from Z × Z to R RKHS induced by k Kernel mean embedding of measure P ∈ P Empirical mean embedding of Psi The set {µk (P ) : P ∈ P} Measure over µk (P) × L induced by M Class of functionals mapping Hk to R Rademacher complexity of class Fk Cost and surrogate ϕ-risk of sign ◦f

Table 1. Table of notations

a suitable representation. To this end, we will rely on the concept of kernel mean embeddings (Berlinet & ThomasAgnan, 2004; Smola et al., 2007). In particular, let P be the probability distribution of some random variable Z taking values in the separable topological space (Z, τz ). Then, the kernel mean embedding of P associated with the continuous, bounded, and positive-definite kernel function k : Z × Z → R is Z µk (P ) := k(z, ·) dP (z), (1)

empirical mean embedding µk (PS ) to the embedding of its population counterpart µk (P ), in RKHS norm: Theorem 1. Assume that kf k∞ ≤ 1 for all f ∈ Hk with kf kHk ≤ 1. Then with probability at least 1 − δ we have s r 2 log 1δ E z∼P [k(z, z)] kµk (P ) − µk (PS )kHk ≤ 2 + . n n Proof. See Section B.1.

3. A Theory of Causal Inference as Distribution Classification This section phrases the inference of cause-effect relationships from probability measures as the classification of empirical kernel mean embeddings, and analyzes the learning rates and consistency of such approach. Throughout our exposition, the setup is as follows: 1. We assume the existence of some Mother distribution M, defined on P × L, where P is the set of all Borel probability measures on the space Z of two causally related random variables, and L = {−1, +1}. 2. A set {(Pi , li )}ni=1 is sampled from Mn . Each measure Pi ∈ P is the joint distribution of the causally related random variables Zi = (Xi , Yi ), and the label li ∈ L indicates whether “Xi → Yi ” or “Xi ← Yi ”.

Z

which is an element in Hk , the Reproducing Kernel Hilbert Space (RKHS) associated with k (Sch¨olkopf & Smola, 2002). Interestingly, the mapping µk is injective if k is a characteristic kernel (Sriperumbudur et al., 2010), that is, kµk (P ) − µk (Q)kHk = 0 ⇔ P = Q. Said differently, if using a characteristic kernel, we do not lose any information when embedding distributions. An example of characteristic kernel is the Gaussian kernel  k(z, z 0 ) = exp −γkz − z 0 k22 , γ > 0, (2) which will be used throughout this paper. In many practical situations, it is unrealistic to assume access to the true distribution P , and consequently to the true embedding µk (P ). Instead, we often have access to a sample S = {zi }ni=1 ∼ P n , which can P be used to construct the empirical distribution PS := n1 zi ∈S δ(zi ) , where δ(z) is the Dirac distribution centered at z. Using PS , we can approximate (1) by the empirical kernel mean embedding n

µk (PS ) :=

1X k(zi , ·) ∈ Hk . n i=1

(3)

The following result is a slight modification of Theorem 27 from (Song, 2008). It establishes the convergence of the

3. In practice, we do not have access to the measures {Pi }ni=1 . Instead, we observe samples Si = i {(xij , yij )}nj=1 ∼ Pini , for all 1 ≤ i ≤ n. Using Si , the data {(Si , li )}ni=1 is provided to the learner. 4. We featurize every sample Si into the empirical kernel mean embedding µk (PSi ) associated with some kernel function k (Equation 3). If k is a characteristic kernel, we incur no loss of information in this step. Under this setup, we will use the set {(µk (PSi ), li )}ni=1 to train a binary classifier from Hk to L, which will later be used to unveil the causal directions of new, unseen probability measures drawn from M. Note that this framework can be straightforwardly extended to also infer the “confounding (X ← Z → Y )” and “independent (X ⊥ ⊥ Y )” cases by adding two extra labels to L. Given the two nested levels of sampling (being the first one from the Mother distribution M, and the second one from each of the drawn cause-effect measures Pi ), it is not trivial to conclude whether this learning procedure is consistent, or how its learning rates depend on the sample sizes n and {ni }ni=1 . In the following, we will study the generalization performance of empirical risk minimization over this learning setup. Specifically, we are interested in

Towards a Learning Theory of Cause-Effect Inference

upper bounding the excess risk between the empirical risk minimizer and the best classifier from our hypothesis class, with respect to the Mother distribution M. We divide our analysis in three parts. First, §3.1 reviews the abstract setting of statistical learning theory and surrogate risk minimization. Second, §3.2 adapts these standard results to the case of empirical kernel mean embedding classification. Third, §3.3 considers theoretically sustained approximations to deal with big data. 3.1. Margin-based Risk Bounds in Learning Theory Let P be some unknown probability measure defined on Z × L, where Z is referred to as the input space, and L = {−1, 1} is referred to as the output space1 . One of the main goals of statistical learning theory (Vapnik, 1998) is to find a classifier h : Z → L that minimizes the expected risk   R(h) = E ` h(z), l (z,l)∼P

+

for a suitable loss function ` : L×L → R , which penalizes departures between predictions h(z) and true labels l. For classification, one common choice of loss function is the 0-1 loss `01 (l, l0 ) = |l−l0 |, for which the expected risk measures the probability of misclassification. Since P is unknown in natural situations, one usually Pn resorts tothe minimization of the empirical risk n1 i=1 ` h(zi ), li over some fixed hypothesis class H, for the training set {(zi , li )}ni=1 ∼ Pn . It is well known that this procedure is consistent under mild assumptions (Boucheron et al., 2005). Unfortunately, the 0-1 loss function is not convex, which leads to empirical risk minimization being generally intractable. Instead, we will focus on the minimization of surrogate risk functions (Bartlett et al., 2006). In particular, we will consider the set of classifiers of the form H = {sign ◦f : f ∈ F} where F is some fixed set of realvalued functions f : Z → R. Introduce a nonnegative cost function ϕ : R → R+ which is surrogate to the 0-1 loss, that is, ϕ() ≥ 1>0 . For any f ∈ F we define its expected and empirical ϕ-risks respectively as   Rϕ (f ) = E ϕ −f (z)l , (4) (z,l)∼P n

X  ˆ ϕ (f ) = 1 R ϕ −f (zi )li . n i=1

(5)

Many natural choices of ϕ lead to tractable empirical risk minimization. Common examples of cost functions include the hinge loss ϕ() = max(0, 1 + ) used in SVM, the exponential loss ϕ() = exp() used in Adaboost, and the logistic loss ϕ() = log2 1 + e ) used in logistic regression. 1

Refer to Section A for considerations on measurability.

The misclassification error of sign ◦ f is always upper bounded by Rϕ (f ). The relationship between functions minimizing Rϕ (f ) and functions minimizing R(sign ◦f ) has been intensively studied in the literature (Steinwart & Christmann, 2008, Chapter 3). Given the high uncertainty associated with causal inferences, we argue that one is interested in predicting soft probabilities rather than hard labels, a fact that makes the study of margin-based classifiers well suited for our problem. We now focus on the estimation of f ∗ ∈ F, the function minimizing (4). However, since the distribution P is unknown, we can only hope to estimate fˆn ∈ F, the function minimizing (5). Therefore, we are interested in highprobability upper bounds on the excess ϕ-risk EF (fˆn ) = Rϕ (fˆn ) − Rϕ (f ∗ ),

(6)

w.r.t. the random training sample {(zi , li )}ni=1 ∼ Pn . The excess risk (6) can be upper bounded in the following way: ˆ ϕ (fˆn ) + R ˆ ϕ (f ∗ ) − Rϕ (f ∗ ) EF (fˆn ) ≤ Rϕ (fˆn ) − R ˆ ϕ (f )|. ≤ 2 sup |Rϕ (f ) − R (7) f ∈F

While this upper bound leads to tight results for worst case analysis, it is well known (Bartlett et al., 2005; Boucheron et al., 2005; Koltchinskii, 2011) that tighter bounds can be achieved under additional assumptions on P. However, we leave these analyses for future research. The following result — in spirit of Koltchinskii & Panchenko (1999); Bartlett & Mendelson (2002) — can be found in Boucheron et al. (2005, Theorem 4.1). Theorem 2. Consider a class F of functions mapping Z to R. Let ϕ : R → R+ be a Lϕ -Lipschitz function such that ϕ() ≥ 1>0 . Let B be a uniform upper bound on ϕ −f ()l . Let {(zi , li )}ni=1 ∼ P and {σi }ni=1 be i.i.d. Rademacher random signs. Then, with prob. at least 1 − δ, ˆ ϕ (f )| sup |Rϕ (f ) − R # " r n log(1/δ) 1 X σi f (zi ) + B , ≤ 2Lϕ E sup 2n f ∈F n i=1

f ∈F

where the expectation is taken w.r.t. {σi , zi }ni=1 . The expectation in the bound of Thm. 2 is known as the Rademacher complexity of F, will be denoted by Rn (F), and has a typical order of O(n−1/2 ) (Koltchinskii, 2011). 3.2. From Classic to Distributional Learning Theory Note that we can not directly apply the empirical risk minimization bounds discussed in the previous section to our learning setup. This is because instead of learning a classifier on the i.i.d. sample {µk (Pi ), li }ni=1 , we have to learn

Towards a Learning Theory of Cause-Effect Inference

over the set {µk (PSi ), li }ni=1 , where Si ∼ Pini . Said differently, our input feature vectors µk (PSi ) are “noisy”: they exhibit an additional source of variation as any two different random samples Si , Si0 ∼ Pini do. In the following, we study how to incorporate these nested sampling effects into an argument similar to Theorem 2. We will now frame our problem within the abstract learning setting considered in the previous section. Recall that our learning setup initially considers some Mother distribution M over P × L. Let µk (P) = {µk (P ) : P ∈ P} ⊆ Hk , L = {−1, +1}, and Mk be a measure (guaranteed to exist by Lemma 2, Section A.1) on µk (P) × L induced by M. Specifically, we will consider µk (P) ⊆ Hk and L to be the input and  output spaces  n of our learning problem, respectively. Let µk (Pi ), li i=1 ∼ Mnk be our training set. We will now work with the set of classifiers {sign ◦ f : f ∈ Fk } for some fixed class Fk of functionals mapping from the RKHS Hk to R. As pointed out in the description of our learning setup, we do not have access to the distributions {Pi }ni=1 but to samples Si ∼ Pini , for all 1 ≤ i ≤ n. Because of this reason, we define the sample-based empirical ϕ-risk

distances kµk (Pi ) − µk (PSi )kHk , which are in turn upper bounded using Theorem 1. To this end, we will have to assume that the class Fk consists of functionals with uniformly bounded Lipschitz constants, such as the set of linear functionals with uniformly bounded operator norm. We now present the main result of this section, which provides a high-probability bound on the excess risk (9). Importantly, this excess risk will translate into the expected causal inference accuracy of our distribution classifier. Theorem 3. Consider the RKHS Hk associated with some bounded, continuous kernel function k, such that supz∈Z k(z, z) ≤ 1. Consider a class Fk of functionals mapping Hk to R with Lipschitz constants uniformly bounded by LF . Let ϕ : R → R+ be a Lϕ-Lipschitz function such that φ(z) ≥ 1z>0 . Let ϕ −f (h)l ≤ B for every f ∈ Fk , h ∈ Hk , and l ∈ L. Then, with probability not less than 1 − δ (over all sources of randomness) r log(2/δ) ∗ ˜ Rϕ (fn ) − Rϕ (f ) ≤ 4Lϕ Rn (Fk ) + 2B 2n  s s n 4Lϕ LF X  E z∼Pi [k(z, z)] log(2n/δ)  . + + n ni 2ni i=1

n

X  ˜ ϕ (f ) = 1 ϕ −li f µk (PSi ) , R n i=1

Proof. See Section B.2.

ˆ ϕ (f ) which is the approximation to the empirical ϕ-risk R that results from substituting the embeddings µk (Pi ) with their empirical counterparts µk (PSi ). Our goal is again to find the function f ∗ ∈ Fk minimizing expected ϕ-risk Rϕ (f ). Since Mk is unknown to us, and we have no access to the embeddings {µk (Pi )}ni=1 , we will ˜ ϕ (f ) in Fk : instead use the minimizer of R ˜ ϕ (f ). f˜n ∈ arg min R f ∈Fk

(8)

To sum up, the excess risk (6) can now be reformulated as Rϕ (f˜n ) − Rϕ (f ∗ ).

(9)

Note that the estimation of f ∗ drinks from two nested sources of error, which are i) having only n training samples from the distribution Mk , and ii) having only ni samples from each measure Pi . Using a similar technique to (7), we can upper bound the excess risk as ˆ ϕ (f )| Rϕ (f˜n ) − Rϕ (f ) ≤ sup |Rϕ (f ) − R ∗

As mentioned in Section 3.1, the typical order of Rn (Fk ) is O(n−1/2 ). For a particular examples of classes of functionals with small Rademacher complexity we refer to Maurer (2006). In such cases, the upper bound in Theorem 3 converges to zero (meaning that our procedure is consistent) as both n and ni tend to infinity, in such a way that2 log n/ni = o(1). The rate of convergence w.r.t. n can be improved up to O(n−1 ) if placing additional assumptions on M (Bartlett et al., 2005). On the contrary, the rate w.r.t. ni cannot be improved in general. Namely, the convergence rate O(n−1/2 ) presented in the upper bound of Theorem 1 is tight, as shown in the following novel result. Theorem 4. Under the assumptions of Theorem 1 denote 2 σH = k

sup

Vz∼P [f (z)].

kf kHk ≤1

Then there exist universal constants c, C such that for every 2 integer n ≥ 1/σH , and with probability at least c k σH kµk (P ) − µk (PS )kHk ≥ C √ k . n

(10)

f ∈Fk

ˆ ϕ (f ) − R ˜ ϕ (f )|. + sup |R

(11)

Proof. See Section B.3.

f ∈Fk

The term (10) is upper bounded by Theorem 2. On the other hand, to deal with (11),  we will need  to upper bound the deviations f µk (Pi ) − f µk (PSi ) in terms of the

Finally, it is instructive to relate the notion of “identifiability” often considered in the causal inference community (Pearl, 2

We conjecture that this constraint is an artifact from our proof.

Towards a Learning Theory of Cause-Effect Inference

2000) to the properties of the Mother distribution. Saying that the model is identifiable means that the label l of P ∈ P is assigned deterministically by M. In this case, learning rates can become as fast as O(n−1 ). On the other hand, as M(l|P ) becomes nondeterministic, the problem becomes unidentifiable and learning rates slow down (for example, in the extreme case of cause-effect pairs related by linear functions polluted with additive Gaussian noise, M(l = +1|P ) = M(l = −1|P ) almost surely). The investigation of these phenomena is left for future research. 3.3. Low Dimensional Embeddings for Large Data For some kernel functions, the embeddings µk (PS ) ∈ Hk are infinite dimensional. Because of this reason, one must resort to the use of dual optimization problems, and in particular, kernel matrices. The construction of these matrices requires at least O(n2 ) computational and memory requirements, prohibitive for large n. In this section, we show that the infinite-dimensional embeddings µk (PS ) ∈ Hk can be approximated with easy to compute, low-dimensional representations (Rahimi & Recht, 2007; 2008). This will allow us to replace the infinite-dimensional minimization problem (8) with a low-dimensional one. d

Assume that Z = R , and that the kernel function k is realvalued, and shift-invariant. Then, we can exploit Bochner’s theorem (Rudin, 1962) to show that, for any z, z 0 ∈ Z: k(z, z 0 ) = 2Ck E [cos(hw, zi+ b) cos(hw, z 0 i+ b)] , (12) w,b

where w ∼ C1k pk , b ∼ U[0, 2π], pk : Z → R is the posRitive and integrable Fourier transform of k, and Ck = p (w)dw. For example, the squared-exponential kernel Z k (2) is shift-invariant, and its evaluations can be approximated by (12), if setting pk (w) = N (w|0, 2γI), and Ck = 1. We now show that for any probability measure Q on Z and z ∈ Z, the function k(z, ·) ∈ Hk ⊆ L2 (Q) can be approximated by a linear combination of randomly chosen elements from the Hilbert space L2 (Q). Namely, consider the functions parametrised by w, z ∈ Z and b ∈ [0, 2π]: z gw,b (·) = 2Ck cos(hw, zi + b) cos(hw, ·i + b),

(13)

which belong to L2 (Q), since they are bounded. If we sample {(wj , bj )}m j=1 i.i.d., as discussed above, the average m

z gˆm (·) =

1 X z g (·) m i=1 wi ,bi

can be viewed as an L2 (Q)-valued random variable. Morez over, (12) shows that E w,b [ˆ gm (·)] = k(z, ·). This enables us to invoke concentration inequalities for Hilbert spaces (Ledoux & Talagrand, 1991), to show the following result, which is in spirit to Rahimi & Recht (2008, Lemma 1).

Lemma 1. Let Z = Rd . For any shift-invariant kernel k, s.t. supz∈Z k(z, z) ≤ 1, any fixed S = {zi }ni=1 ⊂ Z, any probability distribution Q on Z, and any δ > 0, we have

n

 p 2Ck  1 X zi

1 + 2 log(n/δ) ≤√ gˆm (·)

µk (PS ) −

n i=1 m L2 (Q)

with probability larger than 1 − δ over {(wi , bi )}m i=1 . Proof. See Section B.4. Once sampled, the parameters {(wi , bi )}m ali=1 low us to approximate the empirical kernel mean embeddings {µk (PSi )}ni=1 using elements from span({cos(hwi , ·i + bi )}m i=1 ), which is a finite-dimensional subspace of L2 (Q). Therefore, we propose to use {(µk,m (PSi ), li )}ni=1 as the training sample for our final empirical risk minimization problem, where µk,m (PS ) =

m 2Ck X cos(hwj , zi + bj ) j=1∈ Rm . (14) |S| z∈S

These feature vectors can be computed in O(m) time and stored in O(1) memory; importantly, they can be used offthe-shelf in conjunction with any learning algorithm. For the precise excess risk bounds that take into account the use of these low-dimensional approximations, please refer to Theorem 6 from Section B.5.

4. Extensions to Multivariate Causal Inference It is possible to extend our framework to infer causal relatonships between d ≥ 2 variables X = (X1 , . . . , Xd ). To this end, and as introduced in Section 1, assume the existence of a causal directed acyclic graph G which underlies the dependencies present in the probability distribution P (X). Therefore, our task is to recover G from S ∼ P n . Na¨ıvely, one could extend the framework presented in Section 3 from the binary classification of 2-dimensional distributions to the multiclass classification of d-dimensional distributions. However, the number of possible DAGs (and therefore, the number of labels in our multiclass classification problem) grows super-exponentially in d. An alternative approach is to consider the probabilities of the three labels “Xi → Xj ”, “Xi ← Xj “, and “Xi ⊥ ⊥ Xj “ for each pair of variables {Xi , Xj } ⊆ X, when embedded along with every possible context Xk ⊆ X \ {Xi , Xj }. The intuition here is the same as in the PC algorithm of Spirtes et al. (2000): in order to decide the (absence of a) causal relationship between Xi and Xj , one must analyze the confounding effects of every Xk ⊆ X \ {Xi , Xj }.

Towards a Learning Theory of Cause-Effect Inference

ν(S) = (µk,m (PSx ), µk,m (PSy ), µk,m (PSxy )),

(15)

where the three elements forming (15) stand for the lowdimensional representations (14) of the empirical kernel mean embeddings of {xi }ni=1 , {yi }ni=1 , and {(xi , yi )}ni=1 , respectively. The representation (15) is motivated by the typical conjecture in causal inference about the existence of asymmetries between the marginal and conditional distributions of causally-related pairs of random variables (Sch¨olkopf et al., 2012). Each of these three embeddings has random features sampled to approximate the sum of three Gaussian kernels (2) with hyper-parameters 0.1γ, γ, and 10γ, where γ is found using the median heuristic. In practice, we set m = 1000, and observe no significant improvements when using larger amounts of random features. To classify the embeddings (15) in each of the experiments, we use the random forest3 implementation from Python’s sklearn-0.16-git. The number of trees is chosen from {100, 250, 500, 1000, 5000} via cross-validation. Our experiments can be replicated using the source code at https://github.com/lopezpaz/causation_learning_theory.

¨ 5.1. Classification of Tubingen Cause-Effect Pairs The T¨ubingen cause-effect pairs is a collection of heterogeneous, hand-collected, real-world cause-effect samples (Zscheischler, 2014). Given the small size of this dataset, we resort to the synthesis of an artificial Mother distribution to sample our training data from. To this end, assume that sampling a synthetic cause-effect sample set Sˆi := {(ˆ xij , yˆij )}nj=1 ∼ Pθ equals the following simple generative process: 1. A cause vector (ˆ xij )nj=1 is sampled from a mixture of Gaussians with c components. The mixture weights are sampled from U(0, 1), and normalized to sum to one. The mixture means and standard deviations are sampled from N (0, σ1 ), and N (0, σ2 ), respectively, accepting only positive standard deviations. The cause vector is standardized to zero mean and unit variance. 2. A noise vector (ˆ ij )nj=1 is sampled from a centered 3 Although random forests do not comply with Lipschitzness assumptions from Section 3, they showed the best empirical results. Compliant alternatives such as SVMs exhibited a typical drop in classification accuracy of 5%.

80 60 40 20

RCC ANM IGCI

0

We conduct an array of experiments to test the effectiveness of a simple implementation of the presented causal learning framework. Given the use of random embeddings (14) in our classifier, we term our method the Randomized Causation Coefficient (RCC). Throughout our simulations, we featurize each sample S = {(xi , yi )}ni=1 as

classification accuracy

100

5. Numerical Simulations

0

20

40

60

80

100

decission rate

Figure 1. Accuracy of RCC, IGCI and ANM on the T¨ubingen cause-effect pairs, as a function of decision rate. The grey area depicts accuracies not statistically significant.

Gaussian, with variance sampled from U(0, σ3 ). 3. A mapping mechanism fˆi is conceived as a spline fitted using an uniform grid of df elements from min((ˆ xij )nj=1 ) to max((ˆ xij )nj=1 ) as inputs, and df normally distributed outputs. 4. An effect vector is built as (ˆ yij := fˆi (ˆ xij ) + ˆij )nj=1 , and standardized to zero mean and unit variance. 5. Return the cause-effect sample Sˆi := {(ˆ xij , yˆij )}nj=1 . To choose a θ = (c, σ1 , σ2 , σ3 , df ) that best resembles the unlabeled test data, we minimize the distance between the embeddings of N synthetic pairs and the Tuebingen samples arg min

X

θ

i

min kν(Si ) − ν(Sˆj )k22 ,

1≤j≤N

(16)

over c, df ∈ {1, . . . , 10}, and σ1 , σ2 , and σ3 ∈ {0, 0.5, 1, . . . , 5}, where the Sˆj ∼ Pθ , the Si are the T¨ubingen cause-effect pairs, and ν is as in (15). This strategy can be thought of as transductive learning, since we assume to know the test inputs prior to the training of our inference rule. We set n = 1000, and N = 10, 000. Using the generative process outlined above, we construct the synthetic training data {{ν({(ˆ xij , yˆij )}nj=1 ), +1)}N i=1 , {ν({(ˆ yij , x ˆij )}nj=1 ), −1)}N i=1 }, where {(ˆ xij , yˆij )}nj=1 ∼ Pθ , and train our classifier on it. Figure 1 plots the classification accuracy of RCC, IGCI (Daniusis et al., 2012), and ANM (Mooij et al., 2014) versus the fraction of decissions that the algorithms are forced to make out of the 82 scalar T¨uebingen cause-effect pairs. To compare these results to other lower-performing methods,

Towards a Learning Theory of Cause-Effect Inference

refer to Janzing et al. (2012). RCC surpasses the state-of-theart with a classification accuracy of 81.61% when inferring the causal directions on all pairs. The confidence of RCC is computed using the classifier’s output class probabilities. SVMs obtain a test accuracy of 77.2% in this same task. 5.2. Inferring the Arrow of Time We test the effectiveness of our method to infer the arrow of time from causal time series. More specifically, we assume i access to a set of time series {xij }nj=1 , and our task is to infer, for each series, whether Xi → Xi+1 or Xi ← Xi+1 . We compare our framework to the state-of-the-art of Peters et al. (2009), using the same electroencephalography signals (Blankertz, 2005) as in their original experiment. On the one hand, Peters et al. (2009) construct two Auto-Regressive Moving-Average (ARMA) models for each causal time series and time direction, and prefers the solution under which the model residuals are independent from the inferred cause. To this end, the method uses two parameters for which no estimation procedure is provided. On the other hand, our approach makes no assumptions whatsoever about the parametric model underlying the series, at the expense of requiring a disjoint set of N = 10, 000 causal time series for training. Our method matches the best performance of Peters et al. (2009), with an accuracy of 82.66%. 5.3. ChaLearn’s Challenge Data The cause-effect challenges organized by Guyon (2014) provided N = 16, 199 training causal samples Si , each drawn from the distribution of Xi × Yi , and labeled either ⊥ Yi ”. “Xi → Yi ”, “Xi ← Yi ”, “Xi ← Zi → Yi ”, or “Xi ⊥ The task of the competition was to develop a causation coefficient which would predict large positive values to causal samples following “Xi → Yi ”, large negative values to samples following “Xi ← Yi ”, and zero otherwise. Using these data, our obtained a test bidirectional area under the curve score (Guyon, 2014) of 0.74 in one minute and a half, ranking third in the overall leaderboard. The winner of the competition obtained a score of 0.82 in thirty minutes, but resorted to several dozens of hand-crafted features. Partitioning these same data in different ways, we learned two related but different binary classification tasks. First, we trained our classifier to detect latent confounding, and obtained a test classification accuracy of 80% on the task of distinguishing “X → Y or X ← X” from “X ← Z → Y ”. Second, we trained our classifier to measure dependence, and obtained a test classification accuracy of 88% on the task of distinguishing between “X ⊥ ⊥ Y ” and “else”. We consider these results to be a promising direction to learn flexible hypothesis tests and dependence measures directly from data.

5.4. Reconstruction of Causal DAGs We apply the strategy described in Section 4 to reconstruct the causal DAGs of two multivariate datasets: autoMPG and abalone (Lichman, 2013). Once again, we resort to synthetic training data, generated in a similar procedure to the one used in Section 5.1. Refer to Section C for details. Regarding autoMPG, in Figure 2, 1) the release date of the vehicle (AGE) causes the miles per gallon consumption (MPG), acceleration capabilities (ACC) and horse-power (HP), 2) the weight of the vehicle (WEI) causes the horsepower and MPG, and that 3) other characteristics such as the engine displacement (DIS) and number of cylinders (CYL) cause the MPG. For abalone, in Figure 3, 1) the age of the snail causes all the other variables, 2) the overall weight of the snail (WEI) is caused by the partial weights of its meat (WEA), viscera (WEB), and shell (WEC), and 3) the height of the snail (HEI) is responsible for other phisicaly attributes such as its diameter (DIA) and length (LEN). The target variable for each dataset is shaded in gray. Interstingly, our inference reveals that the autoMPG dataset is a causal prediction task (the features cause the target), and that the abalone dataset is an anticausal prediction task (the target causes the features). This distinction has implications when learning from these data (Sch¨olkopf et al., 2012). HP

WEI

ACC

MPG

DIS

AGE

CYL

Figure 2. Causal DAG recovered from data autoMPG.

WEA

WEI

LEN

DIA

WEB

WEC

AGE

HEI

Figure 3. Causal DAG recovered from data abalone.

6. Future Work Three research directions are in progress. First, to improve learning rates by using common assumptions from causal inference. Second, to further investigate methods to reconstruct multivariate DAGs. Third, to develop mechanisms to interpret the causal footprints learned by our classifiers.

Towards a Learning Theory of Cause-Effect Inference

References Bartlett, P. L. and Mendelson, S. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 2002. Bartlett, P. L. and Mendelson, S. Empirical minimization. Probability Theory and Related Fields, 135(4), 2006. Bartlett, P. L., Bousquet, O., and Mendelson, S. Local rademacher complexities. The Annals of Statistics, 33(4), 2005. Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101, 2006. Berlinet, A. and Thomas-Agnan, C. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers, 2004. Blankertz, B. BCI Competition III data, experiment 4a, subject 3, 1000Hz, 2005. URL http://bbci.de/ competition/iii/download/. Boucheron, St´ephane, Lugosi, G´abor, and Bousquet, Olivier. Theory of classification: a survey of recent advances. ESAIM: Probability and Statistics, 2005. Cuturi, M., Fukumizu, K., and Vert, J. P. Semigroup kernels on measures. JMLR, 2005. Daniusis, P., Janzing, D., Mooij, J., Zscheischler, J., Steudel, B., Zhang, K., and Sch¨olkopf, B. Inferring deterministic causal relations. UAI, 2012. Guyon, I. Cause-effect pairs kaggle competition, 2013. URL https://www.kaggle.com/c/ cause-effect-pairs/.

Jebara, T., Kondor, R., and Howard, A. Probability product kernels. JMLR, 2004. Koltchinskii, V. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Ecole d’´et´e de probabilit´es de Saint-Flour. Springer, 2011. Koltchinskii, V. and Panchenko, D. Rademacher processes and bounding the risk of function learning. In E. Gine, D. and J.Wellner (eds.), High Dimensional Probability, II, pp. 443–457. Birkhauser, 1999. Kpotufe, S., Sgouritsa, E., Janzing, D., and Sch¨olkopf, B. Consistency of causal inference under the additive noise model. ICML, 2013. Ledoux, M. and Talagrand, M. Probability in Banach Spaces: Isoperimetry and Processes. Springer, 1991. Lichman, M. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml. Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q., and Figueiredo, M. A. T. Nonextensive information theoretic kernels on measures. JMLR, 2009. Maurer, A. The rademacher complexity of linear transformation classes. COLT, 2006. Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., and Sch¨olkopf, B. Distinguishing cause from effect using observational data: methods and benchmarks. arXiv preprint arXiv:1412.3773, 2014. Muandet, K., Fukumizu, K., Dinuzzo, F., and Sch¨olkopf, B. Learning from distributions via support measure machines. NIPS, 2012. Pearl, J. Causality: models, reasoning and inference. Cambridge Univ Press, 2000.

Guyon, I. Chalearn fast causation coefficient challenge, 2014. URL https://www.codalab.org/ competitions/1381.

Peters, J., Janzing, D., Gretton, A., and Sch¨olkopf, B. Detecting the direction of causal time series. ICML, 2009.

Hein, M. and Bousquet, O. Hilbertian metrics and positive definite kernels on probability measures. AISTATS, 2004.

Peters, J., M., Joris M., Janzing, D., and Sch¨olkopf, B. Causal discovery with continuous additive noise models. JMLR, 2014.

Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J. R., and Sch¨olkopf, B. Nonlinear causal discovery with additive noise models. NIPS, 2009.

Rahimi, A. and Recht, B. Random features for large-scale kernel machines. NIPS, 2007.

Janzing, D., Mooij, J., Zhang, K., Lemeire, J., Zscheischler, J., Daniuˇsis, P., Steudel, B., and Sch¨olkopf, B. Information-geometric approach to inferring causal directions. Artificial Intelligence, 2012. Janzing, D., Steudel, B., Shajarisales, N., and Sch¨olkopf, B. Justifying information-geometric causal inference. arXiv, 2014.

Rahimi, A. and Recht, B. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. NIPS, 2008. Reed, M. and Simon, B. Methods of modern mathematical physics. vol. 1. Functional analysis, volume 1. Academic press New York, 1972. Reichenbach, H. The Direction of Time. Dover, 1956.

Towards a Learning Theory of Cause-Effect Inference

Rudin, W. Fourier Analysis on Groups. Wiley, 1962. Sch¨olkopf, B. and Smola, A. J. Learning with Kernels. MIT Press, 2002. Sch¨olkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. On causal and anticausal learning. ICML, 2012. Sch¨olkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. M. On causal and anticausal learning. In ICML, 2012. Shimizu, S., Hoyer, P. O., Hyv¨arinen, A., and Kerminen, A. A linear non-Gaussian acyclic model for causal discovery. JMLR, 2006. Shimizu, S., Inazumi, T., Sogawa, Y., Hyv¨arinen, A., Kawahara, Y., Washio, T., Hoyer, P. O., and Bollen, K. Directlingam: A direct method for learning a linear nongaussian structural equation model. JMLR, 2011. Smola, A. J., Gretton, A., Song, L., and Sch¨olkopf, B. A Hilbert space embedding for distributions. In ALT. Springer-Verlag, 2007. Song, L. Learning via Hilbert Space Embedding of Distributions. PhD thesis, The University of Sydney, 2008. Spirtes, P., Glymour, C. N., and Scheines, R. Causation, prediction, and search. MIT Press, 2000. Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Sch¨olkopf, B., and Lanckriet, G. Hilbert space embeddings and metrics on probability measures. JMLR, 2010. Stegle, O., Janzing, D., Zhang, K., Mooij, J. M., and Sch¨olkopf, B. Probabilistic latent variable models for distinguishing between cause and effect. NIPS, 2010. Steinwart, I. and Christmann, A. Support vector machines. Springer, 2008. Szab´o, Z., Sriperumbudur, B. P´oczos, B., and Gretton, A. Learning theory for distribution regression. arXiv, 2014. Vapnik, V. N. Statistical Learning Theory. John Wiley & Sons, 1998. Zhang, K. and Hyv¨arinen, A. On the identifiability of the post-nonlinear causal model. UAI, 2009. Zscheischler, J. Benchmark data set for causal discovery algorithms, v0.8, 2014. URL http://webdav. tuebingen.mpg.de/cause-effect/.

Towards a Learning Theory of Cause-Effect Inference

A. Topological and Measurability Considerations Let (Z, τZ ) and (L, τL ) be two separable topological spaces, where Z is the input space and L := {−1, 1} is the output space. Let B(τ ) be the Borel σ-algebra induced by the topology τ . Let P be an unknown probability measure on (Z × L, B(τZ ) ⊗ B(τL )). Consider also the classifiers f ∈ Fk and loss function ` to be measurable. A.1. Measurability Conditions to Learn from Distributions The first step towards the deployment of our learning setup is to guarantee the existence of a measure on the space µk (P)×L, where µk (P) = {µk (P ) : P ∈ P} ⊆ Hk is the set of kernel mean embeddings associated with the measures in P. The following lemma provides such guarantee. This allows the analysis within the rest of this Section on µk (P) × L. Lemma 2. Let (Z, τZ ) and (L, τL ) be two separable topological spaces. Let P be the set of all Borel probability measures on (Z, B(τZ )). Let µk (P) = {µk (P ) : P ∈ P} ⊆ Hk , where µk is the kernel mean embedding (1) associated to some bounded continuous kernel function k : Z × Z → R. Then, there exists a measure on µk (P) × L. Proof. The following is a similar result to Szab´o et al. (2014, Proof 3). Start by endowing P with the weak topology τP , such that the map Z L(P ) = f (z)dP (z),

(17)

Z

is continuous for all f ∈ Cb (Z). This makes (P, B(τP )) a measurable space. First, we show that µk : (P, B(τP )) → (Hk , B(τH )) is Borel measurable. Note that Hk is separable due to the separability of (Z, τZ ) and the continuity of k (Steinwart & Christmann, 2008, Lemma 4.33). The separability of Hk implies µk is Borel measurable iff it is weakly measurable (Reed & Simon, 1972, Thm. IV.22). Note that the boundedness and the continuity of k imply Hk ⊆ Cb (Z) (Steinwart & Christmann, 2008, Lemma 4.28). Therefore, (17) remains continuous for all f ∈ Hk , which implies the Borel measurability of µk . Second, µk : (P, B(τP )) → (G, B(τG )) is Borel measurable, since the B(τG ) = {A ∩ G : A ∈ B(Hk )} ⊆ B(τH ), where B(τG ) is the σ-algebra induced by the topology of G ∈ B(Hk ) (Szab´o et al., 2014). Third, we show that g : (P × L, B(τP ) ⊗ B(τL )) → (G × L, B(τG ) ⊗ B(τL )) is measurable. For that, it suffices to decompose g(x, y) = (g1 (x, y), g2 (x, y)) and show that g1 and g2 are measurable (Szab´o et al., 2014).

B. Proofs B.1. Theorem 1 Note that the original statement of Theorem 27 in Song (2008) assumed f ∈ [0, 1] while we let elements of the ball in RKHS to take negative values as well which can be achieved by minor changes of the proof. For completeness we provide the modified proof here. Using the well known dual relation between the norm in RKHS and sup-norm of empirical process which can be found in Theorem 28 of Song (2008) we can write: ! n 1X kµk (P ) − µk (PS )kHk = sup E [f (z)] − f (zi ) . (18) n i=1 kf kHk ≤1 z∼P Now we proceed in the usual way. First we note that the sup-norm of empirical process appearing on the r.h.s. can be viewed as a real-valued function of i.i.d. random variables z1 , . . . , zn . We will denote it as F (z1 , . . . , zn ). The straightforward computations show that the function F satisfies the bounded difference condition (Theorem 14 of Song (2008)). Indeed, let us fix all the values z1 , . . . , zn except for the zj which we will set to zj0 . Using identity |a − b| = (a − b)1a>b + (b − a)1a≤b and noting that if supx f (x) = f (x∗ ) then supx f (x) − supx g(x) is upper bounded by f (x∗ ) − g(x∗ ) we get |F (z1 , . . . , zj0 , . . . , zn ) − F (z1 , . . . , zj , . . . , zn )|   1 1 ≤ f (zj ) − f (zj0 ) 1F (z1 ,...,zj0 ,...,zn )>F (z1 ,...,zj ,...,zn ) + f (zj0 ) − f (zj ) 1F (z1 ,...,zj0 ,...,zn )≤F (z1 ,...,zj ,...,zn ) . n n

Towards a Learning Theory of Cause-Effect Inference

Now noting that |f (z) − f (z 0 )| ∈ [0, 2] we conclude with |F (z1 , . . . , zj0 , . . . , zn ) − F (z1 , . . . , zj , . . . , zn )| 2 2 2 ≤ 1F (z1 ,...,zj0 ,...,zn )>F (z1 ,...,zj ,...,zn ) + 1F (z1 ,...,zj0 ,...,zn )≤F (z1 ,...,zj ,...,zn ) = . n n n Using McDiarmid’s inequality (Theorem 14 of (Song, 2008)) with ci = 2/n we obtain that with probability not less than 1 − δ the following holds: ! " !# r n n 1X 1X 2 log(1/δ) sup E [f (z)] − f (zi ) ≤ E sup E [f (z)] − f (zi ) + . n i=1 n i=1 n kf kHk ≤1 z∼P kf kHk ≤1 z∼P Finally, we proceed with the symmetrization step (Theorem 2.1 of (Koltchinskii, 2011)) which upper bounds the expected value of the sup-norm of empirical process with twice the Rademacher complexity of the class {f ∈ Hk : kf kHk ≤ 1} and with upper bound on this Rademacher complexity which can be found in Lemma 22 and related remarks of Bartlett & Mendelson (2002). We also note that the original statement of Theorem 27 in Song (2008) contains extra factor of 2 under logarithm compared to our modified result. This is explained by the fact that while we upper bounded the Rademacher complexity directly, Song (2008) instead upper bounds it in terms of the empirical (or conditional) Rademacher complexity which results in another application of McDiarmid’s inequality together with union bound. B.2. Theorem 3 We will proceed as follows: ˜ ϕ (f˜n ) Rϕ (f˜n ) − Rϕ (f ∗ ) = Rϕ (f˜n ) − R ˜ ϕ (f˜n ) − R ˜ ϕ (f ∗ ) +R ˜ ϕ (f ∗ ) − Rϕ (f ∗ ) +R ˜ ϕ (f )| ≤ 2 sup |Rϕ (f ) − R f ∈Fk

ˆ ϕ (f ) + R ˆ ϕ (f ) − R ˜ ϕ (f )| = 2 sup |Rϕ (f ) − R f ∈Fk

ˆ ϕ (f )| + 2 sup |R ˆ ϕ (f ) − R ˜ ϕ (f )|. ≤ 2 sup |Rϕ (f ) − R f ∈Fk

(19)

f ∈Fk

We will now upper bound two terms in (19) separately. We start with noticing that Theorem 2 can be used in order to upper bound the first term. All we need is to match the quantities appearing in our problem to the classical setting of learning theory, discussed in Section 3.1. Indeed, let µ(P) play the role of input space Z. Thus the input objects are kernel mean embeddings of elements of P. According to Lemma 2, there is a distribution defined over µ(P) × L, which will play the role of unknown distribution P. Finally, i.i.d. pairs   n µk (Pi ), li i=1 form the training sample. Thus, using Theorem 2 we get that with probability not less than 1 − δ/2 (w.r.t.   n the random training sample µk (Pi ), li i=1 ) the following holds true: n # " r X log(2/δ) 1 ˆ ϕ (f )| ≤ 2Lϕ E sup sup |Rϕ (f ) − R σi f (zi ) + B . (20) n 2n f ∈Fk f ∈Fk i=1 To deal with the second term in (19) we note that n 1 Xh  i ˆ ϕ (f ) − R ˜ ϕ (f )| = sup sup |R ϕ −li f µk (Pi ) − ϕ −li f µk (PSi ) f ∈Fk f ∈Fk n i=1 ≤ sup f ∈Fk

n   1 X ϕ −li f µk (Pi ) − ϕ −li f µk (PSi ) n i=1

≤ Lϕ sup f ∈Fk

n   1 X f µk (Pi ) − f µk (PSi ) , n i=1

Towards a Learning Theory of Cause-Effect Inference

where we have used the Lipschitzness of the cost function ϕ. Using the Lipschitzness of the functionals f ∈ Fk we obtain: ˆ ϕ (f ) − R ˜ ϕ (f )| ≤ Lϕ sup sup |R f ∈Fk

f ∈Fk

n Lf X kµk (Pi ) − µk (PSi )kHk . n i=1

(21)

Also note that the usual reasoning shows that if h ∈ Hk and khkHk ≤ 1 then: |h(z)| = |hh, k(z, ·)iHk | ≤ khkHk kk(z, ·)kHk = khkHk

p

k(z, z) ≤

p

k(z, z)

and hence khk∞ = supz∈Z |h(z)| ≤ 1 because our kernel is bounded. This allows us to use Theorem 1 to control every term in (21) and combine the resulting upper bounds in a union bound4 over i = 1, . . . , n to show that for any fixed P1 , . . . , Pn with probability not less than 1 − δ/2 (w.r.t. the random samples {Si }ni=1 ) the following is true:  s  s n n 2 log 2n Lf X E z∼P [k(z, z)] Lf X  δ  Lϕ sup kµk (Pi ) − µk (PSi )kHk ≤ Lϕ sup 2 + . (22) ni ni f ∈F n i=1 f ∈F n i=1 The quantity 2n/δ appears under the logarithm since for every i we have used Theorem 1 with δ 0 = δ/(2n). Combining (20) and (22) in a union bound together with (19) we finally get that with probability not less than 1 − δ the following is true: s  s r n 2n X log 4L L log(2/δ) E [k(z, z)] ϕ F z∼P δ   + Rϕ (f˜n ) − Rϕ (f ∗ ) ≤ 4Lϕ Rn (F) + 2B + , 2n n n 2n i i i=1 where we have defined LF = supf ∈F Lf . B.3. Theorem 4 Our proof is a simple combination of the duality equation (18) combined with the following lower bound on the supremum of empirical process presented in Theorem 2.3 of Bartlett & Mendelson (2006): Theorem 5. Let F be a class of real-valued functions defined on a set Z such that supf ∈F kf k∞ ≤ 1. Let z1 , . . . , zn , z ∈ Z be i.i.d. according to some probability measure P on Z. Set σF2 = supf ∈F V[f (z)]. Then there are universal constants c, c0 , and C for which the following holds: # " n σF 1X f (zi ) ≥ c √ . E sup E [f (z)] − n i=1 n f ∈F Furthermore, for every integer n ≥ 1/σF2 , with probability at least c0 , # " n n 1X 1X sup E [f (z)] − f (zi ) ≥ C E sup E [f (z)] − f (zi ) . n i=1 n i=1 f ∈F f ∈F We note that constants c, c0 , and C appearing in the last result do not depend on n, σF2 or any other quantities appearing in the statement. This can be verified by the inspection of the proof presented in (Bartlett & Mendelson, 2006). B.4. Lemma 1 Proof. Bochner’s theorem (Rudin, 1962) states that for any shift-invariant symmetric p.d. kernel k defined on Z × Z where Z = Rd and any z, z 0 ∈ Z the following holds: Z 0 k(z, z 0 ) = pk (w)eihw,z−z i dw, (23) Z 4

Note that the union bound results in the extra log n factor in our bound. We believe that this factor can be avoided using a refined proof technique, based on the application of McDiarmid’s inequality. This question is left for a future work.

Towards a Learning Theory of Cause-Effect Inference

where pk is a positive and integrable Fourier transform of the kernel k. It is immediate to check that Fourier transform of such kernels k is always an even function, meaning pk (−w) = pk (w). Indeed, since k(z − z 0 ) = k(z 0 − z) for all z, z 0 ∈ Z (due to symmetry of the kernel) we have: Z Z Z ihw,δi pk (w) := k(δ)e dδ = k(δ) cos(hw, δi)dδ = k(δ) cos(−hw, δi)dδ = pk (−w) Z

Z

Z

0

d

d

which holds for any w ∈ R . Thus for any z, z ∈ R we can write: Z  0 k(z, z ) = pk (w) cos(hw, z − z 0 i) + i · sin(hw, z − z 0 i) dw d ZR Z  0 = pk (w) cos(hw, z − z i)dw + i · pk (w) sin(hw, z − z 0 i) dw d d R ZR 0 = pk (w) cos(hw, z − z i)dw Rd 2π

1 pk (w) cos(hw, zi + b) cos(hw, z 0 i + b) db dw. 2π 0 R R 2π Denote Ck = Rd p(w)dw < ∞. Next we will use identity cos(a − b) = π1 0 cos(a + x) cos(b + x)dx and introduce random variables b and w distributed according to U[0, 2π] and C1k pk (w) respectively. Then we can rewrite Z

Z

=2

Rd

k(z, z 0 ) = 2Ck E [cos(hw, zi + b) cos(hw, z 0 i + b)] . b,w

(24)

Now let Q be any probability distribution defined on Z. Then for any z, w ∈ Z and b ∈ [0, 2π] the function z gw,b (·) := 2Ck cos(hw, zi + b) cos(hw, ·i + b)

belongs to the L2 (Q) space. Namely, L2 (Q) norm of such a function is finite. Moreover, it is bounded by 2Ck : Z  2 z kgw,b (·)k2L2 (Q) = 2Ck cos(hw, zi + b) cos(hw, ti + b) dQ(t) Z Z ≤ 4Ck2 dQ(t) = 4Ck2 .

(25)

Z z Note that for any fixed x ∈ Z and any random parameters w ∈ Z and b ∈ [0, 2π] the element gw,b (·) is a random variable taking values in the L2 (Q) space (which is Hilbert). Such Banach-space valued random variables are well studied objects (Ledoux & Talagrand, 1991) and a number of concentration results for them are known by now. We will use the following version of Hoeffding inequality which can be found in Lemma 4 of Rahimi & Recht (2008):

Lemma 3. Let v1 , . . . , vm be i.i.d. random variables taking values in a ball of radius M centred around origin in a Hilbert space H. Then, for any δ > 0, the following holds:

" # m m

1 X

 p 1 X M

vi − E vi ≤ 1 + 2 log(1/δ) .

m

m i=1 m i=1 H

with probability higher than 1 − δ over the random sample v1 , . . . , vm . Note that Bochner’s formula (23) and particularly its simplified form (24) indicates that if w is distributed according to z normalized Fourier transform C1k pk and b ∼ U([0, 2π]) then E w,b [gw,b (·)] = k(z, ·). Moreover, we can show that any element h of RKHS Hk also belongs to the L2 (Q) space: Z 2 2 kh(·)kL2 (Q) = h(t) dQ(t) ZZ = hk(t, ·), h(·)i2Hk dQ(t) ZZ ≤ k(t, t)khk2Hk dQ(t) ≤ khk2Hk < ∞, (26) Z

Towards a Learning Theory of Cause-Effect Inference

where we have used the reproducing property of k in RKHS Hk , Cauchy-Schwartz inequality, and the fact that the kernel k is bounded. Thus we conclude that the function k(z, ·) is also an element of L2 (Q) space. h P i m 1 z z This shows that if we have a sample of i.i.d. pairs {(wi , bi )}m then E g (·) = k(z, ·) where {gw (·)}m i=1 i=1 i=1 wi ,bi m i ,bi are i.i.d. elements of Hilbert space L2 (Q). We conclude the proof using concentration inequality for Hilbert spaces of Lemma 3 and a union bound over the elements z ∈ S, since



n n n

1 X 1 X zi 1 X zi



gˆ (·) = k(zi , ·) − gˆm (·)

µk (PS ) −

n

n i=1 m n i=1 i=1 L2 (Q)

L2 (Q)

n

1X zi ≤ kk(zi , ·) − gˆm (·)kL2 (Q) n i=1

n m X X

1 zi

k(zi , ·) − 1

g (·) = wj ,bj

n i=1 m i=j

,

L2 (Q)

where we have used the triangle inequality. B.5. Excess Risk Bound for Low-Dimensional Representations Let us first recall some important notations introduced in Section 3.3. For any w, z ∈ Z and b ∈ [0, 2π] we define the following functions z gw,b (·) = 2Ck cos(hw, zi + b) cos(hw, ·i + b) ∈ L2 (Q), (27) R m where C  k = Z pk (z)dz for pk : Z → R being the Fourier transform of k. We sample m pairs {(wi , bi )}i=1 i.i.d. from 1 Ck pk × U[0, 2π] and define the average function m

z gˆm (·) =

1 X z g (·) ∈ L2 (Q). m i=1 wi ,bi

Since cosine functions (27) do not necessarily belong to the RKHS Hk and we are going to use their linear combinations as a training points, our classifiers should now act on the whole L2 (Q) space. To this end, we redefine the set of classifiers introduced in the Section 3.2 to be {sign ◦f : f ∈ FQ } where now FQ is the set of functionals mapping L2 (Q) to R. Recall that our goal is to find f ∗ such that f ∗ ∈ arg min Rϕ (f ) := arg min f ∈FQ

E

f ∈FQ (P,l)∼M

h   i ϕ −f µk (P ) l .

(28)

As was pointed out in Section B.4 if the kernel k is bounded supz∈Z k(z, z) ≤ 1 then Hk ⊆ L2 (Q). In particular, for any P ∈ P it holds that µk (P ) ∈ L2 (Q) and thus (28) is well defined. Instead of solving (28) directly, we will again use the version of empirical risk minimization (ERM). However, this time we won’t use empirical mean embeddings {µk (PSi )}ni=1 since, as was already discussed, those lead to the expensive computations involving the kernel matrix. Instead, we will pose the ERM problem in terms of the low-dimensional approximations based on cosines. Namely, we propose to use the following estimator f˜nm : ! ! n X X 1 1 m z ˜ϕ f˜nm ∈ arg min R (f ) := arg min ϕ −f gˆm (·) li . f ∈FQ f ∈FQ n n i i=1 z∈Si

The following result puts together Theorem 3 and Lemma 1 to provide an excess risk bound for f˜nm which accounts for all sources of the errors introduced in the learning pipeline: Theorem 6. Let Z = Rd and Q be any probability distribution on Z. Consider the RKHS Hk associated with some bounded, continuous, shift-invariant kernel function k, such that supz∈Z k(z, z) ≤ 1. Consider a class FQ of functionals

Towards a Learning Theory of Cause-Effect Inference + mapping L2 (Q) to R with Lipschitz  constants uniformly bounded by LQ . Let ϕ : R → R be a Lϕ -Lipschitz function such that φ(z) ≥ 1z>0 . Let ϕ −f (h)l ≤ B for every f ∈ FQ , h ∈ L2 (Q), and l ∈ L. Then for any δ > 0 the following holds:

r log(3/δ) m ∗ ˜ Rϕ (fn ) − Rϕ (f ) ≤ 4Lϕ Rn (FQ ) + 2B 2n s  s n log 3n 4Lϕ LQ X  E z∼Pi [k(z, z)] δ  + + n ni 2ni i=1 n  p Lϕ LQ X 2Ck  √ +2 1 + 2 log(3n · ni /δ) n i=1 m

with probability not less than 1 − δ over all sources of randomness, which are {(Pi , li )}ni=1 , {Si }ni=1 , {(wi , bi )}m i=1 . Proof. We will proceed similarly to (19): m ˜m ˜ϕ Rϕ (f˜nm ) − Rϕ (f ∗ ) = Rϕ (f˜nm ) − R (fn ) m m m ˜ (f ∗ ) ˜ (f˜ ) − R +R ϕ

n

ϕ

˜ m (f ∗ ) − Rϕ (f ∗ ) +R ϕ ˜ m (f )| ≤ 2 sup |Rϕ (f ) − R ϕ

f ∈FQ

ˆ ϕ (f ) + R ˆ ϕ (f ) − R ˜ ϕ (f ) + R ˜ ϕ (f ) − R ˜ m (f )| = 2 sup |Rϕ (f ) − R ϕ f ∈FQ

ˆ ϕ (f )| + 2 sup |R ˆ ϕ (f ) − R ˜ ϕ (f )| + 2 sup |R ˜ ϕ (f ) − R ˜ m (f )|. ≤ 2 sup |Rϕ (f ) − R ϕ f ∈FQ

f ∈FQ

(29)

f ∈FQ

First two terms of (29) were upper bounded in Section B.2. Note that the upper bound of the second term (proved in Theorem 3) was based on the assumption that functionals in FQ are Lipschitz on Hk w.r.t. the Hk metric. But as we already noted, for bounded kernels we have Hk ⊆ L2 (Q) which implies khkL2 (Q) ≤ khkHk for any h ∈ Hk (see (26)). Thus |f (h) − f (h0 )| ≤ Lf kh − h0 kL2 (Q) ≤ Lf kh − h0 kHk for any h, h0 ∈ Hk . It means that the assumptions of Theorem 3 hold true and we can safely apply it to upper bound the first two terms of (29). We are now going to upper bound the third one using Lemma 1: ! ! n n 1 X X X   1 1 m z ˜ ϕ (f ) − R ˜ϕ sup |R (f )| = sup ϕ −f µk (PSi ) li − ϕ −f gˆm (·) li n n n i f ∈FQ f ∈FQ i=1 i=1 z∈Si ! ! n   1 X z 1X sup ϕ −f µk (PSi ) li − ϕ −f ≤ gˆm (·) li n i=1 f ∈FQ ni z∈Si ! n  Lϕ X 1 X z ≤ sup f µk (PSi ) − f gˆm (·) n i=1 f ∈FQ ni z∈Si

n

Lϕ X 1 X z

≤ sup Lf µk (PSi ) − gˆm (·) .

n ni f ∈FQ i=1

z∈Si

L2 (Q)

We can now use Lemma 1 combined in union bound over i = 1, . . . , n with δ 0 = δ/n. This will give us that m ˜ ϕ (f ) − R ˜ϕ sup |R (f )| ≤

f ∈FQ

n  p Lϕ LQ X 2Ck  √ 1 + 2 log(n · ni /δ) . n i=1 m

with probability not less than 1 − δ over {(wi , bi )}m i=1 .

Towards a Learning Theory of Cause-Effect Inference

C. Training and Test Protocols for Section 5.4 The synthesis of the training data for the experiments described in Section 5.4 follows a very similar procedure to the one from Section 5.1. The main difference here is that, when trying to infer the cause-effect relationship between two variables Xi and Xj belonging to a larger set of variables X = (X1 , . . . , Xd ), we will have to account for the effects of possible confounders Xk ⊆ X \ {Xi , Xj }. For the sake of simplicity, we will only consider one-dimensional confounding effects, that is, scalar Xk . C.1. Training Phase To generate cause-effect pairs exhibiting every possible scalar confounding effect, we will generate data from the eight possible directed acyclic graphs depicted in Figure 4.

Figure 4. The eight possible directed acyclic graphs on three variables.

In particular, we will sample N different causal DAGs G1 , . . . , GN , where the Gi describes the causal structure underlying (Xi , Yi , Zi ). Given Gi , we generate the sample set Si = {(xij , yij , zij )}nj=1 according to the generative process described in Section 5.1. Together with Si , we annotate the triplet of labels (li1 , li2 , li3 ), where according to Gi , • li1 = +1 if “Xi → Yi ”, li1 = −1 if “Xi ← Yi ”, and li1 = 0 else. • li2 = +1 if “Yi → Zi ”, li2 = −1 if “Yi ← Zi ”, and li2 = 0 else. • li3 = +1 if “Xi → Zi ”, li1 = −1 if “Xi ← Zi ”, and li1 = 0 else. Then, we add the following six elements to our training set: ({(xij , yij , zij )}nj=1 , +l1 ), ({(yij , zij , xij )}nj=1 , +l2 ), ({(xij , zij , yij )}nj=1 , +l3 ), ({(yij , xij , zij )}nj=1 , −l1 ), ({(zij , yij , xij )}nj=1 , −l2 ), ({(zij , xij , yij )}nj=1 , −l3 ), for all 1 ≤ i ≤ N . Therefore, our training set will consist on 6N sample sets and their paired labels. At this point, and given any sample {(uij , vij , wij )}nj=1 from the training set, we propose to use as feature vectors the concatenation of the m−dimensional empirical kernel mean embeddings (14) of {uij }nj=1 , {vij }nj=1 , and {(uij , vij , wij )}nj=1 , respectively. C.2. Test Phase te To start, given nte test d−dimensional samples S = {(x1i , . . . , xdi )}ni=1 , the hyper-parameters of the kernel and training data synthesis process are transductively chosen, as described in Section 5.1.

In order to estimate the causal graph underlying the test sample set S, we compute three d × d matrices M→ , M⊥⊥ , and M← . Each of these matrices will contain, at their coordinates i, j, the probabilities of the labels “Xi → Xj ”, “Xi ⊥ ⊥ Xj ”, and “Xi ← Xj ”, respectively, when averaged over all possible scalar confounders Xk . Using these matrices, we estimate the underlying causal graph by selecting the type of each edge (forward, backward, or no edge) to be the one with maximal probability according. As a post-processing step, we prune the least-confident edges until the derived graph is acyclic.