Conditional Risk Minimization for Stochastic Processes

Report 5 Downloads 52 Views
arXiv:1510.02706v1 [stat.ML] 9 Oct 2015

Conditional Risk Minimization for Stochastic Processes Christoph H. Lampert IST Austria 3400 Klosterneuburg, Austria [email protected]

Alexander Zimin IST Austria 3400 Klosterneuburg, Austria [email protected]

Abstract We study the task of learning from non-i.i.d. data. In particular, we aim at learning predictors that minimize the conditional risk for a stochastic process, i.e. the expected loss taking into account the set of training samples observed so far. For non-i.i.d. data, the training set contains information about the upcoming samples, so learning with respect to the conditional distribution can be expected to yield better predictors than one obtains from the classical setting of minimizing the marginal risk. Our main contribution is a practical estimator for the conditional risk based on the theory of non-parametric time-series prediction, and a finite sample concentration bound that establishes exponential convergence of the estimator to the true conditional risk under certain regularity assumptions on the process.

1

Introduction

One of the cornerstone assumptions in the analysis of machine learning algorithms is that the training examples are independently and identically distributed (i.i.d.). Interestingly, this assumption is hardly ever fulfilled in practice: dependencies between examples exist even in the famous textbook example of classifying e-mails into ham or spam. For example, when multiple emails are exchanged with the same writer, the contents of later emails depends on the contents of earlier ones. Therefore, there is growing interest in the development of algorithms that learn from dependent data and still offer generalization guarantees similar to the i.i.d. situation. In this work, we are interested in learning algorithms for stationary stochastic processes, i.e. data sources with samples arriving in a sequential manner. Typically, the generalization performance of learning algorithms for stochastic processes is phrased in terms of the marginal risk: the expected loss for a new data point that is sampled with respect to the underlying marginal distribution, regardless of which samples had been observed before. In this work, we instead are interested in the conditional risk, i.e. the expectation over the loss taken with respect to the conditional distribution of the next sample given the samples observed so far. For i.i.d. data both notions of risk coincide. For dependent data, however, they can differ drastically and the conditional risk is the more promising quantity for sequential prediction tasks. Imagine, for example, a self-driving car. At any point of time, it should try to make its next decision optimally for the current (conditional) situation, not for the average (marginal) conditions. There are two main challenges when trying to learn predictors of low conditional risk. First, the conditional distribution typically changes in each step, so we are trying to learn a moving target. Second, we cannot make use of out-of-the-box empirical risk minimization, since that would just lead to predictors of low marginal risk. Our main contributions in this work are the following: • a non-parametric empirical estimator of the conditional risk with finite history, 1

• a finite sample concentration bound that, under certain technical assumptions, guarantees and quantifies the convergence of the above estimator to the true conditional risk. • a finite sample concentration bound of the new type for the marginal risk. Our results provide the necessary tools to theoretically justify and practically perform empirical risk minimization with respect to the conditional distribution of a stochastic process. To our knowledge, our work is the first one achieving this goal.

2

Related work

While statistical learning theory was first formulated for the i.i.d. setting [1], it was soon recognized that extension to non-i.i.d. situations, in particular many classes of stochastic processes, were possible and useful. As in the i.i.d. case, the core of such results is typically formed by a combination of a capacity bound on the class of considered predictors and a law of large numbers argument, that ensures that the empirical average of function values converges to a desired expected value. Combining both, one obtains, for example, that empirical risk minimization (ERM) is a successful learning strategy. Most existing results study the situation of stationary stochastic processes, for which the definition of a marginal risk makes sense. The consistency of ERM or similar principles can then be established under certain conditions on the dependence structure, for example for processes that are α-, β- or φ-mixing [2, 3, 4, 5], exchangeable, or conditionally i.i.d. [6, 7]. Asymptotic or distribution dependent results were furthermore obtained even for processes that are just ergodic [8], and Markov chains with countably infinite state-space [9]. All of the above works aim to study the minimizer of a long-term or marginal risk. Actually minimizing the conditional risk has not received much attention in the literature, even though conditional notions of risk were noticed and discussed. For example, [7] and [10] argue in favor of minimizing the conditional risk, but focus on exchangeable processes, for which the unweighted average over the training samples can be used for this purpose. In [11] a related result is proved even for nonstationary processes, but the convergence of their bound requires to look not at the next step, but at s steps into the future, with s growing as the amount of data grows. In [12], the authors discuss the conditional risk in the context of generalization guarantees for stable algorithms. However, the proofs require a technical assumption1 that renders the guarantees unsuitable for the form of conditional risk we are interested in. [13] extends online-to-batch conversion to mixing processes. The authors construct an estimator for the marginal risk, and then show that it can also be used for an average over m future conditional risks, but convergence only holds for m → ∞, where the average conditional risk converges to the marginal one. In contrast, we study the situation m = 1, where both risks are very different. On a technical level, our work is related to the task of one-step ahead prediction of time series [14, 15, 16, 17]. The goal of these methods is to reason about the next step of a process, though not in order to choose a hypothesis of minimal risk, but to predict the value for the next observation itself. Our work on empirical conditional risk is inspired by this school of thought, in particular on kernel-based nonparametric sequential prediction [18].

3

Setting and Overview of Results

In this section we introduce the learning problem of interest and give a general overview of our results. The exact technical conditions for the results and their proofs will be provided and discussed in Section 5. ∞

We assume that we observe data from a stationary stochastic process {zi }i=1 taking values in a probability space Z. Stationarity means that for all m ≥ 1 the vector zm 1 has the same distribution j as zs+m for all s ≥ 0, where z is a shorthand notation for (z , . . . , z ). In order to quantify the i j s+1 i dependence between the past and the future of the process, we consider mixing coefficients. 1 Apparently, the need for this assumption was realized only after publication of the JMLR paper of the same title. Our discussion is based on the PDF version of the manuscript from the author homepage, dated 10/10/13.

2

Definition 1. Let σ(zji ) be a sigma algebra generated by zji . Then the k-th α-mixing coefficient is αk = sup sup |P [A ∩ B] − P [A] P [B]| . (1) t∈N A∈σ(zt1 ) B∈σ(z∞ t+k )

Definition 2. Then the k-th β-mixing coefficient is   βk = sup E sup P A| zt1 − P [A] . t∈N

(2)

A∈σ(z∞ t+k )

The process is called α- or β-mixing if αk → 0 or βk → 0 as k → ∞. On a high level, a process is mixing if the head of the process, zt1 and the tail of the process, z∞ t+k , become as close to independent from each other as wanted when they are separated by a large enough gap. It is known that αk ≤ βk , meaning that α-mixing is a more general assumption. For this reason, we will be mainly interested in α-mixing processes. Many classical stochastic processes are α-mixing, see [19] for a detailed survey. For example, many finite-state Markov and hidden Markov models as well as autoregressive moving average (ARMA) processes fulfill αk → 0 at an exponential rate [20], while certain diffusion processes are α-mixing at least with polynomial rates [21]. Clearly, i.i.d. processes are α-mixing with αk = 0 for all k. 3.1

Risk Minimization N

We study the problem of risk minimization: having observed a sequence of examples, S = (zi )i=1 from a stationary α-mixing process, our goal is to select a predictor, h, of minimal risk from a fixed hypothesis set H. The risk is defined as the expected loss for the next observation with respect to a given loss function ` : H × Z → [0, 1]. For example, in a classification setting, one would have Z = X × Y, where X are inputs and Y are class labels, and solve the task of identifying a predictor, h : X → Y, that minimizes the expected 0/1-loss, `(h, z) = I [h(x) 6= y], for z = (x, y). Different choices of distribution lead to different definitions of risk. The simplest possibility is the marginal risk, Rmar (h) = E [`(h, zN +1 )] , (3) which has two desirable properties: it is in fact independent of the actual value of N (because of the stationarity of the process), and under weak conditions on the process it can be estimated by a simple average of the losses over the training set, i.e. the empirical risk, N X ˆ mar (h) = 1 R `(h, zi ). N i=1

(4)

On the downside, minimizing the marginal risk might pose a more difficult than necessary learning problem, because the information from the actually observed training set is not taken into account. For an i.i.d. process this would not matter, since any future sample would be independent of the past observations anyway. For a dependent process, however, the sequence zN 1 might carry valuable information about the distribution of zN +1 , see Section 4 for an example and numerical simulations. In this work we study the conditional risk based on a finite history of length d, R(h, z¯) = E `(h, zN +1 )| zN ¯ , (5) N −d+1 = z d for any z¯ ∈ Z . Our goal is to identify a predictor of minimal conditional risk in the hypothesis set. Note that on a technical level this is a more challenging problem than marginal risk minimization: the conditional risk depends on the history, z¯, so different predictors will be optimal for different histories. However, the training sample might provide useful information about the process, such that one can hope for an overall better predictor. In practice, the conditional distribution of zN +1 is inaccessible. Thus, we take the standard route ˆ of the conditional risk. and aim at minimizing an empirical estimator, R, ˆ zN hS = argmin R(h, (6) N −d+1 ). h∈H

Our first contribution is the definition of a suitable conditional risk estimator, which for the sake of simplicity we state for vector-valued processes, Z = Rk , for k > 0. For more general processes, an analogous construction can be used after applying a feature function or relying on kernel methods [22]. 3

Definition 3. For a kernel function2 , K : (Rk )d → R+ and a bandwidth, b > 0, we define ˆ z¯) = qˆ(h, z¯) , R(h, pˆ(¯ z)

(7)

for qˆ(h, z¯) =

1 X 1 X i `(h, z )K( (¯ z − z )/b ) and p ˆ (¯ z ) = K( (¯ z − zii−d+1 )/b ), (8) i+1 i−d+1 nbd nbd i∈I

i∈I

where I = {d, d + 1, . . . , N − 1} is the index set of samples used and n = |I|. In words, the estimator is a weighted average loss over the training set, where the weight of each sample, zi+1 , is proportional to how similar its history, zii−d+1 is to the target history, z¯. Similar kernel-based non-parametric estimators have been used successfully in time series prediction [23]. Note, however, that risk estimation might be an easier task than that, especially for processes of complex objects, since we do not have to predict the actual values of zN +1 , but only the loss it causes for a hypothesis h. Our main result in this work is the proof that minimizing the above empirical conditional risk is a successful learning strategy for finding a minimizer of the conditional risk. The following wellknown proposition [24] shows that it suffices to focus on uniform deviations of the estimator from the actual risk. Lemma 1. N N N ˆ R(hS , zN (9) N −d+1 ) − inf R(h, zN −d+1 ) ≤ 2 sup R(h, zN −d+1 ) − R(h, zN −d+1 ) . h∈H

h∈H

Our first result shows the almost sure convergence of such uniform deviations to zero. The exact formulations and proofs are provided in Section 5. Theorem 1. Under assumptions that will be specified later, if b → 0 with an appropriate rate as n → ∞, ˆ zN sup R(h, zN ) − R(h, ) (10) N −d+1 N −d+1 → 0 a.s. h∈H

In addition, we study the convergence rate of the estimator. Theorem 2. Under assumptions that will be specified later, for n ≥ n0 " # 2 ˆ P sup R(h, z¯) − R(h, z¯) > t ≤ C1 e−C2 n t ,

(11)

h∈H,¯ z ∈Z d

for constants n0 , C1 and C2 , that depend in particular on the kernel function, its bandwidth and the fat-shattering dimension of the hypothesis class. Only n0 depends on the mixing properties of the process, while C1 and C2 do not. As it turns out, the techniques we use to prove Theorem 2 can also be applied to the unweighted empirical average over the training data. Doing so yields the following concentration bound for its convergence to the marginal risk, which we believe is of interest independently from our study of the conditional risk. Theorem 3. Under assumptions that will be specified later, for N ≥ N0 ,   ˆ mar (h) > t ≤ C3 e−C4 N t2 , P sup Rmar (h) − R (12) h∈H

for constants N0 , C3 and C4 with only N0 depending on the mixing properties of the process. Theorems 2 and 3 tell us that, after an initial amount of data, the convergence of the empirical estimators is exponentially fast. This is noteworthy, since bounds for the marginal risk typically show convergence behaviour that becomes slower the stronger the dependence between the samples in the process are [25, 26]. Our bound has a different characteristic: it guarantees convergence at the optimal exponential rate, like in the i.i.d. case, however, the convergence starts only for large samples sizes. 2 Here and later in this section, as well as in Section 5, we use kernel only in the sense of kernel-based non-parametric density estimation, not in the sense of positive definite kernel functions from kernel methods.

4

0.95

0.95

0.025 0.025

0.025

0.025

0.025

0.025

0.025 0.025

0.95

0.95

(a) Generating Markov Process

(d) p(zN +1 )

n (e) p(zn+1 |zn−1 )

(b) Sample sequence, z11000

n (f) p(zn+1 |zn−2 )

(c) Contribution of each point to the conditional risk estimate

n (g) p(zn+1 |zn−4 )

(h) p(zn+1 |z1n )

Figure 1: Illustration of the data distributions in Section 4 by their y-expectations. (a): limit distribution of the hidden Markov chain. (b)–(d): conditional distributions of the (n+1)-th sample with history lengths 1, 2 and 4. (e): conditional distribution of the (n+1)-th sample with full history.

4

Experiments

Before formally stating and proving our main results we illustrate the problem setting and the proposed estimator of the conditional risk in a synthetic setting that is easy to analyze but difficult to learn for other techniques. Let Z = X × Y with X = [0, 10] × [0, 10] ∈ R2 and Y = {±1}. We generate data using a time-homogeneous hidden Markov process with four latent states (Figure 1a). Each state, i, is associated with an emission probability distribution, µi (x, y), that is uniform in x and deterministic in y, with y = sign fi (x) for an affine function f . At any step, i, we observe a sample zi = (xi , yi ) from the distribution associated with the current latent state si . Figure 1b depicts the situation for N = 1000. Drawn without order, the empirical distribution of the samples z1N resembles the limit distribution of the hidden Markov chain (Figure 1d), which is also the marginal distribution of the next sample, p(zN +1 ). Using the dependencies in the sequence, however, we can obtain a more informed estiN mate, p(zN +1 |z1N ) or its windowed counterparts, p(zN +1 |zN −d ) for any history length d. Figures 1e to 1h visualize these distribution: already with a short history length, the conditional distribution is easier learnable (has a lower Bayes risk) than the marginal one. Thus, identifying a good classifier for the conditional distribution at each step, i.e. minimizing the conditional risk, can lead to an overall lower error rate than finding a single predictor that is optimal for the limit distribution, i.e. minimizing the marginal risk. We illustrate the proposed kernel-based estimator of the conditional risk, P P for d = 4 and a stratified 1 1 ¯ = k(x , x ¯ ) + ¯j ), for set kernel, K(S, S) ¯ ¯− k(xi , x i j ¯ ¯ (i,j)∈S+ ×S+ (i,j)∈S− ×S 2|S+ ||S+ | 2|S− ||S− | d d ¯ S = {(xl , yl )}l=1 and S = {(¯ xl , y¯l }l=1 , where S+ /S− are the sets of positive/negative examples in ¯ As base kernel, k(x, x S (and analogously for S). ¯) we use the squared exponential kernel. Figure 1c depicts the same samples, z1N , as Figure 1a, but each point zj is drawn at a size proportionally to its contribution to the conditional risk estimate at time N = 1001 for d = 4, i.e. its kernel weight j−1 N N K(zj−d , zN +1−d ), The points of the history zN −3 are marked with crosses. One can see that the resulting distribution indeed resembles the conditional distribution with history length 4 (Figure 1g), which is also close to the full conditional distribution (Figure 1h). 5

error

0.5 0.4 0.3 0.2 0.1 0.0

d=1

0.1 0.5 1 5 10

kernel parameter

0.4 0.3 0.2 0.1 0.0

d=2

0.1 0.5 1 5 10

kernel parameter

0.4 0.3 0.2 0.1 0.0

d=3

0.1 0.5 1 5 10

kernel parameter

0.3 0.2 0.1 0.0

d=4

0.1 0.5 1 5 10

0.3 0.2 0.1 0.0

kernel parameter

d=8

ERM S-W ECRM

0.050.1 0.5 1 5

kernel parameter

Figure 2: Experimental results of empirical conditional risk minimization (ECRM) versus ordinary empirical risk minimization (ERM) and the sliding-window (S-W) heuristic. Figure 2 shows a numeric evalution over 100 randomly generated Markov chains. In each case, we learn a linear predictor by approximately minimizing the conditional risk on from 5000 samples of the process. Then, we measure its quality using 5000 samples from the conditional distribution, p(z5001 |z15000 ). To learn the predictor, we use squared-loss as a convex surrogate for 0/1-loss in the conditional risk estimator. The resulting expression is convex in the weight vector of the predictor and can be minimized by solving a weighted least squares problem, where the weight of each sample is proportional to the kernel similarity between its history and the d last observed elements of the sequence. The results show that learning by conditional risk minimization (solid line) consistently outperforms empirical risk minimization (dashed line) for a wide range of kernel parameters (bandwidth of the Gaussian base kernel). Interestingly, even d = 1 is often enough to achieve a noticeable improvement. As an additional baseline, we add a sliding-window method that performs empirical risk minimization, but only on the samples in the history (dotted line). As expected, this heuristic works better than the plain ERM for long enough histories, but it does not achieve the same prediction quality as the conditional risk minimizer.

5

Proofs

In this section we derive the conditions on the process distribution that are required for the theorems to hold, and we sketch their proofs. Full proofs can be found in the supplemental material. As it is shown in [27] the ergodicity itself is not enough to show even L1 -consistency of the kernel density estimator. Because of this, some additional assumption has to be introduced. The convergence of the uniform deviations relies on the following assumptions.   A1 ∃D0 , D1 > 0 such that D0 µ(B) ≤ P zii−d+1 ∈ B ≤ D1 µ(B) for B ∈ B(Z d ), where µ is a Lebesgue measure on Z d = Rkd .   A2 The distribution of zii−d+1 is tight: ∀ε > 0, ∃ compact set Kε such that P zii−d+1 ∈ Kεc < ε. In addition, we require that the covering number of Kε grows logarithmically in 1/ε. A3 For every hypothesis h ∈ H, the conditional risk R(h, z¯) is LR -Lipschitz continuous in z¯. Assumption A1 means that the marginal distribution of the process and the Lebesgue measure are mutually absolute continuous. This assumption, for example, satisfied for the processes with density, which is bounded below from 0. [28] argues that it implies some kind of recurrence of the process, that is that for almost every point in the support, the process visits its neighborhood infinitely often. A2 is necessary to prove uniform convergence in the history. It is satisfied for the processes with compact support and also for many other distributions, e.g. sub-gaussian ones. The usage of local averaging estimation implicitly assumes the continuity of the underlying function, that is the assumption A3. The proof would work with weaker, but more technical assumption, however, we stick to this, more natural one. In addition, we make the following assumptions on the kernel function: Z K(¯ z )d¯ z = 1, K is bounded by K1 .

(13)

Zd

Z ∀i, j = 1, . . . , d :

Z z¯i K(¯ z )d¯ z = 0,

z¯∈Z d

z¯i z¯j K(¯ z )d¯ z ≤ K2 ,

and z¯∈Z d

6

(14)

K is Lipschitz continuous of order γ with Lipschitz constant L.

(15)

Using these assumption we can prove the statement of the Theorem 1 under β-mixing assumption. If, in addition, we assume h i A4 ∀i, j ≥ d − 1 : P (zjj−d+1 , zii−d+1 ) ∈ B ≤ D1 µ(B) for B ∈ B(Z d × Z d ), the same result holds for α-mixing processes. To prove the rate of convergence, we will have to make a few additional technical assumption on the distribution of the process. A5 The random vector zii−d+1 has a density p. Also, q(h, z¯) = R(h, z¯)p(¯ z ) and p(¯ z ) are both twice continuously differentiable in z¯ with second derivatives bounded by D2 . A6 p is supported on a compact set G. A7 Loss function ` is LH Lipschitz in the first argument Before we can present the proofs, we introduce some notation. Let an X -valued tree ξ of depth n be a rooted complete binary tree with nodes labeled by elements of X . Every path of length i starting at the root can be identifies with a set of i − 1 binary decision. Thus, we can identify the tree ξ i−1 with a sequence ξ1 , . . . , ξn of labeling functions, where each ξi : {±1} → X , assigns labels to all nodes at depth i. A path of length n is given by the sequence ε = (ε1 , . . . , εn ). For notational convenience, we write ξi (ε) for ξi (ε1 , . . . , εi−1 ). An crucial property of these trees is that for any function f : X → R, the sequence of εi f (ξi (ε)) is a martingale difference sequence if the ε are i.i.d. Rademacher variables. The next notions we introduce are the cover of a tree and the sequential fat-shattering dimension of a function class. Definition 4. A set, V , of R-valued trees of depth n is a (sequential) θ-cover (with respect to the `p -norm) of F ⊂ {f : X → R} on a tree ξ of depth n if n 1 X 1/p n p ∀f ∈ F ∀ε ∈ {±1} ∃v ∈ V |f (ξi (ε)) − vi (ε)| ≤ θ. (16) n i=1 The (sequential) covering number of a function class F on a given tree ξ is Np (θ, F, ξ) = min {|V | : V is an θ-cover w.r.t. `p -norm of F on ξ} . The maximal `p covering number of F over depth-n trees is Np (θ, F, n) = sup Np (θ, F, ξ).

(17) (18)

ξ

Definition 5. An X -valued tree of depth n is θ-shattered by a function class F ⊆ {f : X → R} if there exists an R-valued tree, v, of depth n such that n ∀ε ∈ {±1} ∃f ∈ F ∀i ∈ {1, . . . , n} εi (f (ξi (ε)) − vi (ε)) ≥ θ/2. (19) The (sequential) fat-shattering dimension fatθ (F) at scale θ is the largest n such that F shatters an X -valued tree of depth n. The following lemma (Corollary 1 in [29]) relates fat-shattering dimensions and covering numbers. Lemma 2. Let F ⊆ {f : X → [−1, 1]}. For any θ > 0 and any n ≥ 1, we have that  2en fatθ (F ) N∞ (θ, F, n) ≤ . (20) θ The proof of Theorems 1 and 2 relies on Lemma 3, a new concentration inequality that bounds the uniform deviations of functions of blocks of values of an α-mixing stochastic process.  Lemma 3. Let F ⊂ f : Z d+1 → [0, 1] be a class of functions defined on blocks of d+1 variables. Assume there exists a strictly decreasing function, ϕ, such that for every f ∈ F 1 X  Var f (zi+1 ) ≤ ϕ(n). (21) i−d+1 n i∈I

Then, for all n ≥ ϕ−1 (t2 /8) 1 X h i  32en fatt/16 (F )  d+1  2 P sup f (zi+1 ) > t ≤ 16 e−nt /512 . i−d+1 ) − E f (z1 n t f ∈F i∈I

7

(22)

Proof. The first step of the proof follows the i.i.d. situation with the introduction of the ghost sample N S 0 = {z0i }i=1 , which is drawn independently from same distribution as S. Then we have an analogue of the result for i.i.d. data, Lemma 8 (in the appendix), saying that for n ≥ ϕ−1 (t2 /8) " # 1 X  d+1  i+1 P sup f (zi−d+1 ) − E f (z1 ) > t (23) f ∈F n i∈I # " 1 X X t 1 i+1 f (zi+1 f (z0 i−d+1 ) > . ≤ 2P sup i−d+1 ) − 2 n f ∈F n i∈I

i∈I

To bound the term on the right, we are going to take the sequential Rademacher complexity approach, used in [29] for martingale uniform laws of large numbers. For this we introduce a de∞ ˜i coupled tangent sequence {˜ zi }i=1 , which means that for each i, conditioned on zi−1 1 , zi and z are independent and has the same distribution. Then we construct a new sequence of blocks with yi = (zi−d+1 , . . . , zi , z0i+1 ) for each i ∈ I, which are just zi+1 i−d+1 with the last variable changed to the corresponding one from the tangent sequence. Using the Lemma 9 (in the appendix) with   ψ(u) = I nu > 4ε , we get (recall that I = {d, d + 1, . . . , N − 1}) " # 1 X 1X t i+1 0 i+1 P sup (24) f (zi−d+1 ) − f (z i−d+1 ) > 2 n f ∈F n i∈I i∈I " # 1 X 1X t i+1 ≤ 2P sup f (zi−d+1 ) − f (yi ) > (25) 4 n f ∈F n i∈I i∈I " " ## 1 X t = 2E I sup (f (zi+1 (26) i−d+1 ) − f (yi )) > 4 f ∈F n i∈I ## " " t 1 X εi (f (¯ zi ) − f (¯ yi )) > (27) ≤ 2 sup Eεd . . . sup EεN −1 I sup 4 z¯d ,¯ yd z¯N −1 ,¯ yN −1 f ∈F n i∈I ## " " t 1 X = 4 sup Eεd . . . sup EεN −1 I sup εi f (¯ zi ) > . (28) 8 z¯d z¯N −1 f ∈F n i∈I

The last expression can be rewritten using Lemma 7 (from the appendix) and we have so far that n " # " # 1 X  d+1  1 X t i+1 P sup f (zi−d+1 ) − E f (z1 ) > t ≤ 8 sup P sup εi f (ξi (ε)) > . (29) 8 f ∈F n ξ f ∈F n i∈I

i∈I

We finish the proof using the Lemma 5 from [29], which bounds the right hand side by " # 2 1 X t P sup εi f (ξi (ε)) > ≤ 2N1 (t/16, F, n)e−nt /512 n 8 f ∈F i∈I  fatt/16 (F ) 2 32en ≤2 e−nt /512 . t

(30)

(31)

 Corollary 1. Let F = f (¯ z ) = `(h, z¯d+1 )K((¯ y − z¯1d )/b) : h ∈ H, y¯ ∈ Z d ⊂ {f : Z d+1 → R+ }. Assume that A7 holds. Then under the conditions of Lemma 3, the following holds for t1 = t K 1 LH , 1 X h i  32en fatt /16 (H)  d+1  2 1 P sup f (zi+1 ) > t ≤ 16 e−nt1 /512 . (32) i−d+1 ) − E f (z1 n t 1 f ∈F i∈I

Proof. The corollary follows from the proof of Lemma 3 with two modifications. First, we can use the non-negativity of the loss and an upper bound on the kernel at (28) to reduce the supremum to the set L = {f : Z → R s.t. f (z) = `(h, z), ∀h ∈ H}. Second, at (30) we can use the Lipschitz property of the loss function to upper bound the covering number: N1 (θ, F, n) ≤ N1 (θ/LH , H, n).

8

We will use this result to show convergence rates for qˆ(h, z¯) and pˆ(¯ z ). For this, we need to establish condition (21), i.e. derive a bound on their variance. The following lemma is based on Theorem 1 from [30], coupled with covariance inequalities from [31]. Lemma 4. Assume that A1 and A4 holds. Then Var (ˆ q (h, z¯)) ≤ 3D12 K12

λb λb and Var (ˆ p(¯ z )) ≤ 3D12 K12 d , nbd nb

for λb = 1 + b−d

∞ X

αk . (33)

k=bb−d c+1

Proof. The proof is a modification of the Theorem 1 from [30]. We will prove the statement only for qˆ, while the proof for pˆ follows the same lines. Let yi = `(h, zi+1 )K((¯ z − zii−d+1 )/b), so that P 1 qˆ(h, z¯) = nbd i∈I yi . Our goal is to bound the covariances between yi . Observe that h h ii E [yi yj ] = E E yi yj | zii−d+1 , zjj−d+1 (34) h h ii = E K((¯ z − zii−d+1 )/b)K((¯ z − zii−d+1 )/b)E `(h, zi+1 )`(h, zj+1 )| zii−d+1 , zjj−d+1 ZZ ≤ D1

= D1 b2d

h

K((¯ z−u ¯)/b)K((¯ z − v¯)/b)E `(h, zi+1 )`(h, zj+1 )| zii−d+1 = u ¯, zjj−d+1 ZZ

h

K(¯ u)K(¯ v )E `(h, zi+1 )`(h, zj+1 )| zii−d+1 = z¯ − b¯ u, zjj−d+1

(35) i = v¯ d¯ ud¯ v

(36) i = z¯ − b¯ v d¯ ud¯ v (37)

≤ D1 b2d

ZZ

K(¯ u)K(¯ v )d¯ ud¯ v = D1 b2d .

(38)

By the similar argument, for k ≥ 1, Z Z  k   k i  k−1 d d k E yi = E E yi zi−d+1 ≤ D1 b K(¯ u) d¯ u ≤ D1 K1 b K(¯ u)d¯ u = D1 K1k−1 bd . (39) Now let ck = Cov(yi , yi+k ). We can upper bound these covariances in two ways. First, |ck | ≤ E [yi yi+k ] + (E [yi ])2 ≤ 2D12 b2d .

(40)

Second, by the covariance inequality for α-mixing processes (we use the tightest version from [31]), 2 ck ≤ 2αk kyi k∞ ≤ 2K12 αk . Using these bounds, we can calculate the variance. " #2 ∞ X X 1 d (yi − E [yi ]) ≤ b−d (c0 + |ck |) (41) nb Var (ˆ q (h, z¯)) = d E nb i∈I

k=1

bb−d c

= b−d (c0 +

X

∞ X

|ck | +

|ck |)

(42)

k=bb−d c+1

k=1

∞ X

≤ b−d (D1 K1 bd + 2D12 b2d bb−d c + 2K12

αk )

(43)

k=bb−d c+1

≤ 3D12 K12 (1 + b−d

∞ X

αk ),

(44)

k=bb−d c+1

  where we used the previous bounds for |ck | and c0 ≤ E y02 ≤ D1 K1 bd . The above lemmas tell us the required conditions on the α-mixing coefficients of the process. For P∞ example, a sufficient condition is a sublinear tail convergence, i.e. n k=n+1 αk → 0. This holds, e.g., for αk = O(k −β ) with β > 2. Now we are ready to state and prove Theorem 1. 9

Theorem 1. Under the assumptions A1, A2, A3, if the bandwidth b satisfies nbd → ∞, bd → 0, we have for an exponentially β-mixing process N ˆ (45) sup R(h, zN ) − R(h, z ) N −d+1 N −d+1 → 0 a.s. h∈H

and the same holds for an α-mixing process if we additionally assume A4. The proof is based on the argument of [32] with appropriate modifications to achieve the uniform convergence over hypotheses. Proof. We the problem to the supremum over the histories. Denote the start by reducing ˆ z¯) by Ψ(¯ z ). Then for every ε > 0: suph∈H R(h, z¯) − R(h, N      N  N P Ψ(zN (46) N −d+1 ) > t ≤ P Ψ(zN −d+1 ) > t zN −d+1 ∈ Kε P zN −d+1 ∈ Kε + ε   N  N  (47) ≤ P sup Ψ(¯ z ) > t zN −d+1 ∈ Kε P zN −d+1 ∈ Kε + ε z¯∈Kε   = P sup Ψ(¯ z ) > t and zN ∈ K (48) ε +ε N −d+1 z¯∈Kε   ≤ P sup Ψ(¯ z) > t + ε (49) z¯∈Kε

Then we make the following decomposition ˆ z¯) ≤ sup R(h, z¯) − R(h, h,¯ z

1 (T1 + T2 + T3 ), inf z¯ pˆ(¯ z)

(50)

where T1 = sup |ˆ q (h, z¯) − E [ˆ q (h, z¯)]| ,

(51)

h,¯ z

T2 = sup |R(h, z¯)(E [ˆ p(¯ z )] − pˆ(¯ z ))| ,

(52)

h,¯ z

T3 = sup |E [ˆ q (h, z¯)] − R(h, z¯)E [ˆ p(¯ z )]| .

(53)

h,¯ z

i h First, note that, by Lemma 6, P [Ti > t] ≤ N (Kε , τ )P T˜i > t0 for i = 1, 2, where T˜1 = sup |ˆ q (h, z¯) − E [ˆ q (h, z¯)]| ,

(54)

h

T˜2 = sup |R(h, z¯)(E [ˆ p(¯ z )] − pˆ(¯ z ))| .

(55)

h

For exponentially β-mixing processes, the exponential bounds for T˜1 and T˜2 are given in [33]. In the case of α-mixing we can use our Lemma 3 and Lemma 4. Setting ε = e−n and using A2, we get the a.s. convergence of T1 and T2 to 0. A minor modification of Lemma 5 from [32] coupled with an assumption A3 gives us the convergence of T3 . The last step is to ensure that pˆ(¯ z ) is bounded away from zero. Lemma 4 from [32] together with Lemma 3 ensures that. The proof of the convergence rate requires a bit different decomposition than in Theorem 1 and more delicate treatment of the terms using the additional assumptions. We introduce two further lemmas: ˆ z¯) in terms of the concentration of qˆ(h, z¯) Lemma 5 shows how to express the concentration of R(h, and pˆ(¯ z ). Lemma 6 uses covers to eliminate the supremum over z¯. ˆ = R(h, ˆ z¯), qˆ = qˆ(h, z¯), pˆ = pˆ(¯ Lemma 5. Assume A1 and A5. Then, for R = R(h, z¯), R z ), and with t1 = 21 (tD0 − K2 D2 d2 b2 ), h i h i h i ˆ − R > t ≤ P sup qˆ − E [ˆ P sup R q ] > t1 + P sup pˆ − E [ˆ p] > t1 . (56) h,¯ z



h,¯ z

10

Proof. We start with the following decomposition ˆ p(¯ z )] − pˆ(¯ z )) + qˆ(h, z¯) − q(h, z¯) + R(h, z¯)(p(¯ z ) − E [ˆ p(¯ z )]) ˆ z¯) − R(h, z¯) = R(h, z¯)(E [ˆ R(h, . E [ˆ p(¯ z )] (57) R ˆ z¯) and R(h, z¯) are both Using the fact that A1 implies E [ˆ p(¯ z )] ≥ D0 K(¯ z )d¯ z = D0 and that R(h, upper bounded by 1, " # " # 1 ˆ P sup R(h, z¯) − R(h, z¯) > t ≤ P sup |ˆ q (h, z¯) − q(h, z¯)| > tD0 (58) 2 h,¯ z h,¯ z   1 + P sup |ˆ p(¯ z ) − E [ˆ p(¯ z )]| + sup |p(¯ z ) − E [ˆ p(¯ z )]| > tD0 . 2 z¯ z¯ (59) The statement of the lemma will be proven if we show that |q(h, z¯) − E [ˆ q (h, z¯)]| and |p(¯ z ) − E [ˆ p(¯ z )]| are upper bounded by 21 K2 D2 d2 b2 . We do it only for q. Using the stationarity, Z Z 1 E [ˆ q (h, z¯)] = d+1 K((¯ z−u ¯)/b)q(h, u ¯)d¯ u = K(¯ u)q(h, z¯ − b¯ u)d¯ u. (60) b Now we apply the Taylor expansion and invoke the assumptions on the kernel: X Z Z d d 2 2 X ∂ b ∂ ¯ |q(h, z¯) − E [ˆ q (h, z¯)]| = −b q(h, z¯) u ¯i K(¯ u)d¯ u+ q(h, ξ) u ¯i u ¯j K(¯ u)d¯ u 2 i,j=1 ∂ z¯i ∂ z¯j i=1 ∂ z¯i (61) 1 (62) ≤ D2 K2 d2 b2 . 2 Lemma 6. Assume A6. Then h i P sup |ˆ q (h, z¯) − E [ˆ q (h, z¯)]| > t1

h t1 i N (G, τ )P sup |ˆ q (h, z¯) − E [ˆ q (h, z¯)]| > , (63) 3 h,¯ z h     t1 P sup |ˆ p(¯ z ) − E [ˆ p(¯ z )]| > t1 , (64) ≤ N (G, τ )P |ˆ p(¯ z ) − E [ˆ p(¯ z )]| > 3 z¯  d+γ  γ1 where N (G, τ ) is an τ -covering number of a set G (in `2 -norm) with τ = b 3Lt1 . ≤

Note that one could relax the requirement of compact support in different ways. For example, one could make additional assumptions on the moments of q and p or restrict the supremum to the set {¯ z : k¯ z k ≤ cn }, with cn increasing as n grows, which would lead to a slower convergence rate, though. It is also possible to use A2 instead of A6, however, this would require the knowledge of behavior of the covering number of the sets Kε . Proof. We again prove the statement only for q, while the argument for p goes along the same lines. Let us consider V , a fixed τ -covering of G with τ to be set later. We denote by v(¯ z ) the closest element of the covering to z¯. Then we have sup |ˆ q (h, z¯) − E [ˆ q (h, z¯)]| ≤ sup |ˆ q (h, z¯) − qˆ(h, v(¯ z ))| + sup |ˆ q (h, v(¯ z )) − E [ˆ q (h, v(¯ z ))]| (65) h,¯ z

h,¯ z

h,¯ z

+ sup |E [ˆ q (h, v(¯ z ))] − E [ˆ q (h, z¯)]| .

(66)

h,¯ z

Using the Lipschitz property of the kernel,  1 X qˆ(h, z¯) − qˆ(h, v(¯ z )) = d `(h, zi+1 ) K((¯ z − zii−d+1 )/b) − K((v(¯ z ) − zii−d+1 )/b) (67) nb i∈I L X Lτ γ γ ≤ d+γ `(h, zi+1 ) k¯ z − v(¯ z )k2 ≤ d+γ . (68) nb b i∈I

11

Now, setting τ =



bd+γ t1 3L

 γ1

, we can ensure that

2 sup |ˆ q (h, z¯) − E [ˆ q (h, z¯)]| ≤ sup |ˆ q (h, v) − E [ˆ q (h, v)]| + t1 , 3 h,¯ z h,v∈V

(69)

which lead us to the statement of the lemma. Now we are ready to make the full statement of Theorems 2 and 3. Theorem 2 (Full statement). Assuming A1, A4-A7, for n large enough so that 24D12 K12 λb ≤ t2 nbd , 1 d+γ the following holds for t1 = 6K11LH (tD0 − K2 D2 d2 b2 ) and τ = b 3Lt1 γ : i  32en fatt /16 (H) h 2 1 ˆ z¯) − R(h, z¯) > t ≤ 16 N (G, τ ) P sup R(h, e−nt1 /512 , (70) t 1 h,¯ z As it is usually the case with kernel estimators, one has to choose a bandwidth b of the kernel. Theorem 2 gives us several conditions that it needs to satisfy: nbd → ∞, bd+γ → 0, and b2 → 0. 1 One option is to set b = n− d+γ , which satisfies all the conditions. However, better choices might be possible depending on the mixing coefficients of the process. Proof sketch. The proof is a combination of Lemmas 5 and 6, followed by Corollary 1. Note that, technically, to have the concentration for pˆ on does not need to take the last step in Lemma 3 with tree coverings, however, to unify the final statement, we include the covering term in the bound. Note that after applying Lemmas 5 and 6 it would also be possible to continue differently and use an independent block technique. This would, however, require at least a β-mixing process. Analyzing the steps of the proof one observes that most conditions on the process are required to ensure convergence of the kernel-based estimate. Using similar steps (with simpler arguments) for the unweighted empirical risk, we obtain a result analogous to Theorem 2 for the marginal risk: PN −1 Theorem 3 (Full statement). For N large enough so that 2 + 16 i=1 αi ≤ t2 N , h i  fatt/16 (F ) 2 ˆ mar (h) > t ≤ 16 32eN e−N t /512 , (71) P sup Rmar (h) − R t h∈H with F = {f (z) = `(h, z) : h ∈ H} ⊂ {f : Z → [0, 1]}. As in Theorem 2, we can also get the statement of Theorem 3 directly in the terms of fat-shattering dimension of H instead of F for Lipschitz losses.

6

Conclusion

In this paper we introduced an empirical estimator of the conditional risk for vector-valued stochastic processes and we proved concentration bounds showing that after an initial number of samples, the estimator converges to the true risk for large classes at an exponential rate, if the process is α-mixing with sufficiently fast rates. In an independent result, we showed that convergence results of the same type also hold for the marginal risk. An interesting feature of the bounds is that the mixing properties of the process only influence the leading constants of the bound, not the rate of convergence itself.

7

Appendix

We start with a short lemma about X -valued trees, which we used in the proofs. n

Lemma 7. Let φ : X n × {±1} → R and ε1 , . . . , εn be Rademacher random variables, then sup Eε1 ,...,εn [φ(ξ1 (ε), . . . , ξn (ε), ε)] = sup Eε1 . . . sup Eεn [φ(z1 , . . . , zn , ε)] , z1

ξ

zn

where the first supremum is over all X -valued trees of length n. 12

(72)

Proof. Assume that the first supremum on the right hand side is achieved at some z1? (if it is not achieved, a limiting argument can be employed). The second supremum is achieved at some z2? (+1) if ε1 = +1 and at some z2? (−1) if ε1 = −1. Repeating this argument, we get sup Eε1 . . . sup Eεn [φ(z1 , . . . , zn , ε)] = Eε1 ,...,εn [φ(z1? , . . . , zn? (ε1 , . . . , εn−1 ), ε)] (73) z1

zn

≤ sup Eε1 ,...,εn [φ(ξ1 (ε), . . . , ξn (ε), ε)] .

(74)

ξ

The other direction is immediate: for any tree ξ Eε1 ,...,εn [φ(ξ1 (ε), . . . , ξn (ε), ε)] ≤ sup Eε1 . . . sup Eεn [φ(z1 , . . . , zn , ε)] z1

(75)

zn

n

Lemma 8. Let zn1 be a random sample and z0 1 be its independent copy. Fix a function g : F ×Z n → R. Assume that Var (g(f, zn1 )) ≤ ϕ(n) for a strictly decreasing function ϕ. Then, for fixed t > 0 and for n ≥ ϕ−1 (t2 /8), " # " # n n n 0n (76) P sup |g(f, z1 ) − E [g(f, z1 )]| > t ≤ 2P sup g(f, z1 ) − g(f, z 1 ) > t . f

f

Proof. Given the sample zn1 , let f ? be a function that maximizes |g(f, zn1 ) − E [g(f, zn1 )]|. Then # "   ? n t t n ? 0n n 0 ≥ P g(f , z1 ) − g(f , z 1 ) > (77) P sup g(f, z1 ) − g(f, z 1 ) > 2 2 f   t n ≥ P |g(f ? , zn1 ) − E [g(f ? , zn1 )]| > t and g(f ? , z0 1 ) − E [g(f ? , zn1 )] < (78) 2    t n = E I [|g(f ? , zn1 ) − E [g(f ? , zn1 )]| > t] P g(f ? , z0 1 ) − E [g(f ? , zn1 )] < zn1 . (79) 2 Now we can bound the internal probability for any f using Chebyshev’s inequality.    0n n ) Var g(f, z t 4ϕ(n) n 1 0 n P g(f, z 1 ) − E [g(f, z1 )] ≥ z1 ≤ ≤ . (80) 2 t2 /4 t2 For n ≥ ϕ−1 (t2 /8), the last expression is less than 12 . It follows that     ? 0n 1 t n t n 0n n ? n ≤ inf P g(f, z 1 ) − E [g(f, z1 )] < z1 ≤ P g(f , z 1 ) − E [g(f , z1 )] < z1 , f 2 2 2 (81) and finally " # t 1 n 0n P sup g(f, z1 ) − g(f, z 1 ) > ≥ E [I [|g(f ? , zn1 ) − E [g(f ? , zn1 )]| > t]] (82) 2 2 f " # 1 n n = P sup |g(f, z1 ) − E [g(f, z1 )]| > t . (83) 2 f

The next lemma is an analogue of Lemma 9 from [29]. ˜N Lemma 9. Let zN tangent sequence. Let 1 be a sample from stochastic process and z 1 its decoupled  ˜i+1 ), ψ be a measurable function and F ⊂ f : Z d+1 → R . Then yi = (zi−d+1 , . . . , zi , z !# " −1 NX i+1 E ψ sup (f (zi−d+1 ) − f (yi )) (84) f ∈F i=d !# " −1 NX ≤ sup Ed . . . sup EN −1 ψ sup i (f (¯ zi ) − f (¯ yi )) , z¯d ,¯ yd z¯N −1 ,¯ yN −1 f ∈F i=d

where 1 , . . . , µ are i.i.d. Rademacher random variables. 13

Proof. Let P be the joint distribution of {zi } and {˜ zi }. ! # " N −1 X N −1 N −1 N −1 N −1 i+1 ˜1 = z1 , z = z˜1 (85) E ψ sup (f (zi−d+1 ) − f (yi )) z1 f ∈F i=d ! Z −2 NX i+1 = ψ sup (f (zi−d+1 ) − f (yi )) + (f (zN −d , . . . , zN −1 , zN ) − f (zN −d , . . . , zN −1 , z˜N )) f ∈F

=

i=d dP (zN , z˜N |z1N −1 , z˜1N −1 ) N −2 Z X i+1 (f (zi−d+1 ) ψ sup f ∈F i=d dP (zN , z˜N |z1N −1 , z˜1N −1 ).

(86) ! − f (yi )) − (f (zN −d , . . . , zN −1 , zN ) − f (zN −d , . . . , zN −1 , z˜N ))

(87) Therefore, ! # " −1 NX N −1 N −1 N −1 N −1 i+1 ˜1 = z1 , z = z˜1 (88) E ψ sup (f (zi−d+1 ) − f (yi )) z1 f ∈F i=d ! N −2 Z X i+1 N (f (zi−d+1 ) − f (yi )) + N −1 (f (zN ˜N |z1N −1 , z˜1N −1 ) = EN −1 ψ sup −d ) − f (yN −1 )) dP (zN , z f ∈F i=d

Z =

(89) ! N −2 X i+1 N EN −1 ψ sup (f (zi−d+1 ) − f (yi )) + N −1 (f (zN ˜N |z1N −1 , z˜1N −1 ) −d ) − f (yN −1 )) dP (zN , z f ∈F i=d

(90) ! −2 NX i+1 ≤ sup EN −1 ψ sup (f (zi+d−1 ) − f (yi )) + N −1 (f (¯ zN −1 ) − f (¯ yN −1 )) . d f ∈F z¯N −1 ,¯ yN −1 ∈Z i=1

(91) Repeat this for N − 1: ! N −1 # " X N −2 N −2 i+1 ˜1 (92) (f (zi−d+1 ) − f (yi )) z1 , z E ψ sup f ∈F i=d ! " " # # −1 NX N −1 N −1 N −2 N −2 i+1 ˜1 ˜1 = E E ψ sup (f (zi−d+1 ) − f (yi )) z1 , z (93) z1 , z f ∈F i=d ! " # −2 NX −2 N −2 ˜1 ≤E sup EN −1 ψ sup (h(zi+1 zN −1 ) − h(¯ yN −1 )) zN ,z i−d+1 ) − f (yi )) + N −1 (f (¯ 1 z¯N −1 ,¯ yN −1 f ∈F i=d

(94)  NX N −1 X −3 i+1 ≤ sup En−1 sup En ψ  sup (f (zi−d+1 ) − f (yi )) + j (f (¯ zj ) − f (¯ yj ))  . z¯N −2 ,¯ yN −2 z¯N −1 ,¯ yN −1 f ∈F j=N −2 i=d (95) Continuing in the same fashion, we get the statement of the lemma. 

References [1] Vladimir Naumovich Vapnik and Alexey Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):264–280, 1971. [2] Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. Annals of Probability, pages 94–116, 1994. [3] Rajeeva L Karandikar and Mathukumalli Vidyasagar. Rates of uniform convergence of empirical means with mixing processes. Statistics & Probability Letters, 58(3):297–307, 2002. 14

[4] Ingo Steinwart and Andreas Christmann. Fast learning from non-iid observations. In Advances in Neural Information Processing Systems (NIPS), pages 1768–1776, 2009. [5] Bin Zou, Luoqing Li, and Zongben Xu. The generalization performance of ERM algorithm with strongly mixing observations. Machine Learning, 75(3):275–295, 2009. [6] Patrizia Berti and Pietro Rigo. A glivenko-cantelli theorem for exchangeable random variables. Statistics & Probability Letters, 32(4):385–391, 1997. [7] Vladimir Pestov. Predictive PAC learnability: A paradigm for learning from exchangeable input data. In IEEE International Conference on Granular Computing (GrC), pages 387–391, 2010. [8] Terrence M Adams and Andrew B Nobel. Uniform convergence of Vapnik-Chervonenkis classes under ergodic sampling. Annals of Probability, 38(4):1345–1367, 2010. [9] David Gamarnik. Extension of the PAC framework to finite and countable Markov chains. IEEE Transactions on Information Theory (T-IT), 49(1):338–345, 2003. [10] Cosma Shalizi and Aryeh Kontorovitch. Predictive PAC learning and process decompositions. In Neural Information Processing Systems (NIPS), pages 1619–1627, 2013. [11] Vitaly Kuznetsov and Mehryar Mohri. Generalization bounds for time series prediction with non-stationary processes. In Algorithmic Learning Theory (ALT), pages 260–274. Springer, 2014. [12] Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary ϕ-mixing and βmixing processes. http://www.cs.nyu.edu/~mohri/pub/niidj.pdf, 10 Oct 2013. [13] Alekh Agarwal and John C Duchi. The generalization ability of online algorithms for dependent data. IEEE Transactions on Information Theory (T-IT), 59(1):573–587, 2013. [14] Dharmendra S Modha and Elias Masry. Minimum complexity regression estimation with weakly dependent observations. IEEE Transactions on Information Theory (T-IT), 42(6):2133– 2145, 1996. [15] Dharmendra S Modha and Elias Masry. Memory-universal prediction of stationary random processes. IEEE Transactions on Information Theory (T-IT), 44(1):117–133, 1998. [16] Ron Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning, 39(1):5–34, 2000. [17] Pierre Alquier and O. Olivier Wintenberger. Model selection for weakly dependent time series forecasting. Bernoulli, 18(3):883–913, 2012. [18] G´erard Biau, Kevin Bleakley, L´aszl´o Gy¨orfi, and Gy¨orgy Ottucs´ak. Nonparametric sequential prediction of time series. Journal of Nonparametric Statistics, 22(3):297–317, 2010. [19] Richard C. Bradley. Basic properties of strong mixing conditions. A survey and some open questions. Probability Surveys, 2:107–144, 2005. [20] Krishna B Athreya and Sastry G Pantula. Mixing properties of Harris chains and autoregressive processes. Journal of Applied Probability, pages 880–892, 1986. [21] Xiaohong Chen, Lars Peter Hansen, and Marine Carrasco. Nonlinearity and temporal dependence. Journal of Econometrics, 155(2):155–169, 2010. [22] Bernhard Sch¨olkopf and Alexander J. Smola. Learning with kernels. The MIT Press, 2002. [23] L´aszl´o Gy¨orfi, Adam Krzy˙zak, Michael Kohler, and Harro Walk. A distribution-free theory of nonparametric regression. Springer, 2002. [24] Vladimir Naumovich Vapnik. Statistical learning theory, volume 1. Wiley New York, 1998. [25] Mathukumalli Vidyasagar. Convergence of empirical means with alpha-mixing input sequences, and an application to PAC learning. In IEEE Conference on Decision and Control (CDC), pages 560–565, 2005. [26] H. Hang and I. Steinwart. Fast learning from α-mixing observations. Journal of Multivariate Analysis, 127(0):184–199, 2014. [27] L´aszl´o Gy¨orfi and G´abor Lugosi. Kernel density estimation from ergodic sample is not universally consistent. Computational statistics & data analysis, 14(4):437–442, 1992. 15

[28] S Caires and JA Ferreira. On the non-parametric prediction of conditionally stationary sequences. Statistical inference for stochastic processes, 8(2):151–184, 2005. [29] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Sequential complexities and uniform martingale laws of large numbers. Probability Theory and Related Fields, pages 1–43, 2014. [30] Bruce E Hansen. Uniform convergence rates for kernel estimation with dependent data. Econometric Theory, 24(03):726–748, 2008. [31] Emmanuel Rio. Covariance inequalities for strongly mixing processes. In Annales de l’institut Henri Poincar´e (B) Probabilit´es et Statistiques, volume 29, pages 587–597. Gauthier-Villars, 1993. [32] G´erard Collomb. Propri´et´es de convergence presque compl`ete du pr´edicteur a` noyau. Zeitschrift f¨ur Wahrscheinlichkeitstheorie und verwandte Gebiete, 66(3):441–460, 1984. [33] Mehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-iid processes. In Conference on Neural Information Processing Systems (NIPS), pages 1097–1104, 2009.

16