arXiv:1307.0127v1 [cs.LG] 29 Jun 2013 - of Marcus Hutter

Comment

Report 2 Downloads 20 Views

Concentration and Confidence for Discrete Bayesian Sequence Predictors

arXiv:1307.0127v1 [cs.LG] 29 Jun 2013

Tor Lattimore and Marcus Hutter and Peter Sunehag Research School of Computer Science Australian National University {tor.lattimore,marcus.hutter,peter.sunehag}@anu.edu.au

July 2, 2013 Abstract Bayesian sequence prediction is a simple technique for predicting future symbols sampled from an unknown measure on infinite sequences over a countable alphabet. While strong bounds on the expected cumulative error are known, there are only limited results on the distribution of this error. We prove tight high-probability bounds on the cumulative error, which is measured in terms of the Kullback-Leibler (KL) divergence. We also consider the problem of constructing upper confidence bounds on the KL and Hellinger errors similar to those constructed from Hoeffding-like bounds in the i.i.d. case. The new results are applied to show that Bayesian sequence prediction can be used in the Knows What It Knows (KWIK) framework with bounds that match the state-of-the-art.

Contents 1 2 3 4 5 6 A B C D E F

Introduction Notation Convergence Confidence KWIK Learning Conclusions Technical Lemmas Proof of Theorem 1 Proof of Theorem 6 Proof of Proposition 7 Experiments Table of Notation

1 2 3 6 9 10 11 12 12 15 15 17

1 Introduction Sequence prediction is the task of predicting symbol ωt having observed ω e · ln µ + ln µ(x) wµ t=ℓ(x)+1

! 1 x ≤ . e

Let n ∈ N and assume µ(tn < ∞) ≤ e1−n . By the definition tn+1 ≥ tn we have ! X ∞ P 1 ξ(x) + ln dt > e · ln µ tn+1 < ∞ = µ(x) · µ x µ(x) wµ t=ℓ(x)+1 x∈X(tn )

≤

1 e

e1−n

X

µ(x) =

x∈X(tn )

1 µ(tn < ∞) ≤ e−n . e

Therefore µ(tn < ∞) ≤ for all n and so µ(tL+1 < ∞) ≤ e−L ≤ δ/2, which completes the proof of (⋆⋆) and so also the theorem. Theorem 3 is close to unimprovable. Proposition 4. There exists an M = {µ, ν} such that with µ-probability at least δ it holds that D∞ > 1 1 1−w 1 ln ln + 2 ln − 3 ln 2 . 4 ln 2 δ δ w

Proof. Let X = {0, 1} and M := {µ, ν} where the true measure µ is the Lebesgue measure and ν is the measure deterministically producing an infinite sequence of ones, which are defined by µ(x) := 2−ℓ(x) andν(x) := Jx = 1ℓ(x) K where 1n is the sequence of n ones.. Let w = wµ and wν = 1 − w. If n = ln12 ln 1δ ∈ N, then µ(Γ1n ) ≥ δ and for ω ∈ Γ1n ! n+1 1 1 X (a) n+1 1 (b) X 1 2 2 · ln + · ln D∞ (ω) ≥ d1t−1 (µkξ) = t−1 ) t−1 ) 2 ξ(1|1 2 ξ(0|1 t=1 t=1 n+1 n+1 (c) 1 X 1 w · 21−t + (1 − w) (d) 1 X > ln = ln 2 4ξ(0|1t−1 ) 2 4w · 2−t t=1 t=1 X (e) 1 n+1 1 − w (f) (n + 1) 2 ln 1−w w + (n − 2) ln 2 = (t − 2) ln 2 + ln ≥ 2 t=1 w 4

(a) follows from the definition of D∞ (ω) and the positivity of the KL divergence, which allows the sum to be truncated. (b) follows by inserting the definitions of µ and the KL divergence. (c) by basic algebra and the fact that ξ(1|1t−1 ) < 1. (d) follows from the definition of ξ while (e) and (f) are basic algebra. Finally substitute n + 1 ≥ ln12 ln 1δ .

In the next section we will bound dt by a function of ct , which can be computed without knowing µ. For this result to be useful we need to show that ct converges to zero, which is established by the following theorems. Theorem 5. If Ent(w) < ∞, then Eµ C∞ ≤ Ent(w)/wµ and limt→∞ ct = 0 with µ-probability 1. Proof. We make use of the dominance ξ(x) ≥ wµ µ(x), properties of expectation and Theorem 1. ∞ ∞ X ∞ X X X (a) 1 ν

Recommend Documents

Sparse Adaptive Dirichlet- Multinomial-like Processes - of Marcus Hutter

Q-learning for history-based reinforcement learning - of Marcus Hutter

Introduction to Statistical Machine Learning - of Marcus Hutter

Strong Asymptotic Assertions for Discrete MDL in ... - of Marcus Hutter

Reflective Features Detection and Hierarchical ... - of Marcus Hutter