1
A Mathematical Theory of Learning
arXiv:1405.1513v1 [cs.LG] 7 May 2014
Ibrahim Alabdulmohsin
Abstract—In this paper, a mathematical theory of learning is proposed that has many parallels with information theory. We consider Vapnik’s General Setting of Learning in which the learning process is defined to be the act of selecting a hypothesis in response to a given training set. Such hypothesis can, for example, be a decision boundary in classification, a set of centroids in clustering, or a set of frequent item-sets in association rule mining. Depending on the hypothesis space and how the final hypothesis is selected, we show that a learning process can be assigned a numeric score, called learning capacity, which is analogous to Shannon’s channel capacity and satisfies similar interesting properties as well such as the data-processing inequality and the information-cannot-hurt inequality. In addition, learning capacity provides the tightest possible bound on the difference between true risk and empirical risk of the learning process for all loss functions that are parametrized by the chosen hypothesis. It is also shown that the notion of learning capacity equivalently quantifies how sensitive the choice of the final hypothesis is to a small perturbation in the training set. Consequently, algorithmic stability is both necessary and sufficient for generalization. While the theory does not rely on concentration inequalities, we finally show that analogs to classical results in learning theory using the Probably Approximately Correct (PAC) model can be immediately deduced using this theory, and conclude with information-theoretic bounds to learning capacity.
I. I NTRODUCTION In this paper, we consider the General Setting of Learning introduced by Vapnik [1] in which a learning machine L is presented with a training set Sm = {Z1 , . . . , Zm } ∈ Z m whose m training examples Zi are drawn i.i.d. from a fixed unknown distribution P(z). The task of the learning machine is to pick a hypothesis H out of a pre-defined hypothesis space H(m) in order to “summarize” or “fit” the training set. In general, we assume that the observation space Z may or may not be numerical and that the hypothesis space H(m) can vary according to m. For example, in a binary classification task, a classifier may be presented with feature-label pairs Z = (X, Y ) ∈ X × {−1, +1} and the goal is to choose a function f (x) : X → {−1, +1} that can accurately predict the unknown label Y given X. Here, any choice of f (x) serves as an instance of H. In mean estimation, such as when we would like to predict X ∈ Rn using its expected value E[X], the hypothesis H can be the empirical average of training examples whose space H(m) is the entire plane Rn . In the latter case, H is a deterministic function of the training set Sm . Finding weights in neural networks, prototypes in clustering methods, enclosing spheres in some outlier detection algorithms, and I. Alabdulmohsin is with the Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division at King Abdullah University of Science & Technology (KAUST), Thuwal, Saudi Arabia. e-mail:
[email protected] frequent itemsets in association rule mining are only a handful of examples to learning tasks that fall under such general learning setting. In the general learning setting, the learning machine L picks an instance of H, denoted h ∈ H(m) , according to some fixed learning process. We model such learning process as a probability distribution H ∼ P(m) (h | Sm ). Again, the process generally depends on the number of training examples. For example, if observations are binary Z P ∈ {0, 1} and m L generates the sum of observations H = i=1 Zi , then (m) H = {0, 1, . . . , m}. In the latter case, the probability of picking a specific hypothesis P(H = h) given an instance of a training set Sm = sm is a degenerate distribution; the probability is one if h is the sum of all training examples in sm and is zero otherwise. In general, however, the hypothesis H might be a random function of Sm . Once a hypothesis is inferred, the learning process is concluded. In order to assess quality of the inference, we assume a non-negative loss function with bounded range exists LH (Z) : Z → [0, 1], that is conditionally independent of the training set Sm given H. That is, we assume that the Markov chain Sm → H → LH (Z) always hold. Formally, LH (Z) is a function of both the inferred hypothesis H and the observation Z. For example, an observation in SVM is a pair of features plus class label Z = (X, Y ) ∈ Rn × {−1, +1}. Here, the hypothesis H can be a separating hyperplane, i.e. H takes its values from {(w, b) | w ∈ Rn , b ∈ R} for some normal vector w and offset b. In addition, one possible loss function is given by LH (Z) = Lw,b (x, y) = I{y(wT x − b) ≤ 0}1 , which is conditionally independent of the original training set Sm given the inferred hypothesis H = (w, b). Having a loss function LH (Z) at hand, we define the true risk of a hypothesis H with respect to LH (Z) by the risk functional: ˆ R(H) = EZ∼P(z) LH (Z) (1) Here, the true risk of any instance of H is the expected value of LH (Z), where expectation is taken over Z ∼ P(z). We define the risk of the learning machine L with respect to LH (Z), denoted R(L), to be the expected risk of its inferred hypothesis, where expectation is taken over all possible training sets and over all possible hypotheses. Precisely, we have: ˆ R(L) = ESm EH|Sm R(H)
(2)
The ideal final goal is to be able to obtain an unbiased estimator to the true risk of a learning machine R(L) given that we know its inference process. This allows us to quantify quality of the inference. One convenient estimator is the 1 Here,
I{·} is a 0-1 Boolean indicator.
2
empirical loss, which for a fixed training set Sm = sm and a fixed hypothesis H = h is defined as: 1 X Lh (zi ) (3) Remp (h, sm ) = m zi ∈Sm
Note in the above expression that Lh (z) is implicitly a function of h. Analogously, we define the empirical risk of the learning machine L by the expected empirical risk of its inferred hypothesis: Remp (L) = ESm EH|Sm Remp (H, Sm )
(4)
An unbiased estimator to Remp (L) is usually available since both training examples and the inferred hypothesis are often known. Unfortunately, however, it has long been established that the empirical risk is a biased (optimistic) estimator to the true risk. Hence, we desire to bound the difference |R(L) − Remp (L)| analytically in order to be able to correct for such bias. Such bound would hopefully shed some insight into the many phenomena associated with learning such as overfitting, underfitting, and the importance of algorithmic stability. However, quantifying the difference between true risk and empirical risk is quite subtle and several methods have been proposed in the past to answer it including uniform convergence, stability, Rademacher and Gaussian complexities, generic chaining bounds, the PAC-Bayesian framework, and robustness-based analysis [1]–[9]. Moreover, extensions to the semi-supervised setting have been proposed as well [10]. Concentration inequalities form the building blocks of such rich theories. In this paper, a new approach of bounding the difference between true risk and empirical risk is introduced. Unlike earlier approaches, the mathematical theory presented here does not treat such difference as a problem of convergence of random variables to their expectations. In fact, we will show that even though observations Z are always assumed to be drawn i.i.d. from the same underlying distribution P(z), both in the past and into the future, true and empirical risks of a learning machine have different distributions because the process of learning changes our posterior distribution of the training set. We will see that the theory can be confirmed numerically quite readily, and that it is rich enough to capture some of deepest aspects of learning even though we are only dealing with averages. As will be demonstrated throughout the sequel, the learning theory proposed in this paper has many parallels with information theory. In particular, a notion of mutual affinity in learning is closely related to mutual information, and the notion of learning capacity is quite analogous to the capacity of communication channels. In addition, important inequalities in information theory such as the “data-processing” inequality and the “information-cannot-hurt” inequality [11] have analogs within the theory of learning. The asymptotic equipartition property (AEP) plays a key role in both theories as well. In fact, we will also be able to derive information-theoretic bounds, which are close in spirit to the PAC-Bayesian bounds [4], [5]. The rest of the paper is structured as follows. We will first introduce fundamental concepts such as similarity and
distance between probability distributions, and present the main theorems after that. We will introduce the notion of learning capacity, which provides the tightest possible bound on the difference between true and empirical risks of learning machines. After that, we present several interpretations of learning capacity. For instance, it will be shown that learning machines admit a partial order with interesting implications, and that capacity and stability of learning machines are two different faces of the same coin. Finally, we show connections between learning capacity and the effective support set size of observations as well as size of the hypothesis space. II. N OTATION We will always use L to refer to a learning machine, which is a formal specification of a learning process and it comprises of three components: 1) The observation space Z. 2) A sequence of hypothesis spaces H(m) for all m ≥ 1. 3) A sequence of probability distributions P(m) (H | Sm ) for all m ≥ 1 and all Sm ∈ Z m , where H ∈ H(m) . Formally, L is a tuple: L = Z, {H(m) }m=1,2,... , {P(m) (H | Sm )}m=1,2,...
Given a learning machine L, we interpret it by saying that for any m i.i.d. observations Sm = {Z1 , . . . , Zm } ∈ Z m received, a hypothesis H ∈ H(m) is generated randomly according to P(m) (h | Sm ). In statistical terms, H can be any summary statistic of Sm . For example, if Z ⊆ R and H(m) = R, then H can be the mean, the maximum, the median, or any individual training example pick out of Sm . In fact, H can also be entirely independent of Sm . If H(m) = Z m and H = Sm , we will say that L is a lazy learner. A lazy learner receives a training set Sm and returns the training set itself as a hypothesis, hence the name. If X ∼ P(x) is a random variable drawn from the alphabet X andPf (X) is a function of X, we write EX∼P(x) f (X) to mean x∈X P(x) f (x). Often, we will simply write EX f (X) to mean EX∼P(x) f (X) if the probability distribution P(x) is clear from context. If X takes its values from a finite set P S uniformly at random, we write EX∼S f (X) to mean 1 x∈S f (x). |S| In general, random variables will be denoted using capital letters, and instances of random variables will be denoted using small letters. Finally, alphabets are denoted using calligraphic typeface. We will generally restrict attention to the case in which the observation space Z and the hypothesis space H(m) are finite or countably infinite, in which case P(z) and P(m) (H | Sm ) are probability mass functions, but the main results can be readily generalized. Finally, we will denote the 0-1 indicator function using I{·}. If X is a boolean random variable, then I{X} = 1 if and only if X is true, otherwise I{X} = 0. III. F UNDAMENTAL C ONCEPTS A. Similarity and Distance Our first fundamental concept is the notion of similarity and distance between two probability distributions:
3
Definition 1. Given two probability distributions P1 (a) and P2 (a) defined on the P same alphabet A, we define similarity using hP1 , P2 iP = a∈A min{P1 (a), P2 (a)}. Also, we write ||P1 , P2 ||P = 1 − hP1 , P2 iP to denote the total variation distance. Intuitively speaking, similarity is a measure of the intersection or overlap between two probability distributions, and is sometimes referred to as the overlapping coefficient [12]. In addition ||p , q||P is the total variation distance but we will call it distance for simplicity. Needless to say, the notion of similarity and distance is not analytic. Nevertheless, the following lemma reveals that h·, ·iP and ||· , ·||P are, in fact, the infinite limit of a sequence of smooth analytic functionals of probability distributions. Lemma 1. Let P1 (a) and P2 (a) be two probability distributions defined on the same alphabet A. Then: ||P1 , P2 ||P =
∞ Y t=1
1−
X a∈A
ρ(t) (a) · ν (t) (a)
(5)
Here, ρ(t) (a) and ν (t) (a) are probability distributions given by the following recursive definition: ρ(t) (a) = ν
(t)
ρ(t−1) (a) · (1 − ν (t−1) (a)) P , 1 − b∈A ρ(t−1) (b) · ν (t−1) (b)
ν (t−1) (a) · (1 − ρ(t−1) (a)) P (a) = , 1 − b∈A ρ(t−1) (b) · ν (t−1) (b)
ν
(1)
Taking the 1-norm of both sides and using ||P1 , P2 ||P = 1 2 ||P1 − P2 ||1 , where the subtraction is element-wise, gives us: X ||P1 , P2 ||P = 1 − P1 (b) · P2 (b) · ||ρ(2) , ν (2) ||P b∈A
Repeating this process indefinitely on the right-hand side yields statement of the lemma. Of particular importance to us in the above lemma is the following bound, which will be useful later when we discuss algorithmic stability:
t=1
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
X a∈A
ρ(t) (a)·ν (t) (a) ,
0.4
s
0.6
0.8
1
Fig. 1. First three approximations of Lemma 1 are plotted against true distance for Example 1.
ν (t) of Lemma 1 are given by: ρ(1) = (s, 1 − s) , ρ(2) = (s, 1 − s) , (1 − s)2 s2 , ρ(3) = 2 s + (1 − s)2 s2 + (1 − s)2 (1 − s)2 s2 ν (3) = 2 , s + (1 − s)2 s2 + (1 − s)2
1 1 ν (1) = ( , ) 2 2 (2) ν = (1 − s, s)
Example 2. To gain further insight into the role of total variation distance in learning problems, suppose we have a binary classification problem with features X ∈ X and labels Y ∈ {0, 1}. Let Pk (x) = P(X = x|Y = k) be the class conditional distribution of features X. Define κ = max{P(Y = 0), P(Y = 1)}. Then, the optimal Bayes rate e∗ satisfies: e∗ ≤ κ 1 − ||P0 , P1 ||P The above inequality holds with equality if κ = 21 . So, if distribution of the two classes are far away from each other in the total variation distance, then they can be distinguished from each other with high accuracy. Proof. The Bayes rate satisfies: n o X e∗ = min P(X = x, Y = 0), P(X = x, Y = 1) x∈X
=
X x∈X
1−
0.2
Figure 1 shows first three approximations for T = 1, 2, 3. does approach the true distance (a) = P2 (a) Clearly, the approximation ||P1 , P2 ||P = |s − 21 | as expected.
b∈A
T Y
Exact Solution 1st Approximation 2nd Approximation 3rd Approximation
ρ(1) (a) = P1 (a)
Proof. Writing P1 (a) = P1 (a) P2 (a) + P1 (a) 1 − P2 (a) , we have: X P1 (a) − P2 (a) = 1 − P1 (b) · P2 (b) · ρ(2) − ν (2)
||P1 , P2 ||P ≤
1 0.9
for any T ≥ 1 (6)
Example 1. Suppose we have two Bernoulli distributions P1 (z) = (s, 1 − s) with probability of success s, and P2 (z) = ( 21 , 12 ) with probability of success 12 . Their distance is ||P1 , P2 ||P = |s − 12 |. The first few distributions ρ(t) and
n min P(X = x|Y = 0) · P(Y = 0),
o P(X = x|Y = 1) · P(Y = 1) n o X ≤κ min P(X = x | Y = 0), P(X = x | Y = 1) x∈X = κ 1 − ||P0 , P1 ||P
4
B. Mutual Affinity The second fundamental concept in this paper is mutual affinity: Definition 2 (Mutual Affinity). The mutual affinity between two random variables X1 and X2 is defined by: IP (X1 , X2 ) = ||P(X1 ) · P(X2 ) , P(X1 , X2 )||P = EX1 ||P(X2 ) , P(X2 | X1 )||P
= EX2 ||P(X1 ) , P(X1 | X2 )||P
Mutual affinity is quite analogous to mutual information. In information theory, mutual information between two random variables X1 and X2 is the distance between the hypothesis that the two random variables are independent of each other vs. their true joint distribution, where distance is measured in the Kullback-Leibler divergence sense. In learning theory, mutual affinity is the distance between the same two hypotheses, where distance is now measured in the total variation sense. Example 3. Suppose X and Y are binary random variables with P(Y = X) = 1 − as depicted in Figure 2. Then, IP (X , Y ) = | 12 − |. Hence, if = 21 , then X and Y are independent of each other and mutual affinity is identically zero.
✓✏ ✓✏ 1 − � ✲ 0 0 ❍ ❍❍ ✒✑ ✯✒✑ ✟ ❍�❍✟✟ ❍ ✟✟ � ❍❍ ✟ ✓✏ ❥✓✏ ✟ ✲ 1 1 ✟ ✒✑ 1 − � ✒✑
X
Y
Fig. 2. The mutual affinity between X and Y in this example is | 12 − |.
C. Effective Support Set Size Our third key concept is the effective support set size. For learning problems whose observations Z are drawn from a probability mass function P(z), size of the support set of P(z) is important because it relates to how difficult the learning problem is. In other words, if size of the support set of P(z) is large, then large training sets are usually needed. For most problems of interest, however, size of the support set is infinite, hence such comparison is inappropriate. The correct measure of the spread of a probability distribution is given by its effective support set size. Definition 3 (Effective Support Set Size). Given a probability mass function P(z) on an alphabet Z, its effective support set size is defined by: 2 Xp P(z) (1 − P(z)) (7) Ess [P(z)] = 1 + z∈Z
Example 4. At one extreme, let P(z) be a uniform probability mass function on a finite alphabet |Z| < ∞. Then,
Ess [P(z)] = |Z|. In other words, effective support set size of a uniform distribution is equivalent to size of its true support set. At the other extreme, let P(z) be a Kronecker delta distribution P(z) = δz,z0 , whose entire probability mass is located at a single point z0 , then Ess [P(z)] = 1. Example 5. A geometric distribution P(z) = α (1 − α)z−1 defined on the set of natural numbers Z = {1, 2, . . .} with probability of success 0 < α ≤ 1 has a finite effective support set size with the following upper bound: α Ess[P(z)] ≤ 1 + 2 √ 1− 1−α
Intuitively, we expect effective support set size to decrease when α → 1. As the probability of success increases, higher values of the geometric random variable become less likely and the probability mass becomes more concentrated around the first few values.
Proof. Effective support set size of a geometric distribution with probability of success α > 0 satisfies: ∞ p 2 X p P(k) 1 − P(k) Ess[P(z)] = 1 + k=1
∞ p X 2 ≤1+ P(k)
=1+α
k=1 ∞ X
k=1
=1+
(1 −
(1 − α)
k−1 2
2
α √ 1 − α)2
D. Information and Events Our final key concept is that of information. One of the most widely known results in information theory is that the amount of information delivered by an observation Z is directly related to its uncertainty. In information theory, an observation Z 1 is said to deliver log P(Z) of information. Here, the use 1 of log P(Z) is interpreted in terms of coding or ‘minimum description length’. Such connection between information and uncertainty has found support in neuroscience [13], [14]. In the mathematical theory of learning presented in this paper, however, the amount of information delivered by an observation Z is given by 1 − P(Z). Informally speaking, the quantity of information delivered by an event, when measured in the context of learning, is the “amount of change” in our “belief”upon knowing the event has occurred. Not surprisingly, information in the context of learning is also directly related to uncertainty. A precise definition of information is the following: Definition 4 (Information). Let X and Y be two random variables. The amount of information contained in an event Y = y about X is given by: IX (y) = ||P(X) , P(X | Y = y)||P
(8)
In other words, the amount of information contained in an event Y = y is the amount of change in our belief about the
5
probability distribution of X measured in the total variation sense. Example 6. For any random variable X and any event X = x, we have: IX (x) = ||P(X) , P(X | X = x)||P = 1 − P(x) It is perhaps worthwhile to note that the measure of information in Definition 4 closely resembles the “Bayesian Brain”, which is a popular model for how the brain might encode information coming from sensory systems. According to such model, the brain encodes its beliefs in the form of probabilities that are being continuously updated given new sensory information. For example, the depth of objects D can have a prior distribution P(D). Given a new retinal image G, the probability distribution of depth of an object changes to P(D | G). The brain, then, acts upon the new posterior distribution [13]–[15]. In this regard, a retinal image G is only as useful as the impact it leaves on our prior beliefs, which is consistent with Definition 4. Example 7. The mutual affinity between two random variables X and Y is the expected amount of information one variable carries about the other: IP (X , Y ) = EY IX (Y ) = EX IY (X)
(9)
Finally, we note that for any two random variables X and Y , the random variable Y cannot carry more information about X than it carries about itself, which is analogous to similar results in information theory. More precisely, we have the following lemma: Lemma 2. For any two random variables X ∈ X and Y ∈ Y, and any event Y = y, we have: IX (y) ≤ IY (y) IX (y) = ||P(X) , P(X | Y = y)||P X =1− min{P(X = x), P(X = x | Y = y)} x∈X
=1− =1−
X
Lemma 3 (Distance-Loss Lemma). Suppose we have a random variable Z ∈ Z and two probability distributions P1 (z) and P2 (z) defined on Z. Let L(Z) : Z → [0, 1] be a loss function. Then: EZ∼P1 (z) L(Z) − EZ∼P2 (z) L(Z) ≤ ||P1 (z) , P2 (z)||P (10) In addition, there exists a loss function that achieves the bound.
Proof. Given an event Y = y, we have:
≤1−
a hypothesis H according to P(m) (h | Sm ). Of course, H can be selected deterministically, in which case P(m) (h | Sm ) becomes a degenerate distribution. Once H is selected, the entire learning process is concluded. Given an inferred hypothesis H, the same hypothesis can be used in multiple applications. For instance, if H = {cj }j=1,...,k ⊆ Sm is a set of k prototypes that are selected out of Sm , then such prototypes can be used in clustering, and they can also be used regression or classification if the target concept Y is one of the dimensions of Z. These different applications or uses of the same inferred hypothesis give rise to different loss functions LH (Z) that satisfy the Markov chain Sm → H → LH (Z). In regression, for instance, the loss function used might measure the mean-square error whereas a 0-1 misclassification loss might be used in classification. Nevertheless, the act of choosing a suitable loss function LH (Z) is outside the learning process because the learning process is defined solely by how H is being generated. Of course, the process by which H is generated might be designed at the outset to optimize a specific loss function in mind, but this fact would already be encoded in the distribution P(m) (H | Sm ). Henceforth, in order to bound the difference between true and empirical risks of a given learning machine, we need a bound that holds simultaneously for all loss functions that satisfy the Markov chain Sm → H → LH (Z), and we desire the bound to be as tight as possible. To achieve such objectives, we start with the following simple lemma.
min{P(X = x, Y = y), P(X = x|Y = y)}
Proof. We have: EZ∼P1 (z) L(Z) − EZ∼P2 (z) L(Z) X Z u=1 = u · P(L(Z) = u| Z = z) P1 (z) − P2 (z) du z∈Z
x∈X
X
P(X = x, Y = y) min{1,
x∈X
X
1 } P(Y = y)
P(X = x, Y = y)
x∈X
= 1 − P(Y = y)
= IY (y)
≤
X z∈Z
u=0
max{P1 (z) − P2 (z), 0}
= ||P1 (z) , P2 (z)||P Now, consider the following upper bound: EZ∼P1 (z) L(Z) ≤ EZ∼P2 (z) L(Z) + ||P1 (z) , P2 (z)||P
IV. M AIN R ESULTS
To show that the bound above is tight, we note that the inequality holds with equality for the loss function: L? (Z) = I P1 (Z) ≥ P2 (Z)
As stated earlier, our first goal is to be able to bound the difference between true and empirical risks of a given learning machine L. A formal definition of L was provided earlier in Section II, which we interpreted by saying that L receives a training set Sm of m i.i.d. examples Z ∼ P(z) and selects
Corollary 1. Suppose in a learning machine L, the hypothesis H is itself a valid bounded loss function L(Z) : Z → [0, 1].
The second inequality follows because P(X) ≥ P(X, Y ).
Since L? (Z) satisfies conditions of the theorem, namely that we have L? (Z) : Z → [0, 1], the upper bound is tight. Tightness of the lower bound is derived similarly.
6
That is, H(m) ⊆ {f (z)| ∀z ∈ Z : 0 ≤ f (z) ≤ 1} for all m ≥ 1. Let Ztrn be a random variable whose value is drawn uniformly at random out of Sm with replacement. Also, define R(L) and Remp (L) by Eq 2 and Eq 4 respectively, where LH (Z) is now the hypothesis H. Then: R(L) − Remp (L) ≤ IP (Ztrn , H) (11) Proof. As always, we write P(z) to denote the probability distribution of observations Z. First, we have by Eq 2: R(L) = ESm ,H EZ∼P(z) LH (Z) = EH EZ∼P(z) L(Z) (12) Now, we note that the value of the expression L(Z) inside the expectation depends, in fact, on the value of two random variables. The first random variable is the choice of the loss function H = L(Z), since this is selected by the learning machine according to Sm . The second random variable is the observation Z. However, by definition of true risk, Z is drawn from its original distribution P(z) independently of L(Z). By contrast, the hypothesis H = L(Z) and Ztrn are not independent of each other since both clearly depend on the training set Sm . The probability of observing the pair (H, Ztrn ) is P(H) · P(Ztrn | H), where by marginalization: X P(Ztrn |H) = P(Sm = sm |H) · P(Ztrn |Sm = sm , H) sm ∈Z m
=
X sm ∈Z m
P(Sm = sm |H) · P(Ztrn |Sm = sm ) (13)
The last line follows because Ztrn and H are conditionally independent of each other given Sm . To simplify notation, we will use P(z | H) for the conditional distribution of training examples given the hypothesis H. That is: P(z | H) ∼ = P(Ztrn = z | H),
(14)
So, we have:
Example 8. At one extreme, if the loss function L(Z) : Z → [0, 1] is chosen independently of the training set Sm , then IP (Ztrn , H) = 0 and we obtain R(L) = Remp (L). Hence, lack of learning is perfectly captured in this model. At the other extreme, if Z is a region in Rn and P(z) is a bounded probability density function over Z, then lazy learners such as in 1-NN classification achieve IP (Ztrn , H) = 1. In the latter case, the empirical risk Remp (L), which can always be made identically zero for some loss function LH (Z), might carry no information whatsoever about the true risk R(L)2 . The previous corollary is restricted to the case in which the loss function L(Z) is itself the learned hypothesis H. This happens, for example, in the classification setting if we seek a separating hyperplane (w, b) for some normal vector w ∈ Rn and offset b ∈ R, and use the loss function Lw,b (x, y) = I{y(wT x − b) ≤ 0} to measure risk. Because (w, b) ↔ Lw,b (·) is a one-to-one mapping, the act of choosing a hypothesis H = (w, b) is equivalent to the act of choosing a loss function H = Lw, b (Z). If the training set Sm influences the choice of the loss function, which in turn is used to measure risk, then the statement of Corollary 1 holds. Next, we generalize the previous result to any arbitrary learning machine as shown in the following theorem. Theorem 1. Given a learning machine L that receives a training set Sm of m i.i.d. examples Z ∼ P(z) and produces a hypothesis H ∈ H(m) , let LH (Z) : Z → [0, 1] be any loss function that satisfies the Markov chain Sm → H → LH (Z). Also, let Ztrn be a random variable whose value is drawn uniformly at random out of Sm with replacement. Then, the true risk R(L) and empirical risk Remp (L) of the learning machine, defined in Eq 2 and Eq 4 respectively, are related by: R(L) − Remp (L) ≤ IP (Ztrn , H) (16) In addition, this is the tightest possible bound.
Remp (L) = EH ESm |H EZtrn ∼Sm L(Ztrn ) = EH EZtrn ∼P(z|H) L(Ztrn )
(15)
The first line is our original definition of empirical risk given earlier in Eq 4, while the second line follows by Eq 13. Now, we employ Lemma 3 to deduce that: R(L) − Remp (L) = EH EZ∼P(z) L(Z) − EH EZ∼P(z|H) L(Z) h i = EH EZ∼P(z) L(Z) − EZ∼P(z|H) L(Z) ≤ EH ||P(z) , P(z|H)||P
= IP (Ztrn , H)
Proof. We know from Corollary 1 that for any loss function LH (Z) whose selection is influenced by the training set Sm , the following inequality holds: R(L) − Remp (L) ≤ IP (Ztrn , LH (Z)) The quantity on the right-hand side is mutual affinity between Ztrn and the choice of the loss function LH (Z). Later in Section V-A, it will be shown using the data-processing inequality that IP (A , B) ≥ IP (A , C) whenever the Markov chain A → B → C holds. Since Sm → H → LH (Z) implies Ztrn → H → LH (Z), we deduce the following bound: IP (Ztrn , LH (Z)) ≤ IP (Ztrn , H),
In the first line, we substituted Eq 12 and 15. In the third line, we employed Lemma 3. The last line uses the fact that the marginal distribution of Ztrn is P(z) by assumption whereas its posterior distribution given H is P(z|H) as stated in Eq 13. In a similar manner, we see by using Lemma 3 that the following lower bound holds:
which holds whenever the Markov chain Sm → H → LH (Z) holds. This establishes the upper bound. To prove tightness, let us consider the following loss function: n o L?H (z) = I P(Ztrn = z) ≥ P(Ztrn = z | H) (17)
R(L) − Remp (L) ≥ −IP (Ztrn , H)
2 The loss function L (Z) that achieves the bound for lazy learners, whose H hypothesis H is itself the entire training set Sm , is LH (Z) = I{Z ∈ / H}. If the observation space is a region in the plane Rn and P(z) is a bounded density function, then R(L) = 1 whereas Remp (L) = 0.
Both bounds imply statement of the corollary.
7
In order to establish tightness of the bound in Eq 16, we first see that the loss function L?H (Z) satisfies both conditions of the theorem; namely that we have ∀z ∈ Z : 0 ≤ L?H (z) ≤ 1 and Sm → H → L?H (Z). The second condition holds because any change to the original training set Sm that does not alter the final hypothesis H will not change the loss function L?H (Z). Now, it is immediate to observe that L?H (Z) achieves the bound. This is because if we measure true and empirical risks of L using L?H (Z), we obtain: R(L) − Remp (L)
= EH EZ∼P(z) L?H (Z) − EH ESm |H EZ∼Sm L?H (Z) h i = EH EZ∼P(z) L?H (Z) − EZ∼P(z|H) L?H (Z) h = EH EZ∼P(z) I{P(Z) ≥ P(Z|H)} i − EZ∼P(z|H) I{P(Z) ≥ P(Z|H)} hX i = EH P(z) − P(z|H) · I{P(z) ≥ P(z|H) z∈Z
= EH ||P(z) , P(z|H)||P
= IP (Ztrn , H)
Again, the second line follows from Eq 13 while the third line follows by construction of L?H (Z). If the inequality in the definition of L?H (Z) in Eq 17 is reversed, we obtain: R(L) − Remp (L) = −IP (Ztrn , H) Hence, the bound is tight. Of course, one does not usually know the original distribution P(z) so computing IP (Ztrn , H) is not always possible. To resolve this issue, we introduce the notion of capacity of learning machines. Definition 5 (Learning Capacity). Let L be a learning machine that receives a training set Sm of m i.i.d. examples Z ∼ P(z) and produces a hypothesis H ∈ H(m) . Let Ztrn be a random variable whose value is drawn uniformly at random out of Sm with replacement. Then, capacity of the learning machine L is defined by: C (m) (L) = supP(z) IP (Ztrn , H), where the supremum is taken over all possible distributions of observations Z. Theorem 2. Let L be a learning machine that receives a training set Sm of m i.i.d. examples Z ∼ P(z) for some unknown distribution P(z) and produces a hypothesis H ∈ H(m) . Also, let LH (Z) : Z → [0, 1] be any loss function that satisfies the Markov chain Sm → H → LH (Z). In addition, let Ztrn be a random variable whose value is drawn uniformly at random out of Sm with replacement. Then, the true risk R(L) and empirical risk Remp (L) of the learning machine, defined in Eq 2 and Eq 4 respectively, are related by: R(L) − Remp (L) ≤ C (m) (L), (18) which holds for any distribution of observations P(z). In addition, this is the tightest possible bound. Proof. This follows by Definition 5 and Theorem 1.
m Remp (L) C (m) (L)
10 0.3780 0.1230
25 0.4194 0.0806
50 0.4426 0.0561
100 0.4613 0.0398
200 0.4712 0.0282
TABLE I F IRST ROW IS EMPIRICAL RISK OF THE LEARNING MACHINE IN E XAMPLE 9, WHICH IS ESTIMATED BY AVERAGING OVER 1,000 REALIZATIONS OF TRAINING ERROR FOR RANDOMLY DRAWN TRAINING SETS . T HE SECOND ROW IS THEORETICAL LEARNING CAPACITY.
m Remp (L) Remp (L) + C (m) (L)
10 0.4460 0.5690
25 0.4811 0.5621
50 0.4928 0.5488
100 0.4947 0.5345
200 0.4966 0.5248
TABLE II I N THIS PROBLEM , THE EMPRIICAL RISK AND THEORETICAL UPPER BOUND ON TRUE RISK , WHEN USING THE LOSS FUNCTION IN E XAMPLE 10. T HE TRUE RISK IS ALWAYS 12 FOR ALL VALUES OF m.
Example 9. Suppose observations Z ∈ {0, 1} are Bernoulli trials with P(Z = 1) = φ, and that our learning machine L summarizes Pm the training set Sm with the empirical average 1 H=m i=1 Zi . Then, the capacity of this learning machine is asymptotically given by C(L) ∼ √21π m . The proof is given in Appendix A. Suppose, in addition, that a classifier predicts either 0 or 1 depending on which label occurs most often in the training set. In other words, we have the loss function: m m LH (Z) = I{Z = 1} · I{H < } + I{Z = 0} · I{H ≥ } 2 2 Here, LH (Z) satisfies the conditions of Theorem 2. When φ = 1/2, Table I shows simulations results for different values of m. As shown in table, the bound |R(L) − Remp (L)| ≤ C (m) (L) holds with equality in this case 3 . Example 10. Suppose we decided to use the same learning machine L in Example 9 but we would like to change our classifier. In this new classifier, we use empirical average on the training set and decide to either predict y = 1 all the time or y = 0 all the time but we choose to do so randomly according to the empirical distribution of the two labels 4 . Let LH (Z) be the prediction error of this classifier. Clearly, LH (Z) satisfies the conditions of Theorem 2. Table II shows empirical risk and the predicted upper bound on true risk for different values of m. Here, both labels are assumed to be equally likely, which means that the true risk of L is always 1 2 . As shown in the table, the bound indeed holds, albeit the bound is slightly loose in this example. However, we knew earlier from Example 9 that a loss function indeed exists, for which the upper bound holds with equality. Example 11 (Randomized Learning Machine). Suppose our observation space is finite |Z| < ∞. Given a training set 3 Precisely, the experiment run as follows. For each value of m, a total of m training bits were generated. Each bit is either ‘0’ or ‘1’ with equal probability. The training error is the fraction of bits that are different from the majority. Expected test error is always 12 . This entire process was, then, repeated 1,000 times and averages are reported. 4 Precisely, we generate m random bits where ‘0’ and ‘1’ are equally likely. In each training set, we compute the sum of observations s, and decide with s probability m to predict ‘1’ all the time.
8
Sm of m i.i.d. observations, let N (z) denote the number of times z ∈ Z is observed in the training set. Suppose we have a learning machine L whose final hypothesis H is a single value H ∈ Z that is selected randomly according to the empirical distribution P(H = z) = N (z)/m. For example, if Z = {1, 2, 3, 4} and the training set is Sm = {1, 2, 1, 1, 3, 3}, then we have P(H = 1|Sm ) = 12 , P(H = 2|Sm ) = 16 , P(H = 3|Sm ) = 13 , and P(H = 4|Sm ) = 0. The capacity of this learning machine is given by: 1 1 C (m) (L) = · 1− m |Z| Proof. The proof is given in Appendix B. The objective of introducing Example 11 is two-fold. First, when |Z| = 2, we see that this problem is quite similar to the previous Bernoulli problem in Example 9 except for the fact that H is now a randomized summary statistic of the training set. The capacity of the two learning machines, however, are quite different. In√the deterministic learning machine, we had C (m) (L) = O(1/ m) whereas we have C (m) (L) = O(1/m) in the randomized learning machine. Intuitively, we know that randomness should decrease capacity. Second, we note that our randomized learning machine in Example 11 is related to the earlier randomized classifier in Example 10. This is because we can equivalently think of the latter classifier as a deterministic classifier that receives a randomized hypothesis H, instead of treating it as a randomized classifier that receives a deterministic hypothesis. With this new equivalent view, we note that the randomized learning machine in Example 11 can be used to bound the difference between true risk and empirical risk in Table II. In fact, the bound now holds with equality. In other words, the difference between empirical and true risks in Table II is equal to the capacity of the learning machine in Example 11. Thus, for the same classifier, one might be able to find a better learning machine that yields tighter bounds. Later, a more insightful interpretation of this fact will be provided when we show that learning machines admit a partial order. Finally, we conclude this section with the following remark. Perhaps, one central goal of any learning algorithm is to guarantee generalization. That is, we would like to ensure that the empirical risk we estimate on a given training set is a valid approximation to the true risk we expect to obtain in the future. This is necessary because any learning algorithm has access to the empirical risk only, which can be minimized if the learning machine has sufficient capacity. The true risk, on the other hand, is inaccessible directly, and one can only minimize it by using a learning machine that generalizes well. Definition 6. A learning machine L generalizes if for all distributions of observations P(z) and all loss functions LH (Z) : Z → [0, 1] that satisfy the Markov chain Sm → H → LH (Z), we have limm→∞ |Remp (L) − R(L)| = 0. Definition 7. A learning machine L has a finite capacity if limm→∞ C (m) (L) = 0. It is important to distinguish learning machines with finite capacity from those with infinite capacity. This is partly due
to the following result: Theorem 3. A learning machine L generalizes if and only if it has a finite capacity. Proof. This follows from the fact that R(L) − Remp (L) ≤ C (m) (L) is achievable for some distribution P(z) and some loss function LH (Z) that satisfies the conditions of Definition 6. Thus, in order for L to generalize, we must have limm→∞ C (m) = 0. The converse also holds. Luckily, most learning machines of interest have finite capacities. In fact, any learning machine with a countable observation space Z has a finite capacity. This follows from the Asymptotic Equipartition Property (AEP) in information theory. In simple terms, for any distribution of observations P(z), the sequence (Z1 , Z2 , . . . , Zm ) becomes progressively closer to a sequence that is unique up to permutation, and this happens as m → ∞. This unique sequence is the one implied by the law of large numbers; i.e. for each z ∈ Z, the fraction of times z appears in the sequence is given by its probability P(z) [11]. Because the learning machines we consider in this paper are always invariant to permutations of training examples, knowledge of the hypothesis H typically yields little information about the training set as m → ∞ because all sufficiently large training sets are nearly identical at the specified limit up to permutation. Such conclusion will be established more formally in Theorem 6, in which we provide the rate of convergence. V. I NTERPRETING L EARNING C APACITY In this section, we provide several interpretations to learning capacity. To do this, we first note that any learning process is influenced by three key components: 1) Observations Z including their space Z and probability distribution P(z). 2) The inference process P(m) (H | Sm ). 3) The hypothesis space H(m) . All three components influence the learning capacity. In particular, if we impose restrictions on any of these three components, we effectively limit the learning capacity. In this section, we explore such possibilities. First, we show that learning capacity is indeed a reasonable measure of quantifying how “much” has been learned out of the training set. In particular, we show using the dataprocessing inequality that the “more” we learn, the larger the learning capacity is. Second, we show that having algorithmic stability in the inference process is equivalent to having a finite learning capacity. Third, we show that if observations are restricted to a countable space Z with a finite effective support set size, then all learning machines have finite capacity. Finally, we explore connections between learning capacity and the hypothesis space H(m) . One, perhaps not quite surprising, result is that all learning machines have finite capacity if size of the hypothesis space is finite. The latter result, proved via information theoretic inequalities, is analogous to wellknown results that have been established in the past using the Probably Approximately Correct (PAC) model.
9
A. Partial Ordering of Learning Machines
Proof. We have by definition:
Earlier in Example 9 and Example 11, we looked into two learning machines that were very similar to each other, yet with drastically different capacities. Let us briefly look into those two learning machines again. In both machines, observations are Bernoulli random variables Z ∈ {0, 1}. The difference between the two learning machines lies in their method of computing their hypothesis H:
IP (A, (B, C)) X =1− min P(A) · P(B, C), P(A, B, C) X =1− P(A = a)
1) The first learning machine LdetP computes the empirical m 1 average of samples Hdet = m i=1 Zi . 2) The second learning machine Lrnd also computes Pm the 1 empirical average of samples Hdet = m i=1 Zi . However, its final hypothesis is Hrnd ∈ {0, 1}, where Hrnd is a Bernoulli random variable with probability of success Hdet . We noted that C(Lrnd ) ≤ C(Ldet ). Why should the latter result hold? In this section, we show that the latter inequality holds because we have the Markov chain Sm → Hdet → Hrnd . In other words, because Hdet is necessarily “more informative” than Hrnd , the learning machine Ldet has a larger learning capacity than Lrnd . To establish this result, we begin with the following lemma: Lemma 4. Let A ∈ A, B ∈ B, and C ∈ C be three random variables. If A → B → C forms a Markov chain, i.e. P(C | A, B) = P(C | B), then: IP (A , (B, C)) = IP (A , B) In other words, because C is conditionally independent of A given B, adding C to B does not create any additional affinity with A. Proof. We have: IP (A, (B, C)) X =1− min P(A) · P(B, C), P(A, B, C) X =1− min P(A) · P(B) · P(C | B), P(A, B) · P(C | A, B) X =1− P(C | B) min P(A) · P(B), P(A, B) X =1− min P(A) · P(B), P(A, B) = ||P(A) · P(B) , P(A, B)||P
= IP (A , B)
Lemma 5 (Information Can’t Hurt). For any random variables A ∈ A, B ∈ B, and C ∈ C, we have: IP (A , (B, C)) ≥ IP (A , B) In other words, adding C to B cannot reduce affinity with A.
a∈A
X
min P(B = b, C = c), P(B = b, C = c|A = a)
b∈B, c∈C
However, the minimum of the sums is always larger than the sum of minimums. That is: X X X min αi , βi ≥ min{αi , βi } i
i
Using marginalization P(x) = inequality, we obtain:
i
P
y
P(x, y) and the above
IP (A, (B, C)) X =1− P(A = a) a∈A
X
min P(B = b, C = c), P(B = b, C = c|A = a)
b∈B, c∈C
≥1−
X
min{P(A = a) P(B = b), P(A = a, B = b)}
a∈A,b∈B
= IP (A , B) Lemma 5 is the analog to the “Information can’t hurt” inequality in information theory. In the context of learning, it simply states that adding more summary statistics about the training set cannot decrease mutual affinity. Thus, the “more” the summary statistics we use, the larger the learning capacity is. Using both lemmas, we arrive at the important data-processing inequality. Lemma 6 (Data Processing Inequality). Suppose we have the Markov chain: Ztrn → H1 → H2 , where Ztrn ∼ P(z). Then, the following inequality holds for any distributions of observations P(z): IP (Ztrn , H1 ) ≥ IP (Ztrn , H2 ) Proof. We have by Lemma 4 and Lemma 5: IP (Ztrn , H1 ) = IP (Ztrn , (H1 , H2 )) ≥ IP (Ztrn , H2 ) The statement that manipulation hurts information has manifested in many contexts. In information theory, manipulation leads to loss of mutual information, and hence decreases the capacity of communication channels [11]. In Bayesian decision theory, manipulation leads to loss of information and hence reduces the optimal Bayes rate in classification [16]. In our context, manipulation leads to loss of information, and hence decreases the capacity of learning machines. Decreasing
10
0.14
is indeed both necessary and sufficient for generalization of learning machines. The notion of capacity or mutual affinity of a learning machine L is intimately tied to the notion of stability. First, let us recall the inequality: R(L) − Remp (L) ≤ IP (Ztrn , H)
L2 (m=11) L1 (m=11) L2 (m=51) L1 (m=51)
0.12
Distance
0.1 0.08 0.06 0.04 0.02 0 ï0.02 0
0.1
0.2
0.3
0.4
0.5
Theta
0.6
0.7
0.8
0.9
1
Fig. 3. An illustration to a partial order of learning machines. Here, L2 ⊆ L1 , where the two machines are defined in Example 12. As shown in the figure, the inequality IP (Ztrn , H2 ) ≤ IP (Ztrn , H1 ) always hold.
For a given distribution P(z), this is the tightest possible bound because there exists some loss functions LH (Z) : Z → [0, 1] that satisfy the Markov chain Sm → H → LH (Z) and achieve the bound. To see how this is tied to the notion of stability, define: s(m) (L) = EZtrn hP(m) (H), P(m) (H | Ztrn )iP X = P(Ztrn = z) · hP(m) (H), P(m) (H | Ztrn = z)iP z∈Z
the capacity of learning machines, however, is not always disadvantageous since it can mitigate overfitting. As suggested earlier, the data processing inequality yields an insightful notion of partial ordering of learning machines. Definition 8 (Subsets and Supersets). Suppose we have two learning machines: L1 that produces a hypothesis H1 and L2 that produces H2 over the same observation space Z. We say that L2 is a subset of L1 , denoted L2 ⊆ L1 , if behavior of the learning machine L2 can be simulated completely by L1 . Mathematically, we have L2 ⊆ L1 if the Markov chain Sm → H1 → H2 holds. Theorem 4. If for two learning machines L1 and L2 we have L2 ⊆ L1 , then: C (m) (L2 ) ≤ C (m) (L1 ),
for all m ≥ 1
Proof. By the data processing inequality (Lemma 6) and by definition of capacity (Definition 5). Example 12. Returning again to our earlier example, where observations Z ∈ {0, 1} are Bernoulli trials with probability of success φ, suppose one learning machine L1 P computes m 1 the empirical average of observations H1 = m i=1 Zi . Also, suppose a second learning machine L2 only reports the label that occurs most often in the training set. That is, H2 = I{H1 ≥ m 2 }. Clearly, L2 ⊆ L1 . Figure 3 shows mutual affinity IP (Ztrn , H1 ) and IP (Ztrn , H2 ) for m = 11 and m = 51 and different values of φ ∈ (0, 1). As shown in the figure, the inequality IP (Ztrn , H2 ) ≤ IP (Ztrn , H1 ) always hold. Proof. The proof is given in Appendix C. B. Learning Capacity and Stability Algorithmic stability analysis was popularized in the last decade in learning theory. Because stability is a property of the learning machine itself rather than the hypothesis space, it is widely applicable to a broad class of machine learning algorithms. For this reason, stability has been proposed as a key condition for learnability and generalization [6], [17], [18]. In this section, we show that an appropriate notion of stability
(19) Here, s(m) (L) is a measure of how insensitive the learning machine L is to a single training example, on average, when we have m examples in the training set. Specifically, P(m) (H | Ztrn ) is the probability distribution of the hypothesis H given a single fixed training example Ztrn and expectation is taken over all possible single training examples. If the learning machine is stable, then P(m) (H) should be very close to P(m) (H | Ztrn ) in distance and s(m) (L) ≈ 1. In this case, a single training example Ztrn does not make a “big” difference to the distribution of the inferred hypothesis H. If we define stability of the learning machine L using: S (m) (L) = inf s(m) (L),
(20)
P(z)
where the infimum is taken over all possible distributions of observations Z, then S (m) (L) characterizes a distribution-free stability of L. However, we have by definition: C (m) (L) = 1 − S (m) (L) Thus, the capacity of a learning machine is inversely related to its algorithmic stability. Because in order for a learning machine to generalize (see Definition 6) it is both necessary and sufficient that it has a finite capacity, we conclude that stability as defined in Eq 20 is also both necessary and sufficient. We emphasize again that such result holds for any learning machine including unsupervised learning algorithms. Definition 9. A learning limm→∞ S (m) (L) = 1
machine
L
is
stable
if
In other words, a learning machine is stable if the impact of a single observation becomes more and more negligible as size of the training set increases. Theorem 5. A learning machine L generalizes if and only if it is algorithmically stable. Proof. By Definition 9 and Theorem 3. Corollary 2. Suppose a learning machine L is supplied with a training set Sm that consists of m i.i.d. training examples 0 Z ∼ P(z), and let H be the inferred hypothesis. Then, let Sm
11
be a new training set with m i.i.d. observations drawn also from P(z), and let H 0 be the new hypothesis. Then:
Writing = | 12 − φ| and using both Corollary 2 and Hoeffding’s inequality [19]:
s(m) (L) ≥ P(H = H 0 )
IP (Ztrn , H) ≤ P(H 6= H 0 ) ≤ 2 exp − 2 m 2 1 2 = 2 exp − 2m − φ 2 Clearly, this is a simple method of establishing that mutual affinity decreases exponentially fast when φ 6= 21 , i.e. the two classes are not equally likely.
(21)
Proof. Using the infinite product representation of the distance function ||P1 , P2 ||P in Eq 6, we know that: X hP1 (z), P2 (z)iP ≥ P1 (z) · P2 (z) (22) z∈Z
Now, we write by definition of stability and by Eq 22: s(m) (L) = EZtrn hP(m) (H), P(m) (H|Ztrn )iP X ≥ EZtrn P(m) (H = h) · P(m) (H = h|Ztrn ) h∈H(m)
= EZtrn
X
0 0 EZtrn P(m) (H = h|Ztrn ) P(m) (H = h|Ztrn )
h∈H(m) 0 = EZtrn EZtrn
X
0 P(m) (H = h|Ztrn ) P(m) (H = h|Ztrn )
h∈H(m)
The last line states the following. First, we fix a single training example Ztrn and draw all remaining m−1 training examples i.i.d. from P(z) and let H be the inferred hypothesis. After that, we perform a second trial, in which we fix a new training 0 and let H 0 be the new hypothesis. Then: example Ztrn 0 0 s(m) (L) = EZtrn , Ztrn P(H = H 0 | Ztrn , Ztrn )
= P(H = H 0 ),
where the second line follows by marginalization. Example 13. If we return again to the Bernoulli problem, where we have binary observations Z ∈ {0, 1} that are drawn from a Bernoulli distribution with parameter P(Z = 1) = φ. Let L be the learning machine that produces the label the occurs most often in the training set. It was shown earlier in Example 12 that capacity of this learning machine is given by (see Appendix C): C(L) ∼ √
1 2πm
Capacity (maximum mutual affinity) occurs when φ = 12 . For arbitrary values of φ, mutual affinity is, in general, quite involved and is given in Appendix C. Instead of dealing with the exact expression of mutual affinity, we would like to draw qualitative results using stability analysis. Using Corollary 0 2, we note that if we draw two training sets Sm and Sm independently, the probability we obtain different hypotheses is: P(H 6= H 0 ) = 2 P(H = 0) · P(H = 1) m/2 X m =2 φk (1 − φ)m−k k k=0 m X m φk (1 − φ)m−k k k=m/2+1 ≤ 2 min P(H = 0), P(H = 1)
C. Learning Capacity and Observations In the previous section, we looked into different interpretations of how the learning process influences the learning capacity. In this section, we look into observations Z and the role of the effective support set size of P(z). Earlier, it was stated that learning machines could be partially ordered, where L2 ⊆ L1 implied that C(L2 ) ≤ C(L1 ). In particular, if L? is a lazy learner, then C(L) ≤ C(L? ) for all learning machines L. To reiterate, a lazy learner returns the training set itself as a hypothesis H. Next, we show that a lazy learner in a countable observation space actually has a finite capacity. Theorem 6 (The Square-Root Law). If observations Z ∈ Z are drawn i.i.d. from a probability distribution P(z) with finite effective support set size, then the following asymptotic bound on capacity holds for any learning machine L 5 : r r Ess [P(z)] − 1 |Z| − 1 (m) ≤ (23) C (L) ≤ 2πm 2 πm In addition, the lazy learner L? achieves the bound. Proof. This can be proved by deducing capacity of the lazy learner L? that is described earlier. The detailed proof is given in Appendix D. Intuitively, Theorem 6 states that in order to have good generalization that holds for any possible learning machine, the average number of training examples per each possible observation must be sufficiently large. For multiclass classification problems where Z = (X, Y ) ∈ X × Y, we have the following corollary: Corollary 3. Suppose observations consist of attributes plus labels, i.e. Z = (X, Y ) ∈ X × Y, where |X | × |Y| < ∞, and our learning machine produces a hypothesis H, which is a classifier that predicts class label Y given X. Let LH (Y, X) = I{Y 6= H(X)} be the misclassification error. Also, let H = {h(x) : X → Y} be the set of all possible hypotheses (classifiers). Then, the difference between empirical risk and true risk for any possible learning machine L is asymptotically bounded by: r |X | × |Y| − 1 (24) R(L) − Remp (L) ≤ 2πm s |Y| × log|Y| |H| − 1 = (25) 2πm √ we have an additional term o(1/ m). However, such term is negligible and the bound becomes arbitrarily tight in ratio as m → ∞. 5 Here,
12
Proof. We have Z = X × Y. Moreover: |H| = |Y||X |
→
|X | = log|Y| |H|
Plugging these expressions into Theorem 6 yields the desired result.
we note that: I(Sm , H) = H(Sm ) − H(Sm | H) m h i X = H(Zi ) − H(Z1 |H) + H(Z2 |Z1 , H) + · · · i=1
≥ Corollary 3 is quite similar to well-known results obtained using PAC model for binary classification problems. We will derive similar results later using information-theoretic bounds. It is perhaps worthwhile to point out that Theorem 6 can be interpreted as one additional formal justification to dimensionality reduction methods such as feature selection and Principal Component Analysis (PCA) because reducing the effective support set size of observations helps improve generalization.
m X i=1
[H(Zi ) − H(Zi | H)]
= m [H(Ztrn ) − H(Ztrn | H)] = m I(Ztrn , H)
Here, the inequality follows because H(A|B) ≤ H(A) for any random variables A and B. The fourth line follows because we always assume that the learning process is invariant to permutation of the training set. Thus, we obtain: I(Sm , H) m Combining this with Corollary 4 yields the desired result. I(Ztrn , H) ≤
D. Learning Capacity and Size of the Hypothesis Space Finally, we look into the role of the hypothesis space and how it influences learning capacity. So far, we have noted the apparent similarity between information theory and the learning theory proposed in this paper; in the sense that many quantities and results have analogs in both fields. There is, in addition, one concrete result that ties both fields together [11]:
Corollary 5. If H(m) is a countable space and H is the inferred hypothesis, then the following bound holds for all learning machines: r r H(H) log |H(m) | (m) C (L) ≤ ≤ , 2m 2m where H is the Shannon entropy.
Theorem 7 (Pinsker’s Inequaity). For any two probability distributions P1 (z) and P2 (z), we have:
Proof. Because for any random variables A ∈ A and B ∈ B, we have I(A, B) ≤ H(A), and H(A) ≤ log |A|.
r ||P1 , P2 ||P ≤
D(P1 || P2 ) , 2
where D(P1 || P2 ) is the Kullback-Leibler divergence measured in nats (i.e. using natural logarithms). Many connections can be immediately deduced using Pinsker’s inequality. For example, we have the following corollary: Corollary 4. For any two random variables X and Y , we have: r I(X, Y ) IP (X , Y ) ≤ , (26) 2 where IP (X , Y ) is mutual affinity while I(X, Y ) is mutual information between X and Y . Using Corollary 4, we obtain the following bound on capacity that holds for any learning machine L: Theorem 8. Suppose we have a learning machine L that receives a training set Sm = {Z1 , . . . , Zm } and produces a hypothesis H ∈ H(m) . Then the following bound holds: r IP (Ztrn , H) ≤
I(Sm , H) 2m
Proof. We will write H to denote the Shannon entropy. First,
Corollary 5 generalizes the well-known PAC result on the finite hypothesis space [20]. In fact, the bound in Corollary 5 is tighter since log |H(m) | is now replaced with entropy of the hypothesis H. From Corollary 5, we can effortlessly deduce the following bound: Corollary 6. If we have a finite observation space |Z| < ∞, then the following bound on capacity holds for any learning machine L: r |Z| · log (1 + m) (m) C (L) ≤ , (27) 2m which is consistent with Theorem 6. Proof. In the language of information theory, using the method of types to be more specific, the discrete lazy learner L? produces the hypothesis H = T [Sm ], which is the type of the training set Sm . Here, the type of a training set is its empirical probability mass function. However, it is well-known that the number of possible types given m training examples is always bounded by H(m) ≤ (1 + m)|Z| [11]. Combining this with Corollary 5 yields the desired result. The reason behind introducing last corollary is to illustrate one scenario where information theory simplifies analysis in learning theory. Originally, Theorem 6 provided us with the tightest possible bound that is achievable by the discrete lazy learner L? . However, its proof is rather involved and is combinatorial in nature. By contrast, the proof of last corollary is quite simple, albeit at a cost of obtaining a slightly looser bound.
13
VI. C ONCLUSIONS This paper proposes a new mathematical theory of learning. Unlike earlier approaches, the theory presented here does not treat learning as a problem of convergence of random variables to their means and does not rely on concentration inequalities. The theory enjoys many advantages. First, it ties the mathematical notion of learning to the mathematical notion of information. For example, mutual affinity in learning theory is quite similar to mutual information, capacity of learning machines is analogous to capacity of communication channels, and the asymptotic equipartition property (AEP) as well as the data-processing inequality both play key roles in the two theories. Second, the bounds obtained through this theory are the tightest possible bounds. Third, the theory follows Vapnik’s General Setting of Learning, which is a unified approach towards analyzing many learning tasks including supervised and unsupervised learning algorithms. Perhaps, the best statement to conclude this paper with is to summarize the different interpretations of learning capacity that have been deduced so far. We have the following results: 1) The capacity of a learning machine is a measure of the maximum difference between empirical and true risks R(L)−Remp (L) . Because bounds are tight, a learning machine generalizes if and only if it has a finite capacity. 2) The capacity of a learning machine is a measure that quantifies how much is expected to be learned out of the training set. Hence, adding more summary statistics increases capacity of the learning machine. If one learning machine L1 can be completely simulated by a second learning machine L2 , i.e. L2 is necessarily more informative than L1 , then we have C(L2 ) ≥ C(L1 ). 3) The capacity of a learning machine is a measure of its algorithmic instability. Learning machines whose inferred hypothesis H is heavily perturbed by a change in a single training example have a higher capacity. Moreover, a learning machine generalizes if and only if it is stable. 4) The capacity of a learning machine is limited by the effective support set size of observations. If observations have a finite effective support set size, then sufficiently large training sets will effectively exhaust the space of possible observations and all learning machines generalize as a result. 5) A learning machine is limited by the size of its hypothesis space H(m) . If the hypothesis space H(m) is finite in size, then all learning machines have finite capacity that grows only logarithmically with |H(m) |. R EFERENCES [1] V. N. Vapnik, “An overview of statistical learning theory,” Neural Networks, IEEE Transactions on, vol. 10, no. 5, September 1999. [2] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, “Learnability and the Vapnik-Chervonenkis dimension,” Journal of the ACM (JACM), vol. 36, no. 4, pp. 929–965, 1989. [3] M. Talagrand, “Majorizing measures: the generic chaining,” The Annals of Probability, vol. 24, no. 3, pp. 1049–1103, 1996. [4] D. A. McAllester, “Some PAC-Bayesian theorems,” Machine Learning, vol. 37, pp. 355–363, 1999.
[5] ——, “PAC-Bayesian stochastic model selection,” Machine Learning, vol. 51, pp. 5–21, 2003. [6] O. Bousquet and A. Elisseeff, “Stability and generalization,” The Journal of Machine Learning Research (JMLR), vol. 2, pp. 499–526, 2002. [7] P. L. Bartlett and S. Mendelson, “Rademacher and gaussian complexities: Risk bounds and structural results,” The Journal of Machine Learning Research (JMLR), vol. 3, pp. 463–482, 2002. [8] J.-Y. Audibert and O. Bousquet, “Combining PAC-Bayesian and generic chaining bounds,” The Journal of Machine Learning Research (JMLR), vol. 8, pp. 863–889, 2007. [9] H. Xu and S. Mannor, “Robustness and generalization,” Machine learning, vol. 86, no. 3, pp. 391–423, 2012. [10] M.-F. Balcan and A. Blum, “A PAC-style model for learning from labeled and unlabeled data,” Learning Theory, vol. 3559, pp. 111–126, 2005. [11] T. M. Cover and J. A. Thomas, Elements of information theory. Wiley & Sons, 1991. [12] B. Reiser and D. Faraggi, “Confidence intervals for the overlapping coefficient: the normal equal variance case,” The Statistician, vol. 48, no. 3, pp. 413–418, 1999. [13] D. C. Knill and A. Pouget, “The bayesian brain: the role of uncertainty in neural coding and computation,” TRENDS in Neurosciences, vol. 27, no. 12, pp. 712–719, 2004. [14] K. Friston, “The free-energy principle: a unified brain theory?” Nature Reviews Neuroscience, vol. 11, no. 2, pp. 127–138, 2010. [15] G. Huang, “Is this a unified theory of the brain,” New Scientist, vol. 2658, pp. 30–33, 2008. [16] L. Devroye, L. Gy¨orfi, and G. Lugosi, A probabilistic theory of pattern recognition. Springer, 1996. [17] T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi, “General conditions for predictivity in learning theory,” Nature, vol. 428, pp. 419–422, 2004. [18] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan, “Learnability, stability and uniform convergence,” The Journal of Machine Learning Research (JMLR), vol. 11, pp. 2635–2670, 2010. [19] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of the American Statistical Association, vol. 58, no. 301, pp. 13–30, 1963. [20] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin, Learning from data, 2012. [21] E. W. Weisstein, “Binomial distribution.” [Online] From MathWorld– A Wolfram Web Resource. http://mathworld.wolfram.com/ BinomialDistribution.html, 2013, accessed: 2013-06-30.
A PPENDIX A P ROOF OF E XAMPLE 9. First, H has a binomial distribution: m k P(H = k/m) = φ (1 − φ)m−k k We use the identity: IP (Ztrn , H) = 1 −
X
P(H = h)
h∈H(m)
X
min P(Ztrn = z), P(Ztrn = z|H = h
z∈Z
=
X h∈H(m)
P(H = h) · ||P(Ztrn ) , P(Ztrn |H = h)||P
However, P(Ztrn ) is a Bernoulli distribution with probability of success φ while P(Ztrn | H = h) is Bernoulli with probability of success h. Knowing that the distance between two Bernoulli distributions is given by |φ − h|, we obtain: m X m k k IP (Ztrn , H) = φ (1 − φ)m−k φ − (28) m k k=0
This is the mean deviation of the binomial distribution. Assuming φ m is an integer, then the above expression is given
14
Since |Z| > 1, we see from the last expression that:
by [21]:
MD =
m 2 (1−φ)(1−φ) m φ1+m φ (1+m φ) m mφ + 1
(29)
The maximum mutual affinity is achieved when φ = 12 . This gives us: m! 1 2m+1 (m/2)! 2 1 ∼√ , 2πm where in the last step we used Stirling’s approximation. C (m) (L) =
A PPENDIX B P ROOF OF E XAMPLE 11. We will take the extreme case where all observations in Z are equally likely. Intuitively, this corresponds to the most difficult distribution to learn. Then, we have by symmetry: P(H = z) =
1 |Z|
Since P(H = z) = P(Ztrn = z), we have by Bayes rule: P(Ztrn | H) = P(H | Ztrn ) However, given a single random draw of a training example Ztrn out of Sm , the probability of eventually selecting a label H = z depends on two cases: ( Q if z 0 = z 0 P(H = z | Ztrn = z) = R if z 0 6= z Of course, we have Q + (|Z| − 1) R = 1. To find Q, we use the definition of L: 1 m − 1 1 1 m−1 Q= · 1+ = + · m |Z| m |Z| m Note that we used the fact that the learning machine is randomized in deriving the above expression for Q. So, to satisfy Q + (|Z| − 1) R = 1, we have: R=
1 m−1 · |Z| m
Now, we are ready to find the desired expression: P(Ztrn = z | H = z 0 ) = P(H = z 0 | Ztrn = z)
= I{z = z 0 } · Q + I{z 6= z 0 } · R I{z = z 0 } m−1 = + m m |Z| 1 1 1 0 = I{z = z } − · + |Z| m |Z|
So, the joint distribution of H and Ztrn is:
P(H = z 0 , Ztrn = z) = P(H = z 0 ) · P(Ztrn = z | H = z 0 ) 1 1 1 0 = + I{z = z } − |Z|2 |Z| m |Z| 0 = P(H = z ) P(Ztrn = z) 1 1 + I{z = z 0 } − |Z| m |Z|
P(H, Ztrn ) > P(H) · P(Ztrn )
↔
H = Ztrn
Hence, mutual affinity is given by: IP (Ztrn , H) = ||P(Ztrn ) · P(H) , P(Ztrn , H)||P X =1− min P(H = z 0 , Ztrn = z), z,z 0 ∈Z
P(H = z 0 ) · P(Ztrn = z) h X P(H = z 0 ) · P(Ztrn = z) =1− z,z 0 ∈Z
n 1 oi 1 · + min 0, I{z = z 0 } − |Z| m |Z| n X 1 1 o 0 =− min 0, I{z = z } − |Z| m |Z| 0 z,z ∈Z
|Z| − 1 1 1 = = · 1− m |Z| m |Z| A PPENDIX C P ROOF OF E XAMPLE 12 For the first learning machine, we have already shown in Example 9 that: m X m k k IP (Ztrn , H) = φ (1 − φ)m−k φ − m k k=0
For the second learning machine L2 , we will assume that m is odd. Then, the probability that H2 = 0 is given by: (m−1)/2
P(H2 = 0) =
X k=0
m k φ (1 − φ)m−k k
Knowing H2 , the marginal distribution of training examples is given by: P(m−1)/2 m k k m−k k m φ (1 − φ) P(Ztrn = 1|H2 = 0) = k=0 P(H2 = 0) On the other hand: Pm P(Ztrn = 1 | H2 = 1) =
m k
k=(m−1)/2+1
k m
φk (1 − φ)m−k
P(H2 = 1)
The mutual affinity is given by: IP (Ztrn , H2 ) = EH2 φ − P(Ztrn = 1 | H2 )
= P(H2 = 0) · |φ − P(Ztrn = 1|H2 = 0)|
+ P(H2 = 1) · |φ − P(Ztrn = 1|H2 = 1)| (m−1)/2 X m k k = φ − φ (1 − φ)m−k φ − m k k=0 m X m k k + φ − φ (1 − φ)m−k φ − k m m+1 k=
2
This is the expression used in Figure 3.
15
A PPENDIX D P ROOF OF T HEOREM 6 First, suppose observations have a finite support |Z| < ∞. To simplify notation, we will assume without loss of generality that Z = {1, 2, . . . , |Z|}. For a lazy learner L? , we note that its hypothesis H is itself the entire training set Sm up to permutation. Let mi be equal to the number of times i ∈ Z was observed in the training set, and let pi = P(Z = i). Then, we have: m m P(H) = P(Sm ) = pm1 pm2 · · · p|Z||Z| m1 , m2 , . . . , m|Z| 1 2 Here, ·· is the multinomial coefficient. For now, assume that maximum affinity is attained at the uniform distribution (this will be established formally using the effective support set 1 , we obtain: bound proved later). Letting pi = |Z|
For the inner summation, we write: m X m k m2 1 pm − pk 1 p2 · · · m , m , . . . m 1 2 m1 +m2 +...=m m X m s mk = − pk × pk m s s=0 X m−s m1 +...+mk−1 +mk+1 +...=m−s
m
In the last step, we used the multinomial series. Using the expression for the mean deviation of the binomial random variable and summing over all k, we obtain the desired result.
Using the identity ||p , q||P = 12 ||p − q||1 , we obtain: C(L? ) = EH ||P(z) , P(z|H)||P |Z| X 1 m 1 1 X mk − = 2 |Z|m m1 , m2 , . . . , m|Z| m |Z| k=1 m1 +...+m|Z| =m X 1 1 1 m m1 − = 2 |Z|m−1 m +...+m =m m1 , m2 , . . . , m|Z| m |Z| |Z|
The second line follows by symmetry. We can simplify further: m X 1 1 m 1 m−k k C(L? ) = |Z| − 1) − 2 |Z|m−1 k m |Z| k=0 m |Z| X m 1 k m 1 m−k = 1− k − 2m k |Z| |Z| |Z| k=0
The last manipulation is intended to place the expression in a binomial distribution form. The expression is identical to was derived earlier in Example 9, where we had |Z| = 2. Again, the quantity inside the summation is the mean deviation. Using Eq 29 and simplifying yields: r |Z| − 1 ? C(L ) ∼ 2 πm In addition, the asymptotic relation is tight, in the sense that the ratio of the two terms goes to unity as m → ∞. In the general case where P(z) has a finite effective support set size, the proof is quite similar to the above approach. Here, we note that for a given alphabet Z = (1, 2, . . .) and a fixed distribution pi = P(Z = i), we have: IP (Ztrn , H) = X m1 +m2 +...=m
1X 2 k
m m k m2 1 pm − pk 1 p2 · · · m1 , m2 , . . . m
m
k−1 k+1 1 pm 1 · · · pk−1 pk+1 · · · m m X m k = − pk psk (1 − pk )m−s m s s=0
1 m P(H) = |Z|m m1 , m2 , . . . , m|Z|
1
m1 , . . . , mk−1 , mk+1 , . . .