The Relationship Between The “Statistical Mechanics” Supervised Learning Framework and PAC David H. Wolpert
SFI WORKING PAPER: 1992-03-014
SFI Working Papers contain accounts of scientific work of the author(s) and do not necessarily represent the views of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or proceedings volumes, but not papers that have already appeared in print. Except for papers by our external faculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration at SFI, or funded by an SFI grant. ©NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensure timely distribution of the scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the author(s). It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may be reposted only with the explicit permission of the copyright holder. www.santafe.edu
SANTA FE INSTITUTE
THE RELATIONSHIP BETWEEN THE "STATISTICAL MECHANICS" SUPERVISED LEARNING FRAMEWORK AND PAC
by David H. Wolpert
The Santa Fe Institute, 1660 Old Pecos Trail, Suite A, Santa Fe, NM, 87501 (
[email protected])
Abstract: This paper uses an over-arching supervised learning fonnalism recently devised by the author to explicitly exhibit the intimate connection between the statistical mechanics supervised learning frarnework and PAC. More precisely, this paper shows that when viewed in tenns of that fonnalism, in all but one of the definitions in their foundations the "exhaustive learning" statistical mechanics scenario (involving the "Gibbs learning algorithm") and PAC are identical. There are a number of ways that this close relationship between the two frameworks manifests itself, e.g., both frameworks derive expressions for the generalization error which involve (1 - E)ffi.This paper goes on to show how many of the features of PAC (e.g., the fact that PAC is an average-case rather than a worst-case analysis over training sets) become overt and immediate when PAC is viewed in tenns of the author's fonnalism. This paper cursorily investigates those features from the perspective of the author's fonnalism. This paper also shows that 1) appropriately restated, PAC results apply when one varies the target function and keeps the hypothesis function fixed (the conventional PAC framework holds the target function fixed and varies the hypothesis function); 2) although as conventionally presented the results of PAC are distribution-free whereas those of the statistical mechanics framework are not, in fact the results of both frameworks can be presented in either a distribution-free or a non-distribution-free manner; 3) one can analyze a "statistical mechanics" version of PAC and derive /):::; (1 - e)ffi x~, rbeing the size of the output space.
5 cemed with such distributions where h is fixed. In PAC, the four quantities are as follows: 1) The sampling assumption is noise-free i.i.d. sampiing, Le., pea I t) oc li(a c f) x IIi: 1 n(aX(i)), where n(.) is sometimes called the "sampling distribution" and is usually unknown. (The proportionality constant in this formula for pea I f) is set by normalization.) 2) PAC also uses an i.i.d. error function: Er(f, h, a) == L XE X n(x) x [l li(f(x), hex))], where n(.) is the same distribution used to define the sampling assumption, and the li function here is a Kronecker delta. (This error function gives the average (according to n(.)) number of agreements between a hypothesis function h and a target function f.) 3) The conditional probability of interest to PAC is P(Er(f, h, a) =Elf, m), where fis unknown, but fixed. 4) Finally, PAC confines its attention to generalizers which meet the following restrictions: i) There is some "concept class" H
~
H consisting of IHI functions such that for all a and all h e H, P(h I a)
= 0;
ii) Although f is not known, it is assumed that for all a sampled from f with no noise, P(h I a) reproduces a perfectly. In other words, for all such a P(h I a) = P(h I a) x li(a c h). This second restriction means that, rigorously speaking, we don't known nothing about f; we know (or assume) that f E H. (To see this, let a
= f, Le., let a be a training set sampled (with no noise) from fwhich
has aX = X, and then apply restriction (ii)). The choices for the four quantities implicit in PAC are by no means unique; there are many other ways of setting these four quantities. This is obviously true for the choice of generalizer, but it also holds for the remaining three quantities. For example, if one were interested in generalization behavior as a function of m' (the number of distinct pairs in a) rather than m, one might want to use a sampling assumption which doesn't allow duplicates in a. Similarly, if one were interested in generalization behavior for points outside of the training set, one might want to use an error function like Er(f, h, a) == L xe a n(x) x [l - li(f(x), hex))] / L xe a n(x). There are also a number of X X different possible choices for the conditional probability of interest. For example, one might want to fix h rather than f; this is what is done (implicitly) in conventional Bayesian analysis. As another example, one might want to fix neither h nor f, and therefore investigate a distribution like P(E I a).
7 central result of the analysis of exhaustive learning carried out by Schwartz et al. ([1990]) is that when n » m and "self-averaging" applies, P(E I f, m) oc Po(E) (1 - E)m, where pO(E) == Lh (T(h) x 1l(E, Er(f, h))} is a function which depends on f but is independent of m. It is important to emphasize that since they both concentrate on P(E I f, m), both PAC and exhaustive learning ignore any aspect of e besides its size. (As is discussed below, this reflects the fact that both implicitly perform an average-case analysis over e.) Furthermore, because they examine probabilities conditioned on a particular target function f, the conclusions of both of them are independent of the prior P(f). This means, loosely speaking, that the results of neither of them are related to the real world; in general, changing the characteristics of the real world won't affect their results, and in general, their results can't tell one how to take the real world into account. This contrasts with conventional Bayesian analysis, which concentrates most of its attention directly on how to set P(f). There are many powerful results which can be derived by analyzing distributions over {f, h, e}. For example, such an analysis confirms the intuitive notion that the probability of generalization error depends on how "aligned" one's generalizer P(h I e) is with the "correct" generalizer P(f I e): P(E I e) equals the (non-Euclidean) inner product L{f, h} P(h I e) x Pif I e) x ME,e(h, f) (ME,9(., .) is a symmetric matrix). One can also write down an analogous formula, which gives the probability of generalization error E which accrues from using cross-validation, expressed as an inner product involving the technique of cross-validation. In fact, in that the formalism presented here is simply Bayesian analysis extended so that all quantities of interest occur in the event space, essentially any supervised learning issue can be analyzed in terms of this formalism. In this paper, the supervising leaming issue of interest is the connection between PAC and the "statistical mechanics" school.
8 2. PAC
In exhaustive learning, because of the class of generalizers being investigated, it is possible to write down P(E I f, m) exactly. Since for PAC's generalizer class one can't calculate P(E I f, m) as readily, PAC settles for bounds on (the integral of) P(E I f, m). More precisely, PAC chooses to ignore the precise generalizer used in a particular experiment and investigate the quantity 0 == maxp(hle)P(E
~ elf, m) = maxp(hle) L~=€ peE I f, m), where the max is implicitly understood to
only be over those P(h I 6) meeting the restrictions given above for PAC generalizers, and the sum is over all E values between e and I, inclusive. Using some results from [Wolpen 1992c] and the fact that PAC uses the Li.d. error function, one easily derives (see appendix A)
which in tum implies the familiar PAC result
(I) 0::;; {I - elm x IHI.
1
Note that if one could calculate L E=€ peE I f, m) for any P(h I 6) in the generalizer class, the quantities so derived would be more fundamental than the max of those quantities. (If for some reason one wished to concentrate on the maximum of the sums, then given the value of the sum for every P(h I 6), it would be straightforward to analyze said maximum. However the reverse does not hold; just given the maximum, one can not deduce the sum for every P(h I 6) in the generalizer class.) Similarly, if once could dispense with the sum and write down P(E I f, m) for any E and for any P(h I 6) in the generalizer class, that quantity would be more fundamental still. It is only for reasons of calculational tractability that in PAC one is forced (I) to focus on the max of a sum of P(E I f, m),s rather than on P(E I f, m) directly.
9 The derivation of (1) given in appendix A has the advantage that it is completely fonnal. This means, for example, that one can invoke symmetry under f H h to replace (I) with an analogous fonnula where target and hypothesis functions are interchanged. (This procedure is carried out for various non-PAC scenarios in [Wolpert 1992b, 1992c].) More precisely, assume I) that the generalizer is an i.i.d. generalizer (i.e., obeys P(h I 8)
oc
P(h) x 8*(8 c h) x
II:I
[1t(8X (i»] / P(8);
see [Wolpert 1992bD, an assumption which is true, for example, for many exhaustive learning generalizers, and also assume 2) that P(f I8) has support falling in a "concept class" F'. Then the pair (P(f I 8), P(h I 8») is the same as in conventional PAC, just with f and h interchanged. Given that the PAC error function is invariant under f H h, this means we can immediately rewrite (I) with f 1
and h interchanged: maxp(f Ie) :EE=e P(E I h, m)
S;
{I - e)
m
x IFI.
In English: Perfonn many different supervised leaming experiments (having target functions chosen according to P(f), all with the same i.i.d. generalizer. Collect all those experiments in which the hypothesis function guessed by the generalizer is h. The resultant generalization error distribution, as a function of m, is P(E I h, m). Assume that we are restricting ourselves to those supervised learning experiments whose target functions are contained in the set F'. Assume further that for all training sets 8 which are consistent with the hypothesis function h, P(f I8) is zero unless f also agrees with 8 (this will be true, in particular, if there is no noise). Now we do not know P(f I 8) in general. (If we did, then how best to do supervised learning would be trivial.) However the result given above says that an upper bound on P(E;;:; e Ih, m), over all allowed P(f 18), is given
In addition to allowing us to exploit symmetry under f H h, another advantage of the formality of the derivation of (1) is that it automatically draws the researcher's attention to several interesting features of PAC. To give one obvious example, since PAC is interested in a sum over P(E I f, m) and since P(E I f, m) equals 8' Accordingly, one might extend PAC so that, instead of investigating the behavior of 0 as a function of E, m and 1
IHI, one investigates the behavior of <maxp(h18) [ kE=e P(E I f, 8)] >8 as a function of E, m and IHI.
The difference is between the question "what's the worst average-case behavior I can have?" and the question "on average, what's the worst behavior I can have?". Define unique(8) to be the (unordered) set consisting of the m' distinct input-output pairs in 8. Now due to PAC's sampling assumption, duplicate input-output pairs are allowed in 8 (so that m' :s; m). As a result, for any two training set sizes ml and m2 > mb the set of all the unique(8) going into the < >8 for ml is contained within the set of all the unique(8) going into the < >8 for m2' In other words, for any two m values, the respective P(E If, 8) distributions (which get averaged to form 0) overlap in unique(8) space. This fact is directly reflected in equation (1), which connects the 0 for one value of m with the 0 value for another value of m. Of course, the most obvious extension of PAC suggested by the fact that it is average-case over 8 is a worst-case-over-8 analysis. It turns out that such a worst-case analysis is trivial if the sampling distribution is not fixed. More precisely, so long as there exists at least one h E H which does not equal f but which intersects f, then there exists a n(.) s. t. '