On the Connection Between InSample Testing and Generalization Error David H. Wolpert
SFI WORKING PAPER: 1992-01-006
SFI Working Papers contain accounts of scientific work of the author(s) and do not necessarily represent the views of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or proceedings volumes, but not papers that have already appeared in print. Except for papers by our external faculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration at SFI, or funded by an SFI grant. ©NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensure timely distribution of the scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the author(s). It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may be reposted only with the explicit permission of the copyright holder. www.santafe.edu
SANTA FE INSTITUTE
ON THE CONNECTION BETWEEN IN-SAMPLE TESTING AND GENERALIZATION ERROR.
by David H. Wolpert 1,2
1 - Theoretical Division and Center for Nonlinear Studies, MS B213, LANL, Los Alamos, NM, 87545, (
[email protected]) 2 - The Santa Fe Institute, 1660 Old Pecos Trail, Suite A, Santa Fe, NM, 87501 (current address)
Abstract: This paper proves that it is impossible to justify a correlation between reproduction of a training set and generalization error off of the training set using only a priori reasoning. As a result, the use in the real world of any generalizer which fits a hypothesis function to a training set (e.g., the use of back-propagation) is implicitly predicated on an assumption about the physical universe. This paper shows how this assumption can be expressed in terms of a non-Euclidean inner product between two vectors, one representing the physical universe and one representing the generalizer. In deriving this result, a novel formalism for addressing machine leaming is developed. This new formalism can be viewed as an extension of the conventional "Bayesian" formalism which (amongst other things) allows one to address the case where one's assumed "priors" are not exactly correct. The most important feature of this new formalism is that it uses an extremely low-level event space, consisting of triples of (target function, hypothesis function, training set). Partly as a ",
result of this feature, most other formalisms that have been constructed to address machine leaming (e.g., PAC, the Bayesian formalism, the "statistical mechanics" formalism) are special cases of the formalism presented in this paper. Consequently such formalisms are capable of addressing only a subset of the issues addressed in this paper. In fact, the formalism of this paper can be used to address all generalization issues of which I am aware: over-training, the need to restrict the number "of free parameters in the hypothesis function, the problems associated with a "non-representative" training set, whether and when cross-validation works, whether and when stacked generalization works, whether and when a particular regularizer will work, etc. A summary of some of the more important results of this paper concerning these and related topics can be found in the conclusion.
2
I. INTRODUCTION.
i) This paper's context. This paper concerns the problem of inductiveinference, sometimes also known as (supervised) .machineleaming. For most purposes.thiS'problem can'befonntrlated asfollows: We have an input ..... space X and an.output--space. y,,·Thereis.an ,unknown Junction.from Xto"Y"which wilLbe referred'·
to as-the--rarget function.-(Thisfunction is'sometimes insteadcalled·the· "parent'" function,or the' "generating" function.) One is given a set of m samples of the target function (the training set), perhaps made with observational noise. One is then given a value from the input space as a ques'.dion~Theproblem i£t0..use..the.training
set to guess what output space value on the target function
.corresponds toJhegiven question. Such a guessed function from questions to.outputs is known as a hypothesis function: An algorithm which produces a hypothesis. function as a guess fora target function, basing the guess only on the training set of m (X.X Y) vectors read off of that target function, is called a generalizer. Some examples of generalizers are back-propagated neural nets [I], Holland's classifier systern'. [2k and -some-implementations"of' Rissanen'sminimum 'description -length. principle [3;4]' (which, along with all other schemes which attempt to exploit Occam's razor, is analyzed in [5]). Other important examples are memory-based reasoning schemes [6], regularization theory [7,8], and similar schemes for overt surface fitting of a hypothesis function to the training set [9-13]. "
Conventional classifiers which work via Bayes' theorem, information theory, clustering analysis ....ortheJike (e.g.. ID3. [14], BayesianclassifiersJike.SchIirnmer.'s.Staggersystein described in .[15], the systems described in [16], etc.) also serve as examples of generalizers. However such classifiers usually can only guess a particular'output value if tharvalue occurs in the training set. This pa~ per assumes no such restriction on the guessing. -"',",._,., ,"' -IHor any training set· eof m pairs {"I,yO the 'generalizer always'gnesses'y{ when presented with the question Xi' we say that the generalizer reproduces the training set. 1 (Some researchers
3
refer to the problem of reproducing the training set as the problem of "learning", to distinguish it .from the problem ofgeneralizingfor questions outside of the training set.) For the sake of simplicity, in this paper I-will usually assume noiseless data: For such' a situation the training set should be reproduced exactly. It is trivial to_ensure. such exact.reproduction.of·the training set; simply ·build a look-up table. (Difficulties-only-arise-when one'insists·that1hdook"Uptable beimplemenF .' ed in an-odd way,e:g.,
as.a-feed-forward.neural.neL).Thereforethe~only·questions of interest
are·
those outside of the training set. The problem before us is to reach rigorous and meaningful conclusions concerning inductive inference. Recently there has been a lot of research attempting to do this while using reasoning • ,C"
•
·'·+whiGh·is,as close-to..a"prioricas·possible·([15, 17-25]). An archetypal example of such an analysis "..(related toJhe reasoning use 0 (Le., e is non-empty). For simplicity, it is also assumed that P(f, h, e), the joint probability of the target function f, the hypothesis function h, and the training set e, equals zero unless the set e is contained in the set f, i.e. unless y(xi) = f(Xj) for all pairs (Xj, ., ,y.(xi).)in.e. ,In'other words,.we're,assuming-no noise. As _was.mentioned-earlier;the:assumptionof' no noise means that theproblernof how to guess in response to a question q is only non-trivial if q is outside of e. Whenever the no-repeats assumption is in effect, this in turn implies that we want m e) {0[Er(f, h, e), E] x P(f, h, e») =
L if=> e}{P(f, h, e»)
(Note that the condition in the sum "f => e" is actually superfluous, since P(f, h, e) equals zero if this condition isn't met. Nonetheless, for clarity, in this paper "f => e" will be explicitly written where 0(., .) is the Kronecker delta function. (For clarity, in this paper the condition ''f=> en will be expliCitly written whenever it applies. ·Note that it is actually superfluous to do so however, since ·'·thatconditionis a.directreflectionofthefactthat p(f;h;e) equals zero ire ttL) Now use the fact that P(f, h; e) =P(f I h, e) x P(h I e) x pee)
=P(f Ie) x P(h Ie) x pee) to rewrite
the formula for P(E I h, e):
(2.1)
P(EI h,e)=L{f=> e) {o[Er(f,h,e),E] xP(f1 e)}.
To proceed·further, we need to make some, assumptions about P(f I e). As a fIrst example, assume that P(f Ie) is constant over its support in F; all target functions consistent with e are equally likely, given only e and some
(2.2)
gues~ect hypothesis function h. 12 We have the following theorem:
When P(f I e) is independent of f, P(E I h, e) = C(n-m) x (r - l)(n-m-z) / z
i
nem), where z;; [en - m)(l - E)].
Proof: Label the m elements ofe as (xn-m+l' Yn-m+l)' (~-m+2' Yn-m+2)'''' (xn , Yn)' Label the remaining elements of X according to the same scheme, so that questions outside of e are chosen
16 from the set (Xl' ..., x n _m )· This allows us to rewrite the sum LV~ e} as LVI' ... ,fn-m}' where
fi is shorthand for f(xi)' and the sum is understood to extend over all r(n-m) possible values of its subscript. Now write the constant value P(f I e) has over itssupportasp;-This allows us torewrite-' .. } (1} P(h, e) as p x L{f l' ''·'In-m
= p x LV
..
l' ''''In-m
= p x r(n-m). Similarly, P(E, h, a) becomes
n-m }(o([L. _ 1 O(f,I" h(x.))], (n - m)(1 - E))}. 1-
~1
This is just p times the number of instances in which the set VI' ... ,fn-m} agrees with (h(xi)} exactly z == [en - m)(1 - E)] times. I3 Simple combinatorics tells us that this equals p x (r - l)(n-m-z). Therefore when P(f I a) is independent of f, P(E I h, a)
= dn -m ) z
C~n-m)
x
x (r _ l)(n-m-z)
/ r(n-m). QED.
This case where P(f I a) is constant over its support is the distribution with the maximum entropy. It serves as a benchmark ~shows
c~~e,
corresponding to a "random" universe. I4 Equation (2.2)
that for such a maximum entropy P(f I a), P(E I h, a) is explicitly independent of s,the num-
"_" ber,ofagreementsbetween the training set.e andJhehypothesis-function h;·Since1h.ereiS'no a'priori' reason to rule out the possibi1ity~~that p(r La) has maximum entropy, there is no a priori reason to ,. rule out (2.2), and therefore there is no theoretical justification for the oft-voiced claim that, for a .: ','" '._ ' priorLr:easons. alone, P,(E.Ih,a) depends, on s..This answers.question (2).from the introduction; we ,_.;;~ have
an explicit proof that all arguments trying tojustify the claim of inductive inference without
making a priori assumptions (e.g., without assuming P(f I a) depends on f) are wrong.
17
In fact, (2.2) shows that for this benchmark case of flat P(f I e), not only is PEE I h, e) independent of s, it is independent of h entirely. This means, for example, that one can not meaningfully . say ~~what size neural net gives valid generalization" without making some ultimately ad hoc assumptionsaboutP(f I e). To put itanother way,.(2.2) shows thatany:attempt to detenninein an a.priori fashion. what size neural net to useimplicitly.makes assumptions about P(f I e), and.without justifyingthose assumptions thereis.noreason. to. believe. the.resulting·conclusion. Note also that (2.1) says that PEE I h, e) is independent ofP(hl e). Tn fact, (2.1) shows that P(E I h, e) is independent of P(h I e) eveu if P(f I e) varies with f,since changing P(h I e) has no effect on P(f I e). In other words, although replacing our maximum entropy assumption with some
"-'0. '.';'f·'other'assumption'allow&th~llhoice of h to affect P(E I h, e) (see the next section), no assumption -.,.'." ...
.:..f or theJorm of P(fJ e).results ina.correlation between R(E I h" e).andP(hl e); .This.fact.can be used to address question (3) from the introductiou: Despite the suggestions of some authors ([23-24]) and despite what intuition mightsay,given a particular hypothesis function h, as far as the distribution P(E I h, e) is concerned, it does not matter if h was fIXed before the researcher has any knowledge of e, or if instead h was chosen based on e (as in a generalizer).
IlL THE ASSUMPTIONS NEEDED FOR IN~SAMPLE TESTING TO BE RELEVANT; "
Given (2.2), how is that in our physical universe so many learning techniques which expend all oftheir effort atreproducing the training set manage to generalize fairly well? This section a:ddress~' es·thisquestion by investigating what assumptions can resultin:a 'correlation:between (the level of agreement between h and e) and (the generalizing error for questions outside of e). In doing so, this section answers issues four, six, seven, and eight from the introduction.
i)
Non-constant P(f I e)•
." .....P(fl 6) is detennined by the physical universe, and by what kinds -off.s..areJikely in our phys-
18 ical universe. If, in our universe, for the kinds of problems generalizing algorithms are usually tested on, reproducing the training set results in good generalization, then it must be that the MaxEnt assumption for P(f I e) is wrong. In other words, we have empirical evidence suggesting that there are non-uniformities in the distribution of inference problems (or. at least in.the.distribution ofsuch' problems with which humanity has so far concerned itself). 16,17 . In this section I consider the. case where MaxEnt is wrong .and P(f"1 ,e):isrnore elaborate.than (P(f I e)
= 0 if e
q:, f; P(f I e)
= a constant over its support in F). First rewrite (2.1) to explicitly
delineate what aspect(s) of h are relevant to the generalization error:
where z is the number of off-training set agreements between h andj, (n - m)(l - E), Gust like in the proof of (2.2)), and the n'-rn elements of the set oX- ex are labelled x 1 through xn_i11.Byappropriate choice of the function P(f I e), we can make almost any distributionP(E Ih, e) we choose. In particul!\!", choose P(f Ie) to equal 0 unless all ofthe f(xi) equalh'(Xj) for some pre~sethypothesis function h', inwhich case.it equals LIn this case;P(E I h',e) equals Lfor E = 0, and: 0 for all other E values; we get perfect generalization. 18 However for generic h oF h' we do not get perfect generalization. In other words, given that' P(f Ie) is allowed to vary, for fixed e different h might lead to different P(E I h, e). Although.this clarifies how.choice of.h_canaffect generalization, iHtillleaves.unresolved·the: two other issues raised at the end of the previous section: Even for' a non"uniforrn P(f I e); P(E I h, e) is independent.of both s and.of.P(hl:e),jn:apparentviolation:ofcommon:experience.. . To resolve,these two issues, first make the definition S(f, h, (xi})=~{iJ 8[h(xi)' f(Xj)]. S is a .
mapping taking two functions and aset·of X values to an integer. That integer is the number of times the two functions agree with one another on what output corresponds to an input, over the
19 set of provided X values. Note that S(f, h, (xi}) is symmetric under interchange of f and h. Often for conciseness the third argument to S will be given as the X components of a set of X-Y pairs (i.e., a training set) rather than directly as a set of X values. For example, I will sometimes write S(f, n, eX) by which.Lmean S(f, h,.{XjD,wherethe (xi }.arethe X components of e. Using this '. notation, .S(f,:h, X ~,e)f) .is .the numher:of,timesch(x}agrees,withf(x)'forxoutsideof6;:Er(frh,',B)=
= 1- S(f,h,X~.eX)I (n -m).-Similarly~ifecf;thenS(f,h;8X)js'thenumbeI:Oftimeslugrees with e. Note that in such cases we can write See, h, eX) instead of S(f, h, eX)' see, h, eX)' never occurs in equation (3.1); if I know hand e, then I have determined PCB I h, " ce,
.".,.
e);'and,c0untingS(e,JL,,6X).doesn't have any effect-on peE I h, e). In other words, despite the fact
,0'- .•
thatwe~reno,longerassumingmaximum entropy
P(fl'6);'(3il)tellsus that"if-s' stilI-true' that only'
h's behavior outside of the training set is relevant- 19 To See why (3.1) might not-be,the. finahvord.on inductive.inference,. note that we're interested in the "dependence" between s
=S(e;h;.8xJand E:Now such "dependence" is ameaningless-con-
cept if s isn't allowed to vary. However scan' t vary if both hand e are fixed. This suggests that to analyze·inductiveinference; we shouldfail to 'specify eitherkand/or 6; and' evaluate (for example)~ PCB I s, e) rather than PCB I h, e). Another reason for examining a quantity like PCB I s, e) rather than PCB I h, e) comes from the fact that in practice we can not evaluate (3.1), since we don't know P(f I e) - it's determined by the universe. To getaround this problem, we could just make a direct assumption for P(f I. e). That's a
',. huge. assumption to makehowever,(which: interestingly enough, doesn't· stop Bayesians-from making it). As an alternative, it would be preferable to make one (or more) relatively weak indirect assumptions which can "take the place" of knowledge.directly concerning P(f I e). AS.an example, we can make an "indirect", assumption, about P(f L6)and.assume that there ex~ ists a correspondence between P(f I e) and the distribution P(h I 6). In other words, we can assume that the probability distribution over the architecture on which we're going to implement our hy-
20 pothesis function (e.g., a feedforward neural net) corresponds in a certain way to the probability .distribution over target functions in therealworld. 20 As an example of how such an indirect assumption can mitigate (3.1) and its implication that generalization error is independentofS(a, h; ax)' assume thatoveroureventspace only those taF' get functions·can occur which are constant'(i.e~independent of X):· Now assume thatweonITex~' amine constant hypothesis,Junctions, j,e.·, assume· that .the-space of hypothesis.functions·we're.ex" amining corresponds to the space of target functions. Then the agreement (or lack thereof) between a hypothesis function and a target function over the elements of the training set fIxes exactly the agreement between the two functions overinput values outside of the training set. It's still true that .' only h's behavior outside of, the training setis relevant, as stipulated by (3.1). Butnowthe corre......... ~.. ,spondence between-the distribution over-Fand the distribution'over H couples the 'error outside 'Of the training set with the error inside the training set. The next subsection is a formal investigation of such coupling behavior in the context of evaluating P(E I s, a).21
ii) How agreement between a hypothesis and a training set can affect generalization.
", We have some'probability-distribution'acrossour event'Space, 'and.therefore·in'particular a diS"" ," •. "'" ,.,; ••"tribution.P(hJ,a)..'chooseastochastic P(h j a),i.e., a'P(hJa) whose support extends over more than one h for a given a (so that the hypothesis guessed by the generalizer isn't uniquely fIxed by the training set). In fact, in practice the P(h I a) used in the following analysis is often completely independent of a. For example, the support of P(h I a) might be the set offeed-forward neural nets , smaller than a given size; As another example;the.supportofP(h Ja) might be the.setoHunctions.formedby·,taking linear combinations of·theelements of some set of basis functions. Now consider the following scheme: Let s" be some integer and-randornly pick a hypothesis function h fromJI (according to the probability distribution over.H), subjectto the condition that S(a, h, aX) = s". This procedure defInes a distribution P(h I a, s") which can be used as a generalizer. Now pick a new integer s'. > s". s~ defines a new generalizer P(h I a, s,).22
21
We wish to choose between a hypothesis function output by the generalizer P(h I e, s') and a hypothesis function output by the generalizer P(h Ie, s"). Our strategy is to use the fact that s" < s' to choose the hypothesis function output by the generalizer P(h I e, s'). The following argument shows that this strategy results· in desirable P(EI s; e) if there is a correspondence between P(f I-e) andP(h Ie). -By expressing -it in terms' of fand then marginalizing over F,'one proves that P(E; s,·e)
=
_~(h}(P(E, h, e) x a[s(e, h, ex), s]h Similarly, pes, e) = ~(h} P(h, e) xa[s(e, h, ex), s]. Therefore
we can write
~(h} (P(E I h, e) x P(h, e) x a[s(e, h, ex)' sJ}
(3.2)
P(E I s, e)
=
~(h} (P(h, e) x a[S(e, h, ex), sJ}
Note that one can replace both of the P(h, e) terms in (3.2) (one in the numerator, one in the denominator) by P(h Ie). (3.2) explicitly relates P(E I s, e) to P(E Ih, e), the quantity analyzed in the previous section. If ,.:., ....." ·,;PEE I h' e)--is independent oflY, as in the maximum entropy case described by equation (2.2) where P(f I e) is independent of f, then we can take P(E I h, e) out of the sum in the numerator in (3.2), "
and P(E Is, e) is independent of s. In other words, for the maximum-entropy case, it doesn't matter what leyeLof agreemenLthere is between ourhypothesis.functionand.thetraining set; inductive inference still doesn't hold, even though we're exarnining'P(E1 s, e) rather thanP(E Ih,e). -. Similarly, ifP(h Ie)is independentofh;then againP(E Is, e) is independent of s. This issug~ gestedby the interchange symmetry between F and H. Aformal proof is as follows:
Proof: First,-note that if P(h I e) is independent of h (3.2) becomes
22
~(h) (P(E I h,
P(E I s, e) =
e) x o[s(e, h, ex), s]} }
~(h}{8[S(e, h, ex), s]}}
In a manner similar to that in sectionU"we can split-up the sum overh·into·two sums, one-over the . , values of hex) where·x is within the training set,' and one over the'valuesofh(x)where x is'outside'' of the training set: ~{h} '==~{h' h h } ~{h' 'h ' h ,,}. Thisallowsus.torel' 2' ..., n-m n-m+l' n-m+2' ..., n write our denominator as ~(h h h }~{h h h } (8(S(e, h, eX)' s)} = l' 2'"'' n-m n-m+l' n-m+2'"'' n ~-m x ~{h
h h } (o(s(e, h', ex), s)}, where h' is defined as the m-tuple n-m+l' n-m+2"'" n
.{hn-m +l,h n-m +2 , ...,.hn }. Evaluating, we get ~-m.xC~ x (r- l)m-s. SinceP(E-I-h;e)isonly
dependent on that part of h outside of 8, we can rewrite our,numerator in a similar fashion, getting ~{h h
h } (P(E I h, e)} x ~(h 'h ' h) (8(S(e, h', ex),s) },where h is del' 2"'" n-m' n-m+l' n-m+2"'" n
fined as the (n-m)-tuple {hI' h2, ..., hn-m } and h' is as before. Therefore when P(h, e) is independentofh,P(Els, 8)·=rID- n x ~{' h 'h"h' '} {P(E Ih, e)};and is 'independent of s: QED. l' 2''''' n-m
In particular, if h is chosen at random according to a uniform distribution, the generalization
,
error is independent of s. , IfP(E Ih,e)is nj)tindependent ofh (cf:equation.(3.1»,and ifP(h.J e) is not.uniform, then P(E I S;' e) can depend.on s.In other words,- under these circumstances there can be coupling be" tween on-training set error and off-training set error (as:measured by:P(E I s,.8» and inductive inc ference can occur. The precise form of the inductive inference, i.e., how. peE I s, e) depends on s" . is determined, by.the form, ofP,(fl 8) (which determines P(E I h,e»;and the form of P(h I e). The following is an example of this:
23 Example: Assume P(f I e) = 1 if f is some particular function f*, 0 otherwise. (Note that see, f*, eX) = m, Le., f* must agree with e, so we can rewrite see, h, eX) as S(f*, h, eX) for any function h.) Similarly, assumeP(h I e) = 1/2 if his f*, 1/2 if h is some particular function h' for which S(f*, h', eX) = m - 1 (S(f*, h', X - ex) being unspecified except that it doesn't equal n - m), and ofor all :otherh:: For this 'scenario, .the denominator in (3.2) equals 1/2 if s= .m,andc1/2 if s =
ffi'"
L ,(If:s
equals neitherm norm - I,PCE.! .s,-mis.undefined;:since the event (s;.e) can:never OCCUf.) To, eval~: uate the numerator in (3.2), note that peE Ih, e) = 1 if S(f*, h, X - ex) = (n - m)(1 - E), 0 otherwise. In other words, PCE I h, e) = aCE, 1 - [S(f*, h, X - ex) / (n - m)]). For s = m, this means that PCE I s, e) = 1 for E = 0, 0 otherwise. For s = m - 1, PCE Is, e)= 1 for E= 1 - S(f*, h', X - eX) / (n - m) (which is greater than 0), 0 otherwise.
A single number giving "the generalization error" can be defined in a number of ways (e.g. as argmaxE PCE I s, e), as L{E}IE xpCELs,e)},etc.):Whateverthe.precise.definition:used;:foLthe~ scenario recounted in the example above, if S(f*, h', X - eX) > 0 then the generalization error either . ',',." :improvesor stays constant-as-s increases"WewiU'refer to this "behavior·.by·saying'that'!'reproduc'>::' tion correlates with
generalization~',
foralLs,Jor this scenario, for any (reasonable) definition of
generalization error. When reproduction correlates with generalization, we should pick a hypothesis function h found by randomly sampling H subject to the constraint that h agrees with all of e, rather than a hypothesis function h found by randomly sampling H subject to the constraint that h does not agree with all of e.
iii) When reproduction correlates with generalization.
Reproduction does not correlate with generalization for all pairs (P(f I e), P(h I e)}. An example is the following:
24
Example: As in the previous example, assume that P(f I 8) = 1 if f is some particular function f*, 0 otherwise. Similarly, assume P(h 18) = 1/2 if h is some particular function h* for which S(8,' h*, 8X )= m, 1/2 if h is some particular. function h'for which S(8, . h', 8X ).=m· 1, and 0 for all other h. (Note that h*. need not equalf*;)As·before, the.denominatorin (3;2)"equals1/2.if.s =m,and.1/
2 if s = m-LAlso asbefore,P(E Ih,c8) = 8(E;:L-'[S(f*, h,X~8X)I(n-m)J). Eor s=m,this:means.. "..thatP(E Is, 8) = Hor E = 1- S(f*, h*, X· 8X ) / (n_m),
ootherwise. Similarly, or s=m - 1,P(E"
Is, 8) = 1 for E = 1 - S(f*, h', X - 8 X) / (n - m), 0 otherwise. If S(f*, h*, X - 8X ) < S(f*, h', X - 8X)' then generalization error increases as s goes up.
.-·ec.This.behaviorisnotso uncommon.as might be hoped..Eor example, as.reported in [38]-; for the parity target function the error of backpropagation increases as the cardinality of the training set goes up. .It has often been observed that'one can "over-train"an architecture (e.g;; a feedforward'neural net), and thereby degrade the generalization. What is meant by this is that if s is forced to be too .close to m; thenfor certain scenarios the generalization'error iSiempirically.observed.to,startto.in". crease. "Over-training" means that reproduction correlates with generalization for small s but not for large s. Usually over-training is considered a side-effect of noise; one "leams the noise" if one trains too much. However it can oe e} {1l[Erif, h, eX), E] x P(f, h, e))
=
=
pee)
L{h,f=> e} (1l[Er(f, h, ex), E] x P(h I e) x Pifl e)} .
(Compare to equation (3.3).) In a.manner similar to the proof of (2.2) we can rewrite this as
n-m L{f ,[,"h'" h .} (P'(fl'''',[,n-m) xP'(hl· ...,hncm) x 1l([L. _lllif,r" hi)],z»), 1,... n-m' 1"'" n-m 1-
where P'(event);: P(event I e). P'ifl ,...,fn-m) is determined by the universe: Our job is·todetermine the optimal P'(h l ...,hn _m ). Now make the following notational conventions: A generic (n-m)-tuple {fl' ...,fn-m} is indicated by I, whereas a particular (ncm)-tuple is indicated by f. Both tuples live in the .space F. The j'th component of such a tuple, fi , equals fi , where f is the full n-tuple {fI' ..., f n }. (The difference between f and fis that f is an n-dimensipnal vector E F, whereas fis only (n-m)-dimensional and lives in F.) Similarly, the generic (n-m)-tuple {hI' ..., hn-m } is indicated by h
E
H. Note that since f
.., ".' ,,-mustcontain e,p~(f) = P'(f).,However.h.need.not containe;'sogiving'h'svaluesfor-inputs outside~ of the training set doesn't specify h completely, and P'(h) need not equal P'(h): P'(h) = P'(h l , ...; hn _m ) = L{h _ + , ..., h } P'(h l , ..., hn_m , hn-m+ l , ..., hn). Also make the definition T(f, h);: nm l n n-m 1l([Li=1
Ii(fi'~)]' n com». (Note that (n -m) is the value of z when E = 0.) Finally, define U(h)
;: [LI (!,'(f) x Tlf, h)}]. Note that U(h) ~ 0 'reral·generalizers for'allof whom reproduction correlates with gener. alization,if.in.addition . cross"validation works, .then CFoss.-validation should'be:used to weed:,out all but one of the base generalizers.
vi) Discussion. lOne could argue about whether and how (5.3) can have ramifications for our physical.universe. .... : In.particular;·'one,can argue. about whether or not the probability ofa:targetgeneralizeris·.a.:mearr£;'" ingful concept; the physical universe is chocker-block full of target inputcoutput functions, but target generalizers? One way to address this issue is to simply say that instead of viewing the problem of inductive inference as your being given a target input-output function which is sampled, simply view the problem as your being given a target generalizer which is sampled. After all, recall that the probability of target input"output functions really'means 'Our.degree of belief that we are likely. to encounter a sampling ofsuch.a function. (The:frequentistinterpretation .of'probability; in which:. we'd view the probability of targetinput-output functions as some sort of objective, universe-wide: frequency count of such functions, has long been discredited. See in particular [35-37].) Similarly, the probability of target generalizers means our degree of belief that we are likely to encounter a sampling of such a generalizer (as opposed to some objective, universe-wide frequency count of '..suchgeneralizers)..Less metaphysically, one can simply.note thatfor..the_purposes of this paper,
41
the physical meaning of probabilities is set by their use in error functions: P(d 10) is simply whatever distribution gives the experimentallyobserved30 function peE Is;.O) for our hypothesisgeneralizer substrate P(g 1 0). Independent of the issue'ofhow,tointerpretprobabilities oftarget-generalizers;·thellarallel-be" tween (3.3},and(53)immediately suggests.manyinterestingand.novel.features.ofinductivdnfeF ence. For, example,' basegeneralizingphenomenalike'ov.er7training,·the:usefulrressoheducirrgthe' number of free parameters, the need to have a "representative" training set, the usefulness of regularizers, etc. (see the end of section III), all have meta-generalizing versions. For example overtraining corresponds to reducing cross-validation error too much, so that generalization suffers. As - another example;'reducing1:he number of free parameters might aid the performance of hypothesis generalizers,just as-it can. aid the performance,of hypothesis'functions:- Similarly;"a'meta~training set 0) can be not "representative" ifP(d I (0) is broad and flat, in which case using a meta-generalizer to detennine how to generalize might not result in low generalization error. In addition, applying regularizer cost functions to hypothesis generalizers might be beneficial, just as applying regularizer cost functions to hypothesis functions can be beneficial. , '-In addition'to
resultingin"predictionsconcerningthephenomenologyofmeta~gerreralization';'"
. the parallelbetween (3.3) and (5.3)has.many other implications. For example, it means that almost any method which is usually viewed as meta-generalizing (e.g., cross-validation [26-30], bootstrap[30], fan generalizers and tim6:series analysis [38,12]) can also be applied directly to (base) generalizing. For example, in techniques akin to the bootstrap method, one doesn't simply sum, over all partitions of the base. "training, set, ,the errors in,guessing one side of the partition when ' "trained with _the,.other, (as ,one, does:ilLcross-validation). Rather one"mightJook at the probability distribution of such errors. Such a distribution can be estimated by'constructing a large .ensembleof partitions and collating a histogram of {number of partitions which had a certain error in guess,':':
..,
ing one side of the partition when trained with the rest of it }vs. {that error value}. The parallel between (3.3) and (5.3) immediately suggests the idea of doing a similar thing with base generali" 'zation: rather than simply sum the errors of a hypothesisfunctiorrat-reproducing a training set (the
42
analogy of cross-validation), construct a histogram of {number of questions leading to a given error between the hypothesis function and.thetarget function}.:vs..·{those.errorvalues}. In this way one could, for example, estimate whether or not a.difference in the average error at reproducing a train e ing set between two neural nets is statistically significant. " .In.a similar fashion; the parallel between (3;3).and (5.3). suggests usingtechniques usually used .in base ·generalizationto'meta-generalize;,For·example,'one··might·applygradientdescentinmetac' 'generalization, thereby aniving at a generalizer with low cross-validation error..Alternatively, one might wish to apply Rissanen's minimum description length principle [3,4] to hypothesis generalizers rather than hypothesis functions; pick the generalizer which when combined with the training '-"'·'l;':,.: 'sepresultsin·.the-smallestcoding length, and use that generalizer to generalize from the training set.
"'.' .. .As 'anotherpossibility;.note..thatusing croshvalidation. to,meta,generalizeis;akin.to.base,gen",. eralizing by the following procedure: choose from a set of candidate hypothesis functions that hypothesis function which best reproduces the training set. Howeverin base generalizationone-often-creates the hypothesis function from the base training set directly with surface-fitters ([9-13]) rather than by searching over a set of possible hypothesis functions. The parallel between (3.3) and .(5.3). suggests.meta-generalizing in a similar fashion, with "meta" surface-fitters rather. than .with:;
cross-validation. For example, the analogy suggests that just as one can base generalize by taking as one's hypothesis function the linear combination of basis hypothesis functions. with the best fit to the base training set, so might on~ meta-generalize by taking as one's hypothesis generalizer the linear combination of basis hypothesis generalizers with the best fit to the meta training set. Similarly, just as one might generalizeoverasetof residuals'between a' hypothesis-functionand·the base
n
training set, one might try to meta~generalize.overa.setofresiduals.between:a:hypothesisgeneral=: .. izer and the meta-training set. In point of fact, schemes ofthese types have been investigated be" fore, and often work extremely well in practice ([31-33, 40]). Another version of such "meta-surface-fitting" follows from the observation that when base .. '-generalizing one often pre-processes the input space, for example to reduce the dimensionality of ".the.input space. values. fed. to.the generalizer. In other. words,. one,often.maps.X.~ Z via a mapping
•
43 T and then does the generalizing from Z
--7
Y, where Z has lower dimension than X and yet still
(hopefully) captures the salient characteristics ofX. Again, the, suggestion of (5.3) is to do the same thing whenmeta"generalizing: reduce X'tosome smaller spaceZ' via some mapping 1", and then" map Z' to Y. Returning for a moment to the case of base...generalizing; ifZisaCartesian product· . "ofsubspaces all of which are copies ofY,thenT isjust a Cartesian productof[unctions'from Xto Y. Similarly,jfZ'isa Cartesiarrproductofsubspaces.allofwhich·are-copies'ofy;·then·'I"is:a·.€ar~·, tesian product offunctions from X' to Y. In other words, if Z' is a Cartesian product of subspaces all of which are copies ofY, then 1" is a Cartesian product of generalizers. Now note that with such a space Z', the mapping from Z' to Y can be done via a base generalizer Gust like the mapping from ';z1'lo'Y)~·'Thereforeosuchpre-.processing
viaT' constitutes a very interesting means of meta-gener-
'" alizing;. under such schemes one combines generalizers (which make up T');.bypiping.their.guesses,~.
through another generalizer (the mapping from Z' to Y). As an.example, with such a scheme one can use surface fitters to combine. decision tree.generalizers;..neural, nets,andsurface~fitters;c; ,': ',.::: These kinds of "meta-surface-fitting" are examples of the technique of stacked generalization [32]. One important aspect of the procedure of stacked generalization is thaUhe entire procedure '····...• "can itself.bestacked,:i.e.;.fed intoyetanother"generalizer.Interms ofthe paralleLbetween (3.3)andr (5.3), this simply means that since one canjump to a meta-realm (Le., go from (3.3) to (5.3)) and maintain essentially the same formal structure for calculating the value of the error function, so can one jump to a meta"meta~realm. In, meta-generalization one'generalizes amongst generalizers. In meta-meta-generalization, one generalizes amongst meta-generalizers. An example of such metameta-generalizationis to decide between using cross-validation and some other scheme for choos~ . .ingamongst generalizers by partitioning the meta·training·set and then calculating whether crosse., validation or the other scheme results in lower-generalization 'error.
44
VI. CONCLUSION.
This paper addresses the question ofhow and whyin"sample testing'can correlate with gener~ .' alization errm off of.thetesting· set. In, addressing. thisquestion"a;formalismis developed which ,.", can'be'vjewed,as an.extension ofthe conventional Bayesian formalism:This'[ormalismcanbe'used . .to address all generalizationissues ofwhichlarn,aware:'over".training, theneed.to restrictthe'numc ber offree parameters in the hypothesis function, the problems associated with a "non-representative" training set, whether and when cross-validation works, whether and when stacked generalization works, whether and when a particular regularizer will work, etc. '17hemostimportantfeature of the formalism presented in this paper is that it uses an extremely lowelevel eventspace, consisting of triples of.{target [unction, ,hypothesis. function" training set}. In much previous theoretical research peculiar definitions of machine learning issues have been used to allow the researcherto ,(try to) shoehorn.a pet formalism into the 'field of machine learning; Using the extremely low-level event space employed in this paper ensures that no such "sleight of hand" occurs; all machine learning issues are addressed directly and overtly. For example, use of ·."this,event)space; ensures that one:sassumptions,about-the, probability distribution of target func,,,', tions in the physical universe are explicit. Most (if not all) other formalisms that have been constructed to address machine learning (e.g., PAC) are special cases of the form,alism presented in this paper; They are capable of'addressing only a subset of the issues addressed in this paper. Moreover, such schemes can be expressed in terms of the formalism presented iwthispaper;whereas the reverse is not'true. In particular, the conventional Bayesian uses only 2/3 of the full eventspace'exploited in this paper; it has probabil-' ities involving target functions and training sets, but hypothesis functions are ignored. (Despite'useof the word "hypothesis", what a Bayesian would call "P(hypothesis such-and-such)" is equivalent not to what is called "P(hypothesis function such-and-such)" in this paper, but rather to what is 'called "P(target function such-and-such)".) As a result, by construction that formalism is incapable ,of deriving resultslike equation (3.3)...
45 Some of the conclusions of this paper are: 1) If one assumes a maximum"entropy universe; then it is entirely irrelevant what hypothesis , . function one. uses and there is no:correlation-between-reproductionofthe·training set and.off"train"~ ing set generalization error.. SinceeSucha universe can. not.be. ruled .out.onapriori .basis,it istheo"... ... '.'.:reticallyimpossible:to.come:toany.conclusions_abouthowto;generalize:usingonly apriori:reason~: ing. '. 2).Given (1); empirical evidence that the choice of hypothesis function is relevant serves as erne pirical evidence that the probability distribution of target.functions in our universe is not uniform (Le., has sub"maximal entropy). Peaks in this distribution presumably correspond to what humans
. . " "gul" ' '::''''.. all ",parslmomous,or,,:re· ar target f unCllons. 3) Assuming a physical universe with less than maximal entropy, not only is choice of hypoth'esis function relevant, but thecorrespondence.between the hypothesis function and the.training,set can correlate with the error of thehypothesisfunction:off ofthe training-set (Interestingly,-a:neeessary condition for such a correspondence is that the hypothesis function -was not chosen at random, independently of the training set This is contrary to the suggestion of some researchers that choosing the hypothesis function at randomis.a sufficient conditionfor such correspondence.) The .correlation can be written as a non-Euclidean inner product between two vectors, one representing the physical universe, and one 'representing the'generalizer used to create the hypothesis function. So far as it assumes that such a strl!;tegy results in improved generalization, any generalizer which strives to create a hypothesis function in agreement with the training set (e.g., back-propagation runon,neuralnets) is,implicitly.making.anassumption:aboutthis,non-Euclideaninnerproduct.~,L. "", '·4)Thdnner product- mentioned'.in' (3) can :be'usedto'demonstrate 'over~training;',:theutility:-of' minimizing the number offree parameters in the hypothesis functions, difficulties arising from the . use of training sets which are not "representative" of theinarget function, the utility of regularizers, etc. That inner product can also be used to demonstrate some very countereintuitive phenomena, e.g., situations in which the worse the fit between the hypothesis function and the training set, the
better the generalization.
46 5) If one knows the distribution of target functions beforehand, then for a given definition of generalization errorone can, build an optimal generalizer. For,example"tomaximizethe probability "thatothe' chosen: hypothesis function 'has"perfectgeneralization;'one'should guess the hypothesisfunction lying' at the mode of the,distribution,oftargetfunctions.,(In;general,thatfunctiorris:not:the' same as the "Bayes-optimal" function.) , " 6) A
"meta"formalism~'
can ,beconstructedJoroaddressingissues:like:howAocombine.generalc
'izers; how and when cross~validationworks, etc:This meta~formalismis formally equivalentto.the formalism alluded to in (1) through (4), and therefore all the conclusions in (1) through (4) carry over to the meta-realm. For example, just as one can have "over-training" in which one over-min..•.:"'.:,". ;' imizes:errorm:.reprooucing,the training set and thereby.increasesgeneralization error, so might one "over-minimize" cross-validation error and thereby increase generalization error. Future research involves: 1) Investigating in more detail some oLtheissues,discussed in the text:: how and·when.over·;
training occurs, how and when regularizers are helpful, how and when "over-regularizing" occurs, how and when limiting the number of free parameters is helpful, how and when it helps to choose a training. set which is "representative" of the target function, etc. 2) Using the answers to (1) to improve real-world generalizers. In particular, following up on , the suggestions in the text on how to avoid over-training, and'using the analysis of the utility of limiting the number of free parame~ers to deduce the relationship between what simplicity measure one should use and what assumptions one makes about the physical universe. 3).Answering the "meta".versions.of all.thequestions'in (1),: and using these answers
toim~
proverealcworld generalization,justas in (2). 4) Extending the analysis to different error functions, in particular to error functions involving a metric over the output space. Extending the analysis to continuous input and output spaces. Extending the analysis ofsectionJV to different measures of "best"generalization error. '.' 5) Extending the analysis to situations in which (like in PAC and like in appendix B) one sums .oVertraining ,sets. as,well,asover target functions and/or.in.which.one.has.errodunctions which run
47 over the elements of the training set as well as off of it. 6) Reconciling the analysis in this paper with the analysis in [5] concerning Occam's razor and uniform simplicity measures (the optimal measures of the simplicity of a hypothesis function). 7).lnvestigating.whether and whenrequiring variousinvariances of the generalizer (as.in [11]) -. results in improved generalization;'Investigatingwhether and when-stacked'generalizatton [32:] ree. '" sults in improved generalization, whether, and.when.fan. generalizers. [38] resultinimprovedgen". eralization, etc. 8) Extending the analysis to fields closely related to machine learning (e.g. time"series analy-
sis).
ACKNOWLEDGMENTS
I would like to thank the members of the Complex Systems group at Los Alamos and especially M. Stein for fruitful discussion. This work was done under the auspices· of the Department of Ene.. ergy.
"
FOOTNOTES
1. It is implicitly assumed in such a statement that there are no two pairs in e with the same input values Xi but different output values Yi' See [11] for more details.
2. One example of such a sleight of hand is confusing the prior distribution of input-output functions in the real worldwith the priordistribution.offeedforward·neuralnets-(asin [17] and any at-
48 tempts to apply studies like [18] to the real world). Another slight of hand is limiting the space of allowed target functions in some "reasonable way". Yet another. is allowing questions to run over the training set. (After all, the only non-trivial issuein·thenoisecfree·case;theonly issue ofinterest;~ .···is how to' guess for questions. outside of the· training set; allowing questions to alsorun-oveLthe" ···training set is; at best,obfuscatory.)Seeappendix Dfor a discussion ofsuchsleights·of hand:in thePAC formalism.
3. It should be noted that this limitation of scope is shared by essentially all other theoretical machine-learning research in the literature which does not directly make assumptions about he real .,. universe;(e,g.,.PAC,..1he.cstatistical mechanical formalism, the analysis of various linear models, etc.).. Essentially .none .of this..previous research even suggests noveLmachine~learning.algorithms . (as is done in this .paper,in section V), never mind goes on to empirically investigate the real-world ."
.. utility ofsuch an algorithm..In pointoffact,.as a.general rule.anyof thisprevious.machinelearning_ research which at fIrst glance appears to have real-world ramifIcations will, on closer inspection, be found to have none (e.g. [18], cf. [41]).
4. Such care shows that the "coin-tossing" argument as it stands is flawed, for. example. See appendix B after reading through this section.
5. This assumption has the immediate consequence that, for the most part, no asymptotic argu···mentseitherfororagainst·a particular scheme"can be made (infinity:is:nocdefIned·fordiscrete'·· ···spaces):This restrictionis hardly a shortcoming; since even for continuous inputand output spaces;,' asymptotic behavior is never what we're directly interested in (since training sets are always ·fie nite},;and therefore arguments concerning such behavior can be extremely misleading. (This is es-pecially true when those arguments are mae without any concern for bounding the error -whichcomes from applying their results to.the finite case.)
49 6. It should be noted that this requirement is more of a tautology than an "assumption", given that we're doing supervised machine learning: (Indeed, this requirementis implicitly made in everysin~ gle.theoreticaltreatment of machine-learning of which I aware.) Considerchanging the target func e :
tion:whilekeeping the training data the same;, Sincethe.generalizeronly..sees. the trainingcdata'(by: definition), such a :change in the target function. provides. no .change in theinformation at' OUT. dis" .. posal telling us. how the. generalizerislikelyto·.guess.To.put itanother.way,:any algorithmrulLolL '.' ·a:computer . c.anda,generalizerin particular - must have its output depend solely on its input. And in supervised machine learning, that output is the hypothesis function, and that input is the training data, e .
. . .7. In OUT .physical. universe. the. probability of.a given target function can vary . with.the.training:set (in fact it has to, given that P(f, h, e) is zero unless f and e are mutually consistent.) On the other hand, either the hypothesis function h chosen by. us is also determined (at leastin part).bye, OI:jt's random. In either case, specifying the hypothesis function we guess in addition to specifying e doesn't help the universe determine.what target function generated e. This is an intuitivejustification of the result that P(f I h, e) = P(f I e).
. 8: Although in the real world we usually have direct access both to the hypothesis function (after all, we constructed it) and to the training.set, we don't necessarily have such access to target func"
tions. The only connection between the real world and the mathematical construct of target func·tions 'occurs through the (probability.distribution .concerning.the) errorJunction. Howeverthis er.;.. 'Hor Junction need not concern.the lev.et oLagreementbetween .the function.h .and some: hypothe~.. sized pre-fixed target function.J which we believe was sampled to produce e. In other words, although we usually think of the training set as being a sample of a target function, such a view is
not intrinsic to U. Nothing in the mathematics defining U says that we assume there exists a fixed, unknown target function f which is sampled to create e. Nonetheless, it is often the case that the '. researcher has in mind precisely this scenario where.e is a.samplingof some pre-fixed target func-
50 tion. (This is implicitly the case for example whenever we talk of "no noise" and therefore set p(f, h, e) = 0 fore
cz. f;) Accordingly, the discussion in therestofthis;paper will bepresentedin
terms of such a scenario, and error functions and the like will be chosen accordingly.
9. pee H)= p(f,e) / pet) = p(f, e) I [L{wd}P(f, w)], where {wd} is the set of all: training setsw .... consistent with f. p(el t) = pee' I t)\i:e and e"which.are.consistentwithLthenimplies that p(eJf) , = (1/ [L {rod} I]} '" k. (k can be calculated by summing over all allowed training set cardinalities the number of training sets ofthat cardinality which .can be chosenfrom f. For example, if all train-. >.ing setcardinalities are .allowedin Ubut training sets aren't allowed to contain repeats, then k- I = n
Li=I'[nl lil];) Note thatkisindependent ofboth fand e. AsaresultP(fI9) =P(91f)xP(t)Ip(9) ~
p(t) / pee), which proves the supposition.
10. Note that in addition to avoiding reliance on a sampling assumption (by calculating errors con~ ditioned on a particular training set), it's also possible to avoid the assumption implicit in an error .': function concerning.how ·questionsare·chosen.:To do' this we must calculate "errors" conditioned '".,.- . ",. 'on· a 'particular,questionfi,e., we replaceaninvestigation·of·the·probability of errors of the form L(XflO 9 } [1 - 8(f(x), hex)] / (n - m) with an investigation of the probability of errors of the form X "
[1 - 8(f(q), h(q)] where q is a provided question. Formally, this can be done in several ways. One
way is to. expand our event space to be..quadruples (f,h, .e,q). Another way is to keep the original .. event space;involvingtriples (f,h, 9):andimplemenrthefollowing. procedure:.after.beinggiven:e}i one must reduce X to the set ofm + I'elements (eX u q} and then use the original error measure' L(XflO eX} [1 - 8(f(x), hex))] / (n - m). Investigations of any son based on being provided a single .question q will not be pursued in this paper.. '
51 11. Note that conventional Bayesian analysis, which doesn't distinguish H from F, can not serve as such an over-arching framework.
12.. Note,that this assumption of uniform probability over,the space.of,possible.targetfunctionsjs 'fnot the-same as 'assuming a'uniform'probability-over the space-ofpossible' error'values.'·'··-'·' ,,', .•
13. Note that z is always an integer. This reflects the fact that only certain E values are allowed; in particular, no E value is allowed for which (n - m)E isn't an integer.
•..••'
,'L,W14,.Onecould,define:'::randem~'
differently from how it's defmed here. For example, considered as
a'function'off, with h and e fixed,P(f, h, 8)is a vector in R(/n:m\The only constraint on this vectoris that the sum of its components equals "i.fPif, h, 8), which is some constant between 0 and . 1. This means that, a priori, P(f, h, ,e) can Jive anywhere in a.certain space U which is a subset of R(r(n-m)). We could now define a "random" probability distribution over T (rather than over F) ,and .estimate < P(E.I h,e) >{T}' For simplicity,in this paper no such altemativedefinition of "random" will be used.
16. It's important to realize that we ,still have no a priori basis to believe P(f Ie) depends on f. To prove that P(f I e) is of a certain form (as opposed to collecting empirical evidence to that effect) ... " would,require knowledge.of:physics:and:psychology.farin.excess of,whatwe,have today... ,~:::..:
17. Apparently humanity has leamed to recognize.the.local maxima.in this non-uniformP(f Ie) humans refer to the f at those maxima as being "regular" or "parsimonious". (See [5].)
. 18. Of course, this isa very unlikelyP(f Ie). In the real world, P(f Ie) is non-zero for all f consistent with e.
52
. 19..To understand this.intuitively, oimagine' that.we'have a generaIizerG.;which tries to .reproducethe training set (e.g., let G be back-propagation run on feedforward neural nets), and use G to cone ,struct'aninput-output surface h which reproduces a provided training sete.Then the generalization· '.. error,·is 'unchanged if'weIeplace h-with ·somemewfunctiomh'·which·is:identical-to·,h-forquestions outsideofe .butdisagrees withhfor questions from within e ..
20. See· [5] for a discussion of how the existence (or lack thereof) of this kind of correspondence affects whether or not application of Occam's razor results in improved generalization.
.21 ..Thisnext subsection will not show that.theknowledge that the· hypothesis was.chosen.accord., ing to a particular P(h I e), by itself, can affect generalization error. After all, such a result would contradict the discussion at the.end.ofsectionII(ii).. Rather it's this knowledge..togetheuvith..the.. knowledge that P(h I e) corresponds to P(f I e) which.(the next subsection shows) can affect the probable generalization error.
22. It is essentially a semantic distinction whether a) P(h I e) itself is viewed asa generalizer, with s being an observed level of agreement between the hypothesis output by the generalizer and e, or b) whether P(h I e) is viewed as only , being a common substrate over which other generalizers are defined by their s value. What is important is how P(h I e) is used in the mathematics.
23. For a stochastic generalizer, the.output is.a probability distribution overY rather than a single, element of Y. For the moment, we will ignore such stochastic generalizers, Howeverit's worth noting that for most practical situations, at the end of the day one must have a single guess, and therefore one has to have a means of collapsing the set of possible guesses provided by a stochastic generalizer down to a single guess. (Examples of such a collapsing process are averaging the stochastic . ,_ generll1izer:s guesses, or picking· one.of the possible guesses.according.to a pseudo-random number
53
generator.) For such cases, the stochastic generalizer is essentially a sub-algorithm in a larger deterministic generalizer.
24, Note thatfor some applicationsf ifK isa k"dimensional space then g{ijis undefmedfor+.k ratheNhan fori >0. See [1l] for-details. ·Such a'scenario will never be:explicitly:considered,inthis"paper•.
25. Note that using cross-validation with a fixed set of generalizers is itself a generalizer; crossvalidation maps training sets and questions to outputs, by setting the output to gj (e; q), where gj ""t"is,the"generalizer whieh·has,lowest cross-validation error for
e. Viewed this way, schemes like
cross-validation, differ,: from schemes like backcpropagation in two ways:, First;:they:choose amongst hypothesis functions indirectly rather than directly, by having the direct choice be amongst generalizers; Second, they.do their choosing by means of partitioning the trainingset'See [32].
26.. One ,interesting feature .of cross-validation is.that the (gi)' the set of generalizers amongst which one,is.choosing",can, not be the seLofa1Lpossibie generalizers. The reason is that for any training set e, question q
E
X, and guess t
E
Y, there is a generalizer with zero cross-validation
error on e which makes the guess !'in response to the questionq. (The parallel with the discussion in section II is immediate.) In other words, by itself the criterion of zero cross-validation error is under-restrictive in the sense that itcanltuniquely,fix;how.oneshould;generalize; Moreover,.it,can;' .. be' proven that there is no subset ofthe"setof alLgeneralizers such that for any,training setthere.is, always one (and only one) generalizerfrom that subsetwhichhas zero cross"validation error.for: that training set. In other words, even in concen with"other generalization ,criteria, one can't use cross-validation to uniquely fix how one should generalize. See [34] for details.
27:'Onemightwantto have some restrictions on the setof generalizers. (Forexample, in this paper
54 we will usually want to restrict attention to those generalizers whose guessing is invariant under .re-ordering ofthe elements of the trainingset,sothat1heir guessingois'definedeven for unordered" training sets. It is implicitly assumed in this paper that "G" is the appropriately restricted set of generalizers ifany such restrictions are desired.
'28. With all thesecequivalences.. between UandN"one.is'tempted.to:Baythat theprooability;of;el;o . ements in Y is given by. the probability of elements in UviaPy (je F, ge' G; e) ;'PU(j'e F,g(e), e), where gee) is.a particular hypothesis function. However there is no a priori reason to require
that this equality holds, and in many circumstances it would violate normalization (e.g., there are . (many generalizets'with-thecsame function gee) for a particular e, so summing P y (f, g, e) over all g might give a number greater than I if PU(f, h, e) is normalized.)
29. Note that for cross-validation, peg Ie) is deterministic - for fixed e,p(g Ie) is a delta function' overG.
30. If one insists on using a frequency count interpretation ofprobability,then"'experimentally ooc," .c, .i.,.served',~.can betaken to.mean frequency counts.
31. Note that since repeats are bei~g allowed, if training sets were unordered then a uniform sam-
e
pling distribution over X would not result in P(f, h, e l ) = P(f, h, 2 ) 'if e l and e 2 of the same car'dinality and bothc:fpor example;Jftraining:sets,were:unordered; thenif:er.contained 3-elements-, with three distinct X components whereas eZ contained 3 elements two of which shared the same
e
X component, then a uniform sampling distribution over X gives P(f, h, e l ) / P(f, h, 2 ) = 3! / 3 = 2.
32. A sampling assumption concerns pee I f). For example the uniform sampling assumption introduced in section II assumes that pee If) is.indepehdentof e for 'all' allowed e. The assumption
55 (P(f, h, e l ) = P(f, h, e 2) if both e l and e 2 contain the same number of elements, all of which are chosen from f} is more than just a sampling assumption however. To see this, write P(f, h,e)c= P(f, h I e) x p(e) = P(f I e) x P(h I e) x p(e) (due to AA) = p(e I f) x P(f) x P(h I e). We are not making a uniform sampling.assumption;xatherwe're assuming. that p(aLf) is chosen to cancel out ·.the .a'dependence. of the generalizer. P(h I e). This is clearly, a· somewhat peculiar. assumption. to ......make; U nfortunately; this'assumption' isnot-just:a;side"effectrofi appendixB~s'formalization;()fthe' coin-tossing argument The coin-tossing argument presented in section I implicitlyfirst fixes h; af-
ter this e is chosen, and we want it to be chosen according to a uniform distribution over X. However by assumption h is the hypothesis function output by the generalizer after training on e. There';fore'knowledgeof'hwilHeU-us something about what e can be, Le., it will introduce non-uniformities· in· probability distributions over e; To get the distribution over e.,to· be 'uniform'despite knowledge of h, p(e If) must compensate for the non-uniformity introduced by that knowledge of
h.
33. This comes about by choosing the complexity measure of a function from {O, I} n -+ {O, I} to .';
be. the binary decoding ofthe. string of211 O's and J' sdefming.. that function. In Blumer's'notation,,...."'·.w.ith;such.a·measure.guessing O's is an.Qccam algorithm with a; equal to O.
34. "Real world" meaning proble~s in which we have no direct control over the target function .distribution (that .distribution being set by the physical universe). This. is meant to. contrast with .'. "toy world'" problems wherewe:directlyconstmctthe·target;functiondistribution.
35. For example, one of the defining features of PAC is its assuming that both the training set and testing set were made according to the same sampling distribution n:(x) over the input space. PAC's strength.isthat it is "distribution-free'.';.i.e.,.it manages to reach conclusions without being told the precise distribution n:(x). Now in the real-world, without access to the algorithm used to choose
56 them, it is impossible to detennine that a particular finite training set and a particular finite testing .set.werecreated.via.the same sampling distribution. Onthe.otherhand,ifwe do have access to the .algorithm used' to 'choosethe data sets, then we know the sampling distribution,.in which case' there's noreason to handcuffourselves':witha distribution-free formalism like PAC.
APPENDIX A
The ramifications of requiring that P(h I f, a) be independent of f.
Since P(h I f, a) is undefined fora ex. f when there is no noise; this appendix works with the restricted requirement that P(h I f, a) = P(h If', a) v h, a, f and f' such that f::> a and f' ::> a. '. First note that this requirement followS'fromthe requirement that P(h If, a) = P(hl a)v h;a, and f::>, a. Therefore to prove the equivalence of these requirements, we must prove the converse: (P(h If, a) =P(h If', a)v h, a,'fandf' such thatf::> a and f'::> a}
~{P(h
If, a) =P(h Ia) v h, a,
and f::> a}. To prove this defme kr as the ratio P(h If, a) / P(h Ia) where f::> a. Our task is to prove that {P(h If, a) = P(h If', a) v h, a., f and f' such that f::> a and f' ::> a} implies that kr is constant and equals 1. Expanding both the conditional probability in the numerator and the one in the denominator, we get L{f) P(h,j, a)x L{h} P(h, f, a) (A.l)
kr- 1 =
P(h, f, a) x L{hf} P(h,j, a)
57 Now rewrite (P(h I f, e) = P(h If, e) \:I h, e, f and f such that f::J e and f ::J e I as P(h, f, e)
L{hl P(h, f, e)
---= P(h, f, e)
\:Ih, e, f and f such that f::J e and f ::J e.
L{h} P(h, f, e)
Plugging this into (A. I ), we get
L{h} P(h,f, e)
kr- l
= L(j) { - - - = - - - - } x L{h} P(h, f, e)
L{h} P(h, f, e) L{h,f} P(h,f, e)
This just equals I however, independent of f, which proves.the supposition:
(A.2)
P(h I f, e)
= P(h I e) 'if h, e, and f::J e.
.". A similar proof, without the fj ::J e" conditions, holds when noise is allowed so that P(h I f, si defined even for
e)
e cr. f.
Now rewrite P(h I f, e) = P(h I e~ as P(h, f, e) / P(f, e) = P(h, e) / pee). This last equality can be rewritten as P(h, f, e) / P(h, e) =P(f, e) / pee), which is equivalent to P(f, I h, e) = P(f I e). Therefore
(A.3)
P(f, I h, e) = P(f I e) 'if h, e, and f.
A number of interesting corollaries follow immediately from (A.3). For example, (A.3) means that P(f I e) x P(h Ie) = P(f I h, e) x P(h I e), i.e., the joint probability P(f, h I e) factors:
(AA)
P(f, h I e) =P(f I e) x P(h I e) \:I h, e, and f.
58
.Note that in ·both (..'\.3) .and (A:4) even when .thereis·no:noise wedon;.tneed to' specifye:ef,: since for 8
ex. f, (A.3) and (A.4) both reduce to the equality 0 = O.
APPENDIXB
A rigorous investigation of the "coin-tossing proor' of inductive inference.
This appendix is a rigorous investigation of,the "coin-tossing proof' ofinductive inference out. ,:. lined in the introduction;This':'proof'makes atleast two crucial assumptions; both of which are hard to justify from the point of view ofreal-world machine learning. The fIrst is that when measuring generalization error one allows questions from within the training set-The second is that the training set is unknown; we only know how many times it agrees with a hypothesis function. However even given these two assumptions, both implicit in thecoin-tossing'argument, as this appendix shows 'We do nOHecoverthe conclusion suggested in the introduction (i.e., these assumptions do not lead to Laplace's law of succession for generalization). For simplicity, let Y be {O, I},, and let X be the n integers (1, 2, ..., n). P(f, h, 8) = 0 if 8
ex. f,
as usual. Assume also that training sets are chosen, independently of h, according to a uniform sampling distribution over X with repeats.allowed,ie.,·training sets·8 consist of.anyfmiteordered>sek of pairs (Xj E X, Yi E Y) (with or withoutTepeats), and P(f, .h, 8 1) = P(f, h, 82) if both 8 1 and 8 2 contain the sarne number of elements, all of which are chosen from f.31 ,32 Let the cardinality of the training set be m, as usual. Use an error function independent of 8: Er(f, h) = L{XEX} n
{l-o(f(x),h(xnl/n == Li=l (1- O(fi,hi)}/n. ,.'"
'.