Machine Learning and Formal Concept Analysis - Semantic Scholar

Report 4 Downloads 27 Views
Machine Learning and Formal Concept Analysis Sergei O. Kuznetsov All-Russia Institute for Scientific and Technical Information (VINITI), Moscow ¨ Dresden Institut fur ¨ Algebra, Technische Universitat

Machine Learning...

[1/79]

Contents 1. Brief historical survey 2. JSM-method 3. Learning with Pattern Structures 4. Decision trees 5. Version spaces 6. Conclusions

Machine Learning...

[2/79]

Machine learning vs. Conceptual (FCA-based) Knowledge Discovery Machine learning is “concerned with the question of how to construct computer programs that automatically improve with experience” (T. Mitchell). Conceptual (FCA-based) knowledge discovery is a “human-centered discovery process”. “Turning information into knowledge is best supported when the information with its collective meaning is represented according to the social and cultural patterns of understanding of the community whose individuals are supposed to create the knowledge.” (R. Wille)

Machine Learning...

[3/79]

Contents 1. Brief historical survey 2. JSM-method 3. Learning with Pattern Structures 4. Decision trees 5. Version spaces 6. Conclusions

Machine Learning...

[4/79]

Lattices in machine learning. Antiunification Antiunification, in the finite term case, was introduced by G. Plotkin and J. C. Reynolds. The antiunification algorithm was studied in J. C. Reynolds, Transformational systems and the algebraic structure of atomic formulas, Machine Intelligence,

vol. 5, pp. 135-151, Edinburgh University Press, 1970.

as the least upper bound operation in a lattice of terms.





  

 

 





, then

   





















 





and

    





  

     

If



Example: .

Antiunification was used by Plotkin G.D. Plotkin, A Note on inductive generalization, Machine Intelligence, vol. 5, pp. 153-163, Edinburgh University Press, 1970.

as a method of generalization and later this work was extended to form a theory of inductive generalization and hypothesis formation. Machine Learning...

[5/79]

Formal Concept Analysis 

[Wille 1982, Ganter, Wille 1996]



, a set of attributes







.





is a formal context.













has the attribute









if and only if object



such that

  

relation





, a set of objects







 





 

 

 



 





 



!

















"









- The concepts, ordered by form a complete lattice, called the concept lattice 

.



 



is the intent of the concept

is the extent and

 

-



















 



:



A formal concept is a pair



 











 









def



  











def 



Derivation operators:

. Machine Learning...

[6/79]







for holds if also has all attributes from

, i.e., every object that has all





Implication attributes from







Implications and attribute exploration .













  



















Implications obey Armstrong rules:

Learning aspects Next Closure an incremental algorithm for constructing implication bases. Attribute exploration is an interactive learning procedure.

Machine Learning...

[7/79]

Lattice-based machine learning models. 1980s Closure systems JSM-method [V. Finn, 1983]: similarity as meet operation Galois connections and non-minimal implication bases CHARADE system [J. Ganascia, 1987] Dedkind-McNeille closure of a generality order and implications GRAND system [G. D. Oosthuizen, 1988]

Machine Learning...

[8/79]

Lattices in machine learning. 1990s In 1990s the idea of a version space was elaborated by means of logical programming within the Inductive Logical Programming (ILP) S.-H. Nienhuys-Cheng and R. de Wolf, Foundations of Inductive Logic Programming, Lecture Notes in Artificial Intelligence, 1228, 1997

where the notion of a subsumption lattice plays an important role. In late 1990s the notion of a lattice of “closed itemsets” became important in the data mining community, since it helps to construct bases of association rules.

Machine Learning...

[9/79]

Contents 1. Brief historical survey 2. JSM-method 3. Learning with Pattern Structures 4. Decision trees 5. Version spaces 6. Conclusions

Machine Learning...

[10/79]

JSM-method. 1 One of the first models of machine learning that used lattices (closure systems) was the JSM-method by V. Finn. V. K. Finn, On Machine-Oriented Formalization of Plausible Reasoning in the Style of F. Backon – J. S. Mill, Semiotika Informatika,

20 (1983), 35-101 [in Russian]

Method of Agreement (First canon of inductive logic): “ If two or more instances of the phenomenon under investigation have only one circumstance in common, ... [it] is the cause (or effect) of the given phenomenon.” John Stuart Mill, A System of Logic, Ratiocinative and Inductive, London, 1843

In the JSM-method positive hypotheses are sought among intersections of positive example given as sets of attributes, same for negative hypotheses. Various additional conditions can be imposed on these intersections. Machine Learning...

[11/79]

JSM-method. 2 Logical means of the JSM-method: Many-valued many-sorted extension of the First-Order Predicate Logic with quantifiers over tuples of variable length (weak Second Order).

  

                                   !   

Example: Formalization of the Mill’s Method of Agreement:

  

&&

&

&

&

&

6

#

&

0  23  /    45 

%

&&

1

 /



&&

 # 1$ 0

/

!

&

&

&



$ " 

&



" &

&

)"  *"    )    *'  !  +, # ) %$ *,    #     .- ' &   & /" 0"        ' /  ( ! 0  ' (

&

6

The predicate defines a closure system (w.r.t. ) generated by descriptions of positive examples. At the same time, is a means of expressing “similarity” of objects given by attribute sets. Machine Learning...

[12/79]

FCA translation 



[Ganter, Kuznetsov 2000]

of objects known to have , 

  



of objects known not to have ,



negative examples: Set

  



positive examples: Set



  

,



A target attribute











 







.



















not contained in the intent





is an intent of







:

of any negative



























A positive hypothesis : example





Three subcontexts of











undetermined examples: Set of objects for which it is unknown whether they have the target attribute or do not have it.

Machine Learning...

[13/79]

firm

smooth

form

apple

yellow

no

yes

round

grapefruit

yellow

no

no

round

kiwi

green

no

no

oval

plum

blue

no

yes

oval

toy cube

green

yes

yes

cubic

egg

white

yes

yes

oval



fruit

tennis ball

white

no

no

round



M



G

color



Example of a learning context

Machine Learning...

[14/79]

M

w

g

b

f

s

r

fruit





























tennis ball



egg















toy cube





plum





kiwi





grapefruit





apple

y



G





Natural scaling of the context





Abbreviations: “g” for green, “y” for yellow, “w” for white, “f” for firm, “ ” for nonfirm, “s” for smooth, “ ” for nonsmooth, “r” for round, “ ” for nonround. Machine Learning...

[15/79]

Positive Concept Lattice = {w, , ,r} 

{7}



({1,2,3,4}, { }) 

minimal ( )-hypotheses 

falsified ( )-generalizations



({2,3},{ , })

({3,4},{ , }) 

({1,2}, {y, ,r})





({3}, {3} ) ({2}, {2} )

({4}, {4} )



(

({1}, {1} )





({1,4},{ ,s})

( , 



)

b

f

s

r

fruit



































































g



y



w



G M apple grapefruit kiwi plum toy cube egg tennis ball

Machine Learning...

[16/79]



Classification of undetermined example 

is classified positively (predicted









If contains a positive and no negative hypothesis, to have ).



contains a negative and no positive hypothesis,

is classified negatively.











If









If contains hypotheses of both kinds, or if contains no hypothesis at all, then the classification is contradictory or undetermined, respectively.



For classification purposes it suffices to have all minimal (w.r.t.

) hypotheses

Machine Learning...

[17/79]

Classifying undetermined example mango b

f

s

r

fruit



g



































 

 





































apple grapefruit kiwi plum toy cube egg tennis ball mango

y



1 2 3 4 5 6 7 8

w



G M







is a -hypothesis, mango y







; 

























  













The object mango is classified positively:

:

























for -hypotheses w and f, s, w mango , , s, mango .

Machine Learning...

[18/79]



Variations of the learning model - allowing for

of counterexamples (for hypotheses and/or classifications),

- imposing other logical conditions (e.g. of the “Difference method” of J. S. Mill): Finn’s “lattice of methods”,

- nonsymmetric classification (applying only ( )-hypotheses), - and so on.

The invariant: hypotheses are sought among positive and negative intents.

Machine Learning...

[19/79]

Toxicology analysis by means of the JSM-method Bioinformatics, 19(2003)

V. G. Blinova, D. A. Dobrynin, V. K. Finn, S. O. Kuznetsov and E. S. Pankratova

Predictive Toxicology Challenge: (PTC) Workshop at the joint 5th European Conference on Knowledge Discovery in Databases (KDD’2001) and the 12th European Conference on Machine Learning (ECML’2001), Freiburg.

Organizers: Machine Learning groups of the Freiburg University, Oxford University, University of Wales.

Toxicology experts: US Environmental Protection Agency, US National Institute of Environmental and Health Standards.

Machine Learning...

[20/79]

Toxicology analysis by means of the JSM-method Bioinformatics, 19(2003)

Training Sample: Data of the National Toxicology Program (NTP) with 120 to 150 positive 



 



examples and 190 to 230 negative examples of toxicity: molecular graphs with indication of mice, rats . whether a substance is toxic for four sex/species groups: male, female Testing Sample: Data of Food and Drug Administration (FDA): about 200 chemical compounds with known molecular structures, whose (non)toxicity, known to organizers, was to be predicted by participants. Participants: 12 research groups (world-wide), each with up to 4 prediction models for every sex/species group. Evaluation: ROC diagrams

Stages of the Competition: 1. Encoding of chemical structures in terms of attributes, 2. Generation of classification rules, 3. Prediction by means of classification rules. Results of each stage were made public by the organizers. In particular, encodings of chemical structures made by a participant were made available to all participants. Machine Learning...

[21/79]

Example of Coding



 





O 



H

N



H

O H





H









 



H



S



N









2 2 2 2

H

H 0200331 (linear descriptors)







6,06 (cyclic descriptors)





S

























S



H

6,06 0200331 1300241 2400331 0264241 0262241



H



H 

H 

H

Complete list of descriptors



Chemical structure

Machine Learning...

[22/79]

Some positive hypotheses FCCS descriptors (encoding)

of predictions in sex/species group(s)

NH



















Molecular graph

CH

2FR

0201131 0202410

1FR 1MM

NH

NH NH

 









O

6,06 0200021 

HN

Machine Learning...

[23/79]

ROC diagrams : Rats

Machine Learning...

[24/79]

ROC diagrams : Mice

Machine Learning...

[25/79]

Contents 1. Brief historical survey 2. JSM-method 3. Learning with Pattern Structures 4. Decision trees 5. Version spaces 6. Conclusions

Machine Learning...

[26/79]













 

,



 

 









 





or such that





  









 









 



fits under labels:

 



respects edges:

   







dominates if there exists a one-to-one mapping

      







      

Order on labeled graphs

.



 









for any vertex label



vertex labels are unordered











































Example:

Machine Learning...

[27/79]



           

  

         



















 















and





         

          







= The set of all maximal common subgraphs of



 

         

           







         

          



 

         







         

           





          

































 













Semilattice on graph sets

.

Example:

Machine Learning... [28/79]





   



  











  









  

  











  

  



           

           











































 















   

 







































and



          

          



           



          



MAX

           



          



Meet of graph sets

For sets of graphs

is idempotent, commutative, and associative.

Example:

Machine Learning... [29/79]













: 



















: 







:











: 

























Negative examples: :







: 





 :









































Examples

Positive examples:

Machine Learning... [30/79]

positive examples 1, 2, 3, 4





{3}



 



 

 



    

{3,4}



    



 

 

 







  

 











 

    

 



    

       

 

  





{1,2,3,4}

 





{2,3}

 

 







{2,3,4}



{2}



 









{1,2,3}

 



{1}

   

 







     

{1,2}



 



    



 





     



Positive (semi)lattice negative example 6

{4}

Machine Learning... [31/79]

positive examples 1, 2, 3, 4





{3}



 



 

 



    

{3,4}



    



 

 

 







  

 











 

    

 



    

       

 

  





{1,2,3,4}

 





{2,3}

 

 







{2,3,4}



{2}



 









{1,2,3}

 



(

{1}

   

 







     

{1,2}



 



    



 





     



Positive lattice negative example 6

{4}

Machine Learning... [32/79]

Pattern Structures 

is a pattern structure if 









[Ganter, Kuznetsov 2001]



is a meet-semilattice;













is a set (“set of objects”);















  













of

.

operation: 





Possible origin of

  





generates a complete subsemilattice



the set







is a mapping;

; is a “more general than” relation); 

The (distributive) lattice of order ideals of the ordered set









of “descriptions” (





Partially ordered set







, each with description from 

A set of objects

.

Machine Learning...

[33/79]













Pattern Structures 







, where



Pattern structure is a tuple























 

 











The subsumption order:

.





is a mapping of examples to “descriptions”,











is a set of “examples”,

























 















   













if





is a pattern concept of





A pair













for

















for









Derivation operators:



is extent and is pattern intent. Machine Learning...

[34/79]



are positive and negative examples for some goal attribute, 















 







and



















not subsumed by any negative



is a pattern intent of 

A positive hypothesis example:

 





















 



and







Pattern-based Hypotheses

Machine Learning...

[35/79]

 

is



 





if

,



 







,









, then



idempotent:





contractive:



monotone: if





is projection (kernel operator) on an ordered set







, e.g., for graphs is NP-complete.



SUBGRAPH ISOMORPHISM, i.e., testing



Motivation: Complexity of computations in







Projections as Approximation Tool

.

Machine Learning...

[36/79]



takes

to the set of its -chains not



  

 

 

  

 

































 









 

  































 

















Example. A projection for labeled graphs: . dominated by other -chains. Here





Projections as Approximation Tool

Machine Learning...

[37/79]





 





takes

to the set of its -chains not









 







 



 







 





 



 







 

























 

























 









Example. A projection for labeled graphs: . dominated by other -chains. Here

















-preserving, i.e., for any





is



Any projection of a complete semilattice











Property of projections

Machine Learning...

[38/79]









Projections and Representation Context 

  













projection

























Graph projections



Graphs

Lattice of graph sets

Lattice of graph sets projections projection Basic Theorem of FCA

Basic Theorem of FCA















 









 





 

   

goal





f 

e



d







c









b



1 2 3 4 5 6 7

a



projection

M



 







 





  



 

G



   

goal



f 

  

 





e





d



c







b



1 2 3 4 5 6 7

a



M



G



Representation Subontext



Representation Context

Machine Learning...

[39/79]

positive examples 1, 2, 3, 4

 



 





 

Machine Learning...

 







 

{3}





 

 





 



  

{3,4}







 

{2,3}

 



  

 











  

















    

 



    

  

 









              



 







{2,3,4}







 







{1,2,3,4}



(

 







{2}

  

 



 

{1}













{1,2}



 











     

{1,2,3}





 





     

     

4-Projections negative example 6

{4}

[40/79]



positive examples 1, 2, 3, 4



 









 Machine Learning...







  







 

    

 

{3}

 

(











 





 







 





 





 

    





     



 





 











    

 



   

 



 

       







{1,2,3,4}





 





{2}









 





{1}











    

    

{1,2}







 









3-Projections negative example 6

{1,2,3}

{3,4}

{4}

[41/79]

2-Projections 





 

{1,2,3,4}

















(





{3,4}



 

{1,2}







 













   

negative example 6

positive examples 1, 2, 3, 4

Machine Learning...

[42/79]

Spam filtering First successful applications of concept-based hypotheses for filtering spam: L. Chaudron and N. Maille, Generalized Formal Concept Analysis, in Proc. 8th Int. Conf. on Conceptual Structures, ICCS’2000, G. Mineau and B. Ganter, Eds., Lecture Notes in Artificial Intelligence, 1867, 2000, pp. 357-370.

Data Mining Cup (DMC, April-May 2003) http://www.data-mining-cup.de Organized by Technical University Chemnitz, European Knowledge Discovery Network, and PrudSys AG



514 participants from 199 Universities from 39 countries 

Training dataset: 8000 e-mail messages (39 ) qualified as spam (positive examples) and the rest (61 ) as nonspam (negative examples), 832 binary attributes and one numerical (ID) Test dataset: 11177 messages

Machine Learning...

[43/79]

Spam filtering The sixth place was taken by a model of F. Hutter (TU-Darmstadt) which combined “Naive Bayes” approach with that of concept-based (JSM-) hypotheses. This was the best model among those that did not use the first (numerical) ID attribute, which was implicit time (could be scaled ordinally). The sixteenth and seventeenth places in the competition were taken by models from TU-Darmstadt that combined concept-based (JSM-) hypotheses, decision trees, and Naive Bayes approaches using the majority vote strategy.

Machine Learning...

[44/79]

Contents 1. Brief historical survey 2. JSM-method 3. Learning with Pattern Structures 4. Decision trees 5. Version spaces 6. Conclusions

Machine Learning...

[45/79]

Decision trees Input: descriptions of positive and negative examples as sets of attribute values.





All vertices (except for the root and the leaves) are labeled by attributes and edges are labeled by values of the attributes (e.g., 0 or 1 in case of binary attributes), each leaf is labeled by a class or : examples with all attribute values in the path leading from the root to the leaf belong to a certain class, either or . Systems like ID3 [R. Quinlan 86] compute the value of the information gain (IG), or negentropy for each vertex and each attribute not chosen in the branch above.



The algorithm sequentially extends branches of the tree by choosing an attribute with the highest information gain (that “most strongly separates” objects from classes and ).



Extension of a branch terminates when a next attribute value together with attribute values chosen before uniquely classify examples into one of the classes or . An algorithm can stop earlier to avoid overfitting. Machine Learning...

[46/79]

Entropy











 

























 





 

Ent









In real systems (like ID3, C4.5) a next chosen attribute should maximize some information functional, e.g., information gain (IG), based on the entropy w.r.t. the target attribute





















are values of the target attribute is the conditional sample probability (for the . training set) that an object having a set of attributes belongs to a class

Machine Learning...

[47/79]

An example of a decision tree w Decision tree obtained by the IG-based algorithm: no





yes b

f

s

r

fruit



f



examples 6,7



yes

no

































































g



y



w



G M apple grapefruit kiwi plum toy cube egg tennis ball

example 5

examples 1,2,3,4

,



.

,f



,



The tree corresponds to three implications w















Note that attributes f and w has the same IG value (a similar tree with f at the root is also optimal), IG-based algorithms usually take the first attribute with the same value of IG.

Machine Learning...

[48/79]

An example of a decision tree w Decision tree obtained by the IG-based algorithm: no





yes b

f

s

r

fruit



f

examples 6,7



yes

no



































































g



y



w



G M apple grapefruit kiwi plum toy cube egg tennis ball

example 5

examples 1,2,3,4



 



 





The closures of the implication premises make the corresponding negative and positive hypotheses. 







Note that the hypothesis , f is not minimal, since there is a minimal hypothesis f contained in it. The minimal hypothesis f corresponds to a decision path of the IG-optimal tree with the attribute f at the root.

Machine Learning...

[49/79]







with the derivation

 









.









and























Training data is given by the context . In FCA terms is the subposition of operator











Decision trees in FCA terms





there is an

 





















is dichotomized: For each attribute Assumption. The set of attributes , a “negation” of : iff . attribute



.



or



one has





is complete if for every



A subset of attributes

.



or



is noncontradictory if



A subset of attributes

  





 







is 





 





































is called a decision path if A sequence of attributes such that noncontradictory and there exists an object (i.e., there is an example with this set of attributes).









The construction of an arbitrary decision tree proceeds by sequentially choosing attributes. First we ignore the optimization aspect related to the information gain.

Machine Learning...

[50/79]



 

















if 

 





 















A decision path is called full if objects having attributes are all either positive or negative examples.





















is a (proper) subpath of a decision path





A decision path ( , respectively).







Decision trees in FCA terms













A full decision path is irredundant if none of its subpaths is a full decision path. The set of all chosen attributes in a full decision path can be considered as a sufficient condition . for an object to belong to a class

















 





  



is the closure of the corresponding set of





The closure of a decision path attributes, i.e.,





A decision tree is a set of full decision paths.

.

A sequence of concepts with decreasing extents is called a descending chain. A chain starting at the top element of the lattice is called rooted. Machine Learning...

[51/79]





Semiproduct of dichotomic scales











 

 

























looks as follows:

  





 

 













c 

       

1 2 3 4 5 6 7 8

b



a











 

For example, the semiproduct of three dichotomic scales

















  

for











 



, where











 





is defined by



and



The semiproduct of two contexts

Machine Learning...

[52/79]

(

Semiproduct of dichotomic scales



 



 







  





  







 



















































Concept lattice of the semiproduct of three dichotomic scales (diagram vertices are labeled by intents)













































c



b



1 2 3 4 5 6 7 8



a

Machine Learning...

[53/79]



:

















Consider the following context











Decision trees vs. semiproducts of dichotomic scales

 

for short), where each dichotomic scale

"







dichotomic scales or























(denoted by stays for the pair of attributes m, .



is the semiproduct of

 



 

In terms of FCA the context







and the relation is such that the set of object intents is The set of objects is of size exactly the set of complete noncontradictory subsets of attributes.

and every rooted



descending chain consisting of concepts with nonempty extents in



"







Proposition. Every decision path is a rooted descending chain in

is a decision path.

Machine Learning...

[54/79]

Decision trees vs. semiproducts of dichotomic scales





































 



















































To relate decision trees to hypotheses introduced above we consider again the contexts , , and . The context



"



 







"











can be much smaller than because the latter always has objects while the can be number of objects in the former is the number of examples. Also the lattice









and the



 











 

















such that



for each minimal hypothesis , there is a full irredundant path







is a hypothesis, either positive or negative. Moreover, 

 





















 







closure of each full decision path

of the line diagram of























 











"











 







corresponds to a rooted descending chain

  





















Proposition. A full decision path



.



much smaller than

.

Machine Learning...

[55/79]

Discussion of the propositions The propositions illustrates the difference between hypotheses and irredundant decision paths. Hypotheses correspond to “most cautious” (most specific) classifier consistent with the data: they are least general generalizations of descriptions of positive examples (i.e., of object intents). The shortest decision paths (for which in no decision tree there exist full paths with proper subsets of attribute values) correspond to the “most courageous” (or “most discriminant”) classifiers: being the shortest possible rules, they are most general generalizations of positive example descriptions. It is not guaranteed that for a given training set there is a decision tree such that minimal hypotheses are among closures of its paths. In general, to obtain all minimal hypotheses as closures of decision paths one needs to consider not only paths optimal w.r.t. the information gain functional. The issues of generality of generalizations, e.g., the relation between most specific and most general generalizations, are naturally captured in terms of version spaces. Machine Learning... [56/79]

Recalling the Information Gain







































  













, and for 

 



 



















Ent

 









,

















where

 









Ent







 















Ent



IG

















For a decision path











For dichotomized attributes the information gain is natural to define for a pair of attributes .





















are values of the target attribute is the conditional sample probability (for the . training set) that an object having a set of attributes belongs to a class

Machine Learning...

[57/79]



























  



.



:



by the property of the derivation operator

























  



























, then













 



is associated with the context



If the derivation operator

















Information Gain is nonsensitive to closure

Hence,



"



 

Instead of considering decision paths, one can consider their closures without affecting the values of the information gain. 















 

















, then an in the branch below chosen and will not









If implication holds in the context IG-based algorithm will not choose attribute choose in the branch below chosen .













one can consider the concept , which can be much smaller.

















"

"









In FCA terms: instead of the concept lattice lattice

Machine Learning...

[58/79]

Contents 1. Brief historical survey 2. JSM-method 3. Learning with Pattern Structures 4. Decision trees 5. Version spaces 6. Conclusions

Machine Learning...

[59/79]

Version spaces T. Mitchell, Generalization as Search, Artificial Intelligence 18, no. 2, 1982.

that describes a set

of examples;















 

of positive and negative examples of a target attribute with













and







Sets

   







  





   







   







A matching predicate : We have iff is an example of classifier or matches . The set of classifiers is (partially) ordered by a subsumption order: for ,









of classifiers (elsewhere called concepts);



that describes a set 

A classifier language



 



An example language





T. Mitchell, Machine Learning, The McGraw-Hill Companies, 1997.

.

Machine Learning...

[60/79]



















 

 

Version space is the set of all consistent classifiers: VS

 

















holds and for every 











the matching predicate holds.









:





Consistency predicate cons cons holds if for every the negation



Version spaces









 







 

 



.

















Given Find the version space VS



 





Learning problem: .





















Classification: A classifier VS classifies an example positively if matches , otherwise it classifies it negatively. -classified if no less than VS classifiers classify it positively. An example is

Machine Learning...

[61/79]

Version spaces in terms of boundary sets T. Mitchell, Generalization as Search, Artificial Intelligence 18, no. 2, 1982. T. Mitchell, Machine Learning, The McGraw-Hill Companies, 1997.

















 

 











VS





VS



MAX VS



VS









VS









VS









MIN VS 







VS















If every chain in the subsumption order has a minimal and a maximal element, a version space can be described by sets of most specific VS and most general VS elements:

Machine Learning...

[62/79]







:



Formal context









Version spaces in terms of Galois connections























!





is the set of examples containing disjoint sets of observed positive and negative , ; examples of a target attribute:



,

the relation







: for









holds iff





is the complementary relation:























relation corresponds to the matching predicate holds iff ;













is the set of classifiers;

.



















VS











Proposition.

Machine Learning...

[63/79]

Corollary: Merging version spaces







 





VS





 

of positive and negative



 

 





 



VS





and







 













VS

 

  







and two sets









For fixed , , examples one has







H. Hirsh, Generalizing Version Spaces, Machine Learning 17, 5-46, 1994.

Machine Learning...

[64/79]

Corollary: Merging version spaces

 

 













 









 











 

VS

, 



 



 

















 

 



 







 







 









VS



 







 





 









 

 

VS









 















 













 





VS

















Proof. By the property

VS

of positive and negative



 

  



 



















VS

and







and two sets









For fixed , , examples one has







H. Hirsh, Generalizing Version Spaces, Machine Learning 17, 5-46, 1994.

Machine Learning...

[65/79]

More corollaries: Classifications and closed sets

























 



Proposition. The set of all 100%-classified examples defined by the version space VS is given by

Machine Learning...

[66/79]

More corollaries: Classifications and closed sets

























 



Proposition. The set of all 100%-classified examples defined by the version space VS is given by























Interpretation of a closed set of examples: and , then there cannot be any 100%-classified Proposition. If undetermined example.

Machine Learning...

[67/79]

More corollaries: Classifications and closed sets

























 



Proposition. The set of all 100%-classified examples defined by the version space VS is given by























Interpretation of a closed set of examples: and , then there cannot be any 100%-classified Proposition. If undetermined example.































 



Proposition. The set of examples that are classified positively by at least one element of the is given by version space VS

Machine Learning...

[68/79]

Classifier semilattices 







Proposition. If the classifiers, ordered by subsumption, form a complete semilattice, then the and . version space is a complete subsemilattice for any sets of examples We use again pattern structures here B. Ganter and S. O. Kuznetsov, Pattern Structures and Their Projections, Proc. 9th Int. Conf. on Conceptual Structures, ICCS’01, G. Stumme and H. Delugach, Eds., Lecture Notes in Artificial

Corollary: a dual join operation



Assumption: the set of all classifiers forms a complete semilattice







Intelligence, 2120, 2001, pp. 129-142.

.

is definable.

Machine Learning...

[69/79]













Pattern Structures 







, where



Pattern structure is a tuple























 

 











The subsumption order:

.





is a mapping of examples to “descriptions”,











is a set of “examples”,

























 















   













if





is a pattern concept of





A pair













for

















for









Derivation operators:



is extent and is pattern intent. Machine Learning...

[70/79]



are positive and negative examples for a target attribute, 















 







and



















not subsumed by any negative



is a pattern intent of 

A positive hypothesis example:

 





















 



and







Pattern-based Hypotheses

Machine Learning...

[71/79]

Hypotheses vs. version spaces 













is hopeless iff





Definition A positive example





[Ganter, Kuznetsov 2003]



such that every classifier which











Interpretation: has a negative counterpart also matches . matches







denote the corresponding pattern structure. Then the following are







, and let

















Theorem 1. Suppose that the classifiers, ordered by subsumption, form a complete meet-semilattice





is not empty.



. 





2.



 









1. The version space VS

 



equivalent:









3. There are no hopeless positive examples and there is a unique minimal positive hypothesis min . 





In this case, min , and the version space is a convex set in the lattice of all pattern intents ordered by subsumption with maximal element min .

Machine Learning...

[72/79]



Hypotheses vs. version spaces







 

 

,









be sets of positive and negative examples,

 





,



Theorem 2. Let



















 

 











 



 













is a proper (positive) predictor if

























VS













 























 



 



VS





  











denote sets of minimal positive hypotheses and proper positive predictors, respectively. Then

Machine Learning...

[73/79]

 





 



    

   





    



 



   

 

 













 

 

minimal hypothesis 

 

  







 

 









 



     



  



    

Proper predictors negative example 6

falsified generalizations ( -intents)

proper predictors

Machine Learning... [74/79]

#



Machine Learning...

%-  "%



&.

 ')

#



 ') 

 &.

&  $  # $  #

&

  

 /   ('&

$



&







  

 



    





 



























  



 



 







,

-!

  -% 

&

  



, where

( '&



&.

,   ( '&

-!  

#

') 

&

$



-%  

 ')

&

, but generally can be of size



+ *' 

,   ( '&   

')

 & 

-!  



$

&

    ,      ('&





























 



  









  



















  









  



If disjunction is not allowed, then

-%  

 + *'

"! -!      +*'

equivalent to 







Example. Boundaries of the Version Space . If disjunction is allowed, then

.

trivial generalization: disjunction of all positive examples. [75/79]



 















of all examples as follows:

 































We order the set

   



Computing a Version Space

define



! 





!







for all

% 







 







!

!





"%

, define 







,





and





For

$#













!



 







 

   

and

"

and



For









The following notation is adapted from the standard formulation of the NextClosure algorithm:

Machine Learning...

[76/79]

An algorithm



then the version space is empty else





%"



1. If

 



%%"



If the classifiers, ordered by subsumption, form a finite meet-semilattice, then the version space can be computed as follows:

" %











; 2. The first element is min 3. If is an element of the version space, then the “next” element is next where , and is the largest element that is greater than max and that satisfies

#













#     %







Machine Learning...

[77/79]



Decision trees and version spaces are neatly expressed in terms of Galois connections and formal concepts



Under reasonable assumptions version spaces can be computed as concept lattices



The set of classifiers between (in sense of generalization order) minimal hypotheses and proper predictors can be more interesting and/or more compact than a version space, since it introduces “restricted” disjunction over minimal hypotheses.



Conclusions

Generally, FCA is a convenient tool for formalizing symbolic models of machine learning based on generality relation.

Machine Learning...

[78/79]

Thank you

Machine Learning...

[79/79]