Mixed-Initiative Active Learning

Report 4 Downloads 204 Views
Mixed-Initiative Active Learning

Maya Cakmak and Andrea L. Thomaz {maya,athomaz}@cc.gatech.edu School of Interactive Computing, Georgia Institute of Technology, 801 Atlantic Dr., Atlanta, GA 30318, USA

Abstract We propose a learning paradigm in which the responsibility of choosing samples is shared between the teacher and the learner. We present an experiment that demonstrates the potential of this approach with human teachers and discuss interesting related research problems.

1. Introduction. A range of Machine Learning (ML) applications involve learning from data provided by a human teacher (e.g. document/image classification, recommender systems, programming by demonstration). Improving the sample efficiency of learning algorithms is particularly important in these applications, as providing data can be a cumbersome task for the human. Active Learning (AL) has proven useful towards this objective. Although theoretical results supporting the benefits of AL have been limited (Castro & Nowak, 2006; Balcan et al., 2010), many papers published over that last decade have demonstrated the practical strength of AL with human oracles (Settles, 2010). A number of directions have been explored with the purpose of improving techniques within the AL paradigm; however, one question that has not been asked frequently is “Can we do better than AL?” The indication that we can do so comes from the field of Algorithmic Teaching (Balbach & Zeugmann, 2009; Goldman & Kearns, 1995) which studies the teachability of concepts. The teaching problem, unlike the learning problem, involves producing a sequence of examples from a known target concept such that the concept can be learned by a learner. Finding an optimal teaching sequence for an arbitrary concept is NPhard (by reduction to the minimum cover set problem (Goldman & Kearns, 1995)) however efficient alPresented at the ICML 2011 Workshop on Combining Learning Strategies to Reduce Label Cost, Bellevue, WA, USA, 2011. Copyright 2011 by the author(s)/owner(s).

gorithms have been proposed for particular concept classes. An important insight is that an active learner can never learn faster than when a passive learner is taught optimally (Goldman & Kearns, 1995; Angluin, 2004). In other words, teachability indicates an upper bound on how fast a concept can be learned. Thus we believe that the key for going beyond AL will be in “good teaching.” To this end, we hope to exploit the flexibility and intelligence of human teachers. We propose a learning paradigm in which the responsibility of choosing examples is shared between the teacher and the learner – Mixed-Initiative AL (MIAL). MIAL stands between passive supervised learning, where all learning examples are chosen by the teacher and AL where all examples are chosen by the learner and labeled by the teacher. In this setup, the learner receives an example even when it does not make a query. Therefore it needs to decide when to make a query, besides what query to make. In this paper, we first give a few illustrative examples of good teaching outperforming AL. We then overview our previous work on human teaching which motivates a mixed-initiative approach and present a follow-up study that demonstrates the strength of MIAL.

2. Good Teaching In this section we discuss the potential of “good teaching” in different problem settings. 2.1. Optimal Teaching Quantifying the teachability of concepts is a central problem in Algorithmic Teaching. A number of teachability metrics have been proposed (Natarajan, 1989; Anthony et al., 1995; Balbach, 2008), the most popular being the Teaching Dimension (Goldman & Kearns, 1995). This is the smallest number of examples needed to uniquely identify any concept in a concept class. The shortest sequence of examples that uniquely identifies a concept is referred to as an optimal teaching sequence. A major concern is finding polynomial-time algorithms that produce an optimal teaching sequence

032 A 023 diverse set of Machine Learning (ML) applications (e.g. document classification, user pre 2 Sample complexity bounds 1 Introduction ing guidance significantly their This modeling, robot programming by demonstration) take inputimproves data directly frominput. a human wh

033 2.1 Motivating examples Linear separators in R1030

024

distribution, we easily find ρ to be a constant – perhaps the most

that human intelligence and flexibility can be levera

a ML expert. In theseofapplications it is essential ML algorithms require a minimal am 031 034 efficacy active learning in thisthat case. Most proofs have been om 22.1 Sample complexity bounds data, as providing it can efficiency be a cumbersome process thedocument human. To this algorithm end, ML resea A diverse set of Machine Learning (ML) applications (e.g. classification, user pre 025 when inputfor data to a learning c 1032 035 Motivating examples

the full details, along with more examples, can be found at [5]. effici Linear R 033frommodeling, Our firstseparators example isintaken [3, 4]. aSuppose the data on the realtake line, and classiproduced rangeprogramming of methods that adapt conventional learners tothe improve their sample robot bylie demonstration) input data directly from a human wh 036 026 learning (e.g. Active Learning or Semi-supervised Learning). The main idea that we explor fiers are simple thresholding functions, H = {h : w ∈ R}: a ML expert. In these applications it is essential that ML algorithms require a minimal am w 1034 037 2.1 Motivating examples Linear R 035from data, 027 paper is to try to improve the rather learner in trying to achieve objectiv Our firstseparators example is in taken [3, 4]. Suppose the be data lie on thethan realtheline, and the classias providing it can ateacher cumbersome process for the human. To this end,this ML resea Mixed Initiative Active Learning 038 2range complexity bounds 1=Sample produced aH ofIntroduction methods that adapt conventional learners to improve their sample effici fiers are simple thresholding {h 036 functions, w : w ∈ R}: cw028 (x) ! Linear R1039 learning (e.g. Active Learning or Semi-supervised Learning). idea that we explor Our firstseparators example isintaken from [3, 4]. Suppose the data lie on the real line, themain classi− − − − − − − + + + + andThe 037 029islearning 1 thresholding if x ≥ 040 w functions, Active paper to H try to improve the teacher rather than the learner in trying to achieve this objectiv for any concept in a concept class. Such algorithms fiers are simple = {h : w ∈ R}: 2.1 Motivating examples w 038 hw (x) = A diverse set of Machine Learning (ML) applications (e.g. do won the real 030 ! 0 is if taken x < 041 wfrom − first example 4]. Suppose line, and the classi− − modeling, − the−data −lie −programming + + + + demonstration) Good teaching c[3, 039 exist for certain concept classes such Our as conjunctions, w (x) 042 robot by take input 1 if x ≥ w 1 031 fiers are simple thresholding functions, H = {h : w ∈ R}: 040 Linear separators in R w h (x) = 043 w Having a helpful teacher can significantly improve the learning rate of a ML algorithm. Active learning 1 3 5 6 4 2 monotone decision lists and monotone K-term DNFs. a ML expert. In these applications it is essential that ML alE w ! 0 if x < 041 w −032− −this comes − − the−field − of+Algorithmic + + + supporting from Teaching [?, ?, 4]. This field of ML stu 1 usifthat x ≥if044 wthe underlying data, as providing it can be a cumbersome process for the VC theory tells distribution P is separable (can be classified perGood teaching 042 Our first example is taken from [3, 4]. Suppose the data lie on theexrh hw (x) = conteaching learning problem, involves producing a set of labeled 033 problem which, unlike the w To illustrate the strength of optimal teaching ! 0 hypothesis if x < 045 w aachieve range adapt conventional learner fectly by we some then inproduced order to−concept. anmethods rate ",ofefficient itawML is∈ algorithm. − −are − −A of + +functions, +that +less 043 based− ona a− known target lot oferror effort is devoted tothan finding algorithms fiers simple thresholding H = {h : R}: 046 in H), Having helpful teacher can significantly improve the learning rate E w if x ≥if044 wthe underlying 034with asdistribution sider the concept class of conjunctions VC theory tells1the us P possible. is separable classified teach few examples as learning (e.g. Active Learning orbeSemi-supervised Learning) supporting this comes from the field of2P, Algorithmic Teaching [?, ?, 4]. perThis field of ML stu enough to m that = O(1/") random labeled examples from and (can to return any classifier 047 hwfor (x)which = draw 1 w 0 if x < w 035 045 by some hypothesis in H), then in order achieve an error rate less than itwe teaching which, the learning problem, involves producing aisset ofexample labeled ex teaching dimension is identified as fectly 048 suppose paper istotounlike try to improve the teacher rather than learner consistent with them. But weproblem instead draw mis unlabeled samples from P.",If To illustrate the potential of good teaching we consider the following canonical fri VC theory tells m us = that if046 the underlying distribution P separable (can per-the − −be −classified −classifier − −− + + based onlabeled a known targetis! concept. lot effort isreturn devoted tofrom finding efficient algorithms 049 random 036 enough to points draw O(1/") examples from P,ofwand to any The learning problem to 1 find aA linear separator on0’s a line observed examples. A if x ≥ lay these down on the line, their hidden labels are a sequence of followed by a teach with as few examples as possible. fectly by some hypothesis in H), then in order to achieve an error rate less than ", it is h (x) = 047 cw T D(Cn ) = min(r + 2, n + 1) (1) 050 w (x) consistent learner achieves this by thesamples threshold from between consistent with But suppose we instead draw w m unlabeled P.theIfrightmost we wnegative e 0 at ifwhich xplacing < wthe sequence of 1’s,them. and is to037 discover the point transition occurs. This VC theory tells us that ifgoal the underlying distribution isfrom separable (can betheclassified per048 enough topoints draw m =the O(1/") random labeled examples P, and toof return any classifier 051 andillustrate the leftmost positive example. In this setting an active learner achieves To the potential of P good teaching we consider following canonical examplead fr lay these down on the line, their hidden labels are a sequence 0’s followed by athelogarithmic 038 Having a tohelpful teacher significantly improve the learn can beby done with a binary asks fordraw just logunlabeled = O(log 1/") Thus, in 049 fectly some hypothesis in H), then in order to achieve an error rate than ", ittowe is overwhich random sampling, by always querying thecan unlabeled sample closest estimated bo 052search The learning problem is find am linear separator on less alabels. line from observed examples. A where r is the number of relevant features (i.e. numconsistent with them. But suppose we instead m samples from P. If sequence of 1’s,learning and goal isan to039 discover theeasily point w atfrom which the transition occurs. However, we can see that acomes good teacher could directly provide the P twoisexamples 050 supporting this from the field ofoutAlgorithmic Teaching this case active exponential improvement insequence the number of between labels needed. learner achieves this placing the threshold the This rightmost negativecle 053 random enough tothe draw m =the O(1/") labeled examples P, and to return any classifier VC theory tells usby that if the underlying distribution separable lay nbe these points down ongives the line, their hidden are of 0’s followed by ber of variables in the conjunction) and isdone total Figure 1.consistent example oflabels empirically optimal teaching theAn true decision boundary, tolog achieve smallest possible error rate. can with a binary which asks for just maInwhich, =the O(log 1/") labels. Thus, inalogarithmic 051search and the leftmost positive example. this setting an active learner achieves ad teaching problem unlike the learning problem, involv 040 consistent with them. But suppose we instead draw m unlabeled samples from P. If we fectly by some hypothesis in H), then in order to achieve an erro sequence of 1’s, and theperforming goal is toexponential discover the point wfinding at querying which the transition occurs. This Active Learning: a decision boundary number of features (Goldman & Kearns, 1995). The over random sampling, by always the unlabeled sample closest toA the estimated bo Can we always achieve a label complexity proportional to log 1/" rather than 1/"? 052 this case active learning gives an improvement in the number of labels needed. based onjust a known target concept. A lot of effort devoted enough to draw mam = O(1/") random labeled examples P, and 1 line, 041 lay points on the their hidden labels are agood sequence of 0’s followed by atwoisfrom However, we can easily see that could directly provide thein examples cl canthese be done withdown binary search which asks for log = teacher O(log 1/") labels. Thus, in053 R (Dasgupta, 2006). natural next step isa to consider linear separators inwith two dimensions. sample complexity of this concept class in the PAC1as teach with as few examples possible. consistent them. But suppose we instead draw m unlabeled the true decision boundary, to achieve the smallest possible error rate. sequence of 1’s,learning and the goal isan tocomplexity discover the point w at which the transition occurs. This 042 Can we always achieve a label proportional to log 1/" rather than 1/"? A this case active gives exponential improvement in the number of labels needed. Learning model is characterized by the inequality lay these down the line, their hidden labels are a sequen can be done withisa to binary search which asks forinpoints just m =onO(log 1/") labels. Thus, 043 natural next step consider linear separators twolog dimensions. To illustrate good teaching we in consider 2 sequence of 1’s,the andpotential thethe goal isof1torather discover the1/"? point w at whichthe thef Linear in Rgives Cancase weseparators alwayslearning achieve a label complexity proportional to log 1/" than A this active an exponential improvement in number of labels needed. 1 044 The learning problem is to find a linear separator on a line can be done withdimensions. a binary search which asks for just log m = O(lo (n ln 3 + ln(1/δ)) ≤ mnatural next step (2)is to consider linear separators in two 2.2. Empirically Optimal 2 Teaching 045 separators consistent by than placing  Linear separators in R2class this case active gives exponential improvement in thebe nu Can webe always achieve a label complexity proportional toachieves log 1/"anthis rather 1/"?theAthreshold Let H the hypothesis of linear in learner Rlearning , and suppose the input distribution andoptimal thein leftmost positive example. Inexist thispositive setting an active l 046 natural next step issupported to consider linear separators two dimensions. P is some density on the perimeter of the unit circle. It turns out that the In some cases no teaching algorithms which says that the number of examples m to teach Can werandom always achieve a by label complexity proportional to logsam 1/ Linear separators in R2class ofcase over always querying the in unlabeled Let H be the hypothesis linear separators in R2sampling, ,there and suppose the input distribution 047doconcept results of the one-dimensional not generalize: are some target hypotheses (e.g. infinite classes and continuous instance natural next step is toeasily consider linear separators in two dimensions. a conjunction on n variables to a consistent learner However, we can see that a good teacher could direct PHisfor some density supported on the perimeter of the unit2 circle. It turns out that the positive 048 2 which Ω(1/") needed to cases find a in classifier with errorthe rate less than ", no Linear separators Rspaces). For these limiting the teachLetprobability H be the one-dimensional hypothesis classare of linear separators Rconsider , and suppose input distribution such that its error is bounded by , with δ, in labels the true we decision boundary, achieve the smallest results of the case do not generalize: there are someto target hypotheses in possible e matter what active learning scheme is used. 049 2 P isfor some density on the perimeter ofseparators the unit circle. turns arate outfinite that the inwith R Itfrom ing problem toLinear choosing examples setpositive is at least the number specified by the left hand sidesupported H which Ω(1/") labels are needed to find a classifier error less than ", no 2 050do separators Let H be hypothesis class ofcase linear in R ,there and suppose the input �some results ofthe the one-dimensional not generalize: are target hypotheses − in− − −− To see this, consider thedrawn following possible target hypotheses (Figure left): h0distribution ,−for− which from instance space. This is1 aif1, realistic setof the inequality (Mitchell, 1997). For instance, thesupported matter what active learning scheme isthe used. 2 x ≥ that wseparators Let Hofbethe the hypothesis class of linear in no Pall isfor some density on the perimeter unit circle. It turns out the positive 051 h (x) = H which Ω(1/") labels are needed to find a classifier with error rate less than ", w points are 0.95 positive; ting and hoften i ≤ 1/"), for for which all points areif positive except forRa , and suppos i (1 ≤employed studying AL algorithms. 0 x < w number of required examples is 96 for a desired P isgeneralize: some hypotheses density supported on 1, the perimeter of which the unit results of the one-dimensional case not there are some target hypotheses in circle. It tu 052 To see slice this, consider the following possible target (Figure left): h0 , for matter what active learning scheme is used. small B mass ".do i of probability probability that a hypothesis with atall 0.10 results ofa the one-dimensional case doless notexcept generalize: Hmost for which Ω(1/") labels are needed to find classifier with error rate than ", no points areerror positive; and h (1 ≤ i ≤ 1/"), for which all points are positive for a there are som 053 As a isimple example for this setting, consider learning H for which Ω(1/") labels are needed to find a classifier with er To see this, consider the following possible target hypotheses (Figure 1, left): h , for which will be learned for conjunctions withsmall n = 6. On the 0 The slices B are explicitly chosen to be disjoint, with the result that Ω(1/") labels are matterslice whatBactive learning scheme is used. RelatedRwork, Zhu et. al, Tennenbaum et. al, relate somehow 1 mass ". ii of probability aand decision illustrated inscheme Figure 1. This matter what active learning isnature used. all points are positive; between hi (1these ≤boundary i≤ 1/"), in for which all points are positive except for a needed toguarandistinguish hypotheses. For instance, suppose chooses other hand, the optimal teaching algorithm commonly as ,a1hypotheses case where AL provides To see this, B consider theisfollowing possible target (Figure 1,toleft): hlogafor 0 , labels The slices chosen be disjoint, that Ω(1/") are small slice B of probability mass among ". tocited iwith i are target hypothesis atexplicitly random from ≤with i ≤the 1/". Then, identify thiswhich target Tothe seehthis, consider theresult following possible target hypotheses (Figur i tees to teach any conjunction (in a sample space all points positive; and hi it(1is ≤necessary i≤ 1/"), for which all points are positive except for a1 all points a rithmic advantage over random sampling (Dasgupta, needed to are distinguish between these hypotheses. Forpositive; instance, suppose nature chooses with probability at least 1/2, to query points in at least (about) half the B ’s. all points are and h (1 ≤ i ≤ 1/"), for which i that Ω(1/") labels iare 6 features) with at most 7 examples.small From a human The slices B are explicitly chosen to be disjoint, with the result Bii of probability mass ". cansmall target slice hypothesis at random from among the hslice ≤i iofthat ≤probability 1/".a Then, toteacher identify this target 2006). We easily good can i , 1 see B mass ". needed distinguish between these hypotheses. For instance, suppose nature chooses teacher’s perspective, this can be a significant differThus fortothese particular target hypotheses, active learning offers no improvement in B samwith probability at least 1/2, it is necessary to query points in at least (about) halflabels the ’s.a iare directly provide the examples closest to the decision The slices B are explicitly chosen to be disjoint, with the result that Ω(1/") i explicitly chosen to beindisjoint, with the result target hypothesisWhat at random theslices hi , 1 B ≤i in iare ≤ 1/". Then, to those identify this target ple complexity. about from otheramong targetThe hypotheses H, for instance which the ence. boundary to hypotheses. achieve smallest possible error on the needed distinguish between these For instance, chooses a tobalanced? distinguish between these hypotheses. Thus fortoand these particular target hypotheses, active learning offers no improvement inactive samwith probability at least 1/2, it are is necessary to the query points in at suppose least (about) half the BFor i ’s.instance, sup positive negative regions most needed evenly Consider the nature following target hypothesis at random from among the h , 1 ≤ i ≤ 1/". Then, to identify this target rest of the data set. We refer to this as empirically optarget hypothesis at random from among the h , 1 ≤ i ≤ 1/". Then The number of examples required ple by a learner to i complexity. What about other target hypotheses in H, for instance those in which the i learning scheme: Thusprobability forand these particular target active learning offers no inactive samprobability at least it isimprovement necessary query points in at lea with at leastregions 1/2, itteaching. ishypotheses, necessary to query points in at1/2, least (about) halftothe B i ’s. timal In general empirically optimal teaching positive are most with evenly balanced? Consider the following uniquely identify any concept in a concept class C negative with ple complexity. What about other target hypotheses in H, for instance those in which the learning scheme: requires enumerating possible subsets of the data. membership queries is characterized by theforinequality forall these particular target hypotheses, active learning offers n Thus thesenegative particular target are hypotheses, active learning offers no improvement inactive sampositive and regions most Thus evenly balanced? Consider the following ple complexity. What about other those target hypotheses in H, for insta However in some cases, such as in this example, the opple complexity. What about other target hypotheses in H, for instance in which the learning scheme: log2 |C| ≤ #M Q(C) ≤ |X| (3) positive and negative regions are most evenly balanced? Consid timal teaching set can directly be identified by going positive and negative regions are most evenly balanced? Consider the following active learning scheme: learning scheme: over all examples once. This is achieved by exploitwhere X denotes the instance space (Angluin, 2004). ing the known structure of the sample and hypothesis For conjunctions over n binary features, |X| = 2n and n spaces. Human teachers can also exploit such struc|C| = 3 , since a feature can either be irrelevant or ture when they are teaching. Note that the visualizahave one of the binary values in a hypothesis. For tion of data in this example makes teaching even easier example, for n = 6 the number of membership queries for humans (i.e. in comparison to having an unsorted to learn any concept is between 10 and 64. In this list of real numbers). setting, the lower bound on samples to be labeled by a human teacher is reasonable and close to the optimal 2.3. Heuristic Teaching value. However the range is large and this number can become unreasonable for a human as it gets closer In practice finding such structure and visualizations is to the upper bound. Thus, optimal teaching can also not trivial and empirically optimal teaching in polynosignificantly reduce the number of examples to learn a mial time might not be possible. Nevertheless, teachconcept, as compared to AL. ing algorithms that outperform AL might exist. To discuss this scenario, we recreate the example in (SetNote that a crucial assumption in this example is that tles, 2010) that illustrates the advantage of AL over there exists an efficient algorithm that achieves optirandom sampling. It involves the classification of mal teaching. As we will discuss in Sec. 3, our goal is emails between two newsgroups from the 20 Newsto use these algorithms as teaching guidance for hugroups dataset, using logistic regression. The features mans in order to make them better teachers for maare counts of words in the email. Figure 2 shows learnchine learning algorithms.

Mixed Initiative Active Learning 100

PP P P P P P PP PP P POPO O OPO P P PPP Accuracy

75

GG G G GYY Y YYYY Y YYY Y Y YYY Y Y YY

_

50

25

0 0

_

_

+ + +++ + + + -+- - - --- - - - -- - - |V| |V|=|V| =64 64 =|V|64 |V| |V| =|V| 64 ==|V| 64 =88=|V|8 |V| |V| |V| = =8 =|V| =484=|V| |V| 4|V| |V| ==4 |V| ==242=|V| |V| 2|V| ==|V| 2 =121=|V|1|V| =1 =1

+++

Random Active Learning Heuristic teaching 20

40

60

# of examples

80

100

Figure 2. Progress of accuracy over the number of examples provided to the learner, averaged over 5-folds using cross-validation, in the 2-class classification of emails between two newsgroups using Logistic regression.

ing curves for random and uncertainty sampling, as well as for a learner trained by a greedy teacher that always presents the example that provides the most improvement in classifying the rest of the data set. While this algorithm might be unrealistic for humans to use, the result shows the existence of a dataset better than the one produced with AL. We hypothesize that these data sets can be captured by human teachers using teaching heuristics. Such heuristics might also allow good teaching in problems where the input set is not finite (i.e. the teacher both instantiates and labels an example for the learner, rather than picking an example from a finite set). We believe that good teaching heuristics can be derived from intuition based on the properties of the state space and the concept class. Another option is to inspect good teaching sets (e.g. the set that outperforms AL in Figure 2) to characterize properties of good teaching examples at a higher level.

3. Optimality of Human Teaching The examples in the previous section demonstrate the potential of “good teaching.” As mentioned earlier, our motivation in analyzing optimal teaching algorithms and algorithmic teaching heuristics is in deriving teaching strategies and heuristics that are usable by humans. We refer to this as teaching guidance. In this section we overview an experiment from our previous work (Cakmak & Thomaz, 2010) that explores the idea of teaching guidance for humans. Our experiment involves the teaching problem discussed in Sec. 2.1. An optimal teaching sequence that satisfies the lower bound in Equation 1 is produced by the following algorithm (Goldman & Kearns, 1995): (1) First show one positive example. (2) Then show another positive example in which all

Figure 3. An optimal teaching sequence for the conjunction concept HOUSE. O O OO PP P P P PP P POP YY Y Y Y

YY Y are YYY changed Y YY n − r irrelevant features to the opposite value. This demonstrates that the features changed between the first two examples are irrelevant. (3) Next, show negative examples that differ from a positive example by only one feature. This is repeated for each of the r relevant features. Since only one feature is different from what is known to be a positive example, these negative examples prove that the changed feature is a relevant one.

Our experiment looked at teaching conjunctions to a virtual agent. The sample space with n = 6 consists of objects composed of two pieces (top and bottom) where each piece is specified by three features (shape, color and size). An example concept with r = 3 is HOUSE which is defined as an object that has a pink triangle top and a square bottom piece, while other features do not matter. An optimal teaching sequence for this concept, which consists of 5 examples, is shown on Figure 3. Our experiment compared natural human teaching with guided human teaching. In the natural teaching condition the subjects are not given any explicit instruction on how to teach while in the guided teaching condition they are asked to follow the optimal teaching algorithm. The algorithm is described using lay terms. Our main findings from this experiment were: • Natural human teaching is much better than random, but not spontaneously optimal. • Teaching guidance significantly improves human teaching, however does not make them optimal. We observed that the improvement in the guided teaching condition was mainly due to step (2) in the described algorithm. Step (1) was already carried out during natural teaching (i.e. most human teachers naturally start with a positive example). Step (3), on the other hand, was not as intuitive for our subjects. Most subjects could not keep track of which features they had already varied, and either stopped early without uniquely identifying the target or defaulted to testing the learner until they found an example for which the learner was uncertain about. While this outcome

Mixed Initiative Active Learning 1 0.8 0.6

%

points out the importance of the usability of teaching guidance, it also motivates a mixed-initiative approach.

0.4

4. Mixed-initiative Active Learning In the experiment described above, the part of the teaching algorithm that humans were not good at, can actually be replaced with AL without loosing optimality. In other words, if the learner was to make a query after the first two examples in the sequence given in Figure 3, it would choose the same (or equivalent) samples as in the rest of the sequence. The part of the algorithm that humans were good at following happens to be the part that gives the advantage to the teacher in teaching efficiently. This step involves changing irrelevant features all at once. Having the knowledge of what these features are, the teacher can easily perform this step. An active learner, on the other hand, cannot risk to change more than one feature at a time, since if the resulting example is negative, it cannot infer which feature(s) were relevant. This points towards an ideal division of labor between the teacher and the learner: the teacher should follow the steps (1) and (2) of the algorithm and then the learner should make queries. In this way, the workload of the teacher is reduced to providing two carefully chosen examples and then responding to queries, while the optimal number of examples to be labelled is maintained. We explored this idea in a follow-up experiment where the learner has the capability to make queries. The queries are triggered by the teacher, i.e. the learner only makes a query when the teachers presses an “Any questions?” button. The experiment had two groups: the active learning group was told to use queries as they want, and the mixed-initiative group was given the first two steps of the teaching algorithm and was told to trigger and answer queries after that.

0.2 0

Natural

Guided

Active

MI

Figure 4. Average informativeness of examples provided by human teachers in four conditions in our experiments: Natural teaching, Guided teaching, Active learning with teacher-triggered queries, Mixed-Initiative teaching. The experiments involved 10 subjects in each condition.

mativeness of examples across subjects. We define the informativeness of an example as the ratio of the maximum accuracy gain provided by any example in the current state of the learner to the gain provided by this example. Thus, the informativeness of all examples in the optimal teaching sequence is 1 and the informativeness of a redundant example is 0. We see that the average informativeness in the mixed-initiative condition is close to 1 and is significantly higher than the other conditions.

5. Discussion The mixed-initiative approach can be applied to other concept classes for which optimal teaching algorithms exist. This requires (i) identifying parts of an optimal teaching sequence that are obtainable with active learning, and (ii) describing the rest of the algorithm in a human-friendly manner and verifying its usability. Whether other concept classes will exhibit the nice property that led to the success of the mixed-initiative approach in our experiment is an interesting question that we would like to address in future work.

The results from this experiment demonstrate that the mixed-initiative approach can achieve close to optimal teaching. We observed that all subjects in the mixedinitiative condition fully identified the target concept, because they kept triggering queries until the learner converged to the correct hypothesis and did not have any more queries. In addition the length of the teaching sequences provided in this condition were optimal or close to optimal. As in the previous experiment, optimal teaching did not spontaneously emerge in the active learning condition with teacher-triggered queries.

As discussed in Sec. 2, more complex scenarios require intuitive teaching heuristics, as opposed to exact teaching strategies. A potential MIAL setting with heuristics, consists of the human teacher providing a seeding set and the learner making queries in the rest. In most practical applications of AL the learner is seeded with a random set of examples from each class, which gives the learner a good place to start. However a large variance can be observed in the performance of the learner depending on the seeding set. Thus we can aim to device heuristics that let humans pick a good seeding set, such that the performance of the active learner is close to the upper bound of this variance.

We compare the different teaching conditions from both experiments in Figure 4 in terms of average infor-

We emphasize the potential of this idea with the example from Sec. 2.1 which theoretically identified the

Mixed Initiative Active Learning

number of membership queries to learn any conjunction with n = 6 to be varied between 10 and 64. This number is reduced to a guaranteed 6 queries if the learner starts with a positive example. Given a positive example, an active learner can just query examples that differ from this example by only one feature to find out whether the feature is relevant or not. The learner converges to the target concept after repeating this for all n features. Thus having a good place to start in AL can make a big difference. Arguably, the practical success of many AL algorithms, is rooted in the common practice of seeding the learner with a random set of examples from each class. In our experiment we skipped the problem of deciding when to make query (from the learner’s perspective) in the mixed-initiative setting, by having the teacher trigger the queries. Leaving this decision to the learner raises other interesting problems, such as whether optimality can be guaranteed, the necessary and sufficient conditions to guarantee optimality, or the necessary and sufficient conditions to guarantee performance better than or equivalent to pure AL. We note that even in cases where MIAL does not improve upon pure AL, the MI setting can provide a more balanced, possibly preferable user experience for the teacher which can be crucial in certain domains (Horvitz, 1999). In a different experiment (Cakmak et al., 2010), we demonstrated that MIAL without any guidance provides the same performance gain as pure AL, however is preferred from an interaction point of view. The constant stream of queries in pure AL is often found to be annoying and negatively affects the teacher’s mental model of what has been learned. In addition, teacher-triggered queries are preferred over learner-initiated as they give full control of the interaction to the teacher.

References Angluin, D. Queries revisited. Theoretical Computer Science, 313:175–194, 2004. Anthony, M., Brightwell, G., and Shawe-Taylor, J. On specifying boolean functions by labelled examples. Discrete Applied Mathematics, 61(1):1–25, 1995. Balbach, F.J. Measuring teachability using variants of the teaching dimension. Theoretical Computer Science, 397(1–3):94–113, 2008. Balbach, F.J. and Zeugmann, T. Recent developments in algorithmic teaching. In 3rd Intl. Conference on Language and Automata Theory and Applications, pp. 1–18, 2009.

Balcan, M.F., Hanneke, S., and Vaughan, J. The true sample complexity of active learning. Machine Learning, 80:111–139, 2010. Cakmak, M. and Thomaz, A.L. Optimality of human teachers for robot learners. In Proceedings of the IEEE International Conference on Development and Learning (ICDL), 2010. Cakmak, M., Chao, C., and Thomaz, A.L. Designing interactions for robot active learners. IEEE Transactions on Autonomous Mental Development, 2(2): 108–118, 2010. Castro, R. M. and Nowak, R. D. Upper and lower error bounds for active learning. In The 44th Annual Conference on Communication, Control and Computing, 2006. Dasgupta, S. Coarse sample complexity bounds for active learning. In In Proceedings of the NIPS Workshop on Cost-Sensitive Learning, 2006. Goldman, S.A. and Kearns, M.J. On the complexity of teaching. Computer and System Sciences, 50(1): 20–31, February 1995. Horvitz, E. Principles of mixed-initiative user interfaces. In SIGCHI Conference on Human Factors in Computing Systems, pp. 159–166, 1999. Mitchell, Tom M. Machine Learning. McGraw-Hill, 1997. Natarajan, B.K. On learning boolean functions. In 19th Annual ACM Symp. on Theory of Computing, pp. 296–304, 1989. Settles, Burr. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2010.