Active Task Selection for Multi-Task Learning Anastasia Pentina and Christoph H. Lampert {apentina,chl}@ist.ac.at IST Austria (Institute of Science and Technology Austria)
arXiv:1602.06518v2 [stat.ML] 30 Mar 2016
Abstract In this paper we consider the problem of multi-task learning, in which a learner is given a collection of prediction tasks that need to be solved. In contrast to previous work, we give up on the assumption that labeled training data is available for all tasks. Instead, we propose an active task selection framework, where based only on the unlabeled data, the learner can choose a, typically small, subset of tasks for which he gets some labeled examples. For the remaining tasks, which have no available annotation, solutions are found by transferring information from the selected tasks. We analyze two transfer strategies and develop generalization bounds for each of them. Based on this theoretical analysis we propose two algorithms for making the choice of labeled tasks in a principled way and show their effectiveness on synthetic and real data.
1
Introduction
In multi-task learning a learner is given a collection of prediction tasks that need to be solved. By learning all tasks jointly instead of independently he can identify similarities between the task and transfer information between them, thereby potentially improving the prediction quality. All existing multi-task learning approaches have in common, however, that they need at least some labeled training data from each task in order to solve it. In this paper, we study a new and more challenging setting, in which only for a subset of the tasks (typically a small minority) labeled data is needed. In practice, it is highly desirable to be able to handle this situation for prediction problem, for which the fixed cost of obtaining any labels for a task can be high, even when the variable cost per label are reasonable. Examples of such learning tasks are personalized classifiers for which labels are provided by end users. For example, to build a personalized speech recognition system, it would be preferable to have only a few speakers annotate a reasonable amount of data each, instead of collecting a few labeled examples from every potential user of the system. Similarly, the fixed cost of obtaining labels is high when the annotators are non-experts and first have to be trained for any task. This is, e.g., a major issue when using Amazon Mechanical Turk for data annotation: recruiting and training annotators first imposes a large overhead, and only afterwards many labels can be obtained within a short time and at a low cost. Especially if the number of tasks is large and many of them are related, the labels only for a representative subset of tasks may already contain enough information to solve all of them, namely by transferring information from the labeled to the unlabeled tasks. Whether this strategy will succeed or not depends on the answer to two core questions: how representative is the subset of labeled tasks, and how effective can information be transferred between the tasks? One of our contributions in this work is a theoretical approach to quantify these effects in the form of two generalization bounds. The fact that different labeled task subsets lead to predictors of different quality suggests that it will be beneficial if the labeled subset is not arbitrary but can be chosen in a data-dependent way. We call this learning scenario active task selection, and formalize it as follows: initially, for each task that should be learned only unlabeled samples are available. The learner then chooses for which of the tasks he wants to request some labels. After obtaining those, the learner constructs solutions for all tasks. We believe this is a scenario of great practical potential, but to our knowledge we are the first to formally formulate and study it. Active task selection resembles active learning, where one also chooses objects to be labeled (though individual examples instead of tasks), and hopes for better predictions by choosing the objects of interest in an intelligent manner instead of, e.g., randomly. The multi-task setting, however, adds a second level of complication: where active learning only needs to identify one prediction function for all data, in multi-task learning each task requires its own predictor, including the unlabeled ones. Therefore, one also has to decide on a method for transferring information from labeled to unlabeled tasks, a problem typically studied in domain adaptation. In fact, the transfer
1
method should be taken into account already when choosing the tasks to be labeled, as different transfer methods will likely require different tasks to be labeled for optimal effectiveness. In this paper we concentrate on two transfer methods that use the discrepancy distance [14] to quantify the similarity between unlabeled tasks. One method deals with the case where each unlabeled task is solved by transferring the predictor from the nearest labeled task (single-source transfer). The second method allows transferring from multiple labeled task (multi-source transfer). For each of them we prove a new generalization bound that upper bounds the total multi-task error by quantities depending on the labeled tasks and the way information is transferred between tasks. Using the computable quantities in the bounds as objective functions and minimizing them, we obtain principled algorithms for selecting which tasks to have labeled and for choosing predictors for all tasks, labeled as well as unlabeled.
2
Related Work
Most existing multi-task learning methods work in the fully supervised setting and rely on the idea of improving the overall prediction quality by sharing information between the tasks. For this, they either assume that the predictors for all tasks are similar to each other in some norm and exploit this fact through specific regularizers [9], or they assume that the predictors for all tasks share a common low-dimensional representation that can be learned from the data [8, 1]. Follow-up works extended and generalized these concepts, e.g. learning the relatedness of tasks [25, 12] or sharing only between subgroups of tasks [30, 13, 3]. However, all of the above methods require labeled data for each task, because they relate tasks to each other by means of their predictors. Active learning has so far not found widespread use in the multi-task setting. Two works in this direction are [21, 24], which, however, use active learning on the level of training examples, not tasks. The idea of choosing tasks was used in active curriculum selection [23, 20], where the learner can influence the order in which tasks are processed. However these methods nevertheless require annotated examples for all tasks of interest. To transfer information between tasks, our work builds on existing results for single-source and multi-source domain adaptation [5, 14]. We choose these because they come with theoretical guarantees and therefore allow us to prove generalization bounds and derive principled algorithms. We suspect, however, that also other domain adaptation techniques could be exploited for the active task selection scenario, in particular those based on source reweighting [26], representation learning [18, 10], or semi-supervised transfer [29].
3
Preliminaries
Before we explain the details of our contribution, we introduce some notation and restate some central definitions and results from the multi-task and domain adaptation literature.
3.1
Formal setting
In the multi-task setting the learner observes a collection of prediction tasks and its goal is to learn all of them. Formally, we assume that there is a set of T tasks {hD1 , f1 i, . . . , hDT , fT i}, where each task t is defined by a marginal distribution Dt over the input space X and a deterministic labeling function ft : X → Y. The goal of the learner is to find T predictors h1 , . . . , hT in a hypothesis set H ⊂ {h : X → Y} that would minimize the average expected risk: er(h1 , . . . , hT ) =
T 1X ert (ht ), T t=1
(1)
where ert (ht ) = E `(ht (x), ft (x)). x∼Dt
In this work we concentrate on the case of binary tasks, Y = {−1, 1}, and 0/1-loss, `(y1 , y2 ) = 0 if y1 = y2 , and `(y1 , y2 ) = 1 otherwise. However, we expect that the analysis can be extended to hold for a general bounded loss function that satisfies triangle inequality. In the fully-supervised setting the learner is given a training set of annotated examples for every task of interest. In contrast, in the active task selection scenario initially every task t is represented only by a set St = {xt1 , . . . , xtn } of n unlabeled examples sampled i.i.d. according to the marginal distribution Dt . Based on this data the learner is allowed to choose k tasks {i1 , . . . , ik } and for each of them he obtains labels for a random subset S ij ⊂ Sij of m points. Since for all other tasks the learner has access only to unlabeled data, in order to find predictors for them he has to transfer information from the chosen labeled tasks. 2
3.2
Background on domain adaptation
Unsupervised domain adaptation considers the problem of learning a predictor for a target domain, for which there is only unlabeled data available, using the labeled data from a different, source, domain. The success of any such method depends on how similar the source is to the target. For both methods that we consider in this work a sensible measure of similarity is provided by the notion of discrepancy: Definition 1 (Definition 4 in [14]). The discrepancy between distributions D1 and D2 over X with respect to a hypothesis set H is defined as: disc(D1 , D2 ) = max |erD1 (h, h0 ) − erD2 (h, h0 )| , 0 h,h ∈H
(2)
where erDi (h, h0 ) = Ex∼Di `(h(x), h0 (x)). This measure has two advantages: 1. The discrepancy allows relating the performance of a hypothesis on one task to its performance on a different task: Proposition 1 (Theorem 2 in [4]). For any two tasks hD1 , f1 i and hD2 , f2 i and any hypothesis h ∈ H the following holds: er2 (h) ≤ er1 (h) + disc(D1 , D2 ) + λ12 , where λ12 = minh∈H (er1 (h) + er2 (h)). 2. The discrepancy can be estimated from unlabeled samples: Proposition 2 (Lemma 1 in [4]). Let d be the VC dimension of the hypothesis set H and S1 , S2 be two i.i.d. samples of size n from D1 and D2 respectively. Then for any δ > 0 with probability at least 1 − δ: r 2d log(2n) + log(2/δ) , disc(D1 , D2 ) ≤ disc(S1 , S2 )+2 n where disc(S1 , S2 ) = max |er b S1 (h, h0 ) − er b S2 (h, h0 )| 0 h,h ∈H
is the empirical discrepancy between the samples and er b Si (h, h0 ) =
1 X `(h(x), h0 (x)). n x∈Si
4
Transfer from a single task
Probably the most straightforward method that can be used in unsupervised domain adaptation is to train a classifier on the labeled examples from the source task and directly use it on the target. In this section we examine what the optimal choice of labeled tasks would be if this transfer method is used to learn the remaining tasks. Formally, we assume that based on the unlabeled data the learner chooses k tasks and assigns each of the remaining unlabeled tasks to one of them. Each unlabeled task is solved by using the hypothesis trained on the corresponding labeled task (Figure 1). We encode such an assignment by a vector C = (c1 , . . . , cT ) that has at most k different components. These values correspond to the chosen labeled tasks and ct specifies which of them is used as a source of information for the t-th task. The following theorem provides an upper-bound on the performance of this approach: Theorem 1. Let d be the VC dimension of the hypothesis set H, k be the maximum number of tasks for which i.i.d. the learner may ask for labels, S1 , . . . , ST be T random sets of size n each, where St ∼ Dt , and S 1 , . . . , S T be their random subsets of size m each, for which labels can be provided upon learner’s request. Then, for any δ > 0 with probability at least 1−δ uniformly for all possible choices of the assignments C and the corresponding hypotheses, the following inequality holds: T T T T 1X 1X 1X 1X ert (hct ) ≤ er b ct (hct ) + disc(St , Sct ) + λtc T t=1 T t=1 T t=1 T t=1 t r r r 2d log(2n) + 2 log(T ) + log(4/δ) 2d log(em/d) log(T ) + log(2/δ) +2 + + , n m 2m
3
(3)
Figure 1: Schematic illustration of active task selection with transfer from a single task. Left: eight unlabeled tasks need to be solved. Center: the subset of tasks to be labeled is determined by minimizing (5), i.e. k-medoids clustering with respect to the discrepancy distance. Right: prediction functions (black vs. white) are learned for each cluster center and transferred to the other tasks in the cluster. where er b t (h) =
1 m
P
`(h(x), y).
(x,y)∈S t
Proof Sketch (the full proof can be found in Appendix B). First, we use Proposition 1 to relate the average expected risk on all tasks (1) to the expected risk on only those tasks that were chosen by the learner to be labeled: T T T T 1X 1X 1X 1X ert (hct ) ≤ erct (hct ) + disc(Dt , Dct ) + λtc . T t=1 T t=1 T t=1 T t=1 t
(4)
The statement of the theorem then follows from (4) by applying Proposition 2 to every pair of tasks, applying the standard VC bound to each individual task and combining them by a union bound argument. Interpretation. The left hand side of (3) is the average expected error on all T tasks, the quantity that the learner would like to minimize, but cannot directly compute. It is upper bounded by the sum of three complexity terms and three task-dependent terms: training errors on the labeled tasks, average distances to the prototype in terms of the empirical discrepancies and an average of λ-s. First, note that as √ the number of unlabeled examples per task n tends to infinity, the first complexity term converges to 0 as 1/ n, showing that in this case the discrepancies between the tasks can be estimated precisely. If in addition the number of labeled√examples for each of the chosen tasks m tends to infinity, the remaining complexity terms converge to 0 as 1/ m and inequality (3) reduces to (4), which is the best we can expect for the considered type of information transfer. As the bound in Proposition 1, the result of Theorem 1 is relative to λ. While discrepancy captures the similarity between the marginal distributions, λ in addition embodies the similarity between the labeling functions. For every pair of unlabeled task t and the corresponding labeled task ct , λtct is small if there exists a hypothesis that performs well on both these tasks. As in unsupervised domain adaptation this quantity cannot be estimated based on the data. However, if it is large, there is no classifier that works well on both domains and one cannot expect to find a good hypothesis for the target t by training only on the source ct . Note that not all λij have to be small for (3) to be useful, but only the ones between an unlabeled task and the labeled one it is assigned to. The remaining two terms in the right hand side of (3) - the average empirical error and the sum of discrepancies - depend on the chosen labeled tasks and can be estimated from the data. Therefore they can be used as a quality measure of the assignment C and the corresponding hypotheses to guide the learner. However, the choice of which tasks to label, of course, has to be done based only on the unlabeled data. The only data-dependent part of (3) that can be evaluated at this stage and therefore used to direct this choice is the average discrepancy between the tasks with respect to the assignment C: T 1X disc(St , Sct ). (5) T t=1 This quantity can be interpreted as the k-medoids clustering risk where tasks correspond to points in the space with (semi-)metric defined by empirical discrepancy and labeled tasks correspond to the centers of the clusters. Therefore we propose the following strategy for the active task selection with single-source transfer (ATS-SS): Algorithm 1 (ATS-SS). 1. estimate pairwise discrepancies between the tasks based on the unlabeled data 2. cluster the tasks using the k-medoids method based on the obtained empirical discrepancies 3. train classifiers for the cluster centers and transfer them to the other tasks in the corresponding clusters. 4
Figure 2: Schematic illustration of active task selection with transfer from multiple tasks. Left: eight unlabeled tasks need to be solved. Center: the subset of tasks to be labeled and between-task weights are determined by minimizing (16). Right: prediction functions (black vs. white) for each tasks are learned using weighted combinations of the available labeled data. Sharing can occur between labeled tasks.
Since the inequality (3) holds uniformly with respect to assignment C and the corresponding hypotheses, it also holds for the output of ATS-SS. Therefore, by finding an assignment with a low value of the right hand PT side of (3) the learner is guaranteed to have low average expected error. The only non-controllable term T1 t=1 λtct will be small in this case if tasks with similar marginal distribution, i.e. close with respect to empirical discrepancy, are likely to have similar labeling functions. This can be seen as an analog of a ”smoothness” assumption in semi-supervised learning that states that close points are likely to have similar labels.
5
Transfer from multiple tasks
In the previous section we assumed that for solving every task the learner uses only one of the chosen labeled tasks as a source of information. However, this is not the only solution since after the labels for the chosen tasks are obtained the learner is in the multi-source domain adaptation setting: potentially all k labeled tasks could be used to obtain predictors for the remaining unlabeled tasks and it might be suboptimal to use only one of them. Moreover, even for learning the labeled tasks it might be beneficial to transfer information from the other labeled tasks as well. In order to exploit this possibility we consider an extension of the method described in the previous section for the case of multiple sources. Instead of training a classifier for the target domain based on the labeled data from only a single source, this method minimizes a convex combination of training errors on several source domains. For a set of tasks I = {i1 , . . . , ik } ⊂ {1, . . . , T } define: ( ) T X I T Λ = α ∈ [0, 1] : αi = 1; supp α ⊆ I (6) i=1
for supp α = {i ∈ {1, . . . , T } : αi 6= 0}. Given a weight vector α ∈ ΛI , an α-weighted empirical error of a hypothesis h ∈ H is defined as follows: X er b α (h) = αi er b i (h). (7) i∈I
We consider the setting where the learner in order to obtain a solution for every task t minimizes er b αt (h) for some αt ∈ ΛI , where I is the set of labeled tasks (Figure 2). Note that this approach reduces to the single-source transfer described in the previous section if every weight vector αt has only one non-zero component. Real-values weights, however, can potentially improve the performance and choosing the set of tasks to label based on the k-medoids approach described in the previous section might not be optimal in this case. Therefore we develop an analog of Theorem 1 that can be used to make a principled choice. Theorem 2. Let d be the VC dimension of the hypothesis set H, k be the maximum number of tasks for which i.i.d. the learner may ask for labels, S1 , . . . , ST be T sets of size n each, where Si ∼ Di , and S 1 , . . . , S T be their random subsets of size m each for which labels would be provided upon learner’s request. Then for any δ > 0, provided that the choice of labeled tasks I = {i1 , . . . , ik } and the weights α1 , . . . , αT ∈ ΛI are fully determined by the unlabeled data only, the following inequality holds with probability at least 1 − δ over S1 , . . . , ST and
5
S 1 , . . . , S T for all possible choices of h1 , . . . , hT ∈ H: T
T
T
T
1X 1 XX t A B 1 XX t 1X ert (ht ) ≤ er b αt (ht )+ αi disc(St , Si ) + kαk2,1 + kαk1,2 +C +D+ αi λti , T t=1 T t=1 T t=1 T T T t=1 i∈I
i∈I
(8) where: v !2 u r r T uX X 2d log(ekm/d) log(4/δ) t t t 2 kαk2,1 = (αi ) , kαk1,2 = αi , A = , B= , m 2m t=1 t=1 i∈I i∈I r r r 8(log T + d log(enT /d)) 2 2d log(2n) + 2 log(T ) + log(4/δ) 4 C= + log , D = 2 . n n δ n T sX X
Proof Sketch (the full proof can be found in the supplemental material). As for Theorem 1, we begin with bounding the average expected error over all tasks by the error on the labeled tasks: T T T T 1X 1 XX t 1 XX t 1X ert (ht ) ≤ erαt (ht ) + αi disc(Dt , Di ) + αi λti , T t=1 T t=1 T t=1 T t=1 i∈I
where: erαt (ht ) =
X
i∈I
αit E `(ht (x), fi (x)).
i∈I
x∼Di
(9)
In order to prove the statement of the theorem we need to relate the α-weighted expected errors and discrepancies between the marginal distributions in (31) to their empirical estimates. PT The proof consists of three steps. First, we show that, conditioned on the unlabeled data, T1 t=1 er ˜ αt can be PT b αt , where: upper bounded in terms of T1 t=1 er er ˜ α (h) =
X
αi er ˜ i (h) =
i∈I
n X αi X i∈I
n
`(h(xij ), fi (xij )).
j=1
This quantity can be interpreted as a training error if the learner would receive the labels for all the samples for the chosen tasks I. Note that in case of m = n this step is not needed and we can avoid corresponding complexity Pthe T terms. In the second step we relate the average α-weighted expected errors to T1 t=1 er ˜ αt . In the third step we conclude the proof by bounding the pairwise discrepancies in terms of their empirical estimates. Step 1. Fix the unlabeled sets S1 , . . . , ST . They fully determine the choice of labeled tasks I and the weights α1 , . . . , αT . Therefore, conditioned on the unlabeled data, these quantities can be considered constant and we need a bound that holds uniformly only with respect to h1 , . . . , hT . In order to simplify the notation we assume that I = {1, . . . , k} and define: Φ(S , . . . , S k ) =
sup h1 ,...,hT
T 1X er ˜ αt (ht ) − er b αt (ht ). T t=1
(10)
Note that one could analyze this quantity using standard techniques from Rademacher analysis, if the labeled examples were sampled from the unlabeled sets i.i.d., i.e. with replacement. However, since we assume that for every i S i is a subset of Si , i.e. the labeled examples are sampled randomly without replacement, there are dependencies between the labeled examples. Therefore we utilize techniques from the literature on transductive learning [7] instead. We first apply Doob’s construction to Φ in order to obtain a martingale sequence and then use McDiarmid’s inequality for martingales [16]. As a result we obtain that with probability at least 1 − δ/4 over sampling labeled examples: v !2 r u k T X X 1u log(4/δ) t t Φ≤ E Φ+ α . (11) T i=1 t=1 i 2m S 1 ,...,S k Now we need to upper bound E Φ. Using results from [28] and [11] we observe that: E
Φ(S 1 , . . . , S k ) ≤
S 1 ,...,S k
6
E
˜1 ,...,S ˜k S
Φ(S˜1 , . . . , S˜k ),
(12)
where S˜i is a set of m points sampled from Si i.i.d. with replacement (in contrast to sampling without replacement corresponding to S i ). This means that we can upper bound the expectation of Φ over samples with dependencies by the expectation over independent samples. By doing so, applying the symmetrization trick, and introducing Rademacher random variables, we obtain that: v r T uX k 1 Xu 2d log(ekm/d) t t 2 . (13) E Φ≤ (αi ) · T t=1 i=1 m S 1 ,...,S k A combination of (11) and (13) shows that (conditioned on the unlabeled data) with probability at least 1 − δ/4 over sampling labeled examples uniformly for all choices of h1 , . . . , hT the following holds: T
T
1X A 1X B er ˜ αt (ht ) ≤ er b αt (ht ) + kαk2,1 + kαk1,2 . T t=1 T t=1 T T
(14)
PT PT ˜ αt to T1 t=1 erαt . Step 2. Now we relate T1 t=1 er The choice of the tasks to label, I, the corresponding weights, α, and the predictors, h, all depend on the unlabeled data. Therefore, we aim for a bound that is uniform in all three parameters. We define: Ψ(S1 , . . . , ST ) = sup I
sup
sup
α1 ,...,αT ∈ΛI h1 ,...,hT
T T 1 XX t α (eri (ht ) − er ˜ i (ht )). T t=1 i=1 i
The main instrument that we used here is a refined version of McDiarmid’s inequality, which is due to [15]. It allows us to use the standard Rademacher analysis, while taking into account the internal structure of the weights α1 , . . . , αT . As a result we obtain that with probability at least 1 − δ/4 simultaneously for all choices of tasks to be labeled, I, weights α1 , . . . , αT ∈ ΛI and hypotheses h1 , . . . , hT : T T 1X 1X erαt (ht ) ≤ er ˜ αt (ht ) + C. T t=1 T t=1
(15)
Step 3. To conclude the proof we bound the pairwise discrepancies in terms of their finite sample estimates by using the same argument as for Theorem 1: we apply Proposition 2 to every pair of tasks and combine the results using the uniform bound argument. This yields the remaining two terms on the right hand side: the weighted average of the sample-based discrepancies and the constant D. By combining the result with (14) with (15) we obtain the statement of the theorem. p B Interpretation. The complexity terms C and D behave as O( d log(nT )/n), while A T kαk2,1 + T kαk1,2 in the p worst case of kαk2,1 = kαk1,2 = T behaves as O( d log(km)/m). In order for these terms to be balanced, i.e. for the uncertainty coming from the estimation of discrepancy to not dominate the uncertainty from the estimation of the α-weighted risks, the number of unlabeled examples per task n should be significantly (for k T ) larger than m. However, this is not a strong limitation under the common assumption that obtaining enough unlabeled examples is significantly cheaper than annotated ones. The remaining terms in the bound of Theorem 2 can be estimated from the data (apart from the λ-term) and depend on the choice of labeled tasks I and weights α1 , . . . , αT . Therefore they can be used as a quality measure to choose data-appropriate labeled tasks and the corresponding weights and hypotheses. Since, analogously to the case of transferring from a single task, the empirical errors cannot be evaluated before observing the labels, the objective function for choosing the labeled tasks I and the weights α1 , . . . , αT reduces to: T 1 XX t A B αi disc(St , Si ) + kαk2,1 + kαk1,2 . T t=1 T T
(16)
i∈I
By minimizing (16) with respect to I and α one can obtain the labeled tasks and the weights that are beneficial for solving the tasks based on the given data. Thus we propose the following algorithm for active task selection in case of multi-source transfer (ATM-MS): Algorithm 2 (ATS-MS). 1. estimate pairwise discrepancies between the tasks based on the unlabeled data 2. choose the labeled tasks I and the weights α1 , . . . , αT by minimizing (16) 3. for every task t train a classifier by minimizing (7) using the obtained weight vector αt . 7
Note that the above algorithm fully determines the labeled tasks and the weights based only on the unlabeled data. Therefore, the conditions of Theorem 2 are satisfied, and its guarantees also hold for the resulting solution. An important aspect of the bound (8) and, consequently, of Algorithm 2 is the effect of kαk1,2 and kαk2,1 . As already mentioned above, in the worst case of every αt having only one non-zero component kαk2,1 is equal to T and does not improve the convergence rate. However, in the opposite extreme, if every αt weights all the labeled tasks equally, i.e. αit = 1/k for all t ∈ {1, . . . , T } and i ∈ I, kαk2,1 = √Tk . Therefore the convergence p p of the corresponding term improves from O( d log(km)/m) to O( d log(km)/km), which is the best we can expect from having km labeled examples. Thus, this term in (16) encourages the learner to use data from multiple labeled tasks for adaptation and captures the intuition that multi-source approach may improve the performance. The role of the second norm, kαk1,2 , is less transparent. While kαk2,1 influences every αt independently, kαk1,2 connects the weights for all tasks t. It is equal to √Tk when every labeled task i ∈ I is used equally much, PT i.e. t=1 αit = T /k for every i ∈ I. Therefore, this term suggests to select the tasks that all would be equally useful, thus preventing labeling tasks that would be useful for only few of the remaining ones.
6
Experiments
We benchmark the performance of the proposed algorithms by experiments on synthetic and real data. In both cases we choose H to be the set of all linear predictors without a bias term on X = Rd . Synthetic data. We generate T = 1000 binary classification tasks in R2 . For each task t its marginal distribution Dt is a unit-variance Gaussian with mean µt chosen uniformly at random from the set [−5, 5]×[−5, 5]. The label +1 is assigned to all points that have angle between 0 and π with µt (computed counter-clockwise), the other points are labeled −1. We use n = 1000 unlabeled and m = 100 labeled examples per task and augment all samples with a constant feature, resulting in d = 3. Real Data. We use the train part of the ImageNet ILSVRC2010 dataset [22], which consists of approximately 1.2 million natural images from 1000 classes. We extract features using a deep convolutional neural network [27] that was pretrained on a different dataset (MIT Places), reduce their dimension to 5 using PCA and augment them with a constant feature, resulting in d = 6. We construct 999 balanced binary tasks of classifying the largest class, Yorkshire terrier, versus one of the remaining classes. We use n = 400 unlabeled samples per task and label a subset of m = 100 examples for each of the selected tasks. Methods. We report results for the proposed active task selection method in the single-source transfer setting (ATS-SS) as well as the multi-source transfer setting (ATS-MS). Since no earlier methods for multi-task learning with unlabeled task exist we compare in both cases to the natural baseline of choosing the labeled tasks randomly and then applying the same adaptation method, i.e. each unlabeled task is solved using the predictor from the closest labeled task (RTS-SS), or by training on a task-specific weighted combination of labeled tasks (RTSMS) with only the weights obtained by minimizing (16). We found that randomizing also the step of assigning unlabeled to labeled tasks leads to essentially random classification results, and we do not evaluate it formally. To provide context for the results we also report the results of learning independent ridge regressions with access to labels for all tasks (denoted by Fully Labeled). However, this baseline has access to many more annotated examples in total than the active and random task selection method. In order to quantify this effect we also evaluate a modification of the Fully Labeled baseline that has fewer labels per task available (denoted by Partially Labeled), namely when the number of labeled tasks is k, the number of labels per task is mk/T , i.e. the total amount of labeled examples is mk, the same as for the active and random task selection methods. To avoid the need for heuristic choices, we report results for this baseline only for integer values of mk/T . Implementation. We estimate the empirical discrepancies between pairs of tasks (step 1 in Algorithms 1 and 2) by finding a hypothesis in H that minimizes the squared loss for the binary classification problem of separating the two sets of instances, as in [4]. To minimize the k-medoid risk (step 2 in Algorithm 1) we perform a local search as in [19]. For the corresponding minimization of (16) in Algorithm 2 we use the GraSP algorithm [2]. It requires as a subroutine a method for optimizing the objective with respect to a given sparsity pattern, for which we use gradient descent. To obtain classifiers for the individual tasks in all scenarios we use least-squares ridge regression with regularization constant from the set {10−3 , 10−2 , 10−1 , 1, 10, 102 , 103 } found by 5 × 5-fold cross validation. Results. The results are shown in Figure 3. One can see that choosing labeled tasks by minimizing (5) followed by single-source transfer is advantageous over a random choice, especially when the budget only allows for a small fraction of tasks to be labeled. The same observation holds for multi-source transfer: choosing labeled tasks and weights using (16) is beneficial to choosing a random subset of labeled tasks. In both cases, the proposed methods require substantially fewer tasks labeled to achieve the same accuracy as the baseline of randomly choosing tasks. 8
1
2 2.5
4 5
10
25
% of labeled tasks
test error in %
test error in %
Fully Labeled Partially Labeled ATS-SS RTS-SS
14 12 10 8 6 4 2
Fully Labeled Partially Labeled ATS-MS RTS-MS
14 12 10 8 6 4 2
50
1
(a) Synthetic Data, single-source (SS) Transfer
25
10
25
50
20
Fully Labeled Partially Labeled ATS-MS RTS-MS
35
test error in %
test error in %
30
4 5
% of labeled tasks
(b) Synthetic Data, multi-source (MS) Transfer
Fully Labeled Partially Labeled ATS-SS RTS-SS
35
2 2.5
15
30 25 20 15
0.1
0.5
1
45
% of labeled tasks
10
25
50
(c) ILSVRC2010, single-source (SS) Transfer
0.1
0.5
1
45
% of labeled tasks
10
25
50
(d) ILSVRC2010, multi-source (MS) Transfer
Figure 3: Experimental results on synthetic and real data: average test error and standard deviation over 100 repeats (synthetic) or 20 repeats (real) for the proposed active task selection (ATS) and random task selection (RTS) as well as fully supervised and partially labeled baselines. The difference between the proposed method and the Partially Labeled baseline is even bigger than to the random baseline, indicating that in this case, having more labels for fewer tasks rather than only few labels for all tasks is beneficial not only in terms of annotation costs, but also in terms of prediction accuracy. As the number of labeled tasks gets larger, e.g. half of all tasks, the performance of the active task selection learner becomes almost identical to the performance of the fully supervised method, even improving over it in the case of multi-source transfer on synthetic data. This confirms the intuition that in the case of many related tasks even a fraction of the tasks can contain enough information for solving all tasks.
7
Conclusion
In this work we introduced and studied the active task selection framework: a modification of the multi-task learning setting inspired by the active learning paradigm. While initially all tasks are represented only by unlabeled training samples, the learner, given a budget, decides for which tasks of interest to query labels and solves the remaining tasks based only on their unlabeled data and information transferred from the selected tasks. We analyzed this framework for two domain adaptation methods and established generalization bounds that can be used to make the choice of labeled tasks in a principled way. We also provided an empirical evaluation that demonstrates the advantages of the proposed methods. For future work we plan to further exploit the idea of active learning in application to the multi-task setting. In particular, we are interested to see whether by allowing the learner to make his decision on which tasks to label in an iterative way, rather than forcing him to choose all the tasks at the same time, one could obtain better learning guarantees as well as more effective learning methods.
Acknowledgments The authors would like to thank Marius Kloft for helpful discussions. This work was funded in parts by the European Research Council under the European Unions Seventh Framework Programme (FP7/2007-2013)/ERC grant agreement no 308036. A Tesla K40 card used for this research was donated by the NVIDIA Corporation.
9
References [1] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature learning. Machine Learning (ML), 2008. [2] Sohail Bahmani, Bhiksha Raj, and Petros T. Boufounos. Greedy sparsity-constrained optimization. Journal of Machine Learning Research (JMLR), 14(1), 2013. [3] Aviad Barzilai and Koby Crammer. Convex multi-task learning by clustering. In Conference on Uncertainty in Artificial Intelligence (AISTATS), 2015. [4] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine Learning (ML), 2010. [5] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Conference on Neural Information Processing Systems (NIPS), 2007. [6] Joseph L Doob. Regularity properties of certain families of chance variables. Transactions of the American Mathematical Society, 47(3), 1940. [7] Ran El-Yaniv and Dmitry Pechyony. Transductive Rademacher complexity and its applications. In Workshop on Computational Learning Theory (COLT), 2007. [8] A Evgeniou and Massimiliano Pontil. Multi-task feature learning. In Conference on Neural Information Processing Systems (NIPS), 2007. [9] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi-task learning. In International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2004. [10] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. In International Conference on Machine Learing (ICML), 2011. [11] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 1963. [12] Zhuoliang Kang, Kristen Grauman, and Fei Sha. Learning with whom to share in multi-task feature learning. In International Conference on Machine Learing (ICML), 2011. [13] Abhishek Kumar and Hal Daum´e III. Learning task grouping and overlap in multi-task learning. In International Conference on Machine Learing (ICML), 2012. [14] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In Workshop on Computational Learning Theory (COLT), 2009. [15] Andreas Maurer. Concentration inequalities for functions of independent variables. Random Structures and Algorithms, 29, 2006. [16] C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics. Cambridge University Press, 1989. [17] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, 2012. [18] Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks (T-NN), 22(2), 2011. [19] Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for k-medoids clustering. Expert Systems with Applications, 36(2), 2009. [20] Anastasia Pentina, Viktoriia Sharmanska, and Christoph H Lampert. Curriculum learning of multiple tasks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [21] Roi Reichart, Katrin Tomanek, Udo Hahn, and Ari Rappoport. Multi-task active learning for linguistic annotations. In Conference of the Association for Computational Linguistics (ACL), 2008.
10
[22] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 2015. [23] Paul Ruvolo and Eric Eaton. Active task selection for lifelong machine learning. In Conference on Artificial Intelligence (AAAI), 2013. [24] Avishek Saha, Piyush Rai, Hal Daum´e III, and Suresh Venkatasubramanian. Active online multitask learning. In ICML Workshop on Budget Learning, 2010. [25] Avishek Saha, Piyush Rai, Hal Daum´e III, and Suresh Venkatasubramanian. Online learning of multiple tasks and their relationships. In Conference on Uncertainty in Artificial Intelligence (AISTATS), 2011. [26] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2), 2000. [27] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015. [28] I. Tolstikhin, G. Blanchard, and M. Kloft. Localized complexities for transductive learning. In Workshop on Computational Learning Theory (COLT), 2014. [29] Dikan Xing, Wenyuan Dai, Gui-Rong Xue, and Yong Yu. Bridged refinement for transfer learning. In Knowledge Discovery in Databases (PKDD), 2007. [30] Ya Xue, Xuejun Liao, Lawrence Carin, and Balaji Krishnapuram. Multi-task learning for classification with dirichlet process priors. Journal of Machine Learning Research (JMLR), 8:35–63, 2007.
A
Technical Lemmas
By using a union bound argument we generalize Proposition 2 to hold for all T (T − 1)/2 pairwise discrepancies simultaneously: Corollary 1. Let d be the VC dimension of the hypothesis class H, D1 , . . . , DT be T probability distributions over X and S1 , . . . , ST be the corresponding sample sets of i.i.d. examples of size n each. Then for any δ > 0 with probability at least 1 − δ simultaneously for all i, j = 1, . . . , T : r 2d log(2n) + 2 log(T ) + log(2/δ) . (17) disc(Di , Dj ) ≤ disc(Si , Sj ) + 2 n
Lemma 1 (Corollary 3.4 in [17]). Let d be the VC dimension of the hypothesis set H and S be a random sample of m examples from some unknown distribution over X × Y. Then for any δ > 0 with probability at least 1 − δ for any h ∈ H: r r 2d log(em/d) log(1/δ) er(h) ≤ er(h) b + + . (18) m 2m Lemma 2 (Corollary 6.10 in [16]). Let W0n be a martingale with respect to a sequence of random variables (B1 , . . . , Bn ). Let bn1 = (b1 , . . . , bn ) be a vector of possible values of the random variables B1 , . . . , Bn . Let i−1 ri (bi−1 = b1i−1 , Bi = bi } − inf {Wi : B1i−1 = bi−1 1 ) = sup{Wi : B1 1 , Bi = bi }. bi
bi
Let r2 (bn1 ) =
Pn
i−1 2 i=1 (ri (b1 ))
(19)
b2 = supbn r2 (bn ). Then and R 1 1 22 Pr{Wn − W0 > } < exp − . b2 B1n R
11
(20)
Lemma 3 (Part of Lemma 19 in [28]). Let x = (x1 , . . . , xl ) ∈ Rl . Then the following function is convex: F (x) = sup xi .
(21)
i=1...l
Lemma 4 (Originally [11]; in this form Theorem 18 in [28]). Let {U1 , . . . , Um } and {W1 , . . . , Wm } be sampled uniformly from a finite set of d-dimensional vectors {v1 , . . . , vN } ⊂ Rd with and without replacement respectively. Then for any continuous and convex function F : Rd → R the following holds: " " !# !# m m X X E F Wi ≤E F Ui (22) i=1
i=1
Lemma 5 (Theorem 1 in [15]). Let X1 , . . . , Xn be independent random variables taking values in the set X and f be a function f : X n → R. For any x = (x1 , . . . , xn ) ∈ X n and y ∈ X define: xy,k = (x1 , . . . , xk−1 , y, xk+1 , . . . , xn ) (inf f )(x) = inf f (xy,k ) k
∆+,f =
y∈X n X
(f − inf f )2 .
i=1
k
Then for t > 0: Pr{f − E f ≥ t} ≤ exp
B
−t2 2k∆+ k∞
.
(23)
Proof of Theorem 1
First, we note that for any I = {i1 , . . . , ik }, any c = (c1 , . . . , cT ) and any (hi )i∈I the following consequence of Proposition 1 holds: T T T T 1X 1X 1X 1X ert (hct ) ≤ erct (hct ) + disc(Dt , Dct ) + λtc . T t=1 T t=1 T t=1 T t=1 t
(24)
Next, we bound the expected error terms on the right hand by their empirical counterparts. For this, we apply a union bound argument to Lemma 1 and obtain that with probability at least 1 − δ/2 simultaneously for all i = 1, . . . , T and all hi ∈ H the following inequality holds: r r 2d log(em/d) log(T ) + log(2/δ) eri (hi ) ≤ er b i (hi ) + + , (25) m 2m To bound the discrepancy, disc(Dt , Dct ), by its empirical version, disc(St , Sct ), we use Corollary 1 and obtain that with probability 1 − δ/2 for all i, j = 1, . . . , T : r 2d log(2n) + 2 log(T ) + log(4/δ) disc(Di , Dj ) ≤ disc(Si , Sj ) + 2 . (26) n The statement of the theorem follows from inserting (25) and (26) into (24), noting that the two high probability statements do not depend on the choice of the assignments C. Note that we intentionally use only elementary tools in the proof to showcase the main steps of relating the source to target error using discrepancies and bounding the uncertainty due to random sampling of the empirical quantities. By a more careful analysis, some of the log T terms could be avoided and the constants refined.
12
C
Proof of Theorem 2
As in the proof of Theorem 1, we bound the multi-task error by the errors on the source tasks, and transition to empirical quantities while keeping the effect of random sampling controlled. However, the steps will be more involved, since we now require the bounds to be uniform also in the (continuous) weights α, so we cannot rely on simple union bounds. For the first step, we establish the following result (inspired by Theorem 4 in [4]): Lemma 6. Let hD1 , t1 i, . . . , hDT , fT i be T tasks and I = {i1 , . . . , ik }. Then, for every t = 1, . . . , T the following inequality holds for all h ∈ H and all α ∈ ΛI : X ert (h) ≤ erα (h) + αi (λit + disc(Di , Dt )). (27) i∈I
Proof. For fixed t, let h∗i ∈ arg minh∈H (ert (h) + eri (h)).1 Writing `(h, h0 ) as shorthand for `(h(x), h0 (x)), we have X X αi eri (h) − ert (h) (28) | erα (h) − ert (h)| = αi eri (h) − ert (h) ≤ i∈I
i∈I
≤
X i∈I
αi
eri (h) − E `(h, h∗i ) + E `(h, h∗i ) − E `(h, h∗i ) + ert (h) − E `(h, h∗i ) x∼Di
x∼Di
x∼Dt
x∼Dt
(29) We can bound each summand: | eri (h) − E `(h, h∗i )| ≤ eri (h∗ ), because eri (h) = E `(fi , h), and ` fulfills the x∼Di x∼Di triangular inequality. | Ex∼Di `(h, h∗i ) − Ex∼Dt `(h, h∗i ) ≤ disc(Di , Dt ) by the definition of discrepancy, and | ert (h) − E `(h, h∗i ) ≤ ert (h∗i ) by the same reasoning as for the first summand. Therefore, x∼Dt
≤
X
αi (eri (h∗i ) + disc(Di , Dt ) + ert (h∗i )) =
i∈I
X
αi (λit + disc(Di , Dt )).
(30)
i∈I
Assuming that each task t has its own weights αt we obtain as a direct corollary of the previous result: Corollary 2. Let hD1 , t1 i, . . . , hDT , fT i be T tasks and I = {i1 , . . . , ik }. Then the following inequality holds for all h1 , . . . , hT ∈ H and all α1 , . . . , αT ∈ ΛI : T T T T 1X 1 XX t 1 XX t 1X ert (h) ≤ erαt (ht ) + αi disc(Dt , Di ) + αi λti . T t=1 T t=1 T t=1 T t=1 i∈I
(31)
i∈I
The following proposition bounds the effect estimating the first term on the right hand side in (31) from finite data. Proposition 3. For any δ > 0 the following inequality holds uniformly in h1 , . . . , hT ∈ H with probability at least 1 − δ over the sampling of the unlabeled training sets, S1 , . . . , ST , and labeled training sets, (S¯i )i∈I , provided that the subset of labeled tasks, I ⊂ {1, . . . , T }, and the task weights, α1 , . . . , αT ∈ ΛI , depend deterministically on the unlabeled training only. r r T T 1X 1X 1 2d log(ekm/d) 1 log(2/δ) erαt (ht ) ≤ er b αt (ht )+ kαk2,1 + kαk1,2 T t=1 T t=1 T m T 2m r r 2 8(log T + d log(enT /d)) 2 + + log , n n δ 1 If
the minimum is not attained, the same inequality follows by an argument of arbitrary close approximation.
13
(32) (33)
The proofP of the theorem consists of steps. In the first two steps we condition on the unlabeled data sets Pthree T T b αt in terms of T1 t=1 er ˜ αt , where for any index set I and weights α we set and bound T1 t=1 er n
er ˜ α (h) =
X
αi er ˜ i (h) for
er ˜ i (h) =
i∈I
1X `(h(xij ), fi (xij )). n j=1
(34)
This quantity can be seen as a training error if the learner would receive all the labels for the chosen tasks. Clearly, if m is not necessary and we can avoid the resulting complexity terms. In the third step we relate PT= n this part P T 1 1 t to er ˜ α t=1 t=1 erαt . T T Step 1. Fix the unlabeled samples S1 , . . . , ST . This uniquely determines the chosen tasks I and the weights α1 , . . . , αT ∈ ΛI , so the only remaining source of randomness the uncertainty which subsets of the selected tasks are labeled. Analyzing this would be rather straightforward if the labeled points, S¯i , were sampled i.i.d. from Si (i.e. randomly with replacement). This is not the case, however, since we assume that exactly m points are labeled, i.e. S¯i is sampled from Si randomly without replacement, and this introduces dependencies between the elements. For notational simplicity we pretend that exactly the first k tasks were selected, i.e. I = {1, . . . , k}. The general case can be obtained by changing the indices in the proof from 1, . . . , k to i1 , . . . , ik . To deal with the dependencies between the labeled data points we first note that any random labeled subset S¯i = (¯ si1 , . . . , s¯im ) can be described as the first m elements of a random permutation Zi = (z1i , . . . , zni ) over n elements that correspond to the unlabeled sample Si , i.e. s¯ij = (¯ xij , y¯ji ) = (xizi , yzi i ). With this notation and j
j
writing Z = (Z1 , . . . , Zk ) and `(h, zji ) = `(h(¯ xij ), y¯ji ) we define the following function Φ(Z) =
sup h1 ,...,hT
k T T n m X 1 X t 1 X 1 X 1X i t t er ˜ α (ht ) − er b α (ht ) = sup αi `(ht , zj ) − `(ht , zji ) . T t=1 n j=1 m j=1 h1 ,...,hT i=1 T t=1
(35) First, we establish a large deviation bound for Φ. Lemma 7. The following inequality holds with probability at least 1−δ/2 over the labeled sample sets, S¯1 , . . . , S¯k . v !2 r u k T X X 1u log(2/δ) t αt Φ(Z) − E Φ(Z) ≤ . (36) Z T i=1 t=1 i 2m Proof. The main ingredient to the proof will be an application of McDiarmid’s inequality (Lemma 2, page 11) for martingales. For this, we interpret Z = (z11 , z21 , . . . , znk ) as a sequence of as kn dependent variables, z11 , . . . , zkn . For the sake of notational consistency we will keep using double indices, with the convention that the sample index, j = 1, . . . , n, runs faster than the task index, i = 1, . . . , k. Segments of a sequence will be denoted by ¯ ı¯ ¯ ı¯ upper and lower double indices, zij = (zij , zi(j+1) , . . . , z¯ı¯) for ij ≤ ¯ı¯ and zij = ∅ otherwise. We now create a martingale sequence using Doob’s construction [6]: ij Wij = E{Φ(Z)| z11 }.
(37)
Z
where here and in the following when taking expectations over Z it is silently assumed that the expectation is taken only with respect to variables that are not conditioned on. Note that because of this convention, the expectations in (37) is only with respect to zi(j+1) , . . . , zkn , so each Wij is a random variable of z11 , . . . , zij . In particular, W00 = EZ Φ(Z) and Wkn = Φ(Z), and the in between sequence is a martingale with respect to z11 , . . . , zkn : i(j−1) i(j−1) ij i(j−1) E{ Wij |z11 } = E E{Φ(Z)| z11 } z11 = E{Φ(Z)|z11 } = Wi(j−1) . (38) Z
Z
Z
Z
Following the path of the proof of Lemma 2 in [7] we would like to apply Lemma 2. For this we compute an b2 defined there. upper bound on the coefficient R Let i ∈ {1, . . . , k} and j ∈ {1, . . . , n} be fixed and let π = (π1 , . . . , πk ) be specific permutations of n in elements for which we use the same index conventions as for Z. By σ and τ will denote elements in πi(j+1) , i.e. σ and τ do not occur in any of the first j positions of the permutation πi . Then i(j−1)
rij (π11
)=
i(j−1)
sup { Wij : z11
in σ∈πi(j+1)
i(j−1)
= π11
, zij = σ} −
i(j−1)
inf { Wij : z11 in
σ∈πi(j+1)
i(j−1)
= π11
, zij = σ} (39)
=
sup
sup
h
E
kn in in zi(j+1) σ∈πi(j+1) τ ∈πi(j+1)
i(j−1) kn {Φ(π11 , σ, zi(j+1) )}
14
− E
kn zi(j+1)
i
i(j−1) kn {Φ(π11 , τ, zi(j+1) )}
.
(40)
To analyze (40) further, recall that: i(j−1)
E {Φ(π11
kn zi(j+1)
i(j−1)
X
=
kn )} , σ, zi(j+1)
Φ(π11
(41) i(j−1)
kn kn kn |z11 = πi(j+1) ) × Pr( zi(j+1) , σ, πi(j+1)
i(j−1)
= π11
∧ zij = σ )
(42)
kn πi(j+1)
where here and in the following we use the convention that sums over parts of π run only over values that lead to valid permutations. Because the permutations of different task are independent, this is equal to i(j−1)
X
=
Φ(π11
i(j−1)
kn in in ) × Pr( zi(j+1) = πi(j+1) |zi1 , σ, πi(j+1)
i(j−1)
= πi1
kn kn ∧ zij = σ ) Pr(z(i+1)1 = π(i+1)1 )
kn πi(j+1)
(43) ij ij We make the following observation: for any fixed πi1 and any τ 6∈ πi1 , we can rephrase a summation over into a sum over all positions where τ can occur, and a sum over all configuration for the entries that are not τ :
in πi(j+1)
X
in F (πi(j+1) )=
n X
X
X
i(l−1)
in ) F (πi(j+1) , τ, πi(l+1)
(44)
l=j+1 π i(l−1) π in i(l+1)
in πi(j+1)
i(j+1)
for any function F . Applying this to the summation in (43), we obtain X i(j−1) i(j−1) i(j−1) kn in in kn kn Φ(π11 , σ, πi(j+1) ) Pr( zi(j+1) = πi(j+1) |zi1 = πi1 ∧ zij = σ ) Pr(z(i+1)1 = π(i+1)1 ) kn πi(j+1)
(45) =
n X
X
X
i(j−1)
Φ(π11
i(l−1)
kn , σ, πi(j+1) , τ, πi(l+1) )
(46)
l=j+1 π i(l−1) π kn i(l+1) i(j+1)
i(l−1)
i(l−1)
i(j−1)
kn kn × Pr( zi(j+1) = πi(j+1) ∧ zi(l+1) = πi(l+1) |z11
=
E
i(j−1)
E Φ(Z|z11
n Z l∼Uj+1
i(j−1)
= π11
i(j−1)
= π11
kn kn ∧ zij = σ ∧ zil = τ ) Pr(z(i+1)1 = π(i+1)1 )
∧ zij = σ ∧ zil = τ ),
(47)
n where Uj+1 denotes the uniform distribution over the set {j + 1, . . . , n}. The analogue derivation can be applied to the quantity in line (40) with σ and τ exchanged. For any Z denote by Zij↔il the permutation obtained by switching zij and zil . Then, due to the linearity of the expectation: i(j−1)
rij (π11
i i ) = sup { En E{Φ(Z) − Φ(Zij↔il )|z[1:j−1] = π[1:j−1] , zij = σ, zil = τ ). σ,τ
(48)
l∼Uj+1 Z
From the definition of Φ we see that Φ(Z) − Φ(Zij↔il ) = 0 when j, l ∈ {1, . . . , m} or j, l ∈ {m + 1 . . . , n}. i(j−1) Since l > j in (48) this implies rij (π11 ) = 0 for j ∈ {m + 1, . . . , n}. The only remaining cases are j ∈ {1, . . . , m} and l ∈ {m + 1, . . . , n}, for which we obtain Φ(Z) − Φ(Zij↔il ) ≤
sup h1 ,...,hT
T T 1 X t 1X t1 αi (−`(ht , zji ) + `(ht , zli )) ≤ α. T t=1 m T m t=1 i
(49)
where for the first inequality we used that sup F −sup G ≤ sup(F −G) for any F, G, and for the second inequality PT i(j−1) 1 t 2 ) ≤ n−m we used that ` is bounded by [0, 1]. Consequently, rij (π11 t=1 αi in this case. Therefore n−j T m b2 = R
k X n X
i(j−1) 2 rij (π11 )
i=1 j=1 2 We
generously bound
n−m n−j
m k T 1 X n − m 2 X X t ≤ 2 2 αi T m j=1 n − j t=1 i=1
!2
k T 1 X X t ≤ 2 α T m i=1 t=1 i
!2 .
≤ 1 in this step. By keeping the corresponding factor in the analysis one obtains that the constant B in the
theorem can be improved at least by a factor of
(n−m)2 . (n−0.5)(n−m−0.5)
15
Now from Lemma 2 we obtain that with probability at least 1 − δ/2: v !2 r u k T X X log(2/δ) 1u t t Φ(Z) − E Φ(Z) = Wkn − W0 ≤ . αi Z T i=1 t=1 2m
(50)
Step 2. As a second step, we establish an upper bound on EZ Φ(Z) itself. Lemma 8. v r T uX k 1 Xu 2d log(ekm/d) t t 2 E Φ(Z) ≤ . (αi ) · Z T t=1 i=1 m
(51)
Proof. The main ingredient will be Lemma 4 (page 4). First we rewrite Φ(Z) in the following way: Φ(Z) =
T T k X 1X 1 X sup Φt (Z) αit (er ˜ i (h) − er b i (h)) = T t=1 h i=1 T m t=1
Φt (Z) = sup h
k X
m αit (er ˜ i (h) − er b i (h))
i=1
Note that even though H can be infinitely large, we can identify a finite subset that represents all possible predictions of hypothesis in H on S1 ∪ · · · ∪ Sk . We denote their number by L ≤ 2kn and the corresponding hypotheses by h1 , . . . , hL . Let t ∈ {1, . . . , T } be fixed. For every i ∈ {1, . . . , k} define a set of n L-dimensional vectors, Vit = t t {vi1 , . . . , vin }, where for every j ∈ {1, . . . , n}: h i t vij = αit er ˜ i (h1 ) − `(h1 (xij ), yji ) , . . . , αit er ˜ i (hL ) − `(hL (xij ), yji ) . (52) With this notation, for every i ∈ {1, . . . , k} choosing a random subset S¯i ⊂ Si corresponds to sampling m vectors from Vit uniformly without replacement. For every i ∈ {1, . . . , k}, let Ui = {ui1 , . . . , uim } be sampled from Vit in that way. Then m k X X uij , (53) Φt (Z) = F i=1 j=1
where the function F takes as input an L-dimensional vector and returns the value of its maximum component. We now bound EZ Φt (Z) by applying Lemma 4 k times: k X m m m i X h k−1 XX X E F uij = E uij + ukj U1 , . . . , Uk−1 (54) E Φt (Z) = E F Z
U1 ,...,Uk
U1 ,...,Uk−1
i=1 j=1
Uk
i=1 j=1
j=1
By Lemma 3 F (x) is a convex function. Thus F (const + x) is also convex and we can apply Lemma 4 with respect to Uk . ≤
E
U1 ,...,Uk−1
h
E F ˆk U
k−1 m XX
uij +
i=1 j=1
m X
i u ˆkj U1 , . . . , Uk−1
(55)
j=1
ˆk = {uki , . . . , ukm } is a set of m vectors sampled from V t with replacement. where U k =
E
k−1 m XX
ˆk U1 ,...,Uk−1 ,U
F
16
i=1 j=1
uij +
m X j=1
u ˆkj .
(56)
Repeating the process k times, we obtain
≤ ··· ≤
E
ˆ1 ,...,U ˆk U
k X m X F u ˆij .
(57)
i=1 j=1
Note that writing the conditioning in the above expressions is just for clarity of presentation, since the U1 , . . . , Uk are actually independent of each other. ˆ sets in Φ corresponds to switching from random subsets S¯i to random sets Switching from the U sets by the U ˜ Si consisting of m points sampled from Si uniformly with replacement. Therefore we obtain E Φt (Z) = Z
E
¯1 ,...,S ¯k S
Φt (S¯1 , . . . , S¯k ) ≤
E
˜1 ,...,S ˜k S
Φt (S˜1 , . . . , S˜k ),
(58)
which allows us to continue analyzing EZ Φt (Z) in the standard way using Rademacher complexities and independent samples. Applying the common symmetrization trick and introducing Rademacher random variables σij we obtain Φt (S˜1 , . . . , S˜k ) ≤ 2 E sup σ
h
k X m X i=1 j=1
1−yy 0 2
We can rewrite this using the fact that `(y, y 0 ) = Jy 6= y 0 K = E sup σ
h
k X m X
σij αit `(h(xij ), yji ) = E sup σ
i=1 j=1
h
k X m X
σij αit `(h(xij ), yji ).
σij αit
i=1 j=1
k X m X 1 − h(xij )yji 1 −σij yji αit h(xij ) = E sup 2 2 σ h i=1 j=1
Since −σij yji has the same distribution as σij : =
k X m X 1 σij aij (h), E sup 2 σ a(h)∈A i=1 j=1
where aij (h) = αit h(xij ) and A = {a(h) : h ∈ H}. According to Sauer’s lemma (Corollary 3.3 in [17]): |A| ≤ At the same time:
ekm d
d .
(59)
v v u k u k m uX uX X √ t i t 2 kak2 = (αi h(xj )) = mt (αit )2 . i=1 j=1
(60)
i=1
Therefore, by Massart’s lemma (Theorem 3.3 in [17]): v u k X p 1u t i i E sup σij αi `(h(xj ), yj ) ≤ t (αit )2 · 2dm log(ekm/d). σ h 2 i=1 i=1 j=1 k X m X
(61)
By applying this result for all t we obtain: v r T T T u k uX X 1 X 1 X 1 2d log(ekm/d) t t 2 ˜ E Φ(Z) = E Φt (Z) ≤ E Φt (S) ≤ (αi ) · . Z T m t=1 Z T m t=1 S˜ T t=1 i=1 m
(62)
Combining steps 1 and 2, we obtain that for fixed unlabeled samples S1 , . . . , ST with probability at least 1 − δ/2 for all choices of h1 , . . . , hT : r r T T 1X 2d log(ekm/d) 1 log(2/δ) 1X 1 t t er ˜ α (ht ) ≤ er b α (ht ) + kαk2,1 + kαk1,2 . (63) T t=1 T t=1 T m T 2m
17
Step 3. In the last step we relate the empirical multi-task error, er e α to the expected multi-task error, erα . Because the choice of the tasks to label, I, their weights, α, and the predictors, h, all depend on the unlabeled data, we aim for a bound that is holds simultaneous for all choices of these quantities, under the condition that I and α depend only on the unlabeled samples, while h can be chosen based also on the labeled subsets. Lemma 9. For any δ > 0, the following inequality holds with probability at least 1 − δ/2 simultaneously for all choices of tasks to be labeled, I, weights α and hypotheses h: T T 1X 1X erαt (ht ) ≤ er ˜ αt (ht ) + T t=1 T t=1
r
8(log T + d log(enT /d)) + n
r
2 2 log . n δ
(64)
Proof. The main ingredient is a refined version of McDiarmid’s inequality, due to Maurer [15] (Lemma 5, page 12), which allows us to make use of the internal structure of the weights, α, while deriving a large deviation bound. For any S = (S1 , . . . , Sn ) with Si = {(xi1 , y1i ), . . . , (xin , yni )} define: Ψ(S) =
sup
sup
sup
I={i1 ,...,ik } α1 ,...,αn ∈ΛI h1 ,...,hT
T T 1 XX t α (eri (ht ) − er ˜ i (ht )) = sup sup sup g(α, h, S) (65) T t=1 i=1 i α I h
for g(α, h, S) =
n T X X i=1 j=1
! T 1 X t α (eri (ht ) − `(ht (xij ), yji )) . T n t=1 i
To apply Lemma 5 we establish a bound on ∆+,Ψ (S) =
(66)
P P i
j (Ψ(S)
− Ψij (S))2 , with
Ψij (S) = inf sup sup g(α, h, S \ {(xij , yji )} ∪ {(x, y)} ). (x,y) α
(67)
h
Let α∗ , h∗ be the point where the sup in the (65) is attained3 , i.e. Ψ(S) = g(α∗ , h∗ , S). Then Ψij (S) ≥ inf g(α∗ , bh∗ , S \ {(xij , yji )} ∪ {(x, y)} )
(68)
(x,y)
and therefore Ψ(S) − Ψij (S) ≤ g(α∗ , h∗ , S) − inf g(α∗ , h∗ , S \ {(xij , yji )} ∪ {(x, y)})
(69)
(x,y)
T
T
1 X ∗t 1 X ∗t αi (−`(h∗t (xij ), yji ) + `(h∗t (x), y)) ≤ α , T n t=1 i (x,y) T n t=1
≤ sup
(70)
where for the last inequality we use that ` is bounded in [0, 1]. Because also Ψ(S) − Ψij (S) ≥ 0, we obtain T X n T X n X X ∆+,Ψ (S) = (Ψ(S) − Ψij (S))2 ≤ i=1 j=1
(remember that
P
i
i=1 j=1
1 T 2 n2
T X t=1
!2 αi∗t
≤
1 T 2n
T X T X i=1 t=1
!2 αi∗t
=
1 , n
(71)
αi = 1 for any α ∈ Λ). Therefore, according to Lemma 5 with probability at least 1 − δ/2: r
Ψ(S) ≤ E Ψ(S) +
2 2 log . n δ
(72)
18
To bound ES Ψ(S) we again use symmetrization and Rademacher variables, σij : ! T 1 X t i i α (eri (ht ) − `(ht (xj ), yj )) E Ψ(S) = E sup sup sup S S T n t=1 i I α1 ,...,αT ∈ΛI h1 ,...,hT i=1 j=1 ! T T X n X σij X t i i ≤ 2 E E sup sup α `(ht (xj ), yj ) sup S σ I α1 ,...,αT ∈ΛI h1 ,...,hT T n t=1 i i=1 j=1 T X n X
≤ 2EE S σ
T T X T n X 1X σij αit X sup `(ht (xij ), yji ) T t=1 αt ∈Λ,ht i=1 j=1 n t=1
≤ 2 E E sup S σ
T X n X σij αi
α,h i=1 j=1
n
(73)
(74)
(75)
`(h(xij ), y i ),
(76)
where line (75) is obtained from line (74) by dropping the assumption of a common sparsity pattern between the α-s. Note that the function inside the last sup is linear in α ∈ Λ, therefore supα can be reduced to the sup over the corners of the simplex, {(1, 0, . . . , 0), . . . , (0, . . . , 0, 1)}. At the same time, by Sauer’s lemma, the number d of different choices of h on S is bounded by eTdn . Therefore, the total number of different choices in (76) d . Furthermore, is bounded by T enT d √ for any choice of α and h, the norm of the T n-vector formed by the summands of (76) is bounded by 1/ n, because n T X X σij αi i=1 j=1
n
`(h(xij ), y i )
2
n T n T 2 1 XX 1 X X = 2 αi `(h(xij ), y i ) ≤ 2 αi n i=1 j=1 n j=1 i=1
!2 =
1 . n
(77)
Therefore, by Massart’s lemma: E sup σ
T X n X σil αi
α,h i=1 j=1
n
`(h(xil ), yli )
p 2(log T + d log(enT /d)) √ ≤ . n
(78)
Combining (72) and (78) we obtain the statement of Lemma 9: with probability at least 1−δ/2 simultaneously for all choices of tasks to be labeled, I, weights α and hypotheses h: T T 1X 1X erαt (ht ) ≤ er ˜ αt (ht ) + T t=1 T t=1
r
8(log T + d log(enT /d)) + n
r
2 2 log . n δ
By combining this inequality with (63) by a union bound we obtain the statement of Theorem 3. The statement of Theorem 2 follows by combining Corollary 2, Proposition 3 and Corollary 1.
19
(79)