Dimensionality Reduction for Active Learning with Nearest Neighbour ...

Report 5 Downloads 16 Views
Dimensionality Reduction for Active Learning with Nearest Neighbour Classifier in Text Categorisation Problems Michael Davy Artificial Intelligence Group, Department of Computer Science, Trinity College Dublin [email protected]

Abstract Dimensionality reduction techniques are commonly used in text categorisation problems to improve training and classification efficiency as well as to avoid overfitting. The best performing dimensionality reduction techniques for text categorisation are supervised, hence utilise the label information of the training data. Active learning is used to reduce the number of labelled training examples for problems where obtaining label information is expensive. Since the vast majority of data supplied to active learning are unlabelled, supervised dimensionality reduction techniques cannot be readily employed. For this reason, active learning in text categorisation problems do not perform dimensionality reduction thereby restricting the choice of classifier. In this paper we investigate unsupervised dimensionality reduction techniques in active learning for text categorisation problems. Two unsupervised techniques are investigated, namely Document Frequency and Principal Components Analysis. We empirically show increased performance of active learning, using a k-Nearest Neighbour classifier, when dimensionality reduction is applied using the unsupervised techniques.

1 Introduction Text categorisation is defined to be the task of assigning documents to a set of predefined categories [10]. Automated solutions to text categorisation have been developed using supervised learning where a classifier is induced from a large number of labelled examples. Supervised learning assumes there is an abundance of labelled examples, however, this assumption does not hold for many domains. While labelled examples can be scarce, unlabelled examples are naturally abundant. Active learning is a technique for constructing accurate classifiers from very small

Saturnino Luz Artificial Intelligence Group, Department of Computer Science, Trinity College Dublin [email protected]

amounts of training data. Reductions in the number of labelled examples required are achieved by active learning controlling the training data and only populating it with very informative examples. Conversely, supervised learning has no control over the training data, hence requires far more data to ensure there are sufficient numbers of informative training examples. Orders of magnitude reductions in labelling requirements are achieved when performing active learning on text categorisation problems [5]. In this paper we explore the difficulties arising from performing dimensionality reduction in active learning for text categorisation problems. The most successful dimensionality reduction techniques for text categorisation are supervised feature selection methods [13]. However, performing supervised feature selection is a significant problem for active learning tasks since the majority of supplied training data are unlabelled. As text data is naturally high dimensional, the choice of classifier used in active learning is therefore limited to those which do not suffer from the curse of dimensionality [7]. We investigate the application of unsupervised dimensionality reduction to active learning on text categorisation problems. Reducing the dimensionality while retaining the discriminative features will allow for greater flexibility in the choice of classifier used in active learning. To the best of our knowledge this is the first analysis of the use of unsupervised dimensionality reduction in the context of active learning for text categorisation problems. Empirical evaluation were conducted on the effect of dimensionality reduction to the performance of active learning using the k-Nearest Neighbour (kNN) algorithm. Two well established unsupervised dimensionality reduction techniques were considered for use in active learning problems. Feature selection is performed using Document Frequency performed with a global policy (DFG) while feature extraction is performed using Principal Components Analysis (PCA).

Both techniques offer significant reductions in the size of the input data with DFG and PCA reducing dimensionality by up to 90% and 98% respectfully. We demonstrate that preprocessing the data using the unsupervised dimensionality reduction techniques can significantly increase the performance of active learning using the kNN making it more competitive with state-of-the-art classifiers such as Support Vector Machines. A brief description of active learning, in particular, poolbased active learning is given in Section 2. The unsupervised dimensionality reduction techniques are reviewed in section 3. Empirically evaluated on real world text corpora is presented and discussed in section 4. Finally conclusions and future work are given in section 5

2 Active Learning The goal of active learning is to produce an accurate classifier (Φ) from as few training examples as possible. This is advantageous for domains where labelled training examples are scarce and the task of labelling is expensive. Typically training data for supervised learning are chosen randomly prior to induction. This is referred to as passive learning since the learner has no control over the which examples constitute the training data. Conversely, active learning allows the learner to construct it’s own training data. Starting from a small number of labelled seed examples, an active learner will iteratively select unlabelled examples, acquire correct labels and update the training data. Certain examples will contain more information about the problem than others. Passive learning can potentially label a large number of uninformative examples. Active learning attempts to select (and label) only those examples which contain the most information. Therefore, active learning can significant reduce the number of labelled examples when compared to passive learning.

2.1 Pool-Based Active Learning In this paper we use pool-based active learning [5, 6] where the learner is supplied with a pool of unlabelled examples from which it selects queries. Algorithm 1 gives the outline of a pool-based active learning. The active learner is given a pool of unlabelled examples (ul) and training data (tr) which is seeded with a small number of labelled examples. In each iteration a classifier (Φi ) is constructed from all the known labelled training data using an induction classification algorithm. The classifier can then be used by the query selection function to help select informative examples by providing predictions on unlabelled data. A query example (q) is selected using the query selection function and removed from the unlabelled pool.

Algorithm 1: Pool-Based Active Learner Input: tr - training data Input: ul - unlabelled examples for i = 0 to stopping criteria met do Φi = Induce(tr) q = QuerySelect(ul, Φi) ul ← ul \ {q} l = Oracle(q) tr ← tr ∪ {(q, l)}

// Induce // Select // Remove // Label // Update

Output: ΦF = Induce(tr)

The true label (l) of the selected example is obtained from the oracle which is an external entity; assumed to be human and considered infallible. Once the true label is known, the labelled example can be added to the training data where classifiers induced in subsequent iterations will incorporate the information. Common stopping criteria used in active learning are: a limit on the number of examples the oracle is willing to label or stopping once all unlabelled examples have been selected. Once stopped the output of active learning is a classifier (ΦF ) trained on all the known labelled data. 2.1.1 Query Selection The query selection function is a crucial component of active learning and is responsible for selecting informative examples from the pool. A number of selection strategies have emerged in the literature [6, 1]. In this paper we use Uncertainty Sampling (US) [5] as the query selection function. US selects examples which the current classifier (Φi ) is most uncertain about. Uncertainty is defined as the confidence the classifier has in a prediction. For a probabilistic classifier a prediction close to 0.0 or 1.0 indicates a confident prediction while a prediction close to 0.5 indicates an uncertain prediction. Unlabelled examples in the pool are sorted according to their prediction uncertainty and the most uncertain example selected as the query, as shown in Equation 1. s = arg min |Φi (x) − 0.5|

(1)

x∈ul

3 Dimensionality Learning

Reduction

for

Active

While high performance supervised feature selection techniques [13] can be applied in supervised text categorisation problems, the same supervised techniques can not be readily employed in active learning since the majority of

training data supplied are unlabelled. The use of benchmark corpora can allow the use of supervised feature selection [4]. However, in real world applications the label information is not available, which limits the applicability of this kind of approach. In general, dimensionality reduction is not performed for active learning in text categorisation problems. To compensate, classifiers capable of handling high dimensional data are preferred, restricting the choice of classifier used in active learning experiments. In this paper we explore an alternative approach which is suitable for realistic active learning in text categorisation problems. Two well established unsupervised dimensionality reduction techniques are considered for use in conjunction with active learning.

the data while the eigenvalues define the amount of variance accounted for by the principal component. Principal components are sorted by their eigenvalues where the first principal component will account for the largest amount variance, the second principal component will account for the second largest amount, and so on. 3.2.1 PCA for Text Categorisation. Given a set of  examples, principal component analysis will first centre the data by constructing the mean of the data µ (as given in Equation 2) and subtracting this from each example. Centering the data is not essential but can remove irrelevant variance as it reduces the overall sum of the eigenvalues.

Document frequency [10] is a feature selection technique where features are chosen based on the number of documents in which they occur. Rare features which only occur in a small number of documents are removed and only the features which occur in a large number of documents are retained. Despite its simplicity the performance of document frequency is comparable to the best performing feature selection methods [13] such as Information Gain. It is worth noting that stopwords are removed before dimensionality reduction is performed. Document Frequency can be performed using either a local or global policy. Local dimension reduction selects a set of terms for each category (context-sensitive). Obviously this requires knowledge of the label information. Conversely, a global policy for document frequency will select a set of the most frequent terms regardless of category, hence does not require label information (context-free). We use document frequency performed globally as an unsupervised feature selection technique.

3.2 Principal Components Analysis (PCA) Principal Components Analysis is a method for projecting high dimensional data into a new low dimensional space with minimum loss of information. It is an unsupervised feature extraction technique which discovers the directions of maximal variance in the data. The coordinate system of the original data is orthogonally transformed where the new coordinates are called the principal components (sometimes called principal axes). Principal components can be found by performing eigenvalue decomposition of the covariance matrix constructed from the training data. The solution to the eigenvalue decomposition is a set of eigenvectors which have associated eigenvalues. Eigenvectors are the principal components of

1 xi  i=1 

3.1 Document Frequency Global (DFG)

µ=

(2)

The covariance matrix (C) is constructed as the dot product of the centered examples as given in Equation 3 (here centering is incorporated into the construction of the covariance matrix). 1 (xi − µ)(xi − µ)T  i=1 

C=

(3)

The eigenvalue problem (Equation 4) is solved by performing eigenvalue decomposition on C. The solution is a set of eigenvectors (v) and their associated eigenvalues (λ). Cv = λv

(4)

The d largest eigenvalues are sorted (λ1 ≥ λ2 ≥ λ3 ≥ . . . ≥ λd ) in descending order and their associated eigenvectors stacked to form the transformation matrix W = [v1 , v2 , v3 , . . . , vd ]. For a given example x it can be transformed into the PCA reduced space by Equation 5. y = WTx

(5)

The value of d is an important factor in the success of PCA. Since the eigenvalues correspond to the amount of variance accounted for by their associated eigenvector, the proportion of variance accounted for by the first d eigenvectors can be calculated as: λ1 + λ2 + . . . + λd λ1 + λ2 + . . . + λd + . . . + λ|N | In this paper we choose the leading d components which account for 90% of the variance in the data.

4 Empirical Evaluation

1 www.kyb.tuebingen.mpg.de/bs/people/spider/

MacroF1 MicroF1

Full DFG PCA 454 (46%) 324 (33%) 243 (25%) 923 (93%) 444 (45%) 385 (39%)

4.2 Reuters-21578 (R10) We used the R10 [2] which is the top ten most frequent categories of the “ModApte” split. One-versus-rest experiments were constructed for each individual category. To reduce the computational overhead of performing active learning, a pool of 1, 000 documents were randomly selected from the 9, 603 training documents as used previous active learning research [11]. PCA selected, on average, the leading 306 principal components, which is a 98.5% reduction in dimensionality. DFG retained only the top 1, 987 (10%) features. Due to the unbalanced class distribution the F1 value of precision (π) and recall (ρ) was chosen as the performance 2πρ metric, (where F1 = π+ρ ). F1 was calculated using both macroaveraged and microaveraged variants of precision and recall. 0.9

0.75 0.7

0.8

0.65

0.7

0.6

Micro F1

Experiments were conducted to examine the effect of the proposed unsupervised dimensionality reduction techniques on the performance of active learning. Two standard benchmark corpora previously used in active learning research [11, 8], namely the Reuters-21578 corpus and a subset of the 20 Newsgroup corpus were used. The original feature set was obtained from preprocessing the corpora to remove stopwords and punctuation. Stemming was performed using the Porter stemming algorithm. Reduced feature sets were constructed using the two unsupervised dimensionality reduction techniques performed on the unlabelled and seed data. DFG retained only 10% of the most frequent features while PCA transformed the original data onto a d-dimensional space where d was chosen as the number of principal components which accounted for 90% of the variance in the data. Both the training and test sets were re-expressed in the reduced feature representation. The kNN is a high performance classifier [12] for text categorisation, however, it is sensitive to high dimensional data. While it is not commonly used for active learning text categorisation tasks we chose the kNN since it will benefit greatly from dimensionality reduction. The output of the kNN was transformed into a class membership probability estimate where the distribution is based on the distance of the query example to the k nearest neighbours. The estimate was then used as a measure of uncertainty (as discussed in section 2.1.1). The k value was fixed at 3 in our experiments. The optimal value for k is typically found using validation data, which is not available in active learning. A low value for k is also important for the early iterations of active learning since the number of training examples can be very low. Comparison are made between a baseline kNN using the full feature set (FULL), kNN using the dimensionality reduced data (DFG and PCA) and also a top-line Support Vector Machine trained on the full feature set (SVM). The Spider1 toolbox for Matlab was used to perform the experiments with the “andre” optimisation selected for the SVM. Active learning was seeded with 4 positive and 4 negative examples. Just one query example was selected per iteration. Once started, active learning was only stopped when all the unlabelled examples had been selected from the pool. The performance of active learning was measured using the classifier induced in each iteration (Φi ) evaluated on a test set. Each experiment was run ten times and the results averaged. Within each trial the same seed examples for active learning were supplied each of the techniques.

Macro F1

4.1 Experimental Setup

Table 1. Iterations of active learning required to achieve supervised learning performance for R10. Percentage of pool labelled.

0.55

0.45

PCA FULL DFG SVM

0.4 0.35

0.6

0.5

0.5

100

200

300

400 500 600 Iterations of AL

(a) Macro F1

700

800

900

0.4

PCA FULL DFG SVM

0.3 100

200

300

400 500 600 Iterations of AL

700

800

900

(b) Micro F1

Figure 1. Performance of Active Learning for R10. The number of iterations of active learning is given on the X axis and the F1 is given on the Y axis.

Performance of active learning on the R10 data is given in Figure 1. DFG and PCA can be seen lift performance of active learning closer to that achieved by the top-line SVM classifier. Of the two unsupervised dimensionality reduction techniques PCA achieves both a greater reduction in dimensionality and a higher performance increase. The number of iteration of active learning required to produce a classifier (Φi ) with performance equal to a classifier constructed by supervised learning on all training data

were: (A-R) 194, (G-X) 23,(W-H) 232 and (B-C) 210, reducing dimensionality by approximately 81% on average. DFG reduced dimensionality by 90%. 0.5

0.5

PCA FULL DFG SVM

0.45 0.4

0.4 0.35 0.3 Error

0.3 0.25

0.2

0.15

0.15

0.1

0.1

0.05

0.05 100

200

300 400 Iterations of AL

500

600

0

700

(a) Atheism-Religion (A-R) 0.5

0.4

0.8

0.6 0.55

0.6 Micro F1

Macro F1

0.5

0.4

400 500 600 Iterations of AL

700

800

0.5

900

PCA FULL DFG SVM

0.4 0.35 0.3 Error

0.3 Error

300

0.25

0.25

0.2

0.2

0.15

0.15

0.1

0.1

0.05

0.05 100

200

300

400 500 600 Iterations of AL

700

800

900

(c) Windows-Hardware (W-H)

0

100

200

300

400 500 600 Iterations of AL

700

800

900

(d) Baseball-Cryptography (B-C)

Figure 3. Performance of Active Learning for 20NG. Iterations of Active Learning is given on the X-axis and the Error is given on the Y-axis.

0.7

0.45

200

0.45

0.35

0

100

(b) Graphics-X (G-X)

PCA FULL DFG SVM

0.45

4.2.1 Random Feature Selection It could be the case that the observed improvements in performance could be due simply to the positive effect reducing the number of features has on the classifier, irrespective of the quality of the reduced set. In order to test that possibility we compared performance of the baseline to random feature selection [3].

0.25

0.2

0

PCA FULL DFG SVM

0.45

0.35

Error

using the full feature set, is given in Table 1. Increasing the performance of active learning subsequently reduces the labelling effort. Both DFG and PCA increase performance resulting in reductions in the number of required labelled examples. Again PCA is seen to outperform DFG. Given the high cost of labelling it is useful to consider halting active learning after a limited number of labels are acquired. Stopping at 250 iterations, the increase in F1 using PCA compared to FULL is (Macro) 0.0496 (Micro) 0.1561 while the increase in F1 of DFG compared to FULL is (Macro) 0.0219 (Micro) 0.0802. Bold text indicates statistical significance (α = 0.05).

0.5

0.35

0.4

0.3 0.25 0.2 0

Full Rand 100 200 300 400 500 600 700 800 900 Iterations of AL

(a) Macro F1

Full Rand

0.3 0

100 200 300 400 500 600 700 800 900 Iterations of AL

(b) Micro F1

Figure 2. Performance of Rand compared to Full. Iterations of active learning is given on the X-axis and the the F1 is given on the Yaxis.

Figure 2 plots the performance of random feature selection (Rand) w.r.t the original feature set (FULL) on the R10 dataset2 . The performance of Rand is significantly worse which shows features selected by the unsupervised techniques are discriminative.

4.3 20 Newsgroups Subset (20NG) Four 1v1 problems constructed from the the 20 Newsgroups corpus [8]. The problems range in difficulty from easy to hard. Ten 50%/50% training/testing splits of the data were constructed and the results obtained were averaged. The average number of principal components chosen 2 Rand

was not run on the 20 Newsgroups dataset for the sake of brevity

Due to the 1v1 problems Error was used as the performance metric. Figure 3 plots the Error rate of active learning for the four sub-problems. Both of the unsupervised dimensionality reduction techniques (DFG and PCA) increase the performance of active learning. PCA again offers greater reductions in dimensionality and also outperforms DFG on all four problems.

Table 2. Iterations of active learning required to achieve performance of Supervised learning for 20NG. (Percentage of pool labelled).

A-R G-X W-H B-C

Full 672 (95%) 773 (80%) 616 (64%) 408 (42%)

DFG PCA 597 (85%) 416 (59%) 661 (68%) 58 (6%) 553 (57%) 342 (35%) 428 (44%) 201 (21%)

We compared the number of iterations required to a classifier (Φi ) with equal to that produced by supervised learning on all training data using the full feature set. Table 2. shows a significant reduction in the labelling effort when the dimensionality reduction techniques are employed.

Stopping after just 250 iterations the reduction in Error of PCA compared to FULL is: (A-R) 0.0574 (G-X) 0.1883 (W-H) 0.0643 (B-C) 0.1517 while the reduction in Error of DFG compared to FULL is: (A-R) 0.0255 (G-X) 0.028 (W-H) 0.0316 (B-C) 0.0386. Bold text indicates statistical significance (α = 0.05).

ever this the increased performance comes with the higher computational overhead associated with conducting PCA. We plan to continue this research to look at Kernel Principal Components Analysis (KPCA) [9] which will allow for non-linear principal components to be found.

Acknowledgements 4.4 Discussion Empirical evaluation shows that employing unsupervised dimensionality reduction increases the performance of active learning using a kNN. Performing DFG offered some performance increase compared with the baseline (FULL) performance. The performance increase was shown to be a result the selection discriminative features since Random feature selection failed to achieve any increase in performance. PCA outperformed DFG in all of the experiments conducted. There are some noticeable differences between the two techniques which may account for the increased performance. While DFG statically reduced the dimensionally of the data, PCA dynamically reduced the dimensionality until the majority of variance in the data was accounted for. In the 20NG experiments, for instance, dimensionality was reduced to just 23 features in the G-X sub-problem. Subsequently classification in the reduced feature set was considerably easier leading to higher performance of active learning and a large reduction in the labelling effort (58 compared to the baseline of 773). While PCA was shown to be best performing technique the computational expense associated with PCA is far greater, which limits its applicability to very large datasets. DFG offers some increased performance at much lower computational expense.

5 Conclusions and Future Work Supervised dimensionality reduction techniques can not be readily employed in active learning scenarios since the majority of training data is unlabelled. The choice of classifier used in active learning is therefore limited to those which do not suffer from the curse of dimensionality. This paper investigated the use of well established unsupervised dimensionality reduction techniques for use in active learning on text categorisation problems to increase performance and allow for greater flexibility in the choice of classification algorithm. Empirical evaluations on two benchmark corpora show that both Document Frequency performed Globally and Principal Components Analysis significantly increased the performance active learning when using a kNN. In both sets of experiments PCA was found to outperform DFG, how-

This research is funded by the Irish Research Council for Science, Engineering and Technology (IRCSET).

References [1] M. Davy and S. Luz. Active learning with history-based query selection for text categorisation. Proceedings of the 29th European Conference on Information Retrieval Research, ECIR 2007, 4425:695, 2007. [2] F. Debole and F. Sebastiani. An analysis of the relative hardness of reuters-21578 subsets. Journal of the American Society for Information Science and Technology, 56(6):584–596, 2005. [3] G. Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3(1):1533–7928, 2003. [4] S. Hoi, R. Jin, and M. Lyu. Large-scale text categorization by batch mode active learning. Proceedings of the 15th international conference on World Wide Web, 2006. [5] D. Lewis and W. Gale. A sequential algorithm for training text classifiers. Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pages 3–12, 1994. [6] A. McCallum and K. Nigam. Employing EM in pool-based active learning for text classification. Proceedings of the 15th International Conference on Machine Learning, pages 350–358, 1998. Uses random initial cases. [7] T. Mitchell. Machine Learning. McGraw-Hill, 1997. [8] G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. Proceedings of the 17th International Conference on Machine Learning, pages 839–846, 2000. [9] B. Scholkopf, A. Smola, and K. Muller. Kernel principal component analysis. Advances in Kernel Methods-Support Vector Learning, pages 327–352, 1999. [10] F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002. [11] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45–66, 2001. [12] Y. Yang and X. Liu. A re-examination of text categorization methods. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 42–49, 1999. [13] Y. Yang and J. Pedersen. A comparative study on feature selection in text categorization. Proceedings of the 14th International Conference on Machine Learning, 97, 1997.