Categorization of Large Text Collections: Feature Selection for ...

Report 11 Downloads 66 Views
Categorization of Large Text Collections: Feature Selection for Training Neural Networks Pensiri Manomaisupat1, Bogdan Vrusias1, and Khurshid Ahmad2 1

Department of Computing, University of Surrey Guildford, Surrey, UK {P.Manomaisupat, B.Vrusias}@surrey.ac.uk 2 Department of Computer Science, O’reilly Institute, Trinity College Dublin 2, Ireland [email protected]

Abstract. Automatic text categorization requires the construction of appropriate surrogates for documents within a text collection. The surrogates, often called document vectors, are used to train learning systems for categorising unseen documents. A comparison of different measures (tfidf and weirdness) for creating document vectors is presented together with two different state-of-theart classifiers: supervised Kohonen’s SOFM and unsupervised Vapniak’s SVM. The methods are tested using two ‘gold standard’ document collections and one data set from a ‘real-world’ news stream. There appears to be an optimal size both for the of document vectors and for the dimensionality of each vector that gives the best compromise between categorization accuracy and training time. The performance of each of the classifiers was computed for five different surrogate vector models: the first two surrogates were created with tfidf and weirdness measures accordingly, the third surrogate was created purely on the basis of high-frequency words in the training corpus, and the fourth vector model was created from a standardised terminology database. Finally, the fifth surrogate (used for evaluation purposes) was based on a random selection of words from the training corpus. Keywords: Feature selection, Text categorisation, Information extraction, Selforganizing feature map (SOFM), Support vector machine (SVM).

1 Introduction News text streams, like Reuters and Bloomberg, supply documents in a range of topic areas and the topic ‘code(s)’ are marked on the news texts. The assignment of topic codes requires the definition of topics. For the automatic topic assignment, a semantic concept space of relevant terms has to be created. There is a degree of arbitrariness in the assignment of topic codes by the news agency (subeditors) [1]. The problems relating to the choice of terms can be obviated to an extent by using information retrieval measures like tfidf where statistical criteria are used to select terms that are specific enough to be characteristic of a domain and characteristic of a the majority of documents within the domain. More problems are encountered when topics are closely related (for example in one of our experiments we found that the terminology E. Corchado et al. (Eds.): IDEAL 2006, LNCS 4224, pp. 1003 – 1013, 2006. © Springer-Verlag Berlin Heidelberg 2006

1004

P. Manomaisupat, B. Vrusias, and K. Ahmad

of currency markets and bond markets has substantial overlaps). Due to the large volumes of new, covering different topics, that are becoming available, it is important to have a system that not only can categorise but can also learn to categorise. Once trained to a certain performance measure, the system is then tested with an arbitrary selection of unseen texts from the domain. Typically, a text categorization system is presented with a representative sample of texts comprising one or more topic areas: each text is then represented by a surrogate vector – usually containing keywords that characterize a specific document or sets of keywords that may represent a domain [2]. The choice of the keywords – based on information-theoretic measures – is not without controversy or problems. Once the keywords are chosen the training of a text categorization begins: essentially, given the training texts and a lexicon of keywords to construct the vectors, a (supervised) learning system learns to categorize texts according to the categories that are prescribed by a teacher; or in the case of an unsupervised learning system, the vectors are clustered according to a similarity measure and the resulting clusters are expected to have some relation to the given categories of news texts in the training data set. The supervised learning systems require the pre-knowledge of the categories and the keywords that may be used to construct the surrogate vectors. In the case of unsupervised systems, only the knowledge of keywords is required; however, the categories generated by an unsupervised method may be at considerable variance with real world categories. The discussions in the information retrieval/extraction literature focus largely on supervised or semi-supervised classification [3, 5] and a keen interest in recent developments in machine learning, especially support vector machines (SVM) [5]. The success of the unsupervised text categorization systems, precipitated by the development of WEBSOM [7, 8], where they looked at a number of different methods for describing the ‘textual contents of documents statistically’, and have made comments about traditional information retrieval measures: tfidf and latent semantic indexing (LSI). An important finding here is that LSI particularly is computationally more complex than ‘random projection techniques’, that compress the feature vector space by around 1%, and that the use of LSI is not as straightforward as it appears. It has been noted that the use of LSI does not offer significant advantage over term space reduction methods in the context of neural classifiers [9]. The measures used for selecting the keywords for surrogate vectors have been debated in the text categorization literature extensively. Consider how such vectors were created for the ‘largest WEBSOM map […][that interfaces] a data base of 6,840,568 patent abstracts’ written in English with average length of 132 words [6, pp. 581]. A total of 733,179 base forms exist in this patent-abstract corpus: WEBSOM designers decided intuitively to remove all words ‘occurring less than 50 times in the whole [patent] corpus, as well as a set of common words in a stop-word list of 1335 words’: the ‘remaining vocabulary consisted of 43,222 words’; 122,524 abstracts in which less than five words of the vocabulary were ‘omitted’. The selforganising system thus trained had an excellent classification accuracy of around 60% and took ‘about six weeks on a six-processor SGI O2000 computer’ (ibid, pp 582). The choice of keywords appears inspired and what is needed is to look at ways in which the choice itself is automated as the training is. This is the burden of argument in this paper.

Categorization of Large Text Collections

1005

In our work we have looked at news wire either organised according to specific topics (the TREC 96 collection) or provided by a news agency covering a range of topics. The latter comprise two text collections: the topic diversified RCV1 corpus and daily financial news wire supplied by Reuters Financial over a month. A part of the three news collections – also referred to as news wire corpora – was used for training two text categorisation systems and another smaller part used for testing. We have contrasted the performance of the two systems trained on two very different training algorithms for potentially eliminating the bias due to the algorithms used. The keywords used in the construction of vectors, for training the two systems, were extracted automatically from the text collections using two different methods: the first method uses the well-known tfidf measure [3, 4] and the second uses information about the use of English in everyday context and then selects keywords on the basis of the contrast in the distribution of the same word in a specialist and a general language context (weirdness [11]). In order to evaluate the effectiveness of the two methods we have compared the performance of systems trained using words selected (i) randomly; (ii) purely on the basis of being above an arbitrary frequency threshold, and (iii) using keywords already available from an on-line thesaurus; techniques (i)-(iii) will be referred to as ‘baseline’ vectors. The performance measurements for the supervised learning system included the use of a contingency table for contrasting the correctly/incorrectly recalled documents from the same category with that of the category of a test document; we have used ‘break-even point’ statistic to evaluate the performance of the SVM-based text categorization system. The results presented here had involved us in training an SOFM and an SVM with very large vectors (starting from a 27 component vector through to 6561 component vector) and large number of documents (600 to 22,418). The computation for 27 component vector takes about 1 minute for 1000 input vectors and 6561 component vector takes about 30 hours for 10,000 input vectors. These computations were carried out over many thousands of iterations. In this paper we report the results of one single run for the two methods over 4 different lengths of vector together with 5 different methods of choosing keywords. Altogether we carried out 64 experiments each for SOFM and SVM (24 for TREC-AP, 20 each for RCV1 and Reuters Financial) over a period of about 10 weeks. Ideally we should have repeated these experiments especially for the SOFM and reported an aggregated result together with the values of standard deviations. This work is in progress.

2 Method In the following section we describe measures for creating document vectors (2.1). The goal of a text classification algorithm is to assign a label to a document based upon its contents. We wish to explore the effect of how the choice of keywords to build a surrogate vector affects the ability of a system that is learning to classify documents. It may be argued that for supervised systems a complicating factor may be the choice of a ‘teacher’ and for the unsupervised system the factor may be the manner in which clusters are created and subsequently labelled [14]. In order to avoid such complications we use two popular and very different techniques: an exemplar unsupervised algorithm – Kohonen’s SOFM [7]) (2.2); and an exemplar supervised algorithm – Vapniak’s SVM [11] (2.3).

1006

P. Manomaisupat, B. Vrusias, and K. Ahmad

2.1 Measures for Creating Document Vectors from Text Collections The first measure Term Frequency Inverse Document Frequency, tfidf, facilitates the creation of a set of vectors for a document collection by looking at the importance of a term within a document collection, document by document. Subsequently, a weight is assigned to individual terms based on the observation as to whether a given term can help in identifying a class of documents within the collection: the more frequent a term is in a specific collection of documents, and the less it appears in other collections, the higher its tfidf value. The second measure is based on contrasting the frequency distribution of a word within a document collection, or a sub-collection, with the distribution of the same word within a non-specialist, but representative general language collection, for example, the 100 million words British National Corpus for English language. The ratio of the relative frequency in specialist and general language collections indicates whether or not a token is a candidate term. The ratio varies between zero (token not in the specialist collection) to infinity (token not used in the general language collection); the ratio of unity is generally found for the closed class words. Tokens with higher ratio are selected as feature terms [11]. For creating baseline vectors we have used three ‘measures’: first ‘measure’ relates to the random selection of words from the text collection excluding the closed class words; the second measure relates to the selection of most frequent words, excluding closed-class words, from a given text collection; the third ‘measure’ relates to the selection of using only those words that occur in a publicly available terminology database for the specialist domain where the texts in the collection originate. For the standardised term selection, we have used a web-based glossary of financial terms comprising 6,000 terms divided over 25 financial sub-categories [13] (see Table 1). In order to asses the optimum length of a surrogate vector, we have repeated our experiments with 3n tokens (where n=1-6). Table 1. The computation of term weighting used in the creation of vectors in the different models: the weighting is the product of term frequency (a) and text frequency (b) [14]) Methods

Text type

Term Frequency Text Frequency (a) (b)

Term frequency/Inverse Contrastive Linguistic

Individual Documents d in a collection C Documents in a specialist collection ‘C’ and in a general language collection ‘G’

Random

tf t , d

df t

tf t , c

NG 1 ∗ NC tft,G

Collection C

tf t , d

NA

Pure Frequency

Collection C

tf t , d >threshold

NA

Standardised term

Collection C and Terminology database TDB

tf t , c

=1 if t ∈ TDB =0 otherwise

Baseline

Categorization of Large Text Collections

1007

2.2 Self Organised Feature Map The self-organising feature map is based on a grid of artificial neurons whose weights are adapted to match the input vectors presented during the training process. Unlike supervised neural networks, the SOFM does not produce an error based on an expected output; rather the SOFM learns to cluster similar data. An SOFM produces visualisations of the original high-dimensional data onto a two-dimensional surface/map. The two dimensional map produces clusters of input vectors based on ‘detailed topical similarities’ as the document categories are not known, a priori for the unsupervised system [6, 14]: The lack of a priori category information is a given for rapidly updated text collections. Financial news collections are a case in point: the categorisation at a sub-domain is quite critical (e.g. financial newsÆcurrency newsÆ US$ news); where a sub-domain may disappear (for instance financial news Æ currencies of national states now in the Euro zone); and, new domains may appear (e.g. financial news Æ derivative trading of currencies). We have used following algorithm for training a SOFM: STEP 1. Randomize the map's nodes’ weight vectors STEP 2. Take an input vector STEP 3. Traverse each node in the map a. Use Euclidean distance to measure the distance between the input and all map’s nodes b. Track the node that produces the smallest distance (the Best Matching Unit - BMU) STEP 4. Update the BMU and all nodes in the neighbourhood by pulling them closer to the input vector according to Kohonen’s formulae. STEP 5. Repeat step 3-4 with a new input vector until all the input has been taken. STEP 6. Update training parameters accordingly STEP 7. Repeat step 5 and 6 for the desired number of cycles.

The weights wij(t+1) of the winning neuron and its neighbourhood are adjusted as the incremental-learning occurs following the formulae: w ij ( t + 1) = w ij ( t ) + α ( t ) × γ ( t ) × ( x j − w ij ( t ))

(1)

wij is the current weight vector, xj is the target input, t is the current iteration and α(t) is the learning restraint due to time. α(t) is given by:

α ( t ) = (α

max

−α

min

)× e

⎛ 1 t × ln ⎜⎜ ⎝ t max t max

⎞ ⎟⎟ ⎠

+ α

(2) max

where αmax is the starting learning rate and αmin is the final learning rate. The t value is the current cycle and the tmax is the total number of cycles. γ decays exponentially with increasing training cycles and neighbourhood distances [8]. In this experiment, the number of exemplar input vectors has a fixed size similar to the output grid which is determined during the unsupervised training process. Training commences with an output layer, consisting of 15x15 nodes. This training process is repeated for a fixed number λ of training iterations, we set λ = 1,000 cycles. The key parameters for successfully SOFM training include the learning rate (α) and the neighbourhood size (γ). We have used a typical initial setting for the training parameters, α=0.9, γ=8, that eventually both decreased to zero towards the end of the

1008

P. Manomaisupat, B. Vrusias, and K. Ahmad

training process. The testing regimen involves the computation of the Euclidian distance of a text input vector from the most ‘excited’ (winner takes all) node on the output map of the 15x15 trained SOFM, which will eventually determine the classification of the test input. 2.3 Support Vector Machine A Support Vector Machine is a relatively new learning approach for solving two-class pattern recognition problems [16, 17]. An SVM attempts to find a hyperplane that maximises the margin between positive and negative training examples, while simultaneously minimizing training set misclassifications. Given a training set of instance-level pairs (xi, yi), i=1,…,l where xi ∈ Rn and y ∈ {1, -1}l, when perfect separation is not possible along two class lines a slack variable (ξ) is introduced, and w defined as the weight vector. The support vector machines require the solution of the following optimisation problem: l

min 12 wT w + C ∑ ξi w , b ,ξ

subject to

(3)

i =1

yi ( w T φ ( xi ) + b) ≥ 1 − ξ i ;

where

ξi ≥ 0

(4)

In a binary classification task m labelled examples (x1, y1),..., (xm, ym) are used,

where xi∈X are training data points and yi ∈{-1, +1} are the corresponding class labels for the elements in X and b is a parameter. In order to make the data linearly separable, data points are mapped from the input space X to a feature space F with a mapping Φ: X → F before they are used for training or for classification. The training vector xi is mapped into a higher dimensional space by the function Φ. The support vector machine is expected to find a linear separating hyperplane with the maximal margin in this higher dimensional space. (C > 0 is the penalty parameter of the error term). Following Yang and Liu [18]; Dumais et al [19] or Hearst [11], we have used the SVM with the RBF kernel function. The RBF kennel function is defined by as: K ( x i , x j ) = exp( − γ x i − x j

2

), γ > 0

(5)

For computing C and γ the open-source implementation due to Hsu et al [20] was used.

3 Performance Measures Two key metrics for determining the performance of SOFM’s were used – the socalled classification accuracy and the average quantization error (AQE). For SVM performance we chose the Break-even points measure for measuring the effectiveness of the classification task.

Categorization of Large Text Collections

1009

Classification accuracy: Our evaluation method is similar to Kohonen et al [7] and Hung et al [19, 21]. Kohonen defines the categorisation error in terms of ‘all documents that represented a minority newsgroup at any grid point were counted as classification errors.’ We have used a majority voting scheme: given that a node in the SOFM has “won” many news reports from the same topic category, then our system can check the category information (which not used during the training process). If a majority of the news reports belonging to one category have been won over, then that node is assigned the majority’s category. The classification accuracy is computed during testing by simply checking whether the category of the test news report matched that assigned to the BMU by the majority voting scheme. Average quantization error (AQE): The best SOFM is expected to yield the smallest average quantization error qE. The mean of quantization error (AQE) as: AQE

=

1 N

N



i =1

G G xi − wi ,

(6)

G

where N is the total number of input patterns, xi is the sequences of the training vectors, and wG i is the initial values of the input pattern i [7]. Break-even points (BEP): For supervised networks, that have access to correct category information, information retrieval measures of precision and recall are used. Break-Even Points measure is an equally weighted combination of recall and precision.

4 Experiments and Results We report on experiments carried out for selecting an ‘appropriate’ vector representation model for training and testing an unsupervised and supervised neural networks for text classification. The data sets used in creating representative vectors are publicly available. The systems used for training and testing can be downloaded by request to the authors. We have used two data sets that have been used extensively in testing and evaluating text categorisation systems over the last 10 years: the TREC-AP-News wire (http://www-nlpir.nist.gov/projects/duc/pubs.html and ttp://trec.nist.gov/faq.html), the Reuters-22173 text collection (RCV1). These data sets were chosen by text retrieval experts from the ‘noisy’ real world news wire: we have used a third data set, only available through subscription – Reuters Financial News Stream (Table 2). Table 2. The text collections used in our study. The texts were selected from a large dataset: TREC-AP news corpus comprises 242,918 documents; RCV1 has 806,791. For Reuters Financial we chose a month’s news supply which was 22,418 news stories for that month. Collection TREC- AP news RCV1 Reuters Financial

Training Set # of Document # of Words 1,744 962,322 600 114,430 13,670 6,710,000

Testing Set # of Document # of Words 543 321,758 180 64,217 8,748 4,074,174

1010

P. Manomaisupat, B. Vrusias, and K. Ahmad

4.1 SOFM Classifier Results We begin by describing the results obtained on the TREC-AP collection. Recall that experts carefully chose this collection and their focus was on 10 well-defined topics. We created five vector sets of sizes 27, 81, 243, 2187 and 6561 – multiples of powers of 3. The classification accuracy improves when increasing the vector size, therefore increasingly larger number of features, for both the key feature selection measures – weirdness and tfidf. However, the accuracy quickly reached 99% mark when the vector size was increased from 27 to 81 dimensions and the accuracy remained about the same despite the increase in the vector size to 6561 (See Table 1). The comparison with our first benchmark –randomly-selected keyword vectors indicates that for small vector sizes the SOFM trained with randomly selected keywords has a classification accuracy of 60% for a 27 component vector and 72% for a 243 component vector: compare these with 96% and 99% for a SOFM trained using words of high weirdness and tfidf. As the number of components increase the randomly selected vector comes to within 10% of the classification accuracy for 6561 component vector. The second benchmark – high-frequency vector – has better or equal accuracy for all the six component vectors (between 27 and 6561 components). Table 3. Classification Accuracy results using different size of vectors in TREC-AP collection Feature Selection Measure Vector Size Weirdness tfidf 27 96.2% 96.3% 81 99.3% 92.5% 243 99.1% 93.7% 2187 99.3% 98.8% 6561 99.1% 98.8%

Benchmarking Measures Random High-frequency Term-base 60.2% 99.1% 97.0% 66.8% 98.8% 97.1% 72.9% 99.0% 97.8% 88.9% 99.0% 99.3% 88.1% 98.1% -

As expected, the average quantization error increases when the vector sizes are increased for both the key feature selection measures and for the three benchmark measures (See Table 4). The quantization error determines the extent to which the categories are isolated from each other. Recall that we are using a fixed size output surface (15X15) but we are increasing the components of the input vector 2-3 orders of magnitude. Perhaps, the increased resolution of the input vector, thereby allowing Table 4. AQE of five feature selection methods for TREC-AP Feature Selection Measure Vector Size Weirdness tfidf 27 0.002 0.001 81 0.01 0.02 243 0.05 0.04 6561 0.12 0.12

Benchmarking Measures Random High-frequency Term-base 0.02 0.01 0.02 0.004 0.03 0.02 0.01 0.07 0.04 0.13 0.19 -

Categorization of Large Text Collections

1011



:5&9



7)5&9

  

:5 )LQDQFLDO   9HFWRU6L]H D 



7)5 )LQDQFLDO

&ODVVLILFDWLRQ$FFXUDF\

&ODVVLILFDWLRQ$FFXUDF\

for more potential categories, should have been matched by an increment in the output map. As the output map was fixed, then that nodes that had ‘won’ different categories could not have been isolated enough. Higher AQE does not necessarily indicate that the classification will be bad. It simply indicates that there are no clear cluster boundaries. We are investigating the use of AQE in this respect. The classification errors for the Reuters RCV1 corpus are substantially higher than that of TREC-AP as this corpus is more diffuse. The increment in the size of the vectors does decrease the error and again there is a plateau when the vector length increases beyond 243 (35). The keyword selection method shows a mild impact: vectors constructed using tfidf show a marginally poor performance than those constructed using weirdness analysis, especially for vectors of longer lengths (>243). As expected, the more diffuse Reuters Financial News Stream Corpus was categorized with much poorer classification accuracy partly because the number of terms used in the corpus is far higher than the maximum vector length of 6561: the weirdness measure appears to be performing better than the tfidf measure of constructing the vectors.



+)5&9



7 %5&9

  

+)5 )LQDQFLDO   9HFWRU6L]H E 



7 %5 )LQDQFLDO

Fig. 1. shows the results for the benchmark vectors and a simple choice of very high frequency words as the basis of feature selection. This simple choice of keywords pays handsome dividends in terms of roughly the same or better classification accuracy obtained when compared with results of weirdness or tfidf measures. The other empirical choice of keywords – constructing vectors based on the entries in a terminology database appears to give similar results.

4.2 SVM Classifier The performance of the support vector machines was quantified using the break-even point (BEP) computation – a combination of precision and recall measures. The results are, as expected, better than those obtained with the SOFM classifiers. The increase in the size of the vector eventually beyond 243 keywords both for weirdnessbased and tfidf constructions and the former measure gives better BEP than the later (Figure 2a). The high frequency vector outperforms all other methods of construction followed closely by vectors constructed exclusively from term bases (Figure 2b).

%UHDNHYHQ3RLQW0HDVXUH

1012

P. Manomaisupat, B. Vrusias, and K. Ahmad

 



 

5&9



5HXWHUV





7 %5&9 

  

 

+ )5&9 



7 5(&$3

 





9HFWRU6L]H D 







9 HFW RU 6L]H E



+ )5 )LQ D Q F LD O 7 %5 )LQ D Q F LD O

Fig. 2. Classification accuracy (CA) for SVM for different methods of constructing training vectors for learning categories in the two diffuse corpora (Reuters - RCV1 and Reuters Financial News Stream). Figure 2(a) shows the comparison of CA between weirdness-based measures (W) and the tfidf measure. Figure 2(b) shows the comparison between the ‘benchmark’ vectors using high-frequency terms in the RCV1 and R-Financial corpora and using terms in a terminology database only.

5 Conclusions and Future Work In this paper we have discussed the relative merits of constructing feature vectors for training classifiers. The choice of two different learning algorithms was to eliminate the bias due to learning algorithms. We have observed that there is an optimum size of the feature vector beyond which classification accuracy, and the break-even point, does not increase much whilst the computation time increases due to the length. Usually, human evaluators are involved in judging the categorization performance of learning systems. Given the volume of data now available, it is not possible to engage human evaluators in a realistic sense – although when available we should use human volunteers. One way to circumvent this problem is to use feature vectors with randomly selected terms from the training, or to select terms whose frequency is above a certain threshold within the corpus, or to construct a vector from a terminology data base without reference to the training corpus. Terms selected on the basis of simple frequency (or other frequency based measures like tfidf and weirdness) appear to lead to training vectors that, in turn, lead to higher classification accuracy (or BEP). Work on large corpora needs further research due largely to the sheer volume of data and the computation time involved. All our results on CA and BEP are averages over thousands of trials, but in the case of Reuters Financial Corpus we need to carry more computations to deal with this topic-diverse corpus. Acknowledgments. Two of the authors (P. Manomaisupat and B. Vrusias) gratefully acknowledge the support of the UK EPSRC sponsored REVEAL project (EPSRC GR/S98450/01). K. Ahmad acknowledges the support of Trinity College, Dublin.

References 1. Manomaisupat, P.: Term Extraction for Text Categorisation, (2006) (Unpublished PhD Dissertation, Department of Computing, University of Surrey) 2. Liao, D., Alpha, S. and Dixon, P.: Feature Preparation in Text Categorisation. Technical Report, Oracle Corporation Available at: http://www.oracle.com/technology/products/text/ index.html Accessed: May 25, 2005

Categorization of Large Text Collections

1013

3. Croft, W.B. and Lewis, D.D.: Term Clustering of Syntactic Phrases. Proc. of the 13th Annual Int. ACM SIGIR Conf. on R&D in Information Retrieval, Brussels, Belgium (1990) 385-404 4. Manning, C. D., and Schütze, H.: Foundations of Statistical Natural Language Processing, Cambridge (Mass.) & London: The MIT Press (1999/2003) 5. Beitzel, S.M., Eric C. Jensen, E.C., Frieder, O., Lewis, D.D., Chowdhury, A., and Kołcz, A.: Improving Automatic Query Classification via Semi-Supervised Learning. IEEE Int. Conf. on Data Mining (ICDM'05) (2005) 42-49 6. Lewis, D. D.: Applying Support Vector Machines to the TREC-2001 Batch Filtering and Routing Tasks (2001) 7. Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., Honkela, J., Paatero, V. and Saarela A.: Self Organization of a Massive Document Collection. IEEE Trans. NN, vol.11, no.3 (2000) 574-585 8. Kohonen, T.: Self-Organizing Maps, Springer Verlag (2001) 9. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys, vol. 34, no.1 (2002) 1–47 10. Xu R. and Wunsch D.: Survey of Clustering Algorithms. IEEE Transactions on Neural Networks, vol. 16, no. 3 (2005) 645-678 11. Hearst, M. A.: Support Vector Machines. IEEE Intelligent Systems, vol. 13, no. 4 (1998) 18-28 12. Ahmad, K., and Rogers, M. A.: Corpus Linguistics and Terminology Extraction. In (Eds. ) S-E Wright and G. Budin. Handbook of Terminology Management (Volume 2). Amsterdam & Philadelphia: John Benjamins Publishing Company (2001) 725-760 13. Investorwords.com (www.Investorwords.com) [Accessed on 5 September, 2005] 14. Manomaisupat, P. and Ahmad, K.: Feature Selection for Text Categorisation Using SelfOrganising Map. Proc. ICNN&B Int. Conf. on Neural Networks and Brain, vol. 3 (Oct 2005) 1875-1880 15. Arnulfo P. Azcarraga, Teddy N. Yap Jr., Tat Seng Chua and Jonathan Tan.: Evaluating Keyword Selection Methods for WEBSOM Text Archives. IEEE Trans. on DKE, vol 16, no.3 (2004) 380-383 16. Keerthi, S.S. and Line, C.J.: Asymptotic Behaviours of Support Vector Machines with Gaussian Kernel. Neural Computation, vol.15 (2003) 1667-1669 17. Hsu, W., Chang, C.C. and Line, C.J.: A Practical Guild to Support Vector Classification. Technical Report, Dept of CS and Info. Engineering, National Taiwan University, Taipei (2003) 18. Yang, Y. and Liu, X.: A Re-examination of Text Categorization methods. Proc. of the 22nd Int. ACM SIGIR Conf. of Research and Development in Information Retrieval (SIGIR), (1999) 42-49 19. Dumais, S. T., Platt, J., Heckerman, D. and Sahami, M.: Induction Learning Algorithms and Representations for Text Categorization. Proc. of the 7th ACM Int. Conf. on Information and Knowledge Management CIKM-98, Washington, US (1998) 148-155 20. Hung, C. and Wermter, S.: A Dynamic Adaptive Self-Organizing Hybrid Model for Text Clustering. Proc. of the 3rd IEEE Int. Conf. Data Mining (ICDM 03), IEEE Press (2003) 75-82 21. Hung, C., Wermter, S. and Smith, P.: Hybrid Neural Document Clustering Using Guided Self-Organization and WordNet. IEEE Intelligent Systems, vol.19, no.2 (2004) 68-77