Urdu Text Classification Abbas Raza Ali
Maliha Ijaz
National University of Computers and Emerging Sciences Block-B, Faisal Town Lahore, Pakistan
National University of Computers and Emerging Sciences Block-B, Faisal Town Lahore, Pakistan
[email protected] [email protected] ABSTRACT This paper compares statistical techniques for text classification using Naïve Bayes and Support Vector Machines, in context of Urdu language. A large corpus is used for training and testing purpose of the classifiers. However, those classifiers cannot directly interpret the raw dataset, so language specific preprocessing techniques are applied on it to generate a standardized and reduced-feature lexicon. Urdu language is morphological rich language which makes those tasks complex. Statistical characteristics of corpus and lexicon are measured which show satisfactory results of text preprocessing module. The empirical results show that Support Vector Machines outperform Naïve Bayes classifier in terms of classification accuracy.
SVMs outperform Naïve Bayes in context of classification accuracy. The overall system is divided into three main components: 1) Acquisition, compilation and labeling of the text documents of the corpus 2) Preprocessing of raw corpus to generate standardized and reduced-feature lexicon 3) Training of statistical classifiers on the preprocessed data to classify test data Detailed architecture of the system along with its three components is shown in Figure 1.
Keywords Corpus, information retrieval, lexicon, Naïve Bayes, normalization, feature selection, text classification, text mining, Urdu.
Corpus acquisition
Lexicon based tokenization
1. INTRODUCTION Text classification is a process of classifying unknown text automatically by suggesting most probable class to which it belongs. As electronic information is increasing day by day, it becomes a key technique to organize large amount of data for analysis and processing [9]. Text classification is involved in many applications like text filtering, document organization, classification of news stories, searching for interesting information on the web, etc. These are language specific systems mostly designed for English but no work has been done for Urdu language. So, developing classification systems for Urdu documents is a challenging task due to morphological richness and scarcity of resources of the language like automatic tools for tokenization, feature selection, stemming etc. Two different classifiers based on supervised learning techniques are developed and their accuracies are compared on the given dataset. From the experiments, the Naïve Bayes classifier is found to be more efficient than the Support Vector Machines. However,
Normalization
Diacritics elimination
Document Preprocessing
Stop words elimination
Affixes based stemming
Estimate p(Term | class) Training Estimate p(class)
Classification
Naive Bayes Classifier
Maximum[p(class | Term)]
Calculate Normalized Term Frequency
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. FIT’09, December 16–18, 2009, CIIT, Abbottabad, Pakistan. Copyright 2009 ACM 978-1-60558-642-7/09/12....$10.
Training Calculate α, w, w0
Classification
SVM Classifier
Maximum[α*r *K(xtest, x)+w0]
Figure 1. Architecture of Urdu text classification system
2. COPORA
3.3 Normalization
A large amount of dataset is usually needed in order to get good classification accuracy of a statistical system. For that purpose a large text corpus of 19.3 million words, collected from different online Urdu news websites, is used [3]. The corpus was manually classified into six different domains namely news, sports, finance, culture, consumer information, and personal information.
Some Urdu alphabets have more than one Unicode because they are shaped similar to Arabic alphabets. Such characters are replaced by alternate Urdu alphabets to stop creating multiple copies of a word. Some examples of Un-normalized alphabets and their mapping with Urdu characters are shown in Appendix B.
The breakup of documents for each class is given in Table 1 in the next section. The corpus is divided into two parts, the training set contains 90% of the documents from each class and remaining 10% of the documents are used as test set.
3.4 Stop words Elimination
3. DOCUMENT PREPROCESSING
In order to obtain stop word list, frequency of every term is calculated using the same corpus and prepared a ‘word frequency lexicon’. The threshold frequency is chosen by manually analyzing that lexicon. It was observed that the top 116 high frequency words belonged to functional class. So a stop word list comprise of 116 high frequency words is gathered that is eventually eliminated from the final lexicon. Some of the high frequency words from the stop words are given in Table 2.
Statistical classifiers mostly require the input dataset to be preprocessed in the format specified by them. This preprocessing is language-dependent so general text mining techniques need to be modified in order to apply them on Urdu corpus.
3.1 Tokenization Words are derived from the corpus on the basis of white spaces and punctuation marks. The corpus also contains multiple words without any white space or punctuation marks and some nonUrdu words as well. To resolve this problem every word is looked up from ‘tokenization lexicon’ and becomes a token if found otherwise eliminated. The ‘tokenization lexicon’ is manually prepared and gathered from different sources containing 220,760 unique entries. The class wise statistics of the input dataset development process is described in Table 1.
Stop words are functional words of a language and meaningless in context of text classification. They are eliminated from the lexicon in order to reduce its size by using a list of most frequent words known as stop word list.
Table 2. Some most frequent words extracted from the corpus Word
Frequency
Table 1. Analysis of documents at different levels of preprocessing stages Class
Documents
Tokens
Types
Terms
News
17,501
8,957,259
78,649
54,817
Sports
3,388
1,666,304
21,473
16,622
Finance
1,766
1,162,019
16,144
11,951
Culture
1,088
3,845,117
57,486
37,493
1,046
1,980,723
26,433
19,781
1,278
1,685,424
34,614
25,588
Consumer Information Personal Communication Total
26,067 19,296,846
اور
Word
Frequency
743,949
368,155
582,882
306,103
575,545
281,922
466,908
Š
254,385
413,788
اس
244,017
3.5 Stemming It is a process of reducing a word to its root form; it often consists of removing the derivational affixes [13]. ‘Affix elimination’ based stemming is applied in order to merge multiple related word forms. An affixes list containing 417 prefixes and 73 suffixes of Urdu language used for that purpose which reduced 24% terms from the lexicon. Following algorithm is applied to every token to stem them:
234,799 166,252
1) Pick the first and last character of a token separately and search them in the affixes list. If not found then concatenate second and second last letter and again search
Diacritics are used in Urdu text to alter pronunciation of a word but they are optional characters in the language. In the current corpus, less then one percent of the words are diacritized. In order to standardize the corpus, the diacritics are completely removed َ like for example (house) and ِ (surrounded) to be mapped on
2) Continue the process until an affix is found from the list then search remaining part of the word in the ‘tokenization lexicon’; if found then retain it and eliminate remaining (prefix or suffix) part of the token.
3.2 Diacritics Elimination
to a single word Appendix A.
3) In case prefix and suffix strings are not found at all then retain the original word as it is.
. List of diacritical marks in Urdu is given in
For example
ò,
صis added to prefix string and دis added
to suffix string and these are looked up in prefix and suffix list respectively. When they are not found in those lists; حis added to
prefix string and prefix string now becomes ò and نis added to suffix string and suffix string now becomes
. Prefix and suffix
strings are looked up in prefix and suffix list respectively. If still they are not found in the lists so تis added to prefix string and it becomes
ò and مis added to suffix list which becomes
and they are looked up in the respective lists. Finally, in suffix list so rest of the token which is
ò is looked up in ò is reduced
the lexicon and it is found there so the token to
is found
ò after stemming.
Figure 3. Estimation of vocabulary size
3.6 Statistical Properties Some statistical properties of the cleaned lexicon extracted from raw corpus are analyzed using Zipf’s and Heaps Law which shows satisfactory results. 1) According to Zipf's law the ith most frequent term occurs with a frequency inversely proportional to i in presence of a constant c.
c frequencyi = i
(1)
It models distribution of the terms in a collection which implies that documents belonging to the same class will have similar frequency distributions [13]. Figure 2 shows that the frequencies of the most common terms are inversely proportional to their rank in the current corpus.
4. NAÏVE BAYES Naïve Bayes is a supervised learning technique, efficiently used in text classification [1]. It is based on Bayesian theorem with independence assumption [5]. Using Bayes rule, the probability of a document being in a class is;
P(Class | Document ) =
P(Document | Class )× P(Class ) P(Document )
(3)
P(Document|Class) is the conditional probability of document given class, P(Class) and P(Document) are prior and evidence probabilities of class and documents respectively. The independence assumption is used to calculate the conditional probability, where probability of each document feature (Termi) is independent from another [2]. Class that maximizes (3) will be selected. Class
Document1
…
Document2
Term1
…
Term2
Documentm
Termn
Figure 4. Architecture of Text classification using Naïve Bayes P(Document) = P(Term1 ) × K × P(Termn ) = ∏i =1 P(Termi ) n
Figure 2. Frequency distribution of terms over entire collection 2) According to Heaps law the size of vocabulary of a corpus is measured using (2) to predict number of distinct words that exist in a document [13].
vocabulary size = K × (corpus size )
β
(2)
where K and β are constants that vary between 30-90 and 0.45 respectively. Figure 3 shows that how the vocabulary size increases by increasing the size of current corpus. .
(4)
The P(Document) is constant over all terms, so by ignoring it and applying (4) on (3), the expression becomes;
P(Document | Class ) × P(Class ) =
(
[
])
arg max ∏ j =1 P(Term j | Classi )× P(Classi ) i
n
P(Term j | Classi )=
count (Term j , Classi ) count (Term j )
(5)
(6)
P(Class i ) =
count (documents in class i ) count (documents)
(7)
In (6) the count(Term j, in Class i) can be zero because the training data is not enough to represent every term in every class, this becomes the overall estimate equal to zero. To eliminate zeros, re-evaluating the conditional probability and assign a very small non-zero constant values known as smoothing [13]. A very simple smoothing technique is to add one to all the counts and divide it by vocabulary size to normalize overall probabilities. This technique is known as Laplace smoothing and is usually suitable for unigram based language models like Naïve Bayes [14].
P(Term j | Classi ) =
count (Term j , classi ) + 1
(8)
count (Term j ) + V
After estimating conditional (7) and prior (8) probability parameters during training phase, a test document is classify as;
[
]
best class = arg max ∏ j =1 P(Term j | Class )× P(Class ) n
cεC
(9)
(
[
])
arg max ∏ j =1 log(P(Term j | Classi ))+ log(P(Classi )) i
n
for each t ε V
17)
P[t ][c] ←
18)
Tct + 1 Tt + V
end for
19) end for
Classification 20) T ← total tokens in test document d 21) for each c ε C 22)
score[c] ← log( prior[c ])
23)
for each t ε T
24)
score[c] ← score[c ] + log(P[t ][c])
25) end for 26) end for 27) best class ← max(score)
Many conditional probabilities are multiplied in (9), which can result in a floating point underflow [13]. Hence, by applying logarithm on both sides (9) becomes;
best class =
16)
(10)
5. SUPPORT VECTOR MACHINES SVM is a supervised learning technique and very effective in text classification. It finds a hyperplane h with maximum margin m that separates two classes and at test time a data point is classified depending on the side of the hyperplane it lies in [10].
v v v h( x ) = x t .w + w0
(11)
2 m= v 2 w
(12)
4.1 Algorithm The algorithm is divided into three independent modules given below;
Preprocessing 1) L ← lexicon based tokenization 2) NL ← text normalization of L 3) T ← high frequency words elimination of NL 4) term ← affix based stemming of T
Training 5) C ← {class1, class2, …, classk}
where xt is a vector of terms in a document belongs to class rt ε {1, …, k}; wv and w0 are weights associated with each document vector and threshold respectively. At least two data points closest to decision surface determines the margin of the classifier known as support vectors and others are known as non-support vectors [13]. In text classification the data is usually not linearly separable, so that a penalty C is introduced for data points crossing the margin known as misclassified point. x2
h
6) D ← {document1, document2, …, documentm} 7) V ← {term1, term2, …, termn} 8) for each c ε C 9)
Nc ← total documents Dc in class c
10)
prior[c] ←
11)
tokensc = tokens of all documents [Dc] in class c
12)
for each t ε V
Nc N
13)
Tct ← frequency of each token [tokensc] in class c
14)
Tt ← frequency of each token in all classes c
15)
end for
Support vectors +
ξ=0 α>0
C+
ξ≥1 0