Urdu Text Classification

Comment

Report 3 Downloads 98 Views

Urdu Text Classification Abbas Raza Ali

Maliha Ijaz

National University of Computers and Emerging Sciences Block-B, Faisal Town Lahore, Pakistan

National University of Computers and Emerging Sciences Block-B, Faisal Town Lahore, Pakistan

[email protected]

[email protected]

ABSTRACT This paper compares statistical techniques for text classification using Naïve Bayes and Support Vector Machines, in context of Urdu language. A large corpus is used for training and testing purpose of the classifiers. However, those classifiers cannot directly interpret the raw dataset, so language specific preprocessing techniques are applied on it to generate a standardized and reduced-feature lexicon. Urdu language is morphological rich language which makes those tasks complex. Statistical characteristics of corpus and lexicon are measured which show satisfactory results of text preprocessing module. The empirical results show that Support Vector Machines outperform Naïve Bayes classifier in terms of classification accuracy.

SVMs outperform Naïve Bayes in context of classification accuracy. The overall system is divided into three main components: 1) Acquisition, compilation and labeling of the text documents of the corpus 2) Preprocessing of raw corpus to generate standardized and reduced-feature lexicon 3) Training of statistical classifiers on the preprocessed data to classify test data Detailed architecture of the system along with its three components is shown in Figure 1.

Keywords Corpus, information retrieval, lexicon, Naïve Bayes, normalization, feature selection, text classification, text mining, Urdu.

Corpus acquisition

Lexicon based tokenization

1. INTRODUCTION Text classification is a process of classifying unknown text automatically by suggesting most probable class to which it belongs. As electronic information is increasing day by day, it becomes a key technique to organize large amount of data for analysis and processing [9]. Text classification is involved in many applications like text filtering, document organization, classification of news stories, searching for interesting information on the web, etc. These are language specific systems mostly designed for English but no work has been done for Urdu language. So, developing classification systems for Urdu documents is a challenging task due to morphological richness and scarcity of resources of the language like automatic tools for tokenization, feature selection, stemming etc. Two different classifiers based on supervised learning techniques are developed and their accuracies are compared on the given dataset. From the experiments, the Naïve Bayes classifier is found to be more efficient than the Support Vector Machines. However,

Normalization

Diacritics elimination

Document Preprocessing

Stop words elimination

Affixes based stemming

Estimate p(Term | class) Training Estimate p(class)

Classification

Naive Bayes Classifier

Maximum[p(class | Term)]

Calculate Normalized Term Frequency

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. FIT’09, December 16–18, 2009, CIIT, Abbottabad, Pakistan. Copyright 2009 ACM 978-1-60558-642-7/09/12....$10.

Training Calculate α, w, w0

Classification

SVM Classifier

Maximum[α*r *K(xtest, x)+w0]

Figure 1. Architecture of Urdu text classification system

2. COPORA

3.3 Normalization

A large amount of dataset is usually needed in order to get good classification accuracy of a statistical system. For that purpose a large text corpus of 19.3 million words, collected from different online Urdu news websites, is used [3]. The corpus was manually classified into six different domains namely news, sports, finance, culture, consumer information, and personal information.

Some Urdu alphabets have more than one Unicode because they are shaped similar to Arabic alphabets. Such characters are replaced by alternate Urdu alphabets to stop creating multiple copies of a word. Some examples of Un-normalized alphabets and their mapping with Urdu characters are shown in Appendix B.

The breakup of documents for each class is given in Table 1 in the next section. The corpus is divided into two parts, the training set contains 90% of the documents from each class and remaining 10% of the documents are used as test set.

3.4 Stop words Elimination

3. DOCUMENT PREPROCESSING

In order to obtain stop word list, frequency of every term is calculated using the same corpus and prepared a ‘word frequency lexicon’. The threshold frequency is chosen by manually analyzing that lexicon. It was observed that the top 116 high frequency words belonged to functional class. So a stop word list comprise of 116 high frequency words is gathered that is eventually eliminated from the final lexicon. Some of the high frequency words from the stop words are given in Table 2.

Statistical classifiers mostly require the input dataset to be preprocessed in the format specified by them. This preprocessing is language-dependent so general text mining techniques need to be modified in order to apply them on Urdu corpus.

3.1 Tokenization Words are derived from the corpus on the basis of white spaces and punctuation marks. The corpus also contains multiple words without any white space or punctuation marks and some nonUrdu words as well. To resolve this problem every word is looked up from ‘tokenization lexicon’ and becomes a token if found otherwise eliminated. The ‘tokenization lexicon’ is manually prepared and gathered from different sources containing 220,760 unique entries. The class wise statistics of the input dataset development process is described in Table 1.

Stop words are functional words of a language and meaningless in context of text classification. They are eliminated from the lexicon in order to reduce its size by using a list of most frequent words known as stop word list.

Table 2. Some most frequent words extracted from the corpus Word

Frequency

Table 1. Analysis of documents at different levels of preprocessing stages Class

Documents

Tokens

Types

Terms

News

17,501

8,957,259

78,649

54,817

Sports

3,388

1,666,304

21,473

16,622

Finance

1,766

1,162,019

16,144

11,951

Culture

1,088

3,845,117

57,486

37,493

1,046

1,980,723

26,433

19,781

1,278

1,685,424

34,614

25,588

Consumer Information Personal Communication Total

26,067 19,296,846

‫اور‬

Word

Frequency

743,949

368,155

582,882

306,103

575,545

281,922

466,908

Š

254,385

413,788

‫اس‬

244,017

3.5 Stemming It is a process of reducing a word to its root form; it often consists of removing the derivational affixes [13]. ‘Affix elimination’ based stemming is applied in order to merge multiple related word forms. An affixes list containing 417 prefixes and 73 suffixes of Urdu language used for that purpose which reduced 24% terms from the lexicon. Following algorithm is applied to every token to stem them:

234,799 166,252

1) Pick the first and last character of a token separately and search them in the affixes list. If not found then concatenate second and second last letter and again search

Diacritics are used in Urdu text to alter pronunciation of a word but they are optional characters in the language. In the current corpus, less then one percent of the words are diacritized. In order to standardize the corpus, the diacritics are completely removed َ like for example (house) and ِ (surrounded) to be mapped on

2) Continue the process until an affix is found from the list then search remaining part of the word in the ‘tokenization lexicon’; if found then retain it and eliminate remaining (prefix or suffix) part of the token.

3.2 Diacritics Elimination

to a single word Appendix A.

3) In case prefix and suffix strings are not found at all then retain the original word as it is.

. List of diacritical marks in Urdu is given in

For example

ò,

‫ ص‬is added to prefix string and ‫ د‬is added

to suffix string and these are looked up in prefix and suffix list respectively. When they are not found in those lists; ‫ ح‬is added to

prefix string and prefix string now becomes ò and ‫ ن‬is added to suffix string and suffix string now becomes

. Prefix and suffix

strings are looked up in prefix and suffix list respectively. If still they are not found in the lists so ‫ ت‬is added to prefix string and it becomes

ò and ‫ م‬is added to suffix list which becomes

and they are looked up in the respective lists. Finally, in suffix list so rest of the token which is

ò is looked up in ò is reduced

the lexicon and it is found there so the token to

is found

ò after stemming.

Figure 3. Estimation of vocabulary size

3.6 Statistical Properties Some statistical properties of the cleaned lexicon extracted from raw corpus are analyzed using Zipf’s and Heaps Law which shows satisfactory results. 1) According to Zipf's law the ith most frequent term occurs with a frequency inversely proportional to i in presence of a constant c.

c frequencyi = i

(1)

It models distribution of the terms in a collection which implies that documents belonging to the same class will have similar frequency distributions [13]. Figure 2 shows that the frequencies of the most common terms are inversely proportional to their rank in the current corpus.

4. NAÏVE BAYES Naïve Bayes is a supervised learning technique, efficiently used in text classification [1]. It is based on Bayesian theorem with independence assumption [5]. Using Bayes rule, the probability of a document being in a class is;

P(Class | Document ) =

P(Document | Class )× P(Class ) P(Document )

(3)

P(Document|Class) is the conditional probability of document given class, P(Class) and P(Document) are prior and evidence probabilities of class and documents respectively. The independence assumption is used to calculate the conditional probability, where probability of each document feature (Termi) is independent from another [2]. Class that maximizes (3) will be selected. Class

Document1

…

Document2

Term1

…

Term2

Documentm

Termn

Figure 4. Architecture of Text classification using Naïve Bayes P(Document) = P(Term1 ) × K × P(Termn ) = ∏i =1 P(Termi ) n

Figure 2. Frequency distribution of terms over entire collection 2) According to Heaps law the size of vocabulary of a corpus is measured using (2) to predict number of distinct words that exist in a document [13].

vocabulary size = K × (corpus size )

β

(2)

where K and β are constants that vary between 30-90 and 0.45 respectively. Figure 3 shows that how the vocabulary size increases by increasing the size of current corpus. .

(4)

The P(Document) is constant over all terms, so by ignoring it and applying (4) on (3), the expression becomes;

P(Document | Class ) × P(Class ) =

(

[

])

arg max ∏ j =1 P(Term j | Classi )× P(Classi ) i

n

P(Term j | Classi )=

count (Term j , Classi ) count (Term j )

(5)

(6)

P(Class i ) =

count (documents in class i ) count (documents)

(7)

In (6) the count(Term j, in Class i) can be zero because the training data is not enough to represent every term in every class, this becomes the overall estimate equal to zero. To eliminate zeros, re-evaluating the conditional probability and assign a very small non-zero constant values known as smoothing [13]. A very simple smoothing technique is to add one to all the counts and divide it by vocabulary size to normalize overall probabilities. This technique is known as Laplace smoothing and is usually suitable for unigram based language models like Naïve Bayes [14].

P(Term j | Classi ) =

count (Term j , classi ) + 1

(8)

count (Term j ) + V

After estimating conditional (7) and prior (8) probability parameters during training phase, a test document is classify as;

[

]

best class = arg max ∏ j =1 P(Term j | Class )× P(Class ) n

cεC

(9)

(

[

])

arg max ∏ j =1 log(P(Term j | Classi ))+ log(P(Classi )) i

n

for each t ε V

17)

P[t ][c] ←

18)

Tct + 1 Tt + V

end for

19) end for

Classification 20) T ← total tokens in test document d 21) for each c ε C 22)

score[c] ← log( prior[c ])

23)

for each t ε T

24)

score[c] ← score[c ] + log(P[t ][c])

25) end for 26) end for 27) best class ← max(score)

Many conditional probabilities are multiplied in (9), which can result in a floating point underflow [13]. Hence, by applying logarithm on both sides (9) becomes;

best class =

16)

(10)

5. SUPPORT VECTOR MACHINES SVM is a supervised learning technique and very effective in text classification. It finds a hyperplane h with maximum margin m that separates two classes and at test time a data point is classified depending on the side of the hyperplane it lies in [10].

v v v h( x ) = x t .w + w0

(11)

2 m= v 2 w

(12)

4.1 Algorithm The algorithm is divided into three independent modules given below;

Preprocessing 1) L ← lexicon based tokenization 2) NL ← text normalization of L 3) T ← high frequency words elimination of NL 4) term ← affix based stemming of T

Training 5) C ← {class1, class2, …, classk}

where xt is a vector of terms in a document belongs to class rt ε {1, …, k}; wv and w0 are weights associated with each document vector and threshold respectively. At least two data points closest to decision surface determines the margin of the classifier known as support vectors and others are known as non-support vectors [13]. In text classification the data is usually not linearly separable, so that a penalty C is introduced for data points crossing the margin known as misclassified point. x2

h

6) D ← {document1, document2, …, documentm} 7) V ← {term1, term2, …, termn} 8) for each c ε C 9)

Nc ← total documents Dc in class c

10)

prior[c] ←

11)

tokensc = tokens of all documents [Dc] in class c

12)

for each t ε V

Nc N

13)

Tct ← frequency of each token [tokensc] in class c

14)

Tt ← frequency of each token in all classes c

15)

end for

Support vectors +

ξ=0 α>0

C+

ξ≥1 0

Recommend Documents

UrdU

Semi-Automated Text Classification - SIGIR

Clustering Based Text Classification Requiring

Letter-to-Sound Conversion for Urdu Text-to-Speech System