Multilingual Plagiarism Detection - Semantic Scholar

Report 5 Downloads 171 Views
Multilingual Plagiarism Detection Zdenek Ceska1, Michal Toman1, Karel Jezek1 1

Department of Computer Science and Engineering, Faculty of Applied Sciences, University of West Bohemia, Univerzitni 22, 306 14 Pilsen, Czech Republic {zceska, mtoman, jezek_ka}@kiv.zcu.cz

Abstract. Multilingual aspects have been gaining more and more attention in recent years. This trend has been accentuated by the global integration of European states and the vanishing cultural and social boundaries. Multilingual text processing has become an important field bringing a lot of new and interesting problems. This paper describes a novel approach to multilingual plagiarism detection. We propose a new method called MLPlag for plagiarism detection in multilingual environment. This method is based on analysis of word positions. It utilizes the EuroWordNet thesaurus which transforms words into language independent form. It is able to identify documents plagiarized from sources written in foreign languages. We incorporate special techniques, such as semantic-based word normalization, to enhance our method. It identifies the replacement of synonyms used by plagiarists to hide the document match. We performed our experiments on monolingual and multilingual corpora and their results are presented in this paper.

Keywords: Plagiarism, Copy Detection, Nature Language Processing, EuroWordNet, Thesaurus, Lemmatization.

1 Introduction

A useful and still progressing showing sub area of document processing research is the task of text document plagiarism detection. Particularly when documents are written in various languages, the way of their comparison is not satisfactorily solved so far. Simultaneously this implies a number of additional tasks. Several systems have been developed for plagiarism detection; however, none of them deals with such situation when documents are written in different languages. Claugh published the fundaments of plagiarism detection in [1]. Later, Maurer introduced an overview of current plagiarism detection systems in [7] and sketched some future directions for the field of Nature Language Processing (NLP). The most popular is system SCAM [12] that employs single words contained in the examined documents. Another system that also uses words to measure the document similarity is Detection of Duplicate Defect Reports, described in detail in [10]. The most recent systems, such as Ferret [6] or KOPI Portal [9], utilize N-grams as features. Although N-grams yield better results for plagiarism detection, they are inappropriate for multilingual environment. Therefore, we propose a new method called MLPlag that overcomes this issue. We can imagine several usage scenarios of our system. An example of successful usage of our system is a university environment where student works are often plagiarized. There are known cases where two thirds of the master theses were plagiarized from an internet source such as Wikipedia. Even if the work was translated from English to Czech language it cannot be considered as a novel work. The aim of our approach is to detect plagiarized documents and its sources even if

they are written in foreign languages. Another problem we deal with is the replacement of words with synonymous expressions and slightly different word order in different languages. The resolution for these issues is a part of our proposal. We started our experiments with only two languages – Czech and English. It should be noted that the principle remains the same for any number of processed languages. We designed a prototype of the system to detect Czech documents plagiarized from English documents and vice versa. It is able to deal with different word order in Czech and English sentences and synonymous word replacements. The rest of this paper is organized as follows. Section 2 describes the EuroWordNet (EWN) thesaurus in detail. Section 3 proposes our method MLPlag for plagiarism detection in multilingual environment. Section 4 presents the results we achieved on two experimental multilingual corpora. And finally, Section 5 concludes our paper.

2 EuroWordNet Thesaurus

For language independent processing, we designed a technique which transforms multilingual texts into an easily processed form. The EWN thesaurus [4] is used for this task. It is a multilingual database of words and their relations for most European languages, i.e. English, Danish, Italian, Spanish, German, French, Czech, and Estonian. It contains sets of synonyms – synsets – and relations between them. A unique index is assigned to each synset; it interconnects the languages through an inter-lingual-index in such a way that the same synset in one language has the same index in another one. Thus, cross-language processing can easily be

performed. We can, for example, detect a Czech article as a plagiarism of one or more English documents and vice versa. With EWN, completely language independent processing and storage can be carried out, and moreover, synonyms are identically indexed. They are often used to hide the plagiarism from reader. This fact plays an important role in plagiarism detection. In order to use EWN, it is necessary to assign the EWN index to each term of a document. To accomplish that, the words must be firstly transformed into basic forms, i.e. words must be normalized. Lemmatization is the only possible way how to transform the word to obtain the basic form – lemma. The overview of the lemmatization system illustrates Fig. 1. EWN-based lemmatization can be classified as a dictionary lemmatization method. A dictionary creation can be considered as the most difficult part of the EWN-based approach. We proposed a method for the lemmatization dictionary building based on the use of EWN thesaurus and Ispell dictionary [3]. The lemmatization dictionary was created by extraction of word forms using the Ispell utility. We are able to generate all existing word forms from the stem stored in the Ispell dictionary. It contains stems and attributes specifying the possible suffixes and prefixes, which are applied to stems in order to derive all possible word forms. We assume that one of the derived forms is a basic form (lemma). In order to recognize the basic form we looked for the corresponding lemma in EWN. A fuzzy match routine based on [8] can be optionally enabled for searching lemmas in EWN thesaurus. This pre-processing helps especially in the case of highly inflected languages. Languages with a rich inflection are more difficult to be processed in general. We used a Czech

morphological analyzer [5] to overcome this problem. Thanks to this module we obtained a further improvement of our method. The English lemmatization is relatively simple, thus it is possible to use the basic lemmatization algorithms with satisfying results. We implemented lemmatization modules for Czech and English languages, but the main principle remains the same for any number of languages. However, language specific pre-processing, such as morphological analysis and disambiguation, is needed in some cases to achieve better results.

The table has four legs. corpus

lemmatization

lemmatization dictionary

table, have, four, leg

EWN indexing

EWN Thesaurus

language dependent part language independent part

eng20-04209815-n eng20-02139918-v eng20-02111607-a eng20-0318657-n Plagiarism detection

Fig. 1. Pre-processing of the multilingual corpus is split into two parts – lemmatization and EWN indexing. The independent language form is used as an input in the plagiarism detection module.

In case of monolingual pre-processing, when EWN is not used, we recommend using lemmatization followed by stop-word removal, which is the most fundamental

pre-processing approach. Stop-word removal eliminates all common and inconvenient words from the text to reduce the amount of data.

3 MLPlag Method

In multilingual environment it is difficult to choose the right features that can unambiguously identify plagiarism among different languages. We need to choose such features that can be employed in most of languages regardless of the grammatical rules they apply. In monolingual environment, N-grams are mostly preferred since they contain simple ideas and achieve better results than single words. Generally, N-gram is a sequence of n words. In multilingual environment, it is impossible to employ N-grams due to the distinct word order in every language. Moreover, some words do not need to have any equivalent in other languages we are dealing with. This precludes the use of N-grams because this requires all translated words have to have same order in different languages. Therefore, we recommend using single words as features. The problem we face is the word distribution in the examined documents. Let us imagine the situation when a word occurs in one document at the beginning and in the second one at the end. It is obvious that these two words cannot be plagiarized because they stand at absolutely different positions in both documents. To overcome this issue, we introduce a new method taking into account the positions of words. Let us define the set of positions Cw(R,S) of word w according to the following formula

C w (R, S ) = {a : a ∈< 1, N R >, b ∈< 1, N S >, w = wR ,a , w = wS ,b ,

a b − < ω}, NR NS

(1)

where word w from document R is plagiarized in document S. In this formula, we denote word w at position a from document R as wR,a and similarly word w at position b from document S as wS,b. Constant NR expresses the total number of words in document R and NS is the total number of words in document S. Finally, constant ω represents a relative window size in which we consider two words wR,a and wS,b to be close enough. To overcome the influence of different document size, we normalize the word positions into interval , where zero represents the word being at the beginning of a document and one represents the word being at the end. Let us make an example, for the relative window size ω = 25% the word wR,a situated, for example, at position a/NR = 0.5 is plagiarized if and only if the same word wS,b is situated at position b/NS ∈ . Now, let us define the occurrence frequency of plagiarized word w as Pw(R,S) according to formula

Pw (R, S ) = Cw (R, S ) ⋅ Cw (S , R) ,

(2)

where |Cw(R,S)| represents the number of positions of word w from document R such that are plagiarized in document S. Further, we define the occurrence frequency of word w in document R as Fw(R) regardless if it is plagiarized or not. These both

definitions are used in the following subsections where we introduce two measures of similarity.

3.1 Symmetric Similarity Measure (MLPlagSYM) The first measure of similarity is derived in part from the traditional VSM that has gained popularity in the Information Retrieval (IR) domain [11]. We define Equation (3) to measure the similarity between two documents R and S in multilingual environment.

sim(R, S ) =

∑α

w∈R ∩S

∑α

w∈R

2 w

2 w

⋅ Pw (R, S )

⋅ Fw2 (R ) ⋅ ∑ α w2 ⋅ Fw2 (S )

(3)

w∈S

This expresses the symmetric measure for the document pair, where αw is the weight associated with the occurrence of word w. Weight αw in Equation (3) is composed of a local weight and a global weight, as depicted in [13]. The resulting similarity we obtain is in interval .

3.2 Asymmetric Similarity Measure (MLPlagASYM) The second measure of similarity we introduce is an asymmetric one which is in contrast to the preceding measure. We define the subset measure of document R to be subset of document S according to formula

subset (R, S ) =

∑α ∑α

w∈R ∩ S w∈R

2 w 2 w

⋅ Pw (R, S ) ⋅ Fw2 (R )

,

(4)

where αw is the weight associated with the occurrence of word w. Subsequently, we define the measure of similarity as the maximum of both asymmetric measures according to

sim(R, S ) = max{subset( R, S ), subset( S , R )}.

(5)

Because we need to keep similarity in interval , we set sim(R,S) to be 1 if it is greater than 1.

4 Experiments

In Section 2, we introduced the necessary pre-processing which is required for multilingual plagiarism detection of text documents. Fig. 2 presents the influence of multilingual pre-processing on the accuracy in comparison to monolingual pre-processing. In our case, monolingual pre-processing consists of lemmatization and stop-word removal. Multilingual pre-processing consists of lemmatization and the inter-lingual indexing process that partially substitutes the stop-word removal used for monolingual pre-processing.

100 Mono proc. MLPlag-SYM

90 80

Mono proc. MLPlag-ASYM

70

Multi proc. MLPlag-SYM

F1 [%]

60

Multi proc. MLPlag-ASYM

50 40 30 20 10 0 0

10

20

30

40 50 60 Threshold τ [%]

70

80

90

100

Fig. 2. The dependency of F1-measure on the threshold τ. This experiment is performed for monolingual and multilingual pre-processing on the CTK corpus.

Although the dependencies of measure F1-measure on the threshold τ are very similar for monolingual and multilingual pre-processing, we discovered that multilingual pre-processing slightly improves the accuracy of both MLPlagSYM and MLPlagASYM methods, see Table 1. Another advantage is a certain separation of language dependent data processing, see Fig. 1. In higher layers of data processing, i.e. plagiarism detection, we consider only inter-lingual indexes so it is easier to include any other language in future. On the grounds of these facts, we strongly recommend employing multilingual pre-processing not only for multilingual environment.

Table 1. The results achieved on the CTK corpus

Pre-processing

Method

Threshold τ

F1-measure

MLPlagSYM

27%

88.38%

MLPlagASYM

33%

90.24%

MLPlagSYM

33%

89.01%

MLPlagASYM

42%

90.35%

Monolingual

Multilingual

For our following experiments, we built up two distinct multilingual corpora, called JRC-EU and Fairy-tale. JRC-EU corpus is composed of 400 randomly selected European Union legislative texts [2]. Because it includes a lot of topic-specific words that are not contained in the EWN thesaurus, we have to take into account that the indexation process sometimes fails. This corpus contains 200 reports written in English and the same number of corresponding reports written in Czech. The second corpus, called Fairy-tale, represents a smaller set of text documents with a simplified vocabulary. This corpus is composed of 54 documents, 27 in English and 27 corresponding translations in Czech language.

Fig. 3. The dependency of measure F1 on the threshold τ and the window size ω. The top part of the figure presents the MLPlagSYM and MLPlagASYM methods, from the left to right, on the JRC-EU corpus. The bottom part of the figure presents the same methods on the second corpus, called Fairy-tale.

Fig. 3 presents the dependency of F1-measure on the threshold τ and the window size ω. As you can see, window size ω influences F1-measure in multilingual environment. We examined both MLPlagSYM and MLPlagASYM methods on JRC-EU and Fairy-tale corpora. During our examination, we tested window size ω in interval 1% to 100%. The case that no window is used corresponds with ω = 100%, i.e.

the same words occurring at the beginning or at the end of documents are considered plagiarized. On the other hand, if ω = 1% only such words having relative distance at most 1% are considered plagiarized. The same words having relative position greater than ω are regarded as absolutely distinct and they are not taken in by computation the text overlap. From our observation, too low values of ω significantly decrease F1. On the other hand, high values of ω decrease F1 as well. Three of four experiments proved that the best results achieve ω between 8% and 10%, see Table 2. An exception is the MLPlagASYM method on the JRC-EU corpus. Fig. 3 shows the best F1-measure for

ω = 55%. Some satisfactory results could be achieved with ω = 10% as well. Both MLPlagSYM and MLPlagASYM indicate a slightly better accuracy for ω specified in that interval in comparison with the situation when no window size is used. For example, the MLPlagSYM method achieves 72.53% F1-measure for ω = 8% on the JRC-EU corpus, whereas only 65.91% F1-measure if no window size is used. In Table 2, we denote the situation when no window is used as “---”. The second parameter we use is threshold τ that represents a minimal level of similarity we consider two documents to be plagiarized. This parameter significantly influences F1-measure. The effective interval for MLPlagSYM is slightly narrower in comparison with the MLPlagASYM method, see Fig. 3. In case of MLPlagSYM, we talk about 5% to 12%, while MLPlagASYM achieves good results in interval between 8% and 20%. Although MLPlagASYM has a wider effective interval, the peak values occur about 16% for both corpora. Now, let us look at threshold τ in Table 1 and Table 2. You can see that for multilingual corpora the best results are achieved for τ being

much lower than for monolingual corpus, i.e. CTK corpus. This is caused by the words that do not have any equivalent in other languages. Another issue can be found in the EWN thesaurus because it is still under development. In any case, less word matches are found and therefore a much lower value of τ is required to achieve outstanding results in multilingual environment.

Table 2. The results achieved on the multilingual JRC-EU and Fairy-tale corpora

Corpus

Method

Threshold τ

Window size ω

F1-measure

9%

---

65.91%

9%

8%

72.53%

16%

---

70.52%

17%

55%

71.47%

7%

---

100%

5%

10%

100%

16%

---

94.73%

11%

8%

100%

MLPlagSYM JRC-EU MLPlagASYM

MLPlagSYM Fairy-tale MLPlagASYM

Generally, for both corpora, MLPlagSYM and MLPlagASYM yield similar results. In case of monolingual environment, the decision is straightforward. We propose to employ MLPlagASYM because it significantly overcomes MLPlagSYM, see Table 1. In case of multilingual environment the decision is controversy because, from our

observation, the differences between both methods are statistically insignificant, see Table 2. Nevertheless, we recommend employing MLPlagASYM due to the easier determination of the right threshold τ. The reason is simple, an effective τ is spread over a wider interval so it is more likely to make an appropriate choice of threshold τ. Fairy-tale corpus reaches outstanding results in comparison with the JRC-EU corpus. Both methods achieve 100% for F1-measure because of the simplified vocabulary the corpus uses. On the other hand, JRC-EU corpus achieves at most 72.53% for the MLPlagSYM method. JRC-EU corpus is composed of a large amount of topic-specific words that do not occur in EWN. Therefore, the accuracy radically decreases.

5 Conclusion

According to our experiments, the MLPlag method gives promising results. The method is able to process multilingual data without any significant impact on accuracy, which was not possible with other approaches. Actually, the experiments prove that the F1-measure rises if multilingual preprocessing is applied. Further improvement is obtained when we include relative window approach. The F1-measure achieves 72.53% on JRC-EU corpus which means 7% increase compared with the approach where the window is not used. The second corpus called Fairy-tale consists of articles where a simplified language is used. Therefore, the F1-measure is 100% for both setups – with and without the use of the window. Our method is able to process any number of languages included in the EWN thesaurus. However, we should take into account that the EWN thesaurus is still under

development. This can cause some difficulties in cross-language plagiarism detection. As EWN is gradually being completed, this problem will disappear and we expect even better results. We are going to extend our work in several areas. We aim to replace the relative window with a more sophisticated approach based on structural features of languages. Further, we are working on an advanced word processing that includes word sense disambiguation. And finally, we aim to use the inter-word relationships stored in EWN.

Acknowledgments. This research was supported in part by National Research Programme II, project 2C06009 (COT-SEWing).

References 1. Clough, P.: Plagiarism in natural and programming languages: An overview of current tools and technologies. In Internal Report CS-00-05, Department of Computer Science, University of Sheffield, 2000. 2. European Commission - Joint Research Centre: The JRC-Acquis Multilingual Parallel Corpus, Version 3.0. Last update 23/1/2008. URL: http://langtech.jrc.it/JRC-Acquis.html 3. Gorin, R.: Ispell. Last update 5/6/1996. URL: http://fmg-www.cs.ucla.edu/fmg-members/geoff/ispell.html 4. Global WordNet Association: EuroWordNet. Last update 9/1/2001. URL: http://www.illc.uva.nl/EuroWordNet/ 5. Hajic, J.: Morphology analyzer. Last update 8/27/2001. URL: http://quest.ms.mff.cuni.cz/pdt/Morphology_and_Tagging/Morphology/index.html 6. Lane, P., Lyon, C., Malcolm, J.: Demonstration of the Ferret Plagiarism Detector. In Proceedings of the 2nd International Plagiarism Conference, Newcastle, UK, 2006. 7. Maurer, H., Kappe, F., Zaka, B.: Plagiarism – A Survey. In Journal of Universal Computer Science, vol. 12 issue 8, pp. 1050-1084, 2006. 8. Myers, E.: An O(ND) Difference Algorithm and Its Variations, In Algorithmica, vol. 1, pp. 251-266, 1986. 9. Pataki, M.: Distributed Similarity and Plagiarism Search. In Proceedings of the Automation and Applied Computer Science Workshop, pp. 121-130, Budapest, Hungary, 2006. ISBN 963-420-865-7. 10. Runeson, P., Alexanderson, M., Nyholm, O.: Detection of Duplicate Defect Reports Using Natural Language Processing. In Proceedings of the IEEE 29th International Conference on Software Engineering, pp. 499-510, 2007.

11. Salton, G.: The state of retrieval system evaluation. In International Journal of Information Processing & Management, vol. 24 issue 4, pp. 441-449, Pergamon Press, Inc., Tarrytown, USA, 1992. ISSN 0306-4573. 12. Shivakumar, N., Garcia-Molina, H: SCAM: A copy detection mechanism for digital documents. In Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries, Austin, 1995. 13. Salton, G., Buckley, C.: Term-Weighting Approaches in Automatic Retrieval. In Journal of Information Processing and Management, vol. 24 issue 5, pp. 513-523, 1988.