Unsupervised Identification of Persian Compound Verbs Mohammad Sadegh Rasooli1, Heshaam Faili2, and Behrouz Minaei-Bidgoli1 1
Department of Computer Engineering, Iran University of Science and Technology {rasooli@comp., b_minaei@}iust.ac.ir 2 School of Electrical & Computer Engineering, Tehran University
[email protected] Abstract. One of the main tasks related to multiword expressions (MWEs) is compound verb identification. There have been so many works on unsupervised identification of multiword verbs in many languages, but there has not been any conspicuous work on Persian language yet. Persian multiword verbs (known as compound verbs), are a kind of light verb construction (LVC) that have syntactic flexibility such as unrestricted word distance between the light verb and the nonverbal element. Furthermore, the nonverbal element can be inflected. These characteristics have made the task in Persian very difficult. In this paper, two different unsupervised methods have been proposed to automatically detect compound verbs in Persian. In the first method, extending the concept of pointwise mutual information (PMI) measure, a bootstrapping method has been applied. In the second approach, K-means clustering algorithm is used. Our experiments show that the proposed approaches have gained results superior to the baseline which uses PMI measure as its association metric. Keywords: multiword expression, light verb constructions, unsupervised identification, bootstrapping, K-means, Persian.
1
Introduction1
A collocation is "an arbitrary and recurrent word combination" [1] or frequently appeared sequence of adjacent words [2]. In [3], collocations are classified into two main categories: 1) theoretical, and 2) empirical. Empirical collocations are the ones occurring in corpora while theoretical collocations are the ones known in linguistics. Collocations are ranged from lexically restricted expressions (e.g. strong tea), phrasal verbs (e.g. look after), technical terms (e.g. prime minister), and proper names (e.g. Los Angeles) to idioms (e.g. spilt the beans) [4]. In [4], five types of n-grams were considered as true collocations: 1) stock phrases (e.g. major problem), 2) Named entities (e.g. Prague castle), 3) support verb constructions (e.g. make decision), 4) technical terms (e.g. prime minister), and 5) idiomatic expression (e.g. kick the bucket). A multiword expression (MWE) is known as a type of collocation that refers to a single concept [5] which the whole meaning is often not a function of its constituent meaning parts [6] and differs in the meaning level [7]. Because of the idiosyncrasy in meaning, MWEs are considered 1
This Paper is funded by Computer Research Center of Islamic Sciences (C.R.C.I.S).
I. Batyrshin and G. Sidorov (Eds.): MICAI 2011, Part I, LNAI 7094, pp. 394–406, 2011. © Springer-Verlag Berlin Heidelberg 2011
Unsupervised Identification of Persian Compound Verbs
395
different from multi word phrases [8]. One of the main tasks in NLP is the detection of MWEs. In [9], it is stated that MWE is one of the two main challenges in NLP. In addition, MWEs are very frequent in real language data occurrences [9]. Hence, the problem of their identification should be solved in order to have a sound learning of the language. The majority of MWEs are verbal expressions such as verb constructions (LVC), verb noun constructions (VNC), and light verb particle constructions (VPC) [10]. VNCs are either idiomatic or literal [7]. Idioms are defined as sequences of words that are semantically idiosyncratic and non-compositional [11]. VPCs consist of a head verb and one or more obligatory particles such as prepositions (e.g. make up), adjectives (e.g. cut short) or verbs (e.g. let go) [12]. Light verbs are classes of verbs that independently lack semantic force to function as predicates and need some predicative nouns or adjectives to complete their meanings [13]. An LVC is made up of a light verb and a preverb (in most cases, a noun) which its meaning is non-compositional. Furthermore, the preverb (nonverbal part) has a verbal or predicative meaning and the verbal part lacks its literal semantics [14]. The most challenging nature of LVC is its lexical semantic variation, i.e. polysemy; in which the verbal part of the construction tends to have different meanings according to the context [6]. The syntactic patterns of LVC occurrences in language corpora tends to be similar to each other [14], however, the argument structure of the LVC is different from the light verb itself. This kind of evidence is very frequent in languages like Persian, Korean and Kurdish [13]. In Persian, while the nonverbal part of the LVC is not the object of the verbal part; the verbal part is free of literal semantics and only works as a verbal agent for the predicative preverb to express an action in the sentence [13, 15]. There is a major difference between Persian compound verbs (the other name of Persian LVC) and the ones in other languages like English. Although MWEs are not productive in other languages [9], in Persian any predicative noun can combine with its related verbal element and make a new compound verb. For example, the verb "kardan" in Persian (synonym to "to do" in English) makes most of the compound verbs in Persian and many new predicative nouns can combine with it and make a new LVC in Persian. For instance the new Persian word "email" can combine with "kardan" and make a new LVC ("email kardan (email - to do)" means "to compose email") and the argument structure or valency (in the notion of dependency grammar) changes. In this paper, by considering special aspects of Persian compound verbs, we proposed two unsupervised learning methods on Persian corpus in order to improve the task of Persian compound verb identification. In section 2, some related works on MWEs (especially multiword verbs) are described. In section 3, the main challenges in Persian compound verbs are stated and the shortcomings of data resources in Persian are also reminded. Experimental results and conclusion are presented later in the paper.
2
Related Works
Several statistical methods on collocations and MWEs extraction have been proposed that mainly focused on idiomatic expression, LVCs and multiword verbs [16] where most of these methods were based on lexical association measures, such mutual information [4]. One the most famous works in recognizing non-compositional phrases
396
M.S. Rasooli, H. Faili, and B. Minaei-Bidgoli
is done in [17] where pointwise mutual information (PMI) measure was used as the measure of non-compositionality. The main idea of the proposed work was based on the hypothesis that a phrase is non-compositional when "its mutual information differs significantly from the mutual information of phrases obtained by substituting one of the word in the phrase with a similar word". In addition, Dependency relations between verbs and their objects were counted as the co-occurrence evidences. This hypothesis has some deficiencies especially in mutual information overestimation when the counts of data are very small [18]. However, this measure became state of the art in many comparisons. For instance, in [19], 84 different bigram association measures were compared for Czech multiword expressions and PMI gained the best results. In [20], 31 measures were reexamined and the normalized PMI gained the best results. In [21], five classic measures were compared for identifying German pp-verbs and t-score gained the best precision. In [22], with a measure based on thermodynamic models, an identification task was applied in search engines that outperformed the PMI measure with more informative counts from the internet. It is worth to remind that the performance of an association measure really depends on the nature of the data and it can not be said strictly which method retrieved the best results in all cases [4]. Most of the works on identification of multi word verbs were done via unsupervised methods [5]. In [23], a statistical measure based on log-linear models of n-way interaction of words, based on multinomial distribution, was proposed in order to identify English multiword verbs. In [24], three methods based on different linguistic features were used in identifying VPCs. In the first method, a simple part of speech (POS) tagger was used to enumerate potential candidates. In the second one, a simple chunker was used and in the last method some grammatical information was added which not only improved method accuracy but also enumerated both positive and negative evidences in corpus sentences. In [25], a word sense disambiguation (WSD) method was used as a clustering task for discriminating literal and non-literal MWEs. In this work, KE algorithm was used which is based on recognizing similarity between a target sentence and other sentences and this similarity is known as attraction. In [26], with using latent semantic analysis (LSA), a vector based semantic similarity model was applied to word co-occurrence counts to find multi word verbs. In this method, cosine similarity measure was used to compare vectors and the method was evaluated on German texts. In [27], it was assumed that idiomatic expressions appear in a small number of canonical forms and in the other hand, literal multi words appear in several forms and have no restriction in the syntactic form appearance. Based on this assumption, two models of co-occurrence vectors were proposed which were evaluated on BNC corpus to find idioms in VNCs. One of the models was the model proposed in [26] and the other was based on the assumption that canonical forms are more likely to be idiomatic than non-canonical form. In [28], some linguistic features on multi word verbs such as pluralization, passivization, change in determiner type were considered as relevant patterns of syntactic variation. In this work, maximum likelihood estimation was used to estimate the probability of being idiomatic and the probability distribution of syntactic variations was compared via KL-Divergence. In [29], the syntactic fixedness was estimated by summing up pointwise mutual information between all syntactic variations. In [8], statistical measures for estimating
Unsupervised Identification of Persian Compound Verbs
397
idiosyncrasy in MWEs were proposed that those measures were based on PMI measure. The extension of this work is written in [11, 14]. In [11], the syntactic fixedness of an idiom was considered as a positive evidence; i.e. the syntax and order in the idiom constituents in data occurrences does not differ from the main form of the idiom. In this work, it is stated that most idioms are known to be lexically fixed and each different syntactic variation of an idiom, different to the main lexical order, is considered as negative evidence [14] and with this assumption, the model proposed by [17] was improved. At last, KL-divergence was used to measure syntactic fixedness of an expression; i.e. the degree of syntactic fixedness of the verb-noun pair was estimated as the divergence of syntactic behavior of the target verb-noun pair from the typical syntactic behavior (prior distribution). In that work, it is shown that methods proposed in [17] and [1] are not significantly better than random selection. In [7], with inspiring models in [8, 26], a vector based similarity measure was used, but the notion of context was changed. Context included all nouns, verbs, adjectives and adverbs occurred in the same paragraph as the target word. With this assumption, five different parameter settings were experimented and the ratio of a word pair in each context was inspired from tf-idf measure in information retrieval. There are fewer works done in supervised multiword verb identification in comparison to unsupervised methods. In [5], linguistic features such as word lemma, chunk, named entity and part of speech were used as features of classification and then support vector machines (SVM) classifier was used for classification. In this work, the VNC-Token dataset [30] was used as the train data which VNC token expressions are manually tagged as either idiomatic or non-idiomatic. In [31], the context of a paragraph of a target word was mapped into a co-occurrence word vector, then Dice Coefficient and Jaccard index measures were used to estimate the similarity between vectors. In [32], 55 different association measures were mapped into a vector and three different learning methods (linear logistic regression, linear discriminant analysis (LDA), and neural networks) were used to combine these measures. Comparing the mentioned methods, LDA gained the best result. In [4], 84 association measures were compared and a classifier was built by choosing an appropriate threshold for each measure. Furthermore, based on Pearson's correlation coefficient, hierarchical clustering and dimension reduction were done on these association measures in order to handle sparsity in feature combination. Finally, four different learning algorithms (linear logistic regression, linear discriminant analysis, SVM, and neural networks) were used to combine association measures and it was shown that in all cases, the combination of measures outperformed using measures alone. In this work, neural network gained the best result on Prague dependency Treebank (PDT).
3
Challenges in Persian Compound Verb Identification
As mentioned in the previous section, most recent methods in multiword verb identification (both supervised and unsupervised) used some linguistic features which applying them needs a tagged corpus of syntactic features such as dependency relations. One of the major difficulties in this field of research is the lack of reliable datasets such as [30, 33] for supervised learning and datasets such as [34] for unsupervised learning. The only reliable corpus in Persian is Bijankhan corpus [35] that just has
398
M.S. Rasooli, H. Faili, and B. Minaei-Bidgoli
been annotated by part of speech tags and some morphosyntactic features. Some features such as dependency relations in [11, 16, 17, 36], chunks and named entity relations in [5], and dictionary of collocations in [11] are not currently accessible in Persian language. The other major challenge in this problem is the productiveness of Persian compound verbs [37, 38] in which handcrafted verb lists from current dictionaries are not appropriate and reliable. One of the important cues for identifying Persian compound verbs is the change in argument structure (or subcategorization frames) compared to the light verb itself [13, 39] and this cue is not tagged in the current corpus of Persian language. In works such as [11, 27-29, 36], syntactic fixedness was used to identify idiomatic multiword predicates, but this kind of cue is not applicable to Persian. Not only, the nonverbal element in Persian LVC can be inflected but also this inflection does not ever mean that the nonverbal part can not combine with the light verb as a compound verb. Even an inflected noun can be the nonverbal predicate in the compound verb [40]. The other problem in Persian compound verb identification is the sparseness in the data, therefore, methods like the one in [17] do not lead to satisfactory results and for so many verbs the scores like PMI are very near to each other. Hence, this method does not work properly in this case. This sparsity is due to separability [41] and preverb inflection ability [40] in Persian LVCs. Consider the example in (1): (1) Man (I) bâ (with) to (you) sohbat-hâ-ye (speak-plural-Ezafe) besyâr (very) ziyâdi (high) dar (in) mored-e (about-Ezafe) jang-e (war-Ezafe) irân (Iran) va (and) erâq (Iraq) kardam (do-simple past, first person, singular of "kardan"). ─ Meaning: I spoke a great deal with you about the war between Iran and Iraq. As shown in (1), there is a distance of 9 words between the light verb ("kardan") and non-verbal element ("Sohbat"). Also the nonverbal element is pluralized and an Ezafe (an indicator of noun modification [42]) is attached to it2. In sentence (1), 4 out of 8 words between the light verb and nonverbal element are nouns ("sohbat", "jang", "irân", and "erâq"), so it can be stated that there are four candidates for being a predicative noun of the light verb "kardan". If we generalize this phenomenon to all sentences in the corpus, the event space for multiword verb is more than a simple adjacency of nouns and verbs. The number of potential candidates increases with the sentence length. In this work, we had a case study of the light verb construction "kardan" in Persian and recognized that 98 percent of nonverbal elements appear adjacent to the light verb at least once and 91 percent of sentences with LVCs are the ones in which light verbs and nonverbal elements are adjacent in Bijankhan corpus. In other words, we seem to face a tradeoff between precision and recall. In the one hand, if we only consider adjacent nouns to the verbs, some valuable information will be lost and in the other hand if we consider all possible candidates, identification precision will decrease.
4
Persian Compound Verb Identification
In this paper, two kinds of unsupervised methods for detecting Persian compound verbs have been experimented. The first task is based on bootstrapping and the second 2
In scarce cases, the Ezafe construction appear in typography.
Unsupervised Identification of Persian Compound Verbs
399
is the k-means clustering algorithm. Furthermore, a modified version of PMI measure is proposed in order to handle data sparseness in Persian compound verbs. This measure is shown in equation (1), in which fr(x) is the frequency of x. We named this measure Probabilistic PMI (PPMI). In the following sections, after introducing PPMI, details of bootstrapping and k-means are described. ,
4.1
log
, .
log
.
, .
(1)
PPMI
If all possible alternatives of a compound verb are considered in PMI, data sparseness will decrease precision. In the other hand, if only adjacent candidates are considered, some valuable information will be lost. In our experiment, we designed a rule based compound verb tagger based on a list of real compound verbs. Linguistic rules for Persian compound verb were used in the tagger to find the verbs. The rules were extracted from Persian grammar books such as [43]. The Bijankhan corpus [35] was tagged with this tagger. The distribution of the distance between the light verb and the nonverbal element is shown in Table 1. This observation made us sure about the deficiency in the classic PMI measure. If we consider all possible candidates, there will not be a good precision in the results. On the other hand, if only adjacent words are considered, some valuable information will be lost. In order to solve this problem, we introduce a modified version of the PMI measure. In this version, each co-occurrence is not counted as one (as in PMI measure). In other words, the co-occurrence count is a number between zero and one based on the distance between the nonverbal element candidate and the light verb. Table 1. Experimental distribution of the distance between the light verb and preverb in Persian compound verbs Distance
Probability
1
0.91
2
0.045
3-5
0.025
6-10
0.015
≥10
0.005
In PPMI, besides counting every candidate as one, each co-occurrence is a value between zero and one according to the word distance between the light verb and the preverb. The values are gained via estimating an experimental distribution in Bijankhan corpus [35] as shown in Table 1. The reason why we used experimental distribution is that famous distributions such as polynomial distribution were tested via K square test and did not pass the test successfully. By this, all counts in PMI change. Furthermore, to increase precision, we only consider nonverbal candidates that were adjacent to the light verb. For example, consider the candidates in sentence (1). The co-occurrence counts in the classic PMI, the PMI measure used in this paper and PPMI is shown in Table 2.
400
M.S. Rasooli, H. Faili, and B. Minaei-Bidgoli Table 2. An example of co-occurrence count in PMI and PPMI measures Light Verb
Distance
Co-occurrence count (in classic PMI)
Preverb
Co-occurrence count (in PMI used
Co-occurrence count (in PPMI)
in this paper)
Candidate
4.2
sohbat
kardan
9
1
0
0.005
jang
kardan
4
1
0
0.025
irân
kardan
3
1
0
0.045
erâq
kardan
1
1
1
0.91
Bootstrapping
Besides deciding for all compound verb candidates at once, bootstrapping was done incrementally up to a threshold. By this, in each iteration, some compound verbs with high PMI scores3 are assumed to be real compound verbs. In the first phase, the compound verb list is empty. Based on the association measure (in here, PMI), some candidate with high score are inserted to the list. After choosing the candidates as compound verbs, the corpus is reprocessed with the assumption that the chosen candidates were truly compound verbs. In the next iteration, the corpus is processed based on the assumption that the compound verb list is real. In each iteration, the compound verb list becomes bigger. After some iterations (based on a threshold decided manually), the method uses PPMI in order to find verbs that are not adjacent. The algorithm is described in Fig. 1. For example, consider sentence (1) again. Assume that the program is in the first iteration, the data format is as in Table 3. Table 3. Data format for boostrapping in sentence 1 in the first iteration Preverb Candidate
Light Verb
Distance
sohbat
kardan
9
jang
kardan
4
irân
kardan
3
erâq
kardan
1
Now assume that in the first iteration ("sohbat kardan") is recognized as a compound verb. Then the corpus reshapes as in Table 4 in which the other candidates lose chance as being a compound verb candidate in sentence 1. Table 4. Data format for boostrapping in sentence 1 after recognizing "sohbat kardan" as a compound verb
3
Preverb Candidate
Light Verb
Distance
Jang
sohbatkardan
4
Irân
sohbatkardan
3
Erâq
sohbatkardan
1
The PMI is measured only for candidates that the preverb is adjacent to the light verb.
Unsupervised Identification of Persian Compound Verbs
401
For the first iterations only PMI in which adjacent words are regarded as concurrence count is bootstrapped. After a threshold, PPMI is considered in order to catch compound verbs that are not frequently adjacent. Bootstrapping Algorithm for verb identification
(1)
Compound-verb-list=empty
(2)
Construct training data from corpus
(3)
Calculate PMIs
(4)
While (highest-PMI ≥ threshold)
(5)
Select k candidates with highest PMIs and add them to the compound-verb-list
(6)
Reconstruct the training data based on the compound-verb-list
(7) (8)
Recalculate PMIs End-while
Fig. 1. A summary of bootstapping algorithm for compound verb identification
4.3
Kmeans
In order to apply kmeans, four different features are combined. In order to find the best feature set, multiple experiments were done on several combinations of the features. These features are: 1) PMI (only for adjacent compound verbs), 2) PPMI, 3) Average distance between light verb and preverb, and 4) Average number of nouns between the light verb and the preverb. Euclidian distance [44] is used as the distance measure in this algorithm. All of the features were normalized as a number between zero and one. The number of iterations in the Kmeans algorithm is determined manually. Each cluster was labeled after the algorithm was finished. The label of the cluster was assigned manually by counting whether the compound verbs are in majority or non-compound verbs. For example, if there is a cluster of 250 candidates and 140 candidates are compound verbs, we manually tagged the cluster as the compound verb cluster. We tested several numbers of Ks in order to find the best number of clusters, but the surprising fact was the equal results in all cases of cluster numbers (at last, we chose k as 2).
5
Experiments and Results
We used a finite State automata (FSA) based method to parse Persian light verbs. The multiword capability of a light verb in Persian has some state transition rules that these state transition rules in the automata are extracted from Persian grammar books. The entire Bijankhan corpus is analyzed and verb inflections are simplified to their lemmas. In addition, unrelated candidates are filtered via simple formal rules in Persian compound verbs, e.g. before the postposition (known as "râ" in Persian) there can not occur a preverb or no premodifier is accepted for a preverb [43]. Furthermore, we used morphosyntactic features in Bijankhan corpus [35] to find noun lemmas. To test our approach, we tried to find compound verbs in which their light verb is an inflection of "kardan". The compound verbs with this light verb mostly occur with nouns as
402
M.S. Rasooli, H. Faili, and B. Minaei-Bidgoli
preverbs and do not occur with prepositional preverbs (e.g. "Preposition + Noun" as preverb). As a result, we considered only nouns for the sake of simplicity. We used only candidates that have the co-occurrence counts more than or equal to five as in [4]. We examined both the method used in [17] and the PMI with the best threshold as the baselines. The thresholds for PMI and PPMI in bootstrapping are selected manually. In Kmeans method, 8 different combinations are used in order to gain the best feature set in identification task. The results show that bootstrapping has gained the best results. The feature set used in each kmeans method is shown in Table 5. Table 5. The features used in kmeans (+: features used, -: features not used) Method
Average distance between
abbreviation
LVC
Average Number of
PPMI
PMI
nouns between LVC
Kmeans (1)
+
+
+
+
Kmeans (2)
×
+
+
+
Kmeans (3)
+
+
×
+
Kmeans (4)
+
+
+
×
Kmeans (5)
+
×
+
+
Kmeans (6)
×
×
+
+
Kmeans (7)
+
×
×
+
Kmeans (8)
×
+
×
+
As stated before, the method in [17] and the method based on PMI by choosing the best threshold are used as the baseline. The best threshold is chosen based on F-core. The results are shown in Table 6. Table 6. Results on Persian compound verb identification
6
Method
Prec.
Rec.
F-Score
Lin
45.15
29.06
35.36
PMI with the best threshold
47.36
32.09
38.26
Bootstrapping on PMI
90.17
68.90
78.11
Bootstrapping on PMI and PPMI sequentially
87.77
71.33
78.70
Kmeans (1)
77.79
59.96
67.72
Kmeans (2)
79.82
61.36
69.38
Kmeans (3)
81.60
62.16
70.57
Kmeans (4)
71.80
65.77
68.65
Kmeans (5)
74.72
60.36
66.78
Kmeans (6)
75.64
59.36
66.52
Kmeans (7)
82.51
50.05
62.31
Kmeans (8)
79.52
66.07
72.17
Discussion and Analysis
As seen in Table 6, two features are more informative than the others in our evaluation. The first is the PMI measure that is not appropriate alone. The second one is the
U Unsupervised Identification of Persian Compound Verbs
403
average number of nouns between b the light verb and its preverb. On the other haand, using the combination of PMI and average word distance between LVC parts, although the recall is signifiicantly lower than other combinations, the algorithm has gained the best precision. The T experiments show that with bootstrapping only on the PMI of adjacent LVCs the results outperform others. Even though via bootstrappping The on PPMI, we obtained betteer F score, the difference is not statistically significant. T key in the success in the reesults of bootstrapping is the filtering of irrelevant canndidates iteratively (as shown in Table 3 and Table 4). The trends in bootstrapping iteerations are shown in Fig. 2.
Fig. 2. Trends in bootstrapping
7
Conclusion
In this paper, it is shown that t successful association measures such as PMI are not sufficient for Persian comp pound verb identification. This phenomenon is due to the data sparsity that is caused d by word distance flexibility in Persian LVC parts. As stated in previous sections, about 9 percent of LVCs are occurred in the Bijankkhan p of LVCs do not even occur once adjacent to eeach corpus nonadjacent and 2 percent other. Even the introduced d modification to PMI, i.e. PPMI, does not obtain beetter results. We gained significaantly better results via bootstrapping, i.e. by filtering irreelevant candidates iteratively. By using Kmeans, it is observed that the average num mber of nouns between the cand didate preverb and light verb effects on the identificattion performance.
8
Future Works
One other important aspectt in Persian language processing is the identification P Persian syntactic verb valenciees (or subcategorization frames) and Persian semantic vverb valencies (or argument structure) or semantically clustering verbs regarding polyseemy one for producing a potential list of Persian verbs to m main verbs. This work was do nually select true verbs and assign appropriate verb valencies. After checkking obtained verbs by hand, wee are going to do other unsupervised (and perhaps semiisupervised) tasks to identify syntactic s and semantic valencies of Persian verbs.
404
M.S. Rasooli, H. Faili, and B. Minaei-Bidgoli
Acknowledgment. I would like to thank Ali Hadian, Manouchehr Kouhestani, Maryam Aminian, Amir Saeed Moloodi and Majid Laali and four anonymous reviewers for their usefull comments on this work. This Paper is funded by Computer Research Center of Islamic Sciences (C.R.C.I.S).
References 1. Smadja, F.: Retrieving collocations from text: Xtract. Computational Linguistics 19(1), 143–177 (1993) 2. Choueka, Y., Klein, T., Neuwitz, E.: Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal for Literary and Linguistic Computing 4(1), 34–38 (1983) 3. Evert, S.: Corpora and collocations. In: Corpus Linguistics. An International Handbook, pp. 1212–1248 (2009) 4. Pecina, P.: Lexical association measures and collocation extraction. Language Resources and Evaluation 44(1), 137–158 (2010) 5. Diab, M.T., Bhutada, P.: Verb noun construction MWE token supervised classification. In: Workshop on Multiword Expressions (ACL-IJCNLP 2009), pp. 17–22. Association for Computational Linguistics, Suntec (2009) 6. Bannard, C., Baldwin, T., Lascarides, A.: A statistical approach to the semantics of verbparticles. In: ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp. 65–72. Association for Computational Linguistics (2003) 7. Diab, M.T., Krishna, M.: Unsupervised Classification of Verb Noun Multi-word Expression Tokens. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 98–110. Springer, Heidelberg (2009) 8. Fazly, A., Stevenson, S.: Distinguishing subtypes of multiword expressions using linguistically-motivated statistical measures. In: Workshop on A Broader Perspective on Multiword Expressions. Association for Computational Linguistics, Prague (2007) 9. Sag, I., et al.: Multiword expressions: A pain in the neck for NLP. In: 6th Conference on Natural Language Learning (COLING 2002), pp. 1–15 (2002) 10. Villavicencio, A., Copestake, A.: On the nature of idioms. In: LinGO Working (2002) 11. Fazly, A., Cook, P., Stevenson, S.: Unsupervised type and token identification of idiomatic expressions. Computational Linguistics 35(1), 61–103 (2009) 12. Villavicencio, A., Copestake, A.: Verb-particle constructions in a computational grammar of English. Citeseer (2002) 13. Karimi-Doostan, G.: Light verbs and structural case. Lingua 115(12), 1737–1756 (2005) 14. Fazly, A., Stevenson, S., North, R.: Automatically learning semantic knowledge about multiword predicates. Language Resources and Evaluation 41(1), 61–89 (2007) 15. Karimi-Doostan, G.: Event structure of verbal nouns and light verbs. In: Aspects of Iranian Linguistics: Papers in Honor of Mohammad Reza Bateni, pp. 209–226 (2008) 16. Fazly, A., Nematzadeh, A., Stevenson, S.: Acquiring Multiword Verbs: The Role of Statistical Evidence. In: 31st Annual Conference of the Cognitive Science Society, Amsterdam, The Netherlands, pp. 1222–1227 (2009) 17. Lin, D.: Automatic identification of non-compositional phrases. In: 37th Annual Meeting of Association for Computational Linguistics, pp. 317–324. Association for Computational Linguistics, College Park (1999) 18. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1993)
Unsupervised Identification of Persian Compound Verbs
405
19. Pecina, P.: An extensive empirical study of collocation extraction methods. In: ACL Student Research Workshop. Association for Computational Linguistics (2005) 20. Hoang, H.H., Kim, S.N., Kan, M.-Y.: A re-examination of lexical association measures. In: Workshop on Multiword Expressions (ACL-IJCNLP 2009), pp. 31–39. Association for Computational Linguistics, Suntec (2009) 21. Krenn, B., Evert, S.: Can we do better than frequency? A case study on extracting PP-verb collocations. In: ACL Workshop on Collocations. Citeseer (2001) 22. Bu, F., Zhu, X., Li, M.: Measuring the non-compositionality of multiword expressions. In: 23rd International Conference on Computational Linguistics (Coling 2010). Association for Computational Linguistics, Beijing (2010) 23. Blaheta, D., Johnson, M.: Unsupervised learning of multi-word verbs. In: 39th Annual Meeting and 10th Conference of the European Chapter of the Association for Computational Linguistics (ACL 39), Toulouse, France (2001) 24. Baldwin, T., Villavicencio, A.: Extracting the unextractable: A case study on verbparticles. In: 6th Conference on Natural Language Learning (COLING 2002). Association for Computational Linguistics, Stroudsburg (2002) 25. Birke, J., Sarkar, A.: A clustering approach for the nearly unsupervised recognition of nonliteral language. In: EACL 2006, pp. 329–336 (2006) 26. Katz, G., Giesbrecht, E.: Automatic identification of non-compositional multi-word expressions using latent semantic analysis. In: Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties. Association for Computational Linguistics, Sydney (2006) 27. Cook, P., Fazly, A., Stevenson, S.: Pulling their weight: Exploiting syntactic forms for the automatic identification of idiomatic expressions in context. In: Workshop on A Broader Perspective on Multiword Expressions. Association for Computational Linguistics, Prague (2007) 28. Fazly, A., Stevenson, S.: Automatically constructing a lexicon of verb phrase idiomatic combinations. In: EACL 2006 (2006) 29. Bannard, C.: A measure of syntactic flexibility for automatically identifying multiword expressions in corpora. In: Workshop on A Broader Perspective on Multiword Expressions. Association for Computational Linguistics, Prague (2007) 30. Cook, P., Fazly, A., Stevenson, S.: The VNC-Tokens Dataset. In: LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), pp. 19–22 (2008) 31. Diab, M.T., Krishna, M.: Handling sparsity for verb noun MWE token classification. In: Workshop on Geometrical Models of Natural Language Semantics. Association for Computational Linguistics, Athens (2009) 32. Pecina, P.: A machine learning approach to multiword expression extraction. In: Shared Task for Multiword Expressions (MWE 2008), pp. 54–57 (2008) 33. Kaalep, H.-J., Muischnek, K.: Multi-word verbs of Estonian: a database and a corpus. In: LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), pp. 23–26 (2008) 34. Bömová, A., et al.: The Prague Dependency Treebank: A three-level annotation scenario. Treebanks: Building and Using Parsed Corpora, 103–127 (2003) 35. Bijankhan, M.: The role of the corpus in writing a grammar: An introduction to a software. Iranian Journal of Linguistics 19(2) (2004) 36. Fazly, A.: Automatic acquisition of lexical knowledge about multiword predicates. Citeseer (2007) 37. Dabir-Moghaddam, M.: Compound verbs in Persian. Studies in the Linguistic Sciences 27(2), 25–59 (1997)
406
M.S. Rasooli, H. Faili, and B. Minaei-Bidgoli
38. Family, N.: Explorations of Semantic Space: The Case of Light Verb Constructions in Persian. In: Ecole des Hautes Etudes en Sciences Sociales, Paris, France (2006) 39. Pantcheva, M.: First Phase Syntax of Persian Complex Predicates: Argument Structure and Telicity. Journal of South Asian Linguistics 2(1) (2010) 40. Müller, S.: Persian complex predicates and the limits of inheritance-based analyses. Journal of Linguistics 46(03), 601–655 (2010) 41. Karimi Doostan, G.: Separability of light verb constructions in Persian. Studia Linguistica 65(1), 70–95 (2011) 42. Ghomeshi, J.: Non-projecting nouns and the ezafe: construction in Persian. Natural Language & Linguistic Theory 15(4), 729–788 (1997) 43. Anvari, H., Ahmadi-Givi, H.: Persian grammar 2, 2nd edn. Fatemi, Tehran (2006) 44. Deza, E., Deza, M.M.: Encyclopedia of Distances. Springer, Heidelberg (2009)