Chinese Organization Name Recognition Based on Multiple Features Yajuan Ling, Jing Yang, and Liang He* Department of Computer Science and Technology East China Normal University, Shanghai, China
[email protected], {jyang,lhe}@cs.ecnu.edu.cn
Abstract. Recognition of Chinese organization names is the key of the recognition of Chinese named entities. However, the lack of a single unified naming system to capture all types of organizations and the uncertainty in word segmentation, make the recognition of Chinese organization names especially difficult. In this paper, we focus on the recognition of Chinese organization names and propose an approach that takes advantage of various types of features of Chinese organization names to address it. First of all, we pre-process inputs to make the recognition more convenient. Secondly, we use the features of the left and right boundary to determine the candidate Chinese organization names automatically. Thirdly, we evaluate and refine the initial recognition results with the features of behaviors and debugging structure patterns to improve the performance of the recognition. From the experimental results on People’s Daily testing data set, the approach proposed in this paper outperforms the method based on role tagging more than 7%. And through designing a series of other experiments, we have proved that the proposed approach can perfectly complete the task of recognizing Chinese organization names and is particularly effective in nested cases. Keywords: Chinese organization name recognition, Core feature word, Leftbounder rule, Behavior feature, Debugging structure patterns.
1
Introduction
Named Entity Recognition (NER) proposed to identify named entities such as Person Name (PN), Location Name (LN) and Organization Name (ON) is one of the key techniques in the fields of natural language processing, information retrieval, question answering, etc. It is the foundation of those researches. The precision of NER is quite important for those research fields. But it is not easy to recognize NER, especially for Chinese name entity, due to lack of separator and capitalization, it become more difficult. Until now, many researchers have paid attention to the recognition of Chinese PNs and Chinese ONs [1-3] and proposed many different kinds of approaches, which got satisfying results in general. Nevertheless, because of Chinese organization *
Corresponding author.
M. Chau et al. (Eds.): PAISI 2012, LNCS 7299, pp. 136–144, 2012. © Springer-Verlag Berlin Heidelberg 2012
Chinese Organization Name Recognition Based on Multiple Features
137
name’s inherent characteristics such as flexible length, varied composition, complex structures, research about recognition of Chinese organization names has made little progress. In this paper, we focus on the recognition of Chinese organization name (CON). The main contributions of this paper are as follows:
北
• An algorithm is presented to recognize CONs including nested CONs such as “ (The Primary School attached to Peking University)”. It can be seen from experiment that our approach achieves 85% F-Scores for recognizing nested CONs. • A core feature word library is constructed automatically to accurately identify the right boundaries of CONs. And a left-bounder rule set is summarized to determine the left boundaries. The library and the set are detailed in section 3. • Behavior features and debugging structure patterns are used to evaluate and refine candidate CONs, which can obviously improve the precision of the recognition. Behavior features and debugging structure patterns are illustrated in Section 4.
京大学附属小学
The rest of this paper is organized as follows. Section 2 is the background of CON Recognition. Section 3 details the approach and introduces how to identify candidate CONs. Section 4 describes different ways to evaluate and refine candidate CONs. Section 5 represents the recognition algorithm and section 6 presents various experiments as well as the corresponding discussion. The conclusion is drawn in section 7.
2
Related Work
The main methods of CON recognition are rule-based methods, statistic-based methods and hybrid methods. [4] proposes a typical rule-based method making full use of the complete constitution patterns of CONs and nine kinds of constitution patterns are extracted. Similarly, [5] analyzes the constitution patterns of CONs with the semantic and grammatical features concluded manually to realize recognition. Experiments show that these methods are effective for specific fields; however, they are limited by fields and require a large amount of human involvement. Because rule-based methods have their disadvantages, many researchers have focused on studying statistic-based methods. Most methods are based on decision tree learning [6], maximum entropy [7], hidden Marko model [8], SVM [9], CRF [10], etc. Statistic-based methods require less human involvements than rule-based methods, but they are more complex. For example, the method presented in [7], which is based on the maximum entropy, spends more time on calculating. This reduces the efficiency of recognition. For recognition of CONs, we cannot achieve satisfactory performance if we only use statistic-based or rule-based methods. Therefore, a combination of these two methods is needed. In this paper, a combined approach proposed which takes advantage of these two types of methods for the recognition of CONs. First, it does not constitute complete patterns of CONs. It only needs a right feature word library and a leftbounder rule set. Second, the approach has no assumption about the length of CONs. It can perfectly complete the task of recognizing CONs no matter they are long or
138
Y. Ling, J. Yang, and L. He
short. Third, we use behavior features and debugging structure patterns to evaluate and refine candidate CONs and eliminate incorrect CONs. This way can effectively improve the performance of recognizing CONs. The proposed approach will be detailed in the coming sections.
3
Features of Chinese Organization Name
3.1
Structure of CON
The structures of CONs are so complex and flexible that it is quite difficult to find one generic pattern or model to capture all types of CONs. However, through analyzing a large number of CONs, we find that there are some rules and features that can be used to recognize CONs and make the recognition of CONs become easy to realize. For example, after word segmentation, organization name “ (China Peace Life Insurance Co., Ltd)” can be expressed as:
中国平安人寿保险股
份有限公司
中国/ns 平安/a 人寿/n 股份/n 有限公司/n Similarly, organization name“上海付费通信息服务有限公司 (Shanghai Paid Infor-
mation Services Co., Ltd)” can be expressed as:
上海/ns 付费通/v 信息/n 服务/vn 有限公司/n
After abstracting those two segmented CONs, we can get the abstract structures of those two names showed in Fig.1.
Fig. 1. The Abstract Structures of Two CONs
Through the analysis of the abstract structures of CONs, we find that most of CONs segmented can be shown as the above structure “M+F”. Here, “M” stands for prefixes. “F” represents the core feature words of CONs. Generally speaking, the number of modifiers and the part-of-speech tag of each modifier in one CON are flexible and it is difficult to summarize some basic laws from modifiers. However, core feature words are always limited in small numbers and we can get them easily. With the help of these core feature words, we can determine the right boundaries of CONs expediently.
Chinese Organization Name Recognition Based on Multiple Features
3.2
139
Core Feature Word Library
Through the analysis of the abstract structures of CONs, we know that the right boundaries of CONs have their intrinsic characteristics and are limited in small feature word set. For example, educational institution names usually end with words like “ (University)”, “ (Research Institution)”, etc. Enterprises and committees usually end with words like “ (Company)”, “ (Group)”, etc. In this paper, these words are defined as core feature words. Due to structures of CONs, we can use core feature words to determine the right boundary of CONs. It is not accidental that CONs contain core feature words and it is corresponding with the naming rules and the naming customs of CONs. Above all, our country founds the relevant provisions of the CON naming. According to these provisions, the CON registered by the manager must embody the function of the organization definitely. Secondly, in order to show the functionality of an organization clearly, the organization manager, when naming the organization, usually puts one feature word that has a definite significance in the organization name. Therefore, we can determine right boundaries of CONs through the core feature word library of CONs. Because the core feature words of CONs are fixed in general and are limited in small numbers, so they can be obtained from the existing CONs. We segment existing CONs and get the core feature words to construct a core feature word library automatically as the first step.
学
3.3
研究所
公司
大
集团
Left-Border Rule Set
For the recognition of CONs, core feature words play an important role. However, core feature words can just determine right boundaries of CONs and if we want to recognize an entire CON, we also need to determine the left boundary of it. Therefore, left-border rules also play a profound role in the recognition of CONs. Table 1. Left-border Feature and Corresponding Examples
Left-border feature punctuation
preposition
conjunction
Example 1898年,北京大学正 式成立。(Peking University is founded in 1898.)
我毕业于北京大学。
(I graduated from Peking University.
我和北京大学的故 事。(The story of Peking University and me.)
Left-border feature
Example
年,李彦宏筹建百 度公司。(YanHong Li
2000 transitive verb auxiliary word Begin of sentence
prepare to establish Baidu Inc in 2000.)
盖茨一手建立了微软公 司。(Gates built Microsoft company.) 摩根斯坦利是一家投资 银行。(Morgan Stanley is an investment bank.)
140
Y. Ling, J. Yang, and L. He
In this paper, we define left-border rules as functional units that can be used to distinguish CONs from contexts before them. Generally speaking, left-border rules are far more flexible than the features of core feature words, which can be noun phrases, punctuations, preposition, etc. In Table 1, some common left-border features and corresponding examples are shown. These left-border rules can be used to determine the left boundaries of most CONs. With the core feature word library and the left-border rule set, we can get candidate CONs. However, since the contextual information of CONs is complex, so many incorrect CONs exist in candidate CON set. For example, we get CON “ (our company)” from the sentence “ (Our company has just been established.)”with the approach described above. But it is obvious that “ (our company)” is not a correct CON. Hence, we need other features to refine and evaluate candidate CONs.
我们公司刚刚成立。
司 4
Evaluate and Refine
4.1
Behavior Features
我们公司 我们公
People always have some specific behaviors such as reading, thinking, writing, etc. Generally speaking, these behaviors only belong to human. For example, through the sentence “Lily is reading”, we can know that “Lily” is a person name. Similarly, organizations have specific behaviors. Here, we define behavior words of CONs as the words that can be used to describe the special behaviors which only used in organization names. In this paper, we use behavior features to refine and evaluate candidate CONs. In general, behavior feature words appear before or after CONs. With these words, we can determine whether the candidate CON is a correct one or not. For example, the text segment “ …… (be appointed chairman of)” can be regarded as a behavior feature word of “ (corporation)”. With behavior features, we can eliminate incorrect CONs which caused by core feature words and left-border features.
被任命为 董事长 有限公司
4.2
Debugging Structure Pattern Features
In the above section, we use behavior features to evaluate and refine CONs directly. In this section, we will use debugging structure patterns to find incorrect candidate CONs and eliminate them. Through analyzing incorrect CONs, we find that the structures of incorrect CONs are similar. For example, “ (our school)” , ” (their company)” is not CONs obviously. The structures of them are “demonstrative pronouns + core feature word”. Same argument, we analyze and summarize patterns of these incorrect CONs. Table 2 lists some common debugging patterns. Debugging structure patterns cannot be used to form CONs is the theoretical basis to use debugging patterns to filter incorrect CONs.
我们学校
他们公司
Chinese Organization Name Recognition Based on Multiple Features
141
Table 2. Examples of Debugging Structure Patterns
Error pattern demonstrative pronouns + core feature word Some verb + core feature word quantifier distribution + core feature word
Example
我们学校 (our school) 设立委员会(set up committees) 一家公司 (a company)
Fig. 2. The Recognition Process of CONs
5
The Process of Recognition Algorithm
The process flow diagram of the approach proposed in the paper is shown in Fig.2. The input is Chinese texts which can be articles or sentences, the output is recognized CONs. The recognition process includes Pre-process Module, Initial Recognition Module and Recognition Evaluation and Refining Module. The major functionalities of Pre-process Module are formatting texts and segmenting texts into words with partof-speech tags. In this module, ICTCLAS1 is used as the default tool to segment text, which gets an average precision of 98% in Chinese. Then, pre-processed texts will be put into the Initial Recognition Module, which will be detailed in the next section. 5.1
Initial Recognition Module
The process of Initial Recognition Module can be divided into two steps. First, it automatically determines right boundaries of CONs with the help of the core feature word library. Second, it uses left-bounder rule set to find the left boundaries of CONs. Algorithm 1 details this process. The process starts from the beginning of text and terminated at the end of text. After that, we get candidate CONs. 5.2
Recognition Evaluation and Refining Module
The evaluation process also can be divided into two steps. Firstly, behavior features are used to assess the entire candidate CONs. If a candidate CON is identified to be 1
http://ictclas.org/
142
Y. Ling, J. Yang, and L. He
incorrect, we will find a new left-bounder rule for it and reassess it. If we cannot find a new left-bounder rule for the incorrect CON, It will be abandoned. Secondly, the composition models of the entire candidate CONs are extracted and compared with debugging structure patterns. If the composition model of a candidate CON has high similarity with one or more debugging structure patterns, we will drop it. Algorithm 1. Getting candidate CONs Input: , ,……, , ,……, ; ; Output: Candidate CONs
,
,……,
= ,i = 1; != null) While( { If ( R and i > 1) = ; { j = i – 1, L) = ; While ( get substring ( , ); //gets candidate CONs }Else i ++, = ; }
X denotes word sequence of pre-processed texts, L denotes the left-bounder rule set and R denotes the core feature word library.
6
Experiments
In order to evaluate our approach, we introduce an open testing set and a closed testing set. We randomly select 1900 sentences, which contain 3289 CONs, from the Chinese People’s Daily corpus as the closed testing set and randomly extract 400 articles that contain 3767 CONs from SINA2 to constitute the open testing set. Firstly, we test our approach on the closed testing set to evaluate the performance of it. Since the testing data set of [11] is the same with ours, a performance comparison is conducted between those two approaches. Table 3 shows the details. Here, we use the precision (P), recall rate(R) and F-Score (F) to evaluate it. Comparing the first and second line of Table 3, it is obvious that the F-Score improves 16% after adding the Recognition Evaluate and Refine Module. So this module is important for CON recognition to enhance performance. At the same time, comparing the second and third line, we can find that the method based on multiple features is more effective than the method proposed in [11]. More specifically, P and R of CON recognition are improved 3% and 15% respectively. The results show that the approach presented in this paper is effective. 2
http://www.sina.com.cn/
Chinese Organization Name Recognition Based on Multiple Features
143
Table 3. The Results on Testing Set
Approach Type This Paper’s approach without the Recognition Evaluate and Refining in close testing set This paper’s approach with the Recognition Evaluate and Refining in close testing set Approach proposed in [11] in close testing set This paper’s approach in open testing set
Total
Found
Right
P (%)
R (%)
F (%)
3289
5291
2958
55.9
89.9
68.9
3289
3983
3107
77.7
94.5
85.2
3289
3558
2651
74.5
80.6
77.4
3767
4952
3530
71.3
93.4
80.86
The above experimental results show that the approach proposed in this paper can perfectly complete the task of recognizing CONs. Through analyzing the recognition results, we find that the major reason is that our approach is quite effective for recognizing nested CONs and CONs contain a lot of nested ones. In order to evaluate the effectiveness of our approach for nested CONs, an extra test is conducted. We divide the testing set into two parts according to whether a CON is a nested one or not. We test our approach on the sub set which only contain nested CONs. The result is shown in Table 4. Table 4. The Results of Recognizing Nested CONs
Sub closed testing set Sub open testing set
P (%) 83.31 82.02
R (%) 87.54 86.98
F (%) 85.73 84.39
As Table 4 indicates, the proposed approach is effective and superior for nested CON. More specifically, on two testing data sets, the approach achieves F-Scores of 85.73% and 84.39% respectively. It outperforms other similar systems. Through extensive experiments, the proposed approach achieves a preferably performance.
7
Conclusion
The recognition of CONs plays a key role in natural language processing. Though many approaches have been proposed, the results are still not satisfactory. In this paper, the proposed approach proposed is based on multiple features of CONs. We use a core feature word library and a left-bounder rule set to extract candidate CONs. Then, these candidate CONs will be evaluated and refined according to behavior features and debugging structure patterns of CONs. Experimental results show that our
144
Y. Ling, J. Yang, and L. He
approach has good performance with high precision and recall rate. At the same time, experimental results also indicate that our approach is particularly effective for recognizing nested CONs. Though the approach proposed in the paper has been proved to be effective for recognizing CONs, some details can still be further improved. For example, there is a lack of a strategy for recognizing CONs such as “ (China Peace Life Insurance Co., LTD)” and “ (Peking University)”, which belong to short names of CONs. The incorrect CONs in our experiments belong to abbreviate CONs in general. So we will focus on this problem in the future.
北大
中国平安
Acknowledgments. This work is supported by a grant from the Shanghai Science and Technology Foundation (No. 10dz1500103, No. 11530700300 and No. 11511504000).
References 1. Luo, Z.-Y., Song, R.: Integrated and fast recognition of proper noun in modern Chinese word segmentation. In: Ji, D.-H. (ed.) Proceedings of International Conference on Chinese Computing, Singapore, pp. 323–328 (2001) 2. Zhang, H.P., Liu, Q.: Automatic Recognition of Chinese Personal Name Based on Role Tagging. Chinese Journal of Computers 27(1) (January 2004) 3. Tan, H., Zheng, J., Liu, K.: Design and realization of Chinese place name Automatic recognition system. Computer Engineering (08) (2002) 4. Lei, J., Zhang, D., Feng, X.: Recognition of Chinese Organization Name Based on Constitution Pattern. In: Proceedings of SWCL 2008(2008) 5. Zhang, X., Wang, L.: Identification and Analysis of Chinese Organization and Institution Names. Journal of Chinese Information Processing 11(4) (1997) 6. Isozaki, H.: Japanese Named Entity Recognition based on a Simple Rule Generator and Decision Tree Learning. In: Association for Computational Linguistics 39th Annual Meeting and 10th Conference of the European Chapter (2001) 7. Feng, L., Jiao, L.: Chinese Organizations Names Recognition Model Based on the Maximum Entropy. Computer & Digital Engineering 38(12) (2010) 8. Liu, J.: The Arithmetic of Chinese Named Entity Recognition Based on the Improved Hidden Markov Model. Journale of Taiyuan Normal University (Natural Science Edition) 8(1) (2009) 9. Chen, X., Liu, H., Chen, Y.: Chinese organization names recognition based on SVM. Application Research of Computers 25(2) (2008) 10. Huang, D., Li, Z., Wan, R.: Chinese organization name recognition using cascaded model based on SVM and CRF. Journal of Dalian University of Technology 50(5) (2010) 11. Yu, H.-K., Zhang, H.-P., Liu, Q.: Recognition of Chinese Organization Name Based on Role Tagging. In: Advances in Computation of Oriental Languages–Proceedings of the 20th International Conference on Computer Processing of Oriental Languages (2003)