Recursive Neural Conditional Random Fields for Aspect-based ...

Report 2 Downloads 28 Views
arXiv:1603.06679v1 [cs.CL] 22 Mar 2016

Recursive Neural Conditional Random Fields for Aspect-based Sentiment Analysis Wenya Wang Nanyang Technological University, Singapore [email protected]

Xiaokui Xiao Daniel Dahlmeier Sinno Jialin Pan Nanyang Technological SAP Research & Innovation Nanyang Technological University, Singapore Singapore University, Singapore [email protected] [email protected] [email protected]

Abstract Aspect-based sentiment analysis has obtained substantial popularity due to its ability to extract useful information from customer reviews. In most cases, aspect terms in a review sentence have strong relations with opinion terms because an aspect is the target where an opinion is expressed. With this connection, some of the existing work focused on designing syntactic rules to double propagate information between aspect and opinion terms. However, these methods require large amount of efforts and domain knowledge to design precise syntactic rules and fail to handle uncertainty. In this paper, we propose a novel joint model that integrates recursive neural networks and conditional random fields into a unified framework for aspect-based sentiment analysis. Our task is to extract aspect and opinion terms/phrases for each review. The proposed model is able to learn high-level discriminative features and double propagate information between aspect and opinion terms simultaneously. Furthermore, it is flexible to incorporate linguistic or lexicon features into the proposed model to further boost its performance in terms of information extraction. Experimental results on the SemEval Challenge 2014 dataset show the superiority of our proposed model over several baseline methods as well as the winning systems.

1 Introduction Nowadays, the ability to extract useful information from a huge number of reviews has become a key constituent for decision making, either for companies or customers. Aspect-based sentiment analysis or fine-grained opinion mining aims to extract most important information from reviews, e.g. opinion targets, opinion expressions, target categories, and opinion polarities. This task was first studied in 2004 [Hu and Liu, 2004a; Hu and Liu, 2004b], and has received more interest since then [Zhang et al., 2010; Qiu et al., 2011; Li et al., 2010]. In aspect-based sentiment analysis, the core component is to extract aspects or features of a product/service from a review, along with the opinions being expressed. For example,

in a restaurant review “I have to say they have one of the fastest delivery times in the city.”, the aspect term is “delivery times”, and the opinion term is “fastest”, which is positive. Previous work for this task can generally be categorized into one of the two approaches. The first approach is to accumulate aspect terms and opinion terms from a seed collection, by utilizing syntactic rules or modification relations between aspects and opinions [Qiu et al., 2011]. For example, if we know “fastest” is an opinion word, then “delivery times” is deduced as an aspect because “fastest” is a modifier for the following noun phrase. However, this approach largely relies on manually hand-coded rules, and is restricted to certain Part-of-Speech tags, e.g., opinion words are restricted to be adjectives. The other approach [Jin and Ho, 2009; Li et al., 2010; Toh and Wang, 2014] focuses on feature engineering from a huge availability of resources, including lexicons and annotated corpus to be fed into a sequence labeling classifier, such as Conditional Random Fields (CRFs) [Lafferty et al., 2001] or Hidden Markov Models (HMMs). This approach requires extensive efforts for designing hand-crafted features, and only combines features linearly when a CRF/HMM is applied. In another direction, Weiss et al. [2015] proposed to combine the power of deep learning and structured learning for language parsing. They used neural network model to produce distributed representations. These representations are then fed to structured perceptron for training. This shows potential advantage of combining deep learning with structured predictions. However, in their method, structured perceptron is not as flexible as graphical models. To overcome the above limitations of existing methods, we propose a novel joint model, namely Recursive Neural Conditional Random Fields (RNCRFs). Specifically, a RNCRF consists of two main components. The first component is to construct a recursive neural network (RNN)1 based on dependency trees of each sentence. The goal is to learn high-level feature representation of each word in the context of each sentence, and make the representation learning for aspect and opinion terms interactive through the underlying dependency structure among them. The output of the RNN is then fed into a Conditional Random Field (CRF) to learn a discrimina1 Note that in this paper, RNN stands for recursive neural network instead of recurrent neural network.

tive mapping from high-level features to labels, i.e., aspects, opinions, or others, because CRFs have proven to be promising for this kind of sequence tagging problems. We present a joint optimization approach based on maximum likelihood and backpropagation to learn the RNN and CRF components simultaneously. In this way, the label information of aspect and opinion terms can be dually propagated from parameter learning in CRF to representation learning in RNN. We conducted expensive experiments on the SemEval challenge 2014 (task 4) dataset [Pontiki et al., 2014] to verify the superiority of RNCRF over serval baseline methods as well as the winning system for the challenge.

2 Related Work 2.1

Aspect-based Sentiment Analysis

Aspect-based sentiment analysis has been studied in the past decade. Hu and Liu [2004a] proposed to extract product features through association mining and hand-coded rules. They also extracted opinion terms by augmenting a seed opinion set using synonyms and antonyms in WordNet. This method can only extract nouns or noun phrases as product features and adjectives as opinions. Qiu et al. [2011] manually defined some relations between product features and opinion terms based on syntactic information, and proposed a double propagation method to augment aspect sets and opinion sets. Despite the fact that their model is unsupervised, this method heavily depends on predefined rules for extraction, and is also restricted to some specific POS tags for product features and opinions. Jin and Ho [2009] and Li et al. [2010] modeled the extraction problem of product features and opinion terms as a sequence tagging problem, and proposed to use HMMs and CRFs to solve it, respectively. These two methods rely on richly handcrafted features, and do not consider the dual propagation of aspect and opinion terms explicitly.

2.2

Deep Learning for Sentiment Analysis

Deep learning has been applied for sentiment analysis because of its ability to learn high-level features. As shown in some recent work, deep models, e.g. RNNs, deep autoencoders, and convolutional neural networks, can automatically learn the inherent semantic and syntactic information from the data and thus achieve better performance for sentiment analysis [Socher et al., 2011b; Socher et al., 2012; Socher et al., 2013; Glorot et al., 2011; Kalchbrenner et al., 2014; Kim, 2014; Le and Mikolov, 2014]. Sentiment supervision could also be incorporated in word embedding models to learn sentimentoriented feature vectors for sentence polarity prediction [Tang et al., 2014]. These recent works generally fall into two categories: sentence-level sentiment polarity predictions and phrase/word-level polarity predictions. Besides these, ˙Irsoy and Cardie [2014] implemented deep recurrent neural network for opinion expression extraction. Nevertheless, there is little work in aspect-based sentiment analysis using deep models. To the best of our knowledge, the most related work to ours is [Liu et al., 2015], which proposed to combine recurrent neural network and word embeddings for aspect-based sentiment analysis. However, their proposed

model simply uses the standard recurrent neural network on top of different word embeddings, and thus heavily depends on the quality of word embeddings. In addition, it fails to explicitly model dependency relations or compositionalities within certain syntactic structure in the sentence. The above problem could be resolved by RNNs, which have been used for learning phrase representations [Socher et al., 2010], language or image parsing [Socher et al., 2011a], sentiment anal[Socher et al., 2013] ysis and question answering [Iyyer et al., 2014]. The tree structure used for RNNs generally adopts two forms: constituency tree and dependency tree. In a constituency tree, all the words lie at the bottom (leaf nodes), each internal node represents a phrase or a constituent of a sentence, and the root node represents the entire sentence [Socher et al., 2010; Socher et al., 2011a; Socher et al., 2012; Socher et al., 2013]. In a dependency tree, each node including terminal and nonterminal nodes, represents a word, with dependency connections to other nodes [Socher et al., 2014; Iyyer et al., 2014]. The resultant model is known as dependency-tree RNN (DT-RNN). In general, the advantage of the dependency structure over the constituency structure is the ability to extract word-level representations considering syntactic relations and semantic robustness. Therefore, in this paper, we adopt DT-RNN for aspect-based sentiment analysis.

3 Problem Statement Suppose that we are given a training set of customer reviews in a specific domain, e.g. restaurant or laptop, denoted by S = {s1 , s2 , ..., sN }, where N is the number of review sentences. For any si ∈ S, there might exist a set of aspect terms Ai = {ai1 , ai2 , ..., ail }, where each aij ∈ A can be a single word or a sequence of words expressing explicitly some feature of a product/service. Similarly, there could also exist another set of opinion terms Oi = {oi1 , oi2 , ..., oim }, where each oir can be a single word or a sequence of words expressing the subjective sentiment of the comment holder. The task is to learn a classifier to extract the set of aspect terms Ai and the set of opinion terms Oi from each review sentence si ∈ S. This task can be formulated as a sequence tagging problem. Therefore, we use the BIO encoding scheme to define labels. Specifically, each review sentence si is composed of a sequence of words si = {wi1 , wi2 , ..., wini }. Each word wip ∈ si is labeled as one out of the following 5 classes: “BA” (beginning of aspect), “IA” (inside of aspect), “BO” (beginning of opinion), “IO” (inside of opinion), and “O” (others). Let L = {BA, IA, BO, IO, O}. We are also given a test set of review sentences denoted by S ′ = {s′1 , s′2 , ..., s′N ′ }, where N ′ is the number of test reviews. For each test review s′i ∈ S ′ , ′ our objective is to predict the class label yiq ∈ L for each word ′ wiq of the review. Note that a sequence of predictions with “BA” at the beginning followed by “IA” are indication of one aspect term, which is similar for opinion terms.

(a) Example of a dependency tree.

(b) Example of a DT-RNN tree structure.

(c) Example of a RNCRF structure.

Figure 1: Examples of dependency tree, DT-RNN structure and RNCRF structure for a review sentence.

4 Recursive Neural CRFs As described in Section 1, our proposed model, RNCRF, for aspect-based sentiment analysis consists of two main components. To learn high-level representations for each word in a sentence, we use a DT-RNN. These representations are aggregated and fed into a CRF to capture the context information around each word for information extraction.

4.1

Dependency-Tree Recursive Neural Networks

Similar to other deep learning models for NLP, we begin by associating each word w in our vocabulary with a vector of features x ∈ d , which corresponds to a column of a word embedding matrix We ∈ d×v , where v is the size of the vocabulary. Our model takes dependency parse trees of review sentences with word embeddings and their corresponding labels (L = {BA, IA, BO, IO, O} as shown in Section 3) for each word as input. An example of the dependency parse tree is shown in Figure 1(a), where each edge starts from the parent and points to its dependent with a syntactic relation. Specifically, “I” is the nominal subject (NSUBJ) of “like”. “food” is the direct object (DOBJ) of “like”. “the” is the determiner (DET) of “food”. “like” is pointed by ROOT, which means that it does not depend on any word. In a DT-RNN, each node n, including leaf nodes, internal nodes and the root node, in the dependency parse tree for a specific sentence is associated with a word w, an input feature vector xw and a hidden vector hn ∈ d of the same dimension as xw . Each dependency relation r is associated with a separate matrix Wr ∈ d×d , which can be learned during training. In addition, a common transformation matrix Wv ∈ d×d is introduced to map the word embedding xw at a node n to its corresponding hidden vector hn . Along with a particular dependency tree, a hidden vector hn is computed from its own word embedding xw at node n with the transformation matrix Wv and its children’s hidden vectors hchild(n) with the corresponding relation matrices {Wr }’s, where the children represent the dependents of current word in the dependency tree. For instance, given the parse tree shown in Figure 1(a), we first compute the leaf nodes associated with “I” and “the” using Wv as follows,

R

R

R

R

hI hthe

R

= =

f (Wv · xI + b), f (Wv · xthe + b),

where f is a non-linear activation function and b is a bias term. In this paper, we adopt tanh(·) as the activation function. Once hidden vectors of all leaf nodes are generated, we

can recursively generate hidden vectors for interior nodes using the corresponding relation matrix Wr and the common transformation matrix Wv as follows, hfood hlike

= =

f (Wv · xfood + WDET · hthe + b), f (Wv · xlike + WDOBJ · hfood +WNSUBJ · hI + b).

The resultant DT-RNN is shown in Figure 1(b). In general, a hidden vector for any node n associated with a word vector xw can be computed as follows, ! X hn = f Wv · xw + b + (1) Wrnk · hk , k∈Kn

where Kn denotes the set of children of node n, rnk denotes the dependency relation between node n and its child node k, and hk is the hidden vector of the child node k, similar to [Iyyer et al., 2014].

4.2

Training with Conditional Random Fields

CRF is a discriminant graphical model for structured prediction, which directly models the conditional distribution of output given the input. In RNCRF, we feed the output of DT-RNN, i.e., the hidden representation of each word in a sentence, to a CRF. Updates of parameters for RNCRF are carried out successively from the top to bottom, by backpropagating errors through CRF to the hidden layer of RNN, and to subsequent RNN layers (including word embeddings) using backpropagation through structure (BPTS) [Goller and Kuchler, 1996]. Formally, for each sentence si , we denote the input for a CRF by hi , which is generated by a DT-RNN. Here hi is a matrix with columns of hidden vectors {hi1 , hi2 , ..., hini } to represent a sequence of words {wi1 , wi2 , ..., wini } in a sentence si . The model computes a structured output yi = {yi1 , yi2 , ..., yini } ∈ Y, where Y is a set of possible combinations of labels in label set L. The entire structure can be represented by an undirected graph G = (V, E) with cliques c ∈ C. In a linear-chain CRF, which is employed in this paper, there are two different cliques: unary clique (U) representing input-output connection, pairwise clique (P) representing adjacent output connection, as shown in Figure 1(c). During inference, the model aims to output y ˆ with the maximum conditional probability p(y|h). (We drop the subscript i here for simplicity.) This distribution is computed from potential

Algorithm 1 Recursive Neural Conditional Random Fields

Figure 2: An example for computing input-ouput potential for the second position “like”.

Input: A set of customer review sequences: S = {s1 , s2 , ..., sN }, and length d of input feature representation for each word, window size T for CRFs Output: A set of model parameters: Θ = {ΘRNN , ΘU , V }

outputs of the cliques: p(y|h) =

1 Y ψc (h, yc ), Z(h)

(2)

c∈C

where Z(h) is the normalization term, and ψc (h, yc ) is the potential of clique c, computed as ψc (h, yc ) = exp hWc , F (h, yc )i ,

(3)

where the RHS is the exponential of linear combination of feature vector F (h, yc ) for clique c, and the weight vector Wc is tied for unary and pairwise cliques. We also incorporate a context window of size 2 × T + 1 when computing unary potentials. Through this formulation, the potential of unary clique at node k can be written as ψU (h, yk ) = exp (W0 )yk · hk +

T X t=1

T X

(W−t )yk · hk−t !

Initialization: Initialize We using word2vec. Initialize Wv and with uniform distribution between r }’s randomly i h {W √ √ 6 6 √ √ − 2d+1 , 2d+1 . Initialize W0 , {W+t }’s, {W−t }’s, V , and b with all 0’s for each sentence si do 1: Use DT-RNN (1) to generate hi 2: Compute conditional distribution p(yi |hi ) using (2) 3: Use the backpropagation algorithm to update parameters Θ through (5)-(9) end for Here yk′ represents possible label configuration of node k. Parameters for DT-RNN denoted by ΘRNN = {Wv , Wr , We , b} are updated subsequently by applying chain rule with (5) through BPTS:

(4)

△hroot

=

where W0 , W+t and W−t are weight matrices of the CRF for the current position, the t-th position to the right, and the t-th position to the left within context window, respectively. The subscript yk indicates the corresponding row in the weight matrix. For instance, Figure 2 shows an example of window size 3. At the second position, the input features for “like” are composed of the hidden vectors at position 1 (hI ), position 2 (hlike ) and position 3 (hthe ). Therefore, the conditional distribution for the entire sequence y in Figure 1(c) can be calculated as

△hk6=root

=

+

(W+t )yk · hk+t

t=1

p(y|h) =

,

4 4 X X 1 exp (W0 )yk · hk + (W−1 )yk · hk−1 Z(h) k=1 k=2 ! 3 3 X X + (W+1 )yk · hk+1 + Vyk ,yk+1 , k=1

k=1

with matrix V representing pairwise state transition score. For parameter updates, we first denote gc (h, yc ) = hWc , F (h, yc )i as log-potential for clique c. Through the objective of maximum likelihood, updates of parameters in RNCRF are first conducted for unary weight matrices ΘU = {W0 , W+t , W−t } and pairwise weight matrix V by applying chain rule to log-potential updates. Below is the gradient for ′ ΘU (updates for V are similar through gP (yk′ , yk+1 )): ∂ − log p(y|h) ∂gU (h, yk′ ) △ΘU

= −(1yk =yk′ − p(yk′ |h)), =

(5)

∂ − log p(y|h) ∂gU (h, yk′ ) . (6) · ∂gU (h, yk′ ) ∂ΘU

△ΘRNN

=

′ ∂ − log p(y|h) ∂gU (h, yroot ) · , ′ ∂gU (h, yroot ) ∂hroot ∂ − log p(y|h) ∂gU (h, yk′ ) · ∂gU (h, yk′ ) ∂hk ∂hpar(k) , +△hpar(k) · ∂hk K X ∂hk ∂ − log p(y|h) · , ∂hk ∂ΘRNN

(7)

(8) (9)

k=1

where hroot represents the hidden vector of the word pointed by ROOT in the corresponding DT-RNN. Since this word is the topmost node in the tree, it only inherits error from CRF output. hpar(k) is the hidden vector of the parent node of node k in DT-RNN. Hence the lower nodes receive error from both CRF output and error propagation from parent node. For the parameters within RNN, ΘRNN , they are updated by applying chain rule with respect to updates of hidden vectors, and aggregating among all associated nodes, as shown in (9). The overall procedure of RNCRF is summarized in Algorithm 1.

4.3

Discussion

Note that the best performing system [Toh and Wang, 2014] in SemEval challenge 2014 for aspect-based sentiment analysis employed CRFs with extensive hand-crafted features including features induced from dependency trees. However, in their experiments, they showed that the addition of the features induced from dependency relations does not improve the performance in terms of F1 score, or even hurt the performance a little bit. This indicates the infeasibility or difficulty of incorporating dependency structure explicitly as input features, which motivates the formulation of our model,

that is using DT-RNN to encode dependency-related features for words. The most important advantage of RNCRF is the ability to learn the underlying dual propagation between aspect and opinion terms from the tree structure itself. Specifically as shown in Figure 1(c), where the aspect term is “food” and the opinion expression is “like”. In the dependency tree, the word “food” depends on “like” with the relation DOBJ. During training, RNCRF computes the hidden vector hlike for “like”, which is obtained from hfood . As a result, the prediction for “like” is affected by hfood . This is one-way propagation from “food” to “like”. During backpropagation, the error on the word “like” is backpropagated through a top-down manner, which will in turn revise the representation hfood . This is the other-way propagation from “like” to “food”. Therefore, the dependency structure together with the learning approach help to enforce the dual propagation of aspect-opinion pairs as long as the dependency relation exists, either directly or indirectly.

4.4

Incorporation of Linguistic/Lexicon Features

The proposed RNCRF is an end-to-end model, where feature engineering is not necessary. However, in RNCRF, it is flexible to incorporate linguistic or lexicon features to further boost its performance by conducting “light” feature engineering, such as adding POS tags or name-list (gazetteer) based features. Here, for name-list based features, we can simply generate a set of name-list from training data based on the frequency of aspect terms. By adding linguistic/lexicon features to RNCRF, we append these features to the hidden vector of each word, but keep their values fixed during training, and only update neural inputs and the CRF weights as described in Section 4.2. As will be shown in the next section, RNCRF without any hand-crafted features slightly outperforms the best performing systems which involve heavy feature engineering efforts, and RNCRF with POS tags and name-list based features can even achieve better performance.

5 Experiment 5.1

Dataset

We tested our model with a two-domain dataset taken from SemEval Challenge 2014 task 4 with the same evaluation metric. This dataset consists of restaurant reviews and laptop reviews. Each domain contains more than 3,000 sentences for training and 800 sentences for testing. The detailed description of the dataset is given in Table 1. Note that the original dataset only includes manually annotated labels for aspect terms but not for opinion terms. To facilitate our experiment, we manually annotated opinion terms for each sentence by ourselves. Domain Restaurant Laptop Total

Training 3,041 3,045 6,086

Test 800 800 1,600

Total 3,841 3,845 7,686

Table 1: Dataset description in terms of number of sentences from SemEval Challenge 2014 task 4

5.2

Experimental Setup

For word vector initialization, we trained word embeddings with word2vec on the Yelp Challenge dataset2 for the restaurant domain and on the Amazon reviews3 [McAuley et al., 2015] for the laptop domain, respectively. The Yelp dataset contains 2.2M restaurant reviews with 54K vocabulary size. For the Amazon reviews, we only extracted the electronic domain which contains 1M reviews with 590K vocabulary size. We variated different dimensions for word vectors and chose 300 for both domains to conduct comparison experiments. To study whether the dimension of word vectors is sensitive to overall performance of RNCRF, we also conducted experiments with varying dimensions for word vectors. Regarding CRFs, we implemented a linear-chain CRF using CRFSuite [Okazaki, 2007] and modified CRFSuite in order to be combined with DT-RNN. Specifically, in DTRNN, we implemented mini-batch stochastic gradient descent (SGD) with batch size 25. Adaptive gradient descent (AdaGrad) is used for adapting learning rate. The initial learning rate for AdaGrad is 0.02. Pre-training with DT-RNN takes 4 epochs for restaurant domain and 5 epochs for laptop domain. In RNCRF, we implemented SGD with decaying learning rate, which is also initialized to be 0.02. We also tried with varying context window size between 3 and 5. We used 3 for the final result. The dimension for word embedding is 300, same for the size of hidden vectors. All parameters are chosen by cross validation. As discussed in Section 4.4, linguistic or lexicon features can be easily incorporated into RNCRF. Here, we only generated two simple linguistic/lexicon features based on name-list and POS tags to show that by incorporating additional features, the performance of RNCRF can be further improved. Following [Toh and Wang, 2014], we extracted the two sets of name list from the training data for each domain, where one includes high-frequency aspect terms, and the other includes high-probability aspect words. These two sets are used to construct two lexicon features, i.e. we built a 2dimensional binary vector: if a word is in a set, the corresponding value is 1, otherwise 0. For POS tags, we used Stanford POS tagger [Toutanova et al., 2003], and converted them to universal POS tags which include 15 different categories. By using one-hot encoding, we generated 15 POS tag based features. We denote by RNCRF+NL+POS the proposed model with the POS tags and name-list based features.

5.3

Experimental Results

We first compare our proposed model with several baselines: • CRF-1: linear-chain CRF with standard linguistic features including word string, stylistics (capitalization, digit), POS tag, context string (within certain window size), and context pos tags. • CRF-2: linear-chain CRF with both standard linguistic features and dependency information including head word, dependency relations with parent token and child tokens. 2 3

http://www.yelp.com/dataset challenge http://jmcauley.ucsd.edu/data/amazon/links.html

Models CRF-1 CRF-2 LSTM SemEval-1 SemEval-2 SemEval-3 RNCRF-O RNCRF RNCRF+NL+POS

Restaurant Aspect Opinion 77.00 78.95 78.37 78.65 80.20 80.15 84.01 83.98 80.18 82.00 84.05 80.93 84.73 81.48

Laptop Aspect Opinion 66.21 71.78 68.35 70.05 69.67 72.76 74.55 73.78 70.40 72.24 74.66 73.72 78.09 76.89

Table 2: Comparison results with baselines in F1 scores. Underlined scores are top three results in the challenge. • RNCRF-O: the proposed RNCRF model without opinion term labels when training.

Models DT-RNN+SoftMax CRF+word2vec RNCRF RNCRF+POS RNCRF+NL RNCRF+NL&POS

Restaurant Aspect Opinion 72.45 69.76 82.57 78.83 84.05 80.93 84.08 81.48 84.24 81.22 84.73 81.23

Laptop Aspect Opinion 66.11 64.66 63.62 56.96 74.66 73.72 75.66 76.89 76.71 76.49 78.09 75.97

Table 3: Comparison results for different model settings.

embedding strategy as ours. The results show that our model outperforms LSTM in aspect extraction by 3.85% and 4.99% for the restaurant domain and the laptop domain respectively. To test the impact of the two components in RNCRF, we conducted another experiment on different model settings:

• LSTM: recurrent neural network built on top of word embeddings trained with word2vec [Liu et al., 2015]. The model is tested with the same word embeddings as ours. We tried different hidden layer dimensions (50, 100, 150, 200) and obtained the best result with size 50. We followed the same training settings reported in [Liu et al., 2015].

• DT-RNN+SoftMax: rather than using CRF as the output layer, we use a softmax classifier on top of DT-RNN.

• SemEval-1, SemEval-2, SemEval-3: top three winning systems for the SemEval challenge 2014 (task 4).

• RNCRF+NL: only incorporate name-list based features to RNCRF.

The comparison results are shown in Table 2 for both the restaurant domain and the laptop domain. Note that we provided the same annotated dataset (both aspect labels and opinion labels are included for training) for CRF-1, CRF2 and LSTM. It is clear that our proposed model RNCRF achieves superior performance compared to all the baseline models. The performance is even better by adding simple linguistic/lexicon features, i.e., RNCRF+NL+POS, with 0.72% and 3.54% absolute improvement over the best system in the challenge for aspect extraction in the restaurant domain and the laptop domain, respectively. This shows that a combination of high-level continuous features and discrete linguistic/lexicon features help to boost the performance. Though CRFs usually show promising results in sequence tagging problems, it fails to achieve comparable performance when lacking of extensive features (e.g., CRF-1). By adding dependency information explicitly in CRFs, the performance of CRF-2 improves slightly over CRF-1 for aspect extraction. But if being incorporated into deep models (e.g., RNCRF and RNCRF+NL+POS), the effect becomes most evident, with more than 7% improvement for aspect extraction and 2% for opinion extraction. By removing the labels for opinion terms, RNCRF-O produces inferior results because the effect of dual propagation of aspect and opinion pairs disappears with the absence of opinion labels. This verifies our previous assumption that DT-RNN could learn the interactive effects within aspects and opinions. LSTM has shown comparable results for aspect extraction [Liu et al., 2015]. However, in their work, they used well-pretrained word embeddings by training with large corpus or extensive external resources, e.g. chunking, and NER. To compare their model with RNCRF, we re-implemented LSTM with the same word

The results are shown in Table 3. Similarly, both aspect and opinion term labels are provided for training for each of the above models. Firstly, RNCRF achieves much better results compared to DT-RNN+SoftMax (11.60% improvement over the restaurant domain and 8.55% over the laptop domain for aspect extraction). DT-RNN has worse performance because it fails to fully exploit context information for sequential labeling. This shows the necessity of a CRF model for aspect-based sentiment analysis. Secondly, RNCRF outperforms CRF+word2vec. This proves the importance of the DTRNN structure in terms of modeling interactions between aspects and opinions. Hence, the combination of DT-RNN and CRF inherits the advantages from both models. Moreover, by separately adding hand-crafted features, we can observe that name-list based features are more effective than POS tag features for aspect extraction, which is the opposite for opinion extraction. This might be explained by the fact that namelist based features usually contain informative evident for aspect terms in the training set. However, these features are not propagated to opinion terms as their values are fixed in training. The best result for aspect extraction for both domains are achieved by RNCRF+NL+POS. For opinion extraction, the best result is taken from RNCRF+POS. Nevertheless, the RNCRF model itself already achieves promising results. Besides the comparison experiments, we also conducted sensitivity test for our proposed model in terms of word vector dimensions. We tested a set of different dimensions ranging from 25 to 400, with 25 increment. The sensitivity plot is shown in Figure 3. It can be shown that the performance for aspect extraction is smooth with different vector lengths for both domains. The best result appears with dimension 325 for the restaurant domain, but reaches comparable and stable

• CRF+word2vec: linear-chain CRF with word embeddings as the only input features without DT-RNN. • RNCRF+POS: only incorporate POS tags features to RNCRF.

0.88

0.85

aspect opinion

0.86

0.80

0.84

0.75 f1-score

0.82 f1-score

aspect opinion

0.80 0.78

0.70 0.65

0.76 0.60

0.74 0.72

[Iyyer et al., 2014] Mohit Iyyer, Jordan L. Boyd-Graber, Leonardo Max Batista Claudino, Richard Socher, and Hal Daum´e III. A neural network for factoid question answering over paragraphs. In EMNLP, pages 633–644, 2014.

25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 dimension

(a) On restaurant domain.

0.55

25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 dimension

(b) On laptop domain.

Figure 3: Sensitivity studies on dimensions of word embedding vectors. result after dimension 100. For the laptop domain, the best result is obtained with dimension 300, but with relatively small variations. For opinion extraction, the performance reaches a good level after dimension 75 for the restaurant domain and 125 for the laptop domain, although with small fluctuations. This proves the stability and robustness of our model.

6 Conclusion We have presented a joint model, RNCRF, that achieves the state-of-the-art performance for aspect-based sentiment analysis on a benchmark dataset. With the help of DT-RNN, high-level features can be learned by encoding the underlying dual propagation of aspect-opinion pairs. The hybrid model combines DT-RNN with CRF and is jointly trained. This helps to boost performance by combining the advantages of deep models and discriminant graphical models. Different from previous approaches, this model outperforms the traditional rule-based methods in terms of flexibility, because aspect terms and opinion terms are not only restricted to certain observed relations and POS tags. Compared to feature engineering methods with CRFs, the proposed model saves much effort in composing features, and it is able to extract higherlevel features obtained from non-linear transformations.

References [Glorot et al., 2011] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICML, pages 97–110, 2011. [Goller and Kuchler, 1996] C. Goller and A. Kuchler. Learning task-dependent distributed representations by backpropagation through structure. In ICNN, pages 347–352, 1996. [Hu and Liu, 2004a] Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In KDD, pages 168–177, 2004. [Hu and Liu, 2004b] Minqing Hu and Bing Liu. Mining opinion features in customer reviews. In AAAI, pages 755– 760, 2004. [˙Irsoy and Cardie, 2014] Ozan ˙Irsoy and Claire Cardie. Opinion mining with deep recurrent neural networks. In EMNLP, pages 720–728, 2014.

[Jin and Ho, 2009] Wei Jin and Hung Hay Ho. A novel lexicalized hmm-based learning framework for web opinion mining. In ICML, pages 465–472, 2009. [Kalchbrenner et al., 2014] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling sentences. In ACL, pages 655–665, 2014. [Kim, 2014] Yoon Kim. Convolutional neural networks for sentence classification. In EMNLP, pages 1746–1751, 2014. [Lafferty et al., 2001] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282–289, 2001. [Le and Mikolov, 2014] Quoc V. Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML, pages 1188–1196, 2014. [Li et al., 2010] Fangtao Li, Chao Han, Minlie Huang, Xiaoyan Zhu, Ying-Ju Xia, Shu Zhang, and Hao Yu. Structure-aware review mining and summarization. In COLING, pages 653–661, 2010. [Liu et al., 2015] Pengfei Liu, Shafiq Joty, and Helen Meng. Fine-grained opinion mining with recurrent neural networks and word embeddings. In EMNLP, pages 1433– 1443, 2015. [McAuley et al., 2015] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. Image-based recommendations on styles and substitutes. In SIGIR, pages 43–52, 2015. [Okazaki, 2007] Naoaki Okazaki. Crfsuite: a fast implementation of conditional random fields (crfs), 2007. [Pontiki et al., 2014] Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. Semeval-2014 task 4: Aspect based sentiment analysis. In SemEval, pages 27–35, 2014. [Qiu et al., 2011] Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. Opinion word expansion and target extraction through double propagation. Comput. Linguist., 37(1):9– 27, 2011. [Socher et al., 2010] Richard Socher, Christopher D. Manning, and Andrew Y. Ng. Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Networks. pages 1–9, 2010. [Socher et al., 2011a] Richard Socher, Cliff C. Lin, Andrew Y. Ng, and Christopher D. Manning. Parsing natural scenes and natural language with recursive neural networks. In ICML, 2011.

[Socher et al., 2011b] Richard Socher, Jeffrey Pennington, Eric H. Huang, Andrew Y. Ng, and Christopher D. Manning. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions. In EMNLP, 2011. [Socher et al., 2012] Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. Semantic Compositionality Through Recursive Matrix-Vector Spaces. In EMNLP, 2012. [Socher et al., 2013] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, pages 1631–1642, 2013. [Socher et al., 2014] Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. Grounded compositional semantics for finding and describing images with sentences. TACL, 2:207–218, 2014. [Tang et al., 2014] Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. Learning sentimentspecific word embedding for twitter sentiment classification. In ACL, pages 1555–1565, 2014. [Toh and Wang, 2014] Zhiqiang Toh and Wenting Wang. Dlirec: Aspect term extraction and term polarity classification system. In SemEval, pages 235–240, 2014. [Toutanova et al., 2003] Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In NAACL, pages 173–180, 2003. [Weiss et al., 2015] David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. Structured training for neural network transition-based parsing. In ACL-IJCNLP, pages 323–333, July 2015. [Zhang et al., 2010] Lei Zhang, Bing Liu, Suk Hwan Lim, and Eamonn O’Brien-Strain. Extracting and ranking product features in opinion documents. In COLING, pages 1462–1470, 2010.