Variational Neural Discourse Relation Recognizer

Comment

Report 3 Downloads 52 Views

Variational Neural Discourse Relation Recognizer

arXiv:1603.03876v1 [cs.CL] 12 Mar 2016

Biao Zhang1,2 , Deyi Xiong2 and Jinsong Su1 Xiamen University, Xiamen, China 3610051 Soochow University, Suzhou, China 2150062 [email protected], [email protected] [email protected] Abstract Implicit discourse relation recognition is a crucial component for automatic discourse-level analysis and nature language understanding. Previous studies exploit discriminative models that are built on either powerful manual features or deep discourse representations. In this paper, instead, we explore generative models and propose a variational neural discourse relation recognizer. We refer to this model as VIRILE. VIRILE establishes a directed probabilistic model with a latent continuous variable that generates both a discourse and the relation between the two arguments of the discourse. In order to perform efficient inference and learning, we introduce a neural discourse relation model to approximate the posterior of the latent variable, and employ this approximated posterior to optimize a reparameterized variational lower bound. This allows VIRILE to be trained with standard stochastic gradient methods. Experiments on the benchmark data set show that VIRILE can achieve competitive results against state-of-the-art baselines.

1

Introduction

Discourse relation characterizes the internal structure and logical relation of a coherent text. Automatically identifying these relations not only plays an important role in discourse comprehension and generation, but also obtains wide applications in many other relevant natural language processing tasks, such as text summarization [Yoshida et al., 2014], conversation [Higashinaka et al., 2014], question answering [Verberne et al., 2007] and information extraction [Cimiano et al., 2005]. Generally, discourse relations can be divided into two categories: explicit and implicit, which can be illustrated in the following example: The company was disappointed by the ruling. (because) The obligation is totally unwarranted. With the discourse connective because, these two sentences display an explicit discourse relation C ONTINGENCY which can be inferred easily. Once this discourse connective is removed, however, the discourse relation becomes implicit and

difficult to be recognized. This is because almost no surface information in these two sentences can signal this relation. For successful recognition of this relation, in the contrary, we need to understand the deep semantic correlation between disappointed and obligation in the two sentences above. Although explicit discourse relation recognition (DRR) has made great progress [Miltsakaki et al., 2005; Pitler et al., 2008], implicit DRR still remains a serious challenge due to the difficulty in semantic analysis. Conventional approaches to implicit DRR often treat the relation recognition as a classification problem, where discourse arguments and relations are regarded as the inputs and outputs respectively. Generally, these methods first generate a representation for a discourse, denoted as x1 (e.g., manual features in SVM-based recognition [Pitler et al., 2009; Lin et al., 2009] or sentence embeddings in neural networksbased recognition [Ji and Eisenstein, 2015; Zhang et al., 2015]), and then directly model the conditional probability of the corresponding discourse relation y given x, i.e. p(y|x). In spite of their success, these discriminative approaches rely heavily on the goodness of discourse representation x. Sophisticated and good representations of a discourse, however, may make models suffer from overfitting as we do not have large-scale balanced data. Instead, we assume that there is a latent continuous variable z from an underlying semantic space. It is this latent variable that generates both discourse arguments and the corresponding relation, i.e. p(x, y|z). The latent variable enables us to jointly model discourse arguments and their relations, rather than conditionally model y on x. However, the incorporation of the latent variable makes the modeling difficult due to the following three aspects: 1) the posterior distribution of the latent continuous variable is intractable; 2) a relatively simple approximation to the posterior, e.g. the mean-filed approach, may fail in capturing the true posterior of the latent variable; 3) a complicated approximation for the posterior will make the inference and learning inefficient. Inspired by Kingma and Welling [2014] as well as Rezende et al. [2014] who introduce a variational neural inference model to the intractable posterior via optimizing a reparam1 Unless otherwise specified, all the variables in the paper, e.g., x, y, z are multivariate. But for notational convenience, we treat them as univariate variables in most cases. Additionally, we use bold symbols to denote variables, and plain symbols to denote values.

amssymb amsmath

φ

z

θ

model can indeed fit the data set with respect to discourse arguments and relations.

2 x

y N

Figure 1: Illustration of the directed graph model of VIRILE. Solid lines denote the generative model pθ (z)pθ (x|z)pθ (y|z), dashed lines denote the variational approximation qφ (z|x) to the intractable posterior pθ (z|x) for inference. The variational parameters φ are learned jointly with the generative model parameters θ. eterized variational lower bound, we propose a VarIational neuRal dIscourse reLation rEcognizer (VIRILE) with a latent continuous variable for implicit DRR in this paper. The key idea behind VIRILE is that although the posterior distribution is intractable, we can approximate it via a deep neural network. Figure 1 illustrates the graph structure of VIRILE. Specifically, there are three essential components: • neural discourse recognizer: Since a discourse x and the corresponding relation y is independent given the latent variable z (as shown by the solid lines), we can formulate the generation of x and y from z in the equation pθ (x, y|z) = pθ (x|z)pθ (y|z). These two conditional probabilities in the right hand side are modeled via deep neural networks in our neural discourse recognizer (see section 4.1). • neural posterior approximator: VIRILE assumes that the latent variable can be inferred from discourse arguments x (as shown by the dash lines). In order to infer the latent variable, we employ a deep neural network to approximate the intractable posterior qφ (z|x) (see section 4.2), which makes the inference procedure efficient. • variational reparameterization: we introduce a reparameterization technique to bridge the gap between the above-mentioned components (see section 4.3). This allows us to naturally use standard stochastic gradient ascent techniques for optimization (see section 4.4). The main contributions of our work lie in the following two aspects. 1) We exploit a generative graphic model for implicit DRR. To the best of our knowledge, this has never been investigated before. 2) We develop a neural recognizer and a neural posterior approximator specifically for implicit DRR, which enables both the recognition and inference to be efficient. We conduct a series of experiments for English implicit DRR on the PDTB-style corpus to evaluate the effectiveness of our proposed VIRILE model. Experiment results show that our variational model achieves competitive results against several strong baselines in term of F1 score. Extensive analysis on the variational lower bound further reveals that our

Related Work

There are two lines of research related to our work: implicit discourse relation recognition and variational neural model, which we describe in succession. Implicit Discourse Relation Recognition Due to the release of Penn Discourse Treebank [Prasad et al., 2008] corpus, constantly increasing efforts are made for implicit DRR. Upon this corpus, Pilter et al. [2009] exploit several linguistically informed features, such as polarity tags, modality and lexical features. Lin et al. [2009] further incorporate context words, word pairs as well as discourse parse information into their classifier. Following this direction, several more powerful features have been exploited: entities [Louis et al., 2010], word embeddings [Braud and Denis, 2015], Brown cluster pairs and co-reference patterns [Rutherford and Xue, 2014]. With these features, Park and Cardie [2012] perform feature set optimization for better feature combination. Different from feature engineering, predicting discourse connectives can indirectly help the relation classification [Zhou et al., 2010; Patterson and Kehler, 2013]. In addition, selecting explicit discourse instances that are similar to the implicit ones can enrich the training corpus for implicit DRR and gains improvement [Wang et al., 2012; Lan et al., 2013; Braud and Denis, 2014; Fisher and Simmons, 2015; Rutherford and Xue, 2015]. Very recently, neural network models have been also used for implicit DRR due to its capability for representation learning [Ji and Eisenstein, 2015; Zhang et al., 2015]. Despite their successes, most of them focus on the discriminative models, leaving the field of generative models for implicit DRR a relatively uninvestigated area. Variational Neural Model In the presence of continuous latent variables with intractable posterior distributions, efficient inference and learning in directed probabilistic models is required. Kingma and Welling [2014] as well as Rezende et al. [2014] introduce variational neural networks that employ an approximate inference model for intractable posterior and reparameterized variational lower bound for stochastic gradient optimization. Kingma et al. [2014] revisit the approach to semi-supervised learning with generative models and further develop new models that allow effective generalization from a small labeled dataset to a large unlabeled dataset. Chung et al. [2015] incorporate latent variables into the hidden state of a recurrent neural network, while Gregor et al. [2015] combine a novel spatial attention mechanism that mimics the foveation of human eyes, with a sequential variational autoencoding framework that allows the iterative construction of complex images. We follow the spirit of these variational models, but focus on the adaptation and utilization of them onto implicit DRR, which, to the best of our knowledge, is the first attempt in this respect.

3

Background: Variational Autoencoder

y

In this section, we briefly review the variational autoencoder (VAE) [Kingma and Welling, 2014; Rezende et al., 2014], one of the most classical variational neural models, which forms the basis of our model. Different from conventional neural autoencoders, VAE is a generative model that can be regarded as a regularized version of the standard autoencoder. The VAE significantly changes the autoencoder architecture by introducing a latent random variable z, designed to capture the variations in the observed variable x. With the incorporation of z, the joint distribution is formulated as follows: pθ (x, z) = pθ (x|z)pθ (z)

(1)

where pθ (z) is the prior over the latent variable, which is usually equipped with a simple Gaussian distribution; and pθ (x|z) is the conditional distribution that models the probability of x given the latent variable z. Typically, the VAE parameterizes pθ (x|z) with a highly non-linear but flexible function approximator such as a neural network. Although introducing a highly non-linear function improves the learning capability of VAE, this makes the inference of the posterior pθ (z|x) intractable. To tackle this problem, the VAE further introduces an approximate posterior qφ (z|x) to enable the following variational lower bound: LV AE (θ, φ; x) = −KL(qφ (z|x)||pθ (z)) (2) +Eqφ (z|x) [log pθ (x|z)] ≤ log pθ (x) where KL(Q||P ) is Kullback-Leibler divergence between two distributions Q and P , and qφ (z|x) is usually a diagonal Gaussian N (µ, diag(σ 2 )) whose mean µ and variance σ 2 are parameterized by again, neural networks, conditioned on x. To maximize the variational lower bound in Eq. (2) stochastically with respect to both θ and φ, the VAE introduces a reparameterization trick that parameterizes the latent variable z with the Gaussian parameters µ and σ in qφ (z|x): z˜ = µ + σ

(3)

where is a standard Gaussian variable, and denotes an element-wise product. Intuitively, the VAE learns the representation of the latent variable not as single points, but as soft ellipsoidal regions in latent space, forcing the representation to fill the space rather than memorizing the training data as isolated representations. With this trick, the VAE model can be trained through standard backpropagation technique with stochastic gradient ascent.

pθ (y|z) z pθ (x|z)

h′2

h′1 x1

x2

Figure 2: Neural networks for conditional probabilities pθ (x|z) and pθ (y|z). The gray color denotes real-valued representations while the white and black color 0-1 representations.

4.1

Neural Discourse Recognizer

The conditional distribution p(x, y|z) in Eq. (4) shows that both discourse arguments and the corresponding relation are generated from the latent variable. As shown in Figure 1, x is d-separated from y by z. Therefore the discourse x and the corresponding relation y is independent given the latent variable z. The joint probability can be therefore formulated as follows pθ (x, y, z) = pθ (x|z)pθ (y|z)pθ (z)

(5)

We adopt the centered isotropic multivariate Gaussian as the prior for the latent variable, pθ (z) = N (z; 0, I) following previous work [Kingma and Welling, 2014; Rezende et al., 2014]. With respect to the two conditional distributions, we parameterize them via neural networks as shown in Figure 2. Before we further explain the network structure, it is necessary to briefly introduce how discourse relations are annotated in our training data. The PDTB corpus, our training corpus, annotates implicit discourse relations between two neighboring arguments, namely Arg1 and Arg2. In VIRILE, we represent the two arguments with bag-of-word representations, and denote them as x1 and x2 . To model pθ (x|z) (the bottom part in Figure 2), we project the representation of the latent variable z ∈ Rdz onto a hidden layer: h01 = f (Wh01 z + bh01 ) (6) h02 = f (Wh02 z + bh01 ) where Wh01 ∈ R

dh0 ×dz 1

, Wh02 ∈ R

dh0

dh0 ×dz 2

(7) are the transforma-

dh0

This section introduces our proposed VIRILE model. Formally, in VIRILE, there are two observed variables , x for a discourse and y for the corresponding relation, and one latent variable z. As illustrated in Figure 1, the joint distribution of the three variables is formulated as follows:

tion matrices, bh01 ∈ R 1 , bh02 ∈ R 2 are the bias terms, du is the dimensionality of vector representations of u and f (·) is an element-wise activation function, such as tanh(·), which is used throughout our model. Upon this hidden layer, we further stack a Sigmoid layer to predict the probabilities of corresponding discourse arguments: x01 = Sigmoid(Wx01 h01 + bx01 ) (8)

pθ (x, y, z) = pθ (x, y|z)pθ (z)

x02 = Sigmoid(Wx02 h02 + bx02 )

4

The VIRILE Model

(4)

We begin with this distribution to elaborate the major components in VIRILE.

x01

dx1

x02

dx2

(9)

where ∈R and ∈R are the real-valued representations of the reconstructed x1 and x2 respectively. Notice

µ

log σ 2

where the mean µ and s.d. σ of the approximate posterior are the outputs of the neural network as shown in Figure 3. Similar to the calculation of pθ (x|z), we first transform the input x into a hidden representation: h1 = f (Wh1 x1 + bh1 )

h1

h2

x1

h2 = dh1 ×dx1

x2

Figure 3: Neural networks for Gaussian parameters µ and log σ in the approximated posterior qφ (z|x). that the equality of dx1 = dx2 , dh01 = dh02 is not necessary though we assume so in our experiments. We assume that pθ (x|z) is a multivariate Bernoulli distribution. Therefore the logarithm of p(x|z) is calculated as the sum of probabilities of words in discourse arguments as follows:

f (Wh2 x2 + bh2 ) (15) dh2 ×dx2 where Wh1 ∈ R , Wh2 ∈ R are weight matrices, and bh1 ∈ Rdh1 , bh2 ∈ Rdh2 are the bias terms. Notice that dh1 /dh2 are not necessarily equal to dh01 /dh02 . We then obtain the Gaussian parameters µ and log σ 2 through linear regression: µ = Wµ1 h1 + Wµ2 h2 + bµ

log p(x|z) =

x1,i log x01,i

i

+

X j

+ (1 − x1,i ) log(1 −

x01,i )

x2,j log x02,j + (1 − x2,j ) log(1 − x02,j ) (10)

where ui,j is the jth element in ui . In order to estimate pθ (y|z) (the top part in Figure 2), we stack a softmax layer over the representation of the latent variable z: y 0 = Sof tM ax(Wy0 z + by0 ) (11) where Wy0 ∈ Rdy ×dz , by0 ∈ Rdy are the weight matrix and bias term. dy denotes the number of discourse relations. Suppose that the true relation is y ∈ Rdy , the logarithm of p(y|z) can be computed as follows: log p(y|z) =

dy X

yi log yi0

(12)

i=1

In order to precisely estimate these conditional probabilities, our model will force the representation z of the latent variable to encode semantic information for both the reconstructed discourse x0 (Eq. (10)) and predicted discourse relation y 0 (Eq. (12)), which is exactly what we want.

4.2

log σ = Wσ1 h1 + Wσ2 h2 + bσ (17) dz where µ, σ ∈ R . In this way, this posterior approximator can be efficiently computed.

Variational Reparameterization

We have described how to calculate the likelihood pθ (x, y|z) and the approximate posterior qφ (z|x). In order to optimize our model, we need to further compute an expectation over the approximate posterior, that is Eqφ (z|x) [log pθ (x, y|z)]. Since this expectation is intractable, we employ the Monte Carlo method to estimate it with a reparameterization trick similar to Eq. (3): L

Eqφ (z|x) [log pθ (x, y|z)] '

1X log pθ (x, y|˜ z (l) ) L l=1

(18)

where z˜ = µ + σ and ∼ N (0, I) where L is the number of samples. This reparameterization bridges the gap between the likelihood and the posterior, and enables the internal backpropagation in our neural network. When testing new instances using the proposed model, we simply ignore the noise and set z˜ = µ to avoid uncertainty.

4.4

Parameter Learning

Given a training instance (x(t) , y (t) ), the joint training objective is defined as follows: dz 2 2 1X (t) 2 (t) (t) 1 + log (σj ) − µj L(θ, φ) ' − σj 2 j=1 L

Neural Posterior Approximator

For the joint distribution in Eq. (5), we can define a variational lower bound that is similar to Eq. (2). The difference lies in the approximate posterior, which should be qφ (z|x, y) for VIRILE. However, considering the absence of y during discourse relation recognition, we assume that the latent variable can be inferred from discourse arguments x alone. This allows us to use qφ (z|x) rather than qφ (z|x, y) to approximate the true posterior. Similar to previous work [Kingma and Welling, 2014; Rezende et al., 2014], we let qφ (z|x) be a multivariate Gaussian distribution with a diagonal covariance structure: qφ (z|x) = N (z; µ, σ 2 I)

(16)

2

4.3 X

(14)

(13)

+

1X log pθ (x(t) , y (t) |˜ z (t,l) ) L l=1

(t,l)

where z˜

(t)

=µ

+σ

(t)

(l) and (l) ∼ N (0, I)

(19)

The first term is the KL divergence which can be computed and differentiated without estimation (see [Kingma and Welling, 2014] for detail). Intuitively, this is a conventional neural network with a special regularizer. The second term is the approximate expectation shown in Eq. (18), which is also differentiable. There are two different sets of parameters in the proposed model,

Algorithm 1 Parameter Learning Algorithm of VIRILE. Inputs: A, the maximum number of iterations; M , the number of instances in one batch; L, the number of samples; θ, φ ← Initialize parameters repeat D ← getRandomMiniBatch(M) ← getRandomNoiseFromStandardGaussian() g ← ∇θ,φ L(θ, φ; D, ) θ, φ ← parameterUpdater(θ, φ; g) until convergence of parameters (θ, φ) or reach the maximum iteration A • θ: Wh01 , Wh02 , bh01 , bh02 , Wx01 , Wx02 , bx01 , bx02 , Wy0 and by0

• φ: Wh1 , Wh2 , bh1 , bh2 , Wµ1 , Wµ2 , bµ , Wσ1 , Wσ2 and bσ

Since the objective function in Eq. (19) is differentiable, we can optimize these parameters jointly using standard gradient ascent techniques. The training procedure for VIRILE is summarized in Algorithm 1.

5

Experiments

We conducted a series of experiments on English implicit DRR task to validate the effectiveness of VIRILE. We first briefly review the PDTB dataset that we used to train our model. We then present experiment setup, results and analysis on the variational lower bound in this section.

5.1

Dataset

We used the largest hand-annotated discourse corpus PDTB 2.02 [Prasad et al., 2008] (PDTB hereafter). This corpus contains discourse annotations over 2,312 Wall Street Journal articles, and is organized in different sections. Following previous work [Pitler et al., 2009; Zhou et al., 2010; Lan et al., 2013; Zhang et al., 2015], we used sections 220 as our training set, sections 21-22 as the test set. Sections 0-1 were used as the development set for hyperparameter optimization. In PDTB, discourse relations are annotated in a predicateargument view. Each discourse connective is treated as a predicate that takes two text spans as its arguments. The discourse relation tags in PDTB are arranged in a three-level hierarchy, where the top level consists of four major semantic classes: T EMPORAL (T EM), C ONTINGENCY (C ON), E X PANSION (E XP) and C OMPARISON (C OM ). Because the toplevel relations are general enough to be annotated with a high inter-annotator agreement and are common to most theories of discourse, in our experiments we only use this level of annotations. We formulated the task as four separate one-against-all binary classification problems: each top level class vs. the other three discourse relation classes. We also balanced the training set by resampling training instances in each class until the number of positive and negative instances are equal. In contrast, all instances in the test and development set are kept in nature. The statistics of various data sets is listed in Table 1. 2

http://www.seas.upenn.edu/ pdtb/

Relation C OM C ON E XP T EM

#Instance Number Train Dev Test 1942 197 152 3342 295 279 7004 671 574 760 64 85

Table 1: Statistics of implicit discourse relations for the training (Train), development (Dev) and test (Test) sets in PDTB.

5.2

Setup

We tokenized all datasets using Stanford NLP Toolkit3 . For optimization, we employed the Adagrad algorithm to update parameters. With respect to the hyperparameters M, L, A and the dimensionality of all vector representations, we set them according to previous work [Kingma and Welling, 2014; Rezende et al., 2014] and preliminary experiments on the development set. Finally, we set M = 100, A = 1000, L = 1, dz = 20, dx1 = dx2 = 10001, dh1 = dh2 = dh01 = dh02 = 400, dy = 2 for all experiments. Notice that there is one dimension in dx1 and dx2 for unknown words. We compared VIRILE against the following two different baseline methods: • SVM: a support vector machine (SVM) classifier trained with several manual features. We used the toolkit SVMlight4 to train the classifier in our experiments. • SCNN: a shallow convolutional neural network proposed by Zhang et al. [2015]. Features used in SVM are taken from the state-of-the-art implicit discourse relation recognition model, including Bag of Words, Cross-Argument Word Pairs, Polarity, First-Last, First3, Production Rules, Dependency Rules and Brown cluster pair [Rutherford and Xue, 2014]. In order to collect bag of words, production rules, dependency rules, and crossargument word pairs, we used a frequency cutoff of 5 to remove rare features, following Lin et al. [2009].

5.3

Classification Results

Because the development and test sets are imbalanced in terms of the ratio of positive and negative instances, we chose F1 score as our major evaluation metric. In addition, we also provided the precision, recall and accuracy metrics for further analysis. Table 2 summarizes the classification results, where the highest F1 score in four tasks are highlighted in bold. From Table 2, we observe that the proposed VIRILE outperforms SVM on E XP/T EM and SCNN on E XP/C OM according to their F1 scores. Although it fails on C ON, VIRILE achieves the best result on EXP. Overall, VIRILE is competitive in comparison with the two state-of-the-art baselines. Similar to other generative models, VIRILE obtains relatively low precisions but high recalls in most cases. With respect to the accuracy, our model does not yield substantial improvements over the two baselines except for T EM. This may be because that we used the F1 score rather than the 3 4

http://nlp.stanford.edu/software/corenlp.shtml http://svmlight.joachims.org/

Model SVM SCNN VIRILE

Acc 63.10 60.42 62.43

Model SVM SCNN VIRILE

Acc 60.71 63.00 55.45

P 22.79 22.00 22.55

R 64.47 67.76 65.13

F1 33.68 33.22 33.50

Model SVM SCNN VIRILE

Acc 62.62 63.00 57.55

F1 62.19 69.59 71.06

Model SVM SCNN VIRILE

Acc 66.25 76.95 85.94

(a) C OM vs Other

P 65.89 56.29 55.21

P 39.14 39.80 36.50

R 72.40 75.29 79.93

F1 50.82 52.04 50.11

(b) C ON vs Other

R 58.89 91.11 99.65

(c) E XP vs Other

P 15.10 20.22 25.00

R 68.24 62.35 36.47

F1 24.73 30.54 29.67

(d) T EM vs Other 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 1 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54

-518.06 72.38403 -223.37 72.38403 -210.24 72.38403 -202.49 72.38403 -198.18 72.38403 -196.3 72.38403 -192.9 72.38403 -191.21 72.38403 -190 72.38403 -188.8 72.38403 -187.76 72.38403 -187.32 72.38403 -186.14 72.38403 -185.61 72.38403 -184.83 72.38403 -184.25 72.38403 -183.52 72.38403 -183.09 72.38403 -182.69 72.38403 -182.19 72.38403 -181.87 72.38403 10172.38403201 -181.36 -180.9 72.38403 -180.63 72.38403 -180.22 72.38403 -179.85 72.38403 -179.63 72.38403 -179.13 72.38403 -178.72 72.38403 -178.65 72.38403 -178.26 72.38403 -177.92 72.38403 -177.63 72.38403 -177.39 72.38403 -177.09 72.38403 -176.85 72.38403 -176.67 72.38403 -176.44 72.38403 -176.13 72.38403 -175.81 72.38403 -175.58 72.38403 -175.28 72.38403 -175.14 72.38403 -174.87 72.38403 -174.64 72.31516 -174.43 72.31516 -174.18 72.31516 -173.92 72.31516 -173.79 72.31516 -173.64 72.31516 -173.32 72.31516 -173.28 72.31516 -172.96 72.31516 -173.9 72.31516

Table 2: Classification results of different models on the implicit DRR task. P=Precision, R=Recall, and F1=F1 score. The best F1 scores are highlighted in bold. 34 32 30 28 26 24 22 20 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 1 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54

-1241.55 -244.73 -226.18 -253.5 -217.03 -214.32 -210.16 -208.91 -207.41 -205.73 -204.35 -203.5 -202.31 -201.73 -201.12 -200.67 -200.16 -199.52 -199.58 101 201 -198.84 -198.51 -198 -197.8 -197.83 -197.17 -196.59 -196.5 -196.85 -196.31 -196.04 -196.03 -195.69 -195.6 -195.32 -195.52 -195.04 -194.73 -194.97 -194.44 -194.73 -194.44 -194.23 -193.95 -193.59 -193.74 -193.56 -193.48 -193.04 -193.33 -193 -193 -192.56 -192.51 -192.36

28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 301 401 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072 28.55072

-120 -140 -160 -180 -200 -220 -240 -260 Dev

Train

-280 -300

501

601

701

801

(a) C OM vs Other

901

1 2 3 4 5 6 7 8 44 9 10 4211 12 40 13 14 38 15 3616 17 3418 19 3220 211 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54

-835.86 39.91881 -229.88 39.91881 -212.99 39.91881 -208.35 39.91881 -203.63 39.91881 -201.16 39.91881 -198.83 39.91881 -197.81 39.91881 -197.17 39.91881 -196.56 39.81043 -194.58 39.81043 -194.22 39.72881 -193.63 39.59044 -192.55 39.50276 -191.96 40.02829 -191.6 40.24298 -192.43 40.61489 -190.85 40.46553 -190.63 39.84298 -189.82 39.82869 101 201 301 401 -189.58 39.13538 -189.31 38.30303 -188.93 37.48327 -188.23 36.00563 -188.21 35.32934 -188.04 35.9584 -187.6 34.8228 -187.61 34.52566 -187.58 33.17384 -186.98 34.23138 -186.96 34.28571 -186.49 33.8118 -186.19 34.00936 -185.93 34.02226 -185.81 34.29487 -185.61 34.64567 -185.39 35.65891 -185.21 35.23957 -184.96 35.36585 -184.86 36.7713 -184.54 37.28324 -184.4 37.62376 -184.58 37.51724 -184.1 37.63441 -184.03 39.37419 -183.85 39.79975 -183.83 40.24096 -183.6 40.28605 -183.37 40.98544 -183.29 41.93194 -183.08 41.83565 -182.77 42.10526 -182.62 42.78507 -182.67 43.53877

-140 -160 -180 -200

75 70 65

-220 -240 -260 Dev

Train

601

701

801

901

(b) C ON vs Other

55

-280 -300

501

60

50

-140 -160 -180 -200 -220 -240 -260 Dev

Train

-280 -300

301

(c) E XP vs Other

401

1 -2994.29 2 -268.26 3 -299.07 4 -244.27 5 -236.82 6 -233.55 7 -230.2 814 -227.11 9 -225.03 1012 -223.56 1110 -222.2 12 -221.03 13 8 -219.73 14 6 -218.37 15 -217.73 16 4 -216.59 17 -215.89 2 18 -215.11 19 0 -214.14 1 101 20 -213.86 21 -213.19 22 -212.9 23 -211.75 24 -211.31 25 -211.17 26 -210.23 27 -209.73 28 -209.8 29 -209.49 30 -209.46 31 -208.53 32 -207.92 33 -208.11 34 -207.9 35 -207.56 36 -207.06 37 -206.73 38 -206.47 39 -206.65 40 -205.77 41 -206.09 42 -205.62 43 -205.56 44 -205.48 45 -205.49 46 -205.45 47 -204.61 48 -204.78 49 -204.26 50 -204.36 51 -203.81 52 -203.58 53 -204.01 54 -203.64

10.26464 10.28112 10.26464 10.26464 10.26464 10.26464 10.26464 10.26464 10.26464 10.26464 10.26464 10.26464 10.26464 10.26464 10.26464 10.26464 10.26464 10.26464 10.26464 201 301 10.26464 10.26464 10.26464 10.26464 10.26464 10.26464 10.26464 10.26464 10.26464 10.26464 10.26464 10.26464 10.28939 10.28939 10.30596 10.32258 10.36437 10.41497 10.46607 10.37891 10.48253 10.48253 10.36789 10.36789 10.49069 10.44521 10.54451 10.58109 10.72056 10.76787 10.76787 10.72056 10.70175 10.64572 10.56277

-160 -180 -200 -220 -240 -260 -280 Dev 401

501

Train 601

701

-300 801

901

(d) T EM vs Other

Figure 4: Illustration of the variational lower bound (blue color) on the training set and F-score (brown color) on the development set. Horizontal axis: the epoch numbers; Vertical axis: the F1 score for relation classification (left) and the estimated average variational lower bound per datapoint (right). accuracy, as our selection criterion on the development set. Nevertheless, more analysis should be done to understand the deep reason. Besides, we find that the performance of our model is proportional to the number of training instances. This suggests that collecting more training instances (in spite of the noises) may be beneficial to our model.

5.4

Variational Lower Bound Analysis

In addition to the classification performance, the efficiency in learning and inference is another concern for variational methods. Figure 4 shows the training procedure for four tasks in terms of the variational lower bound on the training set. We also provide F1 scores on the development set to investigate the relations between the variational lower bound and recognition performance. We find that our model converges toward the variational lower bound considerably fast in all experiments (within 100 epochs), which resonates with the previous findings [Kingma and Welling, 2014; Rezende et al., 2014]. However, the change trend of the F1 score does not follow that of the lower bound. Particularly to the four discourse relations, we further observe that the change paths of the F1 score are completely different. This may suggest that the four discourse relations have different properties and distributions. Specifically, the number of epochs when the best F1 score reaches is also different for the four discourse relations. This indicates that dividing the implicit DRR into four different tasks according to the type of discourse relations is reasonable

and better than performing DRR on the mixtures of the four relations.

6

Conclusion and Future Work

In this paper, we have presented a variational neural discourse relation recognizer for implicit DRR. Different from conventional discriminative models that directly calculate the conditional probability of the relation y given discourse arguments x, our model assumes that it is a latent variable from an underlying semantic space that generates both x and y. In order to make the inference and learning efficient, we introduce a neural discourse recognizer and a neural posterior approximator as our generative and inference model respectively. Using the reparameterization technique, we are able to optimize the whole model via standard stochastic gradient ascent algorithm. Experiment results in terms of classification and variational lower bound verify the effectiveness of our model. In the future, we would like to exploit the utilization of discourse instances with explicit relations for implicit DRR. For this we can start from two directions: 1) converting explicit instances into pseudo implicit instances and retrain our model; 2) developing a semi-supervised model to leverage semantic information inside discourse arguments. Furthermore, we are also interested in adapting our model to other similar tasks, such as nature language inference.

References [Braud and Denis, 2014] Chlo´e Braud and Pascal Denis. Combining natural and artificial examples to improve implicit discourse relation identification. In Proc. of COLING, pages 1694–1705, August 2014. [Braud and Denis, 2015] Chlo´e Braud and Pascal Denis. Comparing word representations for implicit discourse relation classification. In Proc. of EMNLP, pages 2201– 2211, 2015. [Chung et al., 2015] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C. Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Proc. of NIPS, 2015. [Cimiano et al., 2005] Philipp Cimiano, Uwe Reyle, and Jasˇ c. Ontology-driven discourse analysis for inmin Sari´ formation extraction. Data & Knowledge Engineering, 55:59–83, 2005. [Fisher and Simmons, 2015] Robert Fisher and Reid Simmons. Spectral semi-supervised discourse relation classification. In Proc. of ACL-IJCNLP, pages 89–93, July 2015. [Gregor et al., 2015] Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. DRAW: A recurrent neural network for image generation. CoRR, abs/1502.04623, 2015. [Higashinaka et al., 2014] Ryuichiro Higashinaka, Kenji Imamura, Toyomi Meguro, Chiaki Miyazaki, Nozomi Kobayashi, Hiroaki Sugiyama, Toru Hirano, Toshiro Makino, and Yoshihiro Matsuo. Towards an open-domain conversational system fully based on natural language processing. In Proc. of COLING, pages 928–939, 2014. [Ji and Eisenstein, 2015] Yangfeng Ji and Jacob Eisenstein. One vector is not enough: Entity-augmented distributed semantics for discourse relations. TACL, pages 329–344, 2015. [Kingma and Welling, 2014] Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. In Proc. of ICLR, 2014. [Kingma et al., 2014] Diederik P. Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In Proc. of NIPS, pages 3581–3589, 2014. [Lan et al., 2013] Man Lan, Yu Xu, and Zhengyu Niu. Leveraging Synthetic Discourse Data via Multi-task Learning for Implicit Discourse Relation Recognition. In Proc. of ACL, pages 476–485, Sofia, Bulgaria, August 2013. [Lin et al., 2009] Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng. Recognizing implicit discourse relations in the Penn Discourse Treebank. In Proc. of EMNLP, pages 343–351, 2009. [Louis et al., 2010] Annie Louis, Aravind Joshi, Rashmi Prasad, and Ani Nenkova. Using entity features to classify implicit discourse relations. In Proc. of SIGDIAL, pages 59–62, Tokyo, Japan, September 2010.

[Miltsakaki et al., 2005] Eleni Miltsakaki, Nikhil Dinesh, Rashmi Prasad, Aravind Joshi, and Bonnie Webber. Experiments on sense annotations and sense disambiguation of discourse connectives. In Proc. of TLT2005, 2005. [Park and Cardie, 2012] Joonsuk Park and Claire Cardie. Improving Implicit Discourse Relation Recognition Through Feature Set Optimization. In Proc. of SIGDIAL, pages 108–112, Seoul, South Korea, July 2012. [Patterson and Kehler, 2013] Gary Patterson and Andrew Kehler. Predicting the presence of discourse connectives. In Proc. of EMNLP, pages 914–923, 2013. [Pitler et al., 2008] Emily Pitler, Mridhula Raghupathy, Hena Mehta, Ani Nenkova, Alan Lee, and Aravind K Joshi. Easily identifiable discourse relations. Technical Reports (CIS), page 884, 2008. [Pitler et al., 2009] Emily Pitler, Annie Louis, and Ani Nenkova. Automatic sense prediction for implicit discourse relations in text. In Proc. of ACL-AFNLP, pages 683–691, August 2009. [Prasad et al., 2008] Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind K Joshi, and Bonnie L Webber. The penn discourse treebank 2.0. In LREC. Citeseer, 2008. [Rezende et al., 2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proc. of ICML, pages 1278–1286, 2014. [Rutherford and Xue, 2014] Attapol Rutherford and Nianwen Xue. Discovering implicit discourse relations through brown cluster pair representation and coreference patterns. In Proc. of EACL, pages 645–654, April 2014. [Rutherford and Xue, 2015] Attapol Rutherford and Nianwen Xue. Improving the inference of implicit discourse relations via classifying explicit discourse connectives. In Proc. of NAACL-HLT, pages 799–808, May–June 2015. [Verberne et al., 2007] Suzan Verberne, Lou Boves, Nelleke Oostdijk, and Peter-Arno Coppen. Evaluating discoursebased answer extraction for why-question answering. In Proc. of SIGIR, pages 735–736, 2007. [Wang et al., 2012] Xun Wang, Sujian Li, Jiwei Li, and Wenjie Li. Implicit discourse relation recognition by selecting typical training examples. In Proc. of COLING, pages 2757–2772, 2012. [Yoshida et al., 2014] Yasuhisa Yoshida, Jun Suzuki, Tsutomu Hirao, and Masaaki Nagata. Dependency-based discourse parser for single-document summarization. In Proc. of EMNLP, pages 1834–1839, October 2014. [Zhang et al., 2015] Biao Zhang, Jinsong Su, Deyi Xiong, Yaojie Lu, Hong Duan, and Junfeng Yao. Shallow convolutional neural network for implicit discourse relation recognition. In Proc. of EMNLP, September 2015. [Zhou et al., 2010] Zhi-Min Zhou, Yu Xu, Zheng-Yu Niu, Man Lan, Jian Su, and Chew Lim Tan. Predicting discourse connectives for implicit discourse relation recognition. In Proc. of COLING, pages 1507–1514, 2010.