Latent Tree Language Model

Report 2 Downloads 53 Views
Latent Tree Language Model

arXiv:1607.07057v1 [cs.CL] 24 Jul 2016

Tom´asˇ Brychc´ın NTIS – New Technologies for the Information Society, Faculty of Applied Sciences, University of West Bohemia, Technick´a 8, 306 14 Plzeˇn, Czech Republic [email protected] nlp.kiv.zcu.cz

Abstract It this paper we introduce Latent Tree Language Model (LTLM), a novel approach to language modeling that encodes syntax and semantics of a given sentence as a tree of word roles. The learning phase iteratively updates the trees by moving nodes according to Gibbs sampling. We introduce two algorithms to infer a tree for a given sentence. The first one is based on Gibbs sampling. It is fast, but does not guarantee to find the most probable tree. The second one is based on dynamic programming. It is slower, but guarantees to find the most probable tree. We provide comparison of both algorithms. We combine LTLM with 4-gram Modified Kneser-Ney language model via linear interpolation. Our experiments with English and Czech corpora show significant perplexity reductions (up to 46% for English and 49% for Czech) compared with standalone 4-gram Modified Kneser-Ney language model.

1

Introduction

Language modeling is one of the core disciplines in natural language processing (NLP). Automatic speech recognition, machine translation, optical character recognition, and other tasks strongly depend on the language model (LM). An improvement in language modeling often leads to better performance of the whole task. The goal of language modeling is to determine the joint probability of a sentence. Currently, the dominant approach is n-gram language modeling, which decomposes

the joint probability into the product of conditional probabilities by using the chain rule. In traditional n-gram LMs the words are represented as distinct symbols. This leads to an enormous number of word combinations. In the last years many researchers have tried to capture words contextual meaning and incorporate it into the LMs. Word sequences that have never been seen before receive high probability when they are made of words that are semantically similar to words forming sentences seen in training data. This ability can increase the LM performance because it reduces the data sparsity problem. In NLP a very common paradigm for word meaning representation is the use of the Distributional hypothesis. It suggests that two words are expected to be semantically similar if they occur in similar contexts (they are similarly distributed in the text) (Harris, 1954). Models based on this assumption are denoted as distributional semantic models (DSMs). Recently, semantically motivated LMs have begun to surpass the ordinary n-gram LMs. The most commonly used architectures are neural network LMs (Bengio et al., 2003; Mikolov et al., 2010; Mikolov et al., 2011) and class-based LMs. Classbased LMs are more related to this work thus we investigate them deeper. Brown et al. (1992) introduced class-based LMs of English. Their unsupervised algorithm searches classes consisting of words that are most probable in the given context (one word window in both directions). However, the computational complexity of this algorithm is very high. This approach was later extended by (Martin et al., 1998; Whit-

taker and Woodland, 2003) to improve the complexity and to work with wider context. Deschacht et al. (2012) used the same idea and introduced Latent Words Language Model (LWLM), where word classes are latent variables in a graphical model. They apply Gibbs sampling or the expectation maximization algorithm to discover the word classes that are most probable in the context of surrounding word classes. A similar approach was presented in (Brychc´ın and Konop´ık, 2014; Brychc´ın and Konop´ık, 2015), where the word clusters derived from various semantic spaces were used to improve LMs. In above mentioned approaches, the meaning of a word is inferred from the surrounding words independently of their relation. An alternative approach is to derive contexts based on the syntactic relations the word participates in. Such syntactic contexts are automatically produced by dependency parse-trees. Resulting word representations are usually less topical and exhibit more functional similarity (they are more syntactically oriented) as shown in (Pad´o and Lapata, 2007; Levy and Goldberg, 2014). Dependency-based methods for syntactic parsing have become increasingly popular in NLP in the last years (K¨ubler et al., 2009). Popel and Mareˇcek (2010) showed that these methods are promising direction of improving LMs. Recently, unsupervised algorithms for dependency parsing appeared in (Headden III et al., 2009; Cohen et al., 2009; Spitkovsky et al., 2010; Spitkovsky et al., 2011; Mareˇcek and Straka, 2013) offering new possibilities even for poorly-resourced languages. In this work we introduce a new DSM that uses tree-based context to create word roles. The word role contains the words that are similarly distributed over similar tree-based contexts. The word role encodes the semantic and syntactic properties of a word. We do not rely on parse trees as a prior knowledge, but we jointly learn the tree structures and word roles. Our model is a soft clustering, i.e. one word may be present in several roles. Thus it is theoretically able to capture the word polysemy. The learned structure is used as a LM, where each word role is conditioned on its parent role. We present the unsupervised algorithm that discovers the tree structures only from the distribution of words in a training corpus (i.e. no labeled data or external sources of in-

formation are needed). In our work we were inspired by class-based LMs (Deschacht et al., 2012), unsupervised dependency parsing (Mareˇcek and Straka, 2013), and tree-based DSMs (Levy and Goldberg, 2014). This paper is organized as follows. We start with the definition of our model (Section 2). The process of learning the hidden sentence structures is explained in Section 3. We introduce two algorithms for searching the most probable tree for a given sentence (Section 4). The experimental results on English and Czech corpora are presented in Section 6. We conclude in Section 7 and offer some directions for future work.

2

Latent Tree Language Model

In this section we describe Latent Tree Language Model (LTLM). LTLM is a generative statistical model that discovers the tree structures hidden in the text corpus. Let L be a word vocabulary with total of |L| distinct words. Assume we have a training corpus w divided into S sentences. The goal of LTLM or other LMs is to estimate the probability of a text P (w). Let Ns denote the number of words in the s-th sentence. The s-th sentence is a sequence of s words ws = {ws,i }N i=0 , where ws,i ∈ L is a word at position i in this sentence and ws,0 = < s > is an artificial symbol that is added at the beginning of each sentence. Each sentence s is associated with the dependency graph Gs . We define the dependency graph as a labeled directed graph, where nodes correspond to the words in the sentence and there is a label for each node that we call role. Formally, it is a triple Gs = (V s , E s , r s ) consisting of: • The set of nodes V s = {0, 1, ..., Ns }. Each token ws,i is associated with node i ∈ V s . • The set of edges E s ⊆ V s × V s . s • The sequence of roles r s = {rs,i }N i=0 , where 1 ≤ rs,i ≤ K for i ∈ V s . K is the number of roles.

The artificial word ws,0 = < s > at the beginning of the sentence has always role 1 (rs,0 = 1). Analogously to w, the sequence of all r s is denoted as r and sequence of all Gs as G.

P (ws ) = P (ws |rs,0 ) = Ns X K Y

P (ws,i |rs,i = k)P (rs,i = k|rs,hs (i) ). (2)

i=1 k=1

Figure 1: Example of LTLM for the sentence ”Everything has beauty, but not everyone sees it.” Edge e ∈ E s is an ordered pair of nodes (i, j). We say that i is the head or the parent and j is the dependent or the child. We use the notation i → j for such edge. The directed path from node i to node ∗ j is denoted as i → j. We place a few constraints on the graph Gs . • The graph Gs is a tree. It means it is the acyclic ∗ graph (if i → j then not j → i), where each node has one parent (if i → j then not k → j for every k 6= i). • The graph Gs is projective (there are no cross edges). For each edge (i, j) and for each k between i and j (i.e. i < k < j or i > k > j) ∗ there must exist the directed path i → k. • The graph Gs is always rooted in the node 0. We denote these graphs as the projective dependency trees. Example of such a tree is on Figure 1. For the tree Gs we define a function hs (j) = i,

when (i, j) ∈ E s

(1)

that returns the parent for each node except the root. We use graph Gs as a representation of the Bayesian network with random variables E s and r s . The roles rs,i represent the node labels and the edges express the dependences between the roles. The conditional probability of the role at position i given its parent role is denoted as P (rs,i |rs,hs (i) ). The conditional probability of the word at position i in the sentence given its role rs,i is denoted as P (ws,i |rs,i ). We model the distribution over words in the sentence s as the mixture

The root role is kept fixed for each sentence (rs,0 = 1) so P (ws ) = P (ws |rs,0 ). We look at the roles as mixtures over child roles and simultaneously as mixtures over words. We can represent dependency between roles with a set of K multinomial distributions θ over K roles, such that (k) P (rs,i |rs,hs (i) = k) = θrs,i . Simultaneously, dependency of words on their roles can be represented as a set of K multinomial distributions φ over |L| (k) words, such that P (ws,i |rs,i = k) = φws,i . To make predictions about new sentences, we need to assume a prior distribution on the parameters θ (k) and φ(k) . We place a Dirichlet prior D with the vector of K hyper-parameters α on a multinomial distribution θ (k) ∼ D(α) and with the vector of |L| hyperparameters β on a multinomial distribution φ(k) ∼ D(β). In general, D is not restricted to be Dirichlet distribution. It could be any distribution over discrete children, such as logistic normal. In this paper, we focus only on Dirichlet as a conjugate prior to the multinomial distribution and derive the learning algorithm under this assumption. The choice of the child role depends only on its parent role, i.e. child roles with the same parent are mutually independent. This property is especially important for the learning algorithm (Section 3) and also for searching the most probable trees (Section 4). We do not place any assumption on the length of the sentence Ns or on how many children the parent node is expected to have.

3

Parameter Estimation

In this section we present the learning algorithm for LTLM. The goal is to estimate θ and φ in a way that maximizes the predictive ability of the model (generates the corpus with maximal joint probability P (w)). Let χk(i,j) be an operation that changes the tree Gs to G0s χk(i,j) : Gs → G0s ,

(3)

such that the newly created tree G0 (V 0s , E 0s , r 0s ) consists of: • V 0s = V s . • E 0s = (E s \ {(hs (i), i)}) ∪ {(j, i)}.  r for a 6= i 0 • rs,a = s,a , where 0 ≤ a ≤ Ns . k for a = i It means that we change the role of the selected node i so that rs,i = k and simultaneously we change the parent of this node to be j. We call this operation a partial change. The newly created graph G0 must satisfy all conditions presented in Section 2, i.e. it is a projective dependency tree rooted in the node 0. Thus not all partial changes χk(i,j) are possible to perform on graph Gs . Clearly, for the sentence s there is at most Ns (1+Ns ) parent changes1 . 2 To estimate the parameters of LTLM we apply Gibbs sampling and gradually sample χk(i,j) for trees Gs . For doing so we need to determine the posterior predictive distribution2 G0s ∼ P (χk(i,j) (Gs )|w, G),

(4)

from which we will sample partial changes to update the trees. In the equation, G denote the sequence of all trees for given sentences w and G0s is a result of one sampling. In the following text we derive this equation under assumptions from Section 2. The posterior predictive distribution of Dirichlet multinomial has the form of additive smoothing that is well known in the context of language modeling. The hyper-parameters of Dirichlet prior determine how much is the predictive distribution smoothed. Thus the predictive distribution for the word-in-role distribution can be expressed as

(w

|r

)

where n\s,is,i s,i is the number of times the role rs,i has been assigned to the word ws,i , excluding the position i in the s-th sentence. The symbol • represents any word in the vocabulary so that P (•|r ) (l|rs,i ) n\s,is,i = l∈L n\s,i . We use the symmetric Dirichlet distribution for the word-in-role probabilities as it could be difficult to estimate the vector of hyper-parameters β for large word vocabulary. In the above mentioned equation, β is a scalar. The predictive distribution for the role-by-role distribution is (rs,i |rs,hs (i) )



P rs,i |rs,hs (i) , r \s,i =

n\s,i

(•|r ) n\s,is,hs (i)

+

+ αrs,i K P

. (6)

αk

k=1

Analogously to the previous equation, denote the number of times the the parent role rs,hs (i) , excluding the position i in the s-th sentence. The symbol • represents any possible role to make the probability distribution summing up to 1. We assume an asymmetric Dirichlet distribution. We can use predictive distributions of above mentioned Dirichlet multinomials to express the joint probability that the role at position i is k (rs,i = k) with parent at position j conditioned on current values of all variables, except those in position i in the sentence s (rs,i |r ) n\s,i s,hs (i) role rs,i has

P (rs,i = k, j|w, r \s,i ) ∝ P (ws,i |rs,i = k, w\s,i , r \s,i ) × P (rs,i = k|rs,j , r \s,i ) Q × P (rs,a |rs,i = k, r \s,i ).

(7)

a:hs (a)=i (w

P (ws,i |rs,i , w\s,i , r \s,i ) = 1

n\s,is,i (•|r

|rs,i ) )



n\s,is,i + |L| β

, (5)

The most parent changes are possible for the special case of the tree, where each node i has parent i − 1. Thus for each node i we can change its parent to any node j < i and keep the s) projectivity of the tree. That is Ns (1+N possibilities. 2 2 The posterior predictive distribution is the distribution of an unobserved variable conditioned by the observed data, i.e. P (Xn+1 |X1 , ..., Xn ), where Xi are i.i.d. (independent and identically distributed random variables).

The choice of the node i role affects the word that is produced by this role and also all the child roles of the node i. Simultaneously, the role of the node i depends on its parent j role. Formula 7 is derived from the joint probability of a sentence s and a tree Gs , where all probabilities which do not depend on the choice of the role at position i are removed and equality is replaced by proportionality (∝). We express the final predictive distribution for sampling partial changes χk(i,j) as

P (rs,i = k, j|w, r \s,i ) P (rs,i , hs (i)|w, r \s,i ) (8) that is essentially the fraction between the joint probability of rs,i and its parent after the partial change and before the partial change (conditioned on all other variables). This fraction can be interpreted as the necessity to perform this partial change. We investigate two strategies of sampling partial changes: P (χk(i,j) (Gs )|w, G) ∝

• Per sentence: We sample a single partial change according to Equation 8 for each sentence in the training corpus. It means during one pass through the corpus (one iteration) we perform S partial changes. • Per position: We sample a partial change for each position P in each sentence. We perform in total N = Ss=1 Ns partial changes during one pass. Note that the denominator in Equation 8 is constant for this strategy and can be removed. We compare both training strategies in Section 6. After enough training iterations, we can estimate the (k) (p) conditional probabilities φl and θk from actual samples as (k)

φl (p)

θk ≈

n(ws,i =l|rs,i =k) + β n(•|rs,i =k) + |L| β

(9)

n(rs,i =k|rs,hs (i) =p) + αk . K P (•|rs,hs (i) =p) n + αm

(10)



(b) The root has only one child.

Figure 2: Searching the most probable subtrees.

4.1

m=1

Inference

In this section we present two approaches for searching the most probable tree for a given sentence assuming we have already estimated the parameters θ and φ.

Non-deterministic Inference

We use the same sampling technique as for estimating parameters (Equation 8), i.e. we iteratively sample the partial changes χk(i,j) . However, we use equations 9 and 10 for predictive distributions of Dirichlet multinomials instead of 5 and 6. In fact, these equations correspond to the predictive distributions over the newly added word ws,i with the role rs,i into the corpus, conditioned on w and r. This sampling technique rarely finds the best solution, but often it is very near. 4.2

These equations are similar to equations 5 and 6, but here the counts n do not exclude any position in a corpus. Note that in the Gibbs sampling equation, we assume that the Dirichlet parameters α and β are given. We use a fixed point iteration technique described in (Minka, 2003) to estimate them.

4

(a) The root has two or more children.

Deterministic Inference

Here we present the deterministic algorithm that guarantees to find the most probable tree for a given sentence. We were inspired by Cocke-YoungerKasami (CYK) algorithm (Lange and Leiß, 2009). Let T ns,a,c denote the subtree of Gs (subgraph of Gs that is also a tree) containing subsequence of nodes {a, a + 1, ..., c}. The superscript n denotes the number of children the root of this subtree has. We denote the joint probability of a subtree from position a to position c with the corresponding words conditioned by the root role k as P n ({ws,i }ci=a , T ns,a,c |k). Our goal is to find the tree Gs = T 1+ s,0,Ns that maximizes probability 1+ 1+ s P (ws , Gs ) = P ({ws,i }N i=0 , T s,0,Ns |0). Similarly to CYK algorithm, our approach fol-

lows bottom-up direction and goes through all possible subsequences for a sentence (sequence of words). At the beginning, the probabilities for subsequences of length 1 (i.e. single words) are calculated as P 1+ ({ws,a }, T 1+ s,a,a |k) = P (ws,a |rs,a = k). Once it has considered subsequences of length 1, it goes on to subsequences of length 2, and so on. Thanks to mutual independence of roles under the same parent, we can find the most probable subtree with the root role k and with at least two root children according to P 2+ ({ws,i }ci=a , T 2+ s,a,c |k) = max

b:a