Progress in Exponential Language Models - Semantic Scholar

Report 4 Downloads 160 Views
Progress in Exponential Language Models Diamantino Caseiro

Overview -

Maximum Entropy LM Review L1-Regularized, and class-based Maximum Entropy LMs Training and Prediction Optimizations Deeper Exponential Language Models

Maximum Entropy Language Model Review

Exponential Language Model (MaxEnt) ● ●

A language model is a statistical model that predicts the probability of a word x in a sentence given the sequence of all previous words h (conditional) exponential language model is formulated as:

● ●

Also know as Multinomial Logistic Regression, MaxEnt, SoftMax Can be seen as a generalization of traditional n-grams

x: Label

[Biadsy 2014] Example, given this sample (h, x) <S>

is

212

also

the

area

code codes

h

51 ...



N-gram Features: ● 1-gram: ● 2-gram: <area, code> ● 3-gram: ● … Skip bigram: ● skip=1: ● skip=2: ● skip=3: ● ... ...





0



f

0



0

0

1

0

...

1

...

Why Exponential Language Model ? ●

Elegance: ● Features can be interpreted as constraints: we want a model that generates features with the same relative frequency as the training data ● It has the maximum entropy of all models that satisfy constraints: thus, has the potential to generalize well

● ● ●

Flexibility, can use any feature Features can be correlated Input vocabulary can be different from output vocabulary

Why is it not widely used in practice ? ●

Despite having Maximum Entropy, it can still overfit very easily ● Needs addition of regularization term

● ● ● ● ● ●

Regularization term prefers models with small weights, ● This prevents very rare feature weights to have high enough weight to override more frequent features L2 regularization helps training convergence L1 regularization prefers sparse models, harder to optimize Often used together (L1+L2) Not critical when cross validation with early stopping is used Note that MaxEnt model is not unique:

Why is it not widely used in practice ? ●

Expensive to train (and use): - Training consists of solving convex optimization problem - N-Grams with Kneser-Ney smoothing (KN-NGram) work just as well



Additional features improve prediction ability (reduce perplexity), but may not lead to significant ASR accuracy improvements. (i.e. trigger features: 30% PPL reduction, yet