A majorization-minimization algorithm for ... - Stanford University

Report 0 Downloads 157 Views
A majorization-minimization algorithm for (multiple) hyperparameter learning Chuan-Sheng Foo Chuong B. Do Andrew Y. Ng Stanford University ICML 2009 Montreal, Canada 17th June 2009

Supervised learning • Training set of m IID examples

Labels may be real-valued, discrete, structured

• Probabilistic model • Estimate parameters

Regularization prevents overfitting • Regularized maximum likelihood estimation L2-regularized Logistic Regression

Regularization Hyperparameter

• Also maximum a posteriori (MAP) estimation

Data loglikelihood

Log-prior over model parameters

How to select the hyperparameter(s)? • Grid search + Simple to implement − Scales exponentially with # hyperparameters

• Gradient-based algorithms + Scales well with # hyperparameters − Non-trivial to implement

Can we get the best of both worlds?

Our contribution Striking ease of implementation Simple, closed-form updates for C Leverage existing solvers

Scales well to multiple hyperparameter case Applicable to wide range of models

Outline 1. 2. 3. 4. 5.

Problem definition The “integrate out” strategy The Majorization-Minimization algorithm Experiments Discussion

The “integrate out” strategy • Treat hyperparameter C as a random variable • Analytically integrate out C • Need a convenient prior p(C)

Integrating out a single hyperparameter • For L2 regularization, • A convenient prior:

• The result:

1. C is gone

2. Neither convex nor concave in w

The Majorization-Minimization Algorithm • •

Replace hard problem by series of easier ones EM-like; two steps: 1. Majorization Upper bound the objective function

2. Minimization Minimize the upper bound

MM1: Upper-bounding the new prior • New prior: • Linearize the log: 4 3 2 1

y

0 -1 -2 log(x) expansion at x=1 expansion at x=1.5 expansion at x=2

-3 -4 -5

0

0.5

1

1.5

2

2.5 x

3

3.5

4

4.5

5

MM2: Solving the resultant optimization problem • Resultant linearized prior

Terms independent of w

• Get standard L2-regularization!

Use existing solvers!

Visualization of the upper bound 3

4

log(0.5 x 2 + 1) expansion at x=1 expansion at x=1.5 expansion at x=2

3

2.5 2

2

1

y

y

0

1.5

-1

1

-2 log(x) expansion at x=1 expansion at x=1.5 expansion at x=2

-3 -4 -5

0

0.5

1

1.5

2

2.5 x

3

3.5

4

4.5

0.5

5

0 -5

-4

-3

-2

-1

0 x

1

2

3

4

5

Overall algorithm 1. Closed form updates for C

2. Leverage existing solvers Converges to a local minimum

What about multiple hyperparameters? • Regularization groups w = ( w1, w2, w3, w4, w5 ) “To C or not to C. That is the question…”

NLP

Unigram feature weights

RNA Secondary Hairpin Structure loops Prediction

Bigram feature weights

Bulge loops

C = ( C 1 , C2 )

Mapping from weights to groups

What about multiple hyperparameters?

Separately update each regularization group Sum weights in each group

Weighted L2-regularization

Experiments •

4 probabilistic models – – – –



Linear regression (too easy, not shown) Binary logistic regression Multinomial logistic regression Conditional log-linear model

3 competing algorithms – Grid search – Gradient-based algorithm (Do et al., 2007) – Direct optimization of new objective



Algorithm run with α = 0, β = 1

Grid Grad Direct MM

w1a

splice

sonar

mushrooms

liverdisorders

ionosphere

heart

germannumer

diabetes

breastcancer

australian

Accuracy

Results: Binary Logistic Regression 100 90 80 70 60 50

Grid Grad Direct MM

wine

vowel

vehicle

usps

segment svmguide2

saƟmage

mnist1

leƩer

iris

glass

dna

connect-4

Accuracy

Results: Multinomial Logistic Regression 100 90 80 70 60 50 40 30

Results: Conditional Log-Linear Models • RNA secondary structure prediction • Multiple hyperparameters AGCAGAGUGGCGCA GUGGAAGCGUGCUG GUCCCAUAACCCAGA GGUCCGAGGAUCGA AACCUUGCUCUGCUA

ROC Area 0.65 0.64 0.63 0.62 0.61 0.6 0.59 0.58 Single

(((((((((((((.......))))..((((((.... (((....)))....))))))......))))))))).

Gradient

Grouped Direct

MM

Discussion • How to choose α, β in Gamma prior? – Sensitivity experiments – Simple choice reasonable – Further investigation required

• Simple assumptions sometimes wrong • But competitive performance with Grid, Grad • Suited for ‘Quick-and-dirty’ implementations

Thank you!