Home
Add Document
Sign In
Create An Account
A majorization-minimization algorithm for ... - Stanford University
Download PDF
Comment
Report
0 Downloads
157 Views
A majorization-minimization algorithm for (multiple) hyperparameter learning Chuan-Sheng Foo Chuong B. Do Andrew Y. Ng Stanford University ICML 2009 Montreal, Canada 17th June 2009
Supervised learning • Training set of m IID examples
Labels may be real-valued, discrete, structured
• Probabilistic model • Estimate parameters
Regularization prevents overfitting • Regularized maximum likelihood estimation L2-regularized Logistic Regression
Regularization Hyperparameter
• Also maximum a posteriori (MAP) estimation
Data loglikelihood
Log-prior over model parameters
How to select the hyperparameter(s)? • Grid search + Simple to implement − Scales exponentially with # hyperparameters
• Gradient-based algorithms + Scales well with # hyperparameters − Non-trivial to implement
Can we get the best of both worlds?
Our contribution Striking ease of implementation Simple, closed-form updates for C Leverage existing solvers
Scales well to multiple hyperparameter case Applicable to wide range of models
Outline 1. 2. 3. 4. 5.
Problem definition The “integrate out” strategy The Majorization-Minimization algorithm Experiments Discussion
The “integrate out” strategy • Treat hyperparameter C as a random variable • Analytically integrate out C • Need a convenient prior p(C)
Integrating out a single hyperparameter • For L2 regularization, • A convenient prior:
• The result:
1. C is gone
2. Neither convex nor concave in w
The Majorization-Minimization Algorithm • •
Replace hard problem by series of easier ones EM-like; two steps: 1. Majorization Upper bound the objective function
2. Minimization Minimize the upper bound
MM1: Upper-bounding the new prior • New prior: • Linearize the log: 4 3 2 1
y
0 -1 -2 log(x) expansion at x=1 expansion at x=1.5 expansion at x=2
-3 -4 -5
0
0.5
1
1.5
2
2.5 x
3
3.5
4
4.5
5
MM2: Solving the resultant optimization problem • Resultant linearized prior
Terms independent of w
• Get standard L2-regularization!
Use existing solvers!
Visualization of the upper bound 3
4
log(0.5 x 2 + 1) expansion at x=1 expansion at x=1.5 expansion at x=2
3
2.5 2
2
1
y
y
0
1.5
-1
1
-2 log(x) expansion at x=1 expansion at x=1.5 expansion at x=2
-3 -4 -5
0
0.5
1
1.5
2
2.5 x
3
3.5
4
4.5
0.5
5
0 -5
-4
-3
-2
-1
0 x
1
2
3
4
5
Overall algorithm 1. Closed form updates for C
2. Leverage existing solvers Converges to a local minimum
What about multiple hyperparameters? • Regularization groups w = ( w1, w2, w3, w4, w5 ) “To C or not to C. That is the question…”
NLP
Unigram feature weights
RNA Secondary Hairpin Structure loops Prediction
Bigram feature weights
Bulge loops
C = ( C 1 , C2 )
Mapping from weights to groups
What about multiple hyperparameters?
Separately update each regularization group Sum weights in each group
Weighted L2-regularization
Experiments •
4 probabilistic models – – – –
•
Linear regression (too easy, not shown) Binary logistic regression Multinomial logistic regression Conditional log-linear model
3 competing algorithms – Grid search – Gradient-based algorithm (Do et al., 2007) – Direct optimization of new objective
•
Algorithm run with α = 0, β = 1
Grid Grad Direct MM
w1a
splice
sonar
mushrooms
liverdisorders
ionosphere
heart
germannumer
diabetes
breastcancer
australian
Accuracy
Results: Binary Logistic Regression 100 90 80 70 60 50
Grid Grad Direct MM
wine
vowel
vehicle
usps
segment svmguide2
saƟmage
mnist1
leƩer
iris
glass
dna
connect-4
Accuracy
Results: Multinomial Logistic Regression 100 90 80 70 60 50 40 30
Results: Conditional Log-Linear Models • RNA secondary structure prediction • Multiple hyperparameters AGCAGAGUGGCGCA GUGGAAGCGUGCUG GUCCCAUAACCCAGA GGUCCGAGGAUCGA AACCUUGCUCUGCUA
ROC Area 0.65 0.64 0.63 0.62 0.61 0.6 0.59 0.58 Single
(((((((((((((.......))))..((((((.... (((....)))....))))))......))))))))).
Gradient
Grouped Direct
MM
Discussion • How to choose α, β in Gamma prior? – Sensitivity experiments – Simple choice reasonable – Further investigation required
• Simple assumptions sometimes wrong • But competitive performance with Grid, Grad • Suited for ‘Quick-and-dirty’ implementations
Thank you!
Recommend Documents
A Blockwise Descent Algorithm for Group ... - Stanford University
A fast algorithm for the energy space boson ... - Stanford University
A Randomized Art-Gallery Algorithm for Sensor ... - Stanford University
A Uniform-grid Discretization Algorithm for ... - Stanford University
A Sketch-based Sampling Algorithm on Sparse ... - Stanford University
×
Report A majorization-minimization algorithm for ... - Stanford University
Your name
Email
Reason
-Select Reason-
Pornographic
Defamatory
Illegal/Unlawful
Spam
Other Terms Of Service Violation
File a copyright complaint
Description
×
Sign In
Email
Password
Remember me
Forgot password?
Sign In
Login with Facebook
Our partners will collect data and use cookies for ad personalization and measurement.
Learn how we and our ad partner Google, collect and use data
.
Agree & Close