Parameterization and Bayesian Modeling Andrew Gelman Dept of Statistics and Dept of Political Science, Columbia University, New York (Visiting Sciences Po, Paris, for 2009–2010)
30 nov 2009
1/12
Andrew Gelman
Parameterization and Bayesian Modeling
Statistical computation and statistical modeling
2/12
Andrew Gelman
Parameterization and Bayesian Modeling
Statistical computation and statistical modeling
I
The folk theorem
2/12
Andrew Gelman
Parameterization and Bayesian Modeling
Statistical computation and statistical modeling
I
The folk theorem
I
The Pinocchio principle
2/12
Andrew Gelman
Parameterization and Bayesian Modeling
Statistical computation and statistical modeling
I
The folk theorem
I
The Pinocchio principle
I
Latent variables, transformations, and Bayes
2/12
Andrew Gelman
Parameterization and Bayesian Modeling
Statistical computation and statistical modeling
I
The folk theorem
I
The Pinocchio principle
I
Latent variables, transformations, and Bayes Examples
I
2/12
Andrew Gelman
Parameterization and Bayesian Modeling
Statistical computation and statistical modeling
I
The folk theorem
I
The Pinocchio principle
I
Latent variables, transformations, and Bayes Examples
I
I
Truncated and censored data
2/12
Andrew Gelman
Parameterization and Bayesian Modeling
Statistical computation and statistical modeling
I
The folk theorem
I
The Pinocchio principle
I
Latent variables, transformations, and Bayes Examples
I
I I
Truncated and censored data Modeling continuous data using an underlying discrete distirbution
2/12
Andrew Gelman
Parameterization and Bayesian Modeling
Statistical computation and statistical modeling
I
The folk theorem
I
The Pinocchio principle
I
Latent variables, transformations, and Bayes Examples
I
I I
I
Truncated and censored data Modeling continuous data using an underlying discrete distirbution Modeling discrete data using an underlying continuous distirbution
2/12
Andrew Gelman
Parameterization and Bayesian Modeling
Statistical computation and statistical modeling
I
The folk theorem
I
The Pinocchio principle
I
Latent variables, transformations, and Bayes Examples
I
I I
I
I
Truncated and censored data Modeling continuous data using an underlying discrete distirbution Modeling discrete data using an underlying continuous distirbution Parameter expansion for hierarchical models
2/12
Andrew Gelman
Parameterization and Bayesian Modeling
Statistical computation and statistical modeling
I
The folk theorem
I
The Pinocchio principle
I
Latent variables, transformations, and Bayes Examples
I
I I
I
I I
Truncated and censored data Modeling continuous data using an underlying discrete distirbution Modeling discrete data using an underlying continuous distirbution Parameter expansion for hierarchical models Iterative algorithms and time processes
2/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 1: Truncated and censored data
3/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 1: Truncated and censored data I
Sample of size N from a distribution f (y |θ). We only observe the measurments that are less than 200. Out of these N, we observe n = 91 cases where yi < 200.
3/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 1: Truncated and censored data I
Sample of size N from a distribution f (y |θ). We only observe the measurments that are less than 200. Out of these N, we observe n = 91 cases where yi < 200.
I
Goal is inference about θ
3/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 1: Truncated and censored data I
Sample of size N from a distribution f (y |θ). We only observe the measurments that are less than 200. Out of these N, we observe n = 91 cases where yi < 200.
I
Goal is inference about θ
I
Scenario 1: N is unknown. Use the truncated-data likelihood: −91
p(θ|y ) ∝ p(θ) [1 − F (y |θ)]
91 Y
f (yi |θ)
i=1
3/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 1: Truncated and censored data I
Sample of size N from a distribution f (y |θ). We only observe the measurments that are less than 200. Out of these N, we observe n = 91 cases where yi < 200.
I
Goal is inference about θ
I
Scenario 1: N is unknown. Use the truncated-data likelihood: −91
p(θ|y ) ∝ p(θ) [1 − F (y |θ)]
91 Y
f (yi |θ)
i=1 I
Scenario 2: N is known. Use the censored-data likelihood: p(θ|y , N) ∝ p(θ)F (y |θ)N−91
91 Y
f (yi |θ)
i=1
3/12
Andrew Gelman
Parameterization and Bayesian Modeling
Truncated and censored data: things get weird
4/12
Andrew Gelman
Parameterization and Bayesian Modeling
Truncated and censored data: things get weird I
Now suppose N is unknown. A Bayesian has two options:
4/12
Andrew Gelman
Parameterization and Bayesian Modeling
Truncated and censored data: things get weird I
Now suppose N is unknown. A Bayesian has two options:
I
Use the truncated-data model, or
4/12
Andrew Gelman
Parameterization and Bayesian Modeling
Truncated and censored data: things get weird I
Now suppose N is unknown. A Bayesian has two options:
I
Use the truncated-data model, or
I
Use the censored-data model, averaging over the unknown N: p(θ|y ) ∝
∞ X
p(N)p(θ)
N=91
N 91
F (y |θ)
N−91
91 Y
f (yi |θ)
i=1
4/12
Andrew Gelman
Parameterization and Bayesian Modeling
Truncated and censored data: things get weird I
Now suppose N is unknown. A Bayesian has two options:
I
Use the truncated-data model, or
I
Use the censored-data model, averaging over the unknown N: p(θ|y ) ∝
∞ X
p(N)p(θ)
N=91 I
N 91
F (y |θ)
N−91
91 Y
f (yi |θ)
i=1
If p(N) ∝ 1/N—but only with this prior—the above summation reduces to the truncated-data model: −91
p(θ|y ) ∝ p(θ) [1 − F (y |θ)]
91 Y
f (yi |θ)
i=1
4/12
Andrew Gelman
Parameterization and Bayesian Modeling
Truncated and censored data: things get weird I
Now suppose N is unknown. A Bayesian has two options:
I
Use the truncated-data model, or
I
Use the censored-data model, averaging over the unknown N: p(θ|y ) ∝
∞ X
p(N)p(θ)
N=91 I
N 91
F (y |θ)
N−91
91 Y
f (yi |θ)
i=1
If p(N) ∝ 1/N—but only with this prior—the above summation reduces to the truncated-data model: −91
p(θ|y ) ∝ p(θ) [1 − F (y |θ)]
91 Y
f (yi |θ)
i=1 I
?! 4/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 2: Modeling continuous data using an underlying discrete distribution A bimodal distribution modeled using a mixture:
5/12
Andrew Gelman
Parameterization and Bayesian Modeling
Separating into Republicans, Democrats, and open seats
6/12
Andrew Gelman
Parameterization and Bayesian Modeling
Separating into Republicans, Democrats, and open seats
I
We took the mixture components seriously . . . and then they came to life!
6/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 3: Modeling discrete data using an underlying continuous distribution
7/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 3: Modeling discrete data using an underlying continuous distribution
I
Logistic regression model for voting: Pr(vote Republican) = logit−1 (X β)
7/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 3: Modeling discrete data using an underlying continuous distribution
I
Logistic regression model for voting: Pr(vote Republican) = logit−1 (X β)
I
Latent variable interpretation: z = X β + ; vote Republican if z > 0
7/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 3: Modeling discrete data using an underlying continuous distribution
I
Logistic regression model for voting: Pr(vote Republican) = logit−1 (X β)
I
Latent variable interpretation: z = X β + ; vote Republican if z > 0
I
Take z seriously as a continuous measure of Republican-ness
7/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 3: Modeling discrete data using an underlying continuous distribution
I
Logistic regression model for voting: Pr(vote Republican) = logit−1 (X β)
I
Latent variable interpretation: z = X β + ; vote Republican if z > 0
I
Take z seriously as a continuous measure of Republican-ness
I
Can look at correlations with other z’s, changes over time, . . .
7/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 4: Parameter expansion for hierarchical models
8/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 4: Parameter expansion for hierarchical models I
Hierarchical regression with M batches of coefficients: y=
M X
X (m) β (m) + error
m=1
8/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 4: Parameter expansion for hierarchical models I
Hierarchical regression with M batches of coefficients: y=
M X
X (m) β (m) + error
m=1 I
Exchangeable prior distributions: within each batch m, (m)
βj
2 ∼ N(0, σm ), for j = 1, . . . , Jm
8/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 4: Parameter expansion for hierarchical models I
Hierarchical regression with M batches of coefficients: y=
M X
X (m) β (m) + error
m=1 I
Exchangeable prior distributions: within each batch m, (m)
βj I
2 ∼ N(0, σm ), for j = 1, . . . , Jm
Gibbs sampler for coefficients β and hyperparameters σ gets stuck near σ ≈ 0
8/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 4: Parameter expansion for hierarchical models I
Hierarchical regression with M batches of coefficients: y=
M X
X (m) β (m) + error
m=1 I
Exchangeable prior distributions: within each batch m, (m)
βj I
I
2 ∼ N(0, σm ), for j = 1, . . . , Jm
Gibbs sampler for coefficients β and hyperparameters σ gets stuck near σ ≈ 0 Parameter expansion: y=
M X
αm X (m) β (m) + error
m=1
8/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 4: Parameter expansion for hierarchical models I
Hierarchical regression with M batches of coefficients: y=
M X
X (m) β (m) + error
m=1 I
Exchangeable prior distributions: within each batch m, (m)
βj I
I
2 ∼ N(0, σm ), for j = 1, . . . , Jm
Gibbs sampler for coefficients β and hyperparameters σ gets stuck near σ ≈ 0 Parameter expansion: y=
M X
αm X (m) β (m) + error
m=1 I
Run Gibbs on α, β, σ 8/12
Andrew Gelman
Parameterization and Bayesian Modeling
New models inspired by parameter expansion
9/12
Andrew Gelman
Parameterization and Bayesian Modeling
New models inspired by parameter expansion
I
y=
PM
m=1 αm X
(m) β (m)
+ error
9/12
Andrew Gelman
Parameterization and Bayesian Modeling
New models inspired by parameter expansion
PM
m=1 αm X
(m) β (m)
I
y=
+ error
I
Factor-analysis model: αm is the importance of factor m, and (m) the βj ’s are the factor loadings
9/12
Andrew Gelman
Parameterization and Bayesian Modeling
New models inspired by parameter expansion
PM
m=1 αm X
(m) β (m)
I
y=
+ error
I
Factor-analysis model: αm is the importance of factor m, and (m) the βj ’s are the factor loadings
I
Need informative prior distributions, for example (m) βj ∼ N(1, 0.22 )
9/12
Andrew Gelman
Parameterization and Bayesian Modeling
New models inspired by parameter expansion
PM
m=1 αm X
(m) β (m)
I
y=
I
Factor-analysis model: αm is the importance of factor m, and (m) the βj ’s are the factor loadings
I
Need informative prior distributions, for example (m) βj ∼ N(1, 0.22 )
I
The coefficients βj
(m)
+ error
get a new life as latent factor loadings
9/12
Andrew Gelman
Parameterization and Bayesian Modeling
New prior distributions inspired by parameter expansion
10/12
Andrew Gelman
Parameterization and Bayesian Modeling
New prior distributions inspired by parameter expansion
I
Simple hierarchical model: yij ∼ N(θj , σy2 ), θj ∼ N(µ, σθ2 )
10/12
Andrew Gelman
Parameterization and Bayesian Modeling
New prior distributions inspired by parameter expansion
I
Simple hierarchical model: yij ∼ N(θj , σy2 ), θj ∼ N(µ, σθ2 )
I
What’s a good prior distribution for σθ ?
10/12
Andrew Gelman
Parameterization and Bayesian Modeling
New prior distributions inspired by parameter expansion
I
Simple hierarchical model: yij ∼ N(θj , σy2 ), θj ∼ N(µ, σθ2 )
I
What’s a good prior distribution for σθ ? Redundant parameterization: yij ∼ N(µ + αηj , σy2 ), ηj ∼ N(0, ση2 )
I
10/12
Andrew Gelman
Parameterization and Bayesian Modeling
New prior distributions inspired by parameter expansion
I
Simple hierarchical model: yij ∼ N(θj , σy2 ), θj ∼ N(µ, σθ2 )
I
What’s a good prior distribution for σθ ? Redundant parameterization: yij ∼ N(µ + αηj , σy2 ), ηj ∼ N(0, ση2 )
I
I
Same model: θj = µ + αηj for each j, and σθ = |α|ση
10/12
Andrew Gelman
Parameterization and Bayesian Modeling
New prior distributions inspired by parameter expansion
I
Simple hierarchical model: yij ∼ N(θj , σy2 ), θj ∼ N(µ, σθ2 )
I
What’s a good prior distribution for σθ ? Redundant parameterization: yij ∼ N(µ + αηj , σy2 ), ηj ∼ N(0, ση2 )
I
I I
Same model: θj = µ + αηj for each j, and σθ = |α|ση Conditionally conjugate prior: normal distribution for α and inv-χ2 for ση2
10/12
Andrew Gelman
Parameterization and Bayesian Modeling
New prior distributions inspired by parameter expansion
I
Simple hierarchical model: yij ∼ N(θj , σy2 ), θj ∼ N(µ, σθ2 )
I
What’s a good prior distribution for σθ ? Redundant parameterization: yij ∼ N(µ + αηj , σy2 ), ηj ∼ N(0, ση2 )
I
I I
I
Same model: θj = µ + αηj for each j, and σθ = |α|ση Conditionally conjugate prior: normal distribution for α and inv-χ2 for ση2 Put these together to get a half-t prior for σα
10/12
Andrew Gelman
Parameterization and Bayesian Modeling
New prior distributions inspired by parameter expansion
I
Simple hierarchical model: yij ∼ N(θj , σy2 ), θj ∼ N(µ, σθ2 )
I
What’s a good prior distribution for σθ ? Redundant parameterization: yij ∼ N(µ + αηj , σy2 ), ηj ∼ N(0, ση2 )
I
I I
I I
Same model: θj = µ + αηj for each j, and σθ = |α|ση Conditionally conjugate prior: normal distribution for α and inv-χ2 for ση2 Put these together to get a half-t prior for σα It really works!
10/12
Andrew Gelman
Parameterization and Bayesian Modeling
New prior distributions inspired by parameter expansion
I
Simple hierarchical model: yij ∼ N(θj , σy2 ), θj ∼ N(µ, σθ2 )
I
What’s a good prior distribution for σθ ? Redundant parameterization: yij ∼ N(µ + αηj , σy2 ), ηj ∼ N(0, ση2 )
I
I I
I I
I
Same model: θj = µ + αηj for each j, and σθ = |α|ση Conditionally conjugate prior: normal distribution for α and inv-χ2 for ση2 Put these together to get a half-t prior for σα It really works!
Similar ideas for covariance matrices
10/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 5: Iterative algorithms and time processes
11/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 5: Iterative algorithms and time processes
I
Iterative simulation turns an intractable spatial process into a tractable space-time process
11/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 5: Iterative algorithms and time processes
I
Iterative simulation turns an intractable spatial process into a tractable space-time process
I
MCMC algorithms such as hybrid sampling (in which “position” and “momentum” variables evlove over time)
11/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 5: Iterative algorithms and time processes
I
Iterative simulation turns an intractable spatial process into a tractable space-time process
I
MCMC algorithms such as hybrid sampling (in which “position” and “momentum” variables evlove over time)
I
Simulation of “agent models” to determine equlibrium states in economics
11/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 5: Iterative algorithms and time processes
I
Iterative simulation turns an intractable spatial process into a tractable space-time process
I
MCMC algorithms such as hybrid sampling (in which “position” and “momentum” variables evlove over time)
I
Simulation of “agent models” to determine equlibrium states in economics
I
Connection to the Folk Theorem
11/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 5: Iterative algorithms and time processes
I
Iterative simulation turns an intractable spatial process into a tractable space-time process
I
MCMC algorithms such as hybrid sampling (in which “position” and “momentum” variables evlove over time)
I
Simulation of “agent models” to determine equlibrium states in economics
I
Connection to the Folk Theorem
I
The Traveling Salesman Problem
11/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 5: Iterative algorithms and time processes
I
Iterative simulation turns an intractable spatial process into a tractable space-time process
I
MCMC algorithms such as hybrid sampling (in which “position” and “momentum” variables evlove over time)
I
Simulation of “agent models” to determine equlibrium states in economics
I
Connection to the Folk Theorem
I
The Traveling Salesman Problem Mapping the Gibbs sampler or Metropolis algorithm to a real-time learning process
I
11/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 5: Iterative algorithms and time processes
I
Iterative simulation turns an intractable spatial process into a tractable space-time process
I
MCMC algorithms such as hybrid sampling (in which “position” and “momentum” variables evlove over time)
I
Simulation of “agent models” to determine equlibrium states in economics
I
Connection to the Folk Theorem
I
The Traveling Salesman Problem Mapping the Gibbs sampler or Metropolis algorithm to a real-time learning process
I
I
Burn-in = “learning”
11/12
Andrew Gelman
Parameterization and Bayesian Modeling
Example 5: Iterative algorithms and time processes
I
Iterative simulation turns an intractable spatial process into a tractable space-time process
I
MCMC algorithms such as hybrid sampling (in which “position” and “momentum” variables evlove over time)
I
Simulation of “agent models” to determine equlibrium states in economics
I
Connection to the Folk Theorem
I
The Traveling Salesman Problem Mapping the Gibbs sampler or Metropolis algorithm to a real-time learning process
I
I I
Burn-in = “learning” Moving through the distribution = tracing of variation
11/12
Andrew Gelman
Parameterization and Bayesian Modeling
Summary
12/12
Andrew Gelman
Parameterization and Bayesian Modeling
Summary
I
Often we add latent variables that improve computation but do not change the likelihood function
12/12
Andrew Gelman
Parameterization and Bayesian Modeling
Summary
I
Often we add latent variables that improve computation but do not change the likelihood function I
New parameters are sometimes partially identified
12/12
Andrew Gelman
Parameterization and Bayesian Modeling
Summary
I
Often we add latent variables that improve computation but do not change the likelihood function I I
New parameters are sometimes partially identified Sometimes completely nonidentified
12/12
Andrew Gelman
Parameterization and Bayesian Modeling
Summary
I
Often we add latent variables that improve computation but do not change the likelihood function I I
I
New parameters are sometimes partially identified Sometimes completely nonidentified
The Pinocchio principle
12/12
Andrew Gelman
Parameterization and Bayesian Modeling
Summary
I
Often we add latent variables that improve computation but do not change the likelihood function I I
New parameters are sometimes partially identified Sometimes completely nonidentified
I
The Pinocchio principle
I
New classes of prior distributions
12/12
Andrew Gelman
Parameterization and Bayesian Modeling