Parameterization and Bayesian Modeling - Semantic Scholar

Report 2 Downloads 182 Views
Parameterization and Bayesian Modeling Andrew Gelman Dept of Statistics and Dept of Political Science, Columbia University, New York (Visiting Sciences Po, Paris, for 2009–2010)

30 nov 2009

1/12

Andrew Gelman

Parameterization and Bayesian Modeling

Statistical computation and statistical modeling

2/12

Andrew Gelman

Parameterization and Bayesian Modeling

Statistical computation and statistical modeling

I

The folk theorem

2/12

Andrew Gelman

Parameterization and Bayesian Modeling

Statistical computation and statistical modeling

I

The folk theorem

I

The Pinocchio principle

2/12

Andrew Gelman

Parameterization and Bayesian Modeling

Statistical computation and statistical modeling

I

The folk theorem

I

The Pinocchio principle

I

Latent variables, transformations, and Bayes

2/12

Andrew Gelman

Parameterization and Bayesian Modeling

Statistical computation and statistical modeling

I

The folk theorem

I

The Pinocchio principle

I

Latent variables, transformations, and Bayes Examples

I

2/12

Andrew Gelman

Parameterization and Bayesian Modeling

Statistical computation and statistical modeling

I

The folk theorem

I

The Pinocchio principle

I

Latent variables, transformations, and Bayes Examples

I

I

Truncated and censored data

2/12

Andrew Gelman

Parameterization and Bayesian Modeling

Statistical computation and statistical modeling

I

The folk theorem

I

The Pinocchio principle

I

Latent variables, transformations, and Bayes Examples

I

I I

Truncated and censored data Modeling continuous data using an underlying discrete distirbution

2/12

Andrew Gelman

Parameterization and Bayesian Modeling

Statistical computation and statistical modeling

I

The folk theorem

I

The Pinocchio principle

I

Latent variables, transformations, and Bayes Examples

I

I I

I

Truncated and censored data Modeling continuous data using an underlying discrete distirbution Modeling discrete data using an underlying continuous distirbution

2/12

Andrew Gelman

Parameterization and Bayesian Modeling

Statistical computation and statistical modeling

I

The folk theorem

I

The Pinocchio principle

I

Latent variables, transformations, and Bayes Examples

I

I I

I

I

Truncated and censored data Modeling continuous data using an underlying discrete distirbution Modeling discrete data using an underlying continuous distirbution Parameter expansion for hierarchical models

2/12

Andrew Gelman

Parameterization and Bayesian Modeling

Statistical computation and statistical modeling

I

The folk theorem

I

The Pinocchio principle

I

Latent variables, transformations, and Bayes Examples

I

I I

I

I I

Truncated and censored data Modeling continuous data using an underlying discrete distirbution Modeling discrete data using an underlying continuous distirbution Parameter expansion for hierarchical models Iterative algorithms and time processes

2/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 1: Truncated and censored data

3/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 1: Truncated and censored data I

Sample of size N from a distribution f (y |θ). We only observe the measurments that are less than 200. Out of these N, we observe n = 91 cases where yi < 200.

3/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 1: Truncated and censored data I

Sample of size N from a distribution f (y |θ). We only observe the measurments that are less than 200. Out of these N, we observe n = 91 cases where yi < 200.

I

Goal is inference about θ

3/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 1: Truncated and censored data I

Sample of size N from a distribution f (y |θ). We only observe the measurments that are less than 200. Out of these N, we observe n = 91 cases where yi < 200.

I

Goal is inference about θ

I

Scenario 1: N is unknown. Use the truncated-data likelihood: −91

p(θ|y ) ∝ p(θ) [1 − F (y |θ)]

91 Y

f (yi |θ)

i=1

3/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 1: Truncated and censored data I

Sample of size N from a distribution f (y |θ). We only observe the measurments that are less than 200. Out of these N, we observe n = 91 cases where yi < 200.

I

Goal is inference about θ

I

Scenario 1: N is unknown. Use the truncated-data likelihood: −91

p(θ|y ) ∝ p(θ) [1 − F (y |θ)]

91 Y

f (yi |θ)

i=1 I

Scenario 2: N is known. Use the censored-data likelihood: p(θ|y , N) ∝ p(θ)F (y |θ)N−91

91 Y

f (yi |θ)

i=1

3/12

Andrew Gelman

Parameterization and Bayesian Modeling

Truncated and censored data: things get weird

4/12

Andrew Gelman

Parameterization and Bayesian Modeling

Truncated and censored data: things get weird I

Now suppose N is unknown. A Bayesian has two options:

4/12

Andrew Gelman

Parameterization and Bayesian Modeling

Truncated and censored data: things get weird I

Now suppose N is unknown. A Bayesian has two options:

I

Use the truncated-data model, or

4/12

Andrew Gelman

Parameterization and Bayesian Modeling

Truncated and censored data: things get weird I

Now suppose N is unknown. A Bayesian has two options:

I

Use the truncated-data model, or

I

Use the censored-data model, averaging over the unknown N: p(θ|y ) ∝

∞ X

 p(N)p(θ)

N=91

N 91

 F (y |θ)

N−91

91 Y

f (yi |θ)

i=1

4/12

Andrew Gelman

Parameterization and Bayesian Modeling

Truncated and censored data: things get weird I

Now suppose N is unknown. A Bayesian has two options:

I

Use the truncated-data model, or

I

Use the censored-data model, averaging over the unknown N: p(θ|y ) ∝

∞ X

 p(N)p(θ)

N=91 I

N 91

 F (y |θ)

N−91

91 Y

f (yi |θ)

i=1

If p(N) ∝ 1/N—but only with this prior—the above summation reduces to the truncated-data model: −91

p(θ|y ) ∝ p(θ) [1 − F (y |θ)]

91 Y

f (yi |θ)

i=1

4/12

Andrew Gelman

Parameterization and Bayesian Modeling

Truncated and censored data: things get weird I

Now suppose N is unknown. A Bayesian has two options:

I

Use the truncated-data model, or

I

Use the censored-data model, averaging over the unknown N: p(θ|y ) ∝

∞ X

 p(N)p(θ)

N=91 I

N 91

 F (y |θ)

N−91

91 Y

f (yi |θ)

i=1

If p(N) ∝ 1/N—but only with this prior—the above summation reduces to the truncated-data model: −91

p(θ|y ) ∝ p(θ) [1 − F (y |θ)]

91 Y

f (yi |θ)

i=1 I

?! 4/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 2: Modeling continuous data using an underlying discrete distribution A bimodal distribution modeled using a mixture:

5/12

Andrew Gelman

Parameterization and Bayesian Modeling

Separating into Republicans, Democrats, and open seats

6/12

Andrew Gelman

Parameterization and Bayesian Modeling

Separating into Republicans, Democrats, and open seats

I

We took the mixture components seriously . . . and then they came to life!

6/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 3: Modeling discrete data using an underlying continuous distribution

7/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 3: Modeling discrete data using an underlying continuous distribution

I

Logistic regression model for voting: Pr(vote Republican) = logit−1 (X β)

7/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 3: Modeling discrete data using an underlying continuous distribution

I

Logistic regression model for voting: Pr(vote Republican) = logit−1 (X β)

I

Latent variable interpretation: z = X β + ; vote Republican if z > 0

7/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 3: Modeling discrete data using an underlying continuous distribution

I

Logistic regression model for voting: Pr(vote Republican) = logit−1 (X β)

I

Latent variable interpretation: z = X β + ; vote Republican if z > 0

I

Take z seriously as a continuous measure of Republican-ness

7/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 3: Modeling discrete data using an underlying continuous distribution

I

Logistic regression model for voting: Pr(vote Republican) = logit−1 (X β)

I

Latent variable interpretation: z = X β + ; vote Republican if z > 0

I

Take z seriously as a continuous measure of Republican-ness

I

Can look at correlations with other z’s, changes over time, . . .

7/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 4: Parameter expansion for hierarchical models

8/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 4: Parameter expansion for hierarchical models I

Hierarchical regression with M batches of coefficients: y=

M X

X (m) β (m) + error

m=1

8/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 4: Parameter expansion for hierarchical models I

Hierarchical regression with M batches of coefficients: y=

M X

X (m) β (m) + error

m=1 I

Exchangeable prior distributions: within each batch m, (m)

βj

2 ∼ N(0, σm ), for j = 1, . . . , Jm

8/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 4: Parameter expansion for hierarchical models I

Hierarchical regression with M batches of coefficients: y=

M X

X (m) β (m) + error

m=1 I

Exchangeable prior distributions: within each batch m, (m)

βj I

2 ∼ N(0, σm ), for j = 1, . . . , Jm

Gibbs sampler for coefficients β and hyperparameters σ gets stuck near σ ≈ 0

8/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 4: Parameter expansion for hierarchical models I

Hierarchical regression with M batches of coefficients: y=

M X

X (m) β (m) + error

m=1 I

Exchangeable prior distributions: within each batch m, (m)

βj I

I

2 ∼ N(0, σm ), for j = 1, . . . , Jm

Gibbs sampler for coefficients β and hyperparameters σ gets stuck near σ ≈ 0 Parameter expansion: y=

M X

αm X (m) β (m) + error

m=1

8/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 4: Parameter expansion for hierarchical models I

Hierarchical regression with M batches of coefficients: y=

M X

X (m) β (m) + error

m=1 I

Exchangeable prior distributions: within each batch m, (m)

βj I

I

2 ∼ N(0, σm ), for j = 1, . . . , Jm

Gibbs sampler for coefficients β and hyperparameters σ gets stuck near σ ≈ 0 Parameter expansion: y=

M X

αm X (m) β (m) + error

m=1 I

Run Gibbs on α, β, σ 8/12

Andrew Gelman

Parameterization and Bayesian Modeling

New models inspired by parameter expansion

9/12

Andrew Gelman

Parameterization and Bayesian Modeling

New models inspired by parameter expansion

I

y=

PM

m=1 αm X

(m) β (m)

+ error

9/12

Andrew Gelman

Parameterization and Bayesian Modeling

New models inspired by parameter expansion

PM

m=1 αm X

(m) β (m)

I

y=

+ error

I

Factor-analysis model: αm is the importance of factor m, and (m) the βj ’s are the factor loadings

9/12

Andrew Gelman

Parameterization and Bayesian Modeling

New models inspired by parameter expansion

PM

m=1 αm X

(m) β (m)

I

y=

+ error

I

Factor-analysis model: αm is the importance of factor m, and (m) the βj ’s are the factor loadings

I

Need informative prior distributions, for example (m) βj ∼ N(1, 0.22 )

9/12

Andrew Gelman

Parameterization and Bayesian Modeling

New models inspired by parameter expansion

PM

m=1 αm X

(m) β (m)

I

y=

I

Factor-analysis model: αm is the importance of factor m, and (m) the βj ’s are the factor loadings

I

Need informative prior distributions, for example (m) βj ∼ N(1, 0.22 )

I

The coefficients βj

(m)

+ error

get a new life as latent factor loadings

9/12

Andrew Gelman

Parameterization and Bayesian Modeling

New prior distributions inspired by parameter expansion

10/12

Andrew Gelman

Parameterization and Bayesian Modeling

New prior distributions inspired by parameter expansion

I

Simple hierarchical model: yij ∼ N(θj , σy2 ), θj ∼ N(µ, σθ2 )

10/12

Andrew Gelman

Parameterization and Bayesian Modeling

New prior distributions inspired by parameter expansion

I

Simple hierarchical model: yij ∼ N(θj , σy2 ), θj ∼ N(µ, σθ2 )

I

What’s a good prior distribution for σθ ?

10/12

Andrew Gelman

Parameterization and Bayesian Modeling

New prior distributions inspired by parameter expansion

I

Simple hierarchical model: yij ∼ N(θj , σy2 ), θj ∼ N(µ, σθ2 )

I

What’s a good prior distribution for σθ ? Redundant parameterization: yij ∼ N(µ + αηj , σy2 ), ηj ∼ N(0, ση2 )

I

10/12

Andrew Gelman

Parameterization and Bayesian Modeling

New prior distributions inspired by parameter expansion

I

Simple hierarchical model: yij ∼ N(θj , σy2 ), θj ∼ N(µ, σθ2 )

I

What’s a good prior distribution for σθ ? Redundant parameterization: yij ∼ N(µ + αηj , σy2 ), ηj ∼ N(0, ση2 )

I

I

Same model: θj = µ + αηj for each j, and σθ = |α|ση

10/12

Andrew Gelman

Parameterization and Bayesian Modeling

New prior distributions inspired by parameter expansion

I

Simple hierarchical model: yij ∼ N(θj , σy2 ), θj ∼ N(µ, σθ2 )

I

What’s a good prior distribution for σθ ? Redundant parameterization: yij ∼ N(µ + αηj , σy2 ), ηj ∼ N(0, ση2 )

I

I I

Same model: θj = µ + αηj for each j, and σθ = |α|ση Conditionally conjugate prior: normal distribution for α and inv-χ2 for ση2

10/12

Andrew Gelman

Parameterization and Bayesian Modeling

New prior distributions inspired by parameter expansion

I

Simple hierarchical model: yij ∼ N(θj , σy2 ), θj ∼ N(µ, σθ2 )

I

What’s a good prior distribution for σθ ? Redundant parameterization: yij ∼ N(µ + αηj , σy2 ), ηj ∼ N(0, ση2 )

I

I I

I

Same model: θj = µ + αηj for each j, and σθ = |α|ση Conditionally conjugate prior: normal distribution for α and inv-χ2 for ση2 Put these together to get a half-t prior for σα

10/12

Andrew Gelman

Parameterization and Bayesian Modeling

New prior distributions inspired by parameter expansion

I

Simple hierarchical model: yij ∼ N(θj , σy2 ), θj ∼ N(µ, σθ2 )

I

What’s a good prior distribution for σθ ? Redundant parameterization: yij ∼ N(µ + αηj , σy2 ), ηj ∼ N(0, ση2 )

I

I I

I I

Same model: θj = µ + αηj for each j, and σθ = |α|ση Conditionally conjugate prior: normal distribution for α and inv-χ2 for ση2 Put these together to get a half-t prior for σα It really works!

10/12

Andrew Gelman

Parameterization and Bayesian Modeling

New prior distributions inspired by parameter expansion

I

Simple hierarchical model: yij ∼ N(θj , σy2 ), θj ∼ N(µ, σθ2 )

I

What’s a good prior distribution for σθ ? Redundant parameterization: yij ∼ N(µ + αηj , σy2 ), ηj ∼ N(0, ση2 )

I

I I

I I

I

Same model: θj = µ + αηj for each j, and σθ = |α|ση Conditionally conjugate prior: normal distribution for α and inv-χ2 for ση2 Put these together to get a half-t prior for σα It really works!

Similar ideas for covariance matrices

10/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 5: Iterative algorithms and time processes

11/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 5: Iterative algorithms and time processes

I

Iterative simulation turns an intractable spatial process into a tractable space-time process

11/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 5: Iterative algorithms and time processes

I

Iterative simulation turns an intractable spatial process into a tractable space-time process

I

MCMC algorithms such as hybrid sampling (in which “position” and “momentum” variables evlove over time)

11/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 5: Iterative algorithms and time processes

I

Iterative simulation turns an intractable spatial process into a tractable space-time process

I

MCMC algorithms such as hybrid sampling (in which “position” and “momentum” variables evlove over time)

I

Simulation of “agent models” to determine equlibrium states in economics

11/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 5: Iterative algorithms and time processes

I

Iterative simulation turns an intractable spatial process into a tractable space-time process

I

MCMC algorithms such as hybrid sampling (in which “position” and “momentum” variables evlove over time)

I

Simulation of “agent models” to determine equlibrium states in economics

I

Connection to the Folk Theorem

11/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 5: Iterative algorithms and time processes

I

Iterative simulation turns an intractable spatial process into a tractable space-time process

I

MCMC algorithms such as hybrid sampling (in which “position” and “momentum” variables evlove over time)

I

Simulation of “agent models” to determine equlibrium states in economics

I

Connection to the Folk Theorem

I

The Traveling Salesman Problem

11/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 5: Iterative algorithms and time processes

I

Iterative simulation turns an intractable spatial process into a tractable space-time process

I

MCMC algorithms such as hybrid sampling (in which “position” and “momentum” variables evlove over time)

I

Simulation of “agent models” to determine equlibrium states in economics

I

Connection to the Folk Theorem

I

The Traveling Salesman Problem Mapping the Gibbs sampler or Metropolis algorithm to a real-time learning process

I

11/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 5: Iterative algorithms and time processes

I

Iterative simulation turns an intractable spatial process into a tractable space-time process

I

MCMC algorithms such as hybrid sampling (in which “position” and “momentum” variables evlove over time)

I

Simulation of “agent models” to determine equlibrium states in economics

I

Connection to the Folk Theorem

I

The Traveling Salesman Problem Mapping the Gibbs sampler or Metropolis algorithm to a real-time learning process

I

I

Burn-in = “learning”

11/12

Andrew Gelman

Parameterization and Bayesian Modeling

Example 5: Iterative algorithms and time processes

I

Iterative simulation turns an intractable spatial process into a tractable space-time process

I

MCMC algorithms such as hybrid sampling (in which “position” and “momentum” variables evlove over time)

I

Simulation of “agent models” to determine equlibrium states in economics

I

Connection to the Folk Theorem

I

The Traveling Salesman Problem Mapping the Gibbs sampler or Metropolis algorithm to a real-time learning process

I

I I

Burn-in = “learning” Moving through the distribution = tracing of variation

11/12

Andrew Gelman

Parameterization and Bayesian Modeling

Summary

12/12

Andrew Gelman

Parameterization and Bayesian Modeling

Summary

I

Often we add latent variables that improve computation but do not change the likelihood function

12/12

Andrew Gelman

Parameterization and Bayesian Modeling

Summary

I

Often we add latent variables that improve computation but do not change the likelihood function I

New parameters are sometimes partially identified

12/12

Andrew Gelman

Parameterization and Bayesian Modeling

Summary

I

Often we add latent variables that improve computation but do not change the likelihood function I I

New parameters are sometimes partially identified Sometimes completely nonidentified

12/12

Andrew Gelman

Parameterization and Bayesian Modeling

Summary

I

Often we add latent variables that improve computation but do not change the likelihood function I I

I

New parameters are sometimes partially identified Sometimes completely nonidentified

The Pinocchio principle

12/12

Andrew Gelman

Parameterization and Bayesian Modeling

Summary

I

Often we add latent variables that improve computation but do not change the likelihood function I I

New parameters are sometimes partially identified Sometimes completely nonidentified

I

The Pinocchio principle

I

New classes of prior distributions

12/12

Andrew Gelman

Parameterization and Bayesian Modeling