Lecture Slides

Report 5 Downloads 110 Views
Unsupervised Learning

Lecture 6: Hierarchical and Nonlinear Models

Zoubin Ghahramani [email protected] Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College London Autumn 2003

Why we need nonlinearities Linear systems have limited modelling capability.

X1

X2

X3

XT

Y1

Y2

Y3

YT





Consider linear-Gaussian state-space models. Only certain dynamics can be modelled.





Why we need hierarchical models Many generative processes can be naturally described at different levels of detail.

(e.g. objects, illumination, pose)

(e.g. object parts, surfaces)

(e.g. edges)

(retinal image, i.e. pixels)

Biology seems to have developed hierarchical representations.

Why we need distributed representations

S1

S2

S3

ST

Y1

Y2

Y3

YT









Consider a hidden Markov model. To capture N bits of information about the history of the sequence, a HMM requires K = 2N states!

Factorial Hidden Markov Models and Dynamic Bayesian Networks

(1)

(1)

S t-1

(1)

St



S t+1



(2)



(2)

S t-1

(2)

St



S t+1



(3)



(3)

(3)

S t-1

St

S t+1

Yt-1

Yt

Yt+1











At



At+1 Bt

Ct

Bt+1 Ct+1

Dt

At+2 Bt+2

...

Ct+2

Dt+1

Dt+2

These are hidden Markov models with many state variables (i.e. a distributed representation of the state).

Blind Source Separation

Independent Components Analysis XK

X1

Λ Y1

Y2

YD

• P (xk ) is non-Gaussian. • Equivalently P (xk ) is Gaussian, with a nonlinearity g(·):

yd =

K X

Λdk g(xk ) + d

k=1

• For K = D, and observation noise assumed to be zero, inference and learning are easy (standard ICA). Many extensions are possible (e.g. with noise ⇒ IFA).

ICA Nonlinearity Generative model: x = g(w) y = Λx + v where w and v are zero-mean Gaussian noises with covariances I and R respectively. The density of x can be written in terms of g(·), N (0, 1)|g−1(x) px(x) = |g 0(g −1(x))| For example, if px(x) =

1 π cosh(x)

we find that setting:

20 15 10



g(w) = ln tan

π  4

√  1 + erf(w/ 2)

5 0 −5 −10

generates vectors x in which each component is distributed according to 1/(π cosh(x)).

−15 −20 −6

−4

−2

0

2

4

6

So, ICA can be seen either as a linear generative model with non-Gaussian priors for the hidden variables, or as a nonlinear generative model with Gaussian priors for the hidden variables.

Natural Scenes and Sounds 0

Probability

10

10

Response histogram Gaussian density

-2

-4

10 500

0

Filter Response

500

Natural Scenes a.

b.

   !#"%$&(')*,+ +-.*/0/.+1 2345687:9 ;/ =*?@* A

Natural Scenes

    !"#%$ &"' ( ) *'+,"-,./, 0" *,1#( 2$ 0,&"3, 0 -465 7,8, 0":9  07'==?A@/B

Natural Movies

        ! #"$&%'( )'*+ (,- ".0/21(345'& %6 7"8'9 9:; < >=' "8?" %6@  "BA6 (3C"