Neural Networks slides

Report 26 Downloads 190 Views
Artificial Intelligence: Representation and Problem Solving 15-381 January 16, 2007

Neural Networks

Topics

• • • • •

decision boundaries linear discriminants perceptron gradient learning neural networks

Artificial Intelligence: Neural Networks

2

Michael S. Lewicki ! Carnegie Mellon

The Iris dataset with decision tree boundaries

2.5

petal width (cm)

2

1.5

1

0.5

0

1

2

3 4 5 petal length (cm)

Artificial Intelligence: Neural Networks

6

7

Michael S. Lewicki ! Carnegie Mellon

3

The optimal decision boundary for C2 vs C3



optimal decision boundary is determined from the statistical distribution of the classes

• •

optimal only if model is correct!

p(petal length |C3) 0.9 p(petal length 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

assigns precise degree of uncertainty to classification

|C2)

petal width (cm)

2.5 2 1.5 1 0.5 0

Artificial Intelligence: Neural Networks

4

1

2

3 4 5 petal length (cm)

6

7

Michael S. Lewicki ! Carnegie Mellon

Optimal decision boundary

Optimal decision boundary

p(C2 | petal length)

1

p(C3 | petal length)

0.8 p(petal length |C2)

0.6

p(petal length |C3)

0.4 0.2 0

1

2

3

Artificial Intelligence: Neural Networks

4

5

6

7 Michael S. Lewicki ! Carnegie Mellon

5

Can we do better?

• •

p(petal length |C3)

only way is to use more information 0.9 p(petal length 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

DTs use both petal width and petal length

|C2)

petal width (cm)

2.5 2 1.5 1 0.5 0

Artificial Intelligence: Neural Networks

6

1

2

3 4 5 petal length (cm)

6

7

Michael S. Lewicki ! Carnegie Mellon

Arbitrary decision boundaries would be more powerful

petal width (cm)

2.5 2

Decision boundaries could be non-linear

1.5 1 0.5 0

1

2

3 4 5 petal length (cm)

Artificial Intelligence: Neural Networks

6

7

Michael S. Lewicki ! Carnegie Mellon

7

Defining a decision boundary

• •

consider just two classes



2D linear discriminant function:

2.5

want points on one side of line in class 1, otherwise class 2. = mT x + b = m1 x1 + m2 x2 + b ! mi xi + b =

x2

y

2

1.5

i



1 3

This defines a 2D plane which leads to the decision: ! class 1 if y ≥ 0, x∈ class 2 if y < 0.

4

6

7

The decision boundary:

y = mT x + b = 0 Or in terms of scalars:

m1 x1 + m2 x2 ⇒ x2

Artificial Intelligence: Neural Networks

5 x1

8

= −b m1 x1 + b = − m2 Michael S. Lewicki ! Carnegie Mellon

Linear separability



Two classes are linearly separable if they can be separated by a linear combination of attributes

-

1D: threshold

not linearly separable

2D: line 3D: plane M-D: hyperplane

petal width (cm)

2.5 2 1.5 1 0.5

linearly separable

Artificial Intelligence: Neural Networks

0

1

2

3 4 5 petal length (cm)

6

7

Michael S. Lewicki ! Carnegie Mellon

9

Diagraming the classifier as a “neural” network



The feedforward neural network is specified by weights wi and bias b: y

= wT x + b M ! = wi xi + b

“bias”

y

b

“output unit”

w1 w2

wM

“weights”

i=1

“input units”



x1

x2

• •!•

xM

It can written equivalently as y = wT x =

M !

y

wi xi

i=0



w0

where w0 = b is the bias and a “dummy” input x0 that is always 1.

Artificial Intelligence: Neural Networks

x0=1

10

x1

w1 w2

x2

wM • •!•

xM

Michael S. Lewicki ! Carnegie Mellon

Determining, ie learning, the optimal linear discriminant

• • •

First we must define an objective function, ie the goal of learning Simple idea: adjust weights so that output y(xn) matches class cn Objective: minimize sum-squared error over all patterns xn: N 1! T (w xn − cn )2 E= 2 n=1



Note the notation xn defines a pattern vector: xn = {x1 , . . . , xM }n



We can define the desired class as: ! 0 xn ∈ class 1 cn = 1 xn ∈ class 2

Artificial Intelligence: Neural Networks

Michael S. Lewicki ! Carnegie Mellon

11

We’ve seen this before: curve fitting

t = sin(2πx) + noise

t

1

tn

y(xn , w)

0

!1

0

1

xn

x

example from Bishop (2006), Pattern Recognition and Machine Learning Artificial Intelligence: Neural Networks

12

Michael S. Lewicki ! Carnegie Mellon

Neural networks compared to polynomial curve fitting y(x, w) = w0 + w1 x + w2 x2 + · · · + wM xM = N 1! 2 [y(xn , w) − tn ] E(w) = 2 n=1

For the linear network, M=1 and there are multiple input dimensions

1

1

0

0

!1

!1

0

1

1

1

0

0

!1

!1

0

1

M !

wj xj

j=0

0

1

0

1

example from Bishop (2006), Pattern Recognition and Machine Learning Artificial Intelligence: Neural Networks

Michael S. Lewicki ! Carnegie Mellon

13

General form of a linear network



A linear neural network is simply a linear transformation of the input. yj =

M !

y W

wi,j xi

i=0



x

Or, in matrix-vector form: y = Wx



y1

Multiple outputs corresponds to multivariate regression

• •!•

yi

• •!•

x1

• •!•

xi

“outputs” “weights”

wij x0=1

yK

• •!•

xM

“inputs”

“bias”

Artificial Intelligence: Neural Networks

14

Michael S. Lewicki ! Carnegie Mellon

Training the network: Optimization by gradient descent



We can adjust the weights incrementally to minimize the objective function.

• • •

This is called gradient descent Or gradient ascent if we’re maximizing. The gradient descent rule for weight wi is: wit+1 = wit − !



∂E wi

Or in vector form: wt+1 = wt − !



w2

w4 w3 w2

∂E w

w1

For gradient ascent, the sign of the gradient step changes.

w1 Artificial Intelligence: Neural Networks

15

Michael S. Lewicki ! Carnegie Mellon

Computing the gradient

• •

Idea: minimize error by gradient descent Take the derivative of the objective function wrt the weights: E

=

∂E wi

= =

N 1! T (w xn − cn )2 2 n=1

N 2! (w0 x0,n + · · · + wi xi,n + · · · + wM xM,n − cn )xi,n 2 n=1

N !

n=1



(wT xn − cn )xi,n

And in vector form: N ! ∂E = (wT xn − cn )xn w n=1

Artificial Intelligence: Neural Networks

16

Michael S. Lewicki ! Carnegie Mellon

Simulation: learning the decision boundary



Each iteration updates the gradient: ∂E wi

2 x2

wit+1 = wit − !

2.5

N ! ∂E = (wT xn − cn )xi,n wi n=1



1.5

1 3

Epsilon is a small value:

4

5 x1

10000

Learning Curve

9000



7

11000

" = 0.1/N

8000

Epsilon too large:

-

7000 Error



6

learning diverges

5000 4000

Epsilon too small:

-

6000

3000

convergence slow

2000 1000

0

5

10

15

iteration Artificial Intelligence: Neural Networks

Michael S. Lewicki ! Carnegie Mellon

17

Simulation: learning the decision boundary



Each iteration updates the gradient:

2 x2

wit+1

2.5

∂E = wit − ! wi

N ! ∂E = (wT xn − cn )xi,n wi n=1



1.5

1 3

Epsilon is a small value:

4

5 x1

10000

Learning Curve

9000



7

11000

" = 0.1/N

8000

Epsilon too large:

-

7000 Error



6

learning diverges

5000 4000

Epsilon too small:

-

6000

3000

convergence slow

2000 1000

0

5

10

15

iteration Artificial Intelligence: Neural Networks

18

Michael S. Lewicki ! Carnegie Mellon

Simulation: learning the decision boundary



Each iteration updates the gradient: ∂E wi

2 x2

wit+1 = wit − !

2.5

N ! ∂E = (wT xn − cn )xi,n wi n=1



1.5

1 3

Epsilon is a small value:

4

5 x1

10000

Learning Curve

9000



7

11000

" = 0.1/N

8000

Epsilon too large:

-

7000 Error



6

learning diverges

5000 4000

Epsilon too small:

-

6000

3000

convergence slow

2000 1000

0

5

10

15

iteration Artificial Intelligence: Neural Networks

Michael S. Lewicki ! Carnegie Mellon

19

Simulation: learning the decision boundary



Each iteration updates the gradient:

2 x2

wit+1

2.5

∂E = wit − ! wi

N ! ∂E = (wT xn − cn )xi,n wi n=1



1.5

1 3

Epsilon is a small value:

4

5 x1

10000

Learning Curve

9000



7

11000

" = 0.1/N

8000

Epsilon too large:

-

7000 Error



6

learning diverges

5000 4000

Epsilon too small:

-

6000

3000

convergence slow

2000 1000

0

5

10

15

iteration Artificial Intelligence: Neural Networks

20

Michael S. Lewicki ! Carnegie Mellon

Simulation: learning the decision boundary



Each iteration updates the gradient: ∂E wi

2 x2

wit+1 = wit − !

2.5

N ! ∂E = (wT xn − cn )xi,n wi n=1



1.5

1 3

Epsilon is a small value:

4

5 x1

10000

Learning Curve

9000



7

11000

" = 0.1/N

8000

Epsilon too large:

-

7000 Error



6

learning diverges

5000 4000

Epsilon too small:

-

6000

3000

convergence slow

2000 1000

0

5

10

15

iteration Artificial Intelligence: Neural Networks

Michael S. Lewicki ! Carnegie Mellon

21

Simulation: learning the decision boundary



Each iteration updates the gradient:

2 x2

wit+1

2.5

∂E = wit − ! wi

N ! ∂E = (wT xn − cn )xi,n wi n=1



1.5

1 3

Epsilon is a small value:

4

5 x1

10000

Learning Curve

9000



7

11000

" = 0.1/N

8000

Epsilon too large:

-

7000 Error



6

learning diverges

5000 4000

Epsilon too small:

-

6000

3000

convergence slow

2000 1000

0

5

10

15

iteration Artificial Intelligence: Neural Networks

22

Michael S. Lewicki ! Carnegie Mellon

Simulation: learning the decision boundary 2.5

x2

2

1.5

1 3

Learning converges onto the solution that minimizes the error.



For linear networks, this is guaranteed to converge to the minimum



5 x1

6

7

11000 10000

Learning Curve

9000 8000 7000 Error



4

6000 5000

It is also possible to derive a closedform solution (covered later)

4000 3000 2000 1000

0

5

10

15

iteration Artificial Intelligence: Neural Networks

23

Michael S. Lewicki ! Carnegie Mellon

Learning is slow when epsilon is too small



Here, larger step sizes would converge more quickly to the minimum

Error

w Artificial Intelligence: Neural Networks

24

Michael S. Lewicki ! Carnegie Mellon

Divergence when epsilon is too large



If the step size is too large, learning can oscillate between different sides of the minimum

Error

w Artificial Intelligence: Neural Networks

25

Michael S. Lewicki ! Carnegie Mellon

Multi-layer networks



Can we extend our network to multiple layers? We have: ! yj = wi,j xi

z V

i

zj

=

!

=

!

vj,k yj

k

vj,k

!

k



y wi,j xi

i

W

Or in matrix form z = Vy = VWx



Thus a two-layer linear network is equivalent to a one-layer linear network with weights U=VW.



It is not more powerful.

Artificial Intelligence: Neural Networks

x

How do we address this? 26

Michael S. Lewicki ! Carnegie Mellon

Non-linear neural networks



Idea introduce a non-linearity:

%

()'*

! yj = f ( wi,j xi ) i



Now, multiple layers are not equivalent ! zj = f ( vj,k yj )

! 0 y= 1

threshold

& !!

!"

!#

!$

!%

& '

%

$

x