Artificial Intelligence: Representation and Problem Solving 15-381 January 16, 2007
Neural Networks
Topics
• • • • •
decision boundaries linear discriminants perceptron gradient learning neural networks
Artificial Intelligence: Neural Networks
2
Michael S. Lewicki ! Carnegie Mellon
The Iris dataset with decision tree boundaries
2.5
petal width (cm)
2
1.5
1
0.5
0
1
2
3 4 5 petal length (cm)
Artificial Intelligence: Neural Networks
6
7
Michael S. Lewicki ! Carnegie Mellon
3
The optimal decision boundary for C2 vs C3
•
optimal decision boundary is determined from the statistical distribution of the classes
• •
optimal only if model is correct!
p(petal length |C3) 0.9 p(petal length 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
assigns precise degree of uncertainty to classification
|C2)
petal width (cm)
2.5 2 1.5 1 0.5 0
Artificial Intelligence: Neural Networks
4
1
2
3 4 5 petal length (cm)
6
7
Michael S. Lewicki ! Carnegie Mellon
Optimal decision boundary
Optimal decision boundary
p(C2 | petal length)
1
p(C3 | petal length)
0.8 p(petal length |C2)
0.6
p(petal length |C3)
0.4 0.2 0
1
2
3
Artificial Intelligence: Neural Networks
4
5
6
7 Michael S. Lewicki ! Carnegie Mellon
5
Can we do better?
• •
p(petal length |C3)
only way is to use more information 0.9 p(petal length 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
DTs use both petal width and petal length
|C2)
petal width (cm)
2.5 2 1.5 1 0.5 0
Artificial Intelligence: Neural Networks
6
1
2
3 4 5 petal length (cm)
6
7
Michael S. Lewicki ! Carnegie Mellon
Arbitrary decision boundaries would be more powerful
petal width (cm)
2.5 2
Decision boundaries could be non-linear
1.5 1 0.5 0
1
2
3 4 5 petal length (cm)
Artificial Intelligence: Neural Networks
6
7
Michael S. Lewicki ! Carnegie Mellon
7
Defining a decision boundary
• •
consider just two classes
•
2D linear discriminant function:
2.5
want points on one side of line in class 1, otherwise class 2. = mT x + b = m1 x1 + m2 x2 + b ! mi xi + b =
x2
y
2
1.5
i
•
1 3
This defines a 2D plane which leads to the decision: ! class 1 if y ≥ 0, x∈ class 2 if y < 0.
4
6
7
The decision boundary:
y = mT x + b = 0 Or in terms of scalars:
m1 x1 + m2 x2 ⇒ x2
Artificial Intelligence: Neural Networks
5 x1
8
= −b m1 x1 + b = − m2 Michael S. Lewicki ! Carnegie Mellon
Linear separability
•
Two classes are linearly separable if they can be separated by a linear combination of attributes
-
1D: threshold
not linearly separable
2D: line 3D: plane M-D: hyperplane
petal width (cm)
2.5 2 1.5 1 0.5
linearly separable
Artificial Intelligence: Neural Networks
0
1
2
3 4 5 petal length (cm)
6
7
Michael S. Lewicki ! Carnegie Mellon
9
Diagraming the classifier as a “neural” network
•
The feedforward neural network is specified by weights wi and bias b: y
= wT x + b M ! = wi xi + b
“bias”
y
b
“output unit”
w1 w2
wM
“weights”
i=1
“input units”
•
x1
x2
• •!•
xM
It can written equivalently as y = wT x =
M !
y
wi xi
i=0
•
w0
where w0 = b is the bias and a “dummy” input x0 that is always 1.
Artificial Intelligence: Neural Networks
x0=1
10
x1
w1 w2
x2
wM • •!•
xM
Michael S. Lewicki ! Carnegie Mellon
Determining, ie learning, the optimal linear discriminant
• • •
First we must define an objective function, ie the goal of learning Simple idea: adjust weights so that output y(xn) matches class cn Objective: minimize sum-squared error over all patterns xn: N 1! T (w xn − cn )2 E= 2 n=1
•
Note the notation xn defines a pattern vector: xn = {x1 , . . . , xM }n
•
We can define the desired class as: ! 0 xn ∈ class 1 cn = 1 xn ∈ class 2
Artificial Intelligence: Neural Networks
Michael S. Lewicki ! Carnegie Mellon
11
We’ve seen this before: curve fitting
t = sin(2πx) + noise
t
1
tn
y(xn , w)
0
!1
0
1
xn
x
example from Bishop (2006), Pattern Recognition and Machine Learning Artificial Intelligence: Neural Networks
12
Michael S. Lewicki ! Carnegie Mellon
Neural networks compared to polynomial curve fitting y(x, w) = w0 + w1 x + w2 x2 + · · · + wM xM = N 1! 2 [y(xn , w) − tn ] E(w) = 2 n=1
For the linear network, M=1 and there are multiple input dimensions
1
1
0
0
!1
!1
0
1
1
1
0
0
!1
!1
0
1
M !
wj xj
j=0
0
1
0
1
example from Bishop (2006), Pattern Recognition and Machine Learning Artificial Intelligence: Neural Networks
Michael S. Lewicki ! Carnegie Mellon
13
General form of a linear network
•
A linear neural network is simply a linear transformation of the input. yj =
M !
y W
wi,j xi
i=0
•
x
Or, in matrix-vector form: y = Wx
•
y1
Multiple outputs corresponds to multivariate regression
• •!•
yi
• •!•
x1
• •!•
xi
“outputs” “weights”
wij x0=1
yK
• •!•
xM
“inputs”
“bias”
Artificial Intelligence: Neural Networks
14
Michael S. Lewicki ! Carnegie Mellon
Training the network: Optimization by gradient descent
•
We can adjust the weights incrementally to minimize the objective function.
• • •
This is called gradient descent Or gradient ascent if we’re maximizing. The gradient descent rule for weight wi is: wit+1 = wit − !
•
∂E wi
Or in vector form: wt+1 = wt − !
•
w2
w4 w3 w2
∂E w
w1
For gradient ascent, the sign of the gradient step changes.
w1 Artificial Intelligence: Neural Networks
15
Michael S. Lewicki ! Carnegie Mellon
Computing the gradient
• •
Idea: minimize error by gradient descent Take the derivative of the objective function wrt the weights: E
=
∂E wi
= =
N 1! T (w xn − cn )2 2 n=1
N 2! (w0 x0,n + · · · + wi xi,n + · · · + wM xM,n − cn )xi,n 2 n=1
N !
n=1
•
(wT xn − cn )xi,n
And in vector form: N ! ∂E = (wT xn − cn )xn w n=1
Artificial Intelligence: Neural Networks
16
Michael S. Lewicki ! Carnegie Mellon
Simulation: learning the decision boundary
•
Each iteration updates the gradient: ∂E wi
2 x2
wit+1 = wit − !
2.5
N ! ∂E = (wT xn − cn )xi,n wi n=1
•
1.5
1 3
Epsilon is a small value:
4
5 x1
10000
Learning Curve
9000
•
7
11000
" = 0.1/N
8000
Epsilon too large:
-
7000 Error
•
6
learning diverges
5000 4000
Epsilon too small:
-
6000
3000
convergence slow
2000 1000
0
5
10
15
iteration Artificial Intelligence: Neural Networks
Michael S. Lewicki ! Carnegie Mellon
17
Simulation: learning the decision boundary
•
Each iteration updates the gradient:
2 x2
wit+1
2.5
∂E = wit − ! wi
N ! ∂E = (wT xn − cn )xi,n wi n=1
•
1.5
1 3
Epsilon is a small value:
4
5 x1
10000
Learning Curve
9000
•
7
11000
" = 0.1/N
8000
Epsilon too large:
-
7000 Error
•
6
learning diverges
5000 4000
Epsilon too small:
-
6000
3000
convergence slow
2000 1000
0
5
10
15
iteration Artificial Intelligence: Neural Networks
18
Michael S. Lewicki ! Carnegie Mellon
Simulation: learning the decision boundary
•
Each iteration updates the gradient: ∂E wi
2 x2
wit+1 = wit − !
2.5
N ! ∂E = (wT xn − cn )xi,n wi n=1
•
1.5
1 3
Epsilon is a small value:
4
5 x1
10000
Learning Curve
9000
•
7
11000
" = 0.1/N
8000
Epsilon too large:
-
7000 Error
•
6
learning diverges
5000 4000
Epsilon too small:
-
6000
3000
convergence slow
2000 1000
0
5
10
15
iteration Artificial Intelligence: Neural Networks
Michael S. Lewicki ! Carnegie Mellon
19
Simulation: learning the decision boundary
•
Each iteration updates the gradient:
2 x2
wit+1
2.5
∂E = wit − ! wi
N ! ∂E = (wT xn − cn )xi,n wi n=1
•
1.5
1 3
Epsilon is a small value:
4
5 x1
10000
Learning Curve
9000
•
7
11000
" = 0.1/N
8000
Epsilon too large:
-
7000 Error
•
6
learning diverges
5000 4000
Epsilon too small:
-
6000
3000
convergence slow
2000 1000
0
5
10
15
iteration Artificial Intelligence: Neural Networks
20
Michael S. Lewicki ! Carnegie Mellon
Simulation: learning the decision boundary
•
Each iteration updates the gradient: ∂E wi
2 x2
wit+1 = wit − !
2.5
N ! ∂E = (wT xn − cn )xi,n wi n=1
•
1.5
1 3
Epsilon is a small value:
4
5 x1
10000
Learning Curve
9000
•
7
11000
" = 0.1/N
8000
Epsilon too large:
-
7000 Error
•
6
learning diverges
5000 4000
Epsilon too small:
-
6000
3000
convergence slow
2000 1000
0
5
10
15
iteration Artificial Intelligence: Neural Networks
Michael S. Lewicki ! Carnegie Mellon
21
Simulation: learning the decision boundary
•
Each iteration updates the gradient:
2 x2
wit+1
2.5
∂E = wit − ! wi
N ! ∂E = (wT xn − cn )xi,n wi n=1
•
1.5
1 3
Epsilon is a small value:
4
5 x1
10000
Learning Curve
9000
•
7
11000
" = 0.1/N
8000
Epsilon too large:
-
7000 Error
•
6
learning diverges
5000 4000
Epsilon too small:
-
6000
3000
convergence slow
2000 1000
0
5
10
15
iteration Artificial Intelligence: Neural Networks
22
Michael S. Lewicki ! Carnegie Mellon
Simulation: learning the decision boundary 2.5
x2
2
1.5
1 3
Learning converges onto the solution that minimizes the error.
•
For linear networks, this is guaranteed to converge to the minimum
•
5 x1
6
7
11000 10000
Learning Curve
9000 8000 7000 Error
•
4
6000 5000
It is also possible to derive a closedform solution (covered later)
4000 3000 2000 1000
0
5
10
15
iteration Artificial Intelligence: Neural Networks
23
Michael S. Lewicki ! Carnegie Mellon
Learning is slow when epsilon is too small
•
Here, larger step sizes would converge more quickly to the minimum
Error
w Artificial Intelligence: Neural Networks
24
Michael S. Lewicki ! Carnegie Mellon
Divergence when epsilon is too large
•
If the step size is too large, learning can oscillate between different sides of the minimum
Error
w Artificial Intelligence: Neural Networks
25
Michael S. Lewicki ! Carnegie Mellon
Multi-layer networks
•
Can we extend our network to multiple layers? We have: ! yj = wi,j xi
z V
i
zj
=
!
=
!
vj,k yj
k
vj,k
!
k
•
y wi,j xi
i
W
Or in matrix form z = Vy = VWx
•
Thus a two-layer linear network is equivalent to a one-layer linear network with weights U=VW.
•
It is not more powerful.
Artificial Intelligence: Neural Networks
x
How do we address this? 26
Michael S. Lewicki ! Carnegie Mellon
Non-linear neural networks
•
Idea introduce a non-linearity:
%
()'*
! yj = f ( wi,j xi ) i
•
Now, multiple layers are not equivalent ! zj = f ( vj,k yj )
! 0 y= 1
threshold
& !!
!"
!#
!$
!%
& '
%
$
x