Learning Viewpoint Invariant Representations of ... - Semantic Scholar

Report 3 Downloads 128 Views
Presented at the 18th Cognitive Science Society Meeting, San Diego, CA, July 12-15, 1996.

Learning Viewpoint Invariant Representations of Faces in an Attractor Network 1 3 and Terrence J. Sejnowski1 2 3 Marian Stewart Bartlett 1 University of California San Diego; 2 Howard Hughes Medical Institute; 3 The Salk Institute; ;

; ;

www: http://www.cnl.salk.edu/ marni anonymous ftp: ftp.cnl.salk.edu, pub/marni/cg96bart.ps.Z

Abstract In natural visual experience, di erent views of an object tend to appear in close temporal proximity as an animal manipulates the object or navigates around it. We investigated the ability of an attractor network to acquire view invariant visual representations by associating rst neighbors in a pattern sequence. The pattern sequence contains successive views of faces of ten individuals as they change pose. Under the network dynamics developed by Griniasty, Tsodyks & Amit (1993), multiple views of a given subject fall into the same basin of attraction. We use an independent component (ICA) representation of the faces for the input patterns (Bell & Sejnowski, 1995). The ICA representation has advantages over the principal component representation (PCA) for viewpoint-invariant recognition both with and without the attractor network, suggesting that ICA is a better representation than PCA for object recognition.

Introduction

Recognizing an object or a face despite changes in viewpoint is a challenging problem for computer vision systems, yet people perform the task quickly and easily. In natural visual experience, di erent views of an object tend to appear in close temporal proximity. Capturing the temporal relationships among patterns is a way to automatically associate di erent views of an object without requiring complex geometrical transformations or three dimensional structural descriptions (Stryker, 1991). Temporal association may be an important aspect of invariance learning in the ventral visual stream (Rolls, 1995). A temporal window for Hebbian learning could be provided by the 0.5 second opentime of the NMDA channel (Rhodes, 1992), or by reciprocal connections between cortical regions (O'Reilly, 1994). Hebbian learning of view invariant representations through temporal association has been demonstrated using idealized input representations (Foldiak, 1991; Weinshall, Edelman & Bultho , 1991), and recently with complex inputs such as images of faces (Bartlett & Sejnowski, 1996; Wallis & Rolls, 1996).

Purpose

We explored the development of invariances in an attractor state representation. The sustained activity in attractor networks could support Hebbian learning of temporal associations across much larger time scales than that supported by the open-time of the NMDA channel. Perceptual representations have been related to basins of attraction in activity patterns across an assembly of cells (Amit, 1995). Attractor networks with Hebbian learning mechanisms are capable of acquiring temporal associations between randomly generated patterns (Griniasty, Tsodyks, & Amit, 1993). We investigated the ability of attractor network dynamics to acquire pose invariant representations of faces by forming temporal associations between di erent views of the faces.

A. Relation to Temporal Lobe Neurons

Neurons in the primate anterior inferior temporal lobe are capable of forming temporal associations in their sustained activity patterns. Following prolonged exposure to a sequence of patterns, correlations emerged in the responses to neighboring patterns in the sequence (Miyashita, 1988). Miyashita, Y., Nature 335: 817-20. Fixed Stimulus Sequence

Correlation Coefficient

0.4

0.3 0.2 0.1

0

1

5

10th neighbor

Figure 1: Top: Macaques were presented a xed sequence of 97 fractal patterns for 2 weeks in a delay-match-to-sample task. AIT neurons produced sustained responses to the patterns following removal of the stimulus. Bottom: Autocorrelograms of sustained ring rates to di erent stimuli plotted as a function of relative position of the stimuli in the training sequence. Responses to stimuli that were neighbors in the training sequence were correlated. Stimulus order was random during testing. Mean autocorrelograms across 57 cells for learned stimuli (4); Means across 17 cells tested with both learned () and new () stimuli.

B. Input Images

-30°

-15°



15°

30°

Figure 2: The example set for this simulation consisted of face images of forty subjects at each of ve poses for a total of 200 images (from Beymer, 1994). An image window based on the eye positions in the 0 view was used for all ve poses. The images were normalized for luminance and scaled to 20 x 20.

C. Independent Component Representation

We rst reduced the dimensionality of the images by performing principal component analysis (PCA) on the set of 200 images. We used the projection onto the rst 40 components, which accounted for 94.5% of the variance. Because we were interested in a representation that was sparse and as independent as possible, we subsequently performed independent component analysis (ICA) on the reduced representation. The independent components are found through an unsupervised learning algorithm that maximizes the mutual information between the input and the output of a linear transformation (Bell & Sejnowski, 1995). Maximizing the mutual information between the input and the output is equivalent to maximizing the joint entropy of the output units. The weight matrix for this transformation is found by gradient ascent on the joint entropy of the output units. This process pushes the output units toward statistical independence since maximizing the joint entropy of the output units consists of maximizing the individual entropies while minimizing the mutual information between them.1 Images

PCA

= 40 x 200 400 x 200 * First 40 P. components

ICA

* ICA filters =

40 x 200

Independent component analysis is a higher-order generalization of principal component analysis, which decorrelates the higher, as well as the second order moments of the input (Bell & Sejnowski, to appear). The PCA axes are orthogonal, but not statistically independent. In a task such as face recognition, much of the important information is contained in the high order statistics of the 1

For more info on ICA, ftp to ftp.cnl.salk.edu, name = anonymous, le = pub/tony/bell.blind.ps.Z.

images. A representational basis in which the high order statistics are decorrelated may be more powerful for face recognition than one in which only the second order statistics are decorrelated, as in the Eigenface representation (Turk & Pentland, 1991). ICA also provides a more sparse representation than PCA, which is advantageous for associative memory (Field, 1994). PCA Basis Vectors

ICA Basis Vectors

Figure 3: Top: First forty principal components, ordered left to right, top to bottom. Bottom: Forty independent component basis vectors derived from the 40 principal components on the left. Basis vectors are the images that maximally activate the independent component lters. For PCA the basis vectors are the same as the components themselves.

D. Comparison of PCA and ICA Face Representations

The PCA representation of a face consisted of the 40-dimensional vector containing the projection of the face onto the rst forty principal components of the image set [See Figure 3]. Likewise, the ICA representation consisted of the projections onto the 40 independent component weight vectors.  ICA Representation has greater pose invariance than PCA We assessed the viewpoint invariance of these representations by asking whether di erent views of the same face were more similar to each other than to the face of another person. As a measure of similarity, we used the cosine of the angle between the representation vectors. The greater the similarity, the higher the cosine. The important comparison for pose invariance is the separability of the same-face and across-face measurements [Figure 4]. Note that these two curves cross for the PCA representation but not for the ICA representation. Same-face di erent-face discriminability is measured explicitly by d0 which is the distance between the means of two distributions in units of standard deviation. d0 s for the ICA representation were higher than those for the PCA representation for all changes in pose. 1

PCA PCA

0.8

2.5

Same Face Across Faces

ICA

ICA

2

0.4

1.5

d′

Mean Cosine

3

0.6

0.2

1 0.5

ICA

0

PCA

0



15°

30°

∆ Pose

45°

60°

−0.5

15°

30°

45°

∆ Pose

60°

Figure 4: Similarity of representations within same face (|) versus across faces of di erent people (??) for the PCA and ICA representations. Left: Mean cosine of the angle between face representations is plotted as a function of change in pose (where 4 pose of 15 includes all 15 di erences in pose such as 30 to 45 .) Right: Same-face di erent-face discriminability is measured by d0 for four pose di erences. Note that d0 s are not given for 4 Pose = 0 due to unequal variances.

 Accounting for the di erence in pose invariance Unequal variances in the component projections of the PCA representation did not account for the poor discriminability between same and di erent faces. d0 for the PCA representation with equalized component variances are plotted in Figure 5. We also examined whether the di erence in pose invariance between the PCA and ICA representations was caused by including principal components with poor discriminability. The mid-range principal components have been shown to contain more information than the rst few components for discriminating faces (O'Toole et al. 1993). d0 's for subsets of the PCA representation are given in the right graph of Figure 5. The 21-40 range gave the best discriminability. The ICA representation continued to have superior discriminability despite equalizing the variance of the PCA lters and then taking the most discriminable component range. One manipulation that did account for the di erence in pose

invariance, however, was equalizing the means of the principal component lter outputs. This is equivalent to zeroing the mean of each pixel prior to projecting the images onto the component axes. 3

3

2.5

2.5 ICA

2

Mean d′

2

d′

1.5

PCA

1

Com

p. 21 -4

PCA 0.5

Eq. V ar.

PCA

1.5 1

0

0.5 0

0 −0.5

15°

30°

∆ Pose

45°

-0.5 1-20

60°

6-25

11-30

16-35 21-40

25-40

Principal Component Ranges

Figure 5: Left: d0 s for same-face di erent-face discriminations. d0 s for the PCA and ICA representations are compared to those for PCA with equalized variances (Eq. Var.) and PCA at the most discriminable component range (Comp. 21-40). Right: d0 s for di erent ranges of principal components in bins of 20.

 ICA face representations are more sparse than PCA The superior pose invariance of the ICA representation to the PCA representation may be the result of the increased sparseness and the statistical independence of the ICA representation. A distribution is sparse if most of the outputs are near zero with infrequent high magnitude responses indicated by long tails in the probability distribution [see Figure 6]. The kurtosis (the deviation of the 4th moment of the distribution from that of a Gaussian) measures the spreading of the tails. The ICA lters have a much higher kurtosis than the PCA lters. This result is consistent with the ndings of Bell & Sejnowski (in press) with a large set of samples from natural scenes.

Log Probability

10

0

−1

10

ICA Kurtosis = 13.09

−2

10

−3

10

PCA Kurtosis = 0.247 −4

10

−5

−4

−3

−2

−1

0

1

2

3

4

5

Filter Weight

Figure 6: Probability distributions of PCA and ICA outputs, averaged over all lters of each type.

 ICA face representations are less correlated than PCA The statistical dependence among the PCA lters results in high correlations among the face representations [Figure 7]. The PCA lters themselves are orthogonal, but the projections of faces onto the lters are highly correlated. The ICA representations are uncorrelated. The correlations among the PCA face representations are due to unequal means of the PCA lter outputs. Zeroing the mean of each pixel prior to projecting the images onto the component axes removes these correlations. PCA Representations

ICA Representations 8

Mean Correlation = .67

Representation of Face B

Representation of Face B

8 6 4 2 0

Components 1-10: .88 10-20: .40 20-30: .11 30-40: .31

−2 −4 −6

6

Mean Correlation = .005

4 2 0 −2 −4 −6

−8 −8

−6

−4

−2

0

2

4

6

8

−8 −8

−6

−4

−2

0

2

4

6

8

Representation of Face A

Representation of Face A

Figure 7: The representation of one face plotted against the representation of another face for four pairs of faces, corresponding to the four colors (symbols). Left: Correlation of PCA face representations. The variance of the PCA lters has been equalized. Mean correlation coecient across all pairs of the 200 face representations is 0.67. The rst ten components are the most correlated. Right: Correlations among the ICA representations.

E. Attractor Network

We created sparse, binary input vectors by thresholding the ICA representation to select outputs that were substantially di erent from zero. Each of the 40 ICA lters had two elements in the input vector, one for positive outputs and one for negative outputs, producing an 80 dimensional input vector. Fifty such patterns comprised the input to a fully interconnected network with 80 units. The input sequence consisted of ve sequential views of each of ten subjects, for a total of 50 input patterns. This input sequence re ects the tendency in natural viewing to see several views of the same item in close temporal proximity rather than random snapshots of unrelated items. The equations for the symmetric connection matrix and network dynamics are from Griniasty, Tsodyks & Amit (1993). The weight learning rule is as follows: 1 (( ? f )( ? f ) + a[( +1 ? f )( ? f ) + ( ? f )( +1 ? f )]) W =

X p

ij

N

=1

u i

u j

u

i

u j

u i

u

j

u

Where N is the number of units, p is the number of patterns, the  2 f0; 1g are the patterns, and f is the coding rate (the fraction of 1's in the patterns). In this simulation, N = 80, p = 50, and f = 0:027. The left hand component of the equation is a covariance matrix and is equivalent to Hebbian learning. The parameter a adjusts the relative magnitude of the in uence of neighboring u

patterns in the sequence on the connection strength. In these simulations, a was set to zero between subjects. A biological system has access to transitions between objects through eye movement signals, attentional inputs, and motion continuity. The update rule for the activation V of unit i at time t is given by Vi (t + t)

= [

X

Where  is a neural threshold and (x) = 1 for  = 0:01.

Wij Vj (t) x >

? ]

0, and 0 otherwise. For this simulation,

 Temporal Associations in the Attractor Network Increase Pose Invariance In the attractor network model, a representation of a face is the stable activity state that the network achieves for a given input pattern. Figure 8 demonstrates that same-face di erent-face discriminability increases as the temporal association parameter a increases, and becomes more uniform across changes in pose. Figure 9 shows the dependence of discriminability on the temporal association parameter a. Discriminability increases as a increases, and then levels o after a = 2. 1

a=2

0.9

2

a=1

a=2

0.7

a=0 0.6

1.5

0.5

Same Face Across Faces

a=1

d′

Mean Cosine

0.8

0.4

a=0

1

0.3

a=2

0.2

a=1 a=0 30°

0.1



15°

∆ Pose

Input Patterns 45°

60°

0.5

15°

30°

45°

∆Pose

60°

Figure 8: Dependence of pose invariance on the temporal association parameter a. Left: Mean cosines of the angle between stable activity states for di erent views of the same face (|) and the same change in pose for di erent faces (??) are plotted for three values of a. Right: d0 s are plotted as a function of change in pose for the three values of a and for the original input patterns.

Mean d′

2

1.5

1

0.5

0

0.5

1.0

1.5

2.0

Values of a

2.5

3

Figure 9: Mean d0 s across all poses for di erent values of the temporal association parameter a.

Conclusions

 Attractor network dynamics can acquire visual invariances by temporal association of

input patterns. The learning algorithm need only take account of rst neighbors in the pattern sequence.  The independent component representation proved to be superior to the principal component representation for learning pose invariance. The greater sparseness and smaller correlation in the ICA input patterns are advantageous for associative memory. Most signi cantly, the ICA representation captured more pose invariant aspects of face images than did the PCA representation.  Important advances in face recognition have employed principal component representations (Cottrell & Metcalfe, 1991; Turk & Pentland 1991) which decorrelate only the second-order statistics of the input. In a task such as face recognition, much of the important information is contained in the higher order statistics of the images. A representation in which the higher order statistics of the images are decorrelated, such as the ICA representation presented here, may be more powerful for face recognition than representations based on PCA.

References

Amit, D. 1995. The Hebbian paradigm reintegrated: Local reverberations as internal representations. Behav. and Brain Sci. 18: 617-657. Bartlett, M. Stewart, and Sejnowski, T., (1996). Unsupervised learning of invariant representations of faces through temporal association. Computational Neuroscience: International Review of Neurobiology Suppl. 1 J.M Bower, Ed., Academic Press, San Diego, CA: 317-322. ftp: ftp.cnl.salk.edu, pub/marni/cns95.ta.ps.Z Beymer, D. 1994. Face recognition under varying pose. In Proceedings of the 1994 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Comput. Soc. Press: 756-61. Bell, A. & Sejnowski, T., (to appear). The independent components of natural scenes are edge lters. Vision Research. Bell, A., &Sejnowski, T., 1995. An information Maximization approach to blind separation and blind deconvolution. Neural Comp. 7: 1129-1159. ftp: ftp.cnl.salk.edu, pub/tony/bell.blind.ps.Z Cottrell & Metcalfe, 1991. Face, gender and emotion recognition using holons. In Advances in Neural Information Processing Systems 3, D. Touretzky, (Ed.), Morgan Kaufman, San Mateo, CA: 564 - 571. Field, D. (1994). What is the goal of sensory encoding? Neural Comp. 6:559-601. Foldiak, P. 1991. Learning invariance from transformation sequences. Neural Comp. 3:194-200. Griniasty, M., Tsodyks, M., & Amit, D. (1993). Conversion of temporal correlations between stimuli to spatial correlations between attractors. Neural Comp. 5:1-17. Miyashita, Y. 1988. Neuronal correlate of visual associative long-term memory in the primate temporal cortex. Nature 335(27):817-820. O'Reilly, R. & Johnson, M. 1994. Object recognition and sensitive periods: A computational analysis of visual imprinting. Neural Comp. 6:357-389. O`Toole, Abdi, H., De enbacher, K., & Velantin, D. (1993). Low-dimensional representation of faces in higher dimensions of the face space. Journal of the Optical Society of America A, v10 n3, 405-11. Rhodes, P. 1992. The long open time of the NMDA channel facilitates the self-organization of invariant object responses in cortex. Soc. Neurosci. Abst. 18:740. Rolls, E. 1995. Learning Mechanisms in the temporal lobe visual cortex. Behav. Brain Res. 66:177-185. Stryker, M. 1991. Temporal Associations. Nature 354:108-109. Turk, M., & Pentland, A. 1991. Eigenfaces for Recognition. J. of Cog. Neurosci. 3(1):71 - 86. Wallis, G. & Rolls, E. 1996. A model of invariant object recognition in the visual system. Technical Report, Oxford University Department of Experimental Psychology. Weinshall, D.& Edelman, S. 1991. A self-organizing multiple view representation of 3D objects. Bio. Cyber. 64(3):209-219.