Face distinctiveness in recognition across viewpoint: An analysis of ...

Face distinctiveness in recognition across viewpoint: An analysis of the statistical structure of face spaces Alice J. O'Toole School of Human Development The University of Texas at Dallas Richardson, TX 75093-0688, USA [email protected]

Abstract

We present an analysis of the e ects of face distinctiveness on the performance of a computational model of recognition over viewpoint change. In the rst stage of the model, the face stimulus is normalized by being mapped to an arbitrary standard view. In the second stage, the normalized stimulus is mapped into a \face space" spanned by a number of reference faces, and is classi ed as familiar or unfamiliar. We carried out experiments employing a parametrically generated family of face stimuli that vary in distinctiveness. The experiments show that while the \view-mapping" process operates more accurately for typical versus distinctive faces, the base level distinctiveness of the faces is preserved in the face space coding. These data provide insight into how the psychophysically well-established inverse relationship between the typicality and recognizability of faces might operate for recognition across changes in viewpoint.

1 Psychophysical background

The recognition of familiar faces is something that people do very well. This is true even with relatively dramatic changes in faces that can occur daily (e.g., hair style), and over longer periods of time as faces age. Somewhat distinct from the problem of recognizing faces that have changed in various ways is the problem of generalizing this recognition ability across varying viewpoints. In this case, the 3D structure of a face/head is relatively constant. The problem is to determine whether or not we know a face when we see it from di erent, even completely novel, viewpoints. Notably, the problem of recognition requires the ability to represent the information in individual faces that makes them unique. The problem of generalization across views entails the additional requirement that this unique information be accessible across viewpoint variations. It is evident that individual faces vary in the quality of the uniqueness information they provide for a face recognizer | either human or computational. More simply stated, individual

Shimon Edelman Department of Applied Math. and CS The Weizmann Institute of Science Rehovot 76100, Israel

[email protected] faces vary in how \distinctive" or unusal they are, and hence in how likely they are to be mistaken for other faces. The relationship between the distinctiveness of a face (as rated by human subjects) and the accuracy with which human observers recognize the face has been well established in the psychological literature: not surprisingly, distinctive or unusual faces are more accurately recognized than are typical faces [9, 10, 15]. This nding has implications both for theoretical accounts of human memory for faces and for more applied issues concerning the factors that a ect, e.g., the accuracy of eyewitness identi cation. From a theoretical perspective, many psychological and computational models of face processing have posited a representation of faces in a \face space," with a prototype/average face at the center e.g., [2, 14, 15]. By this account, individual faces are encoded in terms of their deviation from the prototype face | typical faces are harder to recognize than unusual faces, because the face space is more \crowded" close to the prototype, making it easier to confuse typical faces with other (un)familiar faces. While these data are well-established, they have been collected and applied almost exclusively to the problem of face recognition from a single viewpoint (though see [12], for an exception). These data suggest that the human performance depends on the statistical structure of the set of faces to which the observer has been exposed. This observation serves as the main guiding principle behind the model we describe next. This computational model builds on the basic psychological ndings and extends them to consider the e ects of face distinctiveness for recognition over viewpoint change.

2 Computational background

The central role of the statistics of the stimuli in our model is motivated both by the psychological considerations surveyed above, and by the growing importance attributed to the statistical structure of the visual world in current theories

of visual processing. A number of researchers have attempted to derive the shapes of the receptive elds found at the early stages of the visual system from the statistics of natural images ([6]; see [13] for a review). More recently, it has been suggested that a similar approach may be productive at the higher levels of vision, which should be tuned to the statistics of natural objects (such as faces), rather than random scenes [16]. Our model relies on the statistics of a collection of face shapes in two ways. First, the common manner in which images of faces change with viewpoint (due to the common 3D structure of faces) is exploited at the initial stage of the model, which performs normalization of the input image to a \standard" view of the face. The normalized image is then compared to a number of reference faces, which span our version of the face space. At this second stage, the statistics of the collection of faces with respect to a set of reference faces constitutes the system's internal representation of the face space. The rest of this section describes the two stages of the model in some detail.

cult computationl problem is via class-based processing: assuming that the stimulus belongs to a familiar class, the visual system can take advantage of its prior experience with other members of that class in processing the image of a new member. For example, a normalizing transformation that brings familiar members of the class of faces into a normal form, can be used to estimate the appearance of a less familiar face from some standard viewpoint, facilitating subsequent recognition of that face [8].

2.2 The shape space stimulus

view−map (RBF) reference−face modules (RBF)

2.1 The view space

input image

face space

subsample

recognize

identify

(threshold)

(RBF)

train

view−map (RBF)

familiar/unfamiliar?

who?

output vector

Figure 2: The entire model. Following normalization of the stimulus image by the view-mapper [7], it is projected into a view-speci c face space spanned by a set of reference faces [5].

Figure 1: The view-mapper. The way in which known faces change across viewpoint is exploited in deriving a normalized representation of a novel face seen from a familiar orientation.

At the recognition stage, the system must deal with a stimulus that may have been normalized (e.g., by class-based processing), but may still turn out to be unfamiliar, i.e., may not match any of the stimuli for which internal representations are available in long-term memory. Just as the problem of making sense of an unfamiliar viewpoint can be dealt with by exploiting the similarity of the view space of a given face to those of other members of the class of faces, we treat the problem of making sense of unfamil-

equivalent image

As noted frequently in the vision literature, the human visual system is usually able to make sense of a 2D image, even when the object to which it corresponds was never before encountered under that particular combination of viewing conditions. A possible solution to this di-

3 Experiments

We carried out two sets of simulations, the rst of which assessed the e ects of face distinctiveness on the performance of the normalization procedure, and the second | its e ects on the quality of the resulting face-space representations.

3.1 The stimuli

To characterize the e ect of face distinctiveness on the functioning of the model, we had (1) to quantify the distinctiveness itself, and (2) to obtain a series of faces varying along the distinctiveness dimension. For this latter purpose, one may use synthetic parametrically controlled shapes [3, 8], or derive the parameter space from a set of real faces.

2 K 1.5 Ji 1 weight of eigenface 2

iar shapes by exploiting the similarity structure of the shape space, to which all the members of a certain class of objects (such as faces) belong. This is done by representing explicitly the similarities of the stimulus to a number of reference shapes [4]. The resulting scheme constitutes a useful method for class-based dimensionality reduction [5], and can support the representation of a potentially in nite variety of shapes from a given class.

0.5 Ha

P

0

He D

−0.5

S R

−1 Jo −1.5

−2

−3

−2

−1 0 1 weight of eigenface 1

2

3

Figure 4: The weights of the nine faces used in the generation of the stimulus set, in the space of the rst two eigenfaces. computer graphics), we decided to derive the dimensions of the shape space from a principal component analysis (PCA) of nine 3D laser scans of human faces (see Figure 3; three of these are distributed with SGI systems, and the other six are available over the Internet, courtesy of Cyberware Inc., as a part of their demonstration software).1 This approach to the parameterization of the face spaces leads to a natural quanti cation of distinctiveness in terms of the parameter-space distance between a given face and the mean face, the parameters of a face being its projections onto the eigenfaces obtained by the PCA. The locations of the nine faces used in the PCA in the subspace spanned by the rst two eigenfaces appear in Figure 4. We used eight of the faces2 to generate 80 face stimuli, in the following manner. For each of the eight points in the face space, 10 versions were generated, corresponding to 10 equally spaced locations along the line connecting that point to the origin. For convenience and for later reference, faces numbered 1, 11, 21, . . . , 71 were the least distinctive versions of the eight faces, while faces 10, 20, . . . , 80 | were the most distinctive versions of these faces. Each of the 80 faces was rendered from four viewpoints, starting at a full-face orientation, and proceeding by rotation around the vertical axis (in 22 5 incre:

Figure 3: The nine faces used in the generation of the stimulus set. Because synthetic faces o er only a crude approximation to the rich 3D structure of the human face (unless a large investment is made in

1 A similar approach to the generation of parametrically controlled face stimuli has been recently proposed in [1]. Also, a low-dimensional PCA-based representation of faces has been shown to be useful for quantifying gender { a part of the categorical information available in faces [11]. 2 Omitting face P, whose direction relative to the origin in the face space nearly coincided with that of Ha.

ments) to 67 5. :

3.2 Procedure

Our procedure was designed to assess the e ects of distinctiveness at both stages of the computational model. Intuitively, the quality of the view-mapped faces is expected to be the best for typical faces (i.e., the ones that are closest to the average face). Typical faces are expected, however, to be the most dicult to recognize, while distinct faces, due to their location in the face space, should have the advantage at the second stage of the model's representation.

3.2.1 Face distinctiveness and the view-mapper

Separate linear view-mappers were trained to produce estimates of the full-face view from each of three other views: 22 5, 45 , and 67 5. To test the generalization performance of the viewmappers, we employed standard \leave-one-out" cross-validation: a view-mapper was trained with all 10 distinctiveness versions of seven faces and was tested with all 10 distinctiveness versions of the \left-out" face. This procedure was repeated for all eight faces, resulting in view-mapped fullface estimates for all eight faces from each of the three views. :

:

1 22.5−>full 0.999

45.0−>full 67.5−>full

view−map quality

0.998

0.997

0.996

0.995

0.994

0.993 1

2

3

4

5 6 distinctiveness

7

8

9

10

Figure 5: The performance of the view-mapper declines with face distinctiveness and with the disparity between the input and normal views. Analysis 1. We rst assessed the quality of the view-mapped face estimates as a function of face distinctiveness. View-map quality was measured as the cosine of the angle between the original full-face view and the view-mapper's estimate of this view (both de ned as vectors). The results

(Figure 5) show that: (1) view-map quality declines as view-map angle increases; and (2) viewmap quality declines as the face distinctiveness increases (i.e., typical faces were better preserved than distinct faces in the normalization process, as expected). Analysis 2. Recognition of faces across viewpoint depends not only on the quality of the normalized (view-mapped) face estimate, but also, critically, on the extent to which the structure of face space is preserved across the normalization transformations. We examined this latter issue by analyzing the Procrustes distortion between the original full-face views and their viewmapped versions. This was done by applying Procrustes transformations3 to compare the similarity of original and view-mapped con gurations, in which each face was represented by its coordinates in the space of the two leading eigenvectors derived from the face images. The Procrustes distance (the residual that remains after the application of the optimal transformation, and measures the discrepancy between the two con gurations) was 2.91 for the 22 5 viewmap condition, 3.183 for the 45 view-map condition, and 4.04 for the 67 5 view-map condition { all signi cantly better than estimates of the expected random distance, obtained by bootstrap, indicating the preservation of the original similarity structure of the face space by the view mappers.4 Analysis 3. Finally, we examined the extent to which face distinctiveness in uenced the distortion of the face space under view-mapping, by comparing Procrustes distances between the original frontal views and view-mapped versions of the faces for di erent levels of distinctiveness (see Figure 6). We found that the face space distortion increased with the size of the view change. In the two smaller view change conditions, the distortion was lower than the expected random distortion, estimated by bootstrap, in all 10 dis:

:

The optimal combination of scale, rotation, translation, and re ection that minimizes the sum of squared distances between the corresponding points of two con gurations. 4 Note that the above analysis was concerned with the preservation of the information in face images, rather than in the 3D head data. Procrustes analysis of the relationship between the similarity space of the 3D head data and that of its 2D representation (a full-face view) indicated that the 3D head and 2D view face spaces did not match well. In other words, view-based and 3D face codes make rather di erent predictions about the distinctiveness of individual faces; cf. [11]. 3

tinctiveness cases.5 Moreover, there was a relatively consistent relationship between face-space distortion and distinctiveness, with the lowest distortion for the least and the most distinct faces. Thus, while Figure 5 shows that viewmap quality declines with increasing distinctiveness, the extent to which the structure of the similarity space is preserved does not follow a similar decline. Note that the rise in the distortion with distinctiveness suggests that the viewmapper looses more information from the distinct faces than from the typical faces. There is, however, more uniqueness information in the distinct faces to begin with; this e ect, apparently, more than cancels the previous one, resulting in a downward trend in the Procrustes distortion as the distinctiveness continues to grow. 1.3

Procrustian distance

1.2

1.1

1

tinctiveness versions of the 8 original faces. The remaining 40 faces served for testing and were projected into the face space spanned by the responses of the reference-face RBF modules. To assess the e ects of face distinctiveness on the discriminability of novel faces projected into the face space, we plotted the corresponding projections directly, for di erent levels of distinctiveness (Figure 7). As expected, the face projections show maxima along the diagonals, due to the fact that these novel test faces were \neighbors" in the distinctiveness space to the learned faces. The extent to which there is activation o the diagonals is an indication that the model projections are confusable with other \nontarget" faces. The plotted data can be seen, therefore, to represent a confusion table of sorts. Note, rst, that the relatively higher activation levels on the diagonal indicate that the similarity of the test faces to their neighbors in the learned set was sucient to activate the RBF nodes of the learned neighbors. Of more direct interest, however, is the decrease in o -diagonal activation in the projection patterns for our parametrically more distinct face versions, e ectively indicating lesser confusability of the distinct faces with other faces.

0.9 1

1

0.5

0.5

0.8 22.5 deg. 0.7

45 deg. 67.5 deg

0.6 1

2

3

4

5

6

7

8

9

0

0

−0.5 40

−0.5 40

3.2.2 Distinctiveness and the view-space

The e ects of face distinctiveness on the face space representation was examined by projecting novel faces onto a set of reference faces and analyzing the resulting representations. We used 40 faces to train a Radial Basis Function (RBF) network. These reference faces were interleaved by distinctiveness (i.e., every other face; 1, 3, 5, ...11, 13, etc.), comprising 5 out of the 10 dis-

The largest view change condition was di erent, so we will not interpret it further. The apparent difference between this result and that reported for the large view change condition in Analysis 2 is likely to be due to the loss of the statistical power incurred in analyzing 10 faces, grouped by distinctiveness, as opposed to 80 faces. 5

5 0 0

1

1

0.5

0.5

0

0

−0.5 40

−0.5 40

10

20

10

20

5 0 0

face distinctiveness

Figure 6: Procrustes distance between original and view-mapped faces as a function of face distinctiveness version and view-map condition.

10

20

10

5 0 0

10

20

5 0 0

Figure 7: Face space projections for four levels of face distinctiveness, top left least distinctive, top right second most distinctive, etc. (the plot for the fth level of distinctiveness, omitted to save space, was similar to the fourth one).

4 Summary

We have presented a computational model of face recognition that is sensitive to the statistical characterization of faces on a number of levels, mirroring a similar sensitivity of human observers, and of numerous other models. In spite

of the importance of this issue and its potential for bringing together human and computational data on face processing, the e ects of individual face distinctiveness on the accuracy/nature of face processing has been little investigated in the context of computational models. Our preliminary investigation into this matter indicates that in recognizing faces over changes in viewpoint, the e ects of face distinctiveness seem to operate paradoxically. The normalization process we apply to standardize viewpoint, while useful for preserving the richness of the perceptual information in faces, operates most eciently for faces that are lacking in highly distinct perceptual information. The coding of faces in terms of their similarity structure with respect to a set of reference faces, while operating at a level of abstraction beyond the richness of the perceptual representation, retains information about the distinctiveness of faces. At this level, the confusability of a face with other faces is directly dependent on the statistical characteristics of the entire set of faces, and can be used to make psychophysical predictions about individual faces.

[7]

[8]

[9]

[10]

[11]

References

[1] J. J. Atick, P. A. Grin, and A. N. Redlich. The vocabulary of shape: principal shapes for probing perception and neural response. Network, 7:1{5, 1996. [2] M. Bichsel and A. Pentland. Human face recognition and the face image set's topology. Computer Vision, Graphics, and Image Processing: Image Understanding, 59:254{ 261, 1994. [3] S. Edelman. Representation of similarity in 3D object discrimination. Neural Computation, 7:407{422, 1995. [4] S. Edelman, F. Cutzu, and S. DuvdevaniBar. Similarity to reference shapes as a basis for shape representation. In G. Cottrell, editor, Proceedings of COGSCI'96, San Diego, CA, July 1996. to appear. [5] S. Edelman, D. Reisfeld, and Y. Yeshurun. Learning to recognize faces from examples. In G. Sandini, editor, Proc. 2nd European Conf. on Computer Vision, Lecture Notes in Computer Science, volume 588, pages 787{ 791. Springer Verlag, 1992. [6] D. J. Field. Relations between the statistics of natural images and the response proper-

[12]

[13] [14] [15] [16]

ties of cortical cells. Journal of the Optical Society of America, A 4:2379{2394, 1987. M. Lando and S. Edelman. Generalization from a single view in face recognition. CSTR 95-02, Weizmann Institute of Science, 1995. M. Lando and S. Edelman. Receptive eld spaces and class-based generalization from a single view in face recognition. Network, 6:551{576, 1995. L. L. Light, F. Kayra-Stuart, and S. Hollander. Recognition memory for typical and unusual faces. Journal of Experimental Psychology: Human Learning and Memory, 5:212{228, 1979. A. O'Toole, K. De enbacher, D. Valentin, and H. Abdi. Structural aspects of face recognition and the other-race e ect. Memory and Cognition, 22:208{224, 1994. A. O'Toole, T. Vetter, N. Troje, and H. Bultho . Sex classi cation is better with three-dimensional head structure than with image intensity information. Perception, accepted. A. J. O'Toole and S. Edelman. Modeling face recognition across viewpoint. MPIK TR 21, Max Planck Institut fur biologische Kybernetik, Tubingen, Germany, October 1995. D. L. Ruderman. The statistics of natural images. Network, 5:517{548, 1994. M. Turk and A. Pentland. Eigenfaces for recognition. J. of Cognitive Neuroscience, 3:71{86, 1991. T. Valentine and V. Bruce. The e ects of distinctiveness in recognising and classifying faces. Perception, 15:525{535, 1986. Y. Weiss and S. Edelman. Representation of similarity as a goal of early visual processing. Network, 6:19{41, 1995.

5 Acknowledgements

This work was supported by a grant from the National Institutes of Mental Health (1R29MH5176501A1) to A.O'T.