Group-specific Score Normalization for Biometric ... - Semantic Scholar

Report 2 Downloads 50 Views
Group-specific Score Normalization for Biometric Systems Norman Poh and Josef Kittler University of Surrey, Guildford, GU2 7XH, Surrey, UK. {n.poh,j.kittler}@surrey.ac.uk

Ajita Rattani, Massimo Tistarelli Universita Sassari, Dipto. Architecttura e Pianificazione, Piazza Duomo, 6, 07041 Alghero SS, Italy. {ajita02,mtista}@gmail.com

Abstract

database. As a result of this study, users were categorized as being sheep, goats, lambs and wolves. Sheep are characterized by high genuine (similarity) matching scores whereas goats are characterized by low genuine matching scores. Lambs are defined as a symmetry of goats, i.e., having high impostor matching scores. Finally, wolves are persons who can consistently give high impostor similarity scores when matched against all the references (i.e., enrolled templates/models in the gallery). While sheep dominate the population of client models, goats (resp. lambs) constitute only a small fraction of the population. However the latter category constitutes disproportionately large portion of false rejection (resp. acceptance) errors. Although the original Doddington et al.’s study was applied to speaker verification, the same phenomenon was independently observed in the context of the 2D face [22, 26, 27], 3D face [27], fingerprint [22, 12, 3, 27], speech [27], iris [27, 22] and keystrokes [27] biometric modalities. Using finger-vein and fingerprint as case studies, Une et al. [25] proposed a measure known as the wolf attack probability, which quantifies the maximum probability of success of impersonating a victim by feeding wolves in a biometric system. The above-mentioned studies provide a mounting evidence that the biometric menagerie is a general phenomenon inherent in all biometric experiments.

The problem of biometric menagerie, first pointed out by Doddington et al. (1998), is one that plagues all biometric systems. They observe that only a handful of clients (enrolled users in the gallery) actually contribute disproportionately to recognition errors. While prior literature attempting to reduce this effect focuses on either client-specific score normalization or client-specific decision strategies, in this study, we explore a novel category of approaches: group-specific score normalization. While client-specific score normalization can be negatively impacted by the paucity of genuine score samples, groupspecific score normalization is less affected since the matching score samples of different clients belonging to the same group are aggregated. Experimental evidence based on face, fingerprint and iris modalities show that our proposal generally outperforms client-specific score normalization as well as the baseline systems (without any normalization) across all possible operating points (so obtained by changing the decision threshold).

1. Introduction 1.1. Motivations An automatic biometric system works by first building a reference (model or template) specific to a user. During the operational phase, the system compares a scanned biometric sample with the registered model to decide whether an identity claim is genuine or not (from a different person). It has been noted [21] that the system may exhibit user-dependent behaviors in terms of output scores when being presented genuine and impostor biometric samples. The consequence is that some user models are better than others in representing the user’s identity. Doddington et al.’s initial study [7] attempted to characterize the user models by how easy or difficult it is for them to be recognized, and how easily query samples can impersonate others in the

1.2. Prior Works on Analysis of the Biometric Menagerie Research efforts that stem from Doddington et al.’s original study can be divided into two categories: techniques that contribute to the understanding of the biometric menagerie and techniques that reduce the effect of the menagerie. In this section, we shall review the first category of techniques, and then discuss the second one in the next section. Yager and Dunstone [27] devise methods aiming to distinguish the users by considering the characteristics of the genuine and impostor matching score distributions simulta1

Table 1. The biometric menagerie proposed by Yager and Dunstone

Score characteristics Genuine Impostor High High Low High Low Low High Low

Category Cameleon Worms Phantoms Doves

neously, for each claimed identity. The method is thus able to find four categories of users as shown in Table 1. This categorization of clients is more meaningful compared to Doddington et al.’s original categorization because both the genuine and impostor scores characteristics are necessary to describe the degree of recognizability of each client. The above argument is independently confirmed by Poh and Kittler [22] who investigate the feasibility of ranking the users using discriminative criteria such as F-ratio [18], d-prime [6] and Fisher-ratio for binary classification [2]. In [23], the same authors propose an index to quantify the extent of biometric menagerie.

1.3. Existing Methods to Minimize the impact of Biometric Menagerie The existence of different animal species imply that one should design a different strategies for each animal species. For instance, lowering the similarity decision threshold (than a globally pre-set value) for the goats is likely to compensate for their disproportionately high false rejection errors. Similarly, increasing the decision threshold for the lambs will also compensate for their disproportionately high false acceptance errors. This strategy is called client/user specific decision. Examples of such strategies abound: [9, 17, 4, 24, 13, 14, 11]. Rather than adjusting the thresholds, one can instead transform the matching score distribution. This alternative strategy is called client/user-specific score normalization. Examples are Z-norm [1], F-norm [18], EER-norm [8] and model-specific log-likelihood ratio (LLR)-based normalization [20]. The Z-norm is considered impostor-centric because the normalization parameters are derived solely from the impostor scores (that are specific to the reference associated with a client). The F-norm is considered clientimpostor centric because the normalization parameters are derived from both the client-specific genuine and impostor score distributions. In fact, the F-norm is designed to simultaneously align the genuine and impostor score distributions to common axes up to the first order moments. An alternative solution using a Bayesian formulation of this problem can be found in [20]. It is argued in [21] that both the client/user-specific de-

cision and score normalization are but the same methodology despite their implementation differences. A practical advantage of the score normalization strategy (over its decision-based counterpart) is that the normalized scores can be further combined in the context of multimodal biometrics. In comparison, the output of a client-specific decision is by definition binary, the result of which can only be combined using binary operators such as AND and OR. From the point of view of information fusion, this decision level process constitutes a potential weakness (compared to the score-level fusion, which is made possible via clientspecific score normalization) because the confidence of the matching (as reflected by the magnitude of a score) is effectively ignored.

1.4. Our Proposal, Contributions and Paper Organization An intrinsic weakness of client-specific score normalization strategy is that there are never enough samples, especially the client-specific genuine scores, to estimate the parameters required to carry out such normalization. The F-norm as well as the Bayesian-based score normalization [20] overcome the paucity of client-specific genuine scores by using maximum a posteriori (MAP) adaptation, wherein the prior mean score value is given by the system-wide genuine mean score (hence, independent of any client). As shown in [20], the MAP adaptation is not necessary for the client-specific impostor parameters because there are often abundant client-specific impostor matching scores in order to estimate the distribution parameters reliably. The aim of this study is to investigate an alternative strategy, i.e., by normalizing the scores at the group-level, i.e., using a score normalizing procedure for each semantic category shown in Table 1. Our approach directly extends the work of Yager and Dunstone [27] who devise a statistical existence test to identify a user category. However, the problem dealt with here, i.e., group-specific score normalization, is different from that in the context of [27]. This is because, every client (enrolled user) now has to be assigned to one of the four categories. In comparison, the original study of Yager and Dunstone may not assign a user to any group; only those users whose null hypothesis is rejected at a pre-specific confidence level (the “null hypothesis” here refers to the probability of a particular person being in one of the four category is 1/16). There are two challenges to designing an effective groupspecific score normalization. First, one needs to find a way to partition the clients into one of the four categories. Second, for each category, an appropriate normalization procedure has to be devised. As a preliminary investigation, for the first challenge, we devise a simple heuristic to partition the clients by using the joint genuine and impostor

mean scores. For the second challenge, we simply apply the client-specific score normalization developed in the existing literature to the grouped scores. In summary, the contributions of this paper are as follows:

• we provide experimental evidence illustrating the merit of our proposal using face, fingerprint and iris modalities. Finally, it is worth noting that the software developed in this paper is made publicly available1 . The readers are encouraged to download the software and submit their results via the developed web interface. This paper is organized as follows: Section 2 gives a detailed account of our proposal, Section 3 validates the merit of our proposal, and finally, Section 4 concludes the paper pointing some future research directions.

2. Methodology Our proposal to be described can be divided into two sub-methods: a method to categorize the clients; and another to normalize the aggregated scores of clients belonging to the same group. These two methods are discussed in the following two sub-sections, respectively.

2.1. Strategy to Categorize the Clients Inspired by the classification of Yager and Dunstone, as shown in Table 1, a simple way to categorize a client is by using its joint score characteristics (µIj , µGj ) where µIj is the client-specific mean impostor score and µGj is its genuine counter part. The parameter µkj , for k ∈ {G, I} and a client identity j ∈ {1, . . . , J}, are defined as: µkj = Ey∼p(y|j,k) [y],

(1)

where Ez∼p(z) [z] is an expectation operator over the range of variable z subject to the distribution specified by p(z); this operator is defined as: Z Ez∼p(z) [z] = z p(z) dz, z

Note that the expectation operator in (1) is by the density of matching score y ∈ R conditioned on client j and comparison k (being genuine or impostor matching). The global class-conditional mean score is defined as: µk = Ey∼p(y|k) [y], 1

REMOVED

(2)

70

60

Client mean

• we propose a novel method for group-specific score normalization, and

ft2 80

50

40

30

Q1: lamb + high gen Q2: goat + lamb Q3: goat + low imp Q4: sheep mean

20

10 17

18

19

20

21

22 23 Impostor mean

24

25

26

27

Figure 1. An example of data samples in (µIj , µGj ) space. The system used here is the right index fingerprint of the Biosecure DS2 data set (ft2).

which can be shown to be related to the client-specific mean parameter by: µk = Ej∼p(j|k) [µkj ], (3) Based on the joint observation (µI , µG ) as well as for a particular client j, one way to categorize each user is by the following deterministic function:  Q1 if µIj ≥ µI ∧ µGj ≥ µI    Q2 if µIj ≥ µI ∧ µGj < µI categorize(µIj , µGj ) = Q3 if µIj < µI ∧ µGj < µI    Q4 if µIj < µI ∧ µGj ≥ µI (4) where ∧ denotes the AND binary operator. Figure 1 illustrates the resulting decision of the population of clients. (µIj , µGj )

2.2. Group-specific Score Normalization In the client-specific score normalization literature, as briefly reviewed in Section 1.3, a number of score normalization procedures have been suggested. Instead of normalizing the scores of each client, we propose to normalize the scores at the group level. Therefore, although the form of normalization is not new, their application at the group-level is novel. Among the available normalization methods, we shall select one that is client-impostor centric, i.e., one that is capable of taking into account both the (client-specific) genuine and impostor score distribution. This choice is motivated by earlier findings that this strategy is potentially more powerful than the impostor-centric alone [21]. However, for the client-impostor centric score normalization, there is a potential estimation problem associated with µGj , because there are often very few samples to estimate this

µkg = Ey∼p(y|Qg ,k) [y];

(5)

Note that the corresponding global mean, µk , is defined by (2). The group-specific F-norm estimates the group-specific mean by maximum a posteriori adaptation as follows: µ ˜Gg (γ) = γµGg + (1 − γ)µG ,

(6)

where γ ∈ [0, 1] compensates for the confidence of estimate between µGg and µG . The (group-specific) F-norm is then defined as: ygF =

y − µIg µ ˜Gg (γ) − µIg

(7)

From the MAP literature [10], the theoretical optimal value for γ is NGg γ= G (8) Ng + r where r is known as relevance factor and it takes only a positive value. Consistent with the literature on MAP adaptation, when the number of samples NGg is large, i.e., NGg ≫ r, the MAP estimate converges to the maximum likelihood estimate since γ → 1 (the consequence is that that the a priori parameter, µG , plays no role in the estimate). In our context, since NGg is small (depending on the number of available clients), it is necessary to tune the γ ∈ [0, 1] parameter by experimentation. The above discussion directly extends to client-specific score normalization by replacing the subscript g above with the subscript j everywhere. If the group-specific statistics were known exactly, the class conditional mean values of F-normalized scores would be as follows: EygF ∼p(ygF |G) [ygF ] = 1, for all g

(9)

5

0.35 0.3

4 likelihood

0.25 likelihood

quantity. One such method that is robust to this estimation error is the F-norm. For instance, it was reported in [21, 19] that the parameter µGj was estimated using only one genuine score sample. Guided by the above reasoning, we shall use the F-norm as a choice of client-impostor centric score normalization method. We shall apply the F-norm in two different contexts: as a group-specific score normalization procedure as well as a client-specific score normalization procedure. The groupspecific procedure first aggregates the matching scores of all claimed identities belonging to the same group. This is an advantage (over the its client-specific counterpart because the resultant procedure no longer suffers from the paucity of genuine samples. Consistent with the previous section, let Qg denote a particular user group g ∈ {1, 2, 3, 4}. Let the group-specific mean parameter, µkg , be:

0.2 0.15

3

2

0.1

1 0.05 0

−10

0

10 scores

20

30

0

0

(a) Original

0.5 scores

1

(b) F-norm

Figure 2. Effect of client-specific F-norm: (a) before and (b) after transformation. γ = 0.5 was used here.

EygF ∼p(ygF |I) [ygF ] = 0, for all g

(10)

Both (9) and (10) show that the F-norm has the effect of projecting both the genuine and impostor F-normalized score distributions simultaneously to the expected values of one and zero, respectively. This effect is illustrated in Figure 2, in the context of client-specific F-norm. The same behavior is also true for the group-specific F-norm. In summary, the overall procedure for group-specific score normalization is as follow: First, determine to which group a client belongs on the basis of the enrollment data. Then, during query, one employs the appropriate groupspecific score normalization (designed for the group to which the client belongs).

2.3. Implementation This final sub-section is devoted to estimating the parameters as well as the single free parameter (γ) in the F-norm. In the following, we shall describe the estimators for the client-specific F-norm. The group-specific F-norm is realized in exactly the same way by replacing the subscript j with g. The client-specific F-norm is explained here because it will be used to establish a baseline comparison. k Let yj,i denote the i-th score of client j given the hypothesis k, which can be either genuine or impostor, k ∈ {G, I}, and i ∈ {1, . . . , Nkj }. The client-specific and global mean parameters are estimated by: Nk

µ ˆkj

j 1 X k y = k Nj i=1 j,i

and µ ˆk =

J 1X k µ ˆ , J j=1 j

respectively. As can be observed, the fraction J1 is the consequence of the assumption that P (j|k) is a uniform distribution (in (2)). In our database, there is only a single client-specific genuine score sample. This sample was obtained by comparing

the reference (a template) with another sample also acquired during enrollment. For this reason, the client-specific genuine mean value simply takes the value of the single observed sample: G µ ˆGj = yj,1 Naturally, this implies that the estimated client-specific genuine mean parameter is unreliable. Recall that this unreliable estimate is compensated for by using MAP adaptation, via (6). Consistent with the reported experiments in [21, 19] (wherein a single genuine score was available for each client), γ = 0.5 is used throughout this study for the clientspecific F-norm. However, for the group-specific F-norm, one still has to tune the γ parameter. This is because µGg , for any group g, can be estimated more reliably due to score aggregation. The effect of tuning γ will be examined in the experimentation section.

Table 3. A list of channels of data for each biometric modality captured using a given device.

Data sets S1 S2

Genuine Impostor Genuine Impostor

No. of matching scores dev (51 persons) eva (156 persons) 1 × 51 1 × 156 103 × 4 × 51 51 × 4 × 156 2 × 51 2 × 156 103 × 4 × 51 126 × 4 × 156

Acronyms: S1 = session one; S2 = session two; dev = development (training) set; eva = evaluation (test) set Example: The entry “103 × 4 × 51” in column dev and row S2:Impostor indicates the number of scores due to comparing 51 client references against the queries of 103 impostors, each having 4 attempts. The entry “2 × 156” in column eva and row S2:Genuine indicates the number of genuine matching scores due to comparing 156 client references each having two genuine samples.

3. Experiments 3.1. Database, Reference Systems and Experimental Protocols The data used in our evaluation scheme is taken from the Biosecure database. Biosecure2 was a European project whose aim was to integrate multi-disciplinary research efforts in biometric-based identity authentication. Application examples are a building access system using a desktopbased or a mobile-based platform, as well as applications over the Internet such as tele-working and web or remotebanking services. As far as the data collection is concerned, three scenarios were identified, each simulating the use of biometrics in remote-access authentication via the Internet (termed the “Internet” scenario), physical access control (the “desktop” scenario), and authentication via mobile devices (the “mobile” scenario) [16]. For the purpose of our experiments, we used the desktop scenario subset3 which further contains still face, 6 fingers and iris modalities, denoted by fa1, ft1–6 and ir1, respectively. This is the so-called “cost-sensitive” configuration setting, as reported in [19]. These 8 channels of data, as well as the reference systems are summarized in Table 2 whereas the experimental protocols (defining how the database is partitioned for training and testing) are shown in Table 3. Note that for the purpose of performance assessment, the data set and experimental protocols are not of primary concern; any database could have been used. The only requirement is that a wide variety of biometric modalities is used in order to illustrate the generality of our approach. It is important to note that there are two score data sets: the development and the evaluation sets (see Table 3). In 2 http://www.biosecure.info 3 The matching scores used in the experiments are downloaded from http://face.ee.surrey.ac.uk/qfusion.

this table, S1 means the session 1 data (which is the enrollment session) whereas S2 means the session 2 data (which corresponds to the query session). For each client, the data in S1 consists of two samples collected within the same session. They are collected to facilitate the development of a baseline system (i.e., for enrollment). The first sample serves as a reference/template. The second sample is used to obtain an estimate of client-specific genuine mean parameter (as discussed in Section 2.3). It is known that intra-session performance is biased [15]. To illustrate this systematic bias, we compare the performance of the same session (S1) versus that of differentsession (i.e., S2), for each of the 8 channels of data, in terms of Equal Error Rate (EER), in Figure 3. As can be observed, the same-session performance is systematically better than the different-session performance. This directly implies that the estimated client-specific genuine mean sample, µ ˆGj , is biased. The data set used here has been designed to benchmark client-specific score normalization procedures [19]. In our experiments, the development set is used to partition the clients into four groups, as well as tuning any free parameters (e.g., γ in the F-norm). The evaluation set is used uniquely for performance reporting. The performance of the iris baseline system used here is far from that claimed for Daugman’s implementation [5]. We established that this was due to bad iris segmentation and a suboptimal threshold for distinguishing eyelashes from iris (being baselines, no effort was made to optimize performance; the only requirement is that all systems output match scores.

Table 2. A list of channels of data for each biometric modality captured using a given device.

Label fa

no. 1

Modality Still Face

Sensor web cam

ft

1–6

Fingerprint

Thermal

ir

1

Iris image

LG

Reference systems Omniperception’s Affinity SDK face detector; LDA-based face verifier NIST Fingerprint system

A variant of Libor Masek’s iris system

fa1 eva

60 40 FNMR [%]

FNMR [%]

20 10 5

ir1 eva gF−norm (0) gF−norm (.5) gF−norm (1) F−norm (0.5) orig

20 10 5

40 20 10 5

2 1 0.5

2 1 0.5

2 1 0.5

0.2 0.1

0.2 0.1

0.2 0.1

0.10.2 0.5 1 2

5 10 20 FMR [%]

40

60

0.10.2 0.5 1 2

(a) face

5 10 20 FMR [%]

40

60

gF−norm (0) gF−norm (.5) gF−norm (1) F−norm (0.5) orig

60

FNMR [%]

40

1/4 is right/left thumb; 2/5 is right/left index; 3/6 is right/left middle finger 1 denotes the left iris

ft5 eva gF−norm (0) gF−norm (.5) gF−norm (1) F−norm (0.5) orig

60

Remarks Frontal face images (low resolution)

0.10.2 0.5 1 2

(b) left index

5 10 20 FMR [%]

40

60

(c) iris

Figure 4. The error of the development set (blue, left) versus that of evaluation set (red, right) of the 8 systems used in the cost-sensitive evaluation of the original Biosecure data set.

20

3.2. Results and Discussions

18

We compared the performance of the baseline system (without any normalization), the client-specific F-norm and the proposed group-specific F-norm using the 8 channels of Biosecure DS2 data. The experimental results in terms of the DET curve for three of the eight channels of data are shown in Figure 4. The DET curves are summarized by evaluating the False Non-Match Rate (FNMR) at a given False Match Rate (FMR). FNMR is an empirical estimate of the probability of falsely accepting an impostor whereas FMR is the counterpart estimate for the probability of falsely rejecting a genuine user. We only assess the performance of FNMR at three FMR values, i.e., {0.1%, 1%, 10%}. This effectively summarizes the entire DET curves to three operating points of interest. These FNMR values are shown in Table 4. Last but not least, we also compared all the systems in terms of Equal Error Rate (EER) in Table 5. EER is an operating point where the probability of false acceptance is the same as the probability of false rejection. The comparison between client-specific and groupspecific normalization reveals the following:

16

EER (%)

14 12 10 8 6 4 2 0

fa1

ft1

ft2

ft3

ft4

ft5

ft6

ir1

Figure 3. The error of the development set (blue) versus that of evaluation set (red) of the 8 systems used in the cost-sensitive evaluation of the original Biosecure data set.

Table 4. Summary of results in terms of FNMR

Trait fa1 ft1 ft2 ft3 ft4 ft5 ft6 ir1

(a) FNMR at FMR=0.001 F (.5) gF (0) gF (.5) gF (1) 0.50 0.35 0.34 0.39 0.97 0.56 0.65 0.88 0.49 0.38 0.37 0.51 0.58 0.47 0.48 0.66 0.84 0.57 0.63 0.80 0.67 0.39 0.39 0.57 0.52 0.41 0.40 0.54 1.00 0.69 0.79 0.90

orig 0.35 0.57 0.41 0.47 0.57 0.39 0.42 0.68

Trait fa1 ft1 ft2 ft3 ft4 ft5 ft6 ir1

(b) FNMR at FMR=0.01 F (.5) gF (0) gF (.5) gF (1) 0.19 0.22 0.20 0.19 0.47 0.42 0.41 0.50 0.20 0.24 0.18 0.18 0.31 0.31 0.28 0.33 0.47 0.39 0.40 0.43 0.23 0.26 0.23 0.21 0.28 0.28 0.25 0.25 0.46 0.42 0.42 0.45

orig 0.24 0.42 0.23 0.33 0.40 0.27 0.29 0.42

Trait fa1 ft1 ft2 ft3 ft4 ft5 ft6 ir1

(c) FNMR at FMR=0.1 F (.5) gF (0) gF (.5) gF (1) 0.10 0.12 0.11 0.11 0.18 0.21 0.17 0.15 0.05 0.08 0.06 0.06 0.13 0.16 0.14 0.12 0.19 0.23 0.19 0.18 0.09 0.10 0.09 0.08 0.15 0.19 0.16 0.14 0.19 0.19 0.18 0.17

orig 0.13 0.21 0.09 0.17 0.23 0.12 0.21 0.19

Note: The minimum value in each row is printed in bold. “F (.5)” denotes F-norm with γ = 0.5, “gF (.5)” denotes group-specific F-norm with γ = 0.5 and “orig” is the baseline system without normalization. Table 5. Summary of results in terms of EER

Trait fa1 ft1 ft2 ft3 ft4 ft5 ft6 ir1

F (.5) 0.08 0.14 0.06 0.11 0.14 0.09 0.11 0.14

gF (0) 0.09 0.15 0.08 0.12 0.15 0.10 0.13 0.14

gF (.5) 0.09 0.13 0.07 0.11 0.14 0.09 0.12 0.14

gF (1) 0.09 0.12 0.06 0.10 0.13 0.08 0.11 0.13

orig 0.10 0.16 0.09 0.13 0.16 0.11 0.14 0.14

• For application requiring very low FMR (0.001), the group-specific F-norm again outperforms the clientspecific F-norm. • For the group-specific F-norm, setting γ = 0 appears to be adequate for high-security applications (requiring very low FMR), but where user convenience is more important, setting γ = 1 is more adequate. Although the relative improvement gained by the groupspecific score normalization is not by much, we stress that the improvements are systematic and consistent across the three biometric modalities under investigation.

4. Conclusions and Future Research Directions In this paper, we investigated a novel category of methods capable of reducing the impact of the biometric menagerie, called group-specific score normalization. The core idea of our proposal lies in partitioning the clients into four categories according to their score characteristics after the enrollment process. Assuming that the group membership remains the same, when the system is operational, it normalizes the matching score of a given client (the claimed identity) using the parameters of the group to which the client belongs. Experimental results based on three biometric modalities show that group-specific score normalization compares favorably with the baseline system as well as client-specific score normalization. In particular, group-specific score normalization systematically outperforms client-specific score normalization at low FMR. Group-specific score normalization also systematically outperforms the baseline systems (without any normalization) at EER. Some future research directions are outlined below: • Extension to fusion experiments: Our next immediate investigation is to explore the impact of groupspecific score normalization on multimodal fusion. Since some improvement is already observed for each system, we conjecture that the aggregate improvement at the multimodal level can be significant. • Soft partitioning of the clients: The current technique to partition the clients is deterministic. However, since the membership of the biometric menagerie is not deterministic, a probabilistic way of partitioning the clients is desirable. This constitutes another possible future research direction.

Acknowledgments • The group-specific F-norm with γ = 1 outperforms the client-specific F-norm at FMR=0.1 as well as EER for the majority of the experiments.

This work was partially supported by COST Action 2101 (cost2101.org) and by the EU-funded Mobio project (www.mobioproject.org) grant IST-214324. The author

N.P. is also supported by the advanced researcher fellowship grant PA0022 121477 of the Swiss National Science Foundation.

References [1] R. Auckenthaler, M. Carey, and H. Lloyd-Thomas. Score Normalization for Text-Independant Speaker Verification Systems. Digital Signal Processing (DSP) Journal, 10:42– 54, 2000. 2 [2] C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1999. 2 [3] J. Breebaart, T. Akkermans, and E. Kelkboom. Intersubject differences in false nonmatch rates for a fingerprint-based authentication system. EURASIP Journal on Advances in Signal Processing, 2009, 2003. Article ID 896383. 1 [4] K. Chen. Towards Better Making a Decision in Speaker Verification. Pattern Recognition, 36(2):329–346, 2003. 2 [5] J. Daugman. How Iris Recognition Works, chapter 6. Kluwer Publishers, 1999. 5 [6] J. Daugman. Biometric decision landscapes. Technical Report TR482, University of Cambridge Computer Laboratory, 2000. 2 [7] G. Doddington, W. Liggett, A. Martin, M. Przybocki, and D. Reynolds. Sheep, Goats, Lambs and Wolves: A Statistical Analysis of Speaker Performance in the NIST 1998 Speaker Recognition Evaluation. In Int’l Conf. Spoken Language Processing (ICSLP), Sydney, 1998. 1 [8] J. Fierrez-Aguilar, J. Ortega-Garcia, and J. GonzalezRodriguez. Target Dependent Score Normalisation Techniques and Their Application to Signature Verification. In LNCS 3072, Int’l Conf. on Biometric Authentication (ICBA), pages 498–504, Hong Kong, 2004. 2 [9] S. Furui. Cepstral Analysis for Automatic Speaker Verification. IEEE Trans. Acoustic, Speech and Audio Processing / IEEE Trans. on Signal Processing, 29(2):254–272, 1981. 2 [10] J. Gauvain and C.-H. Lee. Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Obervation of Markov Chains. IEEE Tran. Speech Audio Processing, 2:290–298, 1994. 4 [11] D. Genoud. Reconnaissance et Transformation de Locuteur. PhD thesis, Ecole Polythechnique F´ed´erale de Lausanne (EPFL), Switzerland, 1998. 2 [12] A. Hicklin and B. Ulery. The myth of goats: How many people have fingerprints that are hard to match? Technical Report NISTIR 7271, National Institute of Standards and Technology, 2005. 1 [13] K. Jonsson, J. Kittler, Y. P. Li, and J. Matas. Support vector machines for face authentication. Image and Vision Computing, 20:269–275, 2002. 2 [14] J. Lindberg, J. Koolwaaij, H.-P. Hutter, D. Genoud, M. Blomberg, J.-B. Pierrot, and F. Bimbot. Techniques for a priori Decision Threshold Estimation in Speaker Ver¨ ification. In Proc. of the Workshop Reconnaissance du Locuteur et ses Applications Commerciales et Criminalis¨ tiques(RLA2C), pages 89–92, Avignon, 1998. 2

[15] A. Martin, M. Przybocki, and J. P. Campbell. The NIST Speaker Recognition Evaluation Program, chapter 8. Springer, 2005. 5 [16] J. Ortega-Garcia, J. Fierrez, F. Alonso-Fernandez, J. Galbally, M. R. Freire, J. Gonzalez-Rodriguez, C. GarciaMateo, J.-L. Alba-Castro, E. Gonzalez-Agulla, E. OteroMuras, S. Garcia-Salicetti, L. Allano, B. Ly-Van, B. Dorizzi, J. Kittler, T. Bourlai, N. Poh, F. Deravi, M. W. Ng, M. Fairhurst, J. Hennebert, A. Humm, M. Tistarelli, L. Brodo, J. Richiardi, A. Drygajlo, H. Ganster, F. M. Sukno, S.-K. Pavani, A. Frangi, L. Akarun, and A. Savran. The multiscenario multienvironment biosecure multimodal database (bmdb). IEEE Transactions on Pattern Analysis and Machine Intelligence, 32:1097–1111, 2010. 5 [17] J.-B. Pierrot. Elaboration et Validation d’Approaches en V´erification du Locuteur. PhD thesis, ENST, Paris, September 1998. 2 [18] N. Poh and S. Bengio. F-ratio Client-Dependent Normalisation on Biometric Authentication Tasks. In IEEE Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP), pages 721–724, Philadelphia, 2005. 2 [19] N. Poh, T. Bourlai, and J. Kittler. A multimodal biometric test bed for quality-dependent, cost-sensitive and clientspecific score-level fusion algorithms. Pattern Recogn., 43(3):1094–1105, 2010. 4, 5 [20] N. Poh and J. Kittler. On the Use of Log-likelihood Ratio Based Model-specific Score Normalisation in Biometric Authentication. In LNCS 4542, IEEE/IAPR Proc. Int’l Conf. Biometrics (ICB’07), pages 614–624, Seoul, 2007. 2 [21] N. Poh and J. Kittler. Incorporating Variation of Modelspecific Score Distribution in Speaker Verification Systems. IEEE Transactions on Audio, Speech and Language Processing, 16(3):594–606, 2008. 1, 2, 3, 4, 5 [22] N. Poh and J. Kittler. A Methodology for Separating Sheep from Goats for Controlled Enrollment and Multimodal Fusion. In Proc. of the 6th Biometrics Symposium, pages 17– 22, Tampa, 2008. 1, 2 [23] N. Poh and J. Kittler. A biometric menagerie index for characterising template/model-specific variation. In Proc. of the 3rd Int’l Conf. on Biometrics, pages 816–827, Sardinia, 2009. 2 [24] J. Saeta and J. Hernando. On the Use of Score Pruning in Speaker Verification for Speaker Dependent Threshold Estimation. In The Speaker and Language Recognition Workshop (Odyssey), pages 215–218, Toledo, 2004. 2 [25] M. Une, A. Otsuka, and H. Imai. Wolf attack probability: A theoretical security measure in biometric authentication systems. IEICE-Transactions on Info and Systems, E91D(5):1380–1389, 2008. 1 [26] M. Wittman, P. Davis, and P. Flynn. Empirical studies of the existence of the biometric menagerie in the frgc 2.0 color image corpus. Conf. on Computer Vision and Pattern Recognition Workshop, pages 33–33, June 2006. 1 [27] N. Yager and T. Dunstone. Worms, chameleons, phantoms and doves: New additions to the biometric menagerie. Automatic Identification Advanced Technologies, 2007 IEEE Workshop on, pages 1–6, June 2007. 1, 2