Identification of Deletion Polymorphisms from ... - Semantic Scholar

Report 3 Downloads 84 Views
Identification of Deletion Polymorphisms from Haplotypes Erik Corona* , Benjamin Raphael** , and Eleazar Eskin* *

**

Department of Computer Science & Engineering, University of California, San Diego Center for Computational Molecular Biology & Department of Computer Science, Brown University September 30, 2006 Abstract Numerous efforts are underway to catalog genetic variation in human populations. While the majority of studies of genetic variation have focused on single base pair differences between individuals, i.e. single nucleotide polymorphisms (SNPs), several recent studies have demonstrated that larger scale structural variation including copy number polymorphisms and inversion polymorphisms are also common. However, direct techniques for detection and validation of structural variants are generally much more expensive than detection and validation of SNPs. For some types of structural variation, in particular deletions, the polymorphism produces a distinct signature in the SNP data. In this paper, we describe a new probabilistic method for detecting deletion polymorphisms from SNP data. The key idea in our method is that we estimate the frequency of the haplotypes in a region of the genome both with and without the possibility of a deletion in the region and apply a generalized likelihood ratio test to assess the significance of a deletion. Application of our method to the HapMap Phase I data revealed 319 candidate deletions, 142 of these overlap with variants identified in earlier studies, while 177 are novel. In addition, our method is applicable to Phase II HapMap data where we predict 1933 deletions.

1

Introduction

Since the completion of the reference human genome sequence, efforts have been undertaken to identify genetic variants in the human population. The most comprehensive of these efforts is the International HapMap Project which aims to produce a genome-wide catalog of variation in a population of 270 individuals [1]. Both HapMap and the majority of studies of genetic variation have focused on single base pair differences between individuals, or single nucleotide polymorphisms (SNPs), due to recent technological advances which have significantly reduced the cost of measuring SNPs. However, a number of recent studies have demonstrated that variation at the single nucleotide level is only part of the picture, and that other larger-scale variants are also common including copy number polymorphisms, i.e. gains and losses of specific chromosomal regions as well as inversion polymorphisms [10]. The relative contribution of such structural polymorphisms to human genetic variation is still unclear. Direct techniques for detection and validation of structural variants are generally much more expensive than detection and validation of SNPs. However, for some types of structural variation such as deletion polymorphisms, the structural variation leaves a distinct signature in the SNP data. Specifically, a genotyping assay will incorrectly identify a heterozygous deletion as a homozygous genotype of the undeleted allele, while a homozygous deletion with result in a missing SNP. If we observe the genotypes of both parents and a child, then deletions will typically be observed as an inconsistencies in the rules of Mendelian inheritance (Figure 1). This inconsistency occurs when one parent transmits a deletion to the child and the other parent 1

mother

father

mother

father

A A

G G

A A

G -

child

child

A G

A -

A

B

Figure 1: A. A parent child trio where both parents are homozygous and the child is heterozygous at a particular SNP. In this case, the genotypes are consistent with Mendelian inheritance. B. A parent child trio where the father has a hemizygous deletion and will be incorrect genotyped as homozygous. In this case, the deletion is transmitted and the child is also incorrectly genotyped as homozygous resulting in a Mendelian error. does not. The child’s measured genotype will then be homozygous in the undeleted allele. If the parent transmitting this deletion does not have this allele, then a Mendelian error results. Two recent papers [2], [8] cleverly exploited this idea and introduced techniques for detecting deletion polymorphisms from the HapMap data by looking for Mendelian errors in trios. Conrad et al. [2] search for groups of Mendelian errors that are physically close on the genome. Candidate deletions are filtered using a series of heuristics. McCarroll et al. [8] create a binary vector for each SNP that indicates the presence or absence of a Mendelian error for each parent-child pair in the population. They then cluster these vectors across different SNPs and look for clusters that contain SNPs physically close on the genome. These methods report 375 and 315 deletion polymorphisms respectively, but only 225 of these are common to both studies suggesting that the sensitivity and/or specificity of the methods are relatively poor [10]. The difficulty with detecting deletions from SNP data, is that there is no one-to-one relationship between Mendelian errors and deletion polymorphisms. First, most observed Mendelian errors are due to experimental errors in genotyping assays rather than the presence of deletions. Indeed, these errors are routinely filtered out of the official HapMap releases. Second, not every deletion produces Mendelian errors at the deleted SNPs: whether or not there is an error depends on the undeleted alleles. For these reasons, using Mendelian errors as a signal for deletions requires sensitive methods to distinguish deletions from experimental errors. In this paper, we describe a new probabilistic method for detecting deletion polymorphisms from SNP data. The key idea in our method is that we estimate the frequency of the haplotypes in a region of the genome both with and without the possibility of a deletion in the region and apply a generalized likelihood ratio test to assess the significance of a deletion. We argue that the use of haplotypes is a more precise way to examine patterns of Mendelian errors than ad hoc clustering used in earlier studies. Our method has several additional advantages over the earlier methods, neither of which take into account the haplotype structure in the region. First, our method provides a p-value confidence level for each prediction and estimates the 2

frequency of the deletion in a region. Second, the method explicitly combines information from multiple trios, achieving higher specificity than [2] who predict deletions using information only from individual trios. Finally, our method incorporates a flexible model for genotyping errors, which can be adapted to the case when neighboring errors are correlated. This is an important advantage over earlier methods that assume errors are independent because we can directly apply our method to the Phase II HapMap data. This data includes approximately 4 million SNPs, but due to the long range PCR technology[5] used to collect the data by Perlegen Sciences, errors in neighboring SNPs are inherently correlated. Application of our method to the HapMap Phase I data with 1,109,711 SNPs using a very conservative threshold revealed 319 candidate deletions, 142 of these overlap with variants identified in earlier studies, while 177 are novel. The advantage of our method is by using less conservative thresholds, we can potentially discover many more deletions. We applied our approach to the portion of the HapMap Phase II data with 3.8 million SNPs, we discover 317 candidate deletions. Based on this result, we expect that we will discover approximately 1933 candidate deletions in the complete Phase II data.

2 2.1

Methods Probabilistic Model for Deletion Polymorphisms

Our method for detecting deletion polymorphisms relies on a generative probabilistic model for trios of mother-father-child genotypes. For a window ofPL SNPs, we are given a set of haplotypes hD , h1 , . . . , hN with frequencies pD , p1 , . . . , pN , where pD + N j=1 pj = 1. Here, hD represents the deletion haplotype with corresponding frequency pD . Each haplotype, hj ∈ {0, 1}L is a string of length L consisting of 0’s and 1’s, which represent the two possible alleles of a SNP. Let hij denotes the ith SNP of haplotype hj . We denote the set of haplotype frequencies with the vector P = (p1 , p2 , . . . , pN , pD )x. Each mother-father-child trio is characterized by four-tuple(FT , FU , MT , MU ) of haplotypes, where FT and FU are the haplotypes of the father, and MT and MU are the haplotypes of the mother. Here the subscript T denotes the haplotype that is transmitted from parent to child, and U denotes the untransmitted haplotype, so that the haplotypes of the child are FT and MT . Given a pair of haplotypes hj and hk , with hj 6= hD and hk 6= hD , we define a genotype G(hj , hk ) ∈ {0, 1, 2}L as the string of length L where ( hij , hij = hik , i (1) G(hj , hk ) = 2, hij 6= hik , for i = 1, . . . , L. Thus, 2 represents a heterozygous genotype. We also define G(hj , hD ) = G(hD , hj ) = hj . We use the character 3 to indicate a missing SNP with no measured genotype. Consequently, G(hD , hD ) is defined to be the string of L 3’s. We define the genotypes F = G(FT , FU ), M = G(MT , MU ) and C = G(FT , MT ) for the father, mother, and child, respectively. The experimental procedure used to measure genotypes is subject to errors that either mutate an individual SNPs or fail to assign a value to a SNP. We assume that errors are independent and identically distributed for each SNP, which is a reasonable assumption for most genotyping platforms. (See below for exceptions.) Thus, the error process is given by a set of probabilities {ei→j }3i,j=0 where ei→j indicates the probability ˆ when that a SNP with the value i is measured as the Q value j. Then the likelihood of observing genotype G L ˆ the true genotype is G is given by L(G, G) = i=1 eGi →Gˆ i . Given the observed genotypes in a father-mother-child trio, we are uncertain of the underlying haplotypes due to the possibility of genotype errors as well as the inherent ambiguity of genotypes. However, if we know the underlying haplotype frequencies P , then we can explicitly model this uncertainty by summing

3

over all possible choices of haplotypes to compute the likelihood of the observed trio of father-mother-child ˆ , C), ˆ giving the formula: genotypes (Fˆ , M XXXX ˆ , M )L(C, ˆ C) ˆ , C|P ˆ )= L(Fˆ , M pFT pFU pMT pMU L(Fˆ , F )L(M (2) FT FU MT MU

ˆ 1 , Cˆ1 ) . . . (FˆK , M ˆ K , CˆK ) For a window W of L SNPs, the likelihood of a data set consisting of K trios (Fˆ1 , M is then the product of the likelihoods of each trio: L(W |P ) =

K Y

ˆ k , Cˆk ). L(Fˆk , M

(3)

k=1

In our model, the possibility of a deletion within a window is determined by the value of pD : if pD > 0 then a deletion is present in the window, while if pD = 0, then no deletion is present in the window. In order to determine whether a deletion occurs in a region, we use a generalized likelihood ratio test to distinguish the null hypothesis pD = 0 versus the alternative pD > 0. For the window W we define the score S(W ) by the log likelihood ratio maxP L(W |P ) (4) S(W ) = 2 log maxP,pD =0 L(W |P ) which intuitively measures the increase in likelihood of the data when deletions can be used to explain the observed genotypes. S(W ) is asymptotically equal to the chi-square distribution with one degree of freedom since equation (4) is a generalized likelihood ratio test []. Thus, we obtain a p-value ρ(W ) indicating the statistical significance of a deletion in W from the chi-square distribution.

2.2

Efficient scoring of candidate deletions

For each window, equation (4) can be computed by maximizing likelihood over the haplotype frequencies using the Expectation Maximization (EM) algorithm [3]. Unfortunately, direct computation of equation (4) is impractical because the number of terms in the sum in equation (2) is (N + 1)4 for a single trio, and we must perform this computation during each iteration of the EM algorithm. Since the HapMap contains genotypes for close to 4 million SNPs, this computation becomes intractable for a genome-wide analysis. In order to make the computation feasible for the HapMap data, we take advantage of recent progress in haplotype phasing algorithms. A recent benchmarking study[7] has shown that several of the stateof-the-art phasing algorithms perform extremely well on trio data. Thus, we make the assumption that these algorithms will accurately predict the haplotype frequencies p∗1 , . . . , p∗N assuming there is no deletion (pD = 0) and use one of these algorithms, HAP[4], to predict the haplotype frequencies. Since deletions are typically rare in the population, we also assume that in the presence of a deletion, pD > 0, the remaining haplotypes are correspondingly scaled as pi = (1 − pD )p∗i for i = 1, . . . , N . We use the notation pD ◦ P = ((1 − pD )p∗1 , . . . , (1 − pD )p∗N , pD ) to denote the vector of scaled probabilities. Under these assumptions, we can efficiently compute the likelihood of the observed data for any value of pD using the likelihood in the case that pD = 0. We denote the likelihood in the special case where pD = 0 ˆ , C|P, ˆ pD = 0). To compute the likelihood for any other value of pD , we decompose the as L0 = L(Fˆ , M sum in the likelihood (equation (2)) into two parts: one contains all of the terms without any deletions, and the other part contains terms with at least one deletion. Using the notation |D| to denote the number of

4

deletions in the set of haplotypes FT , FU , MT , MU , the likelihood is then X ˆ , M )L(C, ˆ C) ˆ , C|P ˆ )= pFT pFU pMT pMU L(Fˆ , F )L(M L(Fˆ , M

(5)

FT ,FU ,MT ,MU ,|D|=0

X

+

ˆ , M )L(C, ˆ C) pFT pFU pMT pMU L(Fˆ , F )L(M

(6)

FT ,FU ,MT ,MU ,|D|>0

= (1 − pD )4 L0 X +

(7) ˆ , M )L(C, ˆ C) . pFT pFU pMT pMU L(Fˆ , F )L(M

(8)

FT ,FU ,MT ,MU ,|D|>0

Using this equation to compute the likelihood significantly improves the running time since we only need to compute L0 once and then for any value of pD we can compute the likelihood by only considering the 4 ∗ (N + 1)3 terms in the sum (8). We then use EM [3] to compute the maximum likelihood estimate pˆD . The likelihood function (3) can be viewed as a weighted sum of multinomials which motivates the following update. Given an estimate β of ˆ k , Cˆk ) we compute pˆD , we iteratively compute a new estimate β 0 as follows. For each observed trio (Fˆk , M ˆ

P βk0 =

ˆ

ˆ

FT ,FU ,MT ,MU ,|D|>0 |D|pFT pFU pMT pMU L(Fk , Fk )L(Mk , Mk )L(Ck , Ck )

ˆ k , Cˆk |β ◦ P ) 4L(Fˆk , M

.

(9)

P 0 The new estimate β 0 is the average of the estimates over all trios β 0 = K1 K k=1 βk . Since EM is a local optimization method, for each window, we initialize pD to multiple starting points and repeatedly apply the update until convergence.

3

Results

3.1

Data Preparation and Parameter Selection

We downloaded SNPs for thirty CEPH trios, a Caucasian population from Utah (CEU), from Build 16c.1 of the “redundant-unfiltered” version of the Phase I HapMap data. This population consists of 30 parent-child trios. We filtered the data using the same criteria as described in [2] obtaining a total of 1,109,711 SNPs.1 The SNP data was phased using HAP [4]. Our model depends on two sets of parameters: the vector P of haplotype frequencies and the error probabilities ei→j . We determine the haplotype frequencies pi , i = 1, . . . , n, directly from the phased haplotype data according to pi =

# of times hi occurred over all parents . 2(# of parents)

(10)

Note that the haplotypes of the child are transmitted from the parents, and thus are not considering when estimating haplotype frequencies. To determine the parameters of the error model, we estimate the probability of a genotyping error by examining specific cases where we can predict either the presence or absence of an error. We examined every SNP in which the father and mother are homozygous and the offspring has no missing data. In these cases, the parents genotypes will define the child’s genotype. If the observed child genotype does not match the 1

Conrad et al. [2] report 1,108,950 SNPs after filtering the same data. The 761 extra SNPs in our study is likely due to minor implementation differences. These extra SNPs represent only 0.068% of the total, and thus are unlikely to impact a comparison of the two methods.

5

expected genotype, we consider this an observed error. The genotype error frequency was set to be the total number of observed errors divided by the total number of SNPs examined. We found a total of 23,641 errors in 18,093,695 SNPs, yielding an overall genotyping error rate of 1.31E-3. Per chromosome the maximum and minimum estimated genotyping error rates were 3.82E-3 (chromosome 20) and 1.71E-4 (chromosome 15), respectively. This is not unexpected since genotyping centers responsible for different chromosomes used different criteria for filtering SNPs with errors prior to releasing the data. Thus we set {ei→j }2i,j=0 separately for each chromosome to the empirical values. A similar approach was taken for errors involving missing data. We set e0→3 = e1→3 = e2→3 equal to the number of missing SNPs in a chromosome divided by the total number of genotyped SNPs, and so there was one estimate for each chromosome. We also noticed a large disparity between the rate of missing SNPs among individuals. In chromosome 1, the highest reported missing SNP rate was 0.0316 (on individual NA12761) and the lowest was 2.14E-4 (on NA12248), a difference of over 100 times which seems to indicate great disparity in the quality of data for each individual. Due to such an extreme difference in the quality of data for each individual, we also tested our model using the empirical frequency of missing SNPs for each individual. The results were similar to using the average value of missing SNPs over all individuals.

3.2

Selecting and filtering windows

To improve the efficiency of our approach we examine only windows for which our method has a reasonable chance of detecting a deletion. Specifically, we examine only windows containing L = 1 . . . 40 SNPs, at most twenty unique haplotypes (out of the 120 possible haplotypes for 30 trios), and at least a single Mendelian error compatible with a deletion. We make predictions over the whole genome by predicting a p-value for each window and merge all windows that exceed a specified p-value threshold. We are then left with a set of non-overlapping windows which are predicted to contain deletions and their associated p-values.

3.3

Predictions and comparison with earlier studies

We directly compare our method to the method presented in Conrad et al.[2]. Briefly, that method categorizes each SNP into one of seven categories for each trio, labeled A,B,C,D,E, or F. A and B being a Mendelian error consistent with a deletion present in the mother and father, respectively. C represents a Mendelian error that is incompatible with a deletion in either parent. D represents a configuration with child, mother, and father being homozygous or having missing data. E and F represent configurations that include missing data with a possibility of a deletion in the mother and father, respectively. G represents a configuration incompatible with a deletion. A window within any trio was considered a potential deletion if all SNPs were in states A, D, or E for the mother or B, D, F for the father. Another condition was that there had to be at least two SNPs in the A category (in an A, D, or E run of consecutive SNPs in the case of the mother having a deletion) or two SNPs in the B category (in a B, D, or F run of consecutive SNPs in the case of the father having a deletion). The maximum range of the deletion was bounded by SNPs in categories compatible with the deletion while the minimum range of the deletion was bounded by the first and last SNP in the A category in the case of the mother having a deletion (B for the father). Our proposed method has several advantages over the method in Conrad et al.[2]. First, our probabilistic model takes advantage of information from the complete set of genotypes, not only one trio at a time. Second, since our method uses a generalized likelihood test, we can obtain a p-value for our predictions. Third, we take into account information from the haplotype structure in the region to help predict the deletions. Finally, we can estimate the parameters of the model (the rates of genotype errors and missing data) directly from the data.

6

P-Value Cutoff 0.05 0.01 0.001 1E-4 1E-5 1E-7 1E-9

Novel Deletions 1527 782 350 177 64 9 1

Previously Detected Deletions 334 280 192 142 98 59 39

Conrad 254/1861 214/1062 152/542 111/319 80/162 56/68 36/40

McCarroll 156/1861 134/1062 106/542 81/319 62/162 45/68 33/40

Table 1: The number deletions predicted by our method for different p-value thresholds that overlap with a deletion found by Conrad et al. or McCarroll et al. or are novel. Table 1 shows the number of deletions detected by our method and the number reported in the Conrad et al.[2] and also in the McCarroll et al[8] studies that used the same data for deletion identification. For the 216 deletions reported in the CEU population by Conrad et al. [2], our method discovers 319 deletions with a p-value of lower than 1E-4. Similarly we discover 81 of the total 304 deletions reported in [8], with a p-value threshold of 1E-4. As reported in a recent review paper[10], since many of the Conrad and McCarroll deletions are nonoverlapping is is likely that many more deletions exist in the data which have not been predicted by either study. However, without a complete set of experimentally verified deletions in the HapMap individuals, it is unclear if our novel predictions are true deletions are false positives. In order to evaluate our results, we performed comparisons of our predictions to the combined predictions of several existing studies of deletion polymorphisms, in order to examine the sensitivity and specificity of our method. We formed the following sets of deletions. 1. Set 1: Deletion polymorphisms reported in at least one of the following earlier studies. Deletions identified by experimental techniques: namely array-CGH [9, 6] or fosmid end sequencing [11], and that contain at least one SNP in the reported deletion. Deletions inferred from SNPs by the methods of Conrad et al. and McCarroll et al. [8, 2]. (606 total). 2. Set 2: Deletions reported in at least one of the experimental studies only. (173 total). 3. Set 3: Deletions reported in at least two of the experimental studies. (18 total). In the latter case, we say that a deletion is reported twice if there is any overlap in the deleted intervals. For each set of assumed correct deletions, we define a set of intervals over the genome as follows. We define an interval for each deletion in the set and label that interval as a positive example. Starting at the ends of the deletion, we partition the remaining portion of the genome into 50kb intervals. Each interval immediately adjoining a deletion is labelled as a gap and ignored for the purposes of the evaluation. The remaining intervals are considered not to be deletions and labelled as negative examples. Figure 2 shows the specificity and sensitivity of our method and the method of [2] when each of these three sets of deletions is the “true” set. It is apparent that our method consistently outperforms the method of [2]. For example, on Set 3, the method of Conrad et al. report only 3 true positives and 608 false positives, achieving extremely poor performance. Even with Set 1, which includes all predictions of Conrad as “true positives”, the true positive and false positive fractions are 0.256 and 0.743, respectively. Conrad et al. [2] performed an experimental validation of 97 detected deletions and validated 80 of them. These 80 validated deletions correspond to 41 non-overlapping regions in the genome. Based on these experimental results, they estimated that their method has a false positive rate of approximately 5% to 14%. It is very difficult for us to directly compute our false positive rate because we can not distinguish between false positives and novel deletions. However, by comparing our findings to all known deletions (Set 7

Set 1

True positive rate

True positive rate

True positive rate

False positive rate

False positive rate

False positive rate

Set 2

Set 3

Figure 2: Receiver operator curve (ROC) curve showing the sensitivity and specificity of our method of deletion detection on three different sets of deletion polymorphisms. By comparison, the method of Conrad (red points) has significantly lower performance. 1), we estimate an upper bound on our false positive rate of 2.5%, 13.2%, and 53.9% for p-value thresholds of 1E-9, 1E-7 and 1E-4, respectively. However, this evaluation methodology is overly conservative for several reasons. Since many of the experimental studies were performed on a different set of individuals, it is likely that some of the experimentally predicted deletion polymorphisms are simply not present in the HapMap data. Similarly, since the experimental studies are performed over a small set of individuals, it is likely that the HapMap contains many more deletion polymorphisms. Most likely, many of the false positives in Figure 2 are in fact novel predicted deletions. Yet at even these extremely conservative upper bounds on the false positive rates, we still discover novel candidate deletions.

3.4

Phase II Data

We applied our method to the latest Phase II HapMap data as of Aug 12, 2006 containing 3,806,476 SNPS, subjected to the same filtering process as above, and phased with HAP. Phase II presents some unique challenges due to the long range PCR technology[5] used to collect the data by Perlegen Sciences. This technology causes errors in neighboring SNPs that are inherently correlated. Since our method takes into account the haplotype structure, our method is less sensitive to these correlated errors than previously proposed methods. Table 2 contains our predictions for different p-value thresholds on chromosomes 1 and 2 which contain approximately 16.4% of the Phase II data. Using a p-value threshold of INSERT2 , we find 317 deletions on chromsomes 1 and 2. Extrapolating these results to the complete set of Phase II data, we estimate approximately 1933 candidate deletion polymorphisms, a six-fold increase over the Phase I data. This result suggests that there may be a significantly larger number of deletion polymorphisms present in the human genome. The complete Phase II results will be presented in the full version of the paper.

4

Discussion

We describe a new probabilistic method for detecting deletion polymorphisms from SNP and phased haplotype data of parent-child trios. Unlike previous heuristic techniques for the same problem, our method 2

We use a more conservative p-value threshold for Phase II because of the presence of correlated SNPs.

8

P P P P P P

< < < < <