Thousand Genomes And HLA Typing By NGS:
Hid d en T r e a s u r e s In P ubl i c Sh ort Read Data A. Bérces 1, E. Major 1, S. Juhos 1, K. Rigó 1, T. Hague 1, P. Gourraud 2 Department of Neurology – Omixon Biocomputing (www.omixon.com)
Introduction One of the important goals of the 1000 Genomes (1KG) project was to find common mutations in diverse populations with the help of next generation sequencing. In case of HLA genes, NGS gives a promise to resolve phase information and open the possibilities of large-scale HLA typing.
searched for allele pairs in a way that we were optimizing for both coverage depth and coverage %. Allele pairs that contained both a high number of mapped reads and had adequate coverage of exons for both alleles at each locus were reported.
One SNP difference in MIC-A alleles
We are presenting a sufficiently fast algorithm using 1KG whole-exome Illumina data to obtain HLA types for HLA-A, B, C, DRB1 and DQB1 genes. For validation, the results of Sanger capillary sequencing based HLA typing was used for over thousand Coriell samples.
QC-failed sample
Mistypings For samples where reads are covering only some of the exons, there are too many candidates and high ambiguity. Since the targeting method was not specific to the HLA region, but intended to capture whole exome sequences, this is more the consequence of the sequencing strategy rather than the typing algorithm.
Methods Whole-exome Illumina samples were filtered for reads that can be aligned with no or very few mismatches to the collection of alleles in the IMGT/HLA database. For HLA typing these filtered reads were aligned to the exons in the IMGT/HLA reference allele sequences allowing no or very few mismatches and soft clips at read ends. After alignment, allele candidates were ranked by coverage depth and coverage % (extent of the exons covered). In the next step we filtered allele candidates using all this allele coverage data, and left only those candidates that had a high enough number of reads covering the allele. Finally, we
Reads from similar genes or pseudogenes can be another source for discordance. Some of these mistypings can be corrected by excluding reads that are mappable to more than one genes, but there will be still systematic cross-mappings.
Conclusion Concordance is around 95% for MHC-I and 90% for MHC-II Not all of the 2126 filtered samples gave reliable results, quality check (QC) measures had to be included. For example: • coverage % for exons 2 and 3 (or only for exon 2 for MHC-II genes) has to be over 80%
Com paring similar alleles
• read length has to be longer than 75 basepairs The algorithm is capable of typing other genes like MIC-A, MIC-B or KIR, although the determination of copy-number variations (e.g. for KIR) is unreliable. 6 digits precision is available for samples with at least a few thousand reads aligned to the reference allele.
Concordance and exact match
Concordance Concordance Gene
Total QC passed
Mistyped
Concordance %
HLA-A
621
30.5
95.1
HLA-B
673
34.5
94.9
HLA-C
714
32.5
95.4
HLA-DRB1
787
76.5
90.3
HLA-DQB1
817
110.5
86.5
100
100
75
75
50
50
25
25
0
0 5
16
29
44
59
76
Coverage % MHC-I concordance
MHC-I concordance
94
98
99
MHC-I exact match
MHC-I exact match
17
38
53
68
78
87
Coverage % MHC-II concordance
MHC-II concordance
Contact: 1 Omixon Biocomputing, Budapest, Hungary 2 Department of Neurology, University of California San Francisco, San Francisco, CA, USA.
95
99
100
MHC-II exact match
MHC-II exact match
See related results in our recent publication in PLOS ONE by scanning this code
Corresponding author:
[email protected] Omixon_42x36in_tabla01.indd 1
2013.11.13. 14:35