Identifying recessive gene candidates with GEMINI Aaron Quinlan University of Utah ! ! ! ! ! quinlanlab.org
Please refer to the following Github Gist to find each command for this session. Commands should be copy/pasted from this Gist https://gist.github.com/arq5x/9e1928638397ba45da2e#file-autosomal-recessive-sh 1
Compound heterozygote detective work with GEMINI
2
Compound het refresher
3
Example compound heterozygote Dad
Mom Ga
CC
GG
Kid aG
Ct T 4
tC
Phasing genotypes
5
Jessica Chong
The result of phasing by transmission
6
Jessica Chong
“Phasing” a VCF file by transmission with GATK Jessica Chong
7
Jessica Chong
“Phasing” a VCF file by transmission with GATK Jessica Chong
G/G
A/A A/G 7
Jessica Chong
“Phasing” a VCF file by transmission with GATK Jessica Chong
G/G
A/A A/G
A/G
A/A A/G 7
Jessica Chong
“Phasing” a VCF file by transmission with GATK Jessica Chong
G/G
A/A A/G
A/G
A/A
A/A
A/G
A/G A/G
7
Jessica Chong
“Phasing” a VCF file by transmission with GATK Jessica Chong
G/G
A/A A/G
A/G
A/A
A/A
A/G
A/G A/G
7
A/G
A/G A/G
? Jessica Chong
“Phasing” a VCF file by transmission with GEMINI
8
“Phasing” a VCF file by transmission with GEMINI
G/G
A/A A/G 8
“Phasing” a VCF file by transmission with GEMINI
G/G
A/A A/G
A/G
A/A A/G 8
“Phasing” a VCF file by transmission with GEMINI
G/G
A/A A/G
A/G
A/A
A/A
A/G
A/G A/G
8
“Phasing” a VCF file by transmission with GEMINI
G/G
A/A A/G
A/G
A/A
A/A
A/G
A/G A/G
8
A/G
A/G A/G
?
Phasing by transmission A/A
A/G A/G
C/T
C/C C/T
9
* Convention for phased genotype is maternal allele first
Phasing by transmission A/A
A/G A/G
C/T
G|A
C/C C/T
C|T
Both sites phasable: high confidence as deleterious alleles on different chromosomes
9
* Convention for phased genotype is maternal allele first
Phasing by transmission A/A
A/G A/G
C/T
G|A
A/A
C/C C/T
C|T
G/A A/G
C/C
C/T C/T
Both sites phasable: high confidence as deleterious alleles on different chromosomes
9
* Convention for phased genotype is maternal allele first
Phasing by transmission A/A
A/G A/G
C/T
G|A
A/A
C/C C/T
C|T
G/A A/G
Both sites phasable: high confidence as deleterious alleles on different chromosomes
C/C
G|A
C/T C/T
T|C
Both sites phasable: yet exclude as deleterious alleles on same chromosomes
9
* Convention for phased genotype is maternal allele first
Phasing by transmission A/A
A/G A/G
C/T
G|A
A/A
C/C
C|T
C/T
A/G
Both sites phasable: high confidence as deleterious alleles on different chromosomes
A/A
G/G A/G
C/T
G/A
C/C
G|A
C/T C/T
T|C
Both sites phasable: yet exclude as deleterious alleles on same chromosomes
C/T C/T
9
* Convention for phased genotype is maternal allele first
Phasing by transmission A/A
A/G A/G
C/T
G|A
A/A
C/C
C|T
C/T
A/G
Both sites phasable: high confidence as deleterious alleles on different chromosomes
A/A
G/G A/G
C/T
G|A
G/A
C/C
G|A
C/T C/T
T|C
Both sites phasable: yet exclude as deleterious alleles on same chromosomes
C/T C/T
?
Only one site is phasable: lower confidence but cannot necessarily exclude. 9
* Convention for phased genotype is maternal allele first
Phasing by transmission A/A
A/G A/G
C/T
G|A
A/A
C/C
C|T
C/T
G/G A/G
C/T
G|A
C/T
T|C
C/T
Both sites phasable: yet exclude as deleterious alleles on same chromosomes
A/G
C/T C/T
C/C
G|A
A/G
Both sites phasable: high confidence as deleterious alleles on different chromosomes
A/A
G/A
?
A/G A/G
C/T
C/T C/T
Only one site is phasable: lower confidence but cannot necessarily exclude. 9
* Convention for phased genotype is maternal allele first
Phasing by transmission A/A
A/G A/G
C/T
G|A
A/A
C/C
C|T
C/T
G/G A/G
C/T
G|A
C/T
T|C
C/T
Both sites phasable: yet exclude as deleterious alleles on same chromosomes
A/G
C/T C/T
C/C
G|A
A/G
Both sites phasable: high confidence as deleterious alleles on different chromosomes
A/A
G/A
?
A/G A/G
Only one site is phasable: lower confidence but cannot necessarily exclude.
C/T
?
C/T C/T
?
Neither site is phasable: lower confidence but cannot necessarily exclude (recombination?). 9
* Convention for phased genotype is maternal allele first
Compound het test case
10
Jessica Chong
The comp_hets tool in GEMINI
1847
1805
Requires a PED file #family_id sample_id paternal_id maternal_id sex phenotype family1 1805 -‐9 -‐9 2 1 family1 1847 -‐9 -‐9 1 1 family1 4805 1847 1805 1 2 4805 11 http://gemini.readthedocs.org/en/latest/content/tools.html#comp-hets-identifying-potential-compound-heterozygotes
Create a GEMINI database from a VCF Notes: 1. The VCF has been normalized and decomposed with VT 2. The VCF has been annotated with VEP. http://gemini.readthedocs.org/en/latest/content/preprocessing.html#step-1-split-left-align-and-trim-variants $ curl https://s3.amazonaws.com/gemini-‐tutorials/trio.trim.vep.vcf.gz > trio.trim.vep.vcf.gz $ curl https://s3.amazonaws.com/gemini-‐tutorials/recessive.ped > recessive.ped $ gemini load -‐-‐cores 4\ -‐v trio.trim.vep.vcf.gz \ -‐t VEP \ -‐-‐skip-‐gene-‐tables \ -‐p recessive.ped \
!
trio.trim.vep.recessive.db 1847
Note: copy and paste the full command from the Github Gist to avoid errors
12
1805
4805
Running the comp_hets tool. gemini comp_hets trio.trim.vep.recessive.db Note: copy and paste the full command from the Github Gist
13
Again, we can limit the attributes returned w/ the --columns option.
gemini comp_hets --columns "chrom, start, end, gene, impact, cadd_raw" trio.trim.vep.recessive.db Note: copy and paste the full command from the Github Gist
chrom start end gene impact cadd_raw variant_id family_id family_members family_genotypes samples family_count comp_het_id priority chr2 69702114 69702115 AAK1 UTR_3_prime -‐3.52 1638 family1 1805;unaffected,1847;unaffected,4805;affected G/C,G/C,G/C 4805 1 1_1638_1646 2 chr2 69870243 69870244 AAK1 UTR_5_prime 1.04 1646 family1 1805;unaffected,1847;unaffected,4805;affected G/G,G/A,G|A 4805 1 1_1638_1646 2 chr2 69702108 69702109 AAK1 UTR_3_prime -‐2.51 1637 family1 1805;unaffected,1847;unaffected,4805;affected A/C,A/C,A/C 4805 1 2_1637_1638 2 chr2 69702114 69702115 AAK1 UTR_3_prime -‐3.52 1638 family1 1805;unaffected,1847;unaffected,4805;affected G/C,G/C,G/C 4805 1 2_1637_1638 2 chr2 69702101 69702102 AAK1 UTR_3_prime 0.39 1636 family1 1805;unaffected,1847;unaffected,4805;affected T/T,T/T,T/C 4805 1 3_1636_1646 2 chr2 69870243 69870244 AAK1 UTR_5_prime 1.04 1646 family1 1805;unaffected,1847;unaffected,4805;affected G/G,G/A,G|A 4805 1 3_1636_1646 2 chr2 69702114 69702115 AAK1 UTR_3_prime -‐3.52 1638 family1 1805;unaffected,1847;unaffected,4805;affected G/C,G/C,G/C 4805 1 4_1638_1645 2 chr2 69870216 69870218 AAK1 UTR_5_prime None 1645 family1 1805;unaffected,1847;unaffected,4805;affected AT/A,AT/A,AT/A 4805 1 4_1638_1645 2 chr2 69702101 69702102 AAK1 UTR_3_prime 0.39 1636 family1 1805;unaffected,1847;unaffected,4805;affected T/T,T/T,T/C 4805 1 5_1636_1638 2
14
Start with highest priority compound heterozygote candidates
A/A
A/G
C/T
A/G
C/C C/T
15
Start with highest priority compound heterozygote candidates
A/A
A/G A/G
C/T
G|A
C/C C/T
C|T
Both sites phasable: high confidence as deleterious alleles on different chromosomes 15
Restrict to highest priority (i.e, priority==1) candidates $ gemini comp_hets \ --columns "chrom, start, end, gene, impact, cadd_raw" \ trio.trim.vep.recessive.db \ | awk '$14==1' \ | head chrom start end gene impact cadd_raw variant_id family_id family_members family_genotypes samples family_count comp_het_id priority chr15 89398552 89398553 ACAN non_syn_coding 2.57 9519 family1 1805;unaffected,1847;unaffected,4805;affected C/A,C/C,A|C 4805 1 52_9519_9534 1 chr15 89417237 89417238 ACAN non_syn_coding -‐0.06 9534 family1 1805;unaffected,1847;unaffected,4805;affected A/A,A/G,A|G 4805 1 52_9519_9534 1 chr15 89398552 89398553 ACAN non_syn_coding 2.57 9519 family1 1805;unaffected,1847;unaffected,4805;affected C/A,C/C,A|C 4805 1 53_9519_9535 1 chr15 89417628 89417629 ACAN splice_region -‐0.18 9535 family1 1805;unaffected,1847;unaffected,4805;affected G/G,G/A,G|A 4805 1 53_9519_9535 1 chr2 111598957 111598958 ACOXL non_syn_coding 0.95 3247 family1 1805;unaffected,1847;unaffected,4805;affected C/C,C/T,C|T 4805 1 86_3247_3251 1 chr2 111850514 111850515 ACOXL non_syn_coding 3.17 3251 family1 1805;unaffected,1847;unaffected,4805;affected C/T,C/C,T|C 4805 1 86_3247_3251 1 chr17 55182877 55182878 AKAP1 non_syn_coding 2.57 13305 family1 1805;unaffected,1847;unaffected,4805;affected C/T,C/C,T|C 4805 1 104_13305_13308 1 chr17 55183316 55183317 AKAP1 synonymous_coding 0.43 13308 family1 1805;unaffected,1847;unaffected,4805;affected G/G,G/A,G|A 4805 1 104_13305_13308 1 chr2 29448409 29448410 ALK non_syn_coding 1.49 839 family1 1805;unaffected,1847;unaffected,4805;affected T/T,T/G,T|G 4805 1 186_839_841 1 chr2 29455198 29455199 ALK non_syn_coding 3.51 841 family1 1805;unaffected,1847;unaffected,4805;affected A/T,A/A,T|A 4805 1 186_839_841 1
$ gemini comp_hets \ --columns "chrom, start, end, gene, impact, cadd_raw" \ trio.trim.vep.recessive.db \ | awk '$14==1' \ | wc -l Note: copy and paste the full command from the Github Gist
612 lines Each compund heterozygote is a set of two lines, so we have 306 (612 / 2) compound heterozygote candidates 16
So many candidates. Time to start --filtering!
$ gemini comp_hets \ --columns "chrom, start, end, gene, impact, cadd_raw" \ --filter "impact_severity != 'LOW'" \ trio.trim.vep.recessive.db \ | awk '$14==1' \ | wc -l Note: copy and paste the full command from the Github Gist
260 lines (130 comp_hets) 17
Use ESP and ExAC to focus on rare variants
$ gemini comp_hets \ --columns "chrom, start, end, gene, impact, cadd_raw" \ --filter "impact_severity != 'LOW' \ and ((aaf_esp_ea