Identifying recessive gene candidates with GEMINI

Report 6 Downloads 63 Views
Identifying recessive gene candidates with GEMINI Aaron Quinlan University of Utah ! ! ! ! ! quinlanlab.org

Please refer to the following Github Gist to find each command for this session. Commands should be copy/pasted from this Gist https://gist.github.com/arq5x/9e1928638397ba45da2e#file-autosomal-recessive-sh 1

Compound heterozygote detective work with GEMINI

2

Compound het refresher

3

Example compound heterozygote Dad

Mom Ga

CC

GG

Kid aG

Ct T 4

tC

Phasing genotypes

5

Jessica Chong

The result of phasing by transmission

6

Jessica Chong

“Phasing” a VCF file by transmission with GATK Jessica Chong

7

Jessica Chong

“Phasing” a VCF file by transmission with GATK Jessica Chong

G/G

A/A A/G 7

Jessica Chong

“Phasing” a VCF file by transmission with GATK Jessica Chong

G/G

A/A A/G

A/G

A/A A/G 7

Jessica Chong

“Phasing” a VCF file by transmission with GATK Jessica Chong

G/G

A/A A/G

A/G

A/A

A/A

A/G

A/G A/G

7

Jessica Chong

“Phasing” a VCF file by transmission with GATK Jessica Chong

G/G

A/A A/G

A/G

A/A

A/A

A/G

A/G A/G

7

A/G

A/G A/G

? Jessica Chong

“Phasing” a VCF file by transmission with GEMINI

8

“Phasing” a VCF file by transmission with GEMINI

G/G

A/A A/G 8

“Phasing” a VCF file by transmission with GEMINI

G/G

A/A A/G

A/G

A/A A/G 8

“Phasing” a VCF file by transmission with GEMINI

G/G

A/A A/G

A/G

A/A

A/A

A/G

A/G A/G

8

“Phasing” a VCF file by transmission with GEMINI

G/G

A/A A/G

A/G

A/A

A/A

A/G

A/G A/G

8

A/G

A/G A/G

?

Phasing by transmission A/A

A/G A/G

C/T

C/C C/T

9

* Convention for phased genotype is maternal allele first

Phasing by transmission A/A

A/G A/G

C/T

G|A

C/C C/T

C|T

Both sites phasable: high confidence as deleterious alleles on different chromosomes

9

* Convention for phased genotype is maternal allele first

Phasing by transmission A/A

A/G A/G

C/T

G|A

A/A

C/C C/T

C|T

G/A A/G

C/C

C/T C/T

Both sites phasable: high confidence as deleterious alleles on different chromosomes

9

* Convention for phased genotype is maternal allele first

Phasing by transmission A/A

A/G A/G

C/T

G|A

A/A

C/C C/T

C|T

G/A A/G

Both sites phasable: high confidence as deleterious alleles on different chromosomes

C/C

G|A

C/T C/T

T|C

Both sites phasable: yet exclude as deleterious alleles on same chromosomes

9

* Convention for phased genotype is maternal allele first

Phasing by transmission A/A

A/G A/G

C/T

G|A

A/A

C/C

C|T

C/T

A/G

Both sites phasable: high confidence as deleterious alleles on different chromosomes

A/A

G/G A/G

C/T

G/A

C/C

G|A

C/T C/T

T|C

Both sites phasable: yet exclude as deleterious alleles on same chromosomes

C/T C/T

9

* Convention for phased genotype is maternal allele first

Phasing by transmission A/A

A/G A/G

C/T

G|A

A/A

C/C

C|T

C/T

A/G

Both sites phasable: high confidence as deleterious alleles on different chromosomes

A/A

G/G A/G

C/T

G|A

G/A

C/C

G|A

C/T C/T

T|C

Both sites phasable: yet exclude as deleterious alleles on same chromosomes

C/T C/T

?

Only one site is phasable: lower confidence but cannot necessarily exclude. 9

* Convention for phased genotype is maternal allele first

Phasing by transmission A/A

A/G A/G

C/T

G|A

A/A

C/C

C|T

C/T

G/G A/G

C/T

G|A

C/T

T|C

C/T

Both sites phasable: yet exclude as deleterious alleles on same chromosomes

A/G

C/T C/T

C/C

G|A

A/G

Both sites phasable: high confidence as deleterious alleles on different chromosomes

A/A

G/A

?

A/G A/G

C/T

C/T C/T

Only one site is phasable: lower confidence but cannot necessarily exclude. 9

* Convention for phased genotype is maternal allele first

Phasing by transmission A/A

A/G A/G

C/T

G|A

A/A

C/C

C|T

C/T

G/G A/G

C/T

G|A

C/T

T|C

C/T

Both sites phasable: yet exclude as deleterious alleles on same chromosomes

A/G

C/T C/T

C/C

G|A

A/G

Both sites phasable: high confidence as deleterious alleles on different chromosomes

A/A

G/A

?

A/G A/G

Only one site is phasable: lower confidence but cannot necessarily exclude.

C/T

?

C/T C/T

?

Neither site is phasable: lower confidence but cannot necessarily exclude (recombination?). 9

* Convention for phased genotype is maternal allele first

Compound het test case

10

Jessica Chong

The comp_hets tool in GEMINI

1847

1805

Requires a PED file #family_id    sample_id    paternal_id    maternal_id    sex    phenotype     family1          1805              -­‐9                      -­‐9                      2        1                     family1          1847              -­‐9                      -­‐9                      1        1                     family1          4805              1847                  1805                  1        2                   4805 11 http://gemini.readthedocs.org/en/latest/content/tools.html#comp-hets-identifying-potential-compound-heterozygotes

Create a GEMINI database from a VCF Notes: 1. The VCF has been normalized and decomposed with VT 2. The VCF has been annotated with VEP. http://gemini.readthedocs.org/en/latest/content/preprocessing.html#step-1-split-left-align-and-trim-variants $  curl  https://s3.amazonaws.com/gemini-­‐tutorials/trio.trim.vep.vcf.gz  >  trio.trim.vep.vcf.gz   $  curl  https://s3.amazonaws.com/gemini-­‐tutorials/recessive.ped  >  recessive.ped   $  gemini  load  -­‐-­‐cores  4\                              -­‐v  trio.trim.vep.vcf.gz  \                              -­‐t  VEP  \                              -­‐-­‐skip-­‐gene-­‐tables  \                              -­‐p  recessive.ped  \  

!

                 trio.trim.vep.recessive.db 1847

Note: copy and paste the full command from the Github Gist to avoid errors

12

1805

4805

Running the comp_hets tool. gemini comp_hets trio.trim.vep.recessive.db Note: copy and paste the full command from the Github Gist

13

Again, we can limit the attributes returned w/ the --columns option.

gemini comp_hets --columns "chrom, start, end, gene, impact, cadd_raw" trio.trim.vep.recessive.db Note: copy and paste the full command from the Github Gist

chrom    start            end                gene      impact                  cadd_raw    variant_id    family_id    family_members                                                                family_genotypes          samples    family_count    comp_het_id              priority   chr2      69702114      69702115      AAK1      UTR_3_prime        -­‐3.52          1638                family1        1805;unaffected,1847;unaffected,4805;affected    G/C,G/C,G/C                  4805          1                          1_1638_1646              2   chr2      69870243      69870244      AAK1      UTR_5_prime        1.04            1646                family1        1805;unaffected,1847;unaffected,4805;affected    G/G,G/A,G|A                  4805          1                          1_1638_1646              2   chr2      69702108      69702109      AAK1      UTR_3_prime        -­‐2.51          1637                family1        1805;unaffected,1847;unaffected,4805;affected    A/C,A/C,A/C                  4805          1                          2_1637_1638              2   chr2      69702114      69702115      AAK1      UTR_3_prime        -­‐3.52          1638                family1        1805;unaffected,1847;unaffected,4805;affected    G/C,G/C,G/C                  4805          1                          2_1637_1638              2   chr2      69702101      69702102      AAK1      UTR_3_prime        0.39            1636                family1        1805;unaffected,1847;unaffected,4805;affected    T/T,T/T,T/C                  4805          1                          3_1636_1646              2   chr2      69870243      69870244      AAK1      UTR_5_prime        1.04            1646                family1        1805;unaffected,1847;unaffected,4805;affected    G/G,G/A,G|A                  4805          1                          3_1636_1646              2   chr2      69702114      69702115      AAK1      UTR_3_prime        -­‐3.52          1638                family1        1805;unaffected,1847;unaffected,4805;affected    G/C,G/C,G/C                  4805          1                          4_1638_1645              2   chr2      69870216      69870218      AAK1      UTR_5_prime        None            1645                family1        1805;unaffected,1847;unaffected,4805;affected    AT/A,AT/A,AT/A            4805          1                          4_1638_1645              2   chr2      69702101      69702102      AAK1      UTR_3_prime        0.39            1636                family1        1805;unaffected,1847;unaffected,4805;affected    T/T,T/T,T/C                  4805          1                          5_1636_1638              2

14

Start with highest priority compound heterozygote candidates

A/A

A/G

C/T

A/G

C/C C/T

15

Start with highest priority compound heterozygote candidates

A/A

A/G A/G

C/T

G|A

C/C C/T

C|T

Both sites phasable: high confidence as deleterious alleles on different chromosomes 15

Restrict to highest priority (i.e, priority==1) candidates $ gemini comp_hets \ --columns "chrom, start, end, gene, impact, cadd_raw" \ trio.trim.vep.recessive.db \ | awk '$14==1' \ | head chrom    start            end                gene      impact                          cadd_raw    variant_id    family_id    family_members                                                                  family_genotypes      samples    family_count    comp_het_id            priority   chr15    89398552      89398553      ACAN      non_syn_coding          2.57            9519                family1        1805;unaffected,1847;unaffected,4805;affected    C/A,C/C,A|C                4805          1                          52_9519_9534          1   chr15    89417237      89417238      ACAN      non_syn_coding          -­‐0.06          9534                family1        1805;unaffected,1847;unaffected,4805;affected    A/A,A/G,A|G                4805          1                          52_9519_9534          1   chr15    89398552      89398553      ACAN      non_syn_coding          2.57            9519                family1        1805;unaffected,1847;unaffected,4805;affected    C/A,C/C,A|C                4805          1                          53_9519_9535          1   chr15    89417628      89417629      ACAN      splice_region            -­‐0.18          9535                family1        1805;unaffected,1847;unaffected,4805;affected    G/G,G/A,G|A                4805          1                          53_9519_9535          1   chr2      111598957    111598958    ACOXL    non_syn_coding          0.95            3247                family1        1805;unaffected,1847;unaffected,4805;affected    C/C,C/T,C|T                4805          1                          86_3247_3251          1   chr2      111850514    111850515    ACOXL    non_syn_coding          3.17            3251                family1        1805;unaffected,1847;unaffected,4805;affected    C/T,C/C,T|C                4805          1                          86_3247_3251          1   chr17    55182877      55182878      AKAP1    non_syn_coding          2.57            13305              family1        1805;unaffected,1847;unaffected,4805;affected    C/T,C/C,T|C                4805          1                          104_13305_13308    1   chr17    55183316      55183317      AKAP1    synonymous_coding    0.43            13308              family1        1805;unaffected,1847;unaffected,4805;affected    G/G,G/A,G|A                4805          1                          104_13305_13308    1   chr2      29448409      29448410      ALK        non_syn_coding          1.49            839                  family1        1805;unaffected,1847;unaffected,4805;affected    T/T,T/G,T|G                4805          1                          186_839_841            1   chr2      29455198      29455199      ALK        non_syn_coding          3.51            841                  family1        1805;unaffected,1847;unaffected,4805;affected    A/T,A/A,T|A                4805          1                          186_839_841            1

$ gemini comp_hets \ --columns "chrom, start, end, gene, impact, cadd_raw" \ trio.trim.vep.recessive.db \ | awk '$14==1' \ | wc -l Note: copy and paste the full command from the Github Gist

612  lines Each compund heterozygote is a set of two lines, so we have 306 (612 / 2) compound heterozygote candidates 16

So many candidates. Time to start --filtering!

$ gemini comp_hets \ --columns "chrom, start, end, gene, impact, cadd_raw" \ --filter "impact_severity != 'LOW'" \ trio.trim.vep.recessive.db \ | awk '$14==1' \ | wc -l Note: copy and paste the full command from the Github Gist

260  lines  (130  comp_hets) 17

Use ESP and ExAC to focus on rare variants

$ gemini comp_hets \ --columns "chrom, start, end, gene, impact, cadd_raw" \ --filter "impact_severity != 'LOW' \ and ((aaf_esp_ea