Taking Advantage of Long RNA-Seq Reads Vince Magrini Pacific Biosciences User Group Meeting September 18, 2013
Overview • Proof-of-Principle • SMART-cDNA Synthesis • PB-SBL size distributions
• Gene Annotation • • • •
A. ceylancium – Parasitic Nematode Method Observations Advantages
• Validating Gene Fusions • Prostate Cancer • ETS Transcription Factor Fusions • Expression Validation Experiments
Advantages of cDNA sequencing • Extremely data-rich sequencing method • Human genome contains ~2% coding sequence
• Alternative to sequencing difficult highly repetitive genomes • Initial characterization of genes within novel genomes • Companion data set to the genomes the GSC produces • Annotation • Structure
• Detection in resequencing projects • • • •
Expressed somatic variants Isoforms Structural Variants LncRNAs
• All begins with the RNA sample
RNA Quality RNA Quality: Agilent Total RNA 6000 Pico Kit (50-5000pg) A. Flash Frozen B. FFPE
PacBio RNA-Seq: Proof-of-Principle
Work Flow
DNase-Treated: 20µg LNCaP Total RNA * Turbo DNA Free * RIN: 10
mRNA DIRECT Micro Kit (~17µg RNA) * Single round of polyA selection Clontech SMARTer (17.5ng polyA RNA) * Single Primer Amplification * qPCR Cycle Optimize (15 cycles)
Work Flow Normalization Evrogen Trimmer (DSN) Input: 300ng cDNA/rxn (4 rxns) Enzyme: 1U/rxn qPCR Optimize (15 cycles)
PacBio 2kb Sample Prep Input: 500ng cDNA Ampure XP – 1:1 Sequencing Options: Pre-RSII Upgrade 2x55 min collection 1x120 min collection 1x120 min collection + “Stage start” (similar to hot start for PCR)
BLAT Reads
BLAT Results - chr5:113,698,816-113,832,197
Primary Analysis Alignment = 133,382 bp - read = 1,747 bp
chr2:238,657,005-238,672,096
Primary Analysis Alignment = 15,092 bp - read = 1,390 bp
Gene Annotation using PacBio RNA-Seq data
A. Ceylancium – Improving Genome Annotation Life Cycle of the intestinal hookworm infection Infective Larvae
Penetration
A. ceylanicum is a parasitic hookworm that attaches to the intestine of animals particularly hamsters - causing anemia in the host. Species: Ancylostoma ceylanicum (hookworm) Genome Size: 350MB N50 #: 263 Mb 454 Newbler assembly Illumina-based improvements Protein coding genes: 15,892 Maker: Evidence-Based 454 cDNA Illumina RNAseq
Larvae
Eggs
Easy Livin’
PacBio Sequencing Adult Female Total RNA C2/C2 Sequencing (8-Pack) P4/C2 Sequencing (8-Pack)
A. ceylancium Work Flow
Hamster Hookworm Total RNA SMARTer Non-norm PCR
Hamster Hookworm Total RNA SMARTer Norm PCR
Hamster Hookworm Total RNA SMARTer Norm 2kb PBSBL
PacBio RNA-Seq Gene Stats
Genes (Maker) PacBio Evidence
Reference*
Ref + C2
Ref + P4
15,892
17,690
17,540
N/A
8,448
8,257
3,791
3,572
Longest Read CEGMA genes
80%
88%
90%
Genes w UTR
1,891
7,466
7,297
Reference database includes: A. ceylanicum EST’s, nematoda.est, inhouse 454 nematoda, nemtoda.nr CEGMA (Core Eukaryotic Genes Mapping Approach) 458 Core Proteins highly Conserved across taxa (Korf et al.,)
Gene Predictions
PacBio Original
Gene Predictions
Individual 1º PacBio read alignments to assembly using Exonerate
Collapsed PB Isoforms * EST DB aligned to assembly 1,379 PacBio BLASTN to assembly * BLASTX: Genbank Nematode dB * TBLASTX: Genbank Nematode dB * Processed without PacBio Data
Isoforms or artifacts? SMART cDNA may be unable to efficiently RT the 5’ end of transcripts Exonerate Raw BLAST
Annotation Highlights C2 genes vs P4 genes C2 1,379
P4 16,331
1,229
Advantages of long cDNA reads using PacBio: There is an increase in full length gene predictions. ~50% of the genes show a structural change. ~80% of the genes illustrate changes to gene start/stop positions ~10x increase in prediction of UTR bases ~10% increase in CEGMA genes C2 and P4 data sets are very similar Add expression data from Life Cycle Stages Vaccine Development (White Paper)
Validation of gene fusion predictions using PacBio
ChimeraScan: A tool for detecting chimeric transcripts in NGS data
(PNAS - Maher et al., 2009) (Bioinformatics – Iyer et al., 2011)
Example Illustrating ETS fusions
M.A. Rubin 2012
Individual samples can have numerous candidates to validate 30 Read-through Intra-chromosomal
25
Inter-chromosomal
20 15
10
5
0 NOR1
NOR3 Normal NOR5
PCA2
PCA4 ETS-
PCA6
PCA8
PCA10
PCA12 + ETS
ETS rearrangement negative samples have less genomic aberrations
ETS- : no previously identified gene fusion involving ERG, ETV1, ETV5 (any ETS family member) ETS+: known ETS fusion has been documented in the sample
Application in across cohorts necessitates more streamlined validation 49 Cell line + 40 Tissues RNA Isolation Paired-end Library Prep Sequencing (Illumina GAII or HiSeq 2000) Reads Aligned to human genome + transcriptome Identify fusions with ChimeraScan Identify Recurrent Candidates
Identify Functionally Recurrent Candidates
qPCR Validation and/or FISH Functional assays
Rationale for requiring Pac Bio • Need a high-throughput, orthogonal sequencing method to rapidly validate fusions • High depth of coverage to confirm gene fusions expressed at lower levels • Accurately identify in-frame fusions • •
In-frame fusions are important for discovering new fusion peptides In-frame predictions can be misleading without validating the full-length transcript of the upstream fusion partner • I.e., using most common isoform may result in an out of frame prediction; identifying the full-length transcript may reveal exon-skipping event or novel isoform that does create an in-frame fusion
• Approach? • •
Gene Specific RT -> SMARTer/SMRT cDNA Capture
Acknowledgements
WUSTL
PacBio
Sean McGrath Amy Ly Ryan Demeter
Nick Sisneros Jeff Palas
Makedonka Mitreva Kym Hallsworth-Pepin Chris Maher Nicole Maher Chris Cabanski
Cheryl Heiner Primo Meredith Ashby Tyson Clark Marty Badgett