Taking Advantage of Long RNA-Seq Reads

Report 3 Downloads 188 Views
Taking Advantage of Long RNA-Seq Reads Vince Magrini Pacific Biosciences User Group Meeting September 18, 2013

Overview • Proof-of-Principle • SMART-cDNA Synthesis • PB-SBL size distributions

• Gene Annotation • • • •

A. ceylancium – Parasitic Nematode Method Observations Advantages

• Validating Gene Fusions • Prostate Cancer • ETS Transcription Factor Fusions • Expression Validation Experiments

Advantages of cDNA sequencing • Extremely data-rich sequencing method • Human genome contains ~2% coding sequence

• Alternative to sequencing difficult highly repetitive genomes • Initial characterization of genes within novel genomes • Companion data set to the genomes the GSC produces • Annotation • Structure

• Detection in resequencing projects • • • •

Expressed somatic variants Isoforms Structural Variants LncRNAs

• All begins with the RNA sample

RNA Quality RNA Quality: Agilent Total RNA 6000 Pico Kit (50-5000pg) A. Flash Frozen B. FFPE

PacBio RNA-Seq: Proof-of-Principle

Work Flow

DNase-Treated: 20µg LNCaP Total RNA * Turbo DNA Free * RIN: 10

mRNA DIRECT Micro Kit (~17µg RNA) * Single round of polyA selection Clontech SMARTer (17.5ng polyA RNA) * Single Primer Amplification * qPCR Cycle Optimize (15 cycles)

Work Flow Normalization Evrogen Trimmer (DSN) Input: 300ng cDNA/rxn (4 rxns) Enzyme: 1U/rxn qPCR Optimize (15 cycles)

PacBio 2kb Sample Prep Input: 500ng cDNA Ampure XP – 1:1 Sequencing Options: Pre-RSII Upgrade 2x55 min collection 1x120 min collection 1x120 min collection + “Stage start” (similar to hot start for PCR)

BLAT Reads

BLAT Results - chr5:113,698,816-113,832,197

Primary Analysis Alignment = 133,382 bp - read = 1,747 bp

chr2:238,657,005-238,672,096

Primary Analysis Alignment = 15,092 bp - read = 1,390 bp

Gene Annotation using PacBio RNA-Seq data

A. Ceylancium – Improving Genome Annotation Life Cycle of the intestinal hookworm infection Infective Larvae

Penetration

A. ceylanicum is a parasitic hookworm that attaches to the intestine of animals particularly hamsters - causing anemia in the host. Species: Ancylostoma ceylanicum (hookworm) Genome Size: 350MB N50 #: 263 Mb 454 Newbler assembly Illumina-based improvements Protein coding genes: 15,892 Maker: Evidence-Based 454 cDNA Illumina RNAseq

Larvae

Eggs

Easy Livin’

PacBio Sequencing Adult Female Total RNA C2/C2 Sequencing (8-Pack) P4/C2 Sequencing (8-Pack)

A. ceylancium Work Flow

Hamster Hookworm Total RNA SMARTer Non-norm PCR

Hamster Hookworm Total RNA SMARTer Norm PCR

Hamster Hookworm Total RNA SMARTer Norm 2kb PBSBL

PacBio RNA-Seq Gene Stats

Genes (Maker) PacBio Evidence

Reference*

Ref + C2

Ref + P4

15,892

17,690

17,540

N/A

8,448

8,257

3,791

3,572

Longest Read CEGMA genes

80%

88%

90%

Genes w UTR

1,891

7,466

7,297

Reference database includes: A. ceylanicum EST’s, nematoda.est, inhouse 454 nematoda, nemtoda.nr CEGMA (Core Eukaryotic Genes Mapping Approach) 458 Core Proteins highly Conserved across taxa (Korf et al.,)

Gene Predictions

PacBio Original

Gene Predictions

Individual 1º PacBio read alignments to assembly using Exonerate

Collapsed PB Isoforms * EST DB aligned to assembly 1,379 PacBio BLASTN to assembly * BLASTX: Genbank Nematode dB * TBLASTX: Genbank Nematode dB * Processed without PacBio Data

Isoforms or artifacts? SMART cDNA may be unable to efficiently RT the 5’ end of transcripts Exonerate Raw BLAST

Annotation Highlights C2 genes vs P4 genes C2 1,379

P4 16,331

1,229

Advantages of long cDNA reads using PacBio: There is an increase in full length gene predictions. ~50% of the genes show a structural change. ~80% of the genes illustrate changes to gene start/stop positions ~10x increase in prediction of UTR bases ~10% increase in CEGMA genes C2 and P4 data sets are very similar Add expression data from Life Cycle Stages Vaccine Development (White Paper)

Validation of gene fusion predictions using PacBio

ChimeraScan: A tool for detecting chimeric transcripts in NGS data

(PNAS - Maher et al., 2009) (Bioinformatics – Iyer et al., 2011)

Example Illustrating ETS fusions

M.A. Rubin 2012

Individual samples can have numerous candidates to validate 30 Read-through Intra-chromosomal

25

Inter-chromosomal

20 15

10

5

0 NOR1

NOR3 Normal NOR5

PCA2

PCA4 ETS-

PCA6

PCA8

PCA10

PCA12 + ETS

ETS rearrangement negative samples have less genomic aberrations

ETS- : no previously identified gene fusion involving ERG, ETV1, ETV5 (any ETS family member) ETS+: known ETS fusion has been documented in the sample

Application in across cohorts necessitates more streamlined validation 49 Cell line + 40 Tissues RNA Isolation Paired-end Library Prep Sequencing (Illumina GAII or HiSeq 2000) Reads Aligned to human genome + transcriptome Identify fusions with ChimeraScan Identify Recurrent Candidates

Identify Functionally Recurrent Candidates

qPCR Validation and/or FISH Functional assays

Rationale for requiring Pac Bio • Need a high-throughput, orthogonal sequencing method to rapidly validate fusions • High depth of coverage to confirm gene fusions expressed at lower levels • Accurately identify in-frame fusions • •

In-frame fusions are important for discovering new fusion peptides In-frame predictions can be misleading without validating the full-length transcript of the upstream fusion partner • I.e., using most common isoform may result in an out of frame prediction; identifying the full-length transcript may reveal exon-skipping event or novel isoform that does create an in-frame fusion

• Approach? • •

Gene Specific RT -> SMARTer/SMRT cDNA Capture

Acknowledgements

WUSTL

PacBio

Sean McGrath Amy Ly Ryan Demeter

Nick Sisneros Jeff Palas

Makedonka Mitreva Kym Hallsworth-Pepin Chris Maher Nicole Maher Chris Cabanski

Cheryl Heiner Primo Meredith Ashby Tyson Clark Marty Badgett