BAC sequencing supercharged by single molecule technologies

Report 9 Downloads 48 Views
Targeted genomic assembly: BAC sequencing supercharged by single molecule technologies Kevin Fengler Trait Informatics DuPont Pioneer

BAC sequencing to the rescue… 1) Existing reference sequence is inadequate Not robust or complete •

Reference sequence has local mis-assemblies



WGS assembly does not contain enough flanking sequence

Structural variation cannot be resolved • CNVs, PAVs, duplications, large INDELs, repeats, etc.

Does not contain target region • •

Disease resistance loci are often novel in resistant source Characterize complex transgenic events (fosmids)

2) Need high-quality assembly for regions of interest only

Diversity in maize limits use of single reference At given maize locus, >50% sequence can be non-shared between two inbreds

Example BAC region

Plant Cell (2005)

Complex Trait Locus (CTL) Concept for Trait Stacking CTL = co-located SSI landing pads for transgene insertion

Maize Chromosome

Expected benefits • No endogenous gene disruption = regulatory friendly • Simplifies trait introgression, improves conversion quality = less cost • Insertion sites can be evaluated and validated prior to use = less risk Need robust assemblies spanning CTLs in transformation lines and for gene editing targets

Targeted Genomic Reference Assembly Workflow Generate a robust reference assembly specifically for any: gene, locus, region, transgenic event, QTL, CTL, etc. Create BAC library

Multi BAC region

Single BAC BAC superpool sequencing

In silico BAC selection

Sequencing of BAC Pool

900 kb contig

150 kb contig

(QTL target)

(Gene target)

Assembly

Assembly Validation

Deconvolution by Sequencing (DbS) Method for assigning reads from a sequenced BAC pool to individual BACs

D

A

BAC 1

C 1

2

BAC 3

BAC n

3

Deconvolution of individual reads

A

B

BAC 2

B

E

C

D 2,688 Clones

E

F Control G

5,11, 17,23

Alignment to available reference

H

Matrix Pools: SUPERPOOL (7 Plates*)

Plate Columns Rows

Assembly

Enables in silico BAC selection

Aligning DbS Reads to Reference Creates BAC Map Single BAC (139 kb) aligned to maize reference genome

Uniquely mapped reads

Spaces are non-shared with reference, non-deconvolutable or non-unique regions of BAC

Many BACs aligned to maize reference genome (1.4 MB region)

Structural variation Lack of coverage No uniquely mapping reads

Alignments useful for reference-based BAC tiling path selection

Selecting Minimal BAC Tiling Paths Generating targeted assembly for 1.8 Mb region from 3 maize lines with 4x DbS

A B C Line A (18 BACs, no expected gaps) Line B (18 BACs, 3 expected gaps) Line C (13 BACs, 3 expected gaps)

Probes designed from contig ends can be used to screen additional un-sequenced BAC library resources Unfilled gaps introduce uncertainty

The Power of BAC Pooling Bypass barcoding limitations & lower cost

Assembly

Pool of ~15-35 BACs (~150 kb) with 1 library + 1 SMRT cell

Robust assemblies self “sort”

Typical Sequencing Results Protocol 20 kb Template preparation using BP gTube sheared Size-selection 10-50 kb BP Run 6 hr movie

16 BAC pool 128-184 kb 282-511x coverage per BAC

BAC Assembly Catch-22

7 kb vector + >150 kb insert

BAC vector sequence is problematic (and a blessing) with long reads

Challenge of assembling overlapping circles

Need to remove vector sequence, without losing BAC vector junction info

BAC Assembly Method with PacBio BAC (insert + vector)

Correct and assemble PacBio reads

Remove vector sequence

Assemble sub-contigs

Validation

Pooled BAC Assembly Method with PacBio Mix of potentially overlapping BACs Correct and assemble PacBio reads, polish

Remove vector sequence

Assemble sub-contigs together

Assemble all contigs, polish

Validation

Assembly Validation Aided By Visualization Aligning sequencing data to contigs helps validate assemblies Non-overlapping BAC in pool (155 kb)

DbS Illumina reads confirm identity of selected BAC

PacBio reads from BAC end junctions confirm completeness

PacBio corrected insert reads in agreement with assembly confirm sequence

9 overlapping BACs in pool (954 kb)

BAC Deluxe Assembly Pipeline ~90% of maize BACs auto-assemble completely with >99.99% accuracy Remaining BACs require manual curation to complete/correct

Manual Curation of Assemblies

BAC A.1 (38kb)

DbS BAC Reads

BAC A.2 (122kb)

BAC end junction reads

Corrected insert reads

Soft clipped, corrected reads from ends useful

Assemble contigs + longest overhanging reads (repeat until closed) Repeat validation

Detecting Mis-Assemblies

BAC A

BAC B

DbS BAC Reads

BAC end junction reads

Corrected insert reads

What about BACs without DbS data? 2 overlapping BACs (248 kb)

No DbS data, which BACs are these? BAC end junction reads

Corrected insert reads

Digest assembly into kmers Align to reference Maize chromosome

Determine physical coordinates in genome

BioNano Genome Mapping *

*

Essential for regaining long-range connectivity lost from fragmenting and amplifying DNA

*BioNano Genomics website: http://www.bionanogenomics.com/technology/why-genome-mapping/

Genome Mapping Provides Comprehensive View of BAC Assemblies In Physical Space Single Genome Map (assembly of nicked molecules)

• Maize Genome Map • BspQI (17 nicks/100 kb) • Throughput (>150kb): 197Gb (3 chips) • • • •

• Check integrity of contig sequence

Assembly size: 1.99 Gb 1,983 genome maps Single molecule N50: 289 kb Genome map N50: 1.44 Mb

Genome Maps

• Confirm contig order and orientation • Characterize gaps between contigs

BAC Assemblies

CtgD

CtgB

CtgC

160kb CtgA