Targeted genomic assembly: BAC sequencing supercharged by single molecule technologies Kevin Fengler Trait Informatics DuPont Pioneer
BAC sequencing to the rescue… 1) Existing reference sequence is inadequate Not robust or complete •
Reference sequence has local mis-assemblies
•
WGS assembly does not contain enough flanking sequence
Structural variation cannot be resolved • CNVs, PAVs, duplications, large INDELs, repeats, etc.
Does not contain target region • •
Disease resistance loci are often novel in resistant source Characterize complex transgenic events (fosmids)
2) Need high-quality assembly for regions of interest only
Diversity in maize limits use of single reference At given maize locus, >50% sequence can be non-shared between two inbreds
Example BAC region
Plant Cell (2005)
Complex Trait Locus (CTL) Concept for Trait Stacking CTL = co-located SSI landing pads for transgene insertion
Maize Chromosome
Expected benefits • No endogenous gene disruption = regulatory friendly • Simplifies trait introgression, improves conversion quality = less cost • Insertion sites can be evaluated and validated prior to use = less risk Need robust assemblies spanning CTLs in transformation lines and for gene editing targets
Targeted Genomic Reference Assembly Workflow Generate a robust reference assembly specifically for any: gene, locus, region, transgenic event, QTL, CTL, etc. Create BAC library
Multi BAC region
Single BAC BAC superpool sequencing
In silico BAC selection
Sequencing of BAC Pool
900 kb contig
150 kb contig
(QTL target)
(Gene target)
Assembly
Assembly Validation
Deconvolution by Sequencing (DbS) Method for assigning reads from a sequenced BAC pool to individual BACs
D
A
BAC 1
C 1
2
BAC 3
BAC n
3
Deconvolution of individual reads
A
B
BAC 2
B
E
C
D 2,688 Clones
E
F Control G
5,11, 17,23
Alignment to available reference
H
Matrix Pools: SUPERPOOL (7 Plates*)
Plate Columns Rows
Assembly
Enables in silico BAC selection
Aligning DbS Reads to Reference Creates BAC Map Single BAC (139 kb) aligned to maize reference genome
Uniquely mapped reads
Spaces are non-shared with reference, non-deconvolutable or non-unique regions of BAC
Many BACs aligned to maize reference genome (1.4 MB region)
Structural variation Lack of coverage No uniquely mapping reads
Alignments useful for reference-based BAC tiling path selection
Selecting Minimal BAC Tiling Paths Generating targeted assembly for 1.8 Mb region from 3 maize lines with 4x DbS
A B C Line A (18 BACs, no expected gaps) Line B (18 BACs, 3 expected gaps) Line C (13 BACs, 3 expected gaps)
Probes designed from contig ends can be used to screen additional un-sequenced BAC library resources Unfilled gaps introduce uncertainty
The Power of BAC Pooling Bypass barcoding limitations & lower cost
Assembly
Pool of ~15-35 BACs (~150 kb) with 1 library + 1 SMRT cell
Robust assemblies self “sort”
Typical Sequencing Results Protocol 20 kb Template preparation using BP gTube sheared Size-selection 10-50 kb BP Run 6 hr movie
16 BAC pool 128-184 kb 282-511x coverage per BAC
BAC Assembly Catch-22
7 kb vector + >150 kb insert
BAC vector sequence is problematic (and a blessing) with long reads
Challenge of assembling overlapping circles
Need to remove vector sequence, without losing BAC vector junction info
BAC Assembly Method with PacBio BAC (insert + vector)
Correct and assemble PacBio reads
Remove vector sequence
Assemble sub-contigs
Validation
Pooled BAC Assembly Method with PacBio Mix of potentially overlapping BACs Correct and assemble PacBio reads, polish
Remove vector sequence
Assemble sub-contigs together
Assemble all contigs, polish
Validation
Assembly Validation Aided By Visualization Aligning sequencing data to contigs helps validate assemblies Non-overlapping BAC in pool (155 kb)
DbS Illumina reads confirm identity of selected BAC
PacBio reads from BAC end junctions confirm completeness
PacBio corrected insert reads in agreement with assembly confirm sequence
9 overlapping BACs in pool (954 kb)
BAC Deluxe Assembly Pipeline ~90% of maize BACs auto-assemble completely with >99.99% accuracy Remaining BACs require manual curation to complete/correct
Manual Curation of Assemblies
BAC A.1 (38kb)
DbS BAC Reads
BAC A.2 (122kb)
BAC end junction reads
Corrected insert reads
Soft clipped, corrected reads from ends useful
Assemble contigs + longest overhanging reads (repeat until closed) Repeat validation
Detecting Mis-Assemblies
BAC A
BAC B
DbS BAC Reads
BAC end junction reads
Corrected insert reads
What about BACs without DbS data? 2 overlapping BACs (248 kb)
No DbS data, which BACs are these? BAC end junction reads
Corrected insert reads
Digest assembly into kmers Align to reference Maize chromosome
Determine physical coordinates in genome
BioNano Genome Mapping *
*
Essential for regaining long-range connectivity lost from fragmenting and amplifying DNA
*BioNano Genomics website: http://www.bionanogenomics.com/technology/why-genome-mapping/
Genome Mapping Provides Comprehensive View of BAC Assemblies In Physical Space Single Genome Map (assembly of nicked molecules)
• Maize Genome Map • BspQI (17 nicks/100 kb) • Throughput (>150kb): 197Gb (3 chips) • • • •
• Check integrity of contig sequence
Assembly size: 1.99 Gb 1,983 genome maps Single molecule N50: 289 kb Genome map N50: 1.44 Mb
Genome Maps
• Confirm contig order and orientation • Characterize gaps between contigs
BAC Assemblies
CtgD
CtgB
CtgC
160kb CtgA