Scaffolding of a bacterial genome using MinION ... - Semantic Scholar

Report 1 Downloads 60 Views
Supplementary Information

Scaffolding of a bacterial genome using MinION nanopore sequencing Karlsson, E. 1§, Lärkeryd, A. 1§, Sjödin, A. 1,2, Forsman, M. 1 and Stenberg, P. 1,2,3*

Swedish Defence Research Agency, Umeå, Sweden Department of Chemistry, Computational Life Science Cluster (CLiC), Umeå University, Umeå, Sweden 3 Molecular Biology, Umeå University, Umeå, Sweden § Equal contribution * Correspondence: [email protected] 1 2

Sequencing run R7.3 (FSC996) R7 (FSC1006)

Supplementary Tables

Number of 2D reads 30423 20099

Number of bases 190148225 117183952

Mean length 6250 5830

Median length 6038 5580

Max length 34370 30578

Mapped bases 128164261 82872274

Supplementary Table S1. Sequence output (all 2D reads) from the MinION R7 (FSC1006 genome) and the R7.3 (FSC996 genome) sequencing runs.

Supplementary Figures

a

Read lengths

0.10

Fraction of total reads

0 - 4000 bp (n = 5721) 4000 - 8000 bp (n = 10436) 8000 - 30578 bp (n = 3942)

0.05

0.00

0

5

10

15

20

Read length (kbp)

b

Mapability

Cumulative fraction of reads

1.0

0 - 4000 bp (n = 5721) 4000 - 8000 bp (n = 10436) 8000 - 30578 bp (n = 3942)

0.5

0.0 0.00

0.25

0.50

0.75

1.00

Proportion of read aligned

Supplementary Figure S1. Quality of MinION (R7) sequencing reads from FSC1006. (a) Length distribution of the reads. MinION reads are divided into three length categories that are coloured separately. Note that the high number of MinION reads of about 3.5 kb originate from the ligation control fragment. (b) Mapability of MinION reads divided into the same length categories as in (a). Read alignment length is the fraction of the reads covered in the BLAST alignment against the reference genome.

MinION (FSC996)

0 4

2 G

ap

4,

2

-2 1, M

M

1,

-2

G

ap

2,

4

M M

M

M

1,

-2

G

ap

0,

2 -3

G

ap

4,

4 M

M

2,

-3

G

ap

2,

4 M

M

2,

-3

G

ap

0,

2

2, M M

M

M

1,

-1

G

ap

4,

2

2, ap

ap

G

G

-1

-1 1, M

M

1,

-2

G

ap

0,

4

2

10

4,

2

20

0.04

0.00

M

Coverage

0.06

0.02

M

M

M

1,

-2

G

ap

2,

4 G

ap

0,

2 M

M

1,

-2

G

ap

4,

4

-3 2, M M

M

M

2,

-3

G

ap

2,

4

ap G -3

-1

0,

2

4,

2

2, ap M

1,

-1 1, M M

M

M

1,

-1

G

ap

0,

4

2 G

ap

4,

2

-2 1, M M

M

M

1,

-2

G

ap

2,

4 -2

G

ap

0,

2 M

M

1,

-3

G

ap

4,

4 M

M

2,

-3

G

ap

2,

4 M

M

2,

-3

G

ap

0,

2

2, M

M

1,

-1

G

ap

4,

2

2,

0,

ap M M

M

M

1,

-1

G

ap G -1 1, M

30

0.08

0

0.00

40

0.10

10

0.02

0

0.00

20

0.04

10

0.02

M

0.06

50

0.12

1,

0.04

30

60

0.14

M

20

0.08

G

0.06

40

ap

30

0.10

G

0.08

0.12

2,

40

70

0.16

50

M

0.10

0.14

Substitutions

0.18

60

M

0.12

70

M

50

Insertions

Mean error/bp/read

0.16

Coverage

0.18

60

Coverage

Mean error/bp/read

0.14

70

M

Coverage Mean error/bp/read

0.16

Mean error/bp/read

Deletions

0.18

PacBio (FSC996) Insertions

Substitutions 29.4

0.004

29.4

0.0035

29.3

0.0035

29.3

0.0035

29.3

Coverage

29.0

0.0015

28.9

0.001

4 G

ap

4,

2 -2 M

M

1,

-2

G

ap

2,

2 1, M M

M

M

1,

-2

G

ap

0,

4 4, G

ap

2, M

M

2,

-3

G

ap

0, M

M

2,

-3

G

ap

4, -3 2, M M

M

1,

-1

G

ap

2,

2

ap G -1

1, M

M

M

M

M

1,

-1

G

ap

0,

4 ap G

-2 M M

2

28.7 4

28.8

0 4

0.0005

4,

2

29.1

0.002

1,

-2

G

ap

2,

2 1, M M

M

M

1,

-2

G

ap

0,

4 G

ap

4,

2

4

-3 2, M

M

2,

-3

G

ap

2,

4

0, ap G

-3 M

2,

-1 1, M

M

ap

2,

2 0, ap G

-1

-1

1, M

M

1, M M

29.2

0.0025

28.7

4 ap G

0.003

2

28.8

4,

2 -2 M

M

1,

-2

G

ap

2,

2 0, 1, M M

M

M

1,

-2

G

ap

4, G

ap

2, M

M

2,

-3

G

ap

0, M

M

2,

-3

G

ap

4, -3 M

M

2,

-1

G

ap

2,

0,

ap 1, M M

M

M

1,

-1

G

ap G -1 1,

4

0

2

28.7 4

0 4

0.0005

2

28.8

M

28.9

0.001

0.0005

M

Coverage

0.0015

4,

28.9

0.001

29.0

G

0.0015

0.002

ap

29.0

29.1

G

0.002

0.0025

M

29.1

29.2

M

Coverage

0.0025

0.003

2

29.2

M

0.003

Mean error/bp/read

0.004

Mean error/bp/read

29.4

2

Mean error/bp/read

Deletions 0.004

Supplementary Figure S2. Error rates and genomic coverage both vary with BLAST parameters. Mean error rates (deletions, insertions and substitutions) per base pair per read and genomic coverage (calculated as the summed aligned length of all reads divided by the genome size) after mapping MinION (R7.3) and PacBio reads to the FSC996 reference genome using different BLAST parameters. MM=match and mismatch scores and Gap=gap opening and gap extension penalties. Note that for match and mismatch scores of 2 and -3 respectively, a gap opening penalty of 0 combined with a gap extension penalty of 2 is not allowed by BLAST. Therefore a gap extension penalty of 4 was used instead.

Basecalling errors (Genome)

0.15

Sequence MinION (R7.3) - FSC996

0.10

PacBio - FSC1006 PacBio - FSC996

Deletions

5 00

3 00

0. 0

0. 0

0

3

01

01

Insertions

Error type

0. 0

0. 0

00 0.

00 0.

0.00

33

0.05

32

Mean error/bp/read

MinION (R7) - FSC1006

Substitutions

Supplementary Figure S3. Error rates in the sequence reads generated by the two MinION (R7 and R7.3) and PacBio runs. Mean error rates (deletions, insertions and substitutions) per base pair per read across the FSC996 and FSC1006 genomes are shown.

0.25

Error rates in all 2D reads

Mean error/bp/read

0.20

Error type 0.15

Indels Substitutions

0.10

0.05

0.00

Passed 2D reads

Failed 2D reads

Supplementary Figure S4. Boxplot showing the difference in rates of Indels and substitutions between 2D MinION reads (R7.3) that passed and failed quality filtering. Thick black lines and boxes indicate median values and the 25th to 75th quartile range, respectively. Whiskers represent 1.5x the inter-quartile range and black dots denote outliers.

MinION Basecalling errors

0.25

Region Genome GC

Mean error/bp/read

0.20

MonoA repeats MonoT repeats

0.15

MonoG repeats MonoC repeats

0.10

Dinucleotide repeats Nonanucleotide repeats

0.05

0.00

Deletions

Insertions

Error type

Substitutions

Supplementary Figure S5. Error rates within different genomic regions in the sequence reads generated by MinION (R7.3) sequencing. Mean error rates (deletions, insertions and substitutions) per base pair per read in the genome (32% GC), high GC-regions (47.8% GC), monomer repeats (A, T, G and C), dimer repeats and nonamer repeats. All repeats are at least five repeat units long.