Cloud Computing

Report 8 Downloads 484 Views
NC State Forensic Sciences Institute

DNA Forensics in the Cloud Seth A. Faith and Melissa K. Scheible NC State University, Forensic Sciences Institute, Raleigh, NC, USA

Short Tandem Repeats (STRs)

Security Setup Secure Shell (SSH) Security token/key

Remote Desktop connection (encrypted) to EC2 instance IP, Amazon generated root certificate

Login with system user name & security key password/ token

User id

t2.micro t2.small t2.medium m3.medium m3.large m3.xlarge m3.2xlarge

1 1 2 1 2 4 8

Variable Variable Variable 3 6.5 13 26

Intel Xeon family Intel Xeon family Intel Xeon family Intel Xeon E5-2670 v2 Intel Xeon E5-2670 v2 Intel Xeon E5-2670 v2 Intel Xeon E5-2670 v2

2.5 GHz 2.5 GHz 2.5 GHz 2.5 GHz 2.5 GHz 2.5 GHz 2.5 GHz

1 2 4 3.75 7.5 15 30

c4.large c4.xlarge c4.2xlarge c4.4xlarge c4.8xlarge c3.large c3.xlarge c3.2xlarge c3.4xlarge c3.8xlarge

2 4 8 16 36 2 4 8 16 32

8 16 31 62 132 7 14 28 55 108

Intel Xeon E5-2666 v3 Intel Xeon E5-2666 v3 Intel Xeon E5-2666 v3 Intel Xeon E5-2666 v3 Intel Xeon E5-2666 v3 Intel Xeon E5-2680 v2 Intel Xeon E5-2680 v2 Intel Xeon E5-2680 v2 Intel Xeon E5-2680 v2 Intel Xeon E5-2680 v2

2.9 GHz 2.9 GHz 2.9 GHz 2.9 GHz 2.9 GHz 2.8 GHZ 2.8 GHZ 2.8 GHZ 2.8 GHZ 2.8 GHZ

3.75 7.5 15 30 60 3.75 7.5 15 30 60

r3.large r3.xlarge r3.2xlarge r3.4xlarge r3.8xlarge

2 4 8 16 32

6.5 13 26 52 104

Intel Xeon E5-2670 v2 Intel Xeon E5-2670 v2 Intel Xeon E5-2670 v2 Intel Xeon E5-2670 v2 Intel Xeon E5-2670 v2

2.5 GHz 2.5 GHz 2.5 GHz 2.5 GHz 2.5 GHz

15 30.5 61 122 244

2.xlarge i2.2xlarge i2.4xlarge i2.8xlarge

4 8 16 32

14 27 53 104

Intel Xeon E5-2670 v2 Intel Xeon E5-2670 v2 Intel Xeon E5-2670 v2 Intel Xeon E5-2670 v2

2.5 GHz 2.5 GHz 2.5 GHz 2.5 GHz

30.5 61 122 244

ECU = Elastic Compute Cloud Unit EBS = Elastic Block Stroage SSD = Solid State Drive - = not tested a ~ 50MB files per sample, 1,667 sample upload per month b No upload fees to S3 and no data transfer fees between S3 and EC2

Yearly costs (20,000 samples) a,b Processing Storage Total

Instance Processing time (s) per sample Fee ($/hr) Storage 1E5 reads 1E6 reads unit cost ($) General Purpose EBS Only 0.013 1305 0.005 EBS Only 0.026 EBS Only 0.052 1 x 4 SSD 0.070 2697 0.052 1 x 32 SSD 0.140 1231 0.048 2 x 40 SSD 0.280 615 0.048 2 x 80 SSD 0.560 306 0.048 Compute Optimized EBS Only 0.116 1061 0.034 EBS Only 0.232 531 0.034 EBS Only 0.464 266 2660 0.034 EBS Only 0.928 129 0.033 EBS Only 1.856 61 609 0.031 2 x 16 SSD 0.105 1152 0.034 2 x 40 SSD 0.210 573 0.033 2 x 80 SSD 0.420 287 0.033 2 x 160 SSD 0.840 2 x 320 SSD 1.680 Memory Optimized 1 x 32 SSD 0.175 1235 0.060 1 x 80 SSD 0.350 619 0.060 1 x 160 SSD 0.700 309 0.060 1 x 320 SSD 1.400 2 x 320 SSD 2.800 Storage Optimized High I/O 1 x 800 SSD 0.853 616 0.146 2 x 800 SSD 1.705 4 x 800 SSD 3.410 155 0.147 8 x 800 SSD 6.820 -

$94.25

$155.95

$250.20

$1,048.83 $957.44 $956.67 $952.00

$155.95 $155.95 $155.95 $155.95

$1,204.78 $1,113.39 $1,112.62 $1,107.95

$683.76 $684.40 $685.69 $665.07 $628.98 $672.00 $668.50 $669.67

$155.95 $155.95 $155.95 $155.95 $155.95 $155.95 $155.95 $155.95

$839.71 $840.35 $841.64 $821.02 $784.93 $827.95 $824.45 $825.62

$1,200.69 $1,203.61 $1,201.67

$155.95 $155.95 $155.95

$1,356.64 $1,359.56 $1,357.62

$2,919.16

$155.95

$3,075.11

$2,936.39

$155.95

$3,092.34

- Yearly Cost ($)

50.0

$3,500

45.0 $3,000 40.0 $2,500

35.0 30.0

$2,000

25.0 $1,500

20.0 15.0

$1,000

10.0 $500 5.0

Private environments could be successfully implemented to securely access both MS Windows and Linux AMIs through encrypted channels using Remote Desktop and SSH, respectively. Using the Amazon dashboard, security groups, security tokens/keys can be generated for each EC2 instance. Further, the Windows and Linux AMIs can be user access controlled through profile login credentialing. The S3 storage buckets can also be protected through AES-256 encryption and policy settings for users, groups, IP, and token access (not shown).

Possible Cloud Architecture

rg

e

ge

la .4 x i2

.x l

ar

e i2

rg .2 xl a r3

r3

.x la

rg

rg .la r3

la .2 x c3

e

e

e rg

ge ar .x l

e

e c3

rg c3 .la

la

rg

e .8 x c4

.4 x

la

rg

e c4

la .2 x c4

.x l

ar

rg

ge

e c4

rg

e

c4 .la

m

3. 2x

la

ar 3. xl m

rg

ge

e rg m 3. la

iu

EC2 Instance Type

Region

Variant

HV1

16519C

Amazon Glacier LTS

Archive Data

Amazon S3 Storage Bucket - Encrypted AES-256 - Redundant

EC2

CLI Amazon Linux AMI

FASTQ Bucket Policies

Mitochondrial Analysis

Crime lab controlled cloud

SFTP FASTQ

Access Control List (ACL) & Bucket Policies

Windows AMI

EC2

Iterative testing of STR data produced from a MiSeq and analyzed using STRaitRazor demonstrated that compute optimized Linux instances (C4) provided the quickest turn around time and lowest cost. FASTQ files of 100,000 reads could be analyzed in 1 min on a C4.8xlarge instance for 3 cents (USD). No benefit was observed for memory or storage optimized instances.

STR Mito SNP

Authentication: Security Token & User/Group Id

SSL

263G 315.1C (required examiner) 477C

Group

H1C

Forensic Data Summary File A-STR FGA D7S820 D2S441 D16S539 D13S317 D12S391 D21S11 PentaE D1S1656 D19S433 TH01 D8S1179 D10S1248 PentaD TPOX D5S818 D3S1358 VWA CSF1PO D22S1045 D2S1338 D18S51 Amel

Allele 1 20 8 10 9 9 18 29 7 12 13 6 14 13 12 11 12 17 16 12 16 22 16 X

Allele 2 23 11 14 13 11 23 31.2 14 13 14 9.3 15 15 13 11 12 18 19 12 16 25 18 Y

Y-STR DYS576 DYS389I DYS389II DYS448 DYS19 DYS391 DYS481 DYS549 DYS533 DYS438 DYS437 DYS570 DYS635 DYS390 DYS439 DYS392 DYS643 DYS393 DYS458 DYS385a/b DYS456 Y-GATA-H4

Allele 18 14 31 19 14 10 22 13 12 9 14 17 21 24 12 13 10 13 17 13,16 17 11

Mito HV1 HV2

Variant 16519C 152C 263G 315.1C 477C

Group

H1C

Local Backup Storage

A complete profile of autosomal and YSTRs and mitochondrial variants was produced on a secured cloud for the 2800M reference (matching concordance data generated by CE and Sanger sequencing). This type of profile could potentially be uploaded and used in CODIS if this technology gains NDIS approval. Sequence information for each allele was also produced, but no forensic system currently uses sequence based matching. Thus, it is not reported here.

Forensic Sample

SSH

152C

Remote Desktop

HV2

Amazon EC2 IP

STRaitRazor command

•  NGS data for STRs and mitochondria can be securely analyzed in the Amazon Cloud at low cost with Linux and Microsoft OSs •  STR analysis tools, such as STRaitRazor, that perform string matching and interpretation have optimal performance with compute optimized clusters using parallel processing •  Current NGS forensic analysis tools still require significant development for DNA examiner use, including; •  Thresholding functions •  Visualization modalities •  Reporting and export routines •  Audit trail and LIMS connectivity •  Optimization of settings for commercial reference aligners needs to be conducted for accurate mitochondrial analysis •  It may be advisable to de-identify samples (alphanumeric) and limit metadata associated with files analyzed in the Cloud to further reduce additional privacy/security concerns

$0

m

0.0

Estimated Yearly Cost for 20,000 Samples

Memory (GiB)

2800M Control DNA amplified with Promega PowerSeq™, sequenced with Illumina MiSeq, analyzed using an Amazon EC2 Windows instance (m3.xlarge: 4 vCPU, 15GB memory) with SoftGenetics NextGENe® software for reference alignment and variant analysis. HVI/II variants estimate H1C haplogroup. However, the homopolymer required human intervention to assign the correct indel by forensic guidelines.

Linux AMI

MS Windows AMI

Clock Speed

Conclusions

SFTP !

Data Synthesis: Control DNA (2800M standard, 1ng) was amplified with Promega PowerSeq™Auto/Y/mito, made into NGS libraries with TruSeq DNA LT PCR- free (Illumina), and sequenced with a MiSeq (Illumina) for SE 300bp using V3 chemistries. FASTQ files were uploaded to an Amazon S3 storage bucket. Data was transferred to EC2 computer instances using Amazon command line interface (CLI) tools or sftp protocols. FASTQ data were split to contain either 1E5 or 1E6 reads for the STR performance testing. A whole FASTQ file of 3.6 million reads was used for the Mitochondrial testing. STRs: An Amazon Linux AMI was setup in the US-east region to conduct STR analysis using the STRaitRazor tool (Warhauser et al 2013, 2015). The base AMI was updated for Perl library and tre-agrep dependencies, as well as Amazon CLI packages. For each instance of the AMI either 1E5 or 1E6 reads were analyzed for 86 loci (-typeselection ALL) to include autosomal and Y-CHM STRs. Post processing of output files from STRaitRazor was conducted in MSExcel to make call alleles. Mito: Amazon EC2 Microsoft Windows Server 2012 R2 base instance (64-bit architecture, ebs, m3.xlarge: 4 vCPU, 15GB memory) was setup and installed with a trial version of SoftGenetics NextGENe® software for reference alignment and variant analysis (using default alignment settings). The rCRS mtGenome (NC_012920.1) was recombined at the HVII/HVII boundaries (nt position 1 and 16,569) and used as the reference genome. The variants discovered through analysis were referenced against PhyloTree (van Oven and Kayser, 2009) and EMPOP3 (Parson and Dur, 2007) databases to estimate the Haplogroup.

Processor

ed

Approach

ECU

ic ro

Here we tested the potential of the Amazon Cloud Computing (EC2) environment for forensic microsatellite (STR) and mitochondrial genotyping

vCPU

m 3. m

One current barrier to implementation of NGS by forensic scientists is an effective analytical toolkit that can process the large and complex datasets produced from NGS instrumentation in a secured environment

Type

t2 .m

DNA forensics is on the brink of a revolution with new techniques in next-generation sequencing (NGS) and bioinformatics enabling: •  Larger panels of STRs with additional sequence data •  In-parallel mitochondrial genome analysis •  SNP data for identity, physical appearance, and ancestry •  Enhanced kinship and mixture analysis

Amazon Linux AMI (us-east)

Processing TIme for 1E5 Reads (minutes)

Introduction

FASTQ

Sequencer

STR Mito SNP

LDIS

SDIS

NDIS

DNA examiner

Crime Lab

LTS = Long Term Storage, EC2 = Elastic Compute Cloud, CLI = Command Line Interface, AMI = Amazon Machine Instance, S3 = Simple Storage Service, STR = Short Tandem Repeat, SSL = Secure Layer Socket, SSH = Secure Shell, SFTP = SSH File Transfer Protocol, CODIS = Combined DNA Index System, LDIS = Local DNA Index System, SDIS = State DNA Index System, NDIS = National DNA Index System, SNP = Single Nucleotide Polymorphism

References - Warshauer, D. H., et al.. (2013). STRait razor: A length-based forensic STR allele-calling tool for use with second generation sequencing data. Forensic Sci Int Genet, 7(4), 409-17. - Warshauer, D. H., King, J. L., & Budowle, B. (2015). STRait razor v2.0: The improved STR allele identification tool-razor. Forensic Sci Int Genet, 14, 182-186. Van Oven M, Kayser M. 2009. Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Hum Mutat 30(2):E386-E394. http://www.phylotree.org. - Parson, W., & Dur, A. (2007). EMPOP--a forensic mtDNA database. Forensic Sci Int Genet, 1(2), 88-92.

Seth A. Faith PhD NC State Forensic Sciences Institute [email protected] 919-513-8099