Large Scale Genomics Thomas Keane, EMBL-EBI Oliver Hofmann, University of Melbourne genomicsandhealth.org
Large Scale Genomics
Core mission “Create standardized methods for accessing large-scale genomic data by file-based, API-based, cloud-based, and distributed access.”
genomicsandhealth.org
Large Scale Genomics Guiding principles ●
Engage with driver projects and the wider genomics community to identify requirements and use-cases
●
Build on existing standards to ensure a gradual transition to new standards
●
Engage with key community software tool maintainers to drive adoption of standards
●
Engage with key large data repositories to drive community adoption
●
Metric for workstream success will be adoption of standards genomicsandhealth.org
Large Scale Genomics Driver and partner engagement
genomicsandhealth.org
Large Scale Genomics Workstream structure (Y1) ●
File formats task team
●
Streaming API task team
●
Reference Sequence Retrieval task team
●
RNA-Seq task team
genomicsandhealth.org
Large Scale Genomics Workstream structure (Y1) ●
File formats task team ○ Long reads, molecular barcoding, standardised encryption container, VCF representation for structural variation
●
Streaming API task team
●
Reference Sequence Retrieval task team
●
RNA-Seq task team
genomicsandhealth.org
Large Scale Genomics Workstream structure (Y1) ●
File formats task team
●
Streaming API task team ○ Launched v1.0 of htsget! ○ VCF streaming, multi-sample/multi-region support, metrics for adoption
●
Reference Sequence Retrieval task team
●
RNA-Seq task team
genomicsandhealth.org
Large Scale Genomics Workstream structure (Y1) ●
File formats task team
●
Streaming API task team
●
Reference Sequence Retrieval task team ○ Convene the task team (jointly with GKS workstream) ○ Identify requirements ○ Initial draft specification, initial implementations, basic interop demo
●
RNA-Seq task team
genomicsandhealth.org
Large Scale Genomics Workstream structure (Y1) ●
File formats task team
●
Streaming API task team
●
Reference Sequence Retrieval task team
●
RNA-Seq task team ○ Convene an initial discussion with driver projects and tool developers ○ Identify goals and review progress to date ○ Topics: Initial discussion on expression data format(s), compression approaches, streaming / slicing genomicsandhealth.org
Large Scale Genomics Workstream structure (Y2) ●
File formats task team ○ VCF - Formal grammar, major release (breaking changes) ○ SAM/BAM/CRAM - formalise the representation of modification bases
●
Streaming API task team
●
Reference Sequence Retrieval task team
●
RNA-Seq task team
genomicsandhealth.org
Large Scale Genomics Workstream structure (Y2) ●
File formats task team
●
Streaming API task team ○ Introspection of a service - what are the internal processes? What does the server provide? ○ Interoperability matrix update for Y1 new features ○ Compliance testing
●
Reference Sequence Retrieval task team
●
RNA-Seq task team genomicsandhealth.org
Large Scale Genomics Workstream structure (Y2) ●
File formats task team
●
Streaming API task team
●
Reference Sequence Retrieval task team ○ v1.0 specification and deeper interop ○ Public release with multiple server and client implementations
●
RNA-Seq task team
genomicsandhealth.org
Large Scale Genomics Workstream structure (Y2) ●
Variants API ○ Draft a wishlist of use-cases from a variants API/data model ○
Identify driver projects and form a task team
○
What do we do with the existing Variants API?
○
Do we excise it from the existing GA4GH data model?
○
Could this be a joint activity with GKS? genomicsandhealth.org
Large Scale Genomics Coordination with other workstreams
14
●
Discovery & Genomic Knowledge Standards ○ Variant API
●
Genomic Knowledge Standards ○ Reference Sequence API
●
Cloud ○ Integration challenge
●
Security genomicsandhealth.org
Large Scale Genomics For discussion | Out of scope
15
●
Graph genomes ○ Wait for representations, algorithms to mature?