Large Scale Genomics

Report 1 Downloads 211 Views
Large Scale Genomics Thomas Keane, EMBL-EBI Oliver Hofmann, University of Melbourne genomicsandhealth.org

Large Scale Genomics

Core mission “Create standardized methods for accessing large-scale genomic data by file-based, API-based, cloud-based, and distributed access.”

genomicsandhealth.org

Large Scale Genomics Guiding principles ●

Engage with driver projects and the wider genomics community to identify requirements and use-cases



Build on existing standards to ensure a gradual transition to new standards



Engage with key community software tool maintainers to drive adoption of standards



Engage with key large data repositories to drive community adoption



Metric for workstream success will be adoption of standards genomicsandhealth.org

Large Scale Genomics Driver and partner engagement

genomicsandhealth.org

Large Scale Genomics Workstream structure (Y1) ●

File formats task team



Streaming API task team



Reference Sequence Retrieval task team



RNA-Seq task team

genomicsandhealth.org

Large Scale Genomics Workstream structure (Y1) ●

File formats task team ○ Long reads, molecular barcoding, standardised encryption container, VCF representation for structural variation



Streaming API task team



Reference Sequence Retrieval task team



RNA-Seq task team

genomicsandhealth.org

Large Scale Genomics Workstream structure (Y1) ●

File formats task team



Streaming API task team ○ Launched v1.0 of htsget! ○ VCF streaming, multi-sample/multi-region support, metrics for adoption



Reference Sequence Retrieval task team



RNA-Seq task team

genomicsandhealth.org

Large Scale Genomics Workstream structure (Y1) ●

File formats task team



Streaming API task team



Reference Sequence Retrieval task team ○ Convene the task team (jointly with GKS workstream) ○ Identify requirements ○ Initial draft specification, initial implementations, basic interop demo



RNA-Seq task team

genomicsandhealth.org

Large Scale Genomics Workstream structure (Y1) ●

File formats task team



Streaming API task team



Reference Sequence Retrieval task team



RNA-Seq task team ○ Convene an initial discussion with driver projects and tool developers ○ Identify goals and review progress to date ○ Topics: Initial discussion on expression data format(s), compression approaches, streaming / slicing genomicsandhealth.org

Large Scale Genomics Workstream structure (Y2) ●

File formats task team ○ VCF - Formal grammar, major release (breaking changes) ○ SAM/BAM/CRAM - formalise the representation of modification bases



Streaming API task team



Reference Sequence Retrieval task team



RNA-Seq task team

genomicsandhealth.org

Large Scale Genomics Workstream structure (Y2) ●

File formats task team



Streaming API task team ○ Introspection of a service - what are the internal processes? What does the server provide? ○ Interoperability matrix update for Y1 new features ○ Compliance testing



Reference Sequence Retrieval task team



RNA-Seq task team genomicsandhealth.org

Large Scale Genomics Workstream structure (Y2) ●

File formats task team



Streaming API task team



Reference Sequence Retrieval task team ○ v1.0 specification and deeper interop ○ Public release with multiple server and client implementations



RNA-Seq task team

genomicsandhealth.org

Large Scale Genomics Workstream structure (Y2) ●

Variants API ○ Draft a wishlist of use-cases from a variants API/data model ○

Identify driver projects and form a task team



What do we do with the existing Variants API?



Do we excise it from the existing GA4GH data model?



Could this be a joint activity with GKS? genomicsandhealth.org

Large Scale Genomics Coordination with other workstreams

14



Discovery & Genomic Knowledge Standards ○ Variant API



Genomic Knowledge Standards ○ Reference Sequence API



Cloud ○ Integration challenge



Security genomicsandhealth.org

Large Scale Genomics For discussion | Out of scope

15



Graph genomes ○ Wait for representations, algorithms to mature?



Genome annotation ○ File formats (GFF3, others) ○ Covered through GKS workstream?



APIs for Clinical Data ○ Covered by Clinical & Phenotypical Data Capture workstream?

genomicsandhealth.org

Large Scale Genomics

Q&A genomicsandhealth.org