AMIA TBI Summit 2017 Horiz

Report 6 Downloads 181 Views
Translational Time Pattern Search in Transcriptomic Profiling Shahrzad Eslamian, MS, Guenter Tusch, PhD Medical and Bioinformatics Graduate Program, School of Computing and Information Systems Grand Valley State University, Grand Rapids, MI, USA

Summary

Methods

System Implementation

One of the challenges of translational informatics, the translation of information captured in the increasingly voluminous biomedical and genomic data sets into actionable health care or prevention, is the treatment of temporal data. An important part of temporal translational research is based on stimulus response studies and includes searching for temporal effects or time patterns in gene sets or pathways across studies and platforms. A growing body of information is available in public repositories like NCBI GEO and ArrayExpress based on current transcriptomic profiling techniques like DNA microarray, cDNA amplified fragment length polymorphism, expressed sequence tag sequencing, serial analysis of gene expression, massive parallel signature sequencing, RNA-seq etc. This study explores the feasibility of searching for temporal patterns based on statistical models and temporal abstraction using our SPOT software.

Temporal modeling was achieved by knowledge-based temporal abstractions allowing for conversion of expression values or counts (rna-seq) to an interval-based qualitative representation1, 2 .

For implementation we utilized a software platform SPOT1 based on open-source software, R and Bioconductor. We evaluated our approach on a wide array of temporal studies from NCBI GEO.

Introduction Temporal patterns in gene profiles like peaks can represent a biological effect that is reversed over time. In temporal gene expression studies a peak can be identified by a significant change from one time point to the next. In temporal translational research a researcher typically obtains an expression profile and tries to retrieve similar profiles in a set of genes or features in public databases. We used as data normalized microarray or rna-seq datasets from NCBI GEO, either in curated form (GDS) or aggregated using R and Bioconductor packages. We selected only stimulus response studies based on time series data.

The underlying model assumes that if a researcher has found a temporal pattern exhibited, e.g., in a KEGG pathway and tries to extend his findings by searching for the same pattern in similar public datasets, "similarity" could mean that a peak could be found in a specified time interval, i.e., first 24 hours. Here, temporal abstractions allow for independence of the particular time points that the experimenters chose for their studies. The researcher can compare across different microarray platforms, if they contain the same genes or features, for “peak within 24 hours”. If the same biological signal is expressed on each study, he should be able to find it based on statistical significance independent of technology and time points. The p-value as a measure of statistical significance, however, depends on the actual sample sizes of the respective gene expression studies that vary from study to study. Due to the small sample sizes, the studies typically have little statistical power. Therefore, we take the empirical Bayesian approach that results in a far more stable inference as the number of arrays is typically small. For microarrays we use the moderated tstatistic 3 , where standard errors are being moderated across genes, borrowing information from the ensemble of genes, for rna-seq arrays we apply the voom transformation before.

Evaluation and Conclusion We evaluated our approach on a set of 644 temporal GDS with 171 different platforms from NCBI GEO comparing to genes that were confirmed, e.g., qPCR. Preliminary results indicate that if you choose gene ensembles large enough the approach has significant potential. This approach is exploratory in nature and not intended for modelling purposes. Acknowledgements: We would like to thank the following individuals without whose support this study would not have been possible: Ramya Gunda, Vincent K. Sam, Lakshmi Mammidi, Yuka Kutsumi, Olvi Tole, Krishna Nadiminti (GVSU) Dr. Amar Das (IBM), Martin O’Connor (Stanford U), Dr. Craig Webb, Dr. Jeremy Miller (VAI), Dr. Timothy Redmond, Dr. Mark Musen (Stanford U),

References: Tusch G, Tole O, Hoinski ME. A Model for Cross-Platform Searches in Temporal Microarray Data. In Conference on Artificial Intelligence in Medicine in Europe 2015 Jun 17: 153-158. Springer International Publishing. Tusch G, Bretl C, O'Connor M, Das A, SPOT-towards temporal data mining in medicine and bioinformatics, AMIA Annu Symp Proc. 2008: 1157. Smyth G, Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3: Article 3, 2004.

SPOT: S - Protégé – OWL/SWRL – Temporal Abstraction

Figure 1: SPOT – Overview

Figure 2: Selection of data of interests

Figure 3: Inspect and modify time intervals