Graduate Category: Interdisciplinary Research Degree Level: Ph.D. Abstract ID# 68
Ph.D. student Aven Samareh Advising committee: Dr. James Kong
Dynamic Analysis Of High Dimensional Microarray Time Series Data Using Various Dimensional Reduction Methods Application Of Various Dimension Reduction Techniques
Abstract The research methodology includes two research tasks. Firstly, applying several dimension reduction methods on two microarray data sets. Secondly, applying the sparse vector autoregressive from time series microarray experiment, and performing 1-step ahead forecast to compare models with regards to performance.
Analyzing Microarray Data RNA is extracted from the cells, labeled with different colors
[1]
DNA is hybridized on the glass slides
[2]
Dimension reduction Method
Filing missing values • The second data used is Drosophila lifecycle gene expression data (Arbeitman et al., 2002).
Normalization Log-transformation (a) Standard Time series plot of Drosophila genes, and (b) Human cancer cell Line
• The first data set used is a Human cancer cell Line (HeLa) cell-cycle gene expression dataset collected by (Whitfield et al., 2002).
Eigen
Principal component analysis
Challenges in dynamic analysis for microarray time series data The system is underdetermined High correlation between genes Select an efficient model for such data Least squares estimators have large variances
Significance of dynamic analysis in microarray time series data
(PCA) Factor analysis Eigen Laplacian
250
LASSO SCAD Ridge
500
LASSO SCAD Ridge
250
LASSO SCAD Ridge
500
LASSO SCAD Ridge
250
LASSO SCAD Ridge
Monitoring a gene's behavior at multiple points over time A more quantitative knowledge of dynamic gene regulatory Provide an understanding of gene activities It can provide an understanding of drug effectiveness over time
Regression It should be noted that each gene may depend not only on its own past values, but also on the past values of the other genes. OLS estimators will be estimated imprecisely
(3)
large variances and covariance for the least square estimator of regression coefficient Large standard errors Multicollinearity
m
m
mt
How to deal with collinearity: 1. Increase the sample size 2. Orthogonalise the correlated repressor variables
Original Data set
RMethod squared LASSO SCAD Ridge
Method
Pre processing step
With an outstanding performance
[1] Al-Akwaa, Fadhl M. Analysis of Gene Expression Using Biclustering [2] Arbeitman et al, Gene Expression During the Life Cycle of Drosophila melanogaster, (2002). [3] Kiyoung, Y., et al. (2005). "On the Stationarity of Multivariate Time Series \." IEEE Computer Society
Resulted in Reasonable least square estimation with low standard errors
500
LASSO SCAD Ridge
Laplacian
R-squared
Dimension Number Rreduction of Method squared Method genes LASSO 0.35 500 SCAD 0.32 Ridge 0.32 Nystrom LASSO 0.30 250 SCAD 0.30 Method Ridge 0.30 LASSO 0.17 Indian 500 SCAD 0.20 Ridge 0.21 Buffet LASSO 0.17 250 SCAD 0.17 Process Ridge 0.16 LASSO 0.58 Original 500 SCAD 0.55 Data set Ridge 0.53 LASSO 0.50 250 SCAD 0.50 Ridge 0.41
R-squared value for and Drosophila data
Motivation
Applying various dimension reduction methods
Method
Dimensional reduction methods
Challenges In Analyzing Time Series Microarray Data
Data laying down on a low-dimensional manifold Preserving almost all of the original information Computational cost will significantly reduce Complexity reduction of massive microarray data sets
Number of genes
PCA
(a) visualizing time series plot of Drosophila genes, and (b) Human cancer cell Line
The processed data, be represented in the form of a matrix
Significance of dimension reduction techniques in microarray data
R-squared value for and HeLa data
Pre processing step
The expression level for each gene presented as an image
Challenges in using dimension reduction techniques for microarray time series data Microarray data are noisy High computational cost Select an efficient technique High quality dimension reduction technique
1. The results show that explanatory power under factor analysis and principle component analysis is better than the multiple linear regression for both data sets and basically any method producing orthogonal dependent variables. 2. It can be seen that factor analysis has the highest rsquared value among all the of the dimension reduction techniques used. 3. it can be inferred among all the regression methods, lasso has a better performance. 4. It is expected that the relative performances of reduced dimension models that provide orthogonal variables should be better with regards to low variance estimators because orthogonality leads to reasonable coefficient estimation with low standard errors.