Graduate Category: Interdisciplinary Research Degree Level: Ph.D ...

Report 5 Downloads 109 Views
Graduate Category: Interdisciplinary Research Degree Level: Ph.D. Abstract ID# 68

Ph.D. student Aven Samareh Advising committee: Dr. James Kong

Dynamic Analysis Of High Dimensional Microarray Time Series Data Using Various Dimensional Reduction Methods Application Of Various Dimension Reduction Techniques

Abstract The research methodology includes two research tasks. Firstly, applying several dimension reduction methods on two microarray data sets. Secondly, applying the sparse vector autoregressive from time series microarray experiment, and performing 1-step ahead forecast to compare models with regards to performance.

Analyzing Microarray Data RNA is extracted from the cells, labeled with different colors

[1]

DNA is hybridized on the glass slides

[2]

Dimension reduction Method

Filing missing values • The second data used is Drosophila lifecycle gene expression data (Arbeitman et al., 2002).

Normalization Log-transformation (a) Standard Time series plot of Drosophila genes, and (b) Human cancer cell Line

• The first data set used is a Human cancer cell Line (HeLa) cell-cycle gene expression dataset collected by (Whitfield et al., 2002).

Eigen

Principal component analysis

Challenges in dynamic analysis for microarray time series data The system is underdetermined High correlation between genes Select an efficient model for such data Least squares estimators have large variances

Significance of dynamic analysis in microarray time series data

(PCA) Factor analysis Eigen Laplacian

250

LASSO SCAD Ridge

500

LASSO SCAD Ridge

250

LASSO SCAD Ridge

500

LASSO SCAD Ridge

250

LASSO SCAD Ridge

Monitoring a gene's behavior at multiple points over time A more quantitative knowledge of dynamic gene regulatory Provide an understanding of gene activities It can provide an understanding of drug effectiveness over time

Indian buffet process

(FA)

Eigen Laplacian and Nystrom Method

Nystrom

0.53 0.53 0.45 0.46 0.46 0.43 0.51 0.50 0.48 0.39 0.35 0.35 0.80 0.78 0.80 0.72 0.72 0.73

Normalization

Indian

Log-transformation

Buffet

1000

500

LASSO SCAD Ridge

250

LASSO SCAD Ridge

1000

LASSO SCAD Ridge

500

LASSO SCAD Ridge

250

LASSO SCAD Ridge

1000

LASSO SCAD Ridge

500

LASSO SCAD Ridge

250

LASSO SCAD Ridge

Process

Test of stationarity [3]

Regression It should be noted that each gene may depend not only on its own past values, but also on the past values of the other genes. OLS estimators will be estimated imprecisely

(3)

large variances and covariance for the least square estimator of regression coefficient Large standard errors Multicollinearity

m

m

mt

How to deal with collinearity: 1. Increase the sample size 2. Orthogonalise the correlated repressor variables

Original Data set

RMethod squared LASSO SCAD Ridge

Method

Pre processing step

With an outstanding performance

[1] Al-Akwaa, Fadhl M. Analysis of Gene Expression Using Biclustering [2] Arbeitman et al, Gene Expression During the Life Cycle of Drosophila melanogaster, (2002). [3] Kiyoung, Y., et al. (2005). "On the Stationarity of Multivariate Time Series \." IEEE Computer Society

Indian Buffet Restaurant

Dynamic DynamicAnalysis AnalysisOf OfVarious VariousDimension DimensionReduction ReductionMethods Methods

Research Objectives

A low Dimensional manifold

Number of genes

Nystrom method

Multivariate time series model

Producing orthogonal variables

Factor Analysis

Dimension reduction Method

Eliminate the trend

Resulted in Reasonable least square estimation with low standard errors

500

LASSO SCAD Ridge

Laplacian

R-squared

Dimension Number Rreduction of Method squared Method genes LASSO 0.35 500 SCAD 0.32 Ridge 0.32 Nystrom LASSO 0.30 250 SCAD 0.30 Method Ridge 0.30 LASSO 0.17 Indian 500 SCAD 0.20 Ridge 0.21 Buffet LASSO 0.17 250 SCAD 0.17 Process Ridge 0.16 LASSO 0.58 Original 500 SCAD 0.55 Data set Ridge 0.53 LASSO 0.50 250 SCAD 0.50 Ridge 0.41

R-squared value for and Drosophila data

Motivation

Applying various dimension reduction methods

Method

Dimensional reduction methods

Challenges In Analyzing Time Series Microarray Data

Data laying down on a low-dimensional manifold Preserving almost all of the original information Computational cost will significantly reduce Complexity reduction of massive microarray data sets

Number of genes

PCA

(a) visualizing time series plot of Drosophila genes, and (b) Human cancer cell Line

The processed data, be represented in the form of a matrix

Significance of dimension reduction techniques in microarray data

R-squared value for and HeLa data

Pre processing step

The expression level for each gene presented as an image

Challenges in using dimension reduction techniques for microarray time series data Microarray data are noisy High computational cost Select an efficient technique High quality dimension reduction technique

Results

Conclusion

0.58 0.30 0.21 0.42 0.35 0.25 0.44 0.44 0.21 0.23 0.17 0.16 0.20 0.12 0.15 0.22 0.20 0.13 0.49 0.44 0.45 0.50 0.53 0.55 0.60 0.62 0.63

Dimension reduction Method

PCA

Eigen

Number of genes

Method

1000

LASSO SCAD Ridge

500

LASSO SCAD Ridge

250

LASSO SCAD Ridge

1000

LASSO SCAD Ridge

500

LASSO SCAD Ridge

250

LASSO SCAD Ridge

1000

LASSO SCAD Ridge

500

LASSO SCAD Ridge

250

LASSO SCAD Ridge

Laplacian

Factor Analysis

Rsquared 0.39 0.35 0.40 0.54 0.45 0.55 0.53 0.40 0.41 0.52 0.50 0.50 0.50 0.51 0.49 0.49 0.42 0.47 0.77 0.78 0.73 0.63 0.53 0.52 0.53 0.60 0.47

1. The results show that explanatory power under factor analysis and principle component analysis is better than the multiple linear regression for both data sets and basically any method producing orthogonal dependent variables. 2. It can be seen that factor analysis has the highest rsquared value among all the of the dimension reduction techniques used. 3. it can be inferred among all the regression methods, lasso has a better performance. 4. It is expected that the relative performances of reduced dimension models that provide orthogonal variables should be better with regards to low variance estimators because orthogonality leads to reasonable coefficient estimation with low standard errors.