Time Delay Estimation in Gene Regulatory Networks - Semantic Scholar

Report 2 Downloads 83 Views
Time Delay Estimation in Gene Regulatory Networks Elmira Mohyedinbonab

Omid Ghasemi

Mo Jamshidi

Yu-Fang Jin

Department of Electrical and Computer Engineering University of Texas at San Antonio San Antonio, TX, USA [email protected]

[email protected]

Abstract – Studying Gene regulation helps to reveal important biological functions in progression to diseases. As more time-course datasets become available, gene regulatory networks can be constructed using interactions among regulators and their associated target genes at every time point. Currently, most gene regulation networks are constructed without considering the delayed responses. Such delay is embedded in the network due to the intrinsic temporal process of gene expression. We developed time delay regression model to predict time-dependent interactions which offers an opportunity to further investigate the evolution of gene regulatory networks with time delay and to shed light on the underlying regulations. Average Square Difference Function and Least square estimation methods are applied to estimate the parameters of the proposed model. In our study, time-course gene expressions in mice post-myocardial infarction were used. This method is evaluated by comparing with no-delay and cross correlation-based method. Simulation results and statistical analysis were provided. Keywords: Gene regulation, temporal network, estimation, delay.

1

Introduction

Gene expression is the fundamental biological process and has been extensively studied during the past decade with the availability of high throughput microarray data. Inside a cell, messenger RNA (mRNA) is continuously transcribed to produce proteins responsible for vital functions. Particular types of proteins, called transcription factors (TFs), regulate the concentration of mRNA by repressing or initiating transcription. The regulatory relationships between regulators and the associated genes (target genes) have the potential to be mathematically modeled as functional-response models by exploring gene expression profiles. With the accumulation of tremendous gene expression data, construction of descriptive network models of gene regulation using computational approaches is one of the efficient ways to study the regulatory mechanism. Computational analysis can imply physical interactions and reveal association relationships which help to identify functional modules or predict the behavior of the

[email protected]

[email protected]

system following external perturbations. While a priori knowledge about physical connection of transcription factors to their target genes is indispensable, it is not enough to meet transcription initiation process. To further study gene regulation, temporal gene expression analysis discloses substantial information about transient biological function, which leads to identification of transcription factors and target genes interactions in regulatory networks. Further, some of the suggestive interactions may not be accurate in which case the transcription factors are not able to bind to their target genes. Therefore, incorporating microarray analysis with binding information will address issues related to both problems. Identifying potential binding sites of transcription factors to DNA and combination of the results with microarray technology were investigated in multiple papers [1-4]. In addition, there exist delayed responses in the gene regulation process. Time-delayed phenomenon underlying gene regulatory networks can be conceived through different modeling approaches such as correlation analysis [5], decision-tree-related classifiers [6], and Boolean models [7]. Many methods are capable to estimate delay as one unit time lag, while in many regulation process, there are multiple units of time lag during which the change of the regulator expression is transmitted to changes in associated target gene expression. A model, that poses the ability to estimate multiple-time delays of gene regulations, will expand our knowledge of development of biological processes. Assuming that the gene expression is continuous and changes with a linear trend within each stage, we are able to adopt linear regression models to uncover the network structure and time-delayed responses. In the proposed model, the delay between each pair of target gene and transcription factor is considered. In addition, transcription factors that are regulating same target gene may have different values of time delay. The structure of this paper is described as follows. First, the proposed time-delay regression model for our transcriptional regulatory network is presented. Then, through an optimization approach, average square difference function (ASDF), the time delays in the regression model are estimated. In the third section, statistical criteria to evaluate simulation results are

discussed. This section is followed by the description of the gene expression data and data cleaning process. In the fourth section, simulation results for the proposed algorithm are illustrated and temporal networks after every delay value are constructed. In the final section, we show our conclusion and make some discussions.

2

Method

As a primary step of analysis, we need to remove inactive genes, whose expressions do not change significantly among all time points. Based on the transcription factor database (Transfac), the potential transcription factor binding sites can be predicted to construct the regulatory interactions between target genes and transcription factors. A linear regression model is selected to represent the regulatory effects of transcription factors on target gene expressions. Moreover, to estimate the delay between each pair, ASDF method was applied. With the estimated time delay, the regression model coefficients were assessed based on least square method. To evaluate the performance of this approach, the goodness of fit of the proposed model was analyzed with different statistical tests.

2.1

Multiple-delayed regression model

According to Granger causality analysis, a target gene expression will not be affected before its regulator activation. When a transcription factor regulates its target genes through repression or activation, its impact will appear instantly (zero delay) or within a reasonable amount of delay (the cause comes before its effect). The linear regression model proposed in this study illustrates the interactions between transcription factors and their target genes as follow: m

yˆ j [k] = α 0, j + ∑∑α i, j,τ xi, j [k − ni, j ] ,

(1)

i=1 τ

where, yˆ j [k ] is the estimated target gene expression which interacts with transcription factors xi , j s, m is the number of transcription factors affecting target gene y j , ni, j shows the delay between the transcription factor xi , j and target gene y j , and α i, j s are the regression model coefficients. In the regulation of a target gene, regression coefficient of a regulator may vary at different time points. This is due to the shared impact of transcription factors in a regulatory module. Therefore, it is possible that their coefficients change when a new transcription factor is introduced to the model. In equation (1), this concept is represented by the variable τ in α i , j ,τ which indicates the

variability of the regression coefficients at different time points for transcription factor i affecting target gene j .

2.2

Parameter estimation

A traditional technique to estimate delays between two discrete signals is ASDF. It is a fast algorithm offered by Jacovitti et al. that allows us to find the global minimum of the cost function defined as equation 2 [8]

J = argmin(

1 N

( N −1)T

2

∑ ( y [k ] − yˆ [k ]) ) , j

j

(2)

k =0

where N is the total number of time samples, T is sampling time period, y j [k ] and yˆ j [k ] represent the experimental values of the target genes j , and its estimation, respectively. In our regression model, Least Square Estimation (LSE) was adopted to determine the regression coefficients. The objective of LSE is to minimize the sum of squared errors (SSE), between the estimated target gene expressions and their available experimental values, shown in the following equation: (N−1)T

J = argmin(



m

(y j [k]− α 0, j − ∑∑αi, j,τ xi, j [k − ni, j ]) (3)

k=0

i=1 τ

Figure 1 shows the steps taken in our method in more detail.

2.3

Statistical analysis

The derived regression model was evaluated with two statistical tests, adjusted R 2 and ANOVA F-test. R 2 measures the percentage of total variation in the dependent variable y j . The adjusted R 2 is a modified version of R 2 which contains the number of time points and transcription factors in its formula. It is defined as:

adjusted R 2 = 1 − (1 − R 2 )

N −1 , N − ntf − 1

(4)

where ntf is the number of transcription factors that interacts with an specific target gene and N is the total number of time points. The second statistical analysis is an ANOVA F-test. It is used to test the significance of overall regression results (goodness of fit). It can be derived by dividing the sum of square residuals (SSR) by SSE as following:

F=

SSR ~ F (υ1 ,υ2 ) , SSE (n − 2)

(6)

This test follows an F distribution with two degrees of freedom υ1 and υ 2 . The two degree of freedom is computed using N − ntf and ntf − 1, respectively. Then, the p-value is calculated based on the obtained F-statistic. The overall regression equation is statistically significant at the level of significance of 0.05 (p-value ≤ 0.05).

transcriptional regulators. A total of 110 pairs were found in which several genes and transcription factors were redundant (target genes number = 66, transcription factors number = 39). Then, the micro-array data set was used to identify expressed target genes. First, the measurements of the gene expressions were converted into the log scale. Then, we calculated the fold change of the expression data at each time point based on their control expressions (day 0). The fold change values were standardized afterwards by computing their mean and variance values. Our gene expression data set has few time points that are not equally spaced. To find a meaningful correlation for the delay in our model, it is required to have samples with equal intervals. Another issue associated with our dataset is its long gap between last time points. In order to solve this problem, we need to generate more samples between existed sampling time points. In this study, the minimum time interval in data set was used as the sampling period. The longer time intervals were divided to smaller sections.

4

Results

A temporal gene regulatory network is built based on the resultant delay-based regression model. Different amounts of delay are estimated for our interaction database. In the network, an interaction (connecting edge between two genes) is drawn when the regression coefficient associated with the regulator exists. Hence, after every time delay, the network structure is updated by adding new interactions while keeping the previous interactions among genes. The expression level of target genes at the given time point are affected by the expression level of their regulators at previous time points (to shows the impact of delay). Therefore, the final structure of the network includes all target genes and their regulators. It is important to mention that except the first network in which there is a time lag of zero between target genes and their regulator, there are no concurrent interactions in other constructed networks.

Figure 1. Flowchart of our estimation model.

3

Materials / Data

The data used in this study were the expressions of the genes acquired from the liver of young C57BL/6J mice (total number = 7 with control group) post-myocardial infarction (MI). The time points were control at time 0, and 12 hours, 24 hours (1 day), 72 hours (3 days), 120 hours (5 days), 240 hours (10 days), 720 hours (30 days) post MI. Total number of genes were 30,774 and was represented with their entrez IDs, out of which 7435 of them had pvalues of less than 0.05. The official gene symbols were found using R and Bioconductor. Then, the transfac data set was used to find the set of genes that was bound to sets of

In figure 2, the network models at different time delays are plotted. The interactions (edges) shown in the red color are the old interactions remained from the previous networks. The blue edges indicate interactions where their target genes at the given time point are regulated by their associated transcription factors in some delay ago. The estimated delays are as follow: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and more. In figure 2, as the number of connections of a gene (node) to other gene increases (its degree increases), the size of node also increases. Those high-degree nodes (called hubs), such as trance-acting transcription factor 1 (SP1) and v-rel reticuloendotheliosis viral oncogene homolog A (Rela), may be important since they participate in many interactions and also increase the connectivity of the network.

n=0

n=1

n=2

n=4

n=3

n=5

n=6

n=7

n=8

n=9

n=10

n=11

Final graph

Figure 2. Regulatory network structure at each time delay (n = 0, 1, …, 11). The blue edges show the interaction between activated target genes at that present time and its delayed transcription factor, red edges are interactions remained from previous network structure. The size of node increase as their degree increases.

Table 1. Statistical comparison for three different models when the target gene is regulated by 2 and 3 transcription factors. In both groups with 2 and 3 transcription factors, our model shows better average adjusted R 2 and p-value compared to the cross correlation-based delay model and no-delay model. Number of regulators of target genes

Number of TFs

Possible # of Parameters

Average adjusted R 2

Average p-value

Two

Our Model

3,4

0.60

0.0086

transcription factors

Cross correlationbased delay model

3,4

0.54

0.0189

No-delay model

3

0.53

0.0312

Three

Our Model

4,5,6,7

0.82

3.39E-6

transcription factors

Cross correlationbased delay model

4,5,6,7

0.80

1.026E-5

No-delay model

4

0.75

3.75E-3

Another approach to estimate the delay in the regression model is through cross-correlation analysis. To evaluate our model, we compared the results obtained by our estimation method with those realized by two other models, no-delay regression model, and cross correlationbased delay model. The adjusted R 2 and p-value (from ANOVA F-test) were used for the statistical analysis.. The total number of target genes studied in this study was 66.Based on calculated F-score.- Our model successfully identified 29 significant target genes regulated by transcription factors at different time points (p-value ≤ 0.05) while this number decreased to 27 in cross correlation-based delay model and to 15 in no delay model. Our model discovered all target genes regulated by two transcription factors while the other two models failed to do that. In such a way, 22 target genes were regulated with 2 transcription factors in our model, however; only 20 and 8 targets are selected in cross correlation-based delay model and no delay model, respectively. Target genes regulated by 3, 4, and 6 transcription factors were identified in all three models introduced in table 1. Among target genes associated with two regulators, 10, 9 and 8 target genes have the adjusted R 2 value of over 0.60, for our model, cross correlation-based model and no-delay model, respectively. The range of adjusted R 2 value were between [0.13, 0.96] in our model, [0.037, 0.98] in cross correlation-based model, and [0.006, 0.98] in no delay model. Also the maximum p-values in our model, cross correlation-based model, and no delay model were 0.033, 0.037, and 0.34, correspondingly. In case of having target genes regulated by 3 transcription factors, the adjusted R 2 in our model were above 0.73 while the other two models had at least one

target gene that has an adjusted R 2 less than 0.7. For the same type of target genes, the maximum p-value was 6.55E-6 in our model versus 2.91E-5 in cross correlation model and finally 3.56E-3 in no-delay model. The adjusted R 2 for target genes with 4 transcription factors was 0.657 for both delay-based models while decreased to 0.497 for no-delay model. Furthermore, the estimated p-value was 5.23E-10 for delay-based models versus 4.54E-4 in no-delay model. Table 1 represents the comparison of the average adjusted R 2 and average p-values for all mentioned three models when target genes are regulated with 2 or 3 transcription factors. As we discussed, our model showed better scores compared to those obtained by the cross correlation-based delay model and no-delay model in overall.

5

Conclusions

We introduced a multiple-delayed linear regression model to estimate the delay between transcription factors and their target genes. This model was implemented in two steps, estimating the amount of delay and approximating the regression coefficients for the estimated delay. Furthermore, we applied this model to the temporal gene expressions of mice post MI. After indicating the transcription factors and their target genes in Transfac database, we applied our modeling method to temporal mice data and construct the regulatory network with different time delays. Some genes in the network have high degree of connectivity comparing to other genes, suggesting potential biomarkers for myocardial infarction.

We also compared our model with a cross correlationbased delay model and no-delay model. Adjusted R 2 and ANOVA F-test were used to perform the statistical analysis of the aforementioned models. The statistical measures showed that our model performs better than the crosscorrelation-based delay model and no-delay model.

References [1] T. I. Lee, N. J. Rinaldi, F. Robert, D. T. Odom, Z. Bar-Joseph, G. K. Gerber, N. M. Hannett, C. T. Harbison, C. M. Thompson, I. Simon, J. Zeitlinger, E. G. Jennings, H. L. Murray, D. B. Gordon, B. Ren, J. J. Wyrick, J.-B. Tagne, T. L. Volkert, E. Fraenkel, D. K. Gifford, and R. A. Young, "Transcriptional Regulatory Networks in Saccharomyces cerevisiae," Science, vol. 298, pp. 799-804, October 25, 2002 2002. [2] B. Ren, F. Robert, J. J. Wyrick, O. Aparicio, E. G. Jennings, I. Simon, J. Zeitlinger, J. Schreiber, N. Hannett, E. Kanin, T. L. Volkert, C. J. Wilson, S. P. Bell, and R. A. Young, "Genome-Wide Location and Function of DNA Binding Proteins," Science, vol. 290, pp. 2306-2309, December 22, 2000 2000. [3] Z. Li and C. Chan, "Extracting novel information from gene expression data," Trends in Biotechnology, vol. 22, pp. 381-383, 2004. [4] A.-L. Boulesteix and K. Strimmer, "Predicting transcription factor activities from combined analysis of microarray and ChIP data: a partial least squares approach," Theoretical Biology and Medical Modelling, vol. 2, p. 23, 2005. [5] X. Li, S. Rao, W. Jiang, C. Li, Y. Xiao, Z. Guo, Q. Zhang, L. Wang, L. Du, J. Li, L. Li, T. Zhang, and Q. K. Wang, "Discovery of Time-Delayed Gene Regulatory Networks based on temporal gene expression profiling," Bmc Bioinformatics, vol. 7, p. 26, 2006. [6] L. A. Soinov, M. A. Krestyaninova, and A. Brazma, "Towards reconstruction of gene networks from expression data by supervised learning," Genome Biol, vol. 4, p. R6, 2003. [7] A. Silvescu and V. Honavar. (2001, Temporal boolean network models of genetic networks and their inference from gene expression time series. Complex Systems 13, 54 - 75. [8] G. Jacovitti and G. Scarano, "Discrete-Time Techniques for Time-Delay Estimation," Ieee Transactions on Signal Processing, vol. 41, pp. 525-533, Feb 1993.