CMB-2008-0129-Hassen_1P.3D
06/18/09
6:04pm Page 1
JOURNAL OF COMPUTATIONAL BIOLOGY Volume 16, Number 00, 2009 # Mary Ann Liebert, Inc. Pp. 1–14 DOI: 10.1089/cmb.2008.0129
CMB-2008-0129-Hassen_1P Type: research-article
Inference in Signal Transduction Pathways Using EM Algorithm and an Implicit Algorithm: Incomplete Data Case HANEN BEN HASSEN,1 AFIF MASMOUDI,2 and AHMED REBAI1
ABSTRACT We summarize here the Implicit statistical inference approach as an alternative to Bayesian networks and we give an effective iterative algorithm analogous to the Expectation Maximization algorithm to infer signal transduction network when the set of data is incomplete. We proved the convergence of our algorithm that we called Implicit algorithm and we apply it to simulated data for a simplified signal transduction pathway of the EGFR protein. Key words: EM algorithm, implicit algorithm, learning incomplete data, signal transduction network.
1. INTRODUCTION
A
n implicit network (IN) is a model that can be used to define a graphical representation of assumed dependencies and influences among a set of variables. Such a model, when used with statistical techniques, is attractive for its ability to learn causal relationships, and hence to understand and analyze biological data like proteins interaction data. An IN like a bayesian network (BN) is represented by nodes that express random variables and edges that represent direct probabilistic dependencies among them. Both nodes and edges are combined in order to form a directed acyclic graph (DAG). Interactions between the nodes are quantified by node probability tables that are associated to each variable. In classical probability inference, Bayesian networks are based on priors. We recently proposed an Implicit approach that does not require any priors and that would be used when the priors are missing or expensive. Implicit networks can successfully replace Bayesian networks (Ben Hassen et al., 2008). Learning INs or BNs from parameters when data are complete is straightforward and has been previously addressed, but real-life databases are rarely complete and an important goal concerning these incomplete data is whether the absence of an observation is dependent on the actual states of the variables. Learning parameters of Bayesian or Implicit network from incomplete data is a hard problem. Dempster et al. (1977) proposed the Expectation-Maximization (EM) algorithm, that is a well-known technique for learning parameters of statistical models from incomplete data. Lauritzen (1995) described how to apply EM algorithm to known structure Bayesian network to learn the 1
Unit of Bioinformatics and Biostatistics, Centre of Biotechnology of Sfax, Sfax, Tunisia. Laboratory of Probability and Statistics, Faculty of Sciences of Sfax, Sfax, Tunisia.
2
1
b AU1
CMB-2008-0129-Hassen_1P.3D
06/18/09
6:04pm Page 2
2
HASSEN ET AL.
parameters of a Bayesian incomplete data network. This algorithm is iterative and alternates between two steps until convergence towards stationary point the Maximum-Likelihood (ML) parameter estimates or the maximum a posteriori (MAP) estimates (McLachlan and Krishnan, 1997). This article is concerned with learning parameters of a signal transduction network when the random sample is incomplete. In fact, signal transduction cascades where proteins interact via activation and inhibition to convert one signal into a physiological response can be formally represented by a directed acyclic graph and a probabilistic dependencies between the proteins. For example, in the phosphorylation cascade of the epidermal growth factor receptor, not all experimental states of proteins are observed, accordingly, we can apply, the EM algorithm and a novel iterative algorithm, named Implicit algorithm (IM) to predict the probability of the states of each protein in the biological network. The outline of this article is as follows: In Section 2, we briefly recall the principle of inference by the Implicit method. In Section 3, we describe our algorithm for learning parameters when data are incomplete, and in Section 4, we apply it to a fixed structure signal transduction network of the tyrosine kinase receptor. Finally, in Section 5, we summarize our work and conclude with a discussion of future work.
2. INFERENCE WITH THE IMPLICIT METHOD 2.1. General view of the implicit method Our objective is to estimate the phosphorylation state of each protein in the Epidermal Growth Factor Receptor (EGFR) signal transduction cascade given observed activity states of the other proteins in the cascade. If we represent this cascade by a network which is a directed acyclic graph, then each protein in this graph represents a node and apparent causal relationships between these proteins represent the directed arrows (the concept of network is further defined in Section 3). This structure representation of the phosphorylation cascade is well suited for applying a Bayesian or an Implicit network methodology. The Implicit method is similar to the Bayesian method, but it does not need to specify any priors for the parameters. In fact, in the context of the Bayesian theory (Robert, 1994), the unknown parameter y in the statistical model is assumed to be a random variable with a known prior distribution. This prior information is used, together with the data, to derive the posterior distribution of y. The choice of a prior is generally based on the preliminary knowledge of the problem. So, the basic idea of the Bayesian theory is to consider any parameter y as a random variable and to determine its posterior (conditional) distribution given data and the assumed prior. Alternatively, the concept of Implicit distribution was previously proposed by Hassairi et al. (2005) and can be described as a kind of posterior distribution of a parameter given data. To explain the principle of Implicit distribution let us consider a family of probability distributions {p(x / h), h 2 Y} parameterized by an unknown parameter y in a set Y; where x is the observed data. The Implicit distribution p(y / x) is calculated by multiplying the likelihood function p(x/y) by a counting measure s if Y is a countable set and by a Lebesgue measure s if Y is an open set R(s depends only on the topological structure of Y) and then dividing by the norming constant c(x) ¼ Y p(x / h)r(dh). Therefore the Implicit distribution is given by the following formula p(y/x) ¼ (c(x))1p(x/y)s(y) and plays the role of a posterior distribution of y given x in the Bayesian method, corresponding to a particular improper prior which depends only on the topology of Y (without any statistical assumption). Provided its existence (which holds for most statistical models), the Implicit distribution can be used for the estimation of the parameter y following a Bayesian methodology. The Implicit estimator h^ of y is nothing but the mean of the Implicit distribution. To avoid any misunderstanding, it is important here to emphasize that the Implicit approach is neither a non-informative Bayesian analysis nor a fiducial-like method as has been criticized by some commenters (Mukhopadhyay, 2006). For a presentation of the theoretical foundations of Implicit inference and some selected applications, readers are referred to Hassairi et al. (2005).
2.2. Implicit method in the multinomial case Let X ¼ (N1 , . . . , Nr ) be a random variable following a multinomial distribution with unknown r P parameters N ¼ Ni and h ¼ (h2 , . . . , hr ). We first estimate N by the Implicit method after that we i¼1
use the estimate N^ to estimate y. After some calculations, we obtain
CMB-2008-0129-Hassen_1P.3D
06/18/09
6:04pm Page 3
INFERENCE WITH INCOMPLETE DATA IN BAYESIAN NETWORKS P(N / X) ¼ where N1 ¼ N N1 ¼
r P
3
P(X / N) ¼ CNN 1 hN1 N 1 (1 h1 )N 1 þ 1 , C(X)
Ni .
i¼1
So, the Implicit distribution of N given X ¼ (N1 , . . . , Nr ) is a Pascal distribution with parameters 1 y1 and N1 þ 1. Suppose that y1 is known, the Implicit estimator N^ of N is the mean of the Pascal distribution: X N^ ¼ E(N / X) ¼ NCNN 1 hN1 N 1 (1 h1 )N 1 þ 1 : N‡ 0
Design Nob the number of observations and take Nk Nk 1 and 1 £ k £ r : ; £ hk0 ¼ max Nob Nob r 1 After some calculations, we have (Nk0 þ 1) Nk N^ ¼ ¼ Nob þ 0 , 1 h k0 Nk0 where Nk0 ¼ Nob Nk0 Consequently, the next observation in the state xk given a dataset D is determined by
P ^ hi and h^k0 ¼ 1
Nk þ 1 , 1 £ k £ r and k 6¼ k0 h^k ¼ P(XNob þ 1 ¼ xk / D) ¼ N^ þ r
(2:1)
i6¼k0
2.3. Implicit inference with Bayesian networks We define an IN as a set of variables X ¼ {X1 , . . . , Xn } with: (1) a network structure S that encodes a set of conditional independence assertions about variables in X, and (2) a set P of local probability distributions associated with each variable. Together, these components define the joint probability distribution for X. The network structure S is a directed acyclic graph (DAG) and it is well suited for looking for relationships among all variables. The nodes in S correspond to the variables in Xi. Each Xi denotes both the variable and its corresponding node, and Pa(Xi) the parents of node Xi in S as well as the variables corresponding to
FIG. 1.
Basic structure of an Implicit network (the node X3 is the child of X1 and X2 named its parents).
CMB-2008-0129-Hassen_1P.3D
06/18/09
6:04pm Page 4
4
HASSEN ET AL.
those parents (Fig. 1). The lack of possible arcs in S encode conditional independencies. In particular, given b F1 structure S, the joint probability distribution for X is given by the product of all specified conditional probabilities: P(X1 , . . . , Xn ) ¼
n Y
P(Xi / Pa(Xi )):
(3:1)
i¼1
The local probability distributions P are the distributions corresponding to the terms in the product of conditional distributions. When building IN without prior knowledge, the probabilities will depend only on the structure of the parameters set.
2.4. Learning probabilities In recent years, learning graphical probabilistic models such Bayesian and Markov networks has become a very active research and many papers on probabilistic network learning, were studied (Chrisman, 1996; Krause, 1996; Chickering et al., 2004). In this section, we show how to refine the structure and local probability distributions of an IN given data. The result of learning an IN from the data is a directed graph that is often directly interpretable and can be used to make quantitative predictions of outcomes and error estimates. Let n be the number of nodes of a DAG. For every node i we associate the random variable Xi having ri states: node
1 ! X1 2 {x11 , . . . , xr11 }
node
2 ! X2 2 {x12 , . . . , xr22 } .. .
node
node
i ! Xi 2 {x1i , . . . , xri i } .. . n ! Xn 2 {x1n , . . . , xrnn }:
Let D be a dataset and let Nijk be a number of cases in D in which the node i is in the state k and its parents are in the state j that is Xi ¼ xki and Pa(Xi ) ¼ xji . The distribution of Xi is multinomial with parameters Nij and ri ri P P Nijk and hijk ¼ P(Xi ¼ xki / Pa(Xi ) ¼ xj ); k ¼ 1, . . . , ri and hijk ¼ 1 hij ¼ (hij2 , . . . , hijri , where Nij ¼ k¼1
k¼1
j
P(Xi ¼ (Nij1 , . . . , Nijri ) / Pa(Xi ) ¼ x ) ¼ Nij !
ri hNijk Y ijk k¼1
Nijk !
:
Then Nij and yij are unknown parameters that will be estimated by the Implicit method. Given a network S, consider for node i, Nijob is the observed number of occurrences of the node i and its parents are in the state j. n o N Nijk Nijk 1 Let hijk(0) ¼ Nijk(0) ¼ max ; £ and 1 £ k £ r : i N N 1 r ijob ijob ijob i The application of the Implicit method in Section 2 gives the following estimation of Nij and yij: Nijk(0) N^ij ¼ Nijob þ ; Nijk(0)
(3:2)
Nijk þ 1 h^ijk ¼ if k 6¼ k(0) N^ij þ ri
(3:3)
where Nijk(0) ¼ Nijob Nijk(0) and
and h^ijk(0) ¼ 1
X k6¼k(0)
h^ijk
CMB-2008-0129-Hassen_1P.3D
06/18/09
6:04pm Page 5
INFERENCE WITH INCOMPLETE DATA IN BAYESIAN NETWORKS
5
3. LEARNING FROM INCOMPLETE DATA We apply the Implicit method for learning parameters of an IN when the dataset is incomplete. Consider a dataset D with missing data, we compute the Implicit distribution P(y/D) and use the distributions in turn to compute expectation of interest. Let X be a random variable that follows a multinomial distribution with parameters N and h ¼ (h1 , . . . , hr ) such that Y ¼ (N1 , . . . , Nr ) X and Z ¼ (N1 , . . ., Nr ) X denote P the observed and unobserved variables, respectively. So, X ¼ (N1 þ N1 , . . . , Nr þ Nr ) and P(h / Y) ¼ P(Z / Y)P(h / Y, Z) Z
3.1. Implicit algorithm To estimate the parameters yijk of the network, with incomplete dataset, we propose a new iterative algorithm named Implicit algorithm. First we introduce some notations. Consider a node i with parents in (0) the state j and a dataset D which contains Nij(0) observed and unobserved values in such state. Let Nijob (0) (0) (0) (0) the observed values in D, so Nij [Nijob and Nij Nijob represents the number of unobserved states. So, the initial conditions for a node i are: Nij(0) is the number of observed and unobserved states. h(0) ijk is the observed frequency of the (0) Nijk ¼ Nij(0) h(0) ijk is the number of observed
node i in the state k given its parents in the state j. Then, occurrences of the node i in the state k and its parents in the
state j. (0) ¼ Nijob
ri X
(0) Nijk
k¼1
The Implicit algorithm is iterative, and it involves three steps; the first step consists in the choice of the maximum of the conditional frequencies, the second step estimates the number of observations from the first step, and the third computes the other conditional probabilities. The algorithm then iterates through the following steps until its proved convergence. iteration (t ¼ 0): 8 (0) Nij > > > > h(0) > < ijk (0) Nijk ¼ Nij(0) h(0) ijk > > ri > P > (0) (0) > : Nijob ¼ Nijk k¼1
iteration (t ¼ 1): 8 (0) (0) Nijk Nijk > 1 > > k(0) ¼ arg max N (0) ; N (0) ri 1 ; k(0) corresponds to the state with maximum frequency > > 1kri ijob ijob > > > N (0) > (1) (0) > < Nij ¼ Nijob þ (0) ijk(0)(0) N N ijob
ijk(0)
(0) Nijk þ1 > > h(1) > ijk ¼ N (1) þ ri ; k 6¼ k(0) > > ij > > ri P > (1) (1) (1) (1) (1) > > Nijk : Nijk ¼ Nij hijk ; Nijob ¼
k¼1
iteration (t): 8 (t) > hijk > > < (t) Nijk ri > P > (t) (t) > Nijk : Nijob ¼ k¼1
CMB-2008-0129-Hassen_1P.3D
06/18/09
6:04pm Page 6
6
HASSEN ET AL. iteration (t þ 1): (t) (t) 8 Nijk Nijk > 1 > k(t) ¼ arg max N (t) ; N (t) ri 1 > > 1kri > ijob ijob > > (t) > (t þ 1) Nijk(t) > (t) > ¼ Nijob þ N (t) N (t) > > Nij > ijob ijk(t) > < (t) Nijk þ1 (t þ 1) hijk ¼ N (t þ 1) þ r ; k 6¼ k(t) i ij > > (t) > (t þ 1) P (t þ 1) Nijk(t) þ1 > > ¼ 1 h ¼ h > (t ijk(t) ijk > Nij þ 1) þ ri > > k6¼k(t) > > ri > P > (t þ 1) (t þ (t þ 1) (t þ 1) > ¼ Nijk : : Nijk ¼ Nij 1) h(tijkþ 1) ; Nijob k¼1
The convergence of this algorithm is proved by the following theorem: Theorem: i)
ri P
k¼1
(t þ 1) Nijk [
ii) Since
ri P k¼1
AU2 c
ri P k¼1
(t) Nijk :
(t) Nijk is strictly increasing then
ri P k¼1
(t) Nijk converges to Nij(0) when t ! þ 1.
Proof: By induction on t. For k 6¼ k(t) (t þ 1) (t) (t) Nijk ¼ h(tijkþ 1) Nij(t þ 1) Nijk ¼ (i) Nijk
¼
(t) (t) (t þ 1) (t) (Nijk þ 1)Nij(t þ 1) Nijk Nij ri Nijk
(t) (Nijk þ 1)Nij(t þ 1)
Nij(t þ 1) þ ri ¼
Nij(t þ 1) þ ri
(t) Nijk
(t) Nij(t þ 1) ri Nijk
Nij(t þ 1) þ ri
:
Then, ri X
(t þ 1) Nijk
k¼1
¼ ¼
ri X
(t) Nijk
k¼1
(t) ri Nij(t þ 1) ri Nijob
Nij(t þ 1) þ ri (t) ri Nij(t þ 1) Nijob )
Nij(t þ 1) þ ri
[0
(t þ 1)
(t) rNijob [0. Pi (t) (t) (ii) According to (i), Nijk ¼ Nijob is strictly increasing on t, then
because Nij
k¼1
(1) (2) (t) \Nijob \ . . . \Nijob \...: Nijob
So, there exists t such that (s) (s þ 1) Nijob £ Nij(0) \Nijob :
&
This means that the algorithm will stop at iteration t if the number of observed data Nijob at iteration t þ 1 is larger than the number Nij(0) of data in the complete dataset (observed and unobserved states in the initial condition). For one node i in state k with parents in state j, we have Nijob observed data. By applying our algorithm, we go away from one iteration to an other to estimate the number of occurrences Nijk (observed and non observed states) and the conditional probabilities yijk characterizing the state of the node given its parents. ri P After iterations, the algorithm converges when Nijk ‡ Nij(0) . k¼1
CMB-2008-0129-Hassen_1P.3D
06/18/09
6:04pm Page 7
INFERENCE WITH INCOMPLETE DATA IN BAYESIAN NETWORKS
7
4. APPLICATION TO THE EGFR SIGNALING PATHWAY The Epidermal Growth Factor Receptor (EGFR) protein is a member of the ErbB family of transmembrane Tyrosine Kinase receptors that are central components of cellular signaling pathways and are involved in many cellular processes such as cell proliferation, metabolism, survival and apoptosis (Linggi and Carpenter, 2006; Normanno et al., 2006). Several studies have provided evidence that EGFR is involved in the pathogenesis and progression of different carcinoma types. EGFR protein has three domains: an extracellular domain which binds ligands, a transmembrane domain and an intracellular domain with tyrosine kinase activity. When a ligand binds to the extracellular domain, two EGFR molecules aggregate to form a dimer protein. Then, the tyrosine kinase domains of one molecule phosphorylate the C terminal tyrosine residues of the other molecule (Aifa et al, 2006). This phosphorylation produces binding sites for proteins with SH2 domains, such as Grb2. Grb2 is an adapter protein that binds to the active EGFR and the complex is a branch point that leads to several signalling pathways through binding to different potential targets. One of these pathways is the Ras Mitogen Activated Protein Kinase (MAPK) pathway that induces cell division (Kholodenko et al., 1999). In order to model the EGFR signaling pathway and its relationship to human pathologies, we consider in Figure 2 a simplified structure of the network in which only the following nodes are used: Ligand (EGF), b F2 Receptor (EGFR), Receptors and ligand dimer in active state (EGFR*), Adapter protein (Grb2), complex of EGFR and Grb2 (Grb2*), and cellular response through the Ras/MAPK pathway (Ras). The relationships between variables are as follows:
The protein expression level of EGF is either high or low (H/L). The protein expression level of EGFR is either high or low (H/L). The level of (EGFR:EGF) dimer (denoted EGFR*) could be high or low (H/L) depending on the expression levels of both receptor and ligand. The protein expression level of Grb2 (adapter protein) is high or low (H/L). The level of EGFR*/Grb2 protein complex is high or low (H/L) depending on the levels of both EGFR* and Grb2. The Ras is activated and initiates a cascade of reactions that leads to cellular response (Yes/No).
FIG. 2.
Scheme of a simplified EGFR signaling pathway.
CMB-2008-0129-Hassen_1P.3D
06/18/09
6:04pm Page 8
8
HASSEN ET AL.
We simulated 1,000 datasets of 1,000 experiments each. An experiment corresponds to the measure of the status of all proteins in the network, which were generated according to the prior probabilities in Appendix 1 and the network structure in Figure 2. We repeated the simulation with 5, 10, 15, and 20 percent missing data. Using our algorithm or the EM algorithm, we are able to compute the probability of the state of each signaling node of the model given the simulated data, when some of these data are missing. In order to compare our algorithm to the EM algorithm, we calculated by the Implicit and the Bayesian (maximum a posteriori) methods the tables of conditional probabilities of the network (probabilities yijk of the state of each variable given its parents). We used a program implemented in R language for simulations and parameter estimates calculation (available on request to the authors). In Table 1, we give the mean of simulated values h^ijk and the standard b T1 errors for all nodes of the signaling network, by EM (based on Bayesian uniform and Bayesian with prior calculation) and by the IM algorithms for 10 percent missing data. The other tables of calculation for 5, 15, and 20 missing data are given, respectively, in Appendices 2, 3, and 4. Globally, there is a good concordance between the two approaches. In fact, when we compare the Implicit algorithm estimates to the EM algorithm estimates, based on true values as priors (the most favorable setting for Bayesian estimation), we see a the same precisions of these estimates. However, we see a better precision of the Implicit compared to EM estimates with uniform priors (MSE IM ¼ 0.00015 versus MSE EMU ¼ 0.77688) for 10 percent missing data. This fact is illustrated in Figure 3 that shows the difference between results of the b F3 three methods and particulary that our method is different and better than the uniform Bayesian method Table 1. Means of Simulated Conditional Probabilities and Standard Errors (SE) with 10 Percent Missing Data Estimated by Implicit (IM), EM with Uniform Bayesian Prior (EMU), and EM with Bayesian Prior (EMP) Algorithms Theta means Parameters h^111 h^112 h^211 h^212 h^311 h^312 h^411 h^412 h^421 h^422 h^431 h^432 h^441 h^442 h^511 h^512 h^521 h^522 h^531 h^532 h^541 h^542 h^611 h^612 h^621 h^622 MSEc a
True value
IM
EMU
EMPa
(TV)b
IM
EMU
EMPa
0.2963 0.7037 0.1953 0.8047 0.2955 0.7045 0.8030 0.1970 0.3971 0.6029 0.9009 0.0991 0.0998 0.9002 0.8995 0.1005 0.7007 0.2993 0.9003 0.0997 0.0999 0.9001 0.9002 0.0998 0.1003 0.8997 0.00015
0.3979 0.6021 0.3477 0.6523 0.3984 0.6016 0.6421 0.3579 0.4527 0.5477 0.6900 0.3100 0.3100 0.6900 0.6903 0.3097 0.5944 0.4056 0.6902 0.3098 0.3097 0.6903 0.6902 0.3098 0.3106 0.6894 0.77688
0.2979 0.7021 0.1978 0.8022 0.2988 0.7012 0.8008 0.1992 0.3992 0.6008 0.9007 0.0993 0.1001 0.8999 0.9000 0.1000 0.7006 0.2994 0.8998 0.1002 0.1004 0.8996 0.8998 0.1002 0.1000 0.9000 0.00003
0.3 0.7 0.2 0.8 0.3 0.7 0.8 0.2 0.4 0.6 0.9 0.1 0.1 0.9 0.9 0.1 0.7 0.3 0.9 0.1 0.1 0.9 0.9 0.1 0.1 0.9
0.0005 0.0005 0.0005 0.0005 0.0005 0.0005 0.0018 0.0018 0.0015 0.0015 0.0007 0.0007 0.0005 0.0005 0.0010 0.0010 0.0010 0.0010 0.0008 0.0008 0.0005 0.0005 0.0005 0.0005 0.0005 0.0005
0.0003 0.0003 0.0002 0.0002 0.0003 0.0003 0.0009 0.0009 0.0007 0.0007 0.0003 0.0003 0.0002 0.0002 0.0005 0.0005 0.0005 0.0005 0.0004 0.0004 0.0003 0.0003 0.0002 0.0002 0.0002 0.0002
0.0003 0.0003 0.0002 0.0002 0.0003 0.0003 0.0009 0.0009 0.0007 0.0007 0.0003 0.0003 0.0002 0.0002 0.0005 0.0005 0.0005 0.0005 0.0004 0.0004 0.0003 0.0003 0.0002 0.0002 0.0002 0.0002
Corresponds to estimates obtained using true values as prior parameters. Values used to simulate the data. P ^ c Mean squared error calculated as (hijk hijk )2 . b
Standard error
CMB-2008-0129-Hassen_1P.3D
06/18/09
6:04pm Page 9
INFERENCE WITH INCOMPLETE DATA IN BAYESIAN NETWORKS
9
FIG. 3. Mean values of the parameters estimated using 1,000 simulated datasets (t stands for y, and parameters are given in the same order as in Table 1).
(EM algorithm with uniform priors). Furthermore, Figure 4 shows that, for most simulations, the Implicit b F4 algorithm makes less iterations and is more stable when the number of parents of a node increases.
5. DISCUSSION In this article, we have recalled the concept of Implicit networks (IN) as an alternative approach to Bayesian networks, and we generalized the Implicit method to handle incomplete dataset using an iterative process similar to the Expectation-Maximization algorithm. The algorithm was run with 10, 20, and 30 percent of missing data and was applied for the estimation of the states of the nodes of simplified signal transduction network. As we know the choice of prior information in Bayesian approaches has always been
CMB-2008-0129-Hassen_1P.3D
10
06/18/09
6:04pm Page 10
HASSEN ET AL.
FIG. 4. Iteration number made by Implicit (IM), EM with Uniform Bayesian prior (EMU), and EM with Bayesian prior (EMP*) algorithms.
problematic and is considered by many to be the major weakness of such methods. IN approach avoids the problem of priors in an elegant manner and leads to more tractable formulas which are easier to implement, and in the case when the data are incomplete, Learning parameters of Bayesian or Implicit networks is a difficult problem and many approaches use algorithms that very often converge to local optima, however the convergence of our novel iterative algorithm is proved. We demonstrated that our algorithm performs very well for all simulations, by the comparison of the parameters of the signaling network estimated by the Implicit and by the EM algorithms. In fact, the estimates given by the two algorithms are closer to those given by the Bayesian complete case learning. In the major simulation cases, the Implicit algorithm makes less iterations than the EM algorithm. Thus INs and the application of the Implicit algorithm constitute an original and promising alternative in situations where the use of BNs are recommended when the data are incomplete and the priors are missing or difficult to obtain. This means that the application of the Implicit algorithm might become a reference method for many applications in biology, particularly in the modelling of gene regulatory or signaling pathways. We can also use the Implicit algorithm in all fields where EM algorithm has been shown to be useful tool, such as gene expression analysis (Choi et al., 2007), protein-protein interaction (Mamitsuka, 2005) and genetic association studies of multifactorial diseases (Nakayama et al., 2006). As we have shown in the example in Section 4, the Implicit algorithm could be efficiently applied to experimental data with incomplete dataset and to infer probabilities in molecular pathways. From a practical standpoint, the prediction given by IN can provide a starting point for drug target selection or drug response prediction. For example, the IN approach presented in the example could be used to model the whole EGFR pathway and predict the effect of various kinase inhibitors on physiological response. Another interesting issue is the quantitative modelling of signaling pathways (Kholodenko et al., 1999). In this case, the nodes are no longer discrete variables but continuous distribution variables. Bayesian networks have already been generalized for continuous variables or a mixture of discrete and continuous variables (Bottcher and Dethlefsen, 2003). With IN, this is also possible since the Implicit inference framework accommodates very well any type of distribution (Hassairi et al., 2005).
CMB-2008-0129-Hassen_1P.3D
06/18/09
6:04pm Page 11
INFERENCE WITH INCOMPLETE DATA IN BAYESIAN NETWORKS
11
6. APPENDIX 1 6.1. Tables of prior probabilities for the Bayesian network approach Pr (EGF) H
L
0.7
0.3 Pr (EGFR)
H
L
0.8
0.2 Pr (Grb2)
H
L
0.7
0.3 Pr (EGFR*/EGF,EGFR)
EGFR* H H H H
EGF
EGFR
H H L L
H L L H
0.9 0.6 0.2 0.1
Pr (Grb2*/Grb2,EGFR*) Grb2* H H H H
Grb2
EGFR*
H H L L
H L L H
0.9 0.3 0.1 0.1
Pr (Ras/Grb2*) Ras
Grb2*
Y N
H H
0.9 0.1
7. APPENDIX 2 7.1. Means of simulated conditional probabilities and standard errors (SE) with 5 percent missing data estimated by Implicit (IM), EM with Uniform Bayesian prior (EMU), and EM with Bayesian prior (EMP*) algorithms Theta means Parameters h^111 h^112 h^211 h^212 h^311 h^312 h^411
True value
Standard error
IM
EMU
EMPa
(TV)b
IM
EMU
EMPa
0.2986 0.7014 0.1987 0.8013 0.2980 0.7020 0.8028
0.3996 0.6004 0.3494 0.6506 0.4003 0.5997 0.6453
0.2996 0.7004 0.1998 0.8002 0.2991 0.7009 0.8008
0.3 0.7 0.2 0.8 0.3 0.7 0.8
0.0005 0.0005 0.0004 0.0004 0.0005 0.0005 0.0017
0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0008
0.0003 0.0003 0.0002 0.0002 0.0002 0.0002 0.0009 (continued)
CMB-2008-0129-Hassen_1P.3D
06/18/09
6:04pm Page 12
12
HASSEN ET AL. Theta means
Parameters h^412 h^421 h^422 h^431 h^432 h^441 h^442 h^511 h^512 h^521 h^522 h^531 h^532 h^541 h^542 h^611 h^612 h^621 h^622 MSEc
True value
Standard error
IM
EMU
EMPa
(TV)b
IM
EMU
EMPa
0.1972 0.3982 0.6018 0.9009 0.0991 0.1005 0.8995 0.9018 0.0982 0.7007 0.2993 0.9005 0.0995 0.0987 0.9013 0.9011 0.0989 0.0997 0.9003 0.000054
0.3547 0.4514 0.5488 0.6951 0.3049 0.3051 0.6949 0.6945 0.3055 0.5979 0.4021 0.6946 0.3054 0.3051 0.6949 0.6954 0.3046 0.3046 0.6954 0.75
0.1992 0.4002 0.5998 0.8994 0.1006 0.1003 0.8997 0.8998 0.1002 0.6996 0.3004 0.8999 0.1001 0.1002 0.8998 0.9002 0.0998 0.0999 0.9001 0.000005
0.2 0.4 0.6 0.9 0.1 0.1 0.9 0.9 0.1 0.7 0.3 0.9 0.1 0.1 0.9 0.9 0.1 0.1 0.9
0.0017 0.0015 0.0015 0.0007 0.0007 0.0004 0.0004 0.0010 0.0010 0.0009 0.0009 0.0007 0.0007 0.0005 0.0005 0.0004 0.0004 0.0004 0.0004
0.0008 0.0007 0.0007 0.0003 0.0003 0.0002 0.0002 0.0005 0.0005 0.0005 0.0005 0.0004 0.0004 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002
0.0009 0.0007 0.0007 0.0003 0.0003 0.0002 0.0002 0.0005 0.0005 0.0005 0.0005 0.0004 0.0004 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002
a
Corresponds to estimates obtained using true values as prior parameters. Values used to simulate the data. P ^ c Mean squared error calculated as (hijk hijk )2 . b
8. APPENDIX 3 8.1. Means of simulated conditional probabilities and standard errors (SE) with 15 percent missing data estimated by Implicit (IM), EM with Uniform Bayesian prior (EMU), and EM with Bayesian prior (EMP) algorithms Theta means Parameters h^111 h^112 h^211 h^212 h^311 h^312 h^411 h^412 h^421 h^422 h^431 h^432 h^441 h^442 h^511 h^512 h^521
True value a
(TV) 0.3 0.7 0.2 0.8 0.3 0.7 0.8 0.2 0.4 0.6 0.9 0.1 0.1 0.9 0.9 0.1 0.7
IM
EMU
EMP
0.2916 0.7084 0.1903 0.8097 0.2913 0.7087 0.8070 0.1930 0.3974 0.6026 0.9001 0.0999 0.0985 0.9015 0.9000 0.1000 0.7013
0.3954 0.6046 0.3444 0.6556 0.3955 0.6045 0.6394 0.3606 0.4547 0.5456 0.6850 0.3150 0.3153 0.6847 0.6861 0.3139 0.5936
0.2954 0.7046 0.1952 0.8048 0.2955 0.7045 0.7997 0.2003 0.4008 0.5992 0.9002 0.0998 0.1004 0.8996 0.8998 0.1002 0.6990
b
Standard error IM
EMU
EMPa
0.0006 0.0006 0.0006 0.0006 0.0006 0.0006 0.0020 0.0020 0.0016 0.0016 0.0008 0.0008 0.0005 0.0005 0.0011 0.0011 0.0011
0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0009 0.0009 0.0008 0.0008 0.0004 0.0004 0.0002 0.0002 0.0005 0.0005 0.0005
0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0010 0.0010 0.0008 0.0008 0.0004 0.0004 0.0002 0.0002 0.0005 0.0005 0.0005 (continued)
CMB-2008-0129-Hassen_1P.3D
06/18/09
6:04pm Page 13
INFERENCE WITH INCOMPLETE DATA IN BAYESIAN NETWORKS Theta means Parameters h^522 h^531 h^532 h^541 h^542 h^611 h^612 h^621 h^622 MSEc
True value
13 Standard error
IM
EMU
EMPa
(TV)b
IM
EMU
EMPa
0.2987 0.9009 0.0991 0.1007 0.8993 0.9007 0.0993 0.0995 0.9005 0.0000
0.4064 0.6844 0.3156 0.3148 0.6852 0.6852 0.3148 0.3150 0.6850 0.8051
0.3010 0.9000 0.1000 0.1000 0.9000 0.8998 0.1002 0.0998 0.9002 0.0001
0.3 0.9 0.1 0.1 0.9 0.9 0.1 0.1 0.9
0.0011 0.0009 0.0009 0.0006 0.0006 0.0005 0.0005 0.0005 0.0005
0.0005 0.0004 0.0004 0.0003 0.0003 0.0002 0.0002 0.0002 0.0002
0.0005 0.0004 0.0004 0.0003 0.0003 0.0002 0.0002 0.0002 0.0002
a
Corresponds to estimates obtained using true values as prior parameters. Values used to simulate the data. P c Mean squared error calculated as (h^ijk hijk )2 . b
9. APPENDIX 4 9.1. Means of simulated conditional probabilities and standard errors (SE) with 20 percent missing data estimated by Implicit (IM), EM with Uniform Bayesian prior (EMU), and EM with Bayesian prior (EMP*) algorithms Theta means Parameters h^111 h^112 h^211 h^212 h^311 h^312 h^411 h^412 h^421 h^422 h^431 h^432 h^441 h^442 h^511 h^512 h^521 h^522 h^531 h^532 h^541 h^542 h^611 h^612 h^621 h^622 MSEc a
True value
IM
EMU
EMPa
(TV)b
IM
EMU
EMPa
0.2827 0.7173 0.1811 0.8189 0.2834 0.7166 0.8068 0.1932 0.3968 0.6032 0.9010 0.0990 0.0991 0.9009 0.9035 0.0965 0.7022 0.2978 0.9007 0.0993 0.0987 0.9013 0.8996 0.1004 0.0997 0.9003 0.0020
0.3921 0.6079 0.3406 0.6594 0.3912 0.6088 0.6349 0.3651 0.4552 0.5455 0.6806 0.3194 0.3200 0.6800 0.6795 0.3205 0.5903 0.4097 0.6800 0.3200 0.3206 0.6794 0.6799 0.3201 0.3203 0.6797 0.8360
0.2924 0.7076 0.1906 0.8094 0.2915 0.7085 0.7986 0.2014 0.4013 0.5987 0.9009 0.0991 0.1002 0.8998 0.9009 0.0991 0.6994 0.3006 0.9000 0.1000 0.1003 0.8997 0.8999 0.1001 0.1001 0.8999 0.0004
0.3 0.7 0.2 0.8 0.3 0.7 0.8 0.2 0.4 0.6 0.9 0.1 0.1 0.9 0.9 0.1 0.7 0.3 0.9 0.1 0.1 0.9 0.9 0.1 0.1 0.9
0.0006 0.0006 0.0006 0.0006 0.0006 0.0006 0.0021 0.0021 0.0018 0.0018 0.0008 0.0008 0.0005 0.0005 0.0011 0.0011 0.0012 0.0012 0.0009 0.0009 0.0006 0.0006 0.0005 0.0005 0.0005 0.0005
0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0009 0.0009 0.0008 0.0008 0.0004 0.0004 0.0003 0.0003 0.0005 0.0005 0.0006 0.0006 0.0004 0.0004 0.0003 0.0003 0.0002 0.0002 0.0002 0.0002
0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0010 0.0010 0.0008 0.0008 0.0004 0.0004 0.0003 0.0003 0.0006 0.0006 0.0006 0.0006 0.0004 0.0004 0.0003 0.0003 0.0002 0.0002 0.0002 0.0002
Corresponds to estimates obtained using true values as prior parameters. Values used to simulate the data. P ^ c Mean squared error calculated as (hijk hijk )2 . b
Standard error
CMB-2008-0129-Hassen_1P.3D
06/18/09
6:04pm Page 14
14
HASSEN ET AL.
ACKNOWLEDGMENTS This work was supported by the Ministry of Higher Education, Research and Technology, Tunisia.
DISCLOSURE STATEMENT
b AU3
No competing financial interests exist.
REFERENCES Aifa, S., Miled, N., Frikha, F., et al. 2006. Electrostatic interactions of peptides flanking the tyrosine kinase domain in the epidermal growth factor receptor provides a model for intracellular dimerization and autophosphorylation. Proteins 62, 1036–1043. Ben Hassen, H., Masmoudi, A., and Rebai, A. 2008. Causal inference in biomolecular pathways using a Bayesian network approach and an implicit method. J. Theoret. Biol. 4, 717–724. Bøttcher, S.G., and Dethlefsen, C. 2003. Deal: a package for learning Bayesian networks. J. Stat. Software 8, 18–40. Chickering, D., Heckerman, D., and Meek, C. 2004. Large-sample learning of Bayesian networks is NP-hard. J. Mach. Learn. Res. 5, 1287–1330. Choi, H., Shen, R., Chinnaiyan, A.M., et al. 2007. A latent variable approach for meta-analysis of gene expression data from multiple microarray experiments. BMC Bioinform. 8, 364. Chrisman, L. 1996. A road map to research on Bayesian networks and other decomposable probabilistic models. Technical report. School of Computer Science, CMU, Pittsburgh, PA. Dempster, A.P., Laird, N.M., and Rubin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. B 39, 1–38. Hassairi, A., Masmoudi, A., and Kokonendji, C. 2005. Implicit distributions and estimation. Comm. Statist. Theory Methods 34, 245–252. Kholodenko, B.N., Demin, O.V., Moehren, G., et al. 1999. Quantification of short-term signaling by the epidermal growth factor receptor. J. Biol. Chem. 274, 30169–30181. Krause, P. 1996. Learning probabilistic networks. Technical report. Philips Research Laboratories, UK. Linggi, B., and Carpenter G. 2006. ErbB receptors: new insights on mechanisms and biology. Trends Cell. Biol. 16, 649–656. Lauritzen, S.L. 1995. The EM algorithm for graphical association models with missing data. Comput. Statist. Data Anal. 19, 191–201. Mamitsuka, H. 2005. Essential latent knowledge for protein-protein interactions: analysis by unsupervised learning approach. IEEE/ACM Trans. Comput. Biol. Bioinform. 2, 119–130. McLachlan, G.J., and Krishnan, T. 1997. The EM Algorithm and Extensions. Wiley, New York. Mukhopadhyay, N. 2006. Some comments on Hassairi et al.’s implicit distributions and estimation. Comm. Statist. Theory Methods 35, 293–297. Nakayama, T., Asai, S., Sato, N., et al. 2006. Genotype and haplotype association study of the STRK1 region on 5q12 among Japanese: a case-control study. Stroke 37, 69–76. Normanno, N., De Luca, A., Bianco, C., et al. 2006. Epidermal growth factor receptor (EGFR) signaling in cancer. Gene 366, 2–16. Robert, C.P. 1994. The Bayesian Choice: A Decision-Theoretic Motivation. Springer-Verlag, New-York.
Address correspondence to: Dr. Ahmed Rebai Unit of Bioinformatics and Biostatistics Centre of Biotechnology of Sfax Route Sidi Mansour Sfax 1177, Tunisia E-mail:
[email protected] CMB-2008-0129-Hassen_1P.3D
06/18/09
6:04pm Page 15
AUTHOR QUERY FOR CMB-2008-0129-HASSEN_1P AU1: For best reproduction in the printed journal, please ensure that all line art has been supplied at high resolution (1200 dpi). AU2: Please confirm that solid is located square where proof ends. AU3: Is Disclosure Statement as you meant?