Finding feasible pathways in metabolic networks - CiteSeerX

Report 1 Downloads 106 Views
Finding feasible pathways in metabolic networks Esa Pitk¨ anen1,3 , Ari Rantanen1 , Juho Rousu2 , and Esko Ukkonen1 1

3

Department of Computer Science, University of Helsinki, Finland 2 Royal Holloway, University of London, UK Email: [email protected]. Surface mail: Department of Computer Science, P.O.Box 68, FIN-00014 University of Helsinki, Finland

Abstract. Recent small-world studies of the global structure of metabolic networks have been based on the shortest-path distance. In this paper, we propose new distance measures that are based on the structure of feasible metabolic pathways between metabolites. We argue that these distances capture the complexity of the underlying biochemical processes more accurately than the shortest-path distance. To test our approach in practice, we calculated our distances and shortest-path distances in two microbial organisms, S. cerevisiae and E. coli. The results show that metabolite interconversion is significantly more complex than was suggested in previous small-world studies. We also studied the effect of reaction removals (gene knock-outs) on the connectivity of the S. cerevisiae network and found out that the network is not particularly robust against such mutations.

1

Introduction

Information on both biochemical reactions and enzymatic function of gene products has been made available in databases such as KEGG [6] and MetaCyc [8]. This knowledge has made it possible to analyze and predict genome-scale properties of metabolism in various organisms. Two main approaches have been proposed for global analysis of metabolism, a graph-theoretical one [5, 9, 1] that focuses on the topology of the metabolic network, and an approach studying the capabilities of a metabolic network in steady-state conditions using stoichiometric equations [12, 11]. In the graph-theoretical approach, the focus is in identifying node ranks, path lengths and clustering properties of metabolic network. Considering metabolic networks just as graphs consisting of nodes and edges and using shortest-path length as a distance function, it was suggested that metabolic networks in general possess the scale-free property: metabolite rank distribution P (k) follows a power-law P (k) ≈ k −γ , with γ ≈ 2.2 for many organisms [5]. Consequently, the network contains a small number of hub nodes that connect otherwise distant parts of the network. Because the average path length in a scale-free network is relatively short and the network exhibits a high degree of co-clustering, such network is an example of a so-called small world (see, e.g., [7]). Furthermore, it was observed that random deletions of metabolites from metabolic networks

have little effect on average path length between metabolites [5]. This served as a basis for claims that metabolic networks were robust against mutations. It was quickly noticed, however, that in metabolic networks typical hub metabolites include energy and redox cofactors (e.g., ATP and NAD) that are involved with several reactions that may be otherwise non-related. Thus, the shortest paths between metabolites often were routed through these cofactors, which can be considered misleading. This observation led to studies correcting the problem via manually removing the cofactors from the analysis [9] and to the computational approach by Arita [1] that circumvented the cofactor problem by looking at atom-level behaviour of metabolic pathways; he suggested that in a valid metabolic pathway at least one (carbon) atom should be transferred from the source to the target. Analysis of these atom-level pathways in [1] yielded for E. coli significantly longer average path lengths than the analysis of [5] or [9]. However, both analyses consider pathways that transfer only one metabolite or atom from source to target, and disregard the other metabolites or atoms involved in the pathway. In this paper we use stricter criteria for a valid metabolic pathway. Namely, we require that on a valid pathway all atoms of the target metabolite need to be reachable from the source. We argue that this definition is biologically more realistic than the previous definitions described above as the production of a metabolite via a pathway requires the atoms to be transferred from source to target and introduce two novel distance measures between metabolites. We study the structural properties of two high–quality metabolic networks, the networks of S. cerevisiae and E. coli. We also compare our distance functions against the shortest-path distance function. It turns out that in these two metabolic worlds the two approaches give quite clearly differing results. The structure of the paper is as follows. In Section 2, we formalize metabolic networks, give our distance functions and show how they relate to the shortestpath distance. In Section 3, we study the computational complexity of these distances and give algorithms for evaluating them. In Section 4, we present the results of evaluating these as well as the shortest-path distances for all metabolite pairs in metabolic networks of two microbial organisms and study the effect of reaction deletions on the distances of metabolites. Section 5 concludes the article with discussion.

2

Metabolic networks, and-or graphs, and metabolic distances

A metabolic reaction is a pair (I, P ) where I = (I1 , . . . , Im ) are the m input metabolites and P = (P1 , . . . , Pn ) are the n product metabolites of the reaction. Each member of I and P belongs to the set M of the metabolites of the metabolic system under consideration. Note that by this definition a metabolic reaction is directed and that we omit the stoichiometric coefficients which are not relevant for our current study. Bidirectional reactions are modeled by pairs of unidirectional reactions (I, P ) and (P, I). Also note that when applying our theory, we

want to follow how the atoms are transmitted by the reactions and will therefore omit cofactor metabolites from M , I, and P . A metabolic network is given by listing the metabolic reactions that form the network. Let R = (R1 , . . . , Rk ) be a set of k reactions where each Ri = (Ii , Pi ) for some subsets Ii and Pi of M . The corresponding metabolic graph which we also call a metabolic network, has nodes M ∪ R and arcs as follows: there is a directed arc from Mj ∈ M to Ri ∈ R iff Mj ∈ Ii , and a directed arc from Ri ∈ R to Mj ∈ M iff Mj ∈ Pi . We call the nodes of the network that are in M the metabolite nodes and the nodes in R the reaction nodes. Figure 1 gives an example graph in which the reaction nodes are shown as bullets and metabolite nodes contain abbreviated metabolite names. A metabolic pathway in a metabolic network is a concept that is used somewhat loosely in biochemistry. It seems clear, however, that it is not sufficient to consider only simple paths in a metabolic graph. The metabolic interpretation of the network has to be taken into account: a reaction can operate only if all its input substrates are present in the system. Respectively, a metabolite can become present in a system only if it is produced by at least one reaction. We consider some (source) metabolites to be always present in a system, and denote these metabolites by A. Therefore, our metabolic network is in fact an and-orgraph [10] with reactions as and-nodes and metabolites as or-nodes. A similar interpretation of a metabolic network has been used in a previous study [3]. To properly take into account this interpretation, we need to define distance measures for metabolite pairs that relate to the complexity of and-or-graphs connecting the pair. Let us start with reachability from source metabolites A: – A reaction Ri = (Ii , Pi ) is reachable from A in R, if each metabolite in Ii is reachable from A in R. – A metabolite C is reachable from A in R, if C ∈ A or some reaction Rj = (Ij , Pj ) such that C ∈ Pj is reachable from A in R. We will define metabolic pathways from A as certain minimal sets of reactions that are reachable from A and produce the target metabolite. To this end, for any F ⊆ R, we let Inputs(F ) denote the set of the input metabolites and P roducts(F ) denote the set of the output metabolites of F . Moreover, we denote by W (A, F ) the subset of R that is reachable from A in F . Hence W (A, F ) is the reactions in R that can be reached from A without going outside F . A feasible metabolism from A is a set F ⊆ R which satisfies (i) F = W (A, F ), that is, the entire F is reachable from A without going outside F itself. Specifically, a feasible metabolism from A to t is a set F for which it additionally holds that (ii) t ∈ P roducts(F ). We then define that a metabolic pathway from A to t is any minimal feasible metabolism F from A to t, that is, removing any reaction from F leads to violation of requirement (i) or (ii). Thus, a metabolic pathway is a minimal subnetwork capable of performing the conversion from A to t. Now, different distance measures can be defined. We define the metabolic distance from A to t to be the size of the smallest metabolic pathway from A

PYR OA

OAm

CITm

GLU ICITm

ICIT

OSUC

ALA

AKG

CIT

Fig. 1. A feasible metabolism (a pathway, in fact) from pyruvate (PYR) to alanine (ALA). In this network, pyruvate and glutamate (GLU) are combined to produce alanine. Here, ds (P Y R, ALA) = 1, dp (P Y R, ALA) = 9, and dm (P Y R, ALA) = 10.

to t. This distance captures the idea that the distance equals minimum number of reactions in total needed to produce t from A. The production distance from A to t is the smallest diameter taken over all metabolic pathways from A to t, where diameter of a metabolic pathway is taken as the length of the longest simple path in the pathway. Hence, production distance is the minimum number of sequential (successive) reactions needed to convert A to t. In the following we restrict ourself to a single source metabolite, that is |A| = 1, to be better able to compare with shortest-path analysis. We denote by dp (A, t) the production distance and by dm (A, t) the metabolic distance from A to t. Moreover, ds (A, t) denotes the standard shortest-path distance. It should be immediately clear that these distances satisfy: Theorem 1. ds (A, t) ≤ dp (A, t) ≤ dm (A, t). Figure 1 shows a feasible metabolism producing alanine from pyruvate. The reader can easily verify that this metabolism is in fact a metabolic pathway according to our technical definition: removal of any reaction would destroy the integrity of the network. Note that pyruvate is a sufficient precursor to produce all intermediates in this pathway, and no additional input substrates are needed. Let us conclude this section by relating the metabolic distance to the shortestpath distance. The two distances can be seen as two extremes in a continuum in the following sense. We denote by S the set of auxiliary metabolites that are available as reaction substrates without explicitly producing them from A. In metabolic distance, the set S of auxiliary metabolites is empty. Therefore, all reaction substrates required for the conversion to t need to be produced from A. Let us now consider gradually extending the set of auxiliary metabolites to include all metabolite subsets of size 1, 2, 3, . . . , |M |. Let S1 ⊂ S2 ⊂ · · · ⊂ M be any such sequence, and denote by dm,S (A, t) the size of the minimum feasible metabolism from A to t with S being the set of auxiliary metabolites. It is easy to see that the distances satisfy dm,M (A, t) ≤ · · · ≤ dm,S1 (A, t) ≤ dm,∅ (A, t) = dm (A, t), as adding more and more metabolites to the set of auxiliary metabolites can only decrease the size of the required subgraph. Moreover, from some 1 ≤ ` ≤ |M | onwards the distances equal the shortest-path distance dm,S` (A, t) = dm,S`+1 (A, t) = · · · = dm,M (A, t) = ds (A, t)

as the length of shortest-path is a lower bound to the size of the feasible subgraph and the shortest-path becomes a feasible metabolism when all intermediate metabolites along the path are reachable.

A

B D

E

F

C

Fig. 2. In this network, metabolic distance dm,S (A, t) = 5 when the set of auxiliary metabolites S is empty, and dm,S (A, B) = 1 when C ∈ S.

3

Algorithms and complexity

In this section, we discuss computation of the metabolic and production distances. We then give an algorithm to quickly find a feasible, but possibly nonminimal, metabolism. Unfortunately, exact metabolic distance cannot be computed efficiently unless P = N P . Definition 1. (MINIMAL-FEASIBLE-PATHWAY). Given a set of reactions R, a set of source metabolites A, a target metabolite t and an integer k, is the metabolic distance dm (A, t) less or equal k? The intractability of this problem is proven via a reduction from a propositional STRIPS planning problem PLANMIN that concerns the existence of a plan from a initial state to a goal state, consisting of at most k operations [2]. We omit the proof due to the lack of space. Theorem 2. MINIMAL-FEASIBLE-PATHWAY is NP-complete. This implies that also the special case with |A| = 1 is NP-complete. Next, we concentrate on calculating lower and upper bounds for the metabolic distance. Production distance dp (A, Mi ) can be computed efficiently with Algorithm 1. Search starts from the source metabolites A and proceeds in breadth-first order, visiting a reaction node only after all its input metabolite nodes have been visited, and a metabolite node after any of its producing reaction nodes has been visited. The production distance to metabolite nodes is stored in table d and to reaction nodes in table w. The running time is linear in the size of the network because each metabolite node is put in the queue Q at most once. Taking advantage of production distances, we can quickly find some feasible metabolism from A to t with Algorithm 2. The size of this metabolism gives an upper bound dˆm for the metabolic distance dm . The algorithm maintains a list of unsatisfied metabolites. Initially the list contains only metabolite t. The idea of the algorithm is to satisfy one unsatisfied metabolite Mi in turn by adding a reaction to the network that produces Mi . If the metabolite Mi has multiple producers, a reaction with smallest production distance is chosen, breaking ties arbitrarily. The running time is again linear.

Algorithm 1 Calculate production distances from A to all other metabolites Input: A set of reactions R, a set of input metabolites A Output: Pair (d, w), where d[i] = dp (A, Mi ) and w[i] = max{dp (A, Mj ) | Mj ∈ Inputs(Ri )} Procedure CalculateProductionDistances(R, A): 1: for all Mi ∈ M do 2: if Mi ∈ A then 3: d[i] ← 0 4: else 5: d[i] ← ∞ 6: for all Ri ∈ R do 7: B[i] ← |Inputs(Ri )| % unsatisfied inputs 8: w[i] ← ∞ 9: Q : queue 10: Q ← Q ∪ A 11: while Q 6= ∅ do 12: Mi ← remove first(Q) 13: for all Rj ∈ Consumers(Mi ) do 14: B[j] ← B[j] − 1 15: if B[j] = 0 then 16: w[j] ← d[i] 17: for all Mk ∈ P roducts(Rj ) do 18: if d[k] = ∞ then 19: d[k] ← w[j] + 1 20: append(Q, Mk ) 21: return (d, w)

Algorithm 2 Find a feasible metabolism from A to t Input: A set of reactions R, a set of input metabolites A, a target metabolite t Output: Feasible metabolism G ⊆ R or infeasible if no feasible metabolism exists Procedure FindFeasibleMetabolism(R, A, t): 1: (d, w) ← CalculateProductionDistances(R, A) 2: if d[i] = ∞ then 3: return infeasible 4: V ← {t} {set of visited metabolites} 5: Q : queue {unsatisfied metabolites} 6: append(Q, t) 7: G ← ∅ 8: while Q 6= ∅ do 9: Mi ← remove first(Q) 10: Rj ← argmin Rj {w[j] | Mi ∈ Products(Rj )} 11: G ← G ∪ {Rj } 12: for all Mk ∈ Inputs(Rj ) do 13: if Mk ∈ / V then 14: append(Q, Mk ) 15: V ← V ∪ {Mk } 16: return G

4

Experiments

To test our definition of metabolic pathway, we studied the genome-scale metabolic networks of two microbial organisms, namely Saccharomyces cerevisiae (yeast) [4] and Escherichia coli 4 . We calculated simple path lengths and production distances (Algorithm 1) in metabolic networks. In addition, we calculated a feasible metabolism for all metabolite pairs with Algorithm 2 for which such metabolism existed. While this metabolism is not necessarily minimal, the size of this metabolism gives us an upper bound for metabolic distance. To concentrate on primary metabolism and to be able to compare with previous results, we deleted 89 cofactors, such as energy and redox metabolites, from models. We also removed metabolites designated as externals and reactions either consuming or producing them.

70

Shortest-path length ds Upper bound for metabolic distance dm Production distance dp

60

Distance

50

40

30

20

10

0 0

5000

10000

15000 20000 Metabolite pair

25000

30000

35000

Fig. 3. Production distances, upper bounds for metabolic distance given by Algorithm 2 and shortest-path lengths in the metabolic network of S. cerevisiae between all metabolite pairs sorted in ascending production distance order. Only every tenth metabolite pair is included to reduce clutter.

Metabolic distances in two metabolic networks In S. cerevisiae, we found that production distance was defined for 23.2% of 154803 metabolite pairs5 for which there was a connecting simple path (33.6% for E. coli). Table 1 summarizes the results. Average production distance in both networks is significantly higher than average simple path length, implying that metabolic distance is higher as well. Results for simple paths only include paths between metabolite pairs for which a feasible metabolism exists. Figure 3 shows results for all pairs for 4

5

Metabolic network models of S. cerevisiae (iFF708, 1175 reactions) and E. coli (iJE660a, 739 reactions) were obtained from http://systemsbiology.ucsd.edu/ organisms/ The total of number of pairs is 352242 (cofactors excluded).

which a metabolic pathway exists in ascending production distance order. We observe that the size of a smallest metabolic pathway is largely independent of the corresponding shortest-path length. Table 1. Means and standard deviations (in parenthesis) of shortest-path lengths (ds ), production distances (dp ), and upper bounds for metabolic distance (dˆm ) given by Algorithm 2 for S. cerevisiae and E. coli. Organism E. coli S. cerevisiae

ds 5.78 (2.30) 6.11 (2.40)

6000

dp dˆm 14.55 19.06 (12.5) (6.40) 16.72 20.34 (11.3) (7.74)

Shortest-path length ds Upper bound for metabolic distance dm Production distance dp

5000

Frequency

4000

3000

2000

1000

0 0

10

20

30

40 Distance

50

60

70

80

Fig. 4. Histograms of production distances, upper bounds for metabolic distance given by Algorithm 2 and shortest-path lengths for S. cerevisiae.

This result demonstrates the shortfall of approaches using simple paths that do not transfer all atoms from source to target: not nearly all simple paths can be interpreted as biologically plausible pathways. Analysis based on simple paths does not take into account the inherent nature of metabolic networks as a system of chemical reactions. In order to proceed, all substrates of a chemical reaction must be present in the system. Therefore, since two thirds of metabolite pairs with a connecting simple path in yeast do not have a connecting metabolic pathway, we claim that previous small world analysis may produce misleading results. Furthermore, even if two metabolites can be connected with both a simple path and a metabolic pathway, the metabolic pathway is much more complex than the simple path. This can be seen in Figures 4 and 5 which show the histograms of three distances for S. cerevisiae and E. coli, respectively. We can

observe that the upper bound obtained with Algorithm 2 and production distance together give us, on the average, a good estimate of metabolic distance. The distribution of simple path lengths follows the normal distribution as expected. However, the distribution of production distances does not have similar shape, gradually increasing up to distance of 20 in S. cerevisiae and then descending. In E. coli, production distances are shorter on the average but still significantly longer than the small world hypothesis would suggest. In particular, we get the average production distance 14.6 for E. coli which clearly exceeds the average atom-pathway distance of 8.4 reported in a previous study [1]. 7000

Shortest-path length ds Upper bound for metabolic distance dm Production distance dp

6000

Frequency

5000

4000

3000

2000

1000

0 0

10

20

30

40 50 Distance

60

70

80

90

Fig. 5. Histograms of production distances, upper bounds for metabolic distance given by Algorithm 2 and shortest-path lengths for E. coli.

Effect of reaction deletions to distances In addition to calculating pairwise distances, we observed the effect of random reaction deletions from the yeast network to our distances. Reaction deletions simulate knocking out genes with enzymatic end-products from the genome. Figure 6 shows the effect of deletions to the number of feasible metabolisms for S. cerevisiae. The ratio of metabolite pairs connected with a simple path in both the original network and the deletion variant decreased as the number of reaction deletions increased. However, at the same time, the ratio of connections via feasible metabolisms dropped more rapidly. This indicates that metabolic network of S. cerevisiae is in fact more vulnerable to gene knockouts than would be evident just by considering the simple paths.

5

Discussion and conclusions

In this paper we have proposed distance measures for metabolite distances in metabolic networks. We argue that these distances have more natural biochemical interpretation than the simple shortest-path distance used in previous research. The shortest-path distance does not always have a direct correspondence

1 Simple paths Feasible metabolisms 0.9

Ratio of conserved pathways

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

200

400 600 Number of deleted reactions

800

1000

Fig. 6. Robustness of yeast against reaction deletions. The red (upper) curve gives the relative amount of connections via simple paths that are preserved after n = 0 . . . 1000 deletions. The green curve gives the same quantity for connections via feasible metabolisms and indicates quite weak robustness. Averages over 100 repetitions.

to the inherent difficulty of producing a metabolite from another. The distances as we defined them take into account the fact that, in order to produce the target metabolite, all atoms of the target should be reachable from the source metabolites. Our metabolic distance, in addition, measures the genomic capacity (in terms of number of enzymes involved) that is required for the conversion. In addition, we showed that there is a unified way to interpret the metabolic and shortest-path distances: Metabolic distance is the size of the minimal feasible metabolism from A to t when no other metabolite than A is available initially. Shortest-path distance is the size of the minimal feasible metabolism from A to t when all metabolites required by the reactions along the path are available at the outset. An interesting further research direction is to study the continuum between the two extremes by allowing some subsets of metabolites to be available besides source metabolites A, either by allowing some biologically interesting nutrients or conducting more systematic study, looking for possible phase transitions in the distances as the function of the number of allowed metabolites. In our experiments we discovered that the average metabolic distance between pairs of metabolites in the metabolic network of S. cerevisiae is considerably longer than the corresponding shortest-path distance. In E. coli we observed the average metabolic distance to be longer than average atom-path distance reported in a previous study. This is because atom-path distance relates to transforming a single atom between metabolites, while our distance measures the complexity of total conversion. Also, the distribution of the distances takes a different shape: normal-like distribution of the shortest-path lengths is not reproduced when our more realistic measures are used. In the second experiment, we studied the effect of random deletions of enzymes on metabolite distances. Our simulations show that the metabolic network of S. cerevisiae may not be as robust to mutations as stated previously.

Another future direction is to make a more comprehensive study on the effect of reaction deletions on important biological pathways, such as amino acid production and DNA synthesis. In addition we plan to apply our analysis to other organisms that have publicly available metabolic network models. Acknowledgements. This work has been supported by the SYSBIO programme of Academy of Finland. In addition, the work by Juho Rousu has been supported by Marie Curie Individual Fellowship grant HPMF-CT-2002-02110.

References 1. M. Arita. The metabolic world of Escherichia coli is not small. PNAS, 101(6):1543– 1547, 2004. 2. T. Bylander. The computational complexity of propositional STRIPS planning. Artificial Intelligence, 69(1-2):165–204, 1994. 3. O. Ebenh¨ oh, T. Handorf, and R. Heinrich. Structural analysis of expanding metabolic networks. Genome Informatics, 15(1):35–45, 2004. 4. J. F¨ orster, I. Famili, P. Fu, B. Palsson, and J. Nielsen. Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network. Genome Research, (13):244– 253, 2003. 5. H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, and A.-L. Barab´ asi. The large-scale organization of metabolic networks. Nature, (407):651–654, 2000. 6. M. Kanehisa and S. Goto. Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 28:27–30, 2000. 7. J. Kleinberg. The small-world phenomenon: An algorithmic perspective. In Proc. 32nd ACM Symposium on Theory of Computing, 2000. 8. C. J. Krieger, P. Zhang, L. A. Mueller, A. Wang, S. Paley, M. Arnaud, J. Pick, S. Y. Rhee, and P. D. Karp. MetaCyc: a multiorganism database of metabolic pathways and enzymes. Nucleic Acids Research, 32(1):D438–42, 2004. 9. H.-W. Ma and A.-P. Zeng. Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms. Bioinformatics, 19(2):270–277, 2003. 10. S. J. Russell and P. Norvig. Artifical Intelligence: A Modern Approach. Prentice Hall, 2nd edition, 2003. 11. C. H. Schilling, D. Letscher, and B. Palsson. Theory for the systemic definition of metabolic pathways and their use in interpreting metabolic function from a pathway-oriented perspective. Journal of Theoretical Biology, (203):228–248, 2003. 12. S. Schuster, D. A. Fell, and T. Dandekar. A general definition of metabolic pathways useful for systematic organization and analysis of complex metabolic network. Nature Biotechnology, 18:326–332, March 2000.