Evolving protein interaction networks through gene duplication
Keywords
1. Introduction
Since the discovery of the structure of the DNA molecule, a dominant view of molecular biology has been the understanding of the microscopic mechanisms operating at the gene level. Some authors have indeed defined molecular cell biology as an explanation of organisms and cells in terms of their individual molecules (Lodish et al., 2000). This so-called reductionist view has been extremely successful and has widely enlarged our view of genetics and evolution at the smallest scales. In approaching the richness of biocomplexity in this way, we might, however, ignore the other side of the coin: the presence of higher-order phenomena beyond the molecular level. This other view takes into account the interactions among components as an essential part of the whole picture and suggests that there exist emergent properties, not reducible to the properties displayed by the individual components (Goodwin, 1994). The debate between both schools goes back to the early origins of molecular biology (Monod, 1970).
Our perspective of molecular biology might be changing and the emerging picture might help to reach a more balanced interaction between both views. Two important findings help to see how collective properties might play a leading role. The first is the observation of the extraordinary resilience exhibited by some simple organisms against gene removal. Experiments with systematic mutagenesis in yeast Saccharomyces cerevisiae have shown a great tolerance of this organism to gene removal (Ross-Macdonald et al., 1999; Wagner, 2000). These and other studies carried out in order to explore the minimum limits allowed to genome size suggest that many genes might not play a key phenotypic role, being somehow functionally replaced by other genes. Secondly, the network perspective of gene and protein systems is becoming more and more accepted as new data accumulate. In particular, it is becoming obvious that not only genes, but also interactions among specific groups of genes (modules) have been conserved through evolution (Hartwell et al., 1999). In this context, subsets of complex gene networks are also the target of selective forces.
Recent large-scale studies of the global properties of the yeast proteome reinforce the relevance of the network perspective (Jeong et al., 2001a; Wagner, 2001; Gavin, 2002; Ho, 2002). These studies have revealed that the available data from protein–protein interaction networks in the yeast S. cerevisiae share some unexpected features with other complex networks (Jeong et al., 2001a; Wagner, 2001; Goh et al., 2002). In particular, these are very heterogeneous networks, whose degree distribution P(k) (i.e. the probability that a protein interacts with any other k proteins) displays a scale-free behavior, P(k)≈k−γ, with a characteristic exponent γ≈2.5, for a certain range of values of k, and with a well-defined cut-off for large k. Additionally, they also display the so-called small-world (SW) effect: they are highly clustered (each vertex has a well-defined neighborhood of “close” vertices) but the minimum distance between any two randomly chosen vertices in the graph is short, a characteristic feature of random graphs (Watts and Strogatz, 1998; Watts, 1999). Scale-free (SF) networks appear to be present in many natural and artificial systems (Bornholdt and Schuster, 2002), ranging from technological networks (Albert et al., 2000; Amaral et al., 2000; Ferrer i Cancho et al., 2001a; Pastor-Satorras et al., 2001), neural networks (Watts and Strogatz, 1998; Stephan et al., 2000), metabolic pathways (Fell and Wagner, 2000; Jeong et al., 2001b; Podani et al., 2001), and food webs (Montoya and Solé, 2002; Williams et al., 2001) to the human language graph (Ferrer i Cancho et al., 2001b). It is remarkable, in particular, that the exponents observed in Internet, metabolic, and protein networks are very similar. This fact hints towards the presence of common principles of organization, a finding which might have deep consequences in our understanding of how large-scale nets emerge through evolution.
Previous studies on protein networks have emphasized dynamical or computational aspects of interacting proteins as well as their potential links with other classes of nets, such as neural nets (Bray, 1995). The importance of allosteric interactions (and their nonlinear character) was early highlighted as an essential piece in the understanding of cell biology and as a step towards a general systems theory of biocomplexity (Monod, 1970). Here, however, we are mainly interested in the topological properties derived from the process of proteome evolution. These properties can be summarized as follows (Jeong et al., 2001a; Wagner, 2001): (1) the proteome map is a sparse graph indicating a small average number of edges per protein. This observation is also consistent with the study of the global organization of the Escherichia coli gene network from available information on transcriptional regulation (Thieffry et al., 1998); (2) it exhibits a small-world pattern, very different from the properties displayed by purely random (Poissonian) graphs (Bollobás, 1985); and (3) the degree distribution is a power law with a well-defined cut-off.
In this paper we present a model of proteome evolution based on a gene duplication plus re-wiring process that includes the basic ingredients of proteome growth and intends to reproduce the previous set of observations. The first component of the model allows the system to grow by means of the copy process of previous units (together with their wiring). The second introduces novelty by means of changes in the wiring pattern, constrained in our approach to the newly created genes (Solé et al., 2002b). This constraint is required if we assume that conservation of gene (protein) interactions is due to functional restrictions and that further changes in the regulation map are limited. Such constraint would be strongly relaxed when involving a newly created (and redundant) unit. The evolution of the globin gene family provides an example of genome evolution by gene duplication. Here the primitive, single-chain globin molecule (found in many insects and primitive fish) is closely related to the hemoglobin molecule present in higher vertebrates. The hemoglobin is composed of a tetramer formed by two copies of two slightly different globin chains. About 500 million years ago, a series of gene duplications and mutations took place, establishing the two different globin genes which allow, through the interaction of their protein products, a more efficient transport of oxygen.
The model does not include functionality or dynamics in the proteins involved. It is a topological-based approximation to the overall features of the proteome graph which aims to capture some of the (possibly) generic features of real proteome evolution. A preliminary account of the present model was previously given in a short communication (Solé et al., 2002b). It is also worth noting the work by Vázquez et al. (2003), in which a related model of proteome evolution, showing multifractal degree properties, is described and analysed.
This paper is organized as follows. In Section 2 we review the known topological properties of proteome networks, obtained by several authors by analysing the published proteome maps of the yeast S. cerevisiae. In Section 3 we describe our model of proteome growth. The main ingredients of the model are protein duplication plus correlated random re-wiring. 4 Analytical study of the model: mean-field rate equation for the average degree, 5 Analytical study of the model: rate equation for the vertex distribution are devoted to an analytical study of the model. In Section 4 we discuss a mean-field approximation for the evolution of the average degree, that will allow us to restrict the range of values of the model's parameters, while Section 5 presents a study of the rate equation for the vertex distribution nk within an approximation that imposes an uncorrelated re-wiring of connections after each vertex duplication. Our study is completed in Section 6 by means of computer simulations. In Section 7 we present a discussion of our results.
2. Topological properties of real proteome maps
Protein–protein interaction maps have been studied, at different levels, in a variety of organisms including viruses (Bartel et al., 1996; Flajolet et al., 2000; McCraith et al., 2000), prokaryotes (Rain et al., 2001), yeast (Ito et al., 2000), and multicellular organisms such as Caenorhabitis elegans (Walhout et al., 2000). Previous studies have mainly used the so-called two-hybrid assay (Fromont-Racine et al., 1997), based on the properties of site-specific transcriptional activators. Although differences exist between different two-hybrid projects (Hazbun and Fields, 2001), the statistical patterns used in our study seem to be robust. Recent systematic analyses of protein complexes by means of mass spectrometry provide very similar results, together with a better understanding of the internal organization of protein complexes (Kumar and Snyder, 2001).
From a statistical point of view, protein–protein interaction maps can be viewed as a random network (Bollobás, 1985), in which the vertices represent the proteins and an edge between two vertices indicates the presence of an interaction between the respective proteins. Mathematically, the proteome graph is defined by a pair , where is the set of N proteins and Ep={{pi,pj}} is the set of edges/connections between proteins. The adjacency matrix ξij indicates that an interaction exists between proteins or that the interaction is absent (ξij=0). Two connected proteins are thus called adjacent and the degree of a given protein is the number of edges that connect it with other proteins. The underlying key unit here is the so-called protein domain. A domain is defined as a subpart of a protein chain that can fold independently into a stable structure. These domains act as the modules from which complex proteins are built, and different domains are associated to different functions. A given protein can have domains involved in regulatory activity and others in catalytic activity. Domain networks have been recently explored, revealing the presence of scaling at different levels (Wuchty, 2001; Rzhetsky and Gomez, 2001): here two domains are considered to be connected if they are simultaneously present in the same protein. In this context, it is remarkable that scaling laws seem to be universal, from bacteria to metazoa, suggesting common scenarios of proteome domain evolution. In our model (see below) no explicit domains are included, but they are implicitly introduced in terms of edges among proteins (Fig. 1). Fig. 1. An example of three interacting proteins describing a ternary complex. Here interactions involve physical contact between residues of different molecules. (Figure courtesy of Baldo Oliva.)
The network representation of the protein interactions, shown in Fig. 2(a),1 reveals a very complex topology, characterized by the presence of several highly connected hubs, while most of the proteins have very few connections. The network topology can be statistically characterized by means of the degree distribution P(k), defined as the probability that any vertex is connected to exactly k other vertices. The analysis of the protein map from the yeast S. cerevisiae, containing 1870 vertices and 2240 edges, corresponding to an average degree (average number of edges emanating from a vertex) 〈k〉=2.40, shows that the degree distribution can be fitted to a power law with an exponential cut-off, of the form:(1)The estimated values for the yeast are k0≃1, γ≃2.4 and kc≃20 (Jeong et al., 2001a). That is, the protein map is a SF network with a characteristic cut-off. This value is confirmed by the independent analysis of Wagner (2001), who found a power-law behavior with γ≃2.5 for a relatively smaller protein map (985 vertices with average degree 〈k〉=1.83). In Fig. 2(b) we have checked this functional dependence on the cumulated degree distribution of the protein map used in Jeong et al. (2001a) (available at the web site http://www.nd.edu/~networks/database/index.html). A fit to form (1) yields the values , and γ=2.6±0.2, compatible with the results found in Jeong et al. (2001a) and Wagner (2001). Fig. 2. (a) Topology of a real yeast proteome map obtained from the MIPS database (Mewes et al., 1999). (b) Cumulative degree distribution for the yeast proteome map from Jeong et al. (2001a). The proteome map is available at the web site http://www.nd.edu/~networks/database/index.html. The degree distribution has been fitted to the scaling behavior P(k)≈(k0+k)−γe−k/kc, with an exponent γ≃2.6 and a sharp cut-off kc≃15.
An additional observation from Wagner's study of the yeast proteome is the presence of SW properties (Watts and Strogatz, 1998). The SW pattern can be detected from the analysis of two basic statistical quantities: the clustering coefficient C and the average path length . Since the proteome map is a disconnected network, these quantities are defined on the giant component , defined as the largest cluster of connected vertices in the network (Bollobás, 1985). Let us consider the adjacency matrix of the giant component, ξij∞ and indicate by the set of nearest neighbors of a protein . The clustering coefficient for this protein is defined as the number of connections between the proteins pj∈Γi (Watts and Strogatz, 1998). Denoting(2)where N∞ is the size of the giant component, we define the clustering coefficient of the ith protein as(3)where ki is the degree of the ith protein. The clustering coefficient is defined as the average of C(i) over all the proteins,(4)and it provides a measure of the average fraction of pairs of neighbors of a vertex that are also neighbors of each other.
The average path length is defined as follows: Given two proteins , let ℓij be the length of the shortest path connecting these two proteins on the network. The average path length will be(5)
Random graphs, where vertices are randomly connected with a given probability p (Bollobás, 1985), have a clustering coefficient inversely proportional to the network size, Crand≈〈k〉/N, and an average path length proportional to the logarithm of the network size, . At the other extreme, regular lattices with only nearest-neighbor connections among units are typically clustered and exhibit a large average path length. Graphs with SW structure are characterized by a high clustering, C⪢Crand, while possessing an average path comparable with a random graph with the same average degree and number of vertices, .
In Table 1 we summarize the most relevant results for the proteome map of the yeast, as reported in Wagner (2001). In order to compare with other results, we report the values we have calculated for the map used in Jeong et al. (2001a), as well as for a random graph with size and average degree comparable with the real data. These values support the conjecture of the SW properties of the protein network put forward in Wagner (2001).
Table 1. Comparison between the observed regularities in the yeast proteome reported by Wagner (2001), those calculated from the proteome map used in Jeong et al. (2001a), the model predictions with and average degree 〈k〉=2.5 (see Section 6), and a random network with the same size and average degree as the model
Wagner (2001) | Map from Jeong et al. (2001a) | Network model | Random network | |
---|---|---|---|---|
〈k〉 | 1.83 | 2.40 | 2.4±0.6 | 2.50±0.05 |
γ | 2.5 | 2.4 | 2.5±0.1 | — |
C | 2.2×10−2 | 7.1×10−2 | 1.0×10−2 | 1×10−3 |
7.14 | 6.81 | 5.5±0.7 | 8.0±0.2 |
3. Proteome growth model
In this work we will consider the scenario of single-gene duplications. Although multiple-gene duplications should also be taken into account (even whole genome duplication), here we restrict our attention to the most common ones (Ohno, 1970), which are known to occur due to unequal crossover (Patthy, 1999). After duplication of a single, randomly chosen gene, new connections can be added and previous connections deleted. Both re-wiring rules can be implemented in a correlated or uncorrelated manner. The first involves changes that affect the just duplicated gene and its connections. The second involves any edge in the network. Both processes (creation and deletion of edges) might be associated or not to the newly created unit. Four possible combinations are thus allowed in principle: (1) correlated creation and deletion of edges; (2) correlated creation and uncorrelated deletion of edges; (3) uncorrelated creation and correlated deletion of edges; (4) uncorrelated creation and deletion of edges. The rules associated with each variation of the model are summarized in Fig. 3. Fig. 3. Rules of proteome growth in the four possible scenarios. First, (1) duplication occurs after randomly selecting a vertex (small arrow). Then (2) deletion of connections occurs with probability δ. This event can be correlated (C) when the deleted edges are connected to the newly generated vertex or uncorrelated (NC), when all edges are considered for deletion. Finally (3) new connections are generated with probability α, again in a correlated or uncorrelated way.
In this work we will focus in the first variation of the model, in which created and deleted edges occur in relation with the newly duplicated vertex. The reason to consider correlations has to do with the assumption that the evolutionary significance of gene duplication lies in the fact that changes in the newly created genes can lead to the emergence of novelty (Patthy, 1999). After gene duplication, one of the two copies becomes redundant and either one of them becomes non-functional (i.e. a pseudogene) or accumulates molecular changes that provide a new function. The new function might be very different. An example is provided by mouse lysozyme genes. One of them has a digestive function in the intestine and the second has a bactericide action in myeolid tissues. Strong divergences from the original function displayed by the ancestor can develop. Moreover, from a numerical point of view, the analysis of the models in which creation or deletion of edges is uncorrelated yield results which are in disagreement with experimental observations. Additionally, we do not consider differences associated to the number of edges of a given chosen vertex. In this sense, evidence from proteome data indicate that the degree of well-conserved proteins is negatively correlated with their rate of evolution (Fraser et al., 2002).
The model we will consider is defined by the following rules. We start from a set N0 of connected vertices, and each time step we perform the following operations:
- (i)
One vertex of the graph is selected at random and duplicated.
- (ii)
The edges emanating from the newly generated vertex are removed with probability δ. Available data indicate that δ is very large.
- (iii)
New edges (not previously present after the duplication step) are created between the new vertex and all the rest of the vertices with probability α. Available data indicate that this number is very small (much smaller than δ).
The model we have just defined is intended to capture the topological properties of the proteome map. No explicit functionality is included in the description of the proteins and this is certainly a drawback. But by ignoring the specific features of the protein–protein interactions and the underlying regulation dynamics, we can explore the question of how much the network topology is due to the duplication and diversification processes.
4. Analytical study of the model: mean-field rate equation for the average degree
Since we have two free parameters in our model, namely the deletion probability δ and the addition probability α, we should first constrain their possible values by using the available empirical data. One first average property that can be determined is the evolution of the average number of interactions per protein/gene through time, which can be compared with the evidence from real proteomes (Jeong et al., 2001a; Wagner, 2001), as well as recent analysis of large-scale perturbation experiments. This can be done for any model with vertex duplication plus addition/deletion of vertices by considering the discrete dynamics of the number or edges LN at a given step N, where N is the number of vertices in the network (see also Vázquez et al. (2003)). In general, we can write the evolution equation(6)where KN=2LN/N indicates average degree at the Nth duplication event, and φa and φd stand for the general rates of addition/deletion of vertices, respectively. Here different functional forms might be chosen, including rates of change that depend on the degree of the vertex, as suggested by some studies (Fraser et al., 2002). Although duplication rate would in principle depend on the number of edges too, available data indicate that no such association seems to be present (Wagner, 2001).
For the particular case of the model defined in the previous section, the rate equation takes the form:(7)where the last two terms correspond to the addition of edges to a fraction α of the N−KN units not connected to the duplicated vertex, plus the deletion of any of the new KN edges, with probability δ. Using the continuous approximation(8)Eq. (7) can be written(9)whose solution is(10)where Γ=1−2(α+δ) and K1 is the initial degree at N=1. For any value of α and δ this version leads to an increasing degree through time. In this context, the α=δ=0 case would correspond to a pure duplication process, where KN increases linearly with time.
In order to have a final sparse graph with low numbers of edges per protein, we need to consider two possible scenarios. The first would assume fixed α and δ values and a finite N, that we take as the proteome size. Assuming that δ+α>1/2 in order to ensure Γ<0, the asymptotic behavior of KN is dominated by the first, linear term. If the desired degree is indicated as , the required number of vertices will be(11)where ⌈x⌉ indicates the integer part of x.
Another possibility is to assume that the rate of edge creation scales as the inverse of N, i.e. α=β/N, where β>0 is some constant. This would be interpreted in terms of underlying constraints to the number of edges imposed by some network-level property (such as efficient communication at low cost; see Ferrer i Cancho and Solé, 2001). In any case, addition of new edges occurs at a very slow rate and under our assumptions a fixed, very small number of edges will be introduced at each step. Our analysis, as well as other independent studies (Vázquez et al., 2003; Kim et al., 2002), indicate that the key statistical features of the final proteome map (such as the scaling exponent γ) do not depend on the rate of edge addition.
Here the rate of addition of new edges (the establishment of new viable interactions between proteins) is inversely proportional to the network size, and thus much smaller than the deletion rate δ, in agreement with the rates experimentally observed. Using this scaling form, the rate equation for KN reads as(12)The time-dependent solution of this equation is(13)where C is an integration constant and Γ(a,z) is the incomplete Gamma function (Abramowitz and Stegun, 1972). For large values of N (small z) we can use the Taylor expansion of Γ(a,z), given by(14)that yields(15)For δ>1/2, a finite average degree is reached at infinite N,(16)This is thus consistent with the data analysis by Wagner (2001). Eq. (16) can be used to restrict the number of independent parameters of the model, by fixing K∞ to the values experimentally found in real proteome maps. Thus, we can fix the value of β by(17)
5. Analytical study of the model: rate equation for the vertex distribution nk
The rate equation approach to evolving networks (Krapivsky et al., 2000) can be fruitfully applied to the proteome model under consideration. This approach focuses on the time evolution of the number nk(t) of vertices in the network with exactly k edges at time t. Defining our network by the set of numbers nk(t), we have that the total number of vertices N is given by(18)while the total number of edges is given by(19)since the sum over vertex connections double-counts edges.
Time is divided into periods. In each period, t→t+1, one vertex is duplicated at random, so that N→N+1. If, after each duplication, there is a probability δ to delete each edge from the just-duplicated vertex, the probability of increasing the number of vertices at degree k, by direct duplication without edge deletion, is given by(20)In this expression nk/N represents the probability of selecting a vertex of degree k and 1−kδ is the probability of preserving all edges in the just duplicated vertex. It is important to note that in Eq. (20) we are ignoring the possibility of deleting more than one edge in each duplication event, which will contribute with an amount proportional to δ2 or smaller. Obviously, this approximation is correct for small δ. We will see that this fact has important consequences when interpreting the results obtained in this section.
On the other hand, a vertex of degree k can be created from the duplication of a vertex of degree k+1 in which a edge is deleted, contributing with a probability(21)In this expression, the factor (k+1)δ represents the probability of deleting one of the k+1 connections of the duplicated vertex. The probability of degree change, from duplication of a vertex connected to a degree-k vertex, is given by(22)because knk is the total number of vertices connected to all vertices of degree k. In Eq. (22) we have corrected for the probability δ that the crucial connecting edge was deleted.
Finally, in the same period, we proceed to add N−kd edges with probability α=β/N, where kd is the degree of the just duplicated vertex. In the limit N≫kd, we can simply consider the addition of Nα new edges to the graph. When this last step is performed with the correlated rule (i.e. adding edges from the duplicated vertex to the rest of the edges in the graph), it leads to a non-local rate equation for the functions nk. For the sake of simplicity, we will consider now the simpler case of a uncorrelated addition of edges (new edges created between any two vertices in the graph).
The case of uncorrelated addition of edges can be represented as the distribution of 2αN new edge ends among the N vertices in the network. This event contributes with a probability(23)Probabilities , , , define the rate equation for the degree distribution:(24)The point to note in Eq. (24) is the first term proportional to nk/N. This is the unaltered duplication event, which can create a vertex of degree nk only by duplicating another such vertex. It is separated from the edge addition probabilities in the rest of the terms, because for re-wired edges, there is no correlation between the likelihood that a vertex of degree k will be created by duplication, and that it will be gained or lost by edge addition. Since in each time step a new vertex is added, Eq. (24) satisfies the condition(25)that yields the expected result N(t)=N0+t, where N0 is the initial number of vertices in the network. In order to solve Eq. (24), we impose the homogenous condition on the population number(26)where pk is the probability of finding a vertex of degree k, which we assume to be independent of time. With this approximation, the rate equation reads(27)Eq. (27) can be solved using the generating functional method (Gardiner, 1985). Let us define the generating functional(28)In terms of φ, Eq. (27) can be written as(29)The solution of this last equation, with the boundary condition φ(1)=∑kpk=1, is(30)Knowing the form of φ(x) we can compute immediately the average degree(31)in agreement with the mean-field prediction of Eq. (16).
On the other hand, performing a Taylor expansion of φ(x) around x=0 we can obtain pk as(32)where φ(k)(x) is the kth derivative of φ(x). Applying this formula on function (30), we are led to(33)By using Stirling's approximation, we can obtain the asymptotic behavior of pk for large k, which is given by(34)with(35)As we observe from Eq. (34), we recover the same functional form experimentally observed in Jeong et al. (2001a). However, it is important to note that for all the parameter range in which the exponential cut-off kc is well-defined, we obtain a value of the degree exponent, as given by Eq. (35), that is γ⩽1. This result is unsatisfactory, because, as we will see in the next section, it does not correspond with the results from numerical simulations of the model. This discrepancy is explained by the fact that the N→∞ solution that we have constructed only applies for δ>1/2 (see Eq. (31)). Yet the master equation was defined on the basis of an independent-event approximation that only makes sense for δ⪡1/2.
In summary, the master equation correctly reproduces the scale-free distribution but does not give the appropriate exponent (something that the model correctly reproduces when simulated, see next section). The singular character of this model has been stressed in a recent study by Kim et al. (2002). In this sense, it has been shown that the fluctuations displayed by the model are very important. By following a somewhat different approach, these authors also obtained scale-free distributions with a characteristic exponent in the limit of large k (infinite large network)(36)This relation yields an exponent γ surprisingly independent of the addition probability factor β. The solution of this equation for δ≈0.56 is γ≈2.7, consistent with numerical simulations (see next section).
6. Numerical results
The proteome model defined in Section 3 depends effectively on two independent parameters: the average degree of the network 〈k〉 and the deletion rate of newly created edges δ; given these two parameters, the rate β can be computed from Eq. (17). The average degree can be estimated from the experimental results from real proteome maps. Examination of Table 1 yields a value 〈k〉≃2.40 from the data analysed in Jeong et al. (2001a). As a safe estimate, we impose the value 〈k〉=2.5 in our model. In Solé et al. (2002b) the rate δ was roughly estimated from the experimentally calculated ratio of addition and deletion rates in the yeast proteome, α/δ (Wagner, 2001). However, it is clear that this estimate is strongly dependent on the assumed value α/δ. In this work we will consider instead the more general case of a δ-dependent model. Guided by the analytical study in Section 5, we should expect the model to yield, for each value of δ, the functional form Eq. (1) of the degree distribution, with a degree exponent γ which is a function of δ (for a fixed average degree 〈k〉=2.5). From numerical simulations of the model we will compute the function γ(δ) and select the value of δ that yields a degree exponent in agreement with the experimental observations.
Simulations of the model start from a connected ring of N0=5 vertices and proceed by iterating the rules of the model until the desired network size is achieved. Given the size of the maps analysed by Jeong et al. (2001a), we consider networks with N=2×103 vertices. In Fig. 4 we plot the values of γ estimated from functional form (1) for the degree distribution obtained from computer simulations of our model, averaging over 1000 network realizations. The exponent γ is computed performing a nonlinear regression of the corresponding degree distribution in the range k∈[1,80]. In this figure we observe that, apart from a concave region for δ very close to 1/2, γ is an increasing function of δ. We thus conclude that the value of δ yielding the degree exponent closest to the experimentally observed one is(37)We will use this value thorough the rest of the paper. Fig. 4. Degree exponent γ as a function of the deletion rate δ from computer simulations of the proteome model with average degree 〈k〉=2.5. Network size N=2×103. The degree distributions is averaged over 1000 different network realizations.
In Fig. 5(a) we show the topology of the giant component of a typical realization of the network model of size N=2×103. This figure clearly resembles the giant component of real yeast networks, as we can see comparing with Fig. 2(a); we can appreciate the presence of a few highly connected hubs plus many vertices with a relatively small number of connections. On the other hand, in Fig. 5(b) we plot the degree P(k) obtained for networks of size N=2×103, averaged of realizations. In this figure we observe that the resulting degree distribution can be fitted to a power law with an exponential cut-off, of the form given by Eq. (1), with parameters γ=2.5±0.1 and kc≃37, in fair agreement with the measurements reported by Wagner (2001) and Jeong et al. (2001a) (see also Table 1). This value is also to be compared with the theoretical prediction from Kim et al. (2002), γ≈2.7. The slight discrepancy in the value is to be attributed to the small size of the simulated networks, comparable to the size of real proteome maps, while Eq. (36) is expected to be valid in the limit N→∞. Fig. 5. (a) Topology of the giant component of the map obtained with the proteome model with parameters 〈k〉=2.5 and δ=0.562. Network size N=2×103. (b) Degree distribution for the same model, averaged over different network realizations.
We have also computed the SW properties of the model. In Table 1 we report the values of 〈k〉, , and obtained for our model, compared with the values reported for the yeast S. cerevisiae (Jeong et al., 2001a), and the values corresponding to a random graph with size and degree comparable with both the model and the real data. All the magnitudes displayed by the model compare quite well with the values measured for the yeast, and represent a further confirmation of the SW conjecture for the protein networks advanced by Wagner (2001).
7. Discussion
In this paper a detailed analysis of a model of proteome evolution (Solé et al., 2002b) has been presented. The model is a simple approximation to the evolution of the real proteome map, and no functionality is considered (i.e. no cellular network dynamics is explicitly introduced). This simplification imposes some limitations to the conclusions reachable by our study. Nevertheless, the success in reproducing the observed statistical features of real interaction maps suggests that our mechanism is able to capture the essential ingredients that shape large-scale proteome evolution, at least those that can be extracted from topological data, without introducing selective forces that would fine-tune the proteome structure. In this context, it is important to mention that, regardless of the limitations and biases imposed by different large-scale molecular methods (from two-hybrid assays to mass spectrometry) there seems to be a strong consistency in the overall pattern that results from these different sources (Kumar and Snyder, 2001).
Two essential components define the model: growth by single-gene duplication plus correlated re-wiring. The second rule is inspired in the assumption that novelties derived from changes in regulation patterns will be constrained by the functional properties present in already established interacting networks or subnetworks. Such constraints are likely to be relaxed when new genes are created through duplication. Following available data, the interactions associated with the redundant copies are rapidly lost, whereas new connections emerge rarely through proteome evolution. Estimates of these rates have been used in our analysis. Besides, genome-wide studies on the fate of redundant genes reveals that 50% of duplicated genes lose their function after duplication, whereas the other half experience functional divergence (Nadeau and Sankoff, 1997). Although no functionality is used here, it is interesting to mention that more than 40% of vertices in the proteome model proposed here become disconnected from the main component. If such a disconnection can be related to loss of function, then the fraction of loss genes seems to be close to observed rates.
We derived the rate equations for the evolution of the degree distribution P(k) and its stationary states under some constraints imposed by available data from the analysis of yeast proteome. The rate equation successfully reproduces the scaling law reported from proteome analysis, although the exact exponent is no reproduced due to the strong effects of fluctuations implicit in this type of growth mechanism. Although we concentrated our study in comparing model and data distributions (which are assumed to represent steady states) future extensions should consider the time-dependent behavior of the model as well as possible extensions that would treat the problem of how resident genomes degrade in time (Andersson and Kurland, 1998).
Together with the rules that define the evolution of our model proteome, we introduced characteristic rates that are estimated from available information. The rates of change are of course very important, since they are also responsible for the final connectedness, clustering and sparseness of the graph. What are the factors that tune the rates of edge addition and deletion? One possible source of tuning might be related to the cost of wiring. Additional, functional edges require higher transcription levels and are constrained by different sources of regulation feedbacks. A sparse graph might be a topological blueprint of the underlying optimization process operating at the level of protein wiring. Actually, optimization of graphs has been shown to lead to scale-free networks when both edge density and graph distance are minimized simultaneously (Ferrer i Cancho and Solé, 2001). Since network communication plus low cost leads to heterogeneous maps with scaling properties, it might be the case that the resulting homeostasis characteristic of scale-free nets is actually a byproduct of evolutionary dynamics. Selection would be unnecessary in order to explain the overall patterns displayed by this network architecture: by simply maintaining the system communicated at low average degree 〈k〉, the multiplicative character of proteome growth spontaneously leads to scaling in the degree. However, not any rates of edge removal allows to obtain scale-free edge distributions, and thus selection might have operated at a more broad level.
Further developments of this model should consider different components of proteome structure and the underlying dynamics of protein–protein interactions. The modular structure of cellular networks (Hartwell et al., 1999; Clarcke and Mittenthal, 2001; Ravasz et al., 2002) or the presence of degeneracy and redundancy (Edelman and Gally, 2001) and its relation with other natural and artificial systems, should be explored. It is also important to understand the relative role of the three cellular networks (genome, proteome and metabolome) in shaping evolutionary paths. In this context, global analysis of metabolic maps might help expand our approach into more appropriate models including functionality (see for example Ouzounis and Karp, 2000).
The fact that scale-free nets seem so widespread might actually provide a new framework for the study of evolutionary convergence: heterogeneous nets might actually result from optimal searches in high-dimensional parameter spaces. In this context, the proteome map would offer an excellent example of a system where selection, optimization, and tinkering might be at work (Solé et al., 2002a).
Acknowledgements
The authors thank Jay Mittenthal, Ramon Ferrer, Jose Montoya, Stuart Kauffman, Baldo Oliva, and Andy Wuensche for useful discussions. This work has been supported by a grant BFM 2001-2154 and by the Santa Fe Institute (RVS). R.P.-S. acknowledges financial support from the Ministerio de Ciencia y Tecnologı́a (Spain).
References
- Abramowitz and Stegun 1972
- M. Abramowitz, I.A. StegunHandbook of Mathematical Functions, Dover, New York (1972)
- Albert 2000
- R.A. Albert, H. Jeong, A.-L. BarabásiError and attack tolerance of complex networksNature, 406 (2000), pp. 378-382
- Amaral 2000
- L.A.N. Amaral, A. Scala, M. Barthélémy, H.E. StanleyClasses of small-world networksProc. Natl Acad. Sci. USA, 97 (2000), pp. 11149-11152
- Andersson and Kurland 1998
- S.G.E. Andersson, C. KurlandReductive evolution of resident genomesTrends Microbiol., 6 (1998), pp. 263-268
- Bartel 1996
- P.L. Bartel, J.A. Roecklein, D. SenGupta, S.A. FieldsA protein linkage map of Escherichia coli bacteriophage t7Nature Genet., 12 (1996), pp. 72-77
- Bollobás 1985
- B. BollobásRandom Graphs, Academic Press, London (1985)
- Bornholdt and Schuster 2002
- S. Bornholdt, H.G. SchusterHandbook of Graphs and Networks: From the Genome to the Internet, Springer, Berlin (2002)
- Bray 1995
- D. BrayProtein molecules as computational elements in living cellsNature, 376 (1995), pp. 307-312
- Clarcke and Mittenthal 2001
- B. Clarcke, J.E. MittenthalModularity and reliability in the organization of organismsBull. Math. Biol., 54 (2001), pp. 1-20
- Edelman and Gally 2001
- G.M. Edelman, J.A. GallyDegeneracy and complexity in biological systemsProc. Natl Acad. Sci. USA, 98 (2001), pp. 13763-13768
- Fell and Wagner 2000
- D. Fell, A. WagnerThe small world of metabolismNature Biotech., 18 (2000), p. 1121
- Ferrer i Cancho 2001a
- R. Ferrer i Cancho, C. Janssen, R.V. SoléThe topology of technology graphs: small world pattern in electronic circuitsPhys. Rev. E, 63 (2001), p. 32767
- Ferrer i Cancho 2001b
- R. Ferrer i Cancho, C. Janssen, R.V. SoléThe small world of human languageProc. Roy. Soc. London B, 268 (2001), pp. 2261-2266
- Ferrer i Cancho and Solé 2001
- Ferrer i Cancho, R., Solé, R.V., 2001. Optimization in complex networks. Santa Fe working paper 01-11-068.
- Flajolet 2000
- M. Flajolet, G. Rotondo, L. Daviet, F. Bergametti, G. Inchauspe, P. Tiollais, C. Transy, P. LegrainA genomic approach to the hepatitis c virus generates a protein interaction mapGene, 242 (2000), pp. 369-379
- Fraser 2002
- H.B. Fraser, A.E. Hirsh, L.M. Steinmetz, C. Scharfe, M.W. FeldmanEvolutionary rate in the protein interaction networkScience, 296 (2002), pp. 750-752
- Fromont-Racine 1997
- M. Fromont-Racine, J.C. Rain, P. LegrainTowards a functional analysis of the yeast genome through exhaustive two-hybrid screensNature Genet., 16 (1997), pp. 277-282
- Gardiner 1985
- C.W. GardinerHandbook of Stochastic Methods (2nd Edition), Springer, Berlin (1985)
- Gavin 2002
- A. GavinFunctional organization of the yeast proteome by systematic analysis of protein complexesNature, 415 (2002), pp. 141-147
- Goh 2002
- K. Goh, E. Oh, H. Jeong, B. Kahng, D. KimClassification of scale-free networksProc. Natl Acad. Sci. USA, 99 (2002), pp. 12583-12588
- Goodwin 1994
- B. GoodwinHow the Leopard Changed its Spots: The Evolution of Complexity, Weidenfeld & Nicholson, London (1994)
- Hartwell 1999
- L.H. Hartwell, J.J. Hopfield, S. Leibler, A.W. MurrayFrom molecular to modular cell biologyNature, 402 (1999), pp. C47-C52
- Hazbun and Fields 2001
- T.R. Hazbun, S. FieldsNetworking proteins in yeastProc. Natl Acad. Sci. USA, 98 (2001), pp. 4277-4278
- Ho 2002
- Y. HoSystematic identification of protein complexes in Saccharomyces cerevisiaeNature, 415 (2002), pp. 180-183
- Ito 2000
- T. Ito, K. Tashiro, S. Muta, R. Ozawa, T. Chiba, M. Nishizawa, K. Yamamoto, S. Kuhara, Y. SakakiToward a protein–protein interaction map of the budding yeast: a comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteinsProc. Natl Acad. Sci. USA, 97 (2000), pp. 1143-1147
- Jeong 2001a
- H. Jeong, S. Mason, A.L. Barabási, Z.N. OltvaiLethality and centrality in protein networksNature, 411 (2001), p. 41
- Jeong 2001b
- H. Jeong, B. Tombor, R. Albert, Z. N.Oltvai, A.-L. BarabasiThe large-scale organization of metabolic networksNature, 407 (2001), pp. 651-654
- Kim 2002
- J. Kim, P.L. Krapivsky, B. Kahng, S. RednerInfinite-order percolation and giant fluctuations in a protein interaction networkPhys. Rev. E, 66 (2002), p. 055101
- Krapivsky 2000
- P.L. Krapivsky, S. Redner, F. LeyvrazConnectivity of growing random networksPhys. Rev. Lett., 85 (2000), p. 4629
- Kumar and Snyder 2001
- A. Kumar, M. SnyderProtein complexes take the baitNature, 415 (2001), pp. 123-124
- Lodish 2000
- H. Lodish, A. Berk, S.L. Zipursky, P. MatsudairaMolecular Cell Biology (4th Edition), W. H. Freeman, New York (2000)
- McCraith 2000
- S. McCraith, T. Holtzman, B. Moss, S. FieldsGenome-wide analysis of vaccinia virus protein–protein interactionsProc. Natl Acad. Sci. USA, 97 (2000), pp. 4879-4884
- Mewes 1999
- H.W. Mewes, K. Heumann, A. Kaps, K. Mayer, F. Pfeiffer, S. Stocker, D. FrishmanMips: a database for genomes and protein sequencesNucleic Acids Res., 27 (1999), pp. 44-48
- Monod 1970
- J. MonodLe hasard et la nécessité, Editions du Seuil, Paris (1970)
- Montoya and Solé 2002
- J.M. Montoya, R.V. SoléSmall world patters in fodd websJ. theor. Biol., 214 (2002), pp. 405-412
- Nadeau and Sankoff 1997
- J.H. Nadeau, D. SankoffComparable rates of gene loss and functional divergence after genome duplications early in vertebrate evolutionGenetics, 147 (1997), pp. 1259-1266
- Ohno 1970
- S. OhnoEvolution by Gene Duplication, Springer, Berlin (1970)
- Ouzounis and Karp 2000
- C.A. Ouzounis, P.D. KarpGlobal properties of the metabolic map of Escherichia coliGenome Res., 10 (2000), pp. 568-576
- Pastor-Satorras 2001
- R. Pastor-Satorras, A. Vázquez, A. VespignaniDynamical and correlation properties of the internetPhys. Rev. Lett., 87 (2001), p. 258701
- Patthy 1999
- L. PatthyProtein Evolution, Blackwell, Oxford (1999)
- Podani 2001
- J. Podani, Z. Oltvai, H. Jeong, B. Tombor, A.-L. Barabási, E. SzathmáryComparable system-level organization of Archaea and EukaryotesNature Genet., 29 (2001), pp. 54-56
- Rain 2001
- J.C. Rain, L. Selig, H. De Reuse, V. Battaglia, C. Reverdy, S. Simon, G. Lenzen, F. Petel, J. Wojcik, V. Schachter, Y. Chemama, A.S. Labigne, P. LegrainThe protein–protein interaction map of Helicobacter pyloriNature, 409 (2001), p. 743
- Ravasz 2002
- E. Ravasz, S.L. Somera, D.A. Mongru, Z.N. Oltvai, A.-L. BarabásiHierarchical organization of modularity in metabolic networksScience, 297 (2002), pp. 1551-1555
- Ross-Macdonald 1999
- P. Ross-Macdonald, P.S.R. Coelho, T. Roemer, S. Agarwal, A. Kumar, R. Jansen, K.H. Cheung, A. Sheehan, D. Symoniatis, L. Umansky, M. Heldtman, F.K. Nelson, H. Iwasaki, K. Hager, M. Gerstein, P. Miller, G.S. Roeder, M. SnyderLarge-scale analysis of the yeast genome by transposon tagging and gene disruptionNature, 402 (1999), pp. 413-418
- Rzhetsky and Gomez 2001
- A. Rzhetsky, S.M. GomezBirth of scale-free molecular networks and the number of distinct dna and protein domains per genomeBioinformatics, 17 (2001), pp. 988-996
- Solé 2002a
- R.V. Solé, R. Ferrer, J.M. Montoya, S. ValverdeSelection, tinkering, and emergence in complex networksComplexity, 8 (2002), pp. 20-33
- Solé 2002b
- R.V. Solé, R. Pastor-Satorras, E. Smith, T. KeplerA model of large-scale proteome evolutionAdv. Complex. Systems, 5 (2002), pp. 43-54
- Stephan 2000
- K.A. Stephan, C.-C. Hilgetag, G.A.P.C. Burns, M.A. O'Neill, M.P. Young, R. KötterComputational analysis of functional connectivity between areas of primate cerebral cortexPhilos. Trans. Roy. Soc. B, 355 (2000), pp. 111-126
- Thieffry 1998
- D. Thieffry, A.M. Huerta, E. Pérez-Rueda, J. Collado-VivesFrom specific gene regulation to genomic networks: a global analysis of transcriptional regulation in Escherichia coliBioEssays, 20 (1998), pp. 433-440
- Vázquez 2003
- A. Vázquez, A. Flammini, A. Maritan, A. VespignaniModelling of protein interaction networksComplexus, 1 (2003), pp. 38-44
- Wagner 2000
- A. WagnerRobustness against mutations in genetic networks of yeastNature Genet., 24 (2000), pp. 355-361
- Wagner 2001
- A. WagnerThe yeast protein interaction network evolves rapidly and contains few redundant duplicate genesMol. Biol. Evol., 18 (2001), pp. 1283-1292
- Walhout 2000
- A.J.M. Walhout, R. Sordella, X.W. Lu, J.L. Hartley, G.F. Temple, M.A. Brasch, N. Thierry-Mieg, M. VidalProtein interaction mapping in c. elegans using proteins involved in vulval developmentScience, 287 (2000), pp. 116-122
- Watts 1999
- D.J. WattsSmall Worlds, Princeton University Press, Princeton (1999)
- Watts and Strogatz 1998
- D.J. Watts, S.H. StrogatzCollective dynamics of ‘small-world’ networksNature, 393 (1998), pp. 440-442
- Williams 2001
- Williams, R.J., Martinez, N.D., Berlow, E.L., Dunne, J.A., Barabási, A.-L., 2001. Two degrees of separation in complex food webs. Santa Fe working paper 01-07-036.
- Wuchty 2001
- S. WuchtyScale-free behavior in protein domain networksMol. Biol. Evol., 18 (2001), pp. 1694-1702
- 1
Figure kindly provided by W. Basalaj (see http://www.cl.cam.uk/~wb204/GD99/#Mewes).