Abstract
Despite its increasing role in communication, the World-Wide Web remains uncontrolled: any individual or institution can create a website with any number of documents and links. This unregulated growth leads to a huge and complex web, which becomes a large directed graph whose vertices are documents and whose edges are links (URLs) that point from one document to another. The topology of this graph determines the web's connectivity and consequently how effectively we can locate information on it. But its enormous size (estimated to be at least 8108 documents1) and the continual changing of documents and links make it impossible to catalogue all the vertices and edges.
The extent of the challenge in obtaining a complete topological map of the web is illustrated by the limitations of the commercial search engines: Northern Light, the search engine with the largest coverage, is estimated to index only 38% of the web1. Although much work has been done to map and characterize the Internet's infrastructure2, little is known about what really matters in the search for information — the topology of the web. Here we take a step towards filling this gap: we have used local connectivity measurements to construct a topological model of the World-Wide Web, which has enabled us to explore and characterize its large-scale properties.
To determine the local connectivity of the web, we constructed a robot that adds to its database all URLs found on a document and recursively follows these to retrieve the related documents and URLs. We used the data collected to determine the probabilities Pout(k) and Pin(k) that a document has k outgoing and incoming links, respectively. We find that both Pout(k) and Pin(k) follow a power law over several orders of magnitude, remarkably different not only from the Poisson distribution predicted by the classical theory of random graphs3, 4, but also from the bounded distribution found in models of random networks5.
The power-law tail indicates that the probability of finding documents with a large number of links is significant, as the network connectivity is dominated by highly connected web pages. Similarly, for incoming links, the probability of finding very popular addresses, to which a large number of other documents point, is non-negligible, an indication of the flocking nature of the web. Furthermore, while the owner of each web page has complete freedom in choosing the number of links on a document and the addresses to which they point, the overall system obeys scaling laws characteristic only of highly interactive self-organized systems and critical phenomena6.
To investigate the connectivity and the large-scale topological properties of the web, we constructed a directed random graph consisting of N vertices, assigning to each vertex k outgoing (or incoming) links, such that k is drawn from the power-law distribution of Fig. 1a,b. To achieve this, we randomly selected a vertex i and increased its outgoing (or incoming) connectivity to ki+1 if the total number of vertices with ki+1 outgoing (or incoming) links is less than NPout(ki+1) (or NPin(ki+1)).
Figure 1: Distribution of links on the World-Wide Web.

a, Outgoing links (URLs found on an HTML document); b, incoming links (URLs pointing to a certain HTML document). Data were obtained from the complete map of the nd.edu domain, which contains 325,729 documents and 1,469,680 links. Dotted lines represent analytical fits used as input distributions in constructing the topological model of the web; the tail of the distributions follows P(k)k-
, with
out=2.45 and
in=2.1. c, Average of the shortest path between two documents as a function of system size, as predicted by the model. To check the validity of our predictions, we determined d for documents in the domain nd.edu. The measured
dnd.edu
=11.2 agrees well with the prediction
d3
105
=11.6 obtained from our model. To show that the power-law tail of P(k) is a universal feature of the web, the inset shows Pout(k) obtained by starting from whitehouse.gov (squares), yahoo.com (triangles) and snu.ac.kr (inverted triangles). The slope of the dashed line is
out=2.45, as obtained from nd.edu in a.
A particularly important quantity in a search process is the shortest path between two documents, d, defined as the smallest number of URL links that must be followed to navigate from one document to the other. We find that the average of d over all pairs of vertices is d
=0.35+2.06log(N) (Fig. 1c), indicating that the web forms a small-world network5, 7, which characterizes social or biological systems. For N=8
108,
dweb
=18.59; that is, two randomly chosen documents on the web are on average 19 clicks away from each other.
For a given N, d follows a gaussian distribution so d
can be interpreted as the diameter of the web, a measure of the shortest distance between any two points in the system. Despite its huge size, our results indicate that the web is a highly connected graph with an average diameter of only 19 links. The logarithmic dependence of
d
on N is important to the future potential of the web: we find that the expected 1,000% increase in the size of the web over the next few years will change
d
very little, from 19 to only 21.
The relatively small value of d
indicates that an intelligent agent, who can interpret the links and follow only the relevant one, can find the desired information quickly by navigating the web. But this is not the case for a robot that locates the information based on matching strings. We find that such a robot, aiming to identify a document at distance
d
, needs to search M(
d
)
0.53
N0.92 documents, which, with N=8
108, leads to M=8
107, or 10% of the whole web. This indicates that robots cannot benefit from the highly connected nature of the web, their only successful strategy being to index as much of the web as possible.
The scale-free nature of the link distributions indicates that collective phenomena play a previously unsuspected role in the development of the web8, forcing us to look beyond the traditional random graph models3, 4, 5, 7. A better understanding of the web's topology, aided by modelling efforts, is crucial in developing search algorithms or designing strategies for making information widely accessible on the World-Wide Web. Fortunately, the surprisingly small diameter of the web means that all that information is just a few clicks away.