New Research In
Stacking models for nearly optimal link prediction in complex networks
Edited by Luís A. Nunes Amaral, Northwestern University, Evanston, IL, and accepted by Editorial Board Member Simon A. Levin August 6, 2020 (received for review September 2, 2019)
Significance
Networks are a powerful tool for modeling complex biological and social systems. However, most networks are incomplete, and missing connections can negatively affect scientific analyses. Today, many algorithms can predict missing connections, but it is unknown how accuracy varies across algorithms and networks and whether link predictability varies across scientific domains. Analyzing 203 link prediction algorithms applied to 550 diverse real-world networks, we show that no predictor is best or worst overall. We then combine these many predictors into a single state-of-the-art algorithm that achieves nearly optimal performance on both synthetic networks with known optimality and real-world networks. Not all networks are equally predictable, however, and we find that social networks are easiest, while biological and technological networks are hardest.
Abstract
Most real-world networks are incompletely observed. Algorithms that can accurately predict which links are missing can dramatically speed up network data collection and improve network model validation. Many algorithms now exist for predicting missing links, given a partially observed network, but it has remained unknown whether a single best predictor exists, how link predictability varies across methods and networks from different domains, and how close to optimality current methods are. We answer these questions by systematically evaluating 203 individual link predictor algorithms, representing three popular families of methods, applied to a large corpus of 550 structurally diverse networks from six scientific domains. We first show that individual algorithms exhibit a broad diversity of prediction errors, such that no one predictor or family is best, or worst, across all realistic inputs. We then exploit this diversity using network-based metalearning to construct a series of “stacked” models that combine predictors into a single algorithm. Applied to a broad range of synthetic networks, for which we may analytically calculate optimal performance, these stacked models achieve optimal or nearly optimal levels of accuracy. Applied to real-world networks, stacked models are superior, but their accuracy varies strongly by domain, suggesting that link prediction may be fundamentally easier in social networks than in biological or technological networks. These results indicate that the state of the art for link prediction comes from combining individual algorithms, which can achieve nearly optimal predictions. We close with a brief discussion of limitations and opportunities for further improvements.
Footnotes
- ↵1To whom correspondence may be addressed. Email: amir.ghasemianlangroodi@colorado.edu or aaron.clauset@colorado.edu.
Author contributions: A. Ghasemian, H.H., A. Galstyan, E.M.A., and A.C. designed research; A. Ghasemian and H.H. performed research; A. Ghasemian and H.H. analyzed data; and A. Ghasemian and A.C. wrote the paper.
The authors declare no competing interest.
This article is a PNAS Direct Submission. L.A.N.A. is a guest editor invited by the Editorial Board.
See online for related content such as Commentaries.
This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1914950117/-/DCSupplemental.
Data Availability.
Network data and code for replication and reuse have been deposited in GitHub (https://github.com/Aghasemian/OptimalLinkPrediction).
Published under the PNAS license.
Sign up for Article Alerts
Article Classifications
- Physical Sciences
- Computer Sciences
- Biological Sciences
- Applied Biological Sciences
See related content: