PT - JOURNAL ARTICLE AU - Rogini Runghen AU - Daniel B Stouffer AU - Giulio V Dalla Riva TI - Exploiting node metadata to predict interactions in large networks using graph embedding and neural networks AID - 10.1101/2021.06.10.447991 DP - 2021 Jan 01 TA - bioRxiv PG - 2021.06.10.447991 4099 - http://biorxiv.org/content/early/2021/06/11/2021.06.10.447991.short 4100 - http://biorxiv.org/content/early/2021/06/11/2021.06.10.447991.full AB - Collecting network interaction data is difficult. Non-exhaustive sampling and complex hidden processes often result in an incomplete data set. Thus, identifying potentially present but unobserved interactions is crucial both in understanding the structure of large scale data, and in predicting how previously unseen elements will interact. Recent studies in network analysis have shown that accounting for metadata (such as node attributes) can improve both our understanding of how nodes interact with one another, and the accuracy of link prediction. However, the dimension of the object we need to learn to predict interactions in a network grows quickly with the number of nodes. Therefore, it becomes computationally and conceptually challenging for large networks. Here, we present a new predictive procedure combining a graph embedding method with machine learning techniques to predict interactions on the base of nodes’ metadata. Graph embedding methods project the nodes of a network onto a—low dimensional—latent feature space. The position of the nodes in the latent feature space can then be used to predict interactions between nodes. Learning a mapping of the nodes’ metadata to their position in a latent feature space corresponds to a classic—and low dimensional—machine learning problem. In our current study we used the Random Dot Product Graph model to estimate the embedding of an observed network, and we tested different neural networks architectures to predict the position of nodes in the latent feature space. Flexible machine learning techniques to map the nodes onto their latent positions allow to account for multivariate and possibly complex nodes’ meta-data. To illustrate the utility of the proposed procedure, we apply it to a large dataset of tourist visits to destinations across New Zealand. We found that our procedure accurately predicts interactions for both existing nodes and nodes newly added to the network, while being computationally feasible even for very large networks. Overall, our study highlights that by exploiting the properties of a well understood statistical model for complex networks and combining it with standard machine learning techniques, we can simplify the link prediction problem when incorporating multivariate node metadata. Our procedure can be immediately applied to different types of networks, and to a wide variety of data from different systems. As such, both from a network science and data science perspective, our work offers a flexible and generalisable procedure for link prediction.Competing Interest StatementThe authors have declared no competing interest.