Elsevier

Physics Reports

Volume 486, Issues 3–5, February 2010, Pages 75-174
Physics Reports

Community detection in graphs

https://doi.org/10.1016/j.physrep.2009.11.002Get rights and content

Abstract

The modern science of networks has brought significant advances to our understanding of complex systems. One of the most relevant features of graphs representing real systems is community structure, or clustering, i.e. the organization of vertices in clusters, with many edges joining vertices of the same cluster and comparatively few edges joining vertices of different clusters. Such clusters, or communities, can be considered as fairly independent compartments of a graph, playing a similar role like, e.g., the tissues or the organs in the human body. Detecting communities is of great importance in sociology, biology and computer science, disciplines where systems are often represented as graphs. This problem is very hard and not yet satisfactorily solved, despite the huge effort of a large interdisciplinary community of scientists working on it over the past few years. We will attempt a thorough exposition of the topic, from the definition of the main elements of the problem, to the presentation of most methods developed, with a special focus on techniques designed by statistical physicists, from the discussion of crucial issues like the significance of clustering and how methods should be tested and compared against each other, to the description of applications to real networks.

Introduction

The origin of graph theory dates back to Euler’s solution of the puzzle of Königsberg’s bridges in 1736 [1]. Since then a lot has been learned about graphs and their mathematical properties [2]. In the 20th century they have also become extremely useful as the representation of a wide variety of systems in different areas. Biological, social, technological, and information networks can be studied as graphs, and graph analysis has become crucial to understand the features of these systems. For instance, social network analysis started in the 1930’s and has become one of the most important topics in sociology [3], [4]. In recent times, the computer revolution has provided scholars with a huge amount of data and computational resources to process and analyze these data. The size of real networks one can potentially handle has also grown considerably, reaching millions or even billions of vertices. The need to deal with such a large number of units has produced a deep change in the way graphs are approached [5], [6], [7], [8], [9], [10].

Graphs representing real systems are not regular like, e.g., lattices. They are objects where order coexists with disorder. The paradigm of disordered graph is the random graph, introduced by Erdös and Rényi [11]. In it, the probability of having an edge between a pair of vertices is equal for all possible pairs (see Appendix). In a random graph, the distribution of edges among the vertices is highly homogeneous. For instance, the distribution of the number of neighbors of a vertex, or degree, is binomial, so most vertices have equal or similar degree. Real networks are not random graphs, as they display big inhomogeneities, revealing a high level of order and organization. The degree distribution is broad, with a tail that often follows a power law: therefore, many vertices with low degree coexist with some vertices with large degree. Furthermore, the distribution of edges is not only globally, but also locally inhomogeneous, with high concentrations of edges within special groups of vertices, and low concentrations between these groups. This feature of real networks is called community structure [12], or clustering, and is the topic of this review (for earlier reviews see Refs. [13], [14], [15], [16], [17]). Communities, also called clusters or modules, are groups of vertices which probably share common properties and/or play similar roles within the graph. In Fig. 1 a schematic example of a graph with communities is shown.

Society offers a wide variety of possible group organizations: families, working and friendship circles, villages, towns, nations. The diffusion of Internet has also led to the creation of virtual groups, that live on the Web, like online communities. Indeed, social communities have been studied for a long time [18], [19], [20], [21]. Communities also occur in many networked systems from biology, computer science, engineering, economics, politics, etc. In protein–protein interaction networks, communities are likely to group proteins having the same specific function within the cell [22], [23], [24], in the graph of the World Wide Web they may correspond to groups of pages dealing with the same or related topics [25], [26], in metabolic networks they may be related to functional modules such as cycles and pathways [27], [28], in food webs they may identify compartments [29], [30], and so on.

Communities can have concrete applications. Clustering Web clients who have similar interests and are geographically near to each other may improve the performance of services provided on the World Wide Web, in that each cluster of clients could be served by a dedicated mirror server [31]. Identifying clusters of customers with similar interests in the network of purchase relationships between customers and products of online retailers (like, e.g., www.amazon.com) enables one to set up efficient recommendation systems [32], that better guide customers through the list of items of the retailer and enhance the business opportunities. Clusters of large graphs can be used to create data structures in order to efficiently store the graph data and to handle navigational queries, like path searches [33], [34]. Ad hoc networks [35], i.e. self-configuring networks formed by communication nodes acting in the same region and rapidly changing (because the devices move, for instance), usually have no centrally maintained routing tables that specify how nodes have to communicate to other nodes. Grouping the nodes into clusters enables one to generate compact routing tables while the choice of the communication paths is still efficient [36].

Community detection is important for other reasons, too. Identifying modules and their boundaries allows for a classification of vertices, according to their structural position in the modules. So, vertices with a central position in their clusters, i.e. sharing a large number of edges with the other group partners, may have an important function of control and stability within the group; vertices lying at the boundaries between modules play an important role of mediation and lead the relationships and exchanges between different communities (alike to Csermely’s “creative elements” [37]). Such a classification seems to be meaningful in social [38], [39], [40] and metabolic networks [27]. Finally, one can study the graph where vertices are the communities and edges are set between clusters if there are connections between some of their vertices in the original graph and/or if the modules overlap. In this way one attains a coarse-grained description of the original graph, which unveils the relationships between modules.1 Recent studies indicate that networks of communities have a different degree distribution with respect to the full graphs [28]; however, the origin of their structures can be explained by the same mechanism [43].

Another important aspect related to community structure is the hierarchical organization displayed by most networked systems in the real world. Real networks are usually composed by communities including smaller communities, which in turn include smaller communities, etc. The human body offers a paradigmatic example of hierarchical organization: it is composed by organs, organs are composed by tissues, tissues by cells, etc. Another example is represented by business firms, which are characterized by a pyramidal organization, going from the workers to the president, with intermediate levels corresponding to work groups, departments and management. Herbert A. Simon has emphasized the crucial role played by hierarchy in the structure and evolution of complex systems [44]. The generation and evolution of a system organized in interrelated stable subsystems are much quicker than if the system were unstructured, because it is much easier to assemble the smallest subparts first and use them as building blocks to get larger structures, until the whole system is assembled. In this way it is also far more difficult that errors (mutations) occur along the process.

The aim of community detection in graphs is to identify the modules and, possibly, their hierarchical organization, by only using the information encoded in the graph topology. The problem has a long tradition and it has appeared in various forms in several disciplines. The first analysis of community structure was carried out by Weiss and Jacobson [45], who searched for work groups within a government agency. The authors studied the matrix of working relationships between members of the agency, which were identified by means of private interviews. Work groups were separated by removing the members working with people of different groups, which act as connectors between them. This idea of cutting the bridges between groups is at the basis of several modern algorithms of community detection (Section 5). Research on communities actually started even earlier than the paper by Weiss and Jacobson. Already in 1927, Stuart Rice looked for clusters of people in small political bodies, based on the similarity of their voting patterns [46]. Two decades later, George Homans showed that social groups could be revealed by suitably rearranging the rows and the columns of matrices describing social ties, until they take an approximate block-diagonal form [47]. This procedure is now standard. Meanwhile, traditional techniques to find communities in social networks are hierarchical clustering and partitional clustering (Sections 4.2 Hierarchical clustering, 4.3 Partitional clustering), where vertices are joined into groups according to their mutual similarity.

Identifying graph communities is a popular topic in computer science, too. In parallel computing, for instance, it is crucial to know what is the best way to allocate tasks to processors so as to minimize the communications between them and enable a rapid performance of the calculation. This can be accomplished by splitting the computer cluster into groups with roughly the same number of processors, such that the number of physical connections between processors of different groups is minimal. The mathematical formalization of this problem is called graph partitioning (Section 4.1). The first algorithms for graph partitioning were proposed in the early 1970’s.

In a seminal paper appeared in 2002, Girvan and Newman proposed a new algorithm, aiming at the identification of edges lying between communities and their successive removal, a procedure that after some iterations leads to the isolation of the communities [12]. The intercommunity edges are detected according to the values of a centrality measure, the edge betweenness, that expresses the importance of the role of the edges in processes where signals are transmitted across the graph following paths of minimal length. The paper triggered a big activity in the field, and many new methods have been proposed in the last years. In particular, physicists entered the game, bringing in their tools and techniques: spin models, optimization, percolation, random walks, synchronization, etc., became ingredients of new original algorithms. The field has also taken advantage of concepts and methods from computer science, nonlinear dynamics, sociology, discrete mathematics.

In this manuscript we try to cover in some detail the work done in this area. We shall pay special attention to the contributions made by physicists, but we shall also give proper credit to important results obtained by scholars of other disciplines. Section 2 introduces communities in real networks, and is supposed to make the reader acquainted with the problem and its relevance. In Section 3 we define the basic elements of community detection, i.e. the concepts of community and partition. Traditional clustering methods in computer and social sciences, i.e. graph partitioning, hierarchical, partitional and spectral clustering are reviewed in Section 4. Modern methods, divided into categories based on the type of approach, are presented in Sections 5 to 10. Algorithms to find overlapping communities, multiresolution and hierarchical techniques, are separately described in Sections 11 Methods to find overlapping communities, 12 Multiresolution methods and cluster hierarchy, respectively, whereas Section 13 is devoted to the detection of communities evolving in time. We stress that our categorization of the algorithms is not sharp, because many algorithms may enter more categories: we tried to classify them based on what we believe is their main feature/purpose, even if other aspects may be present. Sections 14 Significance of clustering, 15 Testing algorithms are devoted to the issues of defining when community structure is significant, and deciding about the quality of algorithms’ performances. In Sections 16 General properties of real clusters, 17 Applications on real-world networks we describe general properties of clusters found in real networks, and specific applications of clustering algorithms. Section 18 contains the summary of the review, along with a discussion about future research directions in this area. The review makes use of several concepts of graph theory, that are defined and explained in the Appendix. Readers not acquainted with these concepts are urged to read the Appendix first.

Section snippets

Communities in real-world networks

In this section we shall present some striking examples of real networks with community structure. In this way we shall see what communities look like and why they are important.

Social networks are paradigmatic examples of graphs with communities. The word community itself refers to a social context. People naturally tend to form groups, within their work environment, family, friends.

In Fig. 2 we show some examples of social networks. The first example (Fig. 2a) is Zachary’s network of karate

Elements of community detection

The problem of graph clustering, intuitive at first sight, is actually not well defined. The main elements of the problem themselves, i.e. the concepts of community and partition, are not rigorously defined, and require some degree of arbitrariness and/or common sense. Indeed, some ambiguities are hidden and there are often many equally legitimate ways of resolving them. Therefore, it is not surprising that there are plenty of recipes in the literature and that people do not even try to ground

Graph partitioning

The problem of graph partitioning consists of dividing the vertices in g groups of predefined size, such that the number of edges lying between the groups is minimal. The number of edges running between clusters is called cut size. Fig. 9 presents the solution of the problem for a graph with fourteen vertices, for g=2 and clusters of equal size.

Specifying the number of clusters of the partition is necessary. If one simply imposed a partition with the minimal cut size, and left the number of

Divisive algorithms

A simple way to identify communities in a graph is to detect the edges that connect vertices of different communities and remove them, so that the clusters get disconnected from each other. This is the philosophy of divisive algorithms. The crucial point is to find a property of intercommunity edges that could allow for their identification. Divisive methods do not introduce substantial conceptual advances with respect to traditional techniques, as they just perform hierarchical clustering on

Modularity-based methods

Newman–Girvan modularity Q (Section 3.3.2), originally introduced to define a stopping criterion for the algorithm of Girvan and Newman, has rapidly become an essential element of many clustering methods. Modularity is by far the most used and best known quality function. It represented one of the first attempts to achieve a first principle understanding of the clustering problem, and it embeds in its compact form all essential ingredients and questions, from the definition of community, to the

Spectral algorithms

In Sections 4.1 Graph partitioning, 4.4 Spectral clustering we have learned that spectral properties of graph matrices are frequently used to find partitions. A paradigmatic example is spectral graph clustering, which makes use of the eigenvectors of Laplacian matrices (Section 4.4). We have also seen that Newman–Girvan modularity can be optimized by using the eigenvectors of the modularity matrix (Section 6.1.4). Most spectral methods have been introduced and developed in computer science and

Dynamic algorithms

This section describes methods employing processes running on the graph, focusing on spin–spin interactions, random walks and synchronization.

Methods based on statistical inference

Statistical inference [279] aims at deducing properties of data sets, starting from a set of observations and model hypotheses. If the data set is a graph, the model, based on hypotheses on how vertices are connected to each other, has to fit the actual graph topology. In this section we review those clustering techniques attempting to find the best fit of a model to the graph, where the model assumes that vertices have some sort of classification, based on their connectivity patterns. We

Alternative methods

In this section we describe some algorithms that do not fit in the previous categories, although some overlap is possible.

Raghavan et al. [322] have designed a simple and fast method based on label propagation. Vertices are initially given unique labels (e.g. their vertex labels). At each iteration, a sweep over all vertices, in random sequential order, is performed: each vertex takes the label shared by the majority of its neighbors. If there is no unique majority, one of the majority labels

Methods to find overlapping communities

Most of the methods discussed in the previous sections aim at detecting standard partitions, i.e. partitions in which each vertex is assigned to a single community. However, in real graphs vertices are often shared between communities (Section 2), and the issue of detecting overlapping communities has become quite popular in the last few years. We devote this section to the main techniques to detect overlapping communities.

Multiresolution methods and cluster hierarchy

The existence of a resolution limit for Newman–Girvan modularity (Section 6.3) implies that the straight optimization of quality functions yields a coarse description of the cluster structure of the graph, at a scale which has a priori nothing to do with the actual scale of the clusters. In the absence of information on the cluster sizes of the graph, a method should be able to explore all possible scales, to make sure that it will eventually identify the right communities. Multiresolution

Detection of dynamic communities

The analysis of dynamic communities is still in its infancy. Studies in this direction have been mostly hindered by the fact that the problem of graph clustering is already controversial on single graph realizations, so it is understandable that most efforts still concentrate on the “static” version of the problem. Another difficulty is represented by the dearth of timestamped data on real graphs. Recently, several data sets have become available, enabling to monitor the evolution in time of

Significance of clustering

Given a network, many partitions could represent meaningful clusterings in some sense, and it could be difficult for some methods to discriminate between them. Quality functions evaluate the goodness of a partition (Section 3.3.2), so one could say that high quality corresponds to meaningful partitions. But this is not necessarily true. In Section 6.3 we have seen that high values of the modularity of Newman and Girvan do not necessarily indicate that a graph has a definite cluster structure.

Testing algorithms

When a clustering algorithm is designed, it is necessary to test its performance, and compare it with that of other methods. In the previous sections we have said very little about the performance of the algorithms, other than their computational complexity. Indeed, the issue of testing algorithms has received very little attention in the literature on graph clustering. This is a serious limit of the field. Because of that, it is still impossible to state which method (or subset of methods) is

General properties of real clusters

What are the general properties of partitions and clusters of real graphs? In many papers on graph clustering applications to real systems are presented. In spite of the variety of clustering methods that one could employ, in many cases partitions derived from different techniques are rather similar to each other, so the general properties of clusters do not depend much on the particular algorithm used. The analysis of clusters and their properties delivers a mesoscopic description of the

Applications on real-world networks

The ultimate goal of clustering algorithms is trying to infer properties of and relationships between vertices, that are not available from direct observation/measurement. If the scientific community agrees on a set of reliable techniques, one could then proceed with careful investigations of systems in various domains. So far, most works in the literature on graph clustering focused on the development of new algorithms, and applications were limited to those few benchmark graphs that one

Outlook

Despite the remote origins and the great popularity of the last years, research on graph clustering has not yet given a satisfactory solution of the problem and leaves us with a number of important open issues. From our exposition it appears that the field has grown in a rather chaotic way, without a precise direction or guidelines. In some cases, interesting new ideas and tools have been presented, in others existing methods have been improved, becoming more accurate and/or faster.

What the

Acknowledgements

I am indebted to these people for giving useful suggestions and advice to improve this manuscript at various stages: A. Arenas, J. W. Berry, A. Clauset, P. Csermely, S. Gómez, S. Gregory, V. Gudkov, R. Guimerà, Y. Ispolatov, R. Lambiotte, A. Lancichinetti, J.-P. Onnela, G. Palla, M. A. Porter, F. Radicchi, J. J. Ramasco, C. Wiggins. I gratefully acknowledge ICTeCollective, grant number 238597 of the European Commission.

References (457)

  • L. Euler

    Solutio problematis ad geometriam situs pertinentis

    Commentarii Academiae Petropolitanae

    (1736)
  • B. Bollobas

    Modern Graph Theory

    (1998)
  • S. Wasserman et al.

    Social Network Analysis

    (1994)
  • J. Scott

    Social Network Analysis: A Handbook

    (2000)
  • R. Albert et al.

    Statistical mechanics of complex networks

    Rev. Mod. Phys.

    (2002)
  • J.F.F. Mendes et al.

    Evolution of Networks: From Biological Nets to the Internet and WWW

    (2003)
  • M.E.J. Newman

    The structure and function of complex networks

    SIAM Rev.

    (2003)
  • R. Pastor-Satorras et al.

    Evolution and Structure of the Internet: A Statistical Physics Approach

    (2004)
  • A. Barrat et al.

    Dynamical Processes on Complex Networks

    (2008)
  • P. Erdös et al.

    On random graphs. I.

    Publ. Math. Debrecen

    (1959)
  • M. Girvan et al.

    Community structure in social and biological networks

    Proc. Natl. Acad. Sci. USA

    (2002)
  • M.E.J. Newman

    Detecting community structure in networks

    Eur. Phys. J. B

    (2004)
  • L. Danon et al.
  • S. Fortunato et al.

    Community structure in graphs

  • M.A. Porter et al.

    Communities in networks

    Notices of the American Mathematical Society

    (2009)
  • J.S. Coleman

    An Introduction to Mathematical Sociology

    (1964)
  • L.C. Freeman

    The Development of Social Network Analysis: A Study in the Sociology of Science

    (2004)
  • C.P. Kottak

    Cultural Anthropology

    (2004)
  • J. Moody et al.

    Structural cohesion and embeddedness: A hierarchical concept of social groups

    Am. Sociol. Rev.

    (2003)
  • A.W. Rives et al.

    Modular organization of cellular networks

    Proc. Natl. Acad. Sci. USA

    (2003)
  • V. Spirin et al.

    Protein complexes and functional modules in molecular networks

    Proc. Natl. Acad. Sci. USA

    (2003)
  • J. Chen et al.

    Detecting functional modules in the yeast protein–protein interaction network

    Bioinformatics

    (2006)
  • G.W. Flake et al.

    Self-organization and identification of web communities

    IEEE Computer

    (2002)
  • Y. Dourisboure et al.

    Extraction and classification of dense communities in the web

  • R. Guimerà et al.

    Functional cartography of complex metabolic networks

    Nature

    (2005)
  • G. Palla et al.

    Uncovering the overlapping community structure of complex networks in nature and society

    Nature

    (2005)
  • A.E. Krause et al.

    Compartments revealed in food-web structure

    Nature

    (2003)
  • B. Krishnamurthy et al.

    On network-aware clustering of web clients

    SIGCOMM Comput. Commun. Rev.

    (2000)
  • K.P. Reddy et al.

    A graph based approach to extract a neighborhood customer community for collaborative filtering

  • R. Agrawal et al.

    Algorithms for searching massive graphs

    Knowl. Data Eng.

    (1994)
  • A.Y. Wu et al.

    Mining scale-free networks using geodesic clustering

  • C.E. Perkins

    Ad Hoc Networking

    (2001)
  • M. Steenstrup

    Cluster-Based Networks

    (2001)
  • M. Granovetter

    The strength of weak ties

    Am. J. Sociol.

    (1973)
  • R.S. Burt

    Positions in networks

    Soc. Forces

    (1976)
  • L.C. Freeman

    A set of measures of centrality based on betweenness

    Sociometry

    (1977)
  • D. Gfeller et al.

    Spectral coarse graining of complex networks

    Phys. Rev. Lett.

    (2007)
  • D. Gfeller et al.

    Spectral coarse graining and synchronization in oscillator networks

    Phys. Rev. Lett.

    (2008)
  • P. Pollner et al.

    Preferential attachment of communities: The same principle, but a higher level

    Europhys. Lett.

    (2006)
  • H. Simon

    The architecture of complexity

    Proc. Am. Phil. Soc.

    (1962)
  • Cited by (9239)

    View all citing articles on Scopus
    View full text