Structure & Strangeness

 

"Fast Modularity" Community Structure Inference Algorithm

This page documents and supports the fast modularity maximization algorithm I developed jointly with Mark Newman and Cristopher Moore. This algorithm is being widely used in the community of complex network researchers, and was originally designed for the express purpose of analyzing the community structure of extremely large networks (i.e., hundreds of thousand or millions of vertices). The original version worked only with unweighted, undirected networks. I've recently posted a version that works on weighted, undirected networks.

Update February 2007: Please see my recent blog entry about this algorithm.

Update May 2007: See also the igraph library, a pretty comprehensive library of network utility functions (generation and analysis), including an implementation of the fast modularity algorithm described here (along with a few other nice clustering heuristics).

Update October 2008: I've finally gotten around to posting a version that works with weighted networks, which is available here. This version wants a .wpairs file as input, which is an edge list with integer weights, e.g., "54\t91\t3\n" would be an edge with weight 3. Otherwise, it should work just the same as the unweighted version.

Update December 2009: If you're going to use this algorithm, you should also read this paper, which describes the performance (pro and con) of modularity maximization in practical contexts.

B. H. Good, Y.-A. de Montjoye and A. Clauset, "The performance of modularity maximization in practical contexts."
Phys. Rev. E 81, 046106 (2010).

Journal Reference
A. Clauset, M.E.J. Newman and C. Moore, "Finding community structure in very large networks."
Phys. Rev. E 70, 066111 (2004).

My request
For the past year, I have passed out the code personally and have asked each person who uses it to please send me a pre-print of any papers they produce with the code. I ask the same of each of you, as it's nice to know what projects this algorithm is being used in. You may email me directly at aaron@xx.xxx.edu with the rest of the address being obvious.

Get the code
The code is available in both .tgz and .zip formats. It contains the following files (with corresponding descriptions):

gpl.txt GPL (version 2)
Makefile makefile for compiling the executable
fastcommunity_mh.cc main algorithm file - this has the heart of the algorithm described in the paper.
maxheap.h max-heap datastructure for storing the dQ values (linked to vektor data structure)
vektor.h sparse matrix row datastructure for storing dQ values (linked to maxheap data structure)

Update October 2008: A version that works with weighted networks is available here. This version wants a .wpairs file as input, which is an edge list with integer weights, e.g., "54\t91\t3\n" would be an edge with weight 3. Otherwise, it should work just the same as the unweighted version.

Compiling
The Makefile provided should be sufficient to compile the executable (built in ANSI C/C++ with platform independent code).

Input file requirements
The program was written to handle networks in the form of a flat text file containing edge adjacencies. I call this file format a .pairs file. Specifically, the network topology must conform to the following requirements:
• .pairs is a list of tab-delimited pairs of numeric indices, e.g., "54\t91\n"
• the network described is a SINGLE COMPONENT
• there are NO SELF-LOOPS or MULTI-EDGES in the file
• the MINIMUM NODE ID = 0 in the input file
• the MAXIMUM NODE ID can be anything, but the program will use less memory if nodes are labeled sequentially

This file format may seem a bit peculiar, but it was sufficient for the demonstration that the algorithm performs as promised. You are free to alter the file import function readInputFile() to fit your needs.

An example input file, for Zachary's karate club network is here.

Common error messages
The most common error message returned by the program is of the form "WARNING: invalid join (X Y) found at top of heap". If you receive this error message, it is almost surely because your input .pairs file does not meet the above requirements. In particular, it likely contains self-loops, multi-edges or is not a single connected component. If you receive this message, try fixing these issues with your network and trying again.

Running the program
The Makefile provided should be sufficient to compile the executable (built in ANSI C/C++ with platform independent code). This usage information can be retrieved by running the executable with no arguments:

-f <filename> give the target .pairs file to be processed
-l <text> the text label for this run; used to build output filenames
-t <int> timer period for reporting progress of file input to screen
-s calculate and record the support of the dQ matrix
-v --v ---v differing levels of screen output verbosity
-c <int> record the aglomerated network at step <int> (typically, this is the step location at which the modularity is known to be maximum)

(Please see the notes in the .cc file for the most up-to-date version of this information.) The typical usage will be to first create the .pairs file containing your network, to run the program like

./fastcommunity_mh -f myNetwork.pairs -l firstRun

and then consult the file outputs as described below. If you want to then examine the communities that have been built by the algorithm, you would run the algorithm a second time like so

./fastcommunity_mh -f myNetwork.pairs -l secondRun -c X

where X is the integer value for the maximum modularity that is reported in the .info file. Again, this could probably be automated, either by modifying the code, or wrapping it all in another script.

File outputs
Running the program on some input network will produce a set of files, each with a common naming convention, where the file type is encoded by the suffix. This information can also be retrieved by running the executable with the argument -files.

Mandatory file outputs
.info Various information about the program's running. Includes a listing of the files it generates, number of vertices and edges processed, the maximum modularity found and the corresponding step (you can re-run the program with this value in the -c argument to have it output the contents of the clusters, etc. when it reaches that step again (not the most efficient solution, but it works)), start/stop time, and when -c is used, it records some information about the distribution of cluster sizes.
.joins The dendrogram and modularity information from the algorithm. The file format is tab-delimited columns of data, where the columns are:
1. the community index which absorbs
2. the community index which was absorbed
3. the modularity value Q after the join
4. the time step of the join

Optional file outputs (generated at step t=C when the -c C argument is used:

.wpairs The connectivity of the clustered graph in a .wpairs file format (i.e., weighted edges). The edge weights should be the dQ values associated with that clustered edge at time C. From this format, it's easy to convert into another for visualization (e.g., pajek's .net format).
.hist The size distribution of the clusters.
.groups A list of each group and the names of the vertices which compose it (this is particularly useful for verifying that the clustering makes sense - tedious but important).

Bugs and incompleteness
All of the features documented on this page work as advertised (although, if you find any bugs, please let me know). There are some other, undocumented features that may not work fully (if you read through the code, you may spot some of these), so use them at your own risk.

Code maintenence
I am not actively working on this code since this project complete with its publication in 2004. However, I'll try to give as much verbal support to interested users as is reasonable.

 

 

: Creative :
.: Photography :.
.: Artistic :.
.: Blog :.
.: Thinking :.
.: Research :.

: Persona :
.: About :.
.: .plan :.
.: Vitae :.

: Website :
.: Search :.
.: Copyright :.
.: Sitemap :.
.: Links :.

© Aaron Clauset

05 October 2005