"Fast Modularity"
Community Structure Inference Algorithm
This page documents and supports the fast modularity
maximization algorithm I developed jointly with Mark
Newman and Cristopher
Moore. This algorithm is being widely used in the community
of complex network researchers, and was originally designed for
the express purpose of analyzing the community structure of extremely
large networks (i.e., hundreds of thousand or millions of vertices).
The original version worked only with unweighted, undirected networks.
I've recently posted a version that works on weighted, undirected
networks.
Update February 2007: Please see my recent blog entry about this algorithm.
Update May 2007: See also the igraph library, a pretty comprehensive library of network utility functions (generation and analysis), including an implementation of the fast modularity algorithm described here (along with a few other nice clustering heuristics).
Update October 2008: I've finally gotten around to posting a version that works with weighted networks, which is available here. This version wants a .wpairs file as input, which is an edge list with integer weights, e.g., "54\t91\t3\n" would be an edge with weight 3. Otherwise, it should work just the same as the unweighted version.
Update December 2009: If you're going to use this algorithm, you should also read this paper, which describes the performance (pro and con) of modularity maximization in practical contexts.
B. H. Good, Y.-A. de Montjoye and A. Clauset, "The performance of modularity maximization in practical contexts." Phys. Rev. E 81, 046106 (2010).
Journal Reference
A. Clauset, M.E.J. Newman and C. Moore, "Finding
community structure in very large networks."
Phys. Rev. E 70, 066111 (2004).
My request
For the past year, I have passed out the code personally and have
asked each person who uses it to please send me a pre-print of
any papers they produce with the code. I ask the same of each
of you, as it's nice to know what projects this algorithm is being
used in. You may email me directly at aaron@xx.xxx.edu with the
rest of the address being obvious.
Get the code
The code is available in both .tgz and .zip formats.
It contains the following files (with corresponding descriptions):
gpl.txt |
GPL (version 2) |
Makefile |
makefile for compiling the executable |
fastcommunity_mh.cc |
main algorithm file - this has the heart of
the algorithm described in the paper. |
maxheap.h |
max-heap datastructure for storing the dQ values
(linked to vektor data structure) |
vektor.h |
sparse matrix row datastructure for storing
dQ values (linked to maxheap data structure) |
Update October 2008: A version that works with weighted networks is available here. This version wants a .wpairs file as input, which is an edge list with integer weights, e.g., "54\t91\t3\n" would be an edge with weight 3. Otherwise, it should work just the same as the unweighted version.
Compiling
The Makefile provided should be sufficient to compile the executable
(built in ANSI C/C++ with platform independent code).
Input file requirements
The program was written to handle networks in the form of a flat
text file containing edge adjacencies. I call this file format
a .pairs file. Specifically, the network topology must conform
to the following requirements:
• .pairs is a list of tab-delimited pairs of numeric indices,
e.g., "54\t91\n"
• the network described is a SINGLE COMPONENT
• there are NO SELF-LOOPS or MULTI-EDGES in the file
• the MINIMUM NODE ID = 0 in the input file
• the MAXIMUM NODE ID can be anything, but the program will
use less memory if nodes are labeled sequentially
This file format may seem a bit peculiar, but
it was sufficient for the demonstration that the algorithm performs
as promised. You are free to alter the file import function readInputFile()
to fit your needs.
An example input file, for Zachary's karate club network
is here.
Common error messages
The most common error message returned by the program is of the form "WARNING: invalid join (X Y) found at top of heap". If you receive this error message, it is almost surely because your input .pairs file does not meet the above requirements. In particular, it likely contains self-loops, multi-edges or is not a single connected component. If you receive this message, try fixing these issues with your network and trying again.
Running the program
The Makefile provided should be sufficient to compile the executable
(built in ANSI C/C++ with platform independent code). This usage
information can be retrieved by running the executable with no
arguments:
-f <filename> |
give the target .pairs file to be processed |
-l <text> |
the text label for this run; used to build
output filenames |
-t <int> |
timer period for reporting progress of file
input to screen |
-s |
calculate and record the support of the dQ
matrix |
-v --v ---v |
differing levels of screen output verbosity |
-c <int> |
record the aglomerated network at step <int>
(typically, this is the step location at which the modularity
is known to be maximum) |
(Please see the notes in the .cc file for the most up-to-date version
of this information.) The typical usage will be to first create the
.pairs file containing
your network, to run the program like
• ./fastcommunity_mh -f myNetwork.pairs -l firstRun
and then consult the file outputs as described below. If you
want to then examine the communities that have been built by the
algorithm, you would run the algorithm a second time like so
• ./fastcommunity_mh -f myNetwork.pairs -l secondRun
-c X
where X is the integer value for the maximum modularity that
is reported in the .info file. Again, this could probably be automated,
either by modifying the code, or wrapping it all in another script.
File outputs
Running the program on some input network will produce a set of
files, each with a common naming convention, where the file type
is encoded by the suffix. This information can also be retrieved
by running the executable with the argument -files.
Mandatory file outputs
.info |
Various information about the program's running.
Includes a listing of the files it generates, number of vertices
and edges processed, the maximum modularity found and the
corresponding step (you can re-run the program with this value
in the -c argument to have it output the contents of the clusters,
etc. when it reaches that step again (not the most efficient
solution, but it works)), start/stop time, and when -c is
used, it records some information about the distribution of
cluster sizes. |
.joins |
The dendrogram and modularity information
from the algorithm. The file format is tab-delimited columns
of data, where the columns are:
1. the community index which absorbs
2. the community index which was absorbed
3. the modularity value Q after the join
4. the time step of the join |
Optional file outputs (generated at step t=C when the -c
C argument is used:
.wpairs |
The connectivity of the clustered graph in
a .wpairs file format (i.e., weighted edges). The edge weights
should be the dQ values associated with that clustered edge
at time C. From this format, it's easy to convert into another
for visualization (e.g., pajek's .net format). |
.hist |
The size distribution of the clusters. |
.groups |
A list of each group and the names of the vertices
which compose it (this is particularly useful for verifying
that the clustering makes sense - tedious but important). |
Bugs and incompleteness
All of the features documented on this page work as advertised
(although, if you find any bugs, please let me know). There are
some other, undocumented features that may not work fully (if
you read through the code, you may spot some of these), so use
them at your own risk.
Code maintenence
I am not actively working on this code since this project complete
with its publication in 2004. However, I'll try to give as much
verbal support to interested users as is reasonable.
|