|   "Fast Modularity" 
                Community Structure Inference Algorithm This page documents and supports the fast modularity 
                maximization algorithm I developed jointly with Mark 
                Newman and Cristopher 
                Moore. This algorithm is being widely used in the community 
                of complex network researchers, and was originally designed for 
                the express purpose of analyzing the community structure of extremely 
                large networks (i.e., hundreds of thousand or millions of vertices). 
                The original version worked only with unweighted, undirected networks. 
                I've recently posted a version that works on weighted, undirected 
                networks. Update February 2007: Please see my recent blog entry about this algorithm. Update May 2007: See also the igraph library, a pretty comprehensive library of network utility functions (generation and analysis), including an implementation of the fast modularity algorithm described here (along with a few other nice clustering heuristics). Update October 2008: I've finally gotten around to posting a version that works with weighted networks, which is available here. This version wants a .wpairs file as input, which is an edge list with integer weights, e.g., "54\t91\t3\n" would be an edge with weight 3. Otherwise, it should work just the same as the unweighted version. Update December 2009: If you're going to use this algorithm, you should also read this paper, which describes the performance (pro and con) of modularity maximization in practical contexts.
 B. H. Good, Y.-A. de Montjoye and A. Clauset, "The performance of modularity maximization in practical contexts."
 Phys. Rev. E 81, 046106 (2010).
 Journal ReferenceA. Clauset, M.E.J. Newman and C. Moore, "Finding 
                community structure in very large networks."
 Phys. Rev. E 70, 066111 (2004).
 
 My requestFor the past year, I have passed out the code personally and have 
                asked each person who uses it to please send me a pre-print of 
                any papers they produce with the code. I ask the same of each 
                of you, as it's nice to know what projects this algorithm is being 
                used in. You may email me directly at aaron@xx.xxx.edu with the 
                rest of the address being obvious.
 Get the code The code is available in both .tgz and .zip formats.
                It contains the following files (with corresponding descriptions):
 
                 
                  | gpl.txt | GPL (version 2) |   
                  | Makefile | makefile for compiling the executable |   
                  | fastcommunity_mh.cc | main algorithm file - this has the heart of 
                    the algorithm described in the paper. |   
                  | maxheap.h | max-heap datastructure for storing the dQ values 
                    (linked to vektor data structure) |   
                  | vektor.h | sparse matrix row datastructure for storing 
                    dQ values (linked to maxheap data structure) |  Update October 2008: A version that works with weighted networks is available here. This version wants a .wpairs file as input, which is an edge list with integer weights, e.g., "54\t91\t3\n" would be an edge with weight 3. Otherwise, it should work just the same as the unweighted version.
  CompilingThe Makefile provided should be sufficient to compile the executable 
                (built in ANSI C/C++ with platform independent code).
 Input file requirementsThe program was written to handle networks in the form of a flat 
                text file containing edge adjacencies. I call this file format 
                a .pairs file. Specifically, the network topology must conform 
                to the following requirements:
 • .pairs is a list of tab-delimited pairs of numeric indices, 
                e.g., "54\t91\n"
 • the network described is a SINGLE COMPONENT
 • there are NO SELF-LOOPS or MULTI-EDGES in the file
 • the MINIMUM NODE ID = 0 in the input file
 • the MAXIMUM NODE ID can be anything, but the program will 
                use less memory if nodes are labeled sequentially
 This file format may seem a bit peculiar, but 
                it was sufficient for the demonstration that the algorithm performs 
                as promised. You are free to alter the file import function readInputFile() 
                to fit your needs.  An example input file, for Zachary's karate club network 
is here. Common error messagesThe most common error message returned by the program is of the form "WARNING: invalid join (X Y) found at top of heap". If you receive this error message, it is almost surely because your input .pairs file does not meet the above requirements. In particular, it likely contains self-loops, multi-edges or is not a single connected component. If you receive this message, try fixing these issues with your network and trying again.
 Running the programThe Makefile provided should be sufficient to compile the executable 
                (built in ANSI C/C++ with platform independent code). This usage 
                information can be retrieved by running the executable with no 
                arguments:
 
                 
                  | -f <filename> | give the target .pairs file to be processed |   
                  | -l <text> | the text label for this run; used to build 
                    output filenames |   
                  | -t <int> | timer period for reporting progress of file 
                    input to screen |   
                  | -s | calculate and record the support of the dQ 
                    matrix |   
                  | -v --v ---v | differing levels of screen output verbosity |   
                  | -c <int> | record the aglomerated network at step <int> 
                    (typically, this is the step location at which the modularity 
                    is known to be maximum) |  (Please see the notes in the .cc file for the most up-to-date version 
                of this information.) The typical usage will be to first create the 
                .pairs file containing 
                your network, to run the program like • ./fastcommunity_mh -f myNetwork.pairs -l firstRun 
                 and then consult the file outputs as described below. If you 
                want to then examine the communities that have been built by the 
                algorithm, you would run the algorithm a second time like so • ./fastcommunity_mh -f myNetwork.pairs -l secondRun 
                -c X where X is the integer value for the maximum modularity that 
                is reported in the .info file. Again, this could probably be automated, 
                either by modifying the code, or wrapping it all in another script. File outputsMandatory file outputsRunning the program on some input network will produce a set of 
                files, each with a common naming convention, where the file type 
                is encoded by the suffix. This information can also be retrieved 
                by running the executable with the argument -files.
 
 
                 
                  | .info | Various information about the program's running. 
                    Includes a listing of the files it generates, number of vertices 
                    and edges processed, the maximum modularity found and the 
                    corresponding step (you can re-run the program with this value 
                    in the -c argument to have it output the contents of the clusters, 
                    etc. when it reaches that step again (not the most efficient 
                    solution, but it works)), start/stop time, and when -c is 
                    used, it records some information about the distribution of 
                    cluster sizes. |   
                  | .joins | The dendrogram and modularity information 
                    from the algorithm. The file format is tab-delimited columns 
                    of data, where the columns are: 1. the community index which absorbs
 2. the community index which was absorbed
 3. the modularity value Q after the join
 4. the time step of the join
 |  Optional file outputs (generated at step t=C when the -c 
                C argument is used: 
                 
                  | .wpairs | The connectivity of the clustered graph in 
                    a .wpairs file format (i.e., weighted edges). The edge weights 
                    should be the dQ values associated with that clustered edge 
                    at time C. From this format, it's easy to convert into another 
                    for visualization (e.g., pajek's .net format). |   
                  | .hist | The size distribution of the clusters. |   
                  | .groups | A list of each group and the names of the vertices 
                    which compose it (this is particularly useful for verifying 
                    that the clustering makes sense - tedious but important). |  Bugs and incompletenessAll of the features documented on this page work as advertised 
                (although, if you find any bugs, please let me know). There are 
                some other, undocumented features that may not work fully (if 
                you read through the code, you may spot some of these), so use 
                them at your own risk.
 Code maintenenceI am not actively working on this code since this project complete 
                with its publication in 2004. However, I'll try to give as much 
                verbal support to interested users as is reasonable.
     |