August 18, 2012
Wanting to hire
Postdoctoral Fellowship in Study of Networks
Along with Cris Moore, I am looking to hire a postdoc in the area of complex networks and statistical inference. There are two such positions available, one located at the Santa Fe Institute (working with Cris) and one at the University of Colorado Boulder (working with me). Both are funded by a grant from DARPA to investigate the use of generative models to perform statistical inference in complex networks. The larger team includes Mark Newman at the University of Michigan, and there will be ample opportunity to travel and collaborate among the three institutions.
The grant has a particular emphasis on community detection methods, including methods for detecting changes in community structure in dynamic graphs; functional groups that are not merely "clumps" of densely connected nodes; predicting missing links and identifying spurious ones; building on incomplete or noisy information about the network, generalizing from known attributes of nodes and edges to unknown ones; and identifying surprising or anomalous structures or events.
If you are interested in applying, or know someone who has a strong quantitative background in physics, statistics or computer science, see the application information.
The application deadline is 13 January 2013 with an anticipated start date of May or August 2013.
March 29, 2010
The trouble with community detection
Attention conservation notice: this is a posting about a talk I'm giving tomorrow at Dalhousie University in Nova Scotia.
For most of this week, I'll be visiting the math department of Dalhousie University in Halifax Nova Scotia, as a speaker in the Modelling and Mining of Network Information Spaces seminar series and a guest of Jeannette Janssen.
For my part, I'm giving a talk (see below) on the results of my summer student Ben Good's project on the difficulties of identifying dense "communities" (or "modules", or "compartments") in networks using topological information alone. I'm pleased to say that this paper was recently accepted at Physical Review E. 
The problem of detecting communities in networks has received an enormous amount of attention (more than it deserves, in my opinion), and there are now literally dozens of reasonable-sounding ways to find the "clusters" in networks. To give you a sense of just how much attention, the first few papers in the field have received hundreds of citations and a few have even received thousands. And yet, I'm increasingly skeptical that all this effort has produced much of lasting value. On the up side, it's produced lots of clever methodological tricks and insights, and certainly I've enjoyed chewing on these problems myself . But, I'm increasingly pessimistic about the goal of automatically extracting meaningful "clusters" from interaction data alone. In short, I don't believe there is a universally useful definition of a network cluster and I'm skeptical that any of the community detection methods currently available actually produce results that can be trusted.
Current methods do okay on trivial test cases of various kinds, but they all have methodological problems (some of them quite severe) that make it difficult to unambiguously interpret the scientific significance of their output. And, every method makes assumptions that are almost surely highly unrealistic for almost any system you might care to think about. On the other hand, some standard data analysis methods have similar problems (e.g., hierarchical clustering algorithms for spatial data) but still manage to be useful. I think this is partly because we understand pretty well how these methods fail, and thus how their output should be interpreted and under what conditions they can be expected to perform unambiguously. I don't think we're there yet with network clustering methods, but perhaps one day we'll get there.
If you're in the Halifax area and are interested in the talk, here are the details:
Date: Tuesday March 30, 2010 at 2:30 p.m.
Location: Jacob Slonim Conference Room (430), 6050 University Ave., Halifax
Coffee and cookies will be provided, courtesy of Faculty of Computer Science.
The trouble with community detection
Although widely used in practice, the performance of the popular network clustering technique called "modularity maximization" is not well understood when applied to networks with unknown modular structure. In this talk, I'll show that precisely in the case we want it to perform the best--that is, on modular networks--the modularity function Q exhibits extreme degeneracies, in which the global maximum is hidden among an exponential number of high-modularity solutions. Further, these degenerate solutions can be structurally very dissimilar, suggesting that any particular high-modularity partition, or statistical summary of its structure, should not be taken as representative of the other degenerate solutions. These results partly explain why so many heuristics do well at finding high-modularity partitions and why different heuristics can disagree on the modular composition the same network. I'll conclude with some forward-looking thoughts about the general problem of identifying network modules from connectivity data alone, and the likelihood of circumventing this degeneracy problem.
Update 31 March 2010: For those of you interested in reproducing our results or applying our methods to your own networks, Ben has placed implementations online here for his simulated annealing code for sampling the local optima of the modularity function and his code for taking those sampled optima and reconstructing the 3D visualization of the modularity landscape.
Update 15 April 2010: Updated the journal ref.
 B. H. Good, Y.-A. de Montjoye and A. Clauset. " The performance of modularity maximization in practical contexts." Physical Review E 81, 046106 (2010).
 My most cited paper, by far, is my first paper on detecting communities by maximizing modularity using a greedy agglomerative algorithm.
November 03, 2009
The trouble with community detection
I'm a little (a month!) late in posting it, but here's a new paper, largely by my summer student Ben Good, about the trouble with community detection algorithms.
The short story is that the popular quality function called "modularity" (invented by Mark Newman and Michelle Girvan) admits serious degeneracies that make it somewhat impractical to use in situations where the network is large or has a non-trivial number of communities (a.k.a. modules). At the end of the paper, we briefly survey some ways to potentially mitigate this problem in practical contexts.
Benjamin H. Good, Yves-Alexandre de Montjoye, Aaron Clauset, arxiv:0910.0165 (2009).
Although widely used in practice, the behavior and accuracy of the popular module identification technique called modularity maximization is not well understood. Here, we present a broad and systematic characterization of its performance in practical situations. First, we generalize and clarify the recently identified resolution limit phenomenon. Second, we show that the modularity function Q exhibits extreme degeneracies: that is, the modularity landscape admits an exponential number of distinct high-scoring solutions and does not typically exhibit a clear global maximum. Third, we derive the limiting behavior of the maximum modularity Q_max for infinitely modular networks, showing that it depends strongly on the size of the network and the number of module-like subgraphs it contains. Finally, using three real-world examples of metabolic networks, we show that the degenerate solutions can fundamentally disagree on the composition of even the largest modules. Together, these results significantly extend and clarify our understanding of this popular method. In particular, they explain why so many heuristics perform well in practice at finding high-scoring partitions, why these heuristics can disagree on the composition of the identified modules, and how the estimated value of Q_max should be interpreted. Further, they imply that the output of any modularity maximization procedure should be interpreted cautiously in scientific contexts. We conclude by discussing avenues for mitigating these behaviors, such as combining information from many degenerate solutions or using generative models.
August 30, 2008
Having come of age during the rise of the Internet, it's hard to imagine what it was like to do science back in the communication dark ages without tools like email, electronic journals, Wikipedia, etc. Most of my research has some component that requires quickly communicating with people over vast distances . This ease of interaction is all based on a few critical components of the Internet. First, the Internet is fast, in the sense that the Internet's routers and transmission lines allow information to get from A to B extremely quickly. Second, the Internet is navigable, meaning that A knows how to get to B using one of those quick routes. If either of these things failed, the Internet would quickly fall apart and people would go back to phoning and faxing each other . Yuck.
What's not widely known is that there's a real danger that the second of these two won't hold in the future. If so, trying to quickly reach Google, your humble blogger or anyone else may become as difficult as trying to drive from New York to Santa Fe without the help of a map or road signs. The problem is that the system that makes the Internet navigable is fundamentally flawed, and it's not clear how to fix it now that everyone depends so heavily on there being a working Internet. That is, we can't just turn the whole Internet off while we move everyone over to the new improved system.
One flaw is that the system wasn't designed to have the whole, or even a large fraction of the, world online. It was always thought that we would roll out a new series of tubes  down the road, but then something happened: the Internet became wildly popular, and that idea was ruined. A second flaw is that the system assumes everyone is always honest. The Internet's navigability comes from, basically, a massive but highly accurate game of telephone. In the children's version of this game, errors are introduced accidentally, and everyone laughs at the end about how strange the messages become. In the Internet version, a malicious person can introduce an error strategically, allowing them to eavesdrop on other messages (a la the NSA) or hijack messages before they reach their destination. It was recently demonstrated that these kinds of attacks are, in fact, relatively easy to do. We've been lucky so far that these kinds of attacks haven't been more widely used.
These and other issues make it very clear that the future of the Internet (and my scientific productivity!) depends on designing a more robust system, to which we can smoothly transition while still using the current broken version. But how exactly would a better system work? Earlier this summer, I coörganized a mini-workshop at SFI with some folks from CAIDA (Cooperative Association for Internet Data Analysis) based at UC San Diego about exactly this question. The attendees were primarily folks from the Internet-branch of the network science community, and the talks were focused heavily on alternative ways to make the Internet navigable. The result of the meeting was not a solution to the problem, but rather a set of questions that we think probably need to be answered before a real solution can be made.
 Usually, this is because my collaborators are remote; one of them I wrote two papers with before meeting him in person for the first time earlier this year.
 Increasingly phone calls depend on the same technology (packet switching) that runs the Internet, but originally, phone calls depended on a different kind of system (circuit switching), which guaranteed the delivery of information (i.e., no garbled conversations because of network congestion) but was significantly less flexible.
 This was a naive view, of course, but it's hard to make accurate predictions, especially about the future. The future was supposed to be based on something called IPv6, but if everyone used IPv6 today, it would break the Internet even faster. Fortunately, or unfortunately, almost no one uses IPv6 and it seems that no one is really planning to, either.
July 09, 2008
Announcement: DIMACS Workshop on Network Models of Biological and Social Contagion
This workshop looks pretty interesting, and that's not because it's being organized by my friends. Comfortably, the topics align with several of what I think are the "future" of network science (tip to Jake).
Update 18 July 2008: Having just received an invitation to speak from the organizers, I think it's likely that I'll be attending. In addition to learning about new science at DIMACS, it'll be a great opportunity to also visit some friends and colleagues in New York City.
November 3 - 4, 2008 at DIMACS, Rutgers
Description: The spread of infectious diseases and the flow of ideas and information through populations fundamentally depend on the complex structure of the underlying network of interactions between individuals. Disease ecologists and sociologists have historically studied the dynamics of contagion using models that assume very simple population structures. Recently, however, network modeling has revolutionized both fields by enabling the rigorous exploration of the relationship between complex individual-level behavior and the higher-level emergence of outbreaks. The field draws on advanced statistical tools for inferring network structure from often limited data, data-driven algorithms for generating realistic network structures, and mathematical approximations for predicting transmission dynamics that draw from the methods of percolation theory and other fields within statistical physics.
While network models are more complex than their mass-action predecessors, they are remarkably tractable, often reducing to low-dimensional descriptions and allowing straightforward calculations of the dynamics of contagion. The fields of infectious disease epidemiology and sociology are simultaneously experiencing an explosion of computationally-intensive agent-based simulation models, that allow much higher-resolution representations of populations but often preclude comprehensive analysis. Selecting among the diversity of modeling approaches is non-trivial, and may be highly dependent on the system and the questions.
This workshop will focus on network models for biological and social contagion, and how they compare to alternative approaches. It will address the challenges of inferring network structure from sociological and/or epidemiological data, understanding the emergence of such network structure from simple individual-level behavior, and predicting the dynamics of contagion from simple characterizations of the underlying network.
- Inferring network structure from data
- Generative models of social and epidemiological networks
- Modeling the dynamics of biological and social contagion on networks
- Modeling feedback from contagion dynamics to network structure
- Model selection -- choosing the right level of complexity
May 01, 2008
Hierarchical structure of networks
Many scientists believe that complex networks, like those we use to describe the interactions of genes, social relationships, and food webs, have a modular structure, in which genes or people or critters tend to cluster into densely interacting groups that are only loosely connected to each other. This idea is appealing since it agrees with a lot of our everyday experience or beliefs about how the world works. But, within those groups, are interactions uniformly random? Some folks believe that these modules themselves can be decomposed into sub-modules, and they into sub-sub-modules, etc. Similarly, modules may group together into super-modules, etc. This kind of recursive structure is what I mean by hierarchical group structure. 
There's been a lot of interest among both physicists and biologists in methods for extracting either modular or hierarchical structure in networks. In fact, one of my first papers in grad school was a fast algorithm for clustering nodes in very large networks. Many of the methods for getting at the hierarchical structure of networks are rather ad hoc, with the hierarchy produced being largely a byproduct of the particular behavior of the algorithm, rather than something inherent to the network itself. What was missing was a direct model of hierarchy.
Many of you will know (perhaps from here or here), that I've done work in this area with Cris Moore and Mark Newman, and that I care a lot about null models and making appropriate inferences from data. Our first paper on hierarchy is on the arxiv; in it, we showed some fancy things you could do with a model of hierarchy, such as assign connections a "surprisingness" value based on how unlikely they were under our model. Our second paper, in which we show that hierarchy is a very good predictor of missing connections in networks appeared today in Nature. [2,3] There's also a very nice accompanying News & Views piece by Sid Redner. Accurately predicting missing connections has many applications, including the obvious one for homeland security, but also for laboratory or field scientists who construct networks laboriously, testing or looking for one or a few edges at a time.
Another nice thing that came out of this work is that the hierarchy we extract from real networks seems to be extremely good at simultaneously reproducing many other commonly measured statistical features of networks, including things like a right-skewed degree distribution, high (or low) clustering coefficients, etc. In some sense, this suggests that hierarchy may be a fundamental principle of organization for these networks. That is, it may turn out that different kinds of hierarchies of modules is partly what causes real-world networks to look the way they do. General principles like this are wonderful (but not easy) to find, as they suggest we're on the right track to boiling a complex system down to its fundamental parts.
Of course, there are several important missing pieces from this picture, one of which is that real networks are often functional, while the hierarchical model may not completely circumscribe the networks that accomplish the necessary functions for the biological or social context they exist in. In that sense, we still have a long way to go before we understand why things like genetic regulatory networks are shaped the way they are, but hierarchy at least gives us a reasonable way to think about the large-scale organization of these fantastically complex systems.
Update 5 May 2008: Coverage of our results have appeared on Roland Piquepaille's Technology Trends, and also on Slashdot. Now I can live my days out in peace knowing that something I did made it on /. ...
 Hierarchical group structure is different from a hierarchy on the nodes themselves, which is more like a military hierarchy or an org-chart, where control or information flows from individuals higher in the hierarchy to other individuals lower in the hierarchy. For gene networks, there is probably some of both kinds of hierarchy, as there are certainly genes that control the behavior of large numbers of other genes. For instance, see
G. Halder, P. Callaerts and W.J Gehring. "Induction of ectopic eyes by targeted expression of the eyeless gene in Drosophila". Science 267, 1788–1792 (1995).
 "Hierarchical structure and the prediction of missing links in networks." A. Clauset, C. Moore and M. E. J. Newman. Nature 453, 98 - 101 (2008).
The code for fitting the model to network data (C++), for predicting missing connections in networks (C++), and for visualizing the inferred hierarchical structure (Matlab) is available on my website.
 It's especially nice to have this paper in print now as it was the last remaining unpublished chapter of my dissertation. Time for new projects!
June 05, 2007
Thoughts on NetSci 2007
Clearly, attending four conferences in four different states over five weeks has been too much for your humble blogger to handle in a timely fashion. This entry covers some of the notes I took while I was in New York City for the International Conference on Network Science (NetSci 2007), held at the New York Hall of Science.
Stefan Bornholdt (Bremen) gave an reasonably interesting talk on computation in molecular networks on Day 1. The basic idea is that while we pretty much understand how the storage (DNA), transmission (reproduction) and alteration (mutation, crossover) of heritable traits works, we have very little idea about how the machinery that interacts with them actually computes (regulates, manages, etc.) the things they code for. Using a very simple model of this computation, and a set of inferred interactions for the genes governing cell division in yeast, he constructed a simple dynamical system and analyzed its behavior. His main result was a claim that the simple dynamical model, combined with the topology of which genes talk to each other, reproduces the actual 13-state sequence for cell division. The one problem with this claim was that the figure he showed of this sequence was a chain ending in a fixed point, rather than a limit cycle. Still, interesting work, and it suggests that combining the right dynamical model with a network topology can lead to interesting dynamical behaviors.
Stephen North (AT&T; also one of the contact people for GraphViz) gave a brief talk about network visualization procedures. I think the consequences of visualization are under-appreciated by the networks community. That is, "good" visualizations should show us things about the data that we didn't know were there (or maybe we did), but it should try hard not to create artifacts in the visualization. I wasn't surprised to recently learn that the ubiquitous spring-force layout algorithms exhibit strong pathologies in terms of their getting stuck in local energy-minima that can give the appearance of structure that may not exist. For visualization, the desirable features he suggests are purely local -- we want few edge crossings and we want edges to be shot and straight -- but for very large networks, I think it would be more informative to get the cluster structure right than to make the inner details of the clusters pretty. I'm not aware of any layout algorithms that do this, but this is at least partially because getting the cluster structure right can require very sophisticated pre-processing.
Jennifer Dunne (Santa Fe Institute) opened Day 2 with discussion of her recent work on understanding paleolithic foodwebs and their similarity (or dissimilarity) to modern-day foodwebs. Amazingly, experts on paleolithic species can reconstruct a large fraction of the corresponding foodwebs from fossils, based on lucky fossilizations (animals caught in the act of feeding, or fossilized stomach contents), morphological considerations (jaw structure, sizes, etc.), damage patterns, etc. Each edge is assigned a confidence value by the expert. Dunne's analysis of these inferred topologies (e.g., from the Burgess Shale) seems pretty careful, showing that paleo webs and modern webs actually don't seem to differ in the network-features that ecologists typically study, although they do differ in their details. Her work is unusual in the networks community for explicitly doing an analysis of the uncertainty of the conclusions in light of the known uncertainties in the topology -- this was a topic that came up during the panel discussion at the IPAM workshop earlier in the month, and I think it deserves a lot more attention in the networks world. That is, there's a strong need for better methods to understand how to quote reasonable uncertainties in our conclusions.
One of my favorite talks of the week was by James Collins (Boston University) on his work on reverse engineering gene networks. He used a multiple linear regression model to infer an (weighted?) adjacency matrix from gene expression data for the E. coli SOS network (a particular set of responses E. coli initiates under high stress), and then matched its predictions for what genes to target to knockout the overall response of the network with in vivo experiments. I asked if a more complicated model might do even better -- he suggested that the fact that the linear model does so well suggests that the system is highly damped, and that more complicated models (e.g., those with say, non-linear effects built-in) do seem to do slightly better.
Another of my favorite talks of the week was by Jon Kleinberg (Cornell) on the difficulties of anonymizing large social network data sets. He pointed out that many companies have and will release anonymized version of their social networks (e.g., facebook or LiveJournal), where the anonymization is done simply by assigning each node a random unique label. Jon points out that, and then shows us exactly how, a malicious attacker could, by introducing a few additional edges or nodes to the network (e.g., by opening up some new email accounts and sending some emails), specifically de-anonymize a set of edges and nodes in the network. The results he showed relied on some very beautiful and very old results from combinatorics (e.g., some work by Paul Erdos on Ramsey theory). He then pitched an open question, are there good privacy-preserving mechanisms for anonymizing social network data? He suggested that zero-knowledge proofs or interactive proof schemes might be one way to guarantee anonymity, but at the expense of severely limiting the kinds of analysis that researchers could do with the network data.
Lise Getoor (U. Maryland) gave a short talk on her work on entity resolution for citation and coauthor networks (code available here; x86 chips only). Although she didn't speak directly about the effect of this kind of aliasing on the topological patterns that people are often interested in, the implication was clear that the topology (and thus both small-scale and large-scale patterns) can change quite dramatically when you appropriately clean the data. I'm curious to see what happens to the basic network statistics like the degree distribution, clustering coefficient and geodesic-length distribution, along with larger things like the community structure change when these things are fixed. (This would also give us some insight into just how much uncertainty we should include in our conclusions when we don't do this kind of alias resolution.)
May 27, 2007
This week, I'm in Snowbird, UT for SIAM's conference on Applications of Dynamical Systems (DS07). I'm here for a mini-symposium on complex networks organized by Mason Porter and Peter Mucha. I'll be blogging about these (and maybe other) network sessions as time allows (I realize that I still haven't blogged about NetSci last week - that will be coming soon...).
May 21, 2007
This week, I'm in New York City for the International Conference on Network Science, being held at the New York Hall of Science Museum in Queens. I may not be able to blog each day about the events, but I'll be posting my thoughts and comments as things progress. Stay tuned. In the meantime, here's the conference talk schedule.
IPAM - Random and Dynamic Graphs and Networks (Days 4 & 5)
Rather than my usual format of summarizing the things that got me thinking during the last few days, I'm going to go with a more free-form approach.
Thursday began with Jennifer Chayes (MSR) discussing some analytical work on adapting convergence-in-distribution proof techniques to ensembles of graphs. She introduced the cut-norm graph distance metric (useful on dense graphs; says that they have some results for sparse graphs, but that it's more difficult for those). The idea of graph distance seems to pop up in many different areas (including several I've been thinking of) and is closely related to the GRAPH ISOMORPHISM problem (which is not known to be NP-complete, but nor is it known to be in P). For many reasons, it would be really useful to be able to calculate in polynomial time the minimum edge-edit distance between two graphs; this would open up a lot of useful techniques based on transforming one graph into another.
Friday began with a talk by Jeannette Janssen (Dalhousie University) on a geometric preferential attachment model, which is basically a geometric random graph but where nodes have a sphere of attraction (for new edges) that has volume proportional to the node's in-degree. She showed some very nice mathematical results on this model. I wonder if this idea could be generalized to arbitrary manifolds (with a distance metric on them) and attachment kernels. That is, imagine that our complex network has actually imbedded on some complicated manifold and the attachment is based on some function of the distance on that manifold between the two nodes. The trick would be then to infer both the structure of the manifold and the attachment function from real data. Of course, without some constraints on both features, it would be easy to construct an arbitrary pair (manifold and kernel) that would give you exactly the network you observed. Is it sufficient to get meaningful results that both should be relatively smooth (continuous, differentiable, etc.)?
Jeannette's talk was followed by Filippo Menczer's talk on mining traffic data from the Internet2/Abilene network. The data set was based on daily dumps of end-to-end communications (packet headers with client and server IDs anonymized) and looked at a variety of behaviors of this traffic. He used this data to construct interaction graphs betwen clients and servers, clients and applications (e.g., "web"), and a few other things. The analysis seems relatively preliminary in the sense that there are a lot of data issues that are lurking in the background (things like aggregated traffic from proxies, aliasing and masking effects, etc.) that make it difficult to translate conclusions about the traffic into conclusions about real individual users. But, fascinating stuff, and I'm looking forward to seeing what else comes out of that data.
The last full talk I saw was by Raissa D'Souza on competition-induced preferential attachment, and a little bit at the end on dynamic geometric graphs and packet routing on them. I've seen the technical content of the preferential attachment talk before, but it was good to have the reminder that power-law distributions are not necessarily the only game in town for heavy-tailed distributions, and that even though the traditional preferential attachment mechanism may not be a good model of the way real growing networks change, it may be that another mechanism that better models the real world can look like preferential attachment. This ties back to Sidney Redner's comment a couple of days before about the citation network: why does the network look like one grown by preferential attachment, when we know that's not how individual authors choose citations?
May 09, 2007
IPAM - Random and Dynamic Graphs and Networks (Day 3)
This week, I'm in Los Angeles for the Institute for Pure and Applied Mathematics' (IPAM, at UCLA) workshop on Random and Dynamic Graphs and Networks; this is the third of five entries based on my thoughts from each day. As usual, these topics are a highly subjective slice of the workshop's subject matter...
The impact of mobility networks on the worldwide spread of epidemics
I had the pleasure of introducing Alessandro Vespignani (Indiana University) for the first talk of the day on epidemics in networks, and his work in modeling the effect that particles (people) moving around on the airport network have on models of the spread of disease. I've seen most of this stuff before from previous versions of Alex's talk, but there were several nice additions. The one that struck the audience the most was a visualization of all of the individual flights over the space of a couple of days in the eastern United States; the animation was made by Aaron Koblin for a different project, but was still quite effective in conveying the richness of the air traffic data that Alex has been using to do epidemic modeling and forecasting.
On the structure of growing networks
Sidney Redner gave the pre-lunch talk about his work on the preferential attachment growing-network model. Using the master equation approach, Sid explored an extremely wide variety of properties of the PA model, such as the different regimes of degree distribution behavior for sub-, exact, and different kinds of super- linear attachment rates, the first-mover advantage in the network, the importance of initial degree in determining final degree, along with several variations on the initial model. The power of the master equation approach was clearly evident, I should really learn more about.
He also discussed his work analyzing 100 years of citation data from the Physical Review journal (about 350,000 papers and 3.5 million citations; in 1890, the average number of references in a paper was 1, while in 1990, the average number had increased to 10), particularly with respect to his trying to understand the evidence for linear preferential attachment as a model of citation patterns. Quite surprisingly, he showed that for the first 100 or so citations, papers in PR have nearly linear attachment rates. One point Sid made several times in his talk is that almost all of the results for PA models are highly sensitive to variations in the precise details of the attachment mechanism, and that it's easy to get something quite different (so, no power laws) without trying very hard.
Finally, a question he ended with is why does linear PA seem to be a pretty good model for how citations acrue to papers, even though real citation patterns are clearly not dictated by the PA model?
The last talk-slot of the day was replaced by a panel discussion, put together by Walter Willinger and chaired by Mark Newman. Instead of the usual situation where the senior people of a field sit on the panel, this panel was composed of junior people (with the expectation that the senior people in the audience would talk anyway). I was asked to sit on the panel, along with Ben Olding (Harvard), Lea Popovic (Cornell), Leah Shaw (Naval Research Lab), and Lilit Yeghiazarian (UCLA). We each made a brief statement about what we liked about the workshop so far, and what kinds of open questions we would be most interested in seeing the community study.
For my on part, I mentioned many of the questions and themes that I've blogged about the past two days. In addition, I pointed out that function is more than just structure, being typically structure plus dynamics, and that our models currently do little to address the dynamics part of this equation. (For instance, can dynamical generative models of particular kinds of structure tell us more about why networks exhibit those structures specifically, and not some other variety?) Lea and Leah also emphasized dynamics as being a huge open area in terms of both modeling and mechanisms, with Lea pointing out that it's not yet clear what are the right kinds of dynamical processes that we should be studying with networks. (I made a quick list of processes that seem important, but only came up with two main caterogies, branching-contact-epidemic-percolation processes and search-navigation-routing processes. Sid later suggested that consensus-voting style processes, akin to the Ising model, might be another, although there are probably others that we haven't thought up yet.) Ben emphasized the issues of sampling, for instance, sampling subgraphs of our model, e.g., the observable WWW or even just the portion we can crawl in an afternoon, and dealing with sampling effects (i.e., uncertainty) in our models.
The audience had a lot to say on these and other topics, and particularly so on the topics of what statisticians can contribute to the field (and also why there are so few statisticians working in this area; some suggestions that many statisticians are only interested in proving asymptotic results for methods, and those that do deal with data are working on bio-informatics-style applications), and on the cultural difference between the mathematicians who want to prove nice things about toy models (folks like Christian Borgs, Microsoft Research) as a way of understanding the general propeties of networks and of their origin, and the empiricists (like Walter Willinger) who want accurate models of real-world systems that they can use to understand their system better. Mark pointed out that there's a third way in modeling, which relies on using an appropriately defined null model as a probe to explore the structure of your network, i.e., a null model that reproduces some of the structure you see in your data, but is otherwise maximally random, can be used to detect the kind of structure the model doesn't explain (so-called "modeling errors", in contrast to "measurement errors"), and thus be used in the standard framework of error modeling that science has used successfully in the past to understand complex systems.
All-in-all, I think the panel discussion was a big success, and the conversation certainly could have gone on well past the one-hour limit that Mark imposed.
May 08, 2007
IPAM - Random and Dynamic Graphs and Networks (Day 2)
This week, I'm in Los Angeles for the Institute for Pure and Applied Mathematics' (IPAM, at UCLA) workshop on Random and Dynamic Graphs and Networks; this is the second of five entries based on my thoughts from each day. As usual, these topics are a highly subjective slice of the workshop's subject matter...
Biomimetic searching strategies
Massimo Vergassola (Institut Pasteur) started the day with an interesting talk that had nothing to do with networks. Massimo discussed the basic problem of locating a source of smelly molecules in the macroscopic world where air currents cause pockets of the smell to be sparsely scattered across a landscape, thus spoiling the chemotaxis (gradient ascent) strategy used by bacteria, and a clever solution for it (called "infotaxis") based on trading off exploration and exploitation via an adaptive entropy minimization strategy.
Diversity of graphs with highly variable connectivity
Following lunch, David Alderson (Naval Postgraduate School) described his work with Lun Li (Caltech) on understanding just how different networks with a given degree distribution can be from each other. The take-home message of Dave's talk is, essentially, that the degree distribution is a pretty weak constraint on other patterns of connectivity, and is not a sufficient statistical characterization of the global structure of the network with respect to many (most?) of the other (say, topological and functional) aspects we might care about. Although he focused primarily on degree assortativity, the same kind of analysis could in principle be done for other network measures (clustering coefficient, distribution, diameter, vertex-vertex distance distribution, etc.), none of which are wholly independent of the degree distribution, or of each other! (I've rarely seen the interdependence of these measures discussed (mentioned?) in the literature, even though they are often treated as such.)
In addition to describing his numerical experiments, Dave sounded a few cautionary notes about the assumptions that are often made in the complex networks literature (particularly by theoreticians using random-graph models) on the significance of the degree distribution. For instance, the configration model with a power-law degree sequence (and similarly, graphs constructed via preferential attachment) yields networks that look almost nothing like any real-world graph that we know, except for making vaguely similar degree distributions, and yet they are often invoked as reasonable models of real-world systems. In my mind, it's not enough to simply fix-up our existing random-graph models to instead define an ensemble with a specific degree distribution, and a specific clustering coefficient, and a diameter, or whatever our favorite measures are. In some sense all of these statistical measures just give a stylized picture of the network, and will always be misleading with respect to other important structural features of real-world networks. For the purposes of proving mathematical theorems, I think these simplistic toy models are actually very necessary -- since their strong assumptions make analytic work significantly easier -- so long as we also willfully acknowledge that they are a horrible model of the real world. For the purposes of saying something concrete about real networks, we need more articulate models, and, probably, models that are domain specific. That is, I'd like a model of the Internet that respects the idiosyncracies of this distributed engineered and evolving system; a model of metabolic networks that respects the strangeness of biochemistry; and a model of social networks that understands the structure of individual human interactions. More accurately, we probably need models that understand the function that these networks fulfill, and respect the dynamics of the network in time.
Greedy search in social networks
David Liben-Nowell (Carleton College) then closed the day with a great talk on local search in social networks. The content of this talk largely mirrored that of Ravi Kumar's talk at GA Tech back in January, which covered an empirical study of the distribution of the (geographic) distance covered by friendship links in the LiveJournal network (from 2003, when it had about 500,000 users located in the lower 48 states). This work combined some nice data analysis with attempts to validate some of the theoretical ideas due to Kleinberg for locally navigable networks, and a nice generalization of those ideas to networks with non-uniform population distributions.
An interesting point that David made early in his talk was that homophily is not sufficient to explain the presense of either the short paths that Milgrams' original 6-degrees-of-seperation study demonstrated, or even the existence of a connected social graph! That is, without a smoothly varying notion of "likeness", then homophily would lead us to expect disconnected components in the social network. If both likeness and the population density in the likeness space varied smoothly, then a homophilic social web would cover the space, but the average path length would be long, O(n). In order to get the "small world" that we actually observe, we need some amount of non-homophilic connections, or perhaps multiple kinds of "likeness", or maybe some diversity in the preference functions that individuals use to link to each other. Also, it's still not clear what mechanism would lead to the kind of link-length distribution predicted by Kleinberg's model of optimally navigable networks - an answer to this question would, presumably, tell us something about why modern societies organize themselves the way they do.
May 07, 2007
IPAM - Random and Dynamic Graphs and Networks (Day 1)
This week, I'm in Los Angeles for the Institute for Pure and Applied Mathematics' (IPAM, at UCLA) workshop on random and dynamic graphs and networks. This workshop is the third of four in their Random Shapes long program. The workshop has the usual format, with research talks throughout the day, punctuated by short breaks for interacting with your neighbors and colleagues. I'll be trying to do the same for this event as I did for the DIMACS workshop I attended back in January, which is to blog each day about interesting ideas and topics. As usual, this is a highly subjective slice of the workshop's subject matter.
Detecting and understanding the large-scale structure of networks
Mark Newman (U. Michigan) kicked off the morning by discussing his work on clustering algorithms for networks. As he pointed out, in the olden days of network analysis (c. 30 years ago), you could write down all the nodes and edges in a graph and understand its structure visually. These days, our graphs are too big for this, and we're stuck using statistical probes to understand how these things are shaped. And yet, many papers include figures of networks as incoherent balls of nodes and edges (Mark mentioned that Marc Vidal calls these figures "ridiculograms").
I've seen the technical content of Mark's talk before, but he always does an excellent job of making it seem fresh. In this talk, there was a brief exchange with the audience regarding the NP-completeness of the MAXIMUM MODULARITY problem, which made me wonder what exactly are the kind of structures that would make an instance of the MM problem so hard. Clearly, polynomial time algorithms that approximate the maximum modularity Q exist because we have many heuristics that work well on (most) real-world graphs. But, if I was an adversary and wanted to design a network with particularly difficult structure to partition, what kind would I want to include? (Other than reducing another NPC problem using gadgets!)
Walter Willinger raised a point here (and again in a few later talks) about the sensitivity of most network analysis methods to topological uncertainty. That is, just about all the techniques we have available to us assume that the edges as given are completely correct (no missing or spurious edges). Given the classic result due to Watts and Strogatz (1998) of the impact that a few random links added to a lattice have on the diameter of the graph, it's clear that in some cases, topological errors can have a huge impact on our conclusions about the network. So, developing good ways to handle uncertainty and errors while analyzing the structure of a network is a rather large, gaping hole in the field. Presumably, progress in this area will require having good error models of our uncertainty, which, necessary, depend on the measurement techniques used to produce the data. In the case of traceroutes on the Internet, this kind of inverse problem seems quite tricky, but perhaps not impossible.
Probability and Spatial Networks
David Aldous (Berkeley) gave the second talk and discussed some of his work on spatial random graphs, and, in particular, on the optimal design and flow through random graphs. As an example, David gave us a simple puzzle to consider:
Given a square of area N with N nodes distributed uniformly at random throughout. Now, subdivided this area into L^2 subsquares, and choose one node in each square to be a "hub." Then, connect each of the remaining nodes in a square to the hub, and connect the hubs together in a complete graph. The question is, what is the size L that minimizes the total (Euclidean) length of the edges in this network?
He then talked a little about other efficient ways to connect up uniformly scattered points in an area. In particular, Steiner trees are the standard way to do this, and have a cost O(N). The downside for this efficiency is that the tree-distance between physically proximate points on the plane is something polynomial in N (David suggested that he didn't have a rigorous proof for this, but it seems quite reasonable). As it turns out, you can dramatically lower this cost by adding just a few random lines across the plane -- the effect is analagous to the one in the Watts-Strogatz model. Naturally, I was thinking about the structure of real road networks here, and it would seem that the effect of highways in the real world is much the same as David's random lines. That is, it only takes a few of these things to dramatically reduce the topological distance between arbitrary points. Of course, road networks have other things to worry about, such as congestion, that David's highways don't!
April 26, 2007
The month of May
The month of May is a busy one for me. For some reason, it's when most of the big networks-related workshops and conferences happen, so I end up spending most of it on the road. This year, I'm attending four conferences, in four states, two of which are on opposite coasts. The agenda:
Algorithms, Inference, and Statistical Physics (AISP), run by CNLS of Los Alamos National Lab and hosted in Santa Fe. This workshop runs May 1 - 4, and I'm giving a short talk on power-law distribution in empirical data.
I get a short reprieve, and then its off to New York City for the over-named International Conference on Network Science (NetSci), which is trying to position itself as the main event in the field of complex networks each year. Given the number of physicists that present work at the APS March Meeting, that's going to be quite a task. But, at least NetSci attracts some folks outside of physics, such as a few folks in sociology, ecology and microbiology.
And then finally, it's over to Utah for the SIAM's conference on Applications of Dynamical Systems (DS07). There I'll be giving a talk on the hierarchical organization of networks at a mini-symposium on complex networks organized by Mason Porter and Peter Mucha.
Then, I'll return to Santa Fe exhausted, but enlightened from interacting with my esteemed colleagues, and seeing a few friends that live in faraway places. Ah, conference season. How I love thee. How I loathe thee.
February 15, 2007
Fast-modularity made really fast
The other day on the arxiv mailing, a very nice paper appeared (cs.CY/0702048) that optimizes the performance of the fast-modularity algorithm that I worked on with Newman and Moore several years ago. Our algorithm's best running time was O(n log^2 n) on sparse graphs with roughly balanced dendrograms, and we applied it to a large network of about a half million vertices (my implementation is available here).
Shortly after we posted the paper on the arxiv, I began studying its behavior on synthetic networks to understand whether a highly right-skewed distribution of community sizes in the final partition was a natural feature of the real-world network we studied, or whether it was caused by the algorithm itself . I discovered that the distribution probably was not entirely a natural feature because the algorithm almost always produces a few super-communities, i.e., clusters that contain a large fraction of the entire network, even on synthetic networks with no significant community structure. For instance, in the network we analyzed, the top 10 communities account for 87% of the vertices.
Wakita and Tsurumi's paper begins with this observation and then shows that the emergence of these super-communities actually slows the algorithm down considerably, making the running time more like O(n^2) than we would like. They then show that by forcing the algorithm to prefer to merge communities of like sizes - and thus guaranteeing that the dendrogram it constructs will be fairly balanced - the algorithm achieves the bound of essentially linear running time that we proved in our paper. This speed-up yields truly impressive results - they cluster a 4 million node network in about a half an hour - and I certainly hope they make their implementation available to the public. If I have some extra time (unlikely), I may simply modify my own implementation. (Alternatively, if someone would like to make that modification, I'm happy to host their code on this site.)
Community analysis algorithm proposed by Clauset, Newman, and Moore (CNM algorithm) finds community structure in social networks. Unfortunately, CNM algorithm does not scale well and its use is practically limited to networks whose sizes are up to 500,000 nodes. The paper identifies that this inefficiency is caused from merging communities in unbalanced manner. The paper introduces three kinds of metrics (consolidation ratio) to control the process of community analysis trying to balance the sizes of the communities being merged. Three flavors of CNM algorithms are built incorporating those metrics. The proposed techniques are tested using data sets obtained from existing social networking service that hosts 5.5 million users. All the methods exhibit dramatic improvement of execution efficiency in comparison with the original CNM algorithm and shows high scalability. The fastest method processes a network with 1 million nodes in 5 minutes and a network with 4 million nodes in 35 minutes, respectively. Another one processes a network with 500,000 nodes in 50 minutes (7 times faster than the original algorithm), finds community structures that has improved modularity, and scales to a network with 5.5 million.
K. Wakita and T. Tsurumi, "Finding Community Structure in Mega-scale Social Networks." e-print (2007) cs.CY/0702048
Update 14 April 2008: Ken Wakita tells me that their code is now publicly available online.
 Like many heuristics, fast-modularity achieves its speed by being highly biased in the set of solutions it considers. See footnote 7 in the previous post. So, without knowing more about why the algorithm behaves in the way it does, a number of things are not clear, e.g., how close to the maximum modularity the partition it returns is, how sensitive its partition is to small perturbations in the input (removing or adding an edge), whether supplementary information such as the dendrogram formed by the sequence of agglomerations is at all meaningful, whether there is an extremely different partitioning with roughly the same modularity, etc. You get the idea. This is why it's wise to be cautious in over-interpreting the output of these biased methods.
February 03, 2007
Modernizing Kevin Bacon
Via Mason Porter and Arcane Gazebo comes a pointer to a modernization (geekification?) of the Kevin Bacon game, this time using Wikipedia (and suitably enough, this version has its own wikipedia page). Here's how it works:
- Go to Wikipedia.
- Click the random article link in the sidebar.
- Open a second random article in another tab.
- Try to find a chain of links (as short as possible) starting from the first article that leads to the second.
Many people are fascinated by the fact that these short paths, which is popularly called the "small world phenomenon." This behavior isn't really so amazing, since even purely random graphs have very short paths between arbitrary nodes. What makes the success of these games truly strange is the fact that we can find these short paths using only the information at our current location in the search, and some kind of mental representation of movies and actors / human knowledge and concepts. That is, both the movie-actors network and wikipedia are locally navigable.
The US road system is only partially navigable in this sense. For instance, with only a rudimentary mental map of the country, you could probably get between major cities pretty easily using only the information you see on highway signs. But, cities are not locally navigable because the street signs give you no indication of where to find anything. In order to efficiently navigate them, you either need a global map of the city in hand, or a complex mental map of the city (this is basically what cab drivers do it, but they devote a huge amount of mental space to creating it).
Mason also points me to a tool that will find you the shortest path between two wikipedia articles. However, I'm sure this program isn't finding the path the way a human would. Instead, I'm sure that it's just running a breadth-first search from both pages and returning the path formed when the two trees first touch. What would be more interesting, I think, would be a lexicographic webcrawler that would navigate from the one page to the other using only the text available on its current page (and potentially its history of where it's been), and some kind of simple model of concepts / human knowledge (actually, it would only need a function to tell it whether one concept is closer to its target or not). If such a program could produce chains between random articles that are about as short as those that humans produce, then that would be pretty impressive.
(These questions all relate to the process of navigation on a static network, but an equally important question is the one about how the network produces the structure necessary to be locally navigable in the first place. Although it's a highly idealized and unrealistic model, I humbly point to the results of my first big research project in graduate school as potentially having something to say on this topic.)
January 25, 2007
DIMACS - Complex networks and their applications (Day 3)
The third day of the workshop focused on applications to biochemical networks (no food webs), with a lot of that focus being on the difficulties of taking fuzzy biological data (e.g., gene expression data) and converting it into an accurate and meaningful form for further analysis or for hypothesis testing. Only a few of the talks were theoretical, but this perhaps reflects the current distribution of focus in biology today. After the workshop was done, I wondered just how much information crossed between the various disciplines represented at the workshop - certainly, I came away from it with a few new ideas, and a few new insights from the good talks I attended. And I think that's the sign of a successful workshop.
Complex Networks in Biology
Chris Wiggins (Columbia) delivered a great survey of interesting connections between machine learning and biochemical networks. It's probably fair to say that biologists are interested in constructing an understanding of cellular-level systems that compares favorably to an electrical engineer's understanding of circuits (Pointer: Can a Biologist Fix a Radio?). But, this is hard because living stuff is messy, inconsistent in funny ways, and has a tendency to change while you're studying it. So, it's harder to get a clean view of what's going on under the hood than it was with particle physics. This, of course, is where machine learning is going to save us - ML offers powerful and principled ways to sift through (torture) all this data.
The most interesting part of his talk, I think, was his presentation of NetBoost, a mechanism discriminator that can tell you which (among a specific suite of existing candidates) is the most likely to have generated your observed network data . For instance, was it preferential attachment (PA) or duplication-mutation-complementation (DMC) that produced a given protein-interaction network (conclusion: the latter is better supported). The method basically works by constructing a decision tree that looks at the subgraph decomposition of a network and scores it's belief that each of the various mechanisms produced it . With the ongoing proliferation of network mechanisms (theorists really don't have enough to do these days), this kind of approach serves as an excellent way to test a new mechanism against the data it's supposed to be emulating.
One point Chris made that resonated strongly with me - and which Cris and Mark made yesterday - is the problem with what you might call "soft validation" . Typically, a study will cluster or do some other kind of analysis with the data, and then tell a biological story about why these results make sense. On the other hand, forcing the clustering to make testable predictions would be a stronger kind of validation.
Network Inference and Analysis for Systems Biology
Just before lunch, Joel Bader (Johns Hopkins) gave a brief talk about his work on building a good view of the protein-protein interaction network (PPIN). The main problems with this widely studied data are the high error rate, both for false positives (interactions that we think exist, but don't) and false negatives (interactions that we think don't exist, but do). To drive home just how bad the data is, he pointed out that two independent studies of the human PPIN showed just 1% overlap in the sets of "observed" interactions.
He's done a tremendous amount of work on trying to improve the accuracy of our understanding of PPINs, but here he described a recent approach that fits degree-based generative models  to the data using our old friend expectation-maximization (EM) . His results suggest that we're seeing about 30-40% of the real edges, but that our false positive rate is about 10-15%. This is a depressing signal-to-noise ratio (roughly 1%), because the number of real interactions is O(n), while our false positive rate is O(n^2). Clearly, the biological methods used to infer the interactions need to be improved before we have a clear idea of what this network looks like, but it also suggests that a lot of the previous results on this network are almost surely wrong. Another question is whether it's possible to incorporate these kinds of uncertainties into our analyses of the network structure.
Activating Interaction Networks and the Dynamics of Biological Networks
Meredith Betterton (UC-Boulder) presented some interesting work on signaling and regulatory networks. One of the more surprising tidbits she used in her motivation is the following. In yeast, the mRNA transcription undergoes a consistent 40-minute genome-wide oscillation, but when exposed to an antidepressant (in this case, phenelzine), the period doubles . (The fact that gene expression oscillates like this poses another serious problem for the results of gene expression analysis that doesn't account for such oscillations.)
The point Meredith wanted to drive home, though, was we shouldn't just think of biochemical networks as static objects - they also represent the form that the cellular dynamics must follow. Using a simple dynamical model of activation and inhibition, she showed that the structure (who points to who, and whether an edge inhibits or activates its target) of a real-world circadian rhythm network and a real-world membrane-based signal cascade basically behave exactly as you would expect - one oscillates and the other doesn't. But, then she showed that it only takes a relatively small number of flips (activation to inhibition, or vice versa) to dramatically change the steady-state behavior of these cellular circuits. In a sense, this suggests that these circuits are highly adaptable, given a little pressure.
There are several interesting questions that came to mind while she was presenting. For instance, if we believe there are modules within the signaling pathways that accomplish a specific function, how can we identify them? Do sparsely connected dense subgraphs (assortative community structure) map onto these functional modules? What are the good models for understanding these dynamics, systems of differential equations, discrete time and matrix multiplication, or something more akin to a cellular version of Ohm's Law? 
 M. Middendorf, E. Ziv and C. Wiggins, "Inferring Network Mechanisms: The Drosophila melanogaster Protein Interaction Network." PNAS USA 102 (9), 3192 (2005).
 Technically, it's using these subgraphs as generic features and then crunching the feature vectors from examples of each mechanism through a generalized decision tree in order to learn how to discriminate among them. Boosting is used within this process in order to reduce the error rates. The advantage of this approach to model selection and validation, as Chris pointed out, is that it doesn't assume a priori which features (e.g., degree distribution, clustering coefficient, distance distribution, whatever) are interesting, but rather chooses the ones that can actually discriminate between things we believe are different.
 Chris called it "biological validation," but the same thing happens in sociology and Internet modeling, too.
 I admit that I'm a little skeptical of degree-based models of these networks, since they seem to assume that we're getting the degree distribution roughly right. That assumption is only reasonable if our sampling of the interactions attached to a particular vertex is unbiased, which I'm not sure about.
 After some digging, I couldn't find the reference for this work. I did find this one, however, which illustrates a different technique for a related problem. I. Iossifov et al., "Probabilistic inference of molecular networks from noisy data sources." 20 (8), 1205 (2004).
 C. M. Li and R. R. Klevecz, "A rapid genome-scale response of the transcriptional oscillator to perturbation reveals a period-doubling path to phenotypic change." PNAS USA 103 (44), 16254 (2006).
 Maribeth Oscamou pointed out to me during the talk that any attempt to construct such rules have to account for processes like the biochemical degradation of the signals. That is, unlike electric circuits, there's no strict conservation of the "charge" carrier.
January 24, 2007
DIMACS - Complex networks and their applications (Day 2)
There were several interesting talks today, or rather, I should say that there were several talks today that made me think about things beyond just what the presenters said. Here's a brief recap of the ones that made me think the most, and some commentary about what I thought about. There were other good talks today, too. For instance, I particularly enjoyed Frank McSherry's talk on doing PageRank on his laptop. There was also one talk on power laws and scale-free graphs that stimulated a lot of audience, ah, interaction - it seems that there's a lot of confusion both over what a scale-free graph is (admittedly the term has no consistent definition in the literature, although there have been some recent attempts to clarify it in a principled manner), and how to best show that some data exhibit power-law behavior. Tomorrow's talks will be more about networks in various biological contexts.
Complex Structures in Complex Networks
Mark Newman's (U. Michigan) plenary talk mainly focused on the importance of having good techniques to extract information from networks, and being able to do so without making a lot of assumptions about what the technique is supposed to look for. That is, rather than assume that some particular kind of structure exists and then look for it in our data, why not let the data tell you what kind of interesting structure it has to offer?  The tricky thing about this approach to network analysis, though, is working out a method that is flexible enough to find many different kinds of structure, and to present only that which is unusually strong. (Point to ponder: what should we mean by "unusually strong"?) This point was a common theme in a couple of the talks today. The first example that Mark gave of a technique that has this nice property was a beautiful application of spectral graph theory to the task of find a partition of the vertices that give an extremal value of modularity. If we ask for the maximum modularity, this heuristic method , using the positive eigenvalues of the resulting solution, gives us a partition with very high modularity. But, using the negative eigenvalues gives a partition that minimizes the modularity. I think we normally think of modules meaning assortative structures, i.e., sparsely connected dense subgraphs. But, some networks exhibit modules that are approximately bipartite, i.e., they are disassortative, being densely connected sparse subgraphs. Mark's method naturally allows you to look for either. The second method he presented was a powerful probabilistic model of node clustering that can be appropriately parameterized (fitted to data) via expectation-maximization (EM). This method can be used to accomplish much the same results as the previous spectral method, except that it can look for both assortative and disassortative structure simultaneously in the same network.
Hierarchical Structure and the Prediction of Missing Links
In an afternoon talk, Cris Moore (U. New Mexico) presented a new and powerful model of network structure, the hierarchical random graph (HRG) . (Disclaimer: this is joint work with myself and Mark Newman.) A lot of people in the complex networks literature have talked about hierarchy, and, presumably, when they do so, they mean something roughly along the lines of the HRG that Cris presented. That is, they mean that nodes with a common ancestor low in the hierarchical structure are more likely to be connected to each other, and that different cuts across it should produce partitions that look like communities. The HRG model Cris presented makes these notions explicit, but also naturally captures the kind of assortative hierarchical structure and the disassortative structure that Mark's methods find. (Test to do: use HRG to generate mixture of assortative and disassortative structure, then use Mark's second method to find it.) There are several other attractive qualities of the HRG, too. For instance, using a Monte Carlo Markov chain, you can find the hierarchical decomposition of a single real-world network, and then use the HRG to generate a whole ensemble of networks that are statistically similar to the original graph . And, because the MCMC samples the entire posterior distribution of models-given-the-data, you can look not only at models that give the best fit to the data, but you can look at the large number of models that give an almost-best fit. Averaging properties over this ensemble can give you more robust estimates of unusual topological patterns, and Cris showed how it can also be used to predict missing edges. That is, suppose I hide some edges and then ask the model to predict which ones I hid. If it can do well at this task, then we've shown that the model is capturing real correlations in the topology of the real graph - it has the kind of explanatory power that comes from making correct predictions. These kinds of predictions could be extremely useful for laboratory or field scientists who manually collect network data (e.g., protein interaction networks or food webs) . Okay, enough about my own work!
The Optimization Origins of Preferential Attachment
Although I've seen Raissa D'Souza (UC Davis) talk about competition-induced preferential attachment  before, it's such an elegant generalization of PA that I enjoyed it a second time today. Raissa began by pointing out that most power laws in the real-world can't extend to infinity - in most systems, there are finite limits to the size that things can be (the energy released in an earthquake or the number of edges a vertex can have), and these finite effects will typically manifest themselves as exponential cutoffs in the far upper tail of the distribution, which takes the probability of these super-large events to zero. She used this discussion as a springboard to introduce a relatively simple model of resource constraints and competition among vertices in a growing network that produces a power-law degree distribution with such an exponential cutoff. The thing I like most about this model is that it provides a way for (tempered) PA to emerge from microscopic and inherently local interactions (normally, to get pure PA to work, you need global information about the system). The next step, of course, is to find some way to measure evidence for this mechanism in real-world networks . I also wonder how brittle the power-law result is, i.e., if you tweak the dynamics a little, does the power-law behavior disappear?
Web Search and Online Communities
Andrew Tomkins (of Yahoo! Reserch) is a data guy, and his plenary talk drove home the point that Web 2.0 applications (i.e., things that revolve around user-generated content) are creating a huge amount of data, and offering unparalleled challenges for combining, analyzing, and visualizing this data in meaningful ways. He used Flickr (a recent Y! acquisition) as a compelling example by showing an interactive (with fast-rewind and fast-forward features) visual stream of the trends in user-generated tags for user-posted images, annotated with notable examples of those images. He talked a little about the trickiness of the algorithms necessary to make such an application, but what struck me most was his plea for help and ideas in how to combine information drawn from social networks with user behavior with blog content, etc. to make more meaningful and more useful applications - there's all this data, and they only have a few ideas about how to combine it. The more I learn about Y! Research, the more impressed I am with both the quality of their scientists (they recently hired Duncan Watts), and the quality of their data. Web 2.0 stuff like this gives me the late-1990s shivers all over again. (Tomkins mentioned that in Korea, unlike in the US, PageRank-based search has been overtaken by an engine called Naver, which is driven by users building good sets of responses to common search queries.)
 To be more concrete, and perhaps in lieu of having a better way of approaching the problem, much of the past work on network analysis has taken the following approach. First, think of some structure that you think might be interesting (e.g., the density of triangles or the division into sparsely connected dense subgraphs), design a measure that captures that structure, and then measure it in your data (it turns out to be non-trivial to do this in an algorithm independent way). Of course, the big problem with this approach is that you'll never know whether there is other structure that's just as important as, or maybe more important than, the kind you looked for, and that you just weren't clever enough to think to look for it.
 Heuristic because Mark's method is a polynomial time algorithm, while the problem of modularity maximization was recently (finally...) shown to be NP-complete. The proof is simple, and, in retrospect, obvious - just as most such proofs inevitably end up being. See U. Brandes et al. "Maximizing Modularity is hard." Preprint (2006).
 M. E. J. Newman, "Finding community structure in networks using the eigenvectors of matrices." PRE 74, 036104 (2006).
 M. E. J. Newman and E. A. Leicht, "Mixture models and exploratory data analysis in networks." Submitted to PNAS USA (2006).
 A. Clauset, C. Moore and M. E. J. Newman, "Structural Inference of Hierarchies in Networks." In Proc. of the 23rd ICML, Workshop on "Statistical Network Analysis", Springer LNCS (Pittsburgh, June 2006).
 This capability seems genuinely novel. Given that there are an astronomical number of ways to rearrange the edges on a graph, it's kind of amazing that the hierarchical decomposition gives you a way to do such a rearrangement, but one which preserves the statistical regularities in the original graph. We've demonstrated this for the degree distribution, the clustering coefficient, and the distribution of pair-wise distances. Because of the details of the model, it sometimes gets the clustering coefficient a little wrong, but I wonder just how powerful / how general this capability is.
 More generally though, I think the idea of testing a network model by asking how well it can predict things about real-world problems is an important step forward for the field; previously, "validation" consisted of showing only a qualitative (or worse, a subjective) agreement between some statistical measure of the model's behavior (e.g., degree distribution is right-skewed) and the same statistical measure on a real-world network. By being more quantitative - by being more stringent - we can say stronger things about the correctness of our mechanisms and models.
 R. M. D'Souza, C. Borgs, J. T. Chayes, N. Berger, and R. Kleinberg, "Emergence of Tempered Preferential Attachment From Optimization", To appear in PNAS USA, (2007).
 I think the best candidate here would be the BGP graph, since there is clearly competition there, although I suspect that the BGP graph structure is a lot more rich than the simple power-law-centric analysis has suggested. This is primarily due to the fact that almost all previous analyses have ignored the fact that the BGP graph exists as an expression of the interaction of business interests with the affordances of the Border Gateway Protocol itself. So, its topological structure is meaningless without accounting for the way it's used, and this means accounting for complexities of the customer-provider and peer-to-peer relationships on the edges (to say nothing of the sampling issues involved in getting an accurate BGP map).
January 23, 2007
DIMACS - Complex networks and their applications (Day 1)
Today and tomorrow, I'm at the DIMACS workshop on complex networks and their applications, held at Georgia Tech's College of Computing. Over the course of the workshop, I'll be blogging about the talks I see and whatever ideas they stimulate (sadly, I missed most of the first day because of travel).
The most interesting talk I saw Monday afternoon was by Ravi Kumar (Yahoo! Research), who took location data of users on LiveJournal, and asked Do we see the same kind of routable structure - i.e., an inverses-square law relationship in the distance between people and the likelihood that they have a LJ connection - that Kleinberg showed was optimal for distributed / local search? Surprisingly, they were able to show that in the US, once you correct for the fact that there can be many people at a single "location" in geographic space (approximated to the city level), you do indeed observe exactly the kind of power-law that Kleinberg predicted . Truly, this was a kind of stunning confirmation of Kleinberg's theory. So now, the logical question would be, What mechanism might produce this kind of structure in geographic space? Although you could probably get away with assuming a priori the population distribution, what linking dynamics would construct the observed topological pattern? My first project in graduate school asked exactly this question for the pure Kleinberg model, and I wonder if it could be adapted to the geographic version that Kumar et al. consider.
 D. Liben-Nowell, et al. "Geographic Routing in Social Networks." PNAS USA 102, 33 11623-1162 (2005).