February 02, 2008
The Importance of Mathematics
W. Timothy Gowers (homepage) won the Fields Medal in 1998 for work in functional analysis and combinatorics. Pleasantly, he also writes regularly at Gowers's Blog, although much of what he writes is over my head (I am probably more to blame for that fact than Gowers, however). But, this very pleasant talk he gave at the Clay Mathematics Institute for their Millennium Meeting back in 2000. His topic is generally the "Importance of Mathematics" (which of course is a familiar notion to long-time readers here), and he gives a highly entertaining intellectual meditation on the subject, and touches briefly on problems like graph coloring, computational complexity (as a way of distinguishing those things practical in theory and those things practical in practice), knot theory, and the Erdos-Kac theorem that the number of factors for a randomly chosen integer is normally distributed (something I hadn't heard before). Along the way, he gives a good explanation about why it's dangerous to cut funding to "useless" parts of mathematics (or, science) in favor of funding only the "useful" bits, and tries to convey the idea that the most important mathematics is also often the most beautiful.
Parts 2, 3, 4, 5, 6, 7 and 8. (Total length is about 60 minutes.)
posted February 2, 2008 04:06 PM in Interdisciplinarity | permalink | Comments (0)
June 08, 2007
Power laws and all that jazz
With apologies to Tolkien:
Three Power Laws for the Physicists, mathematics in thrall,
Four for the biologists, species and all,
Eighteen behavioral, our will carved in stone,
One for the Dark Lord on his dark throne.
In the Land of Science where Power Laws lie,
One Paper to rule them all, One Paper to find them,
One Paper to bring them all and in their moments bind them,
In the Land of Science, where Power Laws lie.
From an interest that grew directly out of my work chracterizing the frequency of severe terrorist attacks, I'm happy to say that the review article I've been working on with Cosma Shalizi and Mark Newman -- on accurately characterizing power-law distributions in empirical data -- is finally finished. The paper covers all aspects of the process, from fitting the distribution to testing the hypothesis that the data is distributed according to a power law, and to make it easy for folks in the community to use the methods we recommend, we've also made our code available.
So, rejoice, rejoice all ye people of Science! Go forth, fit and validate your power laws!
For those still reading, I have a few thoughts about this paper now that it's been released into the wild. First, I naturally hope that people read the paper and find it interesting and useful. I also hope that we as a community start asking ourselves what exactly we mean when we say that such-and-such a quantity is "power-law distributed," and whether our meaning would be better served at times by using less precise terms such as "heavy-tailed" or simply "heterogeneous." For instance, we might simply mean that visually it looks roughly straight on a log-log plot. To which I might reply (a) power-law distributions are not the only thing that can do this, (b) we haven't said what we mean by roughly straight, and (c) we haven't been clear about why we might prefer a priori such a form over alternatives.
The paper goes into the first two points in some detail, so I'll put those aside. The latter point, though, seems like one that's gone un-addressed in the literature for some time now. In some cases, there are probably legitimate reasons to prefer an explanation that assumes large events (and especially those larger than we've observed so far) are distributed according to a power law -- for example, cases where we have some convincing theoretical explanations that match the microscopic details of the system, are reasonably well motivated, and whose predictions have held up under some additional tests. But I don't think most places where power-law distributions have been "observed" have this degree of support for the power-law hypothesis. (In fact, most simply fit a power-law model and assume that it's correct!) We also rarely ask why a system necessarily needs to exhibit a power-law distribution in the first place. That is, would the system behave fundamentally differently, perhaps from a functional perspective, if it instead exhibited a log-normal distribution in the upper tail?
Update 15 June: Cosma also blogs about the paper, making many excellent points about the methods we describe for dealing with data, as well as making several very constructive points about the general affair of power-law research. Well worth the time to read.
posted June 8, 2007 10:00 AM in Complex Systems | permalink | Comments (3)
May 27, 2007
DS07
This week, I'm in Snowbird, UT for SIAM's conference on Applications of Dynamical Systems (DS07). I'm here for a mini-symposium on complex networks organized by Mason Porter and Peter Mucha. I'll be blogging about these (and maybe other) network sessions as time allows (I realize that I still haven't blogged about NetSci last week - that will be coming soon...).
posted May 27, 2007 11:38 PM in Scientifically Speaking | permalink | Comments (1)
May 21, 2007
NetSci 2007
This week, I'm in New York City for the International Conference on Network Science, being held at the New York Hall of Science Museum in Queens. I may not be able to blog each day about the events, but I'll be posting my thoughts and comments as things progress. Stay tuned. In the meantime, here's the conference talk schedule.
posted May 21, 2007 11:41 AM in Scientifically Speaking | permalink | Comments (0)
IPAM - Random and Dynamic Graphs and Networks (Days 4 & 5)
Two weeks ago, I was in Los Angeles for the Institute for Pure and Applied Mathematics' (IPAM, at UCLA) workshop on Random and Dynamic Graphs and Networks; this is the fourth and fifth entry.
Rather than my usual format of summarizing the things that got me thinking during the last few days, I'm going to go with a more free-form approach.
Thursday began with Jennifer Chayes (MSR) discussing some analytical work on adapting convergence-in-distribution proof techniques to ensembles of graphs. She introduced the cut-norm graph distance metric (useful on dense graphs; says that they have some results for sparse graphs, but that it's more difficult for those). The idea of graph distance seems to pop up in many different areas (including several I've been thinking of) and is closely related to the GRAPH ISOMORPHISM problem (which is not known to be NP-complete, but nor is it known to be in P). For many reasons, it would be really useful to be able to calculate in polynomial time the minimum edge-edit distance between two graphs; this would open up a lot of useful techniques based on transforming one graph into another.
Friday began with a talk by Jeannette Janssen (Dalhousie University) on a geometric preferential attachment model, which is basically a geometric random graph but where nodes have a sphere of attraction (for new edges) that has volume proportional to the node's in-degree. She showed some very nice mathematical results on this model. I wonder if this idea could be generalized to arbitrary manifolds (with a distance metric on them) and attachment kernels. That is, imagine that our complex network has actually imbedded on some complicated manifold and the attachment is based on some function of the distance on that manifold between the two nodes. The trick would be then to infer both the structure of the manifold and the attachment function from real data. Of course, without some constraints on both features, it would be easy to construct an arbitrary pair (manifold and kernel) that would give you exactly the network you observed. Is it sufficient to get meaningful results that both should be relatively smooth (continuous, differentiable, etc.)?
Jeannette's talk was followed by Filippo Menczer's talk on mining traffic data from the Internet2/Abilene network. The data set was based on daily dumps of end-to-end communications (packet headers with client and server IDs anonymized) and looked at a variety of behaviors of this traffic. He used this data to construct interaction graphs betwen clients and servers, clients and applications (e.g., "web"), and a few other things. The analysis seems relatively preliminary in the sense that there are a lot of data issues that are lurking in the background (things like aggregated traffic from proxies, aliasing and masking effects, etc.) that make it difficult to translate conclusions about the traffic into conclusions about real individual users. But, fascinating stuff, and I'm looking forward to seeing what else comes out of that data.
The last full talk I saw was by Raissa D'Souza on competition-induced preferential attachment, and a little bit at the end on dynamic geometric graphs and packet routing on them. I've seen the technical content of the preferential attachment talk before, but it was good to have the reminder that power-law distributions are not necessarily the only game in town for heavy-tailed distributions, and that even though the traditional preferential attachment mechanism may not be a good model of the way real growing networks change, it may be that another mechanism that better models the real world can look like preferential attachment. This ties back to Sidney Redner's comment a couple of days before about the citation network: why does the network look like one grown by preferential attachment, when we know that's not how individual authors choose citations?
posted May 21, 2007 11:38 AM in Scientifically Speaking | permalink | Comments (4)
May 09, 2007
IPAM - Random and Dynamic Graphs and Networks (Day 3)
This week, I'm in Los Angeles for the Institute for Pure and Applied Mathematics' (IPAM, at UCLA) workshop on Random and Dynamic Graphs and Networks; this is the third of five entries based on my thoughts from each day. As usual, these topics are a highly subjective slice of the workshop's subject matter...
The impact of mobility networks on the worldwide spread of epidemics
I had the pleasure of introducing Alessandro Vespignani (Indiana University) for the first talk of the day on epidemics in networks, and his work in modeling the effect that particles (people) moving around on the airport network have on models of the spread of disease. I've seen most of this stuff before from previous versions of Alex's talk, but there were several nice additions. The one that struck the audience the most was a visualization of all of the individual flights over the space of a couple of days in the eastern United States; the animation was made by Aaron Koblin for a different project, but was still quite effective in conveying the richness of the air traffic data that Alex has been using to do epidemic modeling and forecasting.
On the structure of growing networks
Sidney Redner gave the pre-lunch talk about his work on the preferential attachment growing-network model. Using the master equation approach, Sid explored an extremely wide variety of properties of the PA model, such as the different regimes of degree distribution behavior for sub-, exact, and different kinds of super- linear attachment rates, the first-mover advantage in the network, the importance of initial degree in determining final degree, along with several variations on the initial model. The power of the master equation approach was clearly evident, I should really learn more about.
He also discussed his work analyzing 100 years of citation data from the Physical Review journal (about 350,000 papers and 3.5 million citations; in 1890, the average number of references in a paper was 1, while in 1990, the average number had increased to 10), particularly with respect to his trying to understand the evidence for linear preferential attachment as a model of citation patterns. Quite surprisingly, he showed that for the first 100 or so citations, papers in PR have nearly linear attachment rates. One point Sid made several times in his talk is that almost all of the results for PA models are highly sensitive to variations in the precise details of the attachment mechanism, and that it's easy to get something quite different (so, no power laws) without trying very hard.
Finally, a question he ended with is why does linear PA seem to be a pretty good model for how citations acrue to papers, even though real citation patterns are clearly not dictated by the PA model?
Panel discussion
The last talk-slot of the day was replaced by a panel discussion, put together by Walter Willinger and chaired by Mark Newman. Instead of the usual situation where the senior people of a field sit on the panel, this panel was composed of junior people (with the expectation that the senior people in the audience would talk anyway). I was asked to sit on the panel, along with Ben Olding (Harvard), Lea Popovic (Cornell), Leah Shaw (Naval Research Lab), and Lilit Yeghiazarian (UCLA). We each made a brief statement about what we liked about the workshop so far, and what kinds of open questions we would be most interested in seeing the community study.
For my on part, I mentioned many of the questions and themes that I've blogged about the past two days. In addition, I pointed out that function is more than just structure, being typically structure plus dynamics, and that our models currently do little to address the dynamics part of this equation. (For instance, can dynamical generative models of particular kinds of structure tell us more about why networks exhibit those structures specifically, and not some other variety?) Lea and Leah also emphasized dynamics as being a huge open area in terms of both modeling and mechanisms, with Lea pointing out that it's not yet clear what are the right kinds of dynamical processes that we should be studying with networks. (I made a quick list of processes that seem important, but only came up with two main caterogies, branching-contact-epidemic-percolation processes and search-navigation-routing processes. Sid later suggested that consensus-voting style processes, akin to the Ising model, might be another, although there are probably others that we haven't thought up yet.) Ben emphasized the issues of sampling, for instance, sampling subgraphs of our model, e.g., the observable WWW or even just the portion we can crawl in an afternoon, and dealing with sampling effects (i.e., uncertainty) in our models.
The audience had a lot to say on these and other topics, and particularly so on the topics of what statisticians can contribute to the field (and also why there are so few statisticians working in this area; some suggestions that many statisticians are only interested in proving asymptotic results for methods, and those that do deal with data are working on bio-informatics-style applications), and on the cultural difference between the mathematicians who want to prove nice things about toy models (folks like Christian Borgs, Microsoft Research) as a way of understanding the general propeties of networks and of their origin, and the empiricists (like Walter Willinger) who want accurate models of real-world systems that they can use to understand their system better. Mark pointed out that there's a third way in modeling, which relies on using an appropriately defined null model as a probe to explore the structure of your network, i.e., a null model that reproduces some of the structure you see in your data, but is otherwise maximally random, can be used to detect the kind of structure the model doesn't explain (so-called "modeling errors", in contrast to "measurement errors"), and thus be used in the standard framework of error modeling that science has used successfully in the past to understand complex systems.
All-in-all, I think the panel discussion was a big success, and the conversation certainly could have gone on well past the one-hour limit that Mark imposed.
posted May 9, 2007 11:38 PM in Scientifically Speaking | permalink | Comments (0)
May 08, 2007
IPAM - Random and Dynamic Graphs and Networks (Day 2)
This week, I'm in Los Angeles for the Institute for Pure and Applied Mathematics' (IPAM, at UCLA) workshop on Random and Dynamic Graphs and Networks; this is the second of five entries based on my thoughts from each day. As usual, these topics are a highly subjective slice of the workshop's subject matter...
Biomimetic searching strategies
Massimo Vergassola (Institut Pasteur) started the day with an interesting talk that had nothing to do with networks. Massimo discussed the basic problem of locating a source of smelly molecules in the macroscopic world where air currents cause pockets of the smell to be sparsely scattered across a landscape, thus spoiling the chemotaxis (gradient ascent) strategy used by bacteria, and a clever solution for it (called "infotaxis") based on trading off exploration and exploitation via an adaptive entropy minimization strategy.
Diversity of graphs with highly variable connectivity
Following lunch, David Alderson (Naval Postgraduate School) described his work with Lun Li (Caltech) on understanding just how different networks with a given degree distribution can be from each other. The take-home message of Dave's talk is, essentially, that the degree distribution is a pretty weak constraint on other patterns of connectivity, and is not a sufficient statistical characterization of the global structure of the network with respect to many (most?) of the other (say, topological and functional) aspects we might care about. Although he focused primarily on degree assortativity, the same kind of analysis could in principle be done for other network measures (clustering coefficient, distribution, diameter, vertex-vertex distance distribution, etc.), none of which are wholly independent of the degree distribution, or of each other! (I've rarely seen the interdependence of these measures discussed (mentioned?) in the literature, even though they are often treated as such.)
In addition to describing his numerical experiments, Dave sounded a few cautionary notes about the assumptions that are often made in the complex networks literature (particularly by theoreticians using random-graph models) on the significance of the degree distribution. For instance, the configration model with a power-law degree sequence (and similarly, graphs constructed via preferential attachment) yields networks that look almost nothing like any real-world graph that we know, except for making vaguely similar degree distributions, and yet they are often invoked as reasonable models of real-world systems. In my mind, it's not enough to simply fix-up our existing random-graph models to instead define an ensemble with a specific degree distribution, and a specific clustering coefficient, and a diameter, or whatever our favorite measures are. In some sense all of these statistical measures just give a stylized picture of the network, and will always be misleading with respect to other important structural features of real-world networks. For the purposes of proving mathematical theorems, I think these simplistic toy models are actually very necessary -- since their strong assumptions make analytic work significantly easier -- so long as we also willfully acknowledge that they are a horrible model of the real world. For the purposes of saying something concrete about real networks, we need more articulate models, and, probably, models that are domain specific. That is, I'd like a model of the Internet that respects the idiosyncracies of this distributed engineered and evolving system; a model of metabolic networks that respects the strangeness of biochemistry; and a model of social networks that understands the structure of individual human interactions. More accurately, we probably need models that understand the function that these networks fulfill, and respect the dynamics of the network in time.
Greedy search in social networks
David Liben-Nowell (Carleton College) then closed the day with a great talk on local search in social networks. The content of this talk largely mirrored that of Ravi Kumar's talk at GA Tech back in January, which covered an empirical study of the distribution of the (geographic) distance covered by friendship links in the LiveJournal network (from 2003, when it had about 500,000 users located in the lower 48 states). This work combined some nice data analysis with attempts to validate some of the theoretical ideas due to Kleinberg for locally navigable networks, and a nice generalization of those ideas to networks with non-uniform population distributions.
An interesting point that David made early in his talk was that homophily is not sufficient to explain the presense of either the short paths that Milgrams' original 6-degrees-of-seperation study demonstrated, or even the existence of a connected social graph! That is, without a smoothly varying notion of "likeness", then homophily would lead us to expect disconnected components in the social network. If both likeness and the population density in the likeness space varied smoothly, then a homophilic social web would cover the space, but the average path length would be long, O(n). In order to get the "small world" that we actually observe, we need some amount of non-homophilic connections, or perhaps multiple kinds of "likeness", or maybe some diversity in the preference functions that individuals use to link to each other. Also, it's still not clear what mechanism would lead to the kind of link-length distribution predicted by Kleinberg's model of optimally navigable networks - an answer to this question would, presumably, tell us something about why modern societies organize themselves the way they do.
posted May 8, 2007 10:52 PM in Scientifically Speaking | permalink | Comments (0)
May 07, 2007
IPAM - Random and Dynamic Graphs and Networks (Day 1)
This week, I'm in Los Angeles for the Institute for Pure and Applied Mathematics' (IPAM, at UCLA) workshop on random and dynamic graphs and networks. This workshop is the third of four in their Random Shapes long program. The workshop has the usual format, with research talks throughout the day, punctuated by short breaks for interacting with your neighbors and colleagues. I'll be trying to do the same for this event as I did for the DIMACS workshop I attended back in January, which is to blog each day about interesting ideas and topics. As usual, this is a highly subjective slice of the workshop's subject matter.
Detecting and understanding the large-scale structure of networks
Mark Newman (U. Michigan) kicked off the morning by discussing his work on clustering algorithms for networks. As he pointed out, in the olden days of network analysis (c. 30 years ago), you could write down all the nodes and edges in a graph and understand its structure visually. These days, our graphs are too big for this, and we're stuck using statistical probes to understand how these things are shaped. And yet, many papers include figures of networks as incoherent balls of nodes and edges (Mark mentioned that Marc Vidal calls these figures "ridiculograms").
I've seen the technical content of Mark's talk before, but he always does an excellent job of making it seem fresh. In this talk, there was a brief exchange with the audience regarding the NP-completeness of the MAXIMUM MODULARITY problem, which made me wonder what exactly are the kind of structures that would make an instance of the MM problem so hard. Clearly, polynomial time algorithms that approximate the maximum modularity Q exist because we have many heuristics that work well on (most) real-world graphs. But, if I was an adversary and wanted to design a network with particularly difficult structure to partition, what kind would I want to include? (Other than reducing another NPC problem using gadgets!)
Walter Willinger raised a point here (and again in a few later talks) about the sensitivity of most network analysis methods to topological uncertainty. That is, just about all the techniques we have available to us assume that the edges as given are completely correct (no missing or spurious edges). Given the classic result due to Watts and Strogatz (1998) of the impact that a few random links added to a lattice have on the diameter of the graph, it's clear that in some cases, topological errors can have a huge impact on our conclusions about the network. So, developing good ways to handle uncertainty and errors while analyzing the structure of a network is a rather large, gaping hole in the field. Presumably, progress in this area will require having good error models of our uncertainty, which, necessary, depend on the measurement techniques used to produce the data. In the case of traceroutes on the Internet, this kind of inverse problem seems quite tricky, but perhaps not impossible.
Probability and Spatial Networks
David Aldous (Berkeley) gave the second talk and discussed some of his work on spatial random graphs, and, in particular, on the optimal design and flow through random graphs. As an example, David gave us a simple puzzle to consider:
Given a square of area N with N nodes distributed uniformly at random throughout. Now, subdivided this area into L^2 subsquares, and choose one node in each square to be a "hub." Then, connect each of the remaining nodes in a square to the hub, and connect the hubs together in a complete graph. The question is, what is the size L that minimizes the total (Euclidean) length of the edges in this network?
He then talked a little about other efficient ways to connect up uniformly scattered points in an area. In particular, Steiner trees are the standard way to do this, and have a cost O(N). The downside for this efficiency is that the tree-distance between physically proximate points on the plane is something polynomial in N (David suggested that he didn't have a rigorous proof for this, but it seems quite reasonable). As it turns out, you can dramatically lower this cost by adding just a few random lines across the plane -- the effect is analagous to the one in the Watts-Strogatz model. Naturally, I was thinking about the structure of real road networks here, and it would seem that the effect of highways in the real world is much the same as David's random lines. That is, it only takes a few of these things to dramatically reduce the topological distance between arbitrary points. Of course, road networks have other things to worry about, such as congestion, that David's highways don't!
posted May 7, 2007 11:49 PM in Scientifically Speaking | permalink | Comments (1)
January 25, 2007
DIMACS - Complex networks and their applications (Day 3)
The third day of the workshop focused on applications to biochemical networks (no food webs), with a lot of that focus being on the difficulties of taking fuzzy biological data (e.g., gene expression data) and converting it into an accurate and meaningful form for further analysis or for hypothesis testing. Only a few of the talks were theoretical, but this perhaps reflects the current distribution of focus in biology today. After the workshop was done, I wondered just how much information crossed between the various disciplines represented at the workshop - certainly, I came away from it with a few new ideas, and a few new insights from the good talks I attended. And I think that's the sign of a successful workshop.
Complex Networks in Biology
Chris Wiggins (Columbia) delivered a great survey of interesting connections between machine learning and biochemical networks. It's probably fair to say that biologists are interested in constructing an understanding of cellular-level systems that compares favorably to an electrical engineer's understanding of circuits (Pointer: Can a Biologist Fix a Radio?). But, this is hard because living stuff is messy, inconsistent in funny ways, and has a tendency to change while you're studying it. So, it's harder to get a clean view of what's going on under the hood than it was with particle physics. This, of course, is where machine learning is going to save us - ML offers powerful and principled ways to sift through (torture) all this data.
The most interesting part of his talk, I think, was his presentation of NetBoost, a mechanism discriminator that can tell you which (among a specific suite of existing candidates) is the most likely to have generated your observed network data [1]. For instance, was it preferential attachment (PA) or duplication-mutation-complementation (DMC) that produced a given protein-interaction network (conclusion: the latter is better supported). The method basically works by constructing a decision tree that looks at the subgraph decomposition of a network and scores it's belief that each of the various mechanisms produced it [2]. With the ongoing proliferation of network mechanisms (theorists really don't have enough to do these days), this kind of approach serves as an excellent way to test a new mechanism against the data it's supposed to be emulating.
One point Chris made that resonated strongly with me - and which Cris and Mark made yesterday - is the problem with what you might call "soft validation" [3]. Typically, a study will cluster or do some other kind of analysis with the data, and then tell a biological story about why these results make sense. On the other hand, forcing the clustering to make testable predictions would be a stronger kind of validation.
Network Inference and Analysis for Systems Biology
Just before lunch, Joel Bader (Johns Hopkins) gave a brief talk about his work on building a good view of the protein-protein interaction network (PPIN). The main problems with this widely studied data are the high error rate, both for false positives (interactions that we think exist, but don't) and false negatives (interactions that we think don't exist, but do). To drive home just how bad the data is, he pointed out that two independent studies of the human PPIN showed just 1% overlap in the sets of "observed" interactions.
He's done a tremendous amount of work on trying to improve the accuracy of our understanding of PPINs, but here he described a recent approach that fits degree-based generative models [4] to the data using our old friend expectation-maximization (EM) [5]. His results suggest that we're seeing about 30-40% of the real edges, but that our false positive rate is about 10-15%. This is a depressing signal-to-noise ratio (roughly 1%), because the number of real interactions is O(n), while our false positive rate is O(n^2). Clearly, the biological methods used to infer the interactions need to be improved before we have a clear idea of what this network looks like, but it also suggests that a lot of the previous results on this network are almost surely wrong. Another question is whether it's possible to incorporate these kinds of uncertainties into our analyses of the network structure.
Activating Interaction Networks and the Dynamics of Biological Networks
Meredith Betterton (UC-Boulder) presented some interesting work on signaling and regulatory networks. One of the more surprising tidbits she used in her motivation is the following. In yeast, the mRNA transcription undergoes a consistent 40-minute genome-wide oscillation, but when exposed to an antidepressant (in this case, phenelzine), the period doubles [6]. (The fact that gene expression oscillates like this poses another serious problem for the results of gene expression analysis that doesn't account for such oscillations.)
The point Meredith wanted to drive home, though, was we shouldn't just think of biochemical networks as static objects - they also represent the form that the cellular dynamics must follow. Using a simple dynamical model of activation and inhibition, she showed that the structure (who points to who, and whether an edge inhibits or activates its target) of a real-world circadian rhythm network and a real-world membrane-based signal cascade basically behave exactly as you would expect - one oscillates and the other doesn't. But, then she showed that it only takes a relatively small number of flips (activation to inhibition, or vice versa) to dramatically change the steady-state behavior of these cellular circuits. In a sense, this suggests that these circuits are highly adaptable, given a little pressure.
There are several interesting questions that came to mind while she was presenting. For instance, if we believe there are modules within the signaling pathways that accomplish a specific function, how can we identify them? Do sparsely connected dense subgraphs (assortative community structure) map onto these functional modules? What are the good models for understanding these dynamics, systems of differential equations, discrete time and matrix multiplication, or something more akin to a cellular version of Ohm's Law? [7]
-----
[1] M. Middendorf, E. Ziv and C. Wiggins, "Inferring Network Mechanisms: The Drosophila melanogaster Protein Interaction Network." PNAS USA 102 (9), 3192 (2005).
[2] Technically, it's using these subgraphs as generic features and then crunching the feature vectors from examples of each mechanism through a generalized decision tree in order to learn how to discriminate among them. Boosting is used within this process in order to reduce the error rates. The advantage of this approach to model selection and validation, as Chris pointed out, is that it doesn't assume a priori which features (e.g., degree distribution, clustering coefficient, distance distribution, whatever) are interesting, but rather chooses the ones that can actually discriminate between things we believe are different.
[3] Chris called it "biological validation," but the same thing happens in sociology and Internet modeling, too.
[4] I admit that I'm a little skeptical of degree-based models of these networks, since they seem to assume that we're getting the degree distribution roughly right. That assumption is only reasonable if our sampling of the interactions attached to a particular vertex is unbiased, which I'm not sure about.
[5] After some digging, I couldn't find the reference for this work. I did find this one, however, which illustrates a different technique for a related problem. I. Iossifov et al., "Probabilistic inference of molecular networks from noisy data sources." 20 (8), 1205 (2004).
[6] C. M. Li and R. R. Klevecz, "A rapid genome-scale response of the transcriptional oscillator to perturbation reveals a period-doubling path to phenotypic change." PNAS USA 103 (44), 16254 (2006).
[7] Maribeth Oscamou pointed out to me during the talk that any attempt to construct such rules have to account for processes like the biochemical degradation of the signals. That is, unlike electric circuits, there's no strict conservation of the "charge" carrier.
posted January 25, 2007 01:20 PM in Scientifically Speaking | permalink | Comments (0)
January 24, 2007
DIMACS - Complex networks and their applications (Day 2)
There were several interesting talks today, or rather, I should say that there were several talks today that made me think about things beyond just what the presenters said. Here's a brief recap of the ones that made me think the most, and some commentary about what I thought about. There were other good talks today, too. For instance, I particularly enjoyed Frank McSherry's talk on doing PageRank on his laptop. There was also one talk on power laws and scale-free graphs that stimulated a lot of audience, ah, interaction - it seems that there's a lot of confusion both over what a scale-free graph is (admittedly the term has no consistent definition in the literature, although there have been some recent attempts to clarify it in a principled manner), and how to best show that some data exhibit power-law behavior. Tomorrow's talks will be more about networks in various biological contexts.
Complex Structures in Complex Networks
Mark Newman's (U. Michigan) plenary talk mainly focused on the importance of having good techniques to extract information from networks, and being able to do so without making a lot of assumptions about what the technique is supposed to look for. That is, rather than assume that some particular kind of structure exists and then look for it in our data, why not let the data tell you what kind of interesting structure it has to offer? [1] The tricky thing about this approach to network analysis, though, is working out a method that is flexible enough to find many different kinds of structure, and to present only that which is unusually strong. (Point to ponder: what should we mean by "unusually strong"?) This point was a common theme in a couple of the talks today. The first example that Mark gave of a technique that has this nice property was a beautiful application of spectral graph theory to the task of find a partition of the vertices that give an extremal value of modularity. If we ask for the maximum modularity, this heuristic method [2], using the positive eigenvalues of the resulting solution, gives us a partition with very high modularity. But, using the negative eigenvalues gives a partition that minimizes the modularity. I think we normally think of modules meaning assortative structures, i.e., sparsely connected dense subgraphs. But, some networks exhibit modules that are approximately bipartite, i.e., they are disassortative, being densely connected sparse subgraphs. Mark's method naturally allows you to look for either. The second method he presented was a powerful probabilistic model of node clustering that can be appropriately parameterized (fitted to data) via expectation-maximization (EM). This method can be used to accomplish much the same results as the previous spectral method, except that it can look for both assortative and disassortative structure simultaneously in the same network.
Hierarchical Structure and the Prediction of Missing Links
In an afternoon talk, Cris Moore (U. New Mexico) presented a new and powerful model of network structure, the hierarchical random graph (HRG) [5]. (Disclaimer: this is joint work with myself and Mark Newman.) A lot of people in the complex networks literature have talked about hierarchy, and, presumably, when they do so, they mean something roughly along the lines of the HRG that Cris presented. That is, they mean that nodes with a common ancestor low in the hierarchical structure are more likely to be connected to each other, and that different cuts across it should produce partitions that look like communities. The HRG model Cris presented makes these notions explicit, but also naturally captures the kind of assortative hierarchical structure and the disassortative structure that Mark's methods find. (Test to do: use HRG to generate mixture of assortative and disassortative structure, then use Mark's second method to find it.) There are several other attractive qualities of the HRG, too. For instance, using a Monte Carlo Markov chain, you can find the hierarchical decomposition of a single real-world network, and then use the HRG to generate a whole ensemble of networks that are statistically similar to the original graph [6]. And, because the MCMC samples the entire posterior distribution of models-given-the-data, you can look not only at models that give the best fit to the data, but you can look at the large number of models that give an almost-best fit. Averaging properties over this ensemble can give you more robust estimates of unusual topological patterns, and Cris showed how it can also be used to predict missing edges. That is, suppose I hide some edges and then ask the model to predict which ones I hid. If it can do well at this task, then we've shown that the model is capturing real correlations in the topology of the real graph - it has the kind of explanatory power that comes from making correct predictions. These kinds of predictions could be extremely useful for laboratory or field scientists who manually collect network data (e.g., protein interaction networks or food webs) [7]. Okay, enough about my own work!
The Optimization Origins of Preferential Attachment
Although I've seen Raissa D'Souza (UC Davis) talk about competition-induced preferential attachment [8] before, it's such an elegant generalization of PA that I enjoyed it a second time today. Raissa began by pointing out that most power laws in the real-world can't extend to infinity - in most systems, there are finite limits to the size that things can be (the energy released in an earthquake or the number of edges a vertex can have), and these finite effects will typically manifest themselves as exponential cutoffs in the far upper tail of the distribution, which takes the probability of these super-large events to zero. She used this discussion as a springboard to introduce a relatively simple model of resource constraints and competition among vertices in a growing network that produces a power-law degree distribution with such an exponential cutoff. The thing I like most about this model is that it provides a way for (tempered) PA to emerge from microscopic and inherently local interactions (normally, to get pure PA to work, you need global information about the system). The next step, of course, is to find some way to measure evidence for this mechanism in real-world networks [9]. I also wonder how brittle the power-law result is, i.e., if you tweak the dynamics a little, does the power-law behavior disappear?
Web Search and Online Communities
Andrew Tomkins (of Yahoo! Reserch) is a data guy, and his plenary talk drove home the point that Web 2.0 applications (i.e., things that revolve around user-generated content) are creating a huge amount of data, and offering unparalleled challenges for combining, analyzing, and visualizing this data in meaningful ways. He used Flickr (a recent Y! acquisition) as a compelling example by showing an interactive (with fast-rewind and fast-forward features) visual stream of the trends in user-generated tags for user-posted images, annotated with notable examples of those images. He talked a little about the trickiness of the algorithms necessary to make such an application, but what struck me most was his plea for help and ideas in how to combine information drawn from social networks with user behavior with blog content, etc. to make more meaningful and more useful applications - there's all this data, and they only have a few ideas about how to combine it. The more I learn about Y! Research, the more impressed I am with both the quality of their scientists (they recently hired Duncan Watts), and the quality of their data. Web 2.0 stuff like this gives me the late-1990s shivers all over again. (Tomkins mentioned that in Korea, unlike in the US, PageRank-based search has been overtaken by an engine called Naver, which is driven by users building good sets of responses to common search queries.)
-----
[1] To be more concrete, and perhaps in lieu of having a better way of approaching the problem, much of the past work on network analysis has taken the following approach. First, think of some structure that you think might be interesting (e.g., the density of triangles or the division into sparsely connected dense subgraphs), design a measure that captures that structure, and then measure it in your data (it turns out to be non-trivial to do this in an algorithm independent way). Of course, the big problem with this approach is that you'll never know whether there is other structure that's just as important as, or maybe more important than, the kind you looked for, and that you just weren't clever enough to think to look for it.
[2] Heuristic because Mark's method is a polynomial time algorithm, while the problem of modularity maximization was recently (finally...) shown to be NP-complete. The proof is simple, and, in retrospect, obvious - just as most such proofs inevitably end up being. See U. Brandes et al. "Maximizing Modularity is hard." Preprint (2006).
[3] M. E. J. Newman, "Finding community structure in networks using the eigenvectors of matrices." PRE 74, 036104 (2006).
[4] M. E. J. Newman and E. A. Leicht, "Mixture models and exploratory data analysis in networks." Submitted to PNAS USA (2006).
[5] A. Clauset, C. Moore and M. E. J. Newman, "Structural Inference of Hierarchies in Networks." In Proc. of the 23rd ICML, Workshop on "Statistical Network Analysis", Springer LNCS (Pittsburgh, June 2006).
[6] This capability seems genuinely novel. Given that there are an astronomical number of ways to rearrange the edges on a graph, it's kind of amazing that the hierarchical decomposition gives you a way to do such a rearrangement, but one which preserves the statistical regularities in the original graph. We've demonstrated this for the degree distribution, the clustering coefficient, and the distribution of pair-wise distances. Because of the details of the model, it sometimes gets the clustering coefficient a little wrong, but I wonder just how powerful / how general this capability is.
[7] More generally though, I think the idea of testing a network model by asking how well it can predict things about real-world problems is an important step forward for the field; previously, "validation" consisted of showing only a qualitative (or worse, a subjective) agreement between some statistical measure of the model's behavior (e.g., degree distribution is right-skewed) and the same statistical measure on a real-world network. By being more quantitative - by being more stringent - we can say stronger things about the correctness of our mechanisms and models.
[8] R. M. D'Souza, C. Borgs, J. T. Chayes, N. Berger, and R. Kleinberg, "Emergence of Tempered Preferential Attachment From Optimization", To appear in PNAS USA, (2007).
[9] I think the best candidate here would be the BGP graph, since there is clearly competition there, although I suspect that the BGP graph structure is a lot more rich than the simple power-law-centric analysis has suggested. This is primarily due to the fact that almost all previous analyses have ignored the fact that the BGP graph exists as an expression of the interaction of business interests with the affordances of the Border Gateway Protocol itself. So, its topological structure is meaningless without accounting for the way it's used, and this means accounting for complexities of the customer-provider and peer-to-peer relationships on the edges (to say nothing of the sampling issues involved in getting an accurate BGP map).
posted January 24, 2007 02:18 AM in Scientifically Speaking | permalink | Comments (1)
January 23, 2007
DIMACS - Complex networks and their applications (Day 1)
Today and tomorrow, I'm at the DIMACS workshop on complex networks and their applications, held at Georgia Tech's College of Computing. Over the course of the workshop, I'll be blogging about the talks I see and whatever ideas they stimulate (sadly, I missed most of the first day because of travel).
The most interesting talk I saw Monday afternoon was by Ravi Kumar (Yahoo! Research), who took location data of users on LiveJournal, and asked Do we see the same kind of routable structure - i.e., an inverses-square law relationship in the distance between people and the likelihood that they have a LJ connection - that Kleinberg showed was optimal for distributed / local search? Surprisingly, they were able to show that in the US, once you correct for the fact that there can be many people at a single "location" in geographic space (approximated to the city level), you do indeed observe exactly the kind of power-law that Kleinberg predicted [1]. Truly, this was a kind of stunning confirmation of Kleinberg's theory. So now, the logical question would be, What mechanism might produce this kind of structure in geographic space? Although you could probably get away with assuming a priori the population distribution, what linking dynamics would construct the observed topological pattern? My first project in graduate school asked exactly this question for the pure Kleinberg model, and I wonder if it could be adapted to the geographic version that Kumar et al. consider.
[1] D. Liben-Nowell, et al. "Geographic Routing in Social Networks." PNAS USA 102, 33 11623-1162 (2005).
posted January 23, 2007 06:24 AM in Scientifically Speaking | permalink | Comments (0)
October 04, 2006
Fighting the dominant paradigm
Sean over at Cosmic Variance has a nice review of Lee Smolin's The Trouble With Physics, which is itself a critique of theoretical physic's focus on string theory as the way to unify gravity with the other forces. Most of the review focuses on Smolin's criticism of string theory's dominance, but Sean points out that Smolin is actually making two arguments, one about string theory and one about supporting interesting alternative ideas.
Smolin talks a great deal about the need for physics, and academia more generally, to support plucky upstart ideas and scholars with the courage and vision to think big and go against the grain. This is a larger point than the specific argument about how to best quantize gravity, and ultimately far more persuasive; it is likely, unfortunately, to be lost amidst the conflict between string theory and its discontents. Faculty positions and grant money are scarce commodities, and universities and funding agencies are naturally risk-averse. Under the current system, a typical researcher might spend five years in graduate school, three to six as a postdoc, and another six or seven as an assistant professor before getting tenure – with an expectation that they will write several competent papers in every one of those years. Nobody should be surprised that, apart from a few singular geniuses, the people who survive this gauntlet are more likely to be those who show technical competence within a dominant paradigm, rather than those who will take risks and pursue their idiosyncratic visions. The dogged pursuit of string theory through the 1970’s by Green and Schwarz is a perfect example of the ultimate triumph of the latter approach, and Smolin is quite correct to lament the lack of support for this kind of research today.
Although he's talking about theoretical physicists, the same applies just as much to other disciplines (perhaps with shorter postdoc periods) and their relationship to upstart ideas. Of course, finding the right balance between "normal science" and "paradigm-shifting science" is not easy, and there is a big difference between supporting interesting new ideas and supporting crackpots. Sometimes, that distinction can be hard to see at first, but all good new ideas ultimately lead to really excellent science. Fortunately, there are places that actively encourage both both excellent work and thinking about crazy ideas.
Update Oct. 4: Suresh blogs about Sean's review as well, and also zeros in on the same passage. He makes some valuable points about how important it is to build your own model of how to do good research. Separately, Dave Bacon blogs about Peter Shor's review of Smolin's book, in which Shor likens Nature to an insurance salesman.
posted October 4, 2006 09:42 AM in Interdisciplinarity | permalink | Comments (0)
July 06, 2006
An ontological question about complex systems
Although I've been reading Nature News for several years now (as part of my daily trawl for treasure in the murky waters of science), I first came to recognize one of their regular writers Philip Ball when he wrote about my work on terrorism with Maxwell Young. His essay, now hidden behind Nature's silly subscription-only barrier, sounded an appropriately cautionary note about using statistical patterns of human behavior to predict the future, and was even titled "Don't panic, it might never happen."
The idea that there might be statistical laws that govern human behavior can be traced, as Ball does in his essay, back to the English philosopher Thomas Hobbes (1588-1679) in The Leviathan and to the French positivist philosopher Auguste Comte (1798-1857; known as the father of sociology, and who also apparently coined the term "altruism"), who were inspired by the work of physicists in mechanizing the behavior of nature to try to do the same with human societies.
It seems, however, that somewhere between then and now, much of sociology has lost interest in such laws. A good friend of mine in graduate school for sociology (who shall remain nameless to protect her from the politics of academia) says that her field is obsessed with the idea that context, or nurture, drives all significant human behavior, and that it rejects the idea that overarching patterns or laws of society might exist. These, apparently, are the domain of biology, and thus Not Sociology. I'm kind of stunned that any field that takes itself seriously would so thoroughly cling to the nearly medieval notion of the tabula rasa (1) in the face of unrelenting scientific evidence to the contrary. But, if this territory has been abandoned by sociologists (2), it has recently, and enthusiastically, been claimed by physicists (who may or may not recognize the similarity of their work to a certain idea in science fiction).
Ball's background is originally in chemistry and statistical physics, and having spent many years as an editor at Nature, he apparently now has a broad perspective on modern science. But, what makes his writing so enjoyable is the way he places scientific advances in their proper historical context, showing both where the inspiration may have come from, and how other scientists were developing similar or alternative ideas concurrently. These strengths are certainly evident in his article about the statistical regularity of terrorism, but he puts them to greater use in several books and, in particular, one on physicists' efforts to create something he calls sociophysics.
As it turns out, however, this connection between physics and sociology is not a new one, and the original inspiration for statistical physics (one of the three revolutionary ideas in modern physics; the other two are quantum mechanics and relativity) is owed to social scientists.
In the mid 1800s, James Clerk Maxwell, one of the fathers of statistical physics, read Henry Thomas Buckle's lengthy History of Civilization. Buckle was a historian by trade, and a champion of the idea that society's machinations are bound by fundamental laws. Maxwell, struggling with the question of how to describe the various motions of particles in a gas, was struck by Buckle's descriptions of the statistical nature of studies of society. Such studies sought not to describe each individual and their choices exactly, but instead represent the patterns of behavior statistically, and often pointed to surprising regularities, e.g., the near-stable birth or suicide rates in a particular region. As a result, Maxwell abandoned the popular approach of describing gas particles only using Newtonian mechanics, i.e., an attempt to describe every particle's position and motion exactly, in favor for a statistical approach that focused on the distribution of velocities.
It was the profound success of these statistical descriptions that helped cement this approach as one of the most valuable tools available to physicists, and brought about some pretty profound shifts in our understanding of gasses, materials and even astrophysics. So, it seems fitting that statistical physicists are now returning to their roots by considering statistical laws of human behavior. Alas, I doubt that most such physicists appreciate this fact.
These efforts, which Ball surveys in "Critical Mass" (Farrar, Straus and Giroux, 2004) via a series of well-written case studies, have dramatically altered our understanding of phenomena as varied as traffic patterns (which have liquid, gaseous, solid and meta-stable states along with the corresponding phase transitions), voting patterns in parliamentary elections (which display nice heavy-tailed statistics), the evolution of pedestrian traffic trails across a university quad, economics and the statistics of businesses and markets, and a very shallow discussion of social networks. Although his exposition is certainly aimed at the layman, he does not shy away from technical language when appropriate. Pleasantly, he even reproduces figures from the original papers when it serves his explanations. Given that these phenomena were drawn from a burgeoning field of interdisciplinary research, it's easy to forgive him for omitting some of my favorite topics, treating others only shallowly, and mercifully leaving out the hobby horses of cellular automata, genetic algorithms and artificial life.
Now, after seeing that list of topics, you might think that "Critical Mass" was a book about complex systems, and you might be right. But, you might be wrong, too, which is the problem when there's no strict definition of a term. So, let's assume he has, and see what this offers in terms of clarifying the corresponding ontological question. For one thing, Ball's choices suggest that perhaps we do not need other ill-defined properties like emergence, self-organization or robustness (3) to define a complex system. Instead, perhaps when we say we are studying a "complex system," we simply mean that it has a highly heterogeneous composition that we seek to explain using statistical mechanisms. To me, the former means that I, because of my limited mental capacity to grasp complicated equations, relationships or a tremendously large configuration space, pretty much have to use a statistical characterization that omits most of the detailed structure of the system; also, I say heterogeneous because homogeneous systems are much easier to explain using traditional statistical mechanics. The latter means that I'm not merely interested in describing the system, which can certainly be done using traditional statistics, but rather in explaining the rules and laws that govern the formation, persistence and evolution of that structure. For me, this definition is attractive both for its operational and utilitarian aspects, but also because it doesn't require me to wave my hands, use obfuscating jargon or otherwise change the subject.
In general, it's the desire to establish laws that reflects complex systems' roots in physics, and it is this that distinguishes it from traditional statistics and machine learning. In those areas, the focus seems to me to be more on predictive power ("Huzzah! My error rate is lower than yours.") and less on mechanisms. My machine learning friends tell me that people are getting more interested in the "interpretability" of their models, but I'm not sure this is the same thing as building models that reflect the true mechanical nature of the underlying system... of course, one fundamental difference between much of statistical learning and what I've described above is that for many systems, there's no underlying mechanism! We shouldn't expect problems like keeping the spam out of my inbox to exhibit nice mechanistic behavior, and there are a tremendous number of such problems out there today. Fortunately, I'm happy to leave those to people who care more about error rates than mechanisms, and I hope they're happy to leave studying the (complex) natural world, mechanisms and all, to me.
Updates, July 7
(1) The notion of the tabula rasa is not antithetical to the idea that there are patterns in social behavior, but patterns per se are not the same as the kind of societal laws that the founders of sociology were apparently interested in, i.e., sociology apparently believes these patterns to be wholly the results of culture and not driven by things that every human shares like our evolutionary history as a species. I suppose there's a middle ground here, in which society has created the appearance of laws, which the sociophysicists then discover and mistake for absolutes. Actually, I'm sure that much of what physicists have done recently can be placed into this category.
(2) It may be the case that it is merely the portion of sociology that my friend is most familiar with that expresses this odd conviction, and that there are subfields that retain the idea that true mechanistic laws do operate in social systems. For all I know, social network analysis people may be of this sort; it would be nice to have an insider's perspective on this.
(3) Like the notions of criticality and universality, these terms actually do have precise, technical definitions in their proper contexts, but they've recently been co-opted in imprecisely ways and are now, unfortunately and in my opinion, basically meaningless in most of the complex systems literature.
posted July 6, 2006 07:09 PM in Reviews | permalink | Comments (0)
March 01, 2006
The scenic view
In my formal training in physics and computer science, I never did get much exposure to statistics and probability theory, yet I have found myself consistently using them in my research (partially on account of the fact that I deal with real data quite often). What little formal exposure I did receive was always in some specific context and never focused on probability as a topic itself (e.g., statistical mechanics, which could hardly be called a good introduction to probability theory). Generally, my training played-out in the crisp and clean neighborhoods of logical reasoning, algebra and calculus, with the occasional day-trip to the ghetto of probability. David Mumford, a Professor of Mathematics at Brown University, opines about ongoing spread of that ghetto throughout the rest science and mathematics, i.e., how probability theory deserves a respect at least equal to that of abstract algebra, in a piece from 1999 on The Dawning of the Age of Stochasticity. From the abstract,
For over two millennia, Aristotle's logic has rules over the thinking of western intellectuals. All precise theories, all scientific models, even models of the process of thinking itself, have in principle conformed to the straight-jacket of logic. But from its shady beginnings devising gambling strategies and counting corpses in medieval London, probability theory and statistical inference now emerge as better foundations for scientific models ... [and] even the foundations of mathematics itself.
It may sound it, but I doubt that Mumford is actually overstating his case here, especially given the deep connection between probability theory, quantum mechanics (c.f. the recent counter-intuitive result on quantum interrogation) and complexity theory.
A neighborhood I'm more familiar with is that of special functions; things like the Gamma distribution, the Riemann Zeta function (a personal favorite), and the Airy functions. Sadly, these familiar friends show up very rarely in the neighborhood of traditional computer science, but instead hang out in the district of mathematical modeling. Robert Batterman, a Professor of Philosophy at Ohio State University, writes about why exactly these functions are so interesting in On the Specialness of Special Functions (The Nonrandom Effusions of the Divine Mathematician).
From the point of view presented here, the shared mathematical features that serve to unify the special functions - the universal form of their asymptotic expansions - depends upon certain features of the world.
(Emphasis his.) That is, the physical world itself, by presenting a patterned appearance, must be governed by a self-consistent set of rules that create that pattern. In mathematical modeling, these rules are best represented by asymptotic analysis and, you guessed it, special functions, that reveal the universal structure of reality in their asymptotic behavior. Certainly this approach to modeling has been hugely successful, and remains so in current research (including my own).
My current digs, however, are located in the small nexus that butts up against these neighborhoods and those in computer science. Scott Aaronson, who occupies an equivalent juncture between computer science and physics, has written several highly readable and extremely interesting pieces on the commonalities he sees in his respective locale. I've found them to be a particularly valuable way to see beyond the unfortunately shallow exploration of computational complexity that is given in most graduate-level introductory classes.
In NP-complete Problems and Physical Reality Aaronson looks out of his East-facing window toward physics for hints about ways to solve NP-complete problems by using physical processes (e.g., simulated annealing). That is, can physical reality efficiently solve instances of "hard" problems? Although he concludes that the evidence is not promising, he points to a fundamental connection between physics and computer science.
Then turning to look out his West-facing window towards computer science, he asks Is P Versus NP Formally Indepenent?, where he considers formal logic systems and the implications of Godel's Incompleteness Theorem for the likelihood of resolving the P versus NP question. It's stealing his thunder a little, but the most quotable line comes from his conclusion:
So I'll state, as one of the few definite conclusions of this survey, that P \not= NP is either true or false. It's one or the other. But we may not be able to prove which way it goes, and we may not be able to prove that we can't prove it.
There's a little nagging question that some researchers are only just beginning to explore, which is, are certain laws of physics formally independent? I'm not even entirely sure what that means, but it's an interesting kind of question to ponder on a lazy Sunday afternoon.
There's something else embedded in these topics, though. Almost all of the current work on complexity theory is logic-oriented, essentially because it was born of the logic and formal mathematics of the first half of the 20th century. But, if we believe Mumford's claim that statistical inference (and in particular Bayesian inference) will invade all of science, I wonder what insights it can give us about solving hard problems, and perhaps why they're hard to begin with.
I'm aware of only anecdotal evidence of such benefits, in the form of the Survey Propagation Algorithm and its success at solving hard k-SAT formulas. The insights from the physicists' non-rigorous results has even helped improve our rigorous understanding of why problems like random k-SAT undergo a phase transition from mostly easy to mostly hard. (The intuition is, in short, that as the density of constraints increases, the space of valid solutions fragments into many disconnected regions.) Perhaps there's more being done here than I know of, but it seems that a theory of inferential algorithms as they apply to complexity theory (I'm not even sure what that means, precisely; perhaps it doesn't differ significantly from PPT algorithms) might teach us something fundamental about computation.
posted March 1, 2006 02:32 PM in Interdisciplinarity | permalink | Comments (0)
December 19, 2005
On modeling the human response time function; Part 3.
Much to my surprise, this morning I awoke to find several emails in my inbox apparently related to my commentary on the Barabasi paper in Nature. This morning, Anders Johansen pointed out to myself and Luis Amaral (I can only assume that he has already communicated this to Barabasi) that in 2004 he published an article entitled Probing human response times in Physica A about the very same topic using the very same data as that of Barabasi's paper. In it, he displays the now familiar heavy-tailed distribution of response times and fits a power law of the form P(t) ~ 1/(t+c) where c is a constant estimated from the data. Asymptotically, this is the same as Barabasi's P(t) ~ 1/t; it differs in the lower tail, i.e., for t < c where it scales more uniformly. As an originating mechanism, he suggests something related to a spin-glass model of human dynamics.
Although Johansen's paper raises other issues, which I'll discuss briefly in a moment, let's step back and think about this controversy from a scientific perspective. There are two slightly different approaches to modeling that are being employed to understand the response-time function of human behavior. The first is a purely "fit-the-data" approach, which is largely what Johansen has done, and certainly what Amaral's group has done. The other, employed by Barabasi, uses enough data analysis to extract some interesting features, posits a mechanism for the origin of those and then sets about connecting the two. The advantage of developing such a mechanistic explanation is that (if done properly) it provides falsifiable hypotheses and can move the discussion past simple data-analysis techniques. The trouble begins, as I've mentioned before, when either a possible mechanistic model is declared to be "correct" before being properly vetted, or when an insufficient amount of data analysis is done before positing a mechanism. This latter kind of trouble allows for a debate over how much support the data really provides to the proposed mechanism, and is exactly the source of the exchange between Barabasi et al. and Stouffer et al.
I tend to agree with the idea implicitly put forward by Stouffer et al.'s comment that Barabasi should have done more thorough data analysis before publishing, or alternatively, been a little more cautious in his claims of the universality of his mechanism. In light of Johansen's paper and Johansen's statement that he and Barabasi spoke at the talk in 2003 where Johansen presented his results, there is now the specter that either previous work was not cited that should have been, or something more egregious happened. While not to say that this aspect of the story isn't an important issue in itself, it is a separate one from the issues regarding the modeling, and it is those with which I am primarily concerned. But, given the high profile of articles published in journals like Nature, this kind of gross error in attribution does little to reassure me that such journals are not aggravating certain systemic problems in the scientific publication system. This will probably be a topic of a later post, if I ever get around to it. But let's get back to the modeling questions.
Seeking to be more physics and less statistics, the ultimate goal of such a study of human behavior should be to understand the mechanism at play, and at least Barabasi did put forward and analyze a plausible suggestion there, even if a) he may not have done enough data analysis to properly support it or his claims of universality, and b) his model assumes some reasonably unrealistic behavior on the part of humans. Indeed, the former is my chief complaint about his paper, and why I am grateful for the Stouffer et al. comment and the ensuing discussion. With regard to the latter, my preference would have been for Barabasi to have discussed the fragility of his model with respect to the particular assumptions he describes. That is, although he assumes it, humans probably don't assign priorities to their tasks with anything like a uniformly random distribution and nor do humans always execute their highest priority task next. For instance, can you decide, right now without thinking, what the most important email in your inbox is at this moment? Instead, he commits the crime of hubris and neglects these details in favor of the suggestiveness of his model given the data. On the other hand, regardless of their implausibility, both of these assumptions about human behavior can be tested through experiments with real people and through numerical simulation. That is, these assumptions become predictions about the world that, if they fail to agree with experiment, would falsify the model. This seems to me an advantage of Barabasi's mechanism over that proposed by Johansen, which, by relying on a spin glass model of human behavior, seems quite trickier to falsify.
But let's get back to the topic of the data analysis and the argument between Stouffer et al. and Barabasi et al. (now also Johansen) over whether the data better supports a log-normal or a power-law distribution. The importance of this point is that if the log-normal is the better fit, then the mathematical model Barabasi proposes cannot be the originating mechanism. From my experience with distributions with heavy tails, it can be difficult to statistically (let alone visually) distinguish between a log-normal and various kinds of power laws. In human systems, there is almost never enough data (read: orders of magnitude) to distinguish these without using standard (but sophisticated) statistical tools. This is because for any finite sample of data from an asymptotic distribution, there will be deviations that will blur the functional form just enough to look rather like the other. For instance, if you look closely at the data of Barabasi or Johansen, there are deviations from the power-law distribution in the far upper tail. Stouffer et al. cite these as examples of the poor fit of the power law and as evidence supporting the log-normal. Unfortunately, they could simply be due to deviations due to finite-sample effects (not to be confused with finite-size effects), and the only way to determine if they could have been is to try resampling the hypothesized distribution and measuring the sample deviation against the observed one.
The approach that I tend to favor for resolving this kind of question combines a goodness-of-fit test with a statistical power test to distinguish between alternative models. It's a bit more labor-intensive than the Bayesian model selection employed by Stouffer et al., but this approach offers, in addition to others that I'll describe momentarily, the advantage of being able to say that, given the data, neither model is good or that both models are good.
Using Monte Carlo simulation and something like the Kolmogorov-Smirnov goodness-of-fit test, you can quantitatively gauge how likely a random sample drawn from your hypothesized function F (which can be derived using maximum likelihood parameter estimation or by something like a least-squares fit; it doesn't matter) will have a deviation from F at least as big as the one observed in the data. By then comparing the deviations with an alternative function G (e.g., a power law versus a log-normal), you get a measure of the power of F over G as an originating model of the data. For heavy-tailed distributions, particularly those with a sample-mean that converges slowly or never at all (as is the case for something like P(t) ~ 1/t), sampling deviations can cause pretty significant problems with model selection, and I suspect that the Bayesian model selection approach is sensitive to these. On the other hand, by incorporating sampling variation into the model selection process itself, one can get an idea of whether it is even possible to select one model over another. If someone were to use this approach to analyze the data of human response times, I suspect that the pure power law would be a poor fit (the data looks too curved for that), but that the power law suggested in Johansen's paper would be largely statistically indistinguishable from a log-normal. With this knowledge in hand, one is then free to posit mechanisms that generate either distribution and then proceed to validate the theory by testing its predictions (e.g., its assumptions).
So, in the end, we may not have gained much in arguing about which heavy-tailed distribution the data likely came from, and instead should consider whether or not an equally plausible mechanism for generating the response-time data could be derived from the standard mechanisms for producing log-normal distributions. If we had such an alternative mechanism, then we could devise some experiments to distinguish between them and perhaps actually settle this question like scientists.
As a closing thought, my interest in this debate is not particularly in its politics. Rather, I think this story suggests some excellent questions about the practice of modeling, the questions a good modeler should ponder on the road to truth, and some of the pot holes strewn about the field of complex systems. It also, unfortunately, provides some anecdotal evidence of some systemic problems with attribution, the scientific publishing industry and the current state of peer-review at high-profile, fast turn-around-time journals.
References for those interested in reading the source material.
A. Johansen, "Probing human response times." Physica A 338 (2004) 286-291.
A.-L. Barabasi, "The origin of bursts and heavy tails in human dynamics." Nature 435 (2005) 207-211.
D. B. Stouffer, R. D. Malmgren and L. A. N. Amaral "Comment on 'The origin of bursts and heavy tails in human dynamics'." e-print (2005).
J.-P. Eckmann, E. Moses and D. Sergi, "Entropy of dialogues creates coherent structures in e-mail traffic." PNAS USA 101 (2004) 14333-14337.
A.-L. Barabasi, K.-I. Goh, A. Vazquez, "Reply to Comment on 'The origin of bursts and heavy tails in human dynamics'." e-print (2005).
posted December 19, 2005 04:32 PM in Scientifically Speaking | permalink | Comments (0)
November 27, 2005
Irrational exuberance plus indelible sniping yields delectable entertainment
In a past entry (which sadly has not yet scrolled off the bottom of the front page - sad because it indicates how infrequently I am posting these days), I briefly discussed the amusing public debate by Barabasi et al. and Souffer et al. over Barabasi's model of correspondence. At that point, I found the exchange amusing and was inclined to agree with the response article. However, let me rehash this topic and expose a little more light on the subject.
From the original abstract of the article posted on arxiv.org by Barabasi:
Current models of human dynamics, used from risk assessment to communications, assume that human actions are randomly distributed in time and thus well approximated by Poisson processes. In contrast, ... the timing of many human activities, ranging from communication to entertainment and work patterns, [are] ... characterized by bursts of rapidly occurring events separated by long periods of inactivity. Here we show that the bursty nature of human behavior is a consequence of a decision based queuing process: when individuals execute tasks based on some perceived priority, the timing of the tasks will be heavy tailed, most tasks being rapidly executed, while a few experience very long waiting times.
(Emphasis is mine.) Barabasi is not one to shy away from grand claims of universality. As such, he epitomizes the thing that many of those outside of the discipline hate about physicists, i.e., their apparent arrogance. My opinion is that most physicists accused of intellectual arrogant are misunderstood, but that's a topic for another time.
Stouffer et al. responded a few months after Barabasi's original idea, as published in Nature, with the following (abstract):
In a recent letter, Barabasi claims that the dynamics of a number of human activities are scale-free. He specifically reports that the probability distribution of time intervals tau between consecutive e-mails sent by a single user and time delays for e-mail replies follow a power-law with an exponent -1, and proposes a priority-queuing process as an explanation of the bursty nature of human activity. Here, we quantitatively demonstrate that the reported power-law distributions are solely an artifact of the analysis of the empirical data and that the proposed model is not representative of e-mail communication patterns.
(Emphasis is mine.) In this comment, Stouffer et al. strongly criticize the data analysis that Barabasi uses to argue for the plausibility and, indeed, the correctness of his priority-based queueing model. I admit that when I first read Barabasi's queueing model, I thought that surely the smart folks who have been dealing with queueing theory (a topic nearly a century old!) knew something like this already. Even if that were the case, the idea certainly qualifies as interesting, and I'm happy to see a) the idea published, although Nature was likely not the appropriate place and b) the press attention that Barabasi has brought to the discipline of complex systems and modeling. Anyway, the heart of the data-analysis based critique of Barabasi's work lies in distinguishing two different kinds of heavy-tailed distributions: the log-normal and the power law. Because of a heavy tail is an asymptotic property, these two distributions can be extremely difficult to differentiate when the data only spans a few orders of magnitude (as is the case here). Fortunately, statisticians (and occasionally, myself) enjoy this sort of thing. Stouffer et al. employ such statistical tools in the form of Bayesian model selection to choose between the two hypotheses and find the evidence of the power law lacking. It was quite dissatisfying, however, that Stouffer et al. neglected to discuss their model selection procedure in detail, and instead chose to discuss the politicking over Barabasi's publication in Nature.
And so, it should come as no surprise that a rejoinder from Barabasi was soon issued. With each iteration of this process, the veneer of professionalism cracks away a little more:
[Stouffer et al.] revisit the datasets [we] studied..., making four technical observations. Some of [their] observations ... are based on the authors' unfamiliarity with the details of the data collection process and have little relevance to [our] findings ... and others are resolved in quantitative fashion by other authors.
In the response, Barabasi discusses the details of the dataset that Stouffer et al. fixated on: that the extreme short-time behavior of the data is actually an artifact of the way messages to multiple recipients were logged. They rightly emphasize that it is the existence of a heavy tail that is primarily interesting, rather than its exact form (of course, Barabasi made some noise about the exact form in the original paper). However, it is not sufficient to simply observe a heavy tail, posit an apparently plausible model that produces some kind of such tail and then declare victory, universality and issue a press release. (I'll return to this thought in a moment.) As a result, Barabasi's response, while clarifying a few details, does not address the fundamental problems with the original work. Problems that Stouffer et al. seem to intuit, but don't directly point out.
A month ago, Suresh over at the Geomblog published a comment on the controversy by Michael Mitzenmacher (whose work I greatly enjoy) in which he touches briefly on the real issue here.
While the rebuttal suggests the data is a better fit for the lognormal distribution, I am not a big believer in the fit-the-data approach to distinguish these distributions. The Barabasi paper actually suggested a model, which is nice, although the problem of how to verify such a model is challenge... This seems to be the real problem. Trust me, anyone can come up with a power law model. The challenge is figuring out how to show your model is actually right.
That is, first and foremost, the bursty nature of human activity is odd and, in that alluring voice only those fascinated by complex systems can hear, begs for an explanation. Second, a priority-based queueing process is merely one possible explanation (out of perhaps many) for the heaviness and burstiness. The emphasis is to point out that there is a real difficulty in nailing down causal mechanisms in human systems. often the best we can do is concoct a theory and see if the data supports it. That is, it is exceedingly difficult to go beyond mere plausibility without an overwhelming weight of empirical evidence and, preferably, the vetting of falsifiable hypotheses. The theory of natural selection is an excellent example that has been validated by just such a method (and continues to be). Unfortunately, simply looking at the response time statistics for email or letters by Darwin or Einstein, while interesting from the socio-historical perspective, does not prove the model. On the contrary: it merely suggests it.
That is, Barabasi's work demonstrates the empirical evidence (heavy-tails in the response times of correspondence) and offers a mathematical model that generates statistics of a similar form. It does not show causality, nor does it provide falsifiable hypotheses by which it could be invalidated. Barabasi's work in this case is suggestive but not explanatory, and should be judged accordingly. To me, it seems that the contention over the result derives partly from the overstatement of its generality, i.e., the authors claims their model to be explanatory. Thus, the argument over the empirical data is really just an argument about how much plausibility it imparts to the model. Had Barabasi gone beyond suggestion, I seriously doubt the controversy would exist.
Considering the issues raised here, personally, I think it's okay to publish a results that is merely suggestive so long as it is honestly made, diligently investigated and embodies a compelling and plausible story. That is to say that, ideally, authors should discuss the weakness of their model, empirical results and/or mathematical analysis, avoid overstating the generality of the result (sadly, a frequent problem in many of the papers I referee), carefully investigate possible biases and sources of error, and ideally, discuss alternative explanations. Admittedly, this last one may be asking a bit much. In a sense, these are the things I think about when I read any paper, but particularly when I referee something. This thread of thought seems to be fashionable right now, as I just noticed that Cosma's latest post discusses criteria for accepting or rejecting papers in the peer review process.
posted November 27, 2005 05:00 AM in Scientifically Speaking | permalink | Comments (0)
November 06, 2005
Finding your audience
Some time ago, a discussion erupted on Crooked Timber about the ettiquete of interdisciplinary research. This conversation was originally sparked by Eszter Hargittai, a sociologist with a distinct interest in social network analysis, who complained about some physicists working on social networks and failing to appropriately cite previous work in the area. I won't rehash the details, since you can read them for yourself. However, the point of the discussion that is salient for this post is the question of where and how one should publish and promote interdisciplinary work.
Over the better half of this past year, I have had my own journey with doing interdisciplinary research in political science. Long-time readers will know that I'm referring to my work with here, here and here). In our paper (old version via arxiv), we use tools from extremal statistics and physics to think carefully about the nature and evolution of terrorism, and, I think, uncover some interesting properties and trends at the global level. Throughout the process of getting our results published in an appropriate technical venue, I have espoused the belief that it should either go to an interdisciplinary journal or one that political scientists will read. That is, I felt that it should go to a journal with an audience that would both appreciate the results and understand their implications.
This idea of appropriateness and audience, I think, is a central problem for interdisciplinary researchers. In an ideal world, every piece of novel research would be communicated to exactly that group of people who would get the most out of learning about the new result and who would be able to utilize the advance to further deepen our knowledge of the natural world. Academic journals and conferences are a poor approximation of this ideal, but currently they're the best institutional mechanism we have. To correct for the non-idealness of these institutions, academics have always distributed preprints of their work to their colleagues (who often pass them to their own friends, etc.). Blogs, e-print archives and the world wide web in general constitute interesting new developments in this practice and show how the fundamental need to communicate ideas will co-opt whatever technology is available. Returning to the point, however, what is interesting about interdisciplinary research is that by definition it has multiple target audiences to which it could, or should, be communicated. Choosing that audience can become a question of choosing what aspects of the work you think are most important to science in general, i.e., what audience has the most potential to further develop your ideas? For physicists working on networks, some of their work can and should be sent to sociology journals, as its main contribution is in the form of understanding social structure and implication, and sociologists are best able to use these discoveries to explain other complex social phenomena and to incorporate them into their existing theoretical frameworks.
In our work on the statistics of terrorism, Maxwell and I have chosen a compromise strategy to address this question: while we selected general science or interdisciplinary journals to send our first manuscript on the topic, we have simultaneously been making contacts and promoting our ideas in political science so as to try to understand how to further develop these ideas within their framework (and perhaps how to encourage the establishment to engage in these ideas directly). This process has been educational in a number of ways, and recently has begun to bear fruit. For instance, at the end of October, Maxwell and I attended the International Security Annual Conference (in Denver this year) where we presented our work in the second of two panels on terrorism. Although it may have been because we announced ourselves as computer scientists, stood up to speak, used slides and showed lots of colorful figures, the audience (mostly political scientists, with apparently some government folk present as well) was extremely receptive to our presentation (despite the expected questions about statistics, the use of randomness and various other technical points that were unfamiliar to them). This led to several interesting contacts and conversations after the session, and an invitation to the both of us to attend a workshop in Washington DC on predictive analysis for terrorism that will be attended by people from the entire alphabet soup of spook agencies. Also, thanks to the mention of our work in The Economist over the summer, we have similarly been contacted be a handful of political scientists who are doing rigorous quantitative work in a similar vein as ours. We're cautiously optimistic that this may all lead to some fruitful collaborations, and ultimately to communicating our ideas to the people to whom they will matter the most.
Despite the current popularity of the idea of interdisciplinary research (not to be confused with excitement about the topic itself, which would take the form of funding), if you are interested in pursuing a career in it, like many aspects of an academic career, there is little education about its pitfalls. The question of etiquette in academic research deserves much more attention in graduate school than it currently receives, as does its subtopic of interdisciplinary etiquette. Essentially, it is this last idea that lays at the heart of Eszter Hargittai's original complaint about physicists working on social networks: because science is a fundamentally social exercise, there are social consequences for not observing the accepted etiquette, and those consequences can be a little unpredictable when the etiquette is still being hammered out as in the case of interdisciplinary research. For our work on terrorism, our compromise strategy has worked so far, but I fully expect that, as we continue to work in the area, we will need to more fully adopt the mode and convention of our target audience in order to communicate effectively with them.
posted November 6, 2005 01:15 PM in Simply Academic | permalink | Comments (1)
October 27, 2005
Links, links, links.
The title is perhaps a modern variation on Hamlet's famous "words, words, words" quip to Lord Polonius. Some things I've read recently, with mild amounts of editorializing:
Tim Burke (History professor at Swarthmore College) recently discussed (again) his thoughts on the future of academia. That is, why would it take for college costs to actually decrease. I assume this arises at least partially as a result of the recent New York Times article on the ever increasing tuition rates for colleges in this country. He argues that modern college costs rise at least partially as a result of pressure from lawsuits and parents to provide in loco parentis to the kids attending. Given the degree of hand-holding I experienced at Haverford, perhaps the closest thing to Swarthmore without actually being Swat, this makes a lot of sense. I suspect, however, that tuition prices will continue to increase apace for the time being, if only because enrollment rates continue to remain high.
Speaking of high enrollment rates, Burke makes the interesting point
... the more highly selective a college or university is in its admission policies, the more useful it is for an employer as a device for identifying potentially valuable employees, even if the employer doesn’t know or care what happened to the potential employee while he or she was a student.
This assertion belies an assumption about whose pervasiveness I wonder. Basically, Burke is claiming that selectivity is an objective measure of something. Indeed, it is. It's an objective measure of the popularity of the school, filtered through the finite size of a freshman class that the school can reasonably admit, and nothing else. A huge institution could catapult itself higher in the selectivity rankings simply by cutting the number of students it admits.
Barabasi's recent promotion of his ideas about the relationship between "bursty behavior" among humans and our managing a queue of tasks to accomplish continues to generate press. New Scientist and Physics Web both picked the piece of work on Darwin's, Einstein's and modern email-usage communication patterns. To briefly summarize from Barabasi's own paper:
Here we show that the bursty nature of human behavior is a consequence of a decision based queueing process: when individuals execute tasks based on some perceived priority, the timing of the tasks will be heavy tailed, most tasks being rapidly executed, while a few experience very long waiting times.
A.-L. Barabasi (2005) "The origin of bursts and heavy tails in human dynamics." Nature 435, 207.
That is, the response times are described by a power law with exponent between 1.0 and 1.5. Once again, power laws are everywhere. (NB: In the interest of full disclosure, power laws are one focus of my research, although I've gone on record saying that there's something of an irrational exuberance for them these days.) To those of you experiencing power-law fatigue, it may not come as any surprise that last night in the daily arXiv mailing of new work, a very critical (I am even tempted to say scathing) comment on Barabasi's work appeared. Again, to briefly summarize from the comment:
... we quantitatively demonstrate that the reported power-law distributions are solely an artifact of the analysis of the empirical data and that the proposed model is not representative of e-mail communication patterns.
D. B. Stouffer, R. D. Malmgren and L. A. N. Amaral (2005) "Comment on The origin of bursts and heavy tails in human dynamics." e-print.
There are several interesting threads imbedded in this discussion, the main one being on the twin supports of good empirical research: 1) rigorous quantitative tools for data analysis, and 2) a firm basis in empirical and statistical methods to support whatever conclusions you draw with aforementioned tools. In this case, Stouffer, Malmgren and Amaral utilize Bayesian model selection to eliminate the power law as a model, and instead show that the distributions are better described by a log-normal distribution. This idea of the importance of good tools and good statistics is something I've written on before. Cosma Shalizi is a continual booster of these issues, particularly among physicists working in extremal statistics and social science.
And finally, Carl Zimmer, always excellent, on the evolution of language.
[Update: After Cosma linked to my post, I realized it needed a little bit of cleaning up.]
posted October 27, 2005 01:23 AM in Thinking Aloud | permalink | Comments (0)
September 29, 2005
Networks in our nation's capital
This past week, I attended the Statistics on Networks workshop at the National Academies of Science in Washington DC, where I saw many familiar faces and many new ones. In particular, I was very happy to finally meet Jon Kleinberg, John Doyle, Steve Borgatti and my collaborator Dimitris Achlioptas. And it was nice to see Walter Willinger and Chris Wiggins again, both of whom I met at the MSRI workshop on networks earlier this year. And naturally, it was nice to see my collaborator Mark Newman again, even though we correspond pretty regularly. Now that I've distributed the appropriate linkage for the search engines, let me get on with my thoughts.
This workshop was interesting for a couple of reasons. First, the audience contained statisticians, social scientists, computer science/physics people, and engineers/biologists. Certainly the latter two groups presented very different perspectives on networks, with the former being interested in universality properties and random models of networks, while the latter was much more interested in building or decomposing a particular kind or instance of a network. The social scientists present (and there were many of them) seemed to have a nicely balanced perspective on the usefulness of random models, with perhaps a slight leaning toward the computer science/physics side. Naturally, this all made for interesting dinner and wrap-up discussion. For myself, my bias is naturally in the direction of appreciating models that incorporate randomness. However, it's true that when translated through a particular network model, randomness can itself generate structure (e.g., random graphs with power law degree distributions tend to have a densely connected core of high degree vertices, a structure that is a poor model for the core of the internet, where mixing is disassortative). In the case of real world networks, I think random models yield the most benefit when used to explore the space of viable solutions to a particular constraint or control problem. Eve Marder's work (also at the workshop) on small networks of self-regulating neurons (in this case, those of the lobster gut) is a particularly good example of this approach.
Second, although there were very few graduate students in attendance (I counted three, myself included), the environment was friendly, supportive and generally interesting. The workshop coordinators did a good job of inviting people doing interesting work, and I enjoyed just about all of the talks. Finally, it was interesting to see inside the National Academies a little. This institution is the one that fulfills the scientific inquiries of Congress, although I can't imagine this Congress listens to its scientists very much.
posted September 29, 2005 09:43 PM in Simply Academic | permalink | Comments (0)
May 21, 2005
The inter-disciplinary politics of interdisciplinary research or, "Hey, that was my idea first."
A few days ago, Eszter Hargittai posted a rant on the joint-blog Crooked Timber about the entre of physicists into the subfield in sociology of social networks and her perception of their contributing mostly nothing of value. Her entry was prompted by this paper about the EuroVision Contest. I learned about the entry first when she reproduced it on the social networking listserv SOCNET; a list on which I lurk mostly because I'm too cheap to pay the membership fee and also because I mainly use it as a way to collect journal references for sociology literature. References which I imagine to myself that I'll read or use one day, although given the poor job I'm currently doing at keeping up with the recent papers in my own field, I may realistically never get around to. (This point is salient, and I'll return to it momentarily.) In the ensuing and relatively lively debate in the post's comments section, someone called for and then received attention from friend Cosma Shalizi, who blogs his own thoughts on the subject in his usual lengthy, heavily cross-referenced and edifying way.
Several meta-commentary thoughts come immediately to mind:
1. Cosma's points are extremely thoughtful and are likely right on the money in terms of seeing the merits of both physicists contributions to social sciences and the argument of their reinvention of wheels. Most relevant to the rant about physicists not contributing anything of value to the field of social networks, he gives four excellent and broad examples of how physicists have added to our knowledge.
2. One of these points, which bears rehashing here, is that physicists are not just interested in social networks (it unfortunately illustrates the irony of the sociologists claims of academic injustice that this observation is abscent from their complaints). Physics training, and particularly that of statistical mechanics, the subfield that most physicists interested in social networks hail from, emphasizes that items of inquiry can, to as great an extent as possible, be treated as interchangeable. Thus, complex networks is the idea that social networks are just one kind of network. The progress physicists have made in carving out the field of complex networks has been somewhat spotty, perhaps because of their not knowing entirely how much of statistical mechanics to import and how much of a reliance on numerical simulation is reasonable (this touches on a related point, that there is not a firm consensus on how computational modeling and simulation should be incorporated into science to the same degree that theory and empiricism have been). If they have been arrogant toward other fields in their attempts to do this, then they should be chastised through letters to the editor of the journals that publish the offending articles. With regard to the EuroVision Contest article, Eszter Hargittai and Kieran Healy's best recourse is to write such a letter to Physica A illustrating that the work is not novel.
3. A point which Cosma omits in his list is connection to social network analysis, via complex network analysis, a large body of mathematical techniques from physics such as percolation theory (he does point out the contribution via network epidemiology), group renormalization, random graph theory, ideas of entropy and techniques for modeling dynamic systems. I may be wrong on these contributions, since I will easily admit that I don't read enough sociology literature. (Update: Cosma notes that sociologists and affiliated statisticians were familiar with Erdos-Renyi random graph theory before the physicists came along.)
4. There's a deeper issue at play here, which Cosma has also discussed (his prolificness is truly impressive, even more so given its high quality). Namely, that there are more physicists than there is funding (or interest?) for physics problems. While I was at Haverford, one of my physics professors told me, without a hint of a smile, that in order to get a job in traditional physics, you basically had to work at one of the national laboratories, work at a particle accelerator laboratory, or work in condensed matter physics. None of these seemed particularly appealing, yet the ideas and approaches of physics were. So, it is perhaps entirely expected that similar folks in my position eventually branch out into other fields. This is, after all, the nature of interdisciplinary research, and physicists (along with mathematicians and, to a lesser degree, chemists) seem particularly well-equipped for this kind of adventure. With the rising emphasis among both funding agencies and universities for interdisciplinary research (which may or may not be simply lip-service), the future likelihood of inter-disciplinary ego-bruising seems high.
5. Obviously, in any scientific endeavor, interdisciplinary or otherwise, scientists should understand the literature that came before (I dislike the term "master", because it implies an amount of time-commitment that I think few people can honestly claim to have spent with literature). In my recent referee work for Physical Review E, I have routinely chastised authors for not writing better introductions that leave a reader with a firm understanding of the context (past and present) in which the fundamental questions they seek to address sit. When it comes to interdisciplinary work, these problems are particularly acute; not only do you have multiple bodies of literature to quickly and succinctly review, but you must also do so in a way accessible to the members of the each field. Some (but, by no means, all) physicists are certainly guilty of this when it comes to writing about social networks, as they are prone to reinventing the wheel. The most egregious example of which is the preferential attachment model of Barabasi and Albert, but it can (and should) be argued that this reinvention was extremely valuable, as it helped spark a wide degree of interest in the previous work and has prompted some excellent work on developing that idea since. So, the fundamental question that I think all of we who claim to be interdisciplinary must face and ultimately answer (in a way that can be communicated to future generations of interdisciplinary researchers, many of whom are in college right now) is, What is the most principled and reasonable way, given the constraints on attention, energy, time, knowledge, intelligence, etc., to allocate proper recognition (typically via citations and coauthorships) to previous and on-going work that is relevant to some interdisciplinary effort?
Or, more succinctly, what's the most practical way to mitigate the inter-disciplinary politics of interdisciplinary research while encouraging it to the fullest extent possible? Closely related are questions about adequately evaluating the merit of research that does not fall squarely within the domain of a large enough body of experts for peer-review. As is the question of how academic departments should value interdisciplinary researchers and what role they should fill in the highly compartmentalized and territorial realm of academic disciplines.
Manual TrackBack: Three-toed Sloth
posted May 21, 2005 04:09 PM in Simply Academic | permalink | Comments (1)
January 31, 2005
On being interdisciplinary
I've been on the East Coast for two weeks for reasons of both work and play. I started at the MIT Media Lab working with Nathan Eagle on some really amazing network analysis stuff. It was cold, there was some peripheral unpleasantness not connected with him (he was actually good about it as it was happening), but it was great to be in a totally new environment thinking about totally new things. Plus, I got to hang out with friends from the Santa Fe Institute summer school I went to a couple of years ago. Then I went to Holyoke where I got to visit an old old friend that I haven't seen in ages. Jessica drove me down to Yale, where I visisted my friend Robin Herlands which was wonderful and stimulating and fun, despite having 18 inches of snow dump