2011 in ML; My Favorite Paper

2011 was the Year of the Baby for me, so I didn’t really make it to as many of the core conferences as I would have liked to.  (Also part of why I hardly posted here at all.  :-P )  Still, some trends stood out in 2011:

Sparsity inducing norms continued to be hot.  L1 and its bretheren were in strong showing at AISTATS, and there appears to be a lot of first principles work on why L1 is good for the “d>>n” (dimensionality far larger than data set size problem).  There also appears to continue to be work both on mixed norms (e.g., L1,2 or L1,inf) and on optimization.  This is exciting stuff, because many scientific data sets (at least) are characterized by d>>n, and mixed norms give a pointer for manipulating more structured kinds of data.

Optimization and high-performance algorithms were all the rage at KDD.  Everything from variants of MapReduce to statistics thrashing on GPUs.  Clearly, we all need to sit down and talk with my colleague Dorian Arnold, who knows way more about the practicalities of generalizing MapReduce than most ML folks do, I suspsect.

Out of it all, though, my favorite paper was a little abstract by Rich Caruana at Snowbird.  (I don’t know if he has a full paper version of it yet — I can’t find one in a couple of minutes Googling, but it may be out there.)

Very briefly, the experiment was the following: he trained gradient boosted tree in a document prediction task, and the question he asked was, “what is the right number of trees to keep in the ensemble?”  This is essentially a question about tuning the regularization parameter — too small and the model lacks sufficient capacity to capture the concept and underfits; too large and you have too much capacity and overfitting.  We’re used to doing things like searching over such regularization hyperparameters via, say, cross validation.

But where Rich’s point diverged, and became really interesting is that he then split the test set into three chunks, and showed that on held-out test data the exact same model underfit one chunk, overfit a second, and was just right on the third.  That is, different parts of the test space required different regularization parameters. Essentially, the target concept has different degrees of smoothness in different parts of the space.

In retrospect, this is, perhaps, not surprising.  After all, there’s nothing that requires that a function have a constant smoothness.  But most of our mathematical tools and theory are grounded in global measures of smoothness, such as function norms or the VC dimension of a given RKHS.  Rich’s result suggests that these global summary measures are insufficient; instead, we may need a more localized measure of smoothness and a more localized way to set both model parameters and hyperparameters.

It’s an intriguing paper, even if a very brief one.  I encourage people to read it.  I have a feeling that this opens up a new way of thinking about model spaces.

ROC Curves and AUC: A Rant

My students have heard this rant a thousand times, but since I’m in the midst of Reviewing Season and am annoyed, I might as well share the vitriol…

I am sick to death of papers that report only or primarily “Area under the ROC curve” (AUC) as their empirical performance metric.  It’s the brand new way to lie with statistics.  You can scoot all sorts of malfeasance under this particular rug, and I’m never really sure how to interpret such values.

AUTHORS TAKE NOTE: If I get a paper that only reports AUC values, I will whack you for it.

Read More

Bayes’ Rule Meets the Law

News of the bizarre and frightening today: A UK appeals court has banned the use of Bayes’ Rule in court cases.

Read More

Learning in Economics

The Nobel Prize in Economics was just awarded to Thomas Sargent and Christopher Sims (NYU).  This is of interest to ML hackers because their contributions are essentially econometric — estimation of the parameters of the “economic system” from observational data.  Even more interestingly, one of their key contributions is to establish the role of partial information and learning on the part of economic agents.  (That is, they assume that different economic actors have only incomplete knowledge of key parameters of the economic system, and that the agents update their knowledge state over time from observation.)

Read More

RL, Nonstationarity, and International Econo-Politics

This morning I was reading a recent National Geographic article about the collapse of biodiversity in food staple plants and animals.  Over roughly the last century, it appears, the rise of industrial farming has drastically curtailed the number of distinct breeds of everything from potatoes to chickens — many breeds have gone extinct because of the enormous focus (and economic incentive) on factors like mass production, consistency of product, storability/shipability, etc.  The focus of the article was on documenting these biodiversity losses, efforts to maintain seed banks and breeding stocks of some breeds that are on the edge of extinction, and the risks of biodiversity losses.

All of this brought me back to a subject that I’ve been mulling over for some while: decision theory and nonstationarity.

Read More