2011 was the Year of the Baby for me, so I didn’t really make it to as many of the core conferences as I would have liked to. (Also part of why I hardly posted here at all. :-P ) Still, some trends stood out in 2011:
Sparsity inducing norms continued to be hot. L1 and its bretheren were in strong showing at AISTATS, and there appears to be a lot of first principles work on why L1 is good for the “d>>n” (dimensionality far larger than data set size problem). There also appears to continue to be work both on mixed norms (e.g., L1,2 or L1,inf) and on optimization. This is exciting stuff, because many scientific data sets (at least) are characterized by d>>n, and mixed norms give a pointer for manipulating more structured kinds of data.
Optimization and high-performance algorithms were all the rage at KDD. Everything from variants of MapReduce to statistics thrashing on GPUs. Clearly, we all need to sit down and talk with my colleague Dorian Arnold, who knows way more about the practicalities of generalizing MapReduce than most ML folks do, I suspsect.
Out of it all, though, my favorite paper was a little abstract by Rich Caruana at Snowbird. (I don’t know if he has a full paper version of it yet — I can’t find one in a couple of minutes Googling, but it may be out there.)
Very briefly, the experiment was the following: he trained gradient boosted tree in a document prediction task, and the question he asked was, “what is the right number of trees to keep in the ensemble?” This is essentially a question about tuning the regularization parameter — too small and the model lacks sufficient capacity to capture the concept and underfits; too large and you have too much capacity and overfitting. We’re used to doing things like searching over such regularization hyperparameters via, say, cross validation.
But where Rich’s point diverged, and became really interesting is that he then split the test set into three chunks, and showed that on held-out test data the exact same model underfit one chunk, overfit a second, and was just right on the third. That is, different parts of the test space required different regularization parameters. Essentially, the target concept has different degrees of smoothness in different parts of the space.
In retrospect, this is, perhaps, not surprising. After all, there’s nothing that requires that a function have a constant smoothness. But most of our mathematical tools and theory are grounded in global measures of smoothness, such as function norms or the VC dimension of a given RKHS. Rich’s result suggests that these global summary measures are insufficient; instead, we may need a more localized measure of smoothness and a more localized way to set both model parameters and hyperparameters.
It’s an intriguing paper, even if a very brief one. I encourage people to read it. I have a feeling that this opens up a new way of thinking about model spaces.