Opening up the black-box, faster optimization methods for non-smooth and big-data problems
In the lab meeting on Wednesday, 5th February, Mark will give a talk about faster optimization methods. The meeting will be at TASC1 9408 from 1130 hours. The details of the talk are below.
The way we solve continuous optimization problems is changing. The ever-growing size of data sets, along with the increasingly-important use of non-smooth objective functions, has forced various scientific and engineering communities to consider ‘opening up the black box’ and developing custom solvers for their problems. In this talk I will overview my recent work along these lines, highlighting the various applications that motivated the work. The talk will accessible to a general computer science audience and discuss four topics:
Sparsity: How a medical imaging application led me to be interested in feature selection, regularization, and their combination through L1-regularization. I will discuss methods for optimizing a general smooth loss with L1-regularization, and how methods are increasingly using the idea of two-metric minimum-norm sub-gradient projection.
Group Sparsity: How the problem of learning the graph structure in conditional random fields motivates the use of group L1-regularization. This has applications ranging from computer vision to analyzing mutual funds to non-Pearlian causality. I will discuss the class of inexact proximal quasi-Newton methods, which are especially suited to optimization problems where we have an expensive objective function but a simple regularizer or simple constraints.
Structured Sparsity: How the problem of structure learning with higher-order interactions motivates the use of structured sparsity, which allows sparsity patterns such as hierarchical constraints. I will discuss inexact proximal-gradient methods for solving this class of problems.
Big-N Problems: In many data-fitting problems we need to optimize the sum of a large number of functions. First-order deterministic-gradient methods achieve a linear convergence rate for this problem but need to evaluate every function on every iteration. Stochastic-gradient methods only need to evaluate one function, but have a sublinear convergence rate. I will present the stochastic average gradient (SAG) algorithm, which achieves the best of both worlds. SAG not only provably outperforms all possible standard stochastic-gradient methods, but for certain problems it provably outperforms all possible first-order deterministic-gradient methods in terms of passes through the data.