Abstracts

Stephane Boucheron (joint work with Pascal Massart (Orsay))

Title: A poor man's Wilks phenomenon

Abstract:

Wilks phenomenon asserts that that in regular models, twice the difference between the maximum log likelihood and the log-likelihood at the true parameter is asymptotically distributed according to a chi-square distribution with a number of degrees of freedom that coincide with the dimension of the model. We attempt to generalize this phenomenon to other contrast minimization problems such as encountered in statistical learning theory. We provide (non-asymptotic) concentration inequalities for empirical excess risk by combining (almost) classical tail bounds

Gilles Blanchard (joint with E. Roquain and S. Arlot)

Title: Resampling-based confidence regions in high dimension from a non-asymptotic point of view

Abstract:

We study generalized bootstrapped confidence regions for the mean of a random vector whose coordinates have an unknown dependence structure, with a non-asymptotic control of the confidence level. The random vector is supposed to be either Gaussian or to have a symmetric bounded distribution.

We consider two approaches, the first based on a concentration principle and the second on a direct boostrapped quantile. The first one allows us to deal with a very large class of resampling weights while our results for the second are restricted to Rademacher weights. The non-asymtpotic point of view developed here is strongly inspired by recent work in the area of learning theory.

Albert Cohen

Title: Matching vs. basis pursuit for approximation and learning : a comparison

Abstract:

In approximation and learning problems, one often tries to exploit an underlying sparsity property of the target function to be estimated. This property means that this function is well described as a combination of a few terms in a given dictionary. Two main class of algorithms have been developped for this purpose : basis pursuit which is based on a convex minimization procedure, and
matching pursuit in which the relevant components in the dictionary are sequentially identified. Both algorithms turn out to be equivalent in the case where the dictionary is an orthonormal basis. In this talk, their similarities and differences will be explored in the case of a general dictionary, in particular from the perspective of approximation theory.

Ingrid Daubechies: Convergence results and counterexamples for AdABoost and related algorithms

Nira Dyn: Two algorithms for adaptive approximation of bivariate functions by piecewise linear polynomials on triangulations

Maya Gupta

Title: Functional Bregman Divergence, Bayesian Estimation of Distributions, and Completely Lazy Classifiers

Abstract:

We generalize Bregman divergences to a functional Bregman divergence and show that direct Bayesian estimation of a distribution such that the expected functional Bregman risk is minimized leads to the mean distribution, generalizing the well-known result that "the mean minimizes the average squared error." This result and some intuition from multiresolutional theory leads to the first effective Bayesian quadratic discriminant analysis classifier. We propose a new approach to reduce the bias of Bayesian QDA that avoids the difficulties of Gaussian mixture models, and extend the Bayesian framework to create a completely-lazy classifier that has average-performance guarantees and in practice achieves state-of-the-art classification performance for high-dimensional problems without requiring the cross-validation of any parameters.

Lee Jones

Title: Finite sample minimax estimation, fusion in machine learning, and overcoming the curse of dimensionality

Abstract:

A minimax approach is described for learning the value of an unknown function f(x) at a query point xo based on knowledge of the function's values ( with possible noise added) at given points {xj}. As a result optimal accuracy bounds, based on assumptions about the behavior of f(x), may be determined for several new and old algorithms for learning f(xo).

Empirical evidence indicates that averaging the learned estimate of f(xo) from many different algorithms ( estimates) often leads to increased accuracy - as with random subspace methods, random forests and varying kernel covariances. Furthermore, when statistically analyzed, different algorithms are often estimating different conditional expectations of the corresponding response variable. A method (fusion) is presented for optimally combining the estimates from a given set (ensemble) of such algorithms and obtaining an accuracy bound on such combination. The approach is to introduce information-theoretic concepts relating accuracy bounds of an aggregate estimate to the degree of conditioning of the aggregate estimate.

The minimax approach allows the fusion procedure to account for the possibility of spurious features causing ( with a known high probability) only a small number of algorithms to overfit. When a large number of the algorithms are highly predictive the fusion estimate remains accurate and the curse of dimensionality may be avoided.

(Research supported by the Norbert Wiener Center (NWC) and the Submillimeter-Wave Technology Laboratory (STL).)

Dominique Picard (joint work with Gerard Kerkyacharian)

Title: A 'Frame-work' in Learning Theory

Abstract: pdf file

Vladimir Koltchinskii

Title: Sparse Recovery Problems in Learning Theory

Abstract:

We will consider a problem of learning of a target function based on its noisy observations at random points via penalized empirical risk minimization with convex loss function and convex complexity penalty. Depending on the loss function, this framework includes various regression problems as well as large margin classification (such as boosting and SVM).

We will be interested in two specific cases. In a more simple case, the target function belongs to the linear span of a very large dictionary of given functions. In a more general case, it belongs to the linear span of a very large number of reproducing kernel Hilbert spaces (RKHS). This includes several important examples such as aggregation problems for large ensembles of kernel machines and traditional additive models in Statistics.

In both cases, it is assumed that the target function has a "sparse representation" in the "dictionary" and the goal is to recover the function with an error that depends on the "sparsity" of the problem (rather than on the size of the dictionary).

We study an approach based on $\ell_1$-type complexity penalization and establish several probabilistic inequalities showing the relationship between the sparsity of the empirical solution and the sparsity of the target function and providing bounds on the excess risk and L2-error that depend on the sparsity of the problem.

(Some parts of this talk are based on a joint work and on discussions with Ming Yuan and with Alexandre Tsybakov).

Tomaso Poggio

Title: Learning: neuroscience and engineering applications

Abstract:

The problem of learning is one of the main gateways to making intelligent machines and to understanding how the brain works. In this talk I will sketch a model of the ventral stream of the visual cortex in primates, describing how the brain may learn to recognize objects, and show that the resulting model is capable of performing recognition on datasets of complex images at the level of human performance in rapid categorization tasks. The model performs surprisinfly well when compared with state-of-art computer vision systems in categorization of complex images, suggestiong that neuroscience may begin to effectively guide engineering efforts in the domain of vision.

Relevant papers can be downloaded from http://www.ai.mit.edu/projects/cbcl/publications/all-year.html

Christoph Schwab

Title: Elliptic PDEs with random field input -- numerical analysis of forward solvers and of goal oriented input learning

Abstract:

Numerical Analysis of gPC FEM for elliptic PDEs in polygonal or polyhedral domains with random field inputs is addressed; the input data is assumed to be a random field that is represented as a (nonlinear transformed) Karhunen Loeve expansion.

Assuming completely known inputs, we present a-priori analysis of the complexity of the deterministic `forward' solve, in dependence on the regularity of the spatial two-point correlation function of the input data.

Adaptive selection of dimension and spectral order of active parameters in the gPC representation of the random field solution of the PDE will be addressed.

The Compressive Sampling of the input field with respect to the response of the sPDE will be addressed.

Steve Smale

Title: Vision and learning

Abstract:

The goal of this work is to represent the space of images in high dimensional euclidean space via a map which abstracts the notion of a neural response.

Ingo Steinwart

Title: Approximation Theoretical Questions for Support Vector Machines

Abstract:

Support Vector Machines (SVMs) are one of the state-of-the-art learning algorithms for both classification and regression problems. This talk presents an overview over some of the techniques used to analyze their learning performance. Hereby particular emphasis will be put on open questions that are related to approximation theory.

Vladimir Temlyakov

Title: Universality and Lebesgue inequalities in approximation and estimation

Abstract:

The concept of the Kolmogorov width provides a very nice theoretical way of selecting an optimal approximation method. The major drawback of this way (from a practical point of view) is that in order to initialize such a procedure of selection we need to know a function class F. In many contemporary practical problems we have no idea which class to choose in place of F. There are two ways to overcome the above problem. The first one is to return (in spirit) to the classical setting that goes back to Chebyshev and Weierstrass. In this setting we fix a priori a form of an approximant and look for an approximation method that is optimal or near optimal for each individual function from X. Also, we specify not only a form of an approximant but also choose a specific method of approximation (for instance, the one, which is known to be good in practical implementations). Now, we have a precise mathematical problem of studying efficiency of our specific method of approximation. This setting leads to the Lebesgue inequalities. The second way to overcome the mentioned above drawback of the method based on the concept of width consists in weakening an a priori assumption f in F. Instead of looking for an approximation method that is optimal (near optimal) for a given single class F we look for an approximation method that is near optimal for each class from a given collection F of classes. Such a method is called universal for F. We will discuss a realization of the above two ways in approximation and in estimation.

Alessandro Verri

Title: Regularization Algorithms for Learning

Abstract:

In this talk we investigate the close relationship between learning and regularization by importing in the learning domain algorithms developed in the context of regularization theory. We describe a nonlinear regularization algorithm which seems to be well suited to address the problem of feature selection. We discuss theoretical properties, implementation issues and experimental results in real world problems.

Patrick Wolf (with Mohamed-Ali Belabbas)

Title: The Nystrom Extension and Spectral Methods in Learning: A New Algorithm for Low-Rank Approximation of Quadratic Forms

Abstract:

Spectral methods are of fundamental importance in statistical learning, as they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank approximation to a positive-definite kernel. Using traditional methods, such an approximation can be obtained with computational complexity that scales as the cube of the number of training examples. For the growing number of applications dealing with very large or high-dimensional data sets, however, these techniques are too costly. A known alternative is the Nystrom extension from finite element methods. While its application to machine learning has previously been suggested in the literature, we introduce here what is, to the best of our knowledge, the first randomized algorithm of this type to yield a relative
approximation error bound. Our results follow from a new class of algorithms for the approximation of matrix products, which reveal connections between classical linear algebraic quantities such as Schur complements and techniques from theoretical computer science such as the notion of volume sampling.

Ding-Xuan Zhou

Title: Learnability of Gaussians with Flexible Variances

Abstract:

Gaussian kernels with flexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with flexible variances is a uniform Glivenko-Cantelli class. This result confirms a conjecture concerning learnability of Gaussian kernels and verifies the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when flexible variances are allowed. Finally for regularization schemes generated by Gaussian kernels with flexible variances we present explicit learning rates of the regression with the least square loss and the classification with the hinge loss. Extensions to manifold and Markov sampling settings will also be discussed.