Stephane Boucheron (joint work with Pascal Massart (Orsay))
Title: A poor man's Wilks
phenomenon
Abstract:
Wilks phenomenon asserts that that in regular models,
twice the difference between the maximum log likelihood and
the log-likelihood at the true parameter is asymptotically distributed
according to a chi-square distribution with a number of degrees of
freedom that coincide with the dimension of the model. We attempt to
generalize this phenomenon to other contrast minimization
problems such as encountered in statistical learning theory. We provide
(non-asymptotic) concentration inequalities for empirical excess
risk by combining (almost) classical tail bounds
Gilles Blanchard (joint with E. Roquain and S. Arlot)
Title: Resampling-based
confidence regions in high dimension from a non-asymptotic point of view
Abstract:
We study generalized bootstrapped confidence regions for the
mean of a random vector whose coordinates have an unknown dependence
structure, with a non-asymptotic control of the confidence level. The
random vector is supposed to be either Gaussian or to have a symmetric
bounded distribution.
We consider two approaches, the first based on a concentration
principle and the second on a direct boostrapped quantile. The first
one allows us to deal with a very large class of resampling weights
while our results for the second are restricted to Rademacher weights.
The non-asymtpotic point of view developed here is strongly inspired by
recent work in the area of learning theory.
Albert Cohen
Title: Matching vs. basis
pursuit for approximation and learning : a comparison
Abstract:
In approximation and learning problems, one often tries to
exploit an underlying sparsity property of the target function to be
estimated. This property means that this function is well described as
a combination of a few terms in a given dictionary. Two main class of
algorithms have been developped for this purpose : basis pursuit which
is based on a convex minimization procedure, and
matching pursuit in which the relevant components in the dictionary are
sequentially identified. Both algorithms turn out to be equivalent in
the case where the dictionary is an orthonormal basis. In this talk,
their similarities and differences will be explored in the case of a
general dictionary, in particular from the perspective of approximation
theory.
Ingrid Daubechies: Convergence
results and counterexamples for AdABoost and related algorithms
Nira Dyn: Two
algorithms for adaptive approximation of bivariate functions by
piecewise linear polynomials on triangulations
Maya Gupta
Title: Functional Bregman
Divergence, Bayesian Estimation of Distributions, and Completely Lazy
Classifiers
Abstract:
We generalize Bregman divergences to a functional Bregman divergence
and show that direct Bayesian estimation of a distribution such that
the expected functional Bregman risk is minimized leads to the mean
distribution, generalizing the well-known result that "the mean
minimizes the average squared error." This result and some intuition
from multiresolutional theory leads to the first effective Bayesian
quadratic discriminant analysis classifier. We propose a new approach
to reduce the bias of Bayesian QDA that avoids the difficulties of
Gaussian mixture models, and extend the Bayesian framework to create a
completely-lazy classifier that has average-performance guarantees and
in practice achieves state-of-the-art classification performance for
high-dimensional problems without requiring the cross-validation of any
parameters.
Lee Jones
Title: Finite sample
minimax estimation, fusion in machine learning, and overcoming the
curse of dimensionality
Abstract:
A minimax approach is described for learning the value of an unknown
function f(x) at a query point xo based on knowledge of the function's
values ( with possible noise added) at given points
{xj}. As a result optimal accuracy bounds, based on assumptions
about the behavior of f(x), may be determined for several new and old
algorithms for learning f(xo).
Empirical evidence indicates that averaging the learned estimate of
f(xo) from many different algorithms ( estimates) often leads to
increased accuracy - as with random subspace methods, random forests
and varying kernel covariances. Furthermore, when statistically
analyzed, different algorithms are often estimating different
conditional expectations of the corresponding response
variable. A method (fusion) is presented for optimally
combining the estimates from a given set (ensemble) of such algorithms
and obtaining an accuracy bound on such combination. The approach is to
introduce information-theoretic concepts relating accuracy bounds
of an aggregate estimate to the degree of conditioning of
the aggregate estimate.
The minimax approach allows the fusion procedure to account for the
possibility of spurious features causing ( with a known high
probability) only a small number of algorithms to overfit.
When a large number of the algorithms are highly predictive
the fusion estimate remains accurate and the curse of dimensionality
may be avoided.
(Research supported by the Norbert Wiener Center (NWC) and the
Submillimeter-Wave Technology Laboratory (STL).)
Dominique Picard (joint work with Gerard Kerkyacharian)
Title: A 'Frame-work' in
Learning Theory
Abstract: pdf
file
Vladimir Koltchinskii
Title: Sparse Recovery Problems
in Learning Theory
Abstract:
We will consider a problem of learning of a target function based on
its noisy observations at random points via penalized empirical risk
minimization with convex loss function and convex complexity penalty.
Depending on the loss function, this framework includes various
regression problems as well as large margin classification
(such as boosting and SVM).
We will be interested in two specific cases. In a more simple case, the
target function belongs to the linear span of a very large dictionary
of given functions. In a more general case, it belongs to the linear
span of a very large number of reproducing kernel Hilbert spaces
(RKHS). This includes several important examples such as
aggregation problems for large ensembles of kernel machines and
traditional additive models in Statistics.
In both cases, it is assumed that the target function has a "sparse
representation" in the "dictionary" and the goal is to recover the
function with an error that depends on the "sparsity" of the problem
(rather than on the size of the dictionary).
We study an approach based on $\ell_1$-type complexity penalization and
establish several probabilistic inequalities showing the relationship
between the sparsity of the empirical solution and the sparsity of the
target function and providing bounds on the excess risk and L2-error
that depend on the sparsity of the problem.
(Some parts of this talk are based on
a joint work and on discussions with Ming Yuan and with Alexandre
Tsybakov).
Tomaso Poggio
Title: Learning: neuroscience
and engineering applications
Abstract:
The problem of learning is one of the main gateways to making
intelligent machines and to understanding how the brain works. In this
talk I will sketch a model of the ventral stream of the visual cortex
in primates, describing how the brain may learn to recognize objects,
and show that the resulting model is capable of performing recognition
on datasets of complex images at the level of human performance in
rapid categorization tasks. The model performs surprisinfly well when
compared with state-of-art computer vision systems in categorization of
complex images, suggestiong that neuroscience may begin to effectively
guide engineering efforts in the domain of vision.
Relevant papers can be downloaded from
http://www.ai.mit.edu/projects/cbcl/publications/all-year.html
Christoph Schwab
Title: Elliptic PDEs
with random field input -- numerical analysis of forward solvers and of
goal oriented input learning
Abstract:
Numerical Analysis of gPC FEM for elliptic PDEs in polygonal or
polyhedral domains with random field inputs is addressed; the input
data is assumed to be a random field that is represented as a
(nonlinear transformed) Karhunen Loeve expansion.
Assuming completely known inputs, we present a-priori analysis of the
complexity of the deterministic `forward' solve, in dependence on the
regularity of the spatial two-point correlation function of the input
data.
Adaptive selection of dimension and spectral order of active parameters
in the gPC representation of the random field solution of the PDE will
be addressed.
The Compressive Sampling of the input field with respect to the
response of the sPDE will be addressed.
Steve Smale
Title: Vision and learning
Abstract:
The goal of this work is to represent the space of images in high
dimensional euclidean space via a map which abstracts the notion of a
neural response.
Ingo Steinwart
Title: Approximation
Theoretical Questions for Support Vector Machines
Abstract:
Support Vector Machines (SVMs) are one of the state-of-the-art learning
algorithms for both classification and regression problems. This talk
presents an overview over some of the techniques used to
analyze their learning performance. Hereby particular emphasis will be
put on open questions that are related to approximation theory.
Vladimir Temlyakov
Title: Universality and
Lebesgue inequalities in approximation and estimation
Abstract:
The concept of the Kolmogorov width provides a very nice
theoretical way of selecting an optimal approximation method. The major
drawback of this way (from a practical point of view) is
that in order to initialize such a procedure of selection we need to
know a function class F. In
many contemporary practical problems we have no idea which class to
choose in place of F. There
are two ways to overcome the above problem. The first one is to return
(in spirit) to the classical setting that goes back to Chebyshev and
Weierstrass. In this setting we fix a priori a form of an
approximant and look for an approximation method that is
optimal or near optimal for each individual function from X. Also, we specify not only a form
of an approximant but also choose a specific method of approximation
(for instance, the one, which is known to be good in practical
implementations). Now, we have a precise mathematical problem of
studying efficiency of our specific method of approximation. This
setting leads to the Lebesgue inequalities. The second way to overcome
the mentioned above drawback of the method based on the concept of
width consists in weakening an a priori assumption f in F. Instead of looking for an
approximation method that is optimal (near optimal) for a given single
class F we look for an
approximation method that is near optimal for each class from a given
collection F
of classes. Such a method is called universal
for F. We
will discuss a realization of the above two ways in approximation and
in estimation.
Alessandro Verri
Title: Regularization Algorithms for Learning
Abstract:
In this talk we investigate the close relationship between learning and
regularization by importing in the learning domain algorithms developed
in the context of regularization theory. We describe a nonlinear
regularization algorithm which seems to be well suited to address the
problem of feature selection. We discuss theoretical properties,
implementation issues and experimental results in real world problems.
Patrick Wolf (with Mohamed-Ali Belabbas)
Title: The Nystrom
Extension and Spectral Methods in Learning: A New Algorithm for
Low-Rank Approximation of Quadratic Forms
Abstract:
Spectral methods are of fundamental importance in statistical
learning, as they underlie algorithms from classical principal
components analysis to more recent approaches that exploit manifold
structure. In most cases, the core technical problem can be reduced to
computing a low-rank approximation to a positive-definite
kernel. Using traditional methods, such an approximation can be
obtained with computational complexity that scales as the cube of the
number of training examples. For the growing number of applications
dealing with very large or high-dimensional data sets, however, these
techniques are too costly. A known alternative is the Nystrom extension
from finite element methods. While its application to machine
learning has previously been suggested in the literature, we introduce
here what is, to the best of our knowledge, the first randomized
algorithm of this type to yield a relative
approximation error bound. Our results follow from a new class of
algorithms for the approximation of matrix products, which reveal
connections between classical linear algebraic quantities such as Schur
complements and techniques from theoretical computer science such as
the notion of volume sampling.
Ding-Xuan Zhou
Title: Learnability of
Gaussians with Flexible Variances
Abstract:
Gaussian kernels with flexible variances provide a rich family of
Mercer kernels for learning algorithms. We show that the union of the
unit balls of reproducing kernel Hilbert spaces generated by Gaussian
kernels with flexible variances is a uniform Glivenko-Cantelli class.
This result confirms a conjecture concerning learnability of Gaussian
kernels and verifies the uniform convergence of many learning
algorithms involving Gaussians with changing variances. Rademacher
averages and empirical covering numbers are used to estimate sample
errors of multi-kernel regularization schemes associated with general
loss functions. It is then shown that the regularization error
associated with the least square loss and the Gaussian kernels can be
greatly improved when flexible variances are allowed. Finally for
regularization schemes generated by Gaussian kernels with flexible
variances we present explicit learning rates of the regression with the
least square loss and the classification with the hinge loss.
Extensions to manifold and Markov sampling settings will also be
discussed.