Big Data Analytics: Optimization and Randomization

KDD 2015 Tutorial

Tianbao Yang, Qihang Lin, Rong Jin

Overview

As the scale and dimensionality of data continue to grow in many applications of data analytics (e.g., bioinformatics, finance, computer vision, medical informatics), it becomes critical to develop efficient and effective algorithms to solve numerous machine learning and data mining problems. This tutorial will focus on simple yet practically effective techniques and algorithms for big data analytics.

In the first part, we plan to present the state-of-the-art large-scale optimization algorithms, including various stochastic gradient descent methods, stochastic coordinate descent methods and distributed optimization algorithms, for solving various machine learning problems. In the second part, we will focus on randomized approximation algorithms for learning from large-scale data. We will discuss i) randomized algorithms for low-rank matrix approximation; ii) approximation techniques for solving kernel learning problems; iii) randomized reduction methods for addressing the high-dimensional challenge. Along with the description of algorithms, we will also present some empirical results to facilitate understanding of different algorithms and comparison between them.

Target Audience

This tutorial is intended for researchers, graduate students, and practitioners who are interested in solving big data analytics problems. Through this tutorial, the audience will gain a comprehensive understanding of the basic theories and techniques for improving the scalability of learning algorithms. The audience will be able to learn the key messages about the appropriateness of different algorithms for different problems. Big data analytics has become one of the core areas of KDD, attracting researchers and papers interested in efficient and distributed data mining platforms and algorithms, distributed computing (cloud, map-reduce, MPI), large-scale optimization, and novel statistical techniques for big data.

The audience is expected to have the basic knowledge of machine learning, convex optimization and linear algebra. Experience with stochastic optimization/randomized algorithms for solving big data analytic problems will be a plus.

Content

Part I: Basics
- Introduction
- Basics in Linear Algebra
- Basics in Convex Optimization
Part II: Optimization

Stochastic Gradient Descent/Stochastic Variance Reduced Gradient
Stochastic Coordinate Descent/Stochastic Dual Coordinate Ascent
Distributed Optimization

Part III: Randomization
- Randomized Dimension Reduction
  - JL transforms
  - Subspace embeddings
  - Column sampling
- Randomized Algorithms
  - Randomized Low-rank Approximation
  - Randomized K-means Clustering
  - Randomized Least-squares regression
  - Randomized Classification (Regression)
  - Randomized Kernel methods
Concluding Remarks

Slides

Slides presented at the tutorial are here.