2013年6月20日 星期四

[Summary] Large-Scale Machine Learning at Twitter

Topic: Large-Scale Machine Learning at Twitter

Author: J. Lin and A. Kolcz

Summary:

Hadoop, the open-source implementation of MapReduce, has emerged as a popular framework for large-scale data processing. Among its advantages are the ability to horizon-
tally scale to petabytes of data on thousands of commodity servers, easy-to-understand programming semantics, and a high degree of fault tolerance.

Background
Let X be the input space and Y be the output space. Given a set of training samples D={(x1,y1), (x2,y2),..., (xn,yn)} from the space X Y (called labeled examples or instances), the supervised machine learning task is to induce a function f : X→Y that best explains the training data. There are three main components of a machine learning solution: the data, features extracted from the data, and the model.

The size of the dataset is the most important factor. Studies have repeatedly shown that simple models trained over enormous quantities of data outperform more sophisticated models trained on less data. This has led to the growing dominance of simple, data-driven solutions.

Training Model
Analytics at Twitter is performed mostly using Pig, a high-level data ow language that compiles into physical plans that are executed on Hadoop.

A training instance in Pig has the following schema: (label: int, features: map[]).
As an alternative, training instances can be represented in SVMLight format, a simple sparse-vector encoding format supported by many off -the-shelf packages. SVMLight is a
line-based format, with one training instance per line. Each line begins with a label, i.e.,   {+1, -1} followed by one or more space separated (feature id, feature value) pairs internally delimited by a colon (:).

In this case, the storage function receives output records and feeds them to the learner (without writing any output). Only when all records have been processed (via an API hook in Pig) does the storage function write the learned model to disk. That is, the input is a series of tuples representing (label, feature vector) pairs, and the output is the trained model itself.
For example, if we set the parallel factor to one, then all training instances are shu ed to a single reducer and fed to exactly one learner. By setting larger parallel factors n>1, we can learn simple ensembles|the training data is partitioned n-ways and models are independently trained on each partition.



2013年6月6日 星期四

[Summary] Online Dictionary Learning for Sparse Coding

Topic: Online Dictionary Learning for Sparse Coding

Author: Mairal

Summary:

Sparse coding—that is, modeling data vectors as sparse linear combination of basis elements. This paper focuses on learning the basis set, also called dictionary, to adapt it to specific data.

Problem Statement
Traditional dictionary learning can be considered as optimizing the empirical cost function

where D is the dictionary, and l is a loss function such that l(x,D) should be small if D is “good” at representing the signal x.

The the loss function l, which is defined as the distance between the original signal and the dictionary representation, plus the regularization term

Online Dictionary Learning
The algorithm is as below:

Assuming the training set composed of i.i.d. samples of a distribution p(x), its inner loop draws one element xt at a time, as in stochastic gradient descent, and alternates classical
sparse coding steps for computing the decomposition alpha t of xt over the dictionary Dt−1 obtained at the previous iteration, with dictionary update steps where the new dictionary
Dt is computed by minimizing cost function.

Dictionary Update

The algorithm for updating the dictionary uses blockcoordinate descent with warm restarts, and one of its main advantages is that it is parameter-free and does not require any learning rate tuning.

Optimizing the Algorithm
  • Handling Fixed-Size Datasets: The size of the training set is often finite. In this situation, the same data points may be examined several times, and it is very common in online algorithms to simulate an i.i.d. sampling of p( x ) by cycling over a randomly permuted training set.
  • Mini-Batch Extension: Improve the convergence speed of algorithm by drawing η>1 signals at each iteration instead of a single one, which is a classical heuristic in stochastic gradient descent algorithm.
  • Purging the Dictionary from Unused Atoms: Some of the dictionary atoms are never used. Replace them during the optimization by elements of the training set.