Author: J. Lin and A. Kolcz
Summary:
Hadoop, the open-source implementation of MapReduce, has emerged as a popular framework for large-scale data processing. Among its advantages are the ability to horizon-
tally scale to petabytes of data on thousands of commodity servers, easy-to-understand programming semantics, and a high degree of fault tolerance.
Background
Let X be the input space and Y be the output space. Given a set of training samples D={(x1,y1), (x2,y2),..., (xn,yn)} from the space X Y (called labeled examples or instances), the supervised machine learning task is to induce a function f : X→Y that best explains the training data. There are three main components of a machine learning solution: the data, features extracted from the data, and the model.
The size of the dataset is the most important factor. Studies have repeatedly shown that simple models trained over enormous quantities of data outperform more sophisticated models trained on less data. This has led to the growing dominance of simple, data-driven solutions.
Training Model
Analytics at Twitter is performed mostly using Pig, a high-level data ow language that compiles into physical plans that are executed on Hadoop.
A training instance in Pig has the following schema: (label: int, features: map[]).
As an alternative, training instances can be represented in SVMLight format, a simple sparse-vector encoding format supported by many off -the-shelf packages. SVMLight is a
line-based format, with one training instance per line. Each line begins with a label, i.e., {+1, -1} followed by one or more space separated (feature id, feature value) pairs internally delimited by a colon (:).
In this case, the storage function receives output records and feeds them to the learner (without writing any output). Only when all records have been processed (via an API hook in Pig) does the storage function write the learned model to disk. That is, the input is a series of tuples representing (label, feature vector) pairs, and the output is the trained model itself.
For example, if we set the parallel factor to one, then all training instances are shu ed to a single reducer and fed to exactly one learner. By setting larger parallel factors n>1, we can learn simple ensembles|the training data is partitioned n-ways and models are independently trained on each partition.