2013年5月30日 星期四

[Summary] Hive - A Warehousing Solution Over a Map-Reduce Framework

Topic: Hive - A Warehousing Solution Over a Map-Reduce Framework

Author: Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, Raghotham Murthy

Summary:

The size of data sets being collected and analyzed is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop is a popular open-source map-reduce implementation. However, the map-reduce programming model is very low level and the custom programs are hard to maintain and reuse.

Data in Hive is organized into:
  • Tables: These are analogous to tables in relational databases.
  • Partitions: Each table can have one or more partitions which determine the distribution of data within sub-directories of the table directory.
  • Buckets: Data in each partition may in turn be divided into buckets based on the hash of a column in the table.

Hive provides a SQL-like query language called HiveQL which supports select, project, join, aggregate, union all and sub-queries in the from clause. HiveQL supports data definition
(DDL) statements to create tables with specific serialization formats, and partitioning and bucketing columns. It supports user defined column transformation (UDF) and aggregation (UDAF) functions implemented in Java.

Hive Architecture
  • External Interfaces: Hive provides both user interfaces like command line (CLI) and web UI, and application programming interfaces (API) like JDBC and ODBC
  • Hive Thrift Server: a framework for cross-language services, where a server written in one language (like Java) can also support clients in other languages
  • Metastore: the system catalog
  • Driver: manage the life cycle of a HiveQL statement during compilation, optimization and execution
  • Compiler: translate the statement into a plan which consists of a DAG of mapreduce jobs
 

The metastore contains the following objects:
  • Database: a namespace for tables
  • Table: list of columns and their types, owner, storage and SerDe information
  • Partition: Each partition can have its own columns and SerDe and storage information.

Future Work
  • Make HiveQL subsume SQL syntax
  • Build a costbased optimizer and adaptive optimization techniques to come up with more efficient plans
  • Explore columnar storage and more intelligent data placement to improve scan performance
  • Enhance the JDBC and ODBC drivers for Hive for integration with commercial BI tools which only work with traditional relational warehouses
  • Explore methods for multi-query optimization techniques and performing generic n-way joins in a single map-reduce job.

2013年5月18日 星期六

[Summary] A Global Geometric Framework for Nonlinear Dimensionality Reduction

Topic: A Global Geometric Framework for Nonlinear Dimensionality Reduction

Author: J.B. Tenenbaum

Summary:

Scientists working with large volumes of high-dimensional data confront the problem of dimensionality reduction: finding meaningful low-dimensional structures hidden in their high-dimensional observations.

The challenge of non-linearity with data lying on a two-dimensional “Swiss roll”: points
far apart on the underlying manifold, as measured by their geodesic, or shortest path,  distances, may appear deceptively close in the high-dimensional input space, as measured by their straight-line Euclidean distance.



The complete isometric feature mapping, or Isomap, algorithm has three steps:

1. Construct neighborhood graph: Determine which points are neighbors on the manifold M, based on the distances dX (i, j) between pairs of points i, j in the input space X. These neighborhood relations are represented as a weighted graph G over the data points, with edges of weight dX(i, j) between neighboring points.

2. Compute shortest graphs: Estimate the geodesic distances dM(i, j) between all pairs
of points on the manifold M by computing their shortest path distances dG(i, j) in the graph G.

3. Construct d-dimensional embedding: Applies classical MDS to the matrix of graph distances DG={dG(i, j)}, constructing an embedding of the data in a d-dimensional Euclidean space Y that best preserves the manifold’s estimated intrinsic geometry. The coordinate vectors yi for points in Y are chosen to minimize the cost function:





As with PCA or MDS, the true dimensionality of the data can be estimated from the decrease in error as the dimensionality of Y is increased. Isomap is guaranteed asymptotically to recover the true dimensionality and geometric structure of a strictly larger class of nonlinear manifolds.

The guarantees of asymptotic convergence rest on a proof that as the number of data points increases, the graph distances dG(i, j) provide increasingly better approximations to the intrinsic geodesic distances dM(i, j), becoming arbitrarily accurate in the limit of infinite data. How quickly dG(i, j) converges to dM(i, j) depends on certain parameters of the manifold as it lies within the high-dimensional space (radius of curvature and branch separation) and on the density of points.