2013年5月30日 星期四

[Summary] Hive - A Warehousing Solution Over a Map-Reduce Framework

Topic: Hive - A Warehousing Solution Over a Map-Reduce Framework

Author: Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, Raghotham Murthy

Summary:

The size of data sets being collected and analyzed is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop is a popular open-source map-reduce implementation. However, the map-reduce programming model is very low level and the custom programs are hard to maintain and reuse.

Data in Hive is organized into:
  • Tables: These are analogous to tables in relational databases.
  • Partitions: Each table can have one or more partitions which determine the distribution of data within sub-directories of the table directory.
  • Buckets: Data in each partition may in turn be divided into buckets based on the hash of a column in the table.

Hive provides a SQL-like query language called HiveQL which supports select, project, join, aggregate, union all and sub-queries in the from clause. HiveQL supports data definition
(DDL) statements to create tables with specific serialization formats, and partitioning and bucketing columns. It supports user defined column transformation (UDF) and aggregation (UDAF) functions implemented in Java.

Hive Architecture
  • External Interfaces: Hive provides both user interfaces like command line (CLI) and web UI, and application programming interfaces (API) like JDBC and ODBC
  • Hive Thrift Server: a framework for cross-language services, where a server written in one language (like Java) can also support clients in other languages
  • Metastore: the system catalog
  • Driver: manage the life cycle of a HiveQL statement during compilation, optimization and execution
  • Compiler: translate the statement into a plan which consists of a DAG of mapreduce jobs
 

The metastore contains the following objects:
  • Database: a namespace for tables
  • Table: list of columns and their types, owner, storage and SerDe information
  • Partition: Each partition can have its own columns and SerDe and storage information.

Future Work
  • Make HiveQL subsume SQL syntax
  • Build a costbased optimizer and adaptive optimization techniques to come up with more efficient plans
  • Explore columnar storage and more intelligent data placement to improve scan performance
  • Enhance the JDBC and ODBC drivers for Hive for integration with commercial BI tools which only work with traditional relational warehouses
  • Explore methods for multi-query optimization techniques and performing generic n-way joins in a single map-reduce job.

沒有留言:

張貼留言