Paper Reading for aMMAI: [Summary] Hive - A Warehousing Solution Over a Map-Reduce Framework

Topic: Hive - A Warehousing Solution Over a Map-Reduce Framework

Author: Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, Raghotham Murthy

Summary:

The size of data sets being collected and analyzed is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop is a popular open-source map-reduce implementation. However, the map-reduce programming model is very low level and the custom programs are hard to maintain and reuse.

Data in Hive is organized into:

Tables: These are analogous to tables in relational databases.
Partitions: Each table can have one or more partitions which determine the distribution of data within sub-directories of the table directory.
Buckets: Data in each partition may in turn be divided into buckets based on the hash of a column in the table.

Hive provides a SQL-like query language called HiveQL which supports select, project, join, aggregate, union all and sub-queries in the from clause. HiveQL supports data definition
(DDL) statements to create tables with specific serialization formats, and partitioning and bucketing columns. It supports user defined column transformation (UDF) and aggregation (UDAF) functions implemented in Java.

Hive Architecture

External Interfaces: Hive provides both user interfaces like command line (CLI) and web UI, and application programming interfaces (API) like JDBC and ODBC
Hive Thrift Server: a framework for cross-language services, where a server written in one language (like Java) can also support clients in other languages
Metastore: the system catalog
Driver: manage the life cycle of a HiveQL statement during compilation, optimization and execution
Compiler: translate the statement into a plan which consists of a DAG of mapreduce jobs

The metastore contains the following objects:

Database: a namespace for tables
Table: list of columns and their types, owner, storage and SerDe information
Partition: Each partition can have its own columns and SerDe and storage information.

Future Work

Make HiveQL subsume SQL syntax
Build a costbased optimizer and adaptive optimization techniques to come up with more efficient plans
Explore columnar storage and more intelligent data placement to improve scan performance
Enhance the JDBC and ODBC drivers for Hive for integration with commercial BI tools which only work with traditional relational warehouses
Explore methods for multi-query optimization techniques and performing generic n-way joins in a single map-reduce job.

Paper Reading for aMMAI

2013年5月30日星期四

[Summary] Hive - A Warehousing Solution Over a Map-Reduce Framework

沒有留言:

張貼留言

2013年5月30日 星期四

[Summary] Hive - A Warehousing Solution Over a Map-Reduce Framework

沒有留言:

張貼留言

2013年5月30日星期四