Wednesday, May 29, 2013

HIVE EXTERNAL TABLE



Hive is a data warehousing infrastructure based on the Hadoop which provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware.

Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

The difference between the normal tables and external tables can be seen in LOAD and DROP operations.

Normal Tables: Hive manages the normal tables created and moves the data into its warehouse directory.

As an example, consider the table creation and loading of data into the table.
CREATE TABLE page_view(viewTime INT, userid BIGINT,
                    page_url STRING, referrer_url STRING,
                    ip STRING COMMENT 'IP Address of the User')
    COMMENT 'This is the page view table'
    PARTITIONED BY(dt STRING, country STRING)
    STORED AS SEQUENCEFILE;

LOAD DATA INPATH ‘/user/hduser/sampledata.txt’ INTO TABLE page_view;

By default, when data file is loaded, /user/${USER}/warehouse/page_view is created automatically.

This LOAD will move the file data.txt from HDFS into Hive’s warehouse directory for the table. If the table is dropped, then the table metadata and the data will be deleted.

External Tables: An external table refers to the data that is outside of the warehouse directory.
As an example, consider the table creation and loading of data into the external table.
CREATE TABLE page_view(viewTime INT, userid BIGINT,
                    page_url STRING, referrer_url STRING,
                    ip STRING COMMENT 'IP Address of the User')
    COMMENT 'This is the page view table'
    PARTITIONED BY(dt STRING, country STRING)
    STORED AS SEQUENCEFILE;

LOCATION ‘/user/hduser/page_view’;
LOAD DATA INPATH ‘/user/husr/sampledata.txt’ INTO page_view;

In case of external tables, Hive does not move the data into its warehouse directory. If the external table is dropped, then the table metadata is deleted but not the data.

Hive does not check whether the external table location exists or not at the time the external table is created.

No comments:

Popular Posts