Friday, July 06, 2012

HIVE - HADOOP : INSTALLATION , EXECUTION OF SAMPLE PROGRAM

HIVE - HADOOP : INSTALLATION , EXECUTION  OF SAMPLE PROGRAM

The size of data sets being collected and analyzed in  the industry for business intelligence is growing rapidly, making  traditional warehousing solutions prohibitively expensive.  Hadoop is a popular open-source map-reduce implementation which is being used  in companies like Yahoo, Facebook etc. to  store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. Hive, an open-source data warehousing solution built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language  - HiveQL, which are compiled into mapreduce jobs that are executed using Hadoop. In addition, HiveQL enables users to plug in custom map-reduce scripts into queries. The language includes a type system with support for tables containing primitive types, collections like arrays and maps, and nested compositions of the same. The underlying IO libraries can be extended to query data in custom formats. Hive also includes a system catalog - Metastore – that  contains schemas and statistics, which  are useful in data exploration,  query optimization and query compilation.

Installing and Running hive


Download latest version of hive from the following link (Hive Installation).

$ tar xzf hive-0.9.0.tar.gz

set hive environment variables

$ export HIVE_INSTALL=/home/user1/hive-0.9.0.tar.gz
$ export PATH=$PATH:$HIVE_INSTALL/bin

You also need to set the JAVA_HOME environment variable to point to a suitable Java installation.

Hive has two execution types : 

1) local mode : Hive runs in a single JVM and accesses the local filesystem. This mode is suitable only for small datasets.

$ hive

hive>


This is Hive interactive shell


2) MapReduce mode : In MapReduce mode, Hive translates queries into MapReduce jobs and runs them on a Hadoop cluster. The cluster may be a pseudo- or fully distributed cluster.


set the HADOOP_HOME environment variable for finding which Hadoop client to run.


configure hive-site.xml to specify name node and job tracker of hadoop.

you can set hadoop environment variables per session based also.

% hive -hiveconf fs.default.name=localhost -hiveconf mapred.job.tracker=localhost:8021 or you can use SET command

hive> SET fs.default.name=localhost;

Hive Services

There are 4 ways to access hive


cli : The command-line interface to Hive (the shell). This is the default service.


hiveserver : Runs Hive as a server exposing a Thrift service, enabling access from a range of clients written in different languages. Applications using the Thrift, JDBC, and ODBC connectors need to run a Hive server to communicate with Hive. Set the HIVE_PORT environment variable to specify the port the server will listen on (defaults to 10,000).

hwi : The Hive Web Interface.

jar : The Hive equivalent to hadoop jar, a convenient way to run Java applications that includes both Hadoop and Hive classes on the classpath.

CLI : There are 2 modes to interact

1) Interactive Mode
2) Non-Interactive Mode

1) Interactive Mode ( Hive Shell ) 

The shell is the primary way that we will interact with Hive, by issuing commands in HiveQL. HiveQL is Hive’s query language, a dialect of SQL. It is heavily influenced by MySQL.


$ hive


hive> show tables;



OK

dummy
maxcgpa


records
Time taken: 3.756 seconds

Like SQL, HiveQL is generally case-insensitive (except for string comparisons), so show tables; works equally well here.

2) Non-Interactive Mode :

$ hive -f script.q


 -f option runs the commands in the specified file, which is script.q 


For short scripts, you can use the -e option to specify the commands inline, in which case the final semicolon is not required:

$ hive -e 'SELECT * FROM dummy'

Hive history file=/tmp/tom/hive_job_log_tom_201005042112_1906486281.txt
OK
X

Time taken: 4.734 seconds

suppress these messages using the -S option at launch time, which has the effect of
showing only the output result for queries:


% hive -S -e 'SELECT * FROM dummy'


X

Example of Hive in Non-Interactive Mode 

max_cgpa.q

-- max_cgpa.q: Finds the maximum cgpa of a specialization



CREATE TABLE maxcgpa (name STRING, spl STRING, cgpa FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

LOAD DATA LOCAL INPATH 'hivesample.txt' OVERWRITE INTO TABLE maxcgpa;

SELECT spl, MAX(cgpa) FROM maxcgpa WHERE cgpa >0 AND cgpa <= 10  GROUP BY spl;

Above hive script finds the maximum cgpa of a specialization.

hivesample.txt  ( Input to the hive )

raghu     ece     9
kumar    cse      8.5
biju       ece      8
mukul    cse      8.6
ashish   ece      7.0
subha    cse      8.3
ramu     ece     -8.3
rahul     cse      11.4
budania ece      5.4

first column represents name , second column specialization and third column is cgpa, by default each column is separated by tab space.

$hive -f max_cgpa.q

Output :


cse     8.6
ece     9.0

Analysis : 

Statement : 1

CREATE TABLE maxcgpa (name STRING, spl STRING, cgpa FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';


The first statement declares a  maxcgpa  table with three columns: name, spl, and cgpa. The type of each column must be specified, too. Here the name is a string, spl is a string and cgpa is a float.

The ROW FORMAT clause, however, is particular to HiveQL. 
This declaration is saying that each row in the data file is tab-delimited text. Hive  

expects there to be three fields in each row, corresponding to the table columns, with  
fields separated by tabs and rows by newlines.


Statement : 2

LOAD DATA LOCAL INPATH 'hivesample.txt' OVERWRITE INTO TABLE maxcgpa;

Hive put the specified local file in its warehouse directory.

The OVERWRITE keyword in the LOAD DATA statement tells Hive to delete any existing files in the directory for the table. If it is omitted, the new files are simply added to the table’s directory


Statement : 3

SELECT spl, MAX(cgpa) FROM maxcgpa WHERE cgpa >0 AND cgpa <= 10  GROUP BY spl;

It is a SELECT statement with a GROUP BY clause for  grouping rows into spl, which uses the MAX() aggregate function to find the maximum  cgpa for each spl group. Hive transforms this  query into a MapReduce job, which it executes on our behalf, then prints the results  to the console.








3) Hive Web Interface ( HWI ) :





To use HWI you need to install Apache Ant and configure environment variables for ANT ( ANT Installation.





setenv ANT_HOME /scratch/rjuluri/HADOOP/apache-ant-1.8.4
setenv ANT_LIB /scratch/rjuluri/HADOOP/apache-ant-1.8.4/lib

$ hive --service hwi

Navigate to http://localhost:9999/hwi for accessing hwi .









I will cover Hive Clients , Metastore and HIVEQL in next blog...

3 comments:

Priyadharshini said...

Hi,
I am a newbie to hive and found your post very useful.I am having trouble running the queries in the non-interactive mode.Where should I give the commands..should I enter the hive shell and then give them?
Thanks in Advance

Hadoop online training said...

Hi,
very helpful content in your site the hadoop experienced experts provides best online training on
hadoop online training
with real time

Unknown said...

Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating Hadoop Admin Online Training Hyderabad

Popular Posts