HIVE - HADOOP : INSTALLATION , EXECUTION OF SAMPLE PROGRAM
The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop is a popular open-source map-reduce implementation which is being used in companies like Yahoo, Facebook etc. to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. Hive, an open-source data warehousing solution built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into mapreduce jobs that are executed using Hadoop. In addition, HiveQL enables users to plug in custom map-reduce scripts into queries. The language includes a type system with support for tables containing primitive types, collections like arrays and maps, and nested compositions of the same. The underlying IO libraries can be extended to query data in custom formats. Hive also includes a system catalog - Metastore – that contains schemas and statistics, which are useful in data exploration, query optimization and query compilation.
Installing and Running hive
Download latest version of hive from the following link (Hive Installation).
$ tar xzf hive-0.9.0.tar.gz
set hive environment variables
$ export HIVE_INSTALL=/home/user1/hive-0.9.0.tar.gz
$ export PATH=$PATH:$HIVE_INSTALL/bin
You also need to set the JAVA_HOME environment variable to point to a suitable Java installation.
Hive has two execution types :
1) local mode : Hive runs in a single JVM and accesses the local filesystem. This mode is suitable only for small datasets.
$ hive
hive>
This is Hive interactive shell
2) MapReduce mode : In MapReduce mode, Hive translates queries into MapReduce jobs and runs them on a Hadoop cluster. The cluster may be a pseudo- or fully distributed cluster.
set the HADOOP_HOME environment variable for finding which Hadoop client to run.
configure hive-site.xml to specify name node and job tracker of hadoop.
you can set hadoop environment variables per session based also.
% hive -hiveconf fs.default.name=localhost -hiveconf mapred.job.tracker=localhost:8021 or you can use SET command
hive> SET fs.default.name=localhost;
Hive Services
There are 4 ways to access hive
cli : The command-line interface to Hive (the shell). This is the default service.
hiveserver : Runs Hive as a server exposing a Thrift service, enabling access from a range of clients written in different languages. Applications using the Thrift, JDBC, and ODBC connectors need to run a Hive server to communicate with Hive. Set the HIVE_PORT environment variable to specify the port the server will listen on (defaults to 10,000).
hwi : The Hive Web Interface.
jar : The Hive equivalent to hadoop jar, a convenient way to run Java applications that includes both Hadoop and Hive classes on the classpath.
CLI : There are 2 modes to interact
1) Interactive Mode
2) Non-Interactive Mode
1) Interactive Mode ( Hive Shell ) :
The shell is the primary way that we will interact with Hive, by issuing commands in HiveQL. HiveQL is Hive’s query language, a dialect of SQL. It is heavily influenced by MySQL.
$ hive
hive> show tables;
OK
dummy
maxcgpa
records
Time taken: 3.756 seconds
Like SQL, HiveQL is generally case-insensitive (except for string comparisons), so show tables; works equally well here.
2) Non-Interactive Mode :
$ hive -f script.q
-f option runs the commands in the specified file, which is script.q
For short scripts, you can use the -e option to specify the commands inline, in which case the final semicolon is not required:
$ hive -e 'SELECT * FROM dummy'
Hive history file=/tmp/tom/hive_job_log_tom_201005042112_1906486281.txt
OK
X
Time taken: 4.734 seconds
suppress these messages using the -S option at launch time, which has the effect of
showing only the output result for queries:
% hive -S -e 'SELECT * FROM dummy'
X
Example of Hive in Non-Interactive Mode
max_cgpa.q
-- max_cgpa.q: Finds the maximum cgpa of a specialization
CREATE TABLE maxcgpa (name STRING, spl STRING, cgpa FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH 'hivesample.txt' OVERWRITE INTO TABLE maxcgpa;
SELECT spl, MAX(cgpa) FROM maxcgpa WHERE cgpa >0 AND cgpa <= 10 GROUP BY spl;
Above hive script finds the maximum cgpa of a specialization.
hivesample.txt ( Input to the hive )
raghu ece 9
kumar cse 8.5
biju ece 8
mukul cse 8.6
ashish ece 7.0
subha cse 8.3
ramu ece -8.3
rahul cse 11.4
budania ece 5.4
first column represents name , second column specialization and third column is cgpa, by default each column is separated by tab space.
$hive -f max_cgpa.q
Output :
cse 8.6
ece 9.0
Analysis :
Statement : 1
CREATE TABLE maxcgpa (name STRING, spl STRING, cgpa FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
The first statement declares a maxcgpa table with three columns: name, spl, and cgpa. The type of each column must be specified, too. Here the name is a string, spl is a string and cgpa is a float.
The ROW FORMAT clause, however, is particular to HiveQL.
This declaration is saying that each row in the data file is tab-delimited text. Hive
expects there to be three fields in each row, corresponding to the table columns, with
fields separated by tabs and rows by newlines.
Statement : 2
LOAD DATA LOCAL INPATH 'hivesample.txt' OVERWRITE INTO TABLE maxcgpa;
Hive put the specified local file in its warehouse directory.
The OVERWRITE keyword in the LOAD DATA statement tells Hive to delete any existing files in the directory for the table. If it is omitted, the new files are simply added to the table’s directory
Statement : 3
SELECT spl, MAX(cgpa) FROM maxcgpa WHERE cgpa >0 AND cgpa <= 10 GROUP BY spl;
It is a SELECT statement with a GROUP BY clause for grouping rows into spl, which uses the MAX() aggregate function to find the maximum cgpa for each spl group. Hive transforms this query into a MapReduce job, which it executes on our behalf, then prints the results to the console.
3) Hive Web Interface ( HWI ) :
To use HWI you need to install Apache Ant and configure environment variables for ANT ( ANT Installation ) .
I will cover Hive Clients , Metastore and HIVEQL in next blog...
3 comments:
Hi,
I am a newbie to hive and found your post very useful.I am having trouble running the queries in the non-interactive mode.Where should I give the commands..should I enter the hive shell and then give them?
Thanks in Advance
Hi,
very helpful content in your site the hadoop experienced experts provides best online training on
hadoop online training
with real time
Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating Hadoop Admin Online Training Hyderabad
Post a Comment