Sunday, November 25, 2012

REST WEBSERVICE - JERSEY CLIENT API

               REST WEBSERVICE - JERSEY CLIENT API 


Jersy is a Open Source Implementation of JAX-RS used to create RESTful Webservices using Java.

import com.sun.jersey.api.client.Client;

Client client = Client.create();

creating a instance of a Client is an expensive operation so re-use it after initializing for the first time.

create a WebResource object, which encapsulates a web resource for the client. 

   import com.sun.jersey.api.client.WebResource;

   WebResource webResource = client.resource("http://example.com/base");


use WebResource object to build requests to send to the web resource and to process responses returned from the web resource. For example, you can use the WebResource object for HTTP GET, PUT, POST, and DELETE requests.

String s = webResource.get(String.class);

Download Jersey Resource bundle  Jersey-bundle and add jars to the classpath.

  public String sendRestRequest(String restUrl) {
       
        String response = null;   
        Client client =  Client.create();
        WebResource webResource = client.resource(restUrl);
        
// response type returned from the REST resource is XML or JSON

         ClientResponse clientResponse = webResource.accept(MediaType.APPLICATION_JSON_TYPE, MediaType.APPLICATION_XML_TYPE).get(ClientResponse.class);
        try {
// If status code returned is not 200 throw error
            if (clientResponse.getStatus() != 200) {
                System.out.println(" \n Rest URL : " + restUrl +
                          " Failed with HTTP error code : " +
                          clientresponse.getStatus());
                return null;
            }
            response = clientResponse.getEntity(String.class);
        } catch (Exception e) {
            e.printStackTrace();
        }
// JSON response
        return response;
    }

How to pass authentication parameters along with Client request

import com.sun.jersey.api.client.filter.HTTPBasicAuthFilter;

use HTTPBasicAuthFilter to pass username and password.

      HTTPBasicAuthFilter authFilter =
            new HTTPBasicAuthFilter(username, pwd);
      client.addFilter(authFilter);
      WebResource webResource = client.resource(restUrl);

How to pass Custom Headers along with client request

 ClientResponse clientResponse = webResource.accept(MediaType.APPLICATION_JSON_TYPE, MediaType.APPLICATION_XML_TYPE).header("Custom-header",headerValue).get(ClientResponse.class);

Saturday, November 24, 2012

JSON PARSER - USING JERSEY

                                 JSON PARSER - USING JERSEY


Now a days most of the companies building webservices using REST instead of SOAP.

Representational State Transfer or REST basically means that each unique URL is a representation of some object. You can get the contents of that object using an HTTP GET, to delete it, you then might use a POST, PUT, or DELETE to modify the object .

REST uses existing principles and protocols of the Web are enough to create robust Web services. This means that developers who understand HTTP and XML can start building Web services right away, without needing any toolkits beyond what they normally use for Internet application development.

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate.

JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.

JSON is built on two structures:
  • A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.
  • An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.
In JSON, they take on these forms:

An object is an unordered set of name/value pairs. An object begins with { (left brace) and ends with } (right brace). Each name is followed by : (colon) and the name/value pairs are separated by , (comma).

An array is an ordered collection of values. An array begins with [ (left bracket) and ends with ] (right bracket). Values are separated by , (comma).

A value can be a string in double quotes, or a number, or true or false or null, or an object or an array. These structures can be nested.

A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes. A character is represented as a single character string. A string is very much like a C or Java string.

A number is very much like a C or Java number, except that the octal and hexadecimal formats are not used.


{
"employees": [
{ "firstName":"Raghu" , "lastName":"Juluri" }, 
{ "firstName":"Dinesh" , "lastName":"Gunda" }, 
{ "firstName":"Kumar" , "lastName":"Prabhu" }
]
}

JSON object can be parsed using jersey. 

Download Jersey Resource bundle  jersey-bundle.jar , JSON and add jars to the classpath.


     public void parseJson(String json) {

        try {
            JSONObject jsonObj = new JSONObject(json);


            String[] keys = JSONObject.getNames(jsonObj);
            for (String key : keys) {

                if (jsonObj.get(key) instanceof JSONObject) {
                    JSONObject innerData = jsonObj.getJSONObject(key);
                    // process inner Json Object
                } else if (jsonObj.get(key) instanceof JSONArray) {
                    JSONArray innerArray = jsonObj.getJSONArray(key);
                    for (int i = 0; i < innerArray.length(); i++) {
                        JSONObject innerJsonArray = innerArray.getJSONObject(i);
                        //Process JSON Object present in the Array }
                    }
                } else {
                    System.out.println("Key : " + key + " Value : " +
                                       jsonObj.get("key"));
                }
            }
        } catch (JSONException e) {
            e.printStackTrace();
        }
    }




Tuesday, October 16, 2012

PLUGGABLE DATABASE : ORACLE 12C

                 PLUGGABLE DATABASE : ORACLE 12C

Disclaimer : Content of this blog refers to Andrew speech in the Oracle Open World (OOW 2012) (Oracle Database : OOW 2012) and other material found on net. It is intended for information purposes only for learning. The development and release of these functions including the release dates remain at the sole discretion of Oracle.


Pluggable database is the new feature proposed in Oracle Database 12C which will be released in 2013.

Main drawback with the existing database architecture is a single database server requires memory,process and database files that is core Oracle System dictionary (tables,objects..) i.e Oracle metadata is tightly integrated with the application dictionary i.e tenants metadata, so separate memory,processes,files required for each database (ERP,CRM,DW).


In Oracle Database Server 12C, there is an architectural separation (Split Data Dictionary) between Core Oracle System with User Applications they are loosely coupled. Pluggable Database Architecture consists of 
1) Container Database (CDB)
2) Pluggable Database (PDB)

1) Container Database (CDB) :  Holds Oracle System dictionary , functionality and metadata required to run the database. Memory and Process required for multiple PDB will be handled by a single CDB.

2) Pluggable Database (PDB) : This holds only user application metadata or customer metadata or dictionary. This has read only permission on oracle dictionary. In cloud each Pluggable Database is analogous to single Customer. There is clean separation between each pluggable database .

As shown in the above image, a single instance of container database support ERP,CRM,DW applications. All pdb's will share memory and process based on oracle resource management capability. Single CDB will support 250 pluggable databases or more.This new feature is compatible with older Oracle databases also .pluggable databases can be plugged or unplugged to the container database at any point of time.
Compared to separate databases pluggable database is highly efficient , 6x less h/w resources and 5x scalable.


Patching and upgrades : Apply patch once then all pluggable databases will be updated. You can clone a PDB within the same CDB or into another CDB. PDBs can also be provisioned very fast as each CDB comes with a “PDB Seed” from where the new PDB can be fast provisioned. So the provisioning becomes very fast.

Redeployment becomes much easier as we can unplug the database from one platform or a CDB version and then plug it into a CDB which is in another platform or version. This will make the upgrade, patching and redeployment efforts much faster! When you upgrade the CDB, all the PDBs will get upgraded. If you would like to control when the PDBs should be upgraded, you can create another CDB version and then unplug from the old release and then plug in to the new database release. All PDB contents are separate from each other PDB so the “separation of duties” works very well as well.

Backup: backingup data present in multiple pdb's is very simple, treat entire database as one and recover all pdb's once or you can restore individual pdb.
Pluggable databases are perfect for SaaS as each pdb is allocated to a customer.

Thursday, July 26, 2012

PROVIDENT FUND (PF) WITHDRAWAL : IBM INDIA PVT LTD

PROVIDENT FUND (PF)  WITHDRAWAL : IBM INDIA PVT LTD

In the Current Blog..i want to describe the guidelines for withdrawing Provident Fund (PF) in IBM India Pvt Ltd. I just wanted to explain the step-by-step procedure to be followed .. as many of my friends asked me the procedure for withdrawal ..



Note: Provident Fund Withdrawal needs to be initiated only after the completion of 60 days from the date of leaving Or else it will be rejected.


Please fill the Forms attached (Form-10C.pdf -Form-19.pdf and Guidelines-form-10C.pdf , Guidelines-Form-19.pdf)  and send the duly filled in and signed  hardcopies of the forms to the address below for withdrawal of PF



Please refer the sample filled form which is attached for your reference,
the details which shown in the sample form is mandatory, if you
miss any of details then your Form will be rejected and sent back to
you.

Please ensure below points are taken care before sending the forms to
IBM : 

1. Application Forms – Please ensure that the Forms are printed back to back.

2. Filling the Application - Please use only blue ink to fill the Forms.

3. Name – Please ensure you are filling your name same as it appears in IBM records. Also ensure the name is matching with your bank account. Any mismatch in this regard will lead to rejection of the claim.

4. Over Writing – Please avoid over writing in the form, as over writing will lead to rejection of the claim.


5. Signatures - Please ensure that your application is complete with all mandatory employee signatures. (3 Signatures on Page No.3 of the Form 19 and 2 signatures on Page No.3 of the Form 10C).


6. Cancelled cheque leaf - Please ensure that a cancelled cheque leaf
(with the name,IFSC code printed on it) has been attached along with
the PF withdrawal form. Also ensure that the bank details filled in
the form is matching with the details appearing in attached cancelled
cheque leaf. In case name is not printed on cheque, along with
cheque leaf ensure that a bank statement with last few transactions
is attached with forms. Please make sure IFSC code is appearing in
both cancelled cheque leaf and bank statement. Also make sure that
the bank account number you are providing is of individual account.


7. ID Proof- In case you are applying for PF withdrawal after 2 years
and 6 months from the date of leaving , ensure that you have
attached Photo ID proofs Like Pan card/ Passport/ Voters ID along
with the forms.



8. Please enclose following documents along with forms
    a) photocopy of Full and Final settlement letter.
    b) Please attach a (crossed cheque leaf written as cancelled).
    c) cancelled cheque.

9. Scanned/ photocopy applications are not acceptable for process.

10.Please mention your mobile number on top of the application (written in
pencil ).

Above documents needs to be couriered to the following address.

IBM India Pvt Ltd.
Retirals Team
Global Process Services - HR Delivery
Manyata Embassy Business Park,
D1, 4th Floor, Outer Ring Road,
Nagawara, Bangalore - 560 045
ibmretirals@aonhewitt.com

Once above forms are couriered to IBM address, they will file with PF office 

After 1 Month (approx) you will receive pf withdrawal tracking number to your mobile ..which you mentioned earlier.

Then you need to wait for 3 months (approx) for the PF amount to be credited in your Bank Account which you mentioned earlier.If you file during march-july months it will take long time as Provident Fund office employees are involved in the year end financial transaction calculations .. it might take up to 6 months .. 

PF Withdrawal will take less time compared to PF Transfer .. some times it will take up to 2 years for PF transfer .. So in my perspective PF Withdrawal is better Option then PF Transfer

Uploaded all 4 documents to Google Docs .. so please download from the following urls








HIVE - HADOOP : MSQL AS METASTORE - Part III

HIVE - HADOOP : MSQL AS METASTORE - Part III


In my previous article (Hive with Derby ) described the drawback of using Embedded Metastore i.e only one Hive Session can open connection to this Metastore so at any point of time only one user will be active others will be passive , to over come this use Standalone Database as MetaStore.


If you use Standalone Database then Hive Supports Multiple Sessions therefore Multiple Users can access at same instant, this is referred as Local Metastore. Any JDBC-Compliant database could be used as Metastore. In previous article i demonstrated how to install and start Derby Database and also how to configure hive to connect to this Derby  with a example.




In the Current blog, i wanted to demonstrate how to install , execute Mysql Server and configuration to be made in Hive to connect to Mysql server and store metadata.Mysql is world's most used open source Relational Database Management System (RDBMS).


MYSQL Installation :


1) Download MySql software from  MYSQL Software  
2) Modify the password for MySql 


if you face exception during password change.. ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO) 



/etc/init.d/mysqld stop


mysqld_safe --skip-grant-tables &
mysql -u root
mysql> use mysql;
mysql> update user set password=PASSWORD("newrootpassword") where User='root';
mysql> flush privileges;
mysql> quit
/etc/init.d/mysqld stop
/etc/init.d/mysqld start





For Example:



[root@slc01mcd ~]# mysqld_safe --skip-grant-tables &
[1] 12943
[root@slc01mcd ~]# Starting mysqld daemon with databases from /var/lib/mysql
mysql -u root
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.0.77 Source distribution


Type 'help;' or '\h' for help. Type '\c' to clear the buffer.


mysql> use mysql;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A


Database changed
mysql> update user set password=PASSWORD("Welcome1") where User='root';
Query OK, 2 rows affected (0.00 sec)
Rows matched: 3  Changed: 2  Warnings: 0


mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)


mysql>  quit
Bye

Once the Password set .. check the verify the tables present in MySql DB

[root@slc01mcd ~]# /etc/init.d/mysqld start
Starting MySQL:                                            [  OK  ]
[root@slc01mcd ~]# mysql --user=root -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.0.77 Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> show tables;
ERROR 1046 (3D000): No database selected

/homeHADOOP/hive-0.9.0/conf/hive-site.xml

HIVE Configuration (hive-site.xml)

  hive.metastore.local
  true
  controls whether to connect to remove metastore server or open a new metastore server in Hive Client JVM

  javax.jdo.option.ConnectionURL
jdbc:derby://hostname:1527/metastore_db;create=true
  JDBC connect string for a JDBC metastore

  javax.jdo.option.ConnectionDriverName
org.apache.derby.jdbc.ClientDriver
  Driver class name for a JDBC metastore

  javax.jdo.option.ConnectionUserName
  root
  username to use against metastore database

  javax.jdo.option.ConnectionPassword
  Welcome1
  password to use against metastore database

Jpox.properties

javax.jdo.PersistenceManagerFactoryClass=org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema=false
org.jpox.validateTables=false
org.jpox.validateColumns=false
org.jpox.validateConstraints=false
org.jpox.storeManagerType=rdbms
org.jpox.autoCreateSchema=true
org.jpox.autoStartMechanismMode=checked
org.jpox.transactionIsolation=read_committed
javax.jdo.option.DetachAllOnCommit=true
javax.jdo.option.NontransactionalRead=true
javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL=jdbc:derby://hostname:1527/metastore_db;create=true
javax.jdo.option.ConnectionUserName=root
javax.jdo.option.ConnectionPassword=Welcome1

Copy JDBC driver Jar file for Mysql to the hadoop lib floder

cp mysql-connector-java-5.1.11/*.jar  /homeHADOOP/hive-0.9.0/lib

Sample Program :

Hive> show tables;

OK
Time taken: 0.048 seconds

Execute max_cgpa.q script (Previous article Hive - Part-I )


[root@slc01mcd hive-0.9.0]# hive -f max_cgpa.q 

WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Logging initialized using configuration in jar:file:/scratch/rjuluri/HADOOP/hive-0.9.0/lib/hive-common-0.9.0.jar!/hive-log4j.properties
Hive history file=/tmp/root/hive_job_log_root_201207180701_1297257265.txt
OK
Time taken: 3.819 seconds
Copying data from file:/scratch/rjuluri/HADOOP/hive-0.9.0/hivesample.txt
Copying file: file:/scratch/rjuluri/HADOOP/hive-0.9.0/hivesample.txt
Loading data to table default.maxcgpa1
Deleted file:/user/hive/warehouse/maxcgpa1
OK
Time taken: 0.383 seconds
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Execution log at: /tmp/root/root_20120718070101_e67d60e9-5f66-482a-8151-eebc5cfefaf4.log
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2012-07-18 07:01:23,161 null map = 0%,  reduce = 0%
2012-07-18 07:01:26,174 null map = 100%,  reduce = 0%
2012-07-18 07:01:29,186 null map = 100%,  reduce = 100%
Ended Job = job_local_0001

Execution completed successfully

Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
cse     8.6
ece     9.0
Time taken: 10.476 seconds

[root@slc01mcd hive-0.9.0]# hive show tables;

WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Logging initialized using configuration in jar:file:/scratch/rjuluri/HADOOP/hive-0.9.0/lib/hive-common-0.9.0.jar!/hive-log4j.properties
Hive history file=/tmp/root/hive_job_log_root_201207180701_545279755.txt
hive> show tables;
OK
maxcgpa1
Time taken: 2.769 seconds

Now Verify whether newly created table maxcgpa1 is in MySql databse or not ..

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema | 
| mysql              | 
| test               | 
+--------------------+
3 rows in set (0.00 sec)

This Shows that MySql Database is used as MetaStore by Hive from Now on .. and Multiple Sessions (Users) can Connect to Hive .. and perform hadoop operations ...





Friday, July 20, 2012

HIVE - HADOOP : INSTALLATION , EXECUTION OF SAMPLE PROGRAM - Part II

HIVE - HADOOP : INSTALLATION , EXECUTION  OF SAMPLE PROGRAM - Part II

Hive Clients

To Start Hive Server use following command .. 

$hive --service hiveserver

Following are different Hive Clients, to connect to this Server

1) CommandLine
2) Thrift Client
3) JDBC Driver
4) ODBC Driver


Courtesy : HadoopDefinativeGuide

1) CommandLine : Operates in embedded mode only, it needs to have access to the hive libraries.

This is section is briefly described in my previous article ( Hive - Part I ) .

2) Thrift Client : 

The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages. Thrift Software

Hive Server is implemented in java, so to query hive server using ruby,php,C++  language then you need to build thrift client specific to that language.

Steps to build Thrift Client for Ruby

a) Install Thrift Thrift
b)  Download Thrift Source Thrift Source
c) Navigate to Hive Source directory 

thrift —gen rb -I service/include metastore/if/hive_metastore.thrift 
thrift —gen rb -I service/include -I . service/if/hive_service.thrift 
thrift —gen rb service/include/thrift/fb303/if/fb303.thrift 
thrift —gen rb serde/if/serde.thrift 
thrift —gen rb ql/if/queryplan.thrift 
thrift —gen rb service/include/thrift/if/reflection_limited.thrift

Or you can download Thrift client for ruby from Thrift Ruby Client

Thrift Java Client : operates in embedded mode and standalone server
Thrift C++ Client : operated only in embedded mode

3) JDBC Driver :  
For embedded mode, uri is just "jdbc:hive://". 
For standalone server, uri is "jdbc:hive://host:port/dbname" where host and port are determined by where the hive server is run. 
Example :  "jdbc:hive://localhost:10000/default". Currently, the only dbname supported is "default".
JDBC Client Sample Code JDBC Client Sample Code

4) ODBC Driver : 

The Hive ODBC Driver is a software library that implements the Open Database Connectivity (ODBC) API standard for the Hive database management system, enabling ODBC compliant applications to interact seamlessly (ideally) with Hive through a standard interface. This driver will NOT be built as a part of the typical Hive build process and will need to be compiled and built separately ODBC Client

The Metastore :


The metastore is the central repository of Hive metadata. The metastore is divided into two pieces: a service and the backing store for the data. By default, the metastore service runs in the same JVM as the Hive service and contains an embedded Derby database instance backed by the local disk. This is called the embedded metastore configuration




Disadvantage with Embedded metastore is only one Hive Session can be opened against Hive Server, For multiple Hive Session support move metastore to a Relational Database which supports JDO (Derby Server,MYSQL ..) 

MetaStore : Derby Server

wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz

tar -xzf db-derby-10.4.2.0-bin.tar.gz


mv db-derby-10.4.2.0-bin derby


mkdir derby/data


Login as root, add follwing variables in /etc/profile.d/derby.sh



export DERBY_INSTALL=/home/HADOOP/derby
export DERBY_HOME=/home/HADOOP/derby



cd /home/HADOOP/derby/bin/


./startNetworkServer -h 0.0.0.0 &

/scratch/rjuluri/HADOOP/hive/conf

vi hive-site.xml

  hive.test.mode.nosamplelist
 
  if hive is running in test mode, dont sample the above comma seperated list of tables

  hive.metastore.local
  true
  controls whether to connect to remove metastore server or open a new metastore server in Hive Client JVM

  javax.jdo.option.ConnectionURL
jdbc:derby://hostname:port/metastore_db;create=true
  JDBC connect string for a JDBC metastore

  javax.jdo.option.ConnectionDriverName
org.apache.derby.jdbc.ClientDriver
  Driver class name for a JDBC metastore

Add following file 

vi /home/HADOOP/hive/conf/jpox.properties

javax.jdo.PersistenceManagerFactoryClass=org.jpox.PersistenceManagerFactoyImpl
org.jpox.autoCreateSchema=false
org.jpox.validateTables=false
org.jpox.validateColumns=false
org.jpox.validateConstraints=false
org.jpox.storeManagerType=rdbms
org.jpox.autoCreateSchema=true
org.jpox.autoStartMechanismMode=checked
org.jpox.transactionIsolation=read_committed
javax.jdo.option.DetachAllOnCommit=true
javax.jdo.option.NontransactionalRead=true
javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL=jdbc:derby://hostname:port/metastore_db;create=true
javax.jdo.option.ConnectionUserName=APP
javax.jdo.option.ConnectionPassword=mine

cp /home/HADOOP/derby/lib/derbyclient.jar  /home/HADOOP/hive/lib
cp  /home/HADOOP/derby/lib/derbytools.jar  /home/HADOOP/hive/lib

Hive> show tables;

OK
Time taken: 0.048 seconds

Execute max_cgpa.q script (Previous article Hive - Part I )

[root@slc01mcd hive-0.9.0]# hive -f max_cgpa.q


Hadoop job information for null: number of mappers: 0; number of reducers: 0
2012-07-18 07:01:23,161 null map = 0%,  reduce = 0%
2012-07-18 07:01:26,174 null map = 100%,  reduce = 0%
2012-07-18 07:01:29,186 null map = 100%,  reduce = 100%
Ended Job = job_local_0001
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK


cse     8.6
ece     9.0


Time taken: 10.476 seconds


Hive> show tables;

OK

maxcgpa1

Time taken: 2.769 seconds

Friday, July 06, 2012

HIVE - HADOOP : INSTALLATION , EXECUTION OF SAMPLE PROGRAM

HIVE - HADOOP : INSTALLATION , EXECUTION  OF SAMPLE PROGRAM

The size of data sets being collected and analyzed in  the industry for business intelligence is growing rapidly, making  traditional warehousing solutions prohibitively expensive.  Hadoop is a popular open-source map-reduce implementation which is being used  in companies like Yahoo, Facebook etc. to  store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. Hive, an open-source data warehousing solution built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language  - HiveQL, which are compiled into mapreduce jobs that are executed using Hadoop. In addition, HiveQL enables users to plug in custom map-reduce scripts into queries. The language includes a type system with support for tables containing primitive types, collections like arrays and maps, and nested compositions of the same. The underlying IO libraries can be extended to query data in custom formats. Hive also includes a system catalog - Metastore – that  contains schemas and statistics, which  are useful in data exploration,  query optimization and query compilation.

Installing and Running hive


Download latest version of hive from the following link (Hive Installation).

$ tar xzf hive-0.9.0.tar.gz

set hive environment variables

$ export HIVE_INSTALL=/home/user1/hive-0.9.0.tar.gz
$ export PATH=$PATH:$HIVE_INSTALL/bin

You also need to set the JAVA_HOME environment variable to point to a suitable Java installation.

Hive has two execution types : 

1) local mode : Hive runs in a single JVM and accesses the local filesystem. This mode is suitable only for small datasets.

$ hive

hive>


This is Hive interactive shell


2) MapReduce mode : In MapReduce mode, Hive translates queries into MapReduce jobs and runs them on a Hadoop cluster. The cluster may be a pseudo- or fully distributed cluster.


set the HADOOP_HOME environment variable for finding which Hadoop client to run.


configure hive-site.xml to specify name node and job tracker of hadoop.

you can set hadoop environment variables per session based also.

% hive -hiveconf fs.default.name=localhost -hiveconf mapred.job.tracker=localhost:8021 or you can use SET command

hive> SET fs.default.name=localhost;

Hive Services

There are 4 ways to access hive


cli : The command-line interface to Hive (the shell). This is the default service.


hiveserver : Runs Hive as a server exposing a Thrift service, enabling access from a range of clients written in different languages. Applications using the Thrift, JDBC, and ODBC connectors need to run a Hive server to communicate with Hive. Set the HIVE_PORT environment variable to specify the port the server will listen on (defaults to 10,000).

hwi : The Hive Web Interface.

jar : The Hive equivalent to hadoop jar, a convenient way to run Java applications that includes both Hadoop and Hive classes on the classpath.

CLI : There are 2 modes to interact

1) Interactive Mode
2) Non-Interactive Mode

1) Interactive Mode ( Hive Shell ) 

The shell is the primary way that we will interact with Hive, by issuing commands in HiveQL. HiveQL is Hive’s query language, a dialect of SQL. It is heavily influenced by MySQL.


$ hive


hive> show tables;



OK

dummy
maxcgpa


records
Time taken: 3.756 seconds

Like SQL, HiveQL is generally case-insensitive (except for string comparisons), so show tables; works equally well here.

2) Non-Interactive Mode :

$ hive -f script.q


 -f option runs the commands in the specified file, which is script.q 


For short scripts, you can use the -e option to specify the commands inline, in which case the final semicolon is not required:

$ hive -e 'SELECT * FROM dummy'

Hive history file=/tmp/tom/hive_job_log_tom_201005042112_1906486281.txt
OK
X

Time taken: 4.734 seconds

suppress these messages using the -S option at launch time, which has the effect of
showing only the output result for queries:


% hive -S -e 'SELECT * FROM dummy'


X

Example of Hive in Non-Interactive Mode 

max_cgpa.q

-- max_cgpa.q: Finds the maximum cgpa of a specialization



CREATE TABLE maxcgpa (name STRING, spl STRING, cgpa FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

LOAD DATA LOCAL INPATH 'hivesample.txt' OVERWRITE INTO TABLE maxcgpa;

SELECT spl, MAX(cgpa) FROM maxcgpa WHERE cgpa >0 AND cgpa <= 10  GROUP BY spl;

Above hive script finds the maximum cgpa of a specialization.

hivesample.txt  ( Input to the hive )

raghu     ece     9
kumar    cse      8.5
biju       ece      8
mukul    cse      8.6
ashish   ece      7.0
subha    cse      8.3
ramu     ece     -8.3
rahul     cse      11.4
budania ece      5.4

first column represents name , second column specialization and third column is cgpa, by default each column is separated by tab space.

$hive -f max_cgpa.q

Output :


cse     8.6
ece     9.0

Analysis : 

Statement : 1

CREATE TABLE maxcgpa (name STRING, spl STRING, cgpa FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';


The first statement declares a  maxcgpa  table with three columns: name, spl, and cgpa. The type of each column must be specified, too. Here the name is a string, spl is a string and cgpa is a float.

The ROW FORMAT clause, however, is particular to HiveQL. 
This declaration is saying that each row in the data file is tab-delimited text. Hive  

expects there to be three fields in each row, corresponding to the table columns, with  
fields separated by tabs and rows by newlines.


Statement : 2

LOAD DATA LOCAL INPATH 'hivesample.txt' OVERWRITE INTO TABLE maxcgpa;

Hive put the specified local file in its warehouse directory.

The OVERWRITE keyword in the LOAD DATA statement tells Hive to delete any existing files in the directory for the table. If it is omitted, the new files are simply added to the table’s directory


Statement : 3

SELECT spl, MAX(cgpa) FROM maxcgpa WHERE cgpa >0 AND cgpa <= 10  GROUP BY spl;

It is a SELECT statement with a GROUP BY clause for  grouping rows into spl, which uses the MAX() aggregate function to find the maximum  cgpa for each spl group. Hive transforms this  query into a MapReduce job, which it executes on our behalf, then prints the results  to the console.








3) Hive Web Interface ( HWI ) :





To use HWI you need to install Apache Ant and configure environment variables for ANT ( ANT Installation.





setenv ANT_HOME /scratch/rjuluri/HADOOP/apache-ant-1.8.4
setenv ANT_LIB /scratch/rjuluri/HADOOP/apache-ant-1.8.4/lib

$ hive --service hwi

Navigate to http://localhost:9999/hwi for accessing hwi .









I will cover Hive Clients , Metastore and HIVEQL in next blog...

Popular Posts