@raghujuluri: July 2012

Thursday, July 26, 2012

PROVIDENT FUND (PF) WITHDRAWAL : IBM INDIA PVT LTD

In the Current Blog..i want to describe the guidelines for withdrawing Provident Fund (PF) in IBM India Pvt Ltd. I just wanted to explain the step-by-step procedure to be followed .. as many of my friends asked me the procedure for withdrawal ..

Note: Provident Fund Withdrawal needs to be initiated only after the completion of 60 days from the date of leaving Or else it will be rejected.

Please fill the Forms attached (Form-10C.pdf -Form-19.pdf and Guidelines-form-10C.pdf , Guidelines-Form-19.pdf) and send the duly filled in and signed hardcopies of the forms to the address below for withdrawal of PF

Please refer the sample filled form which is attached for your reference,

the details which shown in the sample form is mandatory, if you

miss any of details then your Form will be rejected and sent back to

you.

Please ensure below points are taken care before sending the forms to
IBM :

1. Application Forms – Please ensure that the Forms are printed back to back.

2. Filling the Application - Please use only blue ink to fill the Forms.

3. Name – Please ensure you are filling your name same as it appears in IBM records. Also ensure the name is matching with your bank account. Any mismatch in this regard will lead to rejection of the claim.

4. Over Writing – Please avoid over writing in the form, as over writing will lead to rejection of the claim.

5. Signatures - Please ensure that your application is complete with all mandatory employee signatures. (3 Signatures on Page No.3 of the Form 19 and 2 signatures on Page No.3 of the Form 10C).

6. Cancelled cheque leaf - Please ensure that a cancelled cheque leaf
(with the name,IFSC code printed on it) has been attached along with
the PF withdrawal form. Also ensure that the bank details filled in
the form is matching with the details appearing in attached cancelled
cheque leaf. In case name is not printed on cheque, along with
cheque leaf ensure that a bank statement with last few transactions
is attached with forms. Please make sure IFSC code is appearing in
both cancelled cheque leaf and bank statement. Also make sure that
the bank account number you are providing is of individual account.

7. ID Proof- In case you are applying for PF withdrawal after 2 years
and 6 months from the date of leaving , ensure that you have
attached Photo ID proofs Like Pan card/ Passport/ Voters ID along
with the forms.

8. Please enclose following documents along with forms
a) photocopy of Full and Final settlement letter.

b) Please attach a (crossed cheque leaf written as cancelled).

c) cancelled cheque.

9. Scanned/ photocopy applications are not acceptable for process.

10.Please mention your mobile number on top of the application (written in

pencil ).

Above documents needs to be couriered to the following address.

IBM India Pvt Ltd.

Retirals Team

Global Process Services - HR Delivery

Manyata Embassy Business Park,

D1, 4th Floor, Outer Ring Road,

Nagawara, Bangalore - 560 045
ibmretirals@aonhewitt.com

Once above forms are couriered to IBM address, they will file with PF office

After 1 Month (approx) you will receive pf withdrawal tracking number to your mobile ..which you mentioned earlier.

Then you need to wait for 3 months (approx) for the PF amount to be credited in your Bank Account which you mentioned earlier.If you file during march-july months it will take long time as Provident Fund office employees are involved in the year end financial transaction calculations .. it might take up to 6 months ..

PF Withdrawal will take less time compared to PF Transfer .. some times it will take up to 2 years for PF transfer .. So in my perspective PF Withdrawal is better Option then PF Transfer

Uploaded all 4 documents to Google Docs .. so please download from the following urls

Guidelines-Form-10C.pdf

Form-10C.doc

Guidelines-Form-19.pdf

Form-19.doc

HIVE - HADOOP : MSQL AS METASTORE - Part III

HIVE - HADOOP : MSQL AS METASTORE - Part III

In my previous article (Hive with Derby ) described the drawback of using Embedded Metastore i.e only one Hive Session can open connection to this Metastore so at any point of time only one user will be active others will be passive , to over come this use Standalone Database as MetaStore.

If you use Standalone Database then Hive Supports Multiple Sessions therefore Multiple Users can access at same instant, this is referred as Local Metastore. Any JDBC-Compliant database could be used as Metastore. In previous article i demonstrated how to install and start Derby Database and also how to configure hive to connect to this Derby with a example.

In the Current blog, i wanted to demonstrate how to install , execute Mysql Server and configuration to be made in Hive to connect to Mysql server and store metadata.Mysql is world's most used open source Relational Database Management System (RDBMS).

MYSQL Installation :

1) Download MySql software from MYSQL Software
2) Modify the password for MySql

if you face exception during password change.. ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO)

/etc/init.d/mysqld stop

mysqld_safe --skip-grant-tables &
mysql -u root
mysql> use mysql;
mysql> update user set password=PASSWORD("newrootpassword") where User='root';
mysql> flush privileges;
mysql> quit
/etc/init.d/mysqld stop
/etc/init.d/mysqld start

For Example:

[root@slc01mcd ~]# mysqld_safe --skip-grant-tables &
[1] 12943
[root@slc01mcd ~]# Starting mysqld daemon with databases from /var/lib/mysql
mysql -u root
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.0.77 Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> use mysql;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> update user set password=PASSWORD("Welcome1") where User='root';
Query OK, 2 rows affected (0.00 sec)
Rows matched: 3 Changed: 2 Warnings: 0

mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)

mysql> quit
Bye

Once the Password set .. check the verify the tables present in MySql DB

[root@slc01mcd ~]# /etc/init.d/mysqld start

Starting MySQL: [ OK ]

[root@slc01mcd ~]# mysql --user=root -p

Enter password:

Welcome to the MySQL monitor. Commands end with ; or \g.

Your MySQL connection id is 2

Server version: 5.0.77 Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> show tables;

ERROR 1046 (3D000): No database selected

/homeHADOOP/hive-0.9.0/conf/hive-site.xml

HIVE Configuration (hive-site.xml)

hive.metastore.local

true

controls whether to connect to remove metastore server or open a new metastore server in Hive Client JVM

javax.jdo.option.ConnectionURL

jdbc:derby://hostname:1527/metastore_db;create=true

JDBC connect string for a JDBC metastore

javax.jdo.option.ConnectionDriverName

org.apache.derby.jdbc.ClientDriver

Driver class name for a JDBC metastore

javax.jdo.option.ConnectionUserName

root

username to use against metastore database

javax.jdo.option.ConnectionPassword

Welcome1

password to use against metastore database

Jpox.properties

javax.jdo.PersistenceManagerFactoryClass=org.jpox.PersistenceManagerFactoryImpl

org.jpox.autoCreateSchema=false

org.jpox.validateTables=false

org.jpox.validateColumns=false

org.jpox.validateConstraints=false

org.jpox.storeManagerType=rdbms

org.jpox.autoCreateSchema=true

org.jpox.autoStartMechanismMode=checked

org.jpox.transactionIsolation=read_committed

javax.jdo.option.DetachAllOnCommit=true

javax.jdo.option.NontransactionalRead=true

javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver

javax.jdo.option.ConnectionURL=jdbc:derby://hostname:1527/metastore_db;create=true

javax.jdo.option.ConnectionUserName=root

javax.jdo.option.ConnectionPassword=Welcome1

Copy JDBC driver Jar file for Mysql to the hadoop lib floder

cp mysql-connector-java-5.1.11/*.jar /homeHADOOP/hive-0.9.0/lib

Sample Program :

Hive> show tables;

Time taken: 0.048 seconds

Execute max_cgpa.q script (Previous article Hive - Part-I )

[root@slc01mcd hive-0.9.0]# hive -f max_cgpa.q

WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.

Logging initialized using configuration in jar:file:/scratch/rjuluri/HADOOP/hive-0.9.0/lib/hive-common-0.9.0.jar!/hive-log4j.properties

Hive history file=/tmp/root/hive_job_log_root_201207180701_1297257265.txt

Time taken: 3.819 seconds

Copying data from file:/scratch/rjuluri/HADOOP/hive-0.9.0/hivesample.txt

Copying file: file:/scratch/rjuluri/HADOOP/hive-0.9.0/hivesample.txt

Loading data to table default.maxcgpa1

Deleted file:/user/hive/warehouse/maxcgpa1

Time taken: 0.383 seconds

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=

In order to set a constant number of reducers:

set mapred.reduce.tasks=

WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.

Execution log at: /tmp/root/root_20120718070101_e67d60e9-5f66-482a-8151-eebc5cfefaf4.log

Job running in-process (local Hadoop)

Hadoop job information for null: number of mappers: 0; number of reducers: 0

2012-07-18 07:01:23,161 null map = 0%, reduce = 0%

2012-07-18 07:01:26,174 null map = 100%, reduce = 0%

2012-07-18 07:01:29,186 null map = 100%, reduce = 100%

Ended Job = job_local_0001

Execution completed successfully

Mapred Local Task Succeeded . Convert the Join into MapJoin

cse 8.6

ece 9.0

Time taken: 10.476 seconds

[root@slc01mcd hive-0.9.0]# hive show tables;

WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.

Logging initialized using configuration in jar:file:/scratch/rjuluri/HADOOP/hive-0.9.0/lib/hive-common-0.9.0.jar!/hive-log4j.properties

Hive history file=/tmp/root/hive_job_log_root_201207180701_545279755.txt

hive> show tables;

maxcgpa1

Time taken: 2.769 seconds

Now Verify whether newly created table maxcgpa1 is in MySql databse or not ..

mysql> show databases;

+--------------------+

| Database |

+--------------------+

| information_schema |

| mysql |

| test |

+--------------------+

3 rows in set (0.00 sec)

This Shows that MySql Database is used as MetaStore by Hive from Now on .. and Multiple Sessions (Users) can Connect to Hive .. and perform hadoop operations ...

Friday, July 20, 2012

HIVE - HADOOP : INSTALLATION , EXECUTION OF SAMPLE PROGRAM - Part II

Hive Clients

To Start Hive Server use following command ..

$hive --service hiveserver

Following are different Hive Clients, to connect to this Server

1) CommandLine

2) Thrift Client

3) JDBC Driver

4) ODBC Driver

Courtesy : HadoopDefinativeGuide

1) CommandLine : Operates in embedded mode only, it needs to have access to the hive libraries.

This is section is briefly described in my previous article ( Hive - Part I ) .

2) Thrift Client :

The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages. Thrift Software

Hive Server is implemented in java, so to query hive server using ruby,php,C++ language then you need to build thrift client specific to that language.

Steps to build Thrift Client for Ruby

a) Install Thrift Thrift

b) Download Thrift Source Thrift Source

c) Navigate to Hive Source directory

thrift —gen rb -I service/include metastore/if/hive_metastore.thrift
thrift —gen rb -I service/include -I . service/if/hive_service.thrift
thrift —gen rb service/include/thrift/fb303/if/fb303.thrift
thrift —gen rb serde/if/serde.thrift
thrift —gen rb ql/if/queryplan.thrift
thrift —gen rb service/include/thrift/if/reflection_limited.thrift

Or you can download Thrift client for ruby from Thrift Ruby Client

Thrift Java Client : operates in embedded mode and standalone server

Thrift C++ Client : operated only in embedded mode

3) JDBC Driver :

For embedded mode, uri is just "jdbc:hive://".

For standalone server, uri is "jdbc:hive://host:port/dbname" where host and port are determined by where the hive server is run.

Example : "jdbc:hive://localhost:10000/default". Currently, the only dbname supported is "default".

JDBC Client Sample Code JDBC Client Sample Code

4) ODBC Driver :

The Hive ODBC Driver is a software library that implements the Open Database Connectivity (ODBC) API standard for the Hive database management system, enabling ODBC compliant applications to interact seamlessly (ideally) with Hive through a standard interface. This driver will NOT be built as a part of the typical Hive build process and will need to be compiled and built separately ODBC Client

The Metastore :

The metastore is the central repository of Hive metadata. The metastore is divided into two pieces: a service and the backing store for the data. By default, the metastore service runs in the same JVM as the Hive service and contains an embedded Derby database instance backed by the local disk. This is called the embedded metastore configuration

Disadvantage with Embedded metastore is only one Hive Session can be opened against Hive Server, For multiple Hive Session support move metastore to a Relational Database which supports JDO (Derby Server,MYSQL ..)

MetaStore : Derby Server

wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz

tar -xzf db-derby-10.4.2.0-bin.tar.gz

mv db-derby-10.4.2.0-bin derby

mkdir derby/data

Login as root, add follwing variables in /etc/profile.d/derby.sh

export DERBY_INSTALL=/home/HADOOP/derby
export DERBY_HOME=/home/HADOOP/derby

cd /home/HADOOP/derby/bin/

./startNetworkServer -h 0.0.0.0 &

/scratch/rjuluri/HADOOP/hive/conf

vi hive-site.xml

hive.test.mode.nosamplelist

if hive is running in test mode, dont sample the above comma seperated list of tables

hive.metastore.local

true

controls whether to connect to remove metastore server or open a new metastore server in Hive Client JVM

javax.jdo.option.ConnectionURL

jdbc:derby://hostname:port/metastore_db;create=true

JDBC connect string for a JDBC metastore

javax.jdo.option.ConnectionDriverName

org.apache.derby.jdbc.ClientDriver

Driver class name for a JDBC metastore

Add following file

vi /home/HADOOP/hive/conf/jpox.properties

javax.jdo.PersistenceManagerFactoryClass=org.jpox.PersistenceManagerFactoyImpl

org.jpox.autoCreateSchema=false

org.jpox.validateTables=false

org.jpox.validateColumns=false

org.jpox.validateConstraints=false

org.jpox.storeManagerType=rdbms

org.jpox.autoCreateSchema=true

org.jpox.autoStartMechanismMode=checked

org.jpox.transactionIsolation=read_committed

javax.jdo.option.DetachAllOnCommit=true

javax.jdo.option.NontransactionalRead=true

javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver

javax.jdo.option.ConnectionURL=jdbc:derby://hostname:port/metastore_db;create=true

javax.jdo.option.ConnectionUserName=APP

javax.jdo.option.ConnectionPassword=mine

cp /home/HADOOP/derby/lib/derbyclient.jar /home/HADOOP/hive/lib

cp /home/HADOOP/derby/lib/derbytools.jar /home/HADOOP/hive/lib

Hive> show tables;

Time taken: 0.048 seconds

Execute max_cgpa.q script (Previous article Hive - Part I )

[root@slc01mcd hive-0.9.0]# hive -f max_cgpa.q

Hadoop job information for null: number of mappers: 0; number of reducers: 0
2012-07-18 07:01:23,161 null map = 0%, reduce = 0%
2012-07-18 07:01:26,174 null map = 100%, reduce = 0%
2012-07-18 07:01:29,186 null map = 100%, reduce = 100%
Ended Job = job_local_0001
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK

cse 8.6
ece 9.0

Time taken: 10.476 seconds

Hive> show tables;

maxcgpa1

Time taken: 2.769 seconds

Friday, July 06, 2012

HIVE - HADOOP : INSTALLATION , EXECUTION OF SAMPLE PROGRAM

The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop is a popular open-source map-reduce implementation which is being used in companies like Yahoo, Facebook etc. to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. Hive, an open-source data warehousing solution built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into mapreduce jobs that are executed using Hadoop. In addition, HiveQL enables users to plug in custom map-reduce scripts into queries. The language includes a type system with support for tables containing primitive types, collections like arrays and maps, and nested compositions of the same. The underlying IO libraries can be extended to query data in custom formats. Hive also includes a system catalog - Metastore – that contains schemas and statistics, which are useful in data exploration, query optimization and query compilation.

Installing and Running hive

Download latest version of hive from the following link (Hive Installation).

$ tar xzf hive-0.9.0.tar.gz

set hive environment variables

$ export HIVE_INSTALL=/home/user1/hive-0.9.0.tar.gz

$ export PATH=$PATH:$HIVE_INSTALL/bin

You also need to set the JAVA_HOME environment variable to point to a suitable Java installation.

Hive has two execution types :

1) local mode : Hive runs in a single JVM and accesses the local filesystem. This mode is suitable only for small datasets.

$ hive

hive>

This is Hive interactive shell

2) MapReduce mode : In MapReduce mode, Hive translates queries into MapReduce jobs and runs them on a Hadoop cluster. The cluster may be a pseudo- or fully distributed cluster.

set the HADOOP_HOME environment variable for finding which Hadoop client to run.

configure hive-site.xml to specify name node and job tracker of hadoop.

you can set hadoop environment variables per session based also.

% hive -hiveconf fs.default.name=localhost -hiveconf mapred.job.tracker=localhost:8021 or you can use SET command

hive> SET fs.default.name=localhost;

Hive Services

There are 4 ways to access hive

cli : The command-line interface to Hive (the shell). This is the default service.

hiveserver : Runs Hive as a server exposing a Thrift service, enabling access from a range of clients written in different languages. Applications using the Thrift, JDBC, and ODBC connectors need to run a Hive server to communicate with Hive. Set the HIVE_PORT environment variable to specify the port the server will listen on (defaults to 10,000).

hwi : The Hive Web Interface.

jar : The Hive equivalent to hadoop jar, a convenient way to run Java applications that includes both Hadoop and Hive classes on the classpath.

CLI : There are 2 modes to interact

1) Interactive Mode

2) Non-Interactive Mode

1) Interactive Mode ( Hive Shell ) :

The shell is the primary way that we will interact with Hive, by issuing commands in HiveQL. HiveQL is Hive’s query language, a dialect of SQL. It is heavily influenced by MySQL.

$ hive

hive> show tables;

dummy

maxcgpa

records

Time taken: 3.756 seconds

Like SQL, HiveQL is generally case-insensitive (except for string comparisons), so show tables; works equally well here.

2) Non-Interactive Mode :

$ hive -f script.q

-f option runs the commands in the specified file, which is script.q

For short scripts, you can use the -e option to specify the commands inline, in which case the final semicolon is not required:

$ hive -e 'SELECT * FROM dummy'

Hive history file=/tmp/tom/hive_job_log_tom_201005042112_1906486281.txt

Time taken: 4.734 seconds

suppress these messages using the -S option at launch time, which has the effect of

showing only the output result for queries:

% hive -S -e 'SELECT * FROM dummy'

Example of Hive in Non-Interactive Mode

max_cgpa.q

-- max_cgpa.q: Finds the maximum cgpa of a specialization

CREATE TABLE maxcgpa (name STRING, spl STRING, cgpa FLOAT)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

LOAD DATA LOCAL INPATH 'hivesample.txt' OVERWRITE INTO TABLE maxcgpa;

SELECT spl, MAX(cgpa) FROM maxcgpa WHERE cgpa >0 AND cgpa <= 10 GROUP BY spl;

Above hive script finds the maximum cgpa of a specialization.

hivesample.txt ( Input to the hive )

raghu ece 9

kumar cse 8.5

biju ece 8

mukul cse 8.6

ashish ece 7.0

subha cse 8.3

ramu ece -8.3

rahul cse 11.4

budania ece 5.4

first column represents name , second column specialization and third column is cgpa, by default each column is separated by tab space.

$hive -f max_cgpa.q

Output :

cse 8.6

ece 9.0

Analysis :

Statement : 1

CREATE TABLE maxcgpa (name STRING, spl STRING, cgpa FLOAT)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

The first statement declares a maxcgpa table with three columns: name, spl, and cgpa. The type of each column must be specified, too. Here the name is a string, spl is a string and cgpa is a float.

The ROW FORMAT clause, however, is particular to HiveQL.

This declaration is saying that each row in the data file is tab-delimited text. Hive

expects there to be three fields in each row, corresponding to the table columns, with

fields separated by tabs and rows by newlines.

Statement : 2

LOAD DATA LOCAL INPATH 'hivesample.txt' OVERWRITE INTO TABLE maxcgpa;

Hive put the specified local file in its warehouse directory.

The OVERWRITE keyword in the LOAD DATA statement tells Hive to delete any existing files in the directory for the table. If it is omitted, the new files are simply added to the table’s directory

Statement : 3

SELECT spl, MAX(cgpa) FROM maxcgpa WHERE cgpa >0 AND cgpa <= 10 GROUP BY spl;

It is a SELECT statement with a GROUP BY clause for grouping rows into spl, which uses the MAX() aggregate function to find the maximum cgpa for each spl group. Hive transforms this query into a MapReduce job, which it executes on our behalf, then prints the results to the console.

3) Hive Web Interface ( HWI ) :

To use HWI you need to install Apache Ant and configure environment variables for ANT ( ANT Installation ) .

setenv ANT_HOME /scratch/rjuluri/HADOOP/apache-ant-1.8.4

setenv ANT_LIB /scratch/rjuluri/HADOOP/apache-ant-1.8.4/lib

$ hive --service hwi

Navigate to http://localhost:9999/hwi for accessing hwi .

I will cover Hive Clients , Metastore and HIVEQL in next blog...

Wednesday, July 04, 2012

PIG - HADOOP : Installation , Execution of Sample Program

Pig raises the level of abstraction for processing large datasets. MapReduce allows you, as the programmer, to specify a map function followed by a reduce function, but working out how to fit your data processing into this pattern, which often requires multiple MapReduce stages, can be a challenge. With Pig, the data structures are much richer, typically being multivalued and nested, and the set of transformations you can apply to the data are much more powerful.

Pig is made up of two pieces:

• The language used to express data flows, called Pig Latin.

• The execution environment to run Pig Latin programs. There are currently two environments: local execution in a single JVM and distributed execution on a Hadoop cluster.

A Pig Latin program is made up of a series of operations, or transformations, that are applied to the input data to produce output. Taken as a whole, the operations describe a data flow, which the Pig execution environment translates into an executable representation and then runs. Under the covers, Pig turns the transformations into a series of MapReduce jobs

Installing and Running Pig

Download latest version of Pig from the following link (Pig Installation).

$ tar xzf pig-0.7.0.tar.gz

set pig environment variables

$ export PIG_INSTALL=/home/user1/pig-0.7.0.tar.gz

$ export PATH=$PATH:$PIG_INSTALL/bin

You also need to set the JAVA_HOME environment variable to point to a suitable Java installation.

Pig has two execution types or modes:

1) local mode : Pig runs in a single JVM and accesses the local filesystem. This mode is suitable only for small datasets.

$ pig -x local

grunt>

This starts Grunt, the Pig interactive shell

2) MapReduce mode : In MapReduce mode, Pig translates queries into MapReduce jobs and runs them on a Hadoop cluster. The cluster may be a pseudo- or fully distributed cluster.

set the HADOOP_HOME environment variable for finding which Hadoop client to run.

$ pig or $ pig -x mapreduce , runs pig in MapReduce mode

Running Pig Programs

There are three ways of executing Pig programs, all of which work in both local and MapReduce mode

Script : Pig can run a script file that contains Pig commands. For example, pig

script.pig runs the commands in the local file script.pig

$ pig script.pig

Grunt : Grunt is an interactive shell for running Pig commands.It is also possible to run Pig scripts from within Grunt using run and exec.

Embedded :

You can run Pig programs from Java using the PigServer class, much like you can use JDBC to run SQL programs from Java.

PigPen is an Eclipse plug-in that provides an environment for developing Pig programs.

PigTools and EditorPlugins for pig can be downloaded from PigTools

Example of Pig in Interactive Mode (Grunt)

max_cgpa.pig

-- max_cgpa.pig: Finds the maximum cgpa of a user

records = LOAD 'pigsample.txt'

AS (name:chararray, spl:chararray, cgpa:float);

filtered_records = FILTER records BY cgpa > 0 AND cgpa < 10;

grouped_records = GROUP filtered_records BY spl;

max_cgpa = FOREACH grouped_records GENERATE group, MAX(filtered_records.cgpa);

STORE max_cgpa INTO 'output/cgpa_out';

Above pig script finds the maximum cgpa of a specialization.

pigsample.txt ( Input to the pig )

raghu ece 9

kumar cse 8.5

biju ece 8

mukul cse 8.6

ashish ece 7.0

subha cse 8.3

ramu ece -8.3

rahul cse 11.4

budania ece 5.4

first column represents name , second column specialization and third column is cgpa, by default each column is separated by tab space.

$ pig max_cgpa.pig

Output :

(cse,8.6F)

(ece,9.0F)

Analysis :

Statement : 1

records = LOAD 'pigsample.txt'AS (name:chararray, spl:chararray, cgpa:float);

Load input file in to memory from the file system (HDFS or local or Amazon S3). name:chararray notation describes the field’s

name and type; chararray is like a Java string, and an float is like a Java float.

grunt> DUMP records;

(raghu,ece,9.0F)

(kumar,cse,8.5F)

(biju,ece,8.0F)

(mukul,cse,8.6F)

(ashish,ece,7.0F)

(subha,cse,8.3F)

(ramu,ece,-8.3F)

(rahul,cse,11.4F)

(budania,ece,5.4F)

Input is converted in to a tuple , and each column is separated by ,

grunt> DESCRIBE records;

records: {name: chararray,spl: chararray,cgpa: float}

Statement : 2

filtered_records = FILTER records BY cgpa > 0 AND cgpa < 10;

grunt> DUMP filtered_records;

filter all the records whose cgpa <0 (negative) and >10

(raghu,ece,9.0F)

(kumar,cse,8.5F)

(biju,ece,8.0F)

(mukul,cse,8.6F)

(ashish,ece,7.0F)

(subha,cse,8.3F)

(budania,ece,5.4F)

grunt> DESCRIBE filtered_records;

filtered_records: {name: chararray,spl: chararray,cgpa: float}

Statement : 3

The third statement uses the GROUP function to group the records relation by the specialization field.

grouped_records = GROUP filtered_records BY spl;

grunt> DUMP grouped_records ;

(cse,{(kumar,cse,8.5F),(mukul,cse,8.6F),(subha,cse,8.3F)})

(ece,{(raghu,ece,9.0F),(biju,ece,8.0F),(ashish,ece,7.0F),(budania,ece,5.4F)})

grunt> DESCRIBE grouped_records;

grouped_records: {group: chararray,filtered_records: {name: chararray,spl: chararray,cgpa: float}}

We now have two rows, or tuples, one for each specialization in the input data. The first field in each tuple is the field being grouped by (the specialization), and the second field is a bag of tuples

for that specialization. A bag is just an unordered collection of tuples, which in Pig Latin is represented using curly braces.

By grouping the data in this way, we have created a row per specialization , so now all that remains is to find the maximum cgpa for the tuples in each bag.

Statement : 4

max_cgpa = FOREACH grouped_records GENERATE group,

MAX(filtered_records.cgpa);

FOREACH processes every row to generate a derived set of rows, using a GENERATE clause to define the fields in each derived row. In this example, the first field is group, which is just the specialization. The second field is a little more complex.

The filtered_records.cgpa reference is to the cgpa field of the

filtered_records bag in the grouped_records relation. MAX is a built-in function for calculating the maximum value of fields in a bag. In this case, it calculates the maximum cgpa for the fields in each filtered_records bag.

grunt> DUMP max_cgpa ;

(cse,8.6F)

(ece,9.0F)

grunt> DESCRIBE max_cgpa ;

max_cgpa : {group: chararray,float}

Statement : 5

STORE max_cgpa INTO 'output/cgpa_out'

This command redirects the output of the script to a file (Local or HDFS) instead of printing the output on the console .

we’ve successfully calculated the maximum cgpa for each specialization.

With the ILLUSTRATE operator, Pig provides a tool for generating a reasonably complete and concise sample dataset.

--------------------------------------------------------------------

--------------------------------------------------------------------

| | kumar | cse | 8.5 |

| | mukul | cse | 8.6 |

| | ramu | ece | -8.3 |

--------------------------------------------------------------------

----------------------------------------------------------------

----------------------------------------------------------------

| | kumar | cse | 8.5 |

| | mukul | cse | 8.6 |

| | ramu | ece | -8.3 |

----------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

| | kumar | cse | 8.5 |

| | mukul | cse | 8.6 |

-------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------------------

| grouped_records | group: chararray | filtered_records: bag({name: chararray,spl: chararray,cgpa: float}) |

----------------------------------------------------------------------------------------------------------------

| | cse | {(kumar, cse, 8.5), (mukul, cse, 8.6)} |

----------------------------------------------------------------------------------------------------------------

-------------------------------------------

| max_cgpa | group: chararray | float |

-------------------------------------------

| | cse | 8.6 |

EXPLAIN max_cgpa

Use the above command to see the logical and physical plans created by Pig.

@raghujuluri

Thursday, July 26, 2012

PROVIDENT FUND (PF) WITHDRAWAL : IBM INDIA PVT LTD

PROVIDENT FUND (PF) WITHDRAWAL : IBM INDIA PVT LTD

HIVE - HADOOP : MSQL AS METASTORE - Part III

Friday, July 20, 2012

HIVE - HADOOP : INSTALLATION , EXECUTION OF SAMPLE PROGRAM - Part II

HIVE - HADOOP : INSTALLATION , EXECUTION OF SAMPLE PROGRAM - Part II

Hive Clients

Friday, July 06, 2012

HIVE - HADOOP : INSTALLATION , EXECUTION OF SAMPLE PROGRAM

HIVE - HADOOP : INSTALLATION , EXECUTION OF SAMPLE PROGRAM

Installing and Running hive

Hive Services

Wednesday, July 04, 2012

PIG - HADOOP : Installation , Execution of Sample Program

PIG - HADOOP : Installation , Execution of Sample Program

Installing and Running Pig

Blog Archive

About Me

Popular Posts