Thursday, July 26, 2012

PROVIDENT FUND (PF) WITHDRAWAL : IBM INDIA PVT LTD

PROVIDENT FUND (PF)  WITHDRAWAL : IBM INDIA PVT LTD

In the Current Blog..i want to describe the guidelines for withdrawing Provident Fund (PF) in IBM India Pvt Ltd. I just wanted to explain the step-by-step procedure to be followed .. as many of my friends asked me the procedure for withdrawal ..



Note: Provident Fund Withdrawal needs to be initiated only after the completion of 60 days from the date of leaving Or else it will be rejected.


Please fill the Forms attached (Form-10C.pdf -Form-19.pdf and Guidelines-form-10C.pdf , Guidelines-Form-19.pdf)  and send the duly filled in and signed  hardcopies of the forms to the address below for withdrawal of PF



Please refer the sample filled form which is attached for your reference,
the details which shown in the sample form is mandatory, if you
miss any of details then your Form will be rejected and sent back to
you.

Please ensure below points are taken care before sending the forms to
IBM : 

1. Application Forms – Please ensure that the Forms are printed back to back.

2. Filling the Application - Please use only blue ink to fill the Forms.

3. Name – Please ensure you are filling your name same as it appears in IBM records. Also ensure the name is matching with your bank account. Any mismatch in this regard will lead to rejection of the claim.

4. Over Writing – Please avoid over writing in the form, as over writing will lead to rejection of the claim.


5. Signatures - Please ensure that your application is complete with all mandatory employee signatures. (3 Signatures on Page No.3 of the Form 19 and 2 signatures on Page No.3 of the Form 10C).


6. Cancelled cheque leaf - Please ensure that a cancelled cheque leaf
(with the name,IFSC code printed on it) has been attached along with
the PF withdrawal form. Also ensure that the bank details filled in
the form is matching with the details appearing in attached cancelled
cheque leaf. In case name is not printed on cheque, along with
cheque leaf ensure that a bank statement with last few transactions
is attached with forms. Please make sure IFSC code is appearing in
both cancelled cheque leaf and bank statement. Also make sure that
the bank account number you are providing is of individual account.


7. ID Proof- In case you are applying for PF withdrawal after 2 years
and 6 months from the date of leaving , ensure that you have
attached Photo ID proofs Like Pan card/ Passport/ Voters ID along
with the forms.



8. Please enclose following documents along with forms
    a) photocopy of Full and Final settlement letter.
    b) Please attach a (crossed cheque leaf written as cancelled).
    c) cancelled cheque.

9. Scanned/ photocopy applications are not acceptable for process.

10.Please mention your mobile number on top of the application (written in
pencil ).

Above documents needs to be couriered to the following address.

IBM India Pvt Ltd.
Retirals Team
Global Process Services - HR Delivery
Manyata Embassy Business Park,
D1, 4th Floor, Outer Ring Road,
Nagawara, Bangalore - 560 045
ibmretirals@aonhewitt.com

Once above forms are couriered to IBM address, they will file with PF office 

After 1 Month (approx) you will receive pf withdrawal tracking number to your mobile ..which you mentioned earlier.

Then you need to wait for 3 months (approx) for the PF amount to be credited in your Bank Account which you mentioned earlier.If you file during march-july months it will take long time as Provident Fund office employees are involved in the year end financial transaction calculations .. it might take up to 6 months .. 

PF Withdrawal will take less time compared to PF Transfer .. some times it will take up to 2 years for PF transfer .. So in my perspective PF Withdrawal is better Option then PF Transfer

Uploaded all 4 documents to Google Docs .. so please download from the following urls








HIVE - HADOOP : MSQL AS METASTORE - Part III

HIVE - HADOOP : MSQL AS METASTORE - Part III


In my previous article (Hive with Derby ) described the drawback of using Embedded Metastore i.e only one Hive Session can open connection to this Metastore so at any point of time only one user will be active others will be passive , to over come this use Standalone Database as MetaStore.


If you use Standalone Database then Hive Supports Multiple Sessions therefore Multiple Users can access at same instant, this is referred as Local Metastore. Any JDBC-Compliant database could be used as Metastore. In previous article i demonstrated how to install and start Derby Database and also how to configure hive to connect to this Derby  with a example.




In the Current blog, i wanted to demonstrate how to install , execute Mysql Server and configuration to be made in Hive to connect to Mysql server and store metadata.Mysql is world's most used open source Relational Database Management System (RDBMS).


MYSQL Installation :


1) Download MySql software from  MYSQL Software  
2) Modify the password for MySql 


if you face exception during password change.. ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO) 



/etc/init.d/mysqld stop


mysqld_safe --skip-grant-tables &
mysql -u root
mysql> use mysql;
mysql> update user set password=PASSWORD("newrootpassword") where User='root';
mysql> flush privileges;
mysql> quit
/etc/init.d/mysqld stop
/etc/init.d/mysqld start





For Example:



[root@slc01mcd ~]# mysqld_safe --skip-grant-tables &
[1] 12943
[root@slc01mcd ~]# Starting mysqld daemon with databases from /var/lib/mysql
mysql -u root
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.0.77 Source distribution


Type 'help;' or '\h' for help. Type '\c' to clear the buffer.


mysql> use mysql;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A


Database changed
mysql> update user set password=PASSWORD("Welcome1") where User='root';
Query OK, 2 rows affected (0.00 sec)
Rows matched: 3  Changed: 2  Warnings: 0


mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)


mysql>  quit
Bye

Once the Password set .. check the verify the tables present in MySql DB

[root@slc01mcd ~]# /etc/init.d/mysqld start
Starting MySQL:                                            [  OK  ]
[root@slc01mcd ~]# mysql --user=root -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.0.77 Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> show tables;
ERROR 1046 (3D000): No database selected

/homeHADOOP/hive-0.9.0/conf/hive-site.xml

HIVE Configuration (hive-site.xml)

  hive.metastore.local
  true
  controls whether to connect to remove metastore server or open a new metastore server in Hive Client JVM

  javax.jdo.option.ConnectionURL
jdbc:derby://hostname:1527/metastore_db;create=true
  JDBC connect string for a JDBC metastore

  javax.jdo.option.ConnectionDriverName
org.apache.derby.jdbc.ClientDriver
  Driver class name for a JDBC metastore

  javax.jdo.option.ConnectionUserName
  root
  username to use against metastore database

  javax.jdo.option.ConnectionPassword
  Welcome1
  password to use against metastore database

Jpox.properties

javax.jdo.PersistenceManagerFactoryClass=org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema=false
org.jpox.validateTables=false
org.jpox.validateColumns=false
org.jpox.validateConstraints=false
org.jpox.storeManagerType=rdbms
org.jpox.autoCreateSchema=true
org.jpox.autoStartMechanismMode=checked
org.jpox.transactionIsolation=read_committed
javax.jdo.option.DetachAllOnCommit=true
javax.jdo.option.NontransactionalRead=true
javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL=jdbc:derby://hostname:1527/metastore_db;create=true
javax.jdo.option.ConnectionUserName=root
javax.jdo.option.ConnectionPassword=Welcome1

Copy JDBC driver Jar file for Mysql to the hadoop lib floder

cp mysql-connector-java-5.1.11/*.jar  /homeHADOOP/hive-0.9.0/lib

Sample Program :

Hive> show tables;

OK
Time taken: 0.048 seconds

Execute max_cgpa.q script (Previous article Hive - Part-I )


[root@slc01mcd hive-0.9.0]# hive -f max_cgpa.q 

WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Logging initialized using configuration in jar:file:/scratch/rjuluri/HADOOP/hive-0.9.0/lib/hive-common-0.9.0.jar!/hive-log4j.properties
Hive history file=/tmp/root/hive_job_log_root_201207180701_1297257265.txt
OK
Time taken: 3.819 seconds
Copying data from file:/scratch/rjuluri/HADOOP/hive-0.9.0/hivesample.txt
Copying file: file:/scratch/rjuluri/HADOOP/hive-0.9.0/hivesample.txt
Loading data to table default.maxcgpa1
Deleted file:/user/hive/warehouse/maxcgpa1
OK
Time taken: 0.383 seconds
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Execution log at: /tmp/root/root_20120718070101_e67d60e9-5f66-482a-8151-eebc5cfefaf4.log
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2012-07-18 07:01:23,161 null map = 0%,  reduce = 0%
2012-07-18 07:01:26,174 null map = 100%,  reduce = 0%
2012-07-18 07:01:29,186 null map = 100%,  reduce = 100%
Ended Job = job_local_0001

Execution completed successfully

Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
cse     8.6
ece     9.0
Time taken: 10.476 seconds

[root@slc01mcd hive-0.9.0]# hive show tables;

WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Logging initialized using configuration in jar:file:/scratch/rjuluri/HADOOP/hive-0.9.0/lib/hive-common-0.9.0.jar!/hive-log4j.properties
Hive history file=/tmp/root/hive_job_log_root_201207180701_545279755.txt
hive> show tables;
OK
maxcgpa1
Time taken: 2.769 seconds

Now Verify whether newly created table maxcgpa1 is in MySql databse or not ..

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema | 
| mysql              | 
| test               | 
+--------------------+
3 rows in set (0.00 sec)

This Shows that MySql Database is used as MetaStore by Hive from Now on .. and Multiple Sessions (Users) can Connect to Hive .. and perform hadoop operations ...





Friday, July 20, 2012

HIVE - HADOOP : INSTALLATION , EXECUTION OF SAMPLE PROGRAM - Part II

HIVE - HADOOP : INSTALLATION , EXECUTION  OF SAMPLE PROGRAM - Part II

Hive Clients

To Start Hive Server use following command .. 

$hive --service hiveserver

Following are different Hive Clients, to connect to this Server

1) CommandLine
2) Thrift Client
3) JDBC Driver
4) ODBC Driver


Courtesy : HadoopDefinativeGuide

1) CommandLine : Operates in embedded mode only, it needs to have access to the hive libraries.

This is section is briefly described in my previous article ( Hive - Part I ) .

2) Thrift Client : 

The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages. Thrift Software

Hive Server is implemented in java, so to query hive server using ruby,php,C++  language then you need to build thrift client specific to that language.

Steps to build Thrift Client for Ruby

a) Install Thrift Thrift
b)  Download Thrift Source Thrift Source
c) Navigate to Hive Source directory 

thrift —gen rb -I service/include metastore/if/hive_metastore.thrift 
thrift —gen rb -I service/include -I . service/if/hive_service.thrift 
thrift —gen rb service/include/thrift/fb303/if/fb303.thrift 
thrift —gen rb serde/if/serde.thrift 
thrift —gen rb ql/if/queryplan.thrift 
thrift —gen rb service/include/thrift/if/reflection_limited.thrift

Or you can download Thrift client for ruby from Thrift Ruby Client

Thrift Java Client : operates in embedded mode and standalone server
Thrift C++ Client : operated only in embedded mode

3) JDBC Driver :  
For embedded mode, uri is just "jdbc:hive://". 
For standalone server, uri is "jdbc:hive://host:port/dbname" where host and port are determined by where the hive server is run. 
Example :  "jdbc:hive://localhost:10000/default". Currently, the only dbname supported is "default".
JDBC Client Sample Code JDBC Client Sample Code

4) ODBC Driver : 

The Hive ODBC Driver is a software library that implements the Open Database Connectivity (ODBC) API standard for the Hive database management system, enabling ODBC compliant applications to interact seamlessly (ideally) with Hive through a standard interface. This driver will NOT be built as a part of the typical Hive build process and will need to be compiled and built separately ODBC Client

The Metastore :


The metastore is the central repository of Hive metadata. The metastore is divided into two pieces: a service and the backing store for the data. By default, the metastore service runs in the same JVM as the Hive service and contains an embedded Derby database instance backed by the local disk. This is called the embedded metastore configuration




Disadvantage with Embedded metastore is only one Hive Session can be opened against Hive Server, For multiple Hive Session support move metastore to a Relational Database which supports JDO (Derby Server,MYSQL ..) 

MetaStore : Derby Server

wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz

tar -xzf db-derby-10.4.2.0-bin.tar.gz


mv db-derby-10.4.2.0-bin derby


mkdir derby/data


Login as root, add follwing variables in /etc/profile.d/derby.sh



export DERBY_INSTALL=/home/HADOOP/derby
export DERBY_HOME=/home/HADOOP/derby



cd /home/HADOOP/derby/bin/


./startNetworkServer -h 0.0.0.0 &

/scratch/rjuluri/HADOOP/hive/conf

vi hive-site.xml

  hive.test.mode.nosamplelist
 
  if hive is running in test mode, dont sample the above comma seperated list of tables

  hive.metastore.local
  true
  controls whether to connect to remove metastore server or open a new metastore server in Hive Client JVM

  javax.jdo.option.ConnectionURL
jdbc:derby://hostname:port/metastore_db;create=true
  JDBC connect string for a JDBC metastore

  javax.jdo.option.ConnectionDriverName
org.apache.derby.jdbc.ClientDriver
  Driver class name for a JDBC metastore

Add following file 

vi /home/HADOOP/hive/conf/jpox.properties

javax.jdo.PersistenceManagerFactoryClass=org.jpox.PersistenceManagerFactoyImpl
org.jpox.autoCreateSchema=false
org.jpox.validateTables=false
org.jpox.validateColumns=false
org.jpox.validateConstraints=false
org.jpox.storeManagerType=rdbms
org.jpox.autoCreateSchema=true
org.jpox.autoStartMechanismMode=checked
org.jpox.transactionIsolation=read_committed
javax.jdo.option.DetachAllOnCommit=true
javax.jdo.option.NontransactionalRead=true
javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL=jdbc:derby://hostname:port/metastore_db;create=true
javax.jdo.option.ConnectionUserName=APP
javax.jdo.option.ConnectionPassword=mine

cp /home/HADOOP/derby/lib/derbyclient.jar  /home/HADOOP/hive/lib
cp  /home/HADOOP/derby/lib/derbytools.jar  /home/HADOOP/hive/lib

Hive> show tables;

OK
Time taken: 0.048 seconds

Execute max_cgpa.q script (Previous article Hive - Part I )

[root@slc01mcd hive-0.9.0]# hive -f max_cgpa.q


Hadoop job information for null: number of mappers: 0; number of reducers: 0
2012-07-18 07:01:23,161 null map = 0%,  reduce = 0%
2012-07-18 07:01:26,174 null map = 100%,  reduce = 0%
2012-07-18 07:01:29,186 null map = 100%,  reduce = 100%
Ended Job = job_local_0001
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK


cse     8.6
ece     9.0


Time taken: 10.476 seconds


Hive> show tables;

OK

maxcgpa1

Time taken: 2.769 seconds

Friday, July 06, 2012

HIVE - HADOOP : INSTALLATION , EXECUTION OF SAMPLE PROGRAM

HIVE - HADOOP : INSTALLATION , EXECUTION  OF SAMPLE PROGRAM

The size of data sets being collected and analyzed in  the industry for business intelligence is growing rapidly, making  traditional warehousing solutions prohibitively expensive.  Hadoop is a popular open-source map-reduce implementation which is being used  in companies like Yahoo, Facebook etc. to  store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. Hive, an open-source data warehousing solution built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language  - HiveQL, which are compiled into mapreduce jobs that are executed using Hadoop. In addition, HiveQL enables users to plug in custom map-reduce scripts into queries. The language includes a type system with support for tables containing primitive types, collections like arrays and maps, and nested compositions of the same. The underlying IO libraries can be extended to query data in custom formats. Hive also includes a system catalog - Metastore – that  contains schemas and statistics, which  are useful in data exploration,  query optimization and query compilation.

Installing and Running hive


Download latest version of hive from the following link (Hive Installation).

$ tar xzf hive-0.9.0.tar.gz

set hive environment variables

$ export HIVE_INSTALL=/home/user1/hive-0.9.0.tar.gz
$ export PATH=$PATH:$HIVE_INSTALL/bin

You also need to set the JAVA_HOME environment variable to point to a suitable Java installation.

Hive has two execution types : 

1) local mode : Hive runs in a single JVM and accesses the local filesystem. This mode is suitable only for small datasets.

$ hive

hive>


This is Hive interactive shell


2) MapReduce mode : In MapReduce mode, Hive translates queries into MapReduce jobs and runs them on a Hadoop cluster. The cluster may be a pseudo- or fully distributed cluster.


set the HADOOP_HOME environment variable for finding which Hadoop client to run.


configure hive-site.xml to specify name node and job tracker of hadoop.

you can set hadoop environment variables per session based also.

% hive -hiveconf fs.default.name=localhost -hiveconf mapred.job.tracker=localhost:8021 or you can use SET command

hive> SET fs.default.name=localhost;

Hive Services

There are 4 ways to access hive


cli : The command-line interface to Hive (the shell). This is the default service.


hiveserver : Runs Hive as a server exposing a Thrift service, enabling access from a range of clients written in different languages. Applications using the Thrift, JDBC, and ODBC connectors need to run a Hive server to communicate with Hive. Set the HIVE_PORT environment variable to specify the port the server will listen on (defaults to 10,000).

hwi : The Hive Web Interface.

jar : The Hive equivalent to hadoop jar, a convenient way to run Java applications that includes both Hadoop and Hive classes on the classpath.

CLI : There are 2 modes to interact

1) Interactive Mode
2) Non-Interactive Mode

1) Interactive Mode ( Hive Shell ) 

The shell is the primary way that we will interact with Hive, by issuing commands in HiveQL. HiveQL is Hive’s query language, a dialect of SQL. It is heavily influenced by MySQL.


$ hive


hive> show tables;



OK

dummy
maxcgpa


records
Time taken: 3.756 seconds

Like SQL, HiveQL is generally case-insensitive (except for string comparisons), so show tables; works equally well here.

2) Non-Interactive Mode :

$ hive -f script.q


 -f option runs the commands in the specified file, which is script.q 


For short scripts, you can use the -e option to specify the commands inline, in which case the final semicolon is not required:

$ hive -e 'SELECT * FROM dummy'

Hive history file=/tmp/tom/hive_job_log_tom_201005042112_1906486281.txt
OK
X

Time taken: 4.734 seconds

suppress these messages using the -S option at launch time, which has the effect of
showing only the output result for queries:


% hive -S -e 'SELECT * FROM dummy'


X

Example of Hive in Non-Interactive Mode 

max_cgpa.q

-- max_cgpa.q: Finds the maximum cgpa of a specialization



CREATE TABLE maxcgpa (name STRING, spl STRING, cgpa FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

LOAD DATA LOCAL INPATH 'hivesample.txt' OVERWRITE INTO TABLE maxcgpa;

SELECT spl, MAX(cgpa) FROM maxcgpa WHERE cgpa >0 AND cgpa <= 10  GROUP BY spl;

Above hive script finds the maximum cgpa of a specialization.

hivesample.txt  ( Input to the hive )

raghu     ece     9
kumar    cse      8.5
biju       ece      8
mukul    cse      8.6
ashish   ece      7.0
subha    cse      8.3
ramu     ece     -8.3
rahul     cse      11.4
budania ece      5.4

first column represents name , second column specialization and third column is cgpa, by default each column is separated by tab space.

$hive -f max_cgpa.q

Output :


cse     8.6
ece     9.0

Analysis : 

Statement : 1

CREATE TABLE maxcgpa (name STRING, spl STRING, cgpa FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';


The first statement declares a  maxcgpa  table with three columns: name, spl, and cgpa. The type of each column must be specified, too. Here the name is a string, spl is a string and cgpa is a float.

The ROW FORMAT clause, however, is particular to HiveQL. 
This declaration is saying that each row in the data file is tab-delimited text. Hive  

expects there to be three fields in each row, corresponding to the table columns, with  
fields separated by tabs and rows by newlines.


Statement : 2

LOAD DATA LOCAL INPATH 'hivesample.txt' OVERWRITE INTO TABLE maxcgpa;

Hive put the specified local file in its warehouse directory.

The OVERWRITE keyword in the LOAD DATA statement tells Hive to delete any existing files in the directory for the table. If it is omitted, the new files are simply added to the table’s directory


Statement : 3

SELECT spl, MAX(cgpa) FROM maxcgpa WHERE cgpa >0 AND cgpa <= 10  GROUP BY spl;

It is a SELECT statement with a GROUP BY clause for  grouping rows into spl, which uses the MAX() aggregate function to find the maximum  cgpa for each spl group. Hive transforms this  query into a MapReduce job, which it executes on our behalf, then prints the results  to the console.








3) Hive Web Interface ( HWI ) :





To use HWI you need to install Apache Ant and configure environment variables for ANT ( ANT Installation.





setenv ANT_HOME /scratch/rjuluri/HADOOP/apache-ant-1.8.4
setenv ANT_LIB /scratch/rjuluri/HADOOP/apache-ant-1.8.4/lib

$ hive --service hwi

Navigate to http://localhost:9999/hwi for accessing hwi .









I will cover Hive Clients , Metastore and HIVEQL in next blog...

Wednesday, July 04, 2012

PIG - HADOOP : Installation , Execution of Sample Program

PIG - HADOOP : Installation , Execution  of Sample Program

Pig raises the level of abstraction for processing large datasets. MapReduce allows you, as the programmer, to specify a map function followed by a reduce function, but working out how to fit your data processing into this pattern, which often requires multiple MapReduce stages, can be a challenge. With Pig, the data structures are much richer, typically being multivalued and nested, and the set of transformations you can apply to the data are much more powerful.


Pig is made up of two pieces:

• The language used to express data flows, called Pig Latin.
• The execution environment to run Pig Latin programs. There are currently two environments: local execution in a single JVM and distributed execution on a Hadoop cluster.


A Pig Latin program is made up of a series of operations, or transformations, that are applied to the input data to produce output. Taken as a whole, the operations describe a data flow, which the Pig execution environment translates into an executable representation and then runs. Under the covers, Pig turns the transformations into a series of MapReduce jobs

Installing and Running Pig

Download latest version of Pig from the following link (Pig Installation).

$ tar xzf pig-0.7.0.tar.gz

set pig environment variables

$ export PIG_INSTALL=/home/user1/pig-0.7.0.tar.gz
$ export PATH=$PATH:$PIG_INSTALL/bin

You also need to set the JAVA_HOME environment variable to point to a suitable Java installation.

Pig has two execution types or modes: 

1) local mode : Pig runs in a single JVM and accesses the local filesystem. This mode is suitable only for small datasets.

$ pig -x local

grunt>

This starts Grunt, the Pig interactive shell

2) MapReduce mode : In MapReduce mode, Pig translates queries into MapReduce jobs and runs them on a Hadoop cluster. The cluster may be a pseudo- or fully distributed cluster.


set the HADOOP_HOME environment variable for finding which Hadoop client to run.

$ pig  or $ pig -x mapreduce , runs pig in MapReduce mode

Running Pig Programs

There are three ways of executing Pig programs, all of which work in both local and MapReduce mode


Script : Pig can run a script file that contains Pig commands. For example, pig
script.pig runs the commands in the local file script.pig
$ pig script.pig

Grunt : Grunt is an interactive shell for running Pig commands.It is also possible to run Pig scripts from within Grunt using run and exec.


Embedded :
You can run Pig programs from Java using the PigServer class, much like you can use JDBC to run SQL programs from Java.

PigPen is an Eclipse plug-in that provides an environment for developing Pig programs.

PigTools and EditorPlugins for pig can be downloaded from PigTools

Example of Pig in Interactive Mode (Grunt)

max_cgpa.pig


-- max_cgpa.pig: Finds the maximum cgpa of a user

records = LOAD 'pigsample.txt'
AS (name:chararray, spl:chararray, cgpa:float);
filtered_records = FILTER records BY cgpa > 0 AND cgpa < 10;
grouped_records = GROUP filtered_records BY spl;
max_cgpa = FOREACH grouped_records GENERATE group, MAX(filtered_records.cgpa);
STORE max_cgpa INTO 'output/cgpa_out';

Above pig script finds the maximum cgpa of a specialization.

pigsample.txt  ( Input to the pig )

raghu     ece     9
kumar    cse      8.5
biju       ece      8
mukul    cse      8.6
ashish   ece      7.0
subha    cse      8.3
ramu     ece     -8.3
rahul     cse      11.4
budania ece      5.4

first column represents name , second column specialization and third column is cgpa, by default each column is separated by tab space.

$ pig max_cgpa.pig

Output : 

(cse,8.6F)
(ece,9.0F)

Analysis : 

Statement : 1
records = LOAD 'pigsample.txt'AS (name:chararray, spl:chararray, cgpa:float);

Load input file in to memory from the file system (HDFS or local or Amazon S3). name:chararray notation describes the field’s
name and type; chararray is like a Java string, and an float is like a Java float.

grunt> DUMP records;

(raghu,ece,9.0F)
(kumar,cse,8.5F)
(biju,ece,8.0F)
(mukul,cse,8.6F)
(ashish,ece,7.0F)
(subha,cse,8.3F)
(ramu,ece,-8.3F)
(rahul,cse,11.4F)
(budania,ece,5.4F)

Input is converted in to a tuple , and each column is separated by ,

grunt> DESCRIBE records;
records: {name: chararray,spl: chararray,cgpa: float}

Statement : 2
filtered_records = FILTER records BY cgpa > 0 AND cgpa < 10;

grunt> DUMP filtered_records;

filter all the records whose cgpa <0 (negative) and >10 

(raghu,ece,9.0F)
(kumar,cse,8.5F)
(biju,ece,8.0F)
(mukul,cse,8.6F)
(ashish,ece,7.0F)
(subha,cse,8.3F)
(budania,ece,5.4F)

grunt> DESCRIBE filtered_records;
filtered_records: {name: chararray,spl: chararray,cgpa: float}

Statement : 3

The third statement uses the GROUP function to group the records relation by the specialization field.

grouped_records = GROUP filtered_records BY spl;

grunt> DUMP  grouped_records ;

(cse,{(kumar,cse,8.5F),(mukul,cse,8.6F),(subha,cse,8.3F)})
(ece,{(raghu,ece,9.0F),(biju,ece,8.0F),(ashish,ece,7.0F),(budania,ece,5.4F)})

grunt> DESCRIBE  grouped_records;
grouped_records: {group: chararray,filtered_records: {name: chararray,spl: chararray,cgpa: float}}

We now have two rows, or tuples, one for each specialization in the input data. The first field in each tuple is the field being grouped by (the specialization), and the second field is a bag of tuples
for that  specialization. A bag is just an unordered collection of tuples, which in Pig Latin is represented using curly braces.

By grouping the data in this way, we have created a row per  specialization , so now all that remains is to find the maximum cgpa for the tuples in each bag.

Statement : 4


max_cgpa = FOREACH grouped_records GENERATE group,
MAX(filtered_records.cgpa);

FOREACH processes every row to generate a derived set of rows, using a GENERATE clause to define the fields in each derived row. In this example, the first field is group, which is just the specialization. The second field is a little more complex.

The filtered_records.cgpa reference is to the cgpa field of the
filtered_records bag in the grouped_records relation. MAX is a built-in function for calculating the maximum value of fields in a bag. In this case, it calculates the maximum cgpa for the fields in each filtered_records bag.

grunt> DUMP    max_cgpa  ;

(cse,8.6F)
(ece,9.0F)

grunt> DESCRIBE    max_cgpa  ;

max_cgpa : {group: chararray,float}

Statement : 5

STORE max_cgpa INTO 'output/cgpa_out'

This command redirects the output of the script to a file (Local or HDFS) instead of printing the output on the console .

we’ve successfully calculated the maximum cgpa for each specialization.

With the ILLUSTRATE operator, Pig provides a tool for generating a reasonably complete and concise sample dataset.


--------------------------------------------------------------------
| records     | name: bytearray | spl: bytearray | cgpa: bytearray | 
--------------------------------------------------------------------
|             | kumar           | cse            | 8.5             | 
|             | mukul           | cse            | 8.6             | 
|             | ramu            | ece            | -8.3            | 
--------------------------------------------------------------------
----------------------------------------------------------------
| records     | name: chararray | spl: chararray | cgpa: float | 
----------------------------------------------------------------
|             | kumar           | cse            | 8.5         | 
|             | mukul           | cse            | 8.6         | 
|             | ramu            | ece            | -8.3        | 
----------------------------------------------------------------
-------------------------------------------------------------------------
| filtered_records     | name: chararray | spl: chararray | cgpa: float | 
-------------------------------------------------------------------------
|                      | kumar           | cse            | 8.5         | 
|                      | mukul           | cse            | 8.6         | 
-------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------
| grouped_records     | group: chararray | filtered_records: bag({name: chararray,spl: chararray,cgpa: float}) | 
----------------------------------------------------------------------------------------------------------------
|                     | cse              | {(kumar, cse, 8.5), (mukul, cse, 8.6)}                              | 
----------------------------------------------------------------------------------------------------------------
-------------------------------------------
|  max_cgpa   | group: chararray | float | 
-------------------------------------------
|              | cse              | 8.6   |


EXPLAIN max_cgpa  

Use the above command to see the logical and physical plans created by Pig.

Popular Posts