Thursday, June 21, 2012

HADOOP INSTALLATION ON LINUX

HADOOP INSTALLATION ON LINUX


In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.
  
                                                                             —Grace Hopper


We live in the data age. It’s not easy to measure the total volume of data stored electronically,but an IDC estimate put the size of the “digital universe” at 0.18 zettabytes in 2006 and is forecasting a tenfold growth by 2011 to 1.8 zettabytes.1 A zettabyte is1021 bytes, or equivalently one thousand exabytes, one million petabytes, or one billion terabytes. That’s roughly the same order of magnitude as one disk drive for every person in the world.


Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project.




The Hadoop projects that are covered in this book are described briefly here:

Common
A set of components and interfaces for distributed filesystems and general I/O
(serialization, Java RPC, persistent data structures).


Avro
A serialization system for efficient, cross-language RPC and persistent data
storage.

MapReduce
A distributed data processing model and execution environment that runs on large clusters of commodity machines.

HDFS
A distributed filesystem that runs on large clusters of commodity machines.

Pig
A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.

Hive
A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.

HBase
A distributed, column-oriented database. HBase uses HDFS for its underlying
storage, and supports both batch-style computations using MapReduce and point queries (random reads).

ZooKeeper
A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.


Sqoop
A tool for efficient bulk transfer of data between structured data stores (such as relational databases) and HDFS.

Oozie
A service for running and scheduling workflows of Hadoop jobs (including Map-
Reduce, Pig, Hive, and Sqoop jobs).

Download a stable release from one of the apache download mirror (http://www.apache.org/dyn/closer.cgi/hadoop/common/)
, which is packaged as a gzipped tar file.

Hadoop 2.0.0 is the latest version (hadoop-2.0.0-alpha.tar.gz) download from (http://apache.mirrors.lucidnetworks.net/hadoop/common/hadoop-2.0.0-alpha/

Unpack this file 

% tar xzf hadoop-2.0.0-alpha.tar.gz

set JAVA_HOME and HADOOP_INSTALL variables , as hadoop is written in java it requires java installation location.


% export HADOOP_INSTALL = /usr/rjuluri/HADOOP/hadoop-2.0.0-alpha

% export JAVA_HOME = /usr/rjuluri/middleware/Jdev11.1.3/jdk160_18

% export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin

To verify the installation, run the following command

% hadoop version

Hadoop 2.0.0-alpha
Subversion http://svn.apache.org/repos/asf/hadoop/common/branches/branch-2.0.0-alpha/hadoop-common-project/hadoop-common -r 1338348
Compiled by hortonmu on Wed May 16 01:28:50 UTC 2012
From source with checksum 954e3f6c91d058b06b1e81a02813303f

Hadoop can be run in one of three modes:

Standalone (or local) mode
There are no daemons running and everything runs in a single JVM. Standalone
mode is suitable for running MapReduce programs during development, since it
is easy to test and debug them.

Common     fs.default.name                              file:/// (default)
HDFS          dfs.replication                                N/A

YARN     yarn.resourcemanager.address              N/A

In standalone mode, there is no further action to take, since the default properties are set for standalone mode and there are no daemons to run.

Pseudodistributed mode
The Hadoop daemons run on the local machine, thus simulating a cluster on a
small scale.

Common     fs.default.name                            hdfs://localhost/
HDFS          dfs.replication                              1

YARN     yarn.resourcemanager.address             localhost:8032

Modify config files under etc/hadoop directory of hadoop installation

Common:
fs.default.name
hdfs://localhost/

HDFS:
dfs.replication
1

MAP-REDUCE:
yarn.resourcemanager.address
localhost:8032
yarn.nodemanager.aux-services
mapreduce.shuffle







fs.default.name
hdfs://localhost/






dfs.replication
1






mapred.job.tracker
localhost:8021




If you are running YARN, use the yarn-site.xml file:





yarn.resourcemanager.address
localhost:8032


yarn.nodemanager.aux-services
mapreduce.shuffle






make sure that SSH is installed and a server is running
Then, to enable password-less login, generate a new SSH key with an empty passphrase:

% ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
% cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Test this with:

% ssh localhost

To start the HDFS and YARN daemons, type:

% start-dfs.sh
% start-yarn.sh

These commands will start the HDFS daemons, and for YARN, a resource manager and a node manager. The resource manager web UI is at http://localhost:8088/

You can stop the daemons with:

% stop-dfs.sh
% stop-yarn.sh



Fully distributed mode
The Hadoop daemons run on a cluster of machines.

Common     fs.default.name                            hdfs://namenode/
HDFS          dfs.replication                              3 (default)
YARN     yarn.resourcemanager.address    resourcemanager:8032




1 comment:

rashmi said...

In cluster environement, there would be multiples namenode machines to use fededration and HA. what would be hdfs host and port? will it be same for all name nodes?

How client would connect to a hdfs host in this cluster?

Popular Posts