@raghujuluri: May 2013

Wednesday, May 29, 2013

How to disable password prompts in Ubuntu

Open the terminal window from Applications --> accessories --> terminal, run the command:

sudo visudo

Find the line that says

%admin ALL=(ALL) ALL

and change it to

%admin ALL=(ALL) NOPASSWD: ALL

Save and exit the file

HIVE EXTERNAL TABLE

Hive is a data warehousing infrastructure based on the Hadoop which provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware.

Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

The difference between the normal tables and external tables can be seen in LOAD and DROP operations.

Normal Tables: Hive manages the normal tables created and moves the data into its warehouse directory.

As an example, consider the table creation and loading of data into the table.

CREATE TABLE page_view(viewTime INT, userid BIGINT,

page_url STRING, referrer_url STRING,

ip STRING COMMENT 'IP Address of the User')

COMMENT 'This is the page view table'

PARTITIONED BY(dt STRING, country STRING)

STORED AS SEQUENCEFILE;

LOAD DATA INPATH ‘/user/hduser/sampledata.txt’ INTO TABLE page_view;

By default, when data file is loaded, /user/${USER}/warehouse/page_view is created automatically.

This LOAD will move the file data.txt from HDFS into Hive’s warehouse directory for the table. If the table is dropped, then the table metadata and the data will be deleted.

External Tables: An external table refers to the data that is outside of the warehouse directory.

As an example, consider the table creation and loading of data into the external table.

CREATE TABLE page_view(viewTime INT, userid BIGINT,

page_url STRING, referrer_url STRING,

ip STRING COMMENT 'IP Address of the User')

COMMENT 'This is the page view table'

PARTITIONED BY(dt STRING, country STRING)

STORED AS SEQUENCEFILE;

LOCATION ‘/user/hduser/page_view’;

LOAD DATA INPATH ‘/user/husr/sampledata.txt’ INTO page_view;

In case of external tables, Hive does not move the data into its warehouse directory. If the external table is dropped, then the table metadata is deleted but not the data.

Hive does not check whether the external table location exists or not at the time the external table is created.

Tuesday, May 28, 2013

How to Configure Pidgin to Connect to Oracle Beehive Messaging Server

1. On the Pidgin menu bar, Select Accounts and then Select either Add/Edit or Manage Accounts, depending on your version. The Accounts window opens.

2. Click Add.

3. On the Basic tab, choose XMPP from the Protocol list.

4. In the Screen Name box, enter your username. For example, if your e-mail address is abc.def@example.com, then would enter abc.def as your username.

5. In the Domain field, enter your domain name. For example, if your e-mail address is abc.def@example.com, then you would enter example.com as your domain name.

Resource : Laptop (any name)

Password: Enter pwd for the user

6. Click the Advanced tab.

Connection Security: Use old-style SSL
Connection Port : 5223
Connect Server : beehive.example.com
File Transfer Proxies : beehive.example.com
Check box must be checked for Show Custom Smileys

8) On Third Tab Proxy -> Proxy Type : No Proxy

ldap_add: Invalid DN syntax

ldap_add: Invalid DN syntax : error while trying to add/create new user (ldiff) in Oracle Internet Directory (OID).

LDIFF File:

version: 1
DN: cn=udstest1,cn=users,dc=us,DC=ORACLE,DC=COM
objectclass: orclBeehive
objectclass: top
objectclass: person
objectclass: orcluserv2
objectclass: orclcorpperson
objectclass: organizationalperson
objectclass: inetorgperson
cn: udstest1
givenname: udstest1
mail: udstest1@us.oracle.com
orclbeehiveuserstatus: false
orclisenabled: ENABLED
sn: udstest1
uid: udstest1@us.oracle.com
manager: udstest2600
userpassword: Welcome1

Invalid DN error occurs while trying to add above ldiff in OID.

Solution is manager field should be in absolute path

cn=udstest2600,cn=users,dc=us,dc=oracle,dc=com

Modified Ldiff

version: 1
DN: cn=udstest1,cn=users,dc=us,DC=ORACLE,DC=COM
objectclass: orclBeehive
objectclass: top
objectclass: person
objectclass: orcluserv2
objectclass: orclcorpperson
objectclass: organizationalperson
objectclass: inetorgperson
cn: udstest1
givenname: udstest1
mail: udstest1@us.oracle.com
orclbeehiveuserstatus: false
orclisenabled: ENABLED
sn: udstest1
uid: udstest1@us.oracle.com
manager: udstest2600
userpassword: Welcome1

Monday, May 13, 2013

Hadoop: Remote Host Identification Has Changed error and solution

When you are running Hadoop in pseudo Distributed Mode, you had to run in local server (localhost),

Then, to enable password-less login, generate a new SSH key with an empty passphrase:
% ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
% cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Test this with:
% ssh localhost

If successful, you should not have to type in a password.
Sometimes you will find below error, to resolve this follow any one of the approach as stated below.

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
5c:9b:16:56:a6:cd:11:10:3a:cd:1b:a2:91:cd:e5:1c.
Please contact your system administrator.
Add correct host key in /home/user/.ssh/known_hosts to get rid of this message.
Offending key in /home/user/.ssh/known_hosts:1
RSA host key for ras.mydomain.com has changed and you have requested strict checking.
Host key verification failed.

How do I get rid of this message?

If you have reinstalled Linux or UNIX with OpenSSH, you will get the above error. To get rid of this problem:

Solution #1: Remove keys

Use the -R option to removes all keys belonging to hostname from a known_hosts file. This option is useful to delete hashed hosts. If your remote hostname is server.example.com, enter:
$ ssh-keygen -R {server.name.com}
$ ssh-keygen -R {ssh.server.ip.address}
$ ssh-keygen -R server.example.com

$ ssh-keygen -R localhost

Sample output:

/home/vivek/.ssh/known_hosts updated.
Original contents retained as /home/vivek/.ssh/known_hosts.old
Now, you can connect to the host without a problem.

Solution #2: Add correct host key in /home/user/.ssh/known_hosts

It is not necessary to delete the entire known_hosts file, just the offending line in that file. For example if you have 3 server as follows.

myserver1.com,64.2.5.111 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEA11FV0EnGahT2EK8qElocjuHTsu1jaCfxkyIgBTlxlrOIRchb2pw8IzJLOs2bcuYYfa8nSXGEcWyaFD1ifUjfHelj94AAAAB3NzaC1yc2EAAAABIwAAAIEA11FV0E
nGahT2EK8qElocjuHTsu1jaCfxkyIgBTlxlrOIRchb2pw8IzJLOs2bcuYYfa8nSXGEcWyaFD1ifUjfHelj94H+uv304/ZDz6xZb9ZWsdm+264qReImZzruAKxnwTo4dcHkgKXKHeefnBKyEvvp/2ExMV9WT5DVe1viVwk=

myserver2.com,125.1.12.5 ssh-rsa
AAAAB3NzaC1yc2EAAAABIwAAAQEAtDiERucsZzJGx/1kUNIOYhJbczbZHN2Z1gCnTjvO/0mO2R6KiQUP4hOdLppIUc9GNvlp1kGc3w7B9tREH6kghXFiBjrIn6VzUO4uwrnsMbnAnscD5EktgI7fG4ZcNUP 5+J7sa3o+rtmOuiFxCA690DXUJ8nX8yDHaJfzMUTKTGxQz4M/H2P8L2R//qLj5s3ofzNmgSM9lSEhZL/IyI4NxHhhpltYZKW/Qz4M/H2P8L2R//qLj5s3ofzNmgSM9lSEhZL/M7L0vKeTObue1SgAsXADtK3162a/Z6MGnAazIviHBldxtGrFwvEnk82+GznkO3IBZt5vOK2heBnqQBfw=

myserver3.com,125.2.1.15 ssh-rsa
5+J7sa3o+rtmOuiFxCA690DXUJ8nX8yDHaJfzMUTKTGx0lVkphVsvYD5hJzm0eKHv+oUXRT9v+QMIL+um/IyI4NxHhhpltYZKW
as3533dka//sd33433////44632Z6MGnAazIviHBldxtGrFwvEnk82/Qz4M/H2P8L2R//qLj5s3ofzNmgSM9lSEhZL/M7L0vKeTObue1SgAsXADtK3162a/Z6MGnAazIviHBldxtGrFwvEnk82+GznkO3IBZt5vOK2heBnqQBfw==

To delete 2nd server (myserver.com), open file:
# vi +2 .ssh/known_hosts

And hit dd command to delete line. Save and close the file. Or use following
$ vi ~/.ssh/known_hosts

Now go to line # 2, type the following command
:2

Now delete line with dd and exit:
dd
:wq

Solution 3: Just delete the known_hosts file If you have only used one ssh server

$ cd
$ rm .ssh/known_hosts
$ ssh ras.mydomain.com

Now you should be able to connect your server via ssh.

Sunday, May 05, 2013

Cloudera Impala Overview

Cloudera, a provider of Apache Hadoop solutions for the enterprise, recently announced the general availability of Cloudera Impala, its open-source, interactive SQL query engine for analyzing data stored in Hadoop clusters in real time.
Cloudera claims to have been first to market with a SQL-on-Hadoop offering, releasing Impala to open source as a public beta offering in October 2012. Since that time, the company has worked closely with customers and open-source users, testing and refining the platform in real-world applications to deliver a production-hardened and customer-validated release, designed from the ground up for enterprise workloads, said Mike Olson, CEO of Cloudera.

As Cloudera Engineering team described in their blog, their work was inspired by Google Dremel paper which is also the basis for Google BigQuery. Cloudera Impala provides a HiveQL-like query language for wide variety of SELECT statements with WHERE, GROUP BY, HAVING clauses, with ORDER BY – though currently LIMIT is mandatory with ORDER BY -, joins (LEFT, RIGTH, FULL, OUTER, INNER), UNION ALL, external tables, etc. It also supports arithmetic and logical operators and Hive built-in functions such as COUNT, SUM, LIKE, IN or BETWEEN. It can access data stored on HDFS but it does not use mapreduce, instead it is based on its own distributed query engine.

The current Impala release (Impala 1.0beta) does not support DDL statements (CREATE, ALTER, DROP TABLE), all the table creation/modification/deletion functions have to be executed via Hive and then refreshed in Impala shell.

Cloudera Impala is open-source under Apache Licence, the code can be retrieved fromGithub. Its components are written in C++, Java and Python.

Commercial Hadoop distributor Cloudera is first out of the gate with a true SQL layer that sits atop Hadoop.

It lets normal people – if you can call people who've mastered SQL normal – perform ad hoc queries in real time against information crammed into the Hadoop Distributed File System or the HBase database that rides atop it.

Every Hadoop distie dev is working on their own standards-compliant SQL interface for HDFS and HBase, and that's because these are inherently batch-mode systems, like mainframes in the 1960s and 1970s. Hadoop is a perfectly acceptable platform into which you can pull vast sums and run algorithms against to make correlations between different data sets to do fraud detection, risk analysis, web page ad serving, and any other number of jobs.

But if you want to do a quick random query of all or part of a dataset residing out on HDFS or organized in HBase tables, you have to wait.

It took a while for interactive capabilities to be added to mainframes, and it's no surprise that it has taken many years and much complaining to transform the Hadoop stack from batch to interactive mode – but now there are a number of competing methods available.

Cloudera has been working on the Project Impala SQL query engine for HDFS and HBase for years, and formally launched the project at Hadoop World last October when it entered an open beta program in which nearly 50 customers put Impala through some serious paces. Mike Olson, CEO at Cloudera, tells El Reg that there were over 1,000 unique customers who downloaded Impala in the past six months to see how it works, and after all of that testing, Cloudera is ready to sell support services to customers as they put it into production on their big data munchers.

Olson dissed the other methods his rivals have come up to add true SQL functionality to Hadoop.

"Many of the methods we see are rear-guard actions to try to preserve legacy approaches," Olson told us. "This is not really just about putting SQL on Hadoop, but rather making one big-data repository and allowing access to that data in a lot of different ways."

Cloudera went right to the source to get some help to create the Impala real-time SQL engine for HDFS and HBase: Google. Or more precisely, Cloudera hired Marcel Kornacker, one of the architects of the query engine in Google's F1 fault tolerant, distributed relational database that now underpins its AdWords ad serving engine. Impala also borrows heavily from Google's Dremel ad hoc query tool, which has been cloned as the Apache Drill project.

This is the way it works on the modern Internet: Google invents something, publishes a paper, and the Hadoop community clones it. In a way, this is how Google gives back to the open source community, merely by proving to smart people that something can be done so they will imitate it.

Hadoop's MapReduce batch algorithm for chewing on large data sets and its HDFS are riffs on ideas that were part of Google's search engine infrastructure years ago. HBase is a riff on Google's BigTable distributed database overlay for its original Google File System, which has been replaced by a much more elegant solution called Spanner.

In Cloudera's case with Impala, the company not only got the idea for Impala from the F1 and Dremel papers, but hired a Googler to clone many of its core ideas.

The Hadoop stack not only has the HBase distributed tabular data store, but also the Hive query tool, which has an SQL-like query function. But in business, SQL-like don't cut it. What companies – the kind who will pay thousands and thousands of dollars per node for a support contract – want is actual SQL.

In this case, the Impala query engine is compliant with the ANSI-92 SQL standard. What that means is that most tools that use Microsoft's ODBC query tool are compliant with Impala, and marketeers and other line-of-business people who use SQL queries to do their jobs can therefore query data in Hadoop without having to learn to create MapReduce algorithms in Java.

"Large customers running Hadoop today have tens of users who know how to do MapReduce," says Justin Erickson, senior product manager at Cloudera. "But there are a lot more SQL users, on the order of hundreds to thousands, and they want access to the data stored in Hadoop in a way that is familiar to them."

And not surprisingly, as is the case on mainframes even today, there is a mix of batch and interactive workloads running on these behemoths, and this is exactly what Olson expects to see happen eventually on Hadoop systems thanks to tools like Impala.

Based on his inside knowledge of the workloads at Facebook and Google, Olson says that these organizations have already passed the point where the total number of cycles burned on SQL interactive jobs exceeds MapReduce batch jobs. MapReduce will never go to zero, just like batch jobs have not gone away in the data center – you still need to generate and print statements, bills, and other business documents – they are not going away in Hadoop clusters.

The Impala tool replaces bits of the HBase tabular data store and the Hive query tool without breaking API compatibility but radically speeding up the processing of ad hoc queries and providing ANSI-92 SQL compliance. Hive uses MapReduce, which is why it is so damned slow, but Impala has a new massively parallel distributed database. Customers can use Impala without changing their HDFS or HBase file formats, and because it is open source, you can download it and not pay Cloudera a dime. It will snap right into the Apache Hadoop distribution if you want to roll your own big-data muncher.

While having ad hoc SQL query capability that happens in real time is important, there is more to it than that – this is all about money, as you might expect. "If you are spending tens of thousands of dollars per terabyte in a data warehouse," explains Olson, "you have an imperative to actually throw data away. Impala and Hadoop will let you do it for hundreds of dollars per terabyte, and you never have to throw anything away."

Cloudera is unnecessarily cagey about pricing for its Hadoop tools, but here is how it works. The Cloudera Distribution for Hadoop is open source, and that includes the code for Impala. If you want support and the closed-source Cloudera Manager tool, then you have to pay more. The base Cloudera Manager tool cost $2,600 per server node per year, including 24x7 support, but Cloudera says it is going to split up the functionality for Cloudera Manager into Core (for MapReduce), Real Time Data (for HBase), and Real Time Query (for Impala) data-extraction methods. Cloudera won't say what Core and RTD cost, but RTQ, which manages Impala and provides the support contract, costs $1,500 per node per year

@raghujuluri

Wednesday, May 29, 2013

How to disable password prompts in Ubuntu

HIVE EXTERNAL TABLE

Tuesday, May 28, 2013

How to Configure Pidgin to Connect to Oracle Beehive Messaging Server

How to Configure Pidgin to Connect to Oracle Beehive Messaging Server

ldap_add: Invalid DN syntax

Monday, May 13, 2013

Hadoop: Remote Host Identification Has Changed error and solution

Sunday, May 05, 2013

Cloudera Impala Overview

Blog Archive

About Me

Popular Posts