@raghujuluri: Cloudera Impala Overview

Cloudera, a provider of Apache Hadoop solutions for the enterprise, recently announced the general availability of Cloudera Impala, its open-source, interactive SQL query engine for analyzing data stored in Hadoop clusters in real time.
Cloudera claims to have been first to market with a SQL-on-Hadoop offering, releasing Impala to open source as a public beta offering in October 2012. Since that time, the company has worked closely with customers and open-source users, testing and refining the platform in real-world applications to deliver a production-hardened and customer-validated release, designed from the ground up for enterprise workloads, said Mike Olson, CEO of Cloudera.

As Cloudera Engineering team described in their blog, their work was inspired by Google Dremel paper which is also the basis for Google BigQuery. Cloudera Impala provides a HiveQL-like query language for wide variety of SELECT statements with WHERE, GROUP BY, HAVING clauses, with ORDER BY – though currently LIMIT is mandatory with ORDER BY -, joins (LEFT, RIGTH, FULL, OUTER, INNER), UNION ALL, external tables, etc. It also supports arithmetic and logical operators and Hive built-in functions such as COUNT, SUM, LIKE, IN or BETWEEN. It can access data stored on HDFS but it does not use mapreduce, instead it is based on its own distributed query engine.

The current Impala release (Impala 1.0beta) does not support DDL statements (CREATE, ALTER, DROP TABLE), all the table creation/modification/deletion functions have to be executed via Hive and then refreshed in Impala shell.

Cloudera Impala is open-source under Apache Licence, the code can be retrieved fromGithub. Its components are written in C++, Java and Python.

Commercial Hadoop distributor Cloudera is first out of the gate with a true SQL layer that sits atop Hadoop.

It lets normal people – if you can call people who've mastered SQL normal – perform ad hoc queries in real time against information crammed into the Hadoop Distributed File System or the HBase database that rides atop it.

Every Hadoop distie dev is working on their own standards-compliant SQL interface for HDFS and HBase, and that's because these are inherently batch-mode systems, like mainframes in the 1960s and 1970s. Hadoop is a perfectly acceptable platform into which you can pull vast sums and run algorithms against to make correlations between different data sets to do fraud detection, risk analysis, web page ad serving, and any other number of jobs.

But if you want to do a quick random query of all or part of a dataset residing out on HDFS or organized in HBase tables, you have to wait.

It took a while for interactive capabilities to be added to mainframes, and it's no surprise that it has taken many years and much complaining to transform the Hadoop stack from batch to interactive mode – but now there are a number of competing methods available.

Cloudera has been working on the Project Impala SQL query engine for HDFS and HBase for years, and formally launched the project at Hadoop World last October when it entered an open beta program in which nearly 50 customers put Impala through some serious paces. Mike Olson, CEO at Cloudera, tells El Reg that there were over 1,000 unique customers who downloaded Impala in the past six months to see how it works, and after all of that testing, Cloudera is ready to sell support services to customers as they put it into production on their big data munchers.

Olson dissed the other methods his rivals have come up to add true SQL functionality to Hadoop.

"Many of the methods we see are rear-guard actions to try to preserve legacy approaches," Olson told us. "This is not really just about putting SQL on Hadoop, but rather making one big-data repository and allowing access to that data in a lot of different ways."

Cloudera went right to the source to get some help to create the Impala real-time SQL engine for HDFS and HBase: Google. Or more precisely, Cloudera hired Marcel Kornacker, one of the architects of the query engine in Google's F1 fault tolerant, distributed relational database that now underpins its AdWords ad serving engine. Impala also borrows heavily from Google's Dremel ad hoc query tool, which has been cloned as the Apache Drill project.

This is the way it works on the modern Internet: Google invents something, publishes a paper, and the Hadoop community clones it. In a way, this is how Google gives back to the open source community, merely by proving to smart people that something can be done so they will imitate it.

Hadoop's MapReduce batch algorithm for chewing on large data sets and its HDFS are riffs on ideas that were part of Google's search engine infrastructure years ago. HBase is a riff on Google's BigTable distributed database overlay for its original Google File System, which has been replaced by a much more elegant solution called Spanner.

In Cloudera's case with Impala, the company not only got the idea for Impala from the F1 and Dremel papers, but hired a Googler to clone many of its core ideas.

The Hadoop stack not only has the HBase distributed tabular data store, but also the Hive query tool, which has an SQL-like query function. But in business, SQL-like don't cut it. What companies – the kind who will pay thousands and thousands of dollars per node for a support contract – want is actual SQL.

In this case, the Impala query engine is compliant with the ANSI-92 SQL standard. What that means is that most tools that use Microsoft's ODBC query tool are compliant with Impala, and marketeers and other line-of-business people who use SQL queries to do their jobs can therefore query data in Hadoop without having to learn to create MapReduce algorithms in Java.

"Large customers running Hadoop today have tens of users who know how to do MapReduce," says Justin Erickson, senior product manager at Cloudera. "But there are a lot more SQL users, on the order of hundreds to thousands, and they want access to the data stored in Hadoop in a way that is familiar to them."

And not surprisingly, as is the case on mainframes even today, there is a mix of batch and interactive workloads running on these behemoths, and this is exactly what Olson expects to see happen eventually on Hadoop systems thanks to tools like Impala.

Based on his inside knowledge of the workloads at Facebook and Google, Olson says that these organizations have already passed the point where the total number of cycles burned on SQL interactive jobs exceeds MapReduce batch jobs. MapReduce will never go to zero, just like batch jobs have not gone away in the data center – you still need to generate and print statements, bills, and other business documents – they are not going away in Hadoop clusters.

The Impala tool replaces bits of the HBase tabular data store and the Hive query tool without breaking API compatibility but radically speeding up the processing of ad hoc queries and providing ANSI-92 SQL compliance. Hive uses MapReduce, which is why it is so damned slow, but Impala has a new massively parallel distributed database. Customers can use Impala without changing their HDFS or HBase file formats, and because it is open source, you can download it and not pay Cloudera a dime. It will snap right into the Apache Hadoop distribution if you want to roll your own big-data muncher.

While having ad hoc SQL query capability that happens in real time is important, there is more to it than that – this is all about money, as you might expect. "If you are spending tens of thousands of dollars per terabyte in a data warehouse," explains Olson, "you have an imperative to actually throw data away. Impala and Hadoop will let you do it for hundreds of dollars per terabyte, and you never have to throw anything away."

Cloudera is unnecessarily cagey about pricing for its Hadoop tools, but here is how it works. The Cloudera Distribution for Hadoop is open source, and that includes the code for Impala. If you want support and the closed-source Cloudera Manager tool, then you have to pay more. The base Cloudera Manager tool cost $2,600 per server node per year, including 24x7 support, but Cloudera says it is going to split up the functionality for Cloudera Manager into Core (for MapReduce), Real Time Data (for HBase), and Real Time Query (for Impala) data-extraction methods. Cloudera won't say what Core and RTD cost, but RTQ, which manages Impala and provides the support contract, costs $1,500 per node per year

@raghujuluri

Sunday, May 05, 2013

Cloudera Impala Overview

1 comment:

Blog Archive

About Me

Popular Posts