Pages

Wednesday, 3 February 2016

What is Cloudera Impala ? Impala vs Hive

Cloudera Impala is an open source, and one of the leading analytic massively parallelprocessing (MPP) SQL query engine that runs natively in Apache Hadoop. Cloudera Impala project was announced in October 2012 and after successful beta test distribution and became generally available in May 2013.Its preferred users are analysts doing ad-hoc queries over the massive data sets stored in Hadoop.

The main feature of Impala is that with Impala we can run low-latency Adhoc SQL queries directly on the data stored in a cluster, stored either in unstructured flat files in the file system, or in structured HBase tables without requiring data movement or transformation. Performance is increased due to the fact that we need not migrate data sets to dedicated processing systems or convert data formats prior to analysis.

Another important feature of Impala is that it is workable to the data formats metadata, security and resource management frameworks used by Map Reduce, Apache Hive, 
Apache Pig and other components of the Hadoop stack.

Impala also supports all Hadoop file formats, including new format Apache Parquet. Apache Parquet is a columnar storage format for the Hadoop ecosystem created with advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.


Impala queries are executed as follows:


  • Queries are submitted using Impala-shell command-line tool, or from a business application through an ODBC or JDBC driver.
  • Impala distributed query engine builds and distributes the query plan across the cluster.
  • It runs separate Impala Daemon (impalad) which runs on data nodes and responds to impala shell. These daemons can return data quickly without having to go through a whole Map/Reduce job.
  • Impalad is a process that runs on designated nodes in the cluster. It coordinates and runs queries.


Comparison With Hive 



When we compare to Hive and MapReduce ,both optimized for long running batch-oriented tasks such as 
ETL(Read more:What is ETL), Impala is more compatible for running interactive analytical SQL queries over small amounts of a huge data. What makes it different form HIVE is that Impala does not rely on Map Reduce, it avoids the start-up overhead of Map Reduce jobs and instead uses its own t’s own set of execution daemons which need to be installed alongside your data nodes. 
Hive in Hadoop ecosystem is intended for a data warehouse system to support with easy data aggregations, adhoc queries over large datasets which are stored in Hadoop HDFS file systems whereas Cloudera Impala is a query engine for data stored in HDFS and HBase. 
Both Hive and Impala supports the existing HiveQL(a SQL-like language-Read more on :Apache Hive ).

Because Impala and Hive share the same metastore database and their tables are often used interchangeably. This cross-compatibility applies to Hive tables that use Impala-compatible types for all columns.

Partitions in Impala 


As in large scale Data warehouse how we make use of partitioned tables (Read more on: Partitions in Oracle ) to speed up queries, the same way in Impala we make use of Partitioned tables. Data is partitioned based on values in one column and instead of looking up one row at a time from widely scattered items, the rows with identical partition keys are physically grouped together. Impala also takes advantage of the partitioning present in Hive tables.


Cloudera Impala makes use of the following two technologies
  • Columnar Storage:  Since data stored in columnar fashion it gives high compression ratio and efficient scanning.
  • Tree Architecture: The architecture forms a massively parallel distributed multi-level serving tree for pushing down a query to the tree and then aggregating the results from the leaves.

Impala provides the following benefits:

  • Efficient resource usage: Impala can handle concurrent client requests in shared workload environment. Each Impala daemon can handle multiple concurrent client requests
  • Impala doesn't provide fault-tolerance compared to Hive. Just in case the node fails in the middle of processing, the whole query has to be re-run. But Impala has the advantage that even if node fails and we start over, its total runtime is so fast that it will accomplish for the time loss.
  • Time savings because you do not have to move around data and Impala does not write the intermediate results to disk.
  • Supports Hadoop Security (Kerberos authentication) and role-based authorization through the Apache Sentry project.
  • Far-reaching accessibility of Hadoop data to the business community.
  • More complete analysis of full raw and historical data, without information loss from aggregations or conforming to fixed schemas.


If you like this post, please share it on google by clicking on the Google +1 button.


Please go through our latest post TOP 6 BIG DATA TRENDS IN THE NEAR FUTURE

2 comments:

  1. so in conclusion, overall, Impala is better than Hive ?

    ReplyDelete
    Replies
    1. It depends on the use case. Hive is better for longer running data warehouse since Impala does not provide fault tolerance. Impala is much faster for adhoc analytics on reasonably sized datasets.

      Delete