Cloudera Impala is an open
source, and one of the leading analytic massively parallelprocessing (MPP) SQL query
engine that runs natively in Apache Hadoop. Cloudera Impala project was announced
in October 2012 and after successful beta test distribution and became
generally available in May 2013.Its preferred users are analysts doing ad-hoc queries over the massive data
sets stored in Hadoop.
The main feature of Impala is that with Impala we can run low-latency Adhoc SQL queries directly on the data stored in a cluster, stored either in unstructured flat files in the file system, or in structured HBase tables without requiring data movement or transformation. Performance is increased due to the fact that we need not migrate data sets to dedicated processing systems or convert data formats prior to analysis.
Another important feature of Impala is that it is workable to the data formats metadata, security and resource management frameworks used by Map Reduce, Apache Hive, Apache Pig and other components of the Hadoop stack.
Impala also supports all Hadoop file formats, including new format
Apache Parquet. Apache Parquet is a columnar storage format for the Hadoop
ecosystem created with advantages of compressed, efficient columnar data
representation available to any project in the Hadoop ecosystem, regardless of
the choice of data processing framework, data model, or programming language.
Impala queries are executed as follows:
- Queries are submitted using Impala-shell command-line tool, or from a business application through an ODBC or JDBC driver.
- Impala distributed query engine builds and distributes the query plan across the cluster.
- It runs separate Impala Daemon (impalad) which runs on data nodes and responds to impala shell. These daemons can return data quickly without having to go through a whole Map/Reduce job.
- Impalad is a process that runs on designated nodes in the cluster. It coordinates and runs queries.
Comparison With Hive
When we compare to Hive and MapReduce ,both optimized for long running batch-oriented tasks such as ETL(Read more:What is ETL), Impala is more compatible for running interactive analytical SQL queries over small amounts of a huge data. What makes it different form HIVE is that Impala does not rely on Map Reduce, it avoids the start-up overhead of Map Reduce jobs and instead uses its own t’s own set of execution daemons which need to be installed alongside your data nodes.
Hive in Hadoop
ecosystem is intended for a data warehouse system to support with easy data
aggregations, adhoc queries over large datasets which are stored in Hadoop HDFS
file systems whereas Cloudera Impala is a query engine for data stored in HDFS
and HBase.
Because Impala and Hive share the same metastore database and their tables are often used interchangeably. This cross-compatibility applies to Hive tables that use Impala-compatible types for all columns.
Partitions in Impala
Cloudera Impala makes use of the following two technologies
- Columnar Storage: Since data stored in columnar fashion it gives high compression ratio and efficient scanning.
- Tree Architecture: The architecture forms a massively parallel distributed multi-level serving tree for pushing down a query to the tree and then aggregating the results from the leaves.
Impala provides the following benefits:
- Efficient resource usage: Impala can handle concurrent client requests in shared workload environment. Each Impala daemon can handle multiple concurrent client requests
- Impala doesn't provide fault-tolerance compared to Hive. Just in case the node fails in the middle of processing, the whole query has to be re-run. But Impala has the advantage that even if node fails and we start over, its total runtime is so fast that it will accomplish for the time loss.
- Time savings because you do not have to move around data and Impala does not write the intermediate results to disk.
- Supports Hadoop Security (Kerberos authentication) and role-based authorization through the Apache Sentry project.
- Far-reaching accessibility of Hadoop data to the business community.
- More complete analysis of full raw and historical data, without information loss from aggregations or conforming to fixed schemas.
If you like this post, please share it on google by clicking on the
Google +1 button.
so in conclusion, overall, Impala is better than Hive ?
ReplyDeleteIt depends on the use case. Hive is better for longer running data warehouse since Impala does not provide fault tolerance. Impala is much faster for adhoc analytics on reasonably sized datasets.
Delete