Imagine if reports that currently take many minutes to run in Hadoop could come back with results in seconds. Get answers to detailed questions about sales figures and customer trends in real time. Make revenue predictions based on up-to-date customer metrics across a spectrum of sources. Iterate more quickly simulating different business decisions to achieve better outcomes. The Actian Vector in Hadoop (VectorH) analytics platform can deliver those improvements in your Apache Hadoop big data environment.
Actian VectorH has demonstrated one to three orders of magnitude better query performance in a comparison with other major SQL in Hadoop alternatives. In this first of a three-part blog describing the results, we’ll show the astounding performance results and explain the factors that contribute to such a large advantage. Part two will cover the unique abilities VectorH has to handle updates, and part three will go into the efficiencies of the VectorH file format.
Actian performance engineering used the full set of 22 TPC-H queries to run unaudited benchmarks on several of the SQL on Hadoop solutions in the market, and the results may surprise you (but not us). Here is a quick summary:
These results have been published in an academic paper submitted to and presented at the 2016 International Conference on Management of Data (ACM SIGMOD) in San Francisco, June 26-July 1, 2016. That paper goes into many technical reasons why VectorH is able to achieve such a performance advantage – here is the short version:
- Efficient, multi-core parallel and vectorized execution – VectorH is designed to take advantage of the performance features in the Intel CPU architecture, including the AVX2 vector instruction set and large, multi-layer caches.
- Well-tuned query optimizer – VectorH extends the mature optimizer from its original SMP version to exploit the multiple levels of parallelism and advantages of data locality in an MPP Hadoop system. The VectorH optimizer can change the join order or partition data tables to improve parallel operations, steps that have to be done manually for queries in the other alternatives.
- Control over HDFS block locality – since VectorH operates natively within HDFS and YARN, it can participate in resource management and make allocation decisions in the context of the larger cluster workload. At the same time, specific table storage optimizations reduce overhead, accelerate reads, maximize disk utilization, and reduce data skew to help deliver faster query results.
- Effective I/O filtering – tracking the range of values in a column (MinMax) allows skipping the reading of blocks which fall outside the range of the query, reducing disk I/O and read delays, and avoiding decompression computations, sometimes significantly.
- Lightweight compression – VectorH compression achieves good levels of compaction at high speed, achieving faster vectorized execution by minimizing branches and instruction counts. Our compression algorithms are capable of running fully in CPU cache, effectively increasing memory bandwidth. Different compression algorithms are tailored for the various data types and VectorH automatically calibrates and chooses among them to reach higher levels of compression and efficiency when compared to general purpose compression algorithms.
How was the testing conducted?
- Actian performance engineering built a 10-node Hadoop cluster, each node 2xIntel 3.0GHz E5-2690v2 CPUs, 256GB RAM, 24x600GB HDD, 10Gb Ethernet, Hadoop 2.6.0. There was one name node and nine SQL-on-Hadoop nodes, set up using Cloudera Express 5.5.
- These tests were conducted in Dec 2015 and Jan 2016, running the then-most-current release of each of the SQL on Hadoop alternatives (Actian VectorH 4.2.2, Apache Hive 1.2.1, Cloudera Impala 2.3, Apache Drill 1.5, Apache Spark SQL 1.5.2, and Pivotal HAWQ 1.3.1). Reasonable efforts were made to tune each platform to make fair comparisons.
- Here are the actual individual query execution times and the speed-up factor for VectorH versus each of the alternatives:
In part two of this blog series, we will cover the advantages VectorH 5.0 delivers in SQL functionality and data updates capability compared to the other alternatives, and part three will show the benefits of the VectorH file format for faster query performance and lower storage requirements.