Actian Vector for Hadoop File Format is Faster and More Efficient By Pradeep Bhanot June 5, 2020 In this third and last part of the series on Actian Vector in Hadoop (VectorH), we will cover how the VectorH file format supports the performance and efficiency of our data analytics platform to accelerate business insights, as well as some of the other enterprise features that can help businesses move their Hadoop applications into production. Part one of this series showed the huge performance advantages VectorH has over other SQL on Hadoop alternatives, while part two explored the benefits of the richer implementation of SQL and the ability to perform data updates in VectorH. The file format for VectorH is one of the key contributors to its industry-leading performance. Having a columnar orientation allows VectorH to choose compression techniques optimized by data type, and VectorH can use various measures described in the SIGMOD paper to employ storage and I/O bandwidth more efficiently. In some simple benchmarks described in this paper, we compared VectorH to the speed and efficiency of other query engines (such as Impala and Presto) and other file formats (like Parquet and ORC). Three observations become clear from the benchmark results: VectorH handles queries much faster than the other alternatives when the data is already in memory, from 26x to over 110x faster, primarily due to the efficiencies of decompression using vectorized processing. The chart below shows query times for each of the alternatives, showing how it varies depending on the percentage of the data selected out of the entire set of tables. VectorH and Presto avoid processing data not in the range selected, while Impala does not and performs much worse in the 10% and 30% cases. VectorH is also significantly faster when data hasn’t yet been loaded into memory. VectorH reduces the amount of I/O required for data residing on disk by using I/O filtering, where MinMax indexes in memory allow skipping read operations for blocks on disk with no data in the range selected. The chart shown below (similar to above) reflects the percentage of data in the range selected, and only VectorH shows significant savings from read operations as less data fits the selection criteria. Although some other formats also have range information, it is stored as metadata inside the data blocks. Every block still needs to be read at least partly before deciding whether the data is relevant. VectorH performed significantly less I/O, from 20% to 98% less, compared to Impala and Presto. VectorH has the most effective compression across a variety of data types, requiring only 11GBs of storage compared to 18GBs for Parquet and 19GBs for ORC, a savings of 39-42%. Imagine the savings over a multi-petabyte data store! Additional advantages for VectorH that contribute to deploying successful analytics solutions: Spark integration is an example of Actian’s continuing commitment to incorporating open interfaces and frameworks directly into the VectorH solution. Actian VectorH 6.0 integrates with the latest Hadoop distributions and can be deployed both on-premises and in the cloud e.g Microsoft Azure HDInsight. Actian VectorH 6.0 supports multiple file systems as well as multiple data formats (Parquet, ORC, CSV, and many others through the Spark connector). Users can execute queries in VectorH on data stored in any file formats supported by Spark by leveraging the Spark connector. This is fully transparent to the user: full ANSI SQL can be used to query data in any file format without even knowing about the existence of Spark. With the Spark connector, data stored in VectorH can be processed in Spark through the use of Dataframes or Spark SQL. Any Spark operation can be performed on data backed by a VectorH table. Overall, Actian provides more complete enterprise-grade functionality to support moving analytics applications from development into a production environment. Role- and row-based security is built into VectorH, providing the access control needed to support privacy policies and regulatory requirements. Actian Director provides a web-based tool for monitoring and managing VectorH and cluster resources. Actian Management Console automates provisioning, deploying, and monitoring analytics in the cloud, making it quicker and easier to get your new project started. This three-part blog series (see parts one and two) shows how Actian provides customers with the performance, flexibility and support needed when integrating with other big data technologies to deliver faster and richer insights to make better business decisions. About Pradeep Bhanot Product Marketing professional, author, father and photographer. Born in Kenya. Lived in England through disco, punk and new romance eras. Moved to California just in time for grunge. Worked with Oracle databases at Oracle Corporation for 13 years. Database Administration for mainframe IBM DB2 and its predecessor SQL/DS at British Telecom and Watson Wyatt. Worked with IBM VSAM at CA Technologies and Serena Software. Microsoft SQL Server powered solutions from 1E and BDNA.