Data Warehouse

Vector in Hadoop 5.0 – New Features You Should Care About

Emma McGrattan

September 19, 2017

Vector in Hadoop 5.0 – New Features You Should Care About

Today we announce the introduction of the next release of Actian Vector in Hadoop, extending our support of Apache Spark to include direct access to native Hadoop file formats and tighter integration with Spark SQL and Spark R applications. In this release, we also incorporate performance improvements, integration with Hadoop security frameworks, and administrative enhancements. I’ll cover each of these in greater detail below.

Combine Native Hadoop Tables With Vector Tables

In previous releases, Vector in Hadoop required data to be stored in a proprietary format which optimized analytics performance and delivered great compression to reduce access latency. Vector in Hadoop 5.0 provides the ability to register Hadoop data files (such as Parquet, ORC, and CSV files) as tables in VectorH and to join these external tables with native Vector tables. Vector in Hadoop will provide the fastest analytics execution against data in these formats, even faster than their native query engines. However, query execution will never be as fast with external tables as with native Vector data. If performance matters we suggest that you load that data into Vector in Hadoop using our high-speed loader.

This feature enables customers who have standardized on a particular file format and who want to avoid copying data into a proprietary format to still get the performance acceleration VectorH offers. The details of the storage benchmark that we conducted as part of our SIGMOD paper showed the Vector file format to be more efficient from a query performance/data read and data compression perspective. See our blog post from July 2016 which further explains that benchmark.

True Enterprise Hadoop Security Integration

A Forrester survey last year indicated that data security is the number one concern with Hadoop deployments. Vector in Hadoop provides the enterprise-grade security natively that one expects in a mature EDW platform, i.e., discretionary access control (control over who can read, write, and update what data in the database), column-level data at rest encryption, data in motion encryption, security auditing with SQL addressable audit logs, and security alarms. For the rest of the Hadoop ecosystem, these concerns have driven the development of Hadoop Security Frameworks, through projects like Apache Knox and Apache Ranger. As we see these frameworks starting to appear on customer RFIs, we’re provided documentation on how to configure VectorH for integration with Apache Knox and Apache Ranger.

Significant Performance Enhancements

The performance enhancements which resulted in Vector 5.0 claiming top performance in the TPC-H 3000GB benchmark for non-clustered systems are now available in Vector in Hadoop 5.0, where we typically see linear or better than linear scalability.

Automatic Histogram Generation

Database query execution plans are heavily reliant on knowledge of the underlying data; without data statistics it has to make assumptions about data distribution e.g. it will assume that all zip codes have the same number of residents; or that customer last names are as likely to begin with an X as with an M. VectorH 5.0 includes an implementation of automatic statistic/histogram generation for Vector tables. It results in histograms being automatically created and cached in memory when a query contains a reference to a column in a WHERE, HAVING or ON clause with no explicitly created (by optimizedb or CREATE STATISTICS) histogram.

Accelerate Startup and Shutdown With Distributed Write Ahead Log

In earlier Vector in Hadoop releases the write ahead log file, which holds details of updates in the system, was managed on the VectorH Leader Node. This memory resident log file consumed a lot of the Leader Node memory and became a bottle neck in startup, as the log file needed to be replayed during startup and that process could take several minutes. In VectorH 5.0 we have implemented a distributed Write Ahead Log (WAL) file, where each node has a local WAL. This alleviates pressure on memory, improves our startup times and as a side-effect it also results in much faster COMMIT processing.

Speed Up Queries With Distributed Indexes

In earlier releases, the VectorH Leader Node was responsible for maintaining the automatic min-max indexes for all partitions. As a reminder, the min-max index keeps track of the minimum and maximum value stored within a data block; this internal index allows us to quickly identify which are the blocks that will participate in solving a query and which ones don’t need to be read. This index is memory resident and is built on server startup. In VectorH 5.0 each node is responsible for maintaining its own portion of the index which alleviates pressure on memory on the leader node, improves our startup times by distributing the work and speed-ups DML queries.

Simplified Partition Management With Partition Specification

We found a number of VectorH customers encountered performance problems because they didn’t know to include the PARTITION clause when creating tables, especially when using CREATE TABLE AS SELECT (CTAS). So let’s say they had an existing table that was distributed across 15 partitions and they wanted to create a new table based on that original table, their assumption was that it too would have 15 partitions, but that’s not the way the SQL standard intended it, and in this case being true to the SQL standard hurt us. To alleviate this we have added a configuration parameter which can be set to require the use of either NOPARTITION or PARTITION= when creating a vector table explicitly or via CTAS.

Simplify Backup and Restore With Database Cloning

VectorH 5.0 introduces a new utility, clonedb, which enables users to make an exact copy of their database into a separate Vector instance e.g. take a copy of a production database into a development environment for testing purposes. This feature was requested by one of our existing customers but has been very well received across all Vector/VectorH accounts.

Faster Exports With Spark Connector Parallel Unload

The Vector Spark Connector can now be used to unload large data volumes in parallel across all nodes.

Simplified Loading With SQL Syntax for vwload

VectorH 5.0 includes the ability to utilize vwload with the SQL COPY statement for fast parallel data load from within SQL.

Simplified Creation of CSV Exports From SQL

VectorH 5.0 includes the ability to export data in CSV format from SQL using the following syntax:

INSERT INTO EXTERNAL CSV 'filename' SELECT ... [WITH NULL_MARKER='NULL', FIELD_SEPARATOR=',', RECORD_SEPARATOR='n']

Next Steps

To learn more, request a demo or a trial version of VectorH to try within your Hadoop cluster. You can also explore the single-server version of Actian Vector running on Linux, distributed free as a community edition, available for download.

Emma McGrattan headshot

About Emma McGrattan

As SVP of Engineering at Actian, Emma leads research and development for the Actian Vector, Actian Vector in Hadoop, Actian X and Ingres teams. A recognized authority in DBMS and Big Data technologies, Emma is a sought-after speaker at industry conferences. Emma has recently celebrated over 25 years in Ingres and Actian Engineering. Educated in Ireland, Emma holds a Bachelors of Electrical Engineering degree from Dublin City University.