Delivering Real-Time Reporting at Speed and Scale

When a major UK logistics company wanted to improve reporting for its large accounts, they turned to Actian to design, implement and support the underlying database system (“LARS”) using Ingres, HVR and Vector products for its architecture.

The Brief

The customer had around 100 customer accounts representatives dedicated to large accounts, with each rep manually producing their own set of spreadsheet-based daily, twice-daily and ad hoc reports for emailing to their account contacts, based on a range of daily extracts from an Ingres operational-level database.

The customer wanted to standardize the format of the reports and to automate their production in order to save reps’ time, to deliver reports to their accounts in a consistent and timely manner, and ultimately to make it feasible to outsource the function.

The challenge was not just to provide the capability of producing the volume of scheduled complex analytical reports (over 1000 per day, tightly clustered around critical times in mid-morning and mid-afternoon) and simultaneously supporting ad hoc complex report production for 200 users with a response time of seconds, but also a) to do this without significant overhead on the source operational-level database and b) reduce the need for the range of existing Extracts from the operational-level database. An additional requirement was that it should be possible to ‘switch’ other existing applications from the operational-level database to this new database at a future stage, thus mandating the database design to be as similar as possible to the existing operational-level database.

Because of delays to the start of the project (due to changes within the customer’s organization), there was considerable pressure to deliver the project in as short a timescale as possible.

The Architecture

To provide the user-visible front-end analytical and reporting facility a semi-customized package from a partner organization was chosen, based on the Logi Analytics product.

The database schema design was constrained by the source database schema design, which resulted in the need to provide a range of database views involving joins over 12 tables, with some of the tables having over 300 million rows. In order to provide interactive users with realistic response times whilst also servicing the needs of scheduled Reports, Vector was chosen as the ideal DBMS for this database, due to its very high speed of processing complex retrieval queries and its ability to mirror the Ingres source database structure virtually unchanged.

Since the source Ingres database and the target Vector database had essentially similar schemas, HVR (High Volume Replicator) was chosen as the software solution to keeping the Vector database in-line with the source Ingres database. The HVR Capture process reads the Ingres source database transaction log, passes insert and update operations via the HVR Hub to the target machine where the HVR Integrate process reflects the inserts and updates as ‘upserts’ into the Vector database (‘deletes’ were suppressed within HVR, to avoid the regular purges of the source database also resulting in purges of the target database), placing very little load on the source database machine.

The Implementation

Ingres source database runs on an older HP-UX platform, so HVR was installed on a dedicated Linux server to act as its Hub. The Vector database sits on a separate dedicated Linux server. An HVR ‘capture’ component runs on the Ingres machine, captures the source database changes from the transaction log and sends them via the HVR Hub to the HVR ‘integrate’ component running on the Vector server which applies the same changes (via ‘upserts’) to the target Vector database.

To meet the customer’s need for reduced development timescales the project was delivered ready for user acceptance testing in 3 months from the start of development, thanks to Vector’s ability to mirror an Ingres schema with little change.

In order to reduce the number of table joins in the views from 12 down to a more manageable 9, a regularly-scheduled job (running every 10 minutes) was created to maintain a de-normalized table.

The denormalization update job, HVR’s ‘upsert’ job, the large number of scheduled reports, and the interactive users happily co-exist on the Vector server.

Vector Performance

It is often fairly meaningless to quote retrieval response times from a system since there are so many variables involved, but we can provide a flavour of the retrieval performance of the Vector database compared with its Ingres source database. A member of the customer’s IT staff needed to run an unreasonably heavy ad-hoc SQL query against the Ingres source database which ran for 10 minutes before she killed it as untenable. We ran the same SQL against the live Vector database, during ‘prime-time’ activity – it completed in 0.05 seconds. Although this is not a direct comparison since the two databases were running on different platforms and hardware configurations, it does illustrate the dramatic retrieval speed of which Vector is capable.

In fact the performance of Vector was so impressive it changed the specified requirements from the client facing team. The envisioned work practice was to allow up to ~200 complex reports to run between 10AM and 10:30 but Vector was so fast and comfortable at scale that these reports are now all run within 5 minutes of 10AM and that was only limited by the resources (cores, memory, etc.) on the machine.

Customer Satisfaction

The customer was sufficiently impressed with the novel architecture of the LARS implementation that they commissioned a second more challenging Vector-based project to be fed from a continuous message stream. This will be the subject of a future blog entry.

About Author

About Emma McGrattan

Emma McGrattan is CTO at Actian, leading global R&D in high-performance analytics, data management, and integration. With over two decades at Actian, Emma holds multiple patents in data technologies and has been instrumental in driving innovation for mission-critical applications. She is a recognized authority, frequently speaking at industry conferences like Strata Data, and she's published technical papers on modern analytics. In her Actian blog posts, Emma tackles performance optimization, hybrid cloud architectures, and advanced analytics strategies. Explore her top articles to unlock data-driven success.

Vector in Hadoop 5.0 – New Features You Should Care About

By Emma McGrattan

#Analytics #Hadoop Analytics #VectorH

By Emma McGrattan

#Analytics #Hadoop Analytics #VectorH

Actian Vector was renamed to Actian Analytics Engine in 2026.

Today we announce the introduction of the next release of Actian Vector in Hadoop, extending our support of Apache Spark to include direct access to native Hadoop file formats and tighter integration with Spark SQL and Spark R applications. In this release, we also incorporate performance improvements, integration with Hadoop security frameworks, and administrative enhancements. I’ll cover each of these in greater detail below.

Combine Native Hadoop Tables With Vector Tables

In previous releases, Vector in Hadoop required data to be stored in a proprietary format which optimized analytics performance and delivered great compression to reduce access latency. Vector in Hadoop 5.0 provides the ability to register Hadoop data files (such as Parquet, ORC, and CSV files) as tables in VectorH and to join these external tables with native Vector tables. Vector in Hadoop will provide the fastest analytics execution against data in these formats, even faster than their native query engines. However, query execution will never be as fast with external tables as with native Vector data. If performance matters we suggest that you load that data into Vector in Hadoop using our high-speed loader.

This feature enables customers who have standardized on a particular file format and who want to avoid copying data into a proprietary format to still get the performance acceleration VectorH offers. The details of the storage benchmark that we conducted as part of our SIGMOD paper showed the Vector file format to be more efficient from a query performance/data read and data compression perspective. See our blog post from July 2016 which further explains that benchmark.

True Enterprise Hadoop Security Integration

A Forrester survey last year indicated that data security is the number one concern with Hadoop deployments. Vector in Hadoop provides the enterprise-grade security natively that one expects in a mature EDW platform, i.e., discretionary access control (control over who can read, write, and update what data in the database), column-level data at rest encryption, data in motion encryption, security auditing with SQL addressable audit logs, and security alarms. For the rest of the Hadoop ecosystem, these concerns have driven the development of Hadoop Security Frameworks, through projects like Apache Knox and Apache Ranger. As we see these frameworks starting to appear on customer RFIs, we’re provided documentation on how to configure VectorH for integration with Apache Knox and Apache Ranger.

Significant Performance Enhancements

The performance enhancements which resulted in Vector 5.0 claiming top performance in the TPC-H 3000GB benchmark for non-clustered systems are now available in Vector in Hadoop 5.0, where we typically see linear or better than linear scalability.

Automatic Histogram Generation

Database query execution plans are heavily reliant on knowledge of the underlying data; without data statistics it has to make assumptions about data distribution e.g. it will assume that all zip codes have the same number of residents; or that customer last names are as likely to begin with an X as with an M. VectorH 5.0 includes an implementation of automatic statistic/histogram generation for Vector tables. It results in histograms being automatically created and cached in memory when a query contains a reference to a column in a WHERE, HAVING or ON clause with no explicitly created (by optimizedb or CREATE STATISTICS) histogram.

Accelerate Startup and Shutdown With Distributed Write Ahead Log

In earlier Vector in Hadoop releases the write ahead log file, which holds details of updates in the system, was managed on the VectorH Leader Node. This memory resident log file consumed a lot of the Leader Node memory and became a bottle neck in startup, as the log file needed to be replayed during startup and that process could take several minutes. In VectorH 5.0 we have implemented a distributed Write Ahead Log (WAL) file, where each node has a local WAL. This alleviates pressure on memory, improves our startup times and as a side-effect it also results in much faster COMMIT processing.

Speed Up Queries With Distributed Indexes

In earlier releases, the VectorH Leader Node was responsible for maintaining the automatic min-max indexes for all partitions. As a reminder, the min-max index keeps track of the minimum and maximum value stored within a data block; this internal index allows us to quickly identify which are the blocks that will participate in solving a query and which ones don’t need to be read. This index is memory resident and is built on server startup. In VectorH 5.0 each node is responsible for maintaining its own portion of the index which alleviates pressure on memory on the leader node, improves our startup times by distributing the work and speed-ups DML queries.

Simplified Partition Management With Partition Specification

We found a number of VectorH customers encountered performance problems because they didn’t know to include the PARTITION clause when creating tables, especially when using CREATE TABLE AS SELECT (CTAS). So let’s say they had an existing table that was distributed across 15 partitions and they wanted to create a new table based on that original table, their assumption was that it too would have 15 partitions, but that’s not the way the SQL standard intended it, and in this case being true to the SQL standard hurt us. To alleviate this we have added a configuration parameter which can be set to require the use of either NOPARTITION or PARTITION= when creating a vector table explicitly or via CTAS.

Simplify Backup and Restore With Database Cloning

VectorH 5.0 introduces a new utility, clonedb, which enables users to make an exact copy of their database into a separate Vector instance e.g. take a copy of a production database into a development environment for testing purposes. This feature was requested by one of our existing customers but has been very well received across all Vector/VectorH accounts.

Faster Exports With Spark Connector Parallel Unload

The Vector Spark Connector can now be used to unload large data volumes in parallel across all nodes.

Simplified Loading With SQL Syntax for vwload

VectorH 5.0 includes the ability to utilize vwload with the SQL COPY statement for fast parallel data load from within SQL.

Simplified Creation of CSV Exports From SQL

VectorH 5.0 includes the ability to export data in CSV format from SQL using the following syntax:

INSERT INTO EXTERNAL CSV 'filename' SELECT ... [WITH NULL_MARKER='NULL', FIELD_SEPARATOR=',', RECORD_SEPARATOR='n']

Next Steps

To learn more, request a demo or a trial version of VectorH to try within your Hadoop cluster. You can also explore the single-server version of Actian Vector running on Linux, distributed free as a community edition, available for download.

About Author

About Emma McGrattan

Option	Meaning
update_propagation	Is automatic maintenance enabled.
max_global_update_memory	Controls the amount of memory that can be used by the in-memory buffer.
max_update_memory_per_transaction	As above per transaction.
max_table_update_ratio	Threshold for the percentage of a table held in the buffer before the maintenance process is initiated.
min_propagate_table_count	Minimum row count a table must have to be considered by the maintenance process.

Delivering Real-Time Reporting at Speed and Scale

The Brief

The Architecture

The Implementation

Vector Performance

Customer Satisfaction

Stay connected

Data insights delivered to you.

Vector in Hadoop 5.0 – New Features You Should Care About

Combine Native Hadoop Tables With Vector Tables

True Enterprise Hadoop Security Integration

Significant Performance Enhancements

Automatic Histogram Generation

Accelerate Startup and Shutdown With Distributed Write Ahead Log

Speed Up Queries With Distributed Indexes

Simplified Partition Management With Partition Specification

Simplify Backup and Restore With Database Cloning

Faster Exports With Spark Connector Parallel Unload

Simplified Loading With SQL Syntax for vwload

Simplified Creation of CSV Exports From SQL

Next Steps

Stay connected

Data insights delivered to you.

The Essential Guide to Gartner Catalyst 2017 in San Diego, CA

Stay connected

Data insights delivered to you.

Can Your Applications Find You?

Stay connected

Data insights delivered to you.

Predictions for the Hybrid Data Landscape

Summary

The Rise of HTAP – Best of Both Worlds in Data Management

The Rise of Edge Databases for IoT Data Management

The Rise of Hybrid Integration Platforms

The Rise of Graph Analytics in the Cloud

Stay connected

Data insights delivered to you.

Introducing Actian DataConnect 11

Stay connected

Data insights delivered to you.

Architecting Next-Generation Data Management Solutions

Stay connected

Data insights delivered to you.

Rethink Hybrid for the Data-Driven Enterprise

Stay connected

Data insights delivered to you.

Hybrid Data is a Gamechanger

Stay connected

Data insights delivered to you.

The Start of Something Really Big

Stay connected

Data insights delivered to you.

Knowing Who is Who in the Zoo is Important in Data Integration Industry

Stay connected

Data insights delivered to you.

Efficient ETL in an Analytical Database?

Stay connected

Data insights delivered to you.