Data Architecture

Operationalizing Your Hadoop Data Lake

Actian Corporation

July 15, 2020

Have you ever tried to join your fact and dimension tables together to generate operational analytics? If you want to perform large-scale data analysis on things like customer churn, you’re probably going to need to do this. The problem is, that these tables are very large, and when you bring them together, the resulting materialized data table is enormous (as in exponentially bigger than the source tables) and likely to time out before it ever completes. If you can successfully pre-join the data together, the resulting data is probably out of date (sorry). This is because the source data is constantly changing, and the resulting data is so big the queries you try to execute against it aren’t going to perform very well either. There has to be a better way!

Vector vs. Traditional Databases – Columnar Orientation

Vector for Hadoop is designed to help address this problem by enabling you to bypass the intermediary step of pre-joining data in a materialized data table, and instead perform high-performance “on-demand” joins. The Vector database starts by reorienting your data into vectors where SIMD (single instruction multiple data) operations can be performed. Essentially, Vector is reorienting your data to focus on the relationships between attributes (across multiple records) instead of focusing on the attributes associated with a single record. This is important for a couple of reasons.

Most of the business questions that you are trying to solve with analytics relate to aggregate data (average transaction size, number of sales in a period, etc.). You are trying to understand a process holistically instead of tracing individual transactions. This business purpose is important because it is the basis for why you need to be running analytics against a data warehouse instead of your transactional systems. Transactional systems are optimized for the processing of individual transactions (hence the name). Data warehouses are optimized for analyzing batches of data. Both systems are capable of doing the other’s job, but if you are looking for operational analytics, you need to be using a data warehouse and ideally one that has a columnar orientation.

Performance Comes From Specialization

Databases and analytics systems are just performing a bunch of mathematical operations on your data – comparing strings of characters, integers, etc. Different types of data have different operations that can be performed on them. Different sets of operations can be performed on different types of data. For example, you can do add, subtract, greater than, less than, min and max operations on numbers, but for strings of characters you can only really do character matching (equal to, contains, starts with, etc.). Computers are designed to do mathematical operations on numbers, so these types of transactions perform faster.

The problem with traditional row-based databases is that the attributes associated with a record of data will likely have mixed data types. To analyze this data, the system essentially must revert to the “least common denominator,” which, in most cases, is string comparisons. By reoriented the data into columns, Actian Vector enables you to perform (faster) numerical operations when the data supports it, thus speeding up overall query performance.

No Need to Wait for Your Data to Update

The traditional approach to pre-joining data to enable analytics works fine if you are dealing with historical data that is essentially static. If it takes a few hours to join the data, that isn’t a big deal. It becomes problematic when the source data is constantly changing, and the analytics insights you are trying to develop are being used for real-time decision making. Increasingly, more companies are looking to operationalize their Hadoop data lakes and use the data they contain to power operations dashboards and real-time process monitoring capabilities that have little tolerance for data latency. A big benefit of the Actian Vector solution is that it bypasses the need to pre-aggregate and pre-process data into the materialized data tables and instead run on-demand joins of the source data tables. This is precisely what operational analytics demands.

To learn more about Actian Vector for Hadoop, visit https://www.actian.com/lp/actian-vector-sql-accelerator-for-hadoop/

About Actian Corporation

Actian empowers enterprises to confidently manage and govern data at scale. Actian data intelligence solutions help streamline complex data environments and accelerate the delivery of AI-ready data. Designed to be flexible, Actian solutions integrate seamlessly and perform reliably across on-premises, cloud, and hybrid environments. Learn more about Actian, the data division of HCLSoftware, at actian.com.

Actian Data Intelligence Platform

Core Capabilities

Actian Data Observability

Core Capabilities

Actian Data Platform

Core Capabilities

Data Management

Products

Databases

Products

Product Overview

All Products

Operationalizing Your Hadoop Data Lake

Vector vs. Traditional Databases – Columnar Orientation

Performance Comes From Specialization

No Need to Wait for Your Data to Update

About Actian Corporation

Ready to Get Started?

Operationalizing Your Hadoop Data Lake

Vector vs. Traditional Databases – Columnar Orientation

Performance Comes From Specialization

No Need to Wait for Your Data to Update

About Actian Corporation

Related Tags

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Ready to Get Started?