Operationalizing your Hadoop Data Lake By Pradeep Bhanot July 15, 2020 Have you ever tried to join your fact and dimension tables together to generate operational analytics? If you want to perform large scale data analysis on things like customer churn, you’re probably going to need to do this. The problem is, these tables are very large, and when you bring them together, the resulting materialized data table is enormous (as in exponentially bigger than the source tables) and likely to time out before it ever completes. If you can successfully pre-join the data together, the resulting data is probably out of date (sorry). This is because the source data is constantly changing, and the resulting data is so big the queries you try to execute against it aren’t going to perform very well either. There has to be a better way!! Vector vs. Traditional Databases – Columnar Orientation Vector for Hadoop is designed to help address this problem by enabling you to bypass the intermediary step of pre-joining data in a materialized data table, and instead perform high performance “on-demand” joins. The Vector database starts by re-orienting your data into vectors where SIMD (single instruction multiple data) operations can be performed. Essentially, Vector is re-orienting your data to focus on the relationships between attributes (across multiple records) instead of focusing on the attributes associated with a single record. This is important for a couple of reasons. Most of the business questions that you are trying to solve with analytics relate to aggregate data (average transaction size, number of sales in a period, etc.). You are trying to understand a process holistically instead of tracing individual transactions. This business purpose is important because it is the basis for why you need to be running analytics against a data warehouse instead of your transactional systems. Transactional systems are optimized for the processing of individual transactions (hence the name). Data warehouses are optimized for analyzing batches of data. Both systems are capable of doing the other’s job, but if you are looking for operational analytics, you need to be using a data warehouse and ideally one that has a columnar orientation. Performance Comes from Specialization Databases and analytics systems are just performing a bunch of mathematical operations on your data – comparing strings of characters, integers, etc. Different types of data have different operations that can be performed on them. There are different sets of operations that can be performed on different types of data. For example, you can do add, subtract, greater than, less than, min and max operations on numbers, but strings of characters you can only really do character matching (equal to, contains, starts with, etc.) Computers are designed to do mathematical operations on numbers, so these types of transactions perform faster. The problem with traditional row-based databases is that the attributes associated with a record of data will likely have mixed data types. To analyze this data, the system essentially must revert to the “least common denominator,” which, in most cases, is string comparisons. By re-oriented the data into columns, Actian Vector enables you to perform (faster) numerical operations when the data supports it, thus speeding up overall query performance. No Need to wait for your Data to Update The traditional approach to pre-joining data to enable analytics works fine if you are dealing with historical data that is essentially static. If it takes a few hours to join the data, that isn’t a big deal. It becomes problematic when the source data is constantly changing, and the analytics insights you are trying to develop are being used for real-time decision making. Increasingly, more companies are looking to operationalize their Hadoop data lakes and use the data they contain to power operations dashboards and real-time process monitoring capabilities that have little tolerance for data latency. A big benefit of the Actian Vector solution is that it bypasses the need to pre-aggregate and pre-process data into the materialized data tables and instead run on-demand joins of the source data tables. This is precisely what operational analytics demands. To learn more about Actian Vector for Hadoop, visit https://www.actian.com/lp/actian-vector-sql-accelerator-for-hadoop/ About Pradeep Bhanot Product Marketing professional, author, father and photographer. Born in Kenya. Lived in England through disco, punk and new romance eras. Moved to California just in time for grunge. Worked with Oracle databases at Oracle Corporation for 13 years. Database Administration for mainframe IBM DB2 and its predecessor SQL/DS at British Telecom and Watson Wyatt. Worked with IBM VSAM at CA Technologies and Serena Software. Microsoft SQL Server powered solutions from 1E and BDNA.