Big Data engineering has invented a seemingly endless supply of workaround solutions. These are for both scalability and performance problems. Ultimately, our approach to solving these problems dictates how sustainable they will be. This post compares some modern best practices with pre-processing cube environments.
3 Ways A Hadoop Database Competes With Cube Analytics
- Software design trumps pre-processing
- Data management capability is crucial
- Control data explosion and platform complexity
Cube engines (OLAP) for analyzing data warehouses is nothing new, but applying them in today’s distributed data management architectures is. This post compares high performance SQL analytic databases with Cube analytics in today’s distributed environment.
Enterprise cube platforms help users wade through massive amounts of data easily. However, with the operating system for data analytics moving to massively parallel computing architecture with distributed data sources, it can be confusing whether or not cube-based analytics will help your next project.
Hadoop itself gives little help for getting high performance analytics out-of-the-box. While you may gain transparent access to a wide variety of data you often sacrifice performance and other critical features such as data management and data expansion control – three key components for mastering the data lake as it grows.
This is why Hadoop adopters likely do not have cube analytics in their future as “fast” databases bypass the overhead and complexity of maintaining a cube-based system.
Software design trumps pre-processing
Cube analytics are the bailout for bulging data warehouses or under-equipped analytic databases when they get too big, too heavy and too slow to keep up with running increasingly difficult workloads. Why have these databases worked so poorly as data volumes scaled or required advanced hardware? Because legacy approaches to software engineering has limited performance improvements.
As a result, the industry is excited when there are merely linear improvements in their query performance when compared to hardware improvement. Are hardware improvements (à la Moore’s Law) accelerating all our queries that much faster or do we even notice? I bet not. How does that make sense when, for example, chip vendors regularly add more ways to optimize processing?
Actian has taken advantage of several improvements in both hardware and software to power highly optimized analytic environments.
Leveraging Better Hardware
Simply, most systems do not take advantage of the inherent power of modern computing platforms. Put in faster disks (SSD), better RAM and improve the network connects – naturally things can run faster. Add cube analytics on top of that and you still improve performance, but only improved against legacy systems running on similarly designed architectures.
Modern databases that utilize the latest processor improvements (chip cache, disk management, memory sizes, columnar storage, etc.) all give a performance gain over the legacy approaches. These improvements show better than linear, often exponential, improvements over other popular solutions on the market. This is where Actian hangs its hat in the Big Data space (see Actian’s Vector on Hadoop platform).
We should all come to expect significant improvement between generations of both hardware and software. After all, today’s engineers are better educated than ever before and are using the best hardware ever developed.
If you aren’t leveraging these advantages then you’ve already lost the opportunity for a “moon shot” via big data analytics. You won’t be able to plug the dam with old concrete – you need to pour it afresh.
High performance databases are removing the limits that have strangled legacy databases and data warehouses over the past decades. While cube engines can still process on top of these new analytic database platforms, they are often so fast that they do not need the help. Instead, common Business Intelligence (BI) tools can plug into them directly and maintain excellent query performance.
Data management capability is crucial
Back-end database management capabilities are crucial to any sustainable database. Front-end users merely need SQL access, but DBAs always need tools for modifying tables, optimizing storage, backing up data and cleaning up after one another. This is another area that differentiates next generation analytical databases and cube engines.
Many tools in the Hadoop ecosystem do one thing well – i.e. read various types of data or run analytical processes. This means they cannot do all the things that an enterprise database usually requires. Cube engines are no exception here – their strength is in summarizing data and building queries against it.
When your data is handled by a cube system you no longer have an enterprise SQL database. Granted, you may have SQL access but you have likely lost insert, update, and rollback capabilities among others. These are what you should expect your analytic database to be bring to the table – ACID compliance, full SQL compliance, vector-based columnar approach, natively in Hadoop, along with other data management necessities.
Closely related to data management is the ability to get at raw data from the source. With a fast relational database there is no separation of summary data from detailed records. You are always just one query away to higher granularity of data – from the same database, from the same table. When the database is updated all the new data points are readily available for your next query regardless of whether they’ve been pre-processed or not.
Data management matters!
Data changes as new data is ingested but also because users want to modify it, clean it or aggregate it in different ways than before. We all need this flexibility and power to continue leveraging our skills and expertise in data handling.
Control Data Explosion and Platform Complexity
Data explosion is real. The adage that we are continually multiplying the rate at which data grows begs the question: how can we keep it as manageable as possible?
This is where I have a fundamental issue with cube approaches to analytics. We should strive to avoid tools that duplicate and explode our exponentially growing data volumes even further. Needless to say, we should also not be moving data out of Hadoop as some products do.
Wouldn’t it make more sense to engineer a solution that directly deals with the performance bottlenecks in the software rather than Band-Aid a slow analytic database by pre-processing dimensions that might not even get used in the future?
Unfortunately, cube approaches inherently grow your data volumes. For example, the Kylin project leaders have said that they see “6-10x data expansion” by using cubes. This also assumed adequately trained personnel who can build and clean up cubes over time. It quickly becomes impossible to estimate future storage needs if you cannot be assured of how much room your analytics will require.
Avoid Juggling Complex Platforms
Many platforms require juggling more technological pieces than a modern analytic database. Keeping data sources loaded and processed in a database is hard enough, so adding layers on top of it for normalization, cube generation, querying and cube-regeneration, etc. make a system even harder to maintain.
Apache’s Kylin project, as just one example, requires quite a lot of moving pieces: Hive for aggregating the source data, Hive for storing a denormalized copy of the data to be analyzed, HBase for storing the resulting cube data, a query engine on top to make it SQL compliant, etc. You can start to imagine that you might need additional nodes to handle various parts of this design.
That’s a lot of baggage; let’s hope that if you are using it, you really need to!
Consider the alternative, like Actian Vector on Hadoop. You compile your data together from operational sources. You create your queries in SQL. Done. Just because many Hadoop options run slow does not mean they have to and we don’t need to engineer more complexity into the platform to make up for it.
With an optimized platform you won’t have to batch up your queries to run in the background to get good performance and you don’t need to worry about resource contention between products in your stack. It’s all one system. Everything from block management to query optimization is within the same underlying system and that’s the way it should be.
SQL Analysts vs. Jugglers
The final thing you should consider are your human resources. They can handle being experts at a limited number of things. Not all platforms are easy to manage and support over the lifetime of your investment.
We work with a lot of open source projects, but at the end of the day we know our own product inside and out the best. We can improve and optimize parts of the stack at any level. When you use a system with many sub-components that are developed and managed by different teams, different companies, even different volunteer communities you sacrifice the ability to leverage the power of a tightly coupled solution. In the long term you will want those solutions to be professionally supported and maintained with your needs in mind.
From a practical standpoint I have aimed to show how many of the problems that cubes seek to solve are less of an issue when better relational databases are available. Likewise, careful consideration is important when deciding if the additional overhead of maintaining such a solution is wise. Obviously this varies by situation but I hope these general comparisons are useful when qualifying technology for a given requirement.