And why is it better than a Data Lake or an Analytics Hub?
In the opening installment of this blog series—Data Lakes, Data Warehouses and Data Hubs: Do we need another choice? I explore why simply migrating these on-prem data integration, management, and analytics platforms to the Cloud does not fully address modern data analytics needs. In comparing these three platforms, it becomes clear that while all of them meet certain critical needs, none of them meet the needs of business end-users without significant support from IT. What we in fact need is a platform that combines the optimal operational and analytical elements of these platforms with features and functionality that directly addresses the real-time operational and self-serve needs of business (rather than IT) users.
Since the current implementation of Data Hubs, Data Lakes, and Data Warehouses do not effectively incorporate or identify the combinatorial and analytical needs of real-world users, you might think that a more straightforward and descriptive term like “Analytics Hub” would flip the focus in the right direction. Sadly, this is one of those garden paths that just leads to disappointment and soul-searching.
Why not just call it an Analytics Hub?
Simply put, the term is already being used and in unhelpful ways. Some Analytics Hubs focus on consolidating small, disparate datasets (such as those in Excel spreadsheets and other sources) that a data scientist might want to exploit. Other Analytics Hubs can access and analyze disparate data sources but solely within the confines of that particular tool and only for immediate consumption. Few of these offerings are capable of handling multi-terabyte, sub-second queries and returns on complex advanced analytics runs as operational workloads.
Indeed, these Analytics Hubs operate as much like switches instead of actual hubs as do the miscategorized Data Hub does. There is no persistence of data at the point of unification and depends on an external cloud data warehouse or data lake to store and supply as input data. There is no effort to curate data from multiple projects, users, and long-term use. The one central, redeeming quality of these Analytics Hubs is the fact that their intended user is the business analyst, business data scientist, and similar power users. Consequently, analytics hubs focus on simple drop-down menus, avoid coding for access to data, and allow for self-service, particularly for pick-up files that are largely under control of the end user anyway.
To get comprehensive, real-time insights from analytics, users need a single consolidated picture of all the relevant data. That data must then be presented for analysis by many different stakeholders using many different tools. The point of data unification must balance disparate data AND disparate analytics tools. Analytics hubs tend not to handle more than a couple of different inputs and outputs at any given time let alone data curation.
Call it a Data Analytics Hub Instead
What kind of platform would do this? Let’s call it a Data Analytics Hub.
That might seem like an obvious refinement, but it turns out that the obvious isn’t always so obvious. Terms like “Data Hub,” “Data Lake,” and “Data Warehouse” all have search frequencies in the tens to hundreds of thousands per month. “Data Analytics Hub” has a lower per-month search frequency than I have years on this planet. I’m making it my mission to change that. Given the relative obscurity of the term, though, I feel it’s important to explore what a data analytics hub is, how it differs from an “Analytics Hub,” and why it’s better for modern analytics than any of the aforementioned options.
A Data Analytics Hub draws elements from of all four technologies above (and if you haven’t read the initial blog in this series and don’t know the differences between Data Hubs, Data Lakes, and Data Warehouses, I would encourage you to take eight minutes to go back and read it).
- Like a Data Hub, a Data Analytics Hub provides connectivity to disparate data sources in both batch and streaming modes. However, unlike a Data Hub a Data Analytics Hub provides persistence in a cloud repository. Further, it provides curation for a diverse set of disparate data types that may be ingested in both batch and streaming modes with self-service, low-to-zero-code options through drop-down menus for non-IT users.
- Like a Data Lake, a Data Analytics Hub’s cloud storage repository can handle all data types and leverage industry standards for data movement and analysis (a la Kafka and Spark). However, unlike today’s typical Data Lake, a Data Analytics Hub also provides structure and support for end-user facing BI and advanced analytics workloads through use of SQL (more in the manner that a Cloud Data Warehouse does). In essence, it’s a bi-directional hub, supporting multiple inputs and outputs, solving for all permutations of input data and output tools used by a diverse set of non-IT users.
- Indeed, a Data Analytics Hub provides downstream (meaning in the direction of the end-user) support for most popular BI, reporting, visualization, and advanced analytics tools. However, unlike today’s Data Hubs, Data Lakes, and Data Warehouses, a Data Analytics Hub provides user-friendly self-service tools that enable non-technical users to link any data source to any end-user tool — without the need for IT intervention (on either a one-off or day-to-day basis).
In short, a Data Analytics Hub combines the critical data collection and analytical features of these well-known solutions but exposes all those features in ways that key business users can access easily and incorporate into programs and processes. The figure below provides a baker’s dozen key features drawn from these four technologies into a single integrated platform.
In layman’s terms, it’s a curated data store with management and analytics capabilities that acts as a bi-directional hub for disparate and diverse data sets on one end and analytics tools on the other, directly usable by business analysts and data scientists to rapidly and iteratively generate insights.
Why is a Data Analytics Hub better than a Data Lake?
In the last blog, I suggested in passing that it would be inaccurate to equate Hadoop, the foremost on-premises Data Lake, to AWS S3, Microsoft Azure ADLS and Google Cloud Store (the major three public cloud storage repositories). A more apt comparison would be between the Hadoop Distributed File System (HDFS) and those cloud-based repositories plus the AWS/Azure/Google-accessible equivalents of the components Hadoop provides for data and systems management, queries, ML, etc. (including Yarn, Hive, MapReduce, Pig, Mahout, Flume, and on and on). Once you get past the alphabet soup, yes, you’ll find several different database options, a data warehouse, renamed or embedded versions of Kafka and Spark, a separate ETL tool, and a vendor’s in-house analytics tool. The clear upside to this is the economics of the cloud. The downside, though, is that this cloud-based data lake remains a complex platform that is only navigable and usable by IT and developers.
Don’t get me wrong, this isn’t a rant against Open Source. Embedding Open Source in a platform, particularly for functionality that has become commoditized, makes perfect sense. All vendors should do this. Nor is this a knock on having a prescriptive recommendation on which Analytics tools your platform works best with. But historically, this type of platform has spiraled down into the trough of disillusionment all too often. It becomes inscrutable to end users such as the business analyst and power user who specialize in a particular line of business and who use data science as a tool to make sense of their business.
In other words, once you’ve transitioned from pure science of Data Science or once you’re at the point where you want to use traditional BI Workloads, reporting, and visualization tools for insights into operational workloads, a Data Lake is the wrong platform. Your end users are business analysts, power users, and data scientists who need to monitor and tweak processes that are deployed and ongoing, that leverage AI/ML that they or their cohorts have devised, and that need to be able to interact with both the data and the analytics in relative real time (that is, not when it’s convenient for IT to respond).
In the next installment in this blog series, I’ll delve further into the use cases that make the most sense for a Data Analytics Hub. Oh, and I’ll put to rest any concerns you might have that I’m merely conjuring up a vision of some fabulous hub that will appear in some distant future. I haven’t simply made up a name for something that doesn’t yet exist. As you’ll see, a Data Analytics Hub is out there now.