Data Lakes, Data Warehouses, and Data Hubs – Do we need another choice?
There’s a long-standing debate, dating back to the early days of Hadoop, about what kind of data repository is best for a given data analytics use case. A Data Lake? A Data Hub? A Data Warehouse? Despite Hadoop’s fall from grace, the debate not only persists but grows more complicated. Today’s cloud-based repositories, including AWS S3, Microsoft Azure ADLS, and Google Cloud Store, look very much like Data Lakes in the Cloud. Similarly, cloud-based offerings like Snowflake look very much like Enterprise Data Warehouses but in the cloud. Granted, for an apples-to-apples comparison for Data Lakes you’d need to pare down Hadoop to be just HDFS or add in the tools for data repository management, query, and so forth associated with the three Public cloud providers portfolios.
At the same time, it should be noted that none of the vendors promoting these offerings are using those terms. . . Microsoft, Amazon, and Google identify their cloud repositories as “Enterprise Data Hubs.” Snowflake positions itself as a Cloud Data Warehouse but is pivoting to call themselves as a Cloud Data Platform via their expansive ecosystem but, standalone its really an “Analytics engine.”
Changing the descriptor doesn’t change the question driving the debate, though, and the simple truth is that no data lake, data hub, nor a data warehouse — on-premises or in the cloud — has ever been able to effectively support all the multi-disciplinary teams of business analysts, data engineers, data scientists and power users within different lines of business. That was evident before there was a cloud, and it’s only become more evident as teams try to incorporate new data sets (think web services and IoT) and try to merge semi-structured data into structured repositories. Don’t even get me started on the stream of Excel data sheets that were supposed to go away (but never did) when we got more sophisticated about analysis and data management.
But here’s the thing: There are real differences between these platforms and it’s important to understand those differences. In the end, though — watch for it — the operational differences between these platforms isn’t the root cause of why they’re not providing the support that all the different stakeholders expect.
Let’s start by talking about what we’re actually talking about:
Here, we’ll define a data hub as a gateway through which virtual or physical data can be merged, transformed, and queued up for passage to another destination. That destination might be an application or a database or some other kind of repository (such as a data lake or data warehouse). In any event, data in a data hub is transient; it is not locally stored and has no persistence.
An example of a data hub would be something like Informatica, which can accommodate every imaginable data type and link both upstream and downstream data sources and destinations. Historically, data hubs have been managed and used by IT personnel that work with separate siloed groups from across the enterprise to create integrations where none naturally existed.
Unlike a data hub, a data lake acts as a repository for persistent data. It is not simply a pass-through. Data lakes can typically ingest and manage almost any type of data and, as exemplified by Hadoop (historically the most popular type of data lake), they provide tools for enriching, querying, and analyzing the data they hold. Problem is that Data Lakes are generally sandboxes for dumping large sets of data used in experimental projects by highly-skilled technical resources, largely IT and developers.
A data warehouse differs from a data lake in that it acts as a repository for persistent and primarily structured data, incrementally built over time from multiple downstream data source silos. A data warehouse also differs from a data lake in that it requires some sort of data hub technology to prepare the data for ingestion. On-premise data warehouses such as those from the big legacy players like Oracle, IBM, and Teradata are very IT centric, managed by one or more database administrators (DBA). While the bulk of data used by business users may ultimately reside in a data warehouse, most of these users have no direct interaction with the data warehouse and may not even know they have one or what it is.
Virtual Rubber Meets Virtual Road
Historically, Data Hubs, Data Lakes, Data Warehouses all have several things in common: They each require personnel with specialized skills to set them up, maintain them, manage them, . and experts who can convert the requests of non-technical business users and analysts into queries and reports that can be run against these data repositories.
As an aside, the complexity of these platforms is one reason for the demise of Hadoop. Hadoop data lakes tended to become dumping grounds for data, and they were only manageable by developers and very skilled (and costly) IT personnel, which limited the business value a Hadoop data lake could generate. It’s not entirely surprising that, as a result, of the big three vendors formerly supporting Hadoop, only Cloudera remains the last “man” standing.
This need for specialized resources had affected the use of Data Hubs, Data Lakes, and Data Warehouses in other ways as well, and this in turn has further complicated the original question about which platform is best for different use cases. With the move from on-premises to cloud-based infrastructures, there’s been a reduction in demand for all these specialized resources. More and more operational support has been provided by the cloud vendors, which has helped to reduce operating costs. Moreover, the architectural changes in the most recent generations of Cloud offerings (separate of compute and storage offerings, pay for what you use, etc.) have created further incentives to move to the cloud to reduce costs.
Increasing Complexity Still Further
While all these structural changes have been taking place, though, the fundamental demand for data-based insights has not changed. The answer to the question about how best to gain these insights has only become more difficult to answer. The data that used to be going into on-prem data lakes or data warehouses (via data hubs) is going to the Cloud, but the offerings in the clouds are not quite the same as they were on-prem. Their object storage models differ. Microsoft, Amazon, and Google offer persistent data stores and, in that way, may resemble a data lake, but they rely on other tools to perform the data hub functions and cannot therefore be defined as anything more than data stores. They still require data integration or data hub functionality, and their business value is limited in the same way it always has been. The people who directly generate business value — the business analysts, data scientist and (for lack of a specific title), the other line of business power users — still cannot easily access and unlock the insights bound up in the data.
These days, most business analysts and power users are using either the built-in analytics and visualization capabilities of siloed applications like Salesforce, Marketo, or whatever ERP platform they need to understand in terms of business operations or historical outcomes. At the same time, they strive to do more. Business users may try to incorporate data from flat files such as Excel or semi-structured JSON data exposed through web services APIs. Oftentimes, they will get help from IT to export data out of one or more systems and combine it with excel spreadsheets and send it to a cube on a periodic basis. The result is painfully familiar: siloed data pipelines tied to siloed analytics and visualization results. Unbeknownst to these business users, when they employ help from IT, they may actually be leveraging a data hub, because there’s no data persistence in the hub they’ve simply used the hub as a switch to tie a set of data silos and an analytics silo together to create an ad hoc organizational or project silo.
Data scientists and data engineers may be using many of the same data silos but they may also be using data from semi-structured data sets such as clickstreams, IoT, and web services and their destinations may include the same visualization tools but of course also include advanced analytics tools to support AI/ML. They may employ IT to support getting the data for them and, in turn, create the same point-to-point spaghetti network.
Put another way and put simply, the single shared repository of data promised by data lakes, data warehouses, and data hubs still remains a dream unrealized. A true analytics hub has yet to be realized — not on-prem and not in the cloud.
Shifting the Focus
Cloud vendors are starting to realize the problem, and some are rapidly shifting to address it. However, the way that’s being done by most of them is by making sure that a Cloud Data Warehouse can act as an upstream data repository to any downstream analytics, reporting, and visualization tool. Often this is being tried through a partner ecosystem, as in Snowflake. This is necessary but insufficient for the Analytics Hub we all really need.
But wait. An Analytics Hub? Where was that in the definitions above?
Fact of the matter is that the Cloud Data Warehouse is currently an analytics engine but without a data hub built-in on the back end and a focus on separate point-to-point connections to various BI and analytics tools on the front-end. Vendors like Snowflake do not mention Analytics hubs let alone claim to be one. Further without the ability to easily get data from data sources and tie composite elements of data from those various sources for presentation out to the analytics tools, you don’t really have a analytics hub, chiefly because you don’t have a data hub.
Instead of just a data hub or analytics hub both usable only by IT, what’s really needed is a data analytics hub that is used by a broad array of IT and business users. More on what this is and why it matters in the next blog.