This year, I’ve spent the first three months and change talking to people on both side of the line, software vendors and IT users, about what a Big Data Information Architecture is or will turn out to be – as time passes. There is fairly strong agreement about some things, so let me list a few of them:
- The atom of data is the event.
- We have entered a world of data flows, and there will be two primary data flows.
- Hadoop has an important role to play as a data reservoir
- The Enterprise Data Warehouse will not be usurped by Hadoop (at least not any time soon)
We summarize what this means, in the diagram provided below.
If we think of an event, no matter whether it is a record from an embedded sensor, or from a data stream, or from a log file or whatever, as constituting an atom of data, we can easily envisage there being a constant flow of such atoms of data. Many will come from outside the business and many will come from inside. While some such data may arrive in batches, we can think of it all as flowing (if you read serially from a batch file you naturally create a data flow).
All of the event records arriving in this data flow needs to be met by a Filtering and Routing job which most likely will divide data in the following way:
- Any data that needs to be processed immediately, by streaming applications of any kind, is duplicated with one (or more) copies routed to the streaming applications and one copy sent to the Data Reservoir.
- All other data is sent to the Data Reservoir for ingest.
So this gives us two data flows, one is a “real-time” flow of data that is used instantly and the other is a data flow into the Data Reservoir which ultimately gathers all corporate data. From the Data Reservoir, data will flow to other places, for example to the Enterprise Data Warehouse (EDW) although no tall data will necessarily leave the Data Reservoir.
A regards the “real-time” data flow, some of the the streaming applications may use other data aside from the real-time stream, particularly historical data possibly drawn from the Data Reservoir or the EDW
The Data Reservoir
The data reservoir is a massively scalable staging area that governs the second not-so-real-time data flow. Two things happen here. First, data is prepared and cleansed for later use, preferably on ingest. ETL (or ELT) jobs are run and data is extracted for use elsewhere.
The most obvious thing that changes with Big Data is that data volumes are higher – possibly very high indeed. This creates a natural architectural preference for bringing processing to the data rather than moving data around, where that is possible. However, we have to acknowledge the reality that while Hadoop is a parallel environment, and hence some jobs can run reasonably fast, it is not an optimized environment. Hadoop is really a general purpose file system rather than one optimized for a particular set of workloads.
This means that there is a choice to be made in respect of what to run on Hadoop. Bear in mind that Hadoop’s primary goal as a Data Reservoir is to ingest, prepare data, and serve it up. Any workloads that could interfere with timeliness of that need to be executed elsewhere.
The Logical Data Warehouse
It is probably better to think in terms of a logical data warehouse rather than assume that a single physical database can do the whole job. One of the things that the NoSQL movement undeniably proved was that hierarchical data structures are both useful and important. As a consequence so is JSON, the standard means for getting at them in document stores like MongoDB. To a lesser fanfare the virtue of graph databases has also been demonstrated in recent years. It is thus entirely feasible to think in terms of having several physical database engines for all the possible workloads that might arise on a large collection of data – the logical data warehouse.
Nevertheless some databases – Teradata offers a good example – try to cater for every possible kind of query workload.
As has always been the case, and as we indicate in the diagram, there is also the possibility of pulling data marts out of the logical data warehouse or alternatively out of Hadoop or, perhaps using both as a source. Traditional BI applications can obviously run from such data marts.
The Analytical Question
Having described all of this, the question that remains unanswered is the analytical question: Where should analytical workloads run. This is not a trivial question to answer, at least not when you’ve committed to having a data reservoir. We’ll address this in the next blog.