I keep hearing the term “Data Lake” and I’m beginning to get edgy about it. The concept is simple enough. There needs to be a catchment area or, if you prefer, a landing zone, for data entering the organization. In the past most businesses didn’t need to organize such a data store because almost all data was internal. It traveled via traditional ETL mechanisms from transactional systems to a data warehouse and then get sprayed around the business, as required.
When a good deal of data comes from external sources, or even from internal sources like log files, which never previously made it into the data warehouse, there is a need for a catchment area. This has definitely become the premier application for Hadoop. And it makes perfect sense to me that such technology be used for a data catchment area. The neat thing about Hadoop for this application is that:
- It scales out “as far as the eye can see,” so there’s no likelihood of it being unable to manage the data volumes even when they grow beyond the petabyte level.
- It is a key-value store, which means that you don’t need to expend much effort in modeling data when you decide to accommodate a new data source. You just define a key and define the metadata at leisure (in HCatalog perhaps).
- The cost of the software and the storage is really very low.
So let’s imagine that indeed have a requirement for a data catchment area, because we have decided to collect data from log-files or mobile devices or social networks or from public data sources or whatever. So let us also imagine that for this purpose that we have implemented Hadoop and some of its useful components (Hbase, HCatalog, Hive, Pig, etc.) and we have begun to collect data.
Is it reasonable to describe this as a data lake?
Getting Picky About a Metaphor
OK, I know I’m getting picky about a metaphor, but metaphors have the power to influence behavior, so the choice of such words can make a difference. A lake is hollow in the land in which water collects, because geography made it that way. Normally it fed by different streams or even rivers and it can either be the final destination for all that water or, possibly, water may flow out from it at some point.
A Hadoop implementation should not be a set of servers that are randomly placed a the confluence of various data flows. The placement needs to be carefully considered and if the implementation is to resemble a “data lake” in any way, then it needs to be a well engineered man-made lake. And since the data doesn’t just sit there until it evaporates but eventually flows to various applications, we should think of this as a “data reservoir” rather than a “data lake.”
And let’s not kid ourselves that this is some kind of data warehouse. The term “data lake” has arisen, in my opinion, because nowadays we don’t get data delivered to the main data store, after the fashion of trucks delivering pallet loads of products to a warehouse. Nowadays the data doesn’t arrive in batches so much as flow in like water.
And there is no point in arranging all that data neatly along the aisles of the warehouse because when we get it, because for some data, we may not know what we want to do with it at the time we get it. We’ll do the organizing when we know that.
Another reason we should think of this as more like a reservoir than a lake is that we might like to purify the data a little before sending it down the pipes to applications or users that want to use it. So let’s think about that a little.
Hadoop As The Reservoir
Once we’ve collected the data in our reservoir we probably want to purify it in situ – and if Hadoop is our reservoir technology of choice, we should be able to schedule that – indeed, using YARN it might be possible to have a job running continually whose main function is to ingest and clean all data.
Most people are now well aware that Hadoop is not a fast processing environment, but that’s not the point of a data reservoir. If you want fast processing then you will no doubt send the data to a muscle-bound database and clean it in-flight before it gets there, perhaps dropping a copy of the data into the reservoir for the sake of archive.
And if you want Hadoop to go really fast, then there are some products which will deliver speed over Hadoop (I’m thinking of Actian’s DataFlow). But I tend not to think of the data reservoir in that way. I think of there being many jobs running on it and sharing its scale out capability. I see this as the data warehouse story rewritten.
Once data warehouses were built, it soon became obvious that there was far too much useful data in the warehouse to allow everyone who wanted to use it direct access. Thus the data mart was invented. We will discover the same with the data reservoir. The important thing will be to keep the data flowing in and making sure the data it holds can flow to where it needs to go in a timely fashion. The cleaning of data in situ is going to be necessary and, if there’s spare capacity it may be possible to run other jobs on the data; a little bit of BI dashboard work here and a bit of predictive analytics work there, perhaps.
But if any of that introduces unacceptable latencies into the main function of the data reservoir, then they will be moved.
And by the way, if anyone asks you, it’s called a DATA RESERVOIR, not a data lake.