There’s been a great deal of noise about Big Data. You could say that the PR volume has been almost deafening. We don’t often think of IT as a fashion industry, and indeed in many of its sectors it is not. But right now Big Data is high fashion by any reasonable definition. Indeed, I believe Gartner has recently declared it as having reached the top of its “hype cycle” and to now be destined to descend, for a while, into the “trough of disillusion.” That may well be the case, although I’m not so sure you lump everything that sits under the umbrella of “Big Data” as part of the same trend. The real trend here is towards more powerful and useful data analytics. So let’s consider this in an end-to-end manner.
With BI and analytics applications, you have to capture data, maybe cleanse the data, present the data to data analysts to be worked on in a usable form and provide the data analyst with the ability to interact with it in a way that suits their activities. Finally there may be a need to pass the results to some operational application so that the knowledge (intelligence) that is distilled is put to use. So that’s:
- Data Capture
- Data Cleansing
- Data Serving
- Analytical Interaction
- Knowledge Implementation
Here we encounter Open-Source-Hadoop, especially where Big Data (large volumes of data) are concerned. Many companies have discovered that Hadoop can usefully serve as a “data reservoir,” a storage area where you can happily park data prior to its usage and possibly before you even decide whether you are going to use it. The virtue of Hadoop is twofold. First it is Open Source and therefore it is low cost until your usage expands to the point where you need support and, possibly, good consultancy advice. Secondly Hadoop is not a database. Consequently, there is no necessity to do much data design work when choosing to take on data from a new source. You need to record the metadata, but you don’t need to do any database design.
Now if you want to make use of data quickly via this route, you can run MapReduce jobs to pull data out of Hadoop. Ultimately the process here is not so different from what we do with data warehouses, it’s just that it’s flexible and fast to implement. If you don’t use it, you probably serve data analysts from a data warehouse.
But there’s a health warning here. Hadoop is not a well tuned multi-user environment (at least not right now) and while you can use it for many things, it is wise not to mix up its usage between the first four stages we described above. If you do, it may slow things down.
I listed data cleansing second, but data cleansing is one of those difficult processes, which may need to run at more than one point in the end-to-end analytics process. We can define rules and cleanse the data, if it needs it, as we serve it up, but we may not know that some of the data is “dirty” until the data analyst discovers it to be so. This depends on context.
We can include data transformation along with data cleansing. If we are going to take a pass at a set of data then we should do both together if we can.
Ideally the data analyst will have a good deal of control over data acquisition, so that there are at least elements of data self-service. Whether the analyst needs a specialized data engine, say a column store database or even a scale-out NoSQL database is going to depend on data volumes and the kind of analytics activity that is intended for that collection of data.
This is the most variable step in the process. The data analyst may not know exactly what he will do with the data, once he has a collection of data to work with. This is extremely contextual. In some situations, a good deal of exploration may be required prior to “mining the data” for actionable knowledge. Alternatively the data analyst may know exactly which algorithms he intends to run – although even then how the project will proceed is unlikely to predictable ahead of time. There are many possibilities. If, for example, a specific correlation is found in the data, the data analyst may want to check on a broad range of historical data to see whether the correlation always existed or whether it has recently emerged.
This is even more difficult to predict ahead of time. It might be that the knowledge discovered can be used by specific individuals or even specific application software, in which case a software development project may be spawned. Alternatively, it may be possible to, for example, just create a special BI dashboard and make it available to the right person. An important point to note here is that whatever knowledge has been discovered by the analyst is useless until it starts to participate in some business process. That’s where the gold is actually buried.
Necessity or Nice To Have
The question we asked in the title of this blog was: Is end-to-end analytics a necessity or a “nice to have?”
Even if you think about the analytical process only in terms of the fairly simple process I have described here, it becomes fairly clear quite quickly that it can get very messy. Boiling it down, there are really two things to be concerned about:
The first is that the data analyst may not actually know what data he or she needs to examine and once they start to delve into data they may decide that they actually need more data than they first thought. Also they may wish to manipulate the data. Obviously it is in no-one’s interest for the analyst to have to co-ordinate with IT regularly just to move the project along. So, in my view, that part of it is clear. The data analyst needs to be able to control data access from gathering through to the end of the analytics step; i.e the first four steps. And that means end-to-end analytics is necessary if you want to prevent it being a slow and tortuous process.
As for knowledge implementation. That may not be includable in the end-to-end process anyway, but nevertheless it also needs to be reasonably easy to achieve.