I predict that for the next few years, and possibly even longer, there is going to be confusion about analytic workloads. And this is turn will probably result in some IT users making some poor product selections. It’s a safe prediction. Here’s why:
Remember back, if you can, 25 years, when a good deal of the excitement in the IT industry surrounded databases and OLTP benchmarks. A new kind of database (the relational database) had emerged and in its first few releases (no matter whether you used Ingres or Oracle) it was hopeless at OLTP. This was truly inconvenient for the vendors of relational databases – there were many more such products in gestation at the time – and it was also truly inconvenient for RDMBS enthusiasts, who had drunk the Ted Codd Kool-Aid and believed that relation database would eventually do everything databases needed to do.
This hopeful new database technology couldn’t actually perform the only workload that really mattered at the time – executing transactions. So the emerging database industry quickly took to benchmarking, to measure the performance improvement of the RDBMS products as time passed. And, of course, the performance did improve. It improved partly because the vendors genuinely did tune up the software, but it improved mostly because Moore’s Law kept ramping up the speed of the hardware. RDBMS were thus deployed first in OLTP applications, where poor latency was acceptable, and then eventually in all OLTP applications, because they had become fast enough. And then BI took off and RDBMSs found a huge area of application in Data Warehouse and as Data Marts.
The Transaction Processing Performance Council (TPC) was duly formed to produce standardized benchmarks and it eventually came up with a whole suite of different benchmarks (TPC-A thru TPC-H) that it claimed were appropriate to particular database workloads. The database vendors thus spent many years and many dollars producing benchmark results to try to demonstrate the superior competitive performance of their technology.
The Weakness of Benchmarks
Personally I am not a great advocate of looking at benchmarks. Database vendors are after a speed figure and in order to get the best figure possible, they run their databases in an unnatural way – in a configuration you would never use in an operational environment. In time, benchmarks results became important enough for database vendors to write specific routines aimed only at optimizing the benchmark result and some did exactly that. When they ran benchmarks they turned on special routines that would rarely or never be used otherwise.
Benchmarks are not completely worthless, of course. Comparing a product against its previous benchmarks can give some indication of improvement. And when, as happened for example with Vectorwise (now called Vector), a database demonstrates a dramatic advantage over its competition in a standard benchmark it proves it’s doing something new. With Vectorwise that was, of course, the in-chip vector instructions which, surprisingly, no other database vendor had ever tried to exploit.
So, if we already accept the fact that benchmarks are at best a rough measure of performance, when we encounter the area of analytic workloads we know we’re in real trouble. Here’s why…
The Analytic Workload
Lets us begin first of all with a simple truth. The huge interest in Big Data comes mainly from the opportunity businesses see in performing analytics on the data. There are other areas of Big Data activity such as 3D rendering and some areas of “high performance computing” that are nothing to do with analytics, but in the main Big Data is about analytics.
So what exactly is an analytic workload?
Well, first of all, it involves a selection of data from some pool of data. So there will be at least one query involved. After that selection, there will be some calculation done on the selected data. This is also bound to be the case. But that, unfortunately, is pretty much all we can say for sure. There will be some i/o and some calculation.
So imagine on the one hand that we have over a petabyte of data and we wish to examine every record among the many billions it includes and just count which records have a particular value in a particular item. We can actually do the count as we read the records, so almost the whole of the workloads is the i/o activity.
Now imagine that we want to select only ten columns from a petabyte of records, each with a thousand columns, and we wish to run a sophisticated correlation between any combination of columns against other columns from those 10. If the data is stored in a columnar fashion then we will only read in about 10 terabytes of data. It is entirely possible that we will be able to hold that in memory, if we have enough servers with enough memory. It will not take long to read in the data. However, I can easily envisage a correlation test that will, even with all the data held in memory and using the latest x86 processors, take days to run.
The Variability of Analytics
I hope you’re seeing the issue here. There will not be any meaningful benchmarks for anyone to peruse in respect of analytical workloads. The workloads are far too variable. I’ve even had to be simplistic in describing two “ends of the spectrum” because neither example I’ve suggested could be thought of as a genuine example of analytics. Both are simplifications of what a data analyst might do in a particular situation.
It isn’t just that there’s a workload spectrum that could be i/o heavy or could be calculation heavy. The individual activity varies between modeling (running small sample workloads) and analysis of the full data resource. It is also the case that the project that a data analyst is pursuing doesn’t necessarily imply the use of any particular set of data or any particular statistical algorithm.
It might, for example, be a loose as: “try to find exploitable trends in a collect of retail data.” It’s an open target. Initially the data analyst has no idea what other data he might choose to include and what techniques he will use. He probably could not even estimate the computer resource required because at the outset he has no idea of where his explorative activity will lead.
So in such circumstances, how can we identify which technologies to choose?
It will be necessary to construct realistic “proof of concept” situations. And that is not so easy to do.