Google, welcome to the DataFlow bandwagon! Having led a team that has been developing, supporting and aggressively enhancing a highly parallel dataflow implementation for some 6-7 years, I was intrigued to see that Google has discovered the power of using dataflow for its own offering announced this week. It’s great to welcome an industry giant’s validation of a strategic technology I’m hugely excited about.
Robin Bloor wrote a nice summary of the beauty and structure of a dataflow architecture as it relates to event streaming a year ago. Like Google, Actian applies dataflow to analytics on very high-volume streams of data, to do massive ingest for data integration and preparation, and for implementing multistep pipelines to (in Google software engineer Frances Perry’s words) “extract deep insight from datasets of any size.”
Anyone who has spent any time in Hadoop recognizes the limitations of the extremely primitive MapReduce environment. With MapReduce, the design-time performance is terrible and the run-time performance is terrible (that’s a bad combination J). The introduction of YARN frees developers from the shackles of MapReduce, which is why we were first out of the gate last fall with our YARN-certified DataFlow implementation running 100% natively in Hadoop, including a joint reference architecture with Hortonworks.
In many years of fast-paced DataFlow development, we’ve produced an incredibly rich offering that serves as the backbone of the Actian Analytics Platform with:
- A comprehensive set of almost 100 highly parallel data preparation and advanced analytics dataflow operators
- A drag-and-drop dataflow visual interface for an amazingly fast and simple design-time experience (via our multiyear collaboration with KNIME)
- The full spectrum of horizontal, vertical, pipeline and broadcast parallelism under the covers – exploiting fine-grained thread-level parallelism in the nodes and Hadoop scale across the nodes, so your applications gain all of the power of scaling up and out on multicore and multinode without you having to understand any of the complexities of parallel programing (queueing, threading, memory management) – we take care of it all for you
- The ability to run natively at every hardware scale – from desktop (easy to test!) to server to cluster – and automagically scaling at runtime to fully consume all cores and nodes without changing a line of code
- Access to this full set of dataflow capabilities AND the SQL you know and love with the recent launch of the world’s highest-performing industrial-grade SQL-in-Hadoop platform
So welcome, Google, to embracing the goodness of DataFlow – the optimal platform for building a whole new generation of data and computationally intensive analytic applications.
And for those who want to leapfrog straight to a mature, rich and robust implementation of DataFlow – welcome to the wonderful post-MapReduce world of Actian DataFlow. Best of all, we are compatible with every major Hadoop distribution so you can get started now on your Hadoop implementation of choice.
P.S. Still wondering “who said that”? Check out Charles Babcock’s InformationWeek article.