I am pleased that my first Actian blog post is an exciting systems announcement. By the end of June, Actian will have available a new product – the Actian Analytics Platform, Hadoop SQL Edition – that allows to use the Actian Vector system in MPP mode on Hadoop clusters. I will talk about all this more extensively at the upcoming Hadoop Summit on June 3 in San Jose, California. So please come and attend!
As you may know, Actian Vector, previously called Vectorwise, is an analytical database system that originated from my research group at CWI. It can be considered a follow-up system from our earlier experiences with MonetDB, the open source system from CWI that pioneered columnar storage and CPU-cache optimized query processing for the new wave of analytical database systems that since have appeared. Whereas MonetDB is a main-memory system, Vector also works if your dataset is larger than RAM, and while sticking to columnar storage, it added a series of innovative ideas across the board such as lightweight compression methods, cooperative I/O scheduling and a new vectorized query execution method, which gives the system its name. Columnar storage and vectorized execution are now being adopted by the relational database industry, with Microsoft SQLserver including vectorized operations on columnstore tables, IBM introducing BLU in DB2 and recently also Oracle announcing an in-memory columnar subsystem. However, these systems have not yet ended the TPC-H Vectorwise domination in the industry standard benchmark for analytical database systems. So far this domination only concerns the single-server category, as Vectorwise was a single-server system – but not anymore thanks to project “Vortex”: the Hadoop-enabled MPP version of Vectorwise.
To scale further than a single-server allows, one can create MPP database systems that use a cluster consisting of many machines. Big Data is heavily in vogue and many organizations are now developing data analysis pipelines that work on Hadoop clusters. Hadoop is not only relevant for the scalability and features it offers, but also because it is becoming the cluster management standard. The new YARN version of Hadoop adds many useful features for resource management. Hadoop distributors like Hortonworks, MapR and Cloudera are among the highest-valued tech startups now. An important side effect is strongly increasing availability of Hadoop skills, as Hadoop is not only getting much mindshare in industry but also making its way into Data Science curricula at universities (which are booming). Though I would not claim that managing a Hadoop cluster is a breeze, it is certainly moreeasy to find expertise on that, than finding expertise in managing and tuning MPP database systems. Further, Big Data pipelines typically process huge unstructured data, with MapReduce and other technologies being employed to find useful information hidden in the raw data. The end product is cleaner and more structured data that one could query with SQL. Being able to use SQL applications right on Hadoop significantly enriches Hadoop infrastructures. No wonder that systems offering to do SQL in Hadoop are the hottest thing in analytical database systems now, with new products being introduced regularly. The marketplace is made up by three categories of vendors:
- those that run MPP database systems outside Hadoop and provide a connector (examples are Teradata and Vertica). In this approach, system complexity is not reduced, and one needs to buy and manage two separate clusters: a database cluster for the structured data, and a Hadoop cluster to process the unstructured data. Plus, there is a lot of data copying.
- those that take an legacy open-source DBMS (typically Postgres) and deploy that on Hadoop with a wrapping layer (examples are Citusdata and HAWQ). Legacy database systems like Derby and Postgres were not designed for an append-only file system like HDFS and the resulting database functionality is therefore often also append-only. Note that I call these older database engines “legacy” from the analytical database point of view, since modern vectorized columnar engines are typically an order of magnitude faster on analytical workloads. The Splice Machine is an interesting design point in that it ported Derby execution onto HBase storage. HBase is designed for HDFS, so this system can accomodate update-intensive workloads better than analytical-oriented SQL-on-Hadoop alternatives, but its reliance on the row-wise tuple-at-a time Derby engine and the access paths through HBase are bound to slow it down on analytical workloads compared with analytical engines.
- native implementations that store all data in HDFS and are YARN integrated. Most famous here are Hive and Impala. However, I can tell you from experience that developing a feature-complete analytical database system takes… roughly a decade. Consequently, query execution and query optimization are very immature still in Impala and Hive. These young systems, while certainly serious attempts (both now store data column-wised and compressed, and Hive even adopted vectorized execution) still miss multiple of the advanced features that the demanding SQL user base has come to expect: workload management, internationalization, advanced SQL features such as ROLLUP and windowing functions, not to mention user authentication and access control.
Interestingly, the new Actian Hadoop product that leverages Vector is of the native category above, as it purely uses HDFS for storage, and is YARN integrated. However, its optimizer and feature set is quite mature. The resulting system is also much faster than Impala and Hive. Of course, it was not trivial to port Vector to HDFS, but true compressed column-stores are in their access patterns quite compatible with an append-only file system that prefers large sequential I/Os (i.e. HDFS). Unlike Hive and Impala, Vector also supports fine-grained updates, since updates (at first) go to a separate data structure called the Positional Delta Tree (PDT), yet another CWI innovation. This has been an important piece the puzzle of making Vector a native inhabitant of HDFS.
One last link I provide is to the Master Thesis of two Vector engineers which one could consider the baby stage of this release. Back then, we envisioned using a cluster file system like RedHat’s GlusterFS (it turned out dead slow), but HDFS took its place in the development. The “A Team” (as we used to call Adrian and Andrei) has been instrumental in making this come alive. Similar thanks extend to the entire Actian team in Amsterdam which has been working very hard on this, as well as to the many other Actian people worldwide who contributed vital parts.
My blog post is becoming too long, so I cannot provide all the juicy bits on how Vortex is designed. But, the above picture does give quite a bit of information for the trained eye. Vector now supports partitioned tables, and when deploying on Hadoop, the various partitions get stored in HDFS, which by default replicates every data block three times. HDFS when reading remote data is quite slow, so we give hints to the NameNode where to store the blocks, and when a Vector session starts and X100 backend processes are started on (the “worker set”, a designated subset of) the Hadoop cluster, we make sure that the processes that become responsible for a particular partition have it local. In the above figure I have tried to suggest this decoupling of partitions and database processes: in principle and thanks to HDFS everybody can read anything, but we try to optimize things such that most data access is in fact a “shortcut read”, as HDFS calls it. We are also working hard to make the solution work well with YARN, such that the system can increase and reduce its footprint (#nodes and #cores per node) on the Hadoop cluster as the query workload varies, and avoids as much as possible getting slowed down by concurrent Hadoop jobs.