The Actian technology teams have recently posted a number of technical tools and snippets to the Actian account on Github that will be of interest to customers, partners and prospects. We encourage all of you to take a look and make contributions of your own – either to enhance these tools, or else to let us know about other tools that you have created for yourselves, and we will mention them here. Our intention is to publish new contributions here over time, and to publish future Blog entries that go into more detail on some of these tools and contributions.
Examples of the projects you can already find on GitHub include:
- The Actian Spark Connector for Vector in Hadoop (VectorH) is maintained here.
- A Vagrant package that will take a downloaded Vector .tgz file and automatically install it into a freshly-built CentOS virtual machine.
- A Unit Testing framework for OpenROAD.
- A collection of scripts for testing VectorH alongside other Hadoop data analysis engines, referenced as part of a forthcoming conference paper.
- A Maven-based template for creating new custom operators in Dataflow, together with a couple of examples that use this template, including a Dataflow JSONpath expression parser and an XML and XPath parser.
- A utility called MQI which is designed to make it easier to run an operating system command across all of the nodes in a VectorH Hadoop cluster.
- A Hadoop cluster health-checking component, that will report on problems such as HDFS file replication issues, or whether short-circuit reads is enabled, or if transfer threads is too small, or if disk throughput is too slow, etc.
- A collection of small Vector Tools that will do things like calculate the appropriate default number of partitions for a large table, look for data skew within a table, check whether the Vector min/max indexes are sorted or not (better performance if your data is sorted on disk and the min/max indexes will show this), and also a tool to take a collection of SQL scripts and turn them into a concurrent user throughput test, complete with some stats on overall runtime.
- A collection of new operators for Dataflow to implement operations like passing runtime parameters into a Dataflow as a service, and a ‘sesssionize’ operator to group timestamped data into ‘sessions’, and a lead/lag node for handling timestamped data, and various others.
- A performance benchmark test suite for Actian Vector, based on the DBT3 test data and queries. This project will create test data at a scale factor you choose (defaults to Scale Factor 1, which is around 1Gb of data in total), load that test data into Vector/VectorH, and then execute a series of queries and time the results.
Please take a look, download, and contribute to extend and enhance them to meet your needs!