Actian Blog / Taking Advantage of Ordered Data

Taking Advantage of Ordered Data

Caos Theory Top 3 Takeaways Impressive Analytics Event

Actian Vector and Vector in Hadoop (VectorH) have a lightweight, but very potent, feature that can give a significant performance boost when querying data that possesses some kind of ordering. This entry takes a look at this feature and describes how to use it.

The feature utilises what are referred to as MinMax structures. These structures contain entries that track column values over ranges of blocks used to store the columns data. When querying, the optimizer can use the values in the structure to filter out ranges of blocks that do not need to be scanned.

The MinMax structures are useful for all columns used in query conditions. However, the big win occurs when one of the columns is ordered; in that case the condition values for that column will likely return a vastly reduced range of blocks that need to be read.

As an example, consider a 100 billion row table that is ordered by transaction_date and where there are 1,000 unique values of transaction_date in the table. These MinMax structures contain, by default, around 1,000 entries so any query containing a condition like “transaction_date = date” will return (assuming an even spread of rows per date value) a single range of blocks that target 100 million rows. Even for a racehorse like Vector/VectorH scanning 100 million rows is faster than scanning 100 billion rows.

The ordering described above is called “natural ordering”. The data is naturally ordered just by the nature of the system. Almost all systems have at least one column that fits this criteria, usually a date field (loaded_date, valid_date, etc).

In some cases, natural ordering may not be present or may not be a useful query condition (although it is worth revisiting your query design to make sure that any opportunities are not being missed). In such cases Vector and VectorH allow for data order to be explicitly specified. Doing so requires extra work at the time of data loading/modification but if that part of the system is not under pressure then it can be worth doing so. Vector and VectorH allow for explicit ordering through the create index statement (although it is a little misleading as what is being specified is an ordering rather than a traditional index structure).

For a real example I’m going to use the same cluster and query as in previous entries. The cluster has 5 nodes (1 NameNode and 4 DataNodes) running, in this case, HDP 2.2 on top of RHEL 6.5. The DataNodes are HP DL380 G7s with 2 x 6 cores @2.8Ghz, 288GB RAM, 1x 10GbE network card, and 16x300GB SFF SAS drives (again a good spec as of 2012 but a long way from today’s “state of art”).

The data for the test is a star schema with around 25 dimension tables ranging in size from 100 to 50 million rows, and a fact table with 2 billion rows.

The queries in the demo join two or more of the dimension tables to the fact table and filter on a date range along with other predicates.

select
        d1.DateLabel, d2.MeasureLabel, sum(f.MeasureValue)
from
         Fact           f
        ,Date_Dim       d1
        ,Measure_Dim    d2
where
        f.DateId = d1.DateId
and   d2.MeasureId = f.MeasureId
and   d1.DateLabel in (’05-Feb-2015′)
group by
         d1.DateLabel
        ,d2.MeasureLabel

I’m using two copies of the fact table, one that is ordered by the date column (a natural ordering – the data was loaded one date at a time) and one that isn’t (the 2 billion rows were loaded in the order they appeared in the data files). The query uses a date value as a condition.

Querying the unordered data returns in 0.6 seconds – which is quite a good result, most systems would be pleased to execute the query in that time. However, querying the ordered data returns in 0.1 seconds – 6 times faster (and from a zero cost feature).

In the query against the unsorted table then MinMax structures are still in play – but their effectiveness is reduced as the data is only partially (by chance) ordered on the date column. As an additional test I created a third version that was deliberately ordered on a high cardinality column (thereby completely removing any ordering on date). Using this version of the table, the query returned in 0.8 seconds. The difference now 8 times (although still a good result).

A few additional points.

The MinMax structures are relatively small and do not grow with the size of the data (and are therefore very lightweight and transparent).

On the Actian GitHub site you will find a tool that will let you examine how ordered or otherwise your data is.

Increasingly schemas are being created with very wide (thousands of columns) tables. In such cases, even though they are relatively lightweight, the net cost of the MinMax structures can add up and it becomes desirable to be selective about which columns have the structures created for them (particularly when most of the columns are holding measure values rather than dimensional attributes). A feature allowing such control will be available shortly.

About Lawrence Stoker

Lawrence Stoker has worked for database specialist companies for over 20 years and operated in the Big Data space before it was called Big Data. Lawrence joined the Actian Sales Engineering team in 2013. His focus is on technologies that take alternative approaches to achieving extreme performance on large data sets which today is the confluence of columnar architecture, in-memory processing, and Hadoop.