Data Management

How Partitioning on Your Data Platform Improves Performance

Colm Ginty

December 14, 2023

Data Partitioning

One of my goals as Customer Success Manager for Actian is to help organizations improve the efficiency and usability of our modern product suite. That’s why I recently wrote an extensive article on partitioning best practices for the Actian Data Platform in Actian communities resource.

In this blog, I’d like to share how partitioning can help improve the manageability and performance of the Actian platform. Partitioning is a useful and powerful function that divides tables and indexes into smaller pieces and can even subdivide them further into even smaller pieces. It’s like taking thousands of books and arranging them into categories—which is the difference between a massive pile of books in one big room and having the books strategically arranged into smaller topic areas; like you see in a modern library.

You can gain several business and IT benefits by using the partitioning function that’s available on our platform. For example, partitioning can lower costs by storing data most optimally and boost performance by executing queries in parallel across small, divided tables.

Why Distributing and Partitioning Tables are Critical to Performance

When we work in the cloud, we use distributed systems. So instead of using one large server, we use multiple regular-sized servers that are networked together and function like the nodes of a single enormous system. Traditionally, these nodes would both store and process data because storing data on the same node it is processed on enables fast performance.

Today, modern object storage in the cloud allows for highly efficient data retrieval by the processing node, regardless of where the data is stored. As a result, we no longer need to place data on the same node that will process it to gain a performance advantage.

Yet, even though we no longer need to worry about how to store data, we do need to pay attention to the most efficient way to process it. Oftentimes, the tables in our data warehouse contain too much data to be efficiently processed using only one node. Therefore, the tables are distributed among multiple nodes.

If a specific table has too much data to be processed by a single node, the table is split into partitions. These partitions are then distributed among the many nodes—this is the essence of a “distributed system,” and it lends itself to fast performance.

Partitioning in the Actian Data Platform

Having a partitioning strategy and a cloud data management strategy can help you get the most value from your data platform. You can partition data in many ways depending on, for example, an application’s needs and the data’s content. If performance is the primary goal, you can spread the load evenly to get the most throughput. Several partitioning methods are available on the Actian Data Platform.

Partitioning is important with our platform because it is architected for parallelism. Distributing rows of a large table to smaller sub-tables, or partitions, helps with fast query performance.

Users have a say in how the Actian platform handles partitions. If you choose to not manage the partition, the platform defaults to the automatic setting. In that case, the server makes its best effort to partition data in the most appropriate way. The downside is that with this approach, joining or grouping data that’s assigned to different nodes can require moving data across the network between nodes, which can increase costs.

Another option is to control the partitions yourself using a hash value to distribute rows evenly among partitions. This allows you to optimize partitioning for joins and aggregations. For example, if you’re querying data in the data warehouse and the query will involve many SQL joins or groupings, you can partition tables in a way that causes certain values in columns to be assigned to the same node, which makes joins more efficient.

When Should You Partition?

It’s a best practice to use the partitioning function in the Actian Data Platform when you create tables and load data. However, you probably have non-partitioned tables in your data warehouse, and redistributing this data can improve performance.

You can perform queries that will tell you how evenly distributed the data is in its current state in the data warehouse. You can then determine if partitioning is needed.

With Actian, you have the option to choose the best number of partitions for your needs. You can use the default option, which results in the platform automatically choosing the optimal number of partitions based on the size of your data warehouse.

I encourage customers to start with the default, then, if needed, further choose the number of partitions manually. Because the Actian Data Platform is architected for parallelism, running queries that give insights into how your data is distributed and then partitioning tables as needed allows you to operate efficiently with optimal performance.

For details on how to perform partitioning, including examples, graphics, and code, join the Actian community and view my article on partitioning best practices. You can learn everything you need to know about partitioning on the Actian Data Platform in just 15 minutes.

Colmy Ginty

About Colm Ginty

Colm is a Customer Success Engineer at Actian, helping key customers to get the most out of our Data Platform. Before that, he was a Data Engineer for 8 years, mostly working with distributed systems like Spark and Kafka. He enjoys cooking, scuba diving, coffee and wine.