Vectorization
Vectorization, as defined by the microprocessor industry, is the process that transforms a scalar operation acting on individual data elements (Single Instruction Single Data—SISD) to an operation where a single instruction operates concurrently on multiple data elements (SIMD).
More broadly, vectorization is the process by which a single thread of software execution can be multithreaded into parallel, concurrent execution threads to accelerate performance by orders of magnitude.
Why is Vectorization Important?
In the database industry, vectorization is the basis for reducing the response time for large queries through parallel execution of multiple concurrent threads. Database system query optimizers choose the most efficient method to obtain the results for a given query. Indexes are used to access single rows quickly; range scans for small subsets and full table scans are reserved as a last resort when every table row must be read. Single-threaded full-table scans could take an hour on a table with billions of rows. Vectorization can shorten such operations’ response time from hours to minutes, assuming the table is partitioned and sufficient CPU cores are available.
Vectorization at the Chip Level
As processor speeds began to plateau, manufacturers began to increase the number of CPU cores that could be packaged into a single chip. Today, an Intel Xeon server processor chip can have as many as 32 processor cores.
The Actian Vector analytics database system exploits a feature of the Intel servers designed to support parallel processing for High-Performance Computing (HPC) called Single Instruction, Multiple Data (SIMD). Unlike traditional processors that only allow executing programs to access a single CPU cache, SMID instructions can load data into all the CPU caches on the server. If the server has 32 CPU cores, Actian Vector will load database column data into all their CPU caches. This single vectorized machine instruction can process the data that would have taken a traditional database 32 instruction cycles because they only process one database record per instruction. Analytic queries are very data-intensive, so processing hundreds of data records in a single instruction is incredibly beneficial.
Exploiting Large Cache Memory
The benefit of SMID processing goes further than just multi-threading because it takes advantage of the fastest kind of memory in a system, which is Level 1 (L1) cache, which operates at speeds 100 times faster than main memory.
Cache memory operates in a hierarchy of levels. All four of the levels listed below are faster than RAM:
- L1 cache is up to 112 KB per core.
- L2 cache is up to 2 MB per core.
- L3 cache is up to 408 MB.
- L4 cache is up to 64 GB.
Some industry analysts might class Actian Vector as an in-memory database, but its ability to exploit all CPU caches and gracefully spillover to disk makes it more than an in-memory analytic database.
Vectorization Across Clustered Servers
So far, we have discussed vectorization within a single server. When a workload is bigger than a single server can handle, businesses turn to Massively Parallel Processing (MPP) clusters of servers. MPP clusters are typically installed close to each other in adjacent server racks in a single room to minimize network latency. They communicate with each other over a high-speed interconnect running at 10 or 30 Gigabits per second.
Early MPP systems were costly and limited to research and large corporations. With the advent of open-source software in the form of the Apache Hadoop project, the software cost went away, and Intel server-based hardware made MPP systems mainstream.
Actian created a variation of the Vector analytic database that worked with Hadoop software to scale database storage across a scalable clustered file system and transparently distributed vectorized queries across the nodes of a Hadoop cluster to provide the ultimate in analytic query performance.
The Actian Data Platform
The Actian Data Platform provides a unified experience for ingesting, transforming, analyzing, and storing data. Actian Vector is an integrated component of the data platform delivering ultra-fast query performance, even for complex workloads, on-premise, and on Google, AWS and Azure cloud platforms.