There are a lot of myths and misconceptions about cloud data warehouses. One of the biggest ones is that all cloud data warehouses cost the same. On the surface, cloud data warehouse vendors may talk the same language – describing similar features, benefits and touting the performance gains of operating in the cloud. But when you start looking at the details of implementation, migration performance and scalability, the differences become apparent.
“We’re moving our data warehouse to the cloud to save money.”
Migrating from an on-premise data warehouse to a cloud data warehouse is a great way to gain greater control over your IT costs, improve performance, and achieve scalability to support your business. How much of these benefits you will gain depends on which cloud data warehouse you select and how you implement it. Most cloud data warehouse solutions provide you some deployment options: on-premise, private cloud, public cloud, multi-cloud, and hybrid. If the one you are looking at doesn’t give you these choices, you might want to pause here and consider how confident you are in the solution you’re implementing.
Deployment choices give you the flexibility to change course in the future (and considering how fast business environments evolve, flexibility is essential). Assuming the solutions you are looking at offer the standard deployment options, you might assume that costs and performance will be effectively the same – after all, if it’s running on AWS, it’s all the same cloud infrastructure, right?
The cloud environment, whether public or private, is only one piece of the performance puzzle. Most cloud providers offer a wide variety of capabilities that software solution providers can choose from. The design and configuration of the solution will have a significant impact on your costs and the performance benefits you receive in your implementation. Here are three key issues you should understand to know how your cloud data warehouse solution stacks up.
Elasticity for minimizing waste and scaling for increased demand
One of the most significant value propositions of moving your data warehouse to the cloud is minimizing the waste that comes from under-utilized infrastructure and idle capacity. Cloud systems are intended to be scaled up for peak demand periods, and then scaled back down when capacity is not needed to save resources (and costs). When it comes to cloud data warehouses, each provider has their own capabilities for optimizing resource utilization (supply) against consumption (demand). Some solutions require full database backups in order to shut down services and a full restoration to bring the service back online. This means that it isn’t practical to “turn off the lights when you aren’t in the office.”
Other cloud data warehouse providers take a stepwise approach to scaling up capacity, adding new instances for every eight or so users. This means you end up paying for more than you really need. The key when it comes to elasticity and scaling is having fine-grain control over how much capacity you are using (and paying for) and being able to adjust it up and down to align to your unique usage patterns. If you have greater control over your costs, you can minimize waste and save money.
Performance – Be sure you understand what you get in a “resource unit.”
In on-premise data centers, it’s easy to measure what resources you are using: it’s this host, this memory and these CPUs. How do we know that? Because that’s the hardware that my data warehouse runs on. In the cloud, because the infrastructure has been optimized for shared use, vendors define “resource units” as a way of describing capacity in a simple way. But here’s the catch–not all resource units are equal, and each vendor defines their own unit of measure. You need to understand what you are getting in a resource unit in terms of speed, performance, scale, and resource size. In some cases, things like memory are bundled with compute; in other cases, they are measured separately. Read the fine print and know what you are getting.
Efficiency and parallel processing
Parallel processing is one of the biggest differentiators between cloud data warehouse solutions. If you process data in a linear fashion (one record at a time), big data sets take time to process. Some vendors speed things up by running multiple transactions in parallel over a set of different CPUs. It’s faster than going in a single-file line, but there is another option that is even faster. Vectorization of data enables multiple transactions to run on a single CPU cycle. This means you get the speed of parallel processing without the overhead cost of parallel hardware.
There are a lot of myths out there about cloud data warehouses, and this was just one of them.
To learn more, check out the whitepaper from Early Adopter Research, “Persistent myths of cloud data warehouses”.