Data warehouse software, what is it?
Data warehousing software manages and stores data from many sources, making it easier to analyze and visualize to support business decision-making.
What is data warehouse software?
A data warehouse is essentially a database that is structured for easy searching. Transactional databases are used for applications such as order entry, so they are designed to handle large volumes of inserts for new orders and have few indexes to maintain. Data warehouses often have elaborate star schemas that are organized into a single fact table with all the attributes related to a subject area, surrounded by many dimension tables that correspond to popular queries and are indexed to support those queries. A more modern data warehouse architecture, columnar databases, stores tables as individual columns. This is the case for the data warehousing engine used by the Actian Avalanche Cloud Data Platform.
Why use data warehouse software?
Before data warehouses became popular, businesses simply unloaded data from operational databases on a nightly basis, created a copy database, added additional indexes and ran reports. The primary reason they did this was to keep the operational database free to process mission-critical transactions, such as taking orders from customers without having to share resources with internal users. Below are more examples of why businesses employ data warehouses
- A data warehouse can be structured to support business decision-making, such as reporting where customers are purchasing certain products. This data can help decide which distribution should stock certain products
- A data warehouse can be used as a repository for data from multiple business units and applications to create a central hub for business data. An enterprise data warehouse can be used to populate dashboards that show management how a business is performing relative to KPIs
- A distributed data warehouse can be used in a geographic division of a business with local analytics with minimal network latency
- A data lake can be used as an extension of a data warehouse for ad hoc queries
- Without a data warehouse and easy-to-use business intelligence (BI) tools, business analysts would have to rely on busy IT teams, leading to delays and potentially missed market opportunities
- A retail sales data warehouse can be used to perform market basket analysis to optimize product stock levels and product placement
- Healthcare data warehouses can be used to analyze clinical trial data to find correlations between symptoms and treatments
- IT service organizations can use a data warehouse to look for root causes across incidents and fix long-term problems in their software or service
- A data warehouse can be used to populate decision support tools such as cubes that contain pre-aggregated slices and dices of data for rapid analysis.
How does data warehouse software work?
A core of most data warehouse software systems is a relational database management system (RDBMS) instance that manages data movement between memory and secondary file storage. The database instance connects to client applications so it can process submitted queries and other requests. Queries are parsed, query plans are created and executed, and result sets are returned to the requesting session. For update operations, the database instance performs transaction management, enabling multiple SQL INSERT, UPDATE and DELETE operations to be committed or rolled back.
- Session management manages the connection to client applications which can be managed as a pool by a load balancer. The database instance is also responsible for maintaining the consistency of the data, so any user query that aggregates or groups results will always be consistent with the instant the query began, so your tabulations always add up.
- Maintaining database integrity is critical, so the data warehouse must make sure any write operations in memory always make it back to secondary storage. All data warehouses have transaction logs to ensure database integrity is maintained. Suppose there is an unexpected server failure upon restart. In that case, the database instance uses the transaction log to perform a rollback to the last checkpoint before rolling forward to a consistent point.
- High availability is assured by having multiple copies of the data in storage and multiple instances to take over when a worker task fails. In the case of a situation where an instance becomes unresponsive or hangs, missing interrupt handlers wait for a specified time before failing over to a standby instance.
Maintaining data consistency across multiple servers in a clustered data warehouse requires a distributed lock manager that ensures that when a data update is performed, only one server is the owner of the block of data being changed and other sessions wanting to make a change wait their turn.
Data warehouse architecture
Every vendor of data warehousing solutions differentiates their solutions in multiple ways to meet their customer needs and maintain a competitive edge. You can run a data warehouse on a Windows or Linux server on-premise or in the cloud.
- On-premise data warehouses have the advantage that you don’t need to pay a provider for the CPU, IO and storage resources you consume. The downside of on-premise solutions is that you need to manage them; when they run out of capacity, you need to buy a bigger server and in-house IT teams to operate and optimize them.
- Cloud-based data warehouses have advantages, including the flexibility to only pay for what you use, infinite elasticity to grow with your needs without buying hardware up-front, and you get to leave the IT management tasks to the cloud service provider.
Data warehouses can be architected to run on a symmetric multiprocessing (SMP) system with limited capacity or as a massively parallel processing (MPP) on a clustered set of servers to support larger parallel queries or larger user populations.
A data warehouse’s secondary storage can be as simple as a set of files or physical disk volumes. A more sophisticated approach is to use virtualization software to abstract storage beyond physical device capacities. A further abstraction decouples the relationship between servers and storage devices in the realm of cloud-based data warehouses. The advantage of using such an architecture is that you can scale storage independently from compute resources to have your server architecture match your storage profile or compute-bound application.
In some cases, data warehouses take advantage of containerization and microservices in the cloud to provide serverless compute architecture so you can be billed based on queries executed rather than virtual servers you licensed.
A very important distinction between data warehouse technologies is whether they use row or column-based table storage. Column-based approaches have distinct performance advantages over row-based stores as your query only touches the columns you are interested in, they are self-indexing to reduce administration costs, and the smaller column objects can take advantage of faster on-processor caches. The Avalanche Cloud Data Platform uses an advanced column-based architecture in the cloud and on-premise.
Below are suggested considerations for making your data warehouse selection:
- Does it provide high data integrity?
- Is it secure?
- Is there a way to get high availability?
- Is it proven to be fast using industry-standard benchmarks?
- Does it use standard SQL, so you don’t need to retrain your users?
- Is it an MPP architecture to scale as workloads grow?
- Will it run in the cloud so I can avoid up-front capital expenditure and procurement costs?
- Will it operate on multiple cloud services so I can put my analytics engine where my data is?
- Can I run the same data warehouse software on-premise for my regulated data?
- Does it keep data warehouse data in a column store?
- Is it cost-effective to administrate and tune?