Data Lakes: The Future of Data Management for Enterprises
In early 2000, VMWare enabled organizations to virtualize their servers (compute) and storage (data warehouses). You still needed to provide the money for the cost of licenses, and the impact on your network was significant, but virtualizing your IT provided the breathing space until cloud. Cloud infrastructure and tools meant you no longer had to maintain or even know the amount of compute and storage required at any given moment.
Cloud flexibly scaled up and down, and the capacity to house data was vastly less than the architecture found in most company data centers. This gave rise to the rapid adoption of new cloud vendor-based data lake infrastructures from Amazon, AWS data lake and Microsoft, Azure data lake, and Google, Google Cloud, when they were introduced.
Consider that a data warehouse remains the same size once built. If you outgrow a data warehouse, you must build a bigger one, which takes time and money. The cloud allowed you to add or remove entire environments or applications within minutes and at minimal cost. Further, most cloud pricing models are on compute use and not storage! Imagine building a data warehouse of vast quantity and only being charged when you entered them and did something with what was inside. The analogy holds true for cloud.
What was missing was a way to house all of the various data types available as the internet grew in importance. IoT, audio, blogs, vlogs, news, real-time data feeds all needed to be consumed by organizations to remain current and relevant. Data warehouses could not be designed quickly enough, so the data lake definition was introduced by James Dixon in 2010; think of it as a way to stop data silos by creating a pool of information from any source required on cloud technology such as AWS and Azure. Data went from being extracted, transformed, and loaded into your applications to extracted, loaded, and transformed when you requested.
Big data analytics, full-text search, real-time use, machine learning, and artificial intelligence are all outcomes of data lakes. Data is the primary commodity of any organization. How you manage and manipulate data will ensure your survival, compliance, competitiveness, resilience, and profitability. Data warehouses were the original storage of information strategy whereby you knew what you had, what it looked like, and who was using it for what, all on the infrastructure you managed. But you kept running out of room until cloud and virtualization on inexpensive commodity infrastructure, just as Google Data lakes, AWS data lakes, or Azure data lakes were introduced. Now you could scale up and down based on your needs, add whatever format of data you needed, and use a plethora of tools to help you analyze data to make fast decisions in times of uncertainty (COVID19) or simply to keep you relevant, competitive, secure and compliant.
In 2017 Aberdeen did a survey that showed how businesses who used data lakes outperformed their competitors by 9%. There are caveats for creating and using data lakes, as we will see, but the benefits clearly outweigh the risks.
What is a Data Warehouse?
To understand data lakes, you need to go back to 1992 when Ralph Kimball and Bill Inmon coined the term Data Warehouse to describe the rules and schemas that would control data for the next 2 decades. Data could be arranged into marts or filing cabinets, then placed logically into a data warehouse to ensure security and usability. Enterprise Data Management became a board-level strategy as what you knew and when you knew it was proving to be of importance.
The Wikipedia definition of a data warehouse highlights its use and weakness: “central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.” Like a warehouse, the guard (application security) allowed the approved person to benefit from the data in the warehouse. But, and hence the warehouse weakness, you needed someone to shape the data into that useful format. Tools alone were of no use to the ordinary company user.
Data Lake vs. Data Warehouse
The main differences between a data warehouse and a data lake are shown in the table below. While not all-inclusive, the differences should help you appreciate that data is a strategic requirement for leaders. Not properly managing data can lead to reputational risk, fines, and insolvency.
Relational from transactional systems, operational databases, and line of business applications. All sources are known before placement into the warehouse. A better strategy for PII data.
Non-relational and relational from IoT devices, websites, mobile apps, social media, and corporate applications. All data is accepted as long as it can pass the security to enter the lake. No support for transactional data if compared to a warehouse.
Reports, transaction management, business intelligence, dashboards
Analysis and Modelling, Artificial Intelligence, Profiling
|Cost, speed, reliability
Fastest query results using higher cost storage
get faster using low-cost storage, but if data lakes are not managed, then swamps can impact performance and reliability.
Highly curated data that serves as the central version of the truth
Any data that may or may not be curated (i.e., raw data)
Users of Data
Data scientists, Data developers, and Business analysts (using curated data)
Time and effort to manage Data
Upfront to find the data, cleanse it, create a model for analysis and reporting. Easier to secure.
Upfront to find the data, cleanse it, create a model for analysis and reporting. Easier to secure.
Skill needed to use Data
Developer creating scripts
Data analysts and modelers needed, but if available, then lake is very agile in use
Data sources are known and
fit the needs of users once manipulation and placement into the warehouse is complete.
Cost is based on data acquired & how often data is accessed. One lake can meet most
Difficult to change schemas or reports without changing the structure of the data warehouse. Easy to remain compliant.
Can become a swamp of data as you accept things you do not need. Harder to secure. Easier to break regulatory rules
as the data lake accepts any data source if not properly managed.
An Example of a Data Lake: marketing wants to know what customers are using social media and to what extent but also needs to know their buying history and, if possible, what they also turned down or returned. In addition, marketing wants to know customer churn, loyalty, which benefited from rewards, and the impact on the company. Using data warehouses, developers would have to extract information from several sources to build the report, but the social media information would prove the most difficult if it were even possible to read and use. All this information could be easily found in a data lake, and the marketing team, using a tool like Tableau, could build the report in a couple of hours.
The truth is that you will need and use both warehouses and lakes. You might decide to break up your data warehouse into data marts (filing cabinets for HR or Finance, for example) and throw them into your lake, but you will find you need both. The question is not the architecture; the question is the purpose. Standard fast and repeatable queries benefit a data warehouse. Analytics and modeling where the sources of data are disparate; will require a data lake.
Data Lake Architecture
Using analogies is a good way to understand the differences between data warehouses and data lakes. A warehouse is built for a purpose and to a specific design, allowing for everything to be in its proper place after being approved for storage. Terms like Relational, Extract, Transform and Load or on-write are associated with data warehouses. Developers go to the correct data warehouse, find what they need, use it if access is approved, and create the relevant information for business use. If they need to change the data, then it depends if the data warehouse can be used or if a new data warehouse needs to be built. The same with adding more data, as warehouses do not automatically grow.
Lakes, on the other hand, change shape because of a new stream or water source, shrink if the stream dries up, or even turns into a swamp if the lake becomes full of garbage or weeds. A data lake can scale up and down depending on the data sources and what is created and stored in the lake. No programming is required to do this as cloud infrastructure has this capability naturally if you pay for this service. Data lakes can also become data swamps of corrupted data, so care is required.
In a data warehouse, all the schemas to use the data must be created by developers that understand the data structure and intended use. In a data lake, the variety of data is made usable by a variety of analysis and modeling tools. A data analyst might be better suited to ensure appropriate information management, but arguably any approved user can benefit from the data in the lake. Hence load and transform. Terms like fluid, tagged for use, catalog, data mining, and on-read are associated with data lakes.
In the cloud, data is stored on commodity infrastructure for both data warehouses and data lakes. The main difference is that you need a specific type of software to interrogate, analyze, and produce the information requested from the lake. The most prominent set of software designed for this purpose is the Hadoop Data lake using HDFS (Hadoop Distributed File System) or a series of tags placed into catalogs that mark each piece of data with what it is, where it came from, date created, etc. which the requester then uses to create their model or analysis. YARN (Yet Another Resource Manager) and MapReduce, which encompass Hadoop programming, support analysis, and modeling of any data source. There is now a long list of other tools available offering various degrees of sophistication.
- Highly available SLA (warehouse must be planned).
- Data is masked and encrypted (not always in a warehouse).
- Automated monitoring and alerting of use or illegal access tools are abundant.
- Requires training on security and regulatory aspects of data for developers and users.
- If in the cloud, scalable up/down.
- Technology agnostic: Spark, Hive, MapReduce, HBase, Storm, Kafka, and R-Server.
AWS, Azure and Google Data Lakes
Commercial data lakes are available from Google, Amazon, and Microsoft. While other options are becoming available every day, these companies started their cloud options with data lakes in mind. To leverage these data lake architectures, Actian Data Platform has been designed from the ground up to deliver high performance and scale across all dimensions – data volume, concurrent user, and query complexity. Actian Data Platform is a true hybrid platform that can be deployed on-premises as well as on multiple clouds, including AWS, Azure, and Google Cloud, making it easy for an organization to migrate or offload applications and data to the cloud at their own pace.