Data Lake

Data Lake vs. Data Warehouse

Find an abundance of information in a data lake

Data Lake vs. Data Warehouse: What is the Difference?

Data warehouses have been around for decades in various forms ranging from specialized servers, appliances, databases, and cloud services. The data lakes are an evolution of the big data concept used to store vast amounts of data in native formats. Data lakes store full-fidelity data in all forms, including flat files, log files and spreadsheet exports. A data warehouse is designed to make prepared data available to users for Business Intelligence (BI) dashboards, visualization, Artificial Intelligence (AI), and Machine Learning (ML) tools.

What is a Data Lake?

data lake is primarily a data repository that accumulates large volumes of structured and semi-structured data in its full fidelity native format.

What is a Data Warehouse?

data warehouse is a relational database system containing data from multiple sources that is structured for easy analysis and decision support.

Data Lake vs. Data Warehouse: How to Choose Between Them

Businesses that have large volumes of raw data can use a data lake to consolidate structured and unstructured data. Data warehouses are needed to analyze data to gain business insights. As they serve different functions, data warehouses and data lakes complement each other so they can co-exist.

As the need to analyze data is vital to every business, the data warehouse is the natural starting point. A data lake can be justified as the business accumulates data that does not need analysis immediately or needs to be retained for potential audits. Most businesses have many data warehouses that gather data from various transactional systems to support lines of business such as Sales, Finance and Logistics.

Web visitor logs are an obvious example of a high-volume data source that can be mined for insights. A data lake can be used to hold all log files. Sales and Marketing functions are usually interested in the most recent logs in support of current campaigns and active prospects. The data warehouse may be structured to maintain a rolling window of the current fiscal year visitors in detail, and summary data is sufficient for past periods. In this case, the data lake can be used to maintain the full logs for all years that can be loaded into a data warehouse to support specific projects.

A pragmatic approach to maintaining a data lake is keeping it close to where the data is generated to avoid extensive data moves between clouds, for example. A business may maintain a logical data lake that spans data pools in AWS, Azure, Google Cloud and their regional data centers. Data warehouses based on solutions such as the Actian Data Platform can run analytics at each pool and maintain a logical centralized global view.

The Actian Data Platform has the additional benefit of allowing a business to create a hybrid data warehouse that can access data in the data lake as an extension of objects stored in the data warehouse. Thanks to the Spark connector in Actian, all Hadoop data formats can be accessed using SQL queries submitted to Actian.

Conceptual design representing the data lake vs data warehouse

Using the Actian Data Platform Data Warehouse to Access a Hadoop Data Lake

The Actian Data Platform (formerly known as Avalanche) can be used to query and write Hadoop data using the external tables feature. The data source must be one that Apache Spark can read from and write to, such as files stored in formats like Parquet, ORC, JSON, or tables in external database systems.

The syntax CREATE EXTERNAL TABLE creates a table in the Actian Data Platform that points to existing data files in locations outside of the Actian Data Platform. This feature eliminates the need to import the data into a new table when the data files are already in a known location, in the desired file format.

After the data file structure is mapped to the Actian Data Platform format using the CREATE EXTERNAL TABLE statement, you can:

  • Select, join, or sort external table data.
  • Create views for external tables.
  • Insert data into external tables.
  • Import and store the data into an Actian Data Platform database.

The data is queried from its original locations, and the Actian Data Platform leaves the data files in place when you drop the table.

The Actian Data Platform leverages Apache Spark’s extensive connectivity for external tables functionality.

The Actian Data Platform receives queries operating on external tables from the user or client app and rewrites them into JSON requests for external data. These requests are then translated into Spark jobs, which are launched. These jobs typically issue queries (to SparkSQL) like “INSERT INTO avalanche_table SELECT * FROM external_resource” for reading external data or “INSERT INTO external_resource SELECT * FROM avalanche_table” for writing to external systems. Finally, these jobs push and pull data in and out of the Actian Data Platform.

You can even insert new records into an external table.

The syntax for defining an external table is:

CREATE EXTERNAL TABLE table_name (column_name data_type {,column_name data_type})
USING SPARK
WITH REFERENCE='reference'
[,FORMAT='format']
[,OPTIONS=('key'=value {,'key'='value'})]

You can get more information, see CREATE EXTERNAL TABLE in the SQL Language Guide.

Reading and Writing to an External Table

After external tables are defined with the CREATE EXTERNAL TABLE syntax, they behave like regular tables. You can issue queries such as the following:

SELECT * FROM test_table_csv

INSERT INTO my_table_orc SELECT some_column FROM other_table

Predicate Pushdown From the Actian Data Platform to Spark

If a SQL query is issued to an EXTERNAL TABLE using Spark, predicates from the WHERE clause may already be evaluated by Spark, reducing the number of tuples sent. Only predicates that can be translated into a column value range are supported. For logical connections of simple predicates IN and AND are supported on a single column, and across columns, OR is supported only on a single column.

Where the whole complex predicate contains a single OR spanning different columns, nothing is pushed down to Spark. In this case, Spark transfers all tuples to Actian Data Platform, and the filtering is done solely on the Actian side.

CREATE EXTERNAL TABLE Examples

  1. Define an external table for a CSV data source residing in Amazon S3:
CREATE EXTERNAL TABLE ext_csv (col1 INT4 NOT NULL, col2 VARCHAR(20) NOT NULL)
USING SPARK
WITH REFERENCE='s3a://<bucket>/file.csv'
OPTIONS=('DELIMITER' = '|', 'HEADER' = 'TRUE','SCHEMA'='col1 INT4 NOT NULL, col2 VARCHAR(20) NOT NULL');
  1. Define an external table for a CSV data source residing in Amazon S3. The CSV data does not have a header row

CREATE EXTERNAL TABLE nation_s3 ( n_nationkey INTEGER NOT NULL, n_name CHAR(25) NOT NULL, n_regionkey INTEGER NOT NULL ) USING SPARK WITH REFERENCE=’s3a://<bucket>/nation.csv’,FORMAT=’csv’, OPTIONS=(‘HEADER’=’FALSE’,’DELIMITER’=’|’,’SCHEMA’=’n_nationkey INT NOT NULL, n_name STRING NOT NULL, n_regionkey INT NOT NULL’);

  1. Define an external table for an ORC data source from Azure:
CREATE EXTERNAL TABLE my_table_orc(a INT8 NOT NULL)
USING SPARK WITH REFERENCE='abfs://loadtest@avalanchetest.dfs.core.windows.net/my_table.orc';
  1. Define an external table for an AVRO data source from Azure:
CREATE EXTERNAL TABLE tweets
(username VARCHAR(20),
tweet VARCHAR(100),
timestamp VARCHAR(50))
USING SPARK
WITH REFERENCE='abfs://loadtest@avalanchetest.dfs.core.windows.net/twitter.avro',
FORMAT='com.databricks.spark.avro'

  1. Define an external table for a JSON data source using the following JSON file:
[{

"symbol": "MSFT",
"company": "Microsoft Corporation",
"stock_tstamp": "2020-01-31T21:00:00+00:00",
"price": 170.777,
"volume": 36142690,
"sector": "TECH"
},
{     "symbol": "AAPL",
"company": "Apple Inc.",
"stock_tstamp": "2020-01-31T21:00:00+00:00",
"price": 309.51,
"volume": 49897096,
"sector": "TECH"
},

{     "symbol": "GOOG",
"company": "Alphabet Inc.",
"stock_tstamp": "2020-01-31T21:00:00+00:00",
"price": 1434.23,
"volume": 2417214,
"sector": "TECH"

},
{
"symbol": "AMZN",
"company": "Amazon.com, Inc.",
"stock_tstamp": "2020-01-31T21:00:00+00:00",
"price": 2008.72,
"volume": 15567283,
"sector": "TECH"}]

Example SQL to create an external table to reference that JSON file stored on an Azure ADL:

CREATE EXTERNAL TABLE techstocks (
symbol VARCHAR(4),
company VARCHAR(20),
stock_tstamp TIMESTAMP,
price FLOAT,
volume INTEGER,
sector CHAR(5)
) USING SPARK WITH REFERENCE='abfs://stockquotes@eastusstockdata.dfs.core.windows.net/tech1.json',
FORMAT='json',
OPTIONS=(
'multiline'='true',
'SCHEMA'= 'symbol string, company string, stock_tstamp string, price double, volume integer, sector

Define an external table for a CSV data source residing in Google Cloud Storage. The CSV data has no header row:

CREATE EXTERNAL TABLE stations (
n_stationkey INTEGER NOT NULL,
n_stationname CHAR(25) NOT NULL,
n_locationkey INTEGER NOT NULL
) USING SPARK WITH REFERENCE='gs://<path>/stations.csv',FORMAT='csv',
OPTIONS=('HEADER'='FALSE','DELIMITER'='|','SCHEMA'='n_stationkey INT NOT NULL, n_stationname STRING NOT NULL, n_locationkey INT NOT NULL');

Make your own comparisons and evaluate the capabilities of the Actian Data Platform for yourself.