ETL streaming through a data pipeline

Getting practical insights from data analytics requires refined data. Getting raw data to the destination data warehouse to support business decisions requires a multi-step process to turn raw data into usable data.

What is ETL (Extract, Load, Transform)?

Data engineers use ETL processes to prepare data for analysis. ETL is an acronym for Extract, Transform, and Load. It describes a three-step process used to extract raw data from multiple source systems and transforms it into a suitable format before loading it into a data warehouse.

Step One: Extraction
The first phase of the ETL process extracts raw operational data from one or more source systems. This can happen using a daily batch job if the dataset you are loading only changes daily. The dataset can be overwritten after a TRUNCATE TABLE SQL command if the dataset is small. Larger source datasets can be extracted incrementally or partially, taking only changes from the previous extract.

A partial extraction with an update notification technique is used when the source data needs to be synched as changes occur to provide near-real-time updates to the data warehouse. The source system can push changes to several subscribing target systems if it is one-to-many connections. Micro-batches are useful for systems that change rapidly. The target system can poll the source system every five seconds for changes, which can be applied to the target system in near-real-time.

Source databases can implement replication using row-level triggers that fire when an event such as an insert, delete or update operation has occurred. These triggers can be programmed to update a log of changes for downstream systems to extract data from. Using a messaging system such as Amazon SNS enables you to send notifications directly to your target systems to inform them of updates to the source system. This kind of asynchronous relationship between source and target data systems ensures the operational source systems never get held up because a target system cannot accommodate the update for any reason.

The relations between source and target systems can be a simple point-to-point connection which is simple to set up and maintain. Over time, businesses end up creating hundreds of such one-to-one connections, which become too burdensome to maintain and administer. At this point, a vended ETL tool can consolidate the management of all the connections, making them an excellent investment. An important aspect of extraction is remembering the data source, so systems can catalog it to maintain and audit the trail of where data came from. Data provenance (also referred to as “data lineage”) is metadata that is paired with records that detail the origin, changes to, and details supporting data source origins. Tracking sources in this way helps control duplication and informs whether data can be trustworthy or reliable.

Extracting data can be as simple as importing a .CSV version of a spreadsheet or as complex as using a web service Application Programming Interface (API). Most business applications provide APIs to make extraction possible using an application or ETL tool. The format of the source data often dictates how it can be extracted. A JSON stream, for example, requires the use of a web services API. A flat file can be read sequentially. A database source usually has an export utility to unload data into a flat file.

Step Two: Transform
Raw data from an operational source usually needs some manipulation and cleansing to make it suitable for general consumption. Data transformation often takes place in an intermediate or staging database. Raw source data can have excess data that needs to be filtered out. It can have gaps that can be filled with default values or interpolated data. Duplicate data can be aggregated or removed depending on the intended use in the target database. If related data is being extracted from multiple sources, it can be merged to create a single record. Data values can be normalized. For example, different sources of customer data may record them differently. One system might spell out the state as in “New York”, and others use an abbreviation such as “NY”. Standardizing on the abbreviation can be done during data transformation. Transformation can include data quality operations such as verifying values across multiple sources before passing them downstream as a trusted value.

Step Three: Load
Data loading into the target system is the final step in the ETL process. The data load can be done using a bulk loading utility, data streaming, asynchronous messaging, or a custom program. Existing data can be overwritten entirely, augmented, or partially updated with fresh values. The frequency of update can be performed in batches such as overnight, periodically in micro-batches or continuously in a data stream.

What is Extract, Load, Transform (ELT)?

ELT stands for Extract, Load and Transform. ELT departs from the traditional ELT process by performing data transforms inside a target or intermediate staging database, typically using SQL to do so. ELT completes data transformation before loading the data warehouse to avoid having raw data in the destination data warehouse.

ETL vs. ELT. What Differences Do They Have?

There are two primary differences when we talk about ETL vs. ELT:

  • ETL is focused on loading transformed, cleaner data into the target data warehouse.
  • ELT transforms raw data in the data warehouse.
  • ETL uses an intermediate data store and database server to transform data.
  • ELT performs data transforms within the target data warehouse.

ETL can be seen as being more complex because it uses more servers and data stores than ELT. This makes ELT appear to be a more streamlined data pipeline. Data loading and unloading can be slow, so performing the transformation in the same server can be faster. The idea of ELT was proposed and popularized by data warehouse providers such as Teradata as it encouraged customers to license more extensive data warehouse systems.

ETL vs. ELT: Which is Better?

There are several factors to consider when you decide to use ETL or ELT. If you have limited capacity at your data warehouse, then use ETL. Because ELT only needs good SQL skills, it can save your team from learning external tools to do data transformations. If the source data needs minimal change, such as deduping, then ELT is a good approach. If the source data needs gaps filled, needs to be merged with other sources, and data types changed, then ELT is a better choice.

ETL with Actian DataConnect

To better support ETL, Actian DataConnect provides an Extract Editor that allows you to markup unstructured data, extract the required data fields from various lines in the file, and assemble the fields into a flat data record. The extracted content is presented in row and column tabular format, which you can view before converting the data to a target format. The extracted data is used as a source in Map Editor for further data transformation activities. Extract Editor provides a rich set of pattern recognition rules and actions to assist in clean data extraction.

Actian DataConnect can parse columnar data in selected text using a given column separator. This will split a line of text into multiple data fields. Tagged data can use a user-specified tag separator value to split a line of text into a line recognition pattern and a data field of fixed start and end columns.

Extract scripts can be generated by saving extract files as a Content Extraction Language (CXL) script file. Actian DataConnect provides a runtime engine, which is an embedded, high-performance, cross-platform engine. It executes the integration artifacts created using the Actian DataConnect Studio IDE or Studio SDK on Windows or Linux servers. The same portable integration package (.djar file) can be run on any of these platforms with no code changes.

The Runtime Engine can be invoked through a Command Line Interface (CLI), Java SDK, Integration Manager, or its API. This section provides information about using the Runtime Engine through the CLI. Actian Data Platform includes one free run time engine, which can be upgraded to support multiple parallel job streams for loading data into the cloud data warehouse. Actian makes ETL and ETL much more manageable by including the Actian Data Platform, a robust cloud-based data warehouse with the data integration capabilities to populate it from multiple sources on-premise and SaaS data sources.

It’s time to make your data easy; visit our website to learn more about the suite of Actian data products and solutions.