Data Extraction: The Ultimate Guide to Extracting Data from Any Source
Data extraction is a term used to describe the movement of data from a source data set. Data extraction is often the first step of an Extraction, Transformation and Load (ETL) process of a data pipeline. Data engineers are responsible for performing data extraction, feeding data analytics and machine learning (ML) functions.
Data Extraction Sources
After data is extracted, it can be cleaned, transformed and loaded into analytic databases. Below are some examples of how data is extracted and organized by data source type:
Flat files are two-dimensional, consisting of bytes of data. Files are stored in an operating file system or cloud service file store. A file is structured as a stream of bits with special character strings to denote the end-of-file (EOF) or new-line/linefeed (CRLF), allowing them to be represented as a set of records. Each record in the file can be of a fixed-length or variable length ascribed by the CRLF special character string. A delimiter string logically separates fields within a record. For example, the delimiter string would be the comma character in a. CSV file. Data extraction utilities understand this format making reading flat files easy. The data extraction tool reads the file field by field, assigning data types as directed. Unlike data streams, flat files have a more defined life cycle of creation, opening, appending, closing, and deleting.
Most applications and operating system functions produce log files used for exception handling, auditing and as a source of analytic data. Log files are usually flat files. Because these files are often configured with limited retention periods to save storage space, they must be extracted before the retention period expires or overwritten.
Data streams differ from flat files as they don’t have an end, so once opened, the data extraction utility will continue to wait for more data. Streamed data is managed by management applications such as Apache Kafka, which ingests the data stream source and stores it in a queue that data loaders or data integration tools subscribe to. As the data is created, it is ingested and made available to consuming applications via the stream manager. This publishes to the subscription system keeps administration costs down and saves a lot of coding on the consuming application side.
Some applications need to be notified immediately of changes, such as stock trading systems and automated driving systems; however, most systems can tolerate a short delay. Rather than be notified of each change, which can be expensive for CPU resources, it is often better to design consuming systems to pull the data periodically in batches or micro-batches. This kind of data extraction protects the consuming servers from becoming overwhelmed by the data streams they consume. Not all applications have the tolerance for delay, which is why streaming workloads are often hosted in cloud environments.
All applications are designed to receive data, process the data and output the resulting data. Legacy applications tend to use nonstandard data formats, so developers must read the data from the flat file containing the output report, for example. Modern web applications are designed to be used in bigger systems. They typically use standard self-describing formats such as JSON, which contains metadata such as field names, formats, and length information.
Data can be extracted from databases in three ways – by writing a custom application, using a data export tool, or using a vendor-provided interface such as ODBC. Most database providers include an export utility that unloads the data into a flat file. Data can be exported into a comma-delimited format for maximum portability. Driver programs such as ODBC and JDBC provide an application programming interface (API) for developers and data integration tools to use.
Data can be extracted from databases for more operational agility, such as maintaining replicas that are updated asynchronously so globally distributed offices or regional outlets have the local copy that lets them work autonomously. In this case, log capture systems such as Change-Data-Capture (CDC) systems such as HVR are used to extract and distribute data.
Another major reason to extract data from a database is for backup and recovery to maintain business continuity. Data can be extracted as physical blocks in these cases, bypassing the SQL layer for maximum throughput performance.
There are many ways to share data for extraction. Data can be secured using encryption to protect it from theft at rest and in transit. The publish and subscription model is one means of sharing data. A less sophisticated method is to use push files to the consuming sources using protocols such as FTP and SFTP.
Pull mechanisms allow consumers to download data from a web browser using HTTP so network admins don’t need to open sockets that can become a potential attack vector from hackers. Downloading from a website creates a file inside the consuming side of the connection, inside the firewall.
Actian and Data Extraction
The Actian Data Platform provides a unified experience for ingesting, transforming, analyzing, and storing data. The Actian Data Platform can be set up and loaded in just minutes for instant access to your analytic data. Built-in data integration, ultra-fast performance and the flexibility to deploy in multiple clouds or on-prem lets you analyze your data wherever it currently resides.
Try the Actian Data Platform by visiting our website and signing up for a free trial.