What Is Data Ingestion?
Data must be ingested by an application program or database management system before it can be processed or analyzed. When an application or database system reads external data, it is ingesting data.
Where Data Ingestion Fits in the Data Preparation Process
Data preparation is a process that encompasses extract, transform & load (ETL) and includes a data ingestion step. Once the source data is identified, it must be brought into the analysis tools to gain business insights. The data will likely be a combination of structured and semi-structured data in different file systems and repositories. Consolidating all the source data sets into a common repository is necessary for the subsequent steps in the data pipeline. Data preparation is a multi-step process outlined below:
- Access source data sets.
- Ingest the data.
- Cleanse the data.
- Format the data.
- Combine datasets.
- Analyze data.
How Data Is Ingested
Ingestion Through Parameters
Application programs and microservices can be coded to receive parameters when executed. For example, a program called ADD might accept two numbers as parameters, from which it computes the sum.
Ingestion of Array Data
Applications can also be passed fixed or variable length arrays of data. A more sophisticated version of the ADD program might accept a list of numbers to sum. Web applications use modern application programming interfaces (APIs) that allow them to be integrated to see what data they expect to ingest. Formats like JSON allow variable numbers of elements to be passed with a declared delimiter string.
Transaction Record Entry
Traditional applications collect input fields from terminals or graphical user interfaces, which are ingested by applications as records for processing. A bill payment application might ingest records containing fields such as payee name, amount, account number, routing number, or sort code.
File Record Ingestion
Transaction systems such as enterprise resource planning (ERP) systems often journal transactions for later batch processing in files that are ingested one record at a time when the overnight batch run occurs. Flat files mimic mainframe tape records which contain specific strings of bits that represent the end of records and end of file markers to standardize ingestion processing.
Ingesting Cloud Data
Data stored in the cloud is ingested using very similar APIs to on-premise ones. The added steps usually involve authentication to the cloud service and the API’s back end, which behaves differently from the on-premise version. This difference is transparent to the reading application.
Ingesting Trading and Gaming Data
When the reading application receives streamed data and misses updates due to network failures or resource shortages, it can be configured to drop intermediate messages when connectivity is restored because only the current stock price or game score matters. If every message matters, unacknowledged messages are queues at the server until all the subscribing clients are updated. If live data is critical, a network connection time-out can trigger a failover and retry event on a backup server.
Ingesting Database Records
Applications ingest database data by submitting a SQL query to an RDBMS using an API such as ODBC or a proprietary interface. The result set is fetched one record at a time.
Ingesting Data Into a Database
Most database management systems include fast loader utilities. These provide the ability to load data faster than an application program because they can parallelize the load operation and preformat database blocks to bypass the SQL layer if needed.
Streaming Data Ingestion
Applications that process streaming data usually subscribe to a queue populated when new data is created. Examples of steaming data managers include AWS SNS, IBM MQ, Apache Flink and Kafka. As applications become less batch-oriented and more online, the real-time nature of streamed data is becoming the norm.
Iot Data Ingestion
In an IoT network, many devices generate data streams for analysis. To ease processing, the data streams are sent to a gateway server that filters out irrelevant data and compresses data for faster network traversal to a streaming data service that client applications subscribe to.
Data integration eases data ingestion by providing standardized connectors to hundreds of data sources and the tools to transform data before analytics systems ingest it. The second significant benefit of data integration solutions is that they provide centralized visibility and orchestration of data preparation pipelines. This is particularly important when the analysis is not a one-off process, so it needs to be reusable.
Actian and Data Ingestion
The Actian Cloud Data Platform is designed for fast analysis using a columnar database. Data ingestion is managed by a data integration service that manages data flow from multiple sources to the analytics database. This eases the creation of real-time insights. Analytics can run on-premise and cloud platforms, including AWS, Google Cloud and Microsoft Azure. DataConnect supports file-based and stream-based ingestion from JMS, Kafka, MSMQ, RabbitMQ and WebSphere MQ.
Sign up now and try the Cloud Data Platform for free.