Data Management

Data Cleansing

Rows of virtual files in a data catalog, contributing to powerful data management

Why is Data Cleansing Important?

Thanks to the digitization of business processes, an abundance of data is available to analyze. Getting value from that data depends significantly on the quality and integrity of the data used to drive analysis and decision-making. Making decisions based on poor quality and/or inaccurate data increases the chances of bad outcomes because of misinformation. Confident decisions need to be based on high-quality data.

Where Data Cleansing Fits

Data cleansing is part of a broader data preparation process that typically follows the following sequence of steps before an organization can analyze data:

  • Access data.
  • Ingest (or fetch) data.
  • Cleanse data.
  • Format data.
  • Combine datasets.

Data Cleansing describes the tools and techniques used to clean coarse or dirty data by filling gaps, filtering out irrelevant data, deduplicating, and formatting to make it accurate and of higher quality.

Data Cleansing Techniques

There are many ways that cause data to be unsuitable for analysis. Below is a selection of techniques used to refine raw data or improve data quality:

blue Ingres Icon for Actian

Deduplication and Data Cleansing

Removing copies of duplicated records is essential to avoid double counting. When merging multiple datasets, it is easy to create duplicate records. In a relational database, using the UNIQUE qualifier to ignore duplicates is easy, but it still makes sense to dedupe to save space and CPU cycles. Extract, transfer, and load (ETL) tools compare strings when processing outside a database. If deduping inside a database using an extract, load, and transform (ELT) methodology, it’s more straightforward in SQL using SELECT UNIQUE and INSERT into a target table.

blue Ingres Icon for Actian

Improving Consistency With Data Cleansing

Most data input validation checks for allowable values, but different systems often have their own rules for what they consider valid. For example, if one system records US states as two characters and another uses a drop-down with state names spelled out, it’s a good idea to standardize on one format and update records to keep the format consistent. Using the same case and data type also improves consistency.

blue Ingres Icon for Actian

Filling Gaps

If a field is optional, it can result in a data file with null values. If there is reasonable default value, it can be inserted to remove the nulls. In a data series, an extrapolated or interpolated value can help to smooth the data.

blue Ingres Icon for Actian

Filtering

Records and fields that are not useful downstream can be filtered out to increase relevance and data quality. Outliers often need to be removed so they don’t skew analytic results.

blue Ingres Icon for Actian

Transformation

Fields such as date formats are often inconsistent in source data sets. Data integration tools like DataConnect allow you to set rules to automate field transformations. Private data may need to be masked or obfuscated to maintain compliance.

In manufacturing, Quality Assurance sensor data is used to see how far production can be pushed before quality suffers. Tesla’s China factory now produces a new car every 40 seconds.

blue Ingres Icon for Actian

Cross Validation

Data integrity leads to trust in data. One way to assess data validity is to compare it to other systems of record that should have the same value. Records that are identical are considered valid; exceptions are flagged and stored separately for data cleansing. Data catalogs can record that the data is valid and score its quality accordingly.

Who Performs Data Cleansing?

The job role that commonly deals with data cleansing is a data engineer who is responsible for data pipelines and ETL processing. The Chief Data Officer (CDO) ensures overall data quality and availability for consumers. Analytics users and data scientists are consumers of cleansed data. Data engineers should specify changes made to source systems to provide cleaner data.

Causes of Dirty Data

Below are some scenarios that lead to dirty data:

  • Most applications operate in a silo as they are designed to meet a specific use case defined by a product manager, business analyst, software designer or other users.
  • Poor user interface (UI) design can mean optional fields are usually left blank, and lack of radio buttons or drop-down values usually mean bad typing fills fields with garbage.
  • Bad data design, such as allowing multiple records with the same primary key value, usually leads to trouble.
  • The lack of referential integrity constraints can corrupt a schema by using one-to-many relationships that should be one-to-one relationships.
  • Software maintenance and business evolution often conspire to create bugs that impact data integrity.
  • Lack of validation in data integration processing can lead to bad data leaking into a downstream system. For example, a poorly set up marketing automation system can feed duplicate leads into a sales system designed for a single contact record with a unique email address as the primary key.

Automating Data Quality

Data integration solutions such as DataConnect allow data engineers to set up data pipelines that perform data quality processing using rules. Fields can be filled with defaults, value ranges can be checked, datatypes can be changed, and transformation functions can perform more sophisticated updates.

Actian Helps With Data Quality

The Actian Data Platform includes a highly scalable hybrid integration solution that delivers high-quality data for unifying, transforming, and orchestrating data pipelines. DataConnect is an intelligent, low-code integration platform that addresses complex use cases with automated, intuitive, and reusable integrations.

Learn more by visiting our page about Actian Data Platform.