Data Preprocessing

For data to be used effectively by analytics and machine learning applications, it must be preprocessed. Preprocessing data makes it easier to use by applying operations such as removing outliers, filtering, transforming and normalizing data from its source form.


Why is Data Preprocessing Important?

Unrefined source data must be optimized for its intended use before contributing to dependable insights. Basing decisions on data that is not preprocessed will result in poorly informed decisions that are more likely to lead to unintended outcomes. Using unrepresentative samples will skew analytical results. Investments in cutting-edge analytics software are wasted if it is fed garbage data. As the adage goes, “Garbage in, garbage out.”

Data Preprocessing Steps

The general flow for data Preprocessing can be summarized by the following steps:

  1. Data Profiling
  2. Data Cleansing
  3. Data Reduction
  4. Data Transformation
  5. Data Enrichment
  6. Data Validation

Preprocessing Data

Data preprocessing takes place in the early stage of a data pipeline. Preprocessing aims to enable it to accurately answer specific questions using analytics and training machine learning models. Below are some techniques used to Preprocess data.

Profiling Data

Data integration solutions like Actian DataConnect include data profiling functions that will scan a source file to count records, duplicates, and cardinality. Actian DataConnect can perform more advanced profiling operations, including separating distinct values, binning data values into ranges, and performing fuzzy matching for potentially duplicate values. In addition, statistics such as Min, Max, Mean, Median, Mode, Standard Deviation, Sum and Variance can be calculated.

Cleansing Data

Cleansing data increases the consistency of the data by verifying data formats, for example. Actian DataConnect provides the ability to make field data formats consistent in a data file.

Data Reduction

Outlying values can be removed to avoid analysis being unduly skewed or biased by outlying values. Filtering is another form of data reduction which deletes unnecessary data. Raw data often contains duplicate records for various reasons. Duplicate records can be deleted. Records with duplicate key fields and spare data can be intelligently reconciled and merged.

Data Transformation

Data fields need to be uniform to facilitate matching. Data formats can be transformed to have a uniform data type and format.

Data Enrichment

Data files can be enriched from multiple sources or can have new calculated values added. For example, it may only be necessary to group specific field values into ranges, in which case the respective data range can replace the discrete values.

Filling Gaps

Gaps can be filled by drawing from multiple data sources and assigning default values. In many cases, an extrapolated or interpolated value can fill any gaps.


If the result of an analytic process is time-critical, data can be pre-partitioned to accelerate processing time. Partitioning can be based on a key value and value ranges or a hash to distribute evenly across partitions. Partitioning massively accelerates processing times for large datasets by making parallel processing more efficient. Range scan queries can also be accelerated by making it easy to skip partitions with values that don’t match the range criteria.

Transforming Data

Data integration tools such as Actian DataConnect can be used to change data formats to improve matching, remove leading or trailing spaces, and add leading zeros. Regulated data can be masked or obfuscated to protect customer privacy.

Data Validation

Data can be validated by comparing existing values against multiple sources.

Automating Data Preprocessing

A data pipeline process combined with a data integration solution can orchestrate data preprocessing steps. Pre-programmed steps can be executed based on a schedule.

The Benefits of Data Preprocessing

The benefits of data preprocessing include:

  • Investing in data preprocessing automated pipelines makes a business more agile and competitive because they are always ready to analyze and adapt to changing customer needs and market dynamics.
  • Avoid delays in data analysis by having data proactively preprocessed.
  • Improved data quality.
  • Automation of data preprocessing using reusable building blocks makes data engineers more productive.

Actian and Data Preprocessing

The Actian Data Platform makes it easy to automate data preprocessing thanks to its built-in data integration capabilities. Businesses can increase the proportion of high-quality, analysis-ready data assets. Organizations cannot fully exploit their available data without the ability to unify, transform, and orchestrate data pipelines easily. Actian DataConnect provides an intelligent, low-code integration platform to address complex use cases with automated, intuitive, and reusable integrations. Actian DataConnect includes a graphical studio for visually designing data flows, mapping data fields and data transformations. Data pipelines can be centrally managed for scalability and reduced administration costs.

The Actian Vector Database makes it easier to perform high-speed data analysis due to its columnar storage capability that minimizes the need for pre-existing data indexes. Actian Vector processing speeds queries by exploiting multiple CPU caches from a single instruction.

The Actian Data Platform runs on-premise and multiple cloud platforms, including AWS, Azure and Google Cloud, so analytics can run wherever the data resides.

A 30-day free trial with a resource credit makes trying the Actian Cloud Data Platform easy.


What is SQL Server?

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec nec pellentesque elit. Suspendisse et eleifend eros.


What is SQL Server?

The original SQL Server code was developed in the 1980s by the former Sybase Inc., which is now owned by SAP. In 1988 Microsoft created SQL Server for OS/2 as a joint effort between Sybase, Microsoft, and Ashton- Tate. The partnership ended in 1990, and Microsoft retained the name SQL Server. Today SQL Server is available on 64-bit Windows, Linux and the Azure Cloud platform. SQL Server is an RDBMS. Microsoft and Sybase released version 1.0 in 1989.

Ashton-Tate stepped away after that, but Microsoft and Sybase continued their partnership until 1994 when Microsoft took over all development and marketing of SQL Server for its own operating systems. The year before, with the Sybase relationship starting to unravel, Microsoft had also made the software available on the newly released Windows NT after modifying the 16-bit OS/2 code base to create a 32-bit implementation with added features; it focused on the Windows code going forward. In 1996, Sybase renamed its version Adaptive Server Enterprise, leaving the SQL Server name to Microsoft.

SQL Server Management Studio provides a graphical administration user interface. Transact-SQL is a procedural language stored and executed within a SQL Server database management system.

SQL Server Editions

Microsoft offers SQL Server in four primary editions that provide different levels of the bundled services. The full-featured Developer edition is free is the Express edition that can be used to run small databases with up to 10 GB of disk storage capacity. The developer edition is not licensed for production use. Larger applications that require production-level support is licensed as Enterprise edition. The Standard edition has a reduced feature set and restricted scalability by limiting the number of CPU cores it can use and memory size. Due to increased competition, in late 2016, Microsoft made Enterprise features available to Standard Edition. These included In-Memory OLTP, PolyBase, columnstore indexes, partitioning, data compression and change data capture capabilities.

What is The Difference Between SQL and SQL Server?

The term SQL (Structured Query Language) normally refers to the standard language used to interact with an RDBMS. SQL Server is the Microsoft name for its own RDBMS product. However, it is not uncommon for people to abbreviate SQL Server to SQL.

What is The Difference Between MySQL and SQL Server?

MySQL is the world’s most popular and widely used open-source relational database. It was originally founded and developed in Sweden by two Swedes and a Finn: David Axmark, Allan Larsson and Michael “Monty” Widenius, who had worked together since the 1980s.

Microsoft SQL Server is an RDBMS product that is often abbreviated to SQL Server. MySQL and SQL Server are direct competitors. MySQL is owned by Oracle and uses a different code base than traditional Oracle Databases.

Is SQL Server a Database?

Microsoft SQL Server is a relational database management system. As a database server that stores and retrieves data as requested by other software applications on the same computer or a remote computer using the client-server model. Microsoft provides APIs to access SQL Server over the internet as a web service. An RDBMS does a lot more than get data for client applications. Internal functions such as buffer management ensure the most accessed data is available in the fastest form of available storage to speed access.

SQL Server is based on the relational model and also enforces referential integrity between objects to maintain data consistency. As with other relational databases, the principles of atomicity, consistency, transaction isolation and durability, collectively known as the ACID properties, are implemented to maintain integrity.

Logging and Replication

Every change to data must be protected from computer failures by logging transactions.

Disk failure must be protected against by performing redundant writes. The RDBMS can be configured to protect from server failures by using failover servers.

SQL Server Replication Services are used to replicate database objects, either in their entirety or a subset on other database servers across the network. Replication Services uses a publisher/subscriber model to push changes to subscribing SQL Server instances.

Backup and recovery operations protect the database from physical failures. Replication can help to protect a database from logical failures, such as accidental transactions. SQL Server uses transaction replication to maintain multiple replicates of a master database.

Merge replication is useful for creating a centralized database for reporting. In this case, multiple distributed databases periodically synchronize bi-directionally between the distributed publisher and central subscriber. The bi-directional nature of this relationship can result in conflicts that are usually handled using policies like giving the last change priority or are handled manually.

Snapshot replication publishes a snapshot of the data to replicates. This might be useful for batch reporting at night of an operational database. It can be more efficient than continuous replication because it does not have to be further synchronized. It would typically be overwritten the following night.

Who uses SQL Server?

Applications use SQL Server using many interfaces. The ODBC interface provides a high-level SQL interface that enables users to embed database calls into applications such as Microsoft Excel. Java applications use the JDBC driver to allow them to access databases using SQL. Application developers use Application Programming Interfaces (API) to embed SQL statements into their applications which can be written in C, Java, and Python, for example. Database rows can be fetched one at a time or in batches or arrays.

Microsoft Visual Studio includes native support for Microsoft SQL Server. Visual Studio includes a data designer to create, view or edit database schemas graphically. Queries can also be created visually.

Business Intelligence tools such as Microsoft Power BI, Qlik, Tableau and MicroStrategy. Vendors can create their own native database interfaces using Microsoft’s APIs or common drivers such as ODBC to connect to SQL Server and other databases.

Databases are useful to any application that needs to efficiently access data. Many business applications, such as Microsoft, are built on SQL Server. SAP, for example, uses a ‘lowest common denominator approach to give their customers a choice of databases to use with their business applications. By avoiding vendor-specific SQL extensions, application providers can certify multiple databases giving customers choices. An application for a small business with a moderate transaction rate can use SQL Server, while more demanding users might prefer to use Oracle.

Microsoft Management products such as Endpoint Manager depend on SQL Server to manage client system configurations.

Why use Actian Zen Over MySQL or SQL Server?

Actian Zen is a zero-maintenance, embedded, high-performance Database Management System (DBMS) that is ideal for systems that require higher speeds than MySQL and SQL Server can provide on edge systems. Zen provides both a SQL and No-SQL or Key-index based API to give developers a choice of access methods for IoT applications, which is particularly important in memory-constrained systems.

Applications often outgrow simple flat-file systems as transaction loads increase or maintenance becomes burdensome due to the number of systems to be managed. A DBMS that is designed with low administration, such as Actian Zen, can significantly impact the total cost of ownership (TCO) when compared to MySQL and SQL Server, which were not designed for embedded use cases from the outset.

Why Actian X?

Actian X started as the Ingres RDBMS, based on research at California’s Berkeley, which still uses it for internal applications even today. Businesses such as HSS Hire in the UK and the Irish Revenue Commissioners selected it for its transactional workload features. Recently, the transaction-centric database engine has been augmented with a columnar storage-based engine optimized for decision support and business intelligence applications. This hybrid transactional and business intelligence capability makes Actian X suitable for a broader array of mixed applications. HSS Hire uses the row-storage database for its order entry application and the columnar database for decision support and reporting functions as that engine is self-indexing and takes advantage of vector processing capabilities at the CPU chip level to achieve industry-leading query speeds.

Avalanche Cloud Data Platform

As Microsoft customers move their database workloads to the cloud, they soon realize that Azure Synapse is a different database technology than SQL Server on-premises. The Avalanche Cloud Data Platform has the distinct advantage of using the same high-performance analytics database on-prem and in the cloud. The  Avalanche Cloud Data Platform is also available on AWS and Google Cloud, and hybrid architectures allowing customers to put their analytics database processing closer to the data source. Moving data out of cloud platforms can incur significant egress fees, so it makes sense to process data in whatever cloud platform and region that holds its source data.

The Avalanche Cloud Data Platform has a major advantage for Hadoop data lakes because it can operate on native Hadoop data formats such as ORC and Parquet directly. You can create a reference to the external file system in Actian and access it with standard SQL. You can even push predicates in the SQL WHERE clause down to the file system to take advantage of available indexes.

Actian achieved compatibility between its Ingres and Avalanche databases when it acquired columnar database technology. Actian added the SQL parser and enterprise features that have been proven with Ingres to Avalanche. Microsoft’s SMP and MPP databases cannot claim this level of compatibility. To learn more, visit our website.

facebooklinkedinrsstwitterBlogAsset 1PRDatasheetDatasheetAsset 1DownloadForumGuideLinkWebinarPRPresentationRoad MapVideo