Abraham Lincoln might easily have been discussing data preparation steps for analytics when he said, “If I had 8 hours to chop down a tree, I would spend 6 sharpening my axe.” Spending 75% of the allotted time on preparation may seem like a lot. But in fact, most industry observers report that data preparation steps for business analysis or machine learning consume 70 to 80% of the time spent by data scientists and analysts.
Data Preparation Steps in Detail
The data preparation pipeline consists of the following steps
- Access the data.
- Ingest (or fetch) the data.
- Cleanse the data.
- Format the data.
- Combine the data.
- And finally, analyze the data.
There are many sources of business data within any organization. Examples include endpoint data, customer data, marketing data, and all their associated repositories. This first essential data preparation step involves identifying the necessary data and its repositories. This is not simply identifying all possible data sources and repositories, but identifying all that are applicable to the desired analysis. This means that there must first be a plan that includes the specific questions to be answered by the data analysis.
Once the data is identified, it needs to be brought into the analysis tools. The data will likely be some combination of structured and semi-structured data in different types of repositories. Importing it all into a common repository is necessary for the subsequent steps in the pipeline. Access and ingest tend to be manual processes with significant variations in exactly what needs to be done. Both data preparation steps require a combination of business and IT expertise and are therefore best done by a small team. This step is also the first opportunity for data validation.
Cleansing the data ensures that the data set can provide valid answers when the data is analyzed. This step could be done manually for small data sets but requires automation for most realistically sized data sets. There are software tools available for this processing. If custom processing is needed, many data engineers rely on applications coded in Python. There are many different problems possible with the ingested data. There could be missing values, out-of-range values, nulls, and whitespaces that obfuscate values, as well as outlier values that could skew analysis results. Outliers are particularly challenging when they are the result of combining two or more variables in the data set. Data engineers need to plan carefully for how they are going to cleanse their data.
Once the data set has been cleansed; it needs to be formatted. This step includes resolving issues like multiple date formats in the data or inconsistent abbreviations. It is also possible that some data variables are not needed for the analysis and should therefore be deleted from the analysis data set. This is another data preparation step that will benefit from automation. Cleansing and formatting steps should be saved into a repeatable recipe data scientists or engineers can apply to similar data sets in the future. For example, a monthly analysis of sales and support data would likely have the same sources that need the same cleansing and formatting steps each month.
When the data set has been cleansed and formatted, it may be transformed by merging, splitting, or joining the input sets. Once the combining step is complete, the data is ready to be moved to the data warehouse staging area. Once data is loaded into the staging area, there is a second opportunity for validation.
Once the analysis has begun, changes to the data set should only be made with careful consideration. During analysis, algorithms are often adjusted and compared to other results. Changes to the data can skew analysis results and make it impossible to determine whether the different results are caused by changes to the data or the algorithms.
Data Preparation Principles and Best Practices
Many of the principles of functional programming can be applied to data preparation. It is not necessary to use a functional programming language to automate data preparation, but such languages are often used to do so.
- Understand the data consumer – who is going to use the data and what questions do they need answered.
- Understand the data – where it is coming from and how it was generated.
- Save the raw data. If the data engineer has the raw data, then all the data transformations can be recreated. Additionally, don’t move or delete the raw data once it is saved.
- If possible, store all the data, raw and processed. Of course, privacy regulations like the European Union (EU)’s General Data Protection Regulation (GDPR) will influence what data can be saved and for how long.
- Ensure that transforms are reproducible, deterministic and idempotent. Each transform must produce the same results each time it is executed given the same input data set, without harmful effects.
- Future proof your data pipeline. Version not only the data and the code that performs the analysis, but also the transforms that have been applied to the data.
- Ensure that there is adequate separation between the online system and the offline analysis so that the ingest step does not impact user-facing services.
- Monitor the data pipeline for consistency across data sets.
- Employ Data Governance early, and be proactive. IT’s need for security and compliance means incorporating governance capabilities like data masking, retention, lineage, and role-based permissions are all important aspects of the pipeline.
Know your data, know your customers’ needs, and set up a reproducible process for constructing your data preparation pipeline.
Making Data Integration Easier
Actian DataConnect is a versatile hybrid integration solution. It allows you to connect to virtually any data source, regardless of format or location, using any protocol that empowers business users, integration specialists, SaaS admins, and line-of-business owners. Users can design and manage integrations and move data quickly while IT maintains corporate governance. Find out how Actian can help with all your data integration, data management, and data storage needs here.