Abraham Lincoln is quoted as saying, “If I had 8 hours to chop down a tree, I would spend 6 sharpening my axe.” Spending 75% of the allotted time on preparation may seem like a lot, but in fact, most industry observers report that preparing data for business analysis or machine learning is indeed 70 to 80% of the time spent by data scientists and analysts.
The data preparation pipeline consists of the following steps
- Access the data.
- Ingest (or Fetch) the data.
- Cleanse the data.
- Format the data.
- Combine the data.
- And finally, Analyze the data.
Access: There are many sources of business data within any organization. Examples are; endpoint data, customer data, marketing data, and all of their associated repositories. This first step involves identifying the necessary data and its repositories. It is not simply the notion of identifying all of the possible data sources but the notion of identifying all of the applicable data sources. This means that there is a plan for what questions need to be answered by the data analysis.
Ingest: Once the data is identified, it needs to be brought into the analysis tools. The data will likely be some combination of structured and semi-structured data in different types of repositories. Importing it all into a common repository is necessary for the subsequent steps in the pipeline. Access and ingest tend to be manual processes with significant variations in exactly what needs to be done. These steps require a combination of business and IT expertise and are therefore best done by a small team. This is the first opportunity for data validation.
Cleanse: Cleansing the data, ensuring that the data set is capable of providing valid answers when the data is analyzed. This step could be done manually for small data sets but requires automation for most realistically sized data sets. There are software tools available for this processing. If custom processing is needed, many data engineers rely on applications coded in Python. There are many different problems possible with the ingested data. There could be missing values, out of range values, nulls, and whitespaces that obfuscate values, as well as outlier values that could skew analysis results. Outliers are particularly challenging when they are the result of combining two or more variables in the data set. Data engineers need to plan carefully for how they are going to cleanse their data.
Format: Once the data set has been cleansed; it needs to be formatted. This step includes resolving issues like multiple date formats in the data or inconsistent abbreviations. It is also possible that some data variables are not needed for the analysis and should, therefore, be deleted from the analysis data set. Again, this is a step that will benefit from automation. The combined cleansing and formatting steps should be saved into a recipe. Saving the recipe allows the data engineer to apply it to a similar dataset in the future. For example, an analysis of monthly sales and support data would likely have the same sources that need the same cleansing and formatting steps each month.
Combine: When the data set has been cleansed and formatted, it may be transformed by merging, splitting, or joining the input sets. Once the combining step is complete, the data is ready to be moved to the staging area. Once data is loaded into the data warehouse staging area, there is a second opportunity for validation.
Analyze: Once the analysis has begun, changes to the data set should only be done with careful consideration. During analysis, algorithms are often adjusted and compared to other results. If the data is changed, then analysis results are skewed, and it is not possible to determine whether the different results are a factor of the changes in data or algorithms.
Data Preparation Principles and Best Practices
Many of the principles of functional programming can be applied to data preparation. It is not necessary to use a functional programming language to automate data preparation, but they often are used for that.
- Understand the data consumer – who is going to use the data and what questions do they want to be answered.
- Understand the data – where it is coming from and how it was generated.
- Save the raw data. If the data engineer has the raw data, then all of the data transformations can be recreated. Additionally, don’t move or delete the raw data once it is saved.
- If possible, store all of the data, raw and processed. Of course, privacy regulations like GDPR will influence the length of time the data can be saved.
- Transforms need to be reproducible – they need to be deterministic and idempotent. That is they will produce the same results each and every time they are executed given the same input data set, and they need to do that without harmful effects.
- Future proof your data pipeline by versioning not only the data and the code that performs the analysis, but also the transformations that have been applied to the data.
- Ensure that there is adequate separation between the online system and the offline analysis so that the ingest step does not impact user-facing services.
- Monitor the data pipeline for consistency between data sets.
- Employ Data Governance early – be proactive. IT’s need for security and compliance by incorporating governance capabilities like data masking, retention, lineage, and role-based permissions are all important aspects of the pipeline.
Know your data, know your customers’ needs, and set up a reproducible process for constructing your data preparation pipeline.
Making Data integration easier
Actian DataConnect is a versatile hybrid integration solution that allows you to connect to virtually any data source, format, location, using any protocol that empowers business users, integration specialists, SaaS admins, Line of business owners to design and manage integrations and move data quickly while IT maintains corporate governance. Find out how Actian can help with all your data integration, data management, and data storage needs here.