Data health lifecycle: data cleaning & transformation strategies



Listen to this blog
Disclaimer

A healthy system with clean data is crucial for organizations aiming to have a healthy bottom line. Unhealthy data leads to inferior decision-making, squandered opportunities, and higher costs. Maintaining quality data requires diligently cleaning data to keep it healthy.

Research from G. Loabovitz and Y. Chang led to the 1-10-100 rule of data quality. This rule highlights the increasing cost of data errors due to poor data quality: if it costs $1 to verify a record, it costs $10 per record to fix errors once created and $100 per record if errors go unaddressed. This research states that the more proactive you are in ensuring the health of your data, the more data quality costs are avoided.

A complete data health strategy that considers data health from database structure to applications, leads to higher-quality data. Data and systems need to be cleaned and assessed at three levels.

Database Structural Level

database and data tables need to be organized effectively and operate correctly to avoid errors. Effective database repair processes fix data quality problems at the source.

Data at Rest

data within your database must undergo a regular data cleansing process that looks for errors, inconsistencies, duplications, and missing entries.

Data Transformation

new data errors are created as data is converted and moved. Organizations need strategies to ensure data health through different transformations, including ETL (Extract, Transform, Load) and data wrangling processes.

Database Repair

Ensuring that your databases and their structure are in good health can significantly impact your data health strategy. By focusing on the data source, repetitive downstream data quality issues are reduced. The first steps to keeping your data healthy are upgrading database technology and repairing poorly structured databases.

Evaluating how databases are organized and structured, reduces the occurrence of unhealthy data. Consolidating data formats into a common structure can decrease conflicting and duplicate data. Creating a standard phone number format is one example. Instead of maintaining multiple formats with dashes, dots, area codes, and country codes, organizations can define a number-only text string format to be used in all systems. This tactic reduces the number of data types, lowering the chances of confusion.

Changing schema to improve performance is another way to organize databases to promote data health. Examples include consolidating primary keys to increase consistency. Fixing and standardizing source tables can reduce the effort required to standardize data each time it is pulled for analysis. While adjusting the schema to be more uniform can limit errors in the future, it can cause near-term problems as downstream apps need to be adjusted to incorporate the schema changes. A federation layer that functions as a gateway to your database can help detect schema changes to ensure downstream apps are updated appropriately.

Essential Data Cleaning Practices

Unlike database repair, data cleaning focuses on the data itself. Database cleansing is a process that requires consistent and regular attention. In a hospital, contaminants are constantly finding their way into the facility, so regular sterilization is necessary to keep germs from making people sick. Your databases are no different.

Regular cleaning processes should focus on:

Eliminating Duplicates
Eliminating Duplicates

Duplicate data wastes space in a database and creates confusion, hindering accurate information. With the amount of data collected and stored across data silos, multiple versions of the same record often occur. Data de-duplication is a vital data cleansing exercise, as double counting the same data points skews your analysis.

Repairing Incorrect Data
Repairing Incorrect Data

Errors such as misspellings, capitalization errors, and naming inconsistencies can confuse applications and analytics relying on standardized formats. Finding and fixing these errors helps keep data clean.

Handling Missing Data
Handling Missing Data

Missing data can distort the analysis. Addressing missing data issues may require adding data with reasonable assumptions or eliminating an entire column or data set if missing data is too excessive.

Removing Irrelevant or Old Data
Removing Irrelevant or Old Data

old or irrelevant data takes up storage space, creating confusion. Old data can also conflict with newer data or be mistaken for fresh data, tainting analysis.

Filtering Outlying Data

Outliers in your data set can signal an error, but not always. Identifying outliers, determining their cause, and taking appropriate action reduces the number of erroneous data points in your data sets.

Data validation can help avoid some of these errors during data input. Building validation rules and incorporating them into your data stack can support real-time testing to identify errors early.

Data Prep and Transformation

Processes get complicated when data is transferred from one database to another, transformed, or merged with a new data set. Ensuring that your transformation processes don’t introduce or perpetuate bad data is essential for maintaining healthy data. When data originates from separate databases, the chances that data gets duplicated are high. Different labeling between databases can also lead to errors.

Data is transformed in many ways, but typically, it is either transformed and prepared for analysis or standardized to be loaded into a database for applications or future analysis.

Extract, Transform, and Load (ETL) processes are the most common transformation approach for moving data from one database to another. ETL process extract data from one database, transform it, and then load it into the target database. The structure of the target database typically dictates these transformations, which are well-defined and systematic. They scale well and are appropriate for large data sets.

Data wrangling processes are different and are much more agile. These processes are key to prepping data sets for analysis. Data wrangling technologies accommodate both structured and unstructured data.

Understanding these differences is important to support healthier data.

ETL

ETL processes typically start with well-organized and structured data. This structure enables more automated and systematic transformation processes. These transformation programs are built by technically oriented developers using powerful technologies like Python and Scala. While greater automation is great for efficiency, it can also perpetuate bad data and errors much more quickly. These automations are also challenging to create, so they are hard to change, making the ETL process much less agile. Getting it right the first time is key to keeping data healthy.

The first step in the ETL process is to extract data from the source database. Errors can occur when the source data is not clean, so it is important to do some data profiling to understand the quality of the data. If data quality is not great to start, plagued by missing values, duplicates, and values that are out of range, you will have to go back to do some data cleansing.

The transformation process reformats data so it can be ingested into target databases. This includes normalizing data so that formats are consistent between the source and target database. Errors can occur if data is mislabeled or data structures are not aligned, making data cleaning an important part of the transformation process. Routines for identifying dirty data and rectifying errors can be built into the ETL program. Filtering data removes unwanted data, reducing complexity. Finally, validation checks are run to check for inconsistencies or to identify outliers.

The final step in the process is loading data into the target database. If errors occur at this stage, it is essential to have a versioning process in place so you can fall back to an earlier version of your code to troubleshoot errors.

Data Wrangling

Like ETL, data wrangling is a transformation process where data errors can occur. Data wrangling is the process of creating data assets to be used in a defined analytical use case. The requirements of the analytical use case drive transformations. This requirement necessitates more agility and the ability to incorporate unstructured data.

With data wrangling, a more manual and dynamic process is required. To ensure analysis is based on the highest quality data, teams must discover, organize, clean, enrich, and validate the data.

With data wrangling, the first step to quality analysis is understanding the health and applicability of the underlying data sets. This means understanding what is available, what will fit the requirement, and assessing the quality of the data sets. Data quality metrics such as distribution tests and validity range tests provide insights into data health. Understanding existing issues, the size of the data set, and data formats are factors that need to be considered before using data to support analysis. Unlike ETL processes, data wrangling is not constraint by prebuild automations. This difference means analysts have the flexibility to search for the best data sets, avoiding less sanitized data. Robust data catalogs that are accessible and maintain and track this data are beneficial in measuring data health.

Organizing data sets to be used in analysis is another critical process. When working with unstructured data, this process becomes much more difficult. Unstructured data needs to be organized into rows and columns to be analyzed. Tagging techniques can be used to create more structure around this data and provide more information about unstructured documents so they can be more efficiently organized. This process is emerging as a viable use case for GenAI. These technologies can analyze documents to understand their meaning and mark them accordingly. Healthy data labeling reduces the chances that GenAI will make mistakes.

Just as in ELT, data cleansing processes are essential in data wrangling. Cleaning strategies can identify outliers and prep data for analysis by eliminating noise in the data and ensuring that datasets are relevant. Fixing spelling and standardizing abbreviations is much more complex when working with unstructured text. AI can be leveraged to fix these.

The data normalization process in data wrangling differs slightly from ETL because standards are driven by the business use case instead of ridged database structure and schema. Robust metadata management engines can improve this process. With integrated GenAI and business glossaries, business users can normalize and combine data based on business definitions and calculations.

Data enrichment is another technique that involves combining data sets or prebuilt data products to provide greater context for deeper and cleaner analysis. Filling in missing data with synthetic data is another strategy to bolster the quality of data sets.

Like ETL processes, data validation tests are vital to ensure data cleanliness after data wrangling processes are executed.

Data Build Tools

Having the right tools and processes to build data transformations is the best way to adhere to best practices and reduce the potential for errors in your data transformations. Like DevOps tools in the software space, Data Build Tools support data transformation processes with development workbenches and quality controls. Features include version controls, testing, and logging.

Data transformations are constantly evolving. To ensure quality, they need to be tested every time the code is updated. Creating or adjusting queries can lead to a multitude of issues including, bad data joins that introduce errors downstream in the pipeline. Having a process for creating transformations is key to avoiding errors.

Tools like dbt (data build tool) enable heathy transformation workflows. The tool enables analysts to:

  • Build complex SQL queries and transformations using business logic.
  • Test transformations to proactively predict how changes will affect dependencies downstream.
  • Execute the code.
  • Create documentation.

After all the work that you have done cleaning data, reducing the chance that you will introduce bad data with effective tools and workflows just makes sense.

Integrating these tools with your data stack makes them easier to use and streamline your data transformation process. Templates and pre-tested transformations can be easily accessed and implemented. This capability not only reduces potential errors but also lowers the effort to create data products and processes.

Combining tools, workflows, and data creates a framework to support healthy data. Through a dpt adapter, users of the Data to AI platform, Avrio, can write and test complex SQL transformations that utilize Avrio’s federated query engine, supporting your data quality requirements.

Understanding how to clean data and keep it clean throughout its journey from collection to analysis is extremely important for building trust in data. The increasing capabilities of GenAI to prep unstructured data for analysis are creating new opportunities for greater insights, but new complexities could lead to dirty data. Having a complex data health strategy across structured and unstructured data from the source system to analysis will help ensure that clean and relevant data is put into decision-makers hands.

Discover the Latest in Data and AI Innovation

  • E-book

    Unstructured data with the modern data stack

    Read More

  • Blog

    Building a reliable data quality strategy in the age of AI

    Read More

  • Blog

    AWS re:Invent recap

    Read More

Request a Demo TODAY!

Take the leap from data to AI