Chapter 1: The data factory: How data quality degrades and why it matters
February 10, 2025
Welcome to “Use AI to modernize your data quality strategy,” a series spotlighting insights from our O’Reilly book, Automating Data Quality Monitoring. This first post corresponds to Chapter 1: The Data Quality Imperative.
Almost all data begins as high quality. But too often, that’s not how it ends up. While all the tools and techniques at our disposal to aggregate, transform, and analyze data have made it more useful and valuable than ever, they’ve also introduced copious opportunities for things to go wrong.
In fact, the vast majority of data quality issues are never caught. Many are minor and imperceptible, but some really matter. Data-driven initiatives implemented with the best of intentions have cost lives, jobs, and large sums of money because of errors that weren’t anticipated and discovered too late. It’s likely to get worse: The more businesses lean on AI, the greater the risk of data errors leading to real-life problems.
Given the complexity and volume of today’s data systems, it’s impossible to anticipate and test for every problem. However, you can use a monitor that learns from the historical patterns in the data, adapts over time, and effortlessly scales across your entire data estate.
Unsupervised machine learning (ML) is the basis of automating data quality monitoring. We’ll walk you through how and why it works over this eight-part series. But first, let’s discuss the ways data quality issues happen at modern companies, and their impact on everything from BI to AI.
Welcome to the data factory
Today’s enterprise data systems do much more than store and transport data. That’s why, instead of the traditional warehouse metaphor, we prefer data factory, reflecting how today’s data stacks transform inputs from diverse sources into useful outputs.
This metaphor is useful because data is not static. Every movement and transformation introduces opportunities for error. Through comparison with a physical process, we can better understand the places where data quality can be compromised.
Physical factory | Data factory | |
Input | Raw materials, as well as those produced from another process. | Data with a wide range of purity, from raw files in data feeds to those already manipulated and delivered via third-party API. |
Manipulation | Machines, conveyor belts, and other equipment follow a process to refine, combine, change, reshape, and otherwise transform inputs. | ETL, orchestration, transformation, and other tools guide and change data by combining, discarding, comparing, calculating, etc. |
Humans | Workers and supervisors monitor for quality, respond to alerts, find ways to improve, do some things manually, and accidentally introduce errors. | Analysts and engineers monitor for quality, respond to alerts, find ways to improve, do some things manually, and accidentally introduce errors. |
Output | May be packaged for end users, or serve as an input to another manufacturing process. | May be used directly by a user or software, or processed further by tools such as analytics platforms or generative AI. |
How the data factory can introduce issues
Let’s look at how things can go wrong in the data factory:
What can go wrong inside the data factory?
Every time a given piece of data is transported, copied, or manipulated, the goal is to make it more useful, but there is a chance it will be corrupted.
- Poor quality inputs. Some errors occur when the data is created, such as a damaged physical sensor or typo, or in upstream data factories.
- Improperly packaged materials. Problems in how the data is described, such as faulty dataset metadata or unannounced changes in an API, can lead systems to treat data incorrectly, turning good data into bad.
- Broken machines. Software is subject to bugs, outages, and other factors that can lose, delay, or corrupt data.
- Scheduling or sequence errors. Processes can run in the wrong order or at the wrong time, causing duplicate, missing, or incorrectly calculated data.
- Incorrect parts. Errors in code can mess up data by improperly transforming, aggregating, or joining it.
- Incorrect configuration. Properly maintained software can still malfunction if not set up appropriately for the inputs.
- New equipment operates differently. Improved or net-new software can introduce unexpected ways of encoding or transforming data. The more different these systems, the more likely the upgrade is to introduce errors. (Watch out for on-prem to cloud migrations!)
Things people do can also harm data quality
In addition to problems with the inputs and equipment, the people stewarding and working with the data will inadvertently introduce other issues.
- New features. Decisions that change the shape of data can introduce unintended downstream effects, such as the level of granularity or number of columns.
- Bug fixes. Counterintuitive, but true. If downstream systems already accounted for the bug, those subsequent corrections will become distortions.
- Refactors. Whenever engineers rework the code or structure of an existing system, they introduce the possibility of errors.
- Optimizations. Smaller adjustments to make things work faster or more efficiently may affect the data itself, changing its granularity, reliability, or uniqueness.
- People. When team members move on to other roles or leave the company, they rarely leave behind perfect documentation. Without full context, their successors are liable to introduce unintended changes in how data is processed and presented.
The lasting impact: data scars and data shocks
Data errors have two types of impacts: the incorrect data itself, and the sharp adjustments in the trendline caused by the onset of and recovery from the error.
Data scars are the anomalous or invalid records that should not be trusted. They ought to be clearly noted and addressed in analysis, and if at all possible kept out of AI training data. The longer a quality issue drags on, the more your data is scarred, and hence you can trust less of your data. (Or, you have to put in more work to clean up the damaged records—which can be very expensive to do, and might not fully repair the scar anyway.)
Data shocks are unexpected, large changes. Sometimes these reflect reality, such as all the patterns that changed nearly overnight around COVID-19, but others are caused by data quality issues. A scar itself is a shock, but fixing the issue can cause an equally destabilizing shock. Unless specifically accounted for, data shocks can mislead human and computer analysis alike to misunderstand trends in data and make wildly inaccurate predictions.
How incidents accumulate to erode data quality and trust over time. Each bar is a data scar left by the incident. Each X (marking when the incident occurred) is a data shock. Notably, the checkmark (when Incident 2 was resolved) is also a data shock.
AI multiplies the impact of bad data
When AI trains on imperfect data, at best, it outputs less-than-optimal results; at worst, it fails in unexpected, hard-to-catch ways. Anomalous data, missing data (NULL values), and data shocks cause models to behave erratically, highlighting the need to put comprehensive DQ monitoring in place if you’re doubling down on AI. Particularly as AI makes business intelligence (BI) more accessible and powerful across the enterprise, companies face a growing risk of bad data leading to bad decisions with real financial and opportunity costs.
The imperative for quality data is even greater with generative AI, which works in ways we barely understand on top of enormous amounts of largely unstructured data. If assessing and monitoring the quality of structured data is a challenge, doing the same for the unstructured data that feeds large language models (LLMs) is in another league.
AI is also the answer to bad data
Here’s a powerful concept that underpins everything we believe in at Anomalo—and everything we’ll be teaching you in this series: You can train a machine learning algorithm to learn the historical patterns in your data over time, and point out when something’s out of the ordinary. If AI/ML is the reason why you’re investing in data quality, it’s also how you should invest in monitoring your data. Unsupervised ML compares all sorts of characteristics within datasets to establish baseline patterns and find errors you’d never have thought to set up rules to check. It scales extremely well, across virtually any number of tables and records. It can be fine-tuned to reduce both false negatives and false positives, and it takes very little effort to set up.
In the next post, we’ll compare unsupervised ML to other strategies, including the traditional rules-based approaches, and show how it works alongside data observability.
Can’t wait? Get the whole book now, for free.
Categories
- Book
Get Started
Meet with our expert team and learn how Anomalo can help you achieve high data quality with less effort.