Chapter 2: A Four-Pillar Approach to Data Quality Monitoring
February 24, 2025
Welcome to “Use AI to modernize your data quality strategy,” a series spotlighting insights from our O’Reilly book, Automating Data Quality Monitoring. This post corresponds to Chapter 2: Data Quality Monitoring Strategies and the Role of Automation.
We’ve all had some version of the nightmare: Someone on the team was working on a report for the board and accidentally discovered a major problem with the data. There’s a chance it’s been broken for a while, but it’s hard to tell. Manually checking for residual damage will take days, and the board needs the report tomorrow morning.
But it doesn’t have to be this way. Somewhere out there, a similar company is living a different dream. The monitoring system has automatically flagged a problem with the data and pinged the on-call engineers in Slack. The issue affects two downstream tables, but the system has already done the heavy lifting and tracked down the root cause of the error.
The difference here is the approach to data quality monitoring. The team in our nightmare scenario relies on reactive, ad-hoc solutions, while the dream team leverages proactive, automated data quality monitoring. But not every automated monitoring system is built the same, so it’s vital to ensure your chosen solution aligns with your goals and incorporates the four core monitoring pillars we’ll discuss below.
A marathon and a sprint: Aligning on monitoring goals
It’s hard to reach your goals if you can’t articulate what they are. When evaluating your own goals around data quality monitoring, consider both short-term and long-term needs. Forget the old saying that “it’s a marathon, not a sprint”—data quality monitoring is both.
A useful monitoring solution should improve the effectiveness of your data monitoring setup by giving you actionable information as soon as an issue occurs. A successful strategy will also improve the issue resolution process by making it easier to find the root cause of an issue. Consider as well your needs around alert management and scaling.
Traditional monitoring approaches fall flat
For many organizations, the thinking around data quality is that “if it’s a big enough problem, someone will catch it eventually.” Of course, you’re unlikely to be successful catching every issue this way. But even more than that, the issues you do catch will probably be caught too late. This leads to the data shocks and data scars we discussed in Chapter 1.
Other traditional monitoring approaches are more sophisticated but still insufficient. Analysts and infrastructure engineers set up manual checks to run at specified intervals and validate the data. These checks are usually hard and fast rules, like “this column must not contain any nulls.” Although this approach is more effective than relying on ad-hoc discoveries, it’s difficult to maintain and doesn’t scale well.
In fact, creating and maintaining manual data quality rules becomes a Sisyphean task over time. When you only have ten or twenty tables, copying and pasting these rules isn’t too much of a hassle. But scaling your manual checks up to a hundred, a thousand, or ten thousand tables? That’s a recipe for disaster. Each table is different and requires customizations to the rules, if not fully custom rules.
With this kind of rules-based approach, you’re always playing catch-up. Most manual rules are reactive, checking for issues that have occurred in the past. Proactive manual checks are difficult to do well—how do you know what’s likely to fail if it hasn’t failed before?
In short, neither of these traditional monitoring approaches (ad-hoc or manual rules) satisfies our short- and long-term goals. You and your data deserve a more complete solution.
The four pillars of automated data monitoring
A multi-pronged approach integrates your team’s subject matter expertise with automated checks and the power of unsupervised machine learning. We believe there are four pillars of successful automated data monitoring systems, each of which represents a distinct type of data monitoring check.
These four pillars are data observability, validation rules, key metrics, and unsupervised machine learning (UML). Check out the following table, and then read on as we dive into each pillar.
Data observability | Validation rules | Key metrics | UML checks | |
Quick to set up | ✅ | ❌ | ❌ | ✅ |
Scales easily | ✅ | ❌ | ❌ | ✅ |
Monitors for unknown unknowns | ❌ | ❌ | ❌ | ✅ |
Takes history into account | ❌ | ❌ | ✅ | ✅ |
Catches needle in a haystack errors | ❌ | ✅ | ❌ | ❌ |
Catches preexisting errors | ❌ | ✅ | ❌ | ❌ |
Closely monitors a small slice of the table | ❌ | ✅ | ✅ | ❌ |
You can see that each monitoring pillar contributes different strengths toward our monitoring goals. Simultaneously, no individual pillar can catch everything on its own. That’s the lesson here: Traditional monitoring isn’t bad, it’s just incomplete. It’s like tracking your budget by only counting the cash in your wallet. Complete monitoring solutions require a more holistic approach incorporating each of the four pillars of automated data monitoring.
Data observability for an eagle-eye view
Data observability is all about breadth. It’s well suited for scaling lightweight monitoring across thousands of tables, but it won’t tell you about a table’s data quality.
Data observability checks are computationally inexpensive because they reference a table’s metadata but do not query the data itself. Most data observability checks are also one-size-fits-all and require no customization. This makes them a popular choice for monitoring the vital signs of many tables at once. They’ll tell you if a table is not available, or if it hasn’t been updated as expected.
Validation rules for the customization your data team needs
Validation rules check a set of data for characteristics you specify as “good” or “bad.” Among many other use cases, validation rules can tell you if you have any nulls in a column or if your dates suddenly have the wrong format. These are hard-line rules that can identify both emerging and pre-existing issues in your data.
You might have similar rules in your monitoring setup already, and this pillar leverages that pre-built material. In general, validation rules rely on your team’s subject matter expertise, because your team knows what failure conditions the system should be looking for.
Key metrics to make predictions given a table’s history
When you want to monitor a statistic that’s important to your business, key metrics are the best choice. These checks incorporate time series models, which treat time as a variable that affects your table’s metrics. With cyclical fluctuations in mind, the check can predict what your metrics are likely to be on that specific day. And all of this is built into key metrics monitoring without the need for manual maintenance.
Take a look at the visualization below for an example of how seasonally-sensitive predictions can help discover data anomalies. The red dot at the end of the top graph shows a value that’s unexpectedly high for the season. But compared to the range of values in the table, it’s within the overall norm. Unlike a validation rule, key metrics use fluid boundaries that account for expected variability while flagging abnormal changes.
Unsupervised machine learning to catch unknown unknowns
We’ve saved the most innovative pillar for last. Unsupervised machine learning is the perfect solution for rooting out “unknown unknowns,” complex issues that you might not have imagined would creep up.
Broadly, an unsupervised machine learning check learns a table’s inherent patterns and relationships, interpreting new inputs based on everything it has seen so far. Because data is usually highly interrelated, unsupervised ML checks are perfectly positioned to find those difficult-to-predict errors. For example, a ML check could alert you to a change in the average time between your order processing dates and order shipping dates.
These ML checks require very little effort to set up across any number of tables, because they adapt to each table on their own. And when they catch issues, they can use their knowledge of the relationships between your columns to provide detailed explanations of the problem.
Although we’re strong proponents of unsupervised ML checks, it’s worth introducing two cautions here. First, not everything that’s marketed as a “AI-powered monitoring solution” is the same under the hood. Some data monitoring solutions that claim to be built on machine learning are really just running a standardized set of key metrics, which don’t address complex relationships between columns.
Second, as with other types of automation, there’s a critical threshold at which the value proposition is greatest. After you hit that threshold, you’ll want to shift your attention to diversifying your monitoring abilities.
These two major cautions are easy to avoid with a little bit of research, and unsupervised ML checks are more than worth building into your monitoring solution. They won’t catch preexisting errors or “needle in a haystack” issues, but they’ll find problems you never would have caught otherwise.
A strong foundation for your data factory
As we saw in the comparison table earlier, no individual pillar can provide robust monitoring on its own. Your data factory depends on a strong foundation, and you need all four pillars in tandem to provide the best base for it. When you support your data factory with a multi-pronged data monitoring solution, you reap all of the benefits of every category while smoothing out any shortcomings.
So what’s the catch?
In Chapter 3, we’ll discuss whether the setup cost of a robust four-pillar system is worth it for your team and your stakeholders. You’ll learn how to meaningfully assess your data, industry, maturity, and more. We’ll also discuss different types of stakeholders and how a robust monitoring strategy can align with their needs.
Can’t wait? Get the whole book now, for free.
Categories
- Book
Get Started
Meet with our expert team and learn how Anomalo can help you achieve high data quality with less effort.