Blog

The Enterprise Data Quality Imperative

September 17, 2024

By Anomalo

Home
Blog
The Enterprise Data Quality Imperative

Data can dramatically transform enterprise businesses; some media outlets have even called data the new gold or oil. However, this is only the case if you have high-quality data. We strongly believe that bad-quality data is worse than having no data at all. In our recent State of Enterprise Data Quality executive brief, 95% of respondents including data leaders, data scientists, data engineers, and architects, said that a data quality issue had a direct impact on their business outcomes, and 100% of them reported data quality issues.

If you are a data governance professional, a data scientist, a data engineer, or a data leader planning to guide your enterprise to leverage any form of ML or AI, then you need to consider data quality as a table-stakes part of your strategy. If you can’t trust your data, you risk disastrous consequences. However, it’s harder than ever to maintain high data quality, and the traditional rules-based methods just won’t cut it when you need to deploy enterprise-level data quality at scale.

Data quality as an enterprise imperative

Many data teams face an unfortunate reality today. They struggle to keep up with questions and issues with their company’s data to the point that they barely have time to effectively analyze and identify insights for the business. Meanwhile, key business leaders are left doubtful of the metrics they see on their dashboards, and as the company scales (and the complexity of their data problems scale with it), the distrust deepens and problems only grow. As a result, 91% of IT decision-makers believe they need to improve data quality at their company, and 77% say they lack trust in their organization’s business data.

These concerns are completely valid, as data quality issues have notable, and sometimes very costly, negative impacts on business outcomes. Consider these examples:

Data issues informing dynamic pricing algorithms have led to airline “mistake fares” which cause airlines to lose hundreds or even thousands of dollars per ticket.
Unity lost $110M on its AI ad system after ingesting bad training data from a third-party vendor.
In 2020, a data error led to the loss of almost 16,000 positive COVID-19 test results, possibly resulting in up to 50,000 not being told to self-isolate.

On the other hand, smart use of high-quality data can unlock exponential growth and market innovation for companies across many sectors. In fact, on closer inspection, many of today’s most successful companies are, at their core, data companies:

Amazon became a leader in the e-commerce industry by leveraging data to personalize recommendations, add real-time pricing, and optimize logistics
Within the financial sector, Capital One has successfully utilized data to personalize its marketing and make smarter underwriting decisions.
Netflix takes advantage of large amounts of user data to provide targeted recommendations for users to watch next, making it stand out early on in the streaming industry.

Data analytics is democratized

As data has become more widely available and accessible, organizations have empowered more of their workers to embrace the use of data even in non-technical settings like marketing or finance. This democratization of data means that there is no longer a small, centralized team interacting with the business’s data. Rather, the data is dispersed and managed by a much wider group of people across the organization.

In this next era, data quality is even more critical than before. With so many different stakeholders relying upon data to make business decisions, your business cannot afford to stop and question at every moment whether the data is trustworthy. Additional tools and software that use and interact with your data also mean potential opportunities for errors to be introduced within your data or dashboards. As your company moves towards a more democratized future for your data, you cannot forget about data quality.

AI and ML are differentiators

Data leaders are well aware that data quality is important for AI and ML solutions. A strong ML strategy relies on the fundamental assumption that you have high-quality data.

It simply doesn’t make sense to invest in new data science, machine learning, or generative AI projects on top of data you can’t trust. Data quality makes or breaks models. Especially with ML, you need to ensure you have enough reliable high-quality data for both training and inference, otherwise, your models will fail when they’re presented with data outside of the distribution they have seen before.

Data quality is probably even more important when using Generative AI.

GenAI often relies on unstructured data, which can be even more difficult to monitor since traditional tactics cannot scale to these larger, more irregular data formats. Because unstructured data does not adhere to a traditional, standardized format, data leaders are facing challenges with organizing, retrieving, storing, and analyzing it. Unstructured data can also contain data that enterprise companies prefer LLMs not to learn from, like personally identifiable information (PII). As HBR has found in their coverage of an AWS/MIT survey, 46% of data leaders have already identified data quality as the greatest challenge of realizing GenAI’s potential in their organizations.

Companies responding to growth and change

As more companies have invested in their data, more solutions have also popped up to address a wide variety of needs across the “modern data stack”. A set of software-as-a-service (SaaS) vendors today can accomplish what would have taken 100 full-time data engineers 10 years ago. Businesses are migrating from legacy databases to cloud systems that leverage more data easily than ever before.

However, these solutions still rely on the fundamental assumption that they’re running on high-quality data. While we’d argue that almost all data begins as high quality data, data doesn’t exist in a vacuum. Enterprise companies are constantly adapting and improving their products, which directly impacts the data related to those products.

The “modern data stack” is a large, continuous investment for your company, and your efforts may be undermined if you move forward without considering data quality monitoring. Otherwise, you may be left migrating to a brand-new system and finding that the migration has left your data in a bad state, which can be even harder to diagnose when everyone is still learning how the new system works.

More Data, More Problems: The Rise of the Data Factory

With the new ways companies are working with data, the “modern data stack” metaphor is no longer complete. Companies today are operating a data factory: a complex environment that transforms raw materials (streaming datasets, raw files from data feeds, API extracts from SaaS apps, replicas of databases) into useful products (like dashboards and insights).

This metaphorical factory is built upon a foundation of cloud data warehouses and data lakes. The “machines” are orchestration platforms; extract, transform, and load (ETL) tools; and transformation tools, while the workers on the floor are the data and analytics engineers using these tools. They then deliver data products to power the decisions made by business users and data professionals, train ML algorithms, and direct feeds that pipe to other data systems.

As there are many ways things can go wrong in a physical factory, many things can go wrong in a data factory:

Broken machines when data processing or orchestration tools break down, stopping or degrading the flow of data
Scheduling errors when processing jobs run out of order or in the wrong cadence, causing missing data, incorrect computations, or duplicate data
Poor raw materials fed into the factory may lead to adverse effects that can propagate throughout the warehouse
Botched upgrades when changes can introduce subtle but pervasive differences in how data is encoded and transformed
Communication failures happen when well-intentioned changes are not properly managed and lead to inconsistencies in data processing logic

In these, and many other ways, you can visualize the various potential breakdown points of a “modern data stack”. It’s important to evolve your company’s thinking around the maintenance and production of data so you can create a robust data quality monitoring strategy in response.

*Figure 2: The data factory and what can go wrong*

Data quality monitoring as table stakes

Ultimately, when (not if) data quality issues occur, it’s critical to address them head-on. The longer a data quality issue goes unfixed, the larger the issues and impact your business will face down the line. Additionally, when multiple issues go unaddressed for a long time, they accumulate over time, leading to further degradation of data quality.

It’s hard to backfill data, and even harder to backfill trust.

Addressing these issues is also not a one-off project: as detailed above, there are many ways that data issues can continually arise over time from a variety of sources. Anyone responsible for data governance needs to consider data quality monitoring as part of the strategy.

However, not all forms of data quality monitoring are created equal. While manual checks and rule-based tests may work on a very small database, these methods can never scale to the size of a large enterprise. Metrics monitoring can help to better identify issues at an aggregate level, but the number of metrics you’d need to monitor would explode as the data collected grows in nuance and complexity. While each of these methods has its uses, any robust data monitoring strategy needs another approach for scale.

Automated data quality monitoring: A new frontier

We believe that the next generation of data quality monitoring must leverage automation to scale to the needs of large enterprise companies. While rules-based monitoring, metrics, and profiling work well in situations where your organization is aware of or understands the issues that may occur, these tactics leave your enterprise organization vulnerable to issues from unknown unknowns— problems that nobody thought to look for.

Anomalo’s enterprise data quality monitoring allows you to set up rules and monitor metrics like any other. What’s different is an unsupervised ML model that automatically learns the historical patterns in your data, so it can notify you when there are changes worth investigating. In combining a variety of approaches, our enterprise customers feel confident that Anomalo will catch any issues before they affect their business.

Are you ready to embark on the journey to enable your enterprise business to scalably monitor, detect, and resolve your data quality issues? Learn more by reading our book, Automating Data Quality: Scaling Beyond Rules with Machine Learning (which this blog post is based on), or reach out to request a demo!

Book
Resources

Get Started

Meet with our expert team and learn how Anomalo can help you achieve high data quality with less effort.

Request a Demo

The Enterprise Data Quality Imperative

Data quality as an enterprise imperative

Data analytics is democratized

AI and ML are differentiators

Companies responding to growth and change

More Data, More Problems: The Rise of the Data Factory

Data quality monitoring as table stakes

Automated data quality monitoring: A new frontier

Related Resources

Blog

An O’Reilly Book for Data Quality in the Age of AI

Blog

Discover Insights on the State of Enterprise Data Quality

Blog

Data Quality Using Anomalo with Jeremy Stanley

Get Started