Blog

An O’Reilly Book for Data Quality in the Age of AI

September 11, 2024

By Paige Schwartz

Home
Blog
An O’Reilly Book for Data Quality in the Age of AI

Over the past year, we’ve been sharing previews of our new technical textbook with O’Reilly: Automating Data Quality Monitoring: Scaling Beyond Rules with Machine Learning. After a lot of anticipation (and editing…and re-editing!), the complete book is now available! Best of all, it’s available for free for a limited time from our website.

Data quality is the #1 roadblock businesses face when building AI and analytics applications. This book will show you how to overcome the many challenges of data quality monitoring. It will teach you how to use automation with machine learning to build trust in your organization’s data.

Click here to download your free copy!

Why did we write this book?

At Instacart, Anomalo co-founders Elliot Shmukler and Jeremy Stanley witnessed first-hand the difficulties of monitoring data effectively at enterprise scale. Rules-based testing was too manual and brittle; monitoring metadata like lineage and observability didn’t go deep enough.

That’s what led to Anomalo—the comprehensive, automated data quality monitoring platform powered by unsupervised machine learning, which learns the patterns in your data and alerts for changes across your entire data lake/warehouse.

When we speak with technical teams, the same kinds of questions tend to come up again and again:

How does unsupervised ML actually work to detect data issues? How is the model trained, evaluated, and fine-tuned?
How do you determine the severity of an issue, perform root cause analysis, and explain where in the data the issue occurs?
How do you avoid alert fatigue / over-alerting?
How does this approach integrate with the modern data stack?
How should you reason about the ROI of automated data quality monitoring? What about the build vs. buy decision?
What’s the best way to roll out this solution to the rest of the org and maintain it over time?
Are rules-based testing, monitoring metrics, and data observability still important? (Short answer: yes!)

After five years of evolving the Anomalo platform and working with large enterprises across nearly every industry, we realized we knew enough to write a book that would answer all these questions and more.

So Anomalo CTO Jeremy Stanley teamed up with technical writing pro Paige Schwartz and the publishers and editors at O’Reilly to share our philosophy and help readers discover the cutting-edge techniques we’ve learned while building Anomalo.

Who is the book for?

We focused on in-depth strategic, operational, and technical considerations in this book, with three main audiences in mind:

Chief data and analytics officers (CDAOs) and VPs of data
Heads of data governance
Data practitioners of all kinds: data scientists, analysts, and data engineers

However, this is a great resource for anyone curious about improving data quality at their organization.

Here’s what others have said about the book:

“This book expertly lays out the entire data quality lifecycle from rules definitions and machine learning to scaling and alert fatigue. A canonical resource.”
—Chris Riccomini, Author of The Missing README: A Guide for the New Software Engineer

“Excellent overview of practical monitoring solutions powered by machine learning and adapted to the maturity of your modern data stack. You don’t have to earn your data scars—you can just read this book.”
—Monica Rogati, Independent Data Science Advisor

“In an era where AI is reshaping enterprises, this book offers a powerful roadmap for ensuring foundational data quality, a must-read for forward-thinking data leaders.”
—Chen Peng, VP and Head of Data at Faire

What is the book about?

The book is divided into 8 chapters, plus an Appendix of the most common data quality issues and how to monitor for them. It features a Preface from former US Chief Data Scientist DJ Patil.

Here’s a chapter-by-chapter breakdown of the topics we cover:

Chapter 1: Why data quality matters and how mistakes affect your business
Chapter 2: What a comprehensive data quality solution looks like, and why that must go beyond rules-based testing to include automation with machine learning
Chapter 3: How to assess the ROI your business would get from comprehensive data quality monitoring
Chapter 4: An algorithm for automating data quality monitoring at scale with machine learning
Chapter 5: Tuning and testing your model to ensure it performs well on real-world data
Chapter 6: How to implement notifications and avoid the common pitfall of over-alerting
Chapter 7: Why and how to integrate your monitoring with other data tools and systems
Chapter 8: Deploying your solution, onboarding users, and continually improving data quality

Get your copy today

This book is a comprehensive guide—it weighs in at over 200 pages!—and will serve as a reference that you can turn to time and time again on your data quality journey.

While it’s free from our website right now, it won’t be that way forever.

So, download the book now, and let us know what you think by emailing us at automating.data.quality.monitoring@anomalo.com.