AI-Powered Data Quality

Data Anomaly Detection That Works at Scale

Your pipelines pass. Your dashboards break anyway. DataHub's ML-powered anomaly detection finds what your rules miss, before your stakeholders do.

  • Detect volume, freshness, schema, and column anomalies automatically
  • ML models learn seasonality so Monday spikes never trigger false alerts
  • Cover hundreds of datasets with one monitoring rule, zero manual setup

See anomaly detection in your environment

A DataHub engineer will walk through your specific stack.

Trusted by modern data teams
The real cost

What breaks when anomaly detection is manual

Manual thresholds can't keep up with modern pipelines. By the time a rule fires, the damage is already downstream.

Rules that break every sprint

Static thresholds require constant tuning. One schema change and your entire monitoring layer goes silent.

Anomalies found in the standup

Stakeholders surface broken data before your monitors do. That is not a tooling gap. That is a trust gap.

No coverage on new datasets

Every new table is unmonitored until someone writes a rule. In fast-moving pipelines, that window is weeks.

Seasonality breaks your alerts

Monday volume spikes. Weekend nulls drop. Static rules can't distinguish a real anomaly from a known pattern.

How DataHub helps

A better way to detect data anomalies

Smart Assertions use ML to learn what normal looks like across your datasets, then alert you the moment something shifts.

ML learns your data patterns

Smart Assertions build a statistical baseline for every monitored dataset. No thresholds to write. No rules to maintain. The model adapts as your data evolves.

  • Learns seasonality, trends, and expected variance
  • Reduces false positives from known traffic patterns
  • No manual recalibration after schema changes

Five detection types, one platform

DataHub monitors volume, freshness, schema drift, column distribution, and custom SQL assertions in a single unified view. Detecting anomalies across every dimension of data health is built in.

  • Volume, freshness, schema, column, and custom checks
  • Unified alert routing to Slack, PagerDuty, or webhook
  • Assertion history tracked per dataset over time

Scale across every anomaly detection dataset

One monitoring rule can cover an entire schema or warehouse. DataHub propagates assertions automatically as new tables appear, so your anomaly detection dataset coverage grows with your platform.

  • Auto-propagate rules to new tables in a schema
  • Monitor hundreds of assets without per-table config
  • Works across Snowflake, BigQuery, Databricks, and more

Data Health Dashboard at a glance

The Data Health Dashboard surfaces assertion pass rates, incident trends, and asset health scores in one view. Engineers and data owners see the same picture without switching tools.

  • Asset health scores tied to lineage context
  • Incident history and resolution tracking built in
  • Shareable views for data owners and stakeholders
Getting started

How it works

Three steps from connection to active anomaly detection across your entire data platform.

Connect your sources

Ingest from Snowflake, BigQuery, Databricks, dbt, and 80+ sources
No pipeline rebuilds required to start collecting metadata
Scheduled or event-driven ingestion to fit your cadence

Contextualize with ML baselines

Smart Assertions build statistical baselines per dataset automatically
Models account for seasonality, trends, and expected variance
Lineage context links anomalies to upstream root causes

Activate alerts and monitoring

Route alerts to Slack, PagerDuty, or any webhook endpoint
Data Health Dashboard tracks incidents and resolution over time
New tables inherit monitoring rules without manual configuration
Enterprise ready

Built for enterprise-grade scale and security

DataHub deploys in your environment, integrates with your existing stack, and meets the security requirements your organization demands.

Deployment options

Self-hosted on Kubernetes or managed cloud via DataHub Cloud
VPC deployment available for regulated industries
Apache 2.0 open source core with no vendor lock-in

API-first architecture

GraphQL and REST APIs for programmatic assertion management
Integrate anomaly alerts into existing incident workflows
Python SDK and CLI for infrastructure-as-code teams

Security and compliance

Role-based access control on all monitored assets
SSO via SAML 2.0 and OIDC for enterprise identity providers
Audit logs for every assertion change and alert action
Peer validation

Trusted by modern data teams

Gartner Peer Insights

Enterprise Data Platform Team

Outcome

Reduced time detecting anomalies from days to hours

"DataHub gave us the ability to detect data anomalies across our warehouse without writing a single threshold rule. The ML-based assertions caught issues our manual checks had missed for months."

Gartner Peer Insights Reviewer

Senior Data Platform Engineer, Financial Services

Common questions

Frequently asked questions about data anomaly detection

Smart Assertions begin building a statistical baseline from the first ingestion run. Meaningful anomaly detection typically requires seven to fourteen days of historical data, depending on the cadence of your pipeline. For datasets with strong seasonality, a full four-week window produces the most reliable results. DataHub surfaces confidence indicators so you know when a baseline is mature enough to trust.
For volume and freshness assertions, DataHub queries metadata and system tables rather than row-level data. Column-level distribution checks do execute lightweight aggregate queries against your warehouse. DataHub never moves or stores your underlying data. All queries run within your existing warehouse permissions and are visible in your warehouse's query history.
DataHub uses statistical sampling and aggregate metrics rather than full table scans for large datasets. This keeps warehouse compute costs predictable while maintaining detection accuracy. You can configure sampling rates and assertion schedules per dataset or apply them at the schema level. For tables with billions of rows, the approach is designed to be cost-aware by default.
Yes. DataHub supports outbound webhooks, Slack notifications, and PagerDuty integrations out of the box. For teams with custom incident pipelines, the GraphQL and REST APIs allow you to subscribe to assertion failure events programmatically. Alert payloads include asset context, lineage information, and the specific assertion that failed so your on-call team has what they need without switching tools.
dbt tests and Great Expectations are rule-based: you define the expected value and the test checks against it. DataHub's Smart Assertions are model-based: they learn what normal looks like and flag deviations without requiring you to specify thresholds. The two approaches are complementary. DataHub ingests dbt test results and surfaces them alongside ML-based assertions in a unified health view, so you do not have to choose one or the other.
DataHub has native ingestion connectors for both Snowflake and dbt. You configure the connectors with read-only credentials, run the first ingestion, and DataHub begins building the metadata graph and baseline models. Most teams complete initial setup in a single day. The timeline to meaningful anomaly detection coverage depends on how many datasets you want to monitor and how much historical data is available for baseline training.

Ready to stop finding anomalies in the standup?

DataHub's ML-powered anomaly detection covers your entire data platform without manual threshold maintenance. See how it works in your environment.

No manual thresholds required Works with your existing stack Scoped to your environment

You will speak with a DataHub engineer about your specific environment. Not a generic walkthrough.