ML-Powered Observability

Data Anomaly Detection for Modern Data Platforms

Your pipelines pass. Your dashboards break anyway. DataHub learns your data patterns and alerts your team before downstream consumers notice.

  • Catch volume spikes, stale tables, and schema drift automatically
  • ML models learn seasonality and trends, no threshold tuning required
  • Runs inside your VPC, your data never leaves your network

See data anomaly detection live

A self-guided tour of DataHub's observability layer. No sales call required.

Trusted by modern data teams
The real cost

Why data anomalies are so hard to catch

By the time a broken dashboard surfaces in standup, the anomaly is hours old. The cost is rarely the fix, it is the time spent finding it.

Volume spikes go unnoticed

A table doubles in size overnight. No alert fires. A downstream model trains on corrupted data before anyone checks.

Stale data breaks trust

A pipeline silently stops updating. Stakeholders make decisions on data that is 18 hours old and do not know it.

Schema changes break pipelines

A column gets renamed upstream. Three dashboards break. You spend the morning tracing the source.

Column drift hides in plain sight

Null rates creep up. Cardinality shifts. No single row looks wrong, but the aggregate tells a different story.

How DataHub helps

A better way to detect data anomalies

DataHub monitors your data landscape continuously, surfacing issues before they reach stakeholders.

Smart Assertions

ML models learn what normal looks like

DataHub's Smart Assertions use time-series forecasting to learn seasonality, trends, and normal variation in your data. No manual thresholds. No rules to maintain.

  • Prophet-based forecasting adapts to weekly and monthly patterns
  • Sensitivity controls let you tune signal-to-noise per dataset
  • Mark false positives to retrain models and improve accuracy
Volume and freshness

Catch stale and bloated tables early

Volume assertions detect unexpected row count changes, including spikes, drops, and growth rate shifts. Freshness assertions monitor update patterns so stale tables surface before stakeholders do.

  • Track total volume and change rates between scheduled checks
  • Monitor freshness via audit logs, information schema, or last-modified columns
  • Alert when tables miss expected refresh windows
Column and schema

Surface column drift automatically

Column-level assertions track null counts, cardinality, distributions, and value ranges. Schema assertions catch unexpected column additions, type changes, and deletions before they propagate downstream.

  • Detect outliers in null rates, min/max values, and cardinality
  • Validate column values against regex patterns, ranges, or allowed sets
  • Alert on schema changes before downstream consumers break
Monitoring Rules

Scale detection across thousands of tables

Monitoring Rules apply anomaly monitors automatically across your data landscape using search predicates. New datasets matching your criteria get monitors. Removed datasets lose them.

  • Filter by DataHub domain, platform, schema, tag, or custom attribute
  • Auto-apply monitors as new datasets are ingested
  • Manage coverage at scale without per-table configuration
How it works

From connection to alert in three steps

DataHub works with your existing infrastructure. No pipeline rewrites. No new data stores.

Connect your data sources

DataHub ingests metadata from warehouses, lakes, and pipelines
Pre-built connectors for Snowflake, BigQuery, Redshift, and Databricks
No code changes to your existing infrastructure required

Contextualize with assertions

Define what good looks like using Smart Assertions or manual thresholds
Monitoring Rules apply coverage automatically as new datasets arrive
ML models begin learning your data patterns from the first ingestion

Activate alerts with lineage context

Route alerts to Slack, PagerDuty, or your incident management tool
Every alert includes lineage context so you know what is affected downstream
Teams like yours have already moved from reactive to proactive detection
Works with your stack

Connects to the platforms your team already uses

DataHub supports data anomaly detection across the most common cloud warehouses and lakes. A remote executor option keeps query traffic inside your network.

Supported warehouses and lakes

Snowflake
Google BigQuery
Amazon Redshift
Databricks
dbt and Airflow

Alerting integrations

Slack
PagerDuty
Microsoft Teams
Email
Remote executor keeps all query traffic inside your VPC
Gartner Peer Insights

Trusted by modern data teams

Financial Services

Data Engineering Lead

Outcome

Proactive anomaly detection before stakeholder impact

"DataHub gives our data engineering team visibility into quality issues we previously had no way to detect until a stakeholder reported them."

Data Engineering Lead

Financial Services

Quote sourced from Gartner Peer Insights. Gartner does not endorse any vendor, product, or service depicted in its research publications.

Common questions

Questions teams ask before getting started

Most teams complete initial ingestion and assertion configuration within a day. Monitoring Rules can extend coverage to new datasets without additional setup time. The complexity depends on how many sources you are connecting and whether you are using Smart Assertions or manual thresholds.
DataHub connects to Snowflake, BigQuery, Redshift, Databricks, and other common platforms. It also integrates with dbt, Airflow, and major alerting tools including Slack, PagerDuty, and Microsoft Teams. If your stack is not listed, the API layer supports custom integrations.
Smart Assertions learn your data's seasonal patterns before alerting. The Prophet-based forecasting model accounts for weekly and monthly variation so expected fluctuations do not trigger alerts. You can mark false positives directly in the UI to refine model accuracy over time. Sensitivity controls let you adjust the signal-to-noise ratio per dataset.
No. DataHub can run entirely within your VPC using the remote executor. Metadata is processed inside your network and never transmitted to DataHub's infrastructure. This is a common requirement for financial services and healthcare teams, and the remote executor is designed to support it without additional configuration overhead.
Teams typically track mean time to detection and the volume of stakeholder-reported issues. Reducing reactive incidents is the most common outcome teams report. A secondary measure is the reduction in time engineers spend tracing the root cause of a broken dashboard or failed pipeline run.

See anomaly detection working on your data

The product tour walks through Smart Assertions, Monitoring Rules, and alerting using realistic examples. No installation needed to get started.

Runs inside your VPC
No installation required
Self-guided, no sales call