ML-Powered Observability

Data Anomaly Detection for Modern Data Platforms

Your pipelines pass. Your dashboards break anyway. DataHub learns your data patterns and alerts your team before downstream consumers notice.

Catch volume spikes, stale tables, and schema drift automatically
ML models learn seasonality and trends, no threshold tuning required
Runs inside your VPC, your data never leaves your network

See it in action Explore the product

See data anomaly detection live

A self-guided tour of DataHub's observability layer. No sales call required.

Trusted by modern data teams

The real cost

Why data anomalies are so hard to catch

By the time a broken dashboard surfaces in standup, the anomaly is hours old. The cost is rarely the fix, it is the time spent finding it.

Volume spikes go unnoticed

A table doubles in size overnight. No alert fires. A downstream model trains on corrupted data before anyone checks.

Stale data breaks trust

A pipeline silently stops updating. Stakeholders make decisions on data that is 18 hours old and do not know it.

Schema changes break pipelines

A column gets renamed upstream. Three dashboards break. You spend the morning tracing the source.

Column drift hides in plain sight

Null rates creep up. Cardinality shifts. No single row looks wrong, but the aggregate tells a different story.

How DataHub helps

A better way to detect data anomalies

DataHub monitors your data landscape continuously, surfacing issues before they reach stakeholders.

Smart Assertions

ML models learn what normal looks like

DataHub's Smart Assertions use time-series forecasting to learn seasonality, trends, and normal variation in your data. No manual thresholds. No rules to maintain.

Prophet-based forecasting adapts to weekly and monthly patterns
Sensitivity controls let you tune signal-to-noise per dataset
Mark false positives to retrain models and improve accuracy

Volume and freshness

Catch stale and bloated tables early

Volume assertions detect unexpected row count changes, including spikes, drops, and growth rate shifts. Freshness assertions monitor update patterns so stale tables surface before stakeholders do.

Track total volume and change rates between scheduled checks
Monitor freshness via audit logs, information schema, or last-modified columns
Alert when tables miss expected refresh windows

Column and schema

Surface column drift automatically

Column-level assertions track null counts, cardinality, distributions, and value ranges. Schema assertions catch unexpected column additions, type changes, and deletions before they propagate downstream.

Detect outliers in null rates, min/max values, and cardinality
Validate column values against regex patterns, ranges, or allowed sets
Alert on schema changes before downstream consumers break

Monitoring Rules

Scale detection across thousands of tables

Monitoring Rules apply anomaly monitors automatically across your data landscape using search predicates. New datasets matching your criteria get monitors. Removed datasets lose them.

Filter by DataHub domain, platform, schema, tag, or custom attribute
Auto-apply monitors as new datasets are ingested
Manage coverage at scale without per-table configuration

How it works

From connection to alert in three steps

DataHub works with your existing infrastructure. No pipeline rewrites. No new data stores.

Connect your data sources

DataHub ingests metadata from warehouses, lakes, and pipelines

Pre-built connectors for Snowflake, BigQuery, Redshift, and Databricks

No code changes to your existing infrastructure required

Contextualize with assertions

Define what good looks like using Smart Assertions or manual thresholds

Monitoring Rules apply coverage automatically as new datasets arrive

ML models begin learning your data patterns from the first ingestion

Activate alerts with lineage context

Route alerts to Slack, PagerDuty, or your incident management tool

Every alert includes lineage context so you know what is affected downstream

Teams like yours have already moved from reactive to proactive detection

Works with your stack

Connects to the platforms your team already uses

DataHub supports data anomaly detection across the most common cloud warehouses and lakes. A remote executor option keeps query traffic inside your network.

Supported warehouses and lakes

Snowflake

Google BigQuery

Amazon Redshift

Databricks

dbt and Airflow

Alerting integrations

Slack

PagerDuty

Microsoft Teams

Remote executor keeps all query traffic inside your VPC

Gartner Peer Insights

Trusted by modern data teams

Financial Services

Data Engineering Lead

Outcome

Proactive anomaly detection before stakeholder impact

"DataHub gives our data engineering team visibility into quality issues we previously had no way to detect until a stakeholder reported them."

Data Engineering Lead

Financial Services

Quote sourced from Gartner Peer Insights. Gartner does not endorse any vendor, product, or service depicted in its research publications.

Common questions

Questions teams ask before getting started

How long does it take to set up data anomaly detection?

Most teams complete initial ingestion and assertion configuration within a day. Monitoring Rules can extend coverage to new datasets without additional setup time. The complexity depends on how many sources you are connecting and whether you are using Smart Assertions or manual thresholds.

Does DataHub work with our existing data stack?

DataHub connects to Snowflake, BigQuery, Redshift, Databricks, and other common platforms. It also integrates with dbt, Airflow, and major alerting tools including Slack, PagerDuty, and Microsoft Teams. If your stack is not listed, the API layer supports custom integrations.

How does DataHub reduce false positive alerts?

Smart Assertions learn your data's seasonal patterns before alerting. The Prophet-based forecasting model accounts for weekly and monthly variation so expected fluctuations do not trigger alerts. You can mark false positives directly in the UI to refine model accuracy over time. Sensitivity controls let you adjust the signal-to-noise ratio per dataset.

Does DataHub send our data to external servers?

No. DataHub can run entirely within your VPC using the remote executor. Metadata is processed inside your network and never transmitted to DataHub's infrastructure. This is a common requirement for financial services and healthcare teams, and the remote executor is designed to support it without additional configuration overhead.

How do teams measure the value of anomaly detection?

Teams typically track mean time to detection and the volume of stakeholder-reported issues. Reducing reactive incidents is the most common outcome teams report. A secondary measure is the reduction in time engineers spend tracing the root cause of a broken dashboard or failed pipeline run.

See anomaly detection working on your data

The product tour walks through Smart Assertions, Monitoring Rules, and alerting using realistic examples. No installation needed to get started.

Take the product tour Explore the product

Runs inside your VPC

No installation required

Self-guided, no sales call