Data monitoring

Data Quality and Anomaly Detection

Your pipelines pass. Your dashboards break anyway. DataHub learns your data's normal behavior and alerts you when something actually goes wrong.

  • Detect volume, freshness, schema, and field anomalies automatically

  • Auto-adjusting thresholds trained on 60 days of historical data

  • Lineage-aware alerts show which downstream systems are at risk

See anomaly detection live on your data

A DataHub engineer walks through your specific environment, not a generic script.

Trusted by modern data teams
The real cost

Why do data anomalies keep slipping through?

Manual thresholds miss edge cases. Alert fatigue buries real failures. By the time your team finds the issue, it is already in a dashboard.

Alert fatigue is real

Too many low-signal alerts train teams to ignore notifications, so genuine failures go unnoticed until a stakeholder reports a broken dashboard.

Thresholds break at scale

Static rules written for last quarter's data volumes fail silently when pipelines grow, leaving unmonitored gaps no one catches.

Incidents surface too late

Without lineage context, teams spend hours tracing which reports or APIs consumed bad data before a fix can begin.

Root cause takes hours

Disconnected tools mean engineers manually correlate logs, schemas, and pipeline runs to find one upstream issue.

Smart assertions

Anomaly detection that learns your data

DataHub's ML-powered assertions establish baselines from your historical data and flag real deviations before they reach downstream consumers.

Volume and freshness monitoring

DataHub tracks row counts, byte volumes, and update cadences across every monitored dataset, flagging deviations before downstream consumers notice a problem.

  • Row-count and byte-volume checks on every run
  • Freshness assertions with configurable SLA windows
  • Alerts fire before consumers detect a problem

Field-level and schema validation

Assertions run at the column level, catching null rates, value distributions, and unexpected schema changes that aggregate checks miss entirely.

  • Null-rate and uniqueness checks per column
  • Distribution shift detection on numeric fields
  • Schema change alerts with field-level diff view

Custom SQL rules with anomaly detection

Write assertions in SQL against any metric your team defines. DataHub's ML layer learns the expected range and flags anomalies without manual threshold tuning.

  • SQL-based assertions on any custom metric
  • ML-learned thresholds replace manual tuning
  • Version-controlled rules stored alongside metadata

Incident management and lineage impact

When an assertion fails, DataHub opens an incident, maps affected downstream assets via lineage, and routes alerts to the right team through your existing channels.

  • Automatic incident creation on assertion failure
  • Lineage graph shows all affected downstream assets
  • Routes to Slack, PagerDuty, or your existing tools
How it works

From connection to continuous monitoring

Three steps. Works with your existing stack. No rebuilding pipelines.

Connect your data sources

  • 50+ native connectors for warehouses, lakes, and transformation tools
  • Snowflake, BigQuery, Databricks, dbt, Looker, and more
  • No custom engineering work required to get started

Baselines established automatically

  • ML engine analyzes up to 60 days of historical pipeline runs
  • Expected ranges set for volume, freshness, and field metrics
  • Thresholds adapt to seasonality and growth trends over time

Smart assertions monitor continuously

  • Assertions run on every pipeline execution automatically
  • Failures open incidents and map lineage impact immediately
  • Alerts route to your team through Slack, PagerDuty, or webhooks
Enterprise ready

Built for enterprise data quality at scale

DataHub fits into your existing infrastructure and governance model. Deployment is flexible, access is controlled, and every assertion is programmable.

Deployment and access control

  • Deploy on-premises, in your VPC, or as a managed cloud service
  • Role-based access control governs who can create or modify assertions

Integrations and extensibility

  • API-first design lets engineering teams automate assertion management
  • 50+ connectors cover warehouses, lakes, BI tools, and orchestrators
Peer reviewed

Trusted by modern data teams

Gartner Peer Insights
Engineer, Enterprise Services, 1B-10B USD
Data quality monitor
Lineage-aware incident routing
"DataHub gives our data engineering team a single place to define, monitor, and act on data quality across every pipeline. The lineage integration is what sets it apart."
Gartner Peer Insights Reviewer
Engineer, Enterprise Services Company

Frequently asked questions about data quality

DataHub detects volume, freshness, schema, field-level, and custom SQL-defined anomalies. The ML layer identifies deviations from learned baselines without requiring manual threshold configuration. Detection covers both structural changes and statistical shifts in your data's behavior over time.
Thresholds auto-adjust based on historical patterns, including seasonality and growth trends, so alerts reflect genuine deviations rather than expected variation in your data. This means a dataset that naturally grows 20% every Monday will not trigger a volume alert on Monday mornings. The result is fewer notifications and higher signal quality for the ones that do fire.
Yes. DataHub supports custom SQL assertions against any metric your team defines. Rules are version-controlled and stored alongside your dataset metadata in the catalog. The ML layer then learns the expected range for those custom metrics and flags deviations automatically, so you get the precision of hand-written rules with the adaptability of learned thresholds.
DataHub routes incident alerts to Slack, PagerDuty, Microsoft Teams, and any webhook-compatible system. It also exposes a full API for custom integrations with your existing tooling. On the data source side, 50+ connectors cover the major warehouses, lakes, BI platforms, and orchestration tools your team already uses.
Most teams connect their first data source and activate initial assertions within a single day. Broader rollout timelines depend on the number of sources and your team's assertion strategy. The ML baseline period requires historical data to be available, so the quality of anomaly detection improves over the first few weeks as the system learns your data's normal patterns.

Catch data anomalies before they reach production

The demo is a working session with a DataHub engineer, focused on your environment and your data quality goals.

Apache 2.0 open source
60+ pre-built connectors
Self-hosted or managed deployment