Data Catalog Software Built for Platform Engineers
Your pipelines pass. Your dashboards break anyway. DataHub connects 60+ sources, tracks lineage to the column, and surfaces quality issues before your stakeholders do.
-
Connect Snowflake, dbt, Airflow, and 60+ sources without rebuilding pipelines
-
Column-level lineage across your full stack, from ingestion to BI
-
Catch data quality failures before they reach a dashboard or standup
See DataHub in your environment
A DataHub engineer will walk through your specific stack, not a generic script.
What does metadata debt cost your team?
Every hour spent tracing a broken pipeline is an hour not spent building. The cost compounds quietly until it surfaces in an incident.
Lineage gaps slow every incident
When a pipeline breaks, engineers spend hours tracing impact manually. Without automated lineage, every incident takes longer than it should.
Schema changes break downstream assets
A renamed column in Snowflake silently breaks three dashboards and two ML features. No alert fires. You find out in the standup.
No single source of truth
Definitions live in Confluence, Slack, and tribal knowledge. Different teams trust different numbers. Decisions stall while people argue about the data.
Governance is manual and fragile
Access policies live in spreadsheets. Compliance reviews require manual audits. One personnel change and a sensitive dataset is exposed.
A better way to manage your metadata
DataHub gives platform engineers the automation, depth, and extensibility that legacy catalog tools were never designed to provide.
Automated metadata discovery
DataHub crawls your stack and builds a unified metadata graph automatically. No manual tagging sprints, no stale wikis.
- 60+ pre-built ingestion connectors, maintained and versioned
- Scheduled and event-driven ingestion pipelines
- Custom source support via the open ingestion framework
Column-level lineage end to end
Trace any field from its source table through every transformation to the BI layer. Understand impact before you make a change, not after.
- Automated lineage from dbt, Spark, Airflow, and SQL parsing
- Upstream and downstream impact analysis in one view
- Cross-platform lineage across warehouses, lakes, and BI tools
Data quality assertions and tracking
Define quality assertions on your datasets and track pass or fail status over time. Surface incidents in the catalog before they reach a stakeholder.
- Native assertions for freshness, volume, schema, and custom SQL
- Incident tracking linked to affected downstream assets
- Integrates with Great Expectations, dbt tests, and Soda
API-first platform automation
DataHub exposes every capability through GraphQL and REST APIs. Automate metadata workflows, build internal tooling, and integrate with your existing platform stack.
- Full GraphQL and OpenAPI surface for every metadata operation
- Python SDK for programmatic ingestion and enrichment
- Webhook and event stream support for real-time metadata updates
Three steps to a governed data platform
DataHub works with your existing stack. No rip-and-replace, no rebuilding pipelines, no months-long implementation projects.
Connect your existing sources
- Deploy the ingestion framework against Snowflake, BigQuery, dbt, Airflow, Looker, and 55+ more sources
- Schedule recurring syncs or trigger ingestion from your existing orchestration layer
- Custom sources connect via the open Python SDK without modifying your pipelines
Contextualize with lineage and quality
- Column-level lineage is built automatically from SQL parsing, dbt manifests, and Spark plans
- Quality assertions run on a schedule and surface failures directly on the affected dataset page
- Business glossary terms and ownership metadata enrich every asset in the catalog
Activate governance across your org
- Role-based access control policies enforce who can view, edit, or own each asset
- Data products and domains give business teams a governed view without platform complexity
- Audit logs and compliance reports are available via API for your security and legal teams
Built for enterprise security and scale
DataHub is deployed by some of the largest data teams in the world. It is designed to run at scale, on your infrastructure, with the security controls your organization requires.
Deployment options
- Self-hosted on Kubernetes via Helm chart, with full infrastructure control
- DataHub Cloud for managed deployment with SLA-backed uptime
- VPC and private link options for air-gapped or regulated environments
Security and access control
- Fine-grained RBAC with resource-level and metadata-level policies
- SSO via OIDC and SAML, with support for Okta, Azure AD, and Google Workspace
- Full audit log of every metadata read, write, and policy change
API and extensibility
- GraphQL and REST APIs cover every metadata entity and relationship
- Custom metadata models via the extensible metadata schema framework
- Kafka-based event stream for real-time metadata change notifications
What platform engineers say about DataHub
"DataHub gave us column-level lineage across our entire warehouse and BI layer. We went from spending half a day tracing a broken pipeline to knowing the impact in minutes. The API-first design meant we could wire it into our existing platform tooling without rebuilding anything."



