Open Source Data Governance at Scale
Your pipelines pass. Your dashboards break anyway. DataHub gives platform engineers automated lineage, fine-grained access control, and 90+ integrations with no vendor lock-in.
-
Automated table- and column-level lineage across your full stack
-
Policy-based access control targeting domains, tags, and containers
-
91 native source integrations including Snowflake, dbt, and Looker
See DataHub govern your data
A DataHub engineer will scope the demo to your environment.
What does ungoverned data infrastructure cost?
Broken trust in data assets costs more than engineer-hours. It costs credibility, compliance standing, and the ability to ship.
Lineage gaps break trust
When a dashboard breaks, you spend hours tracing upstream dependencies manually. The pipeline passed. The data was wrong.
Access control is manual
Permissions live in spreadsheets or tribal knowledge. Auditors ask who can see what. You cannot answer quickly.
No shared data vocabulary
Finance defines revenue one way. Product defines it another. Governance without a glossary is governance in name only.
Vendor lock-in compounds risk
Proprietary governance tools own your metadata model. Switching costs grow every quarter you stay.
A better way to govern your data platform
DataHub replaces manual, fragmented governance with automated lineage, policy enforcement, and a shared metadata layer your whole organization can trust.
End-to-end lineage visibility
DataHub tracks table- and column-level lineage automatically across your full stack. When a schema changes upstream, you know what breaks downstream before your stakeholders do.
- Column-level lineage across Snowflake, dbt, Looker, and Spark
- Upstream and downstream impact analysis on every asset
- Lineage propagation for tags, glossary terms, and ownership
Column-level access control
Policy-based access control lets you define permissions by resource type, domain, tag, container, or glossary term. Role-based defaults get teams productive without custom configuration.
- Platform and metadata policies with granular resource targeting
- RBAC roles: Admin, Editor, and Reader out of the box
- Target policies by domain, tag, container, or glossary term
Shared business glossary
DataHub's business glossary supports hierarchical term groups, ownership assignment, and propagation to linked assets so definitions stay consistent across every team.
- Hierarchical term groups with assigned stewards and owners
- Glossary terms propagate automatically to linked data assets
- Consistent definitions across every domain and data product
Searchable data catalog
DataHub indexes metadata from every connected source into a searchable catalog. Engineers and analysts find verified, documented datasets without filing tickets or pinging Slack.
- Full-text search across tables, dashboards, pipelines, and features
- Dataset health signals: freshness, schema changes, and ownership
- Verified dataset badges surfaced directly in search results
From connection to governed data catalog
Three steps from your existing stack to a fully governed, searchable metadata layer.
Connect your data sources
- 91 pre-built connectors, no pipeline rebuilding required
- Schedule-based or event-driven ingestion cadence
- CLI or UI-based setup for your first pipeline
Contextualize assets with ownership
- Ownership and stewardship assigned per asset or domain
- Glossary terms and tags propagate through lineage graph
- Context follows data across every downstream consumer
Activate policies across your platform
- Policies enforced by role, domain, tag, or container
- Governance framework maps to your org structure
- Teams like yours have already done this at scale
Built to fit your stack, not replace it
DataHub exposes a GraphQL API and Python SDK so your platform team can automate ingestion, extend the metadata model, and integrate governance into existing CI/CD workflows.
Self-hosted on your infrastructure
- Kubernetes deployment in your own cloud account
- Full control over data, network, and upgrade cadence
- Apache 2.0 license, no feature gating
DataHub Cloud, fully managed
- Managed deployment, zero operational overhead
- Open metadata model preserved, no proprietary lock-in
- Same GraphQL API and Python SDK as self-hosted
Data governance framework tools and extensibility
- GraphQL API for metadata reads and writes
- Python SDK for ingestion automation
- Custom metadata aspects via extensible model
- CI/CD integration for governance-as-code
- Event-driven ingestion via Kafka integration
- 12,000+ GitHub stars, active open source community
Rated by practitioners, not analysts
"DataHub gave us a single place to understand our data assets, their lineage, and who owns them. The open source model meant we could extend it to fit our architecture."



