Apache 2.0 Licensed

Open source data governance at scale

Your pipelines pass. Your dashboards break anyway. DataHub gives platform engineers automated lineage, fine-grained access control, and 90+ integrations with no vendor lock-in.

  • Automated table- and column-level lineage across your full stack
  • Policy-based access control targeting domains, tags, and containers
  • 91 native source integrations including Snowflake, dbt, and Looker

See DataHub govern your data

A DataHub engineer will scope the demo to your environment.

Trusted by modern data teams
The real cost

What does ungoverned data infrastructure cost?

Broken trust in data assets costs more than engineer-hours. It costs credibility, compliance standing, and the ability to ship.

Data governance solution gaps in modern platforms

Lineage gaps break trust

When a dashboard breaks, you spend hours tracing upstream dependencies manually. The pipeline passed. The data was wrong.

Access control is manual

Permissions live in spreadsheets or tribal knowledge. Auditors ask who can see what. You cannot answer quickly.

No shared data vocabulary

Finance defines revenue one way. Product defines it another. Governance without a glossary is governance in name only.

Vendor lock-in compounds risk

Proprietary governance tools own your metadata model. Switching costs grow every quarter you stay.

Data governance software

A better way to govern your data platform

DataHub replaces manual, fragmented governance with automated lineage, policy enforcement, and a shared metadata layer your whole organization can trust.

Data governance automation tools built into the platform

End-to-end lineage visibility

DataHub tracks table- and column-level lineage automatically across your full stack. When a schema changes upstream, you know what breaks downstream before your stakeholders do.

  • Column-level lineage across Snowflake, dbt, Looker, and Spark
  • Upstream and downstream impact analysis on every asset
  • Lineage propagation for tags, glossary terms, and ownership

Column-level access control

Policy-based access control lets you define permissions by resource type, domain, tag, container, or glossary term. Role-based defaults get teams productive without custom configuration.

  • Platform and metadata policies with granular resource targeting
  • RBAC roles: Admin, Editor, and Reader out of the box
  • Target policies by domain, tag, container, or glossary term

Shared business glossary

DataHub's business glossary supports hierarchical term groups, ownership assignment, and propagation to linked assets so definitions stay consistent across every team.

  • Hierarchical term groups with assigned stewards and owners
  • Glossary terms propagate automatically to linked data assets
  • Consistent definitions across every domain and data product

Searchable data catalog

DataHub indexes metadata from every connected source into a searchable catalog. Engineers and analysts find verified, documented datasets without filing tickets or pinging Slack.

  • Full-text search across tables, dashboards, pipelines, and features
  • Dataset health signals: freshness, schema changes, and ownership
  • Verified dataset badges surfaced directly in search results
How it works

From connection to governed data catalog

Three steps from your existing stack to a fully governed, searchable metadata layer.

DataHub as your data governance platform

Connect your data sources

DataHub ingests metadata from 91 native integrations including Snowflake, dbt, Kafka, Looker, and Airflow. Ingestion runs on a schedule or triggers on change events.

91 pre-built connectors, no pipeline rebuilding required
Schedule-based or event-driven ingestion cadence
CLI or UI-based setup for your first pipeline

Contextualize assets with ownership

Assign owners, apply glossary terms, tag domains, and document datasets. Metadata propagates through lineage so context follows data wherever it moves.

Ownership and stewardship assigned per asset or domain
Glossary terms and tags propagate through lineage graph
Context follows data across every downstream consumer

Activate policies across your platform

Define access policies by role, domain, tag, or container. DataHub enforces them consistently so your governance framework reflects how your organization actually works.

Policies enforced by role, domain, tag, or container
Governance framework maps to your org structure
Teams like yours have already done this at scale
API-first architecture

Built to fit your stack, not replace it

DataHub exposes a GraphQL API and Python SDK so your platform team can automate ingestion, extend the metadata model, and integrate governance into existing CI/CD workflows.

Open source data governance tools and integrations

Self-hosted on your infrastructure

Run DataHub on Kubernetes in your own cloud account. You control the data, the network, and the upgrade cadence.

Kubernetes deployment in your own cloud account
Full control over data, network, and upgrade cadence
Apache 2.0 license, no feature gating

DataHub Cloud, fully managed

DataHub Cloud removes operational overhead while preserving the open metadata model. No proprietary lock-in on your metadata.

Managed deployment, zero operational overhead
Open metadata model preserved, no proprietary lock-in
Same GraphQL API and Python SDK as self-hosted

Data governance framework tools and extensibility

GraphQL API for metadata reads and writes
Python SDK for ingestion automation
Custom metadata aspects via extensible model
CI/CD integration for governance-as-code
Event-driven ingestion via Kafka integration
12,000+ GitHub stars, active open source community
Peer reviews

Rated by practitioners, not analysts

DataHub
Gartner Peer Insights
Native source integrations
91 connectors
"DataHub gave us a single place to understand our data assets, their lineage, and who owns them. The open source model meant we could extend it to fit our architecture."
Gartner Peer Insights Reviewer
Data Engineering
Common questions

Questions engineers ask before evaluating DataHub

Comparing data governance software vendors

DataHub is licensed under Apache 2.0. The core platform, metadata model, and all 91 ingestion connectors are open source with no feature gating. You can inspect the code, contribute to it, and deploy it without a commercial agreement.
Most proprietary vendors lock your metadata into a closed model. DataHub's metadata graph is open, extensible via GraphQL, and portable if you ever change platforms. The trade-off is that self-hosted deployments require your team to manage infrastructure, which DataHub Cloud addresses for teams that prefer a managed option.
Most teams complete their first ingestion pipeline in under a day using the CLI or UI-based ingestion wizard. Complex multi-source setups typically take one to two sprints. The timeline depends on how many sources you are connecting and whether you need custom metadata aspects.
Yes. DataHub ships with platform and metadata policies, RBAC roles, and resource-level targeting by domain, tag, container, and glossary term. Admin, Editor, and Reader roles are available out of the box. Custom policies can be defined without writing code.
You can self-host DataHub on Kubernetes in your own cloud environment, or use DataHub Cloud for a managed deployment that preserves the open metadata model. Both options use the same GraphQL API and Python SDK, so your automation and integrations work identically across either deployment path.

See open source data governance in your environment

Book a demo and a DataHub engineer will walk through lineage, access control, and discovery using your stack as the example.

Apache 2.0 licensed
91 native integrations
Scoped to your environment