Apache 2.0 Licensed

Open source data governance at scale

Your pipelines pass. Your dashboards break anyway. DataHub gives platform engineers automated lineage, fine-grained access control, and 90+ integrations with no vendor lock-in.

Automated table- and column-level lineage across your full stack
Policy-based access control targeting domains, tags, and containers
91 native source integrations including Snowflake, dbt, and Looker

Request a demo Explore the platform

See DataHub govern your data

A DataHub engineer will scope the demo to your environment.

Trusted by modern data teams

The real cost

What does ungoverned data infrastructure cost?

Broken trust in data assets costs more than engineer-hours. It costs credibility, compliance standing, and the ability to ship.

Data governance solution gaps in modern platforms

Lineage gaps break trust

When a dashboard breaks, you spend hours tracing upstream dependencies manually. The pipeline passed. The data was wrong.

Access control is manual

Permissions live in spreadsheets or tribal knowledge. Auditors ask who can see what. You cannot answer quickly.

No shared data vocabulary

Finance defines revenue one way. Product defines it another. Governance without a glossary is governance in name only.

Vendor lock-in compounds risk

Proprietary governance tools own your metadata model. Switching costs grow every quarter you stay.

Data governance software

A better way to govern your data platform

DataHub replaces manual, fragmented governance with automated lineage, policy enforcement, and a shared metadata layer your whole organization can trust.

Data governance automation tools built into the platform

End-to-end lineage visibility

DataHub tracks table- and column-level lineage automatically across your full stack. When a schema changes upstream, you know what breaks downstream before your stakeholders do.

Column-level lineage across Snowflake, dbt, Looker, and Spark
Upstream and downstream impact analysis on every asset
Lineage propagation for tags, glossary terms, and ownership

Column-level access control

Policy-based access control lets you define permissions by resource type, domain, tag, container, or glossary term. Role-based defaults get teams productive without custom configuration.

Platform and metadata policies with granular resource targeting
RBAC roles: Admin, Editor, and Reader out of the box
Target policies by domain, tag, container, or glossary term

Shared business glossary

DataHub's business glossary supports hierarchical term groups, ownership assignment, and propagation to linked assets so definitions stay consistent across every team.

Hierarchical term groups with assigned stewards and owners
Glossary terms propagate automatically to linked data assets
Consistent definitions across every domain and data product

Searchable data catalog

DataHub indexes metadata from every connected source into a searchable catalog. Engineers and analysts find verified, documented datasets without filing tickets or pinging Slack.

Full-text search across tables, dashboards, pipelines, and features
Dataset health signals: freshness, schema changes, and ownership
Verified dataset badges surfaced directly in search results

How it works

From connection to governed data catalog

Three steps from your existing stack to a fully governed, searchable metadata layer.

DataHub as your data governance platform

Connect your data sources

DataHub ingests metadata from 91 native integrations including Snowflake, dbt, Kafka, Looker, and Airflow. Ingestion runs on a schedule or triggers on change events.

91 pre-built connectors, no pipeline rebuilding required

Schedule-based or event-driven ingestion cadence

CLI or UI-based setup for your first pipeline

Contextualize assets with ownership

Assign owners, apply glossary terms, tag domains, and document datasets. Metadata propagates through lineage so context follows data wherever it moves.

Ownership and stewardship assigned per asset or domain

Glossary terms and tags propagate through lineage graph

Context follows data across every downstream consumer

Activate policies across your platform

Define access policies by role, domain, tag, or container. DataHub enforces them consistently so your governance framework reflects how your organization actually works.

Policies enforced by role, domain, tag, or container

Governance framework maps to your org structure

Teams like yours have already done this at scale

API-first architecture

Built to fit your stack, not replace it

DataHub exposes a GraphQL API and Python SDK so your platform team can automate ingestion, extend the metadata model, and integrate governance into existing CI/CD workflows.

Open source data governance tools and integrations

Self-hosted on your infrastructure

Run DataHub on Kubernetes in your own cloud account. You control the data, the network, and the upgrade cadence.

Kubernetes deployment in your own cloud account

Full control over data, network, and upgrade cadence

Apache 2.0 license, no feature gating

DataHub Cloud, fully managed

DataHub Cloud removes operational overhead while preserving the open metadata model. No proprietary lock-in on your metadata.

Managed deployment, zero operational overhead

Open metadata model preserved, no proprietary lock-in

Same GraphQL API and Python SDK as self-hosted

Data governance framework tools and extensibility

GraphQL API for metadata reads and writes

Python SDK for ingestion automation

Custom metadata aspects via extensible model

CI/CD integration for governance-as-code

Event-driven ingestion via Kafka integration

12,000+ GitHub stars, active open source community

Peer reviews

DataHub

Gartner Peer Insights

Native source integrations

91 connectors

"DataHub gave us a single place to understand our data assets, their lineage, and who owns them. The open source model meant we could extend it to fit our architecture."

Gartner Peer Insights Reviewer

Data Engineering

Common questions

Questions engineers ask before evaluating DataHub

Comparing data governance software vendors

Is DataHub truly open source?

DataHub is licensed under Apache 2.0. The core platform, metadata model, and all 91 ingestion connectors are open source with no feature gating. You can inspect the code, contribute to it, and deploy it without a commercial agreement.

How does DataHub compare to other data governance software vendors?

Most proprietary vendors lock your metadata into a closed model. DataHub's metadata graph is open, extensible via GraphQL, and portable if you ever change platforms. The trade-off is that self-hosted deployments require your team to manage infrastructure, which DataHub Cloud addresses for teams that prefer a managed option.

How long does initial ingestion take to configure?

Most teams complete their first ingestion pipeline in under a day using the CLI or UI-based ingestion wizard. Complex multi-source setups typically take one to two sprints. The timeline depends on how many sources you are connecting and whether you need custom metadata aspects.

Does DataHub support policy-based access control out of the box?

Yes. DataHub ships with platform and metadata policies, RBAC roles, and resource-level targeting by domain, tag, container, and glossary term. Admin, Editor, and Reader roles are available out of the box. Custom policies can be defined without writing code.

What deployment options are available?

You can self-host DataHub on Kubernetes in your own cloud environment, or use DataHub Cloud for a managed deployment that preserves the open metadata model. Both options use the same GraphQL API and Python SDK, so your automation and integrations work identically across either deployment path.

See open source data governance in your environment

Book a demo and a DataHub engineer will walk through lineage, access control, and discovery using your stack as the example.