Apache 2.0 Licensed

Open Source Metadata Management Platform

Your pipelines pass. Your dashboards break. DataHub gives your team the lineage, context, and governance to find out why before the standup.

  • Connect 80+ sources: Snowflake, dbt, Looker, Kafka, and more

  • Automated column-level lineage from source to dashboard

  • Fine-grained governance without rebuilding your stack

See DataHub govern your data

A DataHub engineer will scope the demo to your environment.

Trusted by modern data teams
The real cost

What does fragmented metadata cost you?

Every hour your team spends hunting schemas or tracing broken lineage is an hour not spent building. The invisible cost compounds fast.

Lineage gaps at 3 a.m.

A dashboard breaks. No one can trace which upstream table changed or who owns it. Your team loses hours they do not have.

Governance you cannot prove

Auditors ask who accessed what and when. Without a metadata layer, the answer is a spreadsheet and a prayer.

Stale docs, wrong answers

Static wikis go out of date within days. Engineers stop trusting them. New hires make decisions on bad context.

Stack sprawl, no single view

Your data infrastructure spans multiple tools and clouds. Without a unified metadata layer, no one sees the full picture.

Open source data catalog

A better way to manage your metadata

DataHub gives platform engineers a production-ready open source data catalog with governance, lineage, and observability built in from day one.

Data catalog for data governance

A unified asset inventory across every source in your stack. Search, tag, and document datasets, pipelines, dashboards, and ML models in one place your team will actually use.

  • Full-text search across all metadata assets
  • Automated schema and ownership discovery
  • Business glossary linked to physical assets

Automated lineage across your stack

Column-level lineage captured automatically from Snowflake, dbt, Spark, Airflow, and 80+ other sources. No manual mapping. No stale diagrams. Trace any broken asset back to its root in seconds.

  • Column-level lineage, not just table-level
  • Cross-platform lineage stitched automatically
  • Impact analysis before any schema change

Open source data governance at scale

Role-based access control, fine-grained policies, and audit logs built into the platform. Governance that your compliance team can verify and your engineers can enforce without a separate tool.

  • Fine-grained RBAC with policy inheritance
  • Full audit trail for access and changes
  • Tag-based classification for sensitive data

An open platform engineers control

Apache 2.0 licensed. Deploy on your infrastructure via Helm and Kubernetes. Extend via GraphQL and OpenAPI. No vendor lock-in, no black-box behavior, and 12,000+ GitHub stars from the community that built it with you.

  • Apache 2.0 license, self-hosted or managed
  • GraphQL and OpenAPI for full extensibility
  • Helm chart deployment on Kubernetes
How it works

How it works

Three steps to a governed, searchable, and trusted metadata layer across your entire data platform.

Connect your data sources

  • Ingest from Snowflake, dbt, Looker, Kafka, and 80+ more
  • No pipeline rebuilds required to get started
  • Push or pull ingestion via API or scheduled recipes

Contextualize every asset

  • Column-level lineage stitched across platforms automatically
  • Ownership, tags, and glossary terms applied at scale
  • Schema history and change detection tracked over time

Activate governance org-wide

  • Enforce RBAC policies across every data asset
  • Surface trusted data to analysts and downstream consumers
  • Audit logs ready for compliance and security reviews
Enterprise ready

Built for enterprise-grade scale and security

DataHub is deployed by enterprise data teams that need production-grade reliability, security controls, and flexible deployment options.

Deployment options

  • Self-hosted via Helm chart on Kubernetes
  • Managed cloud option available for teams that prefer it
  • Apache 2.0 license with no usage restrictions

Security and access control

  • Role-based access control with fine-grained policy engine
  • SSO integration via OIDC and SAML
  • Full audit logging for compliance and security reviews

Extensibility and integrations

  • GraphQL and OpenAPI for custom integrations
  • 80+ pre-built source connectors maintained by the community
  • Event-driven metadata updates via Kafka
Trusted by data teams

Trusted by modern data teams

Platform engineers and data leaders at enterprise organizations rely on DataHub to govern their metadata at scale.

Gartner Peer Insights

Verified review

Outcome
Unified metadata layer across a complex multi-cloud stack
"DataHub gave us the lineage and governance layer we had been trying to build manually for two years. We finally have a single place where engineers and analysts can trust what they are looking at."

Gartner Peer Insights Reviewer

Data Platform Engineer, Enterprise Technology Company

Frequently asked questions about open source metadata management

Most teams have DataHub ingesting metadata from their first source within a day using the Helm chart and a pre-built connector recipe. Full deployment across a complex multi-source environment depends on the number of sources and your infrastructure setup. DataHub is designed to work with what you already have, so you are not rebuilding pipelines to get started. A DataHub engineer can walk through your specific environment during a scoped demo.
DataHub ships with 80+ pre-built connectors covering data warehouses (Snowflake, BigQuery, Redshift), transformation tools (dbt, Spark), orchestrators (Airflow, Dagster), BI platforms (Looker, Tableau, Power BI), and streaming systems (Kafka). Connectors are maintained by both the DataHub team and the open source community. If your source is not on the list, the GraphQL and OpenAPI layer lets your team build a custom integration without modifying the core platform.
DataHub includes a fine-grained policy engine that controls who can view, edit, or manage metadata for specific assets, asset types, or domains. Policies can be applied at the platform level or scoped to individual datasets and columns. Role-based access control integrates with your existing SSO provider via OIDC or SAML. Every access and change event is logged, giving your compliance team a verifiable audit trail without a separate tool.
DataHub is distributed as a Helm chart and runs on any Kubernetes cluster, including EKS, GKE, and AKS. The Apache 2.0 license means you can deploy it on your own infrastructure with no usage restrictions. For teams that prefer a managed option, DataHub Cloud is available. The architecture is the same in both cases, so migrating between self-hosted and managed is straightforward if your requirements change.
The open source version of DataHub gives you the full metadata management platform under the Apache 2.0 license, self-hosted on your infrastructure. DataHub Cloud adds managed infrastructure, enterprise support SLAs, and additional features built on top of the open source core. Both share the same underlying platform, so capabilities like lineage, governance, and the connector library are available in both. The right choice depends on whether your team wants to operate the infrastructure or have that handled for you.

Take control of your metadata

You will speak with a DataHub engineer about your specific environment, not a generic walkthrough. Bring your stack, your questions, and your constraints.

Apache 2.0 open source
60+ pre-built connectors
Self-hosted or managed deployment