Data catalog platform

Data Catalog Software Built for Platform Engineers

Your pipelines pass. Your dashboards break anyway. DataHub connects 60+ sources, tracks lineage to the column, and surfaces quality issues before your stakeholders do.

  • Connect Snowflake, dbt, Airflow, and 60+ sources without rebuilding pipelines

  • Column-level lineage across your full stack, from ingestion to BI

  • Catch data quality failures before they reach a dashboard or standup

See DataHub in your environment

A DataHub engineer will walk through your specific stack, not a generic script.

Trusted by modern data teams
The real cost

What does metadata debt cost your team?

Every hour spent tracing a broken pipeline is an hour not spent building. The cost compounds quietly until it surfaces in an incident.

Lineage gaps slow every incident

When a pipeline breaks, engineers spend hours tracing impact manually. Without automated lineage, every incident takes longer than it should.

Schema changes break downstream assets

A renamed column in Snowflake silently breaks three dashboards and two ML features. No alert fires. You find out in the standup.

No single source of truth

Definitions live in Confluence, Slack, and tribal knowledge. Different teams trust different numbers. Decisions stall while people argue about the data.

Governance is manual and fragile

Access policies live in spreadsheets. Compliance reviews require manual audits. One personnel change and a sensitive dataset is exposed.

Data catalog platform

A better way to manage your metadata

DataHub gives platform engineers the automation, depth, and extensibility that legacy catalog tools were never designed to provide.

Automated metadata discovery

DataHub crawls your stack and builds a unified metadata graph automatically. No manual tagging sprints, no stale wikis.

  • 60+ pre-built ingestion connectors, maintained and versioned
  • Scheduled and event-driven ingestion pipelines
  • Custom source support via the open ingestion framework

Column-level lineage end to end

Trace any field from its source table through every transformation to the BI layer. Understand impact before you make a change, not after.

  • Automated lineage from dbt, Spark, Airflow, and SQL parsing
  • Upstream and downstream impact analysis in one view
  • Cross-platform lineage across warehouses, lakes, and BI tools

Data quality assertions and tracking

Define quality assertions on your datasets and track pass or fail status over time. Surface incidents in the catalog before they reach a stakeholder.

  • Native assertions for freshness, volume, schema, and custom SQL
  • Incident tracking linked to affected downstream assets
  • Integrates with Great Expectations, dbt tests, and Soda

API-first platform automation

DataHub exposes every capability through GraphQL and REST APIs. Automate metadata workflows, build internal tooling, and integrate with your existing platform stack.

  • Full GraphQL and OpenAPI surface for every metadata operation
  • Python SDK for programmatic ingestion and enrichment
  • Webhook and event stream support for real-time metadata updates
How it works

Three steps to a governed data platform

DataHub works with your existing stack. No rip-and-replace, no rebuilding pipelines, no months-long implementation projects.

Connect your existing sources

  • Deploy the ingestion framework against Snowflake, BigQuery, dbt, Airflow, Looker, and 55+ more sources
  • Schedule recurring syncs or trigger ingestion from your existing orchestration layer
  • Custom sources connect via the open Python SDK without modifying your pipelines

Contextualize with lineage and quality

  • Column-level lineage is built automatically from SQL parsing, dbt manifests, and Spark plans
  • Quality assertions run on a schedule and surface failures directly on the affected dataset page
  • Business glossary terms and ownership metadata enrich every asset in the catalog

Activate governance across your org

  • Role-based access control policies enforce who can view, edit, or own each asset
  • Data products and domains give business teams a governed view without platform complexity
  • Audit logs and compliance reports are available via API for your security and legal teams
Enterprise scale

Built for enterprise security and scale

DataHub is deployed by some of the largest data teams in the world. It is designed to run at scale, on your infrastructure, with the security controls your organization requires.

Deployment options

  • Self-hosted on Kubernetes via Helm chart, with full infrastructure control
  • DataHub Cloud for managed deployment with SLA-backed uptime
  • VPC and private link options for air-gapped or regulated environments

Security and access control

  • Fine-grained RBAC with resource-level and metadata-level policies
  • SSO via OIDC and SAML, with support for Okta, Azure AD, and Google Workspace
  • Full audit log of every metadata read, write, and policy change

API and extensibility

  • GraphQL and REST APIs cover every metadata entity and relationship
  • Custom metadata models via the extensible metadata schema framework
  • Kafka-based event stream for real-time metadata change notifications
Trusted by data teams

What platform engineers say about DataHub

Gartner Peer Insights
Verified reviewer
Reviewer context
Data platform engineer, financial services, 10,000+ employees
"DataHub gave us column-level lineage across our entire warehouse and BI layer. We went from spending half a day tracing a broken pipeline to knowing the impact in minutes. The API-first design meant we could wire it into our existing platform tooling without rebuilding anything."
Verified Gartner Peer Insights Reviewer
Data Platform Engineer, Financial Services

Frequently asked questions about data catalog software

It depends on the complexity of your stack and how many sources you are connecting. Most teams have DataHub running and ingesting metadata from their primary sources within one to two weeks. Getting to full production coverage across all sources, with lineage and quality assertions configured, typically takes four to eight weeks. DataHub does not require a professional services engagement to get started, though one is available for teams with complex environments or accelerated timelines.
DataHub ships with 60+ maintained ingestion connectors covering data warehouses (Snowflake, BigQuery, Redshift, Databricks), transformation tools (dbt, Spark), orchestration (Airflow, Dagster, Prefect), BI tools (Looker, Tableau, Power BI, Metabase), and databases (Postgres, MySQL, SQL Server, and others). If your source is not on the list, you can build a custom connector using the open Python SDK. The connector library is actively maintained and updated with each release.
DataHub generates column-level lineage through several mechanisms depending on your stack. For dbt, it parses the manifest file directly. For Spark and Airflow, it uses runtime instrumentation via the DataHub Spark agent and Airflow plugin. For SQL-based transformations, it uses a SQL parser that extracts column-level relationships from CREATE TABLE AS SELECT and INSERT INTO statements. The result is a unified lineage graph that spans tools and platforms. Lineage accuracy depends on how transformations are expressed, and the SQL parser handles most standard SQL dialects.
DataHub supports data quality through two approaches. First, it ingests test results from external quality tools including dbt tests, Great Expectations, and Soda, and surfaces pass or fail status on each dataset page in the catalog. Second, DataHub Cloud includes native assertions that run directly against your warehouse on a schedule, covering freshness, row count, column nullness, and custom SQL conditions. Failures are tracked as incidents and linked to the affected downstream assets so you can assess impact immediately.
DataHub is available as open source software you deploy on your own infrastructure, or as DataHub Cloud, a fully managed SaaS offering. The open source version is deployed via a Helm chart on Kubernetes and runs on AWS, GCP, or Azure. It requires managing your own infrastructure, upgrades, and scaling. DataHub Cloud removes that operational burden and adds enterprise features including managed ingestion, native assertions, and SLA-backed uptime. Both options use the same underlying platform and metadata model, so migration between them is straightforward.

Bring order to your metadata stack

DataHub connects your sources, maps your lineage, and surfaces quality issues before they become incidents. A DataHub engineer will walk through your specific environment, not a generic demo script.

Apache 2.0 open source
60+ pre-built connectors
Self-hosted or managed deployment