Data catalog platform

Data Catalog Software Built for Platform Engineers

Your pipelines pass. Your dashboards break anyway. DataHub connects 60+ sources, tracks lineage to the column, and surfaces quality issues before your stakeholders do.

Connect Snowflake, dbt, Airflow, and 60+ sources without rebuilding pipelines
Column-level lineage across your full stack, from ingestion to BI
Catch data quality failures before they reach a dashboard or standup

Explore the product →

Trusted by modern data teams

The real cost

What does metadata debt cost your team?

Every hour spent tracing a broken pipeline is an hour not spent building. The cost compounds quietly until it surfaces in an incident.

Lineage gaps slow every incident

When a pipeline breaks, engineers spend hours tracing impact manually. Without automated lineage, every incident takes longer than it should.

Schema changes break downstream assets

A renamed column in Snowflake silently breaks three dashboards and two ML features. No alert fires. You find out in the standup.

No single source of truth

Definitions live in Confluence, Slack, and tribal knowledge. Different teams trust different numbers. Decisions stall while people argue about the data.

Governance is manual and fragile

Access policies live in spreadsheets. Compliance reviews require manual audits. One personnel change and a sensitive dataset is exposed.

Data catalog platform

A better way to manage your metadata

DataHub gives platform engineers the automation, depth, and extensibility that legacy catalog tools were never designed to provide.

Automated metadata discovery

DataHub crawls your stack and builds a unified metadata graph automatically. No manual tagging sprints, no stale wikis.

60+ pre-built ingestion connectors, maintained and versioned
Scheduled and event-driven ingestion pipelines
Custom source support via the open ingestion framework

Column-level lineage end to end

Trace any field from its source table through every transformation to the BI layer. Understand impact before you make a change, not after.

Automated lineage from dbt, Spark, Airflow, and SQL parsing
Upstream and downstream impact analysis in one view
Cross-platform lineage across warehouses, lakes, and BI tools

Data quality assertions and tracking

Define quality assertions on your datasets and track pass or fail status over time. Surface incidents in the catalog before they reach a stakeholder.

Native assertions for freshness, volume, schema, and custom SQL
Incident tracking linked to affected downstream assets
Integrates with Great Expectations, dbt tests, and Soda

API-first platform automation

DataHub exposes every capability through GraphQL and REST APIs. Automate metadata workflows, build internal tooling, and integrate with your existing platform stack.

Full GraphQL and OpenAPI surface for every metadata operation
Python SDK for programmatic ingestion and enrichment
Webhook and event stream support for real-time metadata updates

How it works

Three steps to a governed data platform

DataHub works with your existing stack. No rip-and-replace, no rebuilding pipelines, no months-long implementation projects.

Connect your existing sources

Deploy the ingestion framework against Snowflake, BigQuery, dbt, Airflow, Looker, and 55+ more sources
Schedule recurring syncs or trigger ingestion from your existing orchestration layer
Custom sources connect via the open Python SDK without modifying your pipelines

Contextualize with lineage and quality

Column-level lineage is built automatically from SQL parsing, dbt manifests, and Spark plans
Quality assertions run on a schedule and surface failures directly on the affected dataset page
Business glossary terms and ownership metadata enrich every asset in the catalog

Activate governance across your org

Role-based access control policies enforce who can view, edit, or own each asset
Data products and domains give business teams a governed view without platform complexity
Audit logs and compliance reports are available via API for your security and legal teams

Enterprise scale

Built for enterprise security and scale

DataHub is deployed by some of the largest data teams in the world. It is designed to run at scale, on your infrastructure, with the security controls your organization requires.

Deployment options

Self-hosted on Kubernetes via Helm chart, with full infrastructure control
DataHub Cloud for managed deployment with SLA-backed uptime
VPC and private link options for air-gapped or regulated environments

Security and access control

Fine-grained RBAC with resource-level and metadata-level policies
SSO via OIDC and SAML, with support for Okta, Azure AD, and Google Workspace
Full audit log of every metadata read, write, and policy change

API and extensibility

GraphQL and REST APIs cover every metadata entity and relationship
Custom metadata models via the extensible metadata schema framework
Kafka-based event stream for real-time metadata change notifications

Trusted by data teams

Gartner Peer Insights

Verified reviewer

Reviewer context

Data platform engineer, financial services, 10,000+ employees

"DataHub gave us column-level lineage across our entire warehouse and BI layer. We went from spending half a day tracing a broken pipeline to knowing the impact in minutes. The API-first design meant we could wire it into our existing platform tooling without rebuilding anything."

Verified Gartner Peer Insights Reviewer

Data Platform Engineer, Financial Services

Frequently asked questions about data catalog software

How long does a DataHub implementation typically take?

It depends on the complexity of your stack and how many sources you are connecting. Most teams have DataHub running and ingesting metadata from their primary sources within one to two weeks. Getting to full production coverage across all sources, with lineage and quality assertions configured, typically takes four to eight weeks. DataHub does not require a professional services engagement to get started, though one is available for teams with complex environments or accelerated timelines.

Which data sources and integrations does DataHub support?

DataHub ships with 60+ maintained ingestion connectors covering data warehouses (Snowflake, BigQuery, Redshift, Databricks), transformation tools (dbt, Spark), orchestration (Airflow, Dagster, Prefect), BI tools (Looker, Tableau, Power BI, Metabase), and databases (Postgres, MySQL, SQL Server, and others). If your source is not on the list, you can build a custom connector using the open Python SDK. The connector library is actively maintained and updated with each release.

How does DataHub generate column-level lineage?

DataHub generates column-level lineage through several mechanisms depending on your stack. For dbt, it parses the manifest file directly. For Spark and Airflow, it uses runtime instrumentation via the DataHub Spark agent and Airflow plugin. For SQL-based transformations, it uses a SQL parser that extracts column-level relationships from CREATE TABLE AS SELECT and INSERT INTO statements. The result is a unified lineage graph that spans tools and platforms. Lineage accuracy depends on how transformations are expressed, and the SQL parser handles most standard SQL dialects.

How does DataHub handle data quality monitoring?

DataHub supports data quality through two approaches. First, it ingests test results from external quality tools including dbt tests, Great Expectations, and Soda, and surfaces pass or fail status on each dataset page in the catalog. Second, DataHub Cloud includes native assertions that run directly against your warehouse on a schedule, covering freshness, row count, column nullness, and custom SQL conditions. Failures are tracked as incidents and linked to the affected downstream assets so you can assess impact immediately.

What are the deployment options for DataHub?

DataHub is available as open source software you deploy on your own infrastructure, or as DataHub Cloud, a fully managed SaaS offering. The open source version is deployed via a Helm chart on Kubernetes and runs on AWS, GCP, or Azure. It requires managing your own infrastructure, upgrades, and scaling. DataHub Cloud removes that operational burden and adds enterprise features including managed ingestion, native assertions, and SLA-backed uptime. Both options use the same underlying platform and metadata model, so migration between them is straightforward.

Bring order to your metadata stack

DataHub connects your sources, maps your lineage, and surfaces quality issues before they become incidents. A DataHub engineer will walk through your specific environment, not a generic demo script.

Request a demo Explore the product

Apache 2.0 open source

60+ pre-built connectors

Self-hosted or managed deployment