Data Lineage and Impact Analysis
Data lineage tracks the origin, movement, transformation, and consumption of data throughout its lifecycle in your systems. It answers "where did this data come from?" and "what would break if I changed this?" — two of the most common and important questions in data management.
Why Data Lineage Matters
- Debugging: When a metric is wrong, lineage shows you exactly where in the pipeline the error occurred
- Impact analysis: Before changing a table or column, lineage reveals all downstream reports and models that depend on it
- Compliance: For GDPR and regulated sectors, lineage provides evidence of where personal data flows and where it is processed
- Audit: Demonstrates to auditors how reported figures were derived from source data
- Trust: Users trust data more when they can see where it came from
Types of Lineage
- Technical lineage: Table-to-table and column-to-column data flow — often captured automatically by data pipeline tools
- Business lineage: Higher-level — "the Revenue metric comes from the Orders table, filtered to completed orders, summing the amount column" — more useful for non-technical stakeholders
Tools for Data Lineage
- dbt: Generates lineage DAG (directed acyclic graph) automatically from transformation code
- OpenLineage / Marquez: Open-source lineage standard and server
- Data catalogue tools: Collibra, Alation, DataHub — enterprise data catalogues with lineage tracking