Was ist Data Lineage?
Data lineage is the complete record of how a data asset moves from its original source through every transformation, pipeline step, and system to its final destination in a report, model, dashboard, or operational application.
Lineage answers five questions for any data asset:
- Where did it come from? The source system, database, API, or feed where the data originated.
- What happened to it? Every transformation applied: SQL joins, aggregations, filters, calculations, format conversions.
- Where does it go? Every downstream report, dashboard, model, and system that depends on it.
- When did each step happen? Timestamps for every movement and transformation in the lineage chain.
- Who touched it? The pipelines, jobs, and users that accessed or modified the data at each step.
Data Lineage Definition
Data lineage is the end-to-end visibility of data as it flows through an organization’s systems. It provides a traceable record of data’s origin, transformation history, and consumption — making it possible to trust, audit, and manage data at scale.
A lineage record is not a static diagram drawn once and forgotten. Modern lineage is captured automatically by observing pipeline execution, parsing SQL transformations, and monitoring schema changes. It updates continuously as data moves and pipelines run, so the lineage record reflects the current state of the data estate rather than a snapshot from months ago.
Types of Data Lineage
| Typ | What it tracks | Primary use case |
|---|---|---|
| Technical lineage | Data movement through pipelines, ETL jobs, SQL queries, and systems | Debugging, impact analysis, pipeline documentation |
| Business lineage | How datasets map to business terms, KPIs, and reports | Analytics trust, stakeholder alignment, metric consistency |
| Column-level lineage | Which specific fields went through which transformations to produce each output field | Compliance traceability, PII tracking, precise impact analysis |
| Dataset-level lineage | Relationships between tables and datasets | High-level dependency mapping and discovery |
| Temporal lineage | Historical versions and schema changes over time | Audit history, rollback analysis, schema drift detection |
| AI/ML lineage | Training datasets, feature pipelines, and model version history | Model reproducibility, AI governance, regulatory compliance |
Column-level vs. dataset-level: Dataset-level lineage shows that Table A feeds Table B. Column-level lineage shows that the net_revenue field in Table B is calculated from gross_revenue minus discount_amount in Table A, filtered by transaction_status = 'completed'. For regulated environments and complex analytics, column-level lineage is required — dataset-level alone cannot support field-level audit trails or precise impact analysis.
How Data Lineage is Captured
Modern data lineage is captured automatically through three mechanisms:
Metadata extraction: Lineage tools connect to data sources, orchestration platforms, BI tools, and warehouses and extract the metadata that describes data movement: query logs, pipeline execution records, schema definitions, and API call histories. This metadata forms the raw material for lineage reconstruction.
SQL and transformation parsing: For SQL-based transformations, lineage tools parse queries to identify source tables, source columns, join relationships, filters, aggregations, and derived columns. This produces column-level lineage from the transformation logic itself without requiring engineers to document it manually.
Event-driven updates: When a pipeline runs, a schema changes, or a new source connects, the lineage graph updates automatically. Active lineage reflects the current state of the data estate continuously rather than on a scheduled batch refresh.
What Data Lineage Looks Like in Practice
An analyst finds an unexpected number in a dashboard. With lineage: the analyst opens the dashboard field in the data catalog, follows the lineage upstream through the transformation that produced it, identifies the source table it came from, and sees that the source table was refreshed six hours late due to a pipeline delay. Total investigation time: five minutes. Without lineage: the analyst escalates to the data team, who manually traces the number through query histories and pipeline logs. Total investigation time: two to four hours.
An engineer prepares to change a source table schema. With lineage: the engineer queries the lineage graph to see every downstream table, pipeline, report, and model that depends on any column in the table being changed. Three downstream reports and one ML feature pipeline would break. The engineer coordinates fixes before the change ships. Without lineage: the change ships, three reports break, and the ML pipeline produces incorrect features for 48 hours before the issue is traced and resolved.
A compliance officer responds to a GDPR right-to-erasure request. With lineage: the officer queries the lineage graph for every downstream asset derived from the subject’s customer record. Lineage returns a complete list across six systems in under a minute. All six are updated, and deletion is confirmed. Without lineage: the officer manually contacts each system owner to ask whether they hold the subject’s data. Three weeks later, the investigation is still not complete.
Data Lineage vs. Related Concepts
Data lineage vs. data provenance: Data provenance is the broader concept of documenting the origin and history of data. Data lineage is the specific operational implementation of provenance within a pipeline context: the end-to-end map of how data flows through systems and transforms. Lineage is provenance made queryable and automatable.
Data lineage vs. data cataloging: A data catalog is a searchable inventory of data assets with metadata: definitions, ownership, quality scores, and access information. Data lineage is one component of the catalog — the record of how each asset was produced and what depends on it. Lineage without a catalog has no interface for business users. A catalog without lineage lacks the provenance information that makes assets trustworthy and auditable.
Data lineage vs. data observability: Data observability monitors the health of data in real time: detecting anomalies, schema changes, and pipeline failures. Data lineage provides the context for those observations: when an observability alert fires on an anomaly, lineage identifies where the anomaly originated and what downstream assets are affected. The two capabilities are complementary — observability detects the problem, lineage explains it.
FAQ
Data lineage is the record of where a piece of data came from, what happened to it along the way, and where it ended up. If you are looking at a number in a report and want to know how it was calculated and where the underlying data originated, data lineage answers that question.
A retail company’s weekly revenue report shows an unexpected drop. A data analyst opens the revenue field in the data catalog and follows its lineage upstream. The lineage shows that the field is calculated from a transactions table that is joined with a product pricing table. The pricing table was updated two days ago and a new discount category was added that the revenue calculation was not designed to handle. The lineage identified the source of the discrepancy in minutes rather than hours.
Column-level lineage tracks which specific fields in source tables went through which specific transformations to produce each field in downstream tables. It is a more granular and more expensive form of lineage than dataset-level tracking, but it is required for regulatory traceability, precise impact analysis, PII tracking, and AI training data governance.
Data governance is the framework of policies, roles, and standards that determines how data is managed. Data lineage is one of the operational capabilities that makes governance work: it provides the audit trails that compliance requires, the impact analysis that change management depends on, and the provenance records that data certification needs. Lineage is a component of governance, not a substitute for it.
Automated lineage tools connect to orchestration platforms like Airflow and dbt, databases and warehouses, BI tools, and data integration platforms. They extract metadata from query logs, pipeline execution records, and schema definitions. SQL parsing extracts column-level transformation logic from queries. The extracted metadata assembles into a lineage graph that updates continuously as pipelines run.
AI models require traceable, governed training data. Data lineage documents which datasets were used to train each model, what transformations were applied to produce the training data, and what quality standards the data met at the time of training. Without lineage, model training cannot be reproduced and model audits cannot be completed. As AI governance regulations mature, training data lineage is becoming a compliance requirement rather than a best practice.
Impact analysis is the process of identifying every downstream asset that will be affected by a proposed change to a data source or pipeline. Given a lineage graph, impact analysis traverses every downstream edge from the change point to produce a complete list of affected tables, reports, dashboards, and models. This allows engineers to assess risk before a change ships rather than discovering breakage in production.