Data lineage is the complete record of how a data asset moves from its original source through every transformation, pipeline step, and system to its final destination in a report, model, or operational application.
When lineage is in place, an analyst looking at an unexpected number in a quarterly report can trace it back to the exact transformation that produced it, the source table it came from, and the pipeline that moved it. An engineer preparing to change a schema can see every downstream asset that will be affected before the change ships. A compliance officer responding to an audit can pull a complete data trail without a manual investigation.
This guide covers what data lineage is, the types of lineage, how it works technically, who uses it, how it connects to data governance and compliance, and how to implement it.
Was ist Data Lineage?
Data lineage is the end-to-end visibility of data as it flows through an organization’s systems: from the source where it was created or ingested, through every transformation, join, aggregation, and pipeline step, to the reports, dashboards, models, and operational systems that consume it.
Lineage answers five questions for every data asset:
- Where did it come from? The original source system, database, API, or external feed where this data originated.
- What happened to it? Every transformation applied: SQL joins, aggregations, filters, calculations, format conversions.
- Where does it go? Every downstream system, report, dashboard, model, or application that depends on this asset.
- When did each step happen? The timestamps of every movement and transformation in the lineage chain.
- Who touched it? The pipelines, jobs, and users that accessed or modified the data at each step.
Types of Data Lineage
Not all lineage serves the same purpose. Organizations need different types depending on their governance and analytics requirements.
| Typ | What it tracks | Granularität | Primary use case |
|---|---|---|---|
| Technical lineage | How data moves through technical systems: databases, pipelines, ETL jobs, APIs | Table and column level | Impact analysis, root cause investigation, pipeline documentation |
| Business lineage | How data maps to business concepts and metrics: where “revenue” comes from, how “active customer” is calculated | Business term level | Business user trust, metric consistency, cross-team alignment |
| Column-level lineage | Which specific fields in which specific tables went through which transformations to produce each output field | Individual column | Regulatory traceability, impact analysis for schema changes, AI training data governance |
| Table-level lineage | Which tables feed which downstream tables and reports | Tabelle | High-level dependency mapping, broad impact analysis |
| Operational lineage | How data moves through operational systems in real time: event streams, microservices, APIs | Event and message level | Real-time system debugging, operational data governance |
| AI/ML lineage | Which training datasets, feature pipelines, and transformation logic produced each model version | Dataset and feature level | Model reproducibility, regulatory compliance, AI governance |
Column-level vs. table-level lineage: Table-level lineage shows that Table A feeds Table B. Column-level lineage shows that the net_revenue column in Table B is calculated from the gross_revenue column in Table A minus the discount_amount column in Table A, filtered by transaction_status = 'completed'. For regulated industries and complex analytics environments, column-level lineage is required — table-level alone cannot satisfy impact analysis or audit requirements.
So funktioniert die Datenabfolge
Modern data lineage is captured automatically by observing how data moves through the systems in the data estate. The technical process involves four stages.
1. Metadata collection
Lineage tools connect to every source in the data estate — databases, data warehouses, data lakes, ETL and ELT platforms, BI tools, streaming systems, and ML feature stores — and extract the metadata that describes data movement:
- Table and column names, schemas, and data types.
- Query logs and execution records from databases and warehouses.
- Pipeline configurations from orchestration tools like Airflow and dbt.
- API call logs from integration platforms.
- Job execution records from ETL platforms.
This metadata forms the raw material that lineage systems use to reconstruct how data moves.
2. Transformation parsing
For lineage to show what happened to data at each step, the system must parse the transformation logic. For SQL-based transformations, lineage tools parse queries to identify source tables and columns, join relationships, filters, aggregations, and derived columns. For pipeline-based transformations, they parse configuration files and execution logs to extract the same information.
The output is a graph of nodes (data assets) and edges (transformations) that represents the complete lineage chain from source to consumption.
3. Graph construction and storage
The extracted metadata and parsed transformation logic are assembled into a lineage graph stored in the metadata repository. Each node represents a data asset — a table, column, file, stream, or model. Each edge represents a relationship — a transformation, a pipeline run, a join, a copy. The graph is queryable: given any asset, the system can traverse upstream to find its sources or downstream to find everything that depends on it.
4. Continuous update
Static lineage diagrams built once and never updated become inaccurate within days as pipelines change and data evolves. Modern lineage systems update continuously: when a pipeline runs, the lineage graph updates. When a schema changes, affected edges in the graph are flagged. When a new source connects, its lineage is added automatically. Active lineage reflects the current state of the data estate rather than a point-in-time snapshot.
Who Uses Data Lineage and How
Data engineer: Uses lineage for impact analysis before making changes. Before altering the schema of a source table, the engineer queries lineage to see every downstream table, pipeline, report, and model that depends on any column in that table. Changes that would break downstream consumers are identified and addressed before they ship rather than discovered after they have caused failures.
When a pipeline fails, or a report shows unexpected values, the engineer traces the lineage upstream from the affected output to find the transformation step where the issue was introduced. Root cause investigation that previously took hours of manual query review takes minutes.
Data analyst: Uses lineage to trust the data in reports and dashboards. When a metric looks wrong, the analyst follows the lineage from the dashboard field back through the transformation logic to the source table to understand how the number was produced. When the lineage shows a clean, well-governed path from a certified source, the analyst trusts the number without escalating to the data team.
Data steward: Uses lineage to understand the downstream impact of data quality issues. When a quality check flags an anomaly in a source dataset, the steward follows the lineage downstream to identify every report, model, and operational system that may have been affected. This scoping determines the severity of the incident and the breadth of the remediation required.
Compliance officer: Uses lineage to answer regulatory questions without manual investigation. Where did this number in our regulatory submission come from? Which systems hold this customer’s personal data? What transformations did this dataset go through before it was used to train this model? Lineage answers each question from records maintained automatically rather than assembled under audit pressure.
Data scientist: Uses lineage to document training data provenance. Every dataset used to train or fine-tune a model has a lineage record showing its source, transformations, quality certification, and access history. This record makes model training reproducible — given the lineage record, any training run can be reconstructed exactly — and auditable for regulatory purposes.
Chief Data Officer: Uses lineage as a governance health signal. Lineage coverage — the percentage of data assets with complete, automated lineage — is a leading indicator of governance program maturity. Low coverage in a domain signals that pipelines in that domain are undocumented and ungoverned.
Data Lineage and Data Governance
Lineage is the operational backbone of data governance. Governance policies require lineage to be enforced and audited.
| Governance requirement | How lineage supports it |
|---|---|
| Compliance audit trails | Lineage provides the complete data trail that regulators require: source, transformation history, access records, and consumption |
| Data quality root cause analysis | Lineage traces quality failures to their origin, enabling targeted remediation rather than broad investigation |
| Impact analysis | Lineage shows every downstream asset affected by a change before the change is made |
| Data certification | Certified assets require documented lineage as a condition of certification — users trust certified assets partly because their provenance is traceable |
| Access governance | Lineage shows who has accessed data at each stage, supporting access review and audit |
| AI training data governance | Lineage documents the provenance of every training dataset, meeting model reproducibility and regulatory requirements |
| Right-to-erasure compliance | Lineage identifies every system where a specific individual’s data exists, enabling complete deletion across all systems |
Data Lineage in Regulated Industries
Financial services: BCBS 239 requires banks to demonstrate that risk data is accurate and that its lineage from source to regulatory submission is traceable and documented. Manual lineage documentation produced quarterly is insufficient — regulators expect lineage to be maintained continuously and available on demand. SOX compliance requires audit trails for financial reporting data that automated lineage tracks as a byproduct of pipeline operations.
Healthcare: HIPAA requires documented accountability for PHI: where it came from, how it moved, who accessed it, and what it was used for. Lineage provides this documentation automatically for every PHI dataset in the data estate. When a breach investigation requires identifying every system that touched a specific patient record, lineage answers the question in minutes.
Pharmaceuticals: FDA 21 CFR Part 11 and GxP regulations require data integrity documentation for clinical and manufacturing data. Lineage tracks the provenance of every dataset used in regulatory submissions, demonstrating that source data was not altered without documentation and that every transformation is traceable.
Insurance: Actuarial models feeding regulatory capital calculations require demonstrable lineage from raw data inputs through every transformation to the final model output. Lineage makes these models auditable and their inputs reproducible.
Retail and e-commerce: Customer data feeding personalization models, pricing algorithms, and fraud detection systems requires lineage for GDPR and CCPA compliance. When a customer submits a right-to-erasure request, lineage identifies every system in the data estate that holds derived data from that customer’s records.
Data Lineage and AI
AI governance is driving new lineage requirements that traditional lineage tools were not designed for.
Training data lineage: Every dataset used to train or fine-tune a model requires a lineage record: source system, transformation history, quality certification at the time of training, PII classification review, and the identity of the steward who certified it. Without this record, model training cannot be reproduced, and model audits cannot be completed.
Feature pipeline lineage: Feature engineering pipelines transform raw data into the features that models consume. Column-level lineage through feature pipelines shows exactly which source fields contributed to each model feature, enabling impact analysis when source schemas change and providing the documentation that AI audits require.
Model version lineage: Each model version is a product of specific training data versions, specific feature pipeline versions, and specific hyperparameters. Model lineage tracks all of these so that any model version can be reproduced exactly and the difference between two model versions can be understood precisely.
RAG pipeline lineage: Retrieval-augmented generation pipelines pull documents and datasets into LLM context windows at query time. Lineage tracks which source documents and datasets are eligible for retrieval, which were used in specific query responses, and what access controls governed each retrieval — the audit trail that AI governance programs and emerging regulations require.
Bewährte Praktiken für die Implementierung von Data Lineage
Start with the highest-risk pipelines: Begin lineage implementation with the pipelines that feed regulatory reporting, business-critical dashboards, and AI models in production. Complete lineage coverage across the entire data estate takes time. Starting where the stakes are highest produces governance value immediately.
Automate from the start: Manual lineage documentation — written by engineers in wikis and spreadsheets — becomes inaccurate within weeks as pipelines change. Automated lineage that captures pipeline execution metadata directly from orchestration tools, databases, and BI platforms stays current without human maintenance.
Require column-level lineage for regulated data: Table-level lineage is insufficient for regulatory traceability and impact analysis in complex environments. Define column-level lineage as a requirement for every pipeline handling regulated data and for every pipeline feeding business-critical reports or AI models.
Integrate lineage with the data catalog: Lineage stored in a separate tool from the metadata catalog requires users to consult two systems. Lineage integrated into the data catalog appears alongside the asset’s definition, quality score, and ownership in a single view, making it accessible to analysts and business users rather than only to engineers.
Make lineage visible to business users: Technical lineage diagrams showing SQL joins and pipeline configurations are useful for engineers but not for analysts and stewards. Business lineage views that translate technical dependencies into business-term relationships make lineage useful for the full spectrum of people who need to trust and act on data.
Monitor lineage coverage as a governance KPI: Track the percentage of data assets with automated lineage documentation as a governance program metric. Low coverage in a domain is a leading indicator that pipelines in that domain are undocumented and that governance policies for those assets cannot be enforced or audited.
FAQ
Die Datenherkunft verfolgt den Fluss und die Transformation von Daten über verschiedene Systeme hinweg. Data Governance die Richtlinien, Standards und Kontrollen für die Verwaltung und den Schutz von Daten. Innerhalb einer Datenintelligenzplattform operationalisiert die Herkunft die Verwaltung, indem sie Transparenz darüber schafft, wie verwaltete Daten tatsächlich in Analyse- und KI-Workflows verwendet werden.
Data lineage is the complete record of where a piece of data came from, what happened to it along the way, and where it ended up. It answers the question: if I am looking at this number in a report, where exactly did it come from and how was it calculated?
Data provenance is the broader concept of documenting the origin and history of data — where it came from and how it was produced. Data lineage is the specific implementation of provenance tracking within a data pipeline context: the end-to-end map of how data flows through systems and transforms along the way. Lineage is the operational form of provenance for enterprise data environments.
Table-level lineage shows which tables feed which downstream tables and reports. Column-level lineage shows which specific fields in specific tables went through which specific transformations to produce each output field. Column-level lineage is required for regulatory traceability, serious impact analysis, and AI training data governance. Table-level lineage is useful for high-level dependency mapping but insufficient for complex environments.
A data catalog is a searchable inventory of data assets with metadata: definitions, ownership, quality scores, and access information. Data lineage is one component of a data catalog — the record of how each asset was produced and what depends on it. Lineage without a catalog has no interface for business users. A catalog without lineage lacks the provenance information that makes assets trustworthy and auditable.
Automated lineage tools connect to orchestration platforms like Airflow and dbt, databases and warehouses, BI tools, and data integration platforms. They extract metadata from query logs, pipeline execution records, API call histories, and configuration files. SQL parsing extracts column-level transformation logic from queries. The extracted metadata is assembled into a lineage graph that updates continuously as pipelines run.
Impact analysis is the process of identifying every downstream asset that will be affected by a change to a data source or pipeline. Given a lineage graph, impact analysis traverses every edge downstream from the proposed change point to produce a complete list of affected tables, reports, dashboards, and models. This allows engineers to assess the risk and scope of a change before it is made rather than discovering breakage after deployment.
GDPR requires organizations to know where personal data exists, how it flows, and how to delete it on request. Lineage identifies every system in the data estate that holds personal data or data derived from it, traces how it moved from its source through every transformation, and documents every access event. When a right-to-erasure request arrives, lineage identifies every system where deletion is required and confirms completeness after deletion.
AI data lineage extends traditional lineage to cover AI systems: which training datasets fed each model, which feature engineering pipelines transformed raw data into model inputs, which hyperparameters and evaluation criteria produced each model version, and which source documents or datasets were retrieved in RAG pipeline responses. AI lineage makes model training reproducible, model audits completable, and AI governance programs defensible.
Lineage and quality are complementary governance capabilities. Quality monitoring detects anomalies in data. Lineage identifies where those anomalies originated. When a quality check flags an unexpected null rate in a reporting table, lineage traces the issue upstream to the specific source table or transformation step that introduced it, enabling targeted remediation rather than broad investigation.