Data Intelligence

Data Contracts: Definition, Components, and Implementation Guide

guía sobre contratos de datos

A data contract is a formal agreement between a data producer and a data consumer that defines the structure, quality standards, ownership, SLAs, and terms of use for a specific dataset — so that downstream teams can trust what they receive and upstream teams know exactly what they are accountable for delivering.

Without a data contract, a schema change in a source system silently breaks a downstream report. A pipeline delivers data two hours late with no notification. A field that analysts rely on disappears without warning. Data contracts prevent all three by making expectations explicit, versioned, and enforceable before problems reach production.


¿Qué es un contrato de datos?

A data contract is a document — typically machine-readable YAML or JSON — that specifies what data a producer will deliver, in what format, at what quality level, on what schedule, and under what terms of use. It is the data equivalent of an API contract: a defined interface between the system that produces data and the systems that consume it.

Where an API contract governs how software systems communicate, a data contract governs how data flows between teams, pipelines, and systems. When a producer and consumer agree on a data contract and embed it in their pipelines, schema changes trigger alerts rather than silent failures, quality violations are caught at the source rather than discovered in a quarterly report, and ownership is documented rather than institutional memory.


What a Data Contract Contains

The industry standard for data contracts is the Open Data Contract Standard (ODCS), a Linux Foundation project maintained by Bitol. A complete data contract covers seven components.

Component What it defines Ejemplo
Fundamentals Contract ID, name, version, status, and the standard version used id: orders-v1version: 1.0.0status: active
Esquema Table structure, column names, data types, primary keys, and business semantics for each field order_id: string, primary key / order_status: string, valid values: [pending, complete, cancelled]
Normas de calidad de los datos Validation checks that must pass before data is considered deliverable Row count above 1,000 / null rate below 2% on required fields / custom SQL validation logic
Team Who owns and maintains the contract, their roles, and how consumers can reach them Owner: data engineering / Steward: revenue analytics / Support: #data-contracts Slack
SLAs Delivery schedule, freshness guarantee, and availability commitments Data refreshed by 6 AM UTC daily / 99.5% availability / maximum 15-minute latency
Terms of use What consumers can and cannot do with the data Approved for internal analytics / not approved for external sharing / PII handling required
Servers Where the data lives and how to connect to it Snowflake database, schema, and table reference / environment-specific connection details

A minimal contract can start with fundamentals, schema, and quality rules. SLAs, terms of use, and server details are added as the program matures.


Data Contract vs. Related Concepts

Data Contract Data Quality Rule SLA Esquema
Qué es A formal agreement covering schema, quality, ownership, SLAs, and terms of use for a dataset A single validation check applied to a field or table A service level commitment (uptime, freshness, delivery time) The structural definition of a table or dataset
Alcance End-to-end: producer to consumer A single quality dimension Delivery and availability only Structure only
Who sets it Agreed between producer and consumer Set by data engineering or stewardship Set by data engineering or platform team Set by data engineering
Enforced how Pipeline validation, alerting, governance workflows Automated quality checks Supervisión y alerta Schema registry, DDL
Relationship A data contract contains quality rules, SLAs, and schema as components A component of a data contract A component of a data contract A component of a data contract

Data contract vs. data governance policy: A data governance policy sets organization-wide standards for how data is managed. A data contract applies those standards to a specific dataset exchanged between a specific producer and consumer. Governance policies are organizational; data contracts are dataset-specific.

Data contract vs. data catalog: A data catalog documents what data assets exist and their metadata. A data contract formalizes the commitments a producer makes about a specific asset. A catalog entry describes a dataset; a data contract governs its delivery.


How Data Contracts Work

Without a data contract: A data engineering team changes the schema of the orders table — renaming customer_id to cust_id and dropping the discount_code field. The change ships on a Tuesday. By Wednesday morning, three downstream pipelines have failed, two dashboards show null values, and a machine learning feature pipeline has been silently receiving zeros for a field it uses as a key predictor. The analytics team identifies the problem when a VP inquires about the drop in revenue numbers.

With a data contract: The same schema change triggers a contract validation failure before the change ships. The data contract specifies that customer_id is a required field of type string and that discount_code must be present with a null rate below 5%. The engineering team receives an alert identifying which downstream consumers depend on the affected fields. They coordinate the change with consumers, update the contract version, and deploy with a migration path. No pipelines break.


How Data Contracts are Implemented

Step 1: Define the producer and consumer

Identify the team that owns the data asset (the producer) and the teams that consume it (the consumers). Data contracts work best when producers and consumers negotiate the contract together rather than having it imposed on one side.

Step 2: Write the contract

Start with the three essential components: fundamentals (ID, version, status), schema (field names, types, constraints), and quality rules (validation checks). Use a machine-readable format — YAML is the standard for ODCS contracts. Store the contract file in version control alongside the pipeline code it governs.

Minimal contract example (ODCS format):

apiVersion: v3.1.0
kind: DataContract
id: orders-daily
name: Orders
version: 1.0.0
status: active

schema:
  - name: orders
    physicalType: TABLE
    properties:
      - name: order_id
        logicalType: string
        primaryKey: true
      - name: order_status
        logicalType: string
      - name: customer_id
        logicalType: string
      - name: order_amount
        logicalType: number

quality:
  - type: rowCount
    mustBeGreaterThan: 1000
  - type: completeness
    column: order_id
    mustBeGreaterThan: 0.99

team:
  - name: Data Engineering
    role: owner

Step 3: Embed validation in the pipeline

Connect the contract to the pipeline that produces the data. Quality rules in the contract run automatically when the pipeline executes. If a rule fails — row count below threshold, null rate above limit, schema mismatch — the pipeline alerts the owner and optionally halts delivery rather than propagating bad data downstream.

Step 4: Version the contract

When the schema or quality standards change, increment the contract version. Maintain backward compatibility where possible. When a breaking change is unavoidable, communicate with all consumers identified in the contract before the change ships. The version history in source control provides an audit trail of every change and who approved it.

Step 5: Publish to the data catalog

Register the contract in the data catalog so consumers can discover it, review its current status, and subscribe to change notifications. A contract visible in the catalog makes data assets self-documenting — consumers can see the schema, quality standards, SLAs, and terms of use without asking the producing team directly.

Step 6: Monitor and enforce

Set up monitoring that checks contract compliance on each pipeline run: schema validation, quality rule evaluation, and SLA adherence. Route violations to the owner’s stewardship workflow. Track violation rates over time as a quality health metric.


Data Contracts and Data Mesh

Data contracts are a foundational component of data mesh architecture. In a data mesh, domain teams own and publish data products for consumption by other domains. A data contract is the formal interface that makes a data product usable: it defines what the product delivers, to what standard, and under what terms, so consuming domains can build reliably on it.

Without data contracts, a data mesh produces data products that other teams are afraid to depend on. With data contracts, each data product has a versioned, enforceable interface — the same principle that makes microservices architectures reliable applied to data.


Data Contracts and AI

AI governance is creating new demand for data contracts. Every dataset used to train or fine-tune a model requires the same commitments a data contract provides: a defined schema, documented quality standards, clear ownership, and terms governing appropriate use.

Training data contracts: A data contract for a training dataset specifies which fields are included, what quality thresholds they must meet, whether the dataset contains PII (and if so, under what handling requirements), and who certified it for AI use. This contract becomes the documentation artifact that model reproducibility and AI audit requirements need.

RAG pipeline contracts: Retrieval-augmented generation pipelines pull documents and datasets into LLM context windows at query time. A data contract for a RAG-eligible dataset specifies which fields can be retrieved, under what access conditions, and with what freshness guarantee.

Feature store contracts: Machine learning feature pipelines transform raw data into model features. A data contract between the feature store and the model training pipeline specifies the feature schema, the statistical distribution expectations (basis for drift detection), and the latency SLA for feature delivery at inference time.


Data Contracts in Regulated Industries

Financial services: BCBS 239 requires banks to demonstrate data accuracy and lineage for risk reporting. A data contract between a source system and the risk reporting pipeline formalizes the quality standards and schema commitments that BCBS 239 requires — and generates the documentation that auditors ask for as a byproduct of daily operations.

Healthcare: HIPAA requires documented accountability for PHI. A data contract for a dataset containing PHI specifies the sensitivity classification, access restrictions, handling requirements, and approved uses. When an audit asks how PHI was governed in a specific pipeline, the data contract provides the answer.

Pharmaceuticals: FDA 21 CFR Part 11 and GxP regulations require data integrity for clinical data. Data contracts applied to clinical datasets document the schema, quality standards, and chain of custody that data integrity requirements demand.

Preguntas frecuentes

A data contract is a formal agreement between a data producer and a data consumer that defines the schema, quality standards, ownership, delivery SLAs, and terms of use for a specific dataset. It makes expectations explicit and enforceable so that downstream teams can build reliably on data they receive.

An SLA covers one dimension of a data contract: the delivery schedule, freshness guarantee, and availability commitment. A data contract is broader — it includes the SLA plus the schema definition, quality rules, ownership, and terms of use. An SLA tells consumers when data will arrive; a data contract tells them what it will contain and what standards it will meet.

The Open Data Contract Standard (ODCS) is a Linux Foundation project maintained by Bitol that defines a machine-readable format for data contracts. It covers fundamentals, schema, quality rules, team ownership, SLAs, terms of use, and server configuration in a YAML document. It is the emerging industry standard for data contracts across modern data stacks.

Not in most cases. Data contracts are internal governance documents used to coordinate and enforce standards between teams. They are not legal contracts in the commercial sense. Their value is operational: they make expectations explicit, create accountability, and enable automated enforcement.

The data producer writes the initial draft in collaboration with data consumers. The producer defines what they can commit to delivering; the consumer defines what they need. The contract reflects the negotiated outcome. A data steward or governance lead may review and approve the contract before it is published.

By embedding schema and quality validation directly into the pipeline that produces data. When a contract specifies that a field must be present with a null rate below 2% and the pipeline produces data with a 15% null rate, the contract validation fails and alerts the owner before bad data reaches downstream consumers. Schema changes that would break consumer pipelines are caught before deployment rather than after.

Data quality rules are a component of a data contract. The contract defines what quality thresholds the data must meet; the quality validation layer enforces them at pipeline execution time. A data contract without quality rules only governs schema and delivery — it does not guarantee the data inside the schema is trustworthy.

In a data mesh, domain teams own and publish data products. A data contract is the formal interface that makes a data product consumable: it defines the schema, quality standards, SLAs, and terms of use that consuming domains can depend on. Without data contracts, data mesh products are undocumented black boxes. With them, each product has a versioned, enforceable interface.