How to Build a Data Validation Layer in Python Pipelines

Most data automation scripts handle the happy path correctly. An API returns the expected fields, the types match, the values are within range -- the pipeline runs clean. The problem is that real API responses are not always the happy path. Upstream systems add new fields without warning. Optional fields that were never null start arriving as null. A numeric field returns a string for one record out of a thousand. The pipeline continues, writes corrupt data to the database, and nobody notices until a report shows a gap or a downstream calculation fails months later.

The fix is a validation layer that runs before you process or store anything. Not a try/except block around the database write, and not a data quality check after the fact -- an explicit validation step that inspects each record against your expectations, rejects anything that does not match, and routes the rejected records somewhere you can review them. This article covers how to build that layer in Python using schema validation and business rule validation, and where to place it in a typical automation pipeline.

Two Categories of Validation You Need Both

Data validation splits into two categories that are easy to conflate but genuinely different.

Schema validation checks that data has the right structure: required fields are present, types match expectations, strings are in the right format. Schema validation is mechanical -- a field either exists or it does not, a value is either a valid ISO date string or it is not. Libraries automate this entirely.

Business rule validation checks semantic correctness: a purchase date cannot be in the future, an amount cannot be negative, a customer ID must correspond to an existing record. These rules come from your business domain, not from the data schema, and no library can generate them for you. They require code that understands what the values mean, not just what type they have.

You need both. Schema validation without business rules catches type errors but lets through logically impossible values. Business rule validation without schema validation fails badly when fields are missing or wrongly typed and you try to evaluate rules against them. The combination catches the most common failure modes before they reach your database.

Schema Validation With Pydantic

Pydantic is the most widely used Python library for schema validation in data automation contexts. Version 2 is significantly faster than V1 and adds stricter coercion behavior that makes it more appropriate for automation pipelines where you want to detect problems rather than silently fix them.

Define a model for each record type your pipeline receives:

from pydantic import BaseModel, ValidationError, field_validator
from datetime import date
from typing import Optional

class OrderEvent(BaseModel):
    order_id: str
    customer_id: int
    amount: float
    event_date: date
    status: str
    notes: Optional[str] = None

Pydantic validates each field on construction. A record missing customer_id or passing amount as a non-numeric string will raise a ValidationError before your processing code ever sees the data. The error message names the field and the problem, which makes debugging straightforward.

To validate a batch of records:

valid_records = []
invalid_records = []

for raw in api_response['records']:
    try:
        record = OrderEvent(**raw)
        valid_records.append(record)
    except ValidationError as e:
        invalid_records.append({'raw': raw, 'errors': e.errors()})

The e.errors() method returns a structured list of validation failures with field names, expected types, and the offending value. Log this alongside the raw record so you can reconstruct what happened when you review the error queue later.

Business Rule Validation With Field Validators

Pydantic supports custom validators that run after type validation. Use @field_validator decorators to add business rule checks directly to the model:

from pydantic import field_validator
import datetime

class OrderEvent(BaseModel):
    order_id: str
    customer_id: int
    amount: float
    event_date: date
    status: str

    @field_validator('amount')
    @classmethod
    def amount_must_be_positive(cls, v):
        if v <= 0:
            raise ValueError('amount must be positive, got {v}')
        return v

    @field_validator('event_date')
    @classmethod
    def date_cannot_be_future(cls, v):
        if v > datetime.date.today():
            raise ValueError(f'event_date cannot be in the future: {v}')
        return v

    @field_validator('status')
    @classmethod
    def status_must_be_valid(cls, v):
        allowed = {'pending', 'processing', 'completed', 'cancelled'}
        if v not in allowed:
            raise ValueError(f'status "{v}" not in allowed set {allowed}')
        return v

This pattern keeps validation co-located with the schema definition. New engineers reading the model see both the type expectations and the business rules in one place, rather than hunting through the processing code for scattered checks.

For cross-field validation -- where one field's validity depends on another -- use @model_validator:

from pydantic import model_validator

class OrderEvent(BaseModel):
    amount: float
    refund_amount: Optional[float] = None

    @model_validator(mode='after')
    def refund_cannot_exceed_amount(self):
        if self.refund_amount is not None and self.refund_amount > self.amount:
            raise ValueError('refund_amount cannot exceed amount')
        return self

Network operations center with rows of monitors
Photo by Fernando Narvaez on Pexels

"The most expensive data bugs are the ones that pass validation silently -- your pipeline runs, your dashboard looks normal, and the corruption only surfaces three months later during an audit. A validation layer at the intake boundary is worth ten monitoring dashboards." - Dennis Traina, founder of 137Foundry

Where to Place the Validation Layer

The validation layer belongs immediately after data ingestion and before any transformation, enrichment, or database write. In a typical pipeline:

API fetch → validate → transform → enrich → write to database

Not:

API fetch → transform → enrich → write to database → validate after

Validating after transformation obscures the original source of the error. A transformation function that fails on an unexpected null may produce a stack trace pointing at your own code, not at the upstream API. Validating before transformation makes the error location unambiguous: the data is wrong before your code ever touched it.

For multi-step pipelines with intermediate storage (a staging table, a queue, a file), validate both at intake and when reading from staging. Intermediate storage can accumulate records from multiple sources over time, and assumptions that held at intake for source A may not hold for source B added six months later.

The validation layer should be a discrete, testable function or class -- not inline checks scattered through the processing code. Centralizing validation makes it easy to add new rules, update existing ones, and write unit tests that confirm the validation behavior independently of the pipeline logic.

Managing Invalid Records

A validation layer is only useful if invalid records go somewhere reviewable rather than being silently dropped or crashing the pipeline.

The minimum viable approach is logging with full context. Log the raw record, the validation errors, the source (which API, which endpoint, which timestamp), and enough surrounding context to reproduce the fetch. Log at a structured level that downstream monitoring can parse and alert on.

A more robust approach adds a dead letter pattern:

import csv
import json
from datetime import datetime

def write_to_error_queue(record, errors, source):
    row = {
        'timestamp': datetime.utcnow().isoformat(),
        'source': source,
        'raw': json.dumps(record),
        'errors': json.dumps(errors)
    }
    with open('error_queue.csv', 'a', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=row.keys())
        writer.writerow(row)

Review the error queue on a regular schedule. High error rates often indicate upstream schema changes -- an API added required fields you do not yet handle, or changed a field from a string to an object. Catching these early prevents accumulation of corrupt records over days or weeks.

Terminal screen with monospace code close
Photo by Cemrecan Yurtman on Pexels

Set an alert threshold. If more than 1% to 2% of records in a run fail validation, something is wrong that warrants immediate investigation rather than queuing for the next review cycle.

Choosing Validation Depth by Risk Level

Not every field in every record needs exhaustive validation. A practical approach calibrates validation depth to data criticality.

Critical fields (primary keys, amounts, dates used in calculations): Full schema and business rule validation. Fail hard on any violation.

Important fields (status flags, category labels, secondary identifiers): Schema validation plus basic business rules. Log violations but allow processing to continue if the downstream computation can still run.

Informational fields (optional notes, metadata, display strings): Schema validation only, or no validation. Corruption here affects display but not calculations.

This tiering lets you keep the validation layer lightweight for high-volume pipelines without compromising accuracy on the data points that actually matter.

Tools and Libraries Worth Knowing

Pydantic handles most schema and basic business rule validation needs and integrates well with FastAPI, SQLAlchemy, and other common Python tooling. For pipelines where performance matters more than feature richness, Cerberus is a lighter alternative with a simpler API.

PyPI hosts both libraries plus several others in the validation space (marshmallow, voluptuous, pandera for dataframe validation). Browse by download count and recent activity to find what the broader Python community is using actively.

Python.org maintains official documentation on type annotations and data classes that underlies much of how Pydantic works under the hood -- useful background if you want to understand why field validators behave the way they do with complex nested types.

Integration With 137Foundry's Data Work

Server rack with organized cables and network hardware
Photo by Brett Sayles on Pexels

Building a robust validation layer is foundational to data automation that stays reliable over time. At 137Foundry's data integration practice, validation layers are a standard component in every pipeline we build -- not an afterthought added when something breaks. The pattern described here -- schema validation at intake, business rules co-located with the schema, dead letter queue for invalid records, alert thresholds -- is the same structure we use in production.

The 137Foundry AI automation services team handles pipelines where data quality is particularly critical because downstream AI models amplify validation errors. Garbage in, garbage out applies at triple the scale when a language model is downstream of your pipeline rather than a human analyst.

If you are building data automation in-house and running into persistent quality issues, reviewing the 137Foundry services hub shows where external help is available for the parts that benefit from outside expertise.

The Core Rule

Every byte of data from an external source should cross a validation boundary before your code acts on it. The boundary does not have to be complex -- a Pydantic model with a handful of field validators covers most real-world cases. The important part is that the boundary exists, that it runs consistently, and that rejected records are captured in a way that lets you understand what upstream systems are sending you.

Fiber optic cables with glowing light ends
Photo by Connor Scott McManus on Pexels