How to Build an Automated Data Reconciliation System

Two systems that are supposed to contain the same data rarely do. The CRM shows 847 active customers. The billing system shows 839. The analytics warehouse shows 851. All three are pulling from the same underlying customer base, but each has accumulated slightly different state over months of asynchronous updates, failed syncs, and schema differences that were patched in one place and not another.

Data reconciliation is the process of comparing two or more representations of the same data, identifying where they disagree, and either resolving the discrepancy or escalating it for human review. Unlike ETL, which moves data from source to destination, reconciliation validates that two systems agree. Its output is not transformed data - it is a list of discrepancies, their severity, and their state.

Without automation, reconciliation is a quarterly audit exercise that happens after someone notices a report doesn't add up. With automation, it is a continuous background process that catches drift within hours of it occurring.

What Makes Reconciliation Different From Validation

Data validation checks that individual records meet quality rules: a date field contains a valid date, a required field is not null, a foreign key references a valid record. Reconciliation compares two representations of the same logical entity across system boundaries.

Validation answers "is this record well-formed?" Reconciliation answers "does system A's record for customer #1042 agree with system B's record for the same customer?"

This distinction matters for architecture. Validation runs within a single pipeline on data it has direct access to. Reconciliation requires reading from two separate sources, identifying matching records across them, and comparing the fields that should be equivalent.

The complexity compounds when the two systems use different identifiers for the same entity, different field names for the same concept, and different update cadences that create windows where one system has new data the other has not yet received.

When Reconciliation Is Worth Building

Not every data integration problem requires a dedicated reconciliation system. The cases where building one pays off:

Financial and billing data. Discrepancies between what your application recorded and what your payment processor recorded create revenue leakage, fraud exposure, and regulatory risk. Even a one-percent discrepancy rate on transaction records can represent meaningful money at volume.

Multi-system customer records. When the same customer entity lives in a CRM, a support system, and a billing platform, these systems drift. Detecting which system has the most recent correct state - and when they meaningfully disagree - prevents customer-facing errors and operational confusion.

Microservice data ownership. When separate services maintain their own representations of shared business state, reconciliation between them identifies cases where an event was processed in one service but not propagated correctly to another.

Regulatory reporting. When reported numbers must match source records and be auditable, automated reconciliation creates a paper trail showing what was compared, when, and what the outcome was.

data center cooling infrastructure pipes
Photo by blickpixel on Pixabay

Core Architecture

A reconciliation system has four functional components regardless of implementation technology.

Source connectors fetch the current state of each data set being compared. For database sources, this typically means a SQL query against a snapshot or a change log. For API sources, it means a paginated pull of records with appropriate rate limiting. The connector is responsible for producing a normalized in-memory or on-disk representation of each data set that the comparison engine can work with.

The comparison engine takes two normalized data sets, matches records across them using comparison keys, and for each matched pair, compares the fields that should be equivalent. It produces a list of discrepancies, each of which identifies the record pair, the specific fields that disagree, and the values from each side.

The discrepancy tracker persists discrepancies and manages their lifecycle. A discrepancy is not just a point-in-time event - it has a state: new, acknowledged, resolved, or escalated. The tracker prevents the same discrepancy from being re-reported on every run and allows downstream systems to query the current set of open discrepancies.

The remediation handler determines what to do with a new discrepancy: alert a monitoring channel, write to an incident queue, trigger an automated fix for known recoverable cases, or log for batch review. Many reconciliation systems start with alerts only and add automated remediation selectively as confidence in specific fix patterns grows.

Defining Comparison Keys

The comparison key is the identifier that allows you to say "record X in system A corresponds to record Y in system B." This is often more complex than it appears.

When both systems use the same identifier (both use your internal customer UUID), comparison is straightforward. When they use different identifiers (system A uses your UUID, system B uses the external vendor's ID), you need a mapping table or a derivable relationship between them.

When identifiers are not reliable (systems that use email address as a primary key will fail when customers change their email), you need a composite key or a probabilistic matching approach for records where the primary key has changed.

Define your comparison key before writing any comparison logic. The key determines which records can be compared and which will show up as unmatched - records that exist in one source but have no counterpart in the other.

def match_records(source_a, source_b, key_field):
    """
    Returns three groups: matched pairs, only-in-a, only-in-b.
    source_a and source_b are lists of dicts.
    """
    index_b = {r[key_field]: r for r in source_b}
    matched = []
    only_in_a = []

    for record in source_a:
        key = record[key_field]
        if key in index_b:
            matched.append((record, index_b.pop(key)))
        else:
            only_in_a.append(record)

    only_in_b = list(index_b.values())
    return matched, only_in_a, only_in_b

Unmatched records are themselves a class of discrepancy - a record present in one system but absent from the other. Handle them separately from field-level disagreements in matched pairs.

power substation electrical wires infrastructure
Photo by Blue Arauz on Pexels

Comparison Logic

For matched pairs, field-level comparison requires deciding what counts as a disagreement.

Exact equality works for identifiers, statuses, and categorical fields. Two records either agree on a customer's subscription plan or they do not.

Tolerance-based comparison is necessary for floating-point amounts, calculated totals, and values that may represent the same underlying quantity with minor rounding differences. Comparing a charge amount of $42.500 against $42.50 requires a tolerance threshold, not exact equality.

Timestamp handling requires agreeing on what time window constitutes a meaningful discrepancy. A one-second difference in updated_at between two systems that sync asynchronously is not a discrepancy. A four-hour difference probably is. Define your lag tolerance explicitly.

Null handling requires a policy: is a null in system A and a populated field in system B a discrepancy, or is null a valid "not yet propagated" state? This depends on the field and the sync pattern. Document the policy per-field rather than applying a universal rule.

def compare_fields(record_a, record_b, field_config):
    """
    field_config: dict mapping field name to comparison settings.
    Returns list of (field_name, value_a, value_b) discrepancy tuples.
    """
    discrepancies = []
    for field, config in field_config.items():
        val_a = record_a.get(field)
        val_b = record_b.get(field)
        tolerance = config.get("tolerance", 0)

        if val_a is None and val_b is None:
            continue
        if tolerance and val_a is not None and val_b is not None:
            if abs(float(val_a) - float(val_b)) <= tolerance:
                continue
        if val_a != val_b:
            discrepancies.append((field, val_a, val_b))
    return discrepancies

Tracking Discrepancy State

The discrepancy tracker is what separates a reconciliation system from a reconciliation script. A script produces a report and exits. A system maintains state across runs.

Each discrepancy should have a stable identifier derived from the comparison key and the field name so the same discrepancy is recognized across runs. On each run:

Compare the new discrepancy list against the open discrepancy set.
Discrepancies that appear in both lists have persisted - update their last-seen timestamp.
Discrepancies that appear only in the new list are new - create a record and trigger alerting.
Discrepancies that appear only in the open set have resolved since the last run - mark them resolved.

This structure eliminates alert storms on repeated runs and provides a clean interface for querying "what is currently wrong" versus "what was ever wrong."

PostgreSQL is a natural backend for the discrepancy tracker if you are already in the relational world. The discrepancy table schema needs: identifier, first_seen, last_seen, status, source_a_value, source_b_value, and any metadata fields useful for debugging.

Scheduling and Triggering

How often to run reconciliation depends on the acceptable lag for detecting discrepancies and the cost of reading from both sources.

Scheduled runs (cron-style) work well when both sources support efficient incremental reads. Pull only records modified since the last run, compare only those records, and limit the full-dataset comparison to a weekly or monthly deep scan.

Event-driven runs trigger reconciliation when a write to one system occurs and the corresponding write to the other system is expected within a defined window. This catches propagation failures faster than scheduled runs but requires instrumentation in the upstream write path.

On-demand runs allow operators to trigger reconciliation for a specific record or date range when investigating a suspected discrepancy. Include a manual trigger endpoint in any production reconciliation system.

"Data reconciliation systems are some of the most valuable automation you can build, because they surface problems that already exist in your data - problems that are silently driving bad decisions right now." - Dennis Traina, founder of 137Foundry

network operations infrastructure center displays
Photo by panumas nikhomkhai on Pexels, batched daily (field-level disagreements in non-critical fields), and logged but not alerted (within-tolerance differences tracked for trend analysis).

Performance at scale. Comparing millions of records requires a different approach than comparing thousands. Consider chunked comparison (partition by date range or ID range, process in parallel), database-side comparison (push comparison logic into SQL where possible to avoid pulling full data sets into memory), and summary-level checks before record-level checks (if the count of records matches and the sum of amounts matches, skip the full field comparison).

The 137Foundry data integration service and AI automation service both involve designing systems that handle these production constraints at client scale.

Tools Worth Evaluating

Great Expectations focuses on data validation rather than cross-system reconciliation, but its expectation framework is useful for defining what field values should look like on each side before comparison.

dbt is primarily a transformation tool, but its testing framework supports cross-source comparisons and relationship checks that can serve reconciliation purposes for teams already in the dbt ecosystem.

Apache Kafka is worth considering for event-driven reconciliation architectures where both sources publish change events to a shared log - the reconciliation system consumes from both topics and compares events for the same key.

Martin Fowler's patterns catalog documents patterns like Event Sourcing and Change Data Capture that inform how to design the source connector layer in a reconciliation system.

Getting Started Without Overbuilding

The minimum useful reconciliation system is often simpler than teams assume: a script that reads from two sources, compares a specific set of fields for matched records, and writes a discrepancy report to a file or database table. Start there.

Validate that the comparison logic is correct before building the state management layer. Validate that the alert routing works before building automated remediation. The architecture described here is the target state after several iterations, not the starting point.

The 137Foundry services hub covers how we approach building these systems for clients - particularly the combination of data integration design, automation pipeline engineering, and ongoing monitoring that makes reconciliation useful rather than just theoretically correct.

What Makes Reconciliation Different From Validation

When Reconciliation Is Worth Building

Core Architecture

Defining Comparison Keys

Comparison Logic

Tracking Discrepancy State

Scheduling and Triggering

Tools Worth Evaluating

Getting Started Without Overbuilding

More Articles

How to Fix Crawl Budget Issues on Large or Complex Web Applications

How to Design a Web Typography System That Balances Readability and Brand Identity

How to Generate Unit Tests with AI Coding Assistants