How to Build a Bi-Directional Data Sync Between Business Applications

Bi-directional data sync -- where two systems each send and receive changes -- looks similar to one-way ETL but behaves completely differently in production. In a one-way pipeline, you have a clear source of truth and a clear destination. In a bi-directional sync, two sources both change, and the sync layer must figure out which changes are authoritative and in which direction.

The failure modes of bi-directional sync are specific and often quiet. This guide covers the architecture decisions, conflict resolution strategies, and operational patterns that make the difference between a sync that works and one that silently corrupts data over time.

server rack with organized cables in a data center facility
Photo by imgix on Unsplash

Why Bi-Directional Sync Is a Different Problem Class

In a one-way pipeline, the flow is simple: read from source, transform, write to destination. Conflict is not a concept because there is only one authoritative source.

In a bi-directional sync, both systems can update the same record simultaneously. When the sync layer reads System A and finds that record 1234 was updated at 14:32:07, and reads System B and finds the same record was updated at 14:32:09, it has a conflict. Something needs to decide which version wins.

The naive approach -- apply changes in order -- fails when clocks are not synchronized between systems. Clock skew in distributed systems can exceed several seconds even on well-maintained infrastructure. A timestamp comparison that relies on synchronized clocks will make the wrong decision whenever the skew exceeds the gap between conflicting writes.

Beyond conflicts, bi-directional sync introduces the risk of sync loops: System A updates a record, the sync propagates the change to System B, System B fires an "on change" hook that updates the record again, which the sync propagates back to System A. A sync loop does not usually cause data corruption -- it causes infinite API calls that rate-limit your integration and fill logs with meaningless change events.

Change Data Capture: Detecting What Changed

Before you can sync changes, you need to know what changed. Change Data Capture (CDC) is the pattern for detecting data changes as they happen rather than scanning entire tables for differences.

Database-level CDC (PostgreSQL logical replication, MySQL binlog) streams row-level changes directly from the database write-ahead log. This is the most reliable approach for high-volume systems -- it captures every insert, update, and delete with exact field values before and after the change.

PostgreSQL logical replication setup:

-- Enable in postgresql.conf: wal_level = logical
CREATE PUBLICATION my_sync_pub FOR TABLE customers, orders;
SELECT pg_create_logical_replication_slot('my_sync_slot', 'pgoutput');

A consumer reads from this slot using the psycopg2 replication API or a tool like Debezium.

Application-level CDC uses updated_at timestamps and a high-watermark query:

from datetime import datetime
from sqlalchemy import create_engine, text

def poll_changes(engine, table_name, since):
    query = text(
        f"SELECT * FROM {table_name} "
        "WHERE updated_at > :since ORDER BY updated_at ASC"
    )
    with engine.connect() as conn:
        return conn.execute(query, {"since": since}).fetchall()

Application-level CDC is less invasive and works with any database, but it cannot capture hard deletes.

Conflict Resolution Strategies

How you resolve conflicts is a business decision before it is a technical one. The right strategy depends on which system is authoritative for which fields.

Last-write-wins with clock skew tolerance. The record with the later updated_at timestamp wins. Add a tolerance window to handle clock skew -- if timestamps are within the tolerance window, route the conflict to a dead-letter queue.

from datetime import timedelta
from typing import Optional

CLOCK_SKEW_TOLERANCE = timedelta(milliseconds=500)

def resolve_last_write_wins(record_a, record_b, tolerance=CLOCK_SKEW_TOLERANCE):
    delta = abs(record_a["updated_at"] - record_b["updated_at"])
    if delta <= tolerance:
        return None  # route to DLQ for review
    return record_a if record_a["updated_at"] > record_b["updated_at"] else record_b

Field-authority mapping. Designate each field as owned by a specific system. The CRM owns name and email. The billing system owns payment_status. When syncing, only propagate changes for fields the source system owns.

FIELD_AUTHORITY = {
    "name": "crm", "email": "crm",
    "payment_status": "billing", "payment_method": "billing",
    "last_activity": "both",
}

def apply_sync(local, remote, source_system):
    result = dict(local)
    for field, authority in FIELD_AUTHORITY.items():
        if authority in (source_system, "both") and field in remote:
            result[field] = remote[field]
    return result

Dead-letter queue for unresolvable conflicts. Any conflict the sync layer cannot resolve deterministically should go to a DLQ rather than being silently dropped or arbitrarily resolved. Build a lightweight admin interface for reviewing DLQ entries -- this is a regular operational need, not a corner case.

fiber optic light strands data transmission
Photo by Brett Sayles on Pexels

Idempotency: Handling Duplicates Safely

Any distributed system that retries failed operations will eventually send the same message twice. Your sync consumers must be idempotent. We use Redis for the deduplication cache and SQLAlchemy for database access in the examples below.

import redis

dedup_client = redis.Redis()
DEDUP_TTL = 86400

def process_change_idempotent(change_event, process_fn):
    change_id = change_event["change_id"]
    dedup_key = f"sync:processed:{change_id}"
    if not dedup_client.set(dedup_key, "1", nx=True, ex=DEDUP_TTL):
        return  # already processed
    process_fn(change_event)

Combine with database-level upsert semantics (INSERT ... ON CONFLICT DO UPDATE) to ensure that applying the same change twice is safe even if the deduplication cache misses.

Preventing Sync Loops

Tag changes made by the sync layer with an application identifier and filter those out from CDC events.

with engine.connect() as conn:
    conn.execute(text("SET application_name = 'data_sync'"))
    conn.execute(text(
        "UPDATE customers SET name = :name WHERE id = :id"
    ), {"name": new_name, "id": record_id})
    conn.commit()

In your CDC consumer, filter out events from connections with application_name = 'data_sync'. The sync-originated change will not be re-propagated.

Operational Monitoring

Track these metrics as first-class operational concerns:

Sync lag per direction. Measure the time from a change in System A to that change appearing in System B, and vice versa. Alert when lag exceeds a threshold for more than a few minutes.

Conflict rate. Track conflicts per 1,000 sync events. A baseline rate is normal. A spike indicates both systems are being written to simultaneously more than expected.

DLQ depth. Track dead-letter queue depth. Alert when it grows. A growing DLQ means unresolvable conflicts are accumulating and waiting for human review.

Record count parity. Periodically compare record counts between systems. A growing divergence indicates missed changes.

"The hardest part of bi-directional sync is not writing the code. It is knowing what healthy looks like. Every production sync we run has a dashboard that shows sync lag in both directions and DLQ depth. If those two metrics are green, the sync is healthy." - Dennis Traina, founder of 137Foundry

Retry Patterns and Dead-Letter Queues

Transient failures should be retried with exponential backoff. Permanent failures should go to the DLQ immediately without retries.

import time

MAX_RETRIES = 5

def sync_with_retry(change_event, process_fn, retry_count=0):
    try:
        process_fn(change_event)
    except TransientSyncError as e:
        if retry_count >= MAX_RETRIES:
            route_to_dlq(change_event, str(e))
            return
        time.sleep(2 ** retry_count)
        sync_with_retry(change_event, process_fn, retry_count + 1)
    except PermanentSyncError as e:
        route_to_dlq(change_event, str(e))

For the data integration work at 137Foundry, every production sync has a DLQ admin interface that shows the failed event payload, the error message, and controls to retry or permanently dismiss the entry.

shipping container yard aerial logistics
Photo by Giant Asparagus on Pexels

Schema Drift and Versioning

Bi-directional syncs fail silently when one system adds or removes a field without the other system knowing. Log warnings for any field in an incoming event that is not in the consumer schema. Alert on warnings. This does not prevent drift, but it makes drift visible as soon as the first event with the new field arrives.

Treat sync events as loosely typed: use explicit get() calls rather than required field access. This way, unknown fields are silently ignored rather than causing consumer failures.

What to Build With This Foundation

The patterns above -- CDC for change detection, field-authority conflict resolution, idempotent consumers, sync loop prevention, and DLQ monitoring -- are the baseline for any production bi-directional sync.

For teams that need to integrate multiple business applications without building and maintaining this infrastructure themselves, the data integration services at 137Foundry cover the full stack: architecture design, CDC setup, conflict resolution logic, and operational monitoring. See the 137Foundry services overview for integration work in broader technical contexts.

The code patterns in this guide are starting points. The specific conflict resolution strategy, CDC mechanism, and monitoring thresholds depend on the systems being connected and the business requirements. Getting those decisions right before writing the first line of sync code is where most of the value lies.

Testing Bi-Directional Sync Before Production

Testing a bi-directional sync differs from testing a one-way pipeline. One-way pipeline tests can verify that a given input produces a given output. Bi-directional sync tests must also verify behavior under conflict scenarios, duplicate delivery, partial failures, and sync loops.

The test suite for a bi-directional sync should cover: the standard happy path (change in A propagates to B, change in B propagates to A); the conflict resolution path (simultaneous changes with various timestamps); the DLQ routing path (changes that should be unresolvable); the idempotency path (same change delivered twice produces the same result); and the sync loop prevention path (changes made by the sync layer are not re-propagated).

Test data setup matters. Bi-directional sync tests should use a real or near-real database pair rather than mocks, because mock-based tests will not catch the integration-level failures (replication slot behavior, trigger timing, network timeouts) that dominate production incidents.

For the complete architecture guide and testing recommendations for bi-directional data sync, see the 137Foundry data integration resources. For teams integrating multiple business applications and looking for implementation support, the data integration services at 137Foundry and the broader services overview cover the full integration lifecycle.

Why This Matters for Production Reliability

The failure modes of bi-directional sync are almost always discovered in production, not in testing. Test environments rarely replicate the exact conditions that cause clock skew conflicts -- clock synchronization on development machines is generally better than on production infrastructure. Test environments rarely replicate the specific bulk operation patterns that create consistency gaps. And test environments rarely run long enough to reveal the slow drift that accumulates when a field authority map is not updated after a schema change.

This is not an argument against testing -- it is an argument for investing in observability alongside testing. The monitoring patterns in this guide (sync lag, DLQ depth, conflict rate, record count parity) give you visibility into problems that tests will not catch before they affect users.

For teams building a bi-directional sync for the first time, the practical recommendation is: build the operational baseline (DLQ, monitoring, idempotency, loop prevention) before the first production deployment, not after the first production incident. The upfront cost is modest. The incident prevention value is significant.

For technical implementation guidance, see 137Foundry and the data integration resources. For production architecture review and implementation support, the 137Foundry services team works with teams across the integration lifecycle.