How to Handle Idempotency in Data Integration Pipelines

Every data integration pipeline that touches more than one system eventually has to answer the same question: what happens when a message is delivered twice. The honest answer is that retries are not a rare edge case. They are the normal operating mode of any pipeline that crosses a network boundary, and the only pipelines that do not retry are the ones that have not yet hit the failures that force them to.

A pipeline that processes the same message twice and produces two different outcomes (two charged invoices, two created users, two synced records) is broken in a way that no amount of monitoring will catch in time. The fix is idempotency, designed into the operations themselves rather than bolted on at the edges.

This guide walks through what idempotency actually means in a data integration context, why most retry strategies fail without it, the patterns that work in practice, and the trade-offs each pattern carries.

An aerial view of a busy shipping container yard with rows of stacked containers
Photo by K on Pexels

What Idempotency Means in This Context

Wikipedia defines idempotence as a property of an operation that produces the same result whether it is applied once or many times. In a data integration context, this means an operation that can be safely retried without producing duplicate side effects.

The specific guarantees vary by operation type. A "create user" operation is idempotent if calling it twice produces a single user record, not two. A "charge $50" operation is idempotent if calling it twice charges $50 total, not $100. A "sync inventory count to 7" operation is idempotent because the final state is the same regardless of how many times it runs.

The third example is the easy case: any operation that sets state to an absolute value (rather than relative changes) is naturally idempotent. The first two examples require explicit design work to make the operations safe to retry. That design work is what most pipelines skip until production teaches them otherwise.

Why Retries Are Not Optional

The most common reason pipelines lack idempotency is a hopeful belief that retries can be avoided through better error handling. This is wrong in a way that becomes obvious only after the first major outage.

Networks fail. Receivers acknowledge after processing but before the acknowledgment reaches the sender. Senders crash between sending and recording the send. Brokers redeliver after a consumer rebalance. Any of these scenarios produces a duplicate delivery, and the sender has no way to tell the difference between "the receiver did not get my message" and "the receiver got my message but I never saw the acknowledgment."

The distributed systems community has spent decades chasing exactly-once delivery and consistently arrived at the same answer: it is achievable only by making the operations themselves idempotent and accepting at-least-once delivery at the transport layer. Wikipedia's overview of extract, transform, load processes describes the same fundamental constraint applied to batch data movement.

This is why retry safety has to be designed into the integration, not assumed away. Every operation that crosses a network boundary will eventually be delivered more than once, and the integration either survives that or it produces silent data corruption.

The Idempotency Key Pattern

The most common pattern for making an operation idempotent is the idempotency key. The sender generates a stable, unique key for each logical operation and includes it with every retry of that operation. The receiver records which keys it has already processed and returns the original result without re-applying side effects on duplicates.

The key requirements are unique-per-operation and stable-across-retries. Both matter. If the key is not unique per operation, two different operations collide and one of them is silently lost. If the key is not stable across retries, the receiver cannot tell that two requests represent the same logical operation.

In practice, idempotency keys are usually UUIDs generated at the moment the originating event is first recorded (not at send time). This pattern is well documented in writing from practitioners like Martin Fowler at martinfowler.com and is the same approach used by payment APIs (Stripe's "Idempotency-Key" header is the canonical implementation) and by major message queue systems.

A typical implementation:

The sender records the operation in its local store with a generated UUID.
The sender sends the operation to the receiver with the UUID in a header or body field.
The receiver looks up the UUID in its own "processed operations" store.
If found, return the original result. Do not re-apply side effects.
If not found, apply the operation, record the UUID and result, return.

The receiver's "processed operations" store can be a database table, a key-value cache with TTL, or a more sophisticated dedicated dedupe layer depending on the volume.

Server racks with neatly organized network cables in a data center
Photo by Brett Sayles on Pexels

The Natural Idempotency Pattern

The cleanest idempotent operations are those that set state to an absolute value. A "set inventory count to 7" message is naturally idempotent because the result is the same regardless of how many times the message is applied.

This contrasts with relative operations like "increment inventory by 3," which is not naturally idempotent: applying twice doubles the increment.

When designing the integration's message format, the choice between absolute and relative operations is one of the most consequential decisions. Absolute operations are easier to make safe but require more state in the message. Relative operations are smaller but require an idempotency key or sequence number to handle retries safely.

137Foundry's data integration practice defaults to absolute-state messages for any operation that does not have a strong reason to use deltas, because the production cost of getting delta semantics wrong is almost always larger than the bandwidth saved.

The trade-off is bandwidth and storage. Sending the full state of a 50-field record every time something changes is wasteful when only one field moved. The right answer depends on the volume and the cost of bandwidth versus the cost of an integration outage from a missed retry.

The Versioning Pattern

When relative operations are unavoidable, a version number on the entity being modified handles the same retry-safety question slightly differently.

The sender includes the expected current version of the entity. The receiver checks that the entity is actually at that version before applying the change. If a retry arrives after the change has already been applied (incrementing the version), the second attempt fails the version check and is harmlessly ignored.

This is the classic optimistic concurrency control pattern, and it doubles as retry protection in distributed integrations. PostgreSQL and other major databases support this at the row level via PostgreSQL xmin semantics or explicit version columns.

The trade-off is that the sender must know the current version, which usually means a round-trip read before the write. For high-throughput integrations, that read overhead can be significant. The idempotency key pattern is usually preferred when reads are expensive.

Deduplication at the Broker Level

Some message brokers (notably Kafka with idempotent producer mode) can handle a subset of the dedup problem at the transport layer. The producer attaches a sequence number to each message, and the broker rejects out-of-order or duplicate sequence numbers from the same producer.

This solves duplicate delivery from producer to broker. It does not solve the consumer side of the problem: a consumer that crashes after processing but before committing the offset will still re-receive the message and need to handle it idempotently.

The right framing for broker-level dedup is "necessary but not sufficient." It eliminates one class of duplicate, but the end-to-end integration still needs idempotent consumer operations to be fully safe.

State Stores and Cleanup

Whichever pattern is used, the receiver needs to store the dedup state somewhere. This raises two practical questions: how long to retain the state, and how to bound the size of the store.

The retention window has to be longer than the maximum possible retry window. If a producer retries for up to 24 hours after the original send, the receiver's dedup store has to retain idempotency keys for at least 24 hours. Most production systems set this to a multiple of the longest practical retry window (48 to 72 hours is common).

The store size is bounded by retention window × message rate. For high-volume integrations, the dedup store can grow to billions of rows. Time-partitioned tables with automated cleanup of partitions older than the retention window are the standard approach.

A garbage-collected dedup table with a clear retention policy is the operational pattern that makes the idempotency key approach sustainable at scale. The alternative (an unbounded dedup table) eventually becomes the single largest cost in the integration.

"Most of the integration incidents we get called in to triage trace back to a missing idempotency key on an operation that was assumed to be safe to retry. The fix is rarely complicated, but the cleanup of the corrupted state is always expensive." - Dennis Traina, founder of 137Foundry

Strands of fiber optic cable glowing with blue and orange light
Photo by Connor Scott McManus on Pexels

The Coordination Trap

A common mistake is trying to coordinate idempotency across multiple receivers using something like a two-phase commit protocol. This is technically correct but operationally fragile: the coordinator becomes a single point of failure and the protocol does not survive coordinator crashes cleanly.

The more pragmatic pattern for multi-receiver integrations is the saga pattern. Each step in a multi-system integration is idempotent and locally consistent, and a compensating action handles each step's failure without requiring distributed coordination. Wikipedia's overview of the saga pattern covers the trade-offs.

Saga-based integrations are operationally simpler than 2PC and degrade more gracefully under partial failure. They require thinking about the compensating actions explicitly, which is real design work, but the work pays off in operational simplicity.

A Practical Implementation Checklist

For a team adding idempotency to an existing integration:

Identify every operation that crosses a network boundary. Each one needs explicit retry safety.
For each operation, decide which idempotency pattern fits: absolute-state messaging, idempotency keys, version checks, or saga compensation.
Generate idempotency keys at the source-of-truth event, not at send time. The key must be stable across retries.
Add a dedup store to each receiver with a retention window longer than the maximum retry duration.
Add metrics on dedup hits. A dedup hit rate above zero is normal; a sudden spike is a signal that retries are happening more often than expected.
Document the chosen pattern for each integration so the next engineer maintaining the code does not silently break it.

For broader context on integration design tradeoffs across all of these patterns, the 137Foundry services overview covers how the data integration practice fits into the broader engineering work, and the 137Foundry homepage has further examples of the kinds of systems these patterns get applied to.

Common Mistakes

A few patterns reliably break idempotency in subtle ways.

Generating the idempotency key at send time rather than event time. A retry that regenerates the key looks like a new operation to the receiver. The key must be stable across the entire retry chain.

Dedup store without TTL or cleanup. The store grows unbounded, eventually dominating storage cost or becoming slow enough to be the bottleneck.

Treating partial-failure responses as success. A receiver that returns 200 OK after the side effect but before recording the idempotency key opens a window where a retry duplicates the side effect. The dedup record must be written atomically with the side effect, in the same transaction where possible.

Assuming the sender retries are deterministic. Different client libraries handle retries differently, and the same operation can be delivered with subtly different request bodies if the sender's serialization is not stable. The idempotency key, not the request body, is the source of truth for "is this a retry."

The Honest Framing

Idempotency is not optional in any data integration that touches more than one system. The choice is between designing it in deliberately or discovering its absence the hard way in production.

The patterns are well understood and have been documented for decades. The work is real but not exotic, and the operational cost of not doing it (silent data corruption, double-charged customers, drifted record counts) is much larger than the design cost of doing it.

A pipeline that survives retries is one that the team can sleep through. A pipeline that does not survive retries is one that produces a quiet stream of data corruption that takes weeks to detect and longer to clean up.

The right time to design for idempotency is before the first production retry. The wrong time is after the corruption has already happened.