Handling Late-Arriving Data in a Streaming Pipeline

The first month a streaming integration goes live, someone in finance opens yesterday's dashboard and the totals do not match the source system. The variance is small, maybe a fraction of a percent, but it is not zero. By the end of the week the gap has grown. By the end of the month, the dashboard and the source system disagree by a few thousand dollars, and nobody can explain why.

This is the late-arriving data problem. Events that should have been counted in Monday's totals showed up Tuesday or Wednesday, after the rollup ran. The dashboard never went back and updated Monday's number. The source system did, because it has no concept of "yesterday's rollup." The two are now permanently divergent, by exactly the volume of late events.

An aerial view of a freight container yard with shipping containers arranged in rows
Photo by MINEIA MARTINS on Pexels

This is a guide for handling that problem deliberately, so the streaming pipeline either prevents the divergence or corrects it on a schedule the business can plan around.

What "late" actually means

Late-arriving data comes in two distinct shapes, and treating them the same way produces the wrong fix.

The first is event-time skew. The event happened in the source system at 11:59 PM Monday, but the source system did not emit it to the integration until 12:01 AM Tuesday. The event is "late" relative to event time but on time relative to processing time. This is the common case for any source system that batches outbound emissions, has clock drift, or sits behind a queue.

The second is producer delay. The event was produced on time but a downstream system (a message broker, a connector, a queue) buffered or retried it, so it arrived hours or days after it was generated. This is the case for retry storms, outage backfills, and any flaky producer.

Both look the same in the consumer: an event with a timestamp older than the current watermark. But they are caused by different problems and the right tradeoff for one is wrong for the other.

The watermark and the grace period

Every modern streaming engine, from Apache Flink to Apache Beam to Kafka Streams, has the concept of a watermark. The Apache Flink documentation explains the formalism in detail, but the working definition is: a watermark for time T means "I believe I have received all events with event time at or before T."

Once the watermark passes T, any windowed aggregation for T closes and emits its result. After that, late events for T are either dropped, sent to a side output, or used to emit a correction. Which behavior you choose is the central design decision.

A common pattern is to set the watermark several minutes behind real time and allow a grace period during which late events are still accepted into the original window. For most low-volume integrations a five to fifteen minute grace period catches the vast majority of late events without delaying downstream reports unacceptably.

For higher-stakes pipelines (financial reporting, billing, regulatory data), the grace period might be hours or a full day, with explicit late-event tolerance set to match the source system's worst-case emission delay.

Side outputs for the truly late

Events that arrive after the grace period closes are "truly late" by the pipeline's definition. Dropping them silently is the single most common cause of dashboard drift. Do not drop them.

Instead, route them to a side output (Flink's side outputs, Beam's deadletter sinks, or a separate Kafka topic in a Kafka Streams app). The side output is then processed on a slower cadence, usually daily, to issue corrections to the affected windows.

The mechanics of issuing a correction depend on what downstream looks like. For an OLAP store like ClickHouse or BigQuery, the simplest correction is a versioned-fact pattern: every emitted aggregate row carries a version number, and the corrected row is written with an incremented version. Downstream queries select the highest version for each window. Old versions are kept for audit but never returned in current reports.

A close-up of a network operations console with timestamp logs scrolling
Photo by AMORIE SAM on Pexels

The versioned-fact pattern

The versioned-fact pattern deserves its own section because it is the most useful tool for handling corrections cleanly.

Each emitted aggregate row carries three timestamps: the window start, the window end, and a "version time" or "emitted at" stamp. Each row also has a deterministic primary key built from the window key plus the version stamp.

When a correction is emitted, the corrected aggregate is written with a later version stamp. Downstream queries use a view or a window function to select the latest version per window.

SELECT window_key, value
FROM (
  SELECT window_key, value, emitted_at,
         ROW_NUMBER() OVER (PARTITION BY window_key ORDER BY emitted_at DESC) AS rn
  FROM aggregate_facts
)
WHERE rn = 1

The view always reflects the latest correction. The underlying table keeps every version for audit. The dashboard sees the latest. The auditor can reconstruct any historical view by filtering on emitted_at.

This pattern works equally well in PostgreSQL, BigQuery, Snowflake, ClickHouse, and most other analytical stores. It does not require database features beyond window functions and primary keys.

When to pick lambda-style re-runs instead

The alternative to streaming corrections is the lambda architecture pattern: a streaming layer for near-real-time approximate results, and a batch layer that recomputes from raw events on a daily cadence and overwrites the streaming results.

This pattern is correct for pipelines where the business is comfortable with "yesterday's number is final, today's is approximate." Many financial reporting pipelines work this way explicitly: the streaming dashboard shows live progress, but the official number for any closed day comes from the batch re-run that happens at 2 AM the next morning.

Lambda is simpler to reason about than per-window corrections, and it makes audits easier because the batch layer is the source of truth. The cost is operational: you maintain two pipelines, and the streaming and batch logic must produce identical results given the same input, which is harder than it sounds. Martin Kleppmann's Designing Data-Intensive Applications is the canonical reference for the tradeoff, and the chapter on stream processing covers this directly.

"On every data integration project, the question is not whether late data exists. It does. The question is whether the pipeline accepts it deliberately, with a window and a correction, or accepts it silently as drift in a downstream report." - Dennis Traina, founder of 137Foundry

Idempotent emit on the consumer side

Whichever correction approach you pick, the consumer side of the integration has to be idempotent against re-emits. A late-arriving event that triggers a corrected aggregate must not produce duplicate rows downstream.

The clean pattern is to make the downstream load operation upsert against a deterministic key (window_key + emitted_at, or window_key + version_id), so re-emits land as updates rather than inserts. For database targets this is straightforward with ON CONFLICT DO UPDATE in PostgreSQL or MERGE in BigQuery and Snowflake. For data lake targets, the same pattern applies via merge operations on Delta Lake or Iceberg tables.

The Apache Iceberg specification covers the table-format requirements for idempotent merges in some detail, and most modern lakehouse tools support the pattern out of the box.

A close-up of fiber optic strands glowing in a server room
Photo by Nic Wood on Pexels

Observability for late events

The pipeline has to surface how often events arrive late and how late they typically are. Without observability, the team has no way to tune the grace period or detect a producer regression.

Three metrics are usually enough.

The first is "events past watermark" per minute, broken down by how late (less than the grace period, within the side-output window, or past the side-output window). A sudden change in this rate is the first signal of upstream trouble.

The second is the gap between event time and processing time at the consumer, percentiles fifty through ninety-nine. A growing tail tells you producer delay is increasing and the grace period needs to expand or the producer needs to be fixed.

The third is the rate of corrections emitted per day. Spikes here usually correlate with upstream incidents and are useful to have on the same dashboard as the source system's reliability metrics.

For practical implementations, Prometheus plus Grafana is the most common open-source stack, and the metrics above are straightforward to emit from any of the streaming engines.

A pragmatic default

For most data integration pipelines, the workable default is:

A five to fifteen minute grace period for late events to land in the original window. A side output for events arriving after the grace period, processed daily into a versioned-fact correction. Downstream queries that select the latest version per window. Observability that surfaces late-event rate and emit-correction rate.

This handles roughly 95 percent of late-data cases cleanly. The remaining 5 percent (extremely late events, producer outages lasting days) is usually handled by a manual backfill process, which is its own design problem and not worth automating until you have hit it three or four times.

If your team is wrestling with this on a specific integration and the pattern needs adapting to your stack, 137Foundry's data integration service has worked through this on enough pipelines (Stripe webhooks, Salesforce events, Kafka-to-warehouse loads) to share specific recipes. The services hub is at 137foundry.com if a longer conversation would help.

The key reframe is that late-arriving data is not a bug; it is a property of the world your pipeline operates in. The bug is when the pipeline pretends it does not exist. The fix is to put a window, a side output, and a correction policy in place from day one.

What changes downstream

The most underrated benefit of handling late data deliberately is that downstream consumers stop asking "why does this dashboard not match the source." The answer becomes "for the current window, it might lag by up to fifteen minutes; for closed windows, the corrected version is final after 2 AM the next day." That is a sentence the business can plan around.

Without the framework, the answer is "I am not sure, let me investigate," every time someone notices a discrepancy, forever. That is not a pipeline problem. That is an integration design problem, and the fix is upstream of any specific bug.

What "late" actually means

The watermark and the grace period

Side outputs for the truly late

The versioned-fact pattern

When to pick lambda-style re-runs instead

Idempotent emit on the consumer side

Observability for late events

A pragmatic default

What changes downstream

More Articles

How to Design a Multi-Step Form Wizard That Doesn't Lose Users Halfway Through

How to Build a Background Job Queue That Survives a Server Restart

How to Build an Icon System That Doesn't Fall Apart as Your Product Grows