How to Design Integration Error Queues Your Team Will Drain

Every data integration system eventually has an error queue. The first version of the pipeline assumes nothing will fail. The second version, after a bad week, adds a place for failed records to land. The third version, after that error queue grows to 200,000 items that nobody has triaged in six months, raises the question we're going to spend this article on: how do you actually design an error queue so the team drains it instead of ignoring it.

The answer is not "more alerting" or "more rigorous engineering culture." Both of those help on the margin, but the structural design of the queue itself is what determines whether a team treats it as a daily workstream or a dumping ground.

This is a guide to the structural part: queue shape, ownership, triage rules, and the few alerts that actually drive action.

A close-up of glowing fiber optic strands representing inter-system data flow
Photo by Suki Lee on Pexels

Why most error queues become graveyards

The pattern is predictable. The queue is set up as a single FIFO list of "things that failed." Items arrive at any rate. When the rate is low, nobody looks. When the rate is high, the queue grows faster than anyone can triage. Either way, no individual engineer is responsible for it, the SLA is "we'll get to it," and the queue's signal value collapses.

By month three, the queue is large enough that opening it is psychologically unrewarding. There are 2,400 items. Even if you fix 30 today, the queue still has 2,370 items tomorrow. Engineers who try to make a dent feel like they're not making progress. The queue stops being a tool and becomes a place to look away from.

The fix is not "try harder." It is to design the queue so it cannot reach that state in the first place.

Principle 1: separate transient from terminal failures

The biggest single design decision is to never put two kinds of failure into the same queue.

Transient failures are things that should succeed on retry: a downstream API returned a 503, a database hit a deadlock, a network connection dropped mid-request. These need automatic retry with backoff, not human triage. They should never reach a human queue unless the retries themselves have exhausted.

Terminal failures are things that no retry can fix: a payload that violates a schema, a referenced entity that doesn't exist, a permission that was revoked. These need human triage immediately, because retrying them just wastes compute.

A single queue conflating both kinds means engineers can't tell which items are "wait it out" versus "fix something." So the default becomes "wait it out," and the schema-violation items rot. Split them at write time, with two physically distinct queues (or two distinct status values within one queue), and the team always knows what they're looking at when they open the right one.

The AWS team's Builder's Library has a clean writeup of this distinction in their reliability articles, framing it as the difference between recoverable and non-recoverable errors. The naming matters less than the separation.

Principle 2: enforce a cap on terminal-queue size

This is the most counterintuitive design rule on the list. If your terminal-failure queue grows past a defined cap (say, 500 items), the pipeline should refuse to add new items to it. Either it should page someone immediately, or it should pause the producer that's generating the failures.

The instinct is the opposite: "we don't want to lose anything; just keep adding to the queue." The reality is that an unbounded queue is the same as a memory leak. Once the queue is too big to drain in a reasonable timebox, it stops being actionable and effectively the data is lost anyway, because nobody is going to spend three weeks paging through 50,000 items.

A bounded queue creates back-pressure. If the queue is filling faster than the team can drain it, the system tells someone to either fix the upstream cause or expand the team handling it. Either action is better than silently growing.

Principle 3: name an owner per integration, not per queue

A common mistake is to put one engineer "on error queue rotation." This makes the queue a chore that rotates, never a system anyone owns.

A better pattern is to make each integration's failures owned by the team that built or operates that integration. The Salesforce-to-NetSuite sync's terminal failures go to the team that built that sync. The marketing-platform webhook failures go to the marketing-tools team. The Stripe events go to the billing team.

Each team gets its own triage view filtered to its integration. The view is a daily standup item for that team. The queue's existence ties to a specific team's work, not to a centralized cleanup rotation.

The Google Site Reliability Engineering book makes a similar argument in their chapter on alerting: alerts should reach the team that can actually act on the underlying cause, not a generic ops queue.

A team standing at a kanban board reviewing tickets
Photo by Md Jawadur Rahman on Pexels

Principle 4: every queue item carries enough context to triage in 60 seconds

The queue is only as useful as the information in each item. If an item is just {error: "validation failed"}, every triage step requires opening logs, finding the request, finding the response, finding the source data. That's 10 minutes per item, which is why nobody drains the queue.

Each item in the queue should carry:

The full source payload that failed (not a reference, the payload itself)
The integration name and the specific stage that failed
The full error from the receiving system (status code, response body, error class)
A correlation ID linking to all related log lines for this attempt
The number of prior retry attempts and timestamps
A "suggested next action" field auto-populated by the pipeline based on the error class

That last field is the underrated one. If the pipeline can detect "this is a schema validation error on field X" and write "fix the source data for record Y" into the suggested-action field, the engineer triaging it doesn't have to derive the action; they just have to do it. Triage time drops from 10 minutes to 60 seconds.

"We treat the error queue as a product surface, not a debugging surface. If a record sits in the queue for more than an hour, the queue itself failed its job; the design needs to surface enough context that an engineer can act immediately, not investigate." - Dennis Traina, founder of 137Foundry

Principle 5: alert on rate-of-arrival, not on queue size

A queue size alert ("alert when queue > 100 items") is almost always wrong, because it fires both when the producer just spiked AND when the consumer is slow, and you can't tell which from the alert.

A rate-of-arrival alert ("alert when more than 30 items entered the terminal queue in the last 15 minutes") is much more actionable. It fires when something upstream broke, which is the case where you need someone to investigate the source, not just drain the queue.

Pair it with a flow-stalled alert ("alert when no items have entered the queue in 4 hours during business hours and producer is supposedly running"). This catches the opposite failure mode: the producer is silently broken and you don't even know there should be data flowing.

The PagerDuty community has good writeups on the difference between symptom alerts and cause alerts; this falls firmly in the symptom-alert category. The broader application-architecture principles behind these alerting choices are also well-covered in the Twelve-Factor App methodology, particularly the logs and processes factors.

Principle 6: triage rules are part of the system, not the team's memory

If "we usually replay these" is a rule that lives in someone's head, the rule will be applied inconsistently and forgotten when that person leaves.

Codify common triage decisions:

"Items with error class X are auto-replayed up to N times before reaching the human queue."
"Items with error class Y are auto-routed to a downstream team's queue rather than ours."
"Items older than 30 days with no resolution are escalated to a weekly architecture review."

The rules live in code or configuration, not in a wiki page. When someone wants to change a rule, they change the code, the change gets reviewed, and the team's behavior shifts together rather than drifting per individual.

Principle 7: a "drained today" metric beats "queue depth"

The most common dashboard metric for an error queue is current depth. It is the wrong metric.

Depth tells you the queue exists. It doesn't tell you whether the team is making progress, whether the rate of arrival is sustainable, or whether yesterday's incidents are recurring today. Two queues with the same depth can have wildly different health.

Better metrics:

Items entered today vs. items drained today (the trend).
Median age of items currently in the queue.
Top three error classes by count (so you can fix the root cause, not the symptoms).
Replay success rate (the percentage of items that, when replayed, succeed; low replay success means the queue is collecting non-replayable items and you need to triage them rather than replay them).

These four metrics on a single dashboard let the team see whether they're winning or losing. Depth alone tells you nothing.

A monitoring dashboard showing multiple time-series charts
Photo by AS Photography on Pexels

Principle 8: a weekly "what's still here" review

Even with the rules above, some items will sit in the queue for weeks because they require coordination with another team, or a vendor, or a customer. Without explicit attention, those items become permanent residents and the queue starts to grow.

Schedule a 30-minute weekly review where the team looks at every item older than 7 days. Three outcomes per item: fix it now, escalate it to a defined owner with a defined date, or document why it can't be fixed and accept the loss with a note. No item stays in the queue without one of those three outcomes.

This sounds like a small thing. It is the single most effective practice for keeping a queue actually drained rather than nominally drained.

What an actually-drained queue looks like

A team running this pattern well has:

Two queues per integration (transient with retry, terminal for triage)
Each terminal queue capped at 500 items
Every item carrying full context and a suggested action
Two alerts (rate-of-arrival, flow-stalled), no depth alert
Triage rules codified in the pipeline itself
A four-metric dashboard tracked daily by the owning team
A weekly 30-minute review of stale items

The terminal queue typically sits between 0 and 50 items most of the day. It spikes during incidents and drains within a day or two. It is never zero (that would be suspicious; something is filtering errors silently), and it is never four-digit (that would mean the system is broken in a structural way).

The integration is healthy when the queue is healthy, and the queue is healthy when the team treats it as a daily artifact, not an occasional cleanup project.

Where 137Foundry can help

If your team is staring at an error queue that has been growing for months and you want help redesigning the integration so the queue becomes a tool rather than a tax, our 137Foundry data integration service is built around exactly this kind of work. We've done it across Salesforce, NetSuite, HubSpot, Stripe, and a dozen other integration shapes. The patterns above apply to all of them with minor variations.

For the broader engineering and architecture work we cover, see 137Foundry. The about page has more context on the team's background.

A well-designed error queue is not a luxury or a stretch goal. It is the difference between an integration that quietly fails for months and one that surfaces every problem the day it happens. Build it once, build it right, and your team gets months of operational signal back.