Data pipelines fail quietly. A script that pulled 50,000 records yesterday returns 200 records today because the API throttled every request after the first few hundred. The pipeline logs show success -- the HTTP response was 200, just with empty pages -- and no one notices until someone checks the data.
API rate limits are one of the most reliable sources of brittle automation. They are not edge cases: every production API enforces them, the limits change without notice, and hitting them at scale requires strategy, not just try-except blocks.
This guide covers the patterns that make API-dependent Python automation genuinely resilient: reading rate limit signals correctly, exponential backoff, production-grade retry logic with the tenacity library, and token bucket rate limiting for per-endpoint control.
What Rate Limits Are and Why They Fail Pipelines
A rate limit is a constraint on how frequently a client can make API requests, typically expressed as requests per second (RPS), requests per minute (RPM), or requests per day. When you exceed the limit, the API returns HTTP 429 Too Many Requests, often with headers telling you how long to wait.
The failure mode in automation is not usually a single 429. It is the response to a 429. Naive pipelines do one of three things: they crash, they retry immediately (making the situation worse by hammering a throttled endpoint), or they silently skip the failed request and continue -- which is the worst outcome because the data loss is invisible.
Correct rate limit handling requires:
- Detecting 429 responses and the headers that accompany them
- Waiting the right amount of time before retrying
- Distinguishing between transient rate limits (wait and retry) and permanent errors (stop and alert)
- Maintaining throughput as close to the API limit as possible without exceeding it

Photo by panumas nikhomkhai on Pexels
Reading Rate Limit Headers
Most APIs return rate limit state in response headers. The requests library exposes these through response.headers.get(). The most common header patterns are:
import requests
response = requests.get("https://api.example.com/data", headers={"Authorization": "Bearer TOKEN"})
# Common rate limit headers
retry_after = response.headers.get("Retry-After") # seconds to wait
x_ratelimit_remaining = response.headers.get("X-RateLimit-Remaining")
x_ratelimit_reset = response.headers.get("X-RateLimit-Reset") # Unix timestamp
x_ratelimit_limit = response.headers.get("X-RateLimit-Limit") # total limit
The Retry-After header is the most directly useful. When present on a 429 response, it tells you exactly how long to wait. Some APIs use an integer (seconds), some use an HTTP date string. Parse both:
import time
from datetime import datetime, timezone
from email.utils import parsedate_to_datetime
def get_retry_after(response):
retry_after = response.headers.get("Retry-After")
if not retry_after:
return None
try:
return float(retry_after)
except ValueError:
# HTTP date format: "Wed, 21 Oct 2015 07:28:00 GMT"
reset_dt = parsedate_to_datetime(retry_after)
now = datetime.now(timezone.utc)
return max(0.0, (reset_dt - now).total_seconds())
Not all APIs provide Retry-After. When it is absent, you need to derive a wait time from other headers or fall back to exponential backoff.
The Problem with Fixed Retry Delays
A fixed retry delay -- sleep for 5 seconds, then retry -- is better than no delay, but it fails in two ways.
First, it is often the wrong duration. A 5-second sleep when the API resets every 60 seconds means you will hit the limit again immediately after retrying. A 60-second sleep when you only needed to wait 3 seconds wastes throughput.
Second, fixed delays do not degrade gracefully under sustained load. If your pipeline generates 100 requests per minute against an API that allows 10 requests per minute, a fixed retry delay just creates a queue of retries, each of which will also be throttled. The pipeline becomes a burst-and-stall pattern that is both slow and hard on the API server.
Exponential backoff solves both problems by increasing wait time with each successive failure, reducing retry frequency under sustained throttling while recovering quickly when the throttle is temporary.
Implementing Exponential Backoff
A minimal exponential backoff implementation:
import time
import random
import requests
def fetch_with_backoff(url, headers, max_retries=5, base_delay=1.0, max_delay=60.0):
for attempt in range(max_retries):
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response
if response.status_code == 429:
retry_after = get_retry_after(response)
if retry_after:
wait = retry_after
else:
# Exponential backoff with jitter
wait = min(base_delay * (2 ** attempt) + random.uniform(0, 1), max_delay)
print(f"Rate limited. Waiting {wait:.1f}s before retry {attempt + 1}/{max_retries}")
time.sleep(wait)
continue
if response.status_code >= 500:
# Server error: retry with backoff
wait = min(base_delay * (2 ** attempt), max_delay)
time.sleep(wait)
continue
# 4xx other than 429: do not retry
response.raise_for_status()
raise RuntimeError(f"Max retries exceeded for {url}")
The jitter term (random.uniform(0, 1)) is important. Without it, multiple parallel workers that all hit the rate limit at the same time will all retry at the same moment, creating a thundering herd that immediately triggers the limit again.

Photo by Nic Wood on Pexels
Production-Grade Retry Logic with Tenacity
For production pipelines, the tenacity library provides a cleaner, more configurable retry decorator than hand-rolled backoff:
from tenacity import (
retry,
stop_after_attempt,
wait_exponential_jitter,
retry_if_exception_type,
before_sleep_log,
)
import logging
import requests
logger = logging.getLogger(__name__)
class RateLimitError(Exception):
pass
class ServerError(Exception):
pass
def check_response(response):
if response.status_code == 429:
raise RateLimitError(f"Rate limited: {response.headers.get('Retry-After', 'unknown wait')}")
if response.status_code >= 500:
raise ServerError(f"Server error {response.status_code}")
response.raise_for_status()
return response
@retry(
retry=retry_if_exception_type((RateLimitError, ServerError)),
wait=wait_exponential_jitter(initial=1, max=60),
stop=stop_after_attempt(6),
before_sleep=before_sleep_log(logger, logging.WARNING),
)
def fetch_api_data(url, headers):
response = requests.get(url, headers=headers, timeout=30)
return check_response(response)
Tenacity handles the backoff math, logs each retry attempt, and raises a RetryError if all attempts are exhausted. The before_sleep_log parameter writes a warning to your log system before each sleep, which makes it trivial to monitor retry patterns in production.
For APIs that return Retry-After, you can implement a custom wait strategy:
from tenacity import wait_base
class WaitRetryAfter(wait_base):
def __call__(self, retry_state):
exc = retry_state.outcome.exception()
if hasattr(exc, "retry_after") and exc.retry_after:
return float(exc.retry_after)
# Fall back to exponential
return min(2 ** retry_state.attempt_number, 60)
Token Bucket for Proactive Rate Control
Exponential backoff is reactive: it responds after hitting the limit. A token bucket implementation is proactive: it throttles your own request rate to stay below the limit, reducing the number of 429 responses you encounter in the first place.
import threading
import time
class TokenBucket:
def __init__(self, rate, capacity):
self.rate = rate # tokens per second
self.capacity = capacity # maximum tokens
self.tokens = capacity
self.last_refill = time.monotonic()
self._lock = threading.Lock()
def acquire(self, tokens=1):
with self._lock:
now = time.monotonic()
elapsed = now - self.last_refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
self.last_refill = now
if self.tokens >= tokens:
self.tokens -= tokens
return 0.0 # no wait needed
else:
wait = (tokens - self.tokens) / self.rate
self.tokens = 0
return wait
def throttled_fetch(bucket, url, headers):
wait = bucket.acquire()
if wait > 0:
time.sleep(wait)
return requests.get(url, headers=headers, timeout=30)
For an API that allows 10 requests per second, you would initialize:
bucket = TokenBucket(rate=10, capacity=10)
A per-endpoint bucket lets you independently control rate for different API sections that have different limits, which is common in APIs like Salesforce, HubSpot, or Stripe where bulk endpoints have different limits than record-level endpoints.

Photo by Brett Sayles on Pexels
Combining the Patterns
In a production data pipeline, the patterns work together:
import requests
from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type
bucket = TokenBucket(rate=5, capacity=10) # Stay well under limit
@retry(
retry=retry_if_exception_type((RateLimitError, ServerError)),
wait=wait_exponential_jitter(initial=2, max=120),
stop=stop_after_attempt(5),
)
def fetch_page(session, url):
wait = bucket.acquire()
if wait > 0:
time.sleep(wait)
response = session.get(url, timeout=30)
return check_response(response)
def paginate_api(base_url, headers, params):
results = []
session = requests.Session()
session.headers.update(headers)
page = 1
while True:
response = fetch_page(session, f"{base_url}?page={page}")
data = response.json()
if not data.get("items"):
break
results.extend(data["items"])
page += 1
return results
The token bucket prevents most 429 responses proactively. The tenacity decorator handles the ones that slip through. The session reuses the TCP connection across requests, reducing overhead.
What Changes in Production
Dennis Traina, founder of 137Foundry, notes: "Most rate limit issues we see in client pipelines are not about backoff logic -- it is that no one has documented which endpoints have which limits, so when a limit changes silently, no one knows what broke or why. The fix starts with treating rate limit headers as first-class instrumentation, not an edge case."
In practice, that means:
- Logging the current
X-RateLimit-Remainingon every response, not just on 429s - Setting alerts when remaining drops below 20% of the limit
- Storing rate limit header values alongside your data for audit purposes
- Testing your retry logic against a mock API that returns 429 responses deliberately
One underused practice is version-pinning your rate limit assumptions. Document the limits observed for each endpoint -- in a config file or code comment -- along with the date. When a pipeline starts failing unexpectedly, the first diagnostic question is whether the API silently changed its limits. Having a documented baseline makes that diagnosis fast rather than speculative, and it creates a natural record of how your API usage has evolved over time.
Rate limit handling is not a feature you add once and forget. API limits change, new endpoints get different limits, and your data volume grows in ways that change your request patterns. Treating rate limit resilience as an operational concern rather than a coding exercise is what separates brittle pipelines from ones that run unattended for months. Schedule time to review your rate limit configuration alongside regular infrastructure reviews as data volume and API dependencies evolve.
To learn more about how 137Foundry builds rate-limit-resilient data automation into production pipelines, visit our data automation services page.
For more technical guides on data automation and pipeline reliability, browse the 137Foundry articles archive.