How to Build an ETL Pipeline for Business Data Syncing

Most data syncing projects start as a weekend task and turn into a months-long maintenance commitment. A developer writes a script to move data from the CRM into the analytics database, it works, and then six months later the CRM adds a new required field, the script starts silently dropping records, and nobody notices for three weeks.

The problem is almost never the happy path. Extract, Transform, Load is straightforward when everything works. The work is in handling the cases where it doesn't: rate limits, schema changes, duplicate records, partial failures, and data that arrives out of order. This guide covers how to design an ETL pipeline that handles these cases without requiring full rewrites every time something changes.

What ETL Actually Means in a Business Context

Extract, Transform, Load describes a movement of data from one or more sources into a destination where it can be used for reporting, operations, or downstream automation.

In business practice, this looks like:

Syncing customer records from your CRM (Salesforce, HubSpot) into a data warehouse for analytics
Moving order data from an e-commerce platform into accounting software
Aggregating data from multiple SaaS tools into a single operational database
Replicating data between environments for staging and testing

The source systems didn't design their data models to match each other. The destination system expects specific formats. The transform step is where you reconcile the gap between what exists and what you need.

data center server infrastructure network
Photo by Elchinator on Pixabay

Extracting Data: Know Your Source Constraints

The extraction step is the one most developers underestimate. Pull-based extraction via API is the most common pattern - you send requests to the source system's API and receive data in response. The constraints:

Rate limits. Most SaaS APIs enforce request limits per minute, hour, or day. Naively pulling records in a tight loop will hit these limits, trigger backoff logic, and slow your pipeline. Design for rate limits from the start: use bulk endpoints where they exist, batch requests, and implement exponential backoff.

Pagination. APIs paginate results. A common bug: a developer handles the first page correctly but doesn't loop through all pages. Records get silently dropped because they were on page 2 and the code never asked for page 2. Always implement full pagination - and test it with a large dataset, not a three-record test environment.

Incremental vs full extraction. Full extraction pulls all records every run. This works for small datasets, becomes expensive for large ones, and often triggers API rate limits. Incremental extraction pulls only records modified since the last successful run. Most modern APIs support filtering by updated_at or an equivalent field. Use it. Store the high-water mark of the last successful run and resume from there.

Schema discovery. APIs add fields, deprecate fields, and change field types. A pipeline written against an API schema documented six months ago may encounter fields that now exist and fields that have moved. Pull schema documentation programmatically where the API supports it, or design your extraction to handle unexpected keys gracefully rather than failing on them.

Transforming Data: Where Most Bugs Live

The transform step converts source data into the format your destination expects. This is where the most maintenance-heavy code lives, because business logic is inherently unstable.

Mapping fields. A source system calls a field contact_email. Your destination expects email_address. This sounds trivial and it is - once. When the source system adds secondary_email and your destination needs both, or when a field gets renamed in an API update, field mapping tables become load-bearing code. Keep mappings in configuration, not hardcoded in logic, so they can be updated without touching code.

Data type normalization. Dates are the canonical example. Source system A returns 2026-04-28T14:30:00Z. Source system B returns 04/28/2026. Both need to become the same format in your destination. Write normalization functions that are explicit about their input assumptions and log cases where the input doesn't match.

Deduplication. If you're pulling from multiple sources that contain overlapping records, or if your incremental logic has any overlap between runs, duplicates are a structural risk. Implement deduplication based on a stable identifier (customer ID, order number, email address) rather than assuming each record is unique. Idempotency at the load step is easier to guarantee when deduplication happens before the data reaches the destination.

workflow automation connected business systems
Photo by Ludovic Delot on Pexels

Loading Data: Idempotency Before Speed

The load step inserts or updates records in the destination system. The most important design property here is idempotency: running the same load operation twice should produce the same result as running it once.

This matters because pipelines fail. Network timeouts, destination API rate limits, and deployment interruptions mean your pipeline will sometimes write partial results and need to be re-run from a checkpoint. If your load step is not idempotent, a re-run may create duplicate records or overwrite data that was correct.

The standard pattern for idempotent loads:

Use upsert semantics (insert if not exists, update if exists) based on a stable unique key
Write the successful checkpoint after confirming the load succeeded, not before
Test re-runs explicitly: run the pipeline twice on the same data and verify the destination has the same count and values as after the first run

"The data automation projects that cause the most problems six months post-launch are the ones that were never designed to fail gracefully. A pipeline that works on the happy path and silently drops records on partial failures is worse than having no pipeline at all - you end up with analytics you can't trust." - Dennis Traina, founder of 137Foundry

Error Handling That Surfaces Problems

Silent failures are worse than loud ones. A pipeline that crashes with a clear error message is recoverable. A pipeline that runs to completion but writes the wrong data is a data quality problem that may not surface for days.

Design your error handling to distinguish between:

Transient errors - rate limit hits, temporary network failures, destination unavailable. These should trigger retry with backoff. Most HTTP client libraries have built-in retry logic; use it.

Structural errors - source data that doesn't match expected schema, field values outside expected ranges, missing required fields. Log these with enough context to reproduce them (the source record, the field, the value). Don't skip the record silently; write it to a dead-letter log so it can be reviewed and processed.

Destination errors - write failures, constraint violations, duplicate key errors. These often indicate either a data quality problem upstream or an idempotency bug. Log the specific error and the record being loaded.

Alert on errors, not on completions. If your pipeline runs successfully every night, you don't need a notification. If it fails or encounters more than N structural errors in a run, you want to know immediately.

Monitoring That Tells You Something Useful

Basic monitoring for a data pipeline:

Record counts. Track how many records were extracted, how many passed transformation, and how many were successfully loaded each run. Alert when the count drops significantly from the previous run - a 90% drop in extracted records usually means an API issue or pagination bug.

Latency. Track how long each step takes. A transformation step that takes three times as long as usual often indicates unexpected data volume or a logic change in the source system that changed the data shape.

Error rate by error type. Track transient errors (should stay low if your backoff logic is working) and structural errors (should be zero or near-zero in a stable pipeline).

These three metrics can be implemented with basic logging and a time-series store. Apache's project ecosystem includes Airflow for pipeline orchestration and Kafka for event-driven architectures if you're building at scale. For simpler business data syncing, a straightforward scripted pipeline with structured logging covers most needs.

server monitoring analytics dashboard infrastructure
Photo by panumas nikhomkhai on Pexels

When to Use a Managed Automation Platform

Building a custom ETL pipeline is the right choice when your data requirements are complex enough that generic tools can't handle them, or when you need fine-grained control over transformation logic.

For many business data syncing needs, a managed platform like Zapier handles the infrastructure problems (rate limits, retries, scheduling) and lets you focus on the business logic. The tradeoff is cost at volume and limited flexibility for complex transformations.

The decision framework: if your pipeline is moving data between well-supported SaaS tools with minimal transformation, a managed platform is almost certainly faster and cheaper than custom code. If you have complex business logic, custom source systems, or data volume that makes managed platforms expensive, a custom pipeline using Python or Node.js against a database like PostgreSQL gives you the control you need.

https://137foundry.com works with businesses on both approaches - evaluating when custom pipelines make sense and building them with the operational characteristics (idempotency, error handling, monitoring) that make them maintainable long-term. The data integration service covers the full range from architecture decisions to implementation.

Designing for Change

The most important quality of a business data pipeline is not performance - it's the ability to change. Source APIs add fields. Destination schemas evolve. Business rules change. A pipeline designed to be changed is worth more than a pipeline designed to be fast.

Practical principles:

Keep transformation logic in configuration where possible, not in code
Use version-controlled schema definitions for source and destination field mappings
Write tests against real sample data, not synthetic test cases
Document the business rules encoded in the transform step - these are the hardest to reconstruct later

For help designing data automation architecture that covers these considerations, the AI automation and data integration services at 137foundry.com/services/ai-automation are a starting point for understanding what a properly designed pipeline looks like in your specific context.

The happy path is easy. Build for the cases where it isn't.

data pipeline network infrastructure diagram
Photo by Wolfgang Weiser on Pexels