How to Implement Change Data Capture Without Polling

You have a production database. You want every change in it (inserts, updates, deletes) reflected somewhere else: an analytics warehouse, a search index, a downstream service, a cache. The naive answer is to poll the database every minute, ask "what changed?", and copy the deltas across. That works until it does not.

Polling fails at scale for predictable reasons. The query gets expensive as the table grows. The poll interval forces a tradeoff between latency and load. Updates that happen and revert inside a single poll window get lost. The polling job becomes the most fragile piece of your data infrastructure, and you only find out when it has been silently dropping changes for a week.

Change data capture (CDC) is the umbrella term for techniques that move the work away from polling. There are three main patterns, each with different cost, latency, and operational profiles. Picking the right one for your situation is most of the work.

Close-up of cooling pipes in a data center
Photo by Zechen Li on Pexels

What you actually want from CDC

Before picking a pattern, get clear on the requirements. Most CDC mistakes are made by teams who jumped to a tool before they specified the problem.

The three dimensions that matter:

Latency. How fresh does the downstream consumer need the data? Sub-second for fraud detection. A few minutes for an analytics dashboard. Hourly for nightly batch jobs.
Throughput. How many writes per second is the source database handling at peak? Hundreds, thousands, millions?
Completeness. Do you need every individual change, or is a final-state snapshot enough? For audit logs, every change matters. For search indexing, the final state is usually enough.

These three drive the pattern choice. A low-latency, high-throughput, complete-history requirement points at log-based CDC. A latency-tolerant, low-throughput, final-state-only requirement can survive with timestamp polling for a long time. The middle ground is where the design work lives.

Worth reading the Wikipedia entry on Change Data Capture once for the historical context. The patterns below are concrete implementations of the ideas described there.

Pattern 1: log-based CDC

The cleanest pattern available. Most production databases keep a transaction log (PostgreSQL has the write-ahead log, MySQL has the binlog, SQL Server has the transaction log). The log records every committed change, in order, before the change is applied to the data files. Log-based CDC consumes that log directly.

The advantages are significant. You see every change including deletes (which timestamp polling cannot reliably catch). The load on the source database is minimal because you are reading the log, not running queries. Latency is bounded by how often the log is flushed, which on a busy database is roughly real-time.

The cost is operational complexity. You need replication-level access to the database (which has security implications), a tool that knows how to parse the specific database's log format, and a downstream consumer that can keep up with the log rate. Lose connectivity to the log for too long and the database may reclaim the log space, at which point you have a gap you cannot fill without a full reload.

The widely-used open-source tool for this pattern is Debezium, which reads logs from PostgreSQL, MySQL, SQL Server, MongoDB, and a few others and emits changes to Kafka. If you are already running Apache Kafka for other reasons, Debezium fits cleanly. If you are not, the operational overhead of standing up Kafka just for CDC is real and worth measuring against the alternatives.

Pattern 2: trigger-based CDC

If you cannot get replication-level access to the source database, but you can write to its schema, trigger-based CDC is the next-best pattern. You install database triggers that fire on every insert, update, and delete, and the trigger writes a row to an audit table (often called a "change log" or "outbox" table). A separate process reads the audit table on whatever cadence you want and ships the changes downstream.

The strengths: you do not need replication access, the audit table is queryable by anything that speaks SQL, and the trigger guarantees you see every change atomically with the original write (because it runs inside the same transaction).

The costs: triggers add write overhead to every transaction on the source table. On a high-throughput workload this is measurable, sometimes meaningfully so. The audit table grows fast and needs its own retention strategy. If the trigger logic is wrong (and it often is at first), you can introduce subtle bugs into your source database that take weeks to diagnose.

Reference reading: database triggers on Wikipedia covers the basics. The PostgreSQL documentation at postgresql.org has detailed reference material on trigger semantics if you are working on that database specifically.

A server rack with neatly bundled fiber-optic cables
Photo by Brett Sayles on Pexels

Pattern 3: timestamp-based polling

The pattern most teams start with, and the pattern most teams keep using longer than they should. You add an updated_at column to every source table, index it, and poll periodically for rows where updated_at > last_seen.

The advantages are real for the right situation. Zero infrastructure beyond a cron job and the existing database. No special access required. Easy to reason about, easy to debug, easy to recover from failure (just back up the last_seen cursor and re-run).

The cost is what it cannot do. Deletes are invisible (the row is gone, there is no updated_at to scan). Reverts within a single poll window are lost (a row that was updated and then immediately changed back will look unchanged to the next poll). High-throughput workloads make the polling query expensive even with an index, because you are scanning a fast-growing range every poll. Latency is bounded below by the poll interval, and shrinking the interval drives up source-database load.

A common workaround for the delete problem is to add an is_deleted flag instead of actually deleting rows (soft-delete pattern). This works but it pushes the cleanup problem somewhere else and complicates every downstream consumer. Use this pattern when you control the source schema and the downstream consumers tolerate the workaround.

How to pick between the three

A practical decision rule:

If you can get replication access to the database AND latency must be sub-second AND throughput is above a few hundred writes per second, use log-based CDC.
If you have schema-write access but not replication access, OR throughput is moderate (tens to low hundreds of writes per second), use trigger-based CDC.
If latency tolerance is more than a minute AND throughput is low (under tens of writes per second) AND you can soft-delete, use timestamp polling.

Most teams skip step 1 because the operational cost feels high, and end up running trigger-based CDC at throughputs where log-based would be cleaner. The right time to switch is when the trigger overhead starts showing up in your source-database write latency, which is a measurable signal.

For deeper architectural reading, the 137Foundry data integration service page lays out how we approach the build-versus-buy tradeoff on CDC infrastructure specifically; the underlying patterns are the same as above but the buy decision depends heavily on team headcount and existing infrastructure.

Common implementation mistakes

A few patterns that bite teams repeatedly:

Treating CDC as one-way. A CDC pipeline that ships changes downstream is half the work. The other half is observability: you need to know when the pipeline lags, when it drops messages, when the source-database log is filling up. Without monitoring, you find out about CDC problems from your data consumers, who notice the staleness before you do.

Ignoring schema evolution. The source database schema changes over time. Columns get added, renamed, dropped. A CDC pipeline that does not handle schema evolution gracefully will break on the first schema migration after it goes live. Log-based tools mostly handle this; trigger-based and timestamp-based pipelines require explicit handling.

Underestimating the snapshot problem. When you first turn on CDC, the downstream consumer has nothing. You need a one-time snapshot of the source table to seed it, and the snapshot has to be consistent with the change stream from the same moment forward. Getting the snapshot-to-stream handoff right is harder than it looks; most CDC tools have explicit support for it, and you should use that support rather than rolling your own.

Filtering too late. If you only need a subset of columns or rows downstream, push the filter as early in the pipeline as possible (ideally at the source). Shipping every change through the network and discarding 90 percent at the consumer is a waste of bandwidth that scales linearly with source-database write volume.

"The teams that succeed with CDC are not the ones who pick the cleverest tool. They are the ones who measure their actual write throughput and consumer latency tolerance before designing anything. Almost every CDC project that fails was built for assumed requirements that never matched reality." - Dennis Traina, founder of 137Foundry

Operational considerations

Once you pick a pattern, the operational work is just starting.

For log-based CDC, the critical operational concerns are:

Log retention on the source database. If the consumer falls behind, the database may reclaim the log space and you lose the gap.
Connection failure handling. The CDC tool must reconnect cleanly without missing messages.
Schema registry. If you are using Kafka with Debezium, you need a schema registry to track how the source schema changes over time.

For trigger-based CDC, the critical operational concerns are:

Audit table growth. The table grows monotonically; you need a retention policy or it will fill the database disk.
Trigger performance. Profile the triggers under realistic load before going live.
Cleanup of processed audit rows. The reader process needs to track which rows it has shipped, and a separate process needs to clean up rows that have been confirmed downstream.

For timestamp polling, the critical operational concerns are:

Index maintenance on the updated_at column. The index has to stay fast as the table grows.
Cursor recovery. If the polling process crashes, you need to be able to restart from the last successful cursor.
Delete handling. Either soft-delete or accept that deletes are invisible.

Rows of neatly organized network cables in a server room
Photo by Brett Sayles on Pexels

When CDC is the wrong answer

A few situations where CDC is the wrong tool:

If you only need a snapshot once a day, just run a nightly export. CDC is overkill.

If the downstream consumer can accept eventual consistency on the order of minutes and the source database is small, just dump and reload. The operational cost is lower than maintaining a CDC pipeline.

If the source data is already a streaming system (Kafka, Kinesis, Pub/Sub), you do not need CDC at all. Subscribe to the stream directly.

If the downstream consumer is another database with built-in replication support (read replicas, native replication), use the native replication features instead of building CDC on top.

The pattern in all these cases is: simpler answers are better when they fit. CDC is the right tool when you genuinely need the change stream from a transactional source database into a different system, and the alternatives have aged out.

For teams thinking about CDC as part of a larger data integration build, the question of build-versus-buy almost always comes up. The honest answer depends on team size: small teams should buy the managed CDC service even when it is more expensive, because the operational burden of self-hosted log-based CDC will eat more engineering time than the licensing cost.

Where to start

If you have never run CDC in production and you need to add it to an existing system, the smallest first step is:

Decide which one table is most valuable to capture changes from.
Profile the write rate on that table for a week. Note the peak.
Decide which downstream consumer is most valuable to feed.
Pick the pattern by the decision rule above.
Build the smallest version that ships changes for that one table to that one consumer.
Add monitoring before expanding.
Add more tables only after the first one has been stable in production for a month.

Most CDC projects that succeed start small and expand. Most that fail try to capture everything from the start and never reach a stable state on any single table. The discipline is in the sequencing.

For a longer read on adjacent topics, the 137Foundry services page covers our broader data and integration work. CDC is one piece of a larger picture, but it is the piece where teams most often get the architecture wrong, and the cost of that mistake is paid in downstream data quality for months after.

Pick the right pattern. Build the smallest version. Add monitoring before expanding. The rest is operational discipline.

What you actually want from CDC

Pattern 1: log-based CDC

Pattern 2: trigger-based CDC

Pattern 3: timestamp-based polling

How to pick between the three

Common implementation mistakes

Operational considerations

When CDC is the wrong answer

Where to start

More Articles

How to Debug AI-Generated Code That Compiles But Behaves Wrong

How to Write Internal AI Coding Guidelines Your Team Will Actually Follow

How to Add Web Push Notifications to a Progressive Web App