How to Build a Health Check Endpoint Worth Trusting

Almost every web application of any size has a route named /health or /healthz or /status. Most of them lie. They return 200 OK while the database is unreachable. They return 200 OK while the queue is backed up. They return 200 OK because the route handler runs no checks at all and a 200 is the default. When production starts smoking, the team realizes that the monitor everyone trusted for the last year never actually told them anything.

This guide walks through how to build a health check endpoint that earns the trust of the orchestrator, the load balancer, the on-call engineer, and the dashboards in front of them. It is not glamorous work. It is the work that turns "the system was up" into a number you can actually defend in a postmortem.

Server rack with neatly organized network cables in a data center
Photo by Brett Sayles on Pexels

What a health check is actually for

The concept of a health check seems obvious. The endpoint says yes or no, the system is up or down. In practice, a health check serves three different audiences, each with different needs:

Orchestrators like Kubernetes use liveness and readiness probes to decide whether to restart a container or remove it from a load balancer's rotation. They want a fast, simple answer.
Load balancers and reverse proxies like Nginx or HAProxy use health checks to decide whether to route traffic to a given instance. They also want fast and simple.
External monitors and dashboards (Prometheus-style scrapers, uptime checkers, on-call alerting tools) want a richer signal that distinguishes "the process is alive" from "the process is functioning well enough to serve real requests."

The single biggest mistake in health-check design is conflating these audiences. A single endpoint that tries to satisfy all three usually fails all three.

Three endpoints, not one

The modern pattern that holds up under production traffic is to expose three separate endpoints, each with a clearly defined scope.

/livez or /healthz. Liveness. This endpoint answers one question: is the process alive and able to respond to HTTP at all? The check should do almost nothing. Return 200 OK with a small body, or 503 if the process is shutting down. No database calls, no external network calls, no cache lookups. The orchestrator polls this endpoint to decide whether to restart the container. A liveness check that calls the database can return 503 because the database is having a bad minute, and the orchestrator will restart a perfectly healthy process for no reason.

/readyz. Readiness. This endpoint answers a different question: is this instance ready to serve real production traffic right now? Here you check the dependencies the application cannot function without: the primary database, the cache layer, the message broker, any internal services in the request path. If any of them are unreachable, return 503 so the load balancer routes traffic to a healthier instance. Keep the check short, parallelize the dependency checks, and cache the result for a second or two so a flood of probes does not become a workload on the database.

/statusz or /health/details. Detailed status. This endpoint is for humans and dashboards. It returns a structured JSON document showing per-dependency status, version information, recent error rates, queue depths, and anything else the on-call needs at 3 AM. This is the endpoint Prometheus scrapes, the page the support team opens, the JSON the chatbot quotes back when someone asks "is the API healthy."

The separation matters. The first two endpoints have to be fast and conservative because they drive automated decisions. The third can be slower and richer because it serves humans.

What to check in each tier

The list below is a working starting point, not exhaustive.

Liveness check should verify:

The HTTP server is accepting connections.
The process is not in the middle of a graceful shutdown.
Nothing else.

Readiness check should verify:

The primary database is reachable with a fast query (typically SELECT 1).
The primary cache is reachable.
The message broker the request path depends on is reachable.
Internal upstream services in the synchronous request path are reachable.
The process is not in the middle of startup migrations or warming up critical caches.

Detailed status check can additionally include:

Application version and build metadata.
Connection pool counts and saturation.
Recent error rates by endpoint or category.
Background worker queue depths.
External (non-critical) dependency status.
Per-region replica lag if applicable.

Terminal screen showing application logs streaming in monospace
Photo by Ec lipse on Pexels

Avoid the most common failure modes

A few patterns ruin health checks. Each one has caused production outages worth remembering.

The "always 200" handler. Someone scaffolds a /health route that returns the string "ok" and never updates it. The endpoint stays green for the life of the project. Monitors trust it. Production fails silently. The fix is to write at least one real check, even if it is just a database SELECT 1, on day one.

The deep dependency tree. Someone writes a readiness check that recursively calls the health endpoints of every downstream service. Now one stalled downstream check propagates a 503 to every service upstream, taking the whole graph offline because a non-critical sidecar had a hiccup. The fix is to check only the dependencies that block the synchronous request path on this instance, not the whole world.

The slow check that becomes the bottleneck. Someone writes a thorough readiness check that takes 2 seconds to run. The orchestrator polls it every 5 seconds. The check itself becomes a meaningful load on the database. The fix is to cache the result for 1 to 2 seconds and parallelize the dependency probes inside the check.

The check that returns 200 while burning. Someone writes a readiness check that catches all exceptions and returns 200. Now any error inside the check silently turns into a green light. The fix is to be explicit: if a dependency call throws, return 503. If the check itself errors out for any reason, return 503. Never let the catch-all be 200.

The version drift. Someone deploys a new version that adds a critical dependency but forgets to add it to the readiness check. The new dependency goes down, traffic keeps flowing to the broken instance, and the monitor stays green. The fix is to treat the readiness check as code that every dependency addition must update, the same way a database migration must update the schema.

Adding observability to the health endpoint

A health check endpoint should also be the place where you stamp the version of the running code and surface a few canonical metrics. The detailed /statusz payload typically looks like this:

version: the build SHA and tag.
commit_time: when the running code was committed.
started_at: when the process started.
dependencies: an array of per-dependency status entries, each with name, status, latency_ms, and last_checked_at.
metrics_endpoint: a pointer to the Prometheus-style metrics endpoint if separate.

Tools like Prometheus scrape the metrics endpoint on a regular schedule and turn the time series into service level indicators the team can build alerts and SLOs against. The health check itself is not the metrics endpoint; they are different endpoints, with different polling cadences, and they answer different questions.

For a deeper picture of how the metrics, logs, and health endpoints fit together, the services hub at 137Foundry covers the reliability and observability work we do for production web applications.

"A health check that lies for six months is worth less than no health check at all. The team that trusted it has built an entire alerting strategy on top of a green light that means nothing. We treat health endpoints as the first piece of any production system to audit, because the cost of getting them wrong is invisible until the worst possible moment." - Dennis Traina, founder of 137Foundry

Making the orchestrator and the engineer agree

A subtle problem most teams run into: the orchestrator's idea of healthy and the engineer's idea of healthy diverge. The orchestrator wants a binary signal. The engineer wants nuance. If you only expose the binary signal, the engineer ends up writing one-off scripts to introspect the running system. If you only expose nuance, the orchestrator either misinterprets it or ignores it.

The three-endpoint pattern resolves this. Liveness and readiness give the orchestrator clean binary signals. The detailed endpoint gives the engineer the nuance. The two views stay consistent because they are derived from the same underlying checks, just at different aggregation levels.

Network operations center with multiple monitors showing dashboards
Photo by Tima Miroshnichenko on Pexels

Production gotchas worth knowing

Three more patterns that catch teams out:

Authentication and the health check. The endpoint should be unauthenticated for the orchestrator and load balancer (they cannot easily carry credentials), and the detailed endpoint should be either unauthenticated and limited in what it reveals, or authenticated with a service-only credential. Do not expose internal version numbers, full dependency hostnames, or recent error messages on a publicly reachable endpoint.

Rate limits and probes. Liveness probes can hit your app multiple times per second per instance. Make sure your rate limiter does not throttle them. Many teams have learned this the hard way: the application starts dropping liveness probes under load, the orchestrator restarts the instance, the surviving instances absorb more load, and the cascade continues until everything is restarting in a loop.

Graceful shutdown coordination. When a process is being shut down (a deploy, a scale-down event), the readiness check should start failing immediately so the load balancer stops routing new requests, while in-flight requests continue to be served. The liveness check should keep passing during this window so the orchestrator does not kill the process before in-flight work finishes. The shutdown sequence is: stop being ready, drain in-flight work, then exit.

A reverse proxy like the one Cloudflare or Nginx sit in front of will respect these signals only if your health endpoints are correctly wired to the process lifecycle. The configuration is rarely the hard part; the wiring is.

A short reading list

For background and deeper reading:

The Kubernetes documentation on probes for the orchestrator perspective.
The Prometheus documentation for the metrics and monitoring perspective.
The service level indicator background for how health and reliability metrics fit into SLO work.
The Cloudflare and Nginx docs for reverse-proxy health check configuration patterns.

For applied work on production web reliability, 137Foundry's data integration service is where most of the cross-system health-check design conversations end up living for client systems.

The takeaway

A health check endpoint worth trusting is three endpoints, not one. Liveness is fast and shallow. Readiness checks the dependencies you cannot serve without. Detailed status is for humans. The most common failure modes are the always-200 handler, the deep dependency tree, the slow check, the catch-all that hides errors, and the version drift that lets new dependencies silently bypass the check.

None of this is glamorous. None of it shows up in a feature demo. It is the work that turns "we think the system was up" into a number the team can defend. The cost of building it well is a few days of careful design. The cost of building it badly is the next time something breaks at 3 AM and the green light on the dashboard turns out to mean nothing.

If you want help auditing the health-check story for your own production stack, the home page is the place to start that conversation.

What a health check is actually for

Three endpoints, not one

What to check in each tier

Avoid the most common failure modes

Adding observability to the health endpoint

Making the orchestrator and the engineer agree

Production gotchas worth knowing

A short reading list

The takeaway

More Articles

How to Handle Idempotency in Data Integration Pipelines When Retries Are Inevitable

How to Design Data Tables and Grids That Stay Readable as the Data Scales

How to Cache API Responses in the Browser Without Breaking Freshness Guarantees