How to Establish Engineering Productivity Metrics That Work

Most engineering productivity metrics fail in the same predictable way. A leader sees that performance feels uneven, picks a few measurable proxies (lines of code, pull request count, story points), publishes them on a dashboard, and watches engineers optimize for the proxy while the underlying problems get worse.

The fix is not adding more metrics. It is choosing metrics that genuinely correlate with the outcome you care about, while building enough context around them that the team can see what to actually change. That requires understanding which metrics are well-established, which are useful but easy to misuse, and which should never be on a dashboard at all.

architectural blueprints on a workbench with pencil and ruler
Photo by Czapp Árpád on Pexels

What Productivity Metrics Are For (and What They Are Not For)

Engineering productivity metrics serve two purposes well: they help leadership understand where systemic friction lives, and they help teams see whether the changes they make to their process are actually improving outcomes. They serve other purposes badly.

Specifically, they are not useful for individual performance evaluation. Every credible research program on software engineering productivity, including the work by the DORA team at Google Cloud, concludes the same thing: software delivery performance is a property of the system, not the individual. Measuring individuals against output metrics produces gaming behaviors and provides no information about how to make the team better.

The first decision before publishing any metric is to decide who gets to see it and what decision they will make based on it. A metric that informs no decision is decoration. A metric that informs the wrong decision is a hazard.

The DORA Metrics: The Reliable Starting Point

The DORA Four Key Metrics emerged from over a decade of research on what distinguishes high-performing software organizations. They have held up across organizations of all sizes, technology stacks, and product types, and they remain the most reliably validated software delivery metrics available.

The four metrics:

Deployment frequency: how often code reaches production.
Lead time for changes: time from code commit to running in production.
Mean time to restore service (MTTR): time from incident detection to resolution.
Change failure rate: percentage of deployments that result in a failure requiring rollback or hotfix.

Each metric is straightforward to measure with standard CI/CD tooling, and improvements in these metrics correlate strongly with improvements in business outcomes. The full DORA research, published annually as the State of DevOps Report and referenced extensively on the DORA website, is the best place to find current industry benchmarks for what "elite," "high," "medium," and "low" performance actually look like.

Two notes on using DORA metrics in practice:

First, the metrics work best as a system. Improving deployment frequency without watching change failure rate produces faster delivery of more bugs. Improving lead time without watching MTTR produces faster delivery without the operational capability to handle what the faster delivery surfaces. Look at all four together.

Second, the cluster you fall into matters more than the absolute number. Elite performers deploy multiple times per day with under one hour lead time and under one hour MTTR. Low performers deploy monthly with multi-month lead time. The numbers within a cluster matter less than which cluster you are in. Moving from low to medium is a much larger win than tightening your already-elite metrics.

SPACE Framework: Adding Developer Experience to the Picture

DORA metrics describe what delivery looks like from the outside. They do not capture what the experience of producing that delivery looks like from inside the team. The SPACE framework, published in a widely-cited ACM Queue paper by researchers including Nicole Forsgren and Margaret-Anne Storey, adds five dimensions that fill in that gap:

Satisfaction: how engineers feel about their work and tools.
Performance: quality and effect of work produced.
Activity: counts of actions taken (PRs, commits, etc.).
Communication and collaboration: team interactions and knowledge sharing.
Efficiency and flow: ability to do work without interruption or blockers.

The SPACE framework's main contribution is the recommendation to measure across multiple dimensions and at multiple levels (individual, team, organization), explicitly because no single metric captures productivity. It also acknowledges that some of the most important signals (satisfaction, flow, collaboration) are best measured through structured surveys rather than instrumentation.

In practice, the most useful combination is the DORA delivery metrics paired with a quarterly developer experience survey covering satisfaction, blockers, and tool effectiveness. The survey data and the delivery metrics together tell a story about where the team is and what is producing the gap between current and desired performance.

Metrics That Sound Useful But Are Not

Several common engineering metrics deserve special attention because they are popular, easy to measure, and almost always do more harm than good when published.

Lines of code (LOC) and commits per developer. These measure activity, not value. The engineers who genuinely make the codebase better usually write fewer lines than the engineers who add unnecessary complexity. Publishing these metrics produces churn, not progress.

Story points completed. Story points were designed as a planning tool, not a productivity metric. Publishing them as a productivity measure transforms estimation into negotiation about scoring, which is worse than not having story points at all. The original guidance from agile methodology proponents (consult the Scrum.org foundational resources or similar references) explicitly warns against this.

Bug counts per developer. Engineers who fix more bugs look worse on this metric than engineers who fix none. Engineers who write code in safer areas of the codebase look better than engineers working on harder problems.

Pull request count. PRs vary by 10x or more in size and complexity. A team optimizing for PR count splits work into trivial PRs that add review overhead without adding value.

Test coverage percentage. Useful as a check that teams are writing tests at all, but the percentage rapidly stops correlating with quality once teams start gaming it. Tests written specifically to hit lines that are not really being exercised do not catch regressions; they just hit the lines.

The pattern in all of these: each metric is easy to measure, easy to game, and weakly correlated with the actual outcome leadership cares about. The act of publishing them creates incentives for gaming rather than improvement.

"The metrics that work in practice are the ones engineers would still care about even if they were not being measured. Anything that produces a behavior change only because it is being watched is a leading indicator of a future problem, not a productivity gain." - Dennis Traina, founder of 137Foundry

data center cooling pipes close racks
Photo by Tayssir Kadamany on Pexels

How to Measure DORA Metrics Without Heavy Tooling

The four DORA metrics are measurable on most CI/CD platforms without buying anything new. The mechanics:

Deployment frequency comes from counting production deploys. Tag every production deploy in your CI/CD platform (GitHub Actions, GitLab CI, CircleCI, Jenkins, etc.) and pull the counts. For organizations with multiple services, track per-service frequency, since service-level performance is what matters and aggregating across services hides the slow services behind the fast ones.

Lead time for changes requires linking commits to deploys. Most CI/CD platforms can record the commit SHA at each deploy, so the lead time per change is the timestamp delta between the commit being authored and the deploy happening. The 50th and 90th percentile values are usually more informative than the mean.

Change failure rate requires tracking which deploys cause incidents. The simplest approach is a binary tag on each deploy: did it cause a rollback, hotfix, or user-impacting incident? Pull the counts over a rolling window (90 days is typical).

MTTR requires incident detection and resolution timestamps. Most teams already have this in their incident tracker (PagerDuty, Opsgenie, or equivalent) and can pull duration data directly.

The CNCF maintains open source tooling references for collecting these metrics, and several open source DORA dashboards exist that can ingest data from common CI/CD platforms. The point is to start with simple instrumentation and improve as the metric becomes load-bearing in the team's decisions.

When the Metrics Surface Real Problems

DORA metrics that point at a real problem usually look like one of three patterns:

Slow deployment frequency with high change failure rate. The team is shipping rarely because each ship is dangerous, and each ship is dangerous because it is rare. The fix is investment in the deployment pipeline (smaller batch sizes, better automated testing, feature flags, canary deploys) so deploys become routine.

Fast deployment frequency with high MTTR. The team is shipping fast but cannot recover when something breaks. The fix is investment in observability (better logging, distributed tracing, runtime monitoring), on-call practices, and rollback automation.

Long lead time despite high deployment frequency. Code is taking a long time from author to production even though deploys are frequent. The fix is usually in the review and merge pipeline: long PR queues, slow CI, manual approval gates, or large PRs that are hard to review.

Each of these patterns is more useful than a single number. The metric tells the team where to look; the team's investigation tells leadership what to fund.

Engineering Tools and Practices That Move These Metrics

For organizations early in their productivity-metric journey, several investments reliably improve the DORA metrics without producing the gaming behaviors of weaker metrics:

Trunk-based development with feature flags. Smaller batch sizes improve deployment frequency, lead time, and change failure rate simultaneously.
Comprehensive automated testing in CI. Reduces change failure rate and shortens lead time by reducing manual verification steps.
Standardized deploy and rollback automation. Reduces MTTR and removes deployment as a source of judgment calls under pressure.
Observability and alerting. Reduces MTTR by shortening detection time.
Investment in developer experience platforms. Tools like internal developer platforms, paved-road infrastructure, and ergonomic local development environments improve flow and satisfaction.

The right sequence depends on which DORA pattern the team is in. Investing in observability when the bottleneck is large PR queues produces no improvement. Investing in feature flags when the bottleneck is slow incident response produces no improvement. Read the metrics to find the bottleneck, then invest there.

server rack cables organized network
Photo by Field Engineer on Pexels

Where 137Foundry Helps

If your engineering team is past the point where DORA metrics are providing useful signal and you want help interpreting what the metrics are telling you (or building the tooling to surface them in the first place), 137Foundry's services hub covers the kinds of work that show up in these conversations: data integration to pull metrics from disparate CI/CD and incident systems, internal tool development to surface them in a way the team will actually use, and the AI automation services that increasingly factor into how engineering teams compose their workflow.

The 137Foundry homepage has more on the firm and the kinds of engineering organizations it works with. The about page covers the team's background.

The harder organizational question that surfaces once metrics are in place is what to do with the answer. Metrics show where the system is friction-laden, but acting on the friction requires resources (engineering time, infrastructure spend, hiring, tooling) and political will. The teams that move their DORA metrics meaningfully are usually the teams whose leadership treats the metric movement as a signal that justifies investment, not as a target to hit with cosmetic changes.

The metrics are the easy part. The conversations the metrics enable are where the actual work happens.

What Productivity Metrics Are For (and What They Are Not For)

The DORA Metrics: The Reliable Starting Point

SPACE Framework: Adding Developer Experience to the Picture

Metrics That Sound Useful But Are Not

How to Measure DORA Metrics Without Heavy Tooling

When the Metrics Surface Real Problems

Engineering Tools and Practices That Move These Metrics

Where 137Foundry Helps

More Articles

How to Design a Rate Limiter That Doesn't Punish Legitimate Bursts

How to Design Settings and Preferences Screens That Scale as Features Grow

CSV Parsing Code Snippets: Patterns for Malformed Rows, Encoding Issues, and Type Coercion