Your Best Engineer Barely Writes Code Anymore

Two skills have been trending among developers lately: Caveman and Ponytail. Both aim at the same sore spot - that AI coding agents tend to over-engineer and emit more code than the job needs. Caveman more or less tells the model to rein it in and write less. Ponytail is subtler: it hands the agent a decision ladder so it asks, first, whether the thing is even worth writing.

Watching them, two things struck me. One, that even this already cuts against today's AI metrics - the ones that reward more output, more tokens, more lines. Two, that I'd go a step bigger than either. Because the question isn't how large the snippet an agent generates - it's whether the thing needs to be built at all.

That's what brought a particular moment to mind. A team lead sitting across from his manager, laying out the case to put one of his engineers on a performance plan. He had the numbers ready: commit count at the bottom of the team, lines of code at the bottom, and the AI throughput dashboard everyone watches now barely registering next to his teammates. Every metric told the same story - this person had checked out.

The manager looked at the screen and asked one question that ended the conversation: "Isn't he the one who killed the billing rewrite?"

The numbers made the case

It's worth being fair to that dashboard, because it wasn't lying. The engineer really had shipped less code than anyone else on the team. Fewer commits, fewer merged pull requests, far fewer lines.

For most of this team lead's career, that would have been a real signal. When writing code was the slow, hard part of building software, the person producing the most working code was usually the one carrying the most weight. The correlation was never perfect, but it was close enough that nobody questioned it. Measuring output wasn't lazy - it was a reasonable shortcut in a world where output was expensive.

The problem is that the world changed and the instrument didn't.

The decision that left no trace

Three weeks before that review, the same engineer sat in a different meeting: planning for the next big feature, a full billing rewrite. Roughly three months of work, scoped and ready to start.

He asked four questions. Who is actually asking for this? What happens if we don't build it? Which customers hit the current limit? How many, last quarter?

Nobody had clean answers, so they went and looked. The limit had been hit twice, by a single customer, who had since churned for an unrelated reason. The rewrite quietly came off the roadmap.

Three months of work the team never had to spend - and the decision generated nothing. No commit, no pull request, no closed ticket. On every dashboard the team lead used, the most valuable thing anyone did that quarter simply did not exist.

What the dashboard was actually counting

Generating code was never really the bottleneck - it was just expensive enough to look like one. An agent now turns a paragraph into a working feature faster than you can review it, and as the price of code collapsed, the real constraint showed through behind it. Which means the chart that counts output is now counting the cheapest part of the job.

The expensive part didn't appear, and it didn't move - it was there the whole time, hidden behind the cost of code. The real work was always deciding what is worth building at all - spotting the feature that solves a problem nobody has, knowing where a boundary goes so three other things stay simple, saying "not yet" in a room full of momentum.

That kind of work shows up as code that didn't get written, complexity that isn't there to count, three months that never hit the timesheet. The engineer at the bottom of the dashboard was doing the most valuable work on the team. The instrument just couldn't see it, because it was built for the old job.

"But you can't have judgment without coding"

There's a fair objection here, and the team lead raised it himself: if this engineer barely writes code, won't he lose the instincts that made him good?

It's a real risk. Judgment isn't free-floating. He can ask those four questions precisely because he spent years building systems, shipping them, and living with what broke. An architect who hasn't touched a codebase in five years gives you opinions, not judgment.

But "barely writes code" was never quite the point. He still wrote code, and still dug into the hard parts by hand. What changed was the ratio. The hours that used to go into producing the obvious eighty percent now go into the questions a model will never ask on its own. His value didn't leave the code - it moved upstream of it.

Taste still comes from having built things. It just stopped expressing itself as line count.

What to measure instead

The team didn't put him on a plan. They changed what they looked at.

Throughput stayed a useful operational metric, but stopped being a measure of who was contributing. In reviews, they started asking harder questions:

What did this person prevent, decide, or simplify?
What got cut because someone asked the right thing early?
Who on the team is better because of how this person reviews their work?

None of that fits in a column, and all of it is the job now. The team lead put it bluntly afterwards: he had almost rewarded the engineers generating the most code he would later have to maintain, and punished the one keeping the codebase small enough to understand. The dashboard would have told him everything was going great - right up to the moment it wasn't.

When generating code is nearly free, output stops being a proxy for value. What's scarce - and always was - is knowing which code is worth generating at all. That judgment is the part of the job AI hasn't touched - and increasingly, it's the whole job.

The numbers made the case

The decision that left no trace

What the dashboard was actually counting

"But you can't have judgment without coding"

What to measure instead

We use cookies