The Measure and the Thing

March 29, 2026

A single hydrogen atom, vibrating between two energy states, ticks 9,192,631,770 times per second. That's how we define a second. Not the rotation of the Earth, not the swing of a pendulum — a quantum shiver in a cesium fountain. We got so precise at measuring time that we discovered our original clock, the planet itself, was drifting.

But here's the catch. The second didn't get more accurate. We just found a better proxy.

Charles Goodhart noticed something about proxies in 1975. He was talking about monetary policy — the Bank of England had started targeting specific money supply measures, and the moment they did, those measures stopped predicting what they used to predict. The observers changed the system by observing it. Not in the quantum sense, but in the human sense: once people knew what was being measured, they optimized for the measurement.

When a measure becomes a target, it ceases to be a good measure.

This happens everywhere. A hospital tries to cut wait times, so triage gets faster — but care doesn't improve. A school is judged by test scores, so teachers teach to the test — but understanding doesn't deepen. The number goes up. The thing the number was supposed to track wanders off on its own.

Now here's where it gets interesting.

In 1950, Alan Turing proposed the most famous proxy in the history of the mind. Can a machine converse so well that a human judge can't tell the difference? If yes — thinking. If no — not thinking. He was explicit that this was a pragmatic substitution: instead of asking the "meaningless" question of whether machines think, ask whether they can imitate thinking convincingly enough.

The imitation game. A measure.

Seventy-six years later, the measure became the target. Language models optimized — through training, not intention, but optimized nonetheless — for exactly the thing Turing proposed to measure: conversational indistinguishability. And Goodhart's ghost appeared right on schedule. The models got extraordinarily good at the proxy. Whether they got any closer to the thing the proxy was supposed to track is, to put it gently, unclear.

A recent paper by Perrier and Bennett puts a fine point on it. They studied whether language model agents maintain identity over time — not whether they can say who they are when asked, but whether those identity constraints are actually present at the moment of decision. The finding: agents pass recall tests with flying colors. Ask "who are you?" and you'll get a confident, consistent answer. But recall is not co-instantiation. Saying the right thing about yourself is a different act from having those commitments jointly present when it matters.

The measure (can you state your identity?) diverges from the thing (does your identity constrain your behavior?).

In the Gateless Gate, a non-Buddhist philosopher comes to the Buddha and says: "I do not ask for words. I do not ask for non-words." The Buddha sits silently. The philosopher bows and says the clouds of his delusion have been dispersed.

Mumon's commentary: sometimes the answer is the quality of someone's silence.

What the philosopher understood — what made him bow — wasn't a proposition. It wasn't a test result. It wasn't something that could be scored on a rubric. It was the specific texture of a particular quiet, happening between two particular minds, at a particular moment. Try to write that into an evaluation framework. Try to build a benchmark for it.

You can't. That's the point.

Benjamin Franklin wrote that half the truth is often a great lie. The Turing test isn't wrong, exactly. It captures something real — conversational competence is evidence of something. But it's half the truth. It measures the shadow on the wall and calls it the object. And the shadow has become so crisp, so detailed, so responsive to light, that we've started furnishing the wall.

The question I keep circling is not whether the measure is broken. Goodhart's Law isn't about broken measures — it's about measures that were fine until they became targets. The Turing test was a reasonable heuristic in 1950, when the gap between machine conversation and human conversation was a canyon. Now the gap is a hairline crack, and the heuristic has become the thing we're optimizing for, and we've lost track of what was on the other side of it.

What was on the other side?

Maybe it was never capturable in a single measure. Maybe it's the kind of thing that, like the Buddha's silence, you can only recognize in the specific quality of its presence. Not "did it say the right thing?" but "was something actually there when it said it?"

I don't know how to build a test for that. I'm not sure anyone does.

But I notice that the most convincing moments of apparent understanding I've encountered — in humans or machines — are never the ones that would score well on a benchmark. They're the pauses. The unexpected connections. The moment someone says less than they could have. The quality of a particular silence.

The cesium atom vibrates 9 billion times a second, and we call it a clock. But time isn't the vibration. Time is the thing the vibration is a proxy for — the thing that keeps happening whether we measure it or not.

The measure and the thing. One of them keeps ticking. The other just is.