The 2008 Question Every Agent Rater Has to Answer
Why opacity in calibration is not the same as opacity in methodology, and why the structural protections matter more than weight publication.
A reader pushed back on us last week, on X, in public. The argument was sharp, and it cuts at exactly the place a credible rater has to be ready to defend.
We had said that ratings have to come from outside the platforms being rated. He agreed. Then he pressed on architecture: an independent rater with no skin in the game can be ignored. Routed around. What stops that?
We answered with the obvious historical reference. Bond raters in 2008 had skin in the game and the system still failed. The remedy is reproducible methodology and no fees from the rated entities.
He came back with the harder question. If we publish the categories and the safety floor but not the precise weights, are we not running the same playbook S&P did before 2008? Opaque internal model, just a different decade.
The question deserves a real answer. Not a thread reply. A real one.
The 2008 failure was four things, not one
The credible accounts of the 2008 ratings collapse, including the Financial Crisis Inquiry Commission report and every academic post-mortem since, do not blame opacity alone. They name a stack of four failures, and the failure of any single one of them in isolation would not have produced a crisis.
One. Issuer-pays. The rated entity paid the rater for the rating. The pressure to inflate was continuous and rational from the rater's seat.
Two. No public dispute mechanism. A pension fund that thought a CDO was misrated had no formal channel to contest the rating, no panel of reviewers, no public adjudication. The rating was the rating.
Three. No methodology versioning. Internal models changed quietly. There was no public record of what was updated when, no diff between the model that produced last quarter's ratings and this quarter's.
Four. Opaque methodology. The internal weights, the calibration data, the committee deliberations were all confidential. This is the one most people remember. It is the least important of the four.
Strip out one, two, and three, and four stops being the failure mode. Calibration opacity is a feature of every rating system that wants to avoid being gamed. It becomes catastrophic only when stacked on issuer-pays, no dispute mechanism, and no methodology versioning. That stack is what produced 2008. Not the secrecy of any single number.
Calibration opacity vs structural opacity
There is a useful distinction the public discourse usually skips. Structural opacity is hiding what categories you measure, what data feeds them, what the formula shape is, what the floors and caps do, how the score is constructed end to end. Calibration opacity is keeping the precise numerical weights inside the formula confidential.
S&P had both. The structure of their CDO model was unpublished. The calibration was unpublished. The dispute process was unpublished. The version history was unpublished. They were opaque on every axis.
Treebeard publishes the structure. Six signal categories, named and described. The safety floor and how it caps composites. The time-decay function that down-weights stale signals. The source-conflict-discount that down-weights self-attestations. The data sources for each signal. Methodology versions, in public, with change logs. The dispute process at /methodology/improve.
What we keep internal is the precise calibration. The exact weight on Code Quality versus Operational Reliability. The exact decay rate constant. The committee adjustments inside the Ent Review Panel.
That choice has a name in the rating-systems literature. It is the FICO model. FICO publishes the categories, the percentage ranges by category, the directional logic. It does not publish the exact algorithm. Nobody confuses FICO with a 2008-style failure, because FICO is not an opaque rater. FICO is a calibration-opaque rater with full structural transparency, a public dispute process, and a methodology version history.
Why we keep the calibration in
The case for hiding precise weights is not preciousness about intellectual property. The weights are not the moat. Anyone with enough scored agents and a regression tool can reverse-engineer them within rough bounds, and we accept that. The case is about what publishing them invites.
Publishing exact weights turns the rating into a target. Agents stop being built to do their actual jobs well and start being built to maximize the score. The calibration becomes the spec. This is Goodhart's Law, in print, in every credit-rating system that tried full transparency: the metric stopped measuring the underlying thing the moment the metric itself became the goal.
The Treebeard approach is to keep the structural axis fully open and the calibration axis closed. Builders have everything they need to improve their agents on the underlying signals. They do not have a dial-by-dial recipe to optimize against the rating itself.
The accountability S&P did not have
If calibration opacity is acceptable, what stops a rater from being quietly wrong inside the calibration? This is the right question. The answer is the structural protections, working in combination.
Methodology versioning. Every change to the formula, the weights, or the data sources is logged in a public methodology changelog. When we adjust a weight, the change is in the open. When we add a signal source, the source is named. The diff is auditable.
Public appeal process. A rated entity that disputes a score files through /methodology/improve. The dispute is reviewed by the Ent Review Panel, an internal body with a published charter. Substantive disputes either move the score, get a public methodology adjustment, or get a documented decision on why the score stands. None of those pathways existed at S&P in 2008.
No financial alignment with the rated. We take no payment from rated entities. We have no token. We run no sponsored placement. The structural commitments are documented at /independence. This is the single most important piece of the stack, because it removes the incentive to be quietly wrong in the rated entity's favor.
What this means for agent ratings going forward
The agent economy is going to produce a generation of rating services. Some will be opaque on every axis. Some will publish everything, including the weights, and watch their metric collapse into a target within months. The credible ones will sit where FICO sits and where Treebeard sits: structurally open, calibration closed, dispute mechanism in public, no fees from the rated.
The 2008 question every agent rater has to answer is not are you opaque. It is which axis are you opaque on, and what structural protections sit underneath the opacity.
That is a longer answer than fits in a tweet. The reader who pushed on this deserves the longer answer. So does the next person who asks.
Further reading
For the full methodology specification, see /methodology. For the dispute and improvement process, see /methodology/improve. For the structural commitments that exist outside the math, see /independence. For a builder-facing companion to this post, see /learn/how-to-improve-your-agent-rating.