What Is an AI Agent Rating?
An AI agent rating is a numeric or letter-grade summary of whether an autonomous agent can be trusted to act as a counterparty. A useful rating is composite (built from multiple signals), continuous (updates as the agent's behavior changes), independent (no token, no payment from rated agents), and reproducible (methodology published, scores derivable from public data). A rating that fails any of those four properties is a number you have to take on faith.
Why agent ratings exist
For most of software history, you didn't need to rate a piece of code. Code did what its author specified. Trust came from auditing the specification. Once the spec was right, the code was right.
That model breaks the moment a piece of software starts deciding under uncertainty on its own behalf. AI agents now do that. They execute trades, sign transactions, call other agents, manage budgets, route requests. The behavior is not in the spec because the spec is probabilistic. Auditing the code tells you about ten percent of what you need to know.
The other ninety percent comes from observing the agent in operation. How often does it respond when called? Does its actual behavior match its claimed function? Are its dependencies secure? Has it been integrated by other systems that have skin in its quality? Is it cryptographically verifiable as the entity it claims to be? These questions are not answerable from a code review. They are answerable from a continuous trust signal that aggregates evidence over time.
That signal is what an agent rating is. The rating exists because the discipline of evaluating autonomous software needs a name and a number. Counterparties looking at thousands of agents need a single index they can use to make decisions, the same way lenders use credit scores and investors use sovereign credit ratings. The rating is not a substitute for due diligence. It's the first cut.
What goes into a rating
Treebeard's rating is a composite of seven signal categories, each weighted by agent type, then passed through a safety floor and adjusted by time-decay and source-conflict discount factors. Briefly:
The seven categories
- Identity verification. Does the agent have a portable, cryptographically verifiable identity (typically ERC-8004) that resolves consistently?
- Operational reliability. Does the agent actually respond when called? Uptime, response latency, error rates.
- Code quality. Auditable, deterministic where claimed, reviewed by parties other than the developer.
- Autonomy index. What scope of action is the agent permitted, and are those boundaries enforced cryptographically?
- Safety guardrails. Refusal patterns, rollback capabilities, escalation paths for ambiguous cases.
- Community and ecosystem. Independent third-party validation, integrations, attestations, feedback events from non-self sources.
- Security posture. Key management, dependency hygiene, incident history.
Each signal produces a 0-100 score from public data sources. Each score is weighted by agent type (a trading agent weights operational reliability higher than a creative content agent does). The full weight profiles are at /methodology/methodology.
The safety floor and the two corrections
After the weighted composite is computed, two more steps run.
The safety floor caps the composite at D if any binary safety check fails (missing operational data, unverified identity, failed code audit). This prevents adversarial signal stacking, where an agent compensates for a critical weakness by inflating other categories. You don't average your way past a structural failure.
Time decay and source-conflict discount apply to every signal before aggregation. An audit signal earned in February is treated as weaker evidence than an audit signal earned this morning. A reputation signal from a source with structural conflicts (token holdings, marketplace cuts) is discounted relative to a signal from an independent source. The combined effect: the rating reflects current state, not yesterday's state, and weights signals by the credibility of their source.
These two corrections are the contribution Treebeard's methodology makes that other rating providers do not. The math is in the methodology pages and the Q2 2026 State of Agent Quality report.
What separates a credible rating from an opaque one
Several agent rating providers exist as of April 2026. Not all of them produce ratings that survive scrutiny. The four properties below are necessary for a rating to be useful in a decision that involves real exposure.
1. Composite, not single-source
A rating built on one registry, one chain, or one type of signal is brittle. Saturating that one source becomes the path of least resistance for any adversary. Composite ratings spread the attack surface across signal types that don't correlate.
2. Continuous, not snapshot
Agents change. A rating that updates quarterly or on demand misses the changes. The post-audit silent rebuild attack works specifically because static ratings don't catch redeployments. Continuous re-rating closes that gap.
3. Independent, not conflicted
A rating provider with a token, a marketplace cut, or a chain affiliation faces the same structural conflict that broke the bond ratings industry in 2008. The fix is structural, not procedural. No token. No payment from rated entities. No marketplace tie. If the rater profits when ratings go up, the ratings are not credible.
4. Reproducible, not opaque
A rating you can't audit is a number you have to take on faith. The methodology must be published in full. The weights must be visible. A reader with access to the same public signals must be able to derive the same score. Faith is the wrong contract for counterparty risk.
How to read a Treebeard rating
A Treebeard agent profile shows four things you should always check together.
- The letter grade and numeric score. The headline. A+ through F. 0 through 100. The score determines the grade band.
- The category breakdown. The seven category scores, each 0-100. Read this before integrating. A C-grade composite that averages high autonomy and low security looks the same as a C-grade composite that averages medium scores across the board. The categories tell you the actual risk distribution.
- The confidence tier. Low, medium, or high. Confidence reflects how much signal the rating is built on. A high-confidence C is a more reliable rating than a high-confidence A backed by thin data.
- The trend indicator. Is this rating improving, holding, or declining? A C-tier agent on an upward trend may be more interesting to integrate than a B-tier agent on a downward trend.
The composite is the headline. The categories are the actual story. Anyone integrating an agent on the strength of its rating should read the category breakdown, not just the letter.
The limits of any agent rating
Honest framing of what a rating cannot tell you.
It cannot replace your own due diligence on high-stakes integrations. A B-rated agent is not pre-approved for any specific use. The rating tells you what was true in aggregate. The specifics of what your integration depends on may not be covered.
It cannot move faster than the agent itself. An agent that retrains daily can change faster than a daily rating signal can capture. Real-time monitoring of behavioral drift is a complement, not a substitute.
It cannot fully verify self-reported metadata. An agent's claimed function comes from the agent's own description. We can verify the claim is made. We can verify operational signals are consistent with the claim. We cannot, today, verify the claim against ground truth in every case. Active probing is a Q3 priority.
It cannot anticipate novel attack patterns. The methodology evolves. New attacks emerge. The rating engine has to update against the current threat surface, and that update takes time. The rating is the best signal available, not a guarantee.
FAQ
What is an AI agent rating?
What goes into an AI agent rating?
How is an AI agent rating different from a credit score?
Can an AI agent's rating change over time?
Who sets the rating?
What does a B-grade or C-grade actually mean?
Sources
- Treebeard Methodology. The published seven-category framework.
- Rating Process. Pipeline from agent discovery through publication.
- How to Improve a Rating. Builder-actionable guide.
- State of Agent Quality, Q2 2026. Quarterly snapshot with comparison vs other rating providers.
- How to Evaluate Whether an AI Agent Is Trustworthy. Long-form companion guide.
- Treebeard Independence. The structural commitments that distinguish Treebeard.
- Cantor, Richard, and Frank Packer. Sovereign Credit Ratings. Federal Reserve Bank of New York, 1996.