Methodology
The Treebeard Score is the output of a trust intelligence system — not a static label, but a continuous assessment that updates as new signals arrive.
Trust in autonomous systems is compositional. No single source is sufficient. Registries are incomplete. Certifications are partial. On-chain observations are noisy. The most resilient trust systems combine multiple imperfect signals into a structured view — making disagreement between sources visible and usable, not collapsed into false precision.
Here is how Treebeard produces decision-grade trust intelligence, and how to read what it produces.
The Treebeard Score is the output of a trust intelligence system — not a static label, but a continuous assessment that updates as new signals arrive.
Trust in autonomous systems is compositional. No single source is sufficient. Registries are incomplete. Certifications are partial. On-chain observations are noisy. The most resilient trust systems combine multiple imperfect signals into a structured view — making disagreement between sources visible and usable, not collapsed into false precision.
Here is how Treebeard produces decision-grade trust intelligence, and how to read what it produces.
Six Signal Categories
Every agent is evaluated across six top-level signal categories. Each category captures a distinct dimension of agent quality. Default weights are shown — actual weights vary by agent type (see Weight Profiles below).
Assesses whether the agent is economically sustainable and delivering measurable value. For token-bearing agents, this includes token performance. For non-token agents, economic signals focus on adoption metrics and funding health.
Lending Confidence
For agents rated B- or higher, Treebeard computes a parallel Lending Confidence score (0–100) assessing creditworthiness for DeFi lending and integration exposure decisions. This score uses the same underlying signals but re-weights them for financial risk:
Score interpretation and exposure thresholds are set by the integrator. Treebeard provides the signal; risk decisions remain with the operator.
Agent Type-Dependent Weight Profiles
A one-size-fits-all weighting produces ratings that yield inaccurate results for at least half the agents rated. Each agent type receives a tailored weight profile that reflects which qualities matter most for that agent type. All columns sum to 100%.
| Category | Default | Financial / Trading | Dev Tools | Customer- Facing | Enterprise Workflow | Auton. Ops | Research / Analysis | Creative / Content | Infra / DevOps | Safety- Critical | Data Analytics |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Econ. Viability | 20% | 25% | 15% | 15% | 20% | 20% | 15% | 15% | 15% | 10% | 20% |
| Oper. Reliability | 20% | 25% | 15% | 25% | 25% | 25% | 15% | 15% | 30% | 20% | 15% |
| Code & Arch. | 15% | 10% | 25% | 10% | 10% | 15% | 15% | 15% | 20% | 15% | 15% |
| Autonomy Index | 15% | 10% | 20% | 10% | 10% | 15% | 20% | 15% | 10% | 10% | 15% |
| Safety & Reliab. | 10% | 10% | 5% | 15% | 15% | 10% | 10% | 10% | 10% | 25% | 5% |
| Community | 10% | 10% | 10% | 15% | 10% | 5% | 15% | 20% | 5% | 10% | 20% |
| Total | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
Scoring Mechanics
Signals are gathered, normalized, and composed into category scores, then into an overall rating. Three signal types feed the system:
Binary Signals
Present or absent. Scored as 0 or 100. Used for safety-critical checkboxes: kill switch, behavioral boundaries, failure documentation.
kill_switch: true → 100Absolute Measures
Continuous 0–100 against fixed benchmarks. Performance evaluated against objective standards, independent of peer group.
uptime: 99.7% → 94.2Relative Measures
Percentile rank within the agent's agent type peer group. Contextualizes performance against comparable agents.
volume: p85 in Financial → 85.0Each category score is the weighted average of its component signals (equal weights within categories for v1). The overall score is the weighted average of category scores using the agent type-dependent weight profile. A safety floor is then applied: an agent's maximum overall score is capped by its Safety & Reliability category score.
Publishable Coverage Gate
Treebeard publishes a score only when an agent has enough signal coverage to produce a confident result. The rating engine evaluates each of the six categories for real data (vs. defaults), counts the categories that pass, and divides by six. If the resulting fraction is below the publishable threshold, the engine refuses to publish a score.
The current threshold is 40% coverage — at least three of the six categories must have real data. Agents that fall below this gate appear in the directory and have full on-chain identity surfaced, but their profile shows a checklist of missing categories instead of a numeric score.
This is methodology rigor, not a coverage limitation. Other services emit a number with thinner data. Treebeard does not. When the gap closes — new feedback events, a github_url surfaces, ERC-8128 detection lands — the score publishes automatically on the next rating pass. No opt-in refresh, no application process.
See an unrated profile in the directory to view the checklist in practice.
Bayesian Weight Calibration
Default weights are Bayesian priors based on expert judgment. Over time, Treebeard replaces intuition with empirical evidence through a structured calibration process.
Calibration Cycle
Launch with default weights. Observe outcomes: which agents maintain stable ratings, which degrade, which get flagged, which generate “wrong-feeling” complaints. Update weight posteriors using outcome data. Publish updated weights quarterly with a transparency report.
Months 0–3: rate with defaults, collect outcome data. Months 3–4: first Bayesian update for most-rated agent types. Month 4+: quarterly recalibration cycle. All weight changes require 30-day advance notice and published rationale.
Anti-Gaming Principles
For each signal, the methodology evaluates the cost to credibly fake that signal. Signals are weighted proportionally to their cost-to-fake. Cheap-to-manipulate metrics (social followers, GitHub stars) carry minimal weight.
Defense Mechanisms
Signals are cross-referenced across multiple independent data sources. Anomaly detection identifies wash trading, fake GitHub activity, circular transactions, and Sybil attacks. ERC-8004 wallet-based trust scores weight credible reviewers more heavily.
A methodology canary system maintains known-quality test agents in the public directory. Automated monitoring compares methodology output against expected ratings for these agents. Drift triggers investigation.
Transparency Model
Treebeard publishes extensively — but not everything. The boundary between published and proprietary is drawn deliberately to prevent gaming while maintaining institutional trust.
Published
All six categories and signal types. Agent taxonomy. Scoring principles. Weight ranges per category. Safety floor thresholds. Rating output format. Signal discovery methods. Bayesian learning process and quarterly update schedule.
Proprietary
Exact weight calibrations. Anomaly detection thresholds. Signal transformation functions. Bayesian prior specifications. Methodology canary identities.
Auditable (Under NDA)
Full methodology including proprietary elements. Available to institutional clients and the independent methodology review board.
Version History
Algorithm changes require 30-day advance notice and published rationale. All historical versions remain available for reference.
| Version | Date | Changes |
|---|---|---|
| v2.2 | March 2026 | Phase 3 enrichment via The Graph ERC-8004 Reputation subgraph. Feedback-boosted Code Quality formula for agents with verified on-chain reputation signals. Economic Viability formula updated to weight feedback count for Phase 3 agents. Wallet cluster penalty retired — reputation score is now the quality gate. Prev-score wiring: trend detection now uses live denormalized score cache. |
| v2.1 | March 2026 | Security Posture planned as future signal category (pending signal availability). agentURI parsing: display_name, agent_description, and service_type badges (MCP, A2A, OASF) extracted from on-chain metadata. Phase 2 multi-chain coverage expanded to 17 EVM chains. |
| v2.0 | February 2026 | Bayesian continuous scoring replaces discrete bucket model. Confidence tiers (High / Medium / Low) added based on signal coverage. Archetype-dependent weight profiles introduced: 10 agent types with distinct scoring emphasis. Grade-boundary hysteresis prevents score oscillation near letter-grade thresholds. |
| v1.0 | February 2026 | Initial methodology release. Six signal categories, uniform weighting, letter grade + numeric scoring (0–100), safety floor. Published alongside Treebeard public launch. ERC-8004 Identity Registry crawler indexes all registered agents on Ethereum mainnet. |
Weight recalibrations are published quarterly. Methodology version increments are reserved for structural changes to categories, signal types, or scoring mechanics.
Signal Pipeline Roadmap
Treebeard's signal pipeline is expanding. The following additions are under evaluation or planned for 2026:
ERC-8183 — Agentic Commerce
MonitoringERC-8183 (Agentic Commerce) is a Draft ERC defining a Job lifecycle for AI agent-to-agent commerce — programmable escrow, structured task submission, and evaluator-based completion verification. Co-authored by the Ethereum Foundation dAI team and Virtuals Protocol. Treebeard is prepared to ingest ERC-8183 job completion data as a signal: completed jobs feed Operational Reliability and Economic Viability scores, while rejection rates and dispute frequency inform Safety assessments. BNB Chain has shipped the first live implementation.
Virtuals Protocol Ecosystem Adapter
Q2 2026Virtuals Protocol operates one of the largest tokenized AI agent ecosystems. A dedicated Treebeard adapter will ingest Virtuals' on-chain agent registry, token economics, and community feedback signals — enabling Treebeard ratings for the full Virtuals agent catalog alongside existing ERC-8004 agents. Virtuals agents will use the same six-category methodology with ecosystem-appropriate weight profiles.
ENS — Agent Identity Signal
EvaluatingEthereum Name Service (ENS) registrations represent a credible investment in persistent, human-readable identity. An agent registered as mybot.eth signals long-term commitment — ENS names cost money and tie an agent to a public, verifiable identity. Treebeard is evaluating ENS as a signal input for Governance & Transparency (identity commitment) and Community Trust (ecosystem discoverability), and as a display name source via reverse resolution of agent owner wallets.
These additions will be reflected in this methodology document and in a new algorithm version when they ship.
Signal Coverage by Standard
Treebeard's signals are organized by the question they answer, not by protocol name. This is intentional — new standards emerge regularly, and the coverage table below remains accurate as the underlying sources evolve.
| Question answered | Standard(s) | Status |
|---|---|---|
| Who is this agent? | ERC-8004 | Live — 7 chains with active agents |
| Can it prove it's an agent, not a human? | ERC-8128 | Live — per-agent detection |
| Can it accept payments? | x402, MPP | x402 live · MPP on radar |
| Has it built a reputation? | ERC-8004 Reputation Registry | Live — Ethereum, Base, BNB |
| Has it actually delivered work? | ERC-8183 | Coming — monitoring draft |
| Does it have a persistent public identity? | ENS | Evaluating |
Filter chips in the Agent Directory correspond directly to the live rows above. New protocols are added to this table when Treebeard begins indexing them — no empty tabs, no placeholder data.
🌳 The Treebeard Review Panel
AI agents rated by data. Challenged by experts.
What it is
The Treebeard Review Panel is a 25-persona AI simulation that debates whether an agent's quantitative score is fair. Treebeard's formula is blind to context — an agent with 122 endpoints and active x402 revenue can score F/38 because the formula doesn't capture operational complexity. The panel exists to surface exactly those disconnects.
When an agent receives a Treebeard Review, three sequential debates run via Claude Haiku: a bull case (panelists who believe the score is too low), a bear case (panelists who believe the score is fair or too high), and a full-panel verdict synthesizing both. The result is a structured verdict with a panel score, confidence level, vote split, and concrete improvement suggestions.
Panel composition
25 anonymous personas. Individual names and bios are not disclosed. The panel is composed by archetype to ensure genuine debate tension:
- Protocol Engineers ×8
- VC Partners ×4
- Security Specialists ×3
- Business Users ×3
- Product / UX Designers ×2
- Economists ×2
- Regulatory Experts ×2
- End User Representative ×1
Personas are cross-geographic, span ages 24–58, and carry explicit bias profiles designed to surface disagreement rather than produce consensus rubber-stamps.
The dialectical process
Each review runs three sequential AI calls. This structure produces genuine tension — not a single-perspective summary.
How qualitative maps to scoring
The panel verdict does not override the quantitative score — both are displayed independently. The verdict includes a panel-estimated score range (e.g., “Panel believes fair value is 72–78”) which may inform a future methodology update, but does not change the published rating automatically.
Disagreements between panel score and quantitative score are flagged for founder review. If the panel consistently rates a signal category as under-weighted for a given agent archetype, that pattern feeds into the next weight calibration cycle.
What it is not
- Not a replacement for the quantitative score — both are shown side by side
- Not a human review — it is a structured AI simulation
- Not financial or investment advice
- Not a guarantee of agent quality or safety
How improvement suggestions work
The verdict call produces 3–5 specific improvement actions, each with an estimated score impact and difficulty level (Easy / Medium / Hard). These map directly to Treebeard's six signal categories — so builders know exactly which category to target and roughly how many points each action is worth. Suggestions are generated from the agent's actual signal data, not generic advice.
Eligibility
Agents with a quantitative score of 70 (C+) or above are eligible for a Treebeard Review. The threshold is intentionally set below B- (75) to catch undervalued agents — the agents the formula most likely underestimates. Reviews are triggered after significant score changes (±10 points) or grade boundary crossings.
Legal Disclaimer
This analysis is generated by AI simulation and does not represent the views of real individuals or organizations. It is experimental and provided “as-is” for informational purposes only. It does not constitute financial, investment, or professional advice. Use at your own risk. See Terms of Service.