Methodology

The Treebeard Score is the output of a trust intelligence system — not a static label, but a continuous assessment that updates as new signals arrive.

Trust in autonomous systems is compositional. No single source is sufficient. Registries are incomplete. Certifications are partial. On-chain observations are noisy. The most resilient trust systems combine multiple imperfect signals into a structured view — making disagreement between sources visible and usable, not collapsed into false precision.

Here is how Treebeard produces decision-grade trust intelligence, and how to read what it produces.

The Treebeard Score is the output of a trust intelligence system — not a static label, but a continuous assessment that updates as new signals arrive.

Trust in autonomous systems is compositional. No single source is sufficient. Registries are incomplete. Certifications are partial. On-chain observations are noisy. The most resilient trust systems combine multiple imperfect signals into a structured view — making disagreement between sources visible and usable, not collapsed into false precision.

Here is how Treebeard produces decision-grade trust intelligence, and how to read what it produces.

Six Signal Categories

Every agent is evaluated across six top-level signal categories. Each category captures a distinct dimension of agent quality. Default weights are shown — actual weights vary by agent type (see Weight Profiles below).

Assesses whether the agent is economically sustainable and delivering measurable value. For token-bearing agents, this includes token performance. For non-token agents, economic signals focus on adoption metrics and funding health.

Revenue or on-chain volume (SaaS or DeFi)
Growth trajectory (trailing 30d, 90d)
Treasury or funding health
Token performance, if applicable (refers to agent-native tokens where applicable — Treebeard itself has no token)
Vendor or project viability indicators

Lending Confidence

For agents rated B- or higher, Treebeard computes a parallel Lending Confidence score (0–100) assessing creditworthiness for DeFi lending and integration exposure decisions. This score uses the same underlying signals but re-weights them for financial risk:

Economic Sustainability35%
Governance & Control25%
Operational Reliability20%
Code Safety15%
Community5%

Score interpretation and exposure thresholds are set by the integrator. Treebeard provides the signal; risk decisions remain with the operator.

Agent Type-Dependent Weight Profiles

A one-size-fits-all weighting produces ratings that yield inaccurate results for at least half the agents rated. Each agent type receives a tailored weight profile that reflects which qualities matter most for that agent type. All columns sum to 100%.

CategoryDefaultFinancial / TradingDev ToolsCustomer- FacingEnterprise WorkflowAuton. OpsResearch / AnalysisCreative / ContentInfra / DevOpsSafety- CriticalData Analytics
Econ. Viability20%25%15%15%20%20%15%15%15%10%20%
Oper. Reliability20%25%15%25%25%25%15%15%30%20%15%
Code & Arch.15%10%25%10%10%15%15%15%20%15%15%
Autonomy Index15%10%20%10%10%15%20%15%10%10%15%
Safety & Reliab.10%10%5%15%15%10%10%10%10%25%5%
Community10%10%10%15%10%5%15%20%5%10%20%
Total100%100%100%100%100%100%100%100%100%100%100%

Scoring Mechanics

Signals are gathered, normalized, and composed into category scores, then into an overall rating. Three signal types feed the system:

Binary Signals

Present or absent. Scored as 0 or 100. Used for safety-critical checkboxes: kill switch, behavioral boundaries, failure documentation.

kill_switch: true → 100

Absolute Measures

Continuous 0–100 against fixed benchmarks. Performance evaluated against objective standards, independent of peer group.

uptime: 99.7% → 94.2

Relative Measures

Percentile rank within the agent's agent type peer group. Contextualizes performance against comparable agents.

volume: p85 in Financial → 85.0

Each category score is the weighted average of its component signals (equal weights within categories for v1). The overall score is the weighted average of category scores using the agent type-dependent weight profile. A safety floor is then applied: an agent's maximum overall score is capped by its Safety & Reliability category score.

Publishable Coverage Gate

Treebeard publishes a score only when an agent has enough signal coverage to produce a confident result. The rating engine evaluates each of the six categories for real data (vs. defaults), counts the categories that pass, and divides by six. If the resulting fraction is below the publishable threshold, the engine refuses to publish a score.

The current threshold is 40% coverage — at least three of the six categories must have real data. Agents that fall below this gate appear in the directory and have full on-chain identity surfaced, but their profile shows a checklist of missing categories instead of a numeric score.

This is methodology rigor, not a coverage limitation. Other services emit a number with thinner data. Treebeard does not. When the gap closes — new feedback events, a github_url surfaces, ERC-8128 detection lands — the score publishes automatically on the next rating pass. No opt-in refresh, no application process.

See an unrated profile in the directory to view the checklist in practice.

Bayesian Weight Calibration

Default weights are Bayesian priors based on expert judgment. Over time, Treebeard replaces intuition with empirical evidence through a structured calibration process.

Calibration Cycle

Launch with default weights. Observe outcomes: which agents maintain stable ratings, which degrade, which get flagged, which generate “wrong-feeling” complaints. Update weight posteriors using outcome data. Publish updated weights quarterly with a transparency report.

Months 0–3: rate with defaults, collect outcome data. Months 3–4: first Bayesian update for most-rated agent types. Month 4+: quarterly recalibration cycle. All weight changes require 30-day advance notice and published rationale.

Anti-Gaming Principles

For each signal, the methodology evaluates the cost to credibly fake that signal. Signals are weighted proportionally to their cost-to-fake. Cheap-to-manipulate metrics (social followers, GitHub stars) carry minimal weight.

Defense Mechanisms

Signals are cross-referenced across multiple independent data sources. Anomaly detection identifies wash trading, fake GitHub activity, circular transactions, and Sybil attacks. ERC-8004 wallet-based trust scores weight credible reviewers more heavily.

A methodology canary system maintains known-quality test agents in the public directory. Automated monitoring compares methodology output against expected ratings for these agents. Drift triggers investigation.

Transparency Model

Treebeard publishes extensively — but not everything. The boundary between published and proprietary is drawn deliberately to prevent gaming while maintaining institutional trust.

Published

All six categories and signal types. Agent taxonomy. Scoring principles. Weight ranges per category. Safety floor thresholds. Rating output format. Signal discovery methods. Bayesian learning process and quarterly update schedule.

Proprietary

Exact weight calibrations. Anomaly detection thresholds. Signal transformation functions. Bayesian prior specifications. Methodology canary identities.

Auditable (Under NDA)

Full methodology including proprietary elements. Available to institutional clients and the independent methodology review board.

Version History

Algorithm changes require 30-day advance notice and published rationale. All historical versions remain available for reference.

VersionDateChanges
v2.2March 2026Phase 3 enrichment via The Graph ERC-8004 Reputation subgraph. Feedback-boosted Code Quality formula for agents with verified on-chain reputation signals. Economic Viability formula updated to weight feedback count for Phase 3 agents. Wallet cluster penalty retired — reputation score is now the quality gate. Prev-score wiring: trend detection now uses live denormalized score cache.
v2.1March 2026Security Posture planned as future signal category (pending signal availability). agentURI parsing: display_name, agent_description, and service_type badges (MCP, A2A, OASF) extracted from on-chain metadata. Phase 2 multi-chain coverage expanded to 17 EVM chains.
v2.0February 2026Bayesian continuous scoring replaces discrete bucket model. Confidence tiers (High / Medium / Low) added based on signal coverage. Archetype-dependent weight profiles introduced: 10 agent types with distinct scoring emphasis. Grade-boundary hysteresis prevents score oscillation near letter-grade thresholds.
v1.0February 2026Initial methodology release. Six signal categories, uniform weighting, letter grade + numeric scoring (0–100), safety floor. Published alongside Treebeard public launch. ERC-8004 Identity Registry crawler indexes all registered agents on Ethereum mainnet.

Weight recalibrations are published quarterly. Methodology version increments are reserved for structural changes to categories, signal types, or scoring mechanics.

Signal Pipeline Roadmap

Treebeard's signal pipeline is expanding. The following additions are under evaluation or planned for 2026:

ERC-8183 — Agentic Commerce

Monitoring

ERC-8183 (Agentic Commerce) is a Draft ERC defining a Job lifecycle for AI agent-to-agent commerce — programmable escrow, structured task submission, and evaluator-based completion verification. Co-authored by the Ethereum Foundation dAI team and Virtuals Protocol. Treebeard is prepared to ingest ERC-8183 job completion data as a signal: completed jobs feed Operational Reliability and Economic Viability scores, while rejection rates and dispute frequency inform Safety assessments. BNB Chain has shipped the first live implementation.

Virtuals Protocol Ecosystem Adapter

Q2 2026

Virtuals Protocol operates one of the largest tokenized AI agent ecosystems. A dedicated Treebeard adapter will ingest Virtuals' on-chain agent registry, token economics, and community feedback signals — enabling Treebeard ratings for the full Virtuals agent catalog alongside existing ERC-8004 agents. Virtuals agents will use the same six-category methodology with ecosystem-appropriate weight profiles.

ENS — Agent Identity Signal

Evaluating

Ethereum Name Service (ENS) registrations represent a credible investment in persistent, human-readable identity. An agent registered as mybot.eth signals long-term commitment — ENS names cost money and tie an agent to a public, verifiable identity. Treebeard is evaluating ENS as a signal input for Governance & Transparency (identity commitment) and Community Trust (ecosystem discoverability), and as a display name source via reverse resolution of agent owner wallets.

These additions will be reflected in this methodology document and in a new algorithm version when they ship.

Signal Coverage by Standard

Treebeard's signals are organized by the question they answer, not by protocol name. This is intentional — new standards emerge regularly, and the coverage table below remains accurate as the underlying sources evolve.

Question answeredStandard(s)Status
Who is this agent?ERC-8004Live — 7 chains with active agents
Can it prove it's an agent, not a human?ERC-8128Live — per-agent detection
Can it accept payments?x402, MPPx402 live · MPP on radar
Has it built a reputation?ERC-8004 Reputation RegistryLive — Ethereum, Base, BNB
Has it actually delivered work?ERC-8183Coming — monitoring draft
Does it have a persistent public identity?ENSEvaluating

Filter chips in the Agent Directory correspond directly to the live rows above. New protocols are added to this table when Treebeard begins indexing them — no empty tabs, no placeholder data.

🌳 The Treebeard Review Panel

AI agents rated by data. Challenged by experts.

What it is

The Treebeard Review Panel is a 25-persona AI simulation that debates whether an agent's quantitative score is fair. Treebeard's formula is blind to context — an agent with 122 endpoints and active x402 revenue can score F/38 because the formula doesn't capture operational complexity. The panel exists to surface exactly those disconnects.

When an agent receives a Treebeard Review, three sequential debates run via Claude Haiku: a bull case (panelists who believe the score is too low), a bear case (panelists who believe the score is fair or too high), and a full-panel verdict synthesizing both. The result is a structured verdict with a panel score, confidence level, vote split, and concrete improvement suggestions.

Panel composition

25 anonymous personas. Individual names and bios are not disclosed. The panel is composed by archetype to ensure genuine debate tension:

  • Protocol Engineers ×8
  • VC Partners ×4
  • Security Specialists ×3
  • Business Users ×3
  • Product / UX Designers ×2
  • Economists ×2
  • Regulatory Experts ×2
  • End User Representative ×1

Personas are cross-geographic, span ages 24–58, and carry explicit bias profiles designed to surface disagreement rather than produce consensus rubber-stamps.

The dialectical process

Each review runs three sequential AI calls. This structure produces genuine tension — not a single-perspective summary.

1
Bull caseThree panelists argue the agent's score is too low. They probe for architectural strengths, revenue-layer capabilities, and ecosystem positioning that the formula does not capture. They cite specific signal gaps and make the strongest possible case for a higher score.
2
Bear caseThree panelists read the bull case and challenge it. They assess whether claimed capabilities are supported by on-chain evidence, flag risks the bull case ignored, and articulate what would need to change for a higher score to be warranted.
3
Full-panel verdictThe full 25-person panel synthesizes both cases. They produce a structured verdict: overall confidence (Low / Medium / High), vote split (e.g., 18 agree, 7 dissent), qualitative assessment, and 3–5 improvement suggestions with score-impact estimates.

How qualitative maps to scoring

The panel verdict does not override the quantitative score — both are displayed independently. The verdict includes a panel-estimated score range (e.g., “Panel believes fair value is 72–78”) which may inform a future methodology update, but does not change the published rating automatically.

Disagreements between panel score and quantitative score are flagged for founder review. If the panel consistently rates a signal category as under-weighted for a given agent archetype, that pattern feeds into the next weight calibration cycle.

What it is not

  • Not a replacement for the quantitative score — both are shown side by side
  • Not a human review — it is a structured AI simulation
  • Not financial or investment advice
  • Not a guarantee of agent quality or safety

How improvement suggestions work

The verdict call produces 3–5 specific improvement actions, each with an estimated score impact and difficulty level (Easy / Medium / Hard). These map directly to Treebeard's six signal categories — so builders know exactly which category to target and roughly how many points each action is worth. Suggestions are generated from the agent's actual signal data, not generic advice.

Eligibility

Agents with a quantitative score of 70 (C+) or above are eligible for a Treebeard Review. The threshold is intentionally set below B- (75) to catch undervalued agents — the agents the formula most likely underestimates. Reviews are triggered after significant score changes (±10 points) or grade boundary crossings.

Legal Disclaimer

This analysis is generated by AI simulation and does not represent the views of real individuals or organizations. It is experimental and provided “as-is” for informational purposes only. It does not constitute financial, investment, or professional advice. Use at your own risk. See Terms of Service.