Full Spec
The complete technical specification of the Treebeard scoring engine. Every weight, threshold, and formula documented — pulled directly from the production codebase.
Most rating agencies publish vague methodology overviews. We publish the spec. If you can read a formula, you can verify our work.
Grade Thresholds
The composite score (0-100) maps to a letter grade. Thresholds are fixed and published. No curve, no relative ranking.
| Grade | Min Score | Label |
|---|---|---|
| A+ | 97 | Exceptional |
| A | 93 | Exceptional |
| A- | 90 | Exceptional |
| B+ | 85 | Strong |
| B | 80 | Strong |
| B- | 75 | Strong |
| C+ | 70 | Average |
| C | 65 | Average |
| C- | 55 | Below Average |
| D | 40 | Developing |
| F | 0 | High Risk |
Default Category Weights
Six signal categories, each scored 0-100, combined into a weighted composite.
| Category | Weight |
|---|---|
| Economic Viability | 20% |
| Operational Reliability | 20% |
| Code Quality | 15% |
| Autonomy Index | 15% |
| Safety & Reliability | 10% |
| Community & Ecosystem | 10% |
| Security Posture | 10% |
| Total | 100% |
Archetype Weight Profiles
Weights shift based on agent archetype. A financial trading bot is judged more heavily on economic viability; a safety-critical agent is judged more heavily on safety. All profiles sum to 100%. Tap an archetype to see the full breakdown.
| Archetype | Econ | Ops | Code | Auto | Safe | Comm | Sec |
|---|---|---|---|---|---|---|---|
| Financial Trading | 25% | 25% | 10% | 10% | 10% | 10% | 10% |
| Developer Tools | 15% | 15% | 25% | 20% | 5% | 10% | 10% |
| Customer Facing | 15% | 25% | 10% | 10% | 15% | 15% | 10% |
| Enterprise Workflow | 20% | 25% | 10% | 10% | 15% | 10% | 10% |
| Autonomous Ops | 20% | 25% | 15% | 15% | 10% | 5% | 10% |
| Research & Analysis | 15% | 15% | 15% | 20% | 10% | 15% | 10% |
| Creative & Content | 15% | 15% | 15% | 15% | 10% | 20% | 10% |
| Infrastructure & DevOps | 15% | 30% | 20% | 10% | 10% | 5% | 10% |
| Safety Critical | 10% | 20% | 15% | 10% | 25% | 10% | 10% |
| Data & Analytics | 20% | 15% | 15% | 15% | 5% | 20% | 10% |
Bold values indicate the highest-weighted category for each archetype. ERC-8004 registered agents without explicit classification default to Autonomous Ops weights.
Safety Floor
The safety floor is non-negotiable. No matter how strong an agent scores in other categories, its overall rating is capped if the safety score is low. An agent cannot outscore its safety.
| If Safety Score | Composite Capped At | Max Grade |
|---|---|---|
| < 25 | 79 | B- |
| < 50 | 82 | B |
| < 70 | 92 | A- |
| ≥ 70 | No cap | A+ |
Hysteresis Buffer
Prevents grade oscillation when a score sits near a boundary. A grade change only occurs if the score moves 3 points past the boundary.
This prevents noisy signals from causing daily grade flips. An agent that deserves B+ keeps B+ through minor signal fluctuations.
Confidence Tiers
Every rating carries a confidence label based on signal coverage — what percentage of the six categories have real data vs. defaults.
| Label | Signal Coverage | Meaning |
|---|---|---|
| High | ≥ 80% | Most signals current and verified |
| Medium | ≥ 40% | Some missing or stale — rating produced with caveat |
| Insufficient | < 40% | N/R (Not Rated) — no score published |
Phase 2-only agents (identity + age signals = ~33% coverage) receive N/R rather than a misleading D or F grade. Phase 3-enriched agents with on-chain feedback typically reach 67-83% coverage.
Composite Score Formula
The composite score is the weighted sum of category scores, with safety floor enforcement and optional sybil deflation applied afterward.
Sybil Detection Engine
Autonomous agents receive on-chain feedback through the ERC-8004 Reputation Registry. Some wallets submit high volumes of feedback across many agents to artificially inflate scores. Treebeard detects and deflates these.
How It Works
- Wallet scanning: During enrichment, per-wallet feedback counts are stored for every agent. The detector aggregates these across the full index.
- Rule-based detection: Wallets exceeding volume or concentration thresholds are flagged automatically with a confidence score.
- Score deflation: For flagged agents, the Community & Ecosystem category score is reduced proportionally to the sybil feedback ratio.
- Admin confirmation: Analysts can manually confirm or dismiss flagged wallets, upgrading confidence to 0.9 (confirmed).
Detection Thresholds
| Rule | Threshold | Confidence |
|---|---|---|
| Volume — agents targeted | > 1,000 agents | 0.7 (auto) |
| Volume — total feedbacks | > 5,000 feedbacks | 0.7 (auto) |
| Concentration — single agent | > 50% of feedbacks to one agent | 0.4 (borderline) |
| Admin confirmed | Manual review | 0.9 (confirmed) |
Deflation Formula
What We Don't Score (Yet)
Transparency means acknowledging gaps, not hiding them. These signals are on the roadmap but not yet implemented. Listing them here prevents false confidence in the current model.
Cross-agent dependency risk
Agent A depends on Agent B. If B fails, A fails. This cascading risk isn't captured yet.
Real-time operational monitoring
Uptime, latency, error rates from live probing. Currently using static signals only.
ENS identity correlation
Linking agent wallets to ENS names for identity verification and trust signal enrichment.
Adversarial robustness testing
Active red-team probing of agent endpoints. Currently assess robustness via documentation and architecture review only.
Multi-model evaluation
Testing agent behavior across different LLM backends. Agents may behave differently with GPT-4 vs Claude vs open-source models.
Economic sustainability depth
TVL, revenue, runway analysis for agents with token economics. Currently using proxy signals from on-chain activity volume.
Changelog
Algorithm changes require 30-day advance notice and published rationale. All historical versions remain available for reference.
| Version | Date | Changes |
|---|---|---|
| v2.3NEW | April 2026 | Sybil Detection Engine: wallet volume/concentration scanning, proportional score deflation, admin confirmation workflow. feedback_wallets stored during enrichment for cross-agent analysis. |
| v2.2 | March 2026 | Phase 3 enrichment via The Graph ERC-8004 Reputation subgraph. Feedback-boosted Code Quality formula. Economic Viability formula updated for Phase 3 agents. Reviewer wallet age signal (PRD v4.50). |
| v2.1 | March 2026 | Security Posture added as 7th signal category (weight minimized pending signal availability). agentURI parsing: display_name, agent_description, service_type badges. Multi-chain coverage expanded to 17 EVM chains + Solana. |
| v2.0 | February 2026 | Bayesian continuous scoring replaces discrete bucket model. Confidence tiers. Archetype-dependent weight profiles (10 agent types). Hysteresis buffer for grade stability. |
| v1.0 | February 2026 | Initial release. Six signal categories, uniform weighting, letter grade + numeric scoring (0-100), safety floor. ERC-8004 Identity Registry crawler. |