Full Spec

The complete technical specification of the Treebeard scoring engine. Every weight, threshold, and formula documented — pulled directly from the production codebase.

Most rating agencies publish vague methodology overviews. We publish the spec. If you can read a formula, you can verify our work.

Grade Thresholds

The composite score (0-100) maps to a letter grade. Thresholds are fixed and published. No curve, no relative ranking.

GradeMin ScoreLabel
A+97Exceptional
A93Exceptional
A-90Exceptional
B+85Strong
B80Strong
B-75Strong
C+70Average
C65Average
C-55Below Average
D40Developing
F0High Risk

Default Category Weights

Six signal categories, each scored 0-100, combined into a weighted composite.

CategoryWeight
Economic Viability20%
Operational Reliability20%
Code Quality15%
Autonomy Index15%
Safety & Reliability10%
Community & Ecosystem10%
Security Posture10%
Total100%

Archetype Weight Profiles

Weights shift based on agent archetype. A financial trading bot is judged more heavily on economic viability; a safety-critical agent is judged more heavily on safety. All profiles sum to 100%. Tap an archetype to see the full breakdown.

ArchetypeEconOpsCodeAutoSafeCommSec
Financial Trading25%25%10%10%10%10%10%
Developer Tools15%15%25%20%5%10%10%
Customer Facing15%25%10%10%15%15%10%
Enterprise Workflow20%25%10%10%15%10%10%
Autonomous Ops20%25%15%15%10%5%10%
Research & Analysis15%15%15%20%10%15%10%
Creative & Content15%15%15%15%10%20%10%
Infrastructure & DevOps15%30%20%10%10%5%10%
Safety Critical10%20%15%10%25%10%10%
Data & Analytics20%15%15%15%5%20%10%

Bold values indicate the highest-weighted category for each archetype. ERC-8004 registered agents without explicit classification default to Autonomous Ops weights.

Safety Floor

The safety floor is non-negotiable. No matter how strong an agent scores in other categories, its overall rating is capped if the safety score is low. An agent cannot outscore its safety.

If Safety ScoreComposite Capped AtMax Grade
< 2579B-
< 5082B
< 7092A-
≥ 70No capA+

Hysteresis Buffer

Prevents grade oscillation when a score sits near a boundary. A grade change only occurs if the score moves 3 points past the boundary.

// Example: Agent currently B+ (min 85) // Score drops to 84 → stays B+ (within 3pt buffer) // Score drops to 82 → downgraded to B (crossed 85 - 3 = 82) HYSTERESIS_POINTS = 3

This prevents noisy signals from causing daily grade flips. An agent that deserves B+ keeps B+ through minor signal fluctuations.

Confidence Tiers

Every rating carries a confidence label based on signal coverage — what percentage of the six categories have real data vs. defaults.

LabelSignal CoverageMeaning
High≥ 80%Most signals current and verified
Medium≥ 40%Some missing or stale — rating produced with caveat
Insufficient< 40%N/R (Not Rated) — no score published

Phase 2-only agents (identity + age signals = ~33% coverage) receive N/R rather than a misleading D or F grade. Phase 3-enriched agents with on-chain feedback typically reach 67-83% coverage.

Composite Score Formula

The composite score is the weighted sum of category scores, with safety floor enforcement and optional sybil deflation applied afterward.

// Step 1: Weighted sum composite_raw = sum(category_score[i] * weight[i] / 100) for i in [EV, OR, CQ, AI, SR, CE, SP] // Step 2: Safety floor enforcement for cap in SAFETY_FLOOR_CAPS: if safety_score < cap.safety_below: composite = min(composite_raw, cap.max_composite) // Step 3: Sybil deflation (if applicable) if sybil_ratio > 0 and sybil_deflation_applied: community_score *= (1 - sybil_ratio * deflation_weight * 0.8) // deflation_weight: 0.9 confirmed, 0.7 auto, 0.4 borderline // Step 4: Hysteresis check if |composite - grade_boundary| < 3: keep_current_grade() // Step 5: Grade assignment grade = first threshold where composite >= min_score

Sybil Detection Engine

Autonomous agents receive on-chain feedback through the ERC-8004 Reputation Registry. Some wallets submit high volumes of feedback across many agents to artificially inflate scores. Treebeard detects and deflates these.

How It Works

  1. Wallet scanning: During enrichment, per-wallet feedback counts are stored for every agent. The detector aggregates these across the full index.
  2. Rule-based detection: Wallets exceeding volume or concentration thresholds are flagged automatically with a confidence score.
  3. Score deflation: For flagged agents, the Community & Ecosystem category score is reduced proportionally to the sybil feedback ratio.
  4. Admin confirmation: Analysts can manually confirm or dismiss flagged wallets, upgrading confidence to 0.9 (confirmed).

Detection Thresholds

RuleThresholdConfidence
Volume — agents targeted> 1,000 agents0.7 (auto)
Volume — total feedbacks> 5,000 feedbacks0.7 (auto)
Concentration — single agent> 50% of feedbacks to one agent0.4 (borderline)
Admin confirmedManual review0.9 (confirmed)

Deflation Formula

// sybil_ratio = flagged_feedbacks / total_feedbacks for this agent // deflation_weight = max confidence of flagged wallets targeting this agent // 0.8 = global dampening factor (conservative — caps max deflation at 80%) deflated_community = community_raw * (1 - sybil_ratio * deflation_weight * 0.8) // Example: Agent with 60% sybil feedback, auto-detected (weight 0.7): // community_raw = 75 // deflated = 75 * (1 - 0.60 * 0.70 * 0.80) = 75 * 0.664 = 49.8

What We Don't Score (Yet)

Transparency means acknowledging gaps, not hiding them. These signals are on the roadmap but not yet implemented. Listing them here prevents false confidence in the current model.

Cross-agent dependency risk

Agent A depends on Agent B. If B fails, A fails. This cascading risk isn't captured yet.

Real-time operational monitoring

Uptime, latency, error rates from live probing. Currently using static signals only.

ENS identity correlation

Linking agent wallets to ENS names for identity verification and trust signal enrichment.

Adversarial robustness testing

Active red-team probing of agent endpoints. Currently assess robustness via documentation and architecture review only.

Multi-model evaluation

Testing agent behavior across different LLM backends. Agents may behave differently with GPT-4 vs Claude vs open-source models.

Economic sustainability depth

TVL, revenue, runway analysis for agents with token economics. Currently using proxy signals from on-chain activity volume.

Changelog

Algorithm changes require 30-day advance notice and published rationale. All historical versions remain available for reference.

VersionDateChanges
v2.3NEWApril 2026Sybil Detection Engine: wallet volume/concentration scanning, proportional score deflation, admin confirmation workflow. feedback_wallets stored during enrichment for cross-agent analysis.
v2.2March 2026Phase 3 enrichment via The Graph ERC-8004 Reputation subgraph. Feedback-boosted Code Quality formula. Economic Viability formula updated for Phase 3 agents. Reviewer wallet age signal (PRD v4.50).
v2.1March 2026Security Posture added as 7th signal category (weight minimized pending signal availability). agentURI parsing: display_name, agent_description, service_type badges. Multi-chain coverage expanded to 17 EVM chains + Solana.
v2.0February 2026Bayesian continuous scoring replaces discrete bucket model. Confidence tiers. Archetype-dependent weight profiles (10 agent types). Hysteresis buffer for grade stability.
v1.0February 2026Initial release. Six signal categories, uniform weighting, letter grade + numeric scoring (0-100), safety floor. ERC-8004 Identity Registry crawler.