Full Spec

The complete technical specification of the Treebeard scoring engine. Every weight, threshold, and formula documented — pulled directly from the production codebase.

Most rating agencies publish vague methodology overviews. We publish the spec. If you can read a formula, you can verify our work.

Grade Thresholds

The composite score (0-100) maps to a letter grade. Thresholds are fixed and published. No curve, no relative ranking.

Grade	Min Score	Label
A+	97	Exceptional
A	93	Exceptional
A-	90	Exceptional
B+	85	Strong
B	80	Strong
B-	75	Strong
C+	70	Average
C	65	Average
C-	55	Below Average
D	40	Developing
F	0	High Risk

Default Category Weights

Six signal categories, each scored 0-100, combined into a weighted composite.

Category	Weight
Economic Viability	20%
Operational Reliability	20%
Code Quality	15%
Autonomy Index	15%
Safety & Reliability	10%
Community & Ecosystem	10%
Security Posture	10%
Total	100%

Archetype Weight Profiles

Weights shift based on agent archetype. A financial trading bot is judged more heavily on economic viability; a safety-critical agent is judged more heavily on safety. All profiles sum to 100%. Tap an archetype to see the full breakdown.

Archetype	Econ	Ops	Code	Auto	Safe	Comm	Sec
Financial Trading	25%	25%	10%	10%	10%	10%	10%
Developer Tools	15%	15%	25%	20%	5%	10%	10%
Customer Facing	15%	25%	10%	10%	15%	15%	10%
Enterprise Workflow	20%	25%	10%	10%	15%	10%	10%
Autonomous Ops	20%	25%	15%	15%	10%	5%	10%
Research & Analysis	15%	15%	15%	20%	10%	15%	10%
Creative & Content	15%	15%	15%	15%	10%	20%	10%
Infrastructure & DevOps	15%	30%	20%	10%	10%	5%	10%
Safety Critical	10%	20%	15%	10%	25%	10%	10%
Data & Analytics	20%	15%	15%	15%	5%	20%	10%

Bold values indicate the highest-weighted category for each archetype. ERC-8004 registered agents without explicit classification default to Autonomous Ops weights.

Safety Floor

The safety floor is non-negotiable. No matter how strong an agent scores in other categories, its overall rating is capped if the safety score is low. An agent cannot outscore its safety.

If Safety Score	Composite Capped At	Max Grade
< 25	79	B-
< 50	82	B
< 70	92	A-
≥ 70	No cap	A+

Hysteresis Buffer

Prevents grade oscillation when a score sits near a boundary. A grade change only occurs if the score moves 3 points past the boundary.

// Example: Agent currently B+ (min 85) // Score drops to 84 → stays B+ (within 3pt buffer) // Score drops to 82 → downgraded to B (crossed 85 - 3 = 82) HYSTERESIS_POINTS = 3

This prevents noisy signals from causing daily grade flips. An agent that deserves B+ keeps B+ through minor signal fluctuations.

Confidence Tiers

Every rating carries a confidence label based on signal coverage — what percentage of the six categories have real data vs. defaults.

Label	Signal Coverage	Meaning
High	≥ 80%	Most signals current and verified
Medium	≥ 40%	Some missing or stale — rating produced with caveat
Insufficient	< 40%	N/R (Not Rated) — no score published

Phase 2-only agents (identity + age signals = ~33% coverage) receive N/R rather than a misleading D or F grade. Phase 3-enriched agents with on-chain feedback typically reach 67-83% coverage.

Composite Score Formula

The composite score is the weighted sum of category scores, with safety floor enforcement and optional sybil deflation applied afterward.

// Step 1: Weighted sum composite_raw = sum(category_score[i] * weight[i] / 100) for i in [EV, OR, CQ, AI, SR, CE, SP] // Step 2: Safety floor enforcement for cap in SAFETY_FLOOR_CAPS: if safety_score < cap.safety_below: composite = min(composite_raw, cap.max_composite) // Step 3: Sybil deflation (if applicable) if sybil_ratio > 0 and sybil_deflation_applied: community_score *= (1 - sybil_ratio * deflation_weight * 0.8) // deflation_weight: 0.9 confirmed, 0.7 auto, 0.4 borderline // Step 4: Hysteresis check if |composite - grade_boundary| < 3: keep_current_grade() // Step 5: Grade assignment grade = first threshold where composite >= min_score

Sybil Detection Engine

Autonomous agents receive on-chain feedback through the ERC-8004 Reputation Registry. Some wallets submit high volumes of feedback across many agents to artificially inflate scores. Treebeard detects and deflates these.

How It Works

Wallet scanning: During enrichment, per-wallet feedback counts are stored for every agent. The detector aggregates these across the full index.
Rule-based detection: Wallets exceeding volume or concentration thresholds are flagged automatically with a confidence score.
Score deflation: For flagged agents, the Community & Ecosystem category score is reduced proportionally to the sybil feedback ratio.
Admin confirmation: Analysts can manually confirm or dismiss flagged wallets, upgrading confidence to 0.9 (confirmed).

Detection Thresholds

Rule	Threshold	Confidence
Volume — agents targeted	> 1,000 agents	0.7 (auto)
Volume — total feedbacks	> 5,000 feedbacks	0.7 (auto)
Concentration — single agent	> 50% of feedbacks to one agent	0.4 (borderline)
Admin confirmed	Manual review	0.9 (confirmed)

Deflation Formula

// sybil_ratio = flagged_feedbacks / total_feedbacks for this agent // deflation_weight = max confidence of flagged wallets targeting this agent // 0.8 = global dampening factor (conservative — caps max deflation at 80%) deflated_community = community_raw * (1 - sybil_ratio * deflation_weight * 0.8) // Example: Agent with 60% sybil feedback, auto-detected (weight 0.7): // community_raw = 75 // deflated = 75 * (1 - 0.60 * 0.70 * 0.80) = 75 * 0.664 = 49.8

What We Don't Score (Yet)

Transparency means acknowledging gaps, not hiding them. These signals are on the roadmap but not yet implemented. Listing them here prevents false confidence in the current model.

Cross-agent dependency risk

Agent A depends on Agent B. If B fails, A fails. This cascading risk isn't captured yet.

Real-time operational monitoring

Uptime, latency, error rates from live probing. Currently using static signals only.

ENS identity correlation

Linking agent wallets to ENS names for identity verification and trust signal enrichment.

Adversarial robustness testing

Active red-team probing of agent endpoints. Currently assess robustness via documentation and architecture review only.

Multi-model evaluation

Testing agent behavior across different LLM backends. Agents may behave differently with GPT-4 vs Claude vs open-source models.

Economic sustainability depth

TVL, revenue, runway analysis for agents with token economics. Currently using proxy signals from on-chain activity volume.

Changelog

Algorithm changes require 30-day advance notice and published rationale. All historical versions remain available for reference.

Version	Date	Changes
v2.3NEW	April 2026	Sybil Detection Engine: wallet volume/concentration scanning, proportional score deflation, admin confirmation workflow. feedback_wallets stored during enrichment for cross-agent analysis.
v2.2	March 2026	Phase 3 enrichment via The Graph ERC-8004 Reputation subgraph. Feedback-boosted Code Quality formula. Economic Viability formula updated for Phase 3 agents. Reviewer wallet age signal (PRD v4.50).
v2.1	March 2026	Security Posture added as 7th signal category (weight minimized pending signal availability). agentURI parsing: display_name, agent_description, service_type badges. Multi-chain coverage expanded to 17 EVM chains + Solana.
v2.0	February 2026	Bayesian continuous scoring replaces discrete bucket model. Confidence tiers. Archetype-dependent weight profiles (10 agent types). Hysteresis buffer for grade stability.
v1.0	February 2026	Initial release. Six signal categories, uniform weighting, letter grade + numeric scoring (0-100), safety floor. ERC-8004 Identity Registry crawler.