Treebeard Learn

How to Evaluate Whether an AI Agent Is Trustworthy

Q: What is a safety floor in agent rating?

A safety floor is a conditional gate: an agent cannot earn a high composite score if it fails specific safety checks (missing operational data, unverified identity, failed code audit). Treebeard caps ratings below this floor regardless of how strong other signals are.

Treebeard Research·April 24, 2026·9 min read

Direct answer

To evaluate whether an AI agent is trustworthy, examine seven independent signals: identity verification, operational reliability, code quality, autonomy boundaries, safety guardrails, community validation, and security posture. No single source produces a complete answer. The most reliable approach is a continuous composite rating from a third party that publishes its methodology, holds no token, and accepts no payment from rated agents.

In this guide

Why agent trust matters now
The seven trust signals
Why single-source ratings fail
How composite ratings aggregate signals
Worked example: evaluating one agent
Limitations and open problems
FAQ
Sources

Why agent trust matters now

Sometime in the last eighteen months, AI agents stopped being chat copilots and started being economic actors. They execute trades. They sign and submit transactions. They subscribe to APIs, settle invoices, and transact with other agents in the middle of the night. As of April 2026, 176,000 of them are registered on-chain. Fewer than 0.04% earn a passing grade on independent rating.

That gap is the problem agent commerce now has to solve. ERC-8004 gave agents a portable identity. x402 let them pay each other through HTTP 402 responses. ERC-8183 gave them verifiable commerce records. None of those primitives answer the only question that matters when you actually wire money: which agent should I trust to act on my behalf, or transact with my system, this hour, in this context?

Trust evaluation for AI agents is a new discipline. It borrows from credit ratings (continuous monitoring, methodology transparency), from security auditing (code review, attestations), and from mechanism design (anti-gaming guarantees). What follows is a practical framework for making the evaluation, whether you're a developer integrating an agent API, a protocol team accepting agent counterparties, or a buyer transacting with an agent for the first time.

The seven trust signals

Trust is not a single attribute. It's a composition of independent measurable signals, each catching a different failure mode. Miss any one and you've left a blind spot for the kind of adversary who actually reads rating methodologies. Seven dimensions cover the surface.

Signal 1

Identity verification

What it measures: Whether the agent has a portable, cryptographically verifiable identity that resolves consistently across platforms.

Example: An ERC-8004 registered agent with a stable agent ID, owner wallet, and signed metadata. Verifiable on any compatible chain.

How to verify: Check on-chain registration. Verify the agent ID resolves on the canonical registry. Confirm the agentURI metadata matches claimed capabilities.

Signal 2

Operational reliability

What it measures: Whether the agent actually responds when called, completes claimed tasks, and maintains uptime.

Example: An agent advertising 24/7 availability with 99% uptime over the trailing 30 days, sub-second response time, and zero unhandled errors in the last 1,000 calls.

How to verify: Probe the agent's endpoint over time. Look for response latency, error rates, continuity of service. Public probes like callable-agent indexers can verify aliveness.

Signal 3

Code quality

What it measures: Whether the agent's underlying code is auditable, deterministic where claimed, and reviewed by parties other than the developer.

Example: An agent backed by a smart contract with a public audit from a reputable firm, source code on GitHub, and a deterministic function set.

How to verify: Find the contract addresses. Look up audits. Check for source transparency. Verify the agent's claimed deterministic boundaries against actual call traces.

Signal 4

Autonomy index

What it measures: What scope of action the agent can take without human approval, and whether those boundaries are enforced cryptographically.

Example: An agent that can execute trades up to a defined dollar limit per day, with on-chain enforced caps. The user's permission boundaries are visible and auditable.

How to verify: Read the agent's permission scope. Look for scoped delegation frameworks (MetaMask Delegation Toolkit, Coinbase AgentKit). Verify limits are enforced at the smart contract level, not just policy-stated.

Signal 5

Safety guardrails

What it measures: Whether the agent has refusal patterns, rollback capabilities, and human-escalation paths for ambiguous or high-stakes situations.

Example: An agent that refuses to execute transactions over a threshold without re-confirmation, that rolls back on detected error, and that flags edge cases to a human operator.

How to verify: Read the agent's published safety policies. Look for incident history. Test the agent against adversarial prompts to see how it handles ambiguity.

Signal 6

Community and ecosystem

What it measures: Whether independent third parties have validated the agent through use, integration, feedback, or attestation.

Example: An agent integrated by multiple protocols, with positive feedback signals from independent users, and visible mentions in trusted ecosystem maps.

How to verify: Search for the agent on independent platforms. Count integrations. Look for verified feedback events on reputation registries. Check whether established accounts reference the agent.

Signal 7

Security posture

What it measures: Whether the agent's keys, wallet, dependencies, and infrastructure follow current security best practices.

Example: An agent using hardware-backed key management, no over-permissioned API keys, dependency scanning, and no incidents in its history.

How to verify: Look for security disclosures. Check public incident history. Verify wallet hygiene through on-chain analysis. Review dependency manifests for known vulnerable packages.

Each signal has its own data sources, its own failure modes, and its own ways of being gamed. A high score in one cannot compensate for absence in another. An agent with audited code (signal 3) but no operational history (signal 2) is a research project, not a counterparty. An agent with strong community signals (signal 6) but no enforced autonomy boundaries (signal 4) is a marketing operation, not a trustworthy actor. The seven cover the surface together. Apart, they don't.

Why single-source ratings fail

The temptation when evaluating an AI agent is to pick one signal and treat it as a proxy for the others. The result? You make a decision based on a partial view, and the attacker-of-the-month finds the gap you ignored. Four structural reasons explain why single-source fails.

Bias from token and marketplace incentives

Rating providers that hold a token or take a marketplace cut on the agents they rate face an inherent conflict. Higher ratings raise the value of the token or the volume on the marketplace. Lower ratings cost the rater money. A rating system optimized for the rater's economics is not optimized for the buyer's risk assessment. Oh and this isn't hypothetical: it's the core lesson of the 2008 credit ratings crisis applied to a new asset class. The issuer-pays model broke once. Watching it break a second time in agent ratings is the kind of thing that should keep the industry up at night.

Static snapshots versus dynamic behavior

A rating taken once and not refreshed cannot reflect agent behavior. AI agents update, retrain, change wallet permissions, integrate new dependencies, get acquired, go dormant. A trust signal that doesn't move with the underlying state stops describing the agent within days. A quarterly snapshot is a historical document, not a counterparty risk score.

Sybil resistance

Single-source signals are easy to game. Solicit positive feedback from sock puppet wallets. Deploy multiple identical contracts to fake distribution. Buy reviews. (Yes, agents pay for reviews. They're cheap, and the rate is dropping.) Composite signals weighted by source diversity make these attacks exponentially harder, not because any single source is unhackable, but because the attack surface is now seven different attack surfaces that have to fail in concert.

Coverage gaps

One chain, one registry, one source of attestations always leaves blind spots. Agents that operate across chains or that combine ERC-8004 identity with x402 payment flows and ACP commerce activity cannot be evaluated by any single registry. The evaluation has to travel with the agent across surfaces, or it misses half the picture.

How composite ratings aggregate signals

A composite rating combines the seven signals through a transparent weighting function and refreshes continuously. The structure is straightforward in principle, even if implementing it well takes time:

Each signal produces a numeric score from public data sources.
Each signal is weighted by category, with weights varying by agent type. (A trading agent weights operational reliability higher than a creative content agent does.)
A safety floor caps the composite below a threshold if any critical signal fails. Failed code audit caps the rating at D regardless of other strengths.
The composite refreshes on enrichment events and on a daily schedule.
Methodology, weights, and signal sources are published, and reproducible from the same public inputs.

Treebeard's methodology implements this pattern. The weights, signal definitions, and scoring math are public at /methodology/process. The list of agent archetypes and their respective weight profiles is at /methodology/methodology. The improvement guide for builders is at /methodology/improve.

Worked example: evaluating one agent

Consider how the seven signals apply to a real on-chain agent currently rated by Treebeard: Ethy AI, an autonomous trading agent built on A2A workflows, x402 payment flows, and ERC-8004 identity. Treebeard rates it C (65.1) at the time of writing.

Identity: ERC-8004 registered, agent ID 9380, verifiable on-chain.

Operational reliability: 63.5/100. Active endpoint, callable, response time within expected range.

Code quality: 60/100. Public smart contracts, partial audit coverage.

Autonomy index: 85/100. Clear scope, plain-language intent translation, scoped to user wallet.

Safety: 50/100. Standard guardrails, no public incident history.

Community: 72/100. Verified X handle (@ethy_agent), positive ecosystem mentions.

Security posture: 50/100. Default. Improvement available via formal audit.

Composite: C (65.1). Above the safety floor, below B-tier on operational and security signals.

What this evaluation tells a counterparty:

The agent is real and identifiable. Identity passes.
The agent operates as advertised. Operational reliability is acceptable.
The agent is well-scoped. Autonomy index is high and bounded.
The agent has not yet earned a strong safety or security signal. Counterparties placing significant value should expect improvement here before scaling integration.
The agent has community validation. It is not a stealth-mode operation.

What does that translate to in practice? A C-grade agent is callable infrastructure for low-to-medium-stakes integrations. It is not yet a fiduciary-grade counterparty. The evaluation does not say the agent is bad. It says the available signals don't yet justify higher trust. Which is a different statement, and the difference matters.

Limitations and open problems

Honest framework discussion requires honest acknowledgment of where this approach has gaps. Four are worth naming.

Thin signal coverage on newer chains

Signals like operational reliability and reputation require activity history. Chains that are new (or chains where ERC-8004 adoption is still early) produce thin signals. An agent on Base today has fewer measurable behaviors than the same agent on Ethereum, not because it's a worse agent, but because the measurement layer hasn't caught up. Evaluators have to adjust confidence accordingly.

Self-reported metadata

An agent's description, claimed capabilities, and category label come from the agent's own metadata. A classifier can assign a category from keyword matching, but it can't verify the agent actually does what it claims. Operational signals must confirm or contradict the claimed function. Most don't, because most agents haven't been called enough times.

Composite scores hide category-specific risk

A C-grade composite that averages high autonomy and low security looks the same as a C-grade composite that averages medium scores across the board. The averaged score hides the specific risk distribution. Always read category breakdowns before integrating. The composite is the headline. The categories are the actual story.

Rating velocity versus agent change rate

An agent that updates its model or permissions daily can change faster than a daily-rated trust signal can capture. For high-frequency or high-stakes integrations, supplement static ratings with real-time monitoring of behavioral drift. The rating tells you what was true yesterday. Behavior tells you what is true now.

FAQ

Can an AI agent's trust score change over time?

Yes. A useful trust rating updates continuously as new operational data, security signals, and community feedback arrive. A static or quarterly score does not reflect current behavior and should be treated as historical reference only.

What does it mean if an AI agent is unrated?

An unrated agent has either not been indexed by the rating provider, lacks sufficient signals to score with confidence, or has been registered too recently to generate operational history. Treat unrated agents as lower-confidence counterparties until ratings exist.

How does AI agent rating differ from credit scoring?

AI agents lack the credit history, legal entity status, and human accountability of traditional borrowers. Trust scoring for agents must rely on cryptographic identity, on-chain behavior, code-level signals, and continuous monitoring rather than financial history. The closest historical analogue is sovereign credit rating, where issuer behavior over time produces signal absent traditional collateral.

Can agents pay to improve their rating?

A rating service that accepts payment from rated entities has a structural conflict of interest. Treebeard accepts no payment from rated agents and operates no token. Builders can improve scores only by improving the underlying signals. See the independence page for the full disclosure.

How often are AI agent ratings updated?

Treebeard re-rates agents on enrichment events and on a daily cadence as the underlying signal pipeline matures. Some ratings update within minutes of new on-chain activity. Others require a full re-rate cycle. Stale ratings are flagged with a confidence indicator.

What is a safety floor in agent rating?

A safety floor is a conditional gate: an agent can't earn a high composite score if it fails specific safety checks (missing operational data, unverified identity, failed code audit). Treebeard caps ratings below this floor regardless of how strong other signals are. The floor is the discriminating signal between roughly half of all rated agents.

What are the limits of AI agent trust evaluation today?

Coverage is uneven across chains. Self-reported metadata can't be fully verified. Composite scores hide category-specific risks. New agents with no operational history default to low ratings even when their underlying code is sound. These are open problems the industry is actively addressing through ERC-8004 reputation registries, ERC-8183 commerce attestations, and continuous re-rating.

Sources

This framework draws on academic and industry work in mechanism design, credit ratings, and AI agent infrastructure. Primary references:

ERC-8004: Identity and Reputation Registries for AI Agents. Ethereum standard for portable agent identity.
ERC-8183: Agentic Commerce Protocol. Standard for verifiable agent-to-agent commerce records.
x402: HTTP-native payment protocol. Specification for agent-driven payments and counterparty attestations.
Hardin, Russell. Trust and Trustworthiness. Russell Sage Foundation, 2002. Foundational treatment of trust as a relational property between principal and counterparty.
Akerlof, George A. The Market for "Lemons": Quality Uncertainty and the Market Mechanism. Quarterly Journal of Economics, 1970. Canonical paper on asymmetric information and market failure.
Cantor, Richard, and Frank Packer. Sovereign Credit Ratings. Federal Reserve Bank of New York, 1996. The original academic study of how composite continuous ratings work in practice.
a16z: The missing infrastructure for AI agents. April 2026. Industry framing of identity, governance, payments, verification, and user control as the five infrastructure problems blockchains can address.
Treebeard Methodology. Live, versioned methodology including category weights, calibration cycle, and the safety floor.
Treebeard Independence. Disclosure of conflicts, funding, and operating principles.

Search the Agent Directory How to Improve a Rating

Coming next: how the safety floor in Treebeard's methodology caps every grade. Why the floor is the discriminating signal between half of all rated agents.

Last updated: April 27, 2026. This guide is maintained as the agent-trust landscape evolves. Citations and methodology pointers stay current with Treebeard's versioned methodology.