FIELD GUIDE · AGENTIC TOOL LANDSCAPE

What agentic tools actually do — not what their docs claim.

Empirical intelligence for builders. We test what tools claim against what they actually do, then publish it — every claim traced to a primary source. Read it here; your agents read it via MCP. No vendor influence. No paywalled CVEs.

WHAT ARE YOU ABOUT TO DO?

I'M… 5 findings

Setting up MCP servers

row-level security bypass, session destruction on retry, default-open configs

See what's in the way →

I'M… 1 finding

Choosing an LLM gateway

silent failure modes — budget counters drift, fallbacks to dead providers, retry cost amplification

See what's in the way →

I'M… 4 findings

Picking an agent framework

streaming guardrails broken, frameworks diverging, hook unreliability, checkpoint serialization loss

See what's in the way →

I'M… 5 findings

Building a RAG pipeline

three silent failure modes, RAM ceiling data loss, structured generation thresholds, graph memory not production-ready

See what's in the way →

I'M… 2 findings

Evaluating a benchmark

SWE-bench Verified retired, agent CI is non-deterministic

See what's in the way →

I'M… 5 findings

Configuring agent autonomy

Claude Code hooks unreliable, settings attack surface, error suppression, default-open Goose configs, OTel trace exfiltration

See what's in the way →

FEATURED FINDING · APR 2026

All 55 findings →

The benchmark everyone cited was retired for being wrong.

YOU EXPECT

Vendor SWE-bench Verified scores reflect production reliability and the cases are valid.

WHAT HAPPENS

The benchmark's authors retired it Feb 14. 295 of 500 cases were flawed. 14 vendors still cite the inflated scores.

WHAT IT MEANS FOR YOU

Any selection decision made on a public Verified score is overestimating success by 20–30 percentage points on real tickets.

WHAT TO DO

Stop citing Verified scores in selection. Replicate one of your real tickets on the corrected subset, or use SWE-bench Live.

source-reviewed independently-confirmed confidence · high 17 sources · 9 gh-issues · 3 papers Read the finding → See the receipts ↗

WHAT THIS IS

A field guide for the agentic tool landscape — structured, opinionated knowledge about what tools actually do. Humans read it here; agents read it via MCP.

We test, we read the issue trackers, we run the tools. Then we publish what we found. Every claim is traced to a primary source or labelled as Theory Delta's own analysis. If a number doesn't come from a primary source, it doesn't appear.

BLOCKS

87 in corpus

Synthesised knowledge — claims, confidence, connections. The asset.

EVIDENCE RECORDS

142 receipts

Per-claim provenance. Source URL, what it actually says, verified date.

PUBLISHED FINDINGS

55 live

Trajectory-changing insight. What you expect, what happens, what to do.

ENGINE PROVENANCE SURFACES

Public, checkable, and linked from the field guide.

TASK → FINDING PATH

Start with what you're about to do, then trace to findings mapped to each phase.

Browse task hubs →

FINDING → RECEIPTS

Each finding ships with publication metadata, evidence type, and linked receipt sections.

Browse findings →

RECEIPTS → PRIMARY SOURCES

The featured finding exposes source-linked receipts so claims can be checked line by line.

Open featured receipts ↗

FINDING → FACT-CHECK READOUT

Fact-check sessions publish corrections and open questions so updates stay auditable.

Open latest readout →

RECENT FINDINGS

Five we shipped this month

All 55 findings →

id tool delta evidence verified

0055 OAuth RFC 8693 (IETF) Multi-Agent OAuth Delegation Has No Enforcement Layer — RFC 8693 'act' Claims Are Advisory Only medium 2026-06-15 0054 A2A Protocol Spec (Google A2A Agent Card Skill Descriptions Are an Unprotected Injection Surface — 100% Exfiltration in Tested Scenarios empirical 2026-06-12 0053 LocalAI (mudler LocalAGI's 50% Tool-Call Failure Rate Is an Infrastructure Bug, Not a Model Problem empirical 2026-06-05 0052 Claude Code (Anthropic) Worktrees are not required for parallel Claude Code agents under active human steering empirical 2026-06-03 0051 Claude Code (Anthropic) Agent Config Dependencies Silently Cause Hallucination, Not Errors empirical 2026-06-01

FOR AGENTS

Your agent should query Theory Delta before the tool decision, not after.

Findings ship as structured JSON with confidence, evidence type, and source URLs. llms.txt and /.well-known/mcp.json are live for agent discovery.

HTTP · stable llms.txt · live /.well-known/mcp.json

~/.config/agent.json

{
  "mcpServers": {
    "theorydelta": {
      "type": "http",
      "url":  "https://api.theorydelta.com/mcp"
    }
  }
}

$ td query "should I use LiteLLM as a budget gateway?"

→ 1 finding · confidence:high · 11 sources

→ what to do: budgets drift; verify counter behavior or use…