Theory Delta Findings

Theory Delta FindingsEmpirical findings on the agentic tool landscape. What tools actually do, not what docs claim.https://theorydelta.com/Mcp Auth Header Client Splithttps://theorydelta.com/findings/mcp-auth-header-client-split/https://theorydelta.com/findings/mcp-auth-header-client-split/MCP Streamable HTTP spec does not mandate Authorization header passthrough. Claude Code supports it; VS Code Copilot does not — MCP servers gating write tools on Authorization Bearer are effectively Claude Code-only from other clients, and the tools fail silently with no diagnostic at the client.Wed, 22 Apr 2026 00:00:00 GMTStructured Generation Hidden Complexity Thresholdshttps://theorydelta.com/findings/structured-generation-hidden-complexity-thresholds/https://theorydelta.com/findings/structured-generation-hidden-complexity-thresholds/Every major provider advertises mathematical or near-guaranteed schema adherence, but each implementation has undocumented complexity thresholds, ordering sensitivities, and silent failure modes that make the guarantee conditional in ways the documentation does not disclose.Tue, 21 Apr 2026 00:00:00 GMTPlaywright Mcp Http Session Destructionhttps://theorydelta.com/findings/playwright-mcp-http-session-destruction/https://theorydelta.com/findings/playwright-mcp-http-session-destruction/playwright-mcp sessions are destroyed on every HTTP transport call in containerized environments — teams testing locally via Claude Code see reliable multi-step workflows, then cloud deployment silently resets browser state on every call.Sun, 19 Apr 2026 00:00:00 GMTAgent Frameworks Diverging Not Converginghttps://theorydelta.com/findings/agent-frameworks-diverging-not-converging/https://theorydelta.com/findings/agent-frameworks-diverging-not-converging/The multi-agent framework landscape has fractured into tiers with distinct maturity profiles — not converged — and every major framework carries confirmed production failure modes that docs don't warn about.Sun, 29 Mar 2026 00:00:00 GMTAgent Security Landscape Four Tiershttps://theorydelta.com/findings/agent-security-landscape-four-tiers/https://theorydelta.com/findings/agent-security-landscape-four-tiers/Agent security has split into four architecturally distinct tiers — no single tool spans more than one, making "agent security" a misleading single category.Sun, 29 Mar 2026 00:00:00 GMTAgentic Error Suppressionhttps://theorydelta.com/findings/agentic-error-suppression/https://theorydelta.com/findings/agentic-error-suppression/Error suppression that a human would notice and investigate leaves an agent with exit code 0 and no signal to stop — the Tools Fail paper measured a 76-point accuracy drop under silent failures.Sun, 29 Mar 2026 00:00:00 GMTChromadb Ram Ceiling Data Losshttps://theorydelta.com/findings/chromadb-ram-ceiling-data-loss/https://theorydelta.com/findings/chromadb-ram-ceiling-data-loss/ChromaDB's single-node RAM-bound architecture makes it a prototyping tool with a hard scaling ceiling — LRU eviction fails, collection deletion doesn't free memory, and connection leaks cause container OOM.Sun, 29 Mar 2026 00:00:00 GMTLanggraph Checkpoint Serialization Silent Losshttps://theorydelta.com/findings/langgraph-checkpoint-serialization-silent-loss/https://theorydelta.com/findings/langgraph-checkpoint-serialization-silent-loss/LangGraph checkpoint serialization fails silently for non-primitive types — round-trips are lossy with no exception raised, across four documented modes since Jan 2026.Sun, 29 Mar 2026 00:00:00 GMTSwe Bench Retired Benchmark Gaphttps://theorydelta.com/findings/swe-bench-retired-benchmark-gap/https://theorydelta.com/findings/swe-bench-retired-benchmark-gap/SWE-bench Verified was retired in Feb 2026 after 59% of its test cases were found flawed — scores overstated production reliability by 20-30 percentage points.Sun, 29 Mar 2026 00:00:00 GMTGoose Default Config Securityhttps://theorydelta.com/findings/goose-default-config-security/https://theorydelta.com/findings/goose-default-config-security/Goose's defaults — autonomous mode, no extension allowlist, disabled injection detection, 1000-turn ceiling — each removes a guardrail that would contain the others. The permissions fail-open bug was fixed in v1.26.2 (Mar 4, 2026), but the other four defaults remain.Tue, 17 Mar 2026 00:00:00 GMTOpenai Agents Sdk Streaming Guardrails Brokenhttps://theorydelta.com/findings/openai-agents-sdk-streaming-guardrails-broken/https://theorydelta.com/findings/openai-agents-sdk-streaming-guardrails-broken/Streaming guardrails are architecturally broken by design -- OpenAI has marked the fix NOT_PLANNED.Tue, 17 Mar 2026 00:00:00 GMTClaude Code Settings Attack Surfacehttps://theorydelta.com/findings/claude-code-settings-attack-surface/https://theorydelta.com/findings/claude-code-settings-attack-surface/Four CVEs and five supply-chain vectors share one pattern: project-scoped settings execute with user privileges before or without trust verification.Sat, 14 Mar 2026 00:00:00 GMTClaude Code Hooks Unreliable Enforcementhttps://theorydelta.com/findings/claude-code-hooks-unreliable-enforcement/https://theorydelta.com/findings/claude-code-hooks-unreliable-enforcement/Five categories of hook failures — silent non-firing, ignored decisions, platform breakage, data corruption, and architectural constraints — mean defense-in-depth across multiple events is required.Tue, 03 Mar 2026 00:00:00 GMTLlm Gateway Silent Failureshttps://theorydelta.com/findings/llm-gateway-silent-failures/https://theorydelta.com/findings/llm-gateway-silent-failures/LiteLLM's gateway features fail silently under production conditions — budget counters drift, guardrails pass, fallbacks route to dead providers — every failure traced to a public GitHub issue.Sun, 01 Mar 2026 00:00:00 GMTAgentic Rag Three Silent Failureshttps://theorydelta.com/findings/agentic-rag-three-silent-failures/https://theorydelta.com/findings/agentic-rag-three-silent-failures/Three silent failure modes in RAG pipelines — GraphRAG entity deduplication corruption, LangGraph edge routing data loss, and tool output leakage at step caps — all go undocumented by the frameworks.Fri, 27 Feb 2026 00:00:00 GMTDeepeval Exfiltrates Traces Via Otel Hijackhttps://theorydelta.com/findings/deepeval-exfiltrates-traces-via-otel-hijack/https://theorydelta.com/findings/deepeval-exfiltrates-traces-via-otel-hijack/DeepEval hijacked the global OTel TracerProvider on import through v3.7.7-era, exfiltrating trace data to New Relic cloud regardless of configured backend — removed in v3.9.x. The Langfuse non-generation span gap is still current.Fri, 27 Feb 2026 00:00:00 GMTGraph Memory Self Hosted Not Production Readyhttps://theorydelta.com/findings/graph-memory-self-hosted-not-production-ready/https://theorydelta.com/findings/graph-memory-self-hosted-not-production-ready/Self-hosted graph memory is uniformly not production-ready — Graphiti has an undocumented async conflict, Mem0 OSS graph is OpenAI-only despite 47k stars, and the MCP memory server corrupts files under concurrent writes.Fri, 27 Feb 2026 00:00:00 GMTAgent Testing Non Deterministic Cihttps://theorydelta.com/findings/agent-testing-non-deterministic-ci/https://theorydelta.com/findings/agent-testing-non-deterministic-ci/Every LLM-as-judge eval framework reviewed produces non-deterministic CI results — the grading layer is non-deterministic by design, not just the model under test.Wed, 25 Feb 2026 00:00:00 GMTMcp Supply Chain Security Institutionally Confirmedhttps://theorydelta.com/findings/mcp-supply-chain-security-institutionally-confirmed/https://theorydelta.com/findings/mcp-supply-chain-security-institutionally-confirmed/Two enterprise acquisitions in 90 days upgraded MCP supply chain risk from "emerging" to institutionally confirmed — but mid-session server mutation remains an unmitigated gap.Wed, 25 Feb 2026 00:00:00 GMTMcp Database Servers Security Bypasshttps://theorydelta.com/findings/mcp-database-servers-security-bypass/https://theorydelta.com/findings/mcp-database-servers-security-bypass/Every SQLite MCP database server reviewed uses startsWith("select") as its read-only guard — trivially bypassable with a semicolon or comment prefix.Tue, 24 Feb 2026 00:00:00 GMTMcp Stateless Http Silent Feature Losshttps://theorydelta.com/findings/mcp-stateless-http-silent-feature-loss/https://theorydelta.com/findings/mcp-stateless-http-silent-feature-loss/Stateless HTTP mode silently disables sampling and elicitation — a protocol-level constraint, not a library bug, with no spec fix before June 2026.Tue, 24 Feb 2026 00:00:00 GMT