Your agent framework choice locks you into undocumented failure modes the docs won’t mention

Published: 2026-03-29 Last verified: 2026-04-20 empirical
14 claims 1 tested landscape

Your agent framework choice locks you into undocumented failure modes the docs won’t mention

From Theory Delta | Methodology | Published 2026-03-29 | Updated 2026-04-20

You’re picking an agent framework — LangGraph, CrewAI, OpenAI Agents SDK, Pydantic AI, smolagents. The docs present them as broadly substitutable general-purpose orchestration layers. Microsoft’s merger of AutoGen and Semantic Kernel gets cited as evidence the ecosystem is consolidating. Framework comparison guides suggest the choice is mostly about developer preference.

Here’s what will actually derail your plans.

What you expect

Any major framework will run reliably on non-OpenAI LLM providers, sandbox user-generated code safely, stream output with content filtering, and provide stable APIs from release to release. Microsoft consolidating AutoGen means less fragmentation, not more migration risk.

What actually happens

The landscape has fractured into tiers with fundamentally different maturity profiles, and every major framework carries confirmed production failure modes that are not prominently documented. The one apparent consolidation is a corporate forced migration, not ecosystem convergence.

Your CrewAI workflow silently fabricates tool results with non-OpenAI models

Issue #3154 (closed not-planned): With non-OpenAI models, CrewAI agents generate plausible fake Observation output without executing tools at all. Phoenix traces confirm zero tool activity — the LLM produces the pattern of a tool result without triggering the tool. Two PRs (#3378, #4077) remain open and unmerged as of Q1 2026. The issue was closed as not-planned.

What this means for you: If you chose CrewAI for its “any LLM provider” framing and you’re not on OpenAI, your tool execution is not verified. Your logs look correct. Your Phoenix traces would show zero tool activity. Any workflow that acts on tool output — API calls, file writes, data retrieval — is acting on fabricated results.

Your LangGraph graph will loop forever without any warning

LangGraph provides no automatic cycle prevention. A documented production case generated 11 revision cycles burning $4 in API calls before a manual cap was applied. The framework will not stop this. Builders must add revision_count < N state counters manually to every cyclic graph. This is not in the main LangGraph documentation.

What this means for you: Every cyclic topology you build is a runaway cost risk until you add explicit loop guards. If you assumed the framework would handle this, your first production incident will be an unexpected bill.

Your OpenAI Agents SDK streaming output has no content filtering path

The SDK’s guardrail system runs after the full response is assembled. It is architecturally incompatible with streaming responses. This is marked NOT_PLANNED by maintainers — there is no workaround within the SDK. Builders who need content filtering on streamed output must implement it outside the SDK, at the transport layer.

Additional active failure categories (March 2026):

  1. Handoff incompatibility with server-managed conversations (Issue #2151, targeted for 0.12.x — unfixed)
  2. Tracing infrastructure unreliable: spans silently dropped in long-running workers (#2135), spans not displaying despite successful export (#2477), large integer arguments render incorrectly (#2094)
  3. Dynamic tool loading throws ModelBehaviorError for missing tools (Issue #2646)

Patched: Fork-after-thread deadlock that crashed gunicorn/uWSGI with preload=true was patched Feb 17 2026. If you’re on a pre-patch release with preload enabled, you need to upgrade.

v0.14.0 (April 2026): Sandbox Agents introduced as a new execution surface. API is actively expanding — v0.14.2 added granular filesystem path grants for sandbox access. Deployment decisions made at v0.14.0 should be revisited as the API stabilizes.

What this means for you: If your architecture depends on guardrails filtering streamed output, that path does not exist and will not exist. If you’re using gunicorn with preload, check your SDK version before your next deployment.

Your smolagents sandbox is bypassable today

NCC Group published a working proof-of-concept demonstrating that LocalPythonInterpreter — smolagents’ default sandboxing for local code execution — is bypassable via numpy/pandas import paths. The sandbox blocks direct import os but does not prevent executing shell commands through numpy’s C extension layer.

What this means for you: If you’re using smolagents to execute user-controlled code and relying on LocalPythonInterpreter as your security boundary, that boundary does not hold. NCC Group has the working exploit. Any deployment with untrusted user inputs requires Docker or subprocess isolation with explicit resource limits.

Your Pydantic AI v1.0 evaluation tooling was removed on release day

Within 24 hours of shipping v1.0, the pydantic_evals Python evaluator was removed as an RCE mitigation. The module remains but the evaluator class is gone. Any builder who shipped against the v1.0 API has broken evaluation tooling with no drop-in replacement.

Compound failures in current versions:

  • Parallel MCP tool cancel scope mismatch: When Pydantic AI runs MCP tools concurrently, the async context manager closes before all concurrent calls complete. Tool calls in-flight at context manager exit are dropped without error. The calling code receives no indication that results are missing.
  • run_stream() and run() divergent tool-handling: Six open issues as of March 2026 track inconsistencies. Tools that work in run() fail silently or return incomplete results in run_stream().

What this means for you: Any Pydantic AI + MCP workflow using parallelism needs explicit result validation that all expected results arrived. Any workflow mixing streaming and batch tool calls must be tested in both modes separately.

AutoGen is in maintenance mode — your migration window is open now

Microsoft placed AutoGen in maintenance mode in October 2025 — bug fixes and security only, no new features. Migration to Microsoft Agent Framework requires “light refactoring” for single agents and a new orchestration model for multi-agent systems. The migration guide recommends migration within 6-12 months. That window is Q3/Q4 2026 for teams on AutoGen today.

What this means for you: This is not ecosystem convergence. It is a forced migration with a deadline. Projects built on AutoGen now carry migration debt that compounds with every month of delay.

Other production failure modes

Jules (Google): Despite marketing that emphasizes GitHub integration, Jules cannot read GitHub issue content. Testing confirms Jules explicitly states it cannot access external websites including GitHub. Issue content must be injected into the task prompt manually. Any workflow that assumes Jules will pull issue context automatically will proceed without that context, silently.

Mastra: Default configuration leaves all tool endpoints open. Role configuration is opt-in, not opt-out. Any Mastra deployment without explicit role configuration exposes all tool endpoints without authentication.

n8n self-hosted: No SSRF protection prior to 2.12.0. Version 2.12.0 introduced configurable SSRF protection — but it is not enabled by default after upgrade.

What this means for you

The “this framework supports my provider” and “this framework handles security” assumptions are where plans break. The failure modes above are not edge cases in obscure settings — they are the default behavior under conditions every production deployment encounters:

  • Non-OpenAI providers with CrewAI
  • Cyclic graphs in LangGraph
  • Streamed output with content filtering in the OpenAI Agents SDK
  • User-controlled code in smolagents

Pick your framework based on your specific topology and constraints, then independently verify the failure modes that apply to your configuration. Do not transfer testing results from one mode (batch) to another (streaming), or from one provider (OpenAI) to another (Anthropic).

What to do

For CrewAI with non-OpenAI models: Add an independent verification step that confirms tool execution actually occurred before treating Observation output as valid. Phoenix tracing or similar — log tool call events at the framework level, not just completion events.

For LangGraph: Add manual loop counters to all cyclic graphs: revision_count in state, guard condition revision_count < N before nodes that can loop. The framework will not do this for you.

For OpenAI Agents SDK: Pin to v0.14.2+. Avoid preload=true with gunicorn/uWSGI unless you’ve confirmed you’re past the Feb 17 2026 patch. Implement content filtering at the transport layer if needed — streaming guardrails are NOT_PLANNED and will not appear.

For smolagents: Use Docker or subprocess isolation with explicit resource limits for any deployment where user-controlled code runs. LocalPythonInterpreter is not a security boundary.

For Pydantic AI: Test streaming and batch tool calls separately. For any MCP parallel execution workflow, add explicit result validation that all expected results are present — not just that the call returned without error.

For Jules: Pre-load GitHub issue content into the task prompt explicitly.

For AutoGen: Begin migration planning to Microsoft Agent Framework. Q3/Q4 2026 deadline.

For Mastra: Configure RBAC roles before deploying beyond local development.

For n8n self-hosted: Upgrade to 2.12.0+ and explicitly enable SSRF protection.

Evidence

ToolVersionResult
CrewAIIssue #3154 (Q1 2026)source-reviewed: tool fabrication with non-OpenAI models (#3154 closed not-planned; #3378, #4077 unmerged)
OpenAI Agents SDKv0.14.2source-reviewed: fork deadlock patched Feb 17 2026; streaming guardrails NOT_PLANNED; #2151, #2135, #2646 open; Sandbox Agents in v0.14.0
smolagentscurrent (March 2026)independently-confirmed: NCC Group PoC sandbox bypass via numpy/pandas
Pydantic AIv1.0source-reviewed: pydantic_evals evaluator removed as RCE fix; parallel MCP cancel scope mismatch; run_stream/run divergence (6 open issues)
Jules (Google)current (March 2026)tested: confirmed no native GitHub issue access; cannot access external websites
n8n self-hosted< 2.12.0source-reviewed: no SSRF protection by default; 2.12.0 release notes
Mastra@mastra/core@1.9.0source-reviewed: default open tool endpoints documented in release notes
AutoGenmaintenance mode (Oct 2025)source-reviewed: Microsoft migration guide

Confidence: empirical — source-reviewed across 8+ frameworks with issue-level evidence for each failure mode (GitHub Issues, release notes, changelogs verified March–April 2026). Jules GitHub access claim is tested (practitioner confirmation, not source-reviewed). smolagents sandbox bypass independently confirmed by NCC Group PoC.

What would disprove this: A CrewAI release that fixes tool fabrication for non-OpenAI providers; a LangGraph release that adds built-in loop limits; an OpenAI Agents SDK release that marks streaming guardrails as planned with a roadmap entry.

Falsification criterion for “diverging, not converging”: A shared configuration standard (schema, SDK, or protocol) adopted by at least three major frameworks (LangGraph, CrewAI, OpenAI Agents SDK) governing orchestration behavior — not just LLM selection.

Last verified: 2026-04-20

Seen different? Contribute your evidence (confirming or contradicting) via the Theory Delta MCP contribute tool or at theorydelta.com/contribute.