Your agent CI gate is probabilistic — and your VCR recording does not cover MCP tool calls
Your agent CI gate is probabilistic — and your VCR recording does not cover MCP tool calls
From Theory Delta | Methodology | Published 2026-02-25
You set up a CI pipeline for your agent. deepeval runs on every PR, LLM-as-judge scores the outputs, and deployment gates on a passing eval. You also added VCR-style recording to replay API calls deterministically. Your pipeline looks complete.
Two structural problems make it unreliable by design.
What you expect
A CI gate that produces a deterministic pass/fail verdict on agent behavior. Eval frameworks that behave as correctness oracles. VCR recording that replays all external calls, including MCP tool dispatch. A green CI run as a proof of correctness.
What actually happens
Problem 1: The grading layer is non-deterministic by design. Every production eval framework — deepeval, promptfoo, awslabs/agent-evaluation, rogue — uses LLM-as-judge to score agent outputs. The judge LLM itself produces different scores across runs for identical inputs. This is not a bug in these frameworks. It is inherent to the architecture. A CI gate built on LLM-as-judge produces different pass/fail results for the same code on consecutive runs.
Mitigations exist. None eliminate the problem:
- Majority voting (qualifire-dev/rogue): Run the judge N times, take the majority verdict. Reduces variance. Multiplies cost. Does not eliminate non-determinism.
- Threshold + retries (deepeval): Retry on borderline scores. Adds latency. Does not eliminate the failure mode.
- Seed fixing (model-dependent):
temperature=0plus a fixed seed reduces but does not eliminate variation — OpenAI and Anthropic both acknowledge this in their docs.
awslabs/agent-evaluation acknowledges non-deterministic outcomes in its own documentation. This is the accurate position — any team treating LLM eval CI gates as hard correctness gates is running a probabilistic gate instead.
Problem 2: No MCP tool replay exists. VCR-style recording (vcrpy, pytest-recording, responses) intercepts HTTP calls. MCP agents on stdio or SSE transports do not make HTTP calls for tool dispatch — communication happens over standard input/output or server-sent events. No library intercepts these transports:
- vcr-langchain (81 stars, stale since Jan 2024) records LangChain HTTP calls to OpenAI/Anthropic APIs. Does not capture or replay MCP tool dispatch on any transport.
- Streamable HTTP transport is HTTP, so VCR could theoretically intercept it, but no library has been tested or documented for this use case.
- No MCP stub server library or mock exists in the ecosystem as of Feb 2026 — for any transport.
What builders are doing instead: writing custom fake MCP servers that return scripted JSON-RPC responses (not shared, not versioned), testing at integration level with real MCP servers and real tool responses, or skipping unit-level MCP tool testing entirely.
Compounding problem: deepeval itself has confirmed correctness bugs. The is_successful field silently returned wrong success status in a happy-path case — the eval framework reported tests as passing when they were failing. This was patched reactively after community reports. A second problem: deepeval v3.7.7+ on import calls trace.set_tracer_provider(TracerProvider()), hijacking your global OTel provider and routing application spans to deepeval’s New Relic account, and initializes Sentry with 100% CPU profiling. (Issue #2497, no maintainer response as of March 2026.) The eval framework itself can have silent correctness failures — this is now a confirmed risk, not a theoretical one.
What this means for you
Your green CI run is not a correctness proof. It means no obvious regression was detected. The same code will fail on a different run with no change to the codebase. The probability of spurious failure depends on how close your agent’s outputs are to the judge’s scoring thresholds — and those thresholds shift with every judge LLM update.
Your VCR setup has a coverage gap for MCP tools. If your agent calls tools via stdio or SSE transport, those calls execute against a live MCP server in CI — they are not replayed from cassettes. Any test that passes because the live MCP server returned the right value is not a deterministic test. An infrastructure failure or API change during a CI run produces a test failure that looks like a regression.
If you upgraded deepeval without reading the changelog: Check whether the is_successful fix is in your version. More critically: if your application has an OTel TracerProvider initialized before deepeval imports, deepeval may be routing your application spans to its own New Relic account. Set DEEPEVAL_TELEMETRY_OPT_OUT=YES and run deepeval only in isolated environments.
If you are building multi-agent systems: Every major platform — Claude Code, Cursor, Devin, Grok Build, Windsurf, Codex — shipped multi-agent team features in February 2026. No mature multi-agent test harness exists. Hallucination propagation between agents, race conditions from shared state, and N×M test case explosion across agent/task combinations are structurally unaddressed by all current frameworks. Single-agent testing infrastructure does not transfer to multi-agent systems.
What to do
- Treat LLM eval CI gates as smoke tests, not correctness proofs. A green run means “no obvious regression.” Set thresholds conservatively and expect occasional false failures.
- Build deterministic assertion layers where possible. For structured outputs (JSON, tool calls with known schemas), assert on structure and field values directly — do not route these through an LLM judge. Reserve LLM-as-judge for free-text quality where no deterministic check exists.
- For MCP tool testing, build custom stubs now. Write a fake MCP server for your specific tools that returns scripted JSON-RPC responses. This is ad hoc but is the only option until a shared MCP stub library emerges. Three adjacent tools exist but do not solve VCR replay: FastMCP supports in-process testing (FastMCP servers only); thoughtspot/mcp-testing-kit (12 stars, TypeScript, unmaintained since May 2025) provides in-process invocation; mcpdrill (2 stars, Go) provides load testing with a built-in mock server but no recording/replay.
- Pin deepeval versions and audit the changelog. The
is_successfulbug establishes that silent correctness failures in the eval layer are a confirmed risk. SetDEEPEVAL_TELEMETRY_OPT_OUT=YES. Run deepeval in isolated environments where OTel hijacking is acceptable. - Use promptfoo for MCP security testing specifically. Its MCP red-team plugin tests for prompt injection and policy violations via MCP tool interactions — but this is security testing, not functional correctness testing.
- Watch laude-institute/harbor (1,537 stars as of Apr 2026, up from 704 in Feb). It unifies eval, RL environments, and prompt optimization under one trajectory format (ATIF). Claude Code integration is first-class. If eval and training share the same representation, failing CI trajectories can feed directly into fine-tuning without infrastructure changes.
Evidence
| Tool | Version | Result |
|---|---|---|
| confident-ai/deepeval | latest (Feb 2026) | independently-confirmed: G-Eval is_successful silent false-pass bug confirmed; OTel hijack + Sentry on import (Issue #2497) |
| promptfoo/promptfoo | latest (Feb 2026) | source-reviewed: MCP security red-team plugin confirmed; no functional MCP test support |
| laude-institute/harbor | Apr 2026 (1,537 stars) | docs-reviewed: ATIF trajectory format reviewed; eval+RL unified; Claude Code integration first-class |
| awslabs/agent-evaluation | latest (Feb 2026) | independently-confirmed: non-deterministic outcomes acknowledged in own docs |
| LangChain FakeChatModel | latest (Feb 2026) | source-reviewed: does not expose prompt inputs without subclassing |
| amosjyng/vcr-langchain | v0.1.x (stale since Jan 2024) | source-reviewed: HTTP-only recording; no MCP tool dispatch interception |
Confidence: source-reviewed + independently-confirmed — source code and documentation reviewed across 6 tools. Non-determinism confirmed by design analysis and third-party acknowledgment (awslabs self-documents it, deepeval bug confirmed by community). scope_matches=false: “every agent eval framework” was assessed by reviewing 4 frameworks, not an exhaustive survey.
Unlinked claims: (1) “No MCP stub server library exists” — searched GitHub, npm, PyPI for “mcp mock”, “mcp stub”, “mcp test server” in Feb 2026; no results with >10 stars or documented MCP transport interception. (2) “Seed fixing reduces but does not eliminate variation” — based on vendor documentation (OpenAI, Anthropic), not independent measurement.
What would disprove this: (1) An agent eval framework using LLM-as-judge that achieves identical pass/fail results across 100 consecutive runs on the same input, or (2) an MCP stub/mock library that intercepts stdio or SSE transport tool dispatch for deterministic replay in CI.
Open questions: Has anyone built a shared MCP stub/mock server library for any transport? Is there a deterministic grading approach for free-text agent outputs that doesn’t use LLM-as-judge? Has anyone measured the actual variance rate of LLM-as-judge CI gates across 100+ runs on identical inputs?
Seen different? Contribute your evidence — theory delta is what makes this knowledge base work.