Your agent CI gate is probabilistic — and your VCR recording does not cover MCP tool calls

Published: 2026-02-25 Last verified: 2026-04-19 independently-confirmed

Published Fact-checked 2026-04-19 · 0 corrections

Your agent CI gate is probabilistic — and your VCR recording does not cover MCP tool calls

From Theory Delta | Methodology | Published 2026-02-25

You set up a CI pipeline for your agent. deepeval runs on every PR, LLM-as-judge scores the outputs, and deployment gates on a passing eval. You also added VCR-style recording to replay API calls deterministically. Your pipeline looks complete.

Two structural problems make it unreliable by design.

What you expect

A CI gate that produces a deterministic pass/fail verdict on agent behavior. Eval frameworks that behave as correctness oracles. VCR recording that replays all external calls, including MCP tool dispatch. A green CI run as a proof of correctness.

What actually happens

Problem 1: The grading layer is non-deterministic by design. Every production eval framework — deepeval, promptfoo, awslabs/agent-evaluation, rogue — uses LLM-as-judge to score agent outputs. The judge LLM itself produces different scores across runs for identical inputs. This is not a bug in these frameworks. It is inherent to the architecture. A CI gate built on LLM-as-judge produces different pass/fail results for the same code on consecutive runs.

Mitigations exist. None eliminate the problem:

Majority voting (qualifire-dev/rogue): Run the judge N times, take the majority verdict. Reduces variance. Multiplies cost. Does not eliminate non-determinism.
Threshold + retries (deepeval): Retry on borderline scores. Adds latency. Does not eliminate the failure mode.
Seed fixing (model-dependent): temperature=0 plus a fixed seed reduces but does not eliminate variation — OpenAI and Anthropic both acknowledge this in their docs.

awslabs/agent-evaluation acknowledges non-deterministic outcomes in its own documentation. This is the accurate position — any team treating LLM eval CI gates as hard correctness gates is running a probabilistic gate instead.

Problem 2: No MCP tool replay exists. VCR-style recording (vcrpy, pytest-recording, responses) intercepts HTTP calls. MCP agents on stdio or SSE transports do not make HTTP calls for tool dispatch — communication happens over standard input/output or server-sent events. No library intercepts these transports:

vcr-langchain (81 stars, stale since Jan 2024) records LangChain HTTP calls to OpenAI/Anthropic APIs. Does not capture or replay MCP tool dispatch on any transport.
Streamable HTTP transport is HTTP, so VCR could theoretically intercept it, but no library has been tested or documented for this use case.
No MCP stub server library or mock exists in the ecosystem as of Feb 2026 — for any transport.

What builders are doing instead: writing custom fake MCP servers that return scripted JSON-RPC responses (not shared, not versioned), testing at integration level with real MCP servers and real tool responses, or skipping unit-level MCP tool testing entirely.

Compounding problem: deepeval itself has confirmed correctness bugs. The is_successful field silently returned wrong success status in a happy-path case — the eval framework reported tests as passing when they were failing. This was patched reactively after community reports. A second problem: deepeval v3.7.7+ on import calls trace.set_tracer_provider(TracerProvider()), hijacking your global OTel provider and routing application spans to deepeval’s New Relic account, and initializes Sentry with 100% CPU profiling. (Issue #2497, no maintainer response as of March 2026.) The eval framework itself can have silent correctness failures — this is now a confirmed risk, not a theoretical one.

What this means for you

Your green CI run is not a correctness proof. It means no obvious regression was detected. The same code will fail on a different run with no change to the codebase. The probability of spurious failure depends on how close your agent’s outputs are to the judge’s scoring thresholds — and those thresholds shift with every judge LLM update.

Your VCR setup has a coverage gap for MCP tools. If your agent calls tools via stdio or SSE transport, those calls execute against a live MCP server in CI — they are not replayed from cassettes. Any test that passes because the live MCP server returned the right value is not a deterministic test. An infrastructure failure or API change during a CI run produces a test failure that looks like a regression.

If you upgraded deepeval without reading the changelog: Check whether the is_successful fix is in your version. More critically: if your application has an OTel TracerProvider initialized before deepeval imports, deepeval may be routing your application spans to its own New Relic account. Set DEEPEVAL_TELEMETRY_OPT_OUT=YES and run deepeval only in isolated environments.

If you are building multi-agent systems: Every major platform — Claude Code, Cursor, Devin, Grok Build, Windsurf, Codex — shipped multi-agent team features in February 2026. No mature multi-agent test harness exists. Hallucination propagation between agents, race conditions from shared state, and N×M test case explosion across agent/task combinations are structurally unaddressed by all current frameworks. Single-agent testing infrastructure does not transfer to multi-agent systems.

What to do

Treat LLM eval CI gates as smoke tests, not correctness proofs. A green run means “no obvious regression.” Set thresholds conservatively and expect occasional false failures.
Build deterministic assertion layers where possible. For structured outputs (JSON, tool calls with known schemas), assert on structure and field values directly — do not route these through an LLM judge. Reserve LLM-as-judge for free-text quality where no deterministic check exists.
For MCP tool testing, build custom stubs now. Write a fake MCP server for your specific tools that returns scripted JSON-RPC responses. This is ad hoc but is the only option until a shared MCP stub library emerges. Three adjacent tools exist but do not solve VCR replay: FastMCP supports in-process testing (FastMCP servers only); thoughtspot/mcp-testing-kit (12 stars, TypeScript, unmaintained since May 2025) provides in-process invocation; mcpdrill (2 stars, Go) provides load testing with a built-in mock server but no recording/replay.
Pin deepeval versions and audit the changelog. The is_successful bug establishes that silent correctness failures in the eval layer are a confirmed risk. Set DEEPEVAL_TELEMETRY_OPT_OUT=YES. Run deepeval in isolated environments where OTel hijacking is acceptable.
Use promptfoo for MCP security testing specifically. Its MCP red-team plugin tests for prompt injection and policy violations via MCP tool interactions — but this is security testing, not functional correctness testing.
Watch laude-institute/harbor (1,537 stars as of Apr 2026, up from 704 in Feb). It unifies eval, RL environments, and prompt optimization under one trajectory format (ATIF). Claude Code integration is first-class. If eval and training share the same representation, failing CI trajectories can feed directly into fine-tuning without infrastructure changes.

Evidence

Tool	Version	Result
confident-ai/deepeval	latest (Feb 2026)	independently-confirmed: G-Eval `is_successful` silent false-pass bug confirmed; OTel hijack + Sentry on import (Issue #2497)
promptfoo/promptfoo	latest (Feb 2026)	source-reviewed: MCP security red-team plugin confirmed; no functional MCP test support
laude-institute/harbor	Apr 2026 (1,537 stars)	docs-reviewed: ATIF trajectory format reviewed; eval+RL unified; Claude Code integration first-class
awslabs/agent-evaluation	latest (Feb 2026)	independently-confirmed: non-deterministic outcomes acknowledged in own docs
LangChain FakeChatModel	latest (Feb 2026)	source-reviewed: does not expose prompt inputs without subclassing
amosjyng/vcr-langchain	v0.1.x (stale since Jan 2024)	source-reviewed: HTTP-only recording; no MCP tool dispatch interception

Confidence: source-reviewed + independently-confirmed — source code and documentation reviewed across 6 tools. Non-determinism confirmed by design analysis and third-party acknowledgment (awslabs self-documents it, deepeval bug confirmed by community). scope_matches=false: “every agent eval framework” was assessed by reviewing 4 frameworks, not an exhaustive survey.

Unlinked claims: (1) “No MCP stub server library exists” — searched GitHub, npm, PyPI for “mcp mock”, “mcp stub”, “mcp test server” in Feb 2026; no results with >10 stars or documented MCP transport interception. (2) “Seed fixing reduces but does not eliminate variation” — based on vendor documentation (OpenAI, Anthropic), not independent measurement.

Falsification criterion: An agent eval framework using LLM-as-judge that achieves identical pass/fail results across 100 consecutive runs on the same input, or an MCP stub/mock library that intercepts stdio or SSE transport tool dispatch for deterministic replay in CI, would disprove the core claims.

Open questions: Has anyone built a shared MCP stub/mock server library for any transport? Is there a deterministic grading approach for free-text agent outputs that doesn’t use LLM-as-judge? Has anyone measured the actual variance rate of LLM-as-judge CI gates across 100+ runs on identical inputs?

Seen different? Contribute your evidence — share a repro or counter-example and we’ll review it against this finding. Reader evidence is what keeps these findings accurate.