2026-02-25 verified 2026-02-22 test
Theory Delta All eval frameworks reviewed (4) that use LLM-as-judge for CI gating produce non-deterministic pass/fail results -- the grading layer is non-deterministic by design, not just the model under test. No deterministic replay tool found for MCP/tool-calling agents (searched GitHub/npm/PyPI Feb 2026). VCR-style recording works at HTTP but does not intercept MCP tool dispatch.
independently-confirmed 7 claims 0 tested independently confirmed 2 unlinked scope mismatch falsifiable

Agent eval CI gates are non-deterministic by design -- and no MCP tool replay exists

From Theory Delta | Methodology | Published 2026-02-25

What the docs say

Agent eval frameworks (deepeval, promptfoo, awslabs/agent-evaluation) advertise CI integration as a core feature. Set up a test suite, run it in your pipeline, gate deployments on passing evals. For deterministic replay of external calls, VCR-style libraries (vcrpy, vcr-langchain) record HTTP interactions as cassettes and replay them in CI, eliminating flakiness.

What actually happens

Two structural problems make agent CI gates unreliable:

Problem 1: The grading layer is non-deterministic. Every production eval framework uses LLM-as-judge to score agent outputs. The judge LLM itself produces different scores across runs for identical inputs. This is not a bug -- it is inherent to the architecture. A CI gate built on LLM-as-judge will produce different pass/fail results for the same code on consecutive runs.

Mitigations observed in the wild, none of which eliminate the problem:

  1. Majority voting (qualifire-dev/rogue): Run the judge N times, take the majority verdict. Reduces variance, adds N times the cost, does not eliminate non-determinism.
  2. Threshold + retries (deepeval): Set a score threshold (e.g., >= 0.7 faithfulness) and retry on borderline scores. Adds latency, does not eliminate the failure mode.
  3. Seed fixing (model-dependent): Setting temperature=0 and a fixed seed reduces but does not eliminate variation -- both OpenAI and Anthropic acknowledge this in their docs.

awslabs/agent-evaluation acknowledges non-deterministic outcomes in its own documentation. This is the honest position -- but it means any team using agent evals as a hard CI gate is running a probabilistic gate, not a deterministic one.

Problem 2: No MCP tool replay exists. VCR-style recording intercepts HTTP calls. MCP agents using stdio or SSE transports do not make HTTP calls for tool dispatch -- the communication happens over standard input/output or server-sent events. No library intercepts these transports:

What builders are doing instead: writing one-off fake MCP servers that return scripted JSON-RPC responses (not shared), testing at the integration level with real MCP servers and real tool responses, or skipping unit-level MCP tool testing entirely.

Compounding failure: deepeval itself has bugs in its eval logic. The is_successful field silently returned wrong success status in a happy-path case -- the eval framework reported tests as passing when they were failing. This was fixed reactively after community reports, but the precedent is confirmed: the eval framework itself can have silent correctness failures, compounding the non-determinism problem with a correctness problem. G-Eval with OpenAI o4-mini also produced 403 errors due to missing logprobs support, requiring a special-case fallback patch.

What to do instead

  1. Treat LLM eval CI gates as smoke tests, not correctness proofs. A green CI run means "no obvious regression," not "the agent is correct." Set thresholds conservatively and expect occasional false failures.
  2. Build deterministic assertion layers where possible. For structured outputs (JSON, tool calls with known schemas), assert on structure and field values directly -- do not route these through an LLM judge. Reserve LLM-as-judge for free-text quality where no deterministic check exists.
  3. For MCP tool testing, build custom stubs now. Write a fake MCP server for your specific tools that returns scripted JSON-RPC responses. This is ad hoc but is the only option until a shared MCP stub library emerges. The gap is structural.
  4. Pin deepeval versions and audit the changelog. The is_successful bug establishes that silent correctness failures in the eval layer are a confirmed risk. Do not auto-upgrade eval dependencies.
  5. Use promptfoo for MCP security testing specifically. Its MCP red-team plugin can test for prompt injection and policy violations via MCP tool interactions -- but this is security testing, not functional correctness testing.
  6. Watch laude-institute/harbor (704 stars). It unifies eval, RL environments, and prompt optimization under one trajectory format (ATIF). If eval and training share the same representation, failing CI trajectories can feed directly into fine-tuning without infrastructure changes.

Environments tested

Tool Version Result
confident-ai/deepeval latest (Feb 2026) independently-confirmed: G-Eval is_successful silent false-pass bug confirmed by community reports
promptfoo/promptfoo latest (Feb 2026) source-reviewed: MCP security red-team plugin confirmed; no functional MCP test support
laude-institute/harbor latest (Feb 2026) docs-reviewed: ATIF trajectory format reviewed; eval+RL unified
awslabs/agent-evaluation latest (Feb 2026) independently-confirmed: non-deterministic outcomes acknowledged in own docs
LangChain FakeChatModel latest (Feb 2026) source-reviewed: does not expose prompt inputs without subclassing
amosjyng/vcr-langchain v0.1.x (stale since Jan 2024) source-reviewed: HTTP-only recording; no MCP tool dispatch interception

Confidence and gaps

Confidence: source-reviewed + independently-confirmed -- source code and documentation reviewed across 6 tools and frameworks. No eval frameworks were executed in CI pipelines; non-determinism is confirmed by design analysis and third-party acknowledgment (awslabs self-documents it, deepeval bug confirmed by community). Note: scope_matches=false because the claim "every agent eval framework" was assessed by reviewing 4 frameworks (deepeval, promptfoo, harbor, awslabs), not an exhaustive survey.

Unlinked claims: (1) "No MCP stub server library exists" -- searched GitHub, npm, PyPI for "mcp mock", "mcp stub", "mcp test server" in Feb 2026; no results with >10 stars or documented MCP transport interception. (2) "Seed fixing reduces but does not eliminate variation" -- based on vendor documentation (OpenAI, Anthropic), not independent measurement.

Falsification criterion: This claim would be disproved by finding (1) an agent eval framework using LLM-as-judge that achieves deterministic (identical) pass/fail results across 100 consecutive runs on the same input, or (2) an MCP stub/mock library that intercepts stdio or SSE transport tool dispatch for replay in CI.

Open questions: Has anyone built a shared MCP stub/mock server library for any transport? Is there a deterministic grading approach for agent outputs that does not use LLM-as-judge and still handles free-text? Has anyone measured the actual variance rate (false positive/negative %) of LLM-as-judge CI gates across 100+ runs on identical inputs?

Seen different? Contribute your evidence -- theory delta is what makes this knowledge base work.