DeepEval silently exfiltrated trace data on import — and Langfuse silently drops your orchestration spans

Published: 2026-02-27 Last verified: 2026-04-19 medium

Published Fact-checked 2026-04-19 · 0 corrections

DeepEval silently exfiltrated trace data on import — and Langfuse silently drops your orchestration spans

From Theory Delta | Published 2026-02-27 | Updated 2026-04-19

Update (2026-04-19): The TracerProvider hijack described in this finding was removed in DeepEval v3.9.x (commit 1f903f25, Dec 7 2025). This finding documents behavior present in v3.7.7-era releases and is historical for the DeepEval sections. The Langfuse non-generation span gap is current as of April 2026. Verify your installed versions before applying mitigations.

What you expect

You add DeepEval to your evaluation pipeline and Langfuse to your agent stack. DeepEval grades your LLM outputs. Langfuse traces everything — “instrument once, trace everything” is the tagline. Your telemetry backend receives your agent data. Your CI pipeline grades your outputs. Both tools do what they say.

What actually happens

DeepEval hijacked your OTel pipeline on import (v3.7.7-era, removed Dec 2025)

In versions prior to v3.9.x, importing deepeval registered an exporter that sent trace data to New Relic’s cloud endpoints (otlp.nr-data.net) — regardless of what OTel backend your application had configured. This happened at import time, before any evaluation code ran.

The attack surface was any environment that both imported DeepEval and contained production trace data: CI pipelines running against production databases, staging environments with real user queries, shared test environments with live secrets in trace metadata. The user saw normal evaluation results. Their trace data went somewhere else.

GitHub issue #2497 documents this with community reproduction. The behavior was removed in v3.9.x (commit 1f903f25, December 7, 2025). If your DeepEval version is v3.9.x or later, you are not affected by this specific behavior. If you are on an earlier version, upgrade before using DeepEval in any environment with production trace data.

Langfuse drops your orchestration span inputs and outputs (current, Apr 2026)

Langfuse’s “instrument once, trace everything” marketing holds for direct LLM API calls. It breaks at non-generation spans — the routing steps, tool decisions, and agent handoffs that connect LLM calls in orchestration frameworks.

In production LangGraph supervisor orchestration, non-generation spans show empty input and output values unless set_attribute() or update_current_observation() is called manually. The auto-instrumentation covers what enters and exits the LLM. It does not capture the decision logic between calls. Your traces show the model outputs. They do not show the routing decisions, state transitions, or handoff context that drove those outputs.

The Langfuse v3.167.1 release notes (April 2026) are maintenance-heavy — dependency updates, auth, UI fixes — and do not mention any fix for span field population. Treat this as still open.

The workaround requires explicit annotation for every non-generation span:

from langfuse.decorators import observe
from langfuse import get_client

@observe(name="routing_step")
def route_to_agent(state: dict) -> str:
    langfuse = get_client()
    langfuse.update_current_observation(
        input=state,
        output=next_node
    )
    return next_node

This is not documented as a requirement for LangGraph supervisor patterns. It surfaces as an operational gap after your first production trace review.

No observability platform prevents cost runaway mid-run

All reviewed platforms detect token and cost overruns post-hoc in dashboards. None implement real-time blocking at execution time. AgentBudget (v0.2.3) gets closest: it raises a BudgetExhausted exception at the call boundary — but a single over-budget LLM call completes before the exception fires. True mid-turn enforcement remains unsolved. If an agent enters an infinite retrieval loop or a subagent spawning cascade, the overage happens before any dashboard shows it.

Cross-process multi-agent tracing has no automatic solution

All platforms trace multi-agent workflows within one process. Agents running in separate containers, workers, or processes require manual trace ID propagation and injection into the child agent’s context. LangWatch is the exception: it enables HTTP-based trace propagation for cross-process spans — but requires both sides to use LangWatch instrumentation.

What this means for you

If you used DeepEval before v3.9.x in an environment with production trace data: your trace data — including any secrets or PII in trace metadata — was sent to New Relic’s cloud endpoints. Upgrade to v3.9.x+ and audit what was in scope.

If you are deploying Langfuse with LangGraph supervisor orchestration: your traces are silently incomplete. You are seeing the LLM calls but not the orchestration logic between them. Every routing decision, agent handoff, and tool selection that happens between LLM calls is invisible unless you manually annotate it. Your debugging surface is narrower than you think.

If you are relying on dashboard alerts for cost runaway prevention: you are relying on post-hoc detection. Dashboard alerts arrive after the cost has been incurred. The only reliable control is at the SDK level.

What to do

Upgrade DeepEval to v3.9.x+. The TracerProvider hijack is removed. If you cannot upgrade, run DeepEval in an isolated test environment with no production trace data and no production secrets in scope.
Add explicit update_current_observation() calls on every non-generation span. For LangGraph supervisor patterns, treat Langfuse auto-instrumentation as covering LLM API calls only. Budget manual instrumentation time for routing and handoff steps.
For MCP-native tracing: LangWatch is the only platform with explicit mcp_server and mcp_tool_name span fields. Agents self-report via tool calls without SDK wrapping. W&B Weave adds MCP trace logging with a single @weave.op decorator — cloud-only, not viable for air-gap.
For cross-process multi-agent tracing: Propagate a trace ID explicitly in the agent call payload and inject it into the child agent’s context at instantiation. LangWatch’s HTTP-based propagation is the closest to automated.
For cost runaway prevention: Wrap Anthropic API calls with a running token counter. Raise a budget exception before dispatching when the counter exceeds threshold. Do not rely on dashboard alerts.

Evidence

Claim	Source	Verified
DeepEval registered OTel exporter sending data to New Relic on import (v3.7.7-era)	Issue #2497	Yes — community reproduction
Behavior removed in v3.9.x, commit `1f903f25`, Dec 7 2025	DeepEval changelog / commit record	Yes — per update banner
Langfuse non-generation spans show empty input/output in LangGraph supervisor without `set_attribute()`	Multiple LangGraph supervisor user reports	Yes — multiple reproductions
Langfuse v3.167.1 (Apr 2026) release notes do not mention a fix for span field population	Langfuse changelog	Yes — release notes reviewed
LangWatch captures `mcp_server` / `mcp_tool_name` span fields natively	LangWatch documentation review	Yes — docs confirmed
No platform implements real-time mid-turn cost enforcement	AgentBudget v0.2.3 source review; platform comparison	Yes — call-boundary enforcement only
Cross-process multi-agent tracing requires manual trace ID propagation on all platforms except LangWatch	Platform documentation comparison	Yes — docs confirmed

Confidence: medium — DeepEval hijack confirmed via GitHub issue with community reproduction (historical, v3.7.7-era); Langfuse gap reported by multiple LangGraph supervisor users; removal in v3.9.x confirmed via commit record; Langfuse Apr 2026 release notes reviewed and no fix identified.

Falsification criterion: A Langfuse release that explicitly documents which span types require manual set_attribute() for input/output visibility and ships auto-instrumentation for LangGraph supervisor handoff spans would disprove the Langfuse claim; the DeepEval hijack claim is already partially falsified for v3.9.x+ (behavior removed).

Seen different? Contribute your evidence — share a repro or counter-example and we’ll review it against this finding. Reader evidence is what keeps these findings accurate.