LangGraph checkpoints silently corrupt non-primitive types — your resume will not restore what you saved

Published: 2026-03-29 Last verified: 2026-03-01 empirical
6 claims 0 tested finding

LangGraph checkpoints silently corrupt non-primitive types — your resume will not restore what you saved

From Theory Delta | Methodology | Published 2026-03-29

What you expect

LangGraph checkpointing is the foundation of human-in-the-loop and stateful RAG workflows. Save graph state — including Pydantic models, Enums, or custom classes — to a checkpoint, then resume from it. The resume should give you back what you saved.

What actually happens

LangGraph checkpoint round-trips are lossy for non-primitive types. Four distinct silent failure modes have been confirmed since January 2026 in open bugs, all affecting LangGraph v1.0.10:

1. JsonPlusSerializer null-on-failure (bug #6970, open as of 2026-02-28): When deserialization fails, JsonPlusSerializer replaces the failed value with None instead of raising an exception. The graph continues with a corrupted state object. No warning, no log entry, no exception. This affects any complex type stored in checkpoint state.

2. StrEnum coerced to plain str (bug #6598, January 2026): StrEnum values silently become plain str after a checkpoint round-trip. Type information is lost. Code checking isinstance(value, MyStrEnum) will fail silently after a resume. Any state machine logic that routes on enum type (rather than enum value) breaks without error.

3. Nested Enum fields become None (bug #6718, February 2026): Nested Enum fields in checkpoint state deserialize as None rather than raising. Like bug #6970, this is silent replacement — the state object looks valid but contains corrupted values.

4. BinaryOperatorAggregate wrapper leak (bug #6909, 2026-02-27): When a channel starts MISSING, BinaryOperatorAggregate with Overwrite returns the wrapper object rather than the unwrapped payload. Downstream code receives a BinaryOperatorAggregate instance where it expects the actual state value.

These are not edge cases in obscure usage paths. They affect any LangGraph pipeline storing Pydantic models, Enums, or custom classes through checkpointing — which is most production agentic RAG pipelines.

Human-in-the-loop workflows with chained interrupts are also broken. Bug #6956 (open, 2026-02-27): get_state().next returns an empty tuple () after resuming from the first of two interrupt() calls in the same node. The graph is still paused — but the snapshot reports it as complete. Any code checking state.next to determine whether a graph is still running will silently misread a paused graph as finished. An agent waiting for human approval may receive a “complete” signal and proceed without it.

Conditional edge routing has a separate footgun. Inline docstrings inside Python dict literals used as conditional edge mappings silently corrupt the routing key — the docstring becomes part of the dictionary key, producing a KeyError at runtime. A newer variant (bug #6770): KeyError('__end__') when a conditional router returns '__end__' but path_map does not explicitly include an __end__/END key.

What this means for you

Your LangGraph stateful pipeline will appear to work in development — and will silently corrupt in production under specific type patterns you may already be using.

The failure path: you store a Pydantic model or Enum in graph state, checkpoint it (either for human review or fault recovery), resume — and the resumed state has None where a value should be. Your downstream logic receives None, makes a bad decision or throws an unhelpful error, and the root cause traces back to a checkpoint round-trip that never raised an exception.

The interrupt snapshot bug has a more direct impact on human-in-the-loop flows: your approval workflow will see “complete” and proceed. The human never approved. The workflow has no record that approval was skipped.

If you are on LangGraph v1.0.10 and any of these types appear in your checkpoint state — Pydantic models, StrEnum, nested Enum, BinaryOperatorAggregate channels — treat your checkpoint round-trips as unreliable until bugs #6970, #6598, #6718, and #6909 are resolved.

What to do

For LangGraph stateful pipelines: Treat checkpoint round-trips as lossy for non-primitive types until bugs #6970, #6598, #6718, and #6909 are closed. Add explicit checkpoint validation after every resume call:

# After resuming a LangGraph graph
state = graph.get_state(config)
# Validate critical fields are not None and have expected types
assert state.values.get("my_enum") is not None, "checkpoint deserialization failure"
assert isinstance(state.values["my_enum"], MyExpectedType), f"type corrupted: {type(state.values['my_enum'])}"

For state that must survive checkpoint round-trips, prefer primitive types (str, int, dict with primitive values) over Pydantic models and Enums where possible. If Enums are required, serialize them to their .value before storing in graph state and reconstruct on read.

For human-in-the-loop workflows with chained interrupts: Do not rely solely on state.next to determine if a graph is paused. Track interrupt state explicitly in your application layer until bug #6956 is closed.

For conditional edge routing: Do not use inline docstrings inside Python dict literals in edge mappings. Always include an explicit "__end__": "__end__" entry in path_map for any conditional router that may return __end__.

This finding would be disproved by: LangGraph v1.0.10+ passing a round-trip checkpoint test where Pydantic models, StrEnum, nested Enum, and BinaryOperatorAggregate values are preserved with type fidelity after a checkpoint cycle.

Evidence

ToolVersionResult
LangGraphv1.0.10source-reviewed: JsonPlusSerializer replaces deserialization failures with None (#6970, open)
LangGraphv1.0.10source-reviewed: StrEnum coerced to str after checkpoint round-trip (#6598)
LangGraphv1.0.10source-reviewed: nested Enum fields become None after resume (#6718)
LangGraphv1.0.10source-reviewed: BinaryOperatorAggregate returns wrapper instead of payload (#6909)
LangGraphv1.0.10source-reviewed: get_state().next empty after first of two interrupt() calls (#6956, open)
Microsoft GraphRAGv3.0.5source-reviewed: v3 pipeline extremely slow vs v2 after NetworkX removal (#2250, open)

Confidence: empirical — all four LangGraph serialization bugs are confirmed in open GitHub issues by third-party reporters (not Theory Delta). Not tested by execution in Theory Delta’s environment — these are source-reviewed from the respective GitHub issue trackers. The bug status (open vs closed) reflects the state as of 2026-03-01; some may have been addressed in subsequent LangGraph releases.

Strongest case against: These bugs may already be fixed in LangGraph versions later than v1.0.10. Open issues do not guarantee unfixed behavior — LangGraph releases frequently. The serialization failures affect specific type patterns; pipelines using only primitive types in checkpoint state are unaffected.

Open questions: Which LangGraph version (if any) closes all four serialization bugs? Is there a LangGraph release where checkpoint round-trips can be considered reliable for Pydantic models?

Seen different? Contribute your evidence — theory delta is what makes this knowledge base work.