Structured generation guarantees break silently above undocumented complexity thresholds — no provider publishes where the line is

Published: 2026-04-21 Last verified: 2026-04-21 secondary-research
8 claims 0 tested finding
Staleness risk: high — provider APIs in this area change frequently. Test specific limits and failure modes in your environment before acting.

Structured generation guarantees break silently above undocumented complexity thresholds — no provider publishes where the line is

Quick reference

ProviderFailure modeObservable signalRecovery
AnthropicSchema exceeds hard limit"Schema is too complex for compilation" — no diagnostic on which limitSplit tools; reduce union types; flatten optional params
AnthropicContext limit mid-generationTruncated output that violates schema — no API errorCheck stop_reason; reduce payload
OpenAIJSON mode — wrong types / missing keysSilent: valid JSON, schema not enforcedMigrate to strict:true Structured Outputs
OpenAItemperature + logitBias in same requestSilent JSON truncation — shortened output, no errorRemove logitBias when using structured output
GeminiPrompt property order ≠ responseSchema orderValid JSON but wrong field values or ordering — deterministic, not randomAlign property order in prompt to match responseSchema
GeminiPython SDK rejects additionalPropertiesSDK validation error at client layerCall API directly until issue #1815 resolves
vLLM (Outlines)Synchronous FSM compilationEntire batch stalls under concurrent loadUse XGrammar (default in current vLLM); pre-compile schemas at startup

What the docs say

Anthropic, OpenAI, and Gemini each advertise schema-adherent structured generation. Anthropic calls it “a mathematical guarantee.” OpenAI’s Structured Outputs reference states it “guarantees the model will always generate responses that adhere to your supplied JSON Schema.” Gemini’s controlled generation documentation describes responseSchema as enforcing response structure. For open-source inference, Outlines and XGrammar provide grammar-constrained decoding with similar guarantees at the token level.

What actually happens

Every provider implementation has an inflection point where schema complexity produces unpredictable behavior. Below the threshold, the guarantee holds. Above it, failures are either silent (wrong output, no error) or crash-level (cryptic 500 / compilation failure). The thresholds are not published in main API documentation. No provider publishes production failure rates.

Anthropic: hard limits with no useful diagnostic

Anthropic’s constrained decoding applies a mathematical guarantee — conditional on stop_reason being neither "refusal" nor "max_tokens". When a request hits the context limit mid-generation, the result is a truncated output that violates the schema, not a validly-conformed partial.

Four hard limits trigger "Schema is too complex for compilation" when exceeded (Anthropic Structured Outputs docs):

  • Compilation timeout: 180 seconds
  • Maximum strict tools: 20
  • Maximum optional parameters per tool: 24
  • Maximum union types in schema: 16

When any limit is exceeded, the request fails entirely — not gracefully with a degraded subset. The error message does not indicate which limit was hit or by how much. Recovery requires schema redesign: flatten nested structures, split required and optional fields into separate tools, reduce union breadth.

Additionally, these schema features are not supported. Each produces either silent wrong output or a compilation failure — no validation error in either case:

Unsupported featureResult
Recursive schemasIncorrect output or compilation failure
Numerical constraints (minimum, maximum)Incorrect output or compilation failure
String length constraints (minLength, maxLength)Incorrect output or compilation failure
Complex regex patternsIncorrect output or compilation failure
additionalPropertiesIncorrect output or compilation failure

Grammar cache artifact

Anthropic grammar compilation caches for 24 hours. The first compilation of a complex schema may take seconds. Subsequent calls within the cache window return immediately. Benchmarks that evaluate the same schema on back-to-back calls will report fast results and miss the real first-call cost. Any benchmark that does not account for this cache window is measuring warm-cache latency, not compilation latency.

OpenAI: JSON mode never enforced schemas — use strict:true

Watch out: combining high temperature with logitBias in the same request may trigger silent JSON truncation. The model stops generating before the schema is complete — no API error, just shortened output. Remove logitBias when using structured output. (This failure mode is community-observed; not currently documented in OpenAI’s official reference.)

OpenAI JSON mode enforces syntactically valid JSON. It does not enforce schema adherence. Wrong types, missing required keys, and invented keys all pass without error. This was not a temporary gap — JSON mode has always been a syntax-only guarantee. OpenAI describes Structured Outputs as “the evolution of JSON mode” and recommends it over JSON mode when possible; the production path is Structured Outputs with strict:true, which applies token-level masking against the schema (OpenAI Structured Outputs reference).

Builders migrating from JSON mode to strict:true may discover that schemas which were silently accepted by JSON mode now trigger parameter-conflict failures or hit complexity limits they were unaware of.

Gemini: property ordering sensitivity and SDK/API divergence

Gemini’s responseSchema compliance is model-capability-dependent — required fields with insufficient context produce hallucinated values rather than null or an error. The documented mitigation is marking fields nullable: true when the model may not have context to fill them (Google Developer Blog — Mastering Controlled Generation with Gemini 1.5).

The undocumented failure: if the order of properties in the prompt differs from the order of properties in responseSchema, Gemini produces output where properties appear in the wrong order, required values are missing or hallucinated, or field names don’t match the schema. The output is valid JSON — it parses without error — but does not conform to responseSchema. The failure is ordering-dependent and deterministic: same prompt structure, same violation, every run.

SDK/API divergence: the additionalProperties field has been accepted by the Gemini API since November 2025. The official Python SDK (python-genai) still rejects it at the client layer — googleapis/python-genai issue #1815 was closed as “not planned,” meaning Google closed it without fixing the SDK. The workaround is to use response_json_schema instead of response_schema, which bypasses SDK validation but loses Pydantic integration ergonomics.

For Gemini 2.0+, property ordering is no longer an implicit sensitivity but an explicit requirement: Gemini 2.0 requires a propertyOrdering list in the JSON input to define preferred structure. Omitting it produces unordered output, not a documented error.

Outlines / vLLM: FSM compilation is synchronous and error-prone

A January 2026 benchmark (arXiv 2501.10868) documented 42 compilation errors across tested grammar categories for Outlines. The vLLM engineering blog characterizes the FSM engine as one that “occasionally crashes the engine” under complex grammars; complex Pydantic models produce “poorly constructed regex” that is often unviable in practice (vLLM blog — Structured Decoding Introduction).

The specific production failure for batch workloads: Outlines’ FSM compilation is synchronous. A single complex schema blocks the entire batch. Under concurrent load, one slow schema stalls all parallel requests.

XGrammar, using pushdown automata, replaced Outlines as vLLM’s default structured decoding backend. Outlines is now the fallback. XGrammar enables batch compilation and supports recursive grammars. The fact that Outlines remains as a fallback confirms that XGrammar does not have complete coverage — some schemas still require the FSM path.

The trade-off is not zero-sum: the January 2026 benchmark shows XGrammar had only 3 compilation errors vs. Outlines’ 42, but XGrammar produced 38 under-constrained outputs — cases where the grammar was accepted but the output was not fully constrained by the schema. Outlines had only 8. More compilation success with XGrammar does not mean tighter schema enforcement across the board.

What to do instead

For Anthropic: stay under the documented hard limits with margin — treat 20 strict tools as 15 in practice. If you hit "Schema is too complex for compilation", split tools by extracting optional params into a separate tool call. Benchmark schema compilation with cache-busting (unique schema variant per run, or wait >24 hours) to measure real first-call cost before production deployment.

For OpenAI: use strict:true Structured Outputs, not JSON mode. Before migrating existing pipelines, test each schema against the Structured Outputs endpoint explicitly — JSON mode silently absorbed failures that strict:true will surface as errors. Do not combine high temperature with logitBias when using structured output.

For Gemini: align prompt property order to responseSchema property order; on Gemini 2.0+, set propertyOrdering explicitly. Mark fields nullable: true for any field the model may lack context to fill. If using additionalProperties, use response_json_schema instead of response_schema to bypass SDK validation — issue #1815 was closed as “not planned.”

For vLLM / open-source: use XGrammar (default in current vLLM). If falling back to Outlines for schema coverage reasons, test compilation at startup rather than at request time — identify blocking schemas before they reach production. Do not use Outlines for batch workloads without an explicit compilation queue.

Universal: test schemas at complexity boundaries — nested objects, large union counts, optional field density — under the same token budget constraints as production requests. Retry logic does not fix compilation failures or ordering sensitivity. Schema redesign is the only recovery path.

Evidence

ToolVersionMethodWhat was observed
Anthropic Claude APIreviewed 2026-04-21docs-reviewedHard limits enforced: 180s compilation timeout, max 20 strict tools, max 24 optional params, max 16 union types. Exceeding any returns "Schema is too complex for compilation" with no indication of which limit was hit.
OpenAI APIreviewed 2026-04-21docs-reviewedJSON mode enforces JSON syntax only — wrong types, missing keys, and invented keys pass silently. strict:true is the production path; OpenAI describes Structured Outputs as “the evolution of JSON mode.”
Gemini API / python-genaipython-genai (2025-11-onwards)source-reviewedPython SDK client rejects additionalProperties at the validation layer; Gemini API has accepted the field since November 2025. Issue #1815 closed as “not planned” — workaround: use response_json_schema.
Gemini APIGemini 1.5 (2024)docs-reviewedProperty ordering sensitivity documented in Google engineering blog: output does not conform to responseSchema when prompt property order diverges from schema property order.
Outlines (dottxt-ai)January 2026 benchmarkindependently-confirmed42 compilation errors documented across tested grammar categories in arXiv 2501.10868. Synchronous FSM compilation confirmed in vLLM engineering blog.
vLLM2026independently-confirmedvLLM engineering blog: FSM engine “occasionally crashes the engine” under complex grammars; XGrammar adopted as default backend, Outlines retained as fallback.

Confidence and gaps

Falsification criterion: This finding would be disproved by any provider publishing documented complexity thresholds and failure rates that match the behaviour described above, OR by demonstrating that the Anthropic compilation limits, Gemini ordering sensitivity, or Outlines FSM errors described here are version-specific and resolved in current releases.

Confidence: secondary-research — no claims were reproduced by execution in Theory Delta’s environment. Evidence sources are: Anthropic documentation (docs-reviewed), OpenAI documentation (docs-reviewed), Google engineering blog (docs-reviewed), a third-party GitHub issue (source-reviewed, now closed as “not planned”), a peer-reviewed benchmark paper (independently-confirmed), and a vLLM engineering blog post (independently-confirmed). The block carries staleness_risk: high — provider limits and defaults shift frequently. Anthropic, OpenAI, and Gemini have each changed structured generation behavior within the 6 months prior to publication.

Strongest case against: Several claims may already be obsolete. Anthropic, OpenAI, and Gemini release frequently and do not version their structured generation implementations in a way that makes “which version has this bug” easy to determine. The Gemini property ordering claim is sourced from a 2024 blog post against Gemini 1.5; current Gemini 2.x behavior may differ. The Outlines compilation error count is from a January 2026 benchmark snapshot.

Open questions: Do the Anthropic hard limits apply identically to all Claude model versions, or are they implementation-dependent? Does the Gemini property ordering sensitivity persist in Gemini 2.5+ or is it addressed by the explicit propertyOrdering requirement introduced in 2.0? Which grammar patterns still fall back to Outlines in XGrammar’s current coverage, and does the under-constrained output rate differ across schema types? Has any provider added failure rate telemetry to their structured output APIs?

Seen different? Contribute your evidence — theory delta is what makes this knowledge base work.