Structured generation guarantees break silently above undocumented complexity thresholds — no provider publishes where the line is

Published: 2026-04-21 Last verified: 2026-04-21 secondary-research

Published Fact-checked 2026-04-21 · 0 corrections

⚠ Staleness risk: high — facts in this subject area change quickly between releases. Re-check the specific claims against your own environment before acting. (This rates the topic, not whether this page is out of date.)

Structured generation guarantees break silently above undocumented complexity thresholds — no provider publishes where the line is

Quick reference

Provider	Failure mode	Observable signal	Recovery
Anthropic	Schema exceeds hard limit	`"Schema is too complex for compilation"` — no diagnostic on which limit	Split tools; reduce union types; flatten optional params
Anthropic	Context limit mid-generation	Truncated output that violates schema — no API error	Check `stop_reason`; reduce payload
OpenAI	JSON mode — wrong types / missing keys	Silent: valid JSON, schema not enforced	Migrate to `strict:true` Structured Outputs
OpenAI	`temperature` + `logitBias` in same request	Silent JSON truncation — shortened output, no error	Remove `logitBias` when using structured output
Gemini	Prompt property order ≠ `responseSchema` order	Valid JSON but wrong field values or ordering — deterministic, not random	Align property order in prompt to match `responseSchema`
Gemini	Python SDK rejects `additionalProperties`	SDK validation error at client layer	Call API directly until issue #1815 resolves
vLLM (Outlines)	Synchronous FSM compilation	Entire batch stalls under concurrent load	Use XGrammar (default in current vLLM); pre-compile schemas at startup

What you expect

Anthropic, OpenAI, and Gemini each advertise schema-adherent structured generation with mathematical or near-guaranteed adherence — Anthropic calls it a “mathematical guarantee,” OpenAI’s reference states it “guarantees the model will always generate responses that adhere to your supplied JSON Schema.” Providers are expected to document their limits.

What the docs say

Anthropic, OpenAI, and Gemini each advertise schema-adherent structured generation. Anthropic calls it “a mathematical guarantee.” OpenAI’s Structured Outputs reference states it “guarantees the model will always generate responses that adhere to your supplied JSON Schema.” Gemini’s controlled generation documentation describes responseSchema as enforcing response structure. For open-source inference, Outlines and XGrammar provide grammar-constrained decoding with similar guarantees at the token level.

What actually happens

Every provider implementation has an inflection point where schema complexity produces unpredictable behavior. Below the threshold, the guarantee holds. Above it, failures are either silent (wrong output, no error) or crash-level (cryptic 500 / compilation failure). The thresholds are not published in main API documentation. No provider publishes production failure rates.

Anthropic: hard limits with no useful diagnostic

Anthropic’s constrained decoding applies a mathematical guarantee — conditional on stop_reason being neither "refusal" nor "max_tokens". When a request hits the context limit mid-generation, the result is a truncated output that violates the schema, not a validly-conformed partial.

Four hard limits trigger "Schema is too complex for compilation" when exceeded (Anthropic Structured Outputs docs):

Compilation timeout: 180 seconds
Maximum strict tools: 20
Maximum optional parameters per tool: 24
Maximum union types in schema: 16

When any limit is exceeded, the request fails entirely — not gracefully with a degraded subset. The error message does not indicate which limit was hit or by how much. Recovery requires schema redesign: flatten nested structures, split required and optional fields into separate tools, reduce union breadth.

Additionally, these schema features are not supported. Each produces either silent wrong output or a compilation failure — no validation error in either case:

Unsupported feature	Result
Recursive schemas	Incorrect output or compilation failure
Numerical constraints (`minimum`, `maximum`)	Incorrect output or compilation failure
String length constraints (`minLength`, `maxLength`)	Incorrect output or compilation failure
Complex regex patterns	Incorrect output or compilation failure
`additionalProperties`	Incorrect output or compilation failure

Grammar cache artifact

Anthropic grammar compilation caches for 24 hours. The first compilation of a complex schema may take seconds. Subsequent calls within the cache window return immediately. Benchmarks that evaluate the same schema on back-to-back calls will report fast results and miss the real first-call cost. Any benchmark that does not account for this cache window is measuring warm-cache latency, not compilation latency.

OpenAI: JSON mode never enforced schemas — use strict:true

Watch out: combining high temperature with logitBias in the same request may trigger silent JSON truncation. The model stops generating before the schema is complete — no API error, just shortened output. Remove logitBias when using structured output. (This failure mode is community-observed; not currently documented in OpenAI’s official reference.)

OpenAI JSON mode enforces syntactically valid JSON. It does not enforce schema adherence. Wrong types, missing required keys, and invented keys all pass without error. This was not a temporary gap — JSON mode has always been a syntax-only guarantee. OpenAI describes Structured Outputs as “the evolution of JSON mode” and recommends it over JSON mode when possible; the production path is Structured Outputs with strict:true, which applies token-level masking against the schema (OpenAI Structured Outputs reference).

Builders migrating from JSON mode to strict:true may discover that schemas which were silently accepted by JSON mode now trigger parameter-conflict failures or hit complexity limits they were unaware of.

Gemini: property ordering sensitivity and SDK/API divergence

Gemini’s responseSchema compliance is model-capability-dependent — required fields with insufficient context produce hallucinated values rather than null or an error. The documented mitigation is marking fields nullable: true when the model may not have context to fill them (Google Developer Blog — Mastering Controlled Generation with Gemini 1.5).

The undocumented failure: if the order of properties in the prompt differs from the order of properties in responseSchema, Gemini produces output where properties appear in the wrong order, required values are missing or hallucinated, or field names don’t match the schema. The output is valid JSON — it parses without error — but does not conform to responseSchema. The failure is ordering-dependent and deterministic: same prompt structure, same violation, every run.

SDK/API divergence: the additionalProperties field has been accepted by the Gemini API since November 2025. The official Python SDK (python-genai) still rejects it at the client layer — googleapis/python-genai issue #1815 was closed as “not planned,” meaning Google closed it without fixing the SDK. The workaround is to use response_json_schema instead of response_schema, which bypasses SDK validation but loses Pydantic integration ergonomics.

For Gemini 2.0+, property ordering is no longer an implicit sensitivity but an explicit requirement: Gemini 2.0 requires a propertyOrdering list in the JSON input to define preferred structure. Omitting it produces unordered output, not a documented error.

Outlines / vLLM: FSM compilation is synchronous and error-prone

A January 2026 benchmark (arXiv 2501.10868) documented 42 compilation errors across tested grammar categories for Outlines. The vLLM engineering blog characterizes the FSM engine as one that “occasionally crashes the engine” under complex grammars; complex Pydantic models produce “poorly constructed regex” that is often unviable in practice (vLLM blog — Structured Decoding Introduction).

The specific production failure for batch workloads: Outlines’ FSM compilation is synchronous. A single complex schema blocks the entire batch. Under concurrent load, one slow schema stalls all parallel requests.

XGrammar, using pushdown automata, replaced Outlines as vLLM’s default structured decoding backend. Outlines is now the fallback. XGrammar enables batch compilation and supports recursive grammars. The fact that Outlines remains as a fallback confirms that XGrammar does not have complete coverage — some schemas still require the FSM path.

The trade-off is not zero-sum: the January 2026 benchmark shows XGrammar had only 3 compilation errors vs. Outlines’ 42, but XGrammar produced 38 under-constrained outputs — cases where the grammar was accepted but the output was not fully constrained by the schema. Outlines had only 8. More compilation success with XGrammar does not mean tighter schema enforcement across the board.

What this means for you

Schema complexity limits are not published in main API documentation by any provider. Builders who hit "Schema is too complex for compilation" (Anthropic), silent wrong output (Gemini property ordering), or FSM compilation errors (Outlines) discover the limits empirically in production, not from docs. Recovery in all cases requires schema redesign — retry logic does not fix compilation failures or ordering sensitivity.

What to do

For Anthropic: stay under the documented hard limits with margin — treat 20 strict tools as 15 in practice. If you hit "Schema is too complex for compilation", split tools by extracting optional params into a separate tool call. Benchmark schema compilation with cache-busting (unique schema variant per run, or wait >24 hours) to measure real first-call cost before production deployment.

For OpenAI: use strict:true Structured Outputs, not JSON mode. Before migrating existing pipelines, test each schema against the Structured Outputs endpoint explicitly — JSON mode silently absorbed failures that strict:true will surface as errors. Do not combine high temperature with logitBias when using structured output.

For Gemini: align prompt property order to responseSchema property order; on Gemini 2.0+, set propertyOrdering explicitly. Mark fields nullable: true for any field the model may lack context to fill. If using additionalProperties, use response_json_schema instead of response_schema to bypass SDK validation — issue #1815 was closed as “not planned.”

For vLLM / open-source: use XGrammar (default in current vLLM). If falling back to Outlines for schema coverage reasons, test compilation at startup rather than at request time — identify blocking schemas before they reach production. Do not use Outlines for batch workloads without an explicit compilation queue.

Universal: test schemas at complexity boundaries — nested objects, large union counts, optional field density — under the same token budget constraints as production requests. Retry logic does not fix compilation failures or ordering sensitivity. Schema redesign is the only recovery path.

Evidence

Tool	Version	Method	What was observed
Anthropic Claude API	reviewed 2026-04-21	docs-reviewed	Hard limits enforced: 180s compilation timeout, max 20 strict tools, max 24 optional params, max 16 union types. Exceeding any returns `"Schema is too complex for compilation"` with no indication of which limit was hit.
OpenAI API	reviewed 2026-04-21	docs-reviewed	JSON mode enforces JSON syntax only — wrong types, missing keys, and invented keys pass silently. `strict:true` is the production path; OpenAI describes Structured Outputs as “the evolution of JSON mode.”
Gemini API / python-genai	python-genai (2025-11-onwards)	source-reviewed	Python SDK client rejects `additionalProperties` at the validation layer; Gemini API has accepted the field since November 2025. Issue #1815 closed as “not planned” — workaround: use `response_json_schema`.
Gemini API	Gemini 1.5 (2024)	docs-reviewed	Property ordering sensitivity documented in Google engineering blog: output does not conform to `responseSchema` when prompt property order diverges from schema property order.
Outlines (dottxt-ai)	January 2026 benchmark	independently-confirmed	42 compilation errors documented across tested grammar categories in arXiv 2501.10868. Synchronous FSM compilation confirmed in vLLM engineering blog.
vLLM	2026	independently-confirmed	vLLM engineering blog: FSM engine “occasionally crashes the engine” under complex grammars; XGrammar adopted as default backend, Outlines retained as fallback.

Confidence and gaps

Falsification criterion: This finding would be disproved by any provider publishing documented complexity thresholds and failure rates that match the behaviour described above, OR by demonstrating that the Anthropic compilation limits, Gemini ordering sensitivity, or Outlines FSM errors described here are version-specific and resolved in current releases.

Confidence: secondary-research — no claims were reproduced by execution in Theory Delta’s environment. Evidence sources are: Anthropic documentation (docs-reviewed), OpenAI documentation (docs-reviewed), Google engineering blog (docs-reviewed), a third-party GitHub issue (source-reviewed, now closed as “not planned”), a peer-reviewed benchmark paper (independently-confirmed), and a vLLM engineering blog post (independently-confirmed). The block carries staleness_risk: high — provider limits and defaults shift frequently. Anthropic, OpenAI, and Gemini have each changed structured generation behavior within the 6 months prior to publication.

Strongest case against: Several claims may already be obsolete. Anthropic, OpenAI, and Gemini release frequently and do not version their structured generation implementations in a way that makes “which version has this bug” easy to determine. The Gemini property ordering claim is sourced from a 2024 blog post against Gemini 1.5; current Gemini 2.x behavior may differ. The Outlines compilation error count is from a January 2026 benchmark snapshot.

Open questions: Do the Anthropic hard limits apply identically to all Claude model versions, or are they implementation-dependent? Does the Gemini property ordering sensitivity persist in Gemini 2.5+ or is it addressed by the explicit propertyOrdering requirement introduced in 2.0? Which grammar patterns still fall back to Outlines in XGrammar’s current coverage, and does the under-constrained output rate differ across schema types? Has any provider added failure rate telemetry to their structured output APIs?

Seen different? Contribute your evidence — share a repro or counter-example and we’ll review it against this finding. Reader evidence is what keeps these findings accurate.