Structured generation guarantees break silently above undocumented complexity thresholds — no provider publishes where the line is
Structured generation guarantees break silently above undocumented complexity thresholds — no provider publishes where the line is
Quick reference
| Provider | Failure mode | Observable signal | Recovery |
|---|---|---|---|
| Anthropic | Schema exceeds hard limit | "Schema is too complex for compilation" — no diagnostic on which limit | Split tools; reduce union types; flatten optional params |
| Anthropic | Context limit mid-generation | Truncated output that violates schema — no API error | Check stop_reason; reduce payload |
| OpenAI | JSON mode — wrong types / missing keys | Silent: valid JSON, schema not enforced | Migrate to strict:true Structured Outputs |
| OpenAI | temperature + logitBias in same request | Silent JSON truncation — shortened output, no error | Remove logitBias when using structured output |
| Gemini | Prompt property order ≠ responseSchema order | Valid JSON but wrong field values or ordering — deterministic, not random | Align property order in prompt to match responseSchema |
| Gemini | Python SDK rejects additionalProperties | SDK validation error at client layer | Call API directly until issue #1815 resolves |
| vLLM (Outlines) | Synchronous FSM compilation | Entire batch stalls under concurrent load | Use XGrammar (default in current vLLM); pre-compile schemas at startup |
What the docs say
Anthropic, OpenAI, and Gemini each advertise schema-adherent structured generation. Anthropic calls it “a mathematical guarantee.” OpenAI’s Structured Outputs reference states it “guarantees the model will always generate responses that adhere to your supplied JSON Schema.” Gemini’s controlled generation documentation describes responseSchema as enforcing response structure. For open-source inference, Outlines and XGrammar provide grammar-constrained decoding with similar guarantees at the token level.
What actually happens
Every provider implementation has an inflection point where schema complexity produces unpredictable behavior. Below the threshold, the guarantee holds. Above it, failures are either silent (wrong output, no error) or crash-level (cryptic 500 / compilation failure). The thresholds are not published in main API documentation. No provider publishes production failure rates.
Anthropic: hard limits with no useful diagnostic
Anthropic’s constrained decoding applies a mathematical guarantee — conditional on stop_reason being neither "refusal" nor "max_tokens". When a request hits the context limit mid-generation, the result is a truncated output that violates the schema, not a validly-conformed partial.
Four hard limits trigger "Schema is too complex for compilation" when exceeded (Anthropic Structured Outputs docs):
- Compilation timeout: 180 seconds
- Maximum strict tools: 20
- Maximum optional parameters per tool: 24
- Maximum union types in schema: 16
When any limit is exceeded, the request fails entirely — not gracefully with a degraded subset. The error message does not indicate which limit was hit or by how much. Recovery requires schema redesign: flatten nested structures, split required and optional fields into separate tools, reduce union breadth.
Additionally, these schema features are not supported. Each produces either silent wrong output or a compilation failure — no validation error in either case:
| Unsupported feature | Result |
|---|---|
| Recursive schemas | Incorrect output or compilation failure |
Numerical constraints (minimum, maximum) | Incorrect output or compilation failure |
String length constraints (minLength, maxLength) | Incorrect output or compilation failure |
| Complex regex patterns | Incorrect output or compilation failure |
additionalProperties | Incorrect output or compilation failure |
Grammar cache artifact
Anthropic grammar compilation caches for 24 hours. The first compilation of a complex schema may take seconds. Subsequent calls within the cache window return immediately. Benchmarks that evaluate the same schema on back-to-back calls will report fast results and miss the real first-call cost. Any benchmark that does not account for this cache window is measuring warm-cache latency, not compilation latency.
OpenAI: JSON mode never enforced schemas — use strict:true
Watch out: combining high
temperaturewithlogitBiasin the same request may trigger silent JSON truncation. The model stops generating before the schema is complete — no API error, just shortened output. RemovelogitBiaswhen using structured output. (This failure mode is community-observed; not currently documented in OpenAI’s official reference.)
OpenAI JSON mode enforces syntactically valid JSON. It does not enforce schema adherence. Wrong types, missing required keys, and invented keys all pass without error. This was not a temporary gap — JSON mode has always been a syntax-only guarantee. OpenAI describes Structured Outputs as “the evolution of JSON mode” and recommends it over JSON mode when possible; the production path is Structured Outputs with strict:true, which applies token-level masking against the schema (OpenAI Structured Outputs reference).
Builders migrating from JSON mode to strict:true may discover that schemas which were silently accepted by JSON mode now trigger parameter-conflict failures or hit complexity limits they were unaware of.
Gemini: property ordering sensitivity and SDK/API divergence
Gemini’s responseSchema compliance is model-capability-dependent — required fields with insufficient context produce hallucinated values rather than null or an error. The documented mitigation is marking fields nullable: true when the model may not have context to fill them (Google Developer Blog — Mastering Controlled Generation with Gemini 1.5).
The undocumented failure: if the order of properties in the prompt differs from the order of properties in responseSchema, Gemini produces output where properties appear in the wrong order, required values are missing or hallucinated, or field names don’t match the schema. The output is valid JSON — it parses without error — but does not conform to responseSchema. The failure is ordering-dependent and deterministic: same prompt structure, same violation, every run.
SDK/API divergence: the additionalProperties field has been accepted by the Gemini API since November 2025. The official Python SDK (python-genai) still rejects it at the client layer — googleapis/python-genai issue #1815 was closed as “not planned,” meaning Google closed it without fixing the SDK. The workaround is to use response_json_schema instead of response_schema, which bypasses SDK validation but loses Pydantic integration ergonomics.
For Gemini 2.0+, property ordering is no longer an implicit sensitivity but an explicit requirement: Gemini 2.0 requires a propertyOrdering list in the JSON input to define preferred structure. Omitting it produces unordered output, not a documented error.
Outlines / vLLM: FSM compilation is synchronous and error-prone
A January 2026 benchmark (arXiv 2501.10868) documented 42 compilation errors across tested grammar categories for Outlines. The vLLM engineering blog characterizes the FSM engine as one that “occasionally crashes the engine” under complex grammars; complex Pydantic models produce “poorly constructed regex” that is often unviable in practice (vLLM blog — Structured Decoding Introduction).
The specific production failure for batch workloads: Outlines’ FSM compilation is synchronous. A single complex schema blocks the entire batch. Under concurrent load, one slow schema stalls all parallel requests.
XGrammar, using pushdown automata, replaced Outlines as vLLM’s default structured decoding backend. Outlines is now the fallback. XGrammar enables batch compilation and supports recursive grammars. The fact that Outlines remains as a fallback confirms that XGrammar does not have complete coverage — some schemas still require the FSM path.
The trade-off is not zero-sum: the January 2026 benchmark shows XGrammar had only 3 compilation errors vs. Outlines’ 42, but XGrammar produced 38 under-constrained outputs — cases where the grammar was accepted but the output was not fully constrained by the schema. Outlines had only 8. More compilation success with XGrammar does not mean tighter schema enforcement across the board.
What to do instead
For Anthropic: stay under the documented hard limits with margin — treat 20 strict tools as 15 in practice. If you hit "Schema is too complex for compilation", split tools by extracting optional params into a separate tool call. Benchmark schema compilation with cache-busting (unique schema variant per run, or wait >24 hours) to measure real first-call cost before production deployment.
For OpenAI: use strict:true Structured Outputs, not JSON mode. Before migrating existing pipelines, test each schema against the Structured Outputs endpoint explicitly — JSON mode silently absorbed failures that strict:true will surface as errors. Do not combine high temperature with logitBias when using structured output.
For Gemini: align prompt property order to responseSchema property order; on Gemini 2.0+, set propertyOrdering explicitly. Mark fields nullable: true for any field the model may lack context to fill. If using additionalProperties, use response_json_schema instead of response_schema to bypass SDK validation — issue #1815 was closed as “not planned.”
For vLLM / open-source: use XGrammar (default in current vLLM). If falling back to Outlines for schema coverage reasons, test compilation at startup rather than at request time — identify blocking schemas before they reach production. Do not use Outlines for batch workloads without an explicit compilation queue.
Universal: test schemas at complexity boundaries — nested objects, large union counts, optional field density — under the same token budget constraints as production requests. Retry logic does not fix compilation failures or ordering sensitivity. Schema redesign is the only recovery path.
Evidence
| Tool | Version | Method | What was observed |
|---|---|---|---|
| Anthropic Claude API | reviewed 2026-04-21 | docs-reviewed | Hard limits enforced: 180s compilation timeout, max 20 strict tools, max 24 optional params, max 16 union types. Exceeding any returns "Schema is too complex for compilation" with no indication of which limit was hit. |
| OpenAI API | reviewed 2026-04-21 | docs-reviewed | JSON mode enforces JSON syntax only — wrong types, missing keys, and invented keys pass silently. strict:true is the production path; OpenAI describes Structured Outputs as “the evolution of JSON mode.” |
| Gemini API / python-genai | python-genai (2025-11-onwards) | source-reviewed | Python SDK client rejects additionalProperties at the validation layer; Gemini API has accepted the field since November 2025. Issue #1815 closed as “not planned” — workaround: use response_json_schema. |
| Gemini API | Gemini 1.5 (2024) | docs-reviewed | Property ordering sensitivity documented in Google engineering blog: output does not conform to responseSchema when prompt property order diverges from schema property order. |
| Outlines (dottxt-ai) | January 2026 benchmark | independently-confirmed | 42 compilation errors documented across tested grammar categories in arXiv 2501.10868. Synchronous FSM compilation confirmed in vLLM engineering blog. |
| vLLM | 2026 | independently-confirmed | vLLM engineering blog: FSM engine “occasionally crashes the engine” under complex grammars; XGrammar adopted as default backend, Outlines retained as fallback. |
Confidence and gaps
Falsification criterion: This finding would be disproved by any provider publishing documented complexity thresholds and failure rates that match the behaviour described above, OR by demonstrating that the Anthropic compilation limits, Gemini ordering sensitivity, or Outlines FSM errors described here are version-specific and resolved in current releases.
Confidence: secondary-research — no claims were reproduced by execution in Theory Delta’s environment. Evidence sources are: Anthropic documentation (docs-reviewed), OpenAI documentation (docs-reviewed), Google engineering blog (docs-reviewed), a third-party GitHub issue (source-reviewed, now closed as “not planned”), a peer-reviewed benchmark paper (independently-confirmed), and a vLLM engineering blog post (independently-confirmed). The block carries staleness_risk: high — provider limits and defaults shift frequently. Anthropic, OpenAI, and Gemini have each changed structured generation behavior within the 6 months prior to publication.
Strongest case against: Several claims may already be obsolete. Anthropic, OpenAI, and Gemini release frequently and do not version their structured generation implementations in a way that makes “which version has this bug” easy to determine. The Gemini property ordering claim is sourced from a 2024 blog post against Gemini 1.5; current Gemini 2.x behavior may differ. The Outlines compilation error count is from a January 2026 benchmark snapshot.
Open questions: Do the Anthropic hard limits apply identically to all Claude model versions, or are they implementation-dependent? Does the Gemini property ordering sensitivity persist in Gemini 2.5+ or is it addressed by the explicit propertyOrdering requirement introduced in 2.0? Which grammar patterns still fall back to Outlines in XGrammar’s current coverage, and does the under-constrained output rate differ across schema types? Has any provider added failure rate telemetry to their structured output APIs?
Seen different? Contribute your evidence — theory delta is what makes this knowledge base work.