8 min read

What Breaks When You Ask an LLM for JSON

I tested structured output from 288 real model calls across every major provider, and what I found changed how I build things.


There's a moment in every LLM integration project where you write json.loads(response) for the first time and it works, and you think the whole thing is going to be straightforward — the model gives you JSON, you parse it, you use the data.

Then you go to production.

The First Failure Is Always a Surprise

The first time an LLM gave me broken JSON, I added a try/except, logged the raw output, and realized it had wrapped the JSON in markdown fences. Easy enough to fix — strip the fences and move on.

Then it was trailing commas, then Python True instead of true because the model had apparently decided it was writing Python but only for the boolean values, then a truncated object because the response hit the token limit mid-key, then unescaped double quotes inside a string value, which is the kind of thing that makes you question whether the model understands what JSON actually is.

By failure number ten, I had a growing pile of regex substitutions and string manipulations scattered across the codebase, none of them tested, all of them interfering with each other in ways I hadn't thought through. The fix for unescaped quotes was breaking the fix for markdown fences when both appeared in the same response, and the truncation recovery was occasionally "recovering" valid JSON into something different.

This is the part of working with LLMs that nobody wants to talk about at conferences — not the prompting, not the fine-tuning, not the RAG pipeline, but the unglamorous reality that these models produce text that almost conforms to a specification, and "almost" is the most dangerous word in parsing.

So I Built a Test Suite

Not for my application, but for the models themselves. I wanted to know, empirically, what kinds of broken output the major models actually produce and whether the failure modes are consistent across providers. I ran structured output prompts through everything I could get my hands on via OpenRouter — GPT-4o, Claude, Gemini, Llama 3, Mistral, Command R, DeepSeek, Qwen, and a handful of smaller models I'd never heard of — totaling 288 calls, all asking for structured JSON output with a defined schema.

The results were more interesting than I expected, not because any individual failure was surprising (I'd seen all of them by that point), but because the patterns were so consistent across models.

A Taxonomy of Broken Output

Here's what actually breaks, roughly by frequency.

Markdown fences. The single most common failure mode. Models wrap their JSON in ```json ... ``` blocks because they've been trained on conversations where JSON is presented in markdown, so the model isn't confused — it thinks it's showing you code when you asked it to produce data. Nearly every model does this some percentage of the time, even with explicit instructions not to, and while JSON mode helps where it's available, not every model supports it and not every provider exposes it even when the model does.

Trailing commas. Valid in JavaScript but not in JSON, and models that have ingested a lot of JS code will happily produce {"items": [1, 2, 3,]}. Easy to fix in isolation, but a naive regex that strips trailing commas can mangle strings that legitimately contain commas, which makes it trickier in combination with other failures.

Wrong booleans and nulls. Python True, False, and None where JSON wants true, false, and null, and also occasionally TRUE, FALSE, NULL, Yes, No, and in one memorable case, nil. The model is essentially code-switching between programming languages within a single JSON object.

Comments. JSON doesn't support comments, but models don't care — I've seen // single-line, /* */ block, and # comments, sometimes all in the same response. The model is annotating its own output, which would be charming if it didn't break everything downstream.

Unescaped quotes in strings. This is the subtle one: the model produces something like {"bio": "She said "hello" to everyone"}, and now your parser can't tell where the string ends. Fixing this programmatically is harder than it sounds because you need to distinguish between quotes that are part of the string and quotes that are structural delimiters, and the model has destroyed exactly the information you'd need to make that distinction.

Truncated objects. The model hits its output token limit and just stops, leaving you with something like {"users": [{"name": "Alice", "age": 30}, {"name": "Bo and nothing else. The object is structurally incomplete, and what makes this particularly frustrating is that the data you did get is probably correct — there's just not enough of it, and the partial last entry is going to blow up your parser.

Ellipsis placeholders. Instead of generating all the data, the model produces {"items": ["first", "second", "...", "last"]} or uses literal ... to indicate "and more like this," being lazy in the most human way possible.

Encoding issues. Broken Unicode escapes (\u00 instead of A), mojibake from double-encoding, and the occasional raw UTF-8 byte sequence in what should be an ASCII-safe JSON string. These are rarer but much harder to diagnose when they show up, because the error messages from JSON parsers are essentially useless for encoding problems.

Repair Order Matters More Than You'd Think

Once I had the taxonomy, I started building repair strategies, and this is where it got interesting — the order you apply fixes matters way more than I initially assumed.

Take a simple case: a response with markdown fences and trailing commas. If you fix the commas first, your regex is operating on text that includes the fence markers, which means it might match commas in the fence delimiters or get confused by the code block boundaries. If you strip the fences first, you're operating on clean JSON that happens to have trailing commas, and the comma fix becomes straightforward.

Now scale that to fifteen strategies and outputs that have three or four problems at once, and the interaction effects get real. Fixing encoding before structure matters because there's no point trying to fix commas in text that has broken byte sequences. Fixing quotes before keys matters because key repair assumes it can identify string boundaries, which requires quotes to already be valid.

I ended up with a two-pass system where the first pass applies all strategies in sequence — encoding first, then extraction, then structural fixes — and tries to parse the result. If that works, we're done. If not, the second pass applies strategies one at a time, re-parsing after each, so a later strategy can't silently undo an earlier fix. The second pass is slower but catches cases where the combined application of multiple strategies introduces new problems.

This sounds overengineered until you've debugged a case where fixing the commas produced valid JSON, but fixing the commas and then fixing the quotes turned it back into invalid JSON because the quote fixer misidentified a comma-fix artifact as an unescaped quote. I'm speaking from experience.

JSON Mode Doesn't Save You

I should address the obvious objection, which is to just use JSON mode.

JSON mode, where available, guarantees syntactically valid JSON, and that's useful enough that you should use it whenever you can. But it doesn't solve the underlying problem for a few reasons.

First, JSON mode guarantees valid syntax, not valid schema — the model can return perfectly parseable JSON that doesn't match the structure you asked for. Missing required fields, wrong types, extra properties, nested objects where you expected arrays — all of these are valid JSON that will still ruin your day.

Second, not every model supports it, so if you're running open models locally or using providers that don't expose the feature, you're on your own.

And third — the one people don't think about — even with JSON mode enabled, truncation still happens. The model can still hit its token limit and produce a partial response, because JSON mode ensures the output it produces is valid, not that it finishes producing it.

So in practice, JSON mode moves you from "output is sometimes syntactically broken" to "output is always syntactically valid but sometimes structurally wrong or incomplete," which is real progress but not a complete solution.

Beyond JSON

The other thing I didn't anticipate when I started is that not everything wants to be JSON in the first place.

I've been working on projects where the model output is YAML (because the downstream consumer is a configuration system), TOML (because the output feeds into a Python config), or Python literals (because eval() is right there and sometimes the model just produces Python dicts). Each of these formats has its own set of failure modes, some overlapping with JSON and some unique.

YAML has the infamous Norway problem where the string NO gets interpreted as a boolean. TOML has strict datetime requirements that models routinely violate. Python literals mostly work but occasionally produce syntax that ast.literal_eval() chokes on.

The repair strategies I'd built for JSON turned out to be partially applicable — markdown fence stripping works regardless of what's inside the fences, and encoding fixes are format-agnostic — but the structural fixes needed format-aware logic because you can't apply JSON comma rules to YAML.

What I Built

All of this became outputguard, which I eventually packaged properly because I was importing the same unversioned file into too many projects and it was getting ridiculous.

It does three things:

Validates structured output against JSON Schema, with error messages that tell you what's wrong using JSON path notation ($.items[0].name is required) instead of the parser's raw exception.

Repairs broken output using 15 strategies applied in a deliberately ordered pipeline — the two-pass system I described above, where each strategy is small and testable on its own.

Generates retry prompts, which are human-readable correction messages you can feed back to the model for another attempt, like "Your output was missing the required field 'email' at $.users[0]. Here's the schema. Try again."

There's also guarded_generate(), which wraps your model call (any provider — you pass a callable that returns a string) and runs the whole validate→repair→retry loop for you, though the pieces work independently if you don't want the orchestration.

The test suite has 2,001 tests, including the 288 real model outputs, adversarial inputs, truncation recovery, multi-strategy interactions, and a format matrix across JSON/YAML/TOML/Python. The test suite is honestly the part I'm most proud of, because repair code without good tests is just a different flavor of broken.

Why This Matters

I spend most of my time on security programs, compliance frameworks, and the organizational problems that come with running a one-person security and engineering function. Outputguard lives in a completely different part of my work, but it comes from the same instinct — the boring infrastructure problems are the ones that actually determine whether your systems work reliably.

Nobody wants to talk about JSON repair at a conference, and nobody's writing thought pieces about trailing commas, but if you're building production systems on LLM output, the gap between "the model returned something" and "the model returned something I can actually use" is where your reliability lives. The fancy parts of the stack get the attention, but the parsing layer is where things actually break.


outputguard is MIT-licensed, Python 3.10+, and has no LLM provider dependencies.

pip install outputguard
GitHub - ndcorder/outputguard: Validate, repair, and retry LLM structured outputs. 13 repair strategies for common JSON malformations, JSON Schema validation, and retry-with-feedback prompts.
Validate, repair, and retry LLM structured outputs. 13 repair strategies for common JSON malformations, JSON Schema validation, and retry-with-feedback prompts. - ndcorder/outputguard