Generation & Checker Analysis: What 300 Exercises Revealed About LLM Quality Control

The findings emerged from measuring an end-to-end pipeline of three consecutive generation batches totalling 300 exercises at B1 and B2 level. They include two filtering stages: 16% exercises discarded during generation (the LLM’s own self-assessment) and another 43% removed by the checker. The exercises that survive are good. The process to get there is expensive in content, not in cost.

The generation pipeline in numbers

Each batch targeted 100 exercises at either B1 or B2 level.

Batch	Level	Generated	Dropped (gen)	Finalized	Cost
A	B1	100	16 (16%)	84	$2.73
B	B2	100	16 (16%)	84	$1.23
C	B2	100	17 (17%)	83	$3.94
Total		300	49 (16%)	251	~$7.90

The generation drop rate is consistent at 16–17% across all three batches. These are exercises that failed the LLM’s self-assessment during the generate–assess–regenerate loop described in a previous article. They were not saved to the database.

The interesting finding is the consistency. Three batches, two levels, and the self-assessment gate removes the same proportion every time. That suggests the 16% floor reflects a real boundary in what the generation prompt can reliably produce rather than random variation.

The checker finds what generation misses

After generation, each batch goes through the checker — an independent LLM pass that reviews every exercise for grammar errors, semantic problems, and structural defects.

The initial checker pass is where the real filtering happens.

Batch	Exercises checked	Flagged	OK	Flag rate
A (B1)	84	36	48	43%
B (B2)	84	42	42	50%
C (B2)	83	43	40	52%

That gap between the two stages is important. The generation pipeline’s assess step approves 84 out of 100 exercises. Then the checker looks at those 84 and flags 36–43 of them. The generator and the checker are measuring different things.

Multi-pass checking converges quickly

Each batch was run through multiple checker passes. After the first pass removes the obvious errors, subsequent passes flag very few additional exercises.

Pass number	Total checked	Total flagged	Share of all 127 flags
1st (initial)	251	121	95%
2nd	172	2	1.6%
3rd–8th	274	4	3.1%

The first checker pass discovers 95% of all errors across all three batches. The remaining 5% trickle in over subsequent passes — and roughly half of those turned out to be false positives on manual re-analysis.

That has a practical implication: running the checker more than twice yields almost nothing. One pass is sufficient for the bulk of error detection. A second pass is cheap insurance. Beyond that, the signal-to-noise ratio degrades.

What the checker actually costs

The checker is cheap. Measured across all 13 runs (613 exercise-checks total, using gpt-5.2 via the Batch API):

Metric	Value
Total input tokens	158,898
Total output tokens	17,449
Average input per exercise	259 tokens
Average output per exercise	28 tokens
Cost per exercise-check	~$0.0007
B1 batch: all 3 checker runs (178 checks)	$0.13
Checking all 2,120 exercises in DB (one pass)	~$1.50

That last number matters. It means the entire exercise library can be audited for less than the cost of generating two exercises. Checker cost is 4.5% of total batch cost. Generation dominates.

The end-to-end yield

Combining generation and checking gives the full picture.

Stage	Count	% of generated
Generated	300	100%
Passed generation self-assessment	251	84%
Active after checking	124	41%

For every 100 exercises the LLM generates, 41 survive to reach a learner. The rest are filtered: 16 by the generation pipeline's own assessment, 43 by the independent checker.

The cost per surviving exercise is roughly $0.06, based on the B1 batch where end-to-end costs are best tracked ($2.86 total for 46 active exercises).

What is the checker actually finding?

I categorized all 121 exercises flagged in the initial checker passes across the three batches.

The doubled-word bug

The most striking category is doubled words — patterns like kurzer kurzer Prüfung, alle alle Hemden, eleganten elegant Outfit. These appeared in roughly 30 of the 121 flagged exercises (25%).

A related pattern is doubled punctuation: Frau Dr. Neumann,,, Coach,,, Chinas,,. Another 8 exercises had this defect.

Together, doubled words and punctuation account for roughly 31% of all bulk flags. A learner would see these in the exercise text. They are low hanging fruit to fix.

Systematic grammar blind spots

Several grammar rule violations recur across batches, suggesting the generator LLM has consistent weak spots:

"denn" + word order (~8 exercises): The generator places the verb in subordinate-clause position after denn, which requires main-clause V2 order. This is a rule that intermediate learners are explicitly tested on.
Missing passive auxiliary (~5 exercises): Constructions like können sie schnell gelöst. — missing werden. The sentence is incomplete.
Konjunktiv I in indirect speech (~10 exercises): The generator uses Indikativ where strict grammar requires Konjunktiv I (erklärte, dass man … durfte instead of dürfe). This is borderline — Indikativ after dass is increasingly accepted in spoken German, but exam preparation material should teach the formal rule.

Genuine grammar and formatting errors

The remaining categories are a mix of real problems:

Error type	Count	Nature
Capitalization after punctuation	~10	Lowercase after period, uppercase mid-sentence
Missing quotation marks	~8	Direct speech without opening „
Case/adjective agreement	~10	Wrong declension ending for the grammatical case
Pronoun congruence	~8	Singular subject, plural pronoun reference
Semantic contradiction	~5	Sentence logic contradicts itself
Style/idiomaticity	~15	Mix of valid concerns and overcritical flags

Most of these are genuine problems that would confuse a learner. The style category is the noisiest — some flags are valid (unidiomatic phrasing that sounds wrong to a native speaker), others are the checker being overly strict about register or word choice.

The checker's false positive problem

Once the obvious errors are removed, the checker’s reliability drops. I re-analyzed the exercises flagged in the smaller 2nd–8th passes (exercises that had already survived the bulk check):

Verdict	Count
Clear false positive	3
Borderline (valid grammar, debatable usage)	3
Genuine error	1

Three of the seven were unambiguously wrong calls by the checker:

"Ausgehzeiten" flagged as invalid word — it is a standard compound noun. The checker's reasoning was incoherent.
"ein Kaffee" after a colon flagged as wrong case — nominative in elliptical enumeration after a colon is standard German.
"Bis zum Ende des Semesters werden wir ... entwickelt haben" flagged as unidiomatic — this is a textbook Futur II construction, exactly the use case it exists for.

The three borderline cases involved valid grammar with arguable semantic or stylistic issues — the kind of thing a strict editor might flag but a grammar checker should probably leave alone.

That ~50% false positive rate on residual flags means the checker is removing some genuinely good exercises from the corpus. This is the cost of a conservative checking strategy: fewer bad exercises reach learners, but some good exercises are lost.

What this means for the product

The exercises that survive are good

The 124 active exercises across 3 batches have passed generation self-assessment, at least one independent checker pass, and in many cases multiple re-checks. On manual re-analysis of a sample, the surviving exercises are well-formed, level-appropriate, and pedagogically useful.

The quality gate works.

The doubled-word bug

This is the most actionable finding. Roughly 31% of all incorrect exercises can be fixed without loosing the generated exercises with a simple regex — \b(\w+) \1\b for repeated words, [,;]{2} for doubled punctuation.

One checker pass is enough

95% of all errors are caught in the first pass. The marginal value of additional passes is near zero, and the false positive rate rises. The operational recommendation is: run the checker once, flag everything it catches, and accept the ~2–3% residual error rate as the cost of avoiding false positive removals.

The full corpus should be checked

At ~$0.0007 per exercise, checking all 2,120 exercises in the database costs $1.50. That is less than the cost of generating three exercises. A single checker pass over the full library would catch the same categories of errors found in these three batches — especially the doubled-word artifacts, which likely exist in older exercises generated with the same pipeline.

The uncomfortable economics

The end-to-end yield — 41 usable exercises per 100 generated — is not a sign that the pipeline is broken. It is the actual cost of producing exercises that meet a non-trivial quality bar.

The pipeline does three things well: it generates exercises cheaply (~$0.03 per generated exercise), it checks them cheaply (~$0.0007 per check), and it catches the vast majority of problems reliably.

Where it struggles is producing exercises that pass on the first try. The generator and the checker disagree on nearly half the output. That disagreement is where the quality actually lives — it is the gap between “the LLM thinks this is correct” and “an independent review confirms it.”

Considering that for an educational product a wrong exercise is worse than no exercise, the 59% rejection rate is the price of that constraint.