Articles

    Generation & Checker Analysis: What 300 Exercises Revealed About LLM Quality Control

    The findings emerged from measuring an end-to-end pipeline of three consecutive generation batches totalling 300 exercises at B1 and B2 level. They include two filtering stages: 16% exercises discarded during generation (the LLM’s own self-assessment) and another 43% removed by the checker. The exercises that survive are good. The process to get there is expensive in content, not in cost.

    The generation pipeline in numbers

    Each batch targeted 100 exercises at either B1 or B2 level.

    BatchLevelGeneratedDropped (gen)FinalizedCost
    AB110016 (16%)84$2.73
    BB210016 (16%)84$1.23
    CB210017 (17%)83$3.94
    Total30049 (16%)251~$7.90

    The generation drop rate is consistent at 16–17% across all three batches. These are exercises that failed the LLM’s self-assessment during the generate–assess–regenerate loop described in a previous article. They were not saved to the database.

    The interesting finding is the consistency. Three batches, two levels, and the self-assessment gate removes the same proportion every time. That suggests the 16% floor reflects a real boundary in what the generation prompt can reliably produce rather than random variation.

    The checker finds what generation misses

    After generation, each batch goes through the checker — an independent LLM pass that reviews every exercise for grammar errors, semantic problems, and structural defects.

    The initial checker pass is where the real filtering happens.

    BatchExercises checkedFlaggedOKFlag rate
    A (B1)84364843%
    B (B2)84424250%
    C (B2)83434052%

    That gap between the two stages is important. The generation pipeline’s assess step approves 84 out of 100 exercises. Then the checker looks at those 84 and flags 36–43 of them. The generator and the checker are measuring different things.

    Multi-pass checking converges quickly

    Each batch was run through multiple checker passes. After the first pass removes the obvious errors, subsequent passes flag very few additional exercises.

    Pass numberTotal checkedTotal flaggedShare of all 127 flags
    1st (initial)25112195%
    2nd17221.6%
    3rd–8th27443.1%

    The first checker pass discovers 95% of all errors across all three batches. The remaining 5% trickle in over subsequent passes — and roughly half of those turned out to be false positives on manual re-analysis.

    That has a practical implication: running the checker more than twice yields almost nothing. One pass is sufficient for the bulk of error detection. A second pass is cheap insurance. Beyond that, the signal-to-noise ratio degrades.

    What the checker actually costs

    The checker is cheap. Measured across all 13 runs (613 exercise-checks total, using gpt-5.2 via the Batch API):

    MetricValue
    Total input tokens158,898
    Total output tokens17,449
    Average input per exercise259 tokens
    Average output per exercise28 tokens
    Cost per exercise-check~$0.0007
    B1 batch: all 3 checker runs (178 checks)$0.13
    Checking all 2,120 exercises in DB (one pass)~$1.50

    That last number matters. It means the entire exercise library can be audited for less than the cost of generating two exercises. Checker cost is 4.5% of total batch cost. Generation dominates.

    The end-to-end yield

    Combining generation and checking gives the full picture.

    StageCount% of generated
    Generated300100%
    Passed generation self-assessment25184%
    Active after checking12441%

    For every 100 exercises the LLM generates, 41 survive to reach a learner. The rest are filtered: 16 by the generation pipeline's own assessment, 43 by the independent checker.

    The cost per surviving exercise is roughly $0.06, based on the B1 batch where end-to-end costs are best tracked ($2.86 total for 46 active exercises).

    What is the checker actually finding?

    I categorized all 121 exercises flagged in the initial checker passes across the three batches.

    The doubled-word bug

    The most striking category is doubled words — patterns like kurzer kurzer Prüfung, alle alle Hemden, eleganten elegant Outfit. These appeared in roughly 30 of the 121 flagged exercises (25%).

    A related pattern is doubled punctuation: Frau Dr. Neumann,,, Coach,,, Chinas,,. Another 8 exercises had this defect.

    Together, doubled words and punctuation account for roughly 31% of all bulk flags. A learner would see these in the exercise text. They are low hanging fruit to fix.

    Systematic grammar blind spots

    Several grammar rule violations recur across batches, suggesting the generator LLM has consistent weak spots:

    • "denn" + word order (~8 exercises): The generator places the verb in subordinate-clause position after denn, which requires main-clause V2 order. This is a rule that intermediate learners are explicitly tested on.
    • Missing passive auxiliary (~5 exercises): Constructions like können sie schnell gelöst. — missing werden. The sentence is incomplete.
    • Konjunktiv I in indirect speech (~10 exercises): The generator uses Indikativ where strict grammar requires Konjunktiv I (erklärte, dass man … durfte instead of dürfe). This is borderline — Indikativ after dass is increasingly accepted in spoken German, but exam preparation material should teach the formal rule.

    Genuine grammar and formatting errors

    The remaining categories are a mix of real problems:

    Error typeCountNature
    Capitalization after punctuation~10Lowercase after period, uppercase mid-sentence
    Missing quotation marks~8Direct speech without opening
    Case/adjective agreement~10Wrong declension ending for the grammatical case
    Pronoun congruence~8Singular subject, plural pronoun reference
    Semantic contradiction~5Sentence logic contradicts itself
    Style/idiomaticity~15Mix of valid concerns and overcritical flags

    Most of these are genuine problems that would confuse a learner. The style category is the noisiest — some flags are valid (unidiomatic phrasing that sounds wrong to a native speaker), others are the checker being overly strict about register or word choice.

    The checker's false positive problem

    Once the obvious errors are removed, the checker’s reliability drops. I re-analyzed the exercises flagged in the smaller 2nd–8th passes (exercises that had already survived the bulk check):

    VerdictCount
    Clear false positive3
    Borderline (valid grammar, debatable usage)3
    Genuine error1

    Three of the seven were unambiguously wrong calls by the checker:

    1. "Ausgehzeiten" flagged as invalid word — it is a standard compound noun. The checker's reasoning was incoherent.
    2. "ein Kaffee" after a colon flagged as wrong case — nominative in elliptical enumeration after a colon is standard German.
    3. "Bis zum Ende des Semesters werden wir ... entwickelt haben" flagged as unidiomatic — this is a textbook Futur II construction, exactly the use case it exists for.

    The three borderline cases involved valid grammar with arguable semantic or stylistic issues — the kind of thing a strict editor might flag but a grammar checker should probably leave alone.

    That ~50% false positive rate on residual flags means the checker is removing some genuinely good exercises from the corpus. This is the cost of a conservative checking strategy: fewer bad exercises reach learners, but some good exercises are lost.

    What this means for the product

    The exercises that survive are good

    The 124 active exercises across 3 batches have passed generation self-assessment, at least one independent checker pass, and in many cases multiple re-checks. On manual re-analysis of a sample, the surviving exercises are well-formed, level-appropriate, and pedagogically useful.

    The quality gate works.

    The doubled-word bug

    This is the most actionable finding. Roughly 31% of all incorrect exercises can be fixed without loosing the generated exercises with a simple regex — \b(\w+) \1\b for repeated words, [,;]{2} for doubled punctuation.

    One checker pass is enough

    95% of all errors are caught in the first pass. The marginal value of additional passes is near zero, and the false positive rate rises. The operational recommendation is: run the checker once, flag everything it catches, and accept the ~2–3% residual error rate as the cost of avoiding false positive removals.

    The full corpus should be checked

    At ~$0.0007 per exercise, checking all 2,120 exercises in the database costs $1.50. That is less than the cost of generating three exercises. A single checker pass over the full library would catch the same categories of errors found in these three batches — especially the doubled-word artifacts, which likely exist in older exercises generated with the same pipeline.

    The uncomfortable economics

    The end-to-end yield — 41 usable exercises per 100 generated — is not a sign that the pipeline is broken. It is the actual cost of producing exercises that meet a non-trivial quality bar.

    The pipeline does three things well: it generates exercises cheaply (~$0.03 per generated exercise), it checks them cheaply (~$0.0007 per check), and it catches the vast majority of problems reliably.

    Where it struggles is producing exercises that pass on the first try. The generator and the checker disagree on nearly half the output. That disagreement is where the quality actually lives — it is the gap between “the LLM thinks this is correct” and “an independent review confirms it.”

    Considering that for an educational product a wrong exercise is worse than no exercise, the 59% rejection rate is the price of that constraint.