Generating Exercises in Batches

The interesting part of the generation pipeline is not only the prompt.

It is the operating model around the prompt.

Once the product requirement becomes "generate and review hundreds of narrowly targeted grammar exercises, keep the output structured, retry weak drafts, and do it cheaply enough that the library can still expand," the main constraint stops being raw model capability.

It becomes throughput, resumability, and cost discipline.

That is why the generation system ended up as a batch-processing workflow.

The synchronous loop was fine for testing and bad for production

The first version was synchronous:

generate an exercise,
wait,
assess it,
if it fails, regenerate,
repeat until pass or retry limit.

This is fine for prompt testing.

It is not especially good for building a content library.

One exercise is not one LLM call. It is potentially several: generation, assessment, and multiple regeneration-assessment cycles. Once this logic is multiplied across many exercises, the pipeline becomes expensive, slow to operate, and fragile to interruptions.

For a product built around depth section by section, that matters.

The question is not whether one exercise can be produced. The question is whether the content operation can run repeatedly and cheaply enough that the library keeps growing where coverage is thin.

Batch processing fits the shape of the problem

At any given moment, the system contains sets of exercises waiting for the next action:

exercises waiting to be generated,
generated exercises waiting to be assessed,
failed exercises waiting to be regenerated.

That makes the workflow look much more like a state machine than like a live conversational app.

A batch run therefore looks like this:

initialize a run with grammar sections and content topics,
submit generation jobs,
collect outputs and mark them pending_assessment,
submit assessment jobs,
move passes to completed, failures to pending_regeneration,
repeat until everything is either completed or dropped.

The important point is that the logic is run-level, not request-level.

That is where batch processing becomes useful. It turns the pipeline from a sequence of blocking calls into a controlled asynchronous workflow.

The state file became the actual control surface

Once the workflow moved into batch mode, the most important artifact was no longer the prompt alone.

It was the state file.

Each run is stored as a structured JSON document that records configuration, exercise state, intermediate outputs, and outcomes.

A simplified entry looks like this:

{
  "id": "ex-0001",
  "iteration": 1,
  "status": "completed",
  "grammar_section": { "id": "adjektivdeklination_einstieg", "level": "A2" },
  "content_topic": { "id": "freizeit_10_wochenende" },
  "current_exercise": "{...}",
  "assessment": "{\"status\": \"pass\", \"explanation\": \"\"}",
  "exercise_id": "5f866f44-faea-4f04-a319-9fe9d1f6a960"
}

At run level, the file also stores aggregate counts:

{
  "stats": {
    "total": 100,
    "pending_generation": 0,
    "pending_assessment": 80,
    "pending_regeneration": 0,
    "dropped": 0,
    "completed": 20
  }
}

This is not just an implementation detail.

The state file is what makes the whole process resumable and inspectable. If a batch finishes while the main process is offline, or a run is interrupted halfway through, the pipeline can resume from state rather than reconstructing its position from partial outputs.

That matters because content generation at this scale is not one transaction. It is a multi-step operation that can span hours or days.

The same model still runs the loop

One point is worth stating clearly.

The system does not depend on one model for generation and another for assessment. In practice, the same model is used for generation, assessment, and regeneration.

What changes between the steps is not the model family. It is the role, temperature, and message history.

generation runs with more variation,
assessment runs with more consistency,
regeneration gets the critique and another chance to repair the draft.

Conceptually, the logic is still simple:

messages = [system_prompt, generation_request]
exercise = call_llm(model, messages, temperature=1.0)

for iteration in range(max_iterations):
    assessment = call_llm(model, messages + [assessment_prompt], temperature=0.3)
    if assessment_passes(assessment):
        break
    exercise = call_llm(model, messages + [regeneration_prompt], temperature=1.0)

The important thing is that in production this logic is staged into batches rather than executed inline.

So the per-exercise logic stays conversational, while the run-level system stays asynchronous.

The economics are what made this worthwhile

This was the practical reason for building the batch workflow.

The cost is low enough to make library expansion realistic: roughly €0.03 per exercise using OpenAI Batch API.

That number matters because the product strategy depends on depth. If each grammar section needs many exercises rather than a few, generation cost is part of the product model, not an accounting footnote.

The other useful number is completion rate.

A good working expectation is that roughly 85 exercises out of 100 pass the loop within 5 retries. The rest are dropped.

That tells two stories at once.

First, the system is productive enough to grow the library. Second, the loop is not magical. A visible share of drafts still fail the quality gate even after several retries.

That drop rate is not just waste. It is a signal about how hard the specification really is.

What batch processing solves and what it does not

Batch processing solves a practical problem.

It makes large-scale content generation cheap enough, resumable enough, and structured enough to be useful as an ongoing operation.

What it does not solve is correctness.

A completed exercise is an exercise that passed the pipeline. It is not a proof that the exercise is perfect.

That distinction matters.

The state machine makes production practical. It does not remove the deeper quality problem. The pipeline can still approve weak or subtly wrong exercises. It can still drop drafts that are perhaps salvageable. It can still route borderline content through several retries before it passes.

So the real contribution of the batch system is not "better exercises."

It is this:

It makes the economics and the workflow of exercise production realistic enough that the product can keep expanding while still enforcing a non-trivial review loop.