Methodology

Manufacturing Synthetic Evaluation Data

A way to generate evaluation data that holds up: one shared pipeline, a recipe per data type, difficulty measured by pass@k and a solver panel, and a human verdict on every point. The payoff is owning the eval distribution you use to calibrate and align your own agents.

Evals · Synthetic Data · Post-training · Methodology

A language model can write a thousand evaluation tasks in an afternoon. Most of them share the same phrasing, cluster at the same difficulty, and carry reference answers that are quietly wrong. A benchmark assembled from that pile measures almost nothing, and a reward signal built on it teaches the wrong thing.

This is a methodology for manufacturing synthetic evaluation data that holds up. It treats generation as a factory with three commitments: one shared pipeline shape, a separate recipe for each data type, and a receipt on every point that records how it was made and why it survived. The approach covers ground-truth final answers, rubric-scored open responses, and most evaluation formats you would want to produce at volume. The payoff for a team building AI products is direct: a private, difficulty-calibrated set drawn from your own distribution, suitable for measuring agents, calibrating judges, and generating verifiable rewards for post-training.

01One spine, many recipes

Every data type moves through the same five stages. An input specification governs a single unit of work. A gated loop turns that specification into one candidate point. The candidate carries its receipts. A review queue holds it for human judgment. Approved points join the dataset.

The shape is fixed. Only the gated loop changes between data types.

Only the middle stage changes between data types. The framework around it stays fixed: the input specifications, the candidate envelope, the receipts, the queue and its states, the export. A data type is added by writing a recipe, which supplies a payload schema, a generation step, an ordered list of gates, a difficulty probe, a final quality check, and a way to render the point for a reviewer. The harness runs any recipe through the same machinery.

A useful discipline applies here: earn the abstraction. Build two concrete recipes first, get them working, and extract the shared spine afterward. Generality that is assumed up front tends to fit nothing well.

02The lever and the orchestrator

One pull of the lever produces one methodically designed candidate from one input specification and one sampled coordinate. An orchestrator pulls the lever many times across many inputs. The two roles stay separate, which keeps the atomic unit small and auditable.

Gates can reject a candidate, so the count of attempts and the count of accepted points differ. A run picks one of two policies: a fixed number of attempts, or a loop that continues until a target number of points are accepted. Either way, every reject is logged with its reason.

03Variety by construction

Repetition is a structural failure of language models. Asked the same way, they answer the same way, and instruction-tuned generators collapse toward a narrow mode.[1] The methodology addresses this at three points, so the spread becomes a property of the design.

The input specification enumerates variety axes: domain, subdomain, task subtype, difficulty, constraints, persona, and format.
Each pull samples a fresh coordinate across those axes along with a decorrelation seed.
A novelty gate compares each candidate against the existing corpus and rejects near-duplicates.

Input specifications are authored by a human with model assistance. Getting the right spread across axes is a judgment call, and the specification is where that judgment lives.

04The gated loop

The middle stage is a sequence of gates. Each gate inspects the candidate, returns a pass, a fail, or a repair instruction, and records its result. Order matters. Cheap structural checks run first, and the expensive model-graded checks run once a candidate has proven worth the spend.

Cheap checks first, model-graded checks once a candidate is worth the spend. Repair is bounded on purpose.

A representative ordering runs like this.

Construct check. The point is solvable from its own contents. Nothing leaks the answer, and the framing is self-contained.
Novelty gate. The point is far enough from everything already in the corpus.
Independent oracle. A separate solver, blind to the authored answer, derives its own. The two must agree. A disagreement means the reference is suspect, and a wrong reference answer is the single most common way a synthetic benchmark fails silently.
Counter-argument. A skeptical reviewer argues the strongest opposite case the facts allow and rates its own strength. A genuinely contestable point gets set aside as noisy.
Realism check. A judge rates whether the scenario reads as something that happens in the wild. Staged setups, telegraphed answers, and cartoonish numbers get caught here.
Difficulty probe. Covered in the next section, since measuring difficulty earns its own treatment.
Gameability check. The grader accepts the intended answer and turns away a plausible wrong one. For richer formats, a degenerate response that scores well while being bad gets hunted down and closed off.

When a gate fails, the loop either repairs the candidate and re-runs the affected gates,[2] or it gives up after a small number of repairs and logs an honest reject. Repair stays bounded on purpose. A point that needs five rewrites to survive is usually telling you something.

05Difficulty is a measurement

Difficulty is something the pipeline observes. A label written by the author is a guess. The real measure comes from running the candidate past a panel of solvers and watching what happens.

The probe estimates an item's difficulty empirically, and two estimators do the work. Sampling k completions from a fixed policy gives pass@k,[3] the probability that at least one of k samples is graded correct, which characterizes how solvable the item is for that model at a given sampling budget. Running the item across a panel of models of varying capability gives a solve rate that characterizes how well the item separates strong policies from weak ones. The factory records both numbers and which models passed, so the difficulty label rests on observed solve behavior.

A point earns its place when stronger solvers pass and weaker ones do not.

A point earns its place when it discriminates. An item solved at pass@1 by every policy carries no gradient, since it separates nothing and rewards nothing.[4] An item that no policy solves at any k is either beyond the current frontier or quietly broken. The useful band sits between the two, where stronger policies pass and weaker ones do not. The most citable signal of all is an item where even the strongest available model fails at high k, which is the clearest evidence that a benchmark has teeth and the most valuable target for the next round of training.[5]

This view has a name in the literature. Separability with confidence[6] asks what fraction of model pairs a benchmark can tell apart with non-overlapping confidence intervals. Adopting that vocabulary turns difficulty into a metric you can report and defend.

Pass@k volatility across the agent flow

A single end-to-end pass rate hides where an agent actually fails. An agentic task runs as a sequence of steps, and the probability of finishing is the product of the per-step success probabilities. One brittle step drags the whole trajectory down, and the end-to-end number gives no hint which step it was.

Measuring pass@k at each step recovers that information. Run k rollouts, grade the action at every step against its known-correct action, and read a pass@k for each step in the workflow. The profile across steps is volatile. Most steps sit near the ceiling, and a few collapse. The steps that collapse are where the failure mass concentrates, and they are the steps worth generating more data for, hardening with verifiable rewards, and tracking as a regression signal release over release.

The same task, broken out by step. The end-to-end rate is the product of these; two steps own most of the loss.

This is why the factory fixes a single decision per point. A library of step-level decisions, each with a known-correct action and a measured pass@k, lets you build the per-step failure profile for any agent you run through it, locate the brittle steps, and aim the next round of data and training squarely at them.

06Two recipes

Two recipes show how the middle stage changes while the spine holds.

The verification machinery re-points onto whatever carries correctness for the type.

The first recipe is a ground-truth final answer. The point is a task prompt and a single verifiable answer, graded by program through exact match, numeric tolerance, or schema. The hard question for this type is whether the reference answer is correct, which is why the independent oracle does the heavy lifting and the final check confirms the grader accepts the right answer and turns away a wrong one.

The second recipe is a prompt with a captured output and a rubric of atomic criteria. Here the hard question moves onto the rubric itself: whether it is atomic, mutually exclusive and collectively exhaustive, discriminating, and free of contrivance. The verification machinery re-points from the answer onto the rubric.

The criterion model carries most of the weight in the rubric recipe. Each criterion is one row tagged on three axes.

Axis	Values	What it controls
type	instruction-following · outcome · process · grounding and safety	the backbone, chosen so the four cover distinct behaviors with minimal overlap
polarity	must · must-not	whether the behavior is required or forbidden
criticality	gate · additive	a gate failure caps the whole score; additive criteria sum

Scoring stays binary per criterion, equal weight, with gate criteria able to veto the total. Reporting per-type subscores alongside the total gives the diagnostic breakdown that makes a benchmark useful to read. Two gates are specific to rubrics. An atomicity gate splits any compound or vague criterion until each is a single yes or no a judge can answer from the output alone. An inter-judge agreement gate has several judges grade sample outputs per criterion, and any criterion they cannot agree on gets cut or reworded, since a criterion judges disagree on is noise.[7]

07How these sets get used

The output of this factory is a labeled, difficulty-graded, deduplicated set with provenance on every point. That shape drops directly into the post-training and evaluation stack.

Held-out evaluation. A discriminating set becomes a private eval harness and a regression suite. Because the points are freshly generated and never published, they sidestep the contamination that quietly inflates scores on public benchmarks.
Verifiable rewards. The ground-truth recipe yields items with programmatic graders, which are the verifiable reward signals that reinforcement learning with verifiable rewards consumes.[4] The too-easy filter keeps that signal informative by dropping items the current policy already solves at pass@1.
Reward models and judges. The rubric recipe yields per-criterion labels with inter-judge agreement statistics, which train and calibrate LLM judges and reward models. The per-type subscores localize where a model is weak across instruction-following, outcome, process, and grounding.
Agent alignment and calibration. For agents that call tools and take actions, each point fixes a single decision with a known correct action. Scoring an agent across a difficulty-graded set of such decisions measures calibration directly: where it acts correctly, where it over-acts, and where it holds back when holding back is right.

08Receipts and the trace

A data point is worth the payload plus its receipts plus its verdict. The store keeps the full trace of how each point was made: every gate result, the difficulty probe with its per-model outcomes, and the repair history. That trace is what makes this a factory with an audit trail. Anyone can open a point and see exactly why it survived. A bare list of prompts and answers offers no such account.

09Human review and reflection

Gates filter, and humans judge. Every surviving candidate lands in a review queue where a person approves, edits, or rejects it. Those verdicts do more than clear the queue. They become signal that improves later generation.

This matters because of criteria drift.[7] You cannot fully author a rubric before you have seen real outputs. Grading reveals criteria you did not know you needed. A review queue feeding a reflection loop is the practical answer. As reviewers reject points and annotate why, a checklist of commonly missed criteria accumulates inside the recipe, and later points start more complete. The factory gets better as it runs.

10Build your own

For a startup shipping an AI product, this is the point. Public benchmarks measure general capability on a generic distribution, and they saturate and leak over time. Your product lives on a specific distribution: your domain, your tools, your workflows, your failure modes. A small, difficulty-gated, receipt-bearing set drawn from that distribution is what lets you calibrate and align your own agents, catch regressions before users do, and mint verifiable rewards for your own post-training. Owning your evaluation distribution is a durable advantage that compounds as your product moves.

To support a new evaluation format, implement the recipe interface: a payload schema, a generation step, an ordered set of gates, a difficulty probe, a final quality check, and a render for review. Keep the spine. Swap the gates that are specific to your format.

A few principles travel with the method across every data type:

Measure difficulty with a solver panel and keep what discriminates.
Verify the thing that carries correctness, whether that is a reference answer or a rubric, with an independent check.
Hunt for the degenerate response that games the grader, and close the hole.
Store receipts so every point can be audited.
Route everything through human review, and feed the verdicts back.
Earn the abstraction by building concrete recipes first.

The contribution here is the assembly: difficulty-gated, receipt-bearing, human-reviewed evaluation data, with a shape that holds across data types and a clear seam for adapting it to a new one.

Sources

Survey on quality, diversity, and complexity in synthetic data. arXiv:2412.02980.
Bai et al. Constitutional AI: Harmlessness from AI Feedback. Anthropic, 2022. arXiv:2212.08073.
Chen et al. Evaluating Large Language Models Trained on Code. 2021 (origin of pass@k). arXiv:2107.03374.
Reinforcement learning with verifiable rewards; NVIDIA Nemotron-4 340B Reward. huggingface.co/nvidia/Nemotron-4-340B-Reward.
Wang et al. Self-Instruct, 2022; Evol-Instruct / WizardLM; CoT-Self-Instruct, 2025. arXiv:2507.23751; survey.
Li et al. Arena-Hard-Auto / BenchBuilder: separability with confidence. 2024. arXiv:2406.11939.
Shankar et al. Who Validates the Validators? (EvalGen): criteria drift and mixed-initiative rubric design. 2024. arXiv:2404.12272.

Further reading: Lambert, synthetic data and RLAIF overview; LLM synthetic-data reading list.