Neal Desai Forward Deployed · GTM · Product · Engineering GH LI

Methodology

Manufacturing Synthetic Evaluation Data

A way to generate evaluation data that holds up: one shared pipeline, a recipe per data type, difficulty measured by pass@k and a solver panel, and a human verdict on every point. The payoff is owning the eval distribution you use to calibrate and align your own agents.

Evals · Synthetic Data · Post-training · Methodology

A language model can write a thousand evaluation tasks in an afternoon. Most of them share the same phrasing, cluster at the same difficulty, and carry reference answers that are quietly wrong. A benchmark assembled from that pile measures almost nothing, and a reward signal built on it teaches the wrong thing.

This is a methodology for manufacturing synthetic evaluation data that holds up. It treats generation as a factory with three commitments: one shared pipeline shape, a separate recipe for each data type, and a receipt on every point that records how it was made and why it survived. The approach covers ground-truth final answers, rubric-scored open responses, and most evaluation formats you would want to produce at volume. The payoff for a team building AI products is direct: a private, difficulty-calibrated set drawn from your own distribution, suitable for measuring agents, calibrating judges, and generating verifiable rewards for post-training.

01One spine, many recipes

Every data type moves through the same five stages. An input specification governs a single unit of work. A gated loop turns that specification into one candidate point. The candidate carries its receipts. A review queue holds it for human judgment. Approved points join the dataset.

Input Spec one unit of work Gated Loop the lever Candidate + receipts Review Queue human verdicts Approved dataset reflection: verdicts improve later generation
The shape is fixed. Only the gated loop changes between data types.

Only the middle stage changes between data types. The framework around it stays fixed: the input specifications, the candidate envelope, the receipts, the queue and its states, the export. A data type is added by writing a recipe, which supplies a payload schema, a generation step, an ordered list of gates, a difficulty probe, a final quality check, and a way to render the point for a reviewer. The harness runs any recipe through the same machinery.

A useful discipline applies here: earn the abstraction. Build two concrete recipes first, get them working, and extract the shared spine afterward. Generality that is assumed up front tends to fit nothing well.

02The lever and the orchestrator

One pull of the lever produces one methodically designed candidate from one input specification and one sampled coordinate. An orchestrator pulls the lever many times across many inputs. The two roles stay separate, which keeps the atomic unit small and auditable.

Gates can reject a candidate, so the count of attempts and the count of accepted points differ. A run picks one of two policies: a fixed number of attempts, or a loop that continues until a target number of points are accepted. Either way, every reject is logged with its reason.

03Variety by construction

Repetition is a structural failure of language models. Asked the same way, they answer the same way, and instruction-tuned generators collapse toward a narrow mode.[1] The methodology addresses this at three points, so the spread becomes a property of the design.

Input specifications are authored by a human with model assistance. Getting the right spread across axes is a judgment call, and the specification is where that judgment lives.

04The gated loop

The middle stage is a sequence of gates. Each gate inspects the candidate, returns a pass, a fail, or a repair instruction, and records its result. Order matters. Cheap structural checks run first, and the expensive model-graded checks run once a candidate has proven worth the spend.

Author Construct Novelty Oracle Counter Realism Difficulty Game-ability Emit → repair (bounded), then re-run honest reject, logged with reason
Cheap checks first, model-graded checks once a candidate is worth the spend. Repair is bounded on purpose.

A representative ordering runs like this.

When a gate fails, the loop either repairs the candidate and re-runs the affected gates,[2] or it gives up after a small number of repairs and logs an honest reject. Repair stays bounded on purpose. A point that needs five rewrites to survive is usually telling you something.

05Difficulty is a measurement

Difficulty is something the pipeline observes. A label written by the author is a guess. The real measure comes from running the candidate past a panel of solvers and watching what happens.

The probe estimates an item's difficulty empirically, and two estimators do the work. Sampling k completions from a fixed policy gives pass@k,[3] the probability that at least one of k samples is graded correct, which characterizes how solvable the item is for that model at a given sampling budget. Running the item across a panel of models of varying capability gives a solve rate that characterizes how well the item separates strong policies from weak ones. The factory records both numbers and which models passed, so the difficulty label rests on observed solve behavior.

discriminating band too hard / suspect too easy / no info 0% solve rate across the panel 100% even the strongest fails = sharpest signal
A point earns its place when stronger solvers pass and weaker ones do not.

A point earns its place when it discriminates. An item solved at pass@1 by every policy carries no gradient, since it separates nothing and rewards nothing.[4] An item that no policy solves at any k is either beyond the current frontier or quietly broken. The useful band sits between the two, where stronger policies pass and weaker ones do not. The most citable signal of all is an item where even the strongest available model fails at high k, which is the clearest evidence that a benchmark has teeth and the most valuable target for the next round of training.[5]

This view has a name in the literature. Separability with confidence[6] asks what fraction of model pairs a benchmark can tell apart with non-overlapping confidence intervals. Adopting that vocabulary turns difficulty into a metric you can report and defend.

Pass@k volatility across the agent flow

A single end-to-end pass rate hides where an agent actually fails. An agentic task runs as a sequence of steps, and the probability of finishing is the product of the per-step success probabilities. One brittle step drags the whole trajectory down, and the end-to-end number gives no hint which step it was.

Measuring pass@k at each step recovers that information. Run k rollouts, grade the action at every step against its known-correct action, and read a pass@k for each step in the workflow. The profile across steps is volatile. Most steps sit near the ceiling, and a few collapse. The steps that collapse are where the failure mass concentrates, and they are the steps worth generating more data for, hardening with verifiable rewards, and tracking as a regression signal release over release.

1.0 0.5 0 pass@k .96 Plan .88 Retrieve .45 Tool call .92 Validate .60 Write .38 Handoff .90 Report failure mass concentrates here
The same task, broken out by step. The end-to-end rate is the product of these; two steps own most of the loss.

This is why the factory fixes a single decision per point. A library of step-level decisions, each with a known-correct action and a measured pass@k, lets you build the per-step failure profile for any agent you run through it, locate the brittle steps, and aim the next round of data and training squarely at them.

06Two recipes

Two recipes show how the middle stage changes while the spine holds.

Ground-truth final answer Prompt Answer Program graderverify the answer Prompt, output, rubric Prompt Output Rubricverify the rubric atomic · MECE · discriminating inter-judge agreement holds
The verification machinery re-points onto whatever carries correctness for the type.

The first recipe is a ground-truth final answer. The point is a task prompt and a single verifiable answer, graded by program through exact match, numeric tolerance, or schema. The hard question for this type is whether the reference answer is correct, which is why the independent oracle does the heavy lifting and the final check confirms the grader accepts the right answer and turns away a wrong one.

The second recipe is a prompt with a captured output and a rubric of atomic criteria. Here the hard question moves onto the rubric itself: whether it is atomic, mutually exclusive and collectively exhaustive, discriminating, and free of contrivance. The verification machinery re-points from the answer onto the rubric.

The criterion model carries most of the weight in the rubric recipe. Each criterion is one row tagged on three axes.

AxisValuesWhat it controls
typeinstruction-following · outcome · process · grounding and safetythe backbone, chosen so the four cover distinct behaviors with minimal overlap
polaritymust · must-notwhether the behavior is required or forbidden
criticalitygate · additivea gate failure caps the whole score; additive criteria sum

Scoring stays binary per criterion, equal weight, with gate criteria able to veto the total. Reporting per-type subscores alongside the total gives the diagnostic breakdown that makes a benchmark useful to read. Two gates are specific to rubrics. An atomicity gate splits any compound or vague criterion until each is a single yes or no a judge can answer from the output alone. An inter-judge agreement gate has several judges grade sample outputs per criterion, and any criterion they cannot agree on gets cut or reworded, since a criterion judges disagree on is noise.[7]

07How these sets get used

The output of this factory is a labeled, difficulty-graded, deduplicated set with provenance on every point. That shape drops directly into the post-training and evaluation stack.

08Receipts and the trace

A data point is worth the payload plus its receipts plus its verdict. The store keeps the full trace of how each point was made: every gate result, the difficulty probe with its per-model outcomes, and the repair history. That trace is what makes this a factory with an audit trail. Anyone can open a point and see exactly why it survived. A bare list of prompts and answers offers no such account.

09Human review and reflection

Gates filter, and humans judge. Every surviving candidate lands in a review queue where a person approves, edits, or rejects it. Those verdicts do more than clear the queue. They become signal that improves later generation.

This matters because of criteria drift.[7] You cannot fully author a rubric before you have seen real outputs. Grading reveals criteria you did not know you needed. A review queue feeding a reflection loop is the practical answer. As reviewers reject points and annotate why, a checklist of commonly missed criteria accumulates inside the recipe, and later points start more complete. The factory gets better as it runs.

10Build your own

For a startup shipping an AI product, this is the point. Public benchmarks measure general capability on a generic distribution, and they saturate and leak over time. Your product lives on a specific distribution: your domain, your tools, your workflows, your failure modes. A small, difficulty-gated, receipt-bearing set drawn from that distribution is what lets you calibrate and align your own agents, catch regressions before users do, and mint verifiable rewards for your own post-training. Owning your evaluation distribution is a durable advantage that compounds as your product moves.

To support a new evaluation format, implement the recipe interface: a payload schema, a generation step, an ordered set of gates, a difficulty probe, a final quality check, and a render for review. Keep the spine. Swap the gates that are specific to your format.

A few principles travel with the method across every data type:


The contribution here is the assembly: difficulty-gated, receipt-bearing, human-reviewed evaluation data, with a shape that holds across data types and a clear seam for adapting it to a new one.

Sources

  1. Survey on quality, diversity, and complexity in synthetic data. arXiv:2412.02980.
  2. Bai et al. Constitutional AI: Harmlessness from AI Feedback. Anthropic, 2022. arXiv:2212.08073.
  3. Chen et al. Evaluating Large Language Models Trained on Code. 2021 (origin of pass@k). arXiv:2107.03374.
  4. Reinforcement learning with verifiable rewards; NVIDIA Nemotron-4 340B Reward. huggingface.co/nvidia/Nemotron-4-340B-Reward.
  5. Wang et al. Self-Instruct, 2022; Evol-Instruct / WizardLM; CoT-Self-Instruct, 2025. arXiv:2507.23751; survey.
  6. Li et al. Arena-Hard-Auto / BenchBuilder: separability with confidence. 2024. arXiv:2406.11939.
  7. Shankar et al. Who Validates the Validators? (EvalGen): criteria drift and mixed-initiative rubric design. 2024. arXiv:2404.12272.

Further reading: Lambert, synthetic data and RLAIF overview; LLM synthetic-data reading list.