Training data that reflects production reality, not annotation convenience.

Annotation quality is model quality. Vague guidelines produce inconsistent labels — and inconsistent labels produce models that generalise poorly precisely where it matters most. We design annotation schemas with measurable inter-annotator agreement targets before any labelling begins, and instrument IAA throughout the project so quality degradation is visible before it reaches training.

Start a Conversation All Services

The Challenge

The annotation process is where model quality is determined — not at training time. A model trained on ambiguously annotated data learns the annotator's inconsistencies, not the underlying task. A training distribution that over-represents common cases and under-represents edge cases produces a model that looks strong on benchmarks and weak on the production long tail.

Inter-annotator agreement (IAA) — Cohen's kappa or Krippendorff's alpha — is the measurement that validates annotation quality. Low IAA is a leading indicator of poor model performance. It means different annotators answered the same question differently, and the training labels contain noise rather than signal. Most annotation projects measure IAA late — after the full dataset is annotated — when the cost of finding systematic disagreement is highest. We measure from the first validation batch.

What makes training data fail in production

Annotation guidelines ambiguous on edge cases — different annotators label them differently
Training distribution that does not represent the production input distribution
Missing adversarial examples — model learns easy surface patterns, not robust signal
No long-tail coverage — model fails on rare but important cases
RLHF preference data collected without clear quality criteria produces noisy reward signals
Label noise from unresolved annotator disagreement

Our Approach

We design annotation processes as software engineering problems: guidelines are versioned, IAA is measured continuously, and edge cases are explicitly catalogued and covered. Annotation guidelines go through a small-batch validation round before full-scale annotation — a 50-100 sample batch with IAA measurement catches guideline ambiguity before it propagates through the full dataset at high cost.

Training dataset composition is designed against the production distribution, not the available data distribution. If production data contains a long tail of rare cases that available data under-represents, we design targeted collection for those cases before training starts. For LLM fine-tuning with RLHF, we design preference comparison workflows where annotators compare model output pairs on defined quality dimensions — the criteria that the reward model will learn from.

Training data development process

Task definition and guideline validation

Define the annotation task precisely. Guidelines go through a 50-100 sample validation batch with IAA measurement before full-scale annotation. Low IAA on the validation batch means guideline revision, not full-scale annotation with ambiguous instructions.

Production distribution analysis

Analyze what the model will encounter in production. Identify under-represented input types, rare but important cases, and adversarial examples relevant to the task. Design targeted data collection for coverage gaps.

Annotation workflow setup

Configure Label Studio or Prodigy for the task type. Set up IAA measurement. Establish annotator calibration — a shared batch where annotators discuss disagreements before independent annotation begins.

Annotation with continuous quality control

Continuous IAA monitoring throughout annotation. Disagreements resolved through adjudication by a senior annotator or consensus. Samples below IAA threshold re-annotated. Guidelines updated when systematic disagreements reveal ambiguity.

Dataset validation before training

Validate class balance, edge case coverage, adversarial example inclusion, and final IAA metrics. RLHF preference data: validate criteria consistency and pair quality before reward model training.

What Is Included

01
IAA-validated annotation guidelines
Guidelines are tested on a small validation batch before full-scale annotation begins. Inter-annotator agreement is measured throughout using Cohen's kappa or Krippendorff's alpha depending on task type — not only at the end. Low-agreement samples are adjudicated and guidelines updated to eliminate the ambiguity that caused them.
02
Production distribution coverage analysis
We analyze the actual distribution of production inputs and design data collection to match it — including edge cases and rare inputs that benchmark datasets systematically under-represent. A model trained on a skewed distribution passes evals and fails in the wild; coverage analysis is how you avoid that.
03
Adversarial example inclusion
For classification and detection tasks, we include near-boundary examples in the training set: inputs with similar surface features but different labels, known misclassification patterns from prior model versions, and synthetic hard negatives. This directly reduces the long-tail failure modes that evaluation sets miss.
04
RLHF preference data workflows
For LLM fine-tuning, we design pairwise preference comparison workflows where annotators rate output pairs on defined quality dimensions — helpfulness, factuality, safety, or domain-specific criteria. Each dimension gets explicit criteria with examples; vague criteria produce noisy reward signals that degrade reward model quality.
05
Active learning with selection review
Active learning selects unlabeled samples where the current model is most uncertain, prioritizing annotation effort where it moves the model most. We add a selection review step to catch distribution-edge artifacts — cases where model uncertainty is high for reasons unrelated to the actual task, which active learning would otherwise over-sample.

Deliverables

Annotation guidelines with edge cases, examples, and IAA results
Label Studio or Prodigy setup with IAA measurement configured
Annotated dataset with full adjudication log and IAA report
Production distribution analysis and coverage gap assessment
Active learning pipeline with selection review process
RLHF preference workflow with per-dimension quality criteria

Projected Impact

Rigorous annotation schema design reduces post-deployment model failures by catching distribution mismatches, edge case underrepresentation, and label ambiguity early. The cost of fixing an annotation schema before labelling starts is roughly 1% of the cost of retraining a model on labels discovered to be inconsistent after the fact.

FAQ

Frequently
asked questions

How do we know when we have enough training data?

Learning curves tell you. Train models on increasing subsets of your data and plot performance vs. dataset size. When the curve flattens — additional data produces diminishing improvement — you have likely reached data sufficiency for the current model architecture. If performance has not reached your target by that point, the problem is more likely model architecture, task framing, or annotation quality than data volume.

Should we use synthetic data generation?

Synthetic data is most valuable for augmenting rare cases in the training distribution — not replacing real data. LLM-generated synthetic examples for text tasks, augmented images for vision tasks. The risk is distributional mismatch: synthetic data that does not match real production inputs adds noise, not signal. Validate synthetic data quality against held-out real samples before including it in training.

What annotation platforms do you work with?

Label Studio (open-source, flexible, self-hostable), Prodigy (tight spaCy integration, active learning support), Scale AI (managed annotation at scale), and Labelbox (enterprise annotation management). Platform selection is driven by task type, budget, data sensitivity requirements, and whether in-house or outsourced annotation is appropriate.

How do you design RLHF preference data collection?

RLHF preference quality depends on the clarity of comparison criteria. We define explicit quality dimensions for each pairwise comparison: accuracy, helpfulness, safety, tone — whatever matters for your use case. Annotators rate which output is better on each dimension, not a single overall preference. Clear criteria reduce annotator variance and produce more useful reward signals.

Ready to get started?

Tell us what you are building. We will scope it, price it honestly, and give you a clear plan.

Start a Conversation

Free 30-min scoping call

Explore More

All services

Training data that reflects production reality, not annotation convenience.

Training data development process

IAA-validated annotation guidelines

Production distribution coverage analysis

Adversarial example inclusion

RLHF preference data workflows

Active learning with selection review

Frequentlyasked questions

Ready to get started?

Frequently
asked questions