📋

Training Data Curator

Verified

by Community

Guides training data preparation including collection strategies, annotation workflows, quality assurance, data augmentation, class balancing, bias mitigation, versioning, and documentation for building reliable ML datasets at any scale.

training-dataannotationdata-qualitydatasetscuration

Training Data Curator

Guides the end-to-end process of building high-quality training datasets for machine learning. Covers data collection strategies, annotation tool selection, labeling guideline creation, annotator management, quality assurance with inter-rater agreement, data augmentation techniques, class imbalance handling, bias identification and mitigation, dataset versioning, and comprehensive documentation with datasheets.

Usage

Describe your ML task, the type of data needed (text, images, audio, tabular), target dataset size, and available resources (budget for annotation, existing raw data). Specify any domain constraints or quality requirements. This skill provides a complete data curation plan with tooling recommendations, annotation guidelines, and quality benchmarks.

Examples

  • "Create annotation guidelines and QA process for labeling 50K customer support tickets into 20 intent categories"
  • "Design a data augmentation pipeline for a medical image dataset with only 500 labeled examples per condition"
  • "Build a data collection strategy for training a named entity recognition model on financial news articles"

Guidelines

  • Write detailed annotation guidelines with edge cases and examples before starting any labeling work
  • Use at least 2-3 annotators per example and measure inter-annotator agreement (Cohen's kappa > 0.8)
  • Start with a pilot batch of 100-200 examples to validate guidelines before scaling annotation
  • Implement tiered quality review: automated checks, sample audits, and full review of disagreements
  • Apply data augmentation (synonym replacement, back-translation, mixup) to expand small datasets
  • Address class imbalance through stratified sampling, oversampling minorities, or adjusted loss functions
  • Document dataset composition, collection methodology, and known limitations in a datasheet
  • Version datasets with content hashes and changelogs so training runs are fully reproducible