Training Data Curator

Guides the end-to-end process of building high-quality training datasets for machine learning. Covers data collection strategies, annotation tool selection, labeling guideline creation, annotator management, quality assurance with inter-rater agreement, data augmentation techniques, class imbalance handling, bias identification and mitigation, dataset versioning, and comprehensive documentation with datasheets.

Usage

Describe your ML task, the type of data needed (text, images, audio, tabular), target dataset size, and available resources (budget for annotation, existing raw data). Specify any domain constraints or quality requirements. This skill provides a complete data curation plan with tooling recommendations, annotation guidelines, and quality benchmarks.

Examples

"Create annotation guidelines and QA process for labeling 50K customer support tickets into 20 intent categories"
"Design a data augmentation pipeline for a medical image dataset with only 500 labeled examples per condition"
"Build a data collection strategy for training a named entity recognition model on financial news articles"

Guidelines

Write detailed annotation guidelines with edge cases and examples before starting any labeling work
Use at least 2-3 annotators per example and measure inter-annotator agreement (Cohen's kappa > 0.8)
Start with a pilot batch of 100-200 examples to validate guidelines before scaling annotation
Implement tiered quality review: automated checks, sample audits, and full review of disagreements
Apply data augmentation (synonym replacement, back-translation, mixup) to expand small datasets
Address class imbalance through stratified sampling, oversampling minorities, or adjusted loss functions
Document dataset composition, collection methodology, and known limitations in a datasheet
Version datasets with content hashes and changelogs so training runs are fully reproducible

Training Data Curator

Usage

Examples

Guidelines

More Development Skills