CSV Data Cleaner

Processes and cleans messy CSV data files to produce standardized, analysis-ready datasets. Handles encoding detection and conversion, delimiter issues, quoting problems, header normalization, missing value imputation, duplicate detection and removal, data type validation, format standardization (dates, phones, addresses), and column transformations using command-line tools, Python pandas, or SQL.

Usage

Describe the CSV file issues you are encountering: encoding problems, inconsistent formats, missing values, duplicates, or specific transformation needs. Specify your preferred tool (pandas, awk, csvkit, SQL) and the desired output format. The skill provides step-by-step cleaning commands or scripts with explanations.

Examples

"Clean a CSV with mixed date formats (MM/DD/YYYY, YYYY-MM-DD, DD-Mon-YY) into ISO 8601 format"
"Deduplicate a customer CSV by email, keeping the most recent record based on updated_at column"
"Fix a CSV with encoding issues (Latin-1 exported as UTF-8) and embedded newlines in quoted fields"
"Standardize phone numbers in a CSV from various formats (555-1234, (555) 123-4567, +1 5551234567) to E.164"

Guidelines

Always detect file encoding first with chardet or file command before processing (UTF-8, Latin-1, Windows-1252)
Use csvkit (csvclean, csvsql, csvstat) for quick command-line CSV validation and statistics
Handle missing values intentionally: decide between drop, fill with default, interpolate, or flag for review
Normalize column names: lowercase, replace spaces with underscores, remove special characters
Validate data types column by column: parse dates, convert numeric strings, identify categorical values
Use pandas with dtype specifications on read to prevent silent type coercion (int columns with NaN become float)
Process large CSVs in chunks (pandas chunksize parameter) to avoid memory issues on files over 1GB
Always preserve the original file and write cleaned output to a new file for reproducibility and auditing

CSV Data Cleaner

Usage

Examples

Guidelines

More Data & Analytics Skills