CSV Data Cleaner
Processes and cleans messy CSV data files to produce standardized, analysis-ready datasets. Handles encoding detection and conversion, delimiter issues, quoting problems, header normalization, missing value imputation, duplicate detection and removal, data type validation, format standardization (dates, phones, addresses), and column transformations using command-line tools, Python pandas, or SQL.
Usage
Describe the CSV file issues you are encountering: encoding problems, inconsistent formats, missing values, duplicates, or specific transformation needs. Specify your preferred tool (pandas, awk, csvkit, SQL) and the desired output format. The skill provides step-by-step cleaning commands or scripts with explanations.
Examples
- "Clean a CSV with mixed date formats (MM/DD/YYYY, YYYY-MM-DD, DD-Mon-YY) into ISO 8601 format"
- "Deduplicate a customer CSV by email, keeping the most recent record based on updated_at column"
- "Fix a CSV with encoding issues (Latin-1 exported as UTF-8) and embedded newlines in quoted fields"
- "Standardize phone numbers in a CSV from various formats (555-1234, (555) 123-4567, +1 5551234567) to E.164"
Guidelines
- Always detect file encoding first with chardet or file command before processing (UTF-8, Latin-1, Windows-1252)
- Use csvkit (csvclean, csvsql, csvstat) for quick command-line CSV validation and statistics
- Handle missing values intentionally: decide between drop, fill with default, interpolate, or flag for review
- Normalize column names: lowercase, replace spaces with underscores, remove special characters
- Validate data types column by column: parse dates, convert numeric strings, identify categorical values
- Use pandas with dtype specifications on read to prevent silent type coercion (int columns with NaN become float)
- Process large CSVs in chunks (pandas chunksize parameter) to avoid memory issues on files over 1GB
- Always preserve the original file and write cleaned output to a new file for reproducibility and auditing