Data Quality Audit Guide
Systematically assess and improve data quality across your organization.
Usage
- Define data quality dimensions: completeness, accuracy, consistency, timeliness, validity, uniqueness
- Profile your data to establish baseline quality metrics
- Identify root causes of quality issues (source systems, ETL, manual entry, schema drift)
- Prioritize fixes by business impact — not all data quality issues matter equally
- Implement automated monitoring to catch quality degradation early
Examples
- Completeness check: Query:
SELECT column_name, COUNT(*) as total, SUM(CASE WHEN column IS NULL THEN 1 ELSE 0 END) as nulls, ROUND(100.0 * SUM(CASE WHEN column IS NULL THEN 1 ELSE 0 END) / COUNT(*), 2) as null_pct FROM table GROUP BY 1. Flag columns with >5% nulls. Investigate: are nulls valid (optional field) or data loss (required field not captured)? - Consistency audit: Same customer has different addresses in CRM vs billing vs shipping. Same product has different names across sales system and inventory system. Count entity mismatches across systems. Root cause is usually: no master data management, multiple entry points, no validation rules
- Timeliness assessment: Data pipeline SLA: dashboard refreshes by 7am. Actual: 3 of last 30 days were late (10% failure rate). Worst delay: 4 hours. Root cause: upstream system delays on month-end processing. Fix: add buffer for month-end, set up alerting for pipeline delays exceeding 30 minutes
- Duplicate detection: Apply fuzzy matching on customer records: exact email match, Levenshtein distance on names (threshold 2), phone number normalization. Found: 8% of customer records are likely duplicates. Impact: inflated customer count by 8%, distorted churn metrics, duplicate marketing emails
Guidelines
- Data quality is a spectrum, not a binary — the goal is "fit for purpose," not perfection
- Fix data at the source, not downstream. Cleaning data in your analytics layer means the dirty data still exists in the source system
- The cost of bad data compounds: wrong data → wrong analysis → wrong decision → wrong outcome. One corrupted field can invalidate an entire dashboard
- Automate quality checks in your data pipeline: row count assertions, schema validation, null percentage thresholds, value range checks. Fail the pipeline on quality violations
- Start with business-critical data: revenue numbers, customer counts, conversion metrics. Don't try to audit everything at once
- Assign data owners for each critical dataset — someone must be accountable for quality, otherwise nobody is