Data Quality Audit Guide

Name: Data Quality Audit Guide
Author: Community

Systematically assess and improve data quality across your organization.

Usage

Define data quality dimensions: completeness, accuracy, consistency, timeliness, validity, uniqueness
Profile your data to establish baseline quality metrics
Identify root causes of quality issues (source systems, ETL, manual entry, schema drift)
Prioritize fixes by business impact — not all data quality issues matter equally
Implement automated monitoring to catch quality degradation early

Completeness check: Query: SELECT column_name, COUNT(*) as total, SUM(CASE WHEN column IS NULL THEN 1 ELSE 0 END) as nulls, ROUND(100.0 * SUM(CASE WHEN column IS NULL THEN 1 ELSE 0 END) / COUNT(*), 2) as null_pct FROM table GROUP BY 1. Flag columns with >5% nulls. Investigate: are nulls valid (optional field) or data loss (required field not captured)?
Consistency audit: Same customer has different addresses in CRM vs billing vs shipping. Same product has different names across sales system and inventory system. Count entity mismatches across systems. Root cause is usually: no master data management, multiple entry points, no validation rules
Timeliness assessment: Data pipeline SLA: dashboard refreshes by 7am. Actual: 3 of last 30 days were late (10% failure rate). Worst delay: 4 hours. Root cause: upstream system delays on month-end processing. Fix: add buffer for month-end, set up alerting for pipeline delays exceeding 30 minutes
Duplicate detection: Apply fuzzy matching on customer records: exact email match, Levenshtein distance on names (threshold 2), phone number normalization. Found: 8% of customer records are likely duplicates. Impact: inflated customer count by 8%, distorted churn metrics, duplicate marketing emails

Data quality is a spectrum, not a binary — the goal is "fit for purpose," not perfection
Fix data at the source, not downstream. Cleaning data in your analytics layer means the dirty data still exists in the source system
The cost of bad data compounds: wrong data → wrong analysis → wrong decision → wrong outcome. One corrupted field can invalidate an entire dashboard
Automate quality checks in your data pipeline: row count assertions, schema validation, null percentage thresholds, value range checks. Fail the pipeline on quality violations
Start with business-critical data: revenue numbers, customer counts, conversion metrics. Don't try to audit everything at once
Assign data owners for each critical dataset — someone must be accountable for quality, otherwise nobody is