ETL Pipeline Design
Designs Extract, Transform, Load (ETL) and ELT pipeline architectures for data integration, warehousing, and analytics. Covers source extraction patterns (CDC, full load, incremental), transformation logic (cleaning, enrichment, aggregation), loading strategies (upsert, append, SCD), error handling, data quality checks, orchestration with Airflow or Prefect, and transformation with dbt.
Usage
Describe your data sources (databases, APIs, files), destination (data warehouse, data lake), transformation requirements, data volume, and freshness needs. Specify your preferred tools and cloud platform. The skill designs a complete pipeline architecture with extraction, transformation, and loading patterns for your specific scenario.
Examples
- "Design an ETL pipeline that extracts from PostgreSQL and 3 REST APIs, transforms in Python, loads to BigQuery"
- "Create a CDC pipeline using Debezium to stream database changes to a Kafka topic for real-time analytics"
- "Build a dbt transformation layer that creates dimensional models from raw Snowflake staging tables"
- "Design an incremental pipeline that processes only new/changed records using watermark timestamps"
Guidelines
- Prefer ELT over ETL when your warehouse has the compute power — transform in SQL using dbt for maintainability
- Implement idempotent loads: rerunning a pipeline for the same time period should produce identical results
- Use incremental extraction with high-watermark columns (updated_at) to avoid full table scans on each run
- Build data quality checks between stages: row counts, null checks, referential integrity, freshness assertions
- Log extraction metadata (row counts, timestamps, source versions) for debugging and lineage tracking
- Handle late-arriving data and out-of-order events with appropriate windowing and upsert strategies
- Use dead-letter queues for records that fail transformation, with alerting for human review
- Design for backfill capability: any pipeline should be able to reprocess historical data on demand