Senior Data Engineer

Verified

by alirezarezvani

Production-grade data engineering skill for building scalable, reliable data systems. 1. [Trigger Phrases](#trigger-phrases) 2. [Quick Start](#quick-start) 3. [Workflows](#workflows) - [Building a Batch ETL Pipeline](#workflow-1-building-a-batch-etl-pipeline) - [Implementing Real-Time Streaming](#workflow-2-implementing-real-time-streaming) - [Data Quality Framework Setup](#workflow-3-data-quality-framework-setup) 4. [Architecture Decision Framework](#architecture-decision-framework) 5. [Tech St

View on GitHub

Senior Data Engineer

Production-grade data engineering skill for building scalable, reliable data systems.

Table of Contents

  1. Trigger Phrases
  2. Quick Start
  3. Workflows

- Building a Batch ETL Pipeline

- Implementing Real-Time Streaming

- Data Quality Framework Setup

  1. Architecture Decision Framework
  2. Tech Stack
  3. Reference Documentation
  4. Troubleshooting

Trigger Phrases

Activate this skill when you see:

Pipeline Design:

  • "Design a data pipeline for..."
  • "Build an ETL/ELT process..."
  • "How should I ingest data from..."
  • "Set up data extraction from..."

Architecture:

  • "Should I use batch or streaming?"
  • "Lambda vs Kappa architecture"
  • "How to handle late-arriving data"
  • "Design a data lakehouse"

Data Modeling:

  • "Create a dimensional model..."
  • "Star schema vs snowflake"
  • "Implement slowly changing dimensions"
  • "Design a data vault"

Data Quality:

  • "Add data validation to..."
  • "Set up data quality checks"
  • "Monitor data freshness"
  • "Implement data contracts"

Performance:

  • "Optimize this Spark job"
  • "Query is running slow"
  • "Reduce pipeline execution time"
  • "Tune Airflow DAG"

Quick Start

Core Tools

# Generate pipeline orchestration config
python scripts/pipeline_orchestrator.py generate \
  --type airflow \
  --source postgres \
  --destination snowflake \
  --schedule "0 5 * * *"

# Validate data quality
python scripts/data_quality_validator.py validate \
  --input data/sales.parquet \
  --schema schemas/sales.json \
  --checks freshness,completeness,uniqueness

# Optimize ETL performance
python scripts/etl_performance_optimizer.py analyze \
  --query queries/daily_aggregation.sql \
  --engine spark \
  --recommend

Workflows

→ See references/workflows.md for details

Architecture Decision Framework

Use this framework to choose the right approach for your data pipeline.

Batch vs Streaming

| Criteria | Batch | Streaming |

|----------|-------|-----------|

| Latency requirement | Hours to days | Seconds to minutes |

| Data volume | Large historical datasets | Continuous event streams |

| Processing complexity | Complex transformations, ML | Simple aggregations, filtering |

| Cost sensitivity | More cost-effective | Higher infrastructure cost |

| Error handling | Easier to reprocess | Requires careful design |

Decision Tree:

Is real-time insight required?
├── Yes → Use streaming
│   └── Is exactly-once semantics needed?
│       ├── Yes → Kafka + Flink/Spark Structured Streaming
│       └── No → Kafka + consumer groups
└── No → Use batch
    └── Is data volume > 1TB daily?
        ├── Yes → Spark/Databricks
        └── No → dbt + warehouse compute

Lambda vs Kappa Architecture

| Aspect | Lambda | Kappa |

|--------|--------|-------|

| Complexity | Two codebases (batch + stream) | Single codebase |

| Maintenance | Higher (sync batch/stream logic) | Lower |

| Reprocessing | Native batch layer | Replay from source |

| Use case | ML training + real-time serving | Pure event-driven |

When to choose Lambda:

  • Need to train ML models on historical data
  • Complex batch transformations not feasible in streaming
  • Existing batch infrastructure

When to choose Kappa:

  • Event-sourced architecture
  • All processing can be expressed as stream operations
  • Starting fresh without legacy systems

Data Warehouse vs Data Lakehouse

| Feature | Warehouse (Snowflake/BigQuery) | Lakehouse (Delta/Iceberg) |

|---------|-------------------------------|---------------------------|

| Best for | BI, SQL analytics | ML, unstructured data |

| Storage cost | Higher (proprietary format) | Lower (open formats) |

| Flexibility | Schema-on-write | Schema-on-read |

| Performance | Excellent for SQL | Good, improving |

| Ecosystem | Mature BI tools | Growing ML tooling |

Tech Stack

| Category | Technologies |

|----------|--------------|

| Languages | Python, SQL, Scala |

| Orchestration | Airflow, Prefect, Dagster |

| Transformation | dbt, Spark, Flink |

| Streaming | Kafka, Kinesis, Pub/Sub |

| Storage | S3, GCS, Delta Lake, Iceberg |

| Warehouses | Snowflake, BigQuery, Redshift, Databricks |

| Quality | Great Expectations, dbt tests, Monte Carlo |

| Monitoring | Prometheus, Grafana, Datadog |

Reference Documentation

1. Data Pipeline Architecture

See references/data_pipeline_architecture.md for:

  • Lambda vs Kappa architecture patterns
  • Batch processing with Spark and Airflow
  • Stream processing with Kafka and Flink
  • Exactly-once semantics implementation
  • Error handling and dead letter queues

2. Data Modeling Patterns

See references/data_modeling_patterns.md for:

  • Dimensional modeling (Star/Snowflake)
  • Slowly Changing Dimensions (SCD Types 1-6)
  • Data Vault modeling
  • dbt best practices
  • Partitioning and clustering

3. DataOps Best Practices

See references/dataops_best_practices.md for:

  • Data testing frameworks
  • Data contracts and schema validation
  • CI/CD for data pipelines
  • Observability and lineage
  • Incident response

Troubleshooting

→ See references/troubleshooting.md for details