Data Engineering Projects

Building modern data pipelines with orchestration, transformation, and cloud data warehousing.

01

Airflow + dbt + Snowflake Data Pipeline

Production-ready end-to-end data pipeline orchestration using Apache Airflow with Astronomer Cosmos. Implements dbt for analytics engineering and data transformation in Snowflake cloud data warehouse.

Features include dimensional modeling with staging layers and fact/dimension tables, automated data quality tests, incremental loads, GitOps workflow with containerized deployments, and comprehensive monitoring with custom logging.

The project uses Astronomer's Astro Runtime for Airflow deployment in Docker containers, with a dedicated dbt virtual environment and automated dependency management. DAG scheduling runs daily with intelligent task orchestration.

Key Features

  • Automated ELT pipeline with dbt transformations orchestrated by Airflow
  • Dimensional data modeling with staging, intermediate, and mart layers
  • Comprehensive data quality testing with custom SQL tests and generic schema tests
  • Containerized deployment using Docker with Astronomer Astro Runtime
  • Modular project structure with reusable macros and dbt packages (dbt_utils)
  • Snowflake integration with optimized warehouse utilization and role-based access
Apache Airflow Astronomer Cosmos dbt Snowflake Docker Python SQL Data Modeling
Live Demo: Airflow DAG Execution
Falco Alert Detection Demo
02

AWS EMR Spark Data Processing with Terraform

Scalable big data processing pipeline using AWS Elastic MapReduce (EMR) with Apache Spark for analyzing food establishment inspection data. Infrastructure fully automated with Terraform IaC for reproducible deployments.

The project implements a complete EMR cluster with auto-scaling capabilities, processing CSV datasets from S3 using PySpark transformations. Features include automated cluster provisioning, VPC networking, IAM security policies, SSH access configuration, and cost optimization with auto-termination policies.

Deployed on AWS with Terraform managing VPC creation, security groups, IAM roles, S3 buckets for data storage and logs, and EMR cluster configuration. Spark jobs run in cluster mode with dynamic resource allocation and custom memory/core settings.

Key Features

  • Infrastructure as Code using Terraform for complete AWS resource management
  • EMR cluster with master and core instance groups, EBS volumes, and Spark 3.x runtime
  • PySpark data transformation pipeline with SQL queries for violation analysis
  • VPC networking with public subnets, internet gateway, and route table configuration
  • S3 integration for input data, script storage, logs, and Parquet output results
  • IAM security with service roles, EC2 instance profiles, and least-privilege policies
  • Auto-termination after idle hours and manual execution via SSH or AWS CLI
  • Port forwarding for YARN UI monitoring and application log analysis
AWS EMR Apache Spark Terraform PySpark AWS S3 AWS VPC IAM EC2 Infrastructure as Code Python
Live Demo: EMR Cluster Deployment & Spark Job Execution
03

Kubernetes Spark Operator for E-commerce Analytics

Cloud-native big data processing using Spark Operator on k0s Kubernetes cluster for analyzing e-commerce event data. Demonstrates containerized Spark workloads with persistent storage and automated job orchestration.

Implements the Kubeflow Spark Operator for declarative Spark application deployment on Kubernetes. Features include PersistentVolumeClaim for data persistence, ConfigMap-based script deployment, ServiceAccount RBAC configuration, and SparkApplication CRD for managing driver and executor pods.

The pipeline processes CSV datasets mounted from PVCs, performs data analysis with PySpark SQL, and outputs results to persistent storage. Uses local-path-provisioner for dynamic volume provisioning on k0s minimal Kubernetes distribution.

Key Features

  • Spark Operator deployment using Helm with webhook integration and namespace configuration
  • Dynamic storage provisioning with local-path StorageClass for PVC management
  • ConfigMap-based PySpark script deployment without external dependencies
  • SparkApplication CRD with declarative driver/executor resource allocation
  • Data persistence across pod lifecycles using PersistentVolumeClaims
  • Temporary pods for data loading and result validation workflows
  • CSV data analysis with product categorization and aggregation queries
  • Results exported to CSV for downstream consumption and validation
Kubernetes k0s Apache Spark Spark Operator Helm PySpark PVC/PV ConfigMap kubectl CRD
Live Demo: Spark on Kubernetes Execution
04

Serverless Data Pipeline with AWS Lambda & Step Functions

Event-driven serverless ETL pipeline orchestrating AWS Lambda, Step Functions, and Glue for automated data processing. Fully managed infrastructure using Terraform with S3-triggered workflows and AWS Glue Data Catalog integration.

The pipeline implements a multi-stage data processing workflow: S3 event triggers Lambda for CSV transformation with Pandas, Step Functions orchestrates validation and cataloging, and AWS Glue Crawler automatically updates metadata schemas for query engines.

Architecture includes dual Lambda functions (processor and validator), S3 buckets with versioning for raw and processed data, IAM roles with least-privilege policies, Step Functions state machine with error handling, and Glue crawler for automated schema discovery.

Key Features

  • Event-driven architecture with S3 notifications triggering Lambda functions automatically
  • Lambda data processor with Pandas layer for CSV transformation and calculated columns
  • AWS Step Functions state machine orchestrating validation and Glue crawler execution
  • Dual S3 buckets with versioning for raw ingestion and processed output storage
  • AWS Glue Crawler for automatic schema discovery and Data Catalog updates
  • IAM security with separate roles for Lambda, Step Functions, and Glue services
  • Error handling with retry logic, catch blocks, and validation failure paths
  • Terraform automation for reproducible infrastructure deployment and updates
AWS Lambda Step Functions AWS Glue S3 Terraform Python Pandas IAM CloudWatch AWS CLI
Live Demo: Serverless Pipeline Execution

More data engineering projects coming soon...