Data Engineering - Kevin González

01

Airflow + dbt + Snowflake Data Pipeline

Production-ready end-to-end data pipeline orchestration using Apache Airflow with Astronomer Cosmos. Implements dbt for analytics engineering and data transformation in Snowflake cloud data warehouse.

Features include dimensional modeling with staging layers and fact/dimension tables, automated data quality tests, incremental loads, GitOps workflow with containerized deployments, and comprehensive monitoring with custom logging.

The project uses Astronomer's Astro Runtime for Airflow deployment in Docker containers, with a dedicated dbt virtual environment and automated dependency management. DAG scheduling runs daily with intelligent task orchestration.

Key Features

Automated ELT pipeline with dbt transformations orchestrated by Airflow
Dimensional data modeling with staging, intermediate, and mart layers
Comprehensive data quality testing with custom SQL tests and generic schema tests
Containerized deployment using Docker with Astronomer Astro Runtime
Modular project structure with reusable macros and dbt packages (dbt_utils)
Snowflake integration with optimized warehouse utilization and role-based access

Apache Airflow Astronomer Cosmos dbt Snowflake Docker Python SQL Data Modeling

● Live Demo: Airflow DAG Execution

02

AWS EMR Spark Data Processing with Terraform

Scalable big data processing pipeline using AWS Elastic MapReduce (EMR) with Apache Spark for analyzing food establishment inspection data. Infrastructure fully automated with Terraform IaC for reproducible deployments.

The project implements a complete EMR cluster with auto-scaling capabilities, processing CSV datasets from S3 using PySpark transformations. Features include automated cluster provisioning, VPC networking, IAM security policies, SSH access configuration, and cost optimization with auto-termination policies.

Deployed on AWS with Terraform managing VPC creation, security groups, IAM roles, S3 buckets for data storage and logs, and EMR cluster configuration. Spark jobs run in cluster mode with dynamic resource allocation and custom memory/core settings.

Key Features

Infrastructure as Code using Terraform for complete AWS resource management
EMR cluster with master and core instance groups, EBS volumes, and Spark 3.x runtime
PySpark data transformation pipeline with SQL queries for violation analysis
VPC networking with public subnets, internet gateway, and route table configuration
S3 integration for input data, script storage, logs, and Parquet output results
IAM security with service roles, EC2 instance profiles, and least-privilege policies
Auto-termination after idle hours and manual execution via SSH or AWS CLI
Port forwarding for YARN UI monitoring and application log analysis

AWS EMR Apache Spark Terraform PySpark AWS S3 AWS VPC IAM EC2 Infrastructure as Code Python

● Live Demo: EMR Cluster Deployment & Spark Job Execution

03

Kubernetes Spark Operator for E-commerce Analytics

Cloud-native big data processing using Spark Operator on k0s Kubernetes cluster for analyzing e-commerce event data. Demonstrates containerized Spark workloads with persistent storage and automated job orchestration.

Implements the Kubeflow Spark Operator for declarative Spark application deployment on Kubernetes. Features include PersistentVolumeClaim for data persistence, ConfigMap-based script deployment, ServiceAccount RBAC configuration, and SparkApplication CRD for managing driver and executor pods.

The pipeline processes CSV datasets mounted from PVCs, performs data analysis with PySpark SQL, and outputs results to persistent storage. Uses local-path-provisioner for dynamic volume provisioning on k0s minimal Kubernetes distribution.

Key Features

Spark Operator deployment using Helm with webhook integration and namespace configuration
Dynamic storage provisioning with local-path StorageClass for PVC management
ConfigMap-based PySpark script deployment without external dependencies
SparkApplication CRD with declarative driver/executor resource allocation
Data persistence across pod lifecycles using PersistentVolumeClaims
Temporary pods for data loading and result validation workflows
CSV data analysis with product categorization and aggregation queries
Results exported to CSV for downstream consumption and validation

Kubernetes k0s Apache Spark Spark Operator Helm PySpark PVC/PV ConfigMap kubectl CRD

● Live Demo: Spark on Kubernetes Execution

04

Serverless Data Pipeline with AWS Lambda & Step Functions

Event-driven serverless ETL pipeline orchestrating AWS Lambda, Step Functions, and Glue for automated data processing. Fully managed infrastructure using Terraform with S3-triggered workflows and AWS Glue Data Catalog integration.

The pipeline implements a multi-stage data processing workflow: S3 event triggers Lambda for CSV transformation with Pandas, Step Functions orchestrates validation and cataloging, and AWS Glue Crawler automatically updates metadata schemas for query engines.

Architecture includes dual Lambda functions (processor and validator), S3 buckets with versioning for raw and processed data, IAM roles with least-privilege policies, Step Functions state machine with error handling, and Glue crawler for automated schema discovery.

Key Features

Event-driven architecture with S3 notifications triggering Lambda functions automatically
Lambda data processor with Pandas layer for CSV transformation and calculated columns
AWS Step Functions state machine orchestrating validation and Glue crawler execution
Dual S3 buckets with versioning for raw ingestion and processed output storage
AWS Glue Crawler for automatic schema discovery and Data Catalog updates
IAM security with separate roles for Lambda, Step Functions, and Glue services
Error handling with retry logic, catch blocks, and validation failure paths
Terraform automation for reproducible infrastructure deployment and updates

AWS Lambda Step Functions AWS Glue S3 Terraform Python Pandas IAM CloudWatch AWS CLI

● Live Demo: Serverless Pipeline Execution

Data Engineering Projects

Airflow + dbt + Snowflake Data Pipeline

Key Features

AWS EMR Spark Data Processing with Terraform

Key Features

Kubernetes Spark Operator for E-commerce Analytics

Key Features

Serverless Data Pipeline with AWS Lambda & Step Functions

Key Features