PORTFOLIO

Our Work & Projects

Real-world case studies and engineering builds — the messy problems, the actual decisions made, and the outcomes.

Case Study

Zero-Downtime Migration: Self-Managed K8s to EKS with 2TB S3 Transfer

Full infrastructure migration for a live product — cross-account S3 data transfer and Kubernetes cluster migration with no user-facing downtime

The Problem

A growing SaaS product needed to move from a self-managed kubeadm cluster on EC2 to fully managed EKS — across a new AWS account entirely. 2TB of user-uploaded assets, ML training data, and backups had to be moved without data loss. Zero downtime was non-negotiable.

What We Did

Used rclone for cross-account S3 transfer with 32 parallel transfers and MD5 checksum verification. Exported all Kubernetes resources, cleaned server-side fields, and rebuilt in EKS using eksctl. Migrated services in dependency order: ConfigMaps → StatefulSets → Deployments → Ingress. Switched from nginx-ingress to AWS Load Balancer Controller with ACM. Cut over using weighted Route 53 DNS routing — 10% → 50% → 100% over 48 hours.

Results

  • 2TB transferred with zero data loss — verified via rclone checksum
  • 12 services migrated to EKS with zero user-facing downtime
  • Deploy time dropped from 25 minutes (manual) to 4 minutes (GitHub Actions → EKS)
  • Infrastructure cost reduced ~20% through better node right-sizing on managed nodegroups
  • Full rollback available at every stage via blue-green DNS weighted routing

Technologies

rcloneAWS EKSeksctlGitHub ActionsRoute 53AWS Load Balancer ControllerTerraformDocker

EKS migration, S3 rclone migration, Kubernetes migration, zero downtime, cross-account AWS

Case Study

Cloud Native CI/CD Automation Platform

Replaced manual deployments causing release delays and production errors with a fully automated pipeline

The Problem

An 8-person engineering team was spending 25+ minutes per deployment on manual steps — pushing Docker images, SSHing into servers, restarting services by hand. Deployment errors were frequent. New engineers couldn't deploy safely without pairing with a senior.

What We Did

Designed end-to-end CI/CD using GitHub Actions: automated test runs on PR, Docker image build with layer caching, image publishing to ECR with environment-specific tags, and Kubernetes rolling deployments via kubectl. Added Slack notifications for deploy status and automatic rollback triggers on health check failures.

Results

  • Deployment time cut from 25 minutes to under 4 minutes
  • Any engineer on the team can deploy independently and safely
  • Zero manual SSH steps remaining in release process
  • Automatic rollback triggered within 60 seconds of failed health checks

Technologies

DockerKubernetesGitHub ActionsAWS ECRNode.jsSlack Webhooks

CI/CD pipeline, GitHub Actions, Docker, Kubernetes, DevOps automation

Case Study

Kubernetes Infrastructure for Machine Learning Workloads

Built GPU-enabled K8s cluster for an ML platform running heavy training and inference jobs

The Problem

An ML platform needed to run GPU-intensive training jobs that were crashing on shared infrastructure. Jobs were queuing for hours, GPU utilization was under 30%, and there was no isolation between training and inference workloads.

What We Did

Deployed Kubernetes cluster with dedicated GPU node pools using NVIDIA GPU Operator for driver management. Configured node selectors and taints to isolate training from inference. Added Kubernetes autoscaler to spin GPU nodes up/down based on job queue depth, cutting idle GPU costs significantly.

Results

  • GPU utilization improved from ~30% to consistent 85%+ during training runs
  • Training jobs no longer compete with inference — separate node pools with taints
  • Autoscaler reduces GPU node count to zero when queue is empty (major cost saving)
  • Stable training environment — no more OOM crashes from resource contention

Technologies

KubernetesNVIDIA GPU OperatorAWS EKSNode AutoscalerPrometheusGrafana

Kubernetes GPU, machine learning infrastructure, GPU clusters, ML deployment

Case Study

Production Monitoring and Observability Stack

Built full observability layer for a team flying blind — no dashboards, no alerts, incidents discovered by users

The Problem

The team had no infrastructure monitoring. Incidents were discovered when users complained. There was no visibility into CPU, memory, pod restarts, or error rates. Debugging required SSH access and manual log grep.

What We Did

Deployed Prometheus with custom scrape configs for all services, Grafana dashboards for system metrics and business KPIs, and AlertManager with PagerDuty routing for on-call. Added structured logging with correlation IDs to make cross-service tracing possible without a paid tool.

Results

  • Mean time to detect (MTTD) went from "user reports" to under 2 minutes
  • On-call team gets actionable alerts with context — not just "something is down"
  • Pod restart loops caught automatically before they affect users
  • Grafana dashboards used in weekly engineering reviews to track reliability trends

Technologies

PrometheusGrafanaAlertManagerPagerDutyKubernetesNode Exporter

Prometheus monitoring, Grafana dashboards, observability, incident detection

Engineering Build

Scrum CLI Tool

Command-line sprint management tool built for engineering teams who live in the terminal

The Problem

Engineering teams needed a better way to manage agile workflows directly from the terminal without switching tools.

What We Did

Built a full-featured CLI application with sprint management, task tracking, and developer-focused productivity features.

Results

  • Streamlined terminal-based workflow management
  • Reduced context switching for developers
  • Improved team productivity and agility

Technologies

Node.jsCLI DevelopmentTypeScriptGit Integration

Scrum CLI, agile tool, developer productivity, command-line interface

Have a similar problem?

Whether it's a cloud migration, K8s setup, or full infrastructure build — let's talk about what you're working with.

Start a Project →