Projects & outcomes
Real work. Real challenges. The details that matter — not marketing copy.
High-Scale Video Platform Migration to Kubernetes
Designed and executed a complete infrastructure migration for a high-scale video platform — moving from a legacy on-prem environment to a production Kubernetes cluster on the cloud. Built the entire video processing pipeline (ingest, transcode, storage, CDN delivery) as cloud-native workloads.
The Challenge
The client was running a high-traffic video processing platform on aging on-prem hardware. The system was brittle, hard to scale during traffic spikes, and required manual intervention for deployments. Processing queues would back up under load, and there was no reliable failover. They needed a path to cloud-native infrastructure without disrupting live video delivery.
What I Did
- Assessed existing infrastructure and mapped all workloads, dependencies, and data flows
- Designed target architecture on cloud Kubernetes (EKS) with autoscaling worker pools
- Built full IaC with Terraform: VPC, node groups, storage, networking
- Containerized all services and built Helm charts for each workload
- Implemented ArgoCD for GitOps-based deployments across environments
- Built video processing pipeline with autoscaling job workers (ffmpeg-based)
- Set up CDN integration for video delivery and origin failover
- Executed zero-downtime cutover with DNS-based traffic shifting
Stack & Tools
NVIDIA A100 GPU Integration on Kubernetes with MIG Partitioning
Designed and deployed a production multi-tenant GPU cluster on Kubernetes using NVIDIA A100s with full MIG (Multi-Instance GPU) partitioning — matching MIG profile sizes to model sizes so every GPU cycle counts. Small models get small slices; large models get the full card.
The Challenge
The client was building a multi-tenant AI inference platform and needed to serve dozens of models simultaneously — from lightweight 7B models to large 70B+ models — on a fixed pool of NVIDIA A100 80GB GPUs. Giving each model a full GPU was wasteful and expensive. Running everything on shared GPUs without isolation caused memory conflicts and unstable latency. They needed fine-grained, isolated GPU partitioning with Kubernetes-native scheduling.
What I Did
- Deployed NVIDIA GPU Operator on Kubernetes to manage drivers, container runtime, and device plugins automatically
- Enabled MIG mode on all A100 nodes and planned profile allocation based on model size tiers
- Configured 1g.10gb MIG instances for small models (≤7B params) — up to 7 instances per GPU
- Configured 2g.20gb MIG instances for mid-size models (7B–13B params)
- Configured 4g.40gb MIG instances for large models (30B–40B params)
- Reserved full 7g.80gb instances for 70B+ models needing the entire card
- Applied custom Kubernetes node labels per MIG profile for precise pod scheduling
- Built a dynamic MIG reconfiguration pipeline using mig-parted to reshape profiles on demand without node reboots
- Set up resource quotas and LimitRanges per namespace to enforce fair GPU allocation across teams
- Integrated vLLM inference server as the serving layer, pinned to specific MIG instances via device plugin
- Built Prometheus + Grafana dashboards for per-MIG GPU utilization, memory, and inference throughput
Stack & Tools
Working on something similar?
Let's talk. Book a free discovery call and we'll figure out if I'm the right fit for your project.
Book a Call