Friday, May 8, 2026

GPU clustering and splitting strategies using Kubernetes

Kubernetes manages GPUs by treating them as specialized "allocatable" resources. While it handles the orchestration (scheduling and monitoring), it does not automatically break down a single process for parallel processing across GPUs

1. How Kubernetes Manages GPUs

Kubernetes doesn't "know" what a GPU is out of the box; it uses a Device Plugin framework to communicate with hardware. 
  • NVIDIA GPU Operator: Most AI/ML projects use the NVIDIA GPU Operator, which automates installing drivers, the container runtime, and monitoring tools like DCGM.
  • Resource Requests: You request a GPU in your Pod's YAML just like CPU or memory: nvidia.com/gpu: 1.
  • Scheduling & Taints: To prevent non-AI apps from "stealing" expensive GPU nodes, admins often use Taints and Tolerations, ensuring only specific AI workloads land on those machines. 

2. Can it Parallelize a Single Process?

No, not automatically. Kubernetes is an orchestrator, not a compiler or a distributed computing engine. 
  • Whole Instances: By default, Kubernetes can only assign whole GPUs to a Pod. If you request 1 GPU, that Pod gets the entire card.
  • Application-Level Parallelism: To use multiple GPUs for a single process (like training a Large Language Model), your code must be written to support it using libraries like PyTorch DistributedDataParallel (DDP) or Horovod.
  • Kubernetes Add-ons: You use tools like Kubeflow or Volcano to manage these distributed jobs. These tools handle "Gang Scheduling," ensuring all 8 GPUs needed for a training job are ready at the same time so the process doesn't hang. 

3. Splitting GPUs for Smaller Tasks

If your project doesn't need a full GPU (e.g., for simple AI inference), you can "slice" them using:
  • Multi-Instance GPU (MIG): Hard-partitioning a single A100/H100 into up to 7 smaller, isolated GPUs.
  • Time-Slicing: Allowing multiple Pods to take turns using the same GPU.
  • NVIDIA MPS: A software layer that lets multiple processes run concurrently on one GPU with lower overhead than time-slicing. 

How is splitting done ?

Splitting GPUs for smaller tasks in Kubernetes is primarily a cluster configuration task managed through the NVIDIA GPU Operator. You generally do not need to rewrite your application code, though you must configure your K8s resource requests correctly
There are three main methods for splitting a GPU, each with different setup requirements:

1. Time-Slicing (Overprovisioning)

  • What it is: Software-level sharing where multiple pods take turns using the GPU compute.
  • K8s Configuration: Create a ConfigMap (specifying the number of replicas, e.g., 4) and apply it to the NVIDIA Device Plugin.
  • Tooling: Requires the NVIDIA GPU Operator.
  • Code Changes: None required. Your app sees one full GPU, but you must manually limit its memory usage in your code (e.g., gpu_memory_utilization in vLLM) because time-slicing does not provide memory isolation. 

2. MIG (Multi-Instance GPU)

  • What it is: Hardware-level partitioning that creates up to 7 fully isolated "mini-GPUs" on A100 or H100 cards.
  • K8s Configuration: Set the mig-strategy to single or mixed in the GPU Operator and label your nodes with a specific profile like ://nvidia.com.
  • Tooling: Managed by the NVIDIA GPU Operator.
  • Code Changes: None. Each partition appears to your pod as a standalone physical GPU with its own dedicated VRAM. 

3. MPS (Multi-Process Service)

  • What it is: Allows multiple processes to run concurrently on one GPU by sharing the same context.
  • K8s Configuration: Enable it via labels (e.g., nos.nebuly.com/gpu-partitioning=mps) and ensure pods include hostIPC: true.
  • Tooling: Often requires third-party extensions like Nebuly's nos or specialized NVIDIA Helm charts.
  • Code Changes: Minimal to none, but containers must often run with the same User ID as the MPS server (default 1000). 

Summary Table

Strategy HardwareIsolationBest ForCode Changes?
Time-SlicingAny NVIDIA GPULow (Shared VRAM)Dev/TestingNo (but watch VRAM)
MIGA100 / H100High (Dedicated VRAM)Production / Multi-tenantNo
MPSAny NVIDIA GPUMedium (Soft limits)High-throughput InferenceMinimal (User ID/IPC)

No comments:

Post a Comment

CrewAI "Hello World" in a single colab sheet, without YAML files

 [Cell 001] ! pip install -U crewai [Cell 002] from google.colab import userdata import os os.environ[ "GOOGLE_API_KEY" ] = use...