Skip to main content

Managing AI Workloads in Kubernetes and OpenShift with Modern GPUs [H100/H200 Nvidia]

 AI workloads demand significant computational resources, especially for training large models or
performing real-time inference. Modern GPUs like NVIDIA's H100 and H200 are designed to handle these demands effectively, but maximizing their utilization requires careful management. This article explores strategies for managing AI workloads in Kubernetes and OpenShift with GPUs, focusing on features like MIG (Multi-Instance GPU), time slicing, MPS (Multi-Process Service), and vGPU (Virtual GPU). Practical examples are included to make these concepts approachable and actionable.


1. Why GPUs for AI Workloads?

GPUs are ideal for AI workloads due to their massive parallelism and ability to perform complex computations faster than CPUs. However, these resources are expensive, so efficient utilization is crucial.

Modern GPUs like NVIDIA H100/H200 come with features like:

  • MIG (Multi-Instance GPU): Partitioning a single GPU into smaller instances.

  • Time slicing: Efficiently sharing GPU resources among multiple tasks.

  • MPS (Multi-Process Service): Reducing kernel launch overhead when multiple processes share a GPU.

  • vGPU (Virtual GPU): Enabling GPU resource sharing across virtual machines.


2. Kubernetes and OpenShift Overview

Kubernetes and OpenShift are popular platforms for orchestrating containerized workloads. Both support GPU-accelerated workloads through:

  • Device plugins: Enabling Kubernetes to detect and allocate GPUs.

  • Node labeling and taints: Scheduling GPU-specific workloads efficiently.

  • Custom resource definitions (CRDs): Allowing users to define GPU resources like MIG partitions.


3. Key Strategies for GPU Management

A. Using MIG for GPU Partitioning

MIG allows you to partition a single high-performance GPU (e.g., NVIDIA H100) into smaller, isolated GPU instances. This is especially useful for running multiple smaller AI workloads simultaneously.

Steps to Use MIG in Kubernetes:

  1. Enable MIG Mode:

    sudo nvidia-smi -mig 1
  2. Configure MIG Instances:

    sudo nvidia-smi mig -cgi 19,19,19 -C

    This creates three GPU instances of equal size.

  3. Deploy a Workload with MIG Resources: Define MIG profiles in a Kubernetes Pod spec:

    apiVersion: v1
    kind: Pod
    metadata:
      name: ai-workload
    spec:
      containers:
      - name: app
        image: my-ai-model:latest
        resources:
          limits:
            nvidia.com/mig-1g.5gb: 1

B. Time Slicing for High Utilization

Time slicing lets multiple workloads share a GPU by dividing execution time. This is suitable for scenarios where latency is less critical.

Example:

In Kubernetes, enable time slicing by simply deploying multiple Pods with access to the same GPU. The GPU device plugin handles time slicing automatically:

apiVersion: v1
kind: Pod
metadata:
  name: shared-gpu-workload
spec:
  containers:
  - name: app
    image: my-ai-model:latest
    resources:
      limits:
        nvidia.com/gpu: 1

C. Using MPS for Low Latency

MPS optimizes GPU sharing by reducing the overhead of context switching between multiple processes. This is ideal for inference workloads where low latency is critical.

Steps:

  1. Enable MPS:

    export CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=100
    nvidia-cuda-mps-control -d
  2. Deploy Workloads: Configure your applications to use MPS by ensuring they target the same GPU.

D. vGPU for Virtualized Environments

vGPU allows multiple VMs or containers to share a single GPU. It’s particularly useful for environments running mixed workloads or with multiple tenants.

Steps to Set Up vGPU:

  1. Install NVIDIA vGPU Software: Follow NVIDIA’s vGPU documentation for installation.

  2. Allocate vGPU Profiles: Define vGPU profiles in Kubernetes or OpenShift using custom resource definitions.

    apiVersion: v1
    kind: Pod
    metadata:
      name: vgpu-app
    spec:
      containers:
      - name: app
        image: my-ai-model:latest
        resources:
          limits:
            nvidia.com/vgpu-profile-4c: 1

4. Monitoring and Optimization

Efficient GPU utilization requires monitoring and tuning:

  • Monitoring Tools: Use tools like NVIDIA DCGM (Data Center GPU Manager) to monitor GPU usage.

    dcgmi discovery -l
  • Optimize Scheduling: Use Kubernetes taints and tolerations to ensure only GPU-optimized workloads are scheduled on GPU nodes.

Example of Node Labeling:

kubectl label node gpu-node-1 gpu=true

5. Practical Use Case: Training and Inference

Training:

  • Use a full GPU or MIG for maximum performance.

  • Example:

    resources:
      limits:
        nvidia.com/gpu: 1

Inference:

  • Use MPS or time slicing for low-latency inference.

  • Example:

    resources:
      limits:
        nvidia.com/mig-1g.5gb: 1

Conclusion

Managing AI workloads on Kubernetes and OpenShift with modern GPUs requires leveraging advanced GPU features like MIG, time slicing, MPS, and vGPU. These strategies enable efficient resource utilization, reduce costs, and maximize performance. By combining these approaches with proper monitoring and optimization, you can ensure your AI workloads run smoothly and effectively.

Start experimenting with these strategies today and unlock the full potential of your GPUs!

For more details: Official documentation can be followed as https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/ 


Comments

Popular posts from this blog

What's New in Red Hat OpenShift 4.17

What's New in Red Hat OpenShift 4.17 Release Overview: · Kubernetes Version:  OpenShift 4.17 is based on Kubernetes 1.30, bringing enhancements and new capabilities. Notable Beta Features: 1.     User Namespaces in Pods:  Enhances security by allowing pods to run with distinct user IDs while mapping to different IDs on the host. 2.     Structured Authentication Configuration:  Provides a more organized approach to managing authentication settings. 3.     Node Memory Swap Support:  Introduces support for memory swapping on nodes, enhancing resource management. 4.     LoadBalancer Behavior Awareness:  Kubernetes can now better understand and manage LoadBalancer behaviors. 5.     CRD Validation Enhancements:  Improves Custom Resource Definition (CRD) validation processes. Stable Features: 1.     Pod Scheduling Readiness:  Ensures that...

Choosing the Right OpenShift Service: Service Mesh, Submariner, or Service Interconnect?

In today’s digital world, businesses rely more and more on interconnected applications and services to operate effectively. This means integrating software and data across different environments is essential. However, achieving smooth connectivity can be tough because different application designs and the mix of on-premises and cloud systems often lead to inconsistencies. These issues require careful management to ensure everything runs well, risks are managed effectively, teams have the right skills, and security measures are strong. This article looks at three Red Hat technologies—Red Hat OpenShift Service Mesh and Red Hat Service Interconnect, as well as Submariner—in simple terms. It aims to help you decide which solution is best for your needs. OPENSHIFT Feature Service Mesh (Istio) Service Interconnect Submariner Purpose Manages service-to-service communication within a single cluster. Enables ...