TKGM PR-DR SITE ON VCLOUD DIRECTOR ARCHITECURE You build: vSphere + vCD + NSX-T + CSE on both sites. You deploy TKGm clusters on primary. You set up Velero to back up YAMLs and volumes. You mirror Harbor registry to DR. You test restoring a cluster on DR site using CSE + Velero. You prepare DNS (manual or automated) to point to DR when needed. Primary & DR Site Layer Comparison Table Layer Component Primary Site DR Site What Happens During DR? Notes / Tools 1️⃣ Infrastructure vSphere (ESXi, vCenter) Same setup DR vSphere takes over Ensure hardware compatibility 2️⃣ Networking NSX-T Same NSX-T setup DR NSX routes traffic Replicate NSX segments, edge configs 3️⃣ Cloud Management vCloud Director vCloud Director DR vCD deploys new VMs Must sync templates across sites 4️⃣ K8s Provisioning CSE (TKGM enabled) CSE (same version) DR CSE deploys TKGm cluster Sync catalog/templates 5️⃣ Kubernetes Cluster TKGm Cluster (Running) TKGm Cluster (Rebuilt) Apps are restore...
AI workloads demand significant computational resources, especially for training large models or performing real-time inference. Modern GPUs like NVIDIA's H100 and H200 are designed to handle these demands effectively, but maximizing their utilization requires careful management. This article explores strategies for managing AI workloads in Kubernetes and OpenShift with GPUs, focusing on features like MIG (Multi-Instance GPU), time slicing, MPS (Multi-Process Service), and vGPU (Virtual GPU). Practical examples are included to make these concepts approachable and actionable. 1. Why GPUs for AI Workloads? GPUs are ideal for AI workloads due to their massive parallelism and ability to perform complex computations faster than CPUs. However, these resources are expensive, so efficient utilization is crucial. Modern GPUs like NVIDIA H100/H200 come with features like: MIG (Multi-Instance GPU): Partitioning a single GPU into smaller instances. Time slicing: Efficiently sharing GPU res...