Skip to main content

TKGM PR-DR SITE ON VCLOUD DIRECTOR ARCHITECURE

 TKGM PR-DR SITE ON VCLOUD DIRECTOR ARCHITECURE 

You build:

  1. vSphere + vCD + NSX-T + CSE on both sites.

  2. You deploy TKGm clusters on primary.

  3. You set up Velero to back up YAMLs and volumes.

  4. You mirror Harbor registry to DR.

  5. You test restoring a cluster on DR site using CSE + Velero.

  6. You prepare DNS (manual or automated) to point to DR when needed.


Primary & DR Site Layer Comparison Table

LayerComponentPrimary SiteDR SiteWhat Happens During DR?Notes / Tools
1️⃣InfrastructurevSphere (ESXi, vCenter)Same setupDR vSphere takes overEnsure hardware compatibility
2️⃣NetworkingNSX-TSame NSX-T setupDR NSX routes trafficReplicate NSX segments, edge configs
3️⃣Cloud ManagementvCloud DirectorvCloud DirectorDR vCD deploys new VMsMust sync templates across sites
4️⃣K8s ProvisioningCSE (TKGM enabled)CSE (same version)DR CSE deploys TKGm clusterSync catalog/templates
5️⃣Kubernetes ClusterTKGm Cluster (Running)TKGm Cluster (Rebuilt)Apps are restored on DR clusterUse Velero / GitOps to restore
6️⃣Persistent Storage (PV)CSI Volumes / DatastoreRestored from backup or replicationApps regain their dataUse Velero+Restic, Zerto, or vSphere Replication
7️⃣Container ImagesHarbor RegistryMirror / Backup HarborDR cluster pulls same imagesEnable Harbor replication between sites
8️⃣K8s Configs / YAMLsGitOps (Flux / ArgoCD) or VeleroSame toolsRe-apply YAMLs in DRUse Git source or Velero backup
9️⃣DNS FailoverDNS entry points to primaryDNS updated to DR IPDNS points to DR cluster ingressUse manual switch or automated failover (Route53, Cloudflare)

🥶 Cold vs 🔥 Hot Standby Table

TypeWhat It MeansProsConsWhen to Use
🧊 Cold StandbyDR site is ready but not running TKGmCheaperSlow failover (10–60 mins)Most common, low-cost DR
🔥 Hot StandbyDR cluster runs live + in syncFast failoverHigh cost, complexityFor mission-critical workloads

🌐 DNS Redirection Table

MethodDescriptionToolsSpeedRecommended When
🛠️ Manual DNS SwitchYou change DNS IP after failoverGoDaddy, Cloudflare, etc.Slow (few minutes)OK for small/low-impact apps
⚙️ Automated FailoverHealth check + switch IP automaticallyRoute53, NS1, F5 GSLBFast (seconds to 1 min)Critical apps needing <1 min downtime


Comments

Popular posts from this blog

Choosing the Right OpenShift Service: Service Mesh, Submariner, or Service Interconnect?

In today’s digital world, businesses rely more and more on interconnected applications and services to operate effectively. This means integrating software and data across different environments is essential. However, achieving smooth connectivity can be tough because different application designs and the mix of on-premises and cloud systems often lead to inconsistencies. These issues require careful management to ensure everything runs well, risks are managed effectively, teams have the right skills, and security measures are strong. This article looks at three Red Hat technologies—Red Hat OpenShift Service Mesh and Red Hat Service Interconnect, as well as Submariner—in simple terms. It aims to help you decide which solution is best for your needs. OPENSHIFT Feature Service Mesh (Istio) Service Interconnect Submariner Purpose Manages service-to-service communication within a single cluster. Enables ...

TKGS VMware/Kubernetes ReadWriteMany Functionality with NFS-CSI

 TKGS VMware WRX Functionality with NFS CSI ReadWriteMany Access mode in Kubernetes When it come to RWX access mode in PVC, TKGS support it if we have the following: 1. Kubernetes is upgraded to 1.22.9 (This version supports this RWX functionality) 2. vSAN should be there in your environment (VMware uses the vpshere csi, which only support vSAN) How to done it without vSAN: 1. Upgrade the kubernetes to version 1.22.9 2. Use NFS-CSI and then create a new storage class to be consumed. Work Around : 2.a : Please use the below link to get the nfs-csi-driver  https://github.com/ibraraziz/csi-driver-nfs Note: It absolutely fine that we have multiple CSI drivers/provisioner in kubernetes (Just for information) Step:1 Goto csi-driver-nfs/deploy/v4.0.0/ and apply that yaml into your environment. It will create NFS csi provisioner and controller pods in namespace of kubesystem as below Step: 2 Now create storage class and goto the example folder  csi-driver- nfs/deploy/example...

Managing AI Workloads in Kubernetes and OpenShift with Modern GPUs [H100/H200 Nvidia]

 AI workloads demand significant computational resources, especially for training large models or performing real-time inference. Modern GPUs like NVIDIA's H100 and H200 are designed to handle these demands effectively, but maximizing their utilization requires careful management. This article explores strategies for managing AI workloads in Kubernetes and OpenShift with GPUs, focusing on features like MIG (Multi-Instance GPU), time slicing, MPS (Multi-Process Service), and vGPU (Virtual GPU). Practical examples are included to make these concepts approachable and actionable. 1. Why GPUs for AI Workloads? GPUs are ideal for AI workloads due to their massive parallelism and ability to perform complex computations faster than CPUs. However, these resources are expensive, so efficient utilization is crucial. Modern GPUs like NVIDIA H100/H200 come with features like: MIG (Multi-Instance GPU): Partitioning a single GPU into smaller instances. Time slicing: Efficiently sharing GPU res...