NVIDIA GPU Operator with Edera zones
This guide installs the NVIDIA GPU Operator configured for Edera, then runs a GPU workload on an Edera-backed pod. By the end you’ll have:
- The GPU Operator running with host-level components disabled and Edera’s sandbox device plugin advertising GPUs to kubelet.
- A pod using
runtimeClassName: ederathat gets a full NVIDIA GPU inside its zone.
This is the Kubernetes path for NVIDIA GPUs. For the standalone protect CLI path, see NVIDIA GPU passthrough to an Edera zone.
How it works
The GPU Operator normally installs the NVIDIA driver, container toolkit, and a device plugin on the host. With Edera, the NVIDIA driver runs inside the Edera zone kernel, not on the host—so those host-level components are turned off. Edera swaps in its own sandbox device plugin to advertise GPUs to kubelet.
This mirrors NVIDIA’s Kata “sandboxed workloads” deployment, but Edera does the passthrough differently. Like KVM, Edera passes the physical PCI device through to the guest—but it uses Xen with PVH zones and vPCI passthrough rather than VFIO. From the zone’s perspective it drives the real PCI device directly; there is no emulated PCI bus. Two things follow from this:
- The zone must run in PVH mode. PVH is a hard requirement for GPU passthrough to a zone.
- No device is bound to
vfio-pci. Edera only passes through devices that are not attached to any host driver. This is correct for Edera, but it trips one of the GPU Operator’s validators—see Known issues and limitations.
For the design background—how GPUs reach a zone and why the driver runs inside it—see GPU support in Edera.
Prerequisites
Before starting:
Edera runtime installed on your GPU nodes, with the
ederaRuntimeClass. The GPU Operator does not install or manage the runtime. See Install Edera. Verify the RuntimeClass:kubectl get runtimeclass ederaIf it’s missing, see Apply the Edera RuntimeClass.
One or more NVIDIA GPUs on the target nodes. Confirm with
lspci | grep -i nvidiaon the node.Helm 3.x and
kubectlconfigured against the cluster.
The images this guide uses are publicly available from ghcr.io/edera-dev—the sandbox device plugin (nvidia-sandbox-device-plugin) and an NVIDIA zone kernel image (zone-nvidiagpu-kernel). No credentials or access request are required to pull them.
Install the GPU Operator
Create the values file
Create values.yaml. This disables every host-level component and enables Edera’s sandbox device plugin:
# values.yaml
driver:
enabled: false
toolkit:
enabled: false
devicePlugin:
enabled: false
dcgmExporter:
enabled: false
gfd:
enabled: false
migManager:
enabled: false
vgpuDeviceManager:
enabled: false
vfioManager:
enabled: false
kataManager:
enabled: false
sandboxWorkloads:
enabled: true
defaultWorkload: vm-passthrough
sandboxDevicePlugin:
enabled: true
repository: ghcr.io/edera-dev
image: nvidia-sandbox-device-plugin
version: v1.4.0-edera
imagePullPolicy: AlwaysWhat each setting does:
| Setting | Value | Why |
|---|---|---|
driver.enabled | false | The NVIDIA driver runs inside the Edera zone kernel, not on the host. |
toolkit.enabled | false | The NVIDIA Container Toolkit isn’t used; Edera handles GPU exposure to the zone. |
devicePlugin.enabled | false | Replaced by sandboxDevicePlugin below. |
dcgmExporter.enabled | false | Host-level DCGM metrics don’t apply—the driver isn’t on the host. |
gfd.enabled | false | GPU Feature Discovery is not used in this mode. |
migManager.enabled | false | MIG is not managed from the host. |
vgpuDeviceManager.enabled | false | vGPU is not used. |
vfioManager.enabled | false | Edera passes GPUs through over vPCI, not VFIO—devices are not bound to vfio-pci. |
kataManager.enabled | false | Edera is not Kata-based. |
sandboxWorkloads.enabled | true | Enables the sandboxed-workloads code path the device plugin runs under. |
sandboxWorkloads.defaultWorkload | vm-passthrough | The workload config Edera nodes use. |
sandboxDevicePlugin | Edera image | Edera’s device plugin, which advertises GPUs to kubelet. |
Install the chart
Add the NVIDIA Helm repository and install the chart, pinning the version:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v25.10.1 --values values.yamlThe initial pods come up shortly after install:
kubectl get pods -n gpu-operatorExpected output (names and suffixes vary):
NAME READY STATUS RESTARTS AGE
gpu-operator-...-node-feature-discovery-gc-... 1/1 Running 0 15s
gpu-operator-...-node-feature-discovery-master-... 1/1 Running 0 15s
gpu-operator-...-node-feature-discovery-worker-... 1/1 Running 0 15s
gpu-operator-... 1/1 Running 0 15sAfter the operator reconciles, it rolls out the sandbox device plugin and a validator daemonset:
nvidia-sandbox-device-plugin-daemonset-... 1/1 Running 0 2m38s
nvidia-sandbox-validator-... 0/1 Init:Error 4 2m38snvidia-sandbox-validator pod failing with Init:Error is expected on Edera and drives the cluster into a notReady state. This does not stop GPUs from being advertised, but it has implications you should understand before going to production. See ClusterPolicy reports notReady.Verify the install
List the GPUs discovered on a node:
kubectl get nodes -l nvidia.com/gpu.present -o json | \
jq '.items[0].status.allocatable
| with_entries(select(.key | startswith("nvidia.com/")))
| with_entries(select(.value != "0"))'Expected output (the exact product key depends on your hardware):
{
"nvidia.com/GH100_H100L_94GB": "1"
}This nvidia.com/<PRODUCT> key is the resource name you request in a pod spec. Note it for the next step.
Run a GPU workload
Create a deployment that runs in an Edera zone and requests one GPU. Replace the kernel image with the one you have access to, and the nvidia.com/<PRODUCT> resource name with the value from the previous step:
apiVersion: apps/v1
kind: Deployment
metadata:
name: cuda-gpu
spec:
replicas: 1
selector:
matchLabels:
app: cuda-gpu
template:
metadata:
labels:
app: cuda-gpu
annotations:
dev.edera/kernel: "ghcr.io/edera-dev/zone-nvidiagpu-kernel:6.18.33-nvidia-595.71.05"
dev.edera/initial-memory-request: "8192"
dev.edera/resource-policy: "static"
spec:
runtimeClassName: edera
containers:
- name: cuda
image: nvidia/cuda:13.1.2-devel-ubuntu24.04
command: ["/bin/sh", "-c"]
args: ["sleep infinity"]
resources:
limits:
nvidia.com/GH100_H100L_94GB: 1Key fields:
runtimeClassName: edera- schedules the pod into an Edera zone. See Deploy your app to Edera.dev.edera/kernel- the NVIDIA zone kernel image (it includes the matching driver).dev.edera/initial-memory-request- initial zone memory in MiB.dev.edera/resource-policy: static- recommended for GPU zones.resources.limits.nvidia.com/<PRODUCT>- the GPU resource advertised by the sandbox device plugin.
Apply it and wait for the pod to reach Running:
kubectl apply -f cuda-gpu.yaml
kubectl rollout status deploy/cuda-gpuVerify the GPU is visible inside the zone:
kubectl exec deploy/cuda-gpu -- nvidia-smiExpected output (abridged):
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05 Driver Version: 595.71.05 CUDA Version: 13.2 |
|-------------------------------+----------------------+----------------------+
| 0 NVIDIA H100 NVL Off | 00000000:00:00.0 Off | 0 |
+-------------------------------+----------------------+----------------------+Success—a pod running in an isolated Edera zone now has a dedicated NVIDIA GPU.
Clean up the demo workload:
kubectl delete deploy/cuda-gpuKnown issues and limitations
ClusterPolicy reports notReady from the vfio-pci validation
The GPU Operator runs a validator (nvidia-validator) that, in sandboxed-workload mode, checks that GPUs are bound to the vfio-pci driver. Edera passes the GPU through to a PVH zone over vPCI rather than VFIO, and only passes through devices that are unbound from any host driver—so no device is bound to vfio-pci and this check fails.
What you see:
kubectl logs -n gpu-operator <nvidia-sandbox-validator-pod> -c vfio-pci-validationlevel=info msg="GPU workload configuration: vm-passthrough"
level=info msg="Error: error validating vfio-pci driver installation: device not bound to 'vfio-pci'; device: 0000:2b:00.0 driver: ''"This drives the ClusterPolicy into a notReady state:
kubectl get clusterpolicy cluster-policy -o jsonpath='{.status.state}'notReadyThe GPU Operator provides no supported way to disable only this check. We are investigating long-term solutions for this problem, and it will be fixed in a future release. Contact support@edera.dev if this issue is blocking your use of this feature.
Impact is environment-dependent. On a single-node setup, GPUs are still advertised and Edera GPU workloads still run despite the notReady state. We have not validated the impact on a production, multi-node cluster, where ClusterPolicy readiness may be consumed by other systems. Before relying on this in production, check whether any of the following apply in your environment:
- GitOps / CD health gates. Tools like Argo CD or Flux may mark the GPU Operator release as degraded because the ClusterPolicy never becomes ready.
helm install --wait/--atomic. A strict readiness gate can time out, and--atomicwould roll the release back. Avoid--atomicfor this chart, and don’t block automation on ClusterPolicy readiness.- Monitoring and alerting. Alerts keyed on ClusterPolicy readiness or on the
nvidia-sandbox-validatorpod will fire. Scope them accordingly. - Per-node noise. Each GPU node runs its own
nvidia-sandbox-validatorpod stuck inInit:Error.
Despite the notReady state, GPUs should still be advertised and your workloads should still acquire and use them. Confirm GPUs are advertised on the node:
kubectl get nodes -l nvidia.com/gpu.present -o json | \
jq '.items[].status.allocatable | with_entries(select(.key | startswith("nvidia.com/")))'Then confirm a GPU workload can actually use its GPU. Run nvidia-smi (or your workload’s own GPU check) inside the pod that requested the GPU—substitute your own pod or deployment name. If you deployed the example from Run a GPU workload:
kubectl exec deploy/cuda-gpu -- nvidia-smiA GPU can become unresponsive after a zone kernel crash
If a zone’s (guest) kernel crashes while it holds a GPU, the device can be left in an unresponsive state that a new zone cannot recover. You may observe:
- PCI hotplug errors on the host.
- The zone kernel no longer seeing the NVIDIA device.
nvidia-smior CUDA calls failing in subsequent workloads scheduled onto that GPU.
In this state the GPU is functionally unrecoverable—including from the host (dom0). Returning it to a working state generally requires fully draining power to the card so its onboard firmware re-initializes, and software cannot drain the auxiliary power connectors. On some hosts a PCIe slot power-management reset clears it, but not all hosts support that. A full host reboot is the only reliable recovery today.
Recovery requires a host reboot. Cordon and drain the affected node, then reboot the host to power-cycle the GPU:
kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# reboot the host, then:
kubectl uncordon <node>There is no software workaround for the wedged state itself today. Use a static resource policy for GPU zones, and contact support@edera.dev if you hit this repeatedly.
Uninstall
Find the generated release name and uninstall it:
helm list -n gpu-operator
helm uninstall <release-name> -n gpu-operatorhelm uninstall does not remove the namespace created with --create-namespace. Remove it once you’ve confirmed nothing else uses it:
kubectl delete namespace gpu-operatorSee also
- GPU support in Edera - how GPUs reach a zone
- NVIDIA GPU passthrough to an Edera zone - the standalone
protectCLI path - Deploy your app to Edera - using
runtimeClassName: edera - Install the DRA Driver for Edera Zones
- Troubleshooting