NVIDIA GPU Operator with Edera zones

8 min read · Intermediate


This guide installs the NVIDIA GPU Operator configured for Edera, then runs a GPU workload on an Edera-backed pod. By the end you’ll have:

  • The GPU Operator running with host-level components disabled and Edera’s sandbox device plugin advertising GPUs to kubelet.
  • A pod using runtimeClassName: edera that gets a full NVIDIA GPU inside its zone.

This is the Kubernetes path for NVIDIA GPUs. For the standalone protect CLI path, see NVIDIA GPU passthrough to an Edera zone.

How it works

The GPU Operator normally installs the NVIDIA driver, container toolkit, and a device plugin on the host. With Edera, the NVIDIA driver runs inside the Edera zone kernel, not on the host—so those host-level components are turned off. Edera swaps in its own sandbox device plugin to advertise GPUs to kubelet.

This mirrors NVIDIA’s Kata “sandboxed workloads” deployment, but Edera does the passthrough differently. Like KVM, Edera passes the physical PCI device through to the guest—but it uses Xen with PVH zones and vPCI passthrough rather than VFIO. From the zone’s perspective it drives the real PCI device directly; there is no emulated PCI bus. Two things follow from this:

  • The zone must run in PVH mode. PVH is a hard requirement for GPU passthrough to a zone.
  • No device is bound to vfio-pci. Edera only passes through devices that are not attached to any host driver. This is correct for Edera, but it trips one of the GPU Operator’s validators—see Known issues and limitations.

For the design background—how GPUs reach a zone and why the driver runs inside it—see GPU support in Edera.

Prerequisites

Before starting:

  • Edera runtime installed on your GPU nodes, with the edera RuntimeClass. The GPU Operator does not install or manage the runtime. See Install Edera. Verify the RuntimeClass:

    kubectl get runtimeclass edera

    If it’s missing, see Apply the Edera RuntimeClass.

  • One or more NVIDIA GPUs on the target nodes. Confirm with lspci | grep -i nvidia on the node.

  • Helm 3.x and kubectl configured against the cluster.

The images this guide uses are publicly available from ghcr.io/edera-dev—the sandbox device plugin (nvidia-sandbox-device-plugin) and an NVIDIA zone kernel image (zone-nvidiagpu-kernel). No credentials or access request are required to pull them.

Install the GPU Operator

Create the values file

Create values.yaml. This disables every host-level component and enables Edera’s sandbox device plugin:

# values.yaml
driver:
  enabled: false
toolkit:
  enabled: false
devicePlugin:
  enabled: false
dcgmExporter:
  enabled: false
gfd:
  enabled: false
migManager:
  enabled: false
vgpuDeviceManager:
  enabled: false
vfioManager:
  enabled: false
kataManager:
  enabled: false

sandboxWorkloads:
  enabled: true
  defaultWorkload: vm-passthrough

sandboxDevicePlugin:
  enabled: true
  repository: ghcr.io/edera-dev
  image: nvidia-sandbox-device-plugin
  version: v1.4.0-edera
  imagePullPolicy: Always

What each setting does:

SettingValueWhy
driver.enabledfalseThe NVIDIA driver runs inside the Edera zone kernel, not on the host.
toolkit.enabledfalseThe NVIDIA Container Toolkit isn’t used; Edera handles GPU exposure to the zone.
devicePlugin.enabledfalseReplaced by sandboxDevicePlugin below.
dcgmExporter.enabledfalseHost-level DCGM metrics don’t apply—the driver isn’t on the host.
gfd.enabledfalseGPU Feature Discovery is not used in this mode.
migManager.enabledfalseMIG is not managed from the host.
vgpuDeviceManager.enabledfalsevGPU is not used.
vfioManager.enabledfalseEdera passes GPUs through over vPCI, not VFIO—devices are not bound to vfio-pci.
kataManager.enabledfalseEdera is not Kata-based.
sandboxWorkloads.enabledtrueEnables the sandboxed-workloads code path the device plugin runs under.
sandboxWorkloads.defaultWorkloadvm-passthroughThe workload config Edera nodes use.
sandboxDevicePluginEdera imageEdera’s device plugin, which advertises GPUs to kubelet.

Install the chart

Add the NVIDIA Helm repository and install the chart, pinning the version:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --version=v25.10.1 --values values.yaml

The initial pods come up shortly after install:

kubectl get pods -n gpu-operator

Expected output (names and suffixes vary):

NAME                                                             READY   STATUS    RESTARTS   AGE
gpu-operator-...-node-feature-discovery-gc-...                   1/1     Running   0          15s
gpu-operator-...-node-feature-discovery-master-...               1/1     Running   0          15s
gpu-operator-...-node-feature-discovery-worker-...               1/1     Running   0          15s
gpu-operator-...                                                 1/1     Running   0          15s

After the operator reconciles, it rolls out the sandbox device plugin and a validator daemonset:

nvidia-sandbox-device-plugin-daemonset-...                       1/1     Running      0          2m38s
nvidia-sandbox-validator-...                                     0/1     Init:Error   4          2m38s
⚠️
The nvidia-sandbox-validator pod failing with Init:Error is expected on Edera and drives the cluster into a notReady state. This does not stop GPUs from being advertised, but it has implications you should understand before going to production. See ClusterPolicy reports notReady.

Verify the install

List the GPUs discovered on a node:

kubectl get nodes -l nvidia.com/gpu.present -o json | \
  jq '.items[0].status.allocatable
    | with_entries(select(.key | startswith("nvidia.com/")))
    | with_entries(select(.value != "0"))'

Expected output (the exact product key depends on your hardware):

{
  "nvidia.com/GH100_H100L_94GB": "1"
}

This nvidia.com/<PRODUCT> key is the resource name you request in a pod spec. Note it for the next step.

Run a GPU workload

Create a deployment that runs in an Edera zone and requests one GPU. Replace the kernel image with the one you have access to, and the nvidia.com/<PRODUCT> resource name with the value from the previous step:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cuda-gpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cuda-gpu
  template:
    metadata:
      labels:
        app: cuda-gpu
      annotations:
        dev.edera/kernel: "ghcr.io/edera-dev/zone-nvidiagpu-kernel:6.18.33-nvidia-595.71.05"
        dev.edera/initial-memory-request: "8192"
        dev.edera/resource-policy: "static"
    spec:
      runtimeClassName: edera
      containers:
        - name: cuda
          image: nvidia/cuda:13.1.2-devel-ubuntu24.04
          command: ["/bin/sh", "-c"]
          args: ["sleep infinity"]
          resources:
            limits:
              nvidia.com/GH100_H100L_94GB: 1

Key fields:

  • runtimeClassName: edera - schedules the pod into an Edera zone. See Deploy your app to Edera.
  • dev.edera/kernel - the NVIDIA zone kernel image (it includes the matching driver).
  • dev.edera/initial-memory-request - initial zone memory in MiB.
  • dev.edera/resource-policy: static - recommended for GPU zones.
  • resources.limits.nvidia.com/<PRODUCT> - the GPU resource advertised by the sandbox device plugin.

Apply it and wait for the pod to reach Running:

kubectl apply -f cuda-gpu.yaml
kubectl rollout status deploy/cuda-gpu

Verify the GPU is visible inside the zone:

kubectl exec deploy/cuda-gpu -- nvidia-smi

Expected output (abridged):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05    Driver Version: 595.71.05    CUDA Version: 13.2     |
|-------------------------------+----------------------+----------------------+
|   0  NVIDIA H100 NVL      Off | 00000000:00:00.0 Off |                    0 |
+-------------------------------+----------------------+----------------------+

Success—a pod running in an isolated Edera zone now has a dedicated NVIDIA GPU.

Clean up the demo workload:

kubectl delete deploy/cuda-gpu

Known issues and limitations

ClusterPolicy reports notReady from the vfio-pci validation

The GPU Operator runs a validator (nvidia-validator) that, in sandboxed-workload mode, checks that GPUs are bound to the vfio-pci driver. Edera passes the GPU through to a PVH zone over vPCI rather than VFIO, and only passes through devices that are unbound from any host driver—so no device is bound to vfio-pci and this check fails.

What you see:

kubectl logs -n gpu-operator <nvidia-sandbox-validator-pod> -c vfio-pci-validation
level=info msg="GPU workload configuration: vm-passthrough"
level=info msg="Error: error validating vfio-pci driver installation: device not bound to 'vfio-pci'; device: 0000:2b:00.0 driver: ''"

This drives the ClusterPolicy into a notReady state:

kubectl get clusterpolicy cluster-policy -o jsonpath='{.status.state}'
notReady

The GPU Operator provides no supported way to disable only this check. We are investigating long-term solutions for this problem, and it will be fixed in a future release. Contact support@edera.dev if this issue is blocking your use of this feature.

⚠️

Impact is environment-dependent. On a single-node setup, GPUs are still advertised and Edera GPU workloads still run despite the notReady state. We have not validated the impact on a production, multi-node cluster, where ClusterPolicy readiness may be consumed by other systems. Before relying on this in production, check whether any of the following apply in your environment:

  • GitOps / CD health gates. Tools like Argo CD or Flux may mark the GPU Operator release as degraded because the ClusterPolicy never becomes ready.
  • helm install --wait / --atomic. A strict readiness gate can time out, and --atomic would roll the release back. Avoid --atomic for this chart, and don’t block automation on ClusterPolicy readiness.
  • Monitoring and alerting. Alerts keyed on ClusterPolicy readiness or on the nvidia-sandbox-validator pod will fire. Scope them accordingly.
  • Per-node noise. Each GPU node runs its own nvidia-sandbox-validator pod stuck in Init:Error.

Despite the notReady state, GPUs should still be advertised and your workloads should still acquire and use them. Confirm GPUs are advertised on the node:

kubectl get nodes -l nvidia.com/gpu.present -o json | \
  jq '.items[].status.allocatable | with_entries(select(.key | startswith("nvidia.com/")))'

Then confirm a GPU workload can actually use its GPU. Run nvidia-smi (or your workload’s own GPU check) inside the pod that requested the GPU—substitute your own pod or deployment name. If you deployed the example from Run a GPU workload:

kubectl exec deploy/cuda-gpu -- nvidia-smi

A GPU can become unresponsive after a zone kernel crash

If a zone’s (guest) kernel crashes while it holds a GPU, the device can be left in an unresponsive state that a new zone cannot recover. You may observe:

  • PCI hotplug errors on the host.
  • The zone kernel no longer seeing the NVIDIA device.
  • nvidia-smi or CUDA calls failing in subsequent workloads scheduled onto that GPU.

In this state the GPU is functionally unrecoverable—including from the host (dom0). Returning it to a working state generally requires fully draining power to the card so its onboard firmware re-initializes, and software cannot drain the auxiliary power connectors. On some hosts a PCIe slot power-management reset clears it, but not all hosts support that. A full host reboot is the only reliable recovery today.

⚠️

Recovery requires a host reboot. Cordon and drain the affected node, then reboot the host to power-cycle the GPU:

kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# reboot the host, then:
kubectl uncordon <node>

There is no software workaround for the wedged state itself today. Use a static resource policy for GPU zones, and contact support@edera.dev if you hit this repeatedly.

Uninstall

Find the generated release name and uninstall it:

helm list -n gpu-operator
helm uninstall <release-name> -n gpu-operator
ℹ️

helm uninstall does not remove the namespace created with --create-namespace. Remove it once you’ve confirmed nothing else uses it:

kubectl delete namespace gpu-operator

See also

Last updated on