28 Mar 2024

How to configure multiple GPU passthrough with amdgpu, rocm 6.0 and proxmox (kvm)

The problem

Configuration of GPU passthrough is pretty straightforward. Alas, even though it may seem that everything is working correctly, including torch, the torch.distributed will fail with a mysterious error:

peugeot:1741:1741 [0] /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/enqueue.cc:1013 NCCL WARN Cuda failure 'the operation cannot be performed in the present state'
peugeot:1741:1741 [0] NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/group.cc:184 -> 1
peugeot:1741:1741 [0] NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/group.cc:346 -> 1
peugeot:1741:1741 [0] NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/group.cc:431 -> 1
peugeot:1741:1741 [0] NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/group.cc:116 -> 1

peugeot:1742:1742 [1] /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/enqueue.cc:1013 NCCL WARN Cuda failure 'the operation cannot be performed in the present state'
peugeot:1742:1742 [1] NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/group.cc:184 -> 1
peugeot:1742:1742 [1] NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/group.cc:346 -> 1
peugeot:1742:1742 [1] NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/group.cc:431 -> 1
peugeot:1742:1742 [1] NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/group.cc:116 -> 1
Traceback (most recent call last):
  File "/home/tlaguz/.virtualenvs/pytorch/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/tlaguz/.virtualenvs/pytorch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast
Traceback (most recent call last):
    work = group.broadcast([tensor], opts)
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
...
RuntimeError: HIP error: the operation cannot be performed in the present state
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

This happens because ROCm requires PCI-E atomics capability on GPUs and on PCI-E root port. You can verify if atomics are available by checking dmesg:

PCI-E atomics are available and not a problem:

root@peugeot:~# dmesg | grep -i atomi
[    0.159869] DMA: preallocated 4096 KiB GFP_KERNEL pool for atomic allocations
[    0.160000] DMA: preallocated 4096 KiB GFP_KERNEL|GFP_DMA pool for atomic allocations
[    0.160126] DMA: preallocated 4096 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations

PCI-E atomics are not available:
root@peugeot:~# dmesg | grep -i atomi
[    0.160296] DMA: preallocated 4096 KiB GFP_KERNEL pool for atomic allocations
[    0.160434] DMA: preallocated 4096 KiB GFP_KERNEL|GFP_DMA pool for atomic allocations
[    0.160572] DMA: preallocated 4096 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations
[    4.450501] amdgpu 0000:01:00.0: amdgpu: PCIE atomic ops is not supported
[    6.830826] amdgpu 0000:02:00.0: amdgpu: PCIE atomic ops is not supported

and lspci - but here is the trap. A while ago there was a problem with passing PCI-E atomics capability of GPU to VM. The resolution was to set register at 0x80 to 40 to indicate that the GPU is capable of PCI-E atomics. This is no longer the solution, as there is still a problem somewhere else as of 2024-03. For PCI-E atomics to work every node between CPU and GPU has to also support atomics. QEMU adds PCI bridge: Red Hat, Inc. QEMU PCIe Root port which don’t advertise atomics capability.

root@peugeot:~# lspci -d 1002:73bf -vvv | grep -i atom
			 AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
			 AtomicOpsCtl: ReqEn+
			 AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
			 AtomicOpsCtl: ReqEn+
root@peugeot:~/nvtop# lspci | grep Root
00:1c.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1c.1 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1c.2 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1c.3 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
root@peugeot:~/nvtop# lspci -s 00:1c -vvv | grep -i atomi
			 AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
			 AtomicOpsCtl: ReqEn- EgressBlck-
			 AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
			 AtomicOpsCtl: ReqEn- EgressBlck-
			 AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
			 AtomicOpsCtl: ReqEn- EgressBlck-
			 AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
			 AtomicOpsCtl: ReqEn- EgressBlck-

Here we can see that atomics are indeed available for the GPUs, but not for the root ports.

There is an ongoing discussion in the community how to implement the advertising. Meanwhile, there is no easy way to force enable atomics on the root port.

The solution

I found a patch which proposes an addition of x-atomic-completion to pcie-root-port device: https://patchew.org/QEMU/20230420153839.167418-1-robin@streamhpc.com/. The site suggests that it’s been merged, but I couldn’t find it in QEMU repo.

My solution is to compile a custom version of QEMU which will force advertise atomics on every pcie-root-port device. Why not make it configurable? I failed to find a way to configure GPU passthrough in proxmox in such a way so that the cards would be connected to pcie-root-port added in args: in qm.conf file for vm.

This solution is ugly and may break things. There should be no assumption that every device is connected to a bus which supports atomics. For my use case this is not a problem.

Prerequisites

Here I post my full configuration. Please check against it and assess yourself which options are relevant for you.

Hardware:

Motherboard: GENOAD8X-2T/BCM
CPU: AMD EPYC 9274F 24-Core Processor
GPU: 2x Gigabyte Radeon RX 6900 XT GAMING OC

BIOS:

Advanced -> Chipset Configuration -> PCIE ASPM -> Disabled (I found it to cause some PCI-E link errors)
Advanced -> PCI Subsystem Settings -> Re-Size BAR Support -> On
Advanced -> PCI Subsystem Settings -> SR-IOV Support -> On
Advanced -> AMD CBS -> NBIO Common Options -> IOMMU -> Enabled

OS:

Proxmox: pve-manager/8.1.5/60e01c6ac2325b3f (running kernel: 6.5.13-3-pve)
booted in UEFI mode
GRUB_CMDLINE_LINUX=“amd_iommu=on iommu=pt initcall_blacklist=sysfb_init pcie_aspm=off”
blacklist amdgpu and radeon drivers

Guest:

Ubuntu 22.04 running HWE kernel: 6.5.0-26-generic
Machine type q35 ; ~~BIOS OVMF (UEFI)~~ SeaBios (It looks like OVMF causes problems with PCIe BAR making GPU p2p communication impossible)
GPUs are passed with options: All Functions, ROM-Bar, PCI-Express – all available from Proxmox GUI
ROCM 6.0 installed using the apt method
torch installed from AMD’s source: https://rocm.docs.amd.com/projects/radeon/en/latest/docs/install/install-pytorch.html

The patch

We’ll be building the pve-qemu-kvm package. To download and configure pve-qemu repository and install build requirements:

~# apt update
~# apt install build-essentials git
~# apt install debhelper-compat=13 check libacl1-dev libaio-dev libattr1-dev libcap-ng-dev libcurl4-gnutls-dev libepoxy-dev libfdt-dev libgbm-dev libglusterfs-dev libgnutls28-dev libiscsi-dev libjpeg-dev libn
uma-dev libpci-dev libpixman-1-dev libproxmox-backup-qemu0-dev librbd-dev libsdl1.2-dev libseccomp-dev libslirp-dev libspice-protocol-dev libspice-server-dev libsystemd-dev liburing-dev libusb-1.0-0-dev libusbredirpar
ser-dev libvirglrenderer-dev libzstd-dev meson python3-sphinx python3-sphinx-rtd-theme python3-venv quilt uuid-dev xfslibs-dev lintian
~# git clone git://git.proxmox.com/git/pve-qemu.git --recurse-submodules
~# cd pve-qemu/qemu/
~/pve-qemu/qemu/# meson subprojects download

Next we have to extract patch from https://patchew.org/QEMU/20230420153839.167418-1-robin@streamhpc.com/:

~# mkdir ~/tmp
~# cd ~/tmp
~/tmp# git clone https://github.com/patchew-project/qemu.git
~/tmp# cd qemu
~/tmp/qemu# git format-patch -1 21771e5
~/tmp/qemu# cp 0001-pcie-Allow-generic-PCIE-root-port-to-enable-atomic-c.patch ~/pve-qemu/qemu/

Apply the patch. First validate if the patch is correct:

~# cd pve-qemu/qemu/
~/pve-qemu/qemu# git apply --stat 0001-pcie-Allow-generic-PCIE-root-port-to-enable-atomic-c.patch
 hw/pci-bridge/gen_pcie_root_port.c |    2 ++
 hw/pci/pcie.c                      |    6 ++++++
 include/hw/pci/pcie_port.h         |    3 +++
 3 files changed, 11 insertions(+)
 
~/pve-qemu/qemu# git am 0001-pcie-Allow-generic-PCIE-root-port-to-enable-atomic-c.patch

Now you have the patch version where you can set x-atomic-completion on pcie port. If you can configure proxmox in such a way that this flag will be set to true then you can continue from here on your own.

I simply changed if statement in hw/pci/pcie.c to always advertise said capabilities:

-    if (s->atomic_completion) {
+    if (true) {
        /* PCIe requires setting both comp32 and comp64 if either is supported */
        pci_set_long(dev->config + dev->exp.exp_cap + PCI_EXP_DEVCAP2,
                     PCI_EXP_DEVCAP2_ATOMIC_COMP32 | PCI_EXP_DEVCAP2_ATOMIC_COMP64);
    }

Next compilation and installation of the new qemu-kvm:

~# cd ~/pve-qemu
~/pve-qemu# make
~/pve-qemu# apt install ./pve-qemu-kvm_8.1.5-4_amd64.deb

There is no need to restart the hypervisor. Full shutdown and start of a guest machine is sufficient.

Verification

root@peugeot:~# lspci -d 1002:73bf 
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] (rev c0)
02:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] (rev c0)
root@peugeot:~# lspci -d 1002:73bf -vvv | grep -i atom
			 AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
			 AtomicOpsCtl: ReqEn+
			 AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
			 AtomicOpsCtl: ReqEn+
root@peugeot:~# lspci | grep Root
00:1c.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1c.1 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1c.2 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1c.3 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
root@peugeot:~# lspci -s 00:1c -vvv | grep -i atomi
			 AtomicOpsCap: Routing- 32bit+ 64bit+ 128bitCAS-
			 AtomicOpsCtl: ReqEn- EgressBlck-
			 AtomicOpsCap: Routing- 32bit+ 64bit+ 128bitCAS-
			 AtomicOpsCtl: ReqEn- EgressBlck-
			 AtomicOpsCap: Routing- 32bit+ 64bit+ 128bitCAS-
			 AtomicOpsCtl: ReqEn- EgressBlck-
			 AtomicOpsCap: Routing- 32bit+ 64bit+ 128bitCAS-
			 AtomicOpsCtl: ReqEn- EgressBlck-
root@peugeot:~# dmesg | grep -i atomi
[    0.160097] DMA: preallocated 4096 KiB GFP_KERNEL pool for atomic allocations
[    0.160235] DMA: preallocated 4096 KiB GFP_KERNEL|GFP_DMA pool for atomic allocations
[    0.160364] DMA: preallocated 4096 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations

We can see that atomics are present on the root port (32bit+ 64bit+) now and the error in dmesg is not present. Torch.distributed now works without an issue.