4. With my new udev rules in place, I was successful getting specific NVMe controllers (based on bus-device-function) to unbind from the Linux nvme driver and bind to vfio-pci. However, I made a couple of observations in the kernel log (dmesg). In particular, I was drawn to the following for an NVMe controller at BDF: 0000:40:00.0 for which I had a udev rule to unbind from nvme and bind to vfio-pci: [ 35.534279] nvme nvme1: pci function 0000:40:00.0 [ 37.964945] nvme nvme1: failed to mark controller live [ 37.964947] nvme nvme1: Removing after probe failure status: 0 One theory I have for the above is that my udev RUN rule was invoked while the nvme driver’s probe() was still running on this controller, and perhaps the unbind request came in before the probe() completed hence this “name1: failed to mark controller live”. This has left lingering in my mind that maybe instead of triggering on (Event 2) when the bind occurs, that perhaps I should instead try to derive a trigger on the “last" udev event, an “add”, where the NVMe namespace’s are instantiated. Of course, I’d need to know ahead of time just how many namespaces exist on that controller if I were to do that so I’d trigger on the last one. I’m wondering if that may help to avoid what looks like a complaint during the middle of probe() of that particular controller. Then, again, maybe I can just safely ignore that and not worry about it at all? Thoughts?
[Jim] Can you confirm your suspicion - maybe add a 1 or 2 second delay after detecting Event 2 before unbinding – and see if that eliminates the probe failures? I’m not suggesting that as a workaround or solution – just want to know for sure if we need to worry about deferring the unbind until after the kernel driver’s probe has completed. It sounds like these error messages are benign but would be nice to avoid them.
For experimentation purposes, yes, I might be able to instrument a delay to see if the kernel nvme probe failures go away. I don’t know if udev execution is multi-threaded or not, and thus whether such a delay would block other udev events from getting processed while mine sleeps, but I can explore this at least as an experiment.
Let me emphasize another point. While playing with this further, I did subsequently discover that the end result, at least with my particular NVMe drives, was in fact not benign. That is, although the NVMe controller did appear successfully bound to vfio-pci, execution of any SPDK apps (e.g. perf, identify) returned a failure attempting to communicate with the controller. I then removed my udev rule, then manually unbound the controller from vfio-pci and rebound it to the kernel’s nvme driver. After doing that, inspection of dmesg revealed a complaint from the nvme driver accessing the device. And, so, I then reboot the system — again, having ensured that my udev rule was not in place (neither in my rootfs nor the initramfs) — to see how the controller would behave following a reboot and coming up with the kernel nvme driver in the default scenario. Again, a dmesg revealed complaints about accessing that particular NVMe controller. Finally, I power-cycled the host, and lo and behold after doing that, then the NVMe controller came up fine.
In summary, I will at least attempt the delay-experiment and see if that helps us sidestep the probe failure and leaving the NVMe controller in a bad state. If that should work, I may then alter the udev rule to trigger instead on the add action of the last namespace instead of the bind action to the nvme driver and see how that works.
Overall this seems like a reasonable approach though. How do you see this working if a system has multiple NVMe SSDs – one of which has the OS install, and the rest should be assigned to uio/vfio?
We do have this exact scenario; i.e. systems with NVMe controllers (on which file systems are mounted) which depend on the kernel nvme driver where other NVMe controllers are ‘reserved’ for SPDK-use. Among the criteria in my udev rule’s trigger criteria is also the BDF (bus-device-function), so this should work fine. We just have to make abundantly clear how careful one must be configuring the system to use this mechanism to avoid inadvertently triggering on a NVMe controller that’s needed for use with the kernel nvme driver.