Best practices on driver binding for SPDK in production environments
by Lance Hartmann ORACLE
This email to the SPDK list is a follow-on to a brief discussion held during a recent SPDK community meeting (Tue Jun 26 UTC 15:00).
Lifted and edited from the Trello agenda item (https://trello.com/c/U291IBYx/91-best-practices-on-driver-binding-for-spd... <https://trello.com/c/U291IBYx/91-best-practices-on-driver-binding-for-spd...>):
During development many (most?) people rely on the run of SPDK's scripts/setup.sh to perform a number of initializations, among them the unbinding of the Linux kernel nvme driver from NVMe controllers targeted for use by the SPDK and then binding them to either uio_pci_generic or vfio-pci. This script is applicable for development environments, but not targeted for use in productions systems employing the SPDK.
I'd like to confer with my fellow SPDK community members on ideas, suggestions and best practices for handling this driver unbinding/binding. I wrote some udev rules along with updates to some other Linux system conf files for automatically loading either the uio_pci_generic or vfio-pci modules. I also had to update my initramfs so that when the system comes all the way up, the desired NVMe controllers are already bound to the needed driver for SPDK operation. And, as a bonus, it should "just work" when a hotplug occurs as well. However, there may be additional considerations I might have overlooked on which I'd appreciate input. Further, there's the matter of how and whether to semi-automate this configuration via some kind of script and how that might vary according to Linux distro to say nothing of the determination of employing uio_pci_generic vs vfio-pci.
And, now some details:
1. I performed this on an Oracle Linux (OL) distro. I’m currently unaware how and what configuration files might be different depending on the distro. Oracle Linux is RedHat-compatible, so I’m confident my implementation should run similarly on RedHat-based systems, but I’ve yet to delve into other distro’s like Debian, SuSE, etc.
2. In preparation to writing my own udev rules, I unbound a specific NVMe controller from the Linux nvme driver by hand. Then, in another window I launched: "udevadm monitor -k -p” so that I could observe the usual udev events when a NVMe controller is bound to the nvme driver. On my system, I observed four (4) udev kernel events (abbreviated/edited output to avoid this become excessively long):
(Event 1)
KERNEL[382128.187273] add /devices/pci0000:00/0000:00:02.2/0000:30:00.0/nvme/nvme0 (nvme)
ACTION=add
DEVNAME=/dev/nvme0
…
SUBSYSTEM=nvme
(Event 2)
KERNEL[382128.244658] bind /devices/pci0000:00/0000:00:02.2/0000:30:00.0 (pci)
ACTION=bind
DEVPATH=/devices/pci0000:00/0000:00:02.2/0000:30:00.0
DRIVER=nvme
…
SUBSYSTEM=pci
(Event 3)
KERNEL[382130.697832] add /devices/virtual/bdi/259:0 (bdi)
ACTION=add
DEVPATH=/devices/virtual/bdi/259:0
...
SUBSYSTEM=bdi
(Event 4)
KERNEL[382130.698192] add /devices/pci0000:00/0000:00:02.2/0000:30:00.0/nvme/nvme0/nvme0n1 (block)
ACTION=add
DEVNAME=/dev/nvme0n1
DEVPATH=/devices/pci0000:00/0000:00:02.2/0000:30:00.0/nvme/nvme0/nvme0n1
DEVTYPE=disk
...
SUBSYSTEM=block
3. My udev rule triggers on (Event 2) above: the bind action. Upon this action, my udev rule appends operations to the special udev RUN variable such that udev will essentially mirror that which is done in the SPDK’s scripts/setup.sh for unbinding from the nvme driver and binding to, in my case, the vfio-pci driver.
4. With my new udev rules in place, I was successful getting specific NVMe controllers (based on bus-device-function) to unbind from the Linux nvme driver and bind to vfio-pci. However, I made a couple of observations in the kernel log (dmesg). In particular, I was drawn to the following for an NVMe controller at BDF: 0000:40:00.0 for which I had a udev rule to unbind from nvme and bind to vfio-pci:
[ 35.534279] nvme nvme1: pci function 0000:40:00.0
[ 37.964945] nvme nvme1: failed to mark controller live
[ 37.964947] nvme nvme1: Removing after probe failure status: 0
One theory I have for the above is that my udev RUN rule was invoked while the nvme driver’s probe() was still running on this controller, and perhaps the unbind request came in before the probe() completed hence this “name1: failed to mark controller live”. This has left lingering in my mind that maybe instead of triggering on (Event 2) when the bind occurs, that perhaps I should instead try to derive a trigger on the “last" udev event, an “add”, where the NVMe namespace’s are instantiated. Of course, I’d need to know ahead of time just how many namespaces exist on that controller if I were to do that so I’d trigger on the last one. I’m wondering if that may help to avoid what looks like a complaint during the middle of probe() of that particular controller. Then, again, maybe I can just safely ignore that and not worry about it at all? Thoughts?
I discovered another issue during this experimentation that is somewhat tangential to this task, but I’ll write a separate email on that topic.
thanks for any feedback,
--
Lance Hartmann
lance.hartmann(a)oracle.com
2 years, 8 months
Topic from last week's community meeting
by Luse, Paul E
Hi Shuhei,
I was out of town last week and missed the meeting but saw on Trello you had the topic below:
"a few idea: log structured data store , data store with compression, and metadata replication of Blobstore"
Which I'd be pretty interested in working on with you or at least hearing more about it. When you get a chance, no hurry, can you please expand a little on how the conversation went and what you're looking at specifically?
Thanks!
Paul
2 years, 9 months
Add py-spdk client for SPDK
by We We
Hi, all
I have submitted the py-spdk code on https://review.gerrithub.io/#/c/379741/, please take some time to visit it, I will be very grateful to you.
The py-spdk is client which can help the upper-level app to communicate with the SPDK-based app (such as: nvmf_tgt, vhost, iscsi_tgt, etc.). Should I submit it into the other repo I rebuild rather than SPDK repo? Because I think it is a relatively independent kit upon the SPDK.
If you have some thoughts about the py-spdk, please share with me.
Regards,
Helloway
2 years, 9 months
Re: [SPDK] anyone ran the SPDK ( app/iscsi_tgt/iscsi_tgt ) with VPP?
by Isaac Otsiabah
Hi Tomasz, I got the SPDK patch. My network topology is simple but making the network ip address accessible to the iscsi_tgt application and to vpp is not working. From my understanding, vpp is started first on the target host and then iscsi_tgt application is started after the network setup is done (please, correct me if this is not the case).
------- 192.168.2.10
| | initiator
-------
|
|
|
-------------------------------------------- 192.168.2.0
|
|
| 192.168.2.20
-------------- vpp, vppctl
| | iscsi_tgt
--------------
Both system have a 10GB NIC
(On target Server):
I set up the vpp environment variables through sysctl command.
I unbind the kernel driver and loaded the DPDK uio_pci_generic driver for the first 10GB NIC (device address= 0000:82:00.0).
That worked so I started the vpp application and from the startup output, the NIC is in used by vpp
[root@spdk2 ~]# vpp -c /etc/vpp/startup.conf
vlib_plugin_early_init:356: plugin path /usr/lib/vpp_plugins
load_one_plugin:184: Loaded plugin: acl_plugin.so (Access Control Lists)
load_one_plugin:184: Loaded plugin: dpdk_plugin.so (Data Plane Development Kit (DPDK))
load_one_plugin:184: Loaded plugin: flowprobe_plugin.so (Flow per Packet)
load_one_plugin:184: Loaded plugin: gtpu_plugin.so (GTPv1-U)
load_one_plugin:184: Loaded plugin: ila_plugin.so (Identifier-locator addressing for IPv6)
load_one_plugin:184: Loaded plugin: ioam_plugin.so (Inbound OAM)
load_one_plugin:114: Plugin disabled (default): ixge_plugin.so
load_one_plugin:184: Loaded plugin: kubeproxy_plugin.so (kube-proxy data plane)
load_one_plugin:184: Loaded plugin: l2e_plugin.so (L2 Emulation)
load_one_plugin:184: Loaded plugin: lb_plugin.so (Load Balancer)
load_one_plugin:184: Loaded plugin: libsixrd_plugin.so (IPv6 Rapid Deployment on IPv4 Infrastructure (RFC5969))
load_one_plugin:184: Loaded plugin: memif_plugin.so (Packet Memory Interface (experimetal))
load_one_plugin:184: Loaded plugin: nat_plugin.so (Network Address Translation)
load_one_plugin:184: Loaded plugin: pppoe_plugin.so (PPPoE)
load_one_plugin:184: Loaded plugin: stn_plugin.so (VPP Steals the NIC for Container integration)
vpp[4168]: load_one_plugin:63: Loaded plugin: /usr/lib/vpp_api_test_plugins/acl_test_plugin.so
vpp[4168]: load_one_plugin:63: Loaded plugin: /usr/lib/vpp_api_test_plugins/dpdk_test_plugin.so
vpp[4168]: load_one_plugin:63: Loaded plugin: /usr/lib/vpp_api_test_plugins/flowprobe_test_plugin.so
vpp[4168]: load_one_plugin:63: Loaded plugin: /usr/lib/vpp_api_test_plugins/gtpu_test_plugin.so
vpp[4168]: load_one_plugin:63: Loaded plugin: /usr/lib/vpp_api_test_plugins/ioam_export_test_plugin.so
vpp[4168]: load_one_plugin:63: Loaded plugin: /usr/lib/vpp_api_test_plugins/ioam_pot_test_plugin.so
vpp[4168]: load_one_plugin:63: Loaded plugin: /usr/lib/vpp_api_test_plugins/ioam_trace_test_plugin.so
vpp[4168]: load_one_plugin:63: Loaded plugin: /usr/lib/vpp_api_test_plugins/ioam_vxlan_gpe_test_plugin.so
vpp[4168]: load_one_plugin:63: Loaded plugin: /usr/lib/vpp_api_test_plugins/kubeproxy_test_plugin.so
vpp[4168]: load_one_plugin:63: Loaded plugin: /usr/lib/vpp_api_test_plugins/lb_test_plugin.so
vpp[4168]: load_one_plugin:63: Loaded plugin: /usr/lib/vpp_api_test_plugins/memif_test_plugin.so
vpp[4168]: load_one_plugin:63: Loaded plugin: /usr/lib/vpp_api_test_plugins/nat_test_plugin.so
vpp[4168]: load_one_plugin:63: Loaded plugin: /usr/lib/vpp_api_test_plugins/pppoe_test_plugin.so
vpp[4168]: load_one_plugin:63: Loaded plugin: /usr/lib/vpp_api_test_plugins/udp_ping_test_plugin.so
vpp[4168]: load_one_plugin:63: Loaded plugin: /usr/lib/vpp_api_test_plugins/vxlan_gpe_ioam_export_test_plugin.so
vpp[4168]: dpdk_config:1240: EAL init args: -c 1 -n 4 --huge-dir /run/vpp/hugepages --file-prefix vpp -w 0000:82:00.0 --master-lcore 0 --socket-mem 64,64
EAL: No free hugepages reported in hugepages-1048576kB
EAL: VFIO support initialized
DPDK physical memory layout:
Segment 0: IOVA:0x2200000, len:2097152, virt:0x7f919c800000, socket_id:0, hugepage_sz:2097152, nchannel:0, nrank:0
Segment 1: IOVA:0x3e000000, len:16777216, virt:0x7f919b600000, socket_id:0, hugepage_sz:2097152, nchannel:0, nrank:0
Segment 2: IOVA:0x3fc00000, len:2097152, virt:0x7f919b200000, socket_id:0, hugepage_sz:2097152, nchannel:0, nrank:0
Segment 3: IOVA:0x54c00000, len:46137344, virt:0x7f917ae00000, socket_id:0, hugepage_sz:2097152, nchannel:0, nrank:0
Segment 4: IOVA:0x1f2e400000, len:67108864, virt:0x7f8f9c200000, socket_id:1, hugepage_sz:2097152, nchannel:0, nran
STEP1:
Then from vppctl command prompt, I set up ip address for the 10G interface and up it. From vpp, I can ping the initiator machine and vice versa as shown below.
vpp# show int
Name Idx State Counter Count
TenGigabitEthernet82/0/0 1 down
local0 0 down
vpp# set interface ip address TenGigabitEthernet82/0/0 192.168.2.20/24
vpp# set interface state TenGigabitEthernet82/0/0 up
vpp# show int
Name Idx State Counter Count
TenGigabitEthernet82/0/0 1 up
local0 0 down
vpp# show int address
TenGigabitEthernet82/0/0 (up):
192.168.2.20/24
local0 (dn):
/* ping initiator from vpp */
vpp# ping 192.168.2.10
64 bytes from 192.168.2.10: icmp_seq=1 ttl=64 time=.0779 ms
64 bytes from 192.168.2.10: icmp_seq=2 ttl=64 time=.0396 ms
64 bytes from 192.168.2.10: icmp_seq=3 ttl=64 time=.0316 ms
64 bytes from 192.168.2.10: icmp_seq=4 ttl=64 time=.0368 ms
64 bytes from 192.168.2.10: icmp_seq=5 ttl=64 time=.0327 ms
(On Initiator):
/* ping vpp interface from initiator*/
[root@spdk1 ~]# ping -c 2 192.168.2.20
PING 192.168.2.20 (192.168.2.20) 56(84) bytes of data.
64 bytes from 192.168.2.20: icmp_seq=1 ttl=64 time=0.038 ms
64 bytes from 192.168.2.20: icmp_seq=2 ttl=64 time=0.031 ms
STEP2:
However, when I start the iscsi_tgt server, it does not have access to the above 192.168.2.x subnet so I ran these commands on the target server to create veth and then connected it to a vpp host-interface as follows:
ip link add name vpp1out type veth peer name vpp1host
ip link set dev vpp1out up
ip link set dev vpp1host up
ip addr add 192.168.2.201/24 dev vpp1host
vpp# create host-interface name vpp1out
vpp# set int state host-vpp1out up
vpp# set int ip address host-vpp1out 192.168.2.202
vpp# show int addr
TenGigabitEthernet82/0/0 (up):
192.168.2.20/24
host-vpp1out (up):
192.168.2.202/24
local0 (dn):
vpp# trace add af-packet-input 10
/* From host, ping vpp */
[root@spdk2 ~]# ping -c 2 192.168.2.202
PING 192.168.2.202 (192.168.2.202) 56(84) bytes of data.
64 bytes from 192.168.2.202: icmp_seq=1 ttl=64 time=0.130 ms
64 bytes from 192.168.2.202: icmp_seq=2 ttl=64 time=0.067 ms
/* From vpp, ping host */
vpp# ping 192.168.2.201
64 bytes from 192.168.2.201: icmp_seq=1 ttl=64 time=.1931 ms
64 bytes from 192.168.2.201: icmp_seq=2 ttl=64 time=.1581 ms
64 bytes from 192.168.2.201: icmp_seq=3 ttl=64 time=.1235 ms
64 bytes from 192.168.2.201: icmp_seq=4 ttl=64 time=.1032 ms
64 bytes from 192.168.2.201: icmp_seq=5 ttl=64 time=.0688 ms
Statistics: 5 sent, 5 received, 0% packet loss
>From the target host,I still cannot ping the initiator (192.168.2.10), it does not go through the vpp interface so my vpp interface connection is not correct.
Please, how does one create the vpp host interface and connect it, so that host applications (ie. iscsi_tgt) can communicate in the 192.168.2 subnet? In STEP2, should I use a different subnet like 192.168.3.X and turn on IP forwarding add a route to the routing table?
Isaac
From: Zawadzki, Tomasz [mailto:tomasz.zawadzki@intel.com]
Sent: Thursday, April 12, 2018 12:27 AM
To: Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com>
Cc: Harris, James R <james.r.harris(a)intel.com>; Verkamp, Daniel <daniel.verkamp(a)intel.com>; Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com>
Subject: RE: anyone ran the SPDK ( app/iscsi_tgt/iscsi_tgt ) with VPP?
Hello Isaac,
Are you using following patch ? (I suggest cherry picking it)
https://review.gerrithub.io/#/c/389566/
SPDK iSCSI target can be started without specific interface to bind on, by not specifying any target nodes or portal groups. They can be added later via RPC http://www.spdk.io/doc/iscsi.html#iscsi_rpc.
Please see https://github.com/spdk/spdk/blob/master/test/iscsi_tgt/lvol/iscsi.conf for example of minimal iSCSI config.
Suggested flow of starting up applications is:
1. Unbind interfaces from kernel
2. Start VPP and configure the interface via vppctl
3. Start SPDK
4. Configure the iSCSI target via RPC, at this time it should be possible to use the interface configured in VPP
Please note, there is some leeway here. The only requirement is having VPP app started before SPDK app.
Interfaces in VPP can be created (like tap or veth) and configured at runtime, and are available for use in SPDK as well.
Let me know if you have any questions.
Tomek
From: Isaac Otsiabah [mailto:IOtsiabah@us.fujitsu.com]
Sent: Wednesday, April 11, 2018 8:47 PM
To: Zawadzki, Tomasz <tomasz.zawadzki(a)intel.com<mailto:tomasz.zawadzki@intel.com>>
Cc: Harris, James R <james.r.harris(a)intel.com<mailto:james.r.harris@intel.com>>; Verkamp, Daniel <daniel.verkamp(a)intel.com<mailto:daniel.verkamp@intel.com>>; Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz@us.fujitsu.com>>
Subject: anyone ran the SPDK ( app/iscsi_tgt/iscsi_tgt ) with VPP?
Hi Tomaz, Daniel and Jim, i am trying to test VPP so build the VPP on a Centos 7.4 (x86_64), build the SPDK and tried to run the ./app/iscsi_tgt/iscsi_tgt application.
For VPP, first, I unbind the nick from the kernel as and start VPP application.
./usertools/dpdk-devbind.py -u 0000:07:00.0
vpp unix {cli-listen /run/vpp/cli.sock}
Unbinding the nic takes down the interface, however, the ./app/iscsi_tgt/iscsi_tgt -m 0x101 application needs an interface to bind to during startup so it fails to start. The information at:
"Running SPDK with VPP
VPP application has to be started before SPDK iSCSI target, in order to enable usage of network interfaces. After SPDK iSCSI target initialization finishes, interfaces configured within VPP will be available to be configured as portal addresses. Please refer to Configuring iSCSI Target via RPC method<http://www.spdk.io/doc/iscsi.html#iscsi_rpc>."
is not clear because the instructions at the "Configuring iSCSI Traget via RPC method" suggest the iscsi_tgt server is running for one to be able to execute the RPC commands but, how do I get the iscsi_tgt server running without an interface to bind on during its initialization?
Please, can anyone of you help to explain how to run the SPDK iscsi_tgt application with VPP (for instance, what should change in iscsi.conf?) after unbinding the nic, how do I get the iscsi_tgt server to start without an interface to bind to, what address should be assigned to the Portal in iscsi.conf.. etc)?
I would appreciate if anyone would help. Thank you.
Isaac
3 years, 9 months
Building spdk on CentOS6
by Shahar Salzman
Hi,
Finally got to looking at support of spdk build on CentOS6, things look good, except for one issue.
spdk is latest 18.01.x version, dpdk is 16.07 (+3 dpdk patches to allow compilation) and some minor patches (mainly some memory configuration stuff), kernel is a patched 4.9.6.
build succeeded except for the usage of the dpdk function pci_vfio_is_enabled.
I had to apply the patch bellow, removing the usage of this function and then compilation completed without any issues.
It seems that I am missing some sort of dpdk configuration as I see that the function is built, but not packaged into the generated archive.
I went back to square one and ran the instructions in http://www.spdk.io/doc/getting_started.html, but I see no mention of dpdk there. Actually the ./configure requires it.
My next step is to use a more recent dpdk, but shouldn't this work with my version? Am I missing some dpdk configuration?
BTW, as we are not using vhost, on our 17.07 version we simply use CONFIG_VHOST=n in order to skip this, but I would be happier if we used a better solution.
Shahar
P.S. Here is the patch to remove use of this function:
diff --git a/lib/env_dpdk/vtophys.c b/lib/env_dpdk/vtophys.c
index 92aa256..f38929f 100644
--- a/lib/env_dpdk/vtophys.c
+++ b/lib/env_dpdk/vtophys.c
@@ -53,8 +53,10 @@
#define SPDK_VFIO_ENABLED 1
#include <linux/vfio.h>
+#if 0
/* Internal DPDK function forward declaration */
int pci_vfio_is_enabled(void);
+#endif
struct spdk_vfio_dma_map {
struct vfio_iommu_type1_dma_map map;
@@ -341,9 +343,11 @@ spdk_vtophys_iommu_init(void)
DIR *dir;
struct dirent *d;
+#if 0
if (!pci_vfio_is_enabled()) {
return;
}
+#endif
dir = opendir("/proc/self/fd");
if (!dir) {
3 years, 10 months
SPDK + user space appliance
by Shahar Salzman
Hi all,
Sorry for the delay, had to solve a quarantine issue in order to get access to the list.
Some clarifications regarding the user space application:
1. The application is not the nvmf_tgt, we have an entire applicance to which we are integrating spdk
2. We are currently using nvmf_tgt functions in order to activate spdk, and the bdev_user in order to handle IO
3. This is all in user space (I am used to the kernel/user distinction in order to separate protocol/appliance).
4. The bdev_user will also notify spdk of changes to namespaces (e.g. a new namespace has been added, and can be attached to the spdk subsystem)
I am glad that this is your intention, the question is, do you think that it would be useful to create such a bdev_user module which will allow other users to integrate spdk to their appliance using such a simple threading model? Perhaps such a module will allow easier integration of spdk.
I am attaching a reference application which is does NULL IO via bdev_user.
Regarding the RPC, we have an implementation of it, and will be happy to push it upstream.
I am not sure that using the RPC for this type of bdev_user namespaces is the correct approach in the long run, since the user appliance is the one adding/removing namespaces (like hot plugging of a new NVME device), so it can just call the "add_namespace_to_subsystem" interface directly, and does not need to use an RPC for it.
Thanks,
Shahar
3 years, 11 months
Re: [SPDK] BDEV-IO Lifecycle - Need your input.
by Kaligotla, Srikanth
Hello,
First revision of changes to extend the lifecycle of bdev_io are available for review. I would like to solicit your input on the proposed API/Code flow changes.
https://review.gerrithub.io/c/spdk/spdk/+/415860
Thanks,
Srikanth
From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com<mailto:Srikanth.Kaligotla@netapp.com>>
Date: Friday, May 11, 2018 at 2:27 PM
To: "Walker, Benjamin" <benjamin.walker(a)intel.com<mailto:benjamin.walker@intel.com>>, "Harris, James R" <james.r.harris(a)intel.com<mailto:james.r.harris@intel.com>>
Cc: "raju.gottumukkala(a)broadcom.com<mailto:raju.gottumukkala@broadcom.com>" <raju.gottumukkala(a)broadcom.com<mailto:raju.gottumukkala@broadcom.com>>, "Meneghini, John" <John.Meneghini(a)netapp.com<mailto:John.Meneghini@netapp.com>>, "Rodriguez, Edwin" <Ed.Rodriguez(a)netapp.com<mailto:Ed.Rodriguez@netapp.com>>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com<mailto:Madhusudan.Pai@netapp.com>>, "NGC-john.barnard-broadcom.com" <john.barnard(a)broadcom.com<mailto:john.barnard@broadcom.com>>, "spdk(a)lists.01.org<mailto:spdk@lists.01.org>" <spdk(a)lists.01.org<mailto:spdk@lists.01.org>>
Subject: RE: BDEV-IO Lifecycle - Need your input.
CC: List
Hi Ben,
Your proposal<https://review.gerrithub.io/#/c/spdk/spdk/+/386166/> to interface with the backend to acquire and release buffers is good. You have accurately stated that the challenge is in developing intuitive semantics. And that has been my struggle. To me, there are two problem statements;
1. It is expected that bdev_io pool is sized correctly so the call to get bdev_io succeeds. Failure to acquire bdev_io will result in DEVICE-ERROR. The transport is already capable of handling temporary memory failures by moving the request to PENDING-Queue. Hence the proposal to change the bdev_io lifecycle and perhaps connect it with the spdk_nvmf_request object. Thus all buffer needs are addressed at the beginning of I/O request.
2. I/O Data buffers are sourced and managed by the backend. One of the challenges I see with your proposed interface is the lack of details like if resource is being acquired for a READ operation or WRITE operation. The handling is quite different in each case. Since bdev_io->type is overloaded the type of I/O operation is lost. I suppose one can cast the cb_arg (nvmf_request) and then proceed. Also, the bdev_io should be present to RELEASE the buffer. Zero copy semantics warrants that data buffer stays until controller-to-host has occurred. In other words, bdev_io lives till REQUEST come to COMPLETE state.
What are your thoughts in introducing spdk_bdev_init() and spdk_bdev_fini() as an alternative approach to extend the lifecyle of bdev_io and allow data buffer management via bdev fn_table ?
I hope I’m making sense…
Thanks,
Srikanth
From: Walker, Benjamin <benjamin.walker(a)intel.com<mailto:benjamin.walker@intel.com>>
Sent: Friday, May 11, 2018 12:28 PM
To: Harris, James R <james.r.harris(a)intel.com<mailto:james.r.harris@intel.com>>; Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com<mailto:Srikanth.Kaligotla@netapp.com>>
Cc: raju.gottumukkala(a)broadcom.com<mailto:raju.gottumukkala@broadcom.com>; Meneghini, John <John.Meneghini(a)netapp.com<mailto:John.Meneghini@netapp.com>>; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com<mailto:Ed.Rodriguez@netapp.com>>; Pai, Madhu <Madhusudan.Pai(a)netapp.com<mailto:Madhusudan.Pai@netapp.com>>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com<mailto:john.barnard@broadcom.com>>
Subject: Re: BDEV-IO Lifecycle - Need your input.
Hi Srikanth,
Yes - we'll need to introduce some way to acquire and release buffers from the bdev layer earlier on in the state machine that processes an NVMe-oF request. I've had this patch out for review for several months as a proposal for this scenario:
https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
It doesn't pass the tests - it's just a proposal for the interface. 90% of the challenge here is in developing intuitive semantics.
Thanks,
Ben
P.S. This is the kind of discussion that would fit perfectly on the mailing list.
On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Srikanth wrote:
Hi Ben, Hi James,
I would like to solicit opinions on the lifecycle for bdev-io resource object. Attached is the image of RDMA state machine in its current implementation. When the REQUEST enters the NEED-BUFFER state, the buffers necessary for carrying out I/O operation are allocated/acquired from the memory pool. An instance of BDEV-IO comes into existence after REQUEST reaches READY-TO-EXECUTE state. The BDEV-IO is teared down as soon the backend returns. From a BDEV perspective, BDEV-IO is simply a translation unit that facilitates I/O buffers from the backend. The driver context embedded within bdev_io holds great deal of information pertaining to the I/O under execution. It assists in error handling, dereferencing the buffers upon I/O completion and in abort handling. In summary, the bdev_io stays alive until request has come to COMPLETE state. I’d like to hear peoples thoughts in introducing the plumbing to acquire BDEV-IO resource in REQUEST-NEED-BUFFER state and release it in REQUEST-COMPLETE state. I will shortly have patch available for review that introduces spdk_bdev_init and spdk_bdev_fini which in turn invokes the corresponding bdev fn_table to initialize/cleanup.
I wanted to use this email to communicate our intent and solicit your feedback. We have a working implementation of the above proposal and prior to pushing it upstream for review would like to hear your thoughts. These proposed changes to upstream are a result of FC transport work in collaboration with the Broadcom team who are also copied to this mail. Myself and John will be at SPDK Dev conference and if required we can elaborate further on this proposal.
Thanks,
Srikanth
3 years, 11 months
Debugging /sys driver_override affecting driver binding
by Lance Hartmann ORACLE
During my experimentation of unbinding NVMe controllers from the Linux nvme driver and then binding them to vfio-pci for use with SPDK, I encountered unusual behavior with one of the controllers. For some, initially inexplicable, reason, one of the NVMe controllers did get unbound as desired from the nvme driver, but it refused to bind to vfio-pci, whereas all the other NVMe controllers had no trouble at all binding to vfio-pci. Inspection of the kernel log (dmesg) didn’t help. And so after a bunch of debugging I uncovered the culprit: the /sys driver attribute, driver_override. By default, all of my NVMe controller’s appeared to have that attribute empty/null, e.g.:
# cat /sys/bus/pci/devices/0000:40:00.0/driver_override
(null)
However, I discovered that for the NVMe controller that refused to bind to vfio-pci, its driver_override attribute contained the string “nvme”:
# cat /sys/bus/pci/devices/0000:40:00.0/driver_override
nvme
Per Linux kernel documentation, ABI/testing/sysfs-bus-pci:
This file allows the driver for a device to be specified which
will override standard static and dynamic ID matching. When
specified, only a driver with a name matching the value written
to driver_override will have an opportunity to bind to the
device.
…
Eureka! So, that explains why I had a particular NVMe device that refused to bind to vfio-pci. I wanted to share this discovery with other folks in case that run into a similar issue. Now, the mystery that remains: how and why did this particular NVMe controller get its driver_override attribute set to “nvme”? It’s not being used as boot device, I’ve never attempted to use it with LVM (Linux Logical Volume Management), nor built any file systems on it, or any such thing. I grep’d through both my real rootfs’s /etc and searched through my initramfs as well, but I’ve yet to discover what’s responsible for setting that particular NVMe controller’s driver_override. Anyone have some ideas?
thanks,
--
Lance Hartmann
lance.hartmann(a)oracle.com
3 years, 11 months
Shared Library Alone Doesn't Allow BlobFS Creation
by Great Wizard
Hi.
I saw a commit referencing the ability to build a shared library in SPDK
now, stating "The combined library includes all components of SPDK and is
intended to make linking against SPDK easier.". I tried to make a simple
program initiating a (broken) blobfs, but at compilation time I'm required
to link to -lspdk_thread to get it to build.
#include "../../../include/spdk/blob.h"
int main(int argc, char const *argv[])
{
spdk_fs_init(NULL, NULL, NULL, NULL, NULL);
return 0;
}
gcc test.c -o test -lspdk gives errors about undefined references to
spdk_io_channel and thread functions.
gcc test.c -o test -lspdk -lspdk_thread builds fine.
Am I misunderstanding when I think of the shared library as a sort of "one
library fits all" library that means I typically wouldn't need to link
anything else?
Thanks for your time.
3 years, 12 months