Best practices on driver binding for SPDK in production environments
by Lance Hartmann ORACLE
This email to the SPDK list is a follow-on to a brief discussion held during a recent SPDK community meeting (Tue Jun 26 UTC 15:00).
Lifted and edited from the Trello agenda item (https://trello.com/c/U291IBYx/91-best-practices-on-driver-binding-for-spd... <https://trello.com/c/U291IBYx/91-best-practices-on-driver-binding-for-spd...>):
During development many (most?) people rely on the run of SPDK's scripts/setup.sh to perform a number of initializations, among them the unbinding of the Linux kernel nvme driver from NVMe controllers targeted for use by the SPDK and then binding them to either uio_pci_generic or vfio-pci. This script is applicable for development environments, but not targeted for use in productions systems employing the SPDK.
I'd like to confer with my fellow SPDK community members on ideas, suggestions and best practices for handling this driver unbinding/binding. I wrote some udev rules along with updates to some other Linux system conf files for automatically loading either the uio_pci_generic or vfio-pci modules. I also had to update my initramfs so that when the system comes all the way up, the desired NVMe controllers are already bound to the needed driver for SPDK operation. And, as a bonus, it should "just work" when a hotplug occurs as well. However, there may be additional considerations I might have overlooked on which I'd appreciate input. Further, there's the matter of how and whether to semi-automate this configuration via some kind of script and how that might vary according to Linux distro to say nothing of the determination of employing uio_pci_generic vs vfio-pci.
And, now some details:
1. I performed this on an Oracle Linux (OL) distro. I’m currently unaware how and what configuration files might be different depending on the distro. Oracle Linux is RedHat-compatible, so I’m confident my implementation should run similarly on RedHat-based systems, but I’ve yet to delve into other distro’s like Debian, SuSE, etc.
2. In preparation to writing my own udev rules, I unbound a specific NVMe controller from the Linux nvme driver by hand. Then, in another window I launched: "udevadm monitor -k -p” so that I could observe the usual udev events when a NVMe controller is bound to the nvme driver. On my system, I observed four (4) udev kernel events (abbreviated/edited output to avoid this become excessively long):
(Event 1)
KERNEL[382128.187273] add /devices/pci0000:00/0000:00:02.2/0000:30:00.0/nvme/nvme0 (nvme)
ACTION=add
DEVNAME=/dev/nvme0
…
SUBSYSTEM=nvme
(Event 2)
KERNEL[382128.244658] bind /devices/pci0000:00/0000:00:02.2/0000:30:00.0 (pci)
ACTION=bind
DEVPATH=/devices/pci0000:00/0000:00:02.2/0000:30:00.0
DRIVER=nvme
…
SUBSYSTEM=pci
(Event 3)
KERNEL[382130.697832] add /devices/virtual/bdi/259:0 (bdi)
ACTION=add
DEVPATH=/devices/virtual/bdi/259:0
...
SUBSYSTEM=bdi
(Event 4)
KERNEL[382130.698192] add /devices/pci0000:00/0000:00:02.2/0000:30:00.0/nvme/nvme0/nvme0n1 (block)
ACTION=add
DEVNAME=/dev/nvme0n1
DEVPATH=/devices/pci0000:00/0000:00:02.2/0000:30:00.0/nvme/nvme0/nvme0n1
DEVTYPE=disk
...
SUBSYSTEM=block
3. My udev rule triggers on (Event 2) above: the bind action. Upon this action, my udev rule appends operations to the special udev RUN variable such that udev will essentially mirror that which is done in the SPDK’s scripts/setup.sh for unbinding from the nvme driver and binding to, in my case, the vfio-pci driver.
4. With my new udev rules in place, I was successful getting specific NVMe controllers (based on bus-device-function) to unbind from the Linux nvme driver and bind to vfio-pci. However, I made a couple of observations in the kernel log (dmesg). In particular, I was drawn to the following for an NVMe controller at BDF: 0000:40:00.0 for which I had a udev rule to unbind from nvme and bind to vfio-pci:
[ 35.534279] nvme nvme1: pci function 0000:40:00.0
[ 37.964945] nvme nvme1: failed to mark controller live
[ 37.964947] nvme nvme1: Removing after probe failure status: 0
One theory I have for the above is that my udev RUN rule was invoked while the nvme driver’s probe() was still running on this controller, and perhaps the unbind request came in before the probe() completed hence this “name1: failed to mark controller live”. This has left lingering in my mind that maybe instead of triggering on (Event 2) when the bind occurs, that perhaps I should instead try to derive a trigger on the “last" udev event, an “add”, where the NVMe namespace’s are instantiated. Of course, I’d need to know ahead of time just how many namespaces exist on that controller if I were to do that so I’d trigger on the last one. I’m wondering if that may help to avoid what looks like a complaint during the middle of probe() of that particular controller. Then, again, maybe I can just safely ignore that and not worry about it at all? Thoughts?
I discovered another issue during this experimentation that is somewhat tangential to this task, but I’ll write a separate email on that topic.
thanks for any feedback,
--
Lance Hartmann
lance.hartmann(a)oracle.com
1 year, 2 months
Topic from last week's community meeting
by Luse, Paul E
Hi Shuhei,
I was out of town last week and missed the meeting but saw on Trello you had the topic below:
"a few idea: log structured data store , data store with compression, and metadata replication of Blobstore"
Which I'd be pretty interested in working on with you or at least hearing more about it. When you get a chance, no hurry, can you please expand a little on how the conversation went and what you're looking at specifically?
Thanks!
Paul
1 year, 3 months
Add py-spdk client for SPDK
by We We
Hi, all
I have submitted the py-spdk code on https://review.gerrithub.io/#/c/379741/, please take some time to visit it, I will be very grateful to you.
The py-spdk is client which can help the upper-level app to communicate with the SPDK-based app (such as: nvmf_tgt, vhost, iscsi_tgt, etc.). Should I submit it into the other repo I rebuild rather than SPDK repo? Because I think it is a relatively independent kit upon the SPDK.
If you have some thoughts about the py-spdk, please share with me.
Regards,
Helloway
1 year, 3 months
SPDKv18.10 & compatibility with DPDK versions
by Rajesh Ravi
Hi,
I noticed some compilation issues & then fio failure when I used SPDK
v18.10 along with DPDK v18.11 with some RDMA failures.
1) Is SPDKv18.10 compatible with DPDK v17.11 (LTS)
2) Is SPDK 18.10 supported for DPDK 18.11
Thanks in advance.
--
Regards,
Rajesh
2 years
Null bdev read/write performance with NVMf/TCP target
by Andrey Kuzmin
I'm getting some rather counter-intuitive results in the first hand-on with
the recently announced SPDK NVMoF/TCP target. The only difference between
the two runs below (over a 10GigE link) is that the first one is doing
sequential read, while the second one - sequential write, and notice that
both run against the same null bdev target. Any ideas as to why null bdev
writes are 2x slower than reads and where that extra 1ms of latency is
coming from?
Regards,
Andrey
sudo ./examples/nvme/perf/perf -c 0x1 -q 256 -o 4096 -w read -t 10 -r
"trtype:TCP adrfam:IPv4 traddr:192.168.100.105 trsvcid:4420
subnqn:nqn.test_tgt"
Starting SPDK v19.01-pre / DPDK 18.08.0 initialization...
[ DPDK EAL parameters: perf --no-shconf -c 0x1 --no-pci
--base-virtaddr=0x200000000000 --file-prefix=spdk_pid3870 ]
EAL: Detected 56 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: No free hugepages reported in hugepages-1048576kB
EAL: Probing VFIO support...
Initializing NVMe Controllers
Attaching to NVMe over Fabrics controller at 192.168.100.105:4420:
nqn.test_tgt
Attached to NVMe over Fabrics controller at 192.168.100.105:4420:
nqn.test_tgt
Associating SPDK bdev Controller (TESTTGTCTL) with lcore 0
Initialization complete. Launching workers.
Starting thread on core 0
========================================================
Latency(us)
Device Information : IOPS
MB/s Average min max
SPDK bdev Controller (TESTTGTCTL ) from core 0: 203888.90
796.44 1255.62 720.37 1967.79
========================================================
Total : 203888.90
796.44 1255.62 720.37 1967.79
sudo ./examples/nvme/perf/perf -c 0x1 -q 256 -o 4096 -w write -t 10 -r
"trtype:TCP adrfam:IPv4 traddr:192.168.100.105 trsvcid:4420
subnqn:nqn.test_tgt"
Starting SPDK v19.01-pre / DPDK 18.08.0 initialization...
[ DPDK EAL parameters: perf --no-shconf -c 0x1 --no-pci
--base-virtaddr=0x200000000000 --file-prefix=spdk_pid3873 ]
EAL: Detected 56 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: No free hugepages reported in hugepages-1048576kB
EAL: Probing VFIO support...
Initializing NVMe Controllers
Attaching to NVMe over Fabrics controller at 192.168.100.105:4420:
nqn.test_tgt
Attached to NVMe over Fabrics controller at 192.168.100.105:4420:
nqn.test_tgt
Associating SPDK bdev Controller (TESTTGTCTL) with lcore 0
Initialization complete. Launching workers.
Starting thread on core 0
========================================================
Latency(us)
Device Information : IOPS
MB/s Average min max
SPDK bdev Controller (TESTTGTCTL ) from core 0: 92426.00
361.04 2770.15 773.45 5265.16
========================================================
Total : 92426.00
361.04 2770.15 773.45 5265.16
2 years, 1 month
Questions about vhost memory registration
by Nikos Dragazis
Hi all,
I would like to raise a couple of questions about vhost target.
My first question is:
During vhost-user negotiation, the master sends its memory regions to
the slave. Slave maps each region in its own address space. The mmap
addresses are page aligned (that is 4KB aligned) but not necessarily 2MB
aligned. When vhost registers the memory regions in
spdk_vhost_dev_mem_register(), it aligns the mmap addresses to 2MB here:
https://github.com/spdk/spdk/blob/master/lib/vhost/vhost.c#L534
The aligned addresses may not have a valid page table entry. So, in case
of uio, it is possible that during vtophys translation, the aligned
addresses are touched here:
https://github.com/spdk/spdk/blob/master/lib/env_dpdk/vtophys.c#L287
and this could lead to a segfault. Is this a possible scenario?
My second question is:
The commit message here:
https://review.gerrithub.io/c/spdk/spdk/+/410071
says:
“We've had cases (especially with vhost) in the past where we have
a valid vaddr but the backing page was not assigned yet.”.
This refers to the vhost target, where shared memory is allocated by the
QEMU process and the SPDK process maps this memory.
Let’s consider this case. After mapping vhost-user memory regions, they
are registered to the vtophys map. In case vfio is disabled,
vtophys_get_paddr_pagemap() finds the corresponding physical addresses.
These addresses must refer to pinned memory because vfio is not there to
do the pinning. Therefore, VM’s memory has to be backed by hugepages.
Hugepages are allocated by the QEMU process, way before vhost memory
registration. After their allocation, hugepages will always have a
backing page because they never get swapped out. So, I do not see any
such case where backing page is not assigned yet and thus I do not see
any need to touch the mapped page.
This is my current understanding in brief and I'd welcome any feedback
you may have:
1. address alignment in spdk_vhost_dev_mem_register() is buggy because
the aligned address may not have a valid page table entry thus
triggering a segfault when being touched in
vtophys_get_paddr_pagemap() -> rte_atomic64_read().
2. touching the page in vtophys_get_paddr_pagemap() is unnecessary
because VM’s memory has to be backed by hugepages and hugepages are
not handled by demand paging strategy and they are never swapped out.
I am looking forward to your feedback.
Thanks,
Nikos
2 years, 1 month
[NVME-OF TCP] Crash in perf running on more than single core
by Sasha Kotchubievsky
Hi,
I'm testing NVME-OF TCP. I can run "perf" applications only on single
core. If I try to use 2 or more core, it crashes immediately.
I tried different block sizes, number of cores in the target, read/write
operations.
Do I miss something in configuration, or that's just a bug?
Details:
Version: "7a39a68"
Command lines:
Target: sudo ./app/nvmf_tgt/nvmf_tgt -c ./nvmf.conf -m 0x3
Client: sudo examples/nvme/perf/perf -q 5 -o 1036288 -w randwrite
-t 60 -c 0x1100 -D -r 'trtype:TCP adrfam:IPv4 traddr:1.1.75.1
trsvcid:1023 nqn.2016-06.io.spdk.r-dcs75:rd0'
Backtrace:
#0 0x0000000000420515 in nvme_tcp_qpair_process_send_queue
(tqpair=0xe19040) at nvme_tcp.c:447
447 pdu_length = pdu->hdr.common.plen -
pdu->writev_offset;
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-222.el7.x86_64 libaio-0.3.109-13.el7.x86_64
libgcc-4.8.5-28.el7_5.1.x86_64
libibverbs-41mlnx1-OFED.4.5.0.1.0.45037.x86_64
libnl3-3.2.28-4.el7.x86_64 librdmacm-41mlnx1-OFED.4.2.0.1.3.45037.x86_64
libuuid-2.23.2-52.el7.x86_64 numactl-libs-2.0.9-7.el7.x86_64
openssl-libs-1.0.2k-12.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0 0x0000000000420515 in nvme_tcp_qpair_process_send_queue
(tqpair=0xe19040) at nvme_tcp.c:447
#1 0x000000000042322c in nvme_tcp_qpair_process_completions
(qpair=0xe19040, max_completions=0) at nvme_tcp.c:1555
#2 0x000000000041e01a in nvme_transport_qpair_process_completions
(qpair=0xe19040, max_completions=0) at nvme_transport.c:224
#3 0x000000000041a78d in spdk_nvme_qpair_process_completions
(qpair=0xe19040, max_completions=0) at nvme_qpair.c:413
#4 0x000000000041b586 in spdk_nvme_wait_for_completion_robust_lock
(qpair=0xe19040, status=0x7ffe66d56dd0, robust_mutex=0x0) at nvme.c:128
#5 0x000000000041b5fe in spdk_nvme_wait_for_completion (qpair=0xe19040,
status=0x7ffe66d56dd0) at nvme.c:142
#6 0x0000000000411542 in nvme_fabric_prop_get_cmd (ctrlr=0xe1ed60,
offset=20, size=0 '\000', value=0x7ffe66d56e88) at nvme_fabric.c:97
#7 0x0000000000411664 in nvme_fabric_ctrlr_get_reg_4 (ctrlr=0xe1ed60,
offset=20, value=0x7ffe66d56f40) at nvme_fabric.c:130
#8 0x000000000042024c in nvme_tcp_ctrlr_get_reg_4 (ctrlr=0xe1ed60,
offset=20, value=0x7ffe66d56f40) at nvme_tcp.c:369
#9 0x000000000041d91f in nvme_transport_ctrlr_get_reg_4
(ctrlr=0xe1ed60, offset=20, value=0x7ffe66d56f40) at nvme_transport.c:139
#10 0x000000000040b846 in nvme_ctrlr_get_cc (ctrlr=0xe1ed60,
cc=0x7ffe66d56f40) at nvme_ctrlr.c:49
#11 0x000000000040bd4f in spdk_nvme_ctrlr_alloc_io_qpair
(ctrlr=0xe1ed60, user_opts=0x7ffe66d56f80, opts_size=12) at nvme_ctrlr.c:251
#12 0x000000000040666d in init_ns_worker_ctx (ns_ctx=0xe18d20) at perf.c:826
#13 0x0000000000406739 in work_fn (arg=0xe198c0) at perf.c:862
#14 0x00000000004088ff in main (argc=14, argv=0x7ffe66d57148) at perf.c:1709
Environment:
OS: CentOS Linux release 7.5.1804 (Core)
Kernel: 3.10.0-862.el7.x86_64
Target output"
Starting SPDK v19.01-pre / DPDK 18.08.0 initialization...
[ DPDK EAL parameters: nvmf --no-shconf -c 0x3
--base-virtaddr=0x200000000000 --file-prefix=spdk_pid30815 ]
EAL: Detected 24 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: No free hugepages reported in hugepages-1048576kB
EAL: Probing VFIO support...
app.c: 609:spdk_app_start: *NOTICE*: Total cores available: 2
reactor.c: 293:_spdk_reactor_run: *NOTICE*: Reactor started on core 1
reactor.c: 293:_spdk_reactor_run: *NOTICE*: Reactor started on core 0
EAL: PCI device 0000:81:00.0 on NUMA socket 1
EAL: probe driver: 8086:2700 spdk_nvme
conf.c: 167:spdk_nvmf_parse_nvmf_tgt: *ERROR*: Deprecated options
detected for the NVMe-oF target.
The following options are no longer controlled by the target
and should be set in the transport on a per-transport basis:
MaxQueueDepth, MaxQueuesPerSession, InCapsuleDataSize, MaxIOSize, IOUnitSize
This can be accomplished by setting the options through the
create_nvmf_transport RPC.
You may also continue to configure these options in the conf file under
each transport.tcp.c: 566:spdk_nvmf_tcp_create: *NOTICE*: *** TCP
Transport Init ***
tcp.c: 767:spdk_nvmf_tcp_listen: *NOTICE*: *** NVMe/TCP Target Listening
on 1.1.75.1 port 1023 ***
Thanks
Sasha
2 years, 1 month
Re: [SPDK] SPDK environment initialization from a DPDK application
by Harris, James R
On 11/30/18, 8:29 AM, "SPDK on behalf of Nalla, Pradeep" <spdk-bounces(a)lists.01.org on behalf of Pradeep.Nalla(a)cavium.com> wrote:
Hello
I got a requirement of a DPDK application that forwards packets and also handle NVMe commands. As the DPDK's EAL is already initialized
by the forwarding application, calling rte_eal_init again in the spdk_env_init is causing issue with rte_error being set to EALREADY. Is there
a way for the env_dpdk library of SPDK to consider an already initialized DPDK.
Hi Pradeep,
There's no way to do that currently, but it's a very reasonable request. Would you mind adding an issue in GitHub for this? Since it's an enhancement request, there's no need to fill out the template - adding what you have here is sufficient.
http://github.com/spdk/spdk/issues
I'll add some comments on how it might be implemented in SPDK once the issue is filed.
Thanks,
-Jim
Thanks
Pradeep.
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk
2 years, 1 month
SPDK file system.
by paul.kim@daqscribe.com
Hello my name is Paul Kim.
I am trying to improve through spdk because of the nvme file write rate
limit on Linux.
Does not spdk support file systems that can be accessed and verified by end
users?
thank you.
2 years, 1 month
SPDK environment initialization from a DPDK application
by Nalla, Pradeep
Hello
I got a requirement of a DPDK application that forwards packets and also handle NVMe commands. As the DPDK's EAL is already initialized
by the forwarding application, calling rte_eal_init again in the spdk_env_init is causing issue with rte_error being set to EALREADY. Is there
a way for the env_dpdk library of SPDK to consider an already initialized DPDK.
Thanks
Pradeep.
2 years, 1 month