ibv_poll_cq returns corrupted wr_id causes application to crash
by Shahar Salzman
Hi,
I have been running into some "strange" issues lately where my application is crashing in a new switched environment.
I am not sure that all the components are tuned to the new environment where I want to do PFC instead of global pause, and the switches have Cumulus Linux instead of Mellanox/Dell OS. I am working on getting the low level stuff in order, so I'll update if this issue is recreated once both the switches and the end nodes are all properly configured.
The crashes are a result of ib_poll_cq returning an incorrect wr_id, so the rdma request contains garbage.
In the crash bellow, we seg fault when attempting to access rqpair->state_queue[rdma_req->state] since the state is invalid (2005399), so we get a protection fault attempting to access this offset in the array.
The code is slightly outdated (spdk 18.10), but given the nature of the corruption this will crash upstream spdk, or 19.1 version, just in a different location.
I opened an issue with Mellanox, but I would like to both notify, and also ask whether this type of behavior from ibv_poll_cq is known.
I am currently working around this issue by testing the rdma_req for signs of corruption, and if so, dropping it at the risk of loosing some data buffers.
The problem starts with QPs being removed, and then one of the IOs returned corrupted and we crash:
Mar 7 10:02:41 kblock01-knode01 reactor_1[19958]: rdma.c:2701:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7f5c28204770, Request 0x140033786935248 (12): tra
nsport retry counter exceeded
Mar 7 10:02:41 kblock01-knode01 reactor_1[19958]: rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#0 changed to: IBV_QPS_ERR
Mar 7 10:02:41 kblock01-knode01 reactor_1[19958]: rdma.c:2701:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7f5c28204770, Request 0x140033786935248 (5): Work
Request Flushed Error
Mar 7 10:02:41 kblock01-knode01 reactor_1[19958]: rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#0 changed to: IBV_QPS_ERR
Mar 7 10:02:41 kblock01-knode01 reactor_1[19958]: rdma.c:2701:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7f5c28204770, Request 0x140033786936568 (5): Work
Request Flushed Error
Mar 7 10:02:41 kblock01-knode01 reactor_1[19958]: rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#0 changed to: IBV_QPS_ERR
Mar 7 10:02:41 kblock01-knode01 reactor_1[19958]: rdma.c:2701:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7f5c28204770, Request 0x140033786937888 (5): Work
Request Flushed Error
Mar 7 10:02:41 kblock01-knode01 reactor_1[19958]: rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#0 changed to: IBV_QPS_ERR
Mar 7 10:02:41 kblock01-knode01 reactor_1[19958]: rdma.c:2701:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7f5c28204770, Request 0x140033785369568 (5): Work Request Flushed Error
Looking at the rdma request, we can see that the state is invalid:
(gdb) p rdma_req->state
$14 = 2005399
Looking at the array we are trying to access, its size is only 12
(gdb) p sizeof(rqpair->state_queue)/sizeof(*rqpair->state_queue)
$17 = 12
I looked at the other fields in the rdma request and they all seem corrupt, so with later versions of SPDK that no longer have the state_queue, we would probably crash when attempting to return invalid buffers to the buffer pool.
I will update if this recreates when the environment is 100% PFC, and OFED and firmware is updated.
Shahar
1 year, 11 months
New contributor.
by haris iqbal
Hi,
I have created a trello account, and have identified few low hanging
fruits to work on for starters. However, I am unable to comment on the
same. Can I be added to the board.
My name in the trello website is Md Haris Iqbal (mdharisiqbal)
--
With regards,
Md Haris Iqbal,
Contact: +91 8861996962
1 year, 11 months
FW: DPDK development process and tools survey
by Honnappa Nagarahalli
DPDK community is trying to improve DPDK's development process. We are conducting a survey to understand the pain points. The survey itself takes no more than 10mns. If you have worked with the community and have feedback, please consider taking the survey.
The survey link is: https://forms.office.com/Pages/ResponsePage.aspx?id=eVlO89lXqkqtTbEipmIYT...
The survey is open till 23rd March 2019.
Thank you,
Honnappa
> > -----Original Message-----
> > From: Honnappa Nagarahalli <Honnappa.Nagarahalli(a)arm.com>
> > Sent: Wednesday, February 27, 2019 3:08 PM
> > To: announce(a)dpdk.org
> > Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli(a)arm.com>; nd
> > <nd(a)arm.com>
> > Subject: DPDK development process and tools survey
> >
> > Hello,
> > There have been questions/comments in the past DPDK summits on
> > improving the development process and the tools being used. This
> > survey is being conducted to better understand the pain points and
> > arrive at a set of tools to use going forward.
> >
> > The survey itself will be done in 2 stages.
> > 1) Understand the problems faced by the community (this survey)
> > 2) *If required*, another survey to choose from available solutions
> >
> > The survey itself does not take more than 10mns. It is for all of us
> > in the community to improve the way we contribute, I highly
> > encourage you to take the survey.
> >
> > The survey is open till 13th March 2019, 6:00PM CST.
> >
> > Thank you,
> > Honnappa
> >
> > Survey Link:
> >
> https://forms.office.com/Pages/ResponsePage.aspx?id=eVlO89lXqkqtTbEipm
> > I
> YTcwgJ8psxytOnArCkHeSZSZUREdIN09QOEVRSUJWN0I2TzNYUTk5STVJRC4u
1 year, 11 months
Commit 0fe8cd17111f5870aad56387a57e927d67a20bec
by Andrey Kuzmin
Recently I've spent some time debugging an issue that turned out to be
addressed by the recent commit
https://github.com/spdk/spdk/commit/0fe8cd17111f5870aad56387a57e927d67a20bec.
Could someone give me a hint on what release this is going to be
merged into so that I can properly ifdef the temporary fix in my code?
On a separate note, I've discovered a similar issue when using
bdevperf which seems to be due to the latter opening bdevs without
unregister callback, hitting a corner case of no-callback under
bdev_close. bdevperf unregister callback is fairly straightforward to
provide (the code is basically already there), let me know if there is
any interest in a patch for this.
Thanks,
Andrey
1 year, 11 months
SPDK Continuous Integration Log Change
by Howell, Seth
Hi All,
As you may or may not know, we have been working on moving our continuous integration status page and build logs from a privately hosted site to a cloud based site. Currently, the cloud site is fully functional and running in parallel with the local site.
This last week we have done localized testing of the site and worked out the initial quirks. Starting tomorrow, we will start pointing users to the new site from GerritHub. This will mean that log URLs will look slightly different:
https://ci.spdk.io/spdk-jenkins/public_build/autotest-per-patch_25634.html -> https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_2563...
Also, the top level status page can be accessed from https://dqtibwqq6s6ux.cloudfront.net.
There is still an overarching CI page located at https://spdk.io/ci/ from which you will be able to access the status page and other relevant CI links such as the trello boards related to the project.
We are slated to shut down ci.spdk.io as of March 31. Please provide as much feedback about the new hosting solution as you can before that day since making changes to the new site will be much easier while we still have a redundant backup.
Thanks,
Seth Howell
1 year, 12 months
Questions about probing DPDK-compatible NIC in SPDK
by 김도훈
Hi all,
I got a requirement of a DPDK application handle packets and NVMe commands
too. So, I'm trying to build SPDK application using DPDK network
functions(such as rte_ethdev.h functions) also.
For that, I called *rte_eal_init()* in my SPDK application, then i think
that it should be probe NICs. but, It probes nothing and*
rte_eth_dev_count_avail() *function returns 0 too. On the other hand, when
I test *rte_eal_init()* with any other DPDK example applications in same
environment, it probes network cards well in init function. I tested both
with same DPDK submodule. I tested with Intel 10G network card(X520). I
tested igb_uio and ixgbe NIC driver both.
In that context, I have following question.
1. is there any other configuration to probing NIC in SPDK?
2. How to enable network card in SPDK?
Thanks for your time and patience.
Thanks,
Dohun.
1 year, 12 months
Compiling error on different types of Intel CPU
by Ernest Zed
Hi,
In continuation to this issue from May 2018, I dont see anything changed in
make file, SPDK is being built with the same `-march=native` which hit us
hard two days ago when we upgraded our build machine to the latest Xeon
which supports AVX512. Our test machines does not have this extension so
binaries built on newer machine wont run on older ones. Meanwhile I gonna
change the `-march=native` to something like `-march=core-avx-i`. But is it
possible to change the `configure` script to pass additional compilation
flags to the `make`?
1 year, 12 months
SW Quality Improvements
by Luse, Paul E
Hi Everyone,
I added a new Trello board today called "SW Quality Improvements" https://trello.com/b/2Qx4F4X5 and it's kind of a catch-all list of various ideas on the topic. Everyone is welcome to add ideas, comment on existing ones, even take on a few cards!
Thanks,,
Paul
1 year, 12 months
SPDK build and submodules
by E.W.Z.
Hi,
Consider following limitation when building SPDK, no submodules can be used, but I can build isa-l/ipsec/DPDK outside of SPDK tree. What kind of submodules versioning scheme is used? For example, SPDK 19.01, how it connects to releases of the other three modules? Latest releases of these at the moment the 19.01 is delivered? Where can I get this info? Thanks!
Sincerely,
Ernest
1 year, 12 months