Ben,
Thanks for the detailed explanation, it makes sense to me. We’re happy to help resolve
any issues with the SPDK. We intend to keep using it since it so much more efficient than
the kernel driver.
FYI my tests in the meantime have not uncovered any more issues and the fix worked as
intended.
Thanks for being so responsive and helpful!
-Will
From: SPDK <spdk-bounces(a)lists.01.org> on behalf of "Walker, Benjamin"
<benjamin.walker(a)intel.com
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org
Date: Thursday, July 7, 2016 at 4:03 PM
To: "spdk(a)lists.01.org" <spdk(a)lists.01.org
Subject: Re: [SPDK] Callback passed to spdk_nvme_ns_cmd_read not being called
sometimes
Glad you found it. Here's an explanation of what was wrong in the NVMe driver:
The queue pairs are implemented as two arrays - one of 64 byte commands and the other of
16 byte completions - and two sets of head/tail indices. The driver submits commands by
copying commands into the submission array and then doing an MMIO write to the
"doorbell" register with the new value of the driver's submission queue tail
index. The SSD completions commands by copying a completion into the completion array and
toggling its phase bit, like I explained before. When the completion has been consumed,
the NVMe driver writes an updated value for its completion tail index to a doorbell
register. This final step is important - it is what signals which entries in the
completion array are open for new completions.
MMIO is expensive (writes less so than reads, but still expensive), so we try very hard to
avoid doing any. Instead of writing a new completion queue tail index after each
completion, we recently added a patch that writes out the tail index only after we're
done processing a whole batch of completions. Unfortunately, we didn't consider the
case where the entire queue depth worth of completions was processed in one call to check
for completions. We didn't originally think this was possible because the queue is 256
deep but we only allow 128 actual commands to be outstanding. It turns out, it is possible
under very specific circumstances though, which is what you hit.
Here's a concrete example of one way to hit the problem. Let's say that the
current completion queue tail index is 0 and there is an active completion there. When the
function to poll for completions checks that and finds the phase bit toggled, it will
immediately call the user callback. If that callback submits I/O and then does something
computationally expensive, it's possible that by the time the user callback returns
the I/O that was submitted is already complete. That means the polling function will find
the next completion and repeat. It's possible that this goes on until the completion
queue is entirely full (256 entries), at which point no new completions come and we exit
the loop. When we exit the loop, we write the location where we expect the next completion
to be placed - but that's 0 (circular queue) which is the old value. The device
can't detect that we wrote the same value as was previously there, so it acts as if
there was no update. This is responsible for the hang you were seeing. At high queue
depths, the user callback doesn't even have to be particular expensive to make this
happen on low latency devices.
We talked about different ways to solve this and the one we've agreed on is to limit
the maximum number of completions in one batch of polling to the queue depth - 1 (255).
I'm submitting a patch for this now.
Thanks for your help digging through this.
On Thu, 2016-07-07 at 18:45 +0000, Will Del Genio wrote:
Ben,
I was able to track down the bug I mentioned in the previous email to an issue in my code.
I believe the spdk is working correctly now thanks to change you suggested.
-Will
From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Will Del Genio
<wdelgenio(a)xeograph.com
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org
Date: Thursday, July 7, 2016 at 10:36 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org
Subject: Re: [SPDK] Callback passed to spdk_nvme_ns_cmd_read not being called
sometimes
Ben,
I have run more tests and experienced another failure (the same as before). I wasn’t able
to attach gdb to debug so I’ve been trying to replicate and haven’t had luck yet. I will
continue to try to replicate.
-Will
From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Will Del Genio
<wdelgenio(a)xeograph.com
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org
Date: Thursday, July 7, 2016 at 10:09 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org
Subject: Re: [SPDK] Callback passed to spdk_nvme_ns_cmd_read not being called
sometimes
Ben,
I manually reverted the change from that commit and it looks like that resolved the issue.
At qd 256 I have not seen any failures.
For reference here is the relevant section of my nvme_qpair.c file now:
if (++qpair->cq_head == qpair->num_entries) {
qpair->cq_head = 0;
qpair->phase = !qpair->phase;
}
spdk_mmio_write_4(qpair->cq_hdbl, qpair->cq_head);
if (++num_completions == max_completions) {
break;
}
}
return num_completions;
}
Let me know if you need any more help/info/testing. Thank you very much for your help.
-Will
On 7/6/16, 6:23 PM, "SPDK on behalf of Walker, Benjamin"
<spdk-bounces(a)lists.01.org on behalf of benjamin.walker(a)intel.com> wrote:
On Wed, 2016-07-06 at 22:56 +0000, Will Del Genio wrote:
Ben,
I tried capping my queue depth to 128 and it significantly improved
the problem (I’m not entirely
sure if it was completely eliminated).
Can you try reverting commit ff7e2122c74b09e5961cbcb2622fda9c0087f48f and see if that
fixes the
problem? I believe we see the problem and it will only occur with 256 or greater queue
depth. If
reverting that commit solves the problem, we'll submit a patch to fix it the right way
and I'll
explain what happened.
To get the vars I had to put the qd back to 256:
(gdb) p *qpair
$1 = {sq_tdbl = 0x7ffff7ff2008, cq_hdbl = 0x7ffff7ff200c, cmd =
0x7fffdaae0000, cpl =
0x7fffdaade000, free_tr = {lh_first = 0x0}, outstanding_tr =
{lh_first = 0x7fffdaabf000}, tr =
0x7fffdaa5d000,
queued_req = {stqh_first = 0x7fffea91a500, stqh_last =
0x7fffea8efec0}, id = 1, num_entries =
256, sq_tail = 121, cq_head = 249, phase = 0 '\000',
is_enabled = true, sq_in_cmb = false,
qprio = 0 '\000', ctrlr = 0x7fffddca6740, tailq = {tqe_next
= 0x0, tqe_prev = 0x7fffddca7b70},
cmd_bus_addr = 35430203392, cpl_bus_addr = 35430195200}
(gdb) p qpair->cpl[qpair->cq_head -1]
$2 = {cdw0 = 0, rsvd1 = 0, sqhd = 250, sqid = 1, cid = 97, status =
{p = 0, sc = 0, sct = 0, rsvd2
= 0, m = 0, dnr = 0}}
(gdb) p *(qpair->sq_tdbl)
$4 = 121
(gdb) p *(qpair->cq_hdbl)
$5 = 249
(gdb)
On 7/6/16, 4:40 PM, "SPDK on behalf of Walker, Benjamin"
<spdk-bounces(a)lists.01.org on behalf of b
enjamin.walker(a)intel.com> wrote:
On Wed, 2016-07-06 at 21:00 +0000, Will Del Genio wrote:
>
Ben,
>
> Thanks, you explained that very well. I’m working with a random
4k read only workload of queue
> depth 256.
Can you try capping your queue depth at 128? That's the maximum
I/O we allow outstanding at the
hardware. The NVMe driver should be doing software queueing beyond
that automatically, but this
data
point will help narrow down the problem.
> I’m using 4 drives and one thread per drive. If it’s true that
there are just no more
> completions to handle, then I will recheck the code I wrote to
keep track of the number of
> outstanding read requests.
>
> Here is the qpair:
>
(gdb) p *qpair
>
$1 = {sq_tdbl = 0x7ffff7ff2008, cq_hdbl = 0x7ffff7ff200c, cmd =
0x7fffdaae0000, cpl =
> 0x7fffdaade000, free_tr = {lh_first = 0x0}, outstanding_tr =
{lh_first = 0x7fffdaad8000}, tr =
>
0x7fffdaa5d000,
> queued_req = {stqh_first = 0x7fffea9a1780, stqh_last =
0x7fffea9cdf40}, id = 1, num_entries =
> 256, sq_tail = 249, cq_head = 121, phase = 0 '\000',
is_enabled = true, sq_in_cmb = false,
Note how sq_tail - cq_head is 128, meaning the driver believes there
to be 128 commands
outstanding.
The driver's view of the world (commands outstanding) doesn't
line up with us not getting any NVMe
completions - there is definitely a problem here.
>
qprio = 0 '\000', ctrlr = 0x7fffddca6740, tailq = {tqe_next
= 0x0, tqe_prev = 0x7fffddca7b70},
>
cmd_bus_addr = 35430203392, cpl_bus_addr = 35430195200}
> (gdb) p qpair->phase
> $2 = 0 '\000'
> (gdb) p qpair->cpl[qpair->cq_head]
> $3 = {cdw0 = 0, rsvd1 = 0, sqhd = 132, sqid = 1, cid = 112,
status = {p = 1, sc = 0, sct = 0,
> rsvd2 = 0, m = 0, dnr = 0}}
Can you print out the following 3 things:
- qpair->cpl[qpair->cq_head - 1]
- qpair->sq_tdbl
- qpair->cq_hdbl
>
> -Will
>
> On 7/6/16, 3:50 PM, "SPDK on behalf of Walker,
Benjamin" <spdk-bounces(a)lists.01.org on behalf of
> b
>
enjamin.walker(a)intel.com> wrote:
>
> On Wed, 2016-07-06 at 20:33 +0000, Will Del Genio wrote:
> > Andrey,
> >
> > I was able to step into the
spdk_nvme_qpair_process_completions() function with gdb and found
> > the
> > reason is isn’t returning any completions is because this
check is failing at line 469: if
> > (cpl-
> > > status.p != qpair->phase)
> >
> > Relevant gdb info here:
> > Thread 4 "xg:nvmeIo:9" hit Breakpoint 5,
spdk_nvme_qpair_process_completions
> > (qpair=0x7ffff0a5baa0, max_completions=0) at
nvme_qpair.c:463
> > 463 in nvme_qpair.c
>
> (gdb) p qpair->cpl[qpair->cq_head]
> > $11 = {cdw0 = 0, rsvd1 = 0, sqhd = 51, sqid = 1, cid = 27,
status = {p = 0, sc = 0, sct = 0,
> > rsvd2
> >
= 0, m = 0, dnr = 0}}
>
> (gdb) p qpair->phase
> > $12 = 1 '\001'
> >
> > What does this mean? Does this information help at all?
>
> The NVMe hardware queue pairs consist of two arrays - one of
commands and the other of responses
> -
> and a set of head and tail indices. The arrays are circular, so
you can loop back around to the
> beginning. Each entry in the array contains a phase bit, which
is either 1 or 0. On the first
> pass
> through the array, new entries in the queue are marked by
setting their phase to 1. On the next
> pass
> through the array, new elements are marked by setting their
phase bit to 0, etc. The current
> iteration's expected phase value is stored in
qpair->phase. So the code you are looking at on
> lines
> 467-470 is basically saying:
>
> 1) Grab the completion entry at the head of the queue
> 2) Check its phase bit. If it hasn't toggled, there is no
new completion, so exit
>
> All that means is that there are no completions outstanding for
any commands according to the
> SSD,
> which doesn't narrow much down. At the point where you are
broken in, can you just dump out the
> whole qpair structure? Something like "p *qpair"
should do it. That way I can see if there are
> any
> commands actually pending at the device and what state the
device is in. My expectation is that
> there aren't any commands outstanding.
>
> Can you also provide some background as to what kind of I/O
you're submitting (read or write,
> size,
> queue depth, etc.) when this occurs?
>
>
> -Will
> >
> > > From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Andrey Kuzmin
<andrey.v.kuzmin(a)gmail.com
> > > Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org
> > Date: Wednesday, July 6, 2016 at 2:35 PM
> > > To: Storage Performance Development Kit <spdk(a)lists.01.org
> > Subject: Re: [SPDK] Callback passed to
spdk_nvme_ns_cmd_read not being called sometimes
> >
> >
> >
> >
> >
> > On Wed, Jul 6, 2016, 20:56 Will Del Genio
<wdelgenio(a)xeograph.com> wrote:
> >
> > Andrey,
> > That sounds exactly like what we are experiencing, however
we’re working off the spdk codebase
> > that was current as of last week and are still experiencing
the issue. Do you know what the
> > resource allocation fault was and how we might be able to
determine if that is still
> > occurring?
> >
> >
> >
> > I'll take a look at commit log, both SPDK and mine, and
will get back to you.
> >
> >
> >
> > Regards,
> >
> > Andrey
> >
> >
Ben,
> > We’re ASSERTing that the result of spdk_nvme_ns_cmd_read()
== 0. If I set our queue depth
> > high
> > enough it will fail that assertion, as would be expected.
Whatever other failure we’re
> > experiencing does not seem to be causing
spdk_nvme_ns_cmd_read() to return an error code.
> >
> > Also I performed some tests with the spdk perf tool and was
not able to replicate our
> > problem. It
> > ran fine at various queue depths and core masks. When the
qd was set too high, it failed
> > gracefully with an error message. This is all as
expected.
> >
> > I’d like to continue down the path of investigating if some
resource allocation or something
> > else
> > is failing silently for us. Any specific ideas?
> >
> > Thanks!
> > Will
> >
> > > From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Andrey Kuzmin
<andrey.v.kuzmin(a)gmail.com
> > > Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org
> > Date: Wednesday, July 6, 2016 at 12:01 PM
> > > To: Storage Performance Development Kit <spdk(a)lists.01.org
> > Subject: Re: [SPDK] Callback passed to
spdk_nvme_ns_cmd_read not being called sometimes
> >
> >
> >
> >
> >
> > On Wed, Jul 6, 2016 at 6:35 PM, Walker, Benjamin
<benjamin.walker(a)intel.com> wrote:
> >
> > Hi Will,
> >
> >
> >
> > Since I can't see the code for your application I'd
like to try and reproduce the problem with
> > code that I have some visibility into. Are you able to
reproduce the problem using our perf
> > tool
> > (examples/nvme/perf)? If you aren't, this is likely a
problem with your test application and
> > not
> > SPDK.
> >
> >
> >
> >
> > I had been witnessing a similar issue with an earlier SPDK
release, back around Feb, where the
> > submit call was failing due to the resource allocation
fault and neither returning an error
> > nor
> > invoking the callback, but my issue has been fixed in the
recent release (I can't recall the
> > actual commit, but there definitely was one dealing exactly
with the cause).
> >
> >
> >
> >
> > Based on the symptoms, my best guess is that your memory
pool ran out of request objects. The
> > first thing to check is whether spdk_nvme_ns_cmd_read
failed. If it fails, it won't call the
> > callback. You can check for failure by looking at the
return value - see the documentation
> > here.
> > Your application allocates this memory pool up front - all
of our examples allocate 8k
> > requests
> > (see line 1097 in examples/nvme/perf/perf.c) You need to
allocate a large enough pool to
> > handle
> > the maximum number of outstanding requests you plan to
have. We recently added a "hello_world"
> > ll
> > o_
> > world with tons of comments. One of the comments explains
this memory pool in detail.
> >
> >
> >
> > That memory pool allocation is a bit of a wart on our
otherwise clean API. We're looking at
> > different strategies to clean that up. Let me know what the
result of the debugging is and
> > I'll
> > shoot you some more ideas to try if necessary.
> >
> >
> >
> >
> > Are there any plans regarding the global request pool
rework?
> >
> >
> >
> > Regards,
> >
> > Andrey
> >
> >
> >
> >
> >
> > Thanks,
> >
> > Ben
> >
> >
> >
> > On Tue, 2016-07-05 at 21:03 +0000, Will Del Genio wrote:
> >
> > Hello,
> > We have written a test application that is utilizing the
spdk library to benchmark a set of 3
> > Intel P3700 drives and a single 750 drive (concurrently).
We’ve done some testing using fio
> > and
> > the kernel nvme drivers and have had no problem achieving
the claimed IOPs (4k random read) of
> > all
> > drives on our system.
> >
> > What we have found during our testing is that spdk will
sometimes start to silently fail to
> > call
> > the callback passed to spdk_nvme_ns_cmd_read in the
following situations:
> > 1. Testing a single drive and passing in 0 for
max_completions to
> > spdk_nvme_qpair_process_completions(). We haven’t seen any
issues with single drive testing
> > when
> > max_completions was > 0.
> > 2. Testing all four drives at once will result in one
drive failing to receive
> > callbacks,
> > seemingly regardless of what number we pass for
max_completions (1 through 128).
> >
> > Here are other observations we’ve made
> > -When the callbacks fail to be called for a drive, they
fail to be called for the remaining
> > duration of the test.
> > -The drive that ‘fails’ when testing 4 drives concurrently
varies from test to test.
> > -‘failure’ of a drive seems to be correlated with the
number of outstanding read operations,
> > though it is not a strict correlation.
> >
> > Our system is a dual socket E5-2630 v3. One drive is on a
PCI slot for CPU 0 and the other 3
> > are
> > on PCI slots on CPU 1. The master/slave threads are on the
the same cpu socket as the nvme
> > device
> > they are talking to.
> >
> > We’d like to know what is causing this issue and what we
can do to help investigate the
> > problem.
> > What other information can we provide? Is there some part
of the spdk code that we can look
> > at
> > to
> > help determine the cause?
> >
> > Thanks,
> > Will
> >
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> >
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> >
> > --
> >
> > Regards,
> > Andrey
> >
> >
> >
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
>
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK@lists.01.org<mailto:SPDK@lists.01.org
https://lists.01.org/mailman/listinfo/spdk