Hello Andrey,
Thank you for your information.
That's very helpful.
I will turn on enable-asan flag and try again
Thank you
Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com> 於 2018年11月14日 週三 下午7:02寫道:
I suggest you open a ticket for the issue under
https://github.com/spdk/spdk/issues/.
If you have reasons to believe that this is caused by memory corruption,
try running with address santitizer enabled (./configure --enable-asan).
Regards,
Andrey
On Wed, Nov 14, 2018 at 12:32 PM Vincent <cockroach1136(a)gmail.com> wrote:
> Hello Andrey,
>
> Actually we are still stuck in the hot remove disk problem in past
two
> weeks....:)
>
> We do some experiment and sometimes the system will crash in the call
stack
> as attached.
>
> It seems that the (tr->req) is 0xffffffffffff
>
> We also found a interesting phenomenon that if we increase the hotplug
> poller interval,
>
> for example, from 0 to 10ms, The crash rate is increasing.
>
> We also found another strange phenomenon, if we remove two disks in
short
> interval,( for example, 5 seconds) ,
> the IO will no response in other attached disk.
>
> I known it seems that there are memory corruption in my program, but I
> checked again and again,
> I cannot find any memory pollution or memory corruptions in my program.
>
> So, do you have any hint for the phenomenons that I meet?
>
> Any advise is appreciated.
>
> Thank you so much.
>
>
> (gdb) bt
>
> #0 nvme_io_qpair_print_command (qpair=0x7f827e53adf8,
> cmd=0xffffffffffffffff) at nvme_qpair.c:121
>
> #1 nvme_qpair_print_command (qpair=qpair@entry=0x7f827e53adf8,
> cmd=cmd@entry=0xffffffffffffffff)
>
> at nvme_qpair.c:156
>
> #2 0x00000000004121b6 in nvme_pcie_qpair_complete_tracker
> (qpair=qpair@entry=0x7f827e53adf8,
>
> tr=0x7f827feb4000, cpl=cpl@entry=0x7f827e36b920,
> print_on_error=print_on_error@entry=true)
>
> at nvme_pcie.c:1144
>
> #3 0x0000000000413c50 in nvme_pcie_qpair_process_completions
> (qpair=qpair@entry=0x7f827e53adf8,
>
> max_completions=64, max_completions@entry=0) at nvme_pcie.c:2013
>
> #4 0x0000000000415deb in nvme_transport_qpair_process_completions
> (qpair=qpair@entry=0x7f827e53adf8,
>
> max_completions=max_completions@entry=0) at nvme_transport.c:201
>
> #5 0x000000000041450d in spdk_nvme_qpair_process_completions
> (qpair=0x7f827e53adf8,
>
> max_completions=max_completions@entry=0) at nvme_qpair.c:368
>
> #6 0x000000000040a289 in bdev_nvme_poll (arg=0x7f7fb00018d0) at
> bdev_nvme.c:222
>
> #7 0x000000000049a2ea in _spdk_reactor_run (arg=0x49b2dc0) at
> reactor.c:452
>
> #8 0x00000000004a49c4 in eal_thread_loop ()
>
> #9 0x00007f87e9402e25 in start_thread () from /lib64/libpthread.so.0
>
> #10 0x00007f87e912cbad in clone () from /lib64/libc.so.6
>
> (gdb)
>
>
>
> Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com> 於 2018年10月28日 週日 下午7:33寫道:
>
> > On Sun, Oct 28, 2018 at 7:19 AM Vincent <cockroach1136(a)gmail.com>
wrote:
> >
> > > Hello all,
> > > Recently we are trying the disk hot remove property of SPDK.
> > >
> > > We have a counter to record the IO send out for a io channel
> > >
> > > The roughly hot remove procedure in my code is
> > >
> > > (1) when receive the disk hot remove call back from spdk, we stop
> sending
> > > IO
> > > (2) Because we have a counter to record the IO send out for IO
> channel,
> > we
> > > wait all IOs call back(complete)
> > > (3) close io channel
> > > (4)close bdev desc
> > >
> > > But sometime we crashed (the crash rate is about 1/10) , the call
> stack
> > is
> > > attached
> > > crash in function nvme_free_request
> > > void
> > > nvme_free_request(struct nvme_request *req)
> > > {
> > > assert(req != NULL);
> > > assert(req->num_children == 0);
> > > assert(req->qpair != NULL);
> > >
> > > STAILQ_INSERT_HEAD(&req->qpair->free_req, req, stailq);
> > > <-------------this line
> > > }
> > >
> > > Does any one can give me a hint ??
> > >
> >
> > It looks like nvme_pcie_qpair s not reference counted, and thus the
nvme
> > completion path below does not account for the possibility that the
user
> > callback fired by nvme_complete_request() can close I/O channel
(which,
> > for nvme bdev, will destroy the underlying qpair) before freeing the
> > associated nvme request. This, if happens, will result in
> > nvme_free_request() being entered after the underlying qpair has been
> > destroyed, potentially crashing the app.
> >
> > Regards,
> > Andrey
> >
> > static void
> > nvme_pcie_qpair_complete_tracker(struct spdk_nvme_qpair *qpair, struct
> > nvme_tracker *tr,
> > struct spdk_nvme_cpl *cpl, bool print_on_error)
> > {
> > [snip]
> > if (retry) {
> > req->retries++;
> > nvme_pcie_qpair_submit_tracker(qpair, tr);
> > } else {
> > if (was_active) {
> > /* Only check admin requests from different processes. */
> > if (nvme_qpair_is_admin_queue(qpair) && req->pid != getpid()) {
> > req_from_current_proc = false;
> > nvme_pcie_qpair_insert_pending_admin_request(qpair, req, cpl);
> > } else {
> > nvme_complete_request(req, cpl);
> > }
> > }
> >
> > if (req_from_current_proc == true) {
> > nvme_free_request(req);
> > }
> >
> >
> >
> > >
> > > Any suggestion is appreciated
> > >
> > > Thank you in advance
> > >
> > >
> > >
> >
>
--------------------------------------------------------------------------------------------------------------------------
> > > Using host libthread_db library "/lib64/libthread_db.so.1".
> > > Core was generated by `./smistor_iscsi_tgt -c
> > > /usr/smistor/config/smistor_iscsi_perf.conf'.
> > > Program terminated with signal 11, Segmentation fault.
> > > #0 0x0000000000414b87 in nvme_free_request (req=req@entry
> > =0x7fe4edbf3100)
> > > at nvme.c:227
> > > 227 nvme.c: No such file or directory.
> > > Missing separate debuginfos, use: debuginfo-install
> > > bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.170-4.el7.x86_64
> > > elfutils-libs-0.170-4.el7.x86_64 glibc-2.17-222.el7.x86_64
> > > libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-13.el7.x86_64
> > > libcap-2.22-9.el7.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64
> > > libuuid-2.23.2-52.el7.x86_64 lz4-1.7.5-2.el7.x86_64
> > > numactl-libs-2.0.9-7.el7.x86_64 openssl-libs-1.0.2k-12.el7.x86_64
> > > systemd-libs-219-57.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64
> > > zlib-1.2.7-17.el7.x86_64
> > > (gdb) bt
> > > #0 0x0000000000414b87 in nvme_free_request (req=req@entry
> > =0x7fe4edbf3100)
> > > at nvme.c:227
> > > #1 0x0000000000412056 in nvme_pcie_qpair_complete_tracker
> > > (qpair=qpair@entry=0x7fe4c5376ef8, tr=0x7fe4ca8ad000,
> > > cpl=cpl@entry=0x7fe4c8e0a840,
print_on_error=print_on_error@entry
> > > =true)
> > > at nvme_pcie.c:1170
> > > #2 0x0000000000413be0 in nvme_pcie_qpair_process_completions
> > > (qpair=qpair@entry=0x7fe4c5376ef8, max_completions=64,
> > > max_completions@entry=0) at nvme_pcie.c:2013
> > > #3 0x0000000000415d7b in nvme_transport_qpair_process_completions
> > > (qpair=qpair@entry=0x7fe4c5376ef8,
> > > max_completions=max_completions@entry=0) at nvme_transport.c:201
> > > #4 0x000000000041449d in spdk_nvme_qpair_process_completions
> > > (qpair=0x7fe4c5376ef8, max_completions=max_completions@entry=0)
> > > at nvme_qpair.c:368
> > > #5 0x000000000040a289 in bdev_nvme_poll (arg=0x7fe08c0012a0) at
> > > bdev_nvme.c:208
> > > #6 0x0000000000499baa in _spdk_reactor_run (arg=0x6081dc0) at
> > > reactor.c:452
> > > #7 0x00000000004a4284 in eal_thread_loop ()
> > > #8 0x00007fe8fb276e25 in start_thread () from /lib64/libpthread.so.0
> > > #9 0x00007fe8fafa0bad in clone () from /lib64/libc.so.6
> > > _______________________________________________
> > > SPDK mailing list
> > > SPDK(a)lists.01.org
> > >
https://lists.01.org/mailman/listinfo/spdk
> > >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> >
https://lists.01.org/mailman/listinfo/spdk
> >
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
>
https://lists.01.org/mailman/listinfo/spdk
>
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk