On 8/3/20 4:31 AM, Dave Chinner wrote:
On Wed, Jul 29, 2020 at 02:15:18PM +0530, Ritesh Harjani wrote:
> For systems which do not have CONFIG_PREEMPT set and
> if there is a heavy multi-threaded load/store operation happening
> on pmem + sometimes along with device latencies, softlockup warnings like
> this could trigger. This was seen on Power where pagesize is 64K.
> To avoid softlockup, this patch adds a cond_resched() in this path.
> watchdog: BUG: soft lockup - CPU#31 stuck for 22s!
> CPU: 31 PID: 15627 <..> 5.3.18-20
> NIP memcpy_power7+0x43c/0x7e0
> LR memcpy_flushcache+0x28/0xa0
> Call Trace:
> memcpy_power7+0x274/0x7e0 (unreliable)
> write_pmem+0xa0/0x100 [nd_pmem]
> pmem_do_bvec+0x1f0/0x420 [nd_pmem]
> pmem_make_request+0x14c/0x370 [nd_pmem]
> xfs_zero_extent+0x90/0xc0 [xfs]
> xfs_bmapi_convert_unwritten+0x198/0x230 [xfs]
> xfs_bmapi_write+0x284/0x630 [xfs]
> xfs_iomap_write_direct+0x1f0/0x3e0 [xfs]
> xfs_file_iomap_begin+0x344/0x690 [xfs]
> __xfs_filemap_fault+0x26c/0x2b0 [xfs]
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
> Signed-off-by: Ritesh Harjani <riteshh(a)linux.ibm.com>
> drivers/nvdimm/pmem.c | 1 +
> 1 file changed, 1 insertion(+)
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 2df6994acf83..fcf7af13897e 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -214,6 +214,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct
> bio->bi_status = rc;
> + cond_resched();
There are already cond_resched() calls between submitted bios in
blkdev_issue_zeroout() via both __blkdev_issue_zero_pages() and
__blkdev_issue_write_zeroes(), so I'm kinda wondering where the
problem is coming from here.
This problem is coming from that bio call- submit_bio()
Just how big is the bio being issued here that it spins for 22s
trying to copy it?
It's 256 (due to BIO_MAX_PAGES) * 64KB (pagesize) = 16MB.
So this is definitely not an easy trigger as per tester was mainly seen
on a VM.
Looking at the cond_resched() inside dax_writeback_mapping_range()
in xas_for_each_marked() loop, I thought it should be good to have a
cond_resched() in the above path as well.
Hence an RFC for discussion.
> And, really, if the system is that bound on cacheline bouncing that
> it prevents memcpy() from making progress, I think we probably
> should be issuing a soft lockup warning like this... >