On Sat, May 16, 2015 at 6:19 PM, Elliott, Robert (Server Storage)
> -----Original Message-----
> From: Linux-nvdimm [mailto:email@example.com] On Behalf Of
> Dan Williams
> Sent: Tuesday, April 28, 2015 1:26 PM
> To: linux-nvdimm(a)lists.01.org
> Cc: Ingo Molnar; Neil Brown; Greg KH; Dave Chinner; linux-
> kernel(a)vger.kernel.org; Andy Lutomirski; Jens Axboe; H. Peter Anvin;
> Christoph Hellwig
> Subject: [Linux-nvdimm] [PATCH v2 19/20] nd_btt: atomic sector updates
> From: Vishal Verma <vishal.l.verma(a)linux.intel.com>
> BTT stands for Block Translation Table, and is a way to provide power
> fail sector atomicity semantics for block devices that have the ability
> to perform byte granularity IO. It relies on the ->rw_bytes() capability
> of provided nd namespace devices.
> The BTT works as a stacked blocked device, and reserves a chunk of space
> from the backing device for its accounting metadata. BLK namespaces may
> mandate use of a BTT and expect the bus to initialize a BTT if not
> already present. Otherwise if a BTT is desired for other namespaces (or
> partitions of a namespace) a BTT may be manually configured.
Running btt above pmem with a variety of workloads, I see an awful lot
of time spent in two places:
This occurs for fio to raw /dev/ndN devices, ddpt over ext4 or xfs,
cp -R of large directories, and running make on the linux kernel.
Some specific results:
fio 4 KiB random reads, WC cache type, memcpy:
* 43175 MB/s, 8 M IOPS pmem0 and pmem1
* 18500 MB/s, 1.5 M IOPS nd0 and nd1
fio 4 KiB random reads, WC cache type, memcpy with non-temporal
loads (when everything is 64-byte aligned):
* 33814 MB/s, 4.3 M IOPS nd0 and nd1
Zeroing out 32 MiB with ddpt:
* 19 s, 1800 MiB/s pmem
* 55 s, 625 MiB/s btt
If btt_make_request needs to stall this much, maybe it'd be better
to utilize the blk-mq request queues, keeping requests in per-CPU
queues while they're waiting, and using IPIs for completion
interrupts when they're finally done.
2 items to check:
1/ make sure you have a your btt sector size set to 4k which cuts down
the overhead by a factor of 8.
2/ boot with nr_cpus=256 or lower.
Ross noticed that CONFIG_NR_CPUS is set quite high on distro kernels
which revealed that we should have been using nr_cpu_ids and percpu
variables for nd_region_acquire_lane() from the outset. This fix is
coming in v3.