[PATCH] powerpc/papr_scm: Add PAPR command family to pass-through command-set
by Vaibhav Jain
Add NVDIMM_FAMILY_PAPR to the list of valid 'dimm_family_mask'
acceptable by papr_scm. This is needed as since commit
92fe2aa859f5 ("libnvdimm: Validate command family indices") libnvdimm
performs a validation of 'nd_cmd_pkg.nd_family' received as part of
ND_CMD_CALL processing to ensure only known command families can use
the general ND_CMD_CALL pass-through functionality.
Without this change the ND_CMD_CALL pass-through targeting
NVDIMM_FAMILY_PAPR error out with -EINVAL.
Fixes: 92fe2aa859f5 ("libnvdimm: Validate command family indices")
Signed-off-by: Vaibhav Jain <vaibhav(a)linux.ibm.com>
---
arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index 5493bc847bd08..27268370dee00 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -898,6 +898,9 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
p->bus_desc.of_node = p->pdev->dev.of_node;
p->bus_desc.provider_name = kstrdup(p->pdev->name, GFP_KERNEL);
+ /* Set the dimm command family mask to accept PDSMs */
+ set_bit(NVDIMM_FAMILY_PAPR, &p->bus_desc.dimm_family_mask);
+
if (!p->bus_desc.provider_name)
return -ENOMEM;
--
2.26.2
1 year, 8 months
[PATCH v5 00/17] device-dax: support sub-dividing soft-reserved
ranges
by Dan Williams
Changes since v4 [1]:
- Rebased on
device-dax-move-instance-creation-parameters-to-struct-dev_dax_data.patch
in -mm [2]. I.e. patches that did not need fixups from v4 are not
included.
- Folded all fixes
- Replaced "device-dax: kill dax_kmem_res" with:
device-dax/kmem: introduce dax_kmem_range()
device-dax/kmem: move resource name tracking to drvdata
device-dax/kmem: replace release_resource() with release_mem_region()
...to address David's request to make those cleanups easier to review.
Note that I dropped changes to how IORESOURCE_BUSY is manipulated since
David and I are still debating the best way forward there.
- Broke out some of dax-bus reworks in "device-dax: introduce 'seed'
devices" to a new "device-dax: introduce 'struct dev_dax' typed-driver
operations"
- Added a conversion of xen_alloc_unallocated_pages() from pgmap.res to
pgmap.range. I found it odd that there is no corresponding
memunmap_pages() triggered by xen_free_unallocated_pages()?
- Not included, a conversion of virtio_fs to use pgmap.range for its new
usage of devm_memremap_pages(). It appears the virtio_fs changes are
merged after -mm? My mental model of -mm was that it applies on top of
linux-next? In any event, Vivek, you will need to coordinate a
conversion to pgmap.range for the virtio_fs dax-support merge. Maybe
that should go through Andrew as well?
- Lowercase all the subject lines per akpm's preference
- Received a 0day robot build-success notification over 122 configs
- Thanks to Joao for looking after this set while I was out.
[1]: http://lore.kernel.org/r/159625229779.3040297.11363509688097221416.stgit@...
[2]: https://ozlabs.org/~akpm/mmots/broken-out/device-dax-move-instance-creati...
---
Andrew, this series replaces
device-dax-make-pgmap-optional-for-instance-creation.patch
...through...
dax-hmem-introduce-dax_hmemregion_idle-parameter.patch
...in your stack.
Let me know if there is a different / preferred way to refresh a bulk of
patches in your queue when only a subset need updates.
---
The device-dax facility allows an address range to be directly mapped
through a chardev, or optionally hotplugged to the core kernel page
allocator as System-RAM. It is the mechanism for converting persistent
memory (pmem) to be used as another volatile memory pool i.e. the
current Memory Tiering hot topic on linux-mm.
In the case of pmem the nvdimm-namespace-label mechanism can sub-divide
it, but that labeling mechanism is not available / applicable to
soft-reserved ("EFI specific purpose") memory [3]. This series provides
a sysfs-mechanism for the daxctl utility to enable provisioning of
volatile-soft-reserved memory ranges.
The motivations for this facility are:
1/ Allow performance differentiated memory ranges to be split between
kernel-managed and directly-accessed use cases.
2/ Allow physical memory to be provisioned along performance relevant
address boundaries. For example, divide a memory-side cache [4] along
cache-color boundaries.
3/ Parcel out soft-reserved memory to VMs using device-dax as a security
/ permissions boundary [5]. Specifically I have seen people (ab)using
memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the
device-dax interface on custom address ranges. A follow-on for the VM
use case is to teach device-dax to dynamically allocate 'struct page' at
runtime to reduce the duplication of 'struct page' space in both the
guest and the host kernel for the same physical pages.
[3]: http://lore.kernel.org/r/157309097008.1579826.12818463304589384434.stgit@...
[4]: http://lore.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@...
[5]: http://lore.kernel.org/r/20200110190313.17144-1-joao.m.martins@oracle.com
---
Dan Williams (14):
device-dax: make pgmap optional for instance creation
device-dax/kmem: introduce dax_kmem_range()
device-dax/kmem: move resource name tracking to drvdata
device-dax/kmem: replace release_resource() with release_mem_region()
device-dax: add an allocation interface for device-dax instances
device-dax: introduce 'struct dev_dax' typed-driver operations
device-dax: introduce 'seed' devices
drivers/base: make device_find_child_by_name() compatible with sysfs inputs
device-dax: add resize support
mm/memremap_pages: convert to 'struct range'
mm/memremap_pages: support multiple ranges per invocation
device-dax: add dis-contiguous resource support
device-dax: introduce 'mapping' devices
device-dax: add an 'align' attribute
Joao Martins (3):
device-dax: make align a per-device property
dax/hmem: introduce dax_hmem.region_idle parameter
device-dax: add a range mapping allocation attribute
arch/powerpc/kvm/book3s_hv_uvmem.c | 14
drivers/base/core.c | 2
drivers/dax/bus.c | 1039 ++++++++++++++++++++++++++++++--
drivers/dax/bus.h | 11
drivers/dax/dax-private.h | 58 ++
drivers/dax/device.c | 112 ++-
drivers/dax/hmem/hmem.c | 17 -
drivers/dax/kmem.c | 178 +++--
drivers/dax/pmem/compat.c | 2
drivers/dax/pmem/core.c | 14
drivers/gpu/drm/nouveau/nouveau_dmem.c | 15
drivers/nvdimm/badrange.c | 26 -
drivers/nvdimm/claim.c | 13
drivers/nvdimm/nd.h | 3
drivers/nvdimm/pfn_devs.c | 13
drivers/nvdimm/pmem.c | 27 -
drivers/nvdimm/region.c | 21 -
drivers/pci/p2pdma.c | 12
drivers/xen/unpopulated-alloc.c | 45 +
include/linux/memremap.h | 11
include/linux/range.h | 6
lib/test_hmm.c | 15
mm/memremap.c | 299 +++++----
tools/testing/nvdimm/dax-dev.c | 22 -
tools/testing/nvdimm/test/iomap.c | 2
25 files changed, 1557 insertions(+), 420 deletions(-)
base-commit: 6764736525f27a411ba2c0c430aaa2df7375f3ac
1 year, 8 months
[PATCH 00/22] add Object Storage Media Pool (mpool)
by nmeeramohide@micron.com
From: Nabeel M Mohamed <nmeeramohide(a)micron.com>
This patch series introduces the mpool object storage media pool driver.
Mpool implements a simple transactional object store on top of block
storage devices.
Mpool was developed for the Heterogeneous-Memory Storage Engine (HSE)
project, which is a high-performance key-value storage engine designed
for SSDs. HSE stores its data exclusively in mpool.
Mpool is readily applicable to other storage systems built on immutable
objects. For example, the many databases that store records in
immutable SSTables organized as an LSM-tree or similar data structure.
We developed mpool for HSE storage, versus using a file system or raw
block device, for several reasons.
A primary motivator was the need for a storage model that maps naturally
to conventional block storage devices, as well as to emerging device
interfaces we plan to support in the future, such as
* NVMe Zoned Namespaces (ZNS)
* NVMe Streams
* Persistent memory accessed via CXL or similar technologies
Another motivator was the need for a storage model that readily supports
multiple classes of storage devices or media in a single storage pool,
such as
* QLC SSDs for storing the bulk of objects, and
* 3DXP SSDs or persistent memory for storing objects requiring
low-latency access
The mpool object storage model meets these needs. It also provides
other features that benefit storage systems built on immutable objects,
including
* Facilities to memory-map a specified collection of objects into a
linear address space
* Concurrent access to object data directly and memory-mapped to greatly
reduce page cache pollution from background operations such as
LSM-tree compaction
* Proactive eviction of object data from the page cache, based on
object-level metrics, to avoid excessive memory pressure and its
associated performance impacts
* High concurrency and short code paths for efficient access to
low-latency storage devices
HSE takes advantage of all these mpool features to achieve high
throughput with low tail-latencies.
Mpool is implemented as a character device driver where
* /dev/mpoolctl is the control file (minor number 0) supporting mpool
management ioctls
* /dev/mpool/<mpool-name> are mpool files (minor numbers >0), one per
mpool, supporting object management ioctls
CLI/UAPI access to /dev/mpoolctl and /dev/mpool/<mpool-name> are
controlled by their UID, GID, and mode bits. To provide a familiar look
and feel, the mpool management model and CLI are intentionally aligned
to those of LVM to the degree practical.
An mpool is created with a block storage device specified for its
required capacity media class, and optionally a second block storage
device specified for its staging media class. We recommend virtual
block devices (such as LVM logical volumes) to aggregate the performance
and capacity of multiple physical block devices, to enable sharing of
physical block devices between mpools (or for other uses), and to
support extending the size of a block device used for an mpool media
class. The libblkid library recognizes mpool formatted block devices as
of util-linux v2.32.
Mpool implements a transactional object store with two simple object
abstractions: mblocks and mlogs.
Mblock objects are containers comprising a linear sequence of bytes that
can be written exactly once, are immutable after writing, and can be
read in whole or in part as needed until deleted. Mblocks in a media
class are currently fixed size, which is configured when an mpool is
created, though the amount of data written to mblocks will differ.
Mlog objects are containers for record logging. Records of arbitrary
size can be appended to an mlog until it is full. Once full, an mlog
must be erased before additional records can be appended. Mlog records
can be read sequentially from the beginning at any time. Mlogs in a
media class are always a multiple of the mblock size for that media
class.
Mblock and mlog writes avoid the page cache. Mblocks are written,
committed, and made immutable before they can be read either directly
(avoiding the page cache) or mmaped. Mlogs are always read and updated
directly (avoiding the page cache) and cannot be mmaped.
Mpool also provides the metadata container (MDC) APIs that clients can
use to simplify storing and maintaining metadata. These MDC APIs are
helper functions built on a pair of mlogs per MDC.
The mpool Wiki contains full details on the
* Management model in the "Configure mpools" section
* Object model in the "Develop mpool Applications" section
* Kernel module architecture in the "Explore mpool Internals" section,
which provides context for reviewing this patch series
See https://github.com/hse-project/mpool/wiki
The mpool UAPI and kernel module (not the patchset) are available on
GitHub at:
https://github.com/hse-project/mpool
https://github.com/hse-project/mpool-kmod
The HSE key-value storage engine is available on GitHub at:
https://github.com/hse-project/hse
Nabeel M Mohamed (22):
mpool: add utility routines and ioctl definitions
mpool: add in-memory struct definitions
mpool: add on-media struct definitions
mpool: add pool drive component which handles mpool IO using the block
layer API
mpool: add space map component which manages free space on mpool
devices
mpool: add on-media pack, unpack and upgrade routines
mpool: add superblock management routines
mpool: add pool metadata routines to manage object lifecycle and IO
mpool: add mblock lifecycle management and IO routines
mpool: add mlog IO utility routines
mpool: add mlog lifecycle management and IO routines
mpool: add metadata container or mlog-pair framework
mpool: add utility routines for mpool lifecycle management
mpool: add pool metadata routines to create persistent mpools
mpool: add mpool lifecycle management routines
mpool: add mpool control plane utility routines
mpool: add mpool lifecycle management ioctls
mpool: add object lifecycle management ioctls
mpool: add support to mmap arbitrary collection of mblocks
mpool: add support to proactively evict cached mblock data from the
page-cache
mpool: add documentation
mpool: add Kconfig and Makefile
drivers/Kconfig | 2 +
drivers/Makefile | 1 +
drivers/mpool/Kconfig | 28 +
drivers/mpool/Makefile | 11 +
drivers/mpool/assert.h | 25 +
drivers/mpool/init.c | 126 ++
drivers/mpool/init.h | 17 +
drivers/mpool/mblock.c | 432 +++++
drivers/mpool/mblock.h | 161 ++
drivers/mpool/mcache.c | 1036 ++++++++++++
drivers/mpool/mcache.h | 102 ++
drivers/mpool/mclass.c | 103 ++
drivers/mpool/mclass.h | 137 ++
drivers/mpool/mdc.c | 486 ++++++
drivers/mpool/mdc.h | 106 ++
drivers/mpool/mlog.c | 1667 ++++++++++++++++++
drivers/mpool/mlog.h | 212 +++
drivers/mpool/mlog_utils.c | 1352 +++++++++++++++
drivers/mpool/mlog_utils.h | 63 +
drivers/mpool/mp.c | 1086 ++++++++++++
drivers/mpool/mp.h | 231 +++
drivers/mpool/mpcore.c | 987 +++++++++++
drivers/mpool/mpcore.h | 354 ++++
drivers/mpool/mpctl.c | 2801 +++++++++++++++++++++++++++++++
drivers/mpool/mpctl.h | 59 +
drivers/mpool/mpool-locking.rst | 90 +
drivers/mpool/mpool_ioctl.h | 636 +++++++
drivers/mpool/mpool_printk.h | 44 +
drivers/mpool/omf.c | 1320 +++++++++++++++
drivers/mpool/omf.h | 593 +++++++
drivers/mpool/omf_if.h | 381 +++++
drivers/mpool/params.h | 116 ++
drivers/mpool/pd.c | 426 +++++
drivers/mpool/pd.h | 202 +++
drivers/mpool/pmd.c | 2046 ++++++++++++++++++++++
drivers/mpool/pmd.h | 379 +++++
drivers/mpool/pmd_obj.c | 1569 +++++++++++++++++
drivers/mpool/pmd_obj.h | 499 ++++++
drivers/mpool/reaper.c | 692 ++++++++
drivers/mpool/reaper.h | 71 +
drivers/mpool/sb.c | 625 +++++++
drivers/mpool/sb.h | 162 ++
drivers/mpool/smap.c | 1031 ++++++++++++
drivers/mpool/smap.h | 334 ++++
drivers/mpool/sysfs.c | 48 +
drivers/mpool/sysfs.h | 48 +
drivers/mpool/upgrade.c | 138 ++
drivers/mpool/upgrade.h | 128 ++
drivers/mpool/uuid.h | 59 +
49 files changed, 23222 insertions(+)
create mode 100644 drivers/mpool/Kconfig
create mode 100644 drivers/mpool/Makefile
create mode 100644 drivers/mpool/assert.h
create mode 100644 drivers/mpool/init.c
create mode 100644 drivers/mpool/init.h
create mode 100644 drivers/mpool/mblock.c
create mode 100644 drivers/mpool/mblock.h
create mode 100644 drivers/mpool/mcache.c
create mode 100644 drivers/mpool/mcache.h
create mode 100644 drivers/mpool/mclass.c
create mode 100644 drivers/mpool/mclass.h
create mode 100644 drivers/mpool/mdc.c
create mode 100644 drivers/mpool/mdc.h
create mode 100644 drivers/mpool/mlog.c
create mode 100644 drivers/mpool/mlog.h
create mode 100644 drivers/mpool/mlog_utils.c
create mode 100644 drivers/mpool/mlog_utils.h
create mode 100644 drivers/mpool/mp.c
create mode 100644 drivers/mpool/mp.h
create mode 100644 drivers/mpool/mpcore.c
create mode 100644 drivers/mpool/mpcore.h
create mode 100644 drivers/mpool/mpctl.c
create mode 100644 drivers/mpool/mpctl.h
create mode 100644 drivers/mpool/mpool-locking.rst
create mode 100644 drivers/mpool/mpool_ioctl.h
create mode 100644 drivers/mpool/mpool_printk.h
create mode 100644 drivers/mpool/omf.c
create mode 100644 drivers/mpool/omf.h
create mode 100644 drivers/mpool/omf_if.h
create mode 100644 drivers/mpool/params.h
create mode 100644 drivers/mpool/pd.c
create mode 100644 drivers/mpool/pd.h
create mode 100644 drivers/mpool/pmd.c
create mode 100644 drivers/mpool/pmd.h
create mode 100644 drivers/mpool/pmd_obj.c
create mode 100644 drivers/mpool/pmd_obj.h
create mode 100644 drivers/mpool/reaper.c
create mode 100644 drivers/mpool/reaper.h
create mode 100644 drivers/mpool/sb.c
create mode 100644 drivers/mpool/sb.h
create mode 100644 drivers/mpool/smap.c
create mode 100644 drivers/mpool/smap.h
create mode 100644 drivers/mpool/sysfs.c
create mode 100644 drivers/mpool/sysfs.h
create mode 100644 drivers/mpool/upgrade.c
create mode 100644 drivers/mpool/upgrade.h
create mode 100644 drivers/mpool/uuid.h
--
2.17.2
1 year, 8 months
[PATCH v9 0/2] Renovate memcpy_mcsafe with copy_mc_to_{user, kernel}
by Dan Williams
Changes since v8 [1]:
- Rebase on v5.9-rc6
- Fix a performance regression in the x86 copy_mc_to_user()
implementation that was duplicating copies in the "fragile" case.
- Refreshed the cover letter.
[1]: http://lore.kernel.org/r/159630255616.3143511.18110575960499749012.stgit@...
---
The motivations to go rework memcpy_mcsafe() are that the benefit of
doing slow and careful copies is obviated on newer CPUs, and that the
current opt-in list of cpus to instrument recovery is broken relative to
those cpus. There is no need to keep an opt-in list up to date on an
ongoing basis if pmem/dax operations are instrumented for recovery by
default. With recovery enabled by default the old "mcsafe_key" opt-in to
careful copying can be made a "fragile" opt-out. Where the "fragile"
list takes steps to not consume poison across cachelines.
The discussion with Linus made clear that the current "_mcsafe" suffix
was imprecise to a fault. The operations that are needed by pmem/dax are
to copy from a source address that might throw #MC to a destination that
may write-fault, if it is a user page. So copy_to_user_mcsafe() becomes
copy_mc_to_user() to indicate the separate precautions taken on source
and destination. copy_mc_to_kernel() is introduced as a version that
does not expect write-faults on the destination, but is still prepared
to abort with an error code upon taking #MC.
These patches have received a kbuild-robot build success notification
across 114 configs, the rebase to v5.9-rc6 did not encounter any
conflicts, and the merge with tip/master is conflict-free.
---
Dan Williams (2):
x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user,kernel}()
x86/copy_mc: Introduce copy_mc_generic()
arch/powerpc/Kconfig | 2
arch/powerpc/include/asm/string.h | 2
arch/powerpc/include/asm/uaccess.h | 40 +++--
arch/powerpc/lib/Makefile | 2
arch/powerpc/lib/copy_mc_64.S | 4
arch/x86/Kconfig | 2
arch/x86/Kconfig.debug | 2
arch/x86/include/asm/copy_mc_test.h | 75 +++++++++
arch/x86/include/asm/mcsafe_test.h | 75 ---------
arch/x86/include/asm/string_64.h | 32 ----
arch/x86/include/asm/uaccess.h | 21 +++
arch/x86/include/asm/uaccess_64.h | 20 --
arch/x86/kernel/cpu/mce/core.c | 8 -
arch/x86/kernel/quirks.c | 9 -
arch/x86/lib/Makefile | 1
arch/x86/lib/copy_mc.c | 65 ++++++++
arch/x86/lib/copy_mc_64.S | 165 ++++++++++++++++++++
arch/x86/lib/memcpy_64.S | 115 --------------
arch/x86/lib/usercopy_64.c | 21 ---
drivers/md/dm-writecache.c | 15 +-
drivers/nvdimm/claim.c | 2
drivers/nvdimm/pmem.c | 6 -
include/linux/string.h | 9 -
include/linux/uaccess.h | 9 +
include/linux/uio.h | 10 +
lib/Kconfig | 7 +
lib/iov_iter.c | 43 +++--
tools/arch/x86/include/asm/mcsafe_test.h | 13 --
tools/arch/x86/lib/memcpy_64.S | 115 --------------
tools/objtool/check.c | 5 -
tools/perf/bench/Build | 1
tools/perf/bench/mem-memcpy-x86-64-lib.c | 24 ---
tools/testing/nvdimm/test/nfit.c | 48 +++---
.../testing/selftests/powerpc/copyloops/.gitignore | 2
tools/testing/selftests/powerpc/copyloops/Makefile | 6 -
.../selftests/powerpc/copyloops/copy_mc_64.S | 1
.../selftests/powerpc/copyloops/memcpy_mcsafe_64.S | 1
37 files changed, 452 insertions(+), 526 deletions(-)
rename arch/powerpc/lib/{memcpy_mcsafe_64.S => copy_mc_64.S} (98%)
create mode 100644 arch/x86/include/asm/copy_mc_test.h
delete mode 100644 arch/x86/include/asm/mcsafe_test.h
create mode 100644 arch/x86/lib/copy_mc.c
create mode 100644 arch/x86/lib/copy_mc_64.S
delete mode 100644 tools/arch/x86/include/asm/mcsafe_test.h
delete mode 100644 tools/perf/bench/mem-memcpy-x86-64-lib.c
create mode 120000 tools/testing/selftests/powerpc/copyloops/copy_mc_64.S
delete mode 120000 tools/testing/selftests/powerpc/copyloops/memcpy_mcsafe_64.S
base-commit: ba4f184e126b751d1bffad5897f263108befc780
1 year, 9 months
[PATCH] device-dax: include bus.h in super.c
by Jason Yan
This addresses the following sparse warning:
drivers/dax/super.c:452:6: warning: symbol 'run_dax' was not declared.
Should it be static?
Reported-by: Hulk Robot <hulkci(a)huawei.com>
Signed-off-by: Jason Yan <yanaijie(a)huawei.com>
---
drivers/dax/super.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index edc279be3e59..2cf6faf265c5 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -16,6 +16,7 @@
#include <linux/dax.h>
#include <linux/fs.h>
#include "dax-private.h"
+#include "bus.h"
static dev_t dax_devt;
DEFINE_STATIC_SRCU(dax_srcu);
--
2.25.4
1 year, 9 months
[PATCH] nvdimm: Use kobj_to_dev() API
by Wang Qing
Use kobj_to_dev() instead of container_of().
Signed-off-by: Wang Qing <wangqing(a)vivo.com>
---
drivers/nvdimm/namespace_devs.c | 2 +-
drivers/nvdimm/region_devs.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 6da67f4..1d11ca7
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -1623,7 +1623,7 @@ static struct attribute *nd_namespace_attributes[] = {
static umode_t namespace_visible(struct kobject *kobj,
struct attribute *a, int n)
{
- struct device *dev = container_of(kobj, struct device, kobj);
+ struct device *dev = kobj_to_dev(kobj);
if (a == &dev_attr_resource.attr && is_namespace_blk(dev))
return 0;
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index ef23119..92adfaf
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -644,7 +644,7 @@ static struct attribute *nd_region_attributes[] = {
static umode_t region_visible(struct kobject *kobj, struct attribute *a, int n)
{
- struct device *dev = container_of(kobj, typeof(*dev), kobj);
+ struct device *dev = kobj_to_dev(kobj);
struct nd_region *nd_region = to_nd_region(dev);
struct nd_interleave_set *nd_set = nd_region->nd_set;
int type = nd_region_to_nstype(nd_region);
@@ -759,7 +759,7 @@ REGION_MAPPING(31);
static umode_t mapping_visible(struct kobject *kobj, struct attribute *a, int n)
{
- struct device *dev = container_of(kobj, struct device, kobj);
+ struct device *dev = kobj_to_dev(kobj);
struct nd_region *nd_region = to_nd_region(dev);
if (n < nd_region->ndr_mappings)
--
2.7.4
1 year, 9 months
[PATCH] powerpc/papr_scm: Support dynamic enable/disable of performance statistics
by Vaibhav Jain
Collection of performance statistics of an NVDIMM can be dynamically
enabled/disabled from the Hypervisor Management Console even when the
guest lpar is running. The current implementation however will check if
the performance statistics collection is supported during NVDIMM probe
and if yes will assume that to be the case afterwards.
Hence we update papr_scm to remove this assumption from the code by
eliminating the 'stat_buffer_len' member from 'struct papr_scm_priv'
that was used to cache the max buffer size needed to fetch NVDIMM
performance stats from PHYP. With that struct member gone, various
functions that depended on it are updated. Specifically
perf_stats_show() is updated to query the PHYP first for the size of
buffer needed to hold all performance statistics instead of relying on
'stat_buffer_len'
Signed-off-by: Vaibhav Jain <vaibhav(a)linux.ibm.com>
---
arch/powerpc/platforms/pseries/papr_scm.c | 53 +++++++++++------------
1 file changed, 25 insertions(+), 28 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index 27268370dee00..6697e1c3b9ebe 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -112,9 +112,6 @@ struct papr_scm_priv {
/* Health information for the dimm */
u64 health_bitmap;
-
- /* length of the stat buffer as expected by phyp */
- size_t stat_buffer_len;
};
static LIST_HEAD(papr_nd_regions);
@@ -230,14 +227,15 @@ static int drc_pmem_query_n_bind(struct papr_scm_priv *p)
* - If buff_stats == NULL the return value is the size in byes of the buffer
* needed to hold all supported performance-statistics.
* - If buff_stats != NULL and num_stats == 0 then we copy all known
- * performance-statistics to 'buff_stat' and expect to be large enough to
- * hold them.
+ * performance-statistics to 'buff_stat' and expect it to be large enough to
+ * hold them. The 'buff_size' args contains the size of the 'buff_stats'
* - if buff_stats != NULL and num_stats > 0 then copy the requested
* performance-statistics to buff_stats.
*/
static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p,
struct papr_scm_perf_stats *buff_stats,
- unsigned int num_stats)
+ unsigned int num_stats,
+ size_t buff_size)
{
unsigned long ret[PLPAR_HCALL_BUFSIZE];
size_t size;
@@ -261,12 +259,18 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p,
size = sizeof(struct papr_scm_perf_stats) +
num_stats * sizeof(struct papr_scm_perf_stat);
else
- size = p->stat_buffer_len;
+ size = buff_size;
} else {
/* In case of no out buffer ignore the size */
size = 0;
}
+ /* verify that the buffer size needed is sufficient */
+ if (size > buff_size) {
+ __WARN();
+ return -EINVAL;
+ }
+
/* Do the HCALL asking PHYP for info */
rc = plpar_hcall(H_SCM_PERFORMANCE_STATS, ret, p->drc_index,
buff_stats ? virt_to_phys(buff_stats) : 0,
@@ -277,6 +281,10 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p,
dev_err(&p->pdev->dev,
"Unknown performance stats, Err:0x%016lX\n", ret[0]);
return -ENOENT;
+ } else if (rc == H_AUTHORITY) {
+ dev_dbg(&p->pdev->dev,
+ "Performance stats in-accessible\n");
+ return -EPERM;
} else if (rc != H_SUCCESS) {
dev_err(&p->pdev->dev,
"Failed to query performance stats, Err:%lld\n", rc);
@@ -526,10 +534,6 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
struct papr_scm_perf_stat *stat;
struct papr_scm_perf_stats *stats;
- /* Silently fail if fetching performance metrics isn't supported */
- if (!p->stat_buffer_len)
- return 0;
-
/* Allocate request buffer enough to hold single performance stat */
size = sizeof(struct papr_scm_perf_stats) +
sizeof(struct papr_scm_perf_stat);
@@ -543,9 +547,11 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
stat->stat_val = 0;
/* Fetch the fuel gauge and populate it in payload */
- rc = drc_pmem_query_stats(p, stats, 1);
+ rc = drc_pmem_query_stats(p, stats, 1, size);
if (rc < 0) {
dev_dbg(&p->pdev->dev, "Err(%d) fetching fuel gauge\n", rc);
+ /* Silently fail if unable to fetch performance metric */
+ rc = 0;
goto free_stats;
}
@@ -786,23 +792,25 @@ static ssize_t perf_stats_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
int index;
- ssize_t rc;
+ ssize_t rc, buff_len;
struct seq_buf s;
struct papr_scm_perf_stat *stat;
struct papr_scm_perf_stats *stats;
struct nvdimm *dimm = to_nvdimm(dev);
struct papr_scm_priv *p = nvdimm_provider_data(dimm);
- if (!p->stat_buffer_len)
- return -ENOENT;
+ /* fetch the length of buffer needed to get all stats */
+ buff_len = drc_pmem_query_stats(p, NULL, 0, 0);
+ if (buff_len <= 0)
+ return buff_len;
/* Allocate the buffer for phyp where stats are written */
- stats = kzalloc(p->stat_buffer_len, GFP_KERNEL);
+ stats = kzalloc(buff_len, GFP_KERNEL);
if (!stats)
return -ENOMEM;
/* Ask phyp to return all dimm perf stats */
- rc = drc_pmem_query_stats(p, stats, 0);
+ rc = drc_pmem_query_stats(p, stats, 0, buff_len);
if (rc)
goto free_stats;
/*
@@ -891,7 +899,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
struct nd_region_desc ndr_desc;
unsigned long dimm_flags;
int target_nid, online_nid;
- ssize_t stat_size;
p->bus_desc.ndctl = papr_scm_ndctl;
p->bus_desc.module = THIS_MODULE;
@@ -962,16 +969,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
list_add_tail(&p->region_list, &papr_nd_regions);
mutex_unlock(&papr_ndr_lock);
- /* Try retriving the stat buffer and see if its supported */
- stat_size = drc_pmem_query_stats(p, NULL, 0);
- if (stat_size > 0) {
- p->stat_buffer_len = stat_size;
- dev_dbg(&p->pdev->dev, "Max perf-stat size %lu-bytes\n",
- p->stat_buffer_len);
- } else {
- dev_info(&p->pdev->dev, "Dimm performance stats unavailable\n");
- }
-
return 0;
err: nvdimm_bus_unregister(p->bus);
--
2.26.2
1 year, 9 months
[RFC] nvfs: a filesystem for persistent memory
by Mikulas Patocka
Hi
I am developing a new filesystem suitable for persistent memory - nvfs.
The goal is to have a small and fast filesystem that can be used on
DAX-based devices. Nvfs maps the whole device into linear address space
and it completely bypasses the overhead of the block layer and buffer
cache.
In the past, there was nova filesystem for pmem, but it was abandoned a
year ago (the last version is for the kernel 5.1 -
https://github.com/NVSL/linux-nova ). Nvfs is smaller and performs better.
The design of nvfs is similar to ext2/ext4, so that it fits into the VFS
layer naturally, without too much glue code.
I'd like to ask you to review it.
tarballs:
http://people.redhat.com/~mpatocka/nvfs/
git:
git://leontynka.twibright.com/nvfs.git
the description of filesystem internals:
http://people.redhat.com/~mpatocka/nvfs/INTERNALS
benchmarks:
http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS
TODO:
- programs run approximately 4% slower when running from Optane-based
persistent memory. Therefore, programs and libraries should use page cache
and not DAX mapping.
- when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses
buffer cache for the mapping. The buffer cache slows does fsck by a factor
of 5 to 10. Could it be possible to change the kernel so that it maps DAX
based block devices directly?
- __copy_from_user_inatomic_nocache doesn't flush cache for leading and
trailing bytes.
Mikulas
1 year, 9 months
[PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved
ranges
by Dan Williams
Changes since v3 [1]:
- Update x86 boot options documentation for 'nohmat' (Randy)
- Fixup a handful of kbuild robot reports, the most significant being
moving usage of PUD_SIZE and PMD_SIZE under
#ifdef CONFIG_TRANSPARENT_HUGEPAGE protection.
[1]: http://lore.kernel.org/r/159625229779.3040297.11363509688097221416.stgit@...
---
Merge notes:
Well, no v5.8-rc8 to line this up for v5.9, so next best is early
integration into -mm before other collisions develop.
Chatted with Justin offline and it currently appears that the missing
numa information is the fault of the platform firmware to populate all
the necessary NUMA data in the NFIT.
---
Cover:
The device-dax facility allows an address range to be directly mapped
through a chardev, or optionally hotplugged to the core kernel page
allocator as System-RAM. It is the mechanism for converting persistent
memory (pmem) to be used as another volatile memory pool i.e. the
current Memory Tiering hot topic on linux-mm.
In the case of pmem the nvdimm-namespace-label mechanism can sub-divide
it, but that labeling mechanism is not available / applicable to
soft-reserved ("EFI specific purpose") memory [3]. This series provides
a sysfs-mechanism for the daxctl utility to enable provisioning of
volatile-soft-reserved memory ranges.
The motivations for this facility are:
1/ Allow performance differentiated memory ranges to be split between
kernel-managed and directly-accessed use cases.
2/ Allow physical memory to be provisioned along performance relevant
address boundaries. For example, divide a memory-side cache [4] along
cache-color boundaries.
3/ Parcel out soft-reserved memory to VMs using device-dax as a security
/ permissions boundary [5]. Specifically I have seen people (ab)using
memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the
device-dax interface on custom address ranges. A follow-on for the VM
use case is to teach device-dax to dynamically allocate 'struct page' at
runtime to reduce the duplication of 'struct page' space in both the
guest and the host kernel for the same physical pages.
[2]: http://lore.kernel.org/r/20200713160837.13774-11-joao.m.martins@oracle.com
[3]: http://lore.kernel.org/r/157309097008.1579826.12818463304589384434.stgit@...
[4]: http://lore.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@...
[5]: http://lore.kernel.org/r/20200110190313.17144-1-joao.m.martins@oracle.com
---
Dan Williams (19):
x86/numa: Cleanup configuration dependent command-line options
x86/numa: Add 'nohmat' option
efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance
ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device
resource: Report parent to walk_iomem_res_desc() callback
mm/memory_hotplug: Introduce default phys_to_target_node() implementation
ACPI: HMAT: Attach a device for each soft-reserved range
device-dax: Drop the dax_region.pfn_flags attribute
device-dax: Move instance creation parameters to 'struct dev_dax_data'
device-dax: Make pgmap optional for instance creation
device-dax: Kill dax_kmem_res
device-dax: Add an allocation interface for device-dax instances
device-dax: Introduce 'seed' devices
drivers/base: Make device_find_child_by_name() compatible with sysfs inputs
device-dax: Add resize support
mm/memremap_pages: Convert to 'struct range'
mm/memremap_pages: Support multiple ranges per invocation
device-dax: Add dis-contiguous resource support
device-dax: Introduce 'mapping' devices
Joao Martins (4):
device-dax: Make align a per-device property
device-dax: Add an 'align' attribute
dax/hmem: Introduce dax_hmem.region_idle parameter
device-dax: Add a range mapping allocation attribute
Documentation/x86/x86_64/boot-options.rst | 4
arch/powerpc/kvm/book3s_hv_uvmem.c | 14
arch/x86/include/asm/numa.h | 8
arch/x86/kernel/e820.c | 16
arch/x86/mm/numa.c | 11
arch/x86/mm/numa_emulation.c | 3
arch/x86/xen/enlighten_pv.c | 2
drivers/acpi/numa/hmat.c | 76 --
drivers/acpi/numa/srat.c | 9
drivers/base/core.c | 2
drivers/dax/Kconfig | 4
drivers/dax/Makefile | 3
drivers/dax/bus.c | 1046 +++++++++++++++++++++++++++--
drivers/dax/bus.h | 28 -
drivers/dax/dax-private.h | 60 +-
drivers/dax/device.c | 134 ++--
drivers/dax/hmem.c | 56 --
drivers/dax/hmem/Makefile | 6
drivers/dax/hmem/device.c | 100 +++
drivers/dax/hmem/hmem.c | 65 ++
drivers/dax/kmem.c | 199 +++---
drivers/dax/pmem/compat.c | 2
drivers/dax/pmem/core.c | 22 -
drivers/firmware/efi/x86_fake_mem.c | 12
drivers/gpu/drm/nouveau/nouveau_dmem.c | 15
drivers/nvdimm/badrange.c | 26 -
drivers/nvdimm/claim.c | 13
drivers/nvdimm/nd.h | 3
drivers/nvdimm/pfn_devs.c | 13
drivers/nvdimm/pmem.c | 27 -
drivers/nvdimm/region.c | 21 -
drivers/pci/p2pdma.c | 12
include/acpi/acpi_numa.h | 14
include/linux/dax.h | 8
include/linux/memory_hotplug.h | 5
include/linux/memremap.h | 11
include/linux/numa.h | 11
include/linux/range.h | 6
kernel/resource.c | 11
lib/test_hmm.c | 15
mm/memory_hotplug.c | 10
mm/memremap.c | 299 +++++---
tools/testing/nvdimm/dax-dev.c | 22 -
tools/testing/nvdimm/test/iomap.c | 2
44 files changed, 1825 insertions(+), 601 deletions(-)
delete mode 100644 drivers/dax/hmem.c
create mode 100644 drivers/dax/hmem/Makefile
create mode 100644 drivers/dax/hmem/device.c
create mode 100644 drivers/dax/hmem/hmem.c
base-commit: 01830e6c042e8eb6eb202e05d7df8057135b4c26
1 year, 9 months
[PATCH v5 2/3] memremap: rename MEMORY_DEVICE_DEVDAX to MEMORY_DEVICE_GENERIC
by Roger Pau Monne
This is in preparation for the logic behind MEMORY_DEVICE_DEVDAX also
being used by non DAX devices.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau(a)citrix.com>
Reviewed-by: Ira Weiny <ira.weiny(a)intel.com>
Acked-by: Andrew Morton <akpm(a)linux-foundation.org>
---
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Vishal Verma <vishal.l.verma(a)intel.com>
Cc: Dave Jiang <dave.jiang(a)intel.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Jason Gunthorpe <jgg(a)ziepe.ca>
Cc: Ira Weiny <ira.weiny(a)intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.ibm.com>
Cc: Johannes Thumshirn <jthumshirn(a)suse.de>
Cc: Logan Gunthorpe <logang(a)deltatee.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: linux-nvdimm(a)lists.01.org
Cc: xen-devel(a)lists.xenproject.org
Cc: linux-mm(a)kvack.org
---
drivers/dax/device.c | 2 +-
include/linux/memremap.h | 9 ++++-----
mm/memremap.c | 2 +-
3 files changed, 6 insertions(+), 7 deletions(-)
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 4c0af2eb7e19..1e89513f3c59 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -429,7 +429,7 @@ int dev_dax_probe(struct device *dev)
return -EBUSY;
}
- dev_dax->pgmap.type = MEMORY_DEVICE_DEVDAX;
+ dev_dax->pgmap.type = MEMORY_DEVICE_GENERIC;
addr = devm_memremap_pages(dev, &dev_dax->pgmap);
if (IS_ERR(addr))
return PTR_ERR(addr);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 5f5b2df06e61..e5862746751b 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -46,11 +46,10 @@ struct vmem_altmap {
* wakeup is used to coordinate physical address space management (ex:
* fs truncate/hole punch) vs pinned pages (ex: device dma).
*
- * MEMORY_DEVICE_DEVDAX:
+ * MEMORY_DEVICE_GENERIC:
* Host memory that has similar access semantics as System RAM i.e. DMA
- * coherent and supports page pinning. In contrast to
- * MEMORY_DEVICE_FS_DAX, this memory is access via a device-dax
- * character device.
+ * coherent and supports page pinning. This is for example used by DAX devices
+ * that expose memory using a character device.
*
* MEMORY_DEVICE_PCI_P2PDMA:
* Device memory residing in a PCI BAR intended for use with Peer-to-Peer
@@ -60,7 +59,7 @@ enum memory_type {
/* 0 is reserved to catch uninitialized type fields */
MEMORY_DEVICE_PRIVATE = 1,
MEMORY_DEVICE_FS_DAX,
- MEMORY_DEVICE_DEVDAX,
+ MEMORY_DEVICE_GENERIC,
MEMORY_DEVICE_PCI_P2PDMA,
};
diff --git a/mm/memremap.c b/mm/memremap.c
index 03e38b7a38f1..006dace60b1a 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -216,7 +216,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
return ERR_PTR(-EINVAL);
}
break;
- case MEMORY_DEVICE_DEVDAX:
+ case MEMORY_DEVICE_GENERIC:
need_devmap_managed = false;
break;
case MEMORY_DEVICE_PCI_P2PDMA:
--
2.28.0
1 year, 9 months