[PATCH 1/1] ndctl/namespace: Fix disable-namespace accounting relative to seed devices
by Redhairer Li
Seed namespaces are included in "ndctl disable-namespace all". However
since the user never "creates" them it is surprising to see
"disable-namespace" report 1 more namespace relative to the number that
have been created. Catch attempts to disable a zero-sized namespace:
Before:
{
"dev":"namespace1.0",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1"
}
{
"dev":"namespace1.1",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1.1"
}
{
"dev":"namespace1.2",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1.2"
}
disabled 4 namespaces
After:
{
"dev":"namespace1.0",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1"
}
{
"dev":"namespace1.3",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1.3"
}
{
"dev":"namespace1.1",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1.1"
}
disabled 3 namespaces
Signed-off-by: Redhairer Li <redhairer.li(a)intel.com>
---
ndctl/lib/libndctl.c | 11 ++++++++---
ndctl/region.c | 4 +++-
2 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/ndctl/lib/libndctl.c b/ndctl/lib/libndctl.c
index ee737cb..49f362b 100644
--- a/ndctl/lib/libndctl.c
+++ b/ndctl/lib/libndctl.c
@@ -4231,6 +4231,7 @@ NDCTL_EXPORT int ndctl_namespace_disable_safe(struct ndctl_namespace *ndns)
const char *bdev = NULL;
char path[50];
int fd;
+ unsigned long long size = ndctl_namespace_get_size(ndns);
if (pfn && ndctl_pfn_is_enabled(pfn))
bdev = ndctl_pfn_get_block_device(pfn);
@@ -4260,9 +4261,13 @@ NDCTL_EXPORT int ndctl_namespace_disable_safe(struct ndctl_namespace *ndns)
devname, bdev, strerror(errno));
return -errno;
}
- } else
- ndctl_namespace_disable_invalidate(ndns);
-
+ } else {
+ if (size == 0)
+ /* Don't try to disable idle namespace (no capacity allocated) */
+ return -ENXIO;
+ else
+ ndctl_namespace_disable_invalidate(ndns);
+ }
return 0;
}
diff --git a/ndctl/region.c b/ndctl/region.c
index 7945007..0014bb9 100644
--- a/ndctl/region.c
+++ b/ndctl/region.c
@@ -72,6 +72,7 @@ static int region_action(struct ndctl_region *region, enum device_action mode)
{
struct ndctl_namespace *ndns;
int rc = 0;
+ unsigned long long size;
switch (mode) {
case ACTION_ENABLE:
@@ -80,7 +81,8 @@ static int region_action(struct ndctl_region *region, enum device_action mode)
case ACTION_DISABLE:
ndctl_namespace_foreach(region, ndns) {
rc = ndctl_namespace_disable_safe(ndns);
- if (rc)
+ size = ndctl_namespace_get_size(ndns);
+ if (rc && size != 0)
return rc;
}
rc = ndctl_region_disable_invalidate(region);
--
2.20.1.windows.1
6 days, 20 hours
Feedback requested: Exposing NVDIMM performance statistics in a generic way
by Vaibhav Jain
Hello,
I am looking for some community feedback on these two Problem-statements:
1.How to expose NVDIMM performance statistics in an arch or nvdimm vendor
agnostic manner ?
2. Is there a common set of performance statistics for NVDIMMs that all
vendors should provide ?
Problem context
===============
While working on bring up of PAPR SCM based NVDIMMs[1] for arch/powerpc
we want to expose certain dimm performance statistics like "Media
Read/Write Counts", "Power-on Seconds" etc to user-space [2]. These
performance statistics are similar to what ipmctl[3] reports for Intel®
Optane™ persistent memory via the '-show performance' command line
arg. However the reported set of performance stats doesn't cover the
entirety of all performance stats supported by PAPR SCM based NVDimms.
For example here is a subset of performance stats which are specific to
PAPR SCM NVDimms and that not reported by ipmctl:
* Controller Reset Count
* Controller Reset Elapsed Time
* Power-on Seconds
* Cache Read Hit Count
* Cache Write Hit Count
Possibility of updating ipmctl to add support for these performance
statistics is greatly hampered by no support for ACPI on Powerpc
arch. Secondly vendors who dont support ACPI/NFIT command set
similar to Intel® Optane™ (Example MSFT) are also left out in
lurch. Problem-statement#1 points to this specific problem.
Additionally in absence of any pre-agreed set of performance statistics
which all vendors should support, adding support for such a
functionality in ipmctl may not bode well of other nvdimm vendors. For
example if support for reporting "Controller Reset Count" is added to
ipmctl then it may not be applicable to other vendors such as Intel®
Optane™. This issue is what Problem-statement#2 refers to.
Possible Solution for Problem#1
===============================
One possible solution to Problem#1 can to add support for reporting
NVDIMM performance statistics in 'ndtcl'. 'libndctl' already has a layer
that abstracts underlying NVDIMM vendors (via struct ndctl_dimm_ops),
making supporting different NVDIMM vendors fairly easy. Also ndctl is
more widely used compared to 'ipmctl', hence adding such a functionality
to ndctl would make it more widely used.
Above solution was implemented as RFC patch-set[2] that exposes these
performance statistics through a generic abstraction in libndctl and
added a presentation layer for this data in ndctl[4]. It added a new
command line flags '--stat' to ndctl to report *all* nvdimm vendor
reported performance stats. The output is similar to one below:
# ndctl list -D --stats
[
{
"dev":"nmem0",
"stats":{
"Power-on Seconds":603931,
"Media Read Count":0,
"Media Write Count":6313,
}
}
]
This was done by adding two new dimm-ops callbacks that were
implemented by the papr_scm implementation within libndctl. These
callbacks are invoked by newly introduce code in 'util/json-smart.c'
that format the returned stats from these new dimm-ops and transform
them into a json-object to later presentation. I would request you to
look at RFC patch-set[2] to understand the implementation details.
Possibled Solution for Problem#2
================================
Solution to Problem-statement#2 is what eludes me though. If there is a
minimal set of performance stats (similar to what ndctl enforces for
health-stats) then implementation of such a functionality in
ndctl/ipmctl would be easy to implement. But is it really possible to
have such a common set of performance stats that NVDIMM vendors can
expose.
Patch-set[2] though tries to bypass this problem by letting the vendor
descide which performance stats to expose. This opens up a possibility
of this functionality to abused by dimm vendors to reports arbirary data
through this flag that may not be performance-stats.
Summing-up
==========
In light of above, requesting your feedback as to how
problem-statements#{1, 2} can be addressed within ndctl subsystem. Also
are these problems even worth solving.
References
==========
[1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/papr_...
[2] "[ndctl RFC-PATCH 0/4] Add support for reporting PAPR NVDIMM
Statistics"
https://lore.kernel.org/linux-nvdimm/20200518110814.145644-1-vaibhav@linu...
[3] https://docs.pmem.io/ipmctl-user-guide/instrumentation/show-device-perfor...
[4] "[RFC-PATCH 1/4] ndctl,libndctl: Implement new dimm-ops 'new_stats'
and 'get_stat'"
https://lore.kernel.org/linux-nvdimm/20200514225258.508463-2-vaibhav@linu...
Thanks,
~ Vaibhav
2 months
[RFC PATCH 0/4] powerpc/papr_scm: Add support for reporting NVDIMM performance statistics
by Vaibhav Jain
The patch-set proposes to add support for fetching and reporting
performance statistics for PAPR compliant NVDIMMs as described in
documentation for H_SCM_PERFORMANCE_STATS hcall Ref[1]. The patch-set
also implements mechanisms to expose NVDIMM performance stats via
sysfs and newly introduced PDSMs[2] for libndctl.
This patch-set combined with corresponding ndctl and libndctl changes
proposed at Ref[3] should enable user to fetch PAPR compliant NVDIMMs
using following command:
# ndctl list -D --stats
[
{
"dev":"nmem0",
"stats":{
"Controller Reset Count":2,
"Controller Reset Elapsed Time":603331,
"Power-on Seconds":603931,
"Life Remaining":"100%",
"Critical Resource Utilization":"0%",
"Host Load Count":5781028,
"Host Store Count":8966800,
"Host Load Duration":975895365,
"Host Store Duration":716230690,
"Media Read Count":0,
"Media Write Count":6313,
"Media Read Duration":0,
"Media Write Duration":9679615,
"Cache Read Hit Count":5781028,
"Cache Write Hit Count":8442479,
"Fast Write Count":8969912
}
}
]
The patchset is dependent on existing patch-set "[PATCH v7 0/5]
powerpc/papr_scm: Add support for reporting nvdimm health" available
at Ref[2] that adds support for reporting PAPR compliant NVDIMMs in
'papr_scm' kernel module.
Structure of the patch-set
==========================
The patch-set starts with implementing functionality in papr_scm
module to issue H_SCM_PERFORMANCE_STATS hcall, fetch & parse dimm
performance stats and exposing them as a PAPR specific libnvdimm
attribute named 'perf_stats'
Patch-2 introduces a new PDSM named FETCH_PERF_STATS that can be
issued by libndctl asking papr_scm to issue the
H_SCM_PERFORMANCE_STATS hcall using helpers introduced earlier and
storing the results in a dimm specific perf-stats-buffer.
Patch-3 introduces a new PDSM named READ_PERF_STATS that can be
issued by libndctl to read the perf-stats-buffer in an incremental
manner to workaround the 256-bytes envelop limitation of libnvdimm.
Finally Patch-4 introduces a new PDSM named GET_PERF_STAT that can be
issued by libndctl to read values of a specific NVDIMM performance
stat like "Life Remaining".
References
==========
[1] Documentation/powerpc/papr_hcals.rst
[2] https://lore.kernel.org/linux-nvdimm/20200508104922.72565-1-vaibhav@linux...
[3] https://github.com/vaibhav92/ndctl/tree/papr_scm_stats_v1
Vaibhav Jain (4):
powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
powerpc/papr_scm: Add support for PAPR_SCM_PDSM_FETCH_PERF_STATS
powerpc/papr_scm: Implement support for PAPR_SCM_PDSM_READ_PERF_STATS
powerpc/papr_scm: Add support for PDSM GET_PERF_STAT
Documentation/ABI/testing/sysfs-bus-papr-scm | 27 ++
arch/powerpc/include/uapi/asm/papr_scm_pdsm.h | 60 +++
arch/powerpc/platforms/pseries/papr_scm.c | 391 ++++++++++++++++++
3 files changed, 478 insertions(+)
--
2.26.2
3 months
[PATCH v2] ACPI: Drop rcu usage for MMIO mappings
by Dan Williams
Recently a performance problem was reported for a process invoking a
non-trival ASL program. The method call in this case ends up
repetitively triggering a call path like:
acpi_ex_store
acpi_ex_store_object_to_node
acpi_ex_write_data_to_field
acpi_ex_insert_into_field
acpi_ex_write_with_update_rule
acpi_ex_field_datum_io
acpi_ex_access_region
acpi_ev_address_space_dispatch
acpi_ex_system_memory_space_handler
acpi_os_map_cleanup.part.14
_synchronize_rcu_expedited.constprop.89
schedule
The end result of frequent synchronize_rcu_expedited() invocation is
tiny sub-millisecond spurts of execution where the scheduler freely
migrates this apparently sleepy task. The overhead of frequent scheduler
invocation multiplies the execution time by a factor of 2-3X.
For example, performance improves from 16 minutes to 7 minutes for a
firmware update procedure across 24 devices.
Perhaps the rcu usage was intended to allow for not taking a sleeping
lock in the acpi_os_{read,write}_memory() path which ostensibly could be
called from an APEI NMI error interrupt? Neither rcu_read_lock() nor
ioremap() are interrupt safe, so add a WARN_ONCE() to validate that rcu
was not serving as a mechanism to avoid direct calls to ioremap(). Even
the original implementation had a spin_lock_irqsave(), but that is not
NMI safe.
APEI itself already has some concept of avoiding ioremap() from
interrupt context (see erst_exec_move_data()), if the new warning
triggers it means that APEI either needs more instrumentation like that
to pre-emptively fail, or more infrastructure to arrange for pre-mapping
the resources it needs in NMI context.
Cc: <stable(a)vger.kernel.org>
Fixes: 620242ae8c3d ("ACPI: Maintain a list of ACPI memory mapped I/O remappings")
Cc: Len Brown <lenb(a)kernel.org>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: Ira Weiny <ira.weiny(a)intel.com>
Cc: James Morse <james.morse(a)arm.com>
Cc: Erik Kaneda <erik.kaneda(a)intel.com>
Cc: Myron Stowe <myron.stowe(a)redhat.com>
Cc: "Rafael J. Wysocki" <rjw(a)rjwysocki.net>
Cc: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
Changes since v1 [1]:
- Actually cc: the most important list for ACPI changes (Rafael)
- Cleanup unnecessary variable initialization (Andy)
Link: https://lore.kernel.org/linux-nvdimm/158880834905.2183490.156163294694202...
drivers/acpi/osl.c | 117 +++++++++++++++++++++++++---------------------------
1 file changed, 57 insertions(+), 60 deletions(-)
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 762c5d50b8fe..a44b75aac5d0 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -214,13 +214,13 @@ acpi_physical_address __init acpi_os_get_root_pointer(void)
return pa;
}
-/* Must be called with 'acpi_ioremap_lock' or RCU read lock held. */
static struct acpi_ioremap *
acpi_map_lookup(acpi_physical_address phys, acpi_size size)
{
struct acpi_ioremap *map;
- list_for_each_entry_rcu(map, &acpi_ioremaps, list, acpi_ioremap_lock_held())
+ lockdep_assert_held(&acpi_ioremap_lock);
+ list_for_each_entry(map, &acpi_ioremaps, list)
if (map->phys <= phys &&
phys + size <= map->phys + map->size)
return map;
@@ -228,7 +228,6 @@ acpi_map_lookup(acpi_physical_address phys, acpi_size size)
return NULL;
}
-/* Must be called with 'acpi_ioremap_lock' or RCU read lock held. */
static void __iomem *
acpi_map_vaddr_lookup(acpi_physical_address phys, unsigned int size)
{
@@ -263,7 +262,8 @@ acpi_map_lookup_virt(void __iomem *virt, acpi_size size)
{
struct acpi_ioremap *map;
- list_for_each_entry_rcu(map, &acpi_ioremaps, list, acpi_ioremap_lock_held())
+ lockdep_assert_held(&acpi_ioremap_lock);
+ list_for_each_entry(map, &acpi_ioremaps, list)
if (map->virt <= virt &&
virt + size <= map->virt + map->size)
return map;
@@ -360,7 +360,7 @@ void __iomem __ref
map->size = pg_sz;
map->refcount = 1;
- list_add_tail_rcu(&map->list, &acpi_ioremaps);
+ list_add_tail(&map->list, &acpi_ioremaps);
out:
mutex_unlock(&acpi_ioremap_lock);
@@ -374,20 +374,13 @@ void *__ref acpi_os_map_memory(acpi_physical_address phys, acpi_size size)
}
EXPORT_SYMBOL_GPL(acpi_os_map_memory);
-/* Must be called with mutex_lock(&acpi_ioremap_lock) */
-static unsigned long acpi_os_drop_map_ref(struct acpi_ioremap *map)
-{
- unsigned long refcount = --map->refcount;
-
- if (!refcount)
- list_del_rcu(&map->list);
- return refcount;
-}
-
-static void acpi_os_map_cleanup(struct acpi_ioremap *map)
+static void acpi_os_drop_map_ref(struct acpi_ioremap *map)
{
- synchronize_rcu_expedited();
+ lockdep_assert_held(&acpi_ioremap_lock);
+ if (--map->refcount > 0)
+ return;
acpi_unmap(map->phys, map->virt);
+ list_del(&map->list);
kfree(map);
}
@@ -408,7 +401,6 @@ static void acpi_os_map_cleanup(struct acpi_ioremap *map)
void __ref acpi_os_unmap_iomem(void __iomem *virt, acpi_size size)
{
struct acpi_ioremap *map;
- unsigned long refcount;
if (!acpi_permanent_mmap) {
__acpi_unmap_table(virt, size);
@@ -422,11 +414,8 @@ void __ref acpi_os_unmap_iomem(void __iomem *virt, acpi_size size)
WARN(true, PREFIX "%s: bad address %p\n", __func__, virt);
return;
}
- refcount = acpi_os_drop_map_ref(map);
+ acpi_os_drop_map_ref(map);
mutex_unlock(&acpi_ioremap_lock);
-
- if (!refcount)
- acpi_os_map_cleanup(map);
}
EXPORT_SYMBOL_GPL(acpi_os_unmap_iomem);
@@ -461,7 +450,6 @@ void acpi_os_unmap_generic_address(struct acpi_generic_address *gas)
{
u64 addr;
struct acpi_ioremap *map;
- unsigned long refcount;
if (gas->space_id != ACPI_ADR_SPACE_SYSTEM_MEMORY)
return;
@@ -477,11 +465,8 @@ void acpi_os_unmap_generic_address(struct acpi_generic_address *gas)
mutex_unlock(&acpi_ioremap_lock);
return;
}
- refcount = acpi_os_drop_map_ref(map);
+ acpi_os_drop_map_ref(map);
mutex_unlock(&acpi_ioremap_lock);
-
- if (!refcount)
- acpi_os_map_cleanup(map);
}
EXPORT_SYMBOL(acpi_os_unmap_generic_address);
@@ -700,55 +685,71 @@ int acpi_os_read_iomem(void __iomem *virt_addr, u64 *value, u32 width)
return 0;
}
+static void __iomem *acpi_os_rw_map(acpi_physical_address phys_addr,
+ unsigned int size, bool *did_fallback)
+{
+ void __iomem *virt_addr;
+
+ if (WARN_ONCE(in_interrupt(), "ioremap in interrupt context\n"))
+ return NULL;
+
+ /* Try to use a cached mapping and fallback otherwise */
+ *did_fallback = false;
+ mutex_lock(&acpi_ioremap_lock);
+ virt_addr = acpi_map_vaddr_lookup(phys_addr, size);
+ if (virt_addr)
+ return virt_addr;
+ mutex_unlock(&acpi_ioremap_lock);
+
+ virt_addr = acpi_os_ioremap(phys_addr, size);
+ *did_fallback = true;
+
+ return virt_addr;
+}
+
+static void acpi_os_rw_unmap(void __iomem *virt_addr, bool did_fallback)
+{
+ if (did_fallback) {
+ /* in the fallback case no lock is held */
+ iounmap(virt_addr);
+ return;
+ }
+
+ mutex_unlock(&acpi_ioremap_lock);
+}
+
acpi_status
acpi_os_read_memory(acpi_physical_address phys_addr, u64 *value, u32 width)
{
- void __iomem *virt_addr;
unsigned int size = width / 8;
- bool unmap = false;
+ bool did_fallback = false;
+ void __iomem *virt_addr;
u64 dummy;
int error;
- rcu_read_lock();
- virt_addr = acpi_map_vaddr_lookup(phys_addr, size);
- if (!virt_addr) {
- rcu_read_unlock();
- virt_addr = acpi_os_ioremap(phys_addr, size);
- if (!virt_addr)
- return AE_BAD_ADDRESS;
- unmap = true;
- }
-
+ virt_addr = acpi_os_rw_map(phys_addr, size, &did_fallback);
+ if (!virt_addr)
+ return AE_BAD_ADDRESS;
if (!value)
value = &dummy;
error = acpi_os_read_iomem(virt_addr, value, width);
BUG_ON(error);
- if (unmap)
- iounmap(virt_addr);
- else
- rcu_read_unlock();
-
+ acpi_os_rw_unmap(virt_addr, did_fallback);
return AE_OK;
}
acpi_status
acpi_os_write_memory(acpi_physical_address phys_addr, u64 value, u32 width)
{
- void __iomem *virt_addr;
unsigned int size = width / 8;
- bool unmap = false;
+ bool did_fallback = false;
+ void __iomem *virt_addr;
- rcu_read_lock();
- virt_addr = acpi_map_vaddr_lookup(phys_addr, size);
- if (!virt_addr) {
- rcu_read_unlock();
- virt_addr = acpi_os_ioremap(phys_addr, size);
- if (!virt_addr)
- return AE_BAD_ADDRESS;
- unmap = true;
- }
+ virt_addr = acpi_os_rw_map(phys_addr, size, &did_fallback);
+ if (!virt_addr)
+ return AE_BAD_ADDRESS;
switch (width) {
case 8:
@@ -767,11 +768,7 @@ acpi_os_write_memory(acpi_physical_address phys_addr, u64 value, u32 width)
BUG();
}
- if (unmap)
- iounmap(virt_addr);
- else
- rcu_read_unlock();
-
+ acpi_os_rw_unmap(virt_addr, did_fallback);
return AE_OK;
}
6 months
[PATCH ndctl v1 00/10] daxctl: Support for sub-dividing soft-reserved regions
by Joao Martins
Hey,
This series introduces the daxctl support for sub-dividing soft-reserved
regions created by EFI/HMAT/efi_fake_mem. It's the userspace counterpart
of this recent patch series [0].
These new 'dynamic' regions can be partitioned into multiple different devices
which its subdivisions can consist of one or more ranges. This
is in contrast to static dax regions -- created with ndctl-create-namespace
-m devdax -- which can't be subdivided neither discontiguous.
See also cover-letter of [0].
The daxctl changes in these patches are depicted as:
* {create,destroy,disable,enable}-device:
These orchestrate/manage the sub-division devices.
It mimmics the same as namespaces equivalent commands.
* Allow reconfigure-device to change the size of an existing *dynamic* dax
device.
* Add test coverage (so far tried to cover all range allocation code paths,
but I am still fishing for bugs). Additionally, there are bugs so applying
[0] may not make it pass the added functional test yet.
I am sending the series earlier (i.e. before the kernel patches get merged)
mainly to share a common unit tests, and also letting others try it out.
The only TODOs left is documentation, and perhaps listing of the mappingX sysfs
entries.
Thoughts, comments appreciated. :)
Thanks!
Joao
[0] "device-dax: Support sub-dividing soft-reserved ranges",
https://lore.kernel.org/linux-nvdimm/158500767138.2088294.171316462598039...
Dan Williams (1):
daxctl: Cleanup whitespace
Joao Martins (9):
libdaxctl: add daxctl_dev_set_size()
daxctl: add resize support in reconfigure-device
daxctl: add command to disable devdax device
daxctl: add command to enable devdax device
libdaxctl: add daxctl_region_create_dev()
daxctl: add command to create device
libdaxctl: add daxctl_region_destroy_dev()
daxctl: add command to destroy device
daxctl/test: Add tests for dynamic dax regions
daxctl/builtin.h | 4 +
daxctl/daxctl.c | 4 +
daxctl/device.c | 310 ++++++++++++++++++++++++++++++++++++++-
daxctl/lib/libdaxctl.c | 67 +++++++++
daxctl/lib/libdaxctl.sym | 7 +
daxctl/libdaxctl.h | 3 +
test/Makefile.am | 1 +
test/daxctl-create.sh | 293 ++++++++++++++++++++++++++++++++++++
util/filter.c | 2 +-
9 files changed, 686 insertions(+), 5 deletions(-)
create mode 100755 test/daxctl-create.sh
--
2.17.1
6 months, 1 week
[PATCH 00/12] device-dax: Support sub-dividing soft-reserved ranges
by Dan Williams
The device-dax facility allows an address range to be directly mapped
through a chardev, or turned around and hotplugged to the core kernel
page allocator as System-RAM. It is the baseline mechanism for
converting persistent memory (pmem) to be used as another volatile
memory pool i.e. the current Memory Tiering hot topic on linux-mm.
In the case of pmem the nvdimm-namespace-label mechanism can sub-divide
it, but that labeling mechanism is not available / applicable to
soft-reserved ("EFI specific purpose") memory [1]. This series provides
a sysfs-mechanism for the daxctl utility to enable provisioning of
volatile-soft-reserved memory ranges.
The motivations for this facility are:
1/ Allow performance differentiated memory ranges to be split between
kernel-managed and directly-accessed use cases.
2/ Allow physical memory to be provisioned along performance relevant
address boundaries. For example, divide a memory-side cache [2] along
cache-color boundaries.
3/ Parcel out soft-reserved memory to VMs using device-dax as a security
/ permissions boundary [3]. Specifically I have seen people (ab)using
memmap=nn!ss (mark System-RAM as Peristent Memory) just to get the
device-dax interface on custom address ranges.
The baseline for this series is today's next/master + "[PATCH v2 0/6]
Manual definition of Soft Reserved memory devices" [4].
Big thanks to Joao for the early testing and feedback on this series!
Given the dependencies on the memremap_pages() reworks in Andrew's tree
and the proximity to v5.7 this is clearly v5.8 material. The patches in
most need of a second opinion are the memremap_pages() reworks to switch
from 'struct resource' to 'struct range' and allow for an array of
ranges to be mapped at once.
[1]: https://lore.kernel.org/r/157309097008.1579826.12818463304589384434.stgit...
[2]: https://lore.kernel.org/r/154899811738.3165233.12325692939590944259.stgit...
[3]: https://lore.kernel.org/r/20200110190313.17144-1-joao.m.martins@oracle.com/
[4]: http://lore.kernel.org/r/158489354353.1457606.8327903161927980740.stgit@d...
---
Dan Williams (12):
device-dax: Drop the dax_region.pfn_flags attribute
device-dax: Move instance creation parameters to 'struct dev_dax_data'
device-dax: Make pgmap optional for instance creation
device-dax: Kill dax_kmem_res
device-dax: Add an allocation interface for device-dax instances
device-dax: Introduce seed devices
drivers/base: Make device_find_child_by_name() compatible with sysfs inputs
device-dax: Add resize support
mm/memremap_pages: Convert to 'struct range'
mm/memremap_pages: Support multiple ranges per invocation
device-dax: Add dis-contiguous resource support
device-dax: Introduce 'mapping' devices
arch/powerpc/kvm/book3s_hv_uvmem.c | 14 -
drivers/base/core.c | 2
drivers/dax/bus.c | 877 ++++++++++++++++++++++++++++++--
drivers/dax/bus.h | 28 +
drivers/dax/dax-private.h | 36 +
drivers/dax/device.c | 97 ++--
drivers/dax/hmem/hmem.c | 18 -
drivers/dax/kmem.c | 170 +++---
drivers/dax/pmem/compat.c | 2
drivers/dax/pmem/core.c | 22 +
drivers/gpu/drm/nouveau/nouveau_dmem.c | 4
drivers/nvdimm/badrange.c | 26 -
drivers/nvdimm/claim.c | 13
drivers/nvdimm/nd.h | 3
drivers/nvdimm/pfn_devs.c | 13
drivers/nvdimm/pmem.c | 27 +
drivers/nvdimm/region.c | 21 -
drivers/pci/p2pdma.c | 12
include/linux/memremap.h | 9
include/linux/range.h | 6
mm/memremap.c | 297 ++++++-----
tools/testing/nvdimm/dax-dev.c | 22 +
tools/testing/nvdimm/test/iomap.c | 2
23 files changed, 1318 insertions(+), 403 deletions(-)
6 months, 2 weeks
[PATCH v2 1/5] powerpc/pmem: Add new instructions for persistent storage and sync
by Aneesh Kumar K.V
POWER10 introduces two new variants of dcbf instructions (dcbstps and dcbfps)
that can be used to write modified locations back to persistent storage.
Additionally, POWER10 also introduce phwsync and plwsync which can be used
to establish order of these writes to persistent storage.
This patch exposes these instructions to the rest of the kernel. The existing
dcbf and hwsync instructions in P9 are adequate to enable appropriate
synchronization with OpenCAPI-hosted persistent storage. Hence the new
instructions are added as a variant of the old ones that old hardware
won't differentiate.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
---
arch/powerpc/include/asm/ppc-opcode.h | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
index c1df75edde44..45eccd842f84 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -216,6 +216,8 @@
#define PPC_INST_STWCX 0x7c00012d
#define PPC_INST_LWSYNC 0x7c2004ac
#define PPC_INST_SYNC 0x7c0004ac
+#define PPC_INST_PHWSYNC 0x7c8004ac
+#define PPC_INST_PLWSYNC 0x7ca004ac
#define PPC_INST_SYNC_MASK 0xfc0007fe
#define PPC_INST_ISYNC 0x4c00012c
#define PPC_INST_LXVD2X 0x7c000698
@@ -281,6 +283,8 @@
#define PPC_INST_TABORT 0x7c00071d
#define PPC_INST_TSR 0x7c0005dd
+#define PPC_INST_DCBF 0x7c0000ac
+
#define PPC_INST_NAP 0x4c000364
#define PPC_INST_SLEEP 0x4c0003a4
#define PPC_INST_WINKLE 0x4c0003e4
@@ -529,6 +533,14 @@
#define STBCIX(s,a,b) stringify_in_c(.long PPC_INST_STBCIX | \
__PPC_RS(s) | __PPC_RA(a) | __PPC_RB(b))
+#define PPC_DCBFPS(a, b) stringify_in_c(.long PPC_INST_DCBF | \
+ ___PPC_RA(a) | ___PPC_RB(b) | (4 << 21))
+#define PPC_DCBSTPS(a, b) stringify_in_c(.long PPC_INST_DCBF | \
+ ___PPC_RA(a) | ___PPC_RB(b) | (6 << 21))
+
+#define PPC_PHWSYNC stringify_in_c(.long PPC_INST_PHWSYNC)
+#define PPC_PLWSYNC stringify_in_c(.long PPC_INST_PLWSYNC)
+
/*
* Define what the VSX XX1 form instructions will look like, then add
* the 128 bit load store instructions based on that.
--
2.26.2
7 months
[RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.
by Aneesh Kumar K.V
With POWER10, architecture is adding new pmem flush and sync instructions.
The kernel should prevent the usage of MAP_SYNC if applications are not using
the new instructions on newer hardware.
This patch adds a prctl option MAP_SYNC_ENABLE that can be used to enable
the usage of MAP_SYNC. The kernel config option is added to allow the user
to control whether MAP_SYNC should be enabled by default or not.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
---
include/linux/sched/coredump.h | 13 ++++++++++---
include/uapi/linux/prctl.h | 3 +++
kernel/fork.c | 8 +++++++-
kernel/sys.c | 18 ++++++++++++++++++
mm/Kconfig | 3 +++
mm/mmap.c | 4 ++++
6 files changed, 45 insertions(+), 4 deletions(-)
diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index ecdc6542070f..9ba6b3d5f991 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -72,9 +72,16 @@ static inline int get_dumpable(struct mm_struct *mm)
#define MMF_DISABLE_THP 24 /* disable THP for all VMAs */
#define MMF_OOM_VICTIM 25 /* mm is the oom victim */
#define MMF_OOM_REAP_QUEUED 26 /* mm was queued for oom_reaper */
-#define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP)
+#define MMF_DISABLE_MAP_SYNC 27 /* disable THP for all VMAs */
+#define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP)
+#define MMF_DISABLE_MAP_SYNC_MASK (1 << MMF_DISABLE_MAP_SYNC)
-#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
- MMF_DISABLE_THP_MASK)
+#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK | \
+ MMF_DISABLE_THP_MASK | MMF_DISABLE_MAP_SYNC_MASK)
+
+static inline bool map_sync_enabled(struct mm_struct *mm)
+{
+ return !(mm->flags & MMF_DISABLE_MAP_SYNC_MASK);
+}
#endif /* _LINUX_SCHED_COREDUMP_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 07b4f8131e36..ee4cde32d5cf 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -238,4 +238,7 @@ struct prctl_mm_map {
#define PR_SET_IO_FLUSHER 57
#define PR_GET_IO_FLUSHER 58
+#define PR_SET_MAP_SYNC_ENABLE 59
+#define PR_GET_MAP_SYNC_ENABLE 60
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 8c700f881d92..d5a9a363e81e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -963,6 +963,12 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(mmlist_lock);
static unsigned long default_dump_filter = MMF_DUMP_FILTER_DEFAULT;
+#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
+unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
+#else
+unsigned long default_map_sync_mask = 0;
+#endif
+
static int __init coredump_filter_setup(char *s)
{
default_dump_filter =
@@ -1039,7 +1045,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
mm->flags = current->mm->flags & MMF_INIT_MASK;
mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
} else {
- mm->flags = default_dump_filter;
+ mm->flags = default_dump_filter | default_map_sync_mask;
mm->def_flags = 0;
}
diff --git a/kernel/sys.c b/kernel/sys.c
index d325f3ab624a..f6127cf4128b 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2450,6 +2450,24 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
clear_bit(MMF_DISABLE_THP, &me->mm->flags);
up_write(&me->mm->mmap_sem);
break;
+
+ case PR_GET_MAP_SYNC_ENABLE:
+ if (arg2 || arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = !test_bit(MMF_DISABLE_MAP_SYNC, &me->mm->flags);
+ break;
+ case PR_SET_MAP_SYNC_ENABLE:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ if (down_write_killable(&me->mm->mmap_sem))
+ return -EINTR;
+ if (arg2)
+ clear_bit(MMF_DISABLE_MAP_SYNC, &me->mm->flags);
+ else
+ set_bit(MMF_DISABLE_MAP_SYNC, &me->mm->flags);
+ up_write(&me->mm->mmap_sem);
+ break;
+
case PR_MPX_ENABLE_MANAGEMENT:
case PR_MPX_DISABLE_MANAGEMENT:
/* No longer implemented: */
diff --git a/mm/Kconfig b/mm/Kconfig
index c1acc34c1c35..38fd7cfbfca8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -867,4 +867,7 @@ config ARCH_HAS_HUGEPD
config MAPPING_DIRTY_HELPERS
bool
+config ARCH_MAP_SYNC_DISABLE
+ bool
+
endmenu
diff --git a/mm/mmap.c b/mm/mmap.c
index f609e9ec4a25..613e5894f178 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1464,6 +1464,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
case MAP_SHARED_VALIDATE:
if (flags & ~flags_mask)
return -EOPNOTSUPP;
+
+ if ((flags & MAP_SYNC) && !map_sync_enabled(mm))
+ return -EOPNOTSUPP;
+
if (prot & PROT_WRITE) {
if (!(file->f_mode & FMODE_WRITE))
return -EACCES;
--
2.26.2
7 months, 2 weeks
[PATCH v5 0/2] Renovate memcpy_mcsafe with copy_mc_to_{user, kernel}
by Dan Williams
Changes since v4 [1]:
- Fix up .gitignore for PowerPC test artifacts (Michael)
- Collect Michael's Ack.
[1]: http://lore.kernel.org/r/159010126119.975921.6614194205409771984.stgit@dw...
---
The primary motivation to go touch memcpy_mcsafe() is that the existing
benefit of doing slow "handle with care" copies is obviated on newer
CPUs. With that concern lifted it also obviates the need to continue to
update the MCA-recovery capability detection code currently gated by
"mcsafe_key". Now the old "mcsafe_key" opt-in to perform the copy with
concerns for recovery fragility can instead be made an opt-out from the
default fast copy implementation (enable_copy_mc_fragile()).
The discussion with Linus on the first iteration of this patch
identified that memcpy_mcsafe() was misnamed relative to its usage. The
new names copy_mc_to_user() and copy_mc_to_kernel() clearly indicate the
intended use case and lets the architecture organize the implementation
accordingly.
For both powerpc and x86 a copy_mc_generic() implementation is added as
the backend for these interfaces.
Patches are relative to tip/master.
---
Dan Williams (2):
x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user,kernel}()
x86/copy_mc: Introduce copy_mc_generic()
arch/powerpc/Kconfig | 2
arch/powerpc/include/asm/string.h | 2
arch/powerpc/include/asm/uaccess.h | 40 +++--
arch/powerpc/lib/Makefile | 2
arch/powerpc/lib/copy_mc_64.S | 4
arch/x86/Kconfig | 2
arch/x86/Kconfig.debug | 2
arch/x86/include/asm/copy_mc_test.h | 75 +++++++++
arch/x86/include/asm/mcsafe_test.h | 75 ---------
arch/x86/include/asm/string_64.h | 32 ----
arch/x86/include/asm/uaccess.h | 21 +++
arch/x86/include/asm/uaccess_64.h | 20 --
arch/x86/kernel/cpu/mce/core.c | 8 -
arch/x86/kernel/quirks.c | 9 -
arch/x86/lib/Makefile | 1
arch/x86/lib/copy_mc.c | 64 ++++++++
arch/x86/lib/copy_mc_64.S | 165 ++++++++++++++++++++
arch/x86/lib/memcpy_64.S | 115 --------------
arch/x86/lib/usercopy_64.c | 21 ---
drivers/md/dm-writecache.c | 15 +-
drivers/nvdimm/claim.c | 2
drivers/nvdimm/pmem.c | 6 -
include/linux/string.h | 9 -
include/linux/uaccess.h | 9 +
include/linux/uio.h | 10 +
lib/Kconfig | 7 +
lib/iov_iter.c | 43 +++--
tools/arch/x86/include/asm/mcsafe_test.h | 13 --
tools/arch/x86/lib/memcpy_64.S | 115 --------------
tools/objtool/check.c | 5 -
tools/perf/bench/Build | 1
tools/perf/bench/mem-memcpy-x86-64-lib.c | 24 ---
tools/testing/nvdimm/test/nfit.c | 48 +++---
.../testing/selftests/powerpc/copyloops/.gitignore | 2
tools/testing/selftests/powerpc/copyloops/Makefile | 6 -
.../selftests/powerpc/copyloops/copy_mc_64.S | 1
.../selftests/powerpc/copyloops/memcpy_mcsafe_64.S | 1
37 files changed, 451 insertions(+), 526 deletions(-)
rename arch/powerpc/lib/{memcpy_mcsafe_64.S => copy_mc_64.S} (98%)
create mode 100644 arch/x86/include/asm/copy_mc_test.h
delete mode 100644 arch/x86/include/asm/mcsafe_test.h
create mode 100644 arch/x86/lib/copy_mc.c
create mode 100644 arch/x86/lib/copy_mc_64.S
delete mode 100644 tools/arch/x86/include/asm/mcsafe_test.h
delete mode 100644 tools/perf/bench/mem-memcpy-x86-64-lib.c
create mode 120000 tools/testing/selftests/powerpc/copyloops/copy_mc_64.S
delete mode 120000 tools/testing/selftests/powerpc/copyloops/memcpy_mcsafe_64.S
base-commit: 229aaa8c059f2c908e0561453509f996f2b2d5c4
7 months, 2 weeks
[RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink
by Shiyang Ruan
This patchset is a try to resolve the shared 'page cache' problem for
fsdax.
In order to track multiple mappings and indexes on one page, I
introduced a dax-rmap rb-tree to manage the relationship. A dax entry
will be associated more than once if is shared. At the second time we
associate this entry, we create this rb-tree and store its root in
page->private(not used in fsdax). Insert (->mapping, ->index) when
dax_associate_entry() and delete it when dax_disassociate_entry().
We can iterate the dax-rmap rb-tree before any other operations on
mappings of files. Such as memory-failure and rmap.
Same as before, I borrowed and made some changes on Goldwyn's patchsets.
These patches makes up for the lack of CoW mechanism in fsdax.
The rests are dax & reflink support for xfs.
(Rebased to 5.7-rc2)
Shiyang Ruan (8):
fs/dax: Introduce dax-rmap btree for reflink
mm: add dax-rmap for memory-failure and rmap
fs/dax: Introduce dax_copy_edges() for COW
fs/dax: copy data before write
fs/dax: replace mmap entry in case of CoW
fs/dax: dedup file range to use a compare function
fs/xfs: handle CoW for fsdax write() path
fs/xfs: support dedupe for fsdax
fs/dax.c | 343 +++++++++++++++++++++++++++++++++++++----
fs/ocfs2/file.c | 2 +-
fs/read_write.c | 11 +-
fs/xfs/xfs_bmap_util.c | 6 +-
fs/xfs/xfs_file.c | 10 +-
fs/xfs/xfs_iomap.c | 3 +-
fs/xfs/xfs_iops.c | 11 +-
fs/xfs/xfs_reflink.c | 79 ++++++----
include/linux/dax.h | 11 ++
include/linux/fs.h | 9 +-
mm/memory-failure.c | 63 ++++++--
mm/rmap.c | 54 +++++--
12 files changed, 498 insertions(+), 104 deletions(-)
--
2.26.2
7 months, 3 weeks