[RFC v4 0/1] PMEM device emulation without nfit depenency
by Santosh Sivaraj
The current test module cannot be used for testing platforms (make check)
that do not have support for NFIT. In order to get the ndctl tests working,
we need a module which can emulate NVDIMM devices without relying on
ACPI/NFIT.
The emulated PMEM device is made part of the PAPR family.
Corresponding changes for ndctl is also required, to add attributes needed
for the test, which will be sent as a reply to this patch.
The following is the test result, run on a x86 guest:
PASS: libndctl
PASS: dsm-fail
PASS: dpa-alloc
PASS: parent-uuid
PASS: multi-pmem
PASS: create.sh
FAIL: clear.sh
FAIL: pmem-errors.sh
FAIL: daxdev-errors.sh
PASS: multi-dax.sh
PASS: btt-check.sh
FAIL: label-compat.sh
PASS: blk-exhaust.sh
PASS: sector-mode.sh
FAIL: inject-error.sh
SKIP: btt-errors.sh
PASS: hugetlb
PASS: btt-pad-compat.sh
SKIP: firmware-update.sh
FAIL: ack-shutdown-count-set
PASS: rescan-partitions.sh
FAIL: inject-smart.sh
FAIL: monitor.sh
PASS: max_available_extent_ns.sh
FAIL: pfn-meta-errors.sh
PASS: track-uuid.sh
============================================================================
Testsuite summary for ndctl 70.6.ge117f22
============================================================================
# TOTAL: 26
# PASS: 15
# SKIP: 2
# XFAIL: 0
# FAIL: 9
# XPASS: 0
# ERROR: 0
The following is the test result from a PowerPC 64 guest.
PASS: libndctl
PASS: dsm-fail
PASS: dpa-alloc
PASS: parent-uuid
PASS: multi-pmem
PASS: create.sh
FAIL: clear.sh
FAIL: pmem-errors.sh
FAIL: daxdev-errors.sh
PASS: multi-dax.sh
PASS: btt-check.sh
FAIL: label-compat.sh
PASS: blk-exhaust.sh
PASS: sector-mode.sh
FAIL: inject-error.sh
SKIP: btt-errors.sh
SKIP: hugetlb
PASS: btt-pad-compat.sh
SKIP: firmware-update.sh
FAIL: ack-shutdown-count-set
PASS: rescan-partitions.sh
FAIL: inject-smart.sh
FAIL: monitor.sh
PASS: max_available_extent_ns.sh
FAIL: pfn-meta-errors.sh
PASS: track-uuid.sh
============================================================================
Testsuite summary for ndctl 70.gite29f6c57
============================================================================
# TOTAL: 26
# PASS: 14
# SKIP: 3
# XFAIL: 0
# FAIL: 9
# XPASS: 0
# ERROR: 0
Error injection tests and SMART are not yet implemented. To run the tests disable
CONFIG_ACPI_NFIT, the kernel will pick the new driver.
Santosh Sivaraj (1):
testing/nvdimm: Add test module for non-nfit platforms
drivers/nvdimm/region_devs.c | 5 +
tools/testing/nvdimm/config_check.c | 3 +-
tools/testing/nvdimm/test/Kbuild | 6 +-
tools/testing/nvdimm/test/ndtest.c | 1174 +++++++++++++++++++++++++++
tools/testing/nvdimm/test/ndtest.h | 84 ++
5 files changed, 1270 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/nvdimm/test/ndtest.c
create mode 100644 tools/testing/nvdimm/test/ndtest.h
--
2.26.2
1 year, 5 months
[RFC][PATCH 1/2] libnvdimm: Introduce ND_CMD_GET_STAT to retrieve nvdimm statistics
by Vaibhav Jain
Implement support for exposing generic nvdimm statistics via newly
introduced dimm-command ND_CMD_GET_STAT that can be handled by nvdimm
command handler function and provide values for these statistics back
to libnvdimm. Following generic nvdimm statistics are defined as an
enumeration in 'uapi/ndctl.h':
* "media_reads" : Number of media reads that have occurred since reboot.
* "media_writes" : Number of media writes that have occurred since reboot.
* "read_requests" : Number of read requests that have occurred since reboot.
* "write_requests" : Number of write requests that have occurred since reboot.
* "total_media_reads" : Total number of media reads that have occurred.
* "total_media_writes" : Total number of media writes that have occurred.
* "total_read_requests" : Total number of read requests that have occurred.
* "total_write_requests" : Total number of write requests that have occurred.
Apart from ND_CMD_GET_STAT ioctl these nvdimm statistics are also
exposed via sysfs '<nvdimm-device>/stats' directory for easy user-space
access like below:
/sys/class/nd/ndctl0/device/nmem0/stats # tail -n +1 *
==> media_reads <==
252197707602
==> media_writes <==
20684685172
==> read_requests <==
658810924962
==> write_requests <==
404464081574
In case a specific nvdimm-statistic is not supported than nvdimm
command handler function can simply return an error (e.g -ENOENT) for
request to read that nvdimm-statistic.
The value for a specific nvdimm-stat is exchanged via newly introduced
'struct nd_cmd_get_dimm_stat' that hold a single statistics and a
union of possible values types. Presently only '__s64' type of generic
attributes are supported. These attributes are defined in
'ndvimm/dimm_devs.c' via a helper macro 'NVDIMM_STAT_ATTR'.
Signed-off-by: Vaibhav Jain <vaibhav(a)linux.ibm.com>
---
drivers/nvdimm/bus.c | 6 ++
drivers/nvdimm/dimm_devs.c | 109 +++++++++++++++++++++++++++++++++++++
drivers/nvdimm/nd.h | 5 ++
include/uapi/linux/ndctl.h | 27 +++++++++
4 files changed, 147 insertions(+)
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 2304c6183822..d53564477437 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -794,6 +794,12 @@ static const struct nd_cmd_desc __nd_cmd_dimm_descs[] = {
.out_num = 1,
.out_sizes = { UINT_MAX, },
},
+ [ND_CMD_GET_STAT] = {
+ .in_num = 1,
+ .in_sizes = { sizeof(struct nd_cmd_get_dimm_stat), },
+ .out_num = 1,
+ .out_sizes = { sizeof(struct nd_cmd_get_dimm_stat), },
+ },
};
const struct nd_cmd_desc *nd_cmd_dimm_desc(int cmd)
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index b59032e0859b..68aaa294def7 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -555,6 +555,114 @@ static umode_t nvdimm_firmware_visible(struct kobject *kobj, struct attribute *a
return a->mode;
}
+/* Request a dimm stat from the bus driver */
+static int __request_dimm_stat(struct nvdimm_bus *nvdimm_bus,
+ struct nvdimm *dimm, u64 stat_id,
+ s64 *stat_val)
+{
+ struct nvdimm_bus_descriptor *nd_desc = nvdimm_bus->nd_desc;
+ struct nd_cmd_get_dimm_stat stat = { .stat_id = stat_id };
+ int rc, cmd_rc;
+
+ if (!test_bit(ND_CMD_GET_STAT, &dimm->cmd_mask)) {
+ pr_debug("CMD_GET_STAT not set for bus driver 0x%lx\n",
+ nd_desc->cmd_mask);
+ return -ENOENT;
+ }
+
+ /* Is stat requested is known & bus driver supports fetching stats */
+ if (stat_id <= ND_DIMM_STAT_INVALID || stat_id > ND_DIMM_STAT_MAX) {
+ WARN(1, "Unknown stat-id(%llu) requested", stat_id);
+ return -ENOENT;
+ }
+
+ /* Ask bus driver for its stat value */
+ rc = nd_desc->ndctl(nd_desc, dimm, ND_CMD_GET_STAT,
+ &stat, sizeof(stat), &cmd_rc);
+ if (rc || cmd_rc) {
+ pr_debug("Unable to request stat %lld. Error (%d,%d)\n",
+ stat_id, rc, cmd_rc);
+ return rc ? rc : cmd_rc;
+ }
+
+ /* Indicate error in case returned struct doesn't have the stat_id set */
+ if (stat.stat_id != stat_id) {
+ pr_debug("Invalid statid %llu returned\n", stat.stat_id);
+ return -ENOENT;
+ }
+
+ *stat_val = stat.int_val;
+ return 0;
+}
+
+static ssize_t nvdimm_stat_attr_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct nvdimm_stat_attr *nattr = container_of(attr, typeof(*nattr), attr);
+ struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
+ struct nvdimm *nvdimm = to_nvdimm(dev);
+ s64 stat_val;
+ int rc;
+
+ rc = __request_dimm_stat(nvdimm_bus, nvdimm, nattr->stat_id, &stat_val);
+
+ if (rc)
+ return rc;
+
+ return snprintf(buf, PAGE_SIZE, "%lld", stat_val);
+}
+
+static umode_t nvdimm_stats_visible(struct kobject *kobj, struct attribute *a, int n)
+{
+ struct nvdimm_stat_attr *nattr = container_of(a, typeof(*nattr), attr.attr);
+ struct device *dev = container_of(kobj, typeof(*dev), kobj);
+ struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
+ struct nvdimm *nvdimm = to_nvdimm(dev);
+ u64 stat_val;
+ int rc;
+
+ /* request dimm stat from bus driver and is success mark attribute as visible */
+ rc = __request_dimm_stat(nvdimm_bus, nvdimm, nattr->stat_id, &stat_val);
+ if (rc)
+ pr_info("Unable to query stat %llu . Error(%d)\n", nattr->stat_id, rc);
+
+ return rc ? 0 : a->mode;
+}
+
+#define NVDIMM_STAT_ATTR(_name, _stat_id) \
+ struct nvdimm_stat_attr nvdimm_stat_attr_##_name = { \
+ .attr = __ATTR(_name, 0400, nvdimm_stat_attr_show, NULL), \
+ .stat_id = _stat_id, \
+ }
+
+static NVDIMM_STAT_ATTR(media_reads, ND_DIMM_STAT_MEDIA_READS);
+static NVDIMM_STAT_ATTR(media_writes, ND_DIMM_STAT_MEDIA_WRITES);
+static NVDIMM_STAT_ATTR(read_requests, ND_DIMM_STAT_READ_REQUESTS);
+static NVDIMM_STAT_ATTR(write_requests, ND_DIMM_STAT_WRITE_REQUESTS);
+static NVDIMM_STAT_ATTR(total_media_reads, ND_DIMM_STAT_TOTAL_MEDIA_READS);
+static NVDIMM_STAT_ATTR(total_media_writes, ND_DIMM_STAT_TOTAL_MEDIA_WRITES);
+static NVDIMM_STAT_ATTR(total_read_requests, ND_DIMM_STAT_TOTAL_READ_REQUESTS);
+static NVDIMM_STAT_ATTR(total_write_requests, ND_DIMM_STAT_TOTAL_WRITE_REQUESTS);
+
+static struct attribute *nvdimm_stats_attributes[] = {
+ &nvdimm_stat_attr_media_reads.attr.attr,
+ &nvdimm_stat_attr_media_writes.attr.attr,
+ &nvdimm_stat_attr_read_requests.attr.attr,
+ &nvdimm_stat_attr_write_requests.attr.attr,
+ &nvdimm_stat_attr_total_media_reads.attr.attr,
+ &nvdimm_stat_attr_total_media_writes.attr.attr,
+ &nvdimm_stat_attr_total_read_requests.attr.attr,
+ &nvdimm_stat_attr_total_write_requests.attr.attr,
+ NULL,
+};
+
+static const struct attribute_group nvdimm_stats_group = {
+ .name = "stats",
+ .attrs = nvdimm_stats_attributes,
+ .is_visible = nvdimm_stats_visible,
+};
+
static const struct attribute_group nvdimm_firmware_attribute_group = {
.name = "firmware",
.attrs = nvdimm_firmware_attributes,
@@ -565,6 +673,7 @@ static const struct attribute_group *nvdimm_attribute_groups[] = {
&nd_device_attribute_group,
&nvdimm_attribute_group,
&nvdimm_firmware_attribute_group,
+ &nvdimm_stats_group,
NULL,
};
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 696b55556d4d..ea9e846ae245 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -223,6 +223,11 @@ enum nd_async_mode {
ND_ASYNC,
};
+struct nvdimm_stat_attr {
+ struct device_attribute attr;
+ u64 stat_id;
+};
+
int nd_integrity_init(struct gendisk *disk, unsigned long meta_size);
void wait_nvdimm_bus_probe_idle(struct device *dev);
void nd_device_register(struct device *dev);
diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
index 8cf1e4884fd5..81b76986b423 100644
--- a/include/uapi/linux/ndctl.h
+++ b/include/uapi/linux/ndctl.h
@@ -97,6 +97,31 @@ struct nd_cmd_clear_error {
__u64 cleared;
} __packed;
+/* Various generic dimm stats that can be reported */
+enum {
+ ND_DIMM_STAT_INVALID = 0,
+ ND_DIMM_STAT_MEDIA_READS = 1, /* Media reads after power cycle */
+ ND_DIMM_STAT_MEDIA_WRITES = 2, /* Media writes after power cycle */
+ ND_DIMM_STAT_READ_REQUESTS = 3, /* Read requests after power cycle */
+ ND_DIMM_STAT_WRITE_REQUESTS = 4, /* Write requests after power cycle */
+ ND_DIMM_STAT_TOTAL_MEDIA_READS = 5, /* Total Media Reads */
+ ND_DIMM_STAT_TOTAL_MEDIA_WRITES = 6, /* Total Media Writes */
+ ND_DIMM_STAT_TOTAL_READ_REQUESTS = 7, /* Total Read Requests */
+ ND_DIMM_STAT_TOTAL_WRITE_REQUESTS = 8, /* Total Write Requests */
+ ND_DIMM_STAT_MAX = 8,
+};
+
+struct nd_cmd_get_dimm_stat {
+ /* See enum above for valid values */
+ __u64 stat_id;
+
+ /* Holds a single dimm stat value */
+ union {
+ __s64 int_val;
+ char str_val[120];
+ };
+} __packed;
+
enum {
ND_CMD_IMPLEMENTED = 0,
@@ -117,6 +142,7 @@ enum {
ND_CMD_VENDOR_EFFECT_LOG = 8,
ND_CMD_VENDOR = 9,
ND_CMD_CALL = 10,
+ ND_CMD_GET_STAT = 11,
};
enum {
@@ -151,6 +177,7 @@ static inline const char *nvdimm_cmd_name(unsigned cmd)
case ND_CMD_VENDOR_EFFECT_LOG: return "effect_log";
case ND_CMD_VENDOR: return "vendor";
case ND_CMD_CALL: return "cmd_call";
+ case ND_CMD_GET_STAT: return "get_stat";
default: return "unknown";
}
}
--
2.28.0
1 year, 5 months
[PATCH] libnvdimm/label: Return -ENXIO for no slot in __blk_label_update
by Zhang Qilong
Forget to set error code when nd_label_alloc_slot failed, and we
add it to avoid overwritten error code.
Fixes: 0ba1c634892b3 ("libnvdimm: write blk label set")
Signed-off-by: Zhang Qilong <zhangqilong3(a)huawei.com>
---
drivers/nvdimm/label.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/nvdimm/label.c b/drivers/nvdimm/label.c
index 47a4828b8b31..05c1f186a6be 100644
--- a/drivers/nvdimm/label.c
+++ b/drivers/nvdimm/label.c
@@ -999,8 +999,10 @@ static int __blk_label_update(struct nd_region *nd_region,
if (is_old_resource(res, old_res_list, old_num_resources))
continue; /* carry-over */
slot = nd_label_alloc_slot(ndd);
- if (slot == UINT_MAX)
+ if (slot == UINT_MAX) {
+ rc = -ENXIO;
goto abort;
+ }
dev_dbg(ndd->dev, "allocated: %d\n", slot);
nd_label = to_label(ndd, slot);
--
2.25.4
1 year, 5 months
mapcount corruption regression
by Dan Williams
Kirill, Willy, compound page experts,
I am seeking some debug ideas about the following splat:
BUG: Bad page state in process lt-pmem-ns pfn:121a12
page:0000000051ef73f7 refcount:0 mapcount:-1024
mapping:0000000000000000 index:0x0 pfn:0x121a12
flags: 0x2ffff800000000()
raw: 002ffff800000000 dead000000000100 0000000000000000 0000000000000000
raw: 0000000000000000 ffff8a6914886b48 00000000fffffbff 0000000000000000
page dumped because: nonzero mapcount
[..]
CPU: 26 PID: 6127 Comm: lt-pmem-ns Tainted: G OE 5.10.0-rc4+ #450
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x8b/0xb0
bad_page.cold+0x63/0x94
free_pcp_prepare+0x224/0x270
free_unref_page+0x18/0xd0
pud_free_pmd_page+0x146/0x160
ioremap_pud_range+0xe3/0x350
ioremap_page_range+0x108/0x160
__ioremap_caller.constprop.0+0x174/0x2b0
? memremap+0x7a/0x110
memremap+0x7a/0x110
devm_memremap+0x53/0xa0
pmem_attach_disk+0x4ed/0x530 [nd_pmem]
It triggers on v5.10-rc4 not on v5.9, but the bisect comes up with an
ambiguous result. I've run the bisect 3 times and landed on:
032c7ed95817 Merge tag 'arm64-upstream' of
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
...which does not touch anything near _mapcount. I suspect there is
something unique about the build that lines up the corruption to
happen or not happen.
The test is a simple namespace creation test that results in an
memremap() / ioremap() over several gigabytes of memory capacity. The
-1024 was interesting because that's the GUP_PIN_COUNTING_BIAS, but
that's the _refcount, I did not see any questionable changes to how
_mapcount is manipulated post v5.9. Problem should be reproducible by
running:
make -j TESTS="pmem-ns" check
...in qemu-kvm with some virtual pmem defined:
-object memory-backend-file,id=mem1,share,mem-path=${mem}1,size=$((mem_size+label_size))
-device nvdimm,memdev=mem1,id=nv1,label-size=${label_size}
...where ${mem}1 is a 128GB sparse file $mem_size is 127GB and
$label_size is 128KB.
1 year, 5 months
Re: [PATCH v3 3/6] mm: support THP migration to device private
memory
by Christoph Hellwig
[adding a few of the usual suspects]
On Wed, Nov 11, 2020 at 03:38:42PM -0800, Ralph Campbell wrote:
> There are 4 types of ZONE_DEVICE struct pages:
> MEMORY_DEVICE_PRIVATE, MEMORY_DEVICE_FS_DAX, MEMORY_DEVICE_GENERIC, and
> MEMORY_DEVICE_PCI_P2PDMA.
>
> Currently, memremap_pages() allocates struct pages for a physical address range
> with a page_ref_count(page) of one and increments the pgmap->ref per CPU
> reference count by the number of pages created since each ZONE_DEVICE struct
> page has a pointer to the pgmap.
>
> The struct pages are not freed until memunmap_pages() is called which
> calls put_page() which calls put_dev_pagemap() which releases a reference to
> pgmap->ref. memunmap_pages() blocks waiting for pgmap->ref reference count
> to be zero. As far as I can tell, the put_page() in memunmap_pages() has to
> be the *last* put_page() (see MEMORY_DEVICE_PCI_P2PDMA).
> My RFC [1] breaks this put_page() -> put_dev_pagemap() connection so that
> the struct page reference count can go to zero and back to non-zero without
> changing the pgmap->ref reference count.
>
> Q1: Is that safe? Is there some code that depends on put_page() dropping
> the pgmap->ref reference count as part of memunmap_pages()?
> My testing of [1] seems OK but I'm sure there are lots of cases I didn't test.
It should be safe, but the audit you've done is important to make sure
we do not miss anything important.
> MEMORY_DEVICE_PCI_P2PDMA:
> Struct pages are created in pci_p2pdma_add_resource() and represent device
> memory accessible by PCIe bar address space. Memory is allocated with
> pci_alloc_p2pmem() based on a byte length but the gen_pool_alloc_owner()
> call will allocate memory in a minimum of PAGE_SIZE units.
> Reference counting is +1 per *allocation* on the pgmap->ref reference count.
> Note that this is not +1 per page which is what put_page() expects. So
> currently, a get_page()/put_page() works OK because the page reference count
> only goes 1->2 and 2->1. If it went to zero, the pgmap->ref reference count
> would be incorrect if the allocation size was greater than one page.
>
> I see pci_alloc_p2pmem() is called by nvme_alloc_sq_cmds() and
> pci_p2pmem_alloc_sgl() to create a command queue and a struct scatterlist *.
> Looks like sg_page(sg) returns the ZONE_DEVICE struct page of the scatterlist.
> There are a huge number of places sg_page() is called so it is hard to tell
> whether or not get_page()/put_page() is ever called on MEMORY_DEVICE_PCI_P2PDMA
> pages.
Nothing should call get_page/put_page on them, as they are not treated
as refcountable memory. More importantly nothing is allowed to keep
a reference longer than the time of the I/O.
> pci_p2pmem_virt_to_bus() will return the physical address and I guess
> pfn_to_page(physaddr >> PAGE_SHIFT) could return the struct page.
>
> Since there is a clear allocation/free, pci_alloc_p2pmem() can probably be
> modified to increment/decrement the MEMORY_DEVICE_PCI_P2PDMA struct page
> reference count. Or maybe just leave it at one like it is now.
And yes, doing that is probably a sensible safe guard.
> MEMORY_DEVICE_FS_DAX:
> Struct pages are created in pmem_attach_disk() and virtio_fs_setup_dax() with
> an initial reference count of one.
> The problem I see is that there are 3 states that are important:
> a) memory is free and not allocated to any file (page_ref_count() == 0).
> b) memory is allocated to a file and in the page cache (page_ref_count() == 1).
> c) some gup() or I/O has a reference even after calling unmap_mapping_pages()
> (page_ref_count() > 1). ext4_break_layouts() basically waits until the
> page_ref_count() == 1 with put_page() calling wake_up_var(&page->_refcount)
> to wake up ext4_break_layouts().
> The current code doesn't seem to distinguish (a) and (b). If we want to use
> the 0->1 reference count to signal (c), then the page cache would have hold
> entries with a page_ref_count() == 0 which doesn't match the general page cache
I think the sensible model here is to grab a reference when it is
added to the page cache. That is exactly how normal system memory pages
work.
1 year, 5 months
[PATCH v13 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
by Mike Rapoport
From: Mike Rapoport <rppt(a)linux.ibm.com>
Hi,
@Andrew, this is based on v5.10-rc2-mmotm-2020-11-07-21-40, I can rebase on
current mmotm if you prefer.
This is an implementation of "secret" mappings backed by a file descriptor.
The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call. The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping. The pages in that mapping will be marked as not present in
the direct map and will be present only in the page table of the owning mm.
Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.
Additionally, in the future the secret mappings may be used as a mean to
protect guest memory in a virtual machine host.
For demonstration of secret memory usage we've created a userspace library
https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloa...
that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it. We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.
Hiding secret memory mappings behind an anonymous file allows (ab)use of
the page cache for tracking pages allocated for the "secret" mappings as
well as using address_space_operations for e.g. page migration callbacks.
The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
ABIs in the future.
To limit fragmentation of the direct map to splitting only PUD-size pages,
I've added an amortizing cache of PMD-size pages to each file descriptor
that is used as an allocation pool for the secret memory areas.
As the memory allocated by secretmem becomes unmovable, we use CMA to back
large page caches so that page allocator won't be surprised by failing attempt
to migrate these pages.
v13:
* Added Reviewed-by, thanks Catalin and David
* s/mod_node_page_state/mod_lruvec_page_state/ as Shakeel suggested
v12: https://lore.kernel.org/lkml/20201125092208.12544-1-rppt@kernel.org
* Add detection of whether set_direct_map has actual effect on arm64 and bail
out of CMA allocation for secretmem and the memfd_secret() syscall if pages
would not be removed from the direct map
v11: https://lore.kernel.org/lkml/20201124092556.12009-1-rppt@kernel.org
* Drop support for uncached mappings
v10: https://lore.kernel.org/lkml/20201123095432.5860-1-rppt@kernel.org
* Drop changes to arm64 compatibility layer
* Add Roman's Ack for memcg accounting
v9: https://lore.kernel.org/lkml/20201117162932.13649-1-rppt@kernel.org
* Fix build with and without CONFIG_MEMCG
* Update memcg accounting to avoid copying memcg_data, per Roman comments
* Fix issues in secretmem_fault(), thanks Matthew
* Do not wire up syscall in arm64 compatibility layer
Older history:
v8: https://lore.kernel.org/lkml/20201110151444.20662-1-rppt@kernel.org
v7: https://lore.kernel.org/lkml/20201026083752.13267-1-rppt@kernel.org
v6: https://lore.kernel.org/lkml/20200924132904.1391-1-rppt@kernel.org
v5: https://lore.kernel.org/lkml/20200916073539.3552-1-rppt@kernel.org
v4: https://lore.kernel.org/lkml/20200818141554.13945-1-rppt@kernel.org
v3: https://lore.kernel.org/lkml/20200804095035.18778-1-rppt@kernel.org
v2: https://lore.kernel.org/lkml/20200727162935.31714-1-rppt@kernel.org
v1: https://lore.kernel.org/lkml/20200720092435.17469-1-rppt@kernel.org
Mike Rapoport (10):
mm: add definition of PMD_PAGE_ORDER
mmap: make mlock_future_check() global
set_memory: allow set_direct_map_*_noflush() for multiple pages
set_memory: allow querying whether set_direct_map_*() is actually enabled
mm: introduce memfd_secret system call to create "secret" memory areas
secretmem: use PMD-size pages to amortize direct map fragmentation
secretmem: add memcg accounting
PM: hibernate: disable when there are active secretmem users
arch, mm: wire up memfd_secret system call were relevant
secretmem: test: add basic selftest for memfd_secret(2)
arch/arm64/include/asm/Kbuild | 1 -
arch/arm64/include/asm/cacheflush.h | 6 -
arch/arm64/include/asm/set_memory.h | 17 +
arch/arm64/include/uapi/asm/unistd.h | 1 +
arch/arm64/kernel/machine_kexec.c | 1 +
arch/arm64/mm/mmu.c | 6 +-
arch/arm64/mm/pageattr.c | 23 +-
arch/riscv/include/asm/set_memory.h | 4 +-
arch/riscv/include/asm/unistd.h | 1 +
arch/riscv/mm/pageattr.c | 8 +-
arch/x86/Kconfig | 2 +-
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/asm/set_memory.h | 4 +-
arch/x86/mm/pat/set_memory.c | 8 +-
fs/dax.c | 11 +-
include/linux/pgtable.h | 3 +
include/linux/secretmem.h | 30 ++
include/linux/set_memory.h | 16 +-
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 6 +-
include/uapi/linux/magic.h | 1 +
kernel/power/hibernate.c | 5 +-
kernel/power/snapshot.c | 4 +-
kernel/sys_ni.c | 2 +
mm/Kconfig | 5 +
mm/Makefile | 1 +
mm/filemap.c | 3 +-
mm/gup.c | 10 +
mm/internal.h | 3 +
mm/mmap.c | 5 +-
mm/secretmem.c | 439 ++++++++++++++++++++++
mm/vmalloc.c | 5 +-
scripts/checksyscalls.sh | 4 +
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 3 +-
tools/testing/selftests/vm/memfd_secret.c | 298 +++++++++++++++
tools/testing/selftests/vm/run_vmtests | 17 +
38 files changed, 906 insertions(+), 51 deletions(-)
create mode 100644 arch/arm64/include/asm/set_memory.h
create mode 100644 include/linux/secretmem.h
create mode 100644 mm/secretmem.c
create mode 100644 tools/testing/selftests/vm/memfd_secret.c
base-commit: 9f8ce377d420db12b19d6a4f636fecbd88a725a5
--
2.28.0
1 year, 5 months
Re: [RFC PATCH] powerpc/papr_scm: Implement scm async flush
by Pankaj Gupta
> Tha patch implements SCM async-flush hcall and sets the
> ND_REGION_ASYNC capability when the platform device tree
> has "ibm,async-flush-required" set.
So, you are reusing the existing ND_REGION_ASYNC flag for the
hypercall based async flush with device tree discovery?
Out of curiosity, does virtio based flush work in ppc? Was just thinking
if we can reuse virtio based flush present in virtio-pmem? Or anything
else we are trying to achieve here?
Thanks,
Pankaj
>
> The below demonstration shows the map_sync behavior when
> ibm,async-flush-required is present in device tree.
> (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master...)
>
> The pmem0 is from nvdimm without async-flush-required,
> and pmem1 is from nvdimm with async-flush-required, mounted as
> /dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
>
> #./mapsync /mnt1/newfile ----> Without async-flush-required
> #./mapsync /mnt2/newfile ----> With async-flush-required
> Failed to mmap with Operation not supported
>
> Signed-off-by: Shivaprasad G Bhat <sbhat(a)linux.ibm.com>
> ---
> The HCALL semantics are in review, not final.
Any link of the discussion?
>
> Documentation/powerpc/papr_hcalls.rst | 14 ++++++++++
> arch/powerpc/include/asm/hvcall.h | 3 +-
> arch/powerpc/platforms/pseries/papr_scm.c | 39 +++++++++++++++++++++++++++++
> 3 files changed, 55 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/powerpc/papr_hcalls.rst b/Documentation/powerpc/papr_hcalls.rst
> index 48fcf1255a33..cc310814f24c 100644
> --- a/Documentation/powerpc/papr_hcalls.rst
> +++ b/Documentation/powerpc/papr_hcalls.rst
> @@ -275,6 +275,20 @@ Health Bitmap Flags:
> Given a DRC Index collect the performance statistics for NVDIMM and copy them
> to the resultBuffer.
>
> +**H_SCM_ASYNC_FLUSH**
> +
> +| Input: *drcIndex*
> +| Out: *continue-token*
> +| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY*
> +
> +Given a DRC Index Flush the data to backend NVDIMM device.
> +
> +The hcall returns H_BUSY when the flush takes longer time and the hcall needs
> +to be issued multiple times in order to be completely serviced. The
> +*continue-token* from the output to be passed in the argument list in
> +subsequent hcalls to the hypervisor until the hcall is completely serviced
> +at which point H_SUCCESS is returned by the hypervisor.
> +
> References
> ==========
> .. [1] "Power Architecture Platform Reference"
> diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
> index c1fbccb04390..4a13074bc782 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -306,7 +306,8 @@
> #define H_SCM_HEALTH 0x400
> #define H_SCM_PERFORMANCE_STATS 0x418
> #define H_RPT_INVALIDATE 0x448
> -#define MAX_HCALL_OPCODE H_RPT_INVALIDATE
> +#define H_SCM_ASYNC_FLUSH 0x4A0
> +#define MAX_HCALL_OPCODE H_SCM_ASYNC_FLUSH
>
> /* Scope args for H_SCM_UNBIND_ALL */
> #define H_UNBIND_SCOPE_ALL (0x1)
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
> index 835163f54244..1f8c5153cb3d 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -93,6 +93,7 @@ struct papr_scm_priv {
> uint64_t block_size;
> int metadata_size;
> bool is_volatile;
> + bool async_flush_required;
>
> uint64_t bound_addr;
>
> @@ -117,6 +118,38 @@ struct papr_scm_priv {
> size_t stat_buffer_len;
> };
>
> +static int papr_scm_pmem_flush(struct nd_region *nd_region, struct bio *bio)
> +{
> + unsigned long ret[PLPAR_HCALL_BUFSIZE];
> + struct papr_scm_priv *p = nd_region_provider_data(nd_region);
> + int64_t rc;
> + uint64_t token = 0;
> +
> + do {
> + rc = plpar_hcall(H_SCM_ASYNC_FLUSH, ret, p->drc_index, token);
> +
> + /* Check if we are stalled for some time */
> + token = ret[0];
> + if (H_IS_LONG_BUSY(rc)) {
> + msleep(get_longbusy_msecs(rc));
> + rc = H_BUSY;
> + } else if (rc == H_BUSY) {
> + cond_resched();
> + }
> +
> + } while (rc == H_BUSY);
> +
> + if (rc)
> + dev_err(&p->pdev->dev, "flush error: %lld\n", rc);
> + else
> + dev_dbg(&p->pdev->dev, "flush drc 0x%x complete\n",
> + p->drc_index);
> +
> + dev_dbg(&p->pdev->dev, "Flush call complete\n");
> +
> + return rc;
> +}
> +
> static LIST_HEAD(papr_nd_regions);
> static DEFINE_MUTEX(papr_ndr_lock);
>
> @@ -943,6 +976,11 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
> ndr_desc.num_mappings = 1;
> ndr_desc.nd_set = &p->nd_set;
>
> + if (p->async_flush_required) {
> + set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> + ndr_desc.flush = papr_scm_pmem_flush;
> + }
> +
> if (p->is_volatile)
> p->region = nvdimm_volatile_region_create(p->bus, &ndr_desc);
> else {
> @@ -1088,6 +1126,7 @@ static int papr_scm_probe(struct platform_device *pdev)
> p->block_size = block_size;
> p->blocks = blocks;
> p->is_volatile = !of_property_read_bool(dn, "ibm,cache-flush-required");
> + p->async_flush_required = of_property_read_bool(dn, "ibm,async-flush-required");
>
> /* We just need to ensure that set cookies are unique across */
> uuid_parse(uuid_str, (uuid_t *) uuid);
>
> _______________________________________________
> Linux-nvdimm mailing list -- linux-nvdimm(a)lists.01.org
> To unsubscribe send an email to linux-nvdimm-leave(a)lists.01.org
1 year, 5 months
[RFC PATCH] powerpc/papr_scm: Implement scm async flush
by Shivaprasad G Bhat
Tha patch implements SCM async-flush hcall and sets the
ND_REGION_ASYNC capability when the platform device tree
has "ibm,async-flush-required" set.
The below demonstration shows the map_sync behavior when
ibm,async-flush-required is present in device tree.
(https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master...)
The pmem0 is from nvdimm without async-flush-required,
and pmem1 is from nvdimm with async-flush-required, mounted as
/dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
/dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
#./mapsync /mnt1/newfile ----> Without async-flush-required
#./mapsync /mnt2/newfile ----> With async-flush-required
Failed to mmap with Operation not supported
Signed-off-by: Shivaprasad G Bhat <sbhat(a)linux.ibm.com>
---
The HCALL semantics are in review, not final.
Documentation/powerpc/papr_hcalls.rst | 14 ++++++++++
arch/powerpc/include/asm/hvcall.h | 3 +-
arch/powerpc/platforms/pseries/papr_scm.c | 39 +++++++++++++++++++++++++++++
3 files changed, 55 insertions(+), 1 deletion(-)
diff --git a/Documentation/powerpc/papr_hcalls.rst b/Documentation/powerpc/papr_hcalls.rst
index 48fcf1255a33..cc310814f24c 100644
--- a/Documentation/powerpc/papr_hcalls.rst
+++ b/Documentation/powerpc/papr_hcalls.rst
@@ -275,6 +275,20 @@ Health Bitmap Flags:
Given a DRC Index collect the performance statistics for NVDIMM and copy them
to the resultBuffer.
+**H_SCM_ASYNC_FLUSH**
+
+| Input: *drcIndex*
+| Out: *continue-token*
+| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY*
+
+Given a DRC Index Flush the data to backend NVDIMM device.
+
+The hcall returns H_BUSY when the flush takes longer time and the hcall needs
+to be issued multiple times in order to be completely serviced. The
+*continue-token* from the output to be passed in the argument list in
+subsequent hcalls to the hypervisor until the hcall is completely serviced
+at which point H_SUCCESS is returned by the hypervisor.
+
References
==========
.. [1] "Power Architecture Platform Reference"
diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
index c1fbccb04390..4a13074bc782 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -306,7 +306,8 @@
#define H_SCM_HEALTH 0x400
#define H_SCM_PERFORMANCE_STATS 0x418
#define H_RPT_INVALIDATE 0x448
-#define MAX_HCALL_OPCODE H_RPT_INVALIDATE
+#define H_SCM_ASYNC_FLUSH 0x4A0
+#define MAX_HCALL_OPCODE H_SCM_ASYNC_FLUSH
/* Scope args for H_SCM_UNBIND_ALL */
#define H_UNBIND_SCOPE_ALL (0x1)
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index 835163f54244..1f8c5153cb3d 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -93,6 +93,7 @@ struct papr_scm_priv {
uint64_t block_size;
int metadata_size;
bool is_volatile;
+ bool async_flush_required;
uint64_t bound_addr;
@@ -117,6 +118,38 @@ struct papr_scm_priv {
size_t stat_buffer_len;
};
+static int papr_scm_pmem_flush(struct nd_region *nd_region, struct bio *bio)
+{
+ unsigned long ret[PLPAR_HCALL_BUFSIZE];
+ struct papr_scm_priv *p = nd_region_provider_data(nd_region);
+ int64_t rc;
+ uint64_t token = 0;
+
+ do {
+ rc = plpar_hcall(H_SCM_ASYNC_FLUSH, ret, p->drc_index, token);
+
+ /* Check if we are stalled for some time */
+ token = ret[0];
+ if (H_IS_LONG_BUSY(rc)) {
+ msleep(get_longbusy_msecs(rc));
+ rc = H_BUSY;
+ } else if (rc == H_BUSY) {
+ cond_resched();
+ }
+
+ } while (rc == H_BUSY);
+
+ if (rc)
+ dev_err(&p->pdev->dev, "flush error: %lld\n", rc);
+ else
+ dev_dbg(&p->pdev->dev, "flush drc 0x%x complete\n",
+ p->drc_index);
+
+ dev_dbg(&p->pdev->dev, "Flush call complete\n");
+
+ return rc;
+}
+
static LIST_HEAD(papr_nd_regions);
static DEFINE_MUTEX(papr_ndr_lock);
@@ -943,6 +976,11 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
ndr_desc.num_mappings = 1;
ndr_desc.nd_set = &p->nd_set;
+ if (p->async_flush_required) {
+ set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
+ ndr_desc.flush = papr_scm_pmem_flush;
+ }
+
if (p->is_volatile)
p->region = nvdimm_volatile_region_create(p->bus, &ndr_desc);
else {
@@ -1088,6 +1126,7 @@ static int papr_scm_probe(struct platform_device *pdev)
p->block_size = block_size;
p->blocks = blocks;
p->is_volatile = !of_property_read_bool(dn, "ibm,cache-flush-required");
+ p->async_flush_required = of_property_read_bool(dn, "ibm,async-flush-required");
/* We just need to ensure that set cookies are unique across */
uuid_parse(uuid_str, (uuid_t *) uuid);
1 year, 5 months
Re: regression from 5.10.0-rc3: BUG: Bad page state in process
kworker/41:0 pfn:891066 during fio on devdax
by Jason Gunthorpe
On Mon, Nov 09, 2020 at 01:54:42PM -0400, Jason Gunthorpe wrote:
> On Mon, Nov 09, 2020 at 09:26:19AM -0800, Dan Williams wrote:
> > On Mon, Nov 9, 2020 at 6:12 AM Jason Gunthorpe <jgg(a)ziepe.ca> wrote:
> > >
> > > Wow, this is surprising
> > >
> > > This has been widely backported already, Dan please check??
> > >
> > > I thought pgprot_decrypted was a NOP on most x86 platforms -
> > > sme_me_mask == 0:
> > >
> > > #define __sme_set(x) ((x) | sme_me_mask)
> > > #define __sme_clr(x) ((x) & ~sme_me_mask)
> > >
> > > ??
> > >
> > > Confused how this can be causing DAX issues
> >
> > Does that correctly preserve the "soft" pte bits? Especially
> > PTE_DEVMAP that DAX uses?
> >
> > I'll check...
>
> extern u64 sme_me_mask;
> #define __pgprot(x) ((pgprot_t) { (x) } )
> #define pgprot_val(x) ((x).pgprot)
> #define __sme_clr(x) ((x) & ~sme_me_mask)
> #define pgprot_decrypted(prot) __pgprot(__sme_clr(pgprot_val(prot)))
>
> static inline int io_remap_pfn_range(struct vm_area_struct *vma,
> unsigned long addr, unsigned long pfn,
> unsigned long size, pgprot_t prot)
> {
> return remap_pfn_range(vma, addr, pfn, size, pgprot_decrypted(prot));
> }
>
> Not seeing how that could change the pgprot in any harmful way?
>
> Yi, are you using a platform where sme_me_mask != 0 ?
>
> That code looks clearly like it would only trigger on AMD SME systems,
> is that what you are using?
Can't be, the system is too old:
[ 398.455914] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
I'm at a total loss how this change could even do anything on a
non-AMD system, let alone how this intersects in any way with DEVDAX,
which I could not find being used with io_remap_pfn_range()
How confident are you in the bisection?
Jason
1 year, 5 months