[LSF/MM TOPIC] The end of the DAX experiment
by Dan Williams
Before people get too excited this isn't a proposal to kill DAX. The
topic proposal is a discussion to resolve lingering open questions
that currently motivate ext4 and xfs to scream "EXPERIMENTAL" when the
current DAX facilities are enabled. The are 2 primary concerns to
resolve. Enumerate the remaining features/fixes, and identify a path
to implement it all without regressing any existing application use
cases.
An enumeration of remaining projects follows, please expand this list
if I missed something:
* "DAX" has no specific meaning by itself, users have 2 use cases for
"DAX" capabilities: userspace cache management via MAP_SYNC, and page
cache avoidance where the latter aspect of DAX has no current api to
discover / use it. The project is to supplement MAP_SYNC with a
MAP_DIRECT facility and MADV_SYNC / MADV_DIRECT to indicate the same
dynamically via madvise. Similar to O_DIRECT, MAP_DIRECT would be an
application hint to avoid / minimiize page cache usage, but no strict
guarantee like what MAP_SYNC provides.
* Resolve all "if (dax) goto fail;" patterns in the kernel. Outside of
longterm-GUP (a topic in its own right) the projects here are
XFS-reflink and XFS-realtime-device support. DAX+reflink effectively
requires a given physical page to be mapped into two different inodes
at different (page->index) offsets. The challenge is to support
DAX-reflink without violating any existing application visible
semantics, the operating assumption / strawman to debate is that
experimental status is not blanket permission to go change existing
semantics in backwards incompatible ways.
* Deprecate, but not remove, the DAX mount option. Too many flows
depend on the option so it will never go away, but the facility is too
coarse. Provide an option to enable MAP_SYNC and
more-likely-to-do-something-useful-MAP_DIRECT on a per-directory
basis. The current proposal is to allow this property to only be
toggled while the directory is empty to avoid the complications of
racing page invalidation with new DAX mappings.
Secondary projects, i.e. important but I would submit are not in the
critical path to removing the "experimental" designation:
* Filesystem-integrated badblock management. Hook up the media error
notifications from libnvdimm to the filesystem to allow for operations
like "list files with media errors" and "enumerate bad file offsets on
a granulatiy smaller than a page". Another consideration along these
lines is to integrate machine-check-handling and dynamic error
notification into a filesystem interface. I've heard complaints that
the sigaction() based mechanism to receive BUS_MCEERR_* information,
while sufficient for the "System RAM" use case, is not precise enough
for the "Persistent Memory / DAX" use case where errors are repairable
and sub-page error information is useful.
* Userfaultfd for file-backed mappings and DAX
Ideally all the usual DAX, persistent memory, and GUP suspects could
be in the room to discuss this:
* Jan Kara
* Dave Chinner
* Christoph Hellwig
* Jeff Moyer
* Johannes Thumshirn
* Matthew Wilcox
* John Hubbard
* Jérôme Glisse
* MM folks for the reflink vs 'struct page' vs Xarray considerations
2 years, 5 months
[RFC v3 00/19] kunit: introduce KUnit, the Linux kernel unit testing framework
by Brendan Higgins
This patch set proposes KUnit, a lightweight unit testing and mocking
framework for the Linux kernel.
Unlike Autotest and kselftest, KUnit is a true unit testing framework;
it does not require installing the kernel on a test machine or in a VM
and does not require tests to be written in userspace running on a host
kernel. Additionally, KUnit is fast: From invocation to completion KUnit
can run several dozen tests in under a second. Currently, the entire
KUnit test suite for KUnit runs in under a second from the initial
invocation (build time excluded).
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining
unit test cases, grouping related test cases into test suites, providing
common infrastructure for running tests, mocking, spying, and much more.
## What's so special about unit testing?
A unit test is supposed to test a single unit of code in isolation,
hence the name. There should be no dependencies outside the control of
the test; this means no external dependencies, which makes tests orders
of magnitudes faster. Likewise, since there are no external dependencies,
there are no hoops to jump through to run the tests. Additionally, this
makes unit tests deterministic: a failing unit test always indicates a
problem. Finally, because unit tests necessarily have finer granularity,
they are able to test all code paths easily solving the classic problem
of difficulty in exercising error handling code.
## Is KUnit trying to replace other testing frameworks for the kernel?
No. Most existing tests for the Linux kernel are end-to-end tests, which
have their place. A well tested system has lots of unit tests, a
reasonable number of integration tests, and some end-to-end tests. KUnit
is just trying to address the unit test space which is currently not
being addressed.
## More information on KUnit
There is a bunch of documentation near the end of this patch set that
describes how to use KUnit and best practices for writing unit tests.
For convenience I am hosting the compiled docs here:
https://google.github.io/kunit-docs/third_party/kernel/docs/
Additionally for convenience, I have applied these patches to a branch:
https://kunit.googlesource.com/linux/+/kunit/rfc/4.19/v3
The repo may be cloned with:
git clone https://kunit.googlesource.com/linux
This patchset is on the kunit/rfc/4.19/v3 branch.
## Changes Since Last Version
- Changed namespace prefix from `test_*` to `kunit_*` as requested by
Shuah.
- Started converting/cleaning up the device tree unittest to use KUnit.
- Started adding KUnit expectations with custom messages.
--
2.20.0.rc0.387.gc7a69e6b6c-goog
2 years, 8 months
[PATCH] libnvdimm, namespace: check nsblk->uuid immediately after its allocation
by Wei Yang
When creating nd_namespace_blk, its uuid is copied from nd_label->uuid.
In case the memory allocation fails, it goes to the error branch.
This check is better to be done immediately after memory allocation,
while current implementation does this after assigning claim_class.
This patch moves the check immediately after uuid allocation.
Signed-off-by: Wei Yang <richardw.yang(a)linux.intel.com>
---
drivers/nvdimm/namespace_devs.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 681af3a8fd62..9471b9ca04f5 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -2240,11 +2240,11 @@ static struct device *create_namespace_blk(struct nd_region *nd_region,
nsblk->lbasize = __le64_to_cpu(nd_label->lbasize);
nsblk->uuid = kmemdup(nd_label->uuid, NSLABEL_UUID_LEN,
GFP_KERNEL);
+ if (!nsblk->uuid)
+ goto blk_err;
if (namespace_label_has(ndd, abstraction_guid))
nsblk->common.claim_class
= to_nvdimm_cclass(&nd_label->abstraction_guid);
- if (!nsblk->uuid)
- goto blk_err;
memcpy(name, nd_label->name, NSLABEL_NAME_LEN);
if (name[0])
nsblk->alt_name = kmemdup(name, NSLABEL_NAME_LEN,
--
2.19.1
2 years, 10 months
[PATCH v4 00/18] btrfs dax support
by Goldwyn Rodrigues
This patch set adds support for dax on the BTRFS filesystem.
In order to support for CoW for btrfs, there were changes which had to be
made to the dax handling. The important one is copying blocks into the
same dax device before using them which is performed by iomap
type IOMAP_DAX_COW.
Snapshotting and CoW features are supported (including mmap preservation
across snapshots).
Git: https://github.com/goldwynr/linux/tree/btrfs-dax
Changes since v3:
- Fixed memcpy bug
- used flags for dax_insert_entry instead of bools for dax_insert_entry()
Changes since v2:
- Created a new type IOMAP_DAX_COW as opposed to flag IOMAP_F_COW
- CoW source address is presented in iomap.inline_data
- Split the patches to more elaborate dax/iomap patches
Changes since v1:
- use iomap instead of redoing everything in btrfs
- support for mmap writeprotecting on snapshotting
fs/btrfs/Makefile | 1
fs/btrfs/ctree.h | 38 +++++
fs/btrfs/dax.c | 289 +++++++++++++++++++++++++++++++++++++++++--
fs/btrfs/disk-io.c | 4
fs/btrfs/file.c | 37 ++++-
fs/btrfs/inode.c | 114 ++++++++++++----
fs/btrfs/ioctl.c | 29 +++-
fs/btrfs/send.c | 4
fs/btrfs/super.c | 30 ++++
fs/dax.c | 183 ++++++++++++++++++++++++---
fs/iomap.c | 9 -
fs/ocfs2/file.c | 2
fs/read_write.c | 11 -
fs/xfs/xfs_reflink.c | 2
include/linux/dax.h | 15 +-
include/linux/fs.h | 8 +
include/linux/iomap.h | 7 +
include/trace/events/btrfs.h | 56 ++++++++
18 files changed, 752 insertions(+), 87 deletions(-)
2 years, 11 months
[GIT PULL] device-dax for 5.1: PMEM as RAM
by Dan Williams
Hi Linus, please pull from:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/devdax-for-5.1
...to receive new device-dax infrastructure to allow persistent memory
and other "reserved" / performance differentiated memories, to be
assigned to the core-mm as "System RAM".
While it has soaked in -next with only a simple conflict reported, and
Michal looked at this and said "overall design of this feature makes a
lot of sense to me" [1], it's lacking non-Intel review/ack tags. For
that reason, here's some more commentary on the motivation and
implications:
[1]: https://lore.kernel.org/lkml/20190123170518.GC4087@dhcp22.suse.cz/
Some users want to use persistent memory as additional volatile
memory. They are willing to cope with potential performance
differences, for example between DRAM and 3D Xpoint, and want to use
typical Linux memory management apis rather than a userspace memory
allocator layered over an mmap() of a dax file. The administration
model is to decide how much Persistent Memory (pmem) to use as System
RAM, create a device-dax-mode namespace of that size, and then assign
it to the core-mm. The rationale for device-dax is that it is a
generic memory-mapping driver that can be layered over any "special
purpose" memory, not just pmem. On subsequent boots udev rules can be
used to restore the memory assignment.
One implication of using pmem as RAM is that mlock() no longer keeps
data off persistent media. For this reason it is recommended to enable
NVDIMM Security (previously merged for 5.0) to encrypt pmem contents
at rest. We considered making this recommendation an actively enforced
requirement, but in the end decided to leave it as a distribution /
administrator policy to allow for emulation and test environments that
lack security capable NVDIMMs.
Here is the resolution for the aforementioned conflict:
diff --cc mm/memory_hotplug.c
index a9d5787044e1,b37f3a5c4833..c4f59ac21014
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@@ -102,28 -99,21 +102,24 @@@ u64 max_mem_size = U64_MAX
/* add this memory to iomem resource */
static struct resource *register_memory_resource(u64 start, u64 size)
{
- struct resource *res, *conflict;
+ struct resource *res;
+ unsigned long flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+ char *resource_name = "System RAM";
+ if (start + size > max_mem_size)
+ return ERR_PTR(-E2BIG);
+
- res = kzalloc(sizeof(struct resource), GFP_KERNEL);
- if (!res)
- return ERR_PTR(-ENOMEM);
-
- res->name = "System RAM";
- res->start = start;
- res->end = start + size - 1;
- res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
- conflict = request_resource_conflict(&iomem_resource, res);
- if (conflict) {
- if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
- pr_debug("Device unaddressable memory block "
- "memory hotplug at %#010llx !\n",
- (unsigned long long)start);
- }
- pr_debug("System RAM resource %pR cannot be added\n", res);
- kfree(res);
+ /*
+ * Request ownership of the new memory range. This might be
+ * a child of an existing resource that was present but
+ * not marked as busy.
+ */
+ res = __request_region(&iomem_resource, start, size,
+ resource_name, flags);
+
+ if (!res) {
+ pr_debug("Unable to reserve System RAM region:
%016llx->%016llx\n",
+ start, start + size);
return ERR_PTR(-EEXIST);
}
return res;
* Note, I'm sending this with Gmail rather than Evolution (which goes
through my local Exchange server) as the latter mangles the message
into something the pr-tracker-bot decides to ignore. As a result,
please forgive white-space damage.
---
The following changes since commit bfeffd155283772bbe78c6a05dec7c0128ee500c:
Linux 5.0-rc1 (2019-01-06 17:08:20 -0800)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/devdax-for-5.1
for you to fetch changes up to c221c0b0308fd01d9fb33a16f64d2fd95f8830a4:
device-dax: "Hotplug" persistent memory for use like normal RAM
(2019-02-28 10:41:23 -0800)
----------------------------------------------------------------
device-dax for 5.1
* Replace the /sys/class/dax device model with /sys/bus/dax, and include
a compat driver so distributions can opt-in to the new ABI.
* Allow for an alternative driver for the device-dax address-range
* Introduce the 'kmem' driver to hotplug / assign a device-dax
address-range to the core-mm.
* Arrange for the device-dax target-node to be onlined so that the newly
added memory range can be uniquely referenced by numa apis.
----------------------------------------------------------------
Dan Williams (11):
device-dax: Kill dax_region ida
device-dax: Kill dax_region base
device-dax: Remove multi-resource infrastructure
device-dax: Start defining a dax bus model
device-dax: Introduce bus + driver model
device-dax: Move resource pinning+mapping into the common driver
device-dax: Add support for a dax override driver
device-dax: Add /sys/class/dax backwards compatibility
acpi/nfit, device-dax: Identify differentiated memory with a
unique numa-node
device-dax: Auto-bind device after successful new_id
device-dax: Add a 'target_node' attribute
Dave Hansen (5):
mm/resource: Return real error codes from walk failures
mm/resource: Move HMM pr_debug() deeper into resource code
mm/memory-hotplug: Allow memory resources to be children
mm/resource: Let walk_system_ram_range() search child resources
device-dax: "Hotplug" persistent memory for use like normal RAM
Vishal Verma (1):
device-dax: Add a 'modalias' attribute to DAX 'bus' devices
Documentation/ABI/obsolete/sysfs-class-dax | 22 ++
arch/powerpc/platforms/pseries/papr_scm.c | 1 +
drivers/acpi/nfit/core.c | 8 +-
drivers/acpi/numa.c | 1 +
drivers/base/memory.c | 1 +
drivers/dax/Kconfig | 28 +-
drivers/dax/Makefile | 6 +-
drivers/dax/bus.c | 503 +++++++++++++++++++++++++++++
drivers/dax/bus.h | 61 ++++
drivers/dax/dax-private.h | 34 +-
drivers/dax/dax.h | 18 --
drivers/dax/device-dax.h | 25 --
drivers/dax/device.c | 363 +++++----------------
drivers/dax/kmem.c | 108 +++++++
drivers/dax/pmem.c | 153 ---------
drivers/dax/pmem/Makefile | 7 +
drivers/dax/pmem/compat.c | 73 +++++
drivers/dax/pmem/core.c | 71 ++++
drivers/dax/pmem/pmem.c | 40 +++
drivers/dax/super.c | 41 ++-
drivers/nvdimm/e820.c | 1 +
drivers/nvdimm/nd.h | 2 +-
drivers/nvdimm/of_pmem.c | 1 +
drivers/nvdimm/region_devs.c | 1 +
include/linux/acpi.h | 5 +
include/linux/libnvdimm.h | 1 +
kernel/resource.c | 18 +-
mm/memory_hotplug.c | 33 +-
tools/testing/nvdimm/Kbuild | 7 +-
tools/testing/nvdimm/dax-dev.c | 16 +-
30 files changed, 1112 insertions(+), 537 deletions(-)
create mode 100644 Documentation/ABI/obsolete/sysfs-class-dax
create mode 100644 drivers/dax/bus.c
create mode 100644 drivers/dax/bus.h
delete mode 100644 drivers/dax/dax.h
delete mode 100644 drivers/dax/device-dax.h
create mode 100644 drivers/dax/kmem.c
delete mode 100644 drivers/dax/pmem.c
create mode 100644 drivers/dax/pmem/Makefile
create mode 100644 drivers/dax/pmem/compat.c
create mode 100644 drivers/dax/pmem/core.c
create mode 100644 drivers/dax/pmem/pmem.c
3 years
[PATCH v7 0/6] virtio pmem driver
by Pankaj Gupta
This patch series has implementation for "virtio pmem".
"virtio pmem" is fake persistent memory(nvdimm) in guest
which allows to bypass the guest page cache. This also
implements a VIRTIO based asynchronous flush mechanism.
Sharing guest kernel driver in this patchset with the
changes suggested in v4. Tested with Qemu side device
emulation [6] for virtio-pmem. Documented the impact of
possible page cache side channel attacks with suggested
countermeasures.
Incorporated all the review suggestions.
Details of project idea for 'virtio pmem' flushing interface
is shared [3] & [4].
Implementation is divided into two parts:
New virtio pmem guest driver and qemu code changes for new
virtio pmem paravirtualized device.
1. Guest virtio-pmem kernel driver
---------------------------------
- Reads persistent memory range from paravirt device and
registers with 'nvdimm_bus'.
- 'nvdimm/pmem' driver uses this information to allocate
persistent memory region and setup filesystem operations
to the allocated memory.
- virtio pmem driver implements asynchronous flushing
interface to flush from guest to host.
2. Qemu virtio-pmem device
---------------------------------
- Creates virtio pmem device and exposes a memory range to
KVM guest.
- At host side this is file backed memory which acts as
persistent memory.
- Qemu side flush uses aio thread pool API's and virtio
for asynchronous guest multi request handling.
David Hildenbrand CCed also posted a modified version[7] of
qemu virtio-pmem code based on updated Qemu memory device API.
Virtio-pmem security implications and countermeasures:
-----------------------------------------------------
In previous posting of kernel driver, there was discussion [9]
on possible implications of page cache side channel attacks with
virtio pmem. After thorough analysis of details of known side
channel attacks, below are the suggestions:
- Depends entirely on how host backing image file is mapped
into guest address space.
- virtio-pmem device emulation, by default shared mapping is used
to map host backing file. It is recommended to use separate
backing file at host side for every guest. This will prevent
any possibility of executing common code from multiple guests
and any chance of inferring guest local data based based on
execution time.
- If backing file is required to be shared among multiple guests
it is recommended to don't support host page cache eviction
commands from the guest driver. This will avoid any possibility
of inferring guest local data or host data from another guest.
- Proposed device specification [8] for virtio-pmem device with
details of possible security implications and suggested
countermeasures for device emulation.
Virtio-pmem errors handling:
----------------------------------------
Checked behaviour of virtio-pmem for below types of errors
Need suggestions on expected behaviour for handling these errors?
- Hardware Errors: Uncorrectable recoverable Errors:
a] virtio-pmem:
- As per current logic if error page belongs to Qemu process,
host MCE handler isolates(hwpoison) that page and send SIGBUS.
Qemu SIGBUS handler injects exception to KVM guest.
- KVM guest then isolates the page and send SIGBUS to guest
userspace process which has mapped the page.
b] Existing implementation for ACPI pmem driver:
- Handles such errors with MCE notifier and creates a list
of bad blocks. Read/direct access DAX operation return EIO
if accessed memory page fall in bad block list.
- It also starts backgound scrubbing.
- Similar functionality can be reused in virtio-pmem with MCE
notifier but without scrubbing(no ACPI/ARS)? Need inputs to
confirm if this behaviour is ok or needs any change?
Changes from PATCH v6: [1]
- Corrected comment format in patch 5 & patch 6. [Dave]
- Changed variable declaration indentation in patch 6 [Darrick]
- Add Reviewed-by tag by 'Jan Kara' in patch 4 & patch 5
Changes from PATCH v5: [2]
Changes suggested in by - [Cornelia, Yuval]
- Remove assignment chaining in virtio driver
- Better error message and remove not required free
- Check nd_region before use
Changes suggested by - [Jan Kara]
- dax_synchronous() for !CONFIG_DAX
- Correct 'daxdev_mapping_supported' comment and non-dax implementation
Changes suggested by - [Dan Williams]
- Pass meaningful flag 'DAXDEV_F_SYNC' to alloc_dax
- Gate nvdimm_flush instead of additional async parameter
- Move block chaining logic to flush callback than common nvdimm_flush
- Use NULL flush callback for generic flush for better readability [Dan, Jan]
- Use virtio device id 27 from 25(already used) - [MST]
Changes from PATCH v4:
- Factor out MAP_SYNC supported functionality to a common helper
[Dave, Darrick, Jan]
- Comment, indentation and virtqueue_kick failure handle - Yuval Shaia
Changes from PATCH v3:
- Use generic dax_synchronous() helper to check for DAXDEV_SYNC
flag - [Dan, Darrick, Jan]
- Add 'is_nvdimm_async' function
- Document page cache side channel attacks implications &
countermeasures - [Dave Chinner, Michael]
Changes from PATCH v2:
- Disable MAP_SYNC for ext4 & XFS filesystems - [Dan]
- Use name 'virtio pmem' in place of 'fake dax'
Changes from PATCH v1:
- 0-day build test for build dependency on libnvdimm
Changes suggested by - [Dan Williams]
- Split the driver into two parts virtio & pmem
- Move queuing of async block request to block layer
- Add "sync" parameter in nvdimm_flush function
- Use indirect call for nvdimm_flush
- Don’t move declarations to common global header e.g nd.h
- nvdimm_flush() return 0 or -EIO if it fails
- Teach nsio_rw_bytes() that the flush can fail
- Rename nvdimm_flush() to generic_nvdimm_flush()
- Use 'nd_region->provider_data' for long dereferencing
- Remove virtio_pmem_freeze/restore functions
- Remove BSD license text with SPDX license text
- Add might_sleep() in virtio_pmem_flush - [Luiz]
- Make spin_lock_irqsave() narrow
Changes from RFC v3
- Rebase to latest upstream - Luiz
- Call ndregion->flush in place of nvdimm_flush- Luiz
- kmalloc return check - Luiz
- virtqueue full handling - Stefan
- Don't map entire virtio_pmem_req to device - Stefan
- request leak, correct sizeof req - Stefan
- Move declaration to virtio_pmem.c
Changes from RFC v2:
- Add flush function in the nd_region in place of switching
on a flag - Dan & Stefan
- Add flush completion function with proper locking and wait
for host side flush completion - Stefan & Dan
- Keep userspace API in uapi header file - Stefan, MST
- Use LE fields & New device id - MST
- Indentation & spacing suggestions - MST & Eric
- Remove extra header files & add licensing - Stefan
Changes from RFC v1:
- Reuse existing 'pmem' code for registering persistent
memory and other operations instead of creating an entirely
new block driver.
- Use VIRTIO driver to register memory information with
nvdimm_bus and create region_type accordingly.
- Call VIRTIO flush from existing pmem driver.
Pankaj Gupta (6):
libnvdimm: nd_region flush callback support
virtio-pmem: Add virtio-pmem guest driver
libnvdimm: add nd_region buffered dax_dev flag
dax: check synchronous mapping is supported
ext4: disable map_sync for virtio pmem
xfs: disable map_sync for virtio pmem
[1] https://lkml.org/lkml/2019/4/23/1092
[2] https://lkml.org/lkml/2019/4/10/3
[3] https://www.spinics.net/lists/kvm/msg149761.html
[4] https://www.spinics.net/lists/kvm/msg153095.html
[5] https://lkml.org/lkml/2018/8/31/413
[6] https://marc.info/?l=linux-kernel&m=153572228719237&w=2
[7] https://marc.info/?l=qemu-devel&m=153555721901824&w=2
[8] https://lists.oasis-open.org/archives/virtio-dev/201903/msg00083.html
[9] https://lkml.org/lkml/2019/1/9/1191
drivers/acpi/nfit/core.c | 4 -
drivers/dax/bus.c | 2
drivers/dax/super.c | 13 +++-
drivers/md/dm.c | 3
drivers/nvdimm/claim.c | 6 +
drivers/nvdimm/nd.h | 1
drivers/nvdimm/pmem.c | 16 +++--
drivers/nvdimm/region_devs.c | 33 ++++++++++
drivers/nvdimm/virtio_pmem.c | 114 +++++++++++++++++++++++++++++++++++++
drivers/virtio/Kconfig | 10 +++
drivers/virtio/Makefile | 1
drivers/virtio/pmem.c | 118 +++++++++++++++++++++++++++++++++++++++
fs/ext4/file.c | 10 +--
fs/xfs/xfs_file.c | 9 +-
include/linux/dax.h | 25 +++++++-
include/linux/libnvdimm.h | 9 ++
include/linux/virtio_pmem.h | 60 +++++++++++++++++++
include/uapi/linux/virtio_ids.h | 1
include/uapi/linux/virtio_pmem.h | 10 +++
19 files changed, 420 insertions(+), 25 deletions(-)
3 years
[PATCH ndctl] ndctl, region: return the rc value in region_action
by Yi Zhang
if region_action failed, return the rc valude instead of 0
Signed-off-by: Yi Zhang <yi.zhang(a)redhat.com>
---
ndctl/region.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/ndctl/region.c b/ndctl/region.c
index 993b3f8..7945007 100644
--- a/ndctl/region.c
+++ b/ndctl/region.c
@@ -89,7 +89,7 @@ static int region_action(struct ndctl_region *region, enum device_action mode)
break;
}
- return 0;
+ return rc;
}
static int do_xable_region(const char *region_arg, enum device_action mode,
--
2.17.2
3 years
[PATCH v2] drivers/dax: Allow to include DEV_DAX_PMEM as builtin
by Aneesh Kumar K.V
This move the dependency to DEV_DAX_PMEM_COMPAT such that only
if DEV_DAX_PMEM is built as module we can allow the compat support.
This allows to test the new code easily in a emulation setup where we
often build things without module support.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
---
Changes from V1:
* Make sure we only build compat code as module
drivers/dax/Kconfig | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 5ef624fe3934..a59f338f520f 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -23,7 +23,6 @@ config DEV_DAX
config DEV_DAX_PMEM
tristate "PMEM DAX: direct access to persistent memory"
depends on LIBNVDIMM && NVDIMM_DAX && DEV_DAX
- depends on m # until we can kill DEV_DAX_PMEM_COMPAT
default DEV_DAX
help
Support raw access to persistent memory. Note that this
@@ -50,7 +49,7 @@ config DEV_DAX_KMEM
config DEV_DAX_PMEM_COMPAT
tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
- depends on DEV_DAX_PMEM
+ depends on m && DEV_DAX_PMEM=m
default DEV_DAX_PMEM
help
Older versions of the libdaxctl library expect to find all
--
2.20.1
3 years
Hang / zombie process from Xarray page-fault conversion (bisected)
by Dan Williams
Hi Willy,
We're seeing a case where RocksDB hangs and becomes defunct when
trying to kill the process. v4.19 succeeds and v4.20 fails. Robert was
able to bisect this to commit b15cd800682f "dax: Convert page fault
handlers to XArray".
I see some direct usage of xa_index and wonder if there are some more
pmd fixups to do?
Other thoughts?
3 years
[PATCH v6 00/12] mm: Sub-section memory hotplug support
by Dan Williams
Changes since v5 [1]:
- Rebase on next-20190416 and the new 'struct mhp_restrictions'
infrastructure.
- Extend mhp_restrictions to the 'remove' case so the sub-section policy
can be clarified with respect to the memblock-api in a symmetric
manner with the 'add' case.
- Kill is_dev_zone() since cleanups have now made it moot
[1]: https://lwn.net/Articles/783808/
---
The memory hotplug section is an arbitrary / convenient unit for memory
hotplug. 'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace. The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory'
use cases, persistent memory (pmem) in particular. Recall that pmem uses
devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a
'struct page' memmap for pmem. However, it does not use the 'bottom
half' of memory hotplug, i.e. never marks pmem pages online and never
exposes the userspace memblock interface for pmem. This leaves an
opening to redress the section-size constraint.
To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory(). Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the
next. Device failure (intermittent or permanent) and physical
reconfiguration are events that can cause the platform firmware to
change the physical placement of pmem on a subsequent boot, and device
failure is an everyday event in a data-center.
It turns out that sections are only a hard requirement of the
user-facing interface for memory hotplug and with a bit more
infrastructure sub-section arch_add_memory() support can be added for
kernel internal usages like devm_memremap_pages(). Here is an analysis
of the current design assumptions in the current code and how they are
addressed in the new implementation:
Current design assumptions:
- Sections that describe boot memory (early sections) are never
unplugged / removed.
- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
valid_section() check
- __add_pages() and helper routines assume all operations occur in
PAGES_PER_SECTION units.
- The memblock sysfs interface only comprehends full sections
New design assumptions:
- Sections are instrumented with a sub-section bitmask to track (on x86)
individual 2MB sub-divisions of a 128MB section.
- Partially populated early sections can be extended with additional
sub-sections, and those sub-sections can be removed with
arch_remove_memory(). With this in place we no longer lose usable memory
capacity to padding.
- pfn_valid() is updated to look deeper than valid_section() to also check the
active-sub-section mask. This indication is in the same cacheline as
the valid_section() so the performance impact is expected to be
negligible. So far the lkp robot has not reported any regressions.
- Outside of the core vmemmap population routines which are replaced,
other helper routines like shrink_{zone,pgdat}_span() are updated to
handle the smaller granularity. Core memory hotplug routines that deal
with online memory are not touched.
- The existing memblock sysfs user api guarantees / assumptions are
not touched since this capability is limited to !online
!memblock-sysfs-accessible sections.
Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them. The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt. Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem
ranges with other pmem ranges by default [3]. In short,
devm_memremap_pages() has pushed the venerable section-size constraint
past the breaking point, and the simplicity of section-aligned
arch_add_memory() is no longer tenable.
These patches are exposed to the kbuild robot on my libnvdimm-pending
branch [4], and a preview of the unit test for this functionality is
available on the 'subsection-pending' branch of ndctl [5].
[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@d...
[3]: https://github.com/pmem/ndctl/issues/76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=li...
[5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c
---
Dan Williams (12):
mm/sparsemem: Introduce struct mem_section_usage
mm/sparsemem: Introduce common definitions for the size and mask of a section
mm/sparsemem: Add helpers track active portions of a section at boot
mm/hotplug: Prepare shrink_{zone,pgdat}_span for sub-section removal
mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap()
mm/hotplug: Add mem-hotplug restrictions for remove_memory()
mm: Kill is_dev_zone() helper
mm/sparsemem: Prepare for sub-section ranges
mm/sparsemem: Support sub-section hotplug
mm/devm_memremap_pages: Enable sub-section remap
libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields
libnvdimm/pfn: Stop padding pmem namespaces to section alignment
arch/ia64/mm/init.c | 4
arch/powerpc/mm/mem.c | 5 -
arch/s390/mm/init.c | 2
arch/sh/mm/init.c | 4
arch/x86/mm/init_32.c | 4
arch/x86/mm/init_64.c | 9 +
drivers/nvdimm/dax_devs.c | 2
drivers/nvdimm/pfn.h | 12 -
drivers/nvdimm/pfn_devs.c | 93 +++-------
include/linux/memory_hotplug.h | 12 +
include/linux/mm.h | 4
include/linux/mmzone.h | 72 ++++++--
kernel/memremap.c | 70 +++-----
mm/hmm.c | 2
mm/memory_hotplug.c | 148 +++++++++-------
mm/page_alloc.c | 8 +
mm/sparse-vmemmap.c | 21 ++
mm/sparse.c | 371 +++++++++++++++++++++++++++-------------
18 files changed, 503 insertions(+), 340 deletions(-)
3 years