> -----Original Message-----
> From: linux-kernel-owner(a)vger.kernel.org [mailto:linux-kernel-
> owner(a)vger.kernel.org] On Behalf Of Luck, Tony
> Sent: Wednesday, June 21, 2017 12:54 PM
> To: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
> Cc: Borislav Petkov <bp(a)suse.de>; Dave Hansen <dave.hansen(a)intel.com>;
> x86(a)kernel.org; linux-mm(a)kvack.org; linux-kernel(a)vger.kernel.org
(adding linux-nvdimm list in this reply)
> Subject: Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1
> mappings of poison pages
> On Wed, Jun 21, 2017 at 02:12:27AM +0000, Naoya Horiguchi wrote:
> > We had better have a reverse operation of this to cancel the unmapping
> > when unpoisoning?
> When we have unpoisoning, we can add something. We don't seem to have
> an inverse function for "set_memory_np" to just flip the _PRESENT bit
> back on again. But it would be trivial to write a set_memory_pp().
> Since we'd be doing this after the poison has been cleared, we wouldn't
> need to play games with the address. We'd just use:
> set_memory_pp((unsigned long)pfn_to_kaddr(pfn), 1);
Persistent memory does have unpoisoning and would require this inverse
operation - see drivers/nvdimm/pmem.c pmem_clear_poison() and core.c
Robert Elliott, HPE Persistent Memory
When servicing mmap() reads from file holes the current DAX code allocates
a page cache page of all zeroes and places the struct page pointer in the
mapping->page_tree radix tree. This has two major drawbacks:
1) It consumes memory unnecessarily. For every 4k page that is read via a
DAX mmap() over a hole, we allocate a new page cache page. This means that
if you read 1GiB worth of pages, you end up using 1GiB of zeroed memory.
2) The fact that we had to check for both DAX exceptional entries and for
page cache pages in the radix tree made the DAX code more complex.
This series solves these issues by following the lead of the DAX PMD code
and using a common 4k zero page instead. This reduces memory usage for
some workloads, and it also simplifies the code in fs/dax.c, removing about
100 lines of code.
My hope is to have this reviewed and merged in time for v4.13 via the MM
tree, so if you could spare some review cycles I'd be grateful.
Changes since v1:
- Leave vm_insert_mixed() instact with previous functionality and add
vm_insert_mixed_mkwrite() as a peer so it is more readable/greppable.
Ross Zwisler (3):
mm: add vm_insert_mixed_mkwrite()
dax: relocate dax_load_hole()
dax: use common 4k zero page for dax mmap reads
Documentation/filesystems/dax.txt | 5 +-
fs/dax.c | 265 ++++++++++++--------------------------
fs/ext2/file.c | 25 +---
fs/ext4/file.c | 32 +----
fs/xfs/xfs_file.c | 2 +-
include/linux/dax.h | 13 +-
include/linux/mm.h | 2 +
include/trace/events/fs_dax.h | 2 -
mm/memory.c | 49 ++++++-
9 files changed, 141 insertions(+), 254 deletions(-)
Quoting PATCH 2/2:
To date, the full promise of byte-addressable access to persistent
memory has only been half realized via the filesystem-dax interface. The
current filesystem-dax mechanism allows an application to consume (read)
data from persistent storage at byte-size granularity, bypassing the
full page reads required by traditional storage devices.
Now, for writes, applications still need to contend with
page-granularity dirtying and flushing semantics as well as filesystem
coordination for metadata updates after any mmap write. The current
situation precludes use cases that leverage byte-granularity / in-place
updates to persistent media.
To get around this limitation there are some specialized applications
that are using the device-dax interface to bypass the overhead and
data-safety problems of the current filesystem-dax mmap-write path.
QEMU-KVM is forced to use device-dax to safely pass through persistent
memory to a guest . Some specialized databases are using device-dax
for byte-granularity writes. Outside of those cases, device-dax is
difficult for general purpose persistent memory applications to consume.
There is demand for access to pmem without needing to contend with
special device configuration and other device-dax limitations.
The 'daxfile' interface satisfies this demand and realizes one of Dave
Chinner's ideas for allowing pmem applications to safely bypass
fsync/msync requirements. The idea is to make the file immutable with
respect to the offset-to-block mappings for every extent in the file
. It turns out that filesystems already need to make this guarantee
today. This property is needed for files marked as swap files.
The new daxctl() syscall manages setting a file into 'static-dax' mode
whereby it arranges for the file to be treated as a swapfile as far as
the filesystem is concerned, but not registered with the core-mm as
swapfile space. A file in this mode is then safe to be mapped and
written without the requirement to fsync/msync the writes. The cpu
cache management for flushing data to persistence can be handled
completely in userspace.
As can be seen in the patches there are still some TODOs to resolve in
the code, but this otherwise appears to solve the problem of persistent
memory applications needing to coordinate any and all writes to a file
mapping with fsync/msync.
Dan Williams (2):
mm: introduce bmap_walk()
mm, fs: daxfile, an interface for byte-addressable updates to pmem
arch/x86/entry/syscalls/syscall_64.tbl | 1
include/linux/dax.h | 9 ++
include/linux/fs.h | 3 +
include/linux/syscalls.h | 1
include/uapi/linux/dax.h | 8 +
mm/Kconfig | 5 +
mm/Makefile | 1
mm/daxfile.c | 186 ++++++++++++++++++++++++++++++++
mm/page_io.c | 117 +++++++++++++++++---
9 files changed, 312 insertions(+), 19 deletions(-)
create mode 100644 include/uapi/linux/dax.h
create mode 100644 mm/daxfile.c