[LSF/MM TOPIC] The end of the DAX experiment
by Dan Williams
Before people get too excited this isn't a proposal to kill DAX. The
topic proposal is a discussion to resolve lingering open questions
that currently motivate ext4 and xfs to scream "EXPERIMENTAL" when the
current DAX facilities are enabled. The are 2 primary concerns to
resolve. Enumerate the remaining features/fixes, and identify a path
to implement it all without regressing any existing application use
cases.
An enumeration of remaining projects follows, please expand this list
if I missed something:
* "DAX" has no specific meaning by itself, users have 2 use cases for
"DAX" capabilities: userspace cache management via MAP_SYNC, and page
cache avoidance where the latter aspect of DAX has no current api to
discover / use it. The project is to supplement MAP_SYNC with a
MAP_DIRECT facility and MADV_SYNC / MADV_DIRECT to indicate the same
dynamically via madvise. Similar to O_DIRECT, MAP_DIRECT would be an
application hint to avoid / minimiize page cache usage, but no strict
guarantee like what MAP_SYNC provides.
* Resolve all "if (dax) goto fail;" patterns in the kernel. Outside of
longterm-GUP (a topic in its own right) the projects here are
XFS-reflink and XFS-realtime-device support. DAX+reflink effectively
requires a given physical page to be mapped into two different inodes
at different (page->index) offsets. The challenge is to support
DAX-reflink without violating any existing application visible
semantics, the operating assumption / strawman to debate is that
experimental status is not blanket permission to go change existing
semantics in backwards incompatible ways.
* Deprecate, but not remove, the DAX mount option. Too many flows
depend on the option so it will never go away, but the facility is too
coarse. Provide an option to enable MAP_SYNC and
more-likely-to-do-something-useful-MAP_DIRECT on a per-directory
basis. The current proposal is to allow this property to only be
toggled while the directory is empty to avoid the complications of
racing page invalidation with new DAX mappings.
Secondary projects, i.e. important but I would submit are not in the
critical path to removing the "experimental" designation:
* Filesystem-integrated badblock management. Hook up the media error
notifications from libnvdimm to the filesystem to allow for operations
like "list files with media errors" and "enumerate bad file offsets on
a granulatiy smaller than a page". Another consideration along these
lines is to integrate machine-check-handling and dynamic error
notification into a filesystem interface. I've heard complaints that
the sigaction() based mechanism to receive BUS_MCEERR_* information,
while sufficient for the "System RAM" use case, is not precise enough
for the "Persistent Memory / DAX" use case where errors are repairable
and sub-page error information is useful.
* Userfaultfd for file-backed mappings and DAX
Ideally all the usual DAX, persistent memory, and GUP suspects could
be in the room to discuss this:
* Jan Kara
* Dave Chinner
* Christoph Hellwig
* Jeff Moyer
* Johannes Thumshirn
* Matthew Wilcox
* John Hubbard
* Jérôme Glisse
* MM folks for the reflink vs 'struct page' vs Xarray considerations
1 year, 1 month
[RFC v3 00/19] kunit: introduce KUnit, the Linux kernel unit testing framework
by Brendan Higgins
This patch set proposes KUnit, a lightweight unit testing and mocking
framework for the Linux kernel.
Unlike Autotest and kselftest, KUnit is a true unit testing framework;
it does not require installing the kernel on a test machine or in a VM
and does not require tests to be written in userspace running on a host
kernel. Additionally, KUnit is fast: From invocation to completion KUnit
can run several dozen tests in under a second. Currently, the entire
KUnit test suite for KUnit runs in under a second from the initial
invocation (build time excluded).
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining
unit test cases, grouping related test cases into test suites, providing
common infrastructure for running tests, mocking, spying, and much more.
## What's so special about unit testing?
A unit test is supposed to test a single unit of code in isolation,
hence the name. There should be no dependencies outside the control of
the test; this means no external dependencies, which makes tests orders
of magnitudes faster. Likewise, since there are no external dependencies,
there are no hoops to jump through to run the tests. Additionally, this
makes unit tests deterministic: a failing unit test always indicates a
problem. Finally, because unit tests necessarily have finer granularity,
they are able to test all code paths easily solving the classic problem
of difficulty in exercising error handling code.
## Is KUnit trying to replace other testing frameworks for the kernel?
No. Most existing tests for the Linux kernel are end-to-end tests, which
have their place. A well tested system has lots of unit tests, a
reasonable number of integration tests, and some end-to-end tests. KUnit
is just trying to address the unit test space which is currently not
being addressed.
## More information on KUnit
There is a bunch of documentation near the end of this patch set that
describes how to use KUnit and best practices for writing unit tests.
For convenience I am hosting the compiled docs here:
https://google.github.io/kunit-docs/third_party/kernel/docs/
Additionally for convenience, I have applied these patches to a branch:
https://kunit.googlesource.com/linux/+/kunit/rfc/4.19/v3
The repo may be cloned with:
git clone https://kunit.googlesource.com/linux
This patchset is on the kunit/rfc/4.19/v3 branch.
## Changes Since Last Version
- Changed namespace prefix from `test_*` to `kunit_*` as requested by
Shuah.
- Started converting/cleaning up the device tree unittest to use KUnit.
- Started adding KUnit expectations with custom messages.
--
2.20.0.rc0.387.gc7a69e6b6c-goog
1 year, 4 months
Re: Picking 0th namespace if it is idle
by Aneesh Kumar K.V
aneesh.kumar(a)linux.ibm.com (Aneesh Kumar K.V) writes:
> Hi Dan,
>
> With the patch series to mark the namespace disabled if we have mismatch
> in pfn superblock, we can endup with namespace0 marked idle/disabled.
>
> I am wondering why do do the below in ndctl.
>
>
> static struct ndctl_namespace *region_get_namespace(struct ndctl_region *region)
> {
> struct ndctl_namespace *ndns;
>
> /* prefer the 0th namespace if it is idle */
> ndctl_namespace_foreach(region, ndns)
> if (ndctl_namespace_get_id(ndns) == 0
> && !is_namespace_active(ndns))
> return ndns;
> return ndctl_region_get_namespace_seed(region);
> }
>
> I have a kernel patch that will create a namespace_seed even if we fail
> to ename a pfn backing device. Something like below
>
> @@ -747,12 +752,23 @@ static void nd_region_notify_driver_action(struct nvdimm_bus *nvdimm_bus,
> }
> }
> if (dev->parent && is_nd_region(dev->parent) && probe) {
> nd_region = to_nd_region(dev->parent);
> nvdimm_bus_lock(dev);
> if (nd_region->ns_seed == dev)
> nd_region_create_ns_seed(nd_region);
> nvdimm_bus_unlock(dev);
> }
> +
> + if (dev->parent && is_nd_region(dev->parent) && !probe && (ret == -EOPNOTSUPP)) {
> + nd_region = to_nd_region(dev->parent);
> + nvdimm_bus_lock(dev);
> + if (nd_region->ns_seed == dev)
> + nd_region_create_ns_seed(nd_region);
> + nvdimm_bus_unlock(dev);
> + }
> +
>
> With that we can end up with something like the below after boot.
> :/sys/bus/nd/devices/region0$ sudo ndctl list -Ni
> [
> {
> "dev":"namespace0.1",
> "mode":"fsdax",
> "map":"mem",
> "size":0,
> "uuid":"00000000-0000-0000-0000-000000000000",
> "state":"disabled"
> },
> {
> "dev":"namespace0.0",
> "mode":"fsdax",
> "map":"mem",
> "size":2147483648,
> "uuid":"094e703b-4bf8-4078-ad42-50bebc03e538",
> "state":"disabled"
> }
> ]
>
> namespace0.0 is the one we failed to initialize due to PAGE_SIZE
> mismatch.
>
> We do have namespace_seed pointing to namespacece0.1 correct. But a ndtl
> create-namespace will pick namespace0.0 even if we have seed file
> pointing to namespacec0.1.
>
>
> I am trying to resolve the issues related to creation of new namespaces
> when we have some namespace marked disabled due to pfn_sb setting
> mismatch.
>
> -aneesh
With that ndctl namespace0.0 selection commented out, we do get pick the
right idle namespace.
#ndctl list -Ni
[
{
"dev":"namespace0.1",
"mode":"fsdax",
"map":"mem",
"size":0,
"uuid":"00000000-0000-0000-0000-000000000000",
"state":"disabled"
},
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"mem",
"size":2147483648,
"uuid":"0c31ae4b-b053-43c7-82ff-88574e2585b0",
"state":"disabled"
}
]
after ndctl create-namespace -s 2G -r region0
# ndctl list -Ni
[
{
"dev":"namespace0.2",
"mode":"fsdax",
"map":"mem",
"size":0,
"uuid":"00000000-0000-0000-0000-000000000000",
"state":"disabled"
},
{
"dev":"namespace0.1",
"mode":"fsdax",
"map":"dev",
"size":2130706432,
"uuid":"60970059-9412-4eeb-9e7a-b314585a4da3",
"align":65536,
"blockdev":"pmem0.1",
"supported_alignments":[
65536
]
},
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"mem",
"size":2147483648,
"uuid":"0c31ae4b-b053-43c7-82ff-88574e2585b0",
"state":"disabled"
}
]
1 year, 6 months
[PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines
by Vivek Goyal
Hi,
Here are the RFC patches for V2 of virtio-fs. These patches apply on top
of 5.1 kernel. These patches are also available here.
https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-5.1
Patches for V1 were posted here.
https://lwn.net/ml/linux-fsdevel/20181210171318.16998-1-vgoyal@redhat.com/
This is still work in progress. As of now one can passthrough a host
directory in to guest and it works reasonably well. pjdfstests test
suite passes and blogbench runs. But this dirctory can't be shared
between guests and host can't modify files in directory yet. That's
still TBD.
Posting another version to gather feedback and comments on progress so far.
More information about the project can be found here.
https://virtio-fs.gitlab.io/
Changes from V1
===============
- Various bug fixes
- virtio-fs dax huge page size working, leading to improved performance.
- Fixed kernel automated tests warnings.
- Better handling of shared cache region reporting by virtio device.
Description from V1 posting
---------------------------
Problem Description
===================
We want to be able to take a directory tree on the host and share it with
guest[s]. Our goal is to be able to do it in a fast, consistent and secure
manner. Our primary use case is kata containers, but it should be usable in
other scenarios as well.
Containers may rely on local file system semantics for shared volumes,
read-write mounts that multiple containers access simultaneously. File
system changes must be visible to other containers with the same consistency
expected of a local file system, including mmap MAP_SHARED.
Existing Solutions
==================
We looked at existing solutions and virtio-9p already provides basic shared
file system functionality although does not offer local file system semantics,
causing some workloads and test suites to fail. In addition, virtio-9p
performance has been an issue for Kata Containers and we believe this cannot
be alleviated without major changes that do not fit into the 9P protocol.
Design Overview
===============
With the goal of designing something with better performance and local file
system semantics, a bunch of ideas were proposed.
- Use fuse protocol (instead of 9p) for communication between guest
and host. Guest kernel will be fuse client and a fuse server will
run on host to serve the requests. Benchmark results are encouraging and
show this approach performs well (2x to 8x improvement depending on test
being run).
- For data access inside guest, mmap portion of file in QEMU address
space and guest accesses this memory using dax. That way guest page
cache is bypassed and there is only one copy of data (on host). This
will also enable mmap(MAP_SHARED) between guests.
- For metadata coherency, there is a shared memory region which contains
version number associated with metadata and any guest changing metadata
updates version number and other guests refresh metadata on next
access. This is yet to be implemented.
How virtio-fs differs from existing approaches
==============================================
The unique idea behind virtio-fs is to take advantage of the co-location
of the virtual machine and hypervisor to avoid communication (vmexits).
DAX allows file contents to be accessed without communication with the
hypervisor. The shared memory region for metadata avoids communication in
the common case where metadata is unchanged.
By replacing expensive communication with cheaper shared memory accesses,
we expect to achieve better performance than approaches based on network
file system protocols. In addition, this also makes it easier to achieve
local file system semantics (coherency).
These techniques are not applicable to network file system protocols since
the communications channel is bypassed by taking advantage of shared memory
on a local machine. This is why we decided to build virtio-fs rather than
focus on 9P or NFS.
HOWTO
======
We have put instructions on how to use it here.
https://virtio-fs.gitlab.io/
Caching Modes
=============
Like virtio-9p, different caching modes are supported which determine the
coherency level as well. The “cache=FOO” and “writeback” options control the
level of coherence between the guest and host filesystems. The “shared” option
only has an effect on coherence between virtio-fs filesystem instances
running inside different guests.
- cache=none
metadata, data and pathname lookup are not cached in guest. They are always
fetched from host and any changes are immediately pushed to host.
- cache=always
metadata, data and pathname lookup are cached in guest and never expire.
- cache=auto
metadata and pathname lookup cache expires after a configured amount of time
(default is 1 second). Data is cached while the file is open (close to open
consistency).
- writeback/no_writeback
These options control the writeback strategy. If writeback is disabled,
then normal writes will immediately be synchronized with the host fs. If
writeback is enabled, then writes may be cached in the guest until the file
is closed or an fsync(2) performed. This option has no effect on mmap-ed
writes or writes going through the DAX mechanism.
- shared/no_shared
These options control the use of the shared version table. If shared mode
is enabled then metadata and pathname lookup is cached in guest, but is
refreshed due to changes in another virtio-fs instance.
DAX
===
- dax can be turned on/off when mounting virtio-fs inside guest.
TODO
====
- Implement "cache=shared" option.
- Improve error handling on host. If page fault on host fails, we need
to propagate it into guest.
- Try to fine tune for performance.
- Bug fixes
RESULTS
=======
- pjdfstests are passing. Have tried cache=none/auto/always and dax on/off).
https://github.com/pjd/pjdfstest
(one symlink test fails and that seems to be due xfs on host. Yet to
look into it).
- Ran blogbench and that works too.
Thanks
Vivek
Miklos Szeredi (2):
fuse: delete dentry if timeout is zero
fuse: Use default_file_splice_read for direct IO
Sebastien Boeuf (3):
virtio: Add get_shm_region method
virtio: Implement get_shm_region for PCI transport
virtio: Implement get_shm_region for MMIO transport
Stefan Hajnoczi (10):
fuse: export fuse_end_request()
fuse: export fuse_len_args()
fuse: export fuse_get_unique()
fuse: extract fuse_fill_super_common()
fuse: add fuse_iqueue_ops callbacks
virtio_fs: add skeleton virtio_fs.ko module
dax: remove block device dependencies
fuse, dax: add fuse_conn->dax_dev field
virtio_fs, dax: Set up virtio_fs dax_device
fuse, dax: add DAX mmap support
Vivek Goyal (15):
fuse: Clear setuid bit even in cache=never path
fuse: Export fuse_send_init_request()
fuse: Separate fuse device allocation and installation in fuse_conn
dax: Pass dax_dev to dax_writeback_mapping_range()
fuse: Keep a list of free dax memory ranges
fuse: Introduce setupmapping/removemapping commands
fuse, dax: Implement dax read/write operations
fuse: Define dax address space operations
fuse, dax: Take ->i_mmap_sem lock during dax page fault
fuse: Maintain a list of busy elements
fuse: Add logic to free up a memory range
fuse: Release file in process context
fuse: Reschedule dax free work if too many EAGAIN attempts
fuse: Take inode lock for dax inode truncation
virtio-fs: Do not provide abort interface in fusectl
drivers/dax/super.c | 3 +-
drivers/virtio/virtio_mmio.c | 32 +
drivers/virtio/virtio_pci_modern.c | 108 +++
fs/dax.c | 23 +-
fs/ext2/inode.c | 2 +-
fs/ext4/inode.c | 2 +-
fs/fuse/Kconfig | 11 +
fs/fuse/Makefile | 1 +
fs/fuse/control.c | 4 +-
fs/fuse/cuse.c | 5 +-
fs/fuse/dev.c | 80 +-
fs/fuse/dir.c | 28 +-
fs/fuse/file.c | 953 ++++++++++++++++++++++-
fs/fuse/fuse_i.h | 206 ++++-
fs/fuse/inode.c | 307 ++++++--
fs/fuse/virtio_fs.c | 1129 ++++++++++++++++++++++++++++
fs/splice.c | 3 +-
fs/xfs/xfs_aops.c | 2 +-
include/linux/dax.h | 6 +-
include/linux/fs.h | 2 +
include/linux/virtio_config.h | 17 +
include/uapi/linux/fuse.h | 34 +
include/uapi/linux/virtio_fs.h | 44 ++
include/uapi/linux/virtio_ids.h | 1 +
include/uapi/linux/virtio_mmio.h | 11 +
include/uapi/linux/virtio_pci.h | 10 +
26 files changed, 2875 insertions(+), 149 deletions(-)
create mode 100644 fs/fuse/virtio_fs.c
create mode 100644 include/uapi/linux/virtio_fs.h
--
2.20.1
1 year, 6 months
[RESEND PATCH] nvdimm: fix some compilation warnings
by Qian Cai
Several places (dimm_devs.c, core.c etc) include label.h but only
label.c uses NSINDEX_SIGNATURE, so move its definition to label.c
instead.
In file included from drivers/nvdimm/dimm_devs.c:23:
drivers/nvdimm/label.h:41:19: warning: 'NSINDEX_SIGNATURE' defined but
not used [-Wunused-const-variable=]
The commit d9b83c756953 ("libnvdimm, btt: rework error clearing") left
an unused variable.
drivers/nvdimm/btt.c: In function 'btt_read_pg':
drivers/nvdimm/btt.c:1272:8: warning: variable 'rc' set but not used
[-Wunused-but-set-variable]
Last, some places abuse "/**" which is only reserved for the kernel-doc.
drivers/nvdimm/bus.c:648: warning: cannot understand function prototype:
'struct attribute_group nd_device_attribute_group = '
drivers/nvdimm/bus.c:677: warning: cannot understand function prototype:
'struct attribute_group nd_numa_attribute_group = '
Reviewed-by: Vishal Verma <vishal.l.verma(a)intel.com>
Signed-off-by: Qian Cai <cai(a)lca.pw>
---
drivers/nvdimm/btt.c | 6 ++----
drivers/nvdimm/bus.c | 4 ++--
drivers/nvdimm/label.c | 2 ++
drivers/nvdimm/label.h | 2 --
4 files changed, 6 insertions(+), 8 deletions(-)
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 4671776f5623..9f02a99cfac0 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1269,11 +1269,9 @@ static int btt_read_pg(struct btt *btt, struct bio_integrity_payload *bip,
ret = btt_data_read(arena, page, off, postmap, cur_len);
if (ret) {
- int rc;
-
/* Media error - set the e_flag */
- rc = btt_map_write(arena, premap, postmap, 0, 1,
- NVDIMM_IO_ATOMIC);
+ btt_map_write(arena, premap, postmap, 0, 1,
+ NVDIMM_IO_ATOMIC);
goto out_rtt;
}
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 7ff684159f29..2eb6a6cfe9e4 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -642,7 +642,7 @@ static struct attribute *nd_device_attributes[] = {
NULL,
};
-/**
+/*
* nd_device_attribute_group - generic attributes for all devices on an nd bus
*/
struct attribute_group nd_device_attribute_group = {
@@ -671,7 +671,7 @@ static umode_t nd_numa_attr_visible(struct kobject *kobj, struct attribute *a,
return a->mode;
}
-/**
+/*
* nd_numa_attribute_group - NUMA attributes for all devices on an nd bus
*/
struct attribute_group nd_numa_attribute_group = {
diff --git a/drivers/nvdimm/label.c b/drivers/nvdimm/label.c
index f3d753d3169c..02a51b7775e1 100644
--- a/drivers/nvdimm/label.c
+++ b/drivers/nvdimm/label.c
@@ -25,6 +25,8 @@ static guid_t nvdimm_btt2_guid;
static guid_t nvdimm_pfn_guid;
static guid_t nvdimm_dax_guid;
+static const char NSINDEX_SIGNATURE[] = "NAMESPACE_INDEX\0";
+
static u32 best_seq(u32 a, u32 b)
{
a &= NSINDEX_SEQ_MASK;
diff --git a/drivers/nvdimm/label.h b/drivers/nvdimm/label.h
index e9a2ad3c2150..4bb7add39580 100644
--- a/drivers/nvdimm/label.h
+++ b/drivers/nvdimm/label.h
@@ -38,8 +38,6 @@ enum {
ND_NSINDEX_INIT = 0x1,
};
-static const char NSINDEX_SIGNATURE[] = "NAMESPACE_INDEX\0";
-
/**
* struct nd_namespace_index - label set superblock
* @sig: NAMESPACE_INDEX\0
--
2.20.1 (Apple Git-117)
1 year, 6 months
[PATCH] libnvdimm, namespace: check nsblk->uuid immediately after its allocation
by Wei Yang
When creating nd_namespace_blk, its uuid is copied from nd_label->uuid.
In case the memory allocation fails, it goes to the error branch.
This check is better to be done immediately after memory allocation,
while current implementation does this after assigning claim_class.
This patch moves the check immediately after uuid allocation.
Signed-off-by: Wei Yang <richardw.yang(a)linux.intel.com>
---
drivers/nvdimm/namespace_devs.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 681af3a8fd62..9471b9ca04f5 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -2240,11 +2240,11 @@ static struct device *create_namespace_blk(struct nd_region *nd_region,
nsblk->lbasize = __le64_to_cpu(nd_label->lbasize);
nsblk->uuid = kmemdup(nd_label->uuid, NSLABEL_UUID_LEN,
GFP_KERNEL);
+ if (!nsblk->uuid)
+ goto blk_err;
if (namespace_label_has(ndd, abstraction_guid))
nsblk->common.claim_class
= to_nvdimm_cclass(&nd_label->abstraction_guid);
- if (!nsblk->uuid)
- goto blk_err;
memcpy(name, nd_label->name, NSLABEL_NAME_LEN);
if (name[0])
nsblk->alt_name = kmemdup(name, NSLABEL_NAME_LEN,
--
2.19.1
1 year, 7 months
[PATCH v2 0/8] EFI Specific Purpose Memory Support
by Dan Williams
Changes since the initial RFC [1]
* Split the generic detection of the attribute from any policy /
mechanism that leverages the EFI_MEMORY_SP designation (Ard).
* Various cleanups to the lib/memregion implementation (Willy)
* Rebase on v5.2-rc2
* Several fixes resulting from testing with efi_fake_mem and the
work-in-progress patches that add HMAT support to qemu. Details in
patch3 and patch8.
[1]: https://lore.kernel.org/lkml/155440490809.3190322.15060922240602775809.st...
---
The EFI 2.8 Specification [2] introduces the EFI_MEMORY_SP ("specific
purpose") memory attribute. This attribute bit replaces the deprecated
ACPI HMAT "reservation hint" that was introduced in ACPI 6.2 and removed
in ACPI 6.3.
Given the increasing diversity of memory types that might be advertised
to the operating system, there is a need for platform firmware to hint
which memory ranges are free for the OS to use as general purpose memory
and which ranges are intended for application specific usage. For
example, an application with prior knowledge of the platform may expect
to be able to exclusively allocate a precious / limited pool of high
bandwidth memory. Alternatively, for the general purpose case, the
operating system may want to make the memory available on a best effort
basis as a unique numa-node with performance properties by the new
CONFIG_HMEM_REPORTING [3] facility.
In support of optionally allowing either application-exclusive and
core-kernel-mm managed access to differentiated memory, claim
EFI_MEMORY_SP ranges for exposure as device-dax instances by default.
Such instances can be directly owned / mapped by a
platform-topology-aware application. Alternatively, with the new kmem
facility [4], the administrator has the option to instead designate that
those memory ranges be hot-added to the core-kernel-mm as a unique
memory numa-node. In short, allow for the decision about what software
agent manages specific-purpose memory to be made at runtime.
The patches are based on the new HMAT+HMEM_REPORTING facilities merged
for v5.2-rc1. The implementation is tested with qemu emulation of HMAT
[5] plus the efi_fake_mem facility for applying the EFI_MEMORY_SP
attribute.
[2]: https://uefi.org/sites/default/files/resources/UEFI_Spec_2_8_final.pdf
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit...
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit...
[5]: http://patchwork.ozlabs.org/cover/1096737/
---
Dan Williams (8):
acpi: Drop drivers/acpi/hmat/ directory
acpi/hmat: Skip publishing target info for nodes with no online memory
efi: Enumerate EFI_MEMORY_SP
x86, efi: Reserve UEFI 2.8 Specific Purpose Memory for dax
lib/memregion: Uplevel the pmem "region" ida to a global allocator
device-dax: Add a driver for "hmem" devices
acpi/hmat: Register HMAT at device_initcall level
acpi/hmat: Register "specific purpose" memory as an "hmem" device
arch/x86/Kconfig | 20 +++++
arch/x86/boot/compressed/eboot.c | 5 +
arch/x86/boot/compressed/kaslr.c | 2
arch/x86/include/asm/e820/types.h | 9 ++
arch/x86/kernel/e820.c | 9 ++
arch/x86/kernel/setup.c | 1
arch/x86/platform/efi/efi.c | 37 ++++++++-
drivers/acpi/Kconfig | 13 +++
drivers/acpi/Makefile | 2
drivers/acpi/hmat.c | 149 +++++++++++++++++++++++++++++++++----
drivers/acpi/hmat/Kconfig | 11 ---
drivers/acpi/hmat/Makefile | 2
drivers/acpi/numa.c | 15 +++-
drivers/dax/Kconfig | 27 +++++--
drivers/dax/Makefile | 2
drivers/dax/hmem.c | 58 ++++++++++++++
drivers/firmware/efi/efi.c | 5 +
drivers/nvdimm/Kconfig | 1
drivers/nvdimm/core.c | 1
drivers/nvdimm/nd-core.h | 1
drivers/nvdimm/region_devs.c | 13 +--
include/linux/efi.h | 15 ++++
include/linux/ioport.h | 1
include/linux/memblock.h | 7 ++
include/linux/memregion.h | 11 +++
lib/Kconfig | 7 ++
lib/Makefile | 1
lib/memregion.c | 15 ++++
mm/memblock.c | 4 +
29 files changed, 387 insertions(+), 57 deletions(-)
rename drivers/acpi/{hmat/hmat.c => hmat.c} (81%)
delete mode 100644 drivers/acpi/hmat/Kconfig
delete mode 100644 drivers/acpi/hmat/Makefile
create mode 100644 drivers/dax/hmem.c
create mode 100644 include/linux/memregion.h
create mode 100644 lib/memregion.c
1 year, 7 months
[PATCH v4 00/18] kunit: introduce KUnit, the Linux kernel unit testing framework
by Brendan Higgins
## TLDR
A quick follow up to yesterday's revision. I got some feedback that I
wanted to incorporate before anyone else read the update. For this
reason, I will leave a TLDR of the biggest changes since v2.
Biggest things to look out for (since v2):
- KUnit core now outputs results in TAP14.
- Heavily reworked tools/testing/kunit/kunit.py
- Changed how parsing works.
- Added testing.
- Greg, Logan, you might want to re-review this.
- Added documentation on how to use KUnit on non-UML kernels. You can
see the docs rendered here[1].
There is still some discussion going on on the [PATCH v2 00/17] thread,
but I wanted to get some of these updates out before they got too stale
(and too difficult for me to keep track of). I hope no one minds.
## Background
This patch set proposes KUnit, a lightweight unit testing and mocking
framework for the Linux kernel.
Unlike Autotest and kselftest, KUnit is a true unit testing framework;
it does not require installing the kernel on a test machine or in a VM
(however, KUnit still allows you to run tests on test machines or in VMs
if you want) and does not require tests to be written in userspace
running on a host kernel. Additionally, KUnit is fast: From invocation
to completion KUnit can run several dozen tests in under a second.
Currently, the entire KUnit test suite for KUnit runs in under a second
from the initial invocation (build time excluded).
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining
unit test cases, grouping related test cases into test suites, providing
common infrastructure for running tests, mocking, spying, and much more.
## What's so special about unit testing?
A unit test is supposed to test a single unit of code in isolation,
hence the name. There should be no dependencies outside the control of
the test; this means no external dependencies, which makes tests orders
of magnitudes faster. Likewise, since there are no external dependencies,
there are no hoops to jump through to run the tests. Additionally, this
makes unit tests deterministic: a failing unit test always indicates a
problem. Finally, because unit tests necessarily have finer granularity,
they are able to test all code paths easily solving the classic problem
of difficulty in exercising error handling code.
## Is KUnit trying to replace other testing frameworks for the kernel?
No. Most existing tests for the Linux kernel are end-to-end tests, which
have their place. A well tested system has lots of unit tests, a
reasonable number of integration tests, and some end-to-end tests. KUnit
is just trying to address the unit test space which is currently not
being addressed.
## More information on KUnit
There is a bunch of documentation near the end of this patch set that
describes how to use KUnit and best practices for writing unit tests.
For convenience I am hosting the compiled docs here[2].
Additionally for convenience, I have applied these patches to a
branch[3].
The repo may be cloned with:
git clone https://kunit.googlesource.com/linux
This patchset is on the kunit/rfc/v5.1/v4 branch.
## Changes Since Last Version
As I mentioned above, there are a significant number of updates since
v2:
- Converted KUnit core to print test results in TAP14 format as
suggested by Greg and Frank.
- Heavily reworked tools/testing/kunit/kunit.py
- Changed how parsing works.
- Added testing.
- Added documentation on how to use KUnit on non-UML kernels. You can
see the docs rendered here[1].
- Added a new set of EXPECTs and ASSERTs for pointer comparison.
- Removed more function indirection as suggested by Logan.
- Added a new patch that adds `kunit_try_catch_throw` to objtool's
noreturn list.
- Fixed a number of minorish issues pointed out by Shuah, Masahiro, and
kbuild bot.
Nevertheless, there are only a couple of minor updates since v3:
- Added more context to the changelog on the objtool patch, as per
Peter's request.
- Moved all KUnit documentation under the Documentation/dev-tools/
directory as per Jonathan's suggestion.
[1] https://google.github.io/kunit-docs/third_party/kernel/docs/usage.html#ku...
[2] https://google.github.io/kunit-docs/third_party/kernel/docs/
[3] https://kunit.googlesource.com/linux/+/kunit/rfc/v5.1/v4
--
2.21.0.1020.gf2820cf01a-goog
1 year, 7 months
[PATCH, RFC 0/2] Share PMDs for FS/DAX on x86
by Larry Bassel
This patchset implements sharing of page table entries pointing
to 2MiB pages (PMDs) for FS/DAX on x86.
Only shared mmapings of files (i.e. neither private mmapings nor
anonymous pages) are eligible for PMD sharing.
Due to the characteristics of DAX, this code is simpler and
less intrusive than the general case would be.
In our use case (high end Oracle database using DAX/XFS/PMEM/2MiB
pages) there would be significant memory savings.
A future system might have 6 TiB of PMEM on it and
there might be 10000 processes each mapping all of this 6 TiB.
Here the savings would be approximately
(6 TiB / 2 MiB) * 8 bytes (page table size) * 10000 = 240 GiB
(and these page tables themselves would be in non-PMEM (ordinary RAM)).
There would also be a reduction in page faults because in
some cases the page fault has already been satisfied and
the page table entry has been filled in (and so the processes
after the first would not take a fault).
The code for detecting whether PMDs can be shared and
the implementation of sharing and unsharing is based
on, but somewhat different than that in mm/hugetlb.c,
though some of the code from this file could be reused and
thus was made non-static.
Larry Bassel (2):
Add config option to enable FS/DAX PMD sharing.
Implement sharing/unsharing of PMDs for FS/DAX.
arch/x86/Kconfig | 3 ++
include/linux/hugetlb.h | 4 ++
mm/huge_memory.c | 32 ++++++++++++++
mm/hugetlb.c | 21 ++++++++--
mm/memory.c | 108 +++++++++++++++++++++++++++++++++++++++++++++++-
5 files changed, 163 insertions(+), 5 deletions(-)
--
1.8.3.1
1 year, 7 months
[PATCH v10 0/7] virtio pmem driver
by Pankaj Gupta
This patch series is ready to be merged via nvdimm tree
as discussed with Dan. We have ack/review on XFS, EXT4 &
VIRTIO patches. Need an ack on device mapper change in
patch 4.
Mike, Can you please review patch 4 which has change for
dax with device mapper.
Incorporated all the changes suggested in v9. This version
has minor changes in patch 2(virtio) and does not change the
existing functionality, Kept all the existing reviews/ack.
Jakob CCed also tested the v9 of patch series and confirmed
the working.
---
This patch series has implementation for "virtio pmem".
"virtio pmem" is fake persistent memory(nvdimm) in guest
which allows to bypass the guest page cache. This also
implements a VIRTIO based asynchronous flush mechanism.
Sharing guest kernel driver in this patchset with the
changes suggested in v4. Tested with Qemu side device
emulation [6] for virtio-pmem. Documented the impact of
possible page cache side channel attacks with suggested
countermeasures.
Details of project idea for 'virtio pmem' flushing interface
is shared [3] & [4].
Implementation is divided into two parts:
New virtio pmem guest driver and qemu code changes for new
virtio pmem paravirtualized device.
1. Guest virtio-pmem kernel driver
---------------------------------
- Reads persistent memory range from paravirt device and
registers with 'nvdimm_bus'.
- 'nvdimm/pmem' driver uses this information to allocate
persistent memory region and setup filesystem operations
to the allocated memory.
- virtio pmem driver implements asynchronous flushing
interface to flush from guest to host.
2. Qemu virtio-pmem device
---------------------------------
- Creates virtio pmem device and exposes a memory range to
KVM guest.
- At host side this is file backed memory which acts as
persistent memory.
- Qemu side flush uses aio thread pool API's and virtio
for asynchronous guest multi request handling.
David Hildenbrand CCed also posted a modified version[7] of
qemu virtio-pmem code based on updated Qemu memory device API.
Virtio-pmem security implications and countermeasures:
-----------------------------------------------------
In previous posting of kernel driver, there was discussion [9]
on possible implications of page cache side channel attacks with
virtio pmem. After thorough analysis of details of known side
channel attacks, below are the suggestions:
- Depends entirely on how host backing image file is mapped
into guest address space.
- virtio-pmem device emulation, by default shared mapping is used
to map host backing file. It is recommended to use separate
backing file at host side for every guest. This will prevent
any possibility of executing common code from multiple guests
and any chance of inferring guest local data based based on
execution time.
- If backing file is required to be shared among multiple guests
it is recommended to don't support host page cache eviction
commands from the guest driver. This will avoid any possibility
of inferring guest local data or host data from another guest.
- Proposed device specification [8] for virtio-pmem device with
details of possible security implications and suggested
countermeasures for device emulation.
Virtio-pmem errors handling:
----------------------------------------
Checked behaviour of virtio-pmem for below types of errors
Need suggestions on expected behaviour for handling these errors?
- Hardware Errors: Uncorrectable recoverable Errors:
a] virtio-pmem:
- As per current logic if error page belongs to Qemu process,
host MCE handler isolates(hwpoison) that page and send SIGBUS.
Qemu SIGBUS handler injects exception to KVM guest.
- KVM guest then isolates the page and send SIGBUS to guest
userspace process which has mapped the page.
b] Existing implementation for ACPI pmem driver:
- Handles such errors with MCE notifier and creates a list
of bad blocks. Read/direct access DAX operation return EIO
if accessed memory page fall in bad block list.
- It also starts backgound scrubbing.
- Similar functionality can be reused in virtio-pmem with MCE
notifier but without scrubbing(no ACPI/ARS)? Need inputs to
confirm if this behaviour is ok or needs any change?
Changes from PATCH v9: [1]
- Kconfig help text add two spaces - Randy
- Fixed libnvdimm 'bio' include warning - Dan
- virtio-pmem, separate request/resp struct and
move to uapi file with updated license - DavidH
- Use virtio32* type for req/resp endianess - DavidH
- Added tested-by & ack-by of Jakob
- Rebased to 5.2-rc1
Changes from PATCH v8: [2]
- Set device mapper synchronous if all target devices support - Dan
- Move virtio_pmem.h to nvdimm directory - Dan
- Style, indentation & better error messages in patch 2 - DavidH
- Added MST's ack in patch 2.
Changes from PATCH v7:
- Corrected pending request queue logic (patch 2) - Jakub Staroń
- Used unsigned long flags for passing DAXDEV_F_SYNC (patch 3) - Dan
- Fixed typo => vma 'flag' to 'vm_flag' (patch 4)
- Added rob in patch 6 & patch 2
Changes from PATCH v6:
- Corrected comment format in patch 5 & patch 6. [Dave]
- Changed variable declaration indentation in patch 6 [Darrick]
- Add Reviewed-by tag by 'Jan Kara' in patch 4 & patch 5
Changes from PATCH v5:
Changes suggested in by - [Cornelia, Yuval]
- Remove assignment chaining in virtio driver
- Better error message and remove not required free
- Check nd_region before use
Changes suggested by - [Jan Kara]
- dax_synchronous() for !CONFIG_DAX
- Correct 'daxdev_mapping_supported' comment and non-dax implementation
Changes suggested by - [Dan Williams]
- Pass meaningful flag 'DAXDEV_F_SYNC' to alloc_dax
- Gate nvdimm_flush instead of additional async parameter
- Move block chaining logic to flush callback than common nvdimm_flush
- Use NULL flush callback for generic flush for better readability [Dan, Jan]
- Use virtio device id 27 from 25(already used) - [MST]
Changes from PATCH v4:
- Factor out MAP_SYNC supported functionality to a common helper
[Dave, Darrick, Jan]
- Comment, indentation and virtqueue_kick failure handle - Yuval Shaia
Changes from PATCH v3:
- Use generic dax_synchronous() helper to check for DAXDEV_SYNC
flag - [Dan, Darrick, Jan]
- Add 'is_nvdimm_async' function
- Document page cache side channel attacks implications &
countermeasures - [Dave Chinner, Michael]
Changes from PATCH v2:
- Disable MAP_SYNC for ext4 & XFS filesystems - [Dan]
- Use name 'virtio pmem' in place of 'fake dax'
Changes from PATCH v1:
- 0-day build test for build dependency on libnvdimm
Changes suggested by - [Dan Williams]
- Split the driver into two parts virtio & pmem
- Move queuing of async block request to block layer
- Add "sync" parameter in nvdimm_flush function
- Use indirect call for nvdimm_flush
- Don’t move declarations to common global header e.g nd.h
- nvdimm_flush() return 0 or -EIO if it fails
- Teach nsio_rw_bytes() that the flush can fail
- Rename nvdimm_flush() to generic_nvdimm_flush()
- Use 'nd_region->provider_data' for long dereferencing
- Remove virtio_pmem_freeze/restore functions
- Remove BSD license text with SPDX license text
- Add might_sleep() in virtio_pmem_flush - [Luiz]
- Make spin_lock_irqsave() narrow
Pankaj Gupta (7):
libnvdimm: nd_region flush callback support
virtio-pmem: Add virtio-pmem guest driver
libnvdimm: add nd_region buffered dax_dev flag
dax: check synchronous mapping is supported
dm: dm: Enable synchronous dax
ext4: disable map_sync for virtio pmem
xfs: disable map_sync for virtio pmem
[1] https://lkml.org/lkml/2019/5/14/465
[2] https://lkml.org/lkml/2019/5/10/447
[3] https://www.spinics.net/lists/kvm/msg149761.html
[4] https://www.spinics.net/lists/kvm/msg153095.html
[5] https://lkml.org/lkml/2018/8/31/413
[6] https://marc.info/?l=linux-kernel&m=153572228719237&w=2
[7] https://marc.info/?l=qemu-devel&m=153555721901824&w=2
[8] https://lists.oasis-open.org/archives/virtio-dev/201903/msg00083.html
[9] https://lkml.org/lkml/2019/1/9/1191
20 files changed, 468 insertions(+), 25 deletions(-)
drivers/acpi/nfit/core.c | 4 -
drivers/dax/bus.c | 2
drivers/dax/super.c | 19 +++++
drivers/md/dm-table.c | 14 ++++
drivers/md/dm.c | 3
drivers/nvdimm/Makefile | 1
drivers/nvdimm/claim.c | 6 +
drivers/nvdimm/nd.h | 1
drivers/nvdimm/nd_virtio.c | 124 +++++++++++++++++++++++++++++++++++++++
drivers/nvdimm/pmem.c | 18 +++--
drivers/nvdimm/region_devs.c | 33 +++++++++-
drivers/nvdimm/virtio_pmem.c | 122 ++++++++++++++++++++++++++++++++++++++
drivers/nvdimm/virtio_pmem.h | 55 +++++++++++++++++
drivers/virtio/Kconfig | 11 +++
fs/ext4/file.c | 10 +--
fs/xfs/xfs_file.c | 9 +-
include/linux/dax.h | 26 +++++++-
include/linux/libnvdimm.h | 10 ++-
include/uapi/linux/virtio_ids.h | 1
include/uapi/linux/virtio_pmem.h | 35 +++++++++++
20 files changed, 479 insertions(+), 25 deletions(-)
1 year, 7 months