[PATCH v2] Documentation: nvdimm: Fix typo
by Shiyang Ruan
Remove the extra 'we '.
Signed-off-by: Shiyang Ruan <ruansy.fnst(a)cn.fujitsu.com>
---
Documentation/nvdimm/nvdimm.txt | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/nvdimm/nvdimm.txt b/Documentation/nvdimm/nvdimm.txt
index e894de69915a..1669f626b037 100644
--- a/Documentation/nvdimm/nvdimm.txt
+++ b/Documentation/nvdimm/nvdimm.txt
@@ -284,8 +284,8 @@ A bus has a 1:1 relationship with an NFIT. The current expectation for
ACPI based systems is that there is only ever one platform-global NFIT.
That said, it is trivial to register multiple NFITs, the specification
does not preclude it. The infrastructure supports multiple busses and
-we we use this capability to test multiple NFIT configurations in the
-unit test.
+we use this capability to test multiple NFIT configurations in the unit
+test.
LIBNVDIMM: control class device in /sys/class
--
2.17.0
3 years
[PATCH v3 00/18] kunit: introduce KUnit, the Linux kernel unit testing framework
by Brendan Higgins
## TLDR
I mostly wanted to incorporate feedback I got over the last week and a
half.
Biggest things to look out for:
- KUnit core now outputs results in TAP14.
- Heavily reworked tools/testing/kunit/kunit.py
- Changed how parsing works.
- Added testing.
- Greg, Logan, you might want to re-review this.
- Added documentation on how to use KUnit on non-UML kernels. You can
see the docs rendered here[1].
There is still some discussion going on on the [PATCH v2 00/17] thread,
but I wanted to get some of these updates out before they got too stale
(and too difficult for me to keep track of). I hope no one minds.
## Background
This patch set proposes KUnit, a lightweight unit testing and mocking
framework for the Linux kernel.
Unlike Autotest and kselftest, KUnit is a true unit testing framework;
it does not require installing the kernel on a test machine or in a VM
(however, KUnit still allows you to run tests on test machines or in VMs
if you want) and does not require tests to be written in userspace
running on a host kernel. Additionally, KUnit is fast: From invocation
to completion KUnit can run several dozen tests in under a second.
Currently, the entire KUnit test suite for KUnit runs in under a second
from the initial invocation (build time excluded).
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining
unit test cases, grouping related test cases into test suites, providing
common infrastructure for running tests, mocking, spying, and much more.
## What's so special about unit testing?
A unit test is supposed to test a single unit of code in isolation,
hence the name. There should be no dependencies outside the control of
the test; this means no external dependencies, which makes tests orders
of magnitudes faster. Likewise, since there are no external dependencies,
there are no hoops to jump through to run the tests. Additionally, this
makes unit tests deterministic: a failing unit test always indicates a
problem. Finally, because unit tests necessarily have finer granularity,
they are able to test all code paths easily solving the classic problem
of difficulty in exercising error handling code.
## Is KUnit trying to replace other testing frameworks for the kernel?
No. Most existing tests for the Linux kernel are end-to-end tests, which
have their place. A well tested system has lots of unit tests, a
reasonable number of integration tests, and some end-to-end tests. KUnit
is just trying to address the unit test space which is currently not
being addressed.
## More information on KUnit
There is a bunch of documentation near the end of this patch set that
describes how to use KUnit and best practices for writing unit tests.
For convenience I am hosting the compiled docs here[2].
Additionally for convenience, I have applied these patches to a
branch[3].
The repo may be cloned with:
git clone https://kunit.googlesource.com/linux
This patchset is on the kunit/rfc/v5.1/v3 branch.
## Changes Since Last Version
- Converted KUnit core to print test results in TAP14 format as
suggested by Greg and Frank.
- Heavily reworked tools/testing/kunit/kunit.py
- Changed how parsing works.
- Added testing.
- Added documentation on how to use KUnit on non-UML kernels. You can
see the docs rendered here[1].
- Added a new set of EXPECTs and ASSERTs for pointer comparison.
- Removed more function indirection as suggested by Logan.
- Added a new patch that adds `kunit_try_catch_throw` to objtool's
noreturn list.
- Fixed a number of minorish issues pointed out by Shuah, Masahiro, and
kbuild bot.
[1] https://google.github.io/kunit-docs/third_party/kernel/docs/usage.html#ku...
[2] https://google.github.io/kunit-docs/third_party/kernel/docs/
[3] https://kunit.googlesource.com/linux/+/kunit/rfc/v5.1/v3
--
2.21.0.1020.gf2820cf01a-goog
3 years
[PATCH v8 00/12] mm: Sub-section memory hotplug support
by Dan Williams
Changes since v7 [1]:
- Make subsection helpers pfn based rather than physical-address based
(Oscar and Pavel)
- Make subsection bitmap definition scalable for different section and
sub-section sizes across architectures. As a result:
unsigned long map_active
...is converted to:
DECLARE_BITMAP(subsection_map, SUBSECTIONS_PER_SECTION)
...and the helpers are renamed with a 'subsection' prefix. (Pavel)
- New in this version is a touch of arch/powerpc/include/asm/sparsemem.h
in "[PATCH v8 01/12] mm/sparsemem: Introduce struct mem_section_usage"
to define ARCH_SUBSECTION_SHIFT.
- Drop "mm/sparsemem: Introduce common definitions for the size and mask
of a section" in favor of Robin's "mm/memremap: Rename and consolidate
SECTION_SIZE" (Pavel)
- Collect some more Reviewed-by tags. Patches that still lack review
tags: 1, 3, 9 - 12
[1]: https://lore.kernel.org/lkml/155677652226.2336373.8700273400832001094.stg...
---
[merge logistics]
Hi Andrew,
These are too late for v5.2, I'm posting this v8 during the merge window
to maintain the review momentum.
---
[cover letter]
The memory hotplug section is an arbitrary / convenient unit for memory
hotplug. 'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace. The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory'
use cases, persistent memory (pmem) in particular. Recall that pmem uses
devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a
'struct page' memmap for pmem. However, it does not use the 'bottom
half' of memory hotplug, i.e. never marks pmem pages online and never
exposes the userspace memblock interface for pmem. This leaves an
opening to redress the section-size constraint.
To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory(). Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the
next. Device failure (intermittent or permanent) and physical
reconfiguration are events that can cause the platform firmware to
change the physical placement of pmem on a subsequent boot, and device
failure is an everyday event in a data-center.
It turns out that sections are only a hard requirement of the
user-facing interface for memory hotplug and with a bit more
infrastructure sub-section arch_add_memory() support can be added for
kernel internal usages like devm_memremap_pages(). Here is an analysis
of the current design assumptions in the current code and how they are
addressed in the new implementation:
Current design assumptions:
- Sections that describe boot memory (early sections) are never
unplugged / removed.
- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
valid_section() check
- __add_pages() and helper routines assume all operations occur in
PAGES_PER_SECTION units.
- The memblock sysfs interface only comprehends full sections
New design assumptions:
- Sections are instrumented with a sub-section bitmask to track (on x86)
individual 2MB sub-divisions of a 128MB section.
- Partially populated early sections can be extended with additional
sub-sections, and those sub-sections can be removed with
arch_remove_memory(). With this in place we no longer lose usable memory
capacity to padding.
- pfn_valid() is updated to look deeper than valid_section() to also check the
active-sub-section mask. This indication is in the same cacheline as
the valid_section() so the performance impact is expected to be
negligible. So far the lkp robot has not reported any regressions.
- Outside of the core vmemmap population routines which are replaced,
other helper routines like shrink_{zone,pgdat}_span() are updated to
handle the smaller granularity. Core memory hotplug routines that deal
with online memory are not touched.
- The existing memblock sysfs user api guarantees / assumptions are
not touched since this capability is limited to !online
!memblock-sysfs-accessible sections.
Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them. The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt. Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem
ranges with other pmem ranges by default [3]. In short,
devm_memremap_pages() has pushed the venerable section-size constraint
past the breaking point, and the simplicity of section-aligned
arch_add_memory() is no longer tenable.
These patches are exposed to the kbuild robot on my libnvdimm-pending
branch [4], and a preview of the unit test for this functionality is
available on the 'subsection-pending' branch of ndctl [5].
[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@d...
[3]: https://github.com/pmem/ndctl/issues/76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=li...
[5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c
---
Dan Williams (11):
mm/sparsemem: Introduce struct mem_section_usage
mm/sparsemem: Add helpers track active portions of a section at boot
mm/hotplug: Prepare shrink_{zone,pgdat}_span for sub-section removal
mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap()
mm/hotplug: Kill is_dev_zone() usage in __remove_pages()
mm: Kill is_dev_zone() helper
mm/sparsemem: Prepare for sub-section ranges
mm/sparsemem: Support sub-section hotplug
mm/devm_memremap_pages: Enable sub-section remap
libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields
libnvdimm/pfn: Stop padding pmem namespaces to section alignment
Robin Murphy (1):
mm/memremap: Rename and consolidate SECTION_SIZE
arch/powerpc/include/asm/sparsemem.h | 3
arch/x86/mm/init_64.c | 4
drivers/nvdimm/dax_devs.c | 2
drivers/nvdimm/pfn.h | 15 -
drivers/nvdimm/pfn_devs.c | 95 +++------
include/linux/memory_hotplug.h | 7 -
include/linux/mm.h | 4
include/linux/mmzone.h | 93 +++++++--
kernel/memremap.c | 63 ++----
mm/hmm.c | 2
mm/memory_hotplug.c | 172 +++++++++-------
mm/page_alloc.c | 8 -
mm/sparse-vmemmap.c | 21 +-
mm/sparse.c | 369 +++++++++++++++++++++++-----------
14 files changed, 511 insertions(+), 347 deletions(-)
3 years
[PATCH v7 00/12] mm: Sub-section memory hotplug support
by Dan Williams
Changes since v6 [1]:
- Rebase on next-20190501, no related conflicts or updates
- Fix boot crash due to inaccurate setup of the initial section
->map_active bitmask caused by multiple activations of the same
section. (Jane, Jeff)
- Fix pmem startup crash when devm_memremap_pages() needs to instantiate
a new section. (Jeff)
- Drop mhp_restrictions for the __remove_pages() path in favor of
find_memory_block() to detect cases where section-aligned remove is
required (David)
- Add "[PATCH v7 06/12] mm/hotplug: Kill is_dev_zone() usage in
__remove_pages()"
- Cleanup shrink_{zone,pgdat}_span to remove no longer necessary @ms
section variables. (Oscar)
- Add subsection_check() to the __add_pages() path to prevent
inadvertent sub-section misuse.
[1]: https://lore.kernel.org/lkml/155552633539.2015392.2477781120122237934.stg...
---
[merge logistics]
Hi Andrew,
I believe this is ready for another spin in -mm now that the boot
regression has been squashed. In a chat with Michal last night at LSF/MM
I submitted to his assertion that the boot regression validates the
general concern that there were/are subtle dependencies on sections
beyond what I found to date by code inspection. Of course I want to
relieve the pain that the section constraint inflicts on libnvdimm and
devm_memremap_pages() as soon as possible (i.e. v5.2), but deferment to
v5.3 to give Michal time to do an in-depth look is also acceptable.
---
[cover letter]
The memory hotplug section is an arbitrary / convenient unit for memory
hotplug. 'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace. The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory'
use cases, persistent memory (pmem) in particular. Recall that pmem uses
devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a
'struct page' memmap for pmem. However, it does not use the 'bottom
half' of memory hotplug, i.e. never marks pmem pages online and never
exposes the userspace memblock interface for pmem. This leaves an
opening to redress the section-size constraint.
To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory(). Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the
next. Device failure (intermittent or permanent) and physical
reconfiguration are events that can cause the platform firmware to
change the physical placement of pmem on a subsequent boot, and device
failure is an everyday event in a data-center.
It turns out that sections are only a hard requirement of the
user-facing interface for memory hotplug and with a bit more
infrastructure sub-section arch_add_memory() support can be added for
kernel internal usages like devm_memremap_pages(). Here is an analysis
of the current design assumptions in the current code and how they are
addressed in the new implementation:
Current design assumptions:
- Sections that describe boot memory (early sections) are never
unplugged / removed.
- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
valid_section() check
- __add_pages() and helper routines assume all operations occur in
PAGES_PER_SECTION units.
- The memblock sysfs interface only comprehends full sections
New design assumptions:
- Sections are instrumented with a sub-section bitmask to track (on x86)
individual 2MB sub-divisions of a 128MB section.
- Partially populated early sections can be extended with additional
sub-sections, and those sub-sections can be removed with
arch_remove_memory(). With this in place we no longer lose usable memory
capacity to padding.
- pfn_valid() is updated to look deeper than valid_section() to also check the
active-sub-section mask. This indication is in the same cacheline as
the valid_section() so the performance impact is expected to be
negligible. So far the lkp robot has not reported any regressions.
- Outside of the core vmemmap population routines which are replaced,
other helper routines like shrink_{zone,pgdat}_span() are updated to
handle the smaller granularity. Core memory hotplug routines that deal
with online memory are not touched.
- The existing memblock sysfs user api guarantees / assumptions are
not touched since this capability is limited to !online
!memblock-sysfs-accessible sections.
Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them. The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt. Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem
ranges with other pmem ranges by default [3]. In short,
devm_memremap_pages() has pushed the venerable section-size constraint
past the breaking point, and the simplicity of section-aligned
arch_add_memory() is no longer tenable.
These patches are exposed to the kbuild robot on my libnvdimm-pending
branch [4], and a preview of the unit test for this functionality is
available on the 'subsection-pending' branch of ndctl [5].
[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@d...
[3]: https://github.com/pmem/ndctl/issues/76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=li...
[5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c
---
Dan Williams (12):
mm/sparsemem: Introduce struct mem_section_usage
mm/sparsemem: Introduce common definitions for the size and mask of a section
mm/sparsemem: Add helpers track active portions of a section at boot
mm/hotplug: Prepare shrink_{zone,pgdat}_span for sub-section removal
mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap()
mm/hotplug: Kill is_dev_zone() usage in __remove_pages()
mm: Kill is_dev_zone() helper
mm/sparsemem: Prepare for sub-section ranges
mm/sparsemem: Support sub-section hotplug
mm/devm_memremap_pages: Enable sub-section remap
libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields
libnvdimm/pfn: Stop padding pmem namespaces to section alignment
arch/x86/mm/init_64.c | 4
drivers/nvdimm/dax_devs.c | 2
drivers/nvdimm/pfn.h | 12 -
drivers/nvdimm/pfn_devs.c | 93 +++-------
include/linux/memory_hotplug.h | 7 -
include/linux/mm.h | 4
include/linux/mmzone.h | 72 ++++++--
kernel/memremap.c | 63 +++----
mm/hmm.c | 2
mm/memory_hotplug.c | 172 ++++++++++---------
mm/page_alloc.c | 8 +
mm/sparse-vmemmap.c | 21 ++
mm/sparse.c | 370 ++++++++++++++++++++++++++++------------
13 files changed, 490 insertions(+), 340 deletions(-)
3 years
Re: [Qemu-devel] [PATCH v4 00/11] Build ACPI Heterogeneous Memory Attribute Table (HMAT)
by Dan Williams
On Tue, May 7, 2019 at 11:32 PM Tao Xu <tao3.xu(a)intel.com> wrote:
>
> This series of patches will build Heterogeneous Memory Attribute Table (HMAT)
> according to the command line. The ACPI HMAT describes the memory attributes,
> such as memory side cache attributes and bandwidth and latency details,
> related to the System Physical Address (SPA) Memory Ranges.
> The software is expected to use this information as hint for optimization.
>
> OSPM evaluates HMAT only during system initialization. Any changes to the HMAT
> state at runtime or information regarding HMAT for hot plug are communicated
> using the _HMA method.
[..]
Hi,
I gave these patches a try while developing support for the new EFI
v2.8 Specific Purpose Memory attribute [1]. I have a gap / feature
request to note to make this implementation capable of emulating
current shipping platform BIOS implementations for persistent memory
platforms.
The NUMA configuration I tested was:
-numa node,mem=4G,cpus=0-19,nodeid=0
-numa node,mem=4G,cpus=20-39,nodeid=1
-numa node,mem=4G,nodeid=2
-numa node,mem=4G,nodeid=3
...and it produced an entry like the following for proximity domain 2.
[0C8h 0200 2] Structure Type : 0000 [Memory Proximity
Domain Attributes]
[0CAh 0202 2] Reserved : 0000
[0CCh 0204 4] Length : 00000028
[0D0h 0208 2] Flags (decoded below) : 0002
Processor Proximity Domain Valid : 0
[0D2h 0210 2] Reserved1 : 0000
[0D4h 0212 4] Processor Proximity Domain : 00000002
[0D8h 0216 4] Memory Proximity Domain : 00000002
[0DCh 0220 4] Reserved2 : 00000000
[0E0h 0224 8] Reserved3 : 0000000240000000
[0E8h 0232 8] Reserved4 : 0000000100000000
Notice that the Processor "Proximity Domain Valid" bit is clear. I
understand that the implementation is keying off of whether cpus are
defined for that same node or not, but that's not how current
persistent memory platforms implement "Processor Proximity Domain". On
these platforms persistent memory indeed has its own proximity domain,
but the Processor Proximity Domain is expected to be assigned to the
domain that houses the memory controller for that persistent memory.
So to emulate that configuration it would be useful to have a way to
specify "Processor Proximity Domain" without needing to define CPUs in
that domain.
Something like:
-numa node,mem=4G,cpus=0-19,nodeid=0
-numa node,mem=4G,cpus=20-39,nodeid=1
-numa node,mem=4G,nodeid=2,localnodeid=0
-numa node,mem=4G,nodeid=3,localnodeid=1
...to specify that node2 memory is connected / local to node0 and
node3 memory is connected / local to node1. In general HMAT specifies
that all performance differentiated memory ranges have their own
proximity domain, but those are expected to still be associated with a
local/host/home-socket memory controller.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2019-May/021668.html
3 years
[PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race
by Dan Williams
Changes since v1 [1]:
- Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)
- Refresh the p2pdma patch headers to match the format of other p2pdma
patches (Bjorn)
- Collect Ira's reviewed-by
[1]: https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.stgi...
---
Logan audited the devm_memremap_pages() shutdown path and noticed that
it was possible to proceed to arch_remove_memory() before all
potential page references have been reaped.
Introduce a new ->cleanup() callback to do the work of waiting for any
straggling page references and then perform the percpu_ref_exit() in
devm_memremap_pages_release() context.
For p2pdma this involves some deeper reworks to reference count
resources on a per-instance basis rather than a per pci-device basis. A
modified genalloc api is introduced to convey a driver-private pointer
through gen_pool_{alloc,free}() interfaces. Also, a
devm_memunmap_pages() api is introduced since p2pdma does not
auto-release resources on a setup failure.
The dax and pmem changes pass the nvdimm unit tests, and the p2pdma
changes should now pass testing with the pci_p2pdma_release() fix.
Jérôme, how does this look for HMM?
In general, I think these patches / fixes are suitable for v5.2-rc1 or
v5.2-rc2, and since they touch kernel/memremap.c, and other various
pieces of the core, they should go through the -mm tree. These patches
merge cleanly with the current state of -next, pass the nvdimm unit
tests, and are exposed to the 0day robot with no issues reported
(https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=li...).
---
Dan Williams (6):
drivers/base/devres: Introduce devm_release_action()
mm/devm_memremap_pages: Introduce devm_memunmap_pages
PCI/P2PDMA: Fix the gen_pool_add_virt() failure path
lib/genalloc: Introduce chunk owners
PCI/P2PDMA: Track pgmap references per resource, not globally
mm/devm_memremap_pages: Fix final page put race
drivers/base/devres.c | 24 +++++++-
drivers/dax/device.c | 13 +---
drivers/nvdimm/pmem.c | 17 ++++-
drivers/pci/p2pdma.c | 115 +++++++++++++++++++++++--------------
include/linux/device.h | 1
include/linux/genalloc.h | 55 ++++++++++++++++--
include/linux/memremap.h | 8 +++
kernel/memremap.c | 23 ++++++-
lib/genalloc.c | 51 ++++++++--------
mm/hmm.c | 14 +----
tools/testing/nvdimm/test/iomap.c | 2 +
11 files changed, 217 insertions(+), 106 deletions(-)
3 years
[PATCH v4 00/18] btrfs dax support
by Goldwyn Rodrigues
This patch set adds support for dax on the BTRFS filesystem.
In order to support for CoW for btrfs, there were changes which had to be
made to the dax handling. The important one is copying blocks into the
same dax device before using them which is performed by iomap
type IOMAP_DAX_COW.
Snapshotting and CoW features are supported (including mmap preservation
across snapshots).
Git: https://github.com/goldwynr/linux/tree/btrfs-dax
Changes since v3:
- Fixed memcpy bug
- used flags for dax_insert_entry instead of bools for dax_insert_entry()
Changes since v2:
- Created a new type IOMAP_DAX_COW as opposed to flag IOMAP_F_COW
- CoW source address is presented in iomap.inline_data
- Split the patches to more elaborate dax/iomap patches
Changes since v1:
- use iomap instead of redoing everything in btrfs
- support for mmap writeprotecting on snapshotting
fs/btrfs/Makefile | 1
fs/btrfs/ctree.h | 38 +++++
fs/btrfs/dax.c | 289 +++++++++++++++++++++++++++++++++++++++++--
fs/btrfs/disk-io.c | 4
fs/btrfs/file.c | 37 ++++-
fs/btrfs/inode.c | 114 ++++++++++++----
fs/btrfs/ioctl.c | 29 +++-
fs/btrfs/send.c | 4
fs/btrfs/super.c | 30 ++++
fs/dax.c | 183 ++++++++++++++++++++++++---
fs/iomap.c | 9 -
fs/ocfs2/file.c | 2
fs/read_write.c | 11 -
fs/xfs/xfs_reflink.c | 2
include/linux/dax.h | 15 +-
include/linux/fs.h | 8 +
include/linux/iomap.h | 7 +
include/trace/events/btrfs.h | 56 ++++++++
18 files changed, 752 insertions(+), 87 deletions(-)
3 years, 1 month
[RFC PATCH V2 1/3] mm/nvdimm: Add PFN_MIN_VERSION support
by Aneesh Kumar K.V
This allows us to make changes in a backward incompatible way. I have
kept the PFN_MIN_VERSION in this patch '0' because we are not introducing
any incompatible changes in this patch. We also may want to backport this
to older kernels.
The error looks like
dax0.1: init failed, superblock min version 1, kernel support version 0
and the namespace is marked disabled
$ndctl list -Ni
[
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"mem",
"size":10737418240,
"uuid":"9605de6d-cefa-4a87-99cd-dec28b02cffe",
"state":"disabled"
}
]
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
---
drivers/nvdimm/pfn.h | 9 ++++++++-
drivers/nvdimm/pfn_devs.c | 8 ++++++++
drivers/nvdimm/pmem.c | 26 ++++++++++++++++++++++----
3 files changed, 38 insertions(+), 5 deletions(-)
diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h
index dde9853453d3..5fd29242745a 100644
--- a/drivers/nvdimm/pfn.h
+++ b/drivers/nvdimm/pfn.h
@@ -20,6 +20,12 @@
#define PFN_SIG_LEN 16
#define PFN_SIG "NVDIMM_PFN_INFO\0"
#define DAX_SIG "NVDIMM_DAX_INFO\0"
+/*
+ * increment this when we are making changes such that older
+ * kernel should fail to initialize that namespace.
+ */
+
+#define PFN_MIN_VERSION 0
struct nd_pfn_sb {
u8 signature[PFN_SIG_LEN];
@@ -36,7 +42,8 @@ struct nd_pfn_sb {
__le32 end_trunc;
/* minor-version-2 record the base alignment of the mapping */
__le32 align;
- u8 padding[4000];
+ __le16 min_version;
+ u8 padding[3998];
__le64 checksum;
};
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 01f40672507f..a2268cf262f5 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -439,6 +439,13 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
if (nvdimm_read_bytes(ndns, SZ_4K, pfn_sb, sizeof(*pfn_sb), 0))
return -ENXIO;
+ if (le16_to_cpu(pfn_sb->min_version) > PFN_MIN_VERSION) {
+ dev_err(&nd_pfn->dev,
+ "init failed, superblock min version %ld kernel support version %ld\n",
+ le16_to_cpu(pfn_sb->min_version), PFN_MIN_VERSION);
+ return -EOPNOTSUPP;
+ }
+
if (memcmp(pfn_sb->signature, sig, PFN_SIG_LEN) != 0)
return -ENODEV;
@@ -769,6 +776,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
memcpy(pfn_sb->parent_uuid, nd_dev_to_uuid(&ndns->dev), 16);
pfn_sb->version_major = cpu_to_le16(1);
pfn_sb->version_minor = cpu_to_le16(2);
+ pfn_sb->min_version = cpu_to_le16(PFN_MIN_VERSION);
pfn_sb->start_pad = cpu_to_le32(start_pad);
pfn_sb->end_trunc = cpu_to_le32(end_trunc);
pfn_sb->align = cpu_to_le32(nd_pfn->align);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 845c5b430cdd..406427c064d9 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -490,6 +490,7 @@ static int pmem_attach_disk(struct device *dev,
static int nd_pmem_probe(struct device *dev)
{
+ int ret;
struct nd_namespace_common *ndns;
ndns = nvdimm_namespace_common_probe(dev);
@@ -505,12 +506,29 @@ static int nd_pmem_probe(struct device *dev)
if (is_nd_pfn(dev))
return pmem_attach_disk(dev, ndns);
- /* if we find a valid info-block we'll come back as that personality */
- if (nd_btt_probe(dev, ndns) == 0 || nd_pfn_probe(dev, ndns) == 0
- || nd_dax_probe(dev, ndns) == 0)
+ ret = nd_btt_probe(dev, ndns);
+ if (ret == 0)
return -ENXIO;
+ else if (ret == -EOPNOTSUPP)
+ return ret;
- /* ...otherwise we're just a raw pmem device */
+ ret = nd_pfn_probe(dev, ndns);
+ if (ret == 0)
+ return -ENXIO;
+ else if (ret == -EOPNOTSUPP)
+ return ret;
+
+ ret = nd_dax_probe(dev, ndns);
+ if (ret == 0)
+ return -ENXIO;
+ else if (ret == -EOPNOTSUPP)
+ return ret;
+ /*
+ * We have two failure conditions here, there is no
+ * info reserver block or we found a valid info reserve block
+ * but failed to initialize the pfn superblock.
+ * Don't create a raw pmem disk for the second case.
+ */
return pmem_attach_disk(dev, ndns);
}
--
2.21.0
3 years, 1 month
[ndctl PATCH v4 00/10] daxctl: add a new reconfigure-device command
by Vishal Verma
Changes in v4:
- Don't fail add_dax_dev for kmod failures. Instead fail only when the kmod
list is actually used, i.e. during daxctl-reconfigure-device
Changes in v3:
- In daxctl_dev_get_mode(), remove the subsystem warning, detect dax-class
and simply make it return devdax
Changes in v2:
- Add examples to the documentation page (Dave Hansen)
- Clarify documentation regarding the conversion from system-ram to devdax
- Remove any references to a persistent config from the documentation -
those can be added when the feature is added.
- device.c: validate option compatibility
- daxctl-list: display numa_node for device listings
- daxctl-list: display mode for device listings
- make the options more consistent by adding a '-O' short option
for --attempt-offline
Add a new daxctl-reconfigure-device command that lets us reconfigure DAX
devices back and forth between 'system-ram' and 'device-dax' modes. It
also includes facilities to online any newly hot-plugged memory
(default), and attempt to offline memory before converting away from the
system-ram mode (not default, requires a --attempt-offline option).
Currently missing from this series is a way to persistently store which
devices have been 'marked' for use as system-ram. This depends on a
config system overhaul in ndctl, and patches for those will follow
separately and are independent of this work.
Example invocations:
1. Reconfigure dax0.0 to system-ram mode, don’t online the memory
# daxctl reconfigure-device --mode=system-ram --no-online dax0.0
[
{
"chardev":"dax0.0",
"size":16777216000,
"numa_node":2,
"mode":"system-ram"
}
]
2. Reconfigure dax0.0 to devdax mode, attempt to offline the memory
# daxctl reconfigure-device --human --mode=devdax --attempt-offline dax0.0
{
"chardev":"dax0.0",
"size":"15.63 GiB (16.78 GB)",
"numa_node":2,
"mode":"devdax"
}
3. Reconfigure all dax devices on region0 to system-ram mode
# daxctl reconfigure-device --mode=system-ram --region=0 all
[
{
"chardev":"dax0.0",
"size":16777216000,
"numa_node":2,
"mode":"system-ram"
},
{
"chardev":"dax0.1",
"size":16777216000,
"numa_node":3,
"mode":"system-ram"
}
]
These patches can also be found in the 'kmem-pending' branch on github:
https://github.com/pmem/ndctl/tree/kmem-pending
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: Pavel Tatashin <pasha.tatashin(a)soleen.com>
Vishal Verma (10):
libdaxctl: add interfaces in support of device modes
libdaxctl: cache 'subsystem' in daxctl_ctx
libdaxctl: add interfaces to enable/disable devices
libdaxctl: add interfaces to get/set the online state for a node
daxctl/list: add numa_node for device listings
libdaxctl: add an interface to get the mode for a dax device
daxctl: add a new reconfigure-device command
Documentation/daxctl: add a man page for daxctl-reconfigure-device
contrib/ndctl: fix region-id completions for daxctl
contrib/ndctl: add bash-completion for daxctl-reconfigure-device
Documentation/daxctl/Makefile.am | 3 +-
.../daxctl/daxctl-reconfigure-device.txt | 118 ++++
contrib/ndctl | 34 +-
daxctl/Makefile.am | 2 +
daxctl/builtin.h | 1 +
daxctl/daxctl.c | 1 +
daxctl/device.c | 237 ++++++++
daxctl/lib/Makefile.am | 3 +-
daxctl/lib/libdaxctl-private.h | 21 +
daxctl/lib/libdaxctl.c | 548 +++++++++++++++++-
daxctl/lib/libdaxctl.sym | 14 +
daxctl/libdaxctl.h | 16 +
util/json.c | 22 +
13 files changed, 1009 insertions(+), 11 deletions(-)
create mode 100644 Documentation/daxctl/daxctl-reconfigure-device.txt
create mode 100644 daxctl/device.c
--
2.20.1
3 years, 1 month
[v6 0/3] "Hotremove" persistent memory
by Pavel Tatashin
Changelog:
v6
- A few minor changes and added reviewed-by's.
- Spent time studying lock ordering issue that was reported by Vishal
Verma, but that issue already exists in Linux, and can be reproduced
with exactly the same steps with ACPI memory hotplugging.
v5
- Addressed comments from Dan Williams: made remove_memory() to return
an error code, and use this function from dax.
v4
- Addressed comments from Dave Hansen
v3
- Addressed comments from David Hildenbrand. Don't release
lock_device_hotplug after checking memory status, and rename
memblock_offlined_cb() to check_memblock_offlined_cb()
v2
- Dan Williams mentioned that drv->remove() return is ignored
by unbind. Unbind always succeeds. Because we cannot guarantee
that memory can be offlined from the driver, don't even
attempt to do so. Simply check that every section is offlined
beforehand and only then proceed with removing dax memory.
---
Recently, adding a persistent memory to be used like a regular RAM was
added to Linux. This work extends this functionality to also allow hot
removing persistent memory.
We (Microsoft) have an important use case for this functionality.
The requirement is for physical machines with small amount of RAM (~8G)
to be able to reboot in a very short period of time (<1s). Yet, there is
a userland state that is expensive to recreate (~2G).
The solution is to boot machines with 2G preserved for persistent
memory.
Copy the state, and hotadd the persistent memory so machine still has
all 8G available for runtime. Before reboot, offline and hotremove
device-dax 2G, copy the memory that is needed to be preserved to pmem0
device, and reboot.
The series of operations look like this:
1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps.
and free ramdisk.
2. Convert raw pmem0 to devdax
ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
3. Hotadd to System RAM
echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
echo online_movable > /sys/devices/system/memoryXXX/state
4. Before reboot hotremove device-dax memory from System RAM
echo offline > /sys/devices/system/memoryXXX/state
echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
5. Create raw pmem0 device
ndctl create-namespace --mode raw -e namespace0.0 -f
6. Copy the state that was stored by apps to ramdisk to pmem device
7. Do kexec reboot or reboot through firmware if firmware does not
zero memory in pmem0 region (These machines have only regular
volatile memory). So to have pmem0 device either memmap kernel
parameter is used, or devices nodes in dtb are specified.
Pavel Tatashin (3):
device-dax: fix memory and resource leak if hotplug fails
mm/hotplug: make remove_memory() interface useable
device-dax: "Hotremove" persistent memory that is used like normal RAM
drivers/dax/dax-private.h | 2 ++
drivers/dax/kmem.c | 46 +++++++++++++++++++++---
include/linux/memory_hotplug.h | 8 +++--
mm/memory_hotplug.c | 64 +++++++++++++++++++++++-----------
4 files changed, 92 insertions(+), 28 deletions(-)
--
2.21.0
3 years, 1 month