[added to the 4.1 stable tree] x86/mm: Fix vmalloc_fault() to handle large pages properly
by Sasha Levin
From: Toshi Kani <toshi.kani(a)hpe.com>
This patch has been added to the 4.1 stable tree. If you have any
objections, please let us know.
===============
[ Upstream commit f4eafd8bcd5229e998aa252627703b8462c3b90f ]
A kernel page fault oops with the callstack below was observed
when a read syscall was made to a pmem device after a huge amount
(>512GB) of vmalloc ranges was allocated by ioremap() on a x86_64
system:
BUG: unable to handle kernel paging request at ffff880840000ff8
IP: vmalloc_fault+0x1be/0x300
PGD c7f03a067 PUD 0
Oops: 0000 [#1] SM
Call Trace:
__do_page_fault+0x285/0x3e0
do_page_fault+0x2f/0x80
? put_prev_entity+0x35/0x7a0
page_fault+0x28/0x30
? memcpy_erms+0x6/0x10
? schedule+0x35/0x80
? pmem_rw_bytes+0x6a/0x190 [nd_pmem]
? schedule_timeout+0x183/0x240
btt_log_read+0x63/0x140 [nd_btt]
:
? __symbol_put+0x60/0x60
? kernel_read+0x50/0x80
SyS_finit_module+0xb9/0xf0
entry_SYSCALL_64_fastpath+0x1a/0xa4
Since v4.1, ioremap() supports large page (pud/pmd) mappings in
x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc
range is limited to pte mappings.
vmalloc faults do not normally happen in ioremap'd ranges since
ioremap() sets up the kernel page tables, which are shared by
user processes. pgd_ctor() sets the kernel's PGD entries to
user's during fork(). When allocation of the vmalloc ranges
crosses a 512GB boundary, ioremap() allocates a new pud table
and updates the kernel PGD entry to point it. If user process's
PGD entry does not have this update yet, a read/write syscall
to the range will cause a vmalloc fault, which hits the Oops
above as it does not handle a large page properly.
Following changes are made to vmalloc_fault().
64-bit:
- No change for the PGD sync operation as it handles large
pages already.
- Add pud_huge() and pmd_huge() to the validation code to
handle large pages.
- Change pud_page_vaddr() to pud_pfn() since an ioremap range
is not directly mapped (while the if-statement still works
with a bogus addr).
- Change pmd_page() to pmd_pfn() since an ioremap range is not
backed by struct page (while the if-statement still works
with a bogus addr).
32-bit:
- No change for the sync operation since the index3 PGD entry
covers the entire vmalloc range, which is always valid.
(A separate change to sync PGD entry is necessary if this
memory layout is changed regardless of the page size.)
- Add pmd_huge() to the validation code to handle large pages.
This is for completeness since vmalloc_fault() won't happen
in ioremap'd ranges as its PGD entry is always valid.
Reported-by: Henning Schild <henning.schild(a)siemens.com>
Signed-off-by: Toshi Kani <toshi.kani(a)hpe.com>
Acked-by: Borislav Petkov <bp(a)alien8.de>
Cc: <stable(a)vger.kernel.org> # 4.1+
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Andy Lutomirski <luto(a)amacapital.net>
Cc: Brian Gerst <brgerst(a)gmail.com>
Cc: Denys Vlasenko <dvlasenk(a)redhat.com>
Cc: H. Peter Anvin <hpa(a)zytor.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof(a)suse.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Toshi Kani <toshi.kani(a)hp.com>
Cc: linux-mm(a)kvack.org
Cc: linux-nvdimm(a)lists.01.org
Link: http://lkml.kernel.org/r/1455758214-24623-1-git-send-email-toshi.kani@hpe...
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Signed-off-by: Sasha Levin <sasha.levin(a)oracle.com>
---
arch/x86/mm/fault.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 181c53b..62855ac 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -285,6 +285,9 @@ static noinline int vmalloc_fault(unsigned long address)
if (!pmd_k)
return -1;
+ if (pmd_huge(*pmd_k))
+ return 0;
+
pte_k = pte_offset_kernel(pmd_k, address);
if (!pte_present(*pte_k))
return -1;
@@ -356,8 +359,6 @@ void vmalloc_sync_all(void)
* 64-bit:
*
* Handle a fault on the vmalloc area
- *
- * This assumes no large pages in there.
*/
static noinline int vmalloc_fault(unsigned long address)
{
@@ -399,17 +400,23 @@ static noinline int vmalloc_fault(unsigned long address)
if (pud_none(*pud_ref))
return -1;
- if (pud_none(*pud) || pud_page_vaddr(*pud) != pud_page_vaddr(*pud_ref))
+ if (pud_none(*pud) || pud_pfn(*pud) != pud_pfn(*pud_ref))
BUG();
+ if (pud_huge(*pud))
+ return 0;
+
pmd = pmd_offset(pud, address);
pmd_ref = pmd_offset(pud_ref, address);
if (pmd_none(*pmd_ref))
return -1;
- if (pmd_none(*pmd) || pmd_page(*pmd) != pmd_page(*pmd_ref))
+ if (pmd_none(*pmd) || pmd_pfn(*pmd) != pmd_pfn(*pmd_ref))
BUG();
+ if (pmd_huge(*pmd))
+ return 0;
+
pte_ref = pte_offset_kernel(pmd_ref, address);
if (!pte_present(*pte_ref))
return -1;
--
2.5.0
6 years, 3 months
[PATCH] dax: check return value of dax_radix_entry()
by Ross Zwisler
dax_pfn_mkwrite() previously wasn't checking the return value of the call
to dax_radix_entry(), which was a mistake.
Instead, capture this return value and pass it up the stack if it is an
error.
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
---
fs/dax.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/fs/dax.c b/fs/dax.c
index 7111724..5a587dc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1056,6 +1056,7 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
{
struct file *file = vma->vm_file;
+ int error;
/*
* We pass NO_SECTOR to dax_radix_entry() because we expect that a
@@ -1065,7 +1066,11 @@ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
* saves us from having to make a call to get_block() here to look
* up the sector.
*/
- dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
+ error = dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false,
+ true);
+ if (error)
+ return error;
+
return VM_FAULT_NOPAGE;
}
EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
--
2.5.0
6 years, 3 months
acpi_nfit_find_poison() question
by Linda Knippers
Hi Vishal,
I was looking at acpi_nfit_find_poison() and if I'm reading this
right, I think it's throwing away some ARS results and re-running
an ARS unnecessarily. More comments below...
-- ljk
> static int acpi_nfit_find_poison(struct acpi_nfit_desc *acpi_desc,
> struct nd_region_desc *ndr_desc)
> {
> struct nvdimm_bus_descriptor *nd_desc = &acpi_desc->nd_desc;
> struct nvdimm_bus *nvdimm_bus = acpi_desc->nvdimm_bus;
> struct nd_cmd_ars_status *ars_status = NULL;
> struct nd_cmd_ars_start *ars_start = NULL;
> struct nd_cmd_ars_cap *ars_cap = NULL;
> u64 start, len, cur, remaining;
> int rc;
>
> ars_cap = kzalloc(sizeof(*ars_cap), GFP_KERNEL);
> if (!ars_cap)
> return -ENOMEM;
>
> start = ndr_desc->res->start;
> len = ndr_desc->res->end - ndr_desc->res->start + 1;
>
> rc = ars_get_cap(nd_desc, ars_cap, start, len);
> if (rc)
> goto out;
>
> /*
> * If ARS is unsupported, or if the 'Persistent Memory Scrub' flag in
> * extended status is not set, skip this but continue initialization
> */
> if ((ars_cap->status & 0xffff) ||
> !(ars_cap->status >> 16 & ND_ARS_PERSISTENT)) {
> dev_warn(acpi_desc->dev,
> "ARS unsupported (status: 0x%x), won't create an error list\n",
> ars_cap->status);
> goto out;
> }
>
> /*
> * Check if a full-range ARS has been run. If so, use those results
> * without having to start a new ARS.
> */
> ars_status = kzalloc(ars_cap->max_ars_out + sizeof(*ars_status),
> GFP_KERNEL);
> if (!ars_status) {
> rc = -ENOMEM;
> goto out;
> }
>
> rc = ars_get_status(nd_desc, ars_status);
> if (rc)
> goto out;
>
> if (ars_status->address <= start &&
> (ars_status->address + ars_status->length >= start + len)) {
> rc = ars_status_process_records(nvdimm_bus, ars_status, start);
> goto out;
> }
The above code will process the records if the ARS ran to completion but
not if the ARS overflowed. It won't process partial results because it's
checking both the start and the length against the total range.
>
> /*
> * ARS_STATUS can overflow if the number of poison entries found is
> * greater than the maximum buffer size (ars_cap->max_ars_out)
> * To detect overflow, check if the length field of ars_status
> * is less than the length we supplied. If so, process the
> * error entries we got, adjust the start point, and start again
> */
This comment seems like the right idea but that's not what it's doing.
> ars_start = kzalloc(sizeof(*ars_start), GFP_KERNEL);
> if (!ars_start)
> return -ENOMEM;
>
> cur = start;
> remaining = len;
If we get here, we're starting over at the beginning, losing the
previous results. Shouldn't we process the previous results and
then enter this loop using
cur = ars_status->address + ars_status->length;
remaining = len - ars_status->length;
?
Or restructure the loop so that the existing results, if any, are
processed before doing an ars_do_start()? Or did I miss something?
> do {
> u64 done, end;
>
> rc = ars_do_start(nd_desc, ars_start, cur, remaining);
> if (rc)
> goto out;
>
> rc = ars_get_status(nd_desc, ars_status);
> if (rc)
> goto out;
>
> rc = ars_status_process_records(nvdimm_bus, ars_status, cur);
> if (rc)
> goto out;
>
> end = min(cur + remaining,
> ars_status->address + ars_status->length);
> done = end - cur;
> cur += done;
> remaining -= done;
> } while (remaining);
>
> out:
> kfree(ars_cap);
> kfree(ars_start);
> kfree(ars_status);
> return rc;
> }
6 years, 3 months
[PATCH 0/8] nfit, libnvdimm: async address range scrub
by Dan Williams
Given the capacities of next generation persistent memory devices a
scrub operation to find all poison may take 10s of seconds. We want
this scrub work to be done asynchronously with the rest of system
initialization, so we move it out of line from the NFIT probing, i.e.
acpi_nfit_add().
However, we may want to synchronously wait for that scrubbing to
complete before we probe any pmem devices. Consider the case where
consuming poison triggers a machine check and a reboot. That event will
trigger platform firmware to initiate a scrub. The kernel should
complete any firmware initiated scrubs as those likely indicate the
presence of known poison.
When errors are not present, platform firmware did not initiate
scrubbing, we still scrub, but asynchronously. This trades off a risk
of hitting new unknown poison ranges with making the data available
faster after loading the driver.
This async scrub capability is also useful in the future when we
integrate Tony Luck's mcsafe_copy() (or whatever it is
eventually called). After a machine check recovery event we can scrub
the pmem namespace to see if there are any other latent errors and
otherwise update the 'badblocks' list with the new entries.
This passes the libndctl unit test suite, with some minor updates to
account for the fact that when "modprobe nfit_test" returns not all
regions are registered.
---
Dan Williams (8):
libnvdimm, nfit: centralize command status translation
libnvdimm: protect nvdimm_{bus|namespace}_add_poison() with nvdimm_bus_lock()
libnvdimm: async notification support
nfit, tools/testing/nvdimm: unify common init for acpi_nfit_desc
nfit, libnvdimm: async region scrub workqueue
nfit: scrub and register regions in a workqueue
nfit: disable userspace initiated ars during scrub
tools/testing/nvdimm: expand ars unit testing
drivers/acpi/nfit.c | 761 +++++++++++++++++++++++++++-----------
drivers/acpi/nfit.h | 24 +
drivers/nvdimm/bus.c | 46 ++
drivers/nvdimm/core.c | 110 ++++-
drivers/nvdimm/dimm_devs.c | 6
drivers/nvdimm/nd.h | 2
drivers/nvdimm/pmem.c | 15 +
drivers/nvdimm/region.c | 12 +
include/linux/libnvdimm.h | 5
include/linux/nd.h | 7
tools/testing/nvdimm/test/nfit.c | 133 +++++--
11 files changed, 809 insertions(+), 312 deletions(-)
6 years, 3 months
[PATCH v2 0/3] ACPI 6.1 update for NFIT Control Region Structure
by Toshi Kani
ACPI 6.1, Table 5-133, updates NVDIMM Control Region Structure
as follows.
- Valid Fields, Manufacturing Location, and Manufacturing Date
are added from reserved range. No change in the structure size.
- IDs defined as SPD values are arrays of bytes. The spec
clarified that they need to be represented as arrays of bytes
as well.
Patch 1 changes 'struct acpi_nfit_control_region' and the NFIT driver to
comply ACPI 6.1.
Patch 2 adds a new sysfs file "id" to show NVDIMM ID defined in ACPI 6.1.
Patch 3 changes the nfit test driver.
link: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
---
v2:
- Remove 'mfg_location' and 'mfg_date'. (Dan Williams)
- Rename 'unique_id' to 'id' and make this change as a separate patch.
(Dan Williams)
---
Toshi Kani (3):
1/3 ACPI/NFIT: Update Control Region Structure to comply ACPI 6.1
2/3 ACPI/NFIT: Add NVDIMM ID "id" under sysfs
3/3 nfit_test: Update SPD ID init handlings
---
drivers/acpi/nfit.c | 41 ++++++++++++++++++++-----
include/acpi/actbl1.h | 24 +++++++++------
tools/testing/nvdimm/test/nfit.c | 64 ++++++++++++++++++++++++----------------
3 files changed, 88 insertions(+), 41 deletions(-)
6 years, 3 months
[PATCH 0/2] devm_memremap_pages vs section-misaligned pmem
by Dan Williams
Recent testing uncovered two bugs around the handling of section-misaligned
pmem regions:
1/ If the pmem section overlaps "System RAM" we need to fail the
devm_memremap_pages() request. Previously we would mis-detect a
memory map like the following:
100000000-37bffffff : System RAM
37c000000-837ffffff : Persistent Memory
2/ If the pmem section is misaligned, but otherwise does not overlap
memory from another zone, the altmap needs to be fixed up to add the
alignment padding to the 'reserved ' pfns of the altmap.
---
Dan Williams (2):
libnvdimm, pmem: fix 'pfn' support for section-misaligned namespaces
mm: fix mixed zone detection in devm_memremap_pages
drivers/nvdimm/pmem.c | 29 +++++++++++++++++++++++++++--
kernel/memremap.c | 9 ++++-----
2 files changed, 31 insertions(+), 7 deletions(-)
6 years, 3 months