[ndctl PATCH] ndctl: fix enabled namespace count
by Dan Williams
We under-report the number of enabled namespaces when those namespaces
are claimed by another interface. This is due to the fact that
ndctl_namespace_enable() returns a postive number when the enable
successfully triggers a different device to go active.
Reported-by: Linda Knippers <linda.knippers(a)hpe.com>
Tested-by: Linda Knippers <linda.knippers(a)hpe.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
builtin-xaction-namespace.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/builtin-xaction-namespace.c b/builtin-xaction-namespace.c
index f4645d866e0f..28999b3f8311 100644
--- a/builtin-xaction-namespace.c
+++ b/builtin-xaction-namespace.c
@@ -621,7 +621,7 @@ static int do_xaction_namespace(const char *namespace,
case ACTION_CREATE:
return namespace_reconfig(region, ndns);
}
- if (rc == 0)
+ if (rc >= 0)
success++;
}
}
6 years, 4 months
[PATCH 0/2] DAX bdev fixes - move flushing calls to FS
by Ross Zwisler
The first patch in the series just adds a bdev argument to
dax_clear_blocks(), and should be relatively straightforward.
The second patch is slightly more controversial. During testing of raw block
devices + DAX I noticed that the struct block_device that we were using for DAX
operations was incorrect. For the fault handlers, etc. we can just get the
correct bdev via get_block(), which is passed in as a function pointer, but for
the flushing code we don't have access to get_block(). This is also an issue
for XFS real-time devices, whenever we get those working.
In short, somehow we need to get dax_writeback_mapping_range() a valid bdev.
Right now it is called via filemap_write_and_wait_range(), which can't provide
either the bdev nor a get_block() function pointer. So, our options seem to
be:
a) Move the calls to dax_writeback_mapping_range() into the filesystems.
This is implemented by patch 2 in this series.
b) Keep the calls to dax_writeback_mapping_range() in the mm code, and
provide a generic way to ask a filesystem for an inode's bdev. I did a
version of this using a superblock operation here:
https://lkml.org/lkml/2016/2/2/941
It has been noted that we may need to expand the coverage of our DAX
flushing code to include support for the sync() and syncfs() userspace
calls. This is still under discussion, but if we do end up needing to add
support for sync(), I don't think that it is v4.5 material for the reasons
stated here:
https://lkml.org/lkml/2016/2/4/962
I think that for v4.5 we either need patch 2 of this series, or the
get_bdev() patch listed in for solution b) above.
Ross Zwisler (2):
dax: pass bdev argument to dax_clear_blocks()
dax: move writeback calls into the filesystems
fs/block_dev.c | 7 +++++++
fs/dax.c | 9 ++++-----
fs/ext2/file.c | 10 ++++++++++
fs/ext2/inode.c | 5 +++--
fs/ext4/fsync.c | 10 +++++++++-
fs/xfs/xfs_aops.c | 2 +-
fs/xfs/xfs_aops.h | 1 +
fs/xfs/xfs_bmap_util.c | 4 +++-
fs/xfs/xfs_file.c | 12 ++++++++++--
include/linux/dax.h | 7 ++++---
mm/filemap.c | 6 ------
11 files changed, 52 insertions(+), 21 deletions(-)
--
2.5.0
6 years, 4 months
[PATCH v2] x86/lib/copy_user_64.S: Handle 4-byte nocache copy
by Toshi Kani
Data corruption issues were observed in tests which initiated
a system crash/reset while accessing BTT devices. This problem
is reproducible.
The BTT driver calls pmem_rw_bytes() to update data in pmem
devices. This interface calls __copy_user_nocache(), which
uses non-temporal stores so that the stores to pmem are
persistent.
__copy_user_nocache() uses non-temporal stores when a request
size is 8 bytes or larger (and is aligned by 8 bytes). The
BTT driver updates the BTT map table, which entry size is
4 bytes. Therefore, updates to the map table entries remain
cached, and are not written to pmem after a crash.
Change __copy_user_nocache() to use non-temporal store when
a request size is 4 bytes. The change extends the current
byte-copy path for a less-than-8-bytes request, and does not
add any overhead to the regular path.
Also add comments to the code, and clarify the cases that
lead cache copy.
Reported-and-tested-by: Micah Parrish <micah.parrish(a)hpe.com>
Reported-and-tested-by: Brian Boylston <brian.boylston(a)hpe.com>
Signed-off-by: Toshi Kani <toshi.kani(a)hpe.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: H. Peter Anvin <hpa(a)zytor.com>
Cc: Borislav Petkov <bp(a)suse.de>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Cc: Vishal Verma <vishal.l.verma(a)intel.com>
---
v2:
- Add comments (Ingo Molnar).
- Make this patch as an individual patch since v2 debug changes
will not depend on this patch.
---
arch/x86/lib/copy_user_64.S | 74 +++++++++++++++++++++++++++++++++++--------
1 file changed, 61 insertions(+), 13 deletions(-)
diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index 982ce34..1641327 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -232,17 +232,30 @@ ENDPROC(copy_user_enhanced_fast_string)
/*
* copy_user_nocache - Uncached memory copy with exception handling
- * This will force destination/source out of cache for more performance.
+ * This will force destination out of cache for more performance.
+ *
+ * Note: Cached memory copy is used when destination or size is not
+ * naturally aligned. That is:
+ * - Require 8-byte alignment when size is 8 bytes or larger.
+ * - Require 4-byte alignment when size is 4 bytes.
*/
ENTRY(__copy_user_nocache)
ASM_STAC
+
+ /* If size is less than 8 bytes, goto 4-byte copy */
cmpl $8,%edx
- jb 20f /* less then 8 bytes, go to byte copy loop */
+ jb 20f
+
+ /* If destination is not 8-byte aligned, "cache" copy to align it */
ALIGN_DESTINATION
+
+ /* Set 4x8-byte copy count and remainder */
movl %edx,%ecx
andl $63,%edx
shrl $6,%ecx
- jz 17f
+ jz 17f /* If count is 0, goto 8-byte copy */
+
+ /* Perform 4x8-byte nocache loop-copy */
1: movq (%rsi),%r8
2: movq 1*8(%rsi),%r9
3: movq 2*8(%rsi),%r10
@@ -263,26 +276,57 @@ ENTRY(__copy_user_nocache)
leaq 64(%rdi),%rdi
decl %ecx
jnz 1b
+
+ /* Set 8-byte copy count and remainder */
17: movl %edx,%ecx
andl $7,%edx
shrl $3,%ecx
- jz 20f
+ jz 20f /* If count is 0, goto 4-byte copy */
+
+ /* Perform 8-byte nocache loop-copy */
18: movq (%rsi),%r8
19: movnti %r8,(%rdi)
leaq 8(%rsi),%rsi
leaq 8(%rdi),%rdi
decl %ecx
jnz 18b
+
+ /* If no byte left, we're done */
20: andl %edx,%edx
- jz 23f
+ jz 26f
+
+ /* If destination is not 4-byte aligned, goto byte copy */
+ movl %edi,%ecx
+ andl $3,%ecx
+ jnz 23f
+
+ /* Set 4-byte copy count (1 or 0) and remainder */
movl %edx,%ecx
-21: movb (%rsi),%al
-22: movb %al,(%rdi)
+ andl $3,%edx
+ shrl $2,%ecx
+ jz 23f /* If count is 0, goto byte copy */
+
+ /* Perform 4-byte nocache copy */
+21: movl (%rsi),%r8d
+22: movnti %r8d,(%rdi)
+ leaq 4(%rsi),%rsi
+ leaq 4(%rdi),%rdi
+
+ /* If no byte left, we're done */
+ andl %edx,%edx
+ jz 26f
+
+ /* Perform byte "cache" loop-copy for the remainder */
+23: movl %edx,%ecx
+24: movb (%rsi),%al
+25: movb %al,(%rdi)
incq %rsi
incq %rdi
decl %ecx
- jnz 21b
-23: xorl %eax,%eax
+ jnz 24b
+
+ /* Finished copying; fence the prior stores */
+26: xorl %eax,%eax
ASM_CLAC
sfence
ret
@@ -290,11 +334,13 @@ ENTRY(__copy_user_nocache)
.section .fixup,"ax"
30: shll $6,%ecx
addl %ecx,%edx
- jmp 60f
+ jmp 70f
40: lea (%rdx,%rcx,8),%rdx
- jmp 60f
-50: movl %ecx,%edx
-60: sfence
+ jmp 70f
+50: lea (%rdx,%rcx,4),%rdx
+ jmp 70f
+60: movl %ecx,%edx
+70: sfence
jmp copy_user_handle_tail
.previous
@@ -318,4 +364,6 @@ ENTRY(__copy_user_nocache)
_ASM_EXTABLE(19b,40b)
_ASM_EXTABLE(21b,50b)
_ASM_EXTABLE(22b,50b)
+ _ASM_EXTABLE(24b,60b)
+ _ASM_EXTABLE(25b,60b)
ENDPROC(__copy_user_nocache)
6 years, 4 months
[PATCH] x86/mm/vmfault: Make vmalloc_fault() handle large pages
by Toshi Kani
Since 4.1, ioremap() supports large page (pud/pmd) mappings in
x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc
range is limited to pte mappings.
pgd_ctor() sets the kernel's pgd entries to user's during fork(),
which makes user processes share the same page tables for the
kernel ranges. When a call to ioremap() is made at run-time that
leads to allocate a new 2nd level table (pud in 64-bit and pmd in
PAE), user process needs to re-sync with the updated kernel pgd
entry with vmalloc_fault().
Following changes are made to vmalloc_fault().
64-bit:
- No change for the sync operation as set_pgd() takes care of
huge pages as well.
- Add pud_huge() and pmd_huge() to the validation code to
handle huge pages.
- Change pud_page_vaddr() to pud_pfn() since an ioremap range
is not directly mapped (although the if-statement still works
with a bogus addr).
- Change pmd_page() to pmd_pfn() since an ioremap range is not
backed by struct page table (although the if-statement still
works with a bogus addr).
PAE:
- No change for the sync operation since the index3 pgd entry
covers the entire vmalloc range, which is always valid.
(A separate change will be needed if this assumption gets
changed regardless of the page size.)
- Add pmd_huge() to the validation code to handle huge pages.
This is only for completeness since vmalloc_fault() won't
happen for ioremap'd ranges as its pgd entry is always valid.
(I was unable to test this part of the changes as a result.)
Reported-by: Henning Schild <henning.schild(a)siemens.com>
Signed-off-by: Toshi Kani <toshi.kani(a)hpe.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Borislav Petkov <bp(a)alien8.de>
---
When this patch is accepted, please copy to stable up to 4.1.
---
arch/x86/mm/fault.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index eef44d9..e830c71 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -287,6 +287,9 @@ static noinline int vmalloc_fault(unsigned long address)
if (!pmd_k)
return -1;
+ if (pmd_huge(*pmd_k))
+ return 0;
+
pte_k = pte_offset_kernel(pmd_k, address);
if (!pte_present(*pte_k))
return -1;
@@ -360,8 +363,6 @@ void vmalloc_sync_all(void)
* 64-bit:
*
* Handle a fault on the vmalloc area
- *
- * This assumes no large pages in there.
*/
static noinline int vmalloc_fault(unsigned long address)
{
@@ -403,17 +404,23 @@ static noinline int vmalloc_fault(unsigned long address)
if (pud_none(*pud_ref))
return -1;
- if (pud_none(*pud) || pud_page_vaddr(*pud) != pud_page_vaddr(*pud_ref))
+ if (pud_none(*pud) || pud_pfn(*pud) != pud_pfn(*pud_ref))
BUG();
+ if (pud_huge(*pud))
+ return 0;
+
pmd = pmd_offset(pud, address);
pmd_ref = pmd_offset(pud_ref, address);
if (pmd_none(*pmd_ref))
return -1;
- if (pmd_none(*pmd) || pmd_page(*pmd) != pmd_page(*pmd_ref))
+ if (pmd_none(*pmd) || pmd_pfn(*pmd) != pmd_pfn(*pmd_ref))
BUG();
+ if (pmd_huge(*pmd))
+ return 0;
+
pte_ref = pte_offset_kernel(pmd_ref, address);
if (!pte_present(*pte_ref))
return -1;
6 years, 4 months
[ndctl PATCH] ndctl: quiet ndctl_bind()
by Dan Williams
When a namespace is claimed by a 'btt' or 'pfn' instance it is expected
that attempts to bind the namespace will fail. Instead it is the 'btt'
or 'pfn' device that is bound to the driver.
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
lib/libndctl.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/libndctl.c b/lib/libndctl.c
index b85ae259827b..56838c8a2a92 100644
--- a/lib/libndctl.c
+++ b/lib/libndctl.c
@@ -3004,7 +3004,7 @@ static int ndctl_bind(struct ndctl_ctx *ctx, struct kmod_module *module,
closedir(dir);
if (rc) {
- err(ctx, "%s: bind failed\n", devname);
+ dbg(ctx, "%s: bind failed\n", devname);
return -ENXIO;
}
return 0;
6 years, 4 months
Re: [PATCH v8 6/9] dax: add support for fsync/msync
by Ross Zwisler
On Sat, Feb 06, 2016 at 05:33:07PM +0300, Dmitry Monakhov wrote:
> Ross Zwisler <ross.zwisler(a)linux.intel.com> writes:
<>
> > +static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
> IMHO it would be sane to call that function as dax_radix_entry_insert()
I think I may have actually had it named that at some point. :) I changed it
because it doesn't always insert an entry - in the read case for example we
insert a clean entry, and then on the following dax_pfn_mkwrite() we call back
in and mark it as dirty.
<>
> > +/*
> > + * Flush the mapping to the persistent domain within the byte range of [start,
> > + * end]. This is required by data integrity operations to ensure file data is
> > + * on persistent storage prior to completion of the operation.
> > + */
> > +int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
> > + loff_t end)
> > +{
> > + struct inode *inode = mapping->host;
> > + struct block_device *bdev = inode->i_sb->s_bdev;
> > + pgoff_t indices[PAGEVEC_SIZE];
> > + pgoff_t start_page, end_page;
> > + struct pagevec pvec;
> > + void *entry;
> > + int i, ret = 0;
> > +
> > + if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
> > + return -EIO;
> > +
> > + rcu_read_lock();
> > + entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK);
> > + rcu_read_unlock();
> > +
> > + /* see if the start of our range is covered by a PMD entry */
> > + if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > + start &= PMD_MASK;
> > +
> > + start_page = start >> PAGE_CACHE_SHIFT;
> > + end_page = end >> PAGE_CACHE_SHIFT;
> > +
> > + tag_pages_for_writeback(mapping, start_page, end_page);
> > +
> > + pagevec_init(&pvec, 0);
> > + while (1) {
> > + pvec.nr = find_get_entries_tag(mapping, start_page,
> > + PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
> > + pvec.pages, indices);
> > +
> > + if (pvec.nr == 0)
> > + break;
> > +
> > + for (i = 0; i < pvec.nr; i++) {
> > + ret = dax_writeback_one(bdev, mapping, indices[i],
> > + pvec.pages[i]);
> > + if (ret < 0)
> > + return ret;
> > + }
> I think it would be more efficient to use batched locking like follows:
> spin_lock_irq(&mapping->tree_lock);
> for (i = 0; i < pvec.nr; i++) {
> struct blk_dax_ctl dax[PAGEVEC_SIZE];
> radix_tree_tag_clear(page_tree, indices[i], PAGECACHE_TAG_TOWRITE);
> /* It is also reasonable to merge adjacent dax
> * regions in to one */
> dax[i].sector = RADIX_DAX_SECTOR(entry);
> dax[i].size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
>
> }
> spin_unlock_irq(&mapping->tree_lock);
> if (blk_queue_enter(q, true) != 0)
> goto error;
> for (i = 0; i < pvec.nr; i++) {
> rc = bdev_direct_access(bdev, dax[i]);
> wb_cache_pmem(dax[i].addr, dax[i].size);
> }
> ret = blk_queue_exit(q, true)
I guess this could be more efficient, but as Jan said in his response we're
currently focused on correctness. I also wonder if it would be measurably
better?
In any case, Jan is right - you have to clear the TOWRITE tag only after
you've flushed, and you also need to include the entry verification code from
dax_writeback_one() after you grab the tree lock. Basically, I believe all
the code in dax_writeback_one() is needed - this change would essentially just
be inlining that code in dax_writeback_mapping_range() so you could do
multiple operations without giving up a lock.
6 years, 4 months
[PATCH 0/3] ACPI 6.1 NFIT Compatibility
by Dan Williams
The latest NFIT specification clarifies that a memory device will have a
control region table per function interface. Without the fix in [PATCH
1/3] Linux will miscount the number of DIMMs and fail to load the
driver. The fix is tagged for -stable on the expectation that ACPI 6.1
compatible systems will overlap current -stable development branches.
The latter two patches in the series are verification tests that can be
integrated for 4.6.
---
Dan Williams (3):
nfit: fix multi-interface dimm handling, acpi6.1 compatibility
nfit, tools/testing/nvdimm: add format interface code definitions
nfit, tools/testing/nvdimm: test multiple control regions per-dimm
drivers/acpi/nfit.c | 71 +++++++++++-----------
drivers/acpi/nfit.h | 6 ++
tools/testing/nvdimm/test/nfit.c | 125 ++++++++++++++++++++++++++++++--------
3 files changed, 141 insertions(+), 61 deletions(-)
6 years, 4 months
[PATCH 1/2] block: fix pfn_mkwrite() DAX fault handler
by Ross Zwisler
Previously the pfn_mkwrite() fault handler for raw block devices called
bldev_dax_fault() -> __dax_fault() to do a full DAX page fault. Really
what the pfn_mkwrite() fault handler needs to do is call dax_pfn_mkwrite()
to make sure that the radix tree entry for the given PTE is marked as dirty
so that a follow-up fsync or msync call will flush it durably to media.
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Fixes: 5a023cdba50c ("block: enable dax for raw block devices")
---
fs/block_dev.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 7b9cd49..fa0507a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1730,6 +1730,12 @@ static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
return __dax_fault(vma, vmf, blkdev_get_block, NULL);
}
+static int blkdev_dax_pfn_mkwrite(struct vm_area_struct *vma,
+ struct vm_fault *vmf)
+{
+ return dax_pfn_mkwrite(vma, vmf);
+}
+
static int blkdev_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, unsigned int flags)
{
@@ -1761,7 +1767,7 @@ static const struct vm_operations_struct blkdev_dax_vm_ops = {
.close = blkdev_vm_close,
.fault = blkdev_dax_fault,
.pmd_fault = blkdev_dax_pmd_fault,
- .pfn_mkwrite = blkdev_dax_fault,
+ .pfn_mkwrite = blkdev_dax_pfn_mkwrite,
};
static const struct vm_operations_struct blkdev_default_vm_ops = {
--
2.5.0
6 years, 4 months
Re: [PATCH v8 6/9] dax: add support for fsync/msync
by Jan Kara
On Sat 06-02-16 17:33:07, Dmitry Monakhov wrote:
> > +int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
> > + loff_t end)
> > +{
> > + struct inode *inode = mapping->host;
> > + struct block_device *bdev = inode->i_sb->s_bdev;
> > + pgoff_t indices[PAGEVEC_SIZE];
> > + pgoff_t start_page, end_page;
> > + struct pagevec pvec;
> > + void *entry;
> > + int i, ret = 0;
> > +
> > + if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
> > + return -EIO;
> > +
> > + rcu_read_lock();
> > + entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK);
> > + rcu_read_unlock();
> > +
> > + /* see if the start of our range is covered by a PMD entry */
> > + if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > + start &= PMD_MASK;
> > +
> > + start_page = start >> PAGE_CACHE_SHIFT;
> > + end_page = end >> PAGE_CACHE_SHIFT;
> > +
> > + tag_pages_for_writeback(mapping, start_page, end_page);
> > +
> > + pagevec_init(&pvec, 0);
> > + while (1) {
> > + pvec.nr = find_get_entries_tag(mapping, start_page,
> > + PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
> > + pvec.pages, indices);
> > +
> > + if (pvec.nr == 0)
> > + break;
> > +
> > + for (i = 0; i < pvec.nr; i++) {
> > + ret = dax_writeback_one(bdev, mapping, indices[i],
> > + pvec.pages[i]);
> > + if (ret < 0)
> > + return ret;
> > + }
> I think it would be more efficient to use batched locking like follows:
> spin_lock_irq(&mapping->tree_lock);
> for (i = 0; i < pvec.nr; i++) {
> struct blk_dax_ctl dax[PAGEVEC_SIZE];
> radix_tree_tag_clear(page_tree, indices[i], PAGECACHE_TAG_TOWRITE);
> /* It is also reasonable to merge adjacent dax
> * regions in to one */
> dax[i].sector = RADIX_DAX_SECTOR(entry);
> dax[i].size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
>
> }
> spin_unlock_irq(&mapping->tree_lock);
> if (blk_queue_enter(q, true) != 0)
> goto error;
> for (i = 0; i < pvec.nr; i++) {
> rc = bdev_direct_access(bdev, dax[i]);
> wb_cache_pmem(dax[i].addr, dax[i].size);
> }
> ret = blk_queue_exit(q, true)
We need to clear the radix tree tag only after flushing caches. But in
principle I agree that some batching of radix tree tag manipulations should
be doable. But frankly so far we have issues with correctness so speed is
not our main concern.
> > + }
> > + wmb_pmem();
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
> > +
> > static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> > struct vm_area_struct *vma, struct vm_fault *vmf)
> > {
> > @@ -363,6 +532,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> > }
> > dax_unmap_atomic(bdev, &dax);
> >
> > + error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
> > + vmf->flags & FAULT_FLAG_WRITE);
> > + if (error)
> > + goto out;
> > +
> > error = vm_insert_mixed(vma, vaddr, dax.pfn);
> >
> > out:
> > @@ -487,6 +661,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> > delete_from_page_cache(page);
> > unlock_page(page);
> > page_cache_release(page);
> > + page = NULL;
> > }
> I've realized that I do not understand why dax_fault code works at all.
> During dax_fault we want to remove page from mapping and insert dax-entry
> Basically code looks like follows:
> 0 page = find_get_page()
> 1 lock_page(page)
> 2 delete_from_page_cache(page);
> 3 unlock_page(page);
> 4 dax_insert_mapping(inode, &bh, vma, vmf);
>
> BUT what on earth protects us from other process to reinsert page again
> after step(2) but before (4)?
Nothing, it's a bug and Ross / Matthew are working on fixing it...
Honza
--
Jan Kara <jack(a)suse.com>
SUSE Labs, CR
6 years, 4 months
[PATCH] devm_memremap: Fix error value when memremap failed
by Toshi Kani
devm_memremap() returns an ERR_PTR() value in case of error.
However, it returns NULL when memremap() failed. This causes
the caller, such as the pmem driver, to proceed and oops later.
Change devm_memremap() to return ERR_PTR(-ENXIO) when memremap()
failed.
Signed-off-by: Toshi Kani <toshi.kani(a)hpe.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/memremap.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 70ee377..3427cca 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -136,8 +136,10 @@ void *devm_memremap(struct device *dev, resource_size_t offset,
if (addr) {
*ptr = addr;
devres_add(dev, ptr);
- } else
+ } else {
devres_free(ptr);
+ return ERR_PTR(-ENXIO);
+ }
return addr;
}
6 years, 4 months