[PATCH 1/2] arch: consolidate CONFIG_STRICT_DEVM in lib/Kconfig.debug
by Dan Williams
Let all the archs that implement CONFIG_STRICT_DEVM use a common
definition in lib/Kconfig.debug.
Note, the 'depends on !SPARC' is due to sparc not implementing
devmem_is_allowed().
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Russell King <linux(a)arm.linux.org.uk>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will.deacon(a)arm.com>
Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Cc: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: "David S. Miller" <davem(a)davemloft.net>
Suggested-by: Arnd Bergmann <arnd(a)arndb.de>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
arch/arm/Kconfig.debug | 14 --------------
arch/arm64/Kconfig.debug | 14 --------------
arch/powerpc/Kconfig.debug | 12 ------------
arch/s390/Kconfig.debug | 12 ------------
arch/tile/Kconfig | 3 ---
arch/unicore32/Kconfig.debug | 14 --------------
arch/x86/Kconfig.debug | 17 -----------------
lib/Kconfig.debug | 19 +++++++++++++++++++
8 files changed, 19 insertions(+), 86 deletions(-)
diff --git a/arch/arm/Kconfig.debug b/arch/arm/Kconfig.debug
index 259c0ca9c99a..e356357d86bb 100644
--- a/arch/arm/Kconfig.debug
+++ b/arch/arm/Kconfig.debug
@@ -15,20 +15,6 @@ config ARM_PTDUMP
kernel.
If in doubt, say "N"
-config STRICT_DEVMEM
- bool "Filter access to /dev/mem"
- depends on MMU
- ---help---
- If this option is disabled, you allow userspace (root) access to all
- of memory, including kernel and userspace memory. Accidental
- access to this is obviously disastrous, but specific access can
- be used by people debugging the kernel.
-
- If this option is switched on, the /dev/mem file only allows
- userspace access to memory mapped peripherals.
-
- If in doubt, say Y.
-
# RMK wants arm kernels compiled with frame pointers or stack unwinding.
# If you know what you are doing and are willing to live without stack
# traces, you can get a slightly smaller kernel by setting this option to
diff --git a/arch/arm64/Kconfig.debug b/arch/arm64/Kconfig.debug
index 04fb73b973f1..e13c4bf84d9e 100644
--- a/arch/arm64/Kconfig.debug
+++ b/arch/arm64/Kconfig.debug
@@ -14,20 +14,6 @@ config ARM64_PTDUMP
kernel.
If in doubt, say "N"
-config STRICT_DEVMEM
- bool "Filter access to /dev/mem"
- depends on MMU
- help
- If this option is disabled, you allow userspace (root) access to all
- of memory, including kernel and userspace memory. Accidental
- access to this is obviously disastrous, but specific access can
- be used by people debugging the kernel.
-
- If this option is switched on, the /dev/mem file only allows
- userspace access to memory mapped peripherals.
-
- If in doubt, say Y.
-
config PID_IN_CONTEXTIDR
bool "Write the current PID to the CONTEXTIDR register"
help
diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
index 3a510f4a6b68..a0e44a9c456f 100644
--- a/arch/powerpc/Kconfig.debug
+++ b/arch/powerpc/Kconfig.debug
@@ -335,18 +335,6 @@ config PPC_EARLY_DEBUG_CPM_ADDR
platform probing is done, all platforms selected must
share the same address.
-config STRICT_DEVMEM
- def_bool y
- prompt "Filter access to /dev/mem"
- help
- This option restricts access to /dev/mem. If this option is
- disabled, you allow userspace access to all memory, including
- kernel and userspace memory. Accidental memory access is likely
- to be disastrous.
- Memory access is required for experts who want to debug the kernel.
-
- If you are unsure, say Y.
-
config FAIL_IOMMU
bool "Fault-injection capability for IOMMU"
depends on FAULT_INJECTION
diff --git a/arch/s390/Kconfig.debug b/arch/s390/Kconfig.debug
index c56878e1245f..26c5d5beb4be 100644
--- a/arch/s390/Kconfig.debug
+++ b/arch/s390/Kconfig.debug
@@ -5,18 +5,6 @@ config TRACE_IRQFLAGS_SUPPORT
source "lib/Kconfig.debug"
-config STRICT_DEVMEM
- def_bool y
- prompt "Filter access to /dev/mem"
- ---help---
- This option restricts access to /dev/mem. If this option is
- disabled, you allow userspace access to all memory, including
- kernel and userspace memory. Accidental memory access is likely
- to be disastrous.
- Memory access is required for experts who want to debug the kernel.
-
- If you are unsure, say Y.
-
config S390_PTDUMP
bool "Export kernel pagetable layout to userspace via debugfs"
depends on DEBUG_KERNEL
diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig
index 106c21bd7f44..7b2d40db11fa 100644
--- a/arch/tile/Kconfig
+++ b/arch/tile/Kconfig
@@ -116,9 +116,6 @@ config ARCH_DISCONTIGMEM_DEFAULT
config TRACE_IRQFLAGS_SUPPORT
def_bool y
-config STRICT_DEVMEM
- def_bool y
-
# SMP is required for Tilera Linux.
config SMP
def_bool y
diff --git a/arch/unicore32/Kconfig.debug b/arch/unicore32/Kconfig.debug
index 1a3626239843..f075bbe1d46f 100644
--- a/arch/unicore32/Kconfig.debug
+++ b/arch/unicore32/Kconfig.debug
@@ -2,20 +2,6 @@ menu "Kernel hacking"
source "lib/Kconfig.debug"
-config STRICT_DEVMEM
- bool "Filter access to /dev/mem"
- depends on MMU
- ---help---
- If this option is disabled, you allow userspace (root) access to all
- of memory, including kernel and userspace memory. Accidental
- access to this is obviously disastrous, but specific access can
- be used by people debugging the kernel.
-
- If this option is switched on, the /dev/mem file only allows
- userspace access to memory mapped peripherals.
-
- If in doubt, say Y.
-
config EARLY_PRINTK
def_bool DEBUG_OCD
help
diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 137dfa96aa14..1116452fcfc2 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -5,23 +5,6 @@ config TRACE_IRQFLAGS_SUPPORT
source "lib/Kconfig.debug"
-config STRICT_DEVMEM
- bool "Filter access to /dev/mem"
- ---help---
- If this option is disabled, you allow userspace (root) access to all
- of memory, including kernel and userspace memory. Accidental
- access to this is obviously disastrous, but specific access can
- be used by people debugging the kernel. Note that with PAT support
- enabled, even in this case there are restrictions on /dev/mem
- use due to the cache aliasing requirements.
-
- If this option is switched on, the /dev/mem file only allows
- userspace access to PCI space and the BIOS code and data regions.
- This is sufficient for dosemu and X and all common users of
- /dev/mem.
-
- If in doubt, say Y.
-
config X86_VERBOSE_BOOTUP
bool "Enable verbose x86 bootup info messages"
default y
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 8c15b29d5adc..ad85145d0047 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1853,3 +1853,22 @@ source "samples/Kconfig"
source "lib/Kconfig.kgdb"
+config STRICT_DEVMEM
+ bool "Filter access to /dev/mem"
+ depends on MMU
+ depends on !SPARC
+ default y if TILE || PPC || S390
+ ---help---
+ If this option is disabled, you allow userspace (root) access to all
+ of memory, including kernel and userspace memory. Accidental
+ access to this is obviously disastrous, but specific access can
+ be used by people debugging the kernel. Note that with PAT support
+ enabled, even in this case there are restrictions on /dev/mem
+ use due to the cache aliasing requirements.
+
+ If this option is switched on, the /dev/mem file only allows
+ userspace access to PCI space and the BIOS code and data regions.
+ This is sufficient for dosemu and X and all common users of
+ /dev/mem.
+
+ If in doubt, say Y.
6 years, 7 months
Re: [PATCH 1/2] arch: consolidate CONFIG_STRICT_DEVM in lib/Kconfig.debug
by Dan Williams
On Mon, Nov 23, 2015 at 1:53 AM, Heiko Carstens
<heiko.carstens(a)de.ibm.com> wrote:
> On Sat, Nov 21, 2015 at 07:57:02PM -0800, Dan Williams wrote:
>> Let all the archs that implement CONFIG_STRICT_DEVM use a common
>> definition in lib/Kconfig.debug.
>>
>> Note, the 'depends on !SPARC' is due to sparc not implementing
>> devmem_is_allowed().
>>
>> Cc: Kees Cook <keescook(a)chromium.org>
>> Cc: Russell King <linux(a)arm.linux.org.uk>
>> Cc: Catalin Marinas <catalin.marinas(a)arm.com>
>> Cc: Will Deacon <will.deacon(a)arm.com>
>> Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
>> Cc: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
>> Cc: Heiko Carstens <heiko.carstens(a)de.ibm.com>
>> Cc: Thomas Gleixner <tglx(a)linutronix.de>
>> Cc: Ingo Molnar <mingo(a)redhat.com>
>> Cc: "H. Peter Anvin" <hpa(a)zytor.com>
>> Cc: Andrew Morton <akpm(a)linux-foundation.org>
>> Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
>> Cc: "David S. Miller" <davem(a)davemloft.net>
>> Suggested-by: Arnd Bergmann <arnd(a)arndb.de>
>> Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
>> ---
>> arch/arm/Kconfig.debug | 14 --------------
>> arch/arm64/Kconfig.debug | 14 --------------
>> arch/powerpc/Kconfig.debug | 12 ------------
>> arch/s390/Kconfig.debug | 12 ------------
>> arch/tile/Kconfig | 3 ---
>> arch/unicore32/Kconfig.debug | 14 --------------
>> arch/x86/Kconfig.debug | 17 -----------------
>> lib/Kconfig.debug | 19 +++++++++++++++++++
>> 8 files changed, 19 insertions(+), 86 deletions(-)
>
> For s390
>
> Acked-by: Heiko Carstens <heiko.carstens(a)de.ibm.com>
>
>> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
>> index 8c15b29d5adc..ad85145d0047 100644
>> --- a/lib/Kconfig.debug
>> +++ b/lib/Kconfig.debug
>> @@ -1853,3 +1853,22 @@ source "samples/Kconfig"
>>
>> source "lib/Kconfig.kgdb"
>>
>> +config STRICT_DEVMEM
>> + bool "Filter access to /dev/mem"
>> + depends on MMU
>> + depends on !SPARC
>> + default y if TILE || PPC || S390
>
> I wouldn't mind if you would remove s390 from this list.
>
Will do. Thanks.
6 years, 7 months
Re: [RFC PATCH] restrict /dev/mem to idle io memory ranges
by Dan Williams
On Fri, Nov 20, 2015 at 12:12 PM, Russell King - ARM Linux
<linux(a)arm.linux.org.uk> wrote:
> On Fri, Nov 20, 2015 at 09:31:33AM -0800, Dan Williams wrote:
>> This effectively promotes IORESOURCE_BUSY to IORESOURCE_EXCLUSIVE
>> semantics by default. If userspace really believes it is safe to access
>> the memory region it can also perform the extra step of disabling an
>> active driver. This protects device address ranges with read side
>> effects and otherwise directs userspace to use the driver.
>
> I'm happy with this as long as we retain the option to disable this
> new behaviour.
>
> The reason being, when developing a driver, it is _very_ useful to
> be able to poke around in the device's (and system memory) address
> spaces with tools like devmem2 to work out what's going on when
> things go wrong.
>
> To put it another way, I think it's a good idea to disable access to
> these regions on production systems, but for driver development, we
> want to retain the ability to poke around in physical address space
> in any way we so desire.
>
Sounds ok to me, but I do think it's a good idea to default it to the
same value as STRICT_DEVMEM. Perhaps:
bool "Filter I/O access to /dev/mem" if EXPERT
default STRICT_DEVMEM
When this in do we even need IORESOURCE_EXCLUSIVE? It's barely used.
6 years, 7 months
[PATCH] ndctl: fix a pmd test case
by Dan Williams
With the pending kernel fixes the O_DIRECT read test is no longer
crashing the kernel. Fix the buffer size and mishandling of the file
position.
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
lib/test-dax-pmd.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/lib/test-dax-pmd.c b/lib/test-dax-pmd.c
index 7ea4e6c7bdb6..0fee7bee8817 100644
--- a/lib/test-dax-pmd.c
+++ b/lib/test-dax-pmd.c
@@ -106,12 +106,12 @@ static int test_pmd(int fd)
break;
case 1: /* test O_DIRECT of pre-faulted address */
sprintf(addr, "odirect data");
- if (write(fd2, addr, 4096) != 4096) {
+ if (pwrite(fd2, addr, 4096, 0) != 4096) {
faili(i);
rc = -ENXIO;
}
((char *) buf)[0] = 0;
- read(fd2, buf, sizeof(buf));
+ pread(fd2, buf, 4096, 0);
if (strcmp(buf, "odirect data") != 0) {
faili(i);
rc = -ENXIO;
6 years, 7 months
[GIT PULL] libnvdimm fixes for 4.4-rc2
by Williams, Dan J
Hi Linus, please pull from...
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm libnvdimm-fixes
...to receive:
1/ A collection of crash and deadlock fixes for DAX that are also
tagged for -stable. We will look to re-enable DAX pmd mappings in 4.5,
but for now 4.4 and -stable should disable it by default.
2/ A fixup to ext2 and ext4 to mirror the same warning emitted by XFS
when mounting with "-o dax"
This set has received a build success notification from the kbuild
robot.
The following changes since commit 8005c49d9aea74d382f474ce11afbbc7d7130bec:
Linux 4.4-rc1 (2015-11-15 17:00:27 -0800)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm libnvdimm-fixes
for you to fetch changes up to 2e6edc95382cc36423aff18a237173ad62d5ab52:
block: protect rw_page against device teardown (2015-11-19 13:47:10 -0800)
----------------------------------------------------------------
Dan Williams (3):
ext2, ext4: warn when mounting with dax enabled
dax: disable pmd mappings
block: protect rw_page against device teardown
Yigal Korman (1):
mm, dax: fix DAX deadlocks (COW fault)
block/blk.h | 2 --
fs/Kconfig | 6 ++++++
fs/block_dev.c | 18 ++++++++++++++++--
fs/dax.c | 4 ++++
fs/ext2/super.c | 2 ++
fs/ext4/super.c | 6 +++++-
include/linux/blkdev.h | 2 ++
mm/memory.c | 8 ++++----
8 files changed, 39 insertions(+), 9 deletions(-)
commit 2e6edc95382cc36423aff18a237173ad62d5ab52
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Thu Nov 19 13:29:28 2015 -0800
block: protect rw_page against device teardown
Fix use after free crashes like the following:
general protection fault: 0000 [#1] SMP
Call Trace:
[<ffffffffa0050216>] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem]
[<ffffffffa0050ba2>] pmem_rw_page+0x42/0x80 [nd_pmem]
[<ffffffff8128fd90>] bdev_read_page+0x50/0x60
[<ffffffff812972f0>] do_mpage_readpage+0x510/0x770
[<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
[<ffffffff811d86dc>] ? lru_cache_add+0x1c/0x50
[<ffffffff81297657>] mpage_readpages+0x107/0x170
[<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
[<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
[<ffffffff8129058d>] blkdev_readpages+0x1d/0x20
[<ffffffff811d615f>] __do_page_cache_readahead+0x28f/0x310
[<ffffffff811d6039>] ? __do_page_cache_readahead+0x169/0x310
[<ffffffff811c5abd>] ? pagecache_get_page+0x2d/0x1d0
[<ffffffff811c76f6>] filemap_fault+0x396/0x530
[<ffffffff811f816e>] __do_fault+0x4e/0xf0
[<ffffffff811fce7d>] handle_mm_fault+0x11bd/0x1b50
Cc: <stable(a)vger.kernel.org>
Cc: Jens Axboe <axboe(a)fb.com>
Cc: Alexander Viro <viro(a)zeniv.linux.org.uk>
Reported-by: kbuild test robot <lkp(a)intel.com>
Acked-by: Matthew Wilcox <willy(a)linux.intel.com>
[willy: symmetry fixups]
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit 0df9d41ab5d43dc5b20abc8b22a6b6d098b03994
Author: Yigal Korman <yigal(a)plexistor.com>
Date: Mon Nov 16 14:09:15 2015 +0200
mm, dax: fix DAX deadlocks (COW fault)
DAX handling of COW faults has wrong locking sequence:
dax_fault does i_mmap_lock_read
do_cow_fault does i_mmap_unlock_write
Ross's commit[1] missed a fix[2] that Kirill added to Matthew's
commit[3].
Original COW locking logic was introduced by Matthew here[4].
This should be applied to v4.3 as well.
[1] 0f90cc6609c7 mm, dax: fix DAX deadlocks
[2] 52a2b53ffde6 mm, dax: use i_mmap_unlock_write() in do_cow_fault()
[3] 843172978bb9 dax: fix race between simultaneous faults
[4] 2e4cdab0584f mm: allow page fault handlers to perform the COW
Cc: <stable(a)vger.kernel.org>
Cc: Boaz Harrosh <boaz(a)plexistor.com>
Cc: Alexander Viro <viro(a)zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner(a)redhat.com>
Cc: Jan Kara <jack(a)suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox(a)intel.com>
Acked-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Signed-off-by: Yigal Korman <yigal(a)plexistor.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit ee82c9ed41e896bd47e121d87e4628de0f2656a3
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Sun Nov 15 16:06:32 2015 -0800
dax: disable pmd mappings
While dax pmd mappings are functional in the nominal path they trigger
kernel crashes in the following paths:
BUG: unable to handle kernel paging request at ffffea0004098000
IP: [<ffffffff812362f7>] follow_trans_huge_pmd+0x117/0x3b0
[..]
Call Trace:
[<ffffffff811f6573>] follow_page_mask+0x2d3/0x380
[<ffffffff811f6708>] __get_user_pages+0xe8/0x6f0
[<ffffffff811f7045>] get_user_pages_unlocked+0x165/0x1e0
[<ffffffff8106f5b1>] get_user_pages_fast+0xa1/0x1b0
kernel BUG at arch/x86/mm/gup.c:131!
[..]
Call Trace:
[<ffffffff8106f34c>] gup_pud_range+0x1bc/0x220
[<ffffffff8106f634>] get_user_pages_fast+0x124/0x1b0
BUG: unable to handle kernel paging request at ffffea0004088000
IP: [<ffffffff81235f49>] copy_huge_pmd+0x159/0x350
[..]
Call Trace:
[<ffffffff811fad3c>] copy_page_range+0x34c/0x9f0
[<ffffffff810a0daf>] copy_process+0x1b7f/0x1e10
[<ffffffff810a11c1>] _do_fork+0x91/0x590
All of these paths are interpreting a dax pmd mapping as a transparent
huge page and making the assumption that the pfn is covered by the
memmap, i.e. that the pfn has an associated struct page. PTE mappings
do not suffer the same fate since they have the _PAGE_SPECIAL flag to
cause the gup path to fault. We can do something similar for the PMD
path, or otherwise defer pmd support for cases where a struct page is
available. For now, 4.4-rc and -stable need to disable dax pmd support
by default.
For development the "depends on BROKEN" line can be removed from
CONFIG_FS_DAX_PMD.
Cc: <stable(a)vger.kernel.org>
Cc: Jan Kara <jack(a)suse.com>
Cc: Dave Chinner <david(a)fromorbit.com>
Cc: Matthew Wilcox <willy(a)linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reported-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit ef83b6e8f40bb24b92ad73b5889732346e54a793
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Tue Sep 29 15:48:11 2015 -0400
ext2, ext4: warn when mounting with dax enabled
Similar to XFS warn when mounting DAX while it is still considered under
development. Also, aspects of the DAX implementation, for example
synchronization against multiple faults and faults causing block
allocation, depend on the correct implementation in the filesystem. The
maturity of a given DAX implementation is filesystem specific.
Cc: <stable(a)vger.kernel.org>
Cc: "Theodore Ts'o" <tytso(a)mit.edu>
Cc: Matthew Wilcox <willy(a)linux.intel.com>
Cc: linux-ext4(a)vger.kernel.org
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reported-by: Dave Chinner <david(a)fromorbit.com>
Acked-by: Jan Kara <jack(a)suse.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
diff --git a/block/blk.h b/block/blk.h
index da722eb786df..c43926d3d74d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -72,8 +72,6 @@ void blk_dequeue_request(struct request *rq);
void __blk_queue_free_tags(struct request_queue *q);
bool __blk_end_bidi_request(struct request *rq, int error,
unsigned int nr_bytes, unsigned int bidi_bytes);
-int blk_queue_enter(struct request_queue *q, gfp_t gfp);
-void blk_queue_exit(struct request_queue *q);
void blk_freeze_queue(struct request_queue *q);
static inline void blk_queue_enter_live(struct request_queue *q)
diff --git a/fs/Kconfig b/fs/Kconfig
index da3f32f1a4e4..6ce72d8d1ee1 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -46,6 +46,12 @@ config FS_DAX
or if unsure, say N. Saying Y will increase the size of the kernel
by about 5kB.
+config FS_DAX_PMD
+ bool
+ default FS_DAX
+ depends on FS_DAX
+ depends on BROKEN
+
endif # BLOCK
# Posix ACL utility routines
diff --git a/fs/block_dev.c b/fs/block_dev.c
index bb0dfb1c7af1..c25639e907bd 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -390,9 +390,17 @@ int bdev_read_page(struct block_device *bdev, sector_t sector,
struct page *page)
{
const struct block_device_operations *ops = bdev->bd_disk->fops;
+ int result = -EOPNOTSUPP;
+
if (!ops->rw_page || bdev_get_integrity(bdev))
- return -EOPNOTSUPP;
- return ops->rw_page(bdev, sector + get_start_sect(bdev), page, READ);
+ return result;
+
+ result = blk_queue_enter(bdev->bd_queue, GFP_KERNEL);
+ if (result)
+ return result;
+ result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, READ);
+ blk_queue_exit(bdev->bd_queue);
+ return result;
}
EXPORT_SYMBOL_GPL(bdev_read_page);
@@ -421,14 +429,20 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
int result;
int rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE;
const struct block_device_operations *ops = bdev->bd_disk->fops;
+
if (!ops->rw_page || bdev_get_integrity(bdev))
return -EOPNOTSUPP;
+ result = blk_queue_enter(bdev->bd_queue, GFP_KERNEL);
+ if (result)
+ return result;
+
set_page_writeback(page);
result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, rw);
if (result)
end_page_writeback(page);
else
unlock_page(page);
+ blk_queue_exit(bdev->bd_queue);
return result;
}
EXPORT_SYMBOL_GPL(bdev_write_page);
diff --git a/fs/dax.c b/fs/dax.c
index d1e5cb7311a1..43671b68220e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -541,6 +541,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
unsigned long pfn;
int result = 0;
+ /* dax pmd mappings are broken wrt gup and fork */
+ if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
+ return VM_FAULT_FALLBACK;
+
/* Fall back to PTEs if we're going to COW */
if (write && !(vma->vm_flags & VM_SHARED))
return VM_FAULT_FALLBACK;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 3a71cea68420..748d35afc902 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -569,6 +569,8 @@ static int parse_options(char *options, struct super_block *sb)
/* Fall through */
case Opt_dax:
#ifdef CONFIG_FS_DAX
+ ext2_msg(sb, KERN_WARNING,
+ "DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
set_opt(sbi->s_mount_opt, DAX);
#else
ext2_msg(sb, KERN_INFO, "dax option not supported");
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 753f4e68b820..c9ab67da6e5a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1664,8 +1664,12 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
}
sbi->s_jquota_fmt = m->mount_opt;
#endif
-#ifndef CONFIG_FS_DAX
} else if (token == Opt_dax) {
+#ifdef CONFIG_FS_DAX
+ ext4_msg(sb, KERN_WARNING,
+ "DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
+ sbi->s_mount_opt |= m->mount_opt;
+#else
ext4_msg(sb, KERN_INFO, "dax option not supported");
return -1;
#endif
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3fe27f8d91f0..c0d2b7927c1f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -794,6 +794,8 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
struct scsi_ioctl_command __user *);
+extern int blk_queue_enter(struct request_queue *q, gfp_t gfp);
+extern void blk_queue_exit(struct request_queue *q);
extern void blk_start_queue(struct request_queue *q);
extern void blk_stop_queue(struct request_queue *q);
extern void blk_sync_queue(struct request_queue *q);
diff --git a/mm/memory.c b/mm/memory.c
index deb679c31f2a..c387430f06c3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3015,9 +3015,9 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
} else {
/*
* The fault handler has no page to lock, so it holds
- * i_mmap_lock for write to protect against truncate.
+ * i_mmap_lock for read to protect against truncate.
*/
- i_mmap_unlock_write(vma->vm_file->f_mapping);
+ i_mmap_unlock_read(vma->vm_file->f_mapping);
}
goto uncharge_out;
}
@@ -3031,9 +3031,9 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
} else {
/*
* The fault handler has no page to lock, so it holds
- * i_mmap_lock for write to protect against truncate.
+ * i_mmap_lock for read to protect against truncate.
*/
- i_mmap_unlock_write(vma->vm_file->f_mapping);
+ i_mmap_unlock_read(vma->vm_file->f_mapping);
}
return ret;
uncharge_out:
6 years, 7 months
[PATCH] block: protect rw_page against device teardown
by Dan Williams
Fix use after free crashes like the following:
general protection fault: 0000 [#1] SMP
Call Trace:
[<ffffffffa0050216>] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem]
[<ffffffffa0050ba2>] pmem_rw_page+0x42/0x80 [nd_pmem]
[<ffffffff8128fd90>] bdev_read_page+0x50/0x60
[<ffffffff812972f0>] do_mpage_readpage+0x510/0x770
[<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
[<ffffffff811d86dc>] ? lru_cache_add+0x1c/0x50
[<ffffffff81297657>] mpage_readpages+0x107/0x170
[<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
[<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
[<ffffffff8129058d>] blkdev_readpages+0x1d/0x20
[<ffffffff811d615f>] __do_page_cache_readahead+0x28f/0x310
[<ffffffff811d6039>] ? __do_page_cache_readahead+0x169/0x310
[<ffffffff811c5abd>] ? pagecache_get_page+0x2d/0x1d0
[<ffffffff811c76f6>] filemap_fault+0x396/0x530
[<ffffffff811f816e>] __do_fault+0x4e/0xf0
[<ffffffff811fce7d>] handle_mm_fault+0x11bd/0x1b50
Cc: <stable(a)vger.kernel.org>
Cc: Jens Axboe <axboe(a)fb.com>
Cc: Matthew Wilcox <willy(a)linux.intel.com>
Cc: Alexander Viro <viro(a)zeniv.linux.org.uk>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
fs/block_dev.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index bb0dfb1c7af1..cc0af12acf94 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -390,9 +390,17 @@ int bdev_read_page(struct block_device *bdev, sector_t sector,
struct page *page)
{
const struct block_device_operations *ops = bdev->bd_disk->fops;
+ int rc = -EOPNOTSUPP;
+
if (!ops->rw_page || bdev_get_integrity(bdev))
- return -EOPNOTSUPP;
- return ops->rw_page(bdev, sector + get_start_sect(bdev), page, READ);
+ return rc;
+
+ rc = blk_queue_enter(bdev->bd_queue, GFP_KERNEL);
+ if (rc)
+ return rc;
+ rc = ops->rw_page(bdev, sector + get_start_sect(bdev), page, READ);
+ blk_queue_exit(bdev->bd_queue);
+ return rc;
}
EXPORT_SYMBOL_GPL(bdev_read_page);
@@ -421,14 +429,20 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
int result;
int rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE;
const struct block_device_operations *ops = bdev->bd_disk->fops;
+
if (!ops->rw_page || bdev_get_integrity(bdev))
return -EOPNOTSUPP;
+ result = blk_queue_enter(bdev->bd_queue, GFP_KERNEL);
+ if (result)
+ return result;
+
set_page_writeback(page);
result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, rw);
if (result)
end_page_writeback(page);
else
unlock_page(page);
+ blk_queue_exit(bdev->bd_queue);
return result;
}
EXPORT_SYMBOL_GPL(bdev_write_page);
6 years, 7 months
[RFC PATCH] restrict /dev/mem to idle io memory ranges
by Dan Williams
This effectively promotes IORESOURCE_BUSY to IORESOURCE_EXCLUSIVE
semantics by default. If userspace really believes it is safe to access
the memory region it can also perform the extra step of disabling an
active driver. This protects device address ranges with read side
effects and otherwise directs userspace to use the driver.
Persistent memory presents a large "mistake surface" to /dev/mem as now
accidental writes can corrupt a filesystem.
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Russell King <linux(a)arm.linux.org.uk>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will.deacon(a)arm.com>
Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Cc: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
arch/arm/Kconfig.debug | 14 --------------
arch/arm64/Kconfig.debug | 14 --------------
arch/powerpc/Kconfig.debug | 12 ------------
arch/s390/Kconfig.debug | 12 ------------
arch/tile/Kconfig | 3 ---
arch/unicore32/Kconfig.debug | 14 --------------
arch/x86/Kconfig.debug | 17 -----------------
kernel/resource.c | 3 +++
lib/Kconfig.debug | 36 ++++++++++++++++++++++++++++++++++++
9 files changed, 39 insertions(+), 86 deletions(-)
diff --git a/arch/arm/Kconfig.debug b/arch/arm/Kconfig.debug
index 259c0ca9c99a..e356357d86bb 100644
--- a/arch/arm/Kconfig.debug
+++ b/arch/arm/Kconfig.debug
@@ -15,20 +15,6 @@ config ARM_PTDUMP
kernel.
If in doubt, say "N"
-config STRICT_DEVMEM
- bool "Filter access to /dev/mem"
- depends on MMU
- ---help---
- If this option is disabled, you allow userspace (root) access to all
- of memory, including kernel and userspace memory. Accidental
- access to this is obviously disastrous, but specific access can
- be used by people debugging the kernel.
-
- If this option is switched on, the /dev/mem file only allows
- userspace access to memory mapped peripherals.
-
- If in doubt, say Y.
-
# RMK wants arm kernels compiled with frame pointers or stack unwinding.
# If you know what you are doing and are willing to live without stack
# traces, you can get a slightly smaller kernel by setting this option to
diff --git a/arch/arm64/Kconfig.debug b/arch/arm64/Kconfig.debug
index 04fb73b973f1..e13c4bf84d9e 100644
--- a/arch/arm64/Kconfig.debug
+++ b/arch/arm64/Kconfig.debug
@@ -14,20 +14,6 @@ config ARM64_PTDUMP
kernel.
If in doubt, say "N"
-config STRICT_DEVMEM
- bool "Filter access to /dev/mem"
- depends on MMU
- help
- If this option is disabled, you allow userspace (root) access to all
- of memory, including kernel and userspace memory. Accidental
- access to this is obviously disastrous, but specific access can
- be used by people debugging the kernel.
-
- If this option is switched on, the /dev/mem file only allows
- userspace access to memory mapped peripherals.
-
- If in doubt, say Y.
-
config PID_IN_CONTEXTIDR
bool "Write the current PID to the CONTEXTIDR register"
help
diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
index 3a510f4a6b68..a0e44a9c456f 100644
--- a/arch/powerpc/Kconfig.debug
+++ b/arch/powerpc/Kconfig.debug
@@ -335,18 +335,6 @@ config PPC_EARLY_DEBUG_CPM_ADDR
platform probing is done, all platforms selected must
share the same address.
-config STRICT_DEVMEM
- def_bool y
- prompt "Filter access to /dev/mem"
- help
- This option restricts access to /dev/mem. If this option is
- disabled, you allow userspace access to all memory, including
- kernel and userspace memory. Accidental memory access is likely
- to be disastrous.
- Memory access is required for experts who want to debug the kernel.
-
- If you are unsure, say Y.
-
config FAIL_IOMMU
bool "Fault-injection capability for IOMMU"
depends on FAULT_INJECTION
diff --git a/arch/s390/Kconfig.debug b/arch/s390/Kconfig.debug
index c56878e1245f..26c5d5beb4be 100644
--- a/arch/s390/Kconfig.debug
+++ b/arch/s390/Kconfig.debug
@@ -5,18 +5,6 @@ config TRACE_IRQFLAGS_SUPPORT
source "lib/Kconfig.debug"
-config STRICT_DEVMEM
- def_bool y
- prompt "Filter access to /dev/mem"
- ---help---
- This option restricts access to /dev/mem. If this option is
- disabled, you allow userspace access to all memory, including
- kernel and userspace memory. Accidental memory access is likely
- to be disastrous.
- Memory access is required for experts who want to debug the kernel.
-
- If you are unsure, say Y.
-
config S390_PTDUMP
bool "Export kernel pagetable layout to userspace via debugfs"
depends on DEBUG_KERNEL
diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig
index 106c21bd7f44..7b2d40db11fa 100644
--- a/arch/tile/Kconfig
+++ b/arch/tile/Kconfig
@@ -116,9 +116,6 @@ config ARCH_DISCONTIGMEM_DEFAULT
config TRACE_IRQFLAGS_SUPPORT
def_bool y
-config STRICT_DEVMEM
- def_bool y
-
# SMP is required for Tilera Linux.
config SMP
def_bool y
diff --git a/arch/unicore32/Kconfig.debug b/arch/unicore32/Kconfig.debug
index 1a3626239843..f075bbe1d46f 100644
--- a/arch/unicore32/Kconfig.debug
+++ b/arch/unicore32/Kconfig.debug
@@ -2,20 +2,6 @@ menu "Kernel hacking"
source "lib/Kconfig.debug"
-config STRICT_DEVMEM
- bool "Filter access to /dev/mem"
- depends on MMU
- ---help---
- If this option is disabled, you allow userspace (root) access to all
- of memory, including kernel and userspace memory. Accidental
- access to this is obviously disastrous, but specific access can
- be used by people debugging the kernel.
-
- If this option is switched on, the /dev/mem file only allows
- userspace access to memory mapped peripherals.
-
- If in doubt, say Y.
-
config EARLY_PRINTK
def_bool DEBUG_OCD
help
diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 137dfa96aa14..1116452fcfc2 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -5,23 +5,6 @@ config TRACE_IRQFLAGS_SUPPORT
source "lib/Kconfig.debug"
-config STRICT_DEVMEM
- bool "Filter access to /dev/mem"
- ---help---
- If this option is disabled, you allow userspace (root) access to all
- of memory, including kernel and userspace memory. Accidental
- access to this is obviously disastrous, but specific access can
- be used by people debugging the kernel. Note that with PAT support
- enabled, even in this case there are restrictions on /dev/mem
- use due to the cache aliasing requirements.
-
- If this option is switched on, the /dev/mem file only allows
- userspace access to PCI space and the BIOS code and data regions.
- This is sufficient for dosemu and X and all common users of
- /dev/mem.
-
- If in doubt, say Y.
-
config X86_VERBOSE_BOOTUP
bool "Enable verbose x86 bootup info messages"
default y
diff --git a/kernel/resource.c b/kernel/resource.c
index f150dbbe6f62..03a8b09f68a8 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1498,6 +1498,9 @@ int iomem_is_exclusive(u64 addr)
break;
if (p->end < addr)
continue;
+ if (IS_ENABLED(CONFIG_IO_STRICT_DEVMEM)
+ && p->flags & IORESOURCE_BUSY)
+ break;
if (p->flags & IORESOURCE_BUSY &&
p->flags & IORESOURCE_EXCLUSIVE) {
err = 1;
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 8c15b29d5adc..a188d7757e26 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1853,3 +1853,39 @@ source "samples/Kconfig"
source "lib/Kconfig.kgdb"
+config STRICT_DEVMEM
+ bool "Filter access to /dev/mem"
+ depends on MMU
+ default y if TILE || PPC || S390
+ ---help---
+ If this option is disabled, you allow userspace (root) access to all
+ of memory, including kernel and userspace memory. Accidental
+ access to this is obviously disastrous, but specific access can
+ be used by people debugging the kernel. Note that with PAT support
+ enabled, even in this case there are restrictions on /dev/mem
+ use due to the cache aliasing requirements.
+
+ If this option is switched on, the /dev/mem file only allows
+ userspace access to PCI space and the BIOS code and data regions.
+ This is sufficient for dosemu and X and all common users of
+ /dev/mem.
+
+ If in doubt, say Y.
+
+config IO_STRICT_DEVMEM
+ bool "Filter I/O access to /dev/mem"
+ depends on STRICT_DEVMEM
+ ---help---
+ If this option is disabled, you allow userspace (root) access
+ to all io memory regardless of whether a driver is actively
+ using that range. Accidental access to this is obviously
+ disastrous, but specific access can be used by people
+ debugging the kernel.
+
+ If this option is switched on, the /dev/mem file only allows
+ userspace access to *idle* io memory ranges (any non "System
+ RAM" range listed in /proc/iomem). This may break
+ traditional users of /dev/mem if the driver using a given
+ range cannot be disabled.
+
+ If in doubt, say N.
6 years, 7 months
[PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking
by Dan Williams
Changes since last posting [1]:
1/ Further cleanups to dax_clear_blocks(): Dropped increments of 'addr'
since we call bdev_direct_access() before the next use, and dropped the
BUG_ON for sector unaligned return values from bdev_direct_access().
2/ In [PATCH 8/8] introduce blk_dax_ctl to remove the need to have
separate dax_map_atomic and __dax_map_atomic routines. Note,
blk_dax_ctl is not passed through to drivers, it gets unpacked in
bdev_direct_access. (Willy)
3/ New [PATCH 2/8]: Disable huge page dax mappings while we resolve
various crash scenarios in this development cycle.
4/ New [PATCH 4/8]: Unmap all dax mappings at block device shutdown
I have kept the reviewed-by's received to date, let me know if these
incremental updates invalidate that review.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-November/002733.html
---
The first 4 patches in this series I consider 4.4-rc / -stable material.
The rest are for 4.5. [PATCH 4/8] needs scrutiny. It is yet another
example of where DAX behavior necessarily differs from page cache
behavior. I still maintain that we should not be surprising unaware
applications with DAX semantics, i.e. that DAX should be per-inode
opt-in, not globally enabled for all inodes at fs mount time.
The largest patch in the set, [PATCH 8/8], addresses the lifetime of the
'addr' returned by bdev_direct_access. That address is only valid while
the device driver is enabled. The new dax_map_atomic() /
dax_unmap_atomic() pairing guarantees that 'addr' stays valid for the
duration of that mapping.
While dax_map_atomic() protects against 'addr' going invalid, the new
calls to truncate_pagecache() via invalidate_inodes() protect against
the 'pfn' returned from bdev_direct_access() going invalid. Otherwise,
the storage media can be directly accessed after the driver has been
disabled.
---
[PATCH 1/8] ext2, ext4: warn when mounting with dax enabled
[PATCH 2/8] dax: disable pmd mappings
[PATCH 3/8] mm, dax: fix DAX deadlocks (COW fault)
[PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown
[PATCH 5/8] pmem, dax: clean up clear_pmem()
[PATCH 6/8] dax: increase granularity of dax_clear_blocks() operations
[PATCH 7/8] dax: guarantee page aligned results from bdev_direct_access()
[PATCH 8/8] dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
arch/x86/include/asm/pmem.h | 7 -
block/blk.h | 2
fs/Kconfig | 6 +
fs/block_dev.c | 15 +--
fs/dax.c | 228 ++++++++++++++++++++++++++-----------------
fs/ext2/super.c | 2
fs/ext4/super.c | 6 +
fs/inode.c | 27 +++++
include/linux/blkdev.h | 19 +++-
mm/memory.c | 8 +-
mm/truncate.c | 13 ++
11 files changed, 217 insertions(+), 116 deletions(-)
6 years, 7 months
[PATCH v2 00/11] DAX fsynx/msync support
by Ross Zwisler
This patch series adds support for fsync/msync to DAX.
Patches 1 through 7 add various utilities that the DAX code will eventually
need, and the DAX code itself is added by patch 8. Patches 9-11 update the
three filesystems that currently support DAX, ext2, ext4 and XFS, to use
the new DAX fsync/msync code.
These patches build on the recent DAX locking changes from Dave Chinner,
Jan Kara and myself. Dave's changes for XFS and my changes for ext2 have
been merged in the v4.4 window, but Jan's are still unmerged. You can grab
them here:
http://www.spinics.net/lists/linux-ext4/msg49951.html
Ross Zwisler (11):
pmem: add wb_cache_pmem() to the PMEM API
mm: add pmd_mkclean()
pmem: enable REQ_FUA/REQ_FLUSH handling
dax: support dirty DAX entries in radix tree
mm: add follow_pte_pmd()
mm: add pgoff_mkclean()
mm: add find_get_entries_tag()
dax: add support for fsync/sync
ext2: add support for DAX fsync/msync
ext4: add support for DAX fsync/msync
xfs: add support for DAX fsync/msync
arch/x86/include/asm/pgtable.h | 5 ++
arch/x86/include/asm/pmem.h | 11 ++--
drivers/nvdimm/pmem.c | 3 +-
fs/block_dev.c | 3 +-
fs/dax.c | 140 +++++++++++++++++++++++++++++++++++++++--
fs/ext2/file.c | 14 ++++-
fs/ext4/file.c | 4 +-
fs/ext4/fsync.c | 12 +++-
fs/inode.c | 1 +
fs/xfs/xfs_file.c | 18 ++++--
include/linux/dax.h | 6 ++
include/linux/fs.h | 1 +
include/linux/mm.h | 2 +
include/linux/pagemap.h | 3 +
include/linux/pmem.h | 22 ++++++-
include/linux/radix-tree.h | 8 +++
include/linux/rmap.h | 5 ++
mm/filemap.c | 71 ++++++++++++++++++++-
mm/huge_memory.c | 14 ++---
mm/memory.c | 38 ++++++++---
mm/rmap.c | 51 +++++++++++++++
mm/truncate.c | 62 ++++++++++--------
22 files changed, 425 insertions(+), 69 deletions(-)
--
2.1.0
6 years, 7 months
dax pmd fault handler never returns to userspace
by Jeff Moyer
Hi,
When running the nvml library's test suite against an ext4 file system
mounted with -o dax, I ran into an issue where many of the tests would
simply timeout. The problem appears to be that the pmd fault handler
never returns to userspace (the application is doing a memcpy of 512
bytes into pmem). Here's the 'perf report -g' output:
- 88.30% 0.01% blk_non_zero.st libc-2.17.so [.] __memmove_ssse3_back
- 88.30% __memmove_ssse3_back
- 66.63% page_fault
- 66.47% do_page_fault
- 66.16% __do_page_fault
- 63.38% handle_mm_fault
- 61.15% ext4_dax_pmd_fault
- 45.04% __dax_pmd_fault
- 37.05% vmf_insert_pfn_pmd
- track_pfn_insert
- 35.58% lookup_memtype
- 33.80% pat_pagerange_is_ram
- 33.40% walk_system_ram_range
- 31.63% find_next_iomem_res
21.78% strcmp
And here's 'perf top':
Samples: 2M of event 'cycles:pp', Event count (approx.): 56080150519
Overhead Shared Object Symbol
22.55% [kernel] [k] strcmp
20.33% [unknown] [k] 0x00007f9f549ef3f3
10.01% [kernel] [k] native_irq_return_iret
9.54% [kernel] [k] find_next_iomem_res
3.00% [jbd2] [k] start_this_handle
This is easily reproduced by doing the following:
git clone https://github.com/pmem/nvml.git
cd nvml
make
make test
cd src/test/blk_non_zero
./blk_non_zero.static-nondebug 512 /path/to/ext4/dax/fs/testfile1 c 1073741824 w:0
I also ran the test suite against xfs, and the problem is not present
there. However, I did not verify that the xfs tests were getting pmd
faults.
I'm happy to help diagnose the problem further, if necessary.
Cheers,
Jeff
6 years, 7 months