[ndctl PATCH] ndctl, create-namespace: fix minimum alignment detection
by Dan Williams
Commit 0c31ce0a2875 "ndctl, create-namespace: fix --align= default..."
tried to make casual testing with nfit_test easier by auto-detecting the
alignment. However, this auto-detection always falls back to 4K
alignment on kernels that do not export the region 'resource' attribute.
Those kernels, like v4.9.4, may also be missing this fix: commit
cd755677d944 "libnvdimm, pfn: fix memmap reservation size versus 4K
alignment".
Given the chance to fail to initialize a namespace in the 4K case on
older kernels, be more conservative about falling back to 4K alignment.
Link: https://github.com/pmem/ndctl/issues/34
Reported-by: Maciej Ramotowski <maciej.ramotowski(a)intel.com>
Fixes: 0c31ce0a2875 ("ndctl, create-namespace: fix --align= default...")
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
ndctl/namespace.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/ndctl/namespace.c b/ndctl/namespace.c
index ceb9e7a9b8af..aef356abbee1 100644
--- a/ndctl/namespace.c
+++ b/ndctl/namespace.c
@@ -462,7 +462,7 @@ static int validate_namespace_options(struct ndctl_region *region,
struct ndctl_namespace *ndns, struct parsed_parameters *p)
{
const char *region_name = ndctl_region_get_devname(region);
- unsigned long long size_align = SZ_4K, units = 1;
+ unsigned long long size_align = SZ_4K, units = 1, resource;
unsigned int ways;
int rc = 0;
@@ -563,8 +563,9 @@ static int validate_namespace_options(struct ndctl_region *region,
* the nfit_test use case where it is backed by vmalloc
* memory.
*/
- if (param.align_default && (ndctl_region_get_resource(region)
- & (SZ_2M - 1))) {
+ resource = ndctl_region_get_resource(region);
+ if (param.align_default && resource < ULLONG_MAX
+ && (resource & (SZ_2M - 1))) {
debug("%s: falling back to a 4K alignment\n",
region_name);
p->align = SZ_4K;
4 years, 3 months
[PATCH 2 1/2] dax: change bdev_dax_supported() to take a block_device as input
by Dave Jiang
This is in preparation to support DAX for realtime device on XFS.
bdev_dax_supported() will be taking a block_device as input instead of
a superblock. The only place a super_block is used in this function is
providing the id for debug outputs. Also __bdev_dax_supported()
will be removed since it just directly calls bdev_dax_supported()
and is not reference by any other code.
Signed-off-by: Dave Jiang <dave.jiang(a)intel.com>
---
drivers/dax/super.c | 33 ++++++++++++++++-----------------
fs/ext2/super.c | 2 +-
fs/ext4/super.c | 2 +-
fs/xfs/xfs_ioctl.c | 2 +-
fs/xfs/xfs_super.c | 2 +-
include/linux/dax.h | 8 ++------
6 files changed, 22 insertions(+), 27 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 473af694ad1c..3bdf6787b6df 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -70,11 +70,10 @@ struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
return fs_dax_get_by_host(bdev->bd_disk->disk_name);
}
EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
-#endif
/**
- * __bdev_dax_supported() - Check if the device supports dax for filesystem
- * @sb: The superblock of the device
+ * bdev_dax_supported() - Check if the device supports dax for filesystem
+ * @sb: The block_device of the device
* @blocksize: The block size of the device
*
* This is a library function for filesystems to check if the block device
@@ -82,9 +81,8 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
*
* Return: negative errno if unsupported, 0 if supported.
*/
-int __bdev_dax_supported(struct super_block *sb, int blocksize)
+int bdev_dax_supported(struct block_device *bdev, int blocksize)
{
- struct block_device *bdev = sb->s_bdev;
struct dax_device *dax_dev;
pgoff_t pgoff;
int err, id;
@@ -93,22 +91,22 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
long len;
if (blocksize != PAGE_SIZE) {
- pr_debug("VFS (%s): error: unsupported blocksize for dax\n",
- sb->s_id);
+ pr_debug("bdev (%d:%d): error: unsupported blocksize for dax\n",
+ MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev));
return -EINVAL;
}
err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, &pgoff);
if (err) {
- pr_debug("VFS (%s): error: unaligned partition for dax\n",
- sb->s_id);
+ pr_debug("bdev (%d:%d): error: unaligned partition for dax\n",
+ MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev));
return err;
}
dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
if (!dax_dev) {
- pr_debug("VFS (%s): error: device does not support dax\n",
- sb->s_id);
+ pr_debug("bdev (%d:%d): error: device does not support dax\n",
+ MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev));
return -EOPNOTSUPP;
}
@@ -119,8 +117,8 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
put_dax(dax_dev);
if (len < 1) {
- pr_debug("VFS (%s): error: dax access failed (%ld)\n",
- sb->s_id, len);
+ pr_debug("bdev (%d:%d): error: dax access failed (%ld)\n",
+ MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev), len);
return len < 0 ? len : -EIO;
}
@@ -128,15 +126,16 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
|| pfn_t_devmap(pfn))
/* pass */;
else {
- pr_debug("VFS (%s): error: dax support not enabled\n",
- sb->s_id);
+ pr_debug("bdev (%d:%d): error: dax support not enabled\n",
+ MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev));
return -EOPNOTSUPP;
}
return 0;
}
-EXPORT_SYMBOL_GPL(__bdev_dax_supported);
-#endif
+EXPORT_SYMBOL_GPL(bdev_dax_supported);
+#endif /* CONFIG_FS_DAX */
+#endif /* CONFIG_BLOCK */
enum dax_device_flags {
/* !alive + rcu grace period == no new operations / mappings */
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 38f9222606ee..a0de090b18d5 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -958,7 +958,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
if (sbi->s_mount_opt & EXT2_MOUNT_DAX) {
- err = bdev_dax_supported(sb, blocksize);
+ err = bdev_dax_supported(sb->s_bdev, blocksize);
if (err) {
ext2_msg(sb, KERN_ERR,
"DAX unsupported by block device. Turning off DAX.");
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 18873ea89e08..7ebc8d5cab8c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3712,7 +3712,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
" that may contain inline data");
sbi->s_mount_opt &= ~EXT4_MOUNT_DAX;
}
- err = bdev_dax_supported(sb, blocksize);
+ err = bdev_dax_supported(sb->s_bdev, blocksize);
if (err) {
ext4_msg(sb, KERN_ERR,
"DAX unsupported by block device. Turning off DAX.");
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 20dc65fef6a4..5fd4d50eb1c6 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1102,7 +1102,7 @@ xfs_ioctl_setattr_dax_invalidate(
if (fa->fsx_xflags & FS_XFLAG_DAX) {
if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode)))
return -EINVAL;
- if (bdev_dax_supported(sb, sb->s_blocksize) < 0)
+ if (bdev_dax_supported(sb->s_bdev, sb->s_blocksize) < 0)
return -EINVAL;
}
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 1dacccc367f8..e8a687232614 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1652,7 +1652,7 @@ xfs_fs_fill_super(
xfs_warn(mp,
"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
- error = bdev_dax_supported(sb, sb->s_blocksize);
+ error = bdev_dax_supported(sb->s_bdev, sb->s_blocksize);
if (error) {
xfs_alert(mp,
"DAX unsupported by block device. Turning off DAX.");
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 5258346c558c..a31a7e3f929b 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -40,11 +40,7 @@ static inline void put_dax(struct dax_device *dax_dev)
int bdev_dax_pgoff(struct block_device *, sector_t, size_t, pgoff_t *pgoff);
#if IS_ENABLED(CONFIG_FS_DAX)
-int __bdev_dax_supported(struct super_block *sb, int blocksize);
-static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
-{
- return __bdev_dax_supported(sb, blocksize);
-}
+int bdev_dax_supported(struct block_device *bdev, int blocksize);
static inline struct dax_device *fs_dax_get_by_host(const char *host)
{
@@ -58,7 +54,7 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
#else
-static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
+static inline int bdev_dax_supported(struct block_device *bdev, int blocksize)
{
return -EOPNOTSUPP;
}
4 years, 3 months
[PATCH 1/2] ndctl: add check for update firmware supported
by Dave Jiang
Adding generic and intel support function to allow check if update firmware
is supported by the kernel.
Signed-off-by: Dave Jiang <dave.jiang(a)intel.com>
---
ndctl/lib/firmware.c | 11 +++++++++++
ndctl/lib/intel.c | 24 ++++++++++++++++++++++++
ndctl/lib/libndctl.sym | 1 +
ndctl/lib/private.h | 1 +
ndctl/libndctl.h | 1 +
ndctl/update.c | 4 ++++
6 files changed, 42 insertions(+)
diff --git a/ndctl/lib/firmware.c b/ndctl/lib/firmware.c
index f6deec5d..78d753ca 100644
--- a/ndctl/lib/firmware.c
+++ b/ndctl/lib/firmware.c
@@ -107,3 +107,14 @@ ndctl_cmd_fw_xlat_firmware_status(struct ndctl_cmd *cmd)
else
return FW_EUNKNOWN;
}
+
+NDCTL_EXPORT bool
+ndctl_dimm_fw_update_supported(struct ndctl_dimm *dimm)
+{
+ struct ndctl_dimm_ops *ops = dimm->ops;
+
+ if (ops && ops->fw_update_supported)
+ return ops->fw_update_supported(dimm);
+ else
+ return false;
+}
diff --git a/ndctl/lib/intel.c b/ndctl/lib/intel.c
index cee5204c..7d976c50 100644
--- a/ndctl/lib/intel.c
+++ b/ndctl/lib/intel.c
@@ -650,6 +650,29 @@ intel_dimm_cmd_new_lss(struct ndctl_dimm *dimm)
return cmd;
}
+static bool intel_dimm_fw_update_supported(struct ndctl_dimm *dimm)
+{
+ struct ndctl_ctx *ctx = ndctl_dimm_get_ctx(dimm);
+
+ if (!ndctl_dimm_is_cmd_supported(dimm, ND_CMD_CALL)) {
+ dbg(ctx, "unsupported cmd: %d\n", ND_CMD_CALL);
+ return false;
+ }
+
+ /*
+ * We only need to check FW_GET_INFO. If that isn't supported then
+ * the others aren't either.
+ */
+ if (test_dimm_dsm(dimm, ND_INTEL_FW_GET_INFO)
+ == DIMM_DSM_UNSUPPORTED) {
+ dbg(ctx, "unsupported function: %d\n",
+ ND_INTEL_FW_GET_INFO);
+ return false;
+ }
+
+ return true;
+}
+
struct ndctl_dimm_ops * const intel_dimm_ops = &(struct ndctl_dimm_ops) {
.cmd_desc = intel_cmd_desc,
.new_smart = intel_dimm_cmd_new_smart,
@@ -703,4 +726,5 @@ struct ndctl_dimm_ops * const intel_dimm_ops = &(struct ndctl_dimm_ops) {
.fw_fquery_get_fw_rev = intel_cmd_fw_fquery_get_fw_rev,
.fw_xlat_firmware_status = intel_cmd_fw_xlat_firmware_status,
.new_ack_shutdown_count = intel_dimm_cmd_new_lss,
+ .fw_update_supported = intel_dimm_fw_update_supported,
};
diff --git a/ndctl/lib/libndctl.sym b/ndctl/lib/libndctl.sym
index af9b7d54..cc580f9c 100644
--- a/ndctl/lib/libndctl.sym
+++ b/ndctl/lib/libndctl.sym
@@ -344,4 +344,5 @@ global:
ndctl_cmd_fw_fquery_get_fw_rev;
ndctl_cmd_fw_xlat_firmware_status;
ndctl_dimm_cmd_new_ack_shutdown_count;
+ ndctl_dimm_fw_update_supported;
} LIBNDCTL_13;
diff --git a/ndctl/lib/private.h b/ndctl/lib/private.h
index 015eeb2d..ae4454cf 100644
--- a/ndctl/lib/private.h
+++ b/ndctl/lib/private.h
@@ -325,6 +325,7 @@ struct ndctl_dimm_ops {
unsigned long long (*fw_fquery_get_fw_rev)(struct ndctl_cmd *);
enum ND_FW_STATUS (*fw_xlat_firmware_status)(struct ndctl_cmd *);
struct ndctl_cmd *(*new_ack_shutdown_count)(struct ndctl_dimm *);
+ bool (*fw_update_supported)(struct ndctl_dimm *);
};
struct ndctl_dimm_ops * const intel_dimm_ops;
diff --git a/ndctl/libndctl.h b/ndctl/libndctl.h
index 9db775ba..08030d35 100644
--- a/ndctl/libndctl.h
+++ b/ndctl/libndctl.h
@@ -625,6 +625,7 @@ unsigned int ndctl_cmd_fw_start_get_context(struct ndctl_cmd *cmd);
unsigned long long ndctl_cmd_fw_fquery_get_fw_rev(struct ndctl_cmd *cmd);
enum ND_FW_STATUS ndctl_cmd_fw_xlat_firmware_status(struct ndctl_cmd *cmd);
struct ndctl_cmd *ndctl_dimm_cmd_new_ack_shutdown_count(struct ndctl_dimm *dimm);
+bool ndctl_dimm_fw_update_supported(struct ndctl_dimm *dimm);
#ifdef __cplusplus
} /* extern "C" */
diff --git a/ndctl/update.c b/ndctl/update.c
index 0f0f0d81..72f5839b 100644
--- a/ndctl/update.c
+++ b/ndctl/update.c
@@ -454,6 +454,10 @@ static int get_ndctl_dimm(struct update_context *uctx, void *ctx)
ndctl_dimm_foreach(bus, dimm) {
if (!util_dimm_filter(dimm, uctx->dimm_id))
continue;
+ if (!ndctl_dimm_fw_update_supported(dimm)) {
+ error("DIMM firmware update not supported by the kernel.");
+ return -ENOTSUP;
+ }
uctx->dimm = dimm;
return 0;
}
4 years, 3 months
[GIT PULL] libnvdimm fixes for 4.16-rc4
by Williams, Dan J
Hi Linus, please pull from:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm libnvdimm-fixes
...to receive a 4.16 regression fix, 3 fixes for -stable, and a cleanup
fix.
* During the merge window support for the new ACPI NVDIMM Platform
Capabilities structure disabled support for "deep flush", a force-unit-
access like mechanism for persistent memory. Restore that mechanism.
* VFIO like RDMA is yet one more memory registration / pinning
interface that is incompatible with Filesystem-DAX. Disable long term
pins of Filesystem-DAX mappings via VFIO.
* The Filesystem-DAX detection to prevent long terms pins mistakenly
also disabled Device-DAX pins which are not subject to the same block-
map collision concerns.
* Similar to the setup path, softlockup warnings can trigger in the
shutdown path for large persistent memory namespaces. Teach
for_each_device_pfn() to perform cond_resched() in all cases.
* Boaz noticed that the might_sleep() in dax_direct_access() is stale
as of the v4.15 kernel.
These have received a build success notification from the 0day robot,
and the longterm pin fixes have appeared in -next. However, I recently
rebased the tree to remove some other fixes that need to be reworked
after review feedback.
---
The following changes since commit 91ab883eb21325ad80f3473633f794c78ac87f51:
Linux 4.16-rc2 (2018-02-18 17:29:42 -0800)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm libnvdimm-fixes
for you to fetch changes up to 949b93250a566cc7a578b4f829cf76b70d19a62c:
memremap: fix softlockup reports at teardown (2018-03-02 19:34:50 -0800)
----------------------------------------------------------------
Boaz Harrosh (1):
dax: ->direct_access does not sleep anymore
Dan Williams (3):
dax: fix vma_is_fsdax() helper
vfio: disable filesystem-dax page pinning
memremap: fix softlockup reports at teardown
Dave Jiang (1):
libnvdimm: re-enable deep flush for pmem devices via fsync()
drivers/dax/super.c | 6 ------
drivers/nvdimm/pmem.c | 3 +--
drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++++++---
include/linux/fs.h | 2 +-
kernel/memremap.c | 15 ++++++++++-----
5 files changed, 27 insertions(+), 17 deletions(-)
---
commit 9d4949b4935831be10534d5432bf611285a572a5
Author: Boaz Harrosh <boazh(a)netapp.com>
Date: Mon Feb 26 18:50:35 2018 +0200
dax: ->direct_access does not sleep anymore
In Patch:
[7a862fb] brd: remove dax support
Dan Williams has removed the only might_sleep
implementation of ->direct_access.
So we no longer need to check for it.
CC: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Boaz Harrosh <boazh(a)netapp.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit 230f5a8969d8345fc9bbe3683f068246cf1be4b8
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Wed Feb 21 17:08:01 2018 -0800
dax: fix vma_is_fsdax() helper
Gerd reports that ->i_mode may contain other bits besides S_IFCHR. Use
S_ISCHR() instead. Otherwise, get_user_pages_longterm() may fail on
device-dax instances when those are meant to be explicitly allowed.
Fixes: 2bb6d2837083 ("mm: introduce get_user_pages_longterm")
Cc: <stable(a)vger.kernel.org>
Reported-by: Gerd Rausch <gerd.rausch(a)oracle.com>
Acked-by: Jane Chu <jane.chu(a)oracle.com>
Reported-by: Haozhong Zhang <haozhong.zhang(a)intel.com>
Reviewed-by: Jan Kara <jack(a)suse.cz>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit 94db151dc89262bfa82922c44e8320cea2334667
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Sun Feb 4 10:34:02 2018 -0800
vfio: disable filesystem-dax page pinning
Filesystem-DAX is incompatible with 'longterm' page pinning. Without
page cache indirection a DAX mapping maps filesystem blocks directly.
This means that the filesystem must not modify a file's block map while
any page in a mapping is pinned. In order to prevent the situation of
userspace holding of filesystem operations indefinitely, disallow
'longterm' Filesystem-DAX mappings.
RDMA has the same conflict and the plan there is to add a 'with lease'
mechanism to allow the kernel to notify userspace that the mapping is
being torn down for block-map maintenance. Perhaps something similar can
be put in place for vfio.
Note that xfs and ext4 still report:
"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
...at mount time, and resolving the dax-dma-vs-truncate problem is one
of the last hurdles to remove that designation.
Acked-by: Alex Williamson <alex.williamson(a)redhat.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: kvm(a)vger.kernel.org
Cc: <stable(a)vger.kernel.org>
Reported-by: Haozhong Zhang <haozhong.zhang(a)intel.com>
Tested-by: Haozhong Zhang <haozhong.zhang(a)intel.com>
Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit 5fdf8e5ba5666fe153bd61f851a40078a6347822
Author: Dave Jiang <dave.jiang(a)intel.com>
Date: Fri Mar 2 19:31:40 2018 -0800
libnvdimm: re-enable deep flush for pmem devices via fsync()
Re-enable deep flush so that users always have a way to be sure that a
write makes it all the way out to media. Writes from the PMEM driver
always arrive at the NVDIMM since movnt is used to bypass the cache, and
the driver relies on the ADR (Asynchronous DRAM Refresh) mechanism to
flush write buffers on power failure. The Deep Flush mechanism is there
to explicitly write buffers to protect against (rare) ADR failure. This
change prevents a regression in deep flush behavior so that applications
can continue to depend on fsync() as a mechanism to trigger deep flush
in the filesystem-DAX case.
Fixes: 06e8ccdab15f4 ("acpi: nfit: Add support for detect platform CPU cache...")
Reviewed-by: Jeff Moyer <jmoyer(a)redhat.com>
Signed-off-by: Dave Jiang <dave.jiang(a)intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit 949b93250a566cc7a578b4f829cf76b70d19a62c
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Tue Feb 6 19:34:11 2018 -0800
memremap: fix softlockup reports at teardown
The cond_resched() currently in the setup path needs to be duplicated in
the teardown path. Rather than require each instance of
for_each_device_pfn() to open code the same sequence, embed it in the
helper.
Link: https://github.com/intel/ixpdimm_sw/issues/11
Cc: "Jérôme Glisse" <jglisse(a)redhat.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: <stable(a)vger.kernel.org>
Fixes: 71389703839e ("mm, zone_device: Replace {get, put}_zone_device_page()...")
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
4 years, 3 months
[PATCH v5 00/12] vfio, dax: prevent long term filesystem-dax pins and other fixes
by Dan Williams
Changes since v4 [1]:
* Fix the changelog of "dax: introduce IS_DEVDAX() and IS_FSDAX()" to
better clarify the need for new helpers (Jan)
* Replace dax_sem_is_locked() with dax_sem_assert_held() (Jan)
* Use file_inode() in vma_is_dax() (Jan)
* Resend the full series to linux-xfs@ (Dave)
* Collect Jan's Reviewed-by
[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-February/014271.html
---
The vfio interface, like RDMA, wants to setup long term (indefinite)
pins of the pages backing an address range so that a guest or userspace
driver can perform DMA to the with physical address. Given that this
pinning may lead to filesystem operations deadlocking in the
filesystem-dax case, the pinning request needs to be rejected.
The longer term fix for vfio, RDMA, and any other long term pin user, is
to provide a 'pin with lease' mechanism. Similar to the leases that are
hold for pNFS RDMA layouts, this userspace lease gives the kernel a way
to notify userspace that the block layout of the file is changing and
the kernel is revoking access to pinned pages.
Related to this change is the discovery that vma_is_fsdax() was causing
device-dax inode detection to fail. That lead to series of fixes and
cleanups to make sure that S_DAX is defined correctly in the
CONFIG_FS_DAX=n + CONFIG_DEV_DAX=y case.
---
Dan Williams (12):
dax: fix vma_is_fsdax() helper
dax: introduce IS_DEVDAX() and IS_FSDAX()
ext2, dax: finish implementing dax_sem helpers
ext2, dax: define ext2_dax_*() infrastructure in all cases
ext4, dax: define ext4_dax_*() infrastructure in all cases
ext2, dax: replace IS_DAX() with IS_FSDAX()
ext4, dax: replace IS_DAX() with IS_FSDAX()
xfs, dax: replace IS_DAX() with IS_FSDAX()
mm, dax: replace IS_DAX() with IS_DEVDAX() or IS_FSDAX()
fs, dax: kill IS_DAX()
dax: fix S_DAX definition
vfio: disable filesystem-dax page pinning
drivers/vfio/vfio_iommu_type1.c | 18 ++++++++++++++--
fs/ext2/ext2.h | 6 +++++
fs/ext2/file.c | 19 +++++------------
fs/ext2/inode.c | 10 ++++-----
fs/ext4/file.c | 18 +++++-----------
fs/ext4/inode.c | 4 ++--
fs/ext4/ioctl.c | 2 +-
fs/ext4/super.c | 2 +-
fs/iomap.c | 2 +-
fs/xfs/xfs_file.c | 14 ++++++-------
fs/xfs/xfs_ioctl.c | 4 ++--
fs/xfs/xfs_iomap.c | 6 +++--
fs/xfs/xfs_reflink.c | 2 +-
include/linux/dax.h | 12 ++++++++---
include/linux/fs.h | 43 ++++++++++++++++++++++++++++-----------
mm/fadvise.c | 3 ++-
mm/filemap.c | 4 ++--
mm/huge_memory.c | 4 +++-
mm/madvise.c | 3 ++-
19 files changed, 102 insertions(+), 74 deletions(-)
4 years, 3 months
[PATCH v3 0/3] mm, smaps: MMUPageSize for device-dax
by Dan Williams
Changes since v2:
* Split the fix of the definition vma_mmu_pagesize() on powerpc to its
own patch.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-February/014101.html
---
Andrew,
Similar to commit 31383c6865a5 "mm, hugetlbfs: introduce ->split() to
vm_operations_struct" here is another occasion where we want
special-case hugetlbfs/hstate enabling to also apply to device-dax.
This begs the question what other hstate conversions we might do beyond
->split() and ->pagesize(), but this appears to be the last of the
usages of hstate_vma() in generic/non-hugetlbfs specific code paths.
---
Dan Williams (3):
mm, powerpc: use vma_kernel_pagesize() in vma_mmu_pagesize()
mm, hugetlbfs: introduce ->pagesize() to vm_operations_struct
device-dax: implement ->pagesize() for smaps to report MMUPageSize
arch/powerpc/include/asm/hugetlb.h | 6 ------
arch/powerpc/mm/hugetlbpage.c | 5 +----
drivers/dax/device.c | 10 ++++++++++
include/linux/mm.h | 1 +
mm/hugetlb.c | 27 ++++++++++++++-------------
5 files changed, 26 insertions(+), 23 deletions(-)
4 years, 3 months
Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
by Stephen Bates
>> We'd prefer to have a generic way to get p2pmem instead of restricting
>> ourselves to only using CMBs. We did work in the past where the P2P memory
>> was part of an IB adapter and not the NVMe card. So this won't work if it's
>> an NVMe only interface.
> It just seems like it it making it too complicated.
I disagree. Having a common allocator (instead of some separate allocator per driver) makes things simpler.
> Seems like a very subtle and hard to debug performance trap to leave
> for the users, and pretty much the only reason to use P2P is
> performance... So why have such a dangerous interface?
P2P is about offloading the memory and PCI subsystem of the host CPU and this is achieved no matter which p2p_dev is used.
Stephen
4 years, 3 months
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
by Benjamin Herrenschmidt
On Thu, 2018-03-01 at 13:53 -0700, Jason Gunthorpe wrote:
> On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote:
> > Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
> > on ppc64 (maybe via an arch hook as it might depend on the processor
> > family). Server powerpc cannot do cachable accesses on IO memory
> > (unless it's special OpenCAPI or nVlink, but not on PCIe).
>
> I think you are right on this - even on x86 we must not create
> cachable mappings of PCI BARs - there is no way that works the way
> anyone would expect.
>
> I think this series doesn't have a problem here only because it never
> touches the BAR pages with the CPU.
>
> BAR memory should be mapped into the CPU as WC at best on all arches..
Could be that x86 has the smarts to do the right thing, still trying to
untangle the code :-)
Cheers,
Ben.
4 years, 3 months
Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
by Logan Gunthorpe
On 02/03/18 09:18 AM, Jason Gunthorpe wrote:
> This allocator is already seems not useful for the P2P target memory
> on a Mellanox NIC due to the way it has a special allocation flow
> (windowing) and special usage requirements..
>
> Nor can it be usefull for the doorbell memory in the NIC.
No one says every P2P use has to use P2P memory. Doorbells are obviously
not P2P memory. But it's the p2mem interface that's important and the
interface absolutely does not belong in the NVMe driver. Once you have
the P2P memory interface you need an allocator behind it and the obvious
place is in the P2P code to handle the common case where you're just
mapping a BAR. We don't need to implement a genalloc in every driver
that has P2P memory attached with it. If some hardware wants to expose
memory that requires complicated allocation it's up to them to solve
that problem but that isn't enough justification, to me, to push common
code into every driver.
> Both of these are existing use cases for P2P with out of tree patches..
And we have out of tree code that uses the generic allocator part of
p2pmem.
> The allocator seems to only be able to solve the CMB problem, and I
> think due to that it is better to put this allocator in the NVMe
> driver and not the core code.. At least until we find a 2nd user that
> needs the same allocation scheme...
See the first P2PMEM RFC. We used Chelsio's NIC instead of the CMB with
a very similar allocation scheme. We'd still be enabling that NIC in the
same way if we didn't run into hardware issues with it. A simple BAR
with memory behind it is always going to be the most common case.
Logan
4 years, 3 months