[PATCH v3 0/2] Support ACPI 6.1 update in NFIT Control Region Structure
by Toshi Kani
ACPI 6.1, Table 5-133, updates NVDIMM Control Region Structure as
follows.
- Valid Fields, Manufacturing Location, and Manufacturing Date
are added from reserved range. No change in the structure size.
- IDs (SPD values) are stored as arrays of bytes (i.e. big-endian
format). The spec clarifies that they need to be represented
as arrays of bytes as well.
Patch 1 changes the NFIT driver to comply with ACPI 6.1.
Patch 2 adds a new sysfs file "id" to show NVDIMM ID defined in ACPI 6.1.
The patch-set applies on linux-pm.git acpica.
link: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
---
v3:
- Need to coordinate with ACPICA update (Bob Moore, Dan Williams)
- Integrate with ACPICA changes in struct acpi_nfit_control_region.
(commit 138a95547ab0)
v2:
- Remove 'mfg_location' and 'mfg_date'. (Dan Williams)
- Rename 'unique_id' to 'id' and make this change as a separate patch.
(Dan Williams)
---
Toshi Kani (3):
1/2 acpi/nfit: Update nfit driver to comply with ACPI 6.1
2/3 acpi/nfit: Add sysfs "id" for NVDIMM ID
---
drivers/acpi/nfit.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
2 years, 7 months
Enabling peer to peer device transactions for PCIe devices
by Deucher, Alexander
This is certainly not the first time this has been brought up, but I'd like to try and get some consensus on the best way to move this forward. Allowing devices to talk directly improves performance and reduces latency by avoiding the use of staging buffers in system memory. Also in cases where both devices are behind a switch, it avoids the CPU entirely. Most current APIs (DirectGMA, PeerDirect, CUDA, HSA) that deal with this are pointer based. Ideally we'd be able to take a CPU virtual address and be able to get to a physical address taking into account IOMMUs, etc. Having struct pages for the memory would allow it to work more generally and wouldn't require as much explicit support in drivers that wanted to use it.
Some use cases:
1. Storage devices streaming directly to GPU device memory
2. GPU device memory to GPU device memory streaming
3. DVB/V4L/SDI devices streaming directly to GPU device memory
4. DVB/V4L/SDI devices streaming directly to storage devices
Here is a relatively simple example of how this could work for testing. This is obviously not a complete solution.
- Device memory will be registered with Linux memory sub-system by created corresponding struct page structures for device memory
- get_user_pages_fast() will return corresponding struct pages when CPU address points to the device memory
- put_page() will deal with struct pages for device memory
Previously proposed solutions and related proposals:
1.P2P DMA
DMA-API/PCI map_peer_resource support for peer-to-peer (http://www.spinics.net/lists/linux-pci/msg44560.html)
Pros: Low impact, already largely reviewed.
Cons: requires explicit support in all drivers that want to support it, doesn't handle S/G in device memory.
2. ZONE_DEVICE IO
Direct I/O and DMA for persistent memory (https://lwn.net/Articles/672457/)
Add support for ZONE_DEVICE IO memory with struct pages. (https://patchwork.kernel.org/patch/8583221/)
Pro: Doesn't waste system memory for ZONE metadata
Cons: CPU access to ZONE metadata slow, may be lost, corrupted on device reset.
3. DMA-BUF
RDMA subsystem DMA-BUF support (http://www.spinics.net/lists/linux-rdma/msg38748.html)
Pros: uses existing dma-buf interface
Cons: dma-buf is handle based, requires explicit dma-buf support in drivers.
4. iopmem
iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/)
5. HMM
Heterogeneous Memory Management (http://lkml.iu.edu/hypermail/linux/kernel/1611.2/02473.html)
6. Some new mmap-like interface that takes a userptr and a length and returns a dma-buf and offset?
Alex
3 years, 2 months
Standardization of ACPI NVDIMM DSMs
by Rebecca Cran
I'm pretty new to ACPI work so it's possibly I'm misunderstanding
something. I've recently started working on NVDIMMs, and have
noticed that both HPE and Intel have DSM "Example" interfaces that are
referenced/used in Linux. I've been wondering if there's a reason
the content from both couldn't be combined and added to the ACPI
specification with sufficient vendor-specific fields to support the
cases where they need to differ?
--
Rebecca Cran
3 years, 5 months
[PATCH] ndctl: daxctl: Adding io option for daxctl
by Dave Jiang
The daxctl io option allows I/Os to be performed between block/file to
and from device dax files. It also provides a way to zero a device dax
device.
i.e. daxctl io --input=/home/myfile --output=/dev/dax1.0
Signed-off-by: Dave Jiang <dave.jiang(a)intel.com>
---
Documentation/Makefile.am | 3
Documentation/daxctl-io.txt | 71 +++++
daxctl/Makefile.am | 5
daxctl/daxctl.c | 2
daxctl/io.c | 567 +++++++++++++++++++++++++++++++++++++++++++
5 files changed, 646 insertions(+), 2 deletions(-)
create mode 100644 Documentation/daxctl-io.txt
create mode 100644 daxctl/io.c
diff --git a/Documentation/Makefile.am b/Documentation/Makefile.am
index c7e0758..8efdbc2 100644
--- a/Documentation/Makefile.am
+++ b/Documentation/Makefile.am
@@ -26,7 +26,8 @@ man1_MANS = \
ndctl-destroy-namespace.1 \
ndctl-check-namespace.1 \
ndctl-list.1 \
- daxctl-list.1
+ daxctl-list.1 \
+ daxctl-io.1
CLEANFILES = $(man1_MANS)
diff --git a/Documentation/daxctl-io.txt b/Documentation/daxctl-io.txt
new file mode 100644
index 0000000..c3ddd15
--- /dev/null
+++ b/Documentation/daxctl-io.txt
@@ -0,0 +1,71 @@
+daxctl-io(1)
+===========
+
+NAME
+----
+daxctl-io - Perform I/O on Device-DAX devices or zero a Device-DAX device.
+
+SYNOPSIS
+--------
+[verse]
+'daxctl io' [<options>]
+
+There must be a Device-DAX device involved whether as the input or the output
+device. Read from a Device-DAX device and write to a file, a block device,
+another Device-DAX device, or stdout (if no output is provided). Write
+to a Device-DAX device from a file, a block device, or stdin, or another
+Device-DAX device.
+
+No length specified will default to input file/device length. If input is
+a special char file then length will be the output file/device length.
+
+No input will default to stdin. No output will default to stdout.
+
+For a Device-DAX device, attempts to clear badblocks within range of writes
+will be performed.
+
+EXAMPLE
+-------
+[verse]
+# daxctl io --zero /dev/dax1.0
+
+# daxctl io --input=/dev/dax1.0 --output=/home/myfile --len=2097152 --seek=4096
+
+# cat /dev/zero | daxctl io --output=/dev/dax1.0
+
+# daxctl io --input=/dev/zero --output=/dev/dax1.0 --skip=4096
+
+OPTIONS
+-------
+-i::
+--input=::
+ Input device or file to read from.
+
+-o::
+--output=::
+ Output device or file to write to.
+
+-z::
+--zero::
+ Zero the output device for 'len' size. Or the entire device if no
+ length was provided. The output device must be a Device DAX device.
+
+-l::
+--len::
+ The length in bytes to perform the I/O.
+
+-s::
+--seek::
+ The number of bytes to skip over on the output before performing a
+ write.
+
+-k::
+--skip::
+ The number of bytes to skip over on the input before performing a read.
+
+COPYRIGHT
+---------
+Copyright (c) 2017, Intel Corporation. License GPLv2: GNU GPL
+version 2 <http://gnu.org/licenses/gpl.html>. This is free software:
+you are free to change and redistribute it. There is NO WARRANTY, to
+the extent permitted by law.
diff --git a/daxctl/Makefile.am b/daxctl/Makefile.am
index fe467d0..1ba1f07 100644
--- a/daxctl/Makefile.am
+++ b/daxctl/Makefile.am
@@ -5,10 +5,13 @@ bin_PROGRAMS = daxctl
daxctl_SOURCES =\
daxctl.c \
list.c \
+ io.c \
../util/json.c
daxctl_LDADD =\
lib/libdaxctl.la \
+ ../ndctl/lib/libndctl.la \
../libutil.a \
$(UUID_LIBS) \
- $(JSON_LIBS)
+ $(JSON_LIBS) \
+ -lpmem
diff --git a/daxctl/daxctl.c b/daxctl/daxctl.c
index 91a4600..db2e495 100644
--- a/daxctl/daxctl.c
+++ b/daxctl/daxctl.c
@@ -67,11 +67,13 @@ static int cmd_help(int argc, const char **argv, void *ctx)
}
int cmd_list(int argc, const char **argv, void *ctx);
+int cmd_io(int argc, const char **argv, void *ctx);
static struct cmd_struct commands[] = {
{ "version", cmd_version },
{ "list", cmd_list },
{ "help", cmd_help },
+ { "io", cmd_io },
};
int main(int argc, const char **argv)
diff --git a/daxctl/io.c b/daxctl/io.c
new file mode 100644
index 0000000..92e2878
--- /dev/null
+++ b/daxctl/io.c
@@ -0,0 +1,567 @@
+/*
+ * Copyright(c) 2015-2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <stdio.h>
+#include <errno.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/sysmacros.h>
+#include <sys/param.h>
+#include <sys/mman.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <limits.h>
+#include <libgen.h>
+#include <libpmem.h>
+#include <util/json.h>
+#include <util/filter.h>
+#include <json-c/json.h>
+#include <daxctl/libdaxctl.h>
+#include <ccan/short_types/short_types.h>
+#include <util/parse-options.h>
+#include <ccan/array_size/array_size.h>
+#include <ndctl/ndctl.h>
+
+enum io_direction {
+ IO_READ = 0,
+ IO_WRITE,
+};
+
+struct io_dev {
+ int fd;
+ int major;
+ int minor;
+ void *mmap;
+ const char *parm_path;
+ char *real_path;
+ uint64_t offset;
+ enum io_direction direction;
+ bool is_dax;
+ bool is_char;
+ bool is_new;
+ bool need_trunc;
+ struct ndctl_ctx *ndctx;
+ struct ndctl_region *region;
+ struct ndctl_dax *dax;
+ uint64_t size;
+};
+
+static struct {
+ struct io_dev dev[2];
+ bool zero;
+ uint64_t len;
+ struct ndctl_cmd *ars_cap;
+ struct ndctl_cmd *clear_err;
+} io = {
+ .dev[0].fd = -1,
+ .dev[1].fd = -1,
+};
+
+#define fail(fmt, ...) \
+do { \
+ fprintf(stderr, "daxctl-%s:%s:%d: " fmt, \
+ VERSION, __func__, __LINE__, ##__VA_ARGS__); \
+} while (0)
+
+static bool is_stdinout(struct io_dev *io_dev)
+{
+ return (io_dev->fd == STDIN_FILENO ||
+ io_dev->fd == STDOUT_FILENO) ? true : false;
+}
+
+static int setup_device(struct io_dev *io_dev, struct ndctl_ctx *ctx,
+ size_t size)
+{
+ int flags, rc;
+
+ if (is_stdinout(io_dev))
+ return 0;
+
+ if (io_dev->is_new)
+ flags = O_CREAT|O_WRONLY|O_TRUNC;
+ else if (io_dev->need_trunc)
+ flags = O_RDWR | O_TRUNC;
+ else
+ flags = O_RDWR;
+
+ io_dev->fd = open(io_dev->parm_path, flags, S_IRUSR|S_IWUSR);
+ if (io_dev->fd == -1) {
+ rc = -errno;
+ perror("open");
+ return rc;
+ }
+
+ if (!io_dev->is_dax)
+ return 0;
+
+ flags = (io_dev->direction == IO_READ) ? PROT_READ : PROT_WRITE;
+ io_dev->mmap = mmap(NULL, size, flags, MAP_SHARED, io_dev->fd, 0);
+ if (io_dev->mmap == MAP_FAILED) {
+ rc = -errno;
+ perror("mmap");
+ return rc;
+ }
+
+ return 0;
+}
+
+static int match_device(struct io_dev *io_dev, struct daxctl_region *dregion)
+{
+ struct daxctl_dev *dev;
+
+ daxctl_dev_foreach(dregion, dev) {
+ if (io_dev->major == daxctl_dev_get_major(dev) &&
+ io_dev->minor == daxctl_dev_get_minor(dev)) {
+ io_dev->is_dax = true;
+ io_dev->size = daxctl_dev_get_size(dev);
+ return 1;
+ }
+ }
+
+ return 0;
+}
+
+static int find_dax_device(struct io_dev *io_dev, struct ndctl_ctx *ndctx,
+ enum io_direction dir)
+{
+ struct ndctl_bus *bus;
+ struct ndctl_region *region;
+ struct ndctl_dax *dax;
+ struct daxctl_region *dregion;
+ struct stat st;
+ int rc;
+ char cdev_path[256];
+ char link_path[256];
+ char *dev_name;
+
+ if (is_stdinout(io_dev)) {
+ io_dev->size = ULONG_MAX;
+ return 0;
+ }
+
+ rc = stat(io_dev->parm_path, &st);
+ if (rc == -1) {
+ rc = -errno;
+ if (rc == -ENOENT && dir == IO_WRITE) {
+ io_dev->is_new = true;
+ io_dev->size = ULONG_MAX;
+ return 0;
+ }
+ perror("stat");
+ return rc;
+ }
+
+ if (S_ISREG(st.st_mode)) {
+ if (dir == IO_WRITE) {
+ io_dev->need_trunc = true;
+ io_dev->size = ULONG_MAX;
+ } else
+ io_dev->size = st.st_size;
+ return 0;
+ } else if (S_ISBLK(st.st_mode)) {
+ io_dev->size = st.st_size;
+ return 0;
+ } else if (S_ISCHR(st.st_mode)) {
+ io_dev->size = ULONG_MAX;
+ io_dev->is_char = true;
+ io_dev->major = major(st.st_rdev);
+ io_dev->minor = minor(st.st_rdev);
+ } else
+ return -ENODEV;
+
+ rc = snprintf(cdev_path, 255, "/sys/dev/char/%u:%u", io_dev->major,
+ io_dev->minor);
+ if (rc < 0) {
+ fail("snprintf\n");
+ return -ENXIO;
+ }
+
+ rc = readlink(cdev_path, link_path, 255);
+ if (rc == -1) {
+ rc = errno;
+ perror("readlink");
+ return rc;
+ }
+ link_path[rc] = '\0';
+ dev_name = basename(link_path);
+
+ ndctl_bus_foreach(ndctx, bus)
+ ndctl_region_foreach(bus, region)
+ ndctl_dax_foreach(region, dax) {
+ if (strncmp(dev_name,
+ ndctl_dax_get_devname(dax),
+ 256))
+ continue;
+
+ dregion = ndctl_dax_get_daxctl_region(dax);
+ if(match_device(io_dev, dregion)) {
+ io_dev->region = region;
+ io_dev->dax = dax;
+ return 1;
+ }
+ }
+ return 0;
+}
+
+static int send_clear_error(struct ndctl_bus *bus, uint64_t start, uint64_t size)
+{
+ uint64_t cleared;
+ int rc;
+
+ io.clear_err = ndctl_bus_cmd_new_clear_error(start, size, io.ars_cap);
+ if (!io.clear_err) {
+ fail("bus: %s failed to create cmd\n",
+ ndctl_bus_get_provider(bus));
+ return -ENXIO;
+ }
+
+ rc = ndctl_cmd_submit(io.clear_err);
+ if (rc) {
+ fail("bus: %s failed to submit cmd: %d\n",
+ ndctl_bus_get_provider(bus), rc);
+ ndctl_cmd_unref(io.clear_err);
+ return rc;
+ }
+
+ cleared = ndctl_cmd_clear_error_get_cleared(io.clear_err);
+ if (cleared != size) {
+ fail("bus: %s expected to clear: %ld actual: %ld\n",
+ ndctl_bus_get_provider(bus),
+ size, cleared);
+ return -ENXIO;
+ }
+
+ return 0;
+}
+
+static int get_ars_cap(struct ndctl_bus *bus, uint64_t start, uint64_t size)
+{
+ int rc;
+
+ io.ars_cap = ndctl_bus_cmd_new_ars_cap(bus, start, size);
+ if (!io.ars_cap) {
+ fail("bus: %s failed to create cmd\n",
+ ndctl_bus_get_provider(bus));
+ return -ENOTTY;
+ }
+
+ rc = ndctl_cmd_submit(io.ars_cap);
+ if (rc) {
+ fail("bus: %s failed to submit cmd: %d\n",
+ ndctl_bus_get_provider(bus), rc);
+ ndctl_cmd_unref(io.ars_cap);
+ return rc;
+ }
+
+ if (ndctl_cmd_ars_cap_get_size(io.ars_cap) <
+ sizeof(struct nd_cmd_ars_status)) {
+ fail("bus: %s expected size >= %zd got: %d\n",
+ ndctl_bus_get_provider(bus),
+ sizeof(struct nd_cmd_ars_status),
+ ndctl_cmd_ars_cap_get_size(io.ars_cap));
+ ndctl_cmd_unref(io.ars_cap);
+ return -ENXIO;
+ }
+
+ return 0;
+}
+
+int clear_errors(struct ndctl_bus *bus, uint64_t start, uint64_t len)
+{
+ int rc;
+
+ rc = get_ars_cap(bus, start, len);
+ if (rc) {
+ fail("get_ars_cap failed\n");
+ return rc;
+ }
+
+ rc = send_clear_error(bus, start, len);
+ if (rc) {
+ fail("send_clear_error failed\n");
+ return rc;
+ }
+
+ return 0;
+}
+
+static int clear_badblocks(struct io_dev *dev, uint64_t len)
+{
+ unsigned long long dax_begin, dax_size, dax_end;
+ unsigned long long region_begin, offset;
+ unsigned long long size, io_begin, io_end, io_len;
+ struct badblock *bb;
+ int rc;
+
+ dax_begin = ndctl_dax_get_resource(dev->dax);
+ if (dax_begin == ULLONG_MAX)
+ return -ERANGE;
+
+ dax_size = ndctl_dax_get_size(dev->dax);
+ if (dax_size == ULLONG_MAX)
+ return -ERANGE;
+
+ dax_end = dax_begin + dax_size - 1;
+
+ region_begin = ndctl_region_get_resource(dev->region);
+ if (region_begin == ULLONG_MAX)
+ return -ERANGE;
+
+ ndctl_region_badblock_foreach(dev->region, bb) {
+ unsigned long long bb_begin, bb_end, begin, end;
+
+ bb_begin = region_begin + (bb->offset << 9);
+ bb_end = bb_begin + (bb->len << 9) - 1;
+
+ if (bb_end <= dax_begin || bb_begin >= dax_end)
+ continue;
+
+ if (bb_begin < dax_begin)
+ begin = dax_begin;
+ else
+ begin = bb_begin;
+
+ if (bb_end > dax_end)
+ end = dax_end;
+ else
+ end = bb_end;
+
+ offset = begin - dax_begin;
+ size = end - begin + 1;
+
+ /*
+ * If end of I/O is before badblock or the offset of the
+ * I/O is greater than the actual size of badblock range
+ */
+ if (dev->offset + len - 1 < offset || dev->offset > size)
+ continue;
+
+ io_begin = (dev->offset < offset) ? offset : dev->offset;
+ if ((dev->offset + len) < (offset + size))
+ io_end = offset + len;
+ else
+ io_end = offset + size;
+
+ io_len = io_end - io_begin;
+ io_begin += dax_begin;
+ rc = clear_errors(ndctl_region_get_bus(dev->region),
+ io_begin, io_len);
+ if (rc < 0)
+ return rc;
+ }
+
+ return 0;
+}
+
+static ssize_t __do_io(struct io_dev *dst_dev, struct io_dev *src_dev,
+ uint64_t len, bool zero)
+{
+ void *src, *dst;
+ ssize_t rc, count = 0;
+
+ if (zero && dst_dev->is_dax) {
+ dst = (uint8_t *)dst_dev->mmap + dst_dev->offset;
+ memset(dst, 0, len);
+ pmem_persist(dst, len);
+ rc = len;
+ } else if (dst_dev->is_dax && src_dev->is_dax) {
+ src = (uint8_t *)src_dev->mmap + src_dev->offset;
+ dst = (uint8_t *)dst_dev->mmap + dst_dev->offset;
+ pmem_memcpy_persist(dst, src, len);
+ rc = len;
+ } else if (src_dev->is_dax) {
+ src = (uint8_t *)src_dev->mmap + src_dev->offset;
+ if (dst_dev->offset) {
+ rc = lseek(dst_dev->fd, dst_dev->offset, SEEK_SET);
+ if (rc < 0) {
+ rc = -errno;
+ perror("lseek");
+ return rc;
+ }
+ }
+ do {
+ rc = write(dst_dev->fd, (uint8_t *)src + count,
+ len - count);
+ if (rc == -1) {
+ rc = -errno;
+ perror("write");
+ return rc;
+ }
+ count += rc;
+ } while (count != (ssize_t)len);
+ rc = count;
+ if (rc != (ssize_t)len)
+ printf("Requested size %lu larger than source.\n", len);
+ } else if (dst_dev->is_dax) {
+ dst = (uint8_t *)dst_dev->mmap + dst_dev->offset;
+ if (src_dev->offset) {
+ rc = lseek(src_dev->fd, src_dev->offset, SEEK_SET);
+ if (rc < 0) {
+ rc = -errno;
+ perror("lseek");
+ return rc;
+ }
+ }
+ do {
+ rc = read(src_dev->fd, (uint8_t *)dst + count,
+ len - count);
+ if (rc == -1) {
+ rc = -errno;
+ perror("pread");
+ return rc;
+ }
+ /* end of file */
+ if (rc == 0)
+ break;
+ count += rc;
+ } while (count != (ssize_t)len);
+ pmem_persist(dst, count);
+ rc = count;
+ if (rc != (ssize_t)len)
+ printf("Requested size %lu larger than destination.\n", len);
+ } else
+ return -EINVAL;
+
+ return rc;
+}
+
+static int do_io(struct ndctl_ctx *ctx)
+{
+ int rc, i, dax_devs = 0;
+
+ /* if we are zeroing the device, we just need output */
+ i = io.zero ? 1 : 0;
+ for (; i < 2; i++) {
+ if (!io.dev[i].parm_path)
+ continue;
+ rc = find_dax_device(&io.dev[i], ctx, i);
+ if (rc < 0)
+ return rc;
+
+ if (rc == 1)
+ dax_devs++;
+ }
+
+ if (dax_devs == 0) {
+ fail("No DAX devices for input or output, fail\n");
+ return -ENODEV;
+ }
+
+ if (io.len == 0) {
+ if (is_stdinout(&io.dev[0]))
+ io.len = io.dev[1].size;
+ else
+ io.len = io.dev[0].size;
+ }
+
+ io.dev[1].direction = IO_WRITE;
+ i = io.zero ? 1 : 0;
+ for (; i < 2; i++) {
+ if (!io.dev[i].parm_path)
+ continue;
+ rc = setup_device(&io.dev[i], ctx, io.len);
+ if (rc < 0)
+ return rc;
+ }
+
+ if (io.dev[1].is_dax) {
+ rc = clear_badblocks(&io.dev[1], io.len);
+ if (rc < 0) {
+ fail("Failed to clear badblocks on %s\n",
+ io.dev[1].parm_path);
+ return rc;
+ }
+ }
+
+ rc = __do_io(&io.dev[1], &io.dev[0], io.len, io.zero);
+ if (rc < 0) {
+ fail("Failed to perform I/O\n");
+ return rc;
+ }
+
+ printf("Data copied %u bytes to device %s\n",
+ rc, io.dev[1].parm_path);
+
+ return 0;
+}
+
+static void cleanup(struct ndctl_ctx *ctx)
+{
+ int i;
+
+ for (i = 0; i < 2; i++) {
+ if (is_stdinout(&io.dev[i]))
+ continue;
+ close(io.dev[i].fd);
+ }
+}
+
+int cmd_io(int argc, const char **argv, void *ctx)
+{
+ const struct option options[] = {
+ OPT_STRING('i', "input", &io.dev[0].parm_path, "in device",
+ "input device/file"),
+ OPT_STRING('o', "output", &io.dev[1].parm_path, "out device",
+ "output device/file"),
+ OPT_BOOLEAN('z', "zero", &io.zero, "zeroing the device"),
+ OPT_U64('l', "len", &io.len, "total length to perform the I/O"),
+ OPT_U64('s', "seek", &io.dev[1].offset, "seek offset for output"),
+ OPT_U64('k', "skip", &io.dev[0].offset, "skip offset for input"),
+ };
+ const char * const u[] = {
+ "daxctl io [<options>]",
+ NULL
+ };
+ int i, rc;
+ struct ndctl_ctx *ndctx;
+
+ argc = parse_options(argc, argv, options, u, 0);
+ for (i = 0; i < argc; i++) {
+ fail("Unknown parameter \"%s\"\n", argv[i]);
+ return -EINVAL;
+ }
+
+ if (argc) {
+ usage_with_options(u, options);
+ return 0;
+ }
+
+ if (!io.dev[0].parm_path && !io.dev[1].parm_path) {
+ usage_with_options(u, options);
+ return 0;
+ }
+
+ if (!io.dev[0].parm_path) {
+ io.dev[0].fd = STDIN_FILENO;
+ io.dev[0].offset = 0;
+ }
+
+ if (!io.dev[1].parm_path) {
+ io.dev[1].fd = STDOUT_FILENO;
+ io.dev[1].offset = 0;
+ }
+
+ rc = ndctl_new(&ndctx);
+ if (rc)
+ return -ENOMEM;
+
+ rc = do_io(ndctx);
+ if (rc < 0)
+ goto out;
+
+ rc = 0;
+out:
+ cleanup(ndctx);
+ ndctl_unref(ndctx);
+ return rc;
+}
3 years, 5 months
[resend PATCH v2 00/33] dax: introduce dax_operations
by Dan Williams
[ resend to add dm-devel, linux-block, and fs-devel, apologies for the
duplicates ]
Changes since v1 [1] and the dax-fs RFC [2]:
* rename struct dax_inode to struct dax_device (Christoph)
* rewrite arch_memcpy_to_pmem() in C with inline asm
* use QUEUE_FLAG_WC to gate dax cache management (Jeff)
* add device-mapper plumbing for the ->copy_from_iter() and ->flush()
dax_operations
* kill struct blk_dax_ctl and bdev_direct_access (Christoph)
* cleanup the ->direct_access() calling convention to be page based
(Christoph)
* introduce dax_get_by_host() and don't pollute struct super_block with
dax_device details (Christoph)
[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008586.html
[2]: https://lwn.net/Articles/713064/
---
A few months back, in the course of reviewing the memcpy_nocache()
proposal from Brian, Linus proposed that the pmem specific
memcpy_to_pmem() routine be moved to be implemented at the driver level
[3]:
"Quite frankly, the whole 'memcpy_nocache()' idea or (ab-)using
copy_user_nocache() just needs to die. It's idiotic.
As you point out, it's also fundamentally buggy crap.
Throw it away. There is no possible way this is ever valid or
portable. We're not going to lie and claim that it is.
If some driver ends up using 'movnt' by hand, that is up to that
*driver*. But no way in hell should we care about this one whit in
the sense of <linux/uaccess.h>."
This feedback also dovetails with another fs/dax.c design wart of being
hard coded to assume the backing device is pmem. We call the pmem
specific copy, clear, and flush routines even if the backing device
driver is one of the other 3 dax drivers (axonram, dccssblk, or brd).
There is no reason to spend cpu cycles flushing the cache after writing
to brd, for example, since it is using volatile memory for storage.
Moreover, the pmem driver might be fronting a volatile memory range
published by the ACPI NFIT, or the platform might have arranged to flush
cpu caches on power fail. This latter capability is a feature that has
appeared in embedded storage appliances (pre-ACPI-NFIT nvdimm
platforms).
So, this series:
1/ moves what was previously named "the pmem api" out of the global
namespace and into drivers that need to be concerned with
architecture specific persistent memory considerations.
2/ arranges for dax to stop abusing __copy_user_nocache() and implements
a libnvdimm-local memcpy that uses 'movnt' on x86_64. This might be
expanded in the future to use 'movntdqa' if the copy size is above
some threshold, or expanded with support for other architectures [4].
3/ makes cache maintenance optional by arranging for dax to call driver
specific copy and flush operations only if the driver publishes them.
4/ allows filesytem-dax cache management to be controlled by the block
device write-cache queue flag. The pmem driver is updated to clear
that flag by default when pmem is driving volatile memory.
[3]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[4]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009478.html
These patches have been through a round of build regression fixes
notified by the 0day robot. All review welcome, but the patches that
need extra attention are the device-mapper and uio changes
(copy_from_iter_ops).
This series is based on a merge of char-misc-next (for cdev api reworks)
and libnvdimm-fixes (dax locking and __copy_user_nocache fixes).
---
Dan Williams (33):
device-dax: rename 'dax_dev' to 'dev_dax'
dax: refactor dax-fs into a generic provider of 'struct dax_device' instances
dax: add a facility to lookup a dax device by 'host' device name
dax: introduce dax_operations
pmem: add dax_operations support
axon_ram: add dax_operations support
brd: add dax_operations support
dcssblk: add dax_operations support
block: kill bdev_dax_capable()
dax: introduce dax_direct_access()
dm: add dax_device and dax_operations support
dm: teach dm-targets to use a dax_device + dax_operations
ext2, ext4, xfs: retrieve dax_device for iomap operations
Revert "block: use DAX for partition table reads"
filesystem-dax: convert to dax_direct_access()
block, dax: convert bdev_dax_supported() to dax_direct_access()
block: remove block_device_operations ->direct_access()
x86, dax, pmem: remove indirection around memcpy_from_pmem()
dax, pmem: introduce 'copy_from_iter' dax operation
dm: add ->copy_from_iter() dax operation support
filesystem-dax: convert to dax_copy_from_iter()
dax, pmem: introduce an optional 'flush' dax_operation
dm: add ->flush() dax operation support
filesystem-dax: convert to dax_flush()
x86, dax: replace clear_pmem() with open coded memset + dax_ops->flush
x86, dax, libnvdimm: move wb_cache_pmem() to libnvdimm
x86, libnvdimm, pmem: move arch_invalidate_pmem() to libnvdimm
x86, libnvdimm, dax: stop abusing __copy_user_nocache
uio, libnvdimm, pmem: implement cache bypass for all copy_from_iter() operations
libnvdimm, pmem: fix persistence warning
libnvdimm, nfit: enable support for volatile ranges
filesystem-dax: gate calls to dax_flush() on QUEUE_FLAG_WC
libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region
MAINTAINERS | 2
arch/powerpc/platforms/Kconfig | 1
arch/powerpc/sysdev/axonram.c | 45 +++-
arch/x86/Kconfig | 1
arch/x86/include/asm/pmem.h | 141 ------------
arch/x86/include/asm/string_64.h | 1
block/Kconfig | 1
block/partition-generic.c | 17 -
drivers/Makefile | 2
drivers/acpi/nfit/core.c | 15 +
drivers/block/Kconfig | 1
drivers/block/brd.c | 52 +++-
drivers/dax/Kconfig | 10 +
drivers/dax/Makefile | 5
drivers/dax/dax.h | 15 -
drivers/dax/device-dax.h | 25 ++
drivers/dax/device.c | 415 +++++++++++------------------------
drivers/dax/pmem.c | 10 -
drivers/dax/super.c | 445 ++++++++++++++++++++++++++++++++++++++
drivers/md/Kconfig | 1
drivers/md/dm-core.h | 1
drivers/md/dm-linear.c | 53 ++++-
drivers/md/dm-snap.c | 6 -
drivers/md/dm-stripe.c | 65 ++++--
drivers/md/dm-target.c | 6 -
drivers/md/dm.c | 112 ++++++++--
drivers/nvdimm/Kconfig | 6 +
drivers/nvdimm/Makefile | 1
drivers/nvdimm/bus.c | 10 -
drivers/nvdimm/claim.c | 9 -
drivers/nvdimm/core.c | 2
drivers/nvdimm/dax_devs.c | 2
drivers/nvdimm/dimm_devs.c | 2
drivers/nvdimm/namespace_devs.c | 9 -
drivers/nvdimm/nd-core.h | 9 +
drivers/nvdimm/pfn_devs.c | 4
drivers/nvdimm/pmem.c | 82 +++++--
drivers/nvdimm/pmem.h | 26 ++
drivers/nvdimm/region_devs.c | 39 ++-
drivers/nvdimm/x86.c | 155 +++++++++++++
drivers/s390/block/Kconfig | 1
drivers/s390/block/dcssblk.c | 44 +++-
fs/block_dev.c | 117 +++-------
fs/dax.c | 302 ++++++++++++++------------
fs/ext2/inode.c | 9 +
fs/ext4/inode.c | 9 +
fs/iomap.c | 3
fs/xfs/xfs_iomap.c | 10 +
include/linux/blkdev.h | 19 --
include/linux/dax.h | 43 +++-
include/linux/device-mapper.h | 14 +
include/linux/iomap.h | 1
include/linux/libnvdimm.h | 10 +
include/linux/pmem.h | 165 --------------
include/linux/string.h | 8 +
include/linux/uio.h | 4
lib/Kconfig | 6 -
lib/iov_iter.c | 25 ++
tools/testing/nvdimm/Kbuild | 11 +
tools/testing/nvdimm/pmem-dax.c | 21 +-
60 files changed, 1584 insertions(+), 1042 deletions(-)
delete mode 100644 arch/x86/include/asm/pmem.h
create mode 100644 drivers/dax/device-dax.h
rename drivers/dax/{dax.c => device.c} (60%)
create mode 100644 drivers/dax/super.c
create mode 100644 drivers/nvdimm/x86.c
delete mode 100644 include/linux/pmem.h
3 years, 5 months
[PATCH v3 0/5] DAX common 4k zero page
by Ross Zwisler
When servicing mmap() reads from file holes the current DAX code allocates
a page cache page of all zeroes and places the struct page pointer in the
mapping->page_tree radix tree. This has three major drawbacks:
1) It consumes memory unnecessarily. For every 4k page that is read via a
DAX mmap() over a hole, we allocate a new page cache page. This means that
if you read 1GiB worth of pages, you end up using 1GiB of zeroed memory.
2) It is slower than using a common zero page because each page fault has
more work to do. Instead of just inserting a common zero page we have to
allocate a page cache page, zero it, and then insert it.
3) The fact that we had to check for both DAX exceptional entries and for
page cache pages in the radix tree made the DAX code more complex.
This series solves these issues by following the lead of the DAX PMD code
and using a common 4k zero page instead. This reduces memory usage and
decreases latencies for some workloads, and it simplifies the DAX code,
removing over 100 lines in total.
Andrew, I'm still hoping to get this merged for v4.13 if possible. I I have
addressed all of Jan's feedback, but he is on vacation for the next few
weeks so he may not be able to give me Reviewed-by tags. I think this
series is relatively low risk with clear benefits, and I think we should be
able to address any issues that come up during the v4.13 RC series.
This series has passed my targeted testing and a full xfstests run on both
XFS and ext4.
---
Changes since v2:
- If we call insert_pfn() with 'mkwrite' for an entry that already exists,
don't overwrite the pte with a brand new one. Just add the appropriate
flags. (Jan)
- Keep put_locked_mapping_entry() as a simple wrapper for
dax_unlock_mapping_entry() so it has naming parity with
get_unlocked_mapping_entry(). (Jan)
- Remove DAX special casing in page_cache_tree_insert(), move
now-private definitions from dax.h to dax.c. (Jan)
Ross Zwisler (5):
mm: add vm_insert_mixed_mkwrite()
dax: relocate some dax functions
dax: use common 4k zero page for dax mmap reads
dax: remove DAX code from page_cache_tree_insert()
dax: move all DAX radix tree defs to fs/dax.c
Documentation/filesystems/dax.txt | 5 +-
fs/dax.c | 345 ++++++++++++++++----------------------
fs/ext2/file.c | 25 +--
fs/ext4/file.c | 32 +---
fs/xfs/xfs_file.c | 2 +-
include/linux/dax.h | 45 -----
include/linux/mm.h | 2 +
include/trace/events/fs_dax.h | 2 -
mm/filemap.c | 13 +-
mm/memory.c | 57 ++++++-
10 files changed, 205 insertions(+), 323 deletions(-)
--
2.9.4
3 years, 5 months
[PATCH -mm -v2 00/12] mm, THP, swap: Delay splitting THP after swapped out
by Huang, Ying
From: Huang Ying <ying.huang(a)intel.com>
Hi, Andrew, could you help me to check whether the overall design is
reasonable?
Hi, Johannes and Minchan, Thanks a lot for your review to the first
step of the THP swap optimization! Could you help me to review the
second step in this patchset?
Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
swap part of the patchset? Especially [01/12], [02/12], [03/12],
[04/12], [11/12], and [12/12].
Hi, Andrea and Kirill, could you help me to review the THP part of the
patchset? Especially [01/12], [03/12], [07/12], [08/12], [09/12],
[11/12].
Hi, Johannes, Michal, could you help me to review the cgroup part of
the patchset? Especially [08/12], [09/12], and [10/12].
And for all, Any comment is welcome!
Because the THP swap writing support patch [06/12] needs to be rebased
on multipage bvec patchset which hasn't been merged yet. The [06/12]
in this patchset is just a test patch and will be rewritten later.
The patchset depends on multipage bvec patchset too.
This is the second step of THP (Transparent Huge Page) swap
optimization. In the first step, the splitting huge page is delayed
from almost the first step of swapping out to after allocating the
swap space for the THP and adding the THP into the swap cache. In the
second step, the splitting is delayed further to after the swapping
out finished. The plan is to delay splitting THP step by step,
finally avoid splitting THP for the THP swapping out and swap out/in
the THP as a whole.
In the patchset, more operations for the anonymous THP reclaiming,
such as TLB flushing, writing the THP to the swap device, removing the
THP from the swap cache are batched. So that the performance of
anonymous THP swapping out are improved.
This patchset is based on the 6/16 head of mmotm/master.
During the development, the following scenarios/code paths have been
checked,
- swap out/in
- swap off
- write protect page fault
- madvise_free
- process exit
- split huge page
Please let me know if I missed something.
With the patchset, the swap out throughput improves 42% (from about
5.81GB/s to about 8.25GB/s) in the vm-scalability swap-w-seq test case
with 16 processes. At the same time, the IPI (reflect TLB flushing)
reduced about 78.9%. The test is done on a Xeon E5 v3 system. The
swap device used is a RAM simulated PMEM (persistent memory) device.
To test the sequential swapping out, the test case creates 8
processes, which sequentially allocate and write to the anonymous
pages until the RAM and part of the swap device is used up.
Below is the part of the cover letter for the first step patchset of
THP swap optimization which applies to all steps.
----------------------------------------------------------------->
Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swap out even on a high-end server machine. Because the
performance of the storage device improved faster than that of single
logical CPU. And it seems that the trend will not change in the near
future. On the other hand, the THP becomes more and more popular
because of increased memory size. So it becomes necessary to optimize
THP swap performance.
The advantages of the THP swap support include:
- Batch the swap operations for the THP to reduce TLB flushing and
lock acquiring/releasing, including allocating/freeing the swap
space, adding/deleting to/from the swap cache, and writing/reading
the swap space, etc. This will help improve the performance of the
THP swap.
- The THP swap space read/write will be 2M sequential IO. It is
particularly helpful for the swap read, which are usually 4k random
IO. This will improve the performance of the THP swap too.
- It will help the memory fragmentation, especially when the THP is
heavily used by the applications. The 2M continuous pages will be
free up after THP swapping out.
- It will improve the THP utilization on the system with the swap
turned on. Because the speed for khugepaged to collapse the normal
pages into the THP is quite slow. After the THP is split during the
swapping out, it will take quite long time for the normal pages to
collapse back into the THP after being swapped in. The high THP
utilization helps the efficiency of the page based memory management
too.
There are some concerns regarding THP swap in, mainly because possible
enlarged read/write IO size (for swap in/out) may put more overhead on
the storage device. To deal with that, the THP swap in should be
turned on only when necessary. For example, it can be selected via
"always/never/madvise" logic, to be turned on globally, turned off
globally, or turned on only for VMA with MADV_HUGEPAGE, etc.
Best Regards,
Huang, Ying
3 years, 5 months
[RFC 1/4] libnvdimm: add to_{nvdimm,nd_region}_dev()
by Oliver O'Halloran
struct device contains the ->of_node pointer so that devices can be
assoicated with the device-tree node that created them on DT platforms.
libnvdimm hides the struct device for regions and nvdimm devices inside
of an opaque structure so this patch adds accessors for each to allow
the of_nvdimm driver to set the of_node pointer.
Signed-off-by: Oliver O'Halloran <oohall(a)gmail.com>
---
drivers/nvdimm/dimm_devs.c | 6 ++++++
drivers/nvdimm/region_devs.c | 6 ++++++
include/linux/libnvdimm.h | 2 ++
3 files changed, 14 insertions(+)
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index f0d1b7e5de01..cbddac011181 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -227,6 +227,12 @@ struct nvdimm *to_nvdimm(struct device *dev)
}
EXPORT_SYMBOL_GPL(to_nvdimm);
+struct device *to_nvdimm_dev(struct nvdimm *nvdimm)
+{
+ return &nvdimm->dev;
+}
+EXPORT_SYMBOL_GPL(to_nvdimm_dev);
+
struct nvdimm *nd_blk_region_to_dimm(struct nd_blk_region *ndbr)
{
struct nd_region *nd_region = &ndbr->nd_region;
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index cbaab4210c39..6c3988135fd5 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -182,6 +182,12 @@ struct nd_region *to_nd_region(struct device *dev)
}
EXPORT_SYMBOL_GPL(to_nd_region);
+struct device *to_nd_region_dev(struct nd_region *region)
+{
+ return ®ion->dev;
+}
+EXPORT_SYMBOL_GPL(to_nd_region_dev);
+
struct nd_blk_region *to_nd_blk_region(struct device *dev)
{
struct nd_region *nd_region = to_nd_region(dev);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 550761477005..10fbc523ff95 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -139,6 +139,8 @@ struct nd_region *to_nd_region(struct device *dev);
struct nd_blk_region *to_nd_blk_region(struct device *dev);
struct nvdimm_bus_descriptor *to_nd_desc(struct nvdimm_bus *nvdimm_bus);
struct device *to_nvdimm_bus_dev(struct nvdimm_bus *nvdimm_bus);
+struct device *to_nvdimm_dev(struct nvdimm *nvdimm);
+struct device *to_nd_region_dev(struct nd_region *region);
const char *nvdimm_name(struct nvdimm *nvdimm);
struct kobject *nvdimm_kobj(struct nvdimm *nvdimm);
unsigned long nvdimm_cmd_mask(struct nvdimm *nvdimm);
--
2.9.4
3 years, 6 months
[PATCH] libnvdimm: Stop using HPAGE_SIZE
by Oliver O'Halloran
Currently libnvdimm uses HPAGE_SIZE as the default alignment for DAX and
PFN devices. HPAGE_SIZE is the default hugetlbfs page size and when
hugetlbfs is disabled it defaults to PAGE_SIZE. Given DAX has more
in common with THP than hugetlbfs we should proably be using
HPAGE_PMD_SIZE, but this is undefined when THP is disabled so lets just
give it a new name.
The other usage of HPAGE_SIZE in libnvdimm is when determining how large
the altmap should be. For the reasons mentioned above it doesn't really
make sense to use HPAGE_SIZE here either. PMD_SIZE seems to be safe to
use in generic code and it happens to match the vmemmap allocation block
on x86 and Power. It's still a hack, but it's a slightly nicer hack.
Signed-off-by: Oliver O'Halloran <oohall(a)gmail.com>
---
drivers/nvdimm/nd.h | 7 +++++++
drivers/nvdimm/pfn_devs.c | 9 +++++----
2 files changed, 12 insertions(+), 4 deletions(-)
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 8cabd836df0e..714e3337b609 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -281,6 +281,13 @@ static inline struct device *nd_btt_create(struct nd_region *nd_region)
struct nd_pfn *to_nd_pfn(struct device *dev);
#if IS_ENABLED(CONFIG_NVDIMM_PFN)
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define PFN_DEFAULT_ALIGNMENT HPAGE_PMD_SIZE
+#else
+#define PFN_DEFAULT_ALIGNMENT PAGE_SIZE
+#endif
+
int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns);
bool is_nd_pfn(struct device *dev);
struct device *nd_pfn_create(struct nd_region *nd_region);
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 5fcb6f5b22a2..2ae9a000b090 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -290,7 +290,7 @@ struct device *nd_pfn_devinit(struct nd_pfn *nd_pfn,
return NULL;
nd_pfn->mode = PFN_MODE_NONE;
- nd_pfn->align = HPAGE_SIZE;
+ nd_pfn->align = PFN_DEFAULT_ALIGNMENT;
dev = &nd_pfn->dev;
device_initialize(&nd_pfn->dev);
if (ndns && !__nd_attach_ndns(&nd_pfn->dev, ndns, &nd_pfn->ndns)) {
@@ -638,11 +638,12 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
/ PAGE_SIZE);
if (nd_pfn->mode == PFN_MODE_PMEM) {
/*
- * vmemmap_populate_hugepages() allocates the memmap array in
- * HPAGE_SIZE chunks.
+ * The altmap should be padded out to the block size used
+ * when populating the vmemmap. This *should* be equal to
+ * PMD_SIZE for most architectures.
*/
offset = ALIGN(start + SZ_8K + 64 * npfns + dax_label_reserve,
- max(nd_pfn->align, HPAGE_SIZE)) - start;
+ max(nd_pfn->align, PMD_SIZE)) - start;
} else if (nd_pfn->mode == PFN_MODE_RAM)
offset = ALIGN(start + SZ_8K + dax_label_reserve,
nd_pfn->align) - start;
--
2.9.4
3 years, 6 months
[PATCH] libnvdimm: show supported dax/pfn region alignments in sysfs
by Oliver O'Halloran
The alignment of a DAX and PFN regions dictates the page sizes that can
be used to map the region. Even if the hardware page sizes are known the
actual range of supported page sizes that can be used with DAX depends
on the kernel configuration. As a result its best that the kernel
advertises the alignments that should be used with these region types.
This patch adds the 'supported_alignments' region attribute to expose
this information to userspace.
Signed-off-by: Oliver O'Halloran <oohall(a)gmail.com>
---
drivers/nvdimm/pfn_devs.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 2ae9a000b090..505d50ef9a91 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -260,6 +260,33 @@ static ssize_t size_show(struct device *dev,
}
static DEVICE_ATTR_RO(size);
+static ssize_t supported_alignments_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ /*
+ * This needs to be a local variable because the *_SIZE macros
+ * aren't always constants.
+ */
+ unsigned long supported_alignments[] = {
+ PAGE_SIZE,
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ HPAGE_PMD_SIZE,
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+ HPAGE_PUD_SIZE,
+#endif
+#endif
+ 0,
+ };
+
+ return nd_sector_size_show(0, supported_alignments, buf);
+}
+DEVICE_ATTR_RO(supported_alignments);
+
+static struct attribute *nd_dax_attributes[] = {
+ &dev_attr_supported_alignments.attr,
+ NULL,
+};
+
static struct attribute *nd_pfn_attributes[] = {
&dev_attr_mode.attr,
&dev_attr_namespace.attr,
@@ -267,6 +294,7 @@ static struct attribute *nd_pfn_attributes[] = {
&dev_attr_align.attr,
&dev_attr_resource.attr,
&dev_attr_size.attr,
+ &dev_attr_supported_alignments.attr,
NULL,
};
--
2.9.4
3 years, 6 months