[RFC PATCH] block: introduce poison tracking for block devices
by Vishal Verma
This patch copies the badblock management code from md-raid to use it
for tracking bad/'poison' sectors on a per-block device level.
NVDIMM devices, which behave more like DRAM, may develop bad cache
lines, or 'poison'. A block device exposed by the pmem driver can
then consume poison via a read (or write), and cause a machine check.
On platforms without machine check recovery features, this would
mean a crash.
The block device maintaining a runtime list of all known poison can
directly avoid this, and also provide a path forward to enable proper
handling/recovery for DAX faults on such a device.
Signed-off-by: Vishal Verma <vishal.l.verma(a)intel.com>
---
This really is a copy-paste + a few modifications of the badblock management
code + sysfs representation from md.
In this RFC, I want to make sure this path sounds acceptable for the use case
described above, for NVDIMMs. Eventually, I think the md badblock management
and this should be refactored to use the same code - I think this should be
easy to do:
- move the badblocks struct and associated functions into a header file (along
the lines of include/linux/list.h)
- embed the structure into whatever needs to use this list (in case of md, this
would be 'rdev', in the nvdimm case, the gendisk)
- call the functions from badblocks.h as needed to manipulate the list.
- The sysfs show/store functions in badblocks.h would be generic variants, with
wrappers being present in md and gendisk to fit into their respective sysfs
layouts
If this looks generally reasonable, I'll post a v2 with this refactoring done.
block/genhd.c | 502 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/genhd.h | 26 +++
2 files changed, 528 insertions(+)
diff --git a/block/genhd.c b/block/genhd.c
index 0c706f3..de99d28 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -23,6 +23,15 @@
#include "blk.h"
+#define BB_LEN_MASK (0x00000000000001FFULL)
+#define BB_OFFSET_MASK (0x7FFFFFFFFFFFFE00ULL)
+#define BB_ACK_MASK (0x8000000000000000ULL)
+#define BB_MAX_LEN 512
+#define BB_OFFSET(x) (((x) & BB_OFFSET_MASK) >> 9)
+#define BB_LEN(x) (((x) & BB_LEN_MASK) + 1)
+#define BB_ACK(x) (!!((x) & BB_ACK_MASK))
+#define BB_MAKE(a, l, ack) (((a)<<9) | ((l)-1) | ((u64)(!!(ack)) << 63))
+
static DEFINE_MUTEX(block_class_lock);
struct kobject *block_depr;
@@ -670,6 +679,496 @@ void del_gendisk(struct gendisk *disk)
}
EXPORT_SYMBOL(del_gendisk);
+int disk_poison_list_init(struct gendisk *disk)
+{
+ disk->plist = kmalloc(sizeof(*disk->plist), GFP_KERNEL);
+ if (!disk->plist)
+ return -ENOMEM;
+ disk->plist->count = 0;
+ disk->plist->shift = 0;
+ disk->plist->page = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ seqlock_init(&disk->plist->lock);
+ if (disk->plist->page == NULL)
+ return -ENOMEM;
+
+ return 0;
+}
+EXPORT_SYMBOL(disk_poison_list_init);
+
+/* Bad block management.
+ * We can record which blocks on each device are 'bad' and so just
+ * fail those blocks, or that stripe, rather than the whole device.
+ * Entries in the bad-block table are 64bits wide. This comprises:
+ * Length of bad-range, in sectors: 0-511 for lengths 1-512
+ * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
+ * A 'shift' can be set so that larger blocks are tracked and
+ * consequently larger devices can be covered.
+ * 'Acknowledged' flag - 1 bit. - the most significant bit.
+ *
+ * Locking of the bad-block table uses a seqlock so md_is_badblock
+ * might need to retry if it is very unlucky.
+ * We will sometimes want to check for bad blocks in a bi_end_io function,
+ * so we use the write_seqlock_irq variant.
+ *
+ * When looking for a bad block we specify a range and want to
+ * know if any block in the range is bad. So we binary-search
+ * to the last range that starts at-or-before the given endpoint,
+ * (or "before the sector after the target range")
+ * then see if it ends after the given start.
+ * We return
+ * 0 if there are no known bad blocks in the range
+ * 1 if there are known bad block which are all acknowledged
+ * -1 if there are bad blocks which have not yet been acknowledged in metadata.
+ * plus the start/length of the first bad section we overlap.
+ */
+int disk_check_poison(struct gendisk *disk, sector_t s, int sectors,
+ sector_t *first_bad, int *bad_sectors)
+{
+ struct disk_poison *bb = disk->plist;
+ int hi;
+ int lo;
+ u64 *p = bb->page;
+ int rv;
+ sector_t target = s + sectors;
+ unsigned seq;
+
+ if (bb->shift > 0) {
+ /* round the start down, and the end up */
+ s >>= bb->shift;
+ target += (1<<bb->shift) - 1;
+ target >>= bb->shift;
+ sectors = target - s;
+ }
+ /* 'target' is now the first block after the bad range */
+
+retry:
+ seq = read_seqbegin(&bb->lock);
+ lo = 0;
+ rv = 0;
+ hi = bb->count;
+
+ /* Binary search between lo and hi for 'target'
+ * i.e. for the last range that starts before 'target'
+ */
+ /* INVARIANT: ranges before 'lo' and at-or-after 'hi'
+ * are known not to be the last range before target.
+ * VARIANT: hi-lo is the number of possible
+ * ranges, and decreases until it reaches 1
+ */
+ while (hi - lo > 1) {
+ int mid = (lo + hi) / 2;
+ sector_t a = BB_OFFSET(p[mid]);
+ if (a < target)
+ /* This could still be the one, earlier ranges
+ * could not. */
+ lo = mid;
+ else
+ /* This and later ranges are definitely out. */
+ hi = mid;
+ }
+ /* 'lo' might be the last that started before target, but 'hi' isn't */
+ if (hi > lo) {
+ /* need to check all range that end after 's' to see if
+ * any are unacknowledged.
+ */
+ while (lo >= 0 &&
+ BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
+ if (BB_OFFSET(p[lo]) < target) {
+ /* starts before the end, and finishes after
+ * the start, so they must overlap
+ */
+ if (rv != -1 && BB_ACK(p[lo]))
+ rv = 1;
+ else
+ rv = -1;
+ *first_bad = BB_OFFSET(p[lo]);
+ *bad_sectors = BB_LEN(p[lo]);
+ }
+ lo--;
+ }
+ }
+
+ if (read_seqretry(&bb->lock, seq))
+ goto retry;
+
+ return rv;
+}
+EXPORT_SYMBOL_GPL(disk_check_poison);
+
+/*
+ * Add a range of bad blocks to the table.
+ * This might extend the table, or might contract it
+ * if two adjacent ranges can be merged.
+ * We binary-search to find the 'insertion' point, then
+ * decide how best to handle it.
+ */
+int disk_add_poison(struct gendisk *disk, sector_t s, int sectors,
+ int acknowledged)
+{
+ struct disk_poison *bb = disk->plist;
+ u64 *p;
+ int lo, hi;
+ int rv = 1;
+ unsigned long flags;
+
+ if (bb->shift < 0)
+ /* badblocks are disabled */
+ return 0;
+
+ if (bb->shift) {
+ /* round the start down, and the end up */
+ sector_t next = s + sectors;
+ s >>= bb->shift;
+ next += (1<<bb->shift) - 1;
+ next >>= bb->shift;
+ sectors = next - s;
+ }
+
+ write_seqlock_irqsave(&bb->lock, flags);
+
+ p = bb->page;
+ lo = 0;
+ hi = bb->count;
+ /* Find the last range that starts at-or-before 's' */
+ while (hi - lo > 1) {
+ int mid = (lo + hi) / 2;
+ sector_t a = BB_OFFSET(p[mid]);
+ if (a <= s)
+ lo = mid;
+ else
+ hi = mid;
+ }
+ if (hi > lo && BB_OFFSET(p[lo]) > s)
+ hi = lo;
+
+ if (hi > lo) {
+ /* we found a range that might merge with the start
+ * of our new range
+ */
+ sector_t a = BB_OFFSET(p[lo]);
+ sector_t e = a + BB_LEN(p[lo]);
+ int ack = BB_ACK(p[lo]);
+ if (e >= s) {
+ /* Yes, we can merge with a previous range */
+ if (s == a && s + sectors >= e)
+ /* new range covers old */
+ ack = acknowledged;
+ else
+ ack = ack && acknowledged;
+
+ if (e < s + sectors)
+ e = s + sectors;
+ if (e - a <= BB_MAX_LEN) {
+ p[lo] = BB_MAKE(a, e-a, ack);
+ s = e;
+ } else {
+ /* does not all fit in one range,
+ * make p[lo] maximal
+ */
+ if (BB_LEN(p[lo]) != BB_MAX_LEN)
+ p[lo] = BB_MAKE(a, BB_MAX_LEN, ack);
+ s = a + BB_MAX_LEN;
+ }
+ sectors = e - s;
+ }
+ }
+ if (sectors && hi < bb->count) {
+ /* 'hi' points to the first range that starts after 's'.
+ * Maybe we can merge with the start of that range */
+ sector_t a = BB_OFFSET(p[hi]);
+ sector_t e = a + BB_LEN(p[hi]);
+ int ack = BB_ACK(p[hi]);
+ if (a <= s + sectors) {
+ /* merging is possible */
+ if (e <= s + sectors) {
+ /* full overlap */
+ e = s + sectors;
+ ack = acknowledged;
+ } else
+ ack = ack && acknowledged;
+
+ a = s;
+ if (e - a <= BB_MAX_LEN) {
+ p[hi] = BB_MAKE(a, e-a, ack);
+ s = e;
+ } else {
+ p[hi] = BB_MAKE(a, BB_MAX_LEN, ack);
+ s = a + BB_MAX_LEN;
+ }
+ sectors = e - s;
+ lo = hi;
+ hi++;
+ }
+ }
+ if (sectors == 0 && hi < bb->count) {
+ /* we might be able to combine lo and hi */
+ /* Note: 's' is at the end of 'lo' */
+ sector_t a = BB_OFFSET(p[hi]);
+ int lolen = BB_LEN(p[lo]);
+ int hilen = BB_LEN(p[hi]);
+ int newlen = lolen + hilen - (s - a);
+ if (s >= a && newlen < BB_MAX_LEN) {
+ /* yes, we can combine them */
+ int ack = BB_ACK(p[lo]) && BB_ACK(p[hi]);
+ p[lo] = BB_MAKE(BB_OFFSET(p[lo]), newlen, ack);
+ memmove(p + hi, p + hi + 1,
+ (bb->count - hi - 1) * 8);
+ bb->count--;
+ }
+ }
+ while (sectors) {
+ /* didn't merge (it all).
+ * Need to add a range just before 'hi' */
+ if (bb->count >= DISK_MAX_POISON) {
+ /* No room for more */
+ rv = 0;
+ break;
+ } else {
+ int this_sectors = sectors;
+ memmove(p + hi + 1, p + hi,
+ (bb->count - hi) * 8);
+ bb->count++;
+
+ if (this_sectors > BB_MAX_LEN)
+ this_sectors = BB_MAX_LEN;
+ p[hi] = BB_MAKE(s, this_sectors, acknowledged);
+ sectors -= this_sectors;
+ s += this_sectors;
+ }
+ }
+
+ bb->changed = 1;
+ if (!acknowledged)
+ bb->unacked_exist = 1;
+ write_sequnlock_irqrestore(&bb->lock, flags);
+
+ /* Make sure they get written out promptly */
+ /* TODO sysfs_notify_dirent_safe(disk->sysfs_state); */
+
+ return rv;
+}
+EXPORT_SYMBOL_GPL(disk_add_poison);
+
+/*
+ * Remove a range of bad blocks from the table.
+ * This may involve extending the table if we spilt a region,
+ * but it must not fail. So if the table becomes full, we just
+ * drop the remove request.
+ */
+int disk_clear_poison(struct gendisk *disk, sector_t s, int sectors)
+{
+ struct disk_poison *bb = disk->plist;
+ u64 *p;
+ int lo, hi;
+ sector_t target = s + sectors;
+ int rv = 0;
+
+ if (bb->shift > 0) {
+ /* When clearing we round the start up and the end down.
+ * This should not matter as the shift should align with
+ * the block size and no rounding should ever be needed.
+ * However it is better the think a block is bad when it
+ * isn't than to think a block is not bad when it is.
+ */
+ s += (1<<bb->shift) - 1;
+ s >>= bb->shift;
+ target >>= bb->shift;
+ sectors = target - s;
+ }
+
+ write_seqlock_irq(&bb->lock);
+
+ p = bb->page;
+ lo = 0;
+ hi = bb->count;
+ /* Find the last range that starts before 'target' */
+ while (hi - lo > 1) {
+ int mid = (lo + hi) / 2;
+ sector_t a = BB_OFFSET(p[mid]);
+ if (a < target)
+ lo = mid;
+ else
+ hi = mid;
+ }
+ if (hi > lo) {
+ /* p[lo] is the last range that could overlap the
+ * current range. Earlier ranges could also overlap,
+ * but only this one can overlap the end of the range.
+ */
+ if (BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > target) {
+ /* Partial overlap, leave the tail of this range */
+ int ack = BB_ACK(p[lo]);
+ sector_t a = BB_OFFSET(p[lo]);
+ sector_t end = a + BB_LEN(p[lo]);
+
+ if (a < s) {
+ /* we need to split this range */
+ if (bb->count >= DISK_MAX_POISON) {
+ rv = -ENOSPC;
+ goto out;
+ }
+ memmove(p+lo+1, p+lo, (bb->count - lo) * 8);
+ bb->count++;
+ p[lo] = BB_MAKE(a, s-a, ack);
+ lo++;
+ }
+ p[lo] = BB_MAKE(target, end - target, ack);
+ /* there is no longer an overlap */
+ hi = lo;
+ lo--;
+ }
+ while (lo >= 0 &&
+ BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
+ /* This range does overlap */
+ if (BB_OFFSET(p[lo]) < s) {
+ /* Keep the early parts of this range. */
+ int ack = BB_ACK(p[lo]);
+ sector_t start = BB_OFFSET(p[lo]);
+ p[lo] = BB_MAKE(start, s - start, ack);
+ /* now low doesn't overlap, so.. */
+ break;
+ }
+ lo--;
+ }
+ /* 'lo' is strictly before, 'hi' is strictly after,
+ * anything between needs to be discarded
+ */
+ if (hi - lo > 1) {
+ memmove(p+lo+1, p+hi, (bb->count - hi) * 8);
+ bb->count -= (hi - lo - 1);
+ }
+ }
+
+ bb->changed = 1;
+out:
+ write_sequnlock_irq(&bb->lock);
+ return rv;
+}
+EXPORT_SYMBOL_GPL(disk_clear_poison);
+
+/*
+ * Acknowledge all bad blocks in a list.
+ * This only succeeds if ->changed is clear. It is used by
+ * in-kernel metadata updates
+ */
+void disk_ack_all_poison(struct gendisk *disk)
+{
+ struct disk_poison *bb = disk->plist;
+
+ if (bb->page == NULL || bb->changed)
+ /* no point even trying */
+ return;
+ write_seqlock_irq(&bb->lock);
+
+ if (bb->changed == 0 && bb->unacked_exist) {
+ u64 *p = bb->page;
+ int i;
+ for (i = 0; i < bb->count ; i++) {
+ if (!BB_ACK(p[i])) {
+ sector_t start = BB_OFFSET(p[i]);
+ int len = BB_LEN(p[i]);
+ p[i] = BB_MAKE(start, len, 1);
+ }
+ }
+ bb->unacked_exist = 0;
+ }
+ write_sequnlock_irq(&bb->lock);
+}
+EXPORT_SYMBOL_GPL(disk_ack_all_poison);
+
+/* sysfs access to bad-blocks list.
+ * We present two files.
+ * 'bad-blocks' lists sector numbers and lengths of ranges that
+ * are recorded as bad. The list is truncated to fit within
+ * the one-page limit of sysfs.
+ * Writing "sector length" to this file adds an acknowledged
+ * bad block list.
+ * 'unacknowledged-bad-blocks' lists bad blocks that have not yet
+ * been acknowledged. Writing to this file adds bad blocks
+ * without acknowledging them. This is largely for testing.
+ */
+
+static ssize_t poison_list_show(struct device *dev,
+ struct device_attribute *attr,
+ char *page)
+{
+ struct gendisk *disk = dev_to_disk(dev);
+ struct disk_poison *bb = disk->plist;
+ size_t len;
+ int i;
+ u64 *p = bb->page;
+ unsigned seq;
+
+ if (bb->shift < 0)
+ return 0;
+
+retry:
+ seq = read_seqbegin(&bb->lock);
+
+ len = 0;
+ i = 0;
+
+ while (len < PAGE_SIZE && i < bb->count) {
+ sector_t s = BB_OFFSET(p[i]);
+ unsigned int length = BB_LEN(p[i]);
+
+ i++;
+ len += snprintf(page+len, PAGE_SIZE-len, "%llu %u\n",
+ (unsigned long long)s << bb->shift,
+ length << bb->shift);
+ }
+
+ if (read_seqretry(&bb->lock, seq))
+ goto retry;
+
+ return len;
+}
+
+#define DO_DEBUG 1
+
+static ssize_t poison_list_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *page, size_t len)
+{
+ struct gendisk *disk = dev_to_disk(dev);
+ unsigned long long sector;
+ int length;
+ char newline;
+#ifdef DO_DEBUG
+ /* Allow clearing via sysfs *only* for testing/debugging.
+ * Normally only a successful write may clear a badblock
+ */
+ int clear = 0;
+ if (page[0] == '-') {
+ clear = 1;
+ page++;
+ }
+#endif /* DO_DEBUG */
+
+ switch (sscanf(page, "%llu %d%c", §or, &length, &newline)) {
+ case 3:
+ if (newline != '\n')
+ return -EINVAL;
+ case 2:
+ if (length <= 0)
+ return -EINVAL;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+#ifdef DO_DEBUG
+ if (clear) {
+ disk_clear_poison(disk, sector, length);
+ return len;
+ }
+#endif /* DO_DEBUG */
+ if (disk_add_poison(disk, sector, length, 1))
+ return len;
+ else
+ return -ENOSPC;
+}
+
/**
* get_gendisk - get partitioning information for a given device
* @devt: device to get partitioning information for
@@ -988,6 +1487,8 @@ static DEVICE_ATTR(discard_alignment, S_IRUGO, disk_discard_alignment_show,
static DEVICE_ATTR(capability, S_IRUGO, disk_capability_show, NULL);
static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL);
static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL);
+static DEVICE_ATTR(poison_list, S_IRUGO | S_IWUSR, poison_list_show,
+ poison_list_store);
#ifdef CONFIG_FAIL_MAKE_REQUEST
static struct device_attribute dev_attr_fail =
__ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store);
@@ -1009,6 +1510,7 @@ static struct attribute *disk_attrs[] = {
&dev_attr_capability.attr,
&dev_attr_stat.attr,
&dev_attr_inflight.attr,
+ &dev_attr_poison_list.attr,
#ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
#endif
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 2adbfa6..9acfe1b 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -163,6 +163,24 @@ struct disk_part_tbl {
struct disk_events;
+#define DISK_MAX_POISON (PAGE_SIZE/8)
+
+struct disk_poison {
+ int count; /* count of bad blocks */
+ int unacked_exist; /* there probably are unacknowledged
+ * bad blocks. This is only cleared
+ * when a read discovers none
+ */
+ int shift; /* shift from sectors to block size
+ * a -ve shift means badblocks are
+ * disabled.*/
+ u64 *page; /* badblock list */
+ int changed;
+ seqlock_t lock;
+ sector_t sector;
+ sector_t size; /* in sectors */
+};
+
struct gendisk {
/* major, first_minor and minors are input parameters only,
* don't use directly. Use disk_devt() and disk_max_parts().
@@ -201,6 +219,7 @@ struct gendisk {
struct blk_integrity *integrity;
#endif
int node_id;
+ struct disk_poison *plist;
};
static inline struct gendisk *part_to_disk(struct hd_struct *part)
@@ -434,6 +453,13 @@ extern void disk_block_events(struct gendisk *disk);
extern void disk_unblock_events(struct gendisk *disk);
extern void disk_flush_events(struct gendisk *disk, unsigned int mask);
extern unsigned int disk_clear_events(struct gendisk *disk, unsigned int mask);
+extern int disk_poison_list_init(struct gendisk *disk);
+extern int disk_check_poison(struct gendisk *disk, sector_t s, int sectors,
+ sector_t *first_bad, int *bad_sectors);
+extern int disk_add_poison(struct gendisk *disk, sector_t s, int sectors,
+ int acknowledged);
+extern int disk_clear_poison(struct gendisk *disk, sector_t s, int sectors);
+extern void disk_ack_all_poison(struct gendisk *disk);
/* drivers/char/random.c */
extern void add_disk_randomness(struct gendisk *disk);
--
2.5.0
6 years, 7 months
[RFC PATCH] Fix _FIT vs. NFIT processing breakage
by Linda Knippers
Since commit 209851649dc4f7900a6bfe1de5e2640ab2c7d931, we no longer
see NVDIMM devices on our systems. The NFIT/_FIT processing at
initialization gets a table from _FIT but doesn't like it.
When support for _FIT was added, the code presumed that the data
returned by the _FIT method is identical to the NFIT table, which
starts with an acpi_table_header. However, the _FIT is defined
to return a data in the format of a series of NFIT type structure
entries and as a method, has an acpi_object header rather tahn
an acpi_table_header.
To address the differences, explicitly save the acpi_table_header
from the NFIT, since it is accessible through /sys, and change
the nfit pointer in the acpi_desc structure to point to the
table entries rather than the headers.
This is an RFC patch for several reasons.
1) I've only tested the boot path, not the code path gets
gets a _FIT later.
2) There is some debug information that we probably don't
want to keep in there.
3) I'm not even sure we should be checking _FIT at boot time
4) While this fixes my platform, it probably breaks the tests
that were used to test the original commit.
If we need to have a long discussion about whether our firmware
is correct, then perhaps we can remove the _FIT code from acpi_nfit_add()
while we sort it out.
Reported-by: Jeff Moyer (jmoyer(a)redhat.com>
Signed-off-by: Linda Knippers <linda.knippers(a)hp.com>
---
drivers/acpi/nfit.c | 55 +++++++++++++++++++++++++++++++++++++++++------------
drivers/acpi/nfit.h | 3 ++-
2 files changed, 45 insertions(+), 13 deletions(-)
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index f7dab53..ad95113 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -655,7 +655,7 @@ static ssize_t revision_show(struct device *dev,
struct nvdimm_bus_descriptor *nd_desc = to_nd_desc(nvdimm_bus);
struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
- return sprintf(buf, "%d\n", acpi_desc->nfit->header.revision);
+ return sprintf(buf, "%d\n", acpi_desc->acpi_header.revision);
}
static DEVICE_ATTR_RO(revision);
@@ -1652,7 +1652,6 @@ int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
data = (u8 *) acpi_desc->nfit;
end = data + sz;
- data += sizeof(struct acpi_table_nfit);
while (!IS_ERR_OR_NULL(data))
data = add_table(acpi_desc, &prev, data, end);
@@ -1748,13 +1747,34 @@ static int acpi_nfit_add(struct acpi_device *adev)
return PTR_ERR(acpi_desc);
}
- acpi_desc->nfit = (struct acpi_table_nfit *) tbl;
+ /*
+ * Save the acpi header for later and then skip it, make
+ * nfit point to the first nfit table header.
+ */
+ acpi_desc->acpi_header = *tbl;
+ acpi_desc->nfit = (void *) tbl + sizeof(struct acpi_table_nfit);
+ sz -= sizeof(struct acpi_table_nfit);
/* Evaluate _FIT and override with that if present */
status = acpi_evaluate_object(adev->handle, "_FIT", NULL, &buf);
if (ACPI_SUCCESS(status) && buf.length > 0) {
- acpi_desc->nfit = (struct acpi_table_nfit *)buf.pointer;
- sz = buf.length;
+ union acpi_object *obj;
+
+ dev_dbg(dev, "%s _FIT ptr %p, length: %d\n",
+ __func__, buf.pointer, (int)buf.length);
+ print_hex_dump_debug("_FIT: ", DUMP_PREFIX_OFFSET, 16, 1,
+ buf.pointer, buf.length, true);
+
+ /*
+ * Adjust for the acpi_object header of the _FIT
+ */
+ obj = buf.pointer;
+ if (obj->type == ACPI_TYPE_BUFFER) {
+ acpi_desc->nfit = (struct acpi_nfit_header *)obj->buffer.pointer;
+ sz = obj->buffer.length;
+ } else
+ dev_dbg(dev, "%s invalid type %d, ignoring _FIT\n",
+ __func__, (int) obj->type);
}
rc = acpi_nfit_init(acpi_desc, sz);
@@ -1777,8 +1796,9 @@ static void acpi_nfit_notify(struct acpi_device *adev, u32 event)
{
struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(&adev->dev);
struct acpi_buffer buf = { ACPI_ALLOCATE_BUFFER, NULL };
- struct acpi_table_nfit *nfit_saved;
+ struct acpi_nfit_header *nfit_saved;
struct device *dev = &adev->dev;
+ union acpi_object *obj;
acpi_status status;
int ret;
@@ -1807,13 +1827,24 @@ static void acpi_nfit_notify(struct acpi_device *adev, u32 event)
goto out_unlock;
}
+ dev_dbg(dev, "%s _FIT ptr %p, length: %d\n",
+ __func__, buf.pointer, (int)buf.length);
+ print_hex_dump_debug("_FIT: ", DUMP_PREFIX_OFFSET, 16, 1,
+ buf.pointer, buf.length, true);
+
nfit_saved = acpi_desc->nfit;
- acpi_desc->nfit = (struct acpi_table_nfit *)buf.pointer;
- ret = acpi_nfit_init(acpi_desc, buf.length);
- if (!ret) {
- /* Merge failed, restore old nfit, and exit */
- acpi_desc->nfit = nfit_saved;
- dev_err(dev, "failed to merge updated NFIT\n");
+ obj = buf.pointer;
+ if (obj->type == ACPI_TYPE_BUFFER) {
+ acpi_desc->nfit = (struct acpi_nfit_header *)obj->buffer.pointer;
+ ret = acpi_nfit_init(acpi_desc, obj->buffer.length);
+ if (!ret) {
+ /* Merge failed, restore old nfit, and exit */
+ acpi_desc->nfit = nfit_saved;
+ dev_err(dev, "failed to merge updated NFIT\n");
+ }
+ } else {
+ /* Bad _FIT, restore old nfit */
+ dev_err(dev, "Invalid _FIT\n");
}
kfree(buf.pointer);
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index 2ea5c07..3d549a3 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -96,7 +96,8 @@ struct nfit_mem {
struct acpi_nfit_desc {
struct nvdimm_bus_descriptor nd_desc;
- struct acpi_table_nfit *nfit;
+ struct acpi_table_header acpi_header;
+ struct acpi_nfit_header *nfit;
struct mutex spa_map_mutex;
struct mutex init_mutex;
struct list_head spa_maps;
--
1.8.3.1
6 years, 7 months
Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
by Dan Williams
On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger(a)dilger.ca> wrote:
> On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams(a)intel.com> wrote:
>>
>> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
>> <ross.zwisler(a)linux.intel.com> wrote:
>>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios. These
>>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
>>> and are used by filesystems to order their metadata, among other things.
>>>
>>> When we get an msync() or fsync() it is the responsibility of the DAX code
>>> to flush all dirty pages to media. The PMEM driver then just has issue a
>>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
>>> the flushed data has been durably stored on the media.
>>>
>>> Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
>>
>> Hmm, I'm not seeing why we need this patch. If the actual flushing of
>> the cache is done by the core why does the driver need support
>> REQ_FLUSH? Especially since it's just a couple instructions. REQ_FUA
>> only makes sense if individual writes can bypass the "drive" cache,
>> but no I/O submitted to the driver proper is ever cached we always
>> flush it through to media.
>
> If the upper level filesystem gets an error when submitting a flush
> request, then it assumes the underlying hardware is broken and cannot
> be as aggressive in IO submission, but instead has to wait for in-flight
> IO to complete.
Upper level filesystems won't get errors when the driver does not
support flush. Those requests are ended cleanly in
generic_make_request_checks(). Yes, the fs still needs to wait for
outstanding I/O to complete but in the case of pmem all I/O is
synchronous. There's never anything to await when flushing at the
pmem driver level.
> Since FUA/FLUSH is basically a no-op for pmem devices,
> it doesn't make sense _not_ to support this functionality.
Seems to be a nop either way. Given that DAX may lead to dirty data
pending to the device in the cpu cache that a REQ_FLUSH request will
not touch, its better to leave it all to the mm core to handle. I.e.
it doesn't make sense to call the driver just for two instructions
(sfence + pcommit) when the mm core is taking on the cache flushing.
Either handle it all in the mm or the driver, not a mixture.
6 years, 7 months
[PATCH v2 0/3] nvdimm: Add an IOCTL pass thru for DSM calls
by Jerry Hoemann
The NVDIMM code in the kernel supports an IOCTL interface to user
space based upon the Intel Example DSM:
http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
This interface cannot be used by other NVDIMM DSMs that support
incompatible functions.
This patch set adds a generic "passthru" IOCTL interface which
is not tied to a particular DSM.
A new _IOC_NR ND_CMD_PASSTHRU == "100" is added for the pass thru call.
The new data structure nd_passthru_pkg serves as a wrapper for the passthru
calls. This wrapper supplies the data that the kernel needs to
make the _DSM call.
Unlike the definitions of the _DSM functions themselves, the nd_passthru_pkg
provides the calling information (input/output sizes) in an uniform
manner making the kernel marshaling of the arguments straight
forward.
This shifts the marshaling burden from the kernel to the user
space application while still permitting the kernel to internally
call _DSM functions.
To make the resultant kernel code easier to understand the existing
functions acpi_nfit_ctl and __nd_ioctl were renamed to .*_intel to
denote calling mechanism as in 4.3 tailored to the Intel Example DSM.
New functions acpi_nfit_ctl_passthru and __nd_ioctl_passthru were
created to supply the pass thru interface.
These changes are based upon the 4.3 kernel.
Changes in version 2:
---------------------
1. Cleanup access mode check in nd_ioctl and nvdimm_ioctl.
2. Change name of ndn_pkg to nd_passthru_pkg
3. Adjust sizes in nd_passthru_pkg. DSM intergers are 64 bit.
4. No new ioctl type, instead tunnel into the existing number space.
5. Push down one function level where determine ioctl cmd type.
6. re-work diagnostic print/dump message in pass-thru functions.
Jerry Hoemann (3):
nvdimm: Clean-up access mode check.
nvdimm: Add wrapper for IOCTL pass thru
nvdimm: Add IOCTL pass thru functions
drivers/acpi/nfit.c | 93 ++++++++++++++++++++++++++++++++++++++++-
drivers/nvdimm/bus.c | 101 +++++++++++++++++++++++++++++++++++++++++----
include/uapi/linux/ndctl.h | 16 +++++++
3 files changed, 202 insertions(+), 8 deletions(-)
--
1.7.11.3
6 years, 7 months
[PATCH] ndctl: dax pmd tests
by Dan Williams
Check the following conditions:
1/ gup fault
2/ gup fast
3/ copying a huge mapping on fork()
Cc: Jan Kara <jack(a)suse.com>
Cc: Dave Chinner <david(a)fromorbit.com>
Cc: Matthew Wilcox <willy(a)linux.intel.com>
Cc: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
This is the test case used to generate the kernel crash signatures
mentioned in "[PATCH 2/8] dax: disable pmd mappings":
https://lists.01.org/pipermail/linux-nvdimm/2015-November/002875.html
Makefile.am | 8 ++
lib/test-dax-dev.c | 96 +++++++++++++++++++++++++++++
lib/test-dax-pmd.c | 172 ++++++++++++++++++++++++++++++++++++++++++++++++++++
lib/test-dax.sh | 34 ++++++++++
4 files changed, 310 insertions(+)
create mode 100755 lib/test-dax-dev.c
create mode 100644 lib/test-dax-pmd.c
create mode 100755 lib/test-dax.sh
diff --git a/Makefile.am b/Makefile.am
index e998c5766631..2989e076972d 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -122,6 +122,9 @@ check_PROGRAMS = lib/test-libndctl lib/test-dpa-alloc lib/test-parent-uuid
if ENABLE_DESTRUCTIVE
TESTS += lib/test-blk-ns lib/test-pmem-ns lib/test-pcommit
check_PROGRAMS += lib/test-blk-ns lib/test-pmem-ns lib/test-pcommit
+
+TESTS += lib/test-dax-dev lib/test-dax.sh
+check_PROGRAMS += lib/test-dax-dev lib/test-dax-pmd
endif
lib_test_libndctl_SOURCES = lib/test-libndctl.c lib/test-core.c
@@ -141,3 +144,8 @@ lib_test_dpa_alloc_LDADD = lib/libndctl.la $(UUID_LIBS) $(KMOD_LIBS)
lib_test_parent_uuid_SOURCES = lib/test-parent-uuid.c lib/test-core.c
lib_test_parent_uuid_LDADD = lib/libndctl.la $(UUID_LIBS) $(KMOD_LIBS)
+
+lib_test_dax_dev_SOURCES = lib/test-dax-dev.c lib/test-core.c
+lib_test_dax_dev_LDADD = lib/libndctl.la
+
+lib_test_dax_pmd_SOURCES = lib/test-dax-pmd.c
diff --git a/lib/test-dax-dev.c b/lib/test-dax-dev.c
new file mode 100755
index 000000000000..3ca7cef0f71c
--- /dev/null
+++ b/lib/test-dax-dev.c
@@ -0,0 +1,96 @@
+/*
+ * Copyright (c) 2014-2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU Lesser General Public License,
+ * version 2.1, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT ANY
+ * WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+ * FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for
+ * more details.
+ */
+#include <stdio.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <string.h>
+#include <fcntl.h>
+#include <ctype.h>
+#include <errno.h>
+#include <unistd.h>
+#include <limits.h>
+#include <syslog.h>
+
+#include <test.h>
+#include <linux/version.h>
+#include <ndctl/libndctl.h>
+
+static int emit_e820_device(int loglevel, struct ndctl_test *test)
+{
+ int err, fd;
+ char path[256];
+ const char *bdev;
+ struct ndctl_ctx *ctx;
+ struct ndctl_bus *bus;
+ struct ndctl_region *region;
+ struct ndctl_namespace *ndns;
+
+ if (!ndctl_test_attempt(test, KERNEL_VERSION(4, 3, 0)))
+ return 77;
+
+ err = ndctl_new(&ctx);
+ if (err < 0)
+ return err;
+
+ ndctl_set_log_priority(ctx, loglevel);
+ err = -ENXIO;
+ bus = ndctl_bus_get_by_provider(ctx, "e820");
+ if (!bus)
+ goto out;
+
+ region = ndctl_region_get_first(bus);
+ if (!region)
+ goto out;
+
+ ndns = ndctl_namespace_get_first(region);
+ if (!ndns)
+ goto out;
+
+ bdev = ndctl_namespace_get_block_device(ndns);
+ if (!bdev)
+ goto out;
+
+ if (snprintf(path, sizeof(path), "/dev/%s", bdev) >= (int) sizeof(path))
+ goto out;
+
+ /*
+ * Note, if the bdev goes active after this check we'll still
+ * clobber it in the following tests, see lib/test-dax.sh.
+ */
+ fd = open(path, O_RDWR | O_EXCL);
+ if (fd < 0)
+ goto out;
+ err = 0;
+ fprintf(stdout, "%s\n", path);
+
+ out:
+ if (err)
+ fprintf(stderr, "%s: failed to find usable victim device\n",
+ __func__);
+ ndctl_unref(ctx);
+ return err;
+}
+
+int __attribute__((weak)) main(int argc, char *argv[])
+{
+ struct ndctl_test *test = ndctl_test_new(0);
+ int rc;
+
+ if (!test) {
+ fprintf(stderr, "failed to initialize test\n");
+ return EXIT_FAILURE;
+ }
+
+ rc = emit_e820_device(LOG_DEBUG, test);
+ return ndctl_test_result(test, rc);
+}
diff --git a/lib/test-dax-pmd.c b/lib/test-dax-pmd.c
new file mode 100644
index 000000000000..fc64c7b82836
--- /dev/null
+++ b/lib/test-dax-pmd.c
@@ -0,0 +1,172 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <string.h>
+#include <errno.h>
+#include <sys/ioctl.h>
+#include <stdlib.h>
+#include <linux/fs.h>
+#include <linux/fiemap.h>
+
+#define NUM_EXTENTS 5
+#define HPAGE_SIZE (2 << 20)
+#define ALIGN(x, a) ((((unsigned long long) x) + (a - 1)) & ~(a - 1))
+#define fail() fprintf(stderr, "%s: failed at: %d\n", __func__, __LINE__)
+#define faili(i) fprintf(stderr, "%s: failed at: %d: %ld\n", __func__, __LINE__, i)
+#define TEST_FILE "test_dax_data"
+
+/* test_pmd assumes that fd references a pre-allocated + dax-capable file */
+static int test_pmd(int fd)
+{
+ unsigned long long m_align, p_align;
+ int fd2 = -1, rc = -ENXIO;
+ struct fiemap_extent *ext;
+ struct fiemap *map;
+ void *addr, *buf;
+ unsigned long i;
+
+ if (fd < 0) {
+ fail();
+ return -ENXIO;
+ }
+
+ map = calloc(1, sizeof(struct fiemap)
+ + sizeof(struct fiemap_extent) * NUM_EXTENTS);
+ if (!map) {
+ fail();
+ return -ENXIO;
+ }
+
+ if (posix_memalign(&buf, 4096, 4096) != 0)
+ goto err_memalign;
+
+ addr = mmap(NULL, 4*HPAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+ if (addr == MAP_FAILED) {
+ fail();
+ goto err_mmap;
+ }
+ munmap(addr, 4*HPAGE_SIZE);
+
+ map->fm_start = 0;
+ map->fm_length = -1;
+ map->fm_extent_count = NUM_EXTENTS;
+ rc = ioctl(fd, FS_IOC_FIEMAP, map);
+ if (rc < 0) {
+ fail();
+ goto err_extent;
+ }
+
+ for (i = 0; i < map->fm_mapped_extents; i++) {
+ ext = &map->fm_extents[i];
+ fprintf(stderr, "[%ld]: l: %llx p: %llx len: %llx flags: %x\n",
+ i, ext->fe_logical, ext->fe_physical,
+ ext->fe_length, ext->fe_flags);
+ if (ext->fe_length > 2 * HPAGE_SIZE) {
+ fprintf(stderr, "found potential huge extent\n");
+ break;
+ }
+ }
+
+ if (i >= map->fm_mapped_extents) {
+ fail();
+ goto err_extent;
+ }
+
+ m_align = ALIGN(addr, HPAGE_SIZE) - ((unsigned long) addr);
+ p_align = ALIGN(ext->fe_physical, HPAGE_SIZE) - ext->fe_physical;
+
+ for (i = 0; i < 3; i++) {
+ rc = -ENXIO;
+ addr = mmap((char *) addr + m_align, 2*HPAGE_SIZE,
+ PROT_READ|PROT_WRITE, MAP_SHARED, fd,
+ ext->fe_logical + p_align);
+ if (addr == MAP_FAILED) {
+ faili(i);
+ break;
+ }
+
+ fd2 = open(TEST_FILE, O_CREAT|O_TRUNC|O_DIRECT|O_RDWR);
+ if (fd2 < 0) {
+ faili(i);
+ munmap(addr, 2*HPAGE_SIZE);
+ break;
+ }
+
+ rc = 0;
+ switch (i) {
+ case 0: /* test O_DIRECT of unfaulted address */
+ if (write(fd2, addr, 4096) != 4096) {
+ faili(i);
+ rc = -ENXIO;
+ }
+ break;
+ case 1: /* test O_DIRECT of pre-faulted address */
+ sprintf(addr, "odirect data");
+ if (write(fd2, addr, 4096) != 4096) {
+ faili(i);
+ rc = -ENXIO;
+ }
+ ((char *) buf)[0] = 0;
+ read(fd2, buf, sizeof(buf));
+ if (strcmp(buf, "test data") != 0) {
+ faili(i);
+ rc = -ENXIO;
+ }
+ break;
+ case 2: /* fork with pre-faulted pmd */
+ sprintf(addr, "fork data");
+ rc = fork();
+ if (rc == 0) {
+ /* child */
+ if (strcmp(addr, "fork data") == 0)
+ exit(EXIT_SUCCESS);
+ else
+ exit(EXIT_FAILURE);
+ } else if (rc > 0) {
+ /* parent */
+ wait(&rc);
+ if (rc != EXIT_SUCCESS)
+ faili(i);
+ } else
+ faili(i);
+ break;
+ default:
+ faili(i);
+ rc = -ENXIO;
+ break;
+ }
+
+ munmap(addr, 2*HPAGE_SIZE);
+ addr = MAP_FAILED;
+ unlink(TEST_FILE);
+ close(fd2);
+ fd2 = -1;
+ if (rc)
+ break;
+ }
+
+ err_extent:
+ err_mmap:
+ free(buf);
+ err_memalign:
+ free(map);
+ return rc;
+}
+
+int main(int argc, char *argv[])
+{
+ int fd, rc;
+
+ if (argc < 1)
+ return -EINVAL;
+
+ fd = open(argv[1], O_RDWR);
+ rc = test_pmd(fd);
+ if (fd >= 0)
+ close(fd);
+ return rc;
+}
diff --git a/lib/test-dax.sh b/lib/test-dax.sh
new file mode 100755
index 000000000000..048d82975d92
--- /dev/null
+++ b/lib/test-dax.sh
@@ -0,0 +1,34 @@
+#!/bin/bash
+MNT=test_dax_mnt
+FILE=image
+DEV=""
+
+err() {
+ rc=1
+ echo "test-dax: failed at line $1"
+ if [ -n "$DEV" ]; then
+ umount $DEV
+ else
+ rc=77
+ fi
+ rmdir $MNT
+ exit $rc
+}
+
+set -e
+mkdir -p $MNT
+trap 'err $LINENO' ERR
+
+DEV=$(lib/test-dax-dev)
+
+mkfs.ext4 $DEV
+mount $DEV $MNT -o dax
+fallocate -l 1GiB $MNT/$FILE
+lib/test-dax-pmd $MNT/$FILE
+umount $MNT
+
+mkfs.xfs -f $DEV
+mount $DEV $MNT -o dax
+fallocate -l 1GiB $MNT/$FILE
+lib/test-dax-pmd $MNT/$FILE
+umount $MNT
6 years, 7 months
Re: [RFC 1/1] memremap: devm_memremap_pages has wrong nid
by Boaz Harrosh
On 11/13/2015 06:00 PM, Toshi Kani wrote:
<>
>
> Agreed. memory_add_physaddr_to_nid() uses the SRAT info, which does not work
> with the NFIT case.
>
Thanks Toshi, I did not know that NFIT would not work. (As I already ranted NFIT
is hard to find)
Would it be hard to fix? I mean the way it is today NvDIMM is always put at the
*end* of the NUMA address range, so all the NUMA boundaries (start) are there, all
we need is to make sure max_pfn is advanced behind the last NvDIMM range.
(Ok and we might have a slight problem with an NFIT only Node, where there
is no volatile memory at all)
I think it is worth fixing there are surprising places this might be used.
I know that it works with type-12 and emulated pmem.
(Once I set up my NFIT QEMU I'll see what I can find)
> Thanks,
> -Toshi
>
Thanks
Boaz
6 years, 7 months
[PATCH] mm, dax: fix DAX deadlocks (COW fault)
by Yigal Korman
DAX handling of COW faults has wrong locking sequence:
dax_fault does i_mmap_lock_read
do_cow_fault does i_mmap_unlock_write
Ross's commit[1] missed a fix[2] that Kirill added to Matthew's
commit[3].
Original COW locking logic was introduced by Matthew here[4].
This should be applied to v4.3 as well.
[1] 0f90cc6609c7 mm, dax: fix DAX deadlocks
[2] 52a2b53ffde6 mm, dax: use i_mmap_unlock_write() in do_cow_fault()
[3] 843172978bb9 dax: fix race between simultaneous faults
[4] 2e4cdab0584f mm: allow page fault handlers to perform the COW
Signed-off-by: Yigal Korman <yigal(a)plexistor.com>
Cc: Stable Tree <stable(a)vger.kernel.org>
Cc: Boaz Harrosh <boaz(a)plexistor.com>
Cc: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Cc: Alexander Viro <viro(a)zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Dave Chinner <dchinner(a)redhat.com>
Cc: Jan Kara <jack(a)suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox(a)intel.com>
---
mm/memory.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index c716913..e5071af 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3015,9 +3015,9 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
} else {
/*
* The fault handler has no page to lock, so it holds
- * i_mmap_lock for write to protect against truncate.
+ * i_mmap_lock for read to protect against truncate.
*/
- i_mmap_unlock_write(vma->vm_file->f_mapping);
+ i_mmap_unlock_read(vma->vm_file->f_mapping);
}
goto uncharge_out;
}
@@ -3031,9 +3031,9 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
} else {
/*
* The fault handler has no page to lock, so it holds
- * i_mmap_lock for write to protect against truncate.
+ * i_mmap_lock for read to protect against truncate.
*/
- i_mmap_unlock_write(vma->vm_file->f_mapping);
+ i_mmap_unlock_read(vma->vm_file->f_mapping);
}
return ret;
uncharge_out:
--
1.9.3
6 years, 7 months
Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
by Dan Williams
On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack(a)suse.cz> wrote:
> On Mon 16-11-15 14:37:14, Jan Kara wrote:
[..]
> But a question: Won't it be better to do sfence + pcommit only in response
> to REQ_FLUSH request and don't do it after each write? I'm not sure how
> expensive these instructions are but in theory it could be a performance
> win, couldn't it? For filesystems this is enough wrt persistency
> guarantees...
We would need to gather the performance data... The expectation is
that the cache flushing is more expensive than the sfence + pcommit.
6 years, 7 months
Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
by Dave Chinner
On Mon, Nov 16, 2015 at 03:05:26PM +0100, Jan Kara wrote:
> On Mon 16-11-15 14:37:14, Jan Kara wrote:
> > On Fri 13-11-15 18:32:40, Dan Williams wrote:
> > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger(a)dilger.ca> wrote:
> > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams(a)intel.com> wrote:
> > > >>
> > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > >> <ross.zwisler(a)linux.intel.com> wrote:
> > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios. These
> > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > >>> and are used by filesystems to order their metadata, among other things.
> > > >>>
> > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > >>> to flush all dirty pages to media. The PMEM driver then just has issue a
> > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > >>> the flushed data has been durably stored on the media.
> > > >>>
> > > >>> Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
> > > >>
> > > >> Hmm, I'm not seeing why we need this patch. If the actual flushing of
> > > >> the cache is done by the core why does the driver need support
> > > >> REQ_FLUSH? Especially since it's just a couple instructions. REQ_FUA
> > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > >> but no I/O submitted to the driver proper is ever cached we always
> > > >> flush it through to media.
> > > >
> > > > If the upper level filesystem gets an error when submitting a flush
> > > > request, then it assumes the underlying hardware is broken and cannot
> > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > IO to complete.
> > >
> > > Upper level filesystems won't get errors when the driver does not
> > > support flush. Those requests are ended cleanly in
> > > generic_make_request_checks(). Yes, the fs still needs to wait for
> > > outstanding I/O to complete but in the case of pmem all I/O is
> > > synchronous. There's never anything to await when flushing at the
> > > pmem driver level.
> > >
> > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > it doesn't make sense _not_ to support this functionality.
> > >
> > > Seems to be a nop either way. Given that DAX may lead to dirty data
> > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > not touch, its better to leave it all to the mm core to handle. I.e.
> > > it doesn't make sense to call the driver just for two instructions
> > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > Either handle it all in the mm or the driver, not a mixture.
> >
> > So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
> > e.g. journal writes going through block layer or writes done through
> > dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
> > and the way driver does IO doesn't guarantee this, does it?
>
> Hum, and looking into how dax_do_io() works and what drivers/nvdimm/pmem.c
> does, I'm indeed wrong because they both do wmb_pmem() after each write
> which seems to include sfence + pcommit. Sorry for confusion.
Which I want to remove, because it makes DAX IO 3x slower than
buffered IO on ramdisk based testing.
> But a question: Won't it be better to do sfence + pcommit only in response
> to REQ_FLUSH request and don't do it after each write? I'm not sure how
> expensive these instructions are but in theory it could be a performance
> win, couldn't it? For filesystems this is enough wrt persistency
> guarantees...
I'm pretty sure it would be, because all of the overhead (and
therefore latency) I measured is in the cache flushing instructions.
But before we can remove the wmb_pmem() from dax_do_io(), we need
the underlying device to support REQ_FLUSH correctly...
Cheers,
Dave.
--
Dave Chinner
david(a)fromorbit.com
6 years, 7 months
Re: [PATCH v2 00/11] DAX fsynx/msync support
by Dan Williams
On Mon, Nov 16, 2015 at 6:41 AM, Jan Kara <jack(a)suse.cz> wrote:
> On Fri 13-11-15 17:06:39, Ross Zwisler wrote:
>> This patch series adds support for fsync/msync to DAX.
>>
>> Patches 1 through 7 add various utilities that the DAX code will eventually
>> need, and the DAX code itself is added by patch 8. Patches 9-11 update the
>> three filesystems that currently support DAX, ext2, ext4 and XFS, to use
>> the new DAX fsync/msync code.
>>
>> These patches build on the recent DAX locking changes from Dave Chinner,
>> Jan Kara and myself. Dave's changes for XFS and my changes for ext2 have
>> been merged in the v4.4 window, but Jan's are still unmerged. You can grab
>> them here:
>>
>> http://www.spinics.net/lists/linux-ext4/msg49951.html
>
> I had a quick look and the patches look sane to me. I'll try to give them
> more detailed look later this week. When thinking about the general design
> I was wondering: When we have this infrastructure to track data potentially
> lingering in CPU caches, would not it be a performance win to use standard
> cached stores in dax_io() and mark corresponding pages as dirty in page
> cache the same way as this patch set does it for mmaped writes? I have no
> idea how costly are non-temporal stores compared to cached ones and how
> would this compare to the cost of dirty tracking so this may be just
> completely bogus...
Keep in mind that this approach will flush every virtual address that
may be dirty. For example, if you touch 1byte in a 2MB page we'll end
up looping through the entire 2MB range. At some point the dirty size
becomes large enough that is cheaper to flush the entire cache, we
have not measured where that crossover point is.
6 years, 7 months