[PATCH] ACPI: NFIT: fix flexible_array.cocci warnings
by Julia Lawall
From: kernel test robot <lkp(a)intel.com>
Zero-length and one-element arrays are deprecated, see
Documentation/process/deprecated.rst
Flexible-array members should be used instead.
Generated by: scripts/coccinelle/misc/flexible_array.cocci
Fixes: 7b36c1398fb6 ("coccinelle: misc: add flexible_array.cocci script")
CC: Denis Efremov <efremov(a)linux.com>
Reported-by: kernel test robot <lkp(a)intel.com>
Signed-off-by: kernel test robot <lkp(a)intel.com>
Signed-off-by: Julia Lawall <julia.lawall(a)inria.fr>
---
tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
head: 148842c98a24e508aecb929718818fbf4c2a6ff3
commit: 7b36c1398fb63f9c38cc83dc75f143d2e5995062 coccinelle: misc: add flexible_array.cocci script
:::::: branch date: 20 hours ago
:::::: commit date: 2 months ago
core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2268,7 +2268,7 @@ struct nfit_set_info {
u64 region_offset;
u32 serial_number;
u32 pad;
- } mapping[0];
+ } mapping[];
};
struct nfit_set_info2 {
@@ -2279,7 +2279,7 @@ struct nfit_set_info2 {
u16 manufacturing_date;
u8 manufacturing_location;
u8 reserved[31];
- } mapping[0];
+ } mapping[];
};
static size_t sizeof_nfit_set_info(int num_mappings)
1 year, 4 months
[PATCH] libnvdimm: Switch to using the new API kobj_to_dev()
by Tian Tao
fixed the following coccicheck:
drivers/nvdimm/region_devs.c:762:60-61: WARNING opportunity for
kobj_to_dev().
Signed-off-by: Tian Tao <tiantao6(a)hisilicon.com>
---
drivers/nvdimm/region_devs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index ef23119..d71d4e9 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -759,7 +759,7 @@ REGION_MAPPING(31);
static umode_t mapping_visible(struct kobject *kobj, struct attribute *a, int n)
{
- struct device *dev = container_of(kobj, struct device, kobj);
+ struct device *dev = kobj_to_dev(kobj);
struct nd_region *nd_region = to_nd_region(dev);
if (n < nd_region->ndr_mappings)
--
2.7.4
1 year, 4 months
[PATCH V3 00/10] PKS: Add Protection Keys Supervisor (PKS) support V3
by ira.weiny@intel.com
From: Ira Weiny <ira.weiny(a)intel.com>
Changes from V2 [4]
Rebased on tip-tree/core/entry
From Thomas Gleixner
Address bisectability
Drop Patch:
x86/entry: Move nmi entry/exit into common code
From Greg KH
Remove WARN_ON's
From Dan Williams
Add __must_check to pks_key_alloc()
New patch: x86/pks: Add PKS defines and config options
Split from Enable patch to build on through the series
Fix compile errors
Changes from V1
Rebase to TIP master; resolve conflicts and test
Clean up some kernel docs updates missed in V1
Add irqentry_state_t kernel doc for PKRS field
Removed redundant irq_state->pkrs
This is only needed when we add the global state and somehow
ended up in this patch series. That will come back when we add
the global functionality in.
From Thomas Gleixner
Update commit messages
Add kernel doc for struct irqentry_state_t
From Dave Hansen add flags to pks_key_alloc()
Changes from RFC V3[3]
Rebase to TIP master
Update test error output
Standardize on 'irq_state' for state variables
From Dave Hansen
Update commit messages
Add/clean up comments
Add X86_FEATURE_PKS to disabled-features.h and remove some
explicit CONFIG checks
Move saved_pkrs member of thread_struct
Remove superfluous preempt_disable()
s/irq_save_pks/irq_save_set_pks/
Ensure PKRS is not seen in faults if not configured or not
supported
s/pks_mknoaccess/pks_mk_noaccess/
s/pks_mkread/pks_mk_readonly/
s/pks_mkrdwr/pks_mk_readwrite/
Change pks_key_alloc return to -EOPNOTSUPP when not supported
From Peter Zijlstra
Clean up Attribution
Remove superfluous preempt_disable()
Add union to differentiate exit_rcu/lockdep use in
irqentry_state_t
From Thomas Gleixner
Add preliminary clean up patch and adjust series as needed
Introduce a new page protection mechanism for supervisor pages, Protection Key
Supervisor (PKS).
2 use cases for PKS are being developed, trusted keys and PMEM. Trusted keys
is a newer use case which is still being explored. PMEM was submitted as part
of the RFC (v2) series[1]. However, since then it was found that some callers
of kmap() require a global implementation of PKS. Specifically some users of
kmap() expect mappings to be available to all kernel threads. While global use
of PKS is rare it needs to be included for correctness. Unfortunately the
kmap() updates required a large patch series to make the needed changes at the
various kmap() call sites so that patch set has been split out. Because the
global PKS feature is only required for that use case it will be deferred to
that set as well.[2] This patch set is being submitted as a precursor to both
of the use cases.
For an overview of the entire PKS ecosystem, a git tree including this series
and 2 proposed use cases can be found here:
https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.weiny@intel.com/
https://lore.kernel.org/lkml/20201009201410.3209180-1-ira.weiny@intel.com/
PKS enables protections on 'domains' of supervisor pages to limit supervisor
mode access to those pages beyond the normal paging protections. PKS works in
a similar fashion to user space pkeys, PKU. As with PKU, supervisor pkeys are
checked in addition to normal paging protections and Access or Writes can be
disabled via a MSR update without TLB flushes when permissions change. Also
like PKU, a page mapping is assigned to a domain by setting pkey bits in the
page table entry for that mapping.
Access is controlled through a PKRS register which is updated via WRMSR/RDMSR.
XSAVE is not supported for the PKRS MSR. Therefore the implementation
saves/restores the MSR across context switches and during exceptions. Nested
exceptions are supported by each exception getting a new PKS state.
For consistent behavior with current paging protections, pkey 0 is reserved and
configured to allow full access via the pkey mechanism, thus preserving the
default paging protections on mappings with the default pkey value of 0.
Other keys, (1-15) are allocated by an allocator which prepares us for key
contention from day one. Kernel users should be prepared for the allocator to
fail either because of key exhaustion or due to PKS not being supported on the
arch and/or CPU instance.
The following are key attributes of PKS.
1) Fast switching of permissions
1a) Prevents access without page table manipulations
1b) No TLB flushes required
2) Works on a per thread basis
PKS is available with 4 and 5 level paging. Like PKRU it consumes 4 bits from
the PTE to store the pkey within the entry.
[1] https://lore.kernel.org/lkml/20200717072056.73134-1-ira.weiny@intel.com/
[2] https://lore.kernel.org/lkml/20201009195033.3208459-2-ira.weiny@intel.com/
[3] https://lore.kernel.org/lkml/20201009194258.3207172-1-ira.weiny@intel.com/
[4] https://lore.kernel.org/lkml/20201102205320.1458656-1-ira.weiny@intel.com/
Fenghua Yu (2):
x86/pks: Add PKS kernel API
x86/pks: Enable Protection Keys Supervisor (PKS)
Ira Weiny (8):
x86/pkeys: Create pkeys_common.h
x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
x86/pks: Add PKS defines and Kconfig options
x86/pks: Preserve the PKRS MSR on context switch
x86/entry: Pass irqentry_state_t by reference
x86/entry: Preserve PKRS MSR across exceptions
x86/fault: Report the PKRS state on fault
x86/pks: Add PKS test code
Documentation/core-api/protection-keys.rst | 103 ++-
arch/x86/Kconfig | 1 +
arch/x86/entry/common.c | 46 +-
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/idtentry.h | 25 +-
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/pgtable.h | 13 +-
arch/x86/include/asm/pgtable_types.h | 12 +
arch/x86/include/asm/pkeys.h | 15 +
arch/x86/include/asm/pkeys_common.h | 40 ++
arch/x86/include/asm/processor.h | 18 +-
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/kernel/cpu/common.c | 15 +
arch/x86/kernel/cpu/mce/core.c | 4 +-
arch/x86/kernel/fpu/xstate.c | 22 +-
arch/x86/kernel/kvm.c | 6 +-
arch/x86/kernel/nmi.c | 4 +-
arch/x86/kernel/process.c | 26 +
arch/x86/kernel/traps.c | 21 +-
arch/x86/mm/fault.c | 87 ++-
arch/x86/mm/pkeys.c | 196 +++++-
include/linux/entry-common.h | 31 +-
include/linux/pgtable.h | 4 +
include/linux/pkeys.h | 24 +
kernel/entry/common.c | 44 +-
lib/Kconfig.debug | 12 +
lib/Makefile | 3 +
lib/pks/Makefile | 3 +
lib/pks/pks_test.c | 692 ++++++++++++++++++++
mm/Kconfig | 2 +
tools/testing/selftests/x86/Makefile | 3 +-
tools/testing/selftests/x86/test_pks.c | 66 ++
33 files changed, 1410 insertions(+), 140 deletions(-)
create mode 100644 arch/x86/include/asm/pkeys_common.h
create mode 100644 lib/pks/Makefile
create mode 100644 lib/pks/pks_test.c
create mode 100644 tools/testing/selftests/x86/test_pks.c
--
2.28.0.rc0.12.gb6a658bd00c9
1 year, 4 months
[RFC PATCH] badblocks: Improvement badblocks_set() for handling multiple ranges
by Coly Li
Recently I received a bug report that current badblocks code does not
properly handle multiple ranges. For example,
badblocks_set(bb, 32, 1, true);
badblocks_set(bb, 34, 1, true);
badblocks_set(bb, 36, 1, true);
badblocks_set(bb, 32, 12, true);
Then indeed badblocks_show() reports,
32 3
36 1
But the expected bad blocks table should be,
32 12
Obviously only the first 2 ranges are merged and badblocks_set() returns
and ignores the rest setting range.
This behavior is improper, if the caller of badblocks_set() wants to set
a range of blocks into bad blocks table, all of the blocks in the range
should be handled even the previous part encountering failure.
The desired way to set bad blocks range by badblocks_set() is,
- Set as many as blocks in the setting range into bad blocks table.
- Merge the bad blocks ranges and occupy as less as slots in the bad
blocks table.
- Fast.
Indeed the above proposal is complicated, especially with the following
restrictions,
- The setting bad blocks range can be ackknowledged or not acknowledged.
- The bad blocks table size is limited.
- Memory allocation should be avoided.
This patch is an initial effort to improve badblocks_set() for setting
bad blocks range when it covers multiple already set bad ranges in the
bad blocks table, and to do it as fast as possible.
The basic idea of the patch is to categorize all possible bad blocks
range setting combinationsinto to much less simplified and more less
special conditions. Inside badblocks_set() there is an implicit loop
composed by jumping between labels 're_insert' and 'update_sectors'. No
matter how large the setting bad blocks range is, in every loop just a
minimized range from the head is handled by a pre-defined behavior from
one of the categorized conditions. The logic is simple and code flow is
manageable.
This patch is unfinished yet, it only improves badblocks_set() and not
touch badblocks_clear() and badblocks_show() yet. I post it earlier
because this patch will be large (more then 1000 lines of change), I
want more people to give me comments earlier before I go too far away.
The code logic is tested as user space programmer, this patch passes
compiling but not tested in kernel mode yet. Right now it is only for
RFC purpose. I will post tested patch in further versions.
Thank you in advance for any review or comments on this patch.
Signed-off-by: Coly Li <colyli(a)suse.de>
---
block/badblocks.c | 1041 ++++++++++++++++++++++++++++++-------
include/linux/badblocks.h | 33 ++
2 files changed, 881 insertions(+), 193 deletions(-)
diff --git a/block/badblocks.c b/block/badblocks.c
index d39056630d9c..04ccae95777d 100644
--- a/block/badblocks.c
+++ b/block/badblocks.c
@@ -5,6 +5,8 @@
* - Heavily based on MD badblocks code from Neil Brown
*
* Copyright (c) 2015, Intel Corporation.
+ *
+ * Improvement for handling multiple ranges by Coly Li <colyli(a)suse.de>
*/
#include <linux/badblocks.h>
@@ -16,114 +18,612 @@
#include <linux/types.h>
#include <linux/slab.h>
-/**
- * badblocks_check() - check a given range for bad sectors
- * @bb: the badblocks structure that holds all badblock information
- * @s: sector (start) at which to check for badblocks
- * @sectors: number of sectors to check for badblocks
- * @first_bad: pointer to store location of the first badblock
- * @bad_sectors: pointer to store number of badblocks after @first_bad
+/*
+ * The purpose of badblocks set/clear is to manage bad blocks ranges which are
+ * identified by LBA addresses.
*
- * We can record which blocks on each device are 'bad' and so just
- * fail those blocks, or that stripe, rather than the whole device.
- * Entries in the bad-block table are 64bits wide. This comprises:
- * Length of bad-range, in sectors: 0-511 for lengths 1-512
- * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
- * A 'shift' can be set so that larger blocks are tracked and
- * consequently larger devices can be covered.
- * 'Acknowledged' flag - 1 bit. - the most significant bit.
+ * When the caller of badblocks_set() wants to set a range of bad blocks, the
+ * setting range can be acked or unacked. And the setting range may merge,
+ * overwrite, skip the overlaypped already set range, depends on who they are
+ * overlapped or adjacent, and the acknowledgment type of the ranges. It can be
+ * more complicated when the setting range covers multiple already set bad block
+ * ranges, with restritctions of maximum length of each bad range and the bad
+ * table space limitation.
*
- * Locking of the bad-block table uses a seqlock so badblocks_check
- * might need to retry if it is very unlucky.
- * We will sometimes want to check for bad blocks in a bi_end_io function,
- * so we use the write_seqlock_irq variant.
+ * It is difficut and unnecessary to take care of all the possible situations,
+ * for setting a large range of bad blocks, we can handle it by dividing the
+ * large range into smaller ones when encounter overlap, max range length or
+ * bad table full conditions. Every time only a smaller piece of the bad range
+ * is handled with a limited number of conditions how it is interacted with
+ * possible overlapped or adjacent already set bad block ranges. Then the hard
+ * complicated problem can be much simpler to habndle in proper way.
*
- * When looking for a bad block we specify a range and want to
- * know if any block in the range is bad. So we binary-search
- * to the last range that starts at-or-before the given endpoint,
- * (or "before the sector after the target range")
- * then see if it ends after the given start.
+ * When setting a range of bad blocks to the bad table, the simplified situations
+ * to be considered are, (The already set bad blocks ranges are naming with
+ * prefix E, and the setting bad blocks range is naming with prefix S)
+ *
+ * 1) A setting range is not overlapped or adjacent to any other already set bad
+ * block range.
+ * +--------+
+ * | S |
+ * +--------+
+ * +-------------+ +-------------+
+ * | E1 | | E2 |
+ * +-------------+ +-------------+
+ * For this situation if the bad blocks table is not full, just allocate a
+ * free slot from the bad blocks table to mark the setting range S. The
+ * result is,
+ * +-------------+ +--------+ +-------------+
+ * | E1 | | S | | E2 |
+ * +-------------+ +--------+ +-------------+
+ * 2) A setting range starts exactly at a start LBA of an already set bad blocks
+ * range.
+ * 2.1) The setting range size < already set range size
+ * +--------+
+ * | S |
+ * +--------+
+ * +-------------+
+ * | E |
+ * +-------------+
+ * 2.1.1) If S and E are both acked or unacked range, the setting range S can
+ * be merged into existing bad range E. The result is,
+ * +-------------+
+ * | S |
+ * +-------------+
+ * 2.1.2) If S is uncked setting and E is acked, the setting will be dinied, and
+ * the result is,
+ * +-------------+
+ * | E |
+ * +-------------+
+ * 2.1.3) If S is acked setting and E is unacked, range S can overwirte on E.
+ * An extra slot from the bad blocks table will be allocated for S, and head
+ * of E will move to end of the inserted range E. The result is,
+ * +--------+----+
+ * | S | E |
+ * +--------+----+
+ * 2.2) The setting range size == already set range size
+ * 2.2.1) If S and E are both acked or unacked range, the setting range S can
+ * be merged into existing bad range E. The result is,
+ * +-------------+
+ * | S |
+ * +-------------+
+ * 2.2.2) If S is uncked setting and E is acked, the setting will be dinied, and
+ * the result is,
+ * +-------------+
+ * | E |
+ * +-------------+
+ * 2.2.3) If S is acked setting and E is unacked, range S can overwirte all of
+ bad blocks range E. The result is,
+ * +-------------+
+ * | S |
+ * +-------------+
+ * 2.3) The setting range size > already set range size
+ * +-------------------+
+ * | S |
+ * +-------------------+
+ * +-------------+
+ * | E |
+ * +-------------+
+ * For such situation, the setting range S can be treated as two parts, the
+ * first part (S1) is as same size as the already set range E, the second
+ * part (S2) is thre rest of setting range.
+ * +-------------+-----+ +-------------+ +-----+
+ * | S1 | S2 | | S1 | | S2 |
+ * +-------------+-----+ ===> +-------------+ +-----+
+ * +-------------+ +-------------+
+ * | E | | E |
+ * +-------------+ +-------------+
+ * Now we only focus on how to handle the setting range S1 and already set
+ * range E, which are already explained in 1.2), for the rest S2 it will be
+ * handled later in next loop.
+ * 3) A setting range starts before the start LBA of an already set bad blocks
+ * range.
+ * +-------------+
+ * | S |
+ * +-------------+
+ * +-------------+
+ * | E |
+ * +-------------+
+ * For this situation, the setting range S can be divided into two parts, the
+ * first (S1) ends at the start LBA of already set range E, the second part
+ * (S2) starts exactly at a start LBA of the already set range E.
+ * +----+---------+ +----+ +---------+
+ * | S1 | S2 | | S1 | | S2 |
+ * +----+---------+ ===> +----+ +---------+
+ * +-------------+ +-------------+
+ * | E | | E |
+ * +-------------+ +-------------+
+ * Now only the first part S1 should be handled in this loop, which is in
+ * similar condition as 1). The rest part S2 has exact same start LBA address
+ * of the already set range E, they will be handled in next loop in one of
+ * situations in 2).
+ * 4) A setting range starts after the start LBA of an already set bad blocks
+ * range.
+ * 4.1) If the setting range S exactly matches the tail part of already set bad
+ * blocks range E, like the folowing chart shows,
+ * +---------+
+ * | S |
+ * +---------+
+ * +-------------+
+ * | E |
+ * +-------------+
+ * 4.1.1) If range S and E have same ackknowledg value (both acked or unacked),
+ * they will be merged into one, the result is,
+ * +-------------+
+ * | S |
+ * +-------------+
+ * 4.1.2) If range E is acked and the setting range S is unacked, the setting
+ * request of S will be rejected, the result is,
+ * +-------------+
+ * | E |
+ * +-------------+
+ * 4.1.3) If range E is unacked, and the setting range S is acked, then S may
+ * overwrite the overlapped range of E, the result is,
+ * +---+---------+
+ * | E | S |
+ * +---+---------+
+ * 4.2) If the setting range S stays in middle of an already set range E, like
+ * the following chart shows,
+ * +----+
+ * | S |
+ * +----+
+ * +--------------+
+ * | E |
+ * +--------------+
+ * 4.2.1) If range S and E have same ackknowledg value (both acked or unacked),
+ * they will be merged into one, the result is,
+ * +--------------+
+ * | S |
+ * +--------------+
+ * 4.2.2) If range E is acked and the setting range S is unacked, the setting
+ * request of S will be rejected, the result is also,
+ * +--------------+
+ * | E |
+ * +--------------+
+ * 4.2.3) If range E is unacked, and the setting range S is acked, then S will
+ * inserted into middle of E and split previous range E into twp parts (E1
+ * and E2), the result is,
+ * +----+----+----+
+ * | E1 | S | E2 |
+ * +----+----+----+
+ * 4.3) If the setting bad blocks range S is overlapped with an already set bad
+ * blocks range E. The range S starts after the start LBA of range E, and
+ * ends after the end LBA of range E, as the following chart shows,
+ * +-------------------+
+ * | S |
+ * +-------------------+
+ * +-------------+
+ * | E |
+ * +-------------+
+ * For this situation the range S can be divided into two parts, the first
+ * part (S1) ends at end range E, and the second part (S2) has rest range of
+ * origin S.
+ * +---------+---------+ +---------+ +---------+
+ * | S1 | S2 | | S1 | | S2 |
+ * +---------+---------+ ===> +---------+ +---------+
+ * +-------------+ +-------------+
+ * | E | | E |
+ * +-------------+ +-------------+
+ * Now in this loop the setting range S1 and already set range E can be
+ * handled as the situations 4), the rest range S2 will be handled in next
+ * loop and ignored in this loop.
+ * 5) A setting bad blocks range S is adjacent to one or more already set bad
+ * blocks range(s), and they are all acked or unacked range.
+ * 5.1) Front merge: If the already set bad blocks range E is before setting
+ * range S and they are adjacent,
+ * +------+
+ * | S |
+ * +------+
+ * +-------+
+ * | E |
+ * +-------+
+ * 5.1.1) When total size of range S and E <= BB_MAX_LEN, and their acknowledge
+ * values are same, the setting range S can front merges into range E. The
+ * result is,
+ * +--------------+
+ * | S |
+ * +--------------+
+ * 5.1.2) Otherwise these two ranges cannot merge, just insert the setting
+ * range S right after already set range E into the bad blocks table. The
+ * result is,
+ * +--------+------+
+ * | E | S |
+ * +--------+------+
+ * 6) Special cases which above conditions cannot handle
+ * 6.1) Multiple already set ranges may merge into less ones in a full bad table
+ * +-------------------------------------------------------+
+ * | S |
+ * +-------------------------------------------------------+
+ * |<----- BB_MAX_LEN ----->|
+ * +-----+ +-----+ +-----+
+ * | E1 | | E2 | | E3 |
+ * +-----+ +-----+ +-----+
+ * In the above example, when the bad blocks table is full, inserting the
+ * first part of setting range S will fail because no more available slot
+ * can be allocated from bad blocks table. In this situation a proper
+ * setting method should be go though all the setting bad blocks range and
+ * look for chance to merge already set ranges into less ones. When there
+ * is available slot from bad blocks table, re-try again to handle more
+ * setting bad blocks ranges as many as possible.
+ * +------------------------+
+ * | S3 |
+ * +------------------------+
+ * |<----- BB_MAX_LEN ----->|
+ * +-----+-----+-----+---+-----+--+
+ * | S1 | S2 |
+ * +-----+-----+-----+---+-----+--+
+ * The above chart shows although the first part (S3) cannot be inserted due
+ * to no-space in bad blocks table, but the following E1, E2 and E3 ranges
+ * can be merged with rest part of S into less range S1 and S2. Now there is
+ * 1 free slot in bad blocks table.
+ * +------------------------+-----+-----+-----+---+-----+--+
+ * | S3 | S1 | S2 |
+ * +------------------------+-----+-----+-----+---+-----+--+
+ * Since the bad blocks table is not full anymore, re-try again for the
+ * origin setting range S. Now the setting range S3 can be inserted into the
+ * bad blocks table with previous freed slot from multiple ranges merge.
+ * 6.2) Front merge after overwrite
+ * In the following example, in bad blocks table, E1 is an acked bad blocks
+ * range and E2 is an unacked bad blocks range, therefore they are not able
+ * to merge into a larger range. The setting bad blocks range S is acked,
+ * therefore part of E2 can be overwritten by S.
+ * +--------+
+ * | S | acknowledged
+ * +--------+ S: 1
+ * +-------+-------------+ E1: 1
+ * | E1 | E2 | E2: 0
+ * +-------+-------------+
+ * With previosu simplified routines, after overwiting part of E2 with S,
+ * the bad blocks table should be (E3 is remaining part of E2 which is not
+ * overwritten by S),
+ * acknowledged
+ * +-------+--------+----+ S: 1
+ * | E1 | S | E3 | E1: 1
+ * +-------+--------+----+ E3: 0
+ * The above result is correct but not perfect. Range E1 and S in the bad
+ * blocks table are all acked, merging them into a larger one range may
+ * occupy less bad blocks table space and make badblocks_check() faster.
+ * Therefore in such situation, after overwiting range S, the previous range
+ * E1 should be checked for possible front combination. Then the ideal
+ * result can be,
+ * +----------------+----+ acknowledged
+ * | E1 | E3 | E1: 1
+ * +----------------+----+ E3: 0
+ * 6.3) Behind merge: If the already set bad blocks range E is behind the setting
+ * range S and they are adjacent. Normally we don't need to care about this
+ * because front merge handles this while going though range S from head to
+ * tail, except for the tail part of range S. When the setting range S are
+ * fully handled, all the above simplified routine doesn't check whether the
+ * tail LBA of range S is adjacent to the next already set range and not able
+ * to them if they are mergeable.
+ * +------+
+ * | S |
+ * +------+
+ * +-------+
+ * | E |
+ * +-------+
+ * For the above special stiuation, when the setting range S are all handled
+ * and the loop ends, an extra check is necessary for whether next already
+ * set range E is right after S and mergeable.
+ * 6.2.1) When total size of range E and S <= BB_MAX_LEN, and their acknowledge
+ * values are same, the setting range S can behind merges into range E. The
+ * result is,
+ * +--------------+
+ * | S |
+ * +--------------+
+ * 6.2.2) Otherwise these two ranges cannot merge, just insert the setting range
+ * S infront of the already set range E in the bad blocks table. The result
+ * is,
+ * +------+-------+
+ * | S | E |
+ * +------+-------+
+ *
+ * All the above 5 simplified situations and 3 special cases may cover 99%+ of
+ * the bad block range setting conditions. Maybe there is some rare corner case
+ * is not considered and optimized, it won't hurt if badblocks_set() fails due
+ * to no space, or some ranges are not merged to save bad blocks table space.
+ *
+ * Inside badblocks_set() each loop starts by jumping to re_insert label, every
+ * time for the new loop prev_badblocks() is called to find an already set range
+ * which starts before or at current setting range. Since the setting bad blocks
+ * range is handled from head to tail, most of the cases it is unnecessary to do
+ * the binary search inside prev_badblocks(), it is possible to provide a hint
+ * to prev_badblocks() for a fast path, then the expensive binary search can be
+ * avoided. In my test with the hint to prev_badblocks(), except for the first
+ * loop, all rested calls to prev_badblocks() can go into the fast path and
+ * return correct bad blocks table index immediately.
*
- * Return:
- * 0: there are no known bad blocks in the range
- * 1: there are known bad block which are all acknowledged
- * -1: there are bad blocks which have not yet been acknowledged in metadata.
- * plus the start/length of the first bad section we overlap.
*/
-int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
- sector_t *first_bad, int *bad_sectors)
+
+static int prev_by_hint(struct badblocks *bb, sector_t s, int hint)
{
- int hi;
- int lo;
u64 *p = bb->page;
- int rv;
- sector_t target = s + sectors;
- unsigned seq;
+ int ret = -1;
+ int hint_end = hint + 2;
- if (bb->shift > 0) {
- /* round the start down, and the end up */
- s >>= bb->shift;
- target += (1<<bb->shift) - 1;
- target >>= bb->shift;
- sectors = target - s;
+ while ((hint < hint_end) && ((hint + 1) <= bb->count) &&
+ (BB_OFFSET(p[hint]) <= s)) {
+ if ((hint + 1) == bb->count || BB_OFFSET(p[hint + 1]) > s) {
+ ret = hint;
+ break;
+ }
+ hint++;
+ }
+
+ return ret;
+}
+
+/* find the range starts at-or-before bad->start */
+static int prev_badblocks(struct badblocks *bb, struct bad_context *bad,
+ int hint)
+{
+ u64 *p;
+ int lo, hi;
+ sector_t s = bad->start;
+ int ret = -1;
+
+ if (!bb->count)
+ goto out;
+
+ if (hint >= 0) {
+ ret = prev_by_hint(bb, s, hint);
+ if (ret >= 0)
+ goto out;
}
- /* 'target' is now the first block after the bad range */
-retry:
- seq = read_seqbegin(&bb->lock);
lo = 0;
- rv = 0;
hi = bb->count;
+ p = bb->page;
- /* Binary search between lo and hi for 'target'
- * i.e. for the last range that starts before 'target'
- */
- /* INVARIANT: ranges before 'lo' and at-or-after 'hi'
- * are known not to be the last range before target.
- * VARIANT: hi-lo is the number of possible
- * ranges, and decreases until it reaches 1
- */
while (hi - lo > 1) {
- int mid = (lo + hi) / 2;
+ int mid = (lo + hi)/2;
sector_t a = BB_OFFSET(p[mid]);
- if (a < target)
- /* This could still be the one, earlier ranges
- * could not.
- */
+ if (a <= s)
lo = mid;
else
- /* This and later ranges are definitely out. */
hi = mid;
}
- /* 'lo' might be the last that started before target, but 'hi' isn't */
- if (hi > lo) {
- /* need to check all range that end after 's' to see if
- * any are unacknowledged.
+
+ if (BB_OFFSET(p[lo]) <= s)
+ ret = lo;
+out:
+ return ret;
+}
+
+static int can_merge_behind(struct badblocks *bb, struct bad_context *bad,
+ int behind)
+{
+ u64 *p = bb->page;
+ sector_t s = bad->start;
+ sector_t sectors = bad->len;
+ int ack = bad->ack;
+
+ if ((s <= BB_OFFSET(p[behind])) &&
+ ((s + sectors) >= BB_OFFSET(p[behind])) &&
+ ((BB_END(p[behind]) - s) <= BB_MAX_LEN) &&
+ BB_ACK(p[behind]) == ack)
+ return true;
+ return false;
+}
+
+static int behind_merge(struct badblocks *bb, struct bad_context *bad,
+ int behind)
+{
+ u64 *p = bb->page;
+ sector_t s = bad->start;
+ sector_t sectors = bad->len;
+ int ack = bad->ack;
+ int merged = 0;
+
+ WARN_ON(s > BB_OFFSET(p[behind]));
+ WARN_ON((s + sectors) < BB_OFFSET(p[behind]));
+
+ if (s < BB_OFFSET(p[behind])) {
+ WARN_ON((BB_LEN(p[behind]) + merged) >= BB_MAX_LEN);
+
+ merged = min_t(sector_t, sectors, BB_OFFSET(p[behind]) - s);
+ p[behind] = BB_MAKE(s, BB_LEN(p[behind]) + merged, ack);
+ } else {
+ merged = min_t(sector_t, sectors, BB_LEN(p[behind]));
+ }
+
+ WARN_ON(merged == 0);
+
+ return merged;
+}
+
+static int can_merge_front(struct badblocks *bb, int prev,
+ struct bad_context *bad)
+{
+ u64 *p = bb->page;
+ sector_t s = bad->start;
+ int ack = bad->ack;
+
+ if (BB_ACK(p[prev]) == ack &&
+ (s < BB_END(p[prev]) ||
+ (s == BB_END(p[prev]) && (BB_LEN(p[prev]) < BB_MAX_LEN))))
+ return true;
+ return false;
+}
+
+static int front_merge(struct badblocks *bb, int prev, struct bad_context *bad)
+{
+ int sectors = bad->len;
+ int s = bad->start;
+ int ack = bad->ack;
+ u64 *p = bb->page;
+ int merged = 0;
+
+ WARN_ON(s > BB_END(p[prev]));
+
+ if (s < BB_END(p[prev])) {
+ merged = min_t(sector_t, sectors, BB_END(p[prev]) - s);
+ } else {
+ merged = min_t(sector_t, sectors, BB_MAX_LEN - BB_LEN(p[prev]));
+ if ((prev + 1) < bb->count &&
+ merged > (BB_OFFSET(p[prev + 1]) - BB_END(p[prev]))) {
+ merged = BB_OFFSET(p[prev + 1]) - BB_END(p[prev]);
+ }
+
+ p[prev] = BB_MAKE(BB_OFFSET(p[prev]), BB_LEN(p[prev]) + merged, ack);
+ }
+
+ return merged;
+}
+
+static int can_combine_front(struct badblocks *bb, int prev,
+ struct bad_context *bad)
+{
+ u64 *p = bb->page;
+
+ if ((BB_OFFSET(p[prev]) == bad->start) && (prev > 0) &&
+ (BB_LEN(p[prev - 1]) + BB_LEN(p[prev]) <= BB_MAX_LEN) &&
+ (BB_ACK(p[prev - 1]) == BB_ACK(p[prev])))
+ return true;
+ return false;
+}
+
+static void front_combine(struct badblocks *bb, int prev)
+{
+ u64 *p = bb->page;
+
+ p[prev - 1] = BB_MAKE(BB_OFFSET(p[prev - 1]),
+ BB_LEN(p[prev - 1]) + BB_LEN(p[prev]),
+ BB_ACK(p[prev]));
+ if ((prev + 1) < bb->count)
+ memmove(p + prev, p + prev + 1, (bb->count - prev - 1) * 8);
+}
+
+static int overlap_front(struct badblocks *bb, int front,
+ struct bad_context *bad)
+{
+ u64 *p = bb->page;
+
+ if (bad->start >= BB_OFFSET(p[front]) &&
+ bad->start < BB_END(p[front]))
+ return true;
+ return false;
+}
+
+static int can_front_overwrite(struct badblocks *bb, int prev,
+ struct bad_context *bad, int *extra)
+{
+ u64 *p = bb->page;
+ int len;
+
+ WARN_ON(!overlap_front(bb, prev, bad));
+
+ if (BB_ACK(p[prev]) >= bad->ack)
+ return false;
+
+ if (BB_END(p[prev]) <= (bad->start + bad->len)) {
+ len = BB_END(p[prev]) - bad->start;
+ if (BB_OFFSET(p[prev]) == bad->start)
+ *extra = 0;
+ else
+ *extra = 1;
+
+ bad->len = len;
+ } else {
+ if (BB_OFFSET(p[prev]) == bad->start)
+ *extra = 1;
+ else
+ /*
+ * prev range will be split into two, beside the overwritten
+ * one, an extra slot needed from bad table.
*/
- while (lo >= 0 &&
- BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
- if (BB_OFFSET(p[lo]) < target) {
- /* starts before the end, and finishes after
- * the start, so they must overlap
- */
- if (rv != -1 && BB_ACK(p[lo]))
- rv = 1;
- else
- rv = -1;
- *first_bad = BB_OFFSET(p[lo]);
- *bad_sectors = BB_LEN(p[lo]);
- }
- lo--;
+ *extra = 2;
+ }
+
+ if ((bb->count + (*extra)) >= MAX_BADBLOCKS)
+ return false;
+
+ return true;
+}
+
+static int front_overwrite(struct badblocks *bb, int prev,
+ struct bad_context *bad, int extra)
+{
+ u64 *p = bb->page;
+ int n = extra;
+ sector_t orig_end = BB_END(p[prev]);
+ int orig_ack = BB_ACK(p[prev]);
+
+ switch (extra) {
+ case 0:
+ p[prev] = BB_MAKE(BB_OFFSET(p[prev]), BB_LEN(p[prev]),
+ bad->ack);
+ break;
+ case 1:
+ if (BB_OFFSET(p[prev]) == bad->start) {
+ p[prev] = BB_MAKE(BB_OFFSET(p[prev]),
+ bad->len, bad->ack);
+ memmove(p + prev + 2, p + prev + 1,
+ (bb->count - prev - 1) * 8);
+ p[prev + 1] = BB_MAKE(bad->start + bad->len,
+ orig_end - BB_END(p[prev]),
+ orig_ack);
+ } else {
+ p[prev] = BB_MAKE(BB_OFFSET(p[prev]),
+ BB_END(p[prev]) - bad->start,
+ BB_ACK(p[prev]));
+ memmove(p + prev + 1 + n, p + prev + 1,
+ (bb->count - prev - 1) * 8);
+ p[prev + 1] = BB_MAKE(bad->start, bad->len, bad->ack);
}
+ break;
+ case 2:
+ p[prev] = BB_MAKE(BB_OFFSET(p[prev]),
+ BB_END(p[prev]) - bad->start,
+ BB_ACK(p[prev]));
+ memmove(p + prev + 1 + n, p + prev + 1,
+ (bb->count - prev - 1) * 8);
+ p[prev + 1] = BB_MAKE(bad->start, bad->len, bad->ack);
+ p[prev + 2] = BB_MAKE(BB_END(p[prev + 1]),
+ orig_end - BB_END(p[prev + 1]),
+ BB_ACK(p[prev]));
+ break;
+ default:
+ break;
}
- if (read_seqretry(&bb->lock, seq))
- goto retry;
+ return bad->len;
+}
- return rv;
+static int overlap_behind(struct badblocks *bb, struct bad_context *bad,
+ int behind)
+{
+ u64 *p = bb->page;
+
+ if (bad->start < BB_OFFSET(p[behind]) &&
+ (bad->start + bad->len) > BB_OFFSET(p[behind]))
+ return true;
+
+ if (bad->start >= BB_OFFSET(p[behind]) &&
+ bad->start < BB_END(p[behind]))
+ return true;
+
+ return false;
+}
+
+static int insert_at(struct badblocks *bb, int at, struct bad_context *bad)
+{
+ u64 *p = bb->page;
+ int sectors = bad->len;
+ int s = bad->start;
+ int ack = bad->ack;
+ int len;
+
+ WARN_ON(badblocks_full(bb));
+
+ len = min_t(sector_t, sectors, BB_MAX_LEN);
+ if (at < bb->count)
+ memmove(p + at + 1, p + at, (bb->count - at) * 8);
+ p[at] = BB_MAKE(s, len, ack);
+
+ return len;
}
-EXPORT_SYMBOL_GPL(badblocks_check);
static void badblocks_update_acked(struct badblocks *bb)
{
@@ -164,7 +664,10 @@ int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
int acknowledged)
{
u64 *p;
- int lo, hi;
+ struct bad_context bad;
+ int prev = -1, hint = -1;
+ int len = 0, added = 0;
+ int retried = 0, space_desired = 0;
int rv = 0;
unsigned long flags;
@@ -172,144 +675,187 @@ int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
/* badblocks are disabled */
return 1;
+ if (sectors <= 0)
+ /* Invalid sectors number */
+ return 1;
+
if (bb->shift) {
/* round the start down, and the end up */
sector_t next = s + sectors;
- s >>= bb->shift;
- next += (1<<bb->shift) - 1;
- next >>= bb->shift;
+ rounddown(s, bb->shift);
+ roundup(next, bb->shift);
sectors = next - s;
}
write_seqlock_irqsave(&bb->lock, flags);
+ bad.orig_start = s;
+ bad.orig_len = sectors;
+ bad.ack = acknowledged;
p = bb->page;
- lo = 0;
- hi = bb->count;
- /* Find the last range that starts at-or-before 's' */
- while (hi - lo > 1) {
- int mid = (lo + hi) / 2;
- sector_t a = BB_OFFSET(p[mid]);
- if (a <= s)
- lo = mid;
- else
- hi = mid;
+re_insert:
+ bad.start = s;
+ bad.len = sectors;
+ len = 0;
+
+ if (badblocks_empty(bb)) {
+ len = insert_at(bb, 0, &bad);
+ bb->count++;
+ added++;
+ goto update_sectors;
}
- if (hi > lo && BB_OFFSET(p[lo]) > s)
- hi = lo;
- if (hi > lo) {
- /* we found a range that might merge with the start
- * of our new range
- */
- sector_t a = BB_OFFSET(p[lo]);
- sector_t e = a + BB_LEN(p[lo]);
- int ack = BB_ACK(p[lo]);
-
- if (e >= s) {
- /* Yes, we can merge with a previous range */
- if (s == a && s + sectors >= e)
- /* new range covers old */
- ack = acknowledged;
- else
- ack = ack && acknowledged;
-
- if (e < s + sectors)
- e = s + sectors;
- if (e - a <= BB_MAX_LEN) {
- p[lo] = BB_MAKE(a, e-a, ack);
- s = e;
+ prev = prev_badblocks(bb, &bad, hint);
+
+ /* start before all badblocks */
+ if (prev < 0) {
+ if (!badblocks_full(bb)) {
+ /* insert on the first */
+ if (bad.len > (BB_OFFSET(p[0]) - bad.start))
+ bad.len = BB_OFFSET(p[0]) - bad.start;
+ len = insert_at(bb, 0, &bad);
+ bb->count++;
+ added++;
+ hint = 0;
+ goto update_sectors;
+ }
+
+ /* No sapce, try to merge */
+ if (overlap_behind(bb, &bad, 0)) {
+ if (can_merge_behind(bb, &bad, 0)) {
+ len = behind_merge(bb, &bad, 0);
+ added++;
} else {
- /* does not all fit in one range,
- * make p[lo] maximal
- */
- if (BB_LEN(p[lo]) != BB_MAX_LEN)
- p[lo] = BB_MAKE(a, BB_MAX_LEN, ack);
- s = a + BB_MAX_LEN;
+ len = min_t(sector_t, BB_OFFSET(p[0]) - s, sectors);
+ space_desired = 1;
}
- sectors = e - s;
+ hint = 0;
+ goto update_sectors;
}
+
+ /* no table space and give up */
+ goto out;
}
- if (sectors && hi < bb->count) {
- /* 'hi' points to the first range that starts after 's'.
- * Maybe we can merge with the start of that range
- */
- sector_t a = BB_OFFSET(p[hi]);
- sector_t e = a + BB_LEN(p[hi]);
- int ack = BB_ACK(p[hi]);
-
- if (a <= s + sectors) {
- /* merging is possible */
- if (e <= s + sectors) {
- /* full overlap */
- e = s + sectors;
- ack = acknowledged;
- } else
- ack = ack && acknowledged;
-
- a = s;
- if (e - a <= BB_MAX_LEN) {
- p[hi] = BB_MAKE(a, e-a, ack);
- s = e;
- } else {
- p[hi] = BB_MAKE(a, BB_MAX_LEN, ack);
- s = a + BB_MAX_LEN;
+
+ /* in case p[prev-1] can be merged with p[prev] */
+ if (can_combine_front(bb, prev, &bad)) {
+ front_combine(bb, prev);
+ bb->count--;
+ added++;
+ hint = prev - 1;
+ goto update_sectors;
+ }
+
+ if (overlap_front(bb, prev, &bad)) {
+ if (can_merge_front(bb, prev, &bad)) {
+ len = front_merge(bb, prev, &bad);
+ added++;
+ hint = prev - 1;
+ } else {
+ int extra = 0;
+
+ if (!can_front_overwrite(bb, prev, &bad, &extra)) {
+ len = min_t(sector_t, BB_END(p[prev]) - s, sectors);
+ hint = prev;
+ goto update_sectors;
+ }
+
+ len = front_overwrite(bb, prev, &bad, extra);
+ added++;
+ bb->count += extra;
+ hint = prev;
+
+ if (prev > 0 && can_combine_front(bb, prev, &bad)) {
+ front_combine(bb, prev);
+ bb->count--;
+ hint = prev - 1;
}
- sectors = e - s;
- lo = hi;
- hi++;
}
+ goto update_sectors;
+ }
+
+ if (can_merge_front(bb, prev, &bad)) {
+ len = front_merge(bb, prev, &bad);
+ added++;
+ hint = prev;
+ goto update_sectors;
}
- if (sectors == 0 && hi < bb->count) {
- /* we might be able to combine lo and hi */
- /* Note: 's' is at the end of 'lo' */
- sector_t a = BB_OFFSET(p[hi]);
- int lolen = BB_LEN(p[lo]);
- int hilen = BB_LEN(p[hi]);
- int newlen = lolen + hilen - (s - a);
-
- if (s >= a && newlen < BB_MAX_LEN) {
- /* yes, we can combine them */
- int ack = BB_ACK(p[lo]) && BB_ACK(p[hi]);
-
- p[lo] = BB_MAKE(BB_OFFSET(p[lo]), newlen, ack);
- memmove(p + hi, p + hi + 1,
- (bb->count - hi - 1) * 8);
- bb->count--;
+
+ /* if no space in table, still try to merge in the covered range */
+ if (badblocks_full(bb)) {
+ /* skip the cannot-merge range */
+ if (((prev + 1) < bb->count) &&
+ overlap_behind(bb, &bad, prev + 1) &&
+ ((s + sectors) >= BB_END(p[prev + 1]))) {
+ len = BB_END(p[prev + 1]) - s;
+ hint = prev + 1;
+ goto update_sectors;
}
+
+ /* no retry any more */
+ len = sectors;
+ space_desired = 1;
+ hint = -1;
+ goto update_sectors;
}
- while (sectors) {
- /* didn't merge (it all).
- * Need to add a range just before 'hi'
- */
- if (bb->count >= MAX_BADBLOCKS) {
- /* No room for more */
- rv = 1;
- break;
- } else {
- int this_sectors = sectors;
- memmove(p + hi + 1, p + hi,
- (bb->count - hi) * 8);
- bb->count++;
+ /* cannot merge and there is space in bad table */
+ if (overlap_behind(bb, &bad, prev + 1))
+ bad.len = min_t(sector_t, bad.len, BB_OFFSET(p[prev + 1]) - bad.start);
- if (this_sectors > BB_MAX_LEN)
- this_sectors = BB_MAX_LEN;
- p[hi] = BB_MAKE(s, this_sectors, acknowledged);
- sectors -= this_sectors;
- s += this_sectors;
- }
+ len = insert_at(bb, prev + 1, &bad);
+ bb->count++;
+ added++;
+ hint = prev + 1;
+
+update_sectors:
+ s += len;
+ sectors -= len;
+
+ if (sectors > 0)
+ goto re_insert;
+
+ WARN_ON(sectors < 0);
+
+ /* Check whether the following already set range can be merged */
+ if ((prev + 1) < bb->count &&
+ BB_END(p[prev]) == BB_OFFSET(p[prev + 1]) &&
+ (BB_LEN(p[prev]) + BB_LEN(p[prev + 1])) <= BB_MAX_LEN &&
+ BB_ACK(p[prev]) == BB_ACK(p[prev + 1])) {
+ p[prev] = BB_MAKE(BB_OFFSET(p[prev]),
+ BB_LEN(p[prev]) + BB_LEN(p[prev + 1]),
+ BB_ACK(p[prev]));
+
+ if ((prev + 2) < bb->count)
+ memmove(p + prev + 1, p + prev + 2,
+ (bb->count - (prev + 2)) * 8);
+ bb->count--;
+ }
+
+ if (space_desired && !badblocks_full(bb)) {
+ s = bad.orig_start;
+ sectors = bad.orig_len;
+ if (retried++ < 3)
+ goto re_insert;
+ }
+
+out:
+ if (added) {
+ set_changed(bb);
+
+ if (!acknowledged)
+ bb->unacked_exist = 1;
+ else
+ badblocks_update_acked(bb);
}
- bb->changed = 1;
- if (!acknowledged)
- bb->unacked_exist = 1;
- else
- badblocks_update_acked(bb);
write_sequnlock_irqrestore(&bb->lock, flags);
+ if (!added)
+ rv = 1;
+
return rv;
}
EXPORT_SYMBOL_GPL(badblocks_set);
@@ -423,6 +969,115 @@ int badblocks_clear(struct badblocks *bb, sector_t s, int sectors)
}
EXPORT_SYMBOL_GPL(badblocks_clear);
+/**
+ * badblocks_check() - check a given range for bad sectors
+ * @bb: the badblocks structure that holds all badblock information
+ * @s: sector (start) at which to check for badblocks
+ * @sectors: number of sectors to check for badblocks
+ * @first_bad: pointer to store location of the first badblock
+ * @bad_sectors: pointer to store number of badblocks after @first_bad
+ *
+ * We can record which blocks on each device are 'bad' and so just
+ * fail those blocks, or that stripe, rather than the whole device.
+ * Entries in the bad-block table are 64bits wide. This comprises:
+ * Length of bad-range, in sectors: 0-511 for lengths 1-512
+ * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
+ * A 'shift' can be set so that larger blocks are tracked and
+ * consequently larger devices can be covered.
+ * 'Acknowledged' flag - 1 bit. - the most significant bit.
+ *
+ * Locking of the bad-block table uses a seqlock so badblocks_check
+ * might need to retry if it is very unlucky.
+ * We will sometimes want to check for bad blocks in a bi_end_io function,
+ * so we use the write_seqlock_irq variant.
+ *
+ * When looking for a bad block we specify a range and want to
+ * know if any block in the range is bad. So we binary-search
+ * to the last range that starts at-or-before the given endpoint,
+ * (or "before the sector after the target range")
+ * then see if it ends after the given start.
+ *
+ * Return:
+ * 0: there are no known bad blocks in the range
+ * 1: there are known bad block which are all acknowledged
+ * -1: there are bad blocks which have not yet been acknowledged in metadata.
+ * plus the start/length of the first bad section we overlap.
+ */
+int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
+ sector_t *first_bad, int *bad_sectors)
+{
+ int hi;
+ int lo;
+ u64 *p = bb->page;
+ int rv;
+ sector_t target = s + sectors;
+ unsigned seq;
+
+ if (bb->shift > 0) {
+ /* round the start down, and the end up */
+ s >>= bb->shift;
+ target += (1<<bb->shift) - 1;
+ target >>= bb->shift;
+ sectors = target - s;
+ }
+ /* 'target' is now the first block after the bad range */
+
+retry:
+ seq = read_seqbegin(&bb->lock);
+ lo = 0;
+ rv = 0;
+ hi = bb->count;
+
+ /* Binary search between lo and hi for 'target'
+ * i.e. for the last range that starts before 'target'
+ */
+ /* INVARIANT: ranges before 'lo' and at-or-after 'hi'
+ * are known not to be the last range before target.
+ * VARIANT: hi-lo is the number of possible
+ * ranges, and decreases until it reaches 1
+ */
+ while (hi - lo > 1) {
+ int mid = (lo + hi) / 2;
+ sector_t a = BB_OFFSET(p[mid]);
+
+ if (a < target)
+ /* This could still be the one, earlier ranges
+ * could not.
+ */
+ lo = mid;
+ else
+ /* This and later ranges are definitely out. */
+ hi = mid;
+ }
+ /* 'lo' might be the last that started before target, but 'hi' isn't */
+ if (hi > lo) {
+ /* need to check all range that end after 's' to see if
+ * any are unacknowledged.
+ */
+ while (lo >= 0 &&
+ BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
+ if (BB_OFFSET(p[lo]) < target) {
+ /* starts before the end, and finishes after
+ * the start, so they must overlap
+ */
+ if (rv != -1 && BB_ACK(p[lo]))
+ rv = 1;
+ else
+ rv = -1;
+ *first_bad = BB_OFFSET(p[lo]);
+ *bad_sectors = BB_LEN(p[lo]);
+ }
+ lo--;
+ }
+ }
+
+ if (read_seqretry(&bb->lock, seq))
+ goto retry;
+
+ return rv;
+}
+EXPORT_SYMBOL_GPL(badblocks_check);
+
/**
* ack_all_badblocks() - Acknowledge all bad blocks in a list.
* @bb: the badblocks structure that holds all badblock information
diff --git a/include/linux/badblocks.h b/include/linux/badblocks.h
index 2426276b9bd3..b4bd997a53a4 100644
--- a/include/linux/badblocks.h
+++ b/include/linux/badblocks.h
@@ -15,6 +15,7 @@
#define BB_OFFSET(x) (((x) & BB_OFFSET_MASK) >> 9)
#define BB_LEN(x) (((x) & BB_LEN_MASK) + 1)
#define BB_ACK(x) (!!((x) & BB_ACK_MASK))
+#define BB_END(x) (BB_OFFSET(x) + BB_LEN(x))
#define BB_MAKE(a, l, ack) (((a)<<9) | ((l)-1) | ((u64)(!!(ack)) << 63))
/* Bad block numbers are stored sorted in a single page.
@@ -41,6 +42,14 @@ struct badblocks {
sector_t size; /* in sectors */
};
+struct bad_context {
+ sector_t start;
+ sector_t len;
+ int ack;
+ sector_t orig_start;
+ sector_t orig_len;
+};
+
int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
sector_t *first_bad, int *bad_sectors);
int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
@@ -54,6 +63,7 @@ int badblocks_init(struct badblocks *bb, int enable);
void badblocks_exit(struct badblocks *bb);
struct device;
int devm_init_badblocks(struct device *dev, struct badblocks *bb);
+
static inline void devm_exit_badblocks(struct device *dev, struct badblocks *bb)
{
if (bb->dev != dev) {
@@ -63,4 +73,27 @@ static inline void devm_exit_badblocks(struct device *dev, struct badblocks *bb)
}
badblocks_exit(bb);
}
+
+static inline int badblocks_full(struct badblocks *bb)
+{
+ return (bb->count >= MAX_BADBLOCKS);
+}
+
+static inline int badblocks_empty(struct badblocks *bb)
+{
+ return (bb->count == 0);
+}
+
+static inline void set_changed(struct badblocks *bb)
+{
+ if (bb->changed != 1)
+ bb->changed = 1;
+}
+
+static inline void clear_changed(struct badblocks *bb)
+{
+ if (bb->changed != 0)
+ bb->changed = 0;
+}
+
#endif
--
2.26.2
1 year, 4 months
[ndctl PATCH] ndctl/dimm: Attempt an abort upon
firmware-update-busy status
by Dan Williams
Mark reports that if a previous firmware update is blocked due to a
background ARS then ndctl fails to start another firmware-udpate
request until the platform is rebooted.
Teach 'ndctl update-firmware' to abort previous firmware-update sessions
when '--force' is specified.
Link: https://github.com/pmem/ndctl/issues/155
Link: http://lore.kernel.org/r/20201222005704.2355076-1-jane.chu@oracle.com
Reported-by: Mark Baker <mark.a.baker(a)oracle.com>
Tested-by: Mark Baker <mark.a.baker(a)oracle.com>
Tested-by: Jane Chu <jane.chu(a)oracle.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
Needs the fix from Jane mentioned in the link above, but with that
included Jane and Mark report this works.
ndctl/dimm.c | 109 ++++++++++++++++++++++++++++++++++++----------------------
1 file changed, 67 insertions(+), 42 deletions(-)
diff --git a/ndctl/dimm.c b/ndctl/dimm.c
index 8e85d692afd3..167c3f1bc7c7 100644
--- a/ndctl/dimm.c
+++ b/ndctl/dimm.c
@@ -504,6 +504,36 @@ out:
return rc;
}
+static int submit_abort_firmware(struct ndctl_dimm *dimm,
+ struct action_context *actx)
+{
+ struct update_context *uctx = &actx->update;
+ struct ndctl_cmd *cmd;
+ int rc;
+ enum ND_FW_STATUS status;
+
+ cmd = ndctl_dimm_cmd_new_fw_abort(uctx->start);
+ if (!cmd)
+ return -ENXIO;
+
+ rc = ndctl_cmd_submit(cmd);
+ if (rc < 0)
+ goto out;
+
+ status = ndctl_cmd_fw_xlat_firmware_status(cmd);
+ if (!(status & ND_CMD_STATUS_FIN_ABORTED)) {
+ fprintf(stderr,
+ "Firmware update abort on DIMM %s failed: %#x\n",
+ ndctl_dimm_get_devname(dimm), status);
+ rc = -ENXIO;
+ goto out;
+ }
+
+out:
+ ndctl_cmd_unref(cmd);
+ return rc;
+}
+
static int submit_start_firmware_upload(struct ndctl_dimm *dimm,
struct action_context *actx)
{
@@ -511,8 +541,8 @@ static int submit_start_firmware_upload(struct ndctl_dimm *dimm,
struct update_context *uctx = &actx->update;
struct fw_info *fw = &uctx->dimm_fw;
struct ndctl_cmd *cmd;
- int rc;
enum ND_FW_STATUS status;
+ int rc;
cmd = ndctl_dimm_cmd_new_fw_start_update(dimm);
if (!cmd)
@@ -520,27 +550,46 @@ static int submit_start_firmware_upload(struct ndctl_dimm *dimm,
rc = ndctl_cmd_submit(cmd);
if (rc < 0)
- return rc;
+ goto err;
+ uctx->start = cmd;
status = ndctl_cmd_fw_xlat_firmware_status(cmd);
if (status == FW_EBUSY) {
- err("%s: busy with another firmware update", devname);
- return -EBUSY;
+ if (param.force) {
+ rc = submit_abort_firmware(dimm, actx);
+ if (rc < 0) {
+ err("%s: busy with another firmware update, "
+ "abort failed", devname);
+ rc = -EBUSY;
+ goto err;
+ }
+ rc = -EAGAIN;
+ goto err;
+ } else {
+ err("%s: busy with another firmware update", devname);
+ rc = -EBUSY;
+ goto err;
+ }
}
if (status != FW_SUCCESS) {
err("%s: failed to create start context", devname);
- return -ENXIO;
+ rc = -ENXIO;
+ goto err;
}
fw->context = ndctl_cmd_fw_start_get_context(cmd);
if (fw->context == UINT_MAX) {
err("%s: failed to retrieve start context", devname);
- return -ENXIO;
+ rc = -ENXIO;
+ goto err;
}
- uctx->start = cmd;
-
return 0;
+
+err:
+ uctx->start = NULL;
+ ndctl_cmd_unref(cmd);
+ return rc;
}
static int get_fw_data_from_file(FILE *file, void *buf, uint32_t len)
@@ -659,36 +708,6 @@ out:
return rc;
}
-static int submit_abort_firmware(struct ndctl_dimm *dimm,
- struct action_context *actx)
-{
- struct update_context *uctx = &actx->update;
- struct ndctl_cmd *cmd;
- int rc;
- enum ND_FW_STATUS status;
-
- cmd = ndctl_dimm_cmd_new_fw_abort(uctx->start);
- if (!cmd)
- return -ENXIO;
-
- rc = ndctl_cmd_submit(cmd);
- if (rc < 0)
- goto out;
-
- status = ndctl_cmd_fw_xlat_firmware_status(cmd);
- if (!(status & ND_CMD_STATUS_FIN_ABORTED)) {
- fprintf(stderr,
- "Firmware update abort on DIMM %s failed: %#x\n",
- ndctl_dimm_get_devname(dimm), status);
- rc = -ENXIO;
- goto out;
- }
-
-out:
- ndctl_cmd_unref(cmd);
- return rc;
-}
-
static enum ndctl_fwa_state fw_update_arm(struct ndctl_dimm *dimm)
{
struct ndctl_bus *bus = ndctl_dimm_get_bus(dimm);
@@ -856,15 +875,21 @@ static int update_firmware(struct ndctl_dimm *dimm,
struct action_context *actx)
{
const char *devname = ndctl_dimm_get_devname(dimm);
- int rc;
+ int rc, i;
rc = submit_get_firmware_info(dimm, actx);
if (rc < 0)
return rc;
- rc = submit_start_firmware_upload(dimm, actx);
- if (rc < 0)
- return rc;
+ /* try a few times in the --force and state busy case */
+ for (i = 0; i < 3; i++) {
+ rc = submit_start_firmware_upload(dimm, actx);
+ if (rc == -EAGAIN)
+ continue;
+ if (rc < 0)
+ return rc;
+ break;
+ }
if (param.verbose)
fprintf(stderr, "%s: uploading firmware\n", devname);
1 year, 4 months
[ndctl PATCH] ndctl/dimm: Fix submit_abort_firmware()
by Jane Chu
commit f86369ea29e2 ("ndctl: merge firmware-update into dimm.c as another dimm operation")
introduces submit_abort_firmware() that calls
ndctl_cmd_fw_xlat_firmware_status() to parse status returned
from a firmware abort action. The callee returns FW_ notion of enum,
but the caller checks for ND_CMD_STATUS_FIN_ABORTED which is a bit mask.
So firmware abort always "fails" even when it succeeds.
Fixes: f86369ea29e2 ("ndctl: merge firmware-update into dimm.c as another dimm operation")
Tested-by: Mark Baker <mark.a.baker(a)oracle.com>
Signed-off-by: Jane Chu <jane.chu(a)oracle.com>
---
ndctl/dimm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/ndctl/dimm.c b/ndctl/dimm.c
index 8bb7f672e35c..255dbe156a34 100644
--- a/ndctl/dimm.c
+++ b/ndctl/dimm.c
@@ -531,7 +531,7 @@ static int submit_abort_firmware(struct ndctl_dimm *dimm,
goto out;
status = ndctl_cmd_fw_xlat_firmware_status(cmd);
- if (!(status & ND_CMD_STATUS_FIN_ABORTED)) {
+ if (status != FW_ABORTED) {
fprintf(stderr,
"Firmware update abort on DIMM %s failed: %#x\n",
ndctl_dimm_get_devname(dimm), status);
--
2.18.4
1 year, 4 months
Re: [RFC Qemu PATCH v2 0/2] spapr: nvdimm: Asynchronus flush hcall
support
by Greg Kurz
On Mon, 30 Nov 2020 09:16:14 -0600
Shivaprasad G Bhat <sbhat(a)linux.ibm.com> wrote:
> The nvdimm devices are expected to ensure write persistent during power
> failure kind of scenarios.
>
> The libpmem has architecture specific instructions like dcbf on power
> to flush the cache data to backend nvdimm device during normal writes.
>
> Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
> doesn't traslate to actual flush to the backend file on the host in case
> of file backed vnvdimms. This is addressed by virtio-pmem in case of x86_64
> by making asynchronous flushes.
>
> On PAPR, issue is addressed by adding a new hcall to
> request for an explicit asynchronous flush requests from the guest ndctl
> driver when the backend nvdimm cannot ensure write persistence with dcbf
> alone. So, the approach here is to convey when the asynchronous flush is
> required in a device tree property. The guest makes the hcall when the
> property is found, instead of relying on dcbf.
>
> The first patch adds the necessary asynchronous hcall support infrastructure
> code at the DRC level. Second patch implements the hcall using the
> infrastructure.
>
> Hcall semantics are in review and not final.
>
> A new device property sync-dax is added to the nvdimm device. When the
> sync-dax is off(default), the asynchronous hcalls will be called.
>
> With respect to save from new qemu to restore on old qemu, having the
> sync-dax by default off(when not specified) causes IO errors in guests as
> the async-hcall would not be supported on old qemu. The new hcall
> implementation being supported only on the new pseries machine version,
> the current machine version checks may be sufficient to prevent
> such migration. Please suggest what should be done.
>
First, all requests that are still not completed from the guest POV,
ie. the hcall hasn't returned H_SUCCESS yet, are state that we should
migrate in theory. In this case, I guess we rather want to drain all
pending requests on the source in some pre-save handler.
Then, as explained in another mail, you should enforce stable behavior
for existing machine types with some hw_compat magic.
> The below demonstration shows the map_sync behavior with sync-dax on & off.
> (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master...)
>
> The pmem0 is from nvdimm with With sync-dax=on, and pmem1 is from nvdimm with syn-dax=off, mounted as
> /dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
>
> [root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When sync-dax=off
> [root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when sync-dax=on
> Failed to mmap with Operation not supported
>
> ---
> v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
> Changes from v1
> - Fixed a missed-out unlock
> - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token
>
> Shivaprasad G Bhat (2):
> spapr: drc: Add support for async hcalls at the drc level
> spapr: nvdimm: Implement async flush hcalls
>
>
> hw/mem/nvdimm.c | 1
> hw/ppc/spapr_drc.c | 146 ++++++++++++++++++++++++++++++++++++++++++++
> hw/ppc/spapr_nvdimm.c | 79 ++++++++++++++++++++++++
> include/hw/mem/nvdimm.h | 10 +++
> include/hw/ppc/spapr.h | 3 +
> include/hw/ppc/spapr_drc.h | 25 ++++++++
> 6 files changed, 263 insertions(+), 1 deletion(-)
>
> --
> Signature
>
>
1 year, 5 months
Re: [RFC Qemu PATCH v2 2/2] spapr: nvdimm: Implement async flush
hcalls
by Greg Kurz
On Mon, 30 Nov 2020 09:17:24 -0600
Shivaprasad G Bhat <sbhat(a)linux.ibm.com> wrote:
> When the persistent memory beacked by a file, a cpu cache flush instruction
> is not sufficient to ensure the stores are correctly flushed to the media.
>
> The patch implements the async hcalls for flush operation on demand from the
> guest kernel.
>
> The device option sync-dax is by default off and enables explicit asynchronous
> flush requests from guest. It can be disabled by setting syn-dax=on.
>
> Signed-off-by: Shivaprasad G Bhat <sbhat(a)linux.ibm.com>
> ---
> hw/mem/nvdimm.c | 1 +
> hw/ppc/spapr_nvdimm.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++
> include/hw/mem/nvdimm.h | 10 ++++++
> include/hw/ppc/spapr.h | 3 +-
> 4 files changed, 92 insertions(+), 1 deletion(-)
>
> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> index 03c2201b56..37a4db0135 100644
> --- a/hw/mem/nvdimm.c
> +++ b/hw/mem/nvdimm.c
> @@ -220,6 +220,7 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, const void *buf,
>
> static Property nvdimm_properties[] = {
> DEFINE_PROP_BOOL(NVDIMM_UNARMED_PROP, NVDIMMDevice, unarmed, false),
> + DEFINE_PROP_BOOL(NVDIMM_SYNC_DAX_PROP, NVDIMMDevice, sync_dax, false),
> DEFINE_PROP_END_OF_LIST(),
> };
>
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index a833a63b5e..557e36aa98 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -22,6 +22,7 @@
> * THE SOFTWARE.
> */
> #include "qemu/osdep.h"
> +#include "qemu/cutils.h"
> #include "qapi/error.h"
> #include "hw/ppc/spapr_drc.h"
> #include "hw/ppc/spapr_nvdimm.h"
> @@ -155,6 +156,11 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
> "operating-system")));
> _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 0));
>
> + if (!nvdimm->sync_dax) {
So this is done unconditionally for all machine types. This means that a
guest started on a newer QEMU cannot be migrated to an older QEMU. This
is annoying because people legitimately expect an existing machine type
to be migratable to any QEMU version that supports it.
This means that something like the following should be added in hw_compat_5_2[]
to fix the property for pre-6.0 machine types:
{ "nvdimm", "sync-dax", "on" },
> + _FDT(fdt_setprop(fdt, child_offset, "ibm,async-flush-required",
> + NULL, 0));
> + }
> +
> return child_offset;
> }
>
> @@ -370,6 +376,78 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
> return H_SUCCESS;
> }
>
> +typedef struct SCMAsyncFlushData {
> + int fd;
> + uint64_t token;
> +} SCMAsyncFlushData;
> +
> +static int flush_worker_cb(void *opaque)
> +{
> + int ret = H_SUCCESS;
> + SCMAsyncFlushData *req_data = opaque;
> +
> + /* flush raw backing image */
> + if (qemu_fdatasync(req_data->fd) < 0) {
> + error_report("papr_scm: Could not sync nvdimm to backend file: %s",
> + strerror(errno));
> + ret = H_HARDWARE;
> + }
> +
> + g_free(req_data);
> +
> + return ret;
> +}
> +
> +static target_ulong h_scm_async_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
> + target_ulong opcode, target_ulong *args)
> +{
> + int ret;
> + uint32_t drc_index = args[0];
> + uint64_t continue_token = args[1];
> + SpaprDrc *drc = spapr_drc_by_index(drc_index);
> + PCDIMMDevice *dimm;
> + HostMemoryBackend *backend = NULL;
> + SCMAsyncFlushData *req_data = NULL;
> +
> + if (!drc || !drc->dev ||
> + spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
> + return H_PARAMETER;
> + }
> +
> + if (continue_token != 0) {
> + ret = spapr_drc_get_async_hcall_status(drc, continue_token);
> + if (ret == H_BUSY) {
> + args[0] = continue_token;
> + return H_LONG_BUSY_ORDER_1_SEC;
> + }
> +
> + return ret;
> + }
> +
> + dimm = PC_DIMM(drc->dev);
> + backend = MEMORY_BACKEND(dimm->hostmem);
> +
> + req_data = g_malloc0(sizeof(SCMAsyncFlushData));
> + req_data->fd = memory_region_get_fd(&backend->mr);
> +
> + continue_token = spapr_drc_get_new_async_hcall_token(drc);
> + if (!continue_token) {
> + g_free(req_data);
> + return H_P2;
> + }
> + req_data->token = continue_token;
> +
> + spapr_drc_run_async_hcall(drc, continue_token, &flush_worker_cb, req_data);
> +
> + ret = spapr_drc_get_async_hcall_status(drc, continue_token);
> + if (ret == H_BUSY) {
> + args[0] = req_data->token;
> + return ret;
> + }
> +
> + return ret;
> +}
> +
> static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
> target_ulong opcode, target_ulong *args)
> {
> @@ -486,6 +564,7 @@ static void spapr_scm_register_types(void)
> spapr_register_hypercall(H_SCM_BIND_MEM, h_scm_bind_mem);
> spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
> spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
> + spapr_register_hypercall(H_SCM_ASYNC_FLUSH, h_scm_async_flush);
> }
>
> type_init(spapr_scm_register_types)
> diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
> index c699842dd0..9e8795766e 100644
> --- a/include/hw/mem/nvdimm.h
> +++ b/include/hw/mem/nvdimm.h
> @@ -51,6 +51,7 @@ OBJECT_DECLARE_TYPE(NVDIMMDevice, NVDIMMClass, NVDIMM)
> #define NVDIMM_LABEL_SIZE_PROP "label-size"
> #define NVDIMM_UUID_PROP "uuid"
> #define NVDIMM_UNARMED_PROP "unarmed"
> +#define NVDIMM_SYNC_DAX_PROP "sync-dax"
>
> struct NVDIMMDevice {
> /* private */
> @@ -85,6 +86,15 @@ struct NVDIMMDevice {
> */
> bool unarmed;
>
> + /*
> + * On PPC64,
> + * The 'off' value results in the async-flush-required property set
> + * in the device tree for pseries machines. When 'off', the guest
> + * initiates explicity flush requests to the backend device ensuring
> + * write persistence.
> + */
> + bool sync_dax;
> +
> /*
> * The PPC64 - spapr requires each nvdimm device have a uuid.
> */
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 2e89e36cfb..6d7110b7dc 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -535,8 +535,9 @@ struct SpaprMachineState {
> #define H_SCM_BIND_MEM 0x3EC
> #define H_SCM_UNBIND_MEM 0x3F0
> #define H_SCM_UNBIND_ALL 0x3FC
> +#define H_SCM_ASYNC_FLUSH 0x4A0
>
> -#define MAX_HCALL_OPCODE H_SCM_UNBIND_ALL
> +#define MAX_HCALL_OPCODE H_SCM_ASYNC_FLUSH
>
> /* The hcalls above are standardized in PAPR and implemented by pHyp
> * as well.
>
>
>
1 year, 5 months
[PATCH] device-dax: Fix range release
by Dan Williams
There are multiple locations that open-code the release of the last
range in a device-dax instance. Consolidate this into a new
dev_dax_trim_range() helper.
This also addresses a kmemleak report:
# cat /sys/kernel/debug/kmemleak
[..]
unreferenced object 0xffff976bd46f6240 (size 64):
comm "ndctl", pid 23556, jiffies 4299514316 (age 5406.733s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 20 c3 37 00 00 00 .......... .7...
ff ff ff 7f 38 00 00 00 00 00 00 00 00 00 00 00 ....8...........
backtrace:
[<00000000064003cf>] __kmalloc_track_caller+0x136/0x379
[<00000000d85e3c52>] krealloc+0x67/0x92
[<00000000d7d3ba8a>] __alloc_dev_dax_range+0x73/0x25c
[<0000000027d58626>] devm_create_dev_dax+0x27d/0x416
[<00000000434abd43>] __dax_pmem_probe+0x1c9/0x1000 [dax_pmem_core]
[<0000000083726c1c>] dax_pmem_probe+0x10/0x1f [dax_pmem]
[<00000000b5f2319c>] nvdimm_bus_probe+0x9d/0x340 [libnvdimm]
[<00000000c055e544>] really_probe+0x230/0x48d
[<000000006cabd38e>] driver_probe_device+0x122/0x13b
[<0000000029c7b95a>] device_driver_attach+0x5b/0x60
[<0000000053e5659b>] bind_store+0xb7/0xc3
[<00000000d3bdaadc>] drv_attr_store+0x27/0x31
[<00000000949069c5>] sysfs_kf_write+0x4a/0x57
[<000000004a8b5adf>] kernfs_fop_write+0x150/0x1e5
[<00000000bded60f0>] __vfs_write+0x1b/0x34
[<00000000b92900f0>] vfs_write+0xd8/0x1d1
Reported-by: Jane Chu <jane.chu(a)oracle.com>
Cc: Zhen Lei <thunder.leizhen(a)huawei.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
drivers/dax/bus.c | 44 +++++++++++++++++++++-----------------------
1 file changed, 21 insertions(+), 23 deletions(-)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 9761cb40d4bb..720cd140209f 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -367,19 +367,28 @@ void kill_dev_dax(struct dev_dax *dev_dax)
}
EXPORT_SYMBOL_GPL(kill_dev_dax);
-static void free_dev_dax_ranges(struct dev_dax *dev_dax)
+static void trim_dev_dax_range(struct dev_dax *dev_dax)
{
+ int i = dev_dax->nr_range - 1;
+ struct range *range = &dev_dax->ranges[i].range;
struct dax_region *dax_region = dev_dax->region;
- int i;
device_lock_assert(dax_region->dev);
- for (i = 0; i < dev_dax->nr_range; i++) {
- struct range *range = &dev_dax->ranges[i].range;
-
- __release_region(&dax_region->res, range->start,
- range_len(range));
+ dev_dbg(&dev_dax->dev, "delete range[%d]: %#llx:%#llx\n", i,
+ (unsigned long long)range->start,
+ (unsigned long long)range->end);
+
+ __release_region(&dax_region->res, range->start, range_len(range));
+ if (--dev_dax->nr_range == 0) {
+ kfree(dev_dax->ranges);
+ dev_dax->ranges = NULL;
}
- dev_dax->nr_range = 0;
+}
+
+static void free_dev_dax_ranges(struct dev_dax *dev_dax)
+{
+ while (dev_dax->nr_range)
+ trim_dev_dax_range(dev_dax);
}
static void unregister_dev_dax(void *dev)
@@ -804,15 +813,10 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
return 0;
rc = devm_register_dax_mapping(dev_dax, dev_dax->nr_range - 1);
- if (rc) {
- dev_dbg(dev, "delete range[%d]: %pa:%pa\n", dev_dax->nr_range - 1,
- &alloc->start, &alloc->end);
- dev_dax->nr_range--;
- __release_region(res, alloc->start, resource_size(alloc));
- return rc;
- }
+ if (rc)
+ trim_dev_dax_range(dev_dax);
- return 0;
+ return rc;
}
static int adjust_dev_dax_range(struct dev_dax *dev_dax, struct resource *res, resource_size_t size)
@@ -885,12 +889,7 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
if (shrink >= range_len(range)) {
devm_release_action(dax_region->dev,
unregister_dax_mapping, &mapping->dev);
- __release_region(&dax_region->res, range->start,
- range_len(range));
- dev_dax->nr_range--;
- dev_dbg(dev, "delete range[%d]: %#llx:%#llx\n", i,
- (unsigned long long) range->start,
- (unsigned long long) range->end);
+ trim_dev_dax_range(dev_dax);
to_shrink -= shrink;
if (!to_shrink)
break;
@@ -1267,7 +1266,6 @@ static void dev_dax_release(struct device *dev)
put_dax(dax_dev);
free_dev_dax_id(dev_dax);
dax_region_put(dax_region);
- kfree(dev_dax->ranges);
kfree(dev_dax->pgmap);
kfree(dev_dax);
}
1 year, 5 months
Re: [PATCH RFC 8/9] RDMA/umem: batch page unpin in __ib_mem_release()
by Joao Martins
On 12/8/20 7:29 PM, Jason Gunthorpe wrote:
> On Tue, Dec 08, 2020 at 05:29:00PM +0000, Joao Martins wrote:
>
>> static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
>> {
>> + bool make_dirty = umem->writable && dirty;
>> + struct page **page_list = NULL;
>> struct sg_page_iter sg_iter;
>> + unsigned long nr = 0;
>> struct page *page;
>>
>> + page_list = (struct page **) __get_free_page(GFP_KERNEL);
>
> Gah, no, don't do it like this!
>
> Instead something like:
>
> for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i)
> unpin_use_pages_range_dirty_lock(sg_page(sg), sg->length/PAGE_SIZE,
> umem->writable && dirty);
>
> And have the mm implementation split the contiguous range of pages into
> pairs of (compound head, ntails) with a bit of maths.
>
Got it :)
I was trying to avoid another exported symbol.
Albeit upon your suggestion below, it doesn't justify the efficiency/clearness lost.
Joao
1 year, 5 months