Re: Detecting NUMA per pmem
by Oren Berman
Hi Ross
Thanks for the speedy reply. I am also adding the public list to this
thread as you suggested.
We have tried to dump the SPA table and this is what we get:
/*
* Intel ACPI Component Architecture
* AML/ASL+ Disassembler version 20160108-64
* Copyright (c) 2000 - 2016 Intel Corporation
*
* Disassembly of NFIT, Sun Oct 22 10:46:19 2017
*
* ACPI Data Table [NFIT]
*
* Format: [HexOffset DecimalOffset ByteLength] FieldName : FieldValue
*/
[000h 0000 4] Signature : "NFIT" [NVDIMM Firmware
Interface Table]
[004h 0004 4] Table Length : 00000028
[008h 0008 1] Revision : 01
[009h 0009 1] Checksum : B2
[00Ah 0010 6] Oem ID : "SUPERM"
[010h 0016 8] Oem Table ID : "SMCI--MB"
[018h 0024 4] Oem Revision : 00000001
[01Ch 0028 4] Asl Compiler ID : " "
[020h 0032 4] Asl Compiler Revision : 00000001
[024h 0036 4] Reserved : 00000000
Raw Table Data: Length 40 (0x28)
0000: 4E 46 49 54 28 00 00 00 01 B2 53 55 50 45 52 4D // NFIT(.....SUPERM
0010: 53 4D 43 49 2D 2D 4D 42 01 00 00 00 01 00 00 00 // SMCI--MB........
0020: 01 00 00 00 00 00 00 00
As you can see the memory region info is missing.
This specific check was done on a supermicro server.
We also performed a bios update but the results were the same.
As said before ,the pmem devices are detected correctly and we verified
that they correspond to different numa nodes using the PCM utility.However,
linux still reports both pmem devices to be on the same numa - Numa 0.
If this information is missing, why pmem devices and address ranges are
still detected correctly?
Is there another table that we need to check?
I also ran dmidecode and the NVDIMMs are being listed (we tested with
netlist NVDIMMs). I can also see the bank locator showing P0 and P1 which I
think indicates the numa. Here is an example:
Handle 0x002D, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x002A
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P1-DIMMA3
Bank Locator: P0_Node0_Channel0_Dimm2
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Netlist
Serial Number: 66F50006
Asset Tag: P1-DIMMA3_AssetTag (date:16/42)
Part Number: NV3A74SBT20-000
Rank: 1
Configured Clock Speed: 1600 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Handle 0x003B, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0038
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P2-DIMME3
Bank Locator: P1_Node1_Channel0_Dimm2
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Netlist
Serial Number: 66B50010
Asset Tag: P2-DIMME3_AssetTag (date:16/42)
Part Number: NV3A74SBT20-000
Rank: 1
Configured Clock Speed: 1600 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Did you encounter such a a case? We would appreciate any insight you might
have.
BR
Oren Berman
On 20 October 2017 at 19:22, Ross Zwisler <ross.zwisler(a)linux.intel.com>
wrote:
> On Thu, Oct 19, 2017 at 06:12:24PM +0300, Oren Berman wrote:
> > Hi Ross
> > My name is Oren Berman and I am a senior developer at lightbitslabs.
> > We are working with NDIMMs but we encountered a problem that the
> kernel
> > does not seem to detect the numa id per PMEM device.
> > It always reports numa 0 although we have NVDIMM devices on both
> nodes.
> > We checked that it always returns 0 from sysfs and also from
> retrieving
> > the device of pmem in the kernel and calling dev_to_node.
> > The result is always 0 for both pmem0 and pmem1.
> > In order to make sure that indeed both numa sockets are used we ran
> > intel's pcm utlity. We verified that writing to pmem 0 increases
> socket 0
> > utilization and writing to pmem1 increases socket 1 utilization so
> the hw
> > works properly.
> > Only the detection seems to be invalid.
> > Did you encounter such a problem?
> > We are using kernel version 4.9 - are you aware of any fix for this
> issue
> > or workaround that we can use.
> > Are we missing something?
> > Thanks for any help you can give us.
> > BR
> > Oren Berman
>
> Hi Oren,
>
> My first guess is that your platform isn't properly filling out the
> "proximity
> domain" field in the NFIT SPA table.
>
> See section 5.2.25.2 in ACPI 6.2:
> http://uefi.org/sites/default/files/resources/ACPI_6_2.pdf
>
> Here's how to check that:
>
> # cd /tmp
> # cp /sys/firmware/acpi/tables/NFIT .
> # iasl NFIT
>
> Intel ACPI Component Architecture
> ASL+ Optimizing Compiler version 20160831-64
> Copyright (c) 2000 - 2016 Intel Corporation
>
> Binary file appears to be a valid ACPI table, disassembling
> Input file NFIT, Length 0xE0 (224) bytes
> ACPI: NFIT 0x0000000000000000 0000E0 (v01 BOCHS BXPCNFIT 00000001 BXPC
> 00000001)
> Acpi Data Table [NFIT] decoded
> Formatted output: NFIT.dsl - 5191 bytes
>
> This will give you an NFIT.dsl file which you can look at. Here is what my
> SPA table looks like for an emulated QEMU NVDIMM:
>
> [028h 0040 2] Subtable Type : 0000 [System Physical
> Address Range]
> [02Ah 0042 2] Length : 0038
>
> [02Ch 0044 2] Range Index : 0002
> [02Eh 0046 2] Flags (decoded below) : 0003
> Add/Online Operation Only : 1
> Proximity Domain Valid : 1
> [030h 0048 4] Reserved : 00000000
> [034h 0052 4] Proximity Domain : 00000000
> [038h 0056 16] Address Range GUID :
> 66F0D379-B4F3-4074-AC43-0D3318B78CDB
> [048h 0072 8] Address Range Base : 0000000240000000
> [050h 0080 8] Address Range Length : 0000000440000000
> [058h 0088 8] Memory Map Attribute : 0000000000008008
>
> So, the "Proximity Domain" field is 0, and this lets the system know which
> NUMA node to associate with this memory region.
>
> BTW, in the future it's best to CC our public list,
> linux-nvdimm(a)lists.01.org,
> as a) someone else might have the same question and b) someone else might
> know
> the answer.
>
> Thanks,
> - Ross
>
2 years, 6 months
[PATCH v3 0/2] Support ACPI 6.1 update in NFIT Control Region Structure
by Toshi Kani
ACPI 6.1, Table 5-133, updates NVDIMM Control Region Structure as
follows.
- Valid Fields, Manufacturing Location, and Manufacturing Date
are added from reserved range. No change in the structure size.
- IDs (SPD values) are stored as arrays of bytes (i.e. big-endian
format). The spec clarifies that they need to be represented
as arrays of bytes as well.
Patch 1 changes the NFIT driver to comply with ACPI 6.1.
Patch 2 adds a new sysfs file "id" to show NVDIMM ID defined in ACPI 6.1.
The patch-set applies on linux-pm.git acpica.
link: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
---
v3:
- Need to coordinate with ACPICA update (Bob Moore, Dan Williams)
- Integrate with ACPICA changes in struct acpi_nfit_control_region.
(commit 138a95547ab0)
v2:
- Remove 'mfg_location' and 'mfg_date'. (Dan Williams)
- Rename 'unique_id' to 'id' and make this change as a separate patch.
(Dan Williams)
---
Toshi Kani (3):
1/2 acpi/nfit: Update nfit driver to comply with ACPI 6.1
2/3 acpi/nfit: Add sysfs "id" for NVDIMM ID
---
drivers/acpi/nfit.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
2 years, 7 months
[PATCH 0/18 v6] dax, ext4, xfs: Synchronous page faults
by Jan Kara
Hello,
here is the sixth version of my patches to implement synchronous page faults
for DAX mappings to make flushing of DAX mappings possible from userspace so
that they can be flushed on finer than page granularity and also avoid the
overhead of a syscall.
I think we are ready to get this merged - I've talked to Dan and he said he
could take the patches through his tree. It would just be nice to get final
ack from Christoph for the first patch implementing MAP_VALIDATE and someone
from XFS folks to check patch 17 (make xfs_filemap_pfn_mkwrite use
__xfs_filemap_fault()).
---
We use a new mmap flag MAP_SYNC to indicate that page faults for the mapping
should be synchronous. The guarantee provided by this flag is: While a block
is writeably mapped into page tables of this mapping, it is guaranteed to be
visible in the file at that offset also after a crash.
How I implement this is that ->iomap_begin() indicates by a flag that inode
block mapping metadata is unstable and may need flushing (use the same test as
whether fdatasync() has metadata to write). If yes, DAX fault handler refrains
from inserting / write-enabling the page table entry and returns special flag
VM_FAULT_NEEDDSYNC together with a PFN to map to the filesystem fault handler.
The handler then calls fdatasync() (vfs_fsync_range()) for the affected range
and after that calls DAX code to update the page table entry appropriately.
I did some basic performance testing on the patches over ramdisk - timed
latency of page faults when faulting 512 pages. I did several tests: with file
preallocated / with file empty, with background file copying going on / without
it, with / without MAP_SYNC (so that we get comparison). The results are
(numbers are in microseconds):
File preallocated, no background load no MAP_SYNC:
min=9 avg=10 max=46
8 - 15 us: 508
16 - 31 us: 3
32 - 63 us: 1
File preallocated, no background load, MAP_SYNC:
min=9 avg=10 max=47
8 - 15 us: 508
16 - 31 us: 2
32 - 63 us: 2
File empty, no background load, no MAP_SYNC:
min=21 avg=22 max=70
16 - 31 us: 506
32 - 63 us: 5
64 - 127 us: 1
File empty, no background load, MAP_SYNC:
min=40 avg=124 max=242
32 - 63 us: 1
64 - 127 us: 333
128 - 255 us: 178
File empty, background load, no MAP_SYNC:
min=21 avg=23 max=67
16 - 31 us: 507
32 - 63 us: 4
64 - 127 us: 1
File empty, background load, MAP_SYNC:
min=94 avg=112 max=181
64 - 127 us: 489
128 - 255 us: 23
So here we can see the difference between MAP_SYNC vs non MAP_SYNC is about
100-200 us when we need to wait for transaction commit in this setup.
Changes since v5:
* really updated the manpage
* improved comment describing IOMAP_F_DIRTY
* fixed XFS handling of VM_FAULT_NEEDSYNC in xfs_filemap_pfn_mkwrite()
Changes since v4:
* fixed couple of minor things in the manpage
* make legacy mmap flags always supported, remove them from mask declared
to be supported by ext4 and xfs
Changes since v3:
* updated some changelogs
* folded fs support for VM_SYNC flag into patches implementing the
functionality
* removed ->mmap_validate, use ->mmap_supported_flags instead
* added some Reviewed-by tags
* added manpage patch
Changes since v2:
* avoid unnecessary flushing of faulted page (Ross) - I've realized it makes no
sense to remeasure my benchmark results (after actually doing that and seeing
no difference, sigh) since I use ramdisk and not real PMEM HW and so flushes
are ignored.
* handle nojournal mode of ext4
* other smaller cleanups & fixes (Ross)
* factor larger part of finishing of synchronous fault into a helper (Christoph)
* reorder pfnp argument of dax_iomap_fault() (Christoph)
* add XFS support from Christoph
* use proper MAP_SYNC support in mmap(2)
* rebased on top of 4.14-rc4
Changes since v1:
* switched to using mmap flag MAP_SYNC
* cleaned up fault handlers to avoid passing pfn in vmf->orig_pte
* switched to not touching page tables before we are ready to insert final
entry as it was unnecessary and not really simplifying anything
* renamed fault flag to VM_FAULT_NEEDDSYNC
* other smaller fixes found by reviewers
Honza
2 years, 9 months
[PATCH v4 00/18] dax: fix dma vs truncate/hole-punch
by Dan Williams
Changes since v3 [1]:
* Kill the i_daxdma_lock, and do not impose any new locking constraints
on filesystem implementations (Dave)
* Reuse the existing i_mmap_lock for synchronizing against
get_user_pages() by unmapping and causing punch-hole/truncate to
re-fault the page before get_user_pages() can elevate the page reference
count (Jan)
* Create a dax-specifc address_space_operations instance for each
filesystem. This allows page->mapping to be set for dax pages. (Jan).
* Change the ext4 and ext2 policy of 'mount -o dax' vs a device that
does not support dax. This converts any environments that may have
been using 'page-less' dax back to using page cache.
* Rename wait_on_devmap_idle() to wait_on_atomic_one(), a generic
facility for waiting for an atomic counter to reach a value of '1'.
[1]: https://lwn.net/Articles/737273/
---
Background:
get_user_pages() pins file backed memory pages for access by dma
devices. However, it only pins the memory pages not the page-to-file
offset association. If a file is truncated the pages are mapped out of
the file and dma may continue indefinitely into a page that is owned by
a device driver. This breaks coherency of the file vs dma, but the
assumption is that if userspace wants the file-space truncated it does
not matter what data is inbound from the device, it is not relevant
anymore. The only expectation is that dma can safely continue while the
filesystem reallocates the block(s).
Problem:
This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.
Solution:
Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".
The dax_flush_dma() routine is called by filesystems with a lock held
against mm faults (i_mmap_lock). It then invalidates all mappings to
trigger any subsequent get_user_pages() to block on i_mmap_lock. Finally
it scans/rescans all pages in the mapping until it observes all pages
idle.
So far this solution only targets xfs since it already implements
xfs_break_layouts in all the locations that would need this
synchronization. It applies on top of the vmem_altmap / dev_pagemap
reworks from Christoph.
---
Dan Williams (18):
mm, dax: introduce pfn_t_special()
ext4: auto disable dax instead of failing mount
ext2: auto disable dax instead of failing mount
dax: require 'struct page' by default for filesystem dax
dax: stop using VM_MIXEDMAP for dax
dax: stop using VM_HUGEPAGE for dax
dax: store pfns in the radix
tools/testing/nvdimm: add 'bio_delay' mechanism
mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
fs, dax: introduce DEFINE_FSDAX_AOPS
xfs: use DEFINE_FSDAX_AOPS
ext4: use DEFINE_FSDAX_AOPS
ext2: use DEFINE_FSDAX_AOPS
mm, fs, dax: use page->mapping to warn if dma collides with truncate
wait_bit: introduce {wait_on,wake_up}_atomic_one
mm, fs, dax: dax_flush_dma, handle dma vs block-map-change collisions
xfs, dax: wire up dax_flush_dma support via a new xfs_sync_dma helper
arch/powerpc/platforms/Kconfig | 1
arch/powerpc/sysdev/axonram.c | 2
drivers/dax/device.c | 1
drivers/dax/super.c | 100 ++++++++++-
drivers/nvdimm/pmem.c | 3
drivers/s390/block/Kconfig | 1
drivers/s390/block/dcssblk.c | 3
fs/Kconfig | 8 +
fs/dax.c | 295 ++++++++++++++++++++++++++++-----
fs/ext2/ext2.h | 1
fs/ext2/file.c | 1
fs/ext2/inode.c | 23 ++-
fs/ext2/namei.c | 18 --
fs/ext2/super.c | 13 +
fs/ext4/file.c | 1
fs/ext4/inode.c | 6 +
fs/ext4/super.c | 15 +-
fs/xfs/Makefile | 3
fs/xfs/xfs_aops.c | 2
fs/xfs/xfs_aops.h | 1
fs/xfs/xfs_dma.c | 81 +++++++++
fs/xfs/xfs_dma.h | 24 +++
fs/xfs/xfs_file.c | 8 -
fs/xfs/xfs_ioctl.c | 7 -
fs/xfs/xfs_iops.c | 12 +
fs/xfs/xfs_super.c | 20 +-
include/linux/dax.h | 70 +++++++-
include/linux/memremap.h | 28 +--
include/linux/mm.h | 62 +++++--
include/linux/pfn_t.h | 13 +
include/linux/vma.h | 23 +++
include/linux/wait_bit.h | 13 +
kernel/memremap.c | 30 +++
kernel/sched/wait_bit.c | 59 ++++++-
mm/Kconfig | 5 +
mm/gup.c | 5 +
mm/hmm.c | 13 -
mm/huge_memory.c | 6 -
mm/ksm.c | 3
mm/madvise.c | 2
mm/memory.c | 22 ++
mm/migrate.c | 3
mm/mlock.c | 5 -
mm/mmap.c | 8 -
mm/swap.c | 3
tools/testing/nvdimm/Kbuild | 1
tools/testing/nvdimm/test/iomap.c | 62 +++++++
tools/testing/nvdimm/test/nfit.c | 34 ++++
tools/testing/nvdimm/test/nfit_test.h | 1
49 files changed, 918 insertions(+), 203 deletions(-)
create mode 100644 fs/xfs/xfs_dma.c
create mode 100644 fs/xfs/xfs_dma.h
create mode 100644 include/linux/vma.h
2 years, 10 months
[PATCH] acpi: add NFIT and HMAT to the initrd override list
by Dan Williams
These tables, NFIT and HMAT, are essential for describing
next-generation platform memory topologies and performance
characteristics. Allow them to be overridden for debug and test and
purposes.
Cc: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
drivers/acpi/tables.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index 80ce2a7d224b..67a44fd79449 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -456,7 +456,8 @@ static const char * const table_sigs[] = {
ACPI_SIG_SLIC, ACPI_SIG_SPCR, ACPI_SIG_SPMI, ACPI_SIG_TCPA,
ACPI_SIG_UEFI, ACPI_SIG_WAET, ACPI_SIG_WDAT, ACPI_SIG_WDDT,
ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_PSDT,
- ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, NULL };
+ ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, ACPI_SIG_NFIT,
+ ACPI_SIG_HMAT, NULL };
#define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)
2 years, 10 months
[PATCH 1/2] ndctl: add support to enable latch system shutdown status
by Dave Jiang
Adding the Enable Latch System Shutdown Status (Function Index 10) for
DSM v1.6 spec.
Signed-off-by: Dave Jiang <dave.jiang(a)intel.com>
---
ndctl/lib/Makefile.am | 1 +
ndctl/lib/intel.c | 38 ++++++++++++++++++++++++++++++++++++++
ndctl/lib/intel.h | 7 +++++++
ndctl/lib/libndctl.sym | 2 ++
ndctl/lib/lss.c | 43 +++++++++++++++++++++++++++++++++++++++++++
ndctl/lib/private.h | 2 ++
6 files changed, 93 insertions(+)
create mode 100644 ndctl/lib/lss.c
diff --git a/ndctl/lib/Makefile.am b/ndctl/lib/Makefile.am
index e3a12e7..c1ea371 100644
--- a/ndctl/lib/Makefile.am
+++ b/ndctl/lib/Makefile.am
@@ -22,6 +22,7 @@ libndctl_la_SOURCES =\
msft.c \
ars.c \
firmware.c \
+ lss.c \
libndctl.c
libndctl_la_LIBADD =\
diff --git a/ndctl/lib/intel.c b/ndctl/lib/intel.c
index 6d26a6c..10a0881 100644
--- a/ndctl/lib/intel.c
+++ b/ndctl/lib/intel.c
@@ -547,6 +547,42 @@ static unsigned long intel_cmd_fw_fquery_get_fw_rev(struct ndctl_cmd *cmd)
return cmd->intel->fquery.updated_fw_rev;
}
+static struct ndctl_cmd *
+intel_dimm_cmd_new_lss_enable(struct ndctl_dimm *dimm)
+{
+ struct ndctl_cmd *cmd;
+
+ BUILD_ASSERT(sizeof(struct nd_intel_lss) == 5);
+
+ cmd = alloc_intel_cmd(dimm, ND_INTEL_ENABLE_LSS_STATUS, 1, 4);
+ if (!cmd)
+ return NULL;
+
+ cmd->firmware_status = &cmd->intel->lss.status;
+ return cmd;
+}
+
+static int intel_lss_set_valid(struct ndctl_cmd *cmd)
+{
+ struct nd_pkg_intel *pkg = cmd->intel;
+
+ if (cmd->type != ND_CMD_CALL || cmd->status != 1
+ || pkg->gen.nd_family != NVDIMM_FAMILY_INTEL
+ || pkg->gen.nd_command != ND_INTEL_ENABLE_LSS_STATUS)
+ return -EINVAL;
+ return 0;
+}
+
+static int
+intel_cmd_lss_set_enable(struct ndctl_cmd *cmd, unsigned char enable)
+{
+ if (intel_lss_set_valid(cmd) < 0)
+ return -EINVAL;
+ cmd->intel->lss.enable = enable;
+ return 0;
+}
+
+
struct ndctl_dimm_ops * const intel_dimm_ops = &(struct ndctl_dimm_ops) {
.cmd_desc = intel_cmd_desc,
.new_smart = intel_dimm_cmd_new_smart,
@@ -601,4 +637,6 @@ struct ndctl_dimm_ops * const intel_dimm_ops = &(struct ndctl_dimm_ops) {
.new_fw_finish_query = intel_dimm_cmd_new_fw_finish_query,
.fw_fquery_set_context = intel_cmd_fw_fquery_set_context,
.fw_fquery_get_fw_rev = intel_cmd_fw_fquery_get_fw_rev,
+ .new_lss_enable = intel_dimm_cmd_new_lss_enable,
+ .lss_set_enable = intel_cmd_lss_set_enable,
};
diff --git a/ndctl/lib/intel.h b/ndctl/lib/intel.h
index 080e37b..92bed53 100644
--- a/ndctl/lib/intel.h
+++ b/ndctl/lib/intel.h
@@ -6,6 +6,7 @@
#define ND_INTEL_SMART 1
#define ND_INTEL_SMART_THRESHOLD 2
+#define ND_INTEL_ENABLE_LSS_STATUS 10
#define ND_INTEL_FW_GET_INFO 12
#define ND_INTEL_FW_START_UPDATE 13
#define ND_INTEL_FW_SEND_DATA 14
@@ -118,6 +119,11 @@ struct nd_intel_fw_finish_query {
__u64 updated_fw_rev;
} __attribute__((packed));
+struct nd_intel_lss {
+ __u8 enable;
+ __u32 status;
+} __attribute__((packed));
+
struct nd_pkg_intel {
struct nd_cmd_pkg gen;
union {
@@ -129,6 +135,7 @@ struct nd_pkg_intel {
struct nd_intel_fw_send_data send;
struct nd_intel_fw_finish_update finish;
struct nd_intel_fw_finish_query fquery;
+ struct nd_intel_lss lss;
};
};
#endif /* __INTEL_H__ */
diff --git a/ndctl/lib/libndctl.sym b/ndctl/lib/libndctl.sym
index 2e248ab..65673ad 100644
--- a/ndctl/lib/libndctl.sym
+++ b/ndctl/lib/libndctl.sym
@@ -344,4 +344,6 @@ global:
ndctl_cmd_fw_finish_set_ctrl_flags;
ndctl_cmd_fw_finish_set_context;
ndctl_cmd_fw_fquery_set_context;
+ ndctl_dimm_cmd_new_lss_enable;
+ ndctl_cmd_lss_set_enable;
} LIBNDCTL_13;
diff --git a/ndctl/lib/lss.c b/ndctl/lib/lss.c
new file mode 100644
index 0000000..fbeeec5
--- /dev/null
+++ b/ndctl/lib/lss.c
@@ -0,0 +1,43 @@
+/*
+ * Copyright (c) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU Lesser General Public License,
+ * version 2.1, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT ANY
+ * WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+ * FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for
+ * more details.
+ */
+#include <stdlib.h>
+#include <limits.h>
+#include <util/log.h>
+#include <ndctl/libndctl.h>
+#include "private.h"
+
+/*
+ * Define the wrappers around the ndctl_dimm_ops for LSS DSM
+ */
+NDCTL_EXPORT struct ndctl_cmd *
+ndctl_dimm_cmd_new_lss_enable(struct ndctl_dimm *dimm)
+{
+ struct ndctl_dimm_ops *ops = dimm->ops;
+
+ if (ops && ops->new_lss_enable)
+ return ops->new_lss_enable(dimm);
+ else
+ return NULL;
+}
+
+NDCTL_EXPORT int
+ndctl_cmd_lss_set_enable(struct ndctl_cmd *cmd, unsigned char enable)
+{
+ if (cmd->dimm) {
+ struct ndctl_dimm_ops *ops = cmd->dimm->ops;
+
+ if (ops && ops->lss_set_enable)
+ return ops->lss_set_enable(cmd, enable);
+ }
+ return -ENXIO;
+}
diff --git a/ndctl/lib/private.h b/ndctl/lib/private.h
index 20f9e6e..4035e11 100644
--- a/ndctl/lib/private.h
+++ b/ndctl/lib/private.h
@@ -325,6 +325,8 @@ struct ndctl_dimm_ops {
struct ndctl_cmd *(*new_fw_finish_query)(struct ndctl_dimm *);
int (*fw_fquery_set_context)(struct ndctl_cmd *, unsigned int context);
unsigned long (*fw_fquery_get_fw_rev)(struct ndctl_cmd *);
+ struct ndctl_cmd *(*new_lss_enable)(struct ndctl_dimm *);
+ int (*lss_set_enable)(struct ndctl_cmd *, unsigned char enable);
};
struct ndctl_dimm_ops * const intel_dimm_ops;
diff --git a/ndctl/libndctl.h b/ndctl/libndctl.h
index 64a4e99..24dc7a3 100644
--- a/ndctl/libndctl.h
+++ b/ndctl/libndctl.h
@@ -607,6 +607,9 @@ int ndctl_cmd_fw_finish_set_ctrl_flags(struct ndctl_cmd *cmd, unsigned char flag
int ndctl_cmd_fw_finish_set_context(struct ndctl_cmd *cmd, unsigned int context);
int ndctl_cmd_fw_fquery_set_context(struct ndctl_cmd *cmd, unsigned int context);
+struct ndctl_cmd *ndctl_dimm_cmd_new_lss_enable(struct ndctl_dimm *dimm);
+int ndctl_cmd_lss_set_enable(struct ndctl_cmd *cmd, unsigned char enable);
+
#ifdef __cplusplus
} /* extern "C" */
#endif
2 years, 11 months
[PATCH v2 0/4] add support for platform persistence capabilities
by Dave Jiang
ACPI 6.2a provides an NFIT sub-table that informs if the platform has
auto CPU flush and memory flush on unexpected power loss events. This series
propogates those attributes to nd_region and add sysfs attribute to show
those capabilities.
---
v2:
Per Dan's comments
- Added ADR cap flags propogation
- Added sysfs attribute
Dave Jiang (4):
acpi: nfit: Add support for detect platform CPU cache flush on power loss
acpi: nfit: add persistent memory control flag for nd_region
libnvdimm: expose platform persistence attribute for nd_region
nfit-test: Add platform cap support from ACPI 6.2a to test
drivers/acpi/nfit/core.c | 23 +++++++++++++++++++++++
drivers/acpi/nfit/nfit.h | 1 +
drivers/nvdimm/pmem.c | 4 +++-
drivers/nvdimm/region_devs.c | 14 ++++++++++++++
include/linux/libnvdimm.h | 11 +++++++++++
tools/testing/nvdimm/test/nfit.c | 11 ++++++++++-
6 files changed, 62 insertions(+), 2 deletions(-)
--
2 years, 11 months
[PATCH v3 0/4] ndctl: add DIMM firmware update support
by Dave Jiang
The following series implements support for DIMM firmware update in ndctl.
v3: Addressed Dan's comments
- Removed Intel specific bits from update get info.
- Added inherited context for related commands
- Moved all input params into new_cmd function.
- Added translated status return
---
Dave Jiang (4):
ndctl: add support to alloc_intel_cmd for variable payload
ndctl: add firmware download support functions in libndctl
ndctl: add firmware update command option for ndctl
ndctl, test: firmware update unit test
ndctl/Makefile.am | 3
ndctl/lib/Makefile.am | 1
ndctl/lib/firmware.c | 119 +++++++++++
ndctl/lib/intel.c | 264 +++++++++++++++++++++++
ndctl/lib/intel.h | 77 +++++++
ndctl/lib/libndctl.sym | 15 +
ndctl/lib/private.h | 16 +
ndctl/libndctl.h | 36 +++
ndctl/ndctl.c | 1
ndctl/update.c | 544 ++++++++++++++++++++++++++++++++++++++++++++++++
10 files changed, 1074 insertions(+), 2 deletions(-)
create mode 100644 ndctl/lib/firmware.c
create mode 100644 ndctl/update.c
--
Signature
2 years, 11 months
Re: KVM "fake DAX" flushing interface - discussion
by Xiao Guangrong
On 11/22/2017 02:19 AM, Rik van Riel wrote:
> We can go with the "best" interface for what
> could be a relatively slow flush (fsync on a
> file on ssd/disk on the host), which requires
> that the flushing task wait on completion
> asynchronously.
I'd like to clarify the interface of "wait on completion
asynchronously" and KVM async page fault a bit more.
Current design of async-page-fault only works on RAM rather
than MMIO, i.e, if the page fault caused by accessing the
device memory of a emulated device, it needs to go to
userspace (QEMU) which emulates the operation in vCPU's
thread.
As i mentioned before the memory region used for vNVDIMM
flush interface should be MMIO and consider its support
on other hypervisors, so we do better push this async
mechanism into the flush interface design itself rather
than depends on kvm async-page-fault.
2 years, 12 months
[patch] btt: fix uninitialized err_lock
by Jeff Moyer
Hi,
When a sector mode namespace is initially created, the arena's err_lock
is not initialized. If, on the other hand, the namespace already
exists, the mutex is initialized. To fix the issue, I moved the mutex
initialization into the arena_alloc, which is called by both
discover_arenas and create_arenas.
This was discovered on an older kernel where mutex_trylock checks the
count to determine whether the lock is held. Because the data structure
is kzalloc-d, that count was 0 (held), and I/O to the device would hang
forever waiting for the lock to be released (see btt_write_pg, for
example). Current kernels have a different mutex implementation that
checks for a non-null owner, and so this doesn't show up as a problem.
If that lock were ever contended, it might cause issues, but you'd have
to be really unlucky, I think.
Signed-off-by: Jeff Moyer <jmoyer(a)redhat.com>
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index e949e33..5860f99 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -630,6 +630,7 @@ static struct arena_info *alloc_arena(struct btt *btt, size_t size,
return NULL;
arena->nd_btt = btt->nd_btt;
arena->sector_size = btt->sector_size;
+ mutex_init(&arena->err_lock);
if (!size)
return arena;
@@ -758,7 +759,6 @@ static int discover_arenas(struct btt *btt)
arena->external_lba_start = cur_nlba;
parse_arena_meta(arena, super, cur_off);
- mutex_init(&arena->err_lock);
ret = btt_freelist_init(arena);
if (ret)
goto out;
3 years