This patchset adds support of Write-Through (WT) mapping on x86.
The study below shows that using WT mapping may be useful for
The patchset consists of the following changes.
- Patch 1/10 to 5/10 add ioremap_wt()
- Patch 6/10 adds pgprot_writethrough()
- Patch 7/10 to 8/10 add set_memory_wt()
- Patch 9/10 refactors !pat_enable paths
- Patch 10/10 changes the pmem driver to call ioremap_wt()
All new/modified interfaces have been tested.
- Changed to export the set_xxx_wt() interfaces with GPL.
- Changed is_new_memtype_allowed() to handle WT cases.
- Changed arch-specific io.h to define ioremap_wt().
- Changed the pmem driver to use ioremap_wt().
- Rebased to 4.1-rc3 and resolved minor conflicts.
- Rebased to 4.0-rc1 and resolved conflicts with 9d34cfdf4 in
- Rebased to 3.19-rc3 as Juergen's patchset for the PAT management
has been accepted.
- Dropped the patch moving [set|get]_page_memtype() to pat.c
since the tip branch already has this change.
- Fixed an issue when CONFIG_X86_PAT is not defined.
- Clarified comment of why using slot 7. (Andy Lutomirski,
- Moved [set|get]_page_memtype() to pat.c. (Thomas Gleixner)
- Removed BUG() from set_page_memtype(). (Thomas Gleixner)
- Added set_memory_wt() by adding WT support of regular memory.
- Dropped the set_memory_wt() patch. (Andy Lutomirski)
- Refactored the !pat_enabled handling. (H. Peter Anvin,
- Added the picture of PTE encoding. (Konrad Rzeszutek Wilk)
- Changed WT to use slot 7 of the PAT MSR. (H. Peter Anvin,
- Changed to have conservative checks to exclude all Pentium 2, 3,
M, and 4 families. (Ingo Molnar, Henrique de Moraes Holschuh,
- Updated documentation to cover WT interfaces and usages.
(Andy Lutomirski, Yigal Korman)
Toshi Kani (10):
1/10 x86, mm, pat: Set WT to PA7 slot of PAT MSR
2/10 x86, mm, pat: Change reserve_memtype() for WT
3/10 x86, asm: Change is_new_memtype_allowed() for WT
4/10 x86, mm, asm-gen: Add ioremap_wt() for WT
5/10 arch/*/asm/io.h: Add ioremap_wt() to all architectures
6/10 x86, mm, pat: Add pgprot_writethrough() for WT
7/10 x86, mm, asm: Add WT support to set_page_memtype()
8/10 x86, mm: Add set_memory_wt() for WT
9/10 x86, mm, pat: Refactor !pat_enable handling
10/10 drivers/block/pmem: Map NVDIMM with ioremap_wt()
Documentation/x86/pat.txt | 13 ++-
arch/arc/include/asm/io.h | 1 +
arch/arm/include/asm/io.h | 1 +
arch/arm64/include/asm/io.h | 1 +
arch/avr32/include/asm/io.h | 1 +
arch/frv/include/asm/io.h | 7 ++
arch/m32r/include/asm/io.h | 1 +
arch/m68k/include/asm/io_mm.h | 7 ++
arch/m68k/include/asm/io_no.h | 6 ++
arch/metag/include/asm/io.h | 3 +
arch/microblaze/include/asm/io.h | 1 +
arch/mn10300/include/asm/io.h | 1 +
arch/nios2/include/asm/io.h | 1 +
arch/s390/include/asm/io.h | 1 +
arch/sparc/include/asm/io_32.h | 1 +
arch/sparc/include/asm/io_64.h | 1 +
arch/tile/include/asm/io.h | 1 +
arch/x86/include/asm/cacheflush.h | 6 +-
arch/x86/include/asm/io.h | 2 +
arch/x86/include/asm/pgtable.h | 8 +-
arch/x86/include/asm/pgtable_types.h | 3 +
arch/x86/mm/init.c | 6 +-
arch/x86/mm/iomap_32.c | 12 +--
arch/x86/mm/ioremap.c | 26 ++++-
arch/x86/mm/pageattr.c | 61 +++++++++--
arch/x86/mm/pat.c | 194 ++++++++++++++++++++++++-----------
arch/xtensa/include/asm/io.h | 1 +
drivers/block/pmem.c | 4 +-
include/asm-generic/io.h | 9 ++
include/asm-generic/iomap.h | 4 +
include/asm-generic/pgtable.h | 4 +
31 files changed, 296 insertions(+), 92 deletions(-)
Here is another version of the same trivial pmem driver, because two
obviously aren't enough. The first patch is the same pmem driver
that Ross posted a short time ago, just modified to use platform_devices
to find the persistant memory region instead of hardconding it in the
Kconfig. This allows to keep pmem.c separate from any discovery mechanism,
but still allow auto-discovery.
The other two patches are a heavily rewritten version of the code that
Intel gave to various storage vendors to discover the type 12 (and earlier
type 6) nvdimms, which I massaged into a form that is hopefully suitable
Note that pmem.c really is the minimal version as I think we need something
included ASAP. We'll eventually need to be able to do other I/O from and
to it, and as most people know everyone has their own preferre method to
do it, which I'd like to discuss once we have the basic driver in.
This has been tested both with a real NVDIMM on a system with a type 12
capable bios, as well as with "fake persistent" memory using the memmap=
Can I test Intel ND driver now? Is it workable?
If so how can I test and work with this.
I would be very helpful if can give me some idea.
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> -----Original Message-----
> From: linux-kernel-owner(a)vger.kernel.org [mailto:linux-kernel-
> owner(a)vger.kernel.org] On Behalf Of Daniel J Blueman
> Sent: Thursday, April 30, 2015 11:10 AM
> Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4
> On a 7TB, 1728-core NumaConnect system with 108 NUMA nodes, we're
> seeing stock 4.0 boot in 7136s. This drops to 2159s, or a 70% reduction
> with this patchset. Non-temporal PMD init  drops this to 1045s.
> Nathan, what do you guys see with the non-temporal PMD patch ? Do
> add a sfence at the ende label if you manually patch.
>  https://lkml.org/lkml/2015/4/23/350
From that post:
> + decq %rcx
> + movnti %rax,(%rdi)
> + movnti %rax,8(%rdi)
> + movnti %rax,16(%rdi)
> + movnti %rax,24(%rdi)
> + movnti %rax,32(%rdi)
> + movnti %rax,40(%rdi)
> + movnti %rax,48(%rdi)
> + movnti %rax,56(%rdi)
> + leaq 64(%rdi),%rdi
> + jnz loop_64
There are some even more efficient instructions available in x86,
depending on the CPU features:
* movnti 8 byte
* movntdq %xmm 16 byte, SSE
* vmovntdq %ymm 32 byte, AVX
* vmovntdq %zmm 64 byte, AVX-512 (forthcoming)
The last will transfer a full cache line at a time.
For NVDIMMs, the nd pmem driver is also in need of memcpy functions that
use these non-temporal instructions, both for performance and reliability.
We also need to speed up __clear_page and copy_user_enhanced_string so
userspace accesses through the page cache can keep up.
https://lkml.org/lkml/2015/4/2/453 is one of the threads on that topic.
Some results I've gotten there under different cache attributes
(in terms of 4 KiB IOPS):
UC write iops=697872 (697.872 K)(0.697872 M)
WB write iops=9745800 (9745.8 K)(9.7458 M)
WC write iops=9801800 (9801.8 K)(9.8018 M)
WT write iops=9812400 (9812.4 K)(9.8124 M)
UC write iops=1274400 (1274.4 K)(1.2744 M)
WB write iops=10259000 (10259 K)(10.259 M)
WC write iops=10286000 (10286 K)(10.286 M)
WT write iops=10294000 (10294 K)(10.294 M)
Robert Elliott, HP Server Storage