[Linux-nvdimm] another pmem variant
by Christoph Hellwig
Here is another version of the same trivial pmem driver, because two
obviously aren't enough. The first patch is the same pmem driver
that Ross posted a short time ago, just modified to use platform_devices
to find the persistant memory region instead of hardconding it in the
Kconfig. This allows to keep pmem.c separate from any discovery mechanism,
but still allow auto-discovery.
The other two patches are a heavily rewritten version of the code that
Intel gave to various storage vendors to discover the type 12 (and earlier
type 6) nvdimms, which I massaged into a form that is hopefully suitable
for mainline.
Note that pmem.c really is the minimal version as I think we need something
included ASAP. We'll eventually need to be able to do other I/O from and
to it, and as most people know everyone has their own preferre method to
do it, which I'd like to discuss once we have the basic driver in.
This has been tested both with a real NVDIMM on a system with a type 12
capable bios, as well as with "fake persistent" memory using the memmap=
option.
5 years, 11 months
[Linux-nvdimm] another pmem variant V2
by Christoph Hellwig
Here is another version of the same trivial pmem driver, because two
obviously aren't enough. The first patch is the same pmem driver
that Ross posted a short time ago, just modified to use platform_devices
to find the persistant memory region instead of hardconding it in the
Kconfig. This allows to keep pmem.c separate from any discovery mechanism,
but still allow auto-discovery.
The other two patches are a heavily rewritten version of the code that
Intel gave to various storage vendors to discover the type 12 (and earlier
type 6) nvdimms, which I massaged into a form that is hopefully suitable
for mainline.
Note that pmem.c really is the minimal version as I think we need something
included ASAP. We'll eventually need to be able to do other I/O from and
to it, and as most people know everyone has their own preferre method to
do it, which I'd like to discuss once we have the basic driver in.
This has been tested both with a real NVDIMM on a system with a type 12
capable bios, as well as with "fake persistent" memory using the memmap=
option.
Changes since V1:
- s/E820_PROTECTED_KERN/E820_PMEM/g
- map the persistent memory as uncached
- better kernel parameter description
- various typo fixes
- MODULE_LICENSE fix
6 years
[Linux-nvdimm] [PATCH 0/3 v5] e820: Fix handling of NvDIMM chips
by Boaz Harrosh
Hi
[v5]
* [PATCH 2/3] Added the add_taint(TAINT_FIRMWARE_WORKAROUND,...)
and changed the printed message as requested
* Use IORESOURCE_MEM_WARN bit from the mem specific bit range
(not 64bit only anymore, only works with memory resources)
* Fix user visible typo reserved-unkown => reserved-unknown &&
unkown-12 => unknown-12
* Select [PATCH 3A/3] (over [PATCH 3B/3])
* ...
* Also posting RFC of pmem as reference
[v2]
* Added warning at bring up about unknown type
* Added an extra patch to warn-print in request_resource
* changed name from NvDIMM-12 => unknown-12
I wish we would reconsider this. So we need to suffer until some unknown
future when ACPI decides to reuse type-12. When this happens we can fix
it then, NO?
* Now based on 4.0-rc1
[v1]
There is a deficiency in current e820.c handling where unknown new memory-chip
types come up as a BUSY resource when some other driver (like pmem) tries to
call request_mem_region_exclusive() on that resource. Even though, actually
there is nothing using it.
>From inspecting the code and the history of e820.c it looks like a BUG.
In any way this is a problem for the new type-12 NvDIMM memory chips that
are circulating around. (It is estimated that there are already 100ds of
thousands NvDIMM chips in active use)
The patches below first fixes the above problem for any future type
memory, so external drivers can access these mem chips.
I then also add the NvDIMM type-12 memory constant so it comes up
nice in dprints and at /proc/iomem
Just as before all these chips are very much usable with the pmem
driver. This lets us remove the hack for type-12 NvDIMMs that ignores
the return code from request_mem_region_exclusive() in pmem.c.
For all the pmem people. I maintain a tree with these patches
and latest pmem code here:
git://git.open-osd.org/pmem.git (pmem branch)
[web-view:http://git.open-osd.org/gitweb.cgi?p=pmem.git;a=summary]
List of patches:
[PATCH 1/3] e820: Don't let unknown DIMM type come out BUSY
The main fix
[PATCH 2/3] resource: Add new flag IORESOURCE_MEM_WARN
Warn in request_resource
[PATCH 3/3] e820: Add the unknown-12 Memory type (DDR3-NvDIMM)
Also submitted as reference is an RFC of the pmem driver that demonstrates
the use of the add_resource API for the NvDIMM chips. This can be seen in
pmem-patch-1. Also please see pmem-patch-8 an out-of-tree patch that
ignores the add_resource failure so it can work with NvDIMMs with old
kernels.
Thanks
Boaz
6 years
[Linux-nvdimm] [PATCH 0/3 v3] dax: Fix mmap-write not updating c/mtime
by Boaz Harrosh
Hi
[v3]
* I'm re-posting the two DAX patches that fix the mmap-write after read
problem with DAX. (No changes since [v2])
* I'm also posting a 3rd RFC patch to address what Jan said about fs_freeze
and making mapping read-only.
Jan Please review and see if this is what you meant.
[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.
[v1]
The main problem is that current mm/memory.c will no call us with page_mkwrite
if we do not have an actual page mapping, which is what DAX uses.
The solution presented here introduces a new pfn_mkwrite to solve this problem.
Please see patch-2 for details.
I've been running with this patch for 4 month both HW and VMs with no apparent
danger, but see patch-1 I played it safe.
I am also posting an xfstest 080 that demonstrate this problem, I believe
that also some git operations (can't remember which) suffer from this problem.
Actually Eryu Guan found that this test fails on some other FS as well.
List of patches:
[PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
[PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze
[PATCH 3/3] RFC: dax: dax_prepare_freeze
[PATCH v4] xfstest: generic/080 test that mmap-write updates c/mtime
Please I need that some mm person review the first patch?
Andrew hi
I believe this needs to eventually go through your tree. Please pick it
up when you feel it is ready. I believe the first 2 are ready and fix real
bugs.
Matthew hi
I would love to have your ACK on these patches?
Thanks
Boaz
6 years
Re: [Linux-nvdimm] another pmem variant V2
by Christoph Hellwig
And here's the patch, sorry:
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 4bd525a..e7bf89e 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -346,7 +346,7 @@ int __init sanitize_e820_map(struct e820entry *biosmap, int max_nr_map,
* continue building up new bios map based on this
* information
*/
- if (current_type != last_type) {
+ if (current_type != last_type || current_type == E820_PRAM) {
if (last_type != 0) {
new_bios[new_bios_entry].size =
change_point[chgidx]->addr - last_addr;
6 years
[Linux-nvdimm] [PATCH 0/3 v4] dax: some dax fixes and cleanups
by Boaz Harrosh
Hi
[v4] dax: some dax fixes and cleanups
* First patch fixed according to Andrew's comments. Thanks Andrew.
1st and 2nd patch can go into current Kernel as they fix something
that was merged this release.
* Added a new patch to fix up splice in the dax case, and cleanup.
This one can wait for 4.1 (Also the first two not that anyone uses dax
in production.)
* DAX freeze is not fixed yet. As we have more problems then I originally
hoped for, as pointed out by Dave.
(Just as a referance I'm sending a NO-GOOD additional patch to show what
is not good enough to do. Was the RFC of [v3])
* Not re-posting the xfstest Dave please pick this up (It already found bugs
in none dax FSs)
[v3] dax: Fix mmap-write not updating c/mtime
* I'm re-posting the two DAX patches that fix the mmap-write after read
problem with DAX. (No changes since [v2])
* I'm also posting a 3rd RFC patch to address what Jan said about fs_freeze
and making mapping read-only.
Jan Please review and see if this is what you meant.
[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.
[v1]
The main problem is that current mm/memory.c will no call us with page_mkwrite
if we do not have an actual page mapping, which is what DAX uses.
The solution presented here introduces a new pfn_mkwrite to solve this problem.
Please see patch-2 for details.
I've been running with this patch for 4 month both HW and VMs with no apparent
danger, but see patch-1 I played it safe.
I am also posting an xfstest 080 that demonstrate this problem, I believe
that also some git operations (can't remember which) suffer from this problem.
Actually Eryu Guan found that this test fails on some other FS as well.
List of patches:
[PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP
[PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze
[PATCH 3/3] dax: Unify ext2/4_{dax,}_file_operations
[PATCH] NOTGOOD: dax: dax_prepare_freeze
Andrew hi
I believe this needs to eventually go through your tree. Please pick it
up when you feel it is ready. I believe all 3 are ready and fix real
bugs.
Matthew hi
I would love to have your ACK on these patches?
Thanks
Boaz
6 years
[Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
by Dan Williams
Avoid the impending disaster of requiring struct page coverage for what
is expected to be ever increasing capacities of persistent memory. In
conversations with Rik van Riel, Mel Gorman, and Jens Axboe at the
recently concluded Linux Storage Summit it became clear that struct page
is not required in many places, it was simply convenient to re-use.
Introduce helpers and infrastructure to remove struct page usage where
it is not necessary. One use case for these changes is to implement a
write-back-cache in persistent memory for software-RAID. Another use
case for the scatterlist changes is RDMA to a pfn-range.
This compiles and boots, but 0day-kbuild-robot coverage is needed before
this set exits "RFC". Obviously, the coccinelle script needs to be
re-run on the block updates for kernel.next. As is, this only includes
the resulting auto-generated-patch against 4.0-rc3.
---
Dan Williams (6):
block: add helpers for accessing a bio_vec page
block: convert bio_vec.bv_page to bv_pfn
dma-mapping: allow archs to optionally specify a ->map_pfn() operation
scatterlist: use sg_phys()
x86: support dma_map_pfn()
block: base support for pfn i/o
Matthew Wilcox (1):
scatterlist: support "page-less" (__pfn_t only) entries
arch/Kconfig | 3 +
arch/arm/mm/dma-mapping.c | 2 -
arch/microblaze/kernel/dma.c | 2 -
arch/powerpc/sysdev/axonram.c | 2 -
arch/x86/Kconfig | 12 +++
arch/x86/kernel/amd_gart_64.c | 22 ++++--
arch/x86/kernel/pci-nommu.c | 22 ++++--
arch/x86/kernel/pci-swiotlb.c | 4 +
arch/x86/pci/sta2x11-fixup.c | 4 +
arch/x86/xen/pci-swiotlb-xen.c | 4 +
block/bio-integrity.c | 8 +-
block/bio.c | 83 +++++++++++++++------
block/blk-core.c | 9 ++
block/blk-integrity.c | 7 +-
block/blk-lib.c | 2 -
block/blk-merge.c | 15 ++--
block/bounce.c | 26 +++----
drivers/block/aoe/aoecmd.c | 8 +-
drivers/block/brd.c | 2 -
drivers/block/drbd/drbd_bitmap.c | 5 +
drivers/block/drbd/drbd_main.c | 4 +
drivers/block/drbd/drbd_receiver.c | 4 +
drivers/block/drbd/drbd_worker.c | 3 +
drivers/block/floppy.c | 6 +-
drivers/block/loop.c | 8 +-
drivers/block/nbd.c | 8 +-
drivers/block/nvme-core.c | 2 -
drivers/block/pktcdvd.c | 11 ++-
drivers/block/ps3disk.c | 2 -
drivers/block/ps3vram.c | 2 -
drivers/block/rbd.c | 2 -
drivers/block/rsxx/dma.c | 3 +
drivers/block/umem.c | 2 -
drivers/block/zram/zram_drv.c | 10 +--
drivers/dma/ste_dma40.c | 5 -
drivers/iommu/amd_iommu.c | 21 ++++-
drivers/iommu/intel-iommu.c | 26 +++++--
drivers/iommu/iommu.c | 2 -
drivers/md/bcache/btree.c | 4 +
drivers/md/bcache/debug.c | 6 +-
drivers/md/bcache/movinggc.c | 2 -
drivers/md/bcache/request.c | 6 +-
drivers/md/bcache/super.c | 10 +--
drivers/md/bcache/util.c | 5 +
drivers/md/bcache/writeback.c | 2 -
drivers/md/dm-crypt.c | 12 ++-
drivers/md/dm-io.c | 2 -
drivers/md/dm-verity.c | 2 -
drivers/md/raid1.c | 50 +++++++------
drivers/md/raid10.c | 38 +++++-----
drivers/md/raid5.c | 6 +-
drivers/mmc/card/queue.c | 4 +
drivers/s390/block/dasd_diag.c | 2 -
drivers/s390/block/dasd_eckd.c | 14 ++--
drivers/s390/block/dasd_fba.c | 6 +-
drivers/s390/block/dcssblk.c | 2 -
drivers/s390/block/scm_blk.c | 2 -
drivers/s390/block/scm_blk_cluster.c | 2 -
drivers/s390/block/xpram.c | 2 -
drivers/scsi/mpt2sas/mpt2sas_transport.c | 6 +-
drivers/scsi/mpt3sas/mpt3sas_transport.c | 6 +-
drivers/scsi/sd_dif.c | 4 +
drivers/staging/android/ion/ion_chunk_heap.c | 4 +
drivers/staging/lustre/lustre/llite/lloop.c | 2 -
drivers/xen/biomerge.c | 4 +
drivers/xen/swiotlb-xen.c | 29 +++++--
fs/btrfs/check-integrity.c | 6 +-
fs/btrfs/compression.c | 12 ++-
fs/btrfs/disk-io.c | 4 +
fs/btrfs/extent_io.c | 8 +-
fs/btrfs/file-item.c | 8 +-
fs/btrfs/inode.c | 18 +++--
fs/btrfs/raid56.c | 4 +
fs/btrfs/volumes.c | 2 -
fs/buffer.c | 4 +
fs/direct-io.c | 2 -
fs/exofs/ore.c | 4 +
fs/exofs/ore_raid.c | 2 -
fs/ext4/page-io.c | 2 -
fs/f2fs/data.c | 4 +
fs/f2fs/segment.c | 2 -
fs/gfs2/lops.c | 4 +
fs/jfs/jfs_logmgr.c | 4 +
fs/logfs/dev_bdev.c | 10 +--
fs/mpage.c | 2 -
fs/splice.c | 2 -
include/asm-generic/dma-mapping-common.h | 30 ++++++++
include/asm-generic/memory_model.h | 4 +
include/asm-generic/scatterlist.h | 6 ++
include/crypto/scatterwalk.h | 10 +++
include/linux/bio.h | 24 +++---
include/linux/blk_types.h | 21 +++++
include/linux/blkdev.h | 2 +
include/linux/dma-debug.h | 23 +++++-
include/linux/dma-mapping.h | 8 ++
include/linux/scatterlist.h | 101 ++++++++++++++++++++++++--
include/linux/swiotlb.h | 5 +
kernel/power/block_io.c | 2 -
lib/dma-debug.c | 4 +
lib/swiotlb.c | 20 ++++-
mm/iov_iter.c | 22 +++---
mm/page_io.c | 8 +-
net/ceph/messenger.c | 2 -
103 files changed, 658 insertions(+), 335 deletions(-)
6 years
[Linux-nvdimm] REQ: display how pmem is configured when loading
by Roger C. Pao
Currently, when I load Ross' prd like this:
sudo modprobe pmem pmem_start_gb=6 pmem_size_gb=2 pmem_count=1
dmesg output is:
pmem: module loaded
I would really like it to display:
pmem: /dev/pmem0 at 6GiB for 2GiB
or something similar detailing how each /dev/pmem# is configured.
This is especially needed for Boaz's pmem as he uses the kernel memmap=#$#
syntax. $ requires escaping in shell scripts and is easy to get wrong.
Thank you for your consideration,
rcpao
6 years
[Linux-nvdimm] [PATCH 0/6] Add persistent memory driver
by Ross Zwisler
PMEM is a modified version of the Block RAM Driver, BRD. The major difference
is that BRD allocates its backing store pages from the page cache, whereas
PMEM uses reserved memory that has been ioremapped.
One benefit of this approach is that there is a direct mapping between
filesystem block numbers and virtual addresses. In PMEM, filesystem blocks N,
N+1, N+2, etc. will all be adjacent in the virtual memory space. This property
allows us to set up PMD mappings (2 MiB) for DAX.
This patch set is builds upon the work that Matthew Wilcox has been doing for
DAX, which has been merged into the v4.0 kernel series.
For more information on PMEM and for some instructions on how to use it, please
check out PMEM's github tree:
https://github.com/01org/prd
Cc: linux-nvdimm(a)lists.01.org
Cc: linux-fsdevel(a)vger.kernel.org
Cc: axboe(a)kernel.dk
Cc: hch(a)infradead.org
Cc: riel(a)redhat.com
Boaz Harrosh (1):
pmem: Let each device manage private memory region
Ross Zwisler (5):
pmem: Initial version of persistent memory driver
pmem: Add support for getgeo()
pmem: Add support for rw_page()
pmem: Add support for direct_access()
pmem: Clean up includes
MAINTAINERS | 6 +
drivers/block/Kconfig | 41 +++++
drivers/block/Makefile | 1 +
drivers/block/pmem.c | 401 +++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 449 insertions(+)
create mode 100644 drivers/block/pmem.c
--
1.9.3
6 years
[Linux-nvdimm] [PATCH] brd: Ensure that bio_vecs have size <= PAGE_SIZE
by Ross Zwisler
The functions copy_from_brd() and copy_to_brd() are written with an
assumption that the bio_vec they are given has size <= PAGE_SIZE. This
assumption is not enforced in any way, and if the bio_vec has size
larger than PAGE_SIZE data will just be lost.
Such a situation can occur with I/Os generated from in-kernel sources,
or with coalesced bio_vecs. This bug was originally reported against
the pmem driver, where it was found using the Enmotus tiering engine.
Instead we should have brd explicitly tell the block layer that it can
handle data segments of at most PAGE_SIZE.
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Reported-by: Hugh Daschbach <hugh.daschbach(a)enmotus.com>
Cc: Roger C. Pao (Enmotus) <rcpao.enmotus(a)gmail.com>
Cc: Boaz Harrosh <boaz(a)plexistor.com>
Cc: linux-nvdimm(a)lists.01.org
Cc: Nick Piggin <npiggin(a)kernel.dk>
---
drivers/block/brd.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 898b4f256782..7e4873361b64 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -490,6 +490,7 @@ static struct brd_device *brd_alloc(int i)
blk_queue_make_request(brd->brd_queue, brd_make_request);
blk_queue_max_hw_sectors(brd->brd_queue, 1024);
blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY);
+ blk_queue_max_segment_size(brd->brd_queue, PAGE_SIZE);
brd->brd_queue->limits.discard_granularity = PAGE_SIZE;
brd->brd_queue->limits.max_discard_sectors = UINT_MAX;
--
1.9.3
6 years, 1 month