On Tue, Sep 01, 2015 at 09:38:03AM +1000, Dave Chinner wrote:
On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote:
> For DAX msync we just need to flush the given range using
> wb_cache_pmem(), which is now a public part of the PMEM API.
This is wrong, because it still leaves fsync() broken on dax.
Flushing dirty data to stable storage is the responsibility of the
writeback infrastructure, not the VMA/mm infrasrtucture.
Writeback infrastructure is non-existent for DAX. Without struct page we
don't have anything to transfer pte_ditry() to. And I'm not sure we need
to invent some. For DAX flushing in-place can be cheaper than dirty
tracking beyond page tables.
For non-dax configurations, msync defers all that to
vfs_fsync_range(),
because it has to be implemented there for fsync() to work.
Not necessary. I think fsync() for DAX can be implemented with rmap over
all file's VMA and msync() them with commiting metadata afterwards.
But we also need to commit to persistent on zap_page_range() to make it
work.
Even for DAX, msync has to call vfs_fsync_range() for the filesystem
to commit
the backing store allocations to stable storage, so there's not
getting around the fact msync is the wrong place to be flushing
DAX mappings to persistent storage.
Why?
IIUC, msync() doesn't have any requirements wrt metadata, right?
I pointed this out almost 6 months ago (i.e. that fsync was broken)
anf hinted at how to solve it. Fix fsync, and msync gets fixed for
free:
https://lists.01.org/pipermail/linux-nvdimm/2015-March/000341.html
I've also reported to Willy that DAX write page faults don't work
correctly, either. xfstests generic/080 exposes this: a read
from a page followed immediately by a write to that page does not
result in ->page_mkwrite being called on the write and so
backing store is not allocated for the page, nor are the timestamps
for the file updated. This will also result in fsync (and msync)
not working properly.
Is that because XFS doesn't provide vm_ops->pfn_mkwrite?
--
Kirill A. Shutemov