On Thu, Feb 04, 2016 at 10:15:58AM +0100, Jan Kara wrote:
Yes, you are right. filemap_write_and_wait_range() actually
guarantee data durability. That function only means all dirty data has been
sent to storage and the storage has acknowledged them. This is noop for
PMEM. So we are perfectly fine ignoring calls to
filemap_write_and_wait_range(). What guarantees data durability are only
->fsync() and ->sync_fs() calls. But some code could get upset by seeing
that filemap_write_and_wait_range() didn't actually get rid of dirty pages
(in some special cases like inode eviction or similar). That's why I'd
choose one of the two options for consistency:
1) Treat inode indexes to flush as close to dirty pages as we can - this
means inode is dirty with all the tracking associated with it, radix tree
entries have dirty tag, we get rid of these in ->writepages(). We are close
to this currently.
I think we're actually pretty far from this, at least for v4.5. The issue is
that I don't think we can safely clear radix tree dirty entries during the DAX
code that is called via ->writepages(). To do this correctly we would need to
also mark the PTE as clean so that when the userspace process next writes to
their mmap mapping we would get a new fault to make the page writable. This
would allow us to re-dirty the DAX entry in the radix tree.
I implemented code to do this in v2 of my set, but ripped it out in v3:
(DAX fsync v2)
(DAX fsync v3)
The race that compelled this removal is described here:
(sorry for all the links)
Anyway, for v4.5 I think whatever solution we come up with must be okay with
an ever growing list of dirty radix tree entries, as we currently have. Are
you aware of a reason why this won't work, or was the cleaning of the radix
tree entries just a good goal to have? (And I agree it is a good goal, I just
don't know how to do it safely.)
2) Completely avoid the dirty tracking and writeback code and
everything in DAX code.
Because some hybrid between these is IMHO bound to provoke weird (and very
hard to find) bugs.
> > So revisiting the decision I see two options:
> > 1) Move the DAX flushing code from filemap_write_and_wait() into
> > ->writepages() fs callback. There the filesystem can provide all the
> > information it needs including bdev, get_block callback, or whatever.
> This seems fine as long as we add it to ->fsync as well since ->writepages is
> never called in that path, and as long as we are okay with skipping DAX
> writebacks on hole punch, truncate, and block relocation.
Look at ext4_sync_file() -> filemap_write_and_wait_range() ->
__filemap_fdatawrite_range() -> do_writepages(). Except those nrpages > 0
checks which would need to be changed.
Ah, cool, I missed this path. Thank you for setting me straight.