On Sun, Feb 07, 2016 at 10:15:12AM +1100, Dave Chinner wrote:
On Thu, Feb 04, 2016 at 10:15:58AM +0100, Jan Kara wrote:
> On Wed 03-02-16 13:13:28, Ross Zwisler wrote:
> > Here is the comment from Dave Chinner that had me move to having the calls to
> > dax_writeback_mapping_range() into the generic mm code:
> > > Lastly, this flushing really needs to be inside
> > > filemap_write_and_wait_range(), because we call the writeback code
> > > from many more places than just fsync to ensure ordering of various
> > > operations such that files are in known state before proceeding
> > > (e.g. hole punch).
> > https://lkml.org/lkml/2015/11/16/847
> > > So revisiting the decision I see two options:
> > >
> > > 1) Move the DAX flushing code from filemap_write_and_wait() into
> > > ->writepages() fs callback. There the filesystem can provide all the
> > > information it needs including bdev, get_block callback, or whatever.
> > This seems fine as long as we add it to ->fsync as well since
> > never called in that path, and as long as we are okay with skipping DAX
> > writebacks on hole punch, truncate, and block relocation.
> Look at ext4_sync_file() -> filemap_write_and_wait_range() ->
> __filemap_fdatawrite_range() -> do_writepages(). Except those nrpages > 0
> checks which would need to be changed.
Just to be clear: this is pretty much what I was implying was
necessary when I said that the DAX flushing needed to be "inside
filemap_write_and_wait_range". And that's what I thought Ross was
planning on doing after that round discussion. i.e. Ross said:
"If the race described above isn't an issue then I agree moving this
call out of the filesystems and down into the generic page writeback
code is probably the right thing to do."
Realistically, what Jan is saying in this thread is exactly what I
said we needed to do way back when I first pointed out that fsync
was broken and dirty tracking in the mapping radix tree was still
needed for fsync to work effectively.
Here, let me try and quickly summarize what is going on.
1) The DAX fsync set was merged into v4.5-rc1, it does use the radix tree for
tracking dirty PTE and PMD pages, and we do currently call into the DAX sync
code via filemap_write_and_wait_range() as you initially suggested.
2) During testing of raw block devices + DAX I noticed that the struct
block_device that we were using for DAX operations was incorrect. For the
fault handlers, etc. we can just get the correct bdev via get_block(), which
is passed in as a function pointer, but for the flushing code we don't have
access to get_block(). This is also an issue for XFS real-time devices,
whenever we get those working.
In short, somehow we need to get dax_writeback_mapping_range() a valid bdev.
Right now it is called via filemap_write_and_wait_range(), which can't provide
either the bdev nor a get_block() function pointer. So, our options seem to
a) Move the calls to dax_writeback_mapping_range() into the filesystems
(what Jan is suggesting, i.e. ->writepages())
b) Keep the calls to dax_writeback_mapping_range() in the mm code, and
provide a generic way to ask a filesystem for an inode's bdev. I did a
version of this using a superblock operation here:
3) During the review and discussion for the above problems, Jan noticed that
the flushing code wasn't being called for sync() and syncfs(). Clearly from
your other response (https://lkml.org/lkml/2016/2/6/168
) you think this is
incorrect. Regardless, the above issue remains -
dax_writeback_mapping_range() needs a bdev. Do we move the calls into the
filesystem so the fs can provide a bdev, or do we we create a generic method
for DAX to ask the fs for the correct bdev for an inode?