On Thu 04-02-16 12:56:19, Ross Zwisler wrote:
On Wed, Feb 03, 2016 at 11:46:11AM +0100, Jan Kara wrote:
> On Tue 02-02-16 10:34:56, Ross Zwisler wrote:
> > On Tue, Feb 02, 2016 at 09:10:24AM -0800, Dan Williams wrote:
> > > On Tue, Feb 2, 2016 at 8:46 AM, Jan Kara <jack(a)suse.cz> wrote:
> > > > On Tue 02-02-16 08:33:56, Dan Williams wrote:
> > > >> On Tue, Feb 2, 2016 at 3:17 AM, Jan Kara <jack(a)suse.cz>
> > > >> [..]
> > > >> > I see, thanks for explanation. So I'm OK with changing
what is stored in
> > > >> > the radix tree to accommodate this use case but my
reservation that we IHMO
> > > >> > have other more pressing things to fix remains...
> > > >>
> > > >> We don't need pfns in the radix to support XFS RT
> > > >> Just call get_blocks() again and use the sector, or am I missing
> > > >> something?
> > > >
> > > > You are correct. But if you decide to pay the cost of additional
> > > > get_block() call, you only need the dirty tag in the radix tree and
> > > > else. So my understanding was that the whole point of games with
> > > > is avoiding this extra get_block() calls for fsync().
> > > >
> > >
> > > DAX-fsync() is already a potentially expensive operation to cover data
> > > durability guarantees for DAX-unaware applications. A DAX-aware
> > > application is going to skip fsync, and the get_blocks() cost, to do
> > > cache management itself.
> > >
> > > Willy pointed out some other potential benefits, assuming a suitable
> > > replacement for the protections afforded by the block-device
> > > percpu_ref counter can be found. However, optimizing for the
> > > DAX-unaware-application case seems the wrong motivation to me.
> > Oh, no, the primary issue with calling get_block() in the fsync path isn't
> > performance. It's that we don't have any idea what get_block()
> > call.
> > The fault handler calls all come from the filesystem directly, so they can
> > easily give us an appropriate get_block() function pointer. But the
> > dax_writeback_mapping_range() calls come from the generic code in
> > mm/filemap.c, and don't know what get_block() to pass in.
> > During one iteration I had the calls to dax_writeback_mapping_range()
> > happening in the filesystem fsync code so that it could pass in get_block(),
> > but Dave Chinner pointed out that this misses other paths in the filesystem
> > that need to have things flushed via a call to filemap_write_and_wait_range().
> Let's clear this up a bit: The problem with using ->fsync() method is that
> it doesn't get called for sync(2). We could use ->sync_fs() to flush caches
> in case of sync(2) (that's what's happening for normal storage) but the
> problem with PMEM is that "flush all cached data" operation effectively
> means iterate through all modified pages and we didn't want to implement
> this for DAX fsync code.
> So we have decided to do cache flushing for DAX at a different point - mark
> inodes which may have writes cached as dirty and use writeback code for the
> cache flushing. But looking at it now, we have actually chosen a wrong
> place to do the flushing in the writeback path - note that sync(2) writes
> data via __writeback_single_inode() -> do_writepages() and thus doesn't
> even get to filemap_write_and_wait().
> So revisiting the decision I see two options:
> 1) Move the DAX flushing code from filemap_write_and_wait() into
> ->writepages() fs callback. There the filesystem can provide all the
> information it needs including bdev, get_block callback, or whatever.
> 2) Back out even further and implement own tracking and iteration of inodes
> to write.
> So far I still think 2) is not worth the complexity (although it would
> bring DAX code closer to how things behave with standard storage) so I
> would go for 1).
Jan, just to clarify, are you proposing this change for v4.5 in the remaining
RCs as an alternative to the get_bdev() patch?
Yes, because I don't think anything like ->get_bdev() is needed at all.
Look: dax_do_io(), __dax_fault(), __dax_pmd_fault(), dax_zero_page_range()
don't really need bdev - we have agreed that get_block() fills that in just
dax_clear_blocks() has IMO just the wrong signature - it should take bdev
and not inode as an argument. Because combination inode + bdev sector
doesn't really make much sense.
dax_writeback_mapping_range() is the only remaining offender and it can
easily take bdev as an argument when called from ->writepages().
Or can we move forward with get_bdev(), and try and figure out this
new way of
calling fsync/msync for v4.6? My main concern here is that changing how the
DAX sync code gets called will affect all three filesystems as well as MM, and
that it might be too much for RC inclusion...
I think changes aren't very intrusive so we can feed them in during RC
phase and frankly, you have to move to using ->writepages() anyway to make
sync(2) work reliably.
Jan Kara <jack(a)suse.com>
SUSE Labs, CR