On Thu 30-05-19 08:14:45, Dave Chinner wrote:
On Wed, May 29, 2019 at 03:46:29PM +0200, Jan Kara wrote:
> On Wed 29-05-19 14:46:58, Dave Chinner wrote:
> > iomap_apply()
> > ->iomap_begin()
> > map old data extent that we copy from
> > allocate new data extent we copy to in data fork,
> > immediately replacing old data extent
> > return transaction handle as private data
This holds the inode block map locked exclusively across the IO,
Does it? We do hold XFS_IOLOCK_EXCL during the whole dax write. But
xfs_file_iomap_begin() does release XFS_ILOCK_* on exit AFAICS. So I don't
see anything that would prevent page fault from mapping blocks into page
tables just after xfs_file_iomap_begin() returns.
> > dax_iomap_actor()
> > copies data from old extent to new extent
> > ->iomap_end
> > commits transaction now data has been copied, making
> > the COW operation atomic with the data copy.
> > This, in fact, should be how we do all DAX writes that require
> > allocation, because then we get rid of the need to zero newly
> > allocated or unwritten extents before we copy the data into it. i.e.
> > we only need to write once to newly allocated storage rather than
> > twice.
> You need to be careful though. You need to synchronize with page faults so
> that they cannot see and expose in page tables blocks you've allocated
> before their contents is filled.
... so the page fault will block trying to map the blocks because
it can't get the xfs_inode->i_ilock until the allocation transaciton
> This race was actually the strongest
> motivation for pre-zeroing of blocks. OTOH copy_from_iter() in
> dax_iomap_actor() needs to be able to fault pages to copy from (and these
> pages may be from the same file you're writing to) so you cannot just block
> faulting for the file through I_MMAP_LOCK.
Right, it doesn't take the I_MMAP_LOCK, but it would block further
in. And, really, I'm not caring all this much about this corner
case. i.e. anyone using a "mmap()+write() zero copy" pattern on DAX
within a file is unbeleivably naive - the data still gets copied by
the CPU in the write() call. It's far simpler and more effcient to
just mmap() both ranges of the file(s) and memcpy() in userspace....
FWIW, it's to avoid problems with stupid userspace stuff that nobody
really should be doing that I want range locks for the XFS inode
locks. If userspace overlaps the ranges and deadlocks in that case,
they they get to keep all the broken bits because, IMO, they are
doing something monumentally stupid. I'd probably be making it
return EDEADLOCK back out to userspace in the case rather than
deadlocking but, fundamentally, I think it's broken behaviour that
we should be rejecting with an error rather than adding complexity
trying to handle it.
I agree with this. We must just prevent user from taking the kernel down
with maliciously created IOs...
Jan Kara <jack(a)suse.com>
SUSE Labs, CR