On Fri, Jan 22, 2016 at 04:06:11PM -0700, Ross Zwisler wrote:
With the current DAX code the following race exists:
Process 1 Process 2
__dax_fault() - read file f, index 0
get_block() -> returns hole
__dax_fault() - write file f, index 0
get_block() -> allocates blocks
An analogous race exists between __dax_fault() loading a hole and
__dax_pmd_fault() allocating a PMD DAX page and trying to insert it, and
that race also ends in data corruption.
Ok, so why doesn't this problem exist for the normal page cache
insertion case with concurrent read vs write faults? It's because
the write fault first does a read fault and so always the write
fault always has a page in the radix tree for the get_block call
that allocates the extents, right?
And DAX has an optimisation in the page fault part where it skips
the read fault part of the write fault? And so essentially the DAX
write fault is missing the object (page lock of page in the radix
tree) that the non-DAX write fault uses to avoid this problem?
What happens if we get rid of that DAX write fault optimisation that
skips the initial read fault? The write fault will always run on a
mapping that has a hole loaded, right?, so the race between
dax_load_hole() and dax_insert_mapping() goes away, because nothing
will be calling dax_load_hole() once the write fault is allocating
One solution to this race was proposed by Jan Kara:
So we need some exclusion that makes sure pgoff->block mapping
information is uptodate at the moment we insert it into page tables. The
simplest reasonably fast thing I can see is:
When handling a read fault, things stay as is and filesystem protects the
fault with an equivalent of EXT4_I(inode)->i_mmap_sem held for reading.
When handling a write fault we first grab EXT4_I(inode)->i_mmap_sem for
reading and try a read fault. If __dax_fault() sees a hole returned from
get_blocks() during a write fault, it bails out. Filesystem grabs
EXT4_I(inode)->i_mmap_sem for writing and retries with different
get_blocks() callback which will allocate blocks. That way we get proper
exclusion for faults needing to allocate blocks.
This patch adds this logic to DAX, ext2, ext4 and XFS.
It's too ugly to live. It hacks around a special DAX optimisation in
the fault code by adding special case locking to the filesystems,
and adds a siginificant new locking constraint to the page fault
If the write fault first goes through the read fault path and loads
the hole, this race condition simply does not exist. I'd suggest
that we get rid of the DAX optimisation that skips read fault
processing on write fault so that this problem simply goes away.
Yes, it means write faults on holes will be a little slower (which,
quite frankly, I don't care at all about), but it means we don't
need to hack special cases into code that should not have to care
about various different types of page fault races. Correctness
first, speed later.
FWIW, this also means we can get rid of the hacks in the filesystem
code where we have to handle write faults through the ->fault
handler rather than the ->page_mkwrite handler.