On Tue, Feb 23, 2016 at 11:06:44PM +1100, Dave Chinner wrote:
On Tue, Feb 23, 2016 at 10:07:07AM +0000, Rudoff, Andy wrote:
> > [Hi Andy - care to properly line break after ~75 character, that makes
> > ready the message a lot easier, thanks!]
> My bad.
> >> The instructions give you very fine-grain flushing control, but the
> >> downside is that the app must track what it changes at that fine
> >> granularity. Both models work, but there's a trade-off.
> > No, the cache flush model simply does not work without a lot of hard
> > work to enable it first.
> It's working well enough to pass tests that simulate crashes and
> various workload tests for the apps involved. And I agree there
> has been a lot of hard work behind it. I guess I'm not sure why you're
> saying it is impossible or not working.
> Let's take an example: an app uses fallocate() to create a DAX file,
> mmap() to map it, msync() to flush changes. The app follows POSIX
> meaning it doesn't expect file metadata to be flushed magically, etc.
> The app is tested carefully and it works correctly. Now the msync()
> call used to flush stores is replaced by flushing instructions.
> What's broken?
You haven't told the filesytem to flush any dirty metadata required
to access the user data to persistent storage. If the zeroing and
unwritten extent conversion that is run by the filesytem during
write faults into preallocated blocks isn't persistent, then after a
crash the file will read back as unwritten extents, returning zeros
rather than the data that was written.
msync() calls fsync() on file back pages, which makes file metadata
changes persistent. Indeed, if you read the fdatasync man page, you
might have noticed that it makes explicit reference that it requires
the filesystem to flush the metadata needed to access the data that
is being synced. IOWs, the filesystem knows about this dirty
metadata that needs to be flushed to ensure data integrity,
Not to mention that the filesystem will convert and zero much more
than just a single cacheline (whole pages at minimum, could be 2MB
extents for large pages, etc) so the filesystem may require CPU
cache flushes over a much wider range of cachelines that the
application realises are dirty and require flushing for data
integrity purposes. The filesytem knows about these dirty cache
lines, userspace doesn't.
With the current code at least dax_zero_page_range() doesn't rely on
fsync/msync from userspace to make the zeroes that it writes persistent. It
does all the necessary flushing and wmb_pmem() calls itself. I agree that
this does not address your concern about metadata being in sync, though.
IOWs, your userspace library may have made sure the data it modifies
is in the physical location via your userspace CPU cache flushes,
but there can be a lot of stuff it doesn't know about internal to
the filesytem that also needs to be flushed to ensure data integrity
Linux-nvdimm mailing list