Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
by Dan Williams
On Sat, Nov 7, 2015 at 12:38 AM, Thomas Gleixner <tglx(a)linutronix.de> wrote:
> On Sat, 7 Nov 2015, Dan Williams wrote:
>> On Fri, Nov 6, 2015 at 10:50 PM, Thomas Gleixner <tglx(a)linutronix.de> wrote:
>> > On Fri, 6 Nov 2015, H. Peter Anvin wrote:
>> >> On 11/06/15 15:17, Dan Williams wrote:
>> >> >>
>> >> >> Is it really required to do that on all cpus?
>> >> >
>> >> > I believe it is, but I'll double check.
>> >> >
>> >>
>> >> It's required on all CPUs on which the DAX memory may have been dirtied.
>> >> This is similar to the way we flush TLBs.
>> >
>> > Right. And that's exactly the problem: "may have been dirtied"
>> >
>> > If DAX is used on 50% of the CPUs and the other 50% are plumming away
>> > happily in user space or run low latency RT tasks w/o ever touching
>> > it, then having an unconditional flush on ALL CPUs is just wrong
>> > because you penalize the uninvolved cores with a completely pointless
>> > SMP function call and drain their caches.
>> >
>>
>> It's not wrong and pointless, it's all we have available outside of
>> having the kernel remember every virtual address that might have been
>> touched since the last fsync and sit in a loop flushing those virtual
>> address cache line by cache line.
>>
>> There is a crossover point where wbinvd is better than a clwb loop
>> that needs to be determined.
>
> This is a totally different issue and I'm well aware that there is a
> tradeoff between wbinvd() and a clwb loop. wbinvd() might be more
> efficient performance wise above some number of cache lines, but then
> again it's draining all unrelated stuff as well, which can result in a
> even larger performance hit.
>
> Now what really concerns me more is that you just unconditionally
> flush on all CPUs whether they were involved in that DAX stuff or not.
>
> Assume that DAX using application on CPU 0-3 and some other unrelated
> workload on CPU4-7. That flush will
>
> - Interrupt CPU4-7 for no reason (whether you use clwb or wbinvd)
>
> - Drain the cache for CPU4-7 for no reason if done with wbinvd()
>
> - Render Cache Allocation useless if done with wbinvd()
>
> And we are not talking about a few micro seconds here. Assume that
> CPU4-7 have cache allocated and it's mostly dirty. We've measured the
> wbinvd() impact on RT, back then when the graphic folks used it as a
> big hammer. The maximum latency spike was way above one millisecond.
>
> We have similar issues with TLB flushing, but there we
>
> - are tracking where it was used and never flush on innocent cpus
>
> - one can design his application in a way that it uses different
> processes so cross CPU flushing does not happen
>
> I know that this is not an easy problem to solve, but you should be
> aware that various application scenarios are going to be massively
> unhappy about that.
>
Thanks for that explanation. Peter had alluded to it at KS, but I
indeed did not know that it was as horrible as milliseconds of
latency, hmm...
One other mitigation that follows on with Dave's plan of per-inode DAX
control, is to also track when an inode has a writable DAX mmap
established. With that we could have a REQ_DAX flag to augment
REQ_FLUSH to potentially reduce committing violence on the cache. In
an earlier thread I also recall an idea to have an mmap flag that an
app can use to say "yes, I'm doing a writable DAX mapping, but I'm
taking care of the cache myself". We could track innocent cpus, but
I'm thinking that would be a core change to write-protect pages when a
thread migrates? In general I feel there's a limit for how much
hardware workaround is reasonable to do in the core kernel vs waiting
for the platform to offer better options...
Sorry if I'm being a bit punchy, but I'm still feeling like I need to
defend the notion that DAX may just need to be turned off in some
situations.
6 years, 7 months
Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
by Dan Williams
On Fri, Nov 6, 2015 at 10:50 PM, Thomas Gleixner <tglx(a)linutronix.de> wrote:
> On Fri, 6 Nov 2015, H. Peter Anvin wrote:
>> On 11/06/15 15:17, Dan Williams wrote:
>> >>
>> >> Is it really required to do that on all cpus?
>> >
>> > I believe it is, but I'll double check.
>> >
>>
>> It's required on all CPUs on which the DAX memory may have been dirtied.
>> This is similar to the way we flush TLBs.
>
> Right. And that's exactly the problem: "may have been dirtied"
>
> If DAX is used on 50% of the CPUs and the other 50% are plumming away
> happily in user space or run low latency RT tasks w/o ever touching
> it, then having an unconditional flush on ALL CPUs is just wrong
> because you penalize the uninvolved cores with a completely pointless
> SMP function call and drain their caches.
>
It's not wrong and pointless, it's all we have available outside of
having the kernel remember every virtual address that might have been
touched since the last fsync and sit in a loop flushing those virtual
address cache line by cache line.
There is a crossover point where wbinvd is better than a clwb loop
that needs to be determined.
6 years, 7 months
Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
by Dan Williams
On Fri, Nov 6, 2015 at 9:35 AM, Thomas Gleixner <tglx(a)linutronix.de> wrote:
> On Fri, 6 Nov 2015, Dan Williams wrote:
>> On Fri, Nov 6, 2015 at 12:06 AM, Thomas Gleixner <tglx(a)linutronix.de> wrote:
>> > Just for the record. Such a flush mechanism with
>> >
>> > on_each_cpu()
>> > wbinvd()
>> > ...
>> >
>> > will make that stuff completely unusable on Real-Time systems. We've
>> > been there with the big hammer approach of the intel graphics
>> > driver.
>>
>> Noted. This means RT systems either need to disable DAX or avoid
>> fsync. Yes, this is a wart, but not an unexpected one in a first
>> generation persistent memory platform.
>
> And it's not just only RT. The folks who are aiming for 100%
> undisturbed user space (NOHZ_FULL) will be massively unhappy about
> that as well.
>
> Is it really required to do that on all cpus?
>
I believe it is, but I'll double check.
I assume the folks that want undisturbed userspace are ok with the
mitigation to modify their application to flush by individual cache
lines if they want to use DAX without fsync. At least until the
platform can provide a cheaper fsync implementation.
The option to drive cache flushing from the radix is at least
interruptible, but it might be long running depending on how much
virtual address space is dirty. Altogether, the options in the
current generation are:
1/ wbinvd driven: quick flush O(size of cache), but long interrupt-off latency
2/ radix driven: long flush O(size of dirty range), but at least preempt-able
3/ DAX without calling fsync: userspace takes direct responsibility
for cache management of DAX mappings
4/ DAX disabled: fsync is the standard page cache writeback latency
We could potentially argue about 1 vs 2 ad nauseum, but I wonder if
there is room to it punt it to a configuration option or make it
dynamic? My stance is do 1 with the hope of riding options 3 and 4
until the platform happens to provide a better alternative.
6 years, 7 months
[PATCH 0/2] "big hammer" for DAX msync/fsync correctness
by Ross Zwisler
This series implements the very slow but correct handling for
blkdev_issue_flush() with DAX mappings, as discussed here:
https://lkml.org/lkml/2015/10/26/116
I don't think that we can actually do the
on_each_cpu(sync_cache, ...);
...where sync_cache is something like:
cache_disable();
wbinvd();
pcommit();
cache_enable();
solution as proposed by Dan because WBINVD + PCOMMIT doesn't guarantee that
your writes actually make it durably onto the DIMMs. I believe you really do
need to loop through the cache lines, flush them with CLWB, then fence and
PCOMMIT.
I do worry that the cost of blindly flushing the entire PMEM namespace on each
fsync or msync will be prohibitively expensive, and that we'll by very
incentivized to move to the radix tree based dirty page tracking as soon as
possible. :)
Ross Zwisler (2):
pmem: add wb_cache_pmem() to the PMEM API
pmem: Add simple and slow fsync/msync support
arch/x86/include/asm/pmem.h | 11 ++++++-----
drivers/nvdimm/pmem.c | 10 +++++++++-
include/linux/pmem.h | 22 +++++++++++++++++++++-
3 files changed, 36 insertions(+), 7 deletions(-)
--
2.1.0
6 years, 7 months
Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
by Dan Williams
On Fri, Nov 6, 2015 at 12:06 AM, Thomas Gleixner <tglx(a)linutronix.de> wrote:
> On Thu, 5 Nov 2015, Dan Williams wrote:
>> On Wed, Oct 28, 2015 at 3:51 PM, Ross Zwisler
>> <ross.zwisler(a)linux.intel.com> wrote:
>> > On Wed, Oct 28, 2015 at 06:24:29PM -0400, Jeff Moyer wrote:
>> >> Ross Zwisler <ross.zwisler(a)linux.intel.com> writes:
>> >>
>> >> > This series implements the very slow but correct handling for
>> >> > blkdev_issue_flush() with DAX mappings, as discussed here:
>> >> >
>> >> > https://lkml.org/lkml/2015/10/26/116
>> >> >
>> >> > I don't think that we can actually do the
>> >> >
>> >> > on_each_cpu(sync_cache, ...);
>> >> >
>> >> > ...where sync_cache is something like:
>> >> >
>> >> > cache_disable();
>> >> > wbinvd();
>> >> > pcommit();
>> >> > cache_enable();
>> >> >
>> >> > solution as proposed by Dan because WBINVD + PCOMMIT doesn't guarantee that
>> >> > your writes actually make it durably onto the DIMMs. I believe you really do
>> >> > need to loop through the cache lines, flush them with CLWB, then fence and
>> >> > PCOMMIT.
>> >>
>> >> *blink*
>> >> *blink*
>> >>
>> >> So much for not violating the principal of least surprise. I suppose
>> >> you've asked the hardware folks, and they've sent you down this path?
>> >
>> > Sadly, yes, this was the guidance from the hardware folks.
>>
>> So it turns out we weren't asking the right question. wbinvd may
>> indeed be viable... we're still working through the caveats.
>
> Just for the record. Such a flush mechanism with
>
> on_each_cpu()
> wbinvd()
> ...
>
> will make that stuff completely unusable on Real-Time systems. We've
> been there with the big hammer approach of the intel graphics
> driver.
Noted. This means RT systems either need to disable DAX or avoid
fsync. Yes, this is a wart, but not an unexpected one in a first
generation persistent memory platform.
6 years, 7 months
[RFC 00/11] DAX fsynx/msync support
by Ross Zwisler
This patch series adds support for fsync/msync to DAX.
Patches 1 through 8 add various utilities that the DAX code will eventually
need, and the DAX code itself is added by patch 9. Patches 10 and 11 are
filesystem changes that are needed after the DAX code is added, but these
patches may change slightly as the filesystem fault handling for DAX is
being modified ([1] and [2]).
I've marked this series as RFC because I'm still testing, but I wanted to
get this out there so people would see the direction I was going and
hopefully comment on any big red flags sooner rather than later.
I realize that we are getting pretty dang close to the v4.4 merge window,
but I think that if we can get this reviewed and working it's a much better
solution than the "big hammer" approach that blindly flushes entire PMEM
namespaces [3].
[1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html
[2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2
[3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html
Ross Zwisler (11):
pmem: add wb_cache_pmem() to the PMEM API
mm: add pmd_mkclean()
pmem: enable REQ_FLUSH handling
dax: support dirty DAX entries in radix tree
mm: add follow_pte_pmd()
mm: add pgoff_mkclean()
mm: add find_get_entries_tag()
fs: add get_block() to struct inode_operations
dax: add support for fsync/sync
xfs, ext2: call dax_pfn_mkwrite() on write fault
ext4: add ext4_dax_pfn_mkwrite()
arch/x86/include/asm/pgtable.h | 5 ++
arch/x86/include/asm/pmem.h | 11 +--
drivers/nvdimm/pmem.c | 3 +-
fs/dax.c | 161 +++++++++++++++++++++++++++++++++++++++--
fs/ext2/file.c | 5 +-
fs/ext4/file.c | 23 +++++-
fs/inode.c | 1 +
fs/xfs/xfs_file.c | 9 ++-
fs/xfs/xfs_iops.c | 1 +
include/linux/dax.h | 6 ++
include/linux/fs.h | 5 +-
include/linux/mm.h | 2 +
include/linux/pagemap.h | 3 +
include/linux/pmem.h | 22 +++++-
include/linux/radix-tree.h | 3 +
include/linux/rmap.h | 5 ++
mm/filemap.c | 73 ++++++++++++++++++-
mm/huge_memory.c | 14 ++--
mm/memory.c | 41 +++++++++--
mm/page-writeback.c | 9 +++
mm/rmap.c | 53 ++++++++++++++
mm/truncate.c | 5 +-
22 files changed, 418 insertions(+), 42 deletions(-)
--
2.1.0
6 years, 7 months
Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
by Ross Zwisler
On Wed, Nov 04, 2015 at 10:06:12AM +0100, Jan Kara wrote:
> On Tue 03-11-15 21:46:13, Ross Zwisler wrote:
> > On Tue, Nov 03, 2015 at 05:02:34PM -0800, Dan Williams wrote:
> > > > Hmm...if we go this path, though, is that an argument against moving the
> > > > zeroing from DAX down into the driver? True, with BRD it makes things nice
> > > > and efficient because you can zero and never flush, and the driver knows
> > > > there's nothing else to do.
> > > >
> > > > For PMEM, though, you lose the ability to zero the data and then queue the
> > > > flushing for later, as you would be able to do if you left the zeroing code in
> > > > DAX. The benefit of this is that if you are going to immediately re-write the
> > > > newly zeroed data (which seems common), PMEM will end up doing an extra cache
> > > > flush of the zeroes, only to have them overwritten and marked as dirty by DAX.
> > > > If we leave the zeroing to DAX we can mark it dirty once, zero it once, write
> > > > it once, and flush it once.
> > >
> > > Why do we lose the ability to flush later if the driver supports
> > > blkdev_issue_zeroout?
> >
> > I think that if you implement zeroing in the driver you'd need to also
> > flush in the driver because you wouldn't have access to the radix tree to
> > be able to mark entries as dirty so you can flush them later.
> >
> > As I think about this more, though, I'm not sure that having the zeroing
> > flush later could work. I'm guessing that the filesystem must require a
> > sync point between the zeroing and the subsequent follow-up writes so
> > that you can sync metadata for the block allocation. Otherwise you could
> > end up in a situation where you've got your metadata pointing at newly
> > allocated blocks but the new zeros are still in the processor cache - if
> > you lose power you've just created an information leak. Dave, Jan, does
> > this make sense?
>
> So the problem you describe does not exist. Thing to keep in mind is that
> filesystem are designed to work reliably with 'non-persistent' cache in the
> disk which is common these days. That's why we bother with all that
> REQ_FLUSH | REQ_FUA and blkdev_issue_flush() stuff after all. Processor
> cache is exactly that kind of the cache attached to the PMEM storage. And
> Dave and I try to steer you to a solution that would also treat it equally
> in DAX filesystems as well :).
And I'm always grateful for the guidance. :)
> Now how the problem is currently solved: When we allocate blocks, we just
> record that information in a transaction in the journal. For DAX case we
> also submit the IO zeroing those blocks and wait for it. Now if we crash
> before the transaction gets committed, blocks won't be seen in the inode
> after a journal recovery and thus no data exposure can happen. As a part of
> transaction commit, we call blkdev_issue_flush() (or submit REQ_FLUSH
> request). We expect that to force out all the IO in volatile caches into
> the persistent storage. So this will also force the zeroing into persistent
> storage for normal disks and AFAIU if you do zeroing with non-temporal
> writes in pmem driver and then do wmb_pmem() in response to a flush request
> we get the same persistency guarantee in pmem case as well. So after a
> transaction commit we are guaranteed to see zeros in those allocated
> blocks.
>
> So the transaction commit and the corresponding flush request in particular
> is the sync point you speak about above but the good thing is that in most
> cases this will happen after real data gets written into those blocks so we
> save the unnecessary flush.
Cool, thank you for the explanation, that makes sense to me.
When dealing with normal SSDs and the page cache, does the filesystem keep the
zeroes in the page cache, or does it issue it directly to the driver via
sb_issue_zeroout()/blkdev_issue_zeroout()? If we keep it in the page cache so
that the follow-up writes just update the dirty pages and we end up writing to
media once, this seems like it would flow nicely into the idea of zeroing new
blocks at the DAX level without flushing and just marking them as dirty in the
radix tree. If the zeroing happens via sb_issue_zeroout() then this probably
doesn't make sense because the existing flow won't include a fsync/msync type
step of the newly zeroed data in the page cache.
6 years, 7 months
Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
by Dave Chinner
[add people to the cc list]
On Mon, Nov 02, 2015 at 09:15:10AM -0500, Brian Foster wrote:
> On Mon, Nov 02, 2015 at 12:14:33PM +1100, Dave Chinner wrote:
> > On Fri, Oct 30, 2015 at 08:36:57AM -0400, Brian Foster wrote:
> > > Unless there is some
> > > special mixed dio/mmap case I'm missing, doing so for DAX/DIO basically
> > > causes a clear_pmem() over every page sized chunk of the target I/O
> > > range for which we already have the data.
> >
> > I don't follow - this only zeros blocks when we do allocation of new
> > blocks or overwrite unwritten extents, not on blocks which we
> > already have written data extents allocated for...
> >
>
> Why are we assuming that block zeroing is more efficient than unwritten
> extents for DAX/dio? I haven't played with pmem enough to know for sure
> one way or another (or if hw support is imminent), but I'd expect the
> latter to be more efficient in general without any kind of hardware
> support.
>
> Just as an example, here's an 8GB pwrite test, large buffer size, to XFS
> on a ramdisk mounted with '-o dax:'
>
> - Before this series:
>
> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
> wrote 8589934592/8589934592 bytes at offset 0
> 8.000 GiB, 820 ops; 0:00:04.00 (1.909 GiB/sec and 195.6591 ops/sec)
>
> - After this series:
>
> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
> wrote 8589934592/8589934592 bytes at offset 0
> 8.000 GiB, 820 ops; 0:00:12.00 (659.790 MiB/sec and 66.0435 ops/sec)
That looks wrong. Much, much slower than it should be just zeroing
pages and then writing to them again while cache hot.
Oh, hell - dax_clear_blocks() is stupidly slow. A profile shows this
loop spending most of the CPU time:
¿ ¿ jbe ea
¿ de: clflus %ds:(%rax)
84.67 ¿ add %rsi,%rax
¿ cmp %rax,%rdx
¿ ¿ ja de
¿ ea: add %r13,-0x38(%rbp)
¿ sub %r12,%r14
¿ sub %r12,-0x40(%rbp)
That is the overhead of __arch_wb_cache_pmem() i.e. issuing CPU
cache flushes after each memset.
None of these pmem memory operations are optimised yet - the
implementations are correct, but performance still needs work. The
conversion to non-temporal stores should get rid of this cache flush
overhead (somewhat), but I was still expecting this code to be much
closer to memset speed and not reduce performance to bugger all...
> The impact is less with a smaller buffer size so the above is just meant
> to illustrate the point. FWIW, I'm also fine with getting this in as a
> matter of "correctness before performance" since this stuff is clearly
> still under development, but as far as I can see so far we should
> probably ultimately prefer unwritten extents for DAX/DIO (or at least
> plan to run some similar tests on real pmem hw). Thoughts?
We're going to have these problems initially, but from the XFS
persepective those problems don't matter because we have a different
problem to solve. That is, we need to move towards an architecture
that is highly efficient for low latency, high IOPS storage
subsystems. The first step towards that is to be able to offload
block zeroing to the hardware so we can avoid using unwritten
extents.
In the long run, we don't want to use unwritten extents on DAX if we
can avoid it - the CPU overhead of unwritten extent conversion isn't
constant (i.e. it grows with the size of the BMBT) and it isn't
deterministic (e.g. split/merge take much more CPU than a simple
record write to clear an unwritten flag). We don't notice this
overhead much with normal IO because of the fact that the CPU time
for conversion is much less than the CPU time for the IO to
complete, hence it's a win.
But for PMEM, directly zeroing a 4k chunk of data should take *much
less CPU* than synchronously starting a transaction, reserving
space, looking up the extent in the BMBT, loading the buffers from
cache, modifying the buffers, logging the changes and committing the
transaction (which in itself will typically copy more than a single
page of metadata into the CIL buffers).
Realistically, dax_clear_blocks() should probably be implemented at
the pmem driver layer through blkdev_issue_zeroout() because all it
does is directly map the sector/len to pfn via bdev_direct_access()
and then zero it - it's a sector based, block device operation. We
don't actually need a special case path for DAX here. Optimisation
of this operation has little to do with the filesystem.
This comes back to the comments I made w.r.t. the pmem driver
implementation doing synchronous IO by immediately forcing CPU cache
flushes and barriers. it's obviously correct, but it looks like
there's going to be a major performance penalty associated with it.
This is why I recently suggested that a pmem driver that doesn't do
CPU cache writeback during IO but does it on REQ_FLUSH is an
architecture we'll likely have to support.
In this case, the block device won't need to flush CPU cache lines
for the zeroed blocks until the allocation transaction is actually
committed to the journal. In that case, there's a good chance that
we'd end up also committing the new data as well, hence avoiding two
synchronous memory writes. i.e. the "big hammer" REQ_FLUSH
implementation may well be *much* faster than fine-grained
synchronous "stable on completion" writes for persistent memory.
This, however, is not really a problem for the filesystem - it's
a pmem driver architecture problem. ;)
Cheers,
Dave.
--
Dave Chinner
david(a)fromorbit.com
6 years, 7 months