On Sat, Nov 7, 2015 at 12:38 AM, Thomas Gleixner <tglx(a)linutronix.de> wrote:
On Sat, 7 Nov 2015, Dan Williams wrote:
> On Fri, Nov 6, 2015 at 10:50 PM, Thomas Gleixner <tglx(a)linutronix.de> wrote:
> > On Fri, 6 Nov 2015, H. Peter Anvin wrote:
> >> On 11/06/15 15:17, Dan Williams wrote:
> >> >>
> >> >> Is it really required to do that on all cpus?
> >> >
> >> > I believe it is, but I'll double check.
> >> >
> >> It's required on all CPUs on which the DAX memory may have been
> >> This is similar to the way we flush TLBs.
> > Right. And that's exactly the problem: "may have been dirtied"
> > If DAX is used on 50% of the CPUs and the other 50% are plumming away
> > happily in user space or run low latency RT tasks w/o ever touching
> > it, then having an unconditional flush on ALL CPUs is just wrong
> > because you penalize the uninvolved cores with a completely pointless
> > SMP function call and drain their caches.
> It's not wrong and pointless, it's all we have available outside of
> having the kernel remember every virtual address that might have been
> touched since the last fsync and sit in a loop flushing those virtual
> address cache line by cache line.
> There is a crossover point where wbinvd is better than a clwb loop
> that needs to be determined.
This is a totally different issue and I'm well aware that there is a
tradeoff between wbinvd() and a clwb loop. wbinvd() might be more
efficient performance wise above some number of cache lines, but then
again it's draining all unrelated stuff as well, which can result in a
even larger performance hit.
Now what really concerns me more is that you just unconditionally
flush on all CPUs whether they were involved in that DAX stuff or not.
Assume that DAX using application on CPU 0-3 and some other unrelated
workload on CPU4-7. That flush will
- Interrupt CPU4-7 for no reason (whether you use clwb or wbinvd)
- Drain the cache for CPU4-7 for no reason if done with wbinvd()
- Render Cache Allocation useless if done with wbinvd()
And we are not talking about a few micro seconds here. Assume that
CPU4-7 have cache allocated and it's mostly dirty. We've measured the
wbinvd() impact on RT, back then when the graphic folks used it as a
big hammer. The maximum latency spike was way above one millisecond.
We have similar issues with TLB flushing, but there we
- are tracking where it was used and never flush on innocent cpus
- one can design his application in a way that it uses different
processes so cross CPU flushing does not happen
I know that this is not an easy problem to solve, but you should be
aware that various application scenarios are going to be massively
unhappy about that.
Thanks for that explanation. Peter had alluded to it at KS, but I
indeed did not know that it was as horrible as milliseconds of
One other mitigation that follows on with Dave's plan of per-inode DAX
control, is to also track when an inode has a writable DAX mmap
established. With that we could have a REQ_DAX flag to augment
REQ_FLUSH to potentially reduce committing violence on the cache. In
an earlier thread I also recall an idea to have an mmap flag that an
app can use to say "yes, I'm doing a writable DAX mapping, but I'm
taking care of the cache myself". We could track innocent cpus, but
I'm thinking that would be a core change to write-protect pages when a
thread migrates? In general I feel there's a limit for how much
hardware workaround is reasonable to do in the core kernel vs waiting
for the platform to offer better options...
Sorry if I'm being a bit punchy, but I'm still feeling like I need to
defend the notion that DAX may just need to be turned off in some