On Tue, Feb 2, 2016 at 4:34 PM, Matthew Wilcox <willy(a)linux.intel.com> wrote:
On Tue, Feb 02, 2016 at 01:46:06PM -0800, Jared Hulbert wrote:
> On Tue, Feb 2, 2016 at 8:51 AM, Dan Williams <dan.j.williams(a)intel.com> wrote:
> >> The filesystem I'm concerned with is AXFS
> >> (https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf
> >> Which I've been planning on trying to merge again due to a recent
> >> resurgence of interest. The device model for AXFS is... weird. It
> >> can use one or two devices at a time of any mix of NOR MTD, NAND MTD,
> >> block, and unmanaged physical memory. It's a terribly useful model
> >> for embedded. Anyway AXFS is readonly so hacking in a read only
> >> dax_fault_nodev() and dax_file_read() would work fine, looks easy
> >> enough. But... it would be cool if similar small embedded focused RW
> >> filesystems were enabled.
> > Are those also out of tree?
> Of course. Merging embedded filesystems is little merging regular
> filesystems except 98% of you reviewers don't want it merged.
You should at least be able to get it into staging these days. I mean,
look at some of the junk that's in staging ... and I don't think AXFS was
nearly as bad.
> IMO you're making DAX more complex by overly coupling to the
> I think it could bite you later. I submit this rework of the radix
> tree and confusion about where to get the real bdev as evidence. I'm
> guessing that it won't be the last time. It's unnecessary to couple
> it like this, and in fact is not how the vfs has been layered in the
Huh? The rework to use the radix tree for PFNs was done with one eye
firmly on your usage case. Just because I had to thread the get_block
interface through it for the moment doesn't mean that I didn't have
the "how do we get rid of get_block entirely" question on my mind.
Oh yeah. I think we're on the same page. But I'm not sure Dan is. I
get the need to phase this in too.
Using get_block seemed like the right idea three years ago. I
know just how fundamentally ext4 and XFS disagree on how it should be
Sure. I can see that.
> To look at the the downside consider dax_fault(). Its called on
> fault to a user memory map, uses the filesystems get_block() to lookup
> a sector so you can ask a block device to convert it to an address on
> a DIMM. Come on, that's awkward. Everything around dax_fault() is
> dripping with memory semantic interfaces, the dax_fault() call are
> fundamentally about memory, the pmem calls are memory, the hardware is
> memory, and yet it directly calls bdev_direct_access(). It's out of
What was out of place was the old 'get_xip_mem' in address_space
operations. Returning a kernel virtual address and a PFN from a
filesystem operation? That looks awful.
Yes. Yes it does! But at least my big hack was just one line. ;)
Nobody really even seemed to notice at the time.
All the other operations deal
in struct pages, file offsets and occasionally sectors. Of course, we
don't have a struct page, so a pfn makes sense, but the kernel virtual
address being returned was a gargantuan layering problem.
Well yes, but it was an expedient hack.
> The legacy vfs/mm code didn't have this layering problem
> filemap_fault() that dax_fault() is modeled after doesn't call any
> bdev methods directly, when it needs something it asks the filesystem
> with a ->readpage(). The precedence is that you ask the filesystem
> for what you need. Look at the get_bdev() thing you've concluded you
> need. It _almost_ makes my point. I just happen to be of the opinion
> that you don't actually want or need the bdev, you want the pfn/kaddr
> so you can flush or map or memcpy().
You want the pfn. The device driver doesn't have enough information to
give you a (coherent with userspace) kaddr. That's what (some future
arch-specific implementation of) dax_map_pfn() is for. That's why it
takes 'index' as a parameter, so you can calculate where it'll be mapped
in userspace, and determine an appropriate kernel virtual address to
use for it.
Oh.... I think I'm just beginning to catch your vision for
dax_map_pfn(). I still don't get why we can't just do semi-arch
specific flushing instead of the alignment thing. But that just might
be epic ignorance on my part. Either way flush or magic alignments
dax_(un)map_pfn() would handle it, right?