I'm reviving this old thread because I'm back to working in this area again. The
challenge for changes this large is to figure out how to do them in small
pieces. Some of the major changes to the threading model will need to be done in
one go, but I think there are a few incremental steps we can take to improve the
code base and prepare it for the big transition.
I think the first of those changes is to remove the "mode" parameter from
subsystems. Today, NVMe-oF subsystems can be in either direct (I/O routed to
nvme library) or virtual (I/O routed to bdev library) mode. Recently, a new bdev
command was added by the wider community (thanks for the patch!) that adds an
NVMe passthrough command to the bdev layer. That allows us to send an NVMe
command through the regular bdev stack. The commands are generally only
interpreted by the NVMe bdev module - the other backing devices don't report
support for the NVMe passthrough command - but that's good enough. Given that
new capability, we can do anything in virtual mode that we previously did in
direct mode.
The only reason we didn't remove direct mode immediately after the addition of
NVMe passthrough was because we wanted to do a full performance evaluation to
verify the bdev layer doesn't have a measurable amount of overhead. I'm glad to
report those results have come in and the overhead of routing I/O through the
bdev library instead of nvme isn't measurable for any hardware set up we were
able to build.
I wrote up a patch here:
https://review.gerrithub.io/#/c/369496/
The next big step is probably to make some changes to the transport API to
accommodate the new ideas in my previous email.
Discussion and requests are always welcome!
Thanks,
Ben
On Fri, 2017-07-14 at 17:16 +0000, Walker, Benjamin wrote:
-----Original Message-----
From: Walker, Benjamin
Sent: Wednesday, April 26, 2017 2:06 PM
To: spdk(a)lists.01.org
Subject: NVMe-oF Target Library
Hi all,
I was hoping to start a bit of a design discussion about the future of the
NVMe-oF target library (lib/nvmf). The NVMe-oF target was originally created
as
part of a skunkworks project and was very much an application. It wasn't
divided into a library and an app as it is today. Right before we released it,
I decided to attempt to break it up into a library and an application, but I
never really finished that task. I'd like to resume that work now, but let the
entire community weigh in on what the library looks like.
First, libraries in SPDK (most things that live in lib/) shouldn't enforce a
threading model. They should, as much as possible, be entirely passive C
libraries with as few dependencies as we can manage. Applications in SPDK
(things that live in app/), on the other hand, necessarily must choose a
particular threading model. We universally use our application/event framework
(lib/event) for apps, which spawns one thread per core, etc. We'll continue
this model for NVMe-oF where app/nvmf_tgt will be a full application with a
threading model dictated by the application/event framework, while lib/nvmf
will be a passive C library that will depend only on other passive C
libraries.
I don't think this distinction is at all reality today, but let's work to make
it so.
The other major issue with the NVMe-oF target implementation is that it has
quite a few baked in assumptions about what the backing storage device looks
like. In particular, it was written assuming that it was talking directly to
an
NVMe device (Direct mode), and the ability to route I/O to the bdev layer
(Virtual mode) was added much later and isn't entirely fleshed out yet. One of
these assumptions is that real NVMe devices don't benefit from multiple queues
- you can get the full performance from an NVMe device using just one queue
pair. That isn't necessarily true for bdevs, which may be arbitrarily
complex virtualized devices. Given that assumption, the NVMe-oF target
today only creates a single queue pair to the backing storage device and only
uses a single thread to route I/O to it. We're definitely going to need to
break that assumption.
The first discussion that I want to have is around what the high level
concepts
should be. We clearly need to expose things like "subsystem", "queue
pair/connection", "namespace", and "port". We should probably
have an object
that represents the entire target too, maybe "nvmf_tgt". However, in order to
separate the threading model from the library I think we'll need at least two
more concepts.
First, some thread has to be in charge of polling for new connections. We
typically refer to this as the "acceptor" thread today. Maybe the best way to
handle this is to add an "accept" function that takes the nvmf_tgt object as
an
argument. This function can only be called one a single thread at a time and
is
repeatedly called to discover new connections. I think the user will end up
passing a callback in to this function that will be called for each new
connection discovered.
Second, once a new connection is discovered, we need to hand it off to some
collection that a dedicated thread can poll. This collection of connections
would be tied specifically to that dedicated thread, but it wouldn't
necessarily be tied to a subsystem or a particular storage device. I don't
really know what to call this thing - right now I'm kind of thing
"io_handler".
So the general flow for an application would be to construct a target, add
subsystems, namespaces, and ports as needed, and then poll the target for
incoming connections. For each new connection, the application would assign it
to an io_handler (using whatever algorithm it wanted) and then poll the
io_handlers to actually handle I/O on the connections. Does this seem like a
reasonable design at a very high level? Feedback is very much welcome and
encouraged.
If I don't hear back with a bunch of "you're wrong!" or
"that's stupid!" type
replies over the next few days, the next step will be to write up a new header
file for the library that we can discuss in more detail.
Thanks,
Ben