I also have some quick followup question about WRR inline below.
Thank you very much!
Department of Computer Science and Engineering <http://www.cs.umn.edu/>
College of Science and Engineering <http://cse.umn.edu/>
University of Minnesota, Twin Cities <http://www.umn.edu>
On Wed, Jan 31, 2018 at 12:09 PM, Harris, James R <james.r.harris(a)intel.com>
The max IOPs number is per-device – not per-queue. The observed latency
for each I/O - from submission to completion - will be the same whether the
128 I/O are submitted on one queue or across four queues. Spreading the
I/O across four queues instead of one just means that the device will
process ¼ the rate of I/O from each of the four queues compared to if it
was submitted on a single queue.
For BlobFS, spreading the I/O across multiple NVMe queues would not
normally help with latency. There are NVMe features such as Weighted Round
Robin (WRR), which provide different priorities to different queues. With
WRR, multiple NVMe queues could be used to separate high priority I/O (i.e.
WAL writes) from lower priority I/O (i.e. background compaction I/O). Most
NVMe devices today do not support WRR however and even then it’s still
questionable whether WRR alone would be sufficient or if additional
software queuing would be required.
Is WRR a feature of device or a driver software feature?
if it's a device feature: Currently our research lab has two 300GB P3700
SSD. Will they support WRR?
If it's a software feature: Does SPDK block driver now support WRR? I am
guessing the BlobFS layer does not support WRR in that their is only one
dedicated core/thread/qpair doing the I/O. Is my understanding right?
*From: *SPDK <spdk-bounces(a)lists.01.org> on behalf of Fenggang Wu <
*Reply-To: *Storage Performance Development Kit <spdk(a)lists.01.org>
*Date: *Wednesday, January 31, 2018 at 10:49 AM
*To: *Storage Performance Development Kit <spdk(a)lists.01.org>
*Cc: *"wuxx0835(a)umn.edu" <wuxx0835(a)umn.edu>
*Subject: *[SPDK] Performance Scaling in BlobFS/RocksDB by Multiple I/O
I read from the SPDK doc "NVMe Driver Design -- Scaling Performance" (here
>), which saids:
" For example, if a device claims to be capable of 450,000 I/O per second
at queue depth 128, in practice it does not matter if the driver is using 4
queue pairs each with queue depth 32, or a single queue pair with queue
Does this consider the queuing latency? I am guessing the latency in the
two cases will be different ( in qp/qd = 4/32 and in qp/qd = 1/128). In the
4 threads case, the latency will be 1/4 of the 1 thread case. Do I get it
If so, then I got confused as the document also says:
"In order to take full advantage of this scaling, applications should
consider organizing their internal data structures such that data is
assigned exclusively to a single thread."
Please correct me if I get it wrong. I understand that if the dedicate I/O
thread has the total ownership of the I/O data structures, there is no lock
contention to slow down the I/O. I believe that BlobFS is also designed in
this philosophy in that only one thread is doing I/O.
But considering the RocksDB case, if the shared data structure has already
been largely taken care of by the RocksDB logic via locking (which is
inevitable anyway), the I/O requests each RocksDB thread sends to the
BlobFS could also has its own queue pair to do I/O. More I/O threads means
shorter queue depth and smaller queuing delay.
Even if there is some FS metadata operations that may require some
locking, but I would guest such metadata operation takes only a small
Therefore, is it a viable idea to have more I/O threads in the BlobFS to
serve the multi-threaded RocksDB for a smaller delay? What will be the
pitfalls, or challenges?
Any thoughts/comments are appreciated. Thank you very much!