Hi Fenggang,


The max IOPs number is per-device – not per-queue.  The observed latency for each I/O - from submission to completion - will be the same whether the 128 I/O are submitted on one queue or across four queues.  Spreading the I/O across four queues instead of one just means that the device will process ¼ the rate of I/O from each of the four queues compared to if it was submitted on a single queue.


For BlobFS, spreading the I/O across multiple NVMe queues would not normally help with latency.  There are NVMe features such as Weighted Round Robin (WRR), which provide different priorities to different queues.  With WRR, multiple NVMe queues could be used to separate high priority I/O (i.e. WAL writes) from lower priority I/O (i.e. background compaction I/O).  Most NVMe devices today do not support WRR however and even then it’s still questionable whether WRR alone would be sufficient or if additional software queuing would be required.








From: SPDK <spdk-bounces@lists.01.org> on behalf of Fenggang Wu <fenggang@cs.umn.edu>
Reply-To: Storage Performance Development Kit <spdk@lists.01.org>
Date: Wednesday, January 31, 2018 at 10:49 AM
To: Storage Performance Development Kit <spdk@lists.01.org>
Cc: "wuxx0835@umn.edu" <wuxx0835@umn.edu>
Subject: [SPDK] Performance Scaling in BlobFS/RocksDB by Multiple I/O Threads


Hi All, 


I read from the SPDK doc "NVMe Driver Design -- Scaling Performance" (here), which saids:


" For example, if a device claims to be capable of 450,000 I/O per second at queue depth 128, in practice it does not matter if the driver is using 4 queue pairs each with queue depth 32, or a single queue pair with queue depth 128."


Does this consider the queuing latency? I am guessing the latency in the two cases will be different ( in qp/qd = 4/32 and in qp/qd = 1/128). In the 4 threads case, the latency will be 1/4 of the 1 thread case. Do I get it right?


If so, then I got confused as the document also says:


"In order to take full advantage of this scaling, applications should consider organizing their internal data structures such that data is assigned exclusively to a single thread."


Please correct me if I get it wrong. I understand that if the dedicate I/O thread has the total ownership of the I/O data structures, there is no lock contention to slow down the I/O. I believe that BlobFS is also designed in this philosophy in that only one thread is doing I/O. 


But considering the RocksDB case, if the shared data structure has already been largely taken care of by the RocksDB logic via locking (which is inevitable anyway), the I/O requests each RocksDB thread sends to the BlobFS could also has its own queue pair to do I/O. More I/O threads means shorter queue depth and smaller queuing delay. 


Even if there is some FS metadata operations that may require some locking, but I would guest such metadata operation takes only a small portion.


Therefore, is it a viable idea to have more I/O threads in the BlobFS to serve the multi-threaded RocksDB for a smaller delay? What will be the pitfalls, or challenges?


Any thoughts/comments are appreciated. Thank you very much!