Hi All, 

I read from the SPDK doc "NVMe Driver Design -- Scaling Performance" (here), which saids:

" For example, if a device claims to be capable of 450,000 I/O per second at queue depth 128, in practice it does not matter if the driver is using 4 queue pairs each with queue depth 32, or a single queue pair with queue depth 128."

Does this consider the queuing latency? I am guessing the latency in the two cases will be different ( in qp/qd = 4/32 and in qp/qd = 1/128). In the 4 threads case, the latency will be 1/4 of the 1 thread case. Do I get it right?

If so, then I got confused as the document also says:

"In order to take full advantage of this scaling, applications should consider organizing their internal data structures such that data is assigned exclusively to a single thread."

Please correct me if I get it wrong. I understand that if the dedicate I/O thread has the total ownership of the I/O data structures, there is no lock contention to slow down the I/O. I believe that BlobFS is also designed in this philosophy in that only one thread is doing I/O. 

But considering the RocksDB case, if the shared data structure has already been largely taken care of by the RocksDB logic via locking (which is inevitable anyway), the I/O requests each RocksDB thread sends to the BlobFS could also has its own queue pair to do I/O. More I/O threads means shorter queue depth and smaller queuing delay. 

Even if there is some FS metadata operations that may require some locking, but I would guest such metadata operation takes only a small portion.

Therefore, is it a viable idea to have more I/O threads in the BlobFS to serve the multi-threaded RocksDB for a smaller delay? What will be the pitfalls, or challenges?

Any thoughts/comments are appreciated. Thank you very much!

Best!
-Fenggang