Hi, I am trying some experiments to evaluate performance of peer2peer dma. I am using spdk to control the nvme drives and fio-plugin compiled with spdk. I am seeing a weird behavior where when I run 4K IOs with IO-Depth of 1 peer2peer DMA from nvme drive to some pci device (which exposes memory via Bar1) in a different numa node has a 50th percentile latency of 17 usecs. The same  experiment but where nvme device and pcie device in same numa node (node 0) has a latency of 38 usecs. In both cases fio was running in node 0 cpu core and pci device (which exposes memory via Bar1) is connected to node 1. DMA from nvme device to host memory also takes 38 usecs. 

To summarize the cases below

1. nvme (numa node 0) - pci device (numa node 1)   --- 18 usecs
2. nvme (numa node 1) - pci device (numa node 1)   --- 38 usecs
3. nvme (numa node 0) - host memory   --- 38 usecs

fio running in numa node 0 cpu core in all cases. 

For higher IO Depth values cross numa case (case 1 above), latency increases steeply and performs poorly than case 2 and case 3. 

Any pointers on why this could be happening?

The nvme devices used are both identical intel datacenter ssd 400G.