I have been trying to understand a behavior I am observing in an IOR benchmark on Lustre.  I have pared it down to a simple example.
The IOR benchmark is running in MPI mode.  There are 2 ranks, each running on its own node.  Each rank does the following:

Note : Test was run on the "swan" cluster at Cray Inc., using /lus/scratch

write a file. ( 10GB )
fsync the file
close the file
MPI_barrier
open the file that was written by the other rank.
read the file that was written by the other rank.
close the file that was written by the other rank.

The writing of each file goes as expected.
The fsync takes very little time ( about .05 seconds).
The first reads of the file( written by the other rank ) start out very slowly.  While theses first reads are proceeding slowly, the
kernel's cached memory ( the Cached: line in /proc/meminfo) decreases from the size of the file just written to nearly zero.
Once the cached memory has reached nearly zero, the file reading proceeds as expected.

I have attached a jpg of the instrumentation of the processes that illustrates this behavior.

My questions are:
Why does the reading of the file, written by the other rank, wait until the cached data drains to nearly zero before proceeding normally.
Shouldn't the fsync ensure that the file's data is written to the backing storage so this draining of the cached memory should be simply
releasing pages with no further I/O?

For this case the "dead" time is only about 4 seconds, but this "dead" time scales directly with the size of the files.

John


-- 
John Bauer
I/O Doctors LLC
507-766-0378
bauerj@iodoctors.com