Andreas

What is really puzzling to me is that the reading of the file is not completely blocked until all the locks are released.  The attached jpg has the bottom
frame zoomed in to see what is occurring as the reading of the file gets started.  While all this write lock releasing is happening, a few of the reads
manage to get serviced, and in a rather repeatable fashion.  This image is depicting only one of the two processes.  If I overlay the second process, these
reads behave in a very similar fashion.  If the reads do not have to wait for all the locks to be released, then I wonder what order the locks are
released in.  If it were the low file position to high file position, the early reads should be cleared quickly.  If it's oldest to youngest, again the early reads
should be cleared quickly.

Here is the Lustre module that is loaded, if that helps. 
lustre-cray_ari_s/2.4_3.0.80_0.5.1_1.0501.7664.13.1-1.0501.14968.20.1
I will let Patrick Farrell of Cray address whether we can test on the latest master version

As you can probably tell from some of my questions posted to this group, I am relatively new to Lustre.  Thanks for your patience and help.

John

On 4/21/2014 12:59 AM, Dilger, Andreas wrote:
It seems pretty clear to me. The file is cached under a write lock on the first client, and when it is read by the second client the write lock needs to be cancelled by the first client and all of the pages under that lock dropped from cache before the read lock can be granted to the reading client. Lustre does not support lock conversion, so the write lock must be cancelled first.

That said, it surprises me that canceling the write lock should take so long - a good fraction of the time it took to write in the first place.  With a 10GB file, that is 2.5M 4KB pages dropped in about 4s, or about 600K pages per second. I'm not sure how this stacks up to other filesystems, but I expect it is slower because of the DLM locking.

What version of Lustre is this? If it is not the current master version, is it possible to repeat the test with master to see if it is improved?

Cheers, Andreas

On Apr 20, 2014, at 14:01, "John Bauer" <bauerj@iodoctors.com<mailto:bauerj@iodoctors.com>> wrote:

I have been trying to understand a behavior I am observing in an IOR benchmark on Lustre.  I have pared it down to a simple example.
The IOR benchmark is running in MPI mode.  There are 2 ranks, each running on its own node.  Each rank does the following:

Note : Test was run on the "swan" cluster at Cray Inc., using /lus/scratch

write a file. ( 10GB )
fsync the file
close the file
MPI_barrier
open the file that was written by the other rank.
read the file that was written by the other rank.
close the file that was written by the other rank.

The writing of each file goes as expected.
The fsync takes very little time ( about .05 seconds).
The first reads of the file( written by the other rank ) start out very slowly.  While theses first reads are proceeding slowly, the
kernel's cached memory ( the Cached: line in /proc/meminfo) decreases from the size of the file just written to nearly zero.
Once the cached memory has reached nearly zero, the file reading proceeds as expected.

I have attached a jpg of the instrumentation of the processes that illustrates this behavior.

My questions are:
Why does the reading of the file, written by the other rank, wait until the cached data drains to nearly zero before proceeding normally.
Shouldn't the fsync ensure that the file's data is written to the backing storage so this draining of the cached memory should be simply
releasing pages with no further I/O?

For this case the "dead" time is only about 4 seconds, but this "dead" time scales directly with the size of the files.

John


-- 
John Bauer
I/O Doctors LLC
507-766-0378
bauerj@iodoctors.com