I changed my code to call MPI_Barrier()after the writes for i=0 and i=1, just to ensure that one task does not race ahead in the writing.  The behavior is the same.

Here is a pic of what is going on.  The top frame is the value of cached_mb for the OSC versus time.  The bottom frame indicates the part of the file that is being written(blue) over time followed by the 3 readings(red) of the file.  Note that the first and second reads are slow and the third is fast.  I also included an image with the 2 frames overlayed.  What doesn't show very well is that the read from 1M-2M (issued by the other task/node)  is what is blocked while the cache is draining.  It kind of shows in the overlayed imaged.  At file position near zero(vertical axis) there is a longer horizontal red line.

If you like, I could send you my test code.  It is not pretty, but only about 100 lines and, hopefully, easy to follow.


On 4/9/2015 5:18 PM, Drokin, Oleg wrote:

There's no way to dump entire osc cache of a different node with a code like this.
What likely is happening (assuming stripe count of 1), though is:
task 1: request a Write lock from 0 to 1M,
server: sees that there are no other locks on this file, grants a lock wider than requested, from 0 to EOF (greedy extension, helping to minimize lock traffic in case of low contention)
task 1: write a bunch of 1M records to the file, all of them ending up in the cache
task 2: request a Write lock from 1 to 2M,
server: sees the task1 lock in conflict with the newly requested lock, instructs it to return the lock
node of task 1: sees there are a bunch of pages under the lock requested to be dropped, drops the pages from cache (writing any dirty ones to server first), drops the lock
                -- This is the cache flush you are likely talking about.
server: sees that there's no more conflicts in the file, grants 1M to EOF lock to task 2 (greedy extension only works forward, not backward)
task 2: writes the 1M it wanted to write.
task 1: request the Write lock at XM to X+1M
task 1: get the XM to EOF lock and proceed to write till the end of file retaining everything in the cache.
task 1: request a Read lock from 0 to 1M
server: sees there's a conflict at NM and only extends the lock to NM-1 (so grants 0 to NM-1)
task 1: proceeds to read in 1M chunks until it reaches NM, repopulating the cache in process
task 2: somewhere in the middle of the above, hopefully, requests 1M to 2M Read lock and get a 1M to Nm-1 Read lock (not invalidating the nonconflicting Read lock from task1)
task 2: read the 1M it wanted to read it and retaining it in the cache
task 1: wanting to read 1M at offset NM, finds a locally cached write lock from NM to EOF with data cached under it, does not send any requests to server, reads the data from the cache (or not*)

*) - in theory there would be data, in practice, by default every OSCs in total only allow to cache up to 50% or system RAM, and then also a per-OSC "budget" (not sure how much at this point),
if the writes happen to go beyond that limit, LRU logic comes into play and would be wiping the oldest data from cache as you are reading (or even writing), so there's never going
to be any cached data at the offsets you want to read.

So second reading (with no writes in between) should preserve caches as long as you have enough cache lru space. Are you sure any locking happens at this point? You can
enable ldlm tracing (echo +ldlm >/proc/sys/lnet/debug on the client), run the rereading and then dump the cache with lctl dk >/tmp/log.txt and see what sort of locking was done
(or if your OSS where writes/reads are destined to - you can do the same on OSS instead and then you'll get the server's idea of all locking and conflicts).

On Apr 9, 2015, at 4:57 PM, John Bauer wrote:


Thanks for the description of lock conversion.  

In my effort to get a handle on how the locking works I have written a 2 task MPI testcase, each task on a separate node, where each task primarily writes to one file.  Then each task primarily reads the     same file.  Both writing and reading have the exception of doing the i==1 read/write to the other task's file.   There are MPI_Barrier calls between all the major events, just not between the individual reads and writes.

#define BSIZE 1048576
   crossWriteBlock = 1 ;
      int fd = (i==crossWriteBlock)?other_fd:my_fd;
      lseek64(fd,BSIZE*i,SEEK_SET) ;
      ssize_t ret = write(fd,buffer,BSIZE);
Code snippet is the same for reading, just use read instead of write.
Each file is stripped on one OST, with 1MB stripe factor.

There are two things that I am not understanding why they are happening.  

    1) The single read from each task of the other task's file causes the OSC on the other node to dump its entire OSC cache even though that cache never had that 1MB section of the file in cache.

   2) After the tasks are done with their reading, each task does the same reading again.  The OSCs' caches are dumped again as the second reading starts.  I am thinking that all the locking would be straightened out for reading after the first reading, but the second reading has the same behavior as the first reading.

Any thoughts?


On 4/7/2015 10:03 PM, Drokin, Oleg wrote:
Lock conversion is a change of lock properties.

I.e. once the lock is issued, you can no longer change the lock mode (e.g. READ->WRITE or WRITE->READ) or lock properties (e.g. extent size, or lock bits).

On Apr 7, 2015, at 11:00 PM, John Bauer wrote:


The recent HPDD discussion about file read/write operations, started by 
, prompted me to go back and look at this issue.  In your reply you mentioned that Lustre does not support lock conversion ( first paragraph below).  What do you mean by lock conversion?



On 4/21/2014 12:59 AM, Dilger, Andreas wrote:

It seems pretty clear to me. The file is cached under a write lock on the first client, and when it is read by the second client the write lock needs to be cancelled by the first client and all of the pages under that lock dropped from cache before the read lock can be granted to the reading client. Lustre does not support lock conversion, so the write lock must be cancelled first.

That said, it surprises me that canceling the write lock should take so long - a good fraction of the time it took to write in the first place.  With a 10GB file, that is 2.5M 4KB pages dropped in about 4s, or about 600K pages per second. I'm not sure how this stacks up to other filesystems, but I expect it is slower because of the DLM locking.

What version of Lustre is this? If it is not the current master version, is it possible to repeat the test with master to see if it is improved?

Cheers, Andreas

On Apr 20, 2014, at 14:01, "John Bauer" <
I have been trying to understand a behavior I am observing in an IOR benchmark on Lustre.  I have pared it down to a simple example.
The IOR benchmark is running in MPI mode.  There are 2 ranks, each running on its own node.  Each rank does the following:

Note : Test was run on the "swan" cluster at Cray Inc., using /lus/scratch

write a file. ( 10GB )
fsync the file
close the file
open the file that was written by the other rank.
read the file that was written by the other rank.
close the file that was written by the other rank.

The writing of each file goes as expected.
The fsync takes very little time ( about .05 seconds).
The first reads of the file( written by the other rank ) start out very slowly.  While theses first reads are proceeding slowly, the
kernel's cached memory ( the Cached: line in /proc/meminfo) decreases from the size of the file just written to nearly zero.
Once the cached memory has reached nearly zero, the file reading proceeds as expected.

I have attached a jpg of the instrumentation of the processes that illustrates this behavior.

My questions are:
Why does the reading of the file, written by the other rank, wait until the cached data drains to nearly zero before proceeding normally.
Shouldn't the fsync ensure that the file's data is written to the backing storage so this draining of the cached memory should be simply
releasing pages with no further I/O?

For this case the "dead" time is only about 4 seconds, but this "dead" time scales directly with the size of the files.


I/O Doctors, LLC


HPDD-discuss mailing list

I/O Doctors, LLC



I/O Doctors, LLC