Thanks for the description of lock conversion.
In my effort to get a handle on how the locking works I have written a 2
task MPI testcase, each task on a separate node, where each task
primarily writes to one file. Then each task primarily reads the same
file. Both writing and reading have the exception of doing the i==1
read/write to the other task's file. There are MPI_Barrier calls
between all the major events, just not between the individual reads and
#define BSIZE 1048576
crossWriteBlock = 1 ;
int fd = (i==crossWriteBlock)?other_fd:my_fd;
ssize_t ret = write(fd,buffer,BSIZE);
Code snippet is the same for reading, just use read instead of write.
Each file is stripped on one OST, with 1MB stripe factor.
There are two things that I am not understanding why they are happening.
1) The single read from each task of the other task's file causes
the OSC on the other node to dump its entire OSC cache even though that
cache never had that 1MB section of the file in cache.
2) After the tasks are done with their reading, each task does the
same reading again. The OSCs' caches are dumped again as the second
reading starts. I am thinking that all the locking would be
straightened out for reading after the first reading, but the second
reading has the same behavior as the first reading.
On 4/7/2015 10:03 PM, Drokin, Oleg wrote:
Lock conversion is a change of lock properties.
I.e. once the lock is issued, you can no longer change the lock mode (e.g. READ->WRITE
or WRITE->READ) or lock properties (e.g. extent size, or lock bits).
On Apr 7, 2015, at 11:00 PM, John Bauer wrote:
> The recent HPDD discussion about file read/write operations, started by
akhilesh.gadde(a)stonybrook.edu, prompted me to go back and look at this issue. In your
reply you mentioned that Lustre does not support lock conversion ( first paragraph below).
What do you mean by lock conversion?
> On 4/21/2014 12:59 AM, Dilger, Andreas wrote:
>> It seems pretty clear to me. The file is cached under a write lock on the first
client, and when it is read by the second client the write lock needs to be cancelled by
the first client and all of the pages under that lock dropped from cache before the read
lock can be granted to the reading client. Lustre does not support lock conversion, so the
write lock must be cancelled first.
>> That said, it surprises me that canceling the write lock should take so long - a
good fraction of the time it took to write in the first place. With a 10GB file, that is
2.5M 4KB pages dropped in about 4s, or about 600K pages per second. I'm not sure how
this stacks up to other filesystems, but I expect it is slower because of the DLM
>> What version of Lustre is this? If it is not the current master version, is it
possible to repeat the test with master to see if it is improved?
>> Cheers, Andreas
>> On Apr 20, 2014, at 14:01, "John Bauer"
>> I have been trying to understand a behavior I am observing in an IOR benchmark on
Lustre. I have pared it down to a simple example.
>> The IOR benchmark is running in MPI mode. There are 2 ranks, each running on its
own node. Each rank does the following:
>> Note : Test was run on the "swan" cluster at Cray Inc., using
>> write a file. ( 10GB )
>> fsync the file
>> close the file
>> open the file that was written by the other rank.
>> read the file that was written by the other rank.
>> close the file that was written by the other rank.
>> The writing of each file goes as expected.
>> The fsync takes very little time ( about .05 seconds).
>> The first reads of the file( written by the other rank ) start out very slowly.
While theses first reads are proceeding slowly, the
>> kernel's cached memory ( the Cached: line in /proc/meminfo) decreases from
the size of the file just written to nearly zero.
>> Once the cached memory has reached nearly zero, the file reading proceeds as
>> I have attached a jpg of the instrumentation of the processes that illustrates
>> My questions are:
>> Why does the reading of the file, written by the other rank, wait until the
cached data drains to nearly zero before proceeding normally.
>> Shouldn't the fsync ensure that the file's data is written to the backing
storage so this draining of the cached memory should be simply
>> releasing pages with no further I/O?
>> For this case the "dead" time is only about 4 seconds, but this
"dead" time scales directly with the size of the files.
> I/O Doctors, LLC
> HPDD-discuss mailing list
I/O Doctors, LLC