I changed my code to call MPI_Barrier()after the writes for i=0 and i=1,
just to ensure that one task does not race ahead in the writing. The
behavior is the same.
Here is a pic of what is going on. The top frame is the value of
cached_mb for the OSC versus time. The bottom frame indicates the part
of the file that is being written(blue) over time followed by the 3
readings(red) of the file. Note that the first and second reads are
slow and the third is fast. I also included an image with the 2 frames
overlayed. What doesn't show very well is that the read from 1M-2M
(issued by the other task/node) is what is blocked while the cache is
draining. It kind of shows in the overlayed imaged. At file position
near zero(vertical axis) there is a longer horizontal red line.
If you like, I could send you my test code. It is not pretty, but only
about 100 lines and, hopefully, easy to follow.
On 4/9/2015 5:18 PM, Drokin, Oleg wrote:
There's no way to dump entire osc cache of a different node with a code like this.
What likely is happening (assuming stripe count of 1), though is:
task 1: request a Write lock from 0 to 1M,
server: sees that there are no other locks on this file, grants a lock wider than
requested, from 0 to EOF (greedy extension, helping to minimize lock traffic in case of
task 1: write a bunch of 1M records to the file, all of them ending up in the cache
task 2: request a Write lock from 1 to 2M,
server: sees the task1 lock in conflict with the newly requested lock, instructs it to
return the lock
node of task 1: sees there are a bunch of pages under the lock requested to be dropped,
drops the pages from cache (writing any dirty ones to server first), drops the lock
-- This is the cache flush you are likely talking about.
server: sees that there's no more conflicts in the file, grants 1M to EOF lock to
task 2 (greedy extension only works forward, not backward)
task 2: writes the 1M it wanted to write.
task 1: request the Write lock at XM to X+1M
task 1: get the XM to EOF lock and proceed to write till the end of file retaining
everything in the cache.
task 1: request a Read lock from 0 to 1M
server: sees there's a conflict at NM and only extends the lock to NM-1 (so grants 0
task 1: proceeds to read in 1M chunks until it reaches NM, repopulating the cache in
task 2: somewhere in the middle of the above, hopefully, requests 1M to 2M Read lock and
get a 1M to Nm-1 Read lock (not invalidating the nonconflicting Read lock from task1)
task 2: read the 1M it wanted to read it and retaining it in the cache
task 1: wanting to read 1M at offset NM, finds a locally cached write lock from NM to EOF
with data cached under it, does not send any requests to server, reads the data from the
cache (or not*)
*) - in theory there would be data, in practice, by default every OSCs in total only
allow to cache up to 50% or system RAM, and then also a per-OSC "budget" (not
sure how much at this point),
if the writes happen to go beyond that limit, LRU logic comes into play and would be
wiping the oldest data from cache as you are reading (or even writing), so there's
to be any cached data at the offsets you want to read.
So second reading (with no writes in between) should preserve caches as long as you have
enough cache lru space. Are you sure any locking happens at this point? You can
enable ldlm tracing (echo +ldlm >/proc/sys/lnet/debug on the client), run the
rereading and then dump the cache with lctl dk >/tmp/log.txt and see what sort of
locking was done
(or if your OSS where writes/reads are destined to - you can do the same on OSS instead
and then you'll get the server's idea of all locking and conflicts).
On Apr 9, 2015, at 4:57 PM, John Bauer wrote:
> Thanks for the description of lock conversion.
> In my effort to get a handle on how the locking works I have written a 2 task MPI
testcase, each task on a separate node, where each task primarily writes to one file.
Then each task primarily reads the same file. Both writing and reading have the
exception of doing the i==1 read/write to the other task's file. There are
MPI_Barrier calls between all the major events, just not between the individual reads and
> #define BSIZE 1048576
> crossWriteBlock = 1 ;
> int fd = (i==crossWriteBlock)?other_fd:my_fd;
> lseek64(fd,BSIZE*i,SEEK_SET) ;
> ssize_t ret = write(fd,buffer,BSIZE);
> Code snippet is the same for reading, just use read instead of write.
> Each file is stripped on one OST, with 1MB stripe factor.
> There are two things that I am not understanding why they are happening.
> 1) The single read from each task of the other task's file causes the OSC on
the other node to dump its entire OSC cache even though that cache never had that 1MB
section of the file in cache.
> 2) After the tasks are done with their reading, each task does the same reading
again. The OSCs' caches are dumped again as the second reading starts. I am thinking
that all the locking would be straightened out for reading after the first reading, but
the second reading has the same behavior as the first reading.
> Any thoughts?
> On 4/7/2015 10:03 PM, Drokin, Oleg wrote:
>> Lock conversion is a change of lock properties.
>> I.e. once the lock is issued, you can no longer change the lock mode (e.g.
READ->WRITE or WRITE->READ) or lock properties (e.g. extent size, or lock bits).
>> On Apr 7, 2015, at 11:00 PM, John Bauer wrote:
>>> The recent HPDD discussion about file read/write operations, started by
>>> , prompted me to go back and look at this issue. In your reply you mentioned
that Lustre does not support lock conversion ( first paragraph below). What do you mean
by lock conversion?
>>> On 4/21/2014 12:59 AM, Dilger, Andreas wrote:
>>>> It seems pretty clear to me. The file is cached under a write lock on the
first client, and when it is read by the second client the write lock needs to be
cancelled by the first client and all of the pages under that lock dropped from cache
before the read lock can be granted to the reading client. Lustre does not support lock
conversion, so the write lock must be cancelled first.
>>>> That said, it surprises me that canceling the write lock should take so
long - a good fraction of the time it took to write in the first place. With a 10GB file,
that is 2.5M 4KB pages dropped in about 4s, or about 600K pages per second. I'm not
sure how this stacks up to other filesystems, but I expect it is slower because of the DLM
>>>> What version of Lustre is this? If it is not the current master version,
is it possible to repeat the test with master to see if it is improved?
>>>> Cheers, Andreas
>>>> On Apr 20, 2014, at 14:01, "John Bauer" <
>>>> I have been trying to understand a behavior I am observing in an IOR
benchmark on Lustre. I have pared it down to a simple example.
>>>> The IOR benchmark is running in MPI mode. There are 2 ranks, each
running on its own node. Each rank does the following:
>>>> Note : Test was run on the "swan" cluster at Cray Inc., using
>>>> write a file. ( 10GB )
>>>> fsync the file
>>>> close the file
>>>> open the file that was written by the other rank.
>>>> read the file that was written by the other rank.
>>>> close the file that was written by the other rank.
>>>> The writing of each file goes as expected.
>>>> The fsync takes very little time ( about .05 seconds).
>>>> The first reads of the file( written by the other rank ) start out very
slowly. While theses first reads are proceeding slowly, the
>>>> kernel's cached memory ( the Cached: line in /proc/meminfo) decreases
from the size of the file just written to nearly zero.
>>>> Once the cached memory has reached nearly zero, the file reading proceeds
>>>> I have attached a jpg of the instrumentation of the processes that
illustrates this behavior.
>>>> My questions are:
>>>> Why does the reading of the file, written by the other rank, wait until
the cached data drains to nearly zero before proceeding normally.
>>>> Shouldn't the fsync ensure that the file's data is written to the
backing storage so this draining of the cached memory should be simply
>>>> releasing pages with no further I/O?
>>>> For this case the "dead" time is only about 4 seconds, but this
"dead" time scales directly with the size of the files.
>>> I/O Doctors, LLC
>>> HPDD-discuss mailing list
> I/O Doctors, LLC