We have seen mds (admittedly with smallish memory) OOM'ing while testing 2.5.3 whereas there was no problem with 2.5.0. It turns out the problem is that, even though we have lru_size=800 everywhere, the client LDLM lru's are growing huge so that the MDS unreclaimable ldlm slabs fill memory.

It looks like the root cause is the change to ldlm_cancel_aged_policy()  in commit 0a6c6fcd46 on the 2.5 branch (LU-4786 osc: to not pick busy pages for ELC) - it has changed the lru_sze != 0 behaviour.  Prior to that, the non-lru_resize behaviour (at least through the early_lock_cancel path which is what we see being hit) was effectively

       cancel lock if  (too many in lru cache || lock unused too long)

In 2.5.3, it's

       cancel lock if  (too many in lru cache && lock unused too long) 

Disabling early_lock_cancel doesn't seem to help.

It might be arguable which of the two behaviours is correct but the lru_size doco suggests the former - the latter makes lru_size != 0 ineffective in practice. It also looks like the change was not actually necessary for LU-4300?

Cheers
David