Interestingly, I just happened to come across an old bug report LU-416 which mentions that flushing the page cache would alleviate the problems. I just tried that on some of my affected machines, and it does indeed seem to return ldlm_poold back to normal operation.
We've been seeing a lot of our compute nodes where something is causing ldlm_poold to start chewing 100% CPU. The nodes are fine for a while, and then at some point something goes wrong, and ldlm_poold goes nuts. Sometimes there are a bunch of ldlm_bl_ processes also consuming large amounts of CPU as well.
Once the node gets into this state, it has to be rebooted to fix it. The node won't even shut down cleanly, as umount of /lustre ends up hanging.
There are no lustre-related messages in syslog while the erroneous behavior is taking place.
The clients and servers have all been running 2.5.3 for the last few months on RedHat 6 (2.6.32-504.3.3.el6.x86_64)
I saw the bug report LU-5415 that looks similar, but that report says that 2.5.3 includes a fix for that particular problem.
Has anyone else seen this behavior?
Division of IT
University of Maryland, College Park