We've been seeing a lot of our compute nodes where something is causing ldlm_poold to
start chewing 100% CPU. The nodes are fine for a while, and then at some point something
goes wrong, and ldlm_poold goes nuts. Sometimes there are a bunch of ldlm_bl_ processes
also consuming large amounts of CPU as well.
Once the node gets into this state, it has to be rebooted to fix it. The node won't
even shut down cleanly, as umount of /lustre ends up hanging.
There are no lustre-related messages in syslog while the erroneous behavior is taking
The clients and servers have all been running 2.5.3 for the last few months on RedHat 6
I saw the bug report LU-5415 that looks similar, but that report says that 2.5.3 includes
a fix for that particular problem.
Has anyone else seen this behavior?
Division of IT
University of Maryland, College Park