We've been seeing a lot of our compute nodes where something is causing ldlm_poold to start chewing 100% CPU. The nodes are fine for a while, and then at some point something goes wrong, and ldlm_poold goes nuts. Sometimes there are a bunch of ldlm_bl_ processes also consuming large amounts of CPU as well.

Once the node gets into this state, it has to be rebooted to fix it. The node won't even shut down cleanly, as umount of /lustre ends up hanging.

There are no lustre-related messages in syslog while the erroneous behavior is taking place.

The clients and servers have all been running 2.5.3 for the last few months on RedHat 6 (2.6.32-504.3.3.el6.x86_64)

I saw the bug report LU-5415 that looks similar, but that report says that 2.5.3 includes a fix for that particular problem.

 

Has anyone else seen this behavior?

 

Thanks,

Kevin

 

---

Kevin Hildebrand

Division of IT

University of Maryland, College Park