Running "rm -rf" causes high MDS memory usage
by Mohr Jr, Richard Frank (Rick Mohr)
I have run into a situation where a user's "rm -rf" process seems to cause very high buffer usage on our mds server. I verified that this process was the cause of the issue by sending the STOP signal to the "rm" command and noticing that the growth of the buffers on the mds server slowed to a crawl. If I then sent the CONT signal, the buffer size would start growing again. I dropped all the caches on the mds server and timed the growth. It increased about 17 GB in 15 mins. I have been dropping the caches periodically in an effort to contain the growth while I investigate the problem. Unfortunately, after I drop the caches, the new low point is always a little higher than it was before, which means there will come a point where dropping the caches will no longer be effective.
With that in mind, I have had to stop the user's "rm" process to contain the damage. Since that is only a temporary band-aid, I am trying to get a handle on what might be the underlying problem. I searched the lustre bugs, and the closest thing I could find was LU-4740 (and maybe LU-4906?), but it's not clear if those are the cause of my problem. The odd thing is that a recursive rm is not an uncommon command to run, and I have not noticed this behavior before.
The server is running:
- CentOS 6.5
- kernel 2.6.32-358.23.2
- lustre 2.4.3
The client is running:
- CentOS 6.2
- kernel 2.6.32-358.23.2
- lustre 1.8.9-wc1
Has anyone else seen a similar issue?
--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu