You don't really explain what you are using the "lfs find" data for, so it is hard to help you optimize your usage. It is possible, for example, to specify multiple OSTs at once for "lfs find" (e.g. if emptying 4 OSTs at once), but that may not be what you want to do.
As for the MDS memory problem, that is caused by huge inode/DLM lock caches on the clients, and was fixed at one point. Don't know the bug number off hand, but you could find it in Jira. As a workaround you can also periodically flush the lock caches on the clients via:
lctl set_param ldlm.namespaces.*mdc*.lru_size=clear
You could avoid all of the repeated scanning by using Robin Hood to index the filesystem once, and then do queries against the RBH database, and use the Lustre ChangeLog to keep RBH updated without the need to re-scan the whole filesystem.
Lustre Principal Architect
Intel High Performance Data Division
On 2016/06/06, 08:58, " Kumar, Amit" <email@example.com> wrote:
I believe there is no answer to optimized “lfs find” but still trying to see if I can learn more if there is.
Q1) I have been trying to scan 39 OST’s using lfs find and this is taking forever? Are there any tips or tricks to speed this up. Scans are taking anywhere between 15-24 hours per OST’s to finish if all goes well without interruption. I am parallelizing my scan’s from multiple clients to speed this up but don’t know of any alternate ways.
Q2) On the other hand when I start “lfs find” on each of the 39 OST’s I have doomed my MDS server with kernel panic due to out of memory issue. Any tips on how can I minimize this load and avoid MDS from running out of memory?
Q3) Situation: When a client dies for any reason or if the “lfs find” command that it is running times out, with Input/Out error “or” transport shutdown(Never saw this until I started running multiple lfs find and simultaneously running lfs_migrate for the files that were identified to be moved out of OST’s)
Observation: I believe the MDS continues to run and serve the scan request for “lfs find on a thread until it fails or notices that it has evicted the client, hence taking up the resources on MDS. I don’t know if this makes sense but I am guessing this is causing my MDS to be loaded with RPC requests and causing further slowdown. Any tunable options here?
I wish there was some kind of indexing we could do to avoid deep scans.