Dear Lustre,

 

Lustre v2.4.3

 

This past weekend we filled up our change logs as reported by: “changelog failed: rc=-28”

 

Looking at the changelog_users showed the following:

# cat /proc/fs/lustre/mdd/scratch-MDT0000/changelog_users

current index: 8822156503

ID    index

cl1   4676806604

 

It seems like we had about 4billion entries. We used this mainly for Robinhood and LMT.

 

I have some observations and questions as follows:

 

(a)    I could not run the deregister command on both MDS and Robinhood server. Here is what happens on Robinhood server, even though file system is mounted.

# lctl --device scratch-MDT0000 changelog_deregister cl1

No device found for name scratch-MDT0000: Invalid argument

(b)    Although I was able to run “lfs changelog_clear scratch-MDT0000 cl1 0” on Robinhood server.

 

Q1) Not sure why I was not able to run the “lctl --device scratch-MDT0000 changelog_deregister cl1” it complains with errors on MDS: “error: changelog_deregister: No such file or directory”
 
Q2)*** When we hit this issue, files that were in transaction/or being edited shows up as empty after I cleared changelog? Is this an expected behavior? Or something else is going on here?  
 
Q3)*** It seems like every file in edit mode during this issue is showing up empty and complains about “Bad address” if we try to save it or remove it. I understood changelog is mainly for stats and profiling, data should not be affected if we clear changelogs, does it?. 
 
Q4) We are setting up our new Lustre-ZFS solution in the next couple of weeks along with Robinhood and LMT, assuming we may run into this issue in the future, what would you recommend as far as setting up changelog? Is there a system variable that can be used to increase or decrease the amount of changelogs that could be stored?
 
Any help/insight here is greatly appreciated. 

Thank you,
Amit