On 09/11/2015 03:41 AM, Martin Hecht wrote:
On 09/11/2015 05:23 AM, Dilger, Andreas wrote:
> On 2015/09/10, 6:54 PM, "Chris Hunter" <chris.hunter(a)yale.edu>
>> We experienced file corruption on several OSTs. We proceeded through
>> recovery using e2fsck & ll_recover_lost_found_obj tools.
>> Following these steps, e2fsck came out clean.
>> The file corruption did not impact the MDT. The files were still
>> referenced by the MDT. Accessing the file on a lustre client (ie. ls -l)
>> would report error "Cannot allocate memory"
>> Following OST recovery steps, we started removing the corrupt files via
>> "unlink" command on lustre client (rm command would not remove file).
>> Now dry-run e2fsck of the OST is reporting errors:
>> "deleted/unused inodes" in Pass 2 (checking directory structure),
>> "Unattached inodes" in Pass 4 (checking reference counts)
>> "free block count wrong" in Pass 5 (checking group summary
>> Is e2fsck errors expected when unlinking files ?
> No, the "unlink" command is just avoiding the -ENOENT error that
> by calling "stat()" on the file before trying to unlink it. This
> shouldn't cause any errors on the OSTs, unless there is ongoing corruption
> from the back-end storage.
Chris, with "live filesystem" you mean that you ran a readonly e2fsck on
a lustre file system while it was mounted and clients working on the
file system? Then, it is expected that e2fsck reports some error,
because the file system contents changes while the e2fsck is running and
the in-memory directory structure does not fit to the on-disk data
anymore. However, as Andreas points out, it might as well be a sign of
ongoing corruption on the storage, but only an offline e2fsck (i.e.
while the OST is unmounted, and the journal is played back) can clarify
Hi Martin, good point. The filesystem is active (3 clients) so e2fsck
errors could be due to uncommitted journal transactions.
It would be nice to rule out underlying hardware issues before we do a