ldiskfs corruption
by Vicker, Darby (JSC-EG311)
Hello,
We have a corrupted OST that doesn’t seem to be in a repairable state. I’m looking for guidance on how to proceed. We are running lustre 2.1.6 on CentOS 6.4. The OST’s are built on a hardware RAID6. When the trouble started (yesterday) we unmounted the OST and did a verify on the RAID6 volume, which came back clean. Then ran an fsck, which found several problems.
# e2fsck -fy /dev/Storage/ost0
e2fsck 1.42.7.wc1 (12-Apr-2013)
hpfs2eg3-OST000b: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Entry '3102500' in /O/0/d4 (19398664) references inode 26072855 in group 203694 where _INODE_UNINIT is set.
Fix? yes
Entry '3102500' in /O/0/d4 (19398664) has an incorrect filetype (was 1, should be 0).
Fix? yes
Entry '3102690' in /O/0/d2 (19398662) references inode 26073045 in group 203695 where _INODE_UNINIT is set.
Fix? yes
Entry '3102690' in /O/0/d2 (19398662) has an incorrect filetype (was 1, should be 0).
Fix? yes
<cut - there were a total of 1024 of these>
Restarting e2fsck from the beginning...
One or more block group descriptor checksums are invalid. Fix? yes
Group descriptor 203688 checksum is 0x3e34, should be 0x6fc3. FIXED.
Group descriptor 203689 checksum is 0x492e, should be 0x18d9. FIXED.
Group descriptor 203690 checksum is 0x8112, should be 0xd0e5. FIXED.
Group descriptor 203691 checksum is 0x9b7c, should be 0xca8b. FIXED.
Group descriptor 203692 checksum is 0xe010, should be 0xb1e7. FIXED.
Group descriptor 203693 checksum is 0xf06f, should be 0xa198. FIXED.
Group descriptor 203694 checksum is 0xc0ee, should be 0x9119. FIXED.
Group descriptor 203695 checksum is 0xac6d, should be 0xfd9a. FIXED.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Entry '3102500' in /O/0/d4 (19398664) has deleted/unused inode 26072855. Clear? yes
Entry '3102690' in /O/0/d2 (19398662) has deleted/unused inode 26073045. Clear? yes
Entry '3102208' in /O/0/d0 (19398660) has deleted/unused inode 26072563. Clear? yes
<cut - there are a total of 1032 of these>
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Inode bitmap differences: -(26072065--26073088)
Fix? yes
hpfs2eg3-OST000b: ***** FILE SYSTEM WAS MODIFIED *****
hpfs2eg3-OST000b: 2246277/28501376 files (4.4% non-contiguous), 1827729689/7296336896 blocks
Remounting the OST doesn’t go well.
Nov 7 07:39:01 hpfs2-eg3-oss11 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts:
Nov 7 07:39:02 hpfs2-eg3-oss11 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts:
Nov 7 07:39:02 hpfs2-eg3-oss11 kernel: Lustre: MGC10.148.0.142@o2ib: Reactivating import
Nov 7 07:39:02 hpfs2-eg3-oss11 kernel: LustreError: 137-5: hpfs2eg3-OST000b: Not available for connect from 10.148.2.75@o2ib (not set up)
Nov 7 07:39:02 hpfs2-eg3-oss11 kernel: Lustre: 11446:0:(ldlm_lib.c:2025:target_recovery_init()) RECOVERY: service hpfs2eg3-OST000b, 281 recoverable clients, last_transno 35381660
Nov 7 07:39:02 hpfs2-eg3-oss11 kernel: LustreError: 137-5: hpfs2eg3-OST000b: Not available for connect from 10.148.1.195@o2ib (not set up)
Nov 7 07:39:02 hpfs2-eg3-oss11 kernel: Lustre: hpfs2eg3-OST000b: Now serving hpfs2eg3-OST000b/ on /dev/mapper/Storage-ost0 with recovery enabled
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: Lustre: hpfs2eg3-OST000b: sending delayed replies to recovered clients
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: Lustre: hpfs2eg3-OST000b: received MDS connection from 10.148.0.142@o2ib
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: LDISKFS-fs error (device dm-2): ldiskfs_lookup: deleted inode referenced: 26072429
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: Aborting journal on device dm-2-8.
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: LDISKFS-fs (dm-2): Remounting filesystem read-only
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: LustreError: 11511:0:(filter.c:1506:filter_fid2dentry()) hpfs2eg3-OST000b: object 3102074:0 lookup error: rc -5
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: LustreError: 11511:0:(filter_lvb.c:105:filter_lvbo_init()) hpfs2eg3-OST000b: bad object 3102074/0: rc -5
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: LustreError: 11511:0:(ldlm_resource.c:1090:ldlm_resource_get()) lvbo_init failed for resource 3102074: rc -5
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: LustreError: 11590:0:(fsfilt-ldiskfs.c:382:fsfilt_ldiskfs_start()) error starting handle for op 8 (71 credits): rc -30
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: LustreError: 11588:0:(llog_cat.c:485:llog_cat_process_thread()) llog_cat_process() failed -30
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: LustreError: 11261:0:(fsfilt-ldiskfs.c:382:fsfilt_ldiskfs_start()) error starting handle for op 8 (106 credits): rc -30
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: LustreError: 11261:0:(filter.c:3588:filter_destroy_precreated()) error destroying precreate objid 6883425: -30
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: LustreError: 11261:0:(filter.c:3588:filter_destroy_precreated()) error destroying precreate objid 6883424: -30
Unmounting the OST and running the fsck again produces exactly the same output as before - a diff of the output shows no differences. Any idea why these problems aren’t being corrected by the fsck? Is there anything else I can try?
Thanks,
Darby