Hi Darby,
This looks similar LU-3542, I think it would be useful to get an image
of the filesystem for analysis as we only have one image and it's not as
damaged as yours. Can you do:
e2image -Q /dev/OSTnnnn OSTnnnn.qcow
bzip2 -9 OSTnnnn.qcow
After you take the image, you can clean up the file system by creating a
debugfs script to unlink the files manually, it should look something like:
unlink /O/0/d4/3102500
unlink /O/0/d2/3102690
...(1032 entries)
You can parse it out from the 'has deleted/unused inode' messages in
e2fsck. Then it's a matter of doing debugfs -w -f <script> <dev>.
HTH,
Kit
---
Kit Westneat
L3 Lustre Support, DDN
703-659-3869
On 11/07/2013 09:51 AM, Vicker, Darby (JSC-EG311) wrote:
Hello,
We have a corrupted OST that doesn’t seem to be in a repairable state. I’m looking for
guidance on how to proceed. We are running lustre 2.1.6 on CentOS 6.4. The OST’s are
built on a hardware RAID6. When the trouble started (yesterday) we unmounted the OST and
did a verify on the RAID6 volume, which came back clean. Then ran an fsck, which found
several problems.
# e2fsck -fy /dev/Storage/ost0
e2fsck 1.42.7.wc1 (12-Apr-2013)
hpfs2eg3-OST000b: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Entry '3102500' in /O/0/d4 (19398664) references inode 26072855 in group 203694
where _INODE_UNINIT is set.
Fix? yes
Entry '3102500' in /O/0/d4 (19398664) has an incorrect filetype (was 1, should be
0).
Fix? yes
Entry '3102690' in /O/0/d2 (19398662) references inode 26073045 in group 203695
where _INODE_UNINIT is set.
Fix? yes
Entry '3102690' in /O/0/d2 (19398662) has an incorrect filetype (was 1, should be
0).
Fix? yes
<cut - there were a total of 1024 of these>
Restarting e2fsck from the beginning...
One or more block group descriptor checksums are invalid. Fix? yes
Group descriptor 203688 checksum is 0x3e34, should be 0x6fc3. FIXED.
Group descriptor 203689 checksum is 0x492e, should be 0x18d9. FIXED.
Group descriptor 203690 checksum is 0x8112, should be 0xd0e5. FIXED.
Group descriptor 203691 checksum is 0x9b7c, should be 0xca8b. FIXED.
Group descriptor 203692 checksum is 0xe010, should be 0xb1e7. FIXED.
Group descriptor 203693 checksum is 0xf06f, should be 0xa198. FIXED.
Group descriptor 203694 checksum is 0xc0ee, should be 0x9119. FIXED.
Group descriptor 203695 checksum is 0xac6d, should be 0xfd9a. FIXED.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Entry '3102500' in /O/0/d4 (19398664) has deleted/unused inode 26072855. Clear?
yes
Entry '3102690' in /O/0/d2 (19398662) has deleted/unused inode 26073045. Clear?
yes
Entry '3102208' in /O/0/d0 (19398660) has deleted/unused inode 26072563. Clear?
yes
<cut - there are a total of 1032 of these>
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Inode bitmap differences: -(26072065--26073088)
Fix? yes
hpfs2eg3-OST000b: ***** FILE SYSTEM WAS MODIFIED *****
hpfs2eg3-OST000b: 2246277/28501376 files (4.4% non-contiguous), 1827729689/7296336896
blocks
Remounting the OST doesn’t go well.
Nov 7 07:39:01 hpfs2-eg3-oss11 kernel: LDISKFS-fs (dm-2): mounted filesystem with
ordered data mode. Opts:
Nov 7 07:39:02 hpfs2-eg3-oss11 kernel: LDISKFS-fs (dm-2): mounted filesystem with
ordered data mode. Opts:
Nov 7 07:39:02 hpfs2-eg3-oss11 kernel: Lustre: MGC10.148.0.142@o2ib: Reactivating
import
Nov 7 07:39:02 hpfs2-eg3-oss11 kernel: LustreError: 137-5: hpfs2eg3-OST000b: Not
available for connect from 10.148.2.75@o2ib (not set up)
Nov 7 07:39:02 hpfs2-eg3-oss11 kernel: Lustre:
11446:0:(ldlm_lib.c:2025:target_recovery_init()) RECOVERY: service hpfs2eg3-OST000b, 281
recoverable clients, last_transno 35381660
Nov 7 07:39:02 hpfs2-eg3-oss11 kernel: LustreError: 137-5: hpfs2eg3-OST000b: Not
available for connect from 10.148.1.195@o2ib (not set up)
Nov 7 07:39:02 hpfs2-eg3-oss11 kernel: Lustre: hpfs2eg3-OST000b: Now serving
hpfs2eg3-OST000b/ on /dev/mapper/Storage-ost0 with recovery enabled
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: Lustre: hpfs2eg3-OST000b: sending delayed replies
to recovered clients
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: Lustre: hpfs2eg3-OST000b: received MDS connection
from 10.148.0.142@o2ib
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: LDISKFS-fs error (device dm-2): ldiskfs_lookup:
deleted inode referenced: 26072429
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: Aborting journal on device dm-2-8.
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: LDISKFS-fs (dm-2): Remounting filesystem
read-only
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: LustreError:
11511:0:(filter.c:1506:filter_fid2dentry()) hpfs2eg3-OST000b: object 3102074:0 lookup
error: rc -5
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: LustreError:
11511:0:(filter_lvb.c:105:filter_lvbo_init()) hpfs2eg3-OST000b: bad object 3102074/0: rc
-5
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: LustreError:
11511:0:(ldlm_resource.c:1090:ldlm_resource_get()) lvbo_init failed for resource 3102074:
rc -5
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: LustreError:
11590:0:(fsfilt-ldiskfs.c:382:fsfilt_ldiskfs_start()) error starting handle for op 8 (71
credits): rc -30
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: LustreError:
11588:0:(llog_cat.c:485:llog_cat_process_thread()) llog_cat_process() failed -30
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: LustreError:
11261:0:(fsfilt-ldiskfs.c:382:fsfilt_ldiskfs_start()) error starting handle for op 8 (106
credits): rc -30
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: LustreError:
11261:0:(filter.c:3588:filter_destroy_precreated()) error destroying precreate objid
6883425: -30
Nov 7 07:39:45 hpfs2-eg3-oss11 kernel: LustreError:
11261:0:(filter.c:3588:filter_destroy_precreated()) error destroying precreate objid
6883424: -30
Unmounting the OST and running the fsck again produces exactly the same output as before
- a diff of the output shows no differences. Any idea why these problems aren’t being
corrected by the fsck? Is there anything else I can try?
Thanks,
Darby
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss(a)lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss