Lester, the Lustre lister
by David Dillow
I'd like to announce the availability of Lester, the Lustre lister. It
is available from github at https://github.com/ORNL-TechInt/lester
Lester is an extension of (n)e2scan for generating lists of files (and
potentially their attributes) from a ext2/ext3/ext4/ldiskfs filesystem.
We primarily use it for generating a purge candidate list, but it is
also useful for generating a list of files affected by an OST outage or
providing a name for an inode.
For example, to list files that have not been accessed in two weeks and
put the output in ne2scan format in $OUTFILE:
touch -d 'now - 2 weeks' /tmp/flag
lester -A fslist -a before=/tmp/flag -o $OUTFILE $BLOCKDEV
To do the same thing, but generate a full listing of the filesystem in
parallel:
touch -d 'now - 2 weeks' /tmp/flag
lester -A fslist -a before=/tmp/flag -a genhit=$UNACCESSED_LIST
\
-o $FULL_LIST $BLOCKDEV
To name inodes to stdout (when not using Lustre 2.4's LINKEA):
lester -A namei -a $INODE1 -a $INODE2 ... $BLOCKDEV
To get a list of files with objects on OSTs 999 and 1000:
lester -A lsost -a 999 -a 1000 -o $OUTFILE $BLOCKDEV
To get a list of options and actions, use 'lester -h'; to get a list of
options for a given action, use 'lester -A $ACTION -a help'.
Lester uses its own AIO-based IO engine by default, which is usually
much faster than the default Unix engine for large filesystems on
high-performance devices. The number of requests in flight, request
size, cache size, and read-ahead settings for various phases of the scan
are all configurable. I recommend experimenting with the settings to
find a balance between speed and resource usage for your situation.
More information about the gains we've seen in testing prototype version
of Lester are in my LUG 2011 presentation,
http://www.opensfs.org/wp-content/uploads/2012/12/500-530_David_Dillow_LU...
--
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
8 years, 4 months
lfs_migrate on an OST results in zero files inspite of being 60% used
by Kumar, Amit
Dear All,
I have yet another interesting thing happening and not able to understand.
One of the OST's that I want to migrate the data off of is 58% full and when I run the migrate script on the entire lustre file system, it only found few files migrated only couple of Gigs and now it does not find any files to migrate. Any idea how to interpret this? In the past when I was migrating an OST data it took a long long time because there literally millions of file to go through. Same is the situation here but every run of migration script just finished without moving anything anymore.
Any insight into this is greatly appreciated.
Best Regards,
Amit H. Kumar
8 years, 5 months
How to locate the OST object through the MDT object for Lustre 2.5.0?
by Frank Yang
Hi all,
It seems I have little chance to successfully mount the lustre filesystem
again. As a result, I'm trying to extract the data back by inspecting the
MDT and OST objects on ldiskfs. However, I can only find an example of
"Identifying
To Which Lustre File an OST Object
Belongs<http://wiki.lustre.org/manual/LustreManual20_HTML/LustreOperations.html#5...>"
(
http://wiki.lustre.org/manual/LustreManual20_HTML/LustreOperations.html#5...).
That's exactly the opposite (OST->MDT) of what I want (MDT->OST). But, if
it could work, I could still make a big lookup table to get the data back
SLOWLY. However, it seems this example is for old Lustre.
Does anybody have some reference about this? I only use the default Lustre
2.5.0 configuration. It seems I need to check the oi.16 files to get the
correct mapping between MDT object and OST object.
I found a document
http://users.nccs.gov/~fwang2/papers/lustre_report.pdfdescribing the
internals of Lustre. However, I'm not sure if it's
up-to-date enough and actually I can hardly find enough time to comprehend
it. Below is a file map between OST and MDT objects that I'm sure of. If
somebody can help, it may be used as an example. Thanks a lot.
######
###### .zshrc
######
[root@old_mds ~]# debugfs -c -R "stat /ROOT/space/users2/fsyang/.zshrc"
/dev/mapper/VolGroup00-LogVol03
debugfs 1.42.7.wc2
(07-Nov-2013)
/dev/mapper/VolGroup00-LogVol03: catastrophic mode - not reading inode or
group bitmaps
Inode: 19988768 Type: regular Mode: 0644 Flags: 0x0
Generation: 4060094755 Version: 0x00000003:01d726a2
User: 646 Group: 100 Size: 0
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 0
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x52b4cef8:00000000 -- Sat Dec 21 07:12:56 2013
atime: 0x52b4cef8:00000000 -- Sat Dec 21 07:12:56 2013
mtime: 0x52677014:00000000 -- Wed Oct 23 14:43:32 2013
crtime: 0x52b4cef8:326ee6e8 -- Sat Dec 21 07:12:56 2013
Size of extra inode fields: 28
Extended attributes stored in inode body:
lma = "00 00 00 00 00 00 00 00 26 04 00 00 02 00 00 00 3f 24 01 00 00 00
00 00
" (24)
lma: fid=[0x200000426:0x1243f:0x0] compat=0 incompat=0
lov = "d0 0b d1 0b 01 00 00 00 3f 24 01 00 00 00 00 00 26 04 00 00 02 00
00 00
00 00 10 00 01 00 00 00 ef ad 41 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 0
0 00 00 00 00 00 " (56)
link = "df f1 ea 11 01 00 00 00 30 00 00 00 00 00 00 00 00 00 00 00 00 00
00 0
0 00 18 00 00 00 02 00 00 04 26 00 00 ef 80 00 00 00 00 2e 7a 73 68 72 63 "
(48)
BLOCKS:
[root@myoss]# debugfs -c -R "stat /O/0/d15/4304367" /dev/sda5
debugfs 1.42.7.wc2 (07-Nov-2013)
/dev/sda5: catastrophic mode - not reading inode or group bitmaps
Inode: 5353030 Type: regular Mode: 0666 Flags: 0x80000
Generation: 3492092415 Version: 0x00000003:00c4f216
User: 646 Group: 100 Size: 658
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 8
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x52b4cef8:00000000 -- Sat Dec 21 07:12:56 2013
atime: 0x52b4cef8:00000000 -- Sat Dec 21 07:12:56 2013
mtime: 0x52677014:00000000 -- Wed Oct 23 14:43:32 2013
crtime: 0x52b4cea7:6b754c20 -- Sat Dec 21 07:11:35 2013
Size of extra inode fields: 28
Extended attributes stored in inode body:
lma = "08 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 ef ad 41 00 00 00
00 00
" (24)
lma: fid=[0x100000000:0x41adef:0x0] compat=8 incompat=0
fid = "26 04 00 00 02 00 00 00 3f 24 01 00 00 00 00 00 " (16)
fid: parent=[0x200000426:0x1243f:0x0] stripe=0
EXTENTS:
(0):685147348
[root@old_mds MDT]# ls -l
total 485260
-rw-r--r-- 1 root root 32 Dec 20 10:59 CATALOGS
-rw-r--r-- 1 root root 0 Dec 20 10:47 changelog_catalog
-rw-r--r-- 1 root root 8256 Dec 20 10:47 changelog_users
drwxr-xr-x 2 root root 4096 Dec 20 10:47 CONFIGS
-rw-rw-rw- 1 root root 8192 Dec 20 10:47 fld
-rw-r--r-- 1 root root 0 Dec 20 10:47 hsm_actions
-rw-r--r-- 1 root root 8960 Dec 20 10:47 last_rcvd
-rw-r--r-- 1 root root 64 Dec 20 10:47 lfsck_bookmark
-rw-r--r-- 1 root root 8192 Dec 20 10:47 lfsck_namespace
drwx------ 2 root root 16384 Dec 20 10:47 lost+found
-rw-r--r-- 1 root root 8 Dec 20 10:59 lov_objid
-rw-r--r-- 1 root root 8 Dec 20 10:59 lov_objseq
drwxr-xr-x 2 root root 4096 Dec 20 10:47 NIDTBL_VERSIONS
drwxr-xr-x 5 root root 4096 Dec 20 10:47 O
-rw-r--r-- 1 root root 12759040 Dec 20 10:47 oi.16.0
-rw-r--r-- 1 root root 17424384 Dec 20 10:47 oi.16.1
-rw-r--r-- 1 root root 12759040 Dec 20 10:47 oi.16.10
-rw-r--r-- 1 root root 12759040 Dec 20 10:47 oi.16.11
-rw-r--r-- 1 root root 12759040 Dec 20 10:47 oi.16.12
-rw-r--r-- 1 root root 7991296 Dec 20 10:47 oi.16.13
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.14
-rw-r--r-- 1 root root 6541312 Dec 20 10:47 oi.16.15
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.16
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.17
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.18
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.19
-rw-r--r-- 1 root root 12759040 Dec 20 10:47 oi.16.2
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.20
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.21
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.22
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.23
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.24
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.25
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.26
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.27
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.28
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.29
-rw-r--r-- 1 root root 12759040 Dec 20 10:47 oi.16.3
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.30
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.31
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.32
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.33
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.34
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.35
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.36
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.37
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.38
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.39
-rw-r--r-- 1 root root 12759040 Dec 20 10:47 oi.16.4
-rw-r--r-- 1 root root 7053312 Dec 20 10:47 oi.16.40
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.41
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.42
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.43
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.44
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.45
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.46
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.47
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.48
-rw-r--r-- 1 root root 6545408 Dec 20 10:47 oi.16.49
-rw-r--r-- 1 root root 12759040 Dec 20 10:47 oi.16.5
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.50
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.51
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.52
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.53
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.54
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.55
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.56
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.57
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.58
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.59
-rw-r--r-- 1 root root 12759040 Dec 20 10:47 oi.16.6
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.60
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.61
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.62
-rw-r--r-- 1 root root 6381568 Dec 20 10:47 oi.16.63
-rw-r--r-- 1 root root 12759040 Dec 20 10:47 oi.16.7
-rw-r--r-- 1 root root 12759040 Dec 20 10:47 oi.16.8
-rw-r--r-- 1 root root 10059776 Dec 20 10:47 oi.16.9
-rw-r--r-- 1 root root 400 Dec 20 10:47 OI_scrub
drwxr-xr-x 2 root root 4096 Dec 20 10:47 PENDING
drwxr-xr-x 4 root root 4096 Dec 20 10:47 quota_master
drwxr-xr-x 2 root root 4096 Dec 20 10:47 quota_slave
drwxr-xr-x 2 root root 4096 Dec 20 10:47 REMOTE_PARENT_DIR
drwxr-xr-x 5 root root 4096 Dec 21 19:53 ROOT
-rw-rw-rw- 1 root root 24 Dec 20 10:47 seq_ctl
-rw-rw-rw- 1 root root 24 Dec 20 10:47 seq_srv
Regards,
Frank
8 years, 5 months
Retry for Lustre Server Move
by Frank Yang
Hi,
Sorry to write almost the same mail again. Since the original email with
tile "Help for Lustre Server Move" contains lengthy reports. I think
perhaps it's better to restart a new one.
I have restored Lustre from 2.4.2 back to 2.5.0. The error is basically the
same as the first mail I sent. However, I rearrange the commands and
outputs in order so that if somebody can help, he can have a better
understanding....
The case is basically the data move of the old MDS to another new MDS.
*******************************************************************
Some basic info again
*** Client (cola1)
eth0: 10.242.116.6
eth1: 192.168.1.6
modprobe.conf: options lnet networks=tcp0(eth1),tcp1(eth0)
*** Old MDS (old_mds)
eth0: 10.242.116.7
eth1: 192.168.1.7
modprobe.conf: options lnet networks=tcp0(eth1),tcp1(eth0)
MGS/MDS mount point: /MDT
device: /dev/mapper/VolGroup00-LogVol03
*** New MDS (new_mds)
eth0: 10.242.116.32
eth1: 192.168.1.32
modprobe.conf: options lnet networks=tcp0(eth1),tcp1(eth0)
MGS/MDS mount point: /MDT
device: /dev/sda6
*** OSS (myoss)
eth0: 192.168.1.34
eth1: Disabled
modprobe.conf: options lnet ip2nets="tcp0 192.168.1.*"
OST mount point: /OST
device: /dev/sda5
*******************************************************************
[root@new_mds]# tunefs.lustre --erase-params /dev/sda6
checking for existing Lustre data:
found
Reading
CONFIGS/mountdata
Read previous values:
Target: lustre-MDT0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x5
(MDT MGS )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:
Permanent disk data:
Target: lustre-MDT0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x45
(MDT MGS update )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:
Writing CONFIGS/mountdata
Dec 30 09:38:23 new_mds kernel: LDISKFS-fs (sda6): mounted filesystem with
ordered data mode. quota=on. Opts:
[root@new_mds home]# tunefs.lustre --writeconf --mgs --mdt
/dev/sda6
checking for existing Lustre data:
found
Reading
CONFIGS/mountdata
Read previous values:
Target: lustre-MDT0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x45
(MDT MGS update )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:
Permanent disk data:
Target: lustre=MDT0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x145
(MDT MGS update writeconf )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:
Writing CONFIGS/mountdata
[root@new_mds home]# mount -t lustre /dev/sda6 /MDT
Dec 30 09:40:10 new_mds kernel: LDISKFS-fs (sda6): mounted filesystem with
ordered data mode. quota=on. Opts:
Dec 30 09:40:35 new_mds kernel: LDISKFS-fs (sda6): mounted filesystem with
ordered data mode. quota=on. Opts:
Dec 30 09:40:35 new_mds kernel: LNet: HW CPU cores: 8, npartitions: 2
Dec 30 09:40:35 new_mds modprobe: FATAL: Error inserting crc32c_intel
(/lib/modules/2.6.32-358.18.1.el6_lustre.x86_64/kernel/arch/x86/crypto/crc32c-intel.ko):
No such device
Dec 30 09:40:35 new_mds kernel: alg: No test for crc32 (crc32-table)
Dec 30 09:40:35 new_mds kernel: alg: No test for adler32 (adler32-zlib)
Dec 30 09:40:39 new_mds kernel: padlock: VIA PadLock Hash Engine not
detected.
Dec 30 09:40:39 new_mds modprobe: FATAL: Error inserting padlock_sha
(/lib/modules/2.6.32-358.18.1.el6_lustre.x86_64/kernel/drivers/crypto/padlock-sha.ko):
No such device
Dec 30 09:40:43 new_mds kernel: Lustre: Lustre: Build Version:
2.5.0-RC1--PRISTINE-2.6.32-358.18.1.el6_lustre.x86_64
Dec 30 09:40:43 new_mds kernel: LNet: Added LNI 192.168.1.32@tcp[8/256/0/180]
Dec 30 09:40:43 new_mds kernel: LNet: Added LNI 10.242.116.32@tcp1[8/256/0/180]
Dec 30 09:40:43 new_mds kernel: LNet: Accept secure, port 988
Dec 30 09:40:44 new_mds kernel: LDISKFS-fs (sda6): mounted filesystem with
ordered data mode. quota=on. Opts:
Dec 30 09:40:44 new_mds kernel: Lustre: MGS: Logs for fs lustre were
removed by user request. All servers must be restarted in order to
regenerate the logs.
Dec 30 09:40:45 new_mds kernel: Lustre: lustre-MDT0000: used disk, loading
Dec 30 09:40:45 new_mds kernel: LustreError:
3461:0:(osd_io.c:950:osd_ldiskfs_read()) lustre=MDT0000: can't read
128@8192on ino 33: rc = 0
Dec 30 09:40:45 new_mds kernel: LustreError:
3461:0:(mdt_recovery.c:112:mdt_clients_data_init()) error reading MDS
last_rcvd idx 0, off 8192: rc -14
Dec 30 09:40:45 new_mds kernel: LustreError: 11-0:
lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect
failed with -11.
Dec 30 09:40:45 new_mds kernel: Lustre: lustre-MDD0000: changelog on
[root@myoss ~]# tunefs.lustre --erase-params /dev/sda5
checking for existing Lustre data: found
Reading CONFIGS/mountdata
Read previous values:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x1002
(OST no_primnode )
Persistent mount opts: errors=remount-ro
Parameters: mgsnode=192.168.1.32@tcp
Permanent disk data:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x1042
(OST update no_primnode )
Persistent mount opts: errors=remount-ro
Parameters:
Writing CONFIGS/mountdata
Dec 30 09:42:08 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
ordered data mode. quota=on. Opts:
[root@myoss ~]# tunefs.lustre --writeconf --mgsnode=192.168.1.32@tcp --ost
/dev/sda5
checking for existing Lustre data:
found
Reading
CONFIGS/mountdata
Read previous values:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x1042
(OST update no_primnode )
Persistent mount opts: errors=remount-ro
Parameters:
Permanent disk data:
Target: lustre=OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x1142
(OST update writeconf no_primnode )
Dec 30 09:42:51 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
ordered data mode. quota=on. Opts:
[root@myoss ~]# tunefs.lustre --writeconf /dev/sda5checking for existing
Lustre data: found
Reading CONFIGS/mountdata
Read previous values:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x1142
(OST update writeconf no_primnode )
Persistent mount opts: errors=remount-ro
Parameters: mgsnode=192.168.1.32@tcp
Permanent disk data:
Target: lustre=OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x1142
(OST update writeconf no_primnode )
Persistent mount opts: errors=remount-ro
Parameters: mgsnode=192.168.1.32@tcp
Writing CONFIGS/mountdata
Dec 30 09:44:14 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
ordered data mode. quota=on. Opts:
[root@myoss ~]# mount -t lustre /dev/sda5 /OST
Dec 30 09:44:55 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
ordered data mode. quota=on. Opts:
Dec 30 09:44:55 myoss kernel: LNet: HW CPU cores: 12, npartitions: 4
Dec 30 09:44:55 myoss kernel: alg: No test for crc32 (crc32-table)
Dec 30 09:44:55 myoss kernel: alg: No test for adler32 (adler32-zlib)
Dec 30 09:44:55 myoss kernel: alg: No test for crc32 (crc32-pclmul)
Dec 30 09:44:59 myoss kernel: padlock: VIA PadLock Hash Engine not detected.
Dec 30 09:44:59 myoss modprobe: FATAL: Error inserting padlock_sha
(/lib/modules/2.6.32-358.18.1.el6_lustre.x86_64/kernel/drivers/crypto/padlock-sha.ko):
No such device
Dec 30 09:45:03 myoss kernel: Lustre: Lustre: Build Version:
2.5.0-RC1--PRISTINE-2.6.32-358.18.1.el6_lustre.x86_64
Dec 30 09:45:03 myoss kernel: LNet: Added LNI 192.168.1.34@tcp [8/256/0/180]
Dec 30 09:45:03 myoss kernel: LNet: Accept secure, port 988
Dec 30 09:45:04 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
ordered data mode. quota=on. Opts:
Dec 30 09:45:04 myoss kernel: LustreError: 13a-8: Failed to get MGS log
params and no local copy.
Dec 30 09:45:04 myoss kernel: LustreError:
3457:0:(fld_handler.c:150:fld_server_lookup()) srv-lustre-OST0000: lookup
0x860002, but not connects to MDT0yet: rc = -5.
Dec 30 09:45:04 myoss kernel: LustreError:
3457:0:(osd_handler.c:2134:osd_fld_lookup()) lustre-OST0000-osd: cannot
find FLD range for 0x860002: rc = -5
Dec 30 09:45:04 myoss kernel: LustreError:
3457:0:(osd_handler.c:3364:osd_mdt_seq_exists()) lustre-OST0000-osd: Can
not lookup fld for 0x860002
Dec 30 09:45:05 myoss kernel: LustreError: 13a-8: Failed to get MGS log
params and no local copy.
Dec 30 09:46:24 new_mds kernel: Lustre: MGS: Regenerating lustre-OST0000
log by user request.
Dec 30 09:46:34 new_mds kernel: Lustre:
3432:0:(mgc_request.c:1645:mgc_process_recover_log()) Process recover log
lustre-mdtir error -22
Dec 30 09:46:34 new_mds kernel: LustreError:
3505:0:(ldlm_lib.c:429:client_obd_setup()) can't add initial connection
Dec 30 09:46:34 new_mds kernel: LustreError:
3505:0:(osp_dev.c:684:osp_init0()) lustre-OST0000-osc-MDT0000: can't setup
obd: -2
Dec 30 09:46:34 new_mds kernel: LustreError:
3505:0:(obd_config.c:572:class_setup()) setup lustre-OST0000-osc-MDT0000
failed (-2)
Dec 30 09:46:34 new_mds kernel: LustreError:
3505:0:(obd_config.c:1591:class_config_llog_handler()) MGC192.168.1.32@tcp:
cfg command failed: rc = -2
Dec 30 09:46:34 new_mds kernel: Lustre: cmd=cf003
0:lustre-OST0000-osc-MDT0000 1:lustre-OST0000_UUID 2:0@<0:0>
[root@cola1 ~]# mount -t lustre 192.168.1.32@tcp:/lustre /lustre
mount.lustre: mount 192.168.1.32@tcp:/lustre at /lustre failed: No such
file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
Dec 30 09:48:52 cola1 kernel: LNet: HW CPU cores: 8, npartitions: 2
Dec 30 09:48:52 cola1 modprobe: FATAL: Error inserting crc32c_intel
(/lib/modules/2.6.32-358.18.1.el6_lustre.x86_64/kernel/arch/x86/crypto/crc32c-intel.ko):
No such device
Dec 30 09:48:52 cola1 kernel: alg: No test for crc32 (crc32-table)
Dec 30 09:48:52 cola1 kernel: alg: No test for adler32 (adler32-zlib)
Dec 30 09:48:56 cola1 kernel: padlock: VIA PadLock Hash Engine not detected.
Dec 30 09:48:56 cola1 modprobe: FATAL: Error inserting padlock_sha
(/lib/modules/2.6.32-358.18.1.el6_lustre.x86_64/kernel/drivers/crypto/padlock-sha.ko):
No such device
Dec 30 09:49:00 cola1 kernel: Lustre: Lustre: Build Version:
2.5.0-RC1--PRISTINE-2.6.32-358.18.1.el6_lustre.x86_64
Dec 30 09:49:00 cola1 kernel: LNet: Added LNI 192.168.1.7@tcp [8/256/0/180]
Dec 30 09:49:00 cola1 kernel: LNet: Added LNI 10.242.116.7@tcp1[8/256/0/180]
Dec 30 09:49:00 cola1 kernel: LNet: Accept secure, port 988
Dec 30 09:49:00 cola1 kernel: LustreError:
2562:0:(ldlm_lib.c:429:client_obd_setup()) can't add initial connection
Dec 30 09:49:00 cola1 kernel: LustreError:
2562:0:(obd_config.c:572:class_setup()) setup
lustre-OST0000-osc-ffff88021996ac00 failed (-2)
Dec 30 09:49:00 cola1 kernel: LustreError:
2562:0:(obd_config.c:1591:class_config_llog_handler()) MGC192.168.1.32@tcp:
cfg command failed: rc = -2
Dec 30 09:49:00 cola1 kernel: Lustre: cmd=cf003 0:lustre-OST0000-osc
1:lustre-OST0000_UUID 2:0@<0:0>
Dec 30 09:49:00 cola1 kernel: LustreError: 15c-8: MGC192.168.1.32@tcp: The
configuration from log 'lustre-client' failed (-2). This may be the result
of communication errors between this node and the MGS, a bad configuration,
or other errors. See the syslog for more information.
Dec 30 09:49:00 cola1 kernel: LustreError:
2481:0:(llite_lib.c:1044:ll_fill_super()) Unable to process log: -2
Dec 30 09:49:00 cola1 kernel: LustreError:
2481:0:(obd_config.c:619:class_cleanup()) Device 4 not setup
Dec 30 09:49:00 cola1 kernel: Lustre: Unmounted lustre-client
Dec 30 09:49:00 cola1 kernel: LustreError:
2481:0:(obd_mount.c:1311:lustre_fill_super()) Unable to mount (-2)
Dec 30 09:47:24 myoss kernel: INFO: task tgt_recov:3645 blocked for more
than 120 seconds.
Dec 30 09:47:24 myoss kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 30 09:47:24 myoss kernel: tgt_recov D 000000000000000b 0
3645 2 0x00000080
Dec 30 09:47:24 myoss kernel: ffff88044d985da0 0000000000000046
0000000000000000 0000000000000003
Dec 30 09:47:24 myoss kernel: ffff88044d985d30 ffffffff81055f96
ffff88044d985d40 ffff8804733beae0
Dec 30 09:47:24 myoss kernel: ffff88044d983af8 ffff88044d985fd8
000000000000fb88 ffff88044d983af8
Dec 30 09:47:24 myoss kernel: Call Trace:
Dec 30 09:47:24 myoss kernel: [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
Dec 30 09:47:24 myoss kernel: [<ffffffffa07b0210>] ?
check_for_clients+0x0/0x70 [ptlrpc]
Dec 30 09:47:24 myoss kernel: [<ffffffffa07b187d>]
target_recovery_overseer+0x9d/0x230 [ptlrpc]
Dec 30 09:47:24 myoss kernel: [<ffffffffa07aff00>] ?
exp_connect_healthy+0x0/0x20 [ptlrpc]
Dec 30 09:47:24 myoss kernel: [<ffffffff81096da0>] ?
autoremove_wake_function+0x0/0x40
Dec 30 09:47:24 myoss kernel: [<ffffffffa07b8140>] ?
target_recovery_thread+0x0/0x1920 [ptlrpc]
Dec 30 09:47:24 myoss kernel: [<ffffffffa07b8680>]
target_recovery_thread+0x540/0x1920 [ptlrpc]
Dec 30 09:47:24 myoss kernel: [<ffffffff81063422>] ?
default_wake_function+0x12/0x20
Dec 30 09:47:24 myoss kernel: [<ffffffffa07b8140>] ?
target_recovery_thread+0x0/0x1920 [ptlrpc]
Dec 30 09:47:24 myoss kernel: [<ffffffff81096a36>] kthread+0x96/0xa0
Dec 30 09:47:24 myoss kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Dec 30 09:47:24 myoss kernel: [<ffffffff810969a0>] ? kthread+0x0/0xa0
Dec 30 09:47:24 myoss kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Thanks,
Frank
8 years, 5 months
Re: [HPDD-discuss] Help for Lustre Server Move
by Frank Yang
I'm thinking perhaps I should change 2.4.2. back to 2.5.0. Originally I
"upgraded" 2.5.0 to 2.4.2 in the hope that it will fix the problem and
because it seems the database is compatible between these two versions. But
now, since it can't work, perhaps I should change it back to 2.5.0 again.
Does anybody have the idea of which one is better for later debug or
reconfiguration? Currently the old_mds is still 2.5.0 and it hasn't
"talked" to the 2.4.2 OSS since I haven't mounted it and haven't used
tunefs.lustre to change the mgsnode parameter back to old_mds on OSS.
new_mds is already 2.4.2.
BTW, does anybody know how to make a device which is in ATTACH state go to
the UP state. Perhaps it's a very important step in my case. Today I tried
to use "tunefs.lustre --erase-params /dev'..." and "tunefs.lustre
--writeconf /dev/...". The result now looks much more similar to the case
as mentioned in my first mail as below:
[root@old_mds ~]# cat /proc/fs/lustre/devices 0 UP osd-ldiskfs
lustre-MDT0000-osd lustre-MDT0000-osd_UUID 8
1 UP mgs MGS MGS 7
2 UP mgc MGC192.168.1.7@tcp 28fc524e-9128-4b16-adbf-df94972b556a 5
3 UP mds MDS MDS_uuid 3
4 UP lod lustre-MDT0000-mdtlov lustre-MDT0000-mdtlov_UUID 4
5 UP mdt lustre-MDT0000 lustre-MDT0000_UUID 7
6 UP mdd lustre-MDD0000 lustre-MDD0000_UUID 4
7 UP qmt lustre-QMT0000 lustre-QMT0000_UUID 4
8 UP lwp lustre-MDT0000-lwp-MDT0000 lustre-MDT0000-lwp-MDT0000_UUID 5
9 AT osp lustre-OST0000-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 1
In my later experiment after the first email, I even couldn't see the OST
device (device 9 as above). Now, at least it comes back though still in the
AT state...
Actually I still don't understand why I can't restore them back. I have
chosen to completely destroy the configuration by "--erase-params" (only
"--writeconf" itself doesn't seem to destroy anything). This should enable
me to reconfigure them. So I guess there must be something more to destroy
or configure...
Regards,
Frank
On Sat, Dec 28, 2013 at 10:42 AM, Frank Yang <fsyang.tw(a)gmail.com> wrote:
> Hi Andreas,
>
> Thanks for you suggestions. But I have ever tried to bring up old_mds
> again (certainly with the same address and data) as mentioned in my first
> email. I did it right after I failed to bring up new_mds. In my last mail,
> I wanted to focus on new_mds since old_mds already can't work either (so
> that I don't need to move again after they get online). If necessary, I can
> use old_mds again since OST should also be able to work with either one
> (both have the same metadata)
>
> And actually I "upgrade" from 2.5.0 to 2.4.2. Not 1.8 to 2.4.2. Sorry, my
> email is very lengthy so you can easily get confused.
>
> Actually I still don't know what "do a writeconf" is exactly. I have tried
> "tunefs.lustre -o writeconf" and "mount -t lustre -o writeconf". But I
> don't see anything changed. And, now sometimes I am forced to add "-o
> writeconf" to mount.lustre otherwise I can't successfully mount OST or MDT.
>
> I have tried carefully not to damage any data on MDT and OST although I
> did actually accidentally write a file to MDT on old_mds (mount as -t
> ldiskfs before the tar operation) and then deleted it after I found it.
> Hope this will not screw up the filesystem. e2fsck to MDT on new_mds and
> OST are clean. So I guess/hope perhaps a complete reconfig can bring the
> filesystem back.
>
> Regards,
> Frank
>
>
>
> On Sat, Dec 28, 2013 at 5:07 AM, Dilger, Andreas <andreas.dilger(a)intel.com
> > wrote:
>
>> I think you are complicating your efforts by changing too many things
>> at the same time. Upgrading Lustre from 1.8 to 2.4 and changing the
>> hardware and changing the server addresses makes it very hard to know where
>> the problem might be.
>>
>> I would recommend to move your new_mds to have the same hostname and IP
>> address as old_mds and only do the upgrade first. I would guess by this
>> point you also need to do a writeconf (MDS and OSS) to reset the
>> configuration logs, since everything looks confused.
>>
>> Cheers, Andreas
>>
>> On Dec 27, 2013, at 6:35, "Frank Yang" <fsyang.tw(a)gmail.com> wrote:
>>
>> Hi all,
>>
>> I guess I may provide too much info. Let me try to make things simpler.
>> Now I decide to use only the new MGS/MDS server (192.168.1.32). As a
>> result, I use "tar ... --xattrs..." to copy the data to the new server
>> again according to the lustre_manual.pdf. (Note that the old server already
>> couldn't work again after I fallbacked to it in the last mail). And then do
>> the following things. By the way, I also "upgrade" Lustre 2.5.0 to Lustre
>> 2.4.2 (since 2.4.2 is released after 2.5.0, I assume it has fewer bugs) on
>> my CentOS 6.5.
>>
>> *******************************************************************
>> Some basic info again
>>
>> *** Client (cola1)
>>
>> eth0: 10.242.116.6
>>
>> eth1: 192.168.1.6
>>
>> modprobe.conf: options lnet networks=tcp0(eth1),tcp1(eth0)
>>
>> *** Old MDS (old_mds)
>>
>> eth0: 10.242.116.7
>>
>> eth1: 192.168.1.7
>>
>> modprobe.conf: options lnet networks=tcp0(eth1),tcp1(eth0)
>>
>> MGS/MDS mount point: /MDT
>>
>> device: /dev/mapper/VolGroup00-LogVol03
>>
>> *** New MDS (new_mds)
>>
>> eth0: 10.242.116.32
>>
>> eth1: 192.168.1.32
>>
>> modprobe.conf: options lnet networks=tcp0(eth1),tcp1(eth0)
>>
>> MGS/MDS mount point: /MDT
>>
>> device: /dev/sda6
>>
>> *** OSS (myoss)
>>
>> eth0: 192.168.1.34
>>
>> eth1: Disabled
>>
>> modprobe.conf: options lnet ip2nets="tcp0 192.168.1.*"
>>
>> OST mount point: /OST
>>
>> device: /dev/sda5
>>
>> *******************************************************************
>> I did: (Note that I had done them several times before. But to dump the
>> messages, I have to do them again now. This means I didn't do the
>> time-consuming tar again.) In order to make the sequence and error message
>> clearer, I put commands and messages on new_mds/myoss/cola1 in order.
>>
>>
>> [root@new_mds ~]# mount -t lustre /dev/sda6 -o nosvc /MDT
>> [root@new_mds ~]# lctl replace_nids lustre-MDT0000 192.168.1.32@tcp
>> [root@new_mds ~]# cd
>> [root@new_mds ~]# umount /MDT
>> [root@new_mds ~]# mount -t lustre /dev/sda6 /MDT
>> [root@new_mds ~]# lctl dl
>> 0 UP osd-ldiskfs lustre-MDT0000-osd lustre-MDT0000-osd_UUID 8
>> 1 UP mgs MGS MGS 5
>> 2 UP mgc MGC192.168.1.32@tcp b304f3ec-630a-940d-37e6-ce5ded6a6c71 5
>> 3 UP mds MDS MDS_uuid 3
>> 4 UP lod lustre-MDT0000-mdtlov lustre-MDT0000-mdtlov_UUID 4
>> 5 UP mdt lustre-MDT0000 lustre-MDT0000_UUID 3
>> 6 UP mdd lustre-MDD0000 lustre-MDD0000_UUID 4
>> 7 UP qmt lustre-QMT0000 lustre-QMT0000_UUID 4
>> 8 UP lwp lustre-MDT0000-lwp-MDT0000 lustre-MDT0000-lwp-MDT0000_UUID 5
>>
>>
>> Dec 27 19:04:52 new_mds kernel: LNet: HW CPU cores: 8, npartitions: 2
>> Dec 27 19:04:52 new_mds modprobe: FATAL: Error inserting crc32c_intel
>> (/lib/modules/2.6.32-358.23.2.el6_lustre.x86_64/kernel/arch/x86/crypto/crc32c-intel.ko):
>> No such device
>> Dec 27 19:04:52 new_mds kernel: alg: No test for crc32 (crc32-table)
>> Dec 27 19:04:52 new_mds kernel: alg: No test for adler32 (adler32-zlib)
>> Dec 27 19:04:56 new_mds kernel: padlock: VIA PadLock Hash Engine not
>> detected.
>> Dec 27 19:04:56 new_mds modprobe: FATAL: Error inserting padlock_sha
>> (/lib/modules/2.6.32-358.23.2.el6_lustre.x86_64/kernel/drivers/crypto/padlock-sha.ko):
>> No such device
>> Dec 27 19:05:00 new_mds kernel: Lustre: Lustre: Build Version:
>> 2.4.2-RC2--PRISTINE-2.6.32-358.23.2.el6_lustre.x86_64
>> Dec 27 19:05:00 new_mds kernel: LNet: Added LNI 192.168.1.32@tcp[8/256/0/180]
>> Dec 27 19:05:00 new_mds kernel: LNet: Added LNI 10.242.116.32@tcp1[8/256/0/180]
>> Dec 27 19:05:00 new_mds kernel: LNet: Accept secure, port 988
>> Dec 27 19:05:00 new_mds kernel: LDISKFS-fs (sda6): mounted filesystem
>> with ordered data mode. quota=on. Opts:
>> Dec 27 19:06:21 new_mds kernel: LustreError:
>> 25744:0:(obd_mount_server.c:865:lustre_disconnect_lwp())
>> lustre-MDT0000-lwp-MDT0000: Can't end config log lustre-client.
>> Dec 27 19:06:21 new_mds kernel: LustreError:
>> 25744:0:(obd_mount_server.c:1443:server_put_super()) MGS: failed to
>> disconnect lwp. (rc=-2)
>> Dec 27 19:06:27 new_mds kernel: Lustre:
>> 25744:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has
>> timed out for slow reply: [sent 1388142381/real 1388142381]
>> req@ffff880258d60800 x1455572700364820/t0(0) o251->MGC192.168.1.32@tcp
>> @0@lo:26/25 lens 224/224 e 0 to 1 dl 1388142387 ref 2 fl
>> Rpc:XN/0/ffffffff rc 0/-1
>> Dec 27 19:06:27 new_mds kernel: Lustre: server umount MGS complete
>> Dec 27 19:07:01 new_mds kernel: LDISKFS-fs (sda6): mounted filesystem
>> with ordered data mode. quota=on. Opts:
>> Dec 27 19:07:01 new_mds kernel: Lustre: lustre-MDT0000: used disk, loading
>> Dec 27 19:07:01 new_mds kernel: LustreError:
>> 25814:0:(osd_io.c:1000:osd_ldiskfs_read()) lustre-MDT0000: can't read
>> 128@8192 on ino 33: rc = 0
>> Dec 27 19:07:01 new_mds kernel: LustreError:
>> 25814:0:(mdt_recovery.c:112:mdt_clients_data_init()) error reading MDS
>> last_rcvd idx 0, off 8192: rc -14
>> Dec 27 19:07:01 new_mds kernel: LustreError: 11-0:
>> lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation
>> mds_connect failed with -11.
>> Dec 27 19:07:01 new_mds kernel: Lustre: lustre-MDD0000: changelog on
>>
>>
>> [root@myoss ~]# tunefs.lustre --erase-params /dev/sda5
>> checking for existing Lustre data: found
>> Reading CONFIGS/mountdata
>>
>> Read previous values:
>> Target: lustre-OST0000
>> Index: 0
>> Lustre FS: lustre
>> Mount type: ldiskfs
>> Flags: 0x1102
>> (OST writeconf no_primnode )
>> Persistent mount opts: errors=remount-ro
>> Parameters: mgsnode=192.168.1.7@tcp
>>
>>
>> Permanent disk data:
>> Target: lustre=OST0000
>> Index: 0
>> Lustre FS: lustre
>> Mount type: ldiskfs
>> Flags: 0x1142
>> (OST update writeconf no_primnode )
>> Persistent mount opts: errors=remount-ro
>> Parameters:
>>
>> Writing CONFIGS/mountdata
>> [root@myoss ~]# tunefs.lustre --writeconf --mgsnode=192.168.1.32@tcp--ost
>> /dev/sda5
>>
>> checking for existing Lustre data:
>> found
>> Reading
>> CONFIGS/mountdata
>>
>> Read previous values:
>> Target: lustre-OST0000
>> Index: 0
>> Lustre FS: lustre
>> Mount type: ldiskfs
>> Flags: 0x1142
>> (OST update writeconf no_primnode )
>> Persistent mount opts: errors=remount-ro
>> Parameters:
>>
>>
>> Permanent disk data:
>> Target: lustre=OST0000
>> Index: 0
>> Lustre FS: lustre
>> Mount type: ldiskfs
>> Flags: 0x1142
>> (OST update writeconf no_primnode )
>> Persistent mount opts: errors=remount-ro
>> Parameters: mgsnode=192.168.1.32@tcp
>>
>> Writing CONFIGS/mountdata
>> [root@myoss ~]# mount -t lustre /dev/sda5 /OST
>> mount.lustre: mount /dev/sda5 at /OST failed: File exists
>>
>>
>> Dec 27 19:13:30 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
>> ordered data mode. quota=on. Opts:
>> Dec 27 19:13:51 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
>> ordered data mode. quota=on. Opts:
>> Dec 27 19:14:11 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
>> ordered data mode. quota=on. Opts:
>> Dec 27 19:14:11 myoss kernel: LNet: HW CPU cores: 12, npartitions: 4
>> Dec 27 19:14:11 myoss kernel: alg: No test for crc32 (crc32-table)
>> Dec 27 19:14:11 myoss kernel: alg: No test for adler32 (adler32-zlib)
>> Dec 27 19:14:11 myoss kernel: alg: No test for crc32 (crc32-pclmul)
>> Dec 27 19:14:15 myoss kernel: padlock: VIA PadLock Hash Engine not
>> detected.
>> Dec 27 19:14:15 myoss modprobe: FATAL: Error inserting padlock_sha
>> (/lib/modules/2.6.32-358.23.2.el6_lustre.x86_64/kernel/drivers/crypto/padlock-sha.ko):
>> No such device
>> Dec 27 19:14:20 myoss kernel: Lustre: Lustre: Build Version:
>> 2.4.2-RC2--PRISTINE-2.6.32-358.23.2.el6_lustre.x86_64
>> Dec 27 19:14:20 myoss kernel: LNet: Added LNI 192.168.1.34@tcp[8/256/0/180]
>> Dec 27 19:14:20 myoss kernel: LNet: Accept secure, port 988
>> Dec 27 19:14:20 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
>> ordered data mode. quota=on. Opts:
>> Dec 27 19:14:21 myoss kernel: LustreError: 15f-b: lustre-OST0000: cannot
>> register this server with the MGS: rc = -17. Is the MGS running?
>> Dec 27 19:14:21 myoss kernel: LustreError:
>> 2492:0:(obd_mount_server.c:1716:server_fill_super()) Unable to start
>> targets: -17
>> Dec 27 19:14:21 myoss kernel: LustreError:
>> 2492:0:(obd_mount_server.c:865:lustre_disconnect_lwp())
>> lustre-MDT0000-lwp-OST0000: Can't end config log lustre-client.
>> Dec 27 19:14:21 myoss kernel: LustreError:
>> 2492:0:(obd_mount_server.c:1443:server_put_super()) lustre-OST0000: failed
>> to disconnect lwp. (rc=-2)
>> Dec 27 19:14:21 myoss kernel: LustreError:
>> 2492:0:(obd_mount_server.c:1473:server_put_super()) no obd lustre-OST0000
>> Dec 27 19:14:21 myoss kernel: LustreError:
>> 2492:0:(obd_mount_server.c:135:server_deregister_mount()) lustre-OST0000
>> not registered
>> Dec 27 19:14:21 myoss kernel: Lustre: server umount lustre-OST0000
>> complete
>> Dec 27 19:14:21 myoss kernel: LustreError:
>> 2492:0:(obd_mount.c:1289:lustre_fill_super()) Unable to mount (-17)
>>
>>
>> [root@myoss ~]# mount -t lustre -o writeconf /dev/sda5 /OST # Note
>> that "-o writeconf" can work. Why???
>>
>>
>> Dec 27 19:15:41 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
>> ordered data mode. quota=on. Opts:
>> Dec 27 19:15:42 myoss kernel: LustreError:
>> 2677:0:(obd_mount_server.c:1140:server_register_target()) lustre-OST0000:
>> error registering with the MGS: rc = -17 (not fatal)
>> Dec 27 19:15:42 myoss kernel: Lustre: lustre-OST0000: Imperative Recovery
>> enabled, recovery window shrunk from 300-900 down to 150-450
>>
>>
>> [root@myoss ~]# lctl dl
>> 0 UP osd-ldiskfs lustre-OST0000-osd lustre-OST0000-osd_UUID 5
>> 1 UP mgc MGC192.168.1.32@tcp 156c5656-12d7-81ba-bbe5-70de335088ef 5
>> 2 UP ost OSS OSS_uuid 3
>> 3 UP obdfilter lustre-OST0000 lustre-OST0000_UUID 4
>> 4 UP lwp lustre-MDT0000-lwp-OST0000 lustre-MDT0000-lwp-OST0000_UUID 5
>>
>>
>> Dec 27 19:15:41 new_mds kernel: Lustre: MGS: Regenerating lustre-OST0000
>> log by user request.
>> Dec 27 19:15:41 new_mds kernel: LustreError:
>> 25783:0:(llog.c:250:llog_init_handle()) MGS: llog uuid mismatch:
>> config_uuid/
>> Dec 27 19:15:41 new_mds kernel: LustreError:
>> 25783:0:(mgs_llog.c:1454:record_start_log()) MGS: can't start log
>> lustre-MDT0000: rc = -17
>> Dec 27 19:15:41 new_mds kernel: LustreError:
>> 25783:0:(mgs_llog.c:3658:mgs_write_log_target()) Can't write logs for
>> lustre-OST0000 (-17)
>> Dec 27 19:15:41 new_mds kernel: LustreError:
>> 25783:0:(mgs_handler.c:408:mgs_handle_target_reg()) Failed to write
>> lustre-OST0000 log (-17)
>> Dec 27 19:17:02 new_mds kernel: Lustre: MGS: Regenerating lustre-OST0000
>> log by user request.
>> Dec 27 19:17:02 new_mds kernel: Lustre: Found index 0 for lustre-OST0000,
>> updating log
>> Dec 27 19:17:02 new_mds kernel: Lustre: Client log for lustre-OST0000 was
>> not updated; writeconf the MDT first to regenerate it.
>> Dec 27 19:17:02 new_mds kernel: LustreError:
>> 25782:0:(llog.c:250:llog_init_handle()) MGS: llog uuid mismatch:
>> config_uuid/
>> Dec 27 19:17:02 new_mds kernel: LustreError:
>> 25782:0:(mgs_llog.c:1454:record_start_log()) MGS: can't start log
>> lustre-MDT0000: rc = -17
>> Dec 27 19:17:02 new_mds kernel: LustreError:
>> 25782:0:(mgs_llog.c:3658:mgs_write_log_target()) Can't write logs for
>> lustre-OST0000 (-17)
>> Dec 27 19:17:02 new_mds kernel: LustreError:
>> 25782:0:(mgs_handler.c:408:mgs_handle_target_reg()) Failed to write
>> lustre-OST0000 log (-17)
>> Dec 27 19:17:09 new_mds kernel: Lustre:
>> 25785:0:(mgc_request.c:1564:mgc_process_recover_log()) Process recover log
>> lustre-mdtir error -22
>>
>>
>> [root@new_mds ~]# lctl dl
>> 0 UP osd-ldiskfs lustre-MDT0000-osd lustre-MDT0000-osd_UUID 8
>> 1 UP mgs MGS MGS 7
>> 2 UP mgc MGC192.168.1.32@tcp b304f3ec-630a-940d-37e6-ce5ded6a6c71 5
>> 3 UP mds MDS MDS_uuid 3
>> 4 UP lod lustre-MDT0000-mdtlov lustre-MDT0000-mdtlov_UUID 4
>> 5 UP mdt lustre-MDT0000 lustre-MDT0000_UUID 7
>> 6 UP mdd lustre-MDD0000 lustre-MDD0000_UUID 4
>> 7 UP qmt lustre-QMT0000 lustre-QMT0000_UUID 4
>> 8 UP lwp lustre-MDT0000-lwp-MDT0000 lustre-MDT0000-lwp-MDT0000_UUID 5
>>
>> Dec 27 19:17:50 myoss kernel: INFO: task tgt_recov:2775 blocked for more
>> than 120 seconds.
>> Dec 27 19:17:50 myoss kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Dec 27 19:17:50 myoss kernel: tgt_recov D 000000000000000b 0
>> 2775 2 0x00000080
>> Dec 27 19:17:50 myoss kernel: ffff88044e523e00 0000000000000046
>> 0000000000000000 0000000000000003
>> Dec 27 19:17:50 myoss kernel: ffff88044e523d90 ffffffff81055f96
>> ffff88044e523da0 ffff880474742ae0
>> Dec 27 19:17:50 myoss kernel: ffff8804541cf058 ffff88044e523fd8
>> 000000000000fb88 ffff8804541cf058
>> Dec 27 19:17:50 myoss kernel: Call Trace:
>> Dec 27 19:17:50 myoss kernel: [<ffffffff81055f96>] ?
>> enqueue_task+0x66/0x80
>> Dec 27 19:17:50 myoss kernel: [<ffffffffa0700070>] ?
>> check_for_clients+0x0/0x70 [ptlrpc]
>> Dec 27 19:17:50 myoss kernel: [<ffffffffa070172d>]
>> target_recovery_overseer+0x9d/0x230 [ptlrpc]
>> Dec 27 19:17:50 myoss kernel: [<ffffffffa06ffd60>] ?
>> exp_connect_healthy+0x0/0x20 [ptlrpc]
>> Dec 27 19:17:50 myoss kernel: [<ffffffff81096da0>] ?
>> autoremove_wake_function+0x0/0x40
>> Dec 27 19:17:50 myoss kernel: [<ffffffffa070856e>]
>> target_recovery_thread+0x58e/0x1970 [ptlrpc]
>> Dec 27 19:17:50 myoss kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
>> Dec 27 19:17:50 myoss kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:17:50 myoss kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
>> Dec 27 19:17:50 myoss kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:17:50 myoss kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:17:50 myoss kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
>>
>>
>> [root@cola1 ~]# mount -t lustre 192.168.1.32@tcp:/lustre /lustre
>>
>>
>> Dec 27 19:21:05 cola1 kernel: LNet: HW CPU cores: 8, npartitions: 2
>> Dec 27 19:21:05 cola1 modprobe: FATAL: Error inserting crc32c_intel
>> (/lib/modules/2.6.32-358.18.1.el6.x86_64/kernel/arch/x86/crypto/crc32c-intel.ko):
>> No such device
>> Dec 27 19:21:05 cola1 kernel: alg: No test for crc32 (crc32-table)
>> Dec 27 19:21:05 cola1 kernel: alg: No test for adler32 (adler32-zlib)
>> Dec 27 19:21:09 cola1 kernel: padlock: VIA PadLock Hash Engine not
>> detected.
>> Dec 27 19:21:09 cola1 modprobe: FATAL: Error inserting padlock_sha
>> (/lib/modules/2.6.32-358.18.1.el6.x86_64/kernel/drivers/crypto/padlock-sha.ko):
>> No such device
>> Dec 27 19:21:13 cola1 kernel: Lustre: Lustre: Build Version:
>> 2.5.0-RC1--PRISTINE-2.6.32-358.18.1.el6.x86_64
>> Dec 27 19:21:13 cola1 kernel: LNet: Added LNI 192.168.1.6@tcp[8/256/0/180]
>> Dec 27 19:21:13 cola1 kernel: LNet: Added LNI 10.242.116.6@tcp1[8/256/0/180]
>> Dec 27 19:21:13 cola1 kernel: LNet: Accept secure, port 988
>> Dec 27 19:21:13 cola1 kernel: Lustre:
>> 19583:0:(mgc_request.c:1645:mgc_process_recover_log()) Process recover log
>> lustre-cliir error -22
>> Dec 27 19:21:13 cola1 kernel: LustreError: 11-0:
>> lustre-MDT0000-mdc-ffff880215fac400: Communicating with 192.168.1.32@tcp,
>> operation mds_connect failed with -11.
>> Dec 27 19:21:38 cola1 kernel: LustreError: 11-0:
>> lustre-MDT0000-mdc-ffff880215fac400: Communicating with 192.168.1.32@tcp,
>> operation mds_connect failed with -11.
>> Dec 27 19:22:03 cola1 kernel: LustreError: 11-0:
>> lustre-MDT0000-mdc-ffff880215fac400: Communicating with 192.168.1.32@tcp,
>> operation mds_connect failed with -11.
>> Dec 27 19:22:28 cola1 kernel: LustreError: 11-0:
>> lustre-MDT0000-mdc-ffff880215fac400: Communicating with 192.168.1.32@tcp,
>> operation mds_connect failed with -11.
>> Dec 27 19:22:53 cola1 kernel: LustreError: 11-0:
>> lustre-MDT0000-mdc-ffff880215fac400: Communicating with 192.168.1.32@tcp,
>> operation mds_connect failed with -11.
>> Dec 27 19:23:18 cola1 kernel: LustreError: 11-0:
>> lustre-MDT0000-mdc-ffff880215fac400: Communicating with 192.168.1.32@tcp,
>> operation mds_connect failed with -11.
>>
>>
>> ### No additional message see on new_mds
>>
>>
>> Dec 27 19:19:50 myoss kernel: INFO: task tgt_recov:2775 blocked for more
>> than 120 seconds.
>> Dec 27 19:19:50 myoss kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Dec 27 19:19:50 myoss kernel: tgt_recov D 000000000000000b 0
>> 2775 2 0x00000080
>> Dec 27 19:19:50 myoss kernel: ffff88044e523e00 0000000000000046
>> 0000000000000000 0000000000000003
>> Dec 27 19:19:50 myoss kernel: ffff88044e523d90 ffffffff81055f96
>> ffff88044e523da0 ffff880474742ae0
>> Dec 27 19:19:50 myoss kernel: ffff8804541cf058 ffff88044e523fd8
>> 000000000000fb88 ffff8804541cf058
>> Dec 27 19:19:50 myoss kernel: Call Trace:
>> Dec 27 19:19:50 myoss kernel: [<ffffffff81055f96>] ?
>> enqueue_task+0x66/0x80
>> Dec 27 19:19:50 myoss kernel: [<ffffffffa0700070>] ?
>> check_for_clients+0x0/0x70 [ptlrpc]
>> Dec 27 19:19:50 myoss kernel: [<ffffffffa070172d>]
>> target_recovery_overseer+0x9d/0x230 [ptlrpc]
>> Dec 27 19:19:50 myoss kernel: [<ffffffffa06ffd60>] ?
>> exp_connect_healthy+0x0/0x20 [ptlrpc]
>> Dec 27 19:19:50 myoss kernel: [<ffffffff81096da0>] ?
>> autoremove_wake_function+0x0/0x40
>> Dec 27 19:19:50 myoss kernel: [<ffffffffa070856e>]
>> target_recovery_thread+0x58e/0x1970 [ptlrpc]
>> Dec 27 19:19:50 myoss kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
>> Dec 27 19:19:50 myoss kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:19:50 myoss kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
>> Dec 27 19:19:50 myoss kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:19:50 myoss kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:19:50 myoss kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
>> Dec 27 19:21:50 myoss kernel: INFO: task tgt_recov:2775 blocked for more
>> than 120 seconds.
>> Dec 27 19:21:50 myoss kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Dec 27 19:21:50 myoss kernel: tgt_recov D 000000000000000b 0
>> 2775 2 0x00000080
>> Dec 27 19:21:50 myoss kernel: ffff88044e523e00 0000000000000046
>> 0000000000000000 0000000000000003
>> Dec 27 19:21:50 myoss kernel: ffff88044e523d90 ffffffff81055f96
>> ffff88044e523da0 ffff880474742ae0
>> Dec 27 19:21:50 myoss kernel: ffff8804541cf058 ffff88044e523fd8
>> 000000000000fb88 ffff8804541cf058
>> Dec 27 19:21:50 myoss kernel: Call Trace:
>> Dec 27 19:21:50 myoss kernel: [<ffffffff81055f96>] ?
>> enqueue_task+0x66/0x80
>> Dec 27 19:21:50 myoss kernel: [<ffffffffa0700070>] ?
>> check_for_clients+0x0/0x70 [ptlrpc]
>> Dec 27 19:21:50 myoss kernel: [<ffffffffa070172d>]
>> target_recovery_overseer+0x9d/0x230 [ptlrpc]
>> Dec 27 19:21:50 myoss kernel: [<ffffffffa06ffd60>] ?
>> exp_connect_healthy+0x0/0x20 [ptlrpc]
>> Dec 27 19:21:50 myoss kernel: [<ffffffff81096da0>] ?
>> autoremove_wake_function+0x0/0x40
>> Dec 27 19:21:50 myoss kernel: [<ffffffffa070856e>]
>> target_recovery_thread+0x58e/0x1970 [ptlrpc]
>> Dec 27 19:21:50 myoss kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
>> Dec 27 19:21:50 myoss kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:21:50 myoss kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
>> Dec 27 19:21:50 myoss kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:21:50 myoss kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:21:50 myoss kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
>> Dec 27 19:23:50 myoss kernel: INFO: task tgt_recov:2775 blocked for more
>> than 120 seconds.
>> Dec 27 19:23:50 myoss kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Dec 27 19:23:50 myoss kernel: tgt_recov D 000000000000000b 0
>> 2775 2 0x00000080
>> Dec 27 19:23:50 myoss kernel: ffff88044e523e00 0000000000000046
>> 0000000000000000 0000000000000003
>> Dec 27 19:23:50 myoss kernel: ffff88044e523d90 ffffffff81055f96
>> ffff88044e523da0 ffff880474742ae0
>> Dec 27 19:23:50 myoss kernel: ffff8804541cf058 ffff88044e523fd8
>> 000000000000fb88 ffff8804541cf058
>> Dec 27 19:23:50 myoss kernel: Call Trace:
>> Dec 27 19:23:50 myoss kernel: [<ffffffff81055f96>] ?
>> enqueue_task+0x66/0x80
>> Dec 27 19:23:50 myoss kernel: [<ffffffffa0700070>] ?
>> check_for_clients+0x0/0x70 [ptlrpc]
>> Dec 27 19:23:50 myoss kernel: [<ffffffffa070172d>]
>> target_recovery_overseer+0x9d/0x230 [ptlrpc]
>> Dec 27 19:23:50 pepsi3 kernel: [<ffffffffa06ffd60>] ?
>> exp_connect_healthy+0x0/0x20 [ptlrpc]
>> Dec 27 19:23:50 pepsi3 kernel: [<ffffffff81096da0>] ?
>> autoremove_wake_function+0x0/0x40
>> Dec 27 19:23:50 pepsi3 kernel: [<ffffffffa070856e>]
>> target_recovery_thread+0x58e/0x1970 [ptlrpc]
>> Dec 27 19:23:50 pepsi3 kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
>> Dec 27 19:23:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:23:50 pepsi3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
>> Dec 27 19:23:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:23:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:23:50 pepsi3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
>> Dec 27 19:25:50 pepsi3 kernel: INFO: task tgt_recov:2775 blocked for more
>> than 120 seconds.
>> Dec 27 19:25:50 pepsi3 kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Dec 27 19:25:50 pepsi3 kernel: tgt_recov D 000000000000000b 0
>> 2775 2 0x00000080
>> Dec 27 19:25:50 pepsi3 kernel: ffff88044e523e00 0000000000000046
>> 0000000000000000 0000000000000003
>> Dec 27 19:25:50 pepsi3 kernel: ffff88044e523d90 ffffffff81055f96
>> ffff88044e523da0 ffff880474742ae0
>> Dec 27 19:25:50 pepsi3 kernel: ffff8804541cf058 ffff88044e523fd8
>> 000000000000fb88 ffff8804541cf058
>> Dec 27 19:25:50 pepsi3 kernel: Call Trace:
>> Dec 27 19:25:50 pepsi3 kernel: [<ffffffff81055f96>] ?
>> enqueue_task+0x66/0x80
>> Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa0700070>] ?
>> check_for_clients+0x0/0x70 [ptlrpc]
>> Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa070172d>]
>> target_recovery_overseer+0x9d/0x230 [ptlrpc]
>> Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa06ffd60>] ?
>> exp_connect_healthy+0x0/0x20 [ptlrpc]
>> Dec 27 19:25:50 pepsi3 kernel: [<ffffffff81096da0>] ?
>> autoremove_wake_function+0x0/0x40
>> Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa070856e>]
>> target_recovery_thread+0x58e/0x1970 [ptlrpc]
>> Dec 27 19:25:50 pepsi3 kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
>> Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:25:50 pepsi3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
>> Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:25:50 pepsi3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
>> Dec 27 19:27:14 pepsi3 ntpd_intres[1856]: host name not found:
>> qrdcntp.quanta.corp
>> Dec 27 19:27:50 pepsi3 kernel: INFO: task tgt_recov:2775 blocked for more
>> than 120 seconds.
>> Dec 27 19:27:50 pepsi3 kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Dec 27 19:27:50 pepsi3 kernel: tgt_recov D 000000000000000b 0
>> 2775 2 0x00000080
>> Dec 27 19:27:50 pepsi3 kernel: ffff88044e523e00 0000000000000046
>> 0000000000000000 0000000000000003
>> Dec 27 19:27:50 pepsi3 kernel: ffff88044e523d90 ffffffff81055f96
>> ffff88044e523da0 ffff880474742ae0
>> Dec 27 19:27:50 pepsi3 kernel: ffff8804541cf058 ffff88044e523fd8
>> 000000000000fb88 ffff8804541cf058
>> Dec 27 19:27:50 pepsi3 kernel: Call Trace:
>> Dec 27 19:27:50 pepsi3 kernel: [<ffffffff81055f96>] ?
>> enqueue_task+0x66/0x80
>> Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa0700070>] ?
>> check_for_clients+0x0/0x70 [ptlrpc]
>> Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa070172d>]
>> target_recovery_overseer+0x9d/0x230 [ptlrpc]
>> Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa06ffd60>] ?
>> exp_connect_healthy+0x0/0x20 [ptlrpc]
>> Dec 27 19:27:50 pepsi3 kernel: [<ffffffff81096da0>] ?
>> autoremove_wake_function+0x0/0x40
>> Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa070856e>]
>> target_recovery_thread+0x58e/0x1970 [ptlrpc]
>> Dec 27 19:27:50 pepsi3 kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
>> Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:27:50 pepsi3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
>> Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:27:50 pepsi3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
>> Dec 27 19:29:50 pepsi3 kernel: INFO: task tgt_recov:2775 blocked for more
>> than 120 seconds.
>> Dec 27 19:29:50 pepsi3 kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Dec 27 19:29:50 pepsi3 kernel: tgt_recov D 000000000000000b 0
>> 2775 2 0x00000080
>> Dec 27 19:29:50 pepsi3 kernel: ffff88044e523e00 0000000000000046
>> 0000000000000000 0000000000000003
>> Dec 27 19:29:50 pepsi3 kernel: ffff88044e523d90 ffffffff81055f96
>> ffff88044e523da0 ffff880474742ae0
>> Dec 27 19:29:50 pepsi3 kernel: ffff8804541cf058 ffff88044e523fd8
>> 000000000000fb88 ffff8804541cf058
>> Dec 27 19:29:50 pepsi3 kernel: Call Trace:
>> Dec 27 19:29:50 pepsi3 kernel: [<ffffffff81055f96>] ?
>> enqueue_task+0x66/0x80
>> Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa0700070>] ?
>> check_for_clients+0x0/0x70 [ptlrpc]
>> Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa070172d>]
>> target_recovery_overseer+0x9d/0x230 [ptlrpc]
>> Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa06ffd60>] ?
>> exp_connect_healthy+0x0/0x20 [ptlrpc]
>> Dec 27 19:29:50 pepsi3 kernel: [<ffffffff81096da0>] ?
>> autoremove_wake_function+0x0/0x40
>> Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa070856e>]
>> target_recovery_thread+0x58e/0x1970 [ptlrpc]
>> Dec 27 19:29:50 pepsi3 kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
>> Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:29:50 pepsi3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
>> Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:29:50 pepsi3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
>> Dec 27 19:31:50 pepsi3 kernel: INFO: task tgt_recov:2775 blocked for more
>> than 120 seconds.
>> Dec 27 19:31:50 pepsi3 kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Dec 27 19:31:50 pepsi3 kernel: tgt_recov D 000000000000000b 0
>> 2775 2 0x00000080
>> Dec 27 19:31:50 pepsi3 kernel: ffff88044e523e00 0000000000000046
>> 0000000000000000 0000000000000003
>> Dec 27 19:31:50 pepsi3 kernel: ffff88044e523d90 ffffffff81055f96
>> ffff88044e523da0 ffff880474742ae0
>> Dec 27 19:31:50 pepsi3 kernel: ffff8804541cf058 ffff88044e523fd8
>> 000000000000fb88 ffff8804541cf058
>> Dec 27 19:31:50 pepsi3 kernel: Call Trace:
>> Dec 27 19:31:50 pepsi3 kernel: [<ffffffff81055f96>] ?
>> enqueue_task+0x66/0x80
>> Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa0700070>] ?
>> check_for_clients+0x0/0x70 [ptlrpc]
>> Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa070172d>]
>> target_recovery_overseer+0x9d/0x230 [ptlrpc]
>> Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa06ffd60>] ?
>> exp_connect_healthy+0x0/0x20 [ptlrpc]
>> Dec 27 19:31:50 pepsi3 kernel: [<ffffffff81096da0>] ?
>> autoremove_wake_function+0x0/0x40
>> Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa070856e>]
>> target_recovery_thread+0x58e/0x1970 [ptlrpc]
>> Dec 27 19:31:50 pepsi3 kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
>> Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:31:50 pepsi3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
>> Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
>> target_recovery_thread+0x0/0x1970 [ptlrpc]
>> Dec 27 19:31:50 pepsi3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
>> Dec 27 19:33:50 pepsi3 kernel: INFO: task tgt_recov:2775 blocked for more
>> than 120 seconds.
>>
>> ### The same call trace message are printed again and again until I
>> CTRL-C the client...
>>
>> ### No new messages on myoss and new_mds.
>>
>>
>>
>> The result is different from the previous one although I don't think
>> there's much difference in what I have done. For example, previously the
>> "lctl dl" report had a "AT" (attached?) OST device. However, now, no OST
>> device is reported.
>>
>> I have ever seen one mail saying doing "writeconf" can solve some
>> problems. However, I've tried "tunefs.lustre --writeconf /dev/sda6" or even
>> now "mount -t lustre -o writeconf". I don't know which one is correct or
>> perhaps they are both wrong.
>>
>> Or, if there's no way to solve this problem. Is there any means to
>> extract the data from the ldiskfs filesystem without the client?
>>
>> Regards,
>> Frank
>>
>>
>>
>> _______________________________________________
>> HPDD-discuss mailing list
>> HPDD-discuss(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/hpdd-discuss
>>
>>
>
8 years, 5 months
Multiple independent MGSs/filesystems in a cluster
by Frank Yang
Hi all,
In the document, it's said MGS is universal in a cluster. But I want to
setup another MGS which is completely independent from the original one
since I am still trying to rescue the original one and don't want the
second one to make things more complicated. The reason why I need the
second one is because, before I successfully restore the original one, I
don't have enough space in a single RAID/machine. As a result, I need to
combine several smaller ones.
I guess by "universal MGS", it just means we don't need another MGS to
service multiple lustre filesystems. By if I don't care it, I can still
setup another MGS to service other lustre filesytems in my case. Can
anybody confirm this? Thanks a lot.
Regards,
Frank
8 years, 5 months
Re: [HPDD-discuss] Gerrit - Post upgrade
by Dilger, Andreas
On 2013/12/27 7:42 PM, "Patrick Farrell" <paf(a)cray.com> wrote:
>Good evening,
>
>
>Since the upgrade to Gerrit, some links don't seem to work reliably.
>Sometimes I'll click a link from an LU, and most of the Gerrit page will
>load, but not the actual content, and it will give me the 'Working'
>message (I don't remember exactly what it
> says.). Reloading the page doesn't seem to help.
>
>
>If I go to that page from a link not from Jira, like from one of the
>lists of proposed or merged patches in Gerrit, it seems to work fine.
>Has anyone else been seeing this? It's not crippling, but it makes
>looking at patches a bit annoying sometimes.
Yes, we've noticed this also, and are looking at how to fix it.
>Next, and this is a bit more trivial, but at least to me, inconvenient:
>The new 'grey text on all-white background' style is very hard to read.
>I much preferred the old, higher contrast color version, and there
>doesn't seem to be a user option to change it.
>
>
>Perhaps some of those at Intel feel the same way and would be interested
>in getting it switched back to something higher contrast? Or perhaps I
>just have bad eyes. :)
I agree that the "new sparse look" seems a bit too sparse for me as well.
Even changing back to the "old interface" keeps the same grey-on-white
color scheme.
Joshua (who is on vacation over Christmas, so I don't expect a speedy
reply)
is there some option to change the color scheme so it is easier to separate
different parts of the interface?
Cheers, Andreas
--
Andreas Dilger
Lustre Software Architect
Intel High Performance Data Division
8 years, 5 months
Gerrit - Post upgrade
by Patrick Farrell
Good evening,
Since the upgrade to Gerrit, some links don't seem to work reliably. Sometimes I'll click a link from an LU, and most of the Gerrit page will load, but not the actual content, and it will give me the 'Working' message (I don't remember exactly what it says.). Reloading the page doesn't seem to help.
If I go to that page from a link not from Jira, like from one of the lists of proposed or merged patches in Gerrit, it seems to work fine. Has anyone else been seeing this? It's not crippling, but it makes looking at patches a bit annoying sometimes.
Next, and this is a bit more trivial, but at least to me, inconvenient:
The new 'grey text on all-white background' style is very hard to read. I much preferred the old, higher contrast color version, and there doesn't seem to be a user option to change it.
Perhaps some of those at Intel feel the same way and would be interested in getting it switched back to something higher contrast? Or perhaps I just have bad eyes. :)
- Patrick Farrell
Software Engineer
OSIO File Systems
Cray, Inc.
8 years, 6 months
Fwd: Help for Lustre Server Move
by Frank Yang
Hi all,
I previously setup a temporary combined MGS/MDS server and it worked fine
with the OST server and clients (Lustre 2.5.0). After I finished setting up
the final server, I used gnu tar v1.26 to "copy" the data from the
temporary one to the new one according to the lustre user manual. However,
then everything went wrong. The basic information:
*** Client (cola1)
eth0: 10.242.116.6
eth1: 192.168.1.6
modprobe.conf: options lnet networks=tcp0(eth1),tcp1(eth0)
*** Old MDS (cola4)
eth0: 10.242.116.7
eth1: 192.168.1.7
modprobe.conf: options lnet networks=tcp0(eth1),tcp1(eth0)
MGS/MDS mount point: /MDT
device: /dev/mapper/VolGroup00-LogVol03
*** New MDS (myoss)
eth0: 10.242.116.32
eth1: 192.168.1.32
modprobe.conf: options lnet networks=tcp0(eth1),tcp1(eth0)
MGS/MDS mount point: /MDT
device: /dev/sda6
*** OSS (new_mds)
eth0: 192.168.1.34
eth1: Disabled
modprobe.conf: options lnet ip2nets="tcp0 192.168.1.*"
OST mount point: /OST
device: /dev/sda5
*************************************************************************************
What I did (the steps are not so accurate enough since I have tried,
failed, and then retried according to what I google’d):
*** New MDS (new_mds)
[root@new_mds ~]# mkfs.lustre --reformat --fsname=lustre --mgs --mdt
--index=0 /dev/sda6
Permanent disk data:
Target: lustre:MDT0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x65
(MDT MGS first_time update )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:
device size = 102406MB
formatting backing filesystem ldiskfs on /dev/sda6
target name lustre:MDT0000
4k blocks 26216064
options -J size=400 -I 512 -i 2048 -q -O
dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E
lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L lustre:MDT0000 -J size=400 -I 512 -i 2048
-q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E
lazy_journal_init -F /dev/sda6 26216064
Writing CONFIGS/mountdata
[root@new_mds ~]# mount -t ldiskfs /dev/sda6 /MDT
[root@new_mds ~]# cd /MDT
[root@new_mds MDT]# ssh 192.168.1.7 "cd /MDT; tar cf - --xattrs --sparse ."
| tar xpf - --xattrs --sparse
root(a)192.168.1.7's password:
tar: ./ROOT/space/users/svnsync/.google/desktop/a2_sock: socket ignored
tar: ./ROOT/space/users/svnsync/.google/desktop/a1_sock: socket ignored
tar: ./ROOT/space/users/svnsync/.google/desktop/a3_sock: socket ignored
tar: ./ROOT/space/users/svnsync/.google/desktop/a4_sock: socket ignored
tar: value -1984391879 out of time_t range 0..8589934591
tar: ./ROOT/space2/a/Database/Database/FILE0064.jpg: implausibly old time
stamp 1907-02-13 20:02:01
tar: ./ROOT/space2/Departee/???(??)/????/IMP????/Green.JPG: time stamp
2019-07-07 10:12:00 is 174511703.910839819 s in the future
tar: ./ROOT/space2/Departee/???(??)/????/IMP????/Horizon.jpg: time stamp
2035-03-24 10:12:00 is 670361303.91045891 s in the future
tar: ./ROOT/space2/Departee/???(??)/????/IMP????/a.jpg: time stamp
2035-02-24 10:12:00 is 667942103.910342464 s in the future
tar: ./ROOT/space2/Departee/???(??)/????/IMP????/Blue.JPG: time stamp
2027-07-30 10:12:00 is 428959703.909806892 s in the future
tar: ./ROOT/space2/server_backup/cola8/var/lib/mysql/mysql.sock: socket
ignored
tar: value -2147483648 out of time_t range 0..8589934591
tar: value -2147483648 out of time_t range 0..8589934591
tar: ./ROOT/space2/user_backup/Obsolete/DVB/software_sim/.ico: implausibly
old time stamp 1901-12-14 04:45:52
tar: ./ROOT/space2/user_backup/Obsolete/DVB/software_sim/small.ico:
implausibly old time stamp 1901-12-14 04:45:52
tar: Exiting with failure status due to previous errors
[root@new_mds MDT]# ls
CATALOGS lost+found oi.16.2 oi.16.33 oi.16.47 oi.16.60
CONFIGS lov_objid oi.16.20 oi.16.34 oi.16.48 oi.16.61
NIDTBL_VERSIONS lov_objseq oi.16.21 oi.16.35 oi.16.49 oi.16.62
O oi.16.0 oi.16.22 oi.16.36 oi.16.5 oi.16.63
OI_scrub oi.16.1 oi.16.23 oi.16.37 oi.16.50 oi.16.7
PENDING oi.16.10 oi.16.24 oi.16.38 oi.16.51 oi.16.8
REMOTE_PARENT_DIR oi.16.11 oi.16.25 oi.16.39 oi.16.52 oi.16.9
ROOT oi.16.12 oi.16.26 oi.16.4 oi.16.53 quota_master
changelog_catalog oi.16.13 oi.16.27 oi.16.40 oi.16.54 quota_slave
changelog_users oi.16.14 oi.16.28 oi.16.41 oi.16.55 seq_ctl
fld oi.16.15 oi.16.29 oi.16.42 oi.16.56 seq_srv
hsm_actions oi.16.16 oi.16.3 oi.16.43 oi.16.57
last_rcvd oi.16.17 oi.16.30 oi.16.44 oi.16.58
lfsck_bookmark oi.16.18 oi.16.31 oi.16.45 oi.16.59
lfsck_namespace oi.16.19 oi.16.32 oi.16.46 oi.16.6
[root@new_mds MDT]# cd
[root@new_mds ~]# umount /MDT
[root@new_mds ~]# mount -t lustre /dev/sda6 -o nosvc /MDT
[root@new_mds ~]# lctl replace_nids lustre-MDT0000 192.168.1.32@tcp
[root@new_mds ~]# lctl dl
0 UP osd-ldiskfs lustre-MDT0000-osd lustre-MDT0000-osd_UUID 5
1 UP mgs MGS MGS 5
2 UP mgc MGC192.168.1.32@tcp fa00c372-90c9-ce21-7d9b-058e1125be1a 5
[root@new_mds ~]# umount /MDT
[root@new_mds ~]# mount /MDT
[root@new_mds ~]# mount -t lustre /dev/sda6 /MDT
*** OSS (myoss)
# umount /MDT
# tunefs.lustre --erase-param --mgsnode=192.168.1.32@tcp --writeconf
/dev/sd5
# tunefs.lustre --writeconf /dev/sda5
# mount -t lustre /dev/sda5 /OST
Mounting /OST was successful so I assumed it should be okay. However, the
client failed to access the filesystem with the following error:
# mount -t lustre 192.168.1.32:/lustre /mnt/tmp
mount.lustre: mount 192.168.1.32:/lustre at /mnt/tmp failed: File exists
On the OSS, I can see some errors in /var/log/message like this:
Dec 25 15:44:21 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
ordered data mode. quota=on. Opts:
Dec 25 15:44:40 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
ordered data mode. quota=on. Opts:
Dec 25 15:44:40 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
ordered data mode. quota=on. Opts:
Dec 25 15:44:41 myoss kernel: LustreError: 13a-8: Failed to get MGS log
params and no local copy.
Dec 25 15:44:41 myoss kernel: LustreError: Skipped 1 previous similar
message
Dec 25 15:44:41 myoss kernel: Lustre: lustre-OST0000: Imperative Recovery
enabled, recovery window shrunk from 300-900 down to 150-450
Dec 25 15:44:46 myoss kernel: Lustre: lustre-OST0000: Will be in recovery
for at least 2:30, or until 1 client reconnects
Dec 25 15:44:46 myoss kernel: Lustre: lustre-OST0000: Recovery over after
0:01, of 1 clients 1 recovered and 0 were evicted.
Dec 25 15:44:46 myoss kernel: Lustre: lustre-OST0000: deleting orphan
objects from 0x0:8499078 to 0x0:8499169
Dec 25 15:46:57 myoss kernel: Lustre: Failing over lustre-OST0000
Dec 25 15:46:58 myoss kernel: Lustre: server umount lustre-OST0000 complete
On the new MDS, I see:
Dec 25 15:46:31 new_mds kernel: Lustre:
1738:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1387957584/real 1387957584]
req@ffff880279ed3c00 x1455378659279340/t0(0)
o13->lustre-OST0000-osc-MDT0000@192.168.1.34@tcp:7/4 lens 224/368 e 0 to 1
dl 1387957591 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Dec 25 15:46:31 new_mds kernel: Lustre: lustre-OST0000-osc-MDT0000:
Connection to lustre-OST0000 (at 192.168.1.34@tcp) was lost; in progress
operations using this service will wait for recovery to complete
Dec 25 15:46:37 new_mds kernel: Lustre:
1731:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1387957591/real 1387957591]
req@ffff880279ed3000 x1455378659279344/t0(0)
o8->lustre-OST0000-osc-MDT0000@192.168.1.34@tcp:28/4 lens 400/544 e 0 to 1
dl 1387957597 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec 25 15:47:07 new_mds kernel: Lustre:
1731:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1387957616/real 1387957616]
req@ffff880279ed3400 x1455378659279352/t0(0)
o8->lustre-OST0000-osc-MDT0000@192.168.1.34@tcp:28/4 lens 400/544 e 0 to 1
dl 1387957627 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec 25 15:47:31 new_mds kernel: Lustre: Found index 0 for lustre-OST0000,
updating log
Dec 25 15:47:37 new_mds kernel: Lustre:
1731:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1387957641/real 1387957641]
req@ffff880279ed3400 x1455378659279364/t0(0)
o8->lustre-OST0000-osc-MDT0000@192.168.1.34@tcp:28/4 lens 400/544 e 0 to 1
dl 1387957657 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec 25 15:47:38 new_mds kernel: Lustre:
1777:0:(mgc_request.c:1645:mgc_process_recover_log()) Process recover log
lustre-mdtir error -22
Dec 25 15:48:02 new_mds kernel: Lustre:
1731:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1387957666/real 1387957666]
req@ffff88027b6fe000 x1455378659279452/t0(0)
o8->lustre-OST0000-osc-MDT0000@0@lo:28/4 lens 400/544 e 0 to 1 dl
1387957682 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec 25 15:48:11 new_mds kernel: Lustre: lustre-OST0000-osc-MDT0000:
Connection restored to lustre-OST0000 (at 192.168.1.34@tcp)
Dec 25 15:54:27 new_mds kernel: Lustre: Failing over lustre-MDT0000
Dec 25 15:54:33 new_mds kernel: Lustre:
2396:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1387958067/real 1387958067]
req@ffff88026b260800 x1455378659279928/t0(0)
o251->MGC192.168.1.32@tcp@0@lo:26/25
lens 224/224 e 0 to 1 dl 1387958073 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Dec 25 15:54:33 new_mds kernel: Lustre: server umount lustre-MDT0000
complete
Dec 25 15:54:39 new_mds kernel: LDISKFS-fs (sda6): mounted filesystem with
ordered data mode. quota=on. Opts:
Dec 25 15:54:58 new_mds kernel: LDISKFS-fs (sda6): mounted filesystem with
ordered data mode. quota=on. Opts:
Dec 25 15:54:58 new_mds kernel: Lustre: MGS: Logs for fs lustre were
removed by user request. All servers must be restarted in order to
regenerate the logs.
Dec 25 15:54:58 new_mds kernel: Lustre: lustre-MDT0000: used disk, loading
Dec 25 15:54:58 new_mds kernel: LustreError:
2486:0:(osd_io.c:950:osd_ldiskfs_read()) lustre=MDT0000: can't read
128@8192on ino 28: rc = 0
Dec 25 15:54:58 new_mds kernel: LustreError:
2486:0:(mdt_recovery.c:112:mdt_clients_data_init()) error reading MDS
last_rcvd idx 0, off 8192: rc -14
Dec 25 15:54:58 new_mds kernel: LustreError: 11-0:
lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect
failed with -11.
After several hours of trials, I decided to fallback to the old MDS server.
As a result, I did the same things (and also tried a lot of times):
*** OSS (myoss)
[root@myoss ~]# tunefs.lustre --erase-param
--mgsnode=192.168.1.7@tcp--writeconf /dev/sda5
checking for existing Lustre data: found
Reading CONFIGS/mountdata
Read previous values:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x1002
(OST no_primnode )
Persistent mount opts: errors=remount-ro
Parameters: mgsnode=192.168.1.7@tcp
Permanent disk data:
Target: lustre=OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x1142
(OST update writeconf no_primnode )
Persistent mount opts: errors=remount-ro
Parameters: mgsnode=192.168.1.7@tcp
Writing CONFIGS/mountdata
[root@myoss ~]# tunefs.lustre --writeconf /dev/sda5
checking for existing Lustre data: found
Reading CONFIGS/mountdata
Read previous values:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x1142
(OST update writeconf no_primnode )
Persistent mount opts: errors=remount-ro
Parameters: mgsnode=192.168.1.7@tcp
Permanent disk data:
Target: lustre=OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x1142
(OST update writeconf no_primnode )
Persistent mount opts: errors=remount-ro
Parameters: mgsnode=192.168.1.7@tcp
Writing CONFIGS/mountdata
[root@myoss ~]# mount -t lustre /dev/sda5 /OST
[root@myoss ~]# df -kh
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 29G 8.9G 19G 33% /
tmpfs 7.8G 72K 7.8G 1% /dev/shm
/dev/sda1 291M 62M 215M 23% /boot
/dev/sda4 9.7G 2.5G 6.7G 28% /tmp
/dev/sda5 11T 3.9T 6.5T 38% /OST
*** Old MDS (old_mds)
[root@old_mds ~]# mount -t lustre /dev/mapper/VolGroup00-LogVol03 -o nosvc
/MDT
[root@old_mds ~]# lctl replace_nids lustre-MDT0000 192.168.1.7@tcp
[root@old_mds ~]# lctl dl
0 UP osd-ldiskfs lustre-MDT0000-osd lustre-MDT0000-osd_UUID 5
1 UP mgs MGS MGS 5
2 UP mgc MGC192.168.1.7@tcp 6094ff79-4ad8-d93f-1f37-307e324387e2 5
[root@old_mds ~]# umount /MDT
[root@old_mds ~]# tunefs.lustre --writeconf /dev/mapper/VolGroup00-LogVol03
checking for existing Lustre data: found
Reading CONFIGS/mountdata
Read previous values:
Target: lustre-MDT0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x5
(MDT MGS )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:
Permanent disk data:
Target: lustre=MDT0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x105
(MDT MGS writeconf )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:
Writing CONFIGS/mountdata
[root@old_mds ~]# mount -t lustre /dev/mapper/VolGroup00-LogVol03 /MDT
[root@old_mds ~]# cat /proc/fs/lustre/devices 0 UP osd-ldiskfs
lustre-MDT0000-osd lustre-MDT0000-osd_UUID 8
1 UP mgs MGS MGS 7
2 UP mgc MGC192.168.1.7@tcp 28fc524e-9128-4b16-adbf-df94972b556a 5
3 UP mds MDS MDS_uuid 3
4 UP lod lustre-MDT0000-mdtlov lustre-MDT0000-mdtlov_UUID 4
5 UP mdt lustre-MDT0000 lustre-MDT0000_UUID 7
6 UP mdd lustre-MDD0000 lustre-MDD0000_UUID 4
7 UP qmt lustre-QMT0000 lustre-QMT0000_UUID 4
8 UP lwp lustre-MDT0000-lwp-MDT0000 lustre-MDT0000-lwp-MDT0000_UUID 5
9 AT osp lustre-OST0000-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 1
[root@old_mds ~]# df -k
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol02 30237648 11163024 17538624 39% /
tmpfs 4029452 72 4029380 1% /dev/shm
/dev/sda1 297485 62889 219236 23% /boot
/dev/mapper/VolGroup00-LogVol01 30237648 773892 27927756 3% /tmp
/dev/mapper/VolGroup00-LogVol03 78629560 7876016 65511384 11% /MDT
*** Client
[root@cola1 ~]# mount -t lustre 192.168.1.7@tcp0:/lustre /mnt
mount.lustre: mount 192.168.1.7@tcp0:/lustre at /mnt failed: No such file
or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
*** Messages on OSS
Dec 25 20:53:21 myoss modprobe: FATAL: Error inserting padlock_sha
(/lib/modules/2.6.32-358.18.1.el6_lustre.x86_64/kernel/drivers/crypto/padlock-sha.ko):
No such device
Dec 25 20:53:25 myoss kernel: Lustre: Lustre: Build Version:
2.5.0-RC1--PRISTINE-2.6.32-358.18.1.el6_lustre.x86_64
Dec 25 20:53:26 myoss kernel: LNet: Added LNI 192.168.1.34@tcp [8/256/0/180]
Dec 25 20:53:26 myoss kernel: LNet: Accept secure, port 988
Dec 25 21:41:50 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
ordered data mode. quota=on. Opts:
Dec 25 21:42:07 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
ordered data mode. quota=on. Opts:
Dec 25 21:51:58 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
ordered data mode. quota=on. Opts:
Dec 25 21:51:59 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
ordered data mode. quota=on. Opts:
Dec 25 21:52:00 myoss kernel: LustreError: 13a-8: Failed to get MGS log
params and no local copy.
Dec 25 21:52:00 myoss kernel: LustreError: 13a-8: Failed to get MGS log
params and no local copy.
*** Messages on Old MDS
Dec 25 20:56:31 old_mds kernel: LNet: HW CPU cores: 8, npartitions: 2
Dec 25 20:56:31 old_mds modprobe: FATAL: Error inserting crc32c_intel
(/lib/modules/2.6.32-358.18.1.el6_lustre.x86_64/kernel/arch/x86/crypto/crc32c-intel.ko):
No such device
Dec 25 20:56:31 old_mds kernel: alg: No test for crc32 (crc32-table)
Dec 25 20:56:31 old_mds kernel: alg: No test for adler32 (adler32-zlib)
Dec 25 20:56:35 old_mds modprobe: FATAL: Error inserting padlock_sha
(/lib/modules/2.6.32-358.18.1.el6_lustre.x86_64/kernel/drivers/crypto/padlock-sha.ko):
No such device
Dec 25 20:56:39 old_mds kernel: Lustre: Lustre: Build Version:
2.5.0-RC1--PRISTINE-2.6.32-358.18.1.el6_lustre.x86_64
Dec 25 20:56:40 old_mds kernel: LNet: Added LNI 192.168.1.7@tcp[8/256/0/180]
Dec 25 20:56:40 old_mds kernel: LNet: Added LNI 10.242.116.7@tcp1[8/256/0/180]
Dec 25 20:56:40 old_mds kernel: LNet: Accept secure, port 988
Dec 25 21:45:23 old_mds kernel: LDISKFS-fs (dm-3): mounted filesystem with
ordered data mode. quota=on. Opts:
Dec 25 21:46:20 old_mds kernel: LustreError:
2378:0:(obd_mount_server.c:848:lustre_disconnect_lwp())
lustre-MDT0000-lwp-MDT0000: Can't end config log lustre-client.
Dec 25 21:46:20 old_mds kernel: LustreError:
2378:0:(obd_mount_server.c:1426:server_put_super()) MGS: failed to
disconnect lwp. (rc=-2)
Dec 25 21:46:26 old_mds kernel: Lustre:
2378:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1387979180/real 1387979180]
req@ffff8802193b0c00 x1455398531891216/t0(0)
o251->MGC192.168.1.7@tcp@0@lo:26/25
lens 224/224 e 0 to 1 dl 1387979186 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Dec 25 21:46:26 old_mds kernel: Lustre: server umount MGS complete
Dec 25 21:52:43 old_mds kernel: LDISKFS-fs (dm-3): mounted filesystem with
ordered data mode. quota=on. Opts:
Dec 25 21:53:08 old_mds kernel: LDISKFS-fs (dm-3): mounted filesystem with
ordered data mode. quota=on. Opts:
Dec 25 21:53:08 old_mds kernel: Lustre: MGS: Logs for fs lustre were
removed by user request. All servers must be restarted in order to
regenerate the logs.
Dec 25 21:53:09 old_mds kernel: Lustre: lustre-MDT0000: used disk, loading
Dec 25 21:53:09 old_mds kernel: LustreError: 11-0:
lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect
failed with -11.
Dec 25 21:53:20 old_mds kernel: Lustre: MGS: Regenerating lustre-OST0000
log by user request.
Dec 25 21:53:27 old_mds kernel: Lustre:
2461:0:(mgc_request.c:1645:mgc_process_recover_log()) Process recover log
lustre-mdtir error -22
Dec 25 21:53:27 old_mds kernel: LustreError:
2516:0:(ldlm_lib.c:429:client_obd_setup()) can't add initial connection
Dec 25 21:53:27 old_mds kernel: LustreError:
2516:0:(osp_dev.c:684:osp_init0()) lustre-OST0000-osc-MDT0000: can't setup
obd: -2
Dec 25 21:53:27 old_mds kernel: LustreError:
2516:0:(obd_config.c:572:class_setup()) setup lustre-OST0000-osc-MDT0000
failed (-2)
Dec 25 21:53:27 old_mds kernel: LustreError:
2516:0:(obd_config.c:1591:class_config_llog_handler()) MGC192.168.1.7@tcp:
cfg command failed: rc = -2
Dec 25 21:53:27 old_mds kernel: Lustre: cmd=cf003
0:lustre-OST0000-osc-MDT0000 1:lustre-OST0000_UUID 2:0@<0:0>
During the tar, I accidentally created a file under /MDT then I deleted
it. Can this damage the MDT?
I even then tried to add failover nodes to OSS and run both MDS at the same
time, then try to “mount -t lustre 192.168.1.7@tcp:/lustre /lustre” and
“mount -t lustre 192.168.1.32@tcp:/lustre /lustre”. Both fail.
Since we almost run out of all space, I’ve put some important data onto
lustre. If this can’t be recovered, it will be a disaster to me… Hope
somebody can help me. Thanks a lot.
Frank
8 years, 6 months
Luster client for Ubuntu 12.04
by Robert Stites
I need a Luster client for Ubuntu 12.04. My Lustre file system is running version 2.4 on Centos 6.4. Can someone give me a pointer to achieving this? I have compiled from git://git.whamcloud.com/fs/lustre-release.git and it crashes making the lustre-client using the command 'make debs'. Running 'make' compiles fine. Thanks.
Rob Stites
[OHSU_H_4C_POS]
Center for Spoken Language and Understanding
Research Associate
Phone: (503) 346-3764
Email: stites(a)ohsu.edu
Oregon Health & Science University (OHSU)
3181 SW Sam Jackson Park Rd. GH40
Portland, OR 97239-3098
8 years, 6 months