I'm thinking perhaps I should change 2.4.2. back to 2.5.0. Originally I
"upgraded" 2.5.0 to 2.4.2 in the hope that it will fix the problem and
because it seems the database is compatible between these two versions. But
now, since it can't work, perhaps I should change it back to 2.5.0 again.
Does anybody have the idea of which one is better for later debug or
reconfiguration? Currently the old_mds is still 2.5.0 and it hasn't
"talked" to the 2.4.2 OSS since I haven't mounted it and haven't used
tunefs.lustre to change the mgsnode parameter back to old_mds on OSS.
new_mds is already 2.4.2.
BTW, does anybody know how to make a device which is in ATTACH state go to
the UP state. Perhaps it's a very important step in my case. Today I tried
to use "tunefs.lustre --erase-params /dev'..." and "tunefs.lustre
--writeconf /dev/...". The result now looks much more similar to the case
as mentioned in my first mail as below:
[root@old_mds ~]# cat /proc/fs/lustre/devices 0 UP osd-ldiskfs
lustre-MDT0000-osd lustre-MDT0000-osd_UUID 8
1 UP mgs MGS MGS 7
2 UP mgc MGC192.168.1.7@tcp 28fc524e-9128-4b16-adbf-df94972b556a 5
3 UP mds MDS MDS_uuid 3
4 UP lod lustre-MDT0000-mdtlov lustre-MDT0000-mdtlov_UUID 4
5 UP mdt lustre-MDT0000 lustre-MDT0000_UUID 7
6 UP mdd lustre-MDD0000 lustre-MDD0000_UUID 4
7 UP qmt lustre-QMT0000 lustre-QMT0000_UUID 4
8 UP lwp lustre-MDT0000-lwp-MDT0000 lustre-MDT0000-lwp-MDT0000_UUID 5
9 AT osp lustre-OST0000-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 1
In my later experiment after the first email, I even couldn't see the OST
device (device 9 as above). Now, at least it comes back though still in the
AT state...
Actually I still don't understand why I can't restore them back. I have
chosen to completely destroy the configuration by "--erase-params" (only
"--writeconf" itself doesn't seem to destroy anything). This should enable
me to reconfigure them. So I guess there must be something more to destroy
or configure...
Regards,
Frank
On Sat, Dec 28, 2013 at 10:42 AM, Frank Yang <fsyang.tw(a)gmail.com> wrote:
Hi Andreas,
Thanks for you suggestions. But I have ever tried to bring up old_mds
again (certainly with the same address and data) as mentioned in my first
email. I did it right after I failed to bring up new_mds. In my last mail,
I wanted to focus on new_mds since old_mds already can't work either (so
that I don't need to move again after they get online). If necessary, I can
use old_mds again since OST should also be able to work with either one
(both have the same metadata)
And actually I "upgrade" from 2.5.0 to 2.4.2. Not 1.8 to 2.4.2. Sorry, my
email is very lengthy so you can easily get confused.
Actually I still don't know what "do a writeconf" is exactly. I have tried
"tunefs.lustre -o writeconf" and "mount -t lustre -o writeconf". But
I
don't see anything changed. And, now sometimes I am forced to add "-o
writeconf" to mount.lustre otherwise I can't successfully mount OST or MDT.
I have tried carefully not to damage any data on MDT and OST although I
did actually accidentally write a file to MDT on old_mds (mount as -t
ldiskfs before the tar operation) and then deleted it after I found it.
Hope this will not screw up the filesystem. e2fsck to MDT on new_mds and
OST are clean. So I guess/hope perhaps a complete reconfig can bring the
filesystem back.
Regards,
Frank
On Sat, Dec 28, 2013 at 5:07 AM, Dilger, Andreas <andreas.dilger(a)intel.com
> wrote:
> I think you are complicating your efforts by changing too many things
> at the same time. Upgrading Lustre from 1.8 to 2.4 and changing the
> hardware and changing the server addresses makes it very hard to know where
> the problem might be.
>
> I would recommend to move your new_mds to have the same hostname and IP
> address as old_mds and only do the upgrade first. I would guess by this
> point you also need to do a writeconf (MDS and OSS) to reset the
> configuration logs, since everything looks confused.
>
> Cheers, Andreas
>
> On Dec 27, 2013, at 6:35, "Frank Yang" <fsyang.tw(a)gmail.com> wrote:
>
> Hi all,
>
> I guess I may provide too much info. Let me try to make things simpler.
> Now I decide to use only the new MGS/MDS server (192.168.1.32). As a
> result, I use "tar ... --xattrs..." to copy the data to the new server
> again according to the lustre_manual.pdf. (Note that the old server already
> couldn't work again after I fallbacked to it in the last mail). And then do
> the following things. By the way, I also "upgrade" Lustre 2.5.0 to Lustre
> 2.4.2 (since 2.4.2 is released after 2.5.0, I assume it has fewer bugs) on
> my CentOS 6.5.
>
> *******************************************************************
> Some basic info again
>
> *** Client (cola1)
>
> eth0: 10.242.116.6
>
> eth1: 192.168.1.6
>
> modprobe.conf: options lnet networks=tcp0(eth1),tcp1(eth0)
>
> *** Old MDS (old_mds)
>
> eth0: 10.242.116.7
>
> eth1: 192.168.1.7
>
> modprobe.conf: options lnet networks=tcp0(eth1),tcp1(eth0)
>
> MGS/MDS mount point: /MDT
>
> device: /dev/mapper/VolGroup00-LogVol03
>
> *** New MDS (new_mds)
>
> eth0: 10.242.116.32
>
> eth1: 192.168.1.32
>
> modprobe.conf: options lnet networks=tcp0(eth1),tcp1(eth0)
>
> MGS/MDS mount point: /MDT
>
> device: /dev/sda6
>
> *** OSS (myoss)
>
> eth0: 192.168.1.34
>
> eth1: Disabled
>
> modprobe.conf: options lnet ip2nets="tcp0 192.168.1.*"
>
> OST mount point: /OST
>
> device: /dev/sda5
>
> *******************************************************************
> I did: (Note that I had done them several times before. But to dump the
> messages, I have to do them again now. This means I didn't do the
> time-consuming tar again.) In order to make the sequence and error message
> clearer, I put commands and messages on new_mds/myoss/cola1 in order.
>
>
> [root@new_mds ~]# mount -t lustre /dev/sda6 -o nosvc /MDT
> [root@new_mds ~]# lctl replace_nids lustre-MDT0000 192.168.1.32@tcp
> [root@new_mds ~]# cd
> [root@new_mds ~]# umount /MDT
> [root@new_mds ~]# mount -t lustre /dev/sda6 /MDT
> [root@new_mds ~]# lctl dl
> 0 UP osd-ldiskfs lustre-MDT0000-osd lustre-MDT0000-osd_UUID 8
> 1 UP mgs MGS MGS 5
> 2 UP mgc MGC192.168.1.32@tcp b304f3ec-630a-940d-37e6-ce5ded6a6c71 5
> 3 UP mds MDS MDS_uuid 3
> 4 UP lod lustre-MDT0000-mdtlov lustre-MDT0000-mdtlov_UUID 4
> 5 UP mdt lustre-MDT0000 lustre-MDT0000_UUID 3
> 6 UP mdd lustre-MDD0000 lustre-MDD0000_UUID 4
> 7 UP qmt lustre-QMT0000 lustre-QMT0000_UUID 4
> 8 UP lwp lustre-MDT0000-lwp-MDT0000 lustre-MDT0000-lwp-MDT0000_UUID 5
>
>
> Dec 27 19:04:52 new_mds kernel: LNet: HW CPU cores: 8, npartitions: 2
> Dec 27 19:04:52 new_mds modprobe: FATAL: Error inserting crc32c_intel
>
(/lib/modules/2.6.32-358.23.2.el6_lustre.x86_64/kernel/arch/x86/crypto/crc32c-intel.ko):
> No such device
> Dec 27 19:04:52 new_mds kernel: alg: No test for crc32 (crc32-table)
> Dec 27 19:04:52 new_mds kernel: alg: No test for adler32 (adler32-zlib)
> Dec 27 19:04:56 new_mds kernel: padlock: VIA PadLock Hash Engine not
> detected.
> Dec 27 19:04:56 new_mds modprobe: FATAL: Error inserting padlock_sha
>
(/lib/modules/2.6.32-358.23.2.el6_lustre.x86_64/kernel/drivers/crypto/padlock-sha.ko):
> No such device
> Dec 27 19:05:00 new_mds kernel: Lustre: Lustre: Build Version:
> 2.4.2-RC2--PRISTINE-2.6.32-358.23.2.el6_lustre.x86_64
> Dec 27 19:05:00 new_mds kernel: LNet: Added LNI 192.168.1.32@tcp[8/256/0/180]
> Dec 27 19:05:00 new_mds kernel: LNet: Added LNI 10.242.116.32@tcp1[8/256/0/180]
> Dec 27 19:05:00 new_mds kernel: LNet: Accept secure, port 988
> Dec 27 19:05:00 new_mds kernel: LDISKFS-fs (sda6): mounted filesystem
> with ordered data mode. quota=on. Opts:
> Dec 27 19:06:21 new_mds kernel: LustreError:
> 25744:0:(obd_mount_server.c:865:lustre_disconnect_lwp())
> lustre-MDT0000-lwp-MDT0000: Can't end config log lustre-client.
> Dec 27 19:06:21 new_mds kernel: LustreError:
> 25744:0:(obd_mount_server.c:1443:server_put_super()) MGS: failed to
> disconnect lwp. (rc=-2)
> Dec 27 19:06:27 new_mds kernel: Lustre:
> 25744:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has
> timed out for slow reply: [sent 1388142381/real 1388142381]
> req@ffff880258d60800 x1455572700364820/t0(0) o251->MGC192.168.1.32@tcp
> @0@lo:26/25 lens 224/224 e 0 to 1 dl 1388142387 ref 2 fl
> Rpc:XN/0/ffffffff rc 0/-1
> Dec 27 19:06:27 new_mds kernel: Lustre: server umount MGS complete
> Dec 27 19:07:01 new_mds kernel: LDISKFS-fs (sda6): mounted filesystem
> with ordered data mode. quota=on. Opts:
> Dec 27 19:07:01 new_mds kernel: Lustre: lustre-MDT0000: used disk, loading
> Dec 27 19:07:01 new_mds kernel: LustreError:
> 25814:0:(osd_io.c:1000:osd_ldiskfs_read()) lustre-MDT0000: can't read
> 128@8192 on ino 33: rc = 0
> Dec 27 19:07:01 new_mds kernel: LustreError:
> 25814:0:(mdt_recovery.c:112:mdt_clients_data_init()) error reading MDS
> last_rcvd idx 0, off 8192: rc -14
> Dec 27 19:07:01 new_mds kernel: LustreError: 11-0:
> lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation
> mds_connect failed with -11.
> Dec 27 19:07:01 new_mds kernel: Lustre: lustre-MDD0000: changelog on
>
>
> [root@myoss ~]# tunefs.lustre --erase-params /dev/sda5
> checking for existing Lustre data: found
> Reading CONFIGS/mountdata
>
> Read previous values:
> Target: lustre-OST0000
> Index: 0
> Lustre FS: lustre
> Mount type: ldiskfs
> Flags: 0x1102
> (OST writeconf no_primnode )
> Persistent mount opts: errors=remount-ro
> Parameters: mgsnode=192.168.1.7@tcp
>
>
> Permanent disk data:
> Target: lustre=OST0000
> Index: 0
> Lustre FS: lustre
> Mount type: ldiskfs
> Flags: 0x1142
> (OST update writeconf no_primnode )
> Persistent mount opts: errors=remount-ro
> Parameters:
>
> Writing CONFIGS/mountdata
> [root@myoss ~]# tunefs.lustre --writeconf --mgsnode=192.168.1.32@tcp--ost
> /dev/sda5
>
> checking for existing Lustre data:
> found
> Reading
> CONFIGS/mountdata
>
> Read previous values:
> Target: lustre-OST0000
> Index: 0
> Lustre FS: lustre
> Mount type: ldiskfs
> Flags: 0x1142
> (OST update writeconf no_primnode )
> Persistent mount opts: errors=remount-ro
> Parameters:
>
>
> Permanent disk data:
> Target: lustre=OST0000
> Index: 0
> Lustre FS: lustre
> Mount type: ldiskfs
> Flags: 0x1142
> (OST update writeconf no_primnode )
> Persistent mount opts: errors=remount-ro
> Parameters: mgsnode=192.168.1.32@tcp
>
> Writing CONFIGS/mountdata
> [root@myoss ~]# mount -t lustre /dev/sda5 /OST
> mount.lustre: mount /dev/sda5 at /OST failed: File exists
>
>
> Dec 27 19:13:30 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
> ordered data mode. quota=on. Opts:
> Dec 27 19:13:51 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
> ordered data mode. quota=on. Opts:
> Dec 27 19:14:11 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
> ordered data mode. quota=on. Opts:
> Dec 27 19:14:11 myoss kernel: LNet: HW CPU cores: 12, npartitions: 4
> Dec 27 19:14:11 myoss kernel: alg: No test for crc32 (crc32-table)
> Dec 27 19:14:11 myoss kernel: alg: No test for adler32 (adler32-zlib)
> Dec 27 19:14:11 myoss kernel: alg: No test for crc32 (crc32-pclmul)
> Dec 27 19:14:15 myoss kernel: padlock: VIA PadLock Hash Engine not
> detected.
> Dec 27 19:14:15 myoss modprobe: FATAL: Error inserting padlock_sha
>
(/lib/modules/2.6.32-358.23.2.el6_lustre.x86_64/kernel/drivers/crypto/padlock-sha.ko):
> No such device
> Dec 27 19:14:20 myoss kernel: Lustre: Lustre: Build Version:
> 2.4.2-RC2--PRISTINE-2.6.32-358.23.2.el6_lustre.x86_64
> Dec 27 19:14:20 myoss kernel: LNet: Added LNI 192.168.1.34@tcp[8/256/0/180]
> Dec 27 19:14:20 myoss kernel: LNet: Accept secure, port 988
> Dec 27 19:14:20 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
> ordered data mode. quota=on. Opts:
> Dec 27 19:14:21 myoss kernel: LustreError: 15f-b: lustre-OST0000: cannot
> register this server with the MGS: rc = -17. Is the MGS running?
> Dec 27 19:14:21 myoss kernel: LustreError:
> 2492:0:(obd_mount_server.c:1716:server_fill_super()) Unable to start
> targets: -17
> Dec 27 19:14:21 myoss kernel: LustreError:
> 2492:0:(obd_mount_server.c:865:lustre_disconnect_lwp())
> lustre-MDT0000-lwp-OST0000: Can't end config log lustre-client.
> Dec 27 19:14:21 myoss kernel: LustreError:
> 2492:0:(obd_mount_server.c:1443:server_put_super()) lustre-OST0000: failed
> to disconnect lwp. (rc=-2)
> Dec 27 19:14:21 myoss kernel: LustreError:
> 2492:0:(obd_mount_server.c:1473:server_put_super()) no obd lustre-OST0000
> Dec 27 19:14:21 myoss kernel: LustreError:
> 2492:0:(obd_mount_server.c:135:server_deregister_mount()) lustre-OST0000
> not registered
> Dec 27 19:14:21 myoss kernel: Lustre: server umount lustre-OST0000
> complete
> Dec 27 19:14:21 myoss kernel: LustreError:
> 2492:0:(obd_mount.c:1289:lustre_fill_super()) Unable to mount (-17)
>
>
> [root@myoss ~]# mount -t lustre -o writeconf /dev/sda5 /OST # Note
> that "-o writeconf" can work. Why???
>
>
> Dec 27 19:15:41 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with
> ordered data mode. quota=on. Opts:
> Dec 27 19:15:42 myoss kernel: LustreError:
> 2677:0:(obd_mount_server.c:1140:server_register_target()) lustre-OST0000:
> error registering with the MGS: rc = -17 (not fatal)
> Dec 27 19:15:42 myoss kernel: Lustre: lustre-OST0000: Imperative Recovery
> enabled, recovery window shrunk from 300-900 down to 150-450
>
>
> [root@myoss ~]# lctl dl
> 0 UP osd-ldiskfs lustre-OST0000-osd lustre-OST0000-osd_UUID 5
> 1 UP mgc MGC192.168.1.32@tcp 156c5656-12d7-81ba-bbe5-70de335088ef 5
> 2 UP ost OSS OSS_uuid 3
> 3 UP obdfilter lustre-OST0000 lustre-OST0000_UUID 4
> 4 UP lwp lustre-MDT0000-lwp-OST0000 lustre-MDT0000-lwp-OST0000_UUID 5
>
>
> Dec 27 19:15:41 new_mds kernel: Lustre: MGS: Regenerating lustre-OST0000
> log by user request.
> Dec 27 19:15:41 new_mds kernel: LustreError:
> 25783:0:(llog.c:250:llog_init_handle()) MGS: llog uuid mismatch:
> config_uuid/
> Dec 27 19:15:41 new_mds kernel: LustreError:
> 25783:0:(mgs_llog.c:1454:record_start_log()) MGS: can't start log
> lustre-MDT0000: rc = -17
> Dec 27 19:15:41 new_mds kernel: LustreError:
> 25783:0:(mgs_llog.c:3658:mgs_write_log_target()) Can't write logs for
> lustre-OST0000 (-17)
> Dec 27 19:15:41 new_mds kernel: LustreError:
> 25783:0:(mgs_handler.c:408:mgs_handle_target_reg()) Failed to write
> lustre-OST0000 log (-17)
> Dec 27 19:17:02 new_mds kernel: Lustre: MGS: Regenerating lustre-OST0000
> log by user request.
> Dec 27 19:17:02 new_mds kernel: Lustre: Found index 0 for lustre-OST0000,
> updating log
> Dec 27 19:17:02 new_mds kernel: Lustre: Client log for lustre-OST0000 was
> not updated; writeconf the MDT first to regenerate it.
> Dec 27 19:17:02 new_mds kernel: LustreError:
> 25782:0:(llog.c:250:llog_init_handle()) MGS: llog uuid mismatch:
> config_uuid/
> Dec 27 19:17:02 new_mds kernel: LustreError:
> 25782:0:(mgs_llog.c:1454:record_start_log()) MGS: can't start log
> lustre-MDT0000: rc = -17
> Dec 27 19:17:02 new_mds kernel: LustreError:
> 25782:0:(mgs_llog.c:3658:mgs_write_log_target()) Can't write logs for
> lustre-OST0000 (-17)
> Dec 27 19:17:02 new_mds kernel: LustreError:
> 25782:0:(mgs_handler.c:408:mgs_handle_target_reg()) Failed to write
> lustre-OST0000 log (-17)
> Dec 27 19:17:09 new_mds kernel: Lustre:
> 25785:0:(mgc_request.c:1564:mgc_process_recover_log()) Process recover log
> lustre-mdtir error -22
>
>
> [root@new_mds ~]# lctl dl
> 0 UP osd-ldiskfs lustre-MDT0000-osd lustre-MDT0000-osd_UUID 8
> 1 UP mgs MGS MGS 7
> 2 UP mgc MGC192.168.1.32@tcp b304f3ec-630a-940d-37e6-ce5ded6a6c71 5
> 3 UP mds MDS MDS_uuid 3
> 4 UP lod lustre-MDT0000-mdtlov lustre-MDT0000-mdtlov_UUID 4
> 5 UP mdt lustre-MDT0000 lustre-MDT0000_UUID 7
> 6 UP mdd lustre-MDD0000 lustre-MDD0000_UUID 4
> 7 UP qmt lustre-QMT0000 lustre-QMT0000_UUID 4
> 8 UP lwp lustre-MDT0000-lwp-MDT0000 lustre-MDT0000-lwp-MDT0000_UUID 5
>
> Dec 27 19:17:50 myoss kernel: INFO: task tgt_recov:2775 blocked for more
> than 120 seconds.
> Dec 27 19:17:50 myoss kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec 27 19:17:50 myoss kernel: tgt_recov D 000000000000000b 0
> 2775 2 0x00000080
> Dec 27 19:17:50 myoss kernel: ffff88044e523e00 0000000000000046
> 0000000000000000 0000000000000003
> Dec 27 19:17:50 myoss kernel: ffff88044e523d90 ffffffff81055f96
> ffff88044e523da0 ffff880474742ae0
> Dec 27 19:17:50 myoss kernel: ffff8804541cf058 ffff88044e523fd8
> 000000000000fb88 ffff8804541cf058
> Dec 27 19:17:50 myoss kernel: Call Trace:
> Dec 27 19:17:50 myoss kernel: [<ffffffff81055f96>] ?
> enqueue_task+0x66/0x80
> Dec 27 19:17:50 myoss kernel: [<ffffffffa0700070>] ?
> check_for_clients+0x0/0x70 [ptlrpc]
> Dec 27 19:17:50 myoss kernel: [<ffffffffa070172d>]
> target_recovery_overseer+0x9d/0x230 [ptlrpc]
> Dec 27 19:17:50 myoss kernel: [<ffffffffa06ffd60>] ?
> exp_connect_healthy+0x0/0x20 [ptlrpc]
> Dec 27 19:17:50 myoss kernel: [<ffffffff81096da0>] ?
> autoremove_wake_function+0x0/0x40
> Dec 27 19:17:50 myoss kernel: [<ffffffffa070856e>]
> target_recovery_thread+0x58e/0x1970 [ptlrpc]
> Dec 27 19:17:50 myoss kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
> Dec 27 19:17:50 myoss kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:17:50 myoss kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
> Dec 27 19:17:50 myoss kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:17:50 myoss kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:17:50 myoss kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
>
>
> [root@cola1 ~]# mount -t lustre 192.168.1.32@tcp:/lustre /lustre
>
>
> Dec 27 19:21:05 cola1 kernel: LNet: HW CPU cores: 8, npartitions: 2
> Dec 27 19:21:05 cola1 modprobe: FATAL: Error inserting crc32c_intel
> (/lib/modules/2.6.32-358.18.1.el6.x86_64/kernel/arch/x86/crypto/crc32c-intel.ko):
> No such device
> Dec 27 19:21:05 cola1 kernel: alg: No test for crc32 (crc32-table)
> Dec 27 19:21:05 cola1 kernel: alg: No test for adler32 (adler32-zlib)
> Dec 27 19:21:09 cola1 kernel: padlock: VIA PadLock Hash Engine not
> detected.
> Dec 27 19:21:09 cola1 modprobe: FATAL: Error inserting padlock_sha
> (/lib/modules/2.6.32-358.18.1.el6.x86_64/kernel/drivers/crypto/padlock-sha.ko):
> No such device
> Dec 27 19:21:13 cola1 kernel: Lustre: Lustre: Build Version:
> 2.5.0-RC1--PRISTINE-2.6.32-358.18.1.el6.x86_64
> Dec 27 19:21:13 cola1 kernel: LNet: Added LNI 192.168.1.6@tcp[8/256/0/180]
> Dec 27 19:21:13 cola1 kernel: LNet: Added LNI 10.242.116.6@tcp1[8/256/0/180]
> Dec 27 19:21:13 cola1 kernel: LNet: Accept secure, port 988
> Dec 27 19:21:13 cola1 kernel: Lustre:
> 19583:0:(mgc_request.c:1645:mgc_process_recover_log()) Process recover log
> lustre-cliir error -22
> Dec 27 19:21:13 cola1 kernel: LustreError: 11-0:
> lustre-MDT0000-mdc-ffff880215fac400: Communicating with 192.168.1.32@tcp,
> operation mds_connect failed with -11.
> Dec 27 19:21:38 cola1 kernel: LustreError: 11-0:
> lustre-MDT0000-mdc-ffff880215fac400: Communicating with 192.168.1.32@tcp,
> operation mds_connect failed with -11.
> Dec 27 19:22:03 cola1 kernel: LustreError: 11-0:
> lustre-MDT0000-mdc-ffff880215fac400: Communicating with 192.168.1.32@tcp,
> operation mds_connect failed with -11.
> Dec 27 19:22:28 cola1 kernel: LustreError: 11-0:
> lustre-MDT0000-mdc-ffff880215fac400: Communicating with 192.168.1.32@tcp,
> operation mds_connect failed with -11.
> Dec 27 19:22:53 cola1 kernel: LustreError: 11-0:
> lustre-MDT0000-mdc-ffff880215fac400: Communicating with 192.168.1.32@tcp,
> operation mds_connect failed with -11.
> Dec 27 19:23:18 cola1 kernel: LustreError: 11-0:
> lustre-MDT0000-mdc-ffff880215fac400: Communicating with 192.168.1.32@tcp,
> operation mds_connect failed with -11.
>
>
> ### No additional message see on new_mds
>
>
> Dec 27 19:19:50 myoss kernel: INFO: task tgt_recov:2775 blocked for more
> than 120 seconds.
> Dec 27 19:19:50 myoss kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec 27 19:19:50 myoss kernel: tgt_recov D 000000000000000b 0
> 2775 2 0x00000080
> Dec 27 19:19:50 myoss kernel: ffff88044e523e00 0000000000000046
> 0000000000000000 0000000000000003
> Dec 27 19:19:50 myoss kernel: ffff88044e523d90 ffffffff81055f96
> ffff88044e523da0 ffff880474742ae0
> Dec 27 19:19:50 myoss kernel: ffff8804541cf058 ffff88044e523fd8
> 000000000000fb88 ffff8804541cf058
> Dec 27 19:19:50 myoss kernel: Call Trace:
> Dec 27 19:19:50 myoss kernel: [<ffffffff81055f96>] ?
> enqueue_task+0x66/0x80
> Dec 27 19:19:50 myoss kernel: [<ffffffffa0700070>] ?
> check_for_clients+0x0/0x70 [ptlrpc]
> Dec 27 19:19:50 myoss kernel: [<ffffffffa070172d>]
> target_recovery_overseer+0x9d/0x230 [ptlrpc]
> Dec 27 19:19:50 myoss kernel: [<ffffffffa06ffd60>] ?
> exp_connect_healthy+0x0/0x20 [ptlrpc]
> Dec 27 19:19:50 myoss kernel: [<ffffffff81096da0>] ?
> autoremove_wake_function+0x0/0x40
> Dec 27 19:19:50 myoss kernel: [<ffffffffa070856e>]
> target_recovery_thread+0x58e/0x1970 [ptlrpc]
> Dec 27 19:19:50 myoss kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
> Dec 27 19:19:50 myoss kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:19:50 myoss kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
> Dec 27 19:19:50 myoss kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:19:50 myoss kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:19:50 myoss kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
> Dec 27 19:21:50 myoss kernel: INFO: task tgt_recov:2775 blocked for more
> than 120 seconds.
> Dec 27 19:21:50 myoss kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec 27 19:21:50 myoss kernel: tgt_recov D 000000000000000b 0
> 2775 2 0x00000080
> Dec 27 19:21:50 myoss kernel: ffff88044e523e00 0000000000000046
> 0000000000000000 0000000000000003
> Dec 27 19:21:50 myoss kernel: ffff88044e523d90 ffffffff81055f96
> ffff88044e523da0 ffff880474742ae0
> Dec 27 19:21:50 myoss kernel: ffff8804541cf058 ffff88044e523fd8
> 000000000000fb88 ffff8804541cf058
> Dec 27 19:21:50 myoss kernel: Call Trace:
> Dec 27 19:21:50 myoss kernel: [<ffffffff81055f96>] ?
> enqueue_task+0x66/0x80
> Dec 27 19:21:50 myoss kernel: [<ffffffffa0700070>] ?
> check_for_clients+0x0/0x70 [ptlrpc]
> Dec 27 19:21:50 myoss kernel: [<ffffffffa070172d>]
> target_recovery_overseer+0x9d/0x230 [ptlrpc]
> Dec 27 19:21:50 myoss kernel: [<ffffffffa06ffd60>] ?
> exp_connect_healthy+0x0/0x20 [ptlrpc]
> Dec 27 19:21:50 myoss kernel: [<ffffffff81096da0>] ?
> autoremove_wake_function+0x0/0x40
> Dec 27 19:21:50 myoss kernel: [<ffffffffa070856e>]
> target_recovery_thread+0x58e/0x1970 [ptlrpc]
> Dec 27 19:21:50 myoss kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
> Dec 27 19:21:50 myoss kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:21:50 myoss kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
> Dec 27 19:21:50 myoss kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:21:50 myoss kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:21:50 myoss kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
> Dec 27 19:23:50 myoss kernel: INFO: task tgt_recov:2775 blocked for more
> than 120 seconds.
> Dec 27 19:23:50 myoss kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec 27 19:23:50 myoss kernel: tgt_recov D 000000000000000b 0
> 2775 2 0x00000080
> Dec 27 19:23:50 myoss kernel: ffff88044e523e00 0000000000000046
> 0000000000000000 0000000000000003
> Dec 27 19:23:50 myoss kernel: ffff88044e523d90 ffffffff81055f96
> ffff88044e523da0 ffff880474742ae0
> Dec 27 19:23:50 myoss kernel: ffff8804541cf058 ffff88044e523fd8
> 000000000000fb88 ffff8804541cf058
> Dec 27 19:23:50 myoss kernel: Call Trace:
> Dec 27 19:23:50 myoss kernel: [<ffffffff81055f96>] ?
> enqueue_task+0x66/0x80
> Dec 27 19:23:50 myoss kernel: [<ffffffffa0700070>] ?
> check_for_clients+0x0/0x70 [ptlrpc]
> Dec 27 19:23:50 myoss kernel: [<ffffffffa070172d>]
> target_recovery_overseer+0x9d/0x230 [ptlrpc]
> Dec 27 19:23:50 pepsi3 kernel: [<ffffffffa06ffd60>] ?
> exp_connect_healthy+0x0/0x20 [ptlrpc]
> Dec 27 19:23:50 pepsi3 kernel: [<ffffffff81096da0>] ?
> autoremove_wake_function+0x0/0x40
> Dec 27 19:23:50 pepsi3 kernel: [<ffffffffa070856e>]
> target_recovery_thread+0x58e/0x1970 [ptlrpc]
> Dec 27 19:23:50 pepsi3 kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
> Dec 27 19:23:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:23:50 pepsi3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
> Dec 27 19:23:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:23:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:23:50 pepsi3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
> Dec 27 19:25:50 pepsi3 kernel: INFO: task tgt_recov:2775 blocked for more
> than 120 seconds.
> Dec 27 19:25:50 pepsi3 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec 27 19:25:50 pepsi3 kernel: tgt_recov D 000000000000000b 0
> 2775 2 0x00000080
> Dec 27 19:25:50 pepsi3 kernel: ffff88044e523e00 0000000000000046
> 0000000000000000 0000000000000003
> Dec 27 19:25:50 pepsi3 kernel: ffff88044e523d90 ffffffff81055f96
> ffff88044e523da0 ffff880474742ae0
> Dec 27 19:25:50 pepsi3 kernel: ffff8804541cf058 ffff88044e523fd8
> 000000000000fb88 ffff8804541cf058
> Dec 27 19:25:50 pepsi3 kernel: Call Trace:
> Dec 27 19:25:50 pepsi3 kernel: [<ffffffff81055f96>] ?
> enqueue_task+0x66/0x80
> Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa0700070>] ?
> check_for_clients+0x0/0x70 [ptlrpc]
> Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa070172d>]
> target_recovery_overseer+0x9d/0x230 [ptlrpc]
> Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa06ffd60>] ?
> exp_connect_healthy+0x0/0x20 [ptlrpc]
> Dec 27 19:25:50 pepsi3 kernel: [<ffffffff81096da0>] ?
> autoremove_wake_function+0x0/0x40
> Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa070856e>]
> target_recovery_thread+0x58e/0x1970 [ptlrpc]
> Dec 27 19:25:50 pepsi3 kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
> Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:25:50 pepsi3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
> Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:25:50 pepsi3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
> Dec 27 19:27:14 pepsi3 ntpd_intres[1856]: host name not found:
> qrdcntp.quanta.corp
> Dec 27 19:27:50 pepsi3 kernel: INFO: task tgt_recov:2775 blocked for more
> than 120 seconds.
> Dec 27 19:27:50 pepsi3 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec 27 19:27:50 pepsi3 kernel: tgt_recov D 000000000000000b 0
> 2775 2 0x00000080
> Dec 27 19:27:50 pepsi3 kernel: ffff88044e523e00 0000000000000046
> 0000000000000000 0000000000000003
> Dec 27 19:27:50 pepsi3 kernel: ffff88044e523d90 ffffffff81055f96
> ffff88044e523da0 ffff880474742ae0
> Dec 27 19:27:50 pepsi3 kernel: ffff8804541cf058 ffff88044e523fd8
> 000000000000fb88 ffff8804541cf058
> Dec 27 19:27:50 pepsi3 kernel: Call Trace:
> Dec 27 19:27:50 pepsi3 kernel: [<ffffffff81055f96>] ?
> enqueue_task+0x66/0x80
> Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa0700070>] ?
> check_for_clients+0x0/0x70 [ptlrpc]
> Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa070172d>]
> target_recovery_overseer+0x9d/0x230 [ptlrpc]
> Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa06ffd60>] ?
> exp_connect_healthy+0x0/0x20 [ptlrpc]
> Dec 27 19:27:50 pepsi3 kernel: [<ffffffff81096da0>] ?
> autoremove_wake_function+0x0/0x40
> Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa070856e>]
> target_recovery_thread+0x58e/0x1970 [ptlrpc]
> Dec 27 19:27:50 pepsi3 kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
> Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:27:50 pepsi3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
> Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:27:50 pepsi3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
> Dec 27 19:29:50 pepsi3 kernel: INFO: task tgt_recov:2775 blocked for more
> than 120 seconds.
> Dec 27 19:29:50 pepsi3 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec 27 19:29:50 pepsi3 kernel: tgt_recov D 000000000000000b 0
> 2775 2 0x00000080
> Dec 27 19:29:50 pepsi3 kernel: ffff88044e523e00 0000000000000046
> 0000000000000000 0000000000000003
> Dec 27 19:29:50 pepsi3 kernel: ffff88044e523d90 ffffffff81055f96
> ffff88044e523da0 ffff880474742ae0
> Dec 27 19:29:50 pepsi3 kernel: ffff8804541cf058 ffff88044e523fd8
> 000000000000fb88 ffff8804541cf058
> Dec 27 19:29:50 pepsi3 kernel: Call Trace:
> Dec 27 19:29:50 pepsi3 kernel: [<ffffffff81055f96>] ?
> enqueue_task+0x66/0x80
> Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa0700070>] ?
> check_for_clients+0x0/0x70 [ptlrpc]
> Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa070172d>]
> target_recovery_overseer+0x9d/0x230 [ptlrpc]
> Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa06ffd60>] ?
> exp_connect_healthy+0x0/0x20 [ptlrpc]
> Dec 27 19:29:50 pepsi3 kernel: [<ffffffff81096da0>] ?
> autoremove_wake_function+0x0/0x40
> Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa070856e>]
> target_recovery_thread+0x58e/0x1970 [ptlrpc]
> Dec 27 19:29:50 pepsi3 kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
> Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:29:50 pepsi3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
> Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:29:50 pepsi3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
> Dec 27 19:31:50 pepsi3 kernel: INFO: task tgt_recov:2775 blocked for more
> than 120 seconds.
> Dec 27 19:31:50 pepsi3 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec 27 19:31:50 pepsi3 kernel: tgt_recov D 000000000000000b 0
> 2775 2 0x00000080
> Dec 27 19:31:50 pepsi3 kernel: ffff88044e523e00 0000000000000046
> 0000000000000000 0000000000000003
> Dec 27 19:31:50 pepsi3 kernel: ffff88044e523d90 ffffffff81055f96
> ffff88044e523da0 ffff880474742ae0
> Dec 27 19:31:50 pepsi3 kernel: ffff8804541cf058 ffff88044e523fd8
> 000000000000fb88 ffff8804541cf058
> Dec 27 19:31:50 pepsi3 kernel: Call Trace:
> Dec 27 19:31:50 pepsi3 kernel: [<ffffffff81055f96>] ?
> enqueue_task+0x66/0x80
> Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa0700070>] ?
> check_for_clients+0x0/0x70 [ptlrpc]
> Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa070172d>]
> target_recovery_overseer+0x9d/0x230 [ptlrpc]
> Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa06ffd60>] ?
> exp_connect_healthy+0x0/0x20 [ptlrpc]
> Dec 27 19:31:50 pepsi3 kernel: [<ffffffff81096da0>] ?
> autoremove_wake_function+0x0/0x40
> Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa070856e>]
> target_recovery_thread+0x58e/0x1970 [ptlrpc]
> Dec 27 19:31:50 pepsi3 kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
> Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:31:50 pepsi3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
> Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa0707fe0>] ?
> target_recovery_thread+0x0/0x1970 [ptlrpc]
> Dec 27 19:31:50 pepsi3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
> Dec 27 19:33:50 pepsi3 kernel: INFO: task tgt_recov:2775 blocked for more
> than 120 seconds.
>
> ### The same call trace message are printed again and again until I
> CTRL-C the client...
>
> ### No new messages on myoss and new_mds.
>
>
>
> The result is different from the previous one although I don't think
> there's much difference in what I have done. For example, previously the
> "lctl dl" report had a "AT" (attached?) OST device. However, now,
no OST
> device is reported.
>
> I have ever seen one mail saying doing "writeconf" can solve some
> problems. However, I've tried "tunefs.lustre --writeconf /dev/sda6" or
even
> now "mount -t lustre -o writeconf". I don't know which one is correct
or
> perhaps they are both wrong.
>
> Or, if there's no way to solve this problem. Is there any means to
> extract the data from the ldiskfs filesystem without the client?
>
> Regards,
> Frank
>
>
>
> _______________________________________________
> HPDD-discuss mailing list
> HPDD-discuss(a)lists.01.org
>
https://lists.01.org/mailman/listinfo/hpdd-discuss
>
>