Hi all,

I guess I may provide too much info. Let me try to make things simpler. Now I decide to use only the new MGS/MDS server (192.168.1.32). As a result, I use "tar ... --xattrs..." to copy the data to the new server again according to the lustre_manual.pdf. (Note that the old server already couldn't work again after I fallbacked to it in the last mail). And then do the following things. By the way, I also "upgrade" Lustre 2.5.0 to Lustre 2.4.2 (since 2.4.2 is released after 2.5.0, I assume it has fewer bugs) on my CentOS 6.5.

*******************************************************************
Some basic info again

*** Client (cola1)

eth0: 10.242.116.6

eth1: 192.168.1.6

modprobe.conf: options lnet networks=tcp0(eth1),tcp1(eth0)


*** Old MDS (old_mds)

eth0: 10.242.116.7

eth1: 192.168.1.7

modprobe.conf: options lnet networks=tcp0(eth1),tcp1(eth0)

MGS/MDS mount point: /MDT

device: /dev/mapper/VolGroup00-LogVol03


*** New MDS (new_mds)

eth0: 10.242.116.32

eth1: 192.168.1.32

modprobe.conf: options lnet networks=tcp0(eth1),tcp1(eth0)

MGS/MDS mount point: /MDT

device: /dev/sda6


*** OSS (myoss)

eth0: 192.168.1.34

eth1: Disabled

modprobe.conf: options lnet ip2nets="tcp0 192.168.1.*"

OST mount point: /OST

device: /dev/sda5


*******************************************************************
I did:  (Note that I had done them several times before. But to dump the messages, I have to do them again now. This means I didn't do the time-consuming tar again.) In order to make the sequence and error message clearer, I put commands and messages on new_mds/myoss/cola1 in order.


[root@new_mds ~]# mount -t lustre /dev/sda6 -o nosvc /MDT
[root@new_mds ~]# lctl replace_nids lustre-MDT0000 192.168.1.32@tcp
[root@new_mds ~]# cd
[root@new_mds ~]# umount /MDT
[root@new_mds ~]# mount -t lustre /dev/sda6 /MDT
[root@new_mds ~]# lctl dl
  0 UP osd-ldiskfs lustre-MDT0000-osd lustre-MDT0000-osd_UUID 8
  1 UP mgs MGS MGS 5
  2 UP mgc MGC192.168.1.32@tcp b304f3ec-630a-940d-37e6-ce5ded6a6c71 5
  3 UP mds MDS MDS_uuid 3
  4 UP lod lustre-MDT0000-mdtlov lustre-MDT0000-mdtlov_UUID 4
  5 UP mdt lustre-MDT0000 lustre-MDT0000_UUID 3
  6 UP mdd lustre-MDD0000 lustre-MDD0000_UUID 4
  7 UP qmt lustre-QMT0000 lustre-QMT0000_UUID 4
  8 UP lwp lustre-MDT0000-lwp-MDT0000 lustre-MDT0000-lwp-MDT0000_UUID 5


Dec 27 19:04:52 new_mds kernel: LNet: HW CPU cores: 8, npartitions: 2
Dec 27 19:04:52 new_mds modprobe: FATAL: Error inserting crc32c_intel (/lib/modules/2.6.32-358.23.2.el6_lustre.x86_64/kernel/arch/x86/crypto/crc32c-intel.ko): No such device
Dec 27 19:04:52 new_mds kernel: alg: No test for crc32 (crc32-table)
Dec 27 19:04:52 new_mds kernel: alg: No test for adler32 (adler32-zlib)
Dec 27 19:04:56 new_mds kernel: padlock: VIA PadLock Hash Engine not detected.
Dec 27 19:04:56 new_mds modprobe: FATAL: Error inserting padlock_sha (/lib/modules/2.6.32-358.23.2.el6_lustre.x86_64/kernel/drivers/crypto/padlock-sha.ko): No such device
Dec 27 19:05:00 new_mds kernel: Lustre: Lustre: Build Version: 2.4.2-RC2--PRISTINE-2.6.32-358.23.2.el6_lustre.x86_64
Dec 27 19:05:00 new_mds kernel: LNet: Added LNI 192.168.1.32@tcp [8/256/0/180]
Dec 27 19:05:00 new_mds kernel: LNet: Added LNI 10.242.116.32@tcp1 [8/256/0/180]
Dec 27 19:05:00 new_mds kernel: LNet: Accept secure, port 988
Dec 27 19:05:00 new_mds kernel: LDISKFS-fs (sda6): mounted filesystem with ordered data mode. quota=on. Opts:
Dec 27 19:06:21 new_mds kernel: LustreError: 25744:0:(obd_mount_server.c:865:lustre_disconnect_lwp()) lustre-MDT0000-lwp-MDT0000: Can't end config log lustre-client.
Dec 27 19:06:21 new_mds kernel: LustreError: 25744:0:(obd_mount_server.c:1443:server_put_super()) MGS: failed to disconnect lwp. (rc=-2)
Dec 27 19:06:27 new_mds kernel: Lustre: 25744:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1388142381/real 1388142381]  req@ffff880258d60800 x1455572700364820/t0(0) o251->MGC192.168.1.32@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1388142387 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Dec 27 19:06:27 new_mds kernel: Lustre: server umount MGS complete
Dec 27 19:07:01 new_mds kernel: LDISKFS-fs (sda6): mounted filesystem with ordered data mode. quota=on. Opts:
Dec 27 19:07:01 new_mds kernel: Lustre: lustre-MDT0000: used disk, loading
Dec 27 19:07:01 new_mds kernel: LustreError: 25814:0:(osd_io.c:1000:osd_ldiskfs_read()) lustre-MDT0000: can't read 128@8192 on ino 33: rc = 0
Dec 27 19:07:01 new_mds kernel: LustreError: 25814:0:(mdt_recovery.c:112:mdt_clients_data_init()) error reading MDS last_rcvd idx 0, off 8192: rc -14
Dec 27 19:07:01 new_mds kernel: LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
Dec 27 19:07:01 new_mds kernel: Lustre: lustre-MDD0000: changelog on


[root@myoss ~]# tunefs.lustre --erase-params /dev/sda5       
checking for existing Lustre data: found                      
Reading CONFIGS/mountdata                                     

   Read previous values:
Target:     lustre-OST0000
Index:      0            
Lustre FS:  lustre       
Mount type: ldiskfs      
Flags:      0x1102       
              (OST writeconf no_primnode )
Persistent mount opts: errors=remount-ro 
Parameters: mgsnode=192.168.1.7@tcp      


   Permanent disk data:
Target:     lustre=OST0000
Index:      0            
Lustre FS:  lustre       
Mount type: ldiskfs      
Flags:      0x1142       
              (OST update writeconf no_primnode )
Persistent mount opts: errors=remount-ro        
Parameters:                                     

Writing CONFIGS/mountdata
[root@myoss ~]# tunefs.lustre --writeconf --mgsnode=192.168.1.32@tcp --ost /dev/sda5                                                                          
checking for existing Lustre data: found                                       
Reading CONFIGS/mountdata                                                      

   Read previous values:
Target:     lustre-OST0000
Index:      0
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x1142
              (OST update writeconf no_primnode )
Persistent mount opts: errors=remount-ro
Parameters:


   Permanent disk data:
Target:     lustre=OST0000
Index:      0
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x1142
              (OST update writeconf no_primnode )
Persistent mount opts: errors=remount-ro
Parameters: mgsnode=192.168.1.32@tcp

Writing CONFIGS/mountdata
[root@myoss ~]# mount -t lustre /dev/sda5 /OST
mount.lustre: mount /dev/sda5 at /OST failed: File exists


Dec 27 19:13:30 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with ordered data mode. quota=on. Opts:
Dec 27 19:13:51 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with ordered data mode. quota=on. Opts:
Dec 27 19:14:11 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with ordered data mode. quota=on. Opts:
Dec 27 19:14:11 myoss kernel: LNet: HW CPU cores: 12, npartitions: 4
Dec 27 19:14:11 myoss kernel: alg: No test for crc32 (crc32-table)
Dec 27 19:14:11 myoss kernel: alg: No test for adler32 (adler32-zlib)
Dec 27 19:14:11 myoss kernel: alg: No test for crc32 (crc32-pclmul)
Dec 27 19:14:15 myoss kernel: padlock: VIA PadLock Hash Engine not detected.
Dec 27 19:14:15 myoss modprobe: FATAL: Error inserting padlock_sha (/lib/modules/2.6.32-358.23.2.el6_lustre.x86_64/kernel/drivers/crypto/padlock-sha.ko): No such device
Dec 27 19:14:20 myoss kernel: Lustre: Lustre: Build Version: 2.4.2-RC2--PRISTINE-2.6.32-358.23.2.el6_lustre.x86_64
Dec 27 19:14:20 myoss kernel: LNet: Added LNI 192.168.1.34@tcp [8/256/0/180]
Dec 27 19:14:20 myoss kernel: LNet: Accept secure, port 988
Dec 27 19:14:20 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with ordered data mode. quota=on. Opts:
Dec 27 19:14:21 myoss kernel: LustreError: 15f-b: lustre-OST0000: cannot register this server with the MGS: rc = -17. Is the MGS running?
Dec 27 19:14:21 myoss kernel: LustreError: 2492:0:(obd_mount_server.c:1716:server_fill_super()) Unable to start targets: -17
Dec 27 19:14:21 myoss kernel: LustreError: 2492:0:(obd_mount_server.c:865:lustre_disconnect_lwp()) lustre-MDT0000-lwp-OST0000: Can't end config log lustre-client.
Dec 27 19:14:21 myoss kernel: LustreError: 2492:0:(obd_mount_server.c:1443:server_put_super()) lustre-OST0000: failed to disconnect lwp. (rc=-2)
Dec 27 19:14:21 myoss kernel: LustreError: 2492:0:(obd_mount_server.c:1473:server_put_super()) no obd lustre-OST0000
Dec 27 19:14:21 myoss kernel: LustreError: 2492:0:(obd_mount_server.c:135:server_deregister_mount()) lustre-OST0000 not registered
Dec 27 19:14:21 myoss kernel: Lustre: server umount lustre-OST0000 complete
Dec 27 19:14:21 myoss kernel: LustreError: 2492:0:(obd_mount.c:1289:lustre_fill_super()) Unable to mount  (-17)


[root@myoss ~]# mount -t lustre -o writeconf /dev/sda5 /OST    # Note that "-o writeconf" can work. Why???


Dec 27 19:15:41 myoss kernel: LDISKFS-fs (sda5): mounted filesystem with ordered data mode. quota=on. Opts:
Dec 27 19:15:42 myoss kernel: LustreError: 2677:0:(obd_mount_server.c:1140:server_register_target()) lustre-OST0000: error registering with the MGS: rc = -17 (not fatal)
Dec 27 19:15:42 myoss kernel: Lustre: lustre-OST0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450


[root@myoss ~]# lctl dl
  0 UP osd-ldiskfs lustre-OST0000-osd lustre-OST0000-osd_UUID 5
  1 UP mgc MGC192.168.1.32@tcp 156c5656-12d7-81ba-bbe5-70de335088ef 5
  2 UP ost OSS OSS_uuid 3
  3 UP obdfilter lustre-OST0000 lustre-OST0000_UUID 4
  4 UP lwp lustre-MDT0000-lwp-OST0000 lustre-MDT0000-lwp-OST0000_UUID 5


Dec 27 19:15:41 new_mds kernel: Lustre: MGS: Regenerating lustre-OST0000 log by user request.
Dec 27 19:15:41 new_mds kernel: LustreError: 25783:0:(llog.c:250:llog_init_handle()) MGS: llog uuid mismatch: config_uuid/
Dec 27 19:15:41 new_mds kernel: LustreError: 25783:0:(mgs_llog.c:1454:record_start_log()) MGS: can't start log lustre-MDT0000: rc = -17
Dec 27 19:15:41 new_mds kernel: LustreError: 25783:0:(mgs_llog.c:3658:mgs_write_log_target()) Can't write logs for lustre-OST0000 (-17)
Dec 27 19:15:41 new_mds kernel: LustreError: 25783:0:(mgs_handler.c:408:mgs_handle_target_reg()) Failed to write lustre-OST0000 log (-17)
Dec 27 19:17:02 new_mds kernel: Lustre: MGS: Regenerating lustre-OST0000 log by user request.
Dec 27 19:17:02 new_mds kernel: Lustre: Found index 0 for lustre-OST0000, updating log
Dec 27 19:17:02 new_mds kernel: Lustre: Client log for lustre-OST0000 was not updated; writeconf the MDT first to regenerate it.
Dec 27 19:17:02 new_mds kernel: LustreError: 25782:0:(llog.c:250:llog_init_handle()) MGS: llog uuid mismatch: config_uuid/
Dec 27 19:17:02 new_mds kernel: LustreError: 25782:0:(mgs_llog.c:1454:record_start_log()) MGS: can't start log lustre-MDT0000: rc = -17
Dec 27 19:17:02 new_mds kernel: LustreError: 25782:0:(mgs_llog.c:3658:mgs_write_log_target()) Can't write logs for lustre-OST0000 (-17)
Dec 27 19:17:02 new_mds kernel: LustreError: 25782:0:(mgs_handler.c:408:mgs_handle_target_reg()) Failed to write lustre-OST0000 log (-17)
Dec 27 19:17:09 new_mds kernel: Lustre: 25785:0:(mgc_request.c:1564:mgc_process_recover_log()) Process recover log lustre-mdtir error -22


[root@new_mds ~]# lctl dl
  0 UP osd-ldiskfs lustre-MDT0000-osd lustre-MDT0000-osd_UUID 8
  1 UP mgs MGS MGS 7
  2 UP mgc MGC192.168.1.32@tcp b304f3ec-630a-940d-37e6-ce5ded6a6c71 5
  3 UP mds MDS MDS_uuid 3
  4 UP lod lustre-MDT0000-mdtlov lustre-MDT0000-mdtlov_UUID 4
  5 UP mdt lustre-MDT0000 lustre-MDT0000_UUID 7
  6 UP mdd lustre-MDD0000 lustre-MDD0000_UUID 4
  7 UP qmt lustre-QMT0000 lustre-QMT0000_UUID 4
  8 UP lwp lustre-MDT0000-lwp-MDT0000 lustre-MDT0000-lwp-MDT0000_UUID 5

Dec 27 19:17:50 myoss kernel: INFO: task tgt_recov:2775 blocked for more than 120 seconds.
Dec 27 19:17:50 myoss kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 27 19:17:50 myoss kernel: tgt_recov     D 000000000000000b     0  2775      2 0x00000080
Dec 27 19:17:50 myoss kernel: ffff88044e523e00 0000000000000046 0000000000000000 0000000000000003
Dec 27 19:17:50 myoss kernel: ffff88044e523d90 ffffffff81055f96 ffff88044e523da0 ffff880474742ae0
Dec 27 19:17:50 myoss kernel: ffff8804541cf058 ffff88044e523fd8 000000000000fb88 ffff8804541cf058
Dec 27 19:17:50 myoss kernel: Call Trace:
Dec 27 19:17:50 myoss kernel: [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
Dec 27 19:17:50 myoss kernel: [<ffffffffa0700070>] ? check_for_clients+0x0/0x70 [ptlrpc]
Dec 27 19:17:50 myoss kernel: [<ffffffffa070172d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
Dec 27 19:17:50 myoss kernel: [<ffffffffa06ffd60>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
Dec 27 19:17:50 myoss kernel: [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
Dec 27 19:17:50 myoss kernel: [<ffffffffa070856e>] target_recovery_thread+0x58e/0x1970 [ptlrpc]
Dec 27 19:17:50 myoss kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
Dec 27 19:17:50 myoss kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:17:50 myoss kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Dec 27 19:17:50 myoss kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:17:50 myoss kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:17:50 myoss kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20


[root@cola1 ~]# mount -t lustre 192.168.1.32@tcp:/lustre /lustre


Dec 27 19:21:05 cola1 kernel: LNet: HW CPU cores: 8, npartitions: 2
Dec 27 19:21:05 cola1 modprobe: FATAL: Error inserting crc32c_intel (/lib/modules/2.6.32-358.18.1.el6.x86_64/kernel/arch/x86/crypto/crc32c-intel.ko): No such device
Dec 27 19:21:05 cola1 kernel: alg: No test for crc32 (crc32-table)
Dec 27 19:21:05 cola1 kernel: alg: No test for adler32 (adler32-zlib)
Dec 27 19:21:09 cola1 kernel: padlock: VIA PadLock Hash Engine not detected.
Dec 27 19:21:09 cola1 modprobe: FATAL: Error inserting padlock_sha (/lib/modules/2.6.32-358.18.1.el6.x86_64/kernel/drivers/crypto/padlock-sha.ko): No such device
Dec 27 19:21:13 cola1 kernel: Lustre: Lustre: Build Version: 2.5.0-RC1--PRISTINE-2.6.32-358.18.1.el6.x86_64
Dec 27 19:21:13 cola1 kernel: LNet: Added LNI 192.168.1.6@tcp [8/256/0/180]
Dec 27 19:21:13 cola1 kernel: LNet: Added LNI 10.242.116.6@tcp1 [8/256/0/180]
Dec 27 19:21:13 cola1 kernel: LNet: Accept secure, port 988
Dec 27 19:21:13 cola1 kernel: Lustre: 19583:0:(mgc_request.c:1645:mgc_process_recover_log()) Process recover log lustre-cliir error -22
Dec 27 19:21:13 cola1 kernel: LustreError: 11-0: lustre-MDT0000-mdc-ffff880215fac400: Communicating with 192.168.1.32@tcp, operation mds_connect failed with -11.
Dec 27 19:21:38 cola1 kernel: LustreError: 11-0: lustre-MDT0000-mdc-ffff880215fac400: Communicating with 192.168.1.32@tcp, operation mds_connect failed with -11.
Dec 27 19:22:03 cola1 kernel: LustreError: 11-0: lustre-MDT0000-mdc-ffff880215fac400: Communicating with 192.168.1.32@tcp, operation mds_connect failed with -11.
Dec 27 19:22:28 cola1 kernel: LustreError: 11-0: lustre-MDT0000-mdc-ffff880215fac400: Communicating with 192.168.1.32@tcp, operation mds_connect failed with -11.
Dec 27 19:22:53 cola1 kernel: LustreError: 11-0: lustre-MDT0000-mdc-ffff880215fac400: Communicating with 192.168.1.32@tcp, operation mds_connect failed with -11.
Dec 27 19:23:18 cola1 kernel: LustreError: 11-0: lustre-MDT0000-mdc-ffff880215fac400: Communicating with 192.168.1.32@tcp, operation mds_connect failed with -11.


### No additional message see on new_mds


Dec 27 19:19:50 myoss kernel: INFO: task tgt_recov:2775 blocked for more than 120 seconds.
Dec 27 19:19:50 myoss kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 27 19:19:50 myoss kernel: tgt_recov     D 000000000000000b     0  2775      2 0x00000080
Dec 27 19:19:50 myoss kernel: ffff88044e523e00 0000000000000046 0000000000000000 0000000000000003
Dec 27 19:19:50 myoss kernel: ffff88044e523d90 ffffffff81055f96 ffff88044e523da0 ffff880474742ae0
Dec 27 19:19:50 myoss kernel: ffff8804541cf058 ffff88044e523fd8 000000000000fb88 ffff8804541cf058
Dec 27 19:19:50 myoss kernel: Call Trace:
Dec 27 19:19:50 myoss kernel: [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
Dec 27 19:19:50 myoss kernel: [<ffffffffa0700070>] ? check_for_clients+0x0/0x70 [ptlrpc]
Dec 27 19:19:50 myoss kernel: [<ffffffffa070172d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
Dec 27 19:19:50 myoss kernel: [<ffffffffa06ffd60>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
Dec 27 19:19:50 myoss kernel: [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
Dec 27 19:19:50 myoss kernel: [<ffffffffa070856e>] target_recovery_thread+0x58e/0x1970 [ptlrpc]
Dec 27 19:19:50 myoss kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
Dec 27 19:19:50 myoss kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:19:50 myoss kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Dec 27 19:19:50 myoss kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:19:50 myoss kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:19:50 myoss kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Dec 27 19:21:50 myoss kernel: INFO: task tgt_recov:2775 blocked for more than 120 seconds.
Dec 27 19:21:50 myoss kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 27 19:21:50 myoss kernel: tgt_recov     D 000000000000000b     0  2775      2 0x00000080
Dec 27 19:21:50 myoss kernel: ffff88044e523e00 0000000000000046 0000000000000000 0000000000000003
Dec 27 19:21:50 myoss kernel: ffff88044e523d90 ffffffff81055f96 ffff88044e523da0 ffff880474742ae0
Dec 27 19:21:50 myoss kernel: ffff8804541cf058 ffff88044e523fd8 000000000000fb88 ffff8804541cf058
Dec 27 19:21:50 myoss kernel: Call Trace:
Dec 27 19:21:50 myoss kernel: [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
Dec 27 19:21:50 myoss kernel: [<ffffffffa0700070>] ? check_for_clients+0x0/0x70 [ptlrpc]
Dec 27 19:21:50 myoss kernel: [<ffffffffa070172d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
Dec 27 19:21:50 myoss kernel: [<ffffffffa06ffd60>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
Dec 27 19:21:50 myoss kernel: [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
Dec 27 19:21:50 myoss kernel: [<ffffffffa070856e>] target_recovery_thread+0x58e/0x1970 [ptlrpc]
Dec 27 19:21:50 myoss kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
Dec 27 19:21:50 myoss kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:21:50 myoss kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Dec 27 19:21:50 myoss kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:21:50 myoss kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:21:50 myoss kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Dec 27 19:23:50 myoss kernel: INFO: task tgt_recov:2775 blocked for more than 120 seconds.
Dec 27 19:23:50 myoss kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 27 19:23:50 myoss kernel: tgt_recov     D 000000000000000b     0  2775      2 0x00000080
Dec 27 19:23:50 myoss kernel: ffff88044e523e00 0000000000000046 0000000000000000 0000000000000003
Dec 27 19:23:50 myoss kernel: ffff88044e523d90 ffffffff81055f96 ffff88044e523da0 ffff880474742ae0
Dec 27 19:23:50 myoss kernel: ffff8804541cf058 ffff88044e523fd8 000000000000fb88 ffff8804541cf058
Dec 27 19:23:50 myoss kernel: Call Trace:
Dec 27 19:23:50 myoss kernel: [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
Dec 27 19:23:50 myoss kernel: [<ffffffffa0700070>] ? check_for_clients+0x0/0x70 [ptlrpc]
Dec 27 19:23:50 myoss kernel: [<ffffffffa070172d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
Dec 27 19:23:50 pepsi3 kernel: [<ffffffffa06ffd60>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
Dec 27 19:23:50 pepsi3 kernel: [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
Dec 27 19:23:50 pepsi3 kernel: [<ffffffffa070856e>] target_recovery_thread+0x58e/0x1970 [ptlrpc]
Dec 27 19:23:50 pepsi3 kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
Dec 27 19:23:50 pepsi3 kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:23:50 pepsi3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Dec 27 19:23:50 pepsi3 kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:23:50 pepsi3 kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:23:50 pepsi3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Dec 27 19:25:50 pepsi3 kernel: INFO: task tgt_recov:2775 blocked for more than 120 seconds.
Dec 27 19:25:50 pepsi3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 27 19:25:50 pepsi3 kernel: tgt_recov     D 000000000000000b     0  2775      2 0x00000080
Dec 27 19:25:50 pepsi3 kernel: ffff88044e523e00 0000000000000046 0000000000000000 0000000000000003
Dec 27 19:25:50 pepsi3 kernel: ffff88044e523d90 ffffffff81055f96 ffff88044e523da0 ffff880474742ae0
Dec 27 19:25:50 pepsi3 kernel: ffff8804541cf058 ffff88044e523fd8 000000000000fb88 ffff8804541cf058
Dec 27 19:25:50 pepsi3 kernel: Call Trace:
Dec 27 19:25:50 pepsi3 kernel: [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa0700070>] ? check_for_clients+0x0/0x70 [ptlrpc]
Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa070172d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa06ffd60>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
Dec 27 19:25:50 pepsi3 kernel: [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa070856e>] target_recovery_thread+0x58e/0x1970 [ptlrpc]
Dec 27 19:25:50 pepsi3 kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:25:50 pepsi3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:25:50 pepsi3 kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:25:50 pepsi3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Dec 27 19:27:14 pepsi3 ntpd_intres[1856]: host name not found: qrdcntp.quanta.corp
Dec 27 19:27:50 pepsi3 kernel: INFO: task tgt_recov:2775 blocked for more than 120 seconds.
Dec 27 19:27:50 pepsi3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 27 19:27:50 pepsi3 kernel: tgt_recov     D 000000000000000b     0  2775      2 0x00000080
Dec 27 19:27:50 pepsi3 kernel: ffff88044e523e00 0000000000000046 0000000000000000 0000000000000003
Dec 27 19:27:50 pepsi3 kernel: ffff88044e523d90 ffffffff81055f96 ffff88044e523da0 ffff880474742ae0
Dec 27 19:27:50 pepsi3 kernel: ffff8804541cf058 ffff88044e523fd8 000000000000fb88 ffff8804541cf058
Dec 27 19:27:50 pepsi3 kernel: Call Trace:
Dec 27 19:27:50 pepsi3 kernel: [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa0700070>] ? check_for_clients+0x0/0x70 [ptlrpc]
Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa070172d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa06ffd60>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
Dec 27 19:27:50 pepsi3 kernel: [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa070856e>] target_recovery_thread+0x58e/0x1970 [ptlrpc]
Dec 27 19:27:50 pepsi3 kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:27:50 pepsi3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:27:50 pepsi3 kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:27:50 pepsi3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Dec 27 19:29:50 pepsi3 kernel: INFO: task tgt_recov:2775 blocked for more than 120 seconds.
Dec 27 19:29:50 pepsi3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 27 19:29:50 pepsi3 kernel: tgt_recov     D 000000000000000b     0  2775      2 0x00000080
Dec 27 19:29:50 pepsi3 kernel: ffff88044e523e00 0000000000000046 0000000000000000 0000000000000003
Dec 27 19:29:50 pepsi3 kernel: ffff88044e523d90 ffffffff81055f96 ffff88044e523da0 ffff880474742ae0
Dec 27 19:29:50 pepsi3 kernel: ffff8804541cf058 ffff88044e523fd8 000000000000fb88 ffff8804541cf058
Dec 27 19:29:50 pepsi3 kernel: Call Trace:
Dec 27 19:29:50 pepsi3 kernel: [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa0700070>] ? check_for_clients+0x0/0x70 [ptlrpc]
Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa070172d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa06ffd60>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
Dec 27 19:29:50 pepsi3 kernel: [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa070856e>] target_recovery_thread+0x58e/0x1970 [ptlrpc]
Dec 27 19:29:50 pepsi3 kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:29:50 pepsi3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:29:50 pepsi3 kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:29:50 pepsi3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Dec 27 19:31:50 pepsi3 kernel: INFO: task tgt_recov:2775 blocked for more than 120 seconds.
Dec 27 19:31:50 pepsi3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 27 19:31:50 pepsi3 kernel: tgt_recov     D 000000000000000b     0  2775      2 0x00000080
Dec 27 19:31:50 pepsi3 kernel: ffff88044e523e00 0000000000000046 0000000000000000 0000000000000003
Dec 27 19:31:50 pepsi3 kernel: ffff88044e523d90 ffffffff81055f96 ffff88044e523da0 ffff880474742ae0
Dec 27 19:31:50 pepsi3 kernel: ffff8804541cf058 ffff88044e523fd8 000000000000fb88 ffff8804541cf058
Dec 27 19:31:50 pepsi3 kernel: Call Trace:
Dec 27 19:31:50 pepsi3 kernel: [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa0700070>] ? check_for_clients+0x0/0x70 [ptlrpc]
Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa070172d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa06ffd60>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
Dec 27 19:31:50 pepsi3 kernel: [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa070856e>] target_recovery_thread+0x58e/0x1970 [ptlrpc]
Dec 27 19:31:50 pepsi3 kernel: [<ffffffff8106b484>] ? __mmdrop+0x44/0x60
Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:31:50 pepsi3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:31:50 pepsi3 kernel: [<ffffffffa0707fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Dec 27 19:31:50 pepsi3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Dec 27 19:33:50 pepsi3 kernel: INFO: task tgt_recov:2775 blocked for more than 120 seconds.

### The same call trace message are printed again and again until I CTRL-C the client...

### No new messages on myoss and new_mds.



The result is different from the previous one although I don't think there's much difference in what I have done. For example, previously the "lctl dl" report had a "AT" (attached?) OST device. However, now, no OST device is reported.

I have ever seen one mail saying doing "writeconf" can solve some problems. However, I've tried "tunefs.lustre --writeconf /dev/sda6" or even now "mount -t lustre -o writeconf". I don't know which one is correct or perhaps they are both wrong.

Or, if there's no way to solve this problem. Is there any means to extract the data from the ldiskfs filesystem without the client?

    Regards,
    Frank