Thanks in advance!

El 30/05/2014, a las 17:15, Patrick Farrell <paf@cray.com> escribió:

Alfonso,

Just at a guess, I'd suggest you've got two things going on...

First of all, your targets could use a write conf operation to clear up those non-config logname messages - That will wipe out settings you set with lctl conf_param, but it's probably a good idea.

Next, (and more seriously) you've got some damage/confusion on your MDT, and looking at the stack trace, I'd suggest running lfsck on the MDT to get your Object Index scrubbed:

http://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#dbdoclet.lfsckadmin

The manual isn't as helpful as it could be, but that's where I'd start.

- Patrick Farrell

On 05/30/2014 03:09 AM, Pardo Diaz, Alfonso wrote:
Hello,

Since I update my lustre 2.2 to 2.5.1 (Centos6.5) and copy the MDT to a new SSD disk. I get random kernel panics in the MDS (both HA pairs). The last kernel panic I get this log:

<4>Lustre: MGS: non-config logname received: params
<3>LustreError: 11-0: cetafs-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
<4>Lustre: MGS: non-config logname received: params
<4>Lustre: cetafs-MDT0000: Will be in recovery for at least 5:00, or until 102 clients reconnect
<4>Lustre: MGS: non-config logname received: params
<4>Lustre: MGS: non-config logname received: params
<4>Lustre: Skipped 5 previous similar messages
<4>Lustre: MGS: non-config logname received: params
<4>Lustre: Skipped 9 previous similar messages
<4>Lustre: MGS: non-config logname received: params
<4>Lustre: Skipped 2 previous similar messages
<4>Lustre: MGS: non-config logname received: params
<4>Lustre: Skipped 23 previous similar messages
<4>Lustre: MGS: non-config logname received: params
<4>Lustre: Skipped 8 previous similar messages
<3>LustreError: 3461:0:(ldlm_lib.c:1751:check_for_next_transno()) cetafs-MDT0000: waking for gap in transno, VBR is OFF (skip: 17188113481, ql: 1, comp: 101, conn: 102, next: 17188113493, last_committed: 17188113480)
<6>Lustre: cetafs-MDT0000: Recovery over after 1:13, of 102 clients 102 recovered and 0 were evicted.
<1>BUG: unable to handle kernel NULL pointer dereference at (null)
<1>IP: [<ffffffffa0c3b6a0>] __iam_path_lookup+0x70/0x1f0 [osd_ldiskfs]
<4>PGD 106c0bf067 PUD 106c0be067 PMD 0
<4>Oops: 0002 [#1] SMP
<4>last sysfs file: /sys/devices/system/cpu/online
<4>CPU 0
<4>Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ipmi_devintf cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_multipath microcode iTCO_wdt iTCO_vendor_support sb_edac edac_core lpc_ich mfd_core i2c_i801 igb i2c_algo_bit i2c_core ptp pps_core ioatdma dca mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core sg ext4 jbd2 mbcache sd_mod crc_t10dif ahci isci libsas mpt2sas scsi_transport_sas raid_class megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 3362, comm: mdt00_001 Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1 Bull SAS bullx/X9DRH-7TF/7F/iTF/iF
<4>RIP: 0010:[<ffffffffa0c3b6a0>]  [<ffffffffa0c3b6a0>] __iam_path_lookup+0x70/0x1f0 [osd_ldiskfs]
<4>RSP: 0018:ffff88085e2754b0  EFLAGS: 00010246
<4>RAX: 00000000fffffffb RBX: ffff88085e275600 RCX: 000000000009c93c
<4>RDX: 0000000000000000 RSI: 000000000009c93b RDI: ffff88106bcc32f0
<4>RBP: ffff88085e275500 R08: 0000000000000000 R09: 00000000ffffffff
<4>R10: 0000000000000000 R11: 0000000000000000 R12: ffff88085e2755c8
<4>R13: 0000000000005250 R14: ffff8810569bf308 R15: 0000000000000001
<4>FS:  0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
<4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>CR2: 0000000000000000 CR3: 000000106dd9b000 CR4: 00000000000407f0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process mdt00_001 (pid: 3362, threadinfo ffff88085e274000, task ffff88085f55c080)
<4>Stack:
<4>  0000000000000000 ffff88085e2755d8 ffff8810569bf288 ffffffffa00fd2c4
<4><d>  ffff88085e275660 ffff88085e2755c8 ffff88085e2756c8 0000000000000000
<4><d>  0000000000000000 ffff88085db2a480 ffff88085e275530 ffffffffa0c3ba6c
<4>Call Trace:
<4>  [<ffffffffa00fd2c4>] ? do_get_write_access+0x3b4/0x520 [jbd2]
<4>  [<ffffffffa0c3ba6c>] iam_lookup_lock+0x7c/0xb0 [osd_ldiskfs]
<4>  [<ffffffffa0c3bad4>] __iam_it_get+0x34/0x160 [osd_ldiskfs]
<4>  [<ffffffffa0c3be1e>] iam_it_get+0x2e/0x150 [osd_ldiskfs]
<4>  [<ffffffffa0c3bf4e>] iam_it_get_exact+0xe/0x30 [osd_ldiskfs]
<4>  [<ffffffffa0c3d47f>] iam_insert+0x4f/0xb0 [osd_ldiskfs]
<4>  [<ffffffffa0c366ea>] osd_oi_iam_refresh+0x18a/0x330 [osd_ldiskfs]
<4>  [<ffffffffa0c3ea40>] ? iam_lfix_ipd_alloc+0x0/0x20 [osd_ldiskfs]
<4>  [<ffffffffa0c386dd>] osd_oi_insert+0x11d/0x480 [osd_ldiskfs]
<4>  [<ffffffff811ae522>] ? generic_setxattr+0xa2/0xb0
<4>  [<ffffffffa0c25021>] ? osd_ea_fid_set+0xf1/0x410 [osd_ldiskfs]
<4>  [<ffffffffa0c33595>] osd_object_ea_create+0x5b5/0x700 [osd_ldiskfs]
<4>  [<ffffffffa0e173bf>] lod_object_create+0x13f/0x260 [lod]
<4>  [<ffffffffa0e756c0>] mdd_object_create_internal+0xa0/0x1c0 [mdd]
<4>  [<ffffffffa0e86428>] mdd_create+0xa38/0x1730 [mdd]
<4>  [<ffffffffa0c2af37>] ? osd_xattr_get+0x97/0x2e0 [osd_ldiskfs]
<4>  [<ffffffffa0e14770>] ? lod_index_lookup+0x0/0x30 [lod]
<4>  [<ffffffffa0d50358>] mdo_create+0x18/0x50 [mdt]
<4>  [<ffffffffa0d5a64c>] mdt_reint_open+0x13ac/0x21a0 [mdt]
<4>  [<ffffffffa065983c>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
<4>  [<ffffffffa04f4600>] ? lu_ucred_key_init+0x160/0x1a0 [obdclass]
<4>  [<ffffffffa0d431f1>] mdt_reint_rec+0x41/0xe0 [mdt]
<4>  [<ffffffffa0d2add3>] mdt_reint_internal+0x4c3/0x780 [mdt]
<4>  [<ffffffffa0d2b35d>] mdt_intent_reint+0x1ed/0x520 [mdt]
<4>  [<ffffffffa0d26a0e>] mdt_intent_policy+0x3ae/0x770 [mdt]
<4>  [<ffffffffa0610511>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc]
<4>  [<ffffffffa0639abf>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc]
<4>  [<ffffffffa0d26ed6>] mdt_enqueue+0x46/0xe0 [mdt]
<4>  [<ffffffffa0d2dbca>] mdt_handle_common+0x52a/0x1470 [mdt]
<4>  [<ffffffffa0d68545>] mds_regular_handle+0x15/0x20 [mdt]
<4>  [<ffffffffa0669a45>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
<4>  [<ffffffffa03824ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
<4>  [<ffffffffa03933df>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
<4>  [<ffffffffa06610e9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
<4>  [<ffffffff81054839>] ? __wake_up_common+0x59/0x90
<4>  [<ffffffffa066adad>] ptlrpc_main+0xaed/0x1740 [ptlrpc]
<4>  [<ffffffffa066a2c0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
<4>  [<ffffffff8109aee6>] kthread+0x96/0xa0
<4>  [<ffffffff8100c20a>] child_rip+0xa/0x20
<4>  [<ffffffff8109ae50>] ? kthread+0x0/0xa0
<4>  [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>Code: 00 48 8b 5d b8 45 31 ff 0f 1f 00 49 8b 46 30 31 d2 48 89 d9 44 89 ee 48 8b 7d c0 ff 50 20 48 8b 13 66 2e 0f 1f 84 00 00 00 00 00<f0>  0f ba 2a 19 19 c9 85 c9 74 15 48 8b 0a f7 c1 00 00 00 02 74
<1>RIP  [<ffffffffa0c3b6a0>] __iam_path_lookup+0x70/0x1f0 [osd_ldiskfs]
<4>  RSP<ffff88085e2754b0>
<4>CR2: 0000000000000000







Any suggestion is welcome?

THANKS!!!







Alfonso Pardo Diaz
System Administrator / Researcher
c/ Sola nº 1; 10200 Trujillo, ESPAÑA
Tel: +34 927 65 93 17 Fax: +34 927 32 32 37




----------------------------
Confidencialidad:
Este mensaje y sus ficheros adjuntos se dirige exclusivamente a su destinatario y puede contener información privilegiada o confidencial. Si no es vd. el destinatario indicado, queda notificado de que la utilización, divulgación y/o copia sin autorización está prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente respondiendo al mensaje y proceda a su destrucción.

Disclaimer:
This message and its attached files is intended exclusively for its recipients and may contain confidential information. If you received this e-mail in error you are hereby notified that any dissemination, copy or disclosure of this communication is strictly prohibited and may be unlawful. In this case, please notify us by a reply and delete this email and its contents immediately.
----------------------------

_______________________________________________
HPDD-discuss mailing list
HPDD-discuss@lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss@lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss