Lustre training
by Kurt Strosahl
Hello,
About a year ago I attended a lustre training class in DC, titled Luster Installation and Administration. One of my coworkers who also works on our lustre file system would like to take a similar class, but neither of us can find such training. Have such classes been discontinued?
w/r,
Kurt J. Strosahl
System Administrator
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
6 years, 10 months
Lustre 2.4.2 – Client refused reconnection, still busy with 1 active RPCs
by Jaime paret
Dear Lustre Experts,
We currently use Lustre Maintenance Release (Build Wahmcloud) 2.4.2 on
RHEL6U4 Linux.
We have an issue on Client side with a failure reconnection on the client
due to refused connection on the MDS.
We need to reboot the MDS in order to recover the client status.
Client Side :
Lustre: 8863:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent
has timed out for slow reply: [sent 1398453761/real 1398453761]
req@ffff88009e281800 x1465479219941096/t0(0)
o101->data1-MDT0000-mdc-ffff880335a7bc00@10.64.18.12@tcp1:12/10 lens
576/1136 e 5 to 1 dl 1398454684 ref 2 fl Rpc:XP/0/ffffffff rc 0/-1
Lustre: data1-MDT0000-mdc-ffff880335a7bc00: Connection to data1-MDT0000 (at
10.64.18.12@tcp1) was lost; in progress operations using this service will
wait for recovery to complete
LustreError: 11-0: data1-MDT0000-mdc-ffff880335a7bc00: Communicating with
10.64.18.12@tcp1, operation mds_connect failed with -16.
LustreError: 11-0: data1-MDT0000-mdc-ffff880335a7bc00: Communicating with
10.64.18.12@tcp1, operation mds_connect failed with -16.
LustreError: Skipped 1 previous similar message
LustreError: 11-0: data1-MDT0000-mdc-ffff880335a7bc00: Communicating with
10.64.18.12@tcp1, operation mds_connect failed with -16.
LustreError: Skipped 2 previous similar messages
LustreError: 26416:0:(lmv_obd.c:1289:lmv_statfs()) can't stat MDS #0
(data1-MDT0000-mdc-ffff880335a7bc00), error -16
LustreError: 26416:0:(llite_lib.c:1610:ll_statfs_internal()) md_statfs
fails: rc = -16
LustreError: 11-0: data1-MDT0000-mdc-ffff880335a7bc00: Communicating with
10.64.18.12@tcp1, operation mds_connect failed with -16.
LustreError: Skipped 5 previous similar messages
LustreError: 26474:0:(lmv_obd.c:1289:lmv_statfs()) can't stat MDS #0
(data1-MDT0000-mdc-ffff880335a7bc00), error -16
LustreError: 26474:0:(llite_lib.c:1610:ll_statfs_internal()) md_statfs
fails: rc = -16
MDS Side :
Lustre: 18484:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@
Couldn't add any time (5/-207), not sending early reply
req@ffff8800514b2400 x1465479219941096/t0(0)
o101->60155cc5-7c3a-a0af-08a5-19451109c288@10.64.18.11@tcp1:0/0 lens
576/1152 e 5 to 0 dl 1398454573 ref 2 fl Interpret:/0/0 rc 0/0
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at
10.64.18.11@tcp1) reconnecting
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at
10.64.18.11@tcp1) refused reconnection, still busy with 1 active RPCs
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at
10.64.18.11@tcp1) reconnecting
Lustre: Skipped 1 previous similar message
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at
10.64.18.11@tcp1) refused reconnection, still busy with 1 active RPCs
Lustre: Skipped 1 previous similar message
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at
10.64.18.11@tcp1) reconnecting
Lustre: Skipped 1 previous similar message
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at
10.64.18.11@tcp1) refused reconnection, still busy with 1 active RPCs
Lustre: Skipped 1 previous similar message
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at
10.64.18.11@tcp1) reconnecting
Lustre: Skipped 1 previous similar message
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at
10.64.18.11@tcp1) refused reconnection, still busy with 1 active RPCs
Lustre: Skipped 1 previous similar message
Can you provide me somes advice for this issue ?
In Jira HPDD, I have found this issue : LU-793 (
https://jira.hpdd.intel.com/browse/LU-793) but many other tickets seems to
relate to my case ...
Do you think the LU-793 is the good one ?
In this case Peter list 3 patchs :
http://review.whamcloud.com/#/c/9209/
http://review.whamcloud.com/#/c/9210/
http://review.whamcloud.com/#/c/9211/
Are they production ready for 2.4 Release ?
Do you think there is an other way for the solution ?
Cheers, Jaime
6 years, 11 months
lustre permissions
by Daguman, Brainard R
Hi All,
I have mixed environment built on lustre v2.4.2 network setup. The servers are in workgroup IB and eth network (legacy) while clients are on IB workgroup and domain eth. Currently have issue accessing lustre files created from clients using root account which has rwx permissions. When switching over my domain account I'm getting Permissions Denied.
--
Thanks,
Brainard
6 years, 11 months
lustre 2.5.0 + zfs
by luka leskovec
Hello all,
i got a running lustre 2.5.0 + zfs setup on top of centos 6.4 (the kernels
available on the public whamcloud site), my clients are on centos 6.5
(minor version difference, i recompiled the client sources with the options
specified on the whamcloud site)
but now i have some problems. I cannot judge how serious it is, as the only
problems i observe are slow responses on ls, rm and tar and apart from that
it works great. i also export it over nfs, which sometimes hangs the client
on which it is exported, but i expect this is an issue related to how many
service threads i have running on my servers (old machines).
but my osses (i got two) keep spitting out these messages into the system
log:
xxxxxxxxxxxxxxxxxxxxxx kernel: SPL: Showing stack for process 3264
xxxxxxxxxxxxxxxxxxxxxx kernel: Pid: 3264, comm: txg_sync Tainted:
P --------------- 2.6.32-358.18.1.el6_lustre.x86_64 #1
xxxxxxxxxxxxxxxxxxxxxx kernel: Call Trace:
xxxxxxxxxxxxxxxxxxxxxx kernel: [<ffffffffa01595a7>] ?
spl_debug_dumpstack+0x27/0x40 [spl]
xxxxxxxxxxxxxxxxxxxxxx kernel: [<ffffffffa0161337>] ?
kmem_alloc_debug+0x437/0x4c0 [spl]
xxxxxxxxxxxxxxxxxxxxxx kernel: [<ffffffffa0163b13>] ?
task_alloc+0x1d3/0x380 [spl]
xxxxxxxxxxxxxxxxxxxxxx kernel: [<ffffffffa0160f8f>] ?
kmem_alloc_debug+0x8f/0x4c0 [spl]
xxxxxxxxxxxxxxxxxxxxxx kernel: [<ffffffffa02926f0>] ? spa_deadman+0x0/0x120
[zfs]
xxxxxxxxxxxxxxxxxxxxxx kernel: [<ffffffffa016432b>] ?
taskq_dispatch_delay+0x19b/0x2a0 [spl]
xxxxxxxxxxxxxxxxxxxxxx kernel: [<ffffffffa0164612>] ?
taskq_cancel_id+0x102/0x1e0 [spl]
xxxxxxxxxxxxxxxxxxxxxx kernel: [<ffffffffa028259a>] ? spa_sync+0x1fa/0xa80
[zfs]
xxxxxxxxxxxxxxxxxxxxxx kernel: [<ffffffff810a2431>] ? ktime_get_ts+0xb1/0xf0
xxxxxxxxxxxxxxxxxxxxxx kernel: [<ffffffffa0295707>] ?
txg_sync_thread+0x307/0x590 [zfs]
xxxxxxxxxxxxxxxxxxxxxx kernel: [<ffffffff810560a9>] ?
set_user_nice+0xc9/0x130
xxxxxxxxxxxxxxxxxxxxxx kernel: [<ffffffffa0295400>] ?
txg_sync_thread+0x0/0x590 [zfs]
xxxxxxxxxxxxxxxxxxxxxx kernel: [<ffffffffa0162478>] ?
thread_generic_wrapper+0x68/0x80 [spl]
xxxxxxxxxxxxxxxxxxxxxx kernel: [<ffffffffa0162410>] ?
thread_generic_wrapper+0x0/0x80 [spl]
xxxxxxxxxxxxxxxxxxxxxx kernel: [<ffffffff81096a36>] ? kthread+0x96/0xa0
xxxxxxxxxxxxxxxxxxxxxx kernel: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
xxxxxxxxxxxxxxxxxxxxxx kernel: [<ffffffff810969a0>] ? kthread+0x0/0xa0
xxxxxxxxxxxxxxxxxxxxxx kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
does anyone know, is this a serious problem, or just aesthetics? any way to
solve this? any hints?
best regards,
Luka Leskovec
6 years, 11 months
Fwd: [zfs-discuss] Lustre failover
by Dilger, Andreas
Typically in the same manner as with ext4/ldiskfs backed filesystems - using Corosync or similar HA software. Care needs to be taken to STONITH the failed node so that you don't have multiple pools importing the same pool (which can lead to almost immediate corruption).
The MMP feature that we use for ldiskfs to detect and prevent multiple nodes concurrently is not available for ZFS yet, though there are some possible ways to get this on the cheap. If you "dd" the überblocks (128KB at 4MB offset, IIRC) and checksum it, then sleep and repeat, the checksum should obviously not change.
Cheers, Andreas
> On May 5, 2014, at 9:42, Andrew Holway <andrew.holway(a)gmail.com> wrote:
>
> Hello,
>
> How would you manage failover for MDS / OSS for Lustre / ZFS?
>
> Thanks,
>
> Andrew
6 years, 11 months
xeon-phi client for 1.8 server
by Brock Palen
For a few more months we will be running lustre 1.8.x on our servers exporting lustre over o2ib and 10 gig tcp (if need be).
Looking at the wiki and a few presentations it looks like our xeon-phis are only able to mount NFS until we have lustre 2.x on our servers? Is this true? Or can the lustre 2.4 client on the phi, mount our 1.8 filesystem?
Thanks!
Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
brockp(a)umich.edu
(734)936-1985
6 years, 11 months