Hi Ben,
Please see my comments inline.
On Tuesday 27 August 2013 07:47 PM, Ben Evans wrote:
I believe if you do the following:
Run Step 1 as described.
Unmount
Mount on secondary MDS (no reformat), you should be OK. Remember that
an MGS can be on two completely separate devices from an MDT, so
there's no direct relationship between the --mgsnode switches and the
--failnode switch.
When you try the format on the secondary MDS (2.1), you specify the
address of the secondary MDS as the failnode, and when you mount it,
it's also the primary node, so it's set to over to itself. If I had
to guess the MGS sees that a new MDT is trying to register for the
first time on it's failover node, and rejects the whole thing (same
happens with OSTs).
(Swapnil) The MDT on MDS1 is unable to communicate to the MGS
on MDS0.
It keeps trying to communicate to 0@lo since that is the nearest peer. I
tried doing what you suggested - unmounting MGS from MDS0 and mounting
it on MDS1. In that case, targets get mounted on MDS1 but the targets on
MDS0 are unable to connect to MGS due to the same bug.
I think this is just an initialization issue, it could be fixed with
an optional --primary switch (or something).
(Swapnil) If I understand this issue
correctly, I think the fix for this
bug should be that the target code that connects to the MGS should try
to connect to both the mgsnodes that it is configured with.
------------------------------------------------------------------------
*From:* hpdd-discuss-bounces(a)lists.01.org
<hpdd-discuss-bounces(a)lists.01.org> on behalf of Swapnil Pimpale
<spimpale(a)ddn.com>
*Sent:* Tuesday, August 27, 2013 9:28 AM
*To:* hpdd-discuss(a)lists.01.org
*Subject:* [HPDD-discuss] MDT mount fails if mkfs.lustre is run with
multiple mgsnode arguments on MDSs where MGS is not running
Hi All,
We are facing the following issue. There is a JIRA ticket opened for
the same (
https://jira.hpdd.intel.com/browse/LU-3829)
The description from the bug is as follows:
If multiple --mgsnode arguments are provided to mkfs.lustre while
formatting an MDT, then the mount of this MDT fails on the MDS where
the MGS is not running.
Reproduction Steps:
Step 1) On MDS0, run the following script:
mgs_dev='/dev/mapper/vg_v-mgs'
mds0_dev='/dev/mapper/vg_v-mdt'
mgs_pri_nid='10.10.11.210@tcp1'
mgs_sec_nid='10.10.11.211@tcp1'
mkfs.lustre --mgs --reformat $mgs_dev
mkfs.lustre --mgsnode=$mgs_pri_nid --mgsnode=$mgs_sec_nid
--failnode=$mgs_sec_nid --reformat --fsname=v --mdt --index=0 $mds0_dev
mount -t lustre $mgs_dev /lustre/mgs/
mount -t lustre $mds0_dev /lustre/v/mdt
So the MGS and MDT0 will be mounted on MDS0.
Step 2.1) On MDS1:
mdt1_dev='/dev/mapper/vg_mdt1_v-mdt1'
mdt2_dev='/dev/mapper/vg_mdt2_v-mdt2'
mgs_pri_nid='10.10.11.210@tcp1'
mgs_sec_nid='10.10.11.211@tcp1'
mkfs.lustre --mgsnode=$mgs_pri_nid --mgsnode=$mgs_sec_nid
--failnode=$mgs_pri_nid --reformat --fsname=v --mdt --index=1
$mdt1_dev # Does not mount.
mount -t lustre $mdt1_dev /lustre/v/mdt1
The mount of MDT1 will fail with the following error:
mount.lustre: mount /dev/mapper/vg_mdt1_v-mdt1 at /lustre/v/mdt1
failed: Input/output error
Is the MGS running?
These are messages from Lustre logs while trying to mount MDT1:
LDISKFS-fs (dm-20): mounted filesystem with ordered data mode.
quota=on. Opts:
LDISKFS-fs (dm-20): mounted filesystem with ordered data mode.
quota=on. Opts:
LDISKFS-fs (dm-20): mounted filesystem with ordered data mode.
quota=on. Opts:
Lustre: 7564:0:(client.c:1896:ptlrpc_expire_one_request()) @@@ Request
sent has timed out for slow reply:[sent 1377197751/real
1377197751]req@ffff880027956c00 x1444089351391184/t0(0)
o250->MGC10.10.11.210@tcp1@0@lo:26/25 lens 400/544 e 0 to 1 dl
1377197756 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 8059:0:(client.c:1080:ptlrpc_import_delay_req()) @@@ send
limit expired req@ffff880027956800 x1444089351391188/t0(0)
o253->MGC10.10.11.210@tcp1@0@lo:26/25 lens 4768/4768 e 0 to 0 dl 0 ref
2 fl Rpc:W/0/ffffffff rc 0/-1
LustreError: 15f-b: v-MDT0001: cannot register this server with the
MGS: rc = -5. Is the MGS running?
LustreError: 8059:0:(obd_mount_server.c:1732:server_fill_super())
Unable to start targets: -5
LustreError: 8059:0:(obd_mount_server.c:848:lustre_disconnect_lwp())
v-MDT0000-lwp-MDT0001: Can't end config log v-client.
LustreError: 8059:0:(obd_mount_server.c:1426:server_put_super())
v-MDT0001: failed to disconnect lwp. (rc=-2)
LustreError: 8059:0:(obd_mount_server.c:1456:server_put_super()) no
obd v-MDT0001
LustreError: 8059:0:(obd_mount_server.c:137:server_deregister_mount())
v-MDT0001 not registered
Lustre: server umount v-MDT0001 complete
LustreError: 8059:0:(obd_mount.c:1277:lustre_fill_super()) Unable to
mount (-5)
Step 2.2) On MDS1:
mdt1_dev='/dev/mapper/vg_mdt1_v-mdt1'
mdt2_dev='/dev/mapper/vg_mdt2_v-mdt2'
mgs_pri_nid='10.10.11.210@tcp1'
mgs_sec_nid='10.10.11.211@tcp1'
mkfs.lustre --mgsnode=$mgs_pri_nid --failnode=$mgs_pri_nid --reformat
--fsname=v --mdt --index=1 $mdt1_dev
mount -t lustre $mdt1_dev /lustre/v/mdt1
With this MDT1 will mount successfully. The only difference is that
second "--mgsnode" is not provided during mkfs.lustre.
Step 3: On MDS1 again:
mkfs.lustre --mgsnode=$mgs_pri_nid --mgsnode=$mgs_sec_nid
--failnode=$mgs_pri_nid --reformat --fsname=v --mdt --index=2 $mdt2_dev
mount -t lustre $mdt2_dev /lustre/v/mdt2
Once MDT1 is mounted, then using a second "--mgsnode" option works
without any errors and mount of MDT2 succeeds.
Lustre Versions: Reproducible on 2.4.0 and 2.4.91 versions.
Conclusion: Due to this bug, MDTs do not mount on MDSs that are not
running the MGS. With the workaround, HA will not be properly configured.
Also note that this issue is not related to DNE. Same issue and
"workaround" applies to an MDT of a different filesystem on MDS1 as well.
My initial thoughts on this are as follows:
In the above case, while mounting an MDT on MDS1 one of the mgsnode is
MDS1 itself.
It looks like ptlrpc_uuid_to_peer() calculates the distance to NIDs
using LNetDist() and chooses the one with the least distance. (which
in this case turns out to be MDS1 itself which does not have a running
MGS)
Removing MDS1 from mgsnode and adding a different node worked for me.
I would really appreciate any inputs on this.
Thanks!
--
Swapnil
--
Swapnil