Re: [HPDD-discuss] MDT mount fails if mkfs.lustre is run with multiple mgsnode arguments on MDSs where MGS is not running
by Swapnil Pimpale
Hi Ben,
Please see my comments inline.
On Tuesday 27 August 2013 07:47 PM, Ben Evans wrote:
> I believe if you do the following:
>
> Run Step 1 as described.
>
> Unmount
>
> Mount on secondary MDS (no reformat), you should be OK. Remember that
> an MGS can be on two completely separate devices from an MDT, so
> there's no direct relationship between the --mgsnode switches and the
> --failnode switch.
>
> When you try the format on the secondary MDS (2.1), you specify the
> address of the secondary MDS as the failnode, and when you mount it,
> it's also the primary node, so it's set to over to itself. If I had
> to guess the MGS sees that a new MDT is trying to register for the
> first time on it's failover node, and rejects the whole thing (same
> happens with OSTs).
(Swapnil) The MDT on MDS1 is unable to communicate to the MGS on MDS0.
It keeps trying to communicate to 0@lo since that is the nearest peer. I
tried doing what you suggested - unmounting MGS from MDS0 and mounting
it on MDS1. In that case, targets get mounted on MDS1 but the targets on
MDS0 are unable to connect to MGS due to the same bug.
>
> I think this is just an initialization issue, it could be fixed with
> an optional --primary switch (or something).
(Swapnil) If I understand this issue correctly, I think the fix for this
bug should be that the target code that connects to the MGS should try
to connect to both the mgsnodes that it is configured with.
>
> ------------------------------------------------------------------------
> *From:* hpdd-discuss-bounces(a)lists.01.org
> <hpdd-discuss-bounces(a)lists.01.org> on behalf of Swapnil Pimpale
> <spimpale(a)ddn.com>
> *Sent:* Tuesday, August 27, 2013 9:28 AM
> *To:* hpdd-discuss(a)lists.01.org
> *Subject:* [HPDD-discuss] MDT mount fails if mkfs.lustre is run with
> multiple mgsnode arguments on MDSs where MGS is not running
>
> Hi All,
>
>
> We are facing the following issue. There is a JIRA ticket opened for
> the same (https://jira.hpdd.intel.com/browse/LU-3829)
> The description from the bug is as follows:
>
>
> If multiple --mgsnode arguments are provided to mkfs.lustre while
> formatting an MDT, then the mount of this MDT fails on the MDS where
> the MGS is not running.
>
> Reproduction Steps:
> Step 1) On MDS0, run the following script:
> mgs_dev='/dev/mapper/vg_v-mgs'
> mds0_dev='/dev/mapper/vg_v-mdt'
>
> mgs_pri_nid='10.10.11.210@tcp1'
> mgs_sec_nid='10.10.11.211@tcp1'
>
> mkfs.lustre --mgs --reformat $mgs_dev
> mkfs.lustre --mgsnode=$mgs_pri_nid --mgsnode=$mgs_sec_nid
> --failnode=$mgs_sec_nid --reformat --fsname=v --mdt --index=0 $mds0_dev
>
> mount -t lustre $mgs_dev /lustre/mgs/
> mount -t lustre $mds0_dev /lustre/v/mdt
>
> So the MGS and MDT0 will be mounted on MDS0.
>
> Step 2.1) On MDS1:
> mdt1_dev='/dev/mapper/vg_mdt1_v-mdt1'
> mdt2_dev='/dev/mapper/vg_mdt2_v-mdt2'
>
> mgs_pri_nid='10.10.11.210@tcp1'
> mgs_sec_nid='10.10.11.211@tcp1'
>
> mkfs.lustre --mgsnode=$mgs_pri_nid --mgsnode=$mgs_sec_nid
> --failnode=$mgs_pri_nid --reformat --fsname=v --mdt --index=1
> $mdt1_dev # Does not mount.
>
> mount -t lustre $mdt1_dev /lustre/v/mdt1
>
> The mount of MDT1 will fail with the following error:
> mount.lustre: mount /dev/mapper/vg_mdt1_v-mdt1 at /lustre/v/mdt1
> failed: Input/output error
> Is the MGS running?
>
> These are messages from Lustre logs while trying to mount MDT1:
> LDISKFS-fs (dm-20): mounted filesystem with ordered data mode.
> quota=on. Opts:
> LDISKFS-fs (dm-20): mounted filesystem with ordered data mode.
> quota=on. Opts:
> LDISKFS-fs (dm-20): mounted filesystem with ordered data mode.
> quota=on. Opts:
> Lustre: 7564:0:(client.c:1896:ptlrpc_expire_one_request()) @@@ Request
> sent has timed out for slow reply:[sent 1377197751/real
> 1377197751]req@ffff880027956c00 x1444089351391184/t0(0)
> o250->MGC10.10.11.210@tcp1@0@lo:26/25 lens 400/544 e 0 to 1 dl
> 1377197756 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> LustreError: 8059:0:(client.c:1080:ptlrpc_import_delay_req()) @@@ send
> limit expired req@ffff880027956800 x1444089351391188/t0(0)
> o253->MGC10.10.11.210@tcp1@0@lo:26/25 lens 4768/4768 e 0 to 0 dl 0 ref
> 2 fl Rpc:W/0/ffffffff rc 0/-1
> LustreError: 15f-b: v-MDT0001: cannot register this server with the
> MGS: rc = -5. Is the MGS running?
> LustreError: 8059:0:(obd_mount_server.c:1732:server_fill_super())
> Unable to start targets: -5
> LustreError: 8059:0:(obd_mount_server.c:848:lustre_disconnect_lwp())
> v-MDT0000-lwp-MDT0001: Can't end config log v-client.
> LustreError: 8059:0:(obd_mount_server.c:1426:server_put_super())
> v-MDT0001: failed to disconnect lwp. (rc=-2)
> LustreError: 8059:0:(obd_mount_server.c:1456:server_put_super()) no
> obd v-MDT0001
> LustreError: 8059:0:(obd_mount_server.c:137:server_deregister_mount())
> v-MDT0001 not registered
> Lustre: server umount v-MDT0001 complete
> LustreError: 8059:0:(obd_mount.c:1277:lustre_fill_super()) Unable to
> mount (-5)
>
> Step 2.2) On MDS1:
> mdt1_dev='/dev/mapper/vg_mdt1_v-mdt1'
> mdt2_dev='/dev/mapper/vg_mdt2_v-mdt2'
>
> mgs_pri_nid='10.10.11.210@tcp1'
> mgs_sec_nid='10.10.11.211@tcp1'
>
> mkfs.lustre --mgsnode=$mgs_pri_nid --failnode=$mgs_pri_nid --reformat
> --fsname=v --mdt --index=1 $mdt1_dev
>
> mount -t lustre $mdt1_dev /lustre/v/mdt1
>
> With this MDT1 will mount successfully. The only difference is that
> second "--mgsnode" is not provided during mkfs.lustre.
>
> Step 3: On MDS1 again:
> mkfs.lustre --mgsnode=$mgs_pri_nid --mgsnode=$mgs_sec_nid
> --failnode=$mgs_pri_nid --reformat --fsname=v --mdt --index=2 $mdt2_dev
> mount -t lustre $mdt2_dev /lustre/v/mdt2
>
> Once MDT1 is mounted, then using a second "--mgsnode" option works
> without any errors and mount of MDT2 succeeds.
>
> Lustre Versions: Reproducible on 2.4.0 and 2.4.91 versions.
>
> Conclusion: Due to this bug, MDTs do not mount on MDSs that are not
> running the MGS. With the workaround, HA will not be properly configured.
> Also note that this issue is not related to DNE. Same issue and
> "workaround" applies to an MDT of a different filesystem on MDS1 as well.
>
> My initial thoughts on this are as follows:
>
> In the above case, while mounting an MDT on MDS1 one of the mgsnode is
> MDS1 itself.
>
> It looks like ptlrpc_uuid_to_peer() calculates the distance to NIDs
> using LNetDist() and chooses the one with the least distance. (which
> in this case turns out to be MDS1 itself which does not have a running
> MGS)
>
> Removing MDS1 from mgsnode and adding a different node worked for me.
>
> I would really appreciate any inputs on this.
> Thanks!
>
> --
> Swapnil
--
Swapnil
8 years, 10 months
MDT mount fails if mkfs.lustre is run with multiple mgsnode arguments on MDSs where MGS is not running
by Swapnil Pimpale
Hi All,
We are facing the following issue. There is a JIRA ticket opened for the
same (https://jira.hpdd.intel.com/browse/LU-3829)
The description from the bug is as follows:
If multiple --mgsnode arguments are provided to mkfs.lustre while
formatting an MDT, then the mount of this MDT fails on the MDS where the
MGS is not running.
Reproduction Steps:
Step 1) On MDS0, run the following script:
mgs_dev='/dev/mapper/vg_v-mgs'
mds0_dev='/dev/mapper/vg_v-mdt'
mgs_pri_nid='10.10.11.210@tcp1'
mgs_sec_nid='10.10.11.211@tcp1'
mkfs.lustre --mgs --reformat $mgs_dev
mkfs.lustre --mgsnode=$mgs_pri_nid --mgsnode=$mgs_sec_nid
--failnode=$mgs_sec_nid --reformat --fsname=v --mdt --index=0 $mds0_dev
mount -t lustre $mgs_dev /lustre/mgs/
mount -t lustre $mds0_dev /lustre/v/mdt
So the MGS and MDT0 will be mounted on MDS0.
Step 2.1) On MDS1:
mdt1_dev='/dev/mapper/vg_mdt1_v-mdt1'
mdt2_dev='/dev/mapper/vg_mdt2_v-mdt2'
mgs_pri_nid='10.10.11.210@tcp1'
mgs_sec_nid='10.10.11.211@tcp1'
mkfs.lustre --mgsnode=$mgs_pri_nid --mgsnode=$mgs_sec_nid
--failnode=$mgs_pri_nid --reformat --fsname=v --mdt --index=1 $mdt1_dev
# Does not mount.
mount -t lustre $mdt1_dev /lustre/v/mdt1
The mount of MDT1 will fail with the following error:
mount.lustre: mount /dev/mapper/vg_mdt1_v-mdt1 at /lustre/v/mdt1 failed:
Input/output error
Is the MGS running?
These are messages from Lustre logs while trying to mount MDT1:
LDISKFS-fs (dm-20): mounted filesystem with ordered data mode. quota=on.
Opts:
LDISKFS-fs (dm-20): mounted filesystem with ordered data mode. quota=on.
Opts:
LDISKFS-fs (dm-20): mounted filesystem with ordered data mode. quota=on.
Opts:
Lustre: 7564:0:(client.c:1896:ptlrpc_expire_one_request()) @@@ Request
sent has timed out for slow reply:[sent 1377197751/real
1377197751]req@ffff880027956c00 x1444089351391184/t0(0)
o250->MGC10.10.11.210@tcp1@0@lo:26/25 lens 400/544 e 0 to 1 dl
1377197756 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 8059:0:(client.c:1080:ptlrpc_import_delay_req()) @@@ send
limit expired req@ffff880027956800 x1444089351391188/t0(0)
o253->MGC10.10.11.210@tcp1@0@lo:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2
fl Rpc:W/0/ffffffff rc 0/-1
LustreError: 15f-b: v-MDT0001: cannot register this server with the MGS:
rc = -5. Is the MGS running?
LustreError: 8059:0:(obd_mount_server.c:1732:server_fill_super()) Unable
to start targets: -5
LustreError: 8059:0:(obd_mount_server.c:848:lustre_disconnect_lwp())
v-MDT0000-lwp-MDT0001: Can't end config log v-client.
LustreError: 8059:0:(obd_mount_server.c:1426:server_put_super())
v-MDT0001: failed to disconnect lwp. (rc=-2)
LustreError: 8059:0:(obd_mount_server.c:1456:server_put_super()) no obd
v-MDT0001
LustreError: 8059:0:(obd_mount_server.c:137:server_deregister_mount())
v-MDT0001 not registered
Lustre: server umount v-MDT0001 complete
LustreError: 8059:0:(obd_mount.c:1277:lustre_fill_super()) Unable to
mount (-5)
Step 2.2) On MDS1:
mdt1_dev='/dev/mapper/vg_mdt1_v-mdt1'
mdt2_dev='/dev/mapper/vg_mdt2_v-mdt2'
mgs_pri_nid='10.10.11.210@tcp1'
mgs_sec_nid='10.10.11.211@tcp1'
mkfs.lustre --mgsnode=$mgs_pri_nid --failnode=$mgs_pri_nid --reformat
--fsname=v --mdt --index=1 $mdt1_dev
mount -t lustre $mdt1_dev /lustre/v/mdt1
With this MDT1 will mount successfully. The only difference is that
second "--mgsnode" is not provided during mkfs.lustre.
Step 3: On MDS1 again:
mkfs.lustre --mgsnode=$mgs_pri_nid --mgsnode=$mgs_sec_nid
--failnode=$mgs_pri_nid --reformat --fsname=v --mdt --index=2 $mdt2_dev
mount -t lustre $mdt2_dev /lustre/v/mdt2
Once MDT1 is mounted, then using a second "--mgsnode" option works
without any errors and mount of MDT2 succeeds.
Lustre Versions: Reproducible on 2.4.0 and 2.4.91 versions.
Conclusion: Due to this bug, MDTs do not mount on MDSs that are not
running the MGS. With the workaround, HA will not be properly configured.
Also note that this issue is not related to DNE. Same issue and
"workaround" applies to an MDT of a different filesystem on MDS1 as well.
My initial thoughts on this are as follows:
In the above case, while mounting an MDT on MDS1 one of the mgsnode is
MDS1 itself.
It looks like ptlrpc_uuid_to_peer() calculates the distance to NIDs
using LNetDist() and chooses the one with the least distance. (which in
this case turns out to be MDS1 itself which does not have a running MGS)
Removing MDS1 from mgsnode and adding a different node worked for me.
I would really appreciate any inputs on this.
Thanks!
--
Swapnil
8 years, 10 months
Re: [HPDD-discuss] [Lustre-discuss] Loop device performance
by Dilger, Andreas
On 2013/08/25 6:39 AM, "Nikolay Kvetsinski" <nkvecinski(a)gmail.com> wrote:
>Hello, I have a production script that do read operations to a lot of
>small files. I read that one can gain performance boost with small files
>by using a loop device on top of Lustre. So a created 500 GB file striped
>across all of my OSTs(which
> are 8). I formatted the file with ext2 fs, and mounted it on a client.
>Just for the sake of testing a simple bash script finds all files with a
>given file type and cat the first 10 lines in /dev/null.
>
>
>When I run the script on the Lustre cluster I get :
>
>
>time sh test.sh
>
>
>real 1m16.804s
>user 0m2.539s
>sys 0m5.363s
>
>
>
>If I immediately re-run the script the time is :
>
>
>real 0m12.158s
>user 0m2.218s
>sys 0m5.430s
>
>
>
>
>There are 5406 files that meet the filetype criteria.
>
>
>When I run the script on the mounted loop device I get :
>
>
>real 2m30.177s
>user 0m2.290s
>sys 0m4.880s
>
>And immediate re-run gives me :
>
>real 0m7.810s
>user 0m2.187s
>sys 0m5.360s
>
>
>I`m usig lustre-2.4.0-2.6.32_358.6.2.el6_lustre.g230b174.x86_64_gd3f91c4.
>Also
>set all of "small files" optimizations like, no striping for the dirs
>containing the small files, max_dirty_mb=256, max_rpcs_in_flight=32,
>staahead=8192 and lnet.debug=0.
>Is it normal to get two times slower access times with the mounted loop
>device ??
It depends on how the loop device is doing IO on the underlying objects.
It may
be that ext2 isn't the best filesystem for this. You could try formatting
it with:
mke2fs -t ext4 -O ^journal {device}
which will enable the flex_bg,extents and other ext4 features but disables
the
journal (which I assume you don't need because you are formatting as ext2
originally). You should also mount with "-t ext4".
The flex_bg and mballoc features of ext4 may help improve the IO going to
the
back-end storage and improve the performance when running over loop
devices.
Cheers, Andreas
--
Andreas Dilger
Lustre Software Architect
Intel High Performance Data Division
8 years, 10 months
How to add another servicenode
by Ryuichi Sudo @ SSTC
Hello,
I have Lustre 2.2 running and looking for a way to add another service
node based on tcp(ethernet).
My MDS,OSS are formatted with IB based NIDs(o2ib) like this;
(1) MDS
mkfs.lustre --reformat --fsname=lust01 --mgs --mdt
--servicenode=10.10.6.203@o2ib --servicenode=10.10.6.204@o2ib /dev/sdb
(2) OSS
mkfs.lustre --reformat --fsname=lust01 --servicenode=10.10.6.205@o2ib
--servicenode=10.10.6.206@o2ib --mgsnode=10.10.6.203@o2ib
--mgsnode=10.10.6.204@o2ib --ost --index=0 /dev/sdb
This lustre filesystem ('lust01') has been up & running for a while.
And I'd like to add tcp0 interfaces to the MDS/OSS as service nodes.
The purpose is to add non-IB based lustre clients to this LNET.
(e.g.)
--servicenode=10.10.1.1@tcp0 --servicenode=10.10.1.2@tcp0
Questions are;
- Is this valid lustre configuration?
- Is it possible to add tcp0 service nodes to an existing (operating)
lustre filesystem?
- If yes, would someone teach me how to add tcp0 NIDs to the MDS/OSS?
- Any pointer to the manual/URL are welcome.
Thanks, in advance.
Sudo
= 2xMDS = = = = = 2xOSS = =
203@o2ib - - - - - - 205@o2ib
204@o2ib - - - - - - 206@o2ib
- - - - - - - - - - - - - - - - - -
1@tcp0
2@tcp0
8 years, 10 months
Query on RDMA over IP with Lustre
by Singhal, Upanshu
Hello,
I am planning to use RDMA over IP with Lustre, can someone please let me know what are the configuration / settings we need to do for Lustre with RDMA over IP? Also, if there is any list of RDMA supporting cards which are certified with Lustre.
I looked in to the manual and it mentions RDMA with Infiniband, I hope we should be able to use RDMA with IP as well on Lustre, please reply.
Thanks,
-Upanshu
Upanshu Singhal
EMC Data Storage Systems, Bangalore, India.
Phone: 91-80-67375604
8 years, 10 months
Re: [HPDD-discuss] [Lustre-discuss] chmod and chown
by Dilger, Andreas
You didn't mention which version of Lustre you are using, but this is probably related to https://jira.hpdd.intel.com/browse/LU-3671
Cheers, Andreas
On 2013-08-20, at 1:08, "Nikolay Kvetsinski" <nkvecinski(a)gmail.com<mailto:nkvecinski@gmail.com>> wrote:
Hello guys,
does anyone of you have problems with chown and chmod operations taking extremely long time to complete. For example chown on 200 MB folder is taking minutes. The MDS is not loaded at all. All other operations (file read and write) seem to be doing fine.
Any help is greatly appreciated.
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss(a)lists.lustre.org<mailto:Lustre-discuss@lists.lustre.org>
http://lists.lustre.org/mailman/listinfo/lustre-discuss
8 years, 10 months
Lustre Community Roadmap Update, Realignment on 2.5
by OpenSFS Administration
Background
----------
The Lustre community has banded together to work on the development of the
Lustre source code. As part of that effort, we regularly discuss the
roadmap for major Lustre releases. We have developed a schedule of major
releases that occur every six months.
We recognize, however, that many organizations need a branch of Lustre that
they can run for much longer periods of time than six months. To that end
we collectively decided that we would target the Lustre 2.4.0 feature
release to begin a side branch of Lustre to receive regularly scheduled
maintenance releases for a significant period of time (18 months or more).
Unfortunately, one particular feature that some folks feel strongly is an
absolute requirement, HSM (Hierarchical Storage Management), did not land on
the development branch in time for Lustre 2.4.0. Happily HSM is now landed
on the master branch, and will appear in the next Lustre feature release,
version 2.5.0.
Rather than split the community effort between two closely spaced scheduled
maintenance release branches, we have decided to realign our efforts from
the 2.4 branch to the 2.5 branch.
An updated Community Lustre Roadmap has been posted here:
http://lustre.opensfs.org/community-lustre-roadmap/
What this means in a Nutshell
-----------------------------
The Lustre 2.5 branch is the one for which we expect to see regularly
scheduled maintenance releases for an extended period of time (18+ months).
We no longer expect regularly scheduled maintenance releases on the Lustre
2.4 branch after the Lustre 2.4.2 release. Our recommended upgrade path
after 2.4.2 is an upgrade to the latest 2.5.X release.
What to do if you are already committed to Lustre 2.4
-----------------------------------------------------
If you have already committed to using Lustre 2.4 at your organization and
an upgrade to Lustre 2.5 is not either desired or possible, we recommend
that you contact your Lustre support vendor to find out which branches they
intend to support. Some Lustre vendors offer support for branches beyond
the Community's officially recommended releases.
Some Lustre vendors have already announced their intent to continue
supporting their users who have committed to the Lustre 2.4 branch.
About OpenSFS
Open Scalable File Systems, Inc. is a strong and growing nonprofit
organization dedicated to the success of the LustreR file system. OpenSFS
was founded in 2010 to advance Lustre
<http://lustre.opensfs.org/download-lustre/> , ensuring it remains
vendor-neutral, open, and free. Since its inception, OpenSFS has been
responsible for advancing Lustre, delivering new releases
<http://lustre.opensfs.org/community-lustre-roadmap/> on behalf of the open
source development community. Through working-groups, events, and ongoing
funding initiatives, OpenSFS <http://www.opensfs.org/> harnesses the power
of collaborative development to fuel innovation and growth of Lustre
worldwide.
* Lustre is a registered trademark of Xyratex Technology Ltd.
__________________________
OpenSFS Administration
3855 SW 153rd Drive Beaverton, OR 97006 USA
Phone: +1 503-619-0561 | Fax: +1 503-644-6708
Twitter: <https://twitter.com/opensfs> @OpenSFS
Email: <mailto:admin@opensfs.org> admin(a)opensfs.org | Website:
<http://www.opensfs.org> www.opensfs.org
8 years, 10 months