Thanks; Rick; I'm just starting out getting my bearings with Lustre so it's not clear to me, all the various diagnostic tools at hand and the mechanisms available for troubleshooting, so it's helpful that you mentioned "lctl"; I tried that on the MDS and it shows LNET as up consistent with my configuration:

[root@lustre-mgs ~]# lctl list_nids
192.168.1.100@tcp
[root@lustre-mgs ~]#

If I do kind of a loop-back Lustre ping on the MDS, it appears to work ... doesn't give me an error message back:

[root@lustre-mgs ~]# lctl ping 192.168.1.100@tcp0
12345-0@lo
12345-192.168.1.100@tcp
[root@lustre-mgs ~]# 

Now, on the OSS machines, "lctl" also shows Lustre networking being up and running consistently with how I have it configured:

[root@lustre-oss1 ~]# lctl list_nids
192.168.1.101@tcp
[root@lustre-oss1 ~]#

I can do the same loop-back ping on the OSS and it seems to "work":

[root@lustre-oss1 log]# lctl ping 192.168.1.101@tcp0
12345-0@lo
12345-192.168.1.101@tcp
[root@lustre-oss1 log]#

However, if I try to do the ping, it gives me an I/O error!

[root@lustre-oss1 ~]# lctl ping 192.168.1.100@tcp0
failed to ping 192.168.1.100@tcp: Input/output error
[root@lustre-oss1 ~]#

It seems to fail consistently in both directions with the same error message; I tried it also on the MGS:

[root@lustre-mgs ~]# lctl ping 192.168.1.101@tcp0
failed to ping 192.168.1.101@tcp: Input/output error
[root@lustre-mgs ~]# 

Am I missing a module somewhere that I need to be loading? I don't see any messages in dmesg or /var/log/messages corresponding to my attempt to run "lctl ping" that might help to point in the direction of what's going wrong.

Of course, normal TCP ping between the hosts works fine; they're on the same switch so same L2 broadcast domain, etc.

Nothing in /etc/hosts to go awry; there's just the one entry for localhost.localdomain at 127.0.0.1.

Any thoughts?

Thanks,

Sean


On Tue, Jun 30, 2015 at 11:51 PM, Mohr Jr, Richard Frank (Rick Mohr) <rmohr@utk.edu> wrote:

> On Jun 30, 2015, at 5:07 PM, Sean Caron <scaron@umich.edu> wrote:
>
> So that all seems okay, but then I go over to my first OSS node ... I first try to run mkfs.lustre, that seems to complete okay:
>
> mkfs.lustre --fsname=lustre --mgsnode=192.168.1.100@tcp0 --ost --index=1 --reformat /dev/md2
>
> But then if I try to actually mount that, it pauses for a moment, then gives me a timeout error:
>

Have you tried running “lctl ping 192.168.1.100@tcp0” from the OSS node to make sure it has LNet connectivity to the MDS node?  You can also try running “lctl list_nids” on the MDS node to make sure that it has the 192.168.1.100@tcp0 nid configured.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu