Hello,

Can the servers ping eachother via the system ping command ?
Have you checked that there's not any firewall running on the machines ?
Have you checked if SELinux is disabled/properly configured on the machines ?

Have you tried to run a tcpdump on both machines to see if any lustre traffic pass through the network interface ?

Le 02/07/2015 23:03, Sean Caron a écrit :
OK, I did a little more research and I found that I could increase the verbosity of the LNET debugging output by doing the following:

echo +neterror > /proc/sys/lnet/printk

So, I did that and tried one of the failing "lctl ping" commands again:

[root@lustre-mgs ~]# lctl ping 192.168.1.101@tcp0
failed to ping 192.168.1.101@tcp: Input/output error
[root@lustre-mgs ~]#

Here's what I see now in dmesg:

[174224.584669] LNet: 3900:0:(lib-socket.c:626:lnet_sock_connect()) Error -113 connecting 0.0.0.0/1023 -> 192.168.1.101/988
[174224.584692] LNet: 3900:0:(acceptor.c:114:lnet_connect_console_error()) Connection to 192.168.1.101@tcp at host 192.168.1.101 was unreachable: the network or that node may be down, or Lustre may be misconfigured.
[174224.584711] LNet: 3900:0:(socklnd_cb.c:424:ksocknal_txlist_done()) Deleting packet type 2 len 0 192.168.1.100@tcp->192.168.1.101@tcp

I understand tcp is just a synonym for tcp0 so I think that's okay ... Network configuration on each of these machines is very simple; only one interface on any of them is up and running; one port on an Intel X520 10 Gig NIC; I have LNET configured in /etc/modprobe.d/lustre.conf on i.e. the MGS as so:

options lnet networks=tcp0(p1p2)

That's correct, yes? In this case, p1p1 and p1p2 are the two 10 Gig NIC ports ... I don't know why RHEL uses such funky names ... But very basic, no routing, not even multiple interfaces ...

Continuing to research ... I assume error -113 in this case is just a generic "connection failure" type error although if something could be deduced from that, it would certainly be great :O

Thanks,

Sean


On Thu, Jul 2, 2015 at 4:47 PM, Sean Caron <scaron@umich.edu> wrote:
Thanks; Rick; I'm just starting out getting my bearings with Lustre so it's not clear to me, all the various diagnostic tools at hand and the mechanisms available for troubleshooting, so it's helpful that you mentioned "lctl"; I tried that on the MDS and it shows LNET as up consistent with my configuration:

[root@lustre-mgs ~]# lctl list_nids
192.168.1.100@tcp
[root@lustre-mgs ~]#

If I do kind of a loop-back Lustre ping on the MDS, it appears to work ... doesn't give me an error message back:

[root@lustre-mgs ~]# lctl ping 192.168.1.100@tcp0
12345-0@lo
12345-192.168.1.100@tcp
[root@lustre-mgs ~]# 

Now, on the OSS machines, "lctl" also shows Lustre networking being up and running consistently with how I have it configured:

[root@lustre-oss1 ~]# lctl list_nids
192.168.1.101@tcp
[root@lustre-oss1 ~]#

I can do the same loop-back ping on the OSS and it seems to "work":

[root@lustre-oss1 log]# lctl ping 192.168.1.101@tcp0
12345-0@lo
12345-192.168.1.101@tcp
[root@lustre-oss1 log]#

However, if I try to do the ping, it gives me an I/O error!

[root@lustre-oss1 ~]# lctl ping 192.168.1.100@tcp0
failed to ping 192.168.1.100@tcp: Input/output error
[root@lustre-oss1 ~]#

It seems to fail consistently in both directions with the same error message; I tried it also on the MGS:

[root@lustre-mgs ~]# lctl ping 192.168.1.101@tcp0
failed to ping 192.168.1.101@tcp: Input/output error
[root@lustre-mgs ~]# 

Am I missing a module somewhere that I need to be loading? I don't see any messages in dmesg or /var/log/messages corresponding to my attempt to run "lctl ping" that might help to point in the direction of what's going wrong.

Of course, normal TCP ping between the hosts works fine; they're on the same switch so same L2 broadcast domain, etc.

Nothing in /etc/hosts to go awry; there's just the one entry for localhost.localdomain at 127.0.0.1.

Any thoughts?

Thanks,

Sean


On Tue, Jun 30, 2015 at 11:51 PM, Mohr Jr, Richard Frank (Rick Mohr) <rmohr@utk.edu> wrote:

> On Jun 30, 2015, at 5:07 PM, Sean Caron <scaron@umich.edu> wrote:
>
> So that all seems okay, but then I go over to my first OSS node ... I first try to run mkfs.lustre, that seems to complete okay:
>
> mkfs.lustre --fsname=lustre --mgsnode=192.168.1.100@tcp0 --ost --index=1 --reformat /dev/md2
>
> But then if I try to actually mount that, it pauses for a moment, then gives me a timeout error:
>

Have you tried running “lctl ping 192.168.1.100@tcp0” from the OSS node to make sure it has LNet connectivity to the MDS node?  You can also try running “lctl list_nids” on the MDS node to make sure that it has the 192.168.1.100@tcp0 nid configured.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu





_______________________________________________
HPDD-discuss mailing list
HPDD-discuss@lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss

-- 
Jérome BECOT

Administrateur Systèmes et Réseaux

Molécules à visée Thérapeutique par des approches in Silico (MTi)
Univ Paris Diderot, UMRS973 Inserm
Case 013
Bât. Lamarck A, porte 412
35, rue Hélène Brion 75205 Paris Cedex 13
France

Tel : 01 57 27 83 82