Hello,
Can the servers ping eachother via the system ping command ?
Have you checked that there's not any firewall running on the machines ?
Have you checked if SELinux is disabled/properly configured on the
machines ?
Have you tried to run a tcpdump on both machines to see if any lustre
traffic pass through the network interface ?
Le 02/07/2015 23:03, Sean Caron a écrit :
OK, I did a little more research and I found that I could increase
the
verbosity of the LNET debugging output by doing the following:
echo +neterror > /proc/sys/lnet/printk
So, I did that and tried one of the failing "lctl ping" commands again:
[root@lustre-mgs ~]# lctl ping 192.168.1.101@tcp0
failed to ping 192.168.1.101@tcp: Input/output error
[root@lustre-mgs ~]#
Here's what I see now in dmesg:
[174224.584669] LNet: 3900:0:(lib-socket.c:626:lnet_sock_connect())
Error -113 connecting 0.0.0.0/1023 <
http://0.0.0.0/1023> ->
192.168.1.101/988 <
http://192.168.1.101/988>
[174224.584692] LNet:
3900:0:(acceptor.c:114:lnet_connect_console_error()) Connection to
192.168.1.101@tcp at host 192.168.1.101 was unreachable: the network
or that node may be down, or Lustre may be misconfigured.
[174224.584711] LNet: 3900:0:(socklnd_cb.c:424:ksocknal_txlist_done())
Deleting packet type 2 len 0 192.168.1.100@tcp->192.168.1.101@tcp
I understand tcp is just a synonym for tcp0 so I think that's okay ...
Network configuration on each of these machines is very simple; only
one interface on any of them is up and running; one port on an Intel
X520 10 Gig NIC; I have LNET configured in /etc/modprobe.d/lustre.conf
on i.e. the MGS as so:
options lnet networks=tcp0(p1p2)
That's correct, yes? In this case, p1p1 and p1p2 are the two 10 Gig
NIC ports ... I don't know why RHEL uses such funky names ... But very
basic, no routing, not even multiple interfaces ...
Continuing to research ... I assume error -113 in this case is just a
generic "connection failure" type error although if something could be
deduced from that, it would certainly be great :O
Thanks,
Sean
On Thu, Jul 2, 2015 at 4:47 PM, Sean Caron <scaron(a)umich.edu
<mailto:scaron@umich.edu>> wrote:
Thanks; Rick; I'm just starting out getting my bearings with
Lustre so it's not clear to me, all the various diagnostic tools
at hand and the mechanisms available for troubleshooting, so it's
helpful that you mentioned "lctl"; I tried that on the MDS and it
shows LNET as up consistent with my configuration:
[root@lustre-mgs ~]# lctl list_nids
192.168.1.100@tcp
[root@lustre-mgs ~]#
If I do kind of a loop-back Lustre ping on the MDS, it appears to
work ... doesn't give me an error message back:
[root@lustre-mgs ~]# lctl ping 192.168.1.100@tcp0
12345-0@lo
12345-192.168.1.100@tcp
[root@lustre-mgs ~]#
Now, on the OSS machines, "lctl" also shows Lustre networking
being up and running consistently with how I have it configured:
[root@lustre-oss1 ~]# lctl list_nids
192.168.1.101@tcp
[root@lustre-oss1 ~]#
I can do the same loop-back ping on the OSS and it seems to "work":
[root@lustre-oss1 log]# lctl ping 192.168.1.101@tcp0
12345-0@lo
12345-192.168.1.101@tcp
[root@lustre-oss1 log]#
However, if I try to do the ping, it gives me an I/O error!
[root@lustre-oss1 ~]# lctl ping 192.168.1.100@tcp0
failed to ping 192.168.1.100@tcp: Input/output error
[root@lustre-oss1 ~]#
It seems to fail consistently in both directions with the same
error message; I tried it also on the MGS:
[root@lustre-mgs ~]# lctl ping 192.168.1.101@tcp0
failed to ping 192.168.1.101@tcp: Input/output error
[root@lustre-mgs ~]#
Am I missing a module somewhere that I need to be loading? I don't
see any messages in dmesg or /var/log/messages corresponding to my
attempt to run "lctl ping" that might help to point in the
direction of what's going wrong.
Of course, normal TCP ping between the hosts works fine; they're
on the same switch so same L2 broadcast domain, etc.
Nothing in /etc/hosts to go awry; there's just the one entry for
localhost.localdomain at 127.0.0.1.
Any thoughts?
Thanks,
Sean
On Tue, Jun 30, 2015 at 11:51 PM, Mohr Jr, Richard Frank (Rick
Mohr) <rmohr(a)utk.edu <mailto:rmohr@utk.edu>> wrote:
> On Jun 30, 2015, at 5:07 PM, Sean Caron <scaron(a)umich.edu
<mailto:scaron@umich.edu>> wrote:
>
> So that all seems okay, but then I go over to my first OSS
node ... I first try to run mkfs.lustre, that seems to
complete okay:
>
> mkfs.lustre --fsname=lustre --mgsnode=192.168.1.100@tcp0
--ost --index=1 --reformat /dev/md2
>
> But then if I try to actually mount that, it pauses for a
moment, then gives me a timeout error:
>
Have you tried running “lctl ping 192.168.1.100@tcp0” from the
OSS node to make sure it has LNet connectivity to the MDS
node? You can also try running “lctl list_nids” on the MDS
node to make sure that it has the 192.168.1.100@tcp0 nid
configured.
--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss(a)lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss
--
Jérome BECOT
Administrateur Systèmes et Réseaux
Molécules à visée Thérapeutique par des approches in Silico (MTi)
Univ Paris Diderot, UMRS973 Inserm
Case 013
Bât. Lamarck A, porte 412
35, rue Hélène Brion 75205 Paris Cedex 13
France
Tel : 01 57 27 83 82