OK, I did a little more research and I found that I could increase the verbosity of the LNET debugging output by doing the following:
echo +neterror > /proc/sys/lnet/printk
So, I did that and tried one of the failing "lctl ping" commands again:
[root@lustre-mgs ~]# lctl ping 192.168.1.101@tcp0
failed to ping 192.168.1.101@tcp: Input/output error
Here's what I see now in dmesg:
[174224.584692] LNet: 3900:0:(acceptor.c:114:lnet_connect_console_error()) Connection to 192.168.1.101@tcp at host 192.168.1.101 was unreachable: the network or that node may be down, or Lustre may be misconfigured.
[174224.584711] LNet: 3900:0:(socklnd_cb.c:424:ksocknal_txlist_done()) Deleting packet type 2 len 0 192.168.1.100@tcp->192.168.1.101@tcp
I understand tcp is just a synonym for tcp0 so I think that's okay ... Network configuration on each of these machines is very simple; only one interface on any of them is up and running; one port on an Intel X520 10 Gig NIC; I have LNET configured in /etc/modprobe.d/lustre.conf on i.e. the MGS as so:
options lnet networks=tcp0(p1p2)
That's correct, yes? In this case, p1p1 and p1p2 are the two 10 Gig NIC ports ... I don't know why RHEL uses such funky names ... But very basic, no routing, not even multiple interfaces ...
Continuing to research ... I assume error -113 in this case is just a generic "connection failure" type error although if something could be deduced from that, it would certainly be great :O