Thanks; Rick; I'm just starting out getting my bearings with Lustre so it's not clear to me, all the various diagnostic tools at hand and the mechanisms available for troubleshooting, so it's helpful that you mentioned "lctl"; I tried that on the MDS and it shows LNET as up consistent with my configuration:
[root@lustre-mgs ~]# lctl list_nids
If I do kind of a loop-back Lustre ping on the MDS, it appears to work ... doesn't give me an error message back:
[root@lustre-mgs ~]# lctl ping 192.168.1.100@tcp0
Now, on the OSS machines, "lctl" also shows Lustre networking being up and running consistently with how I have it configured:
[root@lustre-oss1 ~]# lctl list_nids
I can do the same loop-back ping on the OSS and it seems to "work":
[root@lustre-oss1 log]# lctl ping 192.168.1.101@tcp0
However, if I try to do the ping, it gives me an I/O error!
[root@lustre-oss1 ~]# lctl ping 192.168.1.100@tcp0
failed to ping 192.168.1.100@tcp: Input/output error
It seems to fail consistently in both directions with the same error message; I tried it also on the MGS:
[root@lustre-mgs ~]# lctl ping 192.168.1.101@tcp0
failed to ping 192.168.1.101@tcp: Input/output error
Am I missing a module somewhere that I need to be loading? I don't see any messages in dmesg or /var/log/messages corresponding to my attempt to run "lctl ping" that might help to point in the direction of what's going wrong.
Of course, normal TCP ping between the hosts works fine; they're on the same switch so same L2 broadcast domain, etc.
Nothing in /etc/hosts to go awry; there's just the one entry for localhost.localdomain at 127.0.0.1.