> On Jul 3, 2015, at 9:22 AM, Sean Caron <firstname.lastname@example.org> wrote:
> Regular TCP connectivity between the MGS and OSS machines appears to work fine; you can ping; traceroute; SSH and so forth between the MGS server and the OSS servers, no problem. The machines are all on the same L2 broadcast domain and there's no software firewall (i.e. iptables) running on any of the MGS or OSS machines. SELinux has been completely disabled and all the machines have been rebooted after that change was made …
Is there any chance that there is some configuration on the switch that could be causing problems? The main thing is to make sure there is nothing that would be blocking port 988.
> I haven't yet looked at things with TCPdump but only because I suspect it's more a problem of configuration or I'm missing a module or something ... I don't think really any Lustre traffic is actually hitting the network ... but I can go check. I was hoping someone would just recognize it as a silly error and come back to me and say, oh, you just need to load this module, or you're missing this in your LNET configuration :O
Based on what I have seen, I think your config looks pretty good (but it’s hard to tell for sure). The symptoms you describe are ones I have seen, and in all cases I can think of, the problem was due to some sort of network connectivity issue (firewalls, hardware problems, etc.). Running tcpdump would be useful to see if any packets are getting through. I would have tcpdump listening on the MDS nodes on port 988, and then try running “lctl ping <mds>” from the oss node. If tcpdump doesn’t see any traffic, then somehow the Lustre requests are making it through the network.
Senior HPC System Administrator
National Institute for Computational Sciences