Okay, so I have a bit more on this, still no solution though.  It seems that the heavy load on the MDS does not occur until the logs are trying to be renamed.  At that point, the load on the MDS sky-rockets, lasts for a few minutes, then things settle down.  I haven't run strace yet but I'm suspecting that it's the fsync operation that triggers everything to go badly.

- Chuck

On Fri, Jul 5, 2013 at 1:35 PM, Chuck Atkins <chuck.atkins@kitware.com> wrote:
I've got a small lustre deployment with the following configuration:

1 x MDS (2 x 6-core), 1 x MDT (6 x 15K SAS disks in R10)
2 x OSS (2 x 6-core), each with 1 x OST (6 x 10K SAS disks in R6)
14 Clients
All Lustre operations over QDR IB.

OS: RHEL 6.3, kernel-2.6.32-279.14.1
Lustre: 2.1.4 (clients and servers) via the pre-built whamcloud packages

Now here's my issue:

I've got 42 processes distributed across 12 of the clients all reading the same input data (a never ending stream of 16-30MB files) from the lustre mount.  I've also got 76 processes across all clients who's only filesystem IO operation is writing thier stdout to log files on the same lustre mount point.  When the logs get to a certain size (50MB), they rotate.  Now, the problem is that when the logs rotate, the processes hang for about 2-3 minutes, and then resume normal operation.  If we turn off log rolling and allow them to grow to an indefinite size on the lustre mount, nothing hangs.  If the logs are written to a local disk on all the clients, nothing hangs at rollover time.

I was also suprised to see that the MDS seems to have a very heavy load on it equally distributed between mdt_N processes and kiblnd_sd_N processes.  This load seems to be constant at all times, before and after, with or without the file rename operation from the rollover.  I can only assume this is from the constant polling for file size across 76 processes every time an entry is writen to the file (several times a second).  Checking the debug logs with lctl on the MDS, it seems to be flooded with entries like:

00000004:00080000:2.0:1373044044.269379:0:8952:0:(mdt_open.c:590:mdt_mfd_set_mode()) Change mfd ffff88082b797640 mode 0x0->0x1
00000004:00080000:2.0:1373044044.284062:0:8952:0:(mdt_open.c:590:mdt_mfd_set_mode()) Change mfd ffff8808298329c0 mode 0x0->0x1
00000004:00080000:2.0:1373044044.299441:0:8952:0:(mdt_open.c:590:mdt_mfd_set_mode()) Change mfd ffff88082b797640 mode 0x0->0x1
00000004:00080000:2.0:1373044044.303248:0:8952:0:(mdt_open.c:590:mdt_mfd_set_mode()) Change mfd ffff8808230b1340 mode 0x0->0x1

Obviously this is not the best use case for LustreFS but I do need it for the input data.  Is there anything I can set or change for the MDS to be able to cope better with this particular load?  If anything, at least help it not hang during file renames and log rotations?  Perhaps being able to give priority to certain types fo MDS operations?

Thanks for the help.

- Chuck