Lustre and kernel buffer interaction
by John Bauer
I have been trying to understand a behavior I am observing in an IOR
benchmark on Lustre. I have pared it down to a simple example.
The IOR benchmark is running in MPI mode. There are 2 ranks, each
running on its own node. Each rank does the following:
Note : Test was run on the "swan" cluster at Cray Inc., using /lus/scratch
write a file. ( 10GB )
fsync the file
close the file
MPI_barrier
open the file that was written by the other rank.
read the file that was written by the other rank.
close the file that was written by the other rank.
The writing of each file goes as expected.
The fsync takes very little time ( about .05 seconds).
The first reads of the file( written by the other rank ) start out *very
*slowly. While theses first reads are proceeding slowly, the
kernel's cached memory ( the Cached: line in /proc/meminfo) decreases
from the size of the file just written to nearly zero.
Once the cached memory has reached nearly zero, the file reading
proceeds as expected.
I have attached a jpg of the instrumentation of the processes that
illustrates this behavior.
My questions are:
Why does the reading of the file, written by the other rank, wait until
the cached data drains to nearly zero before proceeding normally.
Shouldn't the fsync ensure that the file's data is written to the
backing storage so this draining of the cached memory should be simply
releasing pages with no further I/O?
For this case the "dead" time is only about 4 seconds, but this "dead"
time scales directly with the size of the files.
John
--
John Bauer
I/O Doctors LLC
507-766-0378
bauerj(a)iodoctors.com
7 years, 2 months
quotas on 2.4.3
by Matt Bettinger
Hello,
We have a fresh 2.4.3 lustre upgrade that is not yet put into
production running on rhel 6.4.
We would like to take a look at quotas but looks like there is some
major performance problems with 1.8.9 clients.
Here is how I enabled quotas
[root@lfs-mds-0-0 ~]# lctl conf_param lustre2.quota.mdt=ug
[root@lfs-mds-0-0 ~]# lctl conf_param lustre2.quota.ost=ug
[root@lfs-mds-0-0 ~]# lctl get_param osd-*.*.quota_slave.info
osd-ldiskfs.lustre2-MDT0000.quota_slave.info=
target name: lustre2-MDT0000
pool ID: 0
type: md
quota enabled: ug
conn to master: setup
space acct: ug
user uptodate: glb[1],slv[1],reint[0]
group uptodate: glb[1],slv[1],reint[0]
The quotas seem to be working however the write performance from
1.8.9wc client to 2.4.3 with quotas on is horrific. Am I not setting
quotas up correctly?
I try to make a simple user quota on /lustre2/mattb/300MB_QUOTA directory
[root@hous0036 mattb]# lfs setquota -u l0363734 -b 307200 -B 309200 -i
10000 -I 11000 /lustre2/mattb/300MB_QUOTA/
See quota change is in effect...
[root@hous0036 mattb]# lfs quota -u l0363734 /lustre2/mattb/300MB_QUOTA/
Disk quotas for user l0363734 (uid 1378):
Filesystem kbytes quota limit grace files quota limit grace
/lustre2/mattb/300MB_QUOTA/
310292* 307200 309200 - 4 10000 11000 -
Try and write to quota directory as the user but get horrible write speed
[l0363734@hous0036 300MB_QUOTA]$ dd if=/dev/zero of=301MB_FILE bs=1M count=301
301+0 records in
301+0 records out
315621376 bytes (316 MB) copied, 61.7426 seconds, 5.1 MB/s
Try file number 2 and then quota take effect, so it seems.
[l0363734@hous0036 300MB_QUOTA]$ dd if=/dev/zero of=301MB_FILE2 bs=1M count=301
dd: writing `301MB_FILE2': Disk quota exceeded
dd: closing output file `301MB_FILE2': Input/output error
If I disable quotas using
[root@lfs-mds-0-0 ~]# lctl conf_param lustre2.quota.mdt=none
[root@lfs-mds-0-0 ~]# lctl conf_param lustre2.quota.oss=none
Then try and write the same file the speeds are more like we expect
but then can't use quotas.
[l0363734@hous0036 300MB_QUOTA]$ dd if=/dev/zero of=301MB_FILE2 bs=1M count=301
301+0 records in
301+0 records out
315621376 bytes (316 MB) copied, 0.965009 seconds, 327 MB/s
[l0363734@hous0036 300MB_QUOTA]$ dd if=/dev/zero of=301MB_FILE2 bs=1M count=301
I have not tried this with a 2.4 client, yet since all of our nodes
are 1.8.X until we rebuild our images.
I was going by the manual on
http://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact...
but it looks like I am running into interoperability issue (which I
thought I fixed by using 1.8.9-wc client) or just not configuring
this correctly.
Thanks!
MB
7 years, 5 months
New liblustreapi ?
by Simmons, James A.
Now that lustre 2.7 is coming up soon I like to open the discussion
on one of the directions we could go. Recently several projects have sprung
up that impact liblustreapi. During one of those discussion the idea of a new
liblustreapi was brought up. A liblustreapi 2.0 you could say. So I like to
get feel in the community about this. If people want this proposal I like to
recommend that we gradually build this new library along side the original
liblustreapi and link it when necessary to the lustre utilities. First I
like the discussion of using the LGPL license with this new library. I look
forward to the feed back.
7 years, 6 months
automounting lustre with RHEL6 autofs
by Adesanya, Adeyemi
Hi.
I would like to revisit the possibility of automounting Lustre. Is there a reliable, scalable automount option for Lustre RHEL6 clients? The man page for 'autofs' states that type 'lustre' is unsupported........
-------
Yemi
7 years, 7 months
RHEL6.6 client compatibilty
by Adesanya, Adeyemi
Hi.
When can we expect official support for RHEL6.6 clients? I just built the 2.5.1 client RPMs for the 2.6.32-504.el6.x86_64 kernel. Appears to work but I'd like the developers to confirm.
-------
Yemi
7 years, 7 months
How to create multiple oi.16
by DEGREMONT Aurelien
Hello
I got a quite old now Lustre filesystem, formatted with Lustre 2.1 and
upgraded to 2.5.3 now.
It has an only one OI index file: oi.16
This file is very big
-rw-r--r-- 1 root root 82G Feb 10 2012 oi.16
Filesystem has 90 M files currently.
- I'm pretty sure this file should not be so big regarding the number
of inodes we got. Is there a way to fix this?
- Is there a way to upgrade the filesystem to several OI files like
Lustre is creating now for a recent Lustre version?
Thanks
Aurélien
7 years, 7 months
Re: [HPDD-discuss] Anyone using 4MB RPCs
by Mandar Joshi
We have also got significant performance improvement with 4MB RPCs (with
Lustre 2.6.53) and increasing max_rpcs_in_flight
> However, the more importantly, you should have end to end 4MB IO from
client to disk.
> I mean clients send to server with 4MB RPCs, but OSS also needs to pass
efficient IO size to OSTs. I believe you are missing this part.
I have a question here,
Our OST is built over md/raid6 device. Even if I can see in brw_stats that a
RPC req. is 4MB, it is getting fragmented into 1M requests. My md/raid6 is
receiving 1M writes.
I was looking at following from https://jira.hpdd.intel.com/browse/LU-2598
<snip>
The http://review.whamcloud.com/4993 patch looks like it will resolve this
problem in osd-ldiskfs/osd-io.c:
- bio = bio_alloc(GFP_NOIO, max(BIO_MAX_PAGES,
+ bio = bio_alloc(GFP_NOIO, min(BIO_MAX_PAGES,
</snip>
BIO_MAX_PAGES is defined as 256 (even in latest vanilla kernel).
With above change (max -> min) bio will never get allocated more than 256
pages i.e. 1MB?
Am I understanding it correctly?
Is there any other kernel patch or lustre patch which changes BIO_MAX_PAGES
or rectifies this issue, if any?
Regarding another reply:
> 1. Clients are often CPU bound in the CLIO layers, which I think would be
unaffected by a larger RPC size.
> 2. Clients do not package IOs anywhere near as well, resulting in a larger
number of smaller IOs.
> (Even in a test where we do only 1 MB IO requests, on 2.x, we would see a
large number of
> RPCs of (much) < 1 MB from the client, where as in 1.8, we saw almost
exclusively 1 MB RPCs.
> For some tests, we'd see as many as 10 times as many total RPCs on 2.x vs
1.8.) Since the client isn't doing a good job of filling 1 MB RPCs, I don't
think it would fill 4 MB RPCs.
> In contrast, 2.6 is much more like 1.8. CPU usage is down, and IOs are
packaged much better. Our IO statistics for 2.6 look much more like 1.8
than for earlier 2.x.
I am using Lustre server 2.6.53 and Lustre client 2.5.3.
Whether Lustre client 2.5.3 has above mentioned issues? Should I use lustre
client 2.6.53 only to see good perf. Results.
--
Mandar Joshi
From: Shuichi Ihara <sihara(a)ddn.com>
To: "Simmons, James A." <simmonsja(a)ornl.gov>
Cc: "hpdd-discuss(a)lists.01.org" <hpdd-discuss(a)lists.01.org>
Subject: Re: [HPDD-discuss] Anyone using 4MB RPCs
Message-ID: <E1F361C1-8912-4D12-B0F7-F23656605F27(a)ddn.com>
Content-Type: text/plain; charset="us-ascii"
Yes, we got significant performance improvements with 4MB RPCs. Not only
peak performance, but also high sustained performance even with a lot of
concurrent access to OSTs. max_dirty_mb is one of impotent parameter for 4MB
RPCs, but it's now automatically set to suitable value from
max_pages_per_rpc and max_rpcs_in_flight. (see LU-4933) However, the more
importantly, you should have end to end 4MB IO from client to disk. I mean
clients send to server with 4MB RPCs, but OSS also needs to pass efficient
IO size to OSTs. I believe you are missing this part.
Thanks
Ihara
> On Oct 23, 2014, at 10:16 AM, Simmons, James A. <simmonsja(a)ornl.gov>
wrote:
>
> So recently we have moved our systems from 1.8 to 2.5 clients and have
> lost of the performance we had from before which is expected. So I
> thought we could try using 4MB RPCs instead of the default 1MB RPC
> packet. I set max_pages_per_rpc to 1024 and looked at the value of
> max_dirty_mb which was 32 and max_rpcs_in_flight which is 8. By
> default a dirty cache of 32MB should be enough in this case. So It
> tested it and saw no performance improvements. After that I boosted
> max_dirty_mb to 64 and still no improvements over the default settings.
Has anyone seen this before? What could I be missing?
> _______________________________________________
> HPDD-discuss mailing list
> HPDD-discuss(a)lists.01.org
> https://lists.01.org/mailman/listinfo/hpdd-discuss
7 years, 8 months
logstash patterns for Lustre syslog
by Michael Kluge
Hi list,
does anyone have a set of logstash/grok patterns for Lustre logs that
you can/want to share?
Regards, Michael
--
Dr.-Ing. Michael Kluge
Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany
Contact:
Willersbau, Room A 208
Phone: (+49) 351 463-34217
Fax: (+49) 351 463-37773
e-mail: michael.kluge(a)tu-dresden.de
WWW: http://www.tu-dresden.de/zih
7 years, 8 months