On 2013/14/04 5:46 PM, "Erich Focht" <efocht(a)gmail.com> wrote:
checksums are off for the tests. I'll have a look at LU-744 and
will try
to check the multithreaded single client performance. The single
threaded stream's performance difference (1.8.8 vs. 2.3.63 client) is
still an important use case. Here is concrete data, and I wonder whether
it's something you'd rather like to see as a jira ticket...?
1.8.8 client (on 2.1.5 servers):
# dd if=/dev/zero of=/mnt/lnec/pool4/stripe_4_4096/test256gbx bs=4M
count=$((256/4*1000))
268435456000 bytes (268 GB) copied, 349.921 s, 767 MB/s
2.3.63 client (on 2.1.5 servers):
# dd if=/dev/zero of=/mnt/lnec/pool4/stripe_4_4096/test256gby bs=4M
count=$((256/4*1000))
268435456000 bytes (268 GB) copied, 689.333 s, 389 MB/s
For 2.3.63 I can see at the beginning (with collectl -sL on the client)
that it starts with 680MB/s, after a little while (~20s) bandwidth goes
down to some 380MB/s and stays there. Tried increasing/decreasing
osc.*.max_dirty_mb, changing osc.*.max_rpcs_in_flight but those didn't
change the behavior.
Could you please try the patches from LU-2139? The "slowdown after some
time" issue reminds me of LU-2139, which relates to dirty pages filling up
the client cache while their IO is being committed to disk on the server.
http://review.whamcloud.com/5692
http://review.whamcloud.com/4245
http://review.whamcloud.com/4374
http://review.whamcloud.com/4375
http://review.whamcloud.com/5935
Note 5692 is a style change that was abandoned, but needed for current
patches.
The middle three patches are only on the client, and may provide some
benefit
by themselves. It would make sense to see if you see any benefit testing
just
those patches, and that is easier since it does not need a server update.
The 5935 patch is a server-side change (interacting with 4375 on the
client)
that forces server commits if the client has memory pressure, and improves
single-client behaviour, but has some concern about whether it might impact
aggregate performance under some workloads. Getting performance results on
these patches separately will help us decide whether they need to be
included
into 2.4.0 or not.
I tried oflag=direct but that's significantly slower (about factor
2).
Also increasing bs to up to 256MB doesn't actually help, with or without
directIO.
Hmm. Not totally surprising, but worthwhile to test. Thanks for doing so.
In the data below there's a significant difference between the
"offset"
outputs from the client rpc_stats. I'm not sure what offset means, but
if it's related to the next sequential location of consecutive
transactions, then maybe this is a reason for the performance
difference. In 1.8.8 I see a flush-lustre-1 process in top, it's gone in
2.3.63. Was that one taking care of flushing dirty pages in ordered way?
This is getting into the realm of needing a separate bug report. For
well-formed
RPCs the offset should always be zero. It may well be that LU-2139 changes
would fix this also, because the client wouldn't need to submit partial
RPCs
because of memory pressure.
Cheers, Andreas
2.3.63 client rpc_stats
(typical OST rpcs in flight stats from rpc_stats)
read write
rpcs in flight rpcs % cum % | rpcs % cum %
0: 0 0 0 | 0 0 0
1: 0 0 0 | 16030 25 25
2: 0 0 0 | 16516 25 50
3: 0 0 0 | 23525 36 87
4: 0 0 0 | 7909 12 99
5: 0 0 0 | 7 0 99
6: 0 0 0 | 5 0 99
7: 0 0 0 | 4 0 99
8: 0 0 0 | 4 0 100
read write
offset rpcs % cum % | rpcs % cum %
0: 0 0 0 | 1 0 0
1: 0 0 0 | 0 0 0
2: 0 0 0 | 0 0 0
4: 0 0 0 | 0 0 0
8: 0 0 0 | 0 0 0
16: 0 0 0 | 0 0 0
32: 0 0 0 | 0 0 0
64: 0 0 0 | 0 0 0
128: 0 0 0 | 0 0 0
256: 0 0 0 | 1 0 0
512: 0 0 0 | 2 0 0
1024: 0 0 0 | 4 0 0
2048: 0 0 0 | 8 0 0
4096: 0 0 0 | 16 0 0
8192: 0 0 0 | 32 0 0
16384: 0 0 0 | 64 0 0
32768: 0 0 0 | 128 0 0
65536: 0 0 0 | 256 0 0
131072: 0 0 0 | 512 0 1
262144: 0 0 0 | 1024 1 3
524288: 0 0 0 | 2048 3 6
1048576: 0 0 0 | 4096 6 12
2097152: 0 0 0 | 8192 12 25
4194304: 0 0 0 | 16384 25 51
8388608: 0 0 0 | 31232 48 100
2.3.63 client perf top (up to 1%):
5.03% [kernel] [k] copy_user_generic_string
3.79% [obdclass] [k] key_fini
3.78% [kernel] [k] _spin_lock
2.43% [obdclass] [k] keys_fill
2.13% [kernel] [k] kmem_cache_alloc
1.99% [kernel] [k] __clear_user
1.92% [kernel] [k] kmem_cache_free
1.78% [lvfs] [k] lprocfs_counter_add
1.48% [kernel] [k] kfree
1.34% [kernel] [k] __kmalloc
1.33% [kernel] [k] radix_tree_delete
1.31% [kernel] [k] _spin_lock_irqsave
1.28% [obdclass] [k] cl_page_put
1.15% [kernel] [k] radix_tree_insert
1.13% [lvfs] [k] lprocfs_counter_sub
1.04% [osc] [k] osc_lru_del
1.02% [obdclass] [k] cl_page_find0
1.01% [obdclass] [k] lu_context_key_get
1.8.8 client rpc_stats (part):
(typical OST rpcs in flight for full run:)
read write
rpcs in flight rpcs % cum % | rpcs % cum %
0: 0 0 0 | 11594 18 18
1: 0 0 0 | 13130 20 38
2: 0 0 0 | 16592 25 64
3: 0 0 0 | 16543 25 90
4: 0 0 0 | 3762 5 96
5: 0 0 0 | 1598 2 98
6: 0 0 0 | 338 0 99
7: 0 0 0 | 401 0 99
8: 0 0 0 | 45 0 99
9: 0 0 0 | 1 0 100
read write
offset rpcs % cum % | rpcs % cum %
0: 0 0 0 | 64000 99 99
1: 0 0 0 | 0 0 99
2: 0 0 0 | 0 0 99
4: 0 0 0 | 0 0 99
8: 0 0 0 | 0 0 99
16: 0 0 0 | 0 0 99
32: 0 0 0 | 0 0 99
64: 0 0 0 | 0 0 99
128: 0 0 0 | 4 0 100
1.8.8 client perf top:
2151.00 9.3% ll_ra_read_init [lustre]
1781.00 7.7% copy_user_generic_string [kernel.kallsyms]
1734.00 7.5% _spin_lock [kernel.kallsyms]
1300.00 5.6% lov_putref [lov]
781.00 3.4% radix_tree_tag_clear [kernel.kallsyms]
720.00 3.1% _spin_lock_irqsave [kernel.kallsyms]
667.00 2.9% intel_idle [kernel.kallsyms]
607.00 2.6% osc_destroy [osc]
554.00 2.4% __clear_user [kernel.kallsyms]
551.00 2.4% refresh_entry [lvfs]
451.00 2.0% kiblnd_ni_fini_pools [ko2iblnd]
399.00 1.7% radix_tree_tag_set [kernel.kallsyms]
356.00 1.5% __percpu_counter_add [kernel.kallsyms]
352.00 1.5% lov_queue_group_io [lov]
340.00 1.5% mark_page_accessed [kernel.kallsyms]
338.00 1.5% _spin_lock_irq [kernel.kallsyms]
307.00 1.3% test_clear_page_writeback [kernel.kallsyms]
306.00 1.3% lov_stripe_unlock [lov]
302.00 1.3% _spin_unlock_irqrestore [kernel.kallsyms]
280.00 1.2% osc_update_grant [osc]
272.00 1.2% __dec_zone_state [kernel.kallsyms]
221.00 1.0% lop_makes_rpc [osc]
On 04/13/2013 06:05 AM, Dilger, Andreas wrote:
> On 2013/12/04 11:22 AM, "Erich Focht" <efocht(a)gmail.com> wrote:
>> tried the tag 2.3.63 client on top of 2.1.5 servers. Feels faster than
>> the 2.1.5 client, but still slower than 1.8.8.
>>
>> # striped over 4 OSTs
>>
>> # dd if=/dev/zero of=/mnt/lnec/pool4/stripe_4_4096/test256g__ bs=4M
>> count=$((256/4*1000))
>> 268435456000 bytes (268 GB) copied, 562.538 s, 477 MB/s
>>
>> # striped over 32 OSTs
>> # dd if=/dev/zero of=/mnt/lnec/stripe_all/test256g bs=4M
>> count=$((256/4*1000))
>> 268435456000 bytes (268 GB) copied, 620.909 s, 432 MB/s
> Hmm, "feels faster" but comparing it to the below 2.1.5 numbers it looks
> slower?
> According to LU-744, running multiple threads on the client for IO can
>get
> close
> to 5GB/s on FDR IB from a single client. Is your IO workload single
> threaded,
> or is it possible to get multiple threads submitting the IO?
>
> Part of the problem is CPU overhead just copying pages from userspace is
> noticeable
> in an age where a single core speed is not increasing noticeably. Have
> you tried
> with "oflag=direct" and large block sizes (e.g. 32MB or more)? Another
> option is
> to disable RPC checksumming ("lctl set_param osc.*.checksums=0") but
I'm
> not sure
> if this will be a significant factor anymore, since the checksums are
> probably being
> computed on other cores in Lustre 2.3+ using hardware acceleration
> (depending on CPU
> type, check "lctl get_param osc.*.checksum_type" for the algorithm
>used).
>
> Cheers, Andreas
>
>> 2013/4/12 Dilger, Andreas <andreas.dilger(a)intel.com>
>> On 2013/11/04 11:50 AM, "Erich Focht" <efocht(a)gmail.com> wrote:
>>> I'm puzzled about the poor scaling in Lustre 2.X (meaning 2.1.4,
>>>2.1.5,
>>> 2.3) when writing one stream from one client to one striped file.
>>>With 2
>>> OSSes, each having 4 OSTs capable (with one write stream) of
>>>500-600MB/s
>>> each, the performance with 8-fold striping
>>> barely exceeds 700MB/s. From 1.8 I was used to easily exceed 1GB/s for
>>> one dd to a 8-fold striped file.
>>>
>>>
>>> Here are some numbers:
>>>
>>> stripes: 1 size: 16384kB
>>> 268435456000 bytes (268 GB) copied, 526.797 s, 510 MB/s
>>> stripes: 2 size: 16384kB
>>> 268435456000 bytes (268 GB) copied, 413.182 s, 650 MB/s
>>> stripes: 4 size: 16384kB
>>> 268435456000 bytes (268 GB) copied, 374.03 s, 718 MB/s
>>> stripes: 6 size: 16384kB
>>> 268435456000 bytes (268 GB) copied, 382.277 s, 702 MB/s
>>> stripes: 8 size: 16384kB
>>> 268435456000 bytes (268 GB) copied, 378.835 s, 709 MB/s
>>>
>>> Obtained basically with:
>>> dd if=/dev/zero of=test256g bs=16M count=$((256/16*1000))
>>>
>>>
>>> I used a 1.8.8 Lustre client over FDR IB, no bandwidth limitations
>>>there.
>>> With the Lustre 2.1.5 client performance is even worse.
>>>
>>>
>>> Is there anything basically wrong with striping in Lustre 2.X? Is the
>>> degradation understandable, somehow? Or do I miss something? (I tried
>>> various tweaks on server and client side, checksums are disabled,
>>>etc...
>>> but nothing helped).
>> If this is a test system, could you please give the current master git
>> version of Lustre a try on the client? There should be significantly
>> improved single-client IO performance in the upcoming 2.4 release
>>compared
>> to earlier 2.x releases.
>
> Cheers, Andreas
Cheers, Andreas
--
Andreas Dilger
Lustre Software Architect
Intel High Performance Data Division