Lustre metadata operations from a single client
by Grégoire Pichon
Hi all,
I have a question related to single client metadata operations.
While running mdtest benchmark, I have observed that file creation and unlink operations from a single Lustre client quickly saturates to around 8000 iops: maximum is reached as soon as with 4 tasks in parallel.
When using several Lustre mount points on a single client node, the file creation and unlink rate do scale with the number of tasks, up to the 16 cores of my client node.
Looking at the code, it appears that most metadata operations are serialized by a mutex in the MDC layer.
In mdc_reint() routine, request posting is protected by mdc_get_rpc_lock() and mdc_put_rpc_lock(), where the lock is :
struct client_obd -> struct mdc_rpc_lock *cl_rpc_lock -> struct mutex rpcl_mutex.
What is the reason for this serialization ?
Is it a current limitation of MDC layer design ? If yes, are there plans to improve this behavior ?
Thanks,
Grégoire Pichon.
6 years, 9 months
MDS kernel panic
by Pardo Diaz, Alfonso
Hello,
Since I update my lustre 2.2 to 2.5.1 (Centos6.5) and copy the MDT to a new SSD disk. I get random kernel panics in the MDS (both HA pairs). The last kernel panic I get this log:
<4>Lustre: MGS: non-config logname received: params
<3>LustreError: 11-0: cetafs-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
<4>Lustre: MGS: non-config logname received: params
<4>Lustre: cetafs-MDT0000: Will be in recovery for at least 5:00, or until 102 clients reconnect
<4>Lustre: MGS: non-config logname received: params
<4>Lustre: MGS: non-config logname received: params
<4>Lustre: Skipped 5 previous similar messages
<4>Lustre: MGS: non-config logname received: params
<4>Lustre: Skipped 9 previous similar messages
<4>Lustre: MGS: non-config logname received: params
<4>Lustre: Skipped 2 previous similar messages
<4>Lustre: MGS: non-config logname received: params
<4>Lustre: Skipped 23 previous similar messages
<4>Lustre: MGS: non-config logname received: params
<4>Lustre: Skipped 8 previous similar messages
<3>LustreError: 3461:0:(ldlm_lib.c:1751:check_for_next_transno()) cetafs-MDT0000: waking for gap in transno, VBR is OFF (skip: 17188113481, ql: 1, comp: 101, conn: 102, next: 17188113493, last_committed: 17188113480)
<6>Lustre: cetafs-MDT0000: Recovery over after 1:13, of 102 clients 102 recovered and 0 were evicted.
<1>BUG: unable to handle kernel NULL pointer dereference at (null)
<1>IP: [<ffffffffa0c3b6a0>] __iam_path_lookup+0x70/0x1f0 [osd_ldiskfs]
<4>PGD 106c0bf067 PUD 106c0be067 PMD 0
<4>Oops: 0002 [#1] SMP
<4>last sysfs file: /sys/devices/system/cpu/online
<4>CPU 0
<4>Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ipmi_devintf cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_multipath microcode iTCO_wdt iTCO_vendor_support sb_edac edac_core lpc_ich mfd_core i2c_i801 igb i2c_algo_bit i2c_core ptp pps_core ioatdma dca mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core sg ext4 jbd2 mbcache sd_mod crc_t10dif ahci isci libsas mpt2sas scsi_transport_sas raid_class megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 3362, comm: mdt00_001 Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1 Bull SAS bullx/X9DRH-7TF/7F/iTF/iF
<4>RIP: 0010:[<ffffffffa0c3b6a0>] [<ffffffffa0c3b6a0>] __iam_path_lookup+0x70/0x1f0 [osd_ldiskfs]
<4>RSP: 0018:ffff88085e2754b0 EFLAGS: 00010246
<4>RAX: 00000000fffffffb RBX: ffff88085e275600 RCX: 000000000009c93c
<4>RDX: 0000000000000000 RSI: 000000000009c93b RDI: ffff88106bcc32f0
<4>RBP: ffff88085e275500 R08: 0000000000000000 R09: 00000000ffffffff
<4>R10: 0000000000000000 R11: 0000000000000000 R12: ffff88085e2755c8
<4>R13: 0000000000005250 R14: ffff8810569bf308 R15: 0000000000000001
<4>FS: 0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
<4>CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>CR2: 0000000000000000 CR3: 000000106dd9b000 CR4: 00000000000407f0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process mdt00_001 (pid: 3362, threadinfo ffff88085e274000, task ffff88085f55c080)
<4>Stack:
<4> 0000000000000000 ffff88085e2755d8 ffff8810569bf288 ffffffffa00fd2c4
<4><d> ffff88085e275660 ffff88085e2755c8 ffff88085e2756c8 0000000000000000
<4><d> 0000000000000000 ffff88085db2a480 ffff88085e275530 ffffffffa0c3ba6c
<4>Call Trace:
<4> [<ffffffffa00fd2c4>] ? do_get_write_access+0x3b4/0x520 [jbd2]
<4> [<ffffffffa0c3ba6c>] iam_lookup_lock+0x7c/0xb0 [osd_ldiskfs]
<4> [<ffffffffa0c3bad4>] __iam_it_get+0x34/0x160 [osd_ldiskfs]
<4> [<ffffffffa0c3be1e>] iam_it_get+0x2e/0x150 [osd_ldiskfs]
<4> [<ffffffffa0c3bf4e>] iam_it_get_exact+0xe/0x30 [osd_ldiskfs]
<4> [<ffffffffa0c3d47f>] iam_insert+0x4f/0xb0 [osd_ldiskfs]
<4> [<ffffffffa0c366ea>] osd_oi_iam_refresh+0x18a/0x330 [osd_ldiskfs]
<4> [<ffffffffa0c3ea40>] ? iam_lfix_ipd_alloc+0x0/0x20 [osd_ldiskfs]
<4> [<ffffffffa0c386dd>] osd_oi_insert+0x11d/0x480 [osd_ldiskfs]
<4> [<ffffffff811ae522>] ? generic_setxattr+0xa2/0xb0
<4> [<ffffffffa0c25021>] ? osd_ea_fid_set+0xf1/0x410 [osd_ldiskfs]
<4> [<ffffffffa0c33595>] osd_object_ea_create+0x5b5/0x700 [osd_ldiskfs]
<4> [<ffffffffa0e173bf>] lod_object_create+0x13f/0x260 [lod]
<4> [<ffffffffa0e756c0>] mdd_object_create_internal+0xa0/0x1c0 [mdd]
<4> [<ffffffffa0e86428>] mdd_create+0xa38/0x1730 [mdd]
<4> [<ffffffffa0c2af37>] ? osd_xattr_get+0x97/0x2e0 [osd_ldiskfs]
<4> [<ffffffffa0e14770>] ? lod_index_lookup+0x0/0x30 [lod]
<4> [<ffffffffa0d50358>] mdo_create+0x18/0x50 [mdt]
<4> [<ffffffffa0d5a64c>] mdt_reint_open+0x13ac/0x21a0 [mdt]
<4> [<ffffffffa065983c>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
<4> [<ffffffffa04f4600>] ? lu_ucred_key_init+0x160/0x1a0 [obdclass]
<4> [<ffffffffa0d431f1>] mdt_reint_rec+0x41/0xe0 [mdt]
<4> [<ffffffffa0d2add3>] mdt_reint_internal+0x4c3/0x780 [mdt]
<4> [<ffffffffa0d2b35d>] mdt_intent_reint+0x1ed/0x520 [mdt]
<4> [<ffffffffa0d26a0e>] mdt_intent_policy+0x3ae/0x770 [mdt]
<4> [<ffffffffa0610511>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc]
<4> [<ffffffffa0639abf>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc]
<4> [<ffffffffa0d26ed6>] mdt_enqueue+0x46/0xe0 [mdt]
<4> [<ffffffffa0d2dbca>] mdt_handle_common+0x52a/0x1470 [mdt]
<4> [<ffffffffa0d68545>] mds_regular_handle+0x15/0x20 [mdt]
<4> [<ffffffffa0669a45>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
<4> [<ffffffffa03824ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
<4> [<ffffffffa03933df>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
<4> [<ffffffffa06610e9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
<4> [<ffffffff81054839>] ? __wake_up_common+0x59/0x90
<4> [<ffffffffa066adad>] ptlrpc_main+0xaed/0x1740 [ptlrpc]
<4> [<ffffffffa066a2c0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
<4> [<ffffffff8109aee6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>Code: 00 48 8b 5d b8 45 31 ff 0f 1f 00 49 8b 46 30 31 d2 48 89 d9 44 89 ee 48 8b 7d c0 ff 50 20 48 8b 13 66 2e 0f 1f 84 00 00 00 00 00 <f0> 0f ba 2a 19 19 c9 85 c9 74 15 48 8b 0a f7 c1 00 00 00 02 74
<1>RIP [<ffffffffa0c3b6a0>] __iam_path_lookup+0x70/0x1f0 [osd_ldiskfs]
<4> RSP <ffff88085e2754b0>
<4>CR2: 0000000000000000
Any suggestion is welcome?
THANKS!!!
Alfonso Pardo Diaz
System Administrator / Researcher
c/ Sola nº 1; 10200 Trujillo, ESPAÑA
Tel: +34 927 65 93 17 Fax: +34 927 32 32 37
----------------------------
Confidencialidad:
Este mensaje y sus ficheros adjuntos se dirige exclusivamente a su destinatario y puede contener información privilegiada o confidencial. Si no es vd. el destinatario indicado, queda notificado de que la utilización, divulgación y/o copia sin autorización está prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente respondiendo al mensaje y proceda a su destrucción.
Disclaimer:
This message and its attached files is intended exclusively for its recipients and may contain confidential information. If you received this e-mail in error you are hereby notified that any dissemination, copy or disclosure of this communication is strictly prohibited and may be unlawful. In this case, please notify us by a reply and delete this email and its contents immediately.
----------------------------
6 years, 10 months
Re: [HPDD-discuss] [Lustre-discuss] Vdev Configuration on Lustre
by Dilger, Andreas
This should be possible by specifying the pool configuration for mkfs.lustre as you do with zpool:
mkfs.lustre --backfstype zfs {other opts} {poolname/dataset} mirror sde sdf mirror sdg sdh
There is also a bug open to improve the mkfs.lustre(8) man page to describe the ZFS options better: https://jira.hpdd.intel.com/browse/LU-2234
On 2014/06/09, 1:21 PM, "Indivar Nair" <indivar.nair(a)techterra.in<mailto:indivar.nair@techterra.in>> wrote:
Hi All,
In Lustre, is it possible to create a zpool with multiple vdevs -
E.g.:
# zpool create tank mirror sde sdf mirror sdg sdh
# zpool status
pool: tank
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
This will allow us to have a single OST per OSS, with ZFS managing the striping across vdevs.
Regards,
Indivar Nair
Cheers, Andreas
--
Andreas Dilger
Lustre Software Architect
Intel High Performance Data Division
6 years, 10 months
File still exist after OST delete
by l h
Hi,
After an ost failure (which is deleted today), i have a folder wich does not exist anymore because it was on this OST, but it still appears in my filesystem(still exist in the MDT i think), and if i want to remove it, it tells me that the object doesn't exist, i can't make anything with it and I don't know how to clean it !
Can you help me ?
Thanks.
6 years, 10 months
multiple networks and lnet
by Michael Di Domenico
>From my understanding of the lustre manual, I'm not sure what I want
to do is possible, so I figured i'd ask
I have several clusters on different subnets
clusterA eth0 192.168.1.0/24 ib0 192.168.2.0/24
clusterB eth0 192.168.3.0/24 ib0 192.168.4.0/24
...etc...
what i want to do is connect lustre onto a separate network and use
only infiniband for the lustre communication
lustreA eth0 192.168.5.0/24 ib0 192.168.6.0/24
if i add
ib0:1 192.168.2.250/32
ib0:2 192.168.4.250/32
to the lustreA server I can talk via ipoib. however as i understand
it (probably incorrectly) lustreA would appear in lnet as
o2ib0(ib0) 192.168.6.250
o2ib1(ib0:1) 192.168.2.250
o2ib2(ib0:2) 192.168.4.250
however the clients would appear as
clusterA o2ib0(ib0) 192.168.2.250
clusterB o2ib0(ib0) 192.168.4.250
which as i understand it, creates a conflict in the o2ib devices where
o2ib0 on the client will not match o2ib2(ib0:2) on the lustre server
Is there a way to accomplish this or a better way overall?
6 years, 10 months
Re: [HPDD-discuss] Number of Cores and Memory
by Ramiro Alba
On 2014-06-02 12:15, E.S. Rosenberg wrote:
> On Mon, Jun 2, 2014 at 1:03 PM, Ramiro Alba <raq(a)cttc.upc.edu> wrote:
>
>> Hi all,
>>
>> We are in the process of migrating our lustre cluster from 1.8.5 to
>> 2.4.X which also involves changing hardware for servers (1 MDS + 2
>> OSS).
>
> Small question, why are you moving to 2.4.x and not to 2.5.x which is
> supposed to be the "long support" release now?
We are in test period. The upgrade will be done in October/November, so
we will upgrade to the maintenance release at that moment. I did not
know if 2.5.X would it the one to choose, so I did take a conservative
approach. That's all.
>
>> Nowadays we have 8 core (2 * 4) servers (all three) with 16 Gb of
>> RAM, and we have the option to substitute them by (1 * 4) core
>> servers or (2 * 6 ) core servers with at least 32 Gb of RAM. The
>> Lustre file system is of 64 TB: 8 TB * 8 OSTs, 90% in use, but an
>> MDT of 750 MB in size
>>
>> Taking into account that we tend to use big files in size, would it
>> be of any help having an MDS with 2 * 6 cores with 64 GB of RAM MDS
>> more than 4 cores (1 * 4) and 32 GB of RAM with Lustre 2.4.X?
>> And what about the OSSs?. Would it be more interesting having 12
>> cores (6 * 2) instead of 4 (4 * 1)?
>>
>> I read the interesting Whamcloud paper 'Lustre OSS and MDS Server
>> Node Requirements', which treats this issue, but I would like to
>> comment to the list as I can not see clearly if in our case it is
>> worth to put about an additional 100% of money in cores and memory
>>
>> Any advice or suggestion will be welcomed
>>
>> Regards
>>
>> --
>> Ramiro Alba
>>
>> Centre Tecnològic de Tranferència de Calor
>> http://www.cttc.upc.edu [1]
>>
>> Escola Tècnica Superior d'Enginyeries
>> Industrial i Aeronàutica de Terrassa
>> Colom 11, E-08222, Terrassa, Barcelona, Spain
>> Tel: (+34) 93 739 8928 [2]
>>
>> --
>> Aquest missatge ha estat analitzat per MailScanner
>> a la cerca de virus i d'altres continguts perillosos,
>> i es considera que està net.
>>
>> _______________________________________________
>> HPDD-discuss mailing list
>> HPDD-discuss(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/hpdd-discuss [3]
>
> --
> Aquest missatge ha estat analitzat per MAILSCANNER [4]
> a la cerca de virus i d'altres continguts perillosos,
> i es considera que está net.
>
> Links:
> ------
> [1] http://www.cttc.upc.edu
> [2] tel:%28%2B34%29%2093%20739%208928
> [3] https://lists.01.org/mailman/listinfo/hpdd-discuss
> [4] http://www.mailscanner.info/
--
Ramiro Alba
Centre Tecnològic de Tranferència de Calor
http://www.cttc.upc.edu
Escola Tècnica Superior d'Enginyeries
Industrial i Aeronàutica de Terrassa
Colom 11, E-08222, Terrassa, Barcelona, Spain
Tel: (+34) 93 739 8928
--
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que est� net.
6 years, 10 months
Number of Cores and Memory
by Ramiro Alba
Hi all,
We are in the process of migrating our lustre cluster from 1.8.5 to
2.4.X which also involves changing hardware for servers (1 MDS + 2 OSS).
Nowadays we have 8 core (2 * 4) servers (all three) with 16 Gb of RAM,
and we have the option to substitute them by (1 * 4) core servers or (2
* 6 ) core servers with at least 32 Gb of RAM. The Lustre file system is
of 64 TB: 8 TB * 8 OSTs, 90% in use, but an MDT of 750 MB in size
Taking into account that we tend to use big files in size, would it be
of any help having an MDS with 2 * 6 cores with 64 GB of RAM MDS more
than 4 cores (1 * 4) and 32 GB of RAM with Lustre 2.4.X?
And what about the OSSs?. Would it be more interesting having 12 cores
(6 * 2) instead of 4 (4 * 1)?
I read the interesting Whamcloud paper 'Lustre OSS and MDS Server Node
Requirements', which treats this issue, but I would like to comment to
the list as I can not see clearly if in our case it is worth to put
about an additional 100% of money in cores and memory
Any advice or suggestion will be welcomed
Regards
--
Ramiro Alba
Centre Tecnològic de Tranferència de Calor
http://www.cttc.upc.edu
Escola Tècnica Superior d'Enginyeries
Industrial i Aeronàutica de Terrassa
Colom 11, E-08222, Terrassa, Barcelona, Spain
Tel: (+34) 93 739 8928
--
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que est� net.
6 years, 10 months