With lustre 2.6 so close to release it is time for a newer kernel status
update. In my testing the current master branch of lustre-release will
work on clients running up to 3.10 kernels. Also the needed patches to
update to the newer proc handling have been landed for server side support.
You should be able to build a lustre file system with the servers running a
3.10 kernel using a ZFS back end. Support for ldiskfs is being worked
on but it will be a while before it is ready for production. If you like
to use a 3.12 kernel for either clients or servers you will need on patch
As a disclamier this work tho completed is not offically apart of
the Lustre support matrix. So don't use it for a production file system.
Testing is always most welcomed.
I have downloaded the source RPMs from http://downloads.whamcloud.com/public/lustre/lustre-2.1.6/el6/server/SRPMS/
My understanding is that this kernel is not yet patched. Is this correct?
To patch this,
I have set up the following links:
# ls -ld patches/ series
drwxr-xr-x 2 tsadmin tsadmin 4096 Jun 22 2013 patches/
lrwxrwxrwx 1 root root 77 Jun 20 18:25 series -> /root/lustre-2.1.6/lustre-2.1.6/lustre/kernel_patches/series/2.6-rhel6.series
And I ran the following command:
# quilt push -av 2>&1 | tee quilt.log
And, it gets the following error:
Applying patch patches/lustre_version.patch
The next patch would create the file include/linux/lustre_version.h,
which already exists! Applying it anyway.
patching file include/linux/lustre_version.h
Hunk #1 FAILED at 1.
1 out of 1 hunk FAILED -- rejects in file include/linux/lustre_version.h
Patch patches/lustre_version.patch can be reverse-applied
This is really strange. It means that the source RPM on the Whamcloud web site already includes lustre_version.h, which makes it look like it is already patched. But, it does not have the other patches !
Can someone please explain this?
the specification of several '--mgsnode=' failover NIDS in OST formatting does not work in my test
case. Using ':' to separate the NIDs works.
Seems for the first time I have a failover pair of MDSes with Lustre 2.x.
I formatted with:
MDS:~# mkfs.lustre --mgs --mdt --fsname=testfs --index=0 --servicenode=10.20.0.2@o2ib0
No trouble mounting (the MGS/MDT is on 10.20.0.2).
My test-OSS is a single box, so I did
OSS:~# mkfs.lustre --reformat --ost --backfstype=zfs --fsname=testfs --index=0 -param
ost.quota_type=ug --mgsnode=10.20.0.2@o2ib --mgsnode=10.20.0.3@o2ib lpool-oss/ost1 raidz2 dm-1 dm-2
dm-3 dm-4 ...
Permanent disk data:
Lustre FS: testfs
Mount type: zfs
(OST first_time update )
Persistent mount opts:
Parameters: ost.quota_type=ug mgsnode=10.20.0.2@o2ib mgsnode=10.20.0.3@o2ib
mkfs_cmd = zfs create -o canmount=off -o xattr=sa lpool-oss/ost0
Writing lpool-oss/ost0 properties
This didn't mount, claiming the MGS on 10.20.0.3@o2ib wasn't up - which is correct.
I checked with tunefs.lustre:
OSS:~# tunefs.lustre --dryrun lpool-oss/ost0
checking for existing Lustre data: found
Read previous values:
Lustre FS: testfs
Mount type: zfs
(OST first_time update )
Persistent mount opts:
Parameters: mgsnode=10.20.0.3@o2ib ost.quota_type=ug
So indeed, the information about the first MGS NID (--mgsnode=10.20.0.2@o2ib) was lost somehow.
Changing the failover NID specification to 'colon-syntax' does the trick:
OSS:~# mkfs.lustre --reformat --ost --backfstype=zfs --fsname=testfs --index=0 --param
ost.quota_type=ug --mgsnode=10.20.0.2@o2ib:10.20.0.3@o2ib lpool-oss/ost0 raidz2 dm-1 dm-2 dm-3 ...
The example in the manual suggests that the doubling of '--mgsnode' would still work:
A flaw in the manual, or did I miss something?
When patching a kernel by following https://wiki.hpdd.intel.com/pages/viewpage.action?pageId=8126821, the instructions say:
** link the Lustre series and patches
# ln -s ~/lustre-release/lustre/kernel_patches/series/2.6-rhel6.series series
# ln -s ~/lustre-release/lustre/kernel_patches/patches patches
** Apply the patches to the kernel source using quilt
# quilt push -av
In Lustre 2.1.6, the file lustre/kernel_patches/series/2.6-rhel6.series (which is where the link is made to) has only 8 patch files, while the file ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series has 34 patches.
Why do we use one, and not the other. Or, do we need both?
Just as a reminder, don't forget to register and to send your LAD'14
presentation proposal (abstract of a 30-min technical presentation and a
brief bio if you like) before July 25th, 2014. If you have any question
regarding this year LAD, please contact us at lad(a)eofs.eu or if you are
attending ISC'14 next week in Leipzig, Germany, we would love to have
you at the EOFS and OpenSFS Booth #820 to talk about Lustre and LAD.
All details of the Lustre Administrators and Developers Workshop 2014
Release in Europe can be found at the following URL:
Direct link to registration form:
See you soon!
On 2014/06/18, 4:11 AM, "Anjana Kar" <kar(a)psc.edu> wrote:
>We just finished testing zfs 0.6.3 and lustre 2.5.2 with zfs MDT & OSTs,
>still ran into the problem of running out of inodes on the MDT. The number
>started at about 7.9M, and grew to 8.9M, but not beyond that, the MDT
>being on a mirrored zpool of 2 80GB SSD drives. The filesystem size was
>with 8 13TB raidz2 OSTs on a shared MDS/OSS node, and a second OSS.
With your mirrored 80GB drives and 8.9M inodes, this is about 9kB per
It seems a bit high, but considering that ZFS will also replicate the
add checksums and in general have more space overhead than with ldiskfs, in
addition to the mirrored VDEV it isn't totally unreasonable. It is still
only 0.1% of the OST space, so it doesn't seem unreasonable to add another
pair of SSDs in a second mirrored VDEV to increase the MDT space, and also
likely improve your performance.
>It took ~5000 seconds to run out of inodes in our empty file test. But,
>during that time it averaged about 1650/sec which is the best we've seen.
>I'm not sure why the inodes have been an issue, but we ran out of time to
>pursue this further.
>Instead we have moved to ldiskfs MDT and zfs OSTs, with the same
>versions, and have a lot more inodes available.
>Filesystem Inodes IUsed IFree IUse% Mounted on
> 39049920 7455386 31594534 20% /iconfs
>Performance has been reportedly better, but one problem was that when
>the OSS nodes went down before the OSTs could be taken offline (as would
>happen during a power outage), OSTs failed to mount after the reboot.
>To get around that we added a zpool import -f line after the message
>"Unexpected return code from import of pool $pool" in the lustre startup
>so the pools mount, then ran the lustre startup script to start the
>there is a better way to handle this please let me know.
>Another problem we ran into is that our 1.8.9 clients could not write
>into the new
>filesystem with lustre 2.5.60 which came from
>Things worked after checking out "track -b b2_5 origin/b2_5", and
>for ldiskfs. OS on the lustre servers is CentOS 6.5, kernel
Note that there is no long-term plan to maintain interoperability between
1.8.9 and 2.6 or later releases. It is great that this continues to work,
but we are only testing 2.6 interoperability with 2.x releases so I would
recommend to upgrade the 1.8.9 clients sooner rather than later.
>Thanks again for all the responses.
>On 06/12/2014 09:43 PM, Scott Nolin wrote:
>> Just a note, I see zfs-0.6.3 has just been annoounced:
>> I also see it is upgraded in the zfs/lustre repo.
>> The changelog notes the default as changed to 3/4 arc_c_max and a
>> variety of other fixes, many focusing on performance.
>> So Anjana this is probably worth testing, especially if you're
>> considering drastic measures.
>> We upgraded for our MDS, so this file create issue is harder for us to
>> test now (literally started testing writes this afternoon, and it's
>> not degraded yet, so far at 20 million writes). Since your problem
>> still happens fairly quickly I'm sure any information you have will be
>> very helpful to add to LU-2476. And if it helps, it may save you some
>> We will likely install the upgrade but may not be able to test
>> millions of writes any time soon, as the filesystem is needed for
>> On Thu, 12 Jun 2014 16:41:14 +0000
>> "Dilger, Andreas" <andreas.dilger(a)intel.com> wrote:
>>> It looks like you've already increased arc_meta_limit beyond the
>>> default, which is c_max / 4. That was critical to performance in our
>>> There is also a patch from Brian that should help performance in your
>>> Cheers, Andreas
>>> On Jun 11, 2014, at 12:53, "Scott Nolin"
>>> <scott.nolin(a)ssec.wisc.edu<mailto:email@example.com>> wrote:
>>> We tried a few arc tunables as noted here:
>>> However, I didn't find any clear benefit in the long term. We were
>>> just trying a few things without a lot of insight.
>>> On 6/9/2014 12:37 PM, Anjana Kar wrote:
>>> Thanks for all the input.
>>> Before we move away from zfs MDT, I was wondering if we can try
>>> setting zfs
>>> tunables to test the performance. Basically what's a value we can use
>>> arc_meta_limit for our system? Are there are any others settings that
>>> be changed?
>>> Generating small files on our current system, things started off at 500
>>> then declined so it was about 1/20th of that after 2.45 million files.
>>> On 06/09/2014 10:27 AM, Scott Nolin wrote:
>>> We ran some scrub performance tests, and even without tunables set it
>>> wasn't too bad, for our specific configuration. The main thing we did
>>> was verify it made sense to scrub all OSTs simultaneously.
>>> Anyway, indeed scrub or resilver aren't about Defrag.
>>> Further, the mds performance issues aren't about fragmentation.
>>> A side note, it's probably ideal to stay below 80% due to
>>> fragmentation for ldiskfs too or performance degrades.
>>> Sean, note I am dealing with specific issues for a very create intense
>>> workload, and this is on the mds only where we may change. The data
>>> integrity features of Zfs make it very attractive too. I fully expect
>>> things will improve too with Zfs.
>>> If you want a lot of certainty in your choices, you may want to
>>> consult various vendors if lustre systems.
>>> On June 8, 2014 11:42:15 AM CDT, "Dilger, Andreas"
>>> <andreas.dilger(a)intel.com<mailto:firstname.lastname@example.org>> wrote:
>>> Scrub and resilver have nothing to so with defrag.
>>> Scrub is scanning of all the data blocks in the pool to verify
>>> their checksums and parity to detect silent data corruption, and
>>> rewrite the bad blocks if necessary.
>>> Resilver is reconstructing a failed disk onto a new disk using
>>> parity or mirror copies of all the blocks on the failed disk. This is
>>> similar to scrub.
>>> Both scrub and resilver can be done online, though resilver of
>>> course requires a spare disk to rebuild onto, which may not be
>>> possible to add to a running system if your hardware does not support
>>> Both of them do not "improve" the performance or layout of data on
>>> disk. They do impact performance because they cause a lot if random
>>> IO to the disks, though this impact can be limited by tunables on the
>>> Cheers, Andreas
>>> On Jun 8, 2014, at 4:21, "Sean Brisbane"
>>> Hi Scott,
>>> We are considering running zfs backed lustre and the factor of
>>> 10ish performance hit you see worries me. I know zfs can splurge bits
>>> of files all over the place by design. The oracle docs do recommend
>>> scrubbing the volumes and keeping usage below 80% for maintenance and
>>> performance reasons, I'm going to call it 'defrag' but I'm sure
>>> someone who knows better will probably correct me as to why it is not
>>> the same.
>>> So are these performance issues after scubbing and is it possible
>>> to scrub online - I.e. some reasonable level of performance is
>>> maintained while the scrub happens?
>>> Resilvering is also recommended. Not sure if that is for
>>> performance reasons.
>>> Sent from my HTC Desire C on Three
>>> ----- Reply message -----
>>> From: "Scott Nolin"
>>> To: "Anjana Kar"
>>> Subject: [Lustre-discuss] number of inodes in zfs MDT
>>> Date: Fri, Jun 6, 2014 3:23 AM
>>> Looking at some of our existing zfs filesystems, we have a couple
>>> with zfs mdts
>>> One has 103M inodes and uses 152G of MDT space, another 12M and
>>> 19G. I¹d plan for less than that I guess as Mr. Dilger suggests. It
>>> all depends on your expected average file size and number of files
>>> for what will work.
>>> We have run into some unpleasant surprises with zfs for the MDT, I
>>> believe mostly documented in bug reports, or at least hinted at.
>>> A serious issue we have is performance of the zfs arc cache over
>>> time. This is something we didn¹t see in early testing, but with
>>> enough use it grinds things to a crawl. I believe this may be
>>> addressed in the newer version of ZFS, which we¹re hopefully awaiting.
>>> Another thing we¹ve seen, which is mysterious to me is this it
>>> appears hat as the MDT begins to fill up file create rates go down.
>>> We don¹t really have a strong handle on this (not enough for a bug
>>> report I think), but we see this:
>>> The aforementioned 104M inode / 152GB MDT system has 4 SAS drives
>>> raid10. On initial testing file creates were about 2500 to 3000 IOPs
>>> per second. Follow up testing in it¹s current state (about half
>>> full..) shows them at about 500 IOPs now, but with a few iterations
>>> of mdtest those IOPs plummet quickly to unbearable levels (like 30Š).
>>> We took a snapshot of the filesystem and sent it to the backup MDS,
>>> this time with the MDT built on 4 SAS drives in a raid0 - really not
>>> for performance so much as ³extra headroom² if that makes any sense.
>>> Testing this the IOPs started higher, at maybe 800 or 1000 (this is
>>> from memory, I don¹t have my data in front of me). That initial
>>> faster speed could just be writing to 4 spindles I suppose, but
>>> surprising to me, the performance degraded at a slower rate. It took
>>> much longer to get painfully slow. It still got there. The
>>> performance didn¹t degrade at the same rate if that makes sense - the
>>> same number of writes on the smaller/slower mdt degraded the
>>> performance more quickly. My guess is that had something to do with
>>> the total space available. Who knows. I believe restarting lustre
>>> (and certainly rebooting) Œresets the clock¹ on the file create
>>> performance degradation.
>>> For that problem we¹re just going to try adding 4 SSD¹s, but it¹s
>>> an ugly problem. Also are once again hopeful new zfs version
>>> addresses it.
>>> And finally, we¹ve got a real concern with snapshot backups of the
>>> MDT that my colleague posted about - the problem we see manifests in
>>> essentially a read-only recovered file system, so it¹s a concern and
>>> not quite terrifying.
>>> All in all, the next lustre file system we bring up (in a couple
>>> weeks) we are very strongly considering going with ldiskfs for the
>>> MDT this time.
>>> From: Anjana Kar<mailto:email@example.com>
>>> Sent: Tuesday, June 3, 2014 7:38 PM
>>> Is there a way to set the number of inodes for zfs MDT?
>>> I've tried using --mkfsoptions="-N value" mentioned in lustre 2.0
>>> manual, but it
>>> fails to accept it. We are mirroring 2 80GB SSDs for the MDT, but the
>>> number of
>>> inodes is getting set to 7 million, which is not enough for a 100TB
>>> Thanks in advance.
>>> -Anjana Kar
>>> Pittsburgh Supercomputing Center
>Lustre-discuss mailing list
Lustre Software Architect
Intel High Performance Data Division
On 2014/06/16, 11:43 PM, "Anil Belur" <askb23(a)gmail.com> wrote:
>From: Anil Belur <askb23(a)gmail.com>
>* WARNING: min() should probably be min_t(__u32, desc.ld_tgt_count,
>Signed-off-by: Anil Belur <askb23(a)gmail.com>
> drivers/staging/lustre/lustre/lclient/lcommon_misc.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>diff --git a/drivers/staging/lustre/lustre/lclient/lcommon_misc.c
>index 21de1cd..0900bef 100644
>@@ -63,7 +63,7 @@ int cl_init_ea_size(struct obd_export *md_exp, struct
> if (rc)
> return rc;
>- stripes = min(desc.ld_tgt_count, (__u32)LOV_MAX_STRIPE_COUNT);
>+ stripes = min_t(__u32, desc.ld_tgt_count, (__u32)LOV_MAX_STRIPE_COUNT);
If you are using min_t(__u32, ...) then there is no need for the (__u32)
LOV_MAX_STRIPE_COUNT, since that is the whole point of min_t() that the
> lsm.lsm_stripe_count = stripes;
> easize = obd_size_diskmd(dt_exp, &lsm);
Lustre Software Architect
Intel High Performance Data Division