Issue patching kernel-2.6.32-358.18.1.el6 (lustre 2.5.50)
by Gianfranco Sciacca
Hi All,
I have built Lustre 2.5.50 RPMs against kernel-2.6.32-358.18.1.el6
according to section 29.2 in http://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact...
(First I time I do this) All seems to have gone well, except that the resulting kernel RPM is the following: kernel-2.6.32358.18.1.el6_lustre.x86_64-1.x86_64.rpm
(no dash between 32 and 358, it should be 32-358. All the lustre RPMs have a correct version embedded in the name.
During the step Patch the Kernel Source with the Lustre Code I have edited the EXTRAVERSION value in the Makefile like this:
[build@ce02 rpmbuild]$ cd ~/kernel/rpmbuild/BUILD/kernel-2.6.32-358.18.1.el6/linux-2.6.32-358.18.1.el6.x86_64/
Edit line 4 of the Makefile to look like this:
[build@ce02 linux-2.6.32-358.18.1.el6.x86_64]$ grep EXTRAVERSION Makefile|grep 86
EXTRAVERSION = -358.18.1.el6_lustre.x86_64
which I think is what the documentation says. Is this correct?
Many thanks for any help,
Gianfranco
8 years, 7 months
Re: [HPDD-discuss] [Lustre-discuss] e2fsprogs-1.41.10.sun2 source -- compile failed due to db.h
by Dilger, Andreas
On 2013/10/29 5:21 PM, "Weilin Chang" <Weilin.Chang(a)huawei.com> wrote:
>
>I tried to compile e2fsprogs from its source codes. The compilation
>failed because db.h was missing. Does anyone compile the tool from its
>source code before? Where does this file locate? Is it ok to use
>different version of e2fsprogs than
> the one provided in the luster release?
This is a really old version of e2fsprogs (over 3 years old). There are a
lot of bug fixes in e2fsprogs since 1.41.10 was released. You should use a
newer version from http://downloads.whamcloud.com/public/e2fsprogs/latest/
for any new installations. In most cases, you can just use the compiled
RPMs.
If you want to build your own, you either need to install db4-devel or
disable
lfsck in order to build the Lustre e2fsprogs. You can't use the unpatched
e2fsprogs because it doesn't support features that Lustre uses in ldiskfs.
Cheers, Andreas
>
>Thank you for your help.
>
>Here were the error messages generated while compiling e2fsprogs:
>
>making all in e2fsck
>make[2]: Entering directory
>`/usr/src/redhat/SOURCES/e2fsprogs-1.41.10.sun2/build/e2fsck'
> COMPILE_ET prof_err.et
> CC gen_crc32table
> GEN32TABLE crc32table.h
> CC ../../e2fsck/crc32.c
> CC ../../e2fsck/dict.c
> CC ../../e2fsck/unix.c
>In file included from ../../e2fsck/unix.c:62:
>../../e2fsck/lfsck.h:15:16: error: db.h: No such file or directory
>
>
>
Cheers, Andreas
--
Andreas Dilger
Lustre Software Architect
Intel High Performance Data Division
8 years, 7 months
Lustre 2.5 update - October 25th 2013
by Jones, Peter A
Hi there
Here is an update on the Lustre 2.5 release.
Landings
========
-No landings made; code freeze in effect on b2_5; landings have commenced on master for the 2.6 release
Testing
=======
-Testing on 2.5.0-RC1 tag ongoing
Blockers
========
-None at present; if there are any issues not presently marked as blockers that you believe should be, please let me know
Other
=====
-Testing on Hyperion suffered a relatively short outage due to the US Government shutdown and testing is back in service
-We encourage community members to test out the release candidate and to use JIRA (https://jira.hpdd.intel.com<https://jira.hpdd.intel.com/>) to report any bugs encountered
Thanks
Peter
PS/ You can also keep up to date with matters relating to the 2.5 release on the CDWG wiki - http://wiki.opensfs.org/Lustre_2.5.0
8 years, 8 months
Re: [HPDD-discuss] [Lustre-discuss] Speeding up configuration log regeneration?
by Yong, Fan
> -----Original Message-----
> From: Olli Lounela [mailto:olli.lounela@helsinki.fi]
> Sent: Monday, October 21, 2013 4:59 PM
> To: Dilger, Andreas
> Cc: Yong, Fan; hpdd-discuss(a)lists.01.org
> Subject: Re: [Lustre-discuss] Speeding up configuration log regeneration?
>
> First, my apologies for not noticing that the mail client dropped the CC's.
>
> The system still isn't up, it's excruciatingly slow to recover.
>
> Quoting "Dilger, Andreas" <andreas.dilger(a)intel.com>:
>
> > If you have done a file-level backup and restore before your upgrade
> > it sounds like the MDS is rebuilding the Object Index files. I thought
> > from your original email that you had only changed the NIDs and maybe
> > updated the MDS node.
>
> No, I had to switch the host. The previous one was not compatible with the
> 10GE hardware. Also, the nee one is much faster, larger and more reliable in
> multiple ways, so I needed to do it eventually anyway. I did expect a recovery
> period, but nothing in excess of a several days.
>
> > It would have been much faster to do a device level backup/restore
> > using "dd" in this case, since the OI scrub wouldn't be needed.
>
> OK, good to know. Well, I should be able to go back since I still have the
> original MDS untouched, but have been loath to do so since at least
> something is happening. Should I still do that?
>
> > You can check progress via /proc/fs/lustre/osd-ldiskfs/{MDT
> > name}/scrub (I think). It should tell you the current progress and
> > scanning rate. It should be able to run at tens if thousands of files
> > per second. That said, few people have so many small files as you do.
> >
> > I would be interested to see what your scrub statistics are.
>
> AFAICT, it should have been complete in a bit over nine hours, and it's now
> been nearly a week:
>
> [root@dna-lustre-mds mdd]# cat
> /proc/fs/lustre/osd-ldiskfs/g4data-MDT0000/filestotal
> 1497235456
> [root@dna-lustre-mds mdd]# cat
> /proc/fs/lustre/osd-ldiskfs/g4data-MDT0000/oi_scrub
> name: OI scrub
> magic: 0x4c5fd252
> oi_files: 64
> status: completed
> flags:
> param:
> time_since_last_completed: 591570 seconds
> time_since_latest_start: 592140 seconds
> time_since_last_checkpoint: 591570 seconds
> latest_start_position: 11
> last_checkpoint_position: 1497235457
> first_failure_position: N/A
> checked: 25871866
> updated: 25871728
> failed: 0
> prior_updated: 0
> noscrub: 0
> igif: 0
> success_count: 1
> run_time: 569 seconds
> average_speed: 45469 objects/sec
> real-time_speed: N/A
> current_position: N/A
> [root@dna-lustre-mds mdd]# echo '1497235457/45469/60^2'|bc -l
> 9.14686353461821363028
>
> If 'updated' is where it's at, and if it updates all, doesn't that mean it's just
> 1.73% done in 6 days?!? Oops, "only" a year to go..?
>
[Nasf] The OI scrub has already rebuilt your OI files successfully. It took 569 seconds ("run_time"), and checked 25871866 ("checked") objects (or inodes), and rebuilt 25871728 items ("updated") within the 25871866 items. The 1497235457 (" last_checkpoint_position") stands for the last object's index (or say that it is the last inode's ino#), which does NOT mean you have 1497235457 objects (or inodes).
--
Cheers,
Nasf
> This seems to be the typical top speed based on iostat:
>
> 10/21/2013 11:09:01 AM
> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
> sda 0.10 0.00 1.20 0 24
>
> The disk subsystem is pretty fast, though:
>
> [root@dna-lustre-mds mdd]# dd if=/dev/sda of=/dev/null bs=1024k
> count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 10.5425 s, 1.0 GB/s
>
> I dare not write there unless told how; as is the case with most
> HPC/bioinformatics labs, we cannot make backups of significant amount of
> data since there is just too much.
>
> MDS is 99.9-100% idle and no memory pressure:
>
> [root@dna-lustre-mds mdd]# free
> total used free shared buffers cached
> Mem: 32864824 32527208 337616 0 13218200 114504
> -/+ buffers/cache: 19194504 13670320
> Swap: 124999672 0 124999672
>
> I cannot explain the slowness in any way, for all practical purposes there's
> nothing happening at all. If the system was physically hard pressed to cope, I
> would be much happier, at least I'd know what to do...
>
> Thanks again,
> Olli
>
> > On 2013-10-18, at 4:06, "Olli Lounela" <olli.lounela(a)helsinki.fi> wrote:
> >
> >> Thanks for the quick reply.
> >>
> >> What's preventing the system use is that for some reason the file
> >> content doesn't seem to meet its metadata. The client systems hang
> >> connection at login (which I believe is working as designed,) and
> >> when I try listing the mount (/home) first level directories, it very
> >> fast brings up what content it has, but what it has grows very
> >> slowly. Yesterday, ls /home/* hanged but no longer today, and user
> >> logins hang, probably because ~/.login and ~/.bashrc contents don't
> >> come up. Indeed, I can see the entries in the home directories, and
> >> some subdirectories, though not all and not cat all files. I
> >> conjecture that since directories are just files of a sort, the
> >> metadata/content issue affects all 1,5*10^9 files.
> >>
> >> Looking with iostat, the MDS is averaging some 0.1 TPS at most and
> >> writing maybe a block a second. As mentioned, there's 13 GB free RAM
> >> (ie. buffers) in MDS and the system is 99.9% idle. Plenty of
> >> resources and nothing happening. Any ideas how to start tracking the
> >> problem? (NB: see also the zfs issue below.)
> >>
> >> Yes, I switched the hardware under MDS, but Centos 6.x tar seems to
> >> handle --xattrs, so in principle the slow progress in rebuilding
> >> (whatever is being rebuilt) remains unexplained. The MDS is quad-core
> >> Opteron with 32 TB RAM, OSS's are the same as earlier, dual Xeon
> >> 5130's with 8 GB RAM, which seems sufficient. The disk units are
> >> SAS-attached shelves of up to 24 disks. SAS-controllers are standard
> >> LSI ones, and I've seen them performing at or in excess of 100 MBps.
> >>
> >> I have seen similar behaviour earlier with zfs, where writing just
> >> does not happen at any reasonable speed after about 20 TiB, but I had
> >> unfortunately turned on confounding factors like compression and
> >> dedup, which are known to be borken. Hence I did not follow it up,
> >> especially since it seems longstanding/nontrivial issue, and since it
> >> seems zfs developers are busier integrating into Lustre (and yes,
> >> Lustre 2.3 latest didn't compile cleanly with the zfs stuff turned
> >> off.) I did suspect that there is some sort of combination of write
> >> throtting and wait-for-flush/commit that explodes after unusually
> >> large dataset (ie. 20+ TiB,) but no tunable fixed anything, and
> >> eventually it seemed better option just to give up zfs. We now have
> >> ldiskfs. And yes, our dataset will no doubt exceed 70 TiB before the
> >> year is out.
> >>
> >> The major reason for 2.3 was that 2.4 did not yet exist and 2.3 was
> >> the first to allow for big OST slices. With modern disks and nobody
> >> wanting to fund required computing hardware (we do consider ours an
> >> HPC cluster,) running 4-disk RAID-6's was deemed unacceptable waste.
> >> In theory, and especially if it's deemed necessary, I could upgrade
> >> to 2.4, but our informaticians have been out of work for more than a
> >> week now, and a week or two more for the upgrade is really not a good
> >> idea.
> >>
> >> Thankfully yours,
> >> Olli
> >>
> >> Quoting "Dilger, Andreas" <andreas.dilger(a)intel.com>:
> >>
> >>> On 2013/10/17 5:34 AM, "Olli Lounela" <olli.lounela(a)helsinki.fi> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> We run four-node Lustre 2.3, and I needed to both change hardware
> >>>> under MGS/MDS and reassign an OSS ip. Just the same, I added a
> >>>> brand new 10GE network to the system, which was the reason for MDS
> >>>> hardware change.
> >>>
> >>> Note that in Lustre 2.4 there is a "lctl replace_nids" command that
> >>> allows you to change the NIDs without running --writeconf. That
> >>> doesn't help you now, but possibly in the future.
> >>>
> >>>> I ran tunefs.lustre --writeconf as per chapter 14.4 in Lustre
> >>>> Manual, and everything mounts fine. Log regeneration apparently
> >>>> works, since it seems to do something, but exceedingly slowly.
> >>>> Disks show all but no activity, CPU utilization is zero across the
> >>>> board, and memory should be no issue. I believe it works, but
> >>>> currently it seems the
> >>>> 1,5*10^9 files (some 55 TiB of data) won't be indexed in a week. My
> >>>> boss isn't happy when I can't even predict how long this will take,
> >>>> or even say for sure that it really works.
> >>>
> >>> The --writeconf information is at most a few kB and should only take
> >>> seconds to complete. What "reindexing" operation are you referencing?
> >>> It should be possible to mount the filesystem immediately (MGS
> >>> first, then MDS and OSSes) after running --writeconf.
> >>>
> >>> You didn't really explain what is preventing you from using the
> >>> filesystem, since you said it mounted properly?
> >>>
> >>>> Two questions: is there a way to know how fast it is progressing
> >>>> and/or where it is at, or even that it really works, and is there a
> >>>> way to speed up whatever is slowing it down? Seems all diagnostic
> >>>> /proc entries have been removed from 2.3. I have tried mounting
> >>>> the Lustre partitions with -o nobarrier (yes, I know it's
> >>>> dangerous, but I'd really need to speed things up) but I don't know
> >>>> if that does anything at all.
> >>>
> >>> I doubt that the "-o nobarrier" is helping you much.
> >>>
> >>>> We run Centos 6.x in Lustre servers, where Lustre has been
> >>>> installed from rpm's from Whamcloud/Intel build bot, and Ubuntu
> >>>> 10.04 in clients with hand compiled kernel and Lustre. One MGC/MGS
> >>>> with twelve 15k-RPM SAS disks in RAID-10 as MDT that is all but
> >>>> empty, and six variously build RAID-6's in SAS-attached shelves in three
> OSS's.
> >>
> >> --
> >> Olli Lounela
> >> IT specialist and administrator
> >> DNA sequencing and genomics
> >> Institute of Biotechnology
> >> University of Helsinki
> >>
>
>
> --
> Olli Lounela
> IT specialist and administrator
> DNA sequencing and genomics
> Institute of Biotechnology
> University of Helsinki
8 years, 8 months
New Lustre/HSM doc
by DEGREMONT Aurelien
Hi
As Lustre/HSM doc is not yet integrated in Lustre doc, here is the
latest version of Lustre/HSM doc, with the /proc names updated.
This doc is matching Lustre 2.5.0rc1 code.
Aurélien
8 years, 8 months
In what order to restart Lustre
by jyizheng
Hi,
I want to restart all the servers of my Lustre File System. In what order I need to reboot the machines? The MGS and MDT are using separate disk partitions but reside on the same machine. I have eight OSSes, each with one OST.
Could you please some suggestions?
Thanks very much.
Yizheng
8 years, 8 months
Re: [HPDD-discuss] [Lustre-discuss] Speeding up configuration log regeneration?
by Olli Lounela
Andreas, Yong, apologies for resend, but I wasn't subscribed to
hpdd-discuss. Resending to fix this.
Quoting Olli Lounela <olli.lounela(a)helsinki.fi>:
> First, my apologies for not noticing that the mail client dropped the CC's.
>
> The system still isn't up, it's excruciatingly slow to recover.
>
> Quoting "Dilger, Andreas" <andreas.dilger(a)intel.com>:
>
>> If you have done a file-level backup and restore before your
>> upgrade it sounds like the MDS is rebuilding the Object Index
>> files. I thought from your original email that you had only changed
>> the NIDs and maybe updated the MDS node.
>
> No, I had to switch the host. The previous one was not compatible
> with the 10GE hardware. Also, the new one is much faster, larger and
> more reliable in multiple ways, so I needed to do it eventually
> anyway. I did expect a recovery period, but nothing in excess of a
> several days.
>
>> It would have been much faster to do a device level backup/restore
>> using "dd" in this case, since the OI scrub wouldn't be needed.
>
> OK, good to know. Well, I should be able to go back since I still
> have the original MDS untouched, but have been loath to do so since
> at least something is happening. Should I still do that?
>
>> You can check progress via /proc/fs/lustre/osd-ldiskfs/{MDT
>> name}/scrub (I think). It should tell you the current progress and
>> scanning rate. It should be able to run at tens if thousands of
>> files per second. That said, few people have so many small files as
>> you do.
>>
>> I would be interested to see what your scrub statistics are.
>
> AFAICT, it should have been complete in a bit over nine hours, and
> it's now been nearly a week:
>
> [root@dna-lustre-mds mdd]# cat
> /proc/fs/lustre/osd-ldiskfs/g4data-MDT0000/filestotal
> 1497235456
> [root@dna-lustre-mds mdd]# cat
> /proc/fs/lustre/osd-ldiskfs/g4data-MDT0000/oi_scrub
> name: OI scrub
> magic: 0x4c5fd252
> oi_files: 64
> status: completed
> flags:
> param:
> time_since_last_completed: 591570 seconds
> time_since_latest_start: 592140 seconds
> time_since_last_checkpoint: 591570 seconds
> latest_start_position: 11
> last_checkpoint_position: 1497235457
> first_failure_position: N/A
> checked: 25871866
> updated: 25871728
> failed: 0
> prior_updated: 0
> noscrub: 0
> igif: 0
> success_count: 1
> run_time: 569 seconds
> average_speed: 45469 objects/sec
> real-time_speed: N/A
> current_position: N/A
> [root@dna-lustre-mds mdd]# echo '1497235457/45469/60^2'|bc -l
> 9.14686353461821363028
>
> If 'updated' is where it's at, and if it updates all, doesn't that
> mean it's just 1.73% done in 6 days?!? Oops, "only" a year to go..?
>
> This seems to be the typical top speed based on iostat:
>
> 10/21/2013 11:09:01 AM
> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
> sda 0.10 0.00 1.20 0 24
>
> The disk subsystem is pretty fast, though:
>
> [root@dna-lustre-mds mdd]# dd if=/dev/sda of=/dev/null bs=1024k count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 10.5425 s, 1.0 GB/s
>
> I dare not write there unless told how; as is the case with most
> HPC/bioinformatics labs, we cannot make backups of significant
> amount of data since there is just too much.
>
> MDS is 99.9-100% idle and no memory pressure:
>
> [root@dna-lustre-mds mdd]# free
> total used free shared buffers cached
> Mem: 32864824 32527208 337616 0 13218200 114504
> -/+ buffers/cache: 19194504 13670320
> Swap: 124999672 0 124999672
>
> I cannot explain the slowness in any way, for all practical purposes
> there's nothing happening at all. If the system was physically hard
> pressed to cope, I would be much happier, at least I'd know what to
> do...
>
> Thanks again,
> Olli
>
>> On 2013-10-18, at 4:06, "Olli Lounela" <olli.lounela(a)helsinki.fi> wrote:
>>
>>> Thanks for the quick reply.
>>>
>>> What's preventing the system use is that for some reason the file
>>> content doesn't seem to meet its metadata. The client systems hang
>>> connection at login (which I believe is working as designed,) and
>>> when I try listing the mount (/home) first level directories, it
>>> very fast brings up what content it has, but what it has grows
>>> very slowly. Yesterday, ls /home/* hanged but no longer today, and
>>> user logins hang, probably because ~/.login and ~/.bashrc contents
>>> don't come up. Indeed, I can see the entries in the home
>>> directories, and some subdirectories, though not all and not cat
>>> all files. I conjecture that since directories are just files of a
>>> sort, the metadata/content issue affects all 1,5*10^9 files.
>>>
>>> Looking with iostat, the MDS is averaging some 0.1 TPS at most and
>>> writing maybe a block a second. As mentioned, there's 13 GB free
>>> RAM (ie. buffers) in MDS and the system is 99.9% idle. Plenty of
>>> resources and nothing happening. Any ideas how to start tracking
>>> the problem? (NB: see also the zfs issue below.)
>>>
>>> Yes, I switched the hardware under MDS, but Centos 6.x tar seems
>>> to handle --xattrs, so in principle the slow progress in
>>> rebuilding (whatever is being rebuilt) remains unexplained. The
>>> MDS is quad-core Opteron with 32 TB RAM, OSS's are the same as
>>> earlier, dual Xeon 5130's with 8 GB RAM, which seems sufficient.
>>> The disk units are SAS-attached shelves of up to 24 disks.
>>> SAS-controllers are standard LSI ones, and I've seen them
>>> performing at or in excess of 100 MBps.
>>>
>>> I have seen similar behaviour earlier with zfs, where writing just
>>> does not happen at any reasonable speed after about 20 TiB, but I
>>> had unfortunately turned on confounding factors like compression
>>> and dedup, which are known to be borken. Hence I did not follow it
>>> up, especially since it seems longstanding/nontrivial issue, and
>>> since it seems zfs developers are busier integrating into Lustre
>>> (and yes, Lustre 2.3 latest didn't compile cleanly with the zfs
>>> stuff turned off.) I did suspect that there is some sort of
>>> combination of write throtting and wait-for-flush/commit that
>>> explodes after unusually large dataset (ie. 20+ TiB,) but no
>>> tunable fixed anything, and eventually it seemed better option
>>> just to give up zfs. We now have ldiskfs. And yes, our dataset
>>> will no doubt exceed 70 TiB before the year is out.
>>>
>>> The major reason for 2.3 was that 2.4 did not yet exist and 2.3
>>> was the first to allow for big OST slices. With modern disks and
>>> nobody wanting to fund required computing hardware (we do consider
>>> ours an HPC cluster,) running 4-disk RAID-6's was deemed
>>> unacceptable waste. In theory, and especially if it's deemed
>>> necessary, I could upgrade to 2.4, but our informaticians have
>>> been out of work for more than a week now, and a week or two more
>>> for the upgrade is really not a good idea.
>>>
>>> Thankfully yours,
>>> Olli
>>>
>>> Quoting "Dilger, Andreas" <andreas.dilger(a)intel.com>:
>>>
>>>> On 2013/10/17 5:34 AM, "Olli Lounela" <olli.lounela(a)helsinki.fi> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We run four-node Lustre 2.3, and I needed to both change hardware
>>>>> under MGS/MDS and reassign an OSS ip. Just the same, I added a brand
>>>>> new 10GE network to the system, which was the reason for MDS hardware
>>>>> change.
>>>>
>>>> Note that in Lustre 2.4 there is a "lctl replace_nids" command that
>>>> allows you to change the NIDs without running --writeconf. That doesn't
>>>> help you now, but possibly in the future.
>>>>
>>>>> I ran tunefs.lustre --writeconf as per chapter 14.4 in Lustre Manual,
>>>>> and everything mounts fine. Log regeneration apparently works, since
>>>>> it seems to do something, but exceedingly slowly. Disks show all but
>>>>> no activity, CPU utilization is zero across the board, and memory
>>>>> should be no issue. I believe it works, but currently it seems the
>>>>> 1,5*10^9 files (some 55 TiB of data) won't be indexed in a week. My
>>>>> boss isn't happy when I can't even predict how long this will take, or
>>>>> even say for sure that it really works.
>>>>
>>>> The --writeconf information is at most a few kB and should only take
>>>> seconds to complete. What "reindexing" operation are you referencing?
>>>> It should be possible to mount the filesystem immediately (MGS first,
>>>> then MDS and OSSes) after running --writeconf.
>>>>
>>>> You didn't really explain what is preventing you from using the
>>>> filesystem,
>>>> since you said it mounted properly?
>>>>
>>>>> Two questions: is there a way to know how fast it is progressing
>>>>> and/or where it is at, or even that it really works, and is there a
>>>>> way to speed up whatever is slowing it down? Seems all diagnostic
>>>>> /proc entries have been removed from 2.3. I have tried mounting the
>>>>> Lustre partitions with -o nobarrier (yes, I know it's dangerous, but
>>>>> I'd really need to speed things up) but I don't know if that does
>>>>> anything at all.
>>>>
>>>> I doubt that the "-o nobarrier" is helping you much.
>>>>
>>>>> We run Centos 6.x in Lustre servers, where Lustre has been installed
>>>>> from rpm's from Whamcloud/Intel build bot, and Ubuntu 10.04 in clients
>>>>> with hand compiled kernel and Lustre. One MGC/MGS with twelve 15k-RPM
>>>>> SAS disks in RAID-10 as MDT that is all but empty, and six variously
>>>>> build RAID-6's in SAS-attached shelves in three OSS's.
>>>
>>> --
>>> Olli Lounela
>>> IT specialist and administrator
>>> DNA sequencing and genomics
>>> Institute of Biotechnology
>>> University of Helsinki
--
Olli Lounela
IT specialist and administrator
DNA sequencing and genomics
Institute of Biotechnology
University of Helsinki
8 years, 8 months
No multi_mount_protect MMP for MGS?
by Thomas Roth
Hi all,
we have a test system where the metadata part consists of two servers connected via Fibre Channel to
one storage box.
We are running CentOS 6.4 (Lustre 2.4) and can access the storage device using multipathd:
I have created a partition for the MGS, one for the MDT, both of which I can see on both servers as
/dev/mapper/mdsp1, /dev/mapper/mdsp2
Lustre works fine, I have just tested failing over from one metadata server to the other, and back,
all o.k.
For the fun of it I wanted to test the MMP feature, and mounted the partitions on the inactive one of
the two servers.
The MDT-mount refused all right, but I was surprised to see the MGS mounting cleanly, without so much
as a single log entry!
If both servers are running an MGS, which one is being addressed by the clients? Round-robin? Did I
discover an active-active scenario I was not aware of?
More importantly: the MGS holds the configuration of my system, correct? What if I add an OST now?
Which server gets to write this to the MGS disk? Isn't it possible to mess up big time in this way?
Regards,
Thomas
8 years, 8 months
Re: [HPDD-discuss] [Lustre-discuss] Speeding up configuration log regeneration?
by Dilger, Andreas
If you have done a file-level backup and restore before your upgrade it sounds like the MDS is rebuilding the Object Index files. I thought from your original email that you had only changed the NIDs and maybe updated the MDS node.
It would have been much faster to do a device level backup/restore using "dd" in this case, since the OI scrub wouldn't be needed.
You can check progress via /proc/fs/lustre/osd-ldiskfs/{MDT name}/scrub (I think). It should tell you the current progress and scanning rate. It should be able to run at tens if thousands of files per second. That said, few people have so many small files as you do.
I would be interested to see what your scrub statistics are.
Cheers, Andreas
On 2013-10-18, at 4:06, "Olli Lounela" <olli.lounela(a)helsinki.fi> wrote:
> Thanks for the quick reply.
>
> What's preventing the system use is that for some reason the file content doesn't seem to meet its metadata. The client systems hang connection at login (which I believe is working as designed,) and when I try listing the mount (/home) first level directories, it very fast brings up what content it has, but what it has grows very slowly. Yesterday, ls /home/* hanged but no longer today, and user logins hang, probably because ~/.login and ~/.bashrc contents don't come up. Indeed, I can see the entries in the home directories, and some subdirectories, though not all and not cat all files. I conjecture that since directories are just files of a sort, the metadata/content issue affects all 1,5*10^9 files.
>
> Looking with iostat, the MDS is averaging some 0.1 TPS at most and writing maybe a block a second. As mentioned, there's 13 GB free RAM (ie. buffers) in MDS and the system is 99.9% idle. Plenty of resources and nothing happening. Any ideas how to start tracking the problem? (NB: see also the zfs issue below.)
>
> Yes, I switched the hardware under MDS, but Centos 6.x tar seems to handle --xattrs, so in principle the slow progress in rebuilding (whatever is being rebuilt) remains unexplained. The MDS is quad-core Opteron with 32 TB RAM, OSS's are the same as earlier, dual Xeon 5130's with 8 GB RAM, which seems sufficient. The disk units are SAS-attached shelves of up to 24 disks. SAS-controllers are standard LSI ones, and I've seen them performing at or in excess of 100 MBps.
>
> I have seen similar behaviour earlier with zfs, where writing just does not happen at any reasonable speed after about 20 TiB, but I had unfortunately turned on confounding factors like compression and dedup, which are known to be borken. Hence I did not follow it up, especially since it seems longstanding/nontrivial issue, and since it seems zfs developers are busier integrating into Lustre (and yes, Lustre 2.3 latest didn't compile cleanly with the zfs stuff turned off.) I did suspect that there is some sort of combination of write throtting and wait-for-flush/commit that explodes after unusually large dataset (ie. 20+ TiB,) but no tunable fixed anything, and eventually it seemed better option just to give up zfs. We now have ldiskfs. And yes, our dataset will no doubt exceed 70 TiB before the year is out.
>
> The major reason for 2.3 was that 2.4 did not yet exist and 2.3 was the first to allow for big OST slices. With modern disks and nobody wanting to fund required computing hardware (we do consider ours an HPC cluster,) running 4-disk RAID-6's was deemed unacceptable waste. In theory, and especially if it's deemed necessary, I could upgrade to 2.4, but our informaticians have been out of work for more than a week now, and a week or two more for the upgrade is really not a good idea.
>
> Thankfully yours,
> Olli
>
> Quoting "Dilger, Andreas" <andreas.dilger(a)intel.com>:
>
>> On 2013/10/17 5:34 AM, "Olli Lounela" <olli.lounela(a)helsinki.fi> wrote:
>>
>>> Hi,
>>>
>>> We run four-node Lustre 2.3, and I needed to both change hardware
>>> under MGS/MDS and reassign an OSS ip. Just the same, I added a brand
>>> new 10GE network to the system, which was the reason for MDS hardware
>>> change.
>>
>> Note that in Lustre 2.4 there is a "lctl replace_nids" command that
>> allows you to change the NIDs without running --writeconf. That doesn't
>> help you now, but possibly in the future.
>>
>>> I ran tunefs.lustre --writeconf as per chapter 14.4 in Lustre Manual,
>>> and everything mounts fine. Log regeneration apparently works, since
>>> it seems to do something, but exceedingly slowly. Disks show all but
>>> no activity, CPU utilization is zero across the board, and memory
>>> should be no issue. I believe it works, but currently it seems the
>>> 1,5*10^9 files (some 55 TiB of data) won't be indexed in a week. My
>>> boss isn't happy when I can't even predict how long this will take, or
>>> even say for sure that it really works.
>>
>> The --writeconf information is at most a few kB and should only take
>> seconds to complete. What "reindexing" operation are you referencing?
>> It should be possible to mount the filesystem immediately (MGS first,
>> then MDS and OSSes) after running --writeconf.
>>
>> You didn't really explain what is preventing you from using the filesystem,
>> since you said it mounted properly?
>>
>>> Two questions: is there a way to know how fast it is progressing
>>> and/or where it is at, or even that it really works, and is there a
>>> way to speed up whatever is slowing it down? Seems all diagnostic
>>> /proc entries have been removed from 2.3. I have tried mounting the
>>> Lustre partitions with -o nobarrier (yes, I know it's dangerous, but
>>> I'd really need to speed things up) but I don't know if that does
>>> anything at all.
>>
>> I doubt that the "-o nobarrier" is helping you much.
>>
>>> We run Centos 6.x in Lustre servers, where Lustre has been installed
>>> from rpm's from Whamcloud/Intel build bot, and Ubuntu 10.04 in clients
>>> with hand compiled kernel and Lustre. One MGC/MGS with twelve 15k-RPM
>>> SAS disks in RAID-10 as MDT that is all but empty, and six variously
>>> build RAID-6's in SAS-attached shelves in three OSS's.
>
> --
> Olli Lounela
> IT specialist and administrator
> DNA sequencing and genomics
> Institute of Biotechnology
> University of Helsinki
>
8 years, 8 months