Brock,
Happy to hear you were able to resolve the problems. Thanks for reporting back your
success and how it was achieved - others will benefit from your experience and
information.
Regarding the 2.x manual, we (HPDD) recommend using the latest manual, regardless of the
version of Lustre. It's more up to date and often has better content (examples and
corrections) from previous versions. Differences between versions are also called out in
the Lustre manuals (this is 2.2 specific, this is 2.4, etc.) So while it indicates 2.x,
it's seems to work well with 1.6 forward.
Regards,
Brett Lee
Solutions Architect, High Performance Data Division
-----Original Message-----
From: Brock Palen [mailto:brockp@umich.edu]
Sent: Wednesday, August 06, 2014 8:26 AM
To: Lee, Brett
Cc: hpdd-discuss(a)lists.01.org
Subject: Re: [HPDD-discuss] removing dead OST,
Lee,
As I noted we are on lustre 1.8 (until three weeks)
We were able to recover the data here is what we did,
This was an old lustre filesystem based on sun x4500's/x4540 built with only
software raid5+spare based on the Tokyo Tech paper (old old, see a theme?)
We had a drive fail and a second drive throw read errors during the rebuild.
We were able to recover the data though.
Because the FS was down for a long time and that this was one of 37 OST's
we used from the 1.8 manual
23.3.5 Identifying a Missing OST
We deactivated the osc for that OST on each client this let the FS keep
moving only effecting files with stripes on that OST.
To recover that OST that was not a lustre specific thing, but in general we
use ddrescue to image the last drive kicked out to another drive, turns out
we only had two unreadable sectors, we were able to assemble the array
with the new drive, and fsck -p didn't even complain, which I expected, the
rebuild though from there went fine and we reactivated the osc connection
on every host
lctl --device <deviceid> activate
And all is well again.
I took a lot from this page:
https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID
Specifically I did a lot of testing with the devicemaper snapshot devices to
test if a given solution worked without ever writing to the borked array until
a test worked on the overlays.
Great trick btw.
Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
brockp(a)umich.edu
(734)936-1985
On Aug 4, 2014, at 1:16 PM, Lee, Brett <brett.lee(a)intel.com> wrote:
> Hi Brock,
>
> The Lustre manual covers the different (temporarily failed, permanently
failed) scenarios.
>
>
https://wiki.hpdd.intel.com/display/PUB/Documentation
>
> Chapter 14 - Lustre Maintenance in the 2.x manual
>
> 14.1 illustrates a mount option that may be what you're looking for:
>
> "mount -o exclude=testfs-OST0000 ..."
>
> Brett Lee
> Solutions Architect, High Performance Data Division
>
>
>
>
>> -----Original Message-----
>> From: HPDD-discuss [mailto:hpdd-discuss-bounces@lists.01.org] On
>> Behalf Of Brock Palen
>> Sent: Monday, August 04, 2014 10:53 AM
>> To: hpdd-discuss(a)lists.01.org
>> Subject: [HPDD-discuss] removing dead OST,
>>
>> We just lost an OST failure in a legacy lustre 1.8 filesystem,
>>
>> How can one go about bringing the filesystem up without this OST?
>>
>> Thanks,
>>
>> Brock Palen
>>
www.umich.edu/~brockp
>> CAEN Advanced Computing
>> XSEDE Campus Champion
>> brockp(a)umich.edu
>> (734)936-1985
>>
>>
>