You have triggered this bug.
https://jira.hpdd.intel.com/browse/LU-5778
I think you have to go to 2.5.4 to work around it.
On Fri, Dec 22, 2017 at 10:50 AM, Nicolas Gonzalez <Nicolas.Gonzalez(a)alma.cl
wrote:
> Hello.
>
> The OST10 remain inactive. Two disk failed in the RAID and then disable,
>
> data was unlink from this.
>
> The version the lustre is 2.5.3
>
>
> Do you know some tips to solve this problem?
>
>
> Thanks!
>
>
>
>
> ------------------------------
> *From:* Tim Carlson <tim.s.carlson(a)gmail.com>
> *Sent:* Friday, December 22, 2017 12:50 PM
> *To:* Nicolas Gonzalez
> *Cc:* hpdd-discuss(a)lists.01.org
> *Subject:* Re: [HPDD-discuss] An unbalancing Lustre fs write the first
> ACTIVE OST always
>
> I have seen this before where an INACTIVE OST will stop Lustre from using
> OSTs past that number. Can you reactivate OST10? In my case this was Lustre
> 2.5.4
>
>
>
> On Fri, Dec 22, 2017 at 3:21 AM, Nicolas Gonzalez <
> Nicolas.Gonzalez(a)alma.cl
wrote:
>
>>
>> Hello
>> We have a Lustre fs for data reduction and currently the follow usage
>> distribution
>>
>> UID 1K-blocks Used Available Use% Mounted on
>> jaopost-MDT0000_UUID 652420096 35893004 573024684 6%
/.lustre/jaopost[MDT:0]
>> jaopost-MDT0001_UUID 307547736 834192 286206104 0%
/.lustre/jaopost[MDT:1]
>> jaopost-OST0000_UUID 15617202700 15384873240 232295720 99%
/.lustre/jaopost[OST:0]
>> jaopost-OST0001_UUID 15617202700 15418334924 198855308 99%
/.lustre/jaopost[OST:1]
>> jaopost-OST0002_UUID 15617202700 15462419636 154754580 99%
/.lustre/jaopost[OST:2]
>> jaopost-OST0003_UUID 15617202700 15461905276 155125548 99%
/.lustre/jaopost[OST:3]
>> jaopost-OST0004_UUID 15617202700 15476870016 140305764 99%
/.lustre/jaopost[OST:4]
>> jaopost-OST0005_UUID 15617202700 15550920180 66263692 100%
/.lustre/jaopost[OST:5]
>> jaopost-OST0006_UUID 15617202700 15495824888 121358212 99%
/.lustre/jaopost[OST:6]
>> jaopost-OST0007_UUID 15617202700 15509071792 108086048 99%
/.lustre/jaopost[OST:7]
>> jaopost-OST0008_UUID 15617202700 15465714268 151463980 99%
/.lustre/jaopost[OST:8]
>> jaopost-OST0009_UUID 15617202700 15490943928 126146476 99%
/.lustre/jaopost[OST:9]
>> jaopost-OST000a_UUID 15617202700 15447985132 169182460 99%
/.lustre/jaopost[OST:10]
>> jaopost-OST000b_UUID 15617202700 15364135336 253034356 98%
/.lustre/jaopost[OST:11]
>> jaopost-OST000c_UUID 15617202700 15532906368 84281576 99%
/.lustre/jaopost[OST:12]
>> jaopost-OST000d_UUID 15617202700 15485639672 131543112 99%
/.lustre/jaopost[OST:13]
>> jaopost-OST000e_UUID 15617202700 15528786804 88404480 99%
/.lustre/jaopost[OST:14]
>> jaopost-OST000f_UUID 15617202700 15523110328 94092292 99%
/.lustre/jaopost[OST:15]
>> OST0010 : inactive device
>> jaopost-OST0011_UUID 15617202700 13303847400 2313354908 85%
/.lustre/jaopost[OST:17]
>> jaopost-OST0012_UUID 15617202700 2593078056 13024119288 17%
/.lustre/jaopost[OST:18]
>> jaopost-OST0013_UUID 15617202700 580724544 15036476468 4%
/.lustre/jaopost[OST:19]
>> jaopost-OST0014_UUID 15617202700 1793039312 13824161232 11%
/.lustre/jaopost[OST:20]
>> jaopost-OST0015_UUID 15617202700 4323099708 11294102856 28%
/.lustre/jaopost[OST:21]
>> jaopost-OST0016_UUID 15617202700 281201736 15336000780 2%
/.lustre/jaopost[OST:22]
>> jaopost-OST0017_UUID 15617202700 110096064 15507106444 1%
/.lustre/jaopost[OST:23]
>> jaopost-OST0018_UUID 15617202700 2858929908 12758272512 18%
/.lustre/jaopost[OST:24]
>>
>> Were added OSTs 17-24 and I follow the documentation procedure and
>> the OST 0-15 were disables and the lfs_migrate was started. For problems
>> with the reduction software all working folder has striping 1 and the
>> offset is set to -1.
>>
>> But, only the OST 17 was fulled (some data was moved with lfs migrate
>> and forced to move a specific OST). I done some test with dd command
>> and the situation was the same. Only changed the stripe index to a
>> specific OST, dd command write in a other target.
>>
>> I repeated the test in other cluster that we have, and the files were correctly
write in deferents OSTs with offset -1
>>
>> I changed the the priority in the QOS algorithm
>>
>> /proc/fs/lustre/lov/*/qos_prio_free to 100% and the result was the same
>>
>> Do you have any idea what is the root cause?
>>
>> Cloud be a bug or a setup problem?
>>
>> Thanks in advance...
>>
>>
>>
>>
>> _______________________________________________
>> HPDD-discuss mailing list
>> HPDD-discuss(a)lists.01.org
>>
https://lists.01.org/mailman/listinfo/hpdd-discuss
>>
>>
>