You have triggered this bug.

https://jira.hpdd.intel.com/browse/LU-5778

I think you have to go to 2.5.4 to work around it.

On Fri, Dec 22, 2017 at 10:50 AM, Nicolas Gonzalez <Nicolas.Gonzalez@alma.cl> wrote:

Hello.

The OST10 remain inactive. Two disk failed in the RAID and then disable, 

data was unlink from this.

The version the lustre is 2.5.3


 Do you know some tips to solve this problem?


Thanks!





From: Tim Carlson <tim.s.carlson@gmail.com>
Sent: Friday, December 22, 2017 12:50 PM
To: Nicolas Gonzalez
Cc: hpdd-discuss@lists.01.org
Subject: Re: [HPDD-discuss] An unbalancing Lustre fs write the first ACTIVE OST always
 
I have seen this before where an INACTIVE OST will stop Lustre from using OSTs past that number. Can you reactivate OST10? In my case this was Lustre 2.5.4



On Fri, Dec 22, 2017 at 3:21 AM, Nicolas Gonzalez <Nicolas.Gonzalez@alma.cl> wrote:


Hello
We have a Lustre fs for data reduction and currently the follow usage distribution

UID                   1K-blocks        Used   Available Use% Mounted on
jaopost-MDT0000_UUID   652420096    35893004   573024684   6% /.lustre/jaopost[MDT:0]
jaopost-MDT0001_UUID   307547736      834192   286206104   0% /.lustre/jaopost[MDT:1]
jaopost-OST0000_UUID 15617202700 15384873240   232295720  99% /.lustre/jaopost[OST:0]
jaopost-OST0001_UUID 15617202700 15418334924   198855308  99% /.lustre/jaopost[OST:1]
jaopost-OST0002_UUID 15617202700 15462419636   154754580  99% /.lustre/jaopost[OST:2]
jaopost-OST0003_UUID 15617202700 15461905276   155125548  99% /.lustre/jaopost[OST:3]
jaopost-OST0004_UUID 15617202700 15476870016   140305764  99% /.lustre/jaopost[OST:4]
jaopost-OST0005_UUID 15617202700 15550920180    66263692 100% /.lustre/jaopost[OST:5]
jaopost-OST0006_UUID 15617202700 15495824888   121358212  99% /.lustre/jaopost[OST:6]
jaopost-OST0007_UUID 15617202700 15509071792   108086048  99% /.lustre/jaopost[OST:7]
jaopost-OST0008_UUID 15617202700 15465714268   151463980  99% /.lustre/jaopost[OST:8]
jaopost-OST0009_UUID 15617202700 15490943928   126146476  99% /.lustre/jaopost[OST:9]
jaopost-OST000a_UUID 15617202700 15447985132   169182460  99% /.lustre/jaopost[OST:10]
jaopost-OST000b_UUID 15617202700 15364135336   253034356  98% /.lustre/jaopost[OST:11]
jaopost-OST000c_UUID 15617202700 15532906368    84281576  99% /.lustre/jaopost[OST:12]
jaopost-OST000d_UUID 15617202700 15485639672   131543112  99% /.lustre/jaopost[OST:13]
jaopost-OST000e_UUID 15617202700 15528786804    88404480  99% /.lustre/jaopost[OST:14]
jaopost-OST000f_UUID 15617202700 15523110328    94092292  99% /.lustre/jaopost[OST:15]
OST0010             : inactive device
jaopost-OST0011_UUID 15617202700 13303847400  2313354908  85% /.lustre/jaopost[OST:17]
jaopost-OST0012_UUID 15617202700  2593078056 13024119288  17% /.lustre/jaopost[OST:18]
jaopost-OST0013_UUID 15617202700   580724544 15036476468   4% /.lustre/jaopost[OST:19]
jaopost-OST0014_UUID 15617202700  1793039312 13824161232  11% /.lustre/jaopost[OST:20]
jaopost-OST0015_UUID 15617202700  4323099708 11294102856  28% /.lustre/jaopost[OST:21]
jaopost-OST0016_UUID 15617202700   281201736 15336000780   2% /.lustre/jaopost[OST:22]
jaopost-OST0017_UUID 15617202700   110096064 15507106444   1% /.lustre/jaopost[OST:23]
jaopost-OST0018_UUID 15617202700  2858929908 12758272512  18% /.lustre/jaopost[OST:24]


Were added OSTs 17-24 and I follow the documentation procedure and the OST 0-15 were disables and the lfs_migrate was started. For problems with the reduction software all working folder has striping 1 and the offset is set to -1.
But, only the OST 17 was fulled (some data was moved with lfs migrate and forced to move a specific OST). I done some test with dd command and the situation was the same. Only changed the stripe index to a specific OST, dd command write in a other target.

I repeated the test in other cluster that we have, and the files were correctly write in deferents OSTs with offset -1

I changed the the priority in the QOS algorithm
/proc/fs/lustre/lov/*/qos_prio_free to 100% and the result was the same

Do you have any idea what is the root cause?
Cloud be a bug or a setup problem?

Thanks in advance...





_______________________________________________
HPDD-discuss mailing list
HPDD-discuss@lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss