Hi Andreas,
Thank you for your input. I see that it was the caching effect that made it appear to run
faster. Knowing this was very helpful, and hence I chose a client with large memory so
that I can repeat my scans(lfs find) and benefit from cached inode entries.
Also clearing the locks[lctl set_param ldlm.namespaces.*mdc*.lru_size=clear] helped
"lfs find" to complete successfully. To this end I was all happy that I was
making progress and nearing completion.
Now I have this odd thing happening, I can't figure it out, hope there is an
explanation to this @v2.4.3
My "lfs find" which completes successfully fails to find all files belonging to
an --obd XYZ while the same succeeds with --ost as shown below. Results found with --ost
are more consistent with robinhood.
[Context: scratch-OST0010_UUID 6.5T 1.3T 5.3T 20% /scratch[OST:16]
]
# lfs find /scratch/users/amit/mfc1006.hpc.smu.edu/sc11g
/scratch/users/amit/mfc1006.hpc.smu.edu/sc11g
/scratch/users/amit/mfc1006.hpc.smu.edu/sc11g/results
/scratch/users/amit/mfc1006.hpc.smu.edu/sc11g/filename
# lfs find --obd scratch-OST0010_UUID /scratch/users/amit/mfc1006.hpc.smu.edu/sc11g
<NO RESULT, or EMPTY RESULT FOR THE ABOVE COMMAND>
# lfs find --ost 16 /scratch/users/amit/mfc1006.hpc.smu.edu/sc11g
/scratch/users/amit/mfc1006.hpc.smu.edu/sc11g/filename
# lfs getstripe /scratch/users/amit/mfc1006.hpc.smu.edu/sc11g
/scratch/users/amit/mfc1006.hpc.smu.edu/sc11g
stripe_count: 1 stripe_size: 1048576 stripe_offset: -1
/scratch/users/amit/mfc1006.hpc.smu.edu/sc11g/results
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 33
lmm_pool: default
obdidx objid objid group
33 418949 0x66485 0
/scratch/users/amit/mfc1006.hpc.smu.edu/sc11g/filename
lmm_stripe_count: 2
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 16
obdidx objid objid group
16 2104 0x838 0
7 485877 0x769f5 0
I am rerunning my scan's with --ost and robinhood to make sure I capture the ones I
missed out, but I don't seem to get it why.
Thank you,
Amit
-----Original Message-----
From: Dilger, Andreas
Sent: Wednesday, June 8, 2016 7:11 PM
To: Kumar, Amit
Cc: hpdd-discuss(a)lists.01.org
Subject: Re: [HPDD-discuss] lfs find tips or tricks
On 2016/06/08, 14:47, "Kumar, Amit" <ahkumar(a)mail.smu.edu> wrote:
Hi Andreas,
I just wanted to add that switching to "lfs find --ost index1 --ost
index2 ... /scratch" yields me quicker results than using "lfs find
--obd X --obd Y ..../scratch" not sure why?
That might only be caused by the MDS/OSS nodes already having the files in cache for the
second run and not the first. The "--ost" and "--obd"
options are just aliases for the same "-O" option internally, so they should
behave exactly the same way.
Also my timeout and ldlm_timeout on MDS/OSS/Clients were at 30 and 6
or 33 respectively. I bumped up timeouts to over 1000 for timeout and
300 for ldlm_timeout and it has helped me from "blocking AST locking
timeouts" to occur and as a result no evictions are happening so far
for the clients while running through the scans.
The timeouts should definitely be consistent between the clients and servers. You might
not need them to be as high as 1000 and 300, since that can make failure detection and
recovery very slow, but you can tune them later when your migration is finished. Probably
300 is OK for most cases.
Cheers, Andreas
I just can't wait to
release these OSS servers and migrate to IEEL 3.0 ... I am sure these
issues have to do with the version 2.4.3 that we are currently at.
Thank you,
Amit
-----Original Message-----
From: HPDD-discuss [mailto:hpdd-discuss-bounces@lists.01.org] On Behalf
Of Kumar, Amit
Sent: Tuesday, June 7, 2016 10:38 PM
To: Dilger, Andreas
Cc: hpdd-discuss(a)lists.01.org
Subject: Re: [HPDD-discuss] lfs find tips or tricks
Hi Andreas,
Thank you for your reply. I will try to see if there is any difference between running
"lfs find" and RBH query, it will be a good test.
One thing I found was by attaching strace to "lfs find" is in fact, "lfs
find" lists through the file system file by file and directory by directory to see if
the file belongs to a particular "--obd".
I was under the impression that "lfs find" would pass the onus on to MDS to
perform the scanning of the file system and waited to get all that information as a
result, but this doesn't seem to be the case. I thought it was good to know.
Q) I ran more "lfs find --obd X --obd Y ..../scratch" scans today in the hope
they will finish without issues. And I ended up being evicted by MDS after about 4 hours
of running, as shown in the below messages. Any idea under what circumstances could I be
evicted?
Best Regards,
Amit
# dmesg
LustreError: 11-0: scratch-MDT0000-mdc-ffff885fec564000: Communicating with
10.42.130.10@o2ib, operation mds_getattr_lock failed with -107.
LustreError: Skipped 6 previous similar messages
Lustre: scratch-MDT0000-mdc-ffff885fec564000: Connection to
scratch-MDT0000 (at 10.42.130.10@o2ib) was lost; in progress operations
using this service will wait for recovery to complete
LustreError: 167-0: scratch-MDT0000-mdc-ffff885fec564000: This client was evicted by
scratch-MDT0000; in progress operations using this service will fail.
LustreError: 125945:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue:
-108
LustreError: 125945:0:(vvp_io.c:1229:vvp_io_init()) scratch: refresh file layout
[0x200012804:0x27:0x0] error -108.
LustreError: 125945:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue:
-108
LustreError: 125945:0:(mdc_locks.c:918:mdc_enqueue()) Skipped 23193
previous similar messages
LustreError: 125945:0:(vvp_io.c:1229:vvp_io_init()) scratch: refresh file layout
[0x200012804:0x27:0x0] error -108.
LustreError: 125945:0:(vvp_io.c:1229:vvp_io_init()) Skipped 23193
previous similar messages
LustreError: 125945:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue:
-108
LustreError: 125945:0:(mdc_locks.c:918:mdc_enqueue()) Skipped 48401
previous similar messages
LustreError: 125945:0:(vvp_io.c:1229:vvp_io_init()) scratch: refresh file layout
[0x200012804:0x27:0x0] error -108.
LustreError: 125945:0:(vvp_io.c:1229:vvp_io_init()) Skipped 48401
previous similar messages
LustreError: 125945:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue:
-108
LustreError: 125945:0:(mdc_locks.c:918:mdc_enqueue()) Skipped 100099
previous similar messages
LustreError: 125945:0:(vvp_io.c:1229:vvp_io_init()) scratch: refresh file layout
[0x200012804:0x27:0x0] error -108.
LustreError: 125945:0:(vvp_io.c:1229:vvp_io_init()) Skipped 100099
previous similar messages
LustreError: 125945:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue:
-108
LustreError: 125945:0:(mdc_locks.c:918:mdc_enqueue()) Skipped 201363
previous similar messages
LustreError: 125945:0:(vvp_io.c:1229:vvp_io_init()) scratch: refresh file layout
[0x200012804:0x27:0x0] error -108.
LustreError: 125945:0:(vvp_io.c:1229:vvp_io_init()) Skipped 201363
previous similar messages
LustreError: 66120:0:(file.c:171:ll_close_inode_openhandle()) inode
144116297737699971 mdc close failed: rc = -108
LustreError: 125945:0:(vvp_io.c:1229:vvp_io_init()) scratch: refresh file layout
[0x200012804:0x27:0x0] error -108.
LustreError: 125945:0:(vvp_io.c:1229:vvp_io_init()) Skipped 400423
previous similar messages
LustreError: 125945:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue:
-108
LustreError: 125945:0:(mdc_locks.c:918:mdc_enqueue()) Skipped 400424
previous similar messages
LustreError: 66120:0:(file.c:171:ll_close_inode_openhandle()) inode
144116297737699972 mdc close failed: rc = -108
LustreError: 125945:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue:
-108
LustreError: 125945:0:(mdc_locks.c:918:mdc_enqueue()) Skipped 797596
previous similar messages
LustreError: 125945:0:(vvp_io.c:1229:vvp_io_init()) scratch: refresh file layout
[0x200012804:0x27:0x0] error -108.
LustreError: 125945:0:(vvp_io.c:1229:vvp_io_init()) Skipped 797597
previous similar messages
LustreError: 66120:0:(file.c:171:ll_close_inode_openhandle()) inode
144116297737699973 mdc close failed: rc = -108
LustreError: 66120:0:(file.c:171:ll_close_inode_openhandle()) inode
144116297737699942 mdc close failed: rc = -108
LustreError: 66120:0:(file.c:171:ll_close_inode_openhandle()) inode
144116297737699975 mdc close failed: rc = -108
LustreError: 66120:0:(file.c:171:ll_close_inode_openhandle()) Skipped 1
previous similar message
LustreError: 66120:0:(file.c:171:ll_close_inode_openhandle()) inode
144116297737699976 mdc close failed: rc = -108
LustreError: 66120:0:(file.c:171:ll_close_inode_openhandle()) Skipped 2
previous similar messages
LustreError: 125945:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue:
-108
LustreError: 125945:0:(mdc_locks.c:918:mdc_enqueue()) Skipped 1596912
previous similar messages
LustreError: 125945:0:(vvp_io.c:1229:vvp_io_init()) scratch: refresh file layout
[0x200012804:0x27:0x0] error -108.
LustreError: 125945:0:(vvp_io.c:1229:vvp_io_init()) Skipped 1596911
previous similar messages
LustreError: 66125:0:(dir.c:378:ll_get_dir_page()) lock enqueue:
[0x200012804:0x5:0x0] at 0: rc -108
LustreError: 66125:0:(dir.c:584:ll_dir_read()) error reading dir
[0x200012804:0x5:0x0] at 0: rc -108
LustreError: 66120:0:(file.c:171:ll_close_inode_openhandle()) inode
144116297737699949 mdc close failed: rc = -108
LustreError: 66120:0:(file.c:171:ll_close_inode_openhandle()) Skipped 9
previous similar messages