Jira and Wiki Disruptions
by Joshua J. Kugler
All -
Due to a reboot, about which we were not warned, of the VM on which our JIRA
and Wiki instances run, the Greenhopper (agile) plugin is not currently
working on JIRA.
JIRA (and possibly confluence) may be going down today for short intervals
while we work with Atlassian to troubleshoot this problem.
We apologize for any inconvenience.
j
--
Dev/Ops Lead
High Performance Data Division (formerly Whamcloud)
Intel
9 years
Crash of evicted Lustre client
by Götz Waschk
Dear All,
I've seen several cases of crashed client machines lately. There's a
high load on the lustre servers and the clients seem to run into
time-outs and get evicted. My clients run Lustre 1.8.9 and the servers
have 2.1.3, 1.8.7, 1.8.8.
Is there a known bug in the 1.8.9 client that could cause this?
Regards, Götz Waschk
9 years
Re: [HPDD-discuss] How to run TestDfsio on hadoop running on lustre?
by linux freaker
I tried creating a directory /benchmarks/TestDFSIO under / directory and
linked it as:
ln -s /benchmarks /mnt/lustre.
But I am getting this error:
# bin/hadoop jar hadoop-test-1.1.1.jar TestDFSIO -read -nrFiles 10
-filesize 1000
TestDFSIO.0.0.4
13/05/10 15:42:38 INFO fs.TestDFSIO: nrFiles = 10
13/05/10 15:42:38 INFO fs.TestDFSIO: fileSize (MB) = 1
13/05/10 15:42:38 INFO fs.TestDFSIO: bufferSize = 1000000
13/05/10 15:42:38 INFO fs.TestDFSIO: creating control file: 1 mega bytes,
10 files
13/05/10 15:42:38 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
13/05/10 15:42:38 INFO fs.TestDFSIO: created control files for: 10 files
13/05/10 15:42:39 INFO mapred.FileInputFormat: Total input paths to process
: 10
13/05/10 15:42:39 INFO mapred.JobClient: Running job: job_201305101207_0005
13/05/10 15:42:40 INFO mapred.JobClient: map 0% reduce 0%
13/05/10 15:43:12 INFO mapred.JobClient: Task Id :
attempt_201305101207_0005_m_000000_0, Status : FAILED
java.io.FileNotFoundException: File
file:/benchmarks/TestDFSIO/io_control/in_file_test_io_5 does not exist.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:796)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1479)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1474)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:59)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:197)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
13/05/10 15:43:12 WARN mapred.JobClient: Error reading task
outputhttp://lustreclient2:50060/tasklog?plaintext=true&attemptid=attempt_201305101207_0005_m_000000_0&filter=stdout
13/05/10 15:43:12 WARN mapred.JobClient: Error reading task
outputhttp://lustreclient2:50060/tasklog?plaintext=true&attemptid=attempt_201305101207_0005_m_000000_0&filter=stderr
13/05/10 15:43:12 INFO mapred.JobClient: Task Id :
attempt_201305101207_0005_m_000001_0, Status : FAILED
java.io.FileNotFoundException: File
file:/benchmarks/TestDFSIO/io_control/in_file_test_io_0 does not exist.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:796)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1479)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1474)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:59)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:197)
On Thu, May 9, 2013 at 8:32 PM, linux freaker <linuxfreaker(a)gmail.com>wrote:
> Do I need to create benchmark folder under /mnt/lustre?
>
>
> On Thu, May 9, 2013 at 8:24 PM, Diep, Minh <minh.diep(a)intel.com> wrote:
>
>> You should be able to use the exact same command as in hdfs but make
>> sure you create a symlink from /benchmark or /benchmarks (don't remember
>> with the s or not) to lustre FS.
>>
>> HTH
>> -Minh
>>
>> From: linux freaker <linuxfreaker(a)gmail.com>
>> Date: Thursday, May 9, 2013 1:16 AM
>> To: "hpdd-discuss(a)lists.01.org" <hpdd-discuss(a)lists.01.org>
>> Subject: [HPDD-discuss] How to run TestDfsio on hadoop running on lustre?
>>
>> Any suggestion?
>>
>
>
9 years
Error Handle in ldlm_cli_cancel_req()
by Nozaki, Hiroya
Hi, All.
I want you to discuss an issue about an error handle in ldlm_cli_cancel_req().
According to the code of Lustre-2.3.64, ldlm_cli_cancel_req() never re-send ldlm_cancel req when it has got -EAGAIN from ptlrpc_queue_wait(), and ptlrpc_queue_wait() return -EAGAIN when a state of import object which is target of the ldlm_cancel req is in recovery states such as DISCON, CONNECTING etc. And, In my experience, it can happen so often, especially in a large-scale system.
Which is why, I suggest that ldlm_cli_cancel_req() resend ldlm_cancel request when getting -EAGAIN. Because, if not, the client which failed to send ldlm_cancel req will be evicted from a server when the server sent blocking callback req to the client.
If someone agree my idea, I'd like to make a ticket for the issue on Jira.
Best regard.
-----------------------------------
Hiroya Nozaki nozaki.hiroya(a)jp.fujitsu.com
Next Generation Technical Computing Unit
Fujitsu, Ltd
Tel: 044-754-8769
Ext: 7103-8594
-----------------------------------
Hiroya Nozaki nozaki.hiroya(a)jp.fujitsu.com
Next Generation Technical Computing Unit
Fujitsu, Ltd
Tel: 044-754-8769
Ext: 7103-8594
9 years
bringing deactivated ost back online
by Kurt Strosahl
Good Morning,
A bit of back story... several months ago one of my OSSs encountered a hardware error and had to be returned to the vendor. It came back, and was intact (the vendor hadn't touched the hard drive). So now I've brought one of the OSTs online (one of six) and we are encountering some unusual issues. I set the OST to active using the conf_param and now some of the clients are picking it up, while others are not. I was able to force a client to see it by unmounting and remounting lustre... but this would be awkward to do for every client (~1100) as they are currently in use.
w/r,
Kurt J. Strosahl
System Administrator
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
9 years
OST refused connection from client
by Mi Zhou
Hi,
We sometimes see the following error message on OSSs. And the
May 5 20:47:16 lustre-oss03 kernel: Lustre: scratch-OST0006: Client
511ae429-07b7-f9ca-22b6-f0f8839b8029 (at 192.168.102.37@o2ib) refused
reconnection, still busy with 1 active RPCs
And on the client that it refused connection, the error is as below:
May 5 20:47:03 nodem37 kernel: Lustre:
2424:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for sent delay: [sent 1367804814/real 0] req@ffff881849d84800
x1433750448809719/t0(0)
o101->scratch-OST0008-osc-ffff880c3fe37400@192.168.100.3@o2ib:28/4 lens
296/352 e 0 to 1 dl 1367804823 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
May 5 20:47:03 nodem37 kernel: Lustre:
2424:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 4 previous
similar messages
May 5 20:47:03 nodem37 kernel: Lustre:
scratch-OST0008-osc-ffff880c3fe37400: Connection to scratch-OST0008 (at
192.168.100.3@o2ib) was lost; in progress operations using this service
will wait for recovery to complete
May 5 20:47:03 nodem37 kernel: Lustre: Skipped 1 previous similar message
May 5 20:47:04 nodem37 kernel: Lustre:
2424:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for sent delay: [sent 1367804815/real 0] req@ffff880a86515400
x1433750448809779/t0(0)
o101->scratch-OST0008-osc-ffff880c3fe37400@192.168.100.3@o2ib:28/4 lens
296/352 e 0 to 1 dl 1367804824 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
May 5 20:47:04 nodem37 kernel: Lustre:
2424:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 6 previous
similar messages
May 5 20:47:05 nodem37 kernel: LustreError: 11-0: an error occurred
while communicating with 192.168.100.3@o2ib. The ost_connect operation
failed with -16
May 5 20:47:05 nodem37 kernel: LustreError: Skipped 1 previous similar
message
May 5 20:47:05 nodem37 kernel: LustreError: 11-0: an error occurred
while communicating with 192.168.100.3@o2ib. The ost_connect operation
failed with -16
May 5 20:47:28 nodem37 kernel: Lustre:
scratch-OST0007-osc-ffff880c3fe37400: Connection restored to
scratch-OST0007 (at 192.168.100.3@o2ib)
May 5 20:47:28 nodem37 kernel: Lustre: Skipped 1 previous similar message
May 5 20:49:09 nodem37 kernel: LustreError: 11-0: an error occurred
while communicating with 192.168.100.3@o2ib. The ost_destroy operation
failed with -107
May 5 20:49:09 nodem37 kernel: LustreError: Skipped 1 previous similar
message
May 5 20:49:09 nodem37 kernel: Lustre:
scratch-OST0008-osc-ffff880c3fe37400: Connection to scratch-OST0008 (at
192.168.100.3@o2ib) was lost; in progress operations using this service
will wait for recovery to complete
May 5 20:49:09 nodem37 kernel: Lustre: Skipped 2 previous similar messages
May 5 20:49:09 nodem37 kernel: LustreError: 167-0: This client was
evicted by scratch-OST0008; in progress operations using this service
will fail.
May 5 20:49:09 nodem37 kernel: LustreError:
2422:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req@ffff88184061d400 x1433750448823924/t0(0)
o4->scratch-OST0008-osc-ffff880c3fe37400@192.168.100.3@o2ib:6/4 lens
456/416 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
May 5 20:49:09 nodem37 kernel: LustreError:
2422:0:(client.c:1060:ptlrpc_import_delay_req()) Skipped 5687 previous
similar messages
May 5 20:49:09 nodem37 kernel: LustreError:
5585:0:(osc_lock.c:809:osc_ldlm_completion_ast())
lock@ffff88128c19b698[2 2 0 1 1 00000000] W(2):[0,
0]@[0x100080000:0xcdb5aed5:0x0] {
May 5 20:49:09 nodem37 kernel: LustreError:
5585:0:(osc_lock.c:809:osc_ldlm_completion_ast())
lovsub@ffff880db54ec860: [0 ffff8810e95d6e30 W(2):[0,
0]@[0x201c50c90:0x16927:0x0]]
May 5 20:49:09 nodem37 kernel: LustreError:
5585:0:(osc_lock.c:809:osc_ldlm_completion_ast()) osc@ffff88169bf71d78:
ffff881344ac6240 40120002 0x7293132dc153773c 2 (null) size: 0 mtime:
1367804804 atime: 1367804804 ctime: 1367804804 blocks: 0
May 5 20:49:09 nodem37 kernel: LustreError:
5585:0:(osc_lock.c:809:osc_ldlm_completion_ast()) } lock@ffff88128c19b698
May 5 20:49:09 nodem37 kernel: LustreError:
5585:0:(osc_lock.c:809:osc_ldlm_completion_ast()) dlmlock returned -5
May 5 20:49:09 nodem37 kernel: Lustre:
scratch-OST0008-osc-ffff880c3fe37400: Connection restored to
scratch-OST0008 (at 192.168.100.3@o2ib)
Has anybody seen this? Any advice is appreciated.
Thanks
Mi
Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
9 years
Issue bringing ib0 up on RHEL 6.3
by linux freaker
Hi,
I am trying to setup lustre environment but facing the issue related to
infiniband setup:
While trying to bring up ib0 I am running:
[root@slave3 ~]# /etc/rc.d/init.d/rdma restart
Unloading OpenIB kernel modules:
Found opensm running.
Please stop all RDMA applications before downing the stack.
[FAILED]
Loading OpenIB kernel modules:FATAL: Error inserting ib_addr
(/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/kernel/drivers/infiniband/core/ib_addr.ko):
Unknown symbol in module, or unknown parameter (see dmesg)
Failed to load module WARNING: Error inserting ib_core
(/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/kernel/drivers/infiniband/core/ib_core.ko):
Unknown symbol in module, or unknown parameter (see dmesg)
WARNING: Error inserting ib_mad
(/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/kernel/drivers/infiniband/core/ib_mad.ko):
Unknown symbol in module, or unknown parameter (see dmesg)
WARNING: Error inserting ib_sa
(/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/kernel/drivers/infiniband/core/ib_sa.ko):
Unknown symbol in module, or unknown parameter (see dmesg)
WARNING: Error inserting iw_cm
(/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/kernel/drivers/infiniband/core/iw_cm.ko):
Unknown symbol in module, or unknown parameter (see dmesg)
WARNING: Error inserting ib_cm
(/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/kernel/drivers/infiniband/core/ib_cm.ko):
Unknown symbol in module, or unknown parameter (see dmesg)
FATAL: Error inserting rdma_cm
(/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/kernel/drivers/infiniband/core/rdma_cm.ko):
Unknown symbol in module, or unknown parameter (see dmesg)
Failed to load module WARNING: Error inserting iw_cm
(/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/kernel/drivers/infiniband/core/iw_cm.ko):
Unknown symbol in module, or unknown parameter (see dmesg)
WARNING: Error inserting ib_cm
(/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/kernel/drivers/infiniband/core/ib_cm.ko):
Unknown symbol in module, or unknown parameter (see dmesg)
WARNING: Error inserting rdma_cm
(/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/kernel/drivers/infiniband/core/rdma_cm.ko):
Unknown symbol in module, or unknown parameter (see dmesg)
FATAL: Error inserting rdma_ucm
(/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/kernel/drivers/infiniband/core/rdma_ucm.ko):
Unknown symbol in module, or unknown parameter (see dmesg)
Failed to load module FATAL: Error inserting ib_ipoib
(/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko):
Unknown symbol in module, or unknown parameter (see dmesg)
Failed to load module [FAILED]
[root@slave3 ~]#
9 years
Mellanox OFED MLNX_OFED_LINUX-1.5.3-3 for Lustre not working
by linux freaker
Hi,
I installed RHEL 6.3 which by default comes with openib (Infiniband support
software). I needed MLNX_OFED_LINUX-1.5.3-3.1.0-rhel6.3-x86_64.iso to be
installed and so I extracted and ran ./mlnxofedinstall script. It removed
the old Mellanox installables and drivers for a while but couldnt install
due to lustre kernel. In curious, I installed it on default kernel. It went
fine on default RHEL kernel. and While I rebooted and tried to get lustre
up through lctl network up it threw error on MDS:
[root@oss2 ~]# modprobe lustre
WARNING: Error inserting fld
(/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/updates/kernel/fs/lustre/fld.ko):
Input/output error
WARNING: Error inserting fid
(/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/updates/kernel/fs/lustre/fid.ko):
Input/output error
WARNING: Error inserting mdc
(/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/updates/kernel/fs/lustre/mdc.ko):
Input/output error
WARNING: Error inserting osc
(/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/updates/kernel/fs/lustre/osc.ko):
Input/output error
WARNING: Error inserting lov
(/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/updates/kernel/fs/lustre/lov.ko):
Input/output error
FATAL: Error inserting lustre
(/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/updates/kernel/fs/lustre/lustre.ko):
Input/output error
[root@oss2 ~]# modprobe lnet
[root@oss2 ~]# lctl network up
LNET configure error 100: Network is down
Now it seemed I need to build OFED for lustre and I got this thread:
http://thr3ads.net/lustre-discuss/2012/12/2164117-problem-with-installing...
It suggested to do the following steps and I followed it line by line:
2. boot into the lustre kernel
3. in our /usr/src/lustre-2.1.2 directory built lustre against the
Mellanox "Module.symvers" information (which is why you see the
"Input/Output" errors on fid.ko, mdc.ko, osc.ko, lov.ko and because of
the aforementioned items, the lustre.ko. The MLNX version 1.8.5 that
we needed was in the /usr/src/ofa_kernel directory (with the
Module.symvers etc....) We used the defaults other than the o2ib so
our command in the /usr/src/lustre-2.1.2 directory looked like
"./configure --with-o2ib=/usr/src/ofa_kernel"
4. next we issued "make"
5. next we chose to run a "make rpms" command so that we could have
rpms for our system for cluster re-building
But even this failed to get my lustre up.modprobe lnet work but lctl
network up doesnt.
lctl list_nids
IOC_LIBCFS_GET_NI error 100: Network is down
My ifconfig shows:
ib0 Link encap:InfiniBand HWaddr
80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::202:c903:b:8b85/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:2628 errors:0 dropped:0 overruns:0 frame:0
TX packets:20 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1024
RX bytes:155819 (152.1 KiB) TX bytes:3503 (3.4 KiB)
uname -arn
Linux oss2 2.6.32-279.14.1.el6_lustre.x86_64 #1 SMP Fri Dec 14 23:22:17 PST
2012 x86_64 x86_64 x86_64 GNU/Linux
I did tried running kernel support script under MLNX directory and it did
installed RPM but no luck with lctl list_nids . Can anyone suggest how to
fix it?
9 years