Hello everybody.
Here a strange problem that maybe could be reported as a bug.
Explanation:
Lustre 2.7 (Centos 6.6) in MDS, 2OSS's (12 OST's) and ~200 Clients. 1
single Lustre Filesystem.
Execution of "ungrib.exe
<
http://www2.mmm.ucar.edu/wrf/OnLineTutorial/Basics/UNGRIB/index.html>&...
for WRF into any directory with "stripe = 1" works perfectly.
Execution of "ungrib.exe" for WRF into any directory with any "stripe
<>
1", fails.
"ungrib.exe" works writting medium-size (610MB) files (34 in my case) in
the working directory. If the working directory is configured with
"stripe = 1" (each file goes to a single OST), the proccess makes its
job and finish cleanly.
If the working directory is configured with any other stripe (probed
with 2, 4, 8, 12 and -1), when writing one of the 610MB files (not
always the same one), the procces aborts.
Executing with "strace", the last lines of the trace are:
/[...]
//lseek(4, 99309154, SEEK_SET) = 99309154//
//read(4,
"GRIB\0\0\0\2\0\0\0\0\0\v5\304\0\0\0\25\1\0\7\0\0\2\1\1\7\340\4\22"...,
512) = 512//
//lseek(4, 100043810, SEEK_SET) = 100043810//
//read(4, "7777", 4) = 4//
//lseek(4, 99309154, SEEK_SET) = 99309154//
//read(4,
"GRIB\0\0\0\2\0\0\0\0\0\v5\304\0\0\0\25\1\0\7\0\0\2\1\1\7\340\4\22"...,
734660) = 734660//
//brk(0x1e41f000) = 0x1e41f000//
//lseek(4, 100043814, SEEK_SET) = 100043814//
//read(4,
"GRIB\0\0\0\2\0\0\0\0\0\22:\376\0\0\0\25\1\0\7\0\0\2\1\1\7\340\4\22"...,
512) = 512//
//lseek(4, 101238560, SEEK_SET) = 101238560//
//read(4, "7777", 4) = 4//
//lseek(4, 100043814, SEEK_SET) = 100043814//
//read(4,
"GRIB\0\0\0\2\0\0\0\0\0\22:\376\0\0\0\25\1\0\7\0\0\2\1\1\7\340\4\22"...,
1194750) = 619482//
//write(3, "2016-04-19 17:15:43.713 --- ", 28) = 28//
//write(1, "ERROR: ", 7ERROR: ) = 7//
//write(3, "ERROR: ", 7) = 7//
//write(1, "rd_grib2: IO Error. ", 20rd_grib2: IO Error. ) = 20//
//write(3, "rd_grib2: IO Error. ", 20) = 20//
//write(1, "1194750", 71194750) = 7//
//write(3, "1194750", 7) = 7//
//write(1, " .ne. ", 6 .ne. ) = 6//
//write(3, " .ne. ", 6) = 6//
//write(1, "619482", 6619482) = 6//
//write(3, "619482", 6) = 6//
//write(1, "\n", 1//
//) = 1//
//write(3, "\n", 1) = 1//
//exit_group(0) = ?/
All OST's look to be OK. no timeouts are detected. OSS's and MDS are not
reporting errors. The only pattern detected is the stripe of the working
directory where data is written. There's no a specific "breaking point"
of the proccess. One time it can occur writing a file and other time it
can occur writing a different one.
This issue has been detected with other executions that write
medium-size files, but no way to reproduce it easily.
Any idea of what can be happening?.
Thank you so much.
--
*Jose Manuel Martínez García / Tel. 987 293 174 *
*Coordinador de Sistemas*
Fundación Centro de Supercomputación de Castilla y León
Edificio CRAI-TIC, Campus de Vegazana, s/n
Universidad de León
24071 León, España
www.fcsc.es
logoFCSCL jcyl
_________________________________
Este correo va dirigido, de manera exclusiva, a su destinatario y puede
contener información confidencial, cuya divulgación no está permitida
por la ley. Si usted no es su destinatario notifíquelo urgentemente al
remitente y borre este correo de su sistema.
Proteja el Medio Ambiente. Evite imprimir este mensaje si no es
estrictamente necesario.