Hello,
I wonder if anyone is attending LISA in Seattle in November and might be
interested in a Birds of a Feather type session on lustre monitoring?
At SSEC we have been testing lustre performance monitoring and gone
forward a bit with using graphite for stats, but nothing much on log
files really.
Here's the really short version of what we've found:
1) What stats do you care about? What are interesting?
Take some time searching for various stats files in proc, and you may
find yourself questioning what choices in life have brought you to that
point. Some are of obvious, but there are *many*. To make it even more
fun, the formatting seems to be different for no good reason sometimes,
so have fun parsing the data...
This to me though is really a critical point. Similar to sharing
logstash patterns, it would be nice to share statistics and how you use
them.
2) Collect/parse and get them to graphite -
There's collectl plugins mentioned, which both parse and send.
NCI at LUG14 mentioned a custom collectd plugin. Is that public? I
haven't seen it.
There's the roll your own method - this actually simple and can actually
make sense - Brock wrote one I see.
I wrote a simple perl module for various syntax types I have unearthed
so far: 'stats', 'exportstats' 'single' and 'jobstats'. I
then tried
sending via a simple perl socket, and that worked fine.
Now we're testing using check_mk localchecks (collecting and parsing via
the perl module I made) and 'graphios'. This works too.
3) Deal with them in graphite
This isn't too hard, but beware, if you think about it jobstats are
their own beast. You create new metrics for every job. This means you
don't have one of the big benefits of whisper graphs -like rrd you have
a static graph for each metric, that makes planning easy. If you make
new metrics at some variable rate, you now have to cope with that. It's
a mess. Graphite might not be the right tool for that job.
It sounds like a lot, but it's really *very* easy to go from nothing to
massive data collection and presentation.
The hard part I think is #1 - knowing what statistics matter and why.
With some of these newer tools collecting data and building really slick
gui panels or whatever is no longer a big deal.
Scott
On 10/22/2014 9:51 PM, Brock Palen wrote:
Justin that is cool,
Joshua, I don't think graphite is good for monitoring (notifications)
but is good for reporting. I hope to add more rules to logstash to
send alerts for slow_commit and other errors that normally tell me
lustre is going to start getting mad when the backend gets slow.
For logstash I had to make my own grok {} rules, I made a helper script
that takes all the stats and munges them into json so that logstash can
parse them. I run two inputs, one for summary (per OST/MDT stats) and
one for per clinet stats. Because this turns into a lot of data, for
the per client stats I keep them less time, and I don't collect them at
the same resolution.
I am doing this for a 30 OST 1 MDT, ~1100 client setup. THe grahpite
server is not sharded and currently has just one spindle (yep lose
everything if the disk dies)
command => "/root/logstash.git/helpers/json-stats-wrapper.py
/proc/fs/lustre/mdt/*/exports/*/stats"
command => "/root/logstash.git/helpers/json-stats-wrapper.py
/proc/fs/lustre/mdt/*/md_stats"
The end result I am storing on my current graphite server, I hit about
27,000 commits/minute and most of those are from lusture.
I have attached some example plots.
The read_bytes rate from each OST summary
Summed read_bytes rates from all OST's with a moving median over the
last 30 points
per client MDT open rate (/proc/fs/lustre/mdt/*/exports/*/stats) for
the last hour, filtered to the 'most divant' I use this to find the
users/jobs who are most of our MDT load.
Brock Palen
www.umich.edu/~brockp <
http://www.umich.edu/~brockp>
CAEN Advanced Computing
XSEDE Campus Champion
brockp(a)umich.edu
(734)936-1985
> On Oct 22, 2014, at 2:41 PM, Justin Miller <jupmille(a)iu.edu
> <mailto:jupmille@iu.edu>> wrote:
>
> I have experimented with collectl + graphite a little bit. collectl has
> graphite export functionality built in, so in combination with the
> Lustre collectl plugins (which I think are maintained by someone from
> Terascala), you can send data directly to graphite.
>
> Lustre collectl plugins:
https://github.com/pcpiela/collectl-lustre
>
> Download those plugins and put them in /usr/share/collectl
>
> Then specify which plugin to use depending on the type of machine, e.g.
>
> OSS: collectl --import lustreOSS.ph,sdBC --export graphite,10.10.0.103,d=1
> MDS: collectl --import lustreMDS.ph,sC --export graphite,10.10.0.103,d=1
> CLI: collectl --import lustreClient.ph,sdO --export
> graphite,10.10.0.103,d=1
>
> The plugins have options you can pass specifying what service data to to
> export and more or less detail (the ",sdBC" in the OSS example),
> although I can't recall right now what each on means. I seem to recall
> specifying a full path on the import option didn't work, it had to be in
> /usr/share/collectl
>
> Setting d=1 will echo the data being sent. Setting d=9 will just echo
> data and not send anything to the graphite server. Without the debug
> flag, the graphite export won't echo anything. See
>
http://collectl.sourceforge.net/Graphite.html
>
> Hopefully this is useful. I am also very interested in any ELK stack or
> graphite related projects people are willing to share and discuss.
>
> Slightly off topic, but my understanding is that the builtin collectl
> Lustre plugins are being deprecated in favor of the external Lustre
> plugins. (see
http://sourceforge.net/p/collectl/mailman/message/31992463/)
>
> Regards,
>
> Justin Miller
>
> On 10/21/14 6:14 PM, Joshua Rich wrote:
>> Hey Brock,
>>
>> I'd be really interested in any notes on your Graphite set-up if you
>> are willing to share. I've been interested in whether I could use
>> Graphite as part of a wider monitoring and reporting framework we are
>> looking into. Using it for Lustre though sounds interesting!
>>
>> Best regards,
>>
>> Joshua Rich
>> HPC Systems Administrator
>> HPC Systems and Cloud Services
>>
>> National Computational Infrastructure
>> The Australian National University
>> 143 Ward Road
>> Acton, ACT, 2601
>>
>> T +61 2 6125 2360
>> joshua.rich(a)anu.edu.au
>>
http://nci.org.au
>>
>> ________________________________________
>> From: HPDD-discuss <hpdd-discuss-bounces(a)lists.01.org> on behalf of
>> Brock Palen <brockp(a)umich.edu>
>> Sent: Wednesday, 22 October 2014 12:45 AM
>> To: Justin_Bovee(a)Dell.com
>> Cc: hpdd-discuss(a)lists.01.org
>> Subject: Re: [HPDD-discuss] logstash patterns for Lustre syslog
>>
>> I would also be interested.
>>
>> I use logstash + graphite right now to make graphs of lustre patterns.
>>
>> Brock Palen
>>
www.umich.edu/~brockp
>> CAEN Advanced Computing
>> XSEDE Campus Champion
>> brockp(a)umich.edu
>> (734)936-1985
>>
>>
>>
>> On Oct 21, 2014, at 9:43 AM, <Justin_Bovee(a)Dell.com>
>> <Justin_Bovee(a)Dell.com> wrote:
>>
>>> I would be interested in this as well.
>>>
>>> Justin Bovee
>>> ProSupport Master Engineer
>>> Dell | Enterprise Support Services
>>> Office Number: (512) 723-9536 ext. 7239536
>>> M-F (8:00am – 5:00pm CST)
>>> How am I doing? Email my manager Rocky_Salim(a)dell.com with any feedback
>>>
>>> -----Original Message-----
>>> From: HPDD-discuss [mailto:hpdd-discuss-bounces@lists.01.org] On
>>> Behalf Of Michael Kluge
>>> Sent: Tuesday, October 21, 2014 5:24 AM
>>> To: hpdd-discuss(a)lists.01.org
>>> Subject: [HPDD-discuss] logstash patterns for Lustre syslog
>>>
>>> Hi list,
>>>
>>> does anyone have a set of logstash/grok patterns for Lustre logs
>>> that you can/want to share?
>>>
>>>
>>> Regards, Michael
>>>
>>> --
>>> Dr.-Ing. Michael Kluge
>>>
>>> Technische Universität Dresden
>>> Center for Information Services and
>>> High Performance Computing (ZIH)
>>> D-01062 Dresden
>>> Germany
>>>
>>> Contact:
>>> Willersbau, Room A 208
>>> Phone: (+49) 351 463-34217
>>> Fax: (+49) 351 463-37773
>>> e-mail: michael.kluge(a)tu-dresden.de
>>> WWW:
http://www.tu-dresden.de/zih
>>>
>>> _______________________________________________
>>> HPDD-discuss mailing list
>>> HPDD-discuss(a)lists.01.org
>>>
https://lists.01.org/mailman/listinfo/hpdd-discuss
>> _______________________________________________
>> HPDD-discuss mailing list
>> HPDD-discuss(a)lists.01.org
>>
https://lists.01.org/mailman/listinfo/hpdd-discuss
>
>
> _______________________________________________
> HPDD-discuss mailing list
> HPDD-discuss(a)lists.01.org
>
https://lists.01.org/mailman/listinfo/hpdd-discuss
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss(a)lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss