>>>>
>>>>>> Another approach could be to integrate NVDIMM event
>>>>>> monitoring into some other utility, like the rasdaemon. I'm
interested in
>>>>>> your thoughts.
>>>>>
>>>>> Though I'm not sure which (existing or new) utility is
appropriate yet.
>>>>> I prefer this way. So, I'll think about it.
>>>>
>>>> I investigated the issue that notification/monitoring feature of over-
>>>> threshold event with my co-worker. Here is current our understandings.
>>>>
>>>>
>>>> a) rasdaemon
>>>> It is good tools for machine check error, and if machine check occurs
on
>>>> NVDIMM, I suppose it will work not only traditional RAM but also
NVDIMM.
>>>> But, it may not fit the purpose of notification/monitoring threshold
event.
>>>>
>>>>
>>>> b) smartmontools (
https://www.smartmontools.org/)
>>>> This tool may fit the purpose of notification/monitoring of health of
NVDIMMs.
>>>> However, it may a bit troublesome due to the followings.
>>>>
>>>> - The smartd seems to check smart values of each devices with
>>>> ioctl() periodically (In other words, "polling").
>>>> Probably, other devices does not have the
>>>> notification interface like "ndctl_dimm_get_health_eventfd()
>>>> and poll()/select()".
>>>>
>>>> - smartmontools supports many OSs (Windows, darwin, xxxBSDs,
os2(!)).
>>>> I'm not sure other OSs have similar notification interface
like Linux.
>>>> So, it may need to "polling" like other devices.
>>>>
>>>> c) udev
>>>> Udev can kick any programs if udev.rules is created.
>>>> However, there is no uevent for the event of over-threshold
currently.
>>>> In addition, I'm not sure that udev fits this type of event
notification.
>>>>
>>>>
>>>> d) make a new tiny daemon in ndctl tree
>>>> This may be simpler way.
>>>> It can use ndctl_dimm_get_health_eventfd() and poll()/select().
>>>>
>>>> But, ndctl may be included in kernel source,
>>>> and I don't know whether kernel includes other daemon tools or
not.
>>>
>>> e) acpid
>>
>> Except acpid is ACPI specific, and the event sources that libnvdimm
>> generates are generic. For example, we may be getting an Open Firmware
>> libnvdimm bus in the next merge window.
>
> Can you say more about that? It seems that the notifications we're worried
> about here and the interface for getting information about the notification
> are both ACPI-specific.
Capturing the raw acpi events is not that interesting because we'll
immediately want to turn around and ask what those mean to Linux
kernel objects, so might as well monitor those objects directly.
> We haven't talked much about iwhat a daemon would do once it gets a
> notification from whatever the source is. That might help us determine
> the right tool. Is it just logging?
Yes, logging, and maybe a simple framework to call external helper
applications when a given events fires, or fires too many times within
a certain threshold.
I agree.
I guess some uses would like to use Logstash, Fluentd, or any other
log monitor/collector/analyzer tools. But another users want to kick
applications to avoid serious trouble like data corruption.
Thanks,
---
Yasunori Goto