How do you monitor your RAID and HDD errors?

(Question might have some over explanation of things but it’s just to clear the vision, any moderator can delete or edit. Thanks)
the question is a mix of “how you do” and “is there any thing related in Erlang or Elixir”

As we all (most people) have their application deployed on Servers, which are not our own systems but we have bought them online and use them as servers.

In our case, we are using Hetzner, it’s good.

Question: We were using Nagios for monitoring the RAID and HDD errors, as the servers are on RAID6, and it’s LSI RAID Controller. In Nagios for RAID it was a mega_sas_raid plugin and for HDD errors it was SMART monitoring plugin. It worked good, we never had such issue of RAID and HDD errors but notification was setup and everything was fine. Nagios is a very expensive to use, I want to build something of my own, But as for RAID and HDD errors, eventually, you want to go inside of server first and then you can do any operations,

Can anyone please guide me with their ways of handling those things? what tools do they use? and what tools are good for such things?

Or In Elixir/Erlang, is there any way to do or build such things?

If this is mission critical for you by any means, stick to a paid software/service that is mature and battleproof!

I’m not sure if there are any alternatives to nagios. The Ops from my company love it and are glad to have it, but well, our customers pay us to look at the nagios they pay as well, so its a no-cost for us :wink:

Also I have no idea where I’d start to implement it if I tryed it using BEAM, but probably it were just a NIF or Portwrapper to some driver calls. Perhaps you are lucky and SMART info is available under /proc somewhere?

1 Like

Why not write a script to check the smart and/or raid status and run it every hour via cron?

You can use cat /proc/mdstat to check raid status.

1 Like

My company uses Nagios too, but isn’t Nagios open source? I don’t recall us paying for this :thinking:

It can depend on what vendor your hardware is, but if it’s a LSI controller you’re probably not using Dell hardware where you could just query the BMC through iDRAC. In some hardware cases you may also be able to query via SNMP and/or emit SNMP traps.

For our LSI controllers, we do as mentioned above - have a script / agent that runs on each host every X minutes that queries and parses lsscsi / lshw then writes the info to that hosts metrics endpoint to be collected by our monitoring system.

If you’re using mdadm, you might as well use mdadm --monitor. It will either send you an e-mail, or run an arbitrary command in case any “interesting” events come up. Check man 5 mdadm.conf for email and program.

1 Like

/me is lazy and just uses ZFS’s normal reporting… >.>

mdadm is awesome though!