(Question might have some over explanation of things but it’s just to clear the vision, any moderator can delete or edit. Thanks)
the question is a mix of “how you do” and “is there any thing related in Erlang or Elixir”
As we all (most people) have their application deployed on Servers, which are not our own systems but we have bought them online and use them as servers.
In our case, we are using Hetzner, it’s good.
Question: We were using Nagios for monitoring the RAID and HDD errors, as the servers are on RAID6, and it’s LSI RAID Controller. In Nagios for RAID it was a mega_sas_raid plugin and for HDD errors it was SMART monitoring plugin. It worked good, we never had such issue of RAID and HDD errors but notification was setup and everything was fine. Nagios is a very expensive to use, I want to build something of my own, But as for RAID and HDD errors, eventually, you want to go inside of server first and then you can do any operations,
Can anyone please guide me with their ways of handling those things? what tools do they use? and what tools are good for such things?
Or In Elixir/Erlang, is there any way to do or build such things?
If this is mission critical for you by any means, stick to a paid software/service that is mature and battleproof!
I’m not sure if there are any alternatives to nagios. The Ops from my company love it and are glad to have it, but well, our customers pay us to look at the nagios they pay as well, so its a no-cost for us
Also I have no idea where I’d start to implement it if I tryed it using BEAM, but probably it were just a NIF or Portwrapper to some driver calls. Perhaps you are lucky and SMART info is available under /proc somewhere?
It can depend on what vendor your hardware is, but if it’s a LSI controller you’re probably not using Dell hardware where you could just query the BMC through iDRAC. In some hardware cases you may also be able to query via SNMP and/or emit SNMP traps.
For our LSI controllers, we do as mentioned above - have a script / agent that runs on each host every X minutes that queries and parses lsscsi / lshw then writes the info to that hosts metrics endpoint to be collected by our monitoring system.
If you’re using mdadm, you might as well use mdadm --monitor. It will either send you an e-mail, or run an arbitrary command in case any “interesting” events come up. Check man 5 mdadm.conf for email and program.