Hi Team,
I have some questions about the agent heartbeat.
Firstly a bit of background about the situation: we've had a few instances where a server has gone into a "soft hung" state. The agent is sending its heartbeat but monitoring is not taking places. We have recently had a similar situation where
the host was having issues with VMWare that were drowning resources on it and thus causing the SCOM agent to constantly offload all the workflows but was still sending a proper heartbeat. I'm being asked to monitor for those types of situations.
here are some of my questions:
Is there anyway to modify how the heartbeat works? It seems to me that the data packet that SCOM agent sends is different than the data packet that includes the performance data and the agents workflow results. We have other monitoring tools
that are expecting a file on certain intervals that include everything within it, so if they don't get the file then there is a problem. But in SCOM from my understanding, you could still have a heartbeat even if the agent is not able to perform its tasks.
I know Microsoft has rules/monitors in place for when an agent has on unload its workflows but I'm being told by Microsoft this would not work in a situation where the agent is not even able to report that.
Is there any other kind of monitoring that you recommend to handle a situation like this? Everything I can think of (such as event ID monitoring) requires the agent to be fully operational and not have any issues. Is there any specific monitor that
I can have that is on the management server side that raises an alert when the agent isn't reporting back monitoring data? It seems that the agent heartbeat isn't sufficient in this case.
maybe a better way of phrasing this : if the heartbeat is only testing the connection, is there a better more reliable way of monitoring agent operations on hosts?
Hopefully my rant/question made sense.
Thanks in advance.