System health policy is deployed and available on all nodes. Nodes refer to the domain or CEP components. In the screenshot below, nodes refers to DOMAIN_SERVER & ODIN. This policy gives insight into memory utilization by the node, CPU utilization by the node, the overall count of facts published by all experts deployed on node, the overall count of sensors, latency delays (if any), response time delays (if any), and other such useful information. To open system health policy, right-click on CEP and select the System Health. A screenshot is attached below.
After opening the policy, the properties of each sensor provides a description on what it is used for. In the screenshot below, follow the numbering to view description of each sensor.
Below are a few sensors that are useful for debugging issues:
- Service Status and Recovery Sensor: This sensor report if any expert deployed on CEP is not running. For example, if alerting is enabled on this sensor and any expert or policy manager suddenly goes down, then alert is sent as per the configuration.
- Memory Utilization % Sensor (under General Indicators): This sensor can be used to determine, what much percentage of memory allocated to CEP is being consumed. As shown in the screenshot above, 21.03% of memory allocated to CEP is already consumed.
- Rule Engine CPU Idle % Sensor (under Utilization): This sensor help you understand the percentage of CPU utilization by CEP process.
- Fact Publish Latency (ms), Fact Delivery Latency (ms) & Fact Sensor Latency (ms) Sensors (under Fact Pipeline Statistics): These sensors provide insights about latency delays for facts and their usage. This should be used in cases where facts are not updated or there are delays in sensor creation or other related scenarios.
- Service Fact Utilization Sensor: This sensor lists out the experts that are publishing large number of facts. It helps users identify which experts are consuming significant resources and decide on actions to address the situation.
- Policy Deployment Utilization Sensor: This sensor lists the policies that are creating many dynamic sensors, along with the name of policy manager under which each policy is deployed. Similar to the previous sensor, it helps users identify the problematic sensors and take steps to resolve it.
- Response Time Delay (ms) Sensor: This sensor helps understand any delays in receiving responses from policies.
Note that the default sensor thresholds may not be appropriate for every environment and might need to be updated accordingly. For example, if the default number of facts published exceeds the set thresholds, it could lead to a sensor indicating critical severity. In such scenarios, we recommend validating other available sensors to determine whether the environment is truly in a critical condition or if it is configured to handle the load, and only the sensor thresholds need adjustment to match the use case.
Validating other sensors means:
- Check if memory utilization is consistently below 70%. Note that memory fluctuates based on the activity. But, will go down after the activity is completed. If it is always high (above 80% or 85%), that means we would need to increase the memory allocated for the process, as current resources are almost exhausted to handle the load and for handling higher volumes, increasing load is required.
- Check for any response time or latency delays when processing facts. If these delays are high, it indicates that the environment cannot keep up with the load and will require more resources (both memory and CPU).
- Reviewing task statistics will also help determine if the environment can handle the load.
For additional assistance, we recommend contacting your system administrators who manage the application or reaching out to the meshIQ support team with a description of your queries and any available evidence.