Understanding the Self-Health Metrics
Oracle Communications Unified Assurance poller, collector, and threshold engine applications save self-health metrics that track their performance. You can use the collected self-health metrics to monitor the overall performance and to provide an advanced indicator of potential issues, such as overloaded poll cycle, polling cluster balance, database contention, and latency. These metrics help you make informed decisions about the overall Unified Assurance installation and maintain application efficiency and performance, such as adding additional pollers to the cluster to spread the poller load.
Some of the self-health metrics include:
-
Average Db Time: The average time to insert data into the database in each poll cycle.
-
Average Poll Time: The average time to poll a single device for data in each poll cycle.
-
Poll Duration: The time the poll cycle took to complete.
-
Database Queue Length: The number of messages in the database queue after the poll cycle ends.
For pollers, this is the number of metrics waiting to be inserted into the Metric database.
For thresholding engines, this is the number of threshold violations waiting to be inserted as events in the Event database.
-
Poll Queue Length: The number of messages in the polling queue, before the next poll period starts.
For pollers, this is the number of devices waiting to be polled.
For thresholding engines, this is the number of messages waiting to be added to the ThresholdsQueue.
-
Polled Devices: The number of devices polled by the application. For threshold engines, this number will be the number of metrics processed.
-
Process Queue Length: The number of messages in the processing queue, waiting to be processed by rules.
The following self-health metrics apply only to thresholding engines:
-
Polled Thresholds: The number of thresholds being checked for violations.
-
Threshold Violations: The number of thresholds violated in the poll cycle.
Viewing Self-Health Metrics
You can see the self-health metrics in Metric Analytics dashboards:
-
From the main navigation menu, select Analytics, then Metrics, then Dashboard.
-
Select the default Metric Health - Dynamic dashboard.
This dashboard shows some of the self-health metrics for pollers.
You can optionally add more panels with more data, selecting the relevant metric and application in the panel's query. See Dashboards in the Grafana documentation for more information about creating, configuring, and using dashboards. Only users in groups whose role has the Admin permission in the metricAnalytics package can add or edit Metric Analytics dashboards.
Interpreting Self-Health Metrics
-
Average Db Time: Extremely high values indicate high network latency, other networking issues, or high database contention for inserts.
-
Average Poll Time: Extremely high values indicate latency or network issues when communicating with polled devices.
-
Database Queue Length: This number may vary. The value should stay relatively steady, or increase slightly over time as additional devices and metrics are polled. Large spikes or a constant increase may indicate a database connectivity issue or too few database threads to handle the number of metrics being inserted.
-
Poll Duration: Generally, a value of greater than half of the configured poll time indicates an overloaded poller. Adjust the poller to give it more threads, split it into a cluster, or add additional pollers to the cluster.
-
Poll Queue Length: This value should be zero or stay consistent. If the value is increasing over time, the application may be getting behind (starting a second poll before the first finished). Check the Poll Duration metric. If pollers get behind and are left unchecked, it can cause issues with collection and delay metric insertion.