System Resource Monitoring

In addition to specific function and operation monitoring, the SBC provides you with the ability to monitor overall system resource utilization to protect against undesirable behavior caused by the cumulative utilization of resources. You configure this feature using the resource-monitoring-profile element. This monitoring function operates independently on HA pairs, protecting the standby from resource over utilization independently from the active.

This monitoring feature operates independently of other monitoring functions and takes precedence over those functions and configurations. From a high level, the feature:

Provides a central service that can monitor multiple critical resources
Receives resource utilization levels from other tasks and modules
Takes precautionary actions based on the information to prevent loss of service

This resource monitoring feature segregates system operation into sub-functions so that it can measure those sub-functions' status. This measurement allows the system to determine where resources are being depleted and take action to remedy the problem before they impact system operation. This segregation establishes multiple types of resources that must be measured using different criteria. Ultimately, it calculates percent utilization using that criteria. This allows it to take actions based on your configured thresholds, which are also percentages. The types of resources monitored include:

Memory related resources—In these cases, the system measures the amount of memory available to the function against the full capacity of memory.
- System memory utilization (Heap Memory)
Session related resources—In these cases, the system measure how many applicable sessions are active against how many sessions are supported. An applicable resource-type includes SRTP_SESSIONS.
PPM related resources—For application-specific packet processing, the SBC establishes Packet Processing Modules (PPMs) to handle that traffic type. The system measures utilization of each resource that uses a corresponding PPM against the maximum number of packets that module can handle before diminishing system performance or behavior.
CommandQueues—Each SIP message initially enters the command queue from atcpd. Later, the SIP thread processes these messages. If a large number of SIP messages rapidly accumulate in the command queue, it can lead to the queue becoming overloaded. Currently, the queue limit is set to 1500. If more messages arrive beyond this threshold, the queue becomes overloaded.

Operation

The system uses this feature to monitor the resources that you target with configuration, and act when issues arise. After configuration, the system registers the configured resources with this monitoring function. After successful registration, the feature calculates the utilization values of a resource and generates reports to send that data to the Resource Monitoring Module.

Registered resources calculate and send their resource utilization levels to the Resource Monitoring Module in reports every 15 seconds when initiated. The system changes this report timing based on utilization level.

If the resource severity crosses Minor threshold value, report timing remains at 15 seconds.
If the resource severity crosses Major threshold value, report timing changes to 10 seconds.
If the resource severity crosses Critical threshold value, report timing changes to 5 seconds.

Reporting times for command queues are different from the above:

If the resource severity crosses Minor threshold value, report timing remains at 5 seconds.
If the resource severity crosses Major threshold value, report timing changes to 3 seconds.
If the resource severity crosses Critical threshold value, report timing changes to 2 seconds.

Note:

The SIP, MBCD and ATCPD command queues send a deregister request when they go down or generate an error.

When a resource crosses a threshold, it performs the action you configured for that threshold. The system then evaluates each subsequent report to determine if utilization has fallen below that threshold or has triggered a higher threshold.

Configuration

You use the multi-instance resource-monitoring-profile element to specify what the system monitors and what actions the system takes. Typically, you create a profile for each resource you want to monitor. To configure this feature, you:

Enable the resource-monitoring parameter in the system-config to enable monitoring of all configured resources.
Set parameters within the resource-monitoring-profile especially:
- The resource-type—Specify the resource to monitor. You configure specific resources separately, allowing you to configure behavior per-resource and limit the resources you want to monitor.
- The processName—Applicable to COMMAND_QUEUE resources, this parameter allows you to limit the types of threads you monitor for command_queue issues. If the resource-type is anything other than COMMAND_QUEUE, this parameter must be set to ALL.
  Note also that you are able to enable new resources for monitoring when you have set resource-type to resource-type by also specifying the processName. You do this by setting processName to the value PROCESS_xxx, where xxx specifies the process. There are many processes that have command queues. Currently SIPD/MBCD/ATCPD are added in command queues monitoring.
- The threshold configuration sub-elements—Specify the system behavior when resources cross your configured thresholds.
- The action you want the system to take for each condition.

A complete step list for this configuration is provided below.

The system performs the following verify-config checks on this feature's parameters when:

The threshold value of a lower severity (ex: MINOR, MAJOR) is greater than the threshold value of a higher severity (ex: MAJOR, CRITICAL).
A resource is configured more than once.
If you configure the resource-type to COMMAND_QUEUE and its action to healthscore-decrement-value, the system throws a verify-config error.

Actions

Actions you can configure this module to take include:

Raise an alarm—There are three alarm severity levels, including OL-1 (MINOR), OL-2 (MAJOR) and OL-3 (CRITICAL), which take precedence over hardcoded system values as well as values you configure within other features.
The system also issues a trap for the reported severity when it raises an alarm.

Note:
The system triggers a minor alarm if the usage percentage is above the minor threshold but below the major threshold. For example, if the major threshold is set to 60 and the minor threshold is set to 55, the resource monitor raises a minor alarm if the usage percentage is between 55 and 60.
Decrease the health score, triggering HA switchover when needed—You specify the health score decrement value within each minor-config, major-config and critical-config element individually.

Note:
You cannot use the health score decrement action in conjunction with an monitoring instance that has a resource-type set to CommandQueue.

Abatement

When resource utilization falls below configure values, the system refers to your configured abatement value for that profile and waits for utilization to fall by your abatement setting before reversing an action.

For example, if you configure the threshold to 60, with a minor abatement value of 5, the RM triggers an action if the usage percentage exceeds 60. The system clears the alarm when usage percentage drops to 54.

The system performs the following when utilization reaches the abatement value:

For Raised Alarms, clear the Respective Alarms.
For Decreasing the system's Health Score, reset that health score to the original value.

Note:
The system decrements health scores via this feature's triggers only when the applicable resource exceeds its critical-threshold 2 consecutive times.

Related Configuration

Specific functional and operational methods that prevent overload within the system, that operate independently of this System Resource Monitoring feature include:

Dos/DDos protection
SIP Registration overload protection
Registration Cache Limit
CPU/Memory Load Limiting—When you enable this System Resource Monitoring feature to monitor heap, the system disables the alarm threshold function for heap.
Session Agent Constraints

To accommodate the limitations of these methods, the SBC feeds cumulative information about critical resources to the Resource Monitoring Module.