74 Monitoring ECE Components
Learn how to monitor system processes, such as memory and thread usage, of your Oracle Communications Elastic Charging Engine (ECE) components.
Topics in this document:
About Monitoring ECE Components
You can set up your system to monitor ECE components. When configured to do so, each ECE component exposes a REST endpoint that exposes JVM, Coherence, and application metrics from a single endpoint in an OpenMetrics exposition format. You can then use an external centralized metric service, such as Prometheus, to scrape the ECE metrics and store them for analysis and monitoring.
ECE exposes metric data for the following components by default:
-
ECE Server
-
BRM Gateway
-
Customer Updater
-
Diameter Gateway
-
EM Gateway
-
HTTP Gateway
-
CDR Formatter
-
Pricing Updater
-
RADIUS Gateway
-
Rated Event Formatter
Setting up monitoring of these ECE components involves the following high-level tasks:
-
Configuring ECE to scrape JVM, Coherence, and application metrics and expose them through a Micrometer Prometheus endpoint. See "Scraping and Exposing Metrics for ECE".
The ECE metric data will be exposed through the following endpoint: http://localhost:19612/metrics.
-
Setting up a centralized metrics service, such as Prometheus, to scrape metrics from the Micrometer Prometheus endpoint.
-
Setting up a visualization tool, such as Grafana, to display your ECE metric data in a graphical format.
Scraping and Exposing Metrics for ECE
To configure ECE to scrape and expose JVM, Coherence, and application metrics for each ECE node in your system:
-
Open the ECE_home/config/eceTopology.conf file in a text editor.
-
Add the following information for each ECE node in your system:
-
metricsPort: Set this to a non-null port where the metrics will be exposed. The port number must be unique for each host.
-
isMetricsEnabled: Set this to true to enable monitoring of this node.
-
-
Save and close the file.
-
Perform a rolling upgrade of the ECE components.
Example 74-1 Exposing Metrics for All ECE Components
This shows sample eceTopology.conf entries for exposing the metrics of all ECE nodes except httpGateway1, cdrGateway1, and cdrFormatter1:
#node-name |role |host name (no spaces!) |host ip|JMX port |start CohMgt |JVM Tuning File|metricsPort |isMetricsEnabled
ecs1 |server |localhost | |9999 |true |defaultTuningProfile|22000 |true
customerUpdater1 |customerUpdater |localhost | |9996 |false |defaultTuningProfile|22004 |true
pricingUpdater |pricingUpdater |localhost | |9995 |false |defaultTuningProfile|22005 |true
brmGateway |brmGateway |localhost | |9994 |false |defaultTuningProfile|22006 |true
emGateway1 |emGateway |localhost | |9993 |false |defaultTuningProfile|22007 |true
ratedEventFormatter1 |ratedEventFormatter |localhost | |9992 |false |defaultTuningProfile|22008 |true
cdrFormatter1 |cdrFormatter |localhost | |19982 |false |defaultTuningProfile| |false
cdrGateway1 |cdrGateway |localhost | | |false |defaultTuningProfile| |false
diameterGateway1 |diameterGateway |localhost | |9991 |false |defaultTuningProfile|22009 |true
radiusGateway1 |radiusGateway |localhost | |9990 |false |defaultTuningProfile|22010 |true
httpGateway1 |httpGateway |localhost | | |false |defaultTuningProfile| |false
ECE Metrics
ECE collects metrics in the following groups to produce data for monitoring your ECE components:
JVM Metrics
The JVM Metrics group contains standard metrics about the central processing unit (CPU) and memory utilization of JVMs, which are members of the ECE grid. Table 74-1 lists the metrics in this group.
Table 74-1 JVM Metrics
Metric Name | Type | Description |
---|---|---|
jvm_memory_bytes_init |
Gauge |
Contains the initial size, in bytes, for the Java heap and non-heap memory. |
jvm_memory_bytes_committed |
Gauge |
Contains the committed size, in bytes, for the Java heap and non-heap memory. |
jvm_memory_bytes_used |
Gauge |
Contains the amount of Java heap and non-heap memory, in bytes, that are in use. |
jvm_memory_bytes_max |
Gauge |
Contains the maximum size, in bytes, for the Java heap and non-heap memory. |
jvm_memory_pool_bytes_init |
Gauge |
Contains the initial size, in bytes, of the following JVM memory pools: G1 Survivor Space, G1 Old Gen, and G1 Survivor Space. |
jvm_memory_pool_bytes_committed |
Gauge |
Contains the committed size, in bytes, of the following JVM memory pools: G1 Survivor Space, G1 Old Gen, and G1 Survivor Space. |
jvm_memory_pool_bytes_used |
Gauge |
Contains the amount of Java memory space, in bytes, is in use by the following JVM memory pools: G1 Survivor Space, G1 Old Gen, and G1 Survivor Space. |
jvm_buffer_count_buffers |
Gauge |
Contains the estimated number of mapped and direct buffers in the JVM memory pool. |
jvm_buffer_total_capacity_bytes |
Gauge |
Contains the estimated total capacity, in bytes, of the mapped and direct buffers in the JVM memory pool. |
process_cpu_usage |
Gauge |
Contains the CPU usage information (in percentage) for each ECE component on the server. This data is collected from the corresponding MBean attributes by JVMs. |
process_files_open_files |
Gauge |
Contains the total number of file-descriptors currently available for an ECE component and the descriptors that are in use for that ECE component. |
coherence_os_system_cpu_load |
Gauge |
Contains the CPU load information (in percentage) for each system in the cluster. These statistics are based on the average data collected from all the ECE grid members running on a server. |
system_load_average_1m |
Gauge |
Contains the system load average (the number of items waiting in the CPU run-queue) information for each machine in the cluster. These statistics are based on the average data collected from all the ECE grid members running on a server. |
coherence_os_free_swap_space_size |
Gauge |
Contains system swap usage information (by default in megabytes) for each system in the cluster. These statistics are based on the average data collected from all the ECE grid members running on a server. |
BRS Metrics
The BRS Metrics group contains the metrics for tracking the throughput and latency of the charging clients that use batch request service (BRS). Table 74-2 lists the metrics in this group.
Table 74-2 ECE BRS Metrics
Metric Name | Metric Type | Description |
---|---|---|
ece_brs_task_processed |
Counter |
Tracks the total number of requests accepted, processed, timed out, or rejected by the ECE component. You can use this to track the approximate processing rate over time, aggregated across all client applications, and so on. |
ece_brs_task_pending_count | Gauge |
Contains the number of requests that are pending by the ECE component. |
ece_brs_current_latency_by_type | Gauge |
Tracks the latency of a charging client per service type in the current query interval. These metrics are segregated and exposed from the BRS layer per service type and include event_type, product_type, and op_type tags. This metric provides the latency information for the following operation types: Initiate, Update, Terminate, Cancel, Price_Enquiry, Balance_Query, Debit_Amount, Debit_Unit, Refund_Amount, and Refund_Unit. |
ece_brs_current_latency | Gauge |
Tracks the current operation latency for a charging client in the current scrape interval. This metric contains the BRS statistics tracked using the charging.brsConfigurations MBean attributes. This configuration tracks the maximum and average latency for an operation type since the last query. The maximum window size for collecting this data is 30 seconds, so the query has to be run every 30 seconds. This metric provides the latency information for the following operation types: Initiate, Update, Terminate, Cancel, Price_Enquiry, Balance_Query, Debit_Amount, Debit_Unit, Refund_Amount, Refund_Unit, and Spending_Limit_Report. |
Coherence Cache Metrics
The Coherence Cache Metrics group provides operational and performance statistics for a cache. You can use this metric to track the overall growth rate of certain caches along with other metrics. For more information about the Coherence cache metrics, see "Oracle Coherence MBeans Reference" in Oracle Fusion Middleware Managing Oracle Coherence.
Table 74-3 describes the metrics in this group.
Table 74-3 Coherence Cache Metrics
Metric Name | Metric Type | Description |
---|---|---|
coherence_cache_entries | Integer | Contains the total number of entries present in an ECE cache on an ECE node at the time the query is run. |
coherence_cache_total_gets | Long | Contains the total number of get operations performed on an ECE cache since the statistics were last reset. |
coherence_cache_misses | Long | Contains the rough number of total misses performed on an ECE cache since the statistics were last reset. |
coherence_cache_total_puts | Long | Contains the total number of put operations made to an ECE cache since the statistics were last reset. |
coherence_cache_store_failures | Long | Contains the total number of cache store failures for load, store, and erase operations, since the statistics were last reset. The value is -1 if the cache store type is NONE. |
coherence_cache_store_read_millis | Long | Contains the cumulative time, in milliseconds, spent on load operations since the statistics were last reset. The value is -1 if the cache store type is NONE. |
coherence_cache_store_read_seconds | Long | Contains the cumulative time, in seconds, spent on load operations since the statistics were last reset. The value is -1 if the cache store type is NONE. |
coherence_cache_store_reads | Long | Contains the total number of load operations since the statistics were last reset. The value is -1 if the cache store type is NONE. |
coherence_cache_store_writes | Long | Contains the total number of store and erase operations since the statistics were last reset. The value is -1 if the cache store type is NONE. |
coherence_cache_store_write_millis |
Long |
Contains the cumulative time, in milliseconds, spent on store and erase operations since the statistics were last reset. The value is -1 if the cache store type is NONE. |
coherence_cache_store_write_seconds |
Long |
Contains the cumulative time, in seconds, spent on store and erase operations since the statistics were last reset. The value is -1 if the cache store type is NONE. |
coherence_cache_size | Gauge | Contains the total size of an ECE cache (by default in megabytes). |
Coherence Federated Service Metrics
The Coherence Federated Service Metrics group contains metrics for ECE federated caches when ECE persistence is disabled. The metrics in this group provide information regarding the volume of data transferred, the number of objects transferred, and so on. You can use this metric to monitor the data transferred from the primary production system to the remote or backup systems. This data is typically used for disaster recovery where the Oracle NoSQL database is used for storing rated events. For more information about the Coherence federated metrics, see "Oracle Coherence MBeans Reference" in Oracle Fusion Middleware Managing Oracle Coherence.
Table 74-4 lists the metrics in this group.
Table 74-4 Coherence Federated Service Metrics
Metric Name | Type | Description |
---|---|---|
coherence_federated_service_bandwidth |
Gauge |
Tracks the current or maximum bandwidth used to transfer data from ECE charging nodes to a secondary ECE cluster. |
coherence_federated_service_rate |
Gauge |
Tracks the approximate rate of data transfer in bytes and number of messages sent. This metric uses the Coherence MBean attributes for tracking data. |
coherence_federated_service_replicate_millis |
Gauge |
Contains the cache replication latency, in milliseconds, for initial data replication. The type can be: total (total time taken for replication) or estimate_ttc (estimated time to complete). |
coherence_federated_service_replicate_seconds |
Gauge |
Contains the cache replication latency, in seconds, for initial data replication. The type can be: total (total time taken for replication) or estimate_ttc (estimated time to complete). |
coherence_federated_service_replicate_percent |
Gauge |
Contains the percentage of cache replication completed for a service. |
coherence_federated_service_status |
Gauge |
Contains the state or status of the service that is on federation. The state of a service can be: 1 (Initial), 2 (Idle), 3 Ready, 4 (Sending), 5 (Connecting), 6 (Connect_Wait), 7 (Stopped), 8 (Paused), 9 (Error), 10 (Yielding), 11 (Backlog_Excessive), 12 (Backlog_Normal), and 13 (Disconnected). The status of a service can be: 1 (OK), 2 (Warning), and 3 (Error). |
coherence_federated_service_time_millis |
Gauge |
Contains the cache replication latency, in milliseconds, for data replicated to the remote cache. The type can be: apply, backlog_delay, and round_trip. These are the 90th percentile latency times. |
coherence_federated_service_time_seconds |
Gauge |
Contains the cache replication latency, in seconds, for data replicated to the remote cache. The type can be: apply and backlog_delay. These are the 90th percentile latency times. |
coherence_federated_service_total |
Gauge |
Contains the total number of bytes, cache entries, and messages that are replicated to the remote cache. The entity type can be: records (total number of journal records), bytes (total number of bytes sent), entries (total number of cache entries sent), message (total number of messages sent), response (total number of message responses received). The status can be: sent (entity shipped to remote cluster), unacked (messages sent without acknowledgment), and error (messages failed). |
Coherence Service Metrics
The Coherence Service Metrics group contains metrics for the Oracle Coherence cache services. You can use this data to monitor the server latency and load (per node) and the backlogs which may accumulate on the ECE nodes. For more information about the Coherence service metrics, see "Oracle Coherence MBeans Reference" in Oracle Fusion Middleware Managing Oracle Coherence.
Table 74-5 lists the metrics in this group.
Table 74-5 Coherence Cache Service Metrics
Metric Name | Type | Description |
---|---|---|
coherence_service_avg_thread_count | Float |
Contains the average active thread count as in the service thread pool. |
coherence_service_endangered_partitions | Integer |
Contains the total number of partitions that are currently not backed up. The metric value is 0 if the partition is not endangered, and a number greater than zero if any node has failed. |
coherence_service_ha_status | String |
Contains the high-availability status for this service. The values are ENDANGERED, NODE-SAFE,MACHINE-SAFE, and N/A. |
coherence_service_request_avg_duration | Float |
Contains the average duration, in milliseconds, of an individual request that was issued by the service since the last time the statistics were reset. |
coherence_service_request_count | Long |
Contains the total number of synchronous requests run by the service since the last reset. |
coherence_service_request_max_duration | Long |
Contains the maximum duration, in milliseconds, of a server-side request since the last reset. |
coherence_service_request_pending_count | Long |
Contains the total number of requests currently pending for a service. A large number of pending tasks may indicate a performance or capacity problem. |
coherence_service_request_pending_duration | Long |
Contains the duration of the request, in milliseconds, pending in a service. |
coherence_service_task_avg_duration | Float |
Contains the average server-side task latency in milliseconds since the last reset. |
coherence_service_task_backlog | Integer |
Contains the current server-side task backlog for each service. A large backlog is indicative of a performance or capacity problem. |
coherence_service_task_count | Long |
Contains the total number of tasks processed by a service since the last reset. |
coherence_service_task_max_backlog | Integer |
Contains the maximum size of the backlog queue that holds tasks scheduled to be executed by a service thread. |
coherence_service_unbalanced_partitions | Integer |
Contains the total number of primary and backup partitions that remain to be transferred until the partition distribution across the storage enabled service members is fully balanced. Typically, unbalanced partitions can occur temporarily during rebalancing operations when the nodes are added or removed. |
Kafka JMX Metrics
The Kafka JMX Metrics group contains metrics for tracking the throughput and latency of the Kafka server and topics. Table 74-6 lists the metrics in this group.
Table 74-6 Kafka JMX Metrics
Metric Name | Type | Description |
---|---|---|
kafka_app_info_start_time_ms |
Gauge |
Indicates the start time in milliseconds. |
kafka_producer_metadata_wait_time_ns_total |
Counter |
Contains the total time the producer has spent waiting on topic metadata in nanoseconds. |
kafka_producer_connection_close_rate |
Gauge |
Contains the number of connections closed per second. |
kafka_producer_iotime_total |
Counter |
Contains the total time the I/O thread spent doing I/O. |
kafka_producer_node_request_latency_max |
Gauge |
Contains the maximum latency of producer node requests in milliseconds. |
kafka_producer_txn_commit_time_ns_total |
Counter |
Contains the total time the producer has spent in commitTransaction in nanoseconds. |
afka_producer_record_error_total |
Counter |
Contains the total number of record sends that resulted in errors. |
kafka_producer_io_wait_time_ns_total |
Counter |
Contains the total time the I/O thread spent waiting. |
kafka_producer_io_ratio |
Gauge |
Contains the fraction of time the I/O thread spent doing I/O. |
kafka_producer_txn_begin_time_ns_total |
Counter |
Contains the total time the producer has spent in beginTransaction in nanoseconds. |
Session Metrics
The Session Metrics group contains metrics on ECE server sessions. Table 74-7 lists the metrics in this group.
Table 74-7 Session Metrics
Metric Name | Type | Description |
---|---|---|
ece_session_metrics | Counter | Contains the total number of sessions opened or closed by rating group, node, or cluster. |
Rated Events Metrics
The Rated Events Metrics group contains metrics on rated events processed by ECE server sessions. Table 74-8 lists the metrics in this group.
Table 74-8 Rated Events Metrics
Metric Name | Type | Description |
---|---|---|
ece_rated_events_formatted |
Counter |
Contains the number of successful or failed formatted rated events per RatedEventFormatter worker thread upon each formatting job operation from NoSQL or the Oracle database. |
ece_rated_events_cached |
Counter | Contains the total number of rated events cached by each ECE node. |
ece_rated_events_inserted |
Counter |
Contains the total number of rated events that were successfully inserted into the cache. |
ece_rated_events_insert_failed |
Counter |
Contains the total number of rated events that failed to be inserted into the cache. |
ece_rated_events_purged |
Counter |
Contains the total number of rated events that were purged from the Oracle NoSQL Database. |
ece_requests_by_result_code |
Counter |
Tracks the total requests processed by using the result code. |
CDR Formatter Metrics
The CDR Formatter Metrics group contains the metrics for tracking Charging Function (CHF) records. Table 74-9 lists the metrics in this group.
Table 74-9 CDR Formatter Metrics
Metric Name | Metric Type | Description |
---|---|---|
ece_chf_records_processed |
Counter |
Tracks the total number of CHF records that have been processed by the CDR formatter. |
ece_chf_records_purged |
Counter |
Tracks the total number of CHF records that have been purged by the CDR formatter. |
ece_chf_records_loaded |
Counter |
Tracks the total number of CHF records that have been loaded by the CDR formatter. |