74 Monitoring ECE Components
Learn how to monitor system processes, such as memory and thread usage, of your Oracle Communications Elastic Charging Engine (ECE) components.
Topics in this document:
Note:
This document describes how to monitor ECE in an on-premises environment only. For information about monitoring ECE in a cloud native environment, see "Monitoring ECE in a Cloud Native Environment" in BRM Cloud Native System Administrator's Guide.
About Monitoring ECE Components
You can set up your system to monitor ECE on-premises components. When configured to do so, each ECE component exposes a REST endpoint that exposes Java Virtual Machine (JVM), Coherence, and application metrics from a single endpoint in an OpenMetrics exposition format. You can then use an external centralized metric service, such as Prometheus, to scrape the ECE metrics and store them for analysis and monitoring.
Setting up monitoring of these ECE components involves the following high-level tasks:
-
Configuring ECE to scrape JVM, Coherence, and application metrics and expose them through a Micrometer Prometheus endpoint. See "Scraping and Exposing Metrics for ECE".
-
Setting up a centralized metrics service, such as Prometheus, to scrape metrics from the Micrometer Prometheus endpoint.
-
Setting up a visualization tool, such as Grafana, to display your ECE metric data in a graphical format.
Scraping and Exposing Metrics for ECE
To configure ECE to scrape and expose JVM, Coherence, and application metrics for each ECE node in your system:
-
Open the ECE_home/config/eceTopology.conf file in a text editor.
-
Add the following information for each ECE node in your system:
-
metricsPort: Set this to a non-null port where the metrics will be exposed. The port number must be unique for each ECE JVM on a given host.
-
isMetricsEnabled: Set this to true to enable monitoring of this node.
-
-
Save and close the file.
-
Perform a rolling upgrade of the ECE components.
Example 74-1 Exposing Metrics for All ECE Components
This shows sample eceTopology.conf entries for exposing the metrics of ECE nodes:
#node-name |role |host name (no spaces!) |host ip|JMX port |start CohMgt |JVM Tuning File |metricsPort |isMetricsEnabled
ecs1 |server |localhost | |9999 |true |defaultTuningProfile |22000 |true
customerUpdater1 |customerUpdater |localhost | |9996 |false |defaultTuningProfile |22004 |true
pricingUpdater |pricingUpdater |localhost | |9995 |false |defaultTuningProfile |22005 |true
brmGateway |brmGateway |localhost | |9994 |false |defaultTuningProfile |22006 |true
emGateway1 |emGateway |localhost | |9993 |false |defaultTuningProfile |22007 |true
ratedEventFormatter1 |ratedEventFormatter |localhost | |9992 |false |defaultTuningProfile |22008 |true
diameterGateway1 |diameterGateway |localhost | |9991 |false |defaultTuningProfile |22009 |true
radiusGateway1 |radiusGateway |localhost | |9990 |false |defaultTuningProfile |22010 |true
About Monitoring Kafka Servers
It is strongly recommended that you also monitor Kafka metrics. Kafka exposes metrics about the state, usage, and performance of your Kafka brokers and Java clients via the Java Management Extensions (JMX) technology.
You can the following tools to expose and monitor Kafka metrics:
-
The Prometheus JMX Exporter can expose metrics as a Prometheus endpoint. See "JMX Exporter" on the GitHub website.
-
The Kafka Exporter tool can expose additional useful metrics, such as consumer lag. See "Kafka Exporter" on the GitHub website.
You can also use these tools to helm manage your Kafka clusters:
-
The Strimzi Kafka Operator, which helps you to deploy and manage Kafka. It supports both Prometheus JMX Exporter and Kafka Exporter. See "Strimzi Custom Resource API Reference" on the Strimzi website.
-
The Cruise Control application, which helps run large Apache Kafka clusters. See "Cruise Control for Apache Kafka" on the GitHub website.
ECE Metrics
ECE collects metrics in the following groups to produce data for monitoring your ECE components:
Note:
Additional labels in the metrics indicate the name of the executor.BRS Metrics
The BRS Metrics group contains the metrics for tracking the throughput and latency of the charging clients that use batch request service (BRS). Table 74-1 lists the metrics in this group.
Table 74-1 BRS Metrics
Metric Name | Type | Description |
---|---|---|
ece.brs.message.receive | Counter | Tracks how many messages have been received. |
ece.brs.message.send | Counter | Tracks how many messages have sent. |
ece.brs.task.processed |
Counter |
Tracks the total number of requests accepted, processed, timed out, or rejected by the ECE component. You can use this to track the approximate processing rate over time, aggregated across all client applications, and so on. |
ece.brs.task.pending.count | Gauge |
Contains the number of requests that are pending for each ECE component. |
ece.brs.current.latency.by.type | Gauge |
Tracks the latency of a charging client for each service type in the current query interval. These metrics are segregated and exposed from the BRS layer per service type and include event_type, product_type, and op_type tags. This metric provides the latency information for the following operation types: Initiate, Update, Terminate, Cancel, Price_Enquiry, Balance_Query, Debit_Amount, Debit_Unit, Refund_Amount, and Refund_Unit. |
ece.brs.current.latency | Gauge |
Tracks the current operation latency for a charging client in the current scrape interval. This metric contains the BRS statistics tracked using the charging.brsConfigurations MBean attributes. This configuration tracks the maximum and average latency for an operation type since the last query. The maximum window size for collecting this data is 30 seconds, so the query has to be run every 30 seconds. This metric provides the latency information for the following operation types: Initiate, Update, Terminate, Cancel, Price_Enquiry, Balance_Query, Debit_Amount, Debit_Unit, Refund_Amount, Refund_Unit, and Spending_Limit_Report. |
ece.brs.retry.queue.phase.count | Counter | Tracks the count of operations performed on the retry
queue.
Additional Label: phase |
ece.brs.task.resubmit | Counter | Tracks the number of tasks that were scheduled for
retry.
Additional Label: resubmitReason |
ece.brs.task.retry.count | Counter | Tracks the distributions of the number of retries performed for a
retried request.
Additional Label: source, retries |
ece.brs.task.retry.distribution | Distribution Summary | Tracks the distributions of the number of retries performed for a
retried request.
Additional Label: source |
Reactor Netty ConnectionProvider Metrics
The Reactor Netty ConnectionProvider Metrics group contains standard metrics that provide insights into the pooled ConnectionProvider, which supports built-in integration with Micrometer. Table 74-2 lists the metrics in this group.
For additional information about Reactor Netty ConnectionProvider Metrics, see "Reactor Netty Reference Guide" in the Project Reactor documentation for more information: https://projectreactor.io/docs/netty/1.1.17/reference/index.html.
Table 74-2 Reactor Netty ConnectionProvider Metrics
Metric Name | Type | Description |
---|---|---|
reactor.netty.connection.provider.total.connections |
Gauge |
Tracks the number of active or idle connections. |
reactor.netty.connection.provider.active.connections |
Gauge |
Tracks the number of connections that have been successfully acquired and are in active use. |
reactor.netty.connection.provider.max.connections |
Gauge |
Tracks the maximum number of active connections that are allowed. |
reactor.netty.connection.provider.idle.connections |
Gauge |
Tracks the number of idle connections. |
reactor.netty.connection.provider.pending.connections |
Gauge |
Tracks the number of requests that are waiting for a connection. |
reactor.netty.connection.provider.pending.connections.time |
Timer |
Tracks the time spent waiting to acquire a connection from the connection pool. |
reactor.netty.connection.provider.max.pending.connections |
Gauge |
Tracks the maximum number of requests that are queued while waiting for a ready connection. |
Reactor Netty HTTP Client Metrics
The Reactor Netty HTTP Client Metrics group contains standard metrics that provide insights into the HTTP client which supports built-in integration with Micrometer. Table 74-3 lists the metrics in this group.
For additional information about Reactor Netty ConnectionProvider Metrics, see "Reactor Netty Reference Guide" in the Project Reactor documentation for more information: https://projectreactor.io/docs/netty/1.1.17/reference/index.html.
Table 74-3 Reactor Netty HTTP Client Metrics
Metric Name | Type | Description |
---|---|---|
reactor.netty.http.client.data.received |
DistributionSummary |
Tracks the amount of data received, in bytes. |
reactor.netty.http.client.data.sent |
DistributionSummary |
Tracks the amount of data sent, in bytes. |
reactor.netty.http.client.errors |
Counter |
Tracks the number of errors that occurred. |
reactor.netty.http.client.tls.handshake.time |
Timer |
Tracks the amount of time spent for TLS handshakes. |
reactor.netty.http.client.connect.time |
Timer |
Tracks the amount of time spent connecting to the remote address. |
reactor.netty.http.client.address.resolver |
Timer |
Tracks the amount of time spent resolving the remote address. |
reactor.netty.http.client.data.received.time |
Timer |
Tracks the amount of time spent consuming incoming data. |
reactor.netty.http.client.data.sent.time |
Timer |
Tracks the amount of time spent in sending outgoing data. |
reactor.netty.http.client.response.time |
Timer |
Tracks the total time for the request or response. |
BRS Queue Metrics
The BRS Queue Metrics group contains the metrics for tracking the throughput and latency of the BRS queue. Table 74-4 lists the metrics in this group.
Table 74-4 BRS Queue Metrics
Metric | Type | Description |
---|---|---|
ece.eviction.queue.size |
Gauge |
Tracks the number of items in the queue. |
ece.eviction.queue.eviction.batch.size |
Gauge |
Tracks the number of queue items the eviction cycle processes in each iteration. |
ece.eviction.queue.time |
Timer |
Tracks the amount of time items spend in the queue. |
ece.eviction.queue.operation.duration |
Timer |
Tracks the time it takes to perform an operation on the queue. |
ece.eviction.queue.scheduled.operation.duration |
Timer |
Tracks the time it takes to perform a scheduled operation on the queue. |
ece.eviction.queue.operation.failed |
Counter |
Counts the number of failures for a queue operation. |
Coherence Metrics
All Coherence metrics that are available through the Coherence metrics endpoint are also accessible through the ECE metrics endpoint.
-
For details of the Coherence metrics, see "Oracle Coherence MBeans Reference" in Oracle Fusion Middleware Managing Oracle Coherence
-
For information about querying Coherence metrics, see "Querying for Coherence Metrics" in Oracle Fusion Middleware Managing Oracle Coherence
Diameter Gateway Metrics
The Diameter Gateway Metrics group contains metrics for events processed by the Diameter Gateway. Table 74-5 lists the metrics in this group.
Table 74-5 Diameter Gateway Metrics
Metric Name | Type | Description |
---|---|---|
ece.requests.by.result.code |
Counter |
Tracks the total number of requests processed for each result code. |
ece.diameter.current.latency.by.type |
Gauge |
Tracks the latency of an Sy request for each operation type in the current query interval. The SLR_INITIAL_REQUEST, SLR_INTERMEDIATE_REQUEST, and STR operations are tracked. |
ece.diameter.session.count |
Gauge |
Tracks the count of the currently cached diameter sessions. Additional label: Identity |
ece.diameter.session.cache.capacity |
Gauge |
Tracks the maximum number of diameter session cache entries. Additional label: Identity |
ece.diameter.session.sub.count |
Gauge |
Tracks the count of currently cached active ECE sessions. This is the count of sessions in the right side of the session map (MapString,MapString,DiameterSession). |
ece.diameter.notification.requests.sent |
Timer |
Tracks the amount of time taken to send a diameter notification. Additional labels: protocol, notificationType, result |
EM Gateway Metrics
The EM Gateway Metrics group contains standard metrics that provide insights into the current status of your EM Gateway activity and tasks. Table 74-6 lists the metrics in this group.
Table 74-6 EM Gateway Metrics
Metric Name | Type | Description |
---|---|---|
ece.emgw.processing.latency |
Timer |
Tracks the overall time taken in the EM Gateway. Additional label: handler |
ece.emgw.handler.processing.latency |
Timer |
Tracks the total processing time taken for each request processed by a handler. Additional label: handler |
ece.emgw.handler.processing.latency.by.phase |
Timer |
Tracks the time it takes to send a request to the dispatcher or BRS. Additional label: phase,handler |
ece.emgw.handler.error.count |
Counter |
Tracks the number of failed requests. Additional label: handler, failureReason |
ece.emgw.opcode.formatter.error |
Counter |
Tracks the number of opcode formatter errors. Additional label: phase |
JVM Metrics
The JVM Metrics group contains standard metrics about the central processing unit (CPU) and memory utilization of JVMs which are members of the ECE grid. Table 74-7 lists the metrics in this group.
Table 74-7 JVM Metrics
Metric Name | Type | Description |
---|---|---|
jvm.memory.bytes.init |
Gauge |
Contains the initial size, in bytes, for the Java heap and non-heap memory. |
jvm.memory.bytes.committed |
Gauge |
Contains the committed size, in bytes, for the Java heap and non-heap memory. |
jvm.memory.bytes.used |
Gauge |
Contains the amount , in bytes of Java heap and non-heap memory that are in use. |
jvm.memory.bytes.max |
Gauge |
Contains the maximum size, in bytes, for the Java heap and non-heap memory. |
jvm.memory.pool.bytes.init |
Gauge |
Contains the initial size, in bytes, of the following JVM memory pools: G1 Survivor Space, G1 Old Gen, and G1 Survivor Space. |
jvm.memory.pool.bytes.committed |
Gauge |
Contains the committed size, in bytes, of the following JVM memory pools: G1 Survivor Space, G1 Old Gen, and G1 Survivor Space. |
jvm.memory.pool.bytes.used |
Gauge |
Contains the amount in bytes, of Java memory space in use by the following JVM memory pools: G1 Survivor Space, G1 Old Gen, and G1 Survivor Space. |
jvm.buffer.count.buffers |
Gauge |
Contains the estimated number of mapped and direct buffers in the JVM memory pool. |
jvm.buffer.total.capacity.bytes |
Gauge |
Contains the estimated total capacity, in bytes, of the mapped and direct buffers in the JVM memory pool. |
process.cpu.usage |
Gauge |
Contains the CPU percentage for each ECE component on the server. This data is collected from the corresponding MBean attributes by JVMs. |
process.files.open.files |
Gauge |
Contains the total number of file descriptors currently available for an ECE component and the descriptors in use for that ECE component. |
coherence.os.system.cpu.load |
Gauge |
Contains the CPU load information percentage for each system in the cluster. These statistics are based on the average data collected from all the ECE grid members running on a server. |
system.load.average.1m |
Gauge |
Contains the system load average (the number of items waiting in the CPU run queue) for each machine in the cluster. These statistics are based on the average data collected from all the ECE grid members running on a server. |
coherence.os.free.swap.space.size |
Gauge |
Contains system swap usage information (by default in megabytes) for each system in the cluster. These statistics are based on the average data collected from all the ECE grid members running on a server. |
Kafka Client Metrics
The Kafka Client Metrics group contains metrics for tracking the throughput, latency, and performance of Kafka producer and consumer clients.
Note:
All Kafka producer metrics apply to the ECS and BRM Gateway. All Kafka consumer metrics apply to the BRM Gateway, HTTP Gateway, and Diameter Gateway.
For more information about the available metrics, refer to the following Apache Kafka documentation:
-
Producer Metrics:
https://kafka.apache.org/36/generated/producer_metrics.html
-
Consumer Metrics:
https://kafka.apache.org/36/generated/consumer_metrics.html
Kafka JMX Metrics
The Kafka JMX Metrics group contains metrics for tracking the throughput and latency of the Kafka server and topics. Table 74-8 lists the metrics in this group.
Table 74-8 Kafka JMX Metrics
Metric Name | Type | Description |
---|---|---|
kafka.app.info.start.time.ms |
Gauge |
Indicates the start time in milliseconds. |
kafka.producer.connection.close.rate |
Gauge |
Contains the number of connections closed per second. |
kafka.producer.io.ratio |
Gauge |
Contains the fraction of time the I/O thread spent doing I/O. |
kafka.producer.io.wait.time.ns.total |
Counter |
Contains the total time the I/O thread spent waiting. |
kafka.producer.iotime.total |
Counter |
Contains the total time the I/O thread spent doing I/O. |
kafka.producer.metadata.wait.time.ns.total |
Counter |
Contains the total time, in nanoseconds the producer has spent waiting on topic metadata. |
kafka.producer.node.request.latency.max |
Gauge |
Contains the maximum latency, in milliseconds of producer node requests. |
kafka.producer.record.error.total |
Counter |
Contains the total number of record sends that resulted in errors. |
kafka.producer.txn.begin.time.ns.total |
Counter |
Contains the total time, in nanoseconds the producer has spent in beginTransaction. |
kafka.producer.txn.commit.time.ns.total |
Counter |
Contains the total time, in nanoseconds the producer has spent in commitTransaction. |
Micrometer Executor Metrics
The Micrometer Executor Metrics group contains standard metrics that provide insights into the activity of your thread pool and the status of tasks. Table 74-9 lists the metrics in this group.
Note:
The Micrometer API optionally allows a prefix to the name. In the table below, replace prefix with ece.brs for BRS metrics or ece.emgw for EM Gateway metrics.
Table 74-9 Micrometer Executor Metrics
Metric Name | Type | Description |
---|---|---|
prefix.executor.completed.tasks | FunctionCounter | Tracks the approximate total number of tasks that have
completed.
Additional label: Identity |
prefix.executor.active.threads | Gauge | Tracks the approximate number of threads that are
actively running tasks.
Additional label: Identity |
prefix.executor.queued.tasks | Gauge | Tracks the approximate number of tasks that are queued.
Additional label: Identity |
prefix.executor.queue.remaining.tasks | Gauge | Tracks the number of additional elements that this queue
can accept without blocking.
Additional label: Identity |
prefix.executor.pool.size.threads | Gauge | Tracks the current number of threads in the pool.
Additional label: Identity |
prefix.executor.pool.core.threads | Gauge | Tracks the number of threads in the pool for the
micrometer-core module.
Additional label: Identity |
prefix.executor.pool.max.threads | Gauge | Tracks the maximum allowed number of threads in the
pool.
Additional label: Identity |
RADIUS Gateway Metrics
Table 74-10 RADIUS Gateway Metrics
Metric Name | Type | Description |
---|---|---|
ece.radius.sent.disconnect.message.counter.total |
Counter |
Tracks the number of unique disconnect messages sent to the Network Access Server (NAS), excluding the retried ones. |
ece.radius.retried.disconnect.message.counter.total |
Counter |
Tracks the number of retried disconnect messages, excluding the total number of retries. |
ece.radius.successful.disconnect.message.counter.total |
Counter |
Tracks the number of successful disconnect messages. |
ece.radius.failed.disconnect.message.counter.total |
Counter |
Tracks the number of failed disconnect messages. |
ece.radius.auth.extension.user.data.latency |
Timer |
Tracks the following:
|
Rated Event Formatter (REF) Metrics
The REF Metrics group contains standard metrics that provide insights into the current status of your REF activity and tasks. Table 74-11 lists the REF metrics in this group.
Table 74-11 REF Metrics
Metric Name | Type | Description |
---|---|---|
ece.rated.events.checkpoint.interval |
Gauge |
Tracks the time, in seconds, used by the REF instance to read a set of rated events at a specific time interval. |
ece.rated.events.ripe.duration |
Gauge |
Tracks the duration, in seconds, that rated events have existed before they can be processed. |
ece.rated.events.worker.count |
Gauge |
Contains the number of worker threads used to process rated events. |
ece.rated.events.phase.latency |
Timer |
Tracks the amount of time taken to complete a rated event phase. This only measures successful phases. Additional labels: phase, siteName |
ece.rated.events.phase.failed |
Counter |
Tracks the number of rated event phase operations that have failed. Additional labels: phase, siteName |
ece.rated.events.checkpoint.age |
Gauge |
Tracks the difference in time between the retrieved data and the current time stamp. Additional labels: phase, siteName |
ece.rated.events.batch.size |
Gauge |
Tracks the number of rated events retrieved on each iteration. Additional labels: phase, siteName |
Rated Events Metrics
The Rated Events Metrics group contains metrics on rated events processed by ECE server sessions. Table 74-12 lists the metrics in this group.
Table 74-12 Rated Events Metrics
Metric Name | Type | Description |
---|---|---|
ece.rated.events.formatted |
Counter |
Contains the number of successful or failed formatted rated events per RatedEventFormatter worker thread upon each formatting job operation. |
ece.rated.events.cached |
Counter | Contains the total number of rated events cached by each ECE node. |
ece.rated.events.inserted |
Counter |
Contains the total number of rated events that were successfully inserted into the database. |
ece.rated.events.insert.failed |
Counter |
Contains the total number of rated events that failed to be inserted into the database. |
ece.rated.events.purged |
Counter |
Contains the total number of rated events that are purged. |
ece.requests.by.result.code |
Counter |
Tracks the total number of requests processed for each result code. |
Session Metrics
The Session Metrics group contains metrics on ECE server sessions. Table 74-13 lists the metrics in this group.
Table 74-13 Session Metrics
Metric Name | Type | Description |
---|---|---|
ece.session.metrics |
Counter |
Contains the total number of sessions opened or closed by rating group, node, or cluster. |