74 Monitoring ECE Components

Learn how to monitor system processes, such as memory and thread usage, of your Oracle Communications Elastic Charging Engine (ECE) components.

Topics in this document:

Note:

This document describes how to monitor ECE in an on-premises environment only. For information about monitoring ECE in a cloud native environment, see "Monitoring ECE in a Cloud Native Environment" in BRM Cloud Native System Administrator's Guide.

About Monitoring ECE Components

You can set up your system to monitor ECE on-premises components. When configured to do so, each ECE component exposes a REST endpoint that exposes Java Virtual Machine (JVM), Coherence, and application metrics from a single endpoint in an OpenMetrics exposition format. You can then use an external centralized metric service, such as Prometheus, to scrape the ECE metrics and store them for analysis and monitoring.

Setting up monitoring of these ECE components involves the following high-level tasks:

  1. Configuring ECE to scrape JVM, Coherence, and application metrics and expose them through a Micrometer Prometheus endpoint. See "Scraping and Exposing Metrics for ECE".

  2. Setting up a centralized metrics service, such as Prometheus, to scrape metrics from the Micrometer Prometheus endpoint.

  3. Setting up a visualization tool, such as Grafana, to display your ECE metric data in a graphical format.

Scraping and Exposing Metrics for ECE

To configure ECE to scrape and expose JVM, Coherence, and application metrics for each ECE node in your system:

  1. Open the ECE_home/config/eceTopology.conf file in a text editor.

  2. Add the following information for each ECE node in your system:

    • metricsPort: Set this to a non-null port where the metrics will be exposed. The port number must be unique for each ECE JVM on a given host.

    • isMetricsEnabled: Set this to true to enable monitoring of this node.

  3. Save and close the file.

  4. Perform a rolling upgrade of the ECE components.

Example 74-1 Exposing Metrics for All ECE Components

This shows sample eceTopology.conf entries for exposing the metrics of ECE nodes:

#node-name              |role                    |host name (no spaces!)  |host ip|JMX port  |start CohMgt  |JVM Tuning File       |metricsPort  |isMetricsEnabled
ecs1                    |server                  |localhost               |       |9999      |true          |defaultTuningProfile  |22000        |true            
customerUpdater1        |customerUpdater         |localhost               |       |9996      |false         |defaultTuningProfile  |22004        |true
pricingUpdater          |pricingUpdater          |localhost               |       |9995      |false         |defaultTuningProfile  |22005        |true
brmGateway              |brmGateway              |localhost               |       |9994      |false         |defaultTuningProfile  |22006        |true
emGateway1              |emGateway               |localhost               |       |9993      |false         |defaultTuningProfile  |22007        |true
ratedEventFormatter1    |ratedEventFormatter     |localhost               |       |9992      |false         |defaultTuningProfile  |22008        |true
diameterGateway1        |diameterGateway         |localhost               |       |9991      |false         |defaultTuningProfile  |22009        |true
radiusGateway1          |radiusGateway           |localhost               |       |9990      |false         |defaultTuningProfile  |22010        |true

About Monitoring Kafka Servers

It is strongly recommended that you also monitor Kafka metrics. Kafka exposes metrics about the state, usage, and performance of your Kafka brokers and Java clients via the Java Management Extensions (JMX) technology.

You can the following tools to expose and monitor Kafka metrics:

  • The Prometheus JMX Exporter can expose metrics as a Prometheus endpoint. See "JMX Exporter" on the GitHub website.

  • The Kafka Exporter tool can expose additional useful metrics, such as consumer lag. See "Kafka Exporter" on the GitHub website.

You can also use these tools to helm manage your Kafka clusters:

  • The Strimzi Kafka Operator, which helps you to deploy and manage Kafka. It supports both Prometheus JMX Exporter and Kafka Exporter. See "Strimzi Custom Resource API Reference" on the Strimzi website.

  • The Cruise Control application, which helps run large Apache Kafka clusters. See "Cruise Control for Apache Kafka" on the GitHub website.

ECE Metrics

ECE collects metrics in the following groups to produce data for monitoring your ECE components:

Note:

Additional labels in the metrics indicate the name of the executor.

BRS Metrics

The BRS Metrics group contains the metrics for tracking the throughput and latency of the charging clients that use batch request service (BRS). Table 74-1 lists the metrics in this group.

Table 74-1 BRS Metrics

Metric Name Type Description
ece.brs.message.receive Counter Tracks how many messages have been received.
ece.brs.message.send Counter Tracks how many messages have sent.

ece.brs.task.processed

Counter

Tracks the total number of requests accepted, processed, timed out, or rejected by the ECE component.

You can use this to track the approximate processing rate over time, aggregated across all client applications, and so on.

ece.brs.task.pending.count Gauge

Contains the number of requests that are pending for each ECE component.

ece.brs.current.latency.by.type Gauge

Tracks the latency of a charging client for each service type in the current query interval. These metrics are segregated and exposed from the BRS layer per service type and include event_type, product_type, and op_type tags.

This metric provides the latency information for the following operation types: Initiate, Update, Terminate, Cancel, Price_Enquiry, Balance_Query, Debit_Amount, Debit_Unit, Refund_Amount, and Refund_Unit.

ece.brs.current.latency Gauge

Tracks the current operation latency for a charging client in the current scrape interval.

This metric contains the BRS statistics tracked using the charging.brsConfigurations MBean attributes. This configuration tracks the maximum and average latency for an operation type since the last query. The maximum window size for collecting this data is 30 seconds, so the query has to be run every 30 seconds.

This metric provides the latency information for the following operation types: Initiate, Update, Terminate, Cancel, Price_Enquiry, Balance_Query, Debit_Amount, Debit_Unit, Refund_Amount, Refund_Unit, and Spending_Limit_Report.

ece.brs.retry.queue.phase.count Counter Tracks the count of operations performed on the retry queue.

Additional Label: phase

ece.brs.task.resubmit Counter Tracks the number of tasks that were scheduled for retry.

Additional Label: resubmitReason

ece.brs.task.retry.count Counter Tracks the distributions of the number of retries performed for a retried request.

Additional Label: source, retries

ece.brs.task.retry.distribution Distribution Summary Tracks the distributions of the number of retries performed for a retried request.

Additional Label: source

Reactor Netty ConnectionProvider Metrics

The Reactor Netty ConnectionProvider Metrics group contains standard metrics that provide insights into the pooled ConnectionProvider, which supports built-in integration with Micrometer. Table 74-2 lists the metrics in this group.

For additional information about Reactor Netty ConnectionProvider Metrics, see "Reactor Netty Reference Guide" in the Project Reactor documentation for more information: https://projectreactor.io/docs/netty/1.1.17/reference/index.html.

Table 74-2 Reactor Netty ConnectionProvider Metrics

Metric Name Type Description

reactor.netty.connection.provider.total.connections

Gauge

Tracks the number of active or idle connections.

reactor.netty.connection.provider.active.connections

Gauge

Tracks the number of connections that have been successfully acquired and are in active use.

reactor.netty.connection.provider.max.connections

Gauge

Tracks the maximum number of active connections that are allowed.

reactor.netty.connection.provider.idle.connections

Gauge

Tracks the number of idle connections.

reactor.netty.connection.provider.pending.connections

Gauge

Tracks the number of requests that are waiting for a connection.

reactor.netty.connection.provider.pending.connections.time

Timer

Tracks the time spent waiting to acquire a connection from the connection pool.

reactor.netty.connection.provider.max.pending.connections

Gauge

Tracks the maximum number of requests that are queued while waiting for a ready connection.

Reactor Netty HTTP Client Metrics

The Reactor Netty HTTP Client Metrics group contains standard metrics that provide insights into the HTTP client which supports built-in integration with Micrometer. Table 74-3 lists the metrics in this group.

For additional information about Reactor Netty ConnectionProvider Metrics, see "Reactor Netty Reference Guide" in the Project Reactor documentation for more information: https://projectreactor.io/docs/netty/1.1.17/reference/index.html.

Table 74-3 Reactor Netty HTTP Client Metrics

Metric Name Type Description

reactor.netty.http.client.data.received

DistributionSummary

Tracks the amount of data received, in bytes.

reactor.netty.http.client.data.sent

DistributionSummary

Tracks the amount of data sent, in bytes.

reactor.netty.http.client.errors

Counter

Tracks the number of errors that occurred.

reactor.netty.http.client.tls.handshake.time

Timer

Tracks the amount of time spent for TLS handshakes.

reactor.netty.http.client.connect.time

Timer

Tracks the amount of time spent connecting to the remote address.

reactor.netty.http.client.address.resolver

Timer

Tracks the amount of time spent resolving the remote address.

reactor.netty.http.client.data.received.time

Timer

Tracks the amount of time spent consuming incoming data.

reactor.netty.http.client.data.sent.time

Timer

Tracks the amount of time spent in sending outgoing data.

reactor.netty.http.client.response.time

Timer

Tracks the total time for the request or response.

BRS Queue Metrics

The BRS Queue Metrics group contains the metrics for tracking the throughput and latency of the BRS queue. Table 74-4 lists the metrics in this group.

Table 74-4 BRS Queue Metrics

Metric Type Description

ece.eviction.queue.size

Gauge

Tracks the number of items in the queue.

ece.eviction.queue.eviction.batch.size

Gauge

Tracks the number of queue items the eviction cycle processes in each iteration.

ece.eviction.queue.time

Timer

Tracks the amount of time items spend in the queue.

ece.eviction.queue.operation.duration

Timer

Tracks the time it takes to perform an operation on the queue.

ece.eviction.queue.scheduled.operation.duration

Timer

Tracks the time it takes to perform a scheduled operation on the queue.

ece.eviction.queue.operation.failed

Counter

Counts the number of failures for a queue operation.

Coherence Metrics

All Coherence metrics that are available through the Coherence metrics endpoint are also accessible through the ECE metrics endpoint.

Diameter Gateway Metrics

The Diameter Gateway Metrics group contains metrics for events processed by the Diameter Gateway. Table 74-5 lists the metrics in this group.

Table 74-5 Diameter Gateway Metrics

Metric Name Type Description

ece.requests.by.result.code

Counter

Tracks the total number of requests processed for each result code.

ece.diameter.current.latency.by.type

Gauge

Tracks the latency of an Sy request for each operation type in the current query interval. The SLR_INITIAL_REQUEST, SLR_INTERMEDIATE_REQUEST, and STR operations are tracked.

ece.diameter.session.count

Gauge

Tracks the count of the currently cached diameter sessions.

Additional label: Identity

ece.diameter.session.cache.capacity

Gauge

Tracks the maximum number of diameter session cache entries.

Additional label: Identity

ece.diameter.session.sub.count

Gauge

Tracks the count of currently cached active ECE sessions. This is the count of sessions in the right side of the session map (MapString,MapString,DiameterSession).

ece.diameter.notification.requests.sent

Timer

Tracks the amount of time taken to send a diameter notification.

Additional labels: protocol, notificationType, result

EM Gateway Metrics

The EM Gateway Metrics group contains standard metrics that provide insights into the current status of your EM Gateway activity and tasks. Table 74-6 lists the metrics in this group.

Table 74-6 EM Gateway Metrics

Metric Name Type Description

ece.emgw.processing.latency

Timer

Tracks the overall time taken in the EM Gateway.

Additional label: handler

ece.emgw.handler.processing.latency

Timer

Tracks the total processing time taken for each request processed by a handler.

Additional label: handler

ece.emgw.handler.processing.latency.by.phase

Timer

Tracks the time it takes to send a request to the dispatcher or BRS.

Additional label: phase,handler

ece.emgw.handler.error.count

Counter

Tracks the number of failed requests.

Additional label: handler, failureReason

ece.emgw.opcode.formatter.error

Counter

Tracks the number of opcode formatter errors.

Additional label: phase

JVM Metrics

The JVM Metrics group contains standard metrics about the central processing unit (CPU) and memory utilization of JVMs which are members of the ECE grid. Table 74-7 lists the metrics in this group.

Table 74-7 JVM Metrics

Metric Name Type Description

jvm.memory.bytes.init

Gauge

Contains the initial size, in bytes, for the Java heap and non-heap memory.

jvm.memory.bytes.committed

Gauge

Contains the committed size, in bytes, for the Java heap and non-heap memory.

jvm.memory.bytes.used

Gauge

Contains the amount , in bytes of Java heap and non-heap memory that are in use.

jvm.memory.bytes.max

Gauge

Contains the maximum size, in bytes, for the Java heap and non-heap memory.

jvm.memory.pool.bytes.init

Gauge

Contains the initial size, in bytes, of the following JVM memory pools: G1 Survivor Space, G1 Old Gen, and G1 Survivor Space.

jvm.memory.pool.bytes.committed

Gauge

Contains the committed size, in bytes, of the following JVM memory pools: G1 Survivor Space, G1 Old Gen, and G1 Survivor Space.

jvm.memory.pool.bytes.used

Gauge

Contains the amount in bytes, of Java memory space in use by the following JVM memory pools: G1 Survivor Space, G1 Old Gen, and G1 Survivor Space.

jvm.buffer.count.buffers

Gauge

Contains the estimated number of mapped and direct buffers in the JVM memory pool.

jvm.buffer.total.capacity.bytes

Gauge

Contains the estimated total capacity, in bytes, of the mapped and direct buffers in the JVM memory pool.

process.cpu.usage

Gauge

Contains the CPU percentage for each ECE component on the server. This data is collected from the corresponding MBean attributes by JVMs.

process.files.open.files

Gauge

Contains the total number of file descriptors currently available for an ECE component and the descriptors in use for that ECE component.

coherence.os.system.cpu.load

Gauge

Contains the CPU load information percentage for each system in the cluster.

These statistics are based on the average data collected from all the ECE grid members running on a server.

system.load.average.1m

Gauge

Contains the system load average (the number of items waiting in the CPU run queue) for each machine in the cluster.

These statistics are based on the average data collected from all the ECE grid members running on a server.

coherence.os.free.swap.space.size

Gauge

Contains system swap usage information (by default in megabytes) for each system in the cluster.

These statistics are based on the average data collected from all the ECE grid members running on a server.

Kafka Client Metrics

The Kafka Client Metrics group contains metrics for tracking the throughput, latency, and performance of Kafka producer and consumer clients.

Note:

All Kafka producer metrics apply to the ECS and BRM Gateway. All Kafka consumer metrics apply to the BRM Gateway, HTTP Gateway, and Diameter Gateway.

For more information about the available metrics, refer to the following Apache Kafka documentation:

Kafka JMX Metrics

The Kafka JMX Metrics group contains metrics for tracking the throughput and latency of the Kafka server and topics. Table 74-8 lists the metrics in this group.

Table 74-8 Kafka JMX Metrics

Metric Name Type Description

kafka.app.info.start.time.ms

Gauge

Indicates the start time in milliseconds.

kafka.producer.connection.close.rate

Gauge

Contains the number of connections closed per second.

kafka.producer.io.ratio

Gauge

Contains the fraction of time the I/O thread spent doing I/O.

kafka.producer.io.wait.time.ns.total

Counter

Contains the total time the I/O thread spent waiting.

kafka.producer.iotime.total

Counter

Contains the total time the I/O thread spent doing I/O.

kafka.producer.metadata.wait.time.ns.total

Counter

Contains the total time, in nanoseconds the producer has spent waiting on topic metadata.

kafka.producer.node.request.latency.max

Gauge

Contains the maximum latency, in milliseconds of producer node requests.

kafka.producer.record.error.total

Counter

Contains the total number of record sends that resulted in errors.

kafka.producer.txn.begin.time.ns.total

Counter

Contains the total time, in nanoseconds the producer has spent in beginTransaction.

kafka.producer.txn.commit.time.ns.total

Counter

Contains the total time, in nanoseconds the producer has spent in commitTransaction.

Micrometer Executor Metrics

The Micrometer Executor Metrics group contains standard metrics that provide insights into the activity of your thread pool and the status of tasks. Table 74-9 lists the metrics in this group.

Note:

The Micrometer API optionally allows a prefix to the name. In the table below, replace prefix with ece.brs for BRS metrics or ece.emgw for EM Gateway metrics.

Table 74-9 Micrometer Executor Metrics

Metric Name Type Description
prefix.executor.completed.tasks FunctionCounter Tracks the approximate total number of tasks that have completed.

Additional label: Identity

prefix.executor.active.threads Gauge Tracks the approximate number of threads that are actively running tasks.

Additional label: Identity

prefix.executor.queued.tasks Gauge Tracks the approximate number of tasks that are queued.

Additional label: Identity

prefix.executor.queue.remaining.tasks Gauge Tracks the number of additional elements that this queue can accept without blocking.

Additional label: Identity

prefix.executor.pool.size.threads Gauge Tracks the current number of threads in the pool.

Additional label: Identity

prefix.executor.pool.core.threads Gauge Tracks the number of threads in the pool for the micrometer-core module.

Additional label: Identity

prefix.executor.pool.max.threads Gauge Tracks the maximum allowed number of threads in the pool.

Additional label: Identity

RADIUS Gateway Metrics

The RADIUS Gateway Metrics group contains standard metrics that track the customer usage of services. Table 74-10 lists the metrics in this group.

Table 74-10 RADIUS Gateway Metrics

Metric Name Type Description

ece.radius.sent.disconnect.message.counter.total

Counter

Tracks the number of unique disconnect messages sent to the Network Access Server (NAS), excluding the retried ones.

ece.radius.retried.disconnect.message.counter.total

Counter

Tracks the number of retried disconnect messages, excluding the total number of retries.

ece.radius.successful.disconnect.message.counter.total

Counter

Tracks the number of successful disconnect messages.

ece.radius.failed.disconnect.message.counter.total

Counter

Tracks the number of failed disconnect messages.

ece.radius.auth.extension.user.data.latency

Timer

Tracks the following:

  • The latency of converting a customer into an extension customer.

  • The latency of converting a balance into an extension balance map.

  • The latency of how long it takes to get a user data response from the user data.

Rated Event Formatter (REF) Metrics

The REF Metrics group contains standard metrics that provide insights into the current status of your REF activity and tasks. Table 74-11 lists the REF metrics in this group.

Table 74-11 REF Metrics

Metric Name Type Description

ece.rated.events.checkpoint.interval

Gauge

Tracks the time, in seconds, used by the REF instance to read a set of rated events at a specific time interval.

ece.rated.events.ripe.duration

Gauge

Tracks the duration, in seconds, that rated events have existed before they can be processed.

ece.rated.events.worker.count

Gauge

Contains the number of worker threads used to process rated events.

ece.rated.events.phase.latency

Timer

Tracks the amount of time taken to complete a rated event phase. This only measures successful phases.

Additional labels: phase, siteName

ece.rated.events.phase.failed

Counter

Tracks the number of rated event phase operations that have failed.

Additional labels: phase, siteName

ece.rated.events.checkpoint.age

Gauge

Tracks the difference in time between the retrieved data and the current time stamp.

Additional labels: phase, siteName

ece.rated.events.batch.size

Gauge

Tracks the number of rated events retrieved on each iteration.

Additional labels: phase, siteName

Rated Events Metrics

The Rated Events Metrics group contains metrics on rated events processed by ECE server sessions. Table 74-12 lists the metrics in this group.

Table 74-12 Rated Events Metrics

Metric Name Type Description

ece.rated.events.formatted

Counter

Contains the number of successful or failed formatted rated events per RatedEventFormatter worker thread upon each formatting job operation.

ece.rated.events.cached

Counter Contains the total number of rated events cached by each ECE node.

ece.rated.events.inserted

Counter

Contains the total number of rated events that were successfully inserted into the database.

ece.rated.events.insert.failed

Counter

Contains the total number of rated events that failed to be inserted into the database.

ece.rated.events.purged

Counter

Contains the total number of rated events that are purged.

ece.requests.by.result.code

Counter

Tracks the total number of requests processed for each result code.

Session Metrics

The Session Metrics group contains metrics on ECE server sessions. Table 74-13 lists the metrics in this group.

Table 74-13 Session Metrics

Metric Name Type Description

ece.session.metrics

Counter

Contains the total number of sessions opened or closed by rating group, node, or cluster.