Monitoring ECE in a Cloud Native Environment

32 Monitoring ECE in a Cloud Native Environment

You can monitor the system processes, such as memory and thread usage, in your Oracle Communications Elastic Charging Engine (ECE) components in a cloud native environment.

Topics in this document:

About Monitoring ECE in a Cloud Native Environment

You can set up monitoring of your ECE components in a cloud native environment. When configured to do so, ECE exposes JVM, Coherence, and application metric data through a single HTTP endpoint in an OpenMetrics/Prometheus exposition format. You can then use an external centralized metrics service, such as Prometheus, to scrape the ECE cloud native metrics and store them for analysis and monitoring.

Note:

ECE only exposes the metrics on an HTTP endpoint. It does not provide the Prometheus service.
Do not modify the oc-cn-ece-helm-chart/templates/ece-ecs-metricsservice.yaml file. It is used only during ECE startup and rolling upgrades. It is not used for monitoring.

ECE cloud native exposes metric data for the following components by default:

ECE Server
BRM Gateway
Customer Updater
Diameter Gateway
EM Gateway
HTTP Gateway
CDR Formatter
Pricing Updater
RADIUS Gateway
Rated Event Formatter

Setting up monitoring of these ECE cloud native components involves the following high-level tasks:

Ensuring that the ECE metric endpoints are enabled. See "Enabling ECE Metric Endpoints".

ECE cloud native exposes metric data through the following endpoint: http://localhost:19612/metrics.
Setting up a centralized metrics service, such as Prometheus Operator, to scrape metrics from the endpoint.

For an example of how to configure Prometheus Operator to scrape ECE metric data, see "Sample Prometheus Operator Configuration".
Setting up a visualization tool, such as Grafana, to display your ECE metric data in a graphical format.

Enabling ECE Metric Endpoints

The default ECE cloud native configuration exposes JVM, Coherence, and application metric data for all ECE components to a single REST endpoint. If you create additional instances of ECE components, you must configure them to expose metric data.

To ensure that the ECE metric endpoints are enabled:

Open your override-values.yaml file for oc-cn-ece-helm-chart.
Verify that the charging.metrics.port key is set to the port number where you want to expose the ECE metrics. The default is 19612.
Verify that each ECE component instance has metrics enabled.

Each application role under the charging key can be configured to enable or disable metrics. In the jvmOpts key, setting the ece.metrics.http.service.enabled option enables (true) or disables (false) the metrics service for that role.

For example, these override-values.yaml entries would enable the metrics service for ecs1.
```
charging:
   labels: "ece"
   jmxport: "9999"
   …
   metrics:
      port: "19612"
   ecs1:
      jmxport: ""
      replicas: 1
      …
      jvmOpts: "-Dece.metrics.http.service.enabled=true"
      restartCount: "0"
```
Save and close your override-values.yaml file.
Run the helm upgrade command to update your ECE Helm release:
```
helm upgrade EceReleaseName oc-cn-ece-helm-chart --namespace EceNameSpace --values OverrideValuesFile
```
where:
- EceReleaseName is the release name for oc-cn-ece-helm-chart.
- EceNameSpace is the namespace in which to create ECE Kubernetes objects for the ECE Helm chart.
- OverrideValuesFile is the name and location of your override-values.yaml file for oc-cn-ece-helm-chart.

Sample Prometheus Operator Configuration

After installing Prometheus Operator, you configure it to scrape metrics from the ECE metric endpoint. The following shows sample entries you can use to create Prometheus Service and ServiceMonitor objects that scrape ECE metric data.

This sample creates a Service object that specifies to:

Select all pods with the app label ece
Scrape metrics from port 19612

apiVersion: v1
kind: Service
metadata:
  name: prom-ece-metrics
  labels:
    application: prom-ece-metrics
spec:
  ports:
    - name: metrics
      port: 19612
      protocol: TCP
      targetPort: 19612
  selector:
    app: ece
  sessionAffinity: None
  type: ClusterIP
  clusterIP: None

This sample creates a ServiceMonitor object that specifies to:

Select all namespaces with ece in their name
Select all Service objects with the application label prom-ece-metrics
Scrape metrics from the HTTP path /metrics every 15 seconds

kind: ServiceMonitor
metadata:
  name: prom-ece-metrics
spec:
  endpoints:
    - interval: 15s
      path: /metrics
      port: metrics
      scheme: http
      scrapeTimeout: 10s
  namespaceSelector:
    matchNames:
      - ece
  selector:
    matchLabels:
      application: prom-ece-metrics

For more information about configuring Prometheus Operator, see https://github.com/prometheus-operator/prometheus-operator/tree/main/Documentation.

ECE Cloud Native Metrics

ECE cloud native collects metrics in the following groups to produce data for monitoring your ECE components:

Note:

Additional labels in the metrics indicates the name of the executor.

BRS Metrics

The BRS Metrics group contains the metrics for tracking the throughput and latency of the charging clients that use batch request service (BRS).

Table 32-1 lists the metrics in this group.

Table 32-1 BRS Metrics

Metric Name	Type	Description
ece.brs.message.receive	Counter	Tracks how many messages have been received.
ece.brs.message.send	Counter	Tracks how many messages have sent.
ece.brs.task.processed	Counter	Tracks the total number of requests accepted, processed, timed out, or rejected by the ECE component. You can use this to track the approximate processing rate over time, aggregated across all client applications, and so on.
ece.brs.task.pending.count	Gauge	Contains the number of requests that are pending for each ECE component.
ece.brs.current.latency.by.type	Gauge	Tracks the latency of a charging client for each service type in the current query interval. These metrics are segregated and exposed from the BRS layer per service type and include event_type, product_type, and op_type tags. This metric provides the latency information for the following operation types: Initiate, Update, Terminate, Cancel, Price_Enquiry, Balance_Query, Debit_Amount, Debit_Unit, Refund_Amount, and Refund_Unit.
ece.brs.current.latency	Gauge	Tracks the current operation latency for a charging client in the current scrape interval. This metric contains the BRS statistics tracked using the charging.brsConfigurations MBean attributes. This configuration tracks the maximum and average latency for an operation type since the last query. The maximum window size for collecting this data is 30 seconds, so the query has to be run every 30 seconds. This metric provides the latency information for the following operation types: Initiate, Update, Terminate, Cancel, Price_Enquiry, Balance_Query, Debit_Amount, Debit_Unit, Refund_Amount, Refund_Unit, and Spending_Limit_Report.
ece.brs.retry.queue.phase.count	Counter	Tracks the count of operations performed on the retry queue. Additional Label: phase
ece.brs.task.resubmit	Counter	Tracks the number of tasks that were scheduled for retry. Additional Label: resubmitReason
ece.brs.task.retry.count	Counter	Tracks the distributions of the number of retries performed for a retried request. Additional Label: source, retries
ece.brs.task.retry.distribution	Distribution Summary	Tracks the distributions of the number of retries performed for a retried request. Additional Label: source

Reactor Netty ConnectionProvider Metrics

The Reactor Netty ConnectionProvider Metrics group contains standard metrics that provide insights into the pooled ConnectionProvider which supports built-in integration with Micrometer. Table 32-2 lists the metrics in this group.

For additional information about Reactor Netty ConnectionProvider Metrics, see "Reactor Netty Reference Guide" in the Project Reactor documentation for more information: https://projectreactor.io/docs/netty/1.1.17/reference/index.html.

Table 32-2 Reactor Netty ConnectionProvider Metrics

Metric Name	Type	Description
reactor.netty.connection.provider.total.connections	Gauge	Tracks the number of active or idle connections.
reactor.netty.connection.provider.active.connections	Gauge	Tracks the number of connections that have been successfully acquired and are in active use.
reactor.netty.connection.provider.max.connections	Gauge	Tracks the maximum number of active connections that are allowed.
reactor.netty.connection.provider.idle.connections	Gauge	Tracks the number of idle connections.
reactor.netty.connection.provider.pending.connections	Gauge	Tracks the number of requests that are waiting for a connection.
reactor.netty.connection.provider.pending.connections.time	Timer	Tracks the time spent waiting to acquire a connection from the connection pool.
reactor.netty.connection.provider.max.pending.connections	Gauge	Tracks the maximum number of requests that are queued while waiting for a ready connection.

Reactor Netty HTTP Client Metrics

The Reactor Netty HTTP Client Metrics group contains standard metrics that provide insights into the HTTP client which supports built-in integration with Micrometer. Table 32-3 lists the metrics in this group.

Table 32-3 Reactor Netty HTTP Client Metrics

Metric Name	Type	Description
reactor.netty.http.client.data.received	DistributionSummary	Tracks the amount of data received, in bytes.
reactor.netty.http.client.data.sent	DistributionSummary	Tracks the amount of data sent, in bytes.
reactor.netty.http.client.errors	Counter	Tracks the number of errors that occurred.
reactor.netty.http.client.tls.handshake.time	Timer	Tracks the amount of time spent for TLS handshakes.
reactor.netty.http.client.connect.time	Timer	Tracks the amount of time spent connecting to the remote address.
reactor.netty.http.client.address.resolver	Timer	Tracks the amount of time spent resolving the remote address.
reactor.netty.http.client.data.received.time	Timer	Tracks the amount of time spent consuming incoming data.
reactor.netty.http.client.data.sent.time	Timer	Tracks the amount of time spent in sending outgoing data.
reactor.netty.http.client.response.time	Timer	Tracks the total time for the request or response.

BRS Queue Metrics

The BRS Queue Metrics group contains the metrics for tracking the throughput and latency of the BRS queue. Table 32-4 lists the metrics in this group.

Table 32-4 BRS Queue Metrics

Metric	Type	Description
ece.eviction.queue.size	Gauge	Tracks the number of items in the queue.
ece.eviction.queue.eviction.batch.size	Gauge	Tracks the number of queue items the eviction cycle processes in each iteration.
ece.eviction.queue.time	Timer	Tracks the amount of time items spend in the queue.
ece.eviction.queue.operation.duration	Timer	Tracks the time it takes to perform an operation on the queue.
ece.eviction.queue.scheduled.operation.duration	Timer	Tracks the time it takes to perform a scheduled operation on the queue.
ece.eviction.queue.operation.failed	Counter	Counts the number of failures for a queue operation.

CDR Formatter Metrics

The CDR Formatter Metrics group contains the metrics for tracking Charging Function (CHF) records. Table 32-5 lists the metrics in this group.

Table 32-5 CDR Formatter Metrics

Metric Name	Metric Type	Description
ece.chf.records.processed	Counter	Tracks the total number of CHF records the CDR formatter has processed.
ece.chf.records.purged	Counter	Tracks the total number of CHF records the CDR formatter purged.
ece.chf.records.loaded	Counter	Tracks the total number of CHF records the CDR formatter has loaded.

Coherence Metrics

All Coherence metrics that are available through the Coherence metrics endpoint are also accessible through the ECE metrics endpoint.

For details of the Coherence metrics, see "Oracle Coherence MBeans Reference" in Oracle Fusion Middleware Managing Oracle Coherence
For information about querying Coherence metrics, see "Querying for Coherence Metrics" in Oracle Fusion Middleware Managing Oracle Coherence

Diameter Gateway Metrics

The Diameter Gateway group contains metrics on events processed by the Diameter Gateway. Table 32-6 lists the metrics in this group.

Table 32-6 Diameter Gateway Metrics

Metric Name	Type	Description
ece.requests.by.result.code	Counter	Tracks the total number of requests processed for each result code.
ece.diameter.current.latency.by.type	Gauge	Tracks the latency of an Sy request for each operation type in the current query interval. The SLR_INITIAL_REQUEST, SLR_INTERMEDIATE_REQUEST, and STR operations are tracked.
ece.diameter.session.count	Gauge	Tracks the count of the currently cached diameter sessions. Additional label: Identity
ece.diameter.session.cache.capacity	Gauge	Tracks the maximum number of diameter session cache entries. Additional label: Identity
ece.diameter.session.sub.count	Gauge	Tracks the count of currently cached active ECE sessions. This is the count of sessions in the right side of the session map (MapString,MapString,DiameterSession).
ece.diameter.notification.requests.sent	Timer	Tracks the amount of time taken to send a diameter notification. Additional labels: protocol, notificationType, result

EM Gateway Metrics

The EM Gateway Metrics group contains standard metrics that provide insights into the current status of your EM Gateway activity and tasks. Table 32-7 lists the metrics in this group.

Table 32-7 EM Gateway Metrics

Metric Name	Type	Description
ece.emgw.processing.latency	Timer	Tracks the overall time taken in the EM Gateway. Additional label: handler
ece.emgw.handler.processing.latency	Timer	Tracks the total processing time taken for each request processed by a handler. Additional label: handler
ece.emgw.handler.processing.latency.by.phase	Timer	Tracks the time it takes to send a request to the dispatcher or BRS. Additional label: phase,handler
ece.emgw.handler.error.count	Counter	Tracks the number of failed requests. Additional label: handler, failureReason
ece.emgw.opcode.formatter.error	Counter	Tracks the number of opcode formatter errors. Additional label: phase

JVM Metrics

The JVM Metrics group contains standard metrics about the central processing unit (CPU) and memory utilization of JVMs which are members of the ECE grid. Table 32-8 lists the metrics in this group.

Table 32-8 JVM Metrics

Metric Name	Type	Description
jvm.memory.bytes.init	Gauge	Contains the initial size, in bytes, for the Java heap and non-heap memory.
jvm.memory.bytes.committed	Gauge	Contains the committed size, in bytes, for the Java heap and non-heap memory.
jvm.memory.bytes.used	Gauge	Contains the amount , in bytes of Java heap and non-heap memory that are in use.
jvm.memory.bytes.max	Gauge	Contains the maximum size, in bytes, for the Java heap and non-heap memory.
jvm.memory.pool.bytes.init	Gauge	Contains the initial size, in bytes, of the following JVM memory pools: G1 Survivor Space, G1 Old Gen, and G1 Survivor Space.
jvm.memory.pool.bytes.committed	Gauge	Contains the committed size, in bytes, of the following JVM memory pools: G1 Survivor Space, G1 Old Gen, and G1 Survivor Space.
jvm.memory.pool.bytes.used	Gauge	Contains the amount in bytes, of Java memory space in use by the following JVM memory pools: G1 Survivor Space, G1 Old Gen, and G1 Survivor Space.
jvm.buffer.count.buffers	Gauge	Contains the estimated number of mapped and direct buffers in the JVM memory pool.
jvm.buffer.total.capacity.bytes	Gauge	Contains the estimated total capacity, in bytes, of the mapped and direct buffers in the JVM memory pool.
process.cpu.usage	Gauge	Contains the CPU percentage for each ECE component on the server. This data is collected from the corresponding MBean attributes by JVMs.
process.files.open.files	Gauge	Contains the total number of file descriptors currently available for an ECE component and the descriptors in use for that ECE component.
coherence.os.system.cpu.load	Gauge	Contains the CPU load information percentage for each system in the cluster. These statistics are based on the average data collected from all the ECE grid members running on a server.
system.load.average.1m	Gauge	Contains the system load average (the number of items waiting in the CPU run queue) for each machine in the cluster. These statistics are based on the average data collected from all the ECE grid members running on a server.
coherence.os.free.swap.space.size	Gauge	Contains system swap usage information (by default in megabytes) for each system in the cluster. These statistics are based on the average data collected from all the ECE grid members running on a server.

Kafka JMX Metrics

The Kafka JMX Metrics group contains metrics for tracking the throughput and latency of the Kafka server and topics. Table 32-9 lists the metrics in this group.

Table 32-9 Kafka JMX Metrics

Metric Name	Type	Description
kafka.app.info.start.time.ms	Gauge	Indicates the start time in milliseconds.
kafka.producer.connection.close.rate	Gauge	Contains the number of connections closed per second.
kafka.producer.io.ratio	Gauge	Contains the fraction of time the I/O thread spent doing I/O.
kafka.producer.io.wait.time.ns.total	Counter	Contains the total time the I/O thread spent waiting.
kafka.producer.iotime.total	Counter	Contains the total time the I/O thread spent doing I/O.
kafka.producer.metadata.wait.time.ns.total	Counter	Contains the total time, in nanoseconds the producer has spent waiting on topic metadata.
kafka.producer.node.request.latency.max	Gauge	Contains the maximum latency, in milliseconds of producer node requests.
kafka.producer.record.error.total	Counter	Contains the total number of record sends that resulted in errors.
kafka.producer.txn.begin.time.ns.total	Counter	Contains the total time, in nanoseconds the producer has spent in beginTransaction.
kafka.producer.txn.commit.time.ns.total	Counter	Contains the total time, in nanoseconds the producer has spent in commitTransaction.

Kafka Client Metrics

The Kafka Client Metrics group contains metrics for tracking the throughput, latency, and performance of Kafka producer and consumer clients.

Note:

All Kafka producer metrics apply to the ECS and BRM Gateway. All Kafka consumer metrics apply to the BRM Gateway, HTTP Gateway, and Diameter Gateway.

For more information about the available metrics, refer to the following Apache Kafka documentation:

Producer Metrics: https://kafka.apache.org/36/generated/producer_metrics.html
Consumer Metrics: https://kafka.apache.org/36/generated/consumer_metrics.html

Micrometer Executor Metrics

The Micrometer Executor Metrics group contains standard metrics that provide insights into the activity of your thread pool and the status of tasks. These metrics are created by Micrometer, a third party software. Table 32-10 lists the metrics in this group.

Note:

The Micrometer API optionally allows a prefix to the name. In the table below, replace prefix with ece.brs for BRS metrics or ece.emgw for EM Gateway metrics.

Table 32-10 Micrometer Executor Metrics

Metric Name	Type	Description
prefix.executor.completed.tasks	FunctionCounter	Tracks the approximate total number of tasks that have completed execution. Additional label: Identity
prefix.executor.active.threads	Gauge	Tracks the approximate number of threads that are actively executing tasks. Additional label: Identity
prefix.executor.queued.tasks	Gauge	Tracks the approximate number of tasks that are queued for execution. Additional label: Identity
prefix.executor.queue.remaining.tasks	Gauge	Tracks the number of additional elements that this queue can ideally accept without blocking. Additional label: Identity
prefix.executor.pool.size.threads	Gauge	Tracks the current number of threads in the pool. Additional label: Identity
prefix.executor.pool.core.threads	Gauge	Tracks the core number of threads in the pool. Additional label: Identity
prefix.executor.pool.max.threads	Gauge	Tracks the maximum allowed number of threads in the pool. Additional label: Identity

RADIUS Gateway Metrics

The RADIUS Gateway Metrics group contains standard metrics that track the customer usage of services. Table 32-11 lists the metrics in this group.

Table 32-11 RADIUS Gateway Metrics

Metric Name	Type	Description
ece.radius.sent.disconnect.message.counter.total	Counter	Tracks the number of unique disconnect messages sent to the Network Access Server (NAS), excluding the retried ones.
ece.radius.retried.disconnect.message.counter.total	Counter	Tracks the number of retried disconnect messages, excluding the total number of retries.
ece.radius.successful.disconnect.message.counter.total	Counter	Tracks the number of successful disconnect messages.
ece.radius.failed.disconnect.message.counter.total	Counter	Tracks the number of failed disconnect messages.
ece.radius.auth.extension.user.data.latency	Timer	Tracks the following: The latency of converting a customer into an extension customer. The latency of converting a balance into an extension balance map. The latency of how long it takes to get a user data response from the user data.

Rated Event Formatter (REF) Metrics

The Rated Event Formatter (REF) Metrics group contains standard metrics that provide insights into the current status of your REF activity and tasks. Table 32-12 lists the metrics in this group.

Table 32-12 REF Metrics

Metric Name	Type	Description
ece.rated.events.checkpoint.interval	Gauge	Tracks the time, in seconds, used by the REF instance to read a set of rated events at a specific time interval.
ece.rated.events.ripe.duration	Gauge	Tracks the duration, in seconds, that rated events have existed before they can be processed.
ece.rated.events.worker.count	Gauge	Contains the number of worker threads used to process rated events.
ece.rated.events.phase.latency	Timer	Tracks the amount of time taken to complete a rated event phase. This only measures successful phases. Additional labels: phase, siteName
ece.rated.events.phase.failed	Counter	Tracks the number of rated event phase operations that have failed. Additional labels: phase, siteName
ece.rated.events.checkpoint.age	Gauge	Tracks the difference in time between the retrieved data and the current time stamp. Additional labels: phase, siteName
ece.rated.events.batch.size	Gauge	Tracks the number of rated events retrieved on each iteration. Additional labels: phase, siteName

Rated Events Metrics

The Rated Events Metrics group contains metrics on rated events processed by ECE server sessions. Table 32-13 lists the metrics in this group.

Table 32-13 Rated Events Metrics

Metric Name	Type	Description
ece.rated.events.formatted	Counter	Contains the number of successful or failed formatted rated events per RatedEventFormatter worker thread upon each formatting job operation.
ece.rated.events.cached	Counter	Contains the total number of rated events cached by each ECE node.
ece.rated.events.inserted	Counter	Contains the total number of rated events that were successfully inserted into the database.
ece.rated.events.insert.failed	Counter	Contains the total number of rated events that failed to be inserted into the database.
ece.rated.events.purged	Counter	Contains the total number of rated events that are purged.
ece.requests.by.result.code	Counter	Tracks the total number of requests processed for each result code.

Session Metrics

The Session Metrics group contains metrics on ECE server sessions. Table 32-14 lists the metrics in this group.

Table 32-14 Session Metrics

Metric Name	Type	Description
ece.session.metrics	Counter	Contains the total number of sessions opened or closed by rating group, node, or cluster.