87 Configuring ECE for Disaster Recovery
Learn how to configure Oracle Communications Billing and Revenue Management Elastic Charging Engine (ECE) for disaster recovery.
Topics in this document:
Introduction
BRM offers a disaster recovery (DR) architecture ensuring business continuity in the event of an unexpected site deployment failure. BRM disaster recovery capabilities provide continuity in service usage for your end customers and minimize data loss if a system failure occurs. BRM supports deployment models intended to meet disaster recovery and business continuity needs.
Oracle Communications Charging, Billing and Revenue Management deployment includes functional components that manage business functionality end to end from subscription acquisition to usage charging, billing, and revenue management. The functional components include:
- 
                        
                        Billing and Revenue Management (BRM) server 
- 
                        
                        Pricing Design Center (PDC) 
- 
                        
                        Elastic Charging Engine (ECE) 
- 
                        
                        Offline Mediation Controller 
Business Continuity with ECE Disaster Recovery
Customers deploying BRM products also look for measures should a disaster strike. The disaster can be a power outage, hardware burn-out, or site not being available due to natural calamities such as floods or earthquakes. For hardware burn-out or even localized power outages, a local redundancy measure could be taken by deploying the products in high available mode, wherever possible.
Due to its distributed architecture, BRM components can be deployed with local redundancy for high availability.
For deploying with geo-redundancy, Table 87-1 shows the available deployment modes, along with recommended options for each of the functional components.
Table 87-1 Available and Recommended Deployment Modes
| Components | Deployment Mode | 
|---|---|
| Billing and Revenue Management (BRM) Server | Active-Hot Standby | 
| Pricing Design Center (PDC) | Active-Hot Standby | 
| Elastic Charging Engine (ECE) | Active-Active | 
| Offline Mediation Controller (OCOMC) | Active-Standby | 
It is essential to continue to process transactions and, in case of a site failure, it is advised to choose an adequate mode of deployment based on recovery objectives set by your business. BRM components allow you to choose a deployment mode (component wise) that fits your business needs.
About Deployment Modes with Geo-Redundancy
You can deploy the BRM Server, PDC, and ECE components with geo-redundancy to improve business continuity.
Deploying BRM Server and PDC with Geo-Redundancy
BRM Server and PDC are required to be deployed for charging with ECE and these use Oracle Database to store data. The database replication using Data Guard is recommended to replicate the data across sites.
Figure 87-1 shows the disaster recovery deployment modes for BRM and PDC.
Figure 87-1 Deployment Modes for BRM and PDC
In the above deployment of BRM and PDC, that are running in Active-Hot Standby mode, the database is continuously replicated in real-time using Data Guard/Active Data Guard from the active site to the standby site, ensuring minimal data loss. This requires monitoring of the sites and manual intervention when a site failure occurs.
Within a given site, local redundancy for BRM Server ensures continued availability of the system. BRM supports multi-instance configuration of Connection Managers and Data Managers for high availability, so that transactions are processed through available BRM processes connected to the same database instance.
Deploying ECE with Geo-Redundancy
Elastic Charging Engine is deployed in Active-Active mode. Active-Active mode uses Coherence federation-based data replication across sites and requires Coherence Grid Edition. This mode is more beneficial as you are processing data on both sites rather than keeping one site in standby mode while its processes run.
Figure 87-2 shows the Active-Active deployment mode for disaster recovery for ECE.
Figure 87-2 Active-Active Deployment Mode for ECE
Active-Active mode consists of two ECE sites, where all ECE sites can actively process charging requests simultaneously. Each ECE site’s cache holds all subscriber data and pricing configuration data, and ECE cache data is asynchronously replicated among all of the ECE sites using Coherence cache federation so that the cache data in all of the ECE sites remains synchronized.
- 
                              
                              Local processing mode: Process requests for all subscribers in the ECE site that they arrive in from the network. In this mode, only the requests for the sharing group members may get forwarded to another site where the shared balance is managed. This is the recommended mode for better processing rates. 
- 
                              
                              Preferred site processing mode: Process all requests for a given subscriber in the same ECE site, controlled by an ordered list of preferred ECE sites for each subscriber grouping. In this mode, a charging request arriving at a non-preferred ECE site will get forwarded to the preferred ECE site for processing. For example, all sharing group members are processed in a preferred site where the shared balance is managed. 
All active ECE sites interface with one active BRM and PDC instance. Any updates from BRM will be processed by one of the active ECE sites and will be synchronized to all other active ECE sites via Coherence federation. Rated events created in each active ECE site will be processed on that site and loaded into the BRM database via the configured method.
In Figure 87-2, if ECE site 2 fails, ECE site 1 will automatically be able to handle the entire network’s charging traffic. The operator should manually mark ECE site 2 as failed as soon as this condition is observed to ensure no traffic processing is attempted by ECE site 2 until the problem is corrected.
In Figure 87-2, if ECE site 1 fails, ECE site 2 will automatically be able to handle the entire network’s charging traffic. However, in this case, manual steps are required to update the configuration to ensure that updates from Active BRM and PDC are processed in ECE site 2 without reliance on ECE site 1. The operator should manually mark ECE site 1 as failed as soon as this condition is observed to ensure no traffic processing is attempted by ECE site 1 until the problem is corrected.
For all Active-Active deployments, the operator should ensure that each ECE site is properly sized to handle the expected load in the worst-case failure scenario.
If you would like to process the data on only one site, then the other site will stay in Hot Standby mode. This deployment mode is called Active-Hot Standby, as shown in Figure 87-3. When you have both sites configured for processing, it may not be preferred to keep one site idle, rather process in both the sites. Oracle recommends deploying in Active-Active mode because it provides the best RTO and RPO of all deployment modes.
Figure 87-3 Active-Hot Standby Deployment Mode for ECE
The configurations for Active-Hot Standby mode are the same as for an Active-Active system.
If an operator deploys all components, the deployment will be as shown in Figure 87-4.
Figure 87-4 Site Deployment with Recommended Options
BRM and PDC will always be in Active-Hot Standby mode. The DB from the BRM and PDC active site is replicated to one or more standby sites. If there is a failure of active deployment, manual intervention is needed for making one of the BRM/PDC standby sites active and reconfiguring the new active BRM/PDC instance to communicate with the existing primary ECE or ECE instance in the same site.
ECE is in Active-Active mode. Pricing updates from PDC and customer updates from BRM will be processed by one of the active ECE sites and will be synchronized to the other active ECE site via Coherence federation. Rated events created in an active ECE site will be processed in that site and loaded into the Active BRM database via the configured method.
- 
                              
                              The remaining active ECE site can automatically handle the load redistributed from the core network. 
- 
                              
                              The operator should manually mark the ECE site as failed as soon as this condition is observed to ensure the failed ECE site attempts no traffic processing until the problem is corrected. 
- 
                              
                              If the failed ECE site was the primary ECE site being used by the active BRM and PDC, manual steps are required to update the configuration to ensure that updates from the active BRM and PDC are processed in the other active ECE site. 
- 
                              
                              If the network gateways remain in service at the failed ECE site (that is, a partial ECE site failure), the network gateways may need to be manually turned down to force the network clients to redistribute charging traffic to the remaining active ECE site. 
Offline Mediation Controller is in Active-Standby mode. Offline Mediation Controller processes offline rating requests and generally are processed on one site to keep it simple and easily manageable.
Note:
You must define the customer group in the override-values.yaml file, irrespective of whether ECE is in Active/Active or Active/Hot-Standby deployment modes. The customer group configuration is mandatory for both modes. Both groups must have the same cluster preference in Active/Hot-Standby modes.
Restoring the System After Journal Growth
- 
                                 
                                 Isolate incoming traffic on the affected site.- If you use a cloud native environment, scale down the corresponding pods of any Diameter Gateways, HTTP Gateways, RADIUS Gateways, or EM Gateways that receive active traffic (regardless of how low the volume may be).
- 
                                          
                                          If you use an on-premises system, stop the corresponding application. 
 
- 
                                 
                                 Disable the federation from the affected site for all caches: BRMFederatedCache, XRefFederatedCache, ReplicatedFederatedCache, and OfferProfileFederatedCache. If there is more than one ECE site being federated to, disable the federation towards all sites for all cache services. 
- 
                                 
                                 After the federation has been disabled, no journal records should remain in the Coherence caches. Check if a count of BRMFederatedCache$JournalRecord returns anything other than 0 by running the following command in CohQL (query.sh): CohQL> select count() from BRMFederatedCache$JournalRecord;If the number of entries is greater than 0, run following command to truncate the BRMFederatedCache$JournalRecord cache entries: CohQL> truncate BRMFederatedCache$JournalRecord;Then, verify that the count of BRMFederatedCache$JournalRecord cache entries is 0: CohQL> select count() from BRMFederatedCache$JournalRecord;Note: The above commands are an example of the cache service, BRMFederatedCache. This procedure should be repeated for all cache services. 
- 
                                 
                                 To flush out any remaining journal files in the Flash Journal:- 
                                          
                                          If you use a cloud native environment, perform a rolling restart of all ECS pods. Note: A rolling restart is not mandatory if only a few ECS pods were affected by the flash journal growth. In that case, restart any ECS pods where the Flash Journal Metrics dashboard in Grafana indicates that the flash journal is still high. A high flash journal usage count might mean a value greater than 1 GB, but this depends on what is considered normal for the deployment. You can use historic flash journal values when the system was running normally to determine what might be considered a high value. 
- 
                                          
                                          If you use an on-premises system, do a manual restart of each instance of the ECS applications. 
 
- 
                                          
                                          
- 
                                 
                                 Re-enable the federation for all caches for the affected site. 
- 
                                 
                                 Restore the traffic flow to the affected site (Diameter Gateway, HTP Gateway, RADIUS Gateway, and EM Gateway pods scale up or application start). Note: If more than one site was affected by the flash journal growth and the flash journal full errors, the steps above may need to be repeated on the other affected sites. 
About Load Balancing in an Active-Active System
In an active-active system, External Module (EM) Gateway routes BRM update requests across sites based on the app and site configurations to ensure load balancing.
EM Gateway routes connection requests to Diameter Gateway, RADIUS Gateway, and HTTP Gateway nodes in one of the active sites. The request is rerouted to the backup production site if the site does not respond.
You can set up load-balancing configurations based on your requirements.
About Rated Event Formatter in a Persistence-Enabled Active-Active System
When data persistence is enabled, each site in an active-active system has a primary Rated Event Formatter instance for each schema, and at least one secondary instance for each schema.
As rated events are created, the following happens on each site:
- ECE creates rated events and commits them to the Coherence cache. Each rated event created by ECE includes the Coherence cluster name of the site where it was created.
- The Coherence federation service replicates the events to the remote sites, as it does for other federated objects.
- Coherence caching persists the events to the database in batches. Each schema at each site has its own rated event database table.
- The primary Rated Event Formatter instance processes all rated events from the corresponding site-specific database table.
- The primary Rated Event Formatter instance commits the formatted events to the cache as a checkpoint. The site name is included in the checkpoint data, along with the schema number, timestamp, and plugin type.
- The Coherence federation service replicates the checkpoint to the remote sites, as it does for other federated objects. The remote site ECE servers then purge the events persisted in the checkpoint from the database in batches by schema and by site.
- Coherence caching persists the checkpoint to the database to be consumed by Rated Event Loader. Checkpoints are grouped by schema and by site.
- The ECE server purges the events related to the persisted checkpoint. Events are purged from the database in batches by schema and by site.
Remote sites that receive federated events and checkpoints similarly persist them to and purge them from the database, in site and schema-specific database tables. In this way, all sites contain the same rated events and checkpoints, no matter where they were generated, and each rated event and checkpoint retains information about the site that generated it. If the Rated Event Formatter instance at any one site is down, a secondary instance at a remote site can immediately begin processing the rated events, preserving the site-specific information as though it were the original site. See "Resolving Rated Event Formatter Instance Outages".
Resolving Rated Event Formatter Instance Outages
If a primary Rated Event Formatter instance is down, take one of the following approaches, depending on whether the outage is planned or unplanned, and considering your operational needs:
- Planned outage: Primary instance finishes processing: Choose this
                                option for planned outages, when rating stops but the primary Rated
                                Event Formatter instance can keep processing.
                           - After no new rated events are being generated by the site, wait until the local Rated Event Formatter has finished processing all rated events from the site.
- In the remote sites, drop or truncate the rated event database table for the rated events federated from the site with the outage. Dropping the table means you must recreate it and its indexes after resolving the outage.
- Stop the Rated Event Formatter at the site with the outage.
- When the outage is resolved, you can start Rated Event Formatter again to resume processing events.
 
- Unplanned outage: Secondary instance takes over
                                        processing: Choose this option for unplanned outages,
                                when the primary Rated Event Formatter is also down. After failing
                                over to the backup site as described in "Failing Over to a Backup Site (Active-Active)", perform the following tasks:
                           - Confirm that the last successful Rated Event Formatter checkpoint for the local site matches the one federated to the remote site. You can use the JMX queryRatedEventCheckPoint operation in the ECE configuration MBeans. See "Getting Rated Event Formatter Checkpoint Information".
- If needed, start the secondary Rated Event Formatter instance on the remote site.
- Activate the secondary Rated Event Formatter
                                                instance on the remote site using the JMX
                                                activateSecondaryInstance operation in the ECE
                                                monitoring MBeans. See "Activating a Secondary Rated Event Formatter Instance".
                                 The secondary instance takes over processing the federated rated events as though it were the primary instance at the site with the outage. The events and checkpoints are persisted in the database tables for the original site, not the remote site. 
- Wait until the secondary instance has finished processing all rated events federated from the site with the outage.
- At the site with the outage, drop or truncate the rated event database table for local events. Dropping the table means you must recreate it and its indexes after resolving the outage.
- Stop the secondary Rated Event Formatter instance.
- When the outage is resolved and the site has been recovered as described in "Switching Back to the Original Production Site (Active-Active)", restart the primary Rated Event Formatter again to resume processing events at the local site. If you had the secondary Rated Event Formatter instance running at the remote site before the outage, restart it too.
 
Getting Rated Event Formatter Checkpoint Information
You can retrieve information about the last Rated Event Formatter checkpoint committed to the database.
To retrieve information about the last Rated Event Formatter checkpoint:
- 
                           
                           Access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans". 
- Expand the ECE Configuration node.
- Expand the database connection you want checkpoint information from.
- Expand Operations.
- Run the queryRatedEventCheckPoint operation.
                           Checkpoint information appears for all Rated Event Formatter instances using the database connection. Information includes site, schema, and plugin names as well as the time of the most recent checkpoint. 
Activating a Secondary Rated Event Formatter Instance
If a primary Rated Event Formatter instance is down, you can activate a secondary instance to take over rated event processing.
Note:
For a multi-site deployment with over two sites, activate only one secondary REF instance. Verify that the checkpoint time is up-to-date on the secondary site before activating a secondary REF to process the pending rated events as part of the data recovery process.
To activate a secondary Rated Event Formatter instance:
- 
                           
                           Access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans". 
- Expand the ECE Monitoring node.
- Expand RatedEventFormatterMatrices.
- Expand Operations.
- Run the activateSecondaryInstance operation.
                           The secondary Rated Event Formatter instance begins processing rated events. 
About CDR Generator in an Active-Active System
When CDR generation is enabled, each site in an active-active system contains a CDR Generator, and each site can generate unrated CDRs for external systems. When a production site goes down, the CDR Store retains all in-progress CDR sessions, and subsequent 5G usage events are diverted to the CDR Gateway on the other production site.
In an active-active system, you can configure CDR Generator to do the following:
- 
                           
                           Mark partially processed CDRs in the CDR Store as incomplete to prevent downstream mediation systems from processing them. To do so, use the CDR Formatter's and CDR Gateway's enableIncompleteCdrDetection attribute. 
- 
                           
                           Mark when CDRs contain duplicate usage updates. To do so, use the CDR Gateway's retransmissionDuplicateDetectionEnabled attribute. 
- 
                           
                           Indicate that CDRs were closed for a custom value (in the CDR's causeForRecordClosing field). To enable CDR Generator to add a custom reason why a CDR was closed, use the CDR Formatter's enableStaleSessionCleanupCustomField attribute. To specify the custom value to add, use the CDR Formatter's staleSessionCauseForRecordClosingString attribute. 
For information about configuring these attributes, see "Setting Up ECE to Generate CDRs" in ECE Implementing Charging.
About Conflict Resolution During the Journal Federation Process
In active-active disaster recovery systems, any changes to the ECE cache on one site are automatically federated to the ECE cache on other sites. Sometimes, an entity, such as a customer's balance, can change simultaneously at both sites. For example, Site 1 processes Joe's purchase of 500 prepaid minutes, while Site 2 processes Joe's usage of 20 prepaid minutes. ECE uses custom conflict resolution logic to merge these changes on both sites.
However, ECE may occasionally be unable to resolve these conflicts. This can happen when the skipActiveActivePreferredSiteRouting entry in the charging-settings.xml file is enabled or when the federation process stops for a short time.
When unresolved conflicts happen, ECE:
- 
                        
                        Does not make the updates to the ECE cache on each site. 
- 
                        
                        Appends the log details to the ecs logs, which can be found in the ECE_home/logs directory. 
- 
                        
                        Tracks information about the conflict resolution in the ECE metric shown in Table 87-2. Coherence also provides other federation-related metrics that you can refer to. Table 87-2 Coherence Federated Service Metric Metric Name Type Description ece.federated.service.change.records Counter Tracks the number of change records and tags them by conflict classification type: - 
                                             
                                             notModified 
- 
                                             
                                             error 
- 
                                             
                                             alreadyConflictResolved 
- 
                                             
                                             internallyModified 
- 
                                             
                                             externallyModified 
- 
                                             
                                             sameBinary 
- 
                                             
                                             sameRevisionNumber 
- 
                                             
                                             deleted 
- 
                                             
                                             conflictDetected 
 
- 
                                             
                                             
Configuring an Active-Active System
To configure an active-active system:
- 
                        
                        In the primary production site, do the following: - 
                              
                              Configure the ECE components (Customer Updater, EM Gateway, and so on). 
- 
                              
                              Add all details about participant sites to the federation-config section of the ECE Coherence override file (for example, ECE_home/config/charging-coherence-override-prod.xml). To confirm which ECE Coherence override file is used, see the tangosol.coherence.override value in the ECE_home/config/ece.properties file. To define federation configuration parameters:<!-- Defining federation-config for A-A setup with coherence 14.1.2.0.2+ --> <!-- <initial-action> to be defined just after cluster name --> <federation-config> <participants> <participant> <name>@CLUSTER_NAME1@</name> <initial-action>stop</initial-action> <remote-addresses> <socket-address> <address>@IP_ADDRESS@</address> <port>15000</port> </socket-address> </remote-addresses> </participant> <participant> <name>@CLUSTER_NAME2@</name> <initial-action>stop</initial-action> <remote-addresses> <socket-address> <address>@IP_ADDRESS@</address> <port>15000</port> </socket-address> </remote-addresses> </participant> </participants> </federation-config>See Table 87-3 for more information about providing the federation configuration parameter descriptions and default values. Table 87-3 Federation Configuration Parameters Name Description name The name of the participant site. Note: The name of the participant site must match the name of the cluster in the participant site. initial-action Specifies whether the federation service should be started for replicating data to the participant sites. Valid values are: - 
                                                   
                                                   start: Specifies that the federation service has to be started and the data must be automatically replicated to the participant sites. 
- 
                                                   
                                                   stop: Specifies that the federation service has to be stopped, and the data must not be automatically replicated to the participant sites. 
 Note: Ensure that this parameter is set to stop for all participant sites except for the current site. For example, if you are adding the backup or remote production sites details in the primary production site, this parameter must be set to stop for all backup or remote production sites. address The IP address of the participant site. port The port number assigned to the Coherence cluster port of the participant site. 
- 
                                                   
                                                   
- 
                              
                              Go to the ECE_home/config/management directory, where ECE_home is the directory in which ECE is installed. 
- 
                              
                              Configure HTTP Gateway. See "Connecting ECE to a 5G Client" in ECE Implementing Charging for more information. 
- 
                              
                              Open the charging-settings.xml file. 
- 
                              
                              In the CustomerGroupConfiguration section, set the app configuration parameters as shown in the following sample file: <customerGroupConfigurations config-class="oracle.communication.brm.charging.appconfiguration.beans.customergroup.CustomerGroupConfigurations"> <customerGroupConfigurationList> <customerGroupConfiguration config-class="oracle.communication.brm.charging.appconfiguration.beans.customergroup.CustomerGroupConfiguration" name="customerGroup5"> <clusterPreferenceList.name config-class="java.util.ArrayList"> <clusterPreferenceConfiguration config-class="oracle.communication.brm.charging.appconfiguration.beans.customergroup.ClusterPreferenceConfiguration" name="BRM-S2" priority="1" routingGatewayList="host1:port1"/> </clusterPreferenceList> </customerGroupConfiguration> <customerGroupConfiguration config-class="oracle.communication.brm.charging.appconfiguration.beans.customergroup.CustomerGroupConfiguration" name="customerGroup2"> <clusterPreferenceList config-class="java.util.ArrayList"> <clusterPreferenceConfiguration config-class="oracle.communication.brm.charging.appconfiguration.beans.customergroup.ClusterPreferenceConfiguration" name="BRM-S2" priority="1" routingGatewayList="host1:port1,host1:port1"/> <clusterPreferenceConfiguration config-class="oracle.communication.brm.charging.appconfiguration.beans.customergroup.ClusterPreferenceConfiguration" name="BRM-S1" priority="2" routingGatewayList="host2:port2,host2:port2"/> </clusterPreferenceList> </customerGroupConfiguration> </customerGroupConfigurationList> </customerGroupConfigurations>Table 87-4 provides the configuration parameters of the CustomerGroupConfiguration section.Table 87-4 CustomerGroupConfiguration Parameters Configuration Parameter and Description CustomerGroupConfiguration 
 Customers are processed and distributed in active-active system sites based on customerGroup. The customer names configured in customerGroup are updated to the PublicUserIdentity (PUI) cache when you load the customer information to ECE through customerUpdater or when you create or update information of customers in BRM using EM Gateway.- name=
 Includes a list of cluster names and priority for each cluster name for routing the requests during a site failure.- clusterPreferenceList=
 clusterPreferenceConfiguration 
 Name of the cluster.- name=
 The priority of the preferred cluster that is assigned in the customerGroup list to process the rating request.- clusterPreferenceConfiguration.priority=- The priority to process the request is in the incremental order of numbers and assigned to the lowest number. For example, if you set the value to 1 for priority, the cluster associated with this number processes the request first. 
 A comma-separated list of the host name and port number of chargingServer values used for httpGateway.- routingGatewayList=
 
- 
                              
                              Configure a primary and secondary Rated Event Formatter instance for each site in the ratedEventFormatter section, as shown in the following sample file: <ratedEventFormatterConfigurationList config-class="java.util.ArrayList"> <ratedEventFormatterConfiguration config-class="oracle.communication.brm.charging.appconfiguration.beans.ratedeventformatter.RatedEventFormatterConfiguration" name="ref_site1_primary" partition="1" connectionName="oracle1" siteName="site1" threadPoolSize="2" retainDuration="0" ripeDuration="30" checkPointInterval="20" maxPersistenceCatchupTime="0" pluginPath="ece-ratedeventformatter.jar" pluginType="oracle.communication.brm.charging.ratedevent.formatterplugin.internal.SampleFormatterPlugInImpl" pluginName="brmCdrPluginDC1Primary" … … /> <ratedEventFormatterConfiguration config-class="oracle.communication.brm.charging.appconfiguration.beans.ratedeventformatter.RatedEventFormatterConfiguration" name="ref_site1_secondary" partition="1" connectionName="oracle2" siteName="site1" primaryInstanceName="ref_site1_primary" threadPoolSize="2" retainDuration="0" ripeDuration="30" checkPointInterval="20" maxPersistenceCatchupTime="0" pluginPath="ece-ratedeventformatter.jar" pluginType="oracle.communication.brm.charging.ratedevent.formatterplugin.internal.SampleFormatterPlugInImpl" pluginName="brmCdrPluginDC1Secondary" … /> <ratedEventFormatterConfiguration config-class="oracle.communication.brm.charging.appconfiguration.beans.ratedeventformatter.RatedEventFormatterConfiguration" name="ref_site2_primary" partition="1" connectionName="oracle2" siteName="site2" threadPoolSize="2" retainDuration="0" ripeDuration="30" checkPointInterval="20" maxPersistenceCatchupTime="0" pluginPath="ece-ratedeventformatter.jar" pluginType="oracle.communication.brm.charging.ratedevent.formatterplugin.internal.SampleFormatterPlugInImpl" pluginName="brmCdrPluginDC2Primary" … /> <ratedEventFormatterConfiguration config-class="oracle.communication.brm.charging.appconfiguration.beans.ratedeventformatter.RatedEventFormatterConfiguration" name="ref_site2_secondary" partition="1" connectionName="oracle1" siteName="site2" primaryInstanceName="ref_site2_primary" threadPoolSize="2" retainDuration="0" ripeDuration="30" checkPointInterval="20" maxPersistenceCatchupTime="0" pluginPath="ece-ratedeventformatter.jar" pluginType="oracle.communication.brm.charging.ratedevent.formatterplugin.internal.SampleFormatterPlugInImpl" pluginName="brmCdrPluginDC2Secondary" … /> </ratedEventFormatterConfigurationList>The siteName property determines the site that the instance processes rated events for. This lets you configure secondary instances as backups for remote sites. The sample specifies that the ref_site1_secondary instance running is running at site 2, but processes rated events federated from site 1 in case of an outage. 
- 
                              
                              Configure the production sites to process the routing requests. 
- 
                              
                              Open the site-configuration.xml file. Configure all monitorAgent instances from all sites. Each Monitor Agent instance includes the Coherence cluster name, host name or IP address, and JMX port. Table 87-5 provides the configuration parameters of Monitoring Agent.Table 87-5 Monitor Agent Configuration Parameters Name Description name The name of the production or remote site where the request should be processed. These should correspond to site names defined for the Rated Event Formatter instances. host The IP address of the participant site. jmxPort jmxPort of the production or remote site. disableMonitor This configuration allows a monitorAgent instance to disable collecting monitoring results from multiple monitorAgent instances running within a site. It prevents generating redundant monitoring results for a site. Note: Default value is set to false. If you set this value to true, monitorAgent instance disallows collecting redundant monitoring results. Note: The monitorAgent properties should match with the properties in the eceTopology.conf file where a monitorAgent instance is configured to start from a specific production site. 
- 
                              
                              Copy the JMSConfiguration.xml file content of all sites to a single file and enter the following details: - 
                                    
                                    Add this tag for the queue types: <Cluster>clusterName</Cluster>
- 
                                    
                                    Import the wallet for all clusters and specify the wallet path in the following locations: <KeyStoreLocation> and <ECEWalletLocation>
 
- 
                                    
                                    
- 
                              
                              In the eceTopology.conf file, enable the JMX port for all ECS server nodes and clients, such as Diameter Gateway, HTTP Gateway, RADIUS Gateway, and EM Gateway. Also, enable the JMX port for each Monitor Agent instance. 
- 
                              
                              Start ECE. See "Starting ECE" for more information. 
 
- 
                              
                              
- 
                        
                        On the backup or remote site, do the following:- 
                                 
                                 Configure the ECE components (Customer Updater, EM Gateway, and so on). Ensure the following:- The names of Diameter Gateway, RADIUS Gateway, HTTP Gateway, Rated Event Formatter, and Rated Event Publisher for each site are unique.
- At least two instances of Rated Event Formatter are configured to allow for failover. The data persistence-enabled system requires configuring at least one primary and one secondary instance for each site.
 
- 
                                 
                                 Set the following parameter in the ECE_home/config/ece.properties file to false: loadConfigSettings = falseWhen you start the charging server nodes, the application-configuration data is not loaded into memory. 
- 
                                 
                                 Add all the details of participant sites in the federation-config section of the ECE Coherence override file (for example, ECE_home/config/charging-coherence-override-prod.xml). To confirm which ECE Coherence override file is used, see the tangosol.coherence.override value in the ECE_ home/config/ece.properties file. Table 87-3 provides the federation configuration parameter descriptions and default values. 
- 
                                 
                                 Start the Elastic Charging Controller (ECC):./ecc
- 
                                 
                                 Start the charging server nodes:start server
 
- 
                                 
                                 
- 
                        
                        Start the federation process from the primary site to the backup site by doing the following: - 
                              
                              On the backup site, turn off ECE journal conflict resolution. In your ECE configuration MBeans, set the disableFederationInterceptor attribute to true. 
- 
                              
                              On the primary site, start the federation service and replicate data by running this command: gridSync start gridSync replicateThe federation service is started, and all the existing data is replicated to the backup or remote production sites. 
- 
                              
                              After replication completes, turn on journal conflict resolution again in your backup site. In your ECE configuration MBeans, set the disableFederationInterceptor attribute to false. 
 
- 
                              
                              
- 
                        
                        On the backup sites, do the following: - 
                              
                              Verify that the same number of entries as in the primary production site are available in the customer, balance, configuration, and pricing caches in the backup or remote production sites using the query.sh utility. 
- 
                              
                              Verify that the charging server nodes in the backup or remote production sites are in the same state as those in the primary production site. 
- 
                              
                              Configure the following ECE components and the Oracle persistence database connection details by using a JMX editor: - Rated Event Formatter
- Rated Event Publisher
- Diameter Gateway
- RADIUS Gateway
- HTTP Gateway
 Ensure the following: - 
                                    
                                    The names of Diameter Gateway, RADIUS Gateway, HTTP Gateway, Rated Event Formatter, and Rated Event Publisher for each site are unique. 
- 
                                    
                                    At least two instances of Rated Event Formatter are configured to allow for failover. A data persistence-enabled system requires configuring at least one primary and one secondary instance for each site. 
 
- 
                              
                              Start the following ECE processes and gateways: start brmGateway start ratedEventFormatter start diameterGateway start radiusGateway start httpGatewayThe remote production sites are up and running with all required data. 
- 
                              
                              Run the following command: gridSync startThe federation service is started to replicate the data from the backup or remote production sites to the preferred production site. 
 
- 
                              
                              
- 
                        
                        After starting Rated Event Formatter in the remote production sites, copy the CDR files generated by Rated Event Formatter from the remote production sites to the primary production site using the SFTP utility. 
Note:
When configuring the active-hot standby system, the preferred site for each customer group should be the same. That is, the preferred site should be given as the current active site. For example:
customerGroupConfigurations:
      - name: "customergroup1"
        clusterPreference:
          - priority: "1"
            routingGatewayList: ""
            name: "BRM"
          - priority: "2"
            routingGatewayList: ""
            name: "BRM2"
      - name: "customergroup2"
        clusterPreference:
          - priority: "1"
            routingGatewayList: ""
            name: "BRM"
          - priority: "2"
            routingGatewayList: ""
            name: "BRM2"Including Custom Clients in Your Active-Active Configuration
If your system includes a custom client application that calls the ECE API, add the custom client to your active-active disaster recovery configuration. This enables the active-active system architecture to automatically route requests from your custom client to a backup site when a site failover occurs. To do so, you configure the custom client as an ECE Monitor Framework-compliant node in the ECE cluster.
To add a custom client to an active-active configuration:
- 
                           
                           Modify your custom client to use the ECE Monitor Framework: - 
                                 
                                 Add this import statement: import oracle.communication.brm.charging.monitor.framework.internal.MonitorFramework;
- 
                                 
                                 Add these lines to the program: if (MonitorFramework.isJMXEnabledApp) { MonitorFramework monitorFramework = (MonitorFramework) context.getBean(MonitorFramework.MONITOR_BEAN_NAME); try { monitorFramework.initializeMonitor(null); // null parameter for any non-ECE Monitor Agent node } catch (Exception ex) { // Failed to initialize Monitor Framework, check log file System.exit(-1); } } ... // continue as before
 
- 
                                 
                                 
- 
                           
                           When you start your custom client, include these Java system properties: - 
                                 
                                 -Dcom.sun.management.jmxremote.port set to the port number for enabling JMX RMI connections. Ensure that you specify an unused port number. 
- 
                                 
                                 -Dcom.sun.management.jmxremote.rmi.port set to the port number to which the RMI connector will be bound. 
- -Dtangosol.coherence.member set to the name of the custom client application instance running within the ECE cluster.
 For example: java -Dcom.sun.management.jmxremote.port=6666 \ -Dcom.sun.management.jmxremote.rmi.port=6666 \ -Dtangosol.coherence.member=customApp1 \ -jar customApp1.jar
- 
                                 
                                 
- 
                           
                           Edit the ECE_home/config/eceTopology.conf file to include a row for each custom client application instance. For each row, enter the following information: - 
                                 
                                 node-name: The name of the JVM process for that node. 
- 
                                 
                                 role: The role of the JVM process for that node. 
- 
                                 
                                 host name: The host name of the physical server machine on which the node resides. For a standalone system, enter localhost. 
- 
                                 
                                 host ip: If your host contains multiple IP addresses, enter the IP address so Coherence can be pointed to a port. 
- 
                                 
                                 JMX port: The JMX port of the JVM process for that node. By specifying a JMX port number for one node, you expose MBeans for setting performance-related properties and collecting statistics for all node processes. Enter any free port, such as 9999, for the charging server node to be the JMX-management enabled node. 
- 
                                 
                                 start CohMgt: Specify whether you want the node to be JMX-management enabled. 
 For example: #node-name |role |host name (no spaces!) |host ip |JMX port |start CohMgt |JVM Tuning File customApp1 |customApp |localhost | |6666 |false |
- 
                                 
                                 
Including Offline Mediation Controller in Your Active-Active Configuration
If your system includes Oracle Communications Offline Mediation Controller, add it to your active-active disaster recovery configuration. This enables the active-active system architecture to automatically route requests from Offline Mediation Controller to a backup site when a site fail over occurs.
To include Offline Mediation Controller in your active-active configuration:
- 
                           
                           On each active production site, do the following: - 
                                 
                                 Log in to your ECE driver machine as the rms user. 
- 
                                 
                                 In your ocecesdk/config/client-charging-context.xml file, add the following line to the beans element: <importresource="classpath:/META-INF/spring/monitor.framework-context.xml"/>
 
- 
                                 
                                 
- 
                           
                           On each Offline Mediation Controller machine, do the following: - 
                                 
                                 Log in to your Offline Mediation Controller machine as the rms user. 
- 
                                 
                                 Add the following lines to your OCOMC_home/bin/nodemgr file: -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.rmi.port=rmi_port -Dcom.sun.management.jmxremote.port=portwhere:- 
                                          
                                          rmi_port is set to the port number to which the RMI connector will be bound. 
- 
                                          
                                          port is set to the port number for enabling JMX RMI connections. Ensure that you specify an unused port number. 
 
- 
                                          
                                          
- 
                                 
                                 In the OCOMC_home/bin/UDCEnvironment file, set JMX_ENABLED_STATUS to true and set JMX_PORT to the desired JMX port number: #For enabling jmx in an active-active setup JMX_ENABLED_STATUS=true JMX_PORT=9992
 
- 
                                 
                                 
- 
                           
                           On each Offline Mediation Controller machine, restart Node Manager by going to the OCOMC_home/bin directory and running this command: ./nodemgr
Failing Over to a Backup Site (Active-Active)
- 
                              
                              Open a JMX editor such as a JConsole. 
- 
                              
                              Expand the ECE Monitoring node. 
- 
                              
                              Expand Agent. 
- 
                              
                              Expand Operations. 
- 
                              
                              Set the failoverSite() operation to the name of the failed site. 
- 
                              
                              On the backup site, stop replicating the ECE cache data to the primary production site by running the following command: gridSync stop PrimaryProductionClusterName where PrimaryProductionClusterName is the name of the cluster in the primary production site. 
- 
                              
                              On the backup site, do the following:- Change the BRM, PDC, and Customer
                                                  Updater connection details to connect to BRM and
                                                  PDC on the backup site by using a JMX editor.
                                       Note: If only ECE in the primary production site failed and BRM and PDC in the primary production site are still running, you need not change the BRM and PDC connection details on the backup site.
- Start BRM and PDC.
 
- Change the BRM, PDC, and Customer
                                                  Updater connection details to connect to BRM and
                                                  PDC on the backup site by using a JMX editor.
                                       
- 
                              
                              Recover the data in the Oracle NoSQL database data store of the primary production site by performing the following:- Convert the secondary Oracle
                                                  NoSQL database data store node of the primary
                                                  production site to the primary Oracle NoSQL
                                                  database data store node by performing a failover
                                                  operation in the Oracle NoSQL database data store.
                                                  For more information, see "Performing a
                                                  Failover" in Oracle NoSQL Database
                                                  Administrator's Guide.
                                       The secondary Oracle NoSQL database data store node of the primary production site is now the primary Oracle NoSQL database data store node of the primary production site. 
- On the backup site, convert the rated events from the Oracle NoSQL database data store node that you just converted into the primary node into CDR files by starting Rated Event Formatter.
- In a backup or remote production site, load the CDR files that you just converted into BRM by using Rated Event (RE) Loader.
- Shut down the Oracle NoSQL
                                                  database data store node that you just converted
                                                  into the primary node.
                                       See the "stop" utility in Oracle NoSQL Database Administrator's Guide for more information. 
- Stop the Rated Event Formatter that you just started.
 
- Convert the secondary Oracle
                                                  NoSQL database data store node of the primary
                                                  production site to the primary Oracle NoSQL
                                                  database data store node by performing a failover
                                                  operation in the Oracle NoSQL database data store.
                                                  For more information, see "Performing a
                                                  Failover" in Oracle NoSQL Database
                                                  Administrator's Guide.
                                       
- 
                              
                              In a backup or remote production site, start Pricing Updater, Customer Updater, and EM Gateway by running the following commands: start pricingUpdater start customerUpdater start emGateway All pricing and customer data is now back in the ECE grid in the backup or remote production site. 
- 
                              
                              Stop and restart BRM Gateway. 
- 
                              
                              Migrate internal BRM notifications from the primary production site to a backup or remote production site. See "Migrating ECE Notifications" for more information. Note: - If the expiry duration is configured for these notifications, ensure that you migrate the notifications before they expire. For the expiry duration, see the expiry-delay entry for the ServiceContext module in the ECE_home/config/charging-cache-config.xml file.
- All external notifications from a production site are published to the respective JMS queue. Diameter Gateway retrieves the notifications from the JMS queue and replicates to other sites based on the configuration.
 
- 
                              
                              Ensure that the network clients route all requests to the backup or remote production site. 
The former backup site or one of the remote production sites is now the new preferred production site. When the preferred site starts functioning, you mark the recoverSite and the site traffic routes back to the preferred site. For more information, see "Switching Back to the Original Production Site (Active-Active)".
Switching Back to the Original Production Site (Active-Active)
- 
                              
                              Install ECE and other required components in the original primary production site. For more information, see "Installing Elastic Charging Engine" in ECE Installation Guide. Note: If only ECE in the original primary production site failed and BRM and PDC in the original primary production site are still running, install only ECE and provide the connection details about BRM and PDC in the original primary production site during ECE installation.
- 
                              
                              On the machine on which the Oracle WebLogic server is installed, verify that the JMS queues have been created for loading pricing data and for sending event notification, and that JMS credentials have been configured correctly. 
- 
                              
                              Set the following parameter in the ECE_home/config/ece.properties file to false:loadConfigSettings = falseThe configuration data is not loaded in memory. 
- 
                              
                              Add all details about participant sites in the federation-config section of the ECE Coherence override file (for example, ECE_home/config/charging- coherence-override-prod.xml). To confirm which ECE Coherence override file is used, see the tangosol.coherence.override value in the ECE_home/config/ece.properties file. Table 87-3 provides the federation configuration parameter descriptions and default values. 
- 
                              
                              Go to the ECE_home/bin directory. 
- 
                              
                              Start ECC: ./ecc
- 
                              
                              Start the charging server nodes: start server
- 
                              
                              Replicate the ECE cache data to the original production site by using the gridSync utility. For more information, see "Replicating ECE Cache Data". 
- 
                              
                              Start the following ECE processes and gateways: start brmGateway start ratedEventFormatter start diameterGateway start radiusGateway start httpGateway 
- 
                              
                              Verify that the same number of entries as in the new production site are available in the customer, balance, configuration, and pricing caches in the original production site by using the query.sh utility. 
- 
                              
                              Stop Pricing Updater, Customer Updater, and EM Gateway in the new primary production site and then start them in the original primary production site. 
- 
                              
                              Migrate internal BRM notifications from the new primary production site to the original primary production site. For more information, see "Migrating ECE Notifications". 
- 
                              
                              Change the BRM Gateway, Customer Updater, and Pricing Updater connection details to connect to BRM and PDC in the original primary production site by using a JMX editor. 
- 
                              
                              Stop RE Loader in the new primary production site and then start it in the original primary production site. 
- 
                              
                              Stop and restart BRM Gateway in both the new primary production site and the original primary production site. The roles of the sites are now reversed to the original roles. 
- 
                              
                              Open a JMX editor such as a JConsole. 
- 
                              
                              Expand the ECE Monitoring node. 
- 
                              
                              Expand Agent. 
- 
                              
                              Expand Operations. 
- 
                              
                              Set the recoverSite() operation to the name of the recovered site. 
- 
                              
                              If data persistence is enabled and you failed over your Rated Event Formatter instance at the original site to a secondary instance at a remote site, restart any primary and secondary Rated Event Formatter instances at the original site. 
Note:
EM Gateway routes connection requests to either local Elastic Charging Server (ECS) or to the HTTP Gateway nodes on the remote sites. If the site does not respond, the request is processed locally on the same site. When a production site goes down, the CDR database retains all in-progress (or incomplete) CDR sessions, and all unrated 5G usage events are diverted should be diverted to the remote HTTP Gateway.
When a site is marked DOWN in an Active-Active setup, the gateways need to be brought down or they will continue to service requests which is not expected.
Processing Usage Requests in the Site Received
To configure the ECE active-active mode to process usage requests in the site that receives the request irrespective of the subscriber's preferred site, perform the following steps:
- 
                        
                        Access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans". 
- 
                        Expand the ECE Configuration node. 
- 
                        Expand charging.brsConfigurations.default. 
- 
                        Expand Attributes. 
- 
                        
                        Set the skipActiveActivePreferredSiteRouting attribute to true. Note: By default, the skipActiveActivePreferredSiteRouting attribute is set to false.
Replicating ECE Cache Data
In an active-hot standby system, a segmented active-active system, or an active-active system, when you configure or perform disaster recovery, you replicate the ECE cache data to the participant sites by using the gridSync utility.
To replicate the ECE cache data:
- 
                        Go to the ECE_home/bin directory. 
- 
                        Start ECC: ./ecc
- 
                        Do one of the following: - 
                              To start replicating data to a specific participant site asynchronously and also replicate all the existing ECE cache data to a specific participant site, run the following commands: gridSync start [remoteClusterName] gridSync replicate [remoteClusterName] where remoteClusterName is the name of the cluster in a participant site. 
- 
                              To start replicating data to all the participant sites asynchronously and also replicate all the existing ECE cache data to all the participant sites, run the following commands: gridSync start gridSync replicate 
 
- 
                              
See "gridSync" for more information on the gridSync utility.
Migrating ECE Notifications
When you failover to a backup site or switching back to the primary site, you must migrate the notifications to the destination site.
Note:
If you are using Apache Kafka for notification handling, notifications are not migrated to the destination site. Apache Kafka retains the notifications and these notifications appear in the original site or components when they are active.To migrate ECE notifications:
- 
                        
                        Access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans". 
- 
                        Expand the ECE Configuration node. 
- 
                        Expand systemAdmin. 
- 
                        Expand Operations. 
- 
                        Select triggerFailedClusterServiceContextEventMigration. 
- 
                        In the method's failedClusterName field, enter the name of the failed site's cluster. 
- 
                        Click the triggerFailedClusterServiceContextEventMigration button. 
All the internal BRM notifications are migrated to the destination site. In an active-active system, the external notifications are also migrated to the destination site. If you cannot establish the WebLogic cluster subscription due to a site failover, you should restart Diameter Gateway on the destination site. If a site recovers from a failover, you should restart all the Diameter Gateway instances in the cluster.
Active-Active Sy/Gy Session Re-Anchoring
Note:
This is applicable to ECE systems using the partition-based Kafka architecture only.
In ECE systems using the topic-based Kafka architecture, (for ECE Interim Patch 15.1.0.0.1 (37951934)) the sessions are re-anchored to a gateway that has a connection to the originating client automatically. For information about the Kafka architectures, see "Creating Kafka Topics for ECE Integration" in the ECE Implementing Charging guide.
During prolonged outages of Diameter Gateways in active-active disaster recovery systems, Sy sessions may fail to re-anchor to another available Diameter Gateway if the Policy and Charging Rules Function (PCRF) does not initiate Spending Limit Request (SLR) intermediates. Notifications such as the Status Notification Request (SNR) are not processed from the partition ID of the unavailable Diameter Gateway. Consequently, notifications remain trapped on the partition id that was being handled by the Diameter Gateway where the SLR initiation occurred unless the Sy sessions are terminated and restarted on a different Diameter Gateway.
For Gy sessions, the receipt of a Credit Control Request Update (CCR-U) will result in a Gy Session being re-anchored at a different Diameter Gateway where the CCR Initial was received. However, prior to this re-anchoring any notifications for Gy sessions using a partition ID consumed by the failed Diameter Gateway will remain trapped on that partition ID.
When a Diameter Gateway is out of service, you can use the JMX console to issue the command for another Diameter Gateway to start consuming from the partition ID for the failed Diameter Gateway. This gateway can be on the same site or, if there is full-site outage for Diameter Gateways, at an alternate site.
The only criterion for the new Diameter Gateway is that it is connected to the remote peers that the out-of-service Diameter Gateway sends the Sy Status Notification Request (SNR) and Gy Re-Authorization Requests (RAR) to.
- 
                           
                           Configure another Diameter Gateway instance to start consuming from the Kafka partition. For more information, see "Consuming SNRs/RARs from a Specified Kafka Partition". 
- 
                           
                           Configure the Diameter Gateway instance that went down to stop consuming from the Kafka partition. This should be done after the Diameter Gateway that was down has recovered, and should be performed on the same Diameter Gateway that the previous step was performed on. For more information, see "Stopping the Consumption of SNRs/RARs from a Specified Kafka Partition". 
Figure 87-5 depicts the prolonged outage of a Diameter Gateway (DGW-1-1 in the figure) in an active-active disaster recovery system.
Figure 87-5 Single Diameter Gateway Outage
In the figure, PGW represents a packet gateway for sending charging requests to ECE and DGW represents a Diameter Gateway.
Since DGW-1-1 is down, the SNR/ RAR cannot be retrieved from Kafka partition 1 (on the gateway that is down) unless the JMX Console is used to issue the command to connect them to another Diameter Gateway. In this case, it is changed to connect to DGW-1-2 at the same site.
Figure 87-6 depicts the prolonged outage of multiple Diameter Gateways (DGW-1-1 and DGW-1-2 in the figure) in an active-active disaster recovery system.
Figure 87-6 Multiple Diameter Gateway Outages
Since both DGW-1-1 and DGW-1-2 are down, the SNR/ RAR cannot be retrieved from Kafka partitions 1 and 2 at Site 1 unless the JMX Console is used to issue the commands to connect them to another Diameter Gateway. In this case partition 1 is moved to DGW-2-1 and partition 2 is moved to DGW-2-2, both at the second site.
Consuming SNRs/RARs from a Specified Kafka Partition
Learn how to configure the Diameter Gateway to start consuming SNRs/ RARs from a specified Kafka partition, during prolonged Diameter Gateway outages.
This allows for the Sy/Gy sessions to be re-anchored to another Diameter Gateway that is connected to the remote peers that the Diameter Gateway will then send to the Sy SNR/Gy RARs. You can use JMX commands to configure ECE to make another Diameter Gateway start consuming from a specific partition ID. The only parameter you need to pass to the command is the new partition ID that the Kafka consumer will use.
- 
                              
                              Connect ECE Cloud Native to JConsole following the instructions in "JMX Connection to ECE Using JConsole" in BRM Cloud Native System Administrator’s Guide. 
- 
                              
                              For ECE on-prem, access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans" for more information. 
- 
                              
                              JConsole should be open on the DiameterGateway process (and not the ECS process). 
- 
                              
                              Expand the DiameterGateway node. 
- 
                              
                              Expand DiameterGatewayKafkaResiliencyTracker. 
- 
                              
                              Expand Operations. 
- 
                              
                              Enter the partition ID in the partition field for the startKafkaListener method, and click the method. 
- 
                              
                              The Operation return value dialog box appears, with the value set to True. Click OK. 
Stopping the Consumption of SNRs/RARs from a Specified Kafka Partition
Learn how to configure the Diameter Gateway to stop consuming SNRs/ RARs from a specified Kafka partition, after prolonged Diameter Gateway outages.
You can use a second command in the JMX Console to configure the Diameter Gateway to stop consuming from a specific Kafka partition. This command should be performed when the Diameter Gateway that was down has been restored.
- 
                              
                              Connect ECE Cloud Native to JConsole following the instructions in "JMX Connection to ECE Using JConsole" in BRM Cloud Native System Administrator’s Guide. 
- 
                              
                              For ECE on-prem, access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans" for more information. 
- 
                              
                              JConsole should be open on the DiameterGateway process (and not the ECS process). 
- 
                              
                              Expand the DiameterGateway node. 
- 
                              
                              Expand DiameterGatewayKafkaResiliencyTracker. 
- 
                              
                              Expand Operations. 
- 
                              
                              Enter the partition ID in the partition field for the stopKafkaListener method, and click the method. 
- 
                              
                              The Operation return value dialog box appears, with the value set to True. Click OK. 
Supporting External Module Gateways in Degraded Mode
When the number of healthy Elastic Charging Server (ECS) nodes in a cluster falls below the minimum configured threshold, the cluster runs in degraded mode. The External Module (EM) Gateway, on receiving incoming requests from the BRM server, verifies whether the ECS cluster is healthy or not before it forwards the requests.
If the EM Gateway identifies the cluster to be unhealthy, it returns a PIN_ERR_NOT_ACTIVE (107) error to the BRM server and does not forward the requests to the ECS cluster. The BRM server, on receiving this error, adds this EM Gateway site into the block list, and forwards the requests to another EM Gateway site.
For more information on how to configure the minimum health threshold, see "Configuring the Charging-Server Health Threshold".
Managing Degraded Performance of ECS Nodes
ECE is sometimes unable to process real-time traffic from the Diameter Gateway instances and return the charging response within the expected service latency.
The degradations are primarily caused by:
- 
                        
                        Multiple ECS nodes reaching 100% BRMFederatedCache thread utilization, causing a backlog in the Diameter Gateway nodes. 
- 
                        
                        A single ECS node, with minimal thread utilization, accumulating a large backlog during an ECE service outage. This backlog affects all Diameter Gateway nodes, leading to 100% Invocation Service thread utilization. 
You can resolve these outage issues by doing one of the following:
- 
                        
                        Waiting for some time for ECE to recover by itself when the charging response resumes from degradation. 
- 
                        
                        Manually restarting the affected ECS nodes to restore the service. See "Starting and Stopping ECE". 
- 
                        
                        Configuring the ECE ServiceMonitor to detect issues before they cause a backlog in the Diameter Gateway nodes. You can also configure it to automatically restart the BRMFederatedCache service when a problem occurs. See "About Detecting BRMFederatedCache Service Degradation Early". 
About Detecting BRMFederatedCache Service Degradation Early
You can use the ECE ServiceMonitor to regularly check the performance of the BRMFederatedCache service, so it can catch problems before degradation issues cause ECE system outages. The ServiceMonitor continuously tracks the service’s thread utilization, backlog size, and time to process pending requests. When it detects that the service is starting to degrade, the ServiceMonitor does one of the following:
- 
                           
                           Logs details about the degraded event. 
- 
                           
                           Logs details about the degraded event and generates a thread dump. This is the default. 
- 
                           
                           Logs details, generates a thread dump, and restarts the Coherence BRMFederatedCache service. 
Log file names use the format node_name.log and are stored in the ECE_home/logs directory by default. To configure a new log location, see "Configuring Log Location".
By default, the ECE ServiceMonitor is enabled, but you can disable it at any time.
How ServiceMonitor Detects Service Degradation
ECE ServiceMonitor runs with a dedicated thread count that monitors the health of the BRMFederatedCache service for some BRMFederatedCache caches that have the Coherence cache parameter back-size-limit set to the maximum number of allowed cache entries. It consists of two layers of checks to detect service degradation:
- 
                           
                           First Layer: A normal check process runs every 5 seconds by default and detects the initial signs of degradation in the BRMFederatedCache service. 
- 
                           
                           Second Layer: An alert check process is activated when the normal check process detects degradation. By default, it monitors the BRMFederatedCache service’s performance every 500 milliseconds for up to 3 seconds to ensure requests are completed within that timeframe. - 
                                 
                                 If the service shows improvement, the ServiceMonitor ends the alert check process and returns to the normal check process. 
- 
                                 
                                 If the service does not improve, the ServiceMonitor logs details about the degradation and optionally restarts the Coherence BRMFederatedCache service. BRMFederatedCacheServiceRestartService enables recovery after a restart. The LockKey records the ECS node name that initiated the restart and the corresponding restart timestamp. To prevent simultaneous and frequent restarts, only one node can initiate the restart process at a time by locking the ServiceMonitorCache and adding its name and timestamp to the key. 
 
- 
                                 
                                 
About Scenarios When the BRMFederatedCache Service Is Not Restarted
There are certain scenarios in which the BRMFederatedCache service is not restarted even if degradation is detected. The ServiceMonitor does not restart the BRMFederatedCache service in the following situations:
- 
                           
                           If the BRMFederatedCache service’s HA status is Endanger to avoid data loss. 
- 
                           
                           If the BRMFederatedCache service experiences unbalanced partitions due to active rebalancing, and partition transfers between ECS nodes are ongoing. 
- 
                           
                           If Coherence pruning activities are in progress for BRMFederatedCache and with the back-size-limit attribute greater than 0. The caches involved in this process include AggregateObjectUsage, AggregateObjectUpdate, RatedEvent, TerminatedSessionHistory, and ServiceContext. 
- 
                           
                           If one ECS node attempts to restart the service while another node is already in the process of restarting. 
- 
                           
                           If a recent restart has occurred, so the system has time to stabilize between each restart event. 
Configuring ServiceMonitor to Detect BRMFederatedCache Service Degradation
You can configure ServiceMonitor to detect BRMFederatedCache service degradation in two ways:
- 
                           
                           At installation. See "Configuring ServiceMonitor Attributes At Installation". 
- 
                           
                           During runtime. See "Configuring ServiceMonitor Attributes During Runtime". 
For information on ServiceMonitor attribute entries, see "About the ServiceMonitor Attribute Entries".
Configuring ServiceMonitor Attributes At Installation
You can configure the ServiceMonitor attributes during installation by editing the ECE_home/config/management/charging-settings.xml file. To do so:
- 
                              
                              Open your charging-settings.xml file. 
- 
                              
                              In the server section, set the attributes in Table 87-6. For example: <server config- class="oracle.communication.brm.charging.appconfiguration.beans.BizParamConfig" ... serviceMonitorLevel="THREAD_DUMP" serviceMonitorNormalCheckInterval="5000" serviceMonitorAlertCheckPeriod="3000" serviceMonitorAlertCheckInterval="500" serviceMonitorRestartWaitPeriod="180000"
- 
                              
                              Restart ECE. See "Starting and Stopping ECE". 
Configuring ServiceMonitor Attributes During Runtime
You can use a JMX editor to configure ServiceMonitor attributes when ECE is running.
To configure the ServiceMonitor attributes:
- 
                              
                              Access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans". 
- 
                              
                              Expand the ECE Configuration node. 
- 
                              
                              Expand charging.server. 
- 
                              
                              Expand Attributes and select the attribute you want to configure. 
- 
                              
                              Set the value of the attribute. 
For descriptions of each attribute, see "About the ServiceMonitor Attribute Entries".
About the ServiceMonitor Attribute Entries
Table 87-6 lists the ServiceMonitor attributes. You can use these attributes to configure the outage recovery settings.
Table 87-6 ServiceMonitor Attribute Entries
| Name | Description | 
|---|---|
| ServiceMonitorLevel | Specifies whether to enable the ServiceMonitor feature and the actions to perform when degradations are detected. The possible values are: 
 The default value is THREAD_DUMP. | 
| serviceMonitorNormalCheckInterval | Specifies the time interval, in milliseconds, for node degradation monitoring for the normal check process. The default is 5000 (5 seconds). This means that the normal check process checks for degradation of the nodes every 5 seconds. | 
| serviceMonitorAlertCheckPeriod | Specifies how long, in milliseconds, the alert check process remains active. The default is 3000 (3 seconds). | 
| serviceMonitorAlertCheckInterval | Specifies the time interval, in milliseconds, for node degradation monitoring when the alert check process is active. The default is 500. This means that degradation checks are run every 500 milliseconds until the time specified in serviceMonitorAlertCheckPeriod. | 
| serviceMonitorRestartWaitPeriod | Specifies the wait period since the most recent restart, in milliseconds, before restarting the BRMFederatedCache service. The default value is 180000 (180 seconds). This means that a node can restart the BRMFederatedCache service after 180 seconds have elapsed from the previous restart. | 





