11 Deploying Service Impact Analysis
This chapter describes how to deploy and manage Service Impact Analysis.
Service Impact Analysis Overview
Service Impact Analysis enables you to view the alarm events associated with UIM resources and view the impacts to customer, service, network, logical, and physical resources, and connectivity. It also enables you to assign ownership to specific individuals and track the impact lifecycle using the analysis process. These alarm events are associated to the corresponding UIM resource through the Alarm Consumer service, using the ora-alarm-topology topic with TMF642 alarm event JSON format.
For the architecture, see "ATA Architecture".
You must deploy ATA before deploying Service Impact Analysis.
Creating Service Impact Analysis Images
You must have created the Service Impact Analysis images as part of "Prerequisites and Configuration for Creating ATA Images" and "Creating ATA Images".
Verify if the following images are available:
- uim-7.8.0.0.0-alarm-consumer-1.3.0.0.0:latest
- uim-7.8.0.0.0-impact-analysis-api-1.3.0.0.0:latest
Creating Service Impact Analysis Instance
The Service Impact Analysis instance is dependent on the ATA Instance to be deployed.
Prerequisites:
-
Deploy ATA using "Creating an ATA Instance".
-
Create the required secrets and other configurations (such as Authentication, Oracle Database schema, SmartSearch) while deploying ATA.
Configuring the applications.yaml File
To configure the applications.yaml file:
-
Edit the applications.yaml file to provide the image in your repository (name and tag) by running the following command:
vi $SPEC_PATH/$PROJECT/$INSTANCE/applications.yaml
- Edit the image names to reflect the Service Impact Analysis image names and location
in your docker repository as
follows:
ata: name: "ata" image: alarmConsumerName: uim-7.8.0.0.0-alarm-consumer-1.3.0.0.0 impactAnalysisApiName: uim-7.8.0.0.0-impact-analysis-api-1.3.0.0.0 alarmConsumerTag: latest impactAnalysisApiTag: latest repository: repositoryPath:
- Edit the applications.yaml file to update the
replicaCount
of alarmConsumer and impactAnalysisApi. The sample configuration is as follows. Update the replica count according to your performance needs:alarmConsumer: name: "alarm-consumer" replicaCount: 3 impactAnalysisApi: name: "impact-analysis-api" replicaCount: 3
Configuring Service Impact Analysis
This section helps you to configure Service Impact Analysis.
Configuring UIM
Impact correlation for the events submitted through alarm consumer in the system is done at the UIM side, which then can be viewed on the Service Impact Analysis UI. See "About SIA" for more information.
Configuring Service Impact Analysis API
Service Impact Analysis provides APIs for persisting and managing events and impact correlation of these events. It uses OpenSearch indexes as its persistence mechanism.
Sample configuration files impactanalysis-static-config.yaml.sample and impactanalysis-dynamic-config.yaml.sample are provided as the sample files for Impact Analysis API service that are under $COMMON_CNTK/charts/ata-app/charts/ata/config/impact-analysis-api.
To override configuration properties, copy the sample static property file to impactanalysis-static-config.yaml and sample dynamic property file to impactanalysis-dynamic-config.yaml. Provide key value to override the default value provided for any specific system configuration property. The properties defined in property files are entered in the container using Kubernetes configuration maps. Any changes to these properties require the instance to be upgraded. Restart the pods after updating the configuration changes to impactanalysis-static-config.yaml.
Date Format
Any modifications to the date format used by all dates must be consistently applied to all consumers of the APIs. API serializes and deserializes the date attributes stored in OpenSearch indexes using following date format:
impactanalysis:
api:
dateformat: yyyy-MM-dd'T'HH:mm:ss.SSS'Z'
Event Status
Service Impact Analysis API supports the following types of events:
- RAISED: This event type is for new events.
- UPDATED: This event type is for existing events with updated information.
- CLEARED: This event type is for events that have been Closed.
- REJECTED: This event type is for events that are invalid and
are rejected through Reject Event action available on Service Impact
Analysis UI.
Note:
Rejected events move to a different index calledsmartsearch-rejected-event
and are not deleted. Rejecting events is to isolate invalid events that are submitted to the system.
The following event statuses, apart from REJECTED are standard TMF642 event statuses. These event status mappings are part of Unified Assurance integration and should not be changed:
impactanalysis:
event-status:
CLEARED: CLEARED
RAISED: RAISED
UPDATED: UPDATED
REJECTED: REJECTED
Event Severity
Service Impact Analysis API and ATA support various types of event severities on a Device. The severities from most severe to least severe are CRITICAL (1), MAJOR (5), WARNING (10), INTERMEDIATE (15), MINOR (20), CLEARED (25), and None (999). Internally, a numeric value is used to identify the severity hierarchy. The top three most severe events (CRITICAL, MAJOR, WARNING) are tracked in ATA.
The following event severities are standard TMF642 event severities. These event severity mappings are part of Unified Assurance integration and should not be changed:
impactanalysis:
severity:
CLEARED: CLEARED
INDETERMINATE: INDETERMINATE
CRITICAL: CRITICAL
MAJOR: MAJOR
MINOR: MINOR
WARNING: WARNING
Impact Calculation Thread Pool Size
The impact calculation thread pool size defines the maximum number of open REST requests at a time to UIM for fetching impacts.
impact:
threadPoolSize: 10
SmartSearch and OpenSearch Related Configurations
The configurations related to SmartSearch and OpenSearch are as follows:
- smartsearch.lang: The value of SmartSearch internal field used in lang
pipeline processing. The only supported value is
en
. - smartsearch.tenantId: The value of SmartSearch internal field for search
tenancy. The only supported value is
tenant1
. - smartsearch.fetchSize: Defines the number of documents that should be
fetched at a time in memory from SmartSearch for processing during bulk
operations. The maximum limit for this value is
10000
. - smartsearch.coolDownInterval: Defines the cool-down interval in milliseconds for OpenSearch index after each batch of documents is processed (bulk updates and deletes).
smartsearch:
language: en
tenantId: tenant1
fetchSize: 1000
coolDownInterval: 1000
Events Related Configurations
The configuration related to events are as follows:
- event.idPrefix: Defines the prefix used during document id generation for OpenSearch event index.
- event.reportPrefix: Defines the prefix used during report id generation when the analysis status of the event transitions to COMPLETED.
- event.defaultOwner: Defines the default owner name value to be used in case the owner field is not populated during event creation.
event:
idPrefix: UE
reportPrefix: REP
defaultOwner: Unassigned
Configuring Alarm Consumer
Sample configuration files alarm-consumer-static-config.yaml.sample and alarm-consumer-dynamic-config.yaml.sample are provided under $COMMON_CNTK/charts/ata-app/charts/ata/config/alarm-consumer.
To override configuration properties, copy the sample static property file to alarm-consumer-static-config.yaml and sample dynamic property file to alarm-consumer-dynamic-config.yaml. Provide key value to override the default value provided out-of-the-box for any specific system configuration property. The properties defined in property files are provided to the container using Kubernetes configuration maps. Any changes to these properties require the instance to be upgraded. Restart the pods after updating the configuration changes to alarm-consumer-static-config.yaml.
The alarm consumer service receives the alarm event notifications in TMF642 v5.0 specification JSON string format (TMF642 alarm JSON wrapped in TMF688 event JSON) from ora-alarm-topology Kafka topic. As part of the alarm event notifications processing, alarm is created and associated with the effected entity (node or sub-node) in the Service Impact Analysis service to resolve the use cases. Based on the event type in the notification, the alarms can be updated, cleared, or deleted from the affected entity.
The default implementation in processing an alarm is to retrieve the entity (device and sub-device) from Inventory (UIM or ATA) by filtering with name and entity type and associate the alarm. The alarmedObject.id element value, in the alarm object, represents the name of entity (the device and sub-device identification is separated by "::" delimiter). The alarmObject.@referredType element value represents entity type (device or sub-device type) and the value for the @referredType element represents the TMF639 sub-type for the resource on which the alarm is raised. The resource sub-types are listed in the following sections.
The following sections list the extension available to configure or customize the entity look logic:
- Resource Type mappings
- Customizing Device Mapping
- Alarmed object extension
Following are the samples on the alarmedObject sub-structure from the TMF642 alarm event specification. For more details on the full event payload, see the Active Topology Automator Asynchronous Events Guide:
Sample alarmedObject sub structure for device alarm
{
"eventId": "700001",
"@type": "AlarmCreateEvent",
"eventType": "AlarmCreateEvent",
"event": {
"alarm": {
.....
"alarmedObject": {
"@referredType": "PhysicalDevice",
"@type": "AlarmedObjectRef",
"id": "LSN/EMS_XDM_33/9489"
},
...
}
}
Sample alarmedObject sub structure for the sub-device (port) alarm
{
"eventId": "700001",
"@type": "AlarmCreateEvent",
"eventType": "AlarmCreateEvent",
"event": {
"alarm": {
.....
"alarmedObject": {
"@referredType": "PhysicalPort",
"@type": "AlarmedObjectRef",
"id": "LSN/EMS_XDM_33/9489::LSN/EMS_XDM_33/P01-142.1K.07-Line-Card1.OTU4_8"
},
...
}
}
Configuring Incoming Channel
For performance improvement tuning uncomment or add the following in the alarm-consumer-static-config.yaml file to override the default configuration:
- Edit max.poll.interval.ms to increase or decrease the delay between invocations of poll() while using the consumer group management.
- Edit max.poll.records to increase or decrease the maximum
number of records returned in a single call to
poll().
mp.messaging: incoming: toFaultChannel: # max.poll.interval.ms: 300000 # max.poll.records: 25 toRetryChannel: # max.poll.interval.ms: 300000 # max.poll.records: 25 toDltChannel: # max.poll.interval.ms: 300000 # max.poll.records: 100
Impact Analysis API
The impact analysis API is as follows:
impactAnalysis:
url: http://localhost:8084
Resource Type Mappings
The TMF639ResourceType-mappings.yaml file provides mapping from protocol specific to TMF639 sub-resource types supported by UIM. This mappings file should be updated only when the alarm event is not sent with resource type values supported by the UIM.
The alarmedObject.@referredType element value should be representing the TMF639 sub-resource type, which is supported by UIM. You must use this resource mapping extensibility only when the assurance system does not send the resource types which UIM can understand.
For example: Sending some protocol specific object type. The Corba protocol has object types as OT_EQUIPMENT, OT_MANAGED_ELEMENT, and so on.
When an alarm event does not contain UIM's TMF639 sub-resource types, map the protocol-native resource type to its TMF639 sub-resource type in the TMF639ResourceType-mappings.yaml file and upgrade the alarm consumer service.
The TMF639 resource types supported in alarm consumer are: PhysicalDevice, Equipment, PhysicalPort, and DeviceInterface.
deviceTypeMapping:
PhysicalDevice:
- OT_MANAGED_ELEMENT
Equipment:
- OT_EQUIPMENT
- CHASSIS
- BACKPLANE
- MODEL
- RACK
- SHELF
- CARD
PhyiscalPort:
- OT_PHYSICAL_TERMINATION_POINT
- PORT
- PTP
DeviceInterface:
- OT_CONNECTION_TERMINATION_POINT
- CTP
Pipe:
Connectivity:
Customizing Device Mapping
By default, the device or sub-device is found by name (using the value from the alarmedObject.id). Configuration can be updated to check the device or sub-device by other fields (such as Id or deviceIdentifier) when the alarm event is having id or deviceIdentifier values as part of the alarmedObject.id element.
In order to use different lookup fields, configure the deviceMappings.inventory.lookupFields sub-structure accordingly in the alarm-consumer-static-config.yaml file and upgrade the alarm consumer service.
A sample file is provided in COMMON_CNTK\charts\ata-app\charts\ata\config\alarm-consumer location. If you are modifying it for the first time, rename the alarm-consumer-static-config.yaml.sample file to alarm-consumer-static-config.yaml and update the values accordingly. The supported lookup fields are name, id, and deviceIdentifier. A sample sub-structure is as follows:
deviceMapping:
inventory:
lookupFields: # The lookup is done according to the provided order. Supported values are name, id & deviceIdentifier
- name
- id
- deviceIdentifier
customizeDeviceLookup:
enabled: false
The above YAML configuration is used to change the device mapping.
Table 11-1 Device Mapping Fields
Field | Description |
---|---|
deviceMapping.lookupFields |
This is an array field that can have only the values : name, id, deviceIdentifier. There are names of the fields in UIM Entity which will be used to search the device/sub-device what is mentioned in the alarmedObject.id field. The name field is default. That means, the first part of the '::' of alarmedObject.id field by default is searched with name. The order of the array is followed. Therefore, if array is updated in id, deviceIdentifier, or name fields, the alarm consumer will take the first part of '::' of alarmedObject.id field and then search it in database with the corresponding id, since the array first element is id. Only the first three entries of the array are considered for searching. That means, for name, id, and deviceIdentifier settings, it will search with the name first and if not found then id, if not found then deviceidentifier. Once a device is found, no further matching will be performed. In case no device is found or multiple devices found, see "Fallout Events Resolution for Alarm Consumer" for more information. |
deviceMapping.customizeDeviceLookup.enabled |
This is a Boolean value. The default value is false. If it is true, the alarm consumer enables extensibility to its user to provide groovy script, which should return a single value by processing either alarmedObject or alarm (TMF-642). This single value which is expected to be returned from the Groovy is used to match in the database. The sample Groovy script is mentioned in the alarmed object extension sections. |
Note:
In the previous release, deviceMapping.lookupFields was mentioned to have possible values like ipv4 and ipv6 also. From 1.3.0, alarm on sub-node is supported. The deviceMapping.lookupFields does not support ipv4 and ipv6. Valid values from 1.3.0 are name, id, deviceidentifier only.
import groovy.json.JsonSlurper
/**
* The default delimiter is which is available via method argument named "delimiter".
* The "alarmedObject" parameter is extracted from alarmedObject sub-section as String from the Alarm.
* The "alarm" parameter is the complete alarm information received as String.
* The return type must be type of Map of String as key and value.
*/
def getDeviceIds(delimeter,alarmedObject,alarm){
def jsonSlurper = new JsonSlurper()
def alo = jsonSlurper.parseText(alarmedObject)
def aloId = alo.id
def referredType = alo.'@referredType'
def device = aloId.split(delimeter)
def deviceInfo = [:]
//Custom implementation starts. The following is default implementation which return the keys.
//node and subNode should be the name of which will be searched in ATA/Inventory databases.
if (device.size() == 2) {
//node-name
deviceInfo["node"] = device[0]
//subNode-name
deviceInfo["subNode"] = device[1]
} else {
deviceInfo["node"] = device[0]
deviceInfo["subNode"] = ""
}
//The referredType must match with TMF639ResourceType-mappings.yaml mapping file. Blank value considered as PhysicalDevice
deviceInfo["referredType"] = referredType
//Custom implementation ends
return deviceInfo;
}
Alarmed Object Extension
If the alarmedObject sub-structure has different values or format than the above sub-sections, the provided Groovy file has to be modified to parse and return the identifier. This groovy custom code runs when the deviceMapping.customizedDeviceLoop.enabled element value is configured to true in the alarm-consumer-static-config.yaml file.
Update the out-of-the-box provided Groovy code and the system returns the node/sub-node identifier and referredType values from this Groovy file. This groovy file is provided in COMMON_CNTK\charts\ata-app\charts\ata\config\alarm-consumer location. Enable the deviceMapping.customizedDeviceLoop.enabled value to true and update the alarm consumer service.
The sample implementation is as follows:
/*
* Copyright (c) 2024. Oracle and/or its affiliates. All rights reserved.
*/
import groovy.json.JsonSlurper
/**
* The default delimiter is which is available via method argument named "delimiter".
* The "alarmedObject" parameter is extracted from alarmedObject sub-section as String from the Alarm.
* The "alarm" parameter is the complete alarm information received as String.
* The return type must be type of Map of String as key and value.
*/
def getDeviceIds(delimeter,alarmedObject,alarm){
def jsonSlurper = new JsonSlurper()
def alo = jsonSlurper.parseText(alarmedObject)
def aloId = alo.id
def referredType = alo.'@referredType'
def device = aloId.split(delimeter)
def deviceInfo = [:]
//Custom implementation starts. The following is default implementation which return the keys.
//node and subNode should be the name of which will be searched in ATA/Inventory databases.
if (device.size() == 2) {
//node-name
deviceInfo["node"] = device[0]
//subNode-name
deviceInfo["subNode"] = device[1]
} else {
deviceInfo["node"] = device[0]
deviceInfo["subNode"] = ""
}
//The referredType must match with TMF639ResourceType-mappings.yaml mapping file. Blank value considered as PhysicalDevice
deviceInfo["referredType"] = referredType
//Custom implementation ends
return deviceInfo;
}
Note:
The Groovy script (which was available on previous version) is not compatible from1.3.0. Consider this to be a new script which supports the alarm on sub-node. The previous Groovy script logic has to be written in to this new Groovy file.
Mounting Groovy Scripts To Alarm Consumer
To mount Groovy scripts to alarm consumer pod:
- Edit the script in $COMMON_CNTK/charts/ata-app/charts/ata/config/alarm-consumer/DeviceMapping.groovy location.
- Upgrade ATA instance as
follows:
$COMMON_CNTK/scripts/upgrade-applications.sh -p project -i instance -f $SPEC_PATH/project/instance/applications.yaml -a ata
Service Impact Analysis Customer Mappings
See "Impact Analysis Customer Mappings" for more information.
Roles Required for Accessing Service Impact Analysis
For information on roles required for accessing Service Impact Analysis, see "About Authentication".
Deploying Service Impact Analysis Instance
To deploy a Service Impact Analysis instance in your environment using the scripts that are provided with the toolkit, run the following command to create an instance after updating the applications.yaml and configuring Service Impact Analysis:
$COMMON_CNTK/scripts/upgrade-applications.sh -p sr -i quick -f $SPEC_PATH/sr/quick/applications.yaml -a ata
Managing Service Impact Analysis Instance
The SIA instance consists of alarm-consumer and impact-analysis-api services. Update the corresponding sections in the applications.yaml file and follow the steps mentioned in the following sections of "Deploying the Active Topology Automation Service":
- Upgrading the ATA Instance
- Deleting and Recreating a ATA Instance
To delete only Service Impact Analysis, update the respective replicaCount to 0 and upgrade the instance.
- Restarting the ATA Instance
Managing Service Impact Analysis Logs
To customize and enable logging, update the logging configuration files for the application as follows:
- Customize impact-analysis-api service logs:
- For service level logs, update the $COMMON_CNTK/charts/ata-app/charts/ata/config/impact-analysis-api/logging-config.xml file.
- For Helidon-specific logs, update the $COMMON_CNTK/charts/ata-app/charts/ata/config/impact-analysis-api/logging.properties file. By default, the console handler is used. You can provide filehandler, uncomment the following lines, and provide the project and instance names for location to save logs.
handlers=io.helidon.common.HelidonConsoleHandler,java.util.logging.FileHandler java.util.logging.FileHandler.formatter=java.util.logging.SimpleFormatter java.util.logging.FileHandler.pattern=/logMount/sr-quick/ata/ata-api/logs/ImpactAnalysisJULMS-%g-%u.log
- Customize alarm-consumer service logs as follows:
- For service level logs, update the $COMMON_CNTK/charts/ata-app/charts/ata/config/alarm-consumer/logging-config.xml file.
- For Helidon server logs, update the $COMMON_CNTK/charts/ata-app/charts/ata/config/alarm-consumer/logging.properties file.
- Once the log configuration files are updated, upgrade the ATA instance.
The sample upgrade script is as
follows:
$COMMON_CNTK/scripts/upgrade-applications.sh -p sr -i quick -f $SPEC_PATH/$PROJECT/$INSTANCE/applications.yaml -a ata
Alternate Configuration Options
See "Deploying the Active Topology Automation Service" for more information.
See "Fallout Events Resolution for Alarm Consumer" to resolve the fallout events for the alarms.
Fallout Events Resolution for Alarm Consumer
The following image illustrates an alarm event (or message) processing flow in alarm consumer.
Figure 11-1 Process Flow of Fallout Events Resolution for Alarm Consumer
Troubleshooting the Alarm Fallouts
Alarm fallouts are the alarms that could not be processed in alarm consumer because of the exceptions or errors occurred during processing of the alarm. Following are the major fallout scenarios identified:
- Incoming alarm has an invalid JSON structure or invalid TMF642 structure.
- No device is found to which the incoming alarm can be mapped.
- Multiple devices are found for the incoming alarm.
- Processed alarm could not be forwarded to SIA API.
In the mentioned fallout scenarios, alarm-consumer is configured to retry the processing of the same alarm. This helps to address the possible intermittent issues such as connectivity and temporary data unavailability. In case the retry processing of the alarm is a fallout scenario, the details of the fallout information along with the alarm information will be stored into database. The persisted fallout alarms can be reviewed manually. In the process of reviewing of the fallout alarm, the reviewer can add modification or correction to the alarm data and send the fallout alarm for reprocessing.
Note:
All fallout alarms are persisted. The alarms that have invalid JSON structure or invalid TMF642 structure will not be persisted. These non-persisted fallout alarms are dropped from the alarm-consumer and alarm details cannot be verified later.
The TARGETSYSTEMID for all alarm fallouts for the alarm consumer will be AlarmConsumer. Therefore, while running a fallout resolution, the target service value will be used as AlarmConsumer.
The alarm consumer fallout alarms can be used for reprocessing, using the fallout resolution APIs:
- To list all alarm consumer fallout alarms:
- Method - GET
- URI -
/topology/v2/fallout/events?targetService=AlarmConsumer
- To find a specific fallout alarm:
- Method - GET, URI - /topology/v2/fallout/events/eid/<eid value>
- Update fallout alarm data:
- Method - PUT
- URI -
/topology/v2/fallout/events/eid/<eid value>
- Request Body - Includes the fallout event details found using the above mentioned APIs.
- To send for reprocessing:
- Method - POST
- URI -
/topology/v2/fallout/events/resubmit?state=PENDING&action=NEW&targetService=AlarmConsumer