Monitoring Oracle Service Bus Service Health

5 Monitoring Oracle Service Bus Service Health

This chapter describes how to monitor the health of your Service Bus projects and services using service health statistics. Statistics such as response times or message, error, and alert counts can help you detect, analyze, and fix any issues.

This chapter includes the following sections.

About Service Health Metrics

Service Bus collects statistics to help you monitor the health of your running services and projects using Fusion Middleware Control. The Service Health page lets you view all services in a domain or search for specific services to view. You can then select a service from the Services list to view more detailed information about that service's health on its own Dashboard page.

You can monitor statistics based on the current aggregation interval or monitor a running count of the statistics from the last time the statistics were reset. You can reset statistics at any time for the domain, for a project, or for a service.

When you display statistics based on the aggregation interval, you get a dynamic view of statistical data collected by each service with the aggregation interval determining the statistics that are displayed. For example, if the aggregation interval of a particular service is twenty minutes, that service's row displays the data collected in the last twenty minutes. For more information about the aggregation interval, see Introduction to Aggregation Intervals.

Service Health Metrics for Domains and Projects

When you view metrics for a domain or project, the statistics displayed are only a subset of the general metrics collected for each service. The statistics include aggregation interval, average response time, message count, error count, and alert count. Service health metrics are only displayed for services that have monitoring enabled.

The following table lists the metrics displayed for each type of service. For a complete list of statistics collected, see Statistics Collected for Oracle Service Bus.

Table 5-1 Oracle Service Bus Service Metrics

Metric	Description
Average Execution Time	For a proxy service, the average of the time interval measured between receiving the message at the transport and either handling the exception or sending the response. For a business service, the average of the time interval measured between sending the message in the outbound transport and receiving an exception or a response.
Total Number of Messages	Number of messages sent to the service. In the case of JMS proxy services, if the transaction aborts due to an exception and places the message back in the queue so it is not lost, each retry dequeue is counted as a separate message. In the case of outbound transactions, each retry or failover is likewise counted as a separate message.
Messages With Errors	Number of messages with error responses. For a proxy service, it is the number of messages that resulted in an exit with the system error handler or an exit with a reply failure action. If the error is handled in the service itself with a reply with success or a resume action, it is not an error. For a business service, it is the number of messages that resulted in a transport error or a timeout. Retries and failovers are treated as separate messages.
Success/Failure Ratio	(Total Number of Messages - Number of Messages with Errors)/Messages with Errors
Security	Number of messages with WS-Security errors. This metric is calculated for both proxy services and business services.
Validation	Number of validation actions in the flow that failed. This metric only applies to proxy services and pipelines.

Proxy Service Metrics

From a proxy service's Dashboard page, you can view the following types of metrics for the service:

General: Displays a snapshot of the proxy service status for the current aggregation interval or since the last reset, including alerts, response times, message counts, error counts, and failure and success ratios.
Operations: Displays the statistics for operations defined for WSDL-based services. If there are no WSDL operations defined for the service, this table is empty.

Business Service Metrics

From a business service's Dashboard page, you can view the following types of metrics for the service:

General: Displays a snapshot of the business service status for the current aggregation interval or since the last reset, including alerts, response times, message counts, error counts, and failure and success ratios.
Result Caching: Displays information about how result caching has been used for the service (if result caching is enabled).
Throttling: Displays the throttling statistics for a business service, including the minimum, maximum, and average throttling times in milliseconds (if throttling is enabled).
Operations: Displays the statistics for operations defined for WSDL-based services. If there are no WSDL operations defined for the service, this table is empty.
Endpoint URIs: Displays statistics for the various endpoint URIs configured for a business service, including the state, message count, error count, and response times. You can also bring URIs online and offline from this view. For more information, see Viewing Endpoint URI Metrics for a Business Service and Metrics for Monitoring Endpoint URIs.

Pipeline Service Metrics

From a pipeline's Dashboard page, you can view the following types of metrics for the pipeline:

General: Displays a snapshot of the pipeline status for the current aggregation interval or since the last reset, including alerts, response times, message counts, and error counts.
Operations: Displays the statistics for operations defined for WSDL-based services. If there are no WSDL operations defined for the service, this table is empty.
Flow Metrics: Displays statistics for the message flow at the pipeline service level, pipeline (pair) level, or the action level, depending on the monitoring level for the pipeline. Statistics include message count, error count, and response times. When you select action-level statistics, the table displays information on actions in the pipeline as a hierarchy of nodes and actions.

Split-Join Service Metrics

From a split-join's Dashboard page, you can view the following types of metrics for the split-join:

General: Displays a snapshot of the split-join status for the current aggregation interval or since the last reset, including alerts, response times, message counts, and error counts.
Flow Metrics: Displays statistics for the message flow at the split-join level, branch level, or activity level, depending on the monitoring level for the split-join. The statistics include message count, error count, and response times. When you select action-level statistics, the table displays information on actions in the split-join as a hierarchy of nodes and actions.

Monitoring Service Health Statistics

The Service Health pages for Service Bus domains and projects display general metrics for services that have monitoring enabled. The Dashboard page for each service displays more detailed metrics for that service.

The Current Aggregation Interval view displays a moving statistic view of the service metrics. The Since Last Reset view displays a running count of the metrics. If a cluster exists, cluster-wide metrics are displayed by default. Select an individual Managed Server to display metrics for that server.

Monitoring for services is not enabled by default. To learn how to enable monitoring for services, see Viewing and Configuring Operational Settings. By default, the Dashboard refresh rate is No Refresh.

Viewing Statistics for the Services with the Most Errors

The Service Bus Dashboard displays certain statistics for services that have generated the most errors for the time period you select. The statistics include the average response time, the number of messages processed, the number of errors generated, and the number of SLA alerts generated for the service. This is a limited set of statistics; you can click a service name to view the complete set of statistics for that service.

For information about the statistics that appear on this page, see the online help provided with Service Bus.

To view statistics for the services with the most errors:

In Fusion Middleware Control, expand SOA and select service-bus.
On the Service Bus Dashboard, scroll to the Services With Most Errors section.
In the Service Health Snapshot Table, select whether to view statistics for the current aggregation interval or for the period since the statistics were reset.
In the Server field, select the server for which you want to view statistics.
To view additional statistics for a service, click the name of the service in the table.

The Dashboard for the selected service appears.

Viewing Service Health Statistics for a Domain

The Service Bus - Service Health page displays health statistics for all services in the domain that have monitoring enabled. This is a subset of all statistics; you can click a service name to view the complete set of statistics for that service. You can filter the services displayed in the Services table by a variety of criteria. The following figure shows the Service Health page.

Figure 5-1 Service Health Page

Description of "Figure 5-1 Service Health Page"

To view statistics for all services in a Service Bus domain:

In Fusion Middleware Control, expand SOA and select service-bus.
Click the Service Health tab.
In the Display Statistics field, do one of the following:
- To display monitoring statistics for the period of the current aggregation interval, select Current Aggregation Interval.
- To display monitoring statistics for the period since you last reset statistics for a service, select Since Last Reset.
In the Server field, select a server from the list of options to display metrics for that server.
To list only specific services, enter any of the following search criteria:
- In the Service field, select the type of service to search for, or select All Services to view all service types.
- In the Name field, enter the name of the search target. This field accepts asterisks and question marks (* and ?) as wildcard characters.
- In the Path field, enter the path of the search target. This field accepts asterisks and question marks (* and ?) as wildcard characters.
  
  Use the following format for the path:
```
project-name/root-folder/ . . ./parent-folder
```
  If a service is directly under the project, use the following format:
```
project-name
```
- In the Invoked by Proxy field, click the search icon to search for and select the proxy service that invokes the service you want to find.
- To view only services with messages, select Has Messages.
- To view only services with alerts, select Has Alerts.
- To view only services with errors, select Has Errors.
- Click Reset to remove the search filters and display all services.
Click Search.

A list of services matching your criteria appears.
To view additional statistics for a service, click the name of the service in the table.

The Dashboard for the selected service appears.

Viewing Service Health Statistics for a Project

The Service Bus Project - Service Health page displays health statistics for all services in the project that have monitoring enabled. This is a subset of all statistics; you can click a service name to view the complete set of statistics for that service. You can filter the services displayed in the Services table by a variety of criteria. The following figure shows the Dashboard page for a proxy service.

Figure 5-2 Service Bus Service Dashboard

Description of "Figure 5-2 Service Bus Service Dashboard"

To view statistics for the services in a Service Bus project:

In Fusion Middleware Control, expand SOA, expand service-bus, and select the name of the project for which you want to view statistics.

The Service Bus Project - Service Health page appears.
In the Display Statistics field, do one of the following:
- To display monitoring statistics for the period of the current aggregation interval, select Current Aggregation Interval.
- To display monitoring statistics for the period since you last reset statistics for a service, select Since Last Reset.
In the Server field, select a server from the list of options to display metrics for that server.
To list only specific services, enter any of the following search criteria:
- In the Service field, select the type of service to search for, or select All Services to view all service types.
- In the Name field, enter the name of the search target. This field accepts asterisks and question marks (* and ?) as wildcard characters.
- In the Path field, enter the path of the search target. This field accepts asterisks and question marks (* and ?) as wildcard characters.
  
  Use the following format for the path:
```
project-name/root-folder/ . . ./parent-folder
```
  If a service is directly under the project, use the following format:
```
project-name
```
- In the Invoked by Proxy field, click the search icon to search for and select the proxy service that invokes the service you want to find.
- To view only services with messages, select Has Messages.
- To view only services with alerts, select Has Alerts.
- To view only services with errors, select Has Errors.
- Click Reset to remove the search filters and display all services.
Click Search.

A list of services matching your criteria appears.
To view additional statistics for a service, click the name of the service in the table.

The Dashboard for the selected service appears.

Viewing All Service Health Statistics for a Service

The Dashboard page for each Service Bus service displays the complete set of service metrics and service-specific statistics for that service, but only if monitoring is enabled for that service. You can access the Dashboard page for a service in several ways.

To view the complete set of health statistics for a service:

Navigate to one of the following pages in Fusion Middleware Control:
- The Service Bus Dashboard page. To access this page, expand SOA and select service-bus.
- The Service Bus - Service Health page. To access this page, expand SOA, select service-bus, and click the Service Health tab.
- The Service Bus - Alert History page. To access this page, expand SOA, select service-bus, and click the Alert History tab.
- The Service Bus Project - Service Health page. To access this page, expand SOA, expand service-bus, and select the name of the project.
If necessary, perform a search for the service whose statistics you want to view.
Click the name of the service whose statistics you want to view.

Resetting Statistics for Service Monitoring

You can use the Service Health page to reset monitoring statistics for all services in a domain or project, or just for one specific service.

When you reset statistics, the system deletes all monitoring statistics that were collected for the service, project, or domain since you last reset statistics. However, the system does not delete the statistics being collected during the current aggregation interval for the service. After a statistics reset, the system immediately starts collecting monitoring statistics for the service again.

Note:

If a split-join that gathers branch or activity level statistics is redeployed, the statistics should be reset to ensure that the displayed statistics match the current branches and activities.

To reset statistics for service monitoring:

Do one of the following:
- To reset the statistics for all services in a domain, expand SOA and select service-bus. Click the Service Health tab.
- To reset the statistics for all services in a project, expand SOA, expand service-bus, and select the name of the project.
- To reset the statistics for a single service, navigate to that service's Dashboard page as described in Viewing All Service Health Statistics for a Service.
In the Display Statistics field, select Since Last Reset.
To the right of the Server field, click the Reset icon.

All statistics are reset at the displayed level.

Reset Option Fails to Reset Statistics

If the reset option does not reset the statistics, and there is no Edit Session displayed in Change Center, it implies that a dangling session data exists in sessions folder. Delete the session data manually.

Shutdown the OSB Managed Servers.
Navigate to MIDDLEWARE_HOME\user_projects\domains\<domainname>\osb\configfwk\sessions for each OSB Managed Server and the Admin Server.
Make a backup of the \sessions folder.
Delete the \sessions folder.
Restart the OSB Managed Servers.
Create a session in the OSB Console.
Submit the OSB Console Session.
Statistics are now reset.

What You Might Need to Know About Resetting the Statistics

When you reset statistics for a service, all the statistics collected for the service since the last reset are lost. Resetting the statistics for the domain resets the statistics for all monitored services regardless of whether they are displayed on the page or not. You cannot undo a reset action. The status of endpoint URIs is not reset when you reset statistics.