Debugging and Troubleshooting

14 Debugging and Troubleshooting

This chapter provides information about debugging and troubleshooting issues that you may face while setting up OSM cloud native environment and creating OSM cloud native instances.

This chapter describes information about the following:

Setting Up Java Flight Recorder (JFR)
Troubleshooting Issues with Traefik, OSM UI, and WebLogic Administration Console
Common Error Scenarios
Known Issues

Setting Up Java Flight Recorder (JFR)

The Java Flight Recorder (JFR) is a tool that collects diagnostic data about running Java applications. OSM cloud native leverages JFR. See Java Platform, Standard Edition Java Flight Recorder Runtime Guide for details about JFR.

You can change the JFR settings provided with the toolkit by updating the appropriate values in the instance specification.

To analyze the output produced by the JFR, use Java Mission Control. See Java Platform, Standard Edition Java Mission Control User's Guide for details about Java Mission Control.

JFR is turned on by default in all managed servers. You can disable this feature by setting the enabled flag to false.

You can customize how much data is maintained, by changing the max_age parameter in the instance specification:

# Java Flight Recorder (JFR) Settings
jfr:
 enabled: true
 max_age: 4h

Data that is generated by the JFR is saved in the container in /logMount/project-instance/performance/$server_name.

Persisting JFR Data

JFR data can be persisted outside of the container by re-directing it to persistent storage through the use of a PV-PVC. See "Setting Up Persistent Storage" for details.

Once the storage has been set up, enable storageVolume and set the PVC name. Once enabled, JFR data is re-directed automatically.

# The storage volume must specify the PVC to be used for persistent storage.

storageVolume:
  enabled: true
  pvc: storage-pvc

Troubleshooting Issues with Traefik, OSM UI, and WebLogic Administration Console

This section describes how to troubleshoot issues with access to the OSM UI clients, WLST, and WebLogic Administration Console.

It is assumed that Traefik is the Ingress controller being used and the domain name suffix is osm.org. You can modify the instructions to suit any other domain name suffix that you may have chosen.

The following table lists the URLs for accessing the OSM UI clients and the WebLogic Administration Console, when the Oracle Cloud Infrastructure load balancer is used and not used:

Table 14-1 URLs for Accessing OSM Clients

Client	If Not Using Oracle Cloud Infrastructure Load Balancer	If Using Oracle Cloud Infrastructure Load Balancer
OSM Task Web Client	http://instance.project.osm.org:30305/OrderManagement	http://instance.project.osm.org:80/OrderManagement
WLST	http://t3.instance.project.osm.org:30305	http://t3.instance.project.osm.org:80
WebLogic Admin Console	http://admin.instance.project.osm.org:30305/console	http://admin.instance.project.osm.org:80/console

Error: Http 503 Service Unavailable (for OSM UI Clients)

This error occurs if the managed servers are not running.

To resolve this issue:

Check the status of the managed servers and ensure that at least one managed server is up and running:
```
kubectl -n project get pods
```
Log into WebLogic Admin Console and navigate to the Deployments section and check if the State column for oms shows Active. The value in the Targets column indicates the name of the cluster.
If the application is not Active, check the managed server logs and see if there are any errors. For example, it is likely that the OSM DB Connection pool could not be created. The following could be the reasons for this:
- DB connectivity could not be established due to reasons such as password expired, account locked, and so on.
- DB Schema health check policy failed.
There could be other reasons for the application not becoming Active.
Resolution: To resolve this issue, address the errors that prevent the application from becoming Active. Depending on the nature of the corrective action you take, you may have to perform the following procedures as required:
- Upgrade the instance, by running upgrade-instance.sh
- Upgrade the domain, by running upgrade-domain.sh
- Delete and create a new instance, by running delete-instance.sh followed by create-instance.sh

Security Warning in Mozilla Firefox

If you use Mozilla Firefox to connect to an OSM cloud native instance via HTTP, your connection may fail with a security warning. You may notice that the URL you entered automatically change to https://. This can happen even if HTTPS is disabled for the OSM instance. If HTTPS is enabled, it only happens if you are using a self-signed (or otherwise untrusted) certificate.

If you wish to continue with the connection to the OSM instance using HTTP, in the configuration settings for your Firefox browser (URL: "about:config"), set the network.stricttransportsecurity.preloadlist parameter to FALSE.

Error: Http 404 Page not found

This is the most common problem that you may encounter.

To resolve this issue:

Check the Domain Name System (DNS) configuration.

Note:
These steps apply for local DNS resolution via the hosts file. For any other DNS resolution, such as corporate DNS, follow the corresponding steps.
The hosts configuration file is located at:
- On Windows: C:\Windows\System32\drivers\etc\hosts
- On Linux: /etc/hosts
Check if the following entry exists in the hosts configuration file of the client machine from where you are trying to connect to OSM:
- Local installation of Kubernetes without Oracle Cloud Infrastructure load balancer:
```
Kubernetes_Cluster_Master_IP  instance.project.osm.org  t3.instance.project.osm.org  admin.instance.project.osm.org
```
- If Oracle Cloud Infrastructure load balancer is used:
```
Load_balancer_IP  instance.project.osm.org  t3.instance.project.osm.org admin.instance.project.osm.org
```
Resolve the DNS configuration.
Check the browser settings and ensure that *.osm.org is added to the No proxy list, if your proxy cannot route to it.

Check if the Traefik pod is running and install or update the Traefik Helm chart:

kubectl -n traefik get pod
NAME                                READY   STATUS    RESTARTS   AGE
traefik-operator-657b5b6d59-njxwg   1/1     Running   0          128m

Check if Traefik service is running:

kubectl -n traefik get svc
NAME                         TYPE           CLUSTER-IP    EXTERNAL-IP     PORT(S)                      AGE
oci-lb-service-traefik       LoadBalancer   192.0.2.1     203.0.113.25    80:31115/TCP                 20d    <---- Is expected in OCI environment only --
traefik-operator             NodePort       192.0.2.25    <none>          443:30443/TCP,80:30305/TCP   141m
traefik-operator-dashboard   ClusterIP      203.0.113.1   <none>          80/TCP                       141m

If the Traefik service is not running, install or update the Traefik Helm chart.

Check if Ingress is configured, by running the following command:

kubectl -n project get ing
NAME                       HOSTS                                                                                 ADDRESS   PORTS   AGE
project-instance-traefik   instance.project.osm.org,t3.instance.project.osm.org,admin.instance.project.osm.org              80      89m

If Ingress is not running, install Ingress by running the following commands:

$OSM_CNTK/scripts/create-ingress.sh -p project -i instance -s specPath

Check if the Traefik back-end systems are registered, by using one of the following options:
- Run the following commands to check if your project namespace is being monitored by Traefik. Absence of your project namespace means that your managed server back-end systems are not registered with Traefik.
```
$ cd $OSM_CNTK
$ source scripts/common-utils.sh
$ find_namespace_list 'namespaces' traefik traefik-operator
"traefik","project_1", "project_2"
```
- Check the Traefik Dashboard and add the following DNS entry in your hosts configuration file:
```
Kubernetes_Access_IP traefik.osm.org
```
  Add the same entry regardless of whether you are using Oracle Cloud Infrastructure load balancer or not. Navigate to: http://traefik.osm.org:30305/dashboard/ and check the back-end systems that are registered. If you cannot find your project namespace, install or upgrade the Traefik Helm chart. See "Installing the Traefik Ingress Controller as Alternate (Deprecated)".

Reloading Instance Backend Systems

If your instance's ingress is present, yet Traefik does not recognize the URLs of your instance, try to unregister and register your project namespace again. You can do this by using the unregister-namespace.sh and register-namespace.sh scripts in the toolkit.

Note:

Unregistering a project namespace will stop access to any existing instances in that namespace that were working prior to the unregistration.

Debugging Traefik Access Logs

To increase the log level and debug Traefik access logs:

Run the following command:

$ helm upgrade traefik-operator traefik/traefik   --version 9.11.0  --namespace traefik   --reuse-values   --set logs.access.enabled=true

A new instance of the Traefik pod is created automatically.

Look for the pod that is created most recently:

$ kubectl get po -n traefik
NAME                                  READY     STATUS     RESTARTS   AGE
traefik-operator-pod_name   1/1       Running   0          0          5s
 
$ kubectl -n traefik logs -f traefik-operator-pod_name

Enabling access logs generates large amounts of information in the logs. After debugging is complete, disable access logging by running the following command:
```
$ helm upgrade traefik-operator traefik/traefik   --version 9.11.0  --namespace traefik   --reuse-values   --set logs.access.enabled=false
```

Cleaning Up Traefik

Note:

Clean up is not usually required. It should be performed as a desperate measure only. Before cleaning up, make a note of the monitoring project namespaces. Once Traefik is re-installed, run $OSM_CNTK/scripts/register-namespace.sh for each of the previously monitored project namespaces.

Warning: Uninstalling Traefik in this manner will interrupt access to all OSM instances in the monitored project namespaces.

To clean up the Traefik Helm chart, run the following command:

helm uninstall traefik-operator -n traefik

Cleaning up of Traefik does not impact actively running OSM instances. However, they cannot be accessed during that time. Once the Traefik chart is re-installed with all the monitored namespaces and registered as Traefik back-end systems successfully, OSM instances can be accessed again.

Setting up Logs

As described earlier in this guide, OSM and WebLogic logs can be stored in the individual pods or in a location provided via a Kubernetes Persistent Volume. The PV approach is strongly recommended, both to allow for proper preservation of logs (as pods are ephemeral) and to avoid straining the in-pod storage in Kubernetes.

Within the pod, if PV is not configured, logs are available at: /u01/oracle/user_projects/domains/domain/servers/ms1/logs and /u01/oracle/user_projects/domains/domain. If PV is configured, logs are available at /logMount/project-instance/logs.

Note:

Replace ms1 with the appropriate managed server or with "admin".

When a PV is configured, logs are available at the following path starting from the root of the PV storage:

project-instance/logs.

The following logs are available in the location (within the pod or in PV) based on the specification:

admin.log - Main log file of the admin server
admin.out - stdout from admin server
admin_nodemanager.log: Main log from nodemanager on admin server
admin_nodemanager.out: stdout from nodemanager on admin server
admin_access.log: Log of http/s access to admin server
ms1.log - Main log file of the ms1 managed server
ms1.out - stdout from ms1 managed server
ms1_nodemanager.log: Main log from nodemanager on ms1 managed server
ms1_nodemanager.out: stdout from nodemanager on ms1 managed server
ms1_access.log: Log of http/s access to ms1 managed server

All logs in the above list for "ms1" are repeated for each running managed server, with the logs being named for their originating managed server in each case.

In addition to these logs:

Each JMS Server configured will have its log file with the name server_msn-jms_messages.log (for example: osm_jms_server_ms2-jms_messages.log).
Each SAF agent configured will have its log file with the name server_msn-jms_messages.log (for example: osm_saf_agent_ms1-jms_messages.log).

OSM Cloud Native and Oracle Enterprise Manager

OSM cloud native instances contain a deployment of the Oracle Enterprise Manager application, reachable at the admin server URL with the path "/em". However, the use of Enterprise Manager in this Kubernetes context is not supported. Do not use the Enterprise Manager to monitor OSM. Use standard Kubernetes pod-based monitoring and OSM cloud native logs and metrics to monitor OSM.

Recovering an OSM Cloud Native Database Schema

When the OSM DB Installer fails during an installation, it exits with an error message. You must then find and resolve the issue that caused the failure. You can re-run the DB Installer after the issue (for example, space issue or permissions issue) is rectified. You do not have to rollback the DB.

Note:

Remember to uninstall the failed DB Installer helm chart before rerunning it. Contact Oracle Support for further assistance.

It is recommended that you always run the DB Installer with the logs directed to a Persistent Volume so that you can examine the log for errors. The log file is located at: filestore/project-instance/db-installer/{yyyy-mm-dd}-osm-db-installer.log.

In addition, to identify the operation that failed, you can look in the filestore/project-instance/db-installer/InstallPlan-OMS-CORE.csv CSV file. This file shows the progress of the DB Installer.

When you install the Oracle Database schema for the first time and if the database schema installation fails, do the following:

Delete the new schema or use a new schema user name for the subsequent installation.
Restart the installation of the database schema from the beginning.

To recover a schema upgrade failure, do the following:

Find the issue that caused the upgrade failure. See "Finding the Issue that Caused the OSM Cloud Native Database Schema Upgrade Failure" for details.
Fix the issue. Use the information in the log or error messages to fix the issue before you restart the upgrade process. For information about troubleshooting log or error messages, see OSM Cloud Native System Administrator's Guide.
Restart the schema upgrade procedure from the point of failure. See "Restarting the OSM Database Schema Upgrade from the Point of Failure" for details.

Finding the Issue that Caused the OSM Cloud Native Database Schema Upgrade Failure

There are several files where you can look to find information about the issue. By default, these files are generated in the managed server pod, but can be re-directed to a Persistent Volume Claim (PVC) supported by the underlying technology that you choose. See "Setting Up Persistent Storage" for details.

To access these files after the DB installer pod is deleted, re-direct all logs to the PVC.

See the following files for details about the issue:

The database installation plan action spreadsheet file: This file contains a summary of all the installation actions that are part of this OSM database schema installation or upgrade. The actions are listed in the order that they are performed. The spreadsheet includes actions that have not yet been completed. To find the action that caused the failure, check the following files and review the Status column:
- filestore/project-instance/db-installer/InstallPlan-OMS-CORE.csv
- filestore/project-instance/db-installer/InstallPlan-OMS_CLOUD-CORE.csv
The failed action is the first action with a status that is FAILED. The error_message column of that row contains the reason for the failure.
The database installation log file: This file provides a more detailed description of all the installation actions that have been run for this installation. The issue that caused the failure is located in the filestore/project-instance/db-installer/{yyyy-mm-dd}-osm-db-installer.log file. The failed action, which is the last action that was performed, is typically listed at the end of log file.

The following database tables also contain information about the database installation:

semele$plan_actions: This contains the same information as the database plan action spreadsheet. Compare this table to the spreadsheet in cases of a database connection failure.
semele$plan: This contains a summary of the installation that has been performed on this OSM database schema.

Restarting the OSM Database Schema Upgrade from the Point of Failure

In most cases, restarting the OSM database schema upgrade consists of pointing the installer to the schema that was partially upgraded, and then re-running the installer.

Note:

This task requires a user with DBA role.

Consider the following when preparing to restart an upgrade:

Most migration actions are part of a single transaction, which is rolled back in the event of a failure. However, some migration actions involve multiple transactions. In this case, it is possible that some changes were committed.
Most migration actions are repeatable, which means that they can safely be re-run even if they were committed. However, if a failed action is not repeatable and it committed some changes, either reverse all the changes that were committed and set the status to FAILED, or complete the remaining changes and set the status to COMPLETE.

To restart the upgrade after a failure:

Determine which action has failed and the reason for the failure.
If the status of the failed action is STARTED, check the database to see whether the action is completed or still running. If it is still running, either end the session or wait for the action to finish.

Note:
The transaction might not finish immediately after the connection is lost, depending on how fast the database detects that the connection is lost and how long it takes to roll back.
Fix the issue that caused the failure.

Note:
If the failure is caused by a software issue, contact Oracle Support. With the help of Oracle Support, determine whether the failed action modified the schema and whether you must undo any of those changes. If you decide to undo any changes, leave the action status set to FAILED or set it to NOT STARTED. When you retry the upgrade, the installer starts from this action. If you manually complete the action, set the status to COMPLETE, so that the installer starts with the next action. Do not leave the status set to STARTED because the next attempt to upgrade will not be successful.
Restart the upgrade by running the installer.
The installer restarts the upgrade from the point of failure.

Resolving Improper JMS Assignment

While running OSM cloud native with more than one managed server, sometimes, the incoming orders and the resulting workload may not get distributed evenly across all managed servers.

While there are multiple causes for improper distribution (including the use of an incorrect JMS connection factory to inject order creation messages), one possible cause is the improper assignment of JMS servers to managed servers. For even distribution of workload, each managed server that is running must host its corresponding JMS server.

The following figure shows an example of improper JMS assignment.

Figure 14-1 Example of Improper JMS Assignment

Description of "Figure 14-1 Example of Improper JMS Assignment"

In the figure, osm_jms_server@ms7 is incorrectly running on ms6 even though its native host ms7 is running. It can be normal for more than one JMS server to be running on a managed server as long as the additional JMS servers do not have a native managed server that is online.

Workaround

As a workaround, terminate the Kubernetes pod for the managed server that has been left underutilized. In the above example, the pod for ms7 should be terminated. The WebLogic Operator recreates the managed server pod, and that should trigger the migration of osm_jms_server@ms7 back to ms7.

Resolution

To resolve this issue, tune the time setting for InitialBootDelaySeconds and PartialClusterStabilityDelaySeconds. See the WebLoigic Server documentation for more details.

To tune the time setting:

Add the following Clustering fragment to the instance specification:
```
Clustering:
  InitialBootDelaySeconds: 10
  PartialClusterStabilityDelaySeconds: 30
```
Increase the value for the following parameters from the base WDT model:
- InitialBootDelaySeconds. The default value in base WDT is 2.
- PartialClusterStabilityDelaySeconds. The default value in base WDT is 5.

Note:

The default values for these parameters in WebLogic Server are 60 and 240 respectively. The actual values required depend on the environmental factors and must be arrived at by tuning. Higher values can result in slower placement of JMS servers. While this is not a factor during OSM startup, it will mean more time could be taken when a managed server shuts down before its JMS server migrates and comes up on a surviving managed server. Orders with messages pending delivery in that JMS server will be impacted by this, but the rest of the system is unaffected.

Common Problems and Solutions

This section describes some common problems that you may experience because you have run a script or a command erroneously or you have not properly followed the recommended procedures and guidelines regarding setting up your cloud environment, components, tools, and services in your environment. This section provides possible solutions for such problems.

Domain Introspection Pod Does Not Start

There may be a case where introspector doesn't start. This could mean that the operator is not monitoring your namespace or your namespace is not tagged to the correct label which the operator is monitoring.

For more information about operator monitoring, see: https://oracle.github.io/weblogic-kubernetes-operator/managing-operators/common-mistakes/#forgetting-to-configure-the-operator-to-monitor-a-namespace

Domain Introspection Pod Status

While the introspection is running, you can check the status of the introspection pod by running the following command:

kubectl get pods -n namespace

## healthy status looks like this
 
NAME                                       READY   STATUS    RESTARTS   AGE
project-instance-introspect-domain-job-hzh9t   1/1     Running   0          3s

The READY field is showing 1/1, which indicates that the pod status is healthy.

If there is an issue accessing the image specified in the instance specification, then it shows the following:

NAME                                           READY   STATUS         RESTARTS   AGE
project-instance-introspect-domain-job-r2d6j   0/1     ErrImagePull   0          5s
 
### OR
 
NAME                                           READY   STATUS             RESTARTS   AGE
project-instance-introspect-domain-job-r2d6j   0/1     ImagePullBackOff   0          45s

This shows that the introspection pod status is not healthy. If the image can be pulled, it is possible that it took a long time to pull the image.

To resolve this issue, verify that the image name and the tag and that it is accessible from the repository by the pod.

You can also try the following:

Increase the value of podStartupDeadlineSeconds in the instance specification.
Start with a very high timeout value and then monitor the average time it takes, because it depends on the speed with which the images are downloaded and how busy your cluster is. Once you have a good idea of the average time, you can reduce the timeout values accordingly to a value that includes the average time and some buffer.
Pull the container image manually on all Kubernetes nodes where the OSM cloud native pods can be started up.

Domain Introspection Errors Out

Some times, the domain introspector pod runs, but ends with an error.

To resolve this issue, run the following command and look for the causes:

kubectl logs introspector_pod -n project

The following are the possible causes for this issue:

RCU Schema is pre-existing: If the logs shows the following, then RCU schema could be pre-existing:
```
WLSDPLY-12409: createDomain failed to create the domain: Failed to write domain to /u01/oracle/user_projects/domains/domain: wlst.writeDomain(/u01/oracle/user_projects/domains/domain) failed : Error writing domain:
64254: Error occurred in "OPSS Processing" phase execution
64254: Encountered error: oracle.security.opss.tools.lifecycle.LifecycleException: Error during configuring DB security store. Exception oracle.security.opss.tools.lifecycle.LifecycleException: The schema FMW1_OPSS is already in use for security store(s).  Please create a new schema..
64254: Check log for more detail.
```
This could happen because the database was reused or cloned from an OSM cloud native instance. If this is so, and you wish to reuse the RCU schema as well, provide the required secrets. For details, see "Reusing the Database State".
If you do not have the secrets required to reuse the RCU instance, you must use the OSM cloud native DB Installer to create a new RCU schema in the DB. Use this new schema in your rcudb secret. If you have login details for the old RCU users in your rcudb secret, you can use the OSM cloud native DB Installer to delete and re-create the RCU schema in place. Either of these options gives you a clean slate for your next attempt.

Finally, it is possible that this was a clean RCU schema but the introspector ran into an issue after RCU data population but before it could generate the wallet secret (opssWF). If this is the case, debug the introspector failure and then use the OSM cloud native DB Installer to delete and re-create the RCU schema in place before the next attempt.
Fusion MiddleWare cannot access the RCU: If the introspector logs show the following error, then it means that Fusion MiddleWare could not access the schema inside the RCU DB.
```
WLSDPLY-12409: createDomain failed to create the domain: Failed to get FMW infrastructure database defaults from the service table: Failed to get the database defaults: Got exception when auto configuring the schema component(s) with data obtained from shadow table:
Failed to build JDBC Connection object:
```
Typically, this happens when wrong values are entered while creating secrets for this deployment. Less often, the cause is a corrupted RCU DB or an invalid one. Re-create your secrets, verifying the credentials and drop and re-create the RCU DB.

Recovery After Introspection Error

If the introspection fails during instance creation, once you have gathered the required information and have a solution, delete the instance and then re-run the instance creation script with the fixed specification, model extension, or other environmental failure cause.

If the introspection fails while upgrading a running instance, then do the following:

Make the change to fix the introspection failure. Trigger an instance upgrade. If this results in successful introspection, the recovery process stops here.
If the instance upgrade in step 1 fails to trigger a fresh introspection, then do the following:
1. Rollback to the last good Helm release by first running the helm history -n project project-instance command to identify the version in the output that matches the running instance (that is, before the upgrade that led to introspection failure). The timestamp on each version helps you identify the version. Once you know the "good" version, rollback to that version by running: helm rollback -n project project-instance version. Monitor the pods in the instance to ensure only the Admin server and the appropriate number of Managed Server pods are running.
2. Upgrade the instance with the fixed specification.

Instance Deletion Errors with Timeout

You use the delete-instance.sh script to delete an instance that is no longer required. The script attempts to do this in a graceful manner and is configured to wait up to 10 minutes for any running pods to shut down. If the pods remain after this time, the script times out and exits with an error after showing the details of the leftover pods.

The leftover pods can be OSM pods (Admin Server, Managed Server) or the DBInstaller pod.

For the leftover OSM pods, see the WKO logs to identify why cleanup has not run. Delete the pods manually if necessary, using the kubectl delete commands.

For the leftover DBInstaller pod, this happens only if install-osmdb.sh is interrupted or if it failed in its last run. This should have been identified and handled at that time itself. However, to complete the cleanup, run helm ls -n project to find the failed DBInstaller release, and then invoke helm uninstall -n project release. Monitor the pods in the project namespace until the DBInstaller pod disappears.

OSM Cloud Native Toolkit Instance Create and Update Scripts Timeout; Pods Show Readiness "0/1"

If your create-instance.sh or upgrade-instance.sh scripts timeout, and you see that the desired managed server pods are present, but one or more of them show "0/1" in the "READY" column, this could be because OSM hit a fatal problem while starting up. The following could be the causes for this issue:

A mismatch in the OSM schema found and the expected version: If this is the case, the OSM managed server log shows the following issue:

Error: The OSM application is not compatible with the schema code detected in the OSM database.
Expected version[7.4.0.0.68], found version[7.4.0.0.70]
This likely means that a recent installation or upgrade was not successful.
Please check your install/upgrade error log and take steps to ensure the schema is at the correct version.

To resolve this issue, check the container image used for the DB installer and the OSM domain instances. They should match.

OSM internal users are missing: This can happen if there are issues with the configuration of the external authentication provider and the standard OSM users (for example, oms-internal) and the group association is not loaded. The managed server log shows something like the following:
```
<Error> <Deployer> <BEA-149205> <Failed to initialize the application "oms" due to error 
weblogic.management.DeploymentException: The ApplicationLifecycleListener "com.mslv.oms.j2ee.LifecycleListener" of application "oms" 
has a run-as user configured with the principal name "oms-internal" but a principal of that name 
could not be found. Ensure that a user with that name exists.
```
To resolve this issue, review your external authentication system to validate users and groups. Review your configuration to ensure that the instance is configured for the correct external authenticator.

OSM Cloud Native Pods Do Not Distribute Evenly Across Worker Nodes

In some occasions, OSM cloud native pods do not distribute evenly across the worker nodes.

To resolve this issue, prime all the worker nodes with the image using the OSM cloud native sample utility script:

$ $OSM_CNTK/samples/image-primer.sh -p project image-name:image-tag

This should be done only once for a given image name+tag combination, regardless of which project uses that image or how many instances are created with it.

This script is offered as a sample and may need to be customized for your environment. If you are using an image from a repository that requires pull credentials, edit the image-primer.sh script to uncomment these lines and add your pull secret:

#imagePullSecrets:
  #- name: secret-name

If you are planning to target OSM cloud native to specific worker nodes, edit the sample to ensure only those nodes are selected (typically by using a specific label value) as per standard Kubernetes configuration. See the Kubernetes documentation for DaemonSet objects.

User Workgroup Association Lost

During cartridge deployment, if users are not present in LDAP or if LDAP is not accessible, the user workgroup associations could get deleted.

To resolve this issue, restore the connectivity to LDAP and the users. You may need to redo the workgroup associations.

Changing the WebLogic Kubernetes Operator Log Level

Some situations may require analysis of the WKO logs. These logs can be certain kinds of introspection failures or unexpected behavior from the operator. The default log level for the Operator is INFO.

For information about changing the log level for debugging, see the documentation at: https://oracle.github.io/weblogic-kubernetes-operator/managing-operators/troubleshooting/#operator-and-conversion-webhook-logging-level.

Deleting and Re-creating the WLS Operator

You may need to delete a WLS operator and re-create it. You do this when you want to use a new version of the operator where upgrade is not possible, or when the installation is corrupted.

When the controlling operator is removed, the existing OSM cloud native instances continue to function. However, they cannot process any updates (when you run upgrade-instance.sh) or respond to the Kubernetes events such as the termination of a pod.

To avoid common mistakes during the installation of WKO, refer to the WKO troubleshooting information at: https://oracle.github.io/weblogic-kubernetes-operator/managing-operators/common-mistakes/#forgetting-to-configure-the-operator-to-monitor-a-namespace.

To uninstall WKO, follow the steps in WKO documentation at: https://oracle.github.io/weblogic-kubernetes-operator/managing-operators/installation/#uninstall-the-operator.

Re-register your namespaces using the register-namespace.sh and unregister-namespace.sh scripts in the cloud native toolkit.

You can install the operator by following the instructions in WKO documentation at: https://oracle.github.io/weblogic-kubernetes-operator/managing-operators/installation/#install-the-operator. Then, register all the projects again, one by one. See "Registering the Namespace" for details.

Lost or Missing opssWF and opssWP Contents

For an OSM instance to successfully connect to a previously initialized set of DB schemas, it needs to have the opssWF (OPSS Wallet File) and opssWP (OPSS Wallet-file Password) secrets in place. The $OSM_CNTK/scripts/manage-instance-credentials.sh script can be used to set these up if they are not already present.

If these secrets or their contents are lost, you can delete and recreate the RCU schemas (using $OSM_CNTK/scripts/install-osmdb.sh with command code 5). This deletes data (such as some user preferences and so on) stored in the RCU schemas. On the other hand, if there is a WebLogic domain currently running against that DB (or its clone), the "exportEncryptionKey" offline WLST command can be run to dump out the "ewallet.p12" file. This command also takes a new encryption password. For details about WLST Command Reference for Infrastructure Security, see Oracle Fusion MiddleWare documentation. The contents of the resulting ewallet.p12 file can be used to recreate the opssWF secret, and the encryption password can be used to recreate the opssWP secret. This method is also suitable when a DB (or the clone of a DB) from a traditional OSM installation needs to be brought into OSM cloud native.

Clock Skew or Delay

When submitting JMS message over the Web Service queue, you might see the following in the SOAP response:

Security token failed to validate. weblogic.xml.crypto.wss.SecurityTokenValidateResult@5f1aec15[status: false][msg UNT Error:Message older than allowed MessageAge]

Oracle recommends synchronizing the time across all machines that are involved in communication. See "Synchronizing Time Across Servers" for more details. Implement Network Time Protocol (NTP) across the hosts involved, including the Kubernetes cluster hosts.

It is also possible to temporarily fix this through configuration by adding the following properties to java_options in the project specification for each managed server.managedServers: project:

#JAVA_OPTIONS for all managed servers at project level java_options:
-Dweblogic.wsee.security.clock.skew=72000000 
-Dweblogic.wsee.security.delay.max=72000000

Known Issues

This section describes known issues that you may come across, their causes, and the resolutions.

Email Plugin

The OSM Email plugin is currently not supported. Users who require this capability can create their own plugin for this purpose.

SituationalConfig NullPointerException

In the managed server logs, you might notice a stacktrace that indicates a NullPointerException in situational config.

This exception can be safely ignored.

Connectivity Issues During Cluster Re-size

When the cluster size changes, whether from the termination and re-creation of a pod, through an explicit upgrade to the cluster size, or due to a rolling restart, transient errors are logged as the system adjusts.

These transient errors can usually be ignored and stop after the cluster has stabilized with the correct number of Managed Servers in the Ready state.

If the error messages were to persist after a Ready state is achieved, then looking for secondary symptoms of a real problem would be appropriate. Such connectivity errors could result in orders that were inexplicably stuck or were otherwise processing abnormally.

While not an exhaustive list, some examples of these transient errors you may see in a managed server log are:

An MDB is unable to connect to a JMS destination. The specific MDB and JMS destination can vary, such as:

<The Message-Driven EJB OSMInternalEventsMDB is unable to connect to the JMS destination mslv/oms/oms1/internal/jms/events.

<The Message-Driven EJB DeployCartridgeMDB is unable to connect to the JMS destination mslv/provisioning/internal/ejb/deployCartridgeQueue.

Failed to Initialize JNDI context. Connection refused; No available router to destination. This type of error is seen in an instance where SAF is configured.
Failed to process events for event type[AutomationEvents].
Consumer destination was closed.

Upgrade Instance failed with spec.persistentvolumesource: Forbidden: is immutable after creation.

You may come across the following error when you run the commands for upgrading the OSM Helm chart:

Error: UPGRADE FAILED: cannot patch "<project>-<instance>-nfs-pv" with kind 
PersistentVolume: PersistentVolume "<project>-<instance>-nfs-pv" is invalid: spec.persistentvolumesource: 
Forbidden: is immutable after creation
Error in upgrading osm helm chart

Once created, the Persistent Volume Claim cannot be changed.

To resolve this issue:

Disable NFS by setting the nfs.enabled parameter to false and run the upgrade-instance script. This removes the PV from the instance.
Enable it again by changing nfs.enabled: to true with the new values of NFS and then run upgrade-instance.

JMS Servers for Managed Servers are Reassigned to Remaining Managed Servers

When scaling down, the JMS servers for managed servers that do not exist are getting reassigned to remaining managed servers.

For example, for a SimpleResponseQueue when there is only 1 managed server running, you can notice something like the following in the logs:

Jun 15, 2020 11:01:32,821 AM UTC> <Info> <oracle.communications.ordermanagement.automation.plugin.JMSDestinationListener> <BEA-000000> <
All local JMS destinations: ms1
JNDI                                                                         JMS Server         WLS Server Migratable Target Local    Member Type                   Partition
---------------------------------------------------------------------------- ------------------ ---------- ----------------- -------- ----------------------------- ---------
osm_jms_server@ms1@mslv/oms/oms1/internal/jms/events                         osm_jms_server@ms1 ms1                          true     MEMBER_TYPE_CLUSTERED_DYNAMIC DOMAIN  
osm_jms_server@ms1@oracle.communications.ordermanagement.SimpleResponseQueue osm_jms_server@ms1 ms1                          true     MEMBER_TYPE_CLUSTERED_DYNAMIC DOMAIN  
>

Notice that osm_jms_server@ms1 is targeting ms1.

When scaled to 2 Managed Servers, the log shows the following:

<Jun 15, 2020 11:02:20,461 AM UTC> <Info> <oracle.communications.ordermanagement.automation.plugin.JMSDestinationListener> <BEA-000000> <
All local JMS destinations: ms1
JNDI                                                                         JMS Server         WLS Server Migratable Target Local    Member Type                   Partition
---------------------------------------------------------------------------- ------------------ ---------- ----------------- -------- ----------------------------- ---------
osm_jms_server@ms1@mslv/oms/oms1/internal/jms/events                         osm_jms_server@ms1 ms1                          true     MEMBER_TYPE_CLUSTERED_DYNAMIC DOMAIN
osm_jms_server@ms1@oracle.communications.ordermanagement.SimpleResponseQueue osm_jms_server@ms1 ms1                          true     MEMBER_TYPE_CLUSTERED_DYNAMIC DOMAIN
osm_jms_server@ms2@mslv/oms/oms1/internal/jms/events                         osm_jms_server@ms2 ms2                          true     MEMBER_TYPE_CLUSTERED_DYNAMIC DOMAIN
osm_jms_server@ms2@oracle.communications.ordermanagement.SimpleResponseQueue osm_jms_server@ms2 ms2                          true     MEMBER_TYPE_CLUSTERED_DYNAMIC DOMAIN
>

Notice that osm_jms_server@ms1 is targeting ms1 and osm_jms_server@ms2 is targeting ms2.

After scaling back to 1 managed server, the log shows the following:

<Jun 15, 2020 11:02:20,461 AM UTC> <Info> <oracle.communications.ordermanagement.automation.plugin.JMSDestinationListener> <BEA-000000> <
All local JMS destinations: ms1
JNDI                                                                         JMS Server         WLS Server Migratable Target Local    Member Type                   Partition
---------------------------------------------------------------------------- ------------------ ---------- ----------------- -------- ----------------------------- ---------
osm_jms_server@ms1@mslv/oms/oms1/internal/jms/events                         osm_jms_server@ms1 ms1                          true     MEMBER_TYPE_CLUSTERED_DYNAMIC DOMAIN
osm_jms_server@ms1@oracle.communications.ordermanagement.SimpleResponseQueue osm_jms_server@ms1 ms1                          true     MEMBER_TYPE_CLUSTERED_DYNAMIC DOMAIN
osm_jms_server@ms2@mslv/oms/oms1/internal/jms/events                         osm_jms_server@ms2 ms1                          true     MEMBER_TYPE_CLUSTERED_DYNAMIC DOMAIN
osm_jms_server@ms2@oracle.communications.ordermanagement.SimpleResponseQueue osm_jms_server@ms2 ms1                          true     MEMBER_TYPE_CLUSTERED_DYNAMIC DOMAIN
>

Notice that the JMS Server osm_jms_server@ms2 is not deleted and is targeting ms1.

This is completely expected behavior. This is a WebLogic feature and not to be mistaken for any inconsistency in the functionality.

Image Build Failure Due to OPatch Error

You may face the following error while building images using the OSM cloud native builder toolkit:

OPatch failed with error code 73
[SEVERE ] Build command failed with error: OPatch failed to restore OH '/u01/oracle'. Consult OPatch document to restore the home manually before proceeding.
UtilSession failed: ApplySession failed in system modification phase... 'ApplySession::apply failed: oracle.glcm.opatch.common.api.install.HomeOperationsException: A failure occurred while processing patch: 31676526'
Error: building at STEP "RUN /u01/oracle/OPatch/opatch apply -silent -oh /u01/oracle -nonrollbackable /tmp/imagetool/patches/p31676526_122140_Generic.zip": while running runtime: exit status 73

The root cause for this error is that Podman's default value for number of open files is too low for an OPatch invocation.

The way to resolve this error is to configure the build system's Linux to have a higher hard limit for open files. The current hard limit on the number of open files can be known by running ulimit -n -H on the host. For more information, refer to Prerequisites for Creating OSM Images.