8 Troubleshooting
This chapter provides troubleshooting tips and techniques on installing, discovering, and configuring the Exadata plug-in. The following sections are provided:
Discovery Troubleshooting
Very often, the error message itself will include the cause for the error. Look for error messages in the OMS and agent logs (case insensitive search for dbmdiscovery
) or in the Discovery window itself.
It is recommended that you review the logs for troubleshooting issues. For the list of logs that you can review and their locations, see Review Logs.
The following sections are provided:
Hardware Availability
All the hardware components must be "known" and reachable; otherwise, communication failures will occur. Use the ping
command for each hardware component of the Exadata rack to make sure all names are resolved.
The MAP targets in Enterprise Manager Cloud Control 13c may fail while collecting the correlation identifier. This failure can happen if the credentials are incorrect OR if a target (for example, the ILOM) is too slow in responding.
ILOM can be slow when the number of open sessions on ILOM has exceeded the limit. You can resolve this issue by temporarily closing the sessions on the ILOM.
The rack placement of targets can fail:
-
If
examan
did not return valid rack position for the target. -
If there is an existing target in the same location.
Discovery Failure Diagnosis
Should discovery of your Oracle Exadata Database Machine fail, collect the following information for diagnosis:
-
Any
examan-*.xml
,examan-*.html
,targets-*.xml
, andexaman*.log
files from theAGENT_ROOT/agent_inst/sysman/emd/state/exadata
directory. -
Agent logs:
emagent_perl.trc
andgcagent.log
-
OMS logs:
emoms.trc
andemoms.log
-
Any snapshot (screen capture) of errors shown on the target summary page.
-
If the Prerequisite Check page shows critical errors, then discovery can be retried by clicking on the Retry menu and selecting the Retry, Static Only option.
Exadata Storage Server is not Discovered
If the Exadata Storage Server itself is not discovered, possible causes could be:
-
The
/etc/oracle/cell/network-config/cellip.ora
file on the compute node is missing or unreadable by the agent user or Exadata Storage Server not listed in that file. -
Management Server (MS) or cellsrv is down.
-
Exadata Storage Server management IP is changed improperly. Bouncing both cellsrv and MS may help.
-
To check that the Exadata Storage Server is discovered with a valid management IP, run the following command from the grid infrastructure or database home on the compute node used for discovery:
$ORACLE_HOME/bin/kfod op=cellconfig
-
The discovery of the Exadata Storage Server fails when the cookie-jar is present at
$HOME/.exacli/cookiejar
. Note that cookie-jar is not supported for Enterprise Manager monitoring of Exadata Storage Server.To resolve the issue, move cookie-jar from
$HOME/.exacli/cookiejar
and attempt the discovery of the Exadata Storage Server.
Compute Node or InfiniBand Switch is not Discovered
If there are problems with discovery of the compute node or the InfiniBand switch, possible causes could be:
-
The InfiniBand switch host name or
ilom-admin
password is incorrect. -
The connection from the compute node to the InfiniBand switch through SSH is blocked by a firewall.
-
The InfiniBand switch is down or takes too long to respond to SSH.
To resolve problems with the compute node or InfiniBand switch discovery, try:
-
If the InfiniBand switch node is not discovered, the InfiniBand switch model or switch firmware may not be supported by EM Exadata. Run the
ibnetdiscover
command. Output should look like:Switch 36 "S-002128469f47a0a0" # "Sun DCS 36 QDR switch switch1.example.com" enhanced port 0 lid 1 lmc 0
-
If the compute node is not discovered, then run the
ibnetdiscover
command on the compute node. The output should look similar to that shown below. If the output shows missing or invalid values, refer the issue to your network administrator.Ca 2 "H-00212800013e8f4a" # " xdb1db02 S 192.168.229.85 HCA-1“
Extra or Missing Components in the Newly Discovered Exadata Database Machine
If the list of components on the Components page of the discovery wizard contains extra components or missing components, then check the following troubleshooting steps:
-
For extra components, examine them for Exadata Database Machine membership. Deselect any unnecessary pre-selected components on the components page.
-
Verify which schematic file that was used for discovery. Ensure that Enterprise Manager can read the latest
xml
file (for example,databasemachine.xml
) on the compute node. -
For missing components, check the schematic file content.
-
To generate a new schematic file, see Oracle Support Doc ID 1684431.1.
ILOM, PDU, or Cisco Switch is not Discovered
If the ILOM, PDU, or Cisco switch is not discovered, the most likely cause is that the Exadata Database Machine Schematic file cannot be read or has incorrect data. See Troubleshooting the Exadata Database Machine Schematic File.
Test Connection Fails in PDU
If Test connection gives an error message Wrong credentials Please provide correct PDU user name and password..., ignore the message and proceed with the discovery. The PDU target will be in the down status initially. To correct the failure, you can try one of the following methods:
-
Add the SNMP subscriptions to the PDU manually by following the steps in Enable SNMPv3 on PDUs. The target will be backup when SNMP starts collecting the availability metrics.
-
Execute the below command from the agent that is monitoring the target:
emctl control agent runCollection <pdu-target-name>:oracle_si_pdu oracle_si_pdu_snmp_config emctl upload agent
Target Does not Appear in the Components Step of Guided Discovery
Even though no error may appear on the Prerequisite Check page during the Exadata Database Machine guided discovery process, the target does not appear on the Components page. Possible causes and solutions include:
-
Check the All Targets page to make sure that the target has not been added as an Enterprise Manager target already:
-
Log in to Enterprise Manager.
-
Select Targets, then All Targets.
-
On the All Targets page, check to see if the target that was not shown on Components selection page, appears in the list.
-
-
A target that is added manually may not be connected to the Exadata Database Machine system target through association. To correct this problem:
-
Delete these targets before initiating the Exadata Database Machine guided discovery.
-
Alternatively, use the
emcli
command to add these targets to the appropriate system target as members.
-
Target Status is Down or Metric Collection Error After Discovery
After the Exadata Database Machine guided discovery, an error that the target is down or that there is a problem with the metric collection may display. Possible causes and recommended solutions include:
-
For the Exadata Storage Server or InfiniBand switch, the setup of SSH may not be configured properly. To troubleshoot and resolve this problem:
-
The agent's SSH public key in the
<AGENT_USER_HOME>/.ssh/id_rsa.pub
file is not in theauthorized_keys
file of$HOME/.ssh
for ilom-admin.For the Exadata Storage Server monitored using CellCLI: The agent's SSH public key in the
<AGENT_USER_HOME>/.ssh/id_rsa.pub
file is not in the$HOME/.ssh/authorized_keys
for the cellmonitor user on Exadata Storage Server.For the Exadata Storage Server monitored using ExaCLI/RESTAPI: Exadata Storage Server credentials are not set as monitoring credentials for the Exadata Storage Server target.
For InfiniBand Switch: The agent's SSH public key in the
<AGENT_USER_HOME>/.ssh/id_rsa.pub
file is not in the$home/.ssh/authorized_keys
file for ilom-admin user on Infiniband Switch. -
Verify permissions. The permission settings for
.ssh
andauthorized_keys
should be:drwx------ 2 cellmonitor cellmonitor 4096 Oct 13 07:06 .ssh -rw-r--r-- 1 cellmonitor cellmonitor 441842 Nov 10 20:03 authorized_keys
-
Resolve a
PerformOperationException
error. See Troubleshooting the Exadata Database Machine Schematic File for more information.
-
-
If the SSH setup is confirmed to be properly configured, but the target status is still down, then check to make sure there are valid monitoring and backup agents assigned to monitor the target. To confirm, click the Database Machine menu and select Monitoring Agents.
- After confirming that valid agents are assigned to the Exadata Storage Server, check Target Down Reason in the Response metric to understand the root cause. To view the Response metric, go to Exadata Storage Server target menu, click Monitoring, click All Metrics, and select Response.
-
For the ILOM, PDU, or Cisco switch, possible causes include:
-
The Exadata Database Machine Schematic Diagram file has the wrong IP address.
-
Monitoring Credentials is not set or incorrect.
-
-
To verify or to set Monitoring Credentials:
-
Log in to Enterprise Manager.
-
Click Setup, then Security, and finally Monitoring Credentials.
-
On the Monitoring Credentials page, select your target type, and click Manage Monitoring Credentials. Verify that the monitoring credentials are set for the target. If not, then set the monitoring credentials.
-
ILOM Credential Validation Fails During Discovery
ILOM Credential Validation Failure Errors
ILOM credential validation may fail while performing a 13c discovery. The following error may occur:
Authentication failed
The possible cause of the error is that the credentials provided are invalid.
To resolve the issue, use valid credentials.
Discovery Process Hangs
If the discovery process for the Exadata Database Machine hangs, then check the following troubleshooting steps:
-
Examine your network to verify:
-
That the host name can be resolved.
-
That the Agent(s) can access the OMS.
-
That a simple job can be executed from the console.
-
-
If the OMS reported any errors, then check the following log file:
$MW_HOME/gc_inst/sysman/log/emoms.log
-
For Repository issues, check the Repository database's
alert.log
file. -
For Agent issues, check the following log file on the monitoring agents:
$AGENT/agent_inst/sysman/log/gcagent.log
SNMP Configuration Missing
Set Up SNMP Subscription for Exadata Storage Servers
Exadata Storage Server supports SNMP v1/v2 and v3 version
based alert notifications to the subscribers. To receive alerts in the SNMP V3 version, the
monitoring agents must be added to the snmpsubscriber
list using a
snmpuser
. To set the values of snmpSubscriber
,
notificationMethod
and notificationPolicy
:
Set Up SNMP Subscription for Cisco Ethernet Switch Targets
The Cisco Ethernet Switch must be configured to allow the Agents that monitor it to be able to both poll the switch and to receive SNMP alerts from the switch. To allow this, perform the following steps (swapping the example switch name dm01sw-ip
with the name of the Cisco Ethernet Switch target being configured):
Note:
This procedure is valid for SI targets if the monitoring of the switch is performed with a non-administrator user. If Enterprise Manager monitors the switch with an administrator user, the following procedure is automatically performed as part of the discovery process.
Set Up SNMP for Power Distribution Unit (PDU) Targets
To enable Enterprise Manager to collect metric data and raise events for the PDU target, you must configure the PDU to accept SNMP queries from the Agents that monitor the PDU target. Also, appropriate threshold values for different phase values must be set on the PDU.
For steps to enable SNMPv3 on the PDU, see Enable SNMPv3 on PDUs.
Time Out Error During SNMP Subscription
During the discovery of Exadata Database Machine, when the Exadata Storage Servers have ILOM 5.0 or later, and the SNMP subscription step throws a Time Out error, the following two fixes are recommended:
-
Upgrade Enterprise Manager Exadata plugin to 13.4 RU12, 13.5 RU01, or later.
-
Upgrade Exadata Storage Server software to 21.2.0.0.0 or later version.
Review Logs
You can review the logs to verify the proper discovery, functioning, and troubleshooting. Following are some of the useful logs and their locations:
-
OMS logs
Location: $INSTANCE_HOME/sysman/log
-
emoms.log: The main log file for the OMS Console application
-
emoms.trc: The main trace file for the OMS Console application
-
emoms_pbs.log: The main log file for the OMS platform application
-
-
EM Managed Server logs
Location: Relative path from EM instance base user_projects/domains/GCDomain/servers/EMGC_OMS1
-
EMGC_OMS1.log: The EMGC_OMSn instance writes all messages from its subsystems and applications to this log file.
-
EMGC_OMS1-diagnostic.log: This log file contains application-related security errors.
-
-
Monitoring agent logs
Location: Relative path from AGENT INSTANCE HOME sysman/log
-
gcagent.log: This log file contains trace, debug, information, error, or warning messages from the Agent. It can be used for debugging Agent framework issues.
-
gcagent_errors.log: It is similar to gcagent.log, but it contains only the log messages of ERROR and FATAL levels. The log has no size limit.
-
emagent_perl.trc: Trace file for the PERL scripts. EM uses Perl scripts to gather some metrics and target discovery. The log level can be changed by modifying
EMAGENT_PERL_TRACE_LEVEL
variable insysman/config/emd.properties
file.
-
EMCLI Discovery Issues
The deployment procedure output displays the error messages for targets not promoted, tasks not performed (like SNMP subscription, access point creation, etc.) and information to resolve those errors. After you resolve the underlying cause for the error, submit the deployment procedure again.
To triage an issue, enable the debug mode on OMS and the agent and resubmit the discovery to get more information:
-
On OMS:
emctl set property -name log4j.rootCategory -value "DEBUG, emlogAppender, emtrcAppender" -sysman_pwd sysman
-
On the agent:
emcli set_agent_property -agent_name="<agent_target_name>" -value=DEBUG -name="Logger.sdklog.level"
The following are some of the errors that may be encountered and resolved:
System Infrastructure Targets in Pending Status
System Infrastructure targets are in Pending status for a while until the first configuration metrics are collected.
If the status Pending is displayed for the discovered targets even after 15 minutes, then go to the target home page and investigate the cause and attempt to fix them manually. For example, metric collection errors, etc.
In most cases, the issue gets resolved without rediscovery if you wait long enough for the target metrics to be collected.
Discovery Status Summary States "Completed with errors"
If the EMCLI Deployment Procedure step System Creation completes with a status of Completed with Errors, then the summary information about the errors encountered will be displayed in the procedure output. For example:
Number of targets that have been created successfully:8
Number of targets that have errors during discovery:1
List of targets with errors during discovery:
Exatarget_1.example.com
ExpressionEvaluationException: Below error message was returned while executing this Computational step.
----------------------------------------------------------
Some targets are not successfully discovered. Please refer the above target creation details for the detailed errors.
----------------------------------------------------------
Target Discovered but Association Not Created
If the discovery status reports that a target is created but its associations are not, then create associations manually by following these steps:
-
Add the target manually to the Exadata Database Machine Schematic from the database machine home page. This will associate the target with database machine rack.
-
Associate the target with Exadata Database Machine using the EMCLI command as in the following example:
emcli create_assoc -assoc_type='app_composite_contains' -source='DB_Machine1.example.com:oracle_dbmachine' -dest='DB_Machine1.example.com:host'
Component not Available or Accessible Before Discovery
If the Exadata Database Machine member target is not available or accessible, then try resolving the issue through one of the following options:
-
Make component available before performing Database Machine discovery.
-
If the component's availability is unresolved for an indefinite time, then exclude it from the Exadata Database Machine discovery. Use the input file parameter component.skipComponentList and proceed with the discovery. After the component is available, remove it from the component.skipComponentList parameter and resubmit the discovery to add the target to the existing Exadata Database Machine.
Trouble Using ExaCLI during Discovery
If you selected ExaCLI as the monitoring mechanism for Exadata Storage Servers and if the monitoring agent is installed on the compute nodes, then ExaCLI should already be installed. If not, ensure that ExaCLI is installed on the agent that you have selected.
To verify if ExaCLI is installed properly on the agent, run the following command:
[oracle@myagent ~]$ which exacli
/usr/local/bin/exacli
[oracle@myagent ~]$ exacli -c mycell.example.com -l exacli_user --xml -e 'list cell attributes name'
Password: *************
<?xml version="1.0" encoding="utf-8" ?>
<cli-output>
<context cell="mycell"/>
<cell> <name>mycell</name>
</cell>
</cli-output>
Input File Errors
-
Mandatory property name or value not specified in the input file
The error is displayed indicating the properties that are missing as part of the input file. For example,
Property outputFileLoc is not defined
. -
The specified value is not applicable for a property
The error message lists out all allowed values for the property. For example,
Check cell metric source input. It needs be CellCLI or ExaCLI
. -
The named credential specified in the input file does not exist
The error is displayed indicating the named credential which is not available in Enterprise Manager. For example,
Credential not found owner SYSMAN name BLR_BOX_CREDS_1
. -
The specified named credential in not valid
The error is displayed indicating the credential failed during verification. For example,
Credential test of cell target Exa_Storage_Server_1 failed.
.
Resolve the issue based on the additional information available with the error, and rediscover the targets.
Resolve the above errors by updating the information in the input file by using the resolution steps suggested in the Deployment Procedure output. Resubmit the Deployment Procedure to discover targets.
Unclear Message Shown on the Deployment Procedure Output
Typically, detailed information about the error and resolution is displayed for identified use cases. In case an unclear message is displayed, to learn the cause of the error:
-
Check the Oracle Management Service logs (instance logs) in the location $INSTANCE_HOME/sysman/log where the information would be logged.
-
Check the Enterprise Manager server logs stored in the relative path from the Enterprise Manager instance base user_projects/domains/GCDomain/servers/EMGC_OMS1, for more information.
Troubleshooting the Exadata Database Machine Schematic File
As part of any discovery troubleshooting, possible causes and recommended resolution with the schematic file can include:
-
The schematic file on the compute node is missing or is not readable by the agent user.
For Exadata Release 11.2.3.2 and later, the schematic file is:
/opt/oracle.SupportTools/onecommand/databasemachine.xml
Ensure that the schematic file exists in the specified location and that the file and directory can be accessed by the agent installation user account.
-
If a
PerformOperationException
error appears, the agent NMO is not configured forsetuid-root
:-
From the OMS log:
2019-11-08 12:28:12,910 [[ACTIVE] ExecuteThread: '6' for queue: 'weblogic.kernel.Default (self-tuning)'] ERROR model.DiscoveredTarget logp.251 - ERROR: NMO not setuid-root (Unix only) oracle.sysman.emSDK.agent.client.exception.PerformOperationException:
-
As root, run:
# <AGENT_INST>/root.sh
-
-
In the
/etc/pam.d
file,pam_ldap.so
is used instead ofpam_unix.so
on compute nodes.-
Even though the agent user and password are correct, this errors appears in the agent log:
oracle.sysman.emSDK.agent.client.exception.PerformOperationException: ERROR: Invalid username and/or password
To resolve the issue, in
/etc/pam.d
file usepam_unix.so
on compute nodes. -
-
If the schematic file is blank, then:
-
Check your browser support and Enterprise Manager Cloud Control 13c.
-
Run through discovery again and watch for messages.
-
Check the
emoms.log
file for exceptions at the same time.
-
-
If components are missing, then:
-
Add manually to the schematic page (click Edit).
-
Check for component presence in Enterprise Manager. Check to see if it is monitored.
-
Exadata Database Machine Management Troubleshooting
If data is missing in Resource Utilization graphs, then run a "view object" SQL query to find out what data is missing. Common problems include:
-
Schematic file is not loaded correctly.
-
Cluster, Database, and ASM are not added as Enterprise Manager targets.
-
Database or Exadata Storage Server target is down or is returning metric collection errors.
-
Metric is collected in the Enterprise Manager repository, but has an
IS_CURRENT != Y
setting.
Oracle Database Storage Server System Target Status Pending
The Pending state of the Oracle Database Storage Server System indicates that the storage server associations with the system could not be derived. The derived association rule is based on the ECM data of Exadata Storage Servers, DB, and ASM. Generally, after adding a target, it may take up to 30 minutes for ECM data to appear.
The Last Collected information on the Configuration page indicates the ECM data availability. To view the information, from the target menu, select Configuration, and click Latest.
If the configuration data is not present for any of the targets like Database Machine, DB, and ASM, then perform a Refresh on the configuration page.
Target Status Issues
If the Target status shows DOWN inaccurately, then:
-
If you are using
cellCLI
to monitor the Exadata Storage Server, you can check SSH equivalence (cellmonitor
user) with the following command:ssh –i /home/oracle/.ssh/id_dsa –l cellmonitor <cell name> -e cellcli list cell
Output should be:
<cell name>
-
For the PDU: Check to make sure you can access the PDU through your browser to verify that it is connected to your LAN:
http://<pdu name>
-
For the Cisco Switch: Check for proper SNMP subscriptions. See Set Up SNMP Subscription for Cisco Ethernet Switch Targets for details.
Metric Collection Issues
If the Target status shows a Metric Collection Error, then:
-
Hover over the icon or navigate to Incident Manager.
-
Read the full text of the error.
-
Visit the Monitoring Configuration page and examine the settings. From the Setup menu, select Monitoring Configuration.
-
Trigger a new collection: From the Target menu, select Configuration, then select Last Collected, then Actions, and finally select Refresh.
-
Access the monitoring Agent Metric through your browser:
https://<Agent_Host_Name>/emd/browser/main
Click Target >> and then click Response to evaluate the results. You may need to log a service request (SR) with Oracle Support.
Status: Pending Issues
For those issues where components are in a Pending status, see the following troubleshooting steps:
Cellsys Targets
If the Cellsys target seems to be in a Pending status for too long, then:
-
Verify that there is an association for the Cluster ASM, Database, and Exadata Storage Servers.
-
Check and fix the status of the associated target database.
-
Check and fix the status of the associated target ASM cluster.
-
Ensure an UP status for all Exadata Storage Server targets.
-
Delete any unassociated cellsys targets.
Database Machine Target or Any Associated Components
If the Exadata Database Machine target or any associated components are in a Pending status for too long, then:
-
Check for duplicate or pending delete targets. From the Setup menu, select Manage Cloud Control, then select Health Overview.
-
Check the target configuration. From the target's home page menu, select Target Setup, then select Monitoring Configurations.
-
Search for the target name in the agent or OMS logs:
$ grep <target name> gcagent.log or emoms.log
Monitoring Agent Not Deployed for IPv6 Environments
Problem: For IPv6 environments, the monitoring agent is not deployed.
Cause: If the IPv6 address is not included in the /etc/hosts
file, then the agent will not be deployed.
Resolution: Edit the /etc/hosts
of the compute node (or the VM in case of virtual Exadata) to map the OMS host name to an IPv6 address.
Monitoring: Error Fetching Data for Grid and CellSys Targets IORM Page Charts
When you encounter Error Fetching Data
in the IORM page
Workload distribution by Databases of Grid and CellSys targets charts in the
first 24 hours of Database Machine discovery, to resolve, you can wait 24 hours after the
Database Machine is discovered or perform the following steps:
-
Ensure that the Grid and CellSys member targets configuration collection is collected. If the configuration is not available, perform the below steps:
-
Go to Exadata Grid target menu, Configuration, Latest, click Refresh.
-
Select the member targets configuration collection and click OK.
-
Wait until the configuration collection is collected successfully.
-
-
Update the collection frequency of the Exadata Storage Type metric to
1
minute for every member of the Grid and CellSys target. Follow the below steps:-
Go to Exadata Storage Server target menu, Monitoring, and Metric and Collection Settings.
-
Click the Other Collected Items tab. Update the Exadata Storage Type metric collection frequency to
1
minute and save it.
-
-
Wait for 10 minutes, launch the Grid and CellSys target home page, and ensure that the Overview section does not display the message Configuration collection is not available. If the message continues to appear, then repeat Steps 1 and 2.
After the issue is resolved for Grid and CellSys IORM page charts, set the Exadata
Storage Type metric collection frequency to 1440
minutes.
Not able to receive SNMP traps on Exadata Storage Servers using IPv6
For IPv6 environments, the Enterprise Manager agent needs have SNMP v3 subscription to the Exadata Storage Servers, for complete monitoring.
-
SNMP V3 user created on Exadata Storage Server. See Step 1 for instructions.
-
Exadata Storage Server version must be 12.1.2.2 or later to support SNMP V3 subscription.
If the SNMP subscription was missed during Exadata discovery, you can follow these steps:
To configure SNMP v3 subscription on an Exadata Storage Server:
Exascale Discovery and Monitoring Issues
- Exascale target status is down
Go to the Oracle Exadata Exascale, Target Setup, select Monitoring Configuration. Check the REST URL which should be of the form
https://<fqdn_of_controlserver>:<port>/exascaleapi/v1
. - Exascale target status is Availability Evaluation Error ( Failed to get
private key)
Go to Setup, Security, select Monitoring Credentials. Select Oracle Exadata Exascale row and click Manage Monitoring Credentials. Select the Exascale target that has the issue and click Set Credentials. Fix the credential by providing the correct value.
- Exascale target status is Availability Evaluation Error (the oracle_exascale
target <target name> does not exist)
Ensure that both the discovery and monitoring components of the Exadata plug-in are patched with Oracle Enterprise Manager 13.5GC RU23.