- Oracle Hardware Management Pack 2.4 Linux Fault Management Architecture Software User's Guide
- Troubleshooting Oracle Linux Fault Management Architecture
- Restart fmd if mcelog Fails
Restart fmd if mcelog Fails
For various reasons, it is possible that the mcelog daemon might not start or fail during normal operation. When this happens, you stop receiving and diagnosing CPU and memory errors from the host.
-
Determine if the mcelog daemon is running.
For example for Oracle Linux 6.5:
[root@testserver16 ~]# service mcelogd status Checking for mcelog mcelog (pid 32435) is running...
For example for Oracle Linux 7:
[root@testserver16 ~]# systemctl status mcelogd Checking for mcelog mcelog (pid 32435) is running...
The status should be "running". If not, it could be stopped or failed.
If mcelog is either not running or failed, the Oracle Linux FMA mce module fails because it requires the mcelog daemon to be working properly for it to function.
-
If the mcelog daemon is running, check the status of the Oracle Linux FMA
modules.
To list the status of all fault manager modules:
[root@testserver16 ~]# fmadm config MODULE VERSION STATUS DESCRIPTION ext-event-transport 0.2 active External FM event transport fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis ip-transport 1.1 active IP Transport Agent mce 1.0 failed Machine Check Translator sysevent-transport 1.0 active SysEvent Transport Agent syslog-msgs 1.1 active Syslog Messaging Agent
In the above example, the mce module has a "failed" status. This means that CPU or memory machine check events are not being monitored by the host and, consequently, not being logged or diagnosed in the fault management database.
-
If the Oracle Linux FMA mce module has failed, confirm the cause of the
failure using fmdump.
For example:
[root@testserver16 ~]# fmdump -Ve May 21 2014 09:56:05.930589483 ereport.fm.fmd.module nvlist version: 0 version = 0x0 class = ereport.fm.fmd.module detector = (embedded nvlist) nvlist version: 0 version = 0x1 scheme = fmd authority = (embedded nvlist) nvlist version: 0 version = 0x0 system-mfg = unknown system-name = unknown system-part = unknown system-serial = unknown sys-comp-mfg = unknown sys-comp-name = unknown sys-comp-part = unknown sys-comp-serial = unknown server-name = testserver16 host-id = ffffffff990a7a4a (end authority) mod-name = mce mod-version = 1.0 (end detector) ena = 0x3631d6cd9f6c0001 msg = mcelog not running!: client requested that module execution abort errno = 1072 errclass = ereport.fm.fmd.hdl_abort __ttl = 0x1 __tod = 0x52de8a85 0x3777ab2b
In the above example, the "
msg =
" field lists that mcelog is not running and is the cause for the mce module failure. -
Once you have determined that the mcelog daemon is the problem, restart
it.
For example for Oracle Linux 6.5:
[root@testserver16 ~]# service mcelogd start Starting mcelog daemon
For example for Oracle Linux 7:
[root@testserver16 ~]# systemctl start mcelogd Starting mcelog daemon
-
Verify that mcelog is running.
For example for Oracle Linux 6.5:
[root@testserver16 ~]# service mcelogd status Checking for mcelog mcelog (pid 32435) is running...
For example for Oracle Linux 7:
[root@testserver16 ~]# systemctl status mcelogd Checking for mcelog mcelog (pid 32435) is running...
-
Unload the Oracle Linux FMA mce module.
[root@testserver16 ~]# fmadm unload mce
Doing this generates a fault event that you can identify in the fault management database.
-
Confirm that the unloading of the mce module is captured in the fault
management database.
For example:
[root@ban25ts12uut2 ~]# fmadm faulty --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Jan 21 11:35:07 528fbbb9-92d4-cd7f-ef81-e2fddfd3c244 FMD-8000-2K Minor Problem Status : solved Diag Engine : fmd-self-diagnosis / 1.0 System Manufacturer : unknown Name : unknown Part_Number : unknown Serial_Number : unknown Host_ID : ffffffff990a7a4a ---------------------------------------- Suspect 1 of 1 : Fault class : defect.sunos.fmd.module Certainty : 100% Affects : fmd:///module/mce Status : faulted and taken out of service Description : A Linux Fault Manager component has experienced an error that required the module to be disabled. Response : The module has been disabled. Events destined for the module will be saved for manual diagnosis. Impact : Automated diagnosis and response for subsequent events associated with this module will not occur. Action : Use 'fmadm faulty' to provide a more detailed view of this event. Please refer to the associated reference document at http://support.oracle.com/msg/FMD-8000-2K for the latest service procedures and policies regarding this diagnosis.
-
Reload the Oracle Linux FMA mce module and confirm that it is running.
For example:
[root@testserver16 ~]# fmadm load /opt/fma/fm/lib/fmd/plugins/mce.so fmadm: module '/opt/fma/fm/lib/fmd/plugins/mce.so' loaded into fault manager [root@testserver16 ~]# fmadm config MODULE VERSION STATUS DESCRIPTION ext-event-transport 0.2 active External FM event transport fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis ip-transport 1.1 active IP Transport Agent mce 1.0 active Machine Check Translator sysevent-transport 1.0 active SysEvent Transport Agent syslog-msgs 1.1 active Syslog Messaging Agent
If the mce module does not unload or reload, restart the fault manager.
For example for Oracle Linux 6.5:
[root@testserver16 ~]# service fmd.init restart Stopping fmd: [ OK ] Starting fmd: [ OK ]
For example for Oracle Linux 7:
[root@testserver16 ~]# systemctl restart fmd.init Stopping fmd: [ OK ] Starting fmd: [ OK ]