Notification of Faults, Defects and Alerts
When the mcelog daemon encounters an error, it triggers a configurable response and logs information to the mcelog file. For example, assume that physical address location 0x45a3b50c0 generates a correctable memory read error. When this happens, the mcelog daemon adds an entry to /var/log/mcelog
. For example:
CPU 8 BANK 3 TSC 0 RIP 00:0 MISC 0x85 ADDR 0x45a3b50c0 <------ address that had the correctable read error STATUS 0x9c000000f00c009f MCGSTATUS 0x7 PROCESSOR 0:0x306f1 TIME 1389814624 SOCKETID 0 APICID 18 MCGCAP 0x7000c16
A message is also sent to the system log (/var/log/messages
)
describing the problem (error count exceeded threshold) and what was done (offlining the
page), such as:
1 Jan 15 14:37:04 testserver16 kernel: Machine check poll done on CPU 8 2 Jan 15 14:37:04 testserver16 mcelog: Family 6 Model 3f CPU: only decoding architectural errors 3 Jan 15 14:37:04 testserver16 mcelog: corrected Socket memory error count exceeded threshold: 1 in 24h 4 Jan 15 14:37:04 testserver16 mcelog: Location SOCKET:0 CHANNEL:? DIMM:? [] 5 Jan 15 14:37:04 testserver16 mcelog: Corrected memory errors on page 45a3b5000 exceed threshold 1 in 24h: 1 in 24h 6 Jan 15 14:37:04 testserver16 mcelog: Location SOCKET:0 CHANNEL:? DIMM:? [] 7 Jan 15 14:37:04 testserver16 mcelog: Running trigger `page-error-trigger' 8 Jan 15 14:37:04 testserver16 mcelog: Offlining page 45a3b5000
The message on line 5 indicates that the correctable error threshold was set to 1
error in 24 hours. Since this threshold was exceed, the action taken was to remove page
0x45a3b5000 from service. This is indicated by the "Offlining page" message (line 8) in
the system log. The process that encountered the correctable error is either assigned a
new page, or it is killed, depending on the "memory-ce-action" value in the "page"
section of the mcelog.conf
file.
In addition to the page being offlined, if the DIMM corresponding to the failed address exceeds the factory programmed DIMM threshold, the SP generates a fault that is forwarded to the host and logged as part of the fault management database.
Often, the first interaction with the Fault Manager daemon is a system message
indicating that a fault or defect has been diagnosed. Messages are sent to both the
console and the /var/log/messages
file. All messages from the Fault
Manager daemon use the following format:
1 SUNW-MSG-ID: SPX86A-8002-30, TYPE: Fault, VER: 1, SEVERITY: Minor 2 EVENT-TIME: Wed Nov 27 10:36:30 PST 2013 3 PLATFORM: SUN SERVER X4-4, CSN: -, HOSTNAME: testserver16 4 SOURCE: fdd, REV: 1.0 5 EVENT-ID: eed2208e-2dcf-40c9-9bab-ab3a13e94182 6 DESC: A processor has detected multiple memory controller correctable errors. 8 AUTO-RESPONSE: The affected processor will be disabled at the next system boot 9 and remain unavailable until repaired. 10 The chassis wide and processor service-required LED's are illuminated. 11 IMPACT: The system will continue to operate in the presence of this 12 fault. 13 System performance may be impacted due to disabled processor. 14 REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this 15 event. Please refer to the associated reference document at 16 http://support.oracle.com/msg/SUN4V-8001-8H for the latest service procedures and 17 policies regarding this diagnosis.
When notified of a diagnosed problem, always consult the recommended Oracle Knowledge Article for additional details. See line 16 above for an example. The knowledge article might contain additional actions that you or a service provider should take beyond those listed on line 14.
Notification of events can also be configured in Oracle ILOM using the Simple Network Management Protocol (SNMP) or the Simple Mail Transfer Protocol (SMTP). See the Oracle ILOM documentation at: http://www.oracle.com/goto/ilom/docs
In addition, Oracle Auto Service Request can be configured to automatically request Oracle service when specific hardware problems occur from supported telemetry resources (such as Oracle ILOM). See the Oracle Auto Service Request product page for information about this feature. The documentation link on this page provides links to Oracle ASR Quick Installation Guide and Oracle ASR Installation and Operations Guide.