Lifecycle of a Problem or Condition Managed By the Fault Manager
The lifecycle of a problem or condition managed by the Fault Manager can include the following stages. Each of these lifecycle state changes is associated with the publication of a unique list event.
-
Diagnose – A new diagnosis has been made by the Fault Manager. The diagnosis includes a list of one or more suspects. A
list.suspect
event is published. The diagnosis is identified by a UUID in the event payload, and further events describing the resolution lifecycle of this diagnosis quote a matching UUID. -
Isolate – A suspect has been automatically isolated to prevent further errors from occurring. A
list.isolated
event is published. For example, a CPU core or memory page has been offlined. -
Update – One or more of the suspect resources in a problem diagnosis has been repaired, replaced, or acquitted, or the resource has faulted again. A
list.updated
event is published. The suspect list still contains at least one faulted resource. A repair might have been made by executing anfmadm
command, or the system might have detected a repair such as a changed serial number for a part. Thefmadm
command is described in Repairing Faults and Defects and Clearing Alerts. -
Repair – All of the suspect resources in a diagnosis have been repaired, resolved, or acquitted. A
list.repaired
event is published. Some or all of the resources might still be isolated. -
Resolve – All of the suspect resources in a diagnosis have been repaired, resolved, or acquitted and are no longer isolated. A
list.resolved
event is published. For example, a CPU core that was a suspect and was offlined is now back online again. Offlining and onlining resources is usually automatic.
The Fault Manager daemon is a service enabled by default when using the Oracle Hardware Management Pack installer. See the fmd
man page for more information about the Fault Manager daemon.
The fmadm config
command shows the name, description, and status of each module in the Fault Manager. These modules diagnose, isolate resources, generate notifications, and auto-repair problems in the system.