Lifecycle of a Problem or Condition Managed By the Fault Manager

The lifecycle of a problem or condition managed by the Fault Manager can include the following stages. Each of these lifecycle state changes is associated with the publication of a unique list event.

  • Diagnose – A new diagnosis has been made by the Fault Manager. The diagnosis includes a list of one or more suspects. A list.suspect event is published. The diagnosis is identified by a UUID in the event payload, and further events describing the resolution lifecycle of this diagnosis quote a matching UUID.

  • Isolate – A suspect has been automatically isolated to prevent further errors from occurring. A list.isolated event is published. For example, a CPU core or memory page has been offlined.

  • Update – One or more of the suspect resources in a problem diagnosis has been repaired, replaced, or acquitted, or the resource has faulted again. A list.updated event is published. The suspect list still contains at least one faulted resource. A repair might have been made by executing an fmadm command, or the system might have detected a repair such as a changed serial number for a part. The fmadm command is described in Repairing Faults and Defects and Clearing Alerts.

  • Repair – All of the suspect resources in a diagnosis have been repaired, resolved, or acquitted. A list.repaired event is published. Some or all of the resources might still be isolated.

  • Resolve – All of the suspect resources in a diagnosis have been repaired, resolved, or acquitted and are no longer isolated. A list.resolved event is published. For example, a CPU core that was a suspect and was offlined is now back online again. Offlining and onlining resources is usually automatic.

The Fault Manager daemon is a service enabled by default when using the Oracle Hardware Management Pack installer. See the fmd man page for more information about the Fault Manager daemon.

The fmadm config command shows the name, description, and status of each module in the Fault Manager. These modules diagnose, isolate resources, generate notifications, and auto-repair problems in the system.