Detecting Cluster Failure on a System That Uses Availability Suite Data Replication

This section describes the internal processes that occur when failure is detected on a primary or a secondary cluster.

Detecting Primary Cluster Failure
Detecting Secondary Cluster Failure

Detecting Primary Cluster Failure

When the primary cluster for a given protection group fails, the secondary cluster in the partnership detects the failure. The cluster that fails might be a member of more than one partnership, resulting in multiple failure detections.

The following actions occur when the overall state of a protection group changes to the Unknown state:

Heartbeat failure is detected by a partner cluster.
The heartbeat is activated in emergency mode to verify that the heartbeat loss is not transient and that the primary cluster has failed. The heartbeat remains in the OK state during this default timeout interval, while the heartbeat mechanism continues to retry the primary cluster. Only the heartbeat plug-ins appear in the Error state.

This query interval is set by using the Query_interval property of the heartbeat. If the heartbeat still fails after four times the Query_interval you configured (three retries and one emergency-mode probing), a heartbeat-lost event is generated and logged in the system log. When using the default interval, the emergency-mode retry behavior might delay heartbeat-loss notification for about nine minutes. Messages are displayed in the output of the geoadm status command.

For more information about logging, see Viewing the Geographic Edition Log Messages in Oracle Solaris Cluster Geographic Edition System Administration Guide.

Detecting Secondary Cluster Failure

When a secondary cluster for a given protection group fails, a cluster in the same partnership detects the failure. The cluster that failed might be a member of more than one partnership, resulting in multiple failure detections.

During failure detection, the following actions occur:

Heartbeat failure is detected by a partner cluster.
The heartbeat is activated in emergency mode to verify that the secondary cluster is dead.
The cluster notifies the administrator. The system detects all protection groups for which the cluster that failed was acting as secondary. The state of these protection groups becomes Unknown.

Skip Navigation Links
Exit Print View
	Oracle Solaris Cluster Geographic Edition Data Replication Guide for Oracle Solaris Availability Suite Oracle Solaris Cluster 4.0