MySQL 9.3 Reference Manual Including MySQL NDB Cluster 9.3
The Group Replication Resource Manager component monitors secondary server lag time and memory usage, and can expel servers which lag excessively or use too many resources from the group. Allowable lag time and resource usage are configurable for both applier channels and recovery channels, as explained in this section. This component is available as part of MySQL Enterprise Edition, beginning with MySQL 9.2.0.
Purpose: Provide monitoring of and control over secondary server lag time and resource usage.
URN:
file://component_group_replication_resource_manager
Prior to installing the Group Replication Resource Manager
component, the Group Replication plugin must be installed using
INSTALL PLUGIN
or
--plugin-load-add
(see
Section 20.2.1.2, “Configuring an Instance for Group Replication”);
otherwise, the INSTALL COMPONENT
statement is
rejected with an error. If you attempt to uninstall the Group
Replication plugin when the Group Replication Resource Manager
component is installed, UNINSTALL
PLUGIN
fails with the error Plugin
'group_replication' cannot be uninstalled now. Please uninstall
the component 'component_group_replication_resource_manager' and
then UNINSTALL PLUGIN group_replication.
Once these conditions are met, the Group Replication Resource
Manager component can be installed and uninstalled using
INSTALL COMPONENT
and
UNINSTALL COMPONENT
,
respectively. See the descriptions of these statements, as well
as Section 7.5.1, “Installing and Uninstalling Components”, for more information.
The Group Replication Resource Manager component provides a configurable automatic expulsion mechanism which detects when the applier or recovery channel on a group replication secondary is lagging, or when the secondary is swapping excessively, and expels the problematic server from the group, thus helping to maintain high availability. Due to the high availability requirement, in order to use the auto-expulsion functionality with an active replication group, the group must initially consist of no fewer than three members, including the group replication primary; this guarantees that there are at least two members (one primary and one secondary) in the event that one member has been expelled.
The Group Replication Resource Manager component does not monitor the group replication primary, and is not intended to expel the primary, but it is possible for the decision to expel a secondary to be made just before the same secondary is promoted to primary (due to a concurrent primary failure), in which case the just-elected primary may be evicted.
Using the system and status variables provided by this component, the operator can separately monitor each of the three areas of concern—applier lag, recovery lag, and system resource exhaustion—and separate thresholds for expulsion set for each of them, as listed here:
Applier channel: Obtain the time by
which this server's applier channel lags behind that of
the primary from the
Gr_resource_manager_applier_channel_lag
server status variable. You can set an upper limit for this
by setting the
group_replication_resource_manager.applier_channel_lag
server system variable; if the lag exceeds this value 10
times or more in succession, this server is expelled from
the group. The default threshold is 3600 seconds (1 hour).
Recovery channel: The time by which
this server's recovery channel lags behind that of the
primary can be obtained by checking the value of the
Gr_resource_manager_recovery_channel_lag
server status variable. You can set an upper limit for this
by setting
group_replication_resource_manager.recovery_channel_lag
;
if the secondary's recovery lag is more than this, 10
times in succession, this server is expelled from the group.
The default threshold is 3600 seconds (1 hour).
Resource (Memory) Usage: The
group_replication_resource_manager.memory_used_limit
server system variable sets the threshold for memory
consumption as a percentage of total memory; when
Gr_resource_manager_memory_used
exceeds this percentage 10 times in succession, this server
is expelled.
The Resource Manager component checks lag and usage on group replication secondaries every 5 seconds. This period is not configurable by the operator.
A server which has been expelled from the group may subsequently
try to rejoin it without manual intervention, provided that
group_replication_autorejoin_tries
is enabled (otherwise the server proceeds as specified by
group_replication_exit_state_action
).
The auto-rejoin mechanism and behavior are the same as those
described in
Section 20.7.7.3, “Auto-Rejoin”.
For a replication group member attempting to join or rejoin a
group after encountering issues and being expelled, a quarantine
period prevents immediate re-expulsion. This period is tracked
individually for each member, so that, during the quarantine
period started after group member A has been expelled and
subsequently allowed to re-join the group, member B can be
expelled safely if the need arises. The duration of the
quarantine period determined by the value of the
group_replication_resource_manager.quarantine_time
server system variable. The default length of the quarantine
period is 3600 seconds (1 hour).
The Resource Management component provides a number of server status variables which can be used for monitoring the status of Group Replication and the Resource Manager component. In addition to the three such variables discussed previously, these include the following:
Gr_resource_manager_applier_channel_threshold_hits
:
The number of samples which have exceeded
group_replication_resource_manager.applier_channel_lag
.
Gr_resource_manager_applier_channel_eviction_timestamp
:
When the last eviction caused by applier channel lag
occurred.
Gr_resource_manager_recovery_channel_threshold_hits
:
The number of samples which have exceeded
group_replication_resource_manager.recovery_channel_lag
.
Gr_resource_manager_recovery_channel_eviction_timestamp
:
When the last eviction caused by recovery channel lag
occurred.
Gr_resource_manager_memory_threshold_hits
:
The number of samples which have exceeded
group_replication_resource_manager.memory_used_limit
.
Gr_resource_manager_memory_eviction_timestamp
:
When the last eviction caused by excess memory usage took
place.
In addition, it is possible to determine if and when errors have occurred when attempting to get lag or memory usage information by checking the status variables listed here:
Gr_resource_manager_channel_lag_monitoring_error_timestamp
:
Timestamp for the last time this member encountered an error
while trying to obtain a value for channel lag.
Gr_resource_manager_memory_monitoring_error_timestamp
:
The last time this member encountered an error while trying
to obtain a value for the system memory usage.
For general information about MySQL Group Replication, see Chapter 20, Group Replication.