8.10 InnoDB ClusterSet Repair and Rejoin

Use this information if you need to repair a cluster in an InnoDB ClusterSet deployment. You can use the information here in any of the following situations:

A cluster in the InnoDB ClusterSet requires maintenance but has no issues with its functioning.
A cluster is functioning acceptably in the InnoDB ClusterSet deployment but has some issues, such as member servers that are offline.
A cluster is not functioning acceptably and needs to be repaired.
A cluster has been marked as invalidated during an emergency failover or controlled switchover procedure.

Section 8.7, “InnoDB ClusterSet Status and Topology” explains how to check the status of an InnoDB Cluster and of the whole InnoDB ClusterSet deployment, and the situations in which a cluster might need repair. You can identify the following situations from the output of the clusterSet.status() command:

A cluster does not have quorum (that is, not enough members are online to have a majority).
No members of a cluster can be reached.
A cluster's ClusterSet replication channel is stopped.
A cluster's ClusterSet replication channel is configured incorrectly.
A cluster's GTID set is inconsistent with the GTID set on the primary cluster in the InnoDB ClusterSet.
A cluster has been marked as invalidated. If the cluster is still online, the command warns that a split-brain situation might result.

If the cluster is the primary cluster in the InnoDB ClusterSet deployment, before repairing it, you might need to carry out a controlled switchover or an emergency failover to demote it to a replica cluster. After that, you can take the cluster offline if necessary to repair it, and the InnoDB ClusterSet will remain available during that time.

A controlled switchover is suitable if the primary cluster is functioning acceptably but requires maintenance or has minor issues. A primary cluster that is functioning acceptably has the global status OK when you check it using the clusterSet.status() command. Section 8.8, “InnoDB ClusterSet Controlled Switchover” explains how to perform this operation.
An emergency failover is suitable if you cannot contact the primary cluster at all. Section 8.9, “InnoDB ClusterSet Emergency Failover” explains how to perform this operation.
If the primary cluster is not functioning acceptably (with the global status NOT_OK) but it can be contacted, make an attempt to repair any issues using the information in this section. An emergency failover carries the risk of losing transactions and creating a split-brain situation for the InnoDB ClusterSet. If you cannot repair the primary cluster quickly enough to restore availability, proceed with an emergency failover and then repair it if possible.

Follow this procedure to repair an InnoDB Cluster that is part of an InnoDB ClusterSet deployment:

Using MySQL Shell, connect to any member server in the primary cluster or in one of the replica clusters, using an InnoDB Cluster administrator account (created with cluster.setupAdminAccount()). You may also use the InnoDB Cluster server configuration account, which also has the required permissions. When the connection is established, get the ClusterSet object using a dba.getClusterSet() or cluster.getClusterSet() command. It is important to use an InnoDB Cluster administrator account or server configuration account so that the default user account stored in the ClusterSet object has the correct permissions. For example:
```
mysql-js> \connect admin2@127.0.0.1:4410
Creating a session to 'admin2@127.0.0.1:4410'
Please provide the password for 'admin2@127.0.0.1:4410': ********
Save password for 'admin2@127.0.0.1:4410'? [Y]es/[N]o/Ne[v]er (default No):
Fetching schema names for autocompletion... Press ^C to stop.
Closing old connection...
Your MySQL connection id is 42
Server version: 8.0.27-commercial MySQL Enterprise Server - Commercial
No default schema selected; type \use <schema> to set one.
<ClassicSession:admin2@127.0.0.1:4410>
mysql-js> myclusterset = dba.getClusterSet()
<ClusterSet:testclusterset>
```
Check the status of the whole deployment using AdminAPI's clusterSet.status() command in MySQL Shell. Use the extended option to see exactly where and what the issues are. For example:
```
mysql-js> myclusterset.status({extended: 1})
```
For an explanation of the output, see Section 8.7, “InnoDB ClusterSet Status and Topology”.
Still using an InnoDB Cluster administrator account (created with cluster.setupAdminAccount()) or InnoDB Cluster server configuration account, get the Cluster object using dba.getCluster(). You can either connect to any member server in the cluster you are repairing, or connect to any member of the InnoDB ClusterSet and use the name parameter on dba.getCluster() to specify the cluster you want. For example:
```
mysql-js> cluster2 = dba.getClusterSet()
<Cluster:clustertwo>
```
Check the status of the cluster using AdminAPI's cluster.status() command in MySQL Shell. Use the extended option to get the most details about the cluster. For example:
```
mysql-js> cluster2.status({extended: 2})
```
For an explanation of the output, see Checking a cluster's Status with Cluster.status().
Following an emergency failover, and there is a risk of the transaction sets differing between parts of the ClusterSet, you have to fence the cluster either from write traffic or all traffic. Section 8.10.1, “Fencing Clusters in an InnoDB ClusterSet” explains how, to fence and unfence a cluster, from MySQL Shell 8.0.28.
If the set of transactions (the GTID set) on the cluster is inconsistent, fix this first. The clusterSet.status() command warns you if a replica cluster's GTID set is inconsistent with the GTID set on the primary cluster in the InnoDB ClusterSet. A replica cluster in this state has the global status OK_NOT_CONSISTENT. You also need to check the GTID set on a former primary cluster, or a replica cluster, that has been marked as invalidated during a controlled switchover or emergency failover procedure. A cluster with extra transactions compared to the other clusters in the ClusterSet can continue to function acceptably in the ClusterSet while it stays active. However, a cluster with extra transactions cannot rejoin the ClusterSet. Section 8.10.2, “Inconsistent Transaction Sets (GTID Sets) in InnoDB ClusterSet Clusters” explains how to check for and resolve issues with the transactions on a server.
If there is a technical issue with a member server in the cluster, or with the overall membership of the cluster (such as insufficient fault tolerance or a loss of quorum), you can work with individual member servers or adjust the cluster membership to resolve this. Section 8.10.3, “Repairing Member Servers and Clusters in an InnoDB ClusterSet” explains what operations are available to work with the member servers in a cluster.
If you cannot repair a cluster, you can remove it from the InnoDB ClusterSet using a clusterSet.removeCluster() command. For instructions to do this, see Section 8.10.4, “Removing a Cluster from an InnoDB ClusterSet”. A removed InnoDB Cluster cannot be added back into an InnoDB ClusterSet deployment. If you want to use the server instances in the deployment again, you will need to set up a new cluster using them.
When you have repaired a cluster or carried out the required maintenance, you can rejoin it to the InnoDB ClusterSet using a clusterSet.rejoin() command. This command validates that the cluster is able to rejoin, updates and starts the ClusterSet replication channel, and removes any invalidated status from the cluster. For instructions to do this, see Section 8.10.5, “Rejoining a Cluster to an InnoDB ClusterSet”.