24 Changes in MySQL NDB Cluster 7.6.13 (5.7.29-ndb-7.6.13) (2020-01-14, General Availability)

MySQL NDB Cluster 7.6.13 is a new release of NDB 7.6, based on MySQL Server 5.7 and including features in version 7.6 of the NDB storage engine, as well as fixing recently discovered bugs in previous NDB Cluster releases.

Obtaining NDB Cluster 7.6. NDB Cluster 7.6 source code and binaries can be obtained from https://dev.mysql.com/downloads/cluster/.

For an overview of changes made in NDB Cluster 7.6, see What is New in NDB Cluster 7.6.

This release also incorporates all bug fixes and changes made in previous NDB Cluster releases, as well as all bug fixes and feature changes which were added in mainline MySQL 5.7 through MySQL 5.7.29 (see Changes in MySQL 5.7.29 (2020-01-13, General Availability)) .

Functionality Added or Changed

Important Change: It is now possible to divide a backup into slices and to restore these in parallel using two new options implemented for the ndb_restore utility, making it possible to employ multiple instances of ndb_restore to restore subsets of roughly the same size of the backup in parallel, which should help to reduce the length of time required to restore an NDB Cluster from backup.
The --num-slices options determines the number of slices into which the backup should be divided; --slice-id provides the ID of the slice (0 to 1 less than the number of slices) to be restored by ndb_restore.
Up to 1024 slices are supported.
For more information, see the descriptions of the --num-slices and --slice-id options. (Bug #30383937, WL #10691)

Bugs Fixed

Incompatible Change: The minimum value for the RedoOverCommitCounter data node configuration parameter has been increased from 0 to 1. The minimum value for the RedoOverCommitLimit data node configuration parameter has also been increased from 0 to 1.
You should check the cluster global configuration file and make any necessary adjustments to values set for these parameters before upgrading. (Bug #29752703)
Microsoft Windows; NDB Disk Data: On Windows, restarting a data node other than the master when using Disk Data tables led to a failure in TSMAN. (Bug #97436, Bug #30484272)
NDB Disk Data: Compatibility code for the Version 1 disk format used prior to the introduction of the Version 2 format in NDB 7.6 turned out not to be necessary, and is no longer used.
A faulty ndbrequire() introduced when implementing partial local checkpoints assumed that m_participatingLQH must be clear when receiving START_LCP_REQ, which is not necessarily true when a failure happens for the master after sending START_LCP_REQ and before handling any START_LCP_CONF signals. (Bug #30523457)
A local checkpoint sometimes hung when the master node failed while sending an LCP_COMPLETE_REP signal and it was sent to some nodes, but not all of them. (Bug #30520818)
Added the DUMP 9988 and DUMP 9989 commands. (Bug #30520103)
Execution of ndb_restore --rebuild-indexes together with the --rewrite-database and --exclude-missing-tables options did not create indexes for any tables in the target database. (Bug #30411122)
When synchronizing extent pages it was possible for the current local checkpoint (LCP) to stall indefinitely if a CONTINUEB signal for handling the LCP was still outstanding when receiving the FSWRITECONF signal for the last page written in the extent synchronization page. The LCP could also be restarted if another page was written from the data pages. It was also possible that this issue caused PREP_LCP pages to be written at times when they should not have been. (Bug #30397083)
If a transaction was aborted while getting a page from the disk page buffer and the disk system was overloaded, the transaction hung indefinitely. This could also cause restarts to hang and node failure handling to fail. (Bug #30397083, Bug #30360681)
References: See also: Bug #30152258.
Data node failures with the error Another node failed during system restart... occurred during a partial restart. (Bug #30368622)
If a SYNC_EXTENT_PAGES_REQ signal was received by PGMAN while dropping a log file group as part of a partial local checkpoint, and thus dropping the page locked by this block for processing next, the LCP terminated due to trying to access the page after it had already been dropped. (Bug #30305315)
The wrong number of bytes was reported in the cluster log for a completed local checkpoint. (Bug #30274618)
References: See also: Bug #29942998.
The number of data bytes for the summary event written in the cluster log when a backup completed was truncated to 32 bits, so that there was a significant mismatch between the number of log records and the number of data records printed in the log for this event. (Bug #29942998)
Using 2 LDM threads on a 2-node cluster with 10 threads per node could result in a partition imbalance, such that one of the LDM threads on each node was the primary for zero fragments. Trying to restore a multi-threaded backup from this cluster failed because the datafile for one LDM contained only the 12-byte data file header, which ndb_restore was unable to read. The same problem could occur in other cases, such as when taking a backup immediately after adding an empty node online.
It was found that this occurred when ODirect was enabled for an EOF backup data file write whose size was less than 512 bytes and the backup was in the STOPPING state. This normally occurs only for an aborted backup, but could also happen for a successful backup for which an LDM had no fragments. We fix the issue by introducing an additional check to ensure that writes are skipped only if the backup actually contains an error which should cause it to abort. (Bug #29892660)
References: See also: Bug #30371389.
In some cases the SignalSender class, used as part of the implementation of ndb_mgmd and ndbinfo, buffered excessive numbers of unneeded SUB_GCP_COMPLETE_REP and API_REGCONF signals, leading to unnecessary consumption of memory. (Bug #29520353)
References: See also: Bug #20075747, Bug #29474136.
The setting for the BackupLogBufferSize configuration parameter was not honored. (Bug #29415012)
The maximum global checkpoint (GCP) commit lag and GCP save timeout are recalculated whenever a node shuts down, to take into account the change in number of data nodes. This could lead to the unintentional shutdown of a viable node when the threshold decreased below the previous value. (Bug #27664092)
References: See also: Bug #26364729.
A transaction which inserts a child row may run concurrently with a transaction which deletes the parent row for that child. One of the transactions should be aborted in this case, lest an orphaned child row result.
Before committing an insert on a child row, a read of the parent row is triggered to confirm that the parent exists. Similarly, before committing a delete on a parent row, a read or scan is performed to confirm that no child rows exist. When insert and delete transactions were run concurrently, their prepare and commit operations could interact in such a way that both transactions committed. This occurred because the triggered reads were performed using LM_CommittedRead locks (see NdbOperation::LockMode), which are not strong enough to prevent such error scenarios.
This problem is fixed by using the stronger LM_SimpleRead lock mode for both triggered reads. The use of LM_SimpleRead rather than LM_CommittedRead locks ensures that at least one transaction aborts in every possible scenario involving transactions which concurrently insert into child rows and delete from parent rows. (Bug #22180583)
Concurrent SELECT and ALTER TABLE statements on the same SQL node could sometimes block one another while waiting for locks to be released. (Bug #17812505, Bug #30383887)