Using Raft Replication in Oracle Globally Distributed Database

Oracle Globally Distributed Database provides built-in fault tolerance with Raft replication, a capability that integrates data replication with transaction execution in a distributed database.

Raft replication enables fast automatic failover with zero data loss. If all shards are in the same data center, it is possible to achieve sub-second failover. Raft replication is active/active; each shard can process reads and writes for a subset of data. This capability provides a uniform configuration with no primary or standby shards.

Raft replication is integrated and transparent to applications. Raft replication provides built-in replication for Oracle Globally Distributed Database without requiring configuration of Oracle Data Guard. Raft replication automatically reconfigures replication in case of shard host failures or when shards are added or removed from the distributed database.

When Raft replication is enabled, a distributed database contains multiple replication units. A replication unit (RU) is a set of chunks that have the same replication topology. Each RU has multiple replicas placed on different shards.

Replication Unit

When Raft replication is enabled, a distributed database contains multiple replication units. A replication unit (RU) is a set of chunks that have the same replication topology. Each RU has three replicas placed on different shards.

Each shard contains replicas from multiple RUs. Some of these replicas are leaders and some are followers. Raft replication tries to maintain a balanced distribution of leaders and followers across shards. By default each shard is a leader for two RUs and is a follower for four other RUs. This makes all shards active and provides optimal utilization of hardware resources.

In Oracle Globally Distributed Database, an RU is a set of chunks, as shown in the image below.

Description of sharding-raft-ru_v2.eps follows

Description of the illustration sharding-raft-ru_v2.eps

The diagram above illustrates the relationship among shards, chunk sets, and chunks. A shard contains a set of chunks. A chunk is a set of table partitions in a given table family. A chunk is a unit of resharding (data movement across shards). A set of chunks which have the same replication topology is called chunk set.

Raft Group

Each replication unit contains exactly one chunk set and has a leader and a set of followers, and these members form a raft group. The leader and its followers for a replication unit contain replicas of the same chunk set in different shards as shown below. A shard can be the leader for some replication units and a follower for other replication units.

All DMLs for a particular subset of data are executed in the leader first, and then are replicated to its followers.

Description of sharding-raft-three-shards-b.eps follows

Description of the illustration sharding-raft-three-shards-b.eps

In the image above, a leader in each shard, indicated by the set of chunks in one color with a star next to it, points to a follower (of the same color) on each of the two other shards.

Replication Factor

The replication factor (RF) determines the number of participants in a Raft group. This number includes the leader and its followers.

The RU needs a majority of replicas available for write.

RF = 3: tolerates one replica failure
RF = 5: tolerates two replica failures

Note:

In Oracle Globally Distributed Database the replication factor is currently limited to three.

In Oracle Globally Distributed Database, the replication factor is specified for the entire distributed database, that is all replication units in the database have the same RF.

Raft Log

Each RU is associated with a set of Raft logs and OS processes that maintain the logs and replicate changes from the leader to followers. This allows multiple RUs to operate independently and in parallel within a single shard and across multiple shards. It also makes it possible to scale the replication up and down by changing the number of RUs.

Changes to data made by a DML are recorded in the Raft log. A commit record is also recorded at the end of each user transaction. Raft logs are maintained independently from redo logs and contain logical changes to rows. Logical replication reduces failover time because followers are open to incoming transactions and can quickly become the leader.

The Raft protocol guarantees that followers receive log records in the same order they are generated by the leader. A user transaction is committed on the leader as soon as half of the followers acknowledge the receipt of the commit record and writes it to the Raft log.

Transactions

On a busy system, multiple commits are acknowledged at the same time. The synchronous propagation of transaction commit records provides zero data loss. The application of DML change records to followers, however, is done asynchronously to minimize the impact on transaction latency.

Leader Election Process

Per Raft protocol, if followers do not receive data or heartbeat from the leader for a specified period of time, then a new leader election process begins.

The default heartbeat interval is 150 milliseconds, with randomized election timeouts (up to 150 milliseconds) to prevent multiple shards from triggering elections at the same time, leading to split votes.

Node Failure

Node failure and recovery are handled in an automated way with minimal impact on the application.

The failover time is sub-3 seconds with less than 10 millisecond network latencies between Availability Zones. This includes failure detection, shard failover, change of leadership, application reconnecting to new leader, and continuing business transactions as before.

The impact of the failure on the application can further be abstracted by configuring retries in JDBC driver and end customer experience will be that a particular request took longer rather than getting an error.

The following is an illustration of a distributed database with all three shards in a healthy state. Applications requests are able to reach all three shards, and replication between the leaders and followers is ongoing between the shards.

Description of sharding-raft-app-routing.eps follows

Description of the illustration sharding-raft-app-routing.eps

Leader Node Failure

When the leader for a replication unit becomes unavailable, followers will initiate a new leader election process using the Raft protocol.

As long as a majority of the nodes (quorum) are still healthy, the Raft protocol will ensure that a new leader is elected from the available nodes.

When one of the followers succeeds in becoming the new leader, proactive notifications are sent from the shard to the client driver of leadership change. The client driver starts routing the request to the new leader shard. Routing clients (such as UCP) are notified using ONS notifications to update their shard and chunk mapping, ensuring that they route traffic to the leader.

During this failover and reconnection period, the application could be configured to wait and retry with the retry interval and retry counts settings at the JDBC driver configuration. These are very similar to the present RAC instance failover configuration.

Upon connecting to new leader, the application will continue to function as before.

The following diagram shows that the first shard failed, and that a new leader for the replication unit whose leader was once on that first shard has been replaced by a new leader in the second shard.

Description of sharding-raft-failover.eps follows

Description of the illustration sharding-raft-failover.eps

Failback

When the original leader comes back online after a failure, it first tries to identify the current leader and attempts to rejoin the cluster as a follower. Once the failed shard rejoins the cluster, it asks the leader for logs based on its current index in order to sync up with the leader.

Leadership rebalancing can be done by calling the API SWITCHOVER RU -REBALANCE, which could also be scripted if needed.

If there are not enough Raft logs available on the present leader, the follower will have to be repopulated from one of the good followers using data copy API (COPY RU).

Follower Node Failure

If a follower node becomes unavailable, the leader's attempts to replicate the Raft log to that follower will fail.

The leader will attempt to reach the failed follower indefinitely until the failed follower rejoins or a new follower replaces it.

If needed, a new follower will have to be created and added to the cluster, and its data needs to be synchronized from a good follower as explained above.

Example Raft Replication Deployment

The following diagram shows a simple distributed database Raft replication deployment with 6 shards, 12 replication units, and replication factor = 3.

Each shard has two leaders and 4 followers. Each member is labeled with an RU number and the suffix indicates whether it is a leader (-L) or follower (-Fn). The leaders are also indicated by a star. In this configuration, two shards can take over the load of a failed shard.

Description of sharding-raft-six-shards-v2.eps follows

Description of the illustration sharding-raft-six-shards-v2.eps

To configure this deployment, at the shard catalog creation step, you specify GDSCTL CREATE SHARDCATALOG -repl NATIVE. You can specify the replication factor (-repfactor) in the GDSCTL CREATE SHARDCATALOG or ADD SHARDGROUP commands. Similar to the specification of chunks, you specify the number of replication units in GDSCTL CREATE SHARDCATALOG or ADD SHARDSPACE commands.