Oracle Globally Distributed Database MAA Best Practices

45 Oracle Globally Distributed Database MAA Best Practices

Choosing Oracle Globally Distributed Database

Customers choose a distributed database architecture primarily for the following reasons:

Scalability and agility (Hyperscaling)
- Ability to horizontally scale data, handling increasing data volumes and user loads by adding more shards (nodes) without any application changes
- Ability to add or drop shards with application transparency
- Each shard is open and active for the application.
High availability with always-on continuous database availability
- Automatic failover and data replication across shards to ensure the database remains operational, even if some shards or nodes go down
- Fault tolerance and isolation across shards, including protection from data center, availability domain (AD), or availability zone (AZ) failures
- Failover in low single-digit seconds or extremely low Recovery Time Objective for database failures
- Zero data loss
- MAA observed less than 5 seconds of application impact for shard database instance or node failures. See High Availability Recommendations and Observations.
Seamless distributed application integration
- Distributed application is initially designed with a shard key for data operation, and queries may be adjusted to be shard scoped aware.
- No application changes are required while scaling or adjusting the number of database shards.
- From the perspective of an application, a distributed database looks like a single database; the number of shards and the distribution of data across those shards are completely transparent to the application.
Consistency and transaction support
- Support for consistent data reads/writes, with required levels of transactional guarantees (strong, eventual, and so on) across shards as needed by the application.
- External consistency for sharded applications is a property that guarantees transactions appear to execute as if they occurred one at a time, in an order that respects the real-time, or "wall-clock," order of their commits. This is a critical and challenging requirement for globally distributed and sharded databases, ensuring that all observers see the same, globally-consistent transaction history, regardless of which shard they interact with.
Ease of management and monitoring
- Simple operations for shard orchestration, monitoring, backup/restore, high availability, software updates, and scaling operations.

Oracle Globally Distributed Database 23ai with Raft replication provides all the above benefits.

MAA for Distributed Databases is a new class of MAA focused on providing high availability and application best practices to maximize availability and scalability.

Globally Distributed Database with Raft replication is Oracle's strategic and recommended Distributed Database Solution. For more information about Oracle Globally Distributed Database, see the Oracle Globally Distributed AI Database Guide.

MAA Best Practices for Oracle Globally Distributed Database with Raft Replication

The scope of this topic is MAA best practices for the Oracle Database 23ai feature, Oracle Globally Distributed Database with Raft replication, and does not cover alternative Oracle distributed database options that leverage Oracle Real Application Clusters, Oracle Data Guard, or Oracle GoldenGate.

Note: Applications running with existing MAA architectures or using various Oracle Database features such as Oracle Real Application Clusters, Data Guard, and Oracle GoldenGate should continue to use existing MAA best practices presented in other topics.

Oracle Maximum Availability Architecture (MAA) compiled best practices that covers the following topics:

Architecture and Planning
Network Considerations and Best Practices
Application Best Practices
Configuration Best Practices
High Availability Recommendations and Observations
Scaling Recommendations and Observations
Software Update Recommendations and Observations
Monitoring and Alerting

Architecture and Planning

Globally Distributed Database with Raft Replication Architecture Components

The following figure illustrates the major architectural components of Oracle Globally Distributed Database.

These components can be divided into three tiers for ease of understanding.

Figure 45-1 Oracle Globally Distributed Database Architecture

Description of "Figure 45-1 Oracle Globally Distributed Database Architecture"

Application Tier

Oracle Globally Distributed Database provides benefits for a variety of applications. Such applications may encompass real-time OLTP systems, globally distributed applications, data streaming platforms, advanced analytics, and machine learning workloads.

Application deployment for a distributed database generally follows the same process as deployment for a regular Oracle Database. Applications connect to the distributed database using the Global Services defined by Global Data Services (GDS).

Oracle connection pools and drivers support Oracle Globally Distributed Database with high availability features to achieve less than 5 seconds of application impact for failures.

Application deployment and configuration best practices for Globally Distributed Database are explained in Application Best Practices

Routing Tier

Oracle Global Data Services (GDS) manages the routing tier in a distributed database . With the introduction of Oracle Database 23ai, GDS supports Oracle Global Distributed Database with Raft replication.

GDS provides the framework to manage the distributed database shards, Raft replication, Global Service Manager (GSM) instances, and global services using the GDSCTL utility.

A GSM in a distributed database context is known as a shard director. A shard director is a specific implementation of a GSM that acts as a regional listener for clients that connect to a distributed database. The shard director maintains a current topology map of the distributed database. Based on the sharding key passed during a connection request, the shard director routes the connections to the appropriate shard.

For a distributed database it is recommended that you configure three global service managers for each region. GDS uses a shard catalog database to store meta-data relating to the layout and status of the distributed database configuration. For maximum availability, Oracle recommends that the shard catalog database be deployed independently, and that Oracle's high-availability features, such as Oracle Real Application Clusters (Oracle RAC) and Oracle Data Guard, be used to protect the shard catalog database against outages.

Follow the MAA Global Data Services in Oracle Maximum Availability Architecture guidelines for implementation and configuration of GDS.

Data Tier

The data tier consists of the shards, which are Oracle databases deployed on separate hosts. All of the shards together make up a single logical database, called a distributed database. The shards do not share any software or hardware between them to provide fault isolation between other shards.

Shard databases can be deployed on any platform, and if deployed on Oracle Exadata Database Machine, they can greatly benefit from extremely low latency I/O. Shards on Exadata platform can also take advantage of Exadata smart scan and offload performance benefits as well Oracle Exadata's built in high availability and data protection features. Oracle Restart or Oracle Clusterware can provide the restart capabilities if a shard database instance or node goes down.

Each shard has multiple Replication Units (RUs) which are protected by Raft replication with one leader and multiple followers. Raft replication tries to maintain a balanced distribution of leaders and followers across shards. The application transparently writes and reads to the leaders RUs.

Raft replication is configured with a specific replication factor (RF) which determines the number of participants or number of replicas in a Raft group. This number includes the leader and its followers. By default Raft shards are deployed with RF 3. This means data or replication units will have three replicas spread across three shards.

For more details about configuring distributed database shards, see Create the Shard Databases.

Planning

Planning an Oracle Globally Distributed Database deployment with Raft replication requires careful consideration of several key factors.

Data Distribution Method

When Raft replication is selected as the replication method, a critical planning step is the selection of the data distribution strategy.

Begin with analyzing which data distribution method works best for your application. The data distribution method determines how the application schema tables are partitioned. The distribution method also controls the placement of data on the shards.

Raft replication supports system-managed (hash) and composite data distribution methods at this time.

See Data Distribution Methods for more details on each distribution method.

Replication

The replication strategy for a distributed database is chosen primarily based on the application's workload requirements, latency tolerance, and recovery objectives.

Raft replication provides synchronous data replication between the shards, high availability with multiple replicas, data consistency, and near zero recovery time from failures.

This document only covers Globally Distributed Database with Raft replication.

Topology

Proper placement of Globally Distributed Database components plays a key role in optimizing network latency, ensuring availability, and simplifying operations.

Consider the following while planning the placement of distributed database components:

Deployment of shards across availability domains or zones to provide best fault isolation
Ensure application response time is acceptable when using various database shards. An application can leverage any of the shards during failures or maintenance.
Configure the shard catalog database with Oracle Data Guard to ensure high availability and data protection.
Multiple shard directors (GSM instances) for high availability and load balancing

Capacity

Each shard should be sized with sufficient system resources to handle its portion of the workload, plus an additional 30% conservatively, to handle workload when at least one shard is down for maintenance or outage occurs. If the workload increases, resize accordingly.

While planning for network capacity, consider the bandwidth required for application data, inter shard communication, Raft replication traffic and backup data.

More details about capacity are discussed in Provision and Configure Hosts and Operating Systems.

Finally, establish a clear methodology for managing and operating the distributed environment, including monitoring and failover procedures.

Addressing these aspects during the planning phase ensures a more resilient, scalable, and efficient deployment.

Network Considerations and Best Practices

In Oracle Globally Distributed Database with Raft replication,network latency is an important factor in ensuring acceptable application performance while leveraging lightweight synchronous Raft replication. A transaction is only complete when a Raft leader and half of the followers write the changes to the Raft logs. Commit latency is influenced by the network round trip time between the Raft leader and the slowest follower participating in the quorum to complete the transaction.

A larger network latency could impact user transaction response time and can also influence failure detection and role transition timings.

The following are some of the MAA network best practices for Globally Distributed Database with Raft replication:

Ensure that your application workload can be accommodated by available network bandwidth.
Ensure that application performance requirements with synchronous Raft replication can be met given the latency between shards.
Ensure a stable network connection between all shard replicas.
Implement network redundancy to eliminate single points of failure. This can involve redundant network interfaces, switches, and potentially diverse network paths.
Define SLOs: target commit latency, read latency, RPO=0, RTO for regional loss.
Validate network latency between shards using iperf or oratcptest tools.
Implement the following network configuration best practices to improve performance and minimize impact, especially if shards are in distant proximity:
- For a network interface that sends or receives Raft replication logical change records, validate optimal MTU size. When sending replication data across geographical regions, experiments have shown that using MTU=9000 can benefit in some network topologies. Conduct performance tests with oratcptest and compare the results with default MTU and MTU=9000 settings. This may reduce some unnecessary network round trips which will increase overall throughput.
- Increasing maximum operating system socket buffer size can increase single process throughput by 2-8 times. Test with different socket buffer sizes on the Raft shard nodes to see what value yields the best results, for example, larger TCP socket buffer sizes of 64MB or 128MB. Bandwidth Delay Product is the product of the network link capacity of a channel and the round trip time, or latency. The minimum recommended value for socket buffer sizes is 3*BDP, especially for a high-latency, high-bandwidth network. Use oratcptest to tune the socket buffer sizes on the shard nodes.
- The SDU determines the maximum packet size for Oracle Net Services communication. When replicating large amounts of data, a larger SDU provides several benefits. Set the SDU size in the sqlnet.ora and tnsnames.ora files on the shard database nodes. Set the SDU size to a higher value, for example, 65535, and validate with the application workload.

See Assessing and Optimizing Network Performance and Network Evaluation for detailed examples of network performance evaluation and optimization.

Application Best Practices

For OLTP applications, choose a sharding key that provides the best data distribution and avoids transactions spanning across multiple shards.

Direct routing to shards is the recommended application configuration to achieve the best performance. See Direct Routing to a Shard for routing details.

Use the latest Oracle 23ai UCP and JDBC drivers to achieve the best integration with Globally Distributed Database. The latest drivers are required to achieve the high availability and extremely low recovery times. For more details about connection pools, drivers, and API compatibility see APIs Supporting Direct Routing.

The table below explains important UCP parameters and recommended values based on MAA evaluation to achieve extremely low application brownout and blackout (less than 5 seconds) for various outages:

Note:

The values of certain parameters below may be adjusted to meet application SLA requirements.

UCP Connection Pool Parameter	Recommended Value	Notes
`initialPoolSize`	(Total number of App threads) * (Number of Shards)	Set the initial pool size to the recommended value. Also Review the best practice guidelines for static connection pools from Oracle's Real-World Performance group in Optimizing Real-World Performance with Static Connection Pools.
`minPoolSize`	Same as initialPoolSize	For static connection pool configuration keep the value same as initial pool size.
`maxPoolSize`	Same as minPoolSize	Allows maximum connection pool size based on the application requirements to handle peak workload
`abandonedConnectionTimeout`	1 second	Enables borrowed connections to be reclaimed by the connection pool after a connection has not been used for a specific amount of time. Default is 0 and feature is disabled.
`connectionValidationTimeout`	1 second	This timeout refers to the maximum duration allowed for a connection validation operation when a connection is borrowed from the pool. If the validation is not completed within the specified period, the connection is considered invalid. Default is 15 seconds and a lower value is recommended for faster detection of hung connections improving availability.
`connectionWaitDuration`	1 second	Specifies the maximum time, in seconds, that a request waits for an available connection from the pool. The value can also be specified in milliseconds for a more aggressive detection of connection availability. This parameter is new in Oracle 23ai and parameter `ConnectionWaitTimeout` is deprecated now.
`timeoutCheckInterval`	5 secs	Controls how frequently various timeout properties (such as abandoned connection timeout, time-to-live connection timeout, and inactive connection timeout) are enforced.
`inactiveConnectionTimeout`	120 secs	Specifies how long an available connection can remain idle before it is closed and removed from the pool. Default value is 0 and feature is disabled.

The following table explains the JDBC properties and recommended values.

JDBC Properties	Recommended Value	Notes
`FastFailover`	true	This property enables Fast Application Notification (FAN) aware, high-availability behavior so that connection pools can quickly detect and react to database node or service failures.
`TcpNoDelay`	true	When enabled, packets are sent immediately, without waiting to batch them. Can help reduce latency for OLTP applications.
`StatementCaching`	120	When enabled, cached statements can be reused instead of being re-parsed by the database, which can greatly increase efficiency.
`FetchSize`	30	Helps reduce network round trips. Adjust the value based on application requirements.
`ReadTimeout`	30	Defines the maximum time (in milliseconds) the JDBC driver waits when reading from the database socket. Adjust the value of this parameter according to application timeout requirements.
`oracle.jdbc.useShardingDriverConnection`	true	This JDBC property tells the Oracle JDBC driver to use the Sharding Driver instead of the standard JDBC Thin driver for the connection. The Sharding Driver is required when connecting to an Oracle Globally Distributed Database.
`oracle.jdbc.allowSingleShardTransactionSupport`	true	When this property is set to 'true', the data source supports local transactions on a single shard when auto-commit is disabled. If set to 'false' transactions are started on the catalog database.

The following table illustrates some of the errors which are raised to the application during switchover and failover of the Raft replication units and shards.

Applications need to implement exception handling for these errors and retry logic based on application-specific requirements to ensure continual operations.

Exception Code	Error Message	Explanation
ORA-03974	Cannot perform DML operations during switchover to shard <shard_id> for replication unit <ru_id>	Error is raised due to switchover of replication unit from one shard to another.
ORA-03838	Cannot perform DML operations (<DML>) in a follower shard for replication unit <ru_id>, chunk ID <chunk_id>	Error is raised after switchover of replication unit .
ORA-03996	Cannot perform DML operations during recovery of the leader for replication unit <ru_id>	Error is raised if LCR apply is slow or incomplete on a follower shard, typically during a switchover or failover of the leader shard. This generally happens while performing DMLs with proxy routing and not direct routing of Applications.
ORA-05048	Cannot commit transaction in a follower shard for replication unit <ru_id>, transaction ID <txn_id>	Error is raised due to failover of replication unit.
ORA-45582	The chunk <chunk_id> was not found or has been moved to another shard	Error is raised due to switchover or move operation of a replication unit.
ORA-17008	Closed connection	Error is typically raised during shard instance or node failure.
ORA-17002	Network adapter could not establish connection	Error is raised due to shard node failure or network error.
ORA-01013	User requested cancel of current operation.	When a read timeout or any other timeout fires, UCP cancels the ongoing query, and the caller receives ORA-01013.
UCP-29	Failed to get a connection	Error is raised by UCP when it is unable to acquire a connection because a shard is unavailable due to an instance or node failure.

Configuration Best Practices

Raft Logs Sizing and Configuration

In Raft replication, changes are captured as part of the user transactions and stored in a replication log, also known as Raft log. Similar to Online Redo Logs or Standby Redo Logs, Raft logs contain committed and uncommitted user transactions. A user transaction is considered committed on the leader when half of the followers have acknowledged the commit record and persisted it in their Raft logs.

Each replication unit is tied to a set of Raft logs and operating system processes that manage the Raft logs and replicate changes from the leader to the followers. This design enables multiple replication units to function independently and in parallel, both within a single shard and across multiple shards. It also provides flexibility to scale replication up or down by adjusting the number of Replication Units. This architecture not only improves concurrency and fault isolation but also enables elastic scalability and high performance.

Sizing and placement of Raft logs is critical to maintain predictable performance and recovery of the shard databases. Raft logs are created alongside the shard database files, controlled by the db_create_file_dest parameter. Place Raft logs on an extremely low latency and high IOPs storage tier for optimal performance.

By default each replication unit is created with three Raft logs which are rotated in a circular fashion. The Raft protocol ensures that followers receive log records in the exact order generated by the leader. In normal operation the followers receive a continuous stream of changes from the leader. If one of the followers is behind with the Raft replication, or restarted because of planned or unplanned outages, the leader sends the change records from the Raft logs once the replication processes are up on the follower's shard.

If a follower shard remains down for an extended period and the leader Raft logs are rotated or overwritten due to application workload, the follower will be unable to automatically resynchronize with the missed changes after it restarts. In this situation a recovery of the replication units on the follower shard is required, which is typically done using the COPY RU GDSCTL command from another follower replication unit which is consistent with the leader. It is best to avoid this situation due to the overhead and time it takes to recover.

The following table provides guidelines for shard database instance parameters for managing Raft logs.

Parameter Default Value Recommendation Notes

Parameter	Default Value	Recommendation	Notes
`shard_raft_logfile_size`	1GB	Size Raft logs equal to shard database redo logs	A recommended starting point is to size the Raft logs to match the size of the database redo logs. Size can be better estimated by analyzing the rotation time of each raft log during peak application workload.
`_shard_raft_log_retention_time_mins`	Null	1440 (24h)	Set the Raft log retention window large enough to accommodate planned and unplanned outages of shards. Retaining Raft logs allows leaders to send changes to followers to automatically resync after an outage.
`_shard_max_num_raft_logfile`	10	32	This parameter controls the total number of Raft logs per RU. Enforced when parameter `_shard_raft_log_retention_time_mins` value is set. Setting the value to a higher number is recommended to accommodate peak workload and retention window. The maximum value is 32. Note: If the number of Raft logs is not large enough to accommodate the retention time frame, Raft stops the replication unit and raises error `ORA-05097: Unable to overwrite raft log for replication unit <> in the shard instance alert log.`

shard_raft_logfile_size

1GB

Size Raft logs equal to shard database redo logs

A recommended starting point is to size the Raft logs to match the size of the database redo logs.

Size can be better estimated by analyzing the rotation time of each raft log during peak application workload.

_shard_raft_log_retention_time_mins Null 1440 (24h) Set the Raft log retention window large enough to accommodate planned and unplanned outages of shards. Retaining Raft logs allows leaders to send changes to followers to automatically resync after an outage.

_shard_max_num_raft_logfile

This parameter controls the total number of Raft logs per RU. Enforced when parameter _shard_raft_log_retention_time_mins value is set.

Setting the value to a higher number is recommended to accommodate peak workload and retention window. The maximum value is 32.

Note:

If the number of Raft logs is not large enough to accommodate the retention time frame, Raft stops the replication unit and raises error ORA-05097: Unable to overwrite raft log for replication unit <> in the shard instance alert log.

Additional Configuration Best Practices

The following are additional configuration best practices for Globally Distributed Database with Raft Replication:

Size shard database and Raft logs storage tier to accommodate sufficient I/O capacity for:
- Application workload
- RMAN backup and remote restore operations for MOVE RU or COPY RU operations.
- Exadata provides additional I/O bandwidth, extreme low latency for commits, Exadata Smart Flash and Exadata Smart Offloading.
Deploy multiple shard directors for high availability in each availability domain or zone.
Configure Data Guard Fast-Start Failover for shard catalog database protection and availability.
Deploy additional shards for flexibility, capacity management, and workload balancing.
Size shards so that performance remains acceptable, even when one shard is down for maintenance or an outage occurs.
Benchmark Raft synchronous replication performance under different workloads.
Implement a mechanism to automatically detect Raft shard instance failures and restart.
- Oracle Restart, Exadata or Oracle Clusterware provide these features and can also manage single instance databases.
Place Raft shards across independent failure domains and availability domains to match your failure isolation goals and latency budget.

High Availability Recommendations and Observations

Validating the high availability and resiliency features of the Globally Distributed Database is essential. Oracle MAA has conducted rigorous and extensive testing of Globally Distributed Database with Raft replication and provides the following recommendations and considerations.

Add alert and automatic restart capabilities for the shard database instances after planned and unplanned outages.
Oracle Restart or Oracle Clusterware provides this functionality and works seamlessly for restarting local listener or shard database instance
Evaluate the following outage scenarios to validate that your application fails over transparently in your distributed database:
- Shard database instance or node failure
- Shard restart to simulate a software update maintenance window
Raft replication heartbeat loss timeout and election timeout values are automatically managed by the database and should not be changed using internal parameters unless directed by Oracle Support due to extensive heartbeat failures.
The default heartbeat interval is between 150-300 milliseconds, with randomized election timeouts (up to 150 milliseconds) to prevent multiple shards from triggering elections at the same time.

Oracle MAA has observed the following during tests of distributed database Raft shards under database instance and shard node failure conditions.

Shard Database Instance Failure

Shard database instance failure impact is around two seconds with a brief application workload brownout.
Leader RUs automatically fail over to the follower RUs on the surviving shards.
Oracle Restart or Oracle Clusterware detects failure and automatically restarts the shard database instance.
After restart, the follower RUs on the failed shard instance automatically resync with the replication by receiving changes from the leader RUs.

Figure 45-2 Application TPS impact during shard database instance failure

Description of "Figure 45-2 Application TPS impact during shard database instance failure"

Shard Node Failure

Shard node failure impact is typically less than five seconds with a larger application workload brownout for a short duration.
Leader RUs automatically fail over to the follower RUs on the surviving shards.
After a shard node recovers from failure, Oracle Restart or Oracle Clusterware automatically restarts the shard database instance and local listener.
Following a restart, the follower RUs on the failed shard node automatically resync with the replication stream by applying changes received from the leader RUs.

Figure 45-3 Application TPS impact during shard node failure

Description of "Figure 45-3 Application TPS impact during shard node failure"

Best Practices for Shard Database Instance and Node Failure

Switch leader RUs back to the recovered shard at the next available opportunity to balance the application workload.
Avoid longer duration outages for shard database instance and nodes.
If a shard is down for longer and leader Raft logs are rotated, the failed shard will not be able to resync with the replication automatically. A recovery of all the follower RUs on the shard is required in this scenario. Automatic detection and restart of shards can help avoid this situation.
Implement exception handling for specific error messages raised during shard database instance and node failures.

Scaling Recommendations and Observations

In a Globally Distributed Database with Raft replication, shards can be added or removed from the configuration for scalability and load balancing. The ADD SHARD or REMOVE SHARD operation can be performed online with no database downtime. Raft replication manages the data distribution and relocation automatically.

During the ADD SHARD process new RUs are created, and CHUNK tablespaces from existing RUs are relocated to new RUs. The new RUs are then moved to the shard being added.

Before a shard is removed, RUs need to be relocated to shards where an RU follower does not exist.

See Scaling with Raft Replication for detailed steps showing how to add or remove a shard from the distributed database configuration.

Oracle MAA has performed extensive testing of Raft shard scaling under various workloads and provides the following considerations and recommendations for shard scalability.

Recommendations for Adding Shards

Evaluate network latency between the new and existing shards to quantify potential impact on application response time after replication units are moved to the new shard.
Ensure there is sufficient I/O and CPU capacity to allow RU movement to new shard. RU move to new shard is performed using remote RMAN restore and recovery processes initiated from the new shard.
Initiate the ADD SHARD process during a low application activity or maintenance window to minimize impact. A MOVE RU operation can potentially cause multiple brief application brownouts under high workload.

Recommendations for Removing Shards

Ensure that the target shards have enough compute, storage, and IOPS capacity to absorb the RUs from the shard being removed.
Check if the application workload will remain balanced across the surviving shards after redistribution of RUs.
Initiate the REMOVE SHARD process during a low application activity or maintenance window to minimize impact. A MOVE RU operation can potentially cause multiple brief application brownouts under high workload.

Planned Maintenance or Software Update Recommendations and Observations

Planned maintenance, including software updates, can occur frequently for any of the Oracle Globally Distributed Database components.

The following MAA best practices and guidelines can provide zero application downtime during planned maintenance for Oracle Globally Distributed Database with Raft replication.

Multiple GSM instances deployed, following MAA best practices, can provide a seamless rolling software update experience for applications connecting to Globally Distributed Database using GDS.

For a shard catalog database deployed with Oracle RAC and Data Guard, following MAA high availability best practices enables zero-downtime maintenance.

Replication Unit Leader Switchover

In a balanced configuration, all of the shard databases have one or more RU leaders, and applications are reading and writing to these.

It is highly recommended that you switch RU leaders from a shard before planned maintenance activity. Raft replication unit switchover can be done in a graceful manner with a specified timeout value to allow application transactions to complete on the shard before switching over and enabling the leadership of RU on a different shard.

The following example workflow illustrates the switchover of an RU leader from one shard to another.

In this example a switchover of RU # 1 is performed from shard1 to shard2.

Before the switchover, the RU leader (RU1 - L) is on shard1 as shown here:

Check the RU status to make sure the RU followers are in a healthy state.

GDSCTL> status ru -ru 1

Replication units
------------------------
Database  RU#   Role      Term  Log Index Status
--------  ---   -----     ----  --------- -----
shard1     1    Leader     155  123242128  Ok
shard2     1    Follower   155  123242128  Ok
shard3     1    Follower   155  123242128  Ok

Run the GDSCTL SWITCHOVER RU command.
This command changes which member (specified by the shard it resides on) of the RU is the leader. The -timeout value allows transactions to complete. The target shard RU will not be available for read-write until after the previous leader becomes read-only.
```
GDSCTL> switchover ru -ru 1 -shard shard2 -timeout 1 

Switchover process has been started
Switchover process completed
The operation completed successfully
```
After the switchover command completes, the RU #1 leader is on shard2, as shown here:

Check the status of RU # 1.

GDSCTL> status ru -ru 1

Replication units
------------------------
Database  RU#   Role      Term  Log Index Status
--------  ---   -----     ----  --------- -----
shard1     1    Follower   155  123242128  Ok
shard2     1    Leader     155  123242128  Ok
shard3     1    Follower   155  123242128  Ok

Planned Maintenance Best Practices

Best practice guidelines for planned maintenance:

Perform the maintenance on shards in a rolling fashion.
Validate shards capacity to make sure that shards are able to accommodate additional workload when RU leaders are switched over from the shard going under maintenance.
- AHF, oswatcher tool, and other operating system monitoring commands can be used for capacity analysis.
Check Raft replication apply lag and any apply errors using GDSCTL commands or the dictionary views specified in Globally Distributed Database Health Check and Monitoring on each shard.
- If there is a high Raft replication apply lag, replication unit switchover may take longer or fail.
- Address any apply lag issues before switching over replication units.
Validate leadership status of the RUs on the shard going under maintenance, and switch RU leaders to a different shard database with sufficient capacity.
- Use GDSCTL STATUS RU command on GDS to check the leadership status for RUs. See Globally Distributed Database Health Check and Monitoring
- Always use the timeout option (1-2ms) with the SWITCHOVER RU command to minimize application errors and impact.
Perform a health check before bringing down the next shard during rolling shard maintenance.
- After a shard comes back online, Raft replication automatically starts resyncing, but it can take some time to become current, depending on the number of changes and duration of outage.
- Validate the RU followers status and log index on the shard after it comes back online using the GDSCTL utility. The RU follower log index needs to be the same or very close to the leader log index for a follower to be current.
- Validate the apply lag for the replication units on the shard after it starts applying the Raft replication changes.
Switch some of the RU members back to the updated shard to balance RU leaders and application workload while going through rolling maintenance of shards.
Normalize and balance the remaining RUs at the end of the rolling maintenance.
- For 3 shards and 6 replication units, a minimum of 10 switchovers are required for rebalancing, and 12 switchovers are required if rebalancing with affinity or putting RU leaders back on the original shard.

Rolling Shard Maintenance Example Workflow

The following example illustrates rolling shard maintenance for Globally Distributed Database with 3 shards and 6 RUs.

The shard names used in the example are shard1, shard2, and shard3. RU numbers in the example are 1 through 6.

In a balanced configuration, each shard hosts 2 RU leaders and 4 RU followers.

Perform the following steps to complete this task:

Step 1: Verify shard health and RU leadership status

Check the health and status of all the RUs across all shards using GDSCTL:

GDSCTL> status ru -sort

Run the apply lag check query specified in Globally Distributed Database Health Check and Monitoring. to make sure there is no apply lag issue on any of the shards.

Proceed with the next steps only after verifying that all RUs report an OK status in the output of the STATUS RU command, and that the apply lag is within a few seconds.

Step 2: Switchover RU leaders from the first shard

On the first shard which is going under maintenance, check the RU leadership status and switchover RU leaders from the first shard to other shards.

GDSCTL> status ru -leaders -sort

Replication units
------------------------
Database                      RU#  Role      Term Log Index Status
--------                      ---  ----      ---- --------- ------
shard1                        1    Leader    1    1         Ok
shard1                        2    Leader    1    1         Ok
shard2                        3    Leader    2    2         Ok
shard2                        4    Leader    2    2         Ok
shard3                        5    Leader    2    2         Ok
shard3                        6    Leader    2    2         Ok

To check the RU leader status only on shard1 run the below command:

GDSCTL> status ru -leaders -shard shard1

Switchover all the RU leaders from shard1 to other shards using timeout option.

GDSCTL> switchover ru -ru 1 -shard shard2 -timeout 1
GDSCTL> switchover ru -ru 2 -shard shard3 -timeout 1

Validate to make sure no RU leaders remain on the first shard.

GDSCTL> status ru -leaders -shard shard1

Step 3: Perform maintenance on the first shard and restart

Shard1 can be shut down for maintenance and restarted once the activity is complete.

Step 4: Ensure that RU followers on the first shard resync with Raft replication

When the shard is brought back online, all RU members on it assume the follower role. These RU followers automatically receive updates from the RU leaders and begin resynchronization.

You must allow the RU followers to complete synchronization, ensuring their log index is aligned with, or very close to, those of the RU leaders before proceeding.

See Globally Distributed Database Health Check and Monitoring for more details about how to run the GDSCTL STATUS RU command and explanation of the log index used by the RUs.

GDSCTL> status ru -sort

GDSCTL> status ru -shard shard1

Step 5: Switchover RU leaders from the second shard

The second shard should have 3 RU leaders, as one of the leaders was switched from shard1 in step 2.

Switchover all the RU leaders from shard2 to other shards.

GDSCTL> status ru -leaders -sort

Replication units
------------------------
Database                      RU#  Role      Term Log Index Status
--------                      ---  ----      ---- --------- ------
shard2                        1    Leader    1    1         Ok
shard3                        2    Leader    1    1         Ok
shard2                        3    Leader    2    2         Ok
shard2                        4    Leader    2    2         Ok
shard3                        5    Leader    2    2         Ok
shard3                        6    Leader    2    2         Ok

To check the leader RU status only on shard2 run the below command:

GDSCTL> status ru -leaders -shard shard2

Switchover all the RU leaders from shard2 to other shards using timeout option. RU # 1 should be switched back to shard1 so it can be relocated back to the original host.

GDSCTL> switchover ru -ru 1 -shard shard1 -timeout 1
GDSCTL> switchover ru -ru 3 -shard shard1 -timeout 1
GDSCTL> switchover ru -ru 4 -shard shard3 -timeout 1

Validate to make sure no RU leaders remain on the second shard.

GDSCTL> status ru -leaders -shard shard2

Step 6: Perform maintenance on the second shard and restart

Shard2 can be shut down for maintenance and restarted once the activity is complete.

Step 7: Ensure RU followers on the second shard resync with Raft replication

After maintenance on shard2, you must allow the RU followers to complete synchronization, ensuring their log index is aligned with, or very close to, those of the RU leaders before proceeding.

GDSCTL> status ru -sort

GDSCTL> status ru -shard shard2

Step 8: Switchover RU leaders from the third shard

GDSCTL> status ru -leaders -sort

Replication units
------------------------
Database                      RU#  Role      Term Log Index Status
--------                      ---  ----      ---- --------- ------
shard1                        1    Leader    1    1         Ok
shard3                        2    Leader    1    1         Ok
shard1                        3    Leader    2    2         Ok
shard3                        4    Leader    2    2         Ok
shard3                        5    Leader    2    2         Ok
shard3                        6    Leader    2    2         Ok

To check the RU leaders status only on shard3 run the below command:

GDSCTL> status ru -leaders -shard shard3

Switchover all the RU leaders from shard3 to other shards using timeout option. RU # 2 leader should be switched back to shard1 so it is relocated back to the original host. RU # 4 leader should be switched back to shard2 so it is relocated back to the original host.

GDSCTL> switchover ru -ru 2 -shard shard1 -timeout 1
GDSCTL> switchover ru -ru 4 -shard shard2 -timeout 1
GDSCTL> switchover ru -ru 5 -shard shard1 -timeout 1
GDSCTL> switchover ru -ru 6 -shard shard2 -timeout 1

Validate to make sure no RU leaders remain on the third shard.

GDSCTL> status ru -leaders -shard shard3

Step 9: Perform maintenance on the second shard and restart

Shard3 can be shut down for maintenance and restarted once the activity is complete.

Step 10: Ensure RU followers on the third shard resync with Raft replication

After maintenance on shard3, you must allow the RU followers to complete synchronization, ensuring their log index is aligned with, or very close to, those of the RU leaders before proceeding.

GDSCTL> status ru -sort

GDSCTL> status ru -shard shard3

Step 11: Rebalance RU leaders across all shards

Once maintenance is complete on all shards, the RU leaders should be rebalanced to their original locations to restore even distribution of load.

GDSCTL> status ru -leaders -sort

Replication units
------------------------
Database                      RU#  Role      Term Log Index Status
--------                      ---  ----      ---- --------- ------
shard1                        1    Leader    1    1         Ok
shard1                        2    Leader    1    1         Ok
shard1                        3    Leader    2    2         Ok
shard2                        4    Leader    2    2         Ok
shard1                        5    Leader    2    2         Ok
shard2                        6    Leader    2    2         Ok

Switchover the RU leaders from shard1 and shard2 to shard3 to balance the RUs.

GDSCTL> switchover ru -ru 3 -shard shard2 -timeout 1
GDSCTL> switchover ru -ru 5 -shard shard3 -timeout 1
GDSCTL> switchover ru -ru 6 -shard shard3 -timeout 1

Alternately the SWITCHOVER command can be used with the -rebalance option to relocate all the RUs back to their original shard location using a single command. The SWITCHOVER -rebalance command is aware of the original affinity of the RUs and issues multiple switchover commands in the background.

GDSCTL> switchover ru -rebalance

Validate all the RUs are balanced after switchover completes.

GDSCTL> status ru -sort

GDSCTL> status ru -leaders -sort

Globally Distributed Database Health Check and Monitoring

Raft replication processes information, and any specific error messages are typically reported in the shard database instance alert log. But there is some additional information logged in the shard database instance debug log debug_<sid>.log located in directory $DIAG_DIR/diag/rdbms/<dbname>/<sid_name>/log, for example:
- Raft replication processes state change information.
- RU leader and follower heartbeat failures and role transitions messages.
  - For example, "SNR role change RU_ID 1. CANDIDATE. Reason heartbeat loss"
Implement comprehensive network monitoring to track bandwidth utilization, latency, and potential network issues that could impact performance.
- Configure alerts to notify administrators of any network problems affecting Raft replication.
Monitor RU health status and apply lag.
Use shard database AWR/ASH reports for capacity tracking and performance troubleshooting. Establish run-books for Raft RU leader placement in a balanced configuration, and in case of shard addition and removal.

Below are some useful commands using the GDSCTL utility to check the overall health and status of RUs.

Example 45-1 Check the health of all of the replication units across all the shards

To check the health of all of the replication units across all the shards, run the below command.

The STATUS RU command shows the Status and Log Index of each RU. The Log Index represents the latest Raft log entry index that the RU has acknowledged. The RU leader assigns log entries and replicates them to RU followers. The RU follower Log Index is same or very close to the RU leader Log Index under normal operations.

GDSCTL> status ru -sort

Example 45-2 Check the status of all the replication unit leaders

GDSCTL> status ru -leaders -sort

Example 45-3 Check the CHUNK allocation to all the replication units

GDSCTL> status ru -show_chunks -sort

Example 45-4 Check the detailed status of a replication unit and any errors

GDSCTL> status ru -ru 1
GDSCTL> ru -ru 1 -show_errors

Example 45-5 Check the status of any in progress GDS tasks or jobs

GDSCTL> config task

Example 45-6 Cancel or resume any GDS task

GDSCTL> alter task -resume -task <no>
GDSCTL> alter task -cancel -task <no>

Some useful shard database queries:

Example 45-7 Check the apply lag

To check apply lag, log in to the shard PDB as a privileged user and execute the following SQL:

SELECT
   ru_id,
   state,
   apply_name,
   case
   when (hwm_lcr_create_time > hwm_lcr_create_time) then 0
   else NVL(( hwm_time - hwm_lcr_create_time ) * 86400, 0)  end as latency_in_seconds,
   to_char(hwm_lcr_create_time, 'HH24:MI:SS MM/DD/YY') AS message_creation_time,
   to_char(hwm_time, 'HH24:MI:SS MM/DD/YY')            AS apply_time,
   hwm_lcr_log_index,
   processed_lcr_log_index
 FROM
   v$shard_apply_coordinator;

Example 45-8 Check transport lag

To check the transport lag, log in to the shard PDB as a privileged user and execute the following SQL:

SELECT
   lcr_prdcr.ru_id   AS ru_id,
   ntwk_sndr.remote_peer_id as remote_peer_id,
   '' as follower_shard,
   ntwk_sndr.startup_time   AS network_sender_startup,
   lcr_prdcr.startup_time   AS lcr_prdcr_startup,
   ntwk_sndr.state  AS network_sender_state,
   lcr_prdcr.state     AS lcr_prdcr_state, 
   lcr_prdcr.flwctrl_peers as flwctrl_peers,
   case
   when (ntwk_sndr.last_sent_lcr_time > lcr_prdcr.last_enqueued_lcr_time) then 0
   else nvl( abs(lcr_prdcr.last_enqueued_lcr_time - ntwk_sndr.last_sent_lcr_time) * 86400, 0) end AS transport_lag_seconds,
   ( lcr_prdcr.last_enqueued_lcr_log_index - last_sent_lcr_log_index )               AS transport_lag_logindexes,
   ntwk_sndr.elapsed_propagation_time elapsed_propagation_time,
   ntwk_sndr.connect_error_count connect_error_count
FROM
   v$shard_lcr_producer   lcr_prdcr,
   v$shard_network_sender ntwk_sndr
WHERE
   lcr_prdcr.ru_id = ntwk_sndr.ru_id;

Example 45-9 Check the status of the Raft logs for all the replication units

To check the status of the Raft logs for all the replication units, log in to the shard PDB and execute the below SQLs.

The CURR_FILE column with value 'Y' indicates which Raft log is the current file for each replication unit.

set lines 200 pages 200
col name for a95
col start_time_timestamp for a35
select ru_id, name, rotd_idx, size_gb, curr_file, timestamp'1970-01-01 00:00:00' + numtodsinterval ( start_sec, 'second' ) start_time_timestamp
from v$shard_lcr_logs order by ru_id, curr_file, rotd_idx;

set lines 200 pages 200
col name for a95
col start_time_timestamp for a35
select ru_id, rotd_idx, size_gb, curr_file, STIDX_LOW, STIDX_HIGH, ENDIDX_LOW, ENDIDX_HIGH, timestamp'1970-01-01 00:00:00' + numtodsinterval ( start_sec, 'second' ) start_time_timestamp
from v$shard_lcr_logs order by ru_id, curr_file, rotd_idx;