Reimaging a Compute Server
If a compute server is irretrievably damaged, then you must replace it and reimage the replacement server. During the reimaging procedure, the other compute servers in the cluster are available. When adding the new server to the cluster, you copy the software from a working compute server to the new server.
The following tasks describe how to reimage a compute server:
Contacting Oracle Support Services
Open a support request with Oracle Support Services. The support engineer identifies the failed server and sends you a replacement. The support engineer also asks for the output from the imagehistory
command, run from a working compute server. The output provides a link to the computeImageMaker
file that was used to image the original compute server, and is used to restore the system.
Removing the Failed Compute Server from the Cluster
You must remove the failed compute server from Oracle Real Application Clusters (Oracle RAC).
In these steps, working_server is a working compute server in the cluster, failed_server is the compute server being replaced, and replacement_server is the new server.
To remove a failed compute server from the Oracle RAC cluster:
-
Log in to working_server as the
oracle
user. -
Disable the listener that runs on the failed server:
$ srvctl disable listener -n failed_server $ srvctl stop listener -n failed_server
-
Delete the Oracle home directory from the inventory:
$ cd $ORACLE_HOME/oui/bin $ ./runInstaller -updateNodeList ORACLE_HOME= \ /u01/app/oracle/product/12.1.0/dbhome_1 "CLUSTER_NODES=list_of_working_servers"
In the preceding command, list_of_working_servers is a list of the compute servers that are still working in the cluster, such as
ra01db02
,ra01db03
, and so on. -
Verify that the failed server was deleted—that is, unpinned—from the cluster:
$ olsnodes -s -t ra01db01 Inactive Unpinned ra01db02 Active Unpinned
-
Stop and delete the virtual IP (VIP) resources for the failed compute server:
# srvctl stop vip -i failed_server-vip PRCC-1016 : failed_server-vip.example.com was already stopped # srvctl remove vip -i failed_server-vip Please confirm that you intend to remove the VIPs failed_server-vip (y/[n]) y
-
Delete the compute server from the cluster:
# crsctl delete node -n failed_server CRS-4661: Node failed_server successfully deleted.
If you receive an error message similar to the following, then relocate the voting disks.
CRS-4662: Error while trying to delete node ra01db01. CRS-4000: Command Delete failed, or completed with errors.
To relocate the voting disks:
-
Determine the current location of the voting disks. The sample output shows that the current location is DBFS_DG.
# crsctl query css votedisk ## STATE File Universal Id File Name Disk group -- ----- ----------------- --------- ---------- 1. ONLINE 123456789abab (o/192.168.73.102/DATA_CD_00_ra01cel07) [DBFS_DG] 2. ONLINE 123456789cdcd (o/192.168.73.103/DATA_CD_00_ra01cel08) [DBFS_DG] 3. ONLINE 123456789efef (o/192.168.73.100/DATA_CD_00_ra01cel05) [DBFS_DG] Located 3 voting disk(s).
-
Move the voting disks to another disk group:
# ./crsctl replace votedisk +DATA Successful addition of voting disk 2345667aabbdd. ... CRS-4266: Voting file(s) successfully replaced
-
Return the voting disks to the original location. This example returns them to DBFS_DG:
# ./crsctl replace votedisk +DBFS_DG
-
Repeat the
crsctl
command to delete the server from the cluster.
-
-
Update the Oracle inventory:
$ cd $ORACLE_HOME/oui/bin $ ./runInstaller -updateNodeList ORACLE_HOME=/u01/app/12.1.0/grid \ "CLUSTER_NODES=list_of_working_servers" CRS=TRUE
-
Verify that the server was deleted successfully:
$ cluvfy stage -post nodedel -n failed_server -verbose Performing post-checks for node removal Checking CRS integrity... The Oracle clusterware is healthy on node "ra01db02" CRS integrity check passed Result: Node removal check passed Post-check for node removal was successful.
See Also:
Oracle Real Application Clusters Administration and Deployment Guide for information about deleting a compute server from a cluster
Preparing the USB Flash Drive for Imaging
Use a USB flash drive to copy the image to the new compute server.
To prepare the USB flash drive for use:
Copying the Image to the New Compute Server
Before you perform the following procedure, replace the failed compute server with the new server. See Expanding a Recovery Appliance Rack with Additional Storage Servers.
To load the image onto the replacement server:
-
Insert the USB flash drive into the USB port on the replacement server.
-
Log in to the console through the service processor to monitor progress.
-
Power on the compute server either by physically pressing the power button or by using Oracle ILOM.
-
If you replaced the motherboard:
-
Press F2 during BIOS
-
Select BIOS Setup
-
Set the USB flash drive first, and then the RAID controller.
Otherwise, press F8 during BIOS, select the one-time boot selection menu, and choose the USB flash drive.
-
-
Allow the system to start.
As the system starts, it detects the CELLUSBINSTALL media. The imaging process has two phases. Let both phases complete before proceeding to the next step.
The first phase of the imaging process identifies any BIOS or firmware that is out of date, and upgrades the components to the expected level for the image. If any components are upgraded or downgraded, then the system automatically restarts.
The second phase of the imaging process installs the factory image on the replacement compute server.
-
Remove the USB flash drive when the system prompts you.
-
Press Enter to power off the server.
Configuring the Replacement Compute Server
The replacement compute server does not have a host names, IP addresses, DNS, or NTP settings. This task describes how to configure the replacement compute server.
The information must be the same on all compute servers in Recovery Appliance. You can obtain the IP addresses from the DNS. You should also have a copy of the Installation Template from the initial installation.
To configure the replacement compute server:
Note:
-
If the compute server does not use all network interfaces, then the configuration process stops with a warning that some network interfaces are disconnected. It prompts whether to retry the discovery process. Respond with
yes
orno
, as appropriate for the environment. -
If bonding is used for the ingest network, then it is now set in the default active-passive mode.
Preparing the Replacement Compute Server for the Cluster
The initial installation of Recovery Appliance modified various files.
To modify the files on the replacement compute server:
-
Replicate the contents of the following files from a working compute server in the cluster:
-
Copy the
/etc/security/limits.conf
file. -
Merge the contents of the
/etc/hosts
files. -
Copy the
/etc/oracle/cell/network-config/cellinit.ora
file. -
Update the IP address with the IP address of the BONDIB0 interface on the replacement compute server.
-
Copy the
/etc/oracle/cell/network-config/cellip.ora
file. -
Configure additional network requirements, such as 10 GbE.
-
Copy the
/etc/modprobe.conf
file. -
Copy the
/etc/sysctl.conf
file. -
Restart the compute server, so the network changes take effect.
-
-
Set up the Oracle software owner on the replacement compute server by adding the user name to one or more groups. The owner is usually the
oracle
user.-
Obtain the current group information from a working compute server:
# id oracle uid=1000(oracle) gid=1001(oinstall) groups=1001(oinstall),1002(dba),1003(oper),1004(asmdba)
-
Use the
groupadd
command to add the group information to the replacement compute server. This example adds the groups identified in the previous step:# groupadd –g 1001 oinstall # groupadd –g 1002 dba # groupadd –g 1003 oper # groupadd –g 1004 asmdba
-
Obtain the current user information from a working compute server:
# id oracle uid=1000(oracle) gid=1001(oinstall) \ groups=1001(oinstall),1002(dba),1003(oper),1004(asmdba)
-
Add the user information to the replacement compute server. This example adds the group IDs from the previous step to the
oracle
user ID:# useradd -u 1000 -g 1001 -G 1001,1002,1003,1004 -m -d /home/oracle -s \ /bin/bash oracle
-
Create the
ORACLE_BASE
and Grid Infrastructure directories. This example creates/u01/app/oracle
and/u01/app/12.1.0/grid
:# mkdir -p /u01/app/oracle # mkdir -p /u01/app/12.1.0/grid # chown -R oracle:oinstall /u01/app
-
Change the ownership of the
cellip.ora
andcellinit.ora
files. The owner is typicallyoracle:dba
.# chown -R oracle:dba /etc/oracle/cell/network-config
-
Secure the restored compute server:
$ chmod u+x /opt/oracle.SupportTools/harden_passwords_reset_root_ssh $ /opt/oracle.SupportTools/harden_passwords_reset_root_ssh
The compute server restarts.
-
Log in as the
root
user. When you are prompted for a new password, set it to match theroot
password of the other compute servers. -
Set the password for the Oracle software owner. The owner is typically
oracle
.# passwd oracle
-
-
Set up SSH for the
oracle
account:-
Change to the
oracle
account on the replacement compute server:# su - oracle
-
Create the
dcli
group file on the replacement compute server, listing the servers in the Oracle cluster. -
Run the
setssh-Linux.sh
script on the replacement compute server. This example runs the script interactively:$ /opt/oracle.SupportTools/onecommand/setssh-Linux.sh -s
The script prompts for the
oracle
password on the servers. The-s
option causes the script to run in silent mode. -
Change to the
oracle
user on the replacement compute server:# su - oracle
-
Verify SSH equivalency:
$ dcli -g dbs_group -l oracle date
-
-
Set up or copy any custom login scripts from the working compute server to the replacement compute server:
$ scp .bash* oracle@replacement_server:.
In the preceding command, replacement_server is the name of the new server, such as
ra01db01
.
Applying Patch Bundles to a Replacement Compute Server
Oracle periodically releases software patch bundles for Recovery Appliance. If the working compute server has a patch bundle that is later than the release of the computeImageMaker
file, then you must apply the patch bundle to the replacement compute server.
To determine if a patch bundle was applied, use the imagehistory
command. Compare information on the replacement compute server to information on the working compute server. If the working database has a later release, then apply the storage server patch bundle to the replacement compute server.
Cloning the Oracle Grid Infrastructure
The following procedure describes how to clone the Oracle Grid infrastructure onto the replacement compute server. In the commands, working_server is a working compute server, and replacement_server is the replacement compute server.
To clone the Oracle Grid infrastructure:
-
Log in as
root
to a working compute server in the cluster. -
Verify the hardware and operating system installation using the cluster verification utility (
cluvfy
):$ cluvfy stage -post hwos -n replacement_server,working_server –verbose
The phrase
Post-check for hardware and operating system setup was successful
should appear at the end of the report. -
Verify peer compatibility:
$ cluvfy comp peer -refnode working_server -n replacement_server \ -orainv oinstall -osdba dba | grep -B 3 -A 2 mismatched
The following is an example of the output:
Compatibility check: Available memory [reference node: ra01db02] Node Name Status Ref. node status Comment ------------ ----------------------- ----------------------- ---------- ra01db01 31.02GB (3.2527572E7KB) 29.26GB (3.0681252E7KB) mismatched Available memory check failed Compatibility check: Free disk space for "/tmp" [reference node: ra01db02] Node Name Status Ref. node status Comment ------------ ----------------------- ---------------------- ---------- ra01db01 55.52GB (5.8217472E7KB) 51.82GB (5.4340608E7KB) mismatched Free disk space check failed
If the only failed components are related to the physical memory, swap space, and disk space, then it is safe for you to continue.
-
Perform the requisite checks for adding the server:
-
Ensure that the
GRID_HOME
/network/admin/samples
directory has permissions set to 750. -
Validate the addition of the compute server:
$ cluvfy stage -ignorePrereq -pre nodeadd -n replacement_server \ -fixup -fixupdir /home/oracle/fixup.d
If the only failed component is related to swap space, then it is safe for you to continue.
You might get an error about a voting disk similar to the following:
ERROR: PRVF-5449 : Check of Voting Disk location "o/192.168.73.102/ \ DATA_CD_00_ra01cel07(o/192.168.73.102/DATA_CD_00_ra01cel07)" \ failed on the following nodes: Check failed on nodes: ra01db01 ra01db01:No such file or directory ... PRVF-5431 : Oracle Cluster Voting Disk configuration check failed
If this error occurs, then use the
-ignorePrereq
option when running theaddnode
script in the next step.
-
-
Add the replacement compute server to the cluster:
$ cd /u01/app/12.1.0/grid/addnode/ $ ./addnode.sh -silent "CLUSTER_NEW_NODES={replacement_server}" \ "CLUSTER_NEW_VIRTUAL_HOSTNAMES={replacement_server-vip}"[-ignorePrereq]
The
addnode
script causes Oracle Universal Installer to copy the Oracle Clusterware software to the replacement compute server. A message like the following is displayed:WARNING: A new inventory has been created on one or more nodes in this session. However, it has not yet been registered as the central inventory of this system. To register the new inventory please run the script at '/u01/app/oraInventory/orainstRoot.sh' with root privileges on nodes 'ra01db01'. If you do not register the inventory, you may not be able to update or patch the products you installed. The following configuration scripts need to be executed as the "root" user in each cluster node: /u01/app/oraInventory/orainstRoot.sh #On nodes ra01db01 /u01/app/12.1.0/grid/root.sh #On nodes ra01db01
-
Run the configuration scripts:
-
Open a terminal window.
-
Log in as the
root
user. -
Run the scripts on each cluster server.
After the scripts are run, the following message is displayed:
The Cluster Node Addition of /u01/app/12.1.0/grid was successful. Please check '/tmp/silentInstall.log' for more details.
-
-
Run the
orainstRoot.sh
androot.sh
scripts:# /u01/app/oraInventory/orainstRoot.sh Creating the Oracle inventory pointer file (/etc/oraInst.loc) Changing permissions of /u01/app/oraInventory. Adding read,write permissions for group. Removing read,write,execute permissions for world. Changing groupname of /u01/app/oraInventory to oinstall. The execution of the script is complete. # /u01/app/12.1.0/grid/root.sh
Check the log files in
/u01/app/12.1.0/grid/install/
for the output of theroot.sh
script. The output file reports that the listener resource on the replaced compute server failed to start. This is an example of the expected output:/u01/app/12.1.0/grid/bin/srvctl start listener -n ra01db01 \ ...Failed /u01/app/12.1.0/grid/perl/bin/perl \ -I/u01/app/12.1.0/grid/perl/lib \ -I/u01/app/12.1.0/grid/crs/install \ /u01/app/12.1.0/grid/crs/install/rootcrs.pl execution failed
-
Reenable the listener resource that you stopped in "Removing the Failed Compute Server from the Cluster".
# GRID_HOME/grid/bin/srvctl enable listener -l LISTENER \ -n replacement_server # GRID_HOME/grid/bin/srvctl start listener -l LISTENER \ -n replacement_server
See Also:
Oracle Real Application Clusters Administration and Deployment Guide for information about cloning