Reimaging a Compute Server

If a compute server is irretrievably damaged, then you must replace it and reimage the replacement server. During the reimaging procedure, the other compute servers in the cluster are available. When adding the new server to the cluster, you copy the software from a working compute server to the new server.

The following tasks describe how to reimage a compute server:

Contacting Oracle Support Services

Open a support request with Oracle Support Services. The support engineer identifies the failed server and sends you a replacement. The support engineer also asks for the output from the imagehistory command, run from a working compute server. The output provides a link to the computeImageMaker file that was used to image the original compute server, and is used to restore the system.

Downloading the Latest Release of the Cluster Verification Utility

The latest release of the cluster verification utility (cluvfy) is available from My Oracle Support Doc ID 316817.1.

Removing the Failed Compute Server from the Cluster

You must remove the failed compute server from Oracle Real Application Clusters (Oracle RAC).

In these steps, working_server is a working compute server in the cluster, failed_server is the compute server being replaced, and replacement_server is the new server.

To remove a failed compute server from the Oracle RAC cluster:

  1. Log in to working_server as the oracle user.

  2. Disable the listener that runs on the failed server:

    $ srvctl disable listener -n failed_server
    $ srvctl stop listener -n failed_server
    
  3. Delete the Oracle home directory from the inventory:

    $ cd $ORACLE_HOME/oui/bin
    $ ./runInstaller -updateNodeList ORACLE_HOME= \
    /u01/app/oracle/product/12.1.0/dbhome_1 "CLUSTER_NODES=list_of_working_servers"
    

    In the preceding command, list_of_working_servers is a list of the compute servers that are still working in the cluster, such as ra01db02, ra01db03, and so on.

  4. Verify that the failed server was deleted—that is, unpinned—from the cluster:

    $ olsnodes -s -t
    
    ra01db01     Inactive        Unpinned
    ra01db02        Active          Unpinned
    
  5. Stop and delete the virtual IP (VIP) resources for the failed compute server:

    # srvctl stop vip -i failed_server-vip
    PRCC-1016 : failed_server-vip.example.com was already stopped
    
    # srvctl remove vip -i failed_server-vip
    Please confirm that you intend to remove the VIPs failed_server-vip (y/[n]) y
    
  6. Delete the compute server from the cluster:

    # crsctl delete node -n failed_server
    CRS-4661: Node failed_server successfully deleted.
    

    If you receive an error message similar to the following, then relocate the voting disks.

    CRS-4662: Error while trying to delete node ra01db01.
    CRS-4000: Command Delete failed, or completed with errors.
    

    To relocate the voting disks:

    1. Determine the current location of the voting disks. The sample output shows that the current location is DBFS_DG.

      # crsctl query css votedisk
      
      ##  STATE    File Universal Id          File Name                Disk group
      --  -----    -----------------          ---------                ----------
      1. ONLINE   123456789abab (o/192.168.73.102/DATA_CD_00_ra01cel07) [DBFS_DG]
      2. ONLINE   123456789cdcd (o/192.168.73.103/DATA_CD_00_ra01cel08) [DBFS_DG]
      3. ONLINE   123456789efef (o/192.168.73.100/DATA_CD_00_ra01cel05) [DBFS_DG]
      Located 3 voting disk(s).
      
    2. Move the voting disks to another disk group:

      # ./crsctl replace votedisk +DATA
      
      Successful addition of voting disk 2345667aabbdd.
      ...
      CRS-4266: Voting file(s) successfully replaced
      
    3. Return the voting disks to the original location. This example returns them to DBFS_DG:

      # ./crsctl replace votedisk +DBFS_DG
      
    4. Repeat the crsctl command to delete the server from the cluster.

  7. Update the Oracle inventory:

    $ cd $ORACLE_HOME/oui/bin
    $ ./runInstaller -updateNodeList ORACLE_HOME=/u01/app/12.1.0/grid \
      "CLUSTER_NODES=list_of_working_servers" CRS=TRUE
    
  8. Verify that the server was deleted successfully:

    $ cluvfy stage -post nodedel -n failed_server -verbose
    
    Performing post-checks for node removal
    Checking CRS integrity...
    The Oracle clusterware is healthy on node "ra01db02"
    CRS integrity check passed
    Result:
    Node removal check passed
    Post-check for node removal was successful.
    

See Also:

Oracle Real Application Clusters Administration and Deployment Guide for information about deleting a compute server from a cluster

Preparing the USB Flash Drive for Imaging

Use a USB flash drive to copy the image to the new compute server.

To prepare the USB flash drive for use:

  1. Insert a blank USB flash drive into a working compute server in the cluster.
  2. Log in as the root user.
  3. Unzip the computeImage file:
    # unzip computeImageMaker_release_LINUX.X64_release_date.platform.tar.zip
    
    # tar -xvf computeImageMaker_release_LINUX.X64_release_date.platform.tar
    
  4. Load the image onto the USB flash drive:
    # cd dl360
    # ./makeImageMedia.sh -dualboot no
    

    The makeImageMedia.sh script prompts for information.

  5. Remove the USB flash drive from the compute server.
  6. Remove the unzipped d1360 directory and the computeImageMaker file from the working compute server. The directory and file require about 2 GB of disk space.

Copying the Image to the New Compute Server

Before you perform the following procedure, replace the failed compute server with the new server. See Expanding a Recovery Appliance Rack with Additional Storage Servers.

To load the image onto the replacement server:

  1. Insert the USB flash drive into the USB port on the replacement server.

  2. Log in to the console through the service processor to monitor progress.

  3. Power on the compute server either by physically pressing the power button or by using Oracle ILOM.

  4. If you replaced the motherboard:

    1. Press F2 during BIOS

    2. Select BIOS Setup

    3. Set the USB flash drive first, and then the RAID controller.

    Otherwise, press F8 during BIOS, select the one-time boot selection menu, and choose the USB flash drive.

  5. Allow the system to start.

    As the system starts, it detects the CELLUSBINSTALL media. The imaging process has two phases. Let both phases complete before proceeding to the next step.

    The first phase of the imaging process identifies any BIOS or firmware that is out of date, and upgrades the components to the expected level for the image. If any components are upgraded or downgraded, then the system automatically restarts.

    The second phase of the imaging process installs the factory image on the replacement compute server.

  6. Remove the USB flash drive when the system prompts you.

  7. Press Enter to power off the server.

Configuring the Replacement Compute Server

The replacement compute server does not have a host names, IP addresses, DNS, or NTP settings. This task describes how to configure the replacement compute server.

The information must be the same on all compute servers in Recovery Appliance. You can obtain the IP addresses from the DNS. You should also have a copy of the Installation Template from the initial installation.

To configure the replacement compute server:

  1. Assemble the following information:
    • Name servers

    • Time zone, such as Americas/Chicago

    • NTP servers

    • IP address information for the management network

    • IP address information for the client access network

    • IP address information for the InfiniBand network

    • Canonical host name

    • Default gateway

  2. Power on the replacement compute server. When the system starts, it automatically runs the configuration script and prompts for information.
  3. Enter the information when prompted, and confirm the settings. The startup process then continues.

Note:

  • If the compute server does not use all network interfaces, then the configuration process stops with a warning that some network interfaces are disconnected. It prompts whether to retry the discovery process. Respond with yes or no, as appropriate for the environment.

  • If bonding is used for the ingest network, then it is now set in the default active-passive mode.

Preparing the Replacement Compute Server for the Cluster

The initial installation of Recovery Appliance modified various files.

To modify the files on the replacement compute server:

  1. Replicate the contents of the following files from a working compute server in the cluster:

    1. Copy the /etc/security/limits.conf file.

    2. Merge the contents of the /etc/hosts files.

    3. Copy the /etc/oracle/cell/network-config/cellinit.ora file.

    4. Update the IP address with the IP address of the BONDIB0 interface on the replacement compute server.

    5. Copy the /etc/oracle/cell/network-config/cellip.ora file.

    6. Configure additional network requirements, such as 10 GbE.

    7. Copy the /etc/modprobe.conf file.

    8. Copy the /etc/sysctl.conf file.

    9. Restart the compute server, so the network changes take effect.

  2. Set up the Oracle software owner on the replacement compute server by adding the user name to one or more groups. The owner is usually the oracle user.

    1. Obtain the current group information from a working compute server:

      # id oracle
      uid=1000(oracle) gid=1001(oinstall) groups=1001(oinstall),1002(dba),1003(oper),1004(asmdba)
      
    2. Use the groupadd command to add the group information to the replacement compute server. This example adds the groups identified in the previous step:

      # groupadd –g 1001 oinstall
      # groupadd –g 1002 dba
      # groupadd –g 1003 oper
      # groupadd –g 1004 asmdba
      
    3. Obtain the current user information from a working compute server:

      # id oracle uid=1000(oracle) gid=1001(oinstall) \
        groups=1001(oinstall),1002(dba),1003(oper),1004(asmdba)
      
    4. Add the user information to the replacement compute server. This example adds the group IDs from the previous step to the oracle user ID:

      # useradd -u 1000 -g 1001 -G 1001,1002,1003,1004 -m -d /home/oracle -s \
        /bin/bash oracle
      
    5. Create the ORACLE_BASE and Grid Infrastructure directories. This example creates /u01/app/oracle and /u01/app/12.1.0/grid:

      # mkdir -p /u01/app/oracle
      # mkdir -p /u01/app/12.1.0/grid
      # chown -R oracle:oinstall /u01/app
      
    6. Change the ownership of the cellip.ora and cellinit.ora files. The owner is typically oracle:dba.

      # chown -R oracle:dba /etc/oracle/cell/network-config
      
    7. Secure the restored compute server:

      $ chmod u+x /opt/oracle.SupportTools/harden_passwords_reset_root_ssh
      $ /opt/oracle.SupportTools/harden_passwords_reset_root_ssh
      

      The compute server restarts.

    8. Log in as the root user. When you are prompted for a new password, set it to match the root password of the other compute servers.

    9. Set the password for the Oracle software owner. The owner is typically oracle.

      # passwd oracle
      
  3. Set up SSH for the oracle account:

    1. Change to the oracle account on the replacement compute server:

      # su - oracle
      
    2. Create the dcli group file on the replacement compute server, listing the servers in the Oracle cluster.

    3. Run the setssh-Linux.sh script on the replacement compute server. This example runs the script interactively:

      $ /opt/oracle.SupportTools/onecommand/setssh-Linux.sh -s
      

      The script prompts for the oracle password on the servers. The -s option causes the script to run in silent mode.

    4. Change to the oracle user on the replacement compute server:

      # su - oracle
      
    5. Verify SSH equivalency:

      $ dcli -g dbs_group -l oracle date
      
  4. Set up or copy any custom login scripts from the working compute server to the replacement compute server:

    $ scp .bash* oracle@replacement_server:. 
    

    In the preceding command, replacement_server is the name of the new server, such as ra01db01.

Applying Patch Bundles to a Replacement Compute Server

Oracle periodically releases software patch bundles for Recovery Appliance. If the working compute server has a patch bundle that is later than the release of the computeImageMaker file, then you must apply the patch bundle to the replacement compute server.

To determine if a patch bundle was applied, use the imagehistory command. Compare information on the replacement compute server to information on the working compute server. If the working database has a later release, then apply the storage server patch bundle to the replacement compute server.

Cloning the Oracle Grid Infrastructure

The following procedure describes how to clone the Oracle Grid infrastructure onto the replacement compute server. In the commands, working_server is a working compute server, and replacement_server is the replacement compute server.

To clone the Oracle Grid infrastructure:

  1. Log in as root to a working compute server in the cluster.

  2. Verify the hardware and operating system installation using the cluster verification utility (cluvfy):

    $ cluvfy stage -post hwos -n replacement_server,working_server –verbose
    

    The phrase Post-check for hardware and operating system setup was successful should appear at the end of the report.

  3. Verify peer compatibility:

    $ cluvfy comp peer -refnode working_server -n replacement_server  \
      -orainv oinstall -osdba dba | grep -B 3 -A 2 mismatched
    

    The following is an example of the output:

    Compatibility check: Available memory [reference node: ra01db02]
    Node Name Status Ref. node status Comment
    ------------ ----------------------- ----------------------- ----------
    ra01db01 31.02GB (3.2527572E7KB) 29.26GB (3.0681252E7KB) mismatched
    Available memory check failed
    Compatibility check: Free disk space for "/tmp" [reference node: ra01db02]
    Node Name Status Ref. node status Comment
    ------------ ----------------------- ---------------------- ----------
    ra01db01 55.52GB (5.8217472E7KB) 51.82GB (5.4340608E7KB) mismatched
    Free disk space check failed
    

    If the only failed components are related to the physical memory, swap space, and disk space, then it is safe for you to continue.

  4. Perform the requisite checks for adding the server:

    1. Ensure that the GRID_HOME/network/admin/samples directory has permissions set to 750.

    2. Validate the addition of the compute server:

      $ cluvfy stage -ignorePrereq -pre nodeadd -n replacement_server \
      -fixup -fixupdir  /home/oracle/fixup.d
       

      If the only failed component is related to swap space, then it is safe for you to continue.

      You might get an error about a voting disk similar to the following:

      ERROR: 
      PRVF-5449 : Check of Voting Disk location "o/192.168.73.102/ \
      DATA_CD_00_ra01cel07(o/192.168.73.102/DATA_CD_00_ra01cel07)" \
      failed on the following nodes:
      Check failed on nodes: 
              ra01db01
              ra01db01:No such file or directory
      ...
      PRVF-5431 : Oracle Cluster Voting Disk configuration check failed
      

      If this error occurs, then use the -ignorePrereq option when running the addnode script in the next step.

  5. Add the replacement compute server to the cluster:

    $ cd /u01/app/12.1.0/grid/addnode/
    $ ./addnode.sh -silent "CLUSTER_NEW_NODES={replacement_server}" \
      "CLUSTER_NEW_VIRTUAL_HOSTNAMES={replacement_server-vip}"[-ignorePrereq]
    

    The addnode script causes Oracle Universal Installer to copy the Oracle Clusterware software to the replacement compute server. A message like the following is displayed:

    WARNING: A new inventory has been created on one or more nodes in this session.
    However, it has not yet been registered as the central inventory of this
    system. To register the new inventory please run the script at
    '/u01/app/oraInventory/orainstRoot.sh' with root privileges on nodes
    'ra01db01'. If you do not register the inventory, you may not be able to 
    update or patch the products you installed.
    
    The following configuration scripts need to be executed as the "root" user in
    each cluster node:
     
    /u01/app/oraInventory/orainstRoot.sh #On nodes ra01db01
     
    /u01/app/12.1.0/grid/root.sh #On nodes ra01db01
    
  6. Run the configuration scripts:

    1. Open a terminal window.

    2. Log in as the root user.

    3. Run the scripts on each cluster server.

    After the scripts are run, the following message is displayed:

    The Cluster Node Addition of /u01/app/12.1.0/grid was successful.
    Please check '/tmp/silentInstall.log' for more details.
    
  7. Run the orainstRoot.sh and root.sh scripts:

    # /u01/app/oraInventory/orainstRoot.sh
    Creating the Oracle inventory pointer file (/etc/oraInst.loc)
    Changing permissions of /u01/app/oraInventory.
    Adding read,write permissions for group.
    Removing read,write,execute permissions for world.
    Changing groupname of /u01/app/oraInventory to oinstall.
    The execution of the script is complete.
     
    # /u01/app/12.1.0/grid/root.sh
    

    Check the log files in /u01/app/12.1.0/grid/install/ for the output of the root.sh script. The output file reports that the listener resource on the replaced compute server failed to start. This is an example of the expected output:

    /u01/app/12.1.0/grid/bin/srvctl start listener -n ra01db01 \
    ...Failed
    /u01/app/12.1.0/grid/perl/bin/perl \
    -I/u01/app/12.1.0/grid/perl/lib \
    -I/u01/app/12.1.0/grid/crs/install \
    /u01/app/12.1.0/grid/crs/install/rootcrs.pl execution failed
    
  8. Reenable the listener resource that you stopped in "Removing the Failed Compute Server from the Cluster".

    # GRID_HOME/grid/bin/srvctl enable listener -l LISTENER \
      -n replacement_server
    
    # GRID_HOME/grid/bin/srvctl start listener -l LISTENER  \
      -n replacement_server

Clone Oracle Database Homes to the Replacement Compute Server

To clone the Oracle Database homes to the replacement server:

  1. Add Oracle Database ORACLE_HOME to the replacement compute server:
    $ cd /u01/app/oracle/product/12.1.0/db_home/addnode/
    $ ./addnode.sh -silent "CLUSTER_NEW_NODES={replacement_server}" -ignorePrereq
    

    The addnode script causes Oracle Universal Installer to copy the Oracle Database software to the replacement compute server.

    WARNING: The following configuration scripts need to be executed as the "root"
    user in each cluster node.
    /u01/app/oracle/product/12.1.0/dbhome_1/root.sh #On nodes ra01db01
    To execute the configuration scripts:
    Open a terminal window.
    Log in as root.
    Run the scripts on each cluster node.
     

    After the scripts are finished, the following messages appear:

    The Cluster Node Addition of /u01/app/oracle/product/12.1.0/dbhome_1 was
    successful.
    Please check '/tmp/silentInstall.log' for more details.
    
  2. Run the root.sh script on the replacement compute server:
    # /u01/app/oracle/product/12.1.0/dbhome_1/root.sh
     

    Check the /u01/app/oracle/product/12.1.0/dbhome_1/install/root_replacement_server.company.com_date.log file for the output of the script.

  3. Ensure that the instance parameters are set for the replaced database instance. The following is an example for the CLUSTER_INTERCONNECTS parameter.
    SQL> SHOW PARAMETER cluster_interconnects
    
    NAME                                 TYPE        VALUE
    ------------------------------       --------    -------------------------
    cluster_interconnects                string
     
    SQL> ALTER SYSTEM SET cluster_interconnects='192.168.73.90' SCOPE=spfile SID='dbm1';
    
  4. Validate the configuration files and correct them as necessary:
    • The ORACLE_HOME/dbs/initSID.ora file points to server parameter file (SPFILE) in the Oracle ASM shared storage.

    • The password file that is copied in the ORACLE_HOME/dbs directory has been changed to orapwSID.

  5. Restart the database instance.