Using the Storage Server Rescue Procedure

Each storage server maintains a copy of the software on the USB stick. Whenever the system configuration changes, the server updates the USB stick. You can use this USB stick to recover the server after a hardware replacement or a software failure. You restore the system when the system disks fail, the operating system has a corrupt file system, or the boot area is damaged. You can replace the disks, cards, CPU, memory, and so forth, and recover the server. You can insert the USB stick in a different server, and it will duplicate the old server.

If only one system disk fails, then use CellCLI commands to recover. In the rare event that both system disks fail simultaneously, then use the rescue functionality provided on the storage server CELLBOOT USB flash drive.

This section contains the following topics:

First Steps Before Rescuing the Storage Server

Before rescuing a storage server, you must take steps to protect the data that is stored on it. Those steps depend on whether the system is set up with normal redundancy or high redundancy.

If the Server Has Normal Redundancy

If you are using normal redundancy, then the server has one mirror copy. The data could be irrecoverably lost, if that single mirror also fails during the rescue procedure.

Oracle recommends that you duplicate the mirror copy:

  1. Make a complete backup of the data in the mirror copy.

  2. Take the mirror copy server offline immediately, to prevent any new data changes to it before attempting a rescue.

This procedure ensures that all data residing on the grid disks on the failed server and its mirror copy is inaccessible during the rescue procedure.

The Oracle ASM disk repair timer has a default repair time of 3.6 hours. If you know that you cannot perform the rescue procedure within that time frame, then use the Oracle ASM rebalance procedure to rebalance the disks until you can do the rescue procedure.

See Also:

Oracle Exadata Storage Server Software User's Guide for information about resetting the timer

If the Server Has High Redundancy

When the server has high redundancy disk groups, so that Oracle ASM has multiple mirror copies for all the grid disks of the failed server, then take the failed cell offline. After Oracle ASM times out, it automatically drops the grid disks on the failed server, and starts rebalancing the data using mirror copies.

The default time out is two hours. If the server rescue takes more than two hours, then you must re-create the grid disks on the rescued cells in Oracle ASM.

About the Rescue Procedure

Note the following before using the rescue procedure:

  • The rescue procedure can rewrite some or all of the disks in the cell. If this happens, then you might lose all the content of those disks without the possibility of recovery. Ensure that you complete the appropriate preliminary steps before starting the rescue. See "If the Server Has Normal Redundancy" or "If the Server Has High Redundancy".

  • Use extreme caution when using this procedure, and pay attention to the prompts. Ideally, use the rescue procedure only with assistance from Oracle Support Services, and when you can afford to lose the data on some or all of the disks.

  • The rescue procedure does not destroy the contents of the data disks or the contents of the data partitions on the system disks, unless you explicitly choose to do so during the rescue procedure.

  • The rescue procedure restores the storage server software to the same release, including any patches that existed on the server during the last successful boot.

  • The rescue procedure does not restore these configuration settings:

    • Server configurations, such as alert configurations, SMTP information, administrator email address

    • ILOM configuration. However, ILOM configurations typically remain undamaged even when the server software fails.

  • The recovery procedure does restore these configuration settings:

  • The rescue procedure does not examine or reconstruct data disks or data partitions on the system disks. If there is data corruption on the grid disks, then do not use this rescue procedure. Instead, use the rescue procedures for Oracle Database and Oracle ASM.

After a successful rescue, you must reconfigure the server. If you want to preserve the data, then import the cell disks. Otherwise, you must create new cell disks and grid disks.

See Also:

Oracle Exadata Storage Server Software User's Guide for information on configuring cells, cell disks, and grid disks using the CellCLI utility

Rescuing a Server Using the CELLBOOT USB Flash Drive

Caution:

Follow the rescue procedure with care to avoid data loss.

To rescue a server using the CELLBOOT USB flash drive:

  1. Connect to the Oracle ILOM service processor (SP) of the rescued server. You can use either HTTPS or SSH.
  2. Start the server. As soon as you see the splash screen, press any key on the keyboard. The splash screen is visible for only 5 seconds.
  3. In the displayed list of boot options, select the last option, CELL_USB_BOOT_CELLBOOT_usb_in_rescue_mode, and press Enter.
  4. Select the rescue option, and proceed with the rescue.
  5. At the end of the first phase of the rescue, choose the option to enter the shell. Do not restart the system
  6. Log in to the shell using the rescue root password.
  7. Use the reboot command from the shell.
  8. Press F8 as the server restarts and before the splash screen appears. Pressing F8 accesses the boot device selection menu.
  9. Select the RAID controller as the boot device. This causes the server to boot from the hard disks.

Note:

Additional options might be available that allow you to enter a rescue mode Linux login shell with limited functionality. Then you can log in to the shell as the root user with the password supplied by Oracle Support Services, and manually run additional diagnostics and repairs on the server. For complete details, contact your Oracle Support Services representative.

Reconfiguring the Rescued Storage Server

After a successful rescue, you must configure the server. If the data partitions were preserved, then the cell disks are imported automatically during the rescue procedure.

  1. For any replaced servers, re-create the cell disks and grid disks.
  2. Log in to the Oracle ASM instance, and set the disks to ONLINE using the following command for each disk group:
    SQL> ALTER DISKGROUP disk_group_name ONLINE DISKS IN FAILGROUP \
    cell_name WAIT; 
    
  3. Reconfigure the cell using the ALTER CELL command. The following example shows the most common parameters:
    CellCLI> ALTER CELL
    smtpServer='my_mail.example.com', -
    smtpFromAddr='john.doe@example.com', -
    smtpFromPwd=email_address_password, -
    smtpToAddr='jane.smith@example.com', -
    notificationPolicy='critical,warning,clear', -
    notificationMethod='mail,snmp'
    
  4. Re-create the I/O Resource Management (IORM) plan.
  5. Re-create the metric thresholds.

See Also:

Oracle Exadata Storage Server Software User's Guide for information about IORM plans and metric thresholds

Recreating a Damaged CELLBOOT USB Flash Drive

If the CELLBOOT USB flash drive is lost or damaged, then you can create another one.

To create a CELLBOOT flash drive:

  1. Log in to the server as the root user.
  2. Attach a new USB flash drive with a capacity of 1 to 8 GB.
  3. Remove any other USB flash drives from the system.
  4. Change directories:
    cd /opt/oracle.SupportTools
    
  5. Copy the server software to the flash drive:
    ./make_cellboot_usb -verbose -force