Using the Storage Server Rescue Procedure
Each storage server maintains a copy of the software on the USB stick. Whenever the system configuration changes, the server updates the USB stick. You can use this USB stick to recover the server after a hardware replacement or a software failure. You restore the system when the system disks fail, the operating system has a corrupt file system, or the boot area is damaged. You can replace the disks, cards, CPU, memory, and so forth, and recover the server. You can insert the USB stick in a different server, and it will duplicate the old server.
If only one system disk fails, then use CellCLI commands to recover. In the rare event that both system disks fail simultaneously, then use the rescue functionality provided on the storage server CELLBOOT USB flash drive.
This section contains the following topics:
First Steps Before Rescuing the Storage Server
Before rescuing a storage server, you must take steps to protect the data that is stored on it. Those steps depend on whether the system is set up with normal redundancy or high redundancy.
If the Server Has Normal Redundancy
If you are using normal redundancy, then the server has one mirror copy. The data could be irrecoverably lost, if that single mirror also fails during the rescue procedure.
Oracle recommends that you duplicate the mirror copy:
-
Make a complete backup of the data in the mirror copy.
-
Take the mirror copy server offline immediately, to prevent any new data changes to it before attempting a rescue.
This procedure ensures that all data residing on the grid disks on the failed server and its mirror copy is inaccessible during the rescue procedure.
The Oracle ASM disk repair timer has a default repair time of 3.6 hours. If you know that you cannot perform the rescue procedure within that time frame, then use the Oracle ASM rebalance procedure to rebalance the disks until you can do the rescue procedure.
See Also:
Oracle Exadata Storage Server Software User's Guide for information about resetting the timer
If the Server Has High Redundancy
When the server has high redundancy disk groups, so that Oracle ASM has multiple mirror copies for all the grid disks of the failed server, then take the failed cell offline. After Oracle ASM times out, it automatically drops the grid disks on the failed server, and starts rebalancing the data using mirror copies.
The default time out is two hours. If the server rescue takes more than two hours, then you must re-create the grid disks on the rescued cells in Oracle ASM.
About the Rescue Procedure
Note the following before using the rescue procedure:
-
The rescue procedure can rewrite some or all of the disks in the cell. If this happens, then you might lose all the content of those disks without the possibility of recovery. Ensure that you complete the appropriate preliminary steps before starting the rescue. See "If the Server Has Normal Redundancy" or "If the Server Has High Redundancy".
-
Use extreme caution when using this procedure, and pay attention to the prompts. Ideally, use the rescue procedure only with assistance from Oracle Support Services, and when you can afford to lose the data on some or all of the disks.
-
The rescue procedure does not destroy the contents of the data disks or the contents of the data partitions on the system disks, unless you explicitly choose to do so during the rescue procedure.
-
The rescue procedure restores the storage server software to the same release, including any patches that existed on the server during the last successful boot.
-
The rescue procedure does not restore these configuration settings:
-
Server configurations, such as alert configurations, SMTP information, administrator email address
-
ILOM configuration. However, ILOM configurations typically remain undamaged even when the server software fails.
-
-
The recovery procedure does restore these configuration settings:
-
The rescue procedure does not examine or reconstruct data disks or data partitions on the system disks. If there is data corruption on the grid disks, then do not use this rescue procedure. Instead, use the rescue procedures for Oracle Database and Oracle ASM.
After a successful rescue, you must reconfigure the server. If you want to preserve the data, then import the cell disks. Otherwise, you must create new cell disks and grid disks.
See Also:
Oracle Exadata Storage Server Software User's Guide for information on configuring cells, cell disks, and grid disks using the CellCLI utility
Rescuing a Server Using the CELLBOOT USB Flash Drive
Caution:
Follow the rescue procedure with care to avoid data loss.
To rescue a server using the CELLBOOT USB flash drive:
Note:
Additional options might be available that allow you to enter a rescue mode Linux login shell with limited functionality. Then you can log in to the shell as the root
user with the password supplied by Oracle Support Services, and manually run additional diagnostics and repairs on the server. For complete details, contact your Oracle Support Services representative.
Reconfiguring the Rescued Storage Server
After a successful rescue, you must configure the server. If the data partitions were preserved, then the cell disks are imported automatically during the rescue procedure.
See Also:
Oracle Exadata Storage Server Software User's Guide for information about IORM plans and metric thresholds