Maintaining the Storage Servers

This section describes how to perform maintenance on the storage servers. It contains the following topics:

Note:

Older storage servers cannot be removed and replaced with newer storage servers while keeping existing Recovery Appliance backups online. If such an exchange is done, a re-image of the Recovery Appliance is required. (Painful!)

Shutting Down a Storage Server

When performing maintenance on a storage server, you might need to power down or restart the server. Before shutting down a storage server, verify that taking a server offline does not impact Oracle ASM disk group and database availability. Continued database availability depends on the level of Oracle ASM redundancy used on the affected disk groups, and the current status of disks in other storage servers that have mirror copies of the same data.

Caution:

  • If a disk in a different cell fails while the cell undergoing maintenance is not completely back in service on the Recovery Appliance, a double disk failure can occur. If the Recovery Appliance is deployed with NORMAL redundancy for the DELTA disk group and if this disk failure is permanent, you will lose all backups on the Recovery Appliance.

  • Ensure that the cell undergoing maintenance is not offline for an extended period of time. Otherwise, a rebalance operation will occur and this will cause issues because of insufficient space for the operation to complete. By default, the rebalance operation begins 24 hours after the cell goes offline.

To power down a storage server:

  1. Log in to the storage server as root.
  2. (Optional) Keep the grid disks offline after restarting the storage server:
    CellCLI> ALTER GRIDDISK ALL INACTIVE
    

    Use this command when doing multiple restarts, or to control when the cell becomes active again. For example, so you can verify the planned maintenance activity was successful before the server is used.

  3. Stop the cell services:
    CellCLI> ALTER CELL SHUTDOWN SERVICES ALL
    

    The preceding command checks if any disks are offline, in predictive failure status, or must be copied to its mirror. If Oracle ASM redundancy is intact, then the command takes the grid disks offline in Oracle ASM, and stops the services.

    The following error indicates that stopping the services might cause redundancy problems and force a disk group to dismount:

    Stopping the RS, CELLSRV, and MS services...
    The SHUTDOWN of ALL services was not successful.
    CELL-01548: Unable to shut down CELLSRV because disk group DATA, RECO may be
    forced to dismount due to reduced redundancy.
    Getting the state of CELLSRV services... running
    Getting the state of MS services... running
    Getting the state of RS services... running
    

    If this error occurs, then restore Oracle ASM disk group redundancy. Retry the command when the status is normal for all disks.

  4. Shut down the server. See "Powering Down the Servers".
  5. After you complete the maintenance procedure, power up the server. The services start automatically. During startup, all grid disks are automatically online in Oracle ASM.
  6. Verify that all grid disks are online:
    CellCLI> LIST GRIDDISK ATTRIBUTES name, asmmodestatus
    

    Wait until asmmodestatus shows ONLINE or UNUSED for all grid disks.

  7. If you inactivated the grid disks in step 2, then reactivate them:
    CellCLI> ALTER GRIDDISK ALL ACTIVE
    

    If you skipped step 2, then the grid disks are activated automatically.

See Also:

My Oracle Support Doc ID 1188080.1, "Steps to shut down or reboot an Exadata storage cell without affecting ASM."

Enabling Network Connectivity Using the Diagnostics ISO

You might need to use the diagnostics ISO to access a storage server that fails to restart normally. After starting the server, you can copy files from the ISO to the server, replacing the corrupt files.

The ISO is located on all Recovery Appliance servers at /opt/oracle.SupportTools/diagnostics.iso.

Caution:

Use the diagnostics ISO only after other restart methods, such as using the USB drive, have failed. Contact Oracle Support for advise and guidance before starting this procedure.

To use the diagnostics ISO:

  1. Enable a one-time CD-ROM boot in the service processor, using either the web interface or a serial console, such as Telnet or puTTY. For example, use this command from a serial console:
    set boot_device=cdrom
    
  2. Mount a local copy of diagnostics.iso as a CD-ROM, using the service processor interface.
  3. Use the reboot command to restart the server.
  4. Log in to the server as the root user with the diagnostics ISO password.
  5. To avoid pings:
    alias ping="ping -c"
    
  6. Make a directory named /etc/network.
  7. Make a directory named/etc/network/if-pre-up.d.
  8. Add the following settings to the /etc/network/interfaces file, entering the actual IP address and netmask of the server, and the IP address of the gateway:
    iface eth0 inet static
    address IP address of server
    netmask netmask of server
    gateway gateway IP address of server
    
  9. Start the eth0 interface:
    # ifup eth0
     

    Ignore any warning messages.

  10. Use either FTP or the wget command to retrieve the files needed to repair the server.