Maintaining the Flash Disks of Storage Servers

About the Flash Disks

Recovery Appliance mirrors data across storage servers, and sends write operations to at least two storage servers. If a flash card in one storage server has problems, then Recovery Appliance services the read and write operations using the mirrored data in another storage server. Service is not interrupted.

If a flash card fails, then the storage server software identifies the data in the flash cache by reading the data from the surviving mirror. It then writes the data to the server with the failed flash card. When the failure occurs, the software saves the location of the data lost in the failed flash cache. Resilvering then replaces the lost data with the mirrored copy. During resilvering, the grid disk status is ACTIVE -- RESILVERING WORKING.

Each storage server has four PCIe cards. Each card has four flash disks (FDOMs) for a total of 16 flash disks. The four PCIe cards are located in PCI slot numbers 1, 2, 4, and 5.

To identify a failed flash disk, use the following command:

CellCLI> LIST PHYSICALDISK WHERE DISKTYPE=flashdisk AND STATUS=failed DETAIL

         name:                   FLASH_5_3
         diskType:               FlashDisk
         luns:                   5_3
         makeModel:              "Sun Flash Accelerator F40 PCIe Card"
         physicalFirmware:       TI35
         physicalInsertTime:     2012-07-13T15:40:59-07:00
         physicalSerial:         5L002X4P
         physicalSize:           93.13225793838501G
         slotNumber:             "PCI Slot: 5; FDOM: 3"
         status:                 failed

The card name and slotNumber attributes show the PCI slot and the FDOM number.

When the server software detects a failure, it generates an alert that indicates that the flash disk, and the LUN on it, failed. The alert message includes the PCI slot number of the flash card and the exact FDOM number. These numbers uniquely identify the field replaceable unit (FRU). If you configured the system for alert notification, then the alert is sent to the designated address in an email message.

A flash disk outage can reduce performance and data redundancy. Replace the failed disk at the earliest opportunity. If the flash disk is used for flash cache, then the effective cache size for the server is reduced. If the flash disk is used for flash log, then the flash log is disabled on the disk, thus reducing the effective flash log size. If the flash disk is used for grid disks, then the Oracle ASM disks associated with them are automatically dropped with the FORCE option from the Oracle ASM disk group, and an Oracle ASM rebalance starts to restore the data redundancy.

Faulty Status Indicators

The following status indicators generate an alert. The alert includes specific instructions for replacing the flash disk. If you configured the system for alert notifications, then the alerts are sent by email message to the designated address.

warning - peer failure

One of the flash disks on the same Sun Flash Accelerator PCIe card failed or has a problem. For example, if FLASH5_3 fails, then FLASH5_0, FLASH5_1, and FLASH5_2 have peer failure status:

CellCLI> LIST PHYSICALDISK
         36:0            L45F3A          normal
         36:1            L45WAE          normal
         36:2            L45WQW          normal
          .
          .
          .
         FLASH_5_0       5L0034XM        warning - peer failure
         FLASH_5_1       5L0034JE        warning - peer failure
         FLASH_5_2       5L002WJH        warning - peer failure
         FLASH_5_3       5L002X4P        failed

warning - predictive failure

The flash disk will fail soon, and should be replaced at the earliest opportunity. If the flash disk is used for flash cache, then it continues to be used as flash cache. If the flash disk is used for grid disks, then the Oracle ASM disks associated with these grid disks are automatically dropped, and Oracle ASM rebalance relocates the data from the predictively failed disk to other disks.

When one flash disk has predictive failure status, then the data is copied. If the flash disk is used for write back flash cache, then the data is flushed from the flash disks to the grid disks.

warning - poor performance

The flash disk demonstrates extremely poor performance, and should be replaced at the earliest opportunity. If the flash disk is used for flash cache, then flash cache is dropped from this disk, thus reducing the effective flash cache size for the storage server. If the flash disk is used for grid disks, then the Oracle ASM disks associated with the grid disks on this flash disk are automatically dropped with the FORCE option, if possible. If DROP...FORCE cannot succeed because of offline partners, then the grid disks are dropped normally, and Oracle ASM rebalance relocates the data from the poor performance disk to the other disks.

warning - write-through caching

The capacitors used to support data cache on the PCIe card failed, and the card should be replaced as soon as possible.

Identifying Flash Disks in Poor Health

To identify a flash disk with a particular health status, use the LIST PHYSICALDISK command. This example queries for the warning - predictive failure status:

CellCLI> LIST PHYSICALDISK WHERE DISKTYPE=flashdisk AND STATUS=  \
'warning - predictive failure' DETAIL


         name:                   FLASH_5_3
         diskType:               FlashDisk
         luns:                   5_3
         makeModel:              "Sun Flash Accelerator F40 PCIe Card"
         physicalFirmware:       TI35
         physicalInsertTime:     2012-07-13T15:40:59-07:00
         physicalSerial:         5L002X4P
         physicalSize:           93.13225793838501G
         slotNumber:             "PCI Slot: 1; FDOM: 2"
         status:                 warning - predictive failure

Identifying Underperforming Flash Disks

ASR automatically identifies and removes a poorly performing disk from the active configuration. Recovery Appliance then runs a set of performance tests. When CELLSRV detects poor disk performance, the cell disk status changes to normal - confinedOnline, and the physical disk status changes to warning - confinedOnline. Table 13-2 describes the conditions that trigger disk confinement. The conditions are the same for both physical and flash disks.

If the problem is temporary and the disk passes the tests, then it is brought back into the configuration. If the disk does not pass the tests, then it is marked poor performance, and ASR submits a service request to replace the disk. If possible, Oracle ASM takes the grid disks offline for testing. Otherwise, the cell disk status stays at normal - confinedOnline until the disks can be taken offline safely.

The disk status change is recorded in the server alert history:

MESSAGE ID date_time info "Hard disk entered confinement status. The LUN
 n_m changed status to warning - confinedOnline. CellDisk changed status to normal
 - confinedOnline. Status: WARNING - CONFINEDONLINE  Manufacturer: name  Model
 Number: model  Size: size  Serial Number: serial_number  Firmware: fw_release 
 Slot Number: m  Cell Disk: cell_disk_name  Grid Disk: grid disk 1, grid disk 2
 ... Reason for confinement: threshold for service time exceeded"

These messages are entered in the storage cell alert log:

CDHS: Mark cd health state change cell_disk_name  with newState HEALTH_BAD_
ONLINE pending HEALTH_BAD_ONLINE ongoing INVALID cur HEALTH_GOOD
Celldisk entering CONFINE ACTIVE state with cause CD_PERF_SLOW_ABS activeForced: 0
inactiveForced: 0 trigger HistoryFail: 0, forceTestOutcome: 0 testFail: 0
global conf related state: numHDsConf: 1 numFDsConf: 0 numHDsHung: 0 numFDsHung: 0
     .
     .
     .

When Is It Safe to Replace a Faulty Flash Disk?

When the server software detects a predictive or peer failure in a flash disk used for write back flash cache, and only one FDOM is bad, then the server software resilvers the data on the bad FDOM, and flushes the data on the other three FDOMs. If there are valid grid disks, then the server software initiates an Oracle ASM rebalance of the disks. You cannot replace the bad disk until the tasks are completed and an alert indicates that the disk is ready.

An alert is sent when the Oracle ASM disks are dropped, and you can safely replace the flash disk. If the flash disk is used for write-back flash cache, then wait until none of the grid disks are cached by the flash disk.

Replacing a Failed Flash Disk

Caution:

The PCIe cards are not hot pluggable; you must power down a storage server before replacing the flash disks or cards.

Before you perform the following procedure, shut down the server. See "Shutting Down a Storage Server".

To replace a failed flash disk:

Replace the failed flash disk. Use the PCI number and FDOM number to locate the failed disk. A white cell LED is lit to help you locate the affected server.
Power up the server. The services start automatically. As part of the server startup, all grid disks are automatically online in Oracle ASM.
Verify that all grid disks are online:
```
CellCLI> LIST GRIDDISK ATTRIBUTES name, asmmodestatus
```
Wait until asmmodestatus shows ONLINE or UNUSED for all grid disks.

Replacing a Faulty Flash Disk

Caution:

The PCIe cards are not hot pluggable; you must power down a storage server before replacing the flash disks or cards.

Before you perform the following procedure, review the "When Is It Safe to Replace a Faulty Flash Disk?" topic.

To replace a faulty flash disk:

Use the following command to check the cachedBy attribute of all grid disks.
```
CellCLI> LIST GRIDDISK ATTRIBUTES name, cachedBy
```
The cell disk on the flash disk should not appear in any grid disk cachedBy attribute. If the flash disk is used for both grid disks and flash cache, then wait until receiving the alert, and the cell disk is not shown in any grid disk cachedBy attribute.
Stop all services:
```
CellCLI> ALTER CELL SHUTDOWN SERVICES ALL
```
The preceding command checks if any disks are offline, in predictive failure status, or must be copied to a mirror. If Oracle ASM redundancy is intact, then the command takes the grid disks offline in Oracle ASM, and then stops the services.

The following error indicates that it might be unsafe to stop the services, because stopping them might force a disk group to dismount:
```
Stopping the RS, CELLSRV, and MS services...
The SHUTDOWN of ALL services was not successful.
CELL-01548: Unable to shut down CELLSRV because disk group DATA, RECO may be
forced to dismount due to reduced redundancy.
Getting the state of CELLSRV services... running
Getting the state of MS services... running
Getting the state of RS services... running
```
If this error occurs, then restore Oracle ASM disk group redundancy, and retry the command when the disk status is normal for all disks.
Shut down the server.
See "Shutting Down a Storage Server".
Replace the failed flash disk. Use the PCI number and FDOM number to locate the failed disk. A white cell LED is lit to help you locate the affected server.
Power up the server. The services start automatically. As part of the server startup, all grid disks are automatically online in Oracle ASM.
Verify that all grid disks are online:
```
CellCLI> LIST GRIDDISK ATTRIBUTES name, asmmodestatus
```
Wait until asmmodestatus shows ONLINE or UNUSED for all grid disks.

The system automatically uses the new flash disk, as follows:

If the flash disk is used for flash cache, then the effective cache size increases.
If the flash disk is used for grid disks, then the grid disks are re-created on the new flash disk.
If the grid disks were part of an Oracle ASM disk group, then they are added back to the disk group. The data is rebalanced on them, based on the disk group redundancy and the ASM_POWER_LIMIT parameter.

Removing an Underperforming Flash Disk

A bad flash disk can degrade the performance of other good flash disks. You should remove a bad flash disk. See "Identifying Underperforming Flash Disks".

To remove an underperforming flash drive:

If the flash disk is used for flash cache:
1. Ensure that data not synchronized with the disk (dirty data) is flushed from flash cache to the grid disks:
```
CellCLI> ALTER FLASHCACHE ... FLUSH
```
2. Disable the flash cache and create a new one. Do not include the bad flash disk when creating the flash cache.
```
CellCLI > DROP FLASHCACHE
CellCLI > CREATE FLASHCACHE CELLDISK='fd1,fd2,fd3,fd4, ...' 
```
If the flash disk is used for grid disks, then direct Oracle ASM to stop using the bad disk immediately:
```
SQL> ALTER DISKGROUP diskgroup_name DROP DISK asm_disk_name FORCE 
```
Offline partners might cause the DROP command with the FORCE option to fail. If the previous command fails, do one of the following:
- Restore Oracle ASM data redundancy by correcting the other server or disk failures. Then retry the DROP...FORCE command.
- Direct Oracle ASM to rebalance the data off the bad disk:
```
SQL> ALTER DISKGROUP diskgroup_name DROP DISK asm_disk_name  NOFORCE
```
Wait until the Oracle ASM disks associated with the bad flash disk are dropped successfully. The storage server software automatically sends an alert when it is safe to replace the flash disk.
Stop the services:
```
CellCLI> ALTER CELL SHUTDOWN SERVICES ALL
```
The preceding command checks if any disks are offline, in predictive failure status, or must be copied to its mirror. If Oracle ASM redundancy is intact, then the command takes the grid disks offline in Oracle ASM, and stops the services.

The following error indicates that stopping the services might cause redundancy problems and force a disk group to dismount:
```
Stopping the RS, CELLSRV, and MS services...
The SHUTDOWN of ALL services was not successful.
CELL-01548: Unable to shut down CELLSRV because disk group DATA, RECO may be
forced to dismount due to reduced redundancy.
Getting the state of CELLSRV services... running
Getting the state of MS services... running
Getting the state of RS services... running
```
If this error occurs, then restore Oracle ASM disk group redundancy. Retry the command when the status is normal for all disks.
Shut down the server. See "Shutting Down a Storage Server".
Remove the bad flash disk, and replace it with a new flash disk.
Power up the server. The services are started automatically. As part of the server startup, all grid disks are automatically online in Oracle ASM.

Add the new flash disk to flash cache:

CellCLI> DROP FLASHCACHE
CellCLI> CREATE FLASHCACHE ALL

Verify that all grid disks are online:
```
CellCLI> LIST GRIDDISK ATTRIBUTES asmmodestatus
```
Wait until asmmodestatus shows ONLINE or UNUSED for all grid disks.

The flash disks are added as follows:

If the flash disk is used for grid disks, then the grid disks are re-created on the new flash disk.
If these grid disks were part of an Oracle ASM disk group and DROP...FORCE was used in Step 2, then they are added back to the disk group and the data is rebalanced on based on disk group redundancy and the ASM_POWER_LIMIT parameter.
If DROP...NOFORCE was used in Step 2, then you must manually add the grid disks back to the Oracle ASM disk group.

About Write-Back Flash Cache

You cannot modify the write-back flash cache settings on Recovery Appliance.