Understanding Automated Maintenance Tasks and Policies

2.2.4 Understanding Automated Maintenance Tasks and Policies

The Management Server (MS) performs automated maintenance tasks on the metric repository and various file systems.

Some automated maintenance occurs routinely every hour, while other tasks occur in response to storage space pressure on a specific file system.

Every hour, MS automatically performs the following tasks:

MS automatically deletes metric observations that meet the following criteria:
- For metrics with a default retention policy (retentionPolicy=default), MS automatically deletes metric observations that are older than the retention period defined by the metricHistoryDays dbserver attribute. By default, the metricHistoryDays retention period is 7 days.
- For metrics with an annual retention policy (retentionPolicy=annual), MS automatically deletes metric observations that are older than one year.
MS automatically deletes various diagnostic files that are older than the retention period defined by the diagHistoryDays dbserver attribute. This includes temporary files larger than 5 MB in size and selected other files in the /var/log file system (including log directories under /opt/oracle with symbolic links to /var/log). Files in the Automatic Diagnostic Repository (ADR) and diagnostic pack (diagpack) files are not included in this process. By default, the diagHistoryDays retention period is 7 days.
MS automatically deletes eligible segments of the Oracle Exadata System Software alert.log and debug.log files. Each of these files is automatically segmented (saved to a new name) when it reaches 10 MB in size. To be eligible for deletion during this routine cleanup process, a file segment must be older than the retention period defined by the diagHistoryDays dbserver attribute and also not one of the 5 latest segments (most recent 50 MB) of the file.
MS automatically deletes eligible alerts from the dbserver alert history using the following criteria. Alerts are considered eligible if they are either stateful alerts that have been resolved or they are stateless alerts.
- If there are less than 500 alerts, then eligible alerts older than 100 days are deleted.
- If there are between 500 and 999 alerts, then eligible alerts older than 7 days are deleted.
- If there are 1,000 or more alerts, then all eligible alerts are deleted.

Furthermore, MS routinely manages the ms-odl.trc and ms-odl.log files. Each of these files is automatically segmented (saved to a new name) when it reaches 5 MB in size. When a file segment is written, MS retains the latest 10 segments of the file and deletes any older segments.

In addition to the previously described routine tasks, MS automatically responds to alleviate storage space pressure on the following file systems: / (root), /var/log, /u01, and /EXAVMIMAGES.

Specifically, when file system utilization reaches a predefined action threshold, MS automatically begins an iterative process to delete eligible files. The process continues until the file system utilization drops to the corresponding clearance threshold or until there are no more eligible files.

The following describes the action and clearance thresholds for each managed file system.

If the file system size is less that 100 GB, the action threshold is 80% and the clearance threshold is 75%.
If the file system size is between 100 GB and 2.5 TB, the action threshold is 20 GB less than the size of the file system and the clearance threshold is 25 GB less than the size of the file system.
If the file system size is greater than 2.5 TB, the action threshold is 100% and the clearance threshold is 99%.

In summary, the process to ease space pressure works as follows:

MS first deletes all eligible metrics older than metricHistoryDays and all eligible files older than diagHistoryDays from the affected file system.
If the file system utilization drops below the clearance threshold, the process stops.
Otherwise, MS iteratively deletes the oldest eligible files. For each iteration, MS reduces the effective retention period by half, down to a minimum of 10 minutes. This iterative purging process continues until usage drops below the clearance threshold or all eligible files more than 10 minutes old have been removed.
An alert is automatically raised if the iterative file purging process cannot bring the file system utilization below the clearance threshold. The alert automatically clears when the file system utilization drops below the clearance threshold.

In the context of easing space pressure for a specific file system, eligible files include:

All metric and diagnostic data files on the file system that are routinely managed by MS, including filled segments of the dbserver alert.log, debug.log, ms-odl.trc, and ms-odl.log files.
Automatic Diagnostic Repository (ADR) log and trace files, if the file system contains the ADR.
Diagnostic pack (diagpack) files, if the file system contains them.
Crash files, except that the most recent crash file is maintained if it is less than 30 days old.

Note:

In any event, MS retains all files and directories with SAVE embedded in the name.

In addition to automatically easing space pressure on the managed file systems, MS automatically monitors the file system utilization on the /tmp and /var file systems. However, for these file systems, MS only generates an alert when file system utilization reaches the alert threshold. The alert automatically clears when the file system utilization drops below the corresponding clearance threshold. The alert and clearance thresholds are based on the file system size using the same logic used for the action and clearance thresholds on the managed file systems.