4.5 Collecting Operating System Resources Metrics

Cluster Health Monitor (CHM) and System Health Monitor (SHM) are both high-performance, lightweight daemons that collect, analyze, aggregate, and store a large set of operating system metrics to help you diagnose and troubleshoot system issues.

Why CHM or SHM is unique

CHM or SHM Typical OS Collector

Last man standing - daemon runs memory locked, RT scheduling class ensuring consistent data collection under system load.

Inconsistent data dropouts due to scheduling delays under system load.

High fidelity data sampling rate, 5 seconds. Very low resource usage profile at 5-second sampling rates.

Running multiple utilities creates additional overhead on the system being monitored, and worsens with higher sampling rates.

High Availability daemon, collated data collections across multiple resource categories. Highly optimized collector (data read directly from the operating system, same source as utilities).

Set of scripts/command-line utilities, for example, top, ps, vmstat, iostat, and so on re-directing their output to one or more files for every collection sample.

Collected data is collated into a system snapshot overview (Nodeview) on every sample, Nodeview also contains additional summarization and analysis of the collected data across multiple resource categories.

System snapshot overviews across different resource categories are very tedious to collate.

Significant inline analysis and summarization during data collection and collation into the Nodeview greatly reduces tedious, manual, time-consuming analysis to drive meaningful insights.

The analysis is time-consuming and processing-intensive as the output of various utilities across multiple files needs to be collated, parsed, interpreted, and then analyzed for meaningful insights.

Performs Clusterware-aware specific metrics collection (Process Aggregates, ASM/OCR/VD disk tagging, Private/Public NIC tagging). Also provides an extensive toolset for in-depth data analysis and visualization.

None

4.5.1 Comparing CHM and SHM: Understanding their fundamental differences

This topic outlines the purpose and usage of Cluster Health Monitor (CHM) and System Health Monitor (SHM).

Cluster Health Monitor (CHM) System Health Monitor (SHM)
Known as system monitor daemon (osysmond) is a real-time monitoring and operating system metric collection daemon that runs on each cluster node on RAC systems. known as system Health Monitor (ahf-sysmon) is a real time monitoring and operating system collection service available on Single-Instance Database and non-GI based systems.
Integrated and enabled by default as part of GI since 11.2. Integrated and enabled by default as part of AHF 24.6
Runs as system monitor service (osysmond) from GI home. Runs as ahf-sysmon service from AHF home.
Managed as High Availability Services (HAS) resource within GI stack. Managed as tfa-monitor resource within AHF stack.
Status of the resource can be queried using below command:
crsctl stat res ora.crf -init -d
Status of the process can be queried using below command:
ahfctl statusahf

Generated operating system metrics gets stored in ORACLE_BASE/crsdata/<hostname>/crf/db/json

Metric Repository is auto-managed on the above local filesystem.

  • Nodeview samples are continuously written to the repository (JSON record)
  • Historical data is auto-archived into hourly zip files
  • Archived files are automatically purged once the default retention limit is reached (default: 200 MB)

Generated operating system metrics gets stored in /opt/oracle.ahf/data/<hostname>/shm

Metric Repository is auto-managed on the above local filesystem.

  • Metric samples are continuously written to the repository (JSON record)
  • Historical data is auto-archived into hourly zip files
  • Archived files are automatically purged once the default retention limit is reached (default: 200 MB)
Above generated operating system metrics are collected as part tfactl diagcollect. Above generated operating system metrics are collected as part tfactl diagcollect.
Supported on Linux, Solaris, AIX, zLinux, ARM64, and Microsoft Windows platforms. Supported only on Linux platform.

4.5.2 Additional Details About System Health Monitor (SHM)

System Health Monitor (SHM) is integrated and enabled by default in AHF. AHF now includes the SHM files in its diagnostic collection.

System Health Monitor (SHM) monitors operating system metrics in real time for processes, memory, network, IO and disk to troubleshoot and root cause the system performance issues in real time as well as root cause analysis of past issues. System Health Monitor (SHM) analysis will be available in AHF Insights. For more information, see Explore Diagnostic Insights.

SHM operates as a daemon process, triggered and controlled by AHF, enabled by default, but it is only available on Single-Instance Database and non-GI based systems.

Also, you can use the ahfctl statusahf command to check the status of System Health Monitor.

  • To start SHM:
    Run this command only if SHM has been stopped previously for some reason and needs to be switched back on.
    ahf configuration set --property ahf.collectors.enhanced_os_metrics --value on

    Running the command enables ahf-sysmon to be started and the TFA daemon will then start and monitor it.

  • To stop SHM:
    ahf configuration set --property ahf.collectors.enhanced_os_metrics --value off

    Running the commands checks if ahf-sysmon is up or not. If it's running, the command will kill the process and stops ahf-sysmon.

  • To verify the default value of SHM:
    ahf configuration get --property ahf.collectors.enhanced_os_metrics
    ahf.collectors.enhanced_os_metrics: on
  • To verify the SHM process (ahf-sysmon) is active by default:
    ps -fe | grep sysmon
    root     3333453  1  0 22:44 ?   00:00:00 /opt/oracle.ahf/shm/ahf-sysmon/bin/ahf-sysmon
  • To check SHM JSON files:

    Locate the JSON files in the SHM data directory /opt/oracle.ahf/data/<hostname>/shm

  • To check if SHM runs under the same cgroup of TFAMain or not:
    -bash-4.4$ ps -ef | grep ahf-sysmon
    root        3232       1  0 09:38 ?        00:00:47 /opt/oracle.ahf/shm/ahf-sysmon/bin/ahf-sysmon
    testuser   155833  155678  0 17:04 pts/0    00:00:00 grep --color=auto ahf-sysmon
    -bash-4.4$ cat /proc/3232/cgroup | grep "cpu"
    8:cpu,cpuacct:/oratfagroup
    4:cpuset:/
    -bash-4.4$ ps -ef | grep tfa
    root        1945       1  0 09:37 ?        00:00:02 /bin/sh /etc/init.d/init.tfa run >/dev/null 2>&1 </dev/null
    root        2851       1  1 09:37 ?        00:05:21 /opt/oracle.ahf/jre/bin/java --add-opens java.base/java.lang=ALL-UNNAMED -server -Xms128m -Xmx256m -Djava.awt.headless=true -Ddisable.checkForUpdate=true -XX:+ExitOnOutOfMemoryError oracle.rat.tfa.TFAMain /opt/oracle.ahf/tfa
    testuser   156073  155678  0 17:05 pts/0    00:00:00 grep --color=auto tfa
    -bash-4.4$ cat /proc/2851/cgroup | grep "cpu"
    8:cpu,cpuacct:/oratfagroup
    4:cpuset:/
    -bash-4.4$ cat /proc/3232/cgroup | grep "cpu"
    8:cpu,cpuacct:/oratfagroup
    4:cpuset:/
    cat /proc/[PID_OF_AHF-SYSMON]/cgroup | grep "cpu"
    cat /proc/[PID_OF_TFA]/cgroup | grep "cpu"
  • To verify SHM files are collected in AHF collection:
    • As prerequisite, run:
      tfactl set smartprobclassifier=off
    • Then run:
      tfactl diagcollect -last 1h -tag shm_last_1h; 
      unzip -l $REPOSITORY_ROOT/shm_last_1h/$HOSTNAME*.zip

      A directory called SHM should be present in generated zip file.

    • And finally run:
      tfactl diagcollect -last 1h
      Archive:  /opt/oracle.ahf/data/repository/collection_Wed_Apr_10_22_03_04_UTC_2024_node_all/test-node.tfa_Wed_Apr_10_22_03_03_UTC_2024.zip | grep SHM
        Length      Date    Time    Name
      ---------  ---------- -----   ----
            327  04-10-2024 22:03   test-node/SHMDATA/shmdataconverter_3279258.log
           6660  04-10-2024 22:03   test-node/SHMDATA/shmosmeta_1923000.json
          43575  04-10-2024 22:03   test-node/SHMDATA/shmosmetricdescription.json
        9561411  04-10-2024 22:03   test-node/SHMDATA/shmosdata_test-node_2024-04-10-2100.log
         997193  04-10-2024 22:03   test-node/SHMDATA/shmosdata_test-node_2024-04-10-2200.log

4.5.3 Collecting Cluster Health Monitor Data

Collect Cluster Health Monitor data from any node in the cluster.

Oracle recommends that you run the tfactl diagcollect command to collect diagnostic data when an Oracle Clusterware error occurs.

4.5.4 Operating System Metrics Collected by Cluster Health Monitor and System Health Monitor

Review the metrics collected by CHM and SHM.

Overview of Metrics

CHM groups the operating system data collected into a Nodeview. A Nodeview is a grouping of metric sets where each metric set contains detailed metrics of a unique system resource.

Brief description of metric sets are as follows:

  • CPU metric set: Metrics for top 127 CPUs sorted by usage percentage
  • Device metric set: Metrics for 127 devices that include ASM/VD/OCR along with those having a high average wait time
  • Process metric set: Metrics for 127 processes
    • Top 25 CPU consumers (idle processes not reported)
    • Top 25 Memory consumers (RSS < 1% of total RAM not reported)
    • Top 25 I/O consumers
    • Top 25 File Descriptors consumers (helps to identify top inode consumers)
    • Process Aggregation: Metrics summarized by foreground and background processes for all Oracle Database and Oracle ASM instances
  • Network metric set: Metrics for 16 NICS that include public and private interconnects
  • NFS metric set: Metrics for 32 NFS ordered by round trip time
  • Protocol metric set: Metrics for protocol groups TCP, UDP, and IP
  • Filesystem metric set: Metrics for filesystem utilization
  • Critical resources metric set: Metrics for critical system resource utilization
    • CPU Metrics: system-wide CPU utilization statistics
    • Memory Metrics: system-wide memory statistics
    • Device Metrics: system-wide device statistics distinct from individual device metric set
    • NFS Metrics: Total NFS devices collected every 30 seconds
    • Process Metrics: system-wide unique process metrics

CPU Metric Set

Contains metrics from all CPU cores ordered by usage percentage.

Table 4-13 CPU Metric Set

Metric Name (units) Description
system [%] Percentage of CPU utilization occurred while running at the system level (kernel).
user [%] Percentage of CPU utilization occurred while running at the user level (application).
usage [%] Total utilization (system[%] + user[%]).
nice [%] Percentage of CPU utilization occurred while running at the user level with nice priority.
ioWait [%] Percentage of time that the CPU was idle during which the system had an outstanding disk I/O request.
steal [%] Percentage of time spent in involuntary wait by the virtual CPU while the hypervisor was servicing another virtual processor.

Device Metric Set

Contains metrics from all disk devices/partitions ordered by their service time in milliseconds.

Table 4-14 Device Metric Set

Metric Name (units) Description
ioR [KB/s] Amount of data read from the device.
ioW [KB/s] Amount of data written to the device.
numIOs [#/s] Average disk I/O operations.
qLen [#] Number of I/O queued requests, that is, in a wait state.
aWait [msec] Average wait time per I/O.
svcTm [msec] Average service time per I/O request.
util [%] Percent utilization of the device (same as '%util metric from the iostat -x command. Represents the percentage of time device was active).

Process Metric Set

Contains multiple categories of summarized metric data computed across all system processes.

Table 4-15 Process Metric Set

Metric Name (units) Description
pid Process ID.
pri Process priority (raw value from the operating system).
psr The processor that process is currently assigned to or running on.
pPid Parent process ID.
nice Nice value of the process.
state State of the process. For example, R->Running, S->Interruptible sleep, and so on.
class Scheduling class of the process. For example, RR->RobinRound, FF->First in First out, B->Batch scheduling, and so on.
fd [#] Number of file descriptors opened by this process, which is updated every 30 seconds.
name Name of the process.
cpu [%] Process CPU utilization across cores. For example, 50% => 50% of single core, 400% => 100% usage of 4 cores.
thrds [#] Number of threads created by this process.
vmem [KB] Process virtual memory usage (KB).
shMem [KB] Process shared memory usage (KB).
rss [KB] Process memory-resident set size (KB).
ioR [KB/s] I/O read in kilobytes per second.
ioW [KB/s] I/O write in kilobytes per second.
ioT [KB/s] I/O total in kilobytes per second.
cswch [#/s] Context switch per second. Collected only for a few critical Oracle Database processes.
nvcswch [#/s] Non-voluntary context switch per second. Collected only for a few critical Oracle Database processes.
cumulativeCpu [ms] Amount of CPU used so far by the process in microseconds.

NIC Metric Set

Contains metrics from all network interfaces ordered by their total rate in kilobytes per second.

Table 4-16 NIC Metric Set

Metric Name (units) Description
name Name of the interface.
tag Tag for the interface, for example, public, private, and so on.
mtu [B] Size of the maximum transmission unit in bytes supported for the interface.
rx [Kbps] Average network receive rate.
tx [Kbps] Average network send rate.
total [Kbps] Average network transmission rate (rx[Kb/s] + tx[Kb/s]).
rxPkt [#/s] Average incoming packet rate.
txPkt [#/s] Average outgoing packet rate.
pkt [#/s] Average rate of packet transmission (rxPkt[#/s] + txPkt[#/s]).
rxDscrd [#/s] Average rate of dropped/discarded incoming packets.
txDscrd [#/s] Average rate of dropped/discarded outgoing packets.
rxUnicast [#/s] Average rate of unicast packets received.
rxNonUnicast [#/s] Average rate of multicast packets received.
dscrd [#/s] Average rate of total discarded packets (rxDscrd + txDscrd).
rxErr [#/s] Average error rate for incoming packets.
txErr [#/s] Average error rate for outgoing packets.
Err [#/s] Average error rate of total transmission (rxErr[#/s] + txErr[#/s]).

NFS Metric Set

Contains top 32 NFS ordered by round trip time. This metric set is collected once every 30 seconds.

Table 4-17 NFS Metric Set

Metric Name (units) Description
op [#/s] Number of read/write operations issued to a filesystem per second.
bytes [#/sec] Number of bytes read/write per second from a filesystem.
rtt [s] This is the duration from the time that the client's kernel sends the RPC request until the time it receives the reply.
exe [s] This is the duration from that NFS client does the RPC request to its kernel until the RPC request is completed, this includes the RTT time above.
retrains [%] This is the retransmission's frequency in percentage.

Protocol Metric Set

Contains specific metrics for protocol groups TCP, UDP, and IP. Metric values are cumulative since the system starts.

Table 4-18 TCP Metric Set

Metric Name (units) Description
failedConnErr [#] Number of times that TCP connections have made a direct transition to the CLOSED state from either the SYN-SENT state or the SYN-RCVD state, plus the number of times that TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state.
estResetErr [#] Number of times that TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state.
segRetransErr [#] Total number of TCP segments retransmitted.
rxSeg [#] Total number of TCP segments received on TCP layer.
txSeg [#] Total number of TCP segments sent from TCP layer.

Table 4-19 UDP Metric Set

Metric Name (units) Description
unkPortErr [#] Total number of received datagrams for which there was no application at the destination port.
rxErr [#] Number of received datagrams that could not be delivered for reasons other than the lack of an application at the destination port.
rxPkt [#] Total number of packets received.
txPkt [#] Total number of packets sent.

Table 4-20 IP Metric Set

Metric Name (units) Description
ipHdrErr [#] Number of input datagrams discarded due to errors in their IPv4 headers.
addrErr [#] Number of input datagrams discarded because the IPv4 address in their IPv4 header's destination field was not a valid address to be received at this entity.
unkProtoErr [#] Number of locally-addressed datagrams received successfully but discarded because of an unknown or unsupported protocol.
reasFailErr [#] Number of failures detected by the IPv4 reassembly algorithm.
fragFailErr [#] Number of IPv4 discarded datagrams due to fragmentation failures.
rxPkt [#] Total number of packets received on IP layer.
txPkt [#] Total number of packets sent from IP layer.

Filesystem Metric Set

Contains metrics for filesystem utilization. Collected only for GRID_HOME filesystem.

Table 4-21 Filesystem Metric Set

Metric Name (units) Description
mount Mount point.
type Filesystem type, for example, etx4.
tag Filsystem tag, for example, GRID_HOME.
total [KB] Total amount of space (KB).
used [KB] Amount of used space (KB).
avbl [KB] Amount of available space (KB).
used [%] Percentage of used space.
ifree [%] Percentage of free file nodes.

System Metric Set

Contains a summarized metric set of critical system resource utilization.

Table 4-22 CPU Metrics

Metric Name (units) Description
pCpus [#] Number of physical processing units in the system.
Cores [#] Number of cores for all CPUs in the system.
vCpus [#] Number of logical processing units in the system.
cpuHt CPU Hyperthreading enabled (Y) or disabled (N).
osName Name of the operating system.
chipName Name of the chip of the processing unit.
system [%] Percentage of CPUs utilization that occurred while running at the system level (kernel).
user [%] Percentage of CPUs utilization that occurred while running at the user level (application).
usage [%] Total CPU utilization (system[%] + user[%]).
nice [%] Percentage of CPUs utilization occurred while running at the user level with NICE priority.
ioWait [%] Percentage of time that the CPUs were idle during which the system had an outstanding disk I/O request.
Steal [%] Percentage of time spent in involuntary wait by the virtual CPUs while the hypervisor was servicing another virtual processor.
cpuQ [#] Number of processes waiting in the run queue within the current sample interval.
loadAvg1 Average system load calculated over time of one minute.
loadAvg5 Average system load calculated over of time of five minutes.
loadAvg15 Average system load calculated over of time of 15 minutes. High load averages imply that a system is overloaded; many processes are waiting for CPU time.
Intr [#/s] Number of interrupts occurred per second in the system.
ctxSwitch [#/s] Number of context switches that occurred per second in the system.

Table 4-23 Memory Metrics

Metric Name (units) Description
totalMem [KB] Amount of total usable RAM (KB).
freeMem [KB] Amount of free RAM (KB).
avblMem [KB] Amount of memory available to start a new process without swapping.
shMem [KB] Memory used (mostly) by tmpfs.
swapTotal [KB] Total amount of physical swap memory (KB).
swapFree [KB] Amount of swap memory free (KB).
swpIn [KB/s] Average swap in rate within the current sample interval (KB/sec).
swpOut [KB/s] Average swap-out rate within the current sample interval (KB/sec).
pgIn [#/s] Average page in rate within the current sample interval (pages/sec).
pgOut [#/s] Average page out rate within the current sample interval (pages/sec).
slabReclaim [KB] The part of the slab that might be reclaimed such as caches.
buffer [KB] Memory used by kernel buffers.
Cache [KB] Memory used by the page cache and slabs.
bufferAndCache [KB] Total size of buffer and cache (buffer[KB] + Cache[KB]).
hugePageTotal [#] Total number of huge pages present in the system for the current sample interval.
hugePageFree [KB] Total number of free huge pages in the system for the current sample interval.
hugePageSize [KB] Size of one huge page in KB, depends on the operating system version. Typically the same for all samples for a particular host.

Table 4-24 Device Metrics

Metric Name (units) Description
disks [#] Number of disks configured in the system.
ioR [KB/s] Aggregate read rate across all devices.
ioW [KB/s] Aggregate write rate across all devices.
numIOs [#/s] Aggregate I/O operation rate across all devices.

Table 4-25 NFS Metrics

Metric Name (units) Description
nfs [#] Total NFS devices.

Table 4-26 Process Metrics

Metric Name (units) Description
fds [#] Number of open file structs in system.
procs [#] Number of processes.
rtProcs [#] Number of real-time processes.
procsInDState Number of processes in uninterruptible sleep.
sysFdLimit [#] System limit on a number of file structs.
procsOnCpu [#] Number of processes currently running on CPU.
procsBlocked [#] Number of processes waiting for some event/resource becomes available, such as for the completion of an I/O operation.

Process Aggregates Metric Set

Contains aggregated metrics for all processes by process groups.

Table 4-27 Process Aggregates Metric Set

Metric Name (units) Description
DBBG User Oracle Database background process group.
DBFG User Oracle Database foreground process group.
MDBBG MGMTDB background processes group.
MDBFG MGMTDB foreground processes group.
ASMBG ASM background processes group.
ASMFG ASM foreground processes group.
IOXBG IOS background processes group.
IOXFG IOS foreground processes group.
APXBG APX background processes group.
APXFG APX foreground processes group.
CLUST Clusterware processes group.
OTHER Default group.

For each group, the below metrics are aggregated to report a group summary.

Metric Name (units) Description
processes [#] Total number of processes in the group.
cpu [%] Aggregated CPU utilization.
rss [KB] Aggregated resident set size.
shMem [KB] Aggregated shared memory usage.
thrds [#] Aggregated thread count.
fds [#] Aggregated open file-descriptor.
cpuWeight [%] Contribution of the group in overall CPU utilization of the machine.

4.5.5 Detecting Component Failures and Self-healing Autonomously

Improved ability to detect component failures and self-heal autonomously improves business continuity.

Cluster Health Monitor introduces a new diagnostic feature that identifies critical component events that indicate pending or actual failures and provides recommendations for corrective action. These actions may sometimes be performed autonomously. Such events and actions are then captured and admins are notified through components such as Oracle Trace File Analyzer.

Terms Associated with Diagnosability

CHMDiag: CHMDiag is a python daemon managed by osysmond that listens for events and takes actions. Upon receiving various events/actions, CHMDiag validates them for correctness, does flow control, and schedules the actions for runs. CHMDiag monitors each action to its completion, and kills an action if it takes longer than pre-configured time specific to that action.

This JSON file describes all events/actions and their respective attributes. All events/actions have uniquely identifiable IDs. This file also contains various configurable properties for various actions/events. CHMDiag loads this file during its startup.

CRFE API: CRFE API is used by all C clients to send events to CHMDiag. This API is used by internal clients like components (RDBMS/CSS/GIPC) to publish events/actions.

This API also provides support for both synchronous and asynchronous publication of events. Asynchronous publication of events is done through a background thread which will be shared by all CRFE API clients within a process.

CHMDIAG_BASE: This directory resides in ORACLEB_BASE/hostname/crf/chmdiag. This directory path contains following directories, which are populated or managed by CHMDiag.

  • ActionsResults: Contains all results for all of the invoked actions with a subdirectory for each action.
  • EventsLog: Contains a log of all the events/actions received by CHMDiag and the location of their respective action results. These log files are also auto-rotated after reaching a fixed size.
  • CHMDiagLog: Contains CHMDiag daemon logs. Log files are auto-rotated and once they reach a specific size. Logs should have sufficient debug information to diagnose any problems that CHMDiag could run into.
  • Config: Contains a run sub-directory for CHMDiag process pid file management.
New commands to query, collect, and describe CHMDiag events/actions sent by various components:
  • oclumon chmdiag description: Use the oclumon chmdiag description command to get a detailed description of all the supported events and actions.
  • oclumon chmdiag query: Use the oclumon chmdiag query command to query CHMDiag events/actions sent by various components and generate an HTML or a text report.
  • oclumon chmdiag collect: Use the oclumon chmdiag collect command to collect all events/actions data generated by CHMDiag into the specified output directory location.