4.3 Operating System Metrics Collected by Cluster Health Monitor
Review the metrics collected by CHM.
Overview of Metrics
CHM groups the operating system data collected into a Nodeview. A Nodeview is a grouping of metric sets where each metric set contains detailed metrics of a unique system resource.
Brief description of metric sets are as follows:
- CPU metric set: Metrics for top 127 CPUs sorted by usage percentage
- Device metric set: Metrics for 127 devices that include ASM/VD/OCR along with those having a high average wait time
- Process metric set: Metrics for 127 processes
- Top 25 CPU consumers (idle processes not reported)
- Top 25 Memory consumers (RSS < 1% of total RAM not reported)
- Top 25 I/O consumers
- Top 25 File Descriptors consumers (helps to identify top inode consumers)
- Process Aggregation: Metrics summarized by foreground and background processes for all Oracle Database and Oracle ASM instances
- Network metric set: Metrics for 16 NICS that include public and private interconnects
- NFS metric set: Metrics for 32 NFS ordered by round trip time
- Protocol metric set: Metrics for protocol groups TCP, UDP, and IP
- Filesystem metric set: Metrics for filesystem utilization
- Critical resources metric set: Metrics for critical system
resource utilization
- CPU Metrics: system-wide CPU utilization statistics
- Memory Metrics: system-wide memory statistics
- Device Metrics: system-wide device statistics distinct from individual device metric set
- NFS Metrics: Total NFS devices collected every 30 seconds
- Process Metrics: system-wide unique process metrics
CPU Metric Set
Contains metrics from all CPU cores ordered by usage percentage.
Table 4-1 CPU Metric Set
| Metric Name (units) | Description |
|---|---|
| system [%] | Percentage of CPU utilization occurred while executing at the system level (kernel). |
| user [%] | Percentage of CPU utilization occurred while executing at the user level (application). |
| usage [%] | Total utilization (system[%] + user[%]). |
| nice [%] | Percentage of CPU utilization occurred while executing at the user level with nice priority. |
| ioWait [%] | Percentage of time that the CPU was idle during which the system had an outstanding disk I/O request. |
| steal [%] | Percentage of time spent in involuntary wait by the virtual CPU while the hypervisor was servicing another virtual processor. |
Device Metric Set
Contains metrics from all disk devices/partitions ordered by their service time in milliseconds.
Table 4-2 Device Metric Set
| Metric Name (units) | Description |
|---|---|
| ioR [KB/s] | Amount of data read from the device. |
| ioW [KB/s] | Amount of data written to the device. |
| numIOs [#/s] | Average disk I/O operations. |
| qLen [#] | Number of I/O queued requests, that is, in a wait state. |
| aWait [msec] | Average wait time per I/O. |
| svcTm [msec] | Average service time per I/O request. |
| util [%] | Percent utilization of the device (same as
'%util metric from the iostat
-x command. Represents the percentage of time device
was active).
|
Process Metric Set
Contains multiple categories of summarized metric data computed across all system processes.
Table 4-3 Process Metric Set
| Metric Name (units) | Description |
|---|---|
| pid | Process ID. |
| pri | Process priority (raw value from the operating system). |
| psr | The processor that process is currently assigned to or running on. |
| pPid | Parent process ID. |
| nice | Nice value of the process. |
| state | State of the process. For example, R->Running,
S->Interruptible sleep, and so on.
|
| class | Scheduling class of the process. For example,
RR->RobinRound, FF->First in First
out, B->Batch scheduling, and so
on.
|
| fd [#] | Number of file descriptors opened by this process, which is updated every 30 seconds. |
| name | Name of the process. |
| cpu [%] | Process CPU utilization across cores. For example, 50% => 50% of single core, 400% => 100% usage of 4 cores. |
| thrds [#] | Number of threads created by this process. |
| vmem [KB] | Process virtual memory usage (KB). |
| shMem [KB] | Process shared memory usage (KB). |
| rss [KB] | Process memory-resident set size (KB). |
| ioR [KB/s] | I/O read in kilobytes per second. |
| ioW [KB/s] | I/O write in kilobytes per second. |
| ioT [KB/s] | I/O total in kilobytes per second. |
| cswch [#/s] | Context switch per second. Collected only for a few critical Oracle Database processes. |
| nvcswch [#/s] | Non-voluntary context switch per second. Collected only for a few critical Oracle Database processes. |
| cumulativeCpu [ms] | Amount of CPU used so far by the process in microseconds. |
NIC Metric Set
Contains metrics from all network interfaces ordered by their total rate in kilobytes per second.
Table 4-4 NIC Metric Set
| Metric Name (units) | Description |
|---|---|
| name | Name of the interface. |
| tag | Tag for the interface, for example, public, private, and so on. |
| mtu [B] | Size of the maximum transmission unit in bytes supported for the interface. |
| rx [Kbps] | Average network receive rate. |
| tx [Kbps] | Average network send rate. |
| total [Kbps] | Average network transmission rate (rx[Kb/s] + tx[Kb/s]). |
| rxPkt [#/s] | Average incoming packet rate. |
| txPkt [#/s] | Average outgoing packet rate. |
| pkt [#/s] | Average rate of packet transmission (rxPkt[#/s] + txPkt[#/s]). |
| rxDscrd [#/s] | Average rate of dropped/discarded incoming packets. |
| txDscrd [#/s] | Average rate of dropped/discarded outgoing packets. |
| rxUnicast [#/s] | Average rate of unicast packets received. |
| rxNonUnicast [#/s] | Average rate of multicast packets received. |
| dscrd [#/s] | Average rate of total discarded packets (rxDscrd + txDscrd). |
| rxErr [#/s] | Average error rate for incoming packets. |
| txErr [#/s] | Average error rate for outgoing packets. |
| Err [#/s] | Average error rate of total transmission (rxErr[#/s] + txErr[#/s]). |
NFS Metric Set
Contains top 32 NFS ordered by round trip time. This metric set is collected once every 30 seconds.
Table 4-5 NFS Metric Set
| Metric Name (units) | Description |
|---|---|
| op [#/s] | Number of read/write operations issued to a filesystem per second. |
| bytes [#/sec] | Number of bytes read/write per second from a filesystem. |
| rtt [s] | This is the duration from the time that the client's kernel sends the RPC request until the time it receives the reply. |
| exe [s] | This is the duration from that NFS client does the RPC request to its kernel until the RPC request is completed, this includes the RTT time above. |
| retrains [%] | This is the retransmission's frequency in percentage. |
Protocol Metric Set
Contains specific metrics for protocol groups TCP, UDP, and IP. Metric values are cumulative since the system starts.
Table 4-6 TCP Metric Set
| Metric Name (units) | Description |
|---|---|
| failedConnErr [#] | Number of times that TCP connections have made a direct transition to the CLOSED state from either the SYN-SENT state or the SYN-RCVD state, plus the number of times that TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state. |
| estResetErr [#] | Number of times that TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state. |
| segRetransErr [#] | Total number of TCP segments retransmitted. |
| rxSeg [#] | Total number of TCP segments received on TCP layer. |
| txSeg [#] | Total number of TCP segments sent from TCP layer. |
Table 4-7 UDP Metric Set
| Metric Name (units) | Description |
|---|---|
| unkPortErr [#] | Total number of received datagrams for which there was no application at the destination port. |
| rxErr [#] | Number of received datagrams that could not be delivered for reasons other than the lack of an application at the destination port. |
| rxPkt [#] | Total number of packets received. |
| txPkt [#] | Total number of packets sent. |
Table 4-8 IP Metric Set
| Metric Name (units) | Description |
|---|---|
| ipHdrErr [#] | Number of input datagrams discarded due to errors in their IPv4 headers. |
| addrErr [#] | Number of input datagrams discarded because the IPv4 address in their IPv4 header's destination field was not a valid address to be received at this entity. |
| unkProtoErr [#] | Number of locally-addressed datagrams received successfully but discarded because of an unknown or unsupported protocol. |
| reasFailErr [#] | Number of failures detected by the IPv4 reassembly algorithm. |
| fragFailErr [#] | Number of IPv4 discarded datagrams due to fragmentation failures. |
| rxPkt [#] | Total number of packets received on IP layer. |
| txPkt [#] | Total number of packets sent from IP layer. |
Filesystem Metric Set
Contains metrics for filesystem utilization. Collected only for GRID_HOME filesystem.
Table 4-9 Filesystem Metric Set
| Metric Name (units) | Description |
|---|---|
| mount | Mount point. |
| type | Filesystem type, for example, etx4. |
| tag | Filsystem tag, for example, GRID_HOME. |
| total [KB] | Total amount of space (KB). |
| used [KB] | Amount of used space (KB). |
| avbl [KB] | Amount of available space (KB). |
| used [%] | Percentage of used space. |
| ifree [%] | Percentage of free file nodes. |
System Metric Set
Contains a summarized metric set of critical system resource utilization.
Table 4-10 CPU Metrics
| Metric Name (units) | Description |
|---|---|
| pCpus [#] | Number of physical processing units in the system. |
| Cores [#] | Number of cores for all CPUs in the system. |
| vCpus [#] | Number of logical processing units in the system. |
| cpuHt | CPU Hyperthreading enabled (Y) or disabled (N). |
| osName | Name of the operating system. |
| chipName | Name of the chip of the processing unit. |
| system [%] | Percentage of CPUs utilization that occurred while executing at the system level (kernel). |
| user [%] | Percentage of CPUs utilization that occurred while executing at the user level (application). |
| usage [%] | Total CPU utilization (system[%] + user[%]). |
| nice [%] | Percentage of CPUs utilization occurred while executing at the user level with NICE priority. |
| ioWait [%] | Percentage of time that the CPUs were idle during which the system had an outstanding disk I/O request. |
| Steal [%] | Percentage of time spent in involuntary wait by the virtual CPUs while the hypervisor was servicing another virtual processor. |
| cpuQ [#] | Number of processes waiting in the run queue within the current sample interval. |
| loadAvg1 | Average system load calculated over time of one minute. |
| loadAvg5 | Average system load calculated over of time of five minutes. |
| loadAvg15 | Average system load calculated over of time of 15 minutes. High load averages imply that a system is overloaded; many processes are waiting for CPU time. |
| Intr [#/s] | Number of interrupts occurred per second in the system. |
| ctxSwitch [#/s] | Number of context switches that occurred per second in the system. |
Table 4-11 Memory Metrics
| Metric Name (units) | Description |
|---|---|
| totalMem [KB] | Amount of total usable RAM (KB). |
| freeMem [KB] | Amount of free RAM (KB). |
| avblMem [KB] | Amount of memory available to start a new process without swapping. |
| shMem [KB] | Memory used (mostly) by tmpfs. |
| swapTotal [KB] | Total amount of physical swap memory (KB). |
| swapFree [KB] | Amount of swap memory free (KB). |
| swpIn [KB/s] | Average swap in rate within the current sample interval (KB/sec). |
| swpOut [KB/s] | Average swap-out rate within the current sample interval (KB/sec). |
| pgIn [#/s] | Average page in rate within the current sample interval (pages/sec). |
| pgOut [#/s] | Average page out rate within the current sample interval (pages/sec). |
| slabReclaim [KB] | The part of the slab that might be reclaimed such as caches. |
| buffer [KB] | Memory used by kernel buffers. |
| Cache [KB] | Memory used by the page cache and slabs. |
| bufferAndCache [KB] | Total size of buffer and cache (buffer[KB] + Cache[KB]). |
| hugePageTotal [#] | Total number of huge pages present in the system for the current sample interval. |
| hugePageFree [KB] | Total number of free huge pages in the system for the current sample interval. |
| hugePageSize [KB] | Size of one huge page in KB, depends on the operating system version. Typically the same for all samples for a particular host. |
Table 4-12 Device Metrics
| Metric Name (units) | Description |
|---|---|
| disks [#] | Number of disks configured in the system. |
| ioR [KB/s] | Aggregate read rate across all devices. |
| ioW [KB/s] | Aggregate write rate across all devices. |
| numIOs [#/s] | Aggregate I/O operation rate across all devices. |
Table 4-13 NFS Metrics
| Metric Name (units) | Description |
|---|---|
| nfs [#] | Total NFS devices. |
Table 4-14 Process Metrics
| Metric Name (units) | Description |
|---|---|
| fds [#] | Number of open file structs in system. |
| procs [#] | Number of processes. |
| rtProcs [#] | Number of real-time processes. |
| procsInDState | Number of processes in uninterruptible sleep. |
| sysFdLimit [#] | System limit on a number of file structs. |
| procsOnCpu [#] | Number of processes currently running on CPU. |
| procsBlocked [#] | Number of processes waiting for some event/resource becomes available, such as for the completion of an I/O operation. |
Process Aggregates Metric Set
Contains aggregated metrics for all processes by process groups.
Table 4-15 Process Aggregates Metric Set
| Metric Name (units) | Description |
|---|---|
| DBBG | User Oracle Database background process group. |
| DBFG | User Oracle Database foreground process group. |
| MDBBG | MGMTDB background processes group. |
| MDBFG | MGMTDB foreground processes group. |
| ASMBG | ASM background processes group. |
| ASMFG | ASM foreground processes group. |
| IOXBG | IOS background processes group. |
| IOXFG | IOS foreground processes group. |
| APXBG | APX background processes group. |
| APXFG | APX foreground processes group. |
| CLUST | Clusterware processes group. |
| OTHER | Default group. |
For each group, the below metrics are aggregated to report a group summary.
| Metric Name (units) | Description |
|---|---|
| processes [#] | Total number of processes in the group. |
| cpu [%] | Aggregated CPU utilization. |
| rss [KB] | Aggregated resident set size. |
| shMem [KB] | Aggregated shared memory usage. |
| thrds [#] | Aggregated thread count. |
| fds [#] | Aggregated open file-descriptor. |
| cpuWeight [%] | Contribution of the group in overall CPU utilization of the machine. |
Parent topic: Collecting Operating System Resources Metrics