Hardware Counter Profiling Data

Language:

Hardware counters keep track of events like cache misses, cache stall cycles, floating-point operations, branch mispredictions, CPU cycles, and instructions executed. In hardware counter profiling, the Collector records a profile packet when a designated hardware counter of the CPU on which a thread is running overflows. The counter is reset and continues counting. The profile packet includes the overflow value and the counter type.

Various processor chip families support from two to eighteen simultaneous hardware counter registers. The Collector can collect data on one or more registers. For each register you can select the type of counter to monitor for overflow, and set an overflow value for the counter. Some hardware counters can use any register, while others are only available on a particular register. Consequently, not all combinations of hardware counters can be chosen in a single experiment.

Hardware counter profiling can also be done on the kernel in Performance Analyzer and with the er_kernel utility. See Chapter 9, Kernel Profiling for more information.

Hardware counter profiling data is converted by Performance Analyzer into count metrics. For counters that count in cycles, the metrics reported are converted to times. For counters that do not count in cycles, the metrics reported are event counts. On machines with multiple CPUs, the clock frequency used to convert the metrics is the harmonic mean of the clock frequencies of the individual CPUs. Because each type of processor has its own set of hardware counters, and because the number of hardware counters is large, the hardware counter metrics are not listed here. Hardware Counter Lists tells you how to find out what hardware counters are available.

If two specific counters, "cycles" and "insts", are collected, two additional metrics are available, "CPI" and "IPC", meaning cycles-per-instruction and instructions-per-cycle", respectively. They are always shown as a ratio, and not as a time, count, or percentage. A high value of CPI or low value of IPC indicates code that runs inefficiently in the machine; conversely, a low value of CPI or a high value of IPC indicates code that runs efficiently in the pipeline.

One use of hardware counters is to diagnose problems with the flow of information into and out of the CPU. High counts of cache misses, for example, indicate that restructuring your program to improve data or text locality or to increase cache reuse can improve program performance.

Some of the hardware counters correlate with other counters. For example, branch mispredictions and instruction cache misses are often related because a branch misprediction causes the wrong instructions to be loaded into the instruction cache. These must be replaced by the correct instructions. The replacement can cause an instruction cache miss, an instruction translation lookaside buffer (ITLB) miss, or even a page fault.

For many hardware counters, the overflows are often delivered one or more instructions after the instruction that caused the overflow event. This situation is referred to as "skid", and it can make counter overflow profiles difficult to interpret.

On recent SPARC processors, some memory-based counter interrupts are precise, and are delivered with the PC (program counter) and effective address of the triggering event. Such counters are indicated by the word precise following the event type. Memoryspace and dataspace data is captured by default for those counters. See Dataspace Profiling and Memoryspace Profiling for more information.

Hardware Counter Lists

Hardware counters are processor-specific, so the choice of counters available depends on the processor that you are using. The performance tools provide aliases for a number of counters that are likely to be in common use. You can determine the maximum number of hardware counters definitions for profiling on the current machine, and see the full list of available hardware counters, as well as the default counter set, by running collect -h with no other arguments on the current machine.

If the processor and system support hardware counter profiling, the collect -h command prints two lists containing information about hardware counters. The first list contains hardware counters that are aliased to common names. The second list contains raw hardware counters. If neither the performance counter subsystem nor the collect command have the names for the counters on a specific system, the lists are empty. In most cases, however, the counters can be specified numerically.

The following example shows entries in the counter list. The counters that are aliased are displayed first in the list, followed by a list of the raw hardware counters. Each line of output in this example is formatted for print.

Aliased HW counters available for profiling:
    cycles[/{0|1|2|3}],<interval> (`CPU Cycles', alias for Cycles_user; CPU-cycles)
    insts[/{0|1|2|3}],<interval> (`Instructions Executed', alias for Instr_all; events)
    loads[/{0|1|2|3}],<interval> 
     (`Load Instructions', alias for Instr_ld; precise load-store events)
    stores[/{0|1|2|3}],<interval> 
     (`Store Instructions', alias for Instr_st; precise load-store events)
    dcm[/{0|1|2|3}],<interval> 
     (`L1 D-cache Misses', alias for DC_miss_nospec; precise load-store events)
    l2l3dh[/{0|1|2|3}],<interval> 
     (`L2 or L3 D-cache Hits', alias for DC_miss_L2_L3_hit_nospec; precise load-store events)
    l3m[/{0|1|2|3}],<interval> 
     (`L3 D-cache Misses', alias for DC_miss_remote_L3_hit_nospec~emask=0x6; precise load-store events)
    l3m_spec[/{0|1|2|3}],<interval> 
     (`L3 D-cache Misses incl. Speculative', alias for DC_miss_remote_L3_hit~emask=0x6; events)
.
.
.
 Raw HW counters available for profiling:
    Sel_pipe_drain_cycles[/{0|1|2|3}],<interval> (CPU-cycles)
    Sel_0_wait[/{0|1|2|3}],<interval> (CPU-cycles)
    Sel_0_ready[/{0|1|2|3}],<interval> (CPU-cycles)
    Sel_1[/{0|1|2|3}],<interval> (CPU-cycles)
    Sel_2[/{0|1|2|3}],<interval> (CPU-cycles)
    Pick_0[/{0|1|2|3}],<interval> (CPU-cycles)
    Pick_1[/{0|1|2|3}],<interval> (CPU-cycles)
    Pick_2[/{0|1|2|3}],<interval> (CPU-cycles)
    Pick_3[/{0|1|2|3}],<interval> (CPU-cycles)
    Pick_any[/{0|1|2|3}],<interval> (CPU-cycles)
    Branches[/{0|1|2|3}],<interval> (events)
    Instr_FGU_crypto[/{0|1|2|3}],<interval> (events)
    Instr_ld[/{0|1|2|3}],<interval> (precise load-store events)
    Instr_st[/{0|1|2|3}],<interval> (precise load-store events)

Format of the Aliased Hardware Counter List

In the aliased hardware counter list, the first field (for example, cycles) gives the alias name that can be used in the -h counter... argument of the collect command. This alias name is also the identifier to use in the er_print command.

The second field lists the available registers for the counter. For example, [/{0|1|2|3}].

The third field <interval> can be specified as on, hi, or low, or with a numerical value. If specified as on, hi, or low, and the events are arriving too rapidly, the rate will the throttled back.

The fourth field, in parentheses, contains type information. It provides a short description (for example, CPU Cycles), the raw hardware counter name (for example, Cycles_user), and the type of units being counted (for example, CPU-cycles).

Possible entries in the type information field include the following:

precise, the counter interrupt occurs precisely when an instruction causes the event counter to overflow. The collect -h command for a precise counter collects memoryspace and dataspace data by default. See DataObjects View, DataLayout View, and MemoryObjects Views for details.
load, store, or load-store, the counter is memory-related.
not-program-related, the counter captures events initiated by some other program, such as CPU-to-CPU cache snoops. Using the counter for profiling generates a warning and profiling does not record a call stack.

If the last or only word of the type information is:

CPU-cycles, the counter can be used to provide a time-based metric. The metrics reported for such counters are converted by default to inclusive and exclusive times, but can optionally be shown as event counts.
events, the metric is inclusive and exclusive event counts, and cannot be converted to a time.

In the aliased hardware counter list in the example, the type information contains the word CPU-cycles for the first counter and events for the second counter. For the third counter, the type information contains two words, load-store events.

Format of the Raw Hardware Counter List

The information included in the raw hardware counter list is a subset of the information in the aliased hardware counter list. Each line in the raw hardware counter list includes the internal counter name as used by cputrack(1), the register numbers on which that counter can be used, the default overflow value, the type information, and the counter units, which can be either CPU-cycles or events.

If the counter measures events unrelated to the program running, the first word of type information is not-program-related. For such a counter, profiling does not record a call stack, but instead shows the time being spent in an artificial function, collector_not_program_related . Thread and LWP IDs are recorded, but are meaningless.

The default overflow value for raw counters is 1000003. This value is not ideal for most raw counters, so you should specify overflow values when specifying raw counters.