Skip Navigation Links | |
Exit Print View | |
![]() |
Oracle Solaris Studio 12.3: Performance Analyzer Oracle Solaris Studio 12.3 Information Library |
1. Overview of the Performance Analyzer
3. Collecting Performance Data
4. The Performance Analyzer Tool
5. The er_print Command Line Performance Analysis Tool
6. Understanding the Performance Analyzer and Its Data
7. Understanding Annotated Source and Disassembly Data
How the Tools Find Source Code
Performance Analyzer Source Tab Layout
Identifying the Original Source Lines
Common Subexpression Elimination
Special Lines in the Annotated Source
Interpreting Annotated Disassembly
Attribution of Hardware Counter Overflows
Special Lines in the Source, Disassembly and PCs Tabs
Compiler-Generated Body Functions
Dynamically Compiled Functions
Annotations for Store and Load Instructions
Annotated source code for an experiment can be viewed in the Performance Analyzer by selecting the Source tab in the left pane of the Analyzer window. Alternatively, annotated source code can be viewed without running an experiment, using the er_src utility. This section of the manual describes how source code is displayed in the Performance Analyzer. For details on viewing annotated source code with the er_src utility, see Viewing Source/Disassembly Without An Experiment.
Annotated source in the Analyzer contains the following information:
The contents of the original source file
The performance metrics of each line of executable source code
Highlighting of code lines with metrics exceeding a specific threshold
Index lines
Compiler commentary
The Source tab is divided into columns, with fixed-width columns for individual metrics on the left and the annotated source taking up the remaining width on the right.
All lines displayed in black in the annotated source are taken from the original source file. The number at the start of a line in the annotated source column corresponds to the line number in the original source file. Any lines with characters displayed in a different color are either index lines or compiler commentary lines.
A source file is any file compiled to produce an object file or interpreted into byte code. An object file normally contains one or more regions of executable code corresponding to functions, subroutines, or methods in the source code. The Analyzer analyzes the object file, identifies each executable region as a function, and attempts to map the functions it finds in the object code to the functions, routines, subroutines, or methods in the source file associated with the object code. When the analyzer succeeds, it adds an index line in the annotated source file in the location corresponding to the first instruction in the function found in the object code.
The annotated source shows an index line for every function, including inline functions, even though inline functions are not displayed in the list displayed by the Function tab. The Source tab displays index lines in red italics with text in angle-brackets. The simplest type of index line corresponds to the function’s default context. The default source context for any function is defined as the source file to which the first instruction in that function is attributed. The following example shows an index line for a C function icputime.
578. int 579. icputime(int k) 0. 0. 580. { <Function: icputime>
As can be seen from the above example, the index line appears on the line following the first instruction. For C source, the first instruction corresponds to the opening brace at the start of the function body. In Fortran source, the index line for each subroutine follows the line containing the subroutine keyword. Also, a main function index line follows the first Fortran source instruction executed when the application starts, as shown in the following example:
1. ! Copyright (c) 2006, 2010, Oracle and/or its affiliates. All Rights Reserved. 2. ! @(#)omptest.f 1.11 10/03/24 SMI 3. ! Synthetic f90 program, used for testing openmp directives and the 4. ! analyzer 5. 0. 0. 0. 0. 6. program omptest <Function: MAIN> 7. 8. !$PRAGMA C (gethrtime, gethrvtime)
Sometimes, the Analyzer might not be able to map a function it finds in the object code with any programming instructions in the source file associated with that object code; for example, code may be #included or inlined from another file, such as a header file.
Also displayed in red are special index lines and other special lines that are not compiler commentary. For example, as a result of compiler optimization, a special index line might be created for a function in the object code that does not correspond to code written in any source file. For details, refer to Special Lines in the Source, Disassembly and PCs Tabs.
Compiler commentary indicates how compiler-optimized code has been generated. Compiler commentary lines are displayed in blue, to distinguish them from index lines and original source lines. Various parts of the compiler can incorporate commentary into the executable. Each comment is associated with a specific line of source code. When the annotated source is written, the compiler commentary for any source line appears immediately preceding the source line.
The compiler commentary describes many of the transformations which have been made to the source code to optimize it. These transformations include loop optimizations, parallelization, inlining and pipelining. The following shows an example of compiler commentary.
0. 0. 0. 0. 28. SUBROUTINE dgemv_g2 (transa, m, n, alpha, b, ldb, & 29. & c, incc, beta, a, inca) 30. CHARACTER (KIND=1) :: transa 31. INTEGER (KIND=4) :: m, n, incc, inca, ldb 32. REAL (KIND=8) :: alpha, beta 33. REAL (KIND=8) :: a(1:m), b(1:ldb,1:n), c(1:n) 34. INTEGER :: i, j 35. REAL (KIND=8) :: tmr, wtime, tmrend 36. COMMON/timer/ tmr 37. Function wtime_ not inlined because the compiler has not seen the body of the routine 0. 0. 0. 0. 38. tmrend = tmr + wtime() Function wtime_ not inlined because the compiler has not seen the body of the routine Discovered loop below has tag L16 0. 0. 0. 0. 39. DO WHILE(wtime() < tmrend) Array statement below generated loop L4 0. 0. 0. 0. 40. a(1:m) = 0.0 41. Source loop below has tag L6 0. 0. 0. 0. 42. DO j = 1, n ! <=-----\ swapped loop indices Source loop below has tag L5 L5 cloned for unrolling-epilog. Clone is L19 All 8 copies of L19 are fused together as part of unroll and jam L19 scheduled with steady-state cycle count = 9 L19 unrolled 4 times L19 has 9 loads, 1 stores, 8 prefetches, 8 FPadds, 8 FPmuls, and 0 FPdivs per iteration L19 has 0 int-loads, 0 int-stores, 11 alu-ops, 0 muls, 0 int-divs and 0 shifts per iteration L5 scheduled with steady-state cycle count = 2 L5 unrolled 4 times L5 has 2 loads, 1 stores, 1 prefetches, 1 FPadds, 1 FPmuls, and 0 FPdivs per iteration L5 has 0 int-loads, 0 int-stores, 4 alu-ops, 0 muls, 0 int-divs and 0 shifts per iteration 0.210 0.210 0.210 0. 43. DO i = 1, m 4.003 4.003 4.003 0.050 44. a(i) = a(i) + b(i,j) * c(j) 0.240 0.240 0.240 0. 45. END DO 0. 0. 0. 0. 46. END DO 47. END DO 48. 0. 0. 0. 0. 49. RETURN 0. 0. 0. 0. 50. END
You can set the types of compiler commentary displayed in the Source tab using the Source/Disassembly tab in the Set Data Presentation dialog box; for details, see Setting Data Presentation Options.
One very common optimization recognizes that the same expression appears in more than one place, and that performance can be improved by generating the code for that expression in one place. For example, if the same operation appears in both the if and the else branches of a block of code, the compiler can move that operation to just before the if statement. When it does so, it assigns line numbers to the instructions based on one of the previous occurrences of the expression. If the line numbers assigned to the common code correspond to one branch of an if structure, and the code actually always takes the other branch, the annotated source shows metrics on lines within the branch that is not taken.
The compiler can do several types of loop optimization. Some of the more common ones are as follows:
Loop unrolling
Loop peeling
Loop interchange
Loop fission
Loop fusion
Loop unrolling consists of repeating several iterations of a loop within the loop body, and adjusting the loop index accordingly. As the body of the loop becomes larger, the compiler can schedule the instructions more efficiently. Also reduced is the overhead caused by the loop index increment and conditional check operations. The remainder of the loop is handled using loop peeling.
Loop peeling consists of removing a number of loop iterations from the loop, and moving them in front of or after the loop, as appropriate.
Loop interchange changes the ordering of nested loops to minimize memory stride, to maximize cache-line hit rates.
Loop fusion consists of combining adjacent or closely located loops into a single loop. The benefits of loop fusion are similar to loop unrolling. In addition, if common data is accessed in the two pre-optimized loops, cache locality is improved by loop fusion, providing the compiler with more opportunities to exploit instruction-level parallelism.
Loop fission is the opposite of loop fusion: a loop is split into two or more loops. This optimization is appropriate if the number of computations in a loop becomes excessive, leading to register spills that degrade performance. Loop fission can also come into play if a loop contains conditional statements. Sometimes it is possible to split the loops into two: one with the conditional statement and one without. This can increase opportunities for software pipelining in the loop without the conditional statement.
Sometimes, with nested loops, the compiler applies loop fission to split a loop apart, and then performs loop fusion to recombine the loop in a different way to increase performance. In this case, you see compiler commentary similar to the following:
Loop below fissioned into 2 loops Loop below fused with loop on line 116 [116] for (i=0;i<nvtxs;i++) {
With an inline function, the compiler inserts the function instructions directly at the locations where it is called instead of making actual function calls. Thus, similar to a C/C++ macro, the instructions of an inline function are replicated at each call location. The compiler performs explicit or automatic inlining at high optimization levels (4 and 5). Inlining saves the cost of a function call and provides more instructions for which register usage and instruction scheduling can be optimized, at the cost of a larger code footprint in memory. The following is an example of inlining compiler commentary.
Function initgraph inlined from source file ptralias.c into the code for the following line 0. 0. 44. initgraph(rows);
Note - The compiler commentary does not wrap onto two lines in the Source tab of the Analyzer.
If your code contains Sun, Cray, or OpenMP parallelization directives, it can be compiled for parallel execution on multiple processors. The compiler commentary indicates where parallelization has and has not been performed, and why. The following shows an example of parallelization computer commentary.
0. 6.324 9. c$omp parallel do shared(a,b,c,n) private(i,j,k) Loop below parallelized by explicit user directive Loop below interchanged with loop on line 12 0.010 0.010 [10] do i = 2, n-1 Loop below not parallelized because it was nested in a parallel loop Loop below interchanged with loop on line 12 0.170 0.170 11. do j = 2, i
For more details about parallel execution and compiler-generated body functions, refer to Overview of OpenMP Software Execution.
Several other annotations for special cases can be shown under the Source tab, either in the form of compiler commentary, or as special lines displayed in the same color as index lines. For details, refer to Special Lines in the Source, Disassembly and PCs Tabs.
Source code metrics are displayed, for each line of executable code, in fixed-width columns. The metrics are the same as in the function list. You can change the defaults for an experiment using a .er.rc file; for details, see Commands That Set Defaults. You can also change the metrics displayed and highlighting thresholds in the Analyzer using the Set Data Presentation dialog box; for details, see Setting Data Presentation Options.
Annotated source code shows the metrics of an application at the source-line level. It is produced by taking the PCs (program counts) that are recorded in the application’s call stack, and mapping each PC to a source line. To produce an annotated source file, the Analyzer first determines all of the functions that are generated in a particular object module (.o file) or load object, then scans the data for all PCs from each function. In order to produce annotated source, the Analyzer must be able to find and read the object module or load object to determine the mapping from PCs to source lines, and it must be able to read the source file to produce an annotated copy, which is displayed. See How the Tools Find Source Code for a description of the process used to find an experiment's source code.
The compilation process goes through many stages, depending on the level of optimization requested, and transformations take place which can confuse the mapping of instructions to source lines. For some optimizations, source line information might be completely lost, while for others, it might be confusing. The compiler relies on various heuristics to track the source line for an instruction, and these heuristics are not infallible.
Metrics for an instruction must be interpreted as metrics accrued while waiting for the instruction to be executed. If the instruction being executed when an event is recorded comes from the same source line as the leaf PC, the metrics can be interpreted as due to execution of that source line. However, if the leaf PC comes from a different source line than the instruction being executed, at least some of the metrics for the source line that the leaf PC belongs to must be interpreted as metrics accumulated while this line was waiting to be executed. An example is when a value that is computed on one source line is used on the next source line.
The issue of how to interpret the metrics matters most when there is a substantial delay in execution, such as at a cache miss or a resource queue stall, or when an instruction is waiting for a result from a previous instruction. In such cases the metrics for the source lines can seem to be unreasonably high, and you should look at other nearby lines in the code to find the line responsible for the high metric value.
The four possible formats for the metrics that can appear on a line of annotated source code are explained in Table 7-1.
Table 7-1 Annotated Source-Code Metrics
|