 |
» |
|
|
|
 |
 |
dcpicalc(1)
NAME
dcpicalc - Calculate cycles-per-instruction of procedures
SYNOPSIS
dcpicalc [<options>] -procedures procedure-name-list -- image-file
dcpicalc [<options>] procedure-name image-file
DESCRIPTION
Dcpicalc generates the control flow graph of the specified procedure(s) in the
specified image file. Using profiles collected by dcpid(1) and
stored in the specified profile files, dcpicalc augments the graph with estimated
execution frequencies of basic blocks, cycles-per-instruction for instructions, possible
explanations for stalls, and other useful information. The resulting flow graph is printed
to standard output.
The output can be converted to postscript by dcpi2ps(1). In
the postscript output, "larger" basic blocks are generally "more
important." Specifically, for each basic block, the font size indicates the block's
execution frequency, the physical space occupied by the block on paper indicates the
amount of time spent in that block, and the number of lines indicates the average number
of cycles required to execute it.
The first command syntax allows you to specify multiple procedures. dcpicalc
concatenates the outputs for the individual procedures, starting each with a line of the
form
; PROC procedure-name
Analyzing multiple procedures at a time is typically much more efficient than invoking
the command once per procedure, although dcpicalc reports exactly the same information in
both cases. The -procedures option can be mixed with the other options. The list of
procedures is terminated by "--" or another option. The second command syntax
can name only one procedure.
FLAGS
- -help
- Print information about options.
- -print_opcode
- Output the machine code, in hex, for each instruction.
- -cutoff n
- Omit basic blocks taking less than n% of the time spent in the procedure. The
instructions of these basic blocks are not printed. When the output is piped through dcpi2ps,
these basic blocks appear as tiny boxes with only block names. Note that n is a
floating point number between 0 and 100 (inclusive). The default value is 0: no blocks are
omitted.
- -procedures procedure-name-list
- Analyze the specified procedures. The list is terminated by "--" or another
option.
- -version
- Print program version information.
FREQUENCY AND STALL ANALYSIS FLAGS
The following options can be used to control the heuristics for estimating execution
frequencies and identifying the causes of stalls.
- -conf_low
- Generate low, medium, and high confidence data.
- -conf_med
- Generate medium and high confidence data. (default)
- -conf_high
- Generate only high confidence data.
- -cross_procedure [optimistic | pessimistic | selective]
- Choose what assumption to make when a procedure call boundary is encountered while
looking for reasons to explain dynamic stalls. A procedure call boundary is either a call
made by the procedure being analyzed or the beginning or end of that procedure. With pessimistic,
assume that whatever happens outside the analyzed procedure can cause a dynamic stall
inside it. With optimistic, assume that it cannot. With selective, the
assumption is based on standard procedure call convention. (The default is optimistic.)
- -do_gp
- Use a (non-linear time) constraint solver to exploit global flow constraints when
estimating execution frequencies. The frequency estimates may still violate flow
constraints.
PROFILE FILE FLAGS
By default, this command automatically finds all of the relevant profile files. The
following options can be used to guide the search for the profile files.
- -db <directory name>
- Search for profile files in the specified profile database directory. The directory name
should be the same name as the one specified when dcpid was started. I.e., the
named directory should contain a set of epochs. If this option is not specified, the
directory name is obtained from the DCPIDB environment variable. If neither this
option, nor the DCPIDB environment variable are set, the name of the directory
used by the last invocation of dcpid on this machine is used. If none of these
methods succeed in finding the appropriate directory, and no explicit set of profile files
is provided via the -profiles option, then the command fails.
- -epoch latest
- Search for profile files in the latest epoch. This is the default.
- -epoch latest-k
- Search for profile files in the "k+1"th oldest epoch. For example, search in
the third last epoch if "-epoch latest-2" is specified.
- -epoch all
- Search for profile files in all epochs.
- -epoch <name>
- Search for profile files in the named epoch. The epoch name should be the name of a
subdirectory corresponding to a single epoch within the profile database directory. Epoch
subdirectory names usually take the form YYMMDDHHMM
(year-month-day-hours-minutes). For example, an epoch started on December 4, 1996 at 23:34
is named 9612042334. If an epoch is given a symbolic name by creating a symbol
link to the actual epoch directory, then the symbolic name can also be used as an argument
to the -epoch option.
- -events all
- Search for profile files corresponding to all event types such as cycles, icache misses,
branch mispredictions, etc. This is the default.
- -events type(+type)*
- Search for profiles files for the specified event types. For example, search for cycles,
icache misses, and data cache misses when the option -events cycles+imiss+dmiss
is specified.
- -events all(-type)*
- Search for profile files for all event types except for the specified types. For
example, search for all event types except for branch mispredictions when the option -events
all-branchmp is specified.
- -label <label>
- Search for profile files with the specified label (see dcpilabel). If no labels
are specified on the command line, profile file labels are ignored entirely. If any labels
are specified on the command line (this option can be repeated several times), only
profile files that have one of the specified labels are used.
- -profiles <file names...> --
- Use just the profile files named by the specified file names. The list of profile file
names can be terminated either via --, or by the end of the option list. The
command prints an error message and fails if the -profiles option is used in
conjunction with any of the earlier automatic profile finding options. (Use either the
automatic profile lookup mechanism, or explicitly name the profile file with the -profile
option, but not both.)
INTERPRETING OUTPUT
Dcpicalc provides information at the instruction, basic block, and procedure level.
Dcpicalc is sometimes unable to estimate the cycle-to-sample ratio for a block. Such
blocks are excluded from all summary information except the instruction count. Dcpicalc
makes no attempt to identify stalls (static or dynamic) in such blocks. Therefore, most of
the following discussion pertains only to blocks with known cycle-to-sample ratios.
Instruction Level Information
At the instruction level, dcpicalc inserts "bubbles" into the instruction
listings to identify points where the processor stalls because it is unable to issue an
instruction. Bubbles are inserted before the stalled instruction. Here is an
example.
588584 318:2e4c0000 ldq_u a2, 0(s3) 1558 1
588588 318:a79d2d70 ldq at, 11632(gp) 191855 0 1.5cy
a
a
58858c 318:4a4c00d2 extbl a2, s3, a2 164109 2 1.5cy 8584
s
d
d
d
d
d
d
588590 318:43920412 addq at, a2, a2 428395 1 4.0cy 8588
b
?
?
588594 318:2c320000 ldq_u t0, 0(a2) 227783 1 2.0cy 8590
s
588598 318:22520001 lda a2, 1(a2) 121068 1 1.0cy
b
d
d
d
d
58859c 318:48320f41 extqh t0, a2, t0 336123 1 3.0cy 8598 8594
s
5885a0 318:48271781 sra t0, 0x38, t0 123408 1 1.0cy
b
5885a4 318:41810402 addq s3, t0, t1 127442 1 1.0cy 85a0
s
5885a8 318:2c620000 ldq_u t2, 0(t1) 123021 1 1.0cy
5885ac 318:47ff041f bis zero, zero, zero 0 0 nop
a
a
d
d
d
d
d
d
d
d
5885b0 318:486200c4 extbl t2, t1, t3 658189 2 6.0cy 85a8
5885b4 318:47ff0403 bis zero, zero, t2 0 0
5885b8 318:48807630 zapnot t3, 0x3, a0 122504 1 1.0cy
5885bc 318:47ff041f bis zero, zero, zero 0 0 nop
i
5885c0 318:421fd9b1 cmplt a0, 0xfe, a1 155841 1 1.5cy
5885c4 318:e6200002 beq a1, 0x1205885d0 0 0
Each line of assembly code shows, from left to right,
- the instruction's address (hexadecimal),
- the source line number (decimal),
- the instruction's 32-bit machine code in hexadecimal (if -print_opcode)
- the instruction in mnemonics
- the number of PC samples falling at this instruction address (decimal)
- the minimum number cycles the instruction is predicted to spend at the head of the issue
queue (actual schedule may vary)
- (optionally) the average number of cycles spent at this instruction address
- (optionally) the other instructions that may have caused this instruction to stall (see
details below).
Each line in the listing represents a half-cycle, which makes it easy to see whether
instructions are being dual-issued. To avoid excessively long listings, however, dcpicalc
represents a very long stall with a large but limited number of bubbles. The actual number
of stall cycles is shown as a number along with the bubbles.
Stall cycles are either static or dynamic. Static stall cycles are those that the
processor would suffer even if there were no dynamic stalls (e.g., if all memory loads hit
in the D-cache and all conditional branches are predicted correctly). The rest are
dynamic. The bubbles for the static and dynamic stall cycles are shown in different
columns.
In the static column (the leftmost column), bubbles have the following meanings:
- s refers to stall cycles resulting from static resource conflicts among the instructions
within the same "window" (consisting of two instructions for Alpha 21064 and
four for 21164) that the processor considers for issue in any given cycle.
- a/b/c refer to stall cycles caused by register dependencies on previous instructions
involving, respectively, Ra/Rb/Rc of the stalled instruction.
- f refers to stall cycles caused by competition for the function units and other internal
resources in the processor.
In the dynamic column(s), there may be multiple possible explanations for the same
stall cycles; sometimes there may be none. Each explanation is represented by a column of
bubbles. In some cases, dcpicalc can compute the maximum number of stall cycles that a
particular reason can account for. If this is less than the number of stall cycles, the
column for that reason may not extend all the way down to the stalled instruction.
The bubbles have the meanings below.
- d - D-cache miss
- D - DTB miss
- I - I-cache or ITB miss
- i - I-cache miss (but not ITB miss)
- w - write buffer overflow
- y - synchronization of memory operations (using memory barriers)
- p - branch misprediction
- f - busy function unit
- o - other (currently TRAPB, EXCB, or load-after-store replay trap)
- ? - unexplained
Several points are worthy mentioning here. First, notice that there is no symbol for
ITB miss alone because an I-cache miss is possible whenever an ITB miss is possible.
Second, "other" means miscellaneous other reasons that typically account for
only a tiny percentage of stalls. Currently it includes stalls at TRAPB or EXCB
instructions, which are not issued until all previous instructions are guaranteed to
complete without traps or both traps and exceptions, respectively. Third, the symbol
"f" may appear in both the static and dynamic columns because competition for
function units may explain both static and dynamic stalls. For example, the stall caused
by a floating-point division may be partly static, because part of it can be predicted by
scheduling the instructions, and partly dynamic, because part of it is data dependent. An
"f" in the dynamic column typically means a busy integer multiply or
floating-point divide unit.
For each stalled instruction, dcpicalc also lists instructions that may have caused the
stalls. This list appears at the end of the line showing the stalled instruction. A
four-digit hexadecimal address indicates an instruction in the same basic block as the
stalled instruction; a full block name with a four-digit hexadecimal address indicates an
instruction in another basic block; a full block name without an address indicates that
the instruction potentially causing the stall is assumed to be in another
procedure, which can be a callee or the caller of the current procedure. Note that the
lists of instructions and explanations are not always exhaustive, in part because longer
stalls may hide shorter ones.
If an instruction is a nop, dcpicalc will indicate it by appending "nop" to
the line showing the instruction.
Block Level Information
At the beginning of a block, dcpicalc displays summary information for the block. For
example,
*** One cycle = 714428 samples
*** Executed 4.83 times/invocation
*** Best-case 8/9 = 0.89CPI, Actual 22/9 = 2.44CPI
*** (36% execution without dynamic stalls)
The first line is the cycle-to-sample ratio for block -- this is dcpicalc's estimate of
how many PC samples in the profiling data correspond to one cycle. The next line is the
average number of times the block is executed relative to the number of times the entry
and/or exit blocks are executed. The third line displays the best-case and actual cycles
per instruction (CPI) for the block. The best-case scenario includes all stalls statically
predictable from the instruction stream (e.g., an Alpha 21164 cannot dual-issue
consecutive store instructions) but assumes that there are no dynamic stalls (e.g., all
load instructions hit in the D-cache). The last line above displays the best-case cycles
per instruction as a percentage of the actual.
Procedure Level Information
At the procedure level, dcpicalc displays summary information in the entry block. This
information includes the number of instructions in the procedure, averages of the
best-case and actual cycles per instruction (computed from the per-block values weighted
by block execution frequencies), and a sorted list of blocks accounting for 90% of the
stalls in the procedure.
Moreover, dcpicalc summarizes how the cycles are spent. Here is a sample summary
followed by line-by-line explanations.
Line 1 I-cache (not ITB) 3.5% to 7.4%
Line 2 ITB/I-cache miss 3.7% to 3.7%
Line 3 D-cache miss 25.2% to 27.2%
Line 4 DTB miss 0.0% to 1.7%
Line 5 Write buffer 0.0% to 0.0%
Line 6 Synchronization 0.0% to 0.0%
Line 7 Branch mispredict 0.7% to 2.6%
Line 8 IMUL busy 0.0% to 0.0%
Line 9 FDIV busy 0.0% to 0.0%
Line 10 Other 0.0% to 0.0%
Line 11 Unexplained stall 1.9% to 1.9%
Line 12 Unexplained gain -0.8% to -0.8%
----------------------------------------
Line 13 Subtotal dynamic 38.4%
Line 14 Slotting 6.4%
Line 15 Ra dependency 10.0%
Line 16 Rb dependency 2.9%
Line 17 Rc dependency 0.0%
Line 18 FU dependency 1.9%
----------------------------------------
Line 19 Subtotal static 21.2%
----------------------------------------
Line 20 Total stall 59.6%
Line 21 Useful 39.4%
Line 22 Nops 1.2%
----------------------------------------
Line 23 Execution 40.6%
Line 24 Net sampling error -0.2%
----------------------------------------
Line 25 Total tallied 100.0%
Line 26 (114504716, 88.8% of all samples)
- Lines 1 to 13
- show all dynamic stall cycles. See previous discussion of instruction level information
for the meanings of these categories. Unexplained stall (line 10) represents stall cycles
for which dcpicalc cannot offer any plausible explanation. Unexplained gain (line 11)
occurs when instructions take fewer cycles than even the ideal assumption. For example,
since we take dual-issue as the ideal case, if in fact three instructions are issued (two
to the integer pipelines and one to a floating point pipeline), half a cycle would be
attributed to "unexplained gain." For the difference between "I-cache (not
ITB)" and "ITB/I-cache miss," please see the earlier discussion on the
corresponding bubbles `i' and `I'.
Dcpicalc shows a range of stall cycles (as a
percentage of total cycles tallied) that could have been caused by each reason listed.
Some of the ranges may be wide if major stalls can be explained by more than one reason.
Generally, the accuracy of the analysis can be improved using profiles for non-cycles
events. Currently, dcpicalc takes advantage of imiss, itbmiss, and dtbmiss profiles if
they are specified on the command line. Although the contributions of individual stall
reasons are reported as ranges, the subtotal for all dynamic stalls is not. It represents
the cycles attributed to any one or more of the reasons. Therefore, it does not depend on
how stall cycles are apportioned among alternative reasons for the same stall.
- Lines 14 to 19
- show the static stall cycles. These are stall cycles that the processor would suffer
even if there were no dynamic stalls. For example, this assumes that a load from memory
takes only two cycles, which corresponds to a D-cache hit. Additional stall cycles due to
a cache miss are considered dynamic. If an instruction is stalled for multiple reasons,
the static stall cycles are attributed to the last reason preventing instruction issue.
Thus, shorter stalls are hidden by longer ones.
- Slotting (line 14)
- refers to stall cycles resulting from static resource conflicts among the instructions
within the same "window" that the processor considers for issue in any given
cycle.
- Ra/Rb/Rc dependencies (lines 15-17)
- refer to stall cycles caused by register dependencies on previous instructions
involving, respectively, Ra/Rb/Rc of the stalled instruction.
- FU dependency (line 18)
- refers to stall cycles caused by competition for function units and other internal
resources in the processor.
- Line 21-23
- are the numbers of cycles spent on executing instructions. Line 23 includes all
instructions; line 22 includes nops; line 21 includes "useful" instructions
(i.e., instructions other than nops). Each of them is simply half the number of executed
instructions (of the respective type) since we assume dual-issue to be the ideal case.
This percentage may exceed 100% One reason is the Alpha 21164 may issue floating point
instructions in addition to two integer instructions per cycle. Since dcpicalc assumes
dual issue to be the ideal case (corresponding to 100% execution), the extra instructions
would cause this percentage to exceed 100%. Another possible explanation is discrepancies
due to sampling error in rarely executed code.
Note that the time spent on
"nops" is not necessarily wasted. These operations are often inserted
deliberately by the compiler's instruction scheduler to improve instruction execution by
the processor's pipeline. If they were removed, fewer instructions would be executed, but
it may not take less time.
- Line 24
- is the net discrepancy due to sampling error and inaccuracy in execution frequency
estimates. This can give some indication of how noisy the sample data are, but since it is
net discrepancy, two discrepancies of opposite signs may cancel out each other, giving a
small error term. However, significant discrepancies are attributed to unexplained stall
and gain (lines 11 and 12); they do not cancel out.
- Line 25
- is simply the sum of the subtotals. It should always be 100%. If not, report a bug!
- Line 26
- shows the total number of samples tallied for this summary, and its ratio to the number
of all samples for this procedure. We tally only the samples falling in basic blocks whose
execution frequencies have been determined by dcpicalc. All previous percentages in the
summary are computed relative to the number of tallied samples.
TYPICAL USAGE
Typically, dcpicalc, dcpisource, and dcpi2ps are used
together as follows:
dcpicalc -db db main myprog.exe | \
dcpisource -f C:\src\main.c | \
dcpi2ps -o main.ps
It is also possible to read the ascii output of dcpicalc directly.
SEE ALSO
dcpi(1), dcpiflow(1), dcpiprof(1), dcpilist(1), dcpidis(1), dcpiscan(1), dcpid(1), dcpiepoch(1), dcpiflush(1), dcpilabel(1), dcpi2ps(1), dcpicat(1), dcpiquit(1), dcpidiff(1), dcpitopstalls(1), dcpiwhatcg(1),
dcpictl(1), dcpiformat(4)
For more information, see the HP Continuous Profiling Infrastructure project home page (http://h30097.www3.hp.com/dcpi).
COPYRIGHT
Copyright 1996-2004, Hewlett-Packard Development Company, L.P.
AUTHORS
Shun-Tak Leung, Dick Sites, Mark Vandevoorde
This page was generated automatically by mtex software.
|
|