IBM SPEC CPU2000 Flag Descriptions for Opteron

Portland Group Compiler Technology's Fortran compiler pgf90 5.0-1
GCC C Compiler version 3.3

Last updated: 12-August-2003

Portability Options

-DSPEC_CPU2000_LP64 (Portability)
Use code to make longs and pointers 64 bit

Flags and Compiler options for the Portland Group Compiler Technology's Fortran compiler

Fortran pgf90 5.0-1

The optimization levels and their meanings are as follows:

-O0 A basic block is generated for each Fortran statement. No scheduling is done
between statements. No global optimizations are performed.

-O1 Scheduling within extended basic blocks is performed.
Some register allocation is performed. No global optimizations
are performed.

-O2
All level 1 optimizations are performed. In addition, scalar
optimizations such as induction recognition and loop invariant motion are
performed by the global optimizer.

-O3 This level performs all level-one and level-two optimizations and enables more
aggressive hoisting and scalar replacement optimizations.

-fast Chooses generally optimal flags for the target platform. Equivalent to
"-O2 -Munroll -Mnoframe"

-fastsse Chooses generally optimal flags for machines that supports the SSE type instructions.
Equivalent to "-fast -Mscalarsse -Mvect=sse -Mcache_align -Mflushz"

IPA InterProcedural Analyzer

-Mcache_align (PGI Fortran Compiler)
Align unconstrained objects of length greater than or equal to 16 bytes on
cache-line boundaries. An unconstrained object is a data object that is not
a member of an aggregate structure or common block. This option does
not affect the alignment of allocatable or automatic arrays.

Note: To effect cache-line alignment of stack-based local variables, the
main program or function must be compiled with -Mcache_align.

-Mfixed (PGI Fortran Compiler)
Process source using Fortran90 freeform specifications.

-Mflushz (PGI Fortran Compiler)
Set SSE MXCSR register to flush-to-zero mode.

-Mipa=align (PGI Fortran Compiler)
Instructs the IPA to recognize when pointer targets are all cache-line
aligned, allowing better SSE code generation.

-Mipa=arg (PGI Fortran Compiler)
Instructs the IPA to remove arguments replaced by -Mipa=ptr,const

-Mipa=const (PGI Fortran Compiler)
Enable propagation of constants across procedure calls.

-Mipa=fast (PGI Fortran Compiler)
Equivalent to: -Mipa=const,globals,localarg,ptr,vestigial

-Mipa=globals (PGI Fortran Compiler)
Instructs the IPA to optimize references to globals when not used in procedure calls.

-Mipa=localarg (PGI Fortran Compiler)
Externalizes local variables for use with -Mipa=arg

-Mipa=ptr (PGI Fortran Compiler)
Instructs the IPA to perform pointer disambiguation across procedure calls.

-Mipa=vestigial (PGI Fortran Compiler)
Instructs the IPA to eliminate functions that are not called.

-Mnoframe (PGI Fortran Compiler)
Eliminate operations that set up a true stack frame pointer for functions.

-Mnosmart (PGI Fortran Compiler)
Don't run the Smart assembly re-write tool to enable post-compilation
linear assembly scheduling and optimization

-Mscalarsse (PGI Fortran Compiler)
Utilize the SSE (Streaming SIMD(Single Instruction Multiple Data) Extensions)
and SSE2 instructions to perform the operations coded. This assumes the
user has an assembler capable of interpreting SSE/SSE2 instructions, as in
later versions of Linux. This implies -Mflushz.

-Munroll (PGI Fortran Compiler)
Invokes the loop unroller. This also sets the optimization level to 2 if the
level is set to less than 2.

c:m Instructs the compiler to completely unroll loops with a
constant loop count less than or equal to m, a supplied constant.
If this value is not supplied, the m count is set to 4.

n:u Instructs the compiler to unroll u times, a loop which is
not completely unrolled, or has a non-constant loop count.
If u is not supplied, the unroller computes the number of times a
candidate loop is unrolled.

-Mvect=sse (PGI Fortran Compiler)
Instructs the vectorizer to search for loops, and where possible,
use the SSE or SSE2 and prefetch instructions
(depending on which processor is targeted).

Flags and Compiler options for the GCC 'C' Complier version 3.3

-O0 (GCC C Compiler)

Do not optimize. This is the default.

-O (GCC C Compiler)
-O1
Optimize. Optimizing compilation takes somewhat more time, and a
lot more memory for a large function.

With `-O', the compiler tries to reduce code size and execution
time, without performing any optimizations that take a great deal
of compilation time.

`-O' turns on the following optimization flags:
-fcprop-registers
-fcrossjumping
-fdefer-pop
-fdelayed-branch
-fif-conversion
-fif-conversion2
-floop-optimize
-fmerge-constants
-fthread-jumps

`-O' also turns on `-fomit-frame-pointer' on machines where doing
so does not interfere with debugging.

-O2 (GCC C Compiler)
Optimize even more. GCC performs nearly all supported
optimizations that do not involve a space-speed tradeoff. The
compiler does not perform loop unrolling or function inlining when
you specify `-O2'. As compared to `-O', this option increases
both compilation time and the performance of the generated code.

`-O2' turns on all optimization flags specified by `-O'. It also
turns on the following optimization flags:
-falign-functions
-falign-jumps
-falign-labels
-falign-loops
-fcaller-saves
-fcse-follow-jumps
-fcse-skip-blocks
-fdelete-null-pointer-checks
-fexpensive-optimizations
-fforce-mem
-fgcse
-fgcse-lm
-fgcse-sm
-foptimize-sibling-calls
-fpeephole2
-fregmove
-freorder-blocks
-freorder-functions
-frerun-cse-after-loop
-frerun-loop-opt
-fschedule-insns
-fschedule-insns2
-fsched-interblock
-fsched-spec
-fstrength-reduce
-fstrict-aliasing

Please note the warning under `-fgcse' about invoking `-O2' on
programs that use computed gotos.

-O3 (GCC C Compiler)
Optimize yet more. `-O3' turns on all optimizations specified by
`-O2' and also turns on the following:
-finline-functions
-frename-registers

-falign-functions (GCC C Compiler)
-falign-functions=N
Align the start of functions to the next power-of-two greater than N,
skipping up to N bytes. For instance, `-falign-functions=32' aligns
functions to the next 32-byte boundary, but `-falign-functions=24' would
align to the next 32-byte boundary only if this can be done by skipping
23 bytes or less.

`-fno-align-functions' and `-falign-functions=1' are equivalent and mean
that functions will not be aligned.

Some assemblers only support this flag when N is a power of two; in that
case, it is rounded up.

If N is not specified, use a machine-dependent default.

-falign-jumps (GCC C Compiler)
-falign-jumps=N
Align branch targets to a power-of-two boundary, for branch targets where
the targets can only be reached by jumping, skipping up to N bytes like
`-falign-functions'.
In this case, no dummy operations need be executed.

If N is not specified, use a machine-dependent default.

-falign-labels (GCC C Compiler)
-falign-labels=N
Align all branch targets to a power-of-two boundary, skipping up to
N bytes like `-falign-functions'. This option can easily make code slower,
because it must insert dummy operations for when the branch target is
reached in the usual flow of the code.

If `-falign-loops' or `-falign-jumps' are applicable and are greater
than this value, then their values are used instead.

If N is not specified, use a machine-dependent default which is very
likely to be `1', meaning no alignment.

-falign-loops (GCC C Compiler)
-falign-loops=N
Align loops to a power-of-two boundary, skipping up to N bytes like
`-falign-functions'. The hope is that the loop will be executed many
times, which will makeup for any execution of the dummy operations.

If N is not specified, use a machine-dependent default.

-fbranch-probabilities (GCC C Compiler)
After running a program compiled with -fprofile-arcs, you can compile it
a second time using -fbranch-probabilities, to improve optimizations
based on the number of times each branch was taken. When the program
compiled with -fprofile-arcs exits it saves arc execution counts to a
file called sourcename.da for each source file The information in this
data file is very dependent on the structure of the generated code, so
you must use the same source code and the same optimization options for
both compilations. With -fbranch-probabilities, GCC puts a
REG_EXEC_COUNT note on the first instruction of each basic block, and
a REG_BR_PROB note on each JUMP_INSN and CALL_INSN. These can be used to
improve optimization. Currently, they are only used in one place: in
reorg.c, instead of guessing which path a branch is mostly to take, the
REG_BR_PROB values are used to exactly determine which path is taken
more often.

-fcaller-saves (GCC C Compiler)
Enable values to be allocated in registers that will be clobbered
by function calls, by emitting extra instructions to save and
restore the registers around such calls. Such allocation is done
only when it seems to result in better code than would otherwise
be produced.

This option is always enabled by default on certain machines,
usually those which have no call-preserved registers to use
instead.

-fcprop-registers (GCC C Compiler)
-fno-cprop-registers
After register allocation and post-register allocation instruction
splitting, we perform a copy-propagation pass to try to reduce
scheduling dependencies and occasionally eliminate the copy.

-fcrossjumping (GCC C Compiler)
Perform cross-jumping transformation. This transformation unifies
equivalent code and save code size. The resulting code may or may
not perform better than without cross-jumping.

-fcse-follow-jumps- (GCC C Compiler)
In common subexpression elimination, scan through jump instructions
when the target of the jump is not reached by any other path. For
example, when CSE encounters an `if' statement with an `else'
clause, CSE will follow the jump when the condition tested is
false.

-fcse-skip-blocks (GCC C Compiler)
This is similar to `-fcse-follow-jumps', but causes CSE to follow
jumps which conditionally skip over blocks. When CSE encounters a
simple `if' statement with no else clause, `-fcse-skip-blocks'
causes CSE to follow the jump around the body of the `if'.

-fdefer-pop (GCC C Compiler)
-fno-defer-pop
Always pop the arguments to each function call as soon as that
function returns. For machines which must pop arguments after a
function call, the compiler normally lets arguments accumulate on
the stack for several function calls and pops them all at once.

-fdelayed-branch (GCC C Compiler)
If supported for the target machine, attempt to reorder
instructions to exploit instruction slots available after delayed
branch instructions.

-fdelete-null-pointer-checks (GCC C Compiler)
Use global dataflow analysis to identify and eliminate useless
checks for null pointers. The compiler assumes that dereferencing
a null pointer would have halted the program. If a pointer is
checked after it has already been dereferenced, it cannot be null.

In some environments, this assumption is not true, and programs can
safely dereference null pointers. Use
`-fno-delete-null-pointer-checks' to disable this optimization for
programs which depend on that behavior.

-fexpensive-optimizations (GCC C Compiler)
Perform a number of minor optimizations that are relatively
expensive.

-fforce-mem (GCC C Compiler)
Force memory operands to be copied into registers before doing
arithmetic on them. This produces better code by making all memory
references potential common subexpressions. When they are not
common subexpressions, instruction combination should eliminate
the separate register-load.

-fgcse (GCC C Compiler)
Perform a global common subexpression elimination pass. This pass
also performs global constant and copy propagation.

_Note:_ When compiling a program using computed gotos, a GCC
extension, you may get better runtime performance if you disable
the global common subexpression elimination pass by adding
`-fno-gcse' to the command line.

-fgcse-lm (GCC C Compiler)
When `-fgcse-lm' is enabled, global common subexpression
elimination will attempt to move loads which are only killed by
stores into themselves. This allows a loop containing a
load/store sequence to be changed to a load outside the loop, and
a copy/store within the loop.

Enabled by default when gcse is enabled.

-fgcse-sm (GCC C Compiler)
When `-fgcse-sm' is enabled, A store motion pass is run after
global common subexpression elimination. This pass will attempt
to move stores out of loops. When used in conjunction with
`-fgcse-lm', loops containing a load/store sequence can be changed
to a load before the loop and a store after the loop.

Enabled by default when gcse is enabled.

-fif-conversion (GCC C Compiler)
Attempt to transform conditional jumps into branch-less
equivalents. This include use of conditional moves, min, max, set
flags and abs instructions, and some tricks doable by standard
arithmetics. The use of conditional execution on chips where it
is available is controlled by `if-conversion2'.

-fif-conversion2 (GCC C Compiler)
Use conditional execution (where available) to transform
conditional jumps into branch-less equivalents.

-finline-functions (GCC C Compiler)
Integrate all simple functions into their callers.
The compiler heuristically decides which functions are
simple enough to be worth integrating in this way.

If all calls to a given function are integrated, and
the function is declared "static", then the function
is normally not output as assembler code in its own
right.

-floop-optimize (GCC C Compiler)
Perform loop optimizations: move constant expressions out of
loops, simplify exit test conditions and optionally do
strength-reduction and loop unrolling as well.

-fmerge-constants (GCC C Compiler)
Attempt to merge identical constants (string constants and
floating point constants) across compilation units.

This option is the default for optimized compilation if the
assembler and linker support it. Use `-fno-merge-constants' to
inhibit this behavior.

-fno-guess-branch-probability (GCC C Compiler)

Do not guess branch probabilities using a randomized model.

Sometimes gcc will opt to use a randomized model to guess branch
probabilities, when none are available from either profiling
feedback (`-fprofile-arcs') or `__builtin_expect'. This means that
different runs of the compiler on the same program may produce
different object code.

In a hard real-time system, people don't want different runs of the
compiler to produce code that has different behavior; minimizing
non-determinism is of paramount import. This switch allows users
to reduce non-determinism, possibly at the expense of inferior
optimization.

-fomit-frame-pointer (GCC C Compiler)
Don't keep the frame pointer in a register for functions that don't need
one. This avoids the instructions to save, set up and restore frame
pointers; it also makes an extra register available in many functions. It
also makes debugging impossible on some machines. On some machines, such
as the VAX, this flag has no effect, because the standard calling
sequence automatically handles the frame pointer and nothing is saved by
pretending it doesn't exist. The machine-description macro
FRAME_POINTER_REQUIRED controls whether a target machine supports
this flag.

-foptimize-register-move (GCC C Compiler)
Attempt to reassign register numbers in move instructions and as
operands of other simple instructions in order to maximize the
amount of register tying. This is especially helpful on machines
with two-operand instructions.

-foptimize-sibling-calls (GCC C Compiler)
Optimize sibling and tail recursive calls.

-fpeephole (GCC C Compiler)
-fno-peephole
-fpeephole2 (GCC C Compiler)
-fno-peephole2
Enable/Disable any machine-specific peephole optimizations. The
difference between `-fno-peephole' and `-fno-peephole2' is in how
they are implemented in the compiler; some targets use one, some
use the other, a few use both.

-fprofile-arcs (GCC C Compiler)
Instrument arcs during compilation to generate coverage data or for
profile-directed block ordering. During execution the program records how
many times each branch is executed and how many times it is taken. When
the compiled program exits it saves this data to a file called
sourcename.da for each source file. For profile-directed block ordering,
compile the program with -fprofile-arcs plus optimization and code
generation options, generate the arc profile information by running the
program on a selected workload, and then compile the program again with
the same optimization and code generation options plus
-fbranch-probabilities.

The other use of -fprofile-arcs is for use with gcov, when it is used with
the -ftest-coverage option.

With -fprofile-arcs, for each function of your program GCC creates a
program flow graph, then finds a spanning tree for the graph. Only arcs
that are not on the spanning tree have to be instrumented: the compiler
adds code to count the number of times that these arcs are executed. When
an arc is the only exit or only entrance to a block, the instrumentation
code can be added to the block; otherwise, a new basic block must be
created to hold the instrumentation code.

-fregmove (GCC C Compiler)
Attempt to reassign register numbers in move instructions and as
operands of other simple instructions in order to maximize the
amount of register tying. This is especially helpful on machines
with two-operand instructions.

Note `-fregmove' and `-foptimize-register-move' are the same
optimization.

-frename-registers (GCC C Compiler)
Attempt to avoid false dependencies in scheduled code
by making use of registers left over after register
allocation. This optimization will most benefit pro
cessors with lots of registers. It can, however, make
debugging impossible, since variables will no longer
stay in a ``home register''.

-freorder-blocks (GCC C Compiler)
Reorder basic blocks in the compiled function in order to reduce
number of taken branches and improve code locality.

-freorder-functions (GCC C Compiler)
Reorder basic blocks in the compiled function in order to reduce
number of taken branches and improve code locality. This is
implemented by using special subsections `text.hot' for most
frequently executed functions and `text.unlikely' for unlikely
executed functions. Reordering is done by the linker so object
file format must support named sections and linker must place them
in a reasonable way.

Also profile feedback must be available in to make this option
effective. See ` -fprofile-arcs' for details.

-frerun-cse-after-loop (GCC C Compiler)
Re-run common subexpression elimination after loop optimizations has
been performed.

-frerun-loop-opt (GCC C Compiler)
Run the loop optimizer twice.

-fschedule-insns (GCC C Compiler)
If supported for the target machine, attempt to reorder
instructions to eliminate execution stalls due to required data
being unavailable. This helps machines that have slow floating
point or memory load instructions by allowing other instructions
to be issued until the result of the load or floating point
instruction is required.

-fschedule-insns2 (GCC C Compiler)
Similar to `-fschedule-insns', but requests an additional pass of
instruction scheduling after register allocation has been done.
This is especially useful on machines with a relatively small
number of registers and where memory load instructions take more
than one cycle.

-fsched-interblock `-fno-sched-interblock' (GCC C Compiler)
Don't schedule instructions across basic blocks. This is normally
enabled by default when scheduling before register allocation, i.e.
with `-fschedule-insns' or at `-O2' or higher.

-fsched-spec` -fno-sched-spec' (GCC C Compiler)
Don't allow speculative motion of non-load instructions. This is
normally enabled by default when scheduling before register
allocation, i.e. with `-fschedule-insns' or at `-O2' or higher.

-fstrength-reduce (GCC C Compiler)
Perform the optimizations of loop strength reduction and elimination of
iteration variables.

-fstrict-aliasing (GCC C Compiler)
Allows the compiler to assume the strictest aliasing rules
applicable to the language being compiled. For C (and C++), this
activates optimizations based on the type of expressions. In
particular, an object of one type is assumed never to reside at
the same address as an object of a different type, unless the
types are almost the same. For example, an `unsigned int' can
alias an `int', but not a `void*' or a `double'. A character type
may alias any other type.

Pay special attention to code like this:
union a_union {
int i;
double d;
};

int f() {
a_union t;
t.d = 3.0;
return t.i;
}
The practice of reading from a different union member than the one
most recently written to (called "type-punning") is common. Even
with `-fstrict-aliasing', type-punning is allowed, provided the
memory is accessed through the union type. So, the code above
will work as expected. However, this code might not:
int f() {
a_union t;
int* ip;
t.d = 3.0;
ip = &t.i;
return *ip;
}

Every language that wishes to perform language-specific alias
analysis should define a function that computes, given an `tree'
node, an alias set for the node. Nodes in different alias sets
are not allowed to alias. For an example, see the C front-end
function `c_get_alias_set'.

-fthread-jumps (GCC C Compiler)
Perform optimizations where we check to see if a jump branches to a
location where another comparison subsumed by the first is found.
If so, the first branch is redirected to either the destination of
the second branch or a point immediately following it, depending
on whether the condition is known to be true or false.

rm -f *.da *.life analyz_prbrob.out
Remove any profile feedback information from previous runs.

BIOS Setting Definitions -
DRAM Interleave defines whether data will be interleaved among the four data
banks within individual DRAMs.

Node Interleave defines whether or not data addresses will be alternating
between both processors in 4KB blocks.

ACPI SRAT defines whether the Static Resource Allocation Table is exported by
the BIOS to a location where the operating system can see it. The SRAT may
only be exported when Node Interleave is disabled.