IBM SPEC CPU2000 Flag Descriptions for Opteron 

Portland Group Compiler Technology's Fortran compiler  pgf90 5.0-1
GCC C Compiler version 3.3

Last updated:  12-August-2003

Portability  Options

-DSPEC_CPU2000_LP64 (Portability)
     Use code to make longs and pointers 64 bit


Flags and Compiler options for the Portland Group Compiler Technology's Fortran compiler 
  

 Fortran pgf90 5.0-1


The optimization levels and their meanings are as follows:	

	-O0	A basic block is generated for each Fortran statement.  No scheduling is done
                                between statements.  No global optimizations are performed.

	-O1	Scheduling within extended basic blocks is performed.
		Some register allocation is performed.  No global optimizations
		are performed.

	-O2 
		All level 1 optimizations are performed.  In addition,  scalar
		optimizations such as induction recognition and loop invariant motion are
		performed by the global optimizer. 
                
               -O3           This level performs all level-one and level-two optimizations and enables more
		aggressive hoisting and scalar replacement optimizations.



	-fast    Chooses generally optimal flags for the target platform.  Equivalent to
		"-O2 -Munroll -Mnoframe" 

	-fastsse 	Chooses generally optimal flags for  machines that supports the SSE  type instructions. 
		Equivalent to "-fast -Mscalarsse -Mvect=sse -Mcache_align -Mflushz" 

IPA	InterProcedural Analyzer 


-Mcache_align   (PGI Fortran Compiler) 
     Align unconstrained objects of length greater than or equal to 16 bytes on
     cache-line boundaries. An unconstrained object is a data object that is not
     a member of an aggregate structure or common block. This option does
     not affect the alignment of allocatable or automatic arrays.

     Note: To effect cache-line alignment of stack-based local variables, the
     main program or function must be compiled with -Mcache_align.

-Mfixed (PGI Fortran Compiler)
     Process source using Fortran90 freeform specifications.

-Mflushz 	(PGI Fortran Compiler) 
     Set SSE MXCSR register to flush-to-zero mode.

-Mipa=align (PGI Fortran Compiler) 
     Instructs the IPA to recognize when pointer targets are all cache-line 
     aligned, allowing better SSE code generation.

-Mipa=arg (PGI Fortran Compiler) 
     Instructs the IPA to remove arguments replaced by -Mipa=ptr,const 

-Mipa=const (PGI Fortran Compiler) 
     Enable propagation of constants across procedure calls.

-Mipa=fast (PGI Fortran Compiler) 
     Equivalent to: -Mipa=const,globals,localarg,ptr,vestigial 
              	
-Mipa=globals (PGI Fortran Compiler) 
     Instructs the IPA to optimize references to globals when not used in procedure calls.		

-Mipa=localarg (PGI Fortran Compiler) 
      Externalizes local variables for use with -Mipa=arg

-Mipa=ptr (PGI Fortran Compiler) 
     Instructs the IPA to perform pointer disambiguation across procedure calls.

-Mipa=vestigial (PGI Fortran Compiler) 
     Instructs the IPA to eliminate functions that are not called.
	
-Mnoframe (PGI Fortran Compiler) 
     Eliminate operations that set up a true stack frame pointer for functions.

-Mnosmart  (PGI Fortran Compiler) 
     Don't run the Smart assembly re-write tool to enable post-compilation 
     linear assembly scheduling and optimization

-Mscalarsse  (PGI Fortran Compiler) 
     Utilize the SSE (Streaming SIMD(Single Instruction Multiple Data) Extensions)
     and SSE2  instructions to perform the operations  coded. This assumes the
     user has an assembler capable of interpreting SSE/SSE2  instructions, as in
     later versions of Linux.  This implies -Mflushz.

-Munroll (PGI Fortran Compiler) 
     Invokes the loop unroller.  This also sets the optimization level to 2 if the 
     level is set to less than 2.
			
      c:m	Instructs the compiler to completely unroll loops with a
	constant loop count less than or equal to m, a supplied constant.
	If this value is not supplied, the m count is set to 4.

      n:u	Instructs the compiler to unroll u times, a loop which is
	not completely unrolled, or has a non-constant loop count.
	If u is not supplied, the unroller computes the number of times a
	candidate loop is unrolled.

-Mvect=sse (PGI Fortran Compiler) 
     Instructs the vectorizer to search for loops, and where possible,
     use the SSE or SSE2 and prefetch instructions
     (depending on which processor is targeted).



	
	









Flags and Compiler options for the GCC 'C' Complier version 3.3


-O0  (GCC C Compiler)

     Do not optimize.  This is the default.


-O   (GCC C Compiler)
-O1
     Optimize.  Optimizing compilation takes somewhat more time, and a
     lot more memory for a large function.

     With `-O', the compiler tries to reduce code size and execution
     time, without performing any optimizations that take a great deal
     of compilation time.

     `-O' turns on the following optimization flags:
          -fcprop-registers
          -fcrossjumping
          -fdefer-pop
          -fdelayed-branch
          -fif-conversion
          -fif-conversion2
          -floop-optimize
          -fmerge-constants
          -fthread-jumps

     `-O' also turns on `-fomit-frame-pointer' on machines where doing
     so does not interfere with debugging.

-O2 (GCC C Compiler)
     Optimize even more.  GCC performs nearly all supported
     optimizations that do not involve a space-speed tradeoff.  The
     compiler does not perform loop unrolling or function inlining when
     you specify `-O2'.  As compared to `-O', this option increases
     both compilation time and the performance of the generated code.

     `-O2' turns on all optimization flags specified by `-O'.  It also
     turns on the following optimization flags:
          -falign-functions
          -falign-jumps
          -falign-labels
          -falign-loops
          -fcaller-saves
          -fcse-follow-jumps
          -fcse-skip-blocks
          -fdelete-null-pointer-checks
          -fexpensive-optimizations
          -fforce-mem
          -fgcse
          -fgcse-lm
          -fgcse-sm
          -foptimize-sibling-calls
          -fpeephole2
          -fregmove
          -freorder-blocks
          -freorder-functions
          -frerun-cse-after-loop
          -frerun-loop-opt
          -fschedule-insns
          -fschedule-insns2
          -fsched-interblock
          -fsched-spec
          -fstrength-reduce
          -fstrict-aliasing

     Please note the warning under `-fgcse' about invoking `-O2' on
     programs that use computed gotos.

-O3 (GCC C Compiler)
     Optimize yet more.  `-O3' turns on all optimizations specified by
     `-O2' and also turns on the following:
             -finline-functions
             -frename-registers

-falign-functions      (GCC C Compiler)
-falign-functions=N
     Align the start of functions to the next power-of-two greater than N,
     skipping up to N bytes.  For instance, `-falign-functions=32' aligns
     functions to the next 32-byte boundary, but `-falign-functions=24' would
     align to the next 32-byte boundary only if this can be done by skipping
     23 bytes or less.

    `-fno-align-functions' and `-falign-functions=1' are equivalent and mean
    that functions will not be aligned.

    Some assemblers only support this flag when N is a power of two; in that
    case, it is rounded up.

    If N is not specified, use a machine-dependent default.

-falign-jumps   (GCC C Compiler)
-falign-jumps=N
     Align branch targets to a power-of-two boundary, for branch targets where
     the targets can only be reached by jumping, skipping up to N bytes like
     `-falign-functions'.
     In this case, no dummy operations need be executed.

     If N is not specified, use a machine-dependent default.

-falign-labels   (GCC C Compiler)
-falign-labels=N
     Align all branch targets to a power-of-two boundary, skipping up to
     N bytes like `-falign-functions'.  This option can easily make code slower,
     because it must insert dummy operations for when the branch target is
     reached in the usual flow of the code.

     If `-falign-loops' or `-falign-jumps' are applicable and are greater
     than this value, then their values are used instead.

     If N is not specified, use a machine-dependent default which is very
     likely to be `1', meaning no alignment.

-falign-loops  (GCC C Compiler)
-falign-loops=N
     Align loops to a power-of-two boundary, skipping up to N bytes like
     `-falign-functions'.  The hope is that the loop will be executed many
     times, which will makeup for any execution of the dummy operations.

     If N is not specified, use a machine-dependent default.

-fbranch-probabilities (GCC C Compiler)
     After running a program compiled with -fprofile-arcs, you can compile it
     a second time using -fbranch-probabilities, to improve optimizations
     based on the number of times each branch was taken.  When the program
     compiled with -fprofile-arcs exits it saves arc execution counts to a
     file called sourcename.da for each source file The information in this
     data file is very dependent on the structure of the generated code, so
     you must use the same source code and the same optimization options for
     both compilations.  With -fbranch-probabilities, GCC puts a
     REG_EXEC_COUNT note on the first instruction of each basic block, and
     a REG_BR_PROB note on each JUMP_INSN and CALL_INSN. These can be used to
     improve optimization.  Currently, they are only used in one place: in
     reorg.c, instead of guessing which path a branch is mostly to take, the
     REG_BR_PROB values are used to exactly determine which path is taken
     more often.

-fcaller-saves  (GCC C Compiler)
     Enable values to be allocated in registers that will be clobbered
     by function calls, by emitting extra instructions to save and
     restore the registers around such calls.  Such allocation is done
     only when it seems to result in better code than would otherwise
     be produced.

     This option is always enabled by default on certain machines,
     usually those which have no call-preserved registers to use
     instead.

-fcprop-registers    (GCC C Compiler)
-fno-cprop-registers
     After register allocation and post-register allocation instruction
     splitting, we perform a copy-propagation pass to try to reduce
     scheduling dependencies and occasionally eliminate the copy.

-fcrossjumping    (GCC C Compiler)
     Perform cross-jumping transformation. This transformation unifies
     equivalent code and save code size. The resulting code may or may
     not perform better than without cross-jumping.

-fcse-follow-jumps-  (GCC C Compiler)
     In common subexpression elimination, scan through jump instructions
     when the target of the jump is not reached by any other path.  For
     example, when CSE encounters an `if' statement with an `else'
     clause, CSE will follow the jump when the condition tested is
     false.

-fcse-skip-blocks  (GCC C Compiler)
     This is similar to `-fcse-follow-jumps', but causes CSE to follow
     jumps which conditionally skip over blocks.  When CSE encounters a
     simple `if' statement with no else clause, `-fcse-skip-blocks'
     causes CSE to follow the jump around the body of the `if'.

-fdefer-pop  (GCC C Compiler)
-fno-defer-pop
     Always pop the arguments to each function call as soon as that
     function returns.  For machines which must pop arguments after a
     function call, the compiler normally lets arguments accumulate on
     the stack for several function calls and pops them all at once.

-fdelayed-branch  (GCC C Compiler)
     If supported for the target machine, attempt to reorder
     instructions to exploit instruction slots available after delayed
     branch instructions.

-fdelete-null-pointer-checks  (GCC C Compiler)
     Use global dataflow analysis to identify and eliminate useless
     checks for null pointers.  The compiler assumes that dereferencing
     a null pointer would have halted the program.  If a pointer is
     checked after it has already been dereferenced, it cannot be null.

     In some environments, this assumption is not true, and programs can
     safely dereference null pointers.  Use
     `-fno-delete-null-pointer-checks' to disable this optimization for
     programs which depend on that behavior.

-fexpensive-optimizations  (GCC C Compiler)
     Perform a number of minor optimizations that are relatively
     expensive.

-fforce-mem    (GCC C Compiler)
     Force memory operands to be copied into registers before doing
     arithmetic on them.  This produces better code by making all memory
     references potential common subexpressions.  When they are not
     common subexpressions, instruction combination should eliminate
     the separate register-load.

-fgcse   (GCC C Compiler)
     Perform a global common subexpression elimination pass.  This pass
     also performs global constant and copy propagation.

     _Note:_ When compiling a program using computed gotos, a GCC
     extension, you may get better runtime performance if you disable
     the global common subexpression elimination pass by adding
     `-fno-gcse' to the command line.

-fgcse-lm    (GCC C Compiler)
     When `-fgcse-lm' is enabled, global common subexpression
     elimination will attempt to move loads which are only killed by
     stores into themselves.  This allows a loop containing a
     load/store sequence to be changed to a load outside the loop, and
     a copy/store within the loop.

     Enabled by default when gcse is enabled.


-fgcse-sm   (GCC C Compiler)
     When `-fgcse-sm' is enabled, A store motion pass is run after
     global common subexpression elimination.  This pass will attempt
     to move stores out of loops.  When used in conjunction with
     `-fgcse-lm', loops containing a load/store sequence can be changed
     to a load before the loop and a store after the loop.

     Enabled by default when gcse is enabled.

-fif-conversion  (GCC C Compiler)
     Attempt to transform conditional jumps into branch-less
     equivalents.  This include use of conditional moves, min, max, set
     flags and abs instructions, and some tricks doable by standard
     arithmetics.  The use of conditional execution on chips where it
     is available is controlled by `if-conversion2'.

-fif-conversion2  (GCC C Compiler)
     Use conditional execution (where available) to transform
     conditional jumps into branch-less equivalents.

-finline-functions (GCC C Compiler)
           Integrate all simple functions into their callers.
           The compiler heuristically decides which functions are
           simple enough to be worth integrating in this way.

           If all calls to a given function are integrated, and
           the function is declared "static", then the function
           is normally not output as assembler code in its own
           right.

-floop-optimize  (GCC C Compiler)
     Perform loop optimizations: move constant expressions out of
     loops, simplify exit test conditions and optionally do
     strength-reduction and loop unrolling as well.

-fmerge-constants  (GCC C Compiler)
     Attempt to merge identical constants (string constants and
     floating point constants) across compilation units.

     This option is the default for optimized compilation if the
     assembler and linker support it.  Use `-fno-merge-constants' to
     inhibit this behavior.

-fno-guess-branch-probability  (GCC C Compiler)

     Do not guess branch probabilities using a randomized model.

     Sometimes gcc will opt to use a randomized model to guess branch
     probabilities, when none are available from either profiling
     feedback (`-fprofile-arcs') or `__builtin_expect'.  This means that
     different runs of the compiler on the same program may produce
     different object code.

     In a hard real-time system, people don't want different runs of the
     compiler to produce code that has different behavior; minimizing
     non-determinism is of paramount import.  This switch allows users
     to reduce non-determinism, possibly at the expense of inferior
     optimization.

-fomit-frame-pointer (GCC C Compiler)
     Don't keep the frame pointer in a register for functions that don't need
     one. This avoids the instructions to save, set up and restore frame
     pointers; it also makes an extra register available in many functions. It
     also makes debugging impossible on some machines.  On some machines, such
     as the VAX, this flag has no effect, because the standard calling
     sequence automatically handles the frame pointer and nothing is saved by
     pretending it doesn't exist. The machine-description macro
     FRAME_POINTER_REQUIRED controls whether a target machine supports
     this flag.

-foptimize-register-move (GCC C Compiler)
     Attempt to reassign register numbers in move instructions and as
     operands of other simple instructions in order to maximize the
     amount of register tying.  This is especially helpful on machines
     with two-operand instructions.

-foptimize-sibling-calls  (GCC C Compiler)
     Optimize sibling and tail recursive calls.

-fpeephole    (GCC C Compiler)
-fno-peephole
-fpeephole2   (GCC C Compiler)
-fno-peephole2
     Enable/Disable any machine-specific peephole optimizations.  The
     difference between `-fno-peephole' and `-fno-peephole2' is in how
     they are implemented in the compiler; some targets use one, some
     use the other, a few use both.


-fprofile-arcs (GCC C Compiler)
     Instrument arcs during compilation to generate coverage data or for
     profile-directed block ordering. During execution the program records how
     many times each branch is executed and how many times it is taken.  When
     the compiled program exits it saves this data to a file called
     sourcename.da for each source file.  For profile-directed block ordering,
     compile the program with -fprofile-arcs plus optimization and code
     generation options, generate the arc profile information by running the
     program on a selected workload, and then compile the program again with
     the same optimization and code generation options plus
     -fbranch-probabilities.


     The other use of -fprofile-arcs is for use with gcov, when it is used with
     the -ftest-coverage option.

     With -fprofile-arcs, for each function of your program GCC creates a
     program flow graph, then finds a spanning tree for the graph. Only arcs
     that are not on the spanning tree have to be instrumented: the compiler
     adds code to count the number of times that these arcs are executed. When
     an arc is the only exit or only entrance to a block, the instrumentation
     code can be added to the block; otherwise, a new basic block must be
     created to hold the instrumentation code.

-fregmove (GCC C Compiler)
     Attempt to reassign register numbers in move instructions and as
     operands of other simple instructions in order to maximize the
     amount of register tying.  This is especially helpful on machines
     with two-operand instructions.

     Note `-fregmove' and `-foptimize-register-move' are the same
     optimization.

-frename-registers (GCC C Compiler)
           Attempt to avoid false dependencies in scheduled code
           by making use of registers left over after register
           allocation.  This optimization will most benefit pro
           cessors with lots of registers.  It can, however, make
           debugging impossible, since variables will no longer
           stay in a ``home register''.

-freorder-blocks  (GCC C Compiler)
     Reorder basic blocks in the compiled function in order to reduce
     number of taken branches and improve code locality.

-freorder-functions (GCC C Compiler)
     Reorder basic blocks in the compiled function in order to reduce
     number of taken branches and improve code locality. This is
     implemented by using special subsections `text.hot' for most
     frequently executed functions and `text.unlikely' for unlikely
     executed functions.  Reordering is done by the linker so object
     file format must support named sections and linker must place them
     in a reasonable way.

     Also profile feedback must be available in to make this option
     effective.  See ` -fprofile-arcs' for details.

-frerun-cse-after-loop (GCC C Compiler)
     Re-run common subexpression elimination after loop optimizations has
     been performed.


-frerun-loop-opt (GCC C Compiler)
     Run the loop optimizer twice.

-fschedule-insns (GCC C Compiler)
     If supported for the target machine, attempt to reorder
     instructions to eliminate execution stalls due to required data
     being unavailable.  This helps machines that have slow floating
     point or memory load instructions by allowing other instructions
     to be issued until the result of the load or floating point
     instruction is required.

-fschedule-insns2 (GCC C Compiler)
     Similar to `-fschedule-insns', but requests an additional pass of
     instruction scheduling after register allocation has been done.
     This is especially useful on machines with a relatively small
     number of registers and where memory load instructions take more
     than one cycle.


-fsched-interblock  `-fno-sched-interblock' (GCC C Compiler)
     Don't schedule instructions across basic blocks.  This is normally
     enabled by default when scheduling before register allocation, i.e.
     with `-fschedule-insns' or at `-O2' or higher.

-fsched-spec` -fno-sched-spec' (GCC C Compiler)
     Don't allow speculative motion of non-load instructions.  This is
     normally enabled by default when scheduling before register
     allocation, i.e.  with `-fschedule-insns' or at `-O2' or higher.


-fstrength-reduce (GCC C Compiler)
     Perform the optimizations of loop strength reduction and elimination of
     iteration variables.


-fstrict-aliasing (GCC C Compiler)
     Allows the compiler to assume the strictest aliasing rules
     applicable to the language being compiled.  For C (and C++), this
     activates optimizations based on the type of expressions.  In
     particular, an object of one type is assumed never to reside at
     the same address as an object of a different type, unless the
     types are almost the same.  For example, an `unsigned int' can
     alias an `int', but not a `void*' or a `double'.  A character type
     may alias any other type.

     Pay special attention to code like this:
          union a_union {
            int i;
            double d;
          };

          int f() {
            a_union t;
            t.d = 3.0;
            return t.i;
          }
     The practice of reading from a different union member than the one
     most recently written to (called "type-punning") is common.  Even
     with `-fstrict-aliasing', type-punning is allowed, provided the
     memory is accessed through the union type.  So, the code above
     will work as expected.  However, this code might not:
          int f() {
            a_union t;
            int* ip;
            t.d = 3.0;
            ip = &t.i;
            return *ip;
          }

     Every language that wishes to perform language-specific alias
     analysis should define a function that computes, given an `tree'
     node, an alias set for the node.  Nodes in different alias sets
     are not allowed to alias.  For an example, see the C front-end
     function `c_get_alias_set'.

-fthread-jumps (GCC C Compiler)
     Perform optimizations where we check to see if a jump branches to a
     location where another comparison subsumed by the first is found.
     If so, the first branch is redirected to either the destination of
     the second branch or a point immediately following it, depending
     on whether the condition is known to be true or false.



rm -f *.da *.life analyz_prbrob.out 
Remove any profile feedback information from previous runs. 

BIOS Setting Definitions -   
DRAM Interleave defines whether data will be interleaved among the four data 
    banks within individual DRAMs.

Node Interleave defines whether or not data addresses will be alternating
     between both processors in 4KB blocks.

ACPI SRAT defines whether the Static Resource Allocation Table is exported by
   the BIOS to a location where the operating system can see it.  The SRAT may
     only be exported when Node Interleave is disabled.