-----------------------------------------------------------------------------------
Fujitsu PRIMEPOWER flags/tunables description			(Dec.14 2005)

(Each section is sorted in case insensitive, alphabetical order)

Table of Contents
[1] Sun Studio 11 flag description
[2] Environment Variables
[3] Kernel Parameters (/etc/system)
[4] Commands for feedback control

-----------------------------------------------------------------------------------
[1] Sun Studio 11 flag description

Compiler options	Remark
-----------------------------------------------------------------------------------

cc			Invoke the Sun Studio 11 Compiler C 
(C compiler)

CC			Invoke the Sun Studio 11 Compiler C++ 
(C++ compiler)

-crit			Enable optimization of critical control paths 
(optimizer)

-dalign			Assume data is naturally aligned. 
(C, C++, Fortran)

-Dalloca=__builtin_alloca
(Portability flag)	Portability switch, used for 176.gcc:
			allow use of compiler's internal builtin alloca.

-depend			Synonym for -xdepend.
(Fortran)

-DHOST_WORDS_BIG_ENDIAN	Portability switch, used for 176.gcc:
(Portability flag)	controls how bytes are numbered within a word. 

-D__MATHERR_ERRNO_DONTCARE	
(C)			Allows the compiler to assume that your code
			does not rely on setting of the errno variable.

-DSPEC_CPU2000_SOLARIS	Portability switch, used for 253.perlbmk:
(Portability flag)	selects header files and code paths compatible
			with Solaris.
			

-DSUN			Portability switch, used for 186.crafty:
(Portability flag)	selects header files and code paths
			compatible with Solaris. 

-DSYS_HAS_CALLOC_PROTO	Portability switch, used for 254.gap:
(Portability flag)	allows use of the designated prototype.

-DSYS_HAS_IOCTL_PROTO	Portability switch, used for 254.gap:
(Portability flag)	allows use of the designated prototype.

-DSYS_HAS_SIGNAL_PROTO	Portability switch, used for 254.gap: 
(Portability flag)	allows use of the designated prototype.

-DSYS_HAS_TIME_PROTO	Portability switch, used for 254.gap:
(Portability flag)	allows use of the designated prototype.

-DSYS_IS_USG		Portability switch, used for 254.gap:
(Portability flag)	selects code compatible with USG-based systems. 

-e			Portability switch, used for 178.galgel:
(Portability, Fortran)	allows source lines to be up to 132 characters long. 

f90			Invoke the Sun Studio 11 Compiler Fortran 90
(Fortran compiler)

-fast			A convenience option, this switch selects the
(C)			following switches that are defined elsewhere
			in this page: 

			-D__MATHERR_ERRNO_DONTCARE
			-fns
			-fsimple=2
			-fsingle
			-xalias_level=basic
			-xarch=generic
			-xbuiltin=%all
			-xcache=generic
			-xchip=generic
			-xdepend
			-xlibmil
			-xlibmopt
			-xmemalign=8s
			-xO5
			-xprefetch=auto,explicit
			-xtarget=native

-fast			A convenience option, this switch selects the
(C++)			following switches that are defined elsewhere
			in this page: 

			-fns
			-fsimple=2
			-xbuiltin=%all
			-xlibmil
			-xlibmopt
			-xO5
			-xtarget=native

-fast			A convenience option, this switch selects the
(Fortran)		following switches that are defined elsewhere
			in this page: 

			-dalign
			-fns
			-fround=nearest
			-fsimple=2
			-ftrap=common
			-xlibmil
			-xlibmopt
			-xO5
			-xpad=local
			-xprefetch=auto,explicit
			-xtarget=native
			-xvector=yes

-fixed			Portability switch, used for 178.galgel:
(Portability, Fortran)	assume fixed-format source input.

-fns			Selects faster (but nonstandard) handling of
(C, C++, Fortran)	floating point arithmetic exceptions and
			gradual underflow.

-fsimple=<n>		Controls simplifying assumptions for
(C, C++, Fortran)	floating point arithmetic:

	    -fsimple=0	Permits no simplifying assumptions.
			Preserves strict IEEE 754 conformance. 

	    -fsimple=1	Allows the optimizer to assume: 
			The IEEE 754 default rounding/trapping
			modes do not change after process initialization. 
			Computations producing no visible result other
			than potential floating-point exceptions may
			be deleted. Computations with Infinity or NaNs
			as operands need not propagate NaNs to their
			results. For example, x*0 may be replaced by 0. 
			Computations do not depend on sign of zero. 

	    -fsimple=2	Permits more aggressive floating point
			optimizations that may cause programs to
			produce different numeric results due to
			changes in rounding. Even with -fsimple=2,
			the optimizer still is not permitted to
			introduce a floating point exception in a
			program that otherwise produces none. 

-fsingle		Evaluate float expressions as single precision. 
(C)

-ftrap=common		Sets the IEEE 754 trapping mode to common exceptions
(C, C++, Fortran)	(invalid, division by zero, and overflow).

-ftrap=%none		Turns off all IEEE 754 trapping modes.
(C, C++, Fortran)

-ll2amm			Include a library containing chip specific
(linker)		memory routines.

-noex			Do not allow C++ exceptions. A throw specification
(C++)			on a function is accepted but ignored; the compiler
			does not generate exception code.

-O			A synomym for -xO3.
(Fortran)

-Qoption <phase> <flags>
			Pass flags along to compiler phase:

			f90comp	Fortran first pass
			iropt	Global optimizer
			cg	Code Genetator

-Qoption cg <flags>	See -Wc,<flags> below. (The code generator
(code generator)	phase is addressed via -Qoption cg in
			Fortran and C++; and via -Wc in C.)

-Qoption cg -Qeps:enabled=1
(code generator)	See -Wc,-Qeps:enabled=1


-Qoption cg -Qeps:ws=<n>
(code generator)	See -Wc,-Qeps:ws=<n>

-Qoption cg -Qgsched-T<n>
(code generator)	See -Wc,-Qgsched-T<n>

-Qoption cg -Qgsched-trace_late=1
(code generator)	See -Wc,-Qgsched-trace_late=1

-Qoption iropt <flags>	See -W2,<flags> below. (The optimizer can
(optimizer)		be addressed either via Qoption iropt in
			Fortran and C++; or via -W2 in C.)

-Qoption iropt -Addint:sf=<n>		
(optimizer)		When considering whether to interchange loops, set memory
			store operation weight to n. A higher value of n indicates
			a greater performance cost for stores.

-Qoption iropt -Ainline[:cp=<n>][:cs=<n>][:inc=<n>][:irs=<n>][:mi][:recursion=1]
(optimizer)		See -W2,[:cp=<n>][:cs=<n>][:inc=<n>][:irs=<n>][:mi][:recursion=1]

-Qoption iropt -Apf:llist=<n>:noinnerllist
(optimizer)		Do speculative prefetching for link-list data structures:
			llist=<n> perform prefetching n iterations ahead
			noinnerllist do not attempt for innermost loops.

-Qoption iropt -Atile:skewp[:b<n>]
(optimizer)		Perform loop tiling which is enabled by loop skewing.
			Loop skewing is a transformation that transforms a
			non-fully interchangeable loop nest to a fully
			interchangeable loop nest. The optional b<n> sets the
			tiling block size to n.

-Qoption iropt -Aujam:inner=g		
(optimizer)		Increase the probability that small-trip-count inner
			loops will be fully unrolled.

RM_SOURCES = lapak.f90	This option allows building the benchmark 178.galgel
(SPEC tools)		without its copy of the lapak sources; instead,
			the lapak entry points in the sunperf library are used.

rm -rf ./feedback.profile ./SunWS_cache		
(Unix)			Remove any profile feedback information from previous runs. 

-stackvar		Force all local variables to be allocated on the stack.
(Fortran)		Allocates all the local variables and arrays in
			routines onto the memory stack unless otherwise
			specified. This option makes these variables automatic
			rather than static and provides more freedom to the
			optimizer when parallelizing loops with calls to
			subprograms.

			Variables and arrays are local, unless they are:

			o    Arguments in a SUBROUTINE or FUNCTION statement
			     (already on stack)

			o    Global items in a COMMON or SAVE, or STATIC
			     statement

			o    Initialized items in a type statement or a DATA
			     statement, such as:
			         REAL X/8.0/ or DATA X/8.0/

			Putting large arrays onto the stack with -stackvar can
			overflow the stack causing segmentation faults.
			Increasing the stack size might be required.

			The initial thread executing the program has a main
			stack, while each helper thread of a multithreaded
			program has its own thread stack.

			The default size for the main stack is about 8
			Megabytes.  The default helper thread stack size is 4
			Megabytes on SPARC V8 platforms and 8 Megabytes on
			SPARC V9 platforms.

			The limit command (with no parameters) shows the
			current main stack size.

			Use the limit shell command to set the size (in
			Kilobytes) of the main thread stack.  For example, to
			set the main stack size to 64 Megabytes, use a
			    % limit stacksize 65536
			command.

			You can set the stack size to be used by each slave
			thread by giving the STACKSIZE environment variable a
			value (in Kilobytes):
			    % setenv STACKSIZE 8192
			sets the stack size for each slave thread to 8 Mb.

			The STACKSIZE environment variable also accepts
			numerical values with a suffix of either B, K, M, or G
			for bytes, kilobytes, megabytes, or gigabytes
			respectively.  The default is kilobytes.

			See the Fortran Programming Guide chapter on
			parallelization for details.

			See also -xcheck=stkovf to enable runtime checking for
			stack overflow situations.

submit=echo 'pbind -b...' > dobmk; sh dobmk
(SPEC tools, Unix)      When running multiple copies of benchmarks, the SPEC
                        config file feature submit is sometimes used to
                        cause individual jobs to be bound to specific processors:

               submit=  causes the SPEC tools to use this line when
                        submitting jobs.
       echo ...> dobmk  causes the generated commands to be written to a
                        file, namely dobmk. 
              pbind -b  causes this copy's processes to be bound to the CPU
                        specified by the expression that follows it. See the
                        config file used in the submission for the exact
                        syntax, which tends to be cumbersome because of the
                        need to carefully quote parts of the expression.
                        When all expressions are evaluated, each CPU ends up
                        with exactly one copy of each benchmark. The pbind
                        expression may include: 
                        $SPECUSERNUM:    the SPEC tools-assigned number for
                                         this copy of the benchmark. 
                        psrinfo:         find out what processors are available
                        grep off-line:   search the psrinfo output for
                                         information regarding off-line cpus 
                        awk...print \$1: Pick out the line corresponding to
                                         this copy of the benchmark and use
                                         the CPU number mentioned at the
                                        start of this line. 
              sh dobmk  actually runs the benchmark. 

-W<phase>,<flags>	Pass flags along to compiler phase (2=optimizer,
			c=code genetator).

-W2,-Abcopy		Increase the probability that the compiler will
(optimizer)		perform memcpy/memset transformations. 

-W2,-Ainline[:cp=<n>][:cs=<n>][:inc=<n>][:irs=<n>][:mi][:recursion=1]
(optimizer)		Control the optimizer's loop inliner:

     (without a value)	Perform Inter-Procedural Analysis (IPA) -based inlining.

		cp=<n>	The minimum call site frequency counter in order to
			consider a routine for inlining.

		cs=<n>	Set inline callee size limit to n. The unit roughly
			corresponds to the number of instructions.

	       inc=<n>	The inliner is allowed to increase the size of the
			program by up to n%.

	       irs=<n>	Allow routines to increase by up to n. The unit
			roughly corresponds to the number of instructions. 

		    mi	Perform maximum inlining (without considering code
			size increase). 

	   recursion=1	Allow routines that are called recursively to still
			be eligible for inlining. 

-W2,-crit		Enable optimization of critical control paths.
(optimizer)

-W2,-Apf:llist=<n>:noinnerllist		
(optimizer)		Do speculative prefetching for link-list data structures:
			llist=<n> perform prefetching n iterations ahead
			noinnerllist do not attempt for innermost loops. 

-W2,-Ashort_ldst	Convert multiple short memory operations into
(optimizer)		single long memory operations.

-W2,-whole		Do whole program optimizations.
(optimizer)

-Wc,-Qdepgraph-early_cross_call=1	
(code generator)	There are several scheduling passes in the compiler.
			This option allows early passes to move instructions
			across call instructions.

-Wc,-Qeps:enabled=1	Use enhanced pipeline scheduling(EPS)
(code generator)	and selective scheduling algorithms for
			instruction scheduling. 

-Wc,-Qeps:ws=<n>	Set the EPS window size, that is, the number
(code generator)	of instructions it will consider across all
			paths when trying to find independent
			instructions to schedule a parallel group.
			Larger values may result in better run time,
			at the cost of increased compile time.

-Wc,-Qgsched-T<n>	Sets the aggressiveness of the trace
(code generator)	formation, where n is 4, 5, or 6. 
			The higher the value of n, the lower
			the branch probability needed to include
			a basic block in a trace.

-Wc,-Qgsched-trace_late=1
(code generator)	Turns on the late trace scheduler.

-Wc,-Qipa:valueprediction	
(code generator)	Use profile feedback data to predict values and attempt
			to generate faster code along these control paths,
			even at the expense of possibly slower code along paths
			leading to different values. Correct code is generated
			for all paths.

-Wc,-Qlp=<n>[-av=<n>][-t=<n>][-fa=<n>][-fl=<n>]
(code generator) 	Control irregular loop prefetching:

		lp=<n>	Turns the module on (1) or off (0)
			(default is on for F90; off for C/C++)

	       -av=<n>	Sets the prefetch look ahead distance, in bytes.
			Default is 256.

		-t=<n>	Sets the number of attempts at prefetching. If not
			specified, t=2 if -xprefetch_level=3 has been set;
			otherwise, defaults to t=1.

	       -fa=<n>	1=Force user settings to override internally computed values. 
    
	       -fl=<n>	1=Force the optimization to be turned on for all languages. 

-Wc,-Qms_pipe-pref	Turn off prefetching within modulo scheduling.
(code generator)

-xalias_level=[basic|std|strong]
(C)			Allows the compiler to perform type-based alias analysis
			at the specified alias level:

		 basic	Assume that memory references that involve
			different C basic types do not alias each other.

		   std	Assume aliasing rules described in the ISO 1999 C
			standard.

		strong	In addition to the restrictions at the std level,
			assume that pointers of type char * are used only
			to access an object of type char; and assume that
			there are no interior pointers.

-xalias_level=compatible
(C++)			Allows the compiler to assume that
			layout-incompatible types are not aliased.

-xarch=<a>		Limit the set of instructions the compiler may use
(C, C++, Fortran)	to generic, generic64, native, native64, v7, v8a,
			v8, v8plus, v8plusa, v8plusb, v9, v9a, v9b.
			Typical settings include:

				UltraSPARC-II, 32-bit mode: v8plusa
				UltraSPARC-II, 64-bit mode: v9a
				UltraSPARC-III, 32-bit mode: v8plusb
				UltraSPARC-III, 64-bit mode: v9b

			For more information, see the Fortran User's Guide
			at docs.sun.com

-xbuiltin=%all		Substitute intrinsic functions or inline system
(C, C++)		functions where profitable for performance. 

-xchip=<c>		Specifies the target processor for use by the
(C, C++, Fortran)	optimizer. c must be one of: generic, native,
			old, super, super2, micro, micro2, hyper, hyper2,
			powerup, ultra, ultra2, ultra2i, ultra3, ultra3cu,
			ultra3i, ultra4, 386, 486, pentium, pentium_pro,
			pentium3, pentium4

			ultra3
				Optimize for the UltraSPARC(TM) III processor.

			ultra3cu
				Optimize for the UltraSPARC IIIcu processor.

			ultra4
				Optimize for the UltraSPARC IV processor.

-xcache=<c>		Defines the cache properties for use by the
(C, C++, Fortran)	optimizer. c must be one of  the following:
			native (set parameters for the host environment)

				* s1/l1/a1
				* s1/l1/a1:s2/l2/a2
				* s1/l1/a1:s2/l2/a2:s3/l3/a3

			The si/li/ai are defined as follows:

				si The size of the data cache
				at level i, in kilobytes.
				li The line size of the data cache
				at level i, in bytes.
				ai The associativety of the data cache
				at level i.

-xdepend		Analyze loops for inter-iteration data dependencies,
(C, Fortran)		and do loop restructuring.

-xipo[=n]		Perform optimizations across all object files in the
(C, C++, Fortran)	link step:

			0=off
			1=on
			2=performs whole-program detection and analysis

-xlibmil		Use inline expansion for math library, libm.
(C, C++, Fortran)

-xlibmopt		Select the optimized math library.
(C++, Fortran)

-xlic_lib=sunperf	Link with Sun supplied licensed sunperf library.
(C, C++, Fortran)

-xlinkopt		Perform link-time optimizations, such as branch
(C, C++, Fortran)	optimization and cache coloring.

-xmemalign=ab		This command specifes the maximum assumed
(C, C++, Fortran)	memory alignment and the behavior of misaligned
			data accesses.

			For memory accesses where the alignment is
			determinable at compile time, the compiler
			generates the appropriate load/store instruction
			sequence for that alignment of data.

			For memory accesses where the alignment cannot
			be determined at compile time, the compiler must
			assume an alignment to generate the needed
			load/store sequence.

			Use the -xmemalign  option to specify the maximum
			memory alignment of data to be assumed by the
			compiler in these indeterminable situations. You
			can also specify the error behavior to be followed
			at run-time when a misaligned memory access does
			take place.

			Values

			Accepted values for a are:

				1    Assume at most 1 byte alignment.

				2    Assume at most 2 byte alignment.

				4    Assume at most 4 byte alignment.

				8    Assume at most 8 byte alignment.

				16   Assume at most 16 byte alignment.

			Accepted values for b are:

				i    Interpret access and continue execution.

				s    Raise signal SIGBUS.

				f    Raise signal SIGBUS for alignments less
				     than or equal to 4, otherwise interpret 
				     access and continue execution.

			You must also specify -xmemalign whenever you want to
			link to an object file that was compiled with the value
			of b set to either i or f. For a complete list of com-
			piler options that must be specified at both compile
			time and at link time, see the C User's Guide.

			Defaults

			The default is -xmemalign=8i for all v8 architectures.
			The default for all v9 architectures is -xmemalign=8s.
			Specifying -xmemalign is equivalent to specifying
			-xmemalign=1i.

-xO<n>			Specify optimization level n:
(C, C++, Fortran)

		  -xO1	Does only basic local optimizations (peephole).

		  -xO2	Do basic local and global optimizations, such as
			induction variable elimination, common
			subexpression elimination, constant propogation,
			register allocation, and basic block merging. 

		  -xO3	Add global optimizations at the function level,
			loop unrolling, and software pipelining.

		  -xO4	Adds automatic inlining of functions in the
			same file.

		  -xO5	Uses optmization algorithms that may take
			significantly more compilation time or that
			do not have as high a probability of improving
			execution time, such as speculative code motion.

-xpad=common[:<n>]	If multiple same-sized arrays are placed in common,
(Fortran)		insert padding between them for better use of cache.
			n specifies the amount of padding to apply, in units
			that are the same size as the array elements. If no
			parameter is specified then the compiler selects one
			automatically.

-xpad=local		Pad local variables, for better use of cache.
(Fortran)

-xpagesize=<n>		Set the preferred page size for running the program.
(C, C++, Fortran)

-xprefetch=auto,explicit
(C, C++, Fortran)	Allow generation of prefetch instructions. -xprefetch and
			-xprefetch=yes is a synonym for -xprefetch=auto,explicit.

-xprefetch=latx:<n>	Adjust the compiler's assumptions about prefetch latency
(C, C++, Fortran)	by the specified factor. Typically values in the range of
			0.5 to 2.0 will be useful. A lower number might indicate
			that data will usually be cache resident; a higher number
			might indicate a relatively larger gap between the
			processor speed and the memory speed (compared to the
			assumptions built into the compiler).

-xprefetch=no%auto	Turn off prefetch instruction generation.
(C, C++, Fortran) 

-xprefetch_level=<n>	Control the level of searching that the compiler does
(C, C++, Fortran)	for prefetch opportunities by setting n to 1, 2, or 3,
			where higher numbers mean to do more searching.
			The default is 2.

-xprofile=collect:./feedback
(C, C++, Fortran)	Collect profile data for feedback-directed optimization,
			and store it in a sub directory of the current directory,
			named ./feedback.

-xprofile=use:./feedback
(C, C++, Fortran)	Use data collected for profile feedback. Look for it in
			a subdirectory of the current directory, named ./feedback.

-xrestrict		Treat pointer-valued function parameters as
(C)			restricted pointers.

-xtarget=[system_name]	Selects options appropriate for the system where
(C, C++, Fortran)	the compile is taking place, including architecture,
			chip, and cache sizes. (These can also be controlled
			separately, via -xarch, -xchip, and -xcache, respectively.) 

-xunroll=n		Specifies whether or not the compiler optimizes
(C, C++, Fortran)	(unrolls) loops.  n is a positive integer. When n is
			1, it is a command and the compiler unrolls no loops.
			When n is greater than 1, -xunroll=n merely suggests
			to the compiler that it unroll loops n times.

-xvector		Allow the compiler to transform math library calls within
(C, Fortran)		loops into calls to the vector math library.

-----------------------------------------------------------------------------------
[2] Environment Variables

Flag			Remark
-----------------------------------------------------------------------------------
LD_LIBRARY_PATH=<p>	Specify the locations to resolve dynamic link dependencies.

LD_PRELOAD=mpss.so.1	Allow use of the mpss.so.1 shared object, which provides
			a means by which preferred stack and/or heap page sizes
			can be selected.

MPSSHEAP=<n>		Specify the preferred page size for heap. The specified
			page size is applied to all created processe.

MPSSSTACK=<n>		Specify the preferred page size for stack. The specified
			page size is applied to all created processe.

ulimit -s unlimited	Allow stack size to grow without limit.

-----------------------------------------------------------------------------------
[3] Kernel Parameters (/etc/system)

System Tunable		Remark
-----------------------------------------------------------------------------------
autoup			The frequency of file system sync operations.

consistent_coloring	Controls the page coloring policy. It can be set to
			one of the following:

		     0	(default) dynamic (uses various vaddr bits)
		     1	static (virtual=paddr)

tune_t_fsflushr		The number of seconds between fsflush invocations for
			checking dirty memory.

memscrub_period_sec     Period of execution of the memory patrol daemon in
                        units of seconds. This daemon periodically scans
                        memory to confirm no ECC error.

--------------------------------------------------------------------------------
[4] Commands for feedback control

Command			Remark
--------------------------------------------------------------------------------
fdo_pre0 = rm -rf `pwd`/..feedback.profile
fdo_pre0 = rm -rf `pwd`/SunWS_cache
			remove the profile data generated at the last
			feedback-optimized compilation.