ftune.html

Chapter 5
Compiler Options

Before profiling, the program should be fixed up to correct any uninitialized data or array bounds transgressions, using special compilation options to test with a selected suite of test cases.

Tuning may require a special set of cases sized to run a convenient length of time, preferably with array sizes representative of normal runs.These cases should be run with a normally reliable set of conservative compiler options, with profiling, to get a comparison set of results.

Typically, a compiler will have a set of options which sticks to IEEE style arithmetic (including comparisons) if that is possible, performs enough unrolling to get most of the available performance, does simple powers in line, observes parentheses, and produces reliable results. This may take some research,as it may not be one of the basic option packages like -O1. Profiling may require static linking, but it should allow use of compiler optimizations.

Fused Multiply-Add

If the architecture has fused multiply-add sequence instructions which do not round according to IEEE standards, they probably have to be used for optimization, but the test suite should be run with them turned off to verify reasonably close results. These instructions often produce more accurate results than would be obtained with intermediate rounding, but they also have occasional bad effects, typically in the solution of quadratic equations, or where under-flow occurs.

SGI changed the single-instruction multiply and add from a fused instruction without intermediate rounding on the R8000 to one with IEEE style intermediate rounding on the R10000.

Expression Re-ordering

Compilers often have a choice between left-to-right expression evaluation,observing parentheses, and re-ordering in the hope of getting more useof multiply-add chaining, common sub-expressions, or instruction level parallelism. Sometimes these are tied in with loop renesting, which is unfortunate, particularly when it is not possible to get loop optimizations without allowing disregard of parentheses. When there is a re-ordering option, the default should observe parentheses, in this author's opinion. A diagnostic could be issued suggesting a re-arrangement, without actually doing it, so that the programmer's intent is not violated silently.

This author's preference is to use the re-ordering options only as a rough check to see if there are performance improvements to be found. It's usually possible to get a better combination of performance and accuracy with strict evaluation of properly written code. With industry standard benchmarks, where the source code (but not strict interpretation of it) is sacred, the vendors' reasons for taking liberties are understandable.

IEEE vs FORTRAN Rounding

Modern architectures allow considerably faster execution of IEEE style rounding rather than FORTRAN style for functions such as NINT() and ANINT(). In most situations, the IEEE style is at least as useful. The primary difference is in the rounding of numbers such as +- .5, 2.5, 4.5... where FORTRAN dictates rounding away from 0. A fast implementation of ANINT() may roundX to even for X > 1/EPSILON(1d0). Ideally, a compiler will allow a selection option here which is not tied to other options. The Intel processors give an extreme example, requiring the compiler to insert rounding mode changes to implement either INT() or NINT(), at great cost in performance compared to the IEEE style rounding.

Code Size Optimization

Options which control the expansion of code generation are worth trying. They may generate faster as well as more compact code. This is particularly true when there is a register shortage, which is the usual situation on the Intel architecture.

Static Library Functions

On systems which require extra time to call a dynamically linked library function, there usually are ways to select static linking for critical functions, without linking in too many undesired static functions. Fo rexample, some systems accept linking options like "f90 *.o -B STATIC -lm -B DYNAMIC" which links the math library statically but returns to dynamic linking for the rest. Another possibility is to specify the static version of a library before allowing default linking e.g.:

f90 *.o /usr/lib/libm.a

Of course the location of the library file will vary. Or you may wish to extract the object modules which are called most frequently and include them in your object module list.

Unrolling Options

Compilers typically have options to specify the maximum amount of unrolling and some limit on size of unrolled code, either in terms of generated instructions (best, but architecture dependent) or source code operation count. Default settings probably are ones which worked in SPECmarks. One or another limit may be too high or too low for your application. What's really wanted is to unroll enough to facilitate pipelining but not so much that the compiler is forced into register spills. For the sections of code which don't show up in the profile, it's probably good to reduce unrolling to get faster compilation and smaller code.

The top limit of useful unrolling is probably related to the ratio of addition latency to instruction issue rate. For example, the PA8000 architecture takes 3 cycles to complete addition, and can issue 3 integer and 3 floating point instructions during this time, so it may be useful to unroll by 6 if there are no parallelizable operations within one loop iteration. This is for the usual style of unrolling, such as the HPUX or gnu compilers perform. Software pipelining compilers such as SGI's may consider unrolling to be the number of loop iterations between count tests, but perform much additional unrolling.

Unrolling by more than the loop count is useless; that ensures that the "optimized" version of the loop won't be executed. Ideally, the loop count should be divisible by the unrolling factor.

The odd iterations (0 to 5 of them when unrolling by 6) may be performed before ("pre-conditioning") or after ("clean-up") the main unrolled loops. When the unrolling factor is other than a power of 2, the clean-up position is preferred, to avoid the considerable penalty incurred in calculating the remainder by division. If this were written out in Fortran source code, using a DO index increment of 6, the division would be performed anyway.

Compilers which accept varying amounts of unrolling may switch between pre-conditioning and clean-up orders according to the amount of unrolling. As a result, it may be found that unrolling by 4 is faster or slower than unrolling by either 3 or 5. Other compilers will choose to unroll only by a power of 2. If unrolling is controlled loop by loop, it is likely to be done with directives such as
C*$* UNROLL(3)
!*$* UNROLL(3)
immediately before the DO.

Certain compilers use the unrolling factor as the sum and comparison interleaving factor. Then, if unrolling is done by more than the minimum amount needed for interleaving, the length of the cleanup code is increased.

Size of unrolled code is an issue both in terms of disk and instruction cache space, and because the compiler runs out of registers in code generation.As mentioned above, it is better (on a system with register remapping) if a compiler simply reuses registers as much as possible, rather than generating spills.

Where outer unrolling is employed, the product of inner and outer unrolling is a useful parameter. For instance, the PA8000 may work well with manual outer unrolling (by 2) and compiler inner unrolling by 3. On the R10000, excessive code expansion may be controlled by -LNO:ou_prod_max=2. This restricts the compiler from unrolling more than one outer loop, where 3 or more loops are nested. The compiler follows up with software pipelining, which involves considerable additional unrolling.

Pre-fetch

Certain compilers have options to turn on or off generation of pre-fetch instructions. A pre-fetch instruction is intended to initiate the movement of a block of data into cache before it is to be used. Typically, a group of pre-fetch instructions will pick up data for 16 loop iterations, in which case the optimum situation is to issue a pre-fetch only once per 16 iterations, best done by a large amount of unrolling. Pre-fetch is most likely to be useful in loops which advance through memory by strides greater than one, but it will increase the demand for cache space, and may drive data out of cache which are about to be needed. The vendors' default options usually work best in most situations, but those which are tuned for simple benchmarks may lead to excessive code expansion. Cf SGI's -LNO:prefetch_ahead=1 which restricts pre-fetch to one loop iteration in advance. Pre-fetch is likely to increase the loop start-up time, so it is effective only for longer loops. It may not be useful for loops which have do not exhaust the supply of shadow registers

Pre-inversion

We suggest avoiding pre-inversions, such as the automatic conversion of x/sqrt(dprod(x,x)+dprod(y,y)) to x * (1/sqrt(...) as these cost accuracy and seldom show a significant improvement in speed. Compile-time inversion of constants which may be inverted exactly is a useful optimization, which should not be bundled with inversion of divisions in ways which result in inaccurate rounding.

Fast Complex

Some compilers have an option to generate simplified in-line code for complex arithmetic, failing to take precautions against over- or under-flow.There shouldn't be any problem with IEEE single precision or on Intel architectures, as the way to go is simply to promote the operation into the next higher precision, which has more than enough range. If the promotion is not used, careful testing is indicated.

Automatic In-line

These options should be enabled only where they are known to be beneficial. The most likely such situation is where a simple function with little internal branching is called in a loop. Such a function may be written internally with CONTAINS if it is too complicated for the old-fashioned statement function.

Certain systems use a faster interface to intrinsic functions when an automatic in-line option is invoked.

Among the standard intrinsics, some of the more likely candidates for in-line expansion are:
NINT, simplified ANINT:
CONTAINS
ELEMENTAL FUNCTION anint(x)
! accept IEEE rounding, and possible large even results
! on extended precision systems, use instead the corresponding EPSILON
round= SIGN(1/EPSILON(x),x)
anint= (x+round)-round
RETURN

Problem: on Intel '387, the only satisfactory way is to use the internal rounding instructions.

TAN() is a reasonable in-line candidate (no branching). Other trig functions,if written in full generality, are not such good candidates for in-line,and should make use of the PURE function treatment even if programmed normally. Intel '387 processors make an entirely different story about in-line intrinsic selection, as discussed below.

Data Realignment

Other than the traditional allocation of data in such a way that the compiler can choose proper alignment (don't misuse COMMON), the benefit of these is doubtful. In a situation where various types of operation are performed on the data, special data alignments which speed access in one place, by improving mapping to cache, may hurt in another. Performance requirements for alignment may be expected; for example, 64-bit transfers from L2 cache on the Pentium are possible only for data stored at addresses which are multiples of 8 bytes. A good compiler will observe these alignments unless forced by the programmer to do otherwise. A common exception is gnu compilers, which default to 4-byte alignments.

Code Alignment

Most compilers take care of this invisibly, provided that any architecture switches are set properly. Many architectures have requirements for the alignment of loop heads, which should improve performance of loops executed 4 or more times, but may hurt performance of loops executed 2 times or less. The usual reason for this is time spent making extra cache line fetches which could be avoided by putting extra bytes ahead of the loop, so that the first cache line fetch on the second loop iteration contains no unwanted instructions, or, at least, brings in enough useful instructions to keep the CPU busy while the next cache line is brought into action. Evidently, time may be lost traversing these pad bytes prior to the first execution of the loop, thus increasing the effective loop start-up time to an extent which may depend on the code which comes before the loop.

For example, Intel recommends for the Pentium P6 processors that any code segments which are usually reached by jumping to them should be aligned on 16-byte boundaries if this may be accomplished with at most 7 bytes of padding. Although the cache line contains 64 bytes, the processor has a mechanism to get the 16-byte sub-fields in order of priority, and the second sub-field is expected to arrive in time to avoid further delay, provided that there are at least 8 bytes of useful code in the first sub-field. Code targeted for earlier model Intel processors will not observe these alignments. One would expect that further benefits might be achieved by additional analysis; that padding should be added before loops with a high expected execution count if that will enable them to fit in a single 64-byte cache line, and loop unrolling should be adjusted for the same purpose.

Data Page Size

The simplest way to deal with TLB thrashing, if this still occurs when suitable choices of inner and outer loops have been made, is to increase data page size. If the relevant arrays are allocated on the stack, increase the stack page size, or, if on heap, increase those page sizes. This should improve the situation where the number of active data pages exceeds the size of the Translation Lookaside Buffer. Unfortunately, exercising such options has been known to crash the OS.