Understanding Floating-point Performance

Inexact Floating Point Comparisons

Some floating point applications exhibit extremely poor performance by not terminating. The applications do not terminate, in many cases, because exact FP comparisons were made against a given value.

Denormal Computations

A denormal number is where the mantissa is non zero, but the exponent value is zero in an IEEE* floating-point representation. The smallest normal single precision floating point number greater than zero is about 1.175494350822288e-38. Smaller numbers are possible, but are denormal and take hardware or operating system intervention to handle them, which can cost hundreds of clock cycles.

In many cases, denormal numbers are evidence of an algorithm problem where a poor choice of algorithms is causing too much computation in the denormal range. There are several ways to get around denormal numbers.  For example, you can translate to normal, which means to multiply by a large scalar number, do the remaining computations in the normal space, then rescale back down to denormal range. This is done whenever the small denormal values benefit the program design. In many cases, denormals that can be considered to be zero may be flushed to zero.

Denormals are computed in software on Itanium® processors. Hundreds of clock cycles are required, resulting in excessive kernel time. Attempt to understand why denormal results occur and determine if they are justified. If you determine they are not justified, then use the following steps to handle the results:

  1. Translate to normal problem by scaling values.

  2. Increase precision and range by using a wider data type.

  3. Set flush-to-zero mode in floating-point status register: -ftz (Linux*) or /Qftz (Windows*).

Note

This process applies to the source file containing PROGRAM. See

Denormal numbers always indicate a loss of precision, an underflow condition, and usually an error (or at least a less than desirable condition). On the Intel® Pentium® 4 processor and the Intel Itanium® processor, floating-point computations that generate denormal results can be set to zero, improving the performance.

Itanium® compiler

The Itanium® compiler supports the -ftz (Linux) or /Qftz (Windows) option used to flush denormal results to zero when the application is in the gradual underflow mode. Use this option if the denormal values are not critical to application behavior. The default status of the option is OFF. By default, the compiler lets results gradually underflow.

The -ftz (Linux) or /Qftz (Windows) switch only needs to be used on the source containing PROGRAM. The switch turns on the Flush-to-Zero (FTZ) mode for the process started by PROGRAM. The initial thread, and any threads subsequently created by that process, will operate in FTZ mode. Note that the -O3 (Linux) or /O3 (Windows) option turns -ftz (Linux) or /Qftz (Windows) ON. Use -Qftz- to disable flushing denormal results to zero.

IA-32 and Intel® EM64T compilers

The Intel® compiler automatically sets flush-to-zero mode in the SSE Control Register (MXCSR) when SSE instructions are enabled.  SSE instructions are enabled by default in the Intel® EM64T compiler. Enable SSE instructions in the IA-32 compiler by using -xK, -xW, -xN, -xB, or -xP (Linux) or /QxK, /QxW, /QxN, /QxB, or /QxP (Windows). The MXCSR flush-to-zero setting only affects the behavior of SSE, SSE2, and SSE3 instructions.  x87 floating-point instructions are not affected.

Refer to IA-32 Intel® Architecture Software Developer’s Manual Volume 1: Basic Architecture for more details about flush to zero or specific bit field settings.

Use the -ftz (Linux) or /Qftz (Windows) switch to flush x87 floating-point values to zero. It is necessary to use the option on the source containing PROGRAM and on any source where abrupt underflow is desired.  Using -ftz (Linux) or /Qftz (Windows) can significantly degrade performance since the generated code stream must be synchronized after each floating-point instruction to allow the operating system to do the necessary abrupt underflow corrections.

Note

Windows* Only: The /Qftz option is not supported for Intel® EM64T.

Detailed Microarchitectural Optimization Analysis (Itanium® Compiler)

For more detailed optimization advice regarding microarchitectural optimization and cycle accounting, refer to Introduction to Microarchitectural Optimization for Itanium® 2 Processors Reference Manual also known as “Software Optimization book“ document number 251464-001 located at http://www.Intel.com/software/products/vtune/techtopic/Software_Optimization.pdf.