@
A NOTE ON IBM QUADRUPLE PRECISION ARITHMETIC
This file: http://ftp.aset.psu.edu/pub/ger/fortran/hdk/qpnote.for
by
H. D. Knoble (PSU)
(814)-863-0422 ext 236
The Pennsylvania State University
Computation Center
University Park, Pa. 16082
SHARE 54, Math Software Project
Session A450, Tuesday, March 4 1980, 8:30am
Quadruple precision arithmetic can be utilized with IBM's Fortran Hx
compiler by declaring floating-point variables with the declaration
REAL*16 and by using a Q for scientific exponent notation (eg. 3.Q-1 is
the notation for 3*10**(-1) ). Some double precision Fortran Hx
programs can be run as though all double precision variables and
constants were quadruple precision via the Hx compiler
PARM='AUTODBL(DBL8)'. Converting double precision (16 decimal digit
precision) to quadruple precision(33 decimal digit precision) involves
execution time expense considerably greater than converting from single
precision (7 decimal digit precision) to double precision. There are at
least two reasons for this: (1) Fortran Hx's generated code has little
or no register optimization; that is a QP number fills two of the System
370's four floating-point registers; this necessitates many more machine
instructions (stores and fetches). (2) QP arithmetic operations may be
"simulated" via software rather than be performed as hardware (firmware)
operations. QP divide is always simulated, usually with the DXR macro
instruction, because IBM 370 CPUs do not support a QP divide
instruction. The IBM 370 series computers do support machine
instructions for QP addition, subtraction, and multiplication.
The combination of optimization loss and software simulation of QP
divide may cause a program whose arithmetic is all QP to execute several
times slower than an equivalent all DP program. For example, it is not
unusual for a CPU bound QP Fortran Hx program to execute 8 to 10 times
slower than the same program optimized in all DP. This reasoning also
holds for the PL/I Optimizing Compiler. The remainder of this note will
address the QP divide implementation and pose a suggestion to improve QP
divide times.
The DXR macro works as follows: it first generates an invalid machine
instruction (op code); when executed the DXR generated code then causes
a hardware interrupt (invalid operation code); the system interrupt
handling software traps this interrupt and determines that it is the
special case of DXR; the DXR supervisor routine is then called to do the
software division. Because there are more instructions executed for
handling the interrupt than in implementing a QP divide, DXR can be
expensive in CPU time. For example, on an IBM 370/3033-Q6 a loop of
100,000 QP divisions via Fortran Hx (DXR) requires about 14 seconds of
CPU time to execute. The following Fortran Hx FUNCTION subprogram does
a software QP divide and will yield results which differ from DXR
results by no more than the eight low-order bits.
@
REAL FUNCTION QDIV*16(X,Y)
C==========QP DIVIDE. THIS USES A NEWTON-RAPHSON ITERATION FOR
C 1/Y. SEE CACM 4(1961):98.
C AGREEMENT IS WITHIN 8 LOW-ORDER BITS OF DXR MACRO;
C (ACCURATE TO APPROXIMATELY 30 DECIMAL DIGITS).
REAL*16 X,Y
DOUBLE PRECISION QPTODP
QPTODP(Y)=Y
QDIV=1.D0/QPTODP(Y)
QDIV=X*QDIV*(2.Q0-Y*QDIV)
RETURN
END
The CPU time on an IBM 3033-Q6 for a loop of 100,000 Fortran Hx calls to
QDIV(X,Y) is about 2.3 seconds, or 6 times faster than doing QP X/Y. If
the QDIV code is implemented in-line with the following Arithmetic
Statement Functions:
REAL*16 X,Y
DOUBLE PRECISION QPTODP
QPTODP(Y)=Y
QDIV(X,Y)=X*(1.D0/QPTODP(Y))*(2.Q0-Y*(1.D0/QPTODP(Y)))
then the CPU time goes down to about 1.7 seconds or about 8 times faster
than doing QP X/Y.
Here we forward a recommendation to the Fortran Hx implementor, namely:
add DXR QP divide code to the Fortran Q Library, say IHQQDIV; then fix
the Hx compiler to avoid costly interrupt handling by calling QP divide.
The long-term advantage of DXR of course is that if IBM ever decides to
implement a machine instruction to perform QP division, existing object
programs could automatically take advantage of this provided that the
(currently) invalid machine instruction matches the (possible) future
machine instruction. In the meantime, the CPU time utilized by Fortran
programs which rely heavily on QP divides can be significantly improved
by avoiding the DXR generated interrupt.
Finally, several useful comments of John Ehrman, Stanford University
Linear Accelerator Center, are gratefully acknowledged here.
@
The reader interested in obtaining further information on
numerical programming techniques for scientific computations is refered
to the three part tutorial, "A Practical Look at Computer Arithmetic",
available from the author. Also see Fortran 90 Quad Precision package
by David Bailey: http://science.nas.nasa.gov/Search/search.html