@ A NOTE ON IBM QUADRUPLE PRECISION ARITHMETIC This file: http://ftp.aset.psu.edu/pub/ger/fortran/hdk/qpnote.for by H. D. Knoble (PSU) (814)-863-0422 ext 236 The Pennsylvania State University Computation Center University Park, Pa. 16082 SHARE 54, Math Software Project Session A450, Tuesday, March 4 1980, 8:30am Quadruple precision arithmetic can be utilized with IBM's Fortran Hx compiler by declaring floating-point variables with the declaration REAL*16 and by using a Q for scientific exponent notation (eg. 3.Q-1 is the notation for 3*10**(-1) ). Some double precision Fortran Hx programs can be run as though all double precision variables and constants were quadruple precision via the Hx compiler PARM='AUTODBL(DBL8)'. Converting double precision (16 decimal digit precision) to quadruple precision(33 decimal digit precision) involves execution time expense considerably greater than converting from single precision (7 decimal digit precision) to double precision. There are at least two reasons for this: (1) Fortran Hx's generated code has little or no register optimization; that is a QP number fills two of the System 370's four floating-point registers; this necessitates many more machine instructions (stores and fetches). (2) QP arithmetic operations may be "simulated" via software rather than be performed as hardware (firmware) operations. QP divide is always simulated, usually with the DXR macro instruction, because IBM 370 CPUs do not support a QP divide instruction. The IBM 370 series computers do support machine instructions for QP addition, subtraction, and multiplication. The combination of optimization loss and software simulation of QP divide may cause a program whose arithmetic is all QP to execute several times slower than an equivalent all DP program. For example, it is not unusual for a CPU bound QP Fortran Hx program to execute 8 to 10 times slower than the same program optimized in all DP. This reasoning also holds for the PL/I Optimizing Compiler. The remainder of this note will address the QP divide implementation and pose a suggestion to improve QP divide times. The DXR macro works as follows: it first generates an invalid machine instruction (op code); when executed the DXR generated code then causes a hardware interrupt (invalid operation code); the system interrupt handling software traps this interrupt and determines that it is the special case of DXR; the DXR supervisor routine is then called to do the software division. Because there are more instructions executed for handling the interrupt than in implementing a QP divide, DXR can be expensive in CPU time. For example, on an IBM 370/3033-Q6 a loop of 100,000 QP divisions via Fortran Hx (DXR) requires about 14 seconds of CPU time to execute. The following Fortran Hx FUNCTION subprogram does a software QP divide and will yield results which differ from DXR results by no more than the eight low-order bits. @ REAL FUNCTION QDIV*16(X,Y) C==========QP DIVIDE. THIS USES A NEWTON-RAPHSON ITERATION FOR C 1/Y. SEE CACM 4(1961):98. C AGREEMENT IS WITHIN 8 LOW-ORDER BITS OF DXR MACRO; C (ACCURATE TO APPROXIMATELY 30 DECIMAL DIGITS). REAL*16 X,Y DOUBLE PRECISION QPTODP QPTODP(Y)=Y QDIV=1.D0/QPTODP(Y) QDIV=X*QDIV*(2.Q0-Y*QDIV) RETURN END The CPU time on an IBM 3033-Q6 for a loop of 100,000 Fortran Hx calls to QDIV(X,Y) is about 2.3 seconds, or 6 times faster than doing QP X/Y. If the QDIV code is implemented in-line with the following Arithmetic Statement Functions: REAL*16 X,Y DOUBLE PRECISION QPTODP QPTODP(Y)=Y QDIV(X,Y)=X*(1.D0/QPTODP(Y))*(2.Q0-Y*(1.D0/QPTODP(Y))) then the CPU time goes down to about 1.7 seconds or about 8 times faster than doing QP X/Y. Here we forward a recommendation to the Fortran Hx implementor, namely: add DXR QP divide code to the Fortran Q Library, say IHQQDIV; then fix the Hx compiler to avoid costly interrupt handling by calling QP divide. The long-term advantage of DXR of course is that if IBM ever decides to implement a machine instruction to perform QP division, existing object programs could automatically take advantage of this provided that the (currently) invalid machine instruction matches the (possible) future machine instruction. In the meantime, the CPU time utilized by Fortran programs which rely heavily on QP divides can be significantly improved by avoiding the DXR generated interrupt. Finally, several useful comments of John Ehrman, Stanford University Linear Accelerator Center, are gratefully acknowledged here. @ The reader interested in obtaining further information on numerical programming techniques for scientific computations is refered to the three part tutorial, "A Practical Look at Computer Arithmetic", available from the author. Also see Fortran 90 Quad Precision package by David Bailey: http://science.nas.nasa.gov/Search/search.html