====== Scenario x64_7: Floating Point calculations with the use of FPU. ======
In this scenario, we will write the program using FPU to calculate some mathematical equations.
** Prerequisites **\\
Read the sections "Scenario x64_2: Creating static libraries. Integer to text conversion function." and "Scenario x64_4: Converting floating point values to text." You should have the library with the conversion functions ready.
** Scenario **\\
Write the code of an assembler main program performing calculation with PFU and display the resulting values with the previously written conversion library. Use various floating-point types: single precision, double precision and double extended precision.
** Result **\\
The results are calculated and displayed in a console.
** Start **\\
You can create a new assembly source file or modify the file that uses the conversion library created in "Scenario x64_4: Converting floating point values to text."
** Step 1**\\
We will start with a simple addition of two single-precision floating-point numbers to test if all works correctly. At the beginning, we need to include all required libraries and a file containing function prototypes.
; include the system library and our convert library
includelib kernel32.lib
includelib convert.lib
include convert.inc
We need to declare external system functions and a constant for GetStdHandle.
; declare external functions
EXTERN WriteConsoleA:PROC
EXTERN GetStdHandle:PROC
; constant required by the GetStdHandle system function
STD_OUTPUT_HANDLE equ -11
The data section of our program contains the buffer for the text representation of the number to be displayed, the variables required for the proper operation of system functions, our two addends, and a place for the result.
.data
buffer db 32 dup(0) ; buffer for a string
hOut dq ? ; placeholder for console handle
dummy dq ? ; place for dummy parameter
var_x real4 4.89 ; first addend
var_y real4 17.45 ; second addend
result real4 ? ; place for result
The code of the main function is similar to the one presented in "Scenario x64_2: Creating static libraries. Integer to text conversion function." Here we show the FPU calculations and the conversion function call. Remember that the floating point argument to the conversion function is passed by XMM0.
Our conversion function is written for a single-precision data format. If you want to make computations with double-precision numbers, you need to modify this function or write another one (recommended).
; perform FPU calculations
fld var_x ; load first addend to FPU stack
fadd var_y ; add second addend
fstp result ; store the result
; convert float to text
lea rdx, buffer ; prepare the address of a buffer
movss xmm0, result ; pass the result as an argument to a function
call float_to_ascii ; call conversion function
** Step 2 **\\
While we confirmed that our simple program works properly, we can write more complex FPU calculations. Let's compute the average of the values in a table.\\
We start with the definition of a table and a variable that holds the table length.
.data
...
table_real real4 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0
table_count DW LENGTHOF table_real
The code to compute the average starts with the preparation for a loop. The number of iterations is stored in rcx. We zero-extend the word variable **table_count** to the size of **rcx** with the use of the **movzx** instruction. We need to prepare the address register to access table elements with indirect addressing. Here, the address register is **rdx**. We decrement **rcx** and add 4 to **rdx** because the first element is pre-loaded to the FPU stack before the loop starts.
; perform FPU calculations - average
movzx rcx, table_count
lea rdx, table_real
fld dword ptr [rdx]
sub rcx, 1
add rdx, 4
Inside a loop, we compute the sum of all elements. After finishing a loop, we divide the sum by the number of elements with the use of the integer divide instruction **fidiv**.
average_loop:
fadd dword ptr [rdx]
add rdx, 4
loop average_loop
fidiv table_count
fstp result
** Step 3 **\\
In this step, we'll solve the quadratic equation. The classical method is to calculate the delta, and later to calculate two possible results: root_1 and root_2. Let's start with writing formulas in Reverse Polish Notation (postfix notation). The formula to calculate delta looks like the following.
** Delta **
b b * 4 a * c * -
This can be calculated with the code below.
.data
aa real4 3.0 ; a
bb real4 7.0 ; b
cc real4 2.0 ; c
const_4 real4 4.0 ; constant 4
.code
; calculating delta
fld bb ; load b
fld bb ; load b
fmul ; multiply b * b
fld const_4 ; load 4
fld aa ; load a
fmul ; multiply 4 * a
fld cc ; load c
fmul ; multiply 4*a * c
fsub ; subtract b*b - 4*a*c
After executing the code, we have the delta at the FPU stack top. We will use it to compute both roots. Before we'll test if delta is not negative.
; here we test if delta is non-negative
ftst ; test st(0)
fstsw ax ; store status word to ax
sahf ; store ah to flags
jb no_roots ; jump if delta is negative
It's time to calculate roots. The RPN formula for this calculation looks like this:
** X1 **
b neg delta sqrt + 2 a * /
Writing the code for this calculation, we obtain:
.data
const_2 real4 2.0 ; constant 2
root_1 real4 ? ; place for the first result
root_2 real4 ? ; place for the second result
.code
; calculate first root
fld bb ; load b
fchs ; change sign of b
fld st(1) ; copy delta into st(0)
fsqrt ; square root of delta
fadd ; add (-b) + square root of delta
fld const_2 ; load 2
fld aa ; load a
fmul ; multiply 2 * a
fdiv ; divide
; we have the first root
fstp root_1
The RPN-noted formula for the second root differs by one operation only. Minus instead of plus.
** X2 **
b neg delta sqrt - 2 a * /
Please write the code for the second root calculation on your own. Please note that after calculating the first root, we still have the delta in **st(0)**. Write code to properly handle the situation when we don't have a real solution by displaying the appropriate message instead of results.
** Step 4 **\\
Our previous program was almost a direct translation from RPN to FPU mnemonics. We can optimise the code in some places. Here are some examples:
* Instead of time-consuming loading constants from memory, we can load a constant 1.0 and compute 2.0 and 4.0.
* Because we need 2*a and 4*a in the calculation, we could compute both values in a single code fragment.
* We can avoid calculating values twice; it's better to keep them in registers; in our program, we'll do it with 2a, delta and the square root of delta.
* We can use reverse versions of instructions to avoid data moving.
* We can use memory arguments instead of loading them into registers.
The optimised version of the code is shown in the following fragments. They also show the contents of the FPU stack after executing each instruction to track the algorithm.
; Calculating delta st(0); st(1); st(2); st(3);
fld1 ; Get constant 1 1; ; ; ;
fadd st,st ; Add 1 to 1 to get 2 2; ; ; ;
fld st ; Copy it 2; 2; ; ;
fmul aa ; = 2a 2a; 2; ; ;
fmul st(1),st ; = 4a 2a; 4a; ; ;
fxch ; Exchange st and st(1) 4a; 2a; ; ;
fmul cc ; = 4ac 4ac; 2a; ; ;
fld bb ; Load b b; 4ac; 2a; ;
fmul st,st ; = b*b b*b; 4ac; 2a; ;
fsubr ; = b*b - 4ac (always pop) delta; 2a; ; ;
; you can test here if the delta is non-negative
The following code presents computing both roots in the same fragment of a program. Please note that **d** represents **delta**, while **pv** and **mv** results before division by **2a**.
; Compute both roots st(0); st(1); st(2); st(3);
fsqrt ; = sqrt(b*b - 4ac) sqr(d); 2a; ; ;
fld bb ; Load b b; sqr(d); 2a; ;
fchs ; Make it negative -b; sqr(d); 2a; ;
fxch ; Exchange sqr(d); -b; 2a; ;
fld st ; Copy square root sqr(d); sqr(d); -b; 2a;
fadd st,st(2) ; pv = -b + root(b*b - 4ac) pv; sqr(d); -b; 2a;
fxch ; Exchange sqr(d); pv; -b; 2a;
fsubp st(2),st ; mv = -b - root(b*b - 4ac) pv; mv; 2a; ;
fdiv st,st(2) ; Divide plus version pv/2a; mv; 2a; ;
fstp root_1 ; Store it mv; 2a; ; ;
fdivr ; Divide minus version mv/2a; ; ; ;
fstp root_2 ; Store it ; ; ; ;
This optimised implementation requires detailed stack tracing, but can yield significant speed improvements.
** Step 5 **\\
Make calculations with double-precision and double-extended precision values. Compare results. Note that for longer data formats, you need to modify the conversion library, remembering that SSE does not operate on double-extended precision values.
** Result validation **\\
The program should display proper floating-point results of calculations. Please check the positive and negative input values.
** If it does not print the proper value? **\\
Please remember the order and placement of the arguments and the result of the function. Please keep the Windows ABI calling convention and preserve any non-volatile register before use. You can use FPU registers freely, remembering that they are organised as a stack.