====== Scenario x64_7: Floating Point calculations with the use of FPU. ======
In this scenario, we will write the program using FPU to calculate some mathematical equations.

** Prerequisites **\\
Read the sections "Scenario x64_2: Creating static libraries. Integer to text conversion function." and "Scenario x64_4: Converting floating point values to text." You should have the library with the conversion functions ready.

** Scenario **\\
Write the code of an assembler main program performing calculation with PFU and display the resulting values with the previously written conversion library. Use various floating-point types: single precision, double precision and double extended precision.

** Result **\\
The results are calculated and displayed in a console.

** Start **\\
You can create a new assembly source file or modify the file that uses the conversion library created in "Scenario x64_4: Converting floating point values to text."

** Step 1**\\
We will start with a simple addition of two single-precision floating-point numbers to test if all works correctly. At the beginning, we need to include all required libraries and a file containing function prototypes.

<code asm>
; include the system library and our convert library
includelib kernel32.lib
includelib convert.lib
include convert.inc
</code>

We need to declare external system functions and a constant for GetStdHandle.
<code asm>
; declare external functions
EXTERN WriteConsoleA:PROC
EXTERN GetStdHandle:PROC

; constant required by the GetStdHandle system function
STD_OUTPUT_HANDLE equ -11
</code>

The data section of our program contains the buffer for the text representation of the number to be displayed, the variables required for the proper operation of system functions, our two addends, and a place for the result.

<code asm>
.data
    buffer db 32 dup(0) ; buffer for a string
    hOut   dq ?         ; placeholder for console handle
    dummy  dq ?         ; place for dummy parameter
    var_x real4 4.89    ; first addend
    var_y real4 17.45   ; second addend
    result real4 ?      ; place for result
</code>

The code of the main function is similar to the one presented in "Scenario x64_2: Creating static libraries. Integer to text conversion function." Here we show the FPU calculations and the conversion function call. Remember that the floating point argument to the conversion function is passed by XMM0.

<note>
Our conversion function is written for a single-precision data format. If you want to make computations with double-precision numbers, you need to modify this function or write another one (recommended).
</note>

<code asm>
; perform FPU calculations
    fld var_x           ; load first addend to FPU stack
    fadd var_y          ; add second addend
    fstp result         ; store the result  

; convert float to text
    lea rdx, buffer     ; prepare the address of a buffer
    movss xmm0, result  ; pass the result as an argument to a function
    call float_to_ascii ; call conversion function
</code>


** Step 2 **\\
While we confirmed that our simple program works properly, we can write more complex FPU calculations. Let's compute the average of the values in a table.\\
We start with the definition of a table and a variable that holds the table length.

<code asm>
.data
    ...
    table_real real4 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0
    table_count DW LENGTHOF table_real
</code>

The code to compute the average starts with the preparation for a loop. The number of iterations is stored in rcx. We zero-extend the word variable **table_count** to the size of **rcx** with the use of the **movzx** instruction. We need to prepare the address register to access table elements with indirect addressing. Here, the address register is **rdx**. We decrement **rcx** and add 4 to **rdx** because the first element is pre-loaded to the FPU stack before the loop starts.
<code asm>
; perform FPU calculations - average
    movzx rcx, table_count
    lea rdx, table_real
    fld dword ptr [rdx]
    sub rcx, 1
    add rdx, 4
</code>

Inside a loop, we compute the sum of all elements. After finishing a loop, we divide the sum by the number of elements with the use of the integer divide instruction **fidiv**.
<code asm>
average_loop:
    fadd dword ptr [rdx]
    add rdx, 4
    loop average_loop
    fidiv table_count
    fstp result
</code>

** Step 3 **\\
In this step, we'll solve the quadratic equation. The classical method is to calculate the delta, and later to calculate two possible results: root_1 and root_2. Let's start with writing formulas in Reverse Polish Notation (postfix notation). The formula to calculate delta looks like the following.
<code>
** Delta **
b b * 4 a * c * -
</code>
This can be calculated with the code below.
<code asm>
.data
    aa       real4 3.0  ; a
    bb       real4 7.0  ; b
    cc       real4 2.0  ; c
    const_4  real4 4.0  ; constant 4

.code
; calculating delta
    fld bb      ; load b 
    fld bb      ; load b
    fmul        ; multiply b * b
    fld const_4 ; load 4
    fld aa      ; load a
    fmul        ; multiply 4 * a
    fld cc      ; load c
    fmul        ; multiply 4*a * c
    fsub        ; subtract b*b - 4*a*c
</code>
After executing the code, we have the delta at the FPU stack top. We will use it to compute both roots. Before we'll test if delta is not negative.
<code asm>
; here we test if delta is non-negative
    ftst        ; test st(0)
    fstsw ax    ; store status word to ax
    sahf        ; store ah to flags
    jb no_roots ; jump if delta is negative
</code>

It's time to calculate roots. The RPN formula for this calculation looks like this:
<code>
** X1 **
b neg delta sqrt + 2 a * /
</code> 

Writing the code for this calculation, we obtain:
<code asm>
.data
    const_2  real4 2.0  ; constant 2
    root_1   real4 ?    ; place for the first result
    root_2   real4 ?    ; place for the second result
    
.code
; calculate first root
    fld bb      ; load b
    fchs        ; change sign of b
    fld st(1)   ; copy delta into st(0)
    fsqrt       ; square root of delta
    fadd        ; add (-b) + square root of delta
    fld const_2 ; load 2
    fld aa      ; load a
    fmul        ; multiply 2 * a
    fdiv        ; divide 
; we have the first root
    fstp root_1
</code>

The RPN-noted formula for the second root differs by one operation only. Minus instead of plus.
<code>
** X2 **
b neg delta sqrt - 2 a * /
</code>
Please write the code for the second root calculation on your own. Please note that after calculating the first root, we still have the delta in **st(0)**. Write code to properly handle the situation when we don't have a real solution by displaying the appropriate message instead of results.

** Step 4 **\\
Our previous program was almost a direct translation from RPN to FPU mnemonics. We can optimise the code in some places. Here are some examples:
  * Instead of time-consuming loading constants from memory, we can load a constant 1.0 and compute 2.0 and 4.0.
  * Because we need 2*a and 4*a in the calculation, we could compute both values in a single code fragment.
  * We can avoid calculating values twice; it's better to keep them in registers; in our program, we'll do it with 2a, delta and the square root of delta.
  * We can use reverse versions of instructions to avoid data moving.
  * We can use memory arguments instead of loading them into registers.
The optimised version of the code is shown in the following fragments. They also show the contents of the FPU stack after executing each instruction to track the algorithm.
<code asm>
; Calculating delta                                     st(0);  st(1);  st(2);  st(3);
   fld1              ; Get constant 1                       1;       ;       ;       ;
   fadd    st,st     ; Add 1 to 1 to get 2                  2;       ;       ;       ;
   fld     st        ; Copy it                              2;      2;       ;       ; 
   fmul    aa        ; = 2a                                2a;      2;       ;       ;  
   fmul    st(1),st  ; = 4a                                2a;     4a;       ;       ;
   fxch              ; Exchange st and st(1)               4a;     2a;       ;       ;
   fmul    cc        ; = 4ac                              4ac;     2a;       ;       ;  
   fld     bb        ; Load b                               b;    4ac;     2a;       ;
   fmul    st,st     ; = b*b                              b*b;    4ac;     2a;       ;    
   fsubr             ; = b*b - 4ac (always pop)         delta;     2a;       ;       ;
   
; you can test here if the delta is non-negative
</code>

The following code presents computing both roots in the same fragment of a program. Please note that **d** represents **delta**, while **pv** and **mv** results before division by **2a**.
<code asm>
; Compute both roots                                    st(0);  st(1);  st(2);  st(3);
   fsqrt             ; = sqrt(b*b - 4ac)               sqr(d);     2a;       ;       ;
   fld     bb        ; Load b                               b; sqr(d);     2a;       ;
   fchs              ; Make it negative                    -b; sqr(d);     2a;       ;
   fxch              ; Exchange                        sqr(d);     -b;     2a;       ;

   fld     st        ; Copy square root                sqr(d); sqr(d);     -b;     2a; 
   fadd    st,st(2)  ; pv = -b + root(b*b - 4ac)           pv; sqr(d);     -b;     2a;
   fxch              ; Exchange                        sqr(d);     pv;     -b;     2a;
   fsubp   st(2),st  ; mv = -b - root(b*b - 4ac)           pv;     mv;     2a;       ;

   fdiv    st,st(2) ; Divide plus version               pv/2a;     mv;     2a;       ; 
   fstp    root_1   ; Store it                             mv;     2a;       ;       ;
   fdivr            ; Divide minus version              mv/2a;       ;       ;       ;
   fstp    root_2   ; Store it                               ;       ;       ;       ;
</code>

This optimised implementation requires detailed stack tracing, but can yield significant speed improvements.

** Step 5 **\\
Make calculations with double-precision and double-extended precision values. Compare results. Note that for longer data formats, you need to modify the conversion library, remembering that SSE does not operate on double-extended precision values.

** Result validation **\\
The program should display proper floating-point results of calculations. Please check the positive and negative input values.

** If it does not print the proper value? **\\
Please remember the order and placement of the arguments and the result of the function. Please keep the Windows ABI calling convention and preserve any non-volatile register before use. You can use FPU registers freely, remembering that they are organised as a stack.