Differences

This shows you the differences between two versions of the page.

--- en:multiasm:papc:chapter_6_15 [2025/11/25 14:48] – created ktokarz
+++ en:multiasm:papc:chapter_6_15 [2026/02/20 11:49] (current) – [Cache temporal locality] ktokarz
@@ Line 2: / Line 2: @@
 Optimisation strongly depends on the microarchitecture of the processor. Some optimisation recommendations change together with new versions of processors. Producers usually publish the most up-to-date recommendations. The last release of the Intel documentation is "Intel® 64 and IA-32 Architectures Optimization" ((https://www.intel.com/content/www/us/en/developer/articles/technical/intel64-and-ia32-architectures-optimization.html)). AMD published the document "Software Optimization Guide for the AMD Zen5 Microarchitecture" ((https://docs.amd.com/v/u/en-US/58455_1.00)).
 A selection of specific optimisation recommendations is described in this section.
-===== The use of inc and dec instructions =====
-It is natural for programmers to use **inc** or **dec** instructions to increment or decrement the variable. They are simple and appear to be executed faster than addition and subtraction with a constant "1". The **inc** and **dec** are single-byte instructions, while **add** and **sub** with the argument as a constant consume at least one byte more. The problem with **inc** and **dec** instructions stems from the fact that they do not modify all flags, whereas **add** and **sub** modify all. Modifying all flags frees the processor from the need to wait for previously executed instructions to finish their operation in terms of flag modification. Intel recommends replacing **inc** and **dec** with **add** and **sub** instructions, but compilers do not always consider this recommendation.
-===== Versions of logic instructions =====
-While new extensions are introduced, several new instructions appear. In addition to advanced data processing instructions, simple logic instructions are also implemented. Previous versions of instructions are extended to operate with the latest, bigger registers. This may lead to confusion about which instruction to use, especially when they perform the same operation and give the same result. Let's consider three logic XOR instructions **pxor**, **xorps** and **xorpd**. All of them can operate on 128-bit XMM registers performing the bit-wise logic XOR function. At first sight, the instruction choice is meaningless - the result will be the same. In reality, the selection of the instruction matters. The performance analysis yields a result that, in different situations, the execution time can be longer or shorter. A deeper analysis reveals that when previous calculations are performed with integers, it is better to use integer operation **pxor**; if the data is floating-point, it is better to use the floating-point version **xorps** or **xorpd**. There is a section in the Intel optimisation manual about mixing SIMD data types. It is recommended to use packed-single instead of packed-double when possible.
 ===== Data placement =====
@@ Line 15: / Line 8: @@
 ===== Registers use =====
 It is recommended to use registers instead of memory for scalar data if possible. Keeping data in registers eliminates the need to load and store it in memory.
+===== The use of inc and dec instructions =====
+It is natural for programmers to use **inc** or **dec** instructions to increment or decrement the variable. These instructions are simple and intuitively appear to be executed faster than addition and subtraction with a constant "1". The **inc** and **dec** are single-byte instructions, while **add** and **sub** with the argument as a constant consume at least one byte more. The problem with **inc** and **dec** instructions stems from the fact that they do not modify all flags, whereas **add** and **sub** modify all. Modifying all flags frees the processor from the need to wait for previously executed instructions to finish their operation in terms of flag modification. Intel recommends replacing **inc** and **dec** with **add** and **sub** instructions, but compilers do not always consider this recommendation.
+===== Versions of logic instructions =====
+While new extensions are introduced, several new instructions appear. In addition to advanced data processing instructions, simple logic instructions are also implemented. Previous versions of instructions are extended to operate with the latest, bigger registers. This may lead to confusion about which instruction to use, especially when they perform the same operation and give the same result. Let's consider three logic XOR instructions **pxor**, **xorps** and **xorpd**. All of them can operate on 128-bit XMM registers performing the bit-wise logic XOR function. At first sight, the instruction choice is meaningless - the result will be the same. In reality, the selection of the instruction matters. The performance analysis yields a result that, in different situations, the execution time can be longer or shorter. A deeper analysis reveals that when previous calculations are performed with integers, it is better to use integer operation **pxor**; if the data is floating-point, it is better to use the floating-point version **xorps** or **xorpd**. There is a section in the Intel optimisation manual about mixing SIMD data types. It is recommended to use packed-single instead of packed-double when possible.
 ===== Pause instruction =====
 It is a common method to pause the program execution and wait for an event for a short period in a spin loop. In case of a brief waiting period, this method is more efficient than calling an operating system function, which waits for an event. In modern processors, the **pause** instruction should be used inside such a loop. It helps the internal mechanisms of the processor to allocate hardware resources temporarily to another logical processor.
 ===== Cache utilisation =====
@@ Line 33: / Line 31: @@
   * Object-oriented programming helps to utilise cache because members of the class are grouped.
 ===== Cache temporal locality =====
- This feature helps improve performance in situations where the program uses the same variables repeatedly, e.g. in a loop.
+Cache temporal locality is the feature that helps improve performance in situations where the program uses the same variables repeatedly, e.g. in a loop. If the processed data exceeds half the size of a level 1 cache, it is recommended to use the non-temporal data move instructions **movntq** and **movntdq** to store data from registers to memory. These instructions are hints to the processor to omit the cache if possible. It doesn't mean that the data is immediately stored directly in memory. It can remain in the internal processor's buffers, and, likely, the last version is not visible to other units of the computer. It is the programmer's responsibility to synchronise the data using the **sfence** (Store Fence) instruction.
-In a situation where the data processed exceeds half the size of a level 1 cache, it is recommended to use the non-temporal data move instructions **movntq** and **movntdq** to store data from registers to memory. These instructions are hints to the processor to omit the cache if possible. It doesn't mean that the data is immediately stored directly in memory. It can remain in the internal processor's buffers, and it is likely that the last version is not visible to other units of the computer. It is the programmer's responsibility to synchronise the data using the **sfence** (Store Fence) instruction.
 ===== Cache support instructions =====
@@ Line 44: / Line 41: @@
 Fence instructions guarantee that the load and/or store instructions before the fence are completed before the corresponding instruction after the fence.
-  * **spence** force the memory–cache synchronisation after store instructions
+  * **sfence** force the memory–cache synchronisation after store instructions
   * **lfence** force the memory–cache synchronisation after load instructions
   * **mfence** force the memory–cache synchronisation after load and store instructions
@@ Line 51: / Line 48: @@
   * **prefetch** a hint to the processor, which indicates that the memory area should be considered higher in the hierarchy cache
   * **clflush** flushes a Cache Line from all levels of cache.
 ===== Further reading =====

en/multiasm/papc/chapter_6_15.1764074920.txt.gz · Last modified: 2025/11/25 14:48 by ktokarz