Differences

This shows you the differences between two versions of the page.

--- en:multiasm:papc:chapter_6_16 [2025/11/20 11:26] – [Optimisation (DRAFT)] ktokarz
+++ en:multiasm:papc:chapter_6_16 [Unknown date] (current) – external edit (Unknown date) 127.0.0.1
@@ Line 1: / Line 1: @@
-====== Optimisation (DRAFT) ======
+====== Optimisation ======
 Optimisation strongly depends on the microarchitecture of the processor. Some optimisation recommendations change together with new versions of processors. Producers usually publish the most up-to-date recommendations. The last release of the Intel documentation is "Intel® 64 and IA-32 Architectures Optimization" ((https://www.intel.com/content/www/us/en/developer/articles/technical/intel64-and-ia32-architectures-optimization.html)). AMD published the document "Software Optimization Guide for the AMD Zen5 Microarchitecture" ((https://docs.amd.com/v/u/en-US/58455_1.00)).
 A selection of specific optimisation recommendations is described in this section.
@@ Line 16: / Line 16: @@
 It is recommended to use registers instead of memory for scalar data if possible. Keeping data in registers eliminates the need to load and store it in memory.
+===== Pause instruction =====
+It is a common method to pause the program execution and wait for an event for a short period in a spin loop. In case of a brief waiting period, this method is more efficient than calling an operating system function, which waits for an event. In modern processors, the **pause** instruction should be used inside such a loop. It helps the internal mechanisms of the processor to allocate hardware resources temporarily to another logical processor.
+===== Cache utilisation =====
+In modern microarchitectures, the cache memory is essential for improving performance. In general, the processor handles cache memory in the most efficient way possible; however, it is easy to write a program that prevents the processor from utilising this mechanism effectively.
+The cache works on two main principles:
+  * temporal locality
+  * spatial locality.
+The term temporal locality refers to the fact that if a program recently used a certain portion of data, it is likely to need it again soon. It means that if data is used, it remains in a cache for a certain amount of time until other data is loaded into the cache. It is efficient to keep data in a cache instead of reloading it.
+The term spatial locality refers to whether a program has recently accessed data at a particular address; likely, it will soon need data at the next address. The cache helps the program run faster by automatically prefetching data and code that will likely be used or executed soon.
+It is recommended to write the programs in any programming language, keeping these rules in mind. Some recommendations are listed below:
+  * The program should do as much work as possible on one small area of code and data; after doing the job, it can move to the next part.
+  * The program should avoid frequent jumping over distant regions of memory.
+  * While processing big multidimensional data arrays, keep in mind their placement in memory (row-wise or column-wise), which is specific to the programming language.
+  * Object-oriented programming helps to utilise cache because members of the class are grouped.
 ===== Cache temporal locality =====
-The term temporal locality refers to the fact that if data is used, it remains in a cache for a certain amount of time until other data is loaded into the cache. It is efficient to keep data in a cache instead of reloading it. This feature helps improve performance in situations where the program uses the same variables repeatedly, e.g. in a loop.
+ This feature helps improve performance in situations where the program uses the same variables repeatedly, e.g. in a loop.
 In a situation where the data processed exceeds half the size of a level 1 cache, it is recommended to use the non-temporal data move instructions **movntq** and **movntdq** to store data from registers to memory. These instructions are hints to the processor to omit the cache if possible. It doesn't mean that the data is immediately stored directly in memory. It can remain in the internal processor's buffers, and it is likely that the last version is not visible to other units of the computer. It is the programmer's responsibility to synchronise the data using the **sfence** (Store Fence) instruction.
 ===== Cache support instructions =====
-In modern microarchitectures, the cache memory is essential for improving performance. In general, the processor handles cache memory in the most efficient way possible; however, it is easy to write a program that prevents the processor from utilising this mechanism effectively. The rules for helping processors to use cache to achieve the best performance are:
+There are also instructions which allow the programmer to support the processor with cache utilisation.
+  * **movntq** saving the contents of the MMX register, bypassing cache
+  * **movntps** write the contents of the SSE register, bypassing cache
+  * **maskmovq** write selected bytes from the MMX register, bypassing cache
+  * **movntdqa** non-temporal aligned move
+Fence instructions guarantee that the load and/or store instructions before the fence are completed before the corresponding instruction after the fence.
+  * **spence** force the memory–cache synchronisation after store instructions
+  * **lfence** force the memory–cache synchronisation after load instructions
+  * **mfence** force the memory–cache synchronisation after load and store instructions
+Some instructions are hints to the processor indicating that the programmer expects the data to be stored in cache rather than in memory, or they do not expect to use the data in cache anymore.
+  * **prefetch** a hint to the processor, which indicates that the memory area should be considered higher in the hierarchy cache
+  * **clflush** flushes a Cache Line from all levels of cache.
-There are also instructions which allow the programmer to support the processor with cache utilisation.
-MOVNTQ - saving the contents of the MMX register bypassing cache
-MOVNTPS - write the contents of the SSE register bypassing cache
-MASKMOVQ - write selected bytes from the MMX register bypassing cache
-SFENCE - force the memory – cache synchronization
-PREFETCH - a hint to the processor, that indicate that the memory area should be found higher in the hierarchy cache
-clflush - Flushes a Cache Line from all levels of cache.
-lfence - Guarantees that all memory loads issued before the lfence instruction are completed before any loads after the lfence instruction.
-mfence - Guarantees that all memory reads and writes issued before the mfence instruction are completed before any reads or writes after the mfence instruction.
-pause - Pauses execution for a set amount of time.
-movntdqa - Non-temporal aligned move. Load hint instruction.
-===== Cache temporal locality =====
-The term temporal locality refers to the fact that if data is used, it remains in a cache for a certain amount of time until other data is loaded into the cache. It is efficient to keep data in a cache instead of reloading it. This feature helps improve performance in situations where the program uses the same variables repeatedly, e.g. in a loop.
-In a situation where the data processed exceeds half the size of a level 1 cache, it is recommended to use the non-temporal data move instructions **movntq** and **movntdq** to store data from registers to memory. These instructions are hints to the processor to omit the cache if possible. It doesn't mean that the data is immediately stored directly in memory. It can remain in the internal processor's buffers, and it is likely that the last version is not visible to other units of the computer. It is the programmer's responsibility to synchronise the data using the **sfence** (Store Fence) instruction.
-===== Further research =====
+===== Further reading =====
-The essential readings in an optimisation topic are vendors' optimisation guides.
+The essential readings in an optimisation topic are the vendors' optimisation guides mentioned at the beginning of this section.
-Interesting position about optimisation in x64 processors is by Agner Fog((https://agner.org/optimize/optimizing_assembly.pdf)).
+An exceptional position about optimisation in x64 processors is by Agner Fog((https://agner.org/optimize/optimizing_assembly.pdf)). This is a must-read for programmers who want to optimise their programs. Thanks to the author's extensive knowledge and hard work, this guide documents various tricks, as well as the execution times of individual processor instructions. It is mentioned in almost all internet articles and blogs about optimisation.
-Assembly tutorial ((https://sonictk.github.io/asm_tutorial/#debuggingassembly/thetools/radare2/cutterandida))
+Interesting Understanding Windows x64 Assembly tutorial ((https://sonictk.github.io/asm_tutorial/#debuggingassembly/thetools/radare2/cutterandida)) not only about optimisation but also about using low-level programming in Windows.

en/multiasm/papc/chapter_6_16.1763630803.txt.gz · Last modified: 2025/11/20 11:26 by ktokarz