This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| en:multiasm:papc:chapter_6_16 [2025/11/20 11:26] – [Optimisation (DRAFT)] ktokarz | en:multiasm:papc:chapter_6_16 [Unknown date] (current) – external edit (Unknown date) 127.0.0.1 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ====== Optimisation | + | ====== Optimisation ====== |
| Optimisation strongly depends on the microarchitecture of the processor. Some optimisation recommendations change together with new versions of processors. Producers usually publish the most up-to-date recommendations. The last release of the Intel documentation is " | Optimisation strongly depends on the microarchitecture of the processor. Some optimisation recommendations change together with new versions of processors. Producers usually publish the most up-to-date recommendations. The last release of the Intel documentation is " | ||
| A selection of specific optimisation recommendations is described in this section. | A selection of specific optimisation recommendations is described in this section. | ||
| Line 16: | Line 16: | ||
| It is recommended to use registers instead of memory for scalar data if possible. Keeping data in registers eliminates the need to load and store it in memory. | It is recommended to use registers instead of memory for scalar data if possible. Keeping data in registers eliminates the need to load and store it in memory. | ||
| + | ===== Pause instruction ===== | ||
| + | It is a common method to pause the program execution and wait for an event for a short period in a spin loop. In case of a brief waiting period, this method is more efficient than calling an operating system function, which waits for an event. In modern processors, the **pause** instruction should be used inside such a loop. It helps the internal mechanisms of the processor to allocate hardware resources temporarily to another logical processor. | ||
| + | |||
| + | |||
| + | ===== Cache utilisation ===== | ||
| + | In modern microarchitectures, | ||
| + | The cache works on two main principles: | ||
| + | * temporal locality | ||
| + | * spatial locality. | ||
| + | The term temporal locality refers to the fact that if a program recently used a certain portion of data, it is likely to need it again soon. It means that if data is used, it remains in a cache for a certain amount of time until other data is loaded into the cache. It is efficient to keep data in a cache instead of reloading it. | ||
| + | The term spatial locality refers to whether a program has recently accessed data at a particular address; likely, it will soon need data at the next address. The cache helps the program run faster by automatically prefetching data and code that will likely be used or executed soon. | ||
| + | It is recommended to write the programs in any programming language, keeping these rules in mind. Some recommendations are listed below: | ||
| + | * The program should do as much work as possible on one small area of code and data; after doing the job, it can move to the next part. | ||
| + | * The program should avoid frequent jumping over distant regions of memory. | ||
| + | * While processing big multidimensional data arrays, keep in mind their placement in memory (row-wise or column-wise), | ||
| + | * Object-oriented programming helps to utilise cache because members of the class are grouped. | ||
| ===== Cache temporal locality ===== | ===== Cache temporal locality ===== | ||
| - | The term temporal locality refers to the fact that if data is used, it remains in a cache for a certain amount of time until other data is loaded into the cache. It is efficient to keep data in a cache instead of reloading it. This feature helps improve performance in situations where the program uses the same variables repeatedly, e.g. in a loop. | + | This feature helps improve performance in situations where the program uses the same variables repeatedly, e.g. in a loop. |
| In a situation where the data processed exceeds half the size of a level 1 cache, it is recommended to use the non-temporal data move instructions **movntq** and **movntdq** to store data from registers to memory. These instructions are hints to the processor to omit the cache if possible. It doesn' | In a situation where the data processed exceeds half the size of a level 1 cache, it is recommended to use the non-temporal data move instructions **movntq** and **movntdq** to store data from registers to memory. These instructions are hints to the processor to omit the cache if possible. It doesn' | ||
| + | |||
| ===== Cache support instructions ===== | ===== Cache support instructions ===== | ||
| - | In modern microarchitectures, | + | There are also instructions which allow the programmer to support the processor with cache utilisation. |
| + | * **movntq** saving | ||
| + | * **movntps** write the contents of the SSE register, bypassing cache | ||
| + | * **maskmovq** | ||
| + | * **movntdqa** non-temporal aligned move | ||
| + | |||
| + | Fence instructions guarantee | ||
| + | * **spence** force the memory–cache synchronisation after store instructions | ||
| + | * **lfence** force the memory–cache synchronisation after load instructions | ||
| + | * **mfence** force the memory–cache synchronisation after load and store instructions | ||
| + | |||
| + | Some instructions are hints to the processor indicating that the programmer expects the data to be stored in cache rather than in memory, or they do not expect | ||
| + | * **prefetch** a hint to the processor, | ||
| + | * **clflush** flushes a Cache Line from all levels of cache. | ||
| - | There are also instructions which allow the programmer to support the processor with cache utilisation. | ||
| - | MOVNTQ - saving the contents of the MMX register bypassing cache | ||
| - | MOVNTPS - write the contents of the SSE register bypassing cache | ||
| - | MASKMOVQ - write selected bytes from the MMX register bypassing cache | ||
| - | SFENCE - force the memory – cache synchronization | ||
| - | PREFETCH - a hint to the processor, | ||
| - | clflush - Flushes a Cache Line from all levels of cache. | ||
| - | lfence - Guarantees that all memory loads issued before the lfence instruction are completed before any loads after the lfence instruction. | ||
| - | mfence - Guarantees that all memory reads and writes issued before the mfence instruction are completed before any reads or writes after the mfence instruction. | ||
| - | pause - Pauses execution for a set amount of time. | ||
| - | movntdqa - Non-temporal aligned move. Load hint instruction. | ||
| - | ===== Cache temporal locality ===== | ||
| - | The term temporal locality refers to the fact that if data is used, it remains in a cache for a certain amount of time until other data is loaded into the cache. It is efficient to keep data in a cache instead of reloading it. This feature helps improve performance in situations where the program uses the same variables repeatedly, e.g. in a loop. | ||
| - | In a situation where the data processed exceeds half the size of a level 1 cache, it is recommended to use the non-temporal data move instructions **movntq** and **movntdq** to store data from registers to memory. These instructions are hints to the processor to omit the cache if possible. It doesn' | ||
| - | ===== Further | + | ===== Further |
| - | The essential readings in an optimisation topic are vendors' | + | The essential readings in an optimisation topic are the vendors' |
| - | Interesting | + | An exceptional |
| - | Assembly tutorial ((https:// | + | Interesting Understanding Windows x64 Assembly tutorial ((https:// |