Differences

This shows you the differences between two versions of the page.

Link to this comparison view

--- en:multiasm:papc:chapter_6_11 [2026/02/18 18:10] – [Co-existence of FPU and MMX] ktokarz
+++ en:multiasm:papc:chapter_6_11 [2026/02/27 02:40] (current) – jtokarz
@@ Line 14: / Line 14: @@
 The main idea of vector data processing is shown in figure {{ref>mmxprocessing}}. It shows the example of an operation performed with packed word vector data.
 <figure mmxprocessing>
-{{ :en:multiasm:cs:mmxprocessing.png?600 |Illustration of the idea of vector data processing}}
+{{ :en:multiasm:cs:mmxprocessing.png?550 |Illustration of the idea of vector data processing}}
 <caption>The idea of vector data processing</caption>
 </figure>
@@ Line 25: / Line 25: @@
 <figure mmxpaddsw>
-{{ :en:multiasm:cs:mmxpaddsw.png?500 |Illustration of packed word addition with signed saturation}}
+{{ :en:multiasm:cs:mmxpaddsw.png?450 |Illustration of packed word addition with signed saturation}}
 <caption>The illustration of packed word addition with signed saturation</caption>
 </figure>
 <figure mmxpaddusw>
-{{ :en:multiasm:cs:mmxpaddusw.png?500 |Illustration of packed word addition with unsigned saturation}}
+{{ :en:multiasm:cs:mmxpaddusw.png?450 |Illustration of packed word addition with unsigned saturation}}
 <caption>The illustration of packed word addition with unsigned saturation</caption>
 </figure>
@@ Line 57: / Line 57: @@
 <figure mmxmultiplyandunpck>
-{{ :en:multiasm:cs:mmxmultiplyandunpck.png?650 |Illustration of packed word multiplication and unpacking results to doublewords}}
+{{ :en:multiasm:cs:mmxmultiplyandunpck.png?550 |Illustration of packed word multiplication and unpacking results to doublewords}}
 <caption>The illustration of packed word multiplication and unpacking results to doublewords</caption>
 </figure>
@@ Line 63: / Line 63: @@
 The code that calculates the presented multiplication can look as follows:
 <code asm>
-Numbers	DW  01ACh, 2112h, 03F3h, 00A4h,
+Numbers DW  01ACh, 2112h, 03F3h, 00A4h,
 h, 0137h, 0AB7h, 00D8h
-LEA	    ESI, Numbers
+LEA         ESI, Numbers
-MOVQ	    mm0, [ESI]	        ; mm0 = 00A4 03F3 2112 01AC
+MOVQ        mm0, [ESI]          ; mm0 = 00A4 03F3 2112 01AC
-MOVQ	    mm1, [ESI+8]	; mm1 = 00D8 0AB7 0137 0006
+MOVQ        mm1, [ESI+8]        ; mm1 = 00D8 0AB7 0137 0006
-MOVQ	    mm2, mm0
+MOVQ        mm2, mm0
-PMULLW	    mm0, mm1		; mm0 = 8A60 50B5 2CDE 0A08
+PMULLW      mm0, mm1            ; mm0 = 8A60 50B5 2CDE 0A08
-PMULHW	    mm1, mm2		; mm1 = 0000 002A 0028 0000
+PMULHW      mm1, mm2            ; mm1 = 0000 002A 0028 0000
-MOVQ	    mm2, mm0
+MOVQ        mm2, mm0
-PUNPCKLWD   mm0, mm1		; mm0 = 0028 2CDE 0000 0A08
+PUNPCKLWD   mm0, mm1            ; mm0 = 0028 2CDE 0000 0A08
-PUNPCKHWD   mm2, mm1		; mm2 = 0000 8A60 002A 50B5
+PUNPCKHWD   mm2, mm1            ; mm2 = 0000 8A60 002A 50B5
 </code>
@@ Line 80: / Line 80: @@
 <figure mmxpmaddw>
-{{ :en:multiasm:cs:mmxpmaddw.png?650 |Illustration of packed word multiplication and sum to doublewords}}
+{{ :en:multiasm:cs:mmxpmaddw.png?550 |Illustration of packed word multiplication and sum to doublewords}}
 <caption>The illustration of packed word multiplication and sum to doublewords</caption>
 </figure>
@@ Line 98: / Line 98: @@
 An example of comparison instruction for equality of two vectors of words is shown in figure {{ref>mmxcompare}}.
 <figure mmxcompare>
-{{ :en:multiasm:cs:mmxcompare.png?500 |Illustration of vector data comparison}}
+{{ :en:multiasm:cs:mmxcompare.png?450 |Illustration of vector data comparison}}
 <caption>Vector data comparison</caption>
 </figure>
@@ Line 106: / Line 106: @@
 <figure mmxpunpckhbw>
-{{ :en:multiasm:cs:mmxpunpkhbw.png?650 |Illustration of unpacking high-order bytes to words}}
+{{ :en:multiasm:cs:mmxpunpkhbw.png?550 |Illustration of unpacking high-order bytes to words}}
 <caption>The illustration of unpacking high-order bytes to words</caption>
 </figure>
 <figure mmxpunpcklbw>
-{{ :en:multiasm:cs:mmxpunpklbw.png?650 |Illustration of unpacking low-order bytes to words}}
+{{ :en:multiasm:cs:mmxpunpklbw.png?550 |Illustration of unpacking low-order bytes to words}}
 <caption>The illustration of unpacking low-order bytes to words</caption>
 </figure>
@@ Line 118: / Line 118: @@
 <figure mmxpack>
-{{ :en:multiasm:cs:mmxpacksswd.png?650 |Illustration of packing doublewords to words}}
+{{ :en:multiasm:cs:mmxpacksswd.png?550 |Illustration of packing doublewords to words}}
 <caption>The illustration of packing doublewords to words</caption>
 </figure>
@@ Line 157: / Line 157: @@
 ==== Data transfer ====
-In modern processors, it is very important to transfer data from and to memory effectively. The memory management unit can perform data transfer much faster if the data is aligned to a specific address. For SSE instructions, an address must be evenly divisible by 16. In the SSE extension, two versions of data transfer instructions were implemented. The **movups** copies packed single-precision data from any address, while the **movaps** moves data from an aligned address. The **mivss** moves the scalar single-precision value. It doesn't have to be aligned. It is also possible to copy data between the upper half of the XMM register and memory with the **movhps** instruction, between the lower half of the XMM register and memory with the **movlps**, and from the lower to higher half or from the higher to lower half of the XMM registers with the **movhlps** and **movlhps**, respectively. The **movmskps** instruction copies the most significant bits of single-precision floating-point values to a general-purpose register. It allows us to make a bit mask based on the sign bits of elements of the vector.
+In modern processors, it is very important to transfer data from and to memory effectively. The memory management unit can perform data transfer much faster if the data is aligned to a specific address. For SSE instructions, an address must be evenly divisible by 16. In the SSE extension, two versions of data transfer instructions were implemented. The **movups** copies packed single-precision data from any address, while the **movaps** moves data from an aligned address. The **movss** moves the scalar single-precision value. It doesn't have to be aligned. It is also possible to copy data between the upper half of the XMM register and memory with the **movhps** instruction, between the lower half of the XMM register and memory with the **movlps**, and from the lower to higher half or from the higher to lower half of the XMM registers with the **movhlps** and **movlhps**, respectively. The **movmskps** instruction copies the most significant bits of single-precision floating-point values to a general-purpose register. It allows us to make a bit mask based on the sign bits of elements of the vector.
 ==== Calculations ====
@@ Line 164: / Line 164: @@
 The idea of vector and scalar operations is shown in figure {{ref>sse1vector}} and figure {{ref>sse1scalar}}, respectively.
 <figure sse1vector>
-{{ :en:multiasm:cs:sse1vector.png?600 |Illustration of the idea of SSE vector data processing}}
+{{ :en:multiasm:cs:sse1vector.png?550 |Illustration of the idea of SSE vector data processing}}
 <caption>The idea of vector data processing in SSE</caption>
 </figure>
 <figure sse1scalar>
-{{ :en:multiasm:cs:sse1scalar.png?600 |Illustration of the idea of SSE scalar data processing}}
+{{ :en:multiasm:cs:sse1scalar.png?550 |Illustration of the idea of SSE scalar data processing}}
 <caption>The idea of scalar data processing in SSE</caption>
 </figure>
@@ Line 221: / Line 221: @@
 <figure sseunpack>
-{{ :en:multiasm:cs:sseunpack.png?650 |Illustration of SSE unpacking single-precision floating-point values}}
+{{ :en:multiasm:cs:sseunpack.png?550 |Illustration of SSE unpacking single-precision floating-point values}}
 <caption>The illustration of SSE unpacking single-precision floating-point values</caption>
 </figure>
@@ Line 228: / Line 228: @@
 <figure sseshuffle>
-{{ :en:multiasm:cs:sseshuffle.png?650 |Illustration of SSE shuffle single-precision floating-point values}}
+{{ :en:multiasm:cs:sseshuffle.png?550 |Illustration of SSE shuffle single-precision floating-point values}}
 <caption>The illustration of SSE shuffle single-precision floating-point values</caption>
 </figure>
@@ Line 245: / Line 245: @@
 <figure sse2conversions>
-{{ :en:multiasm:cs:sse2conversions.png?650 |Illustration of a variety of data type conversion instructions}}
+{{ :en:multiasm:cs:sse2conversions.png?550 |Illustration of a variety of data type conversion instructions}}
 <caption>The illustration of a variety of data type conversion instructions</caption>
 </figure>
@@ Line 253: / Line 253: @@
 All horizontal instructions operate in a similar manner. The lower (bottom) part of the resulting vector is the result of operation on the bottom and top elements of the first (destination) operand; the higher (top) part of the resulting vector is the result of operation on the second (source) operand's bottom and top. The best way to present the principles of horizontal operations is a picture. Because in the subtraction operation the order of arguments is important, the **hsubpd** instruction is shown in figure {{ref>sse3hsubpd}}.
 <figure sse3hsubpd>
-{{ :en:multiasm:cs:sse3hsubpd.png?650 |Illustration of a horizontal subtraction instruction}}
+{{ :en:multiasm:cs:sse3hsubpd.png?550 |Illustration of a horizontal subtraction instruction}}
 <caption>The illustration of a horizontal subtraction instruction</caption>
 </figure>
@@ Line 259: / Line 259: @@
 While there are more than two elements of source vectors, like in the **hsubps** instruction, it is also important to know the order of the elements in the resulting vector. Please look at the figure {{ref>sse3hsubps}}.
 <figure sse3hsubps>
-{{ :en:multiasm:cs:sse3hsubps.png?650 |Illustration of a horizontal single precision subtraction instruction}}
+{{ :en:multiasm:cs:sse3hsubps.png?550 |Illustration of a horizontal single precision subtraction instruction}}
 <caption>The illustration of a horizontal single precision subtraction instruction</caption>
 </figure>
@@ Line 268: / Line 268: @@
 <table SSSE3horizontaltable>
 <caption>SSSE3 horizontal integer instructions</caption>
-^ Instruction ^ operation ^ data ^
+^ Instruction  ^ operation              ^ data                  ^
-| **phaddd** | addition | unsigned doublewords |
+| **phaddd**   | addition               | unsigned doublewords  |
-| **phaddw** | addition | unsigned words |
+| **phaddw**   | addition               | unsigned words        |
-| **phaddsw** | saturated addition | signed words |
+| **phaddsw**  | saturated addition     | signed words          |
-| **phsubd** | subtracion | unsigned doublewords |
+| **phsubd**   | subtraction            | unsigned doublewords  |
-| **phsubw** | subtracion | unsigned words |
+| **phsubw**   | subtraction            | unsigned words        |
-| **phsubsw** | saturated subtracion | signed words |
+| **phsubsw**  | saturated subtraction  | signed words          |
 </table>
@@ Line 283: / Line 283: @@
 The illustration is shown in figure {{ref>sse3pshufb}}.
 <figure sse3pshufb>
-{{ :en:multiasm:cs:sse3pshufb.png?650 |Illustration of a byte shuffle instruction}}
+{{ :en:multiasm:cs:sse3pshufb.png?600 |Illustration of a byte shuffle instruction}}
 <caption>The illustration of a byte shuffle instruction</caption>
 </figure>
@@ Line 290: / Line 290: @@
 <figure sse3palignr>
-{{ :en:multiasm:cs:sse3palignr.png?650 |Illustration of an aligned byte combine instruction}}
+{{ :en:multiasm:cs:sse3palignr.png?600 |Illustration of an aligned byte combine instruction}}
 <caption>The illustration of an aligned byte combine instruction</caption>
 </figure>
@@ Line 298: / Line 298: @@
 The **dpps** and **dppd**  instructions calculate the dot product of four single-precision and two double-precision operands, respectively. Additionally, the arguments are controlled with the third immediate operand. The example showing the **dppd** is presented in figure {{ref>sse4dotproduct}}.
 <figure sse4dotproduct>
-{{ :en:multiasm:cs:sse4dotproduct.png?650 |Illustration of a dot product calculation instruction}}
+{{ :en:multiasm:cs:sse4dotproduct.png?600 |Illustration of a dot product calculation instruction}}
 <caption>The illustration of a dot product calculation instruction</caption>
 </figure>
-There are also advanced shuffle, insert and extract instructions which make it possible to manipulate positions of the data of various types. A few examples will be shown in the following figures. The **insertps** inserts a scalar single-precision floating-point value with the position of the vector's element in source and destination controlled with an 8-bit immediate. The example showing the **insertps** instruction is presented in figure {{ref>sse4insertps}}. In this example, the immediate contains the bit value of 10011000b.
+There are also advanced shuffle, insert and extract instructions which make it possible to manipulate positions of the data of various types. The type of the data is specified with the suffix of the mnemonic: b - bytes, w - words, d - doublewords, q - quadwords, ps - single precision and pd - double precision elements. Although these instructions behave the same for the integer and floating-point data elements, formally, those operating with integers begin with the letter "P". A few examples are shown in the following figures.
+The blending instructions copy elements of vectors, mixing two sources into the destination. The **blendps**, **blendpd** and **pblendw** conditionally copy elements from vector X or Y. The mask is specified as the third, immediate value. The behaviour of **blendpd** is shown in fig. {{ref>sse4blendpd}}
+<figure sse4blendpd>
+{{ :en:multiasm:cs:sse4blendpd.png?400 |Illustration of an example of packed blending instruction}}
+<caption>The illustration of an example of packed blending instruction</caption>
+</figure>
+The instructions **blendvps**, **blendvpd** and **pblendvb** operate in a similar way, but the condition is specified as the sign bit of the corresponding elements of the third implied argument stored in XMM0. The behaviour of **blendvpd** is shown in fig. {{ref>sse4blendvpd}}
+<figure sse4blendvpd>
+{{ :en:multiasm:cs:sse4blendvpd.png?400 |Illustration of an example of packed blending instruction}}
+<caption>The illustration of an example of packed blending instruction</caption>
+</figure>
+The set of extract instructions includes **pextrb**, **pextrw**, **pextrd**, **pextrq** and **extractps**. They take one element of the vector from the XMM register and store it in a CPU register or in memory. The offset of the element is specified with an immediate constant. The behaviour of **extractps** is shown in fig. {{ref>sse4extractps}}
+<figure sse4extractps>
+{{ :en:multiasm:cs:sse4extractps.png?400 |Illustration of an example of extract instruction}}
+<caption>The illustration of an example of extract instruction</caption>
+</figure>
+The insert instructions are **pinsrb**, **pinsrd** and **pinsrq**. They operate in an opposite way to extract instructions. They take an element from memory or a general-purpose register and insert it into the XMM register at the position specified with a constant immediate. The behaviour of **pinsrd** is shown in fig. {{ref>sse4pinsrd}}
+<figure sse4pinsrd>
+{{ :en:multiasm:cs:sse4pinsrd.png?400 |Illustration of an example of an insert instruction}}
+<caption>The illustration of an example of an insert instruction</caption>
+</figure>
+The **insertps** is one of the most complex. inserts a scalar single-precision floating-point value with the position of the vector's element in source and destination controlled with an 8-bit immediate. The example showing the **insertps** instruction is presented in figure {{ref>sse4insertps}}. In this example, the immediate contains the bit value of 10011000b.
 <figure sse4insertps>
 {{ :en:multiasm:cs:sse4insertps.png?500 |Illustration of an example of an advanced shuffle instruction}}
 <caption>The illustration of an example of an advanced shuffle instruction</caption>
 </figure>
-In SSE4.2, the set of string compare instructions was added. As the XMM registers can contain sixteen bytes, it is much more efficient to implement string processing algorithms with bigger XMM registers than with registers in the main processor with the use of strong instructions. There are four string compare instructions (see table {{ref>sse4stringtable}}), but each of them can be configured to achieve different functionalities. The length of strings can be explicit or implicit. Explicit length means that the length of the first operand is specified with the RAX register, and the length of the second operand is specified with the RDX register. Implicit length means that both operands contain null-terminated strings. Instructions can produce two kinds of results. Index means that the index of the first or last result is returned. Mask means that the bit mask is returned (one bit for each two elements compared) or a mask of the size of the elements (similarly to MMX compare).
+In SSE4.2, the set of string compare instructions was added. As the XMM registers can contain sixteen bytes, it is much more efficient to implement string processing algorithms with bigger XMM registers than with registers in the main processor with the use of string instructions. There are four string compare instructions (see table {{ref>sse4stringtable}}), but each of them can be configured to achieve different functionalities. The length of strings can be explicit or implicit. Explicit length means that the length of the first operand is specified with the RAX register, and the length of the second operand is specified with the RDX register. Implicit length means that both operands contain null-terminated strings. Instructions can produce two kinds of results. Index means that the index of the first or last result is returned. Mask means that the bit mask is returned (one bit for each two elements compared) or a mask of the size of the elements (similarly to MMX compare).
 <table sse4stringtable>
 <caption>SSE4.2 string compare instructions</caption>

en/multiasm/papc/chapter_6_11.1771431021.txt.gz · Last modified: 2026/02/18 18:10 by ktokarz