| Both sides previous revisionPrevious revisionNext revision | Previous revision |
| en:multiasm:papc:chapter_6_11 [2026/02/18 18:10] – [Co-existence of FPU and MMX] ktokarz | en:multiasm:papc:chapter_6_11 [2026/02/27 02:40] (current) – jtokarz |
|---|
| The main idea of vector data processing is shown in figure {{ref>mmxprocessing}}. It shows the example of an operation performed with packed word vector data. | The main idea of vector data processing is shown in figure {{ref>mmxprocessing}}. It shows the example of an operation performed with packed word vector data. |
| <figure mmxprocessing> | <figure mmxprocessing> |
| {{ :en:multiasm:cs:mmxprocessing.png?600 |Illustration of the idea of vector data processing}} | {{ :en:multiasm:cs:mmxprocessing.png?550 |Illustration of the idea of vector data processing}} |
| <caption>The idea of vector data processing</caption> | <caption>The idea of vector data processing</caption> |
| </figure> | </figure> |
| |
| <figure mmxpaddsw> | <figure mmxpaddsw> |
| {{ :en:multiasm:cs:mmxpaddsw.png?500 |Illustration of packed word addition with signed saturation}} | {{ :en:multiasm:cs:mmxpaddsw.png?450 |Illustration of packed word addition with signed saturation}} |
| <caption>The illustration of packed word addition with signed saturation</caption> | <caption>The illustration of packed word addition with signed saturation</caption> |
| </figure> | </figure> |
| |
| <figure mmxpaddusw> | <figure mmxpaddusw> |
| {{ :en:multiasm:cs:mmxpaddusw.png?500 |Illustration of packed word addition with unsigned saturation}} | {{ :en:multiasm:cs:mmxpaddusw.png?450 |Illustration of packed word addition with unsigned saturation}} |
| <caption>The illustration of packed word addition with unsigned saturation</caption> | <caption>The illustration of packed word addition with unsigned saturation</caption> |
| </figure> | </figure> |
| |
| <figure mmxmultiplyandunpck> | <figure mmxmultiplyandunpck> |
| {{ :en:multiasm:cs:mmxmultiplyandunpck.png?650 |Illustration of packed word multiplication and unpacking results to doublewords}} | {{ :en:multiasm:cs:mmxmultiplyandunpck.png?550 |Illustration of packed word multiplication and unpacking results to doublewords}} |
| <caption>The illustration of packed word multiplication and unpacking results to doublewords</caption> | <caption>The illustration of packed word multiplication and unpacking results to doublewords</caption> |
| </figure> | </figure> |
| The code that calculates the presented multiplication can look as follows: | The code that calculates the presented multiplication can look as follows: |
| <code asm> | <code asm> |
| Numbers DW 01ACh, 2112h, 03F3h, 00A4h, | Numbers DW 01ACh, 2112h, 03F3h, 00A4h, |
| 0006h, 0137h, 0AB7h, 00D8h | 0006h, 0137h, 0AB7h, 00D8h |
| LEA ESI, Numbers | LEA ESI, Numbers |
| MOVQ mm0, [ESI] ; mm0 = 00A4 03F3 2112 01AC | MOVQ mm0, [ESI] ; mm0 = 00A4 03F3 2112 01AC |
| MOVQ mm1, [ESI+8] ; mm1 = 00D8 0AB7 0137 0006 | MOVQ mm1, [ESI+8] ; mm1 = 00D8 0AB7 0137 0006 |
| MOVQ mm2, mm0 | MOVQ mm2, mm0 |
| PMULLW mm0, mm1 ; mm0 = 8A60 50B5 2CDE 0A08 | PMULLW mm0, mm1 ; mm0 = 8A60 50B5 2CDE 0A08 |
| PMULHW mm1, mm2 ; mm1 = 0000 002A 0028 0000 | PMULHW mm1, mm2 ; mm1 = 0000 002A 0028 0000 |
| MOVQ mm2, mm0 | MOVQ mm2, mm0 |
| PUNPCKLWD mm0, mm1 ; mm0 = 0028 2CDE 0000 0A08 | PUNPCKLWD mm0, mm1 ; mm0 = 0028 2CDE 0000 0A08 |
| PUNPCKHWD mm2, mm1 ; mm2 = 0000 8A60 002A 50B5 | PUNPCKHWD mm2, mm1 ; mm2 = 0000 8A60 002A 50B5 |
| </code> | </code> |
| |
| |
| <figure mmxpmaddw> | <figure mmxpmaddw> |
| {{ :en:multiasm:cs:mmxpmaddw.png?650 |Illustration of packed word multiplication and sum to doublewords}} | {{ :en:multiasm:cs:mmxpmaddw.png?550 |Illustration of packed word multiplication and sum to doublewords}} |
| <caption>The illustration of packed word multiplication and sum to doublewords</caption> | <caption>The illustration of packed word multiplication and sum to doublewords</caption> |
| </figure> | </figure> |
| An example of comparison instruction for equality of two vectors of words is shown in figure {{ref>mmxcompare}}. | An example of comparison instruction for equality of two vectors of words is shown in figure {{ref>mmxcompare}}. |
| <figure mmxcompare> | <figure mmxcompare> |
| {{ :en:multiasm:cs:mmxcompare.png?500 |Illustration of vector data comparison}} | {{ :en:multiasm:cs:mmxcompare.png?450 |Illustration of vector data comparison}} |
| <caption>Vector data comparison</caption> | <caption>Vector data comparison</caption> |
| </figure> | </figure> |
| |
| <figure mmxpunpckhbw> | <figure mmxpunpckhbw> |
| {{ :en:multiasm:cs:mmxpunpkhbw.png?650 |Illustration of unpacking high-order bytes to words}} | {{ :en:multiasm:cs:mmxpunpkhbw.png?550 |Illustration of unpacking high-order bytes to words}} |
| <caption>The illustration of unpacking high-order bytes to words</caption> | <caption>The illustration of unpacking high-order bytes to words</caption> |
| </figure> | </figure> |
| |
| <figure mmxpunpcklbw> | <figure mmxpunpcklbw> |
| {{ :en:multiasm:cs:mmxpunpklbw.png?650 |Illustration of unpacking low-order bytes to words}} | {{ :en:multiasm:cs:mmxpunpklbw.png?550 |Illustration of unpacking low-order bytes to words}} |
| <caption>The illustration of unpacking low-order bytes to words</caption> | <caption>The illustration of unpacking low-order bytes to words</caption> |
| </figure> | </figure> |
| |
| <figure mmxpack> | <figure mmxpack> |
| {{ :en:multiasm:cs:mmxpacksswd.png?650 |Illustration of packing doublewords to words}} | {{ :en:multiasm:cs:mmxpacksswd.png?550 |Illustration of packing doublewords to words}} |
| <caption>The illustration of packing doublewords to words</caption> | <caption>The illustration of packing doublewords to words</caption> |
| </figure> | </figure> |
| ==== Data transfer ==== | ==== Data transfer ==== |
| |
| In modern processors, it is very important to transfer data from and to memory effectively. The memory management unit can perform data transfer much faster if the data is aligned to a specific address. For SSE instructions, an address must be evenly divisible by 16. In the SSE extension, two versions of data transfer instructions were implemented. The **movups** copies packed single-precision data from any address, while the **movaps** moves data from an aligned address. The **mivss** moves the scalar single-precision value. It doesn't have to be aligned. It is also possible to copy data between the upper half of the XMM register and memory with the **movhps** instruction, between the lower half of the XMM register and memory with the **movlps**, and from the lower to higher half or from the higher to lower half of the XMM registers with the **movhlps** and **movlhps**, respectively. The **movmskps** instruction copies the most significant bits of single-precision floating-point values to a general-purpose register. It allows us to make a bit mask based on the sign bits of elements of the vector. | In modern processors, it is very important to transfer data from and to memory effectively. The memory management unit can perform data transfer much faster if the data is aligned to a specific address. For SSE instructions, an address must be evenly divisible by 16. In the SSE extension, two versions of data transfer instructions were implemented. The **movups** copies packed single-precision data from any address, while the **movaps** moves data from an aligned address. The **movss** moves the scalar single-precision value. It doesn't have to be aligned. It is also possible to copy data between the upper half of the XMM register and memory with the **movhps** instruction, between the lower half of the XMM register and memory with the **movlps**, and from the lower to higher half or from the higher to lower half of the XMM registers with the **movhlps** and **movlhps**, respectively. The **movmskps** instruction copies the most significant bits of single-precision floating-point values to a general-purpose register. It allows us to make a bit mask based on the sign bits of elements of the vector. |
| |
| ==== Calculations ==== | ==== Calculations ==== |
| The idea of vector and scalar operations is shown in figure {{ref>sse1vector}} and figure {{ref>sse1scalar}}, respectively. | The idea of vector and scalar operations is shown in figure {{ref>sse1vector}} and figure {{ref>sse1scalar}}, respectively. |
| <figure sse1vector> | <figure sse1vector> |
| {{ :en:multiasm:cs:sse1vector.png?600 |Illustration of the idea of SSE vector data processing}} | {{ :en:multiasm:cs:sse1vector.png?550 |Illustration of the idea of SSE vector data processing}} |
| <caption>The idea of vector data processing in SSE</caption> | <caption>The idea of vector data processing in SSE</caption> |
| </figure> | </figure> |
| <figure sse1scalar> | <figure sse1scalar> |
| {{ :en:multiasm:cs:sse1scalar.png?600 |Illustration of the idea of SSE scalar data processing}} | {{ :en:multiasm:cs:sse1scalar.png?550 |Illustration of the idea of SSE scalar data processing}} |
| <caption>The idea of scalar data processing in SSE</caption> | <caption>The idea of scalar data processing in SSE</caption> |
| </figure> | </figure> |
| |
| <figure sseunpack> | <figure sseunpack> |
| {{ :en:multiasm:cs:sseunpack.png?650 |Illustration of SSE unpacking single-precision floating-point values}} | {{ :en:multiasm:cs:sseunpack.png?550 |Illustration of SSE unpacking single-precision floating-point values}} |
| <caption>The illustration of SSE unpacking single-precision floating-point values</caption> | <caption>The illustration of SSE unpacking single-precision floating-point values</caption> |
| </figure> | </figure> |
| |
| <figure sseshuffle> | <figure sseshuffle> |
| {{ :en:multiasm:cs:sseshuffle.png?650 |Illustration of SSE shuffle single-precision floating-point values}} | {{ :en:multiasm:cs:sseshuffle.png?550 |Illustration of SSE shuffle single-precision floating-point values}} |
| <caption>The illustration of SSE shuffle single-precision floating-point values</caption> | <caption>The illustration of SSE shuffle single-precision floating-point values</caption> |
| </figure> | </figure> |
| |
| <figure sse2conversions> | <figure sse2conversions> |
| {{ :en:multiasm:cs:sse2conversions.png?650 |Illustration of a variety of data type conversion instructions}} | {{ :en:multiasm:cs:sse2conversions.png?550 |Illustration of a variety of data type conversion instructions}} |
| <caption>The illustration of a variety of data type conversion instructions</caption> | <caption>The illustration of a variety of data type conversion instructions</caption> |
| </figure> | </figure> |
| All horizontal instructions operate in a similar manner. The lower (bottom) part of the resulting vector is the result of operation on the bottom and top elements of the first (destination) operand; the higher (top) part of the resulting vector is the result of operation on the second (source) operand's bottom and top. The best way to present the principles of horizontal operations is a picture. Because in the subtraction operation the order of arguments is important, the **hsubpd** instruction is shown in figure {{ref>sse3hsubpd}}. | All horizontal instructions operate in a similar manner. The lower (bottom) part of the resulting vector is the result of operation on the bottom and top elements of the first (destination) operand; the higher (top) part of the resulting vector is the result of operation on the second (source) operand's bottom and top. The best way to present the principles of horizontal operations is a picture. Because in the subtraction operation the order of arguments is important, the **hsubpd** instruction is shown in figure {{ref>sse3hsubpd}}. |
| <figure sse3hsubpd> | <figure sse3hsubpd> |
| {{ :en:multiasm:cs:sse3hsubpd.png?650 |Illustration of a horizontal subtraction instruction}} | {{ :en:multiasm:cs:sse3hsubpd.png?550 |Illustration of a horizontal subtraction instruction}} |
| <caption>The illustration of a horizontal subtraction instruction</caption> | <caption>The illustration of a horizontal subtraction instruction</caption> |
| </figure> | </figure> |
| While there are more than two elements of source vectors, like in the **hsubps** instruction, it is also important to know the order of the elements in the resulting vector. Please look at the figure {{ref>sse3hsubps}}. | While there are more than two elements of source vectors, like in the **hsubps** instruction, it is also important to know the order of the elements in the resulting vector. Please look at the figure {{ref>sse3hsubps}}. |
| <figure sse3hsubps> | <figure sse3hsubps> |
| {{ :en:multiasm:cs:sse3hsubps.png?650 |Illustration of a horizontal single precision subtraction instruction}} | {{ :en:multiasm:cs:sse3hsubps.png?550 |Illustration of a horizontal single precision subtraction instruction}} |
| <caption>The illustration of a horizontal single precision subtraction instruction</caption> | <caption>The illustration of a horizontal single precision subtraction instruction</caption> |
| </figure> | </figure> |
| <table SSSE3horizontaltable> | <table SSSE3horizontaltable> |
| <caption>SSSE3 horizontal integer instructions</caption> | <caption>SSSE3 horizontal integer instructions</caption> |
| ^ Instruction ^ operation ^ data ^ | ^ Instruction ^ operation ^ data ^ |
| | **phaddd** | addition | unsigned doublewords | | | **phaddd** | addition | unsigned doublewords | |
| | **phaddw** | addition | unsigned words | | | **phaddw** | addition | unsigned words | |
| | **phaddsw** | saturated addition | signed words | | | **phaddsw** | saturated addition | signed words | |
| | **phsubd** | subtracion | unsigned doublewords | | | **phsubd** | subtraction | unsigned doublewords | |
| | **phsubw** | subtracion | unsigned words | | | **phsubw** | subtraction | unsigned words | |
| | **phsubsw** | saturated subtracion | signed words | | | **phsubsw** | saturated subtraction | signed words | |
| </table> | </table> |
| |
| The illustration is shown in figure {{ref>sse3pshufb}}. | The illustration is shown in figure {{ref>sse3pshufb}}. |
| <figure sse3pshufb> | <figure sse3pshufb> |
| {{ :en:multiasm:cs:sse3pshufb.png?650 |Illustration of a byte shuffle instruction}} | {{ :en:multiasm:cs:sse3pshufb.png?600 |Illustration of a byte shuffle instruction}} |
| <caption>The illustration of a byte shuffle instruction</caption> | <caption>The illustration of a byte shuffle instruction</caption> |
| </figure> | </figure> |
| |
| <figure sse3palignr> | <figure sse3palignr> |
| {{ :en:multiasm:cs:sse3palignr.png?650 |Illustration of an aligned byte combine instruction}} | {{ :en:multiasm:cs:sse3palignr.png?600 |Illustration of an aligned byte combine instruction}} |
| <caption>The illustration of an aligned byte combine instruction</caption> | <caption>The illustration of an aligned byte combine instruction</caption> |
| </figure> | </figure> |
| The **dpps** and **dppd** instructions calculate the dot product of four single-precision and two double-precision operands, respectively. Additionally, the arguments are controlled with the third immediate operand. The example showing the **dppd** is presented in figure {{ref>sse4dotproduct}}. | The **dpps** and **dppd** instructions calculate the dot product of four single-precision and two double-precision operands, respectively. Additionally, the arguments are controlled with the third immediate operand. The example showing the **dppd** is presented in figure {{ref>sse4dotproduct}}. |
| <figure sse4dotproduct> | <figure sse4dotproduct> |
| {{ :en:multiasm:cs:sse4dotproduct.png?650 |Illustration of a dot product calculation instruction}} | {{ :en:multiasm:cs:sse4dotproduct.png?600 |Illustration of a dot product calculation instruction}} |
| <caption>The illustration of a dot product calculation instruction</caption> | <caption>The illustration of a dot product calculation instruction</caption> |
| </figure> | </figure> |
| |
| There are also advanced shuffle, insert and extract instructions which make it possible to manipulate positions of the data of various types. A few examples will be shown in the following figures. The **insertps** inserts a scalar single-precision floating-point value with the position of the vector's element in source and destination controlled with an 8-bit immediate. The example showing the **insertps** instruction is presented in figure {{ref>sse4insertps}}. In this example, the immediate contains the bit value of 10011000b. | There are also advanced shuffle, insert and extract instructions which make it possible to manipulate positions of the data of various types. The type of the data is specified with the suffix of the mnemonic: b - bytes, w - words, d - doublewords, q - quadwords, ps - single precision and pd - double precision elements. Although these instructions behave the same for the integer and floating-point data elements, formally, those operating with integers begin with the letter "P". A few examples are shown in the following figures. |
| | |
| | The blending instructions copy elements of vectors, mixing two sources into the destination. The **blendps**, **blendpd** and **pblendw** conditionally copy elements from vector X or Y. The mask is specified as the third, immediate value. The behaviour of **blendpd** is shown in fig. {{ref>sse4blendpd}} |
| | <figure sse4blendpd> |
| | {{ :en:multiasm:cs:sse4blendpd.png?400 |Illustration of an example of packed blending instruction}} |
| | <caption>The illustration of an example of packed blending instruction</caption> |
| | </figure> |
| | |
| | The instructions **blendvps**, **blendvpd** and **pblendvb** operate in a similar way, but the condition is specified as the sign bit of the corresponding elements of the third implied argument stored in XMM0. The behaviour of **blendvpd** is shown in fig. {{ref>sse4blendvpd}} |
| | <figure sse4blendvpd> |
| | {{ :en:multiasm:cs:sse4blendvpd.png?400 |Illustration of an example of packed blending instruction}} |
| | <caption>The illustration of an example of packed blending instruction</caption> |
| | </figure> |
| | |
| | The set of extract instructions includes **pextrb**, **pextrw**, **pextrd**, **pextrq** and **extractps**. They take one element of the vector from the XMM register and store it in a CPU register or in memory. The offset of the element is specified with an immediate constant. The behaviour of **extractps** is shown in fig. {{ref>sse4extractps}} |
| | <figure sse4extractps> |
| | {{ :en:multiasm:cs:sse4extractps.png?400 |Illustration of an example of extract instruction}} |
| | <caption>The illustration of an example of extract instruction</caption> |
| | </figure> |
| | |
| | The insert instructions are **pinsrb**, **pinsrd** and **pinsrq**. They operate in an opposite way to extract instructions. They take an element from memory or a general-purpose register and insert it into the XMM register at the position specified with a constant immediate. The behaviour of **pinsrd** is shown in fig. {{ref>sse4pinsrd}} |
| | <figure sse4pinsrd> |
| | {{ :en:multiasm:cs:sse4pinsrd.png?400 |Illustration of an example of an insert instruction}} |
| | <caption>The illustration of an example of an insert instruction</caption> |
| | </figure> |
| | |
| | |
| | The **insertps** is one of the most complex. inserts a scalar single-precision floating-point value with the position of the vector's element in source and destination controlled with an 8-bit immediate. The example showing the **insertps** instruction is presented in figure {{ref>sse4insertps}}. In this example, the immediate contains the bit value of 10011000b. |
| <figure sse4insertps> | <figure sse4insertps> |
| {{ :en:multiasm:cs:sse4insertps.png?500 |Illustration of an example of an advanced shuffle instruction}} | {{ :en:multiasm:cs:sse4insertps.png?500 |Illustration of an example of an advanced shuffle instruction}} |
| <caption>The illustration of an example of an advanced shuffle instruction</caption> | <caption>The illustration of an example of an advanced shuffle instruction</caption> |
| </figure> | </figure> |
| In SSE4.2, the set of string compare instructions was added. As the XMM registers can contain sixteen bytes, it is much more efficient to implement string processing algorithms with bigger XMM registers than with registers in the main processor with the use of strong instructions. There are four string compare instructions (see table {{ref>sse4stringtable}}), but each of them can be configured to achieve different functionalities. The length of strings can be explicit or implicit. Explicit length means that the length of the first operand is specified with the RAX register, and the length of the second operand is specified with the RDX register. Implicit length means that both operands contain null-terminated strings. Instructions can produce two kinds of results. Index means that the index of the first or last result is returned. Mask means that the bit mask is returned (one bit for each two elements compared) or a mask of the size of the elements (similarly to MMX compare). | In SSE4.2, the set of string compare instructions was added. As the XMM registers can contain sixteen bytes, it is much more efficient to implement string processing algorithms with bigger XMM registers than with registers in the main processor with the use of string instructions. There are four string compare instructions (see table {{ref>sse4stringtable}}), but each of them can be configured to achieve different functionalities. The length of strings can be explicit or implicit. Explicit length means that the length of the first operand is specified with the RAX register, and the length of the second operand is specified with the RDX register. Implicit length means that both operands contain null-terminated strings. Instructions can produce two kinds of results. Index means that the index of the first or last result is returned. Mask means that the bit mask is returned (one bit for each two elements compared) or a mask of the size of the elements (similarly to MMX compare). |
| <table sse4stringtable> | <table sse4stringtable> |
| <caption>SSE4.2 string compare instructions</caption> | <caption>SSE4.2 string compare instructions</caption> |