Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:multiasm:papc:chapter_6_11 [2026/02/19 10:20] – [SSE4] ktokarzen:multiasm:papc:chapter_6_11 [2026/02/27 02:40] (current) jtokarz
Line 14: Line 14:
 The main idea of vector data processing is shown in figure {{ref>mmxprocessing}}. It shows the example of an operation performed with packed word vector data. The main idea of vector data processing is shown in figure {{ref>mmxprocessing}}. It shows the example of an operation performed with packed word vector data.
 <figure mmxprocessing> <figure mmxprocessing>
-{{ :en:multiasm:cs:mmxprocessing.png?600 |Illustration of the idea of vector data processing}}+{{ :en:multiasm:cs:mmxprocessing.png?550 |Illustration of the idea of vector data processing}}
 <caption>The idea of vector data processing</caption> <caption>The idea of vector data processing</caption>
 </figure> </figure>
Line 25: Line 25:
  
 <figure mmxpaddsw> <figure mmxpaddsw>
-{{ :en:multiasm:cs:mmxpaddsw.png?500 |Illustration of packed word addition with signed saturation}}+{{ :en:multiasm:cs:mmxpaddsw.png?450 |Illustration of packed word addition with signed saturation}}
 <caption>The illustration of packed word addition with signed saturation</caption> <caption>The illustration of packed word addition with signed saturation</caption>
 </figure> </figure>
  
 <figure mmxpaddusw> <figure mmxpaddusw>
-{{ :en:multiasm:cs:mmxpaddusw.png?500 |Illustration of packed word addition with unsigned saturation}}+{{ :en:multiasm:cs:mmxpaddusw.png?450 |Illustration of packed word addition with unsigned saturation}}
 <caption>The illustration of packed word addition with unsigned saturation</caption> <caption>The illustration of packed word addition with unsigned saturation</caption>
 </figure> </figure>
Line 57: Line 57:
  
 <figure mmxmultiplyandunpck> <figure mmxmultiplyandunpck>
-{{ :en:multiasm:cs:mmxmultiplyandunpck.png?650 |Illustration of packed word multiplication and unpacking results to doublewords}}+{{ :en:multiasm:cs:mmxmultiplyandunpck.png?550 |Illustration of packed word multiplication and unpacking results to doublewords}}
 <caption>The illustration of packed word multiplication and unpacking results to doublewords</caption> <caption>The illustration of packed word multiplication and unpacking results to doublewords</caption>
 </figure> </figure>
Line 63: Line 63:
 The code that calculates the presented multiplication can look as follows: The code that calculates the presented multiplication can look as follows:
 <code asm> <code asm>
-Numbers DW  01ACh, 2112h, 03F3h, 00A4h, +Numbers DW  01ACh, 2112h, 03F3h, 00A4h, 
-     0006h, 0137h, 0AB7h, 00D8h +            0006h, 0137h, 0AB7h, 00D8h 
-LEA     ESI, Numbers +LEA         ESI, Numbers 
-MOVQ     mm0, [ESI]         ; mm0 = 00A4 03F3 2112 01AC +MOVQ        mm0, [ESI]          ; mm0 = 00A4 03F3 2112 01AC 
-MOVQ     mm1, [ESI+8] ; mm1 = 00D8 0AB7 0137 0006 +MOVQ        mm1, [ESI+8]        ; mm1 = 00D8 0AB7 0137 0006 
-MOVQ     mm2, mm0 +MOVQ        mm2, mm0 
-PMULLW     mm0, mm1 ; mm0 = 8A60 50B5 2CDE 0A08 +PMULLW      mm0, mm1            ; mm0 = 8A60 50B5 2CDE 0A08 
-PMULHW     mm1, mm2 ; mm1 = 0000 002A 0028 0000 +PMULHW      mm1, mm2            ; mm1 = 0000 002A 0028 0000 
-MOVQ     mm2, mm0 +MOVQ        mm2, mm0 
-PUNPCKLWD   mm0, mm1 ; mm0 = 0028 2CDE 0000 0A08 +PUNPCKLWD   mm0, mm1            ; mm0 = 0028 2CDE 0000 0A08 
-PUNPCKHWD   mm2, mm1 ; mm2 = 0000 8A60 002A 50B5+PUNPCKHWD   mm2, mm1            ; mm2 = 0000 8A60 002A 50B5
 </code> </code>
  
Line 80: Line 80:
  
 <figure mmxpmaddw> <figure mmxpmaddw>
-{{ :en:multiasm:cs:mmxpmaddw.png?650 |Illustration of packed word multiplication and sum to doublewords}}+{{ :en:multiasm:cs:mmxpmaddw.png?550 |Illustration of packed word multiplication and sum to doublewords}}
 <caption>The illustration of packed word multiplication and sum to doublewords</caption> <caption>The illustration of packed word multiplication and sum to doublewords</caption>
 </figure> </figure>
Line 98: Line 98:
 An example of comparison instruction for equality of two vectors of words is shown in figure {{ref>mmxcompare}}. An example of comparison instruction for equality of two vectors of words is shown in figure {{ref>mmxcompare}}.
 <figure mmxcompare> <figure mmxcompare>
-{{ :en:multiasm:cs:mmxcompare.png?500 |Illustration of vector data comparison}}+{{ :en:multiasm:cs:mmxcompare.png?450 |Illustration of vector data comparison}}
 <caption>Vector data comparison</caption> <caption>Vector data comparison</caption>
 </figure> </figure>
Line 106: Line 106:
  
 <figure mmxpunpckhbw> <figure mmxpunpckhbw>
-{{ :en:multiasm:cs:mmxpunpkhbw.png?650 |Illustration of unpacking high-order bytes to words}}+{{ :en:multiasm:cs:mmxpunpkhbw.png?550 |Illustration of unpacking high-order bytes to words}}
 <caption>The illustration of unpacking high-order bytes to words</caption> <caption>The illustration of unpacking high-order bytes to words</caption>
 </figure> </figure>
  
 <figure mmxpunpcklbw> <figure mmxpunpcklbw>
-{{ :en:multiasm:cs:mmxpunpklbw.png?650 |Illustration of unpacking low-order bytes to words}}+{{ :en:multiasm:cs:mmxpunpklbw.png?550 |Illustration of unpacking low-order bytes to words}}
 <caption>The illustration of unpacking low-order bytes to words</caption> <caption>The illustration of unpacking low-order bytes to words</caption>
 </figure> </figure>
Line 118: Line 118:
  
 <figure mmxpack> <figure mmxpack>
-{{ :en:multiasm:cs:mmxpacksswd.png?650 |Illustration of packing doublewords to words}}+{{ :en:multiasm:cs:mmxpacksswd.png?550 |Illustration of packing doublewords to words}}
 <caption>The illustration of packing doublewords to words</caption> <caption>The illustration of packing doublewords to words</caption>
 </figure> </figure>
Line 164: Line 164:
 The idea of vector and scalar operations is shown in figure {{ref>sse1vector}} and figure {{ref>sse1scalar}}, respectively. The idea of vector and scalar operations is shown in figure {{ref>sse1vector}} and figure {{ref>sse1scalar}}, respectively.
 <figure sse1vector> <figure sse1vector>
-{{ :en:multiasm:cs:sse1vector.png?600 |Illustration of the idea of SSE vector data processing}}+{{ :en:multiasm:cs:sse1vector.png?550 |Illustration of the idea of SSE vector data processing}}
 <caption>The idea of vector data processing in SSE</caption> <caption>The idea of vector data processing in SSE</caption>
 </figure> </figure>
 <figure sse1scalar> <figure sse1scalar>
-{{ :en:multiasm:cs:sse1scalar.png?600 |Illustration of the idea of SSE scalar data processing}}+{{ :en:multiasm:cs:sse1scalar.png?550 |Illustration of the idea of SSE scalar data processing}}
 <caption>The idea of scalar data processing in SSE</caption> <caption>The idea of scalar data processing in SSE</caption>
 </figure> </figure>
Line 221: Line 221:
  
 <figure sseunpack> <figure sseunpack>
-{{ :en:multiasm:cs:sseunpack.png?650 |Illustration of SSE unpacking single-precision floating-point values}}+{{ :en:multiasm:cs:sseunpack.png?550 |Illustration of SSE unpacking single-precision floating-point values}}
 <caption>The illustration of SSE unpacking single-precision floating-point values</caption> <caption>The illustration of SSE unpacking single-precision floating-point values</caption>
 </figure> </figure>
Line 228: Line 228:
  
 <figure sseshuffle> <figure sseshuffle>
-{{ :en:multiasm:cs:sseshuffle.png?650 |Illustration of SSE shuffle single-precision floating-point values}}+{{ :en:multiasm:cs:sseshuffle.png?550 |Illustration of SSE shuffle single-precision floating-point values}}
 <caption>The illustration of SSE shuffle single-precision floating-point values</caption> <caption>The illustration of SSE shuffle single-precision floating-point values</caption>
 </figure> </figure>
Line 245: Line 245:
  
 <figure sse2conversions> <figure sse2conversions>
-{{ :en:multiasm:cs:sse2conversions.png?650 |Illustration of a variety of data type conversion instructions}}+{{ :en:multiasm:cs:sse2conversions.png?550 |Illustration of a variety of data type conversion instructions}}
 <caption>The illustration of a variety of data type conversion instructions</caption> <caption>The illustration of a variety of data type conversion instructions</caption>
 </figure> </figure>
Line 253: Line 253:
 All horizontal instructions operate in a similar manner. The lower (bottom) part of the resulting vector is the result of operation on the bottom and top elements of the first (destination) operand; the higher (top) part of the resulting vector is the result of operation on the second (source) operand's bottom and top. The best way to present the principles of horizontal operations is a picture. Because in the subtraction operation the order of arguments is important, the **hsubpd** instruction is shown in figure {{ref>sse3hsubpd}}. All horizontal instructions operate in a similar manner. The lower (bottom) part of the resulting vector is the result of operation on the bottom and top elements of the first (destination) operand; the higher (top) part of the resulting vector is the result of operation on the second (source) operand's bottom and top. The best way to present the principles of horizontal operations is a picture. Because in the subtraction operation the order of arguments is important, the **hsubpd** instruction is shown in figure {{ref>sse3hsubpd}}.
 <figure sse3hsubpd> <figure sse3hsubpd>
-{{ :en:multiasm:cs:sse3hsubpd.png?650 |Illustration of a horizontal subtraction instruction}}+{{ :en:multiasm:cs:sse3hsubpd.png?550 |Illustration of a horizontal subtraction instruction}}
 <caption>The illustration of a horizontal subtraction instruction</caption> <caption>The illustration of a horizontal subtraction instruction</caption>
 </figure> </figure>
Line 259: Line 259:
 While there are more than two elements of source vectors, like in the **hsubps** instruction, it is also important to know the order of the elements in the resulting vector. Please look at the figure {{ref>sse3hsubps}}. While there are more than two elements of source vectors, like in the **hsubps** instruction, it is also important to know the order of the elements in the resulting vector. Please look at the figure {{ref>sse3hsubps}}.
 <figure sse3hsubps> <figure sse3hsubps>
-{{ :en:multiasm:cs:sse3hsubps.png?650 |Illustration of a horizontal single precision subtraction instruction}}+{{ :en:multiasm:cs:sse3hsubps.png?550 |Illustration of a horizontal single precision subtraction instruction}}
 <caption>The illustration of a horizontal single precision subtraction instruction</caption> <caption>The illustration of a horizontal single precision subtraction instruction</caption>
 </figure> </figure>
Line 283: Line 283:
 The illustration is shown in figure {{ref>sse3pshufb}}. The illustration is shown in figure {{ref>sse3pshufb}}.
 <figure sse3pshufb> <figure sse3pshufb>
-{{ :en:multiasm:cs:sse3pshufb.png?650 |Illustration of a byte shuffle instruction}}+{{ :en:multiasm:cs:sse3pshufb.png?600 |Illustration of a byte shuffle instruction}}
 <caption>The illustration of a byte shuffle instruction</caption> <caption>The illustration of a byte shuffle instruction</caption>
 </figure> </figure>
Line 290: Line 290:
  
 <figure sse3palignr> <figure sse3palignr>
-{{ :en:multiasm:cs:sse3palignr.png?650 |Illustration of an aligned byte combine instruction}}+{{ :en:multiasm:cs:sse3palignr.png?600 |Illustration of an aligned byte combine instruction}}
 <caption>The illustration of an aligned byte combine instruction</caption> <caption>The illustration of an aligned byte combine instruction</caption>
 </figure> </figure>
Line 298: Line 298:
 The **dpps** and **dppd**  instructions calculate the dot product of four single-precision and two double-precision operands, respectively. Additionally, the arguments are controlled with the third immediate operand. The example showing the **dppd** is presented in figure {{ref>sse4dotproduct}}. The **dpps** and **dppd**  instructions calculate the dot product of four single-precision and two double-precision operands, respectively. Additionally, the arguments are controlled with the third immediate operand. The example showing the **dppd** is presented in figure {{ref>sse4dotproduct}}.
 <figure sse4dotproduct> <figure sse4dotproduct>
-{{ :en:multiasm:cs:sse4dotproduct.png?650 |Illustration of a dot product calculation instruction}}+{{ :en:multiasm:cs:sse4dotproduct.png?600 |Illustration of a dot product calculation instruction}}
 <caption>The illustration of a dot product calculation instruction</caption> <caption>The illustration of a dot product calculation instruction</caption>
 </figure> </figure>
  
-There are also advanced shuffle, insert and extract instructions which make it possible to manipulate positions of the data of various types. A few examples will be shown in the following figures. +There are also advanced shuffle, insert and extract instructions which make it possible to manipulate positions of the data of various types. The type of the data is specified with the suffix of the mnemonic: b - bytes, w - words, d - doublewords, q - quadwords, ps - single precision and pd - double precision elements. Although these instructions behave the same for the integer and floating-point data elements, formally, those operating with integers begin with the letter "P". A few examples are shown in the following figures. 
  
 The blending instructions copy elements of vectors, mixing two sources into the destination. The **blendps**, **blendpd** and **pblendw** conditionally copy elements from vector X or Y. The mask is specified as the third, immediate value. The behaviour of **blendpd** is shown in fig. {{ref>sse4blendpd}} The blending instructions copy elements of vectors, mixing two sources into the destination. The **blendps**, **blendpd** and **pblendw** conditionally copy elements from vector X or Y. The mask is specified as the third, immediate value. The behaviour of **blendpd** is shown in fig. {{ref>sse4blendpd}}
 <figure sse4blendpd> <figure sse4blendpd>
-{{ :en:multiasm:cs:sse4blendpd.png?500 |Illustration of an example of packed blending instruction}}+{{ :en:multiasm:cs:sse4blendpd.png?400 |Illustration of an example of packed blending instruction}}
 <caption>The illustration of an example of packed blending instruction</caption> <caption>The illustration of an example of packed blending instruction</caption>
 </figure> </figure>
 +
 The instructions **blendvps**, **blendvpd** and **pblendvb** operate in a similar way, but the condition is specified as the sign bit of the corresponding elements of the third implied argument stored in XMM0. The behaviour of **blendvpd** is shown in fig. {{ref>sse4blendvpd}} The instructions **blendvps**, **blendvpd** and **pblendvb** operate in a similar way, but the condition is specified as the sign bit of the corresponding elements of the third implied argument stored in XMM0. The behaviour of **blendvpd** is shown in fig. {{ref>sse4blendvpd}}
 <figure sse4blendvpd> <figure sse4blendvpd>
-{{ :en:multiasm:cs:sse4blendvpd.png?500 |Illustration of an example of packed blending instruction}}+{{ :en:multiasm:cs:sse4blendvpd.png?400 |Illustration of an example of packed blending instruction}}
 <caption>The illustration of an example of packed blending instruction</caption> <caption>The illustration of an example of packed blending instruction</caption>
 </figure> </figure>
 +
 +The set of extract instructions includes **pextrb**, **pextrw**, **pextrd**, **pextrq** and **extractps**. They take one element of the vector from the XMM register and store it in a CPU register or in memory. The offset of the element is specified with an immediate constant. The behaviour of **extractps** is shown in fig. {{ref>sse4extractps}}
 +<figure sse4extractps>
 +{{ :en:multiasm:cs:sse4extractps.png?400 |Illustration of an example of extract instruction}}
 +<caption>The illustration of an example of extract instruction</caption>
 +</figure>
 +
 +The insert instructions are **pinsrb**, **pinsrd** and **pinsrq**. They operate in an opposite way to extract instructions. They take an element from memory or a general-purpose register and insert it into the XMM register at the position specified with a constant immediate. The behaviour of **pinsrd** is shown in fig. {{ref>sse4pinsrd}}
 +<figure sse4pinsrd>
 +{{ :en:multiasm:cs:sse4pinsrd.png?400 |Illustration of an example of an insert instruction}}
 +<caption>The illustration of an example of an insert instruction</caption>
 +</figure>
 +
  
 The **insertps** is one of the most complex. inserts a scalar single-precision floating-point value with the position of the vector's element in source and destination controlled with an 8-bit immediate. The example showing the **insertps** instruction is presented in figure {{ref>sse4insertps}}. In this example, the immediate contains the bit value of 10011000b. The **insertps** is one of the most complex. inserts a scalar single-precision floating-point value with the position of the vector's element in source and destination controlled with an 8-bit immediate. The example showing the **insertps** instruction is presented in figure {{ref>sse4insertps}}. In this example, the immediate contains the bit value of 10011000b.
Line 320: Line 334:
 <caption>The illustration of an example of an advanced shuffle instruction</caption> <caption>The illustration of an example of an advanced shuffle instruction</caption>
 </figure> </figure>
-In SSE4.2, the set of string compare instructions was added. As the XMM registers can contain sixteen bytes, it is much more efficient to implement string processing algorithms with bigger XMM registers than with registers in the main processor with the use of strong instructions. There are four string compare instructions (see table {{ref>sse4stringtable}}), but each of them can be configured to achieve different functionalities. The length of strings can be explicit or implicit. Explicit length means that the length of the first operand is specified with the RAX register, and the length of the second operand is specified with the RDX register. Implicit length means that both operands contain null-terminated strings. Instructions can produce two kinds of results. Index means that the index of the first or last result is returned. Mask means that the bit mask is returned (one bit for each two elements compared) or a mask of the size of the elements (similarly to MMX compare).+In SSE4.2, the set of string compare instructions was added. As the XMM registers can contain sixteen bytes, it is much more efficient to implement string processing algorithms with bigger XMM registers than with registers in the main processor with the use of string instructions. There are four string compare instructions (see table {{ref>sse4stringtable}}), but each of them can be configured to achieve different functionalities. The length of strings can be explicit or implicit. Explicit length means that the length of the first operand is specified with the RAX register, and the length of the second operand is specified with the RDX register. Implicit length means that both operands contain null-terminated strings. Instructions can produce two kinds of results. Index means that the index of the first or last result is returned. Mask means that the bit mask is returned (one bit for each two elements compared) or a mask of the size of the elements (similarly to MMX compare).
 <table sse4stringtable> <table sse4stringtable>
 <caption>SSE4.2 string compare instructions</caption> <caption>SSE4.2 string compare instructions</caption>
en/multiasm/papc/chapter_6_11.1771489250.txt.gz · Last modified: by ktokarz
CC Attribution-Share Alike 4.0 International
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0