| Both sides previous revisionPrevious revision | |
| en:multiasm:paarm:chapter_5_10 [2026/06/21 21:32] – pczekalski | en:multiasm:paarm:chapter_5_10 [2026/06/21 21:34] (current) – pczekalski |
|---|
| ''<fc #800000>FADD</fc> <fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc> <fc #6495ed>@ Add 4 single-precision floats from v1 and v2, store in v0.</fc>''\\ | ''<fc #800000>FADD</fc> <fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc> <fc #6495ed>@ Add 4 single-precision floats from v1 and v2, store in v0.</fc>''\\ |
| ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>8H</fc>, <fc #008000>V1</fc>.<fc #808000>8H</fc>, <fc #008000>V2</fc>.<fc #808000>8H</fc> <fc #6495ed>@ Add 8 halfwords (16-bit integers) elementwise.</fc>''\\ | ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>8H</fc>, <fc #008000>V1</fc>.<fc #808000>8H</fc>, <fc #008000>V2</fc>.<fc #808000>8H</fc> <fc #6495ed>@ Add 8 halfwords (16-bit integers) elementwise.</fc>''\\ |
| ''<fc #800000>MUL</fc> <fc #008000>V0</fc>.<fc #808000>16B</fc>, <fc #008000>V1</fc>.<fc #808000>16B</fc>, <fc #008000>V2</fc>.<fc #808000>16B</fc> <fc #6495ed>@ Multiply 16-byte elementwise.</fc>'' | ''<fc #800000>MUL</fc> <fc #008000>V0</fc>.<fc #808000>16B</fc>, <fc #008000>V1</fc>.<fc #808000>16B</fc>, <fc #008000>V2</fc>.<fc #808000>16B</fc> <fc #6495ed>@ Multiply 16-byte elements elementwise.</fc>'' |
| |
| Elementwise operations are performed by lane, where each element occupies its own lane. Like ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>8H</fc>, <fc #008000>V1</fc>.<fc #808000>8H</fc>, <fc #008000>V2</fc>.<fc #808000>8H</fc>''. The instruction specifies that there will be eight distinct 16-bit variables. The instruction is designed to operate on the integers, and the result is computed from eight 16-bit integers. In the picture below, it is visible which element from vector ''<fc #008000>V1</fc>'' is added to vector ''<fc #008000>V2</fc>'' and how the result is obtained in vector ''<fc #008000>V0</fc>''. | Elementwise operations are performed by lane, where each element occupies its own lane. Like ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>8H</fc>, <fc #008000>V1</fc>.<fc #808000>8H</fc>, <fc #008000>V2</fc>.<fc #808000>8H</fc>''. The instruction specifies that there will be eight distinct 16-bit variables. The instruction is designed to operate on the integers, and the result is computed from eight 16-bit integers. In the picture below, it is visible which element from vector ''<fc #008000>V1</fc>'' is added to vector ''<fc #008000>V2</fc>'' and how the result is obtained in vector ''<fc #008000>V0</fc>''. |
| The instruction ''<fc #800000>LD1R</fc> {<fc #008000>V0</fc>.<fc #808000>S</fc>}, [<fc #008000>X1</fc>]'' loads one 32-bit value (integer or single-precision floating-point) in each of four vector lanes. All vector elements will contain the same value. | The instruction ''<fc #800000>LD1R</fc> {<fc #008000>V0</fc>.<fc #808000>S</fc>}, [<fc #008000>X1</fc>]'' loads one 32-bit value (integer or single-precision floating-point) in each of four vector lanes. All vector elements will contain the same value. |
| |
| The instruction ''<fc #800000>LD1</fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]'' loads eight 32-bit values into the lanes of first ''<fc #008000>V0</fc>'' and then on ''<fc #008000>V1</fc>''. All the data are loaded sequentially in the form x0, x1, x2, x3 and so on. Note that x here is meant for any chosen data type and that the data by itself are in sequential order. Similar operations are performed with ''<fc #800000>LD2</fc>'', ''<fc #800000>LD3</fc>'' and ''<fc #800000>LD4</fc>'' instructions; the data loaded into the vector register are interleaved. The ''<fc #800000>LD2</fc>'' instruction loads the structure of two elements into two vector registers. There must always be two vector registers identified. The structure in the memory may be in such order: x0, y0, x1, y1, x2, y2, … and using the LD2 instruction, the x part of the structure is loaded into the first pointed register and the y part into the second vector register. This instruction is used, for example, to process audio data, where x may be the left audio channel and y the right:\\ | The instruction ''<fc #800000>LD1</fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]'' loads eight 32-bit values into the lanes of first ''<fc #008000>V0</fc>'' and then on ''<fc #008000>V1</fc>''. All the data are loaded sequentially as x0, x1, x2, x3, and so on. Note that x here is meant for any chosen data type and that the data by itself are in sequential order. Similar operations are performed with ''<fc #800000>LD2</fc>'', ''<fc #800000>LD3</fc>'' and ''<fc #800000>LD4</fc>'' instructions; the data loaded into the vector register are interleaved. The ''<fc #800000>LD2</fc>'' instruction loads the structure of two elements into two vector registers. There must always be two vector registers identified. The structure in the memory may be in such order: x0, y0, x1, y1, x2, y2, … and using the LD2 instruction, the x part of the structure is loaded into the first pointed register and the y part into the second vector register. This instruction is used, for example, to process audio data, where x may be the left audio channel and y the right:\\ |
| ''<fc #800000>LD2 </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]'' | ''<fc #800000>LD2 </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]'' |
| |
| </figure> | </figure> |
| |
| The ''<fc #800000>LD3</fc>'' instruction takes memory data as a three-element structure, such as an image's RGB data. For this instruction, three vector registers must be identified. Assuming the data in memory: r0, g0, b0, r1, g1, b1, r2, g2, b2,…, the r (red) channel will be loaded in the first identified register, the ‘g’ in the second and ‘b’ in the third, the last identified register.\\ | The ''<fc #800000>LD3</fc>'' instruction loads memory data as a three-element structure, such as an image's RGB data. For this instruction, three vector registers must be identified. Assuming the data in memory: r0, g0, b0, r1, g1, b1, r2, g2, b2,…, the r (red) channel will be loaded in the first identified register, the ‘g’ in the second and ‘b’ in the third, the last identified register.\\ |
| ''<fc #800000>LD3 </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]'' | ''<fc #800000>LD3 </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]'' |
| |