| Both sides previous revisionPrevious revisionNext revision | Previous revision |
| en:multiasm:paarm:chapter_5_10 [2026/02/27 10:55] – [NEON (SIMD – Single Instruction Multiple Data)] jtokarz | en:multiasm:paarm:chapter_5_10 [2026/02/27 16:48] (current) – [NEON (SIMD – Single Instruction Multiple Data)] jtokarz |
|---|
| * <fc #008000>V31</fc>.<fc #808000>16B</fc> can hold 16 elements: sixteen 8-bit integers, where the letter b indicates byte | * <fc #008000>V31</fc>.<fc #808000>16B</fc> can hold 16 elements: sixteen 8-bit integers, where the letter b indicates byte |
| |
| {{ :en:multiasm:paarm:2025-12-04_17_56_45-.png?600 |}} | <figure v0reg> |
| | {{ :en:multiasm:paarm:2025-12-04_17_56_45-.png?600 |V0 Register}} |
| | <caption>V0 Register</caption> |
| | </figure> |
| |
| The NEON allows the CPU to perform operations on multiple data items with a single instruction. Standard scalar instructions can handle only one scalar value in one register, but NEON registers are specially designed so that they can hold multiple values. And this is how Single Instruction, Multiple Data (SIMD) operations work: one instruction operating on several elements in parallel. This technique is widely used for image, signal, and multimedia processing, as well as for accelerating math-heavy tasks like machine learning, encryption or other Artificial Intelligence AI tasks. The processor used in the Raspberry Pi 5 supports the NEON and floating-point (FP) instruction sets. | The NEON allows the CPU to perform operations on multiple data items with a single instruction. Standard scalar instructions can handle only one scalar value in one register, but NEON registers are specially designed so that they can hold multiple values. And this is how Single Instruction, Multiple Data (SIMD) operations work: one instruction operating on several elements in parallel. This technique is widely used for image, signal, and multimedia processing, as well as for accelerating math-heavy tasks like machine learning, encryption or other Artificial Intelligence AI tasks. The processor used in the Raspberry Pi 5 supports the NEON and floating-point (FP) instruction sets. |
| |
| Elementwise operations are performed by lane, where each element occupies its own lane. Like ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>8H</fc>, <fc #008000>V1</fc>.<fc #808000>8H</fc>, <fc #008000>V2</fc>.<fc #808000>8H</fc>''. The instruction specifies that there will be eight distinct 16-bit variables. The instruction is designed to operate on the integers, and the result is computed from eight 16-bit integers. In the picture below, it is visible which element from vector ''<fc #008000>V1</fc>'' is added to vector ''<fc #008000>V2</fc>'' and how the result is obtained in vector ''<fc #008000>V0</fc>''. | Elementwise operations are performed by lane, where each element occupies its own lane. Like ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>8H</fc>, <fc #008000>V1</fc>.<fc #808000>8H</fc>, <fc #008000>V2</fc>.<fc #808000>8H</fc>''. The instruction specifies that there will be eight distinct 16-bit variables. The instruction is designed to operate on the integers, and the result is computed from eight 16-bit integers. In the picture below, it is visible which element from vector ''<fc #008000>V1</fc>'' is added to vector ''<fc #008000>V2</fc>'' and how the result is obtained in vector ''<fc #008000>V0</fc>''. |
| in the image is ilustrated instruction ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>8H</fc>, <fc #008000>V1</fc>.<fc #808000>8H</fc>, <fc #008000>V2</fc>.<fc #808000>8H</fc>'' operation. | The image illustrates the following instruction: ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>8H</fc>, <fc #008000>V1</fc>.<fc #808000>8H</fc>, <fc #008000>V2</fc>.<fc #808000>8H</fc>''. |
| |
| {{ :en:multiasm:paarm:2025-12-04_18_45_46-pictures_for_the_book.pptx_-_powerpoint.jpg?600 |}} | <figure add8h> |
| | {{ :en:multiasm:paarm:2025-12-04_18_45_46-pictures_for_the_book.pptx_-_powerpoint.jpg?600 |8-lane halfword ADD operation}} |
| | <caption>8-lane halfword ADD operation</caption> |
| | </figure> |
| |
| In the image, it is also visible how the elements are added together in separate lanes (eight lanes). The lanes used for operations depend on the vector register elements used in the operation. Another example with only four lanes involved would be from this instruction: ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>'' | |
| |
| {{ :en:multiasm:paarm:neon_simd2.jpg?600 |}} | In the image, it is also visible how the elements are added together in separate lanes (eight lanes). The lanes used for operations depend on the vector register elements used in the operation. An example with only four lanes involved would be from this instruction: ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>'' |
| | |
| | <figure add4s> |
| | {{ :en:multiasm:paarm:neon_simd2.jpg?600 |4-lane single-precision float ADD operation}} |
| | <caption>4-lane single-precision float ADD operation</caption> |
| | </figure> |
| |
| The NEON can also perform operations by narrowing the destination register.\\ | The NEON can also perform operations by narrowing the destination register.\\ |
| ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>4H</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>'' | ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>4H</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>'' |
| |
| {{ :en:multiasm:paarm:neon_simd_stoh.jpg?600 |}} | <figure narrowed> |
| | {{ :en:multiasm:paarm:neon_simd_stoh.jpg?600 |ADD Operation with Narrowed Destination Register}} |
| | <caption>ADD Operation with Narrowed Destination Register</caption> |
| | </figure> |
| |
| And another example with widening the destination register:\\ | And another example with widening the destination register:\\ |
| ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4H</fc>, <fc #008000>V2</fc>.<fc #808000>4H</fc>'' | ''<fc #800000>ADD</fc> <fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4H</fc>, <fc #008000>V2</fc>.<fc #808000>4H</fc>'' |
| {{ :en:multiasm:paarm:neon_simd_htos.jpg?600 |}} | <figure widened> |
| | {{ :en:multiasm:paarm:neon_simd_htos.jpg?600 |ADD Operation with Widened Destination Register}} |
| | <caption>ADD Operation with Widened Destination Register</caption> |
| | </figure> |
| |
| The instruction specifies that there will be eight distinct 16-bit variables. The ''<fc #800000>ADD</fc>'' instruction is designed to operate on integers, and the result is computed from eight 16-bit integers, four 32-bit integers and so on. SIMD instructions also allow taking only one element from a vector and using it as a scalar value. As a result, the vector, for example, is multiplied by a scalar value\\ | The instruction specifies that there will be eight distinct 16-bit variables. The ''<fc #800000>ADD</fc>'' instruction is designed to operate on integers, and the result is computed from eight 16-bit integers, four 32-bit integers and so on. SIMD instructions also allow taking only one element from a vector and using it as a scalar value. As a result, the vector, for example, is multiplied by a scalar value\\ |
| ''<fc #800000>FMLA</fc> <fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>[<fc #ffa500>1</fc>] <fc #6495ed>@ v0 += v1 * (scalar from v2 element nr1)</fc>'' | ''<fc #800000>FMLA</fc> <fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>[<fc #ffa500>1</fc>] <fc #6495ed>@ v0 += v1 * (scalar from v2 element nr1)</fc>'' |
| |
| {{ :en:multiasm:paarm:neon_fmla.jpg?600 |}} | <figure scalar> |
| | {{ :en:multiasm:paarm:neon_fmla.jpg?600 |Scalar Multiplication}} |
| | <caption>Scalar Multiplication</caption> |
| | </figure> |
| |
| |
| ''<fc #800000>LD2 </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]'' | ''<fc #800000>LD2 </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]'' |
| |
| {{ :en:multiasm:paarm:neon_ld2.jpg?600 |}} | <figure v2load> |
| | {{ :en:multiasm:paarm:neon_ld2.jpg?600 |2D Vector representation in memory and after loading to the V registers}} |
| | <caption>2D Vector representation in memory and after loading to the V registers</caption> |
| | </figure> |
| |
| The ''<fc #800000>LD3</fc>'' instruction takes memory data as a three-element structure, such as an image's RGB data. For this instruction, three vector registers must be identified. Assuming the data in memory: r0, g0, b0, r1, g1, b1, r2, g2, b2,…, the r (red) channel will be loaded in the first identified register, the ‘g’ in the second and ‘b’ in the third, the last identified register.\\ | The ''<fc #800000>LD3</fc>'' instruction takes memory data as a three-element structure, such as an image's RGB data. For this instruction, three vector registers must be identified. Assuming the data in memory: r0, g0, b0, r1, g1, b1, r2, g2, b2,…, the r (red) channel will be loaded in the first identified register, the ‘g’ in the second and ‘b’ in the third, the last identified register.\\ |
| ''<fc #800000>LD3 </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]'' | ''<fc #800000>LD3 </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]'' |
| |
| {{ :en:multiasm:paarm:neon_ld3.jpg?600 |}} | <figure v3load> |
| | {{ :en:multiasm:paarm:neon_ld3.jpg?600 |3D Vector representation in memory and after loading to the V registers}} |
| | <caption>3D Vector representation in memory and after loading to the V registers</caption> |
| | </figure> |
| |
| |
| ''<fc #800000>LD4 </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>, <fc #008000>V3</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]'' | ''<fc #800000>LD4 </fc> {<fc #008000>V0</fc>.<fc #808000>4S</fc>, <fc #008000>V1</fc>.<fc #808000>4S</fc>, <fc #008000>V2</fc>.<fc #808000>4S</fc>, <fc #008000>V3</fc>.<fc #808000>4S</fc>}, [<fc #008000>X1</fc>]'' |
| |
| {{ :en:multiasm:paarm:neon_ld4.jpg?600 |}} | <figure v4load> |
| | {{ :en:multiasm:paarm:neon_ld4.jpg?600 |4D Vector representation in memory and after loading to the V registers}} |
| | <caption>4D Vector representation in memory and after loading to the V registers</caption> |
| | </figure> |
| |
| The ''<fc #800000>LD4R</fc>'' instruction will take four values from memory and fill the first vector with the first value, the second with the second, and so on. Such instructions can be used to perform arithmetical operations on the same value with multiple different scalar values. Similar operations are performed with ''<fc #800000>LD3R</fc>'' and ''<fc #800000>LD2R</fc>'' instructions. Note that the ''<fc #800000>//LDn//</fc>'', the ‘n’ identifies the number of vector registers loaded with the structure elements. Remember that the structure elements are the same as the number of vector registers used by an instruction. | The ''<fc #800000>LD4R</fc>'' instruction will take four values from memory and fill the first vector with the first value, the second with the second, and so on. Such instructions can be used to perform arithmetical operations on the same value with multiple different scalar values. Similar operations are performed with ''<fc #800000>LD3R</fc>'' and ''<fc #800000>LD2R</fc>'' instructions. Note that the ''<fc #800000>//LDn//</fc>'', the ‘n’ identifies the number of vector registers loaded with the structure elements. Remember that the structure elements are the same as the number of vector registers used by an instruction. |