Modern Processors: Cache, Pipeline, Superscalar, Branch Prediction, Hyperthreading

Modern processors have a highly complex design and include many units that primarily reduce software execution time.

Cache

Cache memory is a layer in the memory hierarchy that sits between main memory and processor registers. The main reason for introducing cache memory is that main memory, based on DRAM technology, is much slower than the processor, which is based on static technology. The cache exploits two software features: spatial locality and temporal locality. Spatial locality results from the fact that the processor executes code, which, in most cases, is a sequence of instructions arranged directly one after another. Temporal locality arises because programs often run in loops, repeatedly working on a single set of data over short intervals. In both cases, a larger fragment of a program or data can be loaded into the cache and operated on without accessing main memory each time. Main memory is designed to significantly speed up reading and writing data in blocks compared to accessing random addresses. These properties allow a code fragment to be read in its entirety from main memory into the cache and executed without accessing RAM for each instruction. For data, the processor performs calculations after reading a block into the cache, then stores the results in a single write sequence.

In modern processors, the cache is divided into several levels, usually three. The first-level cache (L1) is the closest to the processor, the fastest, and is usually divided into separate instruction and data caches. The second-level cache (L2) is shared, slower and usually larger than the L1 cache. The largest and the slowest is the third-level cache (L3). It is closest to the computer's main memory. An average L1 cache latency is 1-2 ns, L2 is 3-5 ns, and L3 is 10-20 ns. An example i7 processor, for each core, has 32 KB of L1 code cache, 32 KB of L1 data cache, and 256 KB of L2 cache for both code and data. The third-level cache (L3) is 8 MB and is shared by all cores.

Besides size, important cache parameters are line length and associativity. Length of the line is usually expressed in bytes. It indicates how many bytes are stored in the smallest possible data fragment. It also determines at which addresses in main memory such a data fragment starts. For example, if the cache line length is 64 bytes and memory is byte-organised, the block of memory copied to cache always starts at an address evenly divisible by 64. Associativity tells how many cache lines can be used to store the block from a specific address. If the block can go to any cache line, the cache is fully associative. If there is only one possible location, the cache is named direct-mapped. A fully associative cache is more flexible, but complex and expensive. A direct-mapped cache is simple, but it can cause data conflicts when two blocks of memory that should go to the same cache line are loaded. In real processors, the compromise solution is often implemented, which enables storing each block in 2, 4 or 8 different cache lines. It is named 2-, 4- or 8-way associative cache. This solution significantly reduces conflicts and ensures good performance at a reasonable cost.

Pipeline

As was described in the previous chapter, executing a single instruction requires many actions which must be performed by the processor. We could see that each step, or even substep, can be performed by a separate logical unit. This feature has been used by designers of modern processors to create a pipeline in which instructions are executed. A pipeline is a collection of logical units that execute many instructions simultaneously, each at a different stage of execution. If the instructions arrive in a continuous stream, the pipeline allows the program to execute faster than a processor that does not support the pipeline. Note that the pipeline does not reduce the execution time of a single instruction. It increases the instruction stream's throughput.

A simple pipeline is implemented in AVR microcontrollers. It has two stages, which means that while one instruction is executed, another one is fetched as shown in Fig 1.

Simple 2-stage pipeline in an AVR microcontroller
Figure 1: Simple 2-stage pipeline in AVR microcontroller

A mature 8086 processor executed the instruction in four steps. This allowed for the implementation of the 4-stage pipeline as shown in Fig. 2

2-stage pipeline in 8086 microprocessor
Figure 2: 4-stage pipeline in 8086 microprocessor

Modern processors implement longer pipelines. For example, the Pentium III used a 10-stage pipeline, the Pentium 4 a 20-stage pipeline, and the Pentium 4 Prescott even a 31-stage pipeline. Does the longer pipeline mean faster program execution? Everything has benefits and drawbacks. The undoubted benefit of a longer pipeline is that more instructions can be executed simultaneously, yielding higher instruction throughput. But the problem appears when branch instructions come. While a conditional jump appears in the instruction stream, the processor must choose which way the stream should follow. Should the jump be taken or not? The answer is usually based on the result of the preceding instruction and is known when the branch instruction is close to the end of the pipeline. In such a situation, in modern processors, the branch prediction unit guesses how to handle the branch. If it misses, the pipeline content is invalidated, and the pipeline starts operation from the beginning. This causes stalls in the program execution. If the pipeline is longer, the number of instructions to invalidate is bigger. In modern microarchitectures, the pipeline length ranges from 12 to 20.

Superscalar

The superscalar processor increases program execution speed by executing more than one instruction per clock cycle. It is realised by simultaneously dispatching instructions to different execution units on the processor. The superscalar processor can, but doesn't have to, implement two or more independent pipelines. Rather, decoded instructions are sent to the chosen execution unit for further processing, as shown in Fig. 3.

Superscalar architecture of pipelined processor
Figure 3: Superscalar architecture of pipelined processor

[ktokarz]i7 -> Intel i7? In the x86 family, the first processor with two execution paths was the Pentium, which had two execution units called U and V. Modern x64 processors like i7 implement six execution units. Not all execution units have the same functionality. For example, in the i7 processor, each execution unit has different capabilities, as shown in table 1.

Table 1: Execution units of i7 processor
Execution unit Functionality
0 Integer calculations, Floating point multiplication, SSE multiplication, divide
1 Integer calculations, Floating point addition, SSE addition
2 Address generation, load
3 Address generation, store
4 Data store
5 Integer calculations, Branch, SSE addition

Branch prediction

As mentioned, the pipeline can be invalidated if the conditional branch is not properly predicted. The branch prediction unit is used to guess the outcome of conditional branch instructions. It helps to reduce delays in program execution by predicting the path the program will take. Prediction is based on historical data and program execution patterns. There are many methods of predicting the branches. In general, the processor implements the buffer with the addresses of the last few branch instructions, using a history register for each branch. Based on history, the branch prediction unit can guess if the branch should be taken.

Hyperthreading

Hyper-Threading Technology is an Intel approach to simultaneous multithreading technology, which allows the operating system to execute more than one thread on a single physical core. For each physical core, the operating system defines two logical processor cores and shares the load between them when possible. Hyperthreading uses a superscalar architecture to increase the number of instructions that operate in parallel in the pipeline on separate data. With Hyper-Threading, one physical core appears to the operating system as two separate processors. The logical processors share the execution resources, including the execution engine, caches, and system bus interface. Only the elements that store the processor's architectural state are duplicated, including essential registers used for code execution.

The real path of instruction processing is much more complex. Additional techniques are implemented to improve performance, e.g., out-of-order execution and register renaming. They are performed automatically by the processor, and the assembler programmer does not influence their behaviour.
en/multiasm/cs/chapter_3_9.txt · Last modified: by pczekalski
CC Attribution-Share Alike 4.0 International
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0