Intel Pentium IV Guide Print E-mail
Written by Tuan "Solace" Nguyen
Saturday, December 09, 2000
Article Index
Intel Pentium IV Guide
What is Inside?
Advanced Dynamic Execution
Streaming SIMD
Performance
More Benchmarks
Gaming Performance
Analysis/Conclusion

Advanced Dynamic Execution

With the Pentium 4’s 20-stage hyper pipeline, processing instructions should be a breeze for the CPU right? Wrong.

Starting with the original Pentium, Intel introduced a technique called Branch Prediction. This technique used complex error correcting algorithms to minimize latency and increase efficiency. The processor would actually try to predict which path an instruction will take. What if a branch mispredicts happened? The instruction would have to be actually processed from the start of the pipeline. Usually the processor is above 93% correct and it didn’t take the Pentium long to correct itself since it only had a 5-stage pipeline.

The Pentium 4 however, has a 20-stage pipeline. If a branch mispredict occurred near the end of the pipeline, it would hit performance more significantly. It has to return all the way to the beginning and start all over. Intel realizes this and thus, implemented a system of complex caching and buffers to avoid such cases.


I always thought it went A B C D

The Advanced Dynamic Execution engine is a very deep, out-of-order speculative execution engine that keeps exec units constantly processing instructions. It does this by providing a large pool of instructions from which the execution units can choose. Some instructions can be executed before other depending on their function. Others that are dependant on the result of another instruction cannot be executed out of order. It would be like chewing on gum before you actually placed it into your mouth -- not possible. These instances where an instruction relies on an outcome of another is called a dependency. Dependencies obviously need time, and thus, slow down the pipeline flow. This is one of the more common forms of stalls in waiting for data to be loaded from memory on a cache miss. The NetBurst architecture can have up to 126 instructions in this pool to fetch from. Compare this to the P6’s much smaller pool of 42 instructions and you begin to see the great potentials of this processor.

You can see what a branch mispredict can do to a processor’s efficiency. The Pentium 4’s Advanced Dynamic Execution engine helps speed things up again when something like a mispredict occurs. But this isn’t the only weapon up Intel’s sleeves.

Rapid Execution Engine

Intel has left common grounds and created a very distinct arithmetic unit. Through a combination of architectural, physical and circuit design techniques, the Pentium 4’s ALU’s run at twice the frequency of the processor core. So our 1.5GHz Pentium 4 is executing arithmetic logics at a blazing 3GHz. This allows the ALU’s to execute certain instructions with a latency that is half the duration of the core clock and the results are higher execution throughput as well as reduced latency of processing arithmetic logic.


Double the ALU performance

With this technique, the Pentium 4 can have data waiting for it before it even requires it. This puts critical instructions an arm’s reach away and speeds up mathematical intense applications like those that deal with 3D rendering in real-time. Do I sense Quake 3 fragging anyone?

  Rapid Execution Engine (cont.)

The Rapid Execution Engine ties back to the Hyper Pipeline technology -- almost everything does. Although the long pipeline enables high processor speeds, it’s both effective and potentially ineffective at the same time. Although a branch mispredict is rare, it can happen. There are two reasons why Intel chose to double the speed of the Pentium 4’s ALU units to that of the core: performance, and mispredict recovery.


Dance baby dance!

Performance is an easy one. Twice the speed obviously means better performance. However, there is a more serious reason to “double pump” the ALUs. While the Pentium 4’s branch prediction is highly advanced and efficient, it cannot predict integer instructions very effectively. With a very long arithmetic instruction, the result can be anything, and thus becomes extremely difficult to predict. This means that when doing integer related computations, the Pentium 4 will branch mispredict more often than predict successfully -- with 20-stages of repetitive wrong predicts, things could get a little uneasy. Thankfully, we have a ALUs that run at a blistering speed that is double the core speed. Things become interesting when Intel increases the Pentium 4’s speed. Imagine the speed increases... For every MHz that the core goes up, the ALUs will double. Strategic isn’t it?

Cache Flow

Besides being a potential Intel cash cow, the Pentium invests silicon in advanced L1 caching techniques. We’ve all seen the advantages of having cache that is integrated right on the core of the processor. Now we have improvements in L1 cache as well. Having a 20-stage pipeline will definitely require cache that helps the issues associated with branch mispredicts.

Execution Trace Cache

The Execution Trace cache is a more advanced way of implementing L1 cache. The Trace Cache keeps a record of decoded x86 instructions, thus removing the latency associated with the instruction decoder from the main execution loops. Another advantage is that the Trace Cache also stores these instructions in the direct path of the program execution flow! This increases the instruction flow from the cache and makes better use of cache storage space. This way the cache no longer stores instructions that are branched over but never executed -- a waste of L1 cache allocation. This is another technique the Pentium 4 employs to recover quickly from branches that have been mispredicted.

Advanced Transfer Cache

Like the Pentium III, the Pentium 4 also comes with 256K on-die cache, but the Pentium 4’s implementation is more efficient. The Pentium 4’s L2 cache consists of a 256-bit interface that transfers data on each clock cycle. As a result, a 1.4GHz Pentium 4 can pump data rates at a blistering speed of 44.8GB/sec. The Pentium III pales in comparison with only a bandwidth rate of 16GB/sec. This also helps keep cache full so that the processor doesn’t have to look in system memory where things are much slower.


Click to Enlarge

All this cache is fine and dandy but where would Intel be without extra performance enhancing instruction sets right?