AMD's Hammer Architecture - Making Sense of it All
by Anand Lal Shimpi on October 23, 2001 2:57 AM EST- Posted in
- CPUs
She's got a nice pipeline
With the introduction of the Pentium 4 it became fashionable to talk about how many stages are in a processor's pipeline and the idea of a longer pipeline ended up getting a negative connotation in the eyes of many. Because the Hammer design is such a sharp contrast with Intel's NetBurst architecture that powers the Pentium 4 we thought it best to start out with where the two paths begin to separate the most: with their pipelines.
AMD
Integer Pipeline Comparison
|
||
Clock
Cycle
|
K7
Architecture
|
Hammer
Architecture
|
1 |
Fetch
|
Fetch
1
|
2 |
Scan
|
Fetch
2
|
3 |
Align
1
|
Pick
|
4 |
Align
2
|
Decode
1
|
5 |
EDEC
|
Decode
2
|
6 |
IDEQ/Rename
|
Pack
|
7 |
Schedule
|
Pack/Decode
|
8 |
AGU/ALU
|
Dispatch
|
9 |
L1
Address Generation
|
Schedule
|
10 |
Data
Cache
|
AGU/ALU
|
11 |
Data
Cache 1
|
|
12 |
Data
Cache 2
|
What you're looking at in the above table is the basic integer pipeline for K7 based processors and upcoming Hammer based processors. The path a floating point instruction would take would be longer but for our purposes these integer pipelines will work just fine.
The first thing you'll notice is that the Hammer has 2 extra stages in its pipeline compared to the K7; this 20% increase is clearly to provide for higher clock speeds and to prove that we'll have to look at the nature of the stages themselves.
At the start of this overview we talked about how today's high performance x86 processors obtained their standings by working around the limitations of the x86 ISA. One of the most common practices employed today is to take conventional x86 instructions and decode them into smaller operations that can be executed much quicker. As we found out in our investigation of the Pentium 4, these decode stages are actually very complex and are very influential to the end performance of a processor.
Those that are intimately familiar with the K7 architecture know that some of its decoding stages can vary depending on the type of instruction being decoded, but to keep things as simple as possible we have purposefully omitted those alternate stages; they don't change the analysis any.
The first time the two architectures diverge is in the 2nd stage where the Hammer has a second Fetch stage. This second fetch stage can be considered to be a transit stage; its purpose is to move the instruction that is to be executed from the instruction cache to the decoders. Intel has stages similar to this fetch stage in NetBurst that allow for data to be moved across the chip; the sole purpose of these types of stages is to increase clock speed.
The pick stage readies the instruction(s) for the first decode stages; it is much like the align stage from the K7 in that it tries to send as many independent instructions to the execution units as possible. The Decode 1 & 2 stages don't actually decode the instructions into the smaller operations (AMD calls them Macro-Ops); instead, these two stages are used to gather information about the instructions but not yet decode them. This is much like the Early Decode (EDEC) stage on the K7 where the correct instruction path is chosen (direct or vector) before the actual decode. The only difference here is that this early decode phase takes two cycles on the Hammer which again, allows the CPU to ramp in clock speed.
The pack stage then takes the information from the previous two decode stages and readies the instruction to actually be decoded into macro-ops. Then the macro-ops are dispatched and scheduled before they enter the execution units, and finally it's off to the L1 cache which holds true for both architectures.
As you can see there is only a difference of two stages in the basic integer pipeline of the Hammer and the K7. The benefit of this is that the IPC of the Hammer is not tremendously reduced by the lengthening of the pipeline; remember that the Pentium 4's pipeline was increased by 100% over the P6 pipeline whereas here we're only talking about a 20% increase. At the same time it also lends itself to the point that going forward, AMD probably won't be able to come close to the clock speeds that NetBurst will allow in the next few years.
1 Comments
View All Comments
chowmanga - Tuesday, February 2, 2010 - link
Anand, the link on page 2 leading to the discussion on the 64bit extension of the x86 is broken. Is there any way to read it?