Considering the comparison point is Cortex-A75, this still puts top-end RV apps cores a long way behind top-end ARM licensables. But they certainly seem to be improving.
Also, Doc Ian, it looks like in the last paragraph you refer to Horse Creek as Horse Ridge, which is a very different product (quantum control circuit.) I assume this is just a typo, since it's sorta baffling otherwise.
Although the RV "performance core" appears to be much closer to ARM's LITTLE cores in size/area/transistor count. And, unlike them, it's OOO, something ARM has been resistant to implementing even in their newest designs.
“The new SiFive Performance P550 core at the heart of Horse Creek is SiFive’s highest performance processor to date, with the company quoting a SPEC2006int of 8.65 per GHz.” How does this compare against other common CPU designs? I think it would be useful to have that info in the article.
Sifive is saying the comparison point is the Cortex-A75. That makes it fairly fast by the standard of apps processors as a whole, but far behind current ARM licensable IP or mainstream cores from Intel, AMD, or Apple.
P550 is probably peak score/GHz, whereas all of these points are at the most extreme/least efficient point for the high-end products. The overall performance also matters when you factor in frequency, as cores won't scale linearly with core frequency as other parts of the SoC will be fixed.
I wonder if cloud vendors would be interested in a giant package stuffed full of these cores? Small, isolated VMs don't need tons of core-cluster to core-cluster communication.
I understand this design probably can't do that... and if there was interest, I suppose Altera or someone would've come out with a design based on smaller ARM cores.
Thats's pretty much the exact opposite of what I was thinking though.
I would propose stuffing as many 4-core dies one can cram into a giant package, and have them all hang off a EPYC style IO die built on a cheaper (22nm?) process.
Inter-cluster speeds would be anemic, but that doesn't really matter for a bunch of VMs partitioned to use 1 or 2 tiles each. And one wouldn't have to target some exotic VLIW ISA.
that sounds insanely inefficient. epyc is already memory/bandwidth starved, adding in a bunch of small clusters of cores running off that type of I/O is just going to cripple performance any time there are multiple users. may as well just buy epyc as its already over the limit of what that type of I/O system can handle without moving to more expensive fabric.
Intel was the first one who tried to kill x86 with IA-64 architecture back in 2001. And Intel is currently producing 10nm ARM SOC for their FPGA products (since 2019).
x86 is still going strong in Desktop/Laptop space. Soon it will also be the fastest (exascale) supercomputer with Aurora (1 exaflop, Intel), Frontier (1.5 exaflop, AMD) and El Capitan(2 exaflop, AMD). The arm based Japanese supercomputer is at 0.442 exaflop.
I'm not sure that makes any sense. Rosetta 2 is a translation from x86->ARM
So what you're proposing is an x86->RiscV translator?
That aside, why the focus on XOrg? XWindows isn't exactly the biggest player in Linux any more; Android dwarfs all the other Linux installs, and in that space you want ARM->RiscV!
Enterprise is Intel's bread and butter. IF they really are going down the RISC-V route in a serious way - and I think they should - then they're going to want to keep all their current customers. The reality is that users like the OSs and apps they know, and are slow to adapt... X86 -> RISC-V ensures that the apps people want are immediately available to them. ARM -> RISC-V doesn't.
Of course, as with M1, native apps would run much quicker, and I'm sure that FOSS devs are keen to ensure that their software is running well on RISC-V.
I'm thinking edge, embedded and government plays here. The EU (eventually) and India comes to mind. Additionally, a new generation of students in the US and maybe in other countries have been gaining their basics in the RISC-V world so there is the talent recruiting angle as well.
yet another load-store cpu (with countless layers of buffers/caches) and dram speed NVME just in sight. this makes sense? load-store made sense only when the speed differential was glacial. those days are numbered. yes, Intel didn't back up Optane with a processor that could really leverage App Direct mode. not to mention OS transaction support.
never said it would be simple, but then it took quite a while for load-store to gain hegemony.
Yeah, FunBunny2's post is a lot to pick apart. I think most of us decided not to bother.
As far as I can see, the only real alternative to load/store is computational memory, which is finally gaining some traction for AI. One approach is characterized by Samsung's compute-embedded HBM2, merging computation into the memory. The other would be how lots of AI accelerators have tons of on-die SRAM, which moves the data closer to the compute elements.
So looking at what SiFive has actually announced, I don't think the P550 has vector support like the article says - though the lower-end P270 does (they've launched a couple of smaller vector-capable cores, including the X280, but still nothing at the high end.) Sifive explicitly states the P270 is a vector core everywhere they can, but doesn't say the same about P550 anywhere that I can find.
Yes I think you are right. That means it might be another year (late 2023, or 2024) before there will be an OoO RISC-V core with a vector extension... This also makes the area and performance claims even more dubious. Recent improvements in GCC mean the performance benefit of a vector unit is much higher than ever before.
I think the mention of vector support was a mistake. But why would that make you doubt the area and performance numbers from SiFive's official material?
The area comparisons were against the A75, which doesn't have SVE. Just two 64-bit NEON/FP pipes I believe. The P550 also has 64-bit FP support, and according to older Linley material on U84 it also had two 64-bit FP pipes (just no SIMD). Granted the A75 supports a larger variety of instructions, but the area comparison doesn't seem that unreasonable to me.
SIMD units add significant area so comparing cores with/without isn't a fair comparison. SIMD units also improve performance in many popular benchmarks. Geekbench results for recently released HiFive Unmatched (which uses U74) are over 7 times slower than Raspberry Pi 4 (Cortex-A72).
The U84 is claimed to be 3 times faster than U74, so do the math. Hence the claims of P550 (an improved U84 core) beating Cortex-A75 look extremely dubious.
If things are designed correctly, a 64-bit SIMD floating point pipeline won't have a ton of overhead compared to just a floating point pipeline which supports double precision. The bulk of the arithmetic hardware can be shared between different precisions. But I agree that adding support for SIMD integer or the other instructions which A75 supports would increase the area for U84.
Another thing is that "3 times faster" cannot be applied universally to all workloads. For a high-ILP workload that always hits out of the caches, U84 may only be something like 50% faster (3 integer pipes vs. 2 for example) or less. But for something which incurs DCache misses, the U84 can run laps around the U74.
Anyway, I'm looking forward to getting real silicon measurements from the P550 eventually.
Me too, but it looks like it's going to a very long wait... There are no measurements for U84 yet almost 2 years after being announced. There is only a recent Geekbench result for U74 at 1.0GHz: https://browser.geekbench.com/v5/cpu/8493132
The fun thing is SiFive likes to call U74 a Cortex-A55 equivalent when in reality it gets about half the performance of Cortex-A53 (eg. Pi 3 B at 1.2GHz: https://browser.geekbench.com/v5/cpu/8201396 )
Agreed. Neon adds hundreds of instructions, so it's big increase in decode, double the FP register file, lots of new ALUs as you can't do integer arithmetic on an FP pipe etc. And RISC-V does add something similar to SVE which is even more complex with gather/scatter, predicates etc. Lots more complexity, lots more instructions = lots more transistors.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
35 Comments
Back to Article
SarahKerrigan - Tuesday, June 22, 2021 - link
Considering the comparison point is Cortex-A75, this still puts top-end RV apps cores a long way behind top-end ARM licensables. But they certainly seem to be improving.Also, Doc Ian, it looks like in the last paragraph you refer to Horse Creek as Horse Ridge, which is a very different product (quantum control circuit.) I assume this is just a typo, since it's sorta baffling otherwise.
Ian Cutress - Tuesday, June 22, 2021 - link
I've written Horse Ridge so often, it's in my brain. Fixed that up.FreckledTrout - Tuesday, June 22, 2021 - link
Schrödinger's Horse :)eastcoast_pete - Wednesday, June 23, 2021 - link
Doesn't help that some will now say that Intel's future is indeed "up the creek (s)".eastcoast_pete - Wednesday, June 23, 2021 - link
Although the RV "performance core" appears to be much closer to ARM's LITTLE cores in size/area/transistor count. And, unlike them, it's OOO, something ARM has been resistant to implementing even in their newest designs.Small Bison - Tuesday, June 22, 2021 - link
“The new SiFive Performance P550 core at the heart of Horse Creek is SiFive’s highest performance processor to date, with the company quoting a SPEC2006int of 8.65 per GHz.”How does this compare against other common CPU designs? I think it would be useful to have that info in the article.
SarahKerrigan - Tuesday, June 22, 2021 - link
Sifive is saying the comparison point is the Cortex-A75. That makes it fairly fast by the standard of apps processors as a whole, but far behind current ARM licensable IP or mainstream cores from Intel, AMD, or Apple.brucehoult - Tuesday, June 22, 2021 - link
Four years and one month behind A75 announcement, to be precise.The U84 was announced 4.75 years after the A72. The U74 was announced 6 years after the A53.
So far, each generation of SiFive cores is about one less year behind ARM than the previous one.
Remembering that SiFive was founded after the A72 was announced.
Ian Cutress - Tuesday, June 22, 2021 - link
We've been publishing graphs with this data for a while.https://www.anandtech.com/show/16226/apple-silicon...
https://www.anandtech.com/show/16214/amd-zen-3-ryz...
A13 scores 19.85/GHz
A14 scores 21.1/GHz
R9 5950X scores 13.57/GHz
i9-10900K scores 11.08/GHz
P550 is probably peak score/GHz, whereas all of these points are at the most extreme/least efficient point for the high-end products. The overall performance also matters when you factor in frequency, as cores won't scale linearly with core frequency as other parts of the SoC will be fixed.
brucethemoose - Tuesday, June 22, 2021 - link
I wonder if cloud vendors would be interested in a giant package stuffed full of these cores? Small, isolated VMs don't need tons of core-cluster to core-cluster communication.I understand this design probably can't do that... and if there was interest, I suppose Altera or someone would've come out with a design based on smaller ARM cores.
SarahKerrigan - Tuesday, June 22, 2021 - link
Glad to see someone appreciating TileGX/TileMX.brucethemoose - Tuesday, June 22, 2021 - link
Thats's pretty much the exact opposite of what I was thinking though.I would propose stuffing as many 4-core dies one can cram into a giant package, and have them all hang off a EPYC style IO die built on a cheaper (22nm?) process.
Inter-cluster speeds would be anemic, but that doesn't really matter for a bunch of VMs partitioned to use 1 or 2 tiles each. And one wouldn't have to target some exotic VLIW ISA.
whatthe123 - Tuesday, June 22, 2021 - link
that sounds insanely inefficient. epyc is already memory/bandwidth starved, adding in a bunch of small clusters of cores running off that type of I/O is just going to cripple performance any time there are multiple users. may as well just buy epyc as its already over the limit of what that type of I/O system can handle without moving to more expensive fabric.MikeMurphy - Tuesday, June 22, 2021 - link
Intel sold XScale in 2006 to promote x86 reliance. As it turns out the world is no longer reliance on x86 and Intel again finds itself behind.MiauwMing - Wednesday, July 7, 2021 - link
Intel was the first one who tried to kill x86 with IA-64 architecture back in 2001. And Intel is currently producing 10nm ARM SOC for their FPGA products (since 2019).x86 is still going strong in Desktop/Laptop space. Soon it will also be the fastest (exascale) supercomputer with Aurora (1 exaflop, Intel), Frontier (1.5 exaflop, AMD) and El Capitan(2 exaflop, AMD). The arm based Japanese supercomputer is at 0.442 exaflop.
3arn0wl - Tuesday, June 22, 2021 - link
Intel might do well to spend some time developing a RISC-V translation layer for X Org apps, as Apple have done with Rosetta 2 for their M1 silicon.michael2k - Tuesday, June 22, 2021 - link
I'm not sure that makes any sense. Rosetta 2 is a translation from x86->ARMSo what you're proposing is an x86->RiscV translator?
That aside, why the focus on XOrg? XWindows isn't exactly the biggest player in Linux any more; Android dwarfs all the other Linux installs, and in that space you want ARM->RiscV!
3arn0wl - Tuesday, June 22, 2021 - link
Enterprise is Intel's bread and butter. IF they really are going down the RISC-V route in a serious way - and I think they should - then they're going to want to keep all their current customers. The reality is that users like the OSs and apps they know, and are slow to adapt... X86 -> RISC-V ensures that the apps people want are immediately available to them. ARM -> RISC-V doesn't.Of course, as with M1, native apps would run much quicker, and I'm sure that FOSS devs are keen to ensure that their software is running well on RISC-V.
mode_13h - Wednesday, June 23, 2021 - link
"X Org" must've been a case of getting the mental wires crossed? It only makes sense if 3arn0wl meant "x86".Wereweeb - Tuesday, June 22, 2021 - link
Welcome to the 21th century, Intel. Now, hands off SiFive please.lmcd - Tuesday, June 22, 2021 - link
Delusional to claim that the volume leader in multiple essential categories isn't "in this century."Spunjji - Wednesday, June 23, 2021 - link
Their statement is daft, but it's equally daft to cite sales volume when the claim was about technology levels.TeXWiller - Wednesday, June 23, 2021 - link
I'm thinking edge, embedded and government plays here. The EU (eventually) and India comes to mind. Additionally, a new generation of students in the US and maybe in other countries have been gaining their basics in the RISC-V world so there is the talent recruiting angle as well.FunBunny2 - Wednesday, June 23, 2021 - link
yet another load-store cpu (with countless layers of buffers/caches) and dram speed NVME just in sight. this makes sense? load-store made sense only when the speed differential was glacial. those days are numbered. yes, Intel didn't back up Optane with a processor that could really leverage App Direct mode. not to mention OS transaction support.never said it would be simple, but then it took quite a while for load-store to gain hegemony.
MetalPenguin - Thursday, June 24, 2021 - link
Even DRAM speed is still 2 orders of magnitude slower than L1 caches in modern CPUs. They are still going strong for a reason.mode_13h - Friday, June 25, 2021 - link
Yeah, FunBunny2's post is a lot to pick apart. I think most of us decided not to bother.As far as I can see, the only real alternative to load/store is computational memory, which is finally gaining some traction for AI. One approach is characterized by Samsung's compute-embedded HBM2, merging computation into the memory. The other would be how lots of AI accelerators have tons of on-die SRAM, which moves the data closer to the compute elements.
SarahKerrigan - Saturday, June 26, 2021 - link
So looking at what SiFive has actually announced, I don't think the P550 has vector support like the article says - though the lower-end P270 does (they've launched a couple of smaller vector-capable cores, including the X280, but still nothing at the high end.) Sifive explicitly states the P270 is a vector core everywhere they can, but doesn't say the same about P550 anywhere that I can find.Wilco1 - Saturday, June 26, 2021 - link
Yes I think you are right. That means it might be another year (late 2023, or 2024) before there will be an OoO RISC-V core with a vector extension... This also makes the area and performance claims even more dubious. Recent improvements in GCC mean the performance benefit of a vector unit is much higher than ever before.MetalPenguin - Sunday, June 27, 2021 - link
I think the mention of vector support was a mistake. But why would that make you doubt the area and performance numbers from SiFive's official material?The area comparisons were against the A75, which doesn't have SVE. Just two 64-bit NEON/FP pipes I believe. The P550 also has 64-bit FP support, and according to older Linley material on U84 it also had two 64-bit FP pipes (just no SIMD). Granted the A75 supports a larger variety of instructions, but the area comparison doesn't seem that unreasonable to me.
Wilco1 - Sunday, June 27, 2021 - link
SIMD units add significant area so comparing cores with/without isn't a fair comparison. SIMD units also improve performance in many popular benchmarks. Geekbench results for recently released HiFive Unmatched (which uses U74) are over 7 times slower than Raspberry Pi 4 (Cortex-A72).The U84 is claimed to be 3 times faster than U74, so do the math. Hence the claims of P550 (an improved U84 core) beating Cortex-A75 look extremely dubious.
MetalPenguin - Monday, June 28, 2021 - link
If things are designed correctly, a 64-bit SIMD floating point pipeline won't have a ton of overhead compared to just a floating point pipeline which supports double precision. The bulk of the arithmetic hardware can be shared between different precisions. But I agree that adding support for SIMD integer or the other instructions which A75 supports would increase the area for U84.Another thing is that "3 times faster" cannot be applied universally to all workloads. For a high-ILP workload that always hits out of the caches, U84 may only be something like 50% faster (3 integer pipes vs. 2 for example) or less. But for something which incurs DCache misses, the U84 can run laps around the U74.
Anyway, I'm looking forward to getting real silicon measurements from the P550 eventually.
Wilco1 - Tuesday, June 29, 2021 - link
Me too, but it looks like it's going to a very long wait... There are no measurements for U84 yet almost 2 years after being announced. There is only a recent Geekbench result for U74 at 1.0GHz: https://browser.geekbench.com/v5/cpu/8493132The fun thing is SiFive likes to call U74 a Cortex-A55 equivalent when in reality it gets about half the performance of Cortex-A53 (eg. Pi 3 B at 1.2GHz: https://browser.geekbench.com/v5/cpu/8201396 )
mode_13h - Monday, June 28, 2021 - link
> Just two 64-bit NEON/FP pipes I believe. The P550 also has 64-bit FP supportThere's a big difference between a 64-bit FP pipe and a 64-bit SIMD pipe, once you go to smaller data types, such as those used in deep learning.
Wilco1 - Tuesday, June 29, 2021 - link
Agreed. Neon adds hundreds of instructions, so it's big increase in decode, double the FP register file, lots of new ALUs as you can't do integer arithmetic on an FP pipe etc. And RISC-V does add something similar to SVE which is even more complex with gather/scatter, predicates etc. Lots more complexity, lots more instructions = lots more transistors.mode_13h - Wednesday, June 30, 2021 - link
> Lots more complexity, lots more instructions = lots more transistors.Hey, I never said it was free!
: )
I was just reacting to the way MetalPenguin seemed to be almost equating a 64-bit NEON pipe almost to a 64-bit FP pipe. BIG difference!