Having worked with Knights Landing, I can say that it is a weird "Xeon". When its fast, its very fast. But in general, it's an unforgiving CPU due to the shallow OOO pipeline, lack of shared LLC, and an instruction decoder that struggles to keep up.
Also, the high-bandwidth MCDRAM and 4-way SMT can be unforgiving too. The former because it has worse latency than normal memory, and the latter because the pipeline frontend is statically partitioned.
I'm actually going to agree. @Elstart is missing the point - SMT and low clockspeeds make the *apparent* latency of MCDRAM much lower.
I don't know if it's true that the absolute latency of MCDRAM is higher, but one benefit of SMT is to hide memory latency. If your workload has enough concurrency, then it works just fine.
I'm talking about programming in practice, not theory. MCDRAM latency on Knights Landing in practice is about 150 ns, whereas normal DRAM is about 128 ns. And yes, in theory, a great SMT implementation makes it easy to hide latency, but the SMT implementation on Knights Landing is quite basic. For example, if you average 3 busy threads on each CPU over the course of your app, then the statically partitioned frontend means that you get 3/4 of the available (and already feeble) decoder bandwidth.
This all being said, I think Xeon Phi is a great CPU. It just isn't a forgiving like a normal Xeon. That is my only point here.
The issue might be your perspective. You're saying these cores aren't as well-optimized as a normal server core, but that's actually by design. The philosophy behind this chip is similar to that of GPUs - make the cores really simple and energy-efficient, then pack a ton of them on a single die. Sure, you could improve pipeline utilization, but would it yield a net improvement in power-efficiency? How many fewer cores would fit on a die? Would you then have to boost the memory clock to keep all the cores fed? You're really not looking at this as a tradeoff.
So, while I'm not very familiar with KNM's SMT implementation, it doesn't surprise me that it stalls in places their bigger cores wouldn't. The chip *does* have a small turbo boost to partially-compensate for cases where utilization is low, but I think the key point is to look at aggregate throughput and overall power-efficiency of compute.
And in the particular case of Knights Mill, most of its deep learning workloads will be run via Intel-optimized libraries that surely have hand-tuned code to keep stalls at a minimum. They have optimized back ends for most of the popular deep learning frameworks.
"You're saying these cores aren't as well-optimized as a normal server core"
No, he's saying "it's not as forgiving". Which means you as the programmer have to take a lot more care about how you use it, otherwise performance will suffer. After all, x86 compatibility was supposed to be a strong selling point of the Knights.
Configuring the MCDRAM as a cache is weird. For bandwidth bound code, using the MCDRAM as a cache *should* make the code faster… but it might not! If the miss/conflict rate gets high enough, using the MCDRAM as a "cache" can actually be worse than not using it at all. That is why Intel lets people turn the "cache" mode off and manually manage the MCDRAM via software.
I get the point of Knights Landing, as it's more of a generic HPC processor. But I think Knights Mill actually misses the boat. It just can't compete with GPUs. Intel's acquisition of Nervana and creation of a dedicated GPU computing group seems like an acknowledgment of that fact.
I think all Phi Processor is aim at something different than what most people like to think - it not designed to replace Xeon for general purpose server computing and not designed to replace GPU for graphics
It would be interesting to have a performance comparison between Xeon, Phi, and Atom chips core performance. My guess is that the per core - performance will be in that order - but per core performance for ARM based servers would be lower than even Atom ( Cxxxx series ). GPU cores are different area especially related to computing.
" It just can't compete with GPUs. Intel's acquisition of Nervana and creation of a dedicated GPU computing group seems like an acknowledgment of that fact."
I don't think Phi was ever meant to compete with GPU. I think Intel is planning on using it technology together to improved it products - GPU for example will not be limited to just High End but eventually compute with NVidia gaming GPU.
I believe a lot of there technology is used work on ideas that will later go to mainstream. Hyperthreading is one example and of course AVX and AVX-512 ( for sure ) is another. This might mean also that 4 Way Hyperthreading is coming to 10nm processors.
I don't know where you get your information, but Intel straight out said that Knights Mill was targeted at deep learning. It's just a Knights Landing where they traded off some fp64 performance in order to optimize deep learning performance. This puts it squarely in competition with GPUs and more specialized neural network chips.
"I don't know where you get your information, but Intel straight out said that Knights Mill was targeted at deep learning."
I never stated this. All I stated is that Intel uses such platform to research newer technology in other technology. Knight Landing was never designed to replace the GPU
BTW, did you notice this uses Silvermont Atom cores? So, the hyperthreading they implemented for it cannot be re-purposed for general purpose. Even in Atoms, since the current/next generation of those is already newer (Golmont/Goldmont Plus).
Your opinion to information ratio is too high. Please read more and post less.
Just for information, I found an erratum in 486SLC chip from IBM - it had cache inverted when jumping between 16 and 32 bit mode. This was about 25 years ago - I try to keep up with technologies but no way like I was then. I interview with Intel, but at that time they did not want assembly language developers. My only regret is often my words don't come out as much as I think - but I learn over the years of development that it best to get source from company that made the product and not the internet - because like everything including this statement, Internet is full opinion.
Congratulations on your position within the industry, and skepticism is healthy, but this is information that did in fact come directly from Intel. There are members of this board who regularly post unverified claims, and they normally are easily spotted. For the most part, everyone else is either posting verified claims or attaching appropriate disclaimers ("as far as I remember," etc).
There's at least one press release where Intel specifically said they went from a P3-derivative core to a Silvermont-derivative core with additional AVX-512 support. Pretty awkward to write "nothing but this statement indicates this" when the information is directly from Intel.
You're clearly having no idea about performance, particularly ARM servers. Latest ARM servers easily beat Skylake on performance AND power: https://blog.cloudflare.com/arm-takes-wing/
On a per-core basis even the fastest Atom is just about as fast as 835-based phones despite having significantly higher TDP. Server CPUs like Falkor are significantly faster than phones...
"You're clearly having no idea about performance, particularly ARM servers. Latest ARM servers easily beat Skylake on performance AND power: https://blog.cloudflare.com/arm-takes-wing/"
As from you stated article
"The engineering sample of Falkor we got certainly impressed me a lot. This is a huge step up from any previous attempt at ARM based servers. Certainly core for core, the Intel Skylake is far superior, but when you look at the system level the performance becomes very attractive."
Keep in my they probably not referring to new chips from Intel. But most of graphics show Intel blowing the ARM cpu out the window.
Wrong - how does the article support your claim that ARM server cores are slower than even Atom?
And baiting and switching to GPU? Really? Not a good strategy since Intel GPUs are not exactly known for their great performance or efficiency... When Intel tried to compete in mobile they were forced to *license* non-Intel GPUs, and despite that their GPU performance was far below that of Android phones at the time like HTC One: https://www.anandtech.com/show/9251/the-asus-zenfo...
They literally keep spamming the difference between threads and cores throughout the entire article. Reading comprehension?
Also, Silvermont destroyed its ARM competitors at the time. I wouldn't hesitate to suggest that similar age Intel chip designs would continue to run circles around the ARM designs.
Exactly it's hard to miss that, for example per-core performance is around 80% vs SkyLake in GZIP. Even the latest Atom is not anywhere near that level.
Silvermont was only faster than older Arm cores like Cortex-A8 when it came out, Cortex-A9 was faster at similar frequencies and available in quad-core variants.
If you're focused on performance of these cores on general-purpose code, then you're kind of missing the point. There's a reason these chips have two AVX-512 pipes per core.
The only such benchmark I saw compared an Apollo Lake with 64-bit memory against an 835 with 128-bit memory interface. If you have better data, please post.
That said, there are too many variables for comparisons between cell phone & tablet SoCs to be particularly meaningful for Server CPUs, so I'd take that with a grain of salt.
I don't think memory is that slow on the N5000, the multithreaded bandwidth results look comparable with the 835, so it seems it has high memory latency. However if you have a better like for like comparison, post a link.
Seriously, did you L@@K at the benchies? Because the 835 has a Memory Bandwidth score of 17.2 GB/sec, while the N5000 only scores 7.80 GB/sec. Of course, that's sharing bandwidth between the CPU and GPU, so single-channel is going to have less than half the bandwidth available to the CPU.
Anyway, let me know if you find one where they're both equipped with dual-channel memory. In that case, I expect N5000 will sweep the 835 across the board. ...not that I even know why that's relevant here, because that's a Goldmont+, whereas KNM uses a **heavily**-modified Silvermont.
If you look at the *multi-threaded* scores, the memory bandwidth is 16.7 vs 13.6 GB/s. The issue seems to be that a single core somehow ends up much slower, while a single 835 core can use all of the memory bandwidth.
The GPU will use some bandwidth in *both* cases, but a static display like during GB execution should be well below 1GB/s, so wouldn't affect scores significantly.
I looked for desktop variants which might have faster memory, but there aren't any scores yet. The relevance? Current benchmarks show N5000 and 835 are in the same ballpark, KM runs at max 1.6GHz and uses a slower core. Given it's hard to find any benchmarks for Xeon Phi, knowing that per core performance on scalar code is around half that of a single core in 835 puts things into perspective.
So, single-channel config tops out at 17.9 GB/s (theoretical), while dual-channel should be good for 35.8 GB/s. So, it's definitely single-channel. This is not very unusual, for entry-level x86 notebooks.
Again, relevance to Knights Mill is basically non-existent, since it's a Silvermont w/ 4-way SMT, MCDRAM, mesh uncore, dual AVX-512 pipes, and who knows what other mods.
Strange. They reduced FP64 performance, and increased FP32 and FP16 performance. But it seems like this chip would mostly be attractive to people running FP64 calculations. Those who want FP32 or FP16 performance are probably running machine learning tasks, and they are well served by GPUs already.
This article fails to mention what Intel has previously said about Knights Mill, which is that it's a version of Knights Landing that's been tuned up for Deep Learning workloads.
As such, it's not a proper successor. They will concurrently sell both chips, with KNL going into more generic HPC setups and KNM going into deep-learning focused applications.
Until this chip is fully supported by TF/PyTorch/Caffe and is faster or cheaper than 1080Ti or Titan V, I don't see why would anyone consider it for DL tasks.
I thought part of the point of these chips was that each core is independently programmable? AFAIK with CUDA/OpenCL and even with the post-VLIW architectures there's a much bigger branching penalty within a cluster than x86.
Exactly. That's why I say I can see the point of KNL for some HPC workloads, where you might be using legacy or black-box proprietary code. But KNM is a miss, because deep learning doesn't tend to derive the same benefits from general-purpose programmability and runs swimmingly on GPUs.
It seems to me that MCDRAM is supposed to play the role of high bandwidth memory (equivalent of GPU memory), and it's also 16GB, so I don't see how is it any more attractive in this regard.
I think the most interesting thing I read here is 4-way Hyperthreading. Typically in the past, Intel uses the Xeon and other lines of CPU to work on more functionality and I am curious when 4 way hyperthread will come to Xeon and then mainline Core processors.
One thing in Intel document states "Fast Short REP MOV" - I search and found nothing on this - it for Ice Lake generation. I am very curious what they are talking - with my past experience in assembly programming and compile knowledge REP MOV is used quite frequently - also used internally in C instruction such as MemCpy
Sounds logical. Possibly this is the mostly where compliers uses it - typically in only a small string of smaller contents when it used. Not in huge arrays.
I would think since this is use so much that it would be a major performance increase in applications.
Small sizes is exactly where the microcode has the largest penalty. It's certainly feasible to reduce that penalty somewhat (recent CPUs already made the instructions less slow), but it is unlikely to ever run as fast as an AVX implementation. In any case you won't see a major performance increase since applications don't use REP MOV due to being so slow.
Intel has been incrementally improving the performance of "REP MOV" with each CPU generation. The "Fast Short REP MOV" flag is just another round of these hardware improvements that signals to programmers that they can use "REP MOV" in more places instead of C's "memcpy()" (which may or may not be optimized for the given CPU).
Much better to optimize memcpy() for your platform than to hard-code inline assembly. Intel contributes optimizations to glibc and GCC, so just using memcpy() is probably your best bet. A lot of glibc's string functions have been optimized with vector instructions for quite a while, actually.
GLIBC uses vector instructions indeed, not REP MOV. For small copies it's significantly slower to use REP MOV, and even with improvements it's unlikely going to beat existing memcpy implementations. It hasn't for the last 30 years...
I have already researched this which I exactly why I said what I said. Quoting random ancient bits of source code that isn't used doesn't help your case at all. GLIBC uses an AVX memcpy on modern cores. I've benchmarked REP MOV myself - it's ridiculously slow which is why nobody uses it as a memcpy. The best way to be convinced of this is to benchmark it yourself and compare with an optimized AVX implementation (like in GLIBC). There are benchmarks in GLIBC that you can use, but it's not difficult to do something similar on Windows.
memcpy() is so commonly used that it's a good bet it's well-optimized for whatever your platform. I would never advise people to use anything but memcpy(), if they're writing C. In C++, compilers are usually smart enough to insert memcpy() when possible.
DO you really need to spread false information about how great REP MOV is in every possible article? It's not like many people, including me, have explained why REP MOV is a horribly slow and inefficient instruction which isn't used either by compilers or libraries.
Have you every look at code that generate by compiler, I shown before the compiler uses REP MOV a lot - and it is used by compilers a lot - at least on x86 based platforms. I am not sure what the ARM equivalent is. But it appears to be "vldl.u8" instruction
I think we are talking two different worlds here - x86 vs ARM. I have not been around Intel assembly for long time, but Intel has vector instruction - unfortunately it limited to certain process and primary used for large transfer. this instruction enhancement is for Ice Lake and like is designed for smaller transfers and REP MOV is used in that case on those platforms.
No, modern compilers don't use REP MOV. Some compilers can use it when optimizing for size, but never when optimizing for performance. Try building a large amount of code with GCC with -O2 for x64 and count the number of REP instructions in the binary. As I said before, you will be educated when you actually do these experiments yourself.
This is useless - we are talking two different platforms - I research the code and actually look at unassembled code on x86 platform and REP MOV is used. Please show me compiler output for x64 that does not used REP MOV. I have a lot more knowledge in this area - but I do not used GCC compiler - I used Microsoft compilers including latest Visual Studio 2015 compiler.
Really, try doing the research yourself. I disassembled a random large binary in my system (MicrosoftRawCodec.dll), and this is the full list of rep mov instances:
Quick Note: Visual Studio uses REP MOV almost everytime on Ivy Bridge and newer. Futrher note: And some methods in Direct3D 11.x like UpdateSubResource use it too.
4-way hyperthreading will probably never come to desktop CPU's, or at least not anytime soon. MAYBE the server parts, but it just doesn't make sense for desktop use.
Keep in mind these are virtual cores instead of real cores. But multithreading itself is essence the same thing - in a way - so it would be better that have the functionality in hardware any where.
I probably agree - the recent push for more cores on CPU's is kind of crazy. Part of might be they are having trouble increase single core speed. Especially when at end display logic is single thread - unless they can find ways to communicate to visual display multithreaded. Maybe multiple displays.
Knights Landing has a mode they call "cluster on a chip", where the cores & MCDRAM are divided into 4 independent quadrants each with distinct address space. In this case, it's like having 4 independent CPUs that just happen to share the same die. The reason being that cache coherency doesn't scale terribly well.
So, if you were looking for a sign of core counts plateauing, you might take it as one. At least with full hardware cache coherency, that is.
With the advent of Titan V a 3000 dollar card it makes Knights Mill completely redundant unless you have need for the large 384GB memory for HPC needs...
Intel got burned by non-x86 projects too many times, and now they're afraid to go there. A better bet would've been to use their HD Graphics architecture as the basis for a deep learning accelerator.
Intel didn't really get burned that badly from their non x86 efforts in recent times, mainly because they haven't any. Rather Intel management drank the x86 kool-aid for x86 everywhere. This is why Intel sold off their ARM division as they wanted to conquer mobile with their Atom core. Ditto for Larrabee trying to use basic x86 cores for shader workloads in a GPU. What every defiencies x86 as an architecture had in these markets could be swept aside by Intel's foundry lead at the time. These missteps cost Paul Otellini his job as CEO.
Brian Krzanich has undone the x86 everywhere philosophy by being eager to license technology or more commonly, purchase companies entirely. The problem with Brian Krzanich is that while he is attempting to get ahead of the markets (classic aim where an object is going to be in the future, not where it is now strategy), he is often misfiring. To compound that problem, many of the early acquisitions under his tenure were paralleled by the idea that Intel would retain a foundry advantage going forward. Today, there is arguably no foundry advantage due to the massive delays regarding 10 nm production. Bring up new foundry nodes is notoriously hard so leadership cannot be blamed for the challenges physics provides. However, leadership is responsible for roadmaps and getting product out the door, even if it requires using an older process node. Intel's acquisitions so far have yet to fully embrace Intel's foundry efforts (in fairness, this transition can take years) but we should be seeing something by now from the early days of Brian Krzanich's spending spree. So far Altera FPGA's are the standout but that comes with the advantage that Altera signed up to use Intel's foundries before they were outright acquired by Intel.
I like the fact it has virtualization built-in. It makes it much easier to deploy it as a throughput server where individual thread speed is not as important as aggregate throughput.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
75 Comments
Back to Article
Elstar - Tuesday, December 19, 2017 - link
Having worked with Knights Landing, I can say that it is a weird "Xeon". When its fast, its very fast. But in general, it's an unforgiving CPU due to the shallow OOO pipeline, lack of shared LLC, and an instruction decoder that struggles to keep up.Elstar - Tuesday, December 19, 2017 - link
Also, the high-bandwidth MCDRAM and 4-way SMT can be unforgiving too. The former because it has worse latency than normal memory, and the latter because the pipeline frontend is statically partitioned.ddrіver - Tuesday, December 19, 2017 - link
If you use it for the right job that shouldn't be a problem.mode_13h - Tuesday, December 19, 2017 - link
I'm actually going to agree. @Elstart is missing the point - SMT and low clockspeeds make the *apparent* latency of MCDRAM much lower.I don't know if it's true that the absolute latency of MCDRAM is higher, but one benefit of SMT is to hide memory latency. If your workload has enough concurrency, then it works just fine.
Elstar - Tuesday, December 19, 2017 - link
I'm talking about programming in practice, not theory. MCDRAM latency on Knights Landing in practice is about 150 ns, whereas normal DRAM is about 128 ns. And yes, in theory, a great SMT implementation makes it easy to hide latency, but the SMT implementation on Knights Landing is quite basic. For example, if you average 3 busy threads on each CPU over the course of your app, then the statically partitioned frontend means that you get 3/4 of the available (and already feeble) decoder bandwidth.This all being said, I think Xeon Phi is a great CPU. It just isn't a forgiving like a normal Xeon. That is my only point here.
rbanffy - Wednesday, December 20, 2017 - link
Is each core still running fixed time slots like the original Phi? I though that feature was gone.mode_13h - Wednesday, December 20, 2017 - link
The issue might be your perspective. You're saying these cores aren't as well-optimized as a normal server core, but that's actually by design. The philosophy behind this chip is similar to that of GPUs - make the cores really simple and energy-efficient, then pack a ton of them on a single die. Sure, you could improve pipeline utilization, but would it yield a net improvement in power-efficiency? How many fewer cores would fit on a die? Would you then have to boost the memory clock to keep all the cores fed? You're really not looking at this as a tradeoff.So, while I'm not very familiar with KNM's SMT implementation, it doesn't surprise me that it stalls in places their bigger cores wouldn't. The chip *does* have a small turbo boost to partially-compensate for cases where utilization is low, but I think the key point is to look at aggregate throughput and overall power-efficiency of compute.
And in the particular case of Knights Mill, most of its deep learning workloads will be run via Intel-optimized libraries that surely have hand-tuned code to keep stalls at a minimum. They have optimized back ends for most of the popular deep learning frameworks.
MrSpadge - Thursday, December 21, 2017 - link
"You're saying these cores aren't as well-optimized as a normal server core"No, he's saying "it's not as forgiving". Which means you as the programmer have to take a lot more care about how you use it, otherwise performance will suffer. After all, x86 compatibility was supposed to be a strong selling point of the Knights.
mode_13h - Tuesday, December 19, 2017 - link
FWIW, the MCDRAM can be configured as a shared L3 cache. Not sure I'd use it that way, but you could.Elstar - Tuesday, December 19, 2017 - link
Configuring the MCDRAM as a cache is weird. For bandwidth bound code, using the MCDRAM as a cache *should* make the code faster… but it might not! If the miss/conflict rate gets high enough, using the MCDRAM as a "cache" can actually be worse than not using it at all. That is why Intel lets people turn the "cache" mode off and manually manage the MCDRAM via software.mode_13h - Tuesday, December 19, 2017 - link
I get the point of Knights Landing, as it's more of a generic HPC processor. But I think Knights Mill actually misses the boat. It just can't compete with GPUs. Intel's acquisition of Nervana and creation of a dedicated GPU computing group seems like an acknowledgment of that fact.HStewart - Tuesday, December 19, 2017 - link
I think all Phi Processor is aim at something different than what most people like to think - it not designed to replace Xeon for general purpose server computing and not designed to replace GPU for graphicsIt would be interesting to have a performance comparison between Xeon, Phi, and Atom chips core performance. My guess is that the per core - performance will be in that order - but per core performance for ARM based servers would be lower than even Atom ( Cxxxx series ). GPU cores are different area especially related to computing.
" It just can't compete with GPUs. Intel's acquisition of Nervana and creation of a dedicated GPU computing group seems like an acknowledgment of that fact."
I don't think Phi was ever meant to compete with GPU. I think Intel is planning on using it technology together to improved it products - GPU for example will not be limited to just High End but eventually compute with NVidia gaming GPU.
I believe a lot of there technology is used work on ideas that will later go to mainstream. Hyperthreading is one example and of course AVX and AVX-512 ( for sure ) is another. This might mean also that 4 Way Hyperthreading is coming to 10nm processors.
mode_13h - Tuesday, December 19, 2017 - link
I don't know where you get your information, but Intel straight out said that Knights Mill was targeted at deep learning. It's just a Knights Landing where they traded off some fp64 performance in order to optimize deep learning performance. This puts it squarely in competition with GPUs and more specialized neural network chips.HStewart - Tuesday, December 19, 2017 - link
"I don't know where you get your information, but Intel straight out said that Knights Mill was targeted at deep learning."I never stated this. All I stated is that Intel uses such platform to research newer technology in other technology. Knight Landing was never designed to replace the GPU
mode_13h - Tuesday, December 19, 2017 - link
BTW, did you notice this uses Silvermont Atom cores? So, the hyperthreading they implemented for it cannot be re-purposed for general purpose. Even in Atoms, since the current/next generation of those is already newer (Golmont/Goldmont Plus).Your opinion to information ratio is too high. Please read more and post less.
HStewart - Tuesday, December 19, 2017 - link
"BTW, did you notice this uses Silvermont Atom cores?"Please review your facts - Phi chips do not use Silvermont Atom - nothing but this statement indicates this and Intel's Ark does not
HStewart - Tuesday, December 19, 2017 - link
WikiChip link did mention it - but no where in this article or from Intel did it mention SilvermontIn Any case Silvermont does not have AVX-512 - it would be odd to say Atom has it - but Core series does not have except for X series.
mode_13h - Tuesday, December 19, 2017 - link
I'm glad to see that you've started your journey. Please seek out and read more...HStewart - Tuesday, December 19, 2017 - link
Just for information, I found an erratum in 486SLC chip from IBM - it had cache inverted when jumping between 16 and 32 bit mode. This was about 25 years ago - I try to keep up with technologies but no way like I was then. I interview with Intel, but at that time they did not want assembly language developers. My only regret is often my words don't come out as much as I think - but I learn over the years of development that it best to get source from company that made the product and not the internet - because like everything including this statement, Internet is full opinion.lmcd - Tuesday, December 19, 2017 - link
Congratulations on your position within the industry, and skepticism is healthy, but this is information that did in fact come directly from Intel. There are members of this board who regularly post unverified claims, and they normally are easily spotted. For the most part, everyone else is either posting verified claims or attaching appropriate disclaimers ("as far as I remember," etc).GrumpyCat - Wednesday, December 20, 2017 - link
Hi!Do you still have access to this errata?
I am very interested in it. Very hard to find any errata for 486 era chips.
Thanks!
lmcd - Tuesday, December 19, 2017 - link
There's at least one press release where Intel specifically said they went from a P3-derivative core to a Silvermont-derivative core with additional AVX-512 support. Pretty awkward to write "nothing but this statement indicates this" when the information is directly from Intel.Wilco1 - Tuesday, December 19, 2017 - link
You're clearly having no idea about performance, particularly ARM servers. Latest ARM servers easily beat Skylake on performance AND power: https://blog.cloudflare.com/arm-takes-wing/On a per-core basis even the fastest Atom is just about as fast as 835-based phones despite having significantly higher TDP. Server CPUs like Falkor are significantly faster than phones...
HStewart - Tuesday, December 19, 2017 - link
"You're clearly having no idea about performance, particularly ARM servers. Latest ARM servers easily beat Skylake on performance AND power: https://blog.cloudflare.com/arm-takes-wing/"As from you stated article
"The engineering sample of Falkor we got certainly impressed me a lot. This is a huge step up from any previous attempt at ARM based servers. Certainly core for core, the Intel Skylake is far superior, but when you look at the system level the performance becomes very attractive."
Keep in my they probably not referring to new chips from Intel. But most of graphics show Intel blowing the ARM cpu out the window.
Wilco1 - Tuesday, December 19, 2017 - link
Wrong - how does the article support your claim that ARM server cores are slower than even Atom?And baiting and switching to GPU? Really? Not a good strategy since Intel GPUs are not exactly known for their great performance or efficiency... When Intel tried to compete in mobile they were forced to *license* non-Intel GPUs, and despite that their GPU performance was far below that of Android phones at the time like HTC One: https://www.anandtech.com/show/9251/the-asus-zenfo...
lmcd - Tuesday, December 19, 2017 - link
They literally keep spamming the difference between threads and cores throughout the entire article. Reading comprehension?Also, Silvermont destroyed its ARM competitors at the time. I wouldn't hesitate to suggest that similar age Intel chip designs would continue to run circles around the ARM designs.
Wilco1 - Wednesday, December 20, 2017 - link
Exactly it's hard to miss that, for example per-core performance is around 80% vs SkyLake in GZIP. Even the latest Atom is not anywhere near that level.Silvermont was only faster than older Arm cores like Cortex-A8 when it came out, Cortex-A9 was faster at similar frequencies and available in quad-core variants.
mode_13h - Wednesday, December 20, 2017 - link
If you're focused on performance of these cores on general-purpose code, then you're kind of missing the point. There's a reason these chips have two AVX-512 pipes per core.mode_13h - Wednesday, December 20, 2017 - link
The only such benchmark I saw compared an Apollo Lake with 64-bit memory against an 835 with 128-bit memory interface. If you have better data, please post.That said, there are too many variables for comparisons between cell phone & tablet SoCs to be particularly meaningful for Server CPUs, so I'd take that with a grain of salt.
Wilco1 - Wednesday, December 20, 2017 - link
This is what I recently found: http://browser.geekbench.com/v4/cpu/compare/532602...I don't think memory is that slow on the N5000, the multithreaded bandwidth results look comparable with the 835, so it seems it has high memory latency. However if you have a better like for like comparison, post a link.
mode_13h - Wednesday, December 20, 2017 - link
Yeah, that's the same thing I saw before.Seriously, did you L@@K at the benchies? Because the 835 has a Memory Bandwidth score of 17.2 GB/sec, while the N5000 only scores 7.80 GB/sec. Of course, that's sharing bandwidth between the CPU and GPU, so single-channel is going to have less than half the bandwidth available to the CPU.
Anyway, let me know if you find one where they're both equipped with dual-channel memory. In that case, I expect N5000 will sweep the 835 across the board. ...not that I even know why that's relevant here, because that's a Goldmont+, whereas KNM uses a **heavily**-modified Silvermont.
Wilco1 - Wednesday, December 20, 2017 - link
If you look at the *multi-threaded* scores, the memory bandwidth is 16.7 vs 13.6 GB/s. The issue seems to be that a single core somehow ends up much slower, while a single 835 core can use all of the memory bandwidth.The GPU will use some bandwidth in *both* cases, but a static display like during GB execution should be well below 1GB/s, so wouldn't affect scores significantly.
I looked for desktop variants which might have faster memory, but there aren't any scores yet. The relevance? Current benchmarks show N5000 and 835 are in the same ballpark, KM runs at max 1.6GHz and uses a slower core. Given it's hard to find any benchmarks for Xeon Phi, knowing that per core performance on scalar code is around half that of a single core in 835 puts things into perspective.
mode_13h - Thursday, December 21, 2017 - link
So, single-channel config tops out at 17.9 GB/s (theoretical), while dual-channel should be good for 35.8 GB/s. So, it's definitely single-channel. This is not very unusual, for entry-level x86 notebooks.Again, relevance to Knights Mill is basically non-existent, since it's a Silvermont w/ 4-way SMT, MCDRAM, mesh uncore, dual AVX-512 pipes, and who knows what other mods.
p1esk - Tuesday, December 19, 2017 - link
Strange. They reduced FP64 performance, and increased FP32 and FP16 performance. But it seems like this chip would mostly be attractive to people running FP64 calculations. Those who want FP32 or FP16 performance are probably running machine learning tasks, and they are well served by GPUs already.mode_13h - Tuesday, December 19, 2017 - link
This article fails to mention what Intel has previously said about Knights Mill, which is that it's a version of Knights Landing that's been tuned up for Deep Learning workloads.As such, it's not a proper successor. They will concurrently sell both chips, with KNL going into more generic HPC setups and KNM going into deep-learning focused applications.
p1esk - Tuesday, December 19, 2017 - link
Until this chip is fully supported by TF/PyTorch/Caffe and is faster or cheaper than 1080Ti or Titan V, I don't see why would anyone consider it for DL tasks.mode_13h - Tuesday, December 19, 2017 - link
That's the fatal flaw, and you've just put your finger on it. It can't touch a V100. Even Vega is probably faster and more efficient.lmcd - Tuesday, December 19, 2017 - link
I thought part of the point of these chips was that each core is independently programmable? AFAIK with CUDA/OpenCL and even with the post-VLIW architectures there's a much bigger branching penalty within a cluster than x86.p1esk - Tuesday, December 19, 2017 - link
Why would you need to program each core individually for DL?mode_13h - Wednesday, December 20, 2017 - link
Exactly. That's why I say I can see the point of KNL for some HPC workloads, where you might be using legacy or black-box proprietary code. But KNM is a miss, because deep learning doesn't tend to derive the same benefits from general-purpose programmability and runs swimmingly on GPUs.Elstar - Tuesday, December 19, 2017 - link
Depends on the size of your data set. If your problem doesn't fit easily into the GPU's memory, then Xeon Phi starts to look very attractive.p1esk - Tuesday, December 19, 2017 - link
It seems to me that MCDRAM is supposed to play the role of high bandwidth memory (equivalent of GPU memory), and it's also 16GB, so I don't see how is it any more attractive in this regard.mode_13h - Tuesday, December 19, 2017 - link
Agreed. For deep learning, you're practically limited by the amount of *fast* memory.Not only that, but Nvidia's V100 has 3x the memory bandwidth.
HStewart - Tuesday, December 19, 2017 - link
I think the most interesting thing I read here is 4-way Hyperthreading. Typically in the past, Intel uses the Xeon and other lines of CPU to work on more functionality and I am curious when 4 way hyperthread will come to Xeon and then mainline Core processors.HStewart - Tuesday, December 19, 2017 - link
One thing in Intel document states "Fast Short REP MOV" - I search and found nothing on this - it for Ice Lake generation. I am very curious what they are talking - with my past experience in assembly programming and compile knowledge REP MOV is used quite frequently - also used internally in C instruction such as MemCpybruno.uy - Tuesday, December 19, 2017 - link
Maybe memcpy of small blocks has less overhead?HStewart - Tuesday, December 19, 2017 - link
"Maybe memcpy of small blocks has less overhead?"Sounds logical. Possibly this is the mostly where compliers uses it - typically in only a small string of smaller contents when it used. Not in huge arrays.
I would think since this is use so much that it would be a major performance increase in applications.
Wilco1 - Tuesday, December 19, 2017 - link
Small sizes is exactly where the microcode has the largest penalty. It's certainly feasible to reduce that penalty somewhat (recent CPUs already made the instructions less slow), but it is unlikely to ever run as fast as an AVX implementation. In any case you won't see a major performance increase since applications don't use REP MOV due to being so slow.Elstar - Tuesday, December 19, 2017 - link
Intel has been incrementally improving the performance of "REP MOV" with each CPU generation. The "Fast Short REP MOV" flag is just another round of these hardware improvements that signals to programmers that they can use "REP MOV" in more places instead of C's "memcpy()" (which may or may not be optimized for the given CPU).mode_13h - Tuesday, December 19, 2017 - link
That's horrible advice.Much better to optimize memcpy() for your platform than to hard-code inline assembly. Intel contributes optimizations to glibc and GCC, so just using memcpy() is probably your best bet. A lot of glibc's string functions have been optimized with vector instructions for quite a while, actually.
Wilco1 - Tuesday, December 19, 2017 - link
GLIBC uses vector instructions indeed, not REP MOV. For small copies it's significantly slower to use REP MOV, and even with improvements it's unlikely going to beat existing memcpy implementations. It hasn't for the last 30 years...HStewart - Tuesday, December 19, 2017 - link
I am not unix developer but please research your informationThe following is the GLIBC library source available online for MemCpy function
https://github.com/lattera/glibc/blob/master/strin...
In that code it uses a Macro called BYTE_COPY_FWD
I don't have unix source but at least from the following link it use REP MOV
http://justinyan.me/post/1689
My guess the macro is used for different instruction sets but for x86 it used REP MOV.
Please provide proof that on Intel that GLIBC does not used REP MOV - I am not unix developer - so I could be wrong on that platform.
Wilco1 - Tuesday, December 19, 2017 - link
I have already researched this which I exactly why I said what I said. Quoting random ancient bits of source code that isn't used doesn't help your case at all. GLIBC uses an AVX memcpy on modern cores. I've benchmarked REP MOV myself - it's ridiculously slow which is why nobody uses it as a memcpy. The best way to be convinced of this is to benchmark it yourself and compare with an optimized AVX implementation (like in GLIBC). There are benchmarks in GLIBC that you can use, but it's not difficult to do something similar on Windows.mode_13h - Wednesday, December 20, 2017 - link
I know glibc is vectorized after looking at enough backtraces where somebody passed in a bad pointer, thus resulting in a segfault.:-/
mode_13h - Wednesday, December 20, 2017 - link
memcpy() is so commonly used that it's a good bet it's well-optimized for whatever your platform. I would never advise people to use anything but memcpy(), if they're writing C. In C++, compilers are usually smart enough to insert memcpy() when possible.Wilco1 - Tuesday, December 19, 2017 - link
DO you really need to spread false information about how great REP MOV is in every possible article? It's not like many people, including me, have explained why REP MOV is a horribly slow and inefficient instruction which isn't used either by compilers or libraries.HStewart - Tuesday, December 19, 2017 - link
Have you every look at code that generate by compiler, I shown before the compiler uses REP MOV a lot - and it is used by compilers a lot - at least on x86 based platforms. I am not sure what the ARM equivalent is. But it appears to be "vldl.u8" instructionhttps://stackoverflow.com/questions/11161237/fast-...
but this does not look like non standard stuff.
I think we are talking two different worlds here - x86 vs ARM. I have not been around Intel assembly for long time, but Intel has vector instruction - unfortunately it limited to certain process and primary used for large transfer. this instruction enhancement is for Ice Lake and like is designed for smaller transfers and REP MOV is used in that case on those platforms.
Wilco1 - Tuesday, December 19, 2017 - link
No, modern compilers don't use REP MOV. Some compilers can use it when optimizing for size, but never when optimizing for performance. Try building a large amount of code with GCC with -O2 for x64 and count the number of REP instructions in the binary. As I said before, you will be educated when you actually do these experiments yourself.HStewart - Wednesday, December 20, 2017 - link
This is useless - we are talking two different platforms - I research the code and actually look at unassembled code on x86 platform and REP MOV is used. Please show me compiler output for x64 that does not used REP MOV. I have a lot more knowledge in this area - but I do not used GCC compiler - I used Microsoft compilers including latest Visual Studio 2015 compiler.Wilco1 - Wednesday, December 20, 2017 - link
Really, try doing the research yourself. I disassembled a random large binary in my system (MicrosoftRawCodec.dll), and this is the full list of rep mov instances:0000000180077695: F3 BE 3F 7B 83 BF rep mov esi,0BF837B3Fh
000000018007AC94: F3 A5 rep movs dword ptr [rdi],dword ptr [rsi]
000000018007E0A2: F3 3E 8B C1 rep mov eax,ecx
0000000180135035: F3 48 A5 rep movs qword ptr [rdi],qword ptr [rsi]
0000000180135362: F3 48 A5 rep movs qword ptr [rdi],qword ptr [rsi]
00000001802368C4: F3 8E 14 00 rep mov ss,word ptr [rax+rax]
A quickly looked at each, they all look like data being disassembled, for example the best candidate to be a real rep mov goes like this:
000000018007E0A1: 2E
000000018007E0A2: F3 3E 8B C1 rep mov eax,ecx
000000018007E0A6: F3 3E 9B rep wait
000000018007E0A9: 54 push rsp
000000018007E0AA: F4 hlt
So there are ZERO rep mov instances in a big DLL containing well over 700000 instructions.
Klimax - Monday, December 25, 2017 - link
Quick Note: Visual Studio uses REP MOV almost everytime on Ivy Bridge and newer. Futrher note: And some methods in Direct3D 11.x like UpdateSubResource use it too.mode_13h - Wednesday, December 27, 2017 - link
Did you explicitly tell it to target >= Ivy Bridge and optimize for speed?Are you testing larger copies?
extide - Wednesday, December 20, 2017 - link
4-way hyperthreading will probably never come to desktop CPU's, or at least not anytime soon. MAYBE the server parts, but it just doesn't make sense for desktop use.ddrіver - Wednesday, December 20, 2017 - link
You get 4 cores from 1 core. That's pretty sweet.HStewart - Wednesday, December 20, 2017 - link
"You get 4 cores from 1 core."Keep in mind these are virtual cores instead of real cores. But multithreading itself is essence the same thing - in a way - so it would be better that have the functionality in hardware any where.
mode_13h - Wednesday, December 20, 2017 - link
Yeah, don't get trolled by this one. I'm pretty sure he knows what SMT is... and, um, would troll you.mode_13h - Wednesday, December 20, 2017 - link
My favorite factoid about SMT is that Intel's recent HD Graphics cores have 7-way SMT. IMO, we need more prime numbers in computer architecture.;)
HStewart - Wednesday, December 20, 2017 - link
I probably agree - the recent push for more cores on CPU's is kind of crazy. Part of might be they are having trouble increase single core speed. Especially when at end display logic is single thread - unless they can find ways to communicate to visual display multithreaded. Maybe multiple displays.mode_13h - Wednesday, December 20, 2017 - link
Knights Landing has a mode they call "cluster on a chip", where the cores & MCDRAM are divided into 4 independent quadrants each with distinct address space. In this case, it's like having 4 independent CPUs that just happen to share the same die. The reason being that cache coherency doesn't scale terribly well.So, if you were looking for a sign of core counts plateauing, you might take it as one. At least with full hardware cache coherency, that is.
hahmed330 - Tuesday, December 19, 2017 - link
With the advent of Titan V a 3000 dollar card it makes Knights Mill completely redundant unless you have need for the large 384GB memory for HPC needs...mode_13h - Wednesday, December 20, 2017 - link
+1Intel got burned by non-x86 projects too many times, and now they're afraid to go there. A better bet would've been to use their HD Graphics architecture as the basis for a deep learning accelerator.
Kevin G - Wednesday, December 20, 2017 - link
Intel didn't really get burned that badly from their non x86 efforts in recent times, mainly because they haven't any. Rather Intel management drank the x86 kool-aid for x86 everywhere. This is why Intel sold off their ARM division as they wanted to conquer mobile with their Atom core. Ditto for Larrabee trying to use basic x86 cores for shader workloads in a GPU. What every defiencies x86 as an architecture had in these markets could be swept aside by Intel's foundry lead at the time. These missteps cost Paul Otellini his job as CEO.Brian Krzanich has undone the x86 everywhere philosophy by being eager to license technology or more commonly, purchase companies entirely. The problem with Brian Krzanich is that while he is attempting to get ahead of the markets (classic aim where an object is going to be in the future, not where it is now strategy), he is often misfiring. To compound that problem, many of the early acquisitions under his tenure were paralleled by the idea that Intel would retain a foundry advantage going forward. Today, there is arguably no foundry advantage due to the massive delays regarding 10 nm production. Bring up new foundry nodes is notoriously hard so leadership cannot be blamed for the challenges physics provides. However, leadership is responsible for roadmaps and getting product out the door, even if it requires using an older process node. Intel's acquisitions so far have yet to fully embrace Intel's foundry efforts (in fairness, this transition can take years) but we should be seeing something by now from the early days of Brian Krzanich's spending spree. So far Altera FPGA's are the standout but that comes with the advantage that Altera signed up to use Intel's foundries before they were outright acquired by Intel.
mode_13h - Thursday, December 21, 2017 - link
Lol. Heard of IA64? i860? Institutional memory tends to be long and unforgiving.rbanffy - Wednesday, December 20, 2017 - link
I like the fact it has virtualization built-in. It makes it much easier to deploy it as a throughput server where individual thread speed is not as important as aggregate throughput.mode_13h - Wednesday, December 20, 2017 - link
That argument is good for KNL, but I don't think it works for KNM.