AMD Epyc Faces Off With Intel Skylake-SP Xeon in Massive Server Battle

This site may earn affiliate commissions from the links on this folio. Terms of utilise.

AMD vs. Intel

Final calendar month, AMD launched its new server architecture, codenamed Epyc, in single-chip configurations of up to 32 cores. We already knew back so that Intel was prepping a massive Xeon refresh, starting with the Core X-Serial (working on that, for the record) and following information technology up with a new lineup of Xeon parts with upwardly to 28 cores, 56 threads, and a new L2 cache structure that quadrupled the amount of L2 while slashing the amount of L3 allocated per-cadre.

On Monday, Intel launched roughly fifty SKUs in total, with summit-terminate 28-cadre prices reaching $10K to $11K per physical CPU. Intel'southward new Xeon "Purley" Skylake-SP CPUs supports AVX-512, Intel'due south ain mesh topology, and the aforementioned larger L2 cache, and then the chips are rather significantly different (with both gains and losses) relative to previous Xeon products.

Over at Anandtech, the indelible Johan De Gelas (in one case of Aces Hardware for you longtime tech readers) has joined up with Ian Cutress to provide preliminary information on how AMD'due south Epyc and these new Skylake-SP Xeons compare with i some other, with previous Xeon chips thrown in for good measure.

A few points before we dive in. Anandtech acknowledges having had just one calendar week with its AMD testbed and ii weeks with the Intel system. Server testing is far more complicated than desktop tests, the benchmarks themselves are frequently more arcane and trickier to fine melody, and performance can exist very dependent on the presence or absence of such optimizations. Anandtech makes prominent notation of the fact that they had merely a very limited window in which to test and apply or find such optimizations, particularly in AMD's case.

Second, the performance scenarios and relative rankings of Epyc versus Xeon are themselves highly dependent on the tests in question. This is the first time in at least half-dozen years that AMD has had a server office that could take the fight to Intel in whatsoever context, only testing has shown some distinct strong and weak points to AMD'due south architecture. While we'll provide an overview of the findings, in that location's no substitute for shut reading of the original article if you want to completely understand the subtleties.

Inherent strengths, weaknesses, and differences

In comparison to Intel'southward new fries, AMD'due south Epyc uses its own CCX and Infinity Fabric, doesn't implement AVX-512, and has the same cache structure as Ryzen. This proves critical to agreement how Intel and AMD compare in a number of benchmarks (more on that in a moment).

AMD-Epyc

AMD has a significant advantage in base toll; the top-end Epyc 7601 (180W TDP) is a 32-core bit with a ii.2GHz base / iii.2GHz max clock speed and a $4,200 price tag. Intel's Xeon 8180 is a 28-cadre chip with a 2.five – 3.8GHz max clock and a $x,009 toll tag (the same chip in a 165W TDP with support for one.5TB of DRAM per socket, and a 2.1GHz base clock retails for $11,722). Anandtech tested the Xeon 8176 — 28 cores, 2.1GHz base, with a maximum of 768GB of RAM per socket and a price tag of $eight,719. Intel'southward new Platinum/Gold/Silver/Bronze format looks nothing short of nightmarishly complicated, with vastly different specs swept into the same "families" in some cases. Other designations contain a number of exceptions to the rules that are supposed to govern which chips are placed in which brackets.

Meet? The just difference between 51xx and 61xx is the number of QPI links, AVX-512 FMA units per core, core counts, RAM support, and scalability. They're practically identical!

We noted when news of the rebrand hit that information technology wasn't clear how this structure would clarify Intel's product lines, and a muddle is precisely what'due south emerged from these results.

Cache changes

I want to accept a moment to talk about cache architecture differences between the new Skylake-SP Xeons and previous CPUs, besides every bit betwixt Intel and AMD. Skylake-South processors have a 256KB L2 cache that's iv-fashion fix associative (see our L1 vs. L2 enshroud explainer for details on what this ways) with an eleven-cycle latency. Previous Xeons used a big inclusive L3 enshroud with ~2.5MB of L3 cache allocated per cadre, upwardly to 16-way set associativity, and a 44ns bike time.

Skylake-SP, on the other manus, has a 1MB L2 enshroud that's 16-way associative, only has higher (13 wheel) latency. Less L3 cache is integrated per core (1.375MB), the cache is 11-way prepare associative instead of sixteen-way, it has a 77 cycle latency (upwards from 44), and information technology'south a non-inclusive enshroud.

An inclusive enshroud is a cache that is guaranteed to comprise data institute inside college level caches. The advantage of inclusive caches is that you tin search the highest level of enshroud (L3 in Intel's example) and make up one's mind whether data is located in L1. If you tin can't find information technology in L3, you lot know it'southward not in L1, which ways you know you need to load it. This reduces the miss latency penalty (searching main retentivity is still much slower than searching L3). The disadvantage to an inclusive enshroud is that they offer less existent space for storing data, since each cache must comprise all the information in the cache level higher up information technology. Intel'due south utilise of very big L3 caches in previous Broadwell and Skylake-S chips mitigated this consequence by providing a large absolute amount of cache space.

Skylake-SP transforms the L3 cache into what is oftentimes chosen a victim cache, because data lines present in L2 aren't copied to L3 until they are moved or evicted. Data can be read back from L3 into L2 just also remain in the L3. Anandtech doesn't believe Skylake-SP can prefetch into L3, which means information technology serves equally a home for "evicted" data. It's not used as much as the inclusive Broadwell and before Xeon L3 cache, which is why Intel can relax its latency and performance.

Meanwhile, AMD uses its own distinct CPU Complex (CCX) design, which combines four CPU cores and an 8MB L3 cache. Two CCX's make up one Zeppelin die, and AMD'south own Epyc diagrams testify up to four dies per CPU bundle. The L3 is generally exclusive victim enshroud, but AMD'south reliance on the CCX architecture for cross-communication between cores means in that location are some tangible penalties and impacts. Local data movement within the same CCX is quite quick, simply there's a pregnant latency penalty for moving data across CCX complexes. AMD states that a Naples CPU (4 Zeppelin dies) has 64MB of L3, just that'south non really accurate. What Epyc has is better described as 8x8MB L3s, in much the same way that a pair of GPUs in SLI mode with 4GB of RAM each are improve described as 2x4GB GPUs as opposed to an 8GB GPU.

These enshroud structure difference account for a substantial role of why Epyc, pre-Skylake-SP Xeons, and the new Purley Xeons perform differently than one another. But they're scarcely the merely factor in play. The chart below shows how complex the comparisons between Epyc, Broadwell-EP, and Skylake-SP can get in just memory bandwidth depending on exam weather condition.

MemHierarchy

In that location's no "wrong" test effect here and all these test types are used by shipping software to varying degrees.

AMD's Epyc 7601 has 0.42x of Skylake-SP'south bandwidth in some tests, but 2.26x more than bandwidth than others, depending on how threads are pinned beyond the CPUs. Raw bandwidth for Broadwell-SP is college than Skylake-SP in virtually every case except when 8 threads are running, which is where Skylake-SP finally pulls alee. Relative memory latencies are also unlike betwixt AMD and Intel, with AMD competing extremely well at or below 4MB of L3 and poorly once above that point. Accessing more than 8MB of L3 is a worst-example scenario for Epyc; its latency is worse than Intel'due south DRAM admission latency.

latencyepyc_xeonv5_tinymembench

Ouch — only not as determinative of overall functioning as y'all might remember, given the telescopic of the gap.

Functioning Overview

AT runs through SPEC2006 (single-thread, SMT, multi-core), database and transactional functioning, Coffee, big data number crunching, and floating point performance. AMD's FPU operation is surprisingly excellent compared with Intel. There are several reasons for this, but a number of them come down to various aspects of AVX and its bear upon on turbo clocks. For the terminal few product cycles, Intel has publicly stated that its Turbo Mode frequency figures depend on whether AVX is active, with non-AVX clocks beingness substantially lower. Intel's Xeon 8176 has a not-AVX 28-core maximum turbo frequency of two.8GHz, an AVX ii.0 28-core maximum turbo frequency of 2.4GHz, and an AVX-512 28-core maximum turbo frequency of just one.9GHz.

NAMD MolDyn

Intel talks upwards its use of 256-flake and 512-fleck FMACs compared with AMD'due south 128-fleck implementation of AVX. Simply AMD may have taken the wiser route here (it wins all the FPU benchmarks AT ran). Intel takes a 20 per centum clock penalisation compared with 256-chip AVX when running AVX-512. While higher efficiency should theoretically be able to withal show pregnant AVX-512 functioning improvements, they're only going to happen with substantial functioning tuning. Not all software vendors or buyers tin afford that kind of piece of work, but it'll be critical for AVX-512 to be a success.

FPU performance is, surprisingly, AMD'due south best total showing. It's a mediocre database server, beats Intel in Java performance (but not by the same margins as in FPU code), and is extremely competitive in Big Information tests given toll and clock differentials. Power consumption varies substantially by workload; the Xeon 8176 has extremely high idle power consumption, but vastly ameliorate MySQL perf/watt than Broadwell and modestly better perf/watt in this test than the Epyc 7601. In POV-RAY testing, AMD flips the tables on Intel, with college performance at a huge power differential (327W for Epyc versus 453W for Skylake-SP).

Conclusions

The bottom line is this: AMD's Epyc isn't the better option in every state of affairs or environment. Merely a combination of lower prices, competitive performance, and some solid examination wins testify AMD can hang with Intel again, even at the tiptop of the market place. For hardware cost-witting companies, or vendors that tin beget to optimize heavily for Ryzen (cloud providers like MS, for example), Epyc is a very stiff brand. But Skylake-SP shows some formidable performance gains of its own, has a better scaling mesh topology, and the stronger overall level of performance. If your TCO is dominated more than by software costs than hardware pricing, Intel and its proven track record may all the same exist the better option here.

Finally, I'd like to echo some comments Johan makes. After years of watching Intel's only competition being its own previous generation of products, it's really nice to see some genuine operation back-and-forth. One of the grand ironies of reviewing is that people regularly charge reviewers of using diverse tricks or indulging biases to tilt reviews deliberately towards AMD or Intel when, in reality, we're probably the people that about want to see exciting performance matches. Manufactures similar this (or, of course, AT's vastly larger review) don't write themselves; they take considerable time and effort. It'southward slow to scout the same company win over and over. Nobody likes a slugfest meliorate than a reviewer, and this review is worth a read.