Why Skylake CPUs Are Sometimes 50% Slower – How Intel Has Broken Existing Code

I got a call that on newer hardware some performance regression tests have become slower. Not a big deal. Usually it is a bad configuration somewhere in Windows or some BIOS settings were set to non optimal values. But this time we were not able to find a setting that did bring performance back to normal. Since the change was not small 9s vs 19s (blue is old hardware orange is new hardware) we needed to drill deeper:


Same OS, Same Hardware, Different CPU – 2 Times Slower

A perf drop from 9,1s to 19,6s is definitely significant. We did more checks if the software version under test, Windows, BIOS settings were somehow different from the old baseline hardware. But nope everything was identical. The only difference was that the same tests were running on different CPUs. Below is a picture of the newest CPU


And here is the one used for comparison


The Xeon Gold runs on a different CPU Architecture named Skylake which is common to all CPUs produced by Intel since mid 2017. *As commenters have pointed out the consumer Skylake CPUs were released already 2015. The server Xeon CPUs with SkylakeX were released mid 2017. All later CPUs Kaby Lake, … share the same issue. If you are buying current hardware you will get a CPU with Skylake CPU architecture. These are nice machines but as the tests have shown newer and slower is not the right direction. If all else fails get a repro and use a real profiler ™ to drill deeper. When you record the same test on the old hardware and on the new hardware it should quickly lead to somewhere:


Remember the diff view in WPA *(Windows Performance Analyzer is a profiling UI which is free and part of the Windows Performance Toolkit which is part of the Windows SDK) shows in the table the delta of Trace 2 (11s) – Trace 1 (19s). Hence a negative delta in the table indicates a CPU consumption increase of the slower test. When we look at the biggest CPU consumer differences we find AwareLock::Contention, JIT_MonEnterWorker_InlineGetThread_GetThread_PatchLabel and ThreadNative.SpinWait. Everything points towards CPU spinning when threads are competing for locks. But that is a false red herring because spinning is not the root cause of slower performance. Increased lock contention means that something in our software did become slower while holding a lock which as a consequence results in more CPU spinning. I was checking locking times and other key metrics, like disk and alike but I failed to find anything relevant which could explain the performance degradation. Although not logical I turned back to the increased CPU consumption in various methods.

To find where exactly the CPU was stuck would be interesting. WPA has file and line columns but these work only with private symbols which we do not have because it is .NET Framework code. The next best thing is to get the address of the dll where the instruction is located which is called Image RVA (Relative Virtual Address). When I load the same dll into the debugger and then do

u xxx.dll+ImageRVA

then I should see the instruction which was burning most CPU cycles which was basically only one hot address.


Lets examine the hot code locations of the different methods with Windbg:

0:000> u clr.dll+0x19566B-10
00007ff8`0535565b f00f4cc6        lock cmovl eax,esi
00007ff8`0535565f 2bf0            sub     esi,eax
00007ff8`05355661 eb01            jmp     clr!AwareLock::Contention+0x13f (00007ff8`05355664)
00007ff8`05355663 cc              int     3
00007ff8`05355664 83e801          sub     eax,1
00007ff8`05355667 7405            je      clr!AwareLock::Contention+0x144 (00007ff8`0535566e)
00007ff8`05355669 f390            pause
00007ff8`0535566b ebf7            jmp     clr!AwareLock::Contention+0x13f (00007ff8`05355664)

We do this for the JIT method as well

0:000> u clr.dll+0x2801-10
00007ff8`051c27f1 5e              pop     rsi
00007ff8`051c27f2 c3              ret
00007ff8`051c27f3 833d0679930001  cmp     dword ptr [clr!g_SystemInfo+0x20 (00007ff8`05afa100)],1
00007ff8`051c27fa 7e1b            jle     clr!JIT_MonEnterWorker_InlineGetThread_GetThread_PatchLabel+0x14a (00007ff8`051c2817)
00007ff8`051c27fc 418bc2          mov     eax,r10d
00007ff8`051c27ff f390            pause
00007ff8`051c2801 83e801          sub     eax,1
00007ff8`051c2804 75f9            jne     clr!JIT_MonEnterWorker_InlineGetThread_GetThread_PatchLabel+0x132 (00007ff8`051c27ff)

Now we have a pattern. One time the hot location is a jump instruction and the other time it is a subtraction. But both hot instructions are preceded by the same common instruction named pause. Different methods execute the same CPU instruction which is for some reason very time consuming. Lets measure the duration of the pause instruction to see if we are on the right track.

If You Document A Problem It Becomes A Feature

CPU Pause Duration In ns
Xeon E5 1620v3 3.5GHz 4
Xeon(R) Gold 6126 CPU @ 2.60GHz 43

Pause on the new Skylake CPUs is an order of magnitude slower. Sure things can get faster and sometimes a bit slower. But over 10 times slower? That sounds more like a bug. A little internet search about the pause instruction leads to the Intel manuals where the Skylake Microarchitecture and the pause instruction are explicitly mentioned:



No this is not a bug, it is a documented feature. There exists even a web page which contains the timings of pretty much all CPU instructions


  • Sandy Bridge     11
  • Ivy Bridege         10
  • Haswell                9
  • Broadwell            9
  • SkylakeX             141

The numbers are CPU cycles. To calculate the actual time you need to divide the cycle counts by the CPU frequency (usually GHz) to get the time in ns.

That means when I execute heavily multithreaded applications on .NET on latest hardware things can become much slower. Someone else has this noticed already in August 2017 and has written an issue for it:  https://github.com/dotnet/coreclr/issues/13388. The issue has been fixed with .NET Core 2.1 and .NET Framework 4.8 Preview contains also the fixes for it.


Improved spin-waits in several synchronization primitives to perform better on Intel Skylake and more recent microarchitectures. [495945, mscorlib.dll, Bug]

But since .NET 4.8 is still one year away I have requested a backport of the fixes to get .NET 4.7.2 back to speed on latest hardware. Since many parts of .NET are using spinlocks you should look out for increased CPU consumption around Thread.SpinWait and other spinning methods.



E.g. Task.Result will internally Spin internally where I could see for other tests also a significant increase in CPU consumption and degraded performance.

How Bad Is It?

I have looked at the .NET Core code how long the CPU will keep spinning when the lock is not released before calling into WaitForSingleObject to pay the “expensive” context switch. A context switch is somewhere in the microsecond region and becomes much slower when many threads are waiting on the same kernel object.

.NET Locks multiply the maximum Spin duration with the number of cores which has the fully contended case in mind where every core has a thread waiting for the same lock and tries to spin long enough to give everyone a chance to work a bit before paying for the kernel call. Spinning inside .NET uses an exponential back off algorithm where spinning starts with 50 pause calls in a loop where for each iteration the number of spins is multiplied by 3 until the next spin count becomes greater than the maximum spin duration. I have calculated the total time how long a thread would spin on pre Skylake CPU and current Skylake CPUs for various core numbers:


Below is some simplified code how .NET Locks perform spinning:

/// <summary>
/// This is how .NET is spinning during lock contention minus the Lock taking/SwitchToThread/Sleep calls
/// </summary>
/// <param name="nCores"></param>
void Spin(int nCores)
	const int dwRepetitions = 10;
	const int dwInitialDuration = 0x32;
	const int dwBackOffFactor = 3;
	int dwMaximumDuration = 20 * 1000 * nCores;

	for (int i = 0; i < dwRepetitions; i++)
		int duration = dwInitialDuration;
			for (int k = 0; k < duration; k++)
			duration *= dwBackOffFactor;
		while (duration < dwMaximumDuration);

The old spinning times were in the millisecond region (19ms for 24 cores) which is already quite a lot compared to the always mentioned high costs of context switches which are an order of magnitude faster. But with Skylake CPUs the total CPU Spinning times for a contended lock have exploded and we will spin up to 246ms on a 24 or 48 core machine only because the latency of the new Intel CPUs has increased the pause instruction by a factor 14. Is this really the case? I have created  a small tester to check full CPU spinning and the calculated numbers nicely match my expectations. I have 48 threads waiting on a 24 core machine for a single lock where I call Monitor.PulseAll to let the race begin:


Only one thread will win the race but 47 threads will spin until the give up. This is experimental evidence that we indeed have a real issue with CPU consumption and very long Spin times are a real issue. Excessive spinning hurts scalability because CPU cycles are burned where other threads might need the CPU, although the usage of the pause instruction frees up some of the shared CPU resources while “sleeping” for longer times. The reason for spinning is to acquire the lock fast without going to the kernel. If that is true the increased CPU consumption might not look good in task manager but it should not influence performance at all as long as there are cores left for other tasks. But what the tests did show that nearly single threaded operations where one thread adds something to a worker queue while the worker thread waits for work and then performs some task with the work item are slowed down.

The reason for that can be shown best with a diagram. Spinning for a contended lock happens in steps where the amount of spinning is tripled after each step. After each Spin Round the lock checks again if the current thread can get it. While spinning the lock tries to be fair and switches over to other threads from time to time to help the other thread/s to complete its work. That increases the chances the lock has been released when we check again later. The problem is that only after a complete Spin Round has completed the lock checks if it can be taken:


If e.g. during Spin Round 5 the lock becomes signaled right after we did start Round 5 we wait for the complete Spin Round until we can acquire the lock. By calculating the spin duration for the last round we can estimate the worst case of delay that can happen to our thread:


That are many milliseconds we can wait until spinning has completed. Is that a real issue?

I have created a simple test application that implements a producer consumer queue where the worker thread works for each work item 10ms and the consumer has a delay of 1-9 ms before sending in the next work item. That is sufficient to see the effect:


We see for some sender thread delays of one and two ms a total duration of 2,2s whereas the other times we are twice as fast with ca. 1,2s. This shows that excessive CPU spinning is not only a cosmetic issue which only hurts heavily multithreaded applications but also simple producer consumer threading which involves only two threads. For the run above the ETW data speaks on its own that the increased CPU spinning is really the cause for the observed delay:


When we zoom into the slow section we find in red the 11ms of spinning although the worker (light blue) has completed its work and has returned the lock a long time ago.


The fast non degenerate case looks much better where only 1ms is spent spinning for the the lock.


The test application I did use is named  SkylakeXPause and located at https://1drv.ms/u/s!AhcFq7XO98yJgsMDiyTk6ZEt9pDXGA which contains a zip file with the source code and the binaries for .NET Core and .NET 4.5. What I actually did to compare things was to install on the Skylake machine .NET 4.8 Preview which contains the fixes and .NET Core 2.0 which still implements the old spinning behavior. The application targets .NET Standard 2.0 and .NET 4.5 which produces an exe and a dll. Now I can test the old and new spinning behavior side by side without the need to patch anything which is very convenient.

readonly object _LockObject = new object();
int WorkItems;
int CompletedWorkItems;
Barrier SyncPoint;
void RunSlowTest()
	const int processingTimeinMs = 10;
	const int WorkItemsToSend = 100;
	Console.WriteLine($"Worker thread works {processingTimeinMs} ms for {WorkItemsToSend} times");

	// Test one sender one receiver thread with different timings when the sender wakes up again
	// to send the next work item

	// synchronize worker and sender. Ensure that worker starts first
	double[] sendDelayTimes = { 1, 2, 3, 4, 5, 6, 7, 8, 9 };

	foreach (var sendDelay in sendDelayTimes)
		SyncPoint = new Barrier(2);  // one sender one receiver

		var sw = Stopwatch.StartNew();
		Parallel.Invoke(() => Sender(workItems: WorkItemsToSend,          delayInMs: sendDelay),
						() => Worker(maxWorkItemsToWork: WorkItemsToSend, workItemProcessTimeInMs: processingTimeinMs));
		Console.WriteLine($"Send Delay: {sendDelay:F1} ms Work completed in {sw.Elapsed.TotalSeconds:F3} s");
		Thread.Sleep(100);  // show some gap in ETW data so we can differentiate the test runs

/// <summary>
/// Simulate a worker thread which consumes CPU which is triggered by the Sender thread
/// </summary>
void Worker(int maxWorkItemsToWork, double workItemProcessTimeInMs)

	while (CompletedWorkItems != maxWorkItemsToWork)
		lock (_LockObject)
			if (WorkItems == 0)
				Monitor.Wait(_LockObject); // wait for work

			for (int i = 0; i < WorkItems; i++)
				SimulateWork(workItemProcessTimeInMs); // consume CPU under this lock

			WorkItems = 0;

/// <summary>
/// Insert work for the Worker thread under a lock and wake up the worker thread n times
/// </summary>
void Sender(int workItems, double delayInMs)
	CompletedWorkItems = 0; // delete previous work
	for (int i = 0; i < workItems; i++)
		lock (_LockObject)


This is not a .NET issue. It affects all Spinlock implementations which use the pause instruction. I have done a quick check into the Windows Kernel of Server 2016 but there is no issue like that visible. Looks like Intel was kind enough to give them a hint that some changes in the spinning strategy are needed.

When the issue was reported to .NET Core in August 2017 in September 2017 it was already fixed and pushed out with .NET Core 2.0.3 (https://github.com/dotnet/coreclr/issues/13388). It is not only that the reaction speed of the .NET Core team is amazing but also that the issue has been fixed on the mono branch a few days ago now as well and discussions about even more Spinning improvements are ongoing. Unfortunately the Desktop .NET Framework is not moving as fast but at least we have with .NET Framework 4.8 Preview at least a proof of concept that the fixes work there as well. Now I am waiting for the backport to .NET 4.7.2 to be able to use .NET at its full speed also on latest hardware. This was my first bug which was directly related to a performance change in one CPU instruction. ETW remains the profiling tool of choice on Windows. If I had a wish I would Microsoft make to port the ETW infrastructure to Linux because the current performance tooling still sucks at Linux. There were some interesting kernel capabilities added recently but an analysis tool like WPA remains yet to be seen there.

If you are running .NET Core 2.0 or desktop .NET Framework on CPUs which were produced since mid 2017 you should definitely check out your application with a profiler if you are running at reduced speed due to this issue and upgrade to the newer .NET Core and hopefully soon .NET Desktop version. My test application can tell you if you could be having issues

D:\SkylakeXPause\bin\Release\netcoreapp2.0>dotnet SkylakeXPause.dll -check
Did call pause 1,000,000 in 3.5990 ms, Processors: 8
No SkylakeX problem detected


D:\SkylakeXPause\SkylakeXPause\bin\Release\net45>SkylakeXPause.exe -check
Did call pause 1,000,000 in 3.6195 ms, Processors: 8
No SkylakeX problem detected

The tool will report an issue only if you are running a not fixed .NET Framework on a Skylake CPU. I hope you did find the issue as fascinating as I did. To really understand an issue you need to create a reproducer which allows you to experiment to find all relevant influencing factors. The rest is just boring work but now I understand the reasons and consequences of CPU spinning much better.

* Denotes changes to make things more clear and add new insights. This article has gained quite some traction at Hacker News (https://news.ycombinator.com/item?id=17336853) and Reddit  (https://www.reddit.com/r/programming/comments/8ry9u6/why_skylake_cpus_are_sometimes_50_slower/). It is even mentioned at Wikipedia (https://en.wikipedia.org/wiki/Skylake_(microarchitecture)). Wow. Thanks for the interest.


46 thoughts on “Why Skylake CPUs Are Sometimes 50% Slower – How Intel Has Broken Existing Code

  1. Do you feel that the biggest pain point for Linux performance analysis is the lack of a nice front-end analysis tool like WPA? Or do you (also?) find ETW to be generally superior to ftrace/perf/eBPF?


    • Linux has many good data gathering capabilities but to perform an analysis efficiently you need great UIs like WPA. Otherwise you waste a lot of time to find the problem in huge text files. But if you can visualize the data you can immediately see issues. Besides that there seems no common user and kernel tracing framework to exist which supports stack traces out of the box. See http://blogs.microsoft.co.il/sasha/2017/03/30/tracing-runtime-events-in-net-core-on-linux/.


      • While not quite like WPA (and not out-of-the box), the LinuxKI Toolset has some Kernel and User stack tracing abilities, and can provide summaries and reports outside of raw trace files. For example, MariaDB suffers from the same issue on Linux on SkyLake, and we’ll see high CPU time in the function ut_delay(). If you are interested, check it out at https://github.com/HewlettPackard/LinuxKI/wiki

        As the primary author of the tool, I’ve been using it for over 6 years and have solved many difficult performance issues. Its been opensource since June 2017.

        Liked by 1 person

  2. That code is simply broken. It’s also not a spinlock. The “pause” instruction is intended to put into a spinlock to tell the CPU not to speculate the spinlock looping back — when it eventually doesn’t loop (because another CPU changed the memory variable it’s testing) you have a whole bunch of messy and slow clean up to do.

    The code shows in the examples isn’t a spinlock. In each case it’s a simple while(–i){} with a “pause” in the loop body. That might be *part* of a spinlock. But it’s not itself a spinlock. And it’s not what the “pause” instruction is for.


    • @BruceHoult: The sample code is not trying to implement a Spinlock. It is using the .NET lock construct which internally spins before taking the lock. It is meant to demonstrate the spinning issue on intention. This bad code serves therefore as good example to reproduce superflous spinning. Implementing a good Spinlock is hard and should be left to experts which also will have to deal with some embarrassing bugs. With C++ and STL you do not have such issues because you would need to build such things on your own which most people (for good reasons) dont do.


      • > Implementing a good Spinlock is hard
        > should be left to experts
        > which most people (for good reasons) dont do

        You completely destroyed your credibility.

        > With C++ and STL you do not have such issues

        Laughable. STL is used only in usermode. Can spinlock even be implemented in usermode? The schedular can just decide to stop your “spinning” making the monstrosity you call a “spinlock” not a spinlock. Freaking hipsters.


      • Here is a managed Spinlock: https://referencesource.microsoft.com/#mscorlib/System/threading/SpinLock.cs
        It is part of .NET since a very long time. It is user mode and even written in a high level language such as C#.
        If you spin too long then the scheduler will, when your current thread quantum has expired (usually 1-15ms depending on currently set timer resolution), switch out your thread and the next thread of the ready queue is scheduled to run on the same core. If no other thread is scheduled to run on this core then your spinning thread is switched in again. In reality the scheduler is smarter and will not interrupt your thread at all if no one else is scheduled to run on your CPU. This way you can spin for seconds without a single context switch. If you have questions about the Windows OS Thread Scheduler you can read more about at https://docs.microsoft.com/en-us/windows-hardware/test/wpt/cpu-analysis.

        Liked by 1 person

  3. It’s a software issue, not a hardware issue. Check the latency of pause instruction on AMD’s microarchitectures and you’ll be surprised. Optimizing software for one microarchitecture and complaining that it’s slower on another one is just stupid. It’s been said billion times how you have to write optimized spin lock, but people keep repeating the same mistakes over and over again.


    • The link shows that it is hard but not how to do it right:

      “Backing off of a lock can be accomplished in different ways. The simplest method is just a counted lock acquired where an application can try to obtain a lock a certain number of times and then backs off the lock by requesting a context switch. The attempt count can easily be adjusted, but this method usually does scale with the number of hardware threads. The wall clock time of the spin will change as well with processor frequency which makes this code difficult to maintain.”

      This suggests that there is no other way than to change the software once the CPU vendors choose to change the latency of some instruction. How should I reliably detect that I am running on a CPU with a fast or slow pause instruction? I consider a change of an order of magnitude as breaking. That justifies some processor cabability flag where software safely can rely upon. Currently the only way is to micro benchmark the latency of the pause instruction which also costs time because you need to spin long enough to ensure that you get reliable data since you are not running on an RTOS.

      Liked by 1 person

      • We did a micro benchmark on our game startup to detect the latency of pause instruction, works fine.
        >This suggests that there is no other way than to change the software once the CPU vendors choose to change the latency of some instruction.
        As I said before – AMD has a different latency, so it’s an issue with a software in the first place.
        Bulldozer – 50
        Zen – 10
        Jaguar – 50
        Pretty big difference from pre Skylake architectures

        Liked by 1 person

  4. The issue is not only relevant for SkylakeX and Skylake, the change has been done on Skylake microarchitecture first released in 2015. So Kaby Lake, Coffee Lake and beyond are affected too. We had been in contact with the Kernel guys and the kernel was not affected. Probably they had an adaptive spin-lock wait for a long time already (I would bet they stumbled onto it circa 2015),

    Liked by 1 person

  5. Great article thank you. You mentioned it was patched in a Mono branch a few days back – do you mind pointing out which one / commit? We’ve got an open source project fairly sensitive to this sort of issue.


    • The issue is https://github.com/mono/mono/issues/9011. Sorry it is still open. I did see it referenced from the main issue but it is still in the works. This time it is not in the runtime but mono library code. If you have a specific location you want to be fixed fast you should add a comment to the issue. Since it is pretty clear what needs to be done you should get a fix fast now.

      Liked by 1 person

  6. As was mentioned the fix for this issue is already in .NET Core runtime and will be in V4.8 of the desktop runtime. For those who need a quick fix, there is a work-around that should work. There are some configuration options that allow you some control over the amount of spinning that the .NET Runtime does. There is a config option called SpinLimitProcFactor which is a constant (normally 20K = 4E20 hex) that is used to multiply by the number of processors to get the spin count. Thus by cutting this by a factor of 10 to 20 you can bring spinning down by that factor. For Foo.exe adding the file Foo.exe.config with the following text in it will do the trick


    Note that the value is in Hex, not decimal so this cuts the value to 1024 which is by a factor of 20. Also note that the amount of spinning done by the runtime is already probably too big by a reasonable factor for most scenarios (we have just been too shy about fiddling with it), so cutting it down in this way is actually generally good)

    GIven that this is a machine wide issue, you can fix it machine wide by setting the registry key. HKLM\Software\Microsoft\DotNetRuntime\SpinLimitProcFactor (DWORD). You can also set the COMPLUS_SpinLimitProcFactor to affect any process that inherits that environment variable.

    Finally I should just reiterate that this issue will only come into play if you have a ‘hot lock’ (a lock that is entered and exited frequently by more than one thread) AND that the lock hold time is not really small (e.g. < 1us) because then spinning actually succeeds, but happens alot. Thus the worse case is a lock with a hold time around 10-100 usec. It can definitely happen, but is far from 'all' scenarios.


  7. I small correction to my comment above. The Registry key is called (all components but the last should already exist in the registry. You need to create a DWORD entry named SpinLimitProcFactor and set it to 1024 (400 hex)



    • Thanks Vance. That definitely helps. But this will solve only the lock contention part. All callers of Thread.SpinWait (e.g. Task.Result did show also abnormal high CPU values during some tests) will still spin 14 times more than before on pre Skylake CPUs.


  8. Does anyone perhaps have any statistics on how much this affects Java (openjdk/Oracle Java) on Linux (and Windows)?


  9. Someone pointed out that I tried to put in my reply showed up as empty. Presumably that is because it was not escaped by the blog software Here is another attempt. You can set the SpinLimitProcFactor config to reduce the amount of spinning in .NET when taking locks by having the following config file.

    <?xml version=”1.0″ encoding=”utf-8″ ?>
    <SpinLimitProcFactor enabled=”400″ />

    The above may have show escaped XML or it may show up properly. I can’t tell until after I post it…


    • The config file changes described here do not seem to work for .NET 4.6.1. The environment variable does affect spinning though. Any ideas why?


      • .NET is bound to the OS support date. Since 4.6.1 is bound according to https://en.wikipedia.org/wiki/.NET_Framework_version_history to Windows 7 SP1, 8, 8.1 Update, 10 v1507 2008 R2 SP1, 2012, 2012 R2 Update. That means you do not get support for e.g. Windows 2012 R2 because regular support ended already 2018. Extended support ends in 2023. Depending on your OS you should consider updating to a newer .NET Framework version and OS or you will not be able to run on newer hardware with full speed. Or you will need to pay boatloads of money to MS to backport that fix to your .NET version.


  10. Please check the last .Net Preview Updates on Windows Update:

    Description of Preview of Quality Rollup for .NET Framework 4.6, 4.6.1, 4.6.2, 4.7, 4.7.1, and 4.7.2 for Windows 8.1, RT 8.1, and Server 2012 R2 (KB 4457015)

    Here I see this:

    Conditionally improves Spin-waits in several synchronization primitives to run better on Intel Skylake and more recent microarchitectures. To enable these improvements, set the new configuration variable COMPlus_Thread_NormalizeSpinWait to 1.

    Does this fix the issue?


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.