Why Skylake CPUs Are Sometimes 50% Slower – How Intel Has Broken Existing Code

I got a call that on newer hardware some performance regression tests have become slower. Not a big deal. Usually it is a bad configuration somewhere in Windows or some BIOS settings were set to non optimal values. But this time we were not able to find a setting that did bring performance back to normal. Since the change was not small 9s vs 19s (blue is old hardware orange is new hardware) we needed to drill deeper:

image

Same OS, Same Hardware, Different CPU – 2 Times Slower

A perf drop from 9,1s to 19,6s is definitely significant. We did more checks if the software version under test, Windows, BIOS settings were somehow different from the old baseline hardware. But nope everything was identical. The only difference was that the same tests were running on different CPUs. Below is a picture of the newest CPU

image

And here is the one used for comparison

image

The Xeon Gold runs on a different CPU Architecture named Skylake which is common to all CPUs produced by Intel since mid 2017. *As commenters have pointed out the consumer Skylake CPUs were released already 2015. The server Xeon CPUs with SkylakeX were released mid 2017. All later CPUs Kaby Lake, … share the same issue. If you are buying current hardware you will get a CPU with Skylake CPU architecture. These are nice machines but as the tests have shown newer and slower is not the right direction. If all else fails get a repro and use a real profiler ™ to drill deeper. When you record the same test on the old hardware and on the new hardware it should quickly lead to somewhere:

image

Remember the diff view in WPA *(Windows Performance Analyzer is a profiling UI which is free and part of the Windows Performance Toolkit which is part of the Windows SDK) shows in the table the delta of Trace 2 (11s) – Trace 1 (19s). Hence a negative delta in the table indicates a CPU consumption increase of the slower test. When we look at the biggest CPU consumer differences we find AwareLock::Contention, JIT_MonEnterWorker_InlineGetThread_GetThread_PatchLabel and ThreadNative.SpinWait. Everything points towards CPU spinning when threads are competing for locks. But that is a false red herring because spinning is not the root cause of slower performance. Increased lock contention means that something in our software did become slower while holding a lock which as a consequence results in more CPU spinning. I was checking locking times and other key metrics, like disk and alike but I failed to find anything relevant which could explain the performance degradation. Although not logical I turned back to the increased CPU consumption in various methods.

To find where exactly the CPU was stuck would be interesting. WPA has file and line columns but these work only with private symbols which we do not have because it is .NET Framework code. The next best thing is to get the address of the dll where the instruction is located which is called Image RVA (Relative Virtual Address). When I load the same dll into the debugger and then do

u xxx.dll+ImageRVA

then I should see the instruction which was burning most CPU cycles which was basically only one hot address.

image

Lets examine the hot code locations of the different methods with Windbg:

0:000> u clr.dll+0x19566B-10
clr!AwareLock::Contention+0x135:
00007ff8`0535565b f00f4cc6        lock cmovl eax,esi
00007ff8`0535565f 2bf0            sub     esi,eax
00007ff8`05355661 eb01            jmp     clr!AwareLock::Contention+0x13f (00007ff8`05355664)
00007ff8`05355663 cc              int     3
00007ff8`05355664 83e801          sub     eax,1
00007ff8`05355667 7405            je      clr!AwareLock::Contention+0x144 (00007ff8`0535566e)
00007ff8`05355669 f390            pause
00007ff8`0535566b ebf7            jmp     clr!AwareLock::Contention+0x13f (00007ff8`05355664)

We do this for the JIT method as well

0:000> u clr.dll+0x2801-10
clr!JIT_MonEnterWorker_InlineGetThread_GetThread_PatchLabel+0x124:
00007ff8`051c27f1 5e              pop     rsi
00007ff8`051c27f2 c3              ret
00007ff8`051c27f3 833d0679930001  cmp     dword ptr [clr!g_SystemInfo+0x20 (00007ff8`05afa100)],1
00007ff8`051c27fa 7e1b            jle     clr!JIT_MonEnterWorker_InlineGetThread_GetThread_PatchLabel+0x14a (00007ff8`051c2817)
00007ff8`051c27fc 418bc2          mov     eax,r10d
00007ff8`051c27ff f390            pause
00007ff8`051c2801 83e801          sub     eax,1
00007ff8`051c2804 75f9            jne     clr!JIT_MonEnterWorker_InlineGetThread_GetThread_PatchLabel+0x132 (00007ff8`051c27ff)

Now we have a pattern. One time the hot location is a jump instruction and the other time it is a subtraction. But both hot instructions are preceded by the same common instruction named pause. Different methods execute the same CPU instruction which is for some reason very time consuming. Lets measure the duration of the pause instruction to see if we are on the right track.

If You Document A Problem It Becomes A Feature

CPU Pause Duration In ns
Xeon E5 1620v3 3.5GHz 4
Xeon(R) Gold 6126 CPU @ 2.60GHz 43

Pause on the new Skylake CPUs is an order of magnitude slower. Sure things can get faster and sometimes a bit slower. But over 10 times slower? That sounds more like a bug. A little internet search about the pause instruction leads to the Intel manuals where the Skylake Microarchitecture and the pause instruction are explicitly mentioned:

https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

image

No this is not a bug, it is a documented feature. There exists even a web page which contains the timings of pretty much all CPU instructions

http://www.agner.org/optimize/instruction_tables.pdf

  • Sandy Bridge     11
  • Ivy Bridege         10
  • Haswell                9
  • Broadwell            9
  • SkylakeX             141

The numbers are CPU cycles. To calculate the actual time you need to divide the cycle counts by the CPU frequency (usually GHz) to get the time in ns.

That means when I execute heavily multithreaded applications on .NET on latest hardware things can become much slower. Someone else has this noticed already in August 2017 and has written an issue for it:  https://github.com/dotnet/coreclr/issues/13388. The issue has been fixed with .NET Core 2.1 and .NET Framework 4.8 Preview contains also the fixes for it.

https://github.com/Microsoft/dotnet-framework-early-access/blob/master/release-notes/build-3621/dotnet-build-3621-changes.md#clr

Improved spin-waits in several synchronization primitives to perform better on Intel Skylake and more recent microarchitectures. [495945, mscorlib.dll, Bug]

But since .NET 4.8 is still one year away I have requested a backport of the fixes to get .NET 4.7.2 back to speed on latest hardware. Since many parts of .NET are using spinlocks you should look out for increased CPU consumption around Thread.SpinWait and other spinning methods.

 

image

E.g. Task.Result will internally Spin internally where I could see for other tests also a significant increase in CPU consumption and degraded performance.

How Bad Is It?

I have looked at the .NET Core code how long the CPU will keep spinning when the lock is not released before calling into WaitForSingleObject to pay the “expensive” context switch. A context switch is somewhere in the microsecond region and becomes much slower when many threads are waiting on the same kernel object.

.NET Locks multiply the maximum Spin duration with the number of cores which has the fully contended case in mind where every core has a thread waiting for the same lock and tries to spin long enough to give everyone a chance to work a bit before paying for the kernel call. Spinning inside .NET uses an exponential back off algorithm where spinning starts with 50 pause calls in a loop where for each iteration the number of spins is multiplied by 3 until the next spin count becomes greater than the maximum spin duration. I have calculated the total time how long a thread would spin on pre Skylake CPU and current Skylake CPUs for various core numbers:

image

Below is some simplified code how .NET Locks perform spinning:

/// <summary>
/// This is how .NET is spinning during lock contention minus the Lock taking/SwitchToThread/Sleep calls
/// </summary>
/// <param name="nCores"></param>
void Spin(int nCores)
{
	const int dwRepetitions = 10;
	const int dwInitialDuration = 0x32;
	const int dwBackOffFactor = 3;
	
	int dwMaximumDuration = 20 * 1000 * nCores;

	for (int i = 0; i < dwRepetitions; i++)
	{
		int duration = dwInitialDuration;
		do
		{
			for (int k = 0; k < duration; k++)
			{
				Call_PAUSE();
			}
			duration *= dwBackOffFactor;
		}
		while (duration < dwMaximumDuration);
	}
}

The old spinning times were in the millisecond region (19ms for 24 cores) which is already quite a lot compared to the always mentioned high costs of context switches which are an order of magnitude faster. But with Skylake CPUs the total CPU Spinning times for a contended lock have exploded and we will spin up to 246ms on a 24 or 48 core machine only because the latency of the new Intel CPUs has increased the pause instruction by a factor 14. Is this really the case? I have created  a small tester to check full CPU spinning and the calculated numbers nicely match my expectations. I have 48 threads waiting on a 24 core machine for a single lock where I call Monitor.PulseAll to let the race begin:

image

Only one thread will win the race but 47 threads will spin until the give up. This is experimental evidence that we indeed have a real issue with CPU consumption and very long Spin times are a real issue. Excessive spinning hurts scalability because CPU cycles are burned where other threads might need the CPU, although the usage of the pause instruction frees up some of the shared CPU resources while “sleeping” for longer times. The reason for spinning is to acquire the lock fast without going to the kernel. If that is true the increased CPU consumption might not look good in task manager but it should not influence performance at all as long as there are cores left for other tasks. But what the tests did show that nearly single threaded operations where one thread adds something to a worker queue while the worker thread waits for work and then performs some task with the work item are slowed down.

The reason for that can be shown best with a diagram. Spinning for a contended lock happens in steps where the amount of spinning is tripled after each step. After each Spin Round the lock checks again if the current thread can get it. While spinning the lock tries to be fair and switches over to other threads from time to time to help the other thread/s to complete its work. That increases the chances the lock has been released when we check again later. The problem is that only after a complete Spin Round has completed the lock checks if it can be taken:

image

If e.g. during Spin Round 5 the lock becomes signaled right after we did start Round 5 we wait for the complete Spin Round until we can acquire the lock. By calculating the spin duration for the last round we can estimate the worst case of delay that can happen to our thread:

image

That are many milliseconds we can wait until spinning has completed. Is that a real issue?

I have created a simple test application that implements a producer consumer queue where the worker thread works for each work item 10ms and the consumer has a delay of 1-9 ms before sending in the next work item. That is sufficient to see the effect:

image

We see for some sender thread delays of one and two ms a total duration of 2,2s whereas the other times we are twice as fast with ca. 1,2s. This shows that excessive CPU spinning is not only a cosmetic issue which only hurts heavily multithreaded applications but also simple producer consumer threading which involves only two threads. For the run above the ETW data speaks on its own that the increased CPU spinning is really the cause for the observed delay:

image

When we zoom into the slow section we find in red the 11ms of spinning although the worker (light blue) has completed its work and has returned the lock a long time ago.

image

The fast non degenerate case looks much better where only 1ms is spent spinning for the the lock.

image

The test application I did use is named  SkylakeXPause and located at https://1drv.ms/u/s!AhcFq7XO98yJgsMDiyTk6ZEt9pDXGA which contains a zip file with the source code and the binaries for .NET Core and .NET 4.5. What I actually did to compare things was to install on the Skylake machine .NET 4.8 Preview which contains the fixes and .NET Core 2.0 which still implements the old spinning behavior. The application targets .NET Standard 2.0 and .NET 4.5 which produces an exe and a dll. Now I can test the old and new spinning behavior side by side without the need to patch anything which is very convenient.

readonly object _LockObject = new object();
int WorkItems;
int CompletedWorkItems;
Barrier SyncPoint;
	
void RunSlowTest()
{
	const int processingTimeinMs = 10;
	const int WorkItemsToSend = 100;
	Console.WriteLine($"Worker thread works {processingTimeinMs} ms for {WorkItemsToSend} times");

	// Test one sender one receiver thread with different timings when the sender wakes up again
	// to send the next work item

	// synchronize worker and sender. Ensure that worker starts first
	double[] sendDelayTimes = { 1, 2, 3, 4, 5, 6, 7, 8, 9 };

	foreach (var sendDelay in sendDelayTimes)
	{
		SyncPoint = new Barrier(2);  // one sender one receiver

		var sw = Stopwatch.StartNew();
		Parallel.Invoke(() => Sender(workItems: WorkItemsToSend,          delayInMs: sendDelay),
						() => Worker(maxWorkItemsToWork: WorkItemsToSend, workItemProcessTimeInMs: processingTimeinMs));
		sw.Stop();
		Console.WriteLine($"Send Delay: {sendDelay:F1} ms Work completed in {sw.Elapsed.TotalSeconds:F3} s");
		Thread.Sleep(100);  // show some gap in ETW data so we can differentiate the test runs
	}
}

/// <summary>
/// Simulate a worker thread which consumes CPU which is triggered by the Sender thread
/// </summary>
void Worker(int maxWorkItemsToWork, double workItemProcessTimeInMs)
{
	SyncPoint.SignalAndWait();

	while (CompletedWorkItems != maxWorkItemsToWork)
	{
		lock (_LockObject)
		{
			if (WorkItems == 0)
			{
				Monitor.Wait(_LockObject); // wait for work
			}

			for (int i = 0; i < WorkItems; i++)
			{
				CompletedWorkItems++;
				SimulateWork(workItemProcessTimeInMs); // consume CPU under this lock
			}

			WorkItems = 0;
		}
	}
}

/// <summary>
/// Insert work for the Worker thread under a lock and wake up the worker thread n times
/// </summary>
void Sender(int workItems, double delayInMs)
{
	CompletedWorkItems = 0; // delete previous work
	SyncPoint.SignalAndWait();
	for (int i = 0; i < workItems; i++)
	{
		lock (_LockObject)
		{
			WorkItems++;
			Monitor.PulseAll(_LockObject);
		}
		SimulateWork(delayInMs);
	}
}

Conclusions

This is not a .NET issue. It affects all Spinlock implementations which use the pause instruction. I have done a quick check into the Windows Kernel of Server 2016 but there is no issue like that visible. Looks like Intel was kind enough to give them a hint that some changes in the spinning strategy are needed.

When the issue was reported to .NET Core in August 2017 in September 2017 it was already fixed and pushed out with .NET Core 2.0.3 (https://github.com/dotnet/coreclr/issues/13388). It is not only that the reaction speed of the .NET Core team is amazing but also that the issue has been fixed on the mono branch a few days ago now as well and discussions about even more Spinning improvements are ongoing. Unfortunately the Desktop .NET Framework is not moving as fast but at least we have with .NET Framework 4.8 Preview at least a proof of concept that the fixes work there as well. Now I am waiting for the backport to .NET 4.7.2 to be able to use .NET at its full speed also on latest hardware. This was my first bug which was directly related to a performance change in one CPU instruction. ETW remains the profiling tool of choice on Windows. If I had a wish I would Microsoft make to port the ETW infrastructure to Linux because the current performance tooling still sucks at Linux. There were some interesting kernel capabilities added recently but an analysis tool like WPA remains yet to be seen there.

If you are running .NET Core 2.0 or desktop .NET Framework on CPUs which were produced since mid 2017 you should definitely check out your application with a profiler if you are running at reduced speed due to this issue and upgrade to the newer .NET Core and hopefully soon .NET Desktop version. My test application can tell you if you could be having issues

D:\SkylakeXPause\bin\Release\netcoreapp2.0>dotnet SkylakeXPause.dll -check
Did call pause 1,000,000 in 3.5990 ms, Processors: 8
No SkylakeX problem detected

or 

D:\SkylakeXPause\SkylakeXPause\bin\Release\net45>SkylakeXPause.exe -check
Did call pause 1,000,000 in 3.6195 ms, Processors: 8
No SkylakeX problem detected

The tool will report an issue only if you are running a not fixed .NET Framework on a Skylake CPU. I hope you did find the issue as fascinating as I did. To really understand an issue you need to create a reproducer which allows you to experiment to find all relevant influencing factors. The rest is just boring work but now I understand the reasons and consequences of CPU spinning much better.

* Denotes changes to make things more clear and add new insights. This article has gained quite some traction at Hacker News (https://news.ycombinator.com/item?id=17336853) and Reddit  (https://www.reddit.com/r/programming/comments/8ry9u6/why_skylake_cpus_are_sometimes_50_slower/). It is even mentioned at Wikipedia (https://en.wikipedia.org/wiki/Skylake_(microarchitecture)). Wow. Thanks for the interest.

Advertisements