I got a call that on newer hardware some performance regression tests have become slower. Not a big deal. Usually it is a bad configuration somewhere in Windows or some BIOS settings were set to non optimal values. But this time we were not able to find a setting that did bring performance back to normal. Since the change was not small 9s vs 19s (blue is old hardware orange is new hardware) we needed to drill deeper:
Same OS, Same Hardware, Different CPU – 2 Times Slower
A perf drop from 9,1s to 19,6s is definitely significant. We did more checks if the software version under test, Windows, BIOS settings were somehow different from the old baseline hardware. But nope everything was identical. The only difference was that the same tests were running on different CPUs. Below is a picture of the newest CPU
And here is the one used for comparison
The Xeon Gold runs on a different CPU Architecture named Skylake which is common to all CPUs produced by Intel since mid 2017. If you are buying current hardware you will get a CPU with Skylake CPU architecture. These are nice machines but as the tests have shown newer and slower is not the right direction. If all else fails get a repro and use a real profiler ™ to drill deeper. When you record the same test on the old hardware and on the new hardware it should quickly lead to somewhere:
Remember the diff view in WPA shows in the table the delta of Trace 2 (11s) – Trace 1 (19s). Hence a negative delta in the table indicates a CPU consumption increase of the slower test. When we look at the biggest CPU consumer differences we find AwareLock::Contention, JIT_MonEnterWorker_InlineGetThread_GetThread_PatchLabel and ThreadNative.SpinWait. Everything points towards CPU spinning when threads are competing for locks. But that is a false red herring because spinning is not the root cause of slower performance. Increased lock contention means that something in our software did become slower while holding a lock which as a consequence results in more CPU spinning. I was checking locking times and other key metrics, like disk and alike but I failed to find anything relevant which could explain the performance degradation. Although not logical I turned back to the increased CPU consumption in various methods.
To find where exactly the CPU was stuck would be interesting. WPA has file and line columns but these work only with private symbols which we do not have because it is .NET Framework code. The next best thing is to get the address of the dll where the instruction is located which is called Image RVA. When I load the same dll into the debugger and then do
then I should see the instruction which was burning most CPU cycles which was basically only one hot address.
Lets examine the hot code locations of the different methods with Windbg:
0:000> u clr.dll+0x19566B-10 clr!AwareLock::Contention+0x135: 00007ff8`0535565b f00f4cc6 lock cmovl eax,esi 00007ff8`0535565f 2bf0 sub esi,eax 00007ff8`05355661 eb01 jmp clr!AwareLock::Contention+0x13f (00007ff8`05355664) 00007ff8`05355663 cc int 3 00007ff8`05355664 83e801 sub eax,1 00007ff8`05355667 7405 je clr!AwareLock::Contention+0x144 (00007ff8`0535566e) 00007ff8`05355669 f390 pause 00007ff8`0535566b ebf7 jmp clr!AwareLock::Contention+0x13f (00007ff8`05355664)
We do this for the JIT method as well
0:000> u clr.dll+0x2801-10 clr!JIT_MonEnterWorker_InlineGetThread_GetThread_PatchLabel+0x124: 00007ff8`051c27f1 5e pop rsi 00007ff8`051c27f2 c3 ret 00007ff8`051c27f3 833d0679930001 cmp dword ptr [clr!g_SystemInfo+0x20 (00007ff8`05afa100)],1 00007ff8`051c27fa 7e1b jle clr!JIT_MonEnterWorker_InlineGetThread_GetThread_PatchLabel+0x14a (00007ff8`051c2817) 00007ff8`051c27fc 418bc2 mov eax,r10d 00007ff8`051c27ff f390 pause 00007ff8`051c2801 83e801 sub eax,1 00007ff8`051c2804 75f9 jne clr!JIT_MonEnterWorker_InlineGetThread_GetThread_PatchLabel+0x132 (00007ff8`051c27ff)
Now we have a pattern. One time the hot location is a jump instruction and the other time it is a subtraction. But both hot instructions are preceded by the same common instruction named pause. Different methods execute the same CPU instruction which is for some reason very time consuming. Lets measure the duration of the pause instruction to see if we are on the right track.
If You Document A Problem It It Becomes A Feature
|CPU||Pause Duration In ns|
|Xeon E5 1620v3 3.5GHz||4|
|Xeon(R) Gold 6126 CPU @ 2.60GHz||43|
Pause on the new Skylake CPUs is an order of magnitude slower. Sure things can get faster and sometimes a bit slower. But over 10 times slower? That sounds more like a bug. A little internet search about the pause instruction leads to the Intel manuals where the Skylake Microarchitecture and the pause instruction are explicitly mentioned:
No this is not a bug, it is a documented feature. There exists even a web page which contains the timings of pretty much all CPU instructions
- Sandy Bridge 11
- Ivy Bridege 10
- Haswell 9
- Broadwell 9
- SkylakeX 141
The numbers are CPU cycles. To calculate the actual time you need to divide the cycle counts by the CPU frequency (usually GHz) to get the time in ns.
That means when I execute heavily multithreaded applications on .NET on latest hardware things can become much slower. Someone else has this noticed already in August 2017 and has written an issue for it: https://github.com/dotnet/coreclr/issues/13388. The issue has been fixed with .NET Core 2.1 and .NET Framework 4.8 Preview contains also the fixes for it.
Improved spin-waits in several synchronization primitives to perform better on Intel Skylake and more recent microarchitectures. [495945, mscorlib.dll, Bug]
But since .NET 4.8 is still one year away I have requested a backport of the fixes to get .NET 4.7.2 back to speed on latest hardware. Since many parts of .NET are using spinlocks you should look out for increased CPU consumption around Thread.SpinWait and other spinning methods.
E.g. Task.Result will internally Spin internally where I could see for other tests also a significant increase in CPU consumption and degraded performance.
How Bad Is It?
I have looked at the .NET Core code how long the CPU will keep spinning when the lock is not released before calling into WaitForSingleObject to pay the “expensive” context switch. A context switch is somewhere in the microsecond region and becomes much slower when many threads are waiting on the same kernel object.
.NET Locks multiply the maximum Spin duration with the number of cores which has the fully contended case in mind where every core has a thread waiting for the same lock and tries to spin long enough to give everyone a chance to work a bit before paying for the kernel call. Spinning inside .NET uses an exponential back off algorithm where spinning starts with 50 pause calls in a loop where for each iteration the number of spins is multiplied by 3 until the next spin count becomes greater than the maximum spin duration. I have calculated the total time how long a thread would spin on pre Skylake CPU and current Skylake CPUs for various core numbers:
Below is some simplified code how .NET Locks perform spinning:
The old spinning times were in the millisecond region (19ms for 24 cores) which is already quite a lot compared to the always mentioned high costs of context switches which are an order of magnitude faster. But with Skylake CPUs the total CPU Spinning times for a contended lock have exploded and we will spin up to 246ms on a 24 or 48 core machine only because the latency of the new Intel CPUs has increased the pause instruction by a factor 14. Is this really the case? I have created a small tester to check full CPU spinning and the calculated numbers nicely match my expectations. I have 48 threads waiting on a 24 core machine for a single lock where I call Monitor.PulseAll to let the race begin:
Only one thread will win the race but 47 threads will spin until the give up. This is experimental evidence that we indeed have a real issue with CPU consumption and very long Spin times are a real issue. Excessive spinning hurts scalability because CPU cycles are burned where other threads might need the CPU, although the usage of the pause instruction frees up some of the shared CPU resources while “sleeping” for longer times. The reason for spinning is to acquire the lock fast without going to the kernel. If that is true the increased CPU consumption might not look good in task manager but it should not influence performance at all as long as there are cores left for other tasks. But what the tests did show that nearly single threaded operations where one thread adds something to a worker queue while the worker thread waits for work and then performs some task with the work item are slowed down.
The reason for that can be shown best with a diagram. Spinning for a contended lock happens in steps where the amount of spinning is tripled after each step. After each Spin Round the lock checks again if the current thread can get it. While spinning the lock tries to be fair and switches over to other threads from time to time to help the other thread/s to complete its work. That increases the chances the lock has been released when we check again later. The problem is that only after a complete Spin Round has completed the lock checks if it can be taken:
If e.g. during Spin Round 5 the lock becomes signaled right after we did start Round 5 we wait for the complete Spin Round until we can acquire the lock. By calculating the spin duration for the last round we can estimate the worst case of delay that can happen to our thread:
That are many milliseconds we can wait until spinning has completed. Is that a real issue?
I have created a simple test application that implements a producer consumer queue where the worker thread works for each work item 10ms and the consumer has a delay of 1-9 ms before sending in the next work item. That is sufficient to see the effect:
We see for some sender thread delays of one and two ms a total duration of 2,2s whereas the other times we are twice as fast with ca. 1,2s. This shows that excessive CPU spinning is not only a cosmetic issue which only hurts heavily multithreaded applications but also simple producer consumer threading which involves only two threads. For the run above the ETW data speaks on its own that the increased CPU spinning is really the cause for the observed delay:
When we zoom into the slow section we find in red the 11ms of spinning although the worker (light blue) has completed its work and has returned the lock a long time ago.
The fast non degenerate case looks much better where only 1ms is spent spinning for the the lock.
The test application I did use is named SkylakeXPause and located at https://1drv.ms/u/s!AhcFq7XO98yJgsMDiyTk6ZEt9pDXGA which contains a zip file with the source code and the binaries for .NET Core and .NET 4.5. What I actually did to compare things was to install on the Skylake machine .NET 4.8 Preview which contains the fixes and .NET Core 2.0 which still implements the old spinning behavior. The application targets .NET Standard 2.0 and .NET 4.5 which produces and exe and a dll. Now I can test the old and new spinning behavior side by side without the need to patch anything which is very convenient.
This is not a .NET issue. It affects all Spinlock implementations which use the pause instruction. I have done a quick check into the Windows Kernel of Server 2016 but there is no issue like that visible. Looks like Intel was kind enough to give time a hint that some changes in the spinning strategy are needed.
When the issue was reported to .NET Core in August 2017 in September 2017 it was already fixed and pushed out with .NET Core 2.0.3 (https://github.com/dotnet/coreclr/issues/13388). It is not only that the reaction speed of the .NET Core team is amazing but also that the issue has been fixed on the mono branch a few days ago now as well and discussions about even more Spinning improvements are ongoing. Unfortunately the Desktop .NET Framework is not moving as fast but at least we have with .NET Framework 4.8 Preview at least a proof of concept that the fixes work there as well. Now I am waiting for the backport to .NET 4.7.2 to be able to use .NET at its full speed also on latest hardware. This was my first bug which was directly related to a performance change in one CPU instruction. ETW remains the profiling tool of choice on Windows. If I had a wish I would Microsoft make to port the ETW infrastructure on Linux because the current performance tooling still sucks at Linux. There were some interesting kernel capabilities added recently but an analysis tool like WPA remains yet to be seen there.
If you are running .NET Core 2.0 or desktop .NET Framework on CPUs which were produced since mid 2017 you should definitely check out your application with a profiler if you are running at reduced speed due to this issue and upgrade to the newer .NET Core and hopefully soon .NET Desktop version. My test application can tell you if you could be having issues
D:\SkylakeXPause\bin\Release\netcoreapp2.0>dotnet SkylakeXPause.dll -check Did call pause 1,000,000 in 3.5990 ms, Processors: 8 No SkylakeX problem detected or D:\SkylakeXPause\SkylakeXPause\bin\Release\net45>SkylakeXPause.exe -check Did call pause 1,000,000 in 3.6195 ms, Processors: 8 No SkylakeX problem detected
The tool will report an issue only if you are running on a not fixed .NET Framework on a Skylake CPU. I hope you did find the issue as fascinating as I did. To really understand an issue you need to create a reproducer which allows you to experiment to find all relevant influencing factors. The rest is just boring work but now I do understand the reasons and consequences of CPU spinning much better.