How .NET 4.8 Can Break Your Application

A colleague was calling me in to look at a broken .NET application after installing a preview up an upcoming Windows KB. The application was not starting up all the time. That are easy issues because it is easy and fast to reproduce the problem. Since it was happening on a VM with no Visual Studio installed I used the best und most underestimated tool named dnSpy. I learned about this tool by reading the book Pro .NET Memory Management from Konrad Kokosa is not only an in depth explanation how the GC works but it covers also many not so well known tools to drill into .NET for Windows and Linux. The most useful sections are in my opinion the last ones which cover in depth usage of Span and  System.IO.Pipelines. The GC is a fascinating piece of software but in practice you are fastest if you do not need to rely on it. dnSpy is a xcopy deployable Visual Studio Debugger clone which automatically decompiles managed code on the fly where you can set breakpoints and do line level debugging without having the actual source code. If you ever have to do production debugging you need to have this tool in your toolbox. In many ways dnSpy is better than Windbg because until today Windbg has no way to debug .NET code with a source code window which is in my opinion a shame.

One really cool option of dnSpy is to edit the decompiled source code and modify a running binary. Why am I excited about that? Will strong name verification not prevent loading such a tampered assembly? First of all: No you can recompile a strong named assembly and it will load normally since .NET 3.5 SP1

Bypass signature verification of trusted assemblies

Starting with the .NET Framework 3.5 Service Pack 1, strong-name signatures are not validated when an assembly is loaded into a full-trust application domain,

Not many people are aware of this. To repeat: Strong naming won´t give you any security since over 10 years.

This makes it easy to force a  highly sporadic exception on production machines (e.g. connection loss, disk full, ….) with ease by modifying the original method where the exception was thrown. If you have e.g. a small application which for some reason ends up in the general exception handler and there the error handling does bad things with some code like this:

        static void Main(string[] args)
        {
            try
            {
                new FileStream(@"C:\temp\NotexistingFile.txt", FileMode.Open);
                Console.WriteLine("Got Here");
            }
            catch(FileNotFoundException ex)
            {
                Console.WriteLine("File was not found.");
            }
            catch(Exception ex)
            {
                Console.WriteLine("Fatal exception encountered!");
            }
        }

image

If you start the application from dnSpy it will skip using Native Images which will give you even for release builds all local variables with full fidelity. While debugging the same application in VS 2019 even with Source Server support often it cannot locate .NET Framework sources and you see no local variables and no C# code:

image

That is not nice. What is the reason for that? First Visual Studio uses Native Images if present. If these were compiled with no debug settings then the local variables will be optimized away. Even if you check

image

in Debugger – Options will not help because it only affects JITed code like your application but not loaded NGenned images. But there is a way out

1. Set the environment variable COMPLUS_ZAPDISABLE=1  or the registry key HKLM\SOFTWARE\Microsoft\.NETFramework\ZapDisable  and launch from that shell either VS or your application to which you attach later. See here for more information.

2. Create a file in C:\Windows\Microsoft.NET\assembly\GAC_32\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.ini with the content

[.NET Framework Debugging Control]
GenerateTrackingInfo=1
AllowOptimize=0

Point 1 will disable loading of native images. If you set this system wide your application will start much slower but all of your code will be JITed now. Point 2 will disable JIT optimizations for a given dll where the ini file must be named for an assembly xxxx.dll like xxxx.ini but NOT xxxx.dll.ini or xxxx.ini.dll. You need to do this for each assembly you want to disable JIT optimizations. The  post of Scott Hanselman gives misleading advice. His post leaves you with the impression you need to create that ini file only for your executable. That is not the case. To make things easier I have wrapped that stuff into a powershell script which allows me to disable the use of native images system wide and generate the ini files in the GAC and additionally in my application directory.

PS D:\Utils\EnableLocalVars.ps1 -enable 1 -dir "C:\Program Files\Test" -gac -ngen 0
Did set NGen Flag to 0. NGen Images are disabled now. This will impact application startup!
Adding 1518 files from GAC
Adding 68 files from GAC64
Adding 2 files from C:\Program Files\Test
Skipped 8 files because ini file already exists
Created 1580 ini files.

That script makes it easy to enable/disable on an installed machine. I have put a link to my OneDrive Folder here.

.\EnableLocalVars.ps1 
EnableLocalVars.ps1 -enable 0/1 -dir xxxx [-recurse] or [-gac] [-Ngen 0/1]
 by Alois Kraus 2018-2019
 Use this script when you want to attach to a running process or you want to get memory dumps of crashed processes of release builds with all local variables not optimized away.
 This will disable JIT optimizations to enable viewing local variables of methods in a debugger which otherwise would be optimized away.
  -enable 0/1         Enable local variables to view in debugger
                      If 1 enable local variables. The .ini files are created.
                      If 0 disable local variables. The .ini files are removed.
  -dir directory      Select directory where all dll/exe files get a .ini file
  -recurse            If set it will search recursively for all dlls below that path
  -gac                Make all GAC dlls debuggable
  -NGen 0/1           Disable/enable usage of NGen binaries system wide by setting the registry key HKLM\SOFTWARE\Microsoft\.NETFramework\ZapDisable
                      1 Enable NGen Images  == ZapDisable=1
                      0 Disable NGen Images == ZapDisable=0
                      You need to enable this if you want to view variables of .NET Framework code or if your application was NGenned
You need to restart the applications after executing this command for the changes to take effect.
Examples
 Enable locals for all GAC and  application dlls on an installed and NGenned machine
   EnableLocalVars.ps1 -enable 1 -gac -dir "c:\program files\myApp\bin" -ngen 0
 Reset settings
   EnableLocalVars.ps1 -enable 0 -gac -dir "c:\program files\myApp\bin" -ngen 1
 Enable NGen Images again
  EnableLocalVars.ps1  -ngen 1
WARNING: NGen images are disabled. Do not forget to enable them (EnableLocalVars.ps1 -Ngen 1) when you are done or all .NET applications will start ca. x2 times slower!

It comes with extensive help so it should make it easy enough that you perform this step automatically before taking a memory dump. This will make memory dump debugging a whole lot easier. Usually I make the error configurable by e.g. checking for the existence of a file to force the error when I want to inject some error in .NET Framework code like this:

image

This trick helped me a lot to force the error only when the application has got into the initial non faulted state. Now I can trigger the issue by creating a file from the command line with

 C>echo Hi > c:\temp\ForceError.txt

This will trigger the exception in the patched file:

if (File.Exists(“C:\\Temp\\ForceError.txt”))
{
         throw new Exception(“This one is rare!”);
}

When you try to save the changed mscorlib.dll you get an access denied error because the file is in use

image

But since we are pros we rename the running mscorlib.dll with no fear

image

because we know that Windows opens dlls always with the FILE_SHARE_DELETE sharing permission. This  gives us the right to always rename a running dll. We can now save the patched mscorlib.dll without any issue:

image

When we run the modified mscorlib.dll we get indeed into the impossible error region once the trigger is activated:

image

That is a great way to test how robust your application really is if you suspect that the error handling is wonky. But I am digressing here. Now back to the .NET 4.8 issue.

When I debugged it with Windbg the error did occur but not with dnSpy which left me back puzzled. If you have followed this far you can probably guess that this issue only surfaces when native images are loaded into the process. That was the reason why dnSpy was working perfectly. By enabling local variables and Ngenning the binaries again it did help to find out what was happening. The exact details are not important but it suffices to say that the application did use an old Dependency Injection framework named ObjectBuilder which was part of the Enterprise Library many years ago. To select the right ctor the default policy was to always use the first one by a call to Type.GetConstructors:

// Microsoft.Practices.ObjectBuilder.DefaultCreationPolicy
public ConstructorInfo SelectConstructor(IBuilderContext context, Type typeToBuild, string idToBuild)
{
    ConstructorInfo[] constructors = typeToBuild.GetConstructors();
    if (constructors.Length > 0)
    {
        return constructors[0];
    }
    return null;
}

That worked well for a over a decade but no more with .NET 4.8. If you read the API help for Type.GetConstructors

Remarks

The GetConstructors method does not return constructors in a particular order, such as declaration order. Your code must not depend on the order in which constructors are returned, because that order varies.

it becomes evident that this approach can cause problems. In this particular case the first ctor was the default ctor for a given type but with .NET 4.8 and NGenning the application the order of ctors did change. The first ctor was now ctor(string) which did cause object creation to fail miserably. To hit this issue many things must come together. If you e.g. add another type to the specific assembly that contained the type the issue goes away because for the error to surface you need to craft your IL metadata a specific offsets to make the issue happen. I have created a small reproducer which contains 10K classes with two ctors which shows the opposite issue. With JITed code class I have one class which has mixed ctor order. The Source code is here.

D:\CtorOrderNgenNotNGen\bin\Debug>CtorOrderNgenNotNGen.exe
Type DerivedSealed1808 has different ordered ctor!

The application checks if the first ctor is the default ctor which should be nearly always the same:

        private static void CheckCtorOrder()
        {
            foreach(var type in typeof(Base).Assembly.GetTypes())
            {
                if( type.Name.StartsWith("DerivedSealed"))
                {
                    ConstructorInfo[] infos = type.GetConstructors();
                    if( infos[0].GetParameters().Length > 0 )
                    {
                        Console.WriteLine($"Type {type.Name} has different ordered ctor!");
                    }
                }
            }
        }

After NGenning the application all classes have the same ctor order. This is pretty fragile. The advice on MSDN is therefore there for a reason. I have checked other more popular Dependency Injection Framework’s namely

  • Spring.NET
  • Windsor
  • AutoFac
  • Unity

which all have done their homework and contain not such problematic code. I might be the only one hitting that issue. But if you find strange application failures after a Windows KB check if the KB contains .NET 4.8 which could be the reason why your .NET application fails. .NET 4.8 has been available for 6 month and is pushed out via Windows Update as mandatory update now. MS is careful to not roll this out globally at once. Currently every 20th machine which requests all mandatory Windows Updates will get it but I am not sure when the switch will be flipped to 100%. This approach of MS is good but if you wonder why only some machines of your department behave strangely you should check if they really got all the same KBs installed even if everyone did install all mandatory updates. If you have a larger department and maintain your own Windows Update Service you should definitely keep an eye on strange .NET application behavior after you get the update which contains .NET 4.8. There are some issues with themed WPF context menus which can lead to misplaced, empty or non themed context menus but these should happen only sporadically. WPF still has issues with grouped Virtualizing Stackpanels (since .NET 4.5 where it was introduced) which can cause indefinite layout calls which can surface as UI hangs due to semi memory leaks. But that is part of another story.

Advertisements

The Unknown Error Code

When you look at log files you will frequently find negative error codes inside them (that is mostly the reason why you are looking at the log file anyway). For some of them you can find on Google what they mean but pretty frequently you end up with nothing. Why is that so? Most likely you have a COM error code in front of you. They have a different structure which is described here.

The most important thing to remember about COM error codes is that if bit 31 is set it signals an error. If bit 31=1 the signed 32 bit value becomes negative. That explains the strange values of COM error codes.  The bits 0..15 are most of the time the WIN32 error codes. Other bits above bit 15 can be set but these are irrelevant for normal readers of an error code.

If you ever encounter in a log file a negative error code you can follow that simple recipe to find out what it might be about. Lets suppose you have e.g. Error: -2147480631.

1. Open calc.exe, set it to Programmer mode and paste it as decimal number.

image

2. Get the first 4 digits of the hex number. Type in calc BC9 to get the decimal value.

image

3. Google for that Win32 error code or type on a command shell

net helpmsg 3017

image

and you finally have got a hint why some chained installer did fail. In that case it was some file in use issue which caused a MSI installer to demand a reboot and in the installer log file only that cryptic error code was to be found. Not many people seem to know that negative error codes are mostly a hint to try to convert the lower 16 bit of the error code to decimal and try to interpret that one as Win32 error code.

If you are a powershell guy

-2147480631 -band 0xfff

image

And here is the shortest form

net helpmsg (-2147480631 -band 0xfff)

image

That’s a simple but powerful trick which should be more widely known.

How Fast Can You Get With .NET Core?

With .NET Core 2.1 released I want to come back to an old SO question of mine where I asked how fast one can parse a text file with C# and .NET 4.0. Now I want to look how fast we can get today with .NET 4.7.2, .NET Core 2.1 and C++ on x86 and x64 . The structure of the text file is a double followed by an integer and a newline like this

1.1 0
2.1 1
3.1 2

for 10 million times. The result is a 179 MB text file which we want to parse as fast as possible.

2011 I did have as my home machine a Core Duo 6600 @ 2.40 GHz with Windows 7 running .NET 4.0 Lets see how much the numbers from 2011 have improved after I have a

  • Better CPU – Haswell CPU Architecture I7 4770K 3.5 GHz
  • Newer Windows – Windows 10 1803
  • Newer .NET Framework – .NET 4.7.2
  • New .NET Core 2.1

To compare how fast we can get with managed code I thought it would be good to compare the same thing also to a C implementation which reads the ANSI text file with fgets, tokenizes the string with strtok and converts the numbers with atof and atoi. That should be pretty fast right?

The Tests

The source code for the project IOBound  can be found at Github . You can download the binaries to test for yourself from here: https://github.com/Alois-xx/IOBound/releases.

 

ReadFile

This one checks how fast we can read a file into one huge buffer from disk. But if we execute that a second time the data will be served from the file system cache. If warmed up we measure the memcopy and soft page fault performance of the OS to assign memory to our application. It may look very simple but it can tell you a lot about disk performance and caching behavior of your OS.

        /// <summary>
        /// Check native reading speed from the OS
        /// </summary>
        private unsafe static uint ReadFileIntoByteBuffer(string dataFile)
        {
            using (var stream = new FileStream(dataFile, FileMode.Open))
            {
                byte[] buf = new byte[200 * 1024 * 1024];
                fixed (byte* pBuf = &buf[0])
                {
                    uint dwRead = 0;
                    if (ReadFile(stream.SafeFileHandle, pBuf, 200 * 1000 * 1000, out dwRead, IntPtr.Zero) == 0)
                    {
                        throw new Win32Exception();
                    }
                    return dwRead;
                }

            }
        }

 

MessagePack

After having so many professional serializers at our disposal it would be a sin to dismiss one of the fastest ones. MessagePack is an efficient binary format for which a great library exist from Yoshifumi Kawai (aka neuecc) who did write the fastest one for C#. With 133 MB the resulting file is the smallest one.

    [MessagePackObject]
    public class Container
    {
        [Key(0)]
        public List<double> Doubles = new List<double>();
        [Key(1)]
        public List<int> Ints = new List<int>();
    }

    private static void DeserializeMessagePack(string dataFile)
    {
       using (var stream = new FileStream(dataFile, FileMode.Open))
       {
          Container deser = MessagePack.MessagePackSerializer.Deserialize<Container>(stream);
       }
    }

 

JSON

When using a text based format JSON is also a must. Interestingly the JSON file is with 169 MB even a bit smaller than the original text format. Here we use also the fastest JSON serializer named UTF8Json. Credits for writing the fastest JSON serializer go to Yoshifumi Kawai.

        private static void DeserializeUTF8Json(string dataFile)
        {
            using (var stream = new FileStream(dataFile, FileMode.Open))
            {
                Container deser = Utf8Json.JsonSerializer.Deserialize<Container>(stream);
            }
        }

Count Lines

That reads the file line by line but does nothing with the data. Since we do not allocate much except some temp strings we find how fast we can get when reading the UTF8 file and expand the data to UTF-16 strings.

        private static int CountLines(string dataFile)
        {
            using (var reader = new StreamReader(dataFile))
            {
                string line;
                int count = 0;
                while ((line = reader.ReadLine()) != null)
                {
                    count++;
                }

                return count;
            }
        }

Span Parse

One of the biggest improvements of .NET Core was the introduction of Span based APIs everywhere. That allows you to read characters from raw buffers without the need to copy the data around again. With that you can write nearly zero allocating API which operates on unmanaged memory without the need to use unsafe code. To make that fast the JIT compiler was enhanced to emit the most efficient code possible. For the regular .NET Framework there exists the System.Memory nuget package to bring the type also onboard but with no help from the JIT compiler. This one is especially interesting since it shows how much you loose if you do not have a helping hand of the JIT compiler here.

 

        private static void ParseSpan(string dataFile)
        {
            var doubles = new List<double>();
            var ints = new List<int>();

            double curMeasurement = 0.0d;
            int curMeasurement2 = 0;

            ReadOnlySpan<byte> newLineSpan = NewLine.AsSpan();

            using (var stream = new FileStream(dataFile, FileMode.Open))
            {
                const int bufferSize = 10*4096;
                byte[] buffer = new byte[bufferSize];
                int readBytes = 0;

                int lastLineSize = 0;
                while ((readBytes = stream.Read(buffer, lastLineSize, bufferSize- lastLineSize)) != 0)
                {
                    Span<byte> bufferSpan = new Span<byte>(buffer, 0, readBytes+ lastLineSize);

                    if( bufferSpan.StartsWith(Encoding.UTF8.GetPreamble()) ) // skip byte order mark
                    {
                        bufferSpan = bufferSpan.Slice(Encoding.UTF8.GetPreamble().Length); 
                    }

                    int newLineStart = 0;
                    while( (newLineStart = bufferSpan.IndexOf(newLineSpan)) > 0 )
                    {
                        if( ParseLine( bufferSpan.Slice(0, newLineStart), ref curMeasurement, ref curMeasurement2) )
                        {
                            doubles.Add(curMeasurement);
                            ints.Add(curMeasurement2);
                        }
                        bufferSpan = bufferSpan.Slice(newLineStart + newLineSpan.Length);
                    }

                    bufferSpan.CopyTo(buffer.AsSpan());
                    lastLineSize = bufferSpan.Length;
                }
            }
        }

Unsafe Parse

This is how one could write a single method parser with some hard coded assumptions to become as fast as possible. Interestingly this one is slower than the Span based API which is safer and faster.

        unsafe private static void ParseLinesUnsafe(string dataFile)
        {
            var dobules = new List<double>();
            var ints = new List<int>();
            char decimalChar = CultureInfo.CurrentCulture.NumberFormat.NumberDecimalSeparator[0];

            using (var reader = new StreamReader(dataFile))
            {
                string line;
                double d = 0;
                long a = 0, b = 0;
                int ix = 0;
                while ((line = reader.ReadLine()) != null)
                {
                    int len = line.Length;
                    ix = 0;
                    a = 0;
                    b = 0;
                    fixed (char* ln = line)
                    {
                        while (ix < len && char.IsNumber(ln[ix]))
                        {
                            a = a * 10 + (ln[ix++] - '0');
                        }

                        if (ln[ix] == decimalChar)
                        {
                            ix++;
                            long div = 1;
                            while (ix < len && char.IsNumber(ln[ix]))
                            {
                                b += b * 10 + (ln[ix++] - '0');
                                div *= 10;
                            }
                            d = a + ((double)b) / div;
                        }

                        while (ix < len && char.IsWhiteSpace(ln[ix]))
                        {
                            ix++;
                        }

                        int i = 0;
                        while (ix < len && char.IsNumber(ln[ix]))
                        {
                            i = i * 10 + (ln[ix++] - '0');
                        }

                        dobules.Add(d);
                        ints.Add(i);
                    }
                }
            }
        }

 

C++ Is Still A Thing!

C++ should be the fastest with its highly optimized library routines which were carefully optimized and profiled for decades now. Looks like plain C 20 years ago but it is only faster by a slight margin compared to the naïve Standard Parse C# version.

	char pLine[1024];
	std::vector<double> doubles;
	std::vector<int> ints;
	int lines = 0;

	while (fgets(pLine, sizeof(pLine), fp) != nullptr)
	{
		lines++;
		ParseLine(doubles, ints, pLine);
	}

void ParseLine(std::vector<double> &doubles, std::vector<int> &ints, char *pline)
{
	char *pDouble = strtok(pline, " \n");

	if (pDouble != nullptr)
	{
		double d = atof(pDouble);
		char *pInt = strtok(NULL, " \n");
		if (pInt != nullptr && pInt != pDouble)
		{
			int i = atoi(pInt);
			doubles.push_back(d);
			ints.push_back(i);
		}
	}
	
}

 

Standard Parse C#

That is how you would do it when writing idiomatic C# code.

        private static void ParseLines(string dataFile)
        {
            var dobules = new List<double>();
            var ints = new List<int>();

            using (var reader = new StreamReader(dataFile))
            {
                string line;
                char[] sep = new char[] { ' ' };
                while ((line = reader.ReadLine()) != null)
                {
                    var parts = line.Split(sep);
                    if (parts.Length == 2)
                    {
                        dobules.Add(double.Parse(parts[0]));
                        ints.Add(int.Parse(parts[1]));
                    }
                }
            }
        }

 

The Results

 

image

Newer CPU

One obvious way to become faster is to buy a newer CPU. That works and can give you a speedup of roughly a factor two if you wait long enough between significant CPU generations. But the economic benefit to shrink CPU dies has decreased a lot with 7nm processes at the door you need  EUV lithography. With all of its challenges further huge improvements are not expected soon. And since one atom is about 0,1nm we do not have so much room left to scale much further. If you want to increase the frequency a lot you also run into fundamental problems. The speed of light in conducting materials is ca 200 000 km/s. During 1ns (=1GHz) a signal can travel 20cm far. If you increase the frequency to 10 GHz you can transmit a signal only 2cm. If your chip is bigger you cannot transmit a signal in the necessary time to its destination. We have a natural upper limit how much faster you can get with current CPU designs if you rely on ack signals of some sort.

With the rise of AI and its huge power consumption we still can and will become much more energy efficient. Although Google is leading with dedicated Tensorflow CPUs the amount of power consumption to perform simple image recognition is at least 1000 times worse compared to their biological counterparts (that is only a guess and this number is likely much to low). The next big thing are definitely neuromorphic chips which ran rewire themselves depending the the problem to solve. With these processors you could make Google search a commodity where not a single corporation owns all of your data. Instead it could be made a public service run by many providers where the search graph is shared and we as customers can sell our data on our own. If that ever becomes true depends on who creates such highly energy efficient chips first.

Newer Windows

The plain ReadFile call is about 4 times faster. That performance increase is not only related to the better CPU but also the soft page fault performance of Windows 10 Fall Creators Update did become better compared to Windows 7 which I was using at that time. The better soft fault performance is also part of Windows Server 2019 (finally). More details about soft fault performance can be found here.

32 vs 64 Bit

The biggest differences between x86/x64 has the C version . It turns out the x86 version is about 40% slower due to much higher costs in atof mainly which seems to be pretty slow implemented in x86. In the screenshot below you can see that the ucrtbase:common_atof needs 2,4 vs 1,3s in the x64 version which is a pretty big perf hit. Although C/C++ can be very efficient this is not always the case for the used runtime libraries. By using standard libraries you are not guaranteed to get world class performance on your current hardware. Especially extended instruction sets can only be used by specifically crafted binaries for each CPU. The standard approach in C/C++ is to create a library for each CPU architecture like

  • libSimd.dll
  • libAVX.dll
  • libAVX256.dll
  • libAVX512.dll

That quickly becomes pretty complicated to test. On the other hand in a JIT compiled world we still need to test on different CPUs but we get the best possible assembly code on our CPU for free if the runtime is up to date like it is the case with RyuJIT of .NET Core.

image

The next observation is that the new Span based version is 80% slower (1.35 vs 0.75 .NET Core 2.1) on x86. If you are after high throughput scenarios it definitely makes sense to switch over to 64 bit and use mainly index based structures to alleviate the 100% increase of pointer size for the most significant cases.

.NET Core 2.1 vs .NET 4.7.2

.NET 4.7.2 is for streaming data into memory between 2% (Unsafe Parse x86), 7% (Unsafe Parse x64) and 40% (Span Parse x64) slower than .NET Core. Especially the 40% speed drop for Span based Apis due to the lack of JIT support by the .NET Runtime is unfortunate. But since .NET was declared legacy and most innovation will happen for the years to come at .NET Core it is unlikely that the .NET Framework JIT compiler will get these new capabilities. The main reasosn were compat issues but we can already switch over to the legacy JIT if we do not want to use the new RyuJIT compiler. It should be easy to make the new JIT compiler of .NET Core a optional and the users of the .NET Framework can enable the new functionality if they have verified that their scenarios work.

 

Conclusions

Based on the numbers it makes sense to switch over to x64 and Span based APIs if you are running on .NET Core. But even if you are not using .NET Core Span based APIs offer quite a lot. Span based APIs make it possible to spare the copy operation from unmanaged locations into the managed heap. The result are nearly allocation free applications which can scale great. Another learning is that if you can change your problem a bit by e.g. changing the data format you can get even faster. The MessagePack serializer on x64 .NET Core 2.1 is 25% faster compared to the Span text based approach. That is a bit unfair since there is practically no parsing involved but shows that you can achieve near optimal performance with efficient binary formats which contain mainly primitive types. Even our pretty generic UTF8 JSON file approach performed 34% better than the x64 C/C++ version. In the end you can become very fast with any language if you profile your application and fix the obvious bottlenecks. To be able to do that you need to use some performance profiling tools for your platform. My favorite tool is ETW  as you might have guessed from my previous posts. Are you profiling your code before problems are reported? What is your favorite profiling tool?

Why Does My Synchronous WCF Call Hang?

This is a limitation of WCF which is not widely known. Lets suppose you have an WCF interface which contains a mixture of Task based and non task based methods:

[ServiceContract(Namespace = "WCFDispatching")]
public interface IRemotedService
{
    [OperationContract]
    Task<bool> MakeAsyncCall(int id);

    [OperationContract]
    void SyncCall(int id);
}

What will happen when you call both methods?

    async Task Work(IRemotedService service)
    {
        await service.MakeAsyncCall(50);
        service.SyncCall(150);
    }

The sad truth is that the second call will hang indefinitely with a rather long call stack:

System.Threading.WaitHandle.WaitOne
System.Runtime.TimeoutHelper.WaitOne
System.ServiceModel.Dispatcher.DuplexChannelBinder.SyncDuplexRequest.WaitForReply
System.ServiceModel.Dispatcher.DuplexChannelBinder.Request
System.ServiceModel.Channels.ServiceChannel.Call
System.ServiceModel.Channels.ServiceChannelProxy.InvokeService
System.ServiceModel.Channels.ServiceChannelProxy.Invoke
System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke
WCFDispatching.Program.Work
[Resuming Async Method]
System.Runtime.CompilerServices.AsyncMethodBuilderCore.MoveNextRunner.InvokeMoveNext
System.Threading.ExecutionContext.RunInternal
System.Threading.ExecutionContext.Run
System.Runtime.CompilerServices.AsyncMethodBuilderCore.MoveNextRunner.Run
System.Runtime.CompilerServices.AsyncMethodBuilderCore.OutputAsyncCausalityEvents.AnonymousMethod__0
System.Runtime.CompilerServices.AsyncMethodBuilderCore.ContinuationWrapper.Invoke
System.Runtime.CompilerServices.TaskAwaiter.OutputWaitEtwEvents.AnonymousMethod__0
System.Runtime.CompilerServices.AsyncMethodBuilderCore.ContinuationWrapper.Invoke
System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction
System.Threading.Tasks.Task.FinishContinuations
System.Threading.Tasks.Task.FinishStageThree
System.Threading.Tasks.Task<bool>.TrySetResult
System.Threading.Tasks.TaskFactory<bool>.FromAsyncCoreLogic
System.Threading.Tasks.TaskFactory<bool>.FromAsyncImpl.AnonymousMethod__0
System.Runtime.AsyncResult.Complete
System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.FinishSend
System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.SendCallback
System.Runtime.Fx.AsyncThunk.UnhandledExceptionFrame
System.Runtime.AsyncResult.Complete
System.ServiceModel.Dispatcher.DuplexChannelBinder.AsyncDuplexRequest.Done
System.ServiceModel.Dispatcher.DuplexChannelBinder.AsyncDuplexRequest.GotReply
System.ServiceModel.Dispatcher.DuplexChannelBinder.HandleRequestAsReplyCore
System.ServiceModel.Dispatcher.DuplexChannelBinder.HandleRequestAsReply
System.ServiceModel.Dispatcher.ChannelHandler.HandleRequestAsReply
System.ServiceModel.Dispatcher.ChannelHandler.HandleRequest
System.ServiceModel.Dispatcher.ChannelHandler.AsyncMessagePump

System.ServiceModel.Dispatcher.ChannelHandler.OnAsyncReceiveComplete
System.Runtime.Fx.AsyncThunk.UnhandledExceptionFrame
System.Runtime.AsyncResult.Complete
System.ServiceModel.Channels.TransportDuplexSessionChannel.TryReceiveAsyncResult.OnReceive
System.Runtime.Fx.AsyncThunk.UnhandledExceptionFrame
System.Runtime.AsyncResult.Complete
System.ServiceModel.Channels.SynchronizedMessageSource.ReceiveAsyncResult.OnReceiveComplete
System.ServiceModel.Channels.SessionConnectionReader.OnAsyncReadComplete
System.ServiceModel.Channels.PipeConnection.OnAsyncReadComplete
System.ServiceModel.Channels.OverlappedContext.CompleteCallback
System.Runtime.Fx.IOCompletionThunk.UnhandledExceptionFrame
System.Threading._IOCompletionCallback.PerformIOCompletionCallback

The interesting thing is that the synchronous call completes on the remote endpoint but the WCF client call hangs in this call stack. The problem is that WCF runs asynchronous method completions on the WCF channel dispatcher which seems to be single threaded just like a UI application with a message pump. When a blocking synchronous call is performed WCF waits normally in a stack for the read operation to complete like this

System.Threading.WaitHandle.InternalWaitOne
System.Threading.WaitHandle.WaitOne
System.Runtime.TimeoutHelper.WaitOne
System.ServiceModel.Channels.OverlappedContext.WaitForSyncOperation
System.ServiceModel.Channels.OverlappedContext.WaitForSyncOperation
System.ServiceModel.Channels.PipeConnection.WaitForSyncRead
System.ServiceModel.Channels.PipeConnection.Read
System.ServiceModel.Channels.DelegatingConnection.Read
System.ServiceModel.Channels.SessionConnectionReader.Receive
System.ServiceModel.Channels.SynchronizedMessageSource.Receive
System.ServiceModel.Channels.TransportDuplexSessionChannel.Receive
System.ServiceModel.Channels.TransportDuplexSessionChannel.TryReceive
System.ServiceModel.Dispatcher.DuplexChannelBinder.Request
System.ServiceModel.Channels.ServiceChannel.Call
System.ServiceModel.Channels.ServiceChannelProxy.InvokeService
System.ServiceModel.Channels.ServiceChannelProxy.Invoke
System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke

But in our case we have a different wait call stack where we are not waiting for a read to complete but in DuplexChannelBinder.SyncDuplexRequest.WaitForReply we are waiting that another thread sets an event to signal completion. This assumes that another thread is still receiving input from the remote connection which is not the case. We can see this when one looks who is setting the event:

image

To release our waiting thread another thread must call GotReply which is never going to happen. To get things working again you must make in your remoted interface either all methods synchronous or asynchronous. A sync/async mixture of remoted methods will likely cause deadlocks like shown above.

Below is the full sample code to reproduce the issue if you are interested

using System;
using System.ServiceModel;
using System.Threading.Tasks;

namespace WCFDispatching
{
    [ServiceContract(Namespace = "WCFDispatching")]
public interface IRemotedService
{
    [OperationContract]
    Task<bool> MakeAsyncCall(int id);

    [OperationContract]
    void SyncCall(int id);

    [OperationContract]
    Task SyncCallAsync_(int id);
}

[ServiceBehavior(InstanceContextMode = InstanceContextMode.PerSession, ConcurrencyMode = ConcurrencyMode.Multiple, IncludeExceptionDetailInFaults = true)]
public class RemotedService : IRemotedService
{
    public async Task<bool> MakeAsyncCall(int id)
    {
        await Task.Delay(10);
        Console.WriteLine($"Async call with id: {id} completed.");
        return true;
    }

    public async Task SyncCallAsync_(int id)
    {
        await Task.Delay(0);
        Console.WriteLine($"SyncCallAsync call with id {id} called.");
    }

    public void SyncCall(int id)
    {
        Console.WriteLine($"Sync call with id {id} called.");
    }
}

class Program
{
    const string PipeUri = "net.pipe://localhost/WCFDispatching";

    static void Main(string[] args)
    {
        new Program().Run(args);
    }

    bool bUseAsyncVersion = false;
    readonly string Help = "WCFDispatching.exe [-server] [-client [-async]]" + Environment.NewLine +
                "    -server      Create WCF Server" + Environment.NewLine +
                "    -client      Create WCF Client" + Environment.NewLine + 
                "    -async       Call async version of both API calls" + Environment.NewLine +
                "No options means client mode which calls async/sync WCF API which produces a deadlock.";


    private void Run(string[] args)
    {
        if( args.Length == 0)
        {
            Console.WriteLine(Help);
            return;
        }
        else
        {
            for (int i = 0; i < args.Length; i++)
            {
                string arg = args[i];
                switch(arg)
                {
                    case "-server":
                        StartServer();
                        Console.WriteLine("Server started");
                        Console.ReadLine();
                        break;
                    case "-client":
                        // this is the default
                        break;
                    case "-async":
                        bUseAsyncVersion = true;
                        break;
                    default:
                        Console.WriteLine(Help);
                        Console.WriteLine($"Command line argument {args[0]} is not valid.");
                        return;
                }
            }

            var service = CreateServiceClient<IRemotedService>(new Uri(PipeUri));
            Task waiter = Work(service);
            waiter.Wait();
            return;
        }

    }

    async Task Work(IRemotedService service)
    {
        await service.MakeAsyncCall(50);
        if (bUseAsyncVersion)  // this will work
        {
            await service.SyncCallAsync_(50);
        }
        else
        {
            service.SyncCall(150);  // this call will deadlock!
        }
    }

    internal static T CreateServiceClient<T>(Uri uri)
    {
        var binding = CreateDefaultNetNamedPipeBinding();
        var channelFactory = new ChannelFactory<T>(binding, PipeUri.ToString());
        var serviceClient = channelFactory.CreateChannel();
        var channel = (IContextChannel)serviceClient;
        channel.OperationTimeout = TimeSpan.FromMinutes(10);

        return serviceClient;
    }

    internal static ServiceHost StartServer()
    {
        var host = new ServiceHost(typeof(RemotedService));
        host.AddServiceEndpoint(implementedContract: typeof(IRemotedService), binding: CreateDefaultNetNamedPipeBinding(), address: PipeUri);
        host.Open();

        return host;
    }

    internal static NetNamedPipeBinding CreateDefaultNetNamedPipeBinding()
    {
        //Default setting for NetNamedPipeBinding.MaxReceivedMessageSize = 65,536 bytes
        //Default settings for NetNamedPipeBinding.ReaderQuotas
        //MaxDepth = 32, MaxStringContentLength = 8192, MaxArrayLength = 16384, MaxBytesPerRead = 4096, MaxNameTableCharCount = 16384
        TimeSpan timeOut = TimeSpan.FromMinutes(1000);
        var binding = new NetNamedPipeBinding(NetNamedPipeSecurityMode.None)
        {
            ReceiveTimeout = timeOut,
            MaxReceivedMessageSize = Int32.MaxValue,
            ReaderQuotas =
            {
                MaxArrayLength = Int16.MaxValue,
                MaxStringContentLength = Int32.MaxValue,
                MaxBytesPerRead = Int32.MaxValue
            }
        };
        return binding;
    }
}
}

To try it out first start the server from a shell with

WCFDispatching.exe -server

and then start the client with -client as option to get the deadlock. To call the fixed version add to the client call -client -async and the deadlock will not occur.

Why Skylake CPUs Are Sometimes 50% Slower – How Intel Has Broken Existing Code

I got a call that on newer hardware some performance regression tests have become slower. Not a big deal. Usually it is a bad configuration somewhere in Windows or some BIOS settings were set to non optimal values. But this time we were not able to find a setting that did bring performance back to normal. Since the change was not small 9s vs 19s (blue is old hardware orange is new hardware) we needed to drill deeper:

image

Same OS, Same Hardware, Different CPU – 2 Times Slower

A perf drop from 9,1s to 19,6s is definitely significant. We did more checks if the software version under test, Windows, BIOS settings were somehow different from the old baseline hardware. But nope everything was identical. The only difference was that the same tests were running on different CPUs. Below is a picture of the newest CPU

image

And here is the one used for comparison

image

The Xeon Gold runs on a different CPU Architecture named Skylake which is common to all CPUs produced by Intel since mid 2017. *As commenters have pointed out the consumer Skylake CPUs were released already 2015. The server Xeon CPUs with SkylakeX were released mid 2017. All later CPUs Kaby Lake, … share the same issue. If you are buying current hardware you will get a CPU with Skylake CPU architecture. These are nice machines but as the tests have shown newer and slower is not the right direction. If all else fails get a repro and use a real profiler ™ to drill deeper. When you record the same test on the old hardware and on the new hardware it should quickly lead to somewhere:

image

Remember the diff view in WPA *(Windows Performance Analyzer is a profiling UI which is free and part of the Windows Performance Toolkit which is part of the Windows SDK) shows in the table the delta of Trace 2 (11s) – Trace 1 (19s). Hence a negative delta in the table indicates a CPU consumption increase of the slower test. When we look at the biggest CPU consumer differences we find AwareLock::Contention, JIT_MonEnterWorker_InlineGetThread_GetThread_PatchLabel and ThreadNative.SpinWait. Everything points towards CPU spinning when threads are competing for locks. But that is a false red herring because spinning is not the root cause of slower performance. Increased lock contention means that something in our software did become slower while holding a lock which as a consequence results in more CPU spinning. I was checking locking times and other key metrics, like disk and alike but I failed to find anything relevant which could explain the performance degradation. Although not logical I turned back to the increased CPU consumption in various methods.

To find where exactly the CPU was stuck would be interesting. WPA has file and line columns but these work only with private symbols which we do not have because it is .NET Framework code. The next best thing is to get the address of the dll where the instruction is located which is called Image RVA (Relative Virtual Address). When I load the same dll into the debugger and then do

u xxx.dll+ImageRVA

then I should see the instruction which was burning most CPU cycles which was basically only one hot address.

image

Lets examine the hot code locations of the different methods with Windbg:

0:000> u clr.dll+0x19566B-10
clr!AwareLock::Contention+0x135:
00007ff8`0535565b f00f4cc6        lock cmovl eax,esi
00007ff8`0535565f 2bf0            sub     esi,eax
00007ff8`05355661 eb01            jmp     clr!AwareLock::Contention+0x13f (00007ff8`05355664)
00007ff8`05355663 cc              int     3
00007ff8`05355664 83e801          sub     eax,1
00007ff8`05355667 7405            je      clr!AwareLock::Contention+0x144 (00007ff8`0535566e)
00007ff8`05355669 f390            pause
00007ff8`0535566b ebf7            jmp     clr!AwareLock::Contention+0x13f (00007ff8`05355664)

We do this for the JIT method as well

0:000> u clr.dll+0x2801-10
clr!JIT_MonEnterWorker_InlineGetThread_GetThread_PatchLabel+0x124:
00007ff8`051c27f1 5e              pop     rsi
00007ff8`051c27f2 c3              ret
00007ff8`051c27f3 833d0679930001  cmp     dword ptr [clr!g_SystemInfo+0x20 (00007ff8`05afa100)],1
00007ff8`051c27fa 7e1b            jle     clr!JIT_MonEnterWorker_InlineGetThread_GetThread_PatchLabel+0x14a (00007ff8`051c2817)
00007ff8`051c27fc 418bc2          mov     eax,r10d
00007ff8`051c27ff f390            pause
00007ff8`051c2801 83e801          sub     eax,1
00007ff8`051c2804 75f9            jne     clr!JIT_MonEnterWorker_InlineGetThread_GetThread_PatchLabel+0x132 (00007ff8`051c27ff)

Now we have a pattern. One time the hot location is a jump instruction and the other time it is a subtraction. But both hot instructions are preceded by the same common instruction named pause. Different methods execute the same CPU instruction which is for some reason very time consuming. Lets measure the duration of the pause instruction to see if we are on the right track.

If You Document A Problem It Becomes A Feature

CPU Pause Duration In ns
Xeon E5 1620v3 3.5GHz 4
Xeon(R) Gold 6126 CPU @ 2.60GHz 43

Pause on the new Skylake CPUs is an order of magnitude slower. Sure things can get faster and sometimes a bit slower. But over 10 times slower? That sounds more like a bug. A little internet search about the pause instruction leads to the Intel manuals where the Skylake Microarchitecture and the pause instruction are explicitly mentioned:

https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

image

No this is not a bug, it is a documented feature. There exists even a web page which contains the timings of pretty much all CPU instructions

http://www.agner.org/optimize/instruction_tables.pdf

  • Sandy Bridge     11
  • Ivy Bridege         10
  • Haswell                9
  • Broadwell            9
  • SkylakeX             141

The numbers are CPU cycles. To calculate the actual time you need to divide the cycle counts by the CPU frequency (usually GHz) to get the time in ns.

That means when I execute heavily multithreaded applications on .NET on latest hardware things can become much slower. Someone else has this noticed already in August 2017 and has written an issue for it:  https://github.com/dotnet/coreclr/issues/13388. The issue has been fixed with .NET Core 2.1 and .NET Framework 4.8 Preview contains also the fixes for it.

https://github.com/Microsoft/dotnet-framework-early-access/blob/master/release-notes/build-3621/dotnet-build-3621-changes.md#clr

Improved spin-waits in several synchronization primitives to perform better on Intel Skylake and more recent microarchitectures. [495945, mscorlib.dll, Bug]

But since .NET 4.8 is still one year away I have requested a backport of the fixes to get .NET 4.7.2 back to speed on latest hardware. Since many parts of .NET are using spinlocks you should look out for increased CPU consumption around Thread.SpinWait and other spinning methods.

 

image

E.g. Task.Result will internally Spin internally where I could see for other tests also a significant increase in CPU consumption and degraded performance.

How Bad Is It?

I have looked at the .NET Core code how long the CPU will keep spinning when the lock is not released before calling into WaitForSingleObject to pay the “expensive” context switch. A context switch is somewhere in the microsecond region and becomes much slower when many threads are waiting on the same kernel object.

.NET Locks multiply the maximum Spin duration with the number of cores which has the fully contended case in mind where every core has a thread waiting for the same lock and tries to spin long enough to give everyone a chance to work a bit before paying for the kernel call. Spinning inside .NET uses an exponential back off algorithm where spinning starts with 50 pause calls in a loop where for each iteration the number of spins is multiplied by 3 until the next spin count becomes greater than the maximum spin duration. I have calculated the total time how long a thread would spin on pre Skylake CPU and current Skylake CPUs for various core numbers:

image

Below is some simplified code how .NET Locks perform spinning:

/// <summary>
/// This is how .NET is spinning during lock contention minus the Lock taking/SwitchToThread/Sleep calls
/// </summary>
/// <param name="nCores"></param>
void Spin(int nCores)
{
	const int dwRepetitions = 10;
	const int dwInitialDuration = 0x32;
	const int dwBackOffFactor = 3;
	
	int dwMaximumDuration = 20 * 1000 * nCores;

	for (int i = 0; i < dwRepetitions; i++)
	{
		int duration = dwInitialDuration;
		do
		{
			for (int k = 0; k < duration; k++)
			{
				Call_PAUSE();
			}
			duration *= dwBackOffFactor;
		}
		while (duration < dwMaximumDuration);
	}
}

The old spinning times were in the millisecond region (19ms for 24 cores) which is already quite a lot compared to the always mentioned high costs of context switches which are an order of magnitude faster. But with Skylake CPUs the total CPU Spinning times for a contended lock have exploded and we will spin up to 246ms on a 24 or 48 core machine only because the latency of the new Intel CPUs has increased the pause instruction by a factor 14. Is this really the case? I have created  a small tester to check full CPU spinning and the calculated numbers nicely match my expectations. I have 48 threads waiting on a 24 core machine for a single lock where I call Monitor.PulseAll to let the race begin:

image

Only one thread will win the race but 47 threads will spin until the give up. This is experimental evidence that we indeed have a real issue with CPU consumption and very long Spin times are a real issue. Excessive spinning hurts scalability because CPU cycles are burned where other threads might need the CPU, although the usage of the pause instruction frees up some of the shared CPU resources while “sleeping” for longer times. The reason for spinning is to acquire the lock fast without going to the kernel. If that is true the increased CPU consumption might not look good in task manager but it should not influence performance at all as long as there are cores left for other tasks. But what the tests did show that nearly single threaded operations where one thread adds something to a worker queue while the worker thread waits for work and then performs some task with the work item are slowed down.

The reason for that can be shown best with a diagram. Spinning for a contended lock happens in steps where the amount of spinning is tripled after each step. After each Spin Round the lock checks again if the current thread can get it. While spinning the lock tries to be fair and switches over to other threads from time to time to help the other thread/s to complete its work. That increases the chances the lock has been released when we check again later. The problem is that only after a complete Spin Round has completed the lock checks if it can be taken:

image

If e.g. during Spin Round 5 the lock becomes signaled right after we did start Round 5 we wait for the complete Spin Round until we can acquire the lock. By calculating the spin duration for the last round we can estimate the worst case of delay that can happen to our thread:

image

That are many milliseconds we can wait until spinning has completed. Is that a real issue?

I have created a simple test application that implements a producer consumer queue where the worker thread works for each work item 10ms and the consumer has a delay of 1-9 ms before sending in the next work item. That is sufficient to see the effect:

image

We see for some sender thread delays of one and two ms a total duration of 2,2s whereas the other times we are twice as fast with ca. 1,2s. This shows that excessive CPU spinning is not only a cosmetic issue which only hurts heavily multithreaded applications but also simple producer consumer threading which involves only two threads. For the run above the ETW data speaks on its own that the increased CPU spinning is really the cause for the observed delay:

image

When we zoom into the slow section we find in red the 11ms of spinning although the worker (light blue) has completed its work and has returned the lock a long time ago.

image

The fast non degenerate case looks much better where only 1ms is spent spinning for the the lock.

image

The test application I did use is named  SkylakeXPause and located at https://1drv.ms/u/s!AhcFq7XO98yJgsMDiyTk6ZEt9pDXGA which contains a zip file with the source code and the binaries for .NET Core and .NET 4.5. What I actually did to compare things was to install on the Skylake machine .NET 4.8 Preview which contains the fixes and .NET Core 2.0 which still implements the old spinning behavior. The application targets .NET Standard 2.0 and .NET 4.5 which produces an exe and a dll. Now I can test the old and new spinning behavior side by side without the need to patch anything which is very convenient.

readonly object _LockObject = new object();
int WorkItems;
int CompletedWorkItems;
Barrier SyncPoint;
	
void RunSlowTest()
{
	const int processingTimeinMs = 10;
	const int WorkItemsToSend = 100;
	Console.WriteLine($"Worker thread works {processingTimeinMs} ms for {WorkItemsToSend} times");

	// Test one sender one receiver thread with different timings when the sender wakes up again
	// to send the next work item

	// synchronize worker and sender. Ensure that worker starts first
	double[] sendDelayTimes = { 1, 2, 3, 4, 5, 6, 7, 8, 9 };

	foreach (var sendDelay in sendDelayTimes)
	{
		SyncPoint = new Barrier(2);  // one sender one receiver

		var sw = Stopwatch.StartNew();
		Parallel.Invoke(() => Sender(workItems: WorkItemsToSend,          delayInMs: sendDelay),
						() => Worker(maxWorkItemsToWork: WorkItemsToSend, workItemProcessTimeInMs: processingTimeinMs));
		sw.Stop();
		Console.WriteLine($"Send Delay: {sendDelay:F1} ms Work completed in {sw.Elapsed.TotalSeconds:F3} s");
		Thread.Sleep(100);  // show some gap in ETW data so we can differentiate the test runs
	}
}

/// <summary>
/// Simulate a worker thread which consumes CPU which is triggered by the Sender thread
/// </summary>
void Worker(int maxWorkItemsToWork, double workItemProcessTimeInMs)
{
	SyncPoint.SignalAndWait();

	while (CompletedWorkItems != maxWorkItemsToWork)
	{
		lock (_LockObject)
		{
			if (WorkItems == 0)
			{
				Monitor.Wait(_LockObject); // wait for work
			}

			for (int i = 0; i < WorkItems; i++)
			{
				CompletedWorkItems++;
				SimulateWork(workItemProcessTimeInMs); // consume CPU under this lock
			}

			WorkItems = 0;
		}
	}
}

/// <summary>
/// Insert work for the Worker thread under a lock and wake up the worker thread n times
/// </summary>
void Sender(int workItems, double delayInMs)
{
	CompletedWorkItems = 0; // delete previous work
	SyncPoint.SignalAndWait();
	for (int i = 0; i < workItems; i++)
	{
		lock (_LockObject)
		{
			WorkItems++;
			Monitor.PulseAll(_LockObject);
		}
		SimulateWork(delayInMs);
	}
}

Conclusions

This is not a .NET issue. It affects all Spinlock implementations which use the pause instruction. I have done a quick check into the Windows Kernel of Server 2016 but there is no issue like that visible. Looks like Intel was kind enough to give them a hint that some changes in the spinning strategy are needed.

When the issue was reported to .NET Core in August 2017 in September 2017 it was already fixed and pushed out with .NET Core 2.0.3 (https://github.com/dotnet/coreclr/issues/13388). It is not only that the reaction speed of the .NET Core team is amazing but also that the issue has been fixed on the mono branch a few days ago now as well and discussions about even more Spinning improvements are ongoing. Unfortunately the Desktop .NET Framework is not moving as fast but at least we have with .NET Framework 4.8 Preview at least a proof of concept that the fixes work there as well. Now I am waiting for the backport to .NET 4.7.2 to be able to use .NET at its full speed also on latest hardware. This was my first bug which was directly related to a performance change in one CPU instruction. ETW remains the profiling tool of choice on Windows. If I had a wish I would Microsoft make to port the ETW infrastructure to Linux because the current performance tooling still sucks at Linux. There were some interesting kernel capabilities added recently but an analysis tool like WPA remains yet to be seen there.

If you are running .NET Core 2.0 or desktop .NET Framework on CPUs which were produced since mid 2017 you should definitely check out your application with a profiler if you are running at reduced speed due to this issue and upgrade to the newer .NET Core and hopefully soon .NET Desktop version. My test application can tell you if you could be having issues

D:\SkylakeXPause\bin\Release\netcoreapp2.0>dotnet SkylakeXPause.dll -check
Did call pause 1,000,000 in 3.5990 ms, Processors: 8
No SkylakeX problem detected

or 

D:\SkylakeXPause\SkylakeXPause\bin\Release\net45>SkylakeXPause.exe -check
Did call pause 1,000,000 in 3.6195 ms, Processors: 8
No SkylakeX problem detected

The tool will report an issue only if you are running a not fixed .NET Framework on a Skylake CPU. I hope you did find the issue as fascinating as I did. To really understand an issue you need to create a reproducer which allows you to experiment to find all relevant influencing factors. The rest is just boring work but now I understand the reasons and consequences of CPU spinning much better.

* Denotes changes to make things more clear and add new insights. This article has gained quite some traction at Hacker News (https://news.ycombinator.com/item?id=17336853) and Reddit  (https://www.reddit.com/r/programming/comments/8ry9u6/why_skylake_cpus_are_sometimes_50_slower/). It is even mentioned at Wikipedia (https://en.wikipedia.org/wiki/Skylake_(microarchitecture)). Wow. Thanks for the interest.

Serialization Performance Update With .NET 4.7.2

With .NET Framework 4.7.2 out of the door it was time to update my Serialization Performance test suite (https://github.com/Alois-xx/SerializerTests). There have been many serializers added since the article https://aloiskraus.wordpress.com/2017/04/23/the-definitive-serialization-performance-guide/ was written which warrants a post on its own. The performance numbers were updated but not all of the text.

First of all the pesky BinaryFormatter O(n^2) issue is gone with .NET 4.7.2  if you add to your App.Config

<?xml version="1.0" encoding="utf-8"?>
<configuration>
  <runtime>
    <!-- Use this switch to make BinaryFormatter fast with large object graphs starting with .NET 4.7.2 -->
      <AppContextSwitchOverrides value="Switch.System.Runtime.Serialization.UseNewMaxArraySize=true" />
  </runtime>
</configuration>
 

.NET Core does not need such a setting because it did contain the fixed BinaryFormatter from the start which was added to .NET Core 2.0. The de/serialization performance of .NET Core (2.0.6)  is not faster compared to the .NET Framework. You might ask: How can that be? .NET Core is performance obsessed and now it is slower? The answer is that not .NET Core is slower but some serializers targeting .NET Standard execute workarounds for early .NET Core versions. As always you should measure in your actual target environment to prevent bad surprises. 

Which Serializers Are Slower under .NET Core?

Most notably MsgPack.Cli, ServiceStack.Text, BinaryFormatter and Bois perform significantly worse on .NET Core. MsgPack.Cli is over two times slower on .NET Core! How does this look under a CPU profiler?

I have recorded two profiling sessions. One for .NET Framework and the other for .NET Core. Trace #1 is .NET Core and Trace #2 is the full .NET Framework. The visible WPA tab is a comparative diff view where the graph shows .NET Core and the table contains the diff values. To read the values you need to know that the displayed values are calculated by subtracting Trace #2 from Trace #1 for each row. Negative values mean that .NET Core consumed more CPU relative to the .NET Framework tests.  The Weight column shows the CPU time difference in ms with , as thousand separator and . as decimal point. I know that is a lot of information but once you get used to that level of detail you will never go back to simple timing based tests where you wonder why the timing always fluctuates.  Here I have dissected the worse performing serializers under WPA:

image

What is the reason for that? If one factors out Reflection Get/Set Value and Activator.CreateInstance from the profiled data we get a delta which is within the error margin of ca. 10%. The delta table below now has no longer large deserialization time differences. All the differences come from many calls to FieldInfo.Get/SetValue and Activator.CreateInstance. This is not the case if the same serializers (same serializer but different dll and hence different code!) are running on the regular .NET Framework.

image

The key takeaway is that the mentioned serializers switch to good old slow Reflection and Activator.CreateInstance calls if you use the .NET Standard version of these serializers. This also includes BinaryFormatter which is 20% slower on .NET Core compared to the full .NET Framework.

Designed To Be Profiled

These are nice graphs which can tell you a lot about an application. With Stack Tags you can compare the CPU consumption of an exe and a hosted dll which executes the same code on different runtimes where a difference by call stack drill down would be useless because even the Main methods are located in different dlls. But by extracting the relevant information from your application logic into logical groups you can compare runtime and resource consumption between a SerializerTests.exe and a dotnet SerializerTests.dll call with no problems. The stack tag file used for these views are part of my Serializer test suite at https://github.com/Alois-xx/SerializerTests/blob/master/SerializerStack.stacktags if you are interested.

The format of the .stacktag files is pretty simple

<?xml version="1.0" encoding="utf-8"?>
<!-- This is a Stacktag file for WPA to analyze the performance of serializers under ETW profiling 
     It allows easy comparison of .NET and .NET Core profiling data
-->
<Tag Name="">
<Tag Name="Deserialize">
    <Tag Name="BinaryFormatter">
        <Entrypoint Module="SerializerTests.*" Method="SerializerTests.Serializers.BinaryFormatter*::Deserialize*"/>
    </Tag>

To be able to compare .NET and .NET Core the dll is SerializerTests.* because on .NET it compiles to SerializerTests.exe and on .NET Core to SerializerTests.dll which is executed by the dotnet.exe process. If you compare the CPU Time of the profiling data with the actual test duration from the CSV file which is also created you will find that the test duration is always longer. Even worse it is pretty hard to zoom into the section of time where the test actually executes. The advantage of creating a test suite is that you can make it profiling friendly. The solution to the timing problem is to use an extra thread that starts waiting for an event when the test starts and it becomes signaled when the test has stopped. That way we get a Context Switch event for each test run and we can also visualize the thread wait time for each and every test run.

The graph below shows the CPU consumption for the deserialize tests of all tested formatters. The formatters are executed from fast to slow (in rough order). 

image

The magic happens in the CPU Usage (Precise) view with our custom stacktags we can visualize the test duration as bar chart with the nice alternating pattern. If one test has a strange runtime between test runs and we have profiling data we can now drill into each and every test case and do a full root cause analysis. Looking at a specific profiler test run is now as easy as zooming into the right test run and check out what did take so long:

image

And The Winners Are

image

The fastest De/Serializer is MessagePack-CSharp (https://github.com/neuecc/MessagePack-CSharp) from Yoshifumi Kawai which beats Protobuf by a factor >2.5 and GroBuf (https://github.com/skbkontur/GroBuf) from Andrew Kostousov! I have no idea how these guys did make it so fast but this is the fastest C# code I have seen so far. MessagePack-CSharp even comes with Code Analyzers to make the annotation of your existing objects easy.

The problem with such fast serializers is that you cannot serialize object cycles (StackoverFlowException with MessagePack-CSharp, GroBuf and Wire). Another problem is that they do not keep object identity (Wire has an opt in flag see SerializerOptions(preserveObjectReferences). If your objects contain 100 references to the same 1 MB string it will be serialized 100 times by value without preserving object identity and you end up with 100×1 MB strings. As a general rule you need to pay attention not only to performance but also its concrete feature set. If you cannot guarantee that your objects never contain object cycles your application will crash hard without any further notice when you are using such performance optimized libraries. But if you design a reasonable data structure then MessagePack-CSharp or GroBuf are hot options. Yoshifumi Kawai did also create ZeroFormatter which has the crazy property of having 0 deserialization time. The reason why ZeroFormatter is not showing up in the winners section as well is that it sort of cheats the normal benchmarks. ZeroFormatter creates proxy objects on the fly which only contain an index to the actual byte array it was deserialized from. The actual deserialization cost will show up when you access the properties which need to be public virtual for that reason. To not distort the measured values I included in the deserialize test also a touch phase to access each deserialized property once and measure that as total deserialization cost. It turns out that the touch deserialize + touch costs are much higher compared to protobuf-net. Personally I do not like ZeroFormatter because it is intrusive to your object design and you would need to design your data structures in a way that the least amount of data is accessed in your use case. But use cases can and will change.  Now you need to redesign your object hierarchy every time you have a different access pattern or you need to live with suboptimal performance. 

Similar to ZeroFormatter is FlatBuffer (https://google.github.io/flatbuffers/)  which comes with its own IDL and compiler to generate the code from a schema. It is basically writing structs via memcopy into a byte array where each object reference is an index to another array which makes it a great candidate if you want to share large datasets between processes via shared memory when you are reading only a fraction of the data. Just as ZeroFormatter the data is deserialized when you actually access it. This is both a curse and a blessing. If you are reading some objects many times you are creating many temporary objects which will hurt GC performance. On the other hand if you are needing only a few items of a large array this lazy deserialization approach is perfect. FlatBuffer does not work with existing objects and it cannot cope with dictionaries which makes it rather inflexible. One needs to know that FlatBuffer comes from game programming where most of the data are coordinates and textures. For that it can be good choice. On the plus side it fully supports versioning despite being pretty low level.

Serializer Can Preserve Object References Observations
Message
PackSharp
No  
GroBuf No  
FlatBuffer No Data structures are created from IDL compiler.
Wire Only On Paper
new Serializer(
    new SerializerOptions(preserveObjectReferences :true)
The Wire unit tests contain cyclic references. But that seems to work only for cycles on the same object.
Real objects with many identical references will still be serialized by value.
Jil No Cannot serialize Dictionaries with DateTime as Keys.
Protobuf_net Yes, Opt in at declaration level

[ProtoContract]
public class A {
    ...
    [ProtoMember(5, AsReference=true)]
    public C Foo {get;set;}
}
See StackOverFlow Question
 
SimSerializer By default  
ZeroFormatter No Serializes only virtual properties.
DataContract Yes
new DataContractSerializer(type, new DataContractSerializerSettings
                                { PreserveObjectReferences=true} )
 
Bois No  
JSON.NET Works Not Always

JsonSerializer.Create(

new JsonSerializerSettings

{

          PreserveReferencesHandling =

           PreserveReferencesHandling.All

});

Still serializes by value when tried with more complex types like a dictionary.
ServiceStack No Closes  Input Stream
See SO Question
XmlSerializer No  
MsgPack.Cli No  
BinaryFormatter Yes  
FastJson No Cannot round trip DateTime with 100ns resolutions.

 

By default all serializers do not track object references which speeds up serialization time significantly at the expense of bigger serialized data.

Which One To Choose?

That is a tricky question because it strongly depends if you can change your existing object model radically or if you have to keep backwards compatibility while switching to a different serializer. Based on experience I would mention protobuf-net as the most feature complete and very fast serializer where you will almost certainly will find no blocking issues. For lesser known serializers you should check the number of commits to the project and if there is recent activity. If the library is no longer actively maintained because the author has shifted focus you should not use it for mission critical applications. If you are bound to a specific data format like JSON Jil is by far the fastest serializer on the planet. But be prepared for unpleasant surprises like that you cannot use Dictionary<DateTime, xxx> with Jil because it throws a NotSupportedException at you.

Despite their claims to track references and object cycles not all serializers fulfill their claims advertised at API level. That is a sign that not all code paths are equally well tested and if your scenario differs from the main usage you are likely to hit unexpected issues.

If you can change everything I would go for the fastest one which is MessagePackSharp because the author has a great track record on creating other serializers and he is very active at his project. GroBuf although equally fast produces significantly larger serialized data which can be an issue if you need to take into account not only the serializer performance but also the data size sent over the wire. A three times larger binary payload can easily defat any performance gain by using a faster serializer if a slow network is in between.  To be really sure if your data types work well with the target serializers you can use my test suite, and add your data object/s to the test suite (https://github.com/Alois-xx/SerializerTests) and measure for yourself.

Plugging in a new data type to the test suite is as simple as referencing your assembly which defines your type you care about.  Then change the tested data type from BookShelf to your custom type, supply a object factory delegate to create your test data (Data) and optionally add a touch delegate to touch all properties after deserialization to take into account lazy on access deserializing serializers.

        private void CreateSerializersToTest()
        {
            SerializersToTest = new List<ISerializeDeserializeTester>
            {
                new MessagePackSharp<BookShelf>(Data, TouchBookShelf),
                …
            };
D:\SerializerTests\bin\Release\net471>SerializerTests -test combined
Serializer      Objects "Time to serialize in s"        "Time to deserialize in s"      "Size in bytes" FileVersion     Framework
MessagePackSharp<BookShelf>     1       0.000   0.000   11      1.7.3.4 .NET Framework 4.7.3062.0
MessagePackSharp<BookShelf>     1       0.000   0.000   11      1.7.3.4 .NET Framework 4.7.3062.0
MessagePackSharp<BookShelf>     10      0.000   0.000   93      1.7.3.4 .NET Framework 4.7.3062.0
MessagePackSharp<BookShelf>     100     0.000   0.000   996     1.7.3.4 .NET Framework 4.7.3062.0
MessagePackSharp<BookShelf>     500     0.000   0.000   6014    1.7.3.4 .NET Framework 4.7.3062.0
MessagePackSharp<BookShelf>     1000    0.000   0.000   12515   1.7.3.4 .NET Framework 4.7.3062.0
MessagePackSharp<BookShelf>     10000   0.001   0.001   138516  1.7.3.4 .NET Framework 4.7.3062.0
MessagePackSharp<BookShelf>     50000   0.004   0.006   738516  1.7.3.4 .NET Framework 4.7.3062.0
MessagePackSharp<BookShelf>     100000  0.010   0.013   1557449 1.7.3.4 .NET Framework 4.7.3062.0
MessagePackSharp<BookShelf>     200000  0.016   0.032   3357449 1.7.3.4 .NET Framework 4.7.3062.0

and watch if the numbers are worth the change and that all data ends up in the serialized payload. The serialized data is written to  file

image

to make it easy to check if all data was really written to the output or if your class is missing some [MessagePackObject] or [Index] attributes to make the data show up in the serialized output. If you want to check out two different serializers you can let the test run only for the selected ones with

SerializerTests -Runs 1 -test combined -serializer protobuf,MessagePackSharp

to get your results fast. Now go and fix your serialization performance issues!

The Mysterious UI Hang Which Resolved Itself After 20s

Warning: This post includes ETW, Windbg, Kernel and Process memory dumps. If you don´t want to deep dive into the Windows Internals you should stop reading now.

One strange issue was a UI hang. Normally these are easy to solve because something is doing CPU intensive things on the UI thread, the UI thread is stuck because of a blocking call to wait for something to happen (e.g. to read a 2 GB large file),  or a deadlock has occurred. But this case was different. The UI was stuck but sometimes it did recover after 20s. The UI looked like this while it was not responding:

image

With Windbg we can examine where the UI thread is stuck from a live process or a memory dump. For managed code we need to load sos.dll as usual.

image

From the screenshot above we find that the managed stack is calling WaitMessage

0:000> !ClrStack
OS Thread Id: 0x2dbc (0)
Child SP       IP Call Site
0053ec9c 761a2a9c [InlinedCallFrame: 0053ec9c] System.Windows.Forms.UnsafeNativeMethods.WaitMessage()
0053ec98 58a4d1ea System.Windows.Forms.Application+ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(IntPtr, Int32, Int32)
0053ed24 58a4cbee System.Windows.Forms.Application+ThreadContext.RunMessageLoopInner(Int32, System.Windows.Forms.ApplicationContext)
0053ed74 58a4ca60 System.Windows.Forms.Application+ThreadContext.RunMessageLoop(Int32, System.Windows.Forms.ApplicationContext)
0053eda0 58a35d59 System.Windows.Forms.Application.Run(System.Windows.Forms.Form)
UIHang.Program.Main() [D:\Source\FocusHang\UIHang\Program.cs @ 19]

which is a perfectly legal call stack and is by no way an indication of a hung UI thread with the id 0x2dbc. The deadlock check command for unmanaged locks !locks yielded no results and the managed counterpart !SyncBlk also showed nothing.

Dumping the other threads can be done in Windbg but when many threads are involved the Parallel Stacks window of Visual Studio is much better:

image

The other threads also look normal. By the way how would a not normal stack look like? If something has got stuck I simply check out the stacks with the longest stacktrace because these are usually which are actually doing more than waiting for things to happen. But as you can see from the picture above there are no long stacks involved.

The current dump shows nothing. What can we do? Get more dumps! These showed hangs happening in

  • user32.dll!PeekMessage
  • user32.dll!SetFocus
  • user32.dll!ShowWindow

but nowhere was a reason visible why they were hanging. The window manager of Windows inside the kernel is the win32k subsystem. If something is stuck at a deeper level then it is happening inside the kernel and user mode stacks are useless. Procdump (my favorite memory dump creation tool) can give you a peek inside the kernel by dumping not only the user mode part of the call stack but also the kernel stacks (this works on Windows 10 only as far as I know). If you have looked carefully at the Windbg output you will notice that the memory dump was performed with the -mk option (see Comment: in the Windbg window) which creates a second dump file besides the user mode dump

D:\UIHang>procdump -mk -ma UIHang.exe

ProcDump v9.0 – Sysinternals process dump utility
Copyright (C) 2009-2017 Mark Russinovich and Andrew Richards
Sysinternals – http://www.sysinternals.com

[16:56:31] Dump 1 initiated: D:\UIHang\UIHang.exe_180218_165631.dmp
[16:56:31] Dump 1 writing: Estimated dump file size is 177 MB.
[16:56:31] Dump 1 complete: 177 MB written in 0.3 seconds
[16:56:31] Dump 1 kernel: D:\UIHang\UIHang.exe_180218_165631.Kernel.dmp
[16:56:32] Dump count reached.

When you open the xxxx.Kernel.dmp file you can navigate to the user mode thread 0x2dbc from our stuck UI thread to see where the call stack continues in the kernel:

image

Sometimes you can learn something new by looking at the kernel side. In this case the Kernel waits for a new window messages in NtUserWaitMessage but it is still not clear why this call never wakes up. In that case it makes sense to examine the contents of the window message queue. Unfortunately that can only be done by MS support because the whole windowing stuff is not exposed in Windbg or any published Windbg extension that I am aware of. When sending data to someone else we should get as much evidence as possible. My current favorite data collection for such types of issues are

  • ETW Sample Profiling with 8kHz sample rate and Context Switch Tracing
  • Memory Dump of frozen process
  • Kernel Memory Dump

Full Kernel Memory dumps are a pain because the are huge. If you are on Windows 10 or Server 2016 there is the option to take a kernel memory dump of only the active memory (https://blogs.msdn.microsoft.com/clustering/2015/05/18/windows-server-2016-failover-cluster-troubleshooting-enhancements-active-dump/) which is great because this excludes the file system cache which usually many GB in size. To force the creation of a kernel dump which excludes the file system cache you can create a reg file with the contents below:

CrashOnCtrlScroll_ActiveMemory.reg

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\kbdhid\Parameters]
"CrashOnCtrlScroll"=dword:00000001

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\i8042prt\Parameters]
"CrashOnCtrlScroll"=dword:00000001

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl]
"CrashDumpEnabled"=dword:00000001
"FilterPages"=dword:00000001

and import the reg file. After that you need to reboot the machine. When you now press

Right Ctrl + Scroll Lock + Scroll Lock

you get a sad face which contains as bug check code MANUALLY INITIATED CRASH which is just what we want. You can use the .reg file also on Windows 7 machines where you get a full memory dump because the FilterPages registry key is ignored on older Windows versions.

image

On my 16 GB machine I now get a small 2,2 GB dump file.

image

If you want to transfer large files you should compress the data as much as possible. 7z archives are in my experience about 20% smaller than regular .zip files at the expense of ca. x5 times longer compression times. You can use multithreaded compression with the LZMA2 switch which splits the data into blocks which can be compressed by multiple threads. If you resort back to LZMA then then the speedup is much less. If you are doing this on a server machine where users start working after you have taken a memory dump you should perhaps stick to the .zip format to compress on a single core and stay nicely in the background.

7z a -m0=LZMA2 c:\temp\Kernel.7z MEMORY.DMP

If you use LZMA2 or LZMA (with many different files it will also be highly parallel like LZMA within one file) 7z will use all cores it can get. The compressed file is then a 577MB file which can be much easier sent around. A quick look by Microsoft support revealed the the message queue on our main UI thread is stuck to wait for window messages on another thread with the thread id 1880. Lets check the dump file for that thread with e.g. Windbg

0:011> ~~[1880]s
eax=00000000 ebx=00000002 ecx=00000000 edx=00000000 esi=00000000 edi=0000050c
eip=7769e7ac esp=0866f7a8 ebp=0866f818 iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202
ntdll!NtWaitForSingleObject+0xc:
7769e7ac c20c00          ret     0Ch
0:011> k
 # ChildEBP RetAddr  
00 0866f7a4 7642ebf9 ntdll!NtWaitForSingleObject+0xc
01 0866f818 70e0d5bd KERNELBASE!WaitForSingleObjectEx+0x99
02 0866f87c 70e0d80e clr!CLRSemaphore::Wait+0xc0
03 0866f8b8 70e0d8a8 clr!ThreadpoolMgr::UnfairSemaphore::Wait+0x132
04 0866f924 70d6edf1 clr!ThreadpoolMgr::WorkerThreadStart+0x389
05 0866fa44 76568654 clr!Thread::intermediateThreadProc+0x55
06 0866fa58 77694a77 kernel32!BaseThreadInitThunk+0x24
07 0866faa0 77694a47 ntdll!__RtlUserThreadStart+0x2f
08 0866fab0 00000000 ntdll!_RtlUserThreadStart+0x1b

Hm. That is a .NET Threadool thread which waits for more work to come. There is no indication what did happen before on that thread. But since we have recorded ETW tracing we know thanks to high frequency sample profiling data with 8KHz a lot of the history of that thread. The default sampling rate of 1kHz is not sufficient when you are searching for cheap method calls which can have far reaching consequences. In this case we are not looking for a performance issue but we want to know which methods this thread has executed before it was waiting for more work.

Lets check out what thread 0x1880=6272 was doing in WPA:

image

It was executing a TPL task on that thread were the suspiciously named ctor of HangForm was called. But what is even more important that the thread was destroyed exactly 20s after it has executed the ctor of the form where Windows destroys the window of our hung from on thread exit (see Selection Duration of 20.011s). After that the UI was responsive again. At least that is the observation. But  that still does not explain why the main UI thread was stuck even if you create a window on another thread you can create multiple UI threads within an application without issues as long as you do not mix them up. WinForms is very vigilant in this respect and will always throw an InvalidOperationException of the form

InvalidOperationException: “Crossthread operation not valid: Control ‘<name>’ accessed from a thread other than the thread it was created on.

if you try bad things like that.

The only way I know how to connect window message pumps from different threads is to call AttachThreadInput. When we search the profiling data for that method we find this:

 |    |    |    |    |- UIHang.exe!UIHang.HangForm::StartUIOnOtherThread 0x0

 |    |    |    |    |    |- UIHang.exe!UIHang.HangForm::.ctor 0x0

 |    |    |    |    |    |- System.Windows.Forms.ni.dll!System.Windows.Forms.Control.Show()

 |    |    |    |    |    |- UIHang.exe!dynamicClass::IL_STUB_PInvoke 0x0

 |    |    |    |    |    |    |- user32.dll!SetParentStub

 |    |    |    |    |    |    |    win32u.dll!NtUserSetParent

 |    |    |    |    |    |    |    ntdll.dll!LdrInitializeThunk

 |    |    |    |    |    |    |    ntdll.dll!LdrpInitialize

 |    |    |    |    |    |    |    ntdll.dll!_LdrpInitialize

 |    |    |    |    |    |    |    wow64.dll!Wow64LdrpInitialize

 |    |    |    |    |    |    |    wow64.dll!RunCpuSimulation

 |    |    |    |    |    |    |    wow64cpu.dll!Thunk0Arg

 |    |    |    |    |    |    |    wow64cpu.dll!CpupSyscallStub

 |    |    |    |    |    |    |    ntoskrnl.exe!KiSystemServiceCopyEnd

 |    |    |    |    |    |    |    win32kfull.sys!NtUserSetParent

 |    |    |    |    |    |    |    win32kfull.sys!xxxSetParentWorker

 |    |    |    |    |    |    |    |- win32kfull.sys!xxxShowWindowEx

 |    |    |    |    |    |    |    |- win32kfull.sys!zzzAttachThreadInput

The window was created on another thread but it did not attach the thread input queue directly. Instead it was calling user32.dll!SetParent which did in the kernel in the win32k subsystem attach the window input queues by calling zzzAttachThreadInput. That is all happening on our non message pumping TPL Task thread which is the missing ingredient to finally understand why our main UI thread was blocked due to a programming error on a seemingly unrelated thread. We have from the zzzAttachThreadInput method even with 8kHz sampling rate only one stack trace where one still needs a bit of luck to see the root cause so nicely with ETW data.

More documentation about that Win32 behavior would be great. These details seem to be discussed on the Old New Thing Blog (Sharing an input queue takes what used to be asynchronous and makes it synchronous, like focus changes). A few more hints are shown at P41 https://www.slideshare.net/wvdang/five-things-every-win32-developer-should-know.  According to that you will attach thread input queues implicitly if you

  • Set a parent window (user32.dll!SetParent)
  • Set an Owner window (user32.dll!SetWindowLongPtr(win32window, GWLP_HWNDPARENT, formhandle)
  • Or install journal hooks  (user32.dll!SetWindowsHookEx with an JournalRecordProc)

Win32K ETW Tracing?

You can also enable tracing for the Win32K subsystem to track the window focus events by adding this ETW provider to your xperf command line

Microsoft-Windows-Win32k:0x42e3000:0xff

But if the window message pump is stuck the results of this ETW provider and the WPA Window In Focus chart can be misleading.

Conclusions

Window message queue issues are notoriously hard to debug because most relevant data is only available during live debugging while you still can query window states with e.g. Spy++. But if you only have a memory dump you will have a hard time to figure out what went wrong. A kernel dump would give you all information but since no public information is present how you can examine the contents a window message queue this must be left as an exercise for Microsoft support. If someone knows how to get e.g. thread affinity from a user mode memory dump from an HWND please leave a note below. Once again memory dumps and ETW tracing have helped to find the actual root cause. The memory dump helps to find stuck threads and strange data points. ETW helps you to find how you did get into that state once you know from the dump file where you need to look further.

This time I have learned that .NET Threadpool threads seem to be shut down 20s after they had no real work to do and that Windows will destroy window handles once the creating thread terminates. That can unblock a previously stuck UI as a side effect. If you want to play with the UIHang application you can find it here: https://1drv.ms/f/s!AhcFq7XO98yJgrklCE9_p4RuHoG0Mg