When Known .NET Bugs Bite You

The most interesting type of bugs do not occur during regression testing but when real people use the software. Things become nasty if it happens only after several hours of usage when sometimes the whole application freezes for minutes. If you have a bigger machine with e.g. 40 cores and most of them are busy and you have dozens of processes, things can become challenging to track down, when something goes wrong . Attaching a debugger does not help since it not clear which process is causing the problem and even if you could you would cause issues for already working users on that machine. Using plain ETW tracing is not so easy because the amount of generated data can exceed 10 GB/min. The default Windows Performance Recorder profiles are not an option because they will simply generate too much data. If you have ever tried to open 6+ GB ETL file in WPA it can take 30 minutes until it opens. If you drag and drop around some columns you have to wait some minutes until the UI settles again.

Getting ETW Data The Right Way

Taking data with xperf offers more fined grained control at the command line, but it has also its challenges. If you have e.g. a 2 GB user and a 2 GB kernel session which trace into a circular in memory buffer you can dump the contents with

xperf -flush "NT Kernel Logger" -f kernel.etl 

xperf -flush "UserSession" -f user.etl

xperf -merge user.etl kernel.etl merged.etl 

The result looks strange:

image

For some reason flushing the kernel session takes a very long time (up to 15 minutes) on a busy server. In effect you will have e.g. the first 20s of kernel data, a 6 minutes gap and then the user mode session data which is long after the problem did occur. To work around that you can either flush the user session first or you use the xperf command to flush both sessions at the same time:

xperf -flush "UserSession" -f user.etl -flush "NT Kernel Logger" -f kernel.etl

If you want to profile a machine where you do not know where the issues are coming from you should get at first some “long” term tracing. If you do not record context switch events but you basically resort back to profiling events then you can already get quite far. The default sampling rate is one kHz (a thousand samples/second)  which is good for many cases but too much for long sessions. When you expect much CPU activity then you can try to take only every 10ms a sample which will get you much farther. To limit the amount of traced data you have to be very careful from which events you want to get the call stacks. Usually the stack traces are 10 times larger than the actual event data. If you want to create a dedicated trace configuration to spot a very specific issue you need to know which ETW providers cost you how much. Especially the size of the ETL file is relevant. WPA has a very nice overview under System Configuration – Trace Statistics which shows you exactly that 

image

You can paste the data also into Excel to get a more detailed understanding what is costing you how much.

image

In essence you should enable stack walking only for the events you really care about if you want to keep the profiling file size small. Stack walks cost you a lot of storage (up to 10 times more than the actual event payload). A great thing is that you can configure ETW to omit stack walking for a specific events from an ETW provider with tracelog. This works since Windows 8.1 and is sometimes necessary.

To enable e.g. stack walking for only the CLR Exception event you can use this tracelog call for your user mode session named DotNetSession. The number e13c0d23-ccbc-4e12-931b-d9cc2eee27e4  is the guid for the provider .NET Common Language Runtime because tracelog does not accept pretty ETW provider names. Only guids prefixed with a # make it happy.

tracelog -enableex DotNetSession -guid #e13c0d23-ccbc-4e12-931b-d9cc2eee27e4 -flags 0x8094 -stackwalkFilter -in 1 80 

If you want to disable the highest volume events of Microsoft-Windows-Dwm-Core you can configure a list of events which should not traced at all:

tracelog -enableex DWMSession -guid #9e9bba3c-2e38-40cb-99f4-9e8281425164 -flags 0x1 -level 4 -EventIdFilter -out 23 2 9 6 244 8 204 7 14 13 110 19 137 21 111 57 58 66 67 68 112 113 192 193

That line will keep the DWM events which are relevant to the WPA DWM Frame Details graph. Tracelog is part of the Windows SDK but originally it was only part of the Driver SDK. That explains at least partially why it is the only tool that allows to declare ETW event and stackwalk filters. If you filter something in or out you need to pass as first number the number of events you will pass to the filter.

tracelog … -stackwalkFilter -in 1 80

means enable stackwalks for one event with event id 80

tracelog … -EventIdFilter -out 23 2 9 …

will filter out from all enabled events 23 events where the event ids are directly following after the number. You can execute such tracelog statements also on operating systems < Windows 8.1. In that case the event id filter clauses are simply ignored. In my opinion xperf should support that out of the box and I should not have to use another tool to fully configure lightweight ETW trace sessions.

Analyzing ETW Data

After fiddling with the ETW providers and setting up filters for the most common providers I was able to get a snapshot at the time when the server was slow. The UI was not reacting for minutes and I found this interesting call stack consuming one core for nearly three minutes!

image

When I drill deeper I did see that Gen1 collections were consuming enormous amounts of CPU in the method WKS::allocate_in_older_generation.

image

When you have enabled the .NET ETW provider with the keyword 0x1 which tracks GC events and you use my GC Activity Region file you can easily see that some Gen1 collections are taking up to 7s. If such a condition occurs in several different processes at the same time it is no wonder why the server looks largely unresponsive although all other metrics like CPU, Memory, Disc IO and the network  are ok.

Gen1GCTimes

When I find an interesting call stack it is time to look what the internet can tell me about allocate_in_older_generation. This directly points me to Stackoverflow where someone else had this issue already http://stackoverflow.com/questions/32048094/net-4-6-and-gc-freeze which brings me to https://blogs.msdn.microsoft.com/maoni/2015/08/12/gen2-free-list-changes-in-clr-4-6-gc/ that mentions a .NET 4.6 GC bug. There the .NET GC developer nicely explains what is going on:

… The symptom for this bug is that you are seeing GC taking a lot longer than 4.5 – if you collect ETW events this will show that gen1 GCs are taking a lot longer. …

Based on the data above and that the server was running .NET 4.6.96.0 which was a .NET 4.6 version this explains why sometimes a funny allocation pattern in processes that had already allocated > 2GB of memory the Gen1 GC pause times would sometimes jump from several ms up to seconds. Getting  1000 times slower at random time certainly qualifies as a lot longer. It would even be justified to use more drastic words how much longer this sometimes takes. But ok the bug is known and fixed. If you install .NET 4.6.1 or the upcoming .NET 4.6.2 version you will never see this issue again. It was quite tough to get the right data at the right time but I am getting better at these things while looking at the complete system at once. I am not sure if you could really have found this issue with normal tools if the issue manifests in random processes at random times. If you only concentrate at one part of the system at one time you will most likely fail to identify such systemic issues. When you update your server after some month by doing regular maintenance work the issue would have vanished leaving you back with no real explanation why the long standing issue has suddenly disappeared. Or you would have identified at least a .NET Framework update in retrospect as the fix you were searching for so long.

Advertisements

The Non Contracting Code Contracts

What do you do when you get an impossible exception of the form

System.InvalidCastException: Object of type 'MyApp.Type[]' cannot be converted 
                to type 'System.Collections.Generic.IEnumerable`1[MyApp.Type]'.
System.RuntimeType.TryChangeType(…)
System.RuntimeType.CheckValue(…)
System.Reflection.MethodBase.CheckArguments(…)
System.Reflection.RuntimeMethodInfo.InvokeArgumentsCheck(…)
...

where basically an array of type X cannot be casted to IEnumerable<X>.  Any LINQ affine programmer will do that 20 times a day. What the heck is going on here? It looks like the CLR type system has got corrupted by some bad  guy. It could be anything from a spurious memory corruption due to cosmic rays or some code on this or another thread did corrupt memory somehow. Random bit flips are not so uncommon if you know that cosmic muons hit the earth surface at a rate of ca. 60 000 muons/(h*m^2). It is only a matter of time until some bits in your memory gets randomly another value.

This issue surfaced on some machines quite reproducible so we can put back exotic theories about possible sources of memory corruption back into the bookshelf. First of all I needed to get a memory dump of the process in question. This is much easier if you know how you can “attach” procdump to one process and let it create a dump on any first chance exception with a specific type or message. That comes in handy if you want to create a dump not when a process crashes but when an internally catched exception occurs and you want to inspect the application state at this point in time.

The Sysinternals tool Procdump can do it from the command line without any extra installation. The -e 1 will instruct it to dump on first chance exceptions and -f will do a substring filter of the shown message in the console by procdump.

C:\>procdump  -ma -e 1 -f "InvalidCast" Contractor.exe

ProcDump v7.1 - Writes process dump files
Copyright (C) 2009-2014 Mark Russinovich
Sysinternals - www.sysinternals.com
With contributions from Andrew Richards

Process:               Contractor.exe (12656)
CPU threshold:         n/a
Performance counter:   n/a
Commit threshold:      n/a
Threshold seconds:     10
Hung window check:     Disabled
Log debug strings:     Disabled
Exception monitor:     First Chance+Unhandled
Exception filter:      *InvalidCast*
Terminate monitor:     Disabled
Cloning type:          Disabled
Concurrent limit:      n/a
Avoid outage:          n/a
Number of dumps:       1
Dump folder:           C:\
Dump filename/mask:    PROCESSNAME_YYMMDD_HHMMSS


Press Ctrl-C to end monitoring without terminating the process.

CLR Version: v4.0.30319

[19:48:12] Exception: E0434F4D.System.InvalidCastException ("Object of type 'MyApp.Type[]' cannot be converted to type 'System.Collections.Generic.IEnumerable`1[MyApp.Type]'.")
[19:48:12] Dump 1 initiated: C:\Contractor.exe_160718_194812.dmp
[19:48:12] Dump 1 writing: Estimated dump file size is 50 MB.
[19:48:12] Dump 1 complete: 50 MB written in 0.1 seconds
[19:48:13] The process has exited.
[19:48:13] Dump count reached.
If you have a more complex process creation setup, e.g. you create many instances of the same exe then you cannot use the simple executable name filter. Procdump accepts then only the process id of an already running process which you now need to determine from somewhere else. DebugDiag would be my tool of choice which allows you to configure rules for any process executable regardless if they are already running or will be started later. There you can also configure a dump trigger when a specific exception with a specific message.
To do this start the DebugDiag 2  Collection tool and select Crash
image
 
After selecting a process you can configure e.g. a full dump for each InvalidCastException
 
image
 
When you have activated the Crash rule you will get a nice report how many dumps already have been collected since the rule is active.
image
 
Then you will find in the configured directory the dump files along with the DebugDiag logs.
image

Once I had the dumps I did look into it but I could not make much sense out of the TypeCastException. Without the exact CLR source code it is very hard to make sense of the highly optimized data structures in the CLR. At that point I did reach out to Microsoft to look into the issue. Since the dump did also not provide enough clues I did pack a virtual machine together and transfer it directly to MS support via their upload tool. The support engineer was not able to get the VM running which seems to happen quite frequently if you upload 50+ GB files over the internet. The solution was to send me via mail an USB hard drive, upload the VM onto the disk and send it back to MS. that worked out and I heard nothing from MS support for some time.

It seems that my issue was routed through various hands until it reached one of CoreCLR devs which have debugged certainly many similar issues. When I was finally suspecting to never hear about that issue again I got a mail back that they have found the issue.

The problem is that the CLR type system got confused about invalid assembly references which have invalid Flags set. The assembly reference flag set in the IL code must be set to zero but in my case it was set to 0x70 which is normally a C# compiler internal tracking code to mark reference assemblies. In the final code emitting phase the C# compiler will zero out the assembly reference flags and all is fine. But for some reason one of my assemblies still had set 0x70 for all reference assemblies.

If you wonder what reference assemblies are: These are the assemblies the .NET Framework consist of in various versions. If you link against a specific .NET Framework version like 4.6.2, you “link” against the assemblies located in these directories to ensure that can call only APIs for the currently configured .NET target framework.

image

I was told that with the .NET 4.5.2 C# compiler an issue was reported which could trigger that behavior. If the newer Roslyn based C# 6 compiler suffers still from this issue is not known to me. That sounded like an arcane and spurious compiler bug. For some reason I was asked a few days later by MS if we modify the generated assemblies. It turns out some targets use Code Contracts to enforce run time checks which need assembly rewriting. After doing a query for all assembly references I found that ALL Code contracts rewritten assemblies have invalid Assembly reference flags with 0x70 set. This is still true with newest Code Contracts version which is currently 1.9.10714.2.

If you try to check the Assembly references via Reflection you will find that it will always tell you that the assembly references fields in the references assemblies are zero. This is even true if you load them with Assembly.LoadForReflectionOnlyFrom. Even Ildasm will not show you the fields of the Assembly references. Windbg? Nope.

I had to resort back to my old pet project ApiChange and patch the –ShowReferences command to print besides the referenced assembly also the Flags field of the Assembly Reference.


C:\>apichange -sr D:\bin\SubContractor.dll
SubContractor.dll ->
 System, Version=4.0.0.0,   Culture=neutral,
                            PublicKeyToken=b77a5c561934e089 0x70
 mscorlib, Version=4.0.0.0, Culture=neutral, 
                            PublicKeyToken=b77a5c561934e089 0x70

There I could finally see that my assembly had invalid assembly references. How can I fix it? That is easy. Turn off Perform Runtime Contract Checking!

image

Now you get correct assembly references as the ECMA spec requires them.


C:\>apichange -sr D:\bin\SubContractor.dll
SubContractor.dll ->
 mscorlib, Version=4.0.0.0, Culture=neutral, 
            PublicKeyToken=b77a5c561934e089 0x0
 System, Version=4.0.0.0,   Culture=neutral, 
            PublicKeyToken=b77a5c561934e089 0x0

The source of the spurious InvalidCastExceptions is found but why does it happen so infrequently? I have tried to create a simple reproducer but I was not able to force the cast error. It seems to have something to do with the machine, memory layout, number of loaded types, IL layout, moon phase, …. to trigger it. If something goes wrong it seems that in one assembly with correct assembly references to the Reference assemblies a cast is tried from

mscorlib         ;  IEnumerable<T>  to 
mscorlib,Ref=0x70;  IEnumerable<T>  

where the type identity with generic types and their arguments gets mixed up in the process which then correctly tells me that seemingly identical types are not identical because they differ in their full type expansion by their assembly reference.

I was never a big fan of Code Contracts because of the way how they were implemented. Rewriting existing assemblies can and will cause unforeseen issues in production code. If I can I want to keep my C# and JIT compiler errors and not add Code Contract bugs on top of it. Do not get me wrong. I really like the idea of contract checking. But such checks should be visible in the source code and or at least compiled directly by a Code Contracts plugin by the now much more open Roslyn Compiler.

As it is now Code Contracts considerably increase compilation time and if you enable static code checks you easily can get builds which take several hours which is a way too hefty price for some additional checks.

If you now want to remove Code Contracts you need to remove from your source code all Contract.Requires<Exn>(…) checks because these will assert in non rewritten Release builds as well which is certainly not what you want. As it stands now Code Contracts violate the contract of generating valid assembly references. The assembly rewriter of Code Contracts  generates assemblies with invalid IL code which can trigger under specific circumstances impossible runtime errors in the CLR.