MemAnalyzer v2.5 Released

Managed  (and unmanaged) memory leak detection is still not as easy as it should be. MemAnalyzer tries to be a simple command line tool with the goal to help with the easy problems but also with the hard issues which surface only after hours of stress testing. Often you have a mixture of managed and unmanaged memory leaks where you need memory dumps and VMMap information to get the full picture. MemAnalyzer helps to automate these boring tasks. It is open source at Github (https://github.com/Alois-xx/MemAnalyzer). The executable can can be downloaded here: https://github.com/Alois-xx/MemAnalyzer/releases.

If you are looking for a nice UI to look into memory dumps I recommend MemoScope.NET (https://github.com/fremag/MemoScope.Net) which lets you look into dump files without the need to resort back to Windbg. By the way if you have Visual Studio Ultimate you can already analyze managed memory dumps. But to analyze issues in production installing Visual Studio is not an option. PerfView on the other hand is a great tool but the numbers are only approximations which can make it hard to spot handle leaks. The object counts reported by PerfView are often off by the order of a magnitude. MemAnalyzer tries to get exact metrics of the real alive objects with the -live switch which is as good as !DumpHeap -stat -live of Windbg as CSV output.

MemAnalyzer Features

  • Single self contained executable
  • Supports x86 and x64 processes and memory dumps
  • .NET Core on Windows x86 and x64 support (.NET Core 1.0,1.1 and 2.0, …)
  • Create memory dumps with associated VMMap data
  • Analyze managed heap, unmanaged, private bytes and file mappings when VMMap is present
  • Memory dump diff
  • Optional CSV output

Usage – Leak Detection

Why bother with a command line tool if nicer UIs are around? Because MemAnalyzer is capable to track not only your managed memory but also the unmanaged parts. When a managed application leaks memory you need first to find out if the leak happens on the managed heap or somewhere else.  Depending on the leaked memory type one needs to use different approaches/tools to track the leak down.

The memory consumption of a process can be categorized as

  • Managed Heap
  • Unmanaged Heap
  • Private Bytes
  • Page File Allocated Shared Memory (Shareable in VMMap lingo)
  • Memory Mapped Files

Since there are quite a few different memory types inside a process it is important to know where you need to look at.  MemAnalyzer uses VMMap to determine the size of each region and prints them out in a nice summary which can be put into a CSV file to e.g. get a summary after each test run during an automated test.

C>MemAnalyzer.exe -pid 17888 -vmmap

AllocatedBytes          Instances(Count)        Type
4,648,204               105,374                 System.String
918,824                 22                      System.String[]
697,640                 27,607                  System.Object[]
662,424                 27,601                  System.Int32
1,512                   27                      System.RuntimeType
1,072                   2                       System.Globalization.CultureData
830                     5                       System.Char[]
580                     8                       System.Int32[]
432                     2                       System.Globalization.NumberFormatInfo
26,130                  1,087                   Managed Heap(Free)!
6,936,367               160,704                 Managed Heap(Allocated)!
7,158,288                                       Managed Heap(TotalSize)
25,165,824                                      Reserved_Stack
54,398,976                                      Committed_Dll
1,994,752                                       Committed_Heap!
4,177,920                                       Committed_MappedFile!
565,248                                         Committed_Private!
3,825,664                                       Committed_Shareable!
73,629,696                                      Committed_Total
17,499,952                                      Allocated(Total)
  • Allocated managed objects. That is very much similar to !DumpHeap -stat in Windbg only with more options.
    • If you add -live then the metric will contain no temporary objects which were not reclaimed by the GC yet.
  • Managed heap summary which shows an overall metric how big the heap is and how much of it is allocated and free.
  • Additional VMMap information that gives you an overview which other memory types are allocated in the process.
    • MemAnalyzer needs the VMMap tool in the path to get that data.
  • Allocated = Managed Heap(Allocated) + Heap + MappedFile + Private Bytes + Shareable

The Allocated value is important because if this value rises over time you have a leak in one of the memory types of the sum. If you print this value over time and it does not rise you have no leak (warning simplified!). That is simple enough to do it repeated times by e.g. a script to verify that your long running test behaves well. Since repeated measurements are key to detecting a memory leak MemAnalyzer allows you to append the output to a CSV file along with some context e.g. Iteration 1, 100 to get more clues.

Inside your tracking script a more realistic command line would be

MemAnalyzer -pid {0} -vmmap -o leak.csv -dtn 5000;N#200 -live -silent  -context “{1}”

This will append the output of -dtn (Dump Type by Number)  for up to 5K types with an instance count > 200 to the CSV file leak.csv. Each line will get a context column which can be e.g. your test run number or whatever it makes easier to correlate the time when the data was taken. To get additional information you can add automatic memory dumps to the mix with

MemAnalyzer -procdump -ma {0} {1}\PROCESSNAME_{0}_YYMMDD_HHMMSS.dmp

This will take a memory dump of the process with pid {0} with procdump (needs to be in the path) and also gather VMMap information automatically (VMMap needs to be in the path). The big letter words will be expanded by procdump automatically. That way you can e.g. take a full memory dump after 1, 10 , 100, 500 iterations which contains everything but the trending data is contained in the csv file for every iteration which makes it much easier to track down the real memory leaks. Based on personal experience it is pretty easy to be led down the wrong path by a few memory dumps created by coworkers. The first dump might be created before anything was loaded into the application and the last dump might still have the test data loaded which looks like a pretty big leak but it is not the leak you a after when you have lost 500 MB after 100 iterations. Having more data points at hand which can easily be graphed in Excel is a big help to concentrate on the important incidents and to identify stable patterns and trends without the need to take a gazillion of memory dumps.

Usage – Memory Optimization

If you want to optimize the memory usage of an existing application MemAnalyzer is also a big help because you can quickly diff a memory memory dump which is your baseline against the currently running application. To get started you should take a memory dump of your current state.

MemAnalyzer -procdump -ma pid C:\temp\Baseline.dmp

After you have optimized the data structures of your application to (hopefully) consume less memory you can compare the running application against your saved baseline

MemAnalyzer -f2 baseline.dmp -pid ddd

When you use -f2 then 2-1 will be subtracted and you get a nice diff summary output. To keep the output short the diff is sorted by absolute values which makes it easy to spot top memory additions and deletions along with the totals.

Lets do an step by step example what that means for your usual application development workflow. First we start with our memory hungry application and isolate the memory issue into a single reproducer like this:

using System;
using System.Collections.Generic;

namespace coreapp
{
    class DataInstance : IDisposable
    {
        Func<string> Checker;
        long Instance;
        bool IsDisposed;
        DataInstance[] Childs;

        public DataInstance(int instance)
        {
            Instance = instance;
            Checker = () => $"Instance {Instance} already disposed";
        }

        public void Dispose()
        {
            if (IsDisposed)
            {
                throw new ObjectDisposedException(Checker());
            }
	     IsDisposed = true;
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            var instances = new List<DataInstance>();
            for(int i=1;i<=5*1000*1000;i++)
            {
                instances.Add(new DataInstance(i));
            }
            Console.ReadLine();
        }
    }
}

We give MemAnalyzer the process id to create a baseline memory dump. Since MemAnalyzer uses procdump and VMmap you should have both already downloaded and the tools in your path to make it work.

MemAnalyzer.exe -procdump -ma 11324 DotNetCoreApp_1.0.dmp

Ok we have a dump of a .NET Core application. How can we look into it?

 

MemAnalyzer.exe -f DotNetCoreApp_1.0.dmp
Error: Is the dump file opened by another process (debugger)? If yes close the debugger first.
       If the dump comes from a different computer with another CLR version v1.0.25211.02 that you are running on your machine you need to download the matching mscordacwks.dll first. Check out https://1drv.ms/f/s!AhcFq7XO98yJgoMwuPd7LNioVKAp_A and download the matching version/s.
       Then set _NT_SYMBOL_PATH=PathToYourDownloadedMscordackwks.dll  e.g. _NT_SYMBOL_PATH=c:\temp\mscordacwks in the shell where you did execute MemAnalyzer and then try again.

Got Exception: System.IO.FileNotFoundException: mscordaccore_Amd64_Amd64_1.0.25211.02.dll

 

Ups we have got an error. Most people stop reading when an error occurs because the error messages are most often not that helpful. But  this case is different. You need to download the folder of my OneDrive folder of the link in the error message to get nearly all .NET/Core debugging dlls you could ever need. Download them into e.g. C:\PerfTools. Then you need to tell MemAnalyzer where to look for it with the -dacdir option or you can set the environment variable _NT_SYMBOL_PATH=c:\PerfTools to get rid of the need to specify the dac directory every time manually.

MemAnalyzer.exe  -dts -f DotNetCoreApp_1.0.dmp -dacdir c:\PerfTools

AllocatedBytes          Instances(Count)        Type
320,000,000             5,000,000               System.Func<System.String>
240,000,000             5,000,000               coreapp.DataInstance
100,663,368             3                       coreapp.DataInstance[]
24,530                  145                     System.String
33,627,594              139                     Managed Heap(Free)!
660,714,944             10,000,277              Managed Heap(Allocated)!
694,348,008                                     Managed Heap(TotalSize)

We have 660 MB on the managed heap allocated which is quite a lot of data. There are 5 million Func<string> and DataInstance instances. But why are we having 3 DataInstance arrays with 100MB? These look like temp arrays left over from our List<DataInstance> while growing the internal array. To get rid of garbage data you can either do a GC.Collect() before taking the dump or you tell MemAnalyzer to only track objects which are still alive.

MemAnalyzer.exe  -f DotNetCoreApp_1.0.dmp -dacdir c:\PerfTools -live

AllocatedBytes          Instances(Count)        Type
320,000,000             5,000,000               System.Func<System.String>
240,000,000             5,000,000               coreapp.DataInstance
67,108,912              2                       coreapp.DataInstance[]
24,530                  145                     System.String
627,160,448             10,000,275              Managed Heap(Allocated)!
694,348,008                                     Managed Heap(TotalSize)

There is still one array left which does not belong there but the numbers are now better. While looking at the data I decided that we should get rid of the many delegate instances which cost 64 byte per instance which add up to 320 MB alone for the instances itself. But since the DataInstance object also keeps a reference (on x64 8 bytes) we have even more memory to spare. If we get rid of the delegate and remove the class member we should be able to spare 5m*(64+8)=360MB of memory. That’s a plan. Lets measure things.  Our refactored class becomes

    class DataInstance : IDisposable
    {
        long Instance;
        bool IsDisposed;
        DataInstance[] Childs;

        public DataInstance(int instance)
        {
            Instance = instance;
        }

        public void Dispose()
        {
            if (IsDisposed)
            {
                throw new ObjectDisposedException($"Instance {Instance} already disposed");
            }
            IsDisposed = true;
        }
    }

By taking a second dump we can diff both dump files with

MemAnalyzer.exe -f DotNetCoreApp_NoFuncDelegate.dmp -f2 DotNetCoreApp_1.0.dmp -dacdir c:\PerfTools

Delta(Bytes)    Delta(Instances)        Instances       Instances2      Allocated(Bytes)        Allocated2(Bytes)       AvgSize(Bytes)  AvgSize2(Bytes) Type
320,000,000     5,000,000               0               5,000,000       0                       320,000,000                             64              System.Func<System.String>
40,000,000      0                       5,000,000       5,000,000       200,000,000             240,000,000             40              48              coreapp.DataInstance
0               0                       1               1               160                     160                     160             160             System.Globalization.CalendarData
360,000,000     5,000,000               5,000,277       10,000,277      300,714,930             660,714,930                                             Managed Heap(Allocated)!
360,010,320     0                       0               0               334,337,688             694,348,008                                             Managed Heap(TotalSize)

As expected we got rid of 5 million Func<String> instances. After removing one field in DataInstance the instance size did shrink by 8 bytes from 48 down to 40 bytes which saved another 40MB. That is already quite good. But can we do better? The dispose check is an extra bool flag which will need  due to padding 4 bytes anyway. To eliminate the bool field we can reuse the Instance field and negate the Instance count to we keep the stored value which is always > 0. When you look closely you find that Instance is of the type long but we only need an int because we will always assign in the ctor the value from an integer. The revised DataInstance class is now

    class DataInstance : IDisposable
    {
        int Instance;
        DataInstance[] Childs;

        public DataInstance(int instance)
        {
            Instance = instance;
        }

        public void Dispose()
        {
            if (Instance < 0)
            {
                throw new ObjectDisposedException($"Instance {-1*Instance} already disposed");
            }

            Instance *= -1; 
        }
    }

When we diff things again

MemAnalyzer.exe -f DotNetCoreApp_NoFuncDelegate_intFieldNoDisposeFlag.dmp -f2 DotNetCoreApp_1.0.dmp -dacdir c:\PerfTools

Delta(Bytes)    Delta(Instances)        Instances       Instances2      Allocated(Bytes)        Allocated2(Bytes)       AvgSize(Bytes)  AvgSize2(Bytes) Type
320,000,000     5,000,000               0               5,000,000       0                       320,000,000                             64              System.Func<System.String>
80,000,000      0                       5,000,000       5,000,000       160,000,000             240,000,000             32              48              coreapp.DataInstance
33,554,456      1                       2               3               67,108,912              100,663,368             33554456        33554456        coreapp.DataInstance[]
24              1                       1               2               24                      48                      24              24              System.Int32
0               0                       2               2               208                     208                     104             104             System.Globalization.CultureInfo
0               0                       2               2               912                     912                     456             456             System.Globalization.CultureData
433,554,480     5,000,002               5,000,275       10,000,277      227,160,450             660,714,930                                             Managed Heap(Allocated)!
400,011,856     0                       0               0               294,336,152             694,348,008                                             Managed Heap(TotalSize)

Since we compare against the original baseline directly see the improvement in memory consumption by 433MB. That is 65% less memory! Not bad. If you want to keep going fast you can directly compare a memory dump against a running process to check if a temporary optimization pays off. I have found the VS profiler to break when larger x86 applications were profiled because VS seems to load the data also into a x86 process where things break if the more fancy object graph calculation because VS runs out of memory…

VS 2017.3 does not yet recognize CoreClr memory dumps as managed processes which still requires managed heap analysis with PerfView or Windbg or MemAnalyzer.

SOS and mscordacwks, mscordaccore Collection

Even if you are not interested in MemAnalyzer you might stop by for the biggest collection of SOS and mscordacwks debugging dlls for all .NET versions I could get my hands on. When you analyze memory dumps taken from other machines you need to have a close version match within Windbg or an exact version match with PerfView / ClrMd. Inside Microsoft this is a non issue because their symbol servers distribute the matching binaries without any hassle. We outsiders have to copy the corresponding debugging libraries from the original machine or from the corresponding .NET installer. To spare you the time to hunt for the matching debugging dlls I share my collection of mscordackwks files as OneDrive link: https://1drv.ms/f/s!AhcFq7XO98yJgoMwuPd7LNioVKAp_A

Currently it contains the versions

2.0.50727.1434
2.0.50727.3607
2.0.50727.3615
2.0.50727.4016
2.0.50727.4062
2.0.50727.4200
2.0.50727.4408
2.0.50727.4455
2.0.50727.4927
2.0.50727.4952
2.0.50727.5403
2.0.50727.5420
2.0.50727.5444
2.0.50727.5448
2.0.50727.5456
2.0.50727.8009
4.0.30319.01
4.0.30319.1008
4.0.30319.1022
4.0.30319.17379
4.0.30319.18052
4.0.30319.18408
4.0.30319.18444
4.0.30319.2034
4.0.30319.225
4.0.30319.237
4.0.30319.269
4.0.30319.296
4.0.30319.33440
4.0.30319.34003
4.0.30319.34011
4.0.30319.34014
4.0.30319.34209
4.0.30319.46960
4.6.100.00
4.6.1085.00
4.6.127.01
4.6.1586.00
4.6.1590.00
4.6.1637.00
4.6.81.00
4.6.96.00
4.7.2053.00

.NET Core

.NET Core 1.0     1.0.25211.02
.NET Core 1.1     4.6.25211.01
.NET Core 2.0 x64 4.6.25519.02
.NET Core 2.0 x86 4.6.25519.03

It is interesting to note that .NET Core 2.0 has different build numbers between the x86 and x64 version. It looks like one blocking issue needed fixing before they did release it to a wider audience.

Conclusions

Your toolbox just has got a little bigger. As always use the right tool for the job. MemAnalyzer is not the silver bullet for all of your memory problems but it tries its best to give you fast feedback without the overhead of a fancy UI which makes it easy to put it into your existing leak tracking/reporting scripts. If you want to share success stories sound off in the comments. If you want to report bugs/issues it would be nice to open an issue at https://github.com/Alois-xx/MemAnalyzer/issues. Now go and improve the memory footprint of your app!

The Definitive Serialization Performance Guide

When looking at performance issues with ETW I did find long deserialization times in conjunction with BinaryFormatter. A deeper look revealed that the issue is easy to reproduce if the object graph gets bigger (>100K objects). Since BinaryFormatter is in business since over 10 years and I have never heard of significant performance issues except that it is slow in general I was quite surprised that such a blatant problem still exists in .NET. But since .NET is open sourced at github it is easy to complain: https://github.com/dotnet/corefx/issues/16991. This has catched the interest of Stephen Toub himself and they added a small but impactful change to BinaryFormatter:

image

The problem during deserialization with BinaryFormatter is that it uses the MaxArraySize value to bucket an ObjectHolder array like a hash table.

ObjectHolder objects[MaxArraySize=0x1000=4096] 

class ObjectHolder 
{ 
  internal ObjectHolder m_next;
}

If we need to store one million objects into one ObjectHolder array with a length of 4096 we need  to create 4096 ObjectHolder linked lists via the ObjectHolder.m_next field with a depth of 244 nodes. When you try to access a specific object in that pretty large linked list you will need to touch a large number of linked nodes during deserialization. By increasing the value to 1 048 576 ( = 0x100000 ) we will be able to use the array as a real hash table where only a few collision’s will make it necessary to look up the next item in the linked list.

The next .NET Core version will have a band aid on it so that the issue will appear only with even bigger object graphs. With the current BinaryFormatter you you will get a nice parabola  where the serialization times for one million objects was only 2s but the deserialization time is in the order of 80s!

image

After the fix in .NET Core you can serialize object graphs up to ca. 13 million objects before you hit the next limitation of BinaryFormatter:

Unhandled Exception: System.Runtime.Serialization.SerializationException: Exception of type ‘System.Runtime.Serialization.SerializationException’ was thrown.
   at System.Runtime.Serialization.ObjectIDGenerator.Rehash() in D:\Source\vc17\NetCoreApp\ConsoleApp1\ConsoleApp2\Serialization\ObjectIDGenerator.cs:line 140

This time BinaryFormatter is running out of prime numbers for another hash table. If we try to serialize objects graphs with more than 2*6 584 983 objects we are out of luck again because the ObjectIDGenerator did never expect us to serialize more than 2*6584983 objects.

public class ObjectIDGenerator
{
    // Table of prime numbers to use as hash table sizes. Each entry is the
    // smallest prime number larger than twice the previous entry.
    private static readonly int[] s_sizes = 
    {
        5, 11, 29, 47, 97, 197, 397, 797, 1597, 3203, 6421, 12853, 25717, 51437,
        102877, 205759, 411527, 823117, 1646237, 3292489, 6584983
    };

But hey that is ok. No one did ever that successfully until now. Since there are not many (http://stackoverflow.com/questions/569127/serializationexception-when-serializing-lots-of-objects-in-net) complaints about that one I guess everyone has simply moved on and switched to a faster serializer. Besides that I wonder if the guys did ever profile their application why deserialization was taking ca. 45 minutes for a ca. ~ 300MB file. 

What Types of Serializers Exist?

When you want to switch away from BinaryFormatter you first need to check how your data is structured. If your data can contain cyclic references then you have less options because most serializers are by default tree serializers which cannot cope with object graphs. Another downside might be that your target serializer cannot serialize private fields which BinaryFormatter is capable to do. You also need to be able to change the data object and base classes to add the necessary attributes, ctors to make it work with other serializers. Only BinaryFormatter serializes pretty much everything as long as the class has [Serializable] put on it. And last but not least it should support streams to read and write data from to it. JSON strings are a nice and efficient data storage format for small messages but reading a 200MB JSON file into a single string because the serializer is not supporting streams is creating a lot of work for the garbage collector. And you can start deserializing the data only when the complete file has been read. FastJSON is the only serializer which does not support streams which makes it not a good choice for larger messages.

Below is a collection of widely used serializers and their most important feature properties summed up in a table:

Serializer Default Serializer Type Data Format Private Members Stream Support .NET Core Support Default Ctor Needed To  Deserialize Explicit Object Graph Support
BinaryFormatter Graph Binary Yes Yes Yes (NetStandard 1.6) No Enabled by Default.
XmlSerializer Tree Xml Yes Yes Yes Yes public No
DataContractSerializer Tree Xml Yes Yes Yes No
new DataContractSerializer(typeof(TypeToSerialize), 
new DataContractSerializerSettings  
{ 
   PreserveObjectReferences = true,
});
Jil Tree JSON No Yes Yes Yes public No
FastJSON Graph JSON No No No Yes public Enabled by Default.
Protobuf-Net Tree Binary
Google Protocol Buffer
Yes Yes Yes Yes* Declarative at ProtoMember level


// * no default ctor needed if SkipConstructor=true 
// Thanks Marc for that hint
[ProtoContract(SkipConstructor=true)] 
class DataForProtobuf
{
  [ProtoMember(1, AsReference = true)]
   DataForProtobuf Parent;
}
JSON.NET Tree JSON Yes Yes Yes No Does not completely work. See https://github.com/JamesNK/Newtonsoft.Json/issues/1284

var ser = JsonSerializer.Create(new JsonSerializerSettings()
{
 PreserveReferencesHandling =   PreserveReferencesHandling.Objects,
 ReferenceLoopHandling =   ReferenceLoopHandling.Serialize,          
});
NFX.SlimSerializer Graph Binary Yes Yes No No  
Wire** Tree Binary Yes Yes Yes No new Serializer(new SerializerOptions(preserveObjectReferences:true))
Crashes with StackoverflowException if cycles are present and preserveObjectReferences is not set!
MsgPack** Graph Binary Yes Yes No    

**Update1: Added Wire and MsgPack on request.

With the table above you can better judge which serializers could work for your scenario. Pretty much any serializer exchange will result in breaking changes to your serialized data format. The cost of a switch needs therefore be justified and you need either a way to migrate the data or you stay polymorphic by keeping your data objects and add the necessary support for the other serializer in place which gives you the ability to switch back to the old one for data migration. If you switch e.g. from XmlSerializer to DataContractSerializer both can write Xml as output but DataContractSerializer writes the serialized data never into Xml attributes which XmlSerializer pretty often does. That makes it impossible to switch over from either one without breaking the data exchange format.

Besides the used data format, readability, and feature set the only real metric why one would want to switch over to another serializer is performance. There are many performance guides out there which measure one aspect of serializers but all of them I have found so far ignore important details during the measuring process. The following numbers are therefore the “real” ones measured on my machine with representative samples and many runs to average out random noise from the OS. There is one pretty good serialization testing framework out there which I have found after I have finished my analysis http://aumcode.github.io/serbench/ which was written by some smart guy who did write Pile ( https://www.infoq.com/articles/Big-Memory-Part-2). The claim there is that they use the fastest homegrown serializers to stuff objects into big byte arrays (pile) so they can support many GB large heaps (10s of GB inside one process) while keeping very small GC latencies because the objects are only deserialized on access and live only a short time which makes them Gen0 garbage quite fast. The web site contains many graphs http://aumcode.github.io/serbench/Specimens_Typical_Person/web/overview-charts.htm) which do not really make it clear how you can choose your best serializer. Personally I have found the way how the data is presented confusing. Compared to serbench my own tester is simpler to play with if you want to plug in your own serializer because you can directly edit the code from one self contained executable. My tester (https://github.com/Alois-xx/SerializerTests) also warns you if you are measuring debug builds or you are using a not NGenned baseline which for first call effects does not measure the actual serializer overhead but the amount of JITed code executed. Startup performance testing is complex and I believe my tester does the best job there.

Tested Versions

Serializer File Version
BinaryFormatter 4.6.1637.0 built by: NETFXREL3STAGE
DataContract 4.6.1637.0 built by: NETFXREL3STAGE
XmlSerializer  4.6.1586.0 built by: NETFXREL2
FastJson 2.1.23.0
Jil 2.15.0
JsonNet 10.0.2.20802
Protobuf_net 2.1.0
SlimSerializer 3.0.0.1
Wire 1.0.0.0
MsgPack 0.1.4298.15470

Serializer Performance

Here are the results of my own testing with various serializers. The numbers below were created with https://github.com/Alois-xx/SerializerTests which is my own serialization testing framework to get the numbers. The graph shows on the left axis the average time for the de/serialize operation per serializer. The tests were performed with one up to 1 million Book objects  where the average throughput was used. Since each serializer has a slightly different output format the data size to process changes a lot. The serialized data size is printed on the right axis where the size is shown from top to bottom to achieve better readability. The absolute performance is therefore a function of the efficiency of the used data format and the per object overhead to read/write the data. First time init effects were factored out since this is an extra metric we will shortly discuss. The numbers below are real throughput numbers for larger message packets, although the ordering which serializer performs best remains practically constant if 300K or 1000 objects are de/serialized. For smaller objects numbers the GC effects dominate the picture which makes it harder to get reliable numbers because you get results which depend on the GC state of the previous test which is not what I wanted to show here. I did sort by deserialization time because usually you are reading data from some storage which is most of the time the more important metric.

All tests were performed on my Intel I7-4770K 3.5GHz on Windows 10 with .NET 4.6.2 x64.

image

What is interesting that the not widely known Jil JSON serializer is able to serialize JSON data by far the fastest which is pretty close to protocol buffers although the serialized data size is 90% bigger. In general the serialization times are always faster than the deserialization times. If this is not the case the serializer is either deeply flawed or you have measured things wrong. From a high level perspective these operations must always happen:

Serialization

  • Existing objects are traversed
  • Data is extracted and converted into the serialization format
  • The conversion process can involve some intermediary allocations
  • Data is written into a stream which is usually written to disk or the network

     

    Deserialization

  • Data is read from the stream
  • Data is split into sub tokens (involves memory allocations)
  • New objects are allocated and filled with tokenized data from input stream

     

    If you serialize a 1 GB in memory object into 200 MB on disk you are appending data to a stream which is flushed out to disk. But if you read 200MB from disk to deserialize you need to allocate 1GB of memory just to hold the objects in memory. While deserializing you are effectively GC bound because most of the time your thread is suspended by the GC which checks if some of your newly allocated objects are no longer needed. It is therefore not unusual to have high GC blocking times during deserialization but much less GC activity while serializing data to disk. That is also the reason why there is little point in making efficient deserialization multi threaded because you are now allocating large amounts of memory from multiple threads just to discover that the GC will block all of your threads for now much more often occurring GCs. But back to the serializer performance.

    Size Does Matter

    To fully understand the performance of an serializer one must also take into account the size of the serialized data. Below is the size of the serialized file for 100K Book objects shown:

    image

    The object definition was one Bookshelf to which N Books were added.

        [Serializable, DataContract, ProtoContract]
        public class BookShelf
        {
            [DataMember, ProtoMember(1)]
            public List<Book> Books
            {
                get;
                set;
            }
    
            [DataMember, ProtoMember(2)]
            private string Secret;
    
            public BookShelf(string secret)
            {
                Secret = secret;
            }
    
            public BookShelf()
            { }
        }
    
        [Serializable,DataContract,ProtoContract]
        public class Book
        {
            [DataMember, ProtoMember(1)]
            public string Title;
    
            [DataMember, ProtoMember(2)]
            public int Id;
        }

    Jil and JSON.NET are nearly equal but FastJSON is ca. 35% larger than the two others. Lets check out the serialized data for a Bookshelf with one Book inside it:

    Json.NET


    {“Secret”:”private member value”,”Books”:[{“Title”:”Book 1″,”Id”:1}]}


    Jil


    {“Books”:[{“Id”:1,”Title”:”Book 1″}]}


    FastJSON


    {“$types”:{“SerializerTests.TypesToSerialize.BookShelf, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null”:”1″,”SerializerTests.TypesToSerialize.Book, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null”:”2″},”$type”:”1″,”Books”:[{“$type”:”2″,”Title”:”Book 1″,”Id”:1}]}


    By comparing the output we find that Jil omits private members which is the reason why the Secret field value of BookShelf is missing. Besides that Jil and Json.NET have equal output. FastJSON emits for every title string an additional $type node which explains the bigger JSON output. It is educational to look at the other serialized data as well:

  • XmlSerializer


    <?xml version=”1.0″?>
    <BookShelf xmlns:xsd=”
    http://www.w3.org/2001/XMLSchema” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”>
      <Books>
        <Book>
          <Title>Book 1</Title>
          <Id>1</Id>
        </Book>
      </Books>
    </BookShelf>


    Data Contract Indented


    <?xml version=”1.0″ encoding=”utf-8″?>
    <BookShelf xmlns:i=”
    http://www.w3.org/2001/XMLSchema-instance” xmlns=”http://schemas.datacontract.org/2004/07/SerializerTests.TypesToSerialize”>
      <Books>
        <Book>
          <Id>1</Id>
          <Title>Book 1</Title>
        </Book>
      </Books>
      <Secret>private member value</Secret>
    </BookShelf>


    DataContract


    <BookShelf xmlns=”http://schemas.datacontract.org/2004/07/SerializerTests.TypesToSerialize” xmlns:i=”http://www.w3.org/2001/XMLSchema-instance”><Books><Book><Id>1</Id><Title>Book 1</Title></Book></Books><Secret>private member value</Secret></BookShelf>


    DataContract XmlBinaryDictionaryWriter


    @    BookShelfHhttp://schemas.datacontract.org/2004/07/SerializerTests.TypesToSerialize    i)http://www.w3.org/2001/XMLSchema-instance@Books@Book@Idƒ@Title™Book 1@Secret™private member value


    BinaryFormatter


       ÿÿÿÿ          FSerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null   *SerializerTests.TypesToSerialize.BookShelf   <Books>k__BackingFieldSecret’System.Collections.Generic.List`1[[SerializerTests.TypesToSerialize.Book, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]]             private member value   ’System.Collections.Generic.List`1[[SerializerTests.TypesToSerialize.Book, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]]   _items_size_version  ‘SerializerTests.TypesToSerialize.Book[]                          %SerializerTests.TypesToSerialize.Book         
       %SerializerTests.TypesToSerialize.Book   TitleId       Book 1
      


    Protobuf


    Book 1private member value


    SlimSerializer


      Êþâæ¡
    ÿäSerializerTests.TypesToSerialize.BookShelf, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
    ÿ¾System.Collections.Generic.List`1[[SerializerTests.TypesToSerialize.Book, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]], mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089ÿ(private member value ÿæSerializerTests.TypesToSerialize.Book[], SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null|0~3 ÿ $6|0~3ÿÚSerializerTests.TypesToSerialize.Book, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null ÿ Book 1


    MsgPack


    ‚¥Books„¦_items”‚¥Title¦Book 1¢IdÀÀÀ¥_size¨_version©_syncRootÀ¦Secret´private member value

    Wire


    ÿ;   SerializerTests.TypesToSerialize.BookShelf, SerializerTestsÿl   System.Collections.Generic.List`1[[SerializerTests.TypesToSerialize.Book, SerializerTests]], mscorlib,%core%   ÿ6   SerializerTests.TypesToSerialize.Book, SerializerTests   Book 1private member value

     

    From that we see that Protocol Buffers and Wire have the smallest serialized size. For our serialized Bookshelf Wire looks less efficient but that is only the header which is so big. The serialized data is as small as the one of protocol buffers. Now you can understand the performance differences of DataContractSerializer depending on the used output format (see Data Contract Indented, DataContract and DataContract XmlBinaryDictionaryWriter). It also depends on the serialized data size if you become a factor 2 faster or not. Small is beautiful.  Protocol buffers deliver that with impressive numbers. SlimSerializer is pretty close to protocol buffers and can serialize pretty much anything with no extra attributes. Although it seems not to be able to serialize delegates. You should check if this one could work for you. Since it is not so widely used and lacks versioning support you should do thorough testing before choosing another serializer. Performance is one important aspect but correctness always beats performance.

    Serializer Init Times

    So which serializer is the fastest? As usual it depends on your requirements. Are you trying to load a configuration file faster during your application startup? In that case not only the throughput performance but first time init effects may matter more than the actual serializer performance. Lets try things out by serializing one object only and then quit.

    image

    Suddenly one of the fastest serializers is by a large margin the slowest one. To fully understand the serialization performance you need to take into account the serializer startup costs and the achieved throughput. If application startup is your main concern and you are loading only a few settings from a file you should think twice if Jil is really the right serializer for you. Unless if you change the environment a bit and you get a completely different chart:image

    This time Jil has become 240ms faster with no changes except that the test executable was NGenned with

    %windir%\Microsoft.NET\Framework64\v4.0.30319\ngen.exe install SerializerTests.exe

    That did precompile the executable and all referenced assemblies including Jil and Sigil which seem to have a lot of code running during the serializer initialization. If you are running on .NET Core you will find that the startup costs are much higher because nearly no dll is precompiled with crossgen.exe which is the .NET Core NGen pendant. Serializer startup costs are therefore dominated by JIT costs which can be minimized by precompiling your assembly which is pretty important if you do not only have great throughput but also good startup times. If you are deploying a not precompiled application you need to be aware of the greatly different startup costs. Taking only the serializer with the biggest throughput may be not the best idea.

    XmlSerializer Startup Time

    In the regular .NET Framework there is a special catch with XmlSerializer. If you instantiate XmlSerializer it will cost for the first type ca. 70ms but later invocations cost only ca. 14ms. Why is the first invocation of XmlSerializer so costly? As it turns out XmlSerializer creates a serialization assembly on the fly if it finds not a pregenerated one with the sgen tool (part of the .NET Framework SDK). In order to load it XmlSerializer will try to do an Assembly.Load(“YourAssembly.XmlSerializers, PublicKeyToken=xxxx, Version=…..“) which will fail with a FileNotFoundException if no pregenerated assembly exists. This assembly load attempt will trigger a GAC lookup which will call a method CheckMSIInstallAvailable:

    System.Xml.ni.dll!System.Xml.Serialization.XmlSerializer..ctor(System.Type, System.String)$##6001E85
    System.Xml.ni.dll!System.Xml.Serialization.TempAssembly.LoadGeneratedAssembly(System.Type, System.String, System.Xml.Serialization.XmlSerializerImplementation ByRef)$##60016C8
    mscorlib.ni.dll!System.Reflection.Assembly.Load(System.Reflection.AssemblyName)$##60040EC
    mscorlib.ni.dll!System.Reflection.RuntimeAssembly.InternalLoadAssemblyName(System.Reflection.AssemblyName, System.Security.Policy.Evidence, System.Reflection.RuntimeAssembly, System.Threading.StackCrawlMark ByRef, IntPtr, Boolean, Boolean, Boolean)$##600415D
    clr.dll!AssemblyNative::Load
    clr.dll!AssemblySpec::LoadAssembly
    clr.dll!AssemblySpec::LoadDomainAssembly
    clr.dll!AppDomain::BindAssemblySpec
    clr.dll!AssemblySpec::FindAssemblyFile
    clr.dll!AssemblySpec::LoadAssembly
    clr.dll!FusionBind::RemoteLoad
    clr.dll!CAssemblyName::BindToObject
    clr.dll!CAssemblyDownload::KickOffDownload
    clr.dll!CAssemblyDownload::DownloadComplete
    clr.dll!CAssemblyDownload::DownloadNextCodebase
    clr.dll!CAsmDownloadMgr::ProbeFailed
    clr.dll!CheckMSIInstallAvailable
    clr.dll!MSIProvideAssemblyPeek
    clr.dll!MSIProvideAssemblyGlobalPeek
    clr.dll!MSIProvideAssemblyPeekEnum

    That code is not part of .NET Core and also not of the SSCLI implementation. But with profiling it is not too hard to figure out what is really happening. By using ETW with Registry tracing we can see that the first failed assembly load failure is the most costly one:

    image

    Internally CheckMSIInstallAvailable will read all registry values below  HKEY_LOCAL_MACHINE\SOFTWARE\Classes\Installer\Assemblies\Global which by pure coincidence contains all registered assemblies from the GAC:

    image

    That minor implementation detail causes the noticed 44ms delay because CheckMSIInstallAvailable will first cache the GAC contents from the registry which needs 44ms on first access. It is not correct to attribute the time of the failed assembly load attempt to the startup costs of XmlSerializer because it happens only once for the first assembly load failure. So what is the correct XmlSerializer startup cost? If you have many different XmlSerializer instances during application startup only the first one will pay the high 70ms startup costs. All subsequent instantiation’s will cost  around 15ms per type which is much cheaper than one would expect by a single measurement. By pregenerating the code with sgen one can reduce the startup costs even further to ca. 1ms per type but the first assembly load will still cost around 19ms even when it is successful.

    image

    Before .NET 4.5 XmlSerializer did also spawn csc.exe to compile the code on the fly which is luckily no longer the case. During these “old” days XmlSerializer was costing up to 200ms startup costs per type. The usage of sgen was therefore absolutely necessary but in todays fast moving world old performance truth no longer hold true. Startup costs are non trivial to measure so beware.

    Multi Targeting .NET Executables and Precompiling .NET Core Assemblies

    Precompiling binaries with .NET Core is not very straightforward yet and I think things will change quite a bit in the future. But there is the approach that I have found to work. You can create an executable which targets desktop .NET and .NET Core inside one .csproj file with VS 2017 and later. A typical .csproj which targets .NET 4.5.2 and .NET Core 1.1 contains by semicolon separated the <TargetFrameworks> and for .NET Core the <RuntimeIdentifiers> which are downloaded when the nuget packages are restored which contains the platform dependent .NET Core dlls. When you compile this binary it is compiled two times. Once as regular .NET Desktop exe and another time as .NET Core dll which can be executed in the bin folder with

    dotnet xxxx.dlll

     

    <Project Sdk="Microsoft.NET.Sdk">
      <PropertyGroup>
        <OutputType>Exe</OutputType>
        <TargetFrameworks>netcoreapp1.1;net452</TargetFrameworks>
        <RuntimeIdentifiers>win7-x64</RuntimeIdentifiers>
      </PropertyGroup>
      <ItemGroup>
        <PackageReference Include="Jil" version="2.15.0" />
        <PackageReference Include="protobuf-net" Version="2.1.0" />
        <PackageReference Include="Sigil" version="[4.7.0.0]" />
        <PackageReference Include="System.Xml.XmlSerializer" version="*" />
        <PackageReference Include="System.Runtime.Serialization.Xml" version="*" />
      </ItemGroup>
    </Project>
    

    The binaries are put into a target framework dependent folder

    image

    image

    If you download .NET Core it will only contain one precompiled binary. To precompile everything you need to take the dlls of

    C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.1.1

    to your application binary folder and then call

    %USERPROFILE%\.nuget\packages\runtime.win7-x64.microsoft.netcore.runtime.coreclr\1.1.1\tools\crossgen.exe /JITPath “C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.1.1\clrjit.dll” /Platform_Assemblies_Paths “C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.1.1“;%USERPROFILE%\.nuget\packages\System.Xml.XmlSerializer\4.3.0\lib\netstandard1.3

    where you need to append the path of the referenced Nuget packages to make everything work. If things are not working correctly you can enable “Fusion” logging by setting the environment variable

    COREHOST_TRACE=1

    That and Windbg of course will give you more hints. Precompiling things in .NET Core is still a lot of try and error until everything works and I am not sure if this is the currently recommended way.

    Conclusions

    Measuring and understanding the performance aspects of serializers is quite complex. For some reason the measured numbers by the library authors of public serializers seem to prove that their serializer is the fastest one. Since I have no affiliations with any of the library maintainers the presented tests should be the most neutral one while I was trying hard to make no obvious errors in my testing methodology. If you want to migrate from an existing type hierarchy Protocol buffers and SlimSerializer look like a fast replacement to BinaryFormatter. Jil is great if you serialize the public API surface of your data objects and you do not need to serialize private fields or properties. Despite its claims FastJSON turned out in no metric to be leading in these tests. If I have made an error there please drop me a note and I will correct the data. BinaryFormatter is has a hideous O(n^2) deserialize time complexity which no one seems to have written about in public yet. At least with .NET Core things will become better. If you are deserializing larger object graphs you know now why the deserialization time takes up to 40 minutes. Before trying out a new fastest serializer be sure to measure by yourself and do not choose serializers which have fast in their name. There is a Fastest.Json serializer on Nuget which crashes the .NET Execution engine during serialization and the library author did never bother to implement the deserialize part. That’s all for today.

  • When Known .NET Bugs Bite You

    The most interesting type of bugs do not occur during regression testing but when real people use the software. Things become nasty if it happens only after several hours of usage when sometimes the whole application freezes for minutes. If you have a bigger machine with e.g. 40 cores and most of them are busy and you have dozens of processes, things can become challenging to track down, when something goes wrong . Attaching a debugger does not help since it not clear which process is causing the problem and even if you could you would cause issues for already working users on that machine. Using plain ETW tracing is not so easy because the amount of generated data can exceed 10 GB/min. The default Windows Performance Recorder profiles are not an option because they will simply generate too much data. If you have ever tried to open 6+ GB ETL file in WPA it can take 30 minutes until it opens. If you drag and drop around some columns you have to wait some minutes until the UI settles again.

    Getting ETW Data The Right Way

    Taking data with xperf offers more fined grained control at the command line, but it has also its challenges. If you have e.g. a 2 GB user and a 2 GB kernel session which trace into a circular in memory buffer you can dump the contents with

    xperf -flush "NT Kernel Logger" -f kernel.etl 
    
    xperf -flush "UserSession" -f user.etl
    
    xperf -merge user.etl kernel.etl merged.etl 
    

    The result looks strange:

    image

    For some reason flushing the kernel session takes a very long time (up to 15 minutes) on a busy server. In effect you will have e.g. the first 20s of kernel data, a 6 minutes gap and then the user mode session data which is long after the problem did occur. To work around that you can either flush the user session first or you use the xperf command to flush both sessions at the same time:

    xperf -flush "UserSession" -f user.etl -flush "NT Kernel Logger" -f kernel.etl

    If you want to profile a machine where you do not know where the issues are coming from you should get at first some “long” term tracing. If you do not record context switch events but you basically resort back to profiling events then you can already get quite far. The default sampling rate is one kHz (a thousand samples/second)  which is good for many cases but too much for long sessions. When you expect much CPU activity then you can try to take only every 10ms a sample which will get you much farther. To limit the amount of traced data you have to be very careful from which events you want to get the call stacks. Usually the stack traces are 10 times larger than the actual event data. If you want to create a dedicated trace configuration to spot a very specific issue you need to know which ETW providers cost you how much. Especially the size of the ETL file is relevant. WPA has a very nice overview under System Configuration – Trace Statistics which shows you exactly that 

    image

    You can paste the data also into Excel to get a more detailed understanding what is costing you how much.

    image

    In essence you should enable stack walking only for the events you really care about if you want to keep the profiling file size small. Stack walks cost you a lot of storage (up to 10 times more than the actual event payload). A great thing is that you can configure ETW to omit stack walking for a specific events from an ETW provider with tracelog. This works since Windows 8.1 and is sometimes necessary.

    To enable e.g. stack walking for only the CLR Exception event you can use this tracelog call for your user mode session named DotNetSession. The number e13c0d23-ccbc-4e12-931b-d9cc2eee27e4  is the guid for the provider .NET Common Language Runtime because tracelog does not accept pretty ETW provider names. Only guids prefixed with a # make it happy.

    tracelog -enableex DotNetSession -guid #e13c0d23-ccbc-4e12-931b-d9cc2eee27e4 -flags 0x8094 -stackwalkFilter -in 1 80 

    If you want to disable the highest volume events of Microsoft-Windows-Dwm-Core you can configure a list of events which should not traced at all:

    tracelog -enableex DWMSession -guid #9e9bba3c-2e38-40cb-99f4-9e8281425164 -flags 0x1 -level 4 -EventIdFilter -out 23 2 9 6 244 8 204 7 14 13 110 19 137 21 111 57 58 66 67 68 112 113 192 193

    That line will keep the DWM events which are relevant to the WPA DWM Frame Details graph. Tracelog is part of the Windows SDK but originally it was only part of the Driver SDK. That explains at least partially why it is the only tool that allows to declare ETW event and stackwalk filters. If you filter something in or out you need to pass as first number the number of events you will pass to the filter.

    tracelog … -stackwalkFilter -in 1 80

    means enable stackwalks for one event with event id 80

    tracelog … -EventIdFilter -out 23 2 9 …

    will filter out from all enabled events 23 events where the event ids are directly following after the number. You can execute such tracelog statements also on operating systems < Windows 8.1. In that case the event id filter clauses are simply ignored. In my opinion xperf should support that out of the box and I should not have to use another tool to fully configure lightweight ETW trace sessions.

    Analyzing ETW Data

    After fiddling with the ETW providers and setting up filters for the most common providers I was able to get a snapshot at the time when the server was slow. The UI was not reacting for minutes and I found this interesting call stack consuming one core for nearly three minutes!

    image

    When I drill deeper I did see that Gen1 collections were consuming enormous amounts of CPU in the method WKS::allocate_in_older_generation.

    image

    When you have enabled the .NET ETW provider with the keyword 0x1 which tracks GC events and you use my GC Activity Region file you can easily see that some Gen1 collections are taking up to 7s. If such a condition occurs in several different processes at the same time it is no wonder why the server looks largely unresponsive although all other metrics like CPU, Memory, Disc IO and the network  are ok.

    Gen1GCTimes

    When I find an interesting call stack it is time to look what the internet can tell me about allocate_in_older_generation. This directly points me to Stackoverflow where someone else had this issue already http://stackoverflow.com/questions/32048094/net-4-6-and-gc-freeze which brings me to https://blogs.msdn.microsoft.com/maoni/2015/08/12/gen2-free-list-changes-in-clr-4-6-gc/ that mentions a .NET 4.6 GC bug. There the .NET GC developer nicely explains what is going on:

    … The symptom for this bug is that you are seeing GC taking a lot longer than 4.5 – if you collect ETW events this will show that gen1 GCs are taking a lot longer. …

    Based on the data above and that the server was running .NET 4.6.96.0 which was a .NET 4.6 version this explains why sometimes a funny allocation pattern in processes that had already allocated > 2GB of memory the Gen1 GC pause times would sometimes jump from several ms up to seconds. Getting  1000 times slower at random time certainly qualifies as a lot longer. It would even be justified to use more drastic words how much longer this sometimes takes. But ok the bug is known and fixed. If you install .NET 4.6.1 or the upcoming .NET 4.6.2 version you will never see this issue again. It was quite tough to get the right data at the right time but I am getting better at these things while looking at the complete system at once. I am not sure if you could really have found this issue with normal tools if the issue manifests in random processes at random times. If you only concentrate at one part of the system at one time you will most likely fail to identify such systemic issues. When you update your server after some month by doing regular maintenance work the issue would have vanished leaving you back with no real explanation why the long standing issue has suddenly disappeared. Or you would have identified at least a .NET Framework update in retrospect as the fix you were searching for so long.

    The Non Contracting Code Contracts

    What do you do when you get an impossible exception of the form

    System.InvalidCastException: Object of type 'MyApp.Type[]' cannot be converted 
                    to type 'System.Collections.Generic.IEnumerable`1[MyApp.Type]'.
    System.RuntimeType.TryChangeType(…)
    System.RuntimeType.CheckValue(…)
    System.Reflection.MethodBase.CheckArguments(…)
    System.Reflection.RuntimeMethodInfo.InvokeArgumentsCheck(…)
    ...

    where basically an array of type X cannot be casted to IEnumerable<X>.  Any LINQ affine programmer will do that 20 times a day. What the heck is going on here? It looks like the CLR type system has got corrupted by some bad  guy. It could be anything from a spurious memory corruption due to cosmic rays or some code on this or another thread did corrupt memory somehow. Random bit flips are not so uncommon if you know that cosmic muons hit the earth surface at a rate of ca. 60 000 muons/(h*m^2). It is only a matter of time until some bits in your memory gets randomly another value.

    This issue surfaced on some machines quite reproducible so we can put back exotic theories about possible sources of memory corruption back into the bookshelf. First of all I needed to get a memory dump of the process in question. This is much easier if you know how you can “attach” procdump to one process and let it create a dump on any first chance exception with a specific type or message. That comes in handy if you want to create a dump not when a process crashes but when an internally catched exception occurs and you want to inspect the application state at this point in time.

    The Sysinternals tool Procdump can do it from the command line without any extra installation. The -e 1 will instruct it to dump on first chance exceptions and -f will do a substring filter of the shown message in the console by procdump.

    C:\>procdump  -ma -e 1 -f "InvalidCast" Contractor.exe
    
    ProcDump v7.1 - Writes process dump files
    Copyright (C) 2009-2014 Mark Russinovich
    Sysinternals - www.sysinternals.com
    With contributions from Andrew Richards
    
    Process:               Contractor.exe (12656)
    CPU threshold:         n/a
    Performance counter:   n/a
    Commit threshold:      n/a
    Threshold seconds:     10
    Hung window check:     Disabled
    Log debug strings:     Disabled
    Exception monitor:     First Chance+Unhandled
    Exception filter:      *InvalidCast*
    Terminate monitor:     Disabled
    Cloning type:          Disabled
    Concurrent limit:      n/a
    Avoid outage:          n/a
    Number of dumps:       1
    Dump folder:           C:\
    Dump filename/mask:    PROCESSNAME_YYMMDD_HHMMSS
    
    
    Press Ctrl-C to end monitoring without terminating the process.
    
    CLR Version: v4.0.30319
    
    [19:48:12] Exception: E0434F4D.System.InvalidCastException ("Object of type 'MyApp.Type[]' cannot be converted to type 'System.Collections.Generic.IEnumerable`1[MyApp.Type]'.")
    [19:48:12] Dump 1 initiated: C:\Contractor.exe_160718_194812.dmp
    [19:48:12] Dump 1 writing: Estimated dump file size is 50 MB.
    [19:48:12] Dump 1 complete: 50 MB written in 0.1 seconds
    [19:48:13] The process has exited.
    [19:48:13] Dump count reached.
    If you have a more complex process creation setup, e.g. you create many instances of the same exe then you cannot use the simple executable name filter. Procdump accepts then only the process id of an already running process which you now need to determine from somewhere else. DebugDiag would be my tool of choice which allows you to configure rules for any process executable regardless if they are already running or will be started later. There you can also configure a dump trigger when a specific exception with a specific message.
    To do this start the DebugDiag 2  Collection tool and select Crash
    image
     
    After selecting a process you can configure e.g. a full dump for each InvalidCastException
     
    image
     
    When you have activated the Crash rule you will get a nice report how many dumps already have been collected since the rule is active.
    image
     
    Then you will find in the configured directory the dump files along with the DebugDiag logs.
    image

    Once I had the dumps I did look into it but I could not make much sense out of the TypeCastException. Without the exact CLR source code it is very hard to make sense of the highly optimized data structures in the CLR. At that point I did reach out to Microsoft to look into the issue. Since the dump did also not provide enough clues I did pack a virtual machine together and transfer it directly to MS support via their upload tool. The support engineer was not able to get the VM running which seems to happen quite frequently if you upload 50+ GB files over the internet. The solution was to send me via mail an USB hard drive, upload the VM onto the disk and send it back to MS. that worked out and I heard nothing from MS support for some time.

    It seems that my issue was routed through various hands until it reached one of CoreCLR devs which have debugged certainly many similar issues. When I was finally suspecting to never hear about that issue again I got a mail back that they have found the issue.

    The problem is that the CLR type system got confused about invalid assembly references which have invalid Flags set. The assembly reference flag set in the IL code must be set to zero but in my case it was set to 0x70 which is normally a C# compiler internal tracking code to mark reference assemblies. In the final code emitting phase the C# compiler will zero out the assembly reference flags and all is fine. But for some reason one of my assemblies still had set 0x70 for all reference assemblies.

    If you wonder what reference assemblies are: These are the assemblies the .NET Framework consist of in various versions. If you link against a specific .NET Framework version like 4.6.2, you “link” against the assemblies located in these directories to ensure that can call only APIs for the currently configured .NET target framework.

    image

    I was told that with the .NET 4.5.2 C# compiler an issue was reported which could trigger that behavior. If the newer Roslyn based C# 6 compiler suffers still from this issue is not known to me. That sounded like an arcane and spurious compiler bug. For some reason I was asked a few days later by MS if we modify the generated assemblies. It turns out some targets use Code Contracts to enforce run time checks which need assembly rewriting. After doing a query for all assembly references I found that ALL Code contracts rewritten assemblies have invalid Assembly reference flags with 0x70 set. This is still true with newest Code Contracts version which is currently 1.9.10714.2.

    If you try to check the Assembly references via Reflection you will find that it will always tell you that the assembly references fields in the references assemblies are zero. This is even true if you load them with Assembly.LoadForReflectionOnlyFrom. Even Ildasm will not show you the fields of the Assembly references. Windbg? Nope.

    I had to resort back to my old pet project ApiChange and patch the –ShowReferences command to print besides the referenced assembly also the Flags field of the Assembly Reference.

    
    C:\>apichange -sr D:\bin\SubContractor.dll
    SubContractor.dll ->
     System, Version=4.0.0.0,   Culture=neutral,
                                PublicKeyToken=b77a5c561934e089 0x70
     mscorlib, Version=4.0.0.0, Culture=neutral, 
                                PublicKeyToken=b77a5c561934e089 0x70
    
    

    There I could finally see that my assembly had invalid assembly references. How can I fix it? That is easy. Turn off Perform Runtime Contract Checking!

    image

    Now you get correct assembly references as the ECMA spec requires them.

    
    C:\>apichange -sr D:\bin\SubContractor.dll
    SubContractor.dll ->
     mscorlib, Version=4.0.0.0, Culture=neutral, 
                PublicKeyToken=b77a5c561934e089 0x0
     System, Version=4.0.0.0,   Culture=neutral, 
                PublicKeyToken=b77a5c561934e089 0x0

    The source of the spurious InvalidCastExceptions is found but why does it happen so infrequently? I have tried to create a simple reproducer but I was not able to force the cast error. It seems to have something to do with the machine, memory layout, number of loaded types, IL layout, moon phase, …. to trigger it. If something goes wrong it seems that in one assembly with correct assembly references to the Reference assemblies a cast is tried from

    mscorlib         ;  IEnumerable<T>  to 
    mscorlib,Ref=0x70;  IEnumerable<T>  
    

    where the type identity with generic types and their arguments gets mixed up in the process which then correctly tells me that seemingly identical types are not identical because they differ in their full type expansion by their assembly reference.

    I was never a big fan of Code Contracts because of the way how they were implemented. Rewriting existing assemblies can and will cause unforeseen issues in production code. If I can I want to keep my C# and JIT compiler errors and not add Code Contract bugs on top of it. Do not get me wrong. I really like the idea of contract checking. But such checks should be visible in the source code and or at least compiled directly by a Code Contracts plugin by the now much more open Roslyn Compiler.

    As it is now Code Contracts considerably increase compilation time and if you enable static code checks you easily can get builds which take several hours which is a way too hefty price for some additional checks.

    If you now want to remove Code Contracts you need to remove from your source code all Contract.Requires<Exn>(…) checks because these will assert in non rewritten Release builds as well which is certainly not what you want. As it stands now Code Contracts violate the contract of generating valid assembly references. The assembly rewriter of Code Contracts  generates assemblies with invalid IL code which can trigger under specific circumstances impossible runtime errors in the CLR.