Be Careful Where You Put GC.SuppressFinalize

I had an interesting issue to debug which resulted in a race condition where the finalizer was being called while the object was still in use. If you know how .NET works this should ring some alarm bells since this should never happen. The finalizer is expected to run only when no one has a reference to the finalizable object anymore.

A simple reproducer is below. It creates 50K finalizable objects. Each object allocates 500 bytes of unmanaged memory which is released either by calling dispose on it on a dedicated thread or the finalizer thread will kill the rest during application shutdown.

using System;
using System.Linq;
using System.Runtime.InteropServices;
using System.Threading;
using System.Threading.Tasks;

class Program
{
    static void Main(string[] args)
    {
        // create 50K events
        var events = Enumerable.Range(1, 50 * 1000)
                                .Select(x => new Event())
                                .ToList();

        ManualResetEvent startEvent = new ManualResetEvent(false);

        Task.Factory.StartNew(() =>
        {
            startEvent.WaitOne();  // wait for event
            foreach (var ev in events) // dispose events
            {
                ev.Dispose();
            }
        });

        startEvent.Set(); // start disposing events
        Thread.Sleep(1);  // wait a bit and then exit
    }
}

public class Event : IDisposable
{
    internal IntPtr hGlobal;  // allocate some unmanaged memory

    public Event()
    {
        hGlobal = Marshal.AllocHGlobal(500);
    }

    ~Event()  // finalizer 
    {
        Dispose();
    }

    public void Dispose()
    {
        if( hGlobal !=  IntPtr.Zero) // check if memory is gone
        {
            Marshal.FreeHGlobal(hGlobal); // free it
            GC.SuppressFinalize(this); // Prevent finalizer from running it again
            hGlobal = IntPtr.Zero;
        }
    }
}

Looks good to you? Let it run:

image

Ups that should not happen. When trying to run the application under the VS debugger everything works on my machine™. No matter how hard I try it will never crash under the debugger. But if I start it without debugging it will crash every time.

Debug The Problem

When the application crashes without the debugger on a machine where VS is installed you will get a nice dialog

image

where you can click Debug. Then I choose to debug managed and unmanaged debugging

image

Because part of the issue has to do with the .NET Runtime we need managed and unmanaged debugging. It is therefore wise to enable Native and Managed debugging.

image

If you do not manually select both debugging engine VS will default to unmanaged debugging only where we will miss our managed stack frames which is not particularly helpful:

image

With the correct debugging engine we find that while the finalizer was called a heap corruption was reported:

image

While another thread is also disposing events

image

So what is the problem here? Could it be that the finalizer is disposing the same instance on which our TPL thread is still working? A concurrent double free sounds likely but by only using Visual Studio we cannot prove it. If a finalizer is called while we the object is still alive we would have found have a pretty serious GC bug. On the other hand if that would be the case many people would have complained.

Gather More Evidence

To analyze the crash with other tools it is good to save a memory dump from the crashing aplication. You can do this pretty easily with

D:\Source\FinalizerFun\bin\Release>procdump -ma -e -x . FinalizerFunNetFull.exe

ProcDump v9.0 – Sysinternals process dump utility
Copyright (C) 2009-2017 Mark Russinovich and Andrew Richards
Sysinternals – www.sysinternals.com

[21:13:50] Exception: 04242420
[21:13:52] Exception: 80000003.BREAKPOINT
[21:13:52] Exception: C0000374
[21:13:52] Unhandled: C0000374
[21:13:52] Dump 1 initiated: .\FinalizerFunNetFull.exe_180204_211352.dmp
[21:13:52] Dump 1 writing: Estimated dump file size is 83 MB.
[21:13:52] Dump 1 complete: 83 MB written in 0.1 seconds
[21:13:52] Dump count reached.

procdump is a command line tool to take memory dumps in many ways. This time we take a full memory dump -ma for unhandled exceptions -e where we execute a process -x and put the dump to the current directory . followed by the executable and optional command line arguments for the executable. The most difficult part is that I always forget that the first parameter after -x is not the executable and its arguments but the dump folder. If you try to capture a dump on first chance exceptions before it becomes unhandled you normally use -e 1 but for reasons not known to me this did never trigger the creation of a dump file. If all fails you can still take a memory dump while the “… has stopped working” dialog is shown with procdump for a given pid like “procdump -ma pid”.

You can open the memory dump with Visual Studio without problem by dragging and dropping the .dmp file from the explorer into VS

image

Press Debug with Mixed to see managed and unmanaged code. Many people shy away from memory dumps. But if you dare to open them the debugging experience is the same as it would be for a live process which is stuck at a breakpoint. The only difference is that you cannot continue execution. VS will show your source code and the crashing thread just like it would happen during a live debugging session:

image

VS has great memory dump support (since ca. VS2012/2013 around).  If you have an automated build system it is possible to get full source code debugging  for your released application. The feature is called Source Server support. For TFS builds it is a simple config switch of your build. With git things are more involved https://shonnlyga.wordpress.com/2016/05/28/source-server-with-git-repository/. If you have Source Indexed builds you definitely want to enable Source Server support for the debugger to get live and memory dump debugging without the need to download the source files.  In Debug –  Options

image

check all items below Enable source server support. Unfortunately VS 2017 has broken Source Server Support which is tracked here: https://developercommunity.visualstudio.com/content/problem/169309/debugger-cant-create-folder-structure-when-trying.html

It is working on VS 2013, 2015 or 2017 15.6 (still beta). As workaround you can copy srcsrv.dll from an earlier VS edition to the VS2017 one to get source server support back again.

No Not Windbg!

We have reached a dead end with Visual Studio. It is time to admit that the nice GUI based tools although powerful are not always the most helpful ones when you want to completely understand an issue. First we need to download Windbg for which MS has put up a page https://developer.microsoft.com/en-us/windows/hardware/download-windbg. This will point you to the Windows SDK page

image

from where you can download the Windows SDK installer. If the installer wont start you have a newer version of the Win 10 SDK already installed. In that case you can download the latest SDK installer from  https://developer.microsoft.com/en-us/windows/downloads/windows-10-sdk.

When you start the installer you need to press next a few times to get to the list of features you want to install. Check Debugging Tools for Windows and press Install.

image

Now you will find the 32 bit version of Windbg in

“C:\Program Files (x86)\Windows Kits\10\Debuggers\x86\windbg.exe”

and the 64 bit version at

“C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\windbg.exe”

Start the correct Windbg version (x86,x64) and load the crash dump file.

image

Since it is a managed process we need to load to a managed debugging extensions named sos.dll. This is done with the Windbg command

.loadby sos clr

If you get a failure of the form

0:005> .loadby sos clr
The call to LoadLibrary(C:\Windows\Microsoft.NET\Framework\v4.0.30319\sos) failed, Win32 error 0n193
    “%1 is not a valid Win32 application.
Please check your debugger configuration and/or network access.

then you did load the wrong Windbg because you did open a 32 bit dump with the 64 bit version of Windbg. Things get easier with the upcoming new Windbg which is in Beta where is only one debugger which will load the right debugger.
The cryptic command tells the debugger to load the sos.dll from the same directory where the .NET Runtime dll clr.dll is located. If you wish you can also fully qualify the name like

.load C:\Windows\Microsoft.NET\Framework\v4.0.30319\sos.dll

The 64 bit framework dll is located at

.load C:\Windows\Microsoft.NET\Framework64\v4.0.30319\sos.dll

If you dare to use ” for the path then you need to adhere to the C style escape rules where you need to use \\ to get a \. If you analyze a memory dump on another machine with a different .NET Framework version installed some of the SOS commands might not work or sos.dll refuses to be loaded. In that case you can check out my OneDrive folder https://1drv.ms/f/s!AhcFq7XO98yJgoMwuPd7LNioVKAp_A which contains a pretty up to date list of nearly all .NET Framework sos dlls. You need to extend the symbol path to the downloaded sos dlls (.sympath+ c:\mscordackwksDownloadDir) and then load it via the full path. Things become easier in the future if Windbg automatically loads the right sos.dll from the symbol server which seems now to be in place.

We have a managed debugging extension loaded. Now what? First we test if the extension works by executing the !Threads command

0:000> !Threads
c0000005 Exception in C:\Windows\Microsoft.NET\Framework\v4.0.30319\sos.Threads debugger extension.
      PC: 0b13b8e3  VA: 00000000  R/W: 0  Parameter: ed04c8b4
0:000> !Threads
ThreadCount:      4
UnstartedThread:  0
BackgroundThread: 4
PendingThread:    0
DeadThread:       0
Hosted Runtime:   no
                                                                         Lock  
       ID OSID ThreadOBJ    State GC Mode     GC Alloc Context  Domain   Count Apt Exception
   0    1 2594 02959160   2022220 Preemptive  046E6CF8:00000000 02952d00 0     MTA 
   5    2 1a08 02966f30     2b220 Preemptive  046F3CDC:00000000 02952d00 0     MTA (Finalizer) System.BadImageFormatException 046ebff4
   9    3 4300 06d3f690   3021220 Preemptive  046E829C:00000000 02952d00 0     MTA (Threadpool Worker) 
  11    4 2cec 06d41e78   1029220 Preemptive  046EA1E4:00000000 02952d00 0     MTA (Threadpool Worker) 

For some reason the first time I execute the command I get an exception. But it works the second time. This is happening to me since years on many different machines. I have no idea what the bug is but it should be fixed someday. We know that we have 4 threads and one thread did throw a BadImageFormatException. Lets examine that thread. The first column are the Windbg thread numbers given by Windbg to switch easier between threads. The command to switch to a specific thread 5 where our exception lives is

~5s

Then we can execute the sos command to dump the managed thread stack with

0:005> !ClrStack
OS Thread Id: 0x1a08 (5)
Child SP       IP Call Site
0676f888 7748ed3c [HelperMethodFrame: 0676f888] System.Runtime.InteropServices.Marshal.ThrowExceptionForHRInternal(Int32, IntPtr)
0676f8fc 70d0065e System.Runtime.InteropServices.Marshal.FreeHGlobal(IntPtr) [f:\dd\ndp\clr\src\BCL\system\runtime\interopservices\marshal.cs @ 1211]
0676f908 0291116a Event.Dispose() [D:\Source\vc17\FinalizerFuncNetFull\FinalizerFuncNetFull\Program.cs @ 51]
0676f914 029111a9 Event.Finalize() [D:\Source\vc17\FinalizerFuncNetFull\FinalizerFuncNetFull\Program.cs @ 43]
0676fb10 714e63f2 [DebuggerU2MCatchHandlerFrame: 0676fb10] 

We know that thread number 5 is the finalizer thread and we see that it is indeed calling into Event.Finalize where our exception happens. So far we did not get more information than we could get from the much easier to use Visual Studio debugger. Now lets check on which event object the finalizer was called. For that we can use a heuristic command named !dso which is the short form of Dump Stack Objects.

0:005> !dso
OS Thread Id: 0x1a08 (5)
ESP/REG  Object   Name
0676F5A8 046ebff4 System.BadImageFormatException
0676F698 046ebff4 System.BadImageFormatException
0676F6AC 046ebff4 System.BadImageFormatException
0676F6D0 046ebff4 System.BadImageFormatException
0676F6FC 046ebff4 System.BadImageFormatException
0676F710 046ebff4 System.BadImageFormatException
0676F718 046ebff4 System.BadImageFormatException
0676F71C 046ebff4 System.BadImageFormatException
0676F7BC 046ebff4 System.BadImageFormatException
0676F7FC 046ebff4 System.BadImageFormatException
0676F8FC 046507c0 Event
0676F958 046507c0 Event
0676F98C 046507c0 Event
0676F998 046507c0 Event
0676F9A8 046507c0 Event
0676F9B0 046507c0 Event
0676F9C0 046507c0 Event

The command is rather dumb and dumps the same object reference several times where it was located as pointer on the thread stack. There is actually a much better extension out there for that which is called netext (https://github.com/rodneyviana/netext/tree/master/Binaries). To “install” the extension you can copy it to the Windbg default extension folder which allows you to load the dll with no directory qualifier on my machine to

  • C:\Program Files (x86)\Windows Kits\10\Debuggers\x86\winext
  • C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\winext

Now we can load it

0:000> .load netext
netext version 2.1.19.5000 Feb  6 2018
License and usage can be seen here: !whelp license
Check Latest version: !wupdate
For help, type !whelp (or in WinDBG run: ‘.browse !whelp’)
Questions and Feedback:
https://github.com/rodneyviana/netext/issues
Copyright (c) 2014-2015 Rodney Viana (
http://blogs.msdn.com/b/rodneyviana)
Type: !windex -tree or ~*e!wstack to get started

0:005> !wstack

Listing objects from: 0676b000 to 06770000 from thread: 5 [1a08]

046ebff4 701d13c4   0  0         92 System.BadImageFormatException
046507c0 028b6260   0  0         12 Event

2 unique object(s) found in 104 bytes

to get a much less cluttered output. This extension is pure gold because it allows you to write LINQ style debugger queries to e.g. dump all object instances which derive from a common base class. It has extended support for WCF connections, sockets and APS.NET specific things.

From the dump we know that the event 046507c0 did cause an exception in the unmanaged heap. Was someone else working with this object? Visual Studio is of no help here but we can use the !GCRoot command to find out who references this object from somewhere else:

0:005> !GCRoot 046507c0
Thread 1a08:
    0676f908 0291116a Event.Dispose() [D:\Source\vc17\FinalizerFuncNetFull\FinalizerFuncNetFull\Program.cs @ 51]
        esi: 
            ->  046507c0 Event

Thread 4300:
    08edf790 0291116a Event.Dispose() [D:\Source\vc17\FinalizerFuncNetFull\FinalizerFuncNetFull\Program.cs @ 51]
        esi: 
            ->  046507c0 Event

    08edf79c 02911108 Program+<>c__DisplayClass0_0.<Main>b__1() [D:\Source\vc17\FinalizerFuncNetFull\FinalizerFuncNetFull\Program.cs @ 23]
        ebp+2c: 08edf7ac
            ->  046324b4 System.Collections.Generic.List`1[[Event, FinalizerFuncNetFull]]
            ->  05655530 Event[]
            ->  046507c0 Event

    08edf79c 02911108 Program+<>c__DisplayClass0_0.<Main>b__1() [D:\Source\vc17\FinalizerFuncNetFull\FinalizerFuncNetFull\Program.cs @ 23]
        ebp+28: 08edf7b0
            ->  046507c0 Event

Found 4 unique roots (run '!GCRoot -all' to see all roots).

The finalizer thread 1a08 was expected but what is thread 4300 doing with our object? Lets switch to that thread. We can use either the thread number of the OS thread id with the even more cryptic command

0:005> ~~[4300]s
eax=00000000 ebx=00000001 ecx=00000000 edx=00000000 esi=00000001 edi=00000001
eip=7748ed3c esp=08edf2b8 ebp=08edf448 iopl=0         nv up ei pl nz na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000206
ntdll!NtWaitForMultipleObjects+0xc:
7748ed3c c21400          ret     14h
0:009> !ClrStack
OS Thread Id: 0x4300 (9)
Child SP       IP Call Site
08edf754 7748ed3c [InlinedCallFrame: 08edf754] 
08edf750 7013bb80 DomainNeutralILStubClass.IL_STUB_PInvoke(IntPtr)
08edf754 7010d8b9 [InlinedCallFrame: 08edf754] Microsoft.Win32.Win32Native.LocalFree(IntPtr)
08edf784 7010d8b9 System.Runtime.InteropServices.Marshal.FreeHGlobal(IntPtr) [f:\dd\ndp\clr\src\BCL\system\runtime\interopservices\marshal.cs @ 1212]
08edf790 0291116a Event.Dispose() [D:\Source\vc17\FinalizerFuncNetFull\FinalizerFuncNetFull\Program.cs @ 51]
08edf79c 02911108 Program+c__DisplayClass0_0.b__1() [D:\Source\vc17\FinalizerFuncNetFull\FinalizerFuncNetFull\Program.cs @ 23]
08edf7e0 70097328 System.Threading.Tasks.Task.InnerInvoke() [f:\dd\ndp\clr\src\BCL\system\threading\Tasks\Task.cs @ 2884]
08edf7ec 70096ed0 System.Threading.Tasks.Task.Execute() [f:\dd\ndp\clr\src\BCL\system\threading\Tasks\Task.cs @ 2498]
08edf810 700972fa System.Threading.Tasks.Task.ExecutionContextCallback(System.Object) [f:\dd\ndp\clr\src\BCL\system\threading\Tasks\Task.cs @ 2861]
08edf814 7010bcd5 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) [f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs @ 954]
08edf880 7010bbe6 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) [f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs @ 902]
08edf894 70097178 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef) [f:\dd\ndp\clr\src\BCL\system\threading\Tasks\Task.cs @ 2827]
08edf8f8 7009704d System.Threading.Tasks.Task.ExecuteEntry(Boolean) [f:\dd\ndp\clr\src\BCL\system\threading\Tasks\Task.cs @ 2767]
08edf908 70096fcc System.Threading.Tasks.Task.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem() [f:\dd\ndp\clr\src\BCL\system\threading\Tasks\Task.cs @ 2704]
08edf90c 700e87f2 System.Threading.ThreadPoolWorkQueue.Dispatch() [f:\dd\ndp\clr\src\BCL\system\threading\threadpool.cs @ 820]
08edf95c 700e865a System.Threading._ThreadPoolWaitCallback.PerformWaitCallback() [f:\dd\ndp\clr\src\BCL\system\threading\threadpool.cs @ 1161]
08edfb80 7143eb16 [DebuggerU2MCatchHandlerFrame: 08edfb80] 

Ahh that is our TPL thread which is also freeing the object. The call stack shows that we have caught it in action while it was still calling Marshal.FreeHGlobal the finalizer did finalize it right away! That is pretty serious since that must not happen. To see the full picture we need a mixed mode stack with no hidden stack frames like Visual Studio is showing us. For mixed mode stacks there is another Windbg extension best suited. It is called sosex (http://www.stevestechspot.com/)

0:009> .load sosex
This dump has no SOSEX heap index.
The heap index makes searching for references and roots much faster.
To create a heap index, run !bhi
0:009> !mk
Thread 9:
        SP       IP
00:U 08edf2b8 7748ed3c ntdll!NtWaitForMultipleObjects+0xc
01:U 08edf2bc 753f1293 KERNELBASE!WaitForMultipleObjectsEx+0x103
02:U 08edf450 714dff96 clr!WaitForMultipleObjectsEx_SO_TOLERANT+0x3c
03:U 08edf4a0 714dfcd8 clr!Thread::DoAppropriateWaitWorker+0x237
04:U 08edf52c 714dfdc9 clr!Thread::DoAppropriateWait+0x64
05:U 08edf598 714dff3c clr!CLREventBase::WaitEx+0x128
06:U 08edf5e4 71560152 clr!CLREventBase::Wait+0x1a
07:U 08edf5fc 714fe9dc clr!WaitForEndOfShutdown_OneIteration+0x81
08:U 08edf670 714fea29 clr!WaitForEndOfShutdown+0x1b
09:U 08edf67c 714fcd76 clr!Thread::RareDisablePreemptiveGC+0x52f
0a:U 08edf6c8 714e8374 clr!JIT_RareDisableHelper+0x24
0b:M 08edf74c 7013bb95 DomainNeutralILStubClass.IL_STUB_PInvoke(IntPtr)
0c:M 08edf750 7013bb80 DomainNeutralILStubClass.IL_STUB_PInvoke(IntPtr)
0d:M 08edf784 7010d8b9 System.Runtime.InteropServices.Marshal.FreeHGlobal(IntPtr)(+0xe IL,+0x19 Native) [f:\dd\ndp\clr\src\BCL\system\runtime\interopservices\marshal.cs @ 1212,17]
0e:M 08edf790 0291116a Event.Dispose()(+0x1d IL,+0x12 Native) [D:\Source\vc17\FinalizerFuncNetFull\FinalizerFuncNetFull\Program.cs @ 51,13]
0f:M 08edf79c 02911108 Program+<>c__DisplayClass0_0.<Main>b__1()(+0x21 IL,+0x70 Native) [D:\Source\vc17\FinalizerFuncNetFull\FinalizerFuncNetFull\Program.cs @ 23,17]

There we see that the thread did call into unmanaged code to free the heap memory but the CLR will not let it run managed code anymore because we are shutting down the process. There are some gotchas related to managed application shutdown where I did write a long time (12 years now) ago here: https://www.codeproject.com/Articles/16164/Managed-Application-Shutdown. Most things are still the same. The key takeaway is that when managed application shutdown is initiated the .NET Runtime ensures that

  • managed code calling into unmanaged code never returns (clr!WaitForEndOfShutdown)
  • All managed threads are suspended except for the finalizer thread

0:009> ~
#  0  Id: aa0.2594 Suspend: 1 Teb: 002d2000 Unfrozen
   1  Id: aa0.18e0 Suspend: 1 Teb: 002d5000 Unfrozen
   2  Id: aa0.3ac4 Suspend: 1 Teb: 002d8000 Unfrozen
   3  Id: aa0.30c0 Suspend: 1 Teb: 002db000 Unfrozen
   4  Id: aa0.1d34 Suspend: 1 Teb: 002de000 Unfrozen
  5  Id: aa0.1a08 Suspend: 0 Teb: 002e1000 Unfrozen
   6  Id: aa0.2954 Suspend: 1 Teb: 002e4000 Unfrozen
   7  Id: aa0.3cf4 Suspend: 1 Teb: 002e7000 Unfrozen
   8  Id: aa0.3d2c Suspend: 1 Teb: 002ea000 Unfrozen
.  9  Id: aa0.4300 Suspend: 1 Teb: 002ed000 Unfrozen
  10  Id: aa0.4224 Suspend: 1 Teb: 002f0000 Unfrozen
  11  Id: aa0.2cec Suspend: 1 Teb: 002f3000 Unfrozen

  • All finalizeable objects are declared as finalizable
  • Only the finalizer thread is allowed to run to finalize all now considered garbage objects

The problem with that approach is that there is an inherent race condition where a not yet completed Dispose call is calling into unmanaged code then the finalizer will try to call the unmanaged cleanup call a second time.

Is This A Problem?

Well lets check who in the .NET Framework calls GC.SuppressFinalize

image

There are quite a few classes in the Base Class Library which implement finalizers this way. GC.SuppressFinalize is always called last which is a time bomb waiting to crash on you at the worst possible time and killing e.g. your UI while you are closing everything. Lets try an experiment by changing our code to create Brushes instead of events:

// create 50K events
var events = Enumerable.Range(1, 50 * 1000)
                        .Select(x => new SolidBrush(Color.AliceBlue))
                        .ToList();

When I let it run I get a nice AccessViolationException which some of us certainly have sporadically seen but have been left back wondering why that exception did happen to them:

image

To be fair. Not all classed listed above are susceptible to that race condition. Some classes already check if a shutdown is running and do in that case nothing:

~OverlappedCache()
{
   if (!NclUtilities.HasShutdownStarted)
  {
       this.InternalFree();
  }
}

The Fix

There are several ways to get around that. The easiest is to move the GC.SuppressFinalize call before the Dispose call which will prevent the finalizer from running during shutdown if a Dispose call is already executing. If an exception escapes from the Dispose call it will not be tried by the finalizer a second time which sounds like a good deal for most resources.

public void Dispose()
{
    if( hGlobal !=  IntPtr.Zero) // check if memory is gone
    {
        GC.SuppressFinalize(this); // Prevent finalizer from running it again
        Marshal.FreeHGlobal(hGlobal); // free it            
        hGlobal = IntPtr.Zero;
    }
}

Another way is to check if a shutdown or an AppDomain unload is happening right now:

    if (!Environment.HasShutdownStarted && !AppDomain.CurrentDomain.IsFinalizingForUnload())

.NET Core on the other hand does not suffer from that issue because a .NET Core application has no final finalizer call which prevents that race condition entirely. Now go and check your finalizers to make your application correctly shutting down.

Update 1

As requested by Steve I present a fixed safe version:

In 2018 you should not write a finalizer at all. The basic Dispose(bool bDisposing) pattern is from a time where we had no SafeHandles. Today I would write my Event class entirely without a finalizer but the unmanaged resource/s are self contained by finalizable Safehandles. A typical wrapper would look like the one below which owns the memory pointer:

sealed class SafeNativeMemoryHandle : SafeHandleZeroOrMinusOneIsInvalid
{
    public SafeNativeMemoryHandle(int size):base(true)
    {
        SetHandle(Marshal.AllocHGlobal(size));
    }

    protected override bool ReleaseHandle()
    {
        if (this.handle != IntPtr.Zero)
        {
            Marshal.FreeHGlobal(this.handle);
            this.handle = IntPtr.Zero;
            return true;
        }
        return false;
    }
}

With that infrastructure in place we can improve the event class to the much easier version which will never leak any memory although it contains no finalizer at all:

/// <summary>
/// SafeEvent class needs no finalizer because unmanaged resources
/// are managed by the SafeNativeMemoryHandle which is the only class which needs a finalizer.
/// </summary>
public class SafeEvent : IDisposable
{
    internal SafeNativeMemoryHandle hGlobal;  // allocate some unmanaged memory

    public SafeEvent()
    {
        hGlobal = new SafeNativeMemoryHandle(500);
    }

    public void Dispose()
    {
        hGlobal.Dispose();
        hGlobal = null;
    }
}

You can also create event hierarchies by making the Dispose method virtual without fear to leak any handles from its derived classes. Each class which contains unmanaged resources should contain its own self cleaning members and you are done as long as there are no dependencies between them.

Advertisements

Crashing VS2017 15.5.2

I was just playing with the new readonly struct features of C# 7.2. To check if things got faster I first declared a new struct

    readonly struct FastPoint
    {
        public int X { get; set; }
        public int Y { get; set; }
    }

But Visual Studio will then complain

Program.cs(12,20,12,21): error CS8341: Auto-implemented instance properties in readonly structs must be readonly.
Program.cs(13,20,13,21): error CS8341: Auto-implemented instance properties in readonly structs must be readonly.

Ok. No problem lets make the setter private. But the error stays. Perhaps I need to add some modifier to the get property. Lets try readonly

    public int X { readonly get; private set; }

This results in

error CS0106: The modifier ‘readonly’ is not valid for this item

Ok. Now I am desperate. Lets try ref readonly.  Hah something happens:

image

But not for the better. VS eats up all the memory and if you try to compile it will transfer the leak into PerfWatson2.exe as well.

image

Ok that was not it. The final solution was to remove the setter completely. Interestingly you can still set the property although it has no declared setter.

    readonly struct FastPoint
    {
        public int X { get; }
        public int Y { get; }

        public FastPoint(int x, int y)
        {
            X = x;
            Y = y;
        }
    }

This seems to be a C# 6 feature I was until now not aware of. Problem solved. But wait what was the ever increasing memory of the compiler?

From the call stacks we can deduce quite a bit

image

Roslyn is parsing a property declaration and has found ref which is a valid token. Now some memory is allocated for the token but later is treated as wrong token. That in itself would not be too bad but it seems that the parser seems to rewind and then tries parsing the same wrong property declaration again which results in infinite memory consumption. I have reported the issue here

https://developercommunity.visualstudio.com/content/problem/168027/typing-ref-readonly-at-the-wrong-places-will-crash.html

which will hopefully be fixed. The error message is ok in hindsight but it did confuse me the first time. If you want to play with the newest C# features you need to open the build properties tab, press Advanced and then you can select the e.g. C# latest minor version to always use the latest C# version.

image

Lets hope you are not hitting new memory leaks as fast as I did.

The Case Of NGen.exe Needing 50 GB Of Memory

This is an old bug which seems to be in the .NET Framework since a long time but since it is highly sporadic it was not found until now. I have got reports that on some machines NGen.exe did use all of the computers memory which did lead in Task Manager to this pattern:

clip_image002

The biggest process on that machine was always Ngen.exe and everything was very slow. This tells me that NGen did not recover from time to time from its high memory consumption but that it did allocate like crazy until the machine had no physical memory anymore. When an application uses all memory the OS will page out all memory to the hard disk when no physical memory is left. After writing many GB of data to the hard disk NGen can continue to allocate more memory until no physical memory is left and the OS will write all memory to the page file again. This will continue until Ngen.exe finally hits the commit limit which is the sum of Physical Memory + Page File Size which results in an Out Of Memory error. Only then the process will terminate.

When NGen did go crazy like this MS support suggested to delete the registry key Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\.NETFramework\v2.0.50727\NGenService\Roots and its decendants. That fixed the issue but it remained mysterious why this was happening and what exactly did break. When you delete the registry key NGen will build up its Ngen root dll cache automatically.  Inside the registry keys was no obvious garbage data visible and the issue remained mysterious. But finally I have got my hands on a machine where the issue was still present which allowed me to take more evidence data.

What Do We Know?

  • ngen install somedll.dll or ngen createpdb somedll.ni.dll causes NGen.exe to consume many GB of memory
  • NGen breaks due to corrupted registry keys
  • After deleting the registry key below \HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\.NETFramework\v2.0.50727\NGenService\Roots NGen will build up the registry structure by itself which “fixes” the issue

What data should we get? Part of it is based on experience and the other part is more psychological. If you hand over a bug report to someone else you should anticipate that he/she is not familiar with your favorite debugging tool (e.g. Windbg). Filing a bug report with random findings is easy. Filing a bug report which enables the support personnel  to get down to the real root cause is much harder. When I have a nice reproducible bug which I can repeat easily as often as I want I tend to get all data I can get. When file/registry issues are involved I would get some or all of the things below.

Full Scale Data Capturing

  • Capture a procmon trace which will show all accessed registry keys and files
    • That is easy to do and provides a general understanding which registry keys are accessed
  • Dump the affected files/registry keys
    • E.g. export the registry hive \HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\.NETFramework\v2.0.50727\NGenService from regedit so someone can take a look at the accessed data
    • That is much easier than to crawl though a memory dump to reconstruct the accessed registry/file contents
  • Capture several memory dumps while the problem builds up
    • procdump is a great tool to take several dumps in a row once a specific condition has been reached
  • Capture ETW Traces to get the full picture
    • It potentially shows the evolution of the problem.  But this depends highly on the skill set of the support engineer if that data is useful to them
  • Capture a Time Travel Trace
    • This gives much detail but it can be time consuming to analyze a multi GB trace. If you or the support engineer do not know exactly where to look you will only produce a multi GB random data file which is not used at all

Getting the right data which somebody else can work with is tricky since you do not know with which tools someone else is most comfortable with. Normally you will start with the easiest data capturing method and based on the analysis of the data you did get you will need to decide what else is missing to get the full picture. That usually will involve capturing more data with a different tool.  But if you capture everything from procmon tracing up to a time travel trace the chances are good that you can reduce the time until the investigation leads to somewhere from weeks down to minutes. Lets start with the easy data gathering approach first:

Getting A Procmon Trace

Procmon is a SysInternals Utility which can record all process starts along with all file and registry accesses. When you start it you can add a filter for the process of interest. In our case it is ngen.exe.

image

Since Procmon records system all events the resulting memory needed by this tool can become quite large. To prevent that is is most of the time better to discard all not interesting events from memory by checking the checkbox of File – Drop Filtered Events

image

That is important if you need to wait for an incident to run for hours. You should always get the latest version of procmon since from time to time some memory leaks or other things are fixed which could hinder a successful data collection. The gathered data can then be saved as PML file which can be read into the tool again on a different machine. To save the data you can choose a nice file name which describes the problem. Naming is hard but be precise what the trace actually contains. In a few weeks even you do not know what that file was for.

image

Under the hood Procmon uses ETW to gather the data. What does that mean? For every file/registry access and process start/dll load you will get a full call stack if you click on the event properties which can tell you already as much as a memory dump:

image

In our NGen case we find that NGen deserialized the native image roots from a registry list. That is a very powerful capability but you need to set the symbol server and the path to dbghelp.dll from a Windbg installation (x64 usually) to get valid call stacks. There is one caveat: Procmon cannot decode stack traces from managed code which makes this otherwise great tool severely limited mainly for unmanaged stack trace analysis.

Dump The Registry Keys

Who said that data collection is difficult? When we look at the NGen registry keys which were accessed we find a large list of all NGenned dlls “cached” in the registry. This is done for performance reasons. The .NET Framework had always a soft spot for the Registry. NGen does it and the GAC also. That is much faster to read than to traverse over 4000 directories only for the 64 bit NGenned dlls.

image

Since the corruption is data dependent we can simply export the whole NGenService tree into a text file which can hopefully help to diagnose the data corruption.

image

The resulting text file was over 200 MB in size. It is unlikely that you will find the root cause by looking at a 200 MB text file line by line. We need more clues where we need to look at.

Can We Already Solve?

Pattern identification is a very important skill you need to develop if you want to analyze an issue. One if not the most powerful analysis method is differential analysis. Usually you have a good case and a bad case which you can compare and see where the behavior starts to diverge. But it also works in the other way to find common patterns. The presence or the absence of a deviation can both be a useful hint. It is a good idea to capture the data not only once but several times to be able to find stable patterns in the data.

By looking at the procmon registry trace we can first filter only for the querying of registry contents of the Roots node

image

There we find that the last accessed registry key is always the same one. It is a binary registry key named ImageList. But wait. The call stack of that event is not particularly enlightening but it is a strong hint that either that was the last registry key it did read and then one of the previously read registry keys did contain invalid data or that this registry key is the one which is corrupted. Lets check the contents of the ImageList value:

image

Hm. Not sure if that is the problem. Lets get more data.

Capture Memory Dumps

There are many ways to capture memory dumps but the most flexible tool besides DebugDiag is procdump. It is a simple command line tool which can trigger the creation of a memory dump in very sophisticated ways. In our case it is straightforward. We want to start NGen and then take 3 dumps with 1s in between because the memory leaks is growing very fast.

C:\Windows\assembly\NativeImages_v4.0.30319_64\System\0c9bec7e4e969db233900a4588c91656>procdump -s 1 -n 3 -ma -x c:\temp ngen.exe createpdb system.ni.dll c:\temp

ProcDump v9.0 – Sysinternals process dump utility
Copyright (C) 2009-2017 Mark Russinovich and Andrew Richards
Sysinternals – http://www.sysinternals.com

Process:               ngen.exe (14168)
CPU threshold:         n/a
Performance counter:   n/a
Commit threshold:      n/a
Threshold seconds:     1
Hung window check:     Disabled
Log debug strings:     Disabled
Exception monitor:     Disabled
Exception filter:      [Includes]
*
[Excludes]
Terminate monitor:     Disabled
Cloning type:          Disabled
Concurrent limit:      n/a
Avoid outage:          n/a
Number of dumps:       3
Dump folder:           c:\temp\
Dump filename/mask:    PROCESSNAME_YYMMDD_HHMMSS
Queue to WER:          Disabled
Kill after dump:       Disabled

Press Ctrl-C to end monitoring without terminating the process.

Microsoft (R) CLR Native Image Generator – Version 4.7.2556.0
Copyright (c) Microsoft Corporation.  All rights reserved.
[23:16:42] Timed:
[23:16:42] Dump 1 initiated: c:\temp\ngen.exe_171212_231642.dmp
[23:16:42] Dump 1 writing: Estimated dump file size is 1418 MB.
[23:16:44] Dump 1 complete: 1419 MB written in 2.4 seconds
[23:16:46] Timed:
[23:16:46] Dump 2 initiated: c:\temp\ngen.exe_171212_231646.dmp
[23:16:47] Dump 2 writing: Estimated dump file size is 4144 MB.
[23:17:42] Dump 2 complete: 4145 MB written in 55.8 seconds
[23:17:44] Timed:
[23:17:44] Dump 3 initiated: c:\temp\ngen.exe_171212_231744.dmp

The command line parts are -s 1 to wait one second between each dump, -n 3 to take three dumps before it exits, -ma to take a full memory dump, -x expects as first argument the dump folder and all other arguments are the executable and its command line arguments. If you look at the command line parameters you will find a lot more. The output a little frightening at first but there is a secret switch (procdump -? -e) which will print a lot of useful examples how procdump is meant to be used. Actually this switch is not secret but nearly no one will read the large command line help until the end which is the reason I spell it out explicitly.

We can load the dump file into the new Windbg which will us give automatically a nice call stack window where NGen was just allocating memory:

image

That information should be sufficient for any support guy to drill down to the root cause. To make sense of the call stack you need local variables which are not part of the public symbols of MS. For us outsiders that is as far as we can analyze the problem. Really? Lets have a look at the method names. NGen deserializes a Root Array of native image roots from the registry. While it is deserializing a specific root object it deserializes something with a method BinaryDeSerializeLogicalImageList. That sounds familiar to the binary registry node ImageList from our registry dump. When we only could know the registry key it was just deserializing. This involves a little bit poking the in the dark. I would expect that the stack between DeSerialize and BinaryDeSerializeLogicalImageList contains hopefully somewhere the registry key name.

With the k command we get the call stack and the current stack pointers

0:000> k
 # Child-SP          RetAddr           Call Site
00 000000c9`5f2fdcc0 00007ffe`327a8912 ntdll!RtlpLowFragHeapAllocFromContext+0x2a
01 000000c9`5f2fdda0 00007ffe`05eebde6 ntdll!RtlpAllocateHeapInternal+0xf2
02 000000c9`5f2fde60 00007ffe`05eec700 mscorsvc!operator new+0x30
03 000000c9`5f2fde90 00007ffe`05eed445 mscorsvc!ArrayOfPointers::CreateAndAppendNode+0x2c
04 000000c9`5f2fded0 00007ffe`05eed7f1 mscorsvc!Configuration::BinaryDeSerializeLogicalImageList+0xcd
05 000000c9`5f2fe060 00007ffe`05eeffb0 mscorsvc!Configuration::DeSerialize+0x206
06 000000c9`5f2fe300 00007ffe`05ee81b2 mscorsvc!Root::DeSerialize+0x379
07 000000c9`5f2fe630 00007ffe`05eecd98 mscorsvc!RootList::DeSerializeRoot+0x9c
08 000000c9`5f2fe690 00007ffe`05f0b69c mscorsvc!RootList::GetRootArray+0x1a6
09 000000c9`5f2fe960 00007ffe`05f0bb79 mscorsvc!CCorSvcMgr::GetLogicalImageForRootedNI+0xd4
0a 000000c9`5f2fec60 00007ff6`aef17dd7 mscorsvc!CCorSvcMgr::CreatePdb2+0x229
0b 000000c9`5f2ff1c0 00007ff6`aef11f32 ngen!NGenParser::ProcessNewCommandLineOptionsHelper+0x99d
0c 000000c9`5f2ff5d0 00007ff6`aef11d54 ngen!IsNewCommandLine+0x196
0d 000000c9`5f2ff730 00007ff6`aef1276a ngen!trymain+0x19c
0e 000000c9`5f2ffd90 00007ff6`aef126f8 ngen!wmain+0x4e
0f 000000c9`5f2ffe20 00007ffe`30221fe4 ngen!BaseHolder,&Delete,2>,0,&CompareDefault,2>::~BaseHolder,&Delete,2>,0,&CompareDefault,2>+0x2a6
10 000000c9`5f2ffe50 00007ffe`327eef91 kernel32!BaseThreadInitThunk+0x14
11 000000c9`5f2ffe80 00000000`00000000 ntdll!RtlUserThreadStart+0x21

The brute force method is to dump the stack from start to end with

0:000> db c9`5f2fde90  c9`5f2ff1c0

000000c9`5f2fe710  18 e7 2f 5f c9 00 00 00-43 00 3a 00 2f 00 41 00  ../_....C.:./.A.
000000c9`5f2fe720  6e 00 79 00 4e 00 61 00-6d 00 65 00 57 00 69 00  n.y.N.a.m.e.W.i.
000000c9`5f2fe730  6c 00 6c 00 44 00 6f 00-2e 00 64 00 6c 00 6c 00  l.l.D.o...d.l.l.
000000c9`5f2fe740  00 00 74 00 75 00 62 00-73 00 2e 00 49 00 6e 00  ..t.u.b.s...I.n.
000000c9`5f2fe750  74 00 65 00 72 00 6f 00-70 00 2c 00 20 00 56 00  t.e.r.o.p.,. .V.
000000c9`5f2fe760  65 00 72 00 73 00 69 00-6f 00 6e 00 3d 00 31 00  e.r.s.i.o.n.=.1.
000000c9`5f2fe770  30 00 2e 00 30 00 2e 00-30 00 2e 00 30 00 2c 00  0...0...0...0.,.
000000c9`5f2fe780  20 00 43 00 75 00 6c 00-74 00 75 00 72 00 65 00   .C.u.l.t.u.r.e.
000000c9`5f2fe790  3d 00 4e 00 65 00 75 00-74 00 72 00 61 00 6c 00  =.N.e.u.t.r.a.l.
000000c9`5f2fe7a0  2c 00 20 00 50 00 75 00-62 00 6c 00 69 00 63 00  ,. .P.u.b.l.i.c.
000000c9`5f2fe7b0  4b 00 65 00 79 00 54 00-6f 00 6b 00 65 00 6e 00  K.e.y.T.o.k.e.n.
000000c9`5f2fe7c0  3d 00 33 00 31 00 62 00-66 00 33 00 38 00 35 00  =.3.1.b.f.3.8.5.
000000c9`5f2fe7d0  36 00 61 00 64 00 33 00-36 00 34 00 65 00 33 00  6.a.d.3.6.4.e.3.
000000c9`5f2fe7e0  35 00 2c 00 20 00 70 00-72 00 6f 00 63 00 65 00  5.,. .p.r.o.c.e.
000000c9`5f2fe7f0  73 00 73 00 6f 00 72 00-41 00 72 00 63 00 68 00  s.s.o.r.A.r.c.h.
000000c9`5f2fe800  69 00 74 00 65 00 63 00-74 00 75 00 72 00 65 00  i.t.e.c.t.u.r.e.
000000c9`5f2fe810  3d 00 61 00 6d 00 64 00-36 00 34 00 00 00 00 00  =.a.m.d.6.4.....

where we find the registry key which is currently being worked on:

0:000> du 000000c9`5f2fe718
000000c9`5f2fe718  "C:/AnyNameWillDo.dll"

It looks like the ImageList of this dll is corrupted which did cause NGen to go into an infinite loop. A deeper look at the surrounding registry keys from the registry export revealed that another registry key of the previous dll was also corrupted. This is really strange and I have no idea how NGen could manage to corrupt two unrelated registry keys RuntimeVersion (string) and ImageList  (binary).

Capture ETW Traces

Based on our previous investigations we should get data about memory allocation, CPU consumption and accessed registry keys which should give us a good understanding how the problem evolves over time. To capture ETW data you need normally to download and install the Windows Performance Toolkit which is part of the Windows SDK. But since Windows 10 the command line only tool named wpr.exe is part of Windows itself. That can be important if you are working on a machine which is locked down with e.g. Device Guard and you cannot install new software easily and you cannot execute not Authenticode signed binaries which rules many home grown data collection tools out. Normally I use ETWController (http://etwcontroler.codeplex.com/) which enables me to capture mouse and keyboard interactions along with screenshots which has proven to be invaluable many times. But on a locked down machine one needs to use the tools which you can start.

C:\WINDOWS\system32>wpr -start CPU -start Registry -start VirtualAllocation -start GeneralProfile

… Ngen …. 

C:\WINDOWS\system32>wpr -stop c:\temp\NgenGoneCrazy.etl

After loading the ETL file into WPA and a little working out the important metrics like CPU, Allocation and Registry accesses we get this one:

image

    We find that practically all CPU is spent in allocating memory while the method BinaryDeSerializeLogicalImageList  was executed. The VirtualAlloc graph shows a frightening allocation rate of 1,4 GB/s which is the most massive memory leak I have seen since a long time. The last graph shows that the huge allocation rate starts once the ImageList of the dll C:/AnyNameWillDo.dll was read. After that no more registry keys were read which is strong indicator that this registry key is the one knocking NGen out.

    After realizing that it was easy to come up with a minimal registry file which will bring NGen down

    NgenCorrupt.reg

    Windows Registry Editor Version 5.00
    
    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\.NETFramework\v2.0.50727\NGenService\Roots\C:/AnyNameWillDo.dll]
    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\.NETFramework\v2.0.50727\NGenService\Roots\C:/AnyNameWillDo.dll\0]
    "ImageList"=hex:00,69,00,6f
    

    When you import that reg file and you execute

    ngen.exe createpdb system.ni.dll c:\temp

    in the directory where the native image of System.ni.dll is located NGen will explode. But beware that at least on Windows 10 my machine did freeze and never recover when all physical memory was allocated. It seems that older Windows editions (e.g. Server 2008 R2) deal with such rapidly allocating applications in a better way which lets you at least terminate the application once it has gotten all physical memory.

    Conclusions

    After have drilled down that far it is pretty clear where NGen did break although it remains a mystery how the registry keys were corrupted. The case is currently being investigated at Microsoft which will hopefully result in a more robust NGen which will ignore bogus registry entries and delete them which should cause the recreation of the missing NGen root entries some time later. The data sent to MS are memory dumps, and ETL Trace, procmon trace and the reg file to corrupt the registry on a test machine.

    image

    Troubleshooting is not magic although many people assume magic happening here. It is pretty straightforward to capture the relevant data with some experience. Analyzing the captured data is indeed a complex undertaking and requires a lot of experience. The goal of this blog post is to help other people dealing with failures to understand what data is needed and why. Capturing the data is much easier and faster than to analyze it. If you capture the right data you will make the work of others trying to help you a lot easier.

    I always tell people that it makes no sense to assign one guy of a team as troubleshooter and send him to a Windbg/ETW training. First of all if someone is assigned to a task he dislikes he will never be good at it. Second it needs a lot of practice and experience to be able to drill down this deep. If you are trying to analyze such an issue once every few months you will not have the necessary skills to use the tools correctly.  If you want to bring a member of your team at a good level to troubleshoot hard issues he/she must be willing to look at a memory dump nearly every day. If no one raises his hand for this task you can spare the money for the Windbg/ETW training. But it makes sense to bring all team members to a level where everyone understands what data is needed to let some specialized guys to look into the issues in an efficient manner by providing enough and the right data to successfully nail the root cause. As a positive side effect more people will get used to these tools and some guys will like to drill deeper. These are the ones you should send to a Windbg/ETW training.

    That’s all for today. Remember: Great tools are useless. Tools become great when people use them.

    Bringing The Hardware And Windows To Its Limits

    I like to experiment a lot which often leads to surprising results. Sometimes I compare performance engineering to quantum mechanics: In quantum mechanics a measurable value (observable) is not determined until the measurement is performed. I stretch quantum mechanics here and claim that you have bad performance if you never measure. Or to put it into another way:

    The probability to get a well performing system without quantitative repeatable regular performance measurements is zero.

    When you look deep enough you will find not so well known things. Memory allocation and access performance is such a topic which could span whole books because it is such a fundamental thing which most application developers are not aware of.

    What do I mean with that? Lets perform an experiment:

    1. Allocate 2000MB of memory.
    2. Measure the access performance of every 4096th byte of the allocated memory.
    3. Repeat the measurement to get consistent results a second time.

    Below is a small C++ application to do this:

    #include <chrono>
    
    class Stopwatch
    {
    public:
        Stopwatch()
        {
            _Start = std::chrono::high_resolution_clock::now();
        }
    
        void Start()
        {
            _Start = std::chrono::high_resolution_clock::now();
        }
    
        std::chrono::milliseconds Stop()
        {
            _Stop = std::chrono::high_resolution_clock::now();
            return std::chrono::duration_cast<std::chrono::milliseconds>(_Stop - _Start);
        }
    private:
        std::chrono::high_resolution_clock::time_point _Start;
        std::chrono::high_resolution_clock::time_point _Stop;
    };
    
    #pragma optimize( "", off )
    void Touch(void *p, size_t N)
    {
        char *pB = (char *)p;
        char tmp;
        for (size_t i = 0; i < N; i += 4096)
        {
            tmp = pB[i];
        }
    
    }
    #pragma optimize("", on)
    
    void main()
    {
        const int NBytes = 2 * 1000 * 1024 * 1024; // 2000 MB of memory
        char *bytes = new char[NBytes];
        Stopwatch sw;
        Touch(bytes, NBytes );  // touch every 4096th byte
        auto ms = sw.Stop();
        printf("Did touch %d bytes in %lld ms\n", NBytes, ms.count());
        sw.Start();
        ms = sw.Stop();
        printf("Did touch 2 %d bytes in %lld ms\n", NBytes, ms.count());
    }

    When we execute it the numbers look promising

    D:\>\MemAllocPerf\x64\Debug\MemAllocPerf.exe
    Did touch 2097152000 bytes in 13 ms
    Did touch 2 2097152000 bytes in 0 ms

    13ms for the first access time and 0ms the second time. This is pretty good even for a debug build. For completeness lets execute the same thing as Release build because everyone tells you that you should never ever trust performance values from debug builds.

    D:\>\MemAllocPerf\x64\Release\MemAllocPerf.exe
    Did touch 2097152000 bytes in 377 ms
    Did touch 2 2097152000 bytes in 0 ms

    Second time still looks good but what has happened to the first time access performance? The release build has become 30 times slower! How can this be? Lets step though it with a debugger and check the memory consumption in Task Manager after the allocation but before we have touched the memory.

    Debug

    image

    Release

    image

    Well that is interesting. Both versions have committed 2000 MB of memory but the debug version has it in its working set already. The release build consumes basically zero physical memory. Just in case you need quick recap what commit size and working set means:

    • Commit size is the amount of memory you did allocate with new, malloc, calloc, GlobalAlloc, …
    • Working Set is the physical memory the operating system has assigned to your process (real ram chip usage). The working set can be smaller because the operating system can page out data of your process to make room for other also memory hungry applications.

    Ok so this means the OS did page out my data for the release build? Well no not in this case. There is a wrinkle to it. All operating systems try to be as lazy as possible to move the costs of memory allocation and usage at the latest time possible.

    The Operating System View Of Memory Allocation And Access

    When an allocation happens the OS first needs to check if the process has enough address space left for the allocation. This is especially true for x86 processes which can allocate only 4 GB of memory where we can run out of free addresses where to put our allocation because the memory is fragmented like below. We could allocate the memory but we have no address space hole big enough to satisfy the reservation request.

    image

    I have never seen that happen on x64 processes but it is a pretty common issue on x86 processes. This is called reserving memory in Windows lingo. Apart from checking if enough address space is available in the process nothing happens which is therefore a super fast operation.

    The second stage is to commit memory. The new[], malloc functions will usually reserve and commit the memory in one go with a call to VirtualAlloc on Windows. This time the OS needs to do some further checks.

    • Check if allocation size > Commit Limit
      • The Commit Limit is the maximum memory all applications together can allocate. It is the sum of the physical memory + size of the page file.
    • If the page file has not a fixed size the OS might need to grow the page file for the requested memory which can take quite some time to ensure that the allocation request can be served from physical or page file baked memory.

    After reserving the address space and committing the memory the OS guarantees that your newly allocated memory can be served by the OS either from the page file or (more likely and performant) from physical memory.

    image

    (RAM Image Source https://en.wikipedia.org/wiki/File:Laptop_RAM.jpg)

    You see the dotted lines? The OS only guarantees that you can access the memory but it is still not assigned to your process. The OS has returned you a pointer to memory but you still do not have the memory in your process. All memory pages in your process are still empty! Now comes the expensive part. But this involves no API call at all. To force the OS to actually assign the memory to your process you only need to access it. When you access an empty page the CPU will trigger an exception (page fault) and call an back into the operating system. At this time the OS will actually assign the memory to your process working set “database” where the OS keeps track which physical pages are baked by real memory or the page file.

    The operation to add memory to your process working set is called page fault. If only RAM needs to be assigned to your process it is called soft page fault (fast). If you access paged out memory a hard page fault happens (slow, …. very slow) which will cause the OS to start a read operation from the hard disk which can be a lengthy undertaking.

    After the soft/hard page fault your application finally takes over and you can access the memory without any OS interference.

    image

    If the system runs low on memory some least used memory pages are removed from your working set and its contents are put into the page file. I have indicated that with the dotted lines pointing to the page file in the picture above. The next time you access the memory you will get hard page faults which are the source of most sluggish system behavior. It has become much less of a problem if you are lucky enough to have the page file on an SSD which have pretty good random access times.

    Ok that was a pretty long excursion into the details of memory management. But why is the debug build so much faster and why is all of the memory after the allocation already in our process working if I use the debug build? When we examine the memory contents which was returned by allocation request we find some byte pattern (cd cd)

    image

    If you look further what that means at https://stackoverflow.com/questions/370195/when-and-why-will-an-os-initialise-memory-to-0xcd-0xdd-etc-on-malloc-free-new you will find that the C-runtime initializes and and hence access the memory before returning the pointer to the calling code. That is the reason why the debug build was so much faster. The soft fault performance hit did happen already at allocation time because the memory was initialized to the CD CD byte pattern. That is one of the very few cases where the measured performance of a debug build is much better compared to a release build because the most expensive part of memory access has happened before we did start he measurement.

    Memory Copy And Soft Fault Performance Do NOT Scale On Windows

    That was an interesting case but I am getting just started. Lets suppose we want to read a large file from the disk as fast as possible. Most often the file was accessed already by the application some time ago and it is already in the file system cache. If you execute the read operation the OS has nothing to do except to copy the memory from the file system cache to the buffer in your application. When the buffer was just freshly allocated by new[] how fast can we get? Copying the data from the file system cache is a problem that calls for parallelization to speed up things. The interesting thing is how fast can we get if multiple threads are copying in parallel data from A to B.

    Since memory access has some hidden costs upon first access it makes sense to measure the memory copy performance for a freshly allocated array and a second time with the same one. The test essentially should do

    1. Allocate a large array e.g. 2000 MB
    2. Fill it with random data which will be our source array
    3. Allocate a 2000 MB target array
    4. Start 1-n threads
    5. Each thread copies a sub range of source to destination
    6. Measure the time stop step 4-5
    7. Repeat Step 4-6 for a second measurement

    First lets check the results on different CPUs when we hit a “warm” destination array which will exhibit no soft page faults.

    image

    From these numbers we can deduce the “raw” memory copy performance which at some point saturates the CPU memory bus. The numbers obtained here differ by a large margin with the documented ones:

    CPU Theoretical Memory Bandwidth GB/s Max Memory Copy  Performance GB/s
    I7 4770K 25.6 9,3
    E5 2623 v3 59 10,3
    Xeon Gold 6148 ??? 40,8

    At least for my home machine (the 4770K) I get ca. 9.3 GB/s which is off by a large margin of my 25.6 GB/s. I assume that the maximum memory bandwidth was measured either for read or write operations, but not a parallel read/write operation which would mean that I can multiply my measured values with a factor two. Then I would arrive at ca. 18 GB/s which seems to be ok if I attribute the rest to the cache controller which needs also some memory bandwidth. The raw memory copy performance depends on the memory bus and the used memory modules. What happens when I replace from a full memory bank one module? You guessed it: I did another experiment with my home machine. As expected the memory copy performance did drop by 50% from 9,3 GB/s down to 4,7 GB/s which gives a strong hint that memory bandwidth saturating applications should run always on machines which have full memory banks.

    From the graph above it is clear that having more cores is better up to a certain point where the memory bandwidth is reached and more cores do not help anymore. The brand new Xeon Gold CPUs show an impressive performance where up to 10 memcopy threads did still add performance. Now lets change the use case and measure for the first access time which includes soft page faults.

    image

    Well that is interesting. Adding more cores degrade the soft page fault performance  by a large margin. The speed gains added by parallel memcpy are far less than one would expect. To isolate the issue we can measure the page touch time (soft fault performance) with a similar application like the one above just with some multi threading added. We then get this

    image

    The soft page fault performance of Windows 10 does not scale with the numbers of cores. Instead it decreases with the numbers of cores! There seems to be a magic number around 4 concurrent touch threads where we become faster but then the valley is left and the soft fault performance in all cases gets worse compared to the single threaded use case. If something unexpected like this happens it is time to use a profiler to drill deeper. Below is a CPU sampling graph which shows the parallel page touch with 1-40 threads for a 2000 MB array.

    image

    As I have feared there is some internal locking in the Windows soft fault implementation which makes soft faulting a mostly single threaded operation. If multiple threads try to access the memory we get high CPU times in the method ExpAcquireSpinLockExclusiveAtDpcLevelInstrumented which wastes many CPU cycles until it finally gives up which ends in a highly contended lock. Remember: Locks do not scale. That is the reason why the small actual soft page fault work (yellow) is constant while the overhead causes by the lock explodes. This makes not much sense in my opinion when we get more and more cores to play with but the soft fault implementation still uses a process wide lock.

    Another view of the same date like above  named Flame Graph shows the overhead even better. The selected part below is the actual useful work and the huge rest is the overhead of the Spinlock.

    image

    One would think that soft page fault is an essential OS service that should be implemented as fast as possible. Apparently that is not the case for Windows 8 and 10 (https://stackoverflow.com/questions/45024029/windows-10-poor-performance-compared-to-windows-7-page-fault-handling-is-not-sc). The issue described there was solved by using VirtualLock which is essentially soft faulting the pages from one thread which is much better than to do it concurrently from many threads as you can see from the graph above. I do not know what you think but I have the strong feeling that Microsoft should improve the soft page fault code to finally enter the multi core era. It would be interesting to compare the numbers with Linux because  Linus Torvalds seems to be vigilantly looking at the soft page fault implementation in the Linux kernel.

    With SSDs getting nearly as fast as RAM the soft fault performance becomes a limiting factor in multithreaded applications which would otherwise be memory bus constrained.

    Even seemingly simple things like allocating and accessing memory can have dramatic impact on application performance and scalability. The test application from above is pretty simple but if you start asking why the performance is as it is and you try to optimize it you quickly reach operating system limits and not much later the limits of the hardware. That is actually a good thing because it means that your application is fast as hell.

    If you want to test the soft page fault/memcopy performance for yourself you can check out https://github.com/Alois-xx/FastPageFault which should be pretty self explaining. That was the application I did use the produce the charts above.

    Update 1

    Windows 10 Fall Creators update contains a fix for the soft page fault performance. First some numbers from my home machine with 4 physical cores:

    image

    The use case was copying a 2000 MB buffer into not yet touched memory with 1-N threads.

    Creators Update

    Due to the internal locking we did never get close to the actual memory bandwidth because the kernel was busy with its Spinlocks while updating the data structures to add the pages to our process working set. The copy performance did peak around 6,5 GB/s

    Fall Creators Update

    The soft page fault implementation is now much more scalable and we can now max out our memory bandwidth of 9,xGB/s with only 3 cores while we can fully distribute the soft page fault work across threads!

     

    When we zoom deeper into the soft page fault performance we find that even the single thread soft fault performance has become 43% faster and it scales much better now.

    image

     

    Below is the Spinlock CPU vs actual work shown for Creators Update

    image

    and here for Fall Creators Update

    image

    The lock is still there but much less time is spent in locking things which is a good thing. How did the MS engineers improve the soft page fault implementation by nearly a factor two? To see this the WPA diff view is helpful along with some Excel magic. The diff view in WPA is nice but it still lacks in my opinion some important features

    • There is no way to mark methods which were removed or added except by drilling to the bottom of the call stacks which is very tedious
    • I want to filter only for added/removed methods which would be very helpful to spot the actual difference and not the consequence deeper in the call stacks

    Anyway. There is nothing Excel cannot fix for us. Below is the pimped diff view of the page touch performance test.

    image

    What stands out is that the Fall Creators Update soft page fault implementation has far less method calls. The Windows Kernel internal List (all red marked list management methods) to maintain the page list was removed in favor of a more simple data structure with better cache locality and less locking. Besides getting rid of the kernel linked list the biggest change seems to be that by default the page size has been increased from 4 KB to 64 KB which means that the kernel needs to update 16 times less often the page table structures which seem to be the biggest change. At least that is my impression by noticing that the method MiGet64KPage consumes most CPU and looks like it was introduced with the Fall Creators Update.

    What about Windows Server?

    According to MS support the soft page fault fix should have made it into Windows Server 1709 which is a desktop less server which is best for container and cloud workloads. That is nice but I need a fix for Server 2016. The Windows Server 2016 soft page fault performance affects all applications, especially the memory hungry ones. Moving forward to a not compatible server edition which is still beta with a shorter long term support contract is not an option.

    It is pretty hard to get hard facts from MS support which issues is fixed with which OS version. The question: Is the issue in that ticket fixed in the build I am running? seems no longer to be easily answerable. That definitely should be improved.

    It Is Time For Time Travel

    Windbg is one of the most powerful yet underused tool in every Windows Developers toolbox. Some time ago a new fancier version of Windbg (Windbg Preview)  which looks like a todays modern application was released as Windows Store App. So far the UI has got a nice ribbon but under the hood no real changes were visible. Today a new version was released which supports a long time internally used tool called time travel tracer (TTT).

    The official documentation https://docs.microsoft.com/de-de/windows-hardware/drivers/debugger/debugging-resources is quite good and here is the blog post announcing Time Travel Debugging https://blogs.msdn.microsoft.com/windbg/2017/09/25/time-travel-debugging-in-windbg-preview/.

    If you have a hard to debug race condition you can now attach the Windbg and check the “Record process with Time Travel Debugging”

    image

    The trace file can grow many GB in size which can make it cumbersome if you want to reproduce an error which needs some time to surface. Luckily Windbg is, although a Store App, still xcopy deployable. Search with Process Explorer where the debugger exe sits and copy the contents of the directory C:\Program Files\WindowsApps\Microsoft.WinDbg_1.0.13.0_x86__8wekyb3d8bbwe\ to your own tool folder like Windbg_1.0.13.0. Then you can run it e.g. from a memory stick or a network share as usual. After searching a bit in the debugger directories you will find the command line version of the Time Travel Trace tool located at

    • x64  Windbg_1.0.13\amd64\TTD\TTD.exe
    • x86 Windbg_1.0.13\amd64\TTD\wow64\TTD.exe

    Now you can go to your problematic machine where the problem occurs without the need to install a store app which might not be possible due to corporate policy/firewall/isolated test network, …. To record a time travel trace from the command line I normally use it in ring buffer mode with a 2 GB buffer which should cover a few seconds or minutes depending on the application activity.

     

    D:\Tools\Windbg_1.0.13\amd64\TTD>ttd -ring -maxfile 2048 -out c:\temp\test.run -launch D:\cppError.exe
    Microsoft (R) TTD 1.01.02
    Release: 10.0.16366.1000
    Copyright (C) Microsoft Corporation. All rights reserved.

    Launching D:\cppError.exe

    cppError.exe(x86) (1040): Tracing stopped after 31ms
      Ring trace dumped to c:\temp\test.run

     

    You get a 2 GB file although the actually recorded data might be much smaller. If you have a short repro it might be better to skip the ring buffer setting.

    image

    Once you have the data it is time to leave the crime scene pack the .run file and analyze it at your favorite developer machine. You can double click the .run file or fire up Windbg and select the Open Trace option. Once you have loaded the trace you can press g to let the application run until the first exception happens or you can set breakpoints. But if nothing is set the application will stop a the first exception with the actual source window and the current values of the local function:

    image

    We find that whenever i is 5 we run into an issue which you could also have found also with a memory dump. But now you can travel back in time by entering p- to check the values were just before the crash. This is incredibly powerful to find the root cause how you did get into a situation. If Time Travel Debugging works it is a great tool. Just keep in mind that it makes the application around 10x times or more slower. You should not expect great performance if Time Travel recording is running.

    As far as I can tell it looks like Time Travel Tracing is built upon the Intel Processor Tracing feature https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing which enables recording the full instruction stream along with all touched memory which is a really powerful feature.

    Wishes

    With every great tool there are things left to be desired.

    • Windbg supports no managed source code window
      • No managed breakpoint setting in source code
    • No managed local variables
    • No managed call stack
    • SOS.dll seems not to work at all with time travel traces

    The debugger shell seems to exist only in the x64 flavor which makes it impossible to load SOS.dll for 32 bit applications into Windbg because the bitness of SOS.dll must match with the bitness of the debugger executable. When I try to get a mixed mode stack inside Windbg SOS.dll can be loaded but it seems to be missing vital information. I really would like to use Windbg with time travel tracing support for managed code (regular .NET Framework and .NET Core) but until now this is a C/C++ fan boys tool only.

    Conclusions

    Time Travel Tracing is a novel debugging technique which enables us developers to solve otherwise nearly impossible to find data races in a deterministic way. I hope that the Windbg team will add managed code support in the near future which would bring back feature parity between languages. Why should only C/C++ developer get all the great tools?

    MemAnalyzer v2.5 Released

    Managed  (and unmanaged) memory leak detection is still not as easy as it should be. MemAnalyzer tries to be a simple command line tool with the goal to help with the easy problems but also with the hard issues which surface only after hours of stress testing. Often you have a mixture of managed and unmanaged memory leaks where you need memory dumps and VMMap information to get the full picture. MemAnalyzer helps to automate these boring tasks. It is open source at Github (https://github.com/Alois-xx/MemAnalyzer). The executable can can be downloaded here: https://github.com/Alois-xx/MemAnalyzer/releases.

    If you are looking for a nice UI to look into memory dumps I recommend MemoScope.NET (https://github.com/fremag/MemoScope.Net) which lets you look into dump files without the need to resort back to Windbg. By the way if you have Visual Studio Ultimate you can already analyze managed memory dumps. But to analyze issues in production installing Visual Studio is not an option. PerfView on the other hand is a great tool but the numbers are only approximations which can make it hard to spot handle leaks. The object counts reported by PerfView are often off by the order of a magnitude. MemAnalyzer tries to get exact metrics of the real alive objects with the -live switch which is as good as !DumpHeap -stat -live of Windbg as CSV output.

    MemAnalyzer Features

    • Single self contained executable
    • Supports x86 and x64 processes and memory dumps
    • .NET Core on Windows x86 and x64 support (.NET Core 1.0,1.1 and 2.0, …)
    • Create memory dumps with associated VMMap data
    • Analyze managed heap, unmanaged, private bytes and file mappings when VMMap is present
    • Memory dump diff
    • Optional CSV output

    Usage – Leak Detection

    Why bother with a command line tool if nicer UIs are around? Because MemAnalyzer is capable to track not only your managed memory but also the unmanaged parts. When a managed application leaks memory you need first to find out if the leak happens on the managed heap or somewhere else.  Depending on the leaked memory type one needs to use different approaches/tools to track the leak down.

    The memory consumption of a process can be categorized as

    • Managed Heap
    • Unmanaged Heap
    • Private Bytes
    • Page File Allocated Shared Memory (Shareable in VMMap lingo)
    • Memory Mapped Files

    Since there are quite a few different memory types inside a process it is important to know where you need to look at.  MemAnalyzer uses VMMap to determine the size of each region and prints them out in a nice summary which can be put into a CSV file to e.g. get a summary after each test run during an automated test.

    C>MemAnalyzer.exe -pid 17888 -vmmap

    AllocatedBytes          Instances(Count)        Type
    4,648,204               105,374                 System.String
    918,824                 22                      System.String[]
    697,640                 27,607                  System.Object[]
    662,424                 27,601                  System.Int32
    1,512                   27                      System.RuntimeType
    1,072                   2                       System.Globalization.CultureData
    830                     5                       System.Char[]
    580                     8                       System.Int32[]
    432                     2                       System.Globalization.NumberFormatInfo
    26,130                  1,087                   Managed Heap(Free)!
    6,936,367               160,704                 Managed Heap(Allocated)!
    7,158,288                                       Managed Heap(TotalSize)
    25,165,824                                      Reserved_Stack
    54,398,976                                      Committed_Dll
    1,994,752                                       Committed_Heap!
    4,177,920                                       Committed_MappedFile!
    565,248                                         Committed_Private!
    3,825,664                                       Committed_Shareable!
    73,629,696                                      Committed_Total
    17,499,952                                      Allocated(Total)
    
    • Allocated managed objects. That is very much similar to !DumpHeap -stat in Windbg only with more options.
      • If you add -live then the metric will contain no temporary objects which were not reclaimed by the GC yet.
    • Managed heap summary which shows an overall metric how big the heap is and how much of it is allocated and free.
    • Additional VMMap information that gives you an overview which other memory types are allocated in the process.
      • MemAnalyzer needs the VMMap tool in the path to get that data.
    • Allocated = Managed Heap(Allocated) + Heap + MappedFile + Private Bytes + Shareable

    The Allocated value is important because if this value rises over time you have a leak in one of the memory types of the sum. If you print this value over time and it does not rise you have no leak (warning simplified!). That is simple enough to do it repeated times by e.g. a script to verify that your long running test behaves well. Since repeated measurements are key to detecting a memory leak MemAnalyzer allows you to append the output to a CSV file along with some context e.g. Iteration 1, 100 to get more clues.

    Inside your tracking script a more realistic command line would be

    MemAnalyzer -pid {0} -vmmap -o leak.csv -dtn 5000;N#200 -live -silent  -context “{1}”

    This will append the output of -dtn (Dump Type by Number)  for up to 5K types with an instance count > 200 to the CSV file leak.csv. Each line will get a context column which can be e.g. your test run number or whatever it makes easier to correlate the time when the data was taken. To get additional information you can add automatic memory dumps to the mix with

    MemAnalyzer -procdump -ma {0} {1}\PROCESSNAME_{0}_YYMMDD_HHMMSS.dmp

    This will take a memory dump of the process with pid {0} with procdump (needs to be in the path) and also gather VMMap information automatically (VMMap needs to be in the path). The big letter words will be expanded by procdump automatically. That way you can e.g. take a full memory dump after 1, 10 , 100, 500 iterations which contains everything but the trending data is contained in the csv file for every iteration which makes it much easier to track down the real memory leaks. Based on personal experience it is pretty easy to be led down the wrong path by a few memory dumps created by coworkers. The first dump might be created before anything was loaded into the application and the last dump might still have the test data loaded which looks like a pretty big leak but it is not the leak you a after when you have lost 500 MB after 100 iterations. Having more data points at hand which can easily be graphed in Excel is a big help to concentrate on the important incidents and to identify stable patterns and trends without the need to take a gazillion of memory dumps.

    Usage – Memory Optimization

    If you want to optimize the memory usage of an existing application MemAnalyzer is also a big help because you can quickly diff a memory memory dump which is your baseline against the currently running application. To get started you should take a memory dump of your current state.

    MemAnalyzer -procdump -ma pid C:\temp\Baseline.dmp

    After you have optimized the data structures of your application to (hopefully) consume less memory you can compare the running application against your saved baseline

    MemAnalyzer -f2 baseline.dmp -pid ddd

    When you use -f2 then 2-1 will be subtracted and you get a nice diff summary output. To keep the output short the diff is sorted by absolute values which makes it easy to spot top memory additions and deletions along with the totals.

    Lets do an step by step example what that means for your usual application development workflow. First we start with our memory hungry application and isolate the memory issue into a single reproducer like this:

    using System;
    using System.Collections.Generic;
    
    namespace coreapp
    {
        class DataInstance : IDisposable
        {
            Func<string> Checker;
            long Instance;
            bool IsDisposed;
            DataInstance[] Childs;
    
            public DataInstance(int instance)
            {
                Instance = instance;
                Checker = () => $"Instance {Instance} already disposed";
            }
    
            public void Dispose()
            {
                if (IsDisposed)
                {
                    throw new ObjectDisposedException(Checker());
                }
    	     IsDisposed = true;
            }
        }
    
        class Program
        {
            static void Main(string[] args)
            {
                var instances = new List<DataInstance>();
                for(int i=1;i<=5*1000*1000;i++)
                {
                    instances.Add(new DataInstance(i));
                }
                Console.ReadLine();
            }
        }
    }

    We give MemAnalyzer the process id to create a baseline memory dump. Since MemAnalyzer uses procdump and VMmap you should have both already downloaded and the tools in your path to make it work.

    MemAnalyzer.exe -procdump -ma 11324 DotNetCoreApp_1.0.dmp

    Ok we have a dump of a .NET Core application. How can we look into it?

     

    MemAnalyzer.exe -f DotNetCoreApp_1.0.dmp
    Error: Is the dump file opened by another process (debugger)? If yes close the debugger first.
           If the dump comes from a different computer with another CLR version v1.0.25211.02 that you are running on your machine you need to download the matching mscordacwks.dll first. Check out https://1drv.ms/f/s!AhcFq7XO98yJgoMwuPd7LNioVKAp_A and download the matching version/s.
           Then set _NT_SYMBOL_PATH=PathToYourDownloadedMscordackwks.dll  e.g. _NT_SYMBOL_PATH=c:\temp\mscordacwks in the shell where you did execute MemAnalyzer and then try again.
    
    Got Exception: System.IO.FileNotFoundException: mscordaccore_Amd64_Amd64_1.0.25211.02.dll

     

    Ups we have got an error. Most people stop reading when an error occurs because the error messages are most often not that helpful. But  this case is different. You need to download the folder of my OneDrive folder of the link in the error message to get nearly all .NET/Core debugging dlls you could ever need. Download them into e.g. C:\PerfTools. Then you need to tell MemAnalyzer where to look for it with the -dacdir option or you can set the environment variable _NT_SYMBOL_PATH=c:\PerfTools to get rid of the need to specify the dac directory every time manually.

    MemAnalyzer.exe  -dts -f DotNetCoreApp_1.0.dmp -dacdir c:\PerfTools

    AllocatedBytes          Instances(Count)        Type
    320,000,000             5,000,000               System.Func<System.String>
    240,000,000             5,000,000               coreapp.DataInstance
    100,663,368             3                       coreapp.DataInstance[]
    24,530                  145                     System.String
    33,627,594              139                     Managed Heap(Free)!
    660,714,944             10,000,277              Managed Heap(Allocated)!
    694,348,008                                     Managed Heap(TotalSize)

    We have 660 MB on the managed heap allocated which is quite a lot of data. There are 5 million Func<string> and DataInstance instances. But why are we having 3 DataInstance arrays with 100MB? These look like temp arrays left over from our List<DataInstance> while growing the internal array. To get rid of garbage data you can either do a GC.Collect() before taking the dump or you tell MemAnalyzer to only track objects which are still alive.

    MemAnalyzer.exe  -f DotNetCoreApp_1.0.dmp -dacdir c:\PerfTools -live

    AllocatedBytes          Instances(Count)        Type
    320,000,000             5,000,000               System.Func<System.String>
    240,000,000             5,000,000               coreapp.DataInstance
    67,108,912              2                       coreapp.DataInstance[]
    24,530                  145                     System.String
    627,160,448             10,000,275              Managed Heap(Allocated)!
    694,348,008                                     Managed Heap(TotalSize)

    There is still one array left which does not belong there but the numbers are now better. While looking at the data I decided that we should get rid of the many delegate instances which cost 64 byte per instance which add up to 320 MB alone for the instances itself. But since the DataInstance object also keeps a reference (on x64 8 bytes) we have even more memory to spare. If we get rid of the delegate and remove the class member we should be able to spare 5m*(64+8)=360MB of memory. That’s a plan. Lets measure things.  Our refactored class becomes

        class DataInstance : IDisposable
        {
            long Instance;
            bool IsDisposed;
            DataInstance[] Childs;
    
            public DataInstance(int instance)
            {
                Instance = instance;
            }
    
            public void Dispose()
            {
                if (IsDisposed)
                {
                    throw new ObjectDisposedException($"Instance {Instance} already disposed");
                }
                IsDisposed = true;
            }
        }

    By taking a second dump we can diff both dump files with

    MemAnalyzer.exe -f DotNetCoreApp_NoFuncDelegate.dmp -f2 DotNetCoreApp_1.0.dmp -dacdir c:\PerfTools

    Delta(Bytes)    Delta(Instances)        Instances       Instances2      Allocated(Bytes)        Allocated2(Bytes)       AvgSize(Bytes)  AvgSize2(Bytes) Type
    320,000,000     5,000,000               0               5,000,000       0                       320,000,000                             64              System.Func<System.String>
    40,000,000      0                       5,000,000       5,000,000       200,000,000             240,000,000             40              48              coreapp.DataInstance
    0               0                       1               1               160                     160                     160             160             System.Globalization.CalendarData
    360,000,000     5,000,000               5,000,277       10,000,277      300,714,930             660,714,930                                             Managed Heap(Allocated)!
    360,010,320     0                       0               0               334,337,688             694,348,008                                             Managed Heap(TotalSize)

    As expected we got rid of 5 million Func<String> instances. After removing one field in DataInstance the instance size did shrink by 8 bytes from 48 down to 40 bytes which saved another 40MB. That is already quite good. But can we do better? The dispose check is an extra bool flag which will need  due to padding 4 bytes anyway. To eliminate the bool field we can reuse the Instance field and negate the Instance count to we keep the stored value which is always > 0. When you look closely you find that Instance is of the type long but we only need an int because we will always assign in the ctor the value from an integer. The revised DataInstance class is now

        class DataInstance : IDisposable
        {
            int Instance;
            DataInstance[] Childs;
    
            public DataInstance(int instance)
            {
                Instance = instance;
            }
    
            public void Dispose()
            {
                if (Instance < 0)
                {
                    throw new ObjectDisposedException($"Instance {-1*Instance} already disposed");
                }
    
                Instance *= -1; 
            }
        }

    When we diff things again

    MemAnalyzer.exe -f DotNetCoreApp_NoFuncDelegate_intFieldNoDisposeFlag.dmp -f2 DotNetCoreApp_1.0.dmp -dacdir c:\PerfTools

    Delta(Bytes)    Delta(Instances)        Instances       Instances2      Allocated(Bytes)        Allocated2(Bytes)       AvgSize(Bytes)  AvgSize2(Bytes) Type
    320,000,000     5,000,000               0               5,000,000       0                       320,000,000                             64              System.Func<System.String>
    80,000,000      0                       5,000,000       5,000,000       160,000,000             240,000,000             32              48              coreapp.DataInstance
    33,554,456      1                       2               3               67,108,912              100,663,368             33554456        33554456        coreapp.DataInstance[]
    24              1                       1               2               24                      48                      24              24              System.Int32
    0               0                       2               2               208                     208                     104             104             System.Globalization.CultureInfo
    0               0                       2               2               912                     912                     456             456             System.Globalization.CultureData
    433,554,480     5,000,002               5,000,275       10,000,277      227,160,450             660,714,930                                             Managed Heap(Allocated)!
    400,011,856     0                       0               0               294,336,152             694,348,008                                             Managed Heap(TotalSize)

    Since we compare against the original baseline directly see the improvement in memory consumption by 433MB. That is 65% less memory! Not bad. If you want to keep going fast you can directly compare a memory dump against a running process to check if a temporary optimization pays off. I have found the VS profiler to break when larger x86 applications were profiled because VS seems to load the data also into a x86 process where things break if the more fancy object graph calculation because VS runs out of memory…

    VS 2017.3 does not yet recognize CoreClr memory dumps as managed processes which still requires managed heap analysis with PerfView or Windbg or MemAnalyzer.

    SOS and mscordacwks, mscordaccore Collection

    Even if you are not interested in MemAnalyzer you might stop by for the biggest collection of SOS and mscordacwks debugging dlls for all .NET versions I could get my hands on. When you analyze memory dumps taken from other machines you need to have a close version match within Windbg or an exact version match with PerfView / ClrMd. Inside Microsoft this is a non issue because their symbol servers distribute the matching binaries without any hassle. We outsiders have to copy the corresponding debugging libraries from the original machine or from the corresponding .NET installer. To spare you the time to hunt for the matching debugging dlls I share my collection of mscordackwks files as OneDrive link: https://1drv.ms/f/s!AhcFq7XO98yJgoMwuPd7LNioVKAp_A

    Currently it contains the versions

    2.0.50727.1434
    2.0.50727.3607
    2.0.50727.3615
    2.0.50727.4016
    2.0.50727.4062
    2.0.50727.4200
    2.0.50727.4408
    2.0.50727.4455
    2.0.50727.4927
    2.0.50727.4952
    2.0.50727.5403
    2.0.50727.5420
    2.0.50727.5444
    2.0.50727.5448
    2.0.50727.5456
    2.0.50727.8009
    4.0.30319.01
    4.0.30319.1008
    4.0.30319.1022
    4.0.30319.17379
    4.0.30319.18052
    4.0.30319.18408
    4.0.30319.18444
    4.0.30319.2034
    4.0.30319.225
    4.0.30319.237
    4.0.30319.269
    4.0.30319.296
    4.0.30319.33440
    4.0.30319.34003
    4.0.30319.34011
    4.0.30319.34014
    4.0.30319.34209
    4.0.30319.46960
    4.6.100.00
    4.6.1085.00
    4.6.127.01
    4.6.1586.00
    4.6.1590.00
    4.6.1637.00
    4.6.81.00
    4.6.96.00
    4.7.2053.00

    .NET Core

    .NET Core 1.0     1.0.25211.02
    .NET Core 1.1     4.6.25211.01
    .NET Core 2.0 x64 4.6.25519.02
    .NET Core 2.0 x86 4.6.25519.03
    

    It is interesting to note that .NET Core 2.0 has different build numbers between the x86 and x64 version. It looks like one blocking issue needed fixing before they did release it to a wider audience.

    Conclusions

    Your toolbox just has got a little bigger. As always use the right tool for the job. MemAnalyzer is not the silver bullet for all of your memory problems but it tries its best to give you fast feedback without the overhead of a fancy UI which makes it easy to put it into your existing leak tracking/reporting scripts. If you want to share success stories sound off in the comments. If you want to report bugs/issues it would be nice to open an issue at https://github.com/Alois-xx/MemAnalyzer/issues. Now go and improve the memory footprint of your app!

    The Definitive Serialization Performance Guide

    When looking at performance issues with ETW I did find long deserialization times in conjunction with BinaryFormatter. A deeper look revealed that the issue is easy to reproduce if the object graph gets bigger (>100K objects). Since BinaryFormatter is in business since over 10 years and I have never heard of significant performance issues except that it is slow in general I was quite surprised that such a blatant problem still exists in .NET. But since .NET is open sourced at github it is easy to complain: https://github.com/dotnet/corefx/issues/16991. This has catched the interest of Stephen Toub himself and they added a small but impactful change to BinaryFormatter:

    image

    The problem during deserialization with BinaryFormatter is that it uses the MaxArraySize value to bucket an ObjectHolder array like a hash table.

    ObjectHolder objects[MaxArraySize=0x1000=4096] 
    
    class ObjectHolder 
    { 
      internal ObjectHolder m_next;
    }

    If we need to store one million objects into one ObjectHolder array with a length of 4096 we need  to create 4096 ObjectHolder linked lists via the ObjectHolder.m_next field with a depth of 244 nodes. When you try to access a specific object in that pretty large linked list you will need to touch a large number of linked nodes during deserialization. By increasing the value to 1 048 576 ( = 0x100000 ) we will be able to use the array as a real hash table where only a few collision’s will make it necessary to look up the next item in the linked list.

    The next .NET Core version will have a band aid on it so that the issue will appear only with even bigger object graphs. With the current BinaryFormatter you you will get a nice parabola  where the serialization times for one million objects was only 2s but the deserialization time is in the order of 80s!

    Update 5/2018

    .NET 4.7.2 contains the fix of .NET Core 2.0 as well now. You need to enable the changes with an AppCompat switch in your App.config

    <?xml version="1.0" encoding="utf-8"?>
    <configuration>
      <runtime>
        <!-- Use this switch to make BinaryFormatter fast with large object graphs starting with .NET 4.7.2 -->
          <AppContextSwitchOverrides value="Switch.System.Runtime.Serialization.UseNewMaxArraySize=true" />
      </runtime>
    </configuration>

     

    image

    After the fix in .NET Core you can serialize object graphs up to ca. 13 million objects before you hit the next limitation of BinaryFormatter:

    Unhandled Exception: System.Runtime.Serialization.SerializationException: Exception of type ‘System.Runtime.Serialization.SerializationException’ was thrown.
       at System.Runtime.Serialization.ObjectIDGenerator.Rehash() in D:\Source\vc17\NetCoreApp\ConsoleApp1\ConsoleApp2\Serialization\ObjectIDGenerator.cs:line 140

    This time BinaryFormatter is running out of prime numbers for another hash table. If we try to serialize objects graphs with more than 2*6 584 983 objects we are out of luck again because the ObjectIDGenerator did never expect us to serialize more than 2*6584983 objects.

    public class ObjectIDGenerator
    {
        // Table of prime numbers to use as hash table sizes. Each entry is the
        // smallest prime number larger than twice the previous entry.
        private static readonly int[] s_sizes = 
        {
            5, 11, 29, 47, 97, 197, 397, 797, 1597, 3203, 6421, 12853, 25717, 51437,
            102877, 205759, 411527, 823117, 1646237, 3292489, 6584983
        };
    

    Update 5/2018: Fixed with .NET Core 2.1 https://github.com/dotnet/corefx/issues/24902#issuecomment-377331210 and will arrive hopefully with .NET 4.7.3.

    But hey that is ok. No one did ever that successfully until now. Since there are not many (http://stackoverflow.com/questions/569127/serializationexception-when-serializing-lots-of-objects-in-net) complaints about that one I guess everyone has simply moved on and switched to a faster serializer. Besides that I wonder if the guys did ever profile their application why deserialization was taking ca. 45 minutes for a ca. ~ 300MB file. 

    What Types of Serializers Exist?

    When you want to switch away from BinaryFormatter you first need to check how your data is structured. If your data can contain cyclic references then you have less options because most serializers are by default tree serializers which cannot cope with object graphs. Another downside might be that your target serializer cannot serialize private fields which BinaryFormatter is capable to do. You also need to be able to change the data object and base classes to add the necessary attributes, ctors to make it work with other serializers. Only BinaryFormatter serializes pretty much everything as long as the class has [Serializable] put on it. And last but not least it should support streams to read and write data from to it. JSON strings are a nice and efficient data storage format for small messages but reading a 200MB JSON file into a single string because the serializer is not supporting streams is creating a lot of work for the garbage collector. And you can start deserializing the data only when the complete file has been read. FastJSON is the only serializer which does not support streams which makes it not a good choice for larger messages.

    Below is a collection of widely used serializers and their most important feature properties summed up in a table:

    Serializer Type Data Format Private Members Stream Support .NET Core Support Default Ctor Needed To  Deserialize Explicit Object Graph Support
    BinaryFormatter Graph Binary Yes Yes Yes (NetStandard 1.6) No Enabled by Default.
    XmlSerializer Tree Xml Yes Yes Yes Yes public No
    DataContractSerializer Tree Xml Yes Yes Yes No
    new DataContractSerializer(typeof(TypeToSerialize), 
    new DataContractSerializerSettings  
    { 
       PreserveObjectReferences = true,
    });
    Jil Tree JSON No Yes Yes Yes public No
    FastJSON Graph JSON No No No Yes public Enabled by Default.
    Protobuf-Net Tree Binary
    Google Protocol Buffer
    Yes Yes Yes Yes* Declarative at ProtoMember level

    
    // * no default ctor needed if SkipConstructor=true 
    // Thanks Marc for that hint
    [ProtoContract(SkipConstructor=true)] 
    class DataForProtobuf
    {
      [ProtoMember(1, AsReference = true)]
       DataForProtobuf Parent;
    }
    
    JSON.NET Tree JSON Yes Yes Yes No
     
    NFX.SlimSerializer Graph Binary Yes Yes No No  
    Wire** Tree Binary Yes Yes Yes No new Serializer(new SerializerOptions(preserveObjectReferences:true))
    Crashes with StackoverflowException if cycles are present and preserveObjectReferences is not set!
    MsgPack.Cli*** Graph Binary Yes Yes Yes    
    MessagePackSharp*** Graph Binary Yes Yes Yes    
    GroBuf*** Tree Binary Yes Yes Yes    
    FlatBuffer*** Tree Binary Yes Yes Yes    
    ZeroFormatter*** Graph Binary Yes Yes Yes    
    Bois*** Graph Binary Yes Yes Yes    
    ServiceStack.Text*** Graph Json Yes Yes Yes    

    **Update1: Added Wire and MsgPack on request.

    ***Update 5/2018 Added MessagePackSharp, MessagePack.Cli, GroBuf, FlatBuffer, ZeroFormatter, Bois, ServiceStack. Removed MsgPack which is not maintained since 2011 and was slow anyway.

           Due to the lazy nature of ZeroFormatter and FlatBuffer the deserialized object properties are touched once to get a fair comparison.

    With the table above you can better judge which serializers could work for your scenario. Pretty much any serializer exchange will result in breaking changes to your serialized data format. The cost of a switch needs therefore be justified and you need either a way to migrate the data or you stay polymorphic by keeping your data objects and add the necessary support for the other serializer in place which gives you the ability to switch back to the old one for data migration. If you switch e.g. from XmlSerializer to DataContractSerializer both can write Xml as output but DataContractSerializer writes the serialized data never into Xml attributes which XmlSerializer pretty often does. That makes it impossible to switch over from either one without breaking the data exchange format.

    Besides the used data format, readability, and feature set the only real metric why one would want to switch over to another serializer is performance. There are many performance guides out there which measure one aspect of serializers but all of them I have found so far ignore important details during the measuring process. The following numbers are therefore the “real” ones measured on my machine with representative samples and many runs to average out random noise from the OS. There is one pretty good serialization testing framework out there which I have found after I have finished my analysis http://aumcode.github.io/serbench/ which was written by some smart guy who did write Pile ( https://www.infoq.com/articles/Big-Memory-Part-2). The claim there is that they use the fastest homegrown serializers to stuff objects into big byte arrays (pile) so they can support many GB large heaps (10s of GB inside one process) while keeping very small GC latencies because the objects are only deserialized on access and live only a short time which makes them Gen0 garbage quite fast. The web site contains many graphs http://aumcode.github.io/serbench/Specimens_Typical_Person/web/overview-charts.htm) which do not really make it clear how you can choose your best serializer. Personally I have found the way how the data is presented confusing. Compared to serbench my own tester is simpler to play with if you want to plug in your own serializer because you can directly edit the code from one self contained executable. My tester (https://github.com/Alois-xx/SerializerTests) also warns you if you are measuring debug builds or you are using a not NGenned baseline which for first call effects does not measure the actual serializer overhead but the amount of JITed code executed. Startup performance testing is complex and I believe my tester does the best job there.

    Tested Versions

    Serializer File Version
    BinaryFormatter 4.7.3062.0 built by: NET472REL1
    DataContract 4.7.3062.0 built by: NET472REL1
    XmlSerializer  4.7.3062.0 built by: NET472REL1
    FastJson 2.1.28.0
    Jil 2.15.4
    JsonNet 11.0.2.21924
    Protobuf_net 2.3.7.0
    SlimSerializer 3.0.0.1
    Wire 1.0.0.0
    MsgPack_Cli 0.9.144
    FlatBuffer 1.0.0.0
    GroBuf 1.3.0.0
    ZeroFormatter 1.6.4.0
    MessagePackSharp 1.7.3.4
    Bois 2.2.1.0
    ServiceStack 5.0.2.0

    Serializer Performance

    Here are the results of my own testing with various serializers. The numbers below were created with https://github.com/Alois-xx/SerializerTests which is my own serialization testing framework to get the numbers. The graph shows on the left axis the average time for the de/serialize operation per serializer. The tests were performed with one up to 1 million Book objects  where the average throughput was used. Since each serializer has a slightly different output format the data size to process changes a lot. The serialized data size is printed on the right axis where the size is shown from top to bottom to achieve better readability. The absolute performance is therefore a function of the efficiency of the used data format and the per object overhead to read/write the data. First time init effects were factored out since this is an extra metric we will shortly discuss. The numbers below are real throughput numbers for larger message packets, although the ordering which serializer performs best remains practically constant if 300K or 1000 objects are de/serialized. For smaller objects numbers the GC effects dominate the picture which makes it harder to get reliable numbers because you get results which depend on the GC state of the previous test which is not what I wanted to show here. I did sort by deserialization time because usually you are reading data from some storage which is most of the time the more important metric.

    All tests were performed on my Intel I7-4770K 3.5GHz on Windows 10 with .NET 4.7.2 x64 and .NET Core 2.0.6. The shown numbers are the sum of .NET Core and .NET Framework.

    image

    What is interesting that the not widely known Jil JSON serializer is able to serialize JSON data by far the fastest which is pretty close to protocol buffers although the serialized data size is 90% bigger. In general the serialization times are always faster than the deserialization times. If this is not the case the serializer is either deeply flawed or you have measured things wrong. From a high level perspective these operations must always happen:

    Serialization

  • Existing objects are traversed
  • Data is extracted and converted into the serialization format
  • The conversion process can involve some intermediary allocations
  • Data is written into a stream which is usually written to disk or the network

     

    Deserialization

  • Data is read from the stream
  • Data is split into sub tokens (involves memory allocations)
  • New objects are allocated and filled with tokenized data from input stream

     

    If you serialize a 1 GB in memory object into 200 MB on disk you are appending data to a stream which is flushed out to disk. But if you read 200MB from disk to deserialize you need to allocate 1GB of memory just to hold the objects in memory. While deserializing you are effectively GC bound because most of the time your thread is suspended by the GC which checks if some of your newly allocated objects are no longer needed. It is therefore not unusual to have high GC blocking times during deserialization but much less GC activity while serializing data to disk. That is also the reason why there is little point in making efficient deserialization multi threaded because you are now allocating large amounts of memory from multiple threads just to discover that the GC will block all of your threads for now much more often occurring GCs. But back to the serializer performance.

    Size Does Matter

    To fully understand the performance of an serializer one must also take into account the size of the serialized data. Below is the size of the serialized file for 100K Book objects shown:

    Graph Updated 5/2108

  • image

    The object definition was one Bookshelf to which N Books were added.

        [Serializable, DataContract, ProtoContract]
        public class BookShelf
        {
            [DataMember, ProtoMember(1)]
            public List<Book> Books
            {
                get;
                set;
            }
    
            [DataMember, ProtoMember(2)]
            private string Secret;
    
            public BookShelf(string secret)
            {
                Secret = secret;
            }
    
            public BookShelf()
            { }
        }
    
        [Serializable,DataContract,ProtoContract]
        public class Book
        {
            [DataMember, ProtoMember(1)]
            public string Title;
    
            [DataMember, ProtoMember(2)]
            public int Id;
        }

    Jil and JSON.NET are nearly equal but FastJSON is ca. 35% larger than the two others. Lets check out the serialized data for a Bookshelf with one Book inside it:

    Json.NET


    {“Secret”:”private member value”,”Books”:[{“Title”:”Book 1″,”Id”:1}]}


    Jil


    {“Books”:[{“Id”:1,”Title”:”Book 1″}]}


    FastJSON


    {“$types”:{“SerializerTests.TypesToSerialize.BookShelf, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null”:”1″,”SerializerTests.TypesToSerialize.Book, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null”:”2″},”$type”:”1″,”Books”:[{“$type”:”2″,”Title”:”Book 1″,”Id”:1}]}


    By comparing the output we find that Jil omits private members which is the reason why the Secret field value of BookShelf is missing. Besides that Jil and Json.NET have equal output. FastJSON emits for every title string an additional $type node which explains the bigger JSON output. It is educational to look at the other serialized data as well:

  • XmlSerializer


    <?xml version=”1.0″?>
    <BookShelf xmlns:xsd=”
    http://www.w3.org/2001/XMLSchema” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”>
      <Books>
        <Book>
          <Title>Book 1</Title>
          <Id>1</Id>
        </Book>
      </Books>
    </BookShelf>


    Data Contract Indented


    <?xml version=”1.0″ encoding=”utf-8″?>
    <BookShelf xmlns:i=”
    http://www.w3.org/2001/XMLSchema-instance” xmlns=”http://schemas.datacontract.org/2004/07/SerializerTests.TypesToSerialize”>
      <Books>
        <Book>
          <Id>1</Id>
          <Title>Book 1</Title>
        </Book>
      </Books>
      <Secret>private member value</Secret>
    </BookShelf>


    DataContract


    <BookShelf xmlns=”http://schemas.datacontract.org/2004/07/SerializerTests.TypesToSerialize” xmlns:i=”http://www.w3.org/2001/XMLSchema-instance”><Books><Book><Id>1</Id><Title>Book 1</Title></Book></Books><Secret>private member value</Secret></BookShelf>


    DataContract XmlBinaryDictionaryWriter


    @    BookShelfHhttp://schemas.datacontract.org/2004/07/SerializerTests.TypesToSerialize    i)http://www.w3.org/2001/XMLSchema-instance@Books@Book@Idƒ@Title™Book 1@Secret™private member value


    BinaryFormatter


       ÿÿÿÿ          FSerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null   *SerializerTests.TypesToSerialize.BookShelf   <Books>k__BackingFieldSecret’System.Collections.Generic.List`1[[SerializerTests.TypesToSerialize.Book, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]]             private member value   ’System.Collections.Generic.List`1[[SerializerTests.TypesToSerialize.Book, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]]   _items_size_version  ‘SerializerTests.TypesToSerialize.Book[]                          %SerializerTests.TypesToSerialize.Book         
       %SerializerTests.TypesToSerialize.Book   TitleId       Book 1
      


    Protobuf


    Book 1private member value


    SlimSerializer


      Êþâæ¡
    ÿäSerializerTests.TypesToSerialize.BookShelf, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
    ÿ¾System.Collections.Generic.List`1[[SerializerTests.TypesToSerialize.Book, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]], mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089ÿ(private member value ÿæSerializerTests.TypesToSerialize.Book[], SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null|0~3 ÿ $6|0~3ÿÚSerializerTests.TypesToSerialize.Book, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null ÿ Book 1


    MsgPack


    ‚¥Books„¦_items”‚¥Title¦Book 1¢IdÀÀÀ¥_size¨_version©_syncRootÀ¦Secret´private member value

    Wire


    ÿ;   SerializerTests.TypesToSerialize.BookShelf, SerializerTestsÿl   System.Collections.Generic.List`1[[SerializerTests.TypesToSerialize.Book, SerializerTests]], mscorlib,%core%   ÿ6   SerializerTests.TypesToSerialize.Book, SerializerTests   Book 1private member value

     

    From that we see that Protocol Buffers and Wire have the smallest serialized size. For our serialized Bookshelf Wire looks less efficient but that is only the header which is so big. The serialized data is as small as the one of protocol buffers. Now you can understand the performance differences of DataContractSerializer depending on the used output format (see Data Contract Indented, DataContract and DataContract XmlBinaryDictionaryWriter). It also depends on the serialized data size if you become a factor 2 faster or not. Small is beautiful.  Protocol buffers deliver that with impressive numbers. SlimSerializer is pretty close to protocol buffers and can serialize pretty much anything with no extra attributes. Although it seems not to be able to serialize delegates. You should check if this one could work for you. Since it is not so widely used and lacks versioning support you should do thorough testing before choosing another serializer. Performance is one important aspect but correctness always beats performance.

    Serializer Init Times

    So which serializer is the fastest? As usual it depends on your requirements. Are you trying to load a configuration file faster during your application startup? In that case not only the throughput performance but first time init effects may matter more than the actual serializer performance. Lets try things out by serializing one object only and then quit.

    image

    Suddenly one of the fastest serializers is by a large margin the slowest one. To fully understand the serialization performance you need to take into account the serializer startup costs and the achieved throughput. If application startup is your main concern and you are loading only a few settings from a file you should think twice if Jil is really the right serializer for you. Unless if you change the environment a bit and you get a completely different chart:image

    This time Jil has become 240ms faster with no changes except that the test executable was NGenned with

    %windir%\Microsoft.NET\Framework64\v4.0.30319\ngen.exe install SerializerTests.exe

    That did precompile the executable and all referenced assemblies including Jil and Sigil which seem to have a lot of code running during the serializer initialization. If you are running on .NET Core you will find that the startup costs are much higher because nearly no dll is precompiled with crossgen.exe which is the .NET Core NGen pendant. Serializer startup costs are therefore dominated by JIT costs which can be minimized by precompiling your assembly which is pretty important if you do not only have great throughput but also good startup times. If you are deploying a not precompiled application you need to be aware of the greatly different startup costs. Taking only the serializer with the biggest throughput may be not the best idea.

    XmlSerializer Startup Time

    In the regular .NET Framework there is a special catch with XmlSerializer. If you instantiate XmlSerializer it will cost for the first type ca. 70ms but later invocations cost only ca. 14ms. Why is the first invocation of XmlSerializer so costly? As it turns out XmlSerializer creates a serialization assembly on the fly if it finds not a pregenerated one with the sgen tool (part of the .NET Framework SDK). In order to load it XmlSerializer will try to do an Assembly.Load(“YourAssembly.XmlSerializers, PublicKeyToken=xxxx, Version=…..“) which will fail with a FileNotFoundException if no pregenerated assembly exists. This assembly load attempt will trigger a GAC lookup which will call a method CheckMSIInstallAvailable:

    System.Xml.ni.dll!System.Xml.Serialization.XmlSerializer..ctor(System.Type, System.String)$##6001E85
    System.Xml.ni.dll!System.Xml.Serialization.TempAssembly.LoadGeneratedAssembly(System.Type, System.String, System.Xml.Serialization.XmlSerializerImplementation ByRef)$##60016C8
    mscorlib.ni.dll!System.Reflection.Assembly.Load(System.Reflection.AssemblyName)$##60040EC
    mscorlib.ni.dll!System.Reflection.RuntimeAssembly.InternalLoadAssemblyName(System.Reflection.AssemblyName, System.Security.Policy.Evidence, System.Reflection.RuntimeAssembly, System.Threading.StackCrawlMark ByRef, IntPtr, Boolean, Boolean, Boolean)$##600415D
    clr.dll!AssemblyNative::Load
    clr.dll!AssemblySpec::LoadAssembly
    clr.dll!AssemblySpec::LoadDomainAssembly
    clr.dll!AppDomain::BindAssemblySpec
    clr.dll!AssemblySpec::FindAssemblyFile
    clr.dll!AssemblySpec::LoadAssembly
    clr.dll!FusionBind::RemoteLoad
    clr.dll!CAssemblyName::BindToObject
    clr.dll!CAssemblyDownload::KickOffDownload
    clr.dll!CAssemblyDownload::DownloadComplete
    clr.dll!CAssemblyDownload::DownloadNextCodebase
    clr.dll!CAsmDownloadMgr::ProbeFailed
    clr.dll!CheckMSIInstallAvailable
    clr.dll!MSIProvideAssemblyPeek
    clr.dll!MSIProvideAssemblyGlobalPeek
    clr.dll!MSIProvideAssemblyPeekEnum

    That code is not part of .NET Core and also not of the SSCLI implementation. But with profiling it is not too hard to figure out what is really happening. By using ETW with Registry tracing we can see that the first failed assembly load failure is the most costly one:

    image

    Internally CheckMSIInstallAvailable will read all registry values below  HKEY_LOCAL_MACHINE\SOFTWARE\Classes\Installer\Assemblies\Global which by pure coincidence contains all registered assemblies from the GAC:

    image

    That minor implementation detail causes the noticed 44ms delay because CheckMSIInstallAvailable will first cache the GAC contents from the registry which needs 44ms on first access. It is not correct to attribute the time of the failed assembly load attempt to the startup costs of XmlSerializer because it happens only once for the first assembly load failure. So what is the correct XmlSerializer startup cost? If you have many different XmlSerializer instances during application startup only the first one will pay the high 70ms startup costs. All subsequent instantiation’s will cost  around 15ms per type which is much cheaper than one would expect by a single measurement. By pregenerating the code with sgen one can reduce the startup costs even further to ca. 1ms per type but the first assembly load will still cost around 19ms even when it is successful.

    image

    Before .NET 4.5 XmlSerializer did also spawn csc.exe to compile the code on the fly which is luckily no longer the case. During these “old” days XmlSerializer was costing up to 200ms startup costs per type. The usage of sgen was therefore absolutely necessary but in todays fast moving world old performance truth no longer hold true. Startup costs are non trivial to measure so beware.

    Multi Targeting .NET Executables and Precompiling .NET Core Assemblies

    Precompiling binaries with .NET Core is not very straightforward yet and I think things will change quite a bit in the future. But there is the approach that I have found to work. You can create an executable which targets desktop .NET and .NET Core inside one .csproj file with VS 2017 and later. A typical .csproj which targets .NET 4.5.2 and .NET Core 1.1 contains by semicolon separated the <TargetFrameworks> and for .NET Core the <RuntimeIdentifiers> which are downloaded when the nuget packages are restored which contains the platform dependent .NET Core dlls. When you compile this binary it is compiled two times. Once as regular .NET Desktop exe and another time as .NET Core dll which can be executed in the bin folder with

    dotnet xxxx.dlll

     

    <Project Sdk="Microsoft.NET.Sdk">
      <PropertyGroup>
        <OutputType>Exe</OutputType>
        <TargetFrameworks>netcoreapp1.1;net452</TargetFrameworks>
        <RuntimeIdentifiers>win7-x64</RuntimeIdentifiers>
      </PropertyGroup>
      <ItemGroup>
        <PackageReference Include="Jil" version="2.15.0" />
        <PackageReference Include="protobuf-net" Version="2.1.0" />
        <PackageReference Include="Sigil" version="[4.7.0.0]" />
        <PackageReference Include="System.Xml.XmlSerializer" version="*" />
        <PackageReference Include="System.Runtime.Serialization.Xml" version="*" />
      </ItemGroup>
    </Project>
    

    The binaries are put into a target framework dependent folder

    image

    image

    If you download .NET Core it will only contain one precompiled binary. To precompile everything you need to take the dlls of

    C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.1.1

    to your application binary folder and then call

    %USERPROFILE%\.nuget\packages\runtime.win7-x64.microsoft.netcore.runtime.coreclr\1.1.1\tools\crossgen.exe /JITPath “C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.1.1\clrjit.dll” /Platform_Assemblies_Paths “C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.1.1“;%USERPROFILE%\.nuget\packages\System.Xml.XmlSerializer\4.3.0\lib\netstandard1.3

    where you need to append the path of the referenced Nuget packages to make everything work. If things are not working correctly you can enable “Fusion” logging by setting the environment variable

    COREHOST_TRACE=1

    That and Windbg of course will give you more hints. Precompiling things in .NET Core is still a lot of try and error until everything works and I am not sure if this is the currently recommended way.

    Conclusions

    Measuring and understanding the performance aspects of serializers is quite complex. For some reason the measured numbers by the library authors of public serializers seem to prove that their serializer is the fastest one. Since I have no affiliations with any of the library maintainers the presented tests should be the most neutral one while I was trying hard to make no obvious errors in my testing methodology. If you want to migrate from an existing type hierarchy Protocol buffers and SlimSerializer look like a fast replacement to BinaryFormatter. Jil is great if you serialize the public API surface of your data objects and you do not need to serialize private fields or properties. Despite its claims FastJSON turned out in no metric to be leading in these tests. If I have made an error there please drop me a note and I will correct the data. BinaryFormatter is has a hideous O(n^2) deserialize time complexity which no one seems to have written about in public yet. At least with .NET Core things will become better. If you are deserializing larger object graphs you know now why the deserialization time takes up to 40 minutes. Before trying out a new fastest serializer be sure to measure by yourself and do not choose serializers which have fast in their name. There is a Fastest.Json serializer on Nuget which crashes the .NET Execution engine during serialization and the library author did never bother to implement the deserialize part. That’s all for today.