It Is Time For Time Travel

Windbg is one of the most powerful yet underused tool in every Windows Developers toolbox. Some time ago a new fancier version of Windbg (Windbg Preview)  which looks like a todays modern application was released as Windows Store App. So far the UI has got a nice ribbon but under the hood no real changes were visible. Today a new version was released which supports a long time internally used tool called time travel tracer (TTT).

The official documentation https://docs.microsoft.com/de-de/windows-hardware/drivers/debugger/debugging-resources is quite good and here is the blog post announcing Time Travel Debugging https://blogs.msdn.microsoft.com/windbg/2017/09/25/time-travel-debugging-in-windbg-preview/.

If you have a hard to debug race condition you can now attach the Windbg and check the “Record process with Time Travel Debugging”

image

The trace file can grow many GB in size which can make it cumbersome if you want to reproduce an error which needs some time to surface. Luckily Windbg is, although a Store App, still xcopy deployable. Search with Process Explorer where the debugger exe sits and copy the contents of the directory C:\Program Files\WindowsApps\Microsoft.WinDbg_1.0.13.0_x86__8wekyb3d8bbwe\ to your own tool folder like Windbg_1.0.13.0. Then you can run it e.g. from a memory stick or a network share as usual. After searching a bit in the debugger directories you will find the command line version of the Time Travel Trace tool located at

  • x64  Windbg_1.0.13\amd64\TTD\TTD.exe
  • x86 Windbg_1.0.13\amd64\TTD\wow64\TTD.exe

Now you can go to your problematic machine where the problem occurs without the need to install a store app which might not be possible due to corporate policy/firewall/isolated test network, …. To record a time travel trace from the command line I normally use it in ring buffer mode with a 2 GB buffer which should cover a few seconds or minutes depending on the application activity.

 

D:\Tools\Windbg_1.0.13\amd64\TTD>ttd -ring -maxfile 2048 -out c:\temp\test.run -launch D:\cppError.exe
Microsoft (R) TTD 1.01.02
Release: 10.0.16366.1000
Copyright (C) Microsoft Corporation. All rights reserved.

Launching D:\cppError.exe

cppError.exe(x86) (1040): Tracing stopped after 31ms
  Ring trace dumped to c:\temp\test.run

 

You get a 2 GB file although the actually recorded data might be much smaller. If you have a short repro it might be better to skip the ring buffer setting.

image

Once you have the data it is time to leave the crime scene pack the .run file and analyze it at your favorite developer machine. You can double click the .run file or fire up Windbg and select the Open Trace option. Once you have loaded the trace you can press g to let the application run until the first exception happens or you can set breakpoints. But if nothing is set the application will stop a the first exception with the actual source window and the current values of the local function:

image

We find that whenever i is 5 we run into an issue which you could also have found also with a memory dump. But now you can travel back in time by entering p- to check the values were just before the crash. This is incredibly powerful to find the root cause how you did get into a situation. If Time Travel Debugging works it is a great tool. Just keep in mind that it makes the application around 10x times or more slower. You should not expect great performance if Time Travel recording is running.

As far as I can tell it looks like Time Travel Tracing is built upon the Intel Processor Tracing feature https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing which enables recording the full instruction stream along with all touched memory which is a really powerful feature.

Wishes

With every great tool there are things left to be desired.

  • Windbg supports no managed source code window
    • No managed breakpoint setting in source code
  • No managed local variables
  • No managed call stack
  • SOS.dll seems not to work at all with time travel traces

The debugger shell seems to exist only in the x64 flavor which makes it impossible to load SOS.dll for 32 bit applications into Windbg because the bitness of SOS.dll must match with the bitness of the debugger executable. When I try to get a mixed mode stack inside Windbg SOS.dll can be loaded but it seems to be missing vital information. I really would like to use Windbg with time travel tracing support for managed code (regular .NET Framework and .NET Core) but until now this is a C/C++ fan boys tool only.

Conclusions

Time Travel Tracing is a novel debugging technique which enables us developers to solve otherwise nearly impossible to find data races in a deterministic way. I hope that the Windbg team will add managed code support in the near future which would bring back feature parity between languages. Why should only C/C++ developer get all the great tools?

Advertisements

MemAnalyzer v2.5 Released

Managed  (and unmanaged) memory leak detection is still not as easy as it should be. MemAnalyzer tries to be a simple command line tool with the goal to help with the easy problems but also with the hard issues which surface only after hours of stress testing. Often you have a mixture of managed and unmanaged memory leaks where you need memory dumps and VMMap information to get the full picture. MemAnalyzer helps to automate these boring tasks. It is open source at Github (https://github.com/Alois-xx/MemAnalyzer). The executable can can be downloaded here: https://github.com/Alois-xx/MemAnalyzer/releases.

If you are looking for a nice UI to look into memory dumps I recommend MemoScope.NET (https://github.com/fremag/MemoScope.Net) which lets you look into dump files without the need to resort back to Windbg. By the way if you have Visual Studio Ultimate you can already analyze managed memory dumps. But to analyze issues in production installing Visual Studio is not an option. PerfView on the other hand is a great tool but the numbers are only approximations which can make it hard to spot handle leaks. The object counts reported by PerfView are often off by the order of a magnitude. MemAnalyzer tries to get exact metrics of the real alive objects with the -live switch which is as good as !DumpHeap -stat -live of Windbg as CSV output.

MemAnalyzer Features

  • Single self contained executable
  • Supports x86 and x64 processes and memory dumps
  • .NET Core on Windows x86 and x64 support (.NET Core 1.0,1.1 and 2.0, …)
  • Create memory dumps with associated VMMap data
  • Analyze managed heap, unmanaged, private bytes and file mappings when VMMap is present
  • Memory dump diff
  • Optional CSV output

Usage – Leak Detection

Why bother with a command line tool if nicer UIs are around? Because MemAnalyzer is capable to track not only your managed memory but also the unmanaged parts. When a managed application leaks memory you need first to find out if the leak happens on the managed heap or somewhere else.  Depending on the leaked memory type one needs to use different approaches/tools to track the leak down.

The memory consumption of a process can be categorized as

  • Managed Heap
  • Unmanaged Heap
  • Private Bytes
  • Page File Allocated Shared Memory (Shareable in VMMap lingo)
  • Memory Mapped Files

Since there are quite a few different memory types inside a process it is important to know where you need to look at.  MemAnalyzer uses VMMap to determine the size of each region and prints them out in a nice summary which can be put into a CSV file to e.g. get a summary after each test run during an automated test.

C>MemAnalyzer.exe -pid 17888 -vmmap

AllocatedBytes          Instances(Count)        Type
4,648,204               105,374                 System.String
918,824                 22                      System.String[]
697,640                 27,607                  System.Object[]
662,424                 27,601                  System.Int32
1,512                   27                      System.RuntimeType
1,072                   2                       System.Globalization.CultureData
830                     5                       System.Char[]
580                     8                       System.Int32[]
432                     2                       System.Globalization.NumberFormatInfo
26,130                  1,087                   Managed Heap(Free)!
6,936,367               160,704                 Managed Heap(Allocated)!
7,158,288                                       Managed Heap(TotalSize)
25,165,824                                      Reserved_Stack
54,398,976                                      Committed_Dll
1,994,752                                       Committed_Heap!
4,177,920                                       Committed_MappedFile!
565,248                                         Committed_Private!
3,825,664                                       Committed_Shareable!
73,629,696                                      Committed_Total
17,499,952                                      Allocated(Total)
  • Allocated managed objects. That is very much similar to !DumpHeap -stat in Windbg only with more options.
    • If you add -live then the metric will contain no temporary objects which were not reclaimed by the GC yet.
  • Managed heap summary which shows an overall metric how big the heap is and how much of it is allocated and free.
  • Additional VMMap information that gives you an overview which other memory types are allocated in the process.
    • MemAnalyzer needs the VMMap tool in the path to get that data.
  • Allocated = Managed Heap(Allocated) + Heap + MappedFile + Private Bytes + Shareable

The Allocated value is important because if this value rises over time you have a leak in one of the memory types of the sum. If you print this value over time and it does not rise you have no leak (warning simplified!). That is simple enough to do it repeated times by e.g. a script to verify that your long running test behaves well. Since repeated measurements are key to detecting a memory leak MemAnalyzer allows you to append the output to a CSV file along with some context e.g. Iteration 1, 100 to get more clues.

Inside your tracking script a more realistic command line would be

MemAnalyzer -pid {0} -vmmap -o leak.csv -dtn 5000;N#200 -live -silent  -context “{1}”

This will append the output of -dtn (Dump Type by Number)  for up to 5K types with an instance count > 200 to the CSV file leak.csv. Each line will get a context column which can be e.g. your test run number or whatever it makes easier to correlate the time when the data was taken. To get additional information you can add automatic memory dumps to the mix with

MemAnalyzer -procdump -ma {0} {1}\PROCESSNAME_{0}_YYMMDD_HHMMSS.dmp

This will take a memory dump of the process with pid {0} with procdump (needs to be in the path) and also gather VMMap information automatically (VMMap needs to be in the path). The big letter words will be expanded by procdump automatically. That way you can e.g. take a full memory dump after 1, 10 , 100, 500 iterations which contains everything but the trending data is contained in the csv file for every iteration which makes it much easier to track down the real memory leaks. Based on personal experience it is pretty easy to be led down the wrong path by a few memory dumps created by coworkers. The first dump might be created before anything was loaded into the application and the last dump might still have the test data loaded which looks like a pretty big leak but it is not the leak you a after when you have lost 500 MB after 100 iterations. Having more data points at hand which can easily be graphed in Excel is a big help to concentrate on the important incidents and to identify stable patterns and trends without the need to take a gazillion of memory dumps.

Usage – Memory Optimization

If you want to optimize the memory usage of an existing application MemAnalyzer is also a big help because you can quickly diff a memory memory dump which is your baseline against the currently running application. To get started you should take a memory dump of your current state.

MemAnalyzer -procdump -ma pid C:\temp\Baseline.dmp

After you have optimized the data structures of your application to (hopefully) consume less memory you can compare the running application against your saved baseline

MemAnalyzer -f2 baseline.dmp -pid ddd

When you use -f2 then 2-1 will be subtracted and you get a nice diff summary output. To keep the output short the diff is sorted by absolute values which makes it easy to spot top memory additions and deletions along with the totals.

Lets do an step by step example what that means for your usual application development workflow. First we start with our memory hungry application and isolate the memory issue into a single reproducer like this:

using System;
using System.Collections.Generic;

namespace coreapp
{
    class DataInstance : IDisposable
    {
        Func<string> Checker;
        long Instance;
        bool IsDisposed;
        DataInstance[] Childs;

        public DataInstance(int instance)
        {
            Instance = instance;
            Checker = () => $"Instance {Instance} already disposed";
        }

        public void Dispose()
        {
            if (IsDisposed)
            {
                throw new ObjectDisposedException(Checker());
            }
	     IsDisposed = true;
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            var instances = new List<DataInstance>();
            for(int i=1;i<=5*1000*1000;i++)
            {
                instances.Add(new DataInstance(i));
            }
            Console.ReadLine();
        }
    }
}

We give MemAnalyzer the process id to create a baseline memory dump. Since MemAnalyzer uses procdump and VMmap you should have both already downloaded and the tools in your path to make it work.

MemAnalyzer.exe -procdump -ma 11324 DotNetCoreApp_1.0.dmp

Ok we have a dump of a .NET Core application. How can we look into it?

 

MemAnalyzer.exe -f DotNetCoreApp_1.0.dmp
Error: Is the dump file opened by another process (debugger)? If yes close the debugger first.
       If the dump comes from a different computer with another CLR version v1.0.25211.02 that you are running on your machine you need to download the matching mscordacwks.dll first. Check out https://1drv.ms/f/s!AhcFq7XO98yJgoMwuPd7LNioVKAp_A and download the matching version/s.
       Then set _NT_SYMBOL_PATH=PathToYourDownloadedMscordackwks.dll  e.g. _NT_SYMBOL_PATH=c:\temp\mscordacwks in the shell where you did execute MemAnalyzer and then try again.

Got Exception: System.IO.FileNotFoundException: mscordaccore_Amd64_Amd64_1.0.25211.02.dll

 

Ups we have got an error. Most people stop reading when an error occurs because the error messages are most often not that helpful. But  this case is different. You need to download the folder of my OneDrive folder of the link in the error message to get nearly all .NET/Core debugging dlls you could ever need. Download them into e.g. C:\PerfTools. Then you need to tell MemAnalyzer where to look for it with the -dacdir option or you can set the environment variable _NT_SYMBOL_PATH=c:\PerfTools to get rid of the need to specify the dac directory every time manually.

MemAnalyzer.exe  -dts -f DotNetCoreApp_1.0.dmp -dacdir c:\PerfTools

AllocatedBytes          Instances(Count)        Type
320,000,000             5,000,000               System.Func<System.String>
240,000,000             5,000,000               coreapp.DataInstance
100,663,368             3                       coreapp.DataInstance[]
24,530                  145                     System.String
33,627,594              139                     Managed Heap(Free)!
660,714,944             10,000,277              Managed Heap(Allocated)!
694,348,008                                     Managed Heap(TotalSize)

We have 660 MB on the managed heap allocated which is quite a lot of data. There are 5 million Func<string> and DataInstance instances. But why are we having 3 DataInstance arrays with 100MB? These look like temp arrays left over from our List<DataInstance> while growing the internal array. To get rid of garbage data you can either do a GC.Collect() before taking the dump or you tell MemAnalyzer to only track objects which are still alive.

MemAnalyzer.exe  -f DotNetCoreApp_1.0.dmp -dacdir c:\PerfTools -live

AllocatedBytes          Instances(Count)        Type
320,000,000             5,000,000               System.Func<System.String>
240,000,000             5,000,000               coreapp.DataInstance
67,108,912              2                       coreapp.DataInstance[]
24,530                  145                     System.String
627,160,448             10,000,275              Managed Heap(Allocated)!
694,348,008                                     Managed Heap(TotalSize)

There is still one array left which does not belong there but the numbers are now better. While looking at the data I decided that we should get rid of the many delegate instances which cost 64 byte per instance which add up to 320 MB alone for the instances itself. But since the DataInstance object also keeps a reference (on x64 8 bytes) we have even more memory to spare. If we get rid of the delegate and remove the class member we should be able to spare 5m*(64+8)=360MB of memory. That’s a plan. Lets measure things.  Our refactored class becomes

    class DataInstance : IDisposable
    {
        long Instance;
        bool IsDisposed;
        DataInstance[] Childs;

        public DataInstance(int instance)
        {
            Instance = instance;
        }

        public void Dispose()
        {
            if (IsDisposed)
            {
                throw new ObjectDisposedException($"Instance {Instance} already disposed");
            }
            IsDisposed = true;
        }
    }

By taking a second dump we can diff both dump files with

MemAnalyzer.exe -f DotNetCoreApp_NoFuncDelegate.dmp -f2 DotNetCoreApp_1.0.dmp -dacdir c:\PerfTools

Delta(Bytes)    Delta(Instances)        Instances       Instances2      Allocated(Bytes)        Allocated2(Bytes)       AvgSize(Bytes)  AvgSize2(Bytes) Type
320,000,000     5,000,000               0               5,000,000       0                       320,000,000                             64              System.Func<System.String>
40,000,000      0                       5,000,000       5,000,000       200,000,000             240,000,000             40              48              coreapp.DataInstance
0               0                       1               1               160                     160                     160             160             System.Globalization.CalendarData
360,000,000     5,000,000               5,000,277       10,000,277      300,714,930             660,714,930                                             Managed Heap(Allocated)!
360,010,320     0                       0               0               334,337,688             694,348,008                                             Managed Heap(TotalSize)

As expected we got rid of 5 million Func<String> instances. After removing one field in DataInstance the instance size did shrink by 8 bytes from 48 down to 40 bytes which saved another 40MB. That is already quite good. But can we do better? The dispose check is an extra bool flag which will need  due to padding 4 bytes anyway. To eliminate the bool field we can reuse the Instance field and negate the Instance count to we keep the stored value which is always > 0. When you look closely you find that Instance is of the type long but we only need an int because we will always assign in the ctor the value from an integer. The revised DataInstance class is now

    class DataInstance : IDisposable
    {
        int Instance;
        DataInstance[] Childs;

        public DataInstance(int instance)
        {
            Instance = instance;
        }

        public void Dispose()
        {
            if (Instance < 0)
            {
                throw new ObjectDisposedException($"Instance {-1*Instance} already disposed");
            }

            Instance *= -1; 
        }
    }

When we diff things again

MemAnalyzer.exe -f DotNetCoreApp_NoFuncDelegate_intFieldNoDisposeFlag.dmp -f2 DotNetCoreApp_1.0.dmp -dacdir c:\PerfTools

Delta(Bytes)    Delta(Instances)        Instances       Instances2      Allocated(Bytes)        Allocated2(Bytes)       AvgSize(Bytes)  AvgSize2(Bytes) Type
320,000,000     5,000,000               0               5,000,000       0                       320,000,000                             64              System.Func<System.String>
80,000,000      0                       5,000,000       5,000,000       160,000,000             240,000,000             32              48              coreapp.DataInstance
33,554,456      1                       2               3               67,108,912              100,663,368             33554456        33554456        coreapp.DataInstance[]
24              1                       1               2               24                      48                      24              24              System.Int32
0               0                       2               2               208                     208                     104             104             System.Globalization.CultureInfo
0               0                       2               2               912                     912                     456             456             System.Globalization.CultureData
433,554,480     5,000,002               5,000,275       10,000,277      227,160,450             660,714,930                                             Managed Heap(Allocated)!
400,011,856     0                       0               0               294,336,152             694,348,008                                             Managed Heap(TotalSize)

Since we compare against the original baseline directly see the improvement in memory consumption by 433MB. That is 65% less memory! Not bad. If you want to keep going fast you can directly compare a memory dump against a running process to check if a temporary optimization pays off. I have found the VS profiler to break when larger x86 applications were profiled because VS seems to load the data also into a x86 process where things break if the more fancy object graph calculation because VS runs out of memory…

VS 2017.3 does not yet recognize CoreClr memory dumps as managed processes which still requires managed heap analysis with PerfView or Windbg or MemAnalyzer.

SOS and mscordacwks, mscordaccore Collection

Even if you are not interested in MemAnalyzer you might stop by for the biggest collection of SOS and mscordacwks debugging dlls for all .NET versions I could get my hands on. When you analyze memory dumps taken from other machines you need to have a close version match within Windbg or an exact version match with PerfView / ClrMd. Inside Microsoft this is a non issue because their symbol servers distribute the matching binaries without any hassle. We outsiders have to copy the corresponding debugging libraries from the original machine or from the corresponding .NET installer. To spare you the time to hunt for the matching debugging dlls I share my collection of mscordackwks files as OneDrive link: https://1drv.ms/f/s!AhcFq7XO98yJgoMwuPd7LNioVKAp_A

Currently it contains the versions

2.0.50727.1434
2.0.50727.3607
2.0.50727.3615
2.0.50727.4016
2.0.50727.4062
2.0.50727.4200
2.0.50727.4408
2.0.50727.4455
2.0.50727.4927
2.0.50727.4952
2.0.50727.5403
2.0.50727.5420
2.0.50727.5444
2.0.50727.5448
2.0.50727.5456
2.0.50727.8009
4.0.30319.01
4.0.30319.1008
4.0.30319.1022
4.0.30319.17379
4.0.30319.18052
4.0.30319.18408
4.0.30319.18444
4.0.30319.2034
4.0.30319.225
4.0.30319.237
4.0.30319.269
4.0.30319.296
4.0.30319.33440
4.0.30319.34003
4.0.30319.34011
4.0.30319.34014
4.0.30319.34209
4.0.30319.46960
4.6.100.00
4.6.1085.00
4.6.127.01
4.6.1586.00
4.6.1590.00
4.6.1637.00
4.6.81.00
4.6.96.00
4.7.2053.00

.NET Core

.NET Core 1.0     1.0.25211.02
.NET Core 1.1     4.6.25211.01
.NET Core 2.0 x64 4.6.25519.02
.NET Core 2.0 x86 4.6.25519.03

It is interesting to note that .NET Core 2.0 has different build numbers between the x86 and x64 version. It looks like one blocking issue needed fixing before they did release it to a wider audience.

Conclusions

Your toolbox just has got a little bigger. As always use the right tool for the job. MemAnalyzer is not the silver bullet for all of your memory problems but it tries its best to give you fast feedback without the overhead of a fancy UI which makes it easy to put it into your existing leak tracking/reporting scripts. If you want to share success stories sound off in the comments. If you want to report bugs/issues it would be nice to open an issue at https://github.com/Alois-xx/MemAnalyzer/issues. Now go and improve the memory footprint of your app!

The Definitive Serialization Performance Guide

When looking at performance issues with ETW I did find long deserialization times in conjunction with BinaryFormatter. A deeper look revealed that the issue is easy to reproduce if the object graph gets bigger (>100K objects). Since BinaryFormatter is in business since over 10 years and I have never heard of significant performance issues except that it is slow in general I was quite surprised that such a blatant problem still exists in .NET. But since .NET is open sourced at github it is easy to complain: https://github.com/dotnet/corefx/issues/16991. This has catched the interest of Stephen Toub himself and they added a small but impactful change to BinaryFormatter:

image

The problem during deserialization with BinaryFormatter is that it uses the MaxArraySize value to bucket an ObjectHolder array like a hash table.

ObjectHolder objects[MaxArraySize=0x1000=4096] 

class ObjectHolder 
{ 
  internal ObjectHolder m_next;
}

If we need to store one million objects into one ObjectHolder array with a length of 4096 we need  to create 4096 ObjectHolder linked lists via the ObjectHolder.m_next field with a depth of 244 nodes. When you try to access a specific object in that pretty large linked list you will need to touch a large number of linked nodes during deserialization. By increasing the value to 1 048 576 ( = 0x100000 ) we will be able to use the array as a real hash table where only a few collision’s will make it necessary to look up the next item in the linked list.

The next .NET Core version will have a band aid on it so that the issue will appear only with even bigger object graphs. With the current BinaryFormatter you you will get a nice parabola  where the serialization times for one million objects was only 2s but the deserialization time is in the order of 80s!

image

After the fix in .NET Core you can serialize object graphs up to ca. 13 million objects before you hit the next limitation of BinaryFormatter:

Unhandled Exception: System.Runtime.Serialization.SerializationException: Exception of type ‘System.Runtime.Serialization.SerializationException’ was thrown.
   at System.Runtime.Serialization.ObjectIDGenerator.Rehash() in D:\Source\vc17\NetCoreApp\ConsoleApp1\ConsoleApp2\Serialization\ObjectIDGenerator.cs:line 140

This time BinaryFormatter is running out of prime numbers for another hash table. If we try to serialize objects graphs with more than 2*6 584 983 objects we are out of luck again because the ObjectIDGenerator did never expect us to serialize more than 2*6584983 objects.

public class ObjectIDGenerator
{
    // Table of prime numbers to use as hash table sizes. Each entry is the
    // smallest prime number larger than twice the previous entry.
    private static readonly int[] s_sizes = 
    {
        5, 11, 29, 47, 97, 197, 397, 797, 1597, 3203, 6421, 12853, 25717, 51437,
        102877, 205759, 411527, 823117, 1646237, 3292489, 6584983
    };

But hey that is ok. No one did ever that successfully until now. Since there are not many (http://stackoverflow.com/questions/569127/serializationexception-when-serializing-lots-of-objects-in-net) complaints about that one I guess everyone has simply moved on and switched to a faster serializer. Besides that I wonder if the guys did ever profile their application why deserialization was taking ca. 45 minutes for a ca. ~ 300MB file. 

What Types of Serializers Exist?

When you want to switch away from BinaryFormatter you first need to check how your data is structured. If your data can contain cyclic references then you have less options because most serializers are by default tree serializers which cannot cope with object graphs. Another downside might be that your target serializer cannot serialize private fields which BinaryFormatter is capable to do. You also need to be able to change the data object and base classes to add the necessary attributes, ctors to make it work with other serializers. Only BinaryFormatter serializes pretty much everything as long as the class has [Serializable] put on it. And last but not least it should support streams to read and write data from to it. JSON strings are a nice and efficient data storage format for small messages but reading a 200MB JSON file into a single string because the serializer is not supporting streams is creating a lot of work for the garbage collector. And you can start deserializing the data only when the complete file has been read. FastJSON is the only serializer which does not support streams which makes it not a good choice for larger messages.

Below is a collection of widely used serializers and their most important feature properties summed up in a table:

Serializer Default Serializer Type Data Format Private Members Stream Support .NET Core Support Default Ctor Needed To  Deserialize Explicit Object Graph Support
BinaryFormatter Graph Binary Yes Yes Yes (NetStandard 1.6) No Enabled by Default.
XmlSerializer Tree Xml Yes Yes Yes Yes public No
DataContractSerializer Tree Xml Yes Yes Yes No
new DataContractSerializer(typeof(TypeToSerialize), 
new DataContractSerializerSettings  
{ 
   PreserveObjectReferences = true,
});
Jil Tree JSON No Yes Yes Yes public No
FastJSON Graph JSON No No No Yes public Enabled by Default.
Protobuf-Net Tree Binary
Google Protocol Buffer
Yes Yes Yes Yes* Declarative at ProtoMember level


// * no default ctor needed if SkipConstructor=true 
// Thanks Marc for that hint
[ProtoContract(SkipConstructor=true)] 
class DataForProtobuf
{
  [ProtoMember(1, AsReference = true)]
   DataForProtobuf Parent;
}
JSON.NET Tree JSON Yes Yes Yes No Does not completely work. See https://github.com/JamesNK/Newtonsoft.Json/issues/1284

var ser = JsonSerializer.Create(new JsonSerializerSettings()
{
 PreserveReferencesHandling =   PreserveReferencesHandling.Objects,
 ReferenceLoopHandling =   ReferenceLoopHandling.Serialize,          
});
NFX.SlimSerializer Graph Binary Yes Yes No No  
Wire** Tree Binary Yes Yes Yes No new Serializer(new SerializerOptions(preserveObjectReferences:true))
Crashes with StackoverflowException if cycles are present and preserveObjectReferences is not set!
MsgPack** Graph Binary Yes Yes No    

**Update1: Added Wire and MsgPack on request.

With the table above you can better judge which serializers could work for your scenario. Pretty much any serializer exchange will result in breaking changes to your serialized data format. The cost of a switch needs therefore be justified and you need either a way to migrate the data or you stay polymorphic by keeping your data objects and add the necessary support for the other serializer in place which gives you the ability to switch back to the old one for data migration. If you switch e.g. from XmlSerializer to DataContractSerializer both can write Xml as output but DataContractSerializer writes the serialized data never into Xml attributes which XmlSerializer pretty often does. That makes it impossible to switch over from either one without breaking the data exchange format.

Besides the used data format, readability, and feature set the only real metric why one would want to switch over to another serializer is performance. There are many performance guides out there which measure one aspect of serializers but all of them I have found so far ignore important details during the measuring process. The following numbers are therefore the “real” ones measured on my machine with representative samples and many runs to average out random noise from the OS. There is one pretty good serialization testing framework out there which I have found after I have finished my analysis http://aumcode.github.io/serbench/ which was written by some smart guy who did write Pile ( https://www.infoq.com/articles/Big-Memory-Part-2). The claim there is that they use the fastest homegrown serializers to stuff objects into big byte arrays (pile) so they can support many GB large heaps (10s of GB inside one process) while keeping very small GC latencies because the objects are only deserialized on access and live only a short time which makes them Gen0 garbage quite fast. The web site contains many graphs http://aumcode.github.io/serbench/Specimens_Typical_Person/web/overview-charts.htm) which do not really make it clear how you can choose your best serializer. Personally I have found the way how the data is presented confusing. Compared to serbench my own tester is simpler to play with if you want to plug in your own serializer because you can directly edit the code from one self contained executable. My tester (https://github.com/Alois-xx/SerializerTests) also warns you if you are measuring debug builds or you are using a not NGenned baseline which for first call effects does not measure the actual serializer overhead but the amount of JITed code executed. Startup performance testing is complex and I believe my tester does the best job there.

Tested Versions

Serializer File Version
BinaryFormatter 4.6.1637.0 built by: NETFXREL3STAGE
DataContract 4.6.1637.0 built by: NETFXREL3STAGE
XmlSerializer  4.6.1586.0 built by: NETFXREL2
FastJson 2.1.23.0
Jil 2.15.0
JsonNet 10.0.2.20802
Protobuf_net 2.1.0
SlimSerializer 3.0.0.1
Wire 1.0.0.0
MsgPack 0.1.4298.15470

Serializer Performance

Here are the results of my own testing with various serializers. The numbers below were created with https://github.com/Alois-xx/SerializerTests which is my own serialization testing framework to get the numbers. The graph shows on the left axis the average time for the de/serialize operation per serializer. The tests were performed with one up to 1 million Book objects  where the average throughput was used. Since each serializer has a slightly different output format the data size to process changes a lot. The serialized data size is printed on the right axis where the size is shown from top to bottom to achieve better readability. The absolute performance is therefore a function of the efficiency of the used data format and the per object overhead to read/write the data. First time init effects were factored out since this is an extra metric we will shortly discuss. The numbers below are real throughput numbers for larger message packets, although the ordering which serializer performs best remains practically constant if 300K or 1000 objects are de/serialized. For smaller objects numbers the GC effects dominate the picture which makes it harder to get reliable numbers because you get results which depend on the GC state of the previous test which is not what I wanted to show here. I did sort by deserialization time because usually you are reading data from some storage which is most of the time the more important metric.

All tests were performed on my Intel I7-4770K 3.5GHz on Windows 10 with .NET 4.6.2 x64.

image

What is interesting that the not widely known Jil JSON serializer is able to serialize JSON data by far the fastest which is pretty close to protocol buffers although the serialized data size is 90% bigger. In general the serialization times are always faster than the deserialization times. If this is not the case the serializer is either deeply flawed or you have measured things wrong. From a high level perspective these operations must always happen:

Serialization

  • Existing objects are traversed
  • Data is extracted and converted into the serialization format
  • The conversion process can involve some intermediary allocations
  • Data is written into a stream which is usually written to disk or the network

     

    Deserialization

  • Data is read from the stream
  • Data is split into sub tokens (involves memory allocations)
  • New objects are allocated and filled with tokenized data from input stream

     

    If you serialize a 1 GB in memory object into 200 MB on disk you are appending data to a stream which is flushed out to disk. But if you read 200MB from disk to deserialize you need to allocate 1GB of memory just to hold the objects in memory. While deserializing you are effectively GC bound because most of the time your thread is suspended by the GC which checks if some of your newly allocated objects are no longer needed. It is therefore not unusual to have high GC blocking times during deserialization but much less GC activity while serializing data to disk. That is also the reason why there is little point in making efficient deserialization multi threaded because you are now allocating large amounts of memory from multiple threads just to discover that the GC will block all of your threads for now much more often occurring GCs. But back to the serializer performance.

    Size Does Matter

    To fully understand the performance of an serializer one must also take into account the size of the serialized data. Below is the size of the serialized file for 100K Book objects shown:

    image

    The object definition was one Bookshelf to which N Books were added.

        [Serializable, DataContract, ProtoContract]
        public class BookShelf
        {
            [DataMember, ProtoMember(1)]
            public List<Book> Books
            {
                get;
                set;
            }
    
            [DataMember, ProtoMember(2)]
            private string Secret;
    
            public BookShelf(string secret)
            {
                Secret = secret;
            }
    
            public BookShelf()
            { }
        }
    
        [Serializable,DataContract,ProtoContract]
        public class Book
        {
            [DataMember, ProtoMember(1)]
            public string Title;
    
            [DataMember, ProtoMember(2)]
            public int Id;
        }

    Jil and JSON.NET are nearly equal but FastJSON is ca. 35% larger than the two others. Lets check out the serialized data for a Bookshelf with one Book inside it:

    Json.NET


    {“Secret”:”private member value”,”Books”:[{“Title”:”Book 1″,”Id”:1}]}


    Jil


    {“Books”:[{“Id”:1,”Title”:”Book 1″}]}


    FastJSON


    {“$types”:{“SerializerTests.TypesToSerialize.BookShelf, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null”:”1″,”SerializerTests.TypesToSerialize.Book, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null”:”2″},”$type”:”1″,”Books”:[{“$type”:”2″,”Title”:”Book 1″,”Id”:1}]}


    By comparing the output we find that Jil omits private members which is the reason why the Secret field value of BookShelf is missing. Besides that Jil and Json.NET have equal output. FastJSON emits for every title string an additional $type node which explains the bigger JSON output. It is educational to look at the other serialized data as well:

  • XmlSerializer


    <?xml version=”1.0″?>
    <BookShelf xmlns:xsd=”
    http://www.w3.org/2001/XMLSchema” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”>
      <Books>
        <Book>
          <Title>Book 1</Title>
          <Id>1</Id>
        </Book>
      </Books>
    </BookShelf>


    Data Contract Indented


    <?xml version=”1.0″ encoding=”utf-8″?>
    <BookShelf xmlns:i=”
    http://www.w3.org/2001/XMLSchema-instance” xmlns=”http://schemas.datacontract.org/2004/07/SerializerTests.TypesToSerialize”>
      <Books>
        <Book>
          <Id>1</Id>
          <Title>Book 1</Title>
        </Book>
      </Books>
      <Secret>private member value</Secret>
    </BookShelf>


    DataContract


    <BookShelf xmlns=”http://schemas.datacontract.org/2004/07/SerializerTests.TypesToSerialize” xmlns:i=”http://www.w3.org/2001/XMLSchema-instance”><Books><Book><Id>1</Id><Title>Book 1</Title></Book></Books><Secret>private member value</Secret></BookShelf>


    DataContract XmlBinaryDictionaryWriter


    @    BookShelfHhttp://schemas.datacontract.org/2004/07/SerializerTests.TypesToSerialize    i)http://www.w3.org/2001/XMLSchema-instance@Books@Book@Idƒ@Title™Book 1@Secret™private member value


    BinaryFormatter


       ÿÿÿÿ          FSerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null   *SerializerTests.TypesToSerialize.BookShelf   <Books>k__BackingFieldSecret’System.Collections.Generic.List`1[[SerializerTests.TypesToSerialize.Book, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]]             private member value   ’System.Collections.Generic.List`1[[SerializerTests.TypesToSerialize.Book, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]]   _items_size_version  ‘SerializerTests.TypesToSerialize.Book[]                          %SerializerTests.TypesToSerialize.Book         
       %SerializerTests.TypesToSerialize.Book   TitleId       Book 1
      


    Protobuf


    Book 1private member value


    SlimSerializer


      Êþâæ¡
    ÿäSerializerTests.TypesToSerialize.BookShelf, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
    ÿ¾System.Collections.Generic.List`1[[SerializerTests.TypesToSerialize.Book, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]], mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089ÿ(private member value ÿæSerializerTests.TypesToSerialize.Book[], SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null|0~3 ÿ $6|0~3ÿÚSerializerTests.TypesToSerialize.Book, SerializerTests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null ÿ Book 1


    MsgPack


    ‚¥Books„¦_items”‚¥Title¦Book 1¢IdÀÀÀ¥_size¨_version©_syncRootÀ¦Secret´private member value

    Wire


    ÿ;   SerializerTests.TypesToSerialize.BookShelf, SerializerTestsÿl   System.Collections.Generic.List`1[[SerializerTests.TypesToSerialize.Book, SerializerTests]], mscorlib,%core%   ÿ6   SerializerTests.TypesToSerialize.Book, SerializerTests   Book 1private member value

     

    From that we see that Protocol Buffers and Wire have the smallest serialized size. For our serialized Bookshelf Wire looks less efficient but that is only the header which is so big. The serialized data is as small as the one of protocol buffers. Now you can understand the performance differences of DataContractSerializer depending on the used output format (see Data Contract Indented, DataContract and DataContract XmlBinaryDictionaryWriter). It also depends on the serialized data size if you become a factor 2 faster or not. Small is beautiful.  Protocol buffers deliver that with impressive numbers. SlimSerializer is pretty close to protocol buffers and can serialize pretty much anything with no extra attributes. Although it seems not to be able to serialize delegates. You should check if this one could work for you. Since it is not so widely used and lacks versioning support you should do thorough testing before choosing another serializer. Performance is one important aspect but correctness always beats performance.

    Serializer Init Times

    So which serializer is the fastest? As usual it depends on your requirements. Are you trying to load a configuration file faster during your application startup? In that case not only the throughput performance but first time init effects may matter more than the actual serializer performance. Lets try things out by serializing one object only and then quit.

    image

    Suddenly one of the fastest serializers is by a large margin the slowest one. To fully understand the serialization performance you need to take into account the serializer startup costs and the achieved throughput. If application startup is your main concern and you are loading only a few settings from a file you should think twice if Jil is really the right serializer for you. Unless if you change the environment a bit and you get a completely different chart:image

    This time Jil has become 240ms faster with no changes except that the test executable was NGenned with

    %windir%\Microsoft.NET\Framework64\v4.0.30319\ngen.exe install SerializerTests.exe

    That did precompile the executable and all referenced assemblies including Jil and Sigil which seem to have a lot of code running during the serializer initialization. If you are running on .NET Core you will find that the startup costs are much higher because nearly no dll is precompiled with crossgen.exe which is the .NET Core NGen pendant. Serializer startup costs are therefore dominated by JIT costs which can be minimized by precompiling your assembly which is pretty important if you do not only have great throughput but also good startup times. If you are deploying a not precompiled application you need to be aware of the greatly different startup costs. Taking only the serializer with the biggest throughput may be not the best idea.

    XmlSerializer Startup Time

    In the regular .NET Framework there is a special catch with XmlSerializer. If you instantiate XmlSerializer it will cost for the first type ca. 70ms but later invocations cost only ca. 14ms. Why is the first invocation of XmlSerializer so costly? As it turns out XmlSerializer creates a serialization assembly on the fly if it finds not a pregenerated one with the sgen tool (part of the .NET Framework SDK). In order to load it XmlSerializer will try to do an Assembly.Load(“YourAssembly.XmlSerializers, PublicKeyToken=xxxx, Version=…..“) which will fail with a FileNotFoundException if no pregenerated assembly exists. This assembly load attempt will trigger a GAC lookup which will call a method CheckMSIInstallAvailable:

    System.Xml.ni.dll!System.Xml.Serialization.XmlSerializer..ctor(System.Type, System.String)$##6001E85
    System.Xml.ni.dll!System.Xml.Serialization.TempAssembly.LoadGeneratedAssembly(System.Type, System.String, System.Xml.Serialization.XmlSerializerImplementation ByRef)$##60016C8
    mscorlib.ni.dll!System.Reflection.Assembly.Load(System.Reflection.AssemblyName)$##60040EC
    mscorlib.ni.dll!System.Reflection.RuntimeAssembly.InternalLoadAssemblyName(System.Reflection.AssemblyName, System.Security.Policy.Evidence, System.Reflection.RuntimeAssembly, System.Threading.StackCrawlMark ByRef, IntPtr, Boolean, Boolean, Boolean)$##600415D
    clr.dll!AssemblyNative::Load
    clr.dll!AssemblySpec::LoadAssembly
    clr.dll!AssemblySpec::LoadDomainAssembly
    clr.dll!AppDomain::BindAssemblySpec
    clr.dll!AssemblySpec::FindAssemblyFile
    clr.dll!AssemblySpec::LoadAssembly
    clr.dll!FusionBind::RemoteLoad
    clr.dll!CAssemblyName::BindToObject
    clr.dll!CAssemblyDownload::KickOffDownload
    clr.dll!CAssemblyDownload::DownloadComplete
    clr.dll!CAssemblyDownload::DownloadNextCodebase
    clr.dll!CAsmDownloadMgr::ProbeFailed
    clr.dll!CheckMSIInstallAvailable
    clr.dll!MSIProvideAssemblyPeek
    clr.dll!MSIProvideAssemblyGlobalPeek
    clr.dll!MSIProvideAssemblyPeekEnum

    That code is not part of .NET Core and also not of the SSCLI implementation. But with profiling it is not too hard to figure out what is really happening. By using ETW with Registry tracing we can see that the first failed assembly load failure is the most costly one:

    image

    Internally CheckMSIInstallAvailable will read all registry values below  HKEY_LOCAL_MACHINE\SOFTWARE\Classes\Installer\Assemblies\Global which by pure coincidence contains all registered assemblies from the GAC:

    image

    That minor implementation detail causes the noticed 44ms delay because CheckMSIInstallAvailable will first cache the GAC contents from the registry which needs 44ms on first access. It is not correct to attribute the time of the failed assembly load attempt to the startup costs of XmlSerializer because it happens only once for the first assembly load failure. So what is the correct XmlSerializer startup cost? If you have many different XmlSerializer instances during application startup only the first one will pay the high 70ms startup costs. All subsequent instantiation’s will cost  around 15ms per type which is much cheaper than one would expect by a single measurement. By pregenerating the code with sgen one can reduce the startup costs even further to ca. 1ms per type but the first assembly load will still cost around 19ms even when it is successful.

    image

    Before .NET 4.5 XmlSerializer did also spawn csc.exe to compile the code on the fly which is luckily no longer the case. During these “old” days XmlSerializer was costing up to 200ms startup costs per type. The usage of sgen was therefore absolutely necessary but in todays fast moving world old performance truth no longer hold true. Startup costs are non trivial to measure so beware.

    Multi Targeting .NET Executables and Precompiling .NET Core Assemblies

    Precompiling binaries with .NET Core is not very straightforward yet and I think things will change quite a bit in the future. But there is the approach that I have found to work. You can create an executable which targets desktop .NET and .NET Core inside one .csproj file with VS 2017 and later. A typical .csproj which targets .NET 4.5.2 and .NET Core 1.1 contains by semicolon separated the <TargetFrameworks> and for .NET Core the <RuntimeIdentifiers> which are downloaded when the nuget packages are restored which contains the platform dependent .NET Core dlls. When you compile this binary it is compiled two times. Once as regular .NET Desktop exe and another time as .NET Core dll which can be executed in the bin folder with

    dotnet xxxx.dlll

     

    <Project Sdk="Microsoft.NET.Sdk">
      <PropertyGroup>
        <OutputType>Exe</OutputType>
        <TargetFrameworks>netcoreapp1.1;net452</TargetFrameworks>
        <RuntimeIdentifiers>win7-x64</RuntimeIdentifiers>
      </PropertyGroup>
      <ItemGroup>
        <PackageReference Include="Jil" version="2.15.0" />
        <PackageReference Include="protobuf-net" Version="2.1.0" />
        <PackageReference Include="Sigil" version="[4.7.0.0]" />
        <PackageReference Include="System.Xml.XmlSerializer" version="*" />
        <PackageReference Include="System.Runtime.Serialization.Xml" version="*" />
      </ItemGroup>
    </Project>
    

    The binaries are put into a target framework dependent folder

    image

    image

    If you download .NET Core it will only contain one precompiled binary. To precompile everything you need to take the dlls of

    C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.1.1

    to your application binary folder and then call

    %USERPROFILE%\.nuget\packages\runtime.win7-x64.microsoft.netcore.runtime.coreclr\1.1.1\tools\crossgen.exe /JITPath “C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.1.1\clrjit.dll” /Platform_Assemblies_Paths “C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.1.1“;%USERPROFILE%\.nuget\packages\System.Xml.XmlSerializer\4.3.0\lib\netstandard1.3

    where you need to append the path of the referenced Nuget packages to make everything work. If things are not working correctly you can enable “Fusion” logging by setting the environment variable

    COREHOST_TRACE=1

    That and Windbg of course will give you more hints. Precompiling things in .NET Core is still a lot of try and error until everything works and I am not sure if this is the currently recommended way.

    Conclusions

    Measuring and understanding the performance aspects of serializers is quite complex. For some reason the measured numbers by the library authors of public serializers seem to prove that their serializer is the fastest one. Since I have no affiliations with any of the library maintainers the presented tests should be the most neutral one while I was trying hard to make no obvious errors in my testing methodology. If you want to migrate from an existing type hierarchy Protocol buffers and SlimSerializer look like a fast replacement to BinaryFormatter. Jil is great if you serialize the public API surface of your data objects and you do not need to serialize private fields or properties. Despite its claims FastJSON turned out in no metric to be leading in these tests. If I have made an error there please drop me a note and I will correct the data. BinaryFormatter is has a hideous O(n^2) deserialize time complexity which no one seems to have written about in public yet. At least with .NET Core things will become better. If you are deserializing larger object graphs you know now why the deserialization time takes up to 40 minutes. Before trying out a new fastest serializer be sure to measure by yourself and do not choose serializers which have fast in their name. There is a Fastest.Json serializer on Nuget which crashes the .NET Execution engine during serialization and the library author did never bother to implement the deserialize part. That’s all for today.

  • The Mysterious Lost Memory Which Belongs To No Process

    Sometimes you experience sluggish performance for no apparent reason. That are usually the harder issues since it is not clear what is causing it. Things will become a bit messy but I will give some advice how you can pinpoint sluggish system performance with basic OS tools and some advanced tools which are rarely used in public.

    The Symptom

    A server machine did become very slow once the in use memory reported by task manager reached ca. 70% (e.g. 90/128 GB) of the available memory. There was still plenty of free memory but when a new application was started it was very slow and the C drive was very busy. A first suspicion was that the application dlls were loaded from disk again which did not really make sense since the application was started many times before. Everything should be in the file system cache since a long time.

     

    First Observations

    If you are on a Windows server and you want to view the disk IO graphs in Task Manager like on your normal Windows (8 and later) you need to enable the Disk IO graphs explicitly with

    C>diskperf -y 

    A closer look on disk IO in the task manager showed that the drive hosting the page file was from the very beginning having writes to the page file at ca. 90 MB/s. We certainly did page out a lot of data but why was this happening when only 50/128 GB were in use? 

    image

    What was going on here?

    Advanced Tools

    To see where the memory is going to we can use RAMMap from Sysinternals. It can show system wide memory consumption of processes and the kernel as well. This is the only tool which can show also the lost memory because if it is allocated it must belong at least to the kernel which is also shown.

    image

    Here we find that the modified memory is coming form Shareable memory. Shareable memory is memory which is not backed by a physical file. The only possibility I am aware of to create such memory is to create a file mapping with no file name which creates a page file backed shared memory section. The suspicion here is that we have lost memory to page file backed sections which did land for some reason in the Modified list which is the sign for the kernel to write them back to their backing store. In this case it is the page file which nicely supports our theory because we have high page file write rates.

    That is all nice but the question is which process is doing this? When I did look with Task manager through the processes I found no process with a very high Commit or Working set memory. If nothing helps try to reproduce the problem in isolation so you can better analyze it.

    Create A Repro

    The easiest repro is to create a file mapping, which is mapped, written to and then unmapped but the file mapping object is not closed.

    In C#  this is a few lines of code:

            private unsafe void Allocate()
            {
                // CreateFileMapping
                var file = MemoryMappedFile.CreateNew(null, BytesToAlloc);
    
                Mappings.Add(file); // root object 
    
                var accessor = file.CreateViewAccessor(); // MapViewOfFile
    
                var handle = accessor.SafeMemoryMappedViewHandle;
                byte* ptr = null;
                handle.AcquirePointer(ref ptr);
                var pLong = (long*)ptr;
    
                Random rand = new Random();
                long nr;
                // fill page file allocated memory with random data which cannot be compressed
                // by Windows 10 memory compression to see the real impact of memory allocation.
                for (long i = 0; i < BytesToAlloc / 8L; i++)
                {
                    nr = (long)((ulong)(rand.Next() << 32) | (ulong)rand.Next());
                    *(pLong + i) = nr;
                }
                TotalAllocated += BytesToAlloc; 
    
                handle.ReleasePointer();
                accessor.Dispose();        // UnmapViewOfFile
                Console.WriteLine($"Modifed: {ModifiedBytes:N0} bytes");
            }

    The full source code can be found here https://1drv.ms/f/s!AhcFq7XO98yJgfZdGGK8ahaOrVnggg. With this sample application we can indeed nicely reproduce the issue that every time the memory is unmapped the kernel takes our mapped memory from the process working set away and puts it into the modified byte list.

    Now depending on the OS different things happen. On Server 2008-2016 the memory is immediately written to the page file. On Server 2016 you could enable memory compression also if you wish (see https://aloiskraus.wordpress.com/2016/10/09/how-buffered-io-can-ruin-performance/). On Windows 10 machines on the other hand the memory is compressed by default and kept in memory which is much later written to disk once physical memory gets low. Since we have reproduced the symptom we can search for ways how to find the offending process more easily.

    Lets look in task manager when the process has eaten up all my physical memory:

    image

    Nope not a single number which points towards high memory consumption. That is not surprising because the memory was already unmapped and no longer belongs to any process. But our process still has a file mapping handle open which is reason enough for the kernel to keep the data alive. Lets take a dump and look for file mapping objects. In kernel lingo these are called Section objects.

    Windbg – Dump Section Objects

    To dump all file mappings and their size we can use the !handle command. The first argument 0 means to dump all handles, 0xf to dump all handle information and limit the output to file mapping (Section) objects only.

    0:000>!handle 0 f Section
    
    …
    
    Handle 0000000000000734
      Type             Section
      Attributes       0
      GrantedAccess    0xf0007:
             Delete,ReadControl,WriteDac,WriteOwner
             Query,MapWrite,MapRead
      HandleCount      2
      PointerCount     65537
      Name             <none>
      Object specific information
        Section base address 0
        Section attributes 0x8000000
        Section max size 0x40000000
    24 handles of type Section
    

    We can sum up all open file mapping objects manually or use a pretty unknown Windbg extension called mex (see https://www.microsoft.com/en-us/download/details.aspx?id=53304). Mex is a Windbg extension written in managed code which is the reason for its name (managed extension). It is used by support engineers within MS quite a lot because it contains a lot of functionality in a single dll, which makes deployment pretty easy. It can even decompile C# code with some help from ILSpy from a memory dump which is pretty useful. When you have downloaded the zip file put the dll of the right bitness into the Windbg\winext folder where Windbg stores its extensions so you can load it simply by its name.

    Now we can make the tedious summing the sizes of all shared memory sections a pretty one liner:

    0:000>.load mex
    0:000> !sum !cut -c 22-35 !grep -r "size" !handle 0 f Section
    0x10000
    0x1000
    0x11000
    0x40000000
    0x40000000
    0x40000000
    0x40000000
    0x2000
    0x40000000
    0x40000000
    0x40000000
    0x40000000
    0x40000000
    0x40000000
    0x40000000
    0x40000000
    0x1000
    0x1
    0x40000000
    0x40000000
    0x40000000
    0x40000000
    0x40000000
    0x40000000
    Total = 0n19327504385 (0x480025001)
    

    The !sum command is pretty self explaining. !cut will cut from each line the number by a fixed offset and length of all lines which are filtered for the lines which contain the size with !grep from the !handle command. This reminds me of old linux days where my command lines looked pretty similar like the Windbg commands above.

    From the dump it is clear that we have 19 GB of file mappings created.  But Task Manager is not showing anything. Even the trustworthy VMMap tool

    image

    is not showing anything related to 19 GB of allocated shared memory. If you suspect only one process which is responsible for that you can use Windbg and check out if the process has a file mapping leak. If this is a problem you did either forget to close the file mapping, or someone was unmapping the shared memory too early.

    You can set breakpoints on Windows 8+ at kernelbase or on earlier Windows version at kernel32 where the methods are actually defined. Windows 8 introduced API sets (https://msdn.microsoft.com/en-us/library/windows/desktop/hh802935(v=vs.85).aspx) which allows the OS guys to split formerly huge dlls into smaller parts where the implementation might be in a different dll. You need to be careful to set the breakpoints at the right location depending on the OS you are currently working on.

    0:000> bp kernelbase!CreateFileMappingW
    0:000> bp kernelbase!UnmapViewOfFile
    0:000> bp kernelbase!UnmapViewOfFileEx

    When you have hit a breakpoint and you are running managed code load the sos  dll get the call stacks

    .loadby sos clr

    Then it is easy to find e.g. who created the file mappings

    KERNELBASE!CreateFileMappingW:
    00007ffe`06280f20 4883ec48        sub     rsp,48h
    0:000> !ClrStack
    OS Thread Id: 0x59f8 (0)
            Child SP               IP Call Site
    000000aa64d3e720 00007ffe06280f20 [InlinedCallFrame: 000000aa64d3e720] Microsoft.Win32.UnsafeNativeMethods.CreateFileMapping(Microsoft.Win32.SafeHandles.SafeFileHandle, SECURITY_ATTRIBUTES, Int32, Int32, Int32, System.String)
    000000aa64d3e720 00007ffdd724c3e8 [InlinedCallFrame: 000000aa64d3e720] Microsoft.Win32.UnsafeNativeMethods.CreateFileMapping(Microsoft.Win32.SafeHandles.SafeFileHandle, SECURITY_ATTRIBUTES, Int32, Int32, Int32, System.String)
    000000aa64d3e6e0 00007ffdd724c3e8 DomainBoundILStubClass.IL_STUB_PInvoke(Microsoft.Win32.SafeHandles.SafeFileHandle, SECURITY_ATTRIBUTES, Int32, Int32, Int32, System.String)
    000000aa64d3e800 00007ffdd723014b System.IO.MemoryMappedFiles.MemoryMappedFile.CreateCore(Microsoft.Win32.SafeHandles.SafeFileHandle, System.String, System.IO.HandleInheritability, System.IO.MemoryMappedFiles.MemoryMappedFileSecurity, System.IO.MemoryMappedFiles.MemoryMappedFileAccess, System.IO.MemoryMappedFiles.MemoryMappedFileOptions, Int64)
    000000aa64d3e890 00007ffdd72169af System.IO.MemoryMappedFiles.MemoryMappedFile.CreateNew(System.String, Int64, System.IO.MemoryMappedFiles.MemoryMappedFileAccess, System.IO.MemoryMappedFiles.MemoryMappedFileOptions, System.IO.MemoryMappedFiles.MemoryMappedFileSecurity, System.IO.HandleInheritability)
    000000aa64d3e910 00007ffdd7436bd0 System.IO.MemoryMappedFiles.MemoryMappedFile.CreateNew(System.String, Int64)
    000000aa64d3e950 00007ffd8c1c08a0 FileMappingModifiedBytes.PageFileAllocator.Allocate() [\FileMappingModifiedBytes\FileMappingModifiedBytes\PageFileAllocator.cs @ 60]
    000000aa64d3ea50 00007ffd8c1c0736 FileMappingModifiedBytes.PageFileAllocator.SlowlyAllocateUntilOOM() [\FileMappingModifiedBytes\FileMappingModifiedBytes\PageFileAllocator.cs @ 49]
    000000aa64d3ead0 00007ffd8c1c05b1 FileMappingModifiedBytes.Program.Main(System.String[]) [\FileMappingModifiedBytes\FileMappingModifiedBytes\Program.cs @ 20]
    000000aa64d3ed80 00007ffdeb7b6a53 [GCFrame: 000000aa64d3ed80] 

    When the mappings are closed you get also the full call stack

    0:000> !ClrStack
    OS Thread Id: 0x59f8 (0)
            Child SP               IP Call Site
    000000aa64d3e328 00007ffe06285400 [InlinedCallFrame: 000000aa64d3e328] Microsoft.Win32.UnsafeNativeMethods.UnmapViewOfFile(IntPtr)
    000000aa64d3e328 00007ffdd724b23e [InlinedCallFrame: 000000aa64d3e328] Microsoft.Win32.UnsafeNativeMethods.UnmapViewOfFile(IntPtr)
    000000aa64d3e300 00007ffdd724b23e DomainBoundILStubClass.IL_STUB_PInvoke(IntPtr)
    000000aa64d3e3b0 00007ffdd72305a7 Microsoft.Win32.SafeHandles.SafeMemoryMappedViewHandle.ReleaseHandle()
    000000aa64d3e548 00007ffdeb7b6a53 [GCFrame: 000000aa64d3e548] 
    000000aa64d3e6b8 00007ffdeb7b6a53 [GCFrame: 000000aa64d3e6b8] 
    000000aa64d3e778 00007ffdeb7b6a53 [HelperMethodFrame_1OBJ: 000000aa64d3e778] System.Runtime.InteropServices.SafeHandle.InternalDispose()
    000000aa64d3e870 00007ffdd72309c1 System.IO.MemoryMappedFiles.MemoryMappedView.Dispose(Boolean)
    000000aa64d3e8a0 00007ffdd72307b9 System.IO.MemoryMappedFiles.MemoryMappedViewAccessor.Dispose(Boolean)
    000000aa64d3e8e0 00007ffdd723075a System.IO.MemoryMappedFiles.MemoryMappedViewAccessor.Dispose(Boolean)
    000000aa64d3e920 00007ffddab2b15a System.IO.UnmanagedMemoryAccessor.Dispose()
    000000aa64d3e950 00007ffd8c1c0a48 FileMappingModifiedBytes.PageFileAllocator.Allocate() [\FileMappingModifiedBytes\FileMappingModifiedBytes\PageFileAllocator.cs @ 83]
    000000aa64d3ea50 00007ffd8c1c0736 FileMappingModifiedBytes.PageFileAllocator.SlowlyAllocateUntilOOM() [\FileMappingModifiedBytes\FileMappingModifiedBytes\PageFileAllocator.cs @ 49]
    000000aa64d3ead0 00007ffd8c1c05b1 FileMappingModifiedBytes.Program.Main(System.String[]) [\FileMappingModifiedBytes\FileMappingModifiedBytes\Program.cs @ 20]
    000000aa64d3ed80 00007ffdeb7b6a53 [GCFrame: 000000aa64d3ed80] 
    

    Now we have enough information to come up with a

    Theory Of The Problem

    This is how we got into that situation

    1. Someone creates page file backed shared memory as file mapping.
    2. These file mappings are put into the address space of our process with MapViewOfFile.
    3. After usage the file mapping is removed from the process working set with UnmapViewOfFile. The file mapping object is closed much later or never.
    4. The kernel removes the memory pages from the process working set and moves them to the Modified List.
    5. No process owns that memory anymore and the OS is free to page it out to the page file.
    6. After the modified pages were written to disk they are put into the standby list, overwriting other previously cached things.

    The picture below shows the basic operations the kernel performs.

    image

    The end result is that the unmapped memory from our process is written to the page file and at the same time it is put into the file system cache where it replaces older entries. This has far reaching implications. If you unmap the memory but do not close the file mapping object almost at the same time the OS will page the memory out if no other process has the same memory mapped. An innocent file mapping handle leak with large memory mappings can flush your complete file system cache at random times and there is no clear indicator why your or other application suddenly did become much slower.

    On Windows 10 Anniversary and later you can see the effect of flushing the file system cache more directly by looking at the task manager when the compressed memory goes up and up but the cached memory stored in the standby list is becoming less and less.

    image

    Fresh paged out memory seems to have the highest priority in the file system cache which can overwrite anything else what was inside it. This is also true on server OS where you can check with Rammap that the file system cache was flushed. That explains why we were having so much read activity on the C drive because our file system was flushed and the application dlls needed to be loaded from disk again. If you have a large amount of compressed memory in Windows 10 but not much In use memory the chances are pretty good that you have a file mapping leak somewhere.

    How Things Look in ETW

    On my Windows 10 Anniversary machine I can nicely see that my paged out memory directly goes into the working set of MemCompression (Memory Compression in Process Explorer, Hacker)  which is not shown by the Windows task manager. At some point it is so huge that it needs to page out memory to my SSD with up to 400 MB/s. While MemCompression (purple) allocates more memory, the standby lists (green) becomes smaller and smaller.

    image

    With ETW you can also find out who is responsible for the high modified bytes which are written to the page file with CPU Sampling tracing. That is very useful if you do not know which process is causing it and you cannot interrupt the server processes with memory dumps which might upset people working on the server. Since we know that based on our theory the modified bytes grow when we unmap the data we can simply view all callers of UnmapViewOfFile and follow the callers until you find the call stacks which have the best correlation between the growth of modified bytes and calling UnmapViewOfFile. With our tester application we get a perfect correlation:

    image

    The blue bars map perfectly to the rise of modified bytes when we unmap a file section. You can also treat the file mapping issue as handle leak where you can trace with ETW Handle Leak tracing all Section objects which call CreateFileMapping which can be used to verify that no file mappings of this type exist anymore to verify that a fix works as expected. WPA is the perfect tool to visually correlate different metrics to check if there is a dependency. You can come up with a theory what depends on what, graph the different things underneath and then you see a pattern …. or not. Then it is time for the next theory.

    The Real Root Cause

    image

    Problem solved? Only partially. This only explains sluggish behavior when the machine gets into high memory load which then flushes the file system cache. But there were issues right from the beginning even when there still was plenty of data in the file system cache. Can you spot based on the given data above where the problem is?

    .

    .

    .

    We have a nice RAID on the machine but the pageouts were happening only with 100MB/s. That is not the write throughput one would expect from an expensive RAID. What I would expect is something like this

    image

    but it was about 5 times slower than usually. Perhaps one disk did fail and the RAID was not scaling properly anymore? Nope. But there was a warning logged somewhere that the RAID battery has failed. The battery in the RAID is there to write all cached data from its internal DRAM cache to the disks in the case of a power failure. What happens if the battery fails? The RAID turns off its internal cache to prevent data loss if there is a power failure. We were therefore running the sever the whole time without a much wanted cache! I am not the one to take credit for finding the disabled RAID cache. It is all about teamwork where everyone looks at different aspects of a problem to (hopefully) find all issues. Thanks for finding that one. You know who you are.

    Luckily we do not need to wait to buy a new battery but we can enable the cache even when the battery is not present:

    image

    That setting enables the RAID cache regardless of the state of the battery which is exactly what we want. In fact the battery was not broken but opening the server and dusting off the battery did solve the battery issue as well.

    It’s The Environment Stupid!

    When you write software it is not always the software which causes a change. It still runs on hardware which can change/fail too. You need to monitor not only yourself but also the environment where it is running. If you fail to take that into account you can search for a long time at the wrong or not most significant spot.

    Funny side note: I drive by car every day to work. When I did refill my windshield washing fluid yesterday I did notice an unauthorized hardware change:

    image

    A marten

    image

    seems to have brought into my car his own pet toys to make the place more cozy. Next action: Install some marten repellant to prevent more hardware changes:

    image

    Have a nice day and keep your hardware marten safe.

    MS Technical Summit 2016 at Darmstadt

    I had the pleasure to give a talk at the MS Technical Summit 2016 at Darmstadt about ETW. You can find our slides, the sample code, a crash dump and the resulting sample ETW trace here. It was a great event together with Wolfgang Kroneder from MS. Here is the video:

    https://channel9.msdn.com/Events/microsoft-techncial-summit/Technical-Summit-2016/Deep-Dive-Event-Tracing-for-Windows/player

    For some problems you need custom ETW tracing to find out what is going on. Reference counting bugs are one of the nasty ones where it is very hard to find the root cause if they are timing dependent. If you add normal printf, OutputDebugString tracing to the Add/ReleaseReference calls the error usually goes away. Below is the sample code from my talk minus the ETW manifest generated tracing macros. It mimic’s a reference counting issue were we suspected that someone was calling ReleaseReference too early on a reference counted class which did lead to a too early destruction of the reference counted object. Interestingly it was being deleted more than once which resulted in either a heap corruption or a pure virtual function call exit depending on the exact timing.

    The code below will call AddReference once in the main method and then many times balanced Add/ReleaseReference from 5 different threads which should never cause the reference count to drop to zero.

     

    #include <windows.h>
    #include <vector>
    #include <thread>
    #include <mutex>
    #include "ETWRefCounter.h"  // ETW Tracing header generated from ETW manifest during each build with Pre build command
                                // mc -um $(ProjectDir)RefCounterETW.man  -h $(ProjectDir)  -z $(ProjectName)
                                // ecmangen is the editor to create ETW manifest files. 
    
    
    // Thread safe refcounter base class
    class CRefCounterBase
    {
    public:
        CRefCounterBase() : m_Counter(0) {}
        
        CRefCounterBase(CRefCounterBase &other) = delete;
        CRefCounterBase& operator=(const CRefCounterBase&) = delete;
        
        virtual void DeleteObject() = 0;
    
        // Return new reference count incremented by one
        long AddReference()
        {
            auto lret = InterlockedIncrement(&m_Counter);
            EventWriteAddRefEvent((__int64)this, lret); // trace new value
            return lret;
        }
    
        // Return new reference count decremented by one
        long ReleaseReference()
        {
            auto old = m_Counter;
            auto newValue = InterlockedDecrement(&m_Counter);
            EventWriteReleaseRefEvent((__int64)this, newValue); // trace new value
            if (newValue == 0)
            {
                EventWriteFinalReleaseEvent((__int64)this);
                DeleteObject();
            }
            return newValue;
        }
    
        // Check if RefCount == 1. This can only be safely used when the threads using this object are
        // guaranteed to not change the refcount after this check has been made. 
        bool OnlySingleUser()
        {
            EventWriteIsSingleUserEvent((__int64) this, m_Counter);
            return InterlockedExchange(&m_Counter, m_Counter) == 1;
        }
    
    protected:
        virtual ~CRefCounterBase() { m_Counter = 0; }
    
        long m_Counter;
    };
    
    /// Actual reference counted class
    class RefCountedObject : public CRefCounterBase
    {
    public:
        RefCountedObject() {}
        RefCountedObject(RefCountedObject &other) = delete;
        RefCountedObject& operator=(const RefCountedObject&) = delete;
        virtual ~RefCountedObject() {}
    
        virtual void DeleteObject()
        {
            printf("\nDeleteObject called 0x%p, %d", this, ::GetCurrentThreadId());
            delete this;
        };
    };
    
    
    // Using RefCounter from multiple threads in a balanced way which works
    void AsyncWorker(CRefCounterBase *counter)
    {
        while (true)
        {
            counter->AddReference();
            if (counter->OnlySingleUser())
            {
                // some optimized code not shown here which also causes add/releaseref
            }
            counter->ReleaseReference();
        }
    }
    
    int main()
    {
        EventRegisterRefCounter();
    
        static const int ThreadsCount = 5;
        std::vector<std::thread> threads; // thread list 
    
        auto pCounter = new RefCountedObject();   // construct object
        pCounter->AddReference(); // ensure that refcount does not drop to zero while other threads are working with it.
    
        for (int i = 0; i < ThreadsCount; i++) // start some threads and increment and decrement stuff in a loop
        {
            std::thread th(AsyncWorker, pCounter);
            threads.push_back(std::move(th));
        }
    
        threads[0].join();
    
        EventUnregisterRefCounter();
        return 0;
    }
    

    When you let the exe run it will either run happily forever or crash. During a crash it will terminate with an unhandled exception because of a pure virtual function call like this,

    image

    or with a heap corruption. In reality the code was not as simple but it pretty much boils down to the code above. Can you spot the error?

    How Buffered IO Can Ruin Performance

    Paging can cause bad interactive performance. This happens quite often but very little content exists how you can diagnose and fix paging issues. It is time to change that (a bit). I present you here a deep dive into how paging really works for some workloads and how that caused a severe interactive performance issue.

    The Observation

    It was reported that a software version did significantly worse than its predecessors when the system was under heavy load. Measurements did prove that hard page faults were the issue.

    A more detailed analysis showed that the big page out rate of nearly 500MB/s was caused by explicit working set trims. After removing these explicit EmptyWorkingSet calls performance went back to normal. The problem remained though why EmptyWorkingSet was suddenly a problem because these calls were there since a long time.

    How Paging Really Works

    To make such measurements on Windows 10 I give away some secrets. Some of you might have found the MM-Agent powershell cmdlet already. If you look e.g. at Windows Server 2012 Memory Management improvements then you will find some interesting switches which are also present in Windows 10 Anniversary. You can configure the Windows Memory Management via powershell.

    PS C:\WINDOWS\system32> Get-MMAgent
    
    ApplicationLaunchPrefetching : True
    ApplicationPreLaunch         : True
    MaxOperationAPIFiles         : 256
    MemoryCompression            : False
    OperationAPI                 : True
    PageCombining                : True
    PSComputerName               :

    On Windows 10 Anniversary the switches MemoryCompression and PageCombining are enabled by default. If you are blaming memory compression for bad performance you you can switch off MemoryCompression by calling

    Disable-MMAgent -mc

    To enable it again you can call

    Enable-MMAgent -mc

    The MemoryCompression switch seems not to be documented so far. If MemoryCompression is already enabled and you want to turn it off you need to reboot. If you have disabled memory compression successfully you should see in task manager as Compressed value zero all the time. If you enable memory compression the setting will take immediate effect (no reboot necessary).

    Now lets perform a little experiment what happens when we trim the working set of two GB process with my CppEater when memory compression is disabled.

    C:\>CppEater.exe 2000

    Below is the screenshot when CppEater did flush its working set. This caused some memory to be left over in the Modified list because the memory manager did not see the need to completely flush all of the 2 GB out into the page file.

    image

    When the memory was flushed you see an IO spike on disk D where my page file resides.

    image

    So far so expected. Below is a diagram which shows order of operations happening there.

    image

    In ETW traces this looks like this:

    • When the memory is flushed we have two GB of data in the modified list.
    • Later the OS flushes two GB of data into the page file.

    image 

    From that picture it is clear that when we then touch the paged out memory again we will need to read two GB of data from the page file. Now lets perform the experiment and see what happens:

    image

    Our active memory usage rises again by two GB but what is strange that there is zero disk activity when we access the paged out memory. From our mental model we should see two GB of disk IO on the D drive. Lets have a look at ETW traces while page file writing is happening.image

    While we write two GB of data to the page file the Standby list increases during that time by the same amount. That means that although the data is persistently written out to the page file we still have all of the written data in the file system cache in the form of the standby list. Since the contents of the page file are still in the file system cache (Standby List) we see no hard page faults when we try to access our paged out data again.

    That is a good working hypothesis which we can test now. We only need to access files via buffered IO reads and do not close the file handles in between which will flush the file system cache. Then we should see our CppEater process hitting the hard disk to read its page file contents.

    Below is a small application that reads the windows installer cache which is ca. 30 GB in size which is more than enough to flush any existing file system cache contents:

    static int Main(string[] args)
    {
    
        var streams = new List<FileStream>();
        byte[] buffer = new byte[256 * 1024 * 1024];
        foreach (var f in Directory.EnumerateFiles(@"C:\windows\installer", "*.*"))
        {
            try
            {
                var file = new FileStream(f, FileMode.Open, FileAccess.Read);
                streams.Add(file);
                while (file.Read(buffer, 0, buffer.Length) > 0)
                {
    
                }
            }
            catch (Exception)
            { }
        }
    
        return 0;
    }

    With a flushed file system cache CppEater.exe has no secret hiding place for its paged out memory anymore. Now we see the expected two GB of hard disk reads minus the still not flushed out modified memory.

    image

    The picture what happens when data is written to the page file is lacking an important detail: It misses out the fact that the Modified list transitions to the Standby list which is just another name for the file system cache.

    image

    The Explanation For Bad Performance

    Now we have all the missing pieces together. The initial assumption that flushing the working set cause the OS to write the process memory into the page file is correct, but only half of the story. When the page file data is written the memory from the modified list becomes part of the file system cache. When the pages are later accessed again it depends on the current state of the file system cache if we see soft or hard faults with dramatic effects for the observed performance. The bad performing software version did cause more buffered reads than before. That reads did push the cached page file data out of the file system cache. The still happening page faults were no longer cheap soft page faults but hard page faults. That explains the dramatic effects on interactive performance. The added buffered IO reads did surface the misconception that flushing the working set is a cheap operation. Flushing the working set and soft faulting it back again is only cheap if the machine is not under memory pressure. If the memory condition becomes tight or the file system cache gets flushed you will see the real costs of hard page faults. If you still need to access that memory in a fast way the best thing to do is to not flush it. Otherwise you might be seeing random hard page faults even if the machine has still plenty of free memory due to completely unrelated file system activity! This is true for Windows Server 2008 and 2012. With Windows Server 2016 which will also employ memory compression just like Windows 10 things change a bit.

    Windows 10 Paging

    With memory compression we need to change our picture again. The modified list is not flushed out to disk but compressed and then added to the working set of the Memory Compression process which now acts as cache. Since the page file contents are no longer shared with the standby list we will not see this behavior on Windows 10 or Server 2016 machines with enabled memory compression.

    image

    When we execute the same use case under Windows Server 2016 where we flush the file system cache with enabled memory compression we will see

    image

    that the memory from the MemCompression process stays cached and it is semi hard faulted back into the CppEater process in 3s which is much faster than the previous 10s when the page faults were hitting the hard disk. It is therefore a good idea for most workloads to keep memory compression enabled. It not only compresses the memory but the cached pagefile contents are no longer subject to standby list pollution which should make the system performance much more predictable than it was before.

    Conclusions

    Windows has many hidden caches which make slow operations (like hard page faults) fast again. But at the worst point in time these caches are no longer there and you will experience the uncached bad performance. It is interesting that RamMap does not show the page file as biggest standby list consumer on machines with no memory compression enabled. To prevent such hard to find errors in the first place you should measure what things cost with detailed (for me it is ETW) profiling and then act on the measured data. If someone has a great idea to make things faster you should always ask him for the detailed profiling data. Pure timing measurements can be misleading. If a use case has become 30% faster but you use x3 more memory and x2 CPU is this optimization still a great idea?

    Windows 10 Memory Compression And More

    If you expect performance gains by using a specific API you must measure on the actual target operating system under a realistic load. One such API where many myths are heard about is SetProcessWorkingSetSize which can be used to trim the current working set of a process. Since a long time you can also use use EmptyWorkingSet which does the same job with a simpler API call. One usage scenario might be that our process is not used for some time so it might be a good idea to page out its memory to make room for other processes which need the physical memory more urgently. The interesting question is: Is this a good idea? To answer that question one needs to understand how memory management works in detail. To begin our journey into the inner workings of the operating system we need to know:

    How Is Memory Allocated?

    For that we look under the covers how memory allocation works from an operating system perspective and the application developer view. From a developers points of view memory is only a new xxxx away. But what happens when you do that? The answer to that question depends highly on the used programming language and their implementation. For simplicity I stick here to C/C++.

    Below is the code for a small program named CppEater.exe that allocates and accesses memory several times before and after it gives back its memory to the OS by trimming its working set.

    1. Allocate 2 GB of data with new.
    2. Touch the first byte of every memory page .
    3. Touch the first byte of every memory page .
    4. Touch all bytes (2GB).
    5. Trim working set (call EmptyWorkingSet).
    6. Wait for 15s.
    7. Touch the first byte of every memory page .
    8. Touch all bytes (2 GB).

    // CppEater.cpp source. Update:  Full source is at https://1drv.ms/f/s!AhcFq7XO98yJgcg2Ko4dmxFUmQcAKA

    #include <stdio.h>
    #include <Windows.h>
    #include <Psapi.h>
    #include <chrono>
    #include <vector>
    
    template<typename T> void touch(T *pData, int dataCount, const char *pScenario, bool bFull = false)
    {
        auto start = std::chrono::high_resolution_clock::now();    
        int pageIncrement = bFull ? 1 : 4096 / sizeof(T);
        for (int i = 0; i < dataCount; i += pageIncrement) // touch all memory or only one integer every 4K
        {
            pData[i] = i;
        }
        auto stop = std::chrono::high_resolution_clock::now();
        auto durationInMs = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
        char pChars[512];
        sprintf(pChars, "%s: %dms, %.3f ms/MB, AllTouched=%d", pScenario,
             durationInMs, 1.0*durationInMs / ( 1.0 * dataCount * sizeof(T) / (1024 * 1024)), bFull);
        printf("\n");
        printf(pChars);    
        ::Sleep(500);
    }
    
    int main()
    {
        const int BytesToAllocate = 2 * 1000 * 1024 * 1024;
        const int NumberOfIntegersToAllocate = BytesToAllocate / sizeof(int);
        auto bytes = new int[BytesToAllocate/sizeof(int)];                     // Allocate 2 GB
        touch(bytes, NumberOfIntegersToAllocate, "First page touch");          // touch only first byte in every 4K page
        touch(bytes, NumberOfIntegersToAllocate, "Second page touch");         // touch only first byte in every 4K page
        touch(bytes, NumberOfIntegersToAllocate, "All bytes touch", true);     // touch all bytes
    
        ::EmptyWorkingSet(::GetCurrentProcess());                               // Force pageout and wait until the OS calms down
        ::Sleep(15000);
    
        touch(bytes, NumberOfIntegersToAllocate, "After Empty");               // touch only first byte in every 4K page
        touch(bytes, NumberOfIntegersToAllocate, "Second After Empty", true);  // touch all bytes
        return 0;
    }

    When you record the activity of this program with ETW you find a 2 GB of allocation through the C/C++ heap manager which will end up in VirtualAlloc. VirtualAlloc is to my knowledge the most basic API to request memory from Windows. All heap segment allocations for the C/C++/C# heap go through this API.

    |    |    |- CppEater.exe!main

    |    |    |    |- CppEater.exe!operator new

    |    |    |    |    ucrtbase.dll!malloc

    |    |    |    |    ntdll.dll!RtlpAllocateHeapInternal

    |    |    |    |    ntdll.dll!RtlpAllocateHeap

    |    |    |    |    ntdll.dll!NtAllocateVirtualMemory             2 GB alloc with MEM_COMMIT|MEM_RESERVE

     

    The only difference between the C/C++/C# heap managers is that the heap memory segments are differently managed depending on the target language. In C/C++ objects cannot be moved by the heap manager so it is important to prevent heap fragmentation with sophisticated allocation balancing algorithms. The managed (.NET) heap is controlled by the garbage collector which can move objects around to compact the heap but it has to be careful to not fragment the heap as well with pinned objects. Once the heap manager has got its big block of memory it will satisfy all following allocation requests from the heap segment/s. When all of them are full or too fragmented the heap manager will need another round of memory via VirtualAlloc.

    It is possible to track down managed and unmanaged memory leaks with VirtualAlloc ETW tracing. That works if the leak is large enough to trigger a heap segment re/allocation. But you need to be careful how you interpret the “leaking” stack because it can also happen that some leak causes allocations in the heap but the heap segment re/allocation happens due to large temporary objects which are not the root cause.

    Committed/Private Memory

    Committing memory is quite fast and completes in ca. 1ms for a new[2GB] memory allocation request with VirtualAlloc. Was that really everything? When you commit memory you will see no increase in your working set. If you put a sleep after the new operator and you look at the process with your Task Manager then you see this:

    image

    The 2 GB commit is there but your working set is still only 2 MB. That means from an operating system point of view your application has got only the promise to get so much memory but since you did not yet access any of the committed memory via e.g. a pointer access the OS still had no need to allocate a new 4 K memory page, zero it out and soft fault it into your process working set where you were accessing the memory the first time.  There is quite a lot going on under the covers to make your process address space look flat where no other process can interfere with you and your allocated memory. In reality you have a virtual address space for each process where the OS is responsible to swap physical memory in/out of  your process as it sees fit.  The working set is in a first approximation the RAM your application has allocated in your physical RAM modules on your mainboard. Some things like DLLs are the same in all processes which can be shared by multiple processes. That is the reason why you have a Working Set, Working Set Shared and a Working Set Private column in task manager. If you would add up all working sets of a fully utilized machine it would exceed the installed memory by far because shared memory is counted multiple times. That makes the exact calculation of your actually used memory of an application quite difficult.

    An easier metric is to sum up all committed memory which is by definition local to your process. This gives you an upper bound of the physical memory usage. It is an upper bound because some pages have never been touched and are therefore not assigned to any physical memory pages. Besides that large parts of your application might be already sitting in the page file which then also does not consume much physical memory. Process Explorer and Process Hacker have no column named Committed memory. As far as I can tell Committed memory is same as Private Bytes reported by these tools. Just in case you wonder why there is no Committed Bytes column in Process Explorer and Process Hacker. The definition of Process Hacker is quite easy. Private Bytes is the memory which can go to the page file. This excludes things like file mappings which can be read from disc again from the original file and not the page file as long the file mapping was not created with copy on write semantics.

    Memory Black Hole – Page File Allocated Memory

    That looks like the working set must always be smaller than the commit size which is not always the case. A special case are page file allocated memory mapped files. These do not count to your committed memory. If you touch the memory of a page file backed file mapping then it goes all into your working set.  See below for a 2 GB working set where we have only 4 MB of committed memory!

    image

    The code to produce such a strange process was to allocate a file mapping from the page file by specifying INVALID_HANDLE_VALUE as file handle. If you touch this memory then you have a large shared working set but zero committed memory. Below is a screenshot of the code which produced that allocation:

    image

    If you look into this process with VMMap from SysInternals which is a really great tool to look into any process and its contained memory you will find that VMMap and Task manager seem to disagree about the committed memory because page file backed memory is counted as committed by VMMap but not by the the Task Manager. Be careful at which numbers you look at.

    image

    If you want to look at your system you can use RAMMap also from SysInternals. It will show page file allocated file mapping as Shareable just as VMMap which gives you a good hint if you have somewhere leaked GB of page file baked memory.

    image

    ETW Reference Set and Resident Set Tracing

    There is a bunch of ETW providers available which can help you to find all call stacks which touch committed memory the first time which then causes the OS to soft page fault physical memory into your working set. In ETW lingo this is called Reference Set. These ETW providers work best with Windows 8 and later. There you can see  the call stack for every allocation and page access in the system. This can show you every first page access but it can’t show the hard page faults due to paged out memory (you can use the Hard Faults graph in WPA for that). Although Reference Set tracing is interesting it has a very high overhead and produces many events. For every soft page fault which adds memory to any processes working set an ETW event with the memory type

    • PFMappedSection
    • PageTable
    • PagedPool
    • CopyOnWriteImage
    • KernelStack
    • VirtualAlloc
    • Win32Heap
    • UserStack

    and the call stack which did cause the soft page fault can be recorded. That results in many million of events with a lot of big stack traces. On the plus side it helps a lot if you want to exactly know who did touch this memory but the resulting ETW files are quite big and take a long time to parse. The Reference Set graph was added with the Windows 10 Anniversary SDK to WPA.

    Reference Set tracing can be enabled with WPR/UI or xperf where the equivalent xperf command line is

    xperf -on  PROC_THREAD+LOADER+HARD_FAULTS+MEMORY+MEMINFO+VAMAP+SESSION+VIRT_ALLOC+FOOTPRINT+REFSET+MEMINFO_WS -stackwalk PageAccess+PageAccessEx+PageRelease+PageRangeAccess+PageRangeRelease+PagefileMappedSectionCreate+PagefileMappedSectionDelete+VirtualAlloc

    or you can use the predefined xperf kernel group

    xperf -on ReferenceSet

    image

    There is another profile in WPR which is called Resident Set. This is creating a snapshot when the trace session ends of all processes where the memory is allocated. Function wise it like opening VMMap for all processes at a specific point in time. The ETW events used are largely the same as with Reference Set minus the call stacks which reduces the traced data considerably. If you enable Reference Set tracing you also get Resident Set tracing since it is a superset of Resident Set. Although WPA displays the resident set of a specific time point it needs the ETW events belonging to the actual allocations to be able to assign every memory page its owning process. Otherwise you will only know that large portions of your active memory consists of page file allocated memory but you do not know which process it belongs to.

    The xperf command line for Resident Set tracing is

    xperf -on PROC_THREAD+LOADER+HARD_FAULTS+MEMORY+MEMINFO+VAMAP+SESSION+VIRT_ALLOC+DISK_IO

    or you can use the xperf kernel group

    xperf -on ResidentSet

    Of these many providers the MEMORY ETW provider is the by far most expensive one. It records every page access and release. But without it you will not get any Reference/Resident set graphs in WPA. I am not sure if really all events are necessary but the high amount of data generated by this provider allows only to record very short durations of a busy machine where many allocations and page accesses happen. VirtualAlloc allocated memory is the only exception which can always be attributed to a specific process to the special Page Category VirtualAlloc_PreTrace.

    image

    Memory Black Hole – Large Pages

    There are more things where Task Manager is lying to you. Have a look at this process

    image

    How much physical memory is it consuming? 68 KB? Lets look at our free memory while CppEater is running:

    image

    CppEater causing a rise of In use shown in Task Manager from 5.9GB to 7.9GB . Our small application is eating 2 GB of memory but we cannot see this memory attributed to our process! When the system is working wrong we need to look at our memory at system level. For this task RAMMap is the tool to use.

    image

    Here we see that someone has allocated 2 GB of memory in large pages. I hear you saying: Large what? It is common wisdom that on Intel CPUs the natural page size is 4KB. But for some server workloads it made sense to use larger pages. Usually large pages are 2 MB in size or multiples of that. Some Xeon CPUs even support up to 1 GiB large pages. Why should this arcane know how be useful? Well there is one quite common process which employs large pages quite heavily. It is Microsoft SQL Server. If you happen to see an innocent small sqlserver.exe and for some reason after your 24h load test you miss several GB of memory but no process seem to have allocated it the chances are high that SQL server has allocated some large pages which look small in Task Manager.

    There is a tool to see how much memory SQL server really uses: VMMap

    image

    Large pages manifest itself as Locked WS in VMMap which exactly shows our lost 2 GB of memory. Is this ridiculously small value in Task Manager a bug? Well sort of yes. Trust no tool to 100%. Every number you will ever see with regards to memory consumption is wrong or a lie to some extent. Even Process Explorer shows these nonsensical values. I can only speculate how this small value is calculated. It could be that task manager simply calculates WS Pages * 4096bytes/Page = Working Set bytes.

    A specialty of large pages is that they are never paged out and allocated immediately  when you call VirtualAlloc. This is how SQL server can grab large amounts of physical memory with one API call and then plays operating system with the handed out memory by itself. Here is the sample code to allocate memory with large pages. Large parts are only ceremony because you need to get the Lock pages in memory privilege before you can successfully call VirtualAlloc with MEM_LARGE_PAGES.

    int * LargePageAlloc(unsigned int numberOfBytesToAllocate)
    {
        printf("\nLarge Page allocator used.");
        ETWSetMark("Large Page allocator");
    
        HANDLE proc_h = OpenProcess(PROCESS_ALL_ACCESS, FALSE, GetCurrentProcessId());
    
        HANDLE hToken;
        OpenProcessToken(proc_h, TOKEN_ADJUST_PRIVILEGES, &hToken);
        CloseHandle(proc_h);
    
        LUID luid;
        ::LookupPrivilegeValue(0, SE_LOCK_MEMORY_NAME, &luid);
    
        TOKEN_PRIVILEGES tp;
    
        tp.PrivilegeCount = 1;
        tp.Privileges[0].Luid = luid;
        tp.Privileges[0].Attributes = SE_PRIVILEGE_ENABLED;
    
        auto status = AdjustTokenPrivileges(hToken, FALSE, &tp, sizeof(tp), (PTOKEN_PRIVILEGES)NULL, 0);
        CloseHandle(hToken);
    
        if (status != TRUE)
        {
            printf("\nSeLockMemoryPrivilege could not be aquired. LastError: %d. Use secpol.msc to assign to your user account the SeLockMemoryPrivilege privilege.", ::GetLastError());
            return 0;
        }
    
        auto *p = (int *) VirtualAlloc(NULL, numberOfBytesToAllocate, MEM_COMMIT | MEM_RESERVE | MEM_LARGE_PAGES, PAGE_READWRITE);
    
        if (::GetLastError() == 1314)
        {
            printf("\nNo Privilege held. Use secpol.msc to assign to your user account the SeLockMemoryPrivilege privilege.");
        }
    
        return p;
    }

    To try the sample out you need to add your user or group the Lock pages in memory privilege with secpol.msc and logout and login again to make the changes active.

    image

    When you suspect SQL server memory issues you should check out http://searchsqlserver.techtarget.com/feature/Built-in-tools-troubleshoot-SQL-Server-memory-usage which contains plenty of good advice how you can troubleshoot unexpected SQL server memory usage.

    One of the first things you should do is to configure the minimum and maximum memory you want SQL server to have. See https://technet.microsoft.com/en-us/library/ms180797(v=sql.105).aspx for more information. Otherwise SQL server will sooner or later use up all of your memory. If you run out of physical memory a low memory condition event is raised which can cause SQL server to flush its caches completely which can cause query timeouts. It is therefore important to set reasonable values for min/max SQL server memory and test them before going into production.

    If you are after a SQL Server memory leak you can query the SQL server diagnostics tables directly like this:

    SELECT
      count (*),
      sum(pages_in_bytes)/1024.0/1024.00 'Mem in MB',
      type
      FROM [master].[sys].[dm_os_memory_objects]
      group by type 
      order by  sum(pages_in_bytes) DESC

    to query your SQL server for allocated objects which can give you a hint what was being leaked. There are quite some memory leaks known of SQL server. It makes therefore a lot of sense to stay latest at the SQL server patch level if you tend to loose some memory after weeks of operation to some unknown memory black holes. When you look at a real SQL server process

    image

    you can pretty safely assume that its working set is a lie. Instead you can use the commit size as a good approximation of its current physical memory usage because nearly all SQL server memory is stored in large pages which cannot be paged out. This is only true if SQL server has been granted the privilege to lock pages.

    Windows 10 Memory Compression

    It is time to come back to the original headline. One of the big improvements of Windows 10 was that it compresses memory before writing it into the page file. With the Anniversary edition of Windows 10 this feature is now more visible in the task manager where the amount of compressed memory is now also displayed.

    image

    Another change of the Anniversary update is that originally the System process did own all of the compressed pages. MS had decided that too many users were confused by the large memory footprint of the System process because it holds all of the compressed memory in its working set. Now another hidden process owns all of the compressed memory which shows up in Process Explorer/Hacker under the name Memory Compression. It is a child of the System process. In ETW traces it is called MemCompression. These caches are therefore not visible in the process list of task manager except for the (Compressed) number in the overview which tells you how much working set the Memory Compression process currently has.

    The compression and decompression is performed single threaded in the Memory Compression process. When we look at our original program of CppEater and let it run with ETW tracing enabled  we see that the first page touch at every 4 KB for 2 GB of memory takes 366ms. This was measured in Release/x64 on Windows 10 Anniversary on my Intel I7-4770K @ 3,5 GHz.

    First page touch:   366ms, 0.183 ms/MB, AllTouched=0
    Second page touch:   26ms, 0.013 ms/MB, AllTouched=0
    All bytes touch:    314ms, 0.157 ms/MB, AllTouched=1
    Flushing working set
    After Empty:       4538ms, 2.269 ms/MB, AllTouched=0
    Second After Empty: 314ms, 0.157 ms/MB, AllTouched=1

    When you look at the details you find that the first page access is slow because of soft page faults. The second and third memory accesses are then fast and are only dominated by the CppEater process itself. When the working set is trimmed we find that the MemCompression process is doing a lot of single threaded work which takes ca. 6,3s for 2 GB memory. The compression speed is therefore 320MB/s which is not bad. After the memory has been  compressed and CppEater has slept some time it is time to touch our memory again. This time the memory is causing hard faults which are not hard in the usual way where we need to read memory from the disk but this time we need to decompress the previously compressed memory. That takes 4,5s which results in a decompression rate of 440MB/s which shows that decompression is nearly 30% faster than compression.

    It is justified to call it a hard page fault since we are 12 times slower (=4538ms/366ms) on first page access when we hard fault back compressed pages.

    image

    The evolution of the Working Set of CppEater and the MemCompression process can nicely be seen by Virtual Memory Snapshots where we see that MemCompression gets all the memory from the modified private pages which are caused by calling EmptyWorkingSet in our process and puts it into its own working set after it has been compressed the memory. There we see also that MemCompression ends up at 1.7GB of memory which means that for an integer array which consists of 1,2,3,4,…. the compression rate is 0,85 which is okish. The compression is of course much better if large blocks with identical values can be compressed so this is really a stress test for the compressor to compress not completely random data.

    image

    You can show the relative CPU costs of each operation nicely with Flame Graphs with the same column configuration.

    image

    The number for first page access (soft page faults) are pretty much consistent with the ones Bruce did measure at https://randomascii.wordpress.com/2014/12/10/hidden-costs-of-memory-allocation/ where he measured 175 μs/MB for soft page faults. In reality you will always see a mixture of soft and hard page faults and now also semi hard faults due to memory compression which makes the already quite complex world of memory management even more complicated. But these semi hard faults from compressed memory are still much cheaper than to read the data back from the page file.

    When Does it Page?

    So far all operations were in memory but when does the page file come into the game? I have made some experiments on my machine and others. Paging (mainly) sets in when you have no physical memory left over. When the Active List (that is in kernel lingo all used memory except for caches) reaches the installed physical memory of your machine something dramatic must happen. The used physical memory is not identical with committed memory. If committed memory was never accessed by your application it can stay “virtual” and needs no physical memory backing. Is it a good idea to commit all of your physical memory? If you look at Task Manager on my 16 GB machine then you see I can commit over 19 GB of memory and still have 1,9 GB available because not all committed memory pages were accessed.

    image

    While I allocate and touch more and more memory the amount of Available and Cached memory goes to zero. What does that mean? It is the file system cache which you just have flushed out from memory. We all know that the machine reacts very slowly after it has booted and the hard disk never seems to come to rest. The reason is that after a boot all dlls need to be loaded from disk because the file system cache is still not yet populated. If you want to do your users a favor: Never allocate all of the physical memory but leave 15-20% for the file system cache. Your users will thank you that you did not ruin the interactive performance too much.

    If you still insist to allocate all physical memory then bad things will happen:

    image

    When no more memory is available the OS decides that all processes are bad guys and tries to flush out everything into the Modified list (brown region in Memory Utilization graph). That is memory which is pending to be written to the page file. That makes sense if one or more processes are never accessing that memory anymore due to huge memory leaks. That way the OS can place the memory leak into the page file and continue working for a much longer time. At this time large amounts data are written into the page file. When that happens the system becomes unusable: The UI hangs, a simple printf call will take over six seconds and you experience a sudden system hang. The system is not really hanging, but it is quite busy with reading and writing data to and from the page file at that point in time. These high response times of hard disks are the main reason why SSDs are a much better choice for your page file. It is the slow reading from the page file which is causing the large slowdowns, in this case 16,3 (see Disk Service Time column in Utilization by Disk graph which was filtered for the pagefile) can be greatly reduced by placing the page file on a SSD. The kernel tries since Windows 10 its best to compress the memory before writing it into the page file to reduce slow disk IO by ca. 40%.

    When the system is paging the first memory touch times rise dramatically from ca. 240ms to 1800ms (see ETW Marks graph) which is over seven times slower because it takes time to write out old data to make room for new memory allocations.

    Cross Process Private Page Sharing

    While measuring the effects of memory compression I stumbled across upon an interesting effect. When you start e.g. 8 process where each of them allocates one GB of random data which was initialized with

        pData[i] = rand();

    in a loop. Now each process has one GB of working set and we allocate 8 GB in total of physical memory. When we trim the working set of each process we should see 8 GB of memory in the MemCompression process as its working set because random data does not compress really well. But instead I did see only about 1 GB of data in the MemCompression process! How can windows compress random data to 1/8 to its original size? I think it can’t. Something else must be happening here. The numbers are only pseudo random and always generate the same sequence of values in all processes. We therefore have the identical random data in all processes in their own private memory.

    Now lets change things a bit and initialize the random number generator with a different seed in each process at process start with

        LARGE_INTEGER lint;
        QueryPerformanceCounter(&lint);
        srand( lint.LowPart );

    Now we truly generate the expected 8 GB of working set of the MemCompression process. It seems that Windows creates an internal hash table of all paged out pages (could be also a B-tree) and adds them to the MemCompression process only if the newly compressed pages was not already encountered. A very much similar technique is employed by VMs where it is called transparent memory sharing between VMs.

    Below you see working set of the MemCompression process when CppEater the first time is started with no random seed where all duplicate pages can be combined. The green bars are the times when the working set was flushed until the CPU consumption of MemCompression was flat again. The second run CppEater was seeded with a unique seed for the random number generator. No page sharing can happen because all pages are different now and we see the expected eight fold increase in the MemCompression working when we force page out of the working sets of the CppEater instances.

    image

    It is interesting that this vastly different behavior can easily be seen in the kernel memory lists. The working set of MemCompression counts simply to the Active List. The only difference is that after the flush of the working set the Active List memory remains much larger than in the first case (blue bar).

    Conclusions – Is EmptyWorkingSet Good?

    In general it is (nearly) never a good idea to trim your own working set. If your mental model is that you are simply giving the physical memory back to the OS which then will write it dutifully into the page file is simply wrong. You are not releasing memory but you are increasing the Modified List buffers which are flushed out to disk from time to time. If you do this a lot it can cause so much disk write load together with some random access page in activity that your system will come to a sudden halt while you have still plenty of physical memory left. I have seen 128GB servers which never got above 80GB of active memory because the system was so busy with paging out memory by forced calls to EmptyWorkingSet that it already behaved like it had reached the physical memory limit with the usual large user visible delays caused by paging. Once we removed the offending calls to EmptyWorkingSet the user perceived performance and memory utilization has become much better.

    Other people have also tried SetProcessWorkingSetSize to reduce the memory consumption but failed as well due to application responsiveness issues. It did take much longer to bring all the loose ends into an article than I did initially anticipate. If I have missed out important things or you have found an error please drop me a note and I will update the article.