The Case Of NGen.exe Needing 50 GB Of Memory

This is an old bug which seems to be in the .NET Framework since a long time but since it is highly sporadic it was not found until now. I have got reports that on some machines NGen.exe did use all of the computers memory which did lead in Task Manager to this pattern:

clip_image002

The biggest process on that machine was always Ngen.exe and everything was very slow. This tells me that NGen did not recover from time to time from its high memory consumption but that it did allocate like crazy until the machine had no physical memory anymore. When an application uses all memory the OS will page out all memory to the hard disk when no physical memory is left. After writing many GB of data to the hard disk NGen can continue to allocate more memory until no physical memory is left and the OS will write all memory to the page file again. This will continue until Ngen.exe finally hits the commit limit which is the sum of Physical Memory + Page File Size which results in an Out Of Memory error. Only then the process will terminate.

When NGen did go crazy like this MS support suggested to delete the registry key Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\.NETFramework\v2.0.50727\NGenService\Roots and its decendants. That fixed the issue but it remained mysterious why this was happening and what exactly did break. When you delete the registry key NGen will build up its Ngen root dll cache automatically.  Inside the registry keys was no obvious garbage data visible and the issue remained mysterious. But finally I have got my hands on a machine where the issue was still present which allowed me to take more evidence data.

What Do We Know?

  • ngen install somedll.dll or ngen createpdb somedll.ni.dll causes NGen.exe to consume many GB of memory
  • NGen breaks due to corrupted registry keys
  • After deleting the registry key below \HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\.NETFramework\v2.0.50727\NGenService\Roots NGen will build up the registry structure by itself which “fixes” the issue

What data should we get? Part of it is based on experience and the other part is more psychological. If you hand over a bug report to someone else you should anticipate that he/she is not familiar with your favorite debugging tool (e.g. Windbg). Filing a bug report with random findings is easy. Filing a bug report which enables the support personnel  to get down to the real root cause is much harder. When I have a nice reproducible bug which I can repeat easily as often as I want I tend to get all data I can get. When file/registry issues are involved I would get some or all of the things below.

Full Scale Data Capturing

  • Capture a procmon trace which will show all accessed registry keys and files
    • That is easy to do and provides a general understanding which registry keys are accessed
  • Dump the affected files/registry keys
    • E.g. export the registry hive \HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\.NETFramework\v2.0.50727\NGenService from regedit so someone can take a look at the accessed data
    • That is much easier than to crawl though a memory dump to reconstruct the accessed registry/file contents
  • Capture several memory dumps while the problem builds up
    • procdump is a great tool to take several dumps in a row once a specific condition has been reached
  • Capture ETW Traces to get the full picture
    • It potentially shows the evolution of the problem.  But this depends highly on the skill set of the support engineer if that data is useful to them
  • Capture a Time Travel Trace
    • This gives much detail but it can be time consuming to analyze a multi GB trace. If you or the support engineer do not know exactly where to look you will only produce a multi GB random data file which is not used at all

Getting the right data which somebody else can work with is tricky since you do not know with which tools someone else is most comfortable with. Normally you will start with the easiest data capturing method and based on the analysis of the data you did get you will need to decide what else is missing to get the full picture. That usually will involve capturing more data with a different tool.  But if you capture everything from procmon tracing up to a time travel trace the chances are good that you can reduce the time until the investigation leads to somewhere from weeks down to minutes. Lets start with the easy data gathering approach first:

Getting A Procmon Trace

Procmon is a SysInternals Utility which can record all process starts along with all file and registry accesses. When you start it you can add a filter for the process of interest. In our case it is ngen.exe.

image

Since Procmon records system all events the resulting memory needed by this tool can become quite large. To prevent that is is most of the time better to discard all not interesting events from memory by checking the checkbox of File – Drop Filtered Events

image

That is important if you need to wait for an incident to run for hours. You should always get the latest version of procmon since from time to time some memory leaks or other things are fixed which could hinder a successful data collection. The gathered data can then be saved as PML file which can be read into the tool again on a different machine. To save the data you can choose a nice file name which describes the problem. Naming is hard but be precise what the trace actually contains. In a few weeks even you do not know what that file was for.

image

Under the hood Procmon uses ETW to gather the data. What does that mean? For every file/registry access and process start/dll load you will get a full call stack if you click on the event properties which can tell you already as much as a memory dump:

image

In our NGen case we find that NGen deserialized the native image roots from a registry list. That is a very powerful capability but you need to set the symbol server and the path to dbghelp.dll from a Windbg installation (x64 usually) to get valid call stacks. There is one caveat: Procmon cannot decode stack traces from managed code which makes this otherwise great tool severely limited mainly for unmanaged stack trace analysis.

Dump The Registry Keys

Who said that data collection is difficult? When we look at the NGen registry keys which were accessed we find a large list of all NGenned dlls “cached” in the registry. This is done for performance reasons. The .NET Framework had always a soft spot for the Registry. NGen does it and the GAC also. That is much faster to read than to traverse over 4000 directories only for the 64 bit NGenned dlls.

image

Since the corruption is data dependent we can simply export the whole NGenService tree into a text file which can hopefully help to diagnose the data corruption.

image

The resulting text file was over 200 MB in size. It is unlikely that you will find the root cause by looking at a 200 MB text file line by line. We need more clues where we need to look at.

Can We Already Solve?

Pattern identification is a very important skill you need to develop if you want to analyze an issue. One if not the most powerful analysis method is differential analysis. Usually you have a good case and a bad case which you can compare and see where the behavior starts to diverge. But it also works in the other way to find common patterns. The presence or the absence of a deviation can both be a useful hint. It is a good idea to capture the data not only once but several times to be able to find stable patterns in the data.

By looking at the procmon registry trace we can first filter only for the querying of registry contents of the Roots node

image

There we find that the last accessed registry key is always the same one. It is a binary registry key named ImageList. But wait. The call stack of that event is not particularly enlightening but it is a strong hint that either that was the last registry key it did read and then one of the previously read registry keys did contain invalid data or that this registry key is the one which is corrupted. Lets check the contents of the ImageList value:

image

Hm. Not sure if that is the problem. Lets get more data.

Capture Memory Dumps

There are many ways to capture memory dumps but the most flexible tool besides DebugDiag is procdump. It is a simple command line tool which can trigger the creation of a memory dump in very sophisticated ways. In our case it is straightforward. We want to start NGen and then take 3 dumps with 1s in between because the memory leaks is growing very fast.

C:\Windows\assembly\NativeImages_v4.0.30319_64\System\0c9bec7e4e969db233900a4588c91656>procdump -s 1 -n 3 -ma -x c:\temp ngen.exe createpdb system.ni.dll c:\temp

ProcDump v9.0 – Sysinternals process dump utility
Copyright (C) 2009-2017 Mark Russinovich and Andrew Richards
Sysinternals – http://www.sysinternals.com

Process:               ngen.exe (14168)
CPU threshold:         n/a
Performance counter:   n/a
Commit threshold:      n/a
Threshold seconds:     1
Hung window check:     Disabled
Log debug strings:     Disabled
Exception monitor:     Disabled
Exception filter:      [Includes]
*
[Excludes]
Terminate monitor:     Disabled
Cloning type:          Disabled
Concurrent limit:      n/a
Avoid outage:          n/a
Number of dumps:       3
Dump folder:           c:\temp\
Dump filename/mask:    PROCESSNAME_YYMMDD_HHMMSS
Queue to WER:          Disabled
Kill after dump:       Disabled

Press Ctrl-C to end monitoring without terminating the process.

Microsoft (R) CLR Native Image Generator – Version 4.7.2556.0
Copyright (c) Microsoft Corporation.  All rights reserved.
[23:16:42] Timed:
[23:16:42] Dump 1 initiated: c:\temp\ngen.exe_171212_231642.dmp
[23:16:42] Dump 1 writing: Estimated dump file size is 1418 MB.
[23:16:44] Dump 1 complete: 1419 MB written in 2.4 seconds
[23:16:46] Timed:
[23:16:46] Dump 2 initiated: c:\temp\ngen.exe_171212_231646.dmp
[23:16:47] Dump 2 writing: Estimated dump file size is 4144 MB.
[23:17:42] Dump 2 complete: 4145 MB written in 55.8 seconds
[23:17:44] Timed:
[23:17:44] Dump 3 initiated: c:\temp\ngen.exe_171212_231744.dmp

The command line parts are -s 1 to wait one second between each dump, -n 3 to take three dumps before it exits, -ma to take a full memory dump, -x expects as first argument the dump folder and all other arguments are the executable and its command line arguments. If you look at the command line parameters you will find a lot more. The output a little frightening at first but there is a secret switch (procdump -? -e) which will print a lot of useful examples how procdump is meant to be used. Actually this switch is not secret but nearly no one will read the large command line help until the end which is the reason I spell it out explicitly.

We can load the dump file into the new Windbg which will us give automatically a nice call stack window where NGen was just allocating memory:

image

That information should be sufficient for any support guy to drill down to the root cause. To make sense of the call stack you need local variables which are not part of the public symbols of MS. For us outsiders that is as far as we can analyze the problem. Really? Lets have a look at the method names. NGen deserializes a Root Array of native image roots from the registry. While it is deserializing a specific root object it deserializes something with a method BinaryDeSerializeLogicalImageList. That sounds familiar to the binary registry node ImageList from our registry dump. When we only could know the registry key it was just deserializing. This involves a little bit poking the in the dark. I would expect that the stack between DeSerialize and BinaryDeSerializeLogicalImageList contains hopefully somewhere the registry key name.

With the k command we get the call stack and the current stack pointers

0:000> k
 # Child-SP          RetAddr           Call Site
00 000000c9`5f2fdcc0 00007ffe`327a8912 ntdll!RtlpLowFragHeapAllocFromContext+0x2a
01 000000c9`5f2fdda0 00007ffe`05eebde6 ntdll!RtlpAllocateHeapInternal+0xf2
02 000000c9`5f2fde60 00007ffe`05eec700 mscorsvc!operator new+0x30
03 000000c9`5f2fde90 00007ffe`05eed445 mscorsvc!ArrayOfPointers::CreateAndAppendNode+0x2c
04 000000c9`5f2fded0 00007ffe`05eed7f1 mscorsvc!Configuration::BinaryDeSerializeLogicalImageList+0xcd
05 000000c9`5f2fe060 00007ffe`05eeffb0 mscorsvc!Configuration::DeSerialize+0x206
06 000000c9`5f2fe300 00007ffe`05ee81b2 mscorsvc!Root::DeSerialize+0x379
07 000000c9`5f2fe630 00007ffe`05eecd98 mscorsvc!RootList::DeSerializeRoot+0x9c
08 000000c9`5f2fe690 00007ffe`05f0b69c mscorsvc!RootList::GetRootArray+0x1a6
09 000000c9`5f2fe960 00007ffe`05f0bb79 mscorsvc!CCorSvcMgr::GetLogicalImageForRootedNI+0xd4
0a 000000c9`5f2fec60 00007ff6`aef17dd7 mscorsvc!CCorSvcMgr::CreatePdb2+0x229
0b 000000c9`5f2ff1c0 00007ff6`aef11f32 ngen!NGenParser::ProcessNewCommandLineOptionsHelper+0x99d
0c 000000c9`5f2ff5d0 00007ff6`aef11d54 ngen!IsNewCommandLine+0x196
0d 000000c9`5f2ff730 00007ff6`aef1276a ngen!trymain+0x19c
0e 000000c9`5f2ffd90 00007ff6`aef126f8 ngen!wmain+0x4e
0f 000000c9`5f2ffe20 00007ffe`30221fe4 ngen!BaseHolder,&Delete,2>,0,&CompareDefault,2>::~BaseHolder,&Delete,2>,0,&CompareDefault,2>+0x2a6
10 000000c9`5f2ffe50 00007ffe`327eef91 kernel32!BaseThreadInitThunk+0x14
11 000000c9`5f2ffe80 00000000`00000000 ntdll!RtlUserThreadStart+0x21

The brute force method is to dump the stack from start to end with

0:000> db c9`5f2fde90  c9`5f2ff1c0

000000c9`5f2fe710  18 e7 2f 5f c9 00 00 00-43 00 3a 00 2f 00 41 00  ../_....C.:./.A.
000000c9`5f2fe720  6e 00 79 00 4e 00 61 00-6d 00 65 00 57 00 69 00  n.y.N.a.m.e.W.i.
000000c9`5f2fe730  6c 00 6c 00 44 00 6f 00-2e 00 64 00 6c 00 6c 00  l.l.D.o...d.l.l.
000000c9`5f2fe740  00 00 74 00 75 00 62 00-73 00 2e 00 49 00 6e 00  ..t.u.b.s...I.n.
000000c9`5f2fe750  74 00 65 00 72 00 6f 00-70 00 2c 00 20 00 56 00  t.e.r.o.p.,. .V.
000000c9`5f2fe760  65 00 72 00 73 00 69 00-6f 00 6e 00 3d 00 31 00  e.r.s.i.o.n.=.1.
000000c9`5f2fe770  30 00 2e 00 30 00 2e 00-30 00 2e 00 30 00 2c 00  0...0...0...0.,.
000000c9`5f2fe780  20 00 43 00 75 00 6c 00-74 00 75 00 72 00 65 00   .C.u.l.t.u.r.e.
000000c9`5f2fe790  3d 00 4e 00 65 00 75 00-74 00 72 00 61 00 6c 00  =.N.e.u.t.r.a.l.
000000c9`5f2fe7a0  2c 00 20 00 50 00 75 00-62 00 6c 00 69 00 63 00  ,. .P.u.b.l.i.c.
000000c9`5f2fe7b0  4b 00 65 00 79 00 54 00-6f 00 6b 00 65 00 6e 00  K.e.y.T.o.k.e.n.
000000c9`5f2fe7c0  3d 00 33 00 31 00 62 00-66 00 33 00 38 00 35 00  =.3.1.b.f.3.8.5.
000000c9`5f2fe7d0  36 00 61 00 64 00 33 00-36 00 34 00 65 00 33 00  6.a.d.3.6.4.e.3.
000000c9`5f2fe7e0  35 00 2c 00 20 00 70 00-72 00 6f 00 63 00 65 00  5.,. .p.r.o.c.e.
000000c9`5f2fe7f0  73 00 73 00 6f 00 72 00-41 00 72 00 63 00 68 00  s.s.o.r.A.r.c.h.
000000c9`5f2fe800  69 00 74 00 65 00 63 00-74 00 75 00 72 00 65 00  i.t.e.c.t.u.r.e.
000000c9`5f2fe810  3d 00 61 00 6d 00 64 00-36 00 34 00 00 00 00 00  =.a.m.d.6.4.....

where we find the registry key which is currently being worked on:

0:000> du 000000c9`5f2fe718
000000c9`5f2fe718  "C:/AnyNameWillDo.dll"

It looks like the ImageList of this dll is corrupted which did cause NGen to go into an infinite loop. A deeper look at the surrounding registry keys from the registry export revealed that another registry key of the previous dll was also corrupted. This is really strange and I have no idea how NGen could manage to corrupt two unrelated registry keys RuntimeVersion (string) and ImageList  (binary).

Capture ETW Traces

Based on our previous investigations we should get data about memory allocation, CPU consumption and accessed registry keys which should give us a good understanding how the problem evolves over time. To capture ETW data you need normally to download and install the Windows Performance Toolkit which is part of the Windows SDK. But since Windows 10 the command line only tool named wpr.exe is part of Windows itself. That can be important if you are working on a machine which is locked down with e.g. Device Guard and you cannot install new software easily and you cannot execute not Authenticode signed binaries which rules many home grown data collection tools out. Normally I use ETWController (http://etwcontroler.codeplex.com/) which enables me to capture mouse and keyboard interactions along with screenshots which has proven to be invaluable many times. But on a locked down machine one needs to use the tools which you can start.

C:\WINDOWS\system32>wpr -start CPU -start Registry -start VirtualAllocation -start GeneralProfile

… Ngen …. 

C:\WINDOWS\system32>wpr -stop c:\temp\NgenGoneCrazy.etl

After loading the ETL file into WPA and a little working out the important metrics like CPU, Allocation and Registry accesses we get this one:

image

    We find that practically all CPU is spent in allocating memory while the method BinaryDeSerializeLogicalImageList  was executed. The VirtualAlloc graph shows a frightening allocation rate of 1,4 GB/s which is the most massive memory leak I have seen since a long time. The last graph shows that the huge allocation rate starts once the ImageList of the dll C:/AnyNameWillDo.dll was read. After that no more registry keys were read which is strong indicator that this registry key is the one knocking NGen out.

    After realizing that it was easy to come up with a minimal registry file which will bring NGen down

    NgenCorrupt.reg

    Windows Registry Editor Version 5.00
    
    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\.NETFramework\v2.0.50727\NGenService\Roots\C:/AnyNameWillDo.dll]
    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\.NETFramework\v2.0.50727\NGenService\Roots\C:/AnyNameWillDo.dll\0]
    "ImageList"=hex:00,69,00,6f
    

    When you import that reg file and you execute

    ngen.exe createpdb system.ni.dll c:\temp

    in the directory where the native image of System.ni.dll is located NGen will explode. But beware that at least on Windows 10 my machine did freeze and never recover when all physical memory was allocated. It seems that older Windows editions (e.g. Server 2008 R2) deal with such rapidly allocating applications in a better way which lets you at least terminate the application once it has gotten all physical memory.

    Conclusions

    After have drilled down that far it is pretty clear where NGen did break although it remains a mystery how the registry keys were corrupted. The case is currently being investigated at Microsoft which will hopefully result in a more robust NGen which will ignore bogus registry entries and delete them which should cause the recreation of the missing NGen root entries some time later. The data sent to MS are memory dumps, and ETL Trace, procmon trace and the reg file to corrupt the registry on a test machine.

    image

    Troubleshooting is not magic although many people assume magic happening here. It is pretty straightforward to capture the relevant data with some experience. Analyzing the captured data is indeed a complex undertaking and requires a lot of experience. The goal of this blog post is to help other people dealing with failures to understand what data is needed and why. Capturing the data is much easier and faster than to analyze it. If you capture the right data you will make the work of others trying to help you a lot easier.

    I always tell people that it makes no sense to assign one guy of a team as troubleshooter and send him to a Windbg/ETW training. First of all if someone is assigned to a task he dislikes he will never be good at it. Second it needs a lot of practice and experience to be able to drill down this deep. If you are trying to analyze such an issue once every few months you will not have the necessary skills to use the tools correctly.  If you want to bring a member of your team at a good level to troubleshoot hard issues he/she must be willing to look at a memory dump nearly every day. If no one raises his hand for this task you can spare the money for the Windbg/ETW training. But it makes sense to bring all team members to a level where everyone understands what data is needed to let some specialized guys to look into the issues in an efficient manner by providing enough and the right data to successfully nail the root cause. As a positive side effect more people will get used to these tools and some guys will like to drill deeper. These are the ones you should send to a Windbg/ETW training.

    That’s all for today. Remember: Great tools are useless. Tools become great when people use them.

    Advertisements

    Bringing The Hardware And Windows To Its Limits

    I like to experiment a lot which often leads to surprising results. Sometimes I compare performance engineering to quantum mechanics: In quantum mechanics a measurable value (observable) is not determined until the measurement is performed. I stretch quantum mechanics here and claim that you have bad performance if you never measure. Or to put it into another way:

    The probability to get a well performing system without quantitative repeatable regular performance measurements is zero.

    When you look deep enough you will find not so well known things. Memory allocation and access performance is such a topic which could span whole books because it is such a fundamental thing which most application developers are not aware of.

    What do I mean with that? Lets perform an experiment:

    1. Allocate 2000MB of memory.
    2. Measure the access performance of every 4096th byte of the allocated memory.
    3. Repeat the measurement to get consistent results a second time.

    Below is a small C++ application to do this:

    #include <chrono>
    
    class Stopwatch
    {
    public:
        Stopwatch()
        {
            _Start = std::chrono::high_resolution_clock::now();
        }
    
        void Start()
        {
            _Start = std::chrono::high_resolution_clock::now();
        }
    
        std::chrono::milliseconds Stop()
        {
            _Stop = std::chrono::high_resolution_clock::now();
            return std::chrono::duration_cast<std::chrono::milliseconds>(_Stop - _Start);
        }
    private:
        std::chrono::high_resolution_clock::time_point _Start;
        std::chrono::high_resolution_clock::time_point _Stop;
    };
    
    #pragma optimize( "", off )
    void Touch(void *p, size_t N)
    {
        char *pB = (char *)p;
        char tmp;
        for (size_t i = 0; i < N; i += 4096)
        {
            tmp = pB[i];
        }
    
    }
    #pragma optimize("", on)
    
    void main()
    {
        const int NBytes = 2 * 1000 * 1024 * 1024; // 2000 MB of memory
        char *bytes = new char[NBytes];
        Stopwatch sw;
        Touch(bytes, NBytes );  // touch every 4096th byte
        auto ms = sw.Stop();
        printf("Did touch %d bytes in %lld ms\n", NBytes, ms.count());
        sw.Start();
        ms = sw.Stop();
        printf("Did touch 2 %d bytes in %lld ms\n", NBytes, ms.count());
    }

    When we execute it the numbers look promising

    D:\>\MemAllocPerf\x64\Debug\MemAllocPerf.exe
    Did touch 2097152000 bytes in 13 ms
    Did touch 2 2097152000 bytes in 0 ms

    13ms for the first access time and 0ms the second time. This is pretty good even for a debug build. For completeness lets execute the same thing as Release build because everyone tells you that you should never ever trust performance values from debug builds.

    D:\>\MemAllocPerf\x64\Release\MemAllocPerf.exe
    Did touch 2097152000 bytes in 377 ms
    Did touch 2 2097152000 bytes in 0 ms

    Second time still looks good but what has happened to the first time access performance? The release build has become 30 times slower! How can this be? Lets step though it with a debugger and check the memory consumption in Task Manager after the allocation but before we have touched the memory.

    Debug

    image

    Release

    image

    Well that is interesting. Both versions have committed 2000 MB of memory but the debug version has it in its working set already. The release build consumes basically zero physical memory. Just in case you need quick recap what commit size and working set means:

    • Commit size is the amount of memory you did allocate with new, malloc, calloc, GlobalAlloc, …
    • Working Set is the physical memory the operating system has assigned to your process (real ram chip usage). The working set can be smaller because the operating system can page out data of your process to make room for other also memory hungry applications.

    Ok so this means the OS did page out my data for the release build? Well no not in this case. There is a wrinkle to it. All operating systems try to be as lazy as possible to move the costs of memory allocation and usage at the latest time possible.

    The Operating System View Of Memory Allocation And Access

    When an allocation happens the OS first needs to check if the process has enough address space left for the allocation. This is especially true for x86 processes which can allocate only 4 GB of memory where we can run out of free addresses where to put our allocation because the memory is fragmented like below. We could allocate the memory but we have no address space hole big enough to satisfy the reservation request.

    image

    I have never seen that happen on x64 processes but it is a pretty common issue on x86 processes. This is called reserving memory in Windows lingo. Apart from checking if enough address space is available in the process nothing happens which is therefore a super fast operation.

    The second stage is to commit memory. The new[], malloc functions will usually reserve and commit the memory in one go with a call to VirtualAlloc on Windows. This time the OS needs to do some further checks.

    • Check if allocation size > Commit Limit
      • The Commit Limit is the maximum memory all applications together can allocate. It is the sum of the physical memory + size of the page file.
    • If the page file has not a fixed size the OS might need to grow the page file for the requested memory which can take quite some time to ensure that the allocation request can be served from physical or page file baked memory.

    After reserving the address space and committing the memory the OS guarantees that your newly allocated memory can be served by the OS either from the page file or (more likely and performant) from physical memory.

    image

    (RAM Image Source https://en.wikipedia.org/wiki/File:Laptop_RAM.jpg)

    You see the dotted lines? The OS only guarantees that you can access the memory but it is still not assigned to your process. The OS has returned you a pointer to memory but you still do not have the memory in your process. All memory pages in your process are still empty! Now comes the expensive part. But this involves no API call at all. To force the OS to actually assign the memory to your process you only need to access it. When you access an empty page the CPU will trigger an exception (page fault) and call an back into the operating system. At this time the OS will actually assign the memory to your process working set “database” where the OS keeps track which physical pages are baked by real memory or the page file.

    The operation to add memory to your process working set is called page fault. If only RAM needs to be assigned to your process it is called soft page fault (fast). If you access paged out memory a hard page fault happens (slow, …. very slow) which will cause the OS to start a read operation from the hard disk which can be a lengthy undertaking.

    After the soft/hard page fault your application finally takes over and you can access the memory without any OS interference.

    image

    If the system runs low on memory some least used memory pages are removed from your working set and its contents are put into the page file. I have indicated that with the dotted lines pointing to the page file in the picture above. The next time you access the memory you will get hard page faults which are the source of most sluggish system behavior. It has become much less of a problem if you are lucky enough to have the page file on an SSD which have pretty good random access times.

    Ok that was a pretty long excursion into the details of memory management. But why is the debug build so much faster and why is all of the memory after the allocation already in our process working if I use the debug build? When we examine the memory contents which was returned by allocation request we find some byte pattern (cd cd)

    image

    If you look further what that means at https://stackoverflow.com/questions/370195/when-and-why-will-an-os-initialise-memory-to-0xcd-0xdd-etc-on-malloc-free-new you will find that the C-runtime initializes and and hence access the memory before returning the pointer to the calling code. That is the reason why the debug build was so much faster. The soft fault performance hit did happen already at allocation time because the memory was initialized to the CD CD byte pattern. That is one of the very few cases where the measured performance of a debug build is much better compared to a release build because the most expensive part of memory access has happened before we did start he measurement.

    Memory Copy And Soft Fault Performance Do NOT Scale On Windows

    That was an interesting case but I am getting just started. Lets suppose we want to read a large file from the disk as fast as possible. Most often the file was accessed already by the application some time ago and it is already in the file system cache. If you execute the read operation the OS has nothing to do except to copy the memory from the file system cache to the buffer in your application. When the buffer was just freshly allocated by new[] how fast can we get? Copying the data from the file system cache is a problem that calls for parallelization to speed up things. The interesting thing is how fast can we get if multiple threads are copying in parallel data from A to B.

    Since memory access has some hidden costs upon first access it makes sense to measure the memory copy performance for a freshly allocated array and a second time with the same one. The test essentially should do

    1. Allocate a large array e.g. 2000 MB
    2. Fill it with random data which will be our source array
    3. Allocate a 2000 MB target array
    4. Start 1-n threads
    5. Each thread copies a sub range of source to destination
    6. Measure the time stop step 4-5
    7. Repeat Step 4-6 for a second measurement

    First lets check the results on different CPUs when we hit a “warm” destination array which will exhibit no soft page faults.

    image

    From these numbers we can deduce the “raw” memory copy performance which at some point saturates the CPU memory bus. The numbers obtained here differ by a large margin with the documented ones:

    CPU Theoretical Memory Bandwidth GB/s Max Memory Copy  Performance GB/s
    I7 4770K 25.6 9,3
    E5 2623 v3 59 10,3
    Xeon Gold 6148 ??? 40,8

    At least for my home machine (the 4770K) I get ca. 9.3 GB/s which is off by a large margin of my 25.6 GB/s. I assume that the maximum memory bandwidth was measured either for read or write operations, but not a parallel read/write operation which would mean that I can multiply my measured values with a factor two. Then I would arrive at ca. 18 GB/s which seems to be ok if I attribute the rest to the cache controller which needs also some memory bandwidth. The raw memory copy performance depends on the memory bus and the used memory modules. What happens when I replace from a full memory bank one module? You guessed it: I did another experiment with my home machine. As expected the memory copy performance did drop by 50% from 9,3 GB/s down to 4,7 GB/s which gives a strong hint that memory bandwidth saturating applications should run always on machines which have full memory banks.

    From the graph above it is clear that having more cores is better up to a certain point where the memory bandwidth is reached and more cores do not help anymore. The brand new Xeon Gold CPUs show an impressive performance where up to 10 memcopy threads did still add performance. Now lets change the use case and measure for the first access time which includes soft page faults.

    image

    Well that is interesting. Adding more cores degrade the soft page fault performance  by a large margin. The speed gains added by parallel memcpy are far less than one would expect. To isolate the issue we can measure the page touch time (soft fault performance) with a similar application like the one above just with some multi threading added. We then get this

    image

    The soft page fault performance of Windows 10 does not scale with the numbers of cores. Instead it decreases with the numbers of cores! There seems to be a magic number around 4 concurrent touch threads where we become faster but then the valley is left and the soft fault performance in all cases gets worse compared to the single threaded use case. If something unexpected like this happens it is time to use a profiler to drill deeper. Below is a CPU sampling graph which shows the parallel page touch with 1-40 threads for a 2000 MB array.

    image

    As I have feared there is some internal locking in the Windows soft fault implementation which makes soft faulting a mostly single threaded operation. If multiple threads try to access the memory we get high CPU times in the method ExpAcquireSpinLockExclusiveAtDpcLevelInstrumented which wastes many CPU cycles until it finally gives up which ends in a highly contended lock. Remember: Locks do not scale. That is the reason why the small actual soft page fault work (yellow) is constant while the overhead causes by the lock explodes. This makes not much sense in my opinion when we get more and more cores to play with but the soft fault implementation still uses a process wide lock.

    Another view of the same date like above  named Flame Graph shows the overhead even better. The selected part below is the actual useful work and the huge rest is the overhead of the Spinlock.

    image

    One would think that soft page fault is an essential OS service that should be implemented as fast as possible. Apparently that is not the case for Windows 8 and 10 (https://stackoverflow.com/questions/45024029/windows-10-poor-performance-compared-to-windows-7-page-fault-handling-is-not-sc). The issue described there was solved by using VirtualLock which is essentially soft faulting the pages from one thread which is much better than to do it concurrently from many threads as you can see from the graph above. I do not know what you think but I have the strong feeling that Microsoft should improve the soft page fault code to finally enter the multi core era. It would be interesting to compare the numbers with Linux because  Linus Torvalds seems to be vigilantly looking at the soft page fault implementation in the Linux kernel.

    With SSDs getting nearly as fast as RAM the soft fault performance becomes a limiting factor in multithreaded applications which would otherwise be memory bus constrained.

    Even seemingly simple things like allocating and accessing memory can have dramatic impact on application performance and scalability. The test application from above is pretty simple but if you start asking why the performance is as it is and you try to optimize it you quickly reach operating system limits and not much later the limits of the hardware. That is actually a good thing because it means that your application is fast as hell.

    If you want to test the soft page fault/memcopy performance for yourself you can check out https://github.com/Alois-xx/FastPageFault which should be pretty self explaining. That was the application I did use the produce the charts above.

    Update 1

    Windows 10 Fall Creators update contains a fix for the soft page fault performance. First some numbers from my home machine with 4 physical cores:

    image

    The use case was copying a 2000 MB buffer into not yet touched memory with 1-N threads.

    Creators Update

    Due to the internal locking we did never get close to the actual memory bandwidth because the kernel was busy with its Spinlocks while updating the data structures to add the pages to our process working set. The copy performance did peak around 6,5 GB/s

    Fall Creators Update

    The soft page fault implementation is now much more scalable and we can now max out our memory bandwidth of 9,xGB/s with only 3 cores while we can fully distribute the soft page fault work across threads!

     

    When we zoom deeper into the soft page fault performance we find that even the single thread soft fault performance has become 43% faster and it scales much better now.

    image

     

    Below is the Spinlock CPU vs actual work shown for Creators Update

    image

    and here for Fall Creators Update

    image

    The lock is still there but much less time is spent in locking things which is a good thing. How did the MS engineers improve the soft page fault implementation by nearly a factor two? To see this the WPA diff view is helpful along with some Excel magic. The diff view in WPA is nice but it still lacks in my opinion some important features

    • There is no way to mark methods which were removed or added except by drilling to the bottom of the call stacks which is very tedious
    • I want to filter only for added/removed methods which would be very helpful to spot the actual difference and not the consequence deeper in the call stacks

    Anyway. There is nothing Excel cannot fix for us. Below is the pimped diff view of the page touch performance test.

    image

    What stands out is that the Fall Creators Update soft page fault implementation has far less method calls. The Windows Kernel internal List (all red marked list management methods) to maintain the page list was removed in favor of a more simple data structure with better cache locality and less locking. Besides getting rid of the kernel linked list the biggest change seems to be that by default the page size has been increased from 4 KB to 64 KB which means that the kernel needs to update 16 times less often the page table structures which seem to be the biggest change. At least that is my impression by noticing that the method MiGet64KPage consumes most CPU and looks like it was introduced with the Fall Creators Update.

    What about Windows Server?

    According to MS support the soft page fault fix should have made it into Windows Server 1709 which is a desktop less server which is best for container and cloud workloads. That is nice but I need a fix for Server 2016. The Windows Server 2016 soft page fault performance affects all applications, especially the memory hungry ones. Moving forward to a not compatible server edition which is still beta with a shorter long term support contract is not an option.

    It is pretty hard to get hard facts from MS support which issues is fixed with which OS version. The question: Is the issue in that ticket fixed in the build I am running? seems no longer to be easily answerable. That definitely should be improved.

    It Is Time For Time Travel

    Windbg is one of the most powerful yet underused tool in every Windows Developers toolbox. Some time ago a new fancier version of Windbg (Windbg Preview)  which looks like a todays modern application was released as Windows Store App. So far the UI has got a nice ribbon but under the hood no real changes were visible. Today a new version was released which supports a long time internally used tool called time travel tracer (TTT).

    The official documentation https://docs.microsoft.com/de-de/windows-hardware/drivers/debugger/debugging-resources is quite good and here is the blog post announcing Time Travel Debugging https://blogs.msdn.microsoft.com/windbg/2017/09/25/time-travel-debugging-in-windbg-preview/.

    If you have a hard to debug race condition you can now attach the Windbg and check the “Record process with Time Travel Debugging”

    image

    The trace file can grow many GB in size which can make it cumbersome if you want to reproduce an error which needs some time to surface. Luckily Windbg is, although a Store App, still xcopy deployable. Search with Process Explorer where the debugger exe sits and copy the contents of the directory C:\Program Files\WindowsApps\Microsoft.WinDbg_1.0.13.0_x86__8wekyb3d8bbwe\ to your own tool folder like Windbg_1.0.13.0. Then you can run it e.g. from a memory stick or a network share as usual. After searching a bit in the debugger directories you will find the command line version of the Time Travel Trace tool located at

    • x64  Windbg_1.0.13\amd64\TTD\TTD.exe
    • x86 Windbg_1.0.13\amd64\TTD\wow64\TTD.exe

    Now you can go to your problematic machine where the problem occurs without the need to install a store app which might not be possible due to corporate policy/firewall/isolated test network, …. To record a time travel trace from the command line I normally use it in ring buffer mode with a 2 GB buffer which should cover a few seconds or minutes depending on the application activity.

     

    D:\Tools\Windbg_1.0.13\amd64\TTD>ttd -ring -maxfile 2048 -out c:\temp\test.run -launch D:\cppError.exe
    Microsoft (R) TTD 1.01.02
    Release: 10.0.16366.1000
    Copyright (C) Microsoft Corporation. All rights reserved.

    Launching D:\cppError.exe

    cppError.exe(x86) (1040): Tracing stopped after 31ms
      Ring trace dumped to c:\temp\test.run

     

    You get a 2 GB file although the actually recorded data might be much smaller. If you have a short repro it might be better to skip the ring buffer setting.

    image

    Once you have the data it is time to leave the crime scene pack the .run file and analyze it at your favorite developer machine. You can double click the .run file or fire up Windbg and select the Open Trace option. Once you have loaded the trace you can press g to let the application run until the first exception happens or you can set breakpoints. But if nothing is set the application will stop a the first exception with the actual source window and the current values of the local function:

    image

    We find that whenever i is 5 we run into an issue which you could also have found also with a memory dump. But now you can travel back in time by entering p- to check the values were just before the crash. This is incredibly powerful to find the root cause how you did get into a situation. If Time Travel Debugging works it is a great tool. Just keep in mind that it makes the application around 10x times or more slower. You should not expect great performance if Time Travel recording is running.

    As far as I can tell it looks like Time Travel Tracing is built upon the Intel Processor Tracing feature https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing which enables recording the full instruction stream along with all touched memory which is a really powerful feature.

    Wishes

    With every great tool there are things left to be desired.

    • Windbg supports no managed source code window
      • No managed breakpoint setting in source code
    • No managed local variables
    • No managed call stack
    • SOS.dll seems not to work at all with time travel traces

    The debugger shell seems to exist only in the x64 flavor which makes it impossible to load SOS.dll for 32 bit applications into Windbg because the bitness of SOS.dll must match with the bitness of the debugger executable. When I try to get a mixed mode stack inside Windbg SOS.dll can be loaded but it seems to be missing vital information. I really would like to use Windbg with time travel tracing support for managed code (regular .NET Framework and .NET Core) but until now this is a C/C++ fan boys tool only.

    Conclusions

    Time Travel Tracing is a novel debugging technique which enables us developers to solve otherwise nearly impossible to find data races in a deterministic way. I hope that the Windbg team will add managed code support in the near future which would bring back feature parity between languages. Why should only C/C++ developer get all the great tools?

    The Mysterious Lost Memory Which Belongs To No Process

    Sometimes you experience sluggish performance for no apparent reason. That are usually the harder issues since it is not clear what is causing it. Things will become a bit messy but I will give some advice how you can pinpoint sluggish system performance with basic OS tools and some advanced tools which are rarely used in public.

    The Symptom

    A server machine did become very slow once the in use memory reported by task manager reached ca. 70% (e.g. 90/128 GB) of the available memory. There was still plenty of free memory but when a new application was started it was very slow and the C drive was very busy. A first suspicion was that the application dlls were loaded from disk again which did not really make sense since the application was started many times before. Everything should be in the file system cache since a long time.

     

    First Observations

    If you are on a Windows server and you want to view the disk IO graphs in Task Manager like on your normal Windows (8 and later) you need to enable the Disk IO graphs explicitly with

    C>diskperf -y 

    A closer look on disk IO in the task manager showed that the drive hosting the page file was from the very beginning having writes to the page file at ca. 90 MB/s. We certainly did page out a lot of data but why was this happening when only 50/128 GB were in use? 

    image

    What was going on here?

    Advanced Tools

    To see where the memory is going to we can use RAMMap from Sysinternals. It can show system wide memory consumption of processes and the kernel as well. This is the only tool which can show also the lost memory because if it is allocated it must belong at least to the kernel which is also shown.

    image

    Here we find that the modified memory is coming form Shareable memory. Shareable memory is memory which is not backed by a physical file. The only possibility I am aware of to create such memory is to create a file mapping with no file name which creates a page file backed shared memory section. The suspicion here is that we have lost memory to page file backed sections which did land for some reason in the Modified list which is the sign for the kernel to write them back to their backing store. In this case it is the page file which nicely supports our theory because we have high page file write rates.

    That is all nice but the question is which process is doing this? When I did look with Task manager through the processes I found no process with a very high Commit or Working set memory. If nothing helps try to reproduce the problem in isolation so you can better analyze it.

    Create A Repro

    The easiest repro is to create a file mapping, which is mapped, written to and then unmapped but the file mapping object is not closed.

    In C#  this is a few lines of code:

            private unsafe void Allocate()
            {
                // CreateFileMapping
                var file = MemoryMappedFile.CreateNew(null, BytesToAlloc);
    
                Mappings.Add(file); // root object 
    
                var accessor = file.CreateViewAccessor(); // MapViewOfFile
    
                var handle = accessor.SafeMemoryMappedViewHandle;
                byte* ptr = null;
                handle.AcquirePointer(ref ptr);
                var pLong = (long*)ptr;
    
                Random rand = new Random();
                long nr;
                // fill page file allocated memory with random data which cannot be compressed
                // by Windows 10 memory compression to see the real impact of memory allocation.
                for (long i = 0; i < BytesToAlloc / 8L; i++)
                {
                    nr = (long)((ulong)(rand.Next() << 32) | (ulong)rand.Next());
                    *(pLong + i) = nr;
                }
                TotalAllocated += BytesToAlloc; 
    
                handle.ReleasePointer();
                accessor.Dispose();        // UnmapViewOfFile
                Console.WriteLine($"Modifed: {ModifiedBytes:N0} bytes");
            }

    The full source code can be found here https://1drv.ms/f/s!AhcFq7XO98yJgfZdGGK8ahaOrVnggg. With this sample application we can indeed nicely reproduce the issue that every time the memory is unmapped the kernel takes our mapped memory from the process working set away and puts it into the modified byte list.

    Now depending on the OS different things happen. On Server 2008-2016 the memory is immediately written to the page file. On Server 2016 you could enable memory compression also if you wish (see https://aloiskraus.wordpress.com/2016/10/09/how-buffered-io-can-ruin-performance/). On Windows 10 machines on the other hand the memory is compressed by default and kept in memory which is much later written to disk once physical memory gets low. Since we have reproduced the symptom we can search for ways how to find the offending process more easily.

    Lets look in task manager when the process has eaten up all my physical memory:

    image

    Nope not a single number which points towards high memory consumption. That is not surprising because the memory was already unmapped and no longer belongs to any process. But our process still has a file mapping handle open which is reason enough for the kernel to keep the data alive. Lets take a dump and look for file mapping objects. In kernel lingo these are called Section objects.

    Windbg – Dump Section Objects

    To dump all file mappings and their size we can use the !handle command. The first argument 0 means to dump all handles, 0xf to dump all handle information and limit the output to file mapping (Section) objects only.

    0:000>!handle 0 f Section
    
    …
    
    Handle 0000000000000734
      Type             Section
      Attributes       0
      GrantedAccess    0xf0007:
             Delete,ReadControl,WriteDac,WriteOwner
             Query,MapWrite,MapRead
      HandleCount      2
      PointerCount     65537
      Name             <none>
      Object specific information
        Section base address 0
        Section attributes 0x8000000
        Section max size 0x40000000
    24 handles of type Section
    

    We can sum up all open file mapping objects manually or use a pretty unknown Windbg extension called mex (see https://www.microsoft.com/en-us/download/details.aspx?id=53304). Mex is a Windbg extension written in managed code which is the reason for its name (managed extension). It is used by support engineers within MS quite a lot because it contains a lot of functionality in a single dll, which makes deployment pretty easy. It can even decompile C# code with some help from ILSpy from a memory dump which is pretty useful. When you have downloaded the zip file put the dll of the right bitness into the Windbg\winext folder where Windbg stores its extensions so you can load it simply by its name.

    Now we can make the tedious summing the sizes of all shared memory sections a pretty one liner:

    0:000>.load mex
    0:000> !sum !cut -c 22-35 !grep -r "size" !handle 0 f Section
    0x10000
    0x1000
    0x11000
    0x40000000
    0x40000000
    0x40000000
    0x40000000
    0x2000
    0x40000000
    0x40000000
    0x40000000
    0x40000000
    0x40000000
    0x40000000
    0x40000000
    0x40000000
    0x1000
    0x1
    0x40000000
    0x40000000
    0x40000000
    0x40000000
    0x40000000
    0x40000000
    Total = 0n19327504385 (0x480025001)
    

    The !sum command is pretty self explaining. !cut will cut from each line the number by a fixed offset and length of all lines which are filtered for the lines which contain the size with !grep from the !handle command. This reminds me of old linux days where my command lines looked pretty similar like the Windbg commands above.

    From the dump it is clear that we have 19 GB of file mappings created.  But Task Manager is not showing anything. Even the trustworthy VMMap tool

    image

    is not showing anything related to 19 GB of allocated shared memory. If you suspect only one process which is responsible for that you can use Windbg and check out if the process has a file mapping leak. If this is a problem you did either forget to close the file mapping, or someone was unmapping the shared memory too early.

    You can set breakpoints on Windows 8+ at kernelbase or on earlier Windows version at kernel32 where the methods are actually defined. Windows 8 introduced API sets (https://msdn.microsoft.com/en-us/library/windows/desktop/hh802935(v=vs.85).aspx) which allows the OS guys to split formerly huge dlls into smaller parts where the implementation might be in a different dll. You need to be careful to set the breakpoints at the right location depending on the OS you are currently working on.

    0:000> bp kernelbase!CreateFileMappingW
    0:000> bp kernelbase!UnmapViewOfFile
    0:000> bp kernelbase!UnmapViewOfFileEx

    When you have hit a breakpoint and you are running managed code load the sos  dll get the call stacks

    .loadby sos clr

    Then it is easy to find e.g. who created the file mappings

    KERNELBASE!CreateFileMappingW:
    00007ffe`06280f20 4883ec48        sub     rsp,48h
    0:000> !ClrStack
    OS Thread Id: 0x59f8 (0)
            Child SP               IP Call Site
    000000aa64d3e720 00007ffe06280f20 [InlinedCallFrame: 000000aa64d3e720] Microsoft.Win32.UnsafeNativeMethods.CreateFileMapping(Microsoft.Win32.SafeHandles.SafeFileHandle, SECURITY_ATTRIBUTES, Int32, Int32, Int32, System.String)
    000000aa64d3e720 00007ffdd724c3e8 [InlinedCallFrame: 000000aa64d3e720] Microsoft.Win32.UnsafeNativeMethods.CreateFileMapping(Microsoft.Win32.SafeHandles.SafeFileHandle, SECURITY_ATTRIBUTES, Int32, Int32, Int32, System.String)
    000000aa64d3e6e0 00007ffdd724c3e8 DomainBoundILStubClass.IL_STUB_PInvoke(Microsoft.Win32.SafeHandles.SafeFileHandle, SECURITY_ATTRIBUTES, Int32, Int32, Int32, System.String)
    000000aa64d3e800 00007ffdd723014b System.IO.MemoryMappedFiles.MemoryMappedFile.CreateCore(Microsoft.Win32.SafeHandles.SafeFileHandle, System.String, System.IO.HandleInheritability, System.IO.MemoryMappedFiles.MemoryMappedFileSecurity, System.IO.MemoryMappedFiles.MemoryMappedFileAccess, System.IO.MemoryMappedFiles.MemoryMappedFileOptions, Int64)
    000000aa64d3e890 00007ffdd72169af System.IO.MemoryMappedFiles.MemoryMappedFile.CreateNew(System.String, Int64, System.IO.MemoryMappedFiles.MemoryMappedFileAccess, System.IO.MemoryMappedFiles.MemoryMappedFileOptions, System.IO.MemoryMappedFiles.MemoryMappedFileSecurity, System.IO.HandleInheritability)
    000000aa64d3e910 00007ffdd7436bd0 System.IO.MemoryMappedFiles.MemoryMappedFile.CreateNew(System.String, Int64)
    000000aa64d3e950 00007ffd8c1c08a0 FileMappingModifiedBytes.PageFileAllocator.Allocate() [\FileMappingModifiedBytes\FileMappingModifiedBytes\PageFileAllocator.cs @ 60]
    000000aa64d3ea50 00007ffd8c1c0736 FileMappingModifiedBytes.PageFileAllocator.SlowlyAllocateUntilOOM() [\FileMappingModifiedBytes\FileMappingModifiedBytes\PageFileAllocator.cs @ 49]
    000000aa64d3ead0 00007ffd8c1c05b1 FileMappingModifiedBytes.Program.Main(System.String[]) [\FileMappingModifiedBytes\FileMappingModifiedBytes\Program.cs @ 20]
    000000aa64d3ed80 00007ffdeb7b6a53 [GCFrame: 000000aa64d3ed80] 

    When the mappings are closed you get also the full call stack

    0:000> !ClrStack
    OS Thread Id: 0x59f8 (0)
            Child SP               IP Call Site
    000000aa64d3e328 00007ffe06285400 [InlinedCallFrame: 000000aa64d3e328] Microsoft.Win32.UnsafeNativeMethods.UnmapViewOfFile(IntPtr)
    000000aa64d3e328 00007ffdd724b23e [InlinedCallFrame: 000000aa64d3e328] Microsoft.Win32.UnsafeNativeMethods.UnmapViewOfFile(IntPtr)
    000000aa64d3e300 00007ffdd724b23e DomainBoundILStubClass.IL_STUB_PInvoke(IntPtr)
    000000aa64d3e3b0 00007ffdd72305a7 Microsoft.Win32.SafeHandles.SafeMemoryMappedViewHandle.ReleaseHandle()
    000000aa64d3e548 00007ffdeb7b6a53 [GCFrame: 000000aa64d3e548] 
    000000aa64d3e6b8 00007ffdeb7b6a53 [GCFrame: 000000aa64d3e6b8] 
    000000aa64d3e778 00007ffdeb7b6a53 [HelperMethodFrame_1OBJ: 000000aa64d3e778] System.Runtime.InteropServices.SafeHandle.InternalDispose()
    000000aa64d3e870 00007ffdd72309c1 System.IO.MemoryMappedFiles.MemoryMappedView.Dispose(Boolean)
    000000aa64d3e8a0 00007ffdd72307b9 System.IO.MemoryMappedFiles.MemoryMappedViewAccessor.Dispose(Boolean)
    000000aa64d3e8e0 00007ffdd723075a System.IO.MemoryMappedFiles.MemoryMappedViewAccessor.Dispose(Boolean)
    000000aa64d3e920 00007ffddab2b15a System.IO.UnmanagedMemoryAccessor.Dispose()
    000000aa64d3e950 00007ffd8c1c0a48 FileMappingModifiedBytes.PageFileAllocator.Allocate() [\FileMappingModifiedBytes\FileMappingModifiedBytes\PageFileAllocator.cs @ 83]
    000000aa64d3ea50 00007ffd8c1c0736 FileMappingModifiedBytes.PageFileAllocator.SlowlyAllocateUntilOOM() [\FileMappingModifiedBytes\FileMappingModifiedBytes\PageFileAllocator.cs @ 49]
    000000aa64d3ead0 00007ffd8c1c05b1 FileMappingModifiedBytes.Program.Main(System.String[]) [\FileMappingModifiedBytes\FileMappingModifiedBytes\Program.cs @ 20]
    000000aa64d3ed80 00007ffdeb7b6a53 [GCFrame: 000000aa64d3ed80] 
    

    Now we have enough information to come up with a

    Theory Of The Problem

    This is how we got into that situation

    1. Someone creates page file backed shared memory as file mapping.
    2. These file mappings are put into the address space of our process with MapViewOfFile.
    3. After usage the file mapping is removed from the process working set with UnmapViewOfFile. The file mapping object is closed much later or never.
    4. The kernel removes the memory pages from the process working set and moves them to the Modified List.
    5. No process owns that memory anymore and the OS is free to page it out to the page file.
    6. After the modified pages were written to disk they are put into the standby list, overwriting other previously cached things.

    The picture below shows the basic operations the kernel performs.

    image

    The end result is that the unmapped memory from our process is written to the page file and at the same time it is put into the file system cache where it replaces older entries. This has far reaching implications. If you unmap the memory but do not close the file mapping object almost at the same time the OS will page the memory out if no other process has the same memory mapped. An innocent file mapping handle leak with large memory mappings can flush your complete file system cache at random times and there is no clear indicator why your or other application suddenly did become much slower.

    On Windows 10 Anniversary and later you can see the effect of flushing the file system cache more directly by looking at the task manager when the compressed memory goes up and up but the cached memory stored in the standby list is becoming less and less.

    image

    Fresh paged out memory seems to have the highest priority in the file system cache which can overwrite anything else what was inside it. This is also true on server OS where you can check with Rammap that the file system cache was flushed. That explains why we were having so much read activity on the C drive because our file system was flushed and the application dlls needed to be loaded from disk again. If you have a large amount of compressed memory in Windows 10 but not much In use memory the chances are pretty good that you have a file mapping leak somewhere.

    How Things Look in ETW

    On my Windows 10 Anniversary machine I can nicely see that my paged out memory directly goes into the working set of MemCompression (Memory Compression in Process Explorer, Hacker)  which is not shown by the Windows task manager. At some point it is so huge that it needs to page out memory to my SSD with up to 400 MB/s. While MemCompression (purple) allocates more memory, the standby lists (green) becomes smaller and smaller.

    image

    With ETW you can also find out who is responsible for the high modified bytes which are written to the page file with CPU Sampling tracing. That is very useful if you do not know which process is causing it and you cannot interrupt the server processes with memory dumps which might upset people working on the server. Since we know that based on our theory the modified bytes grow when we unmap the data we can simply view all callers of UnmapViewOfFile and follow the callers until you find the call stacks which have the best correlation between the growth of modified bytes and calling UnmapViewOfFile. With our tester application we get a perfect correlation:

    image

    The blue bars map perfectly to the rise of modified bytes when we unmap a file section. You can also treat the file mapping issue as handle leak where you can trace with ETW Handle Leak tracing all Section objects which call CreateFileMapping which can be used to verify that no file mappings of this type exist anymore to verify that a fix works as expected. WPA is the perfect tool to visually correlate different metrics to check if there is a dependency. You can come up with a theory what depends on what, graph the different things underneath and then you see a pattern …. or not. Then it is time for the next theory.

    The Real Root Cause

    image

    Problem solved? Only partially. This only explains sluggish behavior when the machine gets into high memory load which then flushes the file system cache. But there were issues right from the beginning even when there still was plenty of data in the file system cache. Can you spot based on the given data above where the problem is?

    .

    .

    .

    We have a nice RAID on the machine but the pageouts were happening only with 100MB/s. That is not the write throughput one would expect from an expensive RAID. What I would expect is something like this

    image

    but it was about 5 times slower than usually. Perhaps one disk did fail and the RAID was not scaling properly anymore? Nope. But there was a warning logged somewhere that the RAID battery has failed. The battery in the RAID is there to write all cached data from its internal DRAM cache to the disks in the case of a power failure. What happens if the battery fails? The RAID turns off its internal cache to prevent data loss if there is a power failure. We were therefore running the sever the whole time without a much wanted cache! I am not the one to take credit for finding the disabled RAID cache. It is all about teamwork where everyone looks at different aspects of a problem to (hopefully) find all issues. Thanks for finding that one. You know who you are.

    Luckily we do not need to wait to buy a new battery but we can enable the cache even when the battery is not present:

    image

    That setting enables the RAID cache regardless of the state of the battery which is exactly what we want. In fact the battery was not broken but opening the server and dusting off the battery did solve the battery issue as well.

    It’s The Environment Stupid!

    When you write software it is not always the software which causes a change. It still runs on hardware which can change/fail too. You need to monitor not only yourself but also the environment where it is running. If you fail to take that into account you can search for a long time at the wrong or not most significant spot.

    Funny side note: I drive by car every day to work. When I did refill my windshield washing fluid yesterday I did notice an unauthorized hardware change:

    image

    A marten

    image

    seems to have brought into my car his own pet toys to make the place more cozy. Next action: Install some marten repellant to prevent more hardware changes:

    image

    Have a nice day and keep your hardware marten safe.

    MS Technical Summit 2016 at Darmstadt

    I had the pleasure to give a talk at the MS Technical Summit 2016 at Darmstadt about ETW. You can find our slides, the sample code, a crash dump and the resulting sample ETW trace here. It was a great event together with Wolfgang Kroneder from MS. Here is the video:

    https://channel9.msdn.com/Events/microsoft-techncial-summit/Technical-Summit-2016/Deep-Dive-Event-Tracing-for-Windows/player

    For some problems you need custom ETW tracing to find out what is going on. Reference counting bugs are one of the nasty ones where it is very hard to find the root cause if they are timing dependent. If you add normal printf, OutputDebugString tracing to the Add/ReleaseReference calls the error usually goes away. Below is the sample code from my talk minus the ETW manifest generated tracing macros. It mimic’s a reference counting issue were we suspected that someone was calling ReleaseReference too early on a reference counted class which did lead to a too early destruction of the reference counted object. Interestingly it was being deleted more than once which resulted in either a heap corruption or a pure virtual function call exit depending on the exact timing.

    The code below will call AddReference once in the main method and then many times balanced Add/ReleaseReference from 5 different threads which should never cause the reference count to drop to zero.

     

    #include <windows.h>
    #include <vector>
    #include <thread>
    #include <mutex>
    #include "ETWRefCounter.h"  // ETW Tracing header generated from ETW manifest during each build with Pre build command
                                // mc -um $(ProjectDir)RefCounterETW.man  -h $(ProjectDir)  -z $(ProjectName)
                                // ecmangen is the editor to create ETW manifest files. 
    
    
    // Thread safe refcounter base class
    class CRefCounterBase
    {
    public:
        CRefCounterBase() : m_Counter(0) {}
        
        CRefCounterBase(CRefCounterBase &other) = delete;
        CRefCounterBase& operator=(const CRefCounterBase&) = delete;
        
        virtual void DeleteObject() = 0;
    
        // Return new reference count incremented by one
        long AddReference()
        {
            auto lret = InterlockedIncrement(&m_Counter);
            EventWriteAddRefEvent((__int64)this, lret); // trace new value
            return lret;
        }
    
        // Return new reference count decremented by one
        long ReleaseReference()
        {
            auto old = m_Counter;
            auto newValue = InterlockedDecrement(&m_Counter);
            EventWriteReleaseRefEvent((__int64)this, newValue); // trace new value
            if (newValue == 0)
            {
                EventWriteFinalReleaseEvent((__int64)this);
                DeleteObject();
            }
            return newValue;
        }
    
        // Check if RefCount == 1. This can only be safely used when the threads using this object are
        // guaranteed to not change the refcount after this check has been made. 
        bool OnlySingleUser()
        {
            EventWriteIsSingleUserEvent((__int64) this, m_Counter);
            return InterlockedExchange(&m_Counter, m_Counter) == 1;
        }
    
    protected:
        virtual ~CRefCounterBase() { m_Counter = 0; }
    
        long m_Counter;
    };
    
    /// Actual reference counted class
    class RefCountedObject : public CRefCounterBase
    {
    public:
        RefCountedObject() {}
        RefCountedObject(RefCountedObject &other) = delete;
        RefCountedObject& operator=(const RefCountedObject&) = delete;
        virtual ~RefCountedObject() {}
    
        virtual void DeleteObject()
        {
            printf("\nDeleteObject called 0x%p, %d", this, ::GetCurrentThreadId());
            delete this;
        };
    };
    
    
    // Using RefCounter from multiple threads in a balanced way which works
    void AsyncWorker(CRefCounterBase *counter)
    {
        while (true)
        {
            counter->AddReference();
            if (counter->OnlySingleUser())
            {
                // some optimized code not shown here which also causes add/releaseref
            }
            counter->ReleaseReference();
        }
    }
    
    int main()
    {
        EventRegisterRefCounter();
    
        static const int ThreadsCount = 5;
        std::vector<std::thread> threads; // thread list 
    
        auto pCounter = new RefCountedObject();   // construct object
        pCounter->AddReference(); // ensure that refcount does not drop to zero while other threads are working with it.
    
        for (int i = 0; i < ThreadsCount; i++) // start some threads and increment and decrement stuff in a loop
        {
            std::thread th(AsyncWorker, pCounter);
            threads.push_back(std::move(th));
        }
    
        threads[0].join();
    
        EventUnregisterRefCounter();
        return 0;
    }
    

    When you let the exe run it will either run happily forever or crash. During a crash it will terminate with an unhandled exception because of a pure virtual function call like this,

    image

    or with a heap corruption. In reality the code was not as simple but it pretty much boils down to the code above. Can you spot the error?

    Why Is Batman Arkham Knight So Slow?

    The latest Batman PC game had after the first release a bad reputation for being buggy and very slow. Due to some bad outsourcing and project management decisions the PC version did not meet the required quality gates. This quality and performance disaster did force Warner Bros  to cancel the PC sales for three month until a much more improved version was released which did meet customer expectations with regards to performance. The outcome was that the unpatched PC version was sold for 50% off and I could get a very good game for a very low price. I did buy the unpatched version, did throw away the unpatched DVDs,  and downloaded the “real” v1.0 from Steam with the accompanying product code.

    After being aware of potentially bad performance in the game I kept an eye on it but I found not any rough spots anymore. The only thing that bugged me was when I switched away from the game to do something different and then switching back after ca. 30 minutes the first few seconds of the game graphics was hanging for several seconds. That sounds like a good reason to check performance with my favorite performance recording tool ETWController. Since it records also the screenshots I could easily track down in the profiling data what was happening while the game was hanging.

    image

    Since I get also a HTML report I can exactly see where in the timeline I have continued the game, although I do not get screenshots of the in game graphics.

    image

    After I stopped the recording I can directly open the ETL file and navigate the point of slowness in the game by selecting the time region after Screenshot 5.

    image

    When the game hangs it is usually CPU, Disk, memory or the network. So lets check what is going on.

    By looking at the CPU consumption we see that the game has a several seconds near zero CPU activity between 10-13s which correlates with the observed hang.

    image

    While checking disk IO during the low CPU activity phase I find a lot of page file related IO. When the game resumes it needs to access a lot of its memory which was previously written into the page file this results in a growth of the games working set.

    image

    In the graph above I see that I have two page files (on drive C and D) where I wait 3,6s for 55 MB from drive D and 500ms for 103MB from drive C. That makes sense since my C drive is a SSD

    image

    whereas my D drive is a regular hard drive

    image

    That explains why reading from the page file on the D drive is 7 times slower although only 50% of the data is read from it! Random read access is the worst case scenario for a regular hard disk whereas my SSD can cope with that read pattern easily.

    I do not even remember that I have enabled on D a page file but I think I did it some years ago where SSDs had still write durability issues just to be on the safe side. As it turns out it was a bad decision and as an easy fix I disabled my page file on drive D. The sudden hangs when switching back to the game were gone after that change.

    Whose Fault Is It?

    That was the easy part. But you can go much deeper to really see where the latencies are coming from. Hard faults which involve disk IO are served in the kernel by the method ntoskrnl.exe!MiIssueHardFault which takes care to wait for the pending disc IO of the page file. If we look at which processes did wait for page fault IOs and visualize the total CPU and wait times we get this timeline

    image

    1. We can see that the hard faults nicely correlate with the Disk activity in the graph below.
    2. We find that the Arkham Knight game waits only for 90ms for hard page faults (see Waits(us)_Sum column).
    3. But the System process waits for over 10s (summed across all threads) for page in activity.

    If we look who is causing this page in activity we find DirectX in the Windows kernel to cause quite some IO due to page faults.

    ntoskrnl.exe!MiIssueHardFault

    ntoskrnl.exe!MmAccessFault

    |- ntoskrnl.exe!KiPageFault

    |- ntoskrnl.exe!MiProbeLeafFrame

    |    ntoskrnl.exe!MmProbeAndLockPages

    |    |- dxgmms2.sys!VIDMM_MDL_RANGE::Lock

    |    |    dxgmms2.sys!VIDMM_RECYCLE_HEAP_PHYSICAL_VIEW::LockRange

    |    |    dxgmms2.sys!VIDMM_RECYCLE_RANGE::Lock

    |    |    dxgmms2.sys!VIDMM_RECYCLE_MULTIRANGE::Lock

    |    |    dxgmms2.sys!VIDMM_RECYCLE_HEAP_MGR::ProbeAndLockAllocation

    |    |    dxgmms2.sys!VIDMM_GLOBAL::ProbeAndLockAllocation

    |    |    |- dxgmms2.sys!VIDMM_SYSMEM_SEGMENT::LockAllocationRange

    |    |    |    |- dxgmms2.sys!VIDMM_MEMORY_SEGMENT::CommitResource

    |    |    |    |    dxgmms2.sys!VIDMM_GLOBAL::PageInOneAllocation

    |    |    |    |    dxgmms2.sys!VIDMM_GLOBAL::ProcessDeferredCommand

    |    |    |    |    dxgmms2.sys!VIDMM_WORKER_THREAD::Run

    |    |    |    |    ntoskrnl.exe!PspSystemThreadStartup

    |    |    |    |    ntoskrnl.exe!KiStartSystemThread

    It is not even the game directly but DirectX needs to page in its buffers from previously paged out memory which takes several seconds from my slow spinning hard disk. It is possible to track this further down into the wait calls of the game but that should be proof enough to blame disk IO of DirectX buffers for the observed sluggish gaming experience.

    Conclusions

    It was not Batman’s fault that he was slow but my misconfigured secondary page file on my spinning hard disk. Removing the page file from the spinning disk did solve my performance problem.

    Another observation is that Windows seems to write data out to the first responding device which hosts a page file. That is nice to get as much data as possible in an efficient manner into the page file but the real test comes when the data is read back again. I think the current page out algorithm in Windows is inefficient when it can choose from several devices to write its page data to it. It should balance the written data with respect to the random access servicing time of each device.  In effect SSDs with its very fast response times should be favored over slow hard disks. This makes it possible to have several page files on devices with vastly different response times but Windows should prefer to write most of the data onto the device with the fastest random access time to make hard page faults pause time as short as possible.

    This once again shows that assumptions about bad performance based on former evidence (the game is slow …) will lead to wrong conclusions about current performance issues.

    Performance problems are much like quantum mechanics: The problem does not exist until you have measured it. Some people come up with smart tests to verify this or that assumption. But it is usually easier to use a profiling tool to get down to the root cause with a more structured approach.

    Show GDI Handles by Type in Windbg

    What are GDI Handles? Every GDI (Graphics Device Interface) object has a handle which is associated to one specific process. If you create UI resources (not windows) like

    • Fonts
    • Bitmaps
    • Device Contexts (DC)
    • Pens
    • Regions

    then you are using GDI resources. There is a hard cap on 10K GDI objects per process. If you have a memory leak in window related code then the chances are high that your process will not crash due to an out of memory condition but because a generic GDI or GDI+ error did occur. If this happens after a many hour stress test you will have a hard time to figure out what went wrong, if all you have got is a memory dump with an exception like this:

    System.Runtime.InteropServices.ExternalException was unhandled by user code
    Message="A generic error occurred in GDI+."
    Source="System.Drawing"
    ErrorCode=-2147467259
    StackTrace:
       at System.Drawing.Image.Save(Stream stream, ImageCodecInfo encoder, EncoderParameters    encoderParams)
       at System.Drawing.Image.Save(Stream stream, ImageFormat format)

    To make things more complex .NET tries hard to reuse existing objects and most often it avoids GDI and GDI+ entirely. You can create more than 10K Bitmaps in a Windows Forms or WPF application without problems. Not every object on your managed heap is backed by GDI handle based object! But some of them are. This can be confusing sometimes.

    GDI handle leaks are not easy to diagnose since there are not many tools out there that are able to display more than the total count like Task Manager with the GDI-Object count column.

    image

    Process Explorer can display pretty much anything but if you search for GDI objects it will let you down. Process Hacker (an open source Process Explorer clone with more features) on the other hand can display them via the context menu

    image

    The best tool I have found so far is GDIView from NirSoft. But it seems at least on Windows 8.1 to display sometimes much less handles than Task Manager shows. Do not entirely trust the reported numbers. It displays in the All GDI column the same value as Task Manager which should be pretty reliable. If the other columns add up the to the total reported value then you should be fine. Interestingly there exists no Performance counter for GDI objects which is quite odd since it is still an essential resource although MS seems to try hard to put GDI out of service. 

     

    image

     

    The problem with all of the above tools is that they do not work with memory dumps. Windbg had an extension which is since about 10 years no longer part of the Windbg package which did only work until Windows 2000 and not even on Windows XP.

    So what can you do? First we need a GDI Handle leak. Bitmap.GetHbitmap will create a GDI backed bitmap and return you the handle to it. The following code creates therefore a leak of 1000 bitmap GDI objects:

            private void Button_Click(object sender, RoutedEventArgs e)
            {
                Bitmap bmp = new Bitmap(100, 100);
                for (int i = 0; i < 1000; i++)
                {
                    bmp.GetHbitmap();
                }
            }

    Ok when I put that into my WPF application I get a nice GDI leak when I press the magic button.

    image

    Now we need to search the internet for helpful resource how to dump GDI objects. After some searching I pretty much always land on some security related kernel hacking sites which are an interesting read on its own. These guys really know their stuff and I guess they have also some older copy of the Windows source code somewhere around. I have found https://blog.coresecurity.com/2015/09/28/abusing-gdi-for-ring0-exploit-primitives/ which talks about a fascinating way to abuse Bitmaps for kernel exploits which still seem to work today. But what is more important he also talks about that the kernel manages GDI handles for each process in a specific kernel region which is read only mapped back to the process address space.

    That is important to know since we have a chance to look at GDI kernel structures in user mode process dumps. There is hope to retrieve information about GDI handles also in 2016. The next good thing is that the definition of GDICell structure seems to be still valid on Windows 8 (and on my Windows 10 home installation).

    typedef struct {
      PVOID64 pKernelAddress; // 0x00
      USHORT wProcessId;      // 0x08
      USHORT wCount;          // 0x0a
      USHORT wUpper;          // 0x0c
      USHORT wType;           // 0x0e
      PVOID64 pUserAddress;   // 0x10
    } GDICell;                // sizeof = 0x18

    The post also tells us where we need to search for the GDI Handle table. It is part of the Process Environment Block (PEB) which Windbg can dump since a long time. But !peb will not display the GDI handle table. Instead we need to dump the structure with the ntdll symbols directly.

    .process will print the PEB pointer value which we can then feed into dt

    0:014> .process
    Implicit process is now 00000000`00923000
    0:014> dt ntdll!_PEB GdiSharedHandleTable 00923000
       +0x0f8 GdiSharedHandleTable : 0x00000000`013e0000 Void

    Now we know the start of the GDI handle table. How big is it? The !address command can tell us that:

    0:014> !address 13e0000
    
    Usage:                  Other
    Base Address:           00000000`013e0000
    End Address:            00000000`01561000
    Region Size:            00000000`00181000 (   1.504 MB)
    State:                  00001000          MEM_COMMIT
    Protect:                00000002          PAGE_READONLY
    Type:                   00040000          MEM_MAPPED
    Allocation Base:        00000000`013e0000
    Allocation Protect:     00000002          PAGE_READONLY
    Additional info:        GDI Shared Handle Table
    
    
    Content source: 1 (target), length: 1000

    It seems we are not the first ones to do that since under Additional info the name of the memory region has a really descriptive name. That looks promising. With the | command we get the current process id in hex and we are ready for a first try to search the GDI table for GDICell structures which should contain our current PID as part of the GDI handle table.

    0:014> |
    .  0    id: 35b8    attach    name: D:\GdiLeaker\bin\Debug\GdiLeaker.exe
    0:014> s -w 13e0000 1561000 35b8
    00000000`013f02f8  35b8 0000 1c05 4005 0000 0000 0000 0000  .5.....@........
    00000000`013f3580  35b8 0000 de05 4005 0000 0000 0000 0000  .5.....@........
    00000000`013f7948  35b8 0000 8005 4005 0000 0000 0000 0000  .5.....@........
    00000000`013f7fd8  35b8 0000 c80a 400a 3370 00d4 0000 0000  .5.....@p3......
    00000000`013facf0  35b8 0000 5e05 4005 0000 0000 0000 0000  .5...^.@........
    00000000`013fb9c8  35b8 0000 0a01 4401 0000 00c8 0000 0000  .5.....D........
    00000000`013fdac8  35b8 0000 9305 4005 0000 0000 0000 0000  .5.....@........
    00000000`013fe428  35b8 0000 4905 4005 0000 0000 0000 0000  .5...I.@........
    ....

    The number of printed lines look good. Since the kernel structures have also a wType member which should describe the GDI handle type we dig deeper until we find http://stackoverflow.com/questions/13905661/how-to-get-list-of-gdi-handles which tells us that 05 is the Bitmap GDI handle type. Since 6 bytes after the pid the type should follow we can check on our printed memory and sure there is mostly a 5 at this memory location. The rest of the word seems to be used for other information.

    Lets create a Windbg script out of that and automate the complete process. Now attach to our GDI leaker and dump the GDI handle table before and after we have leaked 1000 bitmaps.

    Important: If you are running on a 64 bit OS you need to attach the 64 bit Windbg even if you debug a 32 bit application!

    0:013> $$>a<"D:\GdiDump\DumpGdi.txt"
    GDI Handle Table 00000000013e0000 0000000001561000
    GDI Handle Count      14
        DeviceContexts: 4
        Regions:        2
        Bitmaps:        2
        Palettes:       0
        Fonts:          3
        Brushes:        3
        Pens:           0
        Uncategorized:  0
    0:013> g
    0:014> $$>a<"D:\GdiDump\DumpGdi.txt"
    GDI Handle Table 00000000013e0000 0000000001561000
    GDI Handle Count      1021
        DeviceContexts: 8
        Regions:        3
        Bitmaps:        1003
        Palettes:       0
        Fonts:          3
        Brushes:        4
        Pens:           0
        Uncategorized:  0

    Bingo! That looks pretty good. I have tried my script on various processes and the results are promising. This works also with full memory dumps which is a long missing piece of important information. Often enough no screenshot with Task Manager and the GDI handle count was taken so this information is lost if the failed process was terminated in the meantime. In some processes ca. 10-20% of the handle information seems to be missing. Since GDIView shows pretty much the same numbers I hope that this script will not be too far off for your crashed processes. Give it a try or post some additions if you know more about the internals of GDI.

     

    Here is the code of DumpGDI.txt

    $$ Run as: $$>a<DumpGdi.txt
    $$ Written by Alois Kraus 2016
    $$ uses pseudo registers r0-5 and r8-r14
    
    r @$t1=0
    r @$t8=0
    r @$t9=0
    r @$t10=0
    r @$t11=0
    r @$t12=0
    r @$t13=0
    r @$t14=0
    $$ Increment count is 1 byte until we find a matching field with the current pid
    r @$t4=1
    
    r @$t0=$peb
    $$ Get address of GDI handle table into t5
    .foreach /pS 3 /ps 1 ( @$GdiSharedHandleTable { dt ntdll!_PEB GdiSharedHandleTable @$t0 } ) { r @$t5 = @$GdiSharedHandleTable }
    
    $$ On first call !address produces more output. Do a warmup
    .foreach /pS 50 ( @$myStartAddress {!address  @$t5} ) {  }
    
    
    $$ Get start address of file mapping into t2
    .foreach /pS 4 /ps 40 ( @$myStartAddress {!address  @$t5} ) { r @$t2 = @$myStartAddress }
    $$ Get end address of file mapping into t3
    .foreach /pS 7 /ps 40 ( @$myEndAddress {!address  @$t5} ) { r @$t3 = @$myEndAddress }
    .printf "GDI Handle Table %p %p", @$t2, @$t3
    
    .for(; @$t2 < @$t3; r @$t2 = @$t2 + @$t4) 
    {
      $$ since we walk bytewise through potentially invalid memory we need first to check if it points to valid memory
      .if($vvalid(@$t2,4) == 1 ) 
      {
         $$ Check if pid matches
         .if (wo(@$t2) == @$tpid ) 
         { 
            $$ increase handle count stored in $t1 and increase step size by 0x18 because we know the cell structure GDICell has a size of 0x18 bytes.
            r @$t1 = @$t1+1
            r @$t4 = 0x18
            $$ Access wType of GDICELL and increment per GDI handle type
            .if (by(@$t2+6) == 0x1 )  { r @$t8 =  @$t8+1  }
            .if (by(@$t2+6) == 0x4 )  { r @$t9 =  @$t9+1  }
            .if (by(@$t2+6) == 0x5 )  { r @$t10 = @$t10+1 }
            .if (by(@$t2+6) == 0x8 )  { r @$t11 = @$t11+1 }
            .if (by(@$t2+6) == 0xa )  { r @$t12 = @$t12+1 }
            .if (by(@$t2+6) == 0x10 ) { r @$t13 = @$t13+1 }
            .if (by(@$t2+6) == 0x30 ) { r @$t14 = @$t14+1 }
         } 
      } 
    }
    
    .printf "\nGDI Handle Count      %d", @$t1
    .printf "\n\tDeviceContexts: %d", @$t8
    .printf "\n\tRegions:        %d", @$t9
    .printf "\n\tBitmaps:        %d", @$t10
    .printf "\n\tPalettes:       %d", @$t11
    .printf "\n\tFonts:          %d", @$t12
    .printf "\n\tBrushes:        %d", @$t13
    .printf "\n\tPens:           %d", @$t14
    .printf "\n\tUncategorized:  %d\n", @$t1-(@$t14+@$t13+@$t12+@$t11+@$t10+@$t9+@$t8)
    

    When you edit the script with Notepad++ try Perl as scripting language. The syntax highlighter will do pretty good job on it. Windbg scripts do not look nice but they work and make dump analysis a lot easier. If you want to use Windbg efficiently I highly recommend you read more about .foreach, .printf, and pseudo registers.

    If you e.g. search for a memory leak in a managed process you usually do !DumpHeap -stat -live to see all still reachable objects. If you omit -live then you see also all temporary objects which are not yet GCed which can also be interesting if there is a large difference between these two. Now you have identified a type and you click on the blue link in Windbg which will then print all object instances of that type.

    Lets try that with FactoryRecord instances where we get the list of object addresses if we click in Windbg on the blue link of the !DumpHeap -stat command.

    Statistics:
          MT    Count    TotalSize Class Name
    6db871a0      114         6384 System.Configuration.FactoryRecord

    If we dump one object instance we see that there is an interesting name inside it which could help to diagnose the context in which it was allocated

    0:014> !DumpObj /d 02c98ca8
    Name:        System.Configuration.FactoryRecord
    MethodTable: 6db871a0
    EEClass:     6db532f8
    Size:        56(0x38) bytes
    File:        C:\WINDOWS\Microsoft.Net\assembly\GAC_MSIL\System.Configuration\v4.0_4.0.0.0__b03f5f7f11d50a3a\System.Configuration.dll
    Fields:
          MT    Field   Offset                 Type VT     Attr    Value Name
    7257e918  4000249        4        System.String  0 instance 02c98c50 _configKey
    7257e918  400024a        8        System.String  0 instance 02c984b0 _group
    7257e918  400024b        c        System.String  0 instance 02c98ad4 _name
    6db86f9c  400024c       2c ...SimpleBitVector32  1 instance 02c98cd4 _flags
    7257e918  400024d       10        System.String  0 instance 02c98b08 _factoryTypeName
    6dc0c698  400024e       20         System.Int32  1 instance      200 _allowDefinition
    6dc0c6d0  400024f       24         System.Int32  1 instance      100 _allowExeDefinition
    6db87168  4000250       30 ...errideModeSetting  1 instance 02c98cd8 _overrideModeDefault
    7257e918  4000251       14        System.String  0 instance 02c86bd8 _filename
    725807a0  4000252       28         System.Int32  1 instance      123 _lineNumber
    7257ecb8  4000253       18        System.Object  0 instance 00000000 _factory
    00000000  4000254       1c                       0 instance 00000000 _errors

    Lets say we are after the group name. If we dump this object

    0:014> !DumpObj /d 02c984b0
    Name:        System.String
    MethodTable: 7257e918
    EEClass:     721bf344
    Size:        50(0x32) bytes
    File:        C:\WINDOWS\Microsoft.Net\assembly\GAC_32\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
    String:      system.web/caching
    Fields:
          MT    Field   Offset                 Type VT     Attr    Value Name
    725807a0  4000243        4         System.Int32  1 instance       18 m_stringLength
    7257f35c  4000244        8          System.Char  1 instance       73 m_firstChar
    7257e918  4000248       40        System.String  0   shared   static Empty

    then we see that this FactoryRecord has something to do with system.web/caching. Now we want to dump the group names of all FactoryRecord instances to get an overview.

    .foreach( @$my { !DumpHeap -short -mt 6db871a0  } ) { r @$t0 = @$my; .printf "\n%mu", poi(@$t0+8)+8 }
    ...
    system.transactions
    system.transactions
    
    system.web
    system.web
    system.web
    system.web
    system.web
    system.web
    system.web
    system.web
    system.web
    system.web
    
    ...

    The basic  idea is to dump the object addresses of all FactoryRecord instances. Since in .NET the Method Table (MT) pointer is the synonym for a type we use the MT pointer as argument to !DumpHeap -mt 6db871a0  to print only FactoryRecord instances. The additional -short limits the output to only object addresses which makes it a perfect fit for .foreach. 

    For some reason you need to assign the local variable from the foreach loop @$my to a pseudo register @$t0 to make calculations possible. Oh and do not forget the spaces between @$t0 = @$my. The @ before the variable $xx is not strictly necessary but it speeds up the debugger evaluation engine a bit.

    To get the pointer to the _group string which is 8 bytes after the FactoryRecord pointer we read it via poi(@$t0+8). Now we have the pointer to the string object. I know that .NET strings are null terminated UTF-16 strings. We can use therefore .printf “\n%mu” to print the string contents. On x64 the offset to the char array of a .NET string is 0xc and for x86 it is 8 which I have used above. These little shortcuts can make a dump analysis dramatically faster.

    Although I like Visual Studio a lot, for more advanced things you need to use the power of Windbg. Give it a try when you analyze your next dump.