MS Technical Summit 2016 at Darmstadt

I had the pleasure to give a talk at the MS Technical Summit 2016 at Darmstadt about ETW. You can find our slides, the sample code, a crash dump and the resulting sample ETW trace here. It was a great event together with Wolfgang Kroneder from MS. I will put up the link here once the recording is published.

For some problems you need custom ETW tracing to find out what is going on. Reference counting bugs are one of the nasty ones where it is very hard to find the root cause if they are timing dependent. If you add normal printf, OutputDebugString tracing to the Add/ReleaseReference calls the error usually goes away. Below is the sample code from my talk minus the ETW manifest generated tracing macros. It mimic’s a reference counting issue were we suspected that someone was calling ReleaseReference too early on a reference counted class which did lead to a too early destruction of the reference counted object. Interestingly it was being deleted more than once which resulted in either a heap corruption or a pure virtual function call exit depending on the exact timing.

The code below will call AddReference once in the main method and then many times balanced Add/ReleaseReference from 5 different threads which should never cause the reference count to drop to zero.

 

#include <windows.h>
#include <vector>
#include <thread>
#include <mutex>
#include "ETWRefCounter.h"  // ETW Tracing header generated from ETW manifest during each build with Pre build command
                            // mc -um $(ProjectDir)RefCounterETW.man  -h $(ProjectDir)  -z $(ProjectName)
                            // ecmangen is the editor to create ETW manifest files. 


// Thread safe refcounter base class
class CRefCounterBase
{
public:
    CRefCounterBase() : m_Counter(0) {}
    
    CRefCounterBase(CRefCounterBase &other) = delete;
    CRefCounterBase& operator=(const CRefCounterBase&) = delete;
    
    virtual void DeleteObject() = 0;

    // Return new reference count incremented by one
    long AddReference()
    {
        auto lret = InterlockedIncrement(&m_Counter);
        EventWriteAddRefEvent((__int64)this, lret); // trace new value
        return lret;
    }

    // Return new reference count decremented by one
    long ReleaseReference()
    {
        auto old = m_Counter;
        auto newValue = InterlockedDecrement(&m_Counter);
        EventWriteReleaseRefEvent((__int64)this, newValue); // trace new value
        if (newValue == 0)
        {
            EventWriteFinalReleaseEvent((__int64)this);
            DeleteObject();
        }
        return newValue;
    }

    // Check if RefCount == 1. This can only be safely used when the threads using this object are
    // guaranteed to not change the refcount after this check has been made. 
    bool OnlySingleUser()
    {
        EventWriteIsSingleUserEvent((__int64) this, m_Counter);
        return InterlockedExchange(&m_Counter, m_Counter) == 1;
    }

protected:
    virtual ~CRefCounterBase() { m_Counter = 0; }

    long m_Counter;
};

/// Actual reference counted class
class RefCountedObject : public CRefCounterBase
{
public:
    RefCountedObject() {}
    RefCountedObject(RefCountedObject &other) = delete;
    RefCountedObject& operator=(const RefCountedObject&) = delete;
    virtual ~RefCountedObject() {}

    virtual void DeleteObject()
    {
        printf("\nDeleteObject called 0x%p, %d", this, ::GetCurrentThreadId());
        delete this;
    };
};


// Using RefCounter from multiple threads in a balanced way which works
void AsyncWorker(CRefCounterBase *counter)
{
    while (true)
    {
        counter->AddReference();
        if (counter->OnlySingleUser())
        {
            // some optimized code not shown here which also causes add/releaseref
        }
        counter->ReleaseReference();
    }
}

int main()
{
    EventRegisterRefCounter();

    static const int ThreadsCount = 5;
    std::vector<std::thread> threads; // thread list 

    auto pCounter = new RefCountedObject();   // construct object
    pCounter->AddReference(); // ensure that refcount does not drop to zero while other threads are working with it.

    for (int i = 0; i < ThreadsCount; i++) // start some threads and increment and decrement stuff in a loop
    {
        std::thread th(AsyncWorker, pCounter);
        threads.push_back(std::move(th));
    }

    threads[0].join();

    EventUnregisterRefCounter();
    return 0;
}

When you let the exe run it will either run happily forever or crash. During a crash it will terminate with an unhandled exception because of a pure virtual function call like this,

image

or with a heap corruption. In reality the code was not as simple but it pretty much boils down to the code above. Can you spot the error?

How Buffered IO Can Ruin Performance

Paging can cause bad interactive performance. This happens quite often but very little content exists how you can diagnose and fix paging issues. It is time to change that (a bit). I present you here a deep dive into how paging really works for some workloads and how that caused a severe interactive performance issue.

The Observation

It was reported that a software version did significantly worse than its predecessors when the system was under heavy load. Measurements did prove that hard page faults were the issue.

A more detailed analysis showed that the big page out rate of nearly 500MB/s was caused by explicit working set trims. After removing these explicit EmptyWorkingSet calls performance went back to normal. The problem remained though why EmptyWorkingSet was suddenly a problem because these calls were there since a long time.

How Paging Really Works

To make such measurements on Windows 10 I give away some secrets. Some of you might have found the MM-Agent powershell cmdlet already. If you look e.g. at Windows Server 2012 Memory Management improvements then you will find some interesting switches which are also present in Windows 10 Anniversary. You can configure the Windows Memory Management via powershell.

PS C:\WINDOWS\system32> Get-MMAgent

ApplicationLaunchPrefetching : True
ApplicationPreLaunch         : True
MaxOperationAPIFiles         : 256
MemoryCompression            : False
OperationAPI                 : True
PageCombining                : True
PSComputerName               :

On Windows 10 Anniversary the switches MemoryCompression and PageCombining are enabled by default. If you are blaming memory compression for bad performance you you can switch off MemoryCompression by calling

Disable-MMAgent -mc

To enable it again you can call

Enable-MMAgent -mc

The MemoryCompression switch seems not to be documented so far. If MemoryCompression is already enabled and you want to turn it off you need to reboot. If you have disabled memory compression successfully you should see in task manager as Compressed value zero all the time. If you enable memory compression the setting will take immediate effect (no reboot necessary).

Now lets perform a little experiment what happens when we trim the working set of two GB process with my CppEater when memory compression is disabled.

C:\>CppEater.exe 2000

Below is the screenshot when CppEater did flush its working set. This caused some memory to be left over in the Modified list because the memory manager did not see the need to completely flush all of the 2 GB out into the page file.

image

When the memory was flushed you see an IO spike on disk D where my page file resides.

image

So far so expected. Below is a diagram which shows order of operations happening there.

image

In ETW traces this looks like this:

  • When the memory is flushed we have two GB of data in the modified list.
  • Later the OS flushes two GB of data into the page file.

image 

From that picture it is clear that when we then touch the paged out memory again we will need to read two GB of data from the page file. Now lets perform the experiment and see what happens:

image

Our active memory usage rises again by two GB but what is strange that there is zero disk activity when we access the paged out memory. From our mental model we should see two GB of disk IO on the D drive. Lets have a look at ETW traces while page file writing is happening.image

While we write two GB of data to the page file the Standby list increases during that time by the same amount. That means that although the data is persistently written out to the page file we still have all of the written data in the file system cache in the form of the standby list. Since the contents of the page file are still in the file system cache (Standby List) we see no hard page faults when we try to access our paged out data again.

That is a good working hypothesis which we can test now. We only need to access files via buffered IO reads and do not close the file handles in between which will flush the file system cache. Then we should see our CppEater process hitting the hard disk to read its page file contents.

Below is a small application that reads the windows installer cache which is ca. 30 GB in size which is more than enough to flush any existing file system cache contents:

static int Main(string[] args)
{

    var streams = new List<FileStream>();
    byte[] buffer = new byte[256 * 1024 * 1024];
    foreach (var f in Directory.EnumerateFiles(@"C:\windows\installer", "*.*"))
    {
        try
        {
            var file = new FileStream(f, FileMode.Open, FileAccess.Read);
            streams.Add(file);
            while (file.Read(buffer, 0, buffer.Length) > 0)
            {

            }
        }
        catch (Exception)
        { }
    }

    return 0;
}

With a flushed file system cache CppEater.exe has no secret hiding place for its paged out memory anymore. Now we see the expected two GB of hard disk reads minus the still not flushed out modified memory.

image

The picture what happens when data is written to the page file is lacking an important detail: It misses out the fact that the Modified list transitions to the Standby list which is just another name for the file system cache.

image

The Explanation For Bad Performance

Now we have all the missing pieces together. The initial assumption that flushing the working set cause the OS to write the process memory into the page file is correct, but only half of the story. When the page file data is written the memory from the modified list becomes part of the file system cache. When the pages are later accessed again it depends on the current state of the file system cache if we see soft or hard faults with dramatic effects for the observed performance. The bad performing software version did cause more buffered reads than before. That reads did push the cached page file data out of the file system cache. The still happening page faults were no longer cheap soft page faults but hard page faults. That explains the dramatic effects on interactive performance. The added buffered IO reads did surface the misconception that flushing the working set is a cheap operation. Flushing the working set and soft faulting it back again is only cheap if the machine is not under memory pressure. If the memory condition becomes tight or the file system cache gets flushed you will see the real costs of hard page faults. If you still need to access that memory in a fast way the best thing to do is to not flush it. Otherwise you might be seeing random hard page faults even if the machine has still plenty of free memory due to completely unrelated file system activity! This is true for Windows Server 2008 and 2012. With Windows Server 2016 which will also employ memory compression just like Windows 10 things change a bit.

Windows 10 Paging

With memory compression we need to change our picture again. The modified list is not flushed out to disk but compressed and then added to the working set of the Memory Compression process which now acts as cache. Since the page file contents are no longer shared with the standby list we will not see this behavior on Windows 10 or Server 2016 machines with enabled memory compression.

image

When we execute the same use case under Windows Server 2016 where we flush the file system cache with enabled memory compression we will see

image

that the memory from the MemCompression process stays cached and it is semi hard faulted back into the CppEater process in 3s which is much faster than the previous 10s when the page faults were hitting the hard disk. It is therefore a good idea for most workloads to keep memory compression enabled. It not only compresses the memory but the cached pagefile contents are no longer subject to standby list pollution which should make the system performance much more predictable than it was before.

Conclusions

Windows has many hidden caches which make slow operations (like hard page faults) fast again. But at the worst point in time these caches are no longer there and you will experience the uncached bad performance. It is interesting that RamMap does not show the page file as biggest standby list consumer on machines with no memory compression enabled. To prevent such hard to find errors in the first place you should measure what things cost with detailed (for me it is ETW) profiling and then act on the measured data. If someone has a great idea to make things faster you should always ask him for the detailed profiling data. Pure timing measurements can be misleading. If a use case has become 30% faster but you use x3 more memory and x2 CPU is this optimization still a great idea?

Windows 10 Memory Compression And More

If you expect performance gains by using a specific API you must measure on the actual target operating system under a realistic load. One such API where many myths are heard about is SetProcessWorkingSetSize which can be used to trim the current working set of a process. Since a long time you can also use use EmptyWorkingSet which does the same job with a simpler API call. One usage scenario might be that our process is not used for some time so it might be a good idea to page out its memory to make room for other processes which need the physical memory more urgently. The interesting question is: Is this a good idea? To answer that question one needs to understand how memory management works in detail. To begin our journey into the inner workings of the operating system we need to know:

How Is Memory Allocated?

For that we look under the covers how memory allocation works from an operating system perspective and the application developer view. From a developers points of view memory is only a new xxxx away. But what happens when you do that? The answer to that question depends highly on the used programming language and their implementation. For simplicity I stick here to C/C++.

Below is the code for a small program named CppEater.exe that allocates and accesses memory several times before and after it gives back its memory to the OS by trimming its working set.

  1. Allocate 2 GB of data with new.
  2. Touch the first byte of every memory page .
  3. Touch the first byte of every memory page .
  4. Touch all bytes (2GB).
  5. Trim working set (call EmptyWorkingSet).
  6. Wait for 15s.
  7. Touch the first byte of every memory page .
  8. Touch all bytes (2 GB).

// CppEater.cpp source. Update:  Full source is at https://1drv.ms/f/s!AhcFq7XO98yJgcg2Ko4dmxFUmQcAKA

#include <stdio.h>
#include <Windows.h>
#include <Psapi.h>
#include <chrono>
#include <vector>

template<typename T> void touch(T *pData, int dataCount, const char *pScenario, bool bFull = false)
{
    auto start = std::chrono::high_resolution_clock::now();    
    int pageIncrement = bFull ? 1 : 4096 / sizeof(T);
    for (int i = 0; i < dataCount; i += pageIncrement) // touch all memory or only one integer every 4K
    {
        pData[i] = i;
    }
    auto stop = std::chrono::high_resolution_clock::now();
    auto durationInMs = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
    char pChars[512];
    sprintf(pChars, "%s: %dms, %.3f ms/MB, AllTouched=%d", pScenario,
         durationInMs, 1.0*durationInMs / ( 1.0 * dataCount * sizeof(T) / (1024 * 1024)), bFull);
    printf("\n");
    printf(pChars);    
    ::Sleep(500);
}

int main()
{
    const int BytesToAllocate = 2 * 1000 * 1024 * 1024;
    const int NumberOfIntegersToAllocate = BytesToAllocate / sizeof(int);
    auto bytes = new int[BytesToAllocate/sizeof(int)];                     // Allocate 2 GB
    touch(bytes, NumberOfIntegersToAllocate, "First page touch");          // touch only first byte in every 4K page
    touch(bytes, NumberOfIntegersToAllocate, "Second page touch");         // touch only first byte in every 4K page
    touch(bytes, NumberOfIntegersToAllocate, "All bytes touch", true);     // touch all bytes

    ::EmptyWorkingSet(::GetCurrentProcess());                               // Force pageout and wait until the OS calms down
    ::Sleep(15000);

    touch(bytes, NumberOfIntegersToAllocate, "After Empty");               // touch only first byte in every 4K page
    touch(bytes, NumberOfIntegersToAllocate, "Second After Empty", true);  // touch all bytes
    return 0;
}

When you record the activity of this program with ETW you find a 2 GB of allocation through the C/C++ heap manager which will end up in VirtualAlloc. VirtualAlloc is to my knowledge the most basic API to request memory from Windows. All heap segment allocations for the C/C++/C# heap go through this API.

|    |    |- CppEater.exe!main

|    |    |    |- CppEater.exe!operator new

|    |    |    |    ucrtbase.dll!malloc

|    |    |    |    ntdll.dll!RtlpAllocateHeapInternal

|    |    |    |    ntdll.dll!RtlpAllocateHeap

|    |    |    |    ntdll.dll!NtAllocateVirtualMemory             2 GB alloc with MEM_COMMIT|MEM_RESERVE

 

The only difference between the C/C++/C# heap managers is that the heap memory segments are differently managed depending on the target language. In C/C++ objects cannot be moved by the heap manager so it is important to prevent heap fragmentation with sophisticated allocation balancing algorithms. The managed (.NET) heap is controlled by the garbage collector which can move objects around to compact the heap but it has to be careful to not fragment the heap as well with pinned objects. Once the heap manager has got its big block of memory it will satisfy all following allocation requests from the heap segment/s. When all of them are full or too fragmented the heap manager will need another round of memory via VirtualAlloc.

It is possible to track down managed and unmanaged memory leaks with VirtualAlloc ETW tracing. That works if the leak is large enough to trigger a heap segment re/allocation. But you need to be careful how you interpret the “leaking” stack because it can also happen that some leak causes allocations in the heap but the heap segment re/allocation happens due to large temporary objects which are not the root cause.

Committed/Private Memory

Committing memory is quite fast and completes in ca. 1ms for a new[2GB] memory allocation request with VirtualAlloc. Was that really everything? When you commit memory you will see no increase in your working set. If you put a sleep after the new operator and you look at the process with your Task Manager then you see this:

image

The 2 GB commit is there but your working set is still only 2 MB. That means from an operating system point of view your application has got only the promise to get so much memory but since you did not yet access any of the committed memory via e.g. a pointer access the OS still had no need to allocate a new 4 K memory page, zero it out and soft fault it into your process working set where you were accessing the memory the first time.  There is quite a lot going on under the covers to make your process address space look flat where no other process can interfere with you and your allocated memory. In reality you have a virtual address space for each process where the OS is responsible to swap physical memory in/out of  your process as it sees fit.  The working set is in a first approximation the RAM your application has allocated in your physical RAM modules on your mainboard. Some things like DLLs are the same in all processes which can be shared by multiple processes. That is the reason why you have a Working Set, Working Set Shared and a Working Set Private column in task manager. If you would add up all working sets of a fully utilized machine it would exceed the installed memory by far because shared memory is counted multiple times. That makes the exact calculation of your actually used memory of an application quite difficult.

An easier metric is to sum up all committed memory which is by definition local to your process. This gives you an upper bound of the physical memory usage. It is an upper bound because some pages have never been touched and are therefore not assigned to any physical memory pages. Besides that large parts of your application might be already sitting in the page file which then also does not consume much physical memory. Process Explorer and Process Hacker have no column named Committed memory. As far as I can tell Committed memory is same as Private Bytes reported by these tools. Just in case you wonder why there is no Committed Bytes column in Process Explorer and Process Hacker. The definition of Process Hacker is quite easy. Private Bytes is the memory which can go to the page file. This excludes things like file mappings which can be read from disc again from the original file and not the page file as long the file mapping was not created with copy on write semantics.

Memory Black Hole – Page File Allocated Memory

That looks like the working set must always be smaller than the commit size which is not always the case. A special case are page file allocated memory mapped files. These do not count to your committed memory. If you touch the memory of a page file backed file mapping then it goes all into your working set.  See below for a 2 GB working set where we have only 4 MB of committed memory!

image

The code to produce such a strange process was to allocate a file mapping from the page file by specifying INVALID_HANDLE_VALUE as file handle. If you touch this memory then you have a large shared working set but zero committed memory. Below is a screenshot of the code which produced that allocation:

image

If you look into this process with VMMap from SysInternals which is a really great tool to look into any process and its contained memory you will find that VMMap and Task manager seem to disagree about the committed memory because page file backed memory is counted as committed by VMMap but not by the the Task Manager. Be careful at which numbers you look at.

image

If you want to look at your system you can use RAMMap also from SysInternals. It will show page file allocated file mapping as Shareable just as VMMap which gives you a good hint if you have somewhere leaked GB of page file baked memory.

image

ETW Reference Set and Resident Set Tracing

There is a bunch of ETW providers available which can help you to find all call stacks which touch committed memory the first time which then causes the OS to soft page fault physical memory into your working set. In ETW lingo this is called Reference Set. These ETW providers work best with Windows 8 and later. There you can see  the call stack for every allocation and page access in the system. This can show you every first page access but it can’t show the hard page faults due to paged out memory (you can use the Hard Faults graph in WPA for that). Although Reference Set tracing is interesting it has a very high overhead and produces many events. For every soft page fault which adds memory to any processes working set an ETW event with the memory type

  • PFMappedSection
  • PageTable
  • PagedPool
  • CopyOnWriteImage
  • KernelStack
  • VirtualAlloc
  • Win32Heap
  • UserStack

and the call stack which did cause the soft page fault can be recorded. That results in many million of events with a lot of big stack traces. On the plus side it helps a lot if you want to exactly know who did touch this memory but the resulting ETW files are quite big and take a long time to parse. The Reference Set graph was added with the Windows 10 Anniversary SDK to WPA.

Reference Set tracing can be enabled with WPR/UI or xperf where the equivalent xperf command line is

xperf -on  PROC_THREAD+LOADER+HARD_FAULTS+MEMORY+MEMINFO+VAMAP+SESSION+VIRT_ALLOC+FOOTPRINT+REFSET+MEMINFO_WS -stackwalk PageAccess+PageAccessEx+PageRelease+PageRangeAccess+PageRangeRelease+PagefileMappedSectionCreate+PagefileMappedSectionDelete+VirtualAlloc

or you can use the predefined xperf kernel group

xperf -on ReferenceSet

image

There is another profile in WPR which is called Resident Set. This is creating a snapshot when the trace session ends of all processes where the memory is allocated. Function wise it like opening VMMap for all processes at a specific point in time. The ETW events used are largely the same as with Reference Set minus the call stacks which reduces the traced data considerably. If you enable Reference Set tracing you also get Resident Set tracing since it is a superset of Resident Set. Although WPA displays the resident set of a specific time point it needs the ETW events belonging to the actual allocations to be able to assign every memory page its owning process. Otherwise you will only know that large portions of your active memory consists of page file allocated memory but you do not know which process it belongs to.

The xperf command line for Resident Set tracing is

xperf -on PROC_THREAD+LOADER+HARD_FAULTS+MEMORY+MEMINFO+VAMAP+SESSION+VIRT_ALLOC+DISK_IO

or you can use the xperf kernel group

xperf -on ResidentSet

Of these many providers the MEMORY ETW provider is the by far most expensive one. It records every page access and release. But without it you will not get any Reference/Resident set graphs in WPA. I am not sure if really all events are necessary but the high amount of data generated by this provider allows only to record very short durations of a busy machine where many allocations and page accesses happen. VirtualAlloc allocated memory is the only exception which can always be attributed to a specific process to the special Page Category VirtualAlloc_PreTrace.

image

Memory Black Hole – Large Pages

There are more things where Task Manager is lying to you. Have a look at this process

image

How much physical memory is it consuming? 68 KB? Lets look at our free memory while CppEater is running:

image

CppEater causing a rise of In use shown in Task Manager from 5.9GB to 7.9GB . Our small application is eating 2 GB of memory but we cannot see this memory attributed to our process! When the system is working wrong we need to look at our memory at system level. For this task RAMMap is the tool to use.

image

Here we see that someone has allocated 2 GB of memory in large pages. I hear you saying: Large what? It is common wisdom that on Intel CPUs the natural page size is 4KB. But for some server workloads it made sense to use larger pages. Usually large pages are 2 MB in size or multiples of that. Some Xeon CPUs even support up to 1 GiB large pages. Why should this arcane know how be useful? Well there is one quite common process which employs large pages quite heavily. It is Microsoft SQL Server. If you happen to see an innocent small sqlserver.exe and for some reason after your 24h load test you miss several GB of memory but no process seem to have allocated it the chances are high that SQL server has allocated some large pages which look small in Task Manager.

There is a tool to see how much memory SQL server really uses: VMMap

image

Large pages manifest itself as Locked WS in VMMap which exactly shows our lost 2 GB of memory. Is this ridiculously small value in Task Manager a bug? Well sort of yes. Trust no tool to 100%. Every number you will ever see with regards to memory consumption is wrong or a lie to some extent. Even Process Explorer shows these nonsensical values. I can only speculate how this small value is calculated. It could be that task manager simply calculates WS Pages * 4096bytes/Page = Working Set bytes.

A specialty of large pages is that they are never paged out and allocated immediately  when you call VirtualAlloc. This is how SQL server can grab large amounts of physical memory with one API call and then plays operating system with the handed out memory by itself. Here is the sample code to allocate memory with large pages. Large parts are only ceremony because you need to get the Lock pages in memory privilege before you can successfully call VirtualAlloc with MEM_LARGE_PAGES.

int * LargePageAlloc(unsigned int numberOfBytesToAllocate)
{
    printf("\nLarge Page allocator used.");
    ETWSetMark("Large Page allocator");

    HANDLE proc_h = OpenProcess(PROCESS_ALL_ACCESS, FALSE, GetCurrentProcessId());

    HANDLE hToken;
    OpenProcessToken(proc_h, TOKEN_ADJUST_PRIVILEGES, &hToken);
    CloseHandle(proc_h);

    LUID luid;
    ::LookupPrivilegeValue(0, SE_LOCK_MEMORY_NAME, &luid);

    TOKEN_PRIVILEGES tp;

    tp.PrivilegeCount = 1;
    tp.Privileges[0].Luid = luid;
    tp.Privileges[0].Attributes = SE_PRIVILEGE_ENABLED;

    auto status = AdjustTokenPrivileges(hToken, FALSE, &tp, sizeof(tp), (PTOKEN_PRIVILEGES)NULL, 0);
    CloseHandle(hToken);

    if (status != TRUE)
    {
        printf("\nSeLockMemoryPrivilege could not be aquired. LastError: %d. Use secpol.msc to assign to your user account the SeLockMemoryPrivilege privilege.", ::GetLastError());
        return 0;
    }

    auto *p = (int *) VirtualAlloc(NULL, numberOfBytesToAllocate, MEM_COMMIT | MEM_RESERVE | MEM_LARGE_PAGES, PAGE_READWRITE);

    if (::GetLastError() == 1314)
    {
        printf("\nNo Privilege held. Use secpol.msc to assign to your user account the SeLockMemoryPrivilege privilege.");
    }

    return p;
}

To try the sample out you need to add your user or group the Lock pages in memory privilege with secpol.msc and logout and login again to make the changes active.

image

When you suspect SQL server memory issues you should check out http://searchsqlserver.techtarget.com/feature/Built-in-tools-troubleshoot-SQL-Server-memory-usage which contains plenty of good advice how you can troubleshoot unexpected SQL server memory usage.

One of the first things you should do is to configure the minimum and maximum memory you want SQL server to have. See https://technet.microsoft.com/en-us/library/ms180797(v=sql.105).aspx for more information. Otherwise SQL server will sooner or later use up all of your memory. If you run out of physical memory a low memory condition event is raised which can cause SQL server to flush its caches completely which can cause query timeouts. It is therefore important to set reasonable values for min/max SQL server memory and test them before going into production.

If you are after a SQL Server memory leak you can query the SQL server diagnostics tables directly like this:

SELECT
  count (*),
  sum(pages_in_bytes)/1024.0/1024.00 'Mem in MB',
  type
  FROM [master].[sys].[dm_os_memory_objects]
  group by type 
  order by  sum(pages_in_bytes) DESC

to query your SQL server for allocated objects which can give you a hint what was being leaked. There are quite some memory leaks known of SQL server. It makes therefore a lot of sense to stay latest at the SQL server patch level if you tend to loose some memory after weeks of operation to some unknown memory black holes. When you look at a real SQL server process

image

you can pretty safely assume that its working set is a lie. Instead you can use the commit size as a good approximation of its current physical memory usage because nearly all SQL server memory is stored in large pages which cannot be paged out. This is only true if SQL server has been granted the privilege to lock pages.

Windows 10 Memory Compression

It is time to come back to the original headline. One of the big improvements of Windows 10 was that it compresses memory before writing it into the page file. With the Anniversary edition of Windows 10 this feature is now more visible in the task manager where the amount of compressed memory is now also displayed.

image

Another change of the Anniversary update is that originally the System process did own all of the compressed pages. MS had decided that too many users were confused by the large memory footprint of the System process because it holds all of the compressed memory in its working set. Now another hidden process owns all of the compressed memory which shows up in Process Explorer/Hacker under the name Memory Compression. It is a child of the System process. In ETW traces it is called MemCompression. These caches are therefore not visible in the process list of task manager except for the (Compressed) number in the overview which tells you how much working set the Memory Compression process currently has.

The compression and decompression is performed single threaded in the Memory Compression process. When we look at our original program of CppEater and let it run with ETW tracing enabled  we see that the first page touch at every 4 KB for 2 GB of memory takes 366ms. This was measured in Release/x64 on Windows 10 Anniversary on my Intel I7-4770K @ 3,5 GHz.

First page touch:   366ms, 0.183 ms/MB, AllTouched=0
Second page touch:   26ms, 0.013 ms/MB, AllTouched=0
All bytes touch:    314ms, 0.157 ms/MB, AllTouched=1
Flushing working set
After Empty:       4538ms, 2.269 ms/MB, AllTouched=0
Second After Empty: 314ms, 0.157 ms/MB, AllTouched=1

When you look at the details you find that the first page access is slow because of soft page faults. The second and third memory accesses are then fast and are only dominated by the CppEater process itself. When the working set is trimmed we find that the MemCompression process is doing a lot of single threaded work which takes ca. 6,3s for 2 GB memory. The compression speed is therefore 320MB/s which is not bad. After the memory has been  compressed and CppEater has slept some time it is time to touch our memory again. This time the memory is causing hard faults which are not hard in the usual way where we need to read memory from the disk but this time we need to decompress the previously compressed memory. That takes 4,5s which results in a decompression rate of 440MB/s which shows that decompression is nearly 30% faster than compression.

It is justified to call it a hard page fault since we are 12 times slower (=4538ms/366ms) on first page access when we hard fault back compressed pages.

image

The evolution of the Working Set of CppEater and the MemCompression process can nicely be seen by Virtual Memory Snapshots where we see that MemCompression gets all the memory from the modified private pages which are caused by calling EmptyWorkingSet in our process and puts it into its own working set after it has been compressed the memory. There we see also that MemCompression ends up at 1.7GB of memory which means that for an integer array which consists of 1,2,3,4,…. the compression rate is 0,85 which is okish. The compression is of course much better if large blocks with identical values can be compressed so this is really a stress test for the compressor to compress not completely random data.

image

You can show the relative CPU costs of each operation nicely with Flame Graphs with the same column configuration.

image

The number for first page access (soft page faults) are pretty much consistent with the ones Bruce did measure at https://randomascii.wordpress.com/2014/12/10/hidden-costs-of-memory-allocation/ where he measured 175 μs/MB for soft page faults. In reality you will always see a mixture of soft and hard page faults and now also semi hard faults due to memory compression which makes the already quite complex world of memory management even more complicated. But these semi hard faults from compressed memory are still much cheaper than to read the data back from the page file.

When Does it Page?

So far all operations were in memory but when does the page file come into the game? I have made some experiments on my machine and others. Paging (mainly) sets in when you have no physical memory left over. When the Active List (that is in kernel lingo all used memory except for caches) reaches the installed physical memory of your machine something dramatic must happen. The used physical memory is not identical with committed memory. If committed memory was never accessed by your application it can stay “virtual” and needs no physical memory backing. Is it a good idea to commit all of your physical memory? If you look at Task Manager on my 16 GB machine then you see I can commit over 19 GB of memory and still have 1,9 GB available because not all committed memory pages were accessed.

image

While I allocate and touch more and more memory the amount of Available and Cached memory goes to zero. What does that mean? It is the file system cache which you just have flushed out from memory. We all know that the machine reacts very slowly after it has booted and the hard disk never seems to come to rest. The reason is that after a boot all dlls need to be loaded from disk because the file system cache is still not yet populated. If you want to do your users a favor: Never allocate all of the physical memory but leave 15-20% for the file system cache. Your users will thank you that you did not ruin the interactive performance too much.

If you still insist to allocate all physical memory then bad things will happen:

image

When no more memory is available the OS decides that all processes are bad guys and tries to flush out everything into the Modified list (brown region in Memory Utilization graph). That is memory which is pending to be written to the page file. That makes sense if one or more processes are never accessing that memory anymore due to huge memory leaks. That way the OS can place the memory leak into the page file and continue working for a much longer time. At this time large amounts data are written into the page file. When that happens the system becomes unusable: The UI hangs, a simple printf call will take over six seconds and you experience a sudden system hang. The system is not really hanging, but it is quite busy with reading and writing data to and from the page file at that point in time. These high response times of hard disks are the main reason why SSDs are a much better choice for your page file. It is the slow reading from the page file which is causing the large slowdowns, in this case 16,3 (see Disk Service Time column in Utilization by Disk graph which was filtered for the pagefile) can be greatly reduced by placing the page file on a SSD. The kernel tries since Windows 10 its best to compress the memory before writing it into the page file to reduce slow disk IO by ca. 40%.

When the system is paging the first memory touch times rise dramatically from ca. 240ms to 1800ms (see ETW Marks graph) which is over seven times slower because it takes time to write out old data to make room for new memory allocations.

Cross Process Private Page Sharing

While measuring the effects of memory compression I stumbled across upon an interesting effect. When you start e.g. 8 process where each of them allocates one GB of random data which was initialized with

    pData[i] = rand();

in a loop. Now each process has one GB of working set and we allocate 8 GB in total of physical memory. When we trim the working set of each process we should see 8 GB of memory in the MemCompression process as its working set because random data does not compress really well. But instead I did see only about 1 GB of data in the MemCompression process! How can windows compress random data to 1/8 to its original size? I think it can’t. Something else must be happening here. The numbers are only pseudo random and always generate the same sequence of values in all processes. We therefore have the identical random data in all processes in their own private memory.

Now lets change things a bit and initialize the random number generator with a different seed in each process at process start with

    LARGE_INTEGER lint;
    QueryPerformanceCounter(&lint);
    srand( lint.LowPart );

Now we truly generate the expected 8 GB of working set of the MemCompression process. It seems that Windows creates an internal hash table of all paged out pages (could be also a B-tree) and adds them to the MemCompression process only if the newly compressed pages was not already encountered. A very much similar technique is employed by VMs where it is called transparent memory sharing between VMs.

Below you see working set of the MemCompression process when CppEater the first time is started with no random seed where all duplicate pages can be combined. The green bars are the times when the working set was flushed until the CPU consumption of MemCompression was flat again. The second run CppEater was seeded with a unique seed for the random number generator. No page sharing can happen because all pages are different now and we see the expected eight fold increase in the MemCompression working when we force page out of the working sets of the CppEater instances.

image

It is interesting that this vastly different behavior can easily be seen in the kernel memory lists. The working set of MemCompression counts simply to the Active List. The only difference is that after the flush of the working set the Active List memory remains much larger than in the first case (blue bar).

Conclusions – Is EmptyWorkingSet Good?

In general it is (nearly) never a good idea to trim your own working set. If your mental model is that you are simply giving the physical memory back to the OS which then will write it dutifully into the page file is simply wrong. You are not releasing memory but you are increasing the Modified List buffers which are flushed out to disk from time to time. If you do this a lot it can cause so much disk write load together with some random access page in activity that your system will come to a sudden halt while you have still plenty of physical memory left. I have seen 128GB servers which never got above 80GB of active memory because the system was so busy with paging out memory by forced calls to EmptyWorkingSet that it already behaved like it had reached the physical memory limit with the usual large user visible delays caused by paging. Once we removed the offending calls to EmptyWorkingSet the user perceived performance and memory utilization has become much better.

Other people have also tried SetProcessWorkingSetSize to reduce the memory consumption but failed as well due to application responsiveness issues. It did take much longer to bring all the loose ends into an article than I did initially anticipate. If I have missed out important things or you have found an error please drop me a note and I will update the article.

Why Is Batman Arkham Knight So Slow?

The latest Batman PC game had after the first release a bad reputation for being buggy and very slow. Due to some bad outsourcing and project management decisions the PC version did not meet the required quality gates. This quality and performance disaster did force Warner Bros  to cancel the PC sales for three month until a much more improved version was released which did meet customer expectations with regards to performance. The outcome was that the unpatched PC version was sold for 50% off and I could get a very good game for a very low price. I did buy the unpatched version, did throw away the unpatched DVDs,  and downloaded the “real” v1.0 from Steam with the accompanying product code.

After being aware of potentially bad performance in the game I kept an eye on it but I found not any rough spots anymore. The only thing that bugged me was when I switched away from the game to do something different and then switching back after ca. 30 minutes the first few seconds of the game graphics was hanging for several seconds. That sounds like a good reason to check performance with my favorite performance recording tool ETWController. Since it records also the screenshots I could easily track down in the profiling data what was happening while the game was hanging.

image

Since I get also a HTML report I can exactly see where in the timeline I have continued the game, although I do not get screenshots of the in game graphics.

image

After I stopped the recording I can directly open the ETL file and navigate the point of slowness in the game by selecting the time region after Screenshot 5.

image

When the game hangs it is usually CPU, Disk, memory or the network. So lets check what is going on.

By looking at the CPU consumption we see that the game has a several seconds near zero CPU activity between 10-13s which correlates with the observed hang.

image

While checking disk IO during the low CPU activity phase I find a lot of page file related IO. When the game resumes it needs to access a lot of its memory which was previously written into the page file this results in a growth of the games working set.

image

In the graph above I see that I have two page files (on drive C and D) where I wait 3,6s for 55 MB from drive D and 500ms for 103MB from drive C. That makes sense since my C drive is a SSD

image

whereas my D drive is a regular hard drive

image

That explains why reading from the page file on the D drive is 7 times slower although only 50% of the data is read from it! Random read access is the worst case scenario for a regular hard disk whereas my SSD can cope with that read pattern easily.

I do not even remember that I have enabled on D a page file but I think I did it some years ago where SSDs had still write durability issues just to be on the safe side. As it turns out it was a bad decision and as an easy fix I disabled my page file on drive D. The sudden hangs when switching back to the game were gone after that change.

Whose Fault Is It?

That was the easy part. But you can go much deeper to really see where the latencies are coming from. Hard faults which involve disk IO are served in the kernel by the method ntoskrnl.exe!MiIssueHardFault which takes care to wait for the pending disc IO of the page file. If we look at which processes did wait for page fault IOs and visualize the total CPU and wait times we get this timeline

image

  1. We can see that the hard faults nicely correlate with the Disk activity in the graph below.
  2. We find that the Arkham Knight game waits only for 90ms for hard page faults (see Waits(us)_Sum column).
  3. But the System process waits for over 10s (summed across all threads) for page in activity.

If we look who is causing this page in activity we find DirectX in the Windows kernel to cause quite some IO due to page faults.

ntoskrnl.exe!MiIssueHardFault

ntoskrnl.exe!MmAccessFault

|- ntoskrnl.exe!KiPageFault

|- ntoskrnl.exe!MiProbeLeafFrame

|    ntoskrnl.exe!MmProbeAndLockPages

|    |- dxgmms2.sys!VIDMM_MDL_RANGE::Lock

|    |    dxgmms2.sys!VIDMM_RECYCLE_HEAP_PHYSICAL_VIEW::LockRange

|    |    dxgmms2.sys!VIDMM_RECYCLE_RANGE::Lock

|    |    dxgmms2.sys!VIDMM_RECYCLE_MULTIRANGE::Lock

|    |    dxgmms2.sys!VIDMM_RECYCLE_HEAP_MGR::ProbeAndLockAllocation

|    |    dxgmms2.sys!VIDMM_GLOBAL::ProbeAndLockAllocation

|    |    |- dxgmms2.sys!VIDMM_SYSMEM_SEGMENT::LockAllocationRange

|    |    |    |- dxgmms2.sys!VIDMM_MEMORY_SEGMENT::CommitResource

|    |    |    |    dxgmms2.sys!VIDMM_GLOBAL::PageInOneAllocation

|    |    |    |    dxgmms2.sys!VIDMM_GLOBAL::ProcessDeferredCommand

|    |    |    |    dxgmms2.sys!VIDMM_WORKER_THREAD::Run

|    |    |    |    ntoskrnl.exe!PspSystemThreadStartup

|    |    |    |    ntoskrnl.exe!KiStartSystemThread

It is not even the game directly but DirectX needs to page in its buffers from previously paged out memory which takes several seconds from my slow spinning hard disk. It is possible to track this further down into the wait calls of the game but that should be proof enough to blame disk IO of DirectX buffers for the observed sluggish gaming experience.

Conclusions

It was not Batman’s fault that he was slow but my misconfigured secondary page file on my spinning hard disk. Removing the page file from the spinning disk did solve my performance problem.

Another observation is that Windows seems to write data out to the first responding device which hosts a page file. That is nice to get as much data as possible in an efficient manner into the page file but the real test comes when the data is read back again. I think the current page out algorithm in Windows is inefficient when it can choose from several devices to write its page data to it. It should balance the written data with respect to the random access servicing time of each device.  In effect SSDs with its very fast response times should be favored over slow hard disks. This makes it possible to have several page files on devices with vastly different response times but Windows should prefer to write most of the data onto the device with the fastest random access time to make hard page faults pause time as short as possible.

This once again shows that assumptions about bad performance based on former evidence (the game is slow …) will lead to wrong conclusions about current performance issues.

Performance problems are much like quantum mechanics: The problem does not exist until you have measured it. Some people come up with smart tests to verify this or that assumption. But it is usually easier to use a profiling tool to get down to the root cause with a more structured approach.

When Known .NET Bugs Bite You

The most interesting type of bugs do not occur during regression testing but when real people use the software. Things become nasty if it happens only after several hours of usage when sometimes the whole application freezes for minutes. If you have a bigger machine with e.g. 40 cores and most of them are busy and you have dozens of processes, things can become challenging to track down, when something goes wrong . Attaching a debugger does not help since it not clear which process is causing the problem and even if you could you would cause issues for already working users on that machine. Using plain ETW tracing is not so easy because the amount of generated data can exceed 10 GB/min. The default Windows Performance Recorder profiles are not an option because they will simply generate too much data. If you have ever tried to open 6+ GB ETL file in WPA it can take 30 minutes until it opens. If you drag and drop around some columns you have to wait some minutes until the UI settles again.

Getting ETW Data The Right Way

Taking data with xperf offers more fined grained control at the command line, but it has also its challenges. If you have e.g. a 2 GB user and a 2 GB kernel session which trace into a circular in memory buffer you can dump the contents with

xperf -flush "NT Kernel Logger" -f kernel.etl 

xperf -flush "UserSession" -f user.etl

xperf -merge user.etl kernel.etl merged.etl 

The result looks strange:

image

For some reason flushing the kernel session takes a very long time (up to 15 minutes) on a busy server. In effect you will have e.g. the first 20s of kernel data, a 6 minutes gap and then the user mode session data which is long after the problem did occur. To work around that you can either flush the user session first or you use the xperf command to flush both sessions at the same time:

xperf -flush "UserSession" -f user.etl -flush "NT Kernel Logger" -f kernel.etl

If you want to profile a machine where you do not know where the issues are coming from you should get at first some “long” term tracing. If you do not record context switch events but you basically resort back to profiling events then you can already get quite far. The default sampling rate is one kHz (a thousand samples/second)  which is good for many cases but too much for long sessions. When you expect much CPU activity then you can try to take only every 10ms a sample which will get you much farther. To limit the amount of traced data you have to be very careful from which events you want to get the call stacks. Usually the stack traces are 10 times larger than the actual event data. If you want to create a dedicated trace configuration to spot a very specific issue you need to know which ETW providers cost you how much. Especially the size of the ETL file is relevant. WPA has a very nice overview under System Configuration – Trace Statistics which shows you exactly that 

image

You can paste the data also into Excel to get a more detailed understanding what is costing you how much.

image

In essence you should enable stack walking only for the events you really care about if you want to keep the profiling file size small. Stack walks cost you a lot of storage (up to 10 times more than the actual event payload). A great thing is that you can configure ETW to omit stack walking for a specific events from an ETW provider with tracelog. This works since Windows 8.1 and is sometimes necessary.

To enable e.g. stack walking for only the CLR Exception event you can use this tracelog call for your user mode session named DotNetSession. The number e13c0d23-ccbc-4e12-931b-d9cc2eee27e4  is the guid for the provider .NET Common Language Runtime because tracelog does not accept pretty ETW provider names. Only guids prefixed with a # make it happy.

tracelog -enableex DotNetSession -guid #e13c0d23-ccbc-4e12-931b-d9cc2eee27e4 -flags 0x8094 -stackwalkFilter -in 1 80 

If you want to disable the highest volume events of Microsoft-Windows-Dwm-Core you can configure a list of events which should not traced at all:

tracelog -enableex DWMSession -guid #9e9bba3c-2e38-40cb-99f4-9e8281425164 -flags 0x1 -level 4 -EventIdFilter -out 23 2 9 6 244 8 204 7 14 13 110 19 137 21 111 57 58 66 67 68 112 113 192 193

That line will keep the DWM events which are relevant to the WPA DWM Frame Details graph. Tracelog is part of the Windows SDK but originally it was only part of the Driver SDK. That explains at least partially why it is the only tool that allows to declare ETW event and stackwalk filters. If you filter something in or out you need to pass as first number the number of events you will pass to the filter.

tracelog … -stackwalkFilter -in 1 80

means enable stackwalks for one event with event id 80

tracelog … -EventIdFilter -out 23 2 9 …

will filter out from all enabled events 23 events where the event ids are directly following after the number. You can execute such tracelog statements also on operating systems < Windows 8.1. In that case the event id filter clauses are simply ignored. In my opinion xperf should support that out of the box and I should not have to use another tool to fully configure lightweight ETW trace sessions.

Analyzing ETW Data

After fiddling with the ETW providers and setting up filters for the most common providers I was able to get a snapshot at the time when the server was slow. The UI was not reacting for minutes and I found this interesting call stack consuming one core for nearly three minutes!

image

When I drill deeper I did see that Gen1 collections were consuming enormous amounts of CPU in the method WKS::allocate_in_older_generation.

image

When you have enabled the .NET ETW provider with the keyword 0x1 which tracks GC events and you use my GC Activity Region file you can easily see that some Gen1 collections are taking up to 7s. If such a condition occurs in several different processes at the same time it is no wonder why the server looks largely unresponsive although all other metrics like CPU, Memory, Disc IO and the network  are ok.

Gen1GCTimes

When I find an interesting call stack it is time to look what the internet can tell me about allocate_in_older_generation. This directly points me to Stackoverflow where someone else had this issue already http://stackoverflow.com/questions/32048094/net-4-6-and-gc-freeze which brings me to https://blogs.msdn.microsoft.com/maoni/2015/08/12/gen2-free-list-changes-in-clr-4-6-gc/ that mentions a .NET 4.6 GC bug. There the .NET GC developer nicely explains what is going on:

… The symptom for this bug is that you are seeing GC taking a lot longer than 4.5 – if you collect ETW events this will show that gen1 GCs are taking a lot longer. …

Based on the data above and that the server was running .NET 4.6.96.0 which was a .NET 4.6 version this explains why sometimes a funny allocation pattern in processes that had already allocated > 2GB of memory the Gen1 GC pause times would sometimes jump from several ms up to seconds. Getting  1000 times slower at random time certainly qualifies as a lot longer. It would even be justified to use more drastic words how much longer this sometimes takes. But ok the bug is known and fixed. If you install .NET 4.6.1 or the upcoming .NET 4.6.2 version you will never see this issue again. It was quite tough to get the right data at the right time but I am getting better at these things while looking at the complete system at once. I am not sure if you could really have found this issue with normal tools if the issue manifests in random processes at random times. If you only concentrate at one part of the system at one time you will most likely fail to identify such systemic issues. When you update your server after some month by doing regular maintenance work the issue would have vanished leaving you back with no real explanation why the long standing issue has suddenly disappeared. Or you would have identified at least a .NET Framework update in retrospect as the fix you were searching for so long.

The Non Contracting Code Contracts

What do you do when you get an impossible exception of the form

System.InvalidCastException: Object of type 'MyApp.Type[]' cannot be converted 
                to type 'System.Collections.Generic.IEnumerable`1[MyApp.Type]'.
System.RuntimeType.TryChangeType(…)
System.RuntimeType.CheckValue(…)
System.Reflection.MethodBase.CheckArguments(…)
System.Reflection.RuntimeMethodInfo.InvokeArgumentsCheck(…)
...

where basically an array of type X cannot be casted to IEnumerable<X>.  Any LINQ affine programmer will do that 20 times a day. What the heck is going on here? It looks like the CLR type system has got corrupted by some bad  guy. It could be anything from a spurious memory corruption due to cosmic rays or some code on this or another thread did corrupt memory somehow. Random bit flips are not so uncommon if you know that cosmic muons hit the earth surface at a rate of ca. 60 000 muons/(h*m^2). It is only a matter of time until some bits in your memory gets randomly another value.

This issue surfaced on some machines quite reproducible so we can put back exotic theories about possible sources of memory corruption back into the bookshelf. First of all I needed to get a memory dump of the process in question. This is much easier if you know how you can “attach” procdump to one process and let it create a dump on any first chance exception with a specific type or message. That comes in handy if you want to create a dump not when a process crashes but when an internally catched exception occurs and you want to inspect the application state at this point in time.

The Sysinternals tool Procdump can do it from the command line without any extra installation. The -e 1 will instruct it to dump on first chance exceptions and -f will do a substring filter of the shown message in the console by procdump.

C:\>procdump  -ma -e 1 -f "InvalidCast" Contractor.exe

ProcDump v7.1 - Writes process dump files
Copyright (C) 2009-2014 Mark Russinovich
Sysinternals - www.sysinternals.com
With contributions from Andrew Richards

Process:               Contractor.exe (12656)
CPU threshold:         n/a
Performance counter:   n/a
Commit threshold:      n/a
Threshold seconds:     10
Hung window check:     Disabled
Log debug strings:     Disabled
Exception monitor:     First Chance+Unhandled
Exception filter:      *InvalidCast*
Terminate monitor:     Disabled
Cloning type:          Disabled
Concurrent limit:      n/a
Avoid outage:          n/a
Number of dumps:       1
Dump folder:           C:\
Dump filename/mask:    PROCESSNAME_YYMMDD_HHMMSS


Press Ctrl-C to end monitoring without terminating the process.

CLR Version: v4.0.30319

[19:48:12] Exception: E0434F4D.System.InvalidCastException ("Object of type 'MyApp.Type[]' cannot be converted to type 'System.Collections.Generic.IEnumerable`1[MyApp.Type]'.")
[19:48:12] Dump 1 initiated: C:\Contractor.exe_160718_194812.dmp
[19:48:12] Dump 1 writing: Estimated dump file size is 50 MB.
[19:48:12] Dump 1 complete: 50 MB written in 0.1 seconds
[19:48:13] The process has exited.
[19:48:13] Dump count reached.
If you have a more complex process creation setup, e.g. you create many instances of the same exe then you cannot use the simple executable name filter. Procdump accepts then only the process id of an already running process which you now need to determine from somewhere else. DebugDiag would be my tool of choice which allows you to configure rules for any process executable regardless if they are already running or will be started later. There you can also configure a dump trigger when a specific exception with a specific message.
To do this start the DebugDiag 2  Collection tool and select Crash
image
 
After selecting a process you can configure e.g. a full dump for each InvalidCastException
 
image
 
When you have activated the Crash rule you will get a nice report how many dumps already have been collected since the rule is active.
image
 
Then you will find in the configured directory the dump files along with the DebugDiag logs.
image

Once I had the dumps I did look into it but I could not make much sense out of the TypeCastException. Without the exact CLR source code it is very hard to make sense of the highly optimized data structures in the CLR. At that point I did reach out to Microsoft to look into the issue. Since the dump did also not provide enough clues I did pack a virtual machine together and transfer it directly to MS support via their upload tool. The support engineer was not able to get the VM running which seems to happen quite frequently if you upload 50+ GB files over the internet. The solution was to send me via mail an USB hard drive, upload the VM onto the disk and send it back to MS. that worked out and I heard nothing from MS support for some time.

It seems that my issue was routed through various hands until it reached one of CoreCLR devs which have debugged certainly many similar issues. When I was finally suspecting to never hear about that issue again I got a mail back that they have found the issue.

The problem is that the CLR type system got confused about invalid assembly references which have invalid Flags set. The assembly reference flag set in the IL code must be set to zero but in my case it was set to 0x70 which is normally a C# compiler internal tracking code to mark reference assemblies. In the final code emitting phase the C# compiler will zero out the assembly reference flags and all is fine. But for some reason one of my assemblies still had set 0x70 for all reference assemblies.

If you wonder what reference assemblies are: These are the assemblies the .NET Framework consist of in various versions. If you link against a specific .NET Framework version like 4.6.2, you “link” against the assemblies located in these directories to ensure that can call only APIs for the currently configured .NET target framework.

image

I was told that with the .NET 4.5.2 C# compiler an issue was reported which could trigger that behavior. If the newer Roslyn based C# 6 compiler suffers still from this issue is not known to me. That sounded like an arcane and spurious compiler bug. For some reason I was asked a few days later by MS if we modify the generated assemblies. It turns out some targets use Code Contracts to enforce run time checks which need assembly rewriting. After doing a query for all assembly references I found that ALL Code contracts rewritten assemblies have invalid Assembly reference flags with 0x70 set. This is still true with newest Code Contracts version which is currently 1.9.10714.2.

If you try to check the Assembly references via Reflection you will find that it will always tell you that the assembly references fields in the references assemblies are zero. This is even true if you load them with Assembly.LoadForReflectionOnlyFrom. Even Ildasm will not show you the fields of the Assembly references. Windbg? Nope.

I had to resort back to my old pet project ApiChange and patch the –ShowReferences command to print besides the referenced assembly also the Flags field of the Assembly Reference.


C:\>apichange -sr D:\bin\SubContractor.dll
SubContractor.dll ->
 System, Version=4.0.0.0,   Culture=neutral,
                            PublicKeyToken=b77a5c561934e089 0x70
 mscorlib, Version=4.0.0.0, Culture=neutral, 
                            PublicKeyToken=b77a5c561934e089 0x70

There I could finally see that my assembly had invalid assembly references. How can I fix it? That is easy. Turn off Perform Runtime Contract Checking!

image

Now you get correct assembly references as the ECMA spec requires them.


C:\>apichange -sr D:\bin\SubContractor.dll
SubContractor.dll ->
 mscorlib, Version=4.0.0.0, Culture=neutral, 
            PublicKeyToken=b77a5c561934e089 0x0
 System, Version=4.0.0.0,   Culture=neutral, 
            PublicKeyToken=b77a5c561934e089 0x0

The source of the spurious InvalidCastExceptions is found but why does it happen so infrequently? I have tried to create a simple reproducer but I was not able to force the cast error. It seems to have something to do with the machine, memory layout, number of loaded types, IL layout, moon phase, …. to trigger it. If something goes wrong it seems that in one assembly with correct assembly references to the Reference assemblies a cast is tried from

mscorlib         ;  IEnumerable<T>  to 
mscorlib,Ref=0x70;  IEnumerable<T>  

where the type identity with generic types and their arguments gets mixed up in the process which then correctly tells me that seemingly identical types are not identical because they differ in their full type expansion by their assembly reference.

I was never a big fan of Code Contracts because of the way how they were implemented. Rewriting existing assemblies can and will cause unforeseen issues in production code. If I can I want to keep my C# and JIT compiler errors and not add Code Contract bugs on top of it. Do not get me wrong. I really like the idea of contract checking. But such checks should be visible in the source code and or at least compiled directly by a Code Contracts plugin by the now much more open Roslyn Compiler.

As it is now Code Contracts considerably increase compilation time and if you enable static code checks you easily can get builds which take several hours which is a way too hefty price for some additional checks.

If you now want to remove Code Contracts you need to remove from your source code all Contract.Requires<Exn>(…) checks because these will assert in non rewritten Release builds as well which is certainly not what you want. As it stands now Code Contracts violate the contract of generating valid assembly references. The assembly rewriter of Code Contracts  generates assemblies with invalid IL code which can trigger under specific circumstances impossible runtime errors in the CLR.

New Beta of Windows Performance Toolkit

With the release of the first Windows Anniversary SDK Beta also a new version of Windows Performance Toolkit was shipped. If you want to search for the changes here are the most significant ones. Not sure if the version number of the beta will change but it seems to target 10.0.14366.1000.

image

Stack Tags for Context Switch Events

Now you can assign tags for wait call stacks which is a big help if you want to have tags why your application is waiting for something.  Here is sample snippet of common sources of waits.

<Tag Name="Waits">
    <Tag Name="Thread Sleep">
         <Entrypoint Module="ntdll.dll" Method="NtDelayExecution*"/>
    </Tag>
    <Tag Name="IO Completion Port Wait">
         <Entrypoint Module="ntoskrnl.exe" Method="IoRemoveIoCompletion*"/>
    </Tag>
    <Tag Name="CLR Wait" Priority="-1">
         <Entrypoint Module="clr.dll" Method="Thread::DoAppropriateWait*"/>
    </Tag>
    <Tag Name=".NET Thread Join">
         <Entrypoint Module="clr.dll" Method="Thread::JoinEx*"/>
    </Tag>
    <Tag Name="Socket">
         <Tag Name="Socket Receive Wait">
               <Entrypoint Module="mswsock.dll" Method="WSPRecv*"/>
         </Tag>
         <Tag Name="Socket Send Wait">
               <Entrypoint Module="mswsock.dll" Method="WSPSend*"/>
         </Tag>
         <Tag Name="Socket Select Wait">
               <Entrypoint Module="ws2_32.dll" Method="select*"/>
         </Tag>
    </Tag>
    <Tag Name="Garbage Collector Wait">
         <Entrypoint Module="clr.dll" Method="WKS::GCHeap::WaitUntilGCComplete*"/>
    </Tag>
</Tag>

That feature is actually already part of the Windows 10 Update 1 WPA (10.0.10586.15, th2_release.151119-1817) but I have come just over that feature now. That makes hang analysis a whole lot easier. But a really new feature are Flame Graphs which Bruce Dawson wanted since a long time built into WPA. I have written some small tool to generate them also some time ago Visualize Your Callstacks Via Flame Graphs but it never got much traction.

image

If you want to see e.g. why your threads are blocking you get in the context switch view now a nice view what are the blocking reasons and which thread did unblock your threads most often. You can stack two Wait views over each other so you can drill down to the relevant time where something is hanging and you can get to conclusions much faster now.

image

Another nice feature is that if you select a region in the graph and hold down the Ctrl key you can select multiple regions at one time and highlight them which is useful if you frequently need to zoom in into different regions and you move between them. The current flame graphs cannot really flame call stacks because the resulting flames become so small you have no chance to select them. Zooming with Ctrl and the mouse wheel only zooms into the time region which is for a flame graph perhaps not what I want. I would like to have a zoom where I can make parts of the graph readable again.

Symbol loading has got significantly faster and it seems also to crash less often which is quite surprising for a beta.

My Presets Window for managing presets

Another new feature is the My Presets Window which is useful if you work with custom WPA profiles. From there you can select from different loaded profiles the graph in your current view and simply add it.

image

 

image

 

Reference Set Tracing

WPR now also supports Reference Set analysis along with a new graph in WPA. This basically traces every page access and release in the system which allows you to track exactly how your working set evolved over time and why.

image

Since page in/out operations happen very frequently this is a high volume provider. With xperf you can enable it with the keyword REFSET. The xperf documentation about it only tells you

REFSET         : Support footprint analysis

but until now no tool was publicly available to decode the events in a meaningful manner. I still do not understand how everything works together there but it looks powerful to deeply understand when something is paged in or out.

image

TCP/IP Graph

WPA has for a long time known nothing about the network. That is changing now a bit.

image

ACK Delays are nice but as far as I understand TCP the application is free to send the ACK until the server has got the data ready. If you see high ACK delay times you cannot point directly to the network but you still need to investigate the server if something was happening there as well. For network issues I still like Wireshark much more because it understands not only raw TCP  but also the higher protocols. For a network delay analysis TCP retransmit events are the most important ones. The next thing to look at are the packet round trip times (RTT) where Wireshark does an incredible good job at it. I have seen some ETW events in the kernel which have a SRTT (Sample RTT) field but I do not know how significant that actually is.

 

Load files from Zip/CAB

image

You no longer need to extract the etl files but WPA can open them directly which is great if you deal with compressed files a lot.

The Bad

  • The new WPT no longer works on Windows 7 so beware.
  • For WPR/UI I have mixed feelings because it is not really configurable and records too much. The recorded amount of data exceeds easily one GB on a not busy machine if I enable CPU Disk, File and Reference Set tracing. The beta version also records for the Disk profile Storeport traces which are only useful if you suspect bugs in your hard disk firmware or special hard disk drivers. If I enable context switch tracing with call stacks I usually do not need for every file/disk operation the call stacks since I will find it anyway in the context switch traces at that time. If VirtualAlloc tracing is enabled the stack traces for the free calls are recorded which are seldom necessary because double frees are most often not the issue. Memory leaks are much more common. For these the allocation call stacks are the only relevant ones.
  • The improvements to the ETW infrastructure with Windows 8.1 which support filtering of specific events or to record stack traces only for some specific events have not made it into WPR nor xperf. I really would like to see more feature parity between what the kernel supports and what xperf allows me to configure to reduce the sometimes very high ETW event rates down to something manageable.
  • Currently xperf can start only two kernel trace sessions (NT Kernel Logger and Circular Kernel Context Logger) but not a generic kernel tracing session where one can have up to 8 since Windows 8. Besides this I am not sure if setting the profiling frequency is even possible to set for a specific kernel session.

Conclusions

The new version has many improvements which can help a lot to get new insights in your system in ways that were not possible before. Flame Graphs look nice and I hope that the final version makes it possible to zoom in somehow.

The WPA viewer is really a great tool and has gotten a lot of added features which shows that more and more people are using it.

Show GDI Handles by Type in Windbg

What are GDI Handles? Every GDI (Graphics Device Interface) object has a handle which is associated to one specific process. If you create UI resources (not windows) like

  • Fonts
  • Bitmaps
  • Device Contexts (DC)
  • Pens
  • Regions

then you are using GDI resources. There is a hard cap on 10K GDI objects per process. If you have a memory leak in window related code then the chances are high that your process will not crash due to an out of memory condition but because a generic GDI or GDI+ error did occur. If this happens after a many hour stress test you will have a hard time to figure out what went wrong, if all you have got is a memory dump with an exception like this:

System.Runtime.InteropServices.ExternalException was unhandled by user code
Message="A generic error occurred in GDI+."
Source="System.Drawing"
ErrorCode=-2147467259
StackTrace:
   at System.Drawing.Image.Save(Stream stream, ImageCodecInfo encoder, EncoderParameters    encoderParams)
   at System.Drawing.Image.Save(Stream stream, ImageFormat format)

To make things more complex .NET tries hard to reuse existing objects and most often it avoids GDI and GDI+ entirely. You can create more than 10K Bitmaps in a Windows Forms or WPF application without problems. Not every object on your managed heap is backed by GDI handle based object! But some of them are. This can be confusing sometimes.

GDI handle leaks are not easy to diagnose since there are not many tools out there that are able to display more than the total count like Task Manager with the GDI-Object count column.

image

Process Explorer can display pretty much anything but if you search for GDI objects it will let you down. Process Hacker (an open source Process Explorer clone with more features) on the other hand can display them via the context menu

image

The best tool I have found so far is GDIView from NirSoft. But it seems at least on Windows 8.1 to display sometimes much less handles than Task Manager shows. Do not entirely trust the reported numbers. It displays in the All GDI column the same value as Task Manager which should be pretty reliable. If the other columns add up the to the total reported value then you should be fine. Interestingly there exists no Performance counter for GDI objects which is quite odd since it is still an essential resource although MS seems to try hard to put GDI out of service. 

 

image

 

The problem with all of the above tools is that they do not work with memory dumps. Windbg had an extension which is since about 10 years no longer part of the Windbg package which did only work until Windows 2000 and not even on Windows XP.

So what can you do? First we need a GDI Handle leak. Bitmap.GetHbitmap will create a GDI backed bitmap and return you the handle to it. The following code creates therefore a leak of 1000 bitmap GDI objects:

        private void Button_Click(object sender, RoutedEventArgs e)
        {
            Bitmap bmp = new Bitmap(100, 100);
            for (int i = 0; i < 1000; i++)
            {
                bmp.GetHbitmap();
            }
        }

Ok when I put that into my WPF application I get a nice GDI leak when I press the magic button.

image

Now we need to search the internet for helpful resource how to dump GDI objects. After some searching I pretty much always land on some security related kernel hacking sites which are an interesting read on its own. These guys really know their stuff and I guess they have also some older copy of the Windows source code somewhere around. I have found https://blog.coresecurity.com/2015/09/28/abusing-gdi-for-ring0-exploit-primitives/ which talks about a fascinating way to abuse Bitmaps for kernel exploits which still seem to work today. But what is more important he also talks about that the kernel manages GDI handles for each process in a specific kernel region which is read only mapped back to the process address space.

That is important to know since we have a chance to look at GDI kernel structures in user mode process dumps. There is hope to retrieve information about GDI handles also in 2016. The next good thing is that the definition of GDICell structure seems to be still valid on Windows 8 (and on my Windows 10 home installation).

typedef struct {
  PVOID64 pKernelAddress; // 0x00
  USHORT wProcessId;      // 0x08
  USHORT wCount;          // 0x0a
  USHORT wUpper;          // 0x0c
  USHORT wType;           // 0x0e
  PVOID64 pUserAddress;   // 0x10
} GDICell;                // sizeof = 0x18

The post also tells us where we need to search for the GDI Handle table. It is part of the Process Environment Block (PEB) which Windbg can dump since a long time. But !peb will not display the GDI handle table. Instead we need to dump the structure with the ntdll symbols directly.

.process will print the PEB pointer value which we can then feed into dt

0:014> .process
Implicit process is now 00000000`00923000
0:014> dt ntdll!_PEB GdiSharedHandleTable 00923000
   +0x0f8 GdiSharedHandleTable : 0x00000000`013e0000 Void

Now we know the start of the GDI handle table. How big is it? The !address command can tell us that:

0:014> !address 13e0000

Usage:                  Other
Base Address:           00000000`013e0000
End Address:            00000000`01561000
Region Size:            00000000`00181000 (   1.504 MB)
State:                  00001000          MEM_COMMIT
Protect:                00000002          PAGE_READONLY
Type:                   00040000          MEM_MAPPED
Allocation Base:        00000000`013e0000
Allocation Protect:     00000002          PAGE_READONLY
Additional info:        GDI Shared Handle Table


Content source: 1 (target), length: 1000

It seems we are not the first ones to do that since under Additional info the name of the memory region has a really descriptive name. That looks promising. With the | command we get the current process id in hex and we are ready for a first try to search the GDI table for GDICell structures which should contain our current PID as part of the GDI handle table.

0:014> |
.  0    id: 35b8    attach    name: D:\GdiLeaker\bin\Debug\GdiLeaker.exe
0:014> s -w 13e0000 1561000 35b8
00000000`013f02f8  35b8 0000 1c05 4005 0000 0000 0000 0000  .5.....@........
00000000`013f3580  35b8 0000 de05 4005 0000 0000 0000 0000  .5.....@........
00000000`013f7948  35b8 0000 8005 4005 0000 0000 0000 0000  .5.....@........
00000000`013f7fd8  35b8 0000 c80a 400a 3370 00d4 0000 0000  .5.....@p3......
00000000`013facf0  35b8 0000 5e05 4005 0000 0000 0000 0000  .5...^.@........
00000000`013fb9c8  35b8 0000 0a01 4401 0000 00c8 0000 0000  .5.....D........
00000000`013fdac8  35b8 0000 9305 4005 0000 0000 0000 0000  .5.....@........
00000000`013fe428  35b8 0000 4905 4005 0000 0000 0000 0000  .5...I.@........
....

The number of printed lines look good. Since the kernel structures have also a wType member which should describe the GDI handle type we dig deeper until we find http://stackoverflow.com/questions/13905661/how-to-get-list-of-gdi-handles which tells us that 05 is the Bitmap GDI handle type. Since 6 bytes after the pid the type should follow we can check on our printed memory and sure there is mostly a 5 at this memory location. The rest of the word seems to be used for other information.

Lets create a Windbg script out of that and automate the complete process. Now attach to our GDI leaker and dump the GDI handle table before and after we have leaked 1000 bitmaps.

Important: If you are running on a 64 bit OS you need to attach the 64 bit Windbg even if you debug a 32 bit application!

0:013> $$>a<"D:\GdiDump\DumpGdi.txt"
GDI Handle Table 00000000013e0000 0000000001561000
GDI Handle Count      14
    DeviceContexts: 4
    Regions:        2
    Bitmaps:        2
    Palettes:       0
    Fonts:          3
    Brushes:        3
    Pens:           0
    Uncategorized:  0
0:013> g
0:014> $$>a<"D:\GdiDump\DumpGdi.txt"
GDI Handle Table 00000000013e0000 0000000001561000
GDI Handle Count      1021
    DeviceContexts: 8
    Regions:        3
    Bitmaps:        1003
    Palettes:       0
    Fonts:          3
    Brushes:        4
    Pens:           0
    Uncategorized:  0

Bingo! That looks pretty good. I have tried my script on various processes and the results are promising. This works also with full memory dumps which is a long missing piece of important information. Often enough no screenshot with Task Manager and the GDI handle count was taken so this information is lost if the failed process was terminated in the meantime. In some processes ca. 10-20% of the handle information seems to be missing. Since GDIView shows pretty much the same numbers I hope that this script will not be too far off for your crashed processes. Give it a try or post some additions if you know more about the internals of GDI.

 

Here is the code of DumpGDI.txt

$$ Run as: $$>a<DumpGdi.txt
$$ Written by Alois Kraus 2016
$$ uses pseudo registers r0-5 and r8-r14

r @$t1=0
r @$t8=0
r @$t9=0
r @$t10=0
r @$t11=0
r @$t12=0
r @$t13=0
r @$t14=0
$$ Increment count is 1 byte until we find a matching field with the current pid
r @$t4=1

r @$t0=$peb
$$ Get address of GDI handle table into t5
.foreach /pS 3 /ps 1 ( @$GdiSharedHandleTable { dt ntdll!_PEB GdiSharedHandleTable @$t0 } ) { r @$t5 = @$GdiSharedHandleTable }

$$ On first call !address produces more output. Do a warmup
.foreach /pS 50 ( @$myStartAddress {!address  @$t5} ) {  }


$$ Get start address of file mapping into t2
.foreach /pS 4 /ps 40 ( @$myStartAddress {!address  @$t5} ) { r @$t2 = @$myStartAddress }
$$ Get end address of file mapping into t3
.foreach /pS 7 /ps 40 ( @$myEndAddress {!address  @$t5} ) { r @$t3 = @$myEndAddress }
.printf "GDI Handle Table %p %p", @$t2, @$t3

.for(; @$t2 < @$t3; r @$t2 = @$t2 + @$t4) 
{
  $$ since we walk bytewise through potentially invalid memory we need first to check if it points to valid memory
  .if($vvalid(@$t2,4) == 1 ) 
  {
     $$ Check if pid matches
     .if (wo(@$t2) == @$tpid ) 
     { 
        $$ increase handle count stored in $t1 and increase step size by 0x18 because we know the cell structure GDICell has a size of 0x18 bytes.
        r @$t1 = @$t1+1
        r @$t4 = 0x18
        $$ Access wType of GDICELL and increment per GDI handle type
        .if (by(@$t2+6) == 0x1 )  { r @$t8 =  @$t8+1  }
        .if (by(@$t2+6) == 0x4 )  { r @$t9 =  @$t9+1  }
        .if (by(@$t2+6) == 0x5 )  { r @$t10 = @$t10+1 }
        .if (by(@$t2+6) == 0x8 )  { r @$t11 = @$t11+1 }
        .if (by(@$t2+6) == 0xa )  { r @$t12 = @$t12+1 }
        .if (by(@$t2+6) == 0x10 ) { r @$t13 = @$t13+1 }
        .if (by(@$t2+6) == 0x30 ) { r @$t14 = @$t14+1 }
     } 
  } 
}

.printf "\nGDI Handle Count      %d", @$t1
.printf "\n\tDeviceContexts: %d", @$t8
.printf "\n\tRegions:        %d", @$t9
.printf "\n\tBitmaps:        %d", @$t10
.printf "\n\tPalettes:       %d", @$t11
.printf "\n\tFonts:          %d", @$t12
.printf "\n\tBrushes:        %d", @$t13
.printf "\n\tPens:           %d", @$t14
.printf "\n\tUncategorized:  %d\n", @$t1-(@$t14+@$t13+@$t12+@$t11+@$t10+@$t9+@$t8)

When you edit the script with Notepad++ try Perl as scripting language. The syntax highlighter will do pretty good job on it. Windbg scripts do not look nice but they work and make dump analysis a lot easier. If you want to use Windbg efficiently I highly recommend you read more about .foreach, .printf, and pseudo registers.

If you e.g. search for a memory leak in a managed process you usually do !DumpHeap -stat -live to see all still reachable objects. If you omit -live then you see also all temporary objects which are not yet GCed which can also be interesting if there is a large difference between these two. Now you have identified a type and you click on the blue link in Windbg which will then print all object instances of that type.

Lets try that with FactoryRecord instances where we get the list of object addresses if we click in Windbg on the blue link of the !DumpHeap -stat command.

Statistics:
      MT    Count    TotalSize Class Name
6db871a0      114         6384 System.Configuration.FactoryRecord

If we dump one object instance we see that there is an interesting name inside it which could help to diagnose the context in which it was allocated

0:014> !DumpObj /d 02c98ca8
Name:        System.Configuration.FactoryRecord
MethodTable: 6db871a0
EEClass:     6db532f8
Size:        56(0x38) bytes
File:        C:\WINDOWS\Microsoft.Net\assembly\GAC_MSIL\System.Configuration\v4.0_4.0.0.0__b03f5f7f11d50a3a\System.Configuration.dll
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
7257e918  4000249        4        System.String  0 instance 02c98c50 _configKey
7257e918  400024a        8        System.String  0 instance 02c984b0 _group
7257e918  400024b        c        System.String  0 instance 02c98ad4 _name
6db86f9c  400024c       2c ...SimpleBitVector32  1 instance 02c98cd4 _flags
7257e918  400024d       10        System.String  0 instance 02c98b08 _factoryTypeName
6dc0c698  400024e       20         System.Int32  1 instance      200 _allowDefinition
6dc0c6d0  400024f       24         System.Int32  1 instance      100 _allowExeDefinition
6db87168  4000250       30 ...errideModeSetting  1 instance 02c98cd8 _overrideModeDefault
7257e918  4000251       14        System.String  0 instance 02c86bd8 _filename
725807a0  4000252       28         System.Int32  1 instance      123 _lineNumber
7257ecb8  4000253       18        System.Object  0 instance 00000000 _factory
00000000  4000254       1c                       0 instance 00000000 _errors

Lets say we are after the group name. If we dump this object

0:014> !DumpObj /d 02c984b0
Name:        System.String
MethodTable: 7257e918
EEClass:     721bf344
Size:        50(0x32) bytes
File:        C:\WINDOWS\Microsoft.Net\assembly\GAC_32\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
String:      system.web/caching
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
725807a0  4000243        4         System.Int32  1 instance       18 m_stringLength
7257f35c  4000244        8          System.Char  1 instance       73 m_firstChar
7257e918  4000248       40        System.String  0   shared   static Empty

then we see that this FactoryRecord has something to do with system.web/caching. Now we want to dump the group names of all FactoryRecord instances to get an overview.

.foreach( @$my { !DumpHeap -short -mt 6db871a0  } ) { r @$t0 = @$my; .printf "\n%mu", poi(@$t0+8)+8 }
...
system.transactions
system.transactions

system.web
system.web
system.web
system.web
system.web
system.web
system.web
system.web
system.web
system.web

...

The basic  idea is to dump the object addresses of all FactoryRecord instances. Since in .NET the Method Table (MT) pointer is the synonym for a type we use the MT pointer as argument to !DumpHeap -mt 6db871a0  to print only FactoryRecord instances. The additional -short limits the output to only object addresses which makes it a perfect fit for .foreach. 

For some reason you need to assign the local variable from the foreach loop @$my to a pseudo register @$t0 to make calculations possible. Oh and do not forget the spaces between @$t0 = @$my. The @ before the variable $xx is not strictly necessary but it speeds up the debugger evaluation engine a bit.

To get the pointer to the _group string which is 8 bytes after the FactoryRecord pointer we read it via poi(@$t0+8). Now we have the pointer to the string object. I know that .NET strings are null terminated UTF-16 strings. We can use therefore .printf “\n%mu” to print the string contents. On x64 the offset to the char array of a .NET string is 0xc and for x86 it is 8 which I have used above. These little shortcuts can make a dump analysis dramatically faster.

Although I like Visual Studio a lot, for more advanced things you need to use the power of Windbg. Give it a try when you analyze your next dump.