The Mysterious Lost Memory Which Belongs To No Process

Sometimes you experience sluggish performance for no apparent reason. That are usually the harder issues since it is not clear what is causing it. Things will become a bit messy but I will give some advice how you can pinpoint sluggish system performance with basic OS tools and some advanced tools which are rarely used in public.

The Symptom

A server machine did become very slow once the in use memory reported by task manager reached ca. 70% (e.g. 90/128 GB) of the available memory. There was still plenty of free memory but when a new application was started it was very slow and the C drive was very busy. A first suspicion was that the application dlls were loaded from disk again which did not really make sense since the application was started many times before. Everything should be in the file system cache since a long time.

 

First Observations

If you are on a Windows server and you want to view the disk IO graphs in Task Manager like on your normal Windows (8 and later) you need to enable the Disk IO graphs explicitly with

C>diskperf -y 

A closer look on disk IO in the task manager showed that the drive hosting the page file was from the very beginning having writes to the page file at ca. 90 MB/s. We certainly did page out a lot of data but why was this happening when only 50/128 GB were in use? 

image

What was going on here?

Advanced Tools

To see where the memory is going to we can use RAMMap from Sysinternals. It can show system wide memory consumption of processes and the kernel as well. This is the only tool which can show also the lost memory because if it is allocated it must belong at least to the kernel which is also shown.

image

Here we find that the modified memory is coming form Shareable memory. Shareable memory is memory which is not backed by a physical file. The only possibility I am aware of to create such memory is to create a file mapping with no file name which creates a page file backed shared memory section. The suspicion here is that we have lost memory to page file backed sections which did land for some reason in the Modified list which is the sign for the kernel to write them back to their backing store. In this case it is the page file which nicely supports our theory because we have high page file write rates.

That is all nice but the question is which process is doing this? When I did look with Task manager through the processes I found no process with a very high Commit or Working set memory. If nothing helps try to reproduce the problem in isolation so you can better analyze it.

Create A Repro

The easiest repro is to create a file mapping, which is mapped, written to and then unmapped but the file mapping object is not closed.

In C#  this is a few lines of code:

        private unsafe void Allocate()
        {
            // CreateFileMapping
            var file = MemoryMappedFile.CreateNew(null, BytesToAlloc);

            Mappings.Add(file); // root object 

            var accessor = file.CreateViewAccessor(); // MapViewOfFile

            var handle = accessor.SafeMemoryMappedViewHandle;
            byte* ptr = null;
            handle.AcquirePointer(ref ptr);
            var pLong = (long*)ptr;

            Random rand = new Random();
            long nr;
            // fill page file allocated memory with random data which cannot be compressed
            // by Windows 10 memory compression to see the real impact of memory allocation.
            for (long i = 0; i < BytesToAlloc / 8L; i++)
            {
                nr = (long)((ulong)(rand.Next() << 32) | (ulong)rand.Next());
                *(pLong + i) = nr;
            }
            TotalAllocated += BytesToAlloc; 

            handle.ReleasePointer();
            accessor.Dispose();        // UnmapViewOfFile
            Console.WriteLine($"Modifed: {ModifiedBytes:N0} bytes");
        }

The full source code can be found here https://1drv.ms/f/s!AhcFq7XO98yJgfZdGGK8ahaOrVnggg. With this sample application we can indeed nicely reproduce the issue that every time the memory is unmapped the kernel takes our mapped memory from the process working set away and puts it into the modified byte list.

Now depending on the OS different things happen. On Server 2008-2016 the memory is immediately written to the page file. On Server 2016 you could enable memory compression also if you wish (see https://aloiskraus.wordpress.com/2016/10/09/how-buffered-io-can-ruin-performance/). On Windows 10 machines on the other hand the memory is compressed by default and kept in memory which is much later written to disk once physical memory gets low. Since we have reproduced the symptom we can search for ways how to find the offending process more easily.

Lets look in task manager when the process has eaten up all my physical memory:

image

Nope not a single number which points towards high memory consumption. That is not surprising because the memory was already unmapped and no longer belongs to any process. But our process still has a file mapping handle open which is reason enough for the kernel to keep the data alive. Lets take a dump and look for file mapping objects. In kernel lingo these are called Section objects.

Windbg – Dump Section Objects

To dump all file mappings and their size we can use the !handle command. The first argument 0 means to dump all handles, 0xf to dump all handle information and limit the output to file mapping (Section) objects only.

0:000>!handle 0 f Section

…

Handle 0000000000000734
  Type             Section
  Attributes       0
  GrantedAccess    0xf0007:
         Delete,ReadControl,WriteDac,WriteOwner
         Query,MapWrite,MapRead
  HandleCount      2
  PointerCount     65537
  Name             <none>
  Object specific information
    Section base address 0
    Section attributes 0x8000000
    Section max size 0x40000000
24 handles of type Section

We can sum up all open file mapping objects manually or use a pretty unknown Windbg extension called mex (see https://www.microsoft.com/en-us/download/details.aspx?id=53304). Mex is a Windbg extension written in managed code which is the reason for its name (managed extension). It is used by support engineers within MS quite a lot because it contains a lot of functionality in a single dll, which makes deployment pretty easy. It can even decompile C# code with some help from ILSpy from a memory dump which is pretty useful. When you have downloaded the zip file put the dll of the right bitness into the Windbg\winext folder where Windbg stores its extensions so you can load it simply by its name.

Now we can make the tedious summing the sizes of all shared memory sections a pretty one liner:

0:000>.load mex
0:000> !sum !cut -c 22-35 !grep -r "size" !handle 0 f Section
0x10000
0x1000
0x11000
0x40000000
0x40000000
0x40000000
0x40000000
0x2000
0x40000000
0x40000000
0x40000000
0x40000000
0x40000000
0x40000000
0x40000000
0x40000000
0x1000
0x1
0x40000000
0x40000000
0x40000000
0x40000000
0x40000000
0x40000000
Total = 0n19327504385 (0x480025001)

The !sum command is pretty self explaining. !cut will cut from each line the number by a fixed offset and length of all lines which are filtered for the lines which contain the size with !grep from the !handle command. This reminds me of old linux days where my command lines looked pretty similar like the Windbg commands above.

From the dump it is clear that we have 19 GB of file mappings created.  But Task Manager is not showing anything. Even the trustworthy VMMap tool

image

is not showing anything related to 19 GB of allocated shared memory. If you suspect only one process which is responsible for that you can use Windbg and check out if the process has a file mapping leak. If this is a problem you did either forget to close the file mapping, or someone was unmapping the shared memory too early.

You can set breakpoints on Windows 8+ at kernelbase or on earlier Windows version at kernel32 where the methods are actually defined. Windows 8 introduced API sets (https://msdn.microsoft.com/en-us/library/windows/desktop/hh802935(v=vs.85).aspx) which allows the OS guys to split formerly huge dlls into smaller parts where the implementation might be in a different dll. You need to be careful to set the breakpoints at the right location depending on the OS you are currently working on.

0:000> bp kernelbase!CreateFileMappingW
0:000> bp kernelbase!UnmapViewOfFile
0:000> bp kernelbase!UnmapViewOfFileEx

When you have hit a breakpoint and you are running managed code load the sos  dll get the call stacks

.loadby sos clr

Then it is easy to find e.g. who created the file mappings

KERNELBASE!CreateFileMappingW:
00007ffe`06280f20 4883ec48        sub     rsp,48h
0:000> !ClrStack
OS Thread Id: 0x59f8 (0)
        Child SP               IP Call Site
000000aa64d3e720 00007ffe06280f20 [InlinedCallFrame: 000000aa64d3e720] Microsoft.Win32.UnsafeNativeMethods.CreateFileMapping(Microsoft.Win32.SafeHandles.SafeFileHandle, SECURITY_ATTRIBUTES, Int32, Int32, Int32, System.String)
000000aa64d3e720 00007ffdd724c3e8 [InlinedCallFrame: 000000aa64d3e720] Microsoft.Win32.UnsafeNativeMethods.CreateFileMapping(Microsoft.Win32.SafeHandles.SafeFileHandle, SECURITY_ATTRIBUTES, Int32, Int32, Int32, System.String)
000000aa64d3e6e0 00007ffdd724c3e8 DomainBoundILStubClass.IL_STUB_PInvoke(Microsoft.Win32.SafeHandles.SafeFileHandle, SECURITY_ATTRIBUTES, Int32, Int32, Int32, System.String)
000000aa64d3e800 00007ffdd723014b System.IO.MemoryMappedFiles.MemoryMappedFile.CreateCore(Microsoft.Win32.SafeHandles.SafeFileHandle, System.String, System.IO.HandleInheritability, System.IO.MemoryMappedFiles.MemoryMappedFileSecurity, System.IO.MemoryMappedFiles.MemoryMappedFileAccess, System.IO.MemoryMappedFiles.MemoryMappedFileOptions, Int64)
000000aa64d3e890 00007ffdd72169af System.IO.MemoryMappedFiles.MemoryMappedFile.CreateNew(System.String, Int64, System.IO.MemoryMappedFiles.MemoryMappedFileAccess, System.IO.MemoryMappedFiles.MemoryMappedFileOptions, System.IO.MemoryMappedFiles.MemoryMappedFileSecurity, System.IO.HandleInheritability)
000000aa64d3e910 00007ffdd7436bd0 System.IO.MemoryMappedFiles.MemoryMappedFile.CreateNew(System.String, Int64)
000000aa64d3e950 00007ffd8c1c08a0 FileMappingModifiedBytes.PageFileAllocator.Allocate() [\FileMappingModifiedBytes\FileMappingModifiedBytes\PageFileAllocator.cs @ 60]
000000aa64d3ea50 00007ffd8c1c0736 FileMappingModifiedBytes.PageFileAllocator.SlowlyAllocateUntilOOM() [\FileMappingModifiedBytes\FileMappingModifiedBytes\PageFileAllocator.cs @ 49]
000000aa64d3ead0 00007ffd8c1c05b1 FileMappingModifiedBytes.Program.Main(System.String[]) [\FileMappingModifiedBytes\FileMappingModifiedBytes\Program.cs @ 20]
000000aa64d3ed80 00007ffdeb7b6a53 [GCFrame: 000000aa64d3ed80] 

When the mappings are closed you get also the full call stack

0:000> !ClrStack
OS Thread Id: 0x59f8 (0)
        Child SP               IP Call Site
000000aa64d3e328 00007ffe06285400 [InlinedCallFrame: 000000aa64d3e328] Microsoft.Win32.UnsafeNativeMethods.UnmapViewOfFile(IntPtr)
000000aa64d3e328 00007ffdd724b23e [InlinedCallFrame: 000000aa64d3e328] Microsoft.Win32.UnsafeNativeMethods.UnmapViewOfFile(IntPtr)
000000aa64d3e300 00007ffdd724b23e DomainBoundILStubClass.IL_STUB_PInvoke(IntPtr)
000000aa64d3e3b0 00007ffdd72305a7 Microsoft.Win32.SafeHandles.SafeMemoryMappedViewHandle.ReleaseHandle()
000000aa64d3e548 00007ffdeb7b6a53 [GCFrame: 000000aa64d3e548] 
000000aa64d3e6b8 00007ffdeb7b6a53 [GCFrame: 000000aa64d3e6b8] 
000000aa64d3e778 00007ffdeb7b6a53 [HelperMethodFrame_1OBJ: 000000aa64d3e778] System.Runtime.InteropServices.SafeHandle.InternalDispose()
000000aa64d3e870 00007ffdd72309c1 System.IO.MemoryMappedFiles.MemoryMappedView.Dispose(Boolean)
000000aa64d3e8a0 00007ffdd72307b9 System.IO.MemoryMappedFiles.MemoryMappedViewAccessor.Dispose(Boolean)
000000aa64d3e8e0 00007ffdd723075a System.IO.MemoryMappedFiles.MemoryMappedViewAccessor.Dispose(Boolean)
000000aa64d3e920 00007ffddab2b15a System.IO.UnmanagedMemoryAccessor.Dispose()
000000aa64d3e950 00007ffd8c1c0a48 FileMappingModifiedBytes.PageFileAllocator.Allocate() [\FileMappingModifiedBytes\FileMappingModifiedBytes\PageFileAllocator.cs @ 83]
000000aa64d3ea50 00007ffd8c1c0736 FileMappingModifiedBytes.PageFileAllocator.SlowlyAllocateUntilOOM() [\FileMappingModifiedBytes\FileMappingModifiedBytes\PageFileAllocator.cs @ 49]
000000aa64d3ead0 00007ffd8c1c05b1 FileMappingModifiedBytes.Program.Main(System.String[]) [\FileMappingModifiedBytes\FileMappingModifiedBytes\Program.cs @ 20]
000000aa64d3ed80 00007ffdeb7b6a53 [GCFrame: 000000aa64d3ed80] 

Now we have enough information to come up with a

Theory Of The Problem

This is how we got into that situation

  1. Someone creates page file backed shared memory as file mapping.
  2. These file mappings are put into the address space of our process with MapViewOfFile.
  3. After usage the file mapping is removed from the process working set with UnmapViewOfFile. The file mapping object is closed much later or never.
  4. The kernel removes the memory pages from the process working set and moves them to the Modified List.
  5. No process owns that memory anymore and the OS is free to page it out to the page file.
  6. After the modified pages were written to disk they are put into the standby list, overwriting other previously cached things.

The picture below shows the basic operations the kernel performs.

image

The end result is that the unmapped memory from our process is written to the page file and at the same time it is put into the file system cache where it replaces older entries. This has far reaching implications. If you unmap the memory but do not close the file mapping object almost at the same time the OS will page the memory out if no other process has the same memory mapped. An innocent file mapping handle leak with large memory mappings can flush your complete file system cache at random times and there is no clear indicator why your or other application suddenly did become much slower.

On Windows 10 Anniversary and later you can see the effect of flushing the file system cache more directly by looking at the task manager when the compressed memory goes up and up but the cached memory stored in the standby list is becoming less and less.

image

Fresh paged out memory seems to have the highest priority in the file system cache which can overwrite anything else what was inside it. This is also true on server OS where you can check with Rammap that the file system cache was flushed. That explains why we were having so much read activity on the C drive because our file system was flushed and the application dlls needed to be loaded from disk again. If you have a large amount of compressed memory in Windows 10 but not much In use memory the chances are pretty good that you have a file mapping leak somewhere.

How Things Look in ETW

On my Windows 10 Anniversary machine I can nicely see that my paged out memory directly goes into the working set of MemCompression (Memory Compression in Process Explorer, Hacker)  which is not shown by the Windows task manager. At some point it is so huge that it needs to page out memory to my SSD with up to 400 MB/s. While MemCompression (purple) allocates more memory, the standby lists (green) becomes smaller and smaller.

image

With ETW you can also find out who is responsible for the high modified bytes which are written to the page file with CPU Sampling tracing. That is very useful if you do not know which process is causing it and you cannot interrupt the server processes with memory dumps which might upset people working on the server. Since we know that based on our theory the modified bytes grow when we unmap the data we can simply view all callers of UnmapViewOfFile and follow the callers until you find the call stacks which have the best correlation between the growth of modified bytes and calling UnmapViewOfFile. With our tester application we get a perfect correlation:

image

The blue bars map perfectly to the rise of modified bytes when we unmap a file section. You can also treat the file mapping issue as handle leak where you can trace with ETW Handle Leak tracing all Section objects which call CreateFileMapping which can be used to verify that no file mappings of this type exist anymore to verify that a fix works as expected. WPA is the perfect tool to visually correlate different metrics to check if there is a dependency. You can come up with a theory what depends on what, graph the different things underneath and then you see a pattern …. or not. Then it is time for the next theory.

The Real Root Cause

image

Problem solved? Only partially. This only explains sluggish behavior when the machine gets into high memory load which then flushes the file system cache. But there were issues right from the beginning even when there still was plenty of data in the file system cache. Can you spot based on the given data above where the problem is?

.

.

.

We have a nice RAID on the machine but the pageouts were happening only with 100MB/s. That is not the write throughput one would expect from an expensive RAID. What I would expect is something like this

image

but it was about 5 times slower than usually. Perhaps one disk did fail and the RAID was not scaling properly anymore? Nope. But there was a warning logged somewhere that the RAID battery has failed. The battery in the RAID is there to write all cached data from its internal DRAM cache to the disks in the case of a power failure. What happens if the battery fails? The RAID turns off its internal cache to prevent data loss if there is a power failure. We were therefore running the sever the whole time without a much wanted cache! I am not the one to take credit for finding the disabled RAID cache. It is all about teamwork where everyone looks at different aspects of a problem to (hopefully) find all issues. Thanks for finding that one. You know who you are.

Luckily we do not need to wait to buy a new battery but we can enable the cache even when the battery is not present:

image

That setting enables the RAID cache regardless of the state of the battery which is exactly what we want. In fact the battery was not broken but opening the server and dusting off the battery did solve the battery issue as well.

It’s The Environment Stupid!

When you write software it is not always the software which causes a change. It still runs on hardware which can change/fail too. You need to monitor not only yourself but also the environment where it is running. If you fail to take that into account you can search for a long time at the wrong or not most significant spot.

Funny side note: I drive by car every day to work. When I did refill my windshield washing fluid yesterday I did notice an unauthorized hardware change:

image

A marten

image

seems to have brought into my car his own pet toys to make the place more cozy. Next action: Install some marten repellant to prevent more hardware changes:

image

Have a nice day and keep your hardware marten safe.

MS Technical Summit 2016 at Darmstadt

I had the pleasure to give a talk at the MS Technical Summit 2016 at Darmstadt about ETW. You can find our slides, the sample code, a crash dump and the resulting sample ETW trace here. It was a great event together with Wolfgang Kroneder from MS. Here is the video:

https://channel9.msdn.com/Events/microsoft-techncial-summit/Technical-Summit-2016/Deep-Dive-Event-Tracing-for-Windows/player

For some problems you need custom ETW tracing to find out what is going on. Reference counting bugs are one of the nasty ones where it is very hard to find the root cause if they are timing dependent. If you add normal printf, OutputDebugString tracing to the Add/ReleaseReference calls the error usually goes away. Below is the sample code from my talk minus the ETW manifest generated tracing macros. It mimic’s a reference counting issue were we suspected that someone was calling ReleaseReference too early on a reference counted class which did lead to a too early destruction of the reference counted object. Interestingly it was being deleted more than once which resulted in either a heap corruption or a pure virtual function call exit depending on the exact timing.

The code below will call AddReference once in the main method and then many times balanced Add/ReleaseReference from 5 different threads which should never cause the reference count to drop to zero.

 

#include <windows.h>
#include <vector>
#include <thread>
#include <mutex>
#include "ETWRefCounter.h"  // ETW Tracing header generated from ETW manifest during each build with Pre build command
                            // mc -um $(ProjectDir)RefCounterETW.man  -h $(ProjectDir)  -z $(ProjectName)
                            // ecmangen is the editor to create ETW manifest files. 


// Thread safe refcounter base class
class CRefCounterBase
{
public:
    CRefCounterBase() : m_Counter(0) {}
    
    CRefCounterBase(CRefCounterBase &other) = delete;
    CRefCounterBase& operator=(const CRefCounterBase&) = delete;
    
    virtual void DeleteObject() = 0;

    // Return new reference count incremented by one
    long AddReference()
    {
        auto lret = InterlockedIncrement(&m_Counter);
        EventWriteAddRefEvent((__int64)this, lret); // trace new value
        return lret;
    }

    // Return new reference count decremented by one
    long ReleaseReference()
    {
        auto old = m_Counter;
        auto newValue = InterlockedDecrement(&m_Counter);
        EventWriteReleaseRefEvent((__int64)this, newValue); // trace new value
        if (newValue == 0)
        {
            EventWriteFinalReleaseEvent((__int64)this);
            DeleteObject();
        }
        return newValue;
    }

    // Check if RefCount == 1. This can only be safely used when the threads using this object are
    // guaranteed to not change the refcount after this check has been made. 
    bool OnlySingleUser()
    {
        EventWriteIsSingleUserEvent((__int64) this, m_Counter);
        return InterlockedExchange(&m_Counter, m_Counter) == 1;
    }

protected:
    virtual ~CRefCounterBase() { m_Counter = 0; }

    long m_Counter;
};

/// Actual reference counted class
class RefCountedObject : public CRefCounterBase
{
public:
    RefCountedObject() {}
    RefCountedObject(RefCountedObject &other) = delete;
    RefCountedObject& operator=(const RefCountedObject&) = delete;
    virtual ~RefCountedObject() {}

    virtual void DeleteObject()
    {
        printf("\nDeleteObject called 0x%p, %d", this, ::GetCurrentThreadId());
        delete this;
    };
};


// Using RefCounter from multiple threads in a balanced way which works
void AsyncWorker(CRefCounterBase *counter)
{
    while (true)
    {
        counter->AddReference();
        if (counter->OnlySingleUser())
        {
            // some optimized code not shown here which also causes add/releaseref
        }
        counter->ReleaseReference();
    }
}

int main()
{
    EventRegisterRefCounter();

    static const int ThreadsCount = 5;
    std::vector<std::thread> threads; // thread list 

    auto pCounter = new RefCountedObject();   // construct object
    pCounter->AddReference(); // ensure that refcount does not drop to zero while other threads are working with it.

    for (int i = 0; i < ThreadsCount; i++) // start some threads and increment and decrement stuff in a loop
    {
        std::thread th(AsyncWorker, pCounter);
        threads.push_back(std::move(th));
    }

    threads[0].join();

    EventUnregisterRefCounter();
    return 0;
}

When you let the exe run it will either run happily forever or crash. During a crash it will terminate with an unhandled exception because of a pure virtual function call like this,

image

or with a heap corruption. In reality the code was not as simple but it pretty much boils down to the code above. Can you spot the error?

Why Is Batman Arkham Knight So Slow?

The latest Batman PC game had after the first release a bad reputation for being buggy and very slow. Due to some bad outsourcing and project management decisions the PC version did not meet the required quality gates. This quality and performance disaster did force Warner Bros  to cancel the PC sales for three month until a much more improved version was released which did meet customer expectations with regards to performance. The outcome was that the unpatched PC version was sold for 50% off and I could get a very good game for a very low price. I did buy the unpatched version, did throw away the unpatched DVDs,  and downloaded the “real” v1.0 from Steam with the accompanying product code.

After being aware of potentially bad performance in the game I kept an eye on it but I found not any rough spots anymore. The only thing that bugged me was when I switched away from the game to do something different and then switching back after ca. 30 minutes the first few seconds of the game graphics was hanging for several seconds. That sounds like a good reason to check performance with my favorite performance recording tool ETWController. Since it records also the screenshots I could easily track down in the profiling data what was happening while the game was hanging.

image

Since I get also a HTML report I can exactly see where in the timeline I have continued the game, although I do not get screenshots of the in game graphics.

image

After I stopped the recording I can directly open the ETL file and navigate the point of slowness in the game by selecting the time region after Screenshot 5.

image

When the game hangs it is usually CPU, Disk, memory or the network. So lets check what is going on.

By looking at the CPU consumption we see that the game has a several seconds near zero CPU activity between 10-13s which correlates with the observed hang.

image

While checking disk IO during the low CPU activity phase I find a lot of page file related IO. When the game resumes it needs to access a lot of its memory which was previously written into the page file this results in a growth of the games working set.

image

In the graph above I see that I have two page files (on drive C and D) where I wait 3,6s for 55 MB from drive D and 500ms for 103MB from drive C. That makes sense since my C drive is a SSD

image

whereas my D drive is a regular hard drive

image

That explains why reading from the page file on the D drive is 7 times slower although only 50% of the data is read from it! Random read access is the worst case scenario for a regular hard disk whereas my SSD can cope with that read pattern easily.

I do not even remember that I have enabled on D a page file but I think I did it some years ago where SSDs had still write durability issues just to be on the safe side. As it turns out it was a bad decision and as an easy fix I disabled my page file on drive D. The sudden hangs when switching back to the game were gone after that change.

Whose Fault Is It?

That was the easy part. But you can go much deeper to really see where the latencies are coming from. Hard faults which involve disk IO are served in the kernel by the method ntoskrnl.exe!MiIssueHardFault which takes care to wait for the pending disc IO of the page file. If we look at which processes did wait for page fault IOs and visualize the total CPU and wait times we get this timeline

image

  1. We can see that the hard faults nicely correlate with the Disk activity in the graph below.
  2. We find that the Arkham Knight game waits only for 90ms for hard page faults (see Waits(us)_Sum column).
  3. But the System process waits for over 10s (summed across all threads) for page in activity.

If we look who is causing this page in activity we find DirectX in the Windows kernel to cause quite some IO due to page faults.

ntoskrnl.exe!MiIssueHardFault

ntoskrnl.exe!MmAccessFault

|- ntoskrnl.exe!KiPageFault

|- ntoskrnl.exe!MiProbeLeafFrame

|    ntoskrnl.exe!MmProbeAndLockPages

|    |- dxgmms2.sys!VIDMM_MDL_RANGE::Lock

|    |    dxgmms2.sys!VIDMM_RECYCLE_HEAP_PHYSICAL_VIEW::LockRange

|    |    dxgmms2.sys!VIDMM_RECYCLE_RANGE::Lock

|    |    dxgmms2.sys!VIDMM_RECYCLE_MULTIRANGE::Lock

|    |    dxgmms2.sys!VIDMM_RECYCLE_HEAP_MGR::ProbeAndLockAllocation

|    |    dxgmms2.sys!VIDMM_GLOBAL::ProbeAndLockAllocation

|    |    |- dxgmms2.sys!VIDMM_SYSMEM_SEGMENT::LockAllocationRange

|    |    |    |- dxgmms2.sys!VIDMM_MEMORY_SEGMENT::CommitResource

|    |    |    |    dxgmms2.sys!VIDMM_GLOBAL::PageInOneAllocation

|    |    |    |    dxgmms2.sys!VIDMM_GLOBAL::ProcessDeferredCommand

|    |    |    |    dxgmms2.sys!VIDMM_WORKER_THREAD::Run

|    |    |    |    ntoskrnl.exe!PspSystemThreadStartup

|    |    |    |    ntoskrnl.exe!KiStartSystemThread

It is not even the game directly but DirectX needs to page in its buffers from previously paged out memory which takes several seconds from my slow spinning hard disk. It is possible to track this further down into the wait calls of the game but that should be proof enough to blame disk IO of DirectX buffers for the observed sluggish gaming experience.

Conclusions

It was not Batman’s fault that he was slow but my misconfigured secondary page file on my spinning hard disk. Removing the page file from the spinning disk did solve my performance problem.

Another observation is that Windows seems to write data out to the first responding device which hosts a page file. That is nice to get as much data as possible in an efficient manner into the page file but the real test comes when the data is read back again. I think the current page out algorithm in Windows is inefficient when it can choose from several devices to write its page data to it. It should balance the written data with respect to the random access servicing time of each device.  In effect SSDs with its very fast response times should be favored over slow hard disks. This makes it possible to have several page files on devices with vastly different response times but Windows should prefer to write most of the data onto the device with the fastest random access time to make hard page faults pause time as short as possible.

This once again shows that assumptions about bad performance based on former evidence (the game is slow …) will lead to wrong conclusions about current performance issues.

Performance problems are much like quantum mechanics: The problem does not exist until you have measured it. Some people come up with smart tests to verify this or that assumption. But it is usually easier to use a profiling tool to get down to the root cause with a more structured approach.

Show GDI Handles by Type in Windbg

What are GDI Handles? Every GDI (Graphics Device Interface) object has a handle which is associated to one specific process. If you create UI resources (not windows) like

  • Fonts
  • Bitmaps
  • Device Contexts (DC)
  • Pens
  • Regions

then you are using GDI resources. There is a hard cap on 10K GDI objects per process. If you have a memory leak in window related code then the chances are high that your process will not crash due to an out of memory condition but because a generic GDI or GDI+ error did occur. If this happens after a many hour stress test you will have a hard time to figure out what went wrong, if all you have got is a memory dump with an exception like this:

System.Runtime.InteropServices.ExternalException was unhandled by user code
Message="A generic error occurred in GDI+."
Source="System.Drawing"
ErrorCode=-2147467259
StackTrace:
   at System.Drawing.Image.Save(Stream stream, ImageCodecInfo encoder, EncoderParameters    encoderParams)
   at System.Drawing.Image.Save(Stream stream, ImageFormat format)

To make things more complex .NET tries hard to reuse existing objects and most often it avoids GDI and GDI+ entirely. You can create more than 10K Bitmaps in a Windows Forms or WPF application without problems. Not every object on your managed heap is backed by GDI handle based object! But some of them are. This can be confusing sometimes.

GDI handle leaks are not easy to diagnose since there are not many tools out there that are able to display more than the total count like Task Manager with the GDI-Object count column.

image

Process Explorer can display pretty much anything but if you search for GDI objects it will let you down. Process Hacker (an open source Process Explorer clone with more features) on the other hand can display them via the context menu

image

The best tool I have found so far is GDIView from NirSoft. But it seems at least on Windows 8.1 to display sometimes much less handles than Task Manager shows. Do not entirely trust the reported numbers. It displays in the All GDI column the same value as Task Manager which should be pretty reliable. If the other columns add up the to the total reported value then you should be fine. Interestingly there exists no Performance counter for GDI objects which is quite odd since it is still an essential resource although MS seems to try hard to put GDI out of service. 

 

image

 

The problem with all of the above tools is that they do not work with memory dumps. Windbg had an extension which is since about 10 years no longer part of the Windbg package which did only work until Windows 2000 and not even on Windows XP.

So what can you do? First we need a GDI Handle leak. Bitmap.GetHbitmap will create a GDI backed bitmap and return you the handle to it. The following code creates therefore a leak of 1000 bitmap GDI objects:

        private void Button_Click(object sender, RoutedEventArgs e)
        {
            Bitmap bmp = new Bitmap(100, 100);
            for (int i = 0; i < 1000; i++)
            {
                bmp.GetHbitmap();
            }
        }

Ok when I put that into my WPF application I get a nice GDI leak when I press the magic button.

image

Now we need to search the internet for helpful resource how to dump GDI objects. After some searching I pretty much always land on some security related kernel hacking sites which are an interesting read on its own. These guys really know their stuff and I guess they have also some older copy of the Windows source code somewhere around. I have found https://blog.coresecurity.com/2015/09/28/abusing-gdi-for-ring0-exploit-primitives/ which talks about a fascinating way to abuse Bitmaps for kernel exploits which still seem to work today. But what is more important he also talks about that the kernel manages GDI handles for each process in a specific kernel region which is read only mapped back to the process address space.

That is important to know since we have a chance to look at GDI kernel structures in user mode process dumps. There is hope to retrieve information about GDI handles also in 2016. The next good thing is that the definition of GDICell structure seems to be still valid on Windows 8 (and on my Windows 10 home installation).

typedef struct {
  PVOID64 pKernelAddress; // 0x00
  USHORT wProcessId;      // 0x08
  USHORT wCount;          // 0x0a
  USHORT wUpper;          // 0x0c
  USHORT wType;           // 0x0e
  PVOID64 pUserAddress;   // 0x10
} GDICell;                // sizeof = 0x18

The post also tells us where we need to search for the GDI Handle table. It is part of the Process Environment Block (PEB) which Windbg can dump since a long time. But !peb will not display the GDI handle table. Instead we need to dump the structure with the ntdll symbols directly.

.process will print the PEB pointer value which we can then feed into dt

0:014> .process
Implicit process is now 00000000`00923000
0:014> dt ntdll!_PEB GdiSharedHandleTable 00923000
   +0x0f8 GdiSharedHandleTable : 0x00000000`013e0000 Void

Now we know the start of the GDI handle table. How big is it? The !address command can tell us that:

0:014> !address 13e0000

Usage:                  Other
Base Address:           00000000`013e0000
End Address:            00000000`01561000
Region Size:            00000000`00181000 (   1.504 MB)
State:                  00001000          MEM_COMMIT
Protect:                00000002          PAGE_READONLY
Type:                   00040000          MEM_MAPPED
Allocation Base:        00000000`013e0000
Allocation Protect:     00000002          PAGE_READONLY
Additional info:        GDI Shared Handle Table


Content source: 1 (target), length: 1000

It seems we are not the first ones to do that since under Additional info the name of the memory region has a really descriptive name. That looks promising. With the | command we get the current process id in hex and we are ready for a first try to search the GDI table for GDICell structures which should contain our current PID as part of the GDI handle table.

0:014> |
.  0    id: 35b8    attach    name: D:\GdiLeaker\bin\Debug\GdiLeaker.exe
0:014> s -w 13e0000 1561000 35b8
00000000`013f02f8  35b8 0000 1c05 4005 0000 0000 0000 0000  .5.....@........
00000000`013f3580  35b8 0000 de05 4005 0000 0000 0000 0000  .5.....@........
00000000`013f7948  35b8 0000 8005 4005 0000 0000 0000 0000  .5.....@........
00000000`013f7fd8  35b8 0000 c80a 400a 3370 00d4 0000 0000  .5.....@p3......
00000000`013facf0  35b8 0000 5e05 4005 0000 0000 0000 0000  .5...^.@........
00000000`013fb9c8  35b8 0000 0a01 4401 0000 00c8 0000 0000  .5.....D........
00000000`013fdac8  35b8 0000 9305 4005 0000 0000 0000 0000  .5.....@........
00000000`013fe428  35b8 0000 4905 4005 0000 0000 0000 0000  .5...I.@........
....

The number of printed lines look good. Since the kernel structures have also a wType member which should describe the GDI handle type we dig deeper until we find http://stackoverflow.com/questions/13905661/how-to-get-list-of-gdi-handles which tells us that 05 is the Bitmap GDI handle type. Since 6 bytes after the pid the type should follow we can check on our printed memory and sure there is mostly a 5 at this memory location. The rest of the word seems to be used for other information.

Lets create a Windbg script out of that and automate the complete process. Now attach to our GDI leaker and dump the GDI handle table before and after we have leaked 1000 bitmaps.

Important: If you are running on a 64 bit OS you need to attach the 64 bit Windbg even if you debug a 32 bit application!

0:013> $$>a<"D:\GdiDump\DumpGdi.txt"
GDI Handle Table 00000000013e0000 0000000001561000
GDI Handle Count      14
    DeviceContexts: 4
    Regions:        2
    Bitmaps:        2
    Palettes:       0
    Fonts:          3
    Brushes:        3
    Pens:           0
    Uncategorized:  0
0:013> g
0:014> $$>a<"D:\GdiDump\DumpGdi.txt"
GDI Handle Table 00000000013e0000 0000000001561000
GDI Handle Count      1021
    DeviceContexts: 8
    Regions:        3
    Bitmaps:        1003
    Palettes:       0
    Fonts:          3
    Brushes:        4
    Pens:           0
    Uncategorized:  0

Bingo! That looks pretty good. I have tried my script on various processes and the results are promising. This works also with full memory dumps which is a long missing piece of important information. Often enough no screenshot with Task Manager and the GDI handle count was taken so this information is lost if the failed process was terminated in the meantime. In some processes ca. 10-20% of the handle information seems to be missing. Since GDIView shows pretty much the same numbers I hope that this script will not be too far off for your crashed processes. Give it a try or post some additions if you know more about the internals of GDI.

 

Here is the code of DumpGDI.txt

$$ Run as: $$>a<DumpGdi.txt
$$ Written by Alois Kraus 2016
$$ uses pseudo registers r0-5 and r8-r14

r @$t1=0
r @$t8=0
r @$t9=0
r @$t10=0
r @$t11=0
r @$t12=0
r @$t13=0
r @$t14=0
$$ Increment count is 1 byte until we find a matching field with the current pid
r @$t4=1

r @$t0=$peb
$$ Get address of GDI handle table into t5
.foreach /pS 3 /ps 1 ( @$GdiSharedHandleTable { dt ntdll!_PEB GdiSharedHandleTable @$t0 } ) { r @$t5 = @$GdiSharedHandleTable }

$$ On first call !address produces more output. Do a warmup
.foreach /pS 50 ( @$myStartAddress {!address  @$t5} ) {  }


$$ Get start address of file mapping into t2
.foreach /pS 4 /ps 40 ( @$myStartAddress {!address  @$t5} ) { r @$t2 = @$myStartAddress }
$$ Get end address of file mapping into t3
.foreach /pS 7 /ps 40 ( @$myEndAddress {!address  @$t5} ) { r @$t3 = @$myEndAddress }
.printf "GDI Handle Table %p %p", @$t2, @$t3

.for(; @$t2 < @$t3; r @$t2 = @$t2 + @$t4) 
{
  $$ since we walk bytewise through potentially invalid memory we need first to check if it points to valid memory
  .if($vvalid(@$t2,4) == 1 ) 
  {
     $$ Check if pid matches
     .if (wo(@$t2) == @$tpid ) 
     { 
        $$ increase handle count stored in $t1 and increase step size by 0x18 because we know the cell structure GDICell has a size of 0x18 bytes.
        r @$t1 = @$t1+1
        r @$t4 = 0x18
        $$ Access wType of GDICELL and increment per GDI handle type
        .if (by(@$t2+6) == 0x1 )  { r @$t8 =  @$t8+1  }
        .if (by(@$t2+6) == 0x4 )  { r @$t9 =  @$t9+1  }
        .if (by(@$t2+6) == 0x5 )  { r @$t10 = @$t10+1 }
        .if (by(@$t2+6) == 0x8 )  { r @$t11 = @$t11+1 }
        .if (by(@$t2+6) == 0xa )  { r @$t12 = @$t12+1 }
        .if (by(@$t2+6) == 0x10 ) { r @$t13 = @$t13+1 }
        .if (by(@$t2+6) == 0x30 ) { r @$t14 = @$t14+1 }
     } 
  } 
}

.printf "\nGDI Handle Count      %d", @$t1
.printf "\n\tDeviceContexts: %d", @$t8
.printf "\n\tRegions:        %d", @$t9
.printf "\n\tBitmaps:        %d", @$t10
.printf "\n\tPalettes:       %d", @$t11
.printf "\n\tFonts:          %d", @$t12
.printf "\n\tBrushes:        %d", @$t13
.printf "\n\tPens:           %d", @$t14
.printf "\n\tUncategorized:  %d\n", @$t1-(@$t14+@$t13+@$t12+@$t11+@$t10+@$t9+@$t8)

When you edit the script with Notepad++ try Perl as scripting language. The syntax highlighter will do pretty good job on it. Windbg scripts do not look nice but they work and make dump analysis a lot easier. If you want to use Windbg efficiently I highly recommend you read more about .foreach, .printf, and pseudo registers.

If you e.g. search for a memory leak in a managed process you usually do !DumpHeap -stat -live to see all still reachable objects. If you omit -live then you see also all temporary objects which are not yet GCed which can also be interesting if there is a large difference between these two. Now you have identified a type and you click on the blue link in Windbg which will then print all object instances of that type.

Lets try that with FactoryRecord instances where we get the list of object addresses if we click in Windbg on the blue link of the !DumpHeap -stat command.

Statistics:
      MT    Count    TotalSize Class Name
6db871a0      114         6384 System.Configuration.FactoryRecord

If we dump one object instance we see that there is an interesting name inside it which could help to diagnose the context in which it was allocated

0:014> !DumpObj /d 02c98ca8
Name:        System.Configuration.FactoryRecord
MethodTable: 6db871a0
EEClass:     6db532f8
Size:        56(0x38) bytes
File:        C:\WINDOWS\Microsoft.Net\assembly\GAC_MSIL\System.Configuration\v4.0_4.0.0.0__b03f5f7f11d50a3a\System.Configuration.dll
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
7257e918  4000249        4        System.String  0 instance 02c98c50 _configKey
7257e918  400024a        8        System.String  0 instance 02c984b0 _group
7257e918  400024b        c        System.String  0 instance 02c98ad4 _name
6db86f9c  400024c       2c ...SimpleBitVector32  1 instance 02c98cd4 _flags
7257e918  400024d       10        System.String  0 instance 02c98b08 _factoryTypeName
6dc0c698  400024e       20         System.Int32  1 instance      200 _allowDefinition
6dc0c6d0  400024f       24         System.Int32  1 instance      100 _allowExeDefinition
6db87168  4000250       30 ...errideModeSetting  1 instance 02c98cd8 _overrideModeDefault
7257e918  4000251       14        System.String  0 instance 02c86bd8 _filename
725807a0  4000252       28         System.Int32  1 instance      123 _lineNumber
7257ecb8  4000253       18        System.Object  0 instance 00000000 _factory
00000000  4000254       1c                       0 instance 00000000 _errors

Lets say we are after the group name. If we dump this object

0:014> !DumpObj /d 02c984b0
Name:        System.String
MethodTable: 7257e918
EEClass:     721bf344
Size:        50(0x32) bytes
File:        C:\WINDOWS\Microsoft.Net\assembly\GAC_32\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
String:      system.web/caching
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
725807a0  4000243        4         System.Int32  1 instance       18 m_stringLength
7257f35c  4000244        8          System.Char  1 instance       73 m_firstChar
7257e918  4000248       40        System.String  0   shared   static Empty

then we see that this FactoryRecord has something to do with system.web/caching. Now we want to dump the group names of all FactoryRecord instances to get an overview.

.foreach( @$my { !DumpHeap -short -mt 6db871a0  } ) { r @$t0 = @$my; .printf "\n%mu", poi(@$t0+8)+8 }
...
system.transactions
system.transactions

system.web
system.web
system.web
system.web
system.web
system.web
system.web
system.web
system.web
system.web

...

The basic  idea is to dump the object addresses of all FactoryRecord instances. Since in .NET the Method Table (MT) pointer is the synonym for a type we use the MT pointer as argument to !DumpHeap -mt 6db871a0  to print only FactoryRecord instances. The additional -short limits the output to only object addresses which makes it a perfect fit for .foreach. 

For some reason you need to assign the local variable from the foreach loop @$my to a pseudo register @$t0 to make calculations possible. Oh and do not forget the spaces between @$t0 = @$my. The @ before the variable $xx is not strictly necessary but it speeds up the debugger evaluation engine a bit.

To get the pointer to the _group string which is 8 bytes after the FactoryRecord pointer we read it via poi(@$t0+8). Now we have the pointer to the string object. I know that .NET strings are null terminated UTF-16 strings. We can use therefore .printf “\n%mu” to print the string contents. On x64 the offset to the char array of a .NET string is 0xc and for x86 it is 8 which I have used above. These little shortcuts can make a dump analysis dramatically faster.

Although I like Visual Studio a lot, for more advanced things you need to use the power of Windbg. Give it a try when you analyze your next dump.