If you expect performance gains by using a specific API you must measure on the actual target operating system under a realistic load. One such API where many myths are heard about is SetProcessWorkingSetSize which can be used to trim the current working set of a process. Since a long time you can also use use EmptyWorkingSet which does the same job with a simpler API call. One usage scenario might be that our process is not used for some time so it might be a good idea to page out its memory to make room for other processes which need the physical memory more urgently. The interesting question is: Is this a good idea? To answer that question one needs to understand how memory management works in detail. To begin our journey into the inner workings of the operating system we need to know:
How Is Memory Allocated?
For that we look under the covers how memory allocation works from an operating system perspective and the application developer view. From a developers points of view memory is only a new xxxx away. But what happens when you do that? The answer to that question depends highly on the used programming language and their implementation. For simplicity I stick here to C/C++.
Below is the code for a small program named CppEater.exe that allocates and accesses memory several times before and after it gives back its memory to the OS by trimming its working set.
- Allocate 2 GB of data with new.
- Touch the first byte of every memory page .
- Touch the first byte of every memory page .
- Touch all bytes (2GB).
- Trim working set (call EmptyWorkingSet).
- Wait for 15s.
- Touch the first byte of every memory page .
- Touch all bytes (2 GB).
// CppEater.cpp source. Update: Full source is at https://1drv.ms/f/s!AhcFq7XO98yJgcg2Ko4dmxFUmQcAKA
template<typename T> void touch(T *pData, int dataCount, const char *pScenario, bool bFull = false)
auto start = std::chrono::high_resolution_clock::now();
int pageIncrement = bFull ? 1 : 4096 / sizeof(T);
for (int i = 0; i < dataCount; i += pageIncrement) // touch all memory or only one integer every 4K
pData[i] = i;
auto stop = std::chrono::high_resolution_clock::now();
auto durationInMs = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
sprintf(pChars, "%s: %dms, %.3f ms/MB, AllTouched=%d", pScenario,
durationInMs, 1.0*durationInMs / ( 1.0 * dataCount * sizeof(T) / (1024 * 1024)), bFull);
const int BytesToAllocate = 2 * 1000 * 1024 * 1024;
const int NumberOfIntegersToAllocate = BytesToAllocate / sizeof(int);
auto bytes = new int[BytesToAllocate/sizeof(int)]; // Allocate 2 GB
touch(bytes, NumberOfIntegersToAllocate, "First page touch"); // touch only first byte in every 4K page
touch(bytes, NumberOfIntegersToAllocate, "Second page touch"); // touch only first byte in every 4K page
touch(bytes, NumberOfIntegersToAllocate, "All bytes touch", true); // touch all bytes
::EmptyWorkingSet(::GetCurrentProcess()); // Force pageout and wait until the OS calms down
touch(bytes, NumberOfIntegersToAllocate, "After Empty"); // touch only first byte in every 4K page
touch(bytes, NumberOfIntegersToAllocate, "Second After Empty", true); // touch all bytes
When you record the activity of this program with ETW you find a 2 GB of allocation through the C/C++ heap manager which will end up in VirtualAlloc. VirtualAlloc is to my knowledge the most basic API to request memory from Windows. All heap segment allocations for the C/C++/C# heap go through this API.
| | |- CppEater.exe!main
| | | |- CppEater.exe!operator new
| | | | ucrtbase.dll!malloc
| | | | ntdll.dll!RtlpAllocateHeapInternal
| | | | ntdll.dll!RtlpAllocateHeap
| | | | ntdll.dll!NtAllocateVirtualMemory 2 GB alloc with MEM_COMMIT|MEM_RESERVE
The only difference between the C/C++/C# heap managers is that the heap memory segments are differently managed depending on the target language. In C/C++ objects cannot be moved by the heap manager so it is important to prevent heap fragmentation with sophisticated allocation balancing algorithms. The managed (.NET) heap is controlled by the garbage collector which can move objects around to compact the heap but it has to be careful to not fragment the heap as well with pinned objects. Once the heap manager has got its big block of memory it will satisfy all following allocation requests from the heap segment/s. When all of them are full or too fragmented the heap manager will need another round of memory via VirtualAlloc.
It is possible to track down managed and unmanaged memory leaks with VirtualAlloc ETW tracing. That works if the leak is large enough to trigger a heap segment re/allocation. But you need to be careful how you interpret the “leaking” stack because it can also happen that some leak causes allocations in the heap but the heap segment re/allocation happens due to large temporary objects which are not the root cause.
Committing memory is quite fast and completes in ca. 1ms for a new[2GB] memory allocation request with VirtualAlloc. Was that really everything? When you commit memory you will see no increase in your working set. If you put a sleep after the new operator and you look at the process with your Task Manager then you see this:
The 2 GB commit is there but your working set is still only 2 MB. That means from an operating system point of view your application has got only the promise to get so much memory but since you did not yet access any of the committed memory via e.g. a pointer access the OS still had no need to allocate a new 4 K memory page, zero it out and soft fault it into your process working set where you were accessing the memory the first time. There is quite a lot going on under the covers to make your process address space look flat where no other process can interfere with you and your allocated memory. In reality you have a virtual address space for each process where the OS is responsible to swap physical memory in/out of your process as it sees fit. The working set is in a first approximation the RAM your application has allocated in your physical RAM modules on your mainboard. Some things like DLLs are the same in all processes which can be shared by multiple processes. That is the reason why you have a Working Set, Working Set Shared and a Working Set Private column in task manager. If you would add up all working sets of a fully utilized machine it would exceed the installed memory by far because shared memory is counted multiple times. That makes the exact calculation of your actually used memory of an application quite difficult.
An easier metric is to sum up all committed memory which is by definition local to your process. This gives you an upper bound of the physical memory usage. It is an upper bound because some pages have never been touched and are therefore not assigned to any physical memory pages. Besides that large parts of your application might be already sitting in the page file which then also does not consume much physical memory. Process Explorer and Process Hacker have no column named Committed memory. As far as I can tell Committed memory is same as Private Bytes reported by these tools. Just in case you wonder why there is no Committed Bytes column in Process Explorer and Process Hacker. The definition of Process Hacker is quite easy. Private Bytes is the memory which can go to the page file. This excludes things like file mappings which can be read from disc again from the original file and not the page file as long the file mapping was not created with copy on write semantics.
Memory Black Hole – Page File Allocated Memory
That looks like the working set must always be smaller than the commit size which is not always the case. A special case are page file allocated memory mapped files. These do not count to your committed memory. If you touch the memory of a page file backed file mapping then it goes all into your working set. See below for a 2 GB working set where we have only 4 MB of committed memory!
The code to produce such a strange process was to allocate a file mapping from the page file by specifying INVALID_HANDLE_VALUE as file handle. If you touch this memory then you have a large shared working set but zero committed memory. Below is a screenshot of the code which produced that allocation:
If you look into this process with VMMap from SysInternals which is a really great tool to look into any process and its contained memory you will find that VMMap and Task manager seem to disagree about the committed memory because page file backed memory is counted as committed by VMMap but not by the the Task Manager. Be careful at which numbers you look at.
If you want to look at your system you can use RAMMap also from SysInternals. It will show page file allocated file mapping as Shareable just as VMMap which gives you a good hint if you have somewhere leaked GB of page file baked memory.
ETW Reference Set and Resident Set Tracing
There is a bunch of ETW providers available which can help you to find all call stacks which touch committed memory the first time which then causes the OS to soft page fault physical memory into your working set. In ETW lingo this is called Reference Set. These ETW providers work best with Windows 8 and later. There you can see the call stack for every allocation and page access in the system. This can show you every first page access but it can’t show the hard page faults due to paged out memory (you can use the Hard Faults graph in WPA for that). Although Reference Set tracing is interesting it has a very high overhead and produces many events. For every soft page fault which adds memory to any processes working set an ETW event with the memory type
and the call stack which did cause the soft page fault can be recorded. That results in many million of events with a lot of big stack traces. On the plus side it helps a lot if you want to exactly know who did touch this memory but the resulting ETW files are quite big and take a long time to parse. The Reference Set graph was added with the Windows 10 Anniversary SDK to WPA.
Reference Set tracing can be enabled with WPR/UI or xperf where the equivalent xperf command line is
xperf -on PROC_THREAD+LOADER+HARD_FAULTS+MEMORY+MEMINFO+VAMAP+SESSION+VIRT_ALLOC+FOOTPRINT+REFSET+MEMINFO_WS -stackwalk PageAccess+PageAccessEx+PageRelease+PageRangeAccess+PageRangeRelease+PagefileMappedSectionCreate+PagefileMappedSectionDelete+VirtualAlloc
or you can use the predefined xperf kernel group
xperf -on ReferenceSet
There is another profile in WPR which is called Resident Set. This is creating a snapshot when the trace session ends of all processes where the memory is allocated. Function wise it like opening VMMap for all processes at a specific point in time. The ETW events used are largely the same as with Reference Set minus the call stacks which reduces the traced data considerably. If you enable Reference Set tracing you also get Resident Set tracing since it is a superset of Resident Set. Although WPA displays the resident set of a specific time point it needs the ETW events belonging to the actual allocations to be able to assign every memory page its owning process. Otherwise you will only know that large portions of your active memory consists of page file allocated memory but you do not know which process it belongs to.
The xperf command line for Resident Set tracing is
xperf -on PROC_THREAD+LOADER+HARD_FAULTS+MEMORY+MEMINFO+VAMAP+SESSION+VIRT_ALLOC+DISK_IO
or you can use the xperf kernel group
xperf -on ResidentSet
Of these many providers the MEMORY ETW provider is the by far most expensive one. It records every page access and release. But without it you will not get any Reference/Resident set graphs in WPA. I am not sure if really all events are necessary but the high amount of data generated by this provider allows only to record very short durations of a busy machine where many allocations and page accesses happen. VirtualAlloc allocated memory is the only exception which can always be attributed to a specific process to the special Page Category VirtualAlloc_PreTrace.
Memory Black Hole – Large Pages
There are more things where Task Manager is lying to you. Have a look at this process
How much physical memory is it consuming? 68 KB? Lets look at our free memory while CppEater is running:
CppEater causing a rise of In use shown in Task Manager from 5.9GB to 7.9GB . Our small application is eating 2 GB of memory but we cannot see this memory attributed to our process! When the system is working wrong we need to look at our memory at system level. For this task RAMMap is the tool to use.
Here we see that someone has allocated 2 GB of memory in large pages. I hear you saying: Large what? It is common wisdom that on Intel CPUs the natural page size is 4KB. But for some server workloads it made sense to use larger pages. Usually large pages are 2 MB in size or multiples of that. Some Xeon CPUs even support up to 1 GiB large pages. Why should this arcane know how be useful? Well there is one quite common process which employs large pages quite heavily. It is Microsoft SQL Server. If you happen to see an innocent small sqlserver.exe and for some reason after your 24h load test you miss several GB of memory but no process seem to have allocated it the chances are high that SQL server has allocated some large pages which look small in Task Manager.
There is a tool to see how much memory SQL server really uses: VMMap
Large pages manifest itself as Locked WS in VMMap which exactly shows our lost 2 GB of memory. Is this ridiculously small value in Task Manager a bug? Well sort of yes. Trust no tool to 100%. Every number you will ever see with regards to memory consumption is wrong or a lie to some extent. Even Process Explorer shows these nonsensical values. I can only speculate how this small value is calculated. It could be that task manager simply calculates WS Pages * 4096bytes/Page = Working Set bytes.
A specialty of large pages is that they are never paged out and allocated immediately when you call VirtualAlloc. This is how SQL server can grab large amounts of physical memory with one API call and then plays operating system with the handed out memory by itself. Here is the sample code to allocate memory with large pages. Large parts are only ceremony because you need to get the Lock pages in memory privilege before you can successfully call VirtualAlloc with MEM_LARGE_PAGES.
int * LargePageAlloc(unsigned int numberOfBytesToAllocate)
printf("\nLarge Page allocator used.");
ETWSetMark("Large Page allocator");
HANDLE proc_h = OpenProcess(PROCESS_ALL_ACCESS, FALSE, GetCurrentProcessId());
OpenProcessToken(proc_h, TOKEN_ADJUST_PRIVILEGES, &hToken);
::LookupPrivilegeValue(0, SE_LOCK_MEMORY_NAME, &luid);
tp.PrivilegeCount = 1;
tp.Privileges.Luid = luid;
tp.Privileges.Attributes = SE_PRIVILEGE_ENABLED;
auto status = AdjustTokenPrivileges(hToken, FALSE, &tp, sizeof(tp), (PTOKEN_PRIVILEGES)NULL, 0);
if (status != TRUE)
printf("\nSeLockMemoryPrivilege could not be aquired. LastError: %d. Use secpol.msc to assign to your user account the SeLockMemoryPrivilege privilege.", ::GetLastError());
auto *p = (int *) VirtualAlloc(NULL, numberOfBytesToAllocate, MEM_COMMIT | MEM_RESERVE | MEM_LARGE_PAGES, PAGE_READWRITE);
if (::GetLastError() == 1314)
printf("\nNo Privilege held. Use secpol.msc to assign to your user account the SeLockMemoryPrivilege privilege.");
To try the sample out you need to add your user or group the Lock pages in memory privilege with secpol.msc and logout and login again to make the changes active.
When you suspect SQL server memory issues you should check out http://searchsqlserver.techtarget.com/feature/Built-in-tools-troubleshoot-SQL-Server-memory-usage which contains plenty of good advice how you can troubleshoot unexpected SQL server memory usage.
One of the first things you should do is to configure the minimum and maximum memory you want SQL server to have. See https://technet.microsoft.com/en-us/library/ms180797(v=sql.105).aspx for more information. Otherwise SQL server will sooner or later use up all of your memory. If you run out of physical memory a low memory condition event is raised which can cause SQL server to flush its caches completely which can cause query timeouts. It is therefore important to set reasonable values for min/max SQL server memory and test them before going into production.
If you are after a SQL Server memory leak you can query the SQL server diagnostics tables directly like this:
sum(pages_in_bytes)/1024.0/1024.00 'Mem in MB',
group by type
order by sum(pages_in_bytes) DESC
to query your SQL server for allocated objects which can give you a hint what was being leaked. There are quite some memory leaks known of SQL server. It makes therefore a lot of sense to stay latest at the SQL server patch level if you tend to loose some memory after weeks of operation to some unknown memory black holes. When you look at a real SQL server process
you can pretty safely assume that its working set is a lie. Instead you can use the commit size as a good approximation of its current physical memory usage because nearly all SQL server memory is stored in large pages which cannot be paged out. This is only true if SQL server has been granted the privilege to lock pages.
Windows 10 Memory Compression
It is time to come back to the original headline. One of the big improvements of Windows 10 was that it compresses memory before writing it into the page file. With the Anniversary edition of Windows 10 this feature is now more visible in the task manager where the amount of compressed memory is now also displayed.
Another change of the Anniversary update is that originally the System process did own all of the compressed pages. MS had decided that too many users were confused by the large memory footprint of the System process because it holds all of the compressed memory in its working set. Now another hidden process owns all of the compressed memory which shows up in Process Explorer/Hacker under the name Memory Compression. It is a child of the System process. In ETW traces it is called MemCompression. These caches are therefore not visible in the process list of task manager except for the (Compressed) number in the overview which tells you how much working set the Memory Compression process currently has.
The compression and decompression is performed single threaded in the Memory Compression process. When we look at our original program of CppEater and let it run with ETW tracing enabled we see that the first page touch at every 4 KB for 2 GB of memory takes 366ms. This was measured in Release/x64 on Windows 10 Anniversary on my Intel I7-4770K @ 3,5 GHz.
First page touch: 366ms, 0.183 ms/MB, AllTouched=0
Second page touch: 26ms, 0.013 ms/MB, AllTouched=0
All bytes touch: 314ms, 0.157 ms/MB, AllTouched=1
Flushing working set
After Empty: 4538ms, 2.269 ms/MB, AllTouched=0
Second After Empty: 314ms, 0.157 ms/MB, AllTouched=1
When you look at the details you find that the first page access is slow because of soft page faults. The second and third memory accesses are then fast and are only dominated by the CppEater process itself. When the working set is trimmed we find that the MemCompression process is doing a lot of single threaded work which takes ca. 6,3s for 2 GB memory. The compression speed is therefore 320MB/s which is not bad. After the memory has been compressed and CppEater has slept some time it is time to touch our memory again. This time the memory is causing hard faults which are not hard in the usual way where we need to read memory from the disk but this time we need to decompress the previously compressed memory. That takes 4,5s which results in a decompression rate of 440MB/s which shows that decompression is nearly 30% faster than compression.
It is justified to call it a hard page fault since we are 12 times slower (=4538ms/366ms) on first page access when we hard fault back compressed pages.
The evolution of the Working Set of CppEater and the MemCompression process can nicely be seen by Virtual Memory Snapshots where we see that MemCompression gets all the memory from the modified private pages which are caused by calling EmptyWorkingSet in our process and puts it into its own working set after it has been compressed the memory. There we see also that MemCompression ends up at 1.7GB of memory which means that for an integer array which consists of 1,2,3,4,…. the compression rate is 0,85 which is okish. The compression is of course much better if large blocks with identical values can be compressed so this is really a stress test for the compressor to compress not completely random data.
You can show the relative CPU costs of each operation nicely with Flame Graphs with the same column configuration.
The number for first page access (soft page faults) are pretty much consistent with the ones Bruce did measure at https://randomascii.wordpress.com/2014/12/10/hidden-costs-of-memory-allocation/ where he measured 175 μs/MB for soft page faults. In reality you will always see a mixture of soft and hard page faults and now also semi hard faults due to memory compression which makes the already quite complex world of memory management even more complicated. But these semi hard faults from compressed memory are still much cheaper than to read the data back from the page file.
When Does it Page?
So far all operations were in memory but when does the page file come into the game? I have made some experiments on my machine and others. Paging (mainly) sets in when you have no physical memory left over. When the Active List (that is in kernel lingo all used memory except for caches) reaches the installed physical memory of your machine something dramatic must happen. The used physical memory is not identical with committed memory. If committed memory was never accessed by your application it can stay “virtual” and needs no physical memory backing. Is it a good idea to commit all of your physical memory? If you look at Task Manager on my 16 GB machine then you see I can commit over 19 GB of memory and still have 1,9 GB available because not all committed memory pages were accessed.
While I allocate and touch more and more memory the amount of Available and Cached memory goes to zero. What does that mean? It is the file system cache which you just have flushed out from memory. We all know that the machine reacts very slowly after it has booted and the hard disk never seems to come to rest. The reason is that after a boot all dlls need to be loaded from disk because the file system cache is still not yet populated. If you want to do your users a favor: Never allocate all of the physical memory but leave 15-20% for the file system cache. Your users will thank you that you did not ruin the interactive performance too much.
If you still insist to allocate all physical memory then bad things will happen:
When no more memory is available the OS decides that all processes are bad guys and tries to flush out everything into the Modified list (brown region in Memory Utilization graph). That is memory which is pending to be written to the page file. That makes sense if one or more processes are never accessing that memory anymore due to huge memory leaks. That way the OS can place the memory leak into the page file and continue working for a much longer time. At this time large amounts data are written into the page file. When that happens the system becomes unusable: The UI hangs, a simple printf call will take over six seconds and you experience a sudden system hang. The system is not really hanging, but it is quite busy with reading and writing data to and from the page file at that point in time. These high response times of hard disks are the main reason why SSDs are a much better choice for your page file. It is the slow reading from the page file which is causing the large slowdowns, in this case 16,3 (see Disk Service Time column in Utilization by Disk graph which was filtered for the pagefile) can be greatly reduced by placing the page file on a SSD. The kernel tries since Windows 10 its best to compress the memory before writing it into the page file to reduce slow disk IO by ca. 40%.
When the system is paging the first memory touch times rise dramatically from ca. 240ms to 1800ms (see ETW Marks graph) which is over seven times slower because it takes time to write out old data to make room for new memory allocations.
Cross Process Private Page Sharing
While measuring the effects of memory compression I stumbled across upon an interesting effect. When you start e.g. 8 process where each of them allocates one GB of random data which was initialized with
pData[i] = rand();
in a loop. Now each process has one GB of working set and we allocate 8 GB in total of physical memory. When we trim the working set of each process we should see 8 GB of memory in the MemCompression process as its working set because random data does not compress really well. But instead I did see only about 1 GB of data in the MemCompression process! How can windows compress random data to 1/8 to its original size? I think it can’t. Something else must be happening here. The numbers are only pseudo random and always generate the same sequence of values in all processes. We therefore have the identical random data in all processes in their own private memory.
Now lets change things a bit and initialize the random number generator with a different seed in each process at process start with
srand( lint.LowPart );
Now we truly generate the expected 8 GB of working set of the MemCompression process. It seems that Windows creates an internal hash table of all paged out pages (could be also a B-tree) and adds them to the MemCompression process only if the newly compressed pages was not already encountered. A very much similar technique is employed by VMs where it is called transparent memory sharing between VMs.
Below you see working set of the MemCompression process when CppEater the first time is started with no random seed where all duplicate pages can be combined. The green bars are the times when the working set was flushed until the CPU consumption of MemCompression was flat again. The second run CppEater was seeded with a unique seed for the random number generator. No page sharing can happen because all pages are different now and we see the expected eight fold increase in the MemCompression working when we force page out of the working sets of the CppEater instances.
It is interesting that this vastly different behavior can easily be seen in the kernel memory lists. The working set of MemCompression counts simply to the Active List. The only difference is that after the flush of the working set the Active List memory remains much larger than in the first case (blue bar).
Conclusions – Is EmptyWorkingSet Good?
In general it is (nearly) never a good idea to trim your own working set. If your mental model is that you are simply giving the physical memory back to the OS which then will write it dutifully into the page file is simply wrong. You are not releasing memory but you are increasing the Modified List buffers which are flushed out to disk from time to time. If you do this a lot it can cause so much disk write load together with some random access page in activity that your system will come to a sudden halt while you have still plenty of physical memory left. I have seen 128GB servers which never got above 80GB of active memory because the system was so busy with paging out memory by forced calls to EmptyWorkingSet that it already behaved like it had reached the physical memory limit with the usual large user visible delays caused by paging. Once we removed the offending calls to EmptyWorkingSet the user perceived performance and memory utilization has become much better.
Other people have also tried SetProcessWorkingSetSize to reduce the memory consumption but failed as well due to application responsiveness issues. It did take much longer to bring all the loose ends into an article than I did initially anticipate. If I have missed out important things or you have found an error please drop me a note and I will update the article.