.NET Serialization Roundup 2022

The last test is from 2019 which is still accurate but the world is changing and we have arrived at .NET 7.0 which is reason enough to spin up my test suite again and measure from .NET 4.8, 3.1, 5.0, 6.0 up to 7.0. The “old” articles are still relevant despite their age.

The following serializers are tested

– BinaryPackhttps://github.com/Sergio0694/BinaryPackBinary
– Cerashttps://github.com/rikimaru0345/CerasBinary
– FastJsonhttps://github.com/mgholam/fastJSON/Json
– JILhttps://github.com/kevin-montrose/JilJson
JsonSerializer.NET Json
– MsgPack.Clihttps://github.com/msgpack/msgpack-cliBinary
Google Protobufhttps://github.com/protocolbuffers/protobufBinary
– SimdJsonSharphttps://github.com/EgorBo/SimdJsonSharpJson
– UTF8Jsonhttps://github.com/neuecc/Utf8JsonJson

Serializers with – in front are in maintenance mode.

New entries since 2019 are

  • BinaryPack
  • MemoryPack
  • Google Protobuf

Which of these 23 serializers are in maintenance mode? Conditions (logical or) for maintenance mode are

  • Last commit > 1 year
  • It is archived
  • All commits within one year changed no product code
  • The author tells you to not use it

Maintenance mode serializers at the point of writing are:

  • BinaryPack
  • Ceras
  • FastJson
  • JIL (2019 last commit)
  • MsgPack.Cli
  • SimdJsonSharp
  • UTF8Json

You can still look at them, but I would not use them for real products where future updates to e.g. .NET 8.0 or security fixes are mandatory.

What else is not recommended in 2022?

BinaryFormatter should nowhere used in production anymore because it allows, if you deserialize attacker controlled data, to open a shell on your box. See https://learn.microsoft.com/en-us/dotnet/standard/serialization/binaryformatter-security-guide.

The following serializers which are part of .NET Framework are unsafe:

Used Benchmark

I did write the SerializerTests test suite which makes it easy to test all these serializers. If you want to measure for yourself, clone the repository, compile it, execute RunAll.cmd and you get a directory with CSV files which you can graph as you like.

Tooling is getting better and I have also added profiling support so you can check oddities at your own if you call RunAll.cmd -profile you get ETW traces besides the CSV files. This assumes you have installed the Windows Performance Toolkit from the Windows SDK.

Serialize Results

The winner in terms of serialize speed is again FlatBuffer which still has the lowest overhead, but you need to go through a more complex compiler chain to generate the de/serialization code.

*Update 1*

After discussions about the fairness how FlatBuffer should be tested I have changed how FlatBuffer performance is evaluated. Now we use an existing object (BookShelf) which is used as input to create a FlatBuffer object. Additionally it was mentioned that some byte buffer should also be part of the tested object. This moves FlatBuffer from rank 1 to 15, because the new test assumes you have an object you want to convert to the efficient FlatBuffer binary format. This assumption is reasonable, because few people will want to work with the FlatBuffer object directly. If you use the FlatBuffer object directly then the previous test results still apply.

The layout of the serialized Book object include

  • int Id
  • string Title
  • byte[] Payload Before it was null now it contains 10 bytes of data

Our BookShelf object looks as serialized xml like this

<?xml version="1.0" encoding="utf-8"?>
<BookShelf xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
      <Title>Book 1</Title>

Now we serialize nearly 40% of binary data which contains just printable ASCII characters. This “simple” change moves the landscape quite a bit: GroBuf is first and MemoryPack second. This contains also an update from MemoryPack from 1.4.4 -> 1.8.10. If GroBuf has less overhead (but double the serialized size because it stores strings as Utf-16 while most others convert the data to Utf-8), or if MemoryPack has some regression I cannot tell. To reproduce the tests you can run the tests with

RunAll.cmd 10

to use a 10 byte buffer payload for each Book object. If you omit the 10 then you will serialize with 0 bytes of payload. The combined csv file SerializationPerf_Combined_10.csv contains as suffix how much payload per book object was used.

*Update 1 End*

The new entry MemoryPack is second which is the newest serializer by Yoshifumi Kawai who seems to write serializers for a living. He has several entries in this list namely

  • MessagePackSharp
  • Utf8JsonSerializer

His latest MemoryPack library shows that you still can get faster once you know what things cost and what you can avoid without becoming too specialized. This one has pretty much all features one could wish for including a Source Generator for Typescript to read the binary data inside the browser or wherever your Typescript code is running.

Additionally he provided a helper to compress the data with Brotli, a very efficient compression scheme which competes with Bois_LZ4 which is currently the only serializer which supports also compression of the data. The Brotli algorithm is nice but there is zstd from Facebook which should be a bit faster which would make a nice addition perhaps in the future?

That is really impressive!

From a higher level we find that the old .NET Framework is getting behind in terms of speed and we see a nice step pattern which proves that every .NET Framework release has become faster than its predecessor.

Binary is good, but Json has also a large fan base. Microsoft did add with .NET 7.0 Source Generators to generate the serialization code at compile time for your objects. See https://github.com/Alois-xx/SerializerTests/blob/master/Serializers/SystemTextJsonSourceGen.cs how you can enable code generation for serialization. This feature generates code only for serialize but uses pregenerated metadata for deserialize. The feature to add pregenerated deserialization code is tagged as future (post .NET 8?). If you think this is important you can upvote the issue. The main reason to add this was not only performance but also AOT scenarios where you have no JIT compiler.

With pregenerated serialization code in SystemTextJsonSourceGen our BookShelf object is 20+% faster serialized compared to SystemTextJson which uses the “normal” runtime generated code. A significant part of the performance improvement seems to be related to directly write UTF8 without the need to convert an UTF-16 char memory stream to an UTF-8 byte stream.

Deserialize Results

*Update 1*

We access after deserialization of the FlatBuffer object all properties which cause the FlatBuffer struct to create a new string or byte[] array on each access again. AllocPerf does not use a binary blob yet. The numbers are therefore identical.

Even with these changes FlatBuffer wins the deserialize race with some margin. The second place goes to GroBuf instead of MemoryPack. But these are really close and had before been fighting for the second place. As already mentioned before GroBuf doubles the serialized data size which allows it to omit the Utf-16 to Utf-8 conversion which saves some speed at the expense of data size.

*Update 1 End*

Normally data is more often read than written. Deserialization needs to allocate all the new objects which makes it largely GC bound. This is the prime reason why deserialize can never be faster than serialize because it needs to allocate new objects while in the serialize case these are already existing.

The fastest serializer is AllocPerf which is not a serializer, but an allocation test which walks over an array of binary data without parsing and constructs strings out of the data. This is a test how fast one could get without any parsing overhead.

MemoryPack sets the mark again which is nearly overhead free. With so little overhead there must be a downside? Yes you loose versioning support to some extent.

The follower after MemoryPack is BinaryPack which is not even trying to be version tolerant and can be used for simple caches but other than that I would stay away from it. But beware it is in maintenance mode!

Another notable change in Protobuf_net was the deprecation since 3.0 of the AsReference attribute which many people are missing. This made it easy to deserialize object graphs with cyclic dependencies, which is with Protobuf_net 3.0 no longer the case.

The now in maintenance mode Utf8JsonSerializer is still 35% faster than SystemTextJson. There is still room for improvement for .NET.

First Startup Results

Besides the pure runtime performance startup costs are also important. That is more tricky to measure since you not only need to pay the costs of reflection, code gen, JIT time for the precompiled (Cross/Ngen) and not precompiled scenario inside a fresh executable. If you do not use a new executable each time you would attribute the costs for first time init effects to the first called serializer in a process which would introduce systematic bias. SerializerTests starts for each serializer a new executable which is started three times (last time is added to CSV) to warmup the loaded dlls to not induce expensive hard page faults which are served from the disk, but from the file system cache. Then it serializes 4 different object types with one object and takes the maximum value which should be the first call. These details allow one to check if just the first code gen is so expensive and subsequent code gen is much cheaper because many one time init effects already have happened.

The old .NET Framework 4.8 looses by a large margin here, but only if Ngen is not used. If the images are Ngenned then it is not too bad compared to crossgenned .NET core binaries.

Is this the full story? Of course not. We are only measuring inside a small loop long after many things have happened, but since precompilation will cause all dependent dlls to be eagerly loaded by the OS loader we should better look at the runtime of the process as a whole. This can show interesting … gaps in our testing methodology.

To look deeper you can run the test suite with ETW profiling enabled with

RunAll.cmd -profile

to generate ETW data while executing the tests. You need to download Windows Performance Toolkit to generate the profiling data and ETWAnalyzer to easily query the generated data. You get from the RunAll.cmd command a bunch of CSV files, a combined file CSV file and some ETW files.

When you put ETWAnalyzer into your path you can extract the ETW file data into Json files which are put into the Extract folder. To convert the ETL files into Json files you add to the ETWAnalyzer command line -Extract All, the folder where the ETL files are located and a symbol server e.g. MS which is predefined. You can also use NtSymbolPath to use your own Symbol search path defined by the _NT_SYMBOL_PATH environment variable.

 ETWAnalyzer -Extract All -FileDir D:\Source\git\SerializerTests\SerializerTestResults\_0_14_06 -SymServer MS 

The extracted Json files have a total size of 112 MB vs 5 GB of the input ETL files which is a good compression ratio! Now we can generate a nice chart with an Excel Pivot Chart to look how long each of the 3 invocations of the startup tests did run:

And the winner is .NET 4.8! That is counter intuitive after we have measured that .NET 7 is faster everywhere. What is happening here? This chart was generated with input from ETWAnalyzer to dump process start/stop times command line and their duration into a CSV file:

ETWAnalyzer.exe -dump process -cmdline *firstcall#* -timefmt s -csv FirstCall_Process.csv

ETWAnalyzer can also dump the CPU consumption to console so we can drill deeper

Although the command line syntax looks complex it is conceptually easy. You first select what to dump e.g. CPU, Process, Disk, … and then add filters and sorting. Because ETW records a lot of data you get a lot of sorting and filter options for each command of ETWAnalyzer which may look overwhelming but it is just filtering, sorting and formatting after you have selected what to dump.

The following call dumps process CPU consumption for all processes with a command line that contains

  • firstcall and cross and datacontract to select the .NET Core crossgenned processes OR
  • firstcall and ngen and datacontract to select the .NET 4.8 Ngened SerializerTests calls

-ProcessFmt s prints process start stop times in ETW recording time (time since session start) and the process duration. You can format also your local time, or utc by using local or utc as time formats.

EtwAnalyzer -dump cpu -cmdline "*firstcall* cross*#datacontract*";*firstcall*ngen*#datacontract* -ProcessFmt s

After drilling into the data it became apparent that since .NET Core 3.1 we are now having two additional serializers (MemoryPack and BinaryPack) which both have significant startup costs. These costs are always payed even if you never use them! I have created a branch of SerializerTests (noBinaryMemoryPack) to remove them so we can compare .NET 4.8 and .NET Core+ with the same number of used dependencies.

Each .NET version has two bars, the lower one is crossgen and the higher one involves JIT compilation of the binaries. Since each process is started 3 times we see in the y-axis the sum of 3 process starts. NET 7.0 seems to have become a bit slower compared to the previous versions. The winner is still .NET 4.8 (455 ms) but at least the difference is not as big as before (558ms vs 972ms).

It is very easy to fool yourself with isolated tests which do not measure the complete (system) picture. Measuring something and drawing a picture is easy. Measuring and understanding what the data means in full details is a much bigger endeavor.

Cool Console Tools

I work a lot with command line tools which produce large width output. The cmd shell has the nice feature to configure a horizontal scroll bar by setting the console width to e.g. 500. This works but newer shells like Windows Terminal are becoming the norm. It works like Linux shells which never had support for horizontal scroll bars. The issue to add a scroll bar is denied https://github.com/microsoft/terminal/issues/1860 which is a pity. But there is an open source solution for it. A little known terminal extension named vtm is offering all features I need. When configuring my command prompt with vtm -r cmd

I get a Windows Terminal window which supports horizontal scrolling, PlainText, RTF, HTML text copy, Ctrl+V (Ctrl+C not yet) and Unix style copying with right mouse click. This makes working with ETWAnalyzer much easier and you can also properly read the help which is also wide output …

EtwAnalyzer -dump CPU -ProcessName SerializerTests.exe -ShowModuleInfo -TopN 1 -NoCmdLine

The command dumps of all extracted files in the current directory the SerializerTests executable with the highest CPU consumption and the module information which includes file version, product name, and directory from where it was started. That can be useful to check if e.g. your local IT has installed the updated software to your machine, or you need to call again to get the update rolled out. By making such previously hard to get information easily queryable I have found that for some reason SerializerTests was running under false flag as protobuf product and since a long time without any file version update. Both issues are fixed, because ETWAnalyzer made it very obvious that something was wrong.


This is the end of this post, but hopefully your start to make a better informed decision which serializer is best suited for your needs. You should always measure in your environment with your specific data types to get numbers which apply to your and not my still synthetic scenario. When you execute RunAll.cmd dd you add to each Book object a byte array with dd bytes which may change the performance landscape dramatically. This can simulate a test where you use a serializer as wrapper over another already serialized payload which did already generate a byte array. That should be cheap, in theory. But when larger byte arrays were used with the fastest Json serializer (UTF8Json) it became one of the slower ones. I tried hard to get the numbers right, but you still need to measure for yourself to get your own right numbers for your use case and scenario.

Performance numbers are not everything. If essential features are there, and you trust the library author that he will provide support over the foreseeable future, only then you should consider performance as next item on your decision matrix to choose a library for mission critical long lived projects. If you are running a pet project, use whatever you seem fit.


Slow network after adding a second WIFI Access Point

Networking is hard and many things can go wrong. I have bought an Oculus Quest VR headset for my daughter. Soon I have found it frustrating to connect it via an USB cable to my PC. To get rid of the cable I have bought a cheap TP-Link Archer C5 AC1200 router which I connect via my LAN cable to the PC. After becoming wireless I can send realtime VR images over my dedicated WIFI to the VR headset with stable 150 MBit/s.

The “real” WIFI (yellow) which is connected to the internet is located in another room which is not capable to transfer 150 MBit/s without any interrupts which would kill the VR experience. That is the reason why nearly everyone recommends to use either an USB cable (inconvenient) or a LAN cable (blue) connected to a dedicated WIFI router for the VR headset.

That setup works great, but I have noticed that whenever I have the second router running that browsing the internet would feel very slow. Every page refresh was taking a lot longer compared to when I did turn the second WIFI router off.

Since all browsers have great development tools I just needed to press F12 to see what was blocking my browser requests. In the example below I have tried to access a far far away web site which shows in the network tab for the first request a delay 10,55s for host name resolution (Dns).

On my box these timings were always around 10s so I knew that I did suffer from slow Dns queries to resolve the host name e.g. http://www.dilbert.com to the actual IP address. Knowing that page load times are blocked by 10+ seconds by Dns is great, but now what? I was suspecting that, since I have two network tabs in Task Manager, that the DNS query was first sent via the LAN cable to a dead end because my Oculus is not connected to the internet and then via WIFI to the actual internet connected router.

Initially I tried the route print command to see where why packets were routed, but that looked too complicated to configure the order for every local route. After a bit more searching I have found the powershell command Get-NetIPInterface which shows you over which network interface Windows sends network packets to the outside world. The relevant column is InterfaceMetric. If multiple network cards are are considered the network interface with the lowest InterfaceMetric number is tried first.

As expected I have found my local Ethernet connection with InterfaceMetric 5 with the preferred connection to send network packets to the outside world. But what should be the other network? The easy way out is to cause some traffic in the browser and check in Task Manager in the Performance Tab which network gets all the traffic:

I need therefore to change the order vEthernet (New Virtual Switch) with metric 15 and interface id 49 and vEthernet (New Virtual Switch) with metric 25 and interface id 10.

Set-NetIPInterface -InterfaceIndex 10 -InterfaceMetric 5
Set-NetIPInterface -InterfaceIndex 49 -InterfaceMetric 25
These two commands did switch the order of the two relevant networks (ifIndex 10 and 49). How fast are things now? Internet surfing feels normal and the change of switched network ordering did save the day. There are still Dns delays but nowhwere in the 10s region. Case solved.

The average reader can stop reading now.

What follows is a deep dive into how Dns works on Windows and how you can diagnose pretty much any Dns issue with the builtin Dns tracing of Windows with the help of ETWAnalyzer.

So far we have just looked at Dns as a black box, but you can trace all Dns requests with the help of ETW (Event Tracing for Windows) and the Microsoft-Windows-DNS-Client ETW provider. The easiest way to record the data is to download a recording profile like Multiprofile.wprp, download the file e.g. via curl from the command line

curl https://raw.githubusercontent.com/Alois-xx/FileWriter/master/MultiProfile.wprp > c:\temp\MultiProfile.wprp

Then start from an Admin command prompt tracing with

wpr -start c:\temp\MultiProfile.wprp!Network

Now surf to some web sites or do your slow use case. After you have found some slowdown you can stop recording with

wpr -start c:\temp\SlowBrowsing_%ComputerName%.etl

Good naming is hard, but if you later want to revisit your recorded files you should try to come up with a descriptive name.

After recording the data you can extract a Json file with ETWAnalyzer by extracting everything (-extract all) or just Dns data with -extract DNS

C:>EtwAnalyzer -extract all -fd c:\temp\SlowBrowsing_Skyrmion.etl -symserver ms 
1 - files found to extract.
Success Extraction of c:\temp\Extract\SlowBrowsing_Skyrmion.json
Extracted 1/1 - Failed 0 files.
Extracted: 1 files in 00 00:02:40, Failed Files 0

ETWAnalyzer supports many different queries for e.g. Disk, File, CPU and other things. Since support for dumping Dns Queries was added. To show the aggregated top 5 slowest Dns queries you just need -fd (File or Directory) and the file name of the extracted data and -dump Dns to dump all Dns queries. To select the top 5 add -topn 5 and you get this output per process printed:

C:\temp\Extract>EtwAnalyzer -dump DNS -topn 5 -fd SlowBrowsing_Skyrmion.json   
02.10.2022 17:47:32    SlowBrowsing_Skyrmion 
NonOverlapping       Min        Max  Count TimeOut DNS Query                                                             
       Total s         s          s      #
OVRServer_x64.exe(6624) 124 128
      12,038 s   0,000 s   12,038 s      4       1 graph.oculus.com                                                      
backgroundTaskHost.exe(6692)  -ServerName:BackgroundTaskHost.WebAccountProvider
      12,043 s   0,000 s   12,043 s      2       1 login.microsoftonline.com                                             
      12,047 s   0,021 s   12,047 s      2       1 d21lj84g4rjzla.cloudfront.net                                         
      12,052 s   0,000 s   12,052 s      2       1 loadus.exelator.com                                                   
      12,052 s   0,000 s   12,052 s      2       1 cdn.adrtx.net                                                         
Totals: 60,232 s Dns query time for 12 Dns queries

We find that we have indeed some slow Dns queries which all did time out after ca. 12s. The magenta named processes are issuing all of the following DNS queries. To see more we can add -details to see the timing of every slow DNS Query

C:\temp\Extract>EtwAnalyzer -dump DNS -topn 5 -details -fd SlowBrowsing_Skyrmion.json  
02.10.2022 17:47:32    SlowBrowsing_Skyrmion 
NonOverlapping       Min        Max  Count TimeOut DNS Query                                                             
       Total s         s          s      #
      12,038 s   0,000 s   12,038 s      4       1 graph.oculus.com                                                      
        2022-10-02 17:47:38.416 Duration:  0,000 s OVRServer_x64.exe(6624)                       DnsAnswer: 
        2022-10-02 17:47:38.417 Duration:  0,000 s OVRServer_x64.exe(6624)                       DnsAnswer: 
        2022-10-02 17:47:38.417 Duration: 12,038 s OVRServer_x64.exe(6624)                       TimedOut: Servers:;fe80::1; DnsAnswer: 2a03:2880:f21c:80c1:face:b00c:0:32c2;::ffff:
        2022-10-02 17:47:38.417 Duration: 12,038 s OVRServer_x64.exe(6624)                       TimedOut: Servers:;fe80::1; DnsAnswer: 2a03:2880:f21c:80c1:face:b00c:0:32c2;::ffff:
      12,043 s   0,000 s   12,043 s      2       1 login.microsoftonline.com                                             
        2022-10-02 17:47:42.020 Duration:  0,000 s backgroundTaskHost.exe(6692)                  DnsAnswer: 
        2022-10-02 17:47:42.020 Duration: 12,043 s backgroundTaskHost.exe(6692)                  TimedOut: Servers:;fe80::1; DnsAnswer: ::ffff:;::ffff:;::ffff:;::ffff:;::ffff:;::ffff:;::ffff:;::ffff:
      12,047 s   0,021 s   12,047 s      2       1 d21lj84g4rjzla.cloudfront.net                                         
        2022-10-02 17:47:54.203 Duration:  0,021 s firefox.exe(13612)                            DnsAnswer:;;;
        2022-10-02 17:47:54.224 Duration: 12,047 s firefox.exe(13612)                            TimedOut: Servers:;;fe80::1 DnsAnswer: 2600:9000:225b:fc00:1e:b6b1:7b80:93a1;2600:9000:225b:3600:1e:b6b1:7b80:93a1;2600:9000:225b:2e00:1e:b6b1:7b80:93a1;2600:9000:225b:f000:1e:b6b1:7b80:93a1;2600:9000:225b:7200:1e:b6b1:7b80:93a1;2600:9000:225b:a400:1e:b6b1:7b80:93a1;2600:9000:225b:9200:1e:b6b1:7b80:93a1;2600:9000:225b:dc00:1e:b6b1:7b80:93a1
      12,052 s   0,000 s   12,052 s      2       1 loadus.exelator.com                                                   
        2022-10-02 17:47:42.091 Duration:  0,000 s firefox.exe(13612)                            DnsAnswer: 
        2022-10-02 17:47:42.091 Duration: 12,052 s firefox.exe(13612)                            TimedOut: Servers:;fe80::1; DnsAnswer: ::ffff:
      12,052 s   0,000 s   12,052 s      2       1 cdn.adrtx.net                                                         
        2022-10-02 17:47:42.091 Duration:  0,000 s firefox.exe(13612)                            DnsAnswer: 
        2022-10-02 17:47:42.091 Duration: 12,052 s firefox.exe(13612)                            TimedOut: Servers:;fe80::1; DnsAnswer: ::ffff:;::ffff:
Totals: 60,232 s Dns query time for 12 Dns queries

You might ask yourself. What are NonOverlapping Totals? The ETW provider Microsoft-Windows-DNS-Client instruments dnsapi.dll!DnsQueryEx which supports synchronous and async queries. By default an IPV4 and IPV6 query is started whenever when you try to resolve a host name. If both queries need e.g. 10s, and are started at the same time the observed delay by the user is 10s (Non Overlapping time) and not 20s (that would be the sum of both query times). To estimate the observed user delay for each query ETWAnalyzer calculates the non overlapping query time which should much better reflect the actual user delays. The summary time Totals is also taking overlapping DNS query times into account which should be a better metric than a simple sum.

For more details please refer to https://github.com/Siemens-Healthineers/ETWAnalyzer/blob/main/ETWAnalyzer/Documentation/DumpDNSCommand.md.

The details view shows that every slow timed out query has a red TimedOut Dns server name ( This is the Dns server IP for which a Dns query did time out. By looking at the interfaces with ipconfig we can indeed verify that the new TPLink Router was tried first, did time out and after that the next network (WIFI) was tried which did finally succeed.

If you are in an IPV6 network the Dns server IP might be a generic default Link Local IP which can be the same for several network interfaces. In that case you can add -ShowAdapter to show which network interfaces were tried. That should give you all information to diagnose common Dns issues. Of course you can also use WireShark to directly get your hands on the Dns queries, but with WireShark you will not know which process was waiting for how long. The ETW instrumented DnsQueryEx Api will give you a clear process correlation and also which queries (IPV4, IPV6) were tried. To get all details you can export the data to a CSV file (add -csv xxx.csv) and post process the data further if you need to. To select just one process (e.g. firefox) you can filter by process with -pn firefox which is is the shorthand notation for -processname firefox. For more information on the query syntax of ETWAnalyzer see https://github.com/Siemens-Healthineers/ETWAnalyzer/blob/main/ETWAnalyzer/Documentation/Filters.md.

If you have just Context switch traces but no DNS Client traces not all is lost. If you suspect DNS delays you can still query the thread wait times of all dns related methods with ETWAnalyzer which also calculates for wait and ready the non overlapping times summed for all threads.

EtwAnalyzer -dump cpu -pn firefox -fd WebBrowsingRouterOn_SKYRMION.10-06-2022.19-46-16 -methods *dns* -fld s s

If you have async queries in place, but you know that just one async operation was initiated you can add -fld s s to show the time difference the method was seen first and last in the trace which should give you a good estimate how long that async method was running, if upon completion it was called again (and consumed enough CPU so we could get a sample). For more details please refer to https://github.com/Siemens-Healthineers/ETWAnalyzer/blob/main/ETWAnalyzer/Documentation/DumpCPUCommand.md.

You can also use one specific method or event as zero point and let ETWAnalyzer calculate first/last timings relative to a specific event which can be useful to calculate e.g. timing of methods relative to a key event like OnLoadClicked or something similar. That advanced feature is explained at https://github.com/Siemens-Healthineers/ETWAnalyzer/blob/main/ETWAnalyzer/Documentation/ZeroTime.md in much greater detail. It is also possible to correlate traces across machines without synchronized clocks by using different zero times for different files within one query and export the data to a CSV file for further processing.

That was a lot of information, but I hope I have convinced you that you can access previously hard to get information now much easier with ETWAnalyzer. It is not only useful for performance investigations, but it might be also interesting to get an overview which servers were tried to query from your machine for forensic audits.

PDD Profiler Driven Development

There are many extreme software development strategies out there. PDD can mean Panic Driven, or Performance Driven Development. The first one should be avoided, the second is a MSR paper dealing with performance modelling techniques. Everyone writes tests for their software because they have proven to be useful, but why are so few people profiling their own code? Is it too difficult?

I have created a C# console project named FileWriter https://github.com/Alois-xx/FileWriter which writes 5000 100KB files (= 500MB) to a folder with different strategies. How do you decide which strategy is the most efficient and/or fastest one? As strategies we have

  • Single Threaded
  • Multi Threaded with TPL Tasks
  • Multi Threaded with Parallel.For
    • 2,3,4,5 Threads

The serial version is straightforward. It creates a n files and writes in each file a string until the file size is reached. The code is located at https://github.com/Alois-xx/FileWriter/blob/master/DataGenerator.cs, but the exact details are not important.

Lets measure how long it takes:

As expected the serial version is slowest and the parallel versions are faster. The fastest one is the one which uses 3 threads. Based on real world measurements we have found that 3 threads is the fastest solution. OK problem solved. Move on and implement the next feature. That was easy.

But, as you can imagine there is always a but: Are the numbers we have measured reasonable? Lets do some math. We write 500 MB with the parallel version in 11s. That gives a write speed of 45 MB/s. My mental performance model is that we should be able to max out a SATA SSD which can write 430 MB/s.

This is off nearly by a factor 10 what the SSD is able to deliver. Ups. Where did we loose performance? We can try out the Visual Studio CPU Profiler:

That shows that creating and writing to the file are the most expensive parts. But why? Drilling deeper will help to some extent, but we do not see where it did wait for disk writes because the CPU Profiler is only showing CPU consumption. We need another tool. Luckily there is the Concurrency Visualizer extension for free on the Visual Studio Marketplace, which can also show Thread blocking times.

After installation you can start your project from the Analyze menu with your current debugging settings.

We try to stay simple and start FileWriter with the single threaded strategy which is easiest to analyze

FileWriter -generate c:\temp serial

In the Utilization tab we get a nice summary of CPU consumption of the machine. In our case other processes were consuming a lot of CPU while we were running. White is free CPU, dark gray is System and light gray are all other processes. Green is our profiled process FileWriter.exe.

The first observation is that once we start writing files other processes become very busy which looks correlated. We come back to that later.

The Threads view shows where the Main Thread is blocked due to IO/Waiting/…

You can click with the mouse on the timeline of any thread to get in the lower view the corresponding stack trace. In this case we are delayed e.g. by 2,309 s by a call to WriteFile. For some reason the .NET 6.0 symbols could not be resolved but it might also be just a problem with my symbol lookup (there should be a full stack trace visible). Personally I find it difficult to work with this view, when other processes are consuming the majority of CPU. But since Concurrency Visualizer uses ETW there is nothing that can stop us to open the Concurrency Visualizer generated ETW files from the folder C:\Users\\Documents\Visual Studio 2022\ConcurrencyVisualizer\ with WPA.

We see that FileWriter (green) starts off good but then MsMpEng.exe (red = Defender Antivirus) is kicking in and probably delaying our File IO significantly. In the middle we see System (lila = Windows OS) doing strange things (Antivirus again?).

Do you remember that our SSD can write 430 MB/s? The Disk Usage view in WPA shows disk writes. Our process FileWriter writes 500 MB of data, but there are only 4 MB written in 53ms (Disk Service Time). Why are we not writing to the SSD? The answer is: The Operating System is collecting a bunch of writes and later flushes data to disk when there is time or the amount of written data exceeds some internal threshold. Based on actual data we need to revise our mental performance model:

We write to the OS write cache instead of the disk. We should be able to write GB/s! The actual disk writes are performed by the OS asynchronously.

Still we are at 45 MB/s. If the disk is not the problem, then we need to look where CPU is spent. Lets turn over to the gaps where the green line of FileWriter writing on one thread to disk vanishes. There seems to be some roadblock caused by the System process (Lila) which is the only active process during that time of meditation:

When we unfold the Checkpoint Volume stacktag which is part of my Overview Profile of ETWController we find that whenever NTFS is optimizing the Free Bitmap of the NTFS volume it is traversing a huge linked list in memory which blocks all File IO operations for the complete disk. Do you see the bluish parts in the graph which WPA shows where NTFS Checkpoint volume samples are found in the trace? Automatic highlighting of the current table selection in the graph is one of the best features of WPA. The root cause is that we delete the old files of the temp folder in FileWriter from a previous run which produces a lot of stale NTFS entries which, when many new files are created, at some point cause the OS to wreak havoc on file create/delete performance. I have seen dysfunctional machines which did have a high file churn (many files per day were created/deleted) where all file create/delete operations did last seconds instead of sub milliseconds. See a related post about that issue here: https://superuser.com/questions/1630206/fragmented-ntfs-hdd-slow-file-deletion-600ms-per-file. I did hit this issue several times when I cleaned up ETW folders which contain for each ETL file a large list of NGENPDB folders which the .ni.pdb files. At some point Explorer is stuck for many minutes to delete the last files. The next day and a defrag later (the NTFS tables still need defragmentation even for SSDs) things are fast again. With that data we need to revisit our mental performance model again:

Creating/Deleting large amount of files might become very slow depending on internal NTFS states which are in memory structures and manifest as high CPU consumption in the Kernel inside the NtfsCheckpointVolume method.

That view is already good, but we can do better if we record ETW data on our own and analyze it at scale.

How about a simple batch script that

  • Starts ETW Profiling
  • Executes your application
  • Retrieves the test execution time as return value from your application
  • Stops ETW Profiling
  • Write an ETW file name with the measured test duration and test case name
  • Extract ETW Data for further analysis

That is what the RunPerformanceTests.cmd script of the FileWriter project does. The tested application is FileWriter.exe which writes files with different strategies to a folder. When it exits it uses as exit code of the process the duration of the test case in ms. For more information refer to ProfileTest.cmd command line help.

FileWriter and Measurement Details

Read on if you are eager for details how the scripts generate ETW measurement data or go to the next headline.

call ProfileTest.cmd CreateSerial       FileWriter.exe -generate "%OutDir%" serial
call ProfileTest.cmd CreateParallel     FileWriter.exe -generate "%OutDir%" parallel
call ProfileTest.cmd CreateParallel2    FileWriter.exe -generate "%OutDir%" parallel -threads 2
call ProfileTest.cmd CreateTaskParallel FileWriter.exe -generate "%OutDir%" taskparallel

The most “complex” of ProfileTest.cmd is where several ETW profiles of the supplied MultiProfile.wprp are started. The actual start command in the ProfileTest.cmd script is

wpr -start MultiProfile!File -start MultiProfile.wprp!CSWitch -start MultiProfile.wprp!PMCLLC 

But with shell scripting with the correct escape characters it becomes this:

"!WPRLocation!" -start "!ScriptLocation!MultiProfile.wprp"^^!File -start "!ScriptLocation!MultiProfile.wprp"^^!CSwitch -start "!ScriptLocation!MultiProfile.wprp"^^!PMCLLC

This line starts WPR with File IO tracing, Context Switch tracing to see thread wait/ready times and some CPU counters for Last Level Cache misses. Disk and CPU sampling is already part of these profiles.

The PMCLLC profile can cause issues if you are using VMs like HyperV. To profile low level CPU features you need to uninstall all Windows OS HyperV Features to get Last Level Cache CPU tracing running.

You can check if you have access to your CPU counters via

C:\>wpr -pmcsources
Id  Name                        
  0 Timer                       
  2 TotalIssues                 
  6 BranchInstructions          
 10 CacheMisses                 
 11 BranchMispredictions        
 19 TotalCycles                 
 25 UnhaltedCoreCycles          
 26 InstructionRetired          
 27 UnhaltedReferenceCycles     
 28 LLCReference                
 29 LLCMisses                   
 30 BranchInstructionRetired    
 31 BranchMispredictsRetired    
 32 LbrInserts                  
 33 InstructionsRetiredFixed    
 34 UnhaltedCoreCyclesFixed     
 35 UnhaltedReferenceCyclesFixed
 36 TimerFixed                  

If you get a list with just one entry named Timer then you will not get CPU counters, because you have installed one or several HyperV features. In theory this issue should not exist because this was fixed with Windows 10 RS 5 in 2019 (https://docs.microsoft.com/en-us/archive/blogs/vancem/perfview-hard-core-cpu-investigations-using-cpu-counters-on-windows-10), but I cannot get these CPU counters when HyperV is active in Windows 10 21H2.

The easiest fix is to remove the “!ScriptLocation!MultiProfile.wprp”^^!PMCLLC from the ProfileTest.cmd script, or you modify the MultiProfile.wprp to set in all HardwareCounter nodes

<HardwareCounter Id="HardwareCounters_EventCounters" Base="" Strict="true">

the value Strict=”false” to prevent WPR from complaining when it could not enable hardware counters. The MultiProfile.wprp profile uses nearly all features of WPR which are possible according to the not really documented xsd Schema to streamline ETW recording to your needs. The result of this “reverse” documenting is an annotated recording profile that can serve as base for your own recording profiles. Which profiles are inside it? You can query it from the command line with WPR:

D:\Source\FileWriter\bin\Release\net6.0>wpr -profiles MultiProfile.wprp

Microsoft Windows Performance Recorder Version 10.0.19041 (CoreSystem)
Copyright (c) 2019 Microsoft Corporation. All rights reserved.

               (CPU Samples/Disk/.NET Exceptions/Focus)
   CSwitch    +(CPU Samples/Disk/.NET Exceptions/Focus/Context Switch)
   MiniFilter +(CPU Samples/Disk/.NET Exceptions/Focus/MiniFilter)
   File       +(CPU Samples/Disk/.NET Exceptions/Focus/File IO)
   Network    +(CPU Samples/Disk/.NET Exceptions/Focus/Network)
   Sockets    +(CPU Samples/Disk/.NET Exceptions/Focus/Sockets)
   VirtualAlloc (Long Term)
   UserGDILeaks (Long Term)
     PMC Sampling for PMC Rollover + Default
     PMC Cycles per Instruction and Branch data - Counting
     PMC Cycles per Instruction and LLC data - Counting
     LBR - Last Branch Record Sampling

Or you can load the profile into WPRUI which is part of the Windows SDK where they will show up under the Custom Measurements node:

All profiles can be mixed (except Long Term) to record common things for unmanaged and .NET Code with the least amount of data. A good choice is the default profile which enables CPU sampling, some .NET events, Disk and Window Focus events. These settings are based on years of troubleshooting, and should be helpful recording settings also in your case. The supplied MS profiles are recording virtually everything which is often a problem that the trace sizes are huge (many GB) or, if memory recording is used, the history is one minute or even less on a busy system which has problems.

Generated Data

RunPerformanceTests.cmd saves the ETW data to the C:\Temp folder

After data generation it calls ExtractETW.cmd which is one call to ETWAnalyzer which loads several files in parallel and creates Json files from the ETW files

ETWAnalyzer -extract all -fd c:\temp\*Create*.etl -symserver MS -nooverwrite

This will generate JSON files in C:\Temp\Extract which can be queried with ETWAnalyzer without the need to load every ETL file into WPA.

As you can see the 500+ MB ETL files are reduced to ca. 10 MB which is a size reduction of a factor 50 while keeping the most interesting aspects. There are multiple json files per input file to enable fast loading. If you query with ETWAnalyzer e.g. just CPU data the other *Derived* files are not even loaded which keeps your queries fast.

Working With ETWAnalyzer and WPA on Measured Data

To get an overview if all executed tests suffer from a known issue you can query the data from the command line. To get the top CPU consumers in the Kernel not by method name but by stacktag you can use this query

EtwAnalyzer -dump CPU -ProcessName System -StackTags * -TopNMethods 2

I am sure you ask yourself: What is a stacktag? A stacktag is a descriptive name for key methods in a stacktrace of an ETW event. WPA and TraceProcessing can load a user configured xml file which assigns to a stacktrace a descriptive (stacktag) name, or Other if no matching stacktag rule could be found. The following XML fragment defines e.g. the above NTFS Checkpoint Volume stacktag:

  <Tag Name="NTFS Checkpoint Volume">
	<Entrypoint Module="Ntfs.sys" Method="NtfsCheckpointVolume*"/>

See default.stacktags of ETWAnalyzer how it can be used to flag common issues like OS problems, AV Scanner Activity, Encryption Overhead, … with descriptive stacktags. If more than one stacktag matches the first matching stacktag, deepest in the stacktrace, wins. You can override this behavior if you set a priority value for a given tag.

Based on the 7 runs we find that the CheckpointVolume issue is part of every test which blocks File writes up to 6+ seconds. Other test runs do no suffer from this which makes comparison of your measured numbers questionable. Again we need to update our mental performance model:

Tests which are slower due to CheckpointVolume overhead should be excluded to make measurements comparable where this internal NTFS state delay is not occurring.

The exact number would be the sum of CPU+Wait, because this operation is single threaded we get exact numbers here. ETWAnalyzer can sum method/Stacktags CPU/Wait times with the -ShowTotal flag. If we add Method, the input lines are kept, if we use Process only process and file totals are shown.

EtwAnalyzer -dump CPU -ProcessName System -StackTags *checkpoint* -ShowTotal method

Now we know that file IO is delayed up to 7461 ms without the need to export the data to Excel. ETWAnalyzer allows you to export all data you see at the console to a CSV file. When you add -csv xxx.csv and you can aggregate/filter the data further as you need it. This is supported by all -Dump commands!

We can also add methods to the stacktags by adding -methods *checkpoint* to see where the stacktag has got its data from

EtwAnalyzer -dump CPU -ProcessName System -StackTags *checkpoint* -methods *checkpoint*
The method NtfsCheckpointVolume has 873 ms, but the stacktag has only 857 ms. This is because other stacktags can “steal” CPU sampling/wait data which is attributed to it. This is curse and feature rolled into one. It is great because if you add up all stacktags then you get the total CPU/Wait time for a given process across all threads. It is bad, because expensive things might be looking less expensive based on one stacktag because other stacktags have “stolen” some CPU/Wait time.

That is all nice, but what is our FileWriter process doing? Lets concentrate on the single threaded use case by filtering the files with -FileDir or -fd which is the shorthand notation for it to *serial*. You need to use * in the beginning, because the filter clause uses the full path name. This allows you to query with -recursive a complete directory tree where -fd *Serial*;!*WithAV* -recursive defines a file query starting from the current directory for all Serial files but it excludes folders or files which contain the name WithAV. To test which files are matching you can use -dump Testrun -fd xxxx -printfiles.

EtwAnalyzer -dump CPU -ProcessName FileWriter -stacktags * -Filedir *serial*

The test takes 12s but it spends 2,2s yet alone with Defender Activity. Other stacktags are not giving away a clear problem we could drill deeper. One interesting thing is that we spend 750ms in thread sleeps. A quick check in WPA shows

that these are coming from the .NET Runtime to recompile busy methods later on the fly. Stacktags are great to flag past issues and categorize activity, but they also tend to decompose expensive operations into several smaller ones. E.g. if you write to a file you might see WriteFile and Defender tags splitting the overall costs of a call to WriteFile. But what stands out that on my 8 Core machine a single threaded FileWriter fully utilizes my machine with Defender activity. We need to update our mental performance model again:

Writing to many small files on a single thread will be slowed down by the Virus Scanner a lot. It also generates a high load on the system which slows down everything else.

Since we did write the application we can go down to specific methods to see where the time is spent. To get an overview we use the top level view of WPA of our CreateFile method

In WPA we find that

  • CreateFiles
    • CPU (CPU Usage) 10653 ms
    • Wait Sum 672 ms
    • Ready Time Sum 1354 ms
      • This is the time a thread did wait for a CPU to become free when all CPUs were used

which adds up to 12,679 s which pretty closely (16ms off) matches the measured 12,695 s. It is always good to verify what you did measure. As you have seen before we have already quite often needed to change our mental performance model. The top CPU consumers in the CreateFiles method are

  • System.IO.StreamWriter::WriteLine
  • System.IO.Strategies.OSFileStreamStrategy::.ctor
  • System.IO.StreamWriter::Dispose
  • System.Runtime.CompilerServices.DefaultInterpolatedStringHandler::ToStringAndClear
  • System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[System.Char]::Rent
  • System.Runtime.CompilerServices.DefaultInterpolatedStringHandler::AppendFormatted
  • System.IO.Strategies.OSFileStreamStrategy::.ctor
  • System.IO.StreamWriter::WriteLine

What is surprising is that ca. 2s of our 12s are spent in creating interpolated strings. The innocent line

 string line = $"echo This is line {lineCount++}";

costs ca. 16% of the overall performance. Not huge, but still significant. Since we are trying to find out which file writing strategy is fastest this is OK because the overhead will be the same for all test cases.

We can get the same view as in WPA for our key methods with the following ETWAnalyzer query

EtwAnalyzer -dump CPU -ProcessName FileWriter -fd *serial* -methods *StreamWriter.WriteLine*;*OSFileStreamStrategy*ctor*;*StreamWriter.WriteLine*;*DefaultInterpolatedStringHandler*AppendFormatted*;*TlsOverPerCoreLockedStacksArrayPool*Rent;*DefaultInterpolatedStringHandler.ToStringAndClear;*StreamWriter.Dispose;*filewriter* -fld

You might have noticed that the method names in the list have :: between the class and method. ETWAnalyzer uses always . which is less typing and is in line with old .NET Framework code which used :: for JITed code and . for precompiled code.

Additionally I have smuggled -fld to the query. It is the shorthand for -FirstLastDuration or -fld which shows the difference between the Last-First time the method did show up in CPU Sampling or Context switch data. Because we know that our test did only measure DataGenerator.CreateFile calls we see in the Last-First column 12,695 s which is down to the ms our measured duration! This option can be of great value if you want to measure parallel asynchronous activity wall clock time. At least for .NET code the generated state machine classes of the C# compiler contain central methods which are invoked during init and finalization of asynchronous activities which makes it easy to “measure” the total duration even if you have no additional trace points at hand. You can add to -fld s s or -fld local local or -fld utc utc to also show the first and last time in

  • Trace Time (seconds)
  • Local Time (customer time in the time zone the machine was running)
  • UTC Time

See command line help for further options. ETWAnalyzer has an advanced notation of time which can be formatted in the way you need it.

The WPA times for Wait/Ready are the same in ETWAnalyzer, but not CPU. The reason is that to exactly judge the method CPU consumption you need to look in WPA into CPU Usage (Sampled) which shows per method CPU data, which is more accurate. ETWAnalyzer merges this automatically for you.

CPU usage (Precise) Trap

The view CPU usage (Precise) in WPA shows Context switch data which is whenever your application called a blocking OS call like Read/WriteFile/WaitForSingleObject, … your thread is switched off the CPU which is called context switch. You will see in this graph therefore only (simplification!) methods which were calling a blocking OS method. If you have e.g. a busy for loop in method int Calculate() and the next called method is Console.WriteLine, like this,

int result = Calculate();

to print the result then you will see in CPU usage (Precise) all CPU attributed to Console.WriteLine because that method was causing the next blocking OS call. All other non blocking methods called before Console.WriteLine are invisible in this view. To get per method data you need to use the CPU data of CPU sampling which gets data in a different way to much better measure CPU consumption at method level. Is Context Switch data wrong? No. The per thread CPU consumption is exact because this is a trace of the OS Scheduler when he moves threads on/off a CPU. But you need to be careful to not interpret the method level CPU data as true CPU consumption of that method.

CPU Usage (Sampled) Trap

Lets compare both CPU views (Sampled) vs (Precise)

StreamWriter.Dispose method consumes 887 ms in Sampling while in Context Switch view we get 1253 ms. In this case the difference is 41%! With regards to CPU consumption you should use the Sampled view for methods to get the most meaningful value.

Every WPA table can be configured to add a Count column. Many people fell into the trap to interpret the CPU Sampling Count column as number of method calls, while it actually just counts the number of sampling events. CPU sampling works by stopping all running threads 1000 (default) times per second, take a full stacktrace and write that data to ETW buffers. If method A() did show up 1000 times then it gets 1000*1ms CPU attributed. CPU sampling has statistical bias and can ignore e.g. method C() if it is executing just before, or after, the CPU sampling event did take a stack trace.

ETWAnalyzer Wait/Ready Times

The method list shown by ETWAnalyzer -Dump CPU -methods xxx is a summation across all threads in a process. This serves two purposes

  • Readability
    • If each thread would be printed the result would become unreadable
  • File Size
    • The resulting extracted Json file would become huge

This works well for single threaded workloads, but what about 100 threads waiting for something? WPA sums all threads together if you remove the thread grouping. ETWAnalyzer has chosen a different route. It only sums all non overlapping Wait/Ready times. This means if more than one thread is waiting, the wait is counted only once.

The Wait/Ready times for methods for multi threaded methods are the sum of all non overlapping Wait/Ready times for all threads. If e.g. a method was blocked on many threads you will see as maximum wait time your ETW recording time and not some arbitrary high number which no one can reason about. It is still complicated to judge parallel applications because if you have heavy thread over subscription you will see both, Wait and Ready, times to reach your ETW recording time. This is a clear indication that you have too many threads and you should use less of them.

Make Measurements Better

We have found at least two fatal flaws in our test

  • Anti Virus Scanner is consuming all CPU even in single threaded test
    • Add an exclude rule to the folder via PowerShell
      • Add-MpPreference -ExclusionPath “C:\temp\Test”
    • To remove use
      • Remove-MpPreference -ExclusionPath “C:\temp\Test”
  • Some test runs suffer from large delays due to CheckpointVolume calls which introduces seconds of latency which is not caused by our application
    • Change Script to delete temp files before test is started to give OS a chance to get into a well defined file system state

When we do this and we repeat the tests a few times we get for the single threaded case:

Nearly a factor 3 faster for changing not a single line of code is pretty good. The very high numbers are coming from overloading the system when something else is running in the background. The 12s numbers are for enabled Defender while < 10s are with a Defender exclusion rule but various NtfsCheckpoint delays. When we align everything correctly then we end up with 7s for the single threaded case which is 71 MB/s of write rate. The best value is now 3,1s for 4 threads with 161 MB/s.

What else can we find? When looking at variations we find also another influencing factor during file creation:

We have a single threaded test run with 9556 ms and a faster one with 7066 ms. Besides the CheckpointVolume issue we loose another 1,3 s due to reading MFT and paged out data. The faster test case (Trace #2) reads just 0,3 ms from the SSD, while we read in the slower Run (Trace #1) 1,3s from disk. The amount of data is small (8 MB), but for some reason the SSD was handing out the data slowly. Yes SSDs are fast, but sometimes, when they are busy with internal reorg things, they can become slow. SSD performance is a topic for another blog post.

We need to update our mental performance model

File System Cache state differences can lead to wait times due to Hard faults. Check method MiIssueHardFault if the tests have comparable numbers.

Etwanalyzer -dump CPU -pn Filewriter -methods MiIssueHardFault
That are another 2 s of difference which we can attribute to File System cache state and paged out data.

Measuring CPU Efficiency

We know that 4 threads are fastest after removing random system noise from our measured data. But is is it also the most efficient one? We can dump the CPUs Performance Monitoring Counters (PMC) to console with

ETWAnalyzer -dump PMC -pn FileWriter

The most fundamental metric is CPI which is Cycles/Instructions. If you execute the the same use case again you should execute the same numbers of instructions. When you go for multithreading you need to take locks and coordinate threads which costs you extra instructions. The most energy efficient solution is therefore the single threaded solution, although it is slower.

To get an overview you can export data into a CSV file from the current directory and the initial measurements we did start with. The following query will dump the Performance Monitoring Counters (PMC) to a CSV file. You can use multiple -fd queries to select the data you need:

ETWAnalyzer -dump PMC -pn FileWriter -fd ..\Take_Test1 -fd .  -csv PMC_Summary.csv

The CSV file contains the columns Test Time in ms and CPI to make it easy to correlate the measured numbers with something else e.g. CPI in this case. This pattern is followed for all -Dump commands of ETWAnalyzer. From this data we can plot Test Timing vs CPI this is Cycles per Instruction into a Pivot Table:

The initial test where we had

  • Defender
  • CheckpointVolume

issues was not only much slower, but CPU efficiency wise much worse: 1,2 vs 0,69 (smaller is better). This proves that even if Defender would be highly efficient on all other cores and not block our code at all we still would do much worse (73%), because even when we do exactly the same, CPU caches and other internal CPU resources, are slowing us down. With this data you can prove why you are slower even when everything else is the same.


If you have read that far I am impressed. That is a dense article with a lot of information. I hope I have convinced you that profiling your system with ETW and not only your code for strategic test cases is an important asset in automated regression testing.

Credits: NASA, ESA, CSA, and STScI (https://www.nasa.gov/image-feature/goddard/2022/nasa-s-webb-delivers-deepest-infrared-image-of-universe-yet)

If you look only at your code you will miss the universe of things that are moving around in the operating system and other processes. You will end up with fluctuating tests which no one can explain. We are doing the same thing 10 times and we still get outliers! The key is not only to understand your own code, but also the system in which it is executing. Both need attention and coordination to achieve the best performance.

To enable that I have written ETWAnalyzer to help you to speed up your own ETW analysis. If you go for ETW you are talking about a lot of data which needs a fair amount of automation to become manageable. Loading hundreds of tests into WPA is not feasible, but with ETWAnalyzer mass data extraction and analysis never was easier. The list of supported queries is growing. If you miss something file an issue so we can take a look what the community is needing.

To come back to the original question. Is profiling too difficult? If you are using only UI based tools then the overhead to do this frequently it is definitely too much. But with the scripting approach shown by FileWriter to record data and extract it with ETWAnalyzer, to make the data queryable, is giving you a lot of opportunities. ETWAnalyzer comes with an object model for the extracted Json files. You can query for known issues directly and flag past issues within seconds without the need to load the profiling data into a graphical viewer to get an overview.

GDI/User Object Leak Tracking – The Easy Way

One of the most complex issues are User/GDI object leaks because they are nearly invisible in memory dumps. Microsoft does not want us to peek into such “legacy” technology so no Windbg plugins exist anymore. Today everything is UWP, WPF (ok that’s legacy as well) and mostly some Chrome Browser based applications. With .NET 6 WPF and WinForms support has got some love again so we can expect new programmers will hit old fashioned User/GDI leaks again. These leaks are worse than “normal” Handle leaks because Windows imposes a limit of 10K GDI and User Handles per process. After that you cannot create any more User/GDI handles and you get either exceptions or partially filled windows where content or icons are missing.

Here is an application which leaks Window and GDI Handles on every button click:

    public partial class Leaker : Form
        public Leaker()

        List<Leaker> myChilds = new List<Leaker>();

        private void button1_Click(object sender, EventArgs e)
            var leaker = new Leaker();
            var tmp = leaker.Handle;
            cCountLabel.Text = $"{myChilds.Count}";

In Task Manager we find that with every click we loose 6 Window handles:

These come from the

  • Leaked Form
  • Create Button
  • Created Label
  • 85 (Number) Label
  • IME Hidden Window
  • MSCTFIME Hidden Window

A very useful tool is Spy++ which is installed when you add C++ core features

via the Visual Studio Installer. Then you will find it in C:\Program Files\Microsoft Visual Studio\<Version>\<Edition>\Common7\Tools\spyxx_amd64.exe or the 32 bit version without the _amd64 postfix.

If you intend to copy Spy++ to a different machine you need these dlls


to make it work. The resource dll in the 1033 folder is essential or it wont start.

How would you identify the call stack that did lead to the allocation of additional objects? The answer is a small and great ETW tool named Performance HUD from Microsoft.

After starting

you can attach to an already running process

After a few seconds an overlay window pops up

If you double click on it the “real” UI becomes visible
Switch over to the Handles Tab
Now press in our Leaker app a few times the create button to leak some GDI/User objects
Press Ctrl+Click on the blue underlined Outstanding text for User Window objects to find the leaking call stack
Voila you have found our button1_Click handler as the bad guy as expected. If you right click on it you can e.g. send a mail with the leaking call stack as attached .hudinsight file to the developer of the faulty application. Problem solved.

See my earlier post what else you can do with Performance HUD. It is one of the tools which save a lot of time once you know that they exist.

Another great feature of Performance HUD is to analyze UI Thread hangs. It employs some sophisticated analysis to detect that the UI thread is no longer processing window messages because some CPU hungry method is executing a long method call, or that you are stuck in a blocking OS call. Today everyone is using async/await in C#. A stuck UI thread should be a thing of the past. But now you can measure and prove that it is not hanging.

Lets add to our sample application a UI hang via TPL tasks which happens only in Debug Builds

        private void cHang_Click(object sender, EventArgs e)
            Task<int> t = Task.Run(async () => { await Task.Delay(5000); return 1; });

My favorite button the Blame Someone which will unfold the stack with probably the longest stuck call:

This precisely points towards our stuck button handler which calls into Task.Result which seems not to be resolved properly but the later method calls are. The 5s hang is nicely colored and you can also pack this into an Email and send the issue to the maintainer of the corresponding code. I have seen plenty of similar issues in the past because Thread.Sleep is caught by many FXCop rules so people tend to use await with Task.Delay which is a Thread.Sleep in disguise.

This works not only with .NET Code but with any language as long as it supports pdbs and walkable stacks. It is great that this diagnostic has been added to Windows but why did it take so long? As far as I know Performance HUD is from the Office team which is a “legacy” but important application. Someone dealing with too many GDI/User object leaks was able to commit some lines to the Kernel to instrument the relevant locations with proper ETW events and we finally arrive with a feature complete leak tracer. With ETW we can track leaks

  • OS Handles (File, Socket, Pipe, …)
    • wpr -start Handle
    • Analyze with WPA. This records handle data from all processes (system wide)
  • GDI Handles (Cursor, Bitmap, Font, …)
  • User Handles (Window, Menu, Timer, …)
  • VirtualAlloc Memory (Records system wide all processes)
    • wpr -start VirtualAllocation
    • Analyze with WPA

Finally we have covered all frequent leak types with ETW events. No leak can hide anymore.

For memory mapped files there are some ETW events but unfortunately no analysis in the public version of WPA exist. These leaks are still more tricky to figure out than the rest which are now at least in principle easy. Recording the right data and not too much (ETW files > 6 GB will usually crash WPA) is still challenging but doable. Performance HUD is very well written. I have successfully processed a 80 GB ETL trace with no hiccups. It can not only attach to already running processes, but it can also analyze previously recorded data.

Now get back to work and fix your leaks and unresponsive UIs!

How Fast is your Virus Scanner?

I am sure everyone has experienced slowdowns due to Antivirus solutions, but very few are able to attribute it to the right component on your Windows box. Most people do not even know what AV solution/s are running on their machine. The end result can be a secure but slow system which is barely usable.

Is your CPU fan always on and your notebook getting hot although you are not running anything? You want to find out what is consuming so much CPU with a command line tool?

Then I have a new tool for you: ETWAnalyzer. It is using the Trace Processing Library which powers WPA (Windows Performance Analyzer). WPA is a great UI to analyze tricky systemic issues, but if you need to check a trace the n-th time if the same issue is there again you want to automate that analysis.

ETWAnalyzer is developed by Siemens Healthineers as its first public open source project. Full Disclosure: I work for Siemens Healthineers. One motivation to make it public was to show how incredibly useful ETW (Event Tracing for Windows) is to track down performance issues in production and development environments. It is one of the best technologies for performance troubleshooting Microsoft has invented, but it is valued by far too few people. ETW is part of Windows and you need no additional tools or code instrumentation to record ETW data! Starting with Windows 10 the Windows Performance Recorder (wpr.exe) is part of the OS. You have a slow system? In that case I would open an Administrator cmd shell and type

C>wpr -start CPU -start DiskIO -start FileIO

Now execute your slow use case and then stop profiling with

C>wpr -stop C:\temp\SlowSystem.etl

This will generate a large file (hundreds MB – several GB) which covers the last ca. 60-200s (depends on system load and installed memory) what your system was doing. If you have a lot of .NET applications installed stopping the first first time can take a long time (10-60 min) because it will create pdbs for all running .NET applications. Later profiling stop runs are much faster tough.

To record data with less overhead check out MultiProfile which is part of FileWriter. You need to download the profile to the machine to a directory e.g. to c:\temp and to record the equivalent to the before command with

C>wpr -start C:\temp\MultiProfile.wprp!CSwitch -start C:\temp\MultiProfile.wprp!File

Getting the data is easy. Analyzing is hard when you do not know where you need to look at.

This is where ETWAnalyzer can help out to ask the recorded data questions about previously analyzed problems. You ask: Previously? Yes. Normally you start with WPA to identify a pattern like high CPU in this or that method. Then you (should) create a stacktag to assign a high level description what a problematic method is about. The stacktags are invaluable to categorize CPU/Wait issues and correlate them with past problems. The supplied stacktag file of ETWAnalyzer contains descriptions for Windows/Network/.NET Framework/Antivirus products even some Chrome stacktags. To get started you can download the latest ETWAnalyzer release from https://github.com/Siemens-Healthineers/ETWAnalyzer/releases und unzip it to your favorite tool directory and extend the PATH environment to the directory. If you want maximum speed you should use the self contained .NET 6 version.

Now you are ready to extract data from an ETL file with the following command:

ETWAnalyzer -extract all -filedir c:\temp\SlowSystem.etl -symserver MS

After extraction all data is contained in several small Json files.

Lets query the data by looking at the top 10 CPU consumers

ETWAnalyzer -dump CPU -filedir c:\temp\Extract\SlowSystem.json -topN 10

The second most CPU hungry process is MsMpEng.exe which is the Windows Defender scan engine. Interesting but not necessarily a bad sign.

Lets check if any known AV Vendor is inducing waits > 10ms. We use a stacktag query which matches all stacktags with Virus in their name.

ETWAnalyzer -dump cpu -fd c:\temp\Extract\SlowSystem.json -stacktags *virus* -MinMaxWaitMs 10

We find a 1349ms slowdown caused by Defender in a cmd shell. I did observe slow process creation after I hit enter to start the fresh downloaded executable from the internet.

Explorer seems also to be slowed down. The numbers shown are the sum of all threads for CPU and wait times. If multiple threads were executing the the method then you can see high numbers for the stacktags/methods. For single threaded use cases such as starting a new process the numbers correlate pretty good with wall clock time.

The currently stacktagged AV solutions are

  • Applocker
  • CrowdStrike
  • CyberArk
  • Defender
  • ESET
  • Mc Afee
  • Palo Alto
  • Sentinel One
  • Symantec
  • Trend Micro

Lets dump all slow methods with a wait time between 1300-2000ms and the timing of all process and file creation calls in explorer.exe and cmd.exe

ETWAnalyzer -dump cpu -fd c:\temp\Extract\SlowSystem.json  -stacktags *virus* -MinMaxWaitMs 1300-2000 -methods *createprocess*;*createfile* -processName explorer;cmd

Antivirus solutions usually hook into CreateProcess and CreateFile calls which did obviously slow down process creation and file operations in explorer and process creation in cmd.exe.

Was this the observed slowness? Lets dump the time when the file and process creation calls were visible the first time in the trace

ETWAnalyzer -dump cpu -fd c:\temp\Extract\SlowSystem.json  -stacktags *virus* -MinMaxWaitMs 1300-2000 -methods *createprocess*;*createfile* -processName explorer;cmd -FirstLastDuration local

That correlates pretty well with the observed slowness. To blame the AV vendor we can dump all methods of device drivers which have no symbols. Microsoft and no other AV vendor delivers symbols for their drivers. By this definition all unresolved symbols for system drivers are likely AV or hardware drivers. ETWAnalyzer makes this easy with the -methods *.sys -ShowModuleInfo query to show all called kernel drivers from our applications for which we do not have symbols :

ETWAnalyzer -dump cpu -fd c:\temp\Extract\SlowSystem.json -methods *.sys -processName explorer;cmd -ShowModuleInfo

The yellow output is from an internal database of ETWAnalyzer which has categorized pretty much all AV drivers. The high wait time comes from WDFilter.sys which is an Antivirus File System driver which is part of Windows Defender.

What have we got?

  • We can record system wide ETW data with WPR or any other ETW Recoding tool e.g. https://github.com/Alois-xx/etwcontroller/ which can capture screenshots to correlate the UI with profiling data
  • With ETWAnalyzer we can extract aggregated data for
    • CPU
    • Disk
    • File
    • .NET Exceptions
  • We can identify CPU and wait bottlenecks going from a system wide view down to method level
    • Specific features to estimate AV overhead are present

This is just the beginning. You can check out the documentation for further inspiration how you can make use of system wide profiling data (https://github.com/Siemens-Healthineers/ETWAnalyzer) which can be queried from the command line.

ETWAnalyzer will not supersede WPA to analyze data. ETWAnalyzers main strength is in analyzing known patterns quickly in large amounts of extracted ETW data. Querying 10 GB of ETW data becomes a few hundred MB of extracted JSON data which has no slow symbol server lookup times anymore.

Happy bug hunting with your new tool! If you find bugs or have ideas how to improve the tool it would be great if you file an issue.

The Case Of A Stuck LoadLibrary Call

I frequently encounter impossible bugs which after full analysis have a reasonable explanation although I have no source code to the problematic library. Todays fault is featured by



the Intel Math Kernel Library.

Lets use some fancy neural network library (AlgoLibrary!UsingCaffee) like Berkeley AI caffee.dll which in turn uses the Intel Math Kernel Library (mkl_core.dll) which hangs forever in a call to LoadLibrary. The first thing to do when you are stuck is taking memory dump (e.g. use Task Manager or procdump)

and loading it into Windbg or Visual Studio. Since you are reading this I assume you want to know more about the mysterious commands Windbg has to offer. You can load a memory dump into Windbg from the File – Menu

To get an overview where all threads of a process are standing or executing you can dump with the ~* command the following command on all threads. The dump thread stack command in Windbg is k. To dump all threads you need to execute therefore


107  Id: 67b0.8094 Suspend: 0 Teb: 00000049`9df88000 Unfrozen
Call Site

When you load a memory dump into Windbg you can issues with the k command a full stack walk which will output hopefully a nice stack trace after you have resolved your symbols from Microsoft, Intel and the caffee.dll. We see that LoadLibrary is stuck while caffee tries to allocate memory which internally loads a library. We would expect that our call from AlgoLibrary!UsingCaffee should be finished within a few ms. But this one hangs forever.

The allocation code comes from this:

 *ptr = mkl_malloc(size ? size:1, 64);

When LoadLibrary hangs it is due to some other thread holding the OS Loader lock. This loader lock is effectively serializing dll loading and global/thread static variables initialization and release. See https://docs.microsoft.com/en-us/windows/win32/dlls/dllmain for a more elaborate description.

With Windbg and the !peb command we can check further

dt ntdll!_PEB 499df0d000
+0x0e8 NumberOfHeaps    : 0x15
+0x0ec MaximumNumberOfHeaps : 0x20
+0x0f8 GdiSharedHandleTable : 0x00000183`71d00000 Void
+0x100 ProcessStarterHelper : (null) 
+0x108 GdiDCAttributeList : 0x14
+0x10c Padding3         : [4]  ""
+0x110 LoaderLock       : 0x00007ffe`f447a568 _RTL_CRITICAL_SECTION
+0x118 OSMajorVersion   : 0xa
+0x11c OSMinorVersion   : 0
+0x120 OSBuildNumber    : 0x3839
+0x122 OSCSDVersion     : 0
+0x124 OSPlatformId     : 2

We can dump the Loader Lock structure by clicking on the blue underlined link in Windbg. But this formatting has not survived in WordPress.

0:107> dx -r1 ((ntdll!_RTL_CRITICAL_SECTION *)0x7ffef447a568)
((ntdll!_RTL_CRITICAL_SECTION *)0x7ffef447a568)                 : 0x7ffef447a568 [Type: _RTL_CRITICAL_SECTION *]
[+0x000] DebugInfo        : 0x7ffef447a998 [Type: _RTL_CRITICAL_SECTION_DEBUG *]
[+0x008] LockCount        : -2 [Type: long]
[+0x00c] RecursionCount   : 1 [Type: long]
[+0x010] OwningThread     : 0x6710 [Type: void *]
[+0x018] LockSemaphore    : 0x0 [Type: void *]
[+0x020] SpinCount        : 0x4000000 [Type: unsigned __int64]

Ahh now we know that thread 0x6710 is owning the lock. Switch to that thread with the command ~~[0x6710]s which allows you specify OS thread id directly. The usual ~dds command switches to debugger enumerated threads which start always at 0. That is convenient but sometimes you need to switch by thread id.

0:114> k


That stack trace is strange because it calls SwitchToThread which basically tells the OS if any other runnable thread for this core is ready to run it should run. If not this is essentially a CPU burning endless loop. Such constructs are normally used in Spinlocks where ones takes at first not lock at all to prevent the costs of a context switch at the expense of burning CPU cycles. In this case it looks like mkl_core is cleaning up some thread static variables which requires synchronization with some lock object which is taken by some other thread. Since I do not have the code for this I can only speculate which thread it is.

Lets get the CPU time this thread to see how long it tries to enter the lock:

0:114> !runaway
 User Mode Time
  Thread       Time
  114:6710     0 days 2:14:28.671
   17:62a4     0 days 0:00:47.656
   92:88bc     0 days 0:00:08.218
   22:107c     0 days 0:00:02.953
   19:3a9c     0 days 0:00:02.765

Nearly all CPU of the process (2h 15 minutes) is spent by this thread (2h 14 minutes) which wastes CPU cycles for hours.

Since we have only one other mkl thread running which is our stuck thread in the LoadLibrary call it all makes sense now. The LoadLibrary call hangs because thread 0x6710 is exiting and cleaning up some thread static variables which takes a lock which our stuck LoadLibrary thread already possesses. But since our shutting down thread owns the loader lock we have two dependant locks owned by two different threads -> Deadlock.

You need to be careful on which threads you execute MKL code which are exiting at some point in time while new threads are initializing new mkl variables. At some points in time you can see funny deadlocks which look really strange. One thread burns CPU like hell which is coming from an exiting thread and your own code is stuck in a LoadLibrary call which waits for the Loader Lock to become free.

If you dig deep enough you can find the root cause even when you do not have the full source code. As added bonus I have learned some arcane Windbg commands along the way. These will be helpful the next time you need to dig into some stuck threads.

Concurrent Dictionary Modification Pitfalls

Dictionary<TKey,TValue> is one of the most heavily used collection classes after List<T>. One of the lesser known things is how it behaves when you accidentally modify it concurrently.

What is the output of this simple program under .NET 3.1, .NET 5.0 and .NET 4.8?

using System;
using System.Collections.Generic;
using System.Threading.Tasks;

namespace ConcurrentDictionary
    class Program
        static Dictionary<string, int> _dict = new();

        static void Modify()
            for (int i = 0; i < 10; i++)
                string key = $"Key {i}";
                _dict[key] = 1;

        static void Main(string[] args)
            for(int k=0; k < 1_000_000; k++)
                if (k % 100 == 0)
                    Console.WriteLine($"Modify Run {k}");

                Parallel.Invoke(Modify, Modify, Modify);

Under .NET Core 3.1, .NET 5.0 it will produce

Modify Run 0
Modify Run 0
Modify Run 100
Unhandled exception. 
 ---> System.InvalidOperationException: Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.
   at System.Collections.Generic.Dictionary`2.FindValue(TKey key)
   at ConcurrentDictionary.Program.<>c__DisplayClass0_0.<Main>g__Modify|0() in D:\Source\vc19\ConcurrentDictionary\Program.cs:line 26
   at System.Threading.Tasks.Task.InnerInvoke()
   at System.Threading.Tasks.Task.<>c.<.cctor>b__277_0(Object obj)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, 

This is similar what you get when you modify a list

Unhandled exception. 
 ---> System.InvalidOperationException: Collection was modified; enumeration operation may not execute.
   at System.Collections.Generic.List`1.Enumerator.MoveNextRare()
   at System.Collections.Generic.List`1.Enumerator.MoveNext()
   at ConcurrentDictionary.Program.ModifyList() in D:\Source\vc19\ConcurrentDictionary\Program.cs:line 31

When you modify a Dictionary or you enumerate a list while it is modified you get an exception back. That is good and expected.

Lets try the Dictionary sample with .NET 4.8 (or an earlier version)

Modify Run 0
Modify Run 100

This time it will stop working and just hang around. Whats the problem with .NET 4.x? When in doubt take a memory dump with procdump

C:\temp>procdump -ma 2216

ProcDump v10.0 - Sysinternals process dump utility
Copyright (C) 2009-2020 Mark Russinovich and Andrew Richards
Sysinternals - www.sysinternals.com

[22:32:25] Dump 1 initiated: C:\temp\ConcurrentDictionary.exe_210810_223225.dmp
[22:32:25] Dump 1 writing: Estimated dump file size is 68 MB.
[22:32:25] Dump 1 complete: 68 MB written in 0.3 seconds
[22:32:25] Dump count reached.

Then open the Dump with Windbg

c:\Debuggers\windbg.exe -z C:\temp\ConcurrentDictionary.exe_210810_223225.dmp

Load sos dll and then dump all managed thread stacks

Microsoft (R) Windows Debugger Version 10.0.17763.132 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [C:\temp\ConcurrentDictionary.exe_210810_223225.dmp]
User Mini Dump File with Full Memory: Only application data is available

Comment: '
*** procdump  -ma 2216 
*** Manual dump'
Symbol search path is: srv*
Executable search path is: 
Windows 10 Version 19043 MP (8 procs) Free x64
Product: WinNt, suite: SingleUserTS
Machine Name:
Debug session time: Tue Aug 10 22:32:25.000 2021 (UTC + 2:00)
System Uptime: 0 days 0:22:07.133
Process Uptime: 0 days 0:00:45.000

0:000> .loadby sos clr

0:000> ~*e!ClrStack

OS Thread Id: 0x2370 (0)
Call Site
System.Collections.Generic.Dictionary`2[[System.__Canon, mscorlib],[System.Int32, mscorlib]].Insert(System.__Canon, Int32, Boolean)
ConcurrentDictionary.Program.Modify() [D:\Source\vc19\ConcurrentDictionary\Program.cs @ 18]
System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)
System.Threading.Tasks.ThreadPoolTaskScheduler.TryExecuteTaskInline(System.Threading.Tasks.Task, Boolean)
System.Threading.Tasks.TaskScheduler.TryRunInline(System.Threading.Tasks.Task, Boolean)
System.Threading.Tasks.Task.InternalRunSynchronously(System.Threading.Tasks.TaskScheduler, Boolean)
System.Threading.Tasks.Parallel.Invoke(System.Threading.Tasks.ParallelOptions, System.Action[])
ConcurrentDictionary.Program.Main(System.String[]) [D:\Source\vc19\ConcurrentDictionary\Program.cs @ 44]

OS Thread Id: 0x1e68 (6)
Call Site
System.Collections.Generic.Dictionary`2[[System.__Canon, mscorlib],[System.Int32, mscorlib]].FindEntry(System.__Canon)
System.Collections.Generic.Dictionary`2[[System.__Canon, mscorlib],[System.Int32, mscorlib]].ContainsKey(System.__Canon)
ConcurrentDictionary.Program.Modify() [D:\Source\vc19\ConcurrentDictionary\Program.cs @ 17]
System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)

0:000> !runaway
 User Mode Time
  Thread       Time
    0:2370     0 days 0:00:44.843
    6:1e68     0 days 0:00:44.796
    2:928      0 days 0:00:00.015
    5:233c     0 days 0:00:00.000
    4:3560     0 days 0:00:00.000
    3:92c      0 days 0:00:00.000
    1:1fb8     0 days 0:00:00.000
0:000> .time
Debug session time: Tue Aug 10 22:32:25.000 2021 (UTC + 2:00)
System Uptime: 0 days 0:22:07.133
Process Uptime: 0 days 0:00:45.000
  Kernel time: 0 days 0:00:00.000
  User time: 0 days 0:01:29.000

We find that we consume a lot of CPU on thread 0 and 6 (44s until the dump was taken). These methods call into Dictionary.FindEntry and Dictionary.Insert which is strange. Why should it take now much more CPU to insert things into a Dictionary?

In the meantime Visual Studio has become pretty good at loading memory dumps as well. The by far best view is still the Parallel Stacks Window

But for now we turn back to Windbg because it offers information which threads consume how much CPU with !runaway on a per thread basis and .time for the complete process. This helps a lot to look at the right threads which are consuming most CPU.

To drill deeper into the innards of the Dictionary we switch to the best Windbg Extension for managed code which is netext. Unzip the latest release for your Windbg into the Windbg\winext folder or you need to fully qualify the path to the x86/x64 version of netext.

0:000> .load netext
netext version Jun  2 2021
License and usage can be seen here: !whelp license
Check Latest version: !wupdate
For help, type !whelp (or in WinDBG run: '.browse !whelp')
Questions and Feedback: https://github.com/rodneyviana/netext/issues 
Copyright (c) 2014-2015 Rodney Viana (http://blogs.msdn.com/b/rodneyviana) 
Type: !windex -tree or ~*e!wstack to get started

0:006> ~6s
0:006> !wstack

Listing objects from: 0000007a30afe000 to 0000007a30b00000 from thread: 6 [1e68]

Address          Method Table    Heap Gen      Size Type
@r8=000001eb5a279ef8 00007ffa81d38538   0  0         92 System.Int32[]
@rcx=000001eb5a27ffe8 00007ffa81d4d880   0  0        432 System.Collections.Generic.Dictionary_Entry[]
@rdi=000001eb5a2d69e0 00007ffa81d359c0   0  0         36 System.String Key 1
@rdx=000001eb5a27ffe8 00007ffa81d4d880   0  0        432 System.Collections.Generic.Dictionary_Entry[]
@rsi=000001eb5a272d88 00007ffa81d46db0   0  0         80 System.Collections.Generic.Dictionary
000001eb5a2d69a8 00007ffa81d385a0   0  0         24 System.Int32 0n1
000001eb5a274d48 00007ffa81d3a440   0  0        216 System.Globalization.NumberFormatInfo
000001eb5a2d69c0 00007ffa81d359c0   0  0         28 System.String 1
000001eb5a28e350 00007ffa81d39e00   0  0         48 System.Text.StringBuilder
000001eb5a279a50 00007ffa81d359c0   0  0         40 System.String Key {0}
19 unique object(s) found in 1,716 bytes

0:006> !wdo 000001eb5a27ffe8
Address: 000001eb5a27ffe8
Method Table/Token: 00007ffa81d4d880/200000004 
Class Name: 
Size : 432
Components: 17
[0]: -mt 00007FFA81D4D820 000001EB5A27FFF8
[1]: -mt 00007FFA81D4D820 000001EB5A280010
[2]: -mt 00007FFA81D4D820 000001EB5A280028
[3]: -mt 00007FFA81D4D820 000001EB5A280040
[4]: -mt 00007FFA81D4D820 000001EB5A280058
0:006> !wdo -mt 00007ffa81d4d820 000001eb5a280010
00007ffa81d3a238 System.__Canon+0000    key 000001eb5a2d6980 Key 0
00007ffa81d385a0 System.Int32 +0008     hashCode 6278e7e7
00007ffa81d385a0 System.Int32 +000c     next 1 (0n1)
00007ffa81d385a0 System.Int32 +0010     value 1 (0n1)

The .NET Dictionary stores the data in buckets which is a linked list. The nodes are referenced via an index with the next field. In the dysfunctional case we find at array location 1 as next node index 1 which will cause the node walk to never end.

The decompiled code shows this

The for loop calls i = this.entries[i].next in the incremeent condition. This never incrementing loop will not end and explains the high CPU on two threads.

How do these things usually crop up? When more and more code is made multithreaded and async await is added to the mix you can find in a slow or stuck process often high CPU in FindEntry or Insert. Now you know this indicates not a huge Dictionary which has just become inefficient.

Actually you have found a Dictionary multithreading bug if you are running on .NET 4.x!

You can create a stack tag for these methods and flag them as potential concurrent Dictionary modification if these consume one core or more constantly like this to directly show up in WPA:

The stacktag file for this is

<?xml version="1.0" encoding="utf-8"?>

<Tag Name=""> 
<Tag Name=".NET">	
<Tag Name="Dictionary FindEntry\If one or more Core is consumed this indicates a concurrent dictionary modification which results in endless loops!">
		<Entrypoint Module="mscorlib.ni.dll" Method="System.Collections.Generic.Dictionary*.FindEntry*"/>
		<Entrypoint Module="mscorlib.dll" Method="System.Collections.Generic.Dictionary*::FindEntry*"/>
		<Entrypoint Module="System.Private.CoreLib.dll" Method="System.Collections.Generic.Dictionary*.FindEntry*"/>
	<Tag Name="Dictionary Insert\If one or more Core is consumed this indicates a concurrent dictionary modification which results in endless loops!">
		<Entrypoint Module="mscorlib.ni.dll" Method="System.Collections.Generic.Dictionary*.Insert*"/>
		<Entrypoint Module="mscorlib.dll" Method="System.Collections.Generic.Dictionary*::Insert*"/>
		<Entrypoint Module="System.Private.CoreLib.dll" Method="System.Collections.Generic.Dictionary*.Insert*"/>

This is enough debugging Voodoo for today. Drop me a note if you have been biten by this issue and how you found it.

How To Not Satisfy Customers – Pixel 5 Android Crash

Recently I have bought a Pixel 5 which worked for about a week perfectly. While driving at a motorway and trying to connect to some Bluetooth device which did not work it rebooted a few minutes later while browsing the web. No problem, after a quick boot it will be right back, I thought. But I was greeted with this (German) error message:

“Cannot load Android system. Your data may be corrupt. If you continue to get this message, you may need to perform a factory data reset and erase all user data stored on this device.”


Retry or “Delete Everything” which is the synonym for Factory Reset are the only options. That sounds like a bad trade. Several retries later it still refused to boot. In some forum I have read that I can copy the OS image back via the Debugger Interface. Luckily I had enabled Developer Mode on my phone. After downloading first the wrong version because Google had at that point in time the OS Image I had installed not listed (https://developers.google.com/android/ota#redfin) and did not mention that you cannot install an OS Image which is older than the current security patch level. I got it  via https://www.getdroidtips.com/february-2021-patch-pixel-device/ on my phone:


Unfortunately that did also not work. There is a secret menu in Android which allows to view e.g. some logs and some other options which I cannot tell if they are useful.


Here are some log messages which might or might not be useful to anyone else

Filesystem on /dev/block/by-name/metadata was not cleanly shutdown; state flags: 0x1, incompat feature flags: 0xc6

Unable to set property "ro.boottime.init.fsck.metadata" to "17": error code: 0x18

Unable to enable ext4 casefold on /dev/block/by-name/metadata because /system/bin/tune2fs is missing

Android System could not be loaded

At that point I gave up and opened a thread https://support.google.com/pixelphone/thread/105622202/pixel-5-boot-loop-how-to-recover-data?hl=en at the Pixel support forum. The problem is that I wanted to save my photos because I do not trust Google to not scan over my entire image gallery nor any other Cloud provider. A manual USB sync does the trick and it has served me well. The problem is that Android encrypts every sector on the phone so there is zero chance to read anything unless you have a booted Android system which has unlocked the master key.

I know Bitlocker from Windows which allows me to export my secret keys so I can later access the encrypted data. Perhaps I have missed some tutorial to export my phone encryption keys but it turns out there is no way to export the key! I find it hard to believe that an OS used by billions of people has no data recovery workflow I could follow.

Google wants me to sync everything to the Cloud or I will loose my data if the OS has a bad day on an otherwise fully functional device. If the phone would have an external SD slot it would not have encrypted my data by default (I can enable that if I need to) which would be ok. But Pixel phones have no SD Card slot. I feel thrown back to the 90s where the rule save often, save early was good advice to prevent data loss. 30 years and millions of engineering hours later using an Android device feels like a retro game with a bad user experience if you to not want to share your data with Google.

Unfortunately that is the end of the story. I am pretty good with computers and having studied particle physics has helped me to analyze and solve many difficult problems. But if a system is designed in a way to lock me out when an error occurs then the bets are off. It is certainly possible to fix these shortcomings but I guess this is pretty low on Googles Business priority list where the Cloud and AI the future and it does not matter on which device your data is living. In theory this sounds great, but  corner cases like what happens to unsynced photos/videos if the same error happens, or if you do not trust the Cloud provider are just minor footnotes on the way to the Cloud.

After a factory reset the phone worked and so far it has not happened again. I think the Cloud is a great thing but as a customer I should have a choice to not share my data with Google without the fear to risk complete data loss every time the OS has detected some inconsistency.

When Things Get Really Bad – NTFS File System Corruption War Story

The setup is a PC which runs software version 1.0. After an upgrade to 2.0 everything did work but after 1-2 weeks the machine would not start the software v2.0 anymore. When the issue was analyzed it turned out that a xml configuration file was starting with binary data which prevented v2.0 from starting up.

Good file

<?xml version="1.0" encoding="utf-8"?>
  <import resource="../../Config1.xml"/>    
  <import resource="../../Config2.xml"/>
  <import resource="../../Config3.xml"/>
  <import resource="../../Config4.xml"/>

Bad file

DCMPA30€×ÕÞ±# EÔ°ƒ3¡;k”€
*d*“B  ...
    <control name="MissileControl" ref="IMissileControl"/>
    <control name="PositionControl" ref="IPositionControl"/>
    <control name="ReservationNotify" ref="IReservationNotify"/>

The binary data did cover the first 4096 bytes of the xml file which was certainly not good. This raises the question if other files have been corrupted and if we can we trust that machine at all? That calls for a tool that scans the entire hard drive and calculates the checksum of all files on that machine which can then be compared to all other files from a good machine with identical setup. This is not a new problem so there should be some open source software out there. I found hashdeep which was developed by Jesse Kornblum an agent of some government agency. It was exactly the thing I did need. You can generate from a directory recursively a list of files with their hashes (audit mode) from a good machine and compare these hashes on another machine and print the missing/new/changed files as diff output.

The audit file generation and file verification are straightforward:

hashdeep -r dir  > /tmp/auditfile            # Generate the audit file
hashdeep -a -k /tmp/auditfile -r dir         # test the audit

The current version seems to be 4.4 and can be downloaded from Github. If you are curious here are the supported options:

C:\> hashdeep64.exe [OPTION]... [FILES]...
-c <alg1,[alg2]> - Compute hashes only. Defaults are MD5 and SHA-256
                   legal values: md5,sha1,sha256,tiger,whirlpool,
-p <size> - piecewise mode. Files are broken into blocks for hashing
-r        - recursive mode. All subdirectories are traversed
-d        - output in DFXML (Digital Forensics XML)
-k <file> - add a file of known hashes
-a        - audit mode. Validates FILES against known hashes. Requires -k
-m        - matching mode. Requires -k
-x        - negative matching mode. Requires -k
-w        - in -m mode, displays which known file was matched
-M and -X act like -m and -x, but display hashes of matching files
-e        - compute estimated time remaining for each file
-s        - silent mode. Suppress all error messages
-b        - prints only the bare name of files; all path information is omitted
-l        - print relative paths for filenames
-i/-I     - only process files smaller than the given threshold
-o        - only process certain types of files. See README/manpage
-v        - verbose mode. Use again to be more verbose
-d        - output in DFXML; -W FILE - write to FILE.
-j <num>  - use num threads (default 8)

After checking all files I found no other corruptions. This is really strange. Why did the xml file not become corrupt during the upgrade but 2 weeks later? The config file is read only and should be never written by anybody. It could be a simple glitch in some memory bank of the SSD but since the exact same error did happen on several machines we can rule out cosmic rays as source of our problem. But can radiation be a problem? Intel has tried it out by putting a SSD into the beam of a particle accelerator. When enough radiation in the game anything can happen but it will be happening at random locations.

Could it be a hard power off while the system was writing data to the disk? If the OS does everything right the SSD might be having not enough power to store the data. But that theory could also be ruled out because the SSD of that machine had a built in capacitor to ensure that all pending writes are safely written even during a hard power off event.

If you have thought of enough nice theories you need to get more evidence (read data). In that case checking the Windows Event Logs could help to find some clues. One first hint was that after the machine was bricked the application event log showed that Defrag has been run. Why on earth is Defrag running on a SSD? That should not be needed since we have no need to realign our data to optimize disk seek times. It turned out that Defrag was reenabled by accident during the update of v2.0.


The default settings were that it would run every day late in the evening. Windows is smart enough to not defrag a SSD. Instead it will trigger SSD Trim to let the SSD do its own reorg things. But that is not all. Although Defrag is not performing a typical defrag operation it still defragments directories and other NTFS tables. NTFS becomes dead slow when you store ~ 100K files in a single directory. Then you will hit hard limits of the NTFS file system structures. These internal lists need some cleanup in case there are many file add/remove operations happening in some hot directories which could lead after some time to a slow or broken file system. To prevent that Defrag is still doing some house keeping and moving some things around even on SSDs.

Since not all machines are running until 9 PM it also explains why after the software update the machine did break randomly within two weeks after the update. When defrag did run and the machine was shut down in the next morning the machine is turned on it would refuse to work due to the corrupted xml configuration file.

That is a nice finding but it still does not explain why Defrag would ruin our xml file. Lets check the system event log for interesting events. With interesting I mean messages like these

System Event Log Messages

Id Source Meaning Example Message
6005 EventLog System Start The Event log service was started.
6006 EventLog System Stop The Event log service was stopped.
6008 EventLog Hard Reboot The previous system shutdown at xxxx AM on ‎dd/dd/dddd‎ was unexpected.
41 Microsoft-Windows-Kernel-Power Hard Reboot The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
109 Microsoft-Windows-Kernel-Power Reboot initiated by OS The kernel power manager has initiated a shutdown transition
131 Ntfs NTFS Corruption “The file system structure on volume C: cannot be corrected.”
130 Ntfs NTFS Repair (chkdsk) The file system structure on volume C: has now been repaired.

Application Event Log Messages

Id Source Meaning Example Message
258 Microsoft-Windows-Defrag Defrag Run The storage optimizer successfully completed defragmentation on System (C:)
258 Microsoft-Windows-Defrag Defrag Run The storage optimizer successfully completed retrim on System (C:)
26228 Chkdsk chkdsk Repair Chkdsk was executed in verify mode on a volume snapshot.


The next clue was that after Defrag has been run the NTFS volume needed a repair.

Chkdsk        26228        None        "Chkdsk was executed in verify mode on a volume snapshot.

Checking file system on \Device\HarddiskVolume2

Volume label is System.

Examining 1 corruption record ...

Record 1 of 1: Bad index ""$I30"" in directory ""\Users\xxxx\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\System Tools <0x4,0x15437>"" ... The multi-sector header signature for VCN 0x0 of index $I30

in file 0x15437 is incorrect.

44 43 4d 01 50 41 33 30 ?? ?? ?? ?? ?? ?? ?? ?? DCM.PA30........

corruption found.

That is a helpful message! After Defrag has been run NTFS is broken. This NTFS corruption is then fixed by chkdsk which is complaining about some start menu entry. Luckily chkdsk prints also the first bytes of the corrupted file which starts with DCM.PA30 which is the same as our corrupted xml file! I did extract the binary file contents of the broken xml  file and put it in an extra file. Then I did run hashdeep.exe over it to check if this binary data is the file content of some other file. Since I have only the first 4K of the data I was not hopeful to find something. But lucky me I did find a file which did match. It looks like we did mix the contents of


with our xml file and the start menu entry shortcut! The question remains what Windows Defender files have to do with our xml and the start menu entry. Lets check the System Event Log for interesting events:


Now we can replay the events that lead to the NTFS corruption

  1. OS is installed
  2. Device Guard is enabled (not shown in the logs)
  3. Windows shuts itself down (“The kernel power manager has initiated a shutdown transition”)
  4. Sometimes the system hangs and the person installing the machine presses the power button to perform a hard reset
    • This caused the initial NTFS corruption
  5. After boot NTFS detects that something is corrupt which is repaired although one message tells us that it cannot be repaired
    1. Volume C … needs to be taken offline to perform a Full Chkdsk
    2. A corruption was discovered in the file system structure on volume
    3. The file system structure on volume C: cannot be corrected
    4. The file system structure on volume C: has now been repaired
  6. Now comes our part. After the repair was done automatically the software v1.0 is installed which writes data on an ill repaired NTFS volume and continues to run happily until …
  7. v2.0 of the software is installed which enables defrag by accident again
  8. The software version 2.0 runs happily it seems …
  9. But when the machine did run at 9:00 PM in the evening defrag is executed which defragments NTFS tables which uses referential information from other tables which destroys the first 4K of our xml file and we end up with some Windows Defender file contents.
  10. The next morning the machine would refuse to work
  11. Problems ….

That was an interesting issue and as you can guess this analysis was not done by me alone but several people contributed to connect all the loose ends. Thank You! You know who you are.

What can we learn from this?

  • Log messages are really important
  • Keeping all of them is essential to find the root cause
    • Overwriting the event log with myriads of messages is not the brightest idea
  • Luck: We got lucky that the System Event Log history included the initial OS installation

Hard Power Off

This is another case where transactions do not help to get an always consistent file system. What else can happen? Lets try it out by writing to some log files and then power off the machine while it is writing data to disk.

I used this simple C# program to write to several files in an alternating manner. Not sure if that is needed but it seems to help to force issues sooner

using System.Collections.Generic;
using System.IO;

class Program
    static void Main(string[] args)
        List<Stream> streams = new List<Stream>();
        const int Files = 5;

        for (int i = 0; i < Files; i++)
            string fileName = $"C:\\temp\\TestCorruption{i}.txt";
            FileStream stream = new FileStream(fileName, FileMode.OpenOrCreate, FileAccess.Write, FileShare.ReadWrite);

        while (true)
            for (int i = 0; i < Files; i++)
                using StreamWriter writer = new StreamWriter(streams[i], leaveOpen:true);
                writer.WriteLine("Hello this is string 1");
                writer.Write("Hello this is string 2");

We are writing to 5 log files the string “Hello this is string1”, flush the data to enforce a WriteFile call to the OS so that our process has no pending buffered data, then another string and enforce WriteFile again. If the process crashes we get consistent data in our log file because the OS has already got the data and it will write at some time later the pending data to disk. But what will happen when we do a hard power off?

Lets look at our log file

Hello this is string 2Hello this is string1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this isNULNULNULNULNULNULNULNULNULNUL....

Who did write the 0 bytes to our text file? We did write only ASCII data to our files! When WriteFile(“Buffered data) is executed and the OS wants to write that data to disk the OS needs to

  1. Increase the file size in the directory
  2. Write the buffered data over the now nulled cluster

Step 1 and 2 are related but a hard power off can lets 1 happen but not 2. This can result in a text file which contain larger amounts of 0 bytes at the end for no apparent reason. See https://stackoverflow.com/questions/10913632/streamwriter-writing-nul-characters-at-the-end-of-the-file-when-system-shutdown where someone with more NTFS know how than me answers that question:

That happens. When you append a file first its size is corrected in the directory (and that’s transactional in NTFS) and then the actual new data is written. There’s good chance that if you shut down the system you end up with a file appended with lots of null bytes because data writes are not transactional unlike metadata (file size) writes.

There’s no absolute solution to this problem.

answered Jun 6 ’12 at 11:53



When a Cluster is allocated it is nulled out to ensure that no previous data can be there. While you append data to a file you can miss the last write although the file size already contains the updated value which makes the 0s visible in a text editor. That is a nasty issue if you expect that nobody will ever reset a machine. I am not aware of a way around that except that you need to be prepared that some text files might contain 0 bytes at the end due to a hard power off.

Was the file system corrupted this time again? What does the System event log tell us?

Critical    9/5/2020 5:40:52 PM    Microsoft-Windows-Kernel-Power    41    (63)    The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
Information    9/5/2020 5:40:52 PM    Microsoft-Windows-Ntfs    98    None    Volume \\?\Volume{12a882c8-5384-11e3-8255-806e6f6e6963} (\Device\HarddiskVolume2) is healthy.  No action is needed.

NTFS says everything is green. From a file integrity perspective everything did work, until power was cut. Your hard disk did miss the last write but hey: Who did guarantee you that this is an atomic operation?


Even impossible errors have an explanation if you are willing to dig deep. What was your nastiest error you did manage to find the root cause?