When Things Get Really Bad – NTFS File System Corruption War Story

The setup is a PC which runs software version 1.0. After an upgrade to 2.0 everything did work but after 1-2 weeks the machine would not start the software v2.0 anymore. When the issue was analyzed it turned out that a xml configuration file was starting with binary data which prevented v2.0 from starting up.

Good file

<?xml version="1.0" encoding="utf-8"?>
  <import resource="../../Config1.xml"/>    
  <import resource="../../Config2.xml"/>
  <import resource="../../Config3.xml"/>
  <import resource="../../Config4.xml"/>

Bad file

DCMPA30€×ÕÞ±# EÔ°ƒ3¡;k”€
*d*“B  ...
    <control name="MissileControl" ref="IMissileControl"/>
    <control name="PositionControl" ref="IPositionControl"/>
    <control name="ReservationNotify" ref="IReservationNotify"/>

The binary data did cover the first 4096 bytes of the xml file which was certainly not good. This raises the question if other files have been corrupted and if we can we trust that machine at all? That calls for a tool that scans the entire hard drive and calculates the checksum of all files on that machine which can then be compared to all other files from a good machine with identical setup. This is not a new problem so there should be some open source software out there. I found hashdeep which was developed by Jesse Kornblum an agent of some government agency. It was exactly the thing I did need. You can generate from a directory recursively a list of files with their hashes (audit mode) from a good machine and compare these hashes on another machine and print the missing/new/changed files as diff output.

The audit file generation and file verification are straightforward:

hashdeep -r dir  > /tmp/auditfile            # Generate the audit file
hashdeep -a -k /tmp/auditfile -r dir         # test the audit

The current version seems to be 4.4 and can be downloaded from Github. If you are curious here are the supported options:

C:\> hashdeep64.exe [OPTION]... [FILES]...
-c <alg1,[alg2]> - Compute hashes only. Defaults are MD5 and SHA-256
                   legal values: md5,sha1,sha256,tiger,whirlpool,
-p <size> - piecewise mode. Files are broken into blocks for hashing
-r        - recursive mode. All subdirectories are traversed
-d        - output in DFXML (Digital Forensics XML)
-k <file> - add a file of known hashes
-a        - audit mode. Validates FILES against known hashes. Requires -k
-m        - matching mode. Requires -k
-x        - negative matching mode. Requires -k
-w        - in -m mode, displays which known file was matched
-M and -X act like -m and -x, but display hashes of matching files
-e        - compute estimated time remaining for each file
-s        - silent mode. Suppress all error messages
-b        - prints only the bare name of files; all path information is omitted
-l        - print relative paths for filenames
-i/-I     - only process files smaller than the given threshold
-o        - only process certain types of files. See README/manpage
-v        - verbose mode. Use again to be more verbose
-d        - output in DFXML; -W FILE - write to FILE.
-j <num>  - use num threads (default 8)

After checking all files I found no other corruptions. This is really strange. Why did the xml file not become corrupt during the upgrade but 2 weeks later? The config file is read only and should be never written by anybody. It could be a simple glitch in some memory bank of the SSD but since the exact same error did happen on several machines we can rule out cosmic rays as source of our problem. But can radiation be a problem? Intel has tried it out by putting a SSD into the beam of a particle accelerator. When enough radiation in the game anything can happen but it will be happening at random locations.

Could it be a hard power off while the system was writing data to the disk? If the OS does everything right the SSD might be having not enough power to store the data. But that theory could also be ruled out because the SSD of that machine had a built in capacitor to ensure that all pending writes are safely written even during a hard power off event.

If you have thought of enough nice theories you need to get more evidence (read data). In that case checking the Windows Event Logs could help to find some clues. One first hint was that after the machine was bricked the application event log showed that Defrag has been run. Why on earth is Defrag running on a SSD? That should not be needed since we have no need to realign our data to optimize disk seek times. It turned out that Defrag was reenabled by accident during the update of v2.0.


The default settings were that it would run every day late in the evening. Windows is smart enough to not defrag a SSD. Instead it will trigger SSD Trim to let the SSD do its own reorg things. But that is not all. Although Defrag is not performing a typical defrag operation it still defragments directories and other NTFS tables. NTFS becomes dead slow when you store ~ 100K files in a single directory. Then you will hit hard limits of the NTFS file system structures. These internal lists need some cleanup in case there are many file add/remove operations happening in some hot directories which could lead after some time to a slow or broken file system. To prevent that Defrag is still doing some house keeping and moving some things around even on SSDs.

Since not all machines are running until 9 PM it also explains why after the software update the machine did break randomly within two weeks after the update. When defrag did run and the machine was shut down in the next morning the machine is turned on it would refuse to work due to the corrupted xml configuration file.

That is a nice finding but it still does not explain why Defrag would ruin our xml file. Lets check the system event log for interesting events. With interesting I mean messages like these

System Event Log Messages

Id Source Meaning Example Message
6005 EventLog System Start The Event log service was started.
6006 EventLog System Stop The Event log service was stopped.
6008 EventLog Hard Reboot The previous system shutdown at xxxx AM on ‎dd/dd/dddd‎ was unexpected.
41 Microsoft-Windows-Kernel-Power Hard Reboot The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
109 Microsoft-Windows-Kernel-Power Reboot initiated by OS The kernel power manager has initiated a shutdown transition
131 Ntfs NTFS Corruption “The file system structure on volume C: cannot be corrected.”
130 Ntfs NTFS Repair (chkdsk) The file system structure on volume C: has now been repaired.

Application Event Log Messages

Id Source Meaning Example Message
258 Microsoft-Windows-Defrag Defrag Run The storage optimizer successfully completed defragmentation on System (C:)
258 Microsoft-Windows-Defrag Defrag Run The storage optimizer successfully completed retrim on System (C:)
26228 Chkdsk chkdsk Repair Chkdsk was executed in verify mode on a volume snapshot.


The next clue was that after Defrag has been run the NTFS volume needed a repair.

Chkdsk        26228        None        "Chkdsk was executed in verify mode on a volume snapshot.

Checking file system on \Device\HarddiskVolume2

Volume label is System.

Examining 1 corruption record ...

Record 1 of 1: Bad index ""$I30"" in directory ""\Users\xxxx\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\System Tools <0x4,0x15437>"" ... The multi-sector header signature for VCN 0x0 of index $I30

in file 0x15437 is incorrect.

44 43 4d 01 50 41 33 30 ?? ?? ?? ?? ?? ?? ?? ?? DCM.PA30........

corruption found.

That is a helpful message! After Defrag has been run NTFS is broken. This NTFS corruption is then fixed by chkdsk which is complaining about some start menu entry. Luckily chkdsk prints also the first bytes of the corrupted file which starts with DCM.PA30 which is the same as our corrupted xml file! I did extract the binary file contents of the broken xml  file and put it in an extra file. Then I did run hashdeep.exe over it to check if this binary data is the file content of some other file. Since I have only the first 4K of the data I was not hopeful to find something. But lucky me I did find a file which did match. It looks like we did mix the contents of


with our xml file and the start menu entry shortcut! The question remains what Windows Defender files have to do with our xml and the start menu entry. Lets check the System Event Log for interesting events:


Now we can replay the events that lead to the NTFS corruption

  1. OS is installed
  2. Device Guard is enabled (not shown in the logs)
  3. Windows shuts itself down (“The kernel power manager has initiated a shutdown transition”)
  4. Sometimes the system hangs and the person installing the machine presses the power button to perform a hard reset
    • This caused the initial NTFS corruption
  5. After boot NTFS detects that something is corrupt which is repaired although one message tells us that it cannot be repaired
    1. Volume C … needs to be taken offline to perform a Full Chkdsk
    2. A corruption was discovered in the file system structure on volume
    3. The file system structure on volume C: cannot be corrected
    4. The file system structure on volume C: has now been repaired
  6. Now comes our part. After the repair was done automatically the software v1.0 is installed which writes data on an ill repaired NTFS volume and continues to run happily until …
  7. v2.0 of the software is installed which enables defrag by accident again
  8. The software version 2.0 runs happily it seems …
  9. But when the machine did run at 9:00 PM in the evening defrag is executed which defragments NTFS tables which uses referential information from other tables which destroys the first 4K of our xml file and we end up with some Windows Defender file contents.
  10. The next morning the machine would refuse to work
  11. Problems ….

That was an interesting issue and as you can guess this analysis was not done by me alone but several people contributed to connect all the loose ends. Thank You! You know who you are.

What can we learn from this?

  • Log messages are really important
  • Keeping all of them is essential to find the root cause
    • Overwriting the event log with myriads of messages is not the brightest idea
  • Luck: We got lucky that the System Event Log history included the initial OS installation

Hard Power Off

This is another case where transactions do not help to get an always consistent file system. What else can happen? Lets try it out by writing to some log files and then power off the machine while it is writing data to disk.

I used this simple C# program to write to several files in an alternating manner. Not sure if that is needed but it seems to help to force issues sooner

using System.Collections.Generic;
using System.IO;

class Program
    static void Main(string[] args)
        List<Stream> streams = new List<Stream>();
        const int Files = 5;

        for (int i = 0; i < Files; i++)
            string fileName = $"C:\\temp\\TestCorruption{i}.txt";
            FileStream stream = new FileStream(fileName, FileMode.OpenOrCreate, FileAccess.Write, FileShare.ReadWrite);

        while (true)
            for (int i = 0; i < Files; i++)
                using StreamWriter writer = new StreamWriter(streams[i], leaveOpen:true);
                writer.WriteLine("Hello this is string 1");
                writer.Write("Hello this is string 2");

We are writing to 5 log files the string “Hello this is string1”, flush the data to enforce a WriteFile call to the OS so that our process has no pending buffered data, then another string and enforce WriteFile again. If the process crashes we get consistent data in our log file because the OS has already got the data and it will write at some time later the pending data to disk. But what will happen when we do a hard power off?

Lets look at our log file

Hello this is string 2Hello this is string1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this is string 1
Hello this is string 2Hello this isNULNULNULNULNULNULNULNULNULNUL....

Who did write the 0 bytes to our text file? We did write only ASCII data to our files! When WriteFile(“Buffered data) is executed and the OS wants to write that data to disk the OS needs to

  1. Increase the file size in the directory
  2. Write the buffered data over the now nulled cluster

Step 1 and 2 are related but a hard power off can lets 1 happen but not 2. This can result in a text file which contain larger amounts of 0 bytes at the end for no apparent reason. See https://stackoverflow.com/questions/10913632/streamwriter-writing-nul-characters-at-the-end-of-the-file-when-system-shutdown where someone with more NTFS know how than me answers that question:

That happens. When you append a file first its size is corrected in the directory (and that’s transactional in NTFS) and then the actual new data is written. There’s good chance that if you shut down the system you end up with a file appended with lots of null bytes because data writes are not transactional unlike metadata (file size) writes.

There’s no absolute solution to this problem.

answered Jun 6 ’12 at 11:53



When a Cluster is allocated it is nulled out to ensure that no previous data can be there. While you append data to a file you can miss the last write although the file size already contains the updated value which makes the 0s visible in a text editor. That is a nasty issue if you expect that nobody will ever reset a machine. I am not aware of a way around that except that you need to be prepared that some text files might contain 0 bytes at the end due to a hard power off.

Was the file system corrupted this time again? What does the System event log tell us?

Critical    9/5/2020 5:40:52 PM    Microsoft-Windows-Kernel-Power    41    (63)    The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
Information    9/5/2020 5:40:52 PM    Microsoft-Windows-Ntfs    98    None    Volume \\?\Volume{12a882c8-5384-11e3-8255-806e6f6e6963} (\Device\HarddiskVolume2) is healthy.  No action is needed.

NTFS says everything is green. From a file integrity perspective everything did work, until power was cut. Your hard disk did miss the last write but hey: Who did guarantee you that this is an atomic operation?


Even impossible errors have an explanation if you are willing to dig deep. What was your nastiest error you did manage to find the root cause?

MS Performance HUD – Analyze EventLog Reading Performance In Realtime

This post explains why your home machine can work fast while many corporate PCs show high CPU consumption at random times especially after boot for several minutes with no apparent explanation.  Event Tracing for Windows (ETW) is a fantastic piece of technology. I love it because it enables me and perhaps you to tackle systemic issues regardless if the problem is my application, an Explorer plugin, a slow device driver or your Antivirus Scanner gone wild. Last time I have complained about slow WMI queries (https://aloiskraus.wordpress.com/2019/11/17/the-wmi-query-from-hell/) which were induced by some System Management software. It turns out that reading from the Windows Eventlog is a quite complex undertaking when you try to understand the slow read performance. For starters I try to read all Application event log messages via a WMI query


Slow Event Log Read Performance

SELECT * FROM Win32_NTLogEvent WHERE Logfile = ‘Application’

This runs ca. 35s which is quite a while. To put the number into perspective lets check the size of the event log which is 25 MB where 60232 messages were stored. This means we are reading a binary file with 0,7MB/s on my i7-4770K CPU @ 3.50GHz. That is really slow! Lets record some data with ETW and analyze it with the new Performance HUD Tool which was released at 29. May 2020. When I select the event log service which shows high CPU I can easily find why it is so busy:


We are reading 60K messages. While we are doing that the event log service process accesses the Registry over 800K times to find the currently configured language! The method EvtRpcMessageRender formats an event log message which in turn calls FormatMessage which does not reuse any cached thread culture but always reads the data from the registry. This means we have for every read message 13 registry accesses which is way to much. The Performance HUD shows on the lower grid performance insights and a tooltip what you can do about it. The performance of FormatMessage seems to be less important than to ensure 100% correctness when the language is switched because it must use the updated value instantaneously. The HUD Tool makes ETW data much more accessible than it was before. If I want to know how much data the Event Log service has read I just need to switch over to the File I/O tab which shows me the 25 MB parsed Application event log. If I need it I can also get the call stack by hovering over the blue underlined and press Ctrl+Click to find the corresponding stack traces.


Now we have proven that the event log service reads 25 MB of data but to format the read data it needs  800K registry reads just to determine the current language which costs in total over 35s of execution time. At least that is what I can see with the HUD Tool inside a single process. This tool is definitely not a v1.0 and is used within MS for quite some time based on the many helpful tooltips from performance experts. The HUD Tools is a great addition to the toolbox which warrants more detailed posts in the future. If you have problems to get it running drop me a note and I will see if I can help.

Performance HUD can be run in two modes

  • Realtime attached to a specific process
  • Offline while parsing data from an existing recorded ETL file
    • This requires CPU Profiling, Context Switch, File IO, Registry tracing with stacks enabled
    • For UI Hangs and User/GDI Handle leaks you need additionally to record the Microsoft-Windows-Win32k ETW Provider with 0x8000030000240000 flags + Stacks
      • The new GDI/User Leak tracking flags are (Win 10 1903. Windows Server 2019 e.g. is based on Win 10 1807. No server OS or Win 10 LTS release supports that yet …)
        • 0x0000010000000000  GdiHandleOperation
        • 0x0000020000000000  UserHandleOperation

Realtime ETW tracing is a great thing which makes it much easier for the average performance investigator to analyze ETW data right away. Especially UI thread hangs due to blocking calls got fantastic support which makes finding your UI hangs a no brainer.


System Wide Analysis

A first check is great to do in HUD but a more detailed analysis is still best done in WPA (and a 4K monitor). Lets see how the CPU consumption across processes is spent:




The WMI Query is executed from my test process named EventLogFlusherAndReader.exe (14796) which consumes 10s mostly reading registry keys for the COM class factory of some WMI COM objects. …

Next is is the svchost(1676) which is the Event Log Service Process with also 10s of CPU. There it is spending also most of the time as we have seen earlier with Registry Reading. In our case it is 5s/10s. Below is a picture how a WMI Event Log Query is actually executed:



WMI can do much but only with the help of many lackeys which work for us. We are reading a binary file and decode its messages but with WMI this causes significant CPU in 4 other system processes! Most time is spent inside the Event Log Service and the WMI Worker process and our WMI reading process. I was wondering why the EventLogReader process consumes so much CPU while it should only receive the data from WMI. It turns out there is more behind it because the COM objects which are used are not only for WMI but also for VirtualBox. Seem the VM host intercepts quite a bit of calls on my machine. It did uninstall VirtualBox (this was quite a journey because I have move the C:\Windows\Installer folder to a different drive which breaks many things. Note to self: Do this never again!) and checked again. Sure enough the additional Registry Queries are gone but the total runtime did not change much. The reason is that the Performance HUD Tool enables expensive ETW Logging event for every Registry access which exaggerates the costs of Registry accesses quit a bit. With HUD profiling enabled I did see application slowdowns of up to 40%. You should therefore still measure without Performance HUD enabled to get a less distorted picture what is really our bottleneck.

So what can we do about slow event log reading performance with WMI? As it turns out not much. MS should really check if it needs to access the language registry keys so often without any caching. In theory we can format the message by ourselves but this is only part of the story since ETW supports for an ETW event log message up to 3 resource dlls.



  1. MessageFileName: Same as old event log message table dlls which contains the localized resource string where the paramters are expanded at %1 %2, … as classic events
  2. ParameterMessageFile:  This contains constants such as Access Denied, Granted, …  which are defined with two %% followed by the resource number e.g. %%8100
  3. ResourceFileName: Defines the WEVT_TEMPLATE resource which contains event names, task names etc inside this binary structure

What will happen when I format messages in the old way by loading the resource dll and format the expansion parameters by myself? For an ETW Provider which is nice and which defines only resource based strings in the EventLog like this


I can correctly decode the message with the old EventLog API of .NET:


But when I remove the EventMessageFile registry key then I get an unresolved message just like any “modern” ETW provider would declare it in the EventLog if it is not nice to legacy APIs: logMsg.Message would error out and give me a message

The description for Event ID ‘xxx’ in Source ‘xxxxx’ cannot be found. The local computer may not have the necessary registry information or message DLL files to display the message, or you may not have permission to access them.

Event Log providers since Windows Vista only need a ProviderGuid below the event log registry key which makes it impossible to format resource based messages without the newer ETW Event Rendering APIs.

This is an issue if you are monitoring log messages issued by low level device drivers to report e.g. disk failures and other hardware issues. These important messages will slip through your monitoring system if you use the .NET System.Diagnostics.EventLog class. It is the fastest way to read messages but unfortunately you will miss out messages. The situation can be fixed to some extent by creating a registry key EventMessageFile for the ETW provider below the corresponding Event Log Source registry node. You need to look up the ProviderGuid below Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\WINEVT\Publishers\ to find the corresponding MessageFileName. That will enable proper rendering of ETW based providers.




There is one caveat though. Fixed string constants which are formatted via %% in the resource based messages that are baked up by the ParameterFileName resource dll will not be formatted and will stay as placeholders in your formatted message. I have seen so far mostly for security related ETW providers to use that more advanced string expansion machinery.

The best option is still to use the newer .NET APIs in the form of System.Diagnostics.Eventing.Reader.EventLogReader which will will properly format all messages. The only issue is that this API just like WMI calls into the event log service where the messages formatting is done out of process which is much slower. I might be obsessed with performance when I complain about newer but more powerful APIs. Lets check things by reading and formatting 10K messages from the Application and Security Event log with the old System.Diagnostics.EventLog (readoldapi) and the newer System.Diagnostics.Eventing.Reader.EventLogReader (read) API.

C>EventLogFlusherAndReader.exe read -n 10000 -p 1 -l Application
Did read 10000 messages from log Application in 0.10s, 97,749 Messages/s, Format Repeat Count: 1
Total Formatting time 10.4 s.

C>EventLogFlusherAndReader.exe readoldapi -f -n 10000 -l Application
Read up to 10000 entries from Application log
Did read 10000 message table based event log messages from log Application in 1.45s, 6,899 Messages/s
Formatting time 0.4 s.
Reading and formatting for 10000 did take 1.85s, 5,410 Messages/s


C>EventLogFlusherAndReader.exe read -n 10000 -p 1 -l Security
Did read 10000 messages from log Security in 0.12s, 82,068 Messages/s, Format Repeat Count: 1
Total Formatting time 129.4 s.

C>EventLogFlusherAndReader.exe readoldapi -f -n 10000 -l Security
Read up to 10000 entries from Security log
Did read 10000 message table based event log messages from log Security in 1.37s, 7,277 Messages/s
Formatting time 0.6 s.
Reading and formatting for 10000 did take 1.98s, 5,041 Messages/s


With the “new” API we are over 20 times slower because formatting is now done in the EventLog Service which causes like WMI a lot of overhead. But what stands out is that reading data from the Security Event Log is > 200 times slower compared to the “old” EventLog API. Since .NET is open source it is easy to debug this and I have opened an issue https://github.com/dotnet/runtime/issues/34568 with a request to fix. As usual there is little motivation to change such old legacy code but I was invited to provide a PR which I did. After a bit of back and forth due to test failures I have landed this single line fix: https://github.com/dotnet/runtime/pull/36188/commits/dd0e212d75218f3b0eef3bc1ab812d7eb763279e to remove a superfluous check which did not serve any real purpose. The underlying issue was that for every message format call a check was done to format all task names. The Security Event Log defines 60 task names which explains the bad parsing performance for the Security Event Log. By removing a single line it did get 26 times faster which is not bad. This change will light up in .NET Core 5.0 so stay tuned. In the meantime you can still recompile the event log reading dll at your own from .NET and use it from a regular .NET application without any issues. It is even so easy to copy the source code from .NET Core and put it into your own project. Here are the results:


C>EventLogFlusherAndReader.exe read -n 10000 -p 1 -l Application
Did read 10000 messages from log Application in 0.10s, 99,245 Messages/s, Format Repeat Count: 1
Total Formatting time 2.7 s.


C>EventLogFlusherAndReader.exe read -n 10000 -p 1 -l Security
Did read 10000 messages from log Security in 0.12s, 80,601 Messages/s, Format Repeat Count: 1
Total Formatting time 4.3 s.


That speeds up reading event logs with .NET between 3-30 times depending on how many task names the log you are parsing has defined. It is still ca. 7 times slower compared to the old API which might be ok if you are not overdoing things.


The Security Event Log can be much bigger than 20 MB because some corporate security departments follow the MS security guidance to audit everything which needs a huge 256 MB or even a 1 GB Security Event Log file. If it would be much smaller it would overwrite the oldest events within hours depending on how your audit policies are configured. To check your current Audit Policies you can run auditpol.exe. On my home machine everything looks good but in a corporate setup things might be different. Especially Success audit messages can fill up your log files …

C>auditpol /get /category:*
System audit policy
Category/Subcategory                      Setting
  Security System Extension               No Auditing
  System Integrity                        Success and Failure
  IPsec Driver                            No Auditing
  Other System Events                     Success and Failure
  Security State Change                   Success
  Logon                                   Success and Failure
  Logoff                                  Success
  Account Lockout                         Success
  IPsec Main Mode                         No Auditing
  IPsec Quick Mode                        No Auditing
  IPsec Extended Mode                     No Auditing
  Special Logon                           Success
  Other Logon/Logoff Events               No Auditing
  Network Policy Server                   Success and Failure
  User / Device Claims                    No Auditing
  Group Membership                        No Auditing
Object Access
  File System                             No Auditing
  Registry                                No Auditing
  Kernel Object                           No Auditing
  SAM                                     No Auditing
  Certification Services                  No Auditing
  Application Generated                   No Auditing
  Handle Manipulation                     No Auditing
  File Share                              No Auditing
  Filtering Platform Packet Drop          No Auditing
  Filtering Platform Connection           No Auditing
  Other Object Access Events              No Auditing
  Detailed File Share                     No Auditing
  Removable Storage                       No Auditing
  Central Policy Staging                  No Auditing
Privilege Use
  Non Sensitive Privilege Use             No Auditing
  Other Privilege Use Events              No Auditing
  Sensitive Privilege Use                 No Auditing
Detailed Tracking
  Process Creation                        No Auditing
  Process Termination                     No Auditing
  DPAPI Activity                          No Auditing
  RPC Events                              No Auditing
  Plug and Play Events                    No Auditing
  Token Right Adjusted Events             No Auditing
Policy Change
  Audit Policy Change                     Success
  Authentication Policy Change            Success
  Authorization Policy Change             No Auditing
  MPSSVC Rule-Level Policy Change         No Auditing
  Filtering Platform Policy Change        No Auditing
  Other Policy Change Events              No Auditing
Account Management
  Computer Account Management             No Auditing
  Security Group Management               Success
  Distribution Group Management           No Auditing
  Application Group Management            No Auditing
  Other Account Management Events         No Auditing
  User Account Management                 Success
DS Access
  Directory Service Access                No Auditing
  Directory Service Changes               No Auditing
  Directory Service Replication           No Auditing
  Detailed Directory Service Replication  No Auditing
Account Logon
  Kerberos Service Ticket Operations      No Auditing
  Other Account Logon Events              No Auditing
  Kerberos Authentication Service         No Auditing
  Credential Validation                   No Auditing


Reading a big event log by **** Some State Of The Art Security Solution ****  to check for malicious activity might kill interactive user experience since you are now maxing out two cores (the reader process and the event log service formatting your messages) for many minutes. A 1 GB Security Event Log file would need even with the fixed .NET version ca. 10 minutes to read. The current way how .NET handles event log messages would cause it to scan for over 5h through a 1 GB Security Event Log file. I hope such a security “solution” is not in production anywhere but one never knows for sure if some Admin has created a nice powershell script which uses Get-WinEvent to read all messages with a non deprecated up to date API in his Active Directory Forest. This post exists to make more people aware of the performance issues related to event log reading and applying tight security auditing policies. If you see minutes of high CPU spent in svchost.exe (it is a generic service host) which hosts the Event log service you can be pretty sure that someone is reading large event log file/s on your machine via powershell, WMI, or other ways.


Most often this is some security related software which is not fully optimized yet. Lets not give up but improve the tools you use and check if the policies which generate most log messages are really needed. If not you can reduce the size of the log and still keep the rollover time for the oldest log messages constant.

With Performance HUD it should be much easier to come to the same conclusions than I did. Try out the new tool and tell me what you have found! And lets hope that someone makes FormatMessage faster.

The WMI Query From Hell

this is about a case where I did wonder why my Windows installation at home is running fast but the Windows machine inside a company network is so slow, especially boot. “Nothing exists independently“, a verdict from Dalai Lama applies very much to performance engineering. I am not sure if he had that in mind but be prepared for unexpected twists during the following posts.

First a little background what you can expect in a corporate Windows network. Normally you have an Active Directory installation which takes care of user management. This involves rolling out central Group Policies defined by your administrators, deploying software centrally with e.g. MS System Center, rolling out virus scanners and virus pattern updates. This is happening in the background and you will not notice anything except that sometimes stuff is downloaded and installed and you are asked to reboot because some newly installed software or Windows update needs one.

After a reboot I was noticing that my machine was always working feverishly for the next 5 minutes on at least 2 cores. That is not good. Task manager was showing on my 4 Core notebook up to over 50% CPU utilization which means that I could barely work when something demanding was needed. The natural reaction to anyone who is dealing with performance issues on windows is to turn on ETW tracing. Why? Because the recording tool WPR is part of every Windows 10 installation. If you want to record for longer than one minute I would set the sampling rate down to 10ms to get not too much information like this:

C>wpr -start GeneralProfile -start CPU -start DiskIO -start FileIO -setprofint 100000


   Reproduce Issue


C>wpr -stop c:\temp\PerfIssue.etl

The CPU profile of WPR will always record context switch events which is interesting but produces a lot of data. I am still not sold to wpr as the successor to xperf because you need to extract the builtin profiles of the exe, create your own derivative profile and then save it to a file which is error prone and not easy to tailor. But for short recording durations it is okay to record pretty much anything so I do not complain. When you want to analyze the data you still need to download the Windows Performance Toolkit (WPT) from the Windows 10 SDK. Then you can apply the OverView.wpaProfile profile from ETWController which comes with many useful stacktags located here: https://github.com/Alois-xx/etwcontroller/tree/master/ETWController/ETW (get the complete folder).


We see a large CPU activity in WmiPrvSE.exe, svchost.exe and lsass.exe. What is this all about? On a live machine you can use ProcessHacker (the current version is 3.0.2824 do not download the old version 2.xxx from 2016) to easily see which processes are consuming CPU. A very useful but little known feature is that you can filter by multiple processes separated with | in the input window on the toolbar. It filters on all visible columns.


When you hover over the processes you immediately see which service or WMI provider is hosted in this process which can be very helpful.


WmiPrVSE.exe is a WMI Provider host process where the WMI provider objects are living. Whenever you call into a WMI object the call ends up in one of the several provider processes. The list WMI Providers tells us that this process contains the Event log provider. Lets turn to the next most CPU consuming process


This is the Event Log service process. Looks like the WMI provider calls into the Windows Event Log Service. Interesting.


lasss.exe is the process which manages all security related things inside Windows. This one tells you if you get access denied or not.


The last one is a Windows services which is the WMI service. That are a lot of different processes which are consuming CPU. But why? It looks like some WMI stuff is dealing with the Event Log but how can we find out which WMI query this could be?

The brute force approach is to check the memory contents of the WmiPrvSE.exe process and search for WMI queries which contain e.g. the select keyword. ProcessHacker can actually scan for strings in any process. Lets do that as quick investigation:



And behold we immediately get a nice list of current and past queries executed by WMI:


How can we verify that which WMI query was so slow? We do not need to write a C++/C# application. Instead we use Powershell which can execute any WMI Query with the Get-WmiObject -query “select …” cmdlet. Lets try to query all Windows Event Logs for errors (eventtype=1) and measure its duration with Measure-Command :

PS C:\>Measure-Command -Expression { Get-WmiObject -query "select Eventcode,Eventtype,Logfile,Message,Sourcename,TimeGenerated,user from Win32_NTLogEvent where eventtype = 1"  }
Minutes           : 4                                                                                                                                                            
Seconds           : 5                                                                                                                                                            

Looks like we have found it. What is so slow with this WMI Query? The first observation is that the supplied query contains a time filter which tries to get events from the last 3 days. Normally you want to query a specific event log but this query reads from all event logs on the system all error (eventtype = 1)  messages.

Up to now we have identified an expensive WMI Query which reads all error events from the last 3 days. When you have a performance problem you have two options now:

  1. Find out who did call the query and prevent it from running
  2. Identify the root cause of the bad WMI Event Log Query performance

Lets start with option 1. Since Windows 8.1 we can enable WMI Activity Tracing and query for the WMI Execution events.

To enable WMI Tracing execute this on an elevated command prompt

C>wevtutil sl Microsoft-Windows-WMI-Activity/Trace /e:true    

To query for WMI Query execution events we need to filter for event 11 which contains also the initiating process id

C>wevtutil qe Microsoft-Windows-WMI-Activity/Trace /q:"*[System[EventID=11]]" /f:text | findstr /i ExecQuery  
CorrelationId = {00000000-0000-0000-0000-000000000000}; GroupOperationId = 42920; OperationId = 42932; Operation = Start IWbemServices::ExecQuery - root\cimv2 : SELECT * FROM Win32_PhysicalMedia WHERE Tag="\\\\.\\PHYSICALDRIVE1"; ClientMachine = SKYRMION; User = NT AUTHORITY\SYSTEM; ClientProcessId
= 6276; NamespaceName = 132184730473612738 CorrelationId = {00000000-0000-0000-0000-000000000000}; GroupOperationId = 42992; OperationId = 42993; Operation = Start IWbemServices::ExecQuery - root\cimv2 : SELECT AdapterRAM FROM Win32_VideoController WHERE AdapterRAM >= 2147483648; ClientMachine = SKYRMION; User = NT AUTHORITY\SYSTEM; ClientPr
ocessId = 6276; NamespaceName = 132184730473612738 CorrelationId = {00000000-0000-0000-0000-000000000000}; GroupOperationId = 42996; OperationId = 42997; Operation = Start IWbemServices::ExecQuery - root\cimv2 : SELECT AdapterRAM FROM Win32_VideoController WHERE AdapterRAM >= 1073741824; ClientMachine = SKYRMION; User = NT AUTHORITY\SYSTEM; ClientPr
ocessId = 6276; NamespaceName = 132184730473612738 
CorrelationId = {00000000-0000-0000-0000-000000000000}; GroupOperationId = 43000; OperationId = 43001; Operation = Start IWbemServices::ExecQuery - root\cimv2 : SELECT AdapterRAM FROM Win32_VideoController WHERE AdapterRAM >= 4293918720; ClientMachine = SKYRMION; User = NT AUTHORITY\SYSTEM; ClientPr
ocessId = 6276; NamespaceName = 132184730473612738 
CorrelationId = {00000000-0000-0000-0000-000000000000}; GroupOperationId = 43046; OperationId = 43047; Operation = Start IWbemServices::ExecQuery - root\cimv2 : SELECT AdapterRAM FROM Win32_VideoController WHERE AdapterRAM >= 2147483648; ClientMachine = SKYRMION; User = NT AUTHORITY\SYSTEM; ClientPr
ocessId = 6276; NamespaceName = 132184730473612738 
CorrelationId = {00000000-0000-0000-0000-000000000000}; GroupOperationId = 43050; OperationId = 43051; Operation = Start IWbemServices::ExecQuery - root\cimv2 : SELECT AdapterRAM FROM Win32_VideoController WHERE AdapterRAM >= 1073741824; ClientMachine = SKYRMION; User = NT AUTHORITY\SYSTEM; ClientPr
ocessId = 6276; NamespaceName = 132184730473612738 
CorrelationId = {00000000-0000-0000-0000-000000000000}; GroupOperationId = 43054; OperationId = 43055; Operation = Start IWbemServices::ExecQuery - root\cimv2 : SELECT AdapterRAM FROM Win32_VideoController WHERE AdapterRAM >= 4293918720; ClientMachine = SKYRMION; User = NT AUTHORITY\SYSTEM; ClientPr
ocessId = 6276; NamespaceName = 132184730473612738 
CorrelationId = {C9FD2187-9C4E-0004-3FBF-FEC94E9CD501}; GroupOperationId = 43058; OperationId = 43059; Operation = Start IWbemServices::ExecQuery - root\cimv2 : select Eventcode,Eventtype,Logfile,Message,Sourcename,TimeGenerated,user from Win32_NTLogEvent where (type = 2 ) ; ClientMachine = SKYRMION
; User = Skyrmion\Alois; ClientProcessId = 1112; NamespaceName = 132183970163500741

With the process ID it is trivial to identify which process did initiate the WMI Query. Who is it?


lspush.exe? Never heard of. This one is not from Microsoft


A search later we get to https://www.lansweeper.com/knowledgebase/how-to-scan-with-lspush/. Lansweeper is a network inventory tool which gathers client machine information on Windows in a general way by e.g. calling it at boot or login via a group policy. There is a free edition which I did install at home to reproduce the issue. And lspush did show the same behavior on my machine:


It is of course much easier to spot the issue when you see the Lansweeper UI where the scan operation is even visible in the UI. In corporate networks such things are normally run in the background and you need to resort to advanced tooling to drill deeper. I have contacted Lansweeper if there is something I could to about it. I have got as answer back that other customers have complained as well but this issue will only be tackled when enough customers do complain. Currently there is NO way to turn this offensive event log query off which is frustrating. I know Lansweeper has many customers but not many seem to have been able to correlate lspush induced CPU actvitiy in WMI related processes and Lansweeper yet. If you are a Lansweeper customer and you are using lspush and not the agent approach to scan your machines you definitely should reach out to their support to raise the priority of the backlog item. Lansweeper is used by many big companies (https://www.lansweeper.com/) where definitely more than one will experience this issue. When addressing this the first time to our IT guys they did not believe me because the Lansweeper UI shows also how long the scan was taking:


According to the Lansweeper UI my scan time is in the sub second region. What this report is failing to show are the scans which cannot be turned off and how long they did take…

In meantime if you are a Lansweeper customer measure how long lspush.exe is executing in your setup and call their support to prevent wasted CPU cycles at scale and bad user experience after logon. This issue is not happening everywhere which brings me back to point 2. Why is this Eventlog query so slow? This is the topic for the next post.

.NET Serialization Benchmark 2019 Roundup

With .NET Core 3.0  and .NET 4.8 released it is time to check again which serializer performs best. The older overviews still contain some interesting insights because the focus tends to shift with year to year a bit:

Time moves on, some libraries are no longer maintained and no longer compatible with .NET Core 3. These were removed:

Name Reason
ZeroFormatter Project was abandoned by maintainer in favor of MessagePack-CSharp Github issue.
Crash on .NET Core 3 with System.BadImageFormatException: ‘Bad IL format.’
Wire Deserialization fails. See Issue1 and  Issue2. Besides this there is no project Activity since 2017.
Hyperion Serialization does not work due to accessing internal fields of Exceptions. See Issue.


Tested Serializers




Version Tolerant

# Commits / Year Of Last Commit at 8/2019

Apex Binary No 143 / 2019
BinaryFormater Binary Yes ?
Bois Binary ? 181 /2019
DataContract Xml Yes ?
FastJson Json Yes 138 / 2019
FlatBuffer Binary Yes 1701 / 2019
GroBuf Binary Yes 322 / 2019
JIL Json Yes 1272 / 2019
Json.NET Json Yes 1787 / 2019
JsonSerializer (.NET Core 3.0) Json Yes ?
MessagePackSharp Binary, MessagePack Yes 715 / 2019
MsgPack.Cli Binary,
Yes 3296 /2019
Protobuf.NET Binary
Protocol Buffer
Yes 1018 / 2019
SerivceStack Json Yes 2526 /2019
SimdJsonSharp Json No 74 /2019
SlimSerializer Binary No ~100 /2019
UTF8Json Json Yes 180 /2019
XmlSerializer Xml Yes ?

New Additions since 2018 are Apex, SimdJsonSharp, Utf8Json and JsonSerializer which comes with .NET Core 3.

Test Execution

You can clone my Serializer Testing Framework: https://github.com/Alois-xx/SerializerTests and compile it. Then execute RunAll.cmd from the SerializerTests directory which will execute the tests on .NET Framework and .NET Core 3.  I have put the output directory above the source directory (e.g. SerializerTests\..\SerializerTests_hh_mm_ss.ss\) to not check in measured data by accident.

Test Results


And another time for .NET 4.8 were some of the new .NET Core only serializers are missing:


The times are shown for .NET Core 3.0 and .NET 4.8 for 1 million objects sorted by deserialize time. The tests were performed on my i7-4770K CPU @ 3.50GHz on Win 10.0.18362.356 (= 1903) with 16 GB RAM. Normally you read data more often than write it. This is the reason why I did put emphasis on the deserialize times. NopSerializer and SimdJsonSharpSerializer test only deserialize times hence the 0 in the graph for the serialization times.

But You Are Doing It All Wrong Without Benchmark.NET

How accurate are the measured numbers? As usual you have many factors such as virus scanners, Windows update downloads, and other things which can show different timings in the region of 10%. For more memory intensive (read slower) serializers it can reach sometimes up to 30% because page fault performance can differ quite a lot and has many influencing factors. Although the absolute timings may vary the relative ordering remains pretty stable when you measure at one point in time. The measurement error during one test run is < 5% which is pretty good to decide which serializer is faster for your specific use case. Every test was executed 5 times and the median value was taken. I have experimented with 10 or more runs but the resulting numbers did not stabilize because if the test suite runs for an extended period of time (hours) then the machine may reach different states (like downloading and unpacking a windows or other application update) which did the introduce systematic errors. For that reason I did get back to rather short test timings to ensure that the relative performance of the tested serializers is comparable. And no I have not used Benchmark.NET because it is only good for micro benchmarks. It goes into great length to get stable numbers but the price for that is that it executes a test which runs for one microsecond for several minutes. If the same approach would be used for the serializers then the test suite would run for days and you still would not know if the numbers are repeatable the next day because this are not real micro benchmarks which depend only on CPU. Besides that Benchmark.NET makes first time init effect measurements pretty hard to test the JIT timing overhead. I am aware that there is a repository of Benchmark.NET enabled performance test named dotnet Performance on github where I have added the .NET Core 3 testcase for IndexViewModel as it was mentioned in https://github.com/dotnet/performance/pull/456/commits where where I get with Benchmark.NET tests only a perf increase of 30% with the new .NET Core 3 Json serializer. I find it strange the the current repo contains no tests for the new Json Serializer of .NET Core 3. I have added the tests as any other for the new Json Serializer of .NET Core in the .NET Testsuite like this:

        [Benchmark(Description = "NetCoreJson")]
        public T NetCoreJson_()
            memoryStream.Position = 0;
            return (T) System.Text.Json.JsonSerializer.DeserializeAsync(memoryStream, typeof(T)).Result;

        [GlobalSetup(Target =nameof(NetCoreJson_))]
        public void SetupNetCoreJson()
            memoryStream.Position = 0;
            System.Text.Json.JsonSerializer.SerializeAsync<T>(memoryStream, value).Wait();


|                     Method |      Mean |    Error |   StdDev |    Median |       Min |       Max |   Gen 0 |  Gen 1 | Gen 2 | Allocated |
|--------------------------- |----------:|---------:|---------:|----------:|----------:|----------:|--------:|-------:|------:|----------:|
|                        Jil |  64.25 us | 0.159 us | 0.124 us |  64.23 us |  64.07 us |  64.45 us |  6.4037 | 0.7684 |     - |   26.7 KB |
|                   JSON.NET |  81.11 us | 0.808 us | 0.756 us |  81.12 us |  79.98 us |  82.31 us |  8.3682 |      - |     - |  34.27 KB |
|                   Utf8Json |  38.81 us | 0.530 us | 0.496 us |  38.66 us |  38.29 us |  39.76 us |  5.1306 |      - |     - |  21.58 KB |
|               NetCoreJson  |  49.78 us | 0.692 us | 0.648 us |  49.51 us |  49.07 us |  51.00 us |  5.3066 | 0.5896 |     - |  21.91 KB |
| DataContractJsonSerializer | 343.39 us | 6.726 us | 6.907 us | 340.58 us | 335.44 us | 355.18 us | 20.3125 | 1.5625 |     - |  88.38 KB |

Then I executed MicroBenchmarks.exe with testcase #92 which tries the deserialization tests on IndexViewModel. In my test suite I only test stream based serializers because I want to support also large datasets which need no transformation into a string before the serializer is willing to deserialize the data. You can extend the Json_FromString which leads to similar results:


|      Method |     Mean |    Error |   StdDev |   Median |      Min |      Max |  Gen 0 |  Gen 1 | Gen 2 | Allocated |
|------------ |---------:|---------:|---------:|---------:|---------:|---------:|-------:|-------:|------:|----------:|
| NetCoreJson | 51.52 us | 0.498 us | 0.441 us | 51.49 us | 50.78 us | 52.53 us | 8.2508 | 0.4125 |     - |   34.2 KB |
|         Jil | 42.83 us | 0.414 us | 0.388 us | 42.73 us | 42.27 us | 43.56 us | 5.6662 | 0.6868 |     - |  23.49 KB |
|    JSON.NET | 77.07 us | 0.580 us | 0.542 us | 77.22 us | 76.19 us | 78.12 us | 7.6523 | 0.9183 |     - |  31.31 KB |
|    Utf8Json | 42.64 us | 0.263 us | 0.246 us | 42.65 us | 42.27 us | 43.06 us | 8.1967 | 0.3415 |     - |  33.87 KB |

This emphasis again to not blindly trust random performance numbers which are mentioned on github. This one was created with MicroBenchmarks.exe –filter “MicroBenchmarks.Serializers.Json_FromString*” after adding NetCoreJson to the mix. There we find again a perf increase of 32% compared to Json.NET measured with Benchmark.NET. But this is nowhere the originally factor 2 mentioned by Scott Hanselmann. Perhaps at that point in time the serializer did not fully support all property types which did result in great numbers. To combat confirmation bias I write the serialized data to disk so I can check the file contents to verify that indeed all required data was written and not e.g. private fields were omitted by this or that serializer which is a common source of measurement errors when a serializer has too good numbers.

Show Me The Real Numbers

The big pictures above need some serious explanation. In my first article I have shown that serialized data size and performance correlate. It is common wisdom that binary formats are superior to textual formats. As it turns out this is still somewhat true but the battle tested binary protobuf-net serializer is beaten by 50% on .NET Core by UTF8Json a Json serializer. There is one guy who did write MessagePackSharp, Utf8Json, ZeroFormatter and others named Yoshifumi Kawai who really knows how to write fast serializers. He is able to deliver best in class performance both for .NET Framework and .NET Core. I have chosen the fastest version tolerant Json serializer Utf8Json and normalized all values to it. Utf8Json is therefore 100% for serialize and deserialize. If something is below 100% it is faster than Utf8Json. If it is 200% then it is two times slower. Now the pictures above make hopefully more sense to you. Humans are notoriously bad at judging relative performance when large and small values are sitting next to each other. The graph should make it clear. The new Json Serializer which comes with .NET Core 3 is faster (NetCoreJsonSerializer) but not by a factor 2 as the post of Scott Hanselman suggests. It is about about 30% faster in the micro benchmarks and ca. 15% faster in my serializer test suite which de/serializes larger 1M objects.

Below is the same data presented in absolute times for .NET Core 3.0.0 when 1 million objects are de/serialized. I have grouped them by their serialization format which makes it easier to choose similar breeds if you want to stay at a specific e.g. binary or textual protocol. If you want to profile on your own you can use the supplied SerializerTests\SerializerStack.stacktags file to visualize the performance data directly in CPU Sampling and Context Switch Tracing.


Here the same data with .NET Framework 4.8


In terms of serialized data size there is Bois_LZ4 a special one because it is a compressing serializer with pretty good numbers. Since large parts of my BookShelf data is very similar it can compress the data pretty well which might play a role if network bandwidth is relevant in your scenario and you cannot use http compression.

Measurement Errors

I have collected data over the months also with .NET 4.7.2 where the numbers were much better. It looks like .NET Framework 4.8 (4.8.4010.0) has some regression issue with Xml handling which leads to slowdowns up to 6 times. These manifest in large degradations in  DataContract and XmlSerializer based tests. You do not believe me? I have measured this! Unfortunately I have played around a lot with debugging .NET recently (https://aloiskraus.wordpress.com/2019/09/08/how-net-4-8-can-break-your-application/) where I have added .ini files in all of my GAC assemblies to make it easy to debug with full locals. This works great but what will happen when NGen tries to update it NGen images? It will honor the .ini settings which tell it to not inline or optimize code too much.  No this is not a regression issue of .NET 4.8 but it was my error by playing around too much with JIT compiler flags. Now I have proof that method inlining is indeed very beneficial especially for Xml parsers with many small methods which speeds up parsing up to a factor 6. There is always something you can learn from an error.

What Is The Theoretical Limit?

To test how fast a serializer can get I have added NopSerializer which tests only deserialization because data is much often read than written and it is the more complex operation. Deserialization consists of two main stages

  • Data Parsing
  • Object Creation

Because any serializer must parse either Xml, Json or its other data format we can test the theoretical limit by skipping the parsing step and measure only object creation. The code for this is at its core getting the string of a preparsed UTF8 encoded byte array:

                // This will test basically the allocation performance of .NET which will when the allocation rate becomes too high put 
                // Thread.Sleeps into the allocation calls. 
                for(int i=0;i<myCount;i++)
                    books.Add(new Book()
                        Id = i+1,
                        Title = Encoding.UTF8.GetString(myUtf8Data, (int) myStartIdxAndLength[i*2], (int) myStartIdxAndLength[i*2+1] )

For 1 million book objects which consist only of an integer and a string of the form “Book dddd” the deserialize times are ca. 0,12s which results in a deserialize performance of ca. 90 MB/s which is not bad. If you look at the data charts above you still find a faster one named ApexSerializer. Why is this one faster than the NopSerializer? Great question! When you parse a text file it is usually stored as ASCII or UTF-8 text. But .NET and Windows uses as internal string representation UTF-16. This means after reading from a text file the data needs to be inflated from ASCII/UTF8 (8 bits per char) to UTF-16 (16 bits per char). This is complex and done in .NET with Encoding.UTF8.GetString. This conversion overhead takes time and has become with .NET Core 3 better which is great because it is a very central method which is called by virtually all code which deals with reading text files. When we use the fastest Json serializer UTf8Json which needs 0,3s then we find that 60% is still spent in parsing Json. This overhead could be lessened by using SimdJsonSharp which claims to give you GB/s Json parsing speed which uses advanced bit fiddling magic and AVX128/256 SIMD instructions which can process up to 64 bytes within one or a few instruction cycles. So far the interop costs of the native version (SimdJsonSharpSerializerN) still outweigh the benefits by far. The managed version does pretty well (SimdJsonSharpSerializer) but it is only a few % faster than Utf8Json which indicates that I am either using the library wrong or that the generated code still needs more fine tuning. But be warned that SimdJsonSharpSerializer/N is NOT a serializer for generic objects. You get from  SimdJsonSharp only the Json parser. You need to write the parsing logic by yourself if you want to go this route.

Which One Is Fastest?


When raw speed is what you want and you accept to keep the serialized object layout fixed (no versioning!) then Apex is worth a look. In the first table I have added a column if the serializer does support versioning in one or the other form. If there is a no expect crashes like ExecutionEngineException, StackoverflowException, AcessViolationException  and other non recoverable exceptions if anything in your execution environment changes. If speed is what you want speed is what you get by memcopying the data to the target location.

GroBuf is also worth a look but is comes with a significantly bigger serialized binary blob. Only slightly slower is MessagePackSharp which is written by the same guy as Utf8Json and uses with MessagePack an open source  binary encoding scheme.


When the data format is Json then Utf8Json is the fastest one followed by Jil which is only 4% slower in this macro benchmark. If you measure for small objects the overhead with Benchmark.NET then Utf8Json is about 40% faster which is not reflected by this macro benchmark. Benchmark.NET tries to warm up the CPU caches and its branch predictor as much as possible which results in many CPU caching effects. This is great to optimize for small tight loops but if you process larger chunk’s of memory like 35 MB of Json for 1M Book objects you get different numbers because now a significant amount of memory bandwidth cannot be used to eagerly load data into the caches.

The Fastest Is Best Right?

Speed is great until you miss an essential feature of one of the tested serializers. If you are trying to replace something slow but very generic like BinaryFormatter which can serialize virtually anything you have not many options because most other serializer out there will not be able to serialize e.g. Delegates or classes with interfaces with different implementations behind each instance. At that point you are for practical reasons bound to BinaryFormatter. You can try SlimSerializer which tries also be pretty generic at the cost of versioning support as next best option.

Things are easier if you start from DataContractSerializer. You can get pretty far with ProtoBuf which allows easy transformation  by supporting the same attributes as DataContractSerializer. But there are some rough edges around interface serialization. You can register only one implementing type for an interface but that should be ok for most cases.

If you are starting from scratch and you want to use Json then go for Utf8Json which is generic and very fast. Also the startup costs are lower than from JIL because Utf8Json and MessagePackSharp (same author) seem to hit a fine spot by having most code already put into the main library and generate only a small amount of code around that. DataContractSerializer generates only the de/serialize code depending on the code path you actually hit on the fly.

First Time Init Performance (Serialize One Object)

To make the tests comparable a new process is used for every serializer alone to prevent issues like that the first serializer needs to startup the reflection infrastructure which takes time while the later called serializers would benefit from the already warmed up reflection infrastructure. Inside a fresh process four different types are serialized and then the process exits. The measured time is not the executable runtime (.NET Runtime startup is not measured) but how much time each serialize call did need . Below is the Max (first run) serialize time for one small object shown and the min time for any of the other three objects to give you an idea how much time the serializer needs for the first object and how much performance you can expect if you add new types to the mix.

You can execute this also manually by running RunAll.cmd which runs all tests including the startup timing tests in the output folder as csv files. Or if you want to go into details use the command line:

D:\Source\git\SerializerTests\bin\Release\net472>SerializerTests.exe -test firstcall
Serializer      Objects "Time to serialize in s"        "Size in bytes" FileVersion     Framework       ProjectHome     DataFormat      FormatDetails   Supports Versioning
ApexSerializer  1       0.1803  84 .NET Framework 4.8.4010.0       https://github.com/dbolin/Apex.Serialization    Binary          No
ApexSerializer  1       0.0331  86 .NET Framework 4.8.4010.0       https://github.com/dbolin/Apex.Serialization    Binary          No
ApexSerializer  1       0.0344  86 .NET Framework 4.8.4010.0       https://github.com/dbolin/Apex.Serialization    Binary          No
ApexSerializer  1       0.1089  148 .NET Framework 4.8.4010.0       https://github.com/dbolin/Apex.Serialization    Binary          No
ServiceStack    1       0.0200  69 .NET Framework 4.8.4010.0       https://github.com/ServiceStack/ServiceStack.Text       Text    Json    Yes
ServiceStack    1       0.0020  70 .NET Framework 4.8.4010.0       https://github.com/ServiceStack/ServiceStack.Text       Text    Json    Yes
ServiceStack    1       0.0020  70 .NET Framework 4.8.4010.0       https://github.com/ServiceStack/ServiceStack.Text       Text    Json    Yes
ServiceStack    1       0.0091  117 .NET Framework 4.8.4010.0       https://github.com/ServiceStack/ServiceStack.Text       Text    Json    Yes


.NET Framework Everything NGenned


.NET Framework Only Default Assemblies NGenned


.NET Core 3 – Defaults


The pictures show that JIT compilation takes a noticeable amount of time which is usually “fixed” with NGen on .NET Framework or Crossgen on .NET Core. Especially ApexSerializer needs to start 500ms on .NET Core 3 or even 1s on .NET Framework if the assembly and its dependencies are not precompiled. But if you NGen then the times are 10 times better for Apex which is great. If you try to choose the best serializer for reading your configuration values during application startup this metric becomes important because the startup costs of the serializer might be higher than the de/serialize times of your data. Performance is always about using the right tradeoffs at the right places.


Although I have tried my best to get the data right: Do not believe me. Measure for yourself in your environment with a data types representative for your code base. My testing framework can help here by being easily extendable. You can plug in your own data types and let the serializers run. There can be rough spots in some of the serializers how they treat DateTime, TimeSpan, float, double, decimal or other data types. Do your classes have public default ctors? Not all of them support private ctors or you need to add some attribute to your class to make it work. This are still things you need to test. If you find some oddity please leave a comment so we do not fall into the same trap as others. Collecting the data and verifying with ETW Tracing many times that I really measure the right thing was already pretty time consuming. Working around of all the peculiarities of the serializers is left as an exercise to you.

After doing much performance work during my professional life I think I have understood why performance is so hard:

Number of ways to make something slow


Number of ways to make it fastest


When in doubt infinity will win. But even when you have created the fastest possible solution it will not stay there because the load profile will change or your application will run on new hardware with new capabilities. That makes performance a moving target. The time dependency of O(1) makes sure that I am never running out of work Zwinkerndes Smiley.

How .NET 4.8 Can Break Your Application

A colleague was calling me in to look at a broken .NET application after installing a preview up an upcoming Windows KB. The application was not starting up all the time. That are easy issues because it is easy and fast to reproduce the problem. Since it was happening on a VM with no Visual Studio installed I used the best und most underestimated tool named dnSpy. I learned about this tool by reading the book Pro .NET Memory Management from Konrad Kokosa is not only an in depth explanation how the GC works but it covers also many not so well known tools to drill into .NET for Windows and Linux. The most useful sections are in my opinion the last ones which cover in depth usage of Span and  System.IO.Pipelines. The GC is a fascinating piece of software but in practice you are fastest if you do not need to rely on it. dnSpy is a xcopy deployable Visual Studio Debugger clone which automatically decompiles managed code on the fly where you can set breakpoints and do line level debugging without having the actual source code. If you ever have to do production debugging you need to have this tool in your toolbox. In many ways dnSpy is better than Windbg because until today Windbg has no way to debug .NET code with a source code window which is in my opinion a shame.

One really cool option of dnSpy is to edit the decompiled source code and modify a running binary. Why am I excited about that? Will strong name verification not prevent loading such a tampered assembly? First of all: No you can recompile a strong named assembly and it will load normally since .NET 3.5 SP1

Bypass signature verification of trusted assemblies

Starting with the .NET Framework 3.5 Service Pack 1, strong-name signatures are not validated when an assembly is loaded into a full-trust application domain,

Not many people are aware of this. To repeat: Strong naming won´t give you any security since over 10 years.

This makes it easy to force a  highly sporadic exception on production machines (e.g. connection loss, disk full, ….) with ease by modifying the original method where the exception was thrown. If you have e.g. a small application which for some reason ends up in the general exception handler and there the error handling does bad things with some code like this:

        static void Main(string[] args)
                new FileStream(@"C:\temp\NotexistingFile.txt", FileMode.Open);
                Console.WriteLine("Got Here");
            catch(FileNotFoundException ex)
                Console.WriteLine("File was not found.");
            catch(Exception ex)
                Console.WriteLine("Fatal exception encountered!");


If you start the application from dnSpy it will skip using Native Images which will give you even for release builds all local variables with full fidelity. While debugging the same application in VS 2019 even with Source Server support often it cannot locate .NET Framework sources and you see no local variables and no C# code:


That is not nice. What is the reason for that? First Visual Studio uses Native Images if present. If these were compiled with no debug settings then the local variables will be optimized away. Even if you check


in Debugger – Options will not help because it only affects JITed code like your application but not loaded NGenned images. But there is a way out

1. Set the environment variable COMPLUS_ZAPDISABLE=1  or the registry key HKLM\SOFTWARE\Microsoft\.NETFramework\ZapDisable  and launch from that shell either VS or your application to which you attach later. See here for more information.

2. Create a file in C:\Windows\Microsoft.NET\assembly\GAC_32\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.ini with the content

[.NET Framework Debugging Control]

Point 1 will disable loading of native images. If you set this system wide your application will start much slower but all of your code will be JITed now. Point 2 will disable JIT optimizations for a given dll where the ini file must be named for an assembly xxxx.dll like xxxx.ini but NOT xxxx.dll.ini or xxxx.ini.dll. You need to do this for each assembly you want to disable JIT optimizations. The  post of Scott Hanselman gives misleading advice. His post leaves you with the impression you need to create that ini file only for your executable. That is not the case. To make things easier I have wrapped that stuff into a powershell script which allows me to disable the use of native images system wide and generate the ini files in the GAC and additionally in my application directory.

PS D:\Utils\EnableLocalVars.ps1 -enable 1 -dir "C:\Program Files\Test" -gac -ngen 0
Did set NGen Flag to 0. NGen Images are disabled now. This will impact application startup!
Adding 1518 files from GAC
Adding 68 files from GAC64
Adding 2 files from C:\Program Files\Test
Skipped 8 files because ini file already exists
Created 1580 ini files.

That script makes it easy to enable/disable on an installed machine. I have put a link to my OneDrive Folder here.

EnableLocalVars.ps1 -enable 0/1 -dir xxxx [-recurse] or [-gac] [-Ngen 0/1]
 by Alois Kraus 2018-2019
 Use this script when you want to attach to a running process or you want to get memory dumps of crashed processes of release builds with all local variables not optimized away.
 This will disable JIT optimizations to enable viewing local variables of methods in a debugger which otherwise would be optimized away.
  -enable 0/1         Enable local variables to view in debugger
                      If 1 enable local variables. The .ini files are created.
                      If 0 disable local variables. The .ini files are removed.
  -dir directory      Select directory where all dll/exe files get a .ini file
  -recurse            If set it will search recursively for all dlls below that path
  -gac                Make all GAC dlls debuggable
  -NGen 0/1           Disable/enable usage of NGen binaries system wide by setting the registry key HKLM\SOFTWARE\Microsoft\.NETFramework\ZapDisable
                      1 Enable NGen Images  == ZapDisable=1
                      0 Disable NGen Images == ZapDisable=0
                      You need to enable this if you want to view variables of .NET Framework code or if your application was NGenned
You need to restart the applications after executing this command for the changes to take effect.
 Enable locals for all GAC and  application dlls on an installed and NGenned machine
   EnableLocalVars.ps1 -enable 1 -gac -dir "c:\program files\myApp\bin" -ngen 0
 Reset settings
   EnableLocalVars.ps1 -enable 0 -gac -dir "c:\program files\myApp\bin" -ngen 1
 Enable NGen Images again
  EnableLocalVars.ps1  -ngen 1
WARNING: NGen images are disabled. Do not forget to enable them (EnableLocalVars.ps1 -Ngen 1) when you are done or all .NET applications will start ca. x2 times slower!

It comes with extensive help so it should make it easy enough that you perform this step automatically before taking a memory dump. This will make memory dump debugging a whole lot easier. Usually I make the error configurable by e.g. checking for the existence of a file to force the error when I want to inject some error in .NET Framework code like this:


This trick helped me a lot to force the error only when the application has got into the initial non faulted state. Now I can trigger the issue by creating a file from the command line with

 C>echo Hi > c:\temp\ForceError.txt

This will trigger the exception in the patched file:

if (File.Exists(“C:\\Temp\\ForceError.txt”))
         throw new Exception(“This one is rare!”);

When you try to save the changed mscorlib.dll you get an access denied error because the file is in use


But since we are pros we rename the running mscorlib.dll with no fear


because we know that Windows opens dlls always with the FILE_SHARE_DELETE sharing permission. This  gives us the right to always rename a running dll. We can now save the patched mscorlib.dll without any issue:


When we run the modified mscorlib.dll we get indeed into the impossible error region once the trigger is activated:


That is a great way to test how robust your application really is if you suspect that the error handling is wonky. But I am digressing here. Now back to the .NET 4.8 issue.

When I debugged it with Windbg the error did occur but not with dnSpy which left me back puzzled. If you have followed this far you can probably guess that this issue only surfaces when native images are loaded into the process. That was the reason why dnSpy was working perfectly. By enabling local variables and Ngenning the binaries again it did help to find out what was happening. The exact details are not important but it suffices to say that the application did use an old Dependency Injection framework named ObjectBuilder which was part of the Enterprise Library many years ago. To select the right ctor the default policy was to always use the first one by a call to Type.GetConstructors:

// Microsoft.Practices.ObjectBuilder.DefaultCreationPolicy
public ConstructorInfo SelectConstructor(IBuilderContext context, Type typeToBuild, string idToBuild)
    ConstructorInfo[] constructors = typeToBuild.GetConstructors();
    if (constructors.Length > 0)
        return constructors[0];
    return null;

That worked well for a over a decade but no more with .NET 4.8. If you read the API help for Type.GetConstructors


The GetConstructors method does not return constructors in a particular order, such as declaration order. Your code must not depend on the order in which constructors are returned, because that order varies.

it becomes evident that this approach can cause problems. In this particular case the first ctor was the default ctor for a given type but with .NET 4.8 and NGenning the application the order of ctors did change. The first ctor was now ctor(string) which did cause object creation to fail miserably. To hit this issue many things must come together. If you e.g. add another type to the specific assembly that contained the type the issue goes away because for the error to surface you need to craft your IL metadata a specific offsets to make the issue happen. I have created a small reproducer which contains 10K classes with two ctors which shows the opposite issue. With JITed code class I have one class which has mixed ctor order. The Source code is here.

Type DerivedSealed1808 has different ordered ctor!

The application checks if the first ctor is the default ctor which should be nearly always the same:

        private static void CheckCtorOrder()
            foreach(var type in typeof(Base).Assembly.GetTypes())
                if( type.Name.StartsWith("DerivedSealed"))
                    ConstructorInfo[] infos = type.GetConstructors();
                    if( infos[0].GetParameters().Length > 0 )
                        Console.WriteLine($"Type {type.Name} has different ordered ctor!");

After NGenning the application all classes have the same ctor order. This is pretty fragile. The advice on MSDN is therefore there for a reason. I have checked other more popular Dependency Injection Framework’s namely

  • Spring.NET
  • Windsor
  • AutoFac
  • Unity

which all have done their homework and contain not such problematic code. I might be the only one hitting that issue. But if you find strange application failures after a Windows KB check if the KB contains .NET 4.8 which could be the reason why your .NET application fails. .NET 4.8 has been available for 6 month and is pushed out via Windows Update as mandatory update now. MS is careful to not roll this out globally at once. Currently every 20th machine which requests all mandatory Windows Updates will get it but I am not sure when the switch will be flipped to 100%. This approach of MS is good but if you wonder why only some machines of your department behave strangely you should check if they really got all the same KBs installed even if everyone did install all mandatory updates. If you have a larger department and maintain your own Windows Update Service you should definitely keep an eye on strange .NET application behavior after you get the update which contains .NET 4.8. There are some issues with themed WPF context menus which can lead to misplaced, empty or non themed context menus but these should happen only sporadically. WPF still has issues with grouped Virtualizing Stackpanels (since .NET 4.5 where it was introduced) which can cause indefinite layout calls which can surface as UI hangs due to semi memory leaks. But that is part of another story.

The Unknown Error Code

When you look at log files you will frequently find negative error codes inside them (that is mostly the reason why you are looking at the log file anyway). For some of them you can find on Google what they mean but pretty frequently you end up with nothing. Why is that so? Most likely you have a COM error code in front of you. They have a different structure which is described here.

The most important thing to remember about COM error codes is that if bit 31 is set it signals an error. If bit 31=1 the signed 32 bit value becomes negative. That explains the strange values of COM error codes.  The bits 0..15 are most of the time the WIN32 error codes. Other bits above bit 15 can be set but these are irrelevant for normal readers of an error code.

If you ever encounter in a log file a negative error code you can follow that simple recipe to find out what it might be about. Lets suppose you have e.g. Error: -2147480631.

1. Open calc.exe, set it to Programmer mode and paste it as decimal number.


2. Get the first 4 digits of the hex number. Type in calc BC9 to get the decimal value.


3. Google for that Win32 error code or type on a command shell

net helpmsg 3017


and you finally have got a hint why some chained installer did fail. In that case it was some file in use issue which caused a MSI installer to demand a reboot and in the installer log file only that cryptic error code was to be found. Not many people seem to know that negative error codes are mostly a hint to try to convert the lower 16 bit of the error code to decimal and try to interpret that one as Win32 error code.

If you are a powershell guy

-2147480631 -band 0xfff


And here is the shortest form

net helpmsg (-2147480631 -band 0xfff)


That’s a simple but powerful trick which should be more widely known.

How Fast Can You Get With .NET Core?

With .NET Core 2.1 released I want to come back to an old SO question of mine where I asked how fast one can parse a text file with C# and .NET 4.0. Now I want to look how fast we can get today with .NET 4.7.2, .NET Core 2.1 and C++ on x86 and x64 . The structure of the text file is a double followed by an integer and a newline like this

1.1 0
2.1 1
3.1 2

for 10 million times. The result is a 179 MB text file which we want to parse as fast as possible.

2011 I did have as my home machine a Core Duo 6600 @ 2.40 GHz with Windows 7 running .NET 4.0 Lets see how much the numbers from 2011 have improved after I have a

  • Better CPU – Haswell CPU Architecture I7 4770K 3.5 GHz
  • Newer Windows – Windows 10 1803
  • Newer .NET Framework – .NET 4.7.2
  • New .NET Core 2.1

To compare how fast we can get with managed code I thought it would be good to compare the same thing also to a C implementation which reads the ANSI text file with fgets, tokenizes the string with strtok and converts the numbers with atof and atoi. That should be pretty fast right?

The Tests

The source code for the project IOBound  can be found at Github . You can download the binaries to test for yourself from here: https://github.com/Alois-xx/IOBound/releases.



This one checks how fast we can read a file into one huge buffer from disk. But if we execute that a second time the data will be served from the file system cache. If warmed up we measure the memcopy and soft page fault performance of the OS to assign memory to our application. It may look very simple but it can tell you a lot about disk performance and caching behavior of your OS.

        /// <summary>
        /// Check native reading speed from the OS
        /// </summary>
        private unsafe static uint ReadFileIntoByteBuffer(string dataFile)
            using (var stream = new FileStream(dataFile, FileMode.Open))
                byte[] buf = new byte[200 * 1024 * 1024];
                fixed (byte* pBuf = &buf[0])
                    uint dwRead = 0;
                    if (ReadFile(stream.SafeFileHandle, pBuf, 200 * 1000 * 1000, out dwRead, IntPtr.Zero) == 0)
                        throw new Win32Exception();
                    return dwRead;




After having so many professional serializers at our disposal it would be a sin to dismiss one of the fastest ones. MessagePack is an efficient binary format for which a great library exist from Yoshifumi Kawai (aka neuecc) who did write the fastest one for C#. With 133 MB the resulting file is the smallest one.

    public class Container
        public List<double> Doubles = new List<double>();
        public List<int> Ints = new List<int>();

    private static void DeserializeMessagePack(string dataFile)
       using (var stream = new FileStream(dataFile, FileMode.Open))
          Container deser = MessagePack.MessagePackSerializer.Deserialize<Container>(stream);



When using a text based format JSON is also a must. Interestingly the JSON file is with 169 MB even a bit smaller than the original text format. Here we use also the fastest JSON serializer named UTF8Json. Credits for writing the fastest JSON serializer go to Yoshifumi Kawai.

        private static void DeserializeUTF8Json(string dataFile)
            using (var stream = new FileStream(dataFile, FileMode.Open))
                Container deser = Utf8Json.JsonSerializer.Deserialize<Container>(stream);

Count Lines

That reads the file line by line but does nothing with the data. Since we do not allocate much except some temp strings we find how fast we can get when reading the UTF8 file and expand the data to UTF-16 strings.

        private static int CountLines(string dataFile)
            using (var reader = new StreamReader(dataFile))
                string line;
                int count = 0;
                while ((line = reader.ReadLine()) != null)

                return count;

Span Parse

One of the biggest improvements of .NET Core was the introduction of Span based APIs everywhere. That allows you to read characters from raw buffers without the need to copy the data around again. With that you can write nearly zero allocating API which operates on unmanaged memory without the need to use unsafe code. To make that fast the JIT compiler was enhanced to emit the most efficient code possible. For the regular .NET Framework there exists the System.Memory nuget package to bring the type also onboard but with no help from the JIT compiler. This one is especially interesting since it shows how much you loose if you do not have a helping hand of the JIT compiler here.


        private static void ParseSpan(string dataFile)
            var doubles = new List<double>();
            var ints = new List<int>();

            double curMeasurement = 0.0d;
            int curMeasurement2 = 0;

            ReadOnlySpan<byte> newLineSpan = NewLine.AsSpan();

            using (var stream = new FileStream(dataFile, FileMode.Open))
                const int bufferSize = 10*4096;
                byte[] buffer = new byte[bufferSize];
                int readBytes = 0;

                int lastLineSize = 0;
                while ((readBytes = stream.Read(buffer, lastLineSize, bufferSize- lastLineSize)) != 0)
                    Span<byte> bufferSpan = new Span<byte>(buffer, 0, readBytes+ lastLineSize);

                    if( bufferSpan.StartsWith(Encoding.UTF8.GetPreamble()) ) // skip byte order mark
                        bufferSpan = bufferSpan.Slice(Encoding.UTF8.GetPreamble().Length); 

                    int newLineStart = 0;
                    while( (newLineStart = bufferSpan.IndexOf(newLineSpan)) > 0 )
                        if( ParseLine( bufferSpan.Slice(0, newLineStart), ref curMeasurement, ref curMeasurement2) )
                        bufferSpan = bufferSpan.Slice(newLineStart + newLineSpan.Length);

                    lastLineSize = bufferSpan.Length;

Unsafe Parse

This is how one could write a single method parser with some hard coded assumptions to become as fast as possible. Interestingly this one is slower than the Span based API which is safer and faster.

        unsafe private static void ParseLinesUnsafe(string dataFile)
            var dobules = new List<double>();
            var ints = new List<int>();
            char decimalChar = CultureInfo.CurrentCulture.NumberFormat.NumberDecimalSeparator[0];

            using (var reader = new StreamReader(dataFile))
                string line;
                double d = 0;
                long a = 0, b = 0;
                int ix = 0;
                while ((line = reader.ReadLine()) != null)
                    int len = line.Length;
                    ix = 0;
                    a = 0;
                    b = 0;
                    fixed (char* ln = line)
                        while (ix < len && char.IsNumber(ln[ix]))
                            a = a * 10 + (ln[ix++] - '0');

                        if (ln[ix] == decimalChar)
                            long div = 1;
                            while (ix < len && char.IsNumber(ln[ix]))
                                b += b * 10 + (ln[ix++] - '0');
                                div *= 10;
                            d = a + ((double)b) / div;

                        while (ix < len && char.IsWhiteSpace(ln[ix]))

                        int i = 0;
                        while (ix < len && char.IsNumber(ln[ix]))
                            i = i * 10 + (ln[ix++] - '0');



C++ Is Still A Thing!

C++ should be the fastest with its highly optimized library routines which were carefully optimized and profiled for decades now. Looks like plain C 20 years ago but it is only faster by a slight margin compared to the naïve Standard Parse C# version.

	char pLine[1024];
	std::vector<double> doubles;
	std::vector<int> ints;
	int lines = 0;

	while (fgets(pLine, sizeof(pLine), fp) != nullptr)
		ParseLine(doubles, ints, pLine);

void ParseLine(std::vector<double> &doubles, std::vector<int> &ints, char *pline)
	char *pDouble = strtok(pline, " \n");

	if (pDouble != nullptr)
		double d = atof(pDouble);
		char *pInt = strtok(NULL, " \n");
		if (pInt != nullptr && pInt != pDouble)
			int i = atoi(pInt);


Standard Parse C#

That is how you would do it when writing idiomatic C# code.

        private static void ParseLines(string dataFile)
            var dobules = new List<double>();
            var ints = new List<int>();

            using (var reader = new StreamReader(dataFile))
                string line;
                char[] sep = new char[] { ' ' };
                while ((line = reader.ReadLine()) != null)
                    var parts = line.Split(sep);
                    if (parts.Length == 2)


The Results



Newer CPU

One obvious way to become faster is to buy a newer CPU. That works and can give you a speedup of roughly a factor two if you wait long enough between significant CPU generations. But the economic benefit to shrink CPU dies has decreased a lot with 7nm processes at the door you need  EUV lithography. With all of its challenges further huge improvements are not expected soon. And since one atom is about 0,1nm we do not have so much room left to scale much further. If you want to increase the frequency a lot you also run into fundamental problems. The speed of light in conducting materials is ca 200 000 km/s. During 1ns (=1GHz) a signal can travel 20cm far. If you increase the frequency to 10 GHz you can transmit a signal only 2cm. If your chip is bigger you cannot transmit a signal in the necessary time to its destination. We have a natural upper limit how much faster you can get with current CPU designs if you rely on ack signals of some sort.

With the rise of AI and its huge power consumption we still can and will become much more energy efficient. Although Google is leading with dedicated Tensorflow CPUs the amount of power consumption to perform simple image recognition is at least 1000 times worse compared to their biological counterparts (that is only a guess and this number is likely much to low). The next big thing are definitely neuromorphic chips which ran rewire themselves depending the the problem to solve. With these processors you could make Google search a commodity where not a single corporation owns all of your data. Instead it could be made a public service run by many providers where the search graph is shared and we as customers can sell our data on our own. If that ever becomes true depends on who creates such highly energy efficient chips first.

Newer Windows

The plain ReadFile call is about 4 times faster. That performance increase is not only related to the better CPU but also the soft page fault performance of Windows 10 Fall Creators Update did become better compared to Windows 7 which I was using at that time. The better soft fault performance is also part of Windows Server 2019 (finally). More details about soft fault performance can be found here.

32 vs 64 Bit

The biggest differences between x86/x64 has the C version . It turns out the x86 version is about 40% slower due to much higher costs in atof mainly which seems to be pretty slow implemented in x86. In the screenshot below you can see that the ucrtbase:common_atof needs 2,4 vs 1,3s in the x64 version which is a pretty big perf hit. Although C/C++ can be very efficient this is not always the case for the used runtime libraries. By using standard libraries you are not guaranteed to get world class performance on your current hardware. Especially extended instruction sets can only be used by specifically crafted binaries for each CPU. The standard approach in C/C++ is to create a library for each CPU architecture like

  • libSimd.dll
  • libAVX.dll
  • libAVX256.dll
  • libAVX512.dll

That quickly becomes pretty complicated to test. On the other hand in a JIT compiled world we still need to test on different CPUs but we get the best possible assembly code on our CPU for free if the runtime is up to date like it is the case with RyuJIT of .NET Core.


The next observation is that the new Span based version is 80% slower (1.35 vs 0.75 .NET Core 2.1) on x86. If you are after high throughput scenarios it definitely makes sense to switch over to 64 bit and use mainly index based structures to alleviate the 100% increase of pointer size for the most significant cases.

.NET Core 2.1 vs .NET 4.7.2

.NET 4.7.2 is for streaming data into memory between 2% (Unsafe Parse x86), 7% (Unsafe Parse x64) and 40% (Span Parse x64) slower than .NET Core. Especially the 40% speed drop for Span based Apis due to the lack of JIT support by the .NET Runtime is unfortunate. But since .NET was declared legacy and most innovation will happen for the years to come at .NET Core it is unlikely that the .NET Framework JIT compiler will get these new capabilities. The main reasosn were compat issues but we can already switch over to the legacy JIT if we do not want to use the new RyuJIT compiler. It should be easy to make the new JIT compiler of .NET Core a optional and the users of the .NET Framework can enable the new functionality if they have verified that their scenarios work.



Based on the numbers it makes sense to switch over to x64 and Span based APIs if you are running on .NET Core. But even if you are not using .NET Core Span based APIs offer quite a lot. Span based APIs make it possible to spare the copy operation from unmanaged locations into the managed heap. The result are nearly allocation free applications which can scale great. Another learning is that if you can change your problem a bit by e.g. changing the data format you can get even faster. The MessagePack serializer on x64 .NET Core 2.1 is 25% faster compared to the Span text based approach. That is a bit unfair since there is practically no parsing involved but shows that you can achieve near optimal performance with efficient binary formats which contain mainly primitive types. Even our pretty generic UTF8 JSON file approach performed 34% better than the x64 C/C++ version. In the end you can become very fast with any language if you profile your application and fix the obvious bottlenecks. To be able to do that you need to use some performance profiling tools for your platform. My favorite tool is ETW  as you might have guessed from my previous posts. Are you profiling your code before problems are reported? What is your favorite profiling tool?

Why Does My Synchronous WCF Call Hang?

This is a limitation of WCF which is not widely known. Lets suppose you have an WCF interface which contains a mixture of Task based and non task based methods:

[ServiceContract(Namespace = "WCFDispatching")]
public interface IRemotedService
    Task<bool> MakeAsyncCall(int id);

    void SyncCall(int id);

What will happen when you call both methods?

    async Task Work(IRemotedService service)
        await service.MakeAsyncCall(50);

The sad truth is that the second call will hang indefinitely with a rather long call stack:

[Resuming Async Method]


The interesting thing is that the synchronous call completes on the remote endpoint but the WCF client call hangs in this call stack. The problem is that WCF runs asynchronous method completions on the WCF channel dispatcher which seems to be single threaded just like a UI application with a message pump. When a blocking synchronous call is performed WCF waits normally in a stack for the read operation to complete like this


But in our case we have a different wait call stack where we are not waiting for a read to complete but in DuplexChannelBinder.SyncDuplexRequest.WaitForReply we are waiting that another thread sets an event to signal completion. This assumes that another thread is still receiving input from the remote connection which is not the case. We can see this when one looks who is setting the event:


To release our waiting thread another thread must call GotReply which is never going to happen. To get things working again you must make in your remoted interface either all methods synchronous or asynchronous. A sync/async mixture of remoted methods will likely cause deadlocks like shown above.

Below is the full sample code to reproduce the issue if you are interested

using System;
using System.ServiceModel;
using System.Threading.Tasks;

namespace WCFDispatching
    [ServiceContract(Namespace = "WCFDispatching")]
public interface IRemotedService
    Task<bool> MakeAsyncCall(int id);

    void SyncCall(int id);

    Task SyncCallAsync_(int id);

[ServiceBehavior(InstanceContextMode = InstanceContextMode.PerSession, ConcurrencyMode = ConcurrencyMode.Multiple, IncludeExceptionDetailInFaults = true)]
public class RemotedService : IRemotedService
    public async Task<bool> MakeAsyncCall(int id)
        await Task.Delay(10);
        Console.WriteLine($"Async call with id: {id} completed.");
        return true;

    public async Task SyncCallAsync_(int id)
        await Task.Delay(0);
        Console.WriteLine($"SyncCallAsync call with id {id} called.");

    public void SyncCall(int id)
        Console.WriteLine($"Sync call with id {id} called.");

class Program
    const string PipeUri = "net.pipe://localhost/WCFDispatching";

    static void Main(string[] args)
        new Program().Run(args);

    bool bUseAsyncVersion = false;
    readonly string Help = "WCFDispatching.exe [-server] [-client [-async]]" + Environment.NewLine +
                "    -server      Create WCF Server" + Environment.NewLine +
                "    -client      Create WCF Client" + Environment.NewLine + 
                "    -async       Call async version of both API calls" + Environment.NewLine +
                "No options means client mode which calls async/sync WCF API which produces a deadlock.";

    private void Run(string[] args)
        if( args.Length == 0)
            for (int i = 0; i < args.Length; i++)
                string arg = args[i];
                    case "-server":
                        Console.WriteLine("Server started");
                    case "-client":
                        // this is the default
                    case "-async":
                        bUseAsyncVersion = true;
                        Console.WriteLine($"Command line argument {args[0]} is not valid.");

            var service = CreateServiceClient<IRemotedService>(new Uri(PipeUri));
            Task waiter = Work(service);


    async Task Work(IRemotedService service)
        await service.MakeAsyncCall(50);
        if (bUseAsyncVersion)  // this will work
            await service.SyncCallAsync_(50);
            service.SyncCall(150);  // this call will deadlock!

    internal static T CreateServiceClient<T>(Uri uri)
        var binding = CreateDefaultNetNamedPipeBinding();
        var channelFactory = new ChannelFactory<T>(binding, PipeUri.ToString());
        var serviceClient = channelFactory.CreateChannel();
        var channel = (IContextChannel)serviceClient;
        channel.OperationTimeout = TimeSpan.FromMinutes(10);

        return serviceClient;

    internal static ServiceHost StartServer()
        var host = new ServiceHost(typeof(RemotedService));
        host.AddServiceEndpoint(implementedContract: typeof(IRemotedService), binding: CreateDefaultNetNamedPipeBinding(), address: PipeUri);

        return host;

    internal static NetNamedPipeBinding CreateDefaultNetNamedPipeBinding()
        //Default setting for NetNamedPipeBinding.MaxReceivedMessageSize = 65,536 bytes
        //Default settings for NetNamedPipeBinding.ReaderQuotas
        //MaxDepth = 32, MaxStringContentLength = 8192, MaxArrayLength = 16384, MaxBytesPerRead = 4096, MaxNameTableCharCount = 16384
        TimeSpan timeOut = TimeSpan.FromMinutes(1000);
        var binding = new NetNamedPipeBinding(NetNamedPipeSecurityMode.None)
            ReceiveTimeout = timeOut,
            MaxReceivedMessageSize = Int32.MaxValue,
            ReaderQuotas =
                MaxArrayLength = Int16.MaxValue,
                MaxStringContentLength = Int32.MaxValue,
                MaxBytesPerRead = Int32.MaxValue
        return binding;

To try it out first start the server from a shell with

WCFDispatching.exe -server

and then start the client with -client as option to get the deadlock. To call the fixed version add to the client call -client -async and the deadlock will not occur.

Why Skylake CPUs Are Sometimes 50% Slower – How Intel Has Broken Existing Code

I got a call that on newer hardware some performance regression tests have become slower. Not a big deal. Usually it is a bad configuration somewhere in Windows or some BIOS settings were set to non optimal values. But this time we were not able to find a setting that did bring performance back to normal. Since the change was not small 9s vs 19s (blue is old hardware orange is new hardware) we needed to drill deeper:


Same OS, Same Hardware, Different CPU – 2 Times Slower

A perf drop from 9,1s to 19,6s is definitely significant. We did more checks if the software version under test, Windows, BIOS settings were somehow different from the old baseline hardware. But nope everything was identical. The only difference was that the same tests were running on different CPUs. Below is a picture of the newest CPU


And here is the one used for comparison


The Xeon Gold runs on a different CPU Architecture named Skylake which is common to all CPUs produced by Intel since mid 2017. *As commenters have pointed out the consumer Skylake CPUs were released already 2015. The server Xeon CPUs with SkylakeX were released mid 2017. All later CPUs Kaby Lake, … share the same issue. If you are buying current hardware you will get a CPU with Skylake CPU architecture. These are nice machines but as the tests have shown newer and slower is not the right direction. If all else fails get a repro and use a real profiler ™ to drill deeper. When you record the same test on the old hardware and on the new hardware it should quickly lead to somewhere:


Remember the diff view in WPA *(Windows Performance Analyzer is a profiling UI which is free and part of the Windows Performance Toolkit which is part of the Windows SDK) shows in the table the delta of Trace 2 (11s) – Trace 1 (19s). Hence a negative delta in the table indicates a CPU consumption increase of the slower test. When we look at the biggest CPU consumer differences we find AwareLock::Contention, JIT_MonEnterWorker_InlineGetThread_GetThread_PatchLabel and ThreadNative.SpinWait. Everything points towards CPU spinning when threads are competing for locks. But that is a false red herring because spinning is not the root cause of slower performance. Increased lock contention means that something in our software did become slower while holding a lock which as a consequence results in more CPU spinning. I was checking locking times and other key metrics, like disk and alike but I failed to find anything relevant which could explain the performance degradation. Although not logical I turned back to the increased CPU consumption in various methods.

To find where exactly the CPU was stuck would be interesting. WPA has file and line columns but these work only with private symbols which we do not have because it is .NET Framework code. The next best thing is to get the address of the dll where the instruction is located which is called Image RVA (Relative Virtual Address). When I load the same dll into the debugger and then do

u xxx.dll+ImageRVA

then I should see the instruction which was burning most CPU cycles which was basically only one hot address.


Lets examine the hot code locations of the different methods with Windbg:

0:000> u clr.dll+0x19566B-10
00007ff8`0535565b f00f4cc6        lock cmovl eax,esi
00007ff8`0535565f 2bf0            sub     esi,eax
00007ff8`05355661 eb01            jmp     clr!AwareLock::Contention+0x13f (00007ff8`05355664)
00007ff8`05355663 cc              int     3
00007ff8`05355664 83e801          sub     eax,1
00007ff8`05355667 7405            je      clr!AwareLock::Contention+0x144 (00007ff8`0535566e)
00007ff8`05355669 f390            pause
00007ff8`0535566b ebf7            jmp     clr!AwareLock::Contention+0x13f (00007ff8`05355664)

We do this for the JIT method as well

0:000> u clr.dll+0x2801-10
00007ff8`051c27f1 5e              pop     rsi
00007ff8`051c27f2 c3              ret
00007ff8`051c27f3 833d0679930001  cmp     dword ptr [clr!g_SystemInfo+0x20 (00007ff8`05afa100)],1
00007ff8`051c27fa 7e1b            jle     clr!JIT_MonEnterWorker_InlineGetThread_GetThread_PatchLabel+0x14a (00007ff8`051c2817)
00007ff8`051c27fc 418bc2          mov     eax,r10d
00007ff8`051c27ff f390            pause
00007ff8`051c2801 83e801          sub     eax,1
00007ff8`051c2804 75f9            jne     clr!JIT_MonEnterWorker_InlineGetThread_GetThread_PatchLabel+0x132 (00007ff8`051c27ff)

Now we have a pattern. One time the hot location is a jump instruction and the other time it is a subtraction. But both hot instructions are preceded by the same common instruction named pause. Different methods execute the same CPU instruction which is for some reason very time consuming. Lets measure the duration of the pause instruction to see if we are on the right track.

If You Document A Problem It Becomes A Feature

CPU Pause Duration In ns
Xeon E5 1620v3 3.5GHz 4
Xeon(R) Gold 6126 CPU @ 2.60GHz 43

Pause on the new Skylake CPUs is an order of magnitude slower. Sure things can get faster and sometimes a bit slower. But over 10 times slower? That sounds more like a bug. A little internet search about the pause instruction leads to the Intel manuals where the Skylake Microarchitecture and the pause instruction are explicitly mentioned:



No this is not a bug, it is a documented feature. There exists even a web page which contains the timings of pretty much all CPU instructions


  • Sandy Bridge     11
  • Ivy Bridege         10
  • Haswell                9
  • Broadwell            9
  • SkylakeX             141

The numbers are CPU cycles. To calculate the actual time you need to divide the cycle counts by the CPU frequency (usually GHz) to get the time in ns.

That means when I execute heavily multithreaded applications on .NET on latest hardware things can become much slower. Someone else has this noticed already in August 2017 and has written an issue for it:  https://github.com/dotnet/coreclr/issues/13388. The issue has been fixed with .NET Core 2.1 and .NET Framework 4.8 Preview contains also the fixes for it.


Improved spin-waits in several synchronization primitives to perform better on Intel Skylake and more recent microarchitectures. [495945, mscorlib.dll, Bug]

But since .NET 4.8 is still one year away I have requested a backport of the fixes to get .NET 4.7.2 back to speed on latest hardware. Since many parts of .NET are using spinlocks you should look out for increased CPU consumption around Thread.SpinWait and other spinning methods.



E.g. Task.Result will internally Spin internally where I could see for other tests also a significant increase in CPU consumption and degraded performance.

How Bad Is It?

I have looked at the .NET Core code how long the CPU will keep spinning when the lock is not released before calling into WaitForSingleObject to pay the “expensive” context switch. A context switch is somewhere in the microsecond region and becomes much slower when many threads are waiting on the same kernel object.

.NET Locks multiply the maximum Spin duration with the number of cores which has the fully contended case in mind where every core has a thread waiting for the same lock and tries to spin long enough to give everyone a chance to work a bit before paying for the kernel call. Spinning inside .NET uses an exponential back off algorithm where spinning starts with 50 pause calls in a loop where for each iteration the number of spins is multiplied by 3 until the next spin count becomes greater than the maximum spin duration. I have calculated the total time how long a thread would spin on pre Skylake CPU and current Skylake CPUs for various core numbers:


Below is some simplified code how .NET Locks perform spinning:

/// <summary>
/// This is how .NET is spinning during lock contention minus the Lock taking/SwitchToThread/Sleep calls
/// </summary>
/// <param name="nCores"></param>
void Spin(int nCores)
	const int dwRepetitions = 10;
	const int dwInitialDuration = 0x32;
	const int dwBackOffFactor = 3;
	int dwMaximumDuration = 20 * 1000 * nCores;

	for (int i = 0; i < dwRepetitions; i++)
		int duration = dwInitialDuration;
			for (int k = 0; k < duration; k++)
			duration *= dwBackOffFactor;
		while (duration < dwMaximumDuration);

The old spinning times were in the millisecond region (19ms for 24 cores) which is already quite a lot compared to the always mentioned high costs of context switches which are an order of magnitude faster. But with Skylake CPUs the total CPU Spinning times for a contended lock have exploded and we will spin up to 246ms on a 24 or 48 core machine only because the latency of the new Intel CPUs has increased the pause instruction by a factor 14. Is this really the case? I have created  a small tester to check full CPU spinning and the calculated numbers nicely match my expectations. I have 48 threads waiting on a 24 core machine for a single lock where I call Monitor.PulseAll to let the race begin:


Only one thread will win the race but 47 threads will spin until the give up. This is experimental evidence that we indeed have a real issue with CPU consumption and very long Spin times are a real issue. Excessive spinning hurts scalability because CPU cycles are burned where other threads might need the CPU, although the usage of the pause instruction frees up some of the shared CPU resources while “sleeping” for longer times. The reason for spinning is to acquire the lock fast without going to the kernel. If that is true the increased CPU consumption might not look good in task manager but it should not influence performance at all as long as there are cores left for other tasks. But what the tests did show that nearly single threaded operations where one thread adds something to a worker queue while the worker thread waits for work and then performs some task with the work item are slowed down.

The reason for that can be shown best with a diagram. Spinning for a contended lock happens in steps where the amount of spinning is tripled after each step. After each Spin Round the lock checks again if the current thread can get it. While spinning the lock tries to be fair and switches over to other threads from time to time to help the other thread/s to complete its work. That increases the chances the lock has been released when we check again later. The problem is that only after a complete Spin Round has completed the lock checks if it can be taken:


If e.g. during Spin Round 5 the lock becomes signaled right after we did start Round 5 we wait for the complete Spin Round until we can acquire the lock. By calculating the spin duration for the last round we can estimate the worst case of delay that can happen to our thread:


That are many milliseconds we can wait until spinning has completed. Is that a real issue?

I have created a simple test application that implements a producer consumer queue where the worker thread works for each work item 10ms and the consumer has a delay of 1-9 ms before sending in the next work item. That is sufficient to see the effect:


We see for some sender thread delays of one and two ms a total duration of 2,2s whereas the other times we are twice as fast with ca. 1,2s. This shows that excessive CPU spinning is not only a cosmetic issue which only hurts heavily multithreaded applications but also simple producer consumer threading which involves only two threads. For the run above the ETW data speaks on its own that the increased CPU spinning is really the cause for the observed delay:


When we zoom into the slow section we find in red the 11ms of spinning although the worker (light blue) has completed its work and has returned the lock a long time ago.


The fast non degenerate case looks much better where only 1ms is spent spinning for the the lock.


The test application I did use is named  SkylakeXPause and located at https://1drv.ms/u/s!AhcFq7XO98yJgsMDiyTk6ZEt9pDXGA which contains a zip file with the source code and the binaries for .NET Core and .NET 4.5. What I actually did to compare things was to install on the Skylake machine .NET 4.8 Preview which contains the fixes and .NET Core 2.0 which still implements the old spinning behavior. The application targets .NET Standard 2.0 and .NET 4.5 which produces an exe and a dll. Now I can test the old and new spinning behavior side by side without the need to patch anything which is very convenient.

readonly object _LockObject = new object();
int WorkItems;
int CompletedWorkItems;
Barrier SyncPoint;
void RunSlowTest()
	const int processingTimeinMs = 10;
	const int WorkItemsToSend = 100;
	Console.WriteLine($"Worker thread works {processingTimeinMs} ms for {WorkItemsToSend} times");

	// Test one sender one receiver thread with different timings when the sender wakes up again
	// to send the next work item

	// synchronize worker and sender. Ensure that worker starts first
	double[] sendDelayTimes = { 1, 2, 3, 4, 5, 6, 7, 8, 9 };

	foreach (var sendDelay in sendDelayTimes)
		SyncPoint = new Barrier(2);  // one sender one receiver

		var sw = Stopwatch.StartNew();
		Parallel.Invoke(() => Sender(workItems: WorkItemsToSend,          delayInMs: sendDelay),
						() => Worker(maxWorkItemsToWork: WorkItemsToSend, workItemProcessTimeInMs: processingTimeinMs));
		Console.WriteLine($"Send Delay: {sendDelay:F1} ms Work completed in {sw.Elapsed.TotalSeconds:F3} s");
		Thread.Sleep(100);  // show some gap in ETW data so we can differentiate the test runs

/// <summary>
/// Simulate a worker thread which consumes CPU which is triggered by the Sender thread
/// </summary>
void Worker(int maxWorkItemsToWork, double workItemProcessTimeInMs)

	while (CompletedWorkItems != maxWorkItemsToWork)
		lock (_LockObject)
			if (WorkItems == 0)
				Monitor.Wait(_LockObject); // wait for work

			for (int i = 0; i < WorkItems; i++)
				SimulateWork(workItemProcessTimeInMs); // consume CPU under this lock

			WorkItems = 0;

/// <summary>
/// Insert work for the Worker thread under a lock and wake up the worker thread n times
/// </summary>
void Sender(int workItems, double delayInMs)
	CompletedWorkItems = 0; // delete previous work
	for (int i = 0; i < workItems; i++)
		lock (_LockObject)


This is not a .NET issue. It affects all Spinlock implementations which use the pause instruction. I have done a quick check into the Windows Kernel of Server 2016 but there is no issue like that visible. Looks like Intel was kind enough to give them a hint that some changes in the spinning strategy are needed.

When the issue was reported to .NET Core in August 2017 in September 2017 it was already fixed and pushed out with .NET Core 2.0.3 (https://github.com/dotnet/coreclr/issues/13388). It is not only that the reaction speed of the .NET Core team is amazing but also that the issue has been fixed on the mono branch a few days ago now as well and discussions about even more Spinning improvements are ongoing. Unfortunately the Desktop .NET Framework is not moving as fast but at least we have with .NET Framework 4.8 Preview at least a proof of concept that the fixes work there as well. Now I am waiting for the backport to .NET 4.7.2 to be able to use .NET at its full speed also on latest hardware. This was my first bug which was directly related to a performance change in one CPU instruction. ETW remains the profiling tool of choice on Windows. If I had a wish I would Microsoft make to port the ETW infrastructure to Linux because the current performance tooling still sucks at Linux. There were some interesting kernel capabilities added recently but an analysis tool like WPA remains yet to be seen there.

If you are running .NET Core 2.0 or desktop .NET Framework on CPUs which were produced since mid 2017 you should definitely check out your application with a profiler if you are running at reduced speed due to this issue and upgrade to the newer .NET Core and hopefully soon .NET Desktop version. My test application can tell you if you could be having issues

D:\SkylakeXPause\bin\Release\netcoreapp2.0>dotnet SkylakeXPause.dll -check
Did call pause 1,000,000 in 3.5990 ms, Processors: 8
No SkylakeX problem detected


D:\SkylakeXPause\SkylakeXPause\bin\Release\net45>SkylakeXPause.exe -check
Did call pause 1,000,000 in 3.6195 ms, Processors: 8
No SkylakeX problem detected

The tool will report an issue only if you are running a not fixed .NET Framework on a Skylake CPU. I hope you did find the issue as fascinating as I did. To really understand an issue you need to create a reproducer which allows you to experiment to find all relevant influencing factors. The rest is just boring work but now I understand the reasons and consequences of CPU spinning much better.

* Denotes changes to make things more clear and add new insights. This article has gained quite some traction at Hacker News (https://news.ycombinator.com/item?id=17336853) and Reddit  (https://www.reddit.com/r/programming/comments/8ry9u6/why_skylake_cpus_are_sometimes_50_slower/). It is even mentioned at Wikipedia (https://en.wikipedia.org/wiki/Skylake_(microarchitecture)). Wow. Thanks for the interest.