Hybrid CPU Performance on Windows 10 and 11

What are hybrid CPUs? Hybrid CPUs are CPUs which have fast, but power hungry cores and slower but less energy consuming cores integrated into a CPU. All consumer grade desktop CPUs from Intel since Alder Lake (CPU number 12xxx or greater which was released in 2022) are hybrid CPUs. If you have recently bought an Intel CPU the chances are nearly 100% that you have bought a hybrid CPU.

In theory, everything should run faster with newer CPUs. But in practice, there are issues, if you want to run software which was not prepared for hybrid CPUs. E.g. if you need to replace old machines with newer ones which run your software on hybrid CPUs which still needs Windows 10.

There are many great reviews online which talk about the performance of hybrid CPUs on Windows. Some have also noticed that there are issues with processes with below normal priority:

Intel gave an example of a content creator, exporting a video, and while that was processing going to edit some images. This puts the video export on the efficiency cores, while the image editor gets the performance cores. In my experience, the limiting factor in that scenario is the video export, not the image editor – what should take a unit of time on the P-cores now suddenly takes 2-3x on the E-cores while I’m doing something else. This extends to anyone who multi-tasks during a heavy workload, such as programmers waiting for the latest compile. Under this philosophy, the user would have to keep the important window in focus at all times. Beyond this, any software that spawns heavy compute threads in the background, without the potential for focus, would also be placed on the E-cores.

Personally, I think this is a crazy way to do things, especially on a desktop. Intel tells me there are three ways to stop this behavior:

  1. Running dual monitors stops it
  2. Changing Windows Power Plan from Balanced to High Performance stops it
  3. There’s an option in the BIOS that, when enabled, means the Scroll Lock can be used to disable/park the E-cores, meaning nothing will be scheduled on them when the Scroll Lock is active.
From https://www.anandtech.com/show/17047/the-intel-12th-gen-core-i912900k-review-hybrid-performance-brings-hybrid

Many CPU hungry number crunchers like compilers, video editing, ray tracing and other software are trying to be nice to the user and lower the priority of their threads or computation child processes to let the user experience a still fluent system, although the CPU is doing hard work on all cores. That approach has worked for a long time pretty well, but it has some unintended side effects which are not well understood when hybrid CPUs enter the game.

Windows 10

Let’s do a simple experiment. We will use Windows 10 and CPU Stress from Sysinternals where we create 7 threads which run inside a Below Normal process. I use 7 because some things might already be running, so I give it some headroom on the 8th E-Core.

When we observe the CPU load in Task Manager we find that all threads of a Below Normal process run on the E-Cores only (the last 8 ones for I7-13700). If we raise the process priority back to normal, the load is scheduled on the P cores and all is fast if the load was e.g. an expensive compile or video editing step.

But we can achieve the same effect by simply selecting the CPU Stress window. That will cause Windows to raise the CPU priority of the foreground process, and suddenly mitigation option 1 mentioned by Dr. Ian Cutress makes sense. If the application has multiple processes which form a UI you might just put the focus on the right one to speed up things again.

Disabling the E-Cores (Mitigation 3) is possible in some but, not in all BIOS versions. What most offer is that you can forcefully park the E-cores, so no work is scheduled to them. In my BIOS this is called Legacy Game Mode which is activated with the Scroll Lock hot key.

That fix does work but you will need to remember always to toggle the hotkey to turn off the E-Cores. Besides the usability issue the main problem is that real multithreaded applications which utilize e.g. all cores will experience large slowdowns because they suffer now from thread oversubscription. The application sees 20 cores (e.g. I5-13600K) but due to core parking 20 threads are squeezed on 12 cores which means that 8 threads will compete to run on the same cores leading to thread contention.

Power Plans

Switching to the High Performance Power Plan (Mitigation option 2) works. What is not explained why this works. Or what are the differences between the Power plans on Windows? On the surface it looks easy. You have 3 default plans

  • Power Saver
  • Balanced
  • High Performance

and you can choose one of them.

By default there are only two settings visible in the UI

  • Minimum/Maximum Processor State

and that’s it. But in reality on a current Windows 11 (23H2) machine you have 75 different settings. Are all of them relevant? No. By far most of the settings are from Intel which configure SpeedStep/SpeedShift settings for quicker voltage/frequency changes than the OS is able to do when the CPU load changes.

That are many settings. What are the differences between the three main power plans? I have updated ETWAnalyzer to answer that question.

To dump all files with the file name Win10 printing only the properties which have differences

ETWAnalyzer -dump power -fd win10 -diff

Then you get a nice list of the changes. Now we also know why changing the Power profile from Balanced to High Performance will solve the E-Core oversubscription issue:

The High Performance Profile has as Hetero Thread Scheduling Policy PreferPerformantProcessors enabled instead of Automatic which apparently means prefer E-cores. That’s the setting which fixes the Low priority process CPU problem on Windows 10.

Since we can compare power profiles it is now also easy to compare them between Windows 10 and 11 by exporting the data to CSV with ETWAnalyzer. After having a lot of columns I just needed to transpose the table and arrived at this:

Orange marked entries are new on Windows 11 which did not exist on Windows 10. Red fields are values which have changed in Windows 11. There are quite a few changes visible, but only a few of them make a measurable difference with simple load tests.

E.g. the main settings which changes everything between Balanced and Power Saver is the MaxThrottleFrequency % Class 1 or Maximum processor state for Processor Power Efficiency Class 1 which defines how high the P-Cores can boost. The Power Saver profile sets that to 75%. Similar is the MaxEfficiencyFrequency MHz Class 1 or Maximum Processor frequency for Processor Power Efficiency Class 1 setting, which let’s you control the P-core frequency directly. If set, it will override the MaxThrottleFrequency % setting.  

Windows 10/11 Scheduling Differences

The main knobs in the power profile are

  • Heterogenous policy in effect
    • Values 0-4
    • Default is 4 on Windows 10
    • 0 Windows 11, except for Power Saver profile where it is 4.
  • Heterogenous thread scheduling policy
    • Values 0-5
      • Default is Automatic on Windows 11
      • PreferPerformantProcessors or Automatic on Windows 10 depending on used Power Profile
    • 0=AllProcessors
    • 1=PerformantProcessors
    • 2=PreferPerformant
    • 3=EfficientProcessors
    • 4=PreferEfficient
    • 5=Automatic

I have created a simple load test which runs on 12 cores for 10s which runs at 3 process priorities (BelowNormal, Normal, AboveNormal). Then I configure in a script all possible combinations for both polices and let them run. Here is the code of my simple LoadStress app:

namespace LoadStress
{
internal class Program
{
static void Main(string[] args)
{
var total = Stopwatch.StartNew();
Action acc = () =>
{
var sw = Stopwatch.StartNew();
while (sw.Elapsed.TotalSeconds < 10)
{
}
};

if( args.Length < 2 )
{
Console.WriteLine("LoadStress [Idle, BelowNormal,Normal, AboveNormal, High] nThreads");
return;
}

string prio = args[0];
if( !Enum.TryParse(prio, true, out ProcessPriorityClass prioEnum) )
{
Console.WriteLine("Process Priority can be Idle, BelowNormal, Normal, AboveNormal, High");
return;
}

Process.GetCurrentProcess().PriorityClass = prioEnum;

int threads = int.Parse(args[1]);

var delegates = Enumerable.Repeat(acc, threads).ToArray();

Parallel.Invoke(delegates);
total.Stop();
Console.WriteLine($"Did start {threads} threads in process with priority {prioEnum} in {total.Elapsed.TotalSeconds:F1} s");
}
}
}

The key insight I did use was that I want to measure CPU scheduling and not CPU performance. I simply start n threads and do a while < 10s busy loop which is independent of the CPU frequency and type. That way I will always consume the same amount of CPU time regardless on which CPU type I am running I know that one thread will have consumed, if no thread contention did show up, 10 s.

One run will modify the current power profile (I did use Balanced) and ran the LoadStress application.

powercfg /SETACVALUEINDEX SCHEME_CURRENT SUB_PROCESSOR HETEROPOLICY 3
powercfg /SETACVALUEINDEX SCHEME_CURRENT SUB_PROCESSOR SCHEDPOLICY 1
powercfg /SETACTIVE SCHEME_CURRENT
LoadStress.exe BelowNormal 12

The test application consumes 120 s of CPU. Both tested CPUs (I5 13600 and I7 13700) have 8 E-cores and 8 and 6 P-cores with Hyper Threading enabled resulting in a total core count of 20 and 24. What is the most performant thread scheduling for 12 threads? In that case 8 P-cores should be used and 4 E-cores, or just P-cores are preferred then the load goes completely on 12 P-cores. Since E-cores are by a factor 2-3x slower than the P-cores, but if you put concurrent load on Hyper threaded P-cores then you are also not getting much faster than on a single P-core. In the end it depends on the specific load type if you gain anything from mixed P/E-core scheduling or not.

Based on that I would expect a total CPU consumption of 120 s if all is optimal 80 s load on the P-Cores and up to 40 s on the E-cores if the hyper threaded cores are skipped in favor of the E-cores.

If the combined CPU consumption of P/E-cores is below 120 s we have thread starvation issues because OS schedules multiple threads onto the same core leading to thread contention. If you are reading this you are usually aiming at 80-120 s of CPU spent on the P-cores and the remaining 40-0 s spent on the E-cores. Let’s check how Windows 10 and 11 behave when we change the settings and use either the Balanced or the High Performance profile as base profile while we are switching Hetero Policy and Hetero Thread Policy and Process Priority. This is already quite a matrix of 5*6*3*2 = 180 combinations which is difficult to visualize with any chart type. In the end I opted for a simple table with embedded graphs over the actual values which gives most insights.

If you aim to force the workload on Windows 10 of Below Normal processes on the P-cores you should use Hetero 3 which prefers Below Normal processes on the P-cores with the highest value of 110s/120s.

Using the High Performance profile as it was suggested by Dr Ian Cutress solves the problem only partially which use Hetero 4 and Hetero Thread scheduling PerformantProcessors. The many other configuration knobs did not make a measurable difference for our synthetic non spiked load.

Normally we would expect a total CPU of 120 s running on 12 or more cores whic is nicely visualized in WPA by logical CPU number.

But sometimes on Windows 10 strange things happen when we confine Hetero Thread scheduling to Efficiency or Performance cores. In this case not 12 cores were used but only 2 during the 10s runtime with 12 threads:

Why this is happening I am not sure. If you use the default Windows 10 policy Hetero 4 and confine the load via Hetero Thread scheduling on specific core types you can be penalized by Windows 10 scheduler. See Windows 10 – Hetero 4 – PerformantProcessors for AboveNormal and Normal processes where we get in the worst case only one core assigned during application runtime. Why is this an issue?

Because these are the standard settings for the High Performance Power Profile, which is the recommended solution as mitigation for the E-core preference in the Balanced Power profile! My tests prove that the recommended solution might make matters worse on Windows 10.

The used Hetero Profiles 0-4 are not documented by MS, but Intel did publish for 0 and 4 some documentation:

Setting Value: 0 (i.e., Standard Parking or Favored Core
Parking)
In this configuration, the optimum set of compute cores are unparked starting with the most
performant cores first.

SettingValue: 4 (i.e., Hetero Parking)
In this configuration, based off utilization, a combination of most performant or most efficient cores
are unparked first.
In certain scenarios like low power envelope SKUs or better battery life goals, it can be more efficient
to run low utilization work on cores with higher efficiency capability at efficient frequency. This policy
is used in these scenarios in combination with optimal performance state engine settings.

https://www.intel.com/content/dam/develop/external/us/en/documents-tps/348851-optimizing-x86-hybrid-cpus.pdf

Based on my measurements we can summarize the scheduling behavior of the 5 Hetero Profiles on Windows 10 with

  • Hetero 0 Distribute load accross all cores with a preference on P-cores (Default on Windows 11).
  • Hetero 1 Same as 0.
  • Hetero 2 Confine work entirely to E-cores
  • Hetero 3 Distribute load on P-cores stronger than Hetero 0, but allow E-core usage.
  • Hetero 4 Prefer E-cores for low priority tasks and constrain load on some core types when PerformantProcessors or EfficientProcessors are used as Hetero Thread scheduling option (Default on Windows 10).

So what is best? Hetero Thread Scheduling Automatic will load our Below Priority process to the E-cores which is the default in Balanced Power Profile. The numbers look stable, just the wrong cores are used:

Now go back to the High Performance Profile defaults

We get P-cores but erratic scheduling where due to unkown reasons not all needed 12 P-cores are utilized. This issue happens sporadically and not all the time. If you have short running tests all will look fine but if you let the 10s LoadTester run 20 times you find erratic scheduling. Based on measurments I would say this is a bug in the Kernel Scheduler which is affecting CPU heavy workloads which use many cores. If the load is constant then it will settle after some time but the ramp up phase until the kernel gives all needed cores is many seconds as the tests have shown. This issue should hurt short lived processes which employ many threads on Windows 10 on hybrid CPUs with the High Performance Power profile.

Solution

Currently the best way out seems to use

  • Hetero Policy 3
  • Hetero Thread Policy AllProcessors

This will schedule first all P-cores and then will overflow the rest to the E-cores which should give you a better mix than the default random shuffling which sometimes happens with the default profiles. Background processes are not unimportant to the user just because they have lowe priority. Perhaps they are running the video AI enhancement or compile step in the background just because they do not want to freeze your machine and lower the process priority.

Windows 11

So far I have talked mostly about Windows 10. What about Windows 11? The results are much less dramatic thanks to the Intel Thread Director. From the numbers it looks like there is not much difference between the Hetero settings at all. The only impactful setting which is strictly adhered to is the Hetero Thread Policy EfficientProcessors which will confine the load entirely on the E-cores. A deeper inspection will need to wait for another day. Since this beast is quite smart, it detects specific cpu instructions like pause/mwait and other things which consume cycles but do not need to be fast which can safely be put to E-cores, a lot of different workloads would need to be tested which is outside the scope for a simple blog post.

Conclusions

I do not want to believe. I want to measure things and verify if the claims are true. Profiling with ETW is extremely helpful to find later what did cause unexpected anomalies during measurements. In total I did generate 600 ETW files with a size of 90 GB. After extraction with ETWAnalyzer 800 MB of Json files were left, which are much quicker to query than to open and look at 600 files with WPA. To find e.g. all .etl files which have scheduling issues with specific settings I just need to query all LoadStress processes and filter for processes which have CPU > 115 s. All files with smaller CPU will show up as 0 CPU which are the interesting ones. Good naming of the files is essential to make such an approach work. But once you have the data it is easy:

EtwAnalyzer -dump CPU -pn LoadStress -ShowTotal Total -fd *_Normal* -MinMaxCPUMs 115s   

That simple query immediately shows that the Windows 11 thread scheduling is rather unspectacular in terms of thread contention. The only case where thread contention happens is when the load is confined to efficient processors. That would be the case for our load because we use 12 threads, but we have only 8 E-cores.

Power profile settings are a deeply underexplored topic where are many myths are circulating in the (gaming) forums. I hope that the provided tooling will serve as a foundation for future quantitative measurements performed by other people. Many people seem to like to change some switches and believe they have solved a performance problem. Without measuring you will never be able to tell. Your gut feeling is not exact enough to serve as foundation to solve performance issues. Only by measuring and visualizing the previously unknown, you will find new patterns and correlations which will help to understand your system better. Your mental performance model is always too simple compared to what is really happening inside your computer.

That’s all for today. Now start measuring at your own!

Windows 10 April update can cause data corruption

Starting with Windows 10 KB 5025221 I had issues in Outlook to attach files to mails. Drag and Drop did become unreliable and would sometimes silently do nothing. When I tried the Attach File dialog it would most of the time work, but sometimes fail with:

Update 2

The issue is fixed with June 2023 KB5027215 .

Update

MS has listed for this KB a Known Issue Rollback which is a MSI that uninstalls the specific portions of the problematic update. The detailed description shows that only CopyFile is affected but not MoveFile. My tests confirm this. More details about KIRs are https://techcommunity.microsoft.com/t5/windows-it-pro-blog/known-issue-rollback-helping-you-keep-windows-devices-protected/ba-p/217683.

That was annoying but not really problematic. More serious was that somtimes I could not open attached PDF files. This resulted in a file not found error box. The issue was sporadic enough and an Outlook restart did always fix the issue. But sometimes I did end up with empty files

which was not so good. Also some mails the attachements (PDFs) were set to 0 bytes which was a real data corruption.

Reason enough to drill deeper what was going on. The first thing I would take is a procmon trace. If someone interferes (Antivirus is always a good reason) we should see some access denied or other file deletion events there. After a first look I did not find a single failed IO operation in the procmon traces which was strange.

At least I did learn how Outlook works. When you double click an attachement in Outlook it copies the data from the mail database .pst/.ost file to a temp file located in

C:\Users\%username%\AppData\Local\Microsoft\Windows\INetCache\Content.Outlook\????\Src.pdf

Then it copies the file again into …\Src (002).pdf which is the file which is then opened by the actual PDF reader.

When the reader is not a reader but a writer then Outlook will check if the second copy is different to the first copy and will ask if you want to save the changes in Outlook. If you do this manually you still can go back by asking Outlook for a earlier version which is still present. So far so good. But in the corrupted mails no previous go back button was present.

Drag and drop a file from Explorer into a new mail works similar. First the file is copied to %temp%\xxxx.tmp from there it is copied again to C:\Users\%username%\AppData\Local\Microsoft\Windows\INetCache\Content.Outlook\????\YourFile.pdf and then it is finally stored in your mail database .pst/.ost file.

Outlook is copying quite a few files around which is the reason why it uses the default Windows CopyFile method. When looking deeper with ETW tracing I could compare a good vs a bad attach run which did differ in the duration of the CopyFile operation which mysteriously did fail and did run much shorter than it did in the good case

|- OUTLOOK.EXE!HrSaveAttachmentsWorker
|    OUTLOOK.EXE!HrShowSaveAttachDialog
|    |- OUTLOOK.EXE!HrInvoke
|    |    OUTLOOK.EXE!HrInvokeHelper
|    |    OUTLOOK.EXE!StdDispatch::Invoke
|    |    OUTLOOK.EXE!_RenDispatchCall
|    |    OUTLOOK.EXE!AttachmentObject::HrSaveAsFile
|    |    |- OUTLOOK.EXE!AttachmentObject::HrCopyTempFileToLocation
|    |    |    KernelBase.dll!CopyFileW
|    |    |    KernelBase.dll!CopyFileExW
|    |    |    |- KernelBase.dll!BasepCopyFileExW
|    |    |    |- KernelBase.dll!CloseHandle
|    |    |- OUTLOOK.EXE!FDoesFileExistW
|    |- OUTLOOK.EXE!RenFFDialog::HrGetPath
|- OUTLOOK.EXE!DisplayErrorContext

The good thing is since a few month Microsoft delivers public symbols also for its office products which read much better than Ordinal+xxxx stack traces. Since it is not possible to see more with ETW profiling one would need to debug Outlook. The good thing is that this is possible now with the recently released new Windbg which allows you to attach to a running process and capture executed CPU instruction and accessed memory. The way how this works is fascinating. At debugging start a full memory dump is taken, then CPU tracing is enabled which captures the complete CPU instruction stream. The end result is not a static memory dump but a memory dump where you can debug forward and backwards! For more infos see https://aloiskraus.wordpress.com/2017/09/26/it-is-time-for-time-travel/. With that I could record a failed attach incident and send to MS support for further analysis where it did fail. The answer was that a regression did happen in the mentioned KB update which did kick in when the following conditions did occur:

  1. Windows 10 with KB 5025221 from April or May Update
  2. Process is 32 bit
  3. Process is Large Address Aware
  4. It calls into CopyFile / MoveFile
  5. A pointer to NtCreateFile with a pointer address > 2 GB is passed

When these things did happen the CreateFile call did bail out early even before procmon or other filter drivers where able to catch the return code of the failed CreateFile call. Quite a nasty thing which also explains why it did take so long to find out.

With that knowledge we can examine the issue a bit further. We can write a small test application which calls CopyFile in a 32 bit process which has allocated more than 2 GB of memory. The easiest way to achieve this is to use a few lines of C# like this

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Runtime.InteropServices;

namespace FileCopyBug
{
    internal class Program
    {
        static void Main(string[] args)
        {
            if (args.Length < 2)
            {
                Console.WriteLine("FileCopyBug src dest [-noalloc]");
                return;
            }

            Queue<string> queue = new Queue<string>(args);
            bool bAlloc = true;
            string src = null;
            string dest = null;

            while (queue.Count > 0)
            {
                string arg = queue.Dequeue();
                if (arg == "-noalloc")
                {
                    Console.WriteLine("Skip allocations.");
                    bAlloc = false;
                }
                if (src == null)
                {
                    src = arg;
                }

                if (dest == null)
                {
                    dest = arg;
                }
            }

            if (bAlloc) // allocate > 2 GB of memory to force pointers > 2 GB
            {
                int n = 0;
                const int ChunkBytes = 500;  // allocate small to not leave holes
                const int nMax = 4600000;
                while (++n < nMax)
                {
                    Marshal.AllocHGlobal(ChunkBytes);
                }
            }

            Console.WriteLine($"Allocated memory: {Process.GetCurrentProcess().PrivateMemorySize64:N0} bytes");

            // Call KernelBase.dll CopyFile which exhibits the bug when pointers > 2 GB are used for Extended Attributes to NtCreateFile which will bail out with an access violation c0000005 as NTSTATUS return code
            File.Copy(args[0], args[1], true); 
        }
    }
}

The compiled binary and code can be download from my OneDrive here. If you are running Windows 10 with the latest Updates you can try to copy one file to a new location just like with any copy tool:

c:\tmp>FileCopyBug.exe src.pdf dsg.pdf
Allocated memory: 2,485,137,408 bytes

Unhandled Exception: System.IO.IOException: Invalid access to memory location.
   at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
   at System.IO.File.InternalCopy(String sourceFileName, String destFileName, Boolean overwrite, Boolean checkHost)
   at System.IO.File.Copy(String sourceFileName, String destFileName, Boolean overwrite)
   at FileCopyBug.Program.Main(String[] args)

That is really an Access Violation code which was returned by NtCreateFile which was correctly returned by CopyFile. When you add the parameter -noalloc then no additional memory is allocated and the copy operation will work. I have a machine with the Jan Win 10 version around where the FileCopy issue is not present. This also works on Windows 11.

Since we have come so far we can debug now our test application directly and see where this leads.

The steps are

  1. Create a breakpoint for CopyFile (bp kernelbase!CopyFileW)
  2. Run until we enter CopyFile (g)
  3. Create a breakpoint for NtCreateFile (bp ntdll!NtCreateFile)
  4. Run until we hit NtCreateFile for the source file (g)
  5. Print file name dt ole32!OBJECT_ATTRIBUTES poi(@esp+3*4)
  6. Leave method to check return code (gu), return code is in eax. 0 is success.
  7. Run until we hit NtCreateFile for the destination file (g)
  8. Print file name dt ole32!OBJECT_ATTRIBUTES poi(@esp+3*4)
  9. Leave method to check return code (gu), return code is in eax with c0000005 which is the error we are after.
0:001> bp kernelbase!CopyFileW
0:001> g
Breakpoint 0 hit
Time Travel Position: FBCD:1ED4
eax=28c67690 ebx=02af91b4 ecx=82e96558 edx=02b50000 esi=5a13bdb0 edi=02afa810
eip=76a137f0 esp=02af5028 ebp=02af52b8 iopl=0         nv up ei pl nz na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200206
KERNELBASE!CopyFileW:
76a137f0 8bff            mov     edi,edi
0:006> bp ntdll!NtCreateFile
0:006> g
Breakpoint 1 hit
Time Travel Position: FBCE:72E
eax=02af4900 ebx=00000000 ecx=00000005 edx=02af474c esi=80000000 edi=80000000
eip=77c13000 esp=02af45ac ebp=02af4f8c iopl=0         nv up ei ng nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200282
ntdll!NtCreateFile:
Unable to load image C:\Program Files (x86)\Microsoft Office\root\Office16\AppVIsvSubsystems32.dll, Win32 error 0n2
77c13000 e96b11cfdf      jmp     AppVIsvSubsystems32!vfs_hooks::hooked_NtCreateFile (57904170)
0:006> dt ole32!OBJECT_ATTRIBUTES poi(@esp+3*4)
   +0x000 Length           : 0x18
   +0x004 RootDirectory    : (null) 
   +0x008 ObjectName       : 0x02af4864 _UNICODE_STRING "\??\C:\Users\...\AppData\Local\Microsoft\Windows\INetCache\Content.Word\src.pdf"
   +0x00c Attributes       : 0x40
   +0x010 SecurityDescriptor : (null) 
   +0x014 SecurityQualityOfService : (null) 

0:006> gu
Time Travel Position: FBD2:640
eax=00000000 ebx=00000000 ecx=579026f4 edx=02b50000 esi=80000000 edi=80000000
eip=76a1521e esp=02af45dc ebp=02af4f8c iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200246
KERNELBASE!BasepCopyFileExW+0x187e:
76a1521e 898584f9ffff    mov     dword ptr [ebp-67Ch],eax ss:002b:02af4910=00000000
0:006> g
Breakpoint 1 hit
Time Travel Position: FBED:77
eax=02af2b8c ebx=00000000 ecx=00000005 edx=10000044 esi=c0150081 edi=00002020
eip=77c13000 esp=02af25d8 ebp=02af4598 iopl=0         nv up ei ng nz na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200286
ntdll!NtCreateFile:
77c13000 e96b11cfdf      jmp     AppVIsvSubsystems32!vfs_hooks::hooked_NtCreateFile (57904170)
0:006> dt ole32!OBJECT_ATTRIBUTES poi(@esp+3*4)
   +0x000 Length           : 0x18
   +0x004 RootDirectory    : (null) 
   +0x008 ObjectName       : 0x02af27a8 _UNICODE_STRING "\??\C:\Users\...\AppData\Local\Microsoft\Windows\INetCache\Content.Outlook\XNKF02BJ\src (002).pdf"
   +0x00c Attributes       : 0x40
   +0x010 SecurityDescriptor : (null) 
   +0x014 SecurityQualityOfService : 0x02af3f18 Void
0:006> gu
Time Travel Position: FBFA:1B8
eax=c0000005 ebx=00000000 ecx=579026f4 edx=02b50000 esi=c0150081 edi=00002020
eip=76a178c2 esp=02af2608 ebp=02af4598 iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200246
KERNELBASE!BaseCopyStream+0xd01:
76a178c2 8985e0e5ffff    mov     dword ptr [ebp-1A20h],eax ss:002b:02af2b78=00000000

The nice thing about the x86 WinAPI calling convention is that all parameters are on the stack. To see all passed parameters to an exported method of the signature which has 11 arguments

__kernel_entry NTSTATUS
NTAPI
NtCreateFile (
    OUT PHANDLE FileHandle,
    IN ACCESS_MASK DesiredAccess,
    IN POBJECT_ATTRIBUTES ObjectAttributes,
    OUT PIO_STATUS_BLOCK IoStatusBlock,
    IN PLARGE_INTEGER AllocationSize OPTIONAL,
    IN ULONG FileAttributes,
    IN ULONG ShareAccess,
    IN ULONG CreateDisposition,
    IN ULONG CreateOptions,
    IN PVOID EaBuffer OPTIONAL,
    IN ULONG EaLength
    );

I can break on method enter and then dump the next 12 DWORDs (the first value is the return address) from the stack pointer esp I can dump directly the passed method arguments:

0:006> dd @esp L0n12
02af25d8  76a17b41 02af2b8c 40150080 02af27ec
02af25e8  02af2a00 02af4ef0 00002020 00000000
02af25f8  00000005 10000044 02af29b8 00000010

This translates to

NtCreateFile( 
FileHandle Pointer = 02af2b8c,
DesiredAccess = 40150080 
   GENERIC_WRITE|SYNCHRONIZE|WRITE_DAC|DELETE|FILE_READ_ATTRIBUTES,
ObjectAttributes Pointer = 02af27ec,
IoStatusBlock Pointer = 02af2a00,
AllocationSize Pointer = 02af4ef0,
FileAttributes = 00002020
FILE_ATTRIBUTE_NORMAL|FILE_ATTRIBUTE_VIRTUAL|FILE_ATTRIBUTE_EA|FILE_ATTRIBUTE_UNPINNED,
ShareAccess = 00000000,
CreateDisposition = 5 
   FILE_OVERWRITE_IF,
CreateOptions = 10000044
   FILE_SEQUENTIAL_ONLY|FILE_NON_DIRECTORY_FILE|FILE_CONTAINS_EXTENDED_CREATE_INFORMATION,
EaBuffer Pointer = 02af29b8 
EaLength = 0x10

We are looking at a Windows 10 machine which makes me wonder why I find a FILE_CONTAINS_EXTENDED_CREATE_INFORMATION flag which is according to the docs a Windows 11 only feature. It looks like that Antivirus can scan less when it is clear that this is a local file. It looks like this performance feature was backported to Windows 10 with the April update. To check that we can dump the EaBuffer as a EXTENDED_CREATE_INFORMATION structure.

0:006> dt combase!PEXTENDED_CREATE_INFORMATION @esp+0n10*4
0x02af29b8 
   +0x000 ExtendedCreateFlags : 0n2
   +0x008 EaBuffer         : 0x81811fe8 Void
   +0x00c EaLength         : 0x2382

That sure looks like a valid structure with the EX_CREATE_FLAG_FILE_DEST_OPEN_FOR_COPY flag set. Now we have also found the EaBuffer pointer which is > 2 GB (0x80000000) which is causing NtCreateFile to return (wrongly) an Access Violoation return code.

Conclusions

The issue presented here is not specific to Outlook 32 bit but it can affect all 32 bit applications which make use of CopyFile/MoveFile. I have not studied the effect to Word/PowerPoint/Excel but similar issues might occur. At least you know now how you can find out if you are affected when you use a debugger. An alternative approach would be to enable ETW SysCall tracing which allows you to see every return code of all called kernel methods. Then you can scan for NtCreateFile and a return code of 0xC0000005 which should never happen.

Lets hope that soon a fix is released to make these strange sporadic failures go away.

WPA fails with 0x80070032

Sometimes I get ETW traces which cannot be loaded by WPA. Initially I found these on Windows Server 2022 on a VMWare machine. I suspected some issue how VMWare virtualizes things on newer OSes but I did get this error also on my new non virtual Windows 11 box. After asking around it turned out that this is side effect of .NET 7 processes. When a .NET 7 process is running and you use WPR profiles you will get unreadable traces back.

When you try to process them with TraceProcessing Library you get something like

C:\Temp>etwanalyzer -extract all -fd test.etl 
1 - files found to extract.

Error: One or more errors occurred. (The request is not supported. (0x80070032)) Check C:\Utils\ETWAnalyzer\ETWAnalyzer_Trace.log for full details, or use -debug switch to get full output.


Error:   EventSink {39cda801-59d8-472d-bc00-601a4150c4c4} returned HRESULT 0x80070032 on:
         Event#: 294800   (T#0:#294800)
         TimeStamp:    6644961, Process: 12172, Thread: 8008, Cpu:  10
         ProviderId: {a669021c-c450-4609-a035-5af59af4df18}
         EventDescriptor: 0x0096 0x0001 0x2a   0x01    0x00    0x05  0x0000000000020000
         UserDataLength: 405

The default wpr profiles are affected, but for some reason my custom WPR Profiles (https://github.com/Alois-xx/FileWriter/blob/master/MultiProfile.wprp) are not. The only way to solve this is to use either custom WPR profiles or to kill all .NET 7.0 processes. But which ones?

With ETWAnalyzer it is easy

Etwanalyzer -fd c:\temp\extract\ManySlowCalls_MAGNON.03-07-2023.23-21-41.json  -dump version -dll *coreclr* -versionfilter *7.0* 

Ahh PhoneExperienceHost.exe is the bad guy. As workaround all you need to do is to kill PhoneExperienceHost and you should get analyzable ETW traces again.

A future WPA/TraceProcessing version should contain the fix, but until then you need to use one of the workarounds.

War Story: How Antivirus solutions can bring a server down

From a performance point of view Antivirus scanners are the worst things that have happened to computers during the last ca. 5 years. Sometimes it is hard to differentiate the CPU activity of an AV scanner from a bitcoin miner. Both are eating your CPU to give you either virtual security or virtual money.

Whenever something is mysterious and slow I get a call. In this case a beefy Windows Server was working happily, but a few times a day everything would freeze for up to 5 minutes. The solution to get back to speed was to reboot the server when it got stuck. When the server got stuck, it was at 100% CPU for several minutes. That sounds like an obvious CPU problem. Identify the problematic process and uninstall the app which is causing problems.

To identify the issue ETW Traces were taken. Everything was working fine until the ETW profiling events stopped in the recording. Several minutes later the CPU problem did start. Was someone tinkering with ETW Traces? Shortly before the trace did stop I did see an installer for the Palo Alto Cortex AV solution executing. That was interesting. Normally I record data with xperf which uses the NT Kernel Logger ETW session. Xperf supports another kernel session named Circular Kernel Context Logger session which is always there since at least Windows 7. When running ETW tracing for a longer time I record with xperf into a file which is stopped every few minutes. Profiling is then restarted (no merging of the files is done). That way I can later choose from the many files the interesting ones when the issue is reported again. Although I default to 10ms sample rate one needs still up to 40 GB of disk space to record a full day.

The Issue occurred again

Below is a picture of total CPU consumption over a period of two minutes. Green is Idle (means CPU is free) and red is the CPU consumed by the Windows Sleep API call which needs ca. 60% for all cores for more than one minute. Why is sleeping consuming so much CPU?

There is a specific workload which occurs only on many core servers: Spinlocks are locks which try to burn some CPU on all cores before all spinning (= trading CPU for decreased latency) threads enter the expensive OS lock. The usual implementation of a Spinlock uses several stages

  1. Check if lock can be acquired
  2. If not
    • For(i..n)
      • Burn some cpu cycles by calling pause instruction in a second loop
      • Check if lock can be acquired and exit loop
      • Call Sleep(0) to give another thread to do some useful work
      • Check if lock can be acquired and exit loop
    • Enter OS lock
  3. Lock is acquired

Some high performance threading libraries like the Intel OpenMP library use excessive amounts of CPU spinning for up to 200 ms to almost never take a lock to get, at insane power consumption costs, a few microseconds of faster algorithm execution. This means if you let one empty task run every 200 ms you will get this CPU graph when using the Intel OpenMP library:

Sure, your CPU looks super busy, but did you perform much useful work? Not at all. The Intel OpenMP library makes it look like you are doing heavy number crunching and you need a bigger processor to take that load.

If you set the environment variable KMP_BLOCKTIME=0 then you get a flat CPU profile as it should have been always the case. The decision of Intel to waste 200 ms of CPU cycles by default is highly questionable, and no one, after profiling the CPU consumption, would leave it at the default value.

What has this to do with Virus scanners? They have the habit to intercept system calls in creative ways, by using OS callbacks, application detouring and ETW. E.g Windows 11 23H2 has instrumented many rarely used system calls with ETW trace points to make it easier to detect unusual application activity.

Interlude – SysCall Tracing

What can you do if you want to see every operating system call performed by an application. On Linux you can run the application with the strace tool to get a nice console output of all called operation system calls. On Windows no comparable tool exists. But there is a kernel ETW provider to enable SYSCALL tracing which not many people are aware of.

In a wpr profile you enable a SYSCALL Tracing by adding the SystemCall keyword and if you want you can even get the stacktrace for each method call (warning high data volume!). The logged events alone are not that useful, because they are missing the process and thread which did issues the call. To make these events useful you need to record also Context Switch events which record how long a thread was running on a given CPU.

Below is a sample snippet of a typical .wprp recording profile for wpr to enable SysCall tracing:

...
<Keywords>
   <Keyword Value="SystemCall"/>
   <Keyword Value="CSwitch"/>
</Keywords>
<Stacks>
  <Stack Value="SystemCallEnter"/>
</Stacks>

A full SysCall recording profile is located in the FileWriter project. To use it, download the MultiProfile.wprp with curl which is part of Windows. Then you can use it like this:

curl https://raw.githubusercontent.com/Alois-xx/FileWriter/master/MultiProfile.wprp > MultiProfile.wprp
wpr -start MultiProfile.wprp!SysCall
wpr -stop c:\temp\Wpr_SysCall.etl

Or if you use prefer xperf:

xperf -on base+syscall+CSWITCH -stackwalk SyscallEnter 
...
xperf -stop 
xperf -merge c:\kernel.etl SyscallTracing.etl

The merge step is needed to get method names, otherwise you just get addresses in the resulting trace. You can now open the Trace with WPA (Windows Performance analyzer). Under System Activity – Syscalls a new graph pops up when you load such a trace which allows you to analyze the call duration, frequency, return value and other things:

Unlike strace of linux you cannot get the method input argument but you can get at least the return code which is better than nothing.

We can dump the data also with the help of the TraceProcessing Library with a few lines

using Microsoft.Windows.EventTracing;
using Microsoft.Windows.EventTracing.Symbols;
using Microsoft.Windows.EventTracing.Syscalls;
using System.Diagnostics;
namespace strace
{
    internal class Program
    {
        static void Main(string[] args)
        {
            var processor = TraceProcessor.Create(args[0], new TraceProcessorSettings
            {
                AllowLostEvents = true,
            });
            string symServer = "*http://msdl.microsoft.com/download/symbols";
            if (args.Length > 1 && args[1] == "-nosymserver")
            {
                symServer = ""; // Loading MS symbols makes loading the trace much slower. Once loaded with symbols you can omit them
            }
            
            var symbols = processor.UseSymbols();
            var syscalls = processor.UseSyscalls();
            var sw = Stopwatch.StartNew();
            Console.WriteLine("Loading trace");
            processor.Process();
            Console.WriteLine($"Trace loaded in {sw.Elapsed.TotalSeconds:F0} s");
            sw = Stopwatch.StartNew();
            Console.WriteLine("Loading symbols");
            SymbolPath symPath = new($@"SRV*C:\Symbols{symServer}");
            SymCachePath symCache = new(@"C:\SymCache");
            symbols.Result.LoadSymbolsAsync(symCache, symPath).Wait();
            Console.WriteLine($"Symbols loaded in {sw.Elapsed.TotalSeconds:F0} s");
            foreach(ISyscall call in syscalls.Result.Syscalls)
            {
                if( call.ReturnValue == null)  // this was a syscall enter call
                {
                    continue;
                }
                string? funcName = call?.StackFrame.Symbol?.FunctionName;
                string? image = call?.Process?.ImageName;
                if (funcName != null && image != null)
                {
                    Console.WriteLine($"{image}({call?.Process.Id}): {funcName} <- 0x{call?.ReturnValue:X}");
                }
            }
        }
    }
}

Our much less inferior Windows strace tool can generate this output for e.g. all non succeeded (return code != 0x0) calls when our own strace process did crash because a not existing file was tried to access:

C>strace c:\syscallCSwitch.etl | findstr /i strace | findstr /v 0x0 
strace.exe(17016): NtOpenThreadTokenEx <- 0xC000007C
strace.exe(17016): NtOpenThreadTokenEx <- 0xC000007C
strace.exe(17016): NtQueryValueKey <- 0xC0000034
strace.exe(17016): NtOpenKey <- 0xC0000034
strace.exe(17016): NtQuerySecurityAttributesToken <- 0xC0000225
strace.exe(17016): NtQuerySecurityAttributesToken <- 0xC0000225
strace.exe(17016): NtQueryValueKey <- 0xC0000034
strace.exe(17016): NtOpenThreadTokenEx <- 0xC000007C
strace.exe(17016): NtOpenThreadTokenEx <- 0xC000007C
strace.exe(17016): NtSetInformationVirtualMemory <- 0xC00000BB
strace.exe(17016): NtQueryValueKey <- 0xC0000034
strace.exe(17016): NtOpenSemaphore <- 0xC0000034
strace.exe(17016): NtQueryValueKey <- 0xC0000034
strace.exe(17016): NtSetInformationVirtualMemory <- 0xC00000BB

The output is interesting, but without the input arguments we cannot make much progress if the input was e.g. the wrong file or if the failed call was even trying to access our file or something entirely different.

Antivirus Problem

Lets get back to our problem of high CPU in a Sleep call. When we drill deeper into the Sleep call CPU consumption we find in the stacktrace (the numbers are in ms)

KernelBase.dll!SleepEx 2 522 541 
|- ntdll.dll!NtDelayExecution 2 510 323.140645
 |    |- ntoskrnl.exe!KiSystemServiceExitPico 2 489 111
 |    |    |- ntoskrnl.exe!PerfInfoLogSysCallEntry 1 387 533
 |    |    |    |- ntoskrnl.exe!EtwTraceKernelEvent 1 386 492
 |    |    |    |    |- ntoskrnl.exe!EtwpLogKernelEvent 1 384 792
 |    |    |    |    |    |- tedrdrv.sys!<PDB not found> 1 251 146
 |    |    |    |    |    |    |- ntoskrnl.exe!ExfReleaseRundownProtection 435 011
 |    |    |    |    |    |    |- ntoskrnl.exe!ExfAcquireRundownProtection 431 336
 |    |    |    |    |    |    |- tedrdrv.sys!<PDB not found> 343 058
 |    |    |    |    |    |    |- tedrdrv.sys!<PDB not found><itself> 41 649
 |    |    |- ntoskrnl.exe!PerfInfoLogSysCallExit 1 061 081
 |    |    |    |- ntoskrnl.exe!EtwTraceKernelEvent 1 059 181
 |    |    |    |    |- ntoskrnl.exe!EtwpLogKernelEvent 1 056 491
 |    |    |    |    |    |- tedrdrv.sys!<PDB not found> 916 549

a call to PerfInfoLogSysCallEntry and PerfInfoLogSysCallExit. The CPU consumption of SleepEx is not our problem if we run the numbers. Due to ETW Tracing the costs of SleepEx are 33 times higher than originally!

But that is still not the full story. Inside the ETW logger we find a call to tedrdrv.sys for which we do not have symbols. The vast majority of costs are coming from this driver which calls into ExfAcquire/ReleaseRundownProtection which is part of an API which was added to support Antivirus drivers to be able to safely update itself at runtime.

With ETWAnalyzer we can dump the module information for all dlls which have no symbols (= most likely non MS dlls/drivers) which are showing up in expensive stack traces.

EtwAnalyzer %f% -Dump CPU -TopN 5 -ShowTotal File -ProcessName !*idle* -SortBy CPU -TopNMethods 2 -MinMaxCPUms 100000 -Methods *.dll;*.sys -ShowModuleInfo    

ETWAnalyzer has an internal list of nearly all Antivirus Drivers. If a driver in the well known list is present in the output it is showing up as yellow text. We find with -ShowModuleInfo (-smi) that tedrdrv.sys is indeed the Palo Alto driver which has intercepted ETW traces, which causes high CPU consumption. The plot is thickening but we still do not have a smoking gun. When profiling is stopped my script always captures the output of xperf -Loggers to see which ETW loggers were running to check if profiling was running properly shortly before we stop our session. The output for NT Kernel Logger is the smoking gun:

	Logger Name           : NT Kernel Logger
	Logger Id             : ffff
	Logger Thread Id      : 0000000000003DC4
	Buffer Size           : 8192
	Maximum Buffers       : 118
	Minimum Buffers       : 96
	Number of Buffers     : 118
	Free Buffers          : 70
	Buffers Written       : 224156
	Events Lost           : 0
	Log Buffers Lost      : 0
	Real Time Buffers Lost: 0
	Flush Timer           : 0
	Age Limit             : 0
	Log File Mode         : Secure PersistOnHybridShutdown SystemLogger
	Maximum File Size     : 0
	Log Filename          : PaloNull
	Trace Flags           : SYSCALL
	PoolTagFilter         : *

Remember that the initial ETW Traces were fine until we did not get profiling events? Now we have found the reason. Palo Alto is enabling SysCall tracing by using the NT Kernel Logger ETW session which was already in use by us. Not nice.

Now we have all data to understand the high CPU issue

When multithreaded code is running which uses Spinlocks Sleep(0) is called on all cores in a tight loop. This is the way how Spinlocks work as explained earlier. Since Palo Alto wants to know which kernel methods all applications are calling they enable ETW tracing and hook inside the kernel the ETW trace call to get easy access to that information. Unfortunately the AV driver needs to access some protected kernel object via a “lock” with ExfAcquire/ReleaseRundownProtection. This “lock” is implemented as interlocked memory accesses which are ok if not many threads are accessing that memory lane. This assumption is no longer true in a heavily multithreaded application. The CPU costs of Sleep(0) are now >33 time higher which causes the server to go to 100% CPU consumption for unusual long times (minutes) which makes users wanting to reboot that machine. The server is not responding, but it miraculously recovers again if you patiently wait just long enough.

Conclusions

Antivirus solutions, multithreaded software and many core machines are a deadly mix for performance. The idea to intercept SysCalls might work decently on small machines, but if AV vendors try the same trick on server machines with multithreaded software things fall apart. I hope some AV vendor reads this post and change their mind about the SysCall tracing approach. Since consumer machines get more and more cores this problem will become bigger over the next years. Please stop intercepting SysCalls and let the machine do its work.

In the past AV did get only in the way when files were accessed or processes were started. That can be dealt with configuring the AV solutions to exclude certain directories/processes to limit its impact. But if compute bound workloads are affected in general, what should one do, except to uninstall that AV software? I have bad experiences with configuring AV solutions, because very often the overhead is still there, although the configuration UI says that all is turned off.

You can figure out the impact of AV with ETW but it takes quite some time and a lot of data to get to that point. For the average user it is impossible to find such issues, except if he can do some A/B testing with/without the AV solution. That works, but if the next AV update brings in new tracing capabilities you will have a hard time to figure that one out without ETW. It should not be so hard to quantify the impact of AV solutions given the number of times they are causing issues.

The loss of productivity due the high performance overhead of AV solutions is hard to quantify, but it is definitely not cheap. In my opinion all AV scanning solutions are trying hard to identify every threat, but fail to properly address the adverse performance effects caused by these additional scanning capabilities.

On the other hand all virus “vendors” will test with all currently used AV scanning solutions to get in undetected. In effect AV scanning solutions will be mostly able to detect past threats, but not the latest release of e.g. malware X. I wonder if it is cost effective to detect past threats with such a high overhead while new malware by definition is still able to get in.

If the scanning settings of AV solutions are tuned towards performance and not only for detection rate our fleet of > 1 billion of Windows computers would consume Gigawatts less of electrical energy (conservative estimate: >1 W of additional energy * 10^9 ). This is a global issue which is not on anyone’s agenda because it is so well hidden deep inside your computer which works but consumes more CPU than necessary. Many people talk about the negative environmental effects of Bitcoin mining, but no one talks about the environmental effects of AV scanning overhead.

I hope I have answered the mystery of a stuck server and raised new questions if it is really ok to turn on every AV scanning feature to stay safe. The tradeoff between computer safety and power consumption should be discussed more by security experts to arrive at decisions which are more balanced than it is currently the case. A wild idea would be to have a virtual Watt Meter for every wasted Watt consumed by AV scanners to make this cost explicit. You can only act on things you can measure. This is currently not the case.

.NET Serialization Roundup 2022

The last test is from 2019 which is still accurate but the world is changing and we have arrived at .NET 7.0 which is reason enough to spin up my test suite again and measure from .NET 4.8, 3.1, 5.0, 6.0 up to 7.0. The “old” articles are still relevant despite their age.

The following serializers are tested

NameHomeType
BinaryFormatter.NETBinary
– BinaryPackhttps://github.com/Sergio0694/BinaryPackBinary
Boishttps://github.com/salarcode/BoisBinary
– Cerashttps://github.com/rikimaru0345/CerasBinary
DataContract.NETXml
– FastJsonhttps://github.com/mgholam/fastJSON/Json
FlatBufferhttps://google.github.io/flatbuffers/Binary
GroBufhttps://github.com/skbkontur/GroBufBinary
– JILhttps://github.com/kevin-montrose/JilJson
Json.NEThttp://www.newtonsoft.com/jsonJson
JsonSerializer.NET Json
MemoryPackhttps://github.com/Cysharp/MemoryPackBinary
MessagePackSharphttps://github.com/neuecc/MessagePack-CSharpBinary
– MsgPack.Clihttps://github.com/msgpack/msgpack-cliBinary
Protobuf.NEThttps://github.com/mgravell/protobuf-netBinary
Google Protobufhttps://github.com/protocolbuffers/protobufBinary
SerivceStackhttps://github.com/ServiceStack/ServiceStack.TextJson
– SimdJsonSharphttps://github.com/EgorBo/SimdJsonSharpJson
SpanJsonhttps://github.com/Tornhoof/SpanJsonJson
SlimSerializerhttps://github.com/azist/azos/tree/master/src/Azos/Serialization/SlimBinary
Swifter.Jsonhttps://github.com/Dogwei/Swifter.JsonJson
– UTF8Jsonhttps://github.com/neuecc/Utf8JsonJson
XmlSerializer.NETXml

Serializers with – in front are in maintenance mode.

New entries since 2019 are

  • BinaryPack
  • MemoryPack
  • Google Protobuf

Which of these 23 serializers are in maintenance mode? Conditions (logical or) for maintenance mode are

  • Last commit > 1 year
  • It is archived
  • All commits within one year changed no product code
  • The author tells you to not use it

Maintenance mode serializers at the point of writing are:

  • BinaryPack
  • Ceras
  • FastJson
  • JIL (2019 last commit)
  • MsgPack.Cli
  • SimdJsonSharp
  • UTF8Json

You can still look at them, but I would not use them for real products where future updates to e.g. .NET 8.0 or security fixes are mandatory.

What else is not recommended in 2022?

BinaryFormatter should nowhere used in production anymore because it allows, if you deserialize attacker controlled data, to open a shell on your box. See https://learn.microsoft.com/en-us/dotnet/standard/serialization/binaryformatter-security-guide.

The following serializers which are part of .NET Framework are unsafe:

Used Benchmark

I did write the SerializerTests test suite which makes it easy to test all these serializers. If you want to measure for yourself, clone the repository, compile it, execute RunAll.cmd and you get a directory with CSV files which you can graph as you like.

Tooling is getting better and I have also added profiling support so you can check oddities at your own if you call RunAll.cmd -profile you get ETW traces besides the CSV files. This assumes you have installed the Windows Performance Toolkit from the Windows SDK.

Serialize Results

The winner in terms of serialize speed is again FlatBuffer which still has the lowest overhead, but you need to go through a more complex compiler chain to generate the de/serialization code.

*Update 1*

After discussions about the fairness how FlatBuffer should be tested I have changed how FlatBuffer performance is evaluated. Now we use an existing object (BookShelf) which is used as input to create a FlatBuffer object. Additionally it was mentioned that some byte buffer should also be part of the tested object. This moves FlatBuffer from rank 1 to 15, because the new test assumes you have an object you want to convert to the efficient FlatBuffer binary format. This assumption is reasonable, because few people will want to work with the FlatBuffer object directly. If you use the FlatBuffer object directly then the previous test results still apply.

The layout of the serialized Book object include

  • int Id
  • string Title
  • byte[] Payload Before it was null now it contains 10 bytes of data

Our BookShelf object looks as serialized xml like this

<?xml version="1.0" encoding="utf-8"?>
<BookShelf xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <Books>
    <Book>
      <Title>Book 1</Title>
      <Id>1</Id>
      <BookData>YWJjZGVmZ2hpag==</BookData>
    </Book>
  </Books>
</BookShelf>

Now we serialize nearly 40% of binary data which contains just printable ASCII characters. This “simple” change moves the landscape quite a bit: GroBuf is first and MemoryPack second. This contains also an update from MemoryPack from 1.4.4 -> 1.8.10. If GroBuf has less overhead (but double the serialized size because it stores strings as Utf-16 while most others convert the data to Utf-8), or if MemoryPack has some regression I cannot tell. To reproduce the tests you can run the tests with

RunAll.cmd 10

to use a 10 byte buffer payload for each Book object. If you omit the 10 then you will serialize with 0 bytes of payload. The combined csv file SerializationPerf_Combined_10.csv contains as suffix how much payload per book object was used.

*Update 1 End*

The new entry MemoryPack is second which is the newest serializer by Yoshifumi Kawai who seems to write serializers for a living. He has several entries in this list namely

  • MessagePackSharp
  • Utf8JsonSerializer

His latest MemoryPack library shows that you still can get faster once you know what things cost and what you can avoid without becoming too specialized. This one has pretty much all features one could wish for including a Source Generator for Typescript to read the binary data inside the browser or wherever your Typescript code is running.

Additionally he provided a helper to compress the data with Brotli, a very efficient compression scheme which competes with Bois_LZ4 which is currently the only serializer which supports also compression of the data. The Brotli algorithm is nice but there is zstd from Facebook which should be a bit faster which would make a nice addition perhaps in the future?

That is really impressive!

From a higher level we find that the old .NET Framework is getting behind in terms of speed and we see a nice step pattern which proves that every .NET Framework release has become faster than its predecessor.

Binary is good, but Json has also a large fan base. Microsoft did add with .NET 7.0 Source Generators to generate the serialization code at compile time for your objects. See https://github.com/Alois-xx/SerializerTests/blob/master/Serializers/SystemTextJsonSourceGen.cs how you can enable code generation for serialization. This feature generates code only for serialize but uses pregenerated metadata for deserialize. The feature to add pregenerated deserialization code is tagged as future (post .NET 8?). If you think this is important you can upvote the issue. The main reason to add this was not only performance but also AOT scenarios where you have no JIT compiler.

With pregenerated serialization code in SystemTextJsonSourceGen our BookShelf object is 20+% faster serialized compared to SystemTextJson which uses the “normal” runtime generated code. A significant part of the performance improvement seems to be related to directly write UTF8 without the need to convert an UTF-16 char memory stream to an UTF-8 byte stream.

Deserialize Results

*Update 1*

We access after deserialization of the FlatBuffer object all properties which cause the FlatBuffer struct to create a new string or byte[] array on each access again. AllocPerf does not use a binary blob yet. The numbers are therefore identical.

Even with these changes FlatBuffer wins the deserialize race with some margin. The second place goes to GroBuf instead of MemoryPack. But these are really close and had before been fighting for the second place. As already mentioned before GroBuf doubles the serialized data size which allows it to omit the Utf-16 to Utf-8 conversion which saves some speed at the expense of data size.

*Update 1 End*

Normally data is more often read than written. Deserialization needs to allocate all the new objects which makes it largely GC bound. This is the prime reason why deserialize can never be faster than serialize because it needs to allocate new objects while in the serialize case these are already existing.

The fastest serializer is AllocPerf which is not a serializer, but an allocation test which walks over an array of binary data without parsing and constructs strings out of the data. This is a test how fast one could get without any parsing overhead.

MemoryPack sets the mark again which is nearly overhead free. With so little overhead there must be a downside? Yes you loose versioning support to some extent.

The follower after MemoryPack is BinaryPack which is not even trying to be version tolerant and can be used for simple caches but other than that I would stay away from it. But beware it is in maintenance mode!

Another notable change in Protobuf_net was the deprecation since 3.0 of the AsReference attribute which many people are missing. This made it easy to deserialize object graphs with cyclic dependencies, which is with Protobuf_net 3.0 no longer the case.

The now in maintenance mode Utf8JsonSerializer is still 35% faster than SystemTextJson. There is still room for improvement for .NET.

First Startup Results

Besides the pure runtime performance startup costs are also important. That is more tricky to measure since you not only need to pay the costs of reflection, code gen, JIT time for the precompiled (Cross/Ngen) and not precompiled scenario inside a fresh executable. If you do not use a new executable each time you would attribute the costs for first time init effects to the first called serializer in a process which would introduce systematic bias. SerializerTests starts for each serializer a new executable which is started three times (last time is added to CSV) to warmup the loaded dlls to not induce expensive hard page faults which are served from the disk, but from the file system cache. Then it serializes 4 different object types with one object and takes the maximum value which should be the first call. These details allow one to check if just the first code gen is so expensive and subsequent code gen is much cheaper because many one time init effects already have happened.

The old .NET Framework 4.8 looses by a large margin here, but only if Ngen is not used. If the images are Ngenned then it is not too bad compared to crossgenned .NET core binaries.

Is this the full story? Of course not. We are only measuring inside a small loop long after many things have happened, but since precompilation will cause all dependent dlls to be eagerly loaded by the OS loader we should better look at the runtime of the process as a whole. This can show interesting … gaps in our testing methodology.

To look deeper you can run the test suite with ETW profiling enabled with

RunAll.cmd -profile

to generate ETW data while executing the tests. You need to download Windows Performance Toolkit to generate the profiling data and ETWAnalyzer to easily query the generated data. You get from the RunAll.cmd command a bunch of CSV files, a combined file CSV file and some ETW files.

When you put ETWAnalyzer into your path you can extract the ETW file data into Json files which are put into the Extract folder. To convert the ETL files into Json files you add to the ETWAnalyzer command line -Extract All, the folder where the ETL files are located and a symbol server e.g. MS which is predefined. You can also use NtSymbolPath to use your own Symbol search path defined by the _NT_SYMBOL_PATH environment variable.

 ETWAnalyzer -Extract All -FileDir D:\Source\git\SerializerTests\SerializerTestResults\_0_14_06 -SymServer MS 

The extracted Json files have a total size of 112 MB vs 5 GB of the input ETL files which is a good compression ratio! Now we can generate a nice chart with an Excel Pivot Chart to look how long each of the 3 invocations of the startup tests did run:

And the winner is .NET 4.8! That is counter intuitive after we have measured that .NET 7 is faster everywhere. What is happening here? This chart was generated with input from ETWAnalyzer to dump process start/stop times command line and their duration into a CSV file:

ETWAnalyzer.exe -dump process -cmdline *firstcall#* -timefmt s -csv FirstCall_Process.csv

ETWAnalyzer can also dump the CPU consumption to console so we can drill deeper

Although the command line syntax looks complex it is conceptually easy. You first select what to dump e.g. CPU, Process, Disk, … and then add filters and sorting. Because ETW records a lot of data you get a lot of sorting and filter options for each command of ETWAnalyzer which may look overwhelming but it is just filtering, sorting and formatting after you have selected what to dump.

The following call dumps process CPU consumption for all processes with a command line that contains

  • firstcall and cross and datacontract to select the .NET Core crossgenned processes OR
  • firstcall and ngen and datacontract to select the .NET 4.8 Ngened SerializerTests calls

-ProcessFmt s prints process start stop times in ETW recording time (time since session start) and the process duration. You can format also your local time, or utc by using local or utc as time formats.

EtwAnalyzer -dump cpu -cmdline "*firstcall* cross*#datacontract*";*firstcall*ngen*#datacontract* -ProcessFmt s

After drilling into the data it became apparent that since .NET Core 3.1 we are now having two additional serializers (MemoryPack and BinaryPack) which both have significant startup costs. These costs are always payed even if you never use them! I have created a branch of SerializerTests (noBinaryMemoryPack) to remove them so we can compare .NET 4.8 and .NET Core+ with the same number of used dependencies.

Each .NET version has two bars, the lower one is crossgen and the higher one involves JIT compilation of the binaries. Since each process is started 3 times we see in the y-axis the sum of 3 process starts. NET 7.0 seems to have become a bit slower compared to the previous versions. The winner is still .NET 4.8 (455 ms) but at least the difference is not as big as before (558ms vs 972ms).

It is very easy to fool yourself with isolated tests which do not measure the complete (system) picture. Measuring something and drawing a picture is easy. Measuring and understanding what the data means in full details is a much bigger endeavor.

Cool Console Tools

I work a lot with command line tools which produce large width output. The cmd shell has the nice feature to configure a horizontal scroll bar by setting the console width to e.g. 500. This works but newer shells like Windows Terminal are becoming the norm. It works like Linux shells which never had support for horizontal scroll bars. The issue to add a scroll bar is denied https://github.com/microsoft/terminal/issues/1860 which is a pity. But there is an open source solution for it. A little known terminal extension named vtm is offering all features I need. When configuring my command prompt with vtm -r cmd

I get a Windows Terminal window which supports horizontal scrolling, PlainText, RTF, HTML text copy, Ctrl+V (Ctrl+C not yet) and Unix style copying with right mouse click. This makes working with ETWAnalyzer much easier and you can also properly read the help which is also wide output …

EtwAnalyzer -dump CPU -ProcessName SerializerTests.exe -ShowModuleInfo -TopN 1 -NoCmdLine

The command dumps of all extracted files in the current directory the SerializerTests executable with the highest CPU consumption and the module information which includes file version, product name, and directory from where it was started. That can be useful to check if e.g. your local IT has installed the updated software to your machine, or you need to call again to get the update rolled out. By making such previously hard to get information easily queryable I have found that for some reason SerializerTests was running under false flag as protobuf product and since a long time without any file version update. Both issues are fixed, because ETWAnalyzer made it very obvious that something was wrong.

Conclusions

This is the end of this post, but hopefully your start to make a better informed decision which serializer is best suited for your needs. You should always measure in your environment with your specific data types to get numbers which apply to your and not my still synthetic scenario. When you execute RunAll.cmd dd you add to each Book object a byte array with dd bytes which may change the performance landscape dramatically. This can simulate a test where you use a serializer as wrapper over another already serialized payload which did already generate a byte array. That should be cheap, in theory. But when larger byte arrays were used with the fastest Json serializer (UTF8Json) it became one of the slower ones. I tried hard to get the numbers right, but you still need to measure for yourself to get your own right numbers for your use case and scenario.

Performance numbers are not everything. If essential features are there, and you trust the library author that he will provide support over the foreseeable future, only then you should consider performance as next item on your decision matrix to choose a library for mission critical long lived projects. If you are running a pet project, use whatever you seem fit.

Slow network after adding a second WIFI Access Point

Networking is hard and many things can go wrong. I have bought an Oculus Quest VR headset for my daughter. Soon I have found it frustrating to connect it via an USB cable to my PC. To get rid of the cable I have bought a cheap TP-Link Archer C5 AC1200 router which I connect via my LAN cable to the PC. After becoming wireless I can send realtime VR images over my dedicated WIFI to the VR headset with stable 150 MBit/s.

The “real” WIFI (yellow) which is connected to the internet is located in another room which is not capable to transfer 150 MBit/s without any interrupts which would kill the VR experience. That is the reason why nearly everyone recommends to use either an USB cable (inconvenient) or a LAN cable (blue) connected to a dedicated WIFI router for the VR headset.

That setup works great, but I have noticed that whenever I have the second router running that browsing the internet would feel very slow. Every page refresh was taking a lot longer compared to when I did turn the second WIFI router off.

Since all browsers have great development tools I just needed to press F12 to see what was blocking my browser requests. In the example below I have tried to access a far far away web site which shows in the network tab for the first request a delay 10,55s for host name resolution (Dns).

On my box these timings were always around 10s so I knew that I did suffer from slow Dns queries to resolve the host name e.g. http://www.dilbert.com to the actual IP address. Knowing that page load times are blocked by 10+ seconds by Dns is great, but now what? I was suspecting that, since I have two network tabs in Task Manager, that the DNS query was first sent via the LAN cable to a dead end because my Oculus is not connected to the internet and then via WIFI to the actual internet connected router.

Initially I tried the route print command to see where why packets were routed, but that looked too complicated to configure the order for every local route. After a bit more searching I have found the powershell command Get-NetIPInterface which shows you over which network interface Windows sends network packets to the outside world. The relevant column is InterfaceMetric. If multiple network cards are are considered the network interface with the lowest InterfaceMetric number is tried first.

As expected I have found my local Ethernet connection with InterfaceMetric 5 with the preferred connection to send network packets to the outside world. But what should be the other network? The easy way out is to cause some traffic in the browser and check in Task Manager in the Performance Tab which network gets all the traffic:

I need therefore to change the order vEthernet (New Virtual Switch) with metric 15 and interface id 49 and vEthernet (New Virtual Switch) with metric 25 and interface id 10.

Set-NetIPInterface -InterfaceIndex 10 -InterfaceMetric 5
Set-NetIPInterface -InterfaceIndex 49 -InterfaceMetric 25
These two commands did switch the order of the two relevant networks (ifIndex 10 and 49). How fast are things now? Internet surfing feels normal and the change of switched network ordering did save the day. There are still Dns delays but nowhwere in the 10s region. Case solved.

The average reader can stop reading now.

What follows is a deep dive into how Dns works on Windows and how you can diagnose pretty much any Dns issue with the builtin Dns tracing of Windows with the help of ETWAnalyzer.

So far we have just looked at Dns as a black box, but you can trace all Dns requests with the help of ETW (Event Tracing for Windows) and the Microsoft-Windows-DNS-Client ETW provider. The easiest way to record the data is to download a recording profile like Multiprofile.wprp, download the file e.g. via curl from the command line

curl https://raw.githubusercontent.com/Alois-xx/FileWriter/master/MultiProfile.wprp > c:\temp\MultiProfile.wprp

Then start from an Admin command prompt tracing with

wpr -start c:\temp\MultiProfile.wprp!Network

Now surf to some web sites or do your slow use case. After you have found some slowdown you can stop recording with

wpr -start c:\temp\SlowBrowsing_%ComputerName%.etl

Good naming is hard, but if you later want to revisit your recorded files you should try to come up with a descriptive name.

After recording the data you can extract a Json file with ETWAnalyzer by extracting everything (-extract all) or just Dns data with -extract DNS

C:>EtwAnalyzer -extract all -fd c:\temp\SlowBrowsing_Skyrmion.etl -symserver ms 
1 - files found to extract.
Success Extraction of c:\temp\Extract\SlowBrowsing_Skyrmion.json
Extracted 1/1 - Failed 0 files.
Extracted: 1 files in 00 00:02:40, Failed Files 0

ETWAnalyzer supports many different queries for e.g. Disk, File, CPU and other things. Since 2.4.0.0 support for dumping Dns Queries was added. To show the aggregated top 5 slowest Dns queries you just need -fd (File or Directory) and the file name of the extracted data and -dump Dns to dump all Dns queries. To select the top 5 add -topn 5 and you get this output per process printed:

C:\temp\Extract>EtwAnalyzer -dump DNS -topn 5 -fd SlowBrowsing_Skyrmion.json   
02.10.2022 17:47:32    SlowBrowsing_Skyrmion 
NonOverlapping       Min        Max  Count TimeOut DNS Query                                                             
       Total s         s          s      #
OVRServer_x64.exe(6624) 124 128
      12,038 s   0,000 s   12,038 s      4       1 graph.oculus.com                                                      
backgroundTaskHost.exe(6692)  -ServerName:BackgroundTaskHost.WebAccountProvider
      12,043 s   0,000 s   12,043 s      2       1 login.microsoftonline.com                                             
firefox.exe(13612) 
      12,047 s   0,021 s   12,047 s      2       1 d21lj84g4rjzla.cloudfront.net                                         
      12,052 s   0,000 s   12,052 s      2       1 loadus.exelator.com                                                   
      12,052 s   0,000 s   12,052 s      2       1 cdn.adrtx.net                                                         
Totals: 60,232 s Dns query time for 12 Dns queries

We find that we have indeed some slow Dns queries which all did time out after ca. 12s. The magenta named processes are issuing all of the following DNS queries. To see more we can add -details to see the timing of every slow DNS Query

C:\temp\Extract>EtwAnalyzer -dump DNS -topn 5 -details -fd SlowBrowsing_Skyrmion.json  
02.10.2022 17:47:32    SlowBrowsing_Skyrmion 
NonOverlapping       Min        Max  Count TimeOut DNS Query                                                             
       Total s         s          s      #
      12,038 s   0,000 s   12,038 s      4       1 graph.oculus.com                                                      
        2022-10-02 17:47:38.416 Duration:  0,000 s OVRServer_x64.exe(6624)                       DnsAnswer: 
        2022-10-02 17:47:38.417 Duration:  0,000 s OVRServer_x64.exe(6624)                       DnsAnswer: 
        2022-10-02 17:47:38.417 Duration: 12,038 s OVRServer_x64.exe(6624)                       TimedOut: 192.169.0.1:53 Servers: 192.169.0.1;fe80::1;0.0.0.0 DnsAnswer: 2a03:2880:f21c:80c1:face:b00c:0:32c2;::ffff:31.13.92.49
        2022-10-02 17:47:38.417 Duration: 12,038 s OVRServer_x64.exe(6624)                       TimedOut: 192.169.0.1:53 Servers: 192.169.0.1;fe80::1;0.0.0.0 DnsAnswer: 2a03:2880:f21c:80c1:face:b00c:0:32c2;::ffff:31.13.92.49
      12,043 s   0,000 s   12,043 s      2       1 login.microsoftonline.com                                             
        2022-10-02 17:47:42.020 Duration:  0,000 s backgroundTaskHost.exe(6692)                  DnsAnswer: 
        2022-10-02 17:47:42.020 Duration: 12,043 s backgroundTaskHost.exe(6692)                  TimedOut: 192.169.0.1:53 Servers: 192.169.0.1;fe80::1;0.0.0.0 DnsAnswer: ::ffff:40.126.32.74;::ffff:20.190.160.14;::ffff:40.126.32.134;::ffff:40.126.32.72;::ffff:20.190.160.22;::ffff:40.126.32.136;::ffff:40.126.32.138;::ffff:40.126.32.68
      12,047 s   0,021 s   12,047 s      2       1 d21lj84g4rjzla.cloudfront.net                                         
        2022-10-02 17:47:54.203 Duration:  0,021 s firefox.exe(13612)                            DnsAnswer: 13.227.153.112;13.227.153.94;13.227.153.81;13.227.153.66
        2022-10-02 17:47:54.224 Duration: 12,047 s firefox.exe(13612)                            TimedOut: 192.169.0.1:53 Servers: 0.0.0.0;192.169.0.1;fe80::1 DnsAnswer: 2600:9000:225b:fc00:1e:b6b1:7b80:93a1;2600:9000:225b:3600:1e:b6b1:7b80:93a1;2600:9000:225b:2e00:1e:b6b1:7b80:93a1;2600:9000:225b:f000:1e:b6b1:7b80:93a1;2600:9000:225b:7200:1e:b6b1:7b80:93a1;2600:9000:225b:a400:1e:b6b1:7b80:93a1;2600:9000:225b:9200:1e:b6b1:7b80:93a1;2600:9000:225b:dc00:1e:b6b1:7b80:93a1
      12,052 s   0,000 s   12,052 s      2       1 loadus.exelator.com                                                   
        2022-10-02 17:47:42.091 Duration:  0,000 s firefox.exe(13612)                            DnsAnswer: 
        2022-10-02 17:47:42.091 Duration: 12,052 s firefox.exe(13612)                            TimedOut: 192.169.0.1:53 Servers: 192.169.0.1;fe80::1;0.0.0.0 DnsAnswer: ::ffff:54.78.254.47
      12,052 s   0,000 s   12,052 s      2       1 cdn.adrtx.net                                                         
        2022-10-02 17:47:42.091 Duration:  0,000 s firefox.exe(13612)                            DnsAnswer: 
        2022-10-02 17:47:42.091 Duration: 12,052 s firefox.exe(13612)                            TimedOut: 192.169.0.1:53 Servers: 192.169.0.1;fe80::1;0.0.0.0 DnsAnswer: ::ffff:34.248.125.130;::ffff:3.248.26.101
Totals: 60,232 s Dns query time for 12 Dns queries

You might ask yourself. What are NonOverlapping Totals? The ETW provider Microsoft-Windows-DNS-Client instruments dnsapi.dll!DnsQueryEx which supports synchronous and async queries. By default an IPV4 and IPV6 query is started whenever when you try to resolve a host name. If both queries need e.g. 10s, and are started at the same time the observed delay by the user is 10s (Non Overlapping time) and not 20s (that would be the sum of both query times). To estimate the observed user delay for each query ETWAnalyzer calculates the non overlapping query time which should much better reflect the actual user delays. The summary time Totals is also taking overlapping DNS query times into account which should be a better metric than a simple sum.

For more details please refer to https://github.com/Siemens-Healthineers/ETWAnalyzer/blob/main/ETWAnalyzer/Documentation/DumpDNSCommand.md.

The details view shows that every slow timed out query has a red TimedOut Dns server name (192.169.0.1:53). This is the Dns server IP for which a Dns query did time out. By looking at the interfaces with ipconfig we can indeed verify that the new TPLink Router was tried first, did time out and after that the next network (WIFI) was tried which did finally succeed.

If you are in an IPV6 network the Dns server IP might be a generic default Link Local IP which can be the same for several network interfaces. In that case you can add -ShowAdapter to show which network interfaces were tried. That should give you all information to diagnose common Dns issues. Of course you can also use WireShark to directly get your hands on the Dns queries, but with WireShark you will not know which process was waiting for how long. The ETW instrumented DnsQueryEx Api will give you a clear process correlation and also which queries (IPV4, IPV6) were tried. To get all details you can export the data to a CSV file (add -csv xxx.csv) and post process the data further if you need to. To select just one process (e.g. firefox) you can filter by process with -pn firefox which is is the shorthand notation for -processname firefox. For more information on the query syntax of ETWAnalyzer see https://github.com/Siemens-Healthineers/ETWAnalyzer/blob/main/ETWAnalyzer/Documentation/Filters.md.

If you have just Context switch traces but no DNS Client traces not all is lost. If you suspect DNS delays you can still query the thread wait times of all dns related methods with ETWAnalyzer which also calculates for wait and ready the non overlapping times summed for all threads.

EtwAnalyzer -dump cpu -pn firefox -fd WebBrowsingRouterOn_SKYRMION.10-06-2022.19-46-16 -methods *dns* -fld s s

If you have async queries in place, but you know that just one async operation was initiated you can add -fld s s to show the time difference the method was seen first and last in the trace which should give you a good estimate how long that async method was running, if upon completion it was called again (and consumed enough CPU so we could get a sample). For more details please refer to https://github.com/Siemens-Healthineers/ETWAnalyzer/blob/main/ETWAnalyzer/Documentation/DumpCPUCommand.md.

You can also use one specific method or event as zero point and let ETWAnalyzer calculate first/last timings relative to a specific event which can be useful to calculate e.g. timing of methods relative to a key event like OnLoadClicked or something similar. That advanced feature is explained at https://github.com/Siemens-Healthineers/ETWAnalyzer/blob/main/ETWAnalyzer/Documentation/ZeroTime.md in much greater detail. It is also possible to correlate traces across machines without synchronized clocks by using different zero times for different files within one query and export the data to a CSV file for further processing.

That was a lot of information, but I hope I have convinced you that you can access previously hard to get information now much easier with ETWAnalyzer. It is not only useful for performance investigations, but it might be also interesting to get an overview which servers were tried to query from your machine for forensic audits.

PDD Profiler Driven Development

There are many extreme software development strategies out there. PDD can mean Panic Driven, or Performance Driven Development. The first one should be avoided, the second is a MSR paper dealing with performance modelling techniques. Everyone writes tests for their software because they have proven to be useful, but why are so few people profiling their own code? Is it too difficult?

I have created a C# console project named FileWriter https://github.com/Alois-xx/FileWriter which writes 5000 100KB files (= 500MB) to a folder with different strategies. How do you decide which strategy is the most efficient and/or fastest one? As strategies we have

  • Single Threaded
  • Multi Threaded with TPL Tasks
  • Multi Threaded with Parallel.For
    • 2,3,4,5 Threads

The serial version is straightforward. It creates a n files and writes in each file a string until the file size is reached. The code is located at https://github.com/Alois-xx/FileWriter/blob/master/DataGenerator.cs, but the exact details are not important.

Lets measure how long it takes:

As expected the serial version is slowest and the parallel versions are faster. The fastest one is the one which uses 3 threads. Based on real world measurements we have found that 3 threads is the fastest solution. OK problem solved. Move on and implement the next feature. That was easy.

But, as you can imagine there is always a but: Are the numbers we have measured reasonable? Lets do some math. We write 500 MB with the parallel version in 11s. That gives a write speed of 45 MB/s. My mental performance model is that we should be able to max out a SATA SSD which can write 430 MB/s.

This is off nearly by a factor 10 what the SSD is able to deliver. Ups. Where did we loose performance? We can try out the Visual Studio CPU Profiler:

That shows that creating and writing to the file are the most expensive parts. But why? Drilling deeper will help to some extent, but we do not see where it did wait for disk writes because the CPU Profiler is only showing CPU consumption. We need another tool. Luckily there is the Concurrency Visualizer extension for free on the Visual Studio Marketplace, which can also show Thread blocking times.

After installation you can start your project from the Analyze menu with your current debugging settings.

We try to stay simple and start FileWriter with the single threaded strategy which is easiest to analyze

FileWriter -generate c:\temp serial

In the Utilization tab we get a nice summary of CPU consumption of the machine. In our case other processes were consuming a lot of CPU while we were running. White is free CPU, dark gray is System and light gray are all other processes. Green is our profiled process FileWriter.exe.

The first observation is that once we start writing files other processes become very busy which looks correlated. We come back to that later.

The Threads view shows where the Main Thread is blocked due to IO/Waiting/…

You can click with the mouse on the timeline of any thread to get in the lower view the corresponding stack trace. In this case we are delayed e.g. by 2,309 s by a call to WriteFile. For some reason the .NET 6.0 symbols could not be resolved but it might also be just a problem with my symbol lookup (there should be a full stack trace visible). Personally I find it difficult to work with this view, when other processes are consuming the majority of CPU. But since Concurrency Visualizer uses ETW there is nothing that can stop us to open the Concurrency Visualizer generated ETW files from the folder C:\Users\\Documents\Visual Studio 2022\ConcurrencyVisualizer\ with WPA.

We see that FileWriter (green) starts off good but then MsMpEng.exe (red = Defender Antivirus) is kicking in and probably delaying our File IO significantly. In the middle we see System (lila = Windows OS) doing strange things (Antivirus again?).

Do you remember that our SSD can write 430 MB/s? The Disk Usage view in WPA shows disk writes. Our process FileWriter writes 500 MB of data, but there are only 4 MB written in 53ms (Disk Service Time). Why are we not writing to the SSD? The answer is: The Operating System is collecting a bunch of writes and later flushes data to disk when there is time or the amount of written data exceeds some internal threshold. Based on actual data we need to revise our mental performance model:

We write to the OS write cache instead of the disk. We should be able to write GB/s! The actual disk writes are performed by the OS asynchronously.

Still we are at 45 MB/s. If the disk is not the problem, then we need to look where CPU is spent. Lets turn over to the gaps where the green line of FileWriter writing on one thread to disk vanishes. There seems to be some roadblock caused by the System process (Lila) which is the only active process during that time of meditation:

When we unfold the Checkpoint Volume stacktag which is part of my Overview Profile of ETWController we find that whenever NTFS is optimizing the Free Bitmap of the NTFS volume it is traversing a huge linked list in memory which blocks all File IO operations for the complete disk. Do you see the bluish parts in the graph which WPA shows where NTFS Checkpoint volume samples are found in the trace? Automatic highlighting of the current table selection in the graph is one of the best features of WPA. The root cause is that we delete the old files of the temp folder in FileWriter from a previous run which produces a lot of stale NTFS entries which, when many new files are created, at some point cause the OS to wreak havoc on file create/delete performance. I have seen dysfunctional machines which did have a high file churn (many files per day were created/deleted) where all file create/delete operations did last seconds instead of sub milliseconds. See a related post about that issue here: https://superuser.com/questions/1630206/fragmented-ntfs-hdd-slow-file-deletion-600ms-per-file. I did hit this issue several times when I cleaned up ETW folders which contain for each ETL file a large list of NGENPDB folders which the .ni.pdb files. At some point Explorer is stuck for many minutes to delete the last files. The next day and a defrag later (the NTFS tables still need defragmentation even for SSDs) things are fast again. With that data we need to revisit our mental performance model again:

Creating/Deleting large amount of files might become very slow depending on internal NTFS states which are in memory structures and manifest as high CPU consumption in the Kernel inside the NtfsCheckpointVolume method.

That view is already good, but we can do better if we record ETW data on our own and analyze it at scale.

How about a simple batch script that

  • Starts ETW Profiling
  • Executes your application
  • Retrieves the test execution time as return value from your application
  • Stops ETW Profiling
  • Write an ETW file name with the measured test duration and test case name
  • Extract ETW Data for further analysis

That is what the RunPerformanceTests.cmd script of the FileWriter project does. The tested application is FileWriter.exe which writes files with different strategies to a folder. When it exits it uses as exit code of the process the duration of the test case in ms. For more information refer to ProfileTest.cmd command line help.

FileWriter and Measurement Details

Read on if you are eager for details how the scripts generate ETW measurement data or go to the next headline.

RunPerformanceTests.cmd
call ProfileTest.cmd CreateSerial       FileWriter.exe -generate "%OutDir%" serial
call ProfileTest.cmd CreateParallel     FileWriter.exe -generate "%OutDir%" parallel
call ProfileTest.cmd CreateParallel2    FileWriter.exe -generate "%OutDir%" parallel -threads 2
...
call ProfileTest.cmd CreateTaskParallel FileWriter.exe -generate "%OutDir%" taskparallel

The most “complex” of ProfileTest.cmd is where several ETW profiles of the supplied MultiProfile.wprp are started. The actual start command in the ProfileTest.cmd script is

wpr -start MultiProfile!File -start MultiProfile.wprp!CSWitch -start MultiProfile.wprp!PMCLLC 

But with shell scripting with the correct escape characters it becomes this:

"!WPRLocation!" -start "!ScriptLocation!MultiProfile.wprp"^^!File -start "!ScriptLocation!MultiProfile.wprp"^^!CSwitch -start "!ScriptLocation!MultiProfile.wprp"^^!PMCLLC

This line starts WPR with File IO tracing, Context Switch tracing to see thread wait/ready times and some CPU counters for Last Level Cache misses. Disk and CPU sampling is already part of these profiles.

The PMCLLC profile can cause issues if you are using VMs like HyperV. To profile low level CPU features you need to uninstall all Windows OS HyperV Features to get Last Level Cache CPU tracing running.

You can check if you have access to your CPU counters via

C:\>wpr -pmcsources
Id  Name                        
--------------------------------
  0 Timer                       
  2 TotalIssues                 
  6 BranchInstructions          
 10 CacheMisses                 
 11 BranchMispredictions        
 19 TotalCycles                 
 25 UnhaltedCoreCycles          
 26 InstructionRetired          
 27 UnhaltedReferenceCycles     
 28 LLCReference                
 29 LLCMisses                   
 30 BranchInstructionRetired    
 31 BranchMispredictsRetired    
 32 LbrInserts                  
 33 InstructionsRetiredFixed    
 34 UnhaltedCoreCyclesFixed     
 35 UnhaltedReferenceCyclesFixed
 36 TimerFixed                  

If you get a list with just one entry named Timer then you will not get CPU counters, because you have installed one or several HyperV features. In theory this issue should not exist because this was fixed with Windows 10 RS 5 in 2019 (https://docs.microsoft.com/en-us/archive/blogs/vancem/perfview-hard-core-cpu-investigations-using-cpu-counters-on-windows-10), but I cannot get these CPU counters when HyperV is active in Windows 10 21H2.

The easiest fix is to remove the “!ScriptLocation!MultiProfile.wprp”^^!PMCLLC from the ProfileTest.cmd script, or you modify the MultiProfile.wprp to set in all HardwareCounter nodes

<HardwareCounter Id="HardwareCounters_EventCounters" Base="" Strict="true">

the value Strict=”false” to prevent WPR from complaining when it could not enable hardware counters. The MultiProfile.wprp profile uses nearly all features of WPR which are possible according to the not really documented xsd Schema to streamline ETW recording to your needs. The result of this “reverse” documenting is an annotated recording profile that can serve as base for your own recording profiles. Which profiles are inside it? You can query it from the command line with WPR:

D:\Source\FileWriter\bin\Release\net6.0>wpr -profiles MultiProfile.wprp

Microsoft Windows Performance Recorder Version 10.0.19041 (CoreSystem)
Copyright (c) 2019 Microsoft Corporation. All rights reserved.

   Default                   
               (CPU Samples/Disk/.NET Exceptions/Focus)
   CSwitch    +(CPU Samples/Disk/.NET Exceptions/Focus/Context Switch)
   MiniFilter +(CPU Samples/Disk/.NET Exceptions/Focus/MiniFilter)
   File       +(CPU Samples/Disk/.NET Exceptions/Focus/File IO)
   Network    +(CPU Samples/Disk/.NET Exceptions/Focus/Network)
   Sockets    +(CPU Samples/Disk/.NET Exceptions/Focus/Sockets)
   VirtualAlloc (Long Term)
   UserGDILeaks (Long Term)
   PMCSample  
     PMC Sampling for PMC Rollover + Default
   PMCBranch  
     PMC Cycles per Instruction and Branch data - Counting
   PMCLLC                      
     PMC Cycles per Instruction and LLC data - Counting
   LBR                         
     LBR - Last Branch Record Sampling

Or you can load the profile into WPRUI which is part of the Windows SDK where they will show up under the Custom Measurements node:

All profiles can be mixed (except Long Term) to record common things for unmanaged and .NET Code with the least amount of data. A good choice is the default profile which enables CPU sampling, some .NET events, Disk and Window Focus events. These settings are based on years of troubleshooting, and should be helpful recording settings also in your case. The supplied MS profiles are recording virtually everything which is often a problem that the trace sizes are huge (many GB) or, if memory recording is used, the history is one minute or even less on a busy system which has problems.

Generated Data

RunPerformanceTests.cmd saves the ETW data to the C:\Temp folder

After data generation it calls ExtractETW.cmd which is one call to ETWAnalyzer which loads several files in parallel and creates Json files from the ETW files

ETWAnalyzer -extract all -fd c:\temp\*Create*.etl -symserver MS -nooverwrite

This will generate JSON files in C:\Temp\Extract which can be queried with ETWAnalyzer without the need to load every ETL file into WPA.

As you can see the 500+ MB ETL files are reduced to ca. 10 MB which is a size reduction of a factor 50 while keeping the most interesting aspects. There are multiple json files per input file to enable fast loading. If you query with ETWAnalyzer e.g. just CPU data the other *Derived* files are not even loaded which keeps your queries fast.

Working With ETWAnalyzer and WPA on Measured Data

To get an overview if all executed tests suffer from a known issue you can query the data from the command line. To get the top CPU consumers in the Kernel not by method name but by stacktag you can use this query

EtwAnalyzer -dump CPU -ProcessName System -StackTags * -TopNMethods 2

I am sure you ask yourself: What is a stacktag? A stacktag is a descriptive name for key methods in a stacktrace of an ETW event. WPA and TraceProcessing can load a user configured xml file which assigns to a stacktrace a descriptive (stacktag) name, or Other if no matching stacktag rule could be found. The following XML fragment defines e.g. the above NTFS Checkpoint Volume stacktag:

  <Tag Name="NTFS Checkpoint Volume">
	<Entrypoint Module="Ntfs.sys" Method="NtfsCheckpointVolume*"/>
  </Tag>

See default.stacktags of ETWAnalyzer how it can be used to flag common issues like OS problems, AV Scanner Activity, Encryption Overhead, … with descriptive stacktags. If more than one stacktag matches the first matching stacktag, deepest in the stacktrace, wins. You can override this behavior if you set a priority value for a given tag.

Based on the 7 runs we find that the CheckpointVolume issue is part of every test which blocks File writes up to 6+ seconds. Other test runs do no suffer from this which makes comparison of your measured numbers questionable. Again we need to update our mental performance model:

Tests which are slower due to CheckpointVolume overhead should be excluded to make measurements comparable where this internal NTFS state delay is not occurring.

The exact number would be the sum of CPU+Wait, because this operation is single threaded we get exact numbers here. ETWAnalyzer can sum method/Stacktags CPU/Wait times with the -ShowTotal flag. If we add Method, the input lines are kept, if we use Process only process and file totals are shown.

EtwAnalyzer -dump CPU -ProcessName System -StackTags *checkpoint* -ShowTotal method

Now we know that file IO is delayed up to 7461 ms without the need to export the data to Excel. ETWAnalyzer allows you to export all data you see at the console to a CSV file. When you add -csv xxx.csv and you can aggregate/filter the data further as you need it. This is supported by all -Dump commands!

We can also add methods to the stacktags by adding -methods *checkpoint* to see where the stacktag has got its data from

EtwAnalyzer -dump CPU -ProcessName System -StackTags *checkpoint* -methods *checkpoint*
The method NtfsCheckpointVolume has 873 ms, but the stacktag has only 857 ms. This is because other stacktags can “steal” CPU sampling/wait data which is attributed to it. This is curse and feature rolled into one. It is great because if you add up all stacktags then you get the total CPU/Wait time for a given process across all threads. It is bad, because expensive things might be looking less expensive based on one stacktag because other stacktags have “stolen” some CPU/Wait time.

That is all nice, but what is our FileWriter process doing? Lets concentrate on the single threaded use case by filtering the files with -FileDir or -fd which is the shorthand notation for it to *serial*. You need to use * in the beginning, because the filter clause uses the full path name. This allows you to query with -recursive a complete directory tree where -fd *Serial*;!*WithAV* -recursive defines a file query starting from the current directory for all Serial files but it excludes folders or files which contain the name WithAV. To test which files are matching you can use -dump Testrun -fd xxxx -printfiles.

EtwAnalyzer -dump CPU -ProcessName FileWriter -stacktags * -Filedir *serial*

The test takes 12s but it spends 2,2s yet alone with Defender Activity. Other stacktags are not giving away a clear problem we could drill deeper. One interesting thing is that we spend 750ms in thread sleeps. A quick check in WPA shows

that these are coming from the .NET Runtime to recompile busy methods later on the fly. Stacktags are great to flag past issues and categorize activity, but they also tend to decompose expensive operations into several smaller ones. E.g. if you write to a file you might see WriteFile and Defender tags splitting the overall costs of a call to WriteFile. But what stands out that on my 8 Core machine a single threaded FileWriter fully utilizes my machine with Defender activity. We need to update our mental performance model again:

Writing to many small files on a single thread will be slowed down by the Virus Scanner a lot. It also generates a high load on the system which slows down everything else.

Since we did write the application we can go down to specific methods to see where the time is spent. To get an overview we use the top level view of WPA of our CreateFile method

In WPA we find that

  • CreateFiles
    • CPU (CPU Usage) 10653 ms
    • Wait Sum 672 ms
    • Ready Time Sum 1354 ms
      • This is the time a thread did wait for a CPU to become free when all CPUs were used

which adds up to 12,679 s which pretty closely (16ms off) matches the measured 12,695 s. It is always good to verify what you did measure. As you have seen before we have already quite often needed to change our mental performance model. The top CPU consumers in the CreateFiles method are

  • System.IO.StreamWriter::WriteLine
  • System.IO.Strategies.OSFileStreamStrategy::.ctor
  • System.IO.StreamWriter::Dispose
  • System.Runtime.CompilerServices.DefaultInterpolatedStringHandler::ToStringAndClear
  • System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[System.Char]::Rent
  • System.Runtime.CompilerServices.DefaultInterpolatedStringHandler::AppendFormatted
  • System.IO.Strategies.OSFileStreamStrategy::.ctor
  • System.IO.StreamWriter::WriteLine

What is surprising is that ca. 2s of our 12s are spent in creating interpolated strings. The innocent line

 string line = $"echo This is line {lineCount++}";

costs ca. 16% of the overall performance. Not huge, but still significant. Since we are trying to find out which file writing strategy is fastest this is OK because the overhead will be the same for all test cases.

We can get the same view as in WPA for our key methods with the following ETWAnalyzer query

EtwAnalyzer -dump CPU -ProcessName FileWriter -fd *serial* -methods *StreamWriter.WriteLine*;*OSFileStreamStrategy*ctor*;*StreamWriter.WriteLine*;*DefaultInterpolatedStringHandler*AppendFormatted*;*TlsOverPerCoreLockedStacksArrayPool*Rent;*DefaultInterpolatedStringHandler.ToStringAndClear;*StreamWriter.Dispose;*filewriter* -fld

You might have noticed that the method names in the list have :: between the class and method. ETWAnalyzer uses always . which is less typing and is in line with old .NET Framework code which used :: for JITed code and . for precompiled code.

Additionally I have smuggled -fld to the query. It is the shorthand for -FirstLastDuration or -fld which shows the difference between the Last-First time the method did show up in CPU Sampling or Context switch data. Because we know that our test did only measure DataGenerator.CreateFile calls we see in the Last-First column 12,695 s which is down to the ms our measured duration! This option can be of great value if you want to measure parallel asynchronous activity wall clock time. At least for .NET code the generated state machine classes of the C# compiler contain central methods which are invoked during init and finalization of asynchronous activities which makes it easy to “measure” the total duration even if you have no additional trace points at hand. You can add to -fld s s or -fld local local or -fld utc utc to also show the first and last time in

  • Trace Time (seconds)
  • Local Time (customer time in the time zone the machine was running)
  • UTC Time

See command line help for further options. ETWAnalyzer has an advanced notation of time which can be formatted in the way you need it.

The WPA times for Wait/Ready are the same in ETWAnalyzer, but not CPU. The reason is that to exactly judge the method CPU consumption you need to look in WPA into CPU Usage (Sampled) which shows per method CPU data, which is more accurate. ETWAnalyzer merges this automatically for you.

CPU usage (Precise) Trap

The view CPU usage (Precise) in WPA shows Context switch data which is whenever your application called a blocking OS call like Read/WriteFile/WaitForSingleObject, … your thread is switched off the CPU which is called context switch. You will see in this graph therefore only (simplification!) methods which were calling a blocking OS method. If you have e.g. a busy for loop in method int Calculate() and the next called method is Console.WriteLine, like this,

int result = Calculate();
Console.WriteLine(result)

to print the result then you will see in CPU usage (Precise) all CPU attributed to Console.WriteLine because that method was causing the next blocking OS call. All other non blocking methods called before Console.WriteLine are invisible in this view. To get per method data you need to use the CPU data of CPU sampling which gets data in a different way to much better measure CPU consumption at method level. Is Context Switch data wrong? No. The per thread CPU consumption is exact because this is a trace of the OS Scheduler when he moves threads on/off a CPU. But you need to be careful to not interpret the method level CPU data as true CPU consumption of that method.

CPU Usage (Sampled) Trap

Lets compare both CPU views (Sampled) vs (Precise)

StreamWriter.Dispose method consumes 887 ms in Sampling while in Context Switch view we get 1253 ms. In this case the difference is 41%! With regards to CPU consumption you should use the Sampled view for methods to get the most meaningful value.

Every WPA table can be configured to add a Count column. Many people fell into the trap to interpret the CPU Sampling Count column as number of method calls, while it actually just counts the number of sampling events. CPU sampling works by stopping all running threads 1000 (default) times per second, take a full stacktrace and write that data to ETW buffers. If method A() did show up 1000 times then it gets 1000*1ms CPU attributed. CPU sampling has statistical bias and can ignore e.g. method C() if it is executing just before, or after, the CPU sampling event did take a stack trace.

ETWAnalyzer Wait/Ready Times

The method list shown by ETWAnalyzer -Dump CPU -methods xxx is a summation across all threads in a process. This serves two purposes

  • Readability
    • If each thread would be printed the result would become unreadable
  • File Size
    • The resulting extracted Json file would become huge

This works well for single threaded workloads, but what about 100 threads waiting for something? WPA sums all threads together if you remove the thread grouping. ETWAnalyzer has chosen a different route. It only sums all non overlapping Wait/Ready times. This means if more than one thread is waiting, the wait is counted only once.

The Wait/Ready times for methods for multi threaded methods are the sum of all non overlapping Wait/Ready times for all threads. If e.g. a method was blocked on many threads you will see as maximum wait time your ETW recording time and not some arbitrary high number which no one can reason about. It is still complicated to judge parallel applications because if you have heavy thread over subscription you will see both, Wait and Ready, times to reach your ETW recording time. This is a clear indication that you have too many threads and you should use less of them.

Make Measurements Better

We have found at least two fatal flaws in our test

  • Anti Virus Scanner is consuming all CPU even in single threaded test
    • Add an exclude rule to the folder via PowerShell
      • Add-MpPreference -ExclusionPath “C:\temp\Test”
    • To remove use
      • Remove-MpPreference -ExclusionPath “C:\temp\Test”
  • Some test runs suffer from large delays due to CheckpointVolume calls which introduces seconds of latency which is not caused by our application
    • Change Script to delete temp files before test is started to give OS a chance to get into a well defined file system state

When we do this and we repeat the tests a few times we get for the single threaded case:

Nearly a factor 3 faster for changing not a single line of code is pretty good. The very high numbers are coming from overloading the system when something else is running in the background. The 12s numbers are for enabled Defender while < 10s are with a Defender exclusion rule but various NtfsCheckpoint delays. When we align everything correctly then we end up with 7s for the single threaded case which is 71 MB/s of write rate. The best value is now 3,1s for 4 threads with 161 MB/s.

What else can we find? When looking at variations we find also another influencing factor during file creation:

We have a single threaded test run with 9556 ms and a faster one with 7066 ms. Besides the CheckpointVolume issue we loose another 1,3 s due to reading MFT and paged out data. The faster test case (Trace #2) reads just 0,3 ms from the SSD, while we read in the slower Run (Trace #1) 1,3s from disk. The amount of data is small (8 MB), but for some reason the SSD was handing out the data slowly. Yes SSDs are fast, but sometimes, when they are busy with internal reorg things, they can become slow. SSD performance is a topic for another blog post.

We need to update our mental performance model

File System Cache state differences can lead to wait times due to Hard faults. Check method MiIssueHardFault if the tests have comparable numbers.

Etwanalyzer -dump CPU -pn Filewriter -methods MiIssueHardFault
That are another 2 s of difference which we can attribute to File System cache state and paged out data.

Measuring CPU Efficiency

We know that 4 threads are fastest after removing random system noise from our measured data. But is is it also the most efficient one? We can dump the CPUs Performance Monitoring Counters (PMC) to console with

ETWAnalyzer -dump PMC -pn FileWriter

The most fundamental metric is CPI which is Cycles/Instructions. If you execute the the same use case again you should execute the same numbers of instructions. When you go for multithreading you need to take locks and coordinate threads which costs you extra instructions. The most energy efficient solution is therefore the single threaded solution, although it is slower.

To get an overview you can export data into a CSV file from the current directory and the initial measurements we did start with. The following query will dump the Performance Monitoring Counters (PMC) to a CSV file. You can use multiple -fd queries to select the data you need:

ETWAnalyzer -dump PMC -pn FileWriter -fd ..\Take_Test1 -fd .  -csv PMC_Summary.csv

The CSV file contains the columns Test Time in ms and CPI to make it easy to correlate the measured numbers with something else e.g. CPI in this case. This pattern is followed for all -Dump commands of ETWAnalyzer. From this data we can plot Test Timing vs CPI this is Cycles per Instruction into a Pivot Table:

The initial test where we had

  • Defender
  • CheckpointVolume

issues was not only much slower, but CPU efficiency wise much worse: 1,2 vs 0,69 (smaller is better). This proves that even if Defender would be highly efficient on all other cores and not block our code at all we still would do much worse (73%), because even when we do exactly the same, CPU caches and other internal CPU resources, are slowing us down. With this data you can prove why you are slower even when everything else is the same.

Conclusions

If you have read that far I am impressed. That is a dense article with a lot of information. I hope I have convinced you that profiling your system with ETW and not only your code for strategic test cases is an important asset in automated regression testing.

Credits: NASA, ESA, CSA, and STScI (https://www.nasa.gov/image-feature/goddard/2022/nasa-s-webb-delivers-deepest-infrared-image-of-universe-yet)

If you look only at your code you will miss the universe of things that are moving around in the operating system and other processes. You will end up with fluctuating tests which no one can explain. We are doing the same thing 10 times and we still get outliers! The key is not only to understand your own code, but also the system in which it is executing. Both need attention and coordination to achieve the best performance.

To enable that I have written ETWAnalyzer to help you to speed up your own ETW analysis. If you go for ETW you are talking about a lot of data which needs a fair amount of automation to become manageable. Loading hundreds of tests into WPA is not feasible, but with ETWAnalyzer mass data extraction and analysis never was easier. The list of supported queries is growing. If you miss something file an issue so we can take a look what the community is needing.

To come back to the original question. Is profiling too difficult? If you are using only UI based tools then the overhead to do this frequently it is definitely too much. But with the scripting approach shown by FileWriter to record data and extract it with ETWAnalyzer, to make the data queryable, is giving you a lot of opportunities. ETWAnalyzer comes with an object model for the extracted Json files. You can query for known issues directly and flag past issues within seconds without the need to load the profiling data into a graphical viewer to get an overview.

GDI/User Object Leak Tracking – The Easy Way

One of the most complex issues are User/GDI object leaks because they are nearly invisible in memory dumps. Microsoft does not want us to peek into such “legacy” technology so no Windbg plugins exist anymore. Today everything is UWP, WPF (ok that’s legacy as well) and mostly some Chrome Browser based applications. With .NET 6 WPF and WinForms support has got some love again so we can expect new programmers will hit old fashioned User/GDI leaks again. These leaks are worse than “normal” Handle leaks because Windows imposes a limit of 10K GDI and User Handles per process. After that you cannot create any more User/GDI handles and you get either exceptions or partially filled windows where content or icons are missing.

Here is an application which leaks Window and GDI Handles on every button click:

    public partial class Leaker : Form
    {
        public Leaker()
        {
            InitializeComponent();
        }

        List<Leaker> myChilds = new List<Leaker>();

        private void button1_Click(object sender, EventArgs e)
        {
            var leaker = new Leaker();
            var tmp = leaker.Handle;
            myChilds.Add(leaker);
            cCountLabel.Text = $"{myChilds.Count}";
        }
    }

In Task Manager we find that with every click we loose 6 Window handles:

These come from the

  • Leaked Form
  • Create Button
  • Created Label
  • 85 (Number) Label
  • IME Hidden Window
  • MSCTFIME Hidden Window

A very useful tool is Spy++ which is installed when you add C++ core features

via the Visual Studio Installer. Then you will find it in C:\Program Files\Microsoft Visual Studio\<Version>\<Edition>\Common7\Tools\spyxx_amd64.exe or the 32 bit version without the _amd64 postfix.

If you intend to copy Spy++ to a different machine you need these dlls

.\1033\spyxxui_amd64.dll
spyxxhk_amd64.dll
spyxx_amd64.exe

to make it work. The resource dll in the 1033 folder is essential or it wont start.

How would you identify the call stack that did lead to the allocation of additional objects? The answer is a small and great ETW tool named Performance HUD from Microsoft.

After starting

you can attach to an already running process

After a few seconds an overlay window pops up

If you double click on it the “real” UI becomes visible
Switch over to the Handles Tab
Now press in our Leaker app a few times the create button to leak some GDI/User objects
Press Ctrl+Click on the blue underlined Outstanding text for User Window objects to find the leaking call stack
Voila you have found our button1_Click handler as the bad guy as expected. If you right click on it you can e.g. send a mail with the leaking call stack as attached .hudinsight file to the developer of the faulty application. Problem solved.

See my earlier post what else you can do with Performance HUD. It is one of the tools which save a lot of time once you know that they exist.

Another great feature of Performance HUD is to analyze UI Thread hangs. It employs some sophisticated analysis to detect that the UI thread is no longer processing window messages because some CPU hungry method is executing a long method call, or that you are stuck in a blocking OS call. Today everyone is using async/await in C#. A stuck UI thread should be a thing of the past. But now you can measure and prove that it is not hanging.

Lets add to our sample application a UI hang via TPL tasks which happens only in Debug Builds

        private void cHang_Click(object sender, EventArgs e)
        {
            Task<int> t = Task.Run(async () => { await Task.Delay(5000); return 1; });
            Debug.Print(t.Result.ToString());
        }

My favorite button the Blame Someone which will unfold the stack with probably the longest stuck call:

This precisely points towards our stuck button handler which calls into Task.Result which seems not to be resolved properly but the later method calls are. The 5s hang is nicely colored and you can also pack this into an Email and send the issue to the maintainer of the corresponding code. I have seen plenty of similar issues in the past because Thread.Sleep is caught by many FXCop rules so people tend to use await with Task.Delay which is a Thread.Sleep in disguise.

This works not only with .NET Code but with any language as long as it supports pdbs and walkable stacks. It is great that this diagnostic has been added to Windows but why did it take so long? As far as I know Performance HUD is from the Office team which is a “legacy” but important application. Someone dealing with too many GDI/User object leaks was able to commit some lines to the Kernel to instrument the relevant locations with proper ETW events and we finally arrive with a feature complete leak tracer. With ETW we can track leaks

  • OS Handles (File, Socket, Pipe, …)
    • wpr -start Handle
    • Analyze with WPA. This records handle data from all processes (system wide)
  • GDI Handles (Cursor, Bitmap, Font, …)
  • User Handles (Window, Menu, Timer, …)
  • VirtualAlloc Memory (Records system wide all processes)
    • wpr -start VirtualAllocation
    • Analyze with WPA

Finally we have covered all frequent leak types with ETW events. No leak can hide anymore.

For memory mapped files there are some ETW events but unfortunately no analysis in the public version of WPA exist. These leaks are still more tricky to figure out than the rest which are now at least in principle easy. Recording the right data and not too much (ETW files > 6 GB will usually crash WPA) is still challenging but doable. Performance HUD is very well written. I have successfully processed a 80 GB ETL trace with no hiccups. It can not only attach to already running processes, but it can also analyze previously recorded data.

Now get back to work and fix your leaks and unresponsive UIs!

How Fast is your Virus Scanner?

I am sure everyone has experienced slowdowns due to Antivirus solutions, but very few are able to attribute it to the right component on your Windows box. Most people do not even know what AV solution/s are running on their machine. The end result can be a secure but slow system which is barely usable.

Is your CPU fan always on and your notebook getting hot although you are not running anything? You want to find out what is consuming so much CPU with a command line tool?

Then I have a new tool for you: ETWAnalyzer. It is using the Trace Processing Library which powers WPA (Windows Performance Analyzer). WPA is a great UI to analyze tricky systemic issues, but if you need to check a trace the n-th time if the same issue is there again you want to automate that analysis.

ETWAnalyzer is developed by Siemens Healthineers as its first public open source project. Full Disclosure: I work for Siemens Healthineers. One motivation to make it public was to show how incredibly useful ETW (Event Tracing for Windows) is to track down performance issues in production and development environments. It is one of the best technologies for performance troubleshooting Microsoft has invented, but it is valued by far too few people. ETW is part of Windows and you need no additional tools or code instrumentation to record ETW data! Starting with Windows 10 the Windows Performance Recorder (wpr.exe) is part of the OS. You have a slow system? In that case I would open an Administrator cmd shell and type

C>wpr -start CPU -start DiskIO -start FileIO

Now execute your slow use case and then stop profiling with

C>wpr -stop C:\temp\SlowSystem.etl

This will generate a large file (hundreds MB – several GB) which covers the last ca. 60-200s (depends on system load and installed memory) what your system was doing. If you have a lot of .NET applications installed stopping the first first time can take a long time (10-60 min) because it will create pdbs for all running .NET applications. Later profiling stop runs are much faster tough.

To record data with less overhead check out MultiProfile which is part of FileWriter. You need to download the profile to the machine to a directory e.g. to c:\temp and to record the equivalent to the before command with

C>wpr -start C:\temp\MultiProfile.wprp!CSwitch -start C:\temp\MultiProfile.wprp!File

Getting the data is easy. Analyzing is hard when you do not know where you need to look at.

This is where ETWAnalyzer can help out to ask the recorded data questions about previously analyzed problems. You ask: Previously? Yes. Normally you start with WPA to identify a pattern like high CPU in this or that method. Then you (should) create a stacktag to assign a high level description what a problematic method is about. The stacktags are invaluable to categorize CPU/Wait issues and correlate them with past problems. The supplied stacktag file of ETWAnalyzer contains descriptions for Windows/Network/.NET Framework/Antivirus products even some Chrome stacktags. To get started you can download the latest ETWAnalyzer release from https://github.com/Siemens-Healthineers/ETWAnalyzer/releases und unzip it to your favorite tool directory and extend the PATH environment to the directory. If you want maximum speed you should use the self contained .NET 6 version.

Now you are ready to extract data from an ETL file with the following command:

ETWAnalyzer -extract all -filedir c:\temp\SlowSystem.etl -symserver MS

After extraction all data is contained in several small Json files.

Lets query the data by looking at the top 10 CPU consumers

ETWAnalyzer -dump CPU -filedir c:\temp\Extract\SlowSystem.json -topN 10

The second most CPU hungry process is MsMpEng.exe which is the Windows Defender scan engine. Interesting but not necessarily a bad sign.

Lets check if any known AV Vendor is inducing waits > 10ms. We use a stacktag query which matches all stacktags with Virus in their name.

ETWAnalyzer -dump cpu -fd c:\temp\Extract\SlowSystem.json -stacktags *virus* -MinMaxWaitMs 10

We find a 1349ms slowdown caused by Defender in a cmd shell. I did observe slow process creation after I hit enter to start the fresh downloaded executable from the internet.

Explorer seems also to be slowed down. The numbers shown are the sum of all threads for CPU and wait times. If multiple threads were executing the the method then you can see high numbers for the stacktags/methods. For single threaded use cases such as starting a new process the numbers correlate pretty good with wall clock time.

The currently stacktagged AV solutions are

  • Applocker
  • Avast
  • CrowdStrike
  • CyberArk
  • Defender
  • ESET
  • Mc Afee
  • Palo Alto
  • Sentinel One
  • Sophos
  • Symantec
  • Trend Micro

Lets dump all slow methods with a wait time between 1300-2000ms and the timing of all process and file creation calls in explorer.exe and cmd.exe

ETWAnalyzer -dump cpu -fd c:\temp\Extract\SlowSystem.json  -stacktags *virus* -MinMaxWaitMs 1300-2000 -methods *createprocess*;*createfile* -processName explorer;cmd

Antivirus solutions usually hook into CreateProcess and CreateFile calls which did obviously slow down process creation and file operations in explorer and process creation in cmd.exe.

Was this the observed slowness? Lets dump the time when the file and process creation calls were visible the first time in the trace

ETWAnalyzer -dump cpu -fd c:\temp\Extract\SlowSystem.json  -stacktags *virus* -MinMaxWaitMs 1300-2000 -methods *createprocess*;*createfile* -processName explorer;cmd -FirstLastDuration local

That correlates pretty well with the observed slowness. To blame the AV vendor we can dump all methods of device drivers which have no symbols. Microsoft and no other AV vendor delivers symbols for their drivers. By this definition all unresolved symbols for system drivers are likely AV or hardware drivers. ETWAnalyzer makes this easy with the -methods *.sys -ShowModuleInfo query to show all called kernel drivers from our applications for which we do not have symbols :

ETWAnalyzer -dump cpu -fd c:\temp\Extract\SlowSystem.json -methods *.sys* -processName explorer;cmd -ShowModuleInfo

The yellow output is from an internal database of ETWAnalyzer which has categorized pretty much all AV drivers. The high wait time comes from WDFilter.sys which is an Antivirus File System driver which is part of Windows Defender.

What have we got?

  • We can record system wide ETW data with WPR or any other ETW Recoding tool e.g. https://github.com/Alois-xx/etwcontroller/ which can capture screenshots to correlate the UI with profiling data
  • With ETWAnalyzer we can extract aggregated data for
    • CPU
    • Disk
    • File
    • .NET Exceptions
  • We can identify CPU and wait bottlenecks going from a system wide view down to method level
    • Specific features to estimate AV overhead are present

This is just the beginning. You can check out the documentation for further inspiration how you can make use of system wide profiling data (https://github.com/Siemens-Healthineers/ETWAnalyzer) which can be queried from the command line.

ETWAnalyzer will not supersede WPA to analyze data. ETWAnalyzers main strength is in analyzing known patterns quickly in large amounts of extracted ETW data. Querying 10 GB of ETW data becomes a few hundred MB of extracted JSON data which has no slow symbol server lookup times anymore.

Happy bug hunting with your new tool! If you find bugs or have ideas how to improve the tool it would be great if you file an issue.