Windows 10 April update can cause data corruption

Starting with Windows 10 KB 5025221 I had issues in Outlook to attach files to mails. Drag and Drop did become unreliable and would sometimes silently do nothing. When I tried the Attach File dialog it would most of the time work, but sometimes fail with:

Update

MS has listed for this KB a Known Issue Rollback which is a MSI that uninstalls the specific portions of the problematic update. The detailed description shows that only CopyFile is affected but not MoveFile. My tests confirm this. More details about KIRs are https://techcommunity.microsoft.com/t5/windows-it-pro-blog/known-issue-rollback-helping-you-keep-windows-devices-protected/ba-p/217683.

That was annoying but not really problematic. More serious was that somtimes I could not open attached PDF files. This resulted in a file not found error box. The issue was sporadic enough and an Outlook restart did always fix the issue. But sometimes I did end up with empty files

which was not so good. Also some mails the attachements (PDFs) were set to 0 bytes which was a real data corruption.

Reason enough to drill deeper what was going on. The first thing I would take is a procmon trace. If someone interferes (Antivirus is always a good reason) we should see some access denied or other file deletion events there. After a first look I did not find a single failed IO operation in the procmon traces which was strange.

At least I did learn how Outlook works. When you double click an attachement in Outlook it copies the data from the mail database .pst/.ost file to a temp file located in

C:\Users\%username%\AppData\Local\Microsoft\Windows\INetCache\Content.Outlook\????\Src.pdf

Then it copies the file again into …\Src (002).pdf which is the file which is then opened by the actual PDF reader.

When the reader is not a reader but a writer then Outlook will check if the second copy is different to the first copy and will ask if you want to save the changes in Outlook. If you do this manually you still can go back by asking Outlook for a earlier version which is still present. So far so good. But in the corrupted mails no previous go back button was present.

Drag and drop a file from Explorer into a new mail works similar. First the file is copied to %temp%\xxxx.tmp from there it is copied again to C:\Users\%username%\AppData\Local\Microsoft\Windows\INetCache\Content.Outlook\????\YourFile.pdf and then it is finally stored in your mail database .pst/.ost file.

Outlook is copying quite a few files around which is the reason why it uses the default Windows CopyFile method. When looking deeper with ETW tracing I could compare a good vs a bad attach run which did differ in the duration of the CopyFile operation which mysteriously did fail and did run much shorter than it did in the good case

|- OUTLOOK.EXE!HrSaveAttachmentsWorker
|    OUTLOOK.EXE!HrShowSaveAttachDialog
|    |- OUTLOOK.EXE!HrInvoke
|    |    OUTLOOK.EXE!HrInvokeHelper
|    |    OUTLOOK.EXE!StdDispatch::Invoke
|    |    OUTLOOK.EXE!_RenDispatchCall
|    |    OUTLOOK.EXE!AttachmentObject::HrSaveAsFile
|    |    |- OUTLOOK.EXE!AttachmentObject::HrCopyTempFileToLocation
|    |    |    KernelBase.dll!CopyFileW
|    |    |    KernelBase.dll!CopyFileExW
|    |    |    |- KernelBase.dll!BasepCopyFileExW
|    |    |    |- KernelBase.dll!CloseHandle
|    |    |- OUTLOOK.EXE!FDoesFileExistW
|    |- OUTLOOK.EXE!RenFFDialog::HrGetPath
|- OUTLOOK.EXE!DisplayErrorContext

The good thing is since a few month Microsoft delivers public symbols also for its office products which read much better than Ordinal+xxxx stack traces. Since it is not possible to see more with ETW profiling one would need to debug Outlook. The good thing is that this is possible now with the recently released new Windbg which allows you to attach to a running process and capture executed CPU instruction and accessed memory. The way how this works is fascinating. At debugging start a full memory dump is taken, then CPU tracing is enabled which captures the complete CPU instruction stream. The end result is not a static memory dump but a memory dump where you can debug forward and backwards! For more infos see https://aloiskraus.wordpress.com/2017/09/26/it-is-time-for-time-travel/. With that I could record a failed attach incident and send to MS support for further analysis where it did fail. The answer was that a regression did happen in the mentioned KB update which did kick in when the following conditions did occur:

  1. Windows 10 with KB 5025221 from April or May Update
  2. Process is 32 bit
  3. Process is Large Address Aware
  4. It calls into CopyFile / MoveFile
  5. A pointer to NtCreateFile with a pointer address > 2 GB is passed

When these things did happen the CreateFile call did bail out early even before procmon or other filter drivers where able to catch the return code of the failed CreateFile call. Quite a nasty thing which also explains why it did take so long to find out.

With that knowledge we can examine the issue a bit further. We can write a small test application which calls CopyFile in a 32 bit process which has allocated more than 2 GB of memory. The easiest way to achieve this is to use a few lines of C# like this

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Runtime.InteropServices;

namespace FileCopyBug
{
    internal class Program
    {
        static void Main(string[] args)
        {
            if (args.Length < 2)
            {
                Console.WriteLine("FileCopyBug src dest [-noalloc]");
                return;
            }

            Queue<string> queue = new Queue<string>(args);
            bool bAlloc = true;
            string src = null;
            string dest = null;

            while (queue.Count > 0)
            {
                string arg = queue.Dequeue();
                if (arg == "-noalloc")
                {
                    Console.WriteLine("Skip allocations.");
                    bAlloc = false;
                }
                if (src == null)
                {
                    src = arg;
                }

                if (dest == null)
                {
                    dest = arg;
                }
            }

            if (bAlloc) // allocate > 2 GB of memory to force pointers > 2 GB
            {
                int n = 0;
                const int ChunkBytes = 500;  // allocate small to not leave holes
                const int nMax = 4600000;
                while (++n < nMax)
                {
                    Marshal.AllocHGlobal(ChunkBytes);
                }
            }

            Console.WriteLine($"Allocated memory: {Process.GetCurrentProcess().PrivateMemorySize64:N0} bytes");

            // Call KernelBase.dll CopyFile which exhibits the bug when pointers > 2 GB are used for Extended Attributes to NtCreateFile which will bail out with an access violation c0000005 as NTSTATUS return code
            File.Copy(args[0], args[1], true); 
        }
    }
}

The compiled binary and code can be download from my OneDrive here. If you are running Windows 10 with the latest Updates you can try to copy one file to a new location just like with any copy tool:

c:\tmp>FileCopyBug.exe src.pdf dsg.pdf
Allocated memory: 2,485,137,408 bytes

Unhandled Exception: System.IO.IOException: Invalid access to memory location.
   at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
   at System.IO.File.InternalCopy(String sourceFileName, String destFileName, Boolean overwrite, Boolean checkHost)
   at System.IO.File.Copy(String sourceFileName, String destFileName, Boolean overwrite)
   at FileCopyBug.Program.Main(String[] args)

That is really an Access Violation code which was returned by NtCreateFile which was correctly returned by CopyFile. When you add the parameter -noalloc then no additional memory is allocated and the copy operation will work. I have a machine with the Jan Win 10 version around where the FileCopy issue is not present. This also works on Windows 11.

Since we have come so far we can debug now our test application directly and see where this leads.

The steps are

  1. Create a breakpoint for CopyFile (bp kernelbase!CopyFileW)
  2. Run until we enter CopyFile (g)
  3. Create a breakpoint for NtCreateFile (bp ntdll!NtCreateFile)
  4. Run until we hit NtCreateFile for the source file (g)
  5. Print file name dt ole32!OBJECT_ATTRIBUTES poi(@esp+3*4)
  6. Leave method to check return code (gu), return code is in eax. 0 is success.
  7. Run until we hit NtCreateFile for the destination file (g)
  8. Print file name dt ole32!OBJECT_ATTRIBUTES poi(@esp+3*4)
  9. Leave method to check return code (gu), return code is in eax with c0000005 which is the error we are after.
0:001> bp kernelbase!CopyFileW
0:001> g
Breakpoint 0 hit
Time Travel Position: FBCD:1ED4
eax=28c67690 ebx=02af91b4 ecx=82e96558 edx=02b50000 esi=5a13bdb0 edi=02afa810
eip=76a137f0 esp=02af5028 ebp=02af52b8 iopl=0         nv up ei pl nz na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200206
KERNELBASE!CopyFileW:
76a137f0 8bff            mov     edi,edi
0:006> bp ntdll!NtCreateFile
0:006> g
Breakpoint 1 hit
Time Travel Position: FBCE:72E
eax=02af4900 ebx=00000000 ecx=00000005 edx=02af474c esi=80000000 edi=80000000
eip=77c13000 esp=02af45ac ebp=02af4f8c iopl=0         nv up ei ng nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200282
ntdll!NtCreateFile:
Unable to load image C:\Program Files (x86)\Microsoft Office\root\Office16\AppVIsvSubsystems32.dll, Win32 error 0n2
77c13000 e96b11cfdf      jmp     AppVIsvSubsystems32!vfs_hooks::hooked_NtCreateFile (57904170)
0:006> dt ole32!OBJECT_ATTRIBUTES poi(@esp+3*4)
   +0x000 Length           : 0x18
   +0x004 RootDirectory    : (null) 
   +0x008 ObjectName       : 0x02af4864 _UNICODE_STRING "\??\C:\Users\...\AppData\Local\Microsoft\Windows\INetCache\Content.Word\src.pdf"
   +0x00c Attributes       : 0x40
   +0x010 SecurityDescriptor : (null) 
   +0x014 SecurityQualityOfService : (null) 

0:006> gu
Time Travel Position: FBD2:640
eax=00000000 ebx=00000000 ecx=579026f4 edx=02b50000 esi=80000000 edi=80000000
eip=76a1521e esp=02af45dc ebp=02af4f8c iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200246
KERNELBASE!BasepCopyFileExW+0x187e:
76a1521e 898584f9ffff    mov     dword ptr [ebp-67Ch],eax ss:002b:02af4910=00000000
0:006> g
Breakpoint 1 hit
Time Travel Position: FBED:77
eax=02af2b8c ebx=00000000 ecx=00000005 edx=10000044 esi=c0150081 edi=00002020
eip=77c13000 esp=02af25d8 ebp=02af4598 iopl=0         nv up ei ng nz na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200286
ntdll!NtCreateFile:
77c13000 e96b11cfdf      jmp     AppVIsvSubsystems32!vfs_hooks::hooked_NtCreateFile (57904170)
0:006> dt ole32!OBJECT_ATTRIBUTES poi(@esp+3*4)
   +0x000 Length           : 0x18
   +0x004 RootDirectory    : (null) 
   +0x008 ObjectName       : 0x02af27a8 _UNICODE_STRING "\??\C:\Users\...\AppData\Local\Microsoft\Windows\INetCache\Content.Outlook\XNKF02BJ\src (002).pdf"
   +0x00c Attributes       : 0x40
   +0x010 SecurityDescriptor : (null) 
   +0x014 SecurityQualityOfService : 0x02af3f18 Void
0:006> gu
Time Travel Position: FBFA:1B8
eax=c0000005 ebx=00000000 ecx=579026f4 edx=02b50000 esi=c0150081 edi=00002020
eip=76a178c2 esp=02af2608 ebp=02af4598 iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200246
KERNELBASE!BaseCopyStream+0xd01:
76a178c2 8985e0e5ffff    mov     dword ptr [ebp-1A20h],eax ss:002b:02af2b78=00000000

The nice thing about the x86 WinAPI calling convention is that all parameters are on the stack. To see all passed parameters to an exported method of the signature which has 11 arguments

__kernel_entry NTSTATUS
NTAPI
NtCreateFile (
    OUT PHANDLE FileHandle,
    IN ACCESS_MASK DesiredAccess,
    IN POBJECT_ATTRIBUTES ObjectAttributes,
    OUT PIO_STATUS_BLOCK IoStatusBlock,
    IN PLARGE_INTEGER AllocationSize OPTIONAL,
    IN ULONG FileAttributes,
    IN ULONG ShareAccess,
    IN ULONG CreateDisposition,
    IN ULONG CreateOptions,
    IN PVOID EaBuffer OPTIONAL,
    IN ULONG EaLength
    );

I can break on method enter and then dump the next 12 DWORDs (the first value is the return address) from the stack pointer esp I can dump directly the passed method arguments:

0:006> dd @esp L0n12
02af25d8  76a17b41 02af2b8c 40150080 02af27ec
02af25e8  02af2a00 02af4ef0 00002020 00000000
02af25f8  00000005 10000044 02af29b8 00000010

This translates to

NtCreateFile( 
FileHandle Pointer = 02af2b8c,
DesiredAccess = 40150080 
   GENERIC_WRITE|SYNCHRONIZE|WRITE_DAC|DELETE|FILE_READ_ATTRIBUTES,
ObjectAttributes Pointer = 02af27ec,
IoStatusBlock Pointer = 02af2a00,
AllocationSize Pointer = 02af4ef0,
FileAttributes = 00002020
FILE_ATTRIBUTE_NORMAL|FILE_ATTRIBUTE_VIRTUAL|FILE_ATTRIBUTE_EA|FILE_ATTRIBUTE_UNPINNED,
ShareAccess = 00000000,
CreateDisposition = 5 
   FILE_OVERWRITE_IF,
CreateOptions = 10000044
   FILE_SEQUENTIAL_ONLY|FILE_NON_DIRECTORY_FILE|FILE_CONTAINS_EXTENDED_CREATE_INFORMATION,
EaBuffer Pointer = 02af29b8 
EaLength = 0x10

We are looking at a Windows 10 machine which makes me wonder why I find a FILE_CONTAINS_EXTENDED_CREATE_INFORMATION flag which is according to the docs a Windows 11 only feature. It looks like that Antivirus can scan less when it is clear that this is a local file. It looks like this performance feature was backported to Windows 10 with the April update. To check that we can dump the EaBuffer as a EXTENDED_CREATE_INFORMATION structure.

0:006> dt combase!PEXTENDED_CREATE_INFORMATION @esp+0n10*4
0x02af29b8 
   +0x000 ExtendedCreateFlags : 0n2
   +0x008 EaBuffer         : 0x81811fe8 Void
   +0x00c EaLength         : 0x2382

That sure looks like a valid structure with the EX_CREATE_FLAG_FILE_DEST_OPEN_FOR_COPY flag set. Now we have also found the EaBuffer pointer which is > 2 GB (0x80000000) which is causing NtCreateFile to return (wrongly) an Access Violoation return code.

Conclusions

The issue presented here is not specific to Outlook 32 bit but it can affect all 32 bit applications which make use of CopyFile/MoveFile. I have not studied the effect to Word/PowerPoint/Excel but similar issues might occur. At least you know now how you can find out if you are affected when you use a debugger. An alternative approach would be to enable ETW SysCall tracing which allows you to see every return code of all called kernel methods. Then you can scan for NtCreateFile and a return code of 0xC0000005 which should never happen.

Lets hope that soon a fix is released to make these strange sporadic failures go away.

Advertisement

WPA fails with 0x80070032

Sometimes I get ETW traces which cannot be loaded by WPA. Initially I found these on Windows Server 2022 on a VMWare machine. I suspected some issue how VMWare virtualizes things on newer OSes but I did get this error also on my new non virtual Windows 11 box. After asking around it turned out that this is side effect of .NET 7 processes. When a .NET 7 process is running and you use WPR profiles you will get unreadable traces back.

When you try to process them with TraceProcessing Library you get something like

C:\Temp>etwanalyzer -extract all -fd test.etl 
1 - files found to extract.

Error: One or more errors occurred. (The request is not supported. (0x80070032)) Check C:\Utils\ETWAnalyzer\ETWAnalyzer_Trace.log for full details, or use -debug switch to get full output.


Error:   EventSink {39cda801-59d8-472d-bc00-601a4150c4c4} returned HRESULT 0x80070032 on:
         Event#: 294800   (T#0:#294800)
         TimeStamp:    6644961, Process: 12172, Thread: 8008, Cpu:  10
         ProviderId: {a669021c-c450-4609-a035-5af59af4df18}
         EventDescriptor: 0x0096 0x0001 0x2a   0x01    0x00    0x05  0x0000000000020000
         UserDataLength: 405

The default wpr profiles are affected, but for some reason my custom WPR Profiles (https://github.com/Alois-xx/FileWriter/blob/master/MultiProfile.wprp) are not. The only way to solve this is to use either custom WPR profiles or to kill all .NET 7.0 processes. But which ones?

With ETWAnalyzer it is easy

Etwanalyzer -fd c:\temp\extract\ManySlowCalls_MAGNON.03-07-2023.23-21-41.json  -dump version -dll *coreclr* -versionfilter *7.0* 

Ahh PhoneExperienceHost.exe is the bad guy. As workaround all you need to do is to kill PhoneExperienceHost and you should get analyzable ETW traces again.

A future WPA/TraceProcessing version should contain the fix, but until then you need to use one of the workarounds.

War Story: How Antivirus solutions can bring a server down

From a performance point of view Antivirus scanners are the worst things that have happened to computers during the last ca. 5 years. Sometimes it is hard to differentiate the CPU activity of an AV scanner from a bitcoin miner. Both are eating your CPU to give you either virtual security or virtual money.

Whenever something is mysterious and slow I get a call. In this case a beefy Windows Server was working happily, but a few times a day everything would freeze for up to 5 minutes. The solution to get back to speed was to reboot the server when it got stuck. When the server got stuck, it was at 100% CPU for several minutes. That sounds like an obvious CPU problem. Identify the problematic process and uninstall the app which is causing problems.

To identify the issue ETW Traces were taken. Everything was working fine until the ETW profiling events stopped in the recording. Several minutes later the CPU problem did start. Was someone tinkering with ETW Traces? Shortly before the trace did stop I did see an installer for the Palo Alto Cortex AV solution executing. That was interesting. Normally I record data with xperf which uses the NT Kernel Logger ETW session. Xperf supports another kernel session named Circular Kernel Context Logger session which is always there since at least Windows 7. When running ETW tracing for a longer time I record with xperf into a file which is stopped every few minutes. Profiling is then restarted (no merging of the files is done). That way I can later choose from the many files the interesting ones when the issue is reported again. Although I default to 10ms sample rate one needs still up to 40 GB of disk space to record a full day.

The Issue occurred again

Below is a picture of total CPU consumption over a period of two minutes. Green is Idle (means CPU is free) and red is the CPU consumed by the Windows Sleep API call which needs ca. 60% for all cores for more than one minute. Why is sleeping consuming so much CPU?

There is a specific workload which occurs only on many core servers: Spinlocks are locks which try to burn some CPU on all cores before all spinning (= trading CPU for decreased latency) threads enter the expensive OS lock. The usual implementation of a Spinlock uses several stages

  1. Check if lock can be acquired
  2. If not
    • For(i..n)
      • Burn some cpu cycles by calling pause instruction in a second loop
      • Check if lock can be acquired and exit loop
      • Call Sleep(0) to give another thread to do some useful work
      • Check if lock can be acquired and exit loop
    • Enter OS lock
  3. Lock is acquired

Some high performance threading libraries like the Intel OpenMP library use excessive amounts of CPU spinning for up to 200 ms to almost never take a lock to get, at insane power consumption costs, a few microseconds of faster algorithm execution. This means if you let one empty task run every 200 ms you will get this CPU graph when using the Intel OpenMP library:

Sure, your CPU looks super busy, but did you perform much useful work? Not at all. The Intel OpenMP library makes it look like you are doing heavy number crunching and you need a bigger processor to take that load.

If you set the environment variable KMP_BLOCKTIME=0 then you get a flat CPU profile as it should have been always the case. The decision of Intel to waste 200 ms of CPU cycles by default is highly questionable, and no one, after profiling the CPU consumption, would leave it at the default value.

What has this to do with Virus scanners? They have the habit to intercept system calls in creative ways, by using OS callbacks, application detouring and ETW. E.g Windows 11 23H2 has instrumented many rarely used system calls with ETW trace points to make it easier to detect unusual application activity.

Interlude – SysCall Tracing

What can you do if you want to see every operating system call performed by an application. On Linux you can run the application with the strace tool to get a nice console output of all called operation system calls. On Windows no comparable tool exists. But there is a kernel ETW provider to enable SYSCALL tracing which not many people are aware of.

In a wpr profile you enable a SYSCALL Tracing by adding the SystemCall keyword and if you want you can even get the stacktrace for each method call (warning high data volume!). The logged events alone are not that useful, because they are missing the process and thread which did issues the call. To make these events useful you need to record also Context Switch events which record how long a thread was running on a given CPU.

Below is a sample snippet of a typical .wprp recording profile for wpr to enable SysCall tracing:

...
<Keywords>
   <Keyword Value="SystemCall"/>
   <Keyword Value="CSwitch"/>
</Keywords>
<Stacks>
  <Stack Value="SystemCallEnter"/>
</Stacks>

A full SysCall recording profile is located in the FileWriter project. To use it, download the MultiProfile.wprp with curl which is part of Windows. Then you can use it like this:

curl https://raw.githubusercontent.com/Alois-xx/FileWriter/master/MultiProfile.wprp > MultiProfile.wprp
wpr -start MultiProfile.wprp!SysCall
wpr -stop c:\temp\Wpr_SysCall.etl

Or if you use prefer xperf:

xperf -on base+syscall+CSWITCH -stackwalk SyscallEnter 
...
xperf -stop 
xperf -merge c:\kernel.etl SyscallTracing.etl

The merge step is needed to get method names, otherwise you just get addresses in the resulting trace. You can now open the Trace with WPA (Windows Performance analyzer). Under System Activity – Syscalls a new graph pops up when you load such a trace which allows you to analyze the call duration, frequency, return value and other things:

Unlike strace of linux you cannot get the method input argument but you can get at least the return code which is better than nothing.

We can dump the data also with the help of the TraceProcessing Library with a few lines

using Microsoft.Windows.EventTracing;
using Microsoft.Windows.EventTracing.Symbols;
using Microsoft.Windows.EventTracing.Syscalls;
using System.Diagnostics;
namespace strace
{
    internal class Program
    {
        static void Main(string[] args)
        {
            var processor = TraceProcessor.Create(args[0], new TraceProcessorSettings
            {
                AllowLostEvents = true,
            });
            string symServer = "*http://msdl.microsoft.com/download/symbols";
            if (args.Length > 1 && args[1] == "-nosymserver")
            {
                symServer = ""; // Loading MS symbols makes loading the trace much slower. Once loaded with symbols you can omit them
            }
            
            var symbols = processor.UseSymbols();
            var syscalls = processor.UseSyscalls();
            var sw = Stopwatch.StartNew();
            Console.WriteLine("Loading trace");
            processor.Process();
            Console.WriteLine($"Trace loaded in {sw.Elapsed.TotalSeconds:F0} s");
            sw = Stopwatch.StartNew();
            Console.WriteLine("Loading symbols");
            SymbolPath symPath = new($@"SRV*C:\Symbols{symServer}");
            SymCachePath symCache = new(@"C:\SymCache");
            symbols.Result.LoadSymbolsAsync(symCache, symPath).Wait();
            Console.WriteLine($"Symbols loaded in {sw.Elapsed.TotalSeconds:F0} s");
            foreach(ISyscall call in syscalls.Result.Syscalls)
            {
                if( call.ReturnValue == null)  // this was a syscall enter call
                {
                    continue;
                }
                string? funcName = call?.StackFrame.Symbol?.FunctionName;
                string? image = call?.Process?.ImageName;
                if (funcName != null && image != null)
                {
                    Console.WriteLine($"{image}({call?.Process.Id}): {funcName} <- 0x{call?.ReturnValue:X}");
                }
            }
        }
    }
}

Our much less inferior Windows strace tool can generate this output for e.g. all non succeeded (return code != 0x0) calls when our own strace process did crash because a not existing file was tried to access:

C>strace c:\syscallCSwitch.etl | findstr /i strace | findstr /v 0x0 
strace.exe(17016): NtOpenThreadTokenEx <- 0xC000007C
strace.exe(17016): NtOpenThreadTokenEx <- 0xC000007C
strace.exe(17016): NtQueryValueKey <- 0xC0000034
strace.exe(17016): NtOpenKey <- 0xC0000034
strace.exe(17016): NtQuerySecurityAttributesToken <- 0xC0000225
strace.exe(17016): NtQuerySecurityAttributesToken <- 0xC0000225
strace.exe(17016): NtQueryValueKey <- 0xC0000034
strace.exe(17016): NtOpenThreadTokenEx <- 0xC000007C
strace.exe(17016): NtOpenThreadTokenEx <- 0xC000007C
strace.exe(17016): NtSetInformationVirtualMemory <- 0xC00000BB
strace.exe(17016): NtQueryValueKey <- 0xC0000034
strace.exe(17016): NtOpenSemaphore <- 0xC0000034
strace.exe(17016): NtQueryValueKey <- 0xC0000034
strace.exe(17016): NtSetInformationVirtualMemory <- 0xC00000BB

The output is interesting, but without the input arguments we cannot make much progress if the input was e.g. the wrong file or if the failed call was even trying to access our file or something entirely different.

Antivirus Problem

Lets get back to our problem of high CPU in a Sleep call. When we drill deeper into the Sleep call CPU consumption we find in the stacktrace (the numbers are in ms)

KernelBase.dll!SleepEx 2 522 541 
|- ntdll.dll!NtDelayExecution 2 510 323.140645
 |    |- ntoskrnl.exe!KiSystemServiceExitPico 2 489 111
 |    |    |- ntoskrnl.exe!PerfInfoLogSysCallEntry 1 387 533
 |    |    |    |- ntoskrnl.exe!EtwTraceKernelEvent 1 386 492
 |    |    |    |    |- ntoskrnl.exe!EtwpLogKernelEvent 1 384 792
 |    |    |    |    |    |- tedrdrv.sys!<PDB not found> 1 251 146
 |    |    |    |    |    |    |- ntoskrnl.exe!ExfReleaseRundownProtection 435 011
 |    |    |    |    |    |    |- ntoskrnl.exe!ExfAcquireRundownProtection 431 336
 |    |    |    |    |    |    |- tedrdrv.sys!<PDB not found> 343 058
 |    |    |    |    |    |    |- tedrdrv.sys!<PDB not found><itself> 41 649
 |    |    |- ntoskrnl.exe!PerfInfoLogSysCallExit 1 061 081
 |    |    |    |- ntoskrnl.exe!EtwTraceKernelEvent 1 059 181
 |    |    |    |    |- ntoskrnl.exe!EtwpLogKernelEvent 1 056 491
 |    |    |    |    |    |- tedrdrv.sys!<PDB not found> 916 549

a call to PerfInfoLogSysCallEntry and PerfInfoLogSysCallExit. The CPU consumption of SleepEx is not our problem if we run the numbers. Due to ETW Tracing the costs of SleepEx are 33 times higher than originally!

But that is still not the full story. Inside the ETW logger we find a call to tedrdrv.sys for which we do not have symbols. The vast majority of costs are coming from this driver which calls into ExfAcquire/ReleaseRundownProtection which is part of an API which was added to support Antivirus drivers to be able to safely update itself at runtime.

With ETWAnalyzer we can dump the module information for all dlls which have no symbols (= most likely non MS dlls/drivers) which are showing up in expensive stack traces.

EtwAnalyzer %f% -Dump CPU -TopN 5 -ShowTotal File -ProcessName !*idle* -SortBy CPU -TopNMethods 2 -MinMaxCPUms 100000 -Methods *.dll;*.sys -ShowModuleInfo    

ETWAnalyzer has an internal list of nearly all Antivirus Drivers. If a driver in the well known list is present in the output it is showing up as yellow text. We find with -ShowModuleInfo (-smi) that tedrdrv.sys is indeed the Palo Alto driver which has intercepted ETW traces, which causes high CPU consumption. The plot is thickening but we still do not have a smoking gun. When profiling is stopped my script always captures the output of xperf -Loggers to see which ETW loggers were running to check if profiling was running properly shortly before we stop our session. The output for NT Kernel Logger is the smoking gun:

	Logger Name           : NT Kernel Logger
	Logger Id             : ffff
	Logger Thread Id      : 0000000000003DC4
	Buffer Size           : 8192
	Maximum Buffers       : 118
	Minimum Buffers       : 96
	Number of Buffers     : 118
	Free Buffers          : 70
	Buffers Written       : 224156
	Events Lost           : 0
	Log Buffers Lost      : 0
	Real Time Buffers Lost: 0
	Flush Timer           : 0
	Age Limit             : 0
	Log File Mode         : Secure PersistOnHybridShutdown SystemLogger
	Maximum File Size     : 0
	Log Filename          : PaloNull
	Trace Flags           : SYSCALL
	PoolTagFilter         : *

Remember that the initial ETW Traces were fine until we did not get profiling events? Now we have found the reason. Palo Alto is enabling SysCall tracing by using the NT Kernel Logger ETW session which was already in use by us. Not nice.

Now we have all data to understand the high CPU issue

When multithreaded code is running which uses Spinlocks Sleep(0) is called on all cores in a tight loop. This is the way how Spinlocks work as explained earlier. Since Palo Alto wants to know which kernel methods all applications are calling they enable ETW tracing and hook inside the kernel the ETW trace call to get easy access to that information. Unfortunately the AV driver needs to access some protected kernel object via a “lock” with ExfAcquire/ReleaseRundownProtection. This “lock” is implemented as interlocked memory accesses which are ok if not many threads are accessing that memory lane. This assumption is no longer true in a heavily multithreaded application. The CPU costs of Sleep(0) are now >33 time higher which causes the server to go to 100% CPU consumption for unusual long times (minutes) which makes users wanting to reboot that machine. The server is not responding, but it miraculously recovers again if you patiently wait just long enough.

Conclusions

Antivirus solutions, multithreaded software and many core machines are a deadly mix for performance. The idea to intercept SysCalls might work decently on small machines, but if AV vendors try the same trick on server machines with multithreaded software things fall apart. I hope some AV vendor reads this post and change their mind about the SysCall tracing approach. Since consumer machines get more and more cores this problem will become bigger over the next years. Please stop intercepting SysCalls and let the machine do its work.

In the past AV did get only in the way when files were accessed or processes were started. That can be dealt with configuring the AV solutions to exclude certain directories/processes to limit its impact. But if compute bound workloads are affected in general, what should one do, except to uninstall that AV software? I have bad experiences with configuring AV solutions, because very often the overhead is still there, although the configuration UI says that all is turned off.

You can figure out the impact of AV with ETW but it takes quite some time and a lot of data to get to that point. For the average user it is impossible to find such issues, except if he can do some A/B testing with/without the AV solution. That works, but if the next AV update brings in new tracing capabilities you will have a hard time to figure that one out without ETW. It should not be so hard to quantify the impact of AV solutions given the number of times they are causing issues.

The loss of productivity due the high performance overhead of AV solutions is hard to quantify, but it is definitely not cheap. In my opinion all AV scanning solutions are trying hard to identify every threat, but fail to properly address the adverse performance effects caused by these additional scanning capabilities.

On the other hand all virus “vendors” will test with all currently used AV scanning solutions to get in undetected. In effect AV scanning solutions will be mostly able to detect past threats, but not the latest release of e.g. malware X. I wonder if it is cost effective to detect past threats with such a high overhead while new malware by definition is still able to get in.

If the scanning settings of AV solutions are tuned towards performance and not only for detection rate our fleet of > 1 billion of Windows computers would consume Gigawatts less of electrical energy (conservative estimate: >1 W of additional energy * 10^9 ). This is a global issue which is not on anyone’s agenda because it is so well hidden deep inside your computer which works but consumes more CPU than necessary. Many people talk about the negative environmental effects of Bitcoin mining, but no one talks about the environmental effects of AV scanning overhead.

I hope I have answered the mystery of a stuck server and raised new questions if it is really ok to turn on every AV scanning feature to stay safe. The tradeoff between computer safety and power consumption should be discussed more by security experts to arrive at decisions which are more balanced than it is currently the case. A wild idea would be to have a virtual Watt Meter for every wasted Watt consumed by AV scanners to make this cost explicit. You can only act on things you can measure. This is currently not the case.

.NET Serialization Roundup 2022

The last test is from 2019 which is still accurate but the world is changing and we have arrived at .NET 7.0 which is reason enough to spin up my test suite again and measure from .NET 4.8, 3.1, 5.0, 6.0 up to 7.0. The “old” articles are still relevant despite their age.

The following serializers are tested

NameHomeType
BinaryFormatter.NETBinary
– BinaryPackhttps://github.com/Sergio0694/BinaryPackBinary
Boishttps://github.com/salarcode/BoisBinary
– Cerashttps://github.com/rikimaru0345/CerasBinary
DataContract.NETXml
– FastJsonhttps://github.com/mgholam/fastJSON/Json
FlatBufferhttps://google.github.io/flatbuffers/Binary
GroBufhttps://github.com/skbkontur/GroBufBinary
– JILhttps://github.com/kevin-montrose/JilJson
Json.NEThttp://www.newtonsoft.com/jsonJson
JsonSerializer.NET Json
MemoryPackhttps://github.com/Cysharp/MemoryPackBinary
MessagePackSharphttps://github.com/neuecc/MessagePack-CSharpBinary
– MsgPack.Clihttps://github.com/msgpack/msgpack-cliBinary
Protobuf.NEThttps://github.com/mgravell/protobuf-netBinary
Google Protobufhttps://github.com/protocolbuffers/protobufBinary
SerivceStackhttps://github.com/ServiceStack/ServiceStack.TextJson
– SimdJsonSharphttps://github.com/EgorBo/SimdJsonSharpJson
SpanJsonhttps://github.com/Tornhoof/SpanJsonJson
SlimSerializerhttps://github.com/azist/azos/tree/master/src/Azos/Serialization/SlimBinary
Swifter.Jsonhttps://github.com/Dogwei/Swifter.JsonJson
– UTF8Jsonhttps://github.com/neuecc/Utf8JsonJson
XmlSerializer.NETXml

Serializers with – in front are in maintenance mode.

New entries since 2019 are

  • BinaryPack
  • MemoryPack
  • Google Protobuf

Which of these 23 serializers are in maintenance mode? Conditions (logical or) for maintenance mode are

  • Last commit > 1 year
  • It is archived
  • All commits within one year changed no product code
  • The author tells you to not use it

Maintenance mode serializers at the point of writing are:

  • BinaryPack
  • Ceras
  • FastJson
  • JIL (2019 last commit)
  • MsgPack.Cli
  • SimdJsonSharp
  • UTF8Json

You can still look at them, but I would not use them for real products where future updates to e.g. .NET 8.0 or security fixes are mandatory.

What else is not recommended in 2022?

BinaryFormatter should nowhere used in production anymore because it allows, if you deserialize attacker controlled data, to open a shell on your box. See https://learn.microsoft.com/en-us/dotnet/standard/serialization/binaryformatter-security-guide.

The following serializers which are part of .NET Framework are unsafe:

Used Benchmark

I did write the SerializerTests test suite which makes it easy to test all these serializers. If you want to measure for yourself, clone the repository, compile it, execute RunAll.cmd and you get a directory with CSV files which you can graph as you like.

Tooling is getting better and I have also added profiling support so you can check oddities at your own if you call RunAll.cmd -profile you get ETW traces besides the CSV files. This assumes you have installed the Windows Performance Toolkit from the Windows SDK.

Serialize Results

The winner in terms of serialize speed is again FlatBuffer which still has the lowest overhead, but you need to go through a more complex compiler chain to generate the de/serialization code.

*Update 1*

After discussions about the fairness how FlatBuffer should be tested I have changed how FlatBuffer performance is evaluated. Now we use an existing object (BookShelf) which is used as input to create a FlatBuffer object. Additionally it was mentioned that some byte buffer should also be part of the tested object. This moves FlatBuffer from rank 1 to 15, because the new test assumes you have an object you want to convert to the efficient FlatBuffer binary format. This assumption is reasonable, because few people will want to work with the FlatBuffer object directly. If you use the FlatBuffer object directly then the previous test results still apply.

The layout of the serialized Book object include

  • int Id
  • string Title
  • byte[] Payload Before it was null now it contains 10 bytes of data

Our BookShelf object looks as serialized xml like this

<?xml version="1.0" encoding="utf-8"?>
<BookShelf xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <Books>
    <Book>
      <Title>Book 1</Title>
      <Id>1</Id>
      <BookData>YWJjZGVmZ2hpag==</BookData>
    </Book>
  </Books>
</BookShelf>

Now we serialize nearly 40% of binary data which contains just printable ASCII characters. This “simple” change moves the landscape quite a bit: GroBuf is first and MemoryPack second. This contains also an update from MemoryPack from 1.4.4 -> 1.8.10. If GroBuf has less overhead (but double the serialized size because it stores strings as Utf-16 while most others convert the data to Utf-8), or if MemoryPack has some regression I cannot tell. To reproduce the tests you can run the tests with

RunAll.cmd 10

to use a 10 byte buffer payload for each Book object. If you omit the 10 then you will serialize with 0 bytes of payload. The combined csv file SerializationPerf_Combined_10.csv contains as suffix how much payload per book object was used.

*Update 1 End*

The new entry MemoryPack is second which is the newest serializer by Yoshifumi Kawai who seems to write serializers for a living. He has several entries in this list namely

  • MessagePackSharp
  • Utf8JsonSerializer

His latest MemoryPack library shows that you still can get faster once you know what things cost and what you can avoid without becoming too specialized. This one has pretty much all features one could wish for including a Source Generator for Typescript to read the binary data inside the browser or wherever your Typescript code is running.

Additionally he provided a helper to compress the data with Brotli, a very efficient compression scheme which competes with Bois_LZ4 which is currently the only serializer which supports also compression of the data. The Brotli algorithm is nice but there is zstd from Facebook which should be a bit faster which would make a nice addition perhaps in the future?

That is really impressive!

From a higher level we find that the old .NET Framework is getting behind in terms of speed and we see a nice step pattern which proves that every .NET Framework release has become faster than its predecessor.

Binary is good, but Json has also a large fan base. Microsoft did add with .NET 7.0 Source Generators to generate the serialization code at compile time for your objects. See https://github.com/Alois-xx/SerializerTests/blob/master/Serializers/SystemTextJsonSourceGen.cs how you can enable code generation for serialization. This feature generates code only for serialize but uses pregenerated metadata for deserialize. The feature to add pregenerated deserialization code is tagged as future (post .NET 8?). If you think this is important you can upvote the issue. The main reason to add this was not only performance but also AOT scenarios where you have no JIT compiler.

With pregenerated serialization code in SystemTextJsonSourceGen our BookShelf object is 20+% faster serialized compared to SystemTextJson which uses the “normal” runtime generated code. A significant part of the performance improvement seems to be related to directly write UTF8 without the need to convert an UTF-16 char memory stream to an UTF-8 byte stream.

Deserialize Results

*Update 1*

We access after deserialization of the FlatBuffer object all properties which cause the FlatBuffer struct to create a new string or byte[] array on each access again. AllocPerf does not use a binary blob yet. The numbers are therefore identical.

Even with these changes FlatBuffer wins the deserialize race with some margin. The second place goes to GroBuf instead of MemoryPack. But these are really close and had before been fighting for the second place. As already mentioned before GroBuf doubles the serialized data size which allows it to omit the Utf-16 to Utf-8 conversion which saves some speed at the expense of data size.

*Update 1 End*

Normally data is more often read than written. Deserialization needs to allocate all the new objects which makes it largely GC bound. This is the prime reason why deserialize can never be faster than serialize because it needs to allocate new objects while in the serialize case these are already existing.

The fastest serializer is AllocPerf which is not a serializer, but an allocation test which walks over an array of binary data without parsing and constructs strings out of the data. This is a test how fast one could get without any parsing overhead.

MemoryPack sets the mark again which is nearly overhead free. With so little overhead there must be a downside? Yes you loose versioning support to some extent.

The follower after MemoryPack is BinaryPack which is not even trying to be version tolerant and can be used for simple caches but other than that I would stay away from it. But beware it is in maintenance mode!

Another notable change in Protobuf_net was the deprecation since 3.0 of the AsReference attribute which many people are missing. This made it easy to deserialize object graphs with cyclic dependencies, which is with Protobuf_net 3.0 no longer the case.

The now in maintenance mode Utf8JsonSerializer is still 35% faster than SystemTextJson. There is still room for improvement for .NET.

First Startup Results

Besides the pure runtime performance startup costs are also important. That is more tricky to measure since you not only need to pay the costs of reflection, code gen, JIT time for the precompiled (Cross/Ngen) and not precompiled scenario inside a fresh executable. If you do not use a new executable each time you would attribute the costs for first time init effects to the first called serializer in a process which would introduce systematic bias. SerializerTests starts for each serializer a new executable which is started three times (last time is added to CSV) to warmup the loaded dlls to not induce expensive hard page faults which are served from the disk, but from the file system cache. Then it serializes 4 different object types with one object and takes the maximum value which should be the first call. These details allow one to check if just the first code gen is so expensive and subsequent code gen is much cheaper because many one time init effects already have happened.

The old .NET Framework 4.8 looses by a large margin here, but only if Ngen is not used. If the images are Ngenned then it is not too bad compared to crossgenned .NET core binaries.

Is this the full story? Of course not. We are only measuring inside a small loop long after many things have happened, but since precompilation will cause all dependent dlls to be eagerly loaded by the OS loader we should better look at the runtime of the process as a whole. This can show interesting … gaps in our testing methodology.

To look deeper you can run the test suite with ETW profiling enabled with

RunAll.cmd -profile

to generate ETW data while executing the tests. You need to download Windows Performance Toolkit to generate the profiling data and ETWAnalyzer to easily query the generated data. You get from the RunAll.cmd command a bunch of CSV files, a combined file CSV file and some ETW files.

When you put ETWAnalyzer into your path you can extract the ETW file data into Json files which are put into the Extract folder. To convert the ETL files into Json files you add to the ETWAnalyzer command line -Extract All, the folder where the ETL files are located and a symbol server e.g. MS which is predefined. You can also use NtSymbolPath to use your own Symbol search path defined by the _NT_SYMBOL_PATH environment variable.

 ETWAnalyzer -Extract All -FileDir D:\Source\git\SerializerTests\SerializerTestResults\_0_14_06 -SymServer MS 

The extracted Json files have a total size of 112 MB vs 5 GB of the input ETL files which is a good compression ratio! Now we can generate a nice chart with an Excel Pivot Chart to look how long each of the 3 invocations of the startup tests did run:

And the winner is .NET 4.8! That is counter intuitive after we have measured that .NET 7 is faster everywhere. What is happening here? This chart was generated with input from ETWAnalyzer to dump process start/stop times command line and their duration into a CSV file:

ETWAnalyzer.exe -dump process -cmdline *firstcall#* -timefmt s -csv FirstCall_Process.csv

ETWAnalyzer can also dump the CPU consumption to console so we can drill deeper

Although the command line syntax looks complex it is conceptually easy. You first select what to dump e.g. CPU, Process, Disk, … and then add filters and sorting. Because ETW records a lot of data you get a lot of sorting and filter options for each command of ETWAnalyzer which may look overwhelming but it is just filtering, sorting and formatting after you have selected what to dump.

The following call dumps process CPU consumption for all processes with a command line that contains

  • firstcall and cross and datacontract to select the .NET Core crossgenned processes OR
  • firstcall and ngen and datacontract to select the .NET 4.8 Ngened SerializerTests calls

-ProcessFmt s prints process start stop times in ETW recording time (time since session start) and the process duration. You can format also your local time, or utc by using local or utc as time formats.

EtwAnalyzer -dump cpu -cmdline "*firstcall* cross*#datacontract*";*firstcall*ngen*#datacontract* -ProcessFmt s

After drilling into the data it became apparent that since .NET Core 3.1 we are now having two additional serializers (MemoryPack and BinaryPack) which both have significant startup costs. These costs are always payed even if you never use them! I have created a branch of SerializerTests (noBinaryMemoryPack) to remove them so we can compare .NET 4.8 and .NET Core+ with the same number of used dependencies.

Each .NET version has two bars, the lower one is crossgen and the higher one involves JIT compilation of the binaries. Since each process is started 3 times we see in the y-axis the sum of 3 process starts. NET 7.0 seems to have become a bit slower compared to the previous versions. The winner is still .NET 4.8 (455 ms) but at least the difference is not as big as before (558ms vs 972ms).

It is very easy to fool yourself with isolated tests which do not measure the complete (system) picture. Measuring something and drawing a picture is easy. Measuring and understanding what the data means in full details is a much bigger endeavor.

Cool Console Tools

I work a lot with command line tools which produce large width output. The cmd shell has the nice feature to configure a horizontal scroll bar by setting the console width to e.g. 500. This works but newer shells like Windows Terminal are becoming the norm. It works like Linux shells which never had support for horizontal scroll bars. The issue to add a scroll bar is denied https://github.com/microsoft/terminal/issues/1860 which is a pity. But there is an open source solution for it. A little known terminal extension named vtm is offering all features I need. When configuring my command prompt with vtm -r cmd

I get a Windows Terminal window which supports horizontal scrolling, PlainText, RTF, HTML text copy, Ctrl+V (Ctrl+C not yet) and Unix style copying with right mouse click. This makes working with ETWAnalyzer much easier and you can also properly read the help which is also wide output …

EtwAnalyzer -dump CPU -ProcessName SerializerTests.exe -ShowModuleInfo -TopN 1 -NoCmdLine

The command dumps of all extracted files in the current directory the SerializerTests executable with the highest CPU consumption and the module information which includes file version, product name, and directory from where it was started. That can be useful to check if e.g. your local IT has installed the updated software to your machine, or you need to call again to get the update rolled out. By making such previously hard to get information easily queryable I have found that for some reason SerializerTests was running under false flag as protobuf product and since a long time without any file version update. Both issues are fixed, because ETWAnalyzer made it very obvious that something was wrong.

Conclusions

This is the end of this post, but hopefully your start to make a better informed decision which serializer is best suited for your needs. You should always measure in your environment with your specific data types to get numbers which apply to your and not my still synthetic scenario. When you execute RunAll.cmd dd you add to each Book object a byte array with dd bytes which may change the performance landscape dramatically. This can simulate a test where you use a serializer as wrapper over another already serialized payload which did already generate a byte array. That should be cheap, in theory. But when larger byte arrays were used with the fastest Json serializer (UTF8Json) it became one of the slower ones. I tried hard to get the numbers right, but you still need to measure for yourself to get your own right numbers for your use case and scenario.

Performance numbers are not everything. If essential features are there, and you trust the library author that he will provide support over the foreseeable future, only then you should consider performance as next item on your decision matrix to choose a library for mission critical long lived projects. If you are running a pet project, use whatever you seem fit.

Slow network after adding a second WIFI Access Point

Networking is hard and many things can go wrong. I have bought an Oculus Quest VR headset for my daughter. Soon I have found it frustrating to connect it via an USB cable to my PC. To get rid of the cable I have bought a cheap TP-Link Archer C5 AC1200 router which I connect via my LAN cable to the PC. After becoming wireless I can send realtime VR images over my dedicated WIFI to the VR headset with stable 150 MBit/s.

The “real” WIFI (yellow) which is connected to the internet is located in another room which is not capable to transfer 150 MBit/s without any interrupts which would kill the VR experience. That is the reason why nearly everyone recommends to use either an USB cable (inconvenient) or a LAN cable (blue) connected to a dedicated WIFI router for the VR headset.

That setup works great, but I have noticed that whenever I have the second router running that browsing the internet would feel very slow. Every page refresh was taking a lot longer compared to when I did turn the second WIFI router off.

Since all browsers have great development tools I just needed to press F12 to see what was blocking my browser requests. In the example below I have tried to access a far far away web site which shows in the network tab for the first request a delay 10,55s for host name resolution (Dns).

On my box these timings were always around 10s so I knew that I did suffer from slow Dns queries to resolve the host name e.g. http://www.dilbert.com to the actual IP address. Knowing that page load times are blocked by 10+ seconds by Dns is great, but now what? I was suspecting that, since I have two network tabs in Task Manager, that the DNS query was first sent via the LAN cable to a dead end because my Oculus is not connected to the internet and then via WIFI to the actual internet connected router.

Initially I tried the route print command to see where why packets were routed, but that looked too complicated to configure the order for every local route. After a bit more searching I have found the powershell command Get-NetIPInterface which shows you over which network interface Windows sends network packets to the outside world. The relevant column is InterfaceMetric. If multiple network cards are are considered the network interface with the lowest InterfaceMetric number is tried first.

As expected I have found my local Ethernet connection with InterfaceMetric 5 with the preferred connection to send network packets to the outside world. But what should be the other network? The easy way out is to cause some traffic in the browser and check in Task Manager in the Performance Tab which network gets all the traffic:

I need therefore to change the order vEthernet (New Virtual Switch) with metric 15 and interface id 49 and vEthernet (New Virtual Switch) with metric 25 and interface id 10.

Set-NetIPInterface -InterfaceIndex 10 -InterfaceMetric 5
Set-NetIPInterface -InterfaceIndex 49 -InterfaceMetric 25
These two commands did switch the order of the two relevant networks (ifIndex 10 and 49). How fast are things now? Internet surfing feels normal and the change of switched network ordering did save the day. There are still Dns delays but nowhwere in the 10s region. Case solved.

The average reader can stop reading now.

What follows is a deep dive into how Dns works on Windows and how you can diagnose pretty much any Dns issue with the builtin Dns tracing of Windows with the help of ETWAnalyzer.

So far we have just looked at Dns as a black box, but you can trace all Dns requests with the help of ETW (Event Tracing for Windows) and the Microsoft-Windows-DNS-Client ETW provider. The easiest way to record the data is to download a recording profile like Multiprofile.wprp, download the file e.g. via curl from the command line

curl https://raw.githubusercontent.com/Alois-xx/FileWriter/master/MultiProfile.wprp > c:\temp\MultiProfile.wprp

Then start from an Admin command prompt tracing with

wpr -start c:\temp\MultiProfile.wprp!Network

Now surf to some web sites or do your slow use case. After you have found some slowdown you can stop recording with

wpr -start c:\temp\SlowBrowsing_%ComputerName%.etl

Good naming is hard, but if you later want to revisit your recorded files you should try to come up with a descriptive name.

After recording the data you can extract a Json file with ETWAnalyzer by extracting everything (-extract all) or just Dns data with -extract DNS

C:>EtwAnalyzer -extract all -fd c:\temp\SlowBrowsing_Skyrmion.etl -symserver ms 
1 - files found to extract.
Success Extraction of c:\temp\Extract\SlowBrowsing_Skyrmion.json
Extracted 1/1 - Failed 0 files.
Extracted: 1 files in 00 00:02:40, Failed Files 0

ETWAnalyzer supports many different queries for e.g. Disk, File, CPU and other things. Since 2.4.0.0 support for dumping Dns Queries was added. To show the aggregated top 5 slowest Dns queries you just need -fd (File or Directory) and the file name of the extracted data and -dump Dns to dump all Dns queries. To select the top 5 add -topn 5 and you get this output per process printed:

C:\temp\Extract>EtwAnalyzer -dump DNS -topn 5 -fd SlowBrowsing_Skyrmion.json   
02.10.2022 17:47:32    SlowBrowsing_Skyrmion 
NonOverlapping       Min        Max  Count TimeOut DNS Query                                                             
       Total s         s          s      #
OVRServer_x64.exe(6624) 124 128
      12,038 s   0,000 s   12,038 s      4       1 graph.oculus.com                                                      
backgroundTaskHost.exe(6692)  -ServerName:BackgroundTaskHost.WebAccountProvider
      12,043 s   0,000 s   12,043 s      2       1 login.microsoftonline.com                                             
firefox.exe(13612) 
      12,047 s   0,021 s   12,047 s      2       1 d21lj84g4rjzla.cloudfront.net                                         
      12,052 s   0,000 s   12,052 s      2       1 loadus.exelator.com                                                   
      12,052 s   0,000 s   12,052 s      2       1 cdn.adrtx.net                                                         
Totals: 60,232 s Dns query time for 12 Dns queries

We find that we have indeed some slow Dns queries which all did time out after ca. 12s. The magenta named processes are issuing all of the following DNS queries. To see more we can add -details to see the timing of every slow DNS Query

C:\temp\Extract>EtwAnalyzer -dump DNS -topn 5 -details -fd SlowBrowsing_Skyrmion.json  
02.10.2022 17:47:32    SlowBrowsing_Skyrmion 
NonOverlapping       Min        Max  Count TimeOut DNS Query                                                             
       Total s         s          s      #
      12,038 s   0,000 s   12,038 s      4       1 graph.oculus.com                                                      
        2022-10-02 17:47:38.416 Duration:  0,000 s OVRServer_x64.exe(6624)                       DnsAnswer: 
        2022-10-02 17:47:38.417 Duration:  0,000 s OVRServer_x64.exe(6624)                       DnsAnswer: 
        2022-10-02 17:47:38.417 Duration: 12,038 s OVRServer_x64.exe(6624)                       TimedOut: 192.169.0.1:53 Servers: 192.169.0.1;fe80::1;0.0.0.0 DnsAnswer: 2a03:2880:f21c:80c1:face:b00c:0:32c2;::ffff:31.13.92.49
        2022-10-02 17:47:38.417 Duration: 12,038 s OVRServer_x64.exe(6624)                       TimedOut: 192.169.0.1:53 Servers: 192.169.0.1;fe80::1;0.0.0.0 DnsAnswer: 2a03:2880:f21c:80c1:face:b00c:0:32c2;::ffff:31.13.92.49
      12,043 s   0,000 s   12,043 s      2       1 login.microsoftonline.com                                             
        2022-10-02 17:47:42.020 Duration:  0,000 s backgroundTaskHost.exe(6692)                  DnsAnswer: 
        2022-10-02 17:47:42.020 Duration: 12,043 s backgroundTaskHost.exe(6692)                  TimedOut: 192.169.0.1:53 Servers: 192.169.0.1;fe80::1;0.0.0.0 DnsAnswer: ::ffff:40.126.32.74;::ffff:20.190.160.14;::ffff:40.126.32.134;::ffff:40.126.32.72;::ffff:20.190.160.22;::ffff:40.126.32.136;::ffff:40.126.32.138;::ffff:40.126.32.68
      12,047 s   0,021 s   12,047 s      2       1 d21lj84g4rjzla.cloudfront.net                                         
        2022-10-02 17:47:54.203 Duration:  0,021 s firefox.exe(13612)                            DnsAnswer: 13.227.153.112;13.227.153.94;13.227.153.81;13.227.153.66
        2022-10-02 17:47:54.224 Duration: 12,047 s firefox.exe(13612)                            TimedOut: 192.169.0.1:53 Servers: 0.0.0.0;192.169.0.1;fe80::1 DnsAnswer: 2600:9000:225b:fc00:1e:b6b1:7b80:93a1;2600:9000:225b:3600:1e:b6b1:7b80:93a1;2600:9000:225b:2e00:1e:b6b1:7b80:93a1;2600:9000:225b:f000:1e:b6b1:7b80:93a1;2600:9000:225b:7200:1e:b6b1:7b80:93a1;2600:9000:225b:a400:1e:b6b1:7b80:93a1;2600:9000:225b:9200:1e:b6b1:7b80:93a1;2600:9000:225b:dc00:1e:b6b1:7b80:93a1
      12,052 s   0,000 s   12,052 s      2       1 loadus.exelator.com                                                   
        2022-10-02 17:47:42.091 Duration:  0,000 s firefox.exe(13612)                            DnsAnswer: 
        2022-10-02 17:47:42.091 Duration: 12,052 s firefox.exe(13612)                            TimedOut: 192.169.0.1:53 Servers: 192.169.0.1;fe80::1;0.0.0.0 DnsAnswer: ::ffff:54.78.254.47
      12,052 s   0,000 s   12,052 s      2       1 cdn.adrtx.net                                                         
        2022-10-02 17:47:42.091 Duration:  0,000 s firefox.exe(13612)                            DnsAnswer: 
        2022-10-02 17:47:42.091 Duration: 12,052 s firefox.exe(13612)                            TimedOut: 192.169.0.1:53 Servers: 192.169.0.1;fe80::1;0.0.0.0 DnsAnswer: ::ffff:34.248.125.130;::ffff:3.248.26.101
Totals: 60,232 s Dns query time for 12 Dns queries

You might ask yourself. What are NonOverlapping Totals? The ETW provider Microsoft-Windows-DNS-Client instruments dnsapi.dll!DnsQueryEx which supports synchronous and async queries. By default an IPV4 and IPV6 query is started whenever when you try to resolve a host name. If both queries need e.g. 10s, and are started at the same time the observed delay by the user is 10s (Non Overlapping time) and not 20s (that would be the sum of both query times). To estimate the observed user delay for each query ETWAnalyzer calculates the non overlapping query time which should much better reflect the actual user delays. The summary time Totals is also taking overlapping DNS query times into account which should be a better metric than a simple sum.

For more details please refer to https://github.com/Siemens-Healthineers/ETWAnalyzer/blob/main/ETWAnalyzer/Documentation/DumpDNSCommand.md.

The details view shows that every slow timed out query has a red TimedOut Dns server name (192.169.0.1:53). This is the Dns server IP for which a Dns query did time out. By looking at the interfaces with ipconfig we can indeed verify that the new TPLink Router was tried first, did time out and after that the next network (WIFI) was tried which did finally succeed.

If you are in an IPV6 network the Dns server IP might be a generic default Link Local IP which can be the same for several network interfaces. In that case you can add -ShowAdapter to show which network interfaces were tried. That should give you all information to diagnose common Dns issues. Of course you can also use WireShark to directly get your hands on the Dns queries, but with WireShark you will not know which process was waiting for how long. The ETW instrumented DnsQueryEx Api will give you a clear process correlation and also which queries (IPV4, IPV6) were tried. To get all details you can export the data to a CSV file (add -csv xxx.csv) and post process the data further if you need to. To select just one process (e.g. firefox) you can filter by process with -pn firefox which is is the shorthand notation for -processname firefox. For more information on the query syntax of ETWAnalyzer see https://github.com/Siemens-Healthineers/ETWAnalyzer/blob/main/ETWAnalyzer/Documentation/Filters.md.

If you have just Context switch traces but no DNS Client traces not all is lost. If you suspect DNS delays you can still query the thread wait times of all dns related methods with ETWAnalyzer which also calculates for wait and ready the non overlapping times summed for all threads.

EtwAnalyzer -dump cpu -pn firefox -fd WebBrowsingRouterOn_SKYRMION.10-06-2022.19-46-16 -methods *dns* -fld s s

If you have async queries in place, but you know that just one async operation was initiated you can add -fld s s to show the time difference the method was seen first and last in the trace which should give you a good estimate how long that async method was running, if upon completion it was called again (and consumed enough CPU so we could get a sample). For more details please refer to https://github.com/Siemens-Healthineers/ETWAnalyzer/blob/main/ETWAnalyzer/Documentation/DumpCPUCommand.md.

You can also use one specific method or event as zero point and let ETWAnalyzer calculate first/last timings relative to a specific event which can be useful to calculate e.g. timing of methods relative to a key event like OnLoadClicked or something similar. That advanced feature is explained at https://github.com/Siemens-Healthineers/ETWAnalyzer/blob/main/ETWAnalyzer/Documentation/ZeroTime.md in much greater detail. It is also possible to correlate traces across machines without synchronized clocks by using different zero times for different files within one query and export the data to a CSV file for further processing.

That was a lot of information, but I hope I have convinced you that you can access previously hard to get information now much easier with ETWAnalyzer. It is not only useful for performance investigations, but it might be also interesting to get an overview which servers were tried to query from your machine for forensic audits.

PDD Profiler Driven Development

There are many extreme software development strategies out there. PDD can mean Panic Driven, or Performance Driven Development. The first one should be avoided, the second is a MSR paper dealing with performance modelling techniques. Everyone writes tests for their software because they have proven to be useful, but why are so few people profiling their own code? Is it too difficult?

I have created a C# console project named FileWriter https://github.com/Alois-xx/FileWriter which writes 5000 100KB files (= 500MB) to a folder with different strategies. How do you decide which strategy is the most efficient and/or fastest one? As strategies we have

  • Single Threaded
  • Multi Threaded with TPL Tasks
  • Multi Threaded with Parallel.For
    • 2,3,4,5 Threads

The serial version is straightforward. It creates a n files and writes in each file a string until the file size is reached. The code is located at https://github.com/Alois-xx/FileWriter/blob/master/DataGenerator.cs, but the exact details are not important.

Lets measure how long it takes:

As expected the serial version is slowest and the parallel versions are faster. The fastest one is the one which uses 3 threads. Based on real world measurements we have found that 3 threads is the fastest solution. OK problem solved. Move on and implement the next feature. That was easy.

But, as you can imagine there is always a but: Are the numbers we have measured reasonable? Lets do some math. We write 500 MB with the parallel version in 11s. That gives a write speed of 45 MB/s. My mental performance model is that we should be able to max out a SATA SSD which can write 430 MB/s.

This is off nearly by a factor 10 what the SSD is able to deliver. Ups. Where did we loose performance? We can try out the Visual Studio CPU Profiler:

That shows that creating and writing to the file are the most expensive parts. But why? Drilling deeper will help to some extent, but we do not see where it did wait for disk writes because the CPU Profiler is only showing CPU consumption. We need another tool. Luckily there is the Concurrency Visualizer extension for free on the Visual Studio Marketplace, which can also show Thread blocking times.

After installation you can start your project from the Analyze menu with your current debugging settings.

We try to stay simple and start FileWriter with the single threaded strategy which is easiest to analyze

FileWriter -generate c:\temp serial

In the Utilization tab we get a nice summary of CPU consumption of the machine. In our case other processes were consuming a lot of CPU while we were running. White is free CPU, dark gray is System and light gray are all other processes. Green is our profiled process FileWriter.exe.

The first observation is that once we start writing files other processes become very busy which looks correlated. We come back to that later.

The Threads view shows where the Main Thread is blocked due to IO/Waiting/…

You can click with the mouse on the timeline of any thread to get in the lower view the corresponding stack trace. In this case we are delayed e.g. by 2,309 s by a call to WriteFile. For some reason the .NET 6.0 symbols could not be resolved but it might also be just a problem with my symbol lookup (there should be a full stack trace visible). Personally I find it difficult to work with this view, when other processes are consuming the majority of CPU. But since Concurrency Visualizer uses ETW there is nothing that can stop us to open the Concurrency Visualizer generated ETW files from the folder C:\Users\\Documents\Visual Studio 2022\ConcurrencyVisualizer\ with WPA.

We see that FileWriter (green) starts off good but then MsMpEng.exe (red = Defender Antivirus) is kicking in and probably delaying our File IO significantly. In the middle we see System (lila = Windows OS) doing strange things (Antivirus again?).

Do you remember that our SSD can write 430 MB/s? The Disk Usage view in WPA shows disk writes. Our process FileWriter writes 500 MB of data, but there are only 4 MB written in 53ms (Disk Service Time). Why are we not writing to the SSD? The answer is: The Operating System is collecting a bunch of writes and later flushes data to disk when there is time or the amount of written data exceeds some internal threshold. Based on actual data we need to revise our mental performance model:

We write to the OS write cache instead of the disk. We should be able to write GB/s! The actual disk writes are performed by the OS asynchronously.

Still we are at 45 MB/s. If the disk is not the problem, then we need to look where CPU is spent. Lets turn over to the gaps where the green line of FileWriter writing on one thread to disk vanishes. There seems to be some roadblock caused by the System process (Lila) which is the only active process during that time of meditation:

When we unfold the Checkpoint Volume stacktag which is part of my Overview Profile of ETWController we find that whenever NTFS is optimizing the Free Bitmap of the NTFS volume it is traversing a huge linked list in memory which blocks all File IO operations for the complete disk. Do you see the bluish parts in the graph which WPA shows where NTFS Checkpoint volume samples are found in the trace? Automatic highlighting of the current table selection in the graph is one of the best features of WPA. The root cause is that we delete the old files of the temp folder in FileWriter from a previous run which produces a lot of stale NTFS entries which, when many new files are created, at some point cause the OS to wreak havoc on file create/delete performance. I have seen dysfunctional machines which did have a high file churn (many files per day were created/deleted) where all file create/delete operations did last seconds instead of sub milliseconds. See a related post about that issue here: https://superuser.com/questions/1630206/fragmented-ntfs-hdd-slow-file-deletion-600ms-per-file. I did hit this issue several times when I cleaned up ETW folders which contain for each ETL file a large list of NGENPDB folders which the .ni.pdb files. At some point Explorer is stuck for many minutes to delete the last files. The next day and a defrag later (the NTFS tables still need defragmentation even for SSDs) things are fast again. With that data we need to revisit our mental performance model again:

Creating/Deleting large amount of files might become very slow depending on internal NTFS states which are in memory structures and manifest as high CPU consumption in the Kernel inside the NtfsCheckpointVolume method.

That view is already good, but we can do better if we record ETW data on our own and analyze it at scale.

How about a simple batch script that

  • Starts ETW Profiling
  • Executes your application
  • Retrieves the test execution time as return value from your application
  • Stops ETW Profiling
  • Write an ETW file name with the measured test duration and test case name
  • Extract ETW Data for further analysis

That is what the RunPerformanceTests.cmd script of the FileWriter project does. The tested application is FileWriter.exe which writes files with different strategies to a folder. When it exits it uses as exit code of the process the duration of the test case in ms. For more information refer to ProfileTest.cmd command line help.

FileWriter and Measurement Details

Read on if you are eager for details how the scripts generate ETW measurement data or go to the next headline.

RunPerformanceTests.cmd
call ProfileTest.cmd CreateSerial       FileWriter.exe -generate "%OutDir%" serial
call ProfileTest.cmd CreateParallel     FileWriter.exe -generate "%OutDir%" parallel
call ProfileTest.cmd CreateParallel2    FileWriter.exe -generate "%OutDir%" parallel -threads 2
...
call ProfileTest.cmd CreateTaskParallel FileWriter.exe -generate "%OutDir%" taskparallel

The most “complex” of ProfileTest.cmd is where several ETW profiles of the supplied MultiProfile.wprp are started. The actual start command in the ProfileTest.cmd script is

wpr -start MultiProfile!File -start MultiProfile.wprp!CSWitch -start MultiProfile.wprp!PMCLLC 

But with shell scripting with the correct escape characters it becomes this:

"!WPRLocation!" -start "!ScriptLocation!MultiProfile.wprp"^^!File -start "!ScriptLocation!MultiProfile.wprp"^^!CSwitch -start "!ScriptLocation!MultiProfile.wprp"^^!PMCLLC

This line starts WPR with File IO tracing, Context Switch tracing to see thread wait/ready times and some CPU counters for Last Level Cache misses. Disk and CPU sampling is already part of these profiles.

The PMCLLC profile can cause issues if you are using VMs like HyperV. To profile low level CPU features you need to uninstall all Windows OS HyperV Features to get Last Level Cache CPU tracing running.

You can check if you have access to your CPU counters via

C:\>wpr -pmcsources
Id  Name                        
--------------------------------
  0 Timer                       
  2 TotalIssues                 
  6 BranchInstructions          
 10 CacheMisses                 
 11 BranchMispredictions        
 19 TotalCycles                 
 25 UnhaltedCoreCycles          
 26 InstructionRetired          
 27 UnhaltedReferenceCycles     
 28 LLCReference                
 29 LLCMisses                   
 30 BranchInstructionRetired    
 31 BranchMispredictsRetired    
 32 LbrInserts                  
 33 InstructionsRetiredFixed    
 34 UnhaltedCoreCyclesFixed     
 35 UnhaltedReferenceCyclesFixed
 36 TimerFixed                  

If you get a list with just one entry named Timer then you will not get CPU counters, because you have installed one or several HyperV features. In theory this issue should not exist because this was fixed with Windows 10 RS 5 in 2019 (https://docs.microsoft.com/en-us/archive/blogs/vancem/perfview-hard-core-cpu-investigations-using-cpu-counters-on-windows-10), but I cannot get these CPU counters when HyperV is active in Windows 10 21H2.

The easiest fix is to remove the “!ScriptLocation!MultiProfile.wprp”^^!PMCLLC from the ProfileTest.cmd script, or you modify the MultiProfile.wprp to set in all HardwareCounter nodes

<HardwareCounter Id="HardwareCounters_EventCounters" Base="" Strict="true">

the value Strict=”false” to prevent WPR from complaining when it could not enable hardware counters. The MultiProfile.wprp profile uses nearly all features of WPR which are possible according to the not really documented xsd Schema to streamline ETW recording to your needs. The result of this “reverse” documenting is an annotated recording profile that can serve as base for your own recording profiles. Which profiles are inside it? You can query it from the command line with WPR:

D:\Source\FileWriter\bin\Release\net6.0>wpr -profiles MultiProfile.wprp

Microsoft Windows Performance Recorder Version 10.0.19041 (CoreSystem)
Copyright (c) 2019 Microsoft Corporation. All rights reserved.

   Default                   
               (CPU Samples/Disk/.NET Exceptions/Focus)
   CSwitch    +(CPU Samples/Disk/.NET Exceptions/Focus/Context Switch)
   MiniFilter +(CPU Samples/Disk/.NET Exceptions/Focus/MiniFilter)
   File       +(CPU Samples/Disk/.NET Exceptions/Focus/File IO)
   Network    +(CPU Samples/Disk/.NET Exceptions/Focus/Network)
   Sockets    +(CPU Samples/Disk/.NET Exceptions/Focus/Sockets)
   VirtualAlloc (Long Term)
   UserGDILeaks (Long Term)
   PMCSample  
     PMC Sampling for PMC Rollover + Default
   PMCBranch  
     PMC Cycles per Instruction and Branch data - Counting
   PMCLLC                      
     PMC Cycles per Instruction and LLC data - Counting
   LBR                         
     LBR - Last Branch Record Sampling

Or you can load the profile into WPRUI which is part of the Windows SDK where they will show up under the Custom Measurements node:

All profiles can be mixed (except Long Term) to record common things for unmanaged and .NET Code with the least amount of data. A good choice is the default profile which enables CPU sampling, some .NET events, Disk and Window Focus events. These settings are based on years of troubleshooting, and should be helpful recording settings also in your case. The supplied MS profiles are recording virtually everything which is often a problem that the trace sizes are huge (many GB) or, if memory recording is used, the history is one minute or even less on a busy system which has problems.

Generated Data

RunPerformanceTests.cmd saves the ETW data to the C:\Temp folder

After data generation it calls ExtractETW.cmd which is one call to ETWAnalyzer which loads several files in parallel and creates Json files from the ETW files

ETWAnalyzer -extract all -fd c:\temp\*Create*.etl -symserver MS -nooverwrite

This will generate JSON files in C:\Temp\Extract which can be queried with ETWAnalyzer without the need to load every ETL file into WPA.

As you can see the 500+ MB ETL files are reduced to ca. 10 MB which is a size reduction of a factor 50 while keeping the most interesting aspects. There are multiple json files per input file to enable fast loading. If you query with ETWAnalyzer e.g. just CPU data the other *Derived* files are not even loaded which keeps your queries fast.

Working With ETWAnalyzer and WPA on Measured Data

To get an overview if all executed tests suffer from a known issue you can query the data from the command line. To get the top CPU consumers in the Kernel not by method name but by stacktag you can use this query

EtwAnalyzer -dump CPU -ProcessName System -StackTags * -TopNMethods 2

I am sure you ask yourself: What is a stacktag? A stacktag is a descriptive name for key methods in a stacktrace of an ETW event. WPA and TraceProcessing can load a user configured xml file which assigns to a stacktrace a descriptive (stacktag) name, or Other if no matching stacktag rule could be found. The following XML fragment defines e.g. the above NTFS Checkpoint Volume stacktag:

  <Tag Name="NTFS Checkpoint Volume">
	<Entrypoint Module="Ntfs.sys" Method="NtfsCheckpointVolume*"/>
  </Tag>

See default.stacktags of ETWAnalyzer how it can be used to flag common issues like OS problems, AV Scanner Activity, Encryption Overhead, … with descriptive stacktags. If more than one stacktag matches the first matching stacktag, deepest in the stacktrace, wins. You can override this behavior if you set a priority value for a given tag.

Based on the 7 runs we find that the CheckpointVolume issue is part of every test which blocks File writes up to 6+ seconds. Other test runs do no suffer from this which makes comparison of your measured numbers questionable. Again we need to update our mental performance model:

Tests which are slower due to CheckpointVolume overhead should be excluded to make measurements comparable where this internal NTFS state delay is not occurring.

The exact number would be the sum of CPU+Wait, because this operation is single threaded we get exact numbers here. ETWAnalyzer can sum method/Stacktags CPU/Wait times with the -ShowTotal flag. If we add Method, the input lines are kept, if we use Process only process and file totals are shown.

EtwAnalyzer -dump CPU -ProcessName System -StackTags *checkpoint* -ShowTotal method

Now we know that file IO is delayed up to 7461 ms without the need to export the data to Excel. ETWAnalyzer allows you to export all data you see at the console to a CSV file. When you add -csv xxx.csv and you can aggregate/filter the data further as you need it. This is supported by all -Dump commands!

We can also add methods to the stacktags by adding -methods *checkpoint* to see where the stacktag has got its data from

EtwAnalyzer -dump CPU -ProcessName System -StackTags *checkpoint* -methods *checkpoint*
The method NtfsCheckpointVolume has 873 ms, but the stacktag has only 857 ms. This is because other stacktags can “steal” CPU sampling/wait data which is attributed to it. This is curse and feature rolled into one. It is great because if you add up all stacktags then you get the total CPU/Wait time for a given process across all threads. It is bad, because expensive things might be looking less expensive based on one stacktag because other stacktags have “stolen” some CPU/Wait time.

That is all nice, but what is our FileWriter process doing? Lets concentrate on the single threaded use case by filtering the files with -FileDir or -fd which is the shorthand notation for it to *serial*. You need to use * in the beginning, because the filter clause uses the full path name. This allows you to query with -recursive a complete directory tree where -fd *Serial*;!*WithAV* -recursive defines a file query starting from the current directory for all Serial files but it excludes folders or files which contain the name WithAV. To test which files are matching you can use -dump Testrun -fd xxxx -printfiles.

EtwAnalyzer -dump CPU -ProcessName FileWriter -stacktags * -Filedir *serial*

The test takes 12s but it spends 2,2s yet alone with Defender Activity. Other stacktags are not giving away a clear problem we could drill deeper. One interesting thing is that we spend 750ms in thread sleeps. A quick check in WPA shows

that these are coming from the .NET Runtime to recompile busy methods later on the fly. Stacktags are great to flag past issues and categorize activity, but they also tend to decompose expensive operations into several smaller ones. E.g. if you write to a file you might see WriteFile and Defender tags splitting the overall costs of a call to WriteFile. But what stands out that on my 8 Core machine a single threaded FileWriter fully utilizes my machine with Defender activity. We need to update our mental performance model again:

Writing to many small files on a single thread will be slowed down by the Virus Scanner a lot. It also generates a high load on the system which slows down everything else.

Since we did write the application we can go down to specific methods to see where the time is spent. To get an overview we use the top level view of WPA of our CreateFile method

In WPA we find that

  • CreateFiles
    • CPU (CPU Usage) 10653 ms
    • Wait Sum 672 ms
    • Ready Time Sum 1354 ms
      • This is the time a thread did wait for a CPU to become free when all CPUs were used

which adds up to 12,679 s which pretty closely (16ms off) matches the measured 12,695 s. It is always good to verify what you did measure. As you have seen before we have already quite often needed to change our mental performance model. The top CPU consumers in the CreateFiles method are

  • System.IO.StreamWriter::WriteLine
  • System.IO.Strategies.OSFileStreamStrategy::.ctor
  • System.IO.StreamWriter::Dispose
  • System.Runtime.CompilerServices.DefaultInterpolatedStringHandler::ToStringAndClear
  • System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[System.Char]::Rent
  • System.Runtime.CompilerServices.DefaultInterpolatedStringHandler::AppendFormatted
  • System.IO.Strategies.OSFileStreamStrategy::.ctor
  • System.IO.StreamWriter::WriteLine

What is surprising is that ca. 2s of our 12s are spent in creating interpolated strings. The innocent line

 string line = $"echo This is line {lineCount++}";

costs ca. 16% of the overall performance. Not huge, but still significant. Since we are trying to find out which file writing strategy is fastest this is OK because the overhead will be the same for all test cases.

We can get the same view as in WPA for our key methods with the following ETWAnalyzer query

EtwAnalyzer -dump CPU -ProcessName FileWriter -fd *serial* -methods *StreamWriter.WriteLine*;*OSFileStreamStrategy*ctor*;*StreamWriter.WriteLine*;*DefaultInterpolatedStringHandler*AppendFormatted*;*TlsOverPerCoreLockedStacksArrayPool*Rent;*DefaultInterpolatedStringHandler.ToStringAndClear;*StreamWriter.Dispose;*filewriter* -fld

You might have noticed that the method names in the list have :: between the class and method. ETWAnalyzer uses always . which is less typing and is in line with old .NET Framework code which used :: for JITed code and . for precompiled code.

Additionally I have smuggled -fld to the query. It is the shorthand for -FirstLastDuration or -fld which shows the difference between the Last-First time the method did show up in CPU Sampling or Context switch data. Because we know that our test did only measure DataGenerator.CreateFile calls we see in the Last-First column 12,695 s which is down to the ms our measured duration! This option can be of great value if you want to measure parallel asynchronous activity wall clock time. At least for .NET code the generated state machine classes of the C# compiler contain central methods which are invoked during init and finalization of asynchronous activities which makes it easy to “measure” the total duration even if you have no additional trace points at hand. You can add to -fld s s or -fld local local or -fld utc utc to also show the first and last time in

  • Trace Time (seconds)
  • Local Time (customer time in the time zone the machine was running)
  • UTC Time

See command line help for further options. ETWAnalyzer has an advanced notation of time which can be formatted in the way you need it.

The WPA times for Wait/Ready are the same in ETWAnalyzer, but not CPU. The reason is that to exactly judge the method CPU consumption you need to look in WPA into CPU Usage (Sampled) which shows per method CPU data, which is more accurate. ETWAnalyzer merges this automatically for you.

CPU usage (Precise) Trap

The view CPU usage (Precise) in WPA shows Context switch data which is whenever your application called a blocking OS call like Read/WriteFile/WaitForSingleObject, … your thread is switched off the CPU which is called context switch. You will see in this graph therefore only (simplification!) methods which were calling a blocking OS method. If you have e.g. a busy for loop in method int Calculate() and the next called method is Console.WriteLine, like this,

int result = Calculate();
Console.WriteLine(result)

to print the result then you will see in CPU usage (Precise) all CPU attributed to Console.WriteLine because that method was causing the next blocking OS call. All other non blocking methods called before Console.WriteLine are invisible in this view. To get per method data you need to use the CPU data of CPU sampling which gets data in a different way to much better measure CPU consumption at method level. Is Context Switch data wrong? No. The per thread CPU consumption is exact because this is a trace of the OS Scheduler when he moves threads on/off a CPU. But you need to be careful to not interpret the method level CPU data as true CPU consumption of that method.

CPU Usage (Sampled) Trap

Lets compare both CPU views (Sampled) vs (Precise)

StreamWriter.Dispose method consumes 887 ms in Sampling while in Context Switch view we get 1253 ms. In this case the difference is 41%! With regards to CPU consumption you should use the Sampled view for methods to get the most meaningful value.

Every WPA table can be configured to add a Count column. Many people fell into the trap to interpret the CPU Sampling Count column as number of method calls, while it actually just counts the number of sampling events. CPU sampling works by stopping all running threads 1000 (default) times per second, take a full stacktrace and write that data to ETW buffers. If method A() did show up 1000 times then it gets 1000*1ms CPU attributed. CPU sampling has statistical bias and can ignore e.g. method C() if it is executing just before, or after, the CPU sampling event did take a stack trace.

ETWAnalyzer Wait/Ready Times

The method list shown by ETWAnalyzer -Dump CPU -methods xxx is a summation across all threads in a process. This serves two purposes

  • Readability
    • If each thread would be printed the result would become unreadable
  • File Size
    • The resulting extracted Json file would become huge

This works well for single threaded workloads, but what about 100 threads waiting for something? WPA sums all threads together if you remove the thread grouping. ETWAnalyzer has chosen a different route. It only sums all non overlapping Wait/Ready times. This means if more than one thread is waiting, the wait is counted only once.

The Wait/Ready times for methods for multi threaded methods are the sum of all non overlapping Wait/Ready times for all threads. If e.g. a method was blocked on many threads you will see as maximum wait time your ETW recording time and not some arbitrary high number which no one can reason about. It is still complicated to judge parallel applications because if you have heavy thread over subscription you will see both, Wait and Ready, times to reach your ETW recording time. This is a clear indication that you have too many threads and you should use less of them.

Make Measurements Better

We have found at least two fatal flaws in our test

  • Anti Virus Scanner is consuming all CPU even in single threaded test
    • Add an exclude rule to the folder via PowerShell
      • Add-MpPreference -ExclusionPath “C:\temp\Test”
    • To remove use
      • Remove-MpPreference -ExclusionPath “C:\temp\Test”
  • Some test runs suffer from large delays due to CheckpointVolume calls which introduces seconds of latency which is not caused by our application
    • Change Script to delete temp files before test is started to give OS a chance to get into a well defined file system state

When we do this and we repeat the tests a few times we get for the single threaded case:

Nearly a factor 3 faster for changing not a single line of code is pretty good. The very high numbers are coming from overloading the system when something else is running in the background. The 12s numbers are for enabled Defender while < 10s are with a Defender exclusion rule but various NtfsCheckpoint delays. When we align everything correctly then we end up with 7s for the single threaded case which is 71 MB/s of write rate. The best value is now 3,1s for 4 threads with 161 MB/s.

What else can we find? When looking at variations we find also another influencing factor during file creation:

We have a single threaded test run with 9556 ms and a faster one with 7066 ms. Besides the CheckpointVolume issue we loose another 1,3 s due to reading MFT and paged out data. The faster test case (Trace #2) reads just 0,3 ms from the SSD, while we read in the slower Run (Trace #1) 1,3s from disk. The amount of data is small (8 MB), but for some reason the SSD was handing out the data slowly. Yes SSDs are fast, but sometimes, when they are busy with internal reorg things, they can become slow. SSD performance is a topic for another blog post.

We need to update our mental performance model

File System Cache state differences can lead to wait times due to Hard faults. Check method MiIssueHardFault if the tests have comparable numbers.

Etwanalyzer -dump CPU -pn Filewriter -methods MiIssueHardFault
That are another 2 s of difference which we can attribute to File System cache state and paged out data.

Measuring CPU Efficiency

We know that 4 threads are fastest after removing random system noise from our measured data. But is is it also the most efficient one? We can dump the CPUs Performance Monitoring Counters (PMC) to console with

ETWAnalyzer -dump PMC -pn FileWriter

The most fundamental metric is CPI which is Cycles/Instructions. If you execute the the same use case again you should execute the same numbers of instructions. When you go for multithreading you need to take locks and coordinate threads which costs you extra instructions. The most energy efficient solution is therefore the single threaded solution, although it is slower.

To get an overview you can export data into a CSV file from the current directory and the initial measurements we did start with. The following query will dump the Performance Monitoring Counters (PMC) to a CSV file. You can use multiple -fd queries to select the data you need:

ETWAnalyzer -dump PMC -pn FileWriter -fd ..\Take_Test1 -fd .  -csv PMC_Summary.csv

The CSV file contains the columns Test Time in ms and CPI to make it easy to correlate the measured numbers with something else e.g. CPI in this case. This pattern is followed for all -Dump commands of ETWAnalyzer. From this data we can plot Test Timing vs CPI this is Cycles per Instruction into a Pivot Table:

The initial test where we had

  • Defender
  • CheckpointVolume

issues was not only much slower, but CPU efficiency wise much worse: 1,2 vs 0,69 (smaller is better). This proves that even if Defender would be highly efficient on all other cores and not block our code at all we still would do much worse (73%), because even when we do exactly the same, CPU caches and other internal CPU resources, are slowing us down. With this data you can prove why you are slower even when everything else is the same.

Conclusions

If you have read that far I am impressed. That is a dense article with a lot of information. I hope I have convinced you that profiling your system with ETW and not only your code for strategic test cases is an important asset in automated regression testing.

Credits: NASA, ESA, CSA, and STScI (https://www.nasa.gov/image-feature/goddard/2022/nasa-s-webb-delivers-deepest-infrared-image-of-universe-yet)

If you look only at your code you will miss the universe of things that are moving around in the operating system and other processes. You will end up with fluctuating tests which no one can explain. We are doing the same thing 10 times and we still get outliers! The key is not only to understand your own code, but also the system in which it is executing. Both need attention and coordination to achieve the best performance.

To enable that I have written ETWAnalyzer to help you to speed up your own ETW analysis. If you go for ETW you are talking about a lot of data which needs a fair amount of automation to become manageable. Loading hundreds of tests into WPA is not feasible, but with ETWAnalyzer mass data extraction and analysis never was easier. The list of supported queries is growing. If you miss something file an issue so we can take a look what the community is needing.

To come back to the original question. Is profiling too difficult? If you are using only UI based tools then the overhead to do this frequently it is definitely too much. But with the scripting approach shown by FileWriter to record data and extract it with ETWAnalyzer, to make the data queryable, is giving you a lot of opportunities. ETWAnalyzer comes with an object model for the extracted Json files. You can query for known issues directly and flag past issues within seconds without the need to load the profiling data into a graphical viewer to get an overview.

GDI/User Object Leak Tracking – The Easy Way

One of the most complex issues are User/GDI object leaks because they are nearly invisible in memory dumps. Microsoft does not want us to peek into such “legacy” technology so no Windbg plugins exist anymore. Today everything is UWP, WPF (ok that’s legacy as well) and mostly some Chrome Browser based applications. With .NET 6 WPF and WinForms support has got some love again so we can expect new programmers will hit old fashioned User/GDI leaks again. These leaks are worse than “normal” Handle leaks because Windows imposes a limit of 10K GDI and User Handles per process. After that you cannot create any more User/GDI handles and you get either exceptions or partially filled windows where content or icons are missing.

Here is an application which leaks Window and GDI Handles on every button click:

    public partial class Leaker : Form
    {
        public Leaker()
        {
            InitializeComponent();
        }

        List<Leaker> myChilds = new List<Leaker>();

        private void button1_Click(object sender, EventArgs e)
        {
            var leaker = new Leaker();
            var tmp = leaker.Handle;
            myChilds.Add(leaker);
            cCountLabel.Text = $"{myChilds.Count}";
        }
    }

In Task Manager we find that with every click we loose 6 Window handles:

These come from the

  • Leaked Form
  • Create Button
  • Created Label
  • 85 (Number) Label
  • IME Hidden Window
  • MSCTFIME Hidden Window

A very useful tool is Spy++ which is installed when you add C++ core features

via the Visual Studio Installer. Then you will find it in C:\Program Files\Microsoft Visual Studio\<Version>\<Edition>\Common7\Tools\spyxx_amd64.exe or the 32 bit version without the _amd64 postfix.

If you intend to copy Spy++ to a different machine you need these dlls

.\1033\spyxxui_amd64.dll
spyxxhk_amd64.dll
spyxx_amd64.exe

to make it work. The resource dll in the 1033 folder is essential or it wont start.

How would you identify the call stack that did lead to the allocation of additional objects? The answer is a small and great ETW tool named Performance HUD from Microsoft.

After starting

you can attach to an already running process

After a few seconds an overlay window pops up

If you double click on it the “real” UI becomes visible
Switch over to the Handles Tab
Now press in our Leaker app a few times the create button to leak some GDI/User objects
Press Ctrl+Click on the blue underlined Outstanding text for User Window objects to find the leaking call stack
Voila you have found our button1_Click handler as the bad guy as expected. If you right click on it you can e.g. send a mail with the leaking call stack as attached .hudinsight file to the developer of the faulty application. Problem solved.

See my earlier post what else you can do with Performance HUD. It is one of the tools which save a lot of time once you know that they exist.

Another great feature of Performance HUD is to analyze UI Thread hangs. It employs some sophisticated analysis to detect that the UI thread is no longer processing window messages because some CPU hungry method is executing a long method call, or that you are stuck in a blocking OS call. Today everyone is using async/await in C#. A stuck UI thread should be a thing of the past. But now you can measure and prove that it is not hanging.

Lets add to our sample application a UI hang via TPL tasks which happens only in Debug Builds

        private void cHang_Click(object sender, EventArgs e)
        {
            Task<int> t = Task.Run(async () => { await Task.Delay(5000); return 1; });
            Debug.Print(t.Result.ToString());
        }

My favorite button the Blame Someone which will unfold the stack with probably the longest stuck call:

This precisely points towards our stuck button handler which calls into Task.Result which seems not to be resolved properly but the later method calls are. The 5s hang is nicely colored and you can also pack this into an Email and send the issue to the maintainer of the corresponding code. I have seen plenty of similar issues in the past because Thread.Sleep is caught by many FXCop rules so people tend to use await with Task.Delay which is a Thread.Sleep in disguise.

This works not only with .NET Code but with any language as long as it supports pdbs and walkable stacks. It is great that this diagnostic has been added to Windows but why did it take so long? As far as I know Performance HUD is from the Office team which is a “legacy” but important application. Someone dealing with too many GDI/User object leaks was able to commit some lines to the Kernel to instrument the relevant locations with proper ETW events and we finally arrive with a feature complete leak tracer. With ETW we can track leaks

  • OS Handles (File, Socket, Pipe, …)
    • wpr -start Handle
    • Analyze with WPA. This records handle data from all processes (system wide)
  • GDI Handles (Cursor, Bitmap, Font, …)
  • User Handles (Window, Menu, Timer, …)
  • VirtualAlloc Memory (Records system wide all processes)
    • wpr -start VirtualAllocation
    • Analyze with WPA

Finally we have covered all frequent leak types with ETW events. No leak can hide anymore.

For memory mapped files there are some ETW events but unfortunately no analysis in the public version of WPA exist. These leaks are still more tricky to figure out than the rest which are now at least in principle easy. Recording the right data and not too much (ETW files > 6 GB will usually crash WPA) is still challenging but doable. Performance HUD is very well written. I have successfully processed a 80 GB ETL trace with no hiccups. It can not only attach to already running processes, but it can also analyze previously recorded data.

Now get back to work and fix your leaks and unresponsive UIs!

How Fast is your Virus Scanner?

I am sure everyone has experienced slowdowns due to Antivirus solutions, but very few are able to attribute it to the right component on your Windows box. Most people do not even know what AV solution/s are running on their machine. The end result can be a secure but slow system which is barely usable.

Is your CPU fan always on and your notebook getting hot although you are not running anything? You want to find out what is consuming so much CPU with a command line tool?

Then I have a new tool for you: ETWAnalyzer. It is using the Trace Processing Library which powers WPA (Windows Performance Analyzer). WPA is a great UI to analyze tricky systemic issues, but if you need to check a trace the n-th time if the same issue is there again you want to automate that analysis.

ETWAnalyzer is developed by Siemens Healthineers as its first public open source project. Full Disclosure: I work for Siemens Healthineers. One motivation to make it public was to show how incredibly useful ETW (Event Tracing for Windows) is to track down performance issues in production and development environments. It is one of the best technologies for performance troubleshooting Microsoft has invented, but it is valued by far too few people. ETW is part of Windows and you need no additional tools or code instrumentation to record ETW data! Starting with Windows 10 the Windows Performance Recorder (wpr.exe) is part of the OS. You have a slow system? In that case I would open an Administrator cmd shell and type

C>wpr -start CPU -start DiskIO -start FileIO

Now execute your slow use case and then stop profiling with

C>wpr -stop C:\temp\SlowSystem.etl

This will generate a large file (hundreds MB – several GB) which covers the last ca. 60-200s (depends on system load and installed memory) what your system was doing. If you have a lot of .NET applications installed stopping the first first time can take a long time (10-60 min) because it will create pdbs for all running .NET applications. Later profiling stop runs are much faster tough.

To record data with less overhead check out MultiProfile which is part of FileWriter. You need to download the profile to the machine to a directory e.g. to c:\temp and to record the equivalent to the before command with

C>wpr -start C:\temp\MultiProfile.wprp!CSwitch -start C:\temp\MultiProfile.wprp!File

Getting the data is easy. Analyzing is hard when you do not know where you need to look at.

This is where ETWAnalyzer can help out to ask the recorded data questions about previously analyzed problems. You ask: Previously? Yes. Normally you start with WPA to identify a pattern like high CPU in this or that method. Then you (should) create a stacktag to assign a high level description what a problematic method is about. The stacktags are invaluable to categorize CPU/Wait issues and correlate them with past problems. The supplied stacktag file of ETWAnalyzer contains descriptions for Windows/Network/.NET Framework/Antivirus products even some Chrome stacktags. To get started you can download the latest ETWAnalyzer release from https://github.com/Siemens-Healthineers/ETWAnalyzer/releases und unzip it to your favorite tool directory and extend the PATH environment to the directory. If you want maximum speed you should use the self contained .NET 6 version.

Now you are ready to extract data from an ETL file with the following command:

ETWAnalyzer -extract all -filedir c:\temp\SlowSystem.etl -symserver MS

After extraction all data is contained in several small Json files.

Lets query the data by looking at the top 10 CPU consumers

ETWAnalyzer -dump CPU -filedir c:\temp\Extract\SlowSystem.json -topN 10

The second most CPU hungry process is MsMpEng.exe which is the Windows Defender scan engine. Interesting but not necessarily a bad sign.

Lets check if any known AV Vendor is inducing waits > 10ms. We use a stacktag query which matches all stacktags with Virus in their name.

ETWAnalyzer -dump cpu -fd c:\temp\Extract\SlowSystem.json -stacktags *virus* -MinMaxWaitMs 10

We find a 1349ms slowdown caused by Defender in a cmd shell. I did observe slow process creation after I hit enter to start the fresh downloaded executable from the internet.

Explorer seems also to be slowed down. The numbers shown are the sum of all threads for CPU and wait times. If multiple threads were executing the the method then you can see high numbers for the stacktags/methods. For single threaded use cases such as starting a new process the numbers correlate pretty good with wall clock time.

The currently stacktagged AV solutions are

  • Applocker
  • Avast
  • CrowdStrike
  • CyberArk
  • Defender
  • ESET
  • Mc Afee
  • Palo Alto
  • Sentinel One
  • Sophos
  • Symantec
  • Trend Micro

Lets dump all slow methods with a wait time between 1300-2000ms and the timing of all process and file creation calls in explorer.exe and cmd.exe

ETWAnalyzer -dump cpu -fd c:\temp\Extract\SlowSystem.json  -stacktags *virus* -MinMaxWaitMs 1300-2000 -methods *createprocess*;*createfile* -processName explorer;cmd

Antivirus solutions usually hook into CreateProcess and CreateFile calls which did obviously slow down process creation and file operations in explorer and process creation in cmd.exe.

Was this the observed slowness? Lets dump the time when the file and process creation calls were visible the first time in the trace

ETWAnalyzer -dump cpu -fd c:\temp\Extract\SlowSystem.json  -stacktags *virus* -MinMaxWaitMs 1300-2000 -methods *createprocess*;*createfile* -processName explorer;cmd -FirstLastDuration local

That correlates pretty well with the observed slowness. To blame the AV vendor we can dump all methods of device drivers which have no symbols. Microsoft and no other AV vendor delivers symbols for their drivers. By this definition all unresolved symbols for system drivers are likely AV or hardware drivers. ETWAnalyzer makes this easy with the -methods *.sys -ShowModuleInfo query to show all called kernel drivers from our applications for which we do not have symbols :

ETWAnalyzer -dump cpu -fd c:\temp\Extract\SlowSystem.json -methods *.sys* -processName explorer;cmd -ShowModuleInfo

The yellow output is from an internal database of ETWAnalyzer which has categorized pretty much all AV drivers. The high wait time comes from WDFilter.sys which is an Antivirus File System driver which is part of Windows Defender.

What have we got?

  • We can record system wide ETW data with WPR or any other ETW Recoding tool e.g. https://github.com/Alois-xx/etwcontroller/ which can capture screenshots to correlate the UI with profiling data
  • With ETWAnalyzer we can extract aggregated data for
    • CPU
    • Disk
    • File
    • .NET Exceptions
  • We can identify CPU and wait bottlenecks going from a system wide view down to method level
    • Specific features to estimate AV overhead are present

This is just the beginning. You can check out the documentation for further inspiration how you can make use of system wide profiling data (https://github.com/Siemens-Healthineers/ETWAnalyzer) which can be queried from the command line.

ETWAnalyzer will not supersede WPA to analyze data. ETWAnalyzers main strength is in analyzing known patterns quickly in large amounts of extracted ETW data. Querying 10 GB of ETW data becomes a few hundred MB of extracted JSON data which has no slow symbol server lookup times anymore.

Happy bug hunting with your new tool! If you find bugs or have ideas how to improve the tool it would be great if you file an issue.

The Case Of A Stuck LoadLibrary Call

I frequently encounter impossible bugs which after full analysis have a reasonable explanation although I have no source code to the problematic library. Todays fault is featured by

….

suspense

the Intel Math Kernel Library.

Lets use some fancy neural network library (AlgoLibrary!UsingCaffee) like Berkeley AI caffee.dll which in turn uses the Intel Math Kernel Library (mkl_core.dll) which hangs forever in a call to LoadLibrary. The first thing to do when you are stuck is taking memory dump (e.g. use Task Manager or procdump)

and loading it into Windbg or Visual Studio. Since you are reading this I assume you want to know more about the mysterious commands Windbg has to offer. You can load a memory dump into Windbg from the File – Menu

To get an overview where all threads of a process are standing or executing you can dump with the ~* command the following command on all threads. The dump thread stack command in Windbg is k. To dump all threads you need to execute therefore

~*k

107  Id: 67b0.8094 Suspend: 0 Teb: 00000049`9df88000 Unfrozen
Call Site
ntdll!NtWaitForSingleObject+0x14
ntdll!LdrpDrainWorkQueue+0xe5
ntdll!LdrpLoadDllInternal+0xc2
ntdll!LdrpLoadDll+0xf1
ntdll!LdrLoadDll+0x8c
KERNELBASE!LoadLibraryExW+0x16f
KERNELBASE!LoadLibraryExA+0x31
KERNELBASE!LoadLibraryA+0x39
mkl_core!mkl_serv_mpi_get_ppn+0x48
mkl_core!mkl_serv_mkl_malloc+0x7e7
caffe!caffe::CaffeMallocHost+0x16
caffe!caffe::SyncedMemory::to_cpu+0x116
caffe!caffe::SyncedMemory::mutable_cpu_data+0xe
caffe!caffe::Blob<float>::Reshape+0x175
caffe!caffe::Blob<float>::Reshape+0x138
caffe!caffe::InputLayer<float>::LayerSetUp+0x184
caffe!caffe::Layer<float>::SetUp+0x15
caffe!caffe::Net<float>::Init+0x898
caffe!caffe::Net<float>::Net<float>+0x473
AlgoLibrary!UsingCaffee

When you load a memory dump into Windbg you can issues with the k command a full stack walk which will output hopefully a nice stack trace after you have resolved your symbols from Microsoft, Intel and the caffee.dll. We see that LoadLibrary is stuck while caffee tries to allocate memory which internally loads a library. We would expect that our call from AlgoLibrary!UsingCaffee should be finished within a few ms. But this one hangs forever.

The allocation code comes from this:

 *ptr = mkl_malloc(size ? size:1, 64);

When LoadLibrary hangs it is due to some other thread holding the OS Loader lock. This loader lock is effectively serializing dll loading and global/thread static variables initialization and release. See https://docs.microsoft.com/en-us/windows/win32/dlls/dllmain for a more elaborate description.

With Windbg and the !peb command we can check further

dt ntdll!_PEB 499df0d000
+0x0e8 NumberOfHeaps    : 0x15
+0x0ec MaximumNumberOfHeaps : 0x20
+0x0f8 GdiSharedHandleTable : 0x00000183`71d00000 Void
+0x100 ProcessStarterHelper : (null) 
+0x108 GdiDCAttributeList : 0x14
+0x10c Padding3         : [4]  ""
+0x110 LoaderLock       : 0x00007ffe`f447a568 _RTL_CRITICAL_SECTION
+0x118 OSMajorVersion   : 0xa
+0x11c OSMinorVersion   : 0
+0x120 OSBuildNumber    : 0x3839
+0x122 OSCSDVersion     : 0
+0x124 OSPlatformId     : 2

We can dump the Loader Lock structure by clicking on the blue underlined link in Windbg. But this formatting has not survived in WordPress.

0:107> dx -r1 ((ntdll!_RTL_CRITICAL_SECTION *)0x7ffef447a568)
((ntdll!_RTL_CRITICAL_SECTION *)0x7ffef447a568)                 : 0x7ffef447a568 [Type: _RTL_CRITICAL_SECTION *]
[+0x000] DebugInfo        : 0x7ffef447a998 [Type: _RTL_CRITICAL_SECTION_DEBUG *]
[+0x008] LockCount        : -2 [Type: long]
[+0x00c] RecursionCount   : 1 [Type: long]
[+0x010] OwningThread     : 0x6710 [Type: void *]
[+0x018] LockSemaphore    : 0x0 [Type: void *]
[+0x020] SpinCount        : 0x4000000 [Type: unsigned __int64]

Ahh now we know that thread 0x6710 is owning the lock. Switch to that thread with the command ~~[0x6710]s which allows you specify OS thread id directly. The usual ~dds command switches to debugger enumerated threads which start always at 0. That is convenient but sometimes you need to switch by thread id.

0:114> k

ntdll!NtYieldExecution+0x14
KERNELBASE!SwitchToThread+0x1d
mkl_core!mkl_serv_lock+0x6d
mkl_core!mkl_serv_thread_free_buffers+0x384
mkl_core+0x1016
mkl_core!sdttrsb_4dfl+0x31a
ntdll!LdrpCallInitRoutine+0x4b
ntdll!LdrShutdownThread+0x151
ntdll!RtlExitUserThread+0x3e
kernel32!BaseThreadInitThunk+0x1c
ntdll!RtlUserThreadStart+0x21

That stack trace is strange because it calls SwitchToThread which basically tells the OS if any other runnable thread for this core is ready to run it should run. If not this is essentially a CPU burning endless loop. Such constructs are normally used in Spinlocks where ones takes at first not lock at all to prevent the costs of a context switch at the expense of burning CPU cycles. In this case it looks like mkl_core is cleaning up some thread static variables which requires synchronization with some lock object which is taken by some other thread. Since I do not have the code for this I can only speculate which thread it is.

Lets get the CPU time this thread to see how long it tries to enter the lock:

0:114> !runaway
 User Mode Time
  Thread       Time
  114:6710     0 days 2:14:28.671
   17:62a4     0 days 0:00:47.656
   92:88bc     0 days 0:00:08.218
   22:107c     0 days 0:00:02.953
   19:3a9c     0 days 0:00:02.765

Nearly all CPU of the process (2h 15 minutes) is spent by this thread (2h 14 minutes) which wastes CPU cycles for hours.

Since we have only one other mkl thread running which is our stuck thread in the LoadLibrary call it all makes sense now. The LoadLibrary call hangs because thread 0x6710 is exiting and cleaning up some thread static variables which takes a lock which our stuck LoadLibrary thread already possesses. But since our shutting down thread owns the loader lock we have two dependant locks owned by two different threads -> Deadlock.

You need to be careful on which threads you execute MKL code which are exiting at some point in time while new threads are initializing new mkl variables. At some points in time you can see funny deadlocks which look really strange. One thread burns CPU like hell which is coming from an exiting thread and your own code is stuck in a LoadLibrary call which waits for the Loader Lock to become free.

If you dig deep enough you can find the root cause even when you do not have the full source code. As added bonus I have learned some arcane Windbg commands along the way. These will be helpful the next time you need to dig into some stuck threads.